Tuesday, July 31, 2012

[part III - Configure environment for script access] Using OpsView Core to check RPO through DPM monitoring

[If you haven't yet, check out the Part II of this article here]

So we have the script ready.

2.6. Configure Opsview Agent to make the script accessible "

Before you can use this script remotely, you first have to "tell" the agent that it exists and how to run it. So, you should:
  • copy "check_rpo.ps1" script to the scripts folder of the Opsview Agent 
  • open "nsc.ini" file, add the following line to the [NRPE Handlers] section and save the file afterwards:
check_rpo=cmd /c echo scripts\check_rpo.ps1 $ARG1$; exit($lastexitcode) | powershell.exe -PSConsoleFile "C:\Program Files\Microsoft DPM\DPM\bin\dpmshell.psc1" -command - 
  • restart Opsview agent in test mode by stopping the service (CTRL + C on the agent command window) and starting it again:
nsclient++ -test

2.7. Test the script remotely

Go back to your SSH session to Opsview Monitoring Server (or follow again point 2.3. of this tutorial) and execute
./check_nrpe -H 192.168.1.20 -t 20 -c check_rpo -a '-ProtGrpName "Exchange Databases" -DataSrcName "Mailbox Database 1" -RpoHours 24'

You should change the parameters according to your needs:

  • -ProtGrpName - The name of the protection group of configured on you DPM Server
  • -DataSrcName - The name of the datasource you wish to check
  • -RpoHours - You recovery point objective (the maximum number of hours before next recovery point)


On the client agent side, if you got the agent running in test mode, you should see the injected request and the agent's result.
If you check the picture you should see that apart from the result itself (OK- the RPO is beeing met or CRITICAL - something's wrong), it'll also return:
  • Hours since last RP
  • Last RP Date/time
  • RP validity
  • Size of RP
On the server side, you should now see the response returned.

Ok. You can now call the script remotely and it returns the desired results.


Monday, June 18, 2012

[part II - The script] Using OpsView Core to check RPO through DPM monitoring

[If you haven't yet, check out the Part I of this article here]

Ok. So far so good. All base configurations between server and agent are done and the communication between them has been verified.

2.4. Activate "External Scripts" plugin on agent side


The Opsview agent comes with several modules you can use to check a great number of aspects of your Windows system. In this case, as far i know, there is no specific module that can deal with the monitoring of Microsoft DPM 2010.
So i'll introduce you to "External Scripts" module, that is not activated by default on the Opsview agent. This module allows you to develop your own scripts (in whatever scripting language you want or feel comfortable with) to check whatever you want to check and call it from the Opsview Monitoring server acting as any other "checking functionality". This module provides great flexibility as it can be used to monitor all sorts of things.

So, to activate "External Scripts" functionality you'll have to go to the C:\Program Files\Opsview Agent folder, edit the NSC.ini file and:

  • add or uncomment the "CheckExternalScripts.dll" line, under the "[modules]" section.
This will enable you to, later on, declare any script you want, and call it from Opsview Monitoring server.

2.5. Develop powershell script to check last good recovery point


I should say that this is the critical step of the whole solution. If your script function does not return the information you want, it won't matter anything else. So, be sure to test it thoroughly.

I chose powershell scripting language because there is a great number of resources on the web that address similar problems and with little work, you can modify some of this scripts to suite your needs. Needless to say that most of the tools we use to control, monitor or configure Microsoft server applications, use powershell underneath to do it - which is enough for me... 

The main purpose of this script is to check if the RPO (Recovery Point Objective) is being met. On the development of the script, that i called "check_rpo.ps1", i configured it to receive 4 parameters:

  • $HostName - name of the DPM server
  • [String] $ProtGrpName - Protection group name
  • [String]$DataSrcName - Datasource name
  • [Int]$RpoHours - Recovery Point Objective (in number of hours) to be met

So, what this script does is:
  • Uses the "Get-ProtectionGroup" powershell command to reach for the protection group and hostname (passed into the function as parameters - $ProtGrpName and $HostName) 
  • Goes through every datasource in the protectiongroup until finds the requested one (passed into the function as a parameter - $DataSrcName) 
  • Determines the last recovery point Calculates the elapsed time since last recovery point and calculates the $hours_passed_since_last_rp variable 
  • Calculates size of the RP (in Gigabytes) 
  • Returns "Critical" (exit 2) if the $hours_passed_since_last_rp value is greater than the recovery point objective (passed into the function as a parameter - $RpoHours), or if it could not find the datasource given 
  • Returns "OK" (exit 0) if the recovery point objective is being met 
  • The performance data returned is: Hours_Passed_Since_Last_RP Size



Important: If you analyse the script closely you'll see that i have included an "InvertSlash" function that is used in the datasourcename. This is done because my "Virtual Servers" protection group include a slash on the datasources name, which is not an allowed character under the service definition on Opsview. So the thing is: if the datasource name is "Backup Using Child Partition Snapshot\SRVxyz01", i invert the slash in the service definition like "Backup Using Child Partition Snapshot/SRVxyz01" and later, the script reinverts it again. I know!!! Not best solution but it does the trick...


Here is the complete code for "check_rpo.ps1" script. I know it can be improved. If you have any suggestions please leave a comment. I would very much welcome it.


Param([String]$HostName, [String]$ProtGrpName, [String]$DataSrcName, [Int]$RpoHours ) 

Process 
{ 

function InvertSlash([string]$dtsrc)
{
    $dtsrc -replace "/","\"
}

$DataSrcName = Invertslash $DataSrcName 
$pg = Get-ProtectionGroup $HostName | where {$_.FriendlyName -eq $ProtGrpName}
$dslist = Get-Datasource $pg
$found = $false
foreach($ds in $dslist) 
 {
 if ($ds.datasourcename -eq $DataSrcName)
  {
  $Counter = 0 
  $Incremental = 1 
  $Timeout = 10 
  While($ds.LatestRecoverypoint -EQ 0) 
   { 
   Start-Sleep $Incremental 
   $Counter = $Counter + $Incremental 
    If ($Counter -GT $Timeout) 
    { 
    Write-Warning "Timed out." 
       Break; 
          } 
      } 
  $RPs = Get-RecoveryPoint -Datasource $ds
  [datetime]$LastGoodRP_datetime=(Get-Date).Addyears(-1)
  [Boolean]$LastGoodRP_isValid=$false 
  $rp_Size=0
  foreach ($RP in $RPs) 
            { 
   if (!($RP -eq $null)) 
    {

    if ($RP.RepresentedPointInTime -gt $LastGoodRP_datetime) 
     {
     $LastGoodRP_datetime=$RP.RepresentedPointInTime
     $rp_Size =[Math]::Round($RP.Size /1gb, 2)   
     $LastGoodRP_isValid =$RP.IsValidRecoverySource             
     
     }
    }
   }
   
   
   
  $rpo_time = (Get-Date).AddHours(-$RpoHours)
  $h =  ((Get-Date) - $LastGoodRP_datetime).days * 24 + ((Get-Date) - $LastGoodRP_datetime).Hours
  $m =  ((((Get-Date) - $LastGoodRP_datetime) - ((Get-Date) - $LastGoodRP_datetime).Hours).minutes)/60
  $newm = [Math]::Round($m, 2) 
  
  $hours_passed_since_last_rp = ($h+$newm)
  if ($hours_passed_since_last_rp -gt $RpoHours)
      {
   $Results = "***CRITICAL*** Hours since RP: " + $hours_passed_since_last_rp + " Last RP Time:" + $LastGoodRP_datetime + " / Valid: " + $LastGoodRP_isValid + " / Size: " + $rp_Size 
   $Results = $Results + " | Hours_Passed_Since_Last_RP=" + $hours_passed_since_last_rp + ";Size=" + $rp_Size
      write-host $Results 
      exit 2
      }
  else
      {
   $Results = "OK - Hours since RP: " + $hours_passed_since_last_rp + " / Last RP Time:" + $LastGoodRP_datetime + " / Valid: " + $LastGoodRP_isValid  + " / Size: " + $rp_Size 
   $Results = $Results + " | Hours_Passed_Since_Last_RP=" + $hours_passed_since_last_rp + ";Size=" + $rp_Size
      write-host $Results 
      exit 0
      }
  

  }
 } 

write-host "***CRITICAL*** - Datasource not found - " $DataSrcName
exit 2

}


Don't miss the third and last part of this article next week.
Feel free to leave any questions or suggestions.

Monday, June 11, 2012

[part I - Architecture and configuration] Using OpsView Core to check RPO through DPM monitoring

I have been using Opsview (open source monitoring application) to monitor all kind of things. From datacentre temperature to server health, switch interfaces, UPSs load, etc. This monitoring application is really worthwhile.

This is the problem: i use Microsoft DPM 2010 (System Center Data Protection Manager) to make my short-term (to disk) and long-term backups (to tape). This is a great application but it keeps sending me dozens of emails about problems and their recovery. I really don't care, especially at 3:00 am, if there is a problem with a recovery point that did not succeed. What i want to know is if my RPO (recovery point objective) is being met.

So, here is how i solved it.

1. Architecture

Architecture of "RPO monitoring with Opsview Core and DPM"
The solution relies on check_nrpe plugin service check (on the monitoring server side) that invokes (on the client agent side) a custom made Powershell script through the "external scripts" functionality of the Opsview core agent.
...Sounds confusing??? It really is simple.

Let's assume that you already have the Opsview Core Monitoring server up and running. If you don't you can check out Installation Guide - Ubuntu Linux
If i succeeded to install the whole system without any special linux/ knowledge, so can you.

For this example i assume the following:
  • The Opsview Monitoring Server IP: 192.168.1.1
  • The DPM Server (Opsview Agent) IP: 192.168.1.20 

2. Step-by-step instructions

2.1. Install and configure Opsview agent (initial configuration)

Install the Opsview Core agent for Windows.
Yes, i have tried a couple of other options installing more recent versions of the NSClient (the monitoring daemon that Opsview Windows Agent relies on...). But, if you want my advice, stick to the Opsview Core agent available for download - it'll suite your needs and work just fine!!!
In my case i installed the 64bit version of the Windows agent, which is straight forward (like the 32bit... :)). After finishing installation, go to the C:\Program Files\Opsview Agent folder and edit the NSC.ini file and:
  • edit the "allowed_hosts" variable value, on the "Settings" section, to the IP address of the Opsview Core Monitoring server. This will add a little extra security preventing the agent from answering requests from not allowed sources. Restart the NSClientpp service to apply changes.

[Settings]
allowed_hosts=192.168.1.1

2.2. Run Opsview agent in test mode

While you're doing the initial configurations and testing it's better if you run the agent, from a command prompt, in test mode. This will allow you to view, in real time, all requests, responses and errors of the agent. To do this you should:

  • run the services snap-in and stop the NSClientpp service
  • launch a command prompt window and go to the C:\Program Files\Opsview Agent folder.
  • run the agent in test mode entering the following command
nsclient++ -test

2.3 Test connectivity between Opsview Monitoring Server and Opsview agent

Before going into more complex scenario verify that you can make a request to the agent and that it returns the proper response.

  • open a SSH session (for ex. with putty) and connect to the Opsview Monitoring Server.
  • go to the /usr/local/nagios/libexec$ folder
  • execute a simple cpu check by running the following command line:
./check_nrpe -H 192.168.1.20 -c CheckCPU -a warn=80 crit=90 time=20m time=10s time=4

  • on the client agent side, if you got the agent running in test mode, you should see the injected request and the agent's result
  • on the server side, you should now see the response returned.
Ok. The communication between server and agent is in place e and is working.


Don't miss the second part of this article. I hope i can publish it later this week.
Feel free to leave any questions or suggestions.


A little about me...

Hi,
As you've already figured out, my name is Marco and i am from Portugal.
My interest in technology began around 1985, when my "visionary" father got me a ZX Spectrum 48K. I was 12 years old. This was my first developing platform. Basic was the language.
Since then, and during my academic and professional journey, I watched the evolution of technology and its generalization.
I am now the Head of IT of a private university in Portugal. Leading a small team has kept me close to the practical aspects of the job being able to do all kinds of things. From software development to database management, IT strategy development to project change management, network management to server maintenance.
Having always been a consumer of others' shared ideas, blogs and tips, i have now decided that is my turn to start sharing my solutions to the day-to-day IT dept problems.
I will soon publish my first article. Hope you enjoy!!!

Marco