Monday, June 18, 2012

[part II - The script] Using OpsView Core to check RPO through DPM monitoring

[If you haven't yet, check out the Part I of this article here]

Ok. So far so good. All base configurations between server and agent are done and the communication between them has been verified.

2.4. Activate "External Scripts" plugin on agent side


The Opsview agent comes with several modules you can use to check a great number of aspects of your Windows system. In this case, as far i know, there is no specific module that can deal with the monitoring of Microsoft DPM 2010.
So i'll introduce you to "External Scripts" module, that is not activated by default on the Opsview agent. This module allows you to develop your own scripts (in whatever scripting language you want or feel comfortable with) to check whatever you want to check and call it from the Opsview Monitoring server acting as any other "checking functionality". This module provides great flexibility as it can be used to monitor all sorts of things.

So, to activate "External Scripts" functionality you'll have to go to the C:\Program Files\Opsview Agent folder, edit the NSC.ini file and:

  • add or uncomment the "CheckExternalScripts.dll" line, under the "[modules]" section.
This will enable you to, later on, declare any script you want, and call it from Opsview Monitoring server.

2.5. Develop powershell script to check last good recovery point


I should say that this is the critical step of the whole solution. If your script function does not return the information you want, it won't matter anything else. So, be sure to test it thoroughly.

I chose powershell scripting language because there is a great number of resources on the web that address similar problems and with little work, you can modify some of this scripts to suite your needs. Needless to say that most of the tools we use to control, monitor or configure Microsoft server applications, use powershell underneath to do it - which is enough for me... 

The main purpose of this script is to check if the RPO (Recovery Point Objective) is being met. On the development of the script, that i called "check_rpo.ps1", i configured it to receive 4 parameters:

  • $HostName - name of the DPM server
  • [String] $ProtGrpName - Protection group name
  • [String]$DataSrcName - Datasource name
  • [Int]$RpoHours - Recovery Point Objective (in number of hours) to be met

So, what this script does is:
  • Uses the "Get-ProtectionGroup" powershell command to reach for the protection group and hostname (passed into the function as parameters - $ProtGrpName and $HostName) 
  • Goes through every datasource in the protectiongroup until finds the requested one (passed into the function as a parameter - $DataSrcName) 
  • Determines the last recovery point Calculates the elapsed time since last recovery point and calculates the $hours_passed_since_last_rp variable 
  • Calculates size of the RP (in Gigabytes) 
  • Returns "Critical" (exit 2) if the $hours_passed_since_last_rp value is greater than the recovery point objective (passed into the function as a parameter - $RpoHours), or if it could not find the datasource given 
  • Returns "OK" (exit 0) if the recovery point objective is being met 
  • The performance data returned is: Hours_Passed_Since_Last_RP Size



Important: If you analyse the script closely you'll see that i have included an "InvertSlash" function that is used in the datasourcename. This is done because my "Virtual Servers" protection group include a slash on the datasources name, which is not an allowed character under the service definition on Opsview. So the thing is: if the datasource name is "Backup Using Child Partition Snapshot\SRVxyz01", i invert the slash in the service definition like "Backup Using Child Partition Snapshot/SRVxyz01" and later, the script reinverts it again. I know!!! Not best solution but it does the trick...


Here is the complete code for "check_rpo.ps1" script. I know it can be improved. If you have any suggestions please leave a comment. I would very much welcome it.


Param([String]$HostName, [String]$ProtGrpName, [String]$DataSrcName, [Int]$RpoHours ) 

Process 
{ 

function InvertSlash([string]$dtsrc)
{
    $dtsrc -replace "/","\"
}

$DataSrcName = Invertslash $DataSrcName 
$pg = Get-ProtectionGroup $HostName | where {$_.FriendlyName -eq $ProtGrpName}
$dslist = Get-Datasource $pg
$found = $false
foreach($ds in $dslist) 
 {
 if ($ds.datasourcename -eq $DataSrcName)
  {
  $Counter = 0 
  $Incremental = 1 
  $Timeout = 10 
  While($ds.LatestRecoverypoint -EQ 0) 
   { 
   Start-Sleep $Incremental 
   $Counter = $Counter + $Incremental 
    If ($Counter -GT $Timeout) 
    { 
    Write-Warning "Timed out." 
       Break; 
          } 
      } 
  $RPs = Get-RecoveryPoint -Datasource $ds
  [datetime]$LastGoodRP_datetime=(Get-Date).Addyears(-1)
  [Boolean]$LastGoodRP_isValid=$false 
  $rp_Size=0
  foreach ($RP in $RPs) 
            { 
   if (!($RP -eq $null)) 
    {

    if ($RP.RepresentedPointInTime -gt $LastGoodRP_datetime) 
     {
     $LastGoodRP_datetime=$RP.RepresentedPointInTime
     $rp_Size =[Math]::Round($RP.Size /1gb, 2)   
     $LastGoodRP_isValid =$RP.IsValidRecoverySource             
     
     }
    }
   }
   
   
   
  $rpo_time = (Get-Date).AddHours(-$RpoHours)
  $h =  ((Get-Date) - $LastGoodRP_datetime).days * 24 + ((Get-Date) - $LastGoodRP_datetime).Hours
  $m =  ((((Get-Date) - $LastGoodRP_datetime) - ((Get-Date) - $LastGoodRP_datetime).Hours).minutes)/60
  $newm = [Math]::Round($m, 2) 
  
  $hours_passed_since_last_rp = ($h+$newm)
  if ($hours_passed_since_last_rp -gt $RpoHours)
      {
   $Results = "***CRITICAL*** Hours since RP: " + $hours_passed_since_last_rp + " Last RP Time:" + $LastGoodRP_datetime + " / Valid: " + $LastGoodRP_isValid + " / Size: " + $rp_Size 
   $Results = $Results + " | Hours_Passed_Since_Last_RP=" + $hours_passed_since_last_rp + ";Size=" + $rp_Size
      write-host $Results 
      exit 2
      }
  else
      {
   $Results = "OK - Hours since RP: " + $hours_passed_since_last_rp + " / Last RP Time:" + $LastGoodRP_datetime + " / Valid: " + $LastGoodRP_isValid  + " / Size: " + $rp_Size 
   $Results = $Results + " | Hours_Passed_Since_Last_RP=" + $hours_passed_since_last_rp + ";Size=" + $rp_Size
      write-host $Results 
      exit 0
      }
  

  }
 } 

write-host "***CRITICAL*** - Datasource not found - " $DataSrcName
exit 2

}


Don't miss the third and last part of this article next week.
Feel free to leave any questions or suggestions.

No comments:

Post a Comment