Friday, April 22, 2016

Creating a reactive alarm for ESXi ROBO cluster or as I call it "DRS-light"

I have decided to start blogging again after a long time away. Eight years to be precise.

The first item I want to share is a solution for a problem that occured in my company's Retail implementation that uses VMware ROBO licensing. For anyone that is familiar with ROBO -  you know that there is one major exclusion that makes cluster management challenging - and that is the lack of DRS. In any large scale implementation without DRS I think it is  fairly common to end up with unbalanced workloads after maintenance or outages.   This problem is compounded when you have a large support team of varying skill levels.

How I chose to address this problem was to create an alarm based on Host Memory Usage,  as Host memory is generally the first problem that occurs in our particular environment.  This methodology would also work for CPU constraints, but that is not a problem we had.

My starting point was the excellent article from the Powercli blog on using Powercli scripts in actions. 

One thing to note :  the syntax suggested in the article for the actual alarm action did not work for me  - so I used the following path to execute the Powershell code with the alarm :

in case that type is tool small to read -  I used the following to call Powershell
 (instead of the batch file method suggested in the original article) :
c:\windows\system32\WindowsPowerShell\v1.0\powershell.exe 
  

















Here is the full code of the script I am using -   I will detail out the the purpose section by section (starting at line 1 as the rest is covered in the original blogpost by VMware pretty pretty well:
# In response to a host alarm, add a VM to that host.
# This is just for demonstration purposes.
# Find more info at http://blogs.vmware.com/vipowershell
$basePath = "C:\PS_SCRIPTS\Alarms"
$ProgressPreference = "SilentlyContinue"
$env:APPDATA = "c:\Documents and Settings\All Users\Application Data"

## Import the admin credential.
#. "$basePath\credentialManagement.ps1"
#$credential = Import-PSCredential "$basePath\systemCredentials.enc.xml"

# Log in.
Add-PSSnapin VMware.Vimautomation.Core -ea SilentlyContinue 
$WarningPreference = "SilentlyContinue"
Connect-VIServer localhost -User "somedomain/someuser" -Password "xxxxxxx"
$hostId = "HostSystem-" + $env:VMWARE_ALARM_TARGET_ID 
$vmhost = Get-VMHost -Id $hostId 
$cluster = Get-VMHost $vmhost |Get-Cluster 
$vm = get-vmhost $vmhost |get-vm |where {$_.name -ne "z*"}|sort-object memorygb -descending |select -first 1 
$destination = get-cluster $cluster |get-vmhost |where {$_.name -ne $vmhost} |sort-object MemoryUsageGB |select -first 1 
$destinationfree = ($destination.MemoryTotalGB - $destination.MemoryusageGB) 
$vmhostfree = ($vmhost.MemoryTotalGB - $vmhost.MemoryUsageGB) 
$vmsize = ($vm.memoryGB) 
$difference = ($destinationfree - $vmhostfree)
if (($difference - $vmsize) -gt 0) { Move-VM -VM $vm -Destination $destination -Confirm:$false} 
$count = $cluster |get-vmhost
$count = $count.count
else {Send-MailMessage -To "somepeople@somecompany.com" -Subject "Unable to Balance Retail Cluster $cluster - $date" -Body "Unable to balance Cluster $cluster due to insufficient resources available. There are currently $count Host(s) available in the Cluster."  -SmtpServer "m.somecompany.com" -From "vmware_team@somecompany.com" }

Now I will break down what the various lines of the script are doing :
$hostId = "HostSystem-" + $env:VMWARE_ALARM_TARGET_ID 
$vmhost = Get-VMHost -Id $hostId 
$cluster = Get-VMHost $vmhost |Get-Cluster
These three lines are taking the Alarm value ($env:VMWARE_ALARM_TARGET_ID) and converting it to a more usable form for Powercli commands.   In this case the alarm returns the hostId number, but without the prefix to query it using get-vmhost.  So the first operation I do is add the HostSystem- to the id to make it easier to query.   Then I retrieve the object for the vmhost and the cluster for future actions :
$vm = get-vmhost $vmhost |get-vm |where {$_.name -ne "z*"}|sort-object memorygb -descending |select -first 1 
The next objective is to get the list of VMs on the host that is having memory pressure  and sort it based on the most memory usage.   The line where {$_.name -ne "z*"} is not necessary in most environments, but in our case we have a guest that starts with z that is considered "immobile"  :
$destination = get-cluster $cluster |get-vmhost |where {$_.name -ne $vmhost} |sort-object MemoryUsageGB |select -first 1 
This next lines then find the host in the cluster with the most memory free and stores that variable for a later calculation :
$destinationfree = ($destination.MemoryTotalGB - $destination.MemoryusageGB) 
$vmhostfree = ($vmhost.MemoryTotalGB - $vmhost.MemoryUsageGB)
$vmsize = ($vm.memoryGB)  
Then the current Host free and VM guest memory usage, and difference between the source and destination host are stored as variables for calculations :
$difference = ($destinationfree - $vmhostfree)
if (($difference - $vmsize) -gt 0) { Move-VM -VM $vm -Destination $destination -Confirm:$false} 

Now all the collected variables are used to calculate if you should move the vm to another host. The If statement is used to make sure that you will not move a VM to another host with less available memory than the original host:
$count = $cluster |get-vmhost
$count = $count.count
Now we gather some data in case we were unable to move a vm (such as if there is no other node with more free RAM) :
else {Send-MailMessage -To "somepeople@somecompany.com" -Subject "Unable to Balance Retail Cluster $cluster - $date" -Body "Unable to balance Cluster $cluster due to insufficient resources available. There are currently $count Host(s) available in the Cluster."  -SmtpServer "m.somecompany.com" -From "vmware_team@somecompany.com" }
The last step is now to send an email to the support team with some context around the error message so they can investigate.   This is to cover in case a cluster is missing nodes.   

So that is my solution for balancing memory in a DRS-less ROBO environment  

If you have any thoughts or suggestions -  please comment.

No comments: