Thursday, May 19, 2016

Using a Dynamic Alarm to Rescan Unavailable Datastores

Hi All,

So in our particular retail environment we have had an ongoing issue with Iscsi datastore availability in the case of a power outage.   We also have the same problem on in store deployments when the onsite technicians power on the components in the wrong order (ie power on the hosts before the iscsi array).   The result of this problem was of course escalations for retail stores being down for a significant period of time following a power outage.

Of course I googled the heck out of this problem -  and the only solution I have found was to change the boot delay on the ESXi hosts to exceed the array boot time.  That was not a great solution for us -  because it would delay starts on planned reboots without a corresponding power outage.

One day I was working on another reactive alarm and I got that light bulb over my head.   I thought "I wonder if there is an alarm type for datastore connectivity to hosts that I could use to trigger an action?".   It turns out there is a condition that can be monitored at the datastore level to alarm for this -  it is called Datastore State to All Hosts. In the screenshots below -   I created a custom rule with this condition.   For a review on setting up Powercli based alarms -  refer to my blog post here.


Now that we have the alarm setup  -   let's get into the Powercli code behind it.  In our particular implementation -  we only have to worry about 3 nodes in our clusters,  so my script is focused on that,   the methodology would also work with a larger cluster and a foreach-object loop -  but I am just going to use my already written script to show how I solved my particular problem, but the logic will scale.
# In response to a datastore alarm, remove all snapshots on the datastore.
# This is just for demonstration purposes.
# Find more info at http://blogs.vmware.com/vipowershell
$basePath = "C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\scripts"
$ProgressPreference = "SilentlyContinue"
$env:APPDATA = "c:\Documents and Settings\All Users\Application Data"

Add-PSSnapin VMware.Vimautomation.Core -ea SilentlyContinue
$WarningPreference = "SilentlyContinue"
Connect-VIServer localhost -User "nobody\someone" -Password "xxxxxxxxxxxxxxxxxxxxxx" | Out-Null
$email = "jason.willey@gmail.com"
$datastoreId = "Datastore" + $env:VMWARE_ALARM_TARGET_ID
$datastore = Get-Datastore $datastoreId
$impactedhosts = $datastore |get-vmhost
$cluster = $impactedhosts |select -First 1 |get-cluster 
# host one check
$Hostone = $impactedhosts |where {$_.name -match "x1"}  
$Hostoneconnected = $Hostone.ConnectionState
$hostonestate = get-vmhost $hostone |get-datastore $datastore
$hostonestate = $hostonestate.Accessible
if ($Hostoneconnected -match "connected") {
$hostonestate = get-vmhost $hostone |get-datastore $datastore
$global:hostonestate = $hostonestate.Accessible }
#host two check
$Hosttwo = $impactedhosts |where {$_.name -match "x2"}  
$Hosttwoconnected = $Hosttwo.ConnectionState
if ($Hosttwoconnected -match "connected") {
$hosttwostate = get-vmhost $hosttwo |get-datastore $datastore
$global:hosttwostate = $hosttwostate.Accessible }
# Host Three check
$Hostthree = $impactedhosts |where {$_.name -match "x3"}  
$Hostthreeconnected = $Hostthree.ConnectionState
if ($Hostthreeconnected -match "connected") {
$hostthreestate = get-vmhost $hostthree |get-datastore $datastore
$global:hostthreestate = $hostthreestate.Accessible}
#  starting rescan loop
$global:counter = 0 
while ($global:hostonestate -match "False" -or $globalhosttwostate -match "False" -or $global:hostthreestate -match "False" -and $counter -lt 10 {
if ($global:Hostonestate -match "False") {Get-VMHostStorage -VMHost $Hostone -RescanAllHba -RescanVmfs |Out-Null
 $global:hostonestate = get-vmhost $hostone |get-datastore $datastore;
 $global:hostonestate = $$global:hostonestate.Accessible  }
if ($global:Hosttwostate -match "False") {Get-VMHostStorage -VMHost $Hosttwo -RescanAllHba -RescanVmfs |Out-Null
 $global:hosttwostate = get-vmhost $hosttwo |get-datastore $datastore;
 $global:hosttwostate = $global:hosttwostate.Accessible  }
if ($global:Hostthreestate -match "False") {Get-VMHostStorage -VMHost $Hostthree -RescanAllHba -RescanVmfs |Out-Null
 $$global:hostthreestate = get-vmhost $hostthree |get-datastore $datastore;
 $global:hostthreestate = $global:hostthreestate.Accessible  } 
sleep 120
$global:counter = $global:counter + 1
if $counter -eq 9 { $global:hosterror = "$datastore has been rescanned every 2 minutes for 20 minutes and is not available on all nodes of $cluster"

}
if ($globbal:hosterror -ne $null) {Send-MailMessage -To $email -Subject "Unable to rescan ISCSI storage in cluster $cluster" -Bodyashtml -Body "$global:hosterror"  -SmtpServer "m.nobody.com" -From "VMware@nobody.com"}


So I will skip all the logic for interpreting the alarm received as that is detailed in another post here.

First things first - I translate the actual datastore ID presented by the alert to a more script friendly datastore name and get the hosts that are supposed to be connected to this datastore. I also get the cluster name for use later on when we send an email alert in the case we were unable to successfully rescan.
$datastoreId = "Datastore" + $env:VMWARE_ALARM_TARGET_ID
$datastore = Get-Datastore $datastoreId
$impactedhosts = $datastore |get-vmhost
$cluster = $impactedhosts |select -First 1 |get-cluster 
From there - I now verify that each host is actually connected to vcenter, and verify if the datastore is viewed as available from the host side. I then repeat this logic for each possible host.
$Hostone = $impactedhosts |where {$_.name -match "x1"}  
$Hostoneconnected = $Hostone.ConnectionState
$hostonestate = get-vmhost $hostone |get-datastore $datastore
$hostonestate = $hostonestate.Accessible
if ($Hostoneconnected -match "connected") {
$hostonestate = get-vmhost $hostone |get-datastore $datastore
$global:hostonestate = $hostonestate.Accessible }
Now I set up the rescanning loop. In the case of our environment we have chosen to rescan every two minutes for twenty minutes and then send an email alert to second level support. I used $global: variables here so they will survive the loop. So after the rescan, it checks the status of the datastore again, and continues running the loop until all the datastores have returned, or the loop counter reaches 10. At the end of each pass of the loop there is a 120 second sleep to give us our two minutes between rescans. One the last iteration of the loop - it writes an error message that after 20 minutes of 2 minute rescans - the cluster still has a datastore availability problem.
$global:counter = 0 
while ($global:hostonestate -match "False" -or $globalhosttwostate -match "False" -or $global:hostthreestate -match "False" -and $counter -lt 10 {
if ($global:Hostonestate -match "False") {Get-VMHostStorage -VMHost $Hostone -RescanAllHba -RescanVmfs |Out-Null
 $global:hostonestate = get-vmhost $hostone |get-datastore $datastore;
 $global:hostonestate = $$global:hostonestate.Accessible  }
if ($global:Hosttwostate -match "False") {Get-VMHostStorage -VMHost $Hosttwo -RescanAllHba -RescanVmfs |Out-Null
 $global:hosttwostate = get-vmhost $hosttwo |get-datastore $datastore;
 $global:hosttwostate = $global:hosttwostate.Accessible  }
if ($global:Hostthreestate -match "False") {Get-VMHostStorage -VMHost $Hostthree -RescanAllHba -RescanVmfs |Out-Null
 $$global:hostthreestate = get-vmhost $hostthree |get-datastore $datastore;
 $global:hostthreestate = $global:hostthreestate.Accessible  } 
sleep 120
$global:counter = $global:counter + 1
if $counter -eq 9 { $global:hosterror = "$datastore has been rescanned every 2 minutes for 20 minutes and is not available on all nodes of $cluster"
}
The last step once the loop has been terminated is to check if the error message has been populated - send an email out to the second level support team letting them know that automatic rescanning has not been successful.
if ($globbal:hosterror -ne $null) {Send-MailMessage -To $email -Subject "Unable to rescan ISCSI storage in cluster $cluster" -Bodyashtml -Body "$global:hosterror"  -SmtpServer "m.nobody.com" -From "VMware@nobody.com"}
I am hoping that you find this useful, as I have heard this is a very common problem, especially with Iscsi implementations. Leave me a comment if you have any questions or suggestions, or just leave a comment if you had a similar problem and solved it in a different way.

Wednesday, May 4, 2016

Powercli VM hardening script

Someone on the Linkedin Powercli Forum (a great group) asked if anyone had a VM hardening script.   I was working on one based on the output of our VROPs implementation.   This may not contain all of the settings available in the hardening guide,  but it did take care of most of the ones that VROPS was alerting on.  

One important caveat :   the vm needs to be shut down when you run this script, as all the advanced settings are locked while the VM is running.  


Param(
  [Parameter(Mandatory=$True,Position=1)]
  [string]$targetvm
)
$vm = Get-VM $targetvm
$vm  |New-AdvancedSetting -name 'log.keepOld' -Value 10 -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.ghi.launchmenu.change' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.device.edit.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.hgfsServerSet.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.toolsautoInstall.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.unity.push.update.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.disk.Wiper.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.ghi.protocolhandler.info.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'RemoteDisplay.maxConnection' -Value 2 -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.vmxDnDVersionGet.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.bios.bbs.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.unity.taskbar.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.diskShrink.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.unity.windowContents.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.unityInterlockOperation.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.ghi.trayicon.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.vixMessage.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.ghi.autologin.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.device.connectable.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.monitor.control.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.memSchedFakeSampleStats.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'log.rotateSize' -Value 1024000 -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.unityActive.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.getCreds.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.ghi.shellAction.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.trashFolderState.disable' -Value false -confirm:$false

so I wrote this with the parameter -targetvm as a parameter. Then I can call it on any subset of machines I choose such as Get-folder dev |get-vm |foreach-object {./vmsecurityupdate $_.name} 

Most of the parameters above were recommended against "default build" VMs, so it is likely if you you ran the VROPs VM hardening alert you might see the same reccomendations. You may want more settings.. or maybe less depending many business factors. The easy way to plan your settings is to do a get-advancedsetting vmname |select *  and find out what setting are important to you or your organization.    My long term goal is to get this script into our build automation so every VM we push out would have an improved security posture.

I hope this helps out.

Friday, April 22, 2016

Creating a reactive alarm for ESXi ROBO cluster or as I call it "DRS-light"

I have decided to start blogging again after a long time away. Eight years to be precise.

The first item I want to share is a solution for a problem that occured in my company's Retail implementation that uses VMware ROBO licensing. For anyone that is familiar with ROBO -  you know that there is one major exclusion that makes cluster management challenging - and that is the lack of DRS. In any large scale implementation without DRS I think it is  fairly common to end up with unbalanced workloads after maintenance or outages.   This problem is compounded when you have a large support team of varying skill levels.

How I chose to address this problem was to create an alarm based on Host Memory Usage,  as Host memory is generally the first problem that occurs in our particular environment.  This methodology would also work for CPU constraints, but that is not a problem we had.

My starting point was the excellent article from the Powercli blog on using Powercli scripts in actions. 

One thing to note :  the syntax suggested in the article for the actual alarm action did not work for me  - so I used the following path to execute the Powershell code with the alarm :

in case that type is tool small to read -  I used the following to call Powershell
 (instead of the batch file method suggested in the original article) :
c:\windows\system32\WindowsPowerShell\v1.0\powershell.exe 
  

















Here is the full code of the script I am using -   I will detail out the the purpose section by section (starting at line 1 as the rest is covered in the original blogpost by VMware pretty pretty well:
# In response to a host alarm, add a VM to that host.
# This is just for demonstration purposes.
# Find more info at http://blogs.vmware.com/vipowershell
$basePath = "C:\PS_SCRIPTS\Alarms"
$ProgressPreference = "SilentlyContinue"
$env:APPDATA = "c:\Documents and Settings\All Users\Application Data"

## Import the admin credential.
#. "$basePath\credentialManagement.ps1"
#$credential = Import-PSCredential "$basePath\systemCredentials.enc.xml"

# Log in.
Add-PSSnapin VMware.Vimautomation.Core -ea SilentlyContinue 
$WarningPreference = "SilentlyContinue"
Connect-VIServer localhost -User "somedomain/someuser" -Password "xxxxxxx"
$hostId = "HostSystem-" + $env:VMWARE_ALARM_TARGET_ID 
$vmhost = Get-VMHost -Id $hostId 
$cluster = Get-VMHost $vmhost |Get-Cluster 
$vm = get-vmhost $vmhost |get-vm |where {$_.name -ne "z*"}|sort-object memorygb -descending |select -first 1 
$destination = get-cluster $cluster |get-vmhost |where {$_.name -ne $vmhost} |sort-object MemoryUsageGB |select -first 1 
$destinationfree = ($destination.MemoryTotalGB - $destination.MemoryusageGB) 
$vmhostfree = ($vmhost.MemoryTotalGB - $vmhost.MemoryUsageGB) 
$vmsize = ($vm.memoryGB) 
$difference = ($destinationfree - $vmhostfree)
if (($difference - $vmsize) -gt 0) { Move-VM -VM $vm -Destination $destination -Confirm:$false} 
$count = $cluster |get-vmhost
$count = $count.count
else {Send-MailMessage -To "somepeople@somecompany.com" -Subject "Unable to Balance Retail Cluster $cluster - $date" -Body "Unable to balance Cluster $cluster due to insufficient resources available. There are currently $count Host(s) available in the Cluster."  -SmtpServer "m.somecompany.com" -From "vmware_team@somecompany.com" }

Now I will break down what the various lines of the script are doing :
$hostId = "HostSystem-" + $env:VMWARE_ALARM_TARGET_ID 
$vmhost = Get-VMHost -Id $hostId 
$cluster = Get-VMHost $vmhost |Get-Cluster
These three lines are taking the Alarm value ($env:VMWARE_ALARM_TARGET_ID) and converting it to a more usable form for Powercli commands.   In this case the alarm returns the hostId number, but without the prefix to query it using get-vmhost.  So the first operation I do is add the HostSystem- to the id to make it easier to query.   Then I retrieve the object for the vmhost and the cluster for future actions :
$vm = get-vmhost $vmhost |get-vm |where {$_.name -ne "z*"}|sort-object memorygb -descending |select -first 1 
The next objective is to get the list of VMs on the host that is having memory pressure  and sort it based on the most memory usage.   The line where {$_.name -ne "z*"} is not necessary in most environments, but in our case we have a guest that starts with z that is considered "immobile"  :
$destination = get-cluster $cluster |get-vmhost |where {$_.name -ne $vmhost} |sort-object MemoryUsageGB |select -first 1 
This next lines then find the host in the cluster with the most memory free and stores that variable for a later calculation :
$destinationfree = ($destination.MemoryTotalGB - $destination.MemoryusageGB) 
$vmhostfree = ($vmhost.MemoryTotalGB - $vmhost.MemoryUsageGB)
$vmsize = ($vm.memoryGB)  
Then the current Host free and VM guest memory usage, and difference between the source and destination host are stored as variables for calculations :
$difference = ($destinationfree - $vmhostfree)
if (($difference - $vmsize) -gt 0) { Move-VM -VM $vm -Destination $destination -Confirm:$false} 

Now all the collected variables are used to calculate if you should move the vm to another host. The If statement is used to make sure that you will not move a VM to another host with less available memory than the original host:
$count = $cluster |get-vmhost
$count = $count.count
Now we gather some data in case we were unable to move a vm (such as if there is no other node with more free RAM) :
else {Send-MailMessage -To "somepeople@somecompany.com" -Subject "Unable to Balance Retail Cluster $cluster - $date" -Body "Unable to balance Cluster $cluster due to insufficient resources available. There are currently $count Host(s) available in the Cluster."  -SmtpServer "m.somecompany.com" -From "vmware_team@somecompany.com" }
The last step is now to send an email to the support team with some context around the error message so they can investigate.   This is to cover in case a cluster is missing nodes.   

So that is my solution for balancing memory in a DRS-less ROBO environment  

If you have any thoughts or suggestions -  please comment.