Thursday, May 19, 2016

Using a Dynamic Alarm to Rescan Unavailable Datastores

Hi All,

So in our particular retail environment we have had an ongoing issue with Iscsi datastore availability in the case of a power outage.   We also have the same problem on in store deployments when the onsite technicians power on the components in the wrong order (ie power on the hosts before the iscsi array).   The result of this problem was of course escalations for retail stores being down for a significant period of time following a power outage.

Of course I googled the heck out of this problem -  and the only solution I have found was to change the boot delay on the ESXi hosts to exceed the array boot time.  That was not a great solution for us -  because it would delay starts on planned reboots without a corresponding power outage.

One day I was working on another reactive alarm and I got that light bulb over my head.   I thought "I wonder if there is an alarm type for datastore connectivity to hosts that I could use to trigger an action?".   It turns out there is a condition that can be monitored at the datastore level to alarm for this -  it is called Datastore State to All Hosts. In the screenshots below -   I created a custom rule with this condition.   For a review on setting up Powercli based alarms -  refer to my blog post here.


Now that we have the alarm setup  -   let's get into the Powercli code behind it.  In our particular implementation -  we only have to worry about 3 nodes in our clusters,  so my script is focused on that,   the methodology would also work with a larger cluster and a foreach-object loop -  but I am just going to use my already written script to show how I solved my particular problem, but the logic will scale.
# In response to a datastore alarm, remove all snapshots on the datastore.
# This is just for demonstration purposes.
# Find more info at http://blogs.vmware.com/vipowershell
$basePath = "C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\scripts"
$ProgressPreference = "SilentlyContinue"
$env:APPDATA = "c:\Documents and Settings\All Users\Application Data"

Add-PSSnapin VMware.Vimautomation.Core -ea SilentlyContinue
$WarningPreference = "SilentlyContinue"
Connect-VIServer localhost -User "nobody\someone" -Password "xxxxxxxxxxxxxxxxxxxxxx" | Out-Null
$email = "jason.willey@gmail.com"
$datastoreId = "Datastore" + $env:VMWARE_ALARM_TARGET_ID
$datastore = Get-Datastore $datastoreId
$impactedhosts = $datastore |get-vmhost
$cluster = $impactedhosts |select -First 1 |get-cluster 
# host one check
$Hostone = $impactedhosts |where {$_.name -match "x1"}  
$Hostoneconnected = $Hostone.ConnectionState
$hostonestate = get-vmhost $hostone |get-datastore $datastore
$hostonestate = $hostonestate.Accessible
if ($Hostoneconnected -match "connected") {
$hostonestate = get-vmhost $hostone |get-datastore $datastore
$global:hostonestate = $hostonestate.Accessible }
#host two check
$Hosttwo = $impactedhosts |where {$_.name -match "x2"}  
$Hosttwoconnected = $Hosttwo.ConnectionState
if ($Hosttwoconnected -match "connected") {
$hosttwostate = get-vmhost $hosttwo |get-datastore $datastore
$global:hosttwostate = $hosttwostate.Accessible }
# Host Three check
$Hostthree = $impactedhosts |where {$_.name -match "x3"}  
$Hostthreeconnected = $Hostthree.ConnectionState
if ($Hostthreeconnected -match "connected") {
$hostthreestate = get-vmhost $hostthree |get-datastore $datastore
$global:hostthreestate = $hostthreestate.Accessible}
#  starting rescan loop
$global:counter = 0 
while ($global:hostonestate -match "False" -or $globalhosttwostate -match "False" -or $global:hostthreestate -match "False" -and $counter -lt 10 {
if ($global:Hostonestate -match "False") {Get-VMHostStorage -VMHost $Hostone -RescanAllHba -RescanVmfs |Out-Null
 $global:hostonestate = get-vmhost $hostone |get-datastore $datastore;
 $global:hostonestate = $$global:hostonestate.Accessible  }
if ($global:Hosttwostate -match "False") {Get-VMHostStorage -VMHost $Hosttwo -RescanAllHba -RescanVmfs |Out-Null
 $global:hosttwostate = get-vmhost $hosttwo |get-datastore $datastore;
 $global:hosttwostate = $global:hosttwostate.Accessible  }
if ($global:Hostthreestate -match "False") {Get-VMHostStorage -VMHost $Hostthree -RescanAllHba -RescanVmfs |Out-Null
 $$global:hostthreestate = get-vmhost $hostthree |get-datastore $datastore;
 $global:hostthreestate = $global:hostthreestate.Accessible  } 
sleep 120
$global:counter = $global:counter + 1
if $counter -eq 9 { $global:hosterror = "$datastore has been rescanned every 2 minutes for 20 minutes and is not available on all nodes of $cluster"

}
if ($globbal:hosterror -ne $null) {Send-MailMessage -To $email -Subject "Unable to rescan ISCSI storage in cluster $cluster" -Bodyashtml -Body "$global:hosterror"  -SmtpServer "m.nobody.com" -From "VMware@nobody.com"}


So I will skip all the logic for interpreting the alarm received as that is detailed in another post here.

First things first - I translate the actual datastore ID presented by the alert to a more script friendly datastore name and get the hosts that are supposed to be connected to this datastore. I also get the cluster name for use later on when we send an email alert in the case we were unable to successfully rescan.
$datastoreId = "Datastore" + $env:VMWARE_ALARM_TARGET_ID
$datastore = Get-Datastore $datastoreId
$impactedhosts = $datastore |get-vmhost
$cluster = $impactedhosts |select -First 1 |get-cluster 
From there - I now verify that each host is actually connected to vcenter, and verify if the datastore is viewed as available from the host side. I then repeat this logic for each possible host.
$Hostone = $impactedhosts |where {$_.name -match "x1"}  
$Hostoneconnected = $Hostone.ConnectionState
$hostonestate = get-vmhost $hostone |get-datastore $datastore
$hostonestate = $hostonestate.Accessible
if ($Hostoneconnected -match "connected") {
$hostonestate = get-vmhost $hostone |get-datastore $datastore
$global:hostonestate = $hostonestate.Accessible }
Now I set up the rescanning loop. In the case of our environment we have chosen to rescan every two minutes for twenty minutes and then send an email alert to second level support. I used $global: variables here so they will survive the loop. So after the rescan, it checks the status of the datastore again, and continues running the loop until all the datastores have returned, or the loop counter reaches 10. At the end of each pass of the loop there is a 120 second sleep to give us our two minutes between rescans. One the last iteration of the loop - it writes an error message that after 20 minutes of 2 minute rescans - the cluster still has a datastore availability problem.
$global:counter = 0 
while ($global:hostonestate -match "False" -or $globalhosttwostate -match "False" -or $global:hostthreestate -match "False" -and $counter -lt 10 {
if ($global:Hostonestate -match "False") {Get-VMHostStorage -VMHost $Hostone -RescanAllHba -RescanVmfs |Out-Null
 $global:hostonestate = get-vmhost $hostone |get-datastore $datastore;
 $global:hostonestate = $$global:hostonestate.Accessible  }
if ($global:Hosttwostate -match "False") {Get-VMHostStorage -VMHost $Hosttwo -RescanAllHba -RescanVmfs |Out-Null
 $global:hosttwostate = get-vmhost $hosttwo |get-datastore $datastore;
 $global:hosttwostate = $global:hosttwostate.Accessible  }
if ($global:Hostthreestate -match "False") {Get-VMHostStorage -VMHost $Hostthree -RescanAllHba -RescanVmfs |Out-Null
 $$global:hostthreestate = get-vmhost $hostthree |get-datastore $datastore;
 $global:hostthreestate = $global:hostthreestate.Accessible  } 
sleep 120
$global:counter = $global:counter + 1
if $counter -eq 9 { $global:hosterror = "$datastore has been rescanned every 2 minutes for 20 minutes and is not available on all nodes of $cluster"
}
The last step once the loop has been terminated is to check if the error message has been populated - send an email out to the second level support team letting them know that automatic rescanning has not been successful.
if ($globbal:hosterror -ne $null) {Send-MailMessage -To $email -Subject "Unable to rescan ISCSI storage in cluster $cluster" -Bodyashtml -Body "$global:hosterror"  -SmtpServer "m.nobody.com" -From "VMware@nobody.com"}
I am hoping that you find this useful, as I have heard this is a very common problem, especially with Iscsi implementations. Leave me a comment if you have any questions or suggestions, or just leave a comment if you had a similar problem and solved it in a different way.

Wednesday, May 4, 2016

Powercli VM hardening script

Someone on the Linkedin Powercli Forum (a great group) asked if anyone had a VM hardening script.   I was working on one based on the output of our VROPs implementation.   This may not contain all of the settings available in the hardening guide,  but it did take care of most of the ones that VROPS was alerting on.  

One important caveat :   the vm needs to be shut down when you run this script, as all the advanced settings are locked while the VM is running.  


Param(
  [Parameter(Mandatory=$True,Position=1)]
  [string]$targetvm
)
$vm = Get-VM $targetvm
$vm  |New-AdvancedSetting -name 'log.keepOld' -Value 10 -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.ghi.launchmenu.change' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.device.edit.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.hgfsServerSet.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.toolsautoInstall.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.unity.push.update.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.disk.Wiper.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.ghi.protocolhandler.info.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'RemoteDisplay.maxConnection' -Value 2 -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.vmxDnDVersionGet.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.bios.bbs.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.unity.taskbar.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.diskShrink.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.unity.windowContents.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.unityInterlockOperation.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.ghi.trayicon.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.vixMessage.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.ghi.autologin.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.device.connectable.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.monitor.control.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.memSchedFakeSampleStats.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'log.rotateSize' -Value 1024000 -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.unityActive.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.getCreds.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.ghi.shellAction.disable' -Value false -confirm:$false
$vm  |New-AdvancedSetting -name 'isolation.tools.trashFolderState.disable' -Value false -confirm:$false

so I wrote this with the parameter -targetvm as a parameter. Then I can call it on any subset of machines I choose such as Get-folder dev |get-vm |foreach-object {./vmsecurityupdate $_.name} 

Most of the parameters above were recommended against "default build" VMs, so it is likely if you you ran the VROPs VM hardening alert you might see the same reccomendations. You may want more settings.. or maybe less depending many business factors. The easy way to plan your settings is to do a get-advancedsetting vmname |select *  and find out what setting are important to you or your organization.    My long term goal is to get this script into our build automation so every VM we push out would have an improved security posture.

I hope this helps out.