So in our particular retail environment we have had an ongoing issue with Iscsi datastore availability in the case of a power outage. We also have the same problem on in store deployments when the onsite technicians power on the components in the wrong order (ie power on the hosts before the iscsi array). The result of this problem was of course escalations for retail stores being down for a significant period of time following a power outage.
Of course I googled the heck out of this problem - and the only solution I have found was to change the boot delay on the ESXi hosts to exceed the array boot time. That was not a great solution for us - because it would delay starts on planned reboots without a corresponding power outage.
One day I was working on another reactive alarm and I got that light bulb over my head. I thought "I wonder if there is an alarm type for datastore connectivity to hosts that I could use to trigger an action?". It turns out there is a condition that can be monitored at the datastore level to alarm for this - it is called Datastore State to All Hosts. In the screenshots below - I created a custom rule with this condition. For a review on setting up Powercli based alarms - refer to my blog post here.
Now that we have the alarm setup - let's get into the Powercli code behind it. In our particular implementation - we only have to worry about 3 nodes in our clusters, so my script is focused on that, the methodology would also work with a larger cluster and a foreach-object loop - but I am just going to use my already written script to show how I solved my particular problem, but the logic will scale.
# In response to a datastore alarm, remove all snapshots on the datastore. # This is just for demonstration purposes. # Find more info at http://blogs.vmware.com/vipowershell $basePath = "C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\scripts" $ProgressPreference = "SilentlyContinue" $env:APPDATA = "c:\Documents and Settings\All Users\Application Data" Add-PSSnapin VMware.Vimautomation.Core -ea SilentlyContinue $WarningPreference = "SilentlyContinue" Connect-VIServer localhost -User "nobody\someone" -Password "xxxxxxxxxxxxxxxxxxxxxx" | Out-Null $email = "jason.willey@gmail.com" $datastoreId = "Datastore" + $env:VMWARE_ALARM_TARGET_ID $datastore = Get-Datastore $datastoreId $impactedhosts = $datastore |get-vmhost $cluster = $impactedhosts |select -First 1 |get-cluster # host one check $Hostone = $impactedhosts |where {$_.name -match "x1"} $Hostoneconnected = $Hostone.ConnectionState $hostonestate = get-vmhost $hostone |get-datastore $datastore $hostonestate = $hostonestate.Accessible if ($Hostoneconnected -match "connected") { $hostonestate = get-vmhost $hostone |get-datastore $datastore $global:hostonestate = $hostonestate.Accessible } #host two check $Hosttwo = $impactedhosts |where {$_.name -match "x2"} $Hosttwoconnected = $Hosttwo.ConnectionState if ($Hosttwoconnected -match "connected") { $hosttwostate = get-vmhost $hosttwo |get-datastore $datastore $global:hosttwostate = $hosttwostate.Accessible } # Host Three check $Hostthree = $impactedhosts |where {$_.name -match "x3"} $Hostthreeconnected = $Hostthree.ConnectionState if ($Hostthreeconnected -match "connected") { $hostthreestate = get-vmhost $hostthree |get-datastore $datastore $global:hostthreestate = $hostthreestate.Accessible} # starting rescan loop $global:counter = 0 while ($global:hostonestate -match "False" -or $globalhosttwostate -match "False" -or $global:hostthreestate -match "False" -and $counter -lt 10 { if ($global:Hostonestate -match "False") {Get-VMHostStorage -VMHost $Hostone -RescanAllHba -RescanVmfs |Out-Null $global:hostonestate = get-vmhost $hostone |get-datastore $datastore; $global:hostonestate = $$global:hostonestate.Accessible } if ($global:Hosttwostate -match "False") {Get-VMHostStorage -VMHost $Hosttwo -RescanAllHba -RescanVmfs |Out-Null $global:hosttwostate = get-vmhost $hosttwo |get-datastore $datastore; $global:hosttwostate = $global:hosttwostate.Accessible } if ($global:Hostthreestate -match "False") {Get-VMHostStorage -VMHost $Hostthree -RescanAllHba -RescanVmfs |Out-Null $$global:hostthreestate = get-vmhost $hostthree |get-datastore $datastore; $global:hostthreestate = $global:hostthreestate.Accessible } sleep 120 $global:counter = $global:counter + 1 if $counter -eq 9 { $global:hosterror = "$datastore has been rescanned every 2 minutes for 20 minutes and is not available on all nodes of $cluster" } if ($globbal:hosterror -ne $null) {Send-MailMessage -To $email -Subject "Unable to rescan ISCSI storage in cluster $cluster" -Bodyashtml -Body "$global:hosterror" -SmtpServer "m.nobody.com" -From "VMware@nobody.com"}So I will skip all the logic for interpreting the alarm received as that is detailed in another post here.
First things first - I translate the actual datastore ID presented by the alert to a more script friendly datastore name and get the hosts that are supposed to be connected to this datastore. I also get the cluster name for use later on when we send an email alert in the case we were unable to successfully rescan.
$datastoreId = "Datastore" + $env:VMWARE_ALARM_TARGET_ID $datastore = Get-Datastore $datastoreId $impactedhosts = $datastore |get-vmhost $cluster = $impactedhosts |select -First 1 |get-clusterFrom there - I now verify that each host is actually connected to vcenter, and verify if the datastore is viewed as available from the host side. I then repeat this logic for each possible host.
$Hostone = $impactedhosts |where {$_.name -match "x1"} $Hostoneconnected = $Hostone.ConnectionState $hostonestate = get-vmhost $hostone |get-datastore $datastore $hostonestate = $hostonestate.Accessible if ($Hostoneconnected -match "connected") { $hostonestate = get-vmhost $hostone |get-datastore $datastore $global:hostonestate = $hostonestate.Accessible }Now I set up the rescanning loop. In the case of our environment we have chosen to rescan every two minutes for twenty minutes and then send an email alert to second level support. I used $global: variables here so they will survive the loop. So after the rescan, it checks the status of the datastore again, and continues running the loop until all the datastores have returned, or the loop counter reaches 10. At the end of each pass of the loop there is a 120 second sleep to give us our two minutes between rescans. One the last iteration of the loop - it writes an error message that after 20 minutes of 2 minute rescans - the cluster still has a datastore availability problem.
$global:counter = 0 while ($global:hostonestate -match "False" -or $globalhosttwostate -match "False" -or $global:hostthreestate -match "False" -and $counter -lt 10 { if ($global:Hostonestate -match "False") {Get-VMHostStorage -VMHost $Hostone -RescanAllHba -RescanVmfs |Out-Null $global:hostonestate = get-vmhost $hostone |get-datastore $datastore; $global:hostonestate = $$global:hostonestate.Accessible } if ($global:Hosttwostate -match "False") {Get-VMHostStorage -VMHost $Hosttwo -RescanAllHba -RescanVmfs |Out-Null $global:hosttwostate = get-vmhost $hosttwo |get-datastore $datastore; $global:hosttwostate = $global:hosttwostate.Accessible } if ($global:Hostthreestate -match "False") {Get-VMHostStorage -VMHost $Hostthree -RescanAllHba -RescanVmfs |Out-Null $$global:hostthreestate = get-vmhost $hostthree |get-datastore $datastore; $global:hostthreestate = $global:hostthreestate.Accessible } sleep 120 $global:counter = $global:counter + 1 if $counter -eq 9 { $global:hosterror = "$datastore has been rescanned every 2 minutes for 20 minutes and is not available on all nodes of $cluster" }The last step once the loop has been terminated is to check if the error message has been populated - send an email out to the second level support team letting them know that automatic rescanning has not been successful.
if ($globbal:hosterror -ne $null) {Send-MailMessage -To $email -Subject "Unable to rescan ISCSI storage in cluster $cluster" -Bodyashtml -Body "$global:hosterror" -SmtpServer "m.nobody.com" -From "VMware@nobody.com"}I am hoping that you find this useful, as I have heard this is a very common problem, especially with Iscsi implementations. Leave me a comment if you have any questions or suggestions, or just leave a comment if you had a similar problem and solved it in a different way.
No comments:
Post a Comment