When troubleshooting APDs, you will quickly notice that gathering the forensics is quite burdensome. Further, if it is not done right, you could impact your environment by running out of space.
This article uses a recent bug I experienced to illustrate the handling of various components needed to determine root cause. We discuss vm-support with Performance Snapshots, sniffing network traffic, and gathering storage array logs. Your hardware may vary, but the methodology should be similar.
Warning: There is a critical bug on EMC 5600 storage arrays affecting vSphere ESXi 5.5 using NFS volumes. The bug is on the EMC side (EMC bug number 850730) and should be resolved in the next release of 8.1 code for the array. ESXi 5.1 and 6.0+ do not seem to be impacted. ESXi 5.5 may experience APDs due to array closing TCP Window inappropriately.
About All Paths Down (APD)
When ESX stops getting NFS heartbeats from the array it goes into
APD_START which begins a 140 second timer. If we do not recover within that time, the datastore in question will go into
At this point, any VMs running on the affected datastore will still be powered on, but will be useless. The ESX host must be evacuated (Right-Click Maintenance Mode), which allows healthy VMs to move to surviving ESX hosts in the cluster. The host must then be rebooted.
Note: In vSphere 6.0 there are new HA features to protect against this. On 5.5 you have to react manually.
This one can be quite difficult unless there is an obvious environmental cause such as high utilization on switches, array, etc. If we must dig deeper we'll need logs from ESX and sniffer traces.
vm-support bundle (with Performance Snapshots)
Perform this from an SSH Session
vm-support -p -d <integer duration sec> -w <remote datastore/folder>
A normal vm-support is only a few hundred MB. However, when you use the Performance Snapshot feature (not to be confused with virtual machine snapshots), these can get very space intensive. For example, VMware GSS wants 10 minutes ideally. This can easily be 1GB in size depending on activity.
When using the Performance Snapshot facilities of
vm-support, always point to a shared or local storage (i.e. a VMFS volume or NFS datastore such as ISO). The default is /var/tmp and that will run out of space for sure with a 10 minute performance run, so choose wisely.
Perform this from the DCUI
Ideally capture these from all points (ESX to switch, switch to array, etc.).
From my experience, simply sniffing the vmkernel interface for NFS is sufficient. This should be done from Direct Console User Interface (DCUI). This is just one less thing (ssh) that the engineers don't have to filter or explain.
The default tools on ESX (i.e.
tcpdump-uw) are usually enough to dig into most problems. To learn more see the related vmkdaily artilce How to Sniff Network Traffic on ESX
Data vs Time
The timing of packet capture and log roll-over is critical. One can easily lose the objective by missing even one element. For example, stopping the run too early or late. Or not getting the storage logs in time after you did all of the hard work with ESX.
Choose a Host
The recommendation is to start with one ESX host, running one VM. Review and fine-tune your timings before attempting to sniff or log from multiple hosts, or hosts with heavier workloads.
We should be minimally invasive and ensure that we get what we want without impact. For example, choose a mix of local and remote datastores to offset the risk of running the ESX root filesystem out of space.
Putting it Together
- ESX Support Logs + Performance Snapshots
- Rotate every 10 minutes
- Run as a Scheduled Task or similar
- Task must be stopped manually following
- Choose local VMFS datastore or an ISO datastore (Do Not use default of
- Depending on I/O, roll-over may be too fast
- Tweak the sniffer roll-over (i.e. using the
-W [int]option of
- Decrease number of VMs on host
- Choose local VMFS datastore or remote ISO datastore instead of visorfs
Array Support Logs
- Must be captured in a timely fashion following issue
- In our case, VNX logs were rolling over after 16 hours
- Should be run as
su -or you could miss data
We went over some basic tactics for capturing forensics during loss of network storage. You now understand that your timings and space consumption may vary widely across different workloads.
If you are experiencing APDs you should upgrade to 6.0+. The VMCP features in vSphere 6.0 in are excellent and you need to check that out (granular reaction to APD START and APD TIMEOUT). Also, make sure you are using VMware Log Insight.
Use my Invoke-ApdCheck.ps1 function to health check your host and gather ESX logs.