Today I’d like to present to you an ESXi host crash we had in our environment tha was due to a hardware failure. This time, we were “lucky” enough to capture its PSOD. In earlier article about Machine Check Errors, I was talking about what exactly do they mean and how to debug them. Also, most of the time, when these are correctable Machine Check Errors, the host only reboots itself without leaving any trace as of why. That I have investigated by determining faulty memory after running a custom memory stress test on an ESXi host.
The Uncorrectable Machine Check Exception presented below is caused by “Other TransBus Generic Error” – this could have been related either to a CPU, or pathways on the motherboard… or both. Most of VMkernel dumps was pointing out to 2nd Physical CPU, but there were some occurrences on 1st CPU as well. Even the AHS log from the HP blade server was corrupted each time I tried to send it to a technician. Therefore they took action and replaced both the motherboard and CPUs. Since then there were no more trouble with this host.
Manual Debugging:
For those of you who are interested – the MCE codes reported were:
In iLO: FA001E8000020E0F
in vmkernel.log: c800008000310e0f ; 8800004000310e0f
Now, if we decode the message we got from iLO manually (so that we have another source of MCE to decode from):
1 1 1 1 1 0 1 0 0 00 0000000011110100 0 0000 0000000000000010 0000 1110 0000 1111
UC 1
PCC 1
S 0
AR 0
Signaling:
Uncorrected error (UC). RESET THE SYSTEM
Examples? None found.
Compound error code found: Bus and Interconnect Errors.
BUS LL PP RRRR II T
BUS{11}_{11}_{0000}_{11}_{0}_ERR
Level: 11, generic
Request: 0000, Generic
Bus & Interconnect mnemonics:
Participation: 11, Generic
Therefore: Generic Bus and Interconnect Error
Here you see VMkernel is pretty good at decoding the MCEs by itself, but it can also be very useful to see for yourself what the real cause was if your error decode is missing.