Tag Archives: PSOD

Stress Testing an ESXi Host – CPU and MCE Debugging

I have needed to stress test a component inside a physical server – this time it was CPU and I’d like to share my method here. I have done a Memory Stress Test using a Windows VM in a previous article. I will be using a Windows VM again, but this time it will be Windows  Server 2012 Standard Edition that can handle up to 4TB Memory and up to 64 Sockets containing 640 logical processors – a very nice bump from Windows Server 2008 R2 Standard that had a Compute configuration maximum of 4 sockets and 32 GB RAM.

The host has crashed several times into a PSOD with Uncorrectable Machine Check Errors. From the start I had a hunch that the second Physical CPU or a System Board are faulty – but these were replaced already and the host has crashed yet again. I have taken a closer look at the matter and went to stress thest this ill host’s CPUs. Continue reading

PSOD Caused by a Machine Check Error (MCE)

Today I’d like to present to you an ESXi host crash we had in our environment tha was due to a hardware failure. This time, we were “lucky” enough to capture its PSOD. In earlier article about Machine Check Errors, I was talking about what exactly do they mean and how to debug them. Also, most of the time, when these are correctable Machine Check Errors, the host only reboots itself without leaving any trace as of why. That I have investigated by determining faulty memory after running a custom memory stress test on an ESXi host.

The Uncorrectable Machine Check Exception presented below is caused by “Other TransBus Generic Error” – this could have been related either to a CPU, or pathways on the motherboard… or both. Most of VMkernel dumps was pointing out to 2nd Physical CPU, but there were some occurrences on 1st CPU as well. Even the AHS log from the HP blade server was corrupted each time I tried to send it to a technician. Therefore they took action and replaced both the motherboard and CPUs. Since then there were no more trouble with this host.

PSOD due to Uncorrectable MCE on CPU

Manual Debugging:

For those of you who are interested – the MCE codes reported were:

In iLO: FA001E8000020E0F
in vmkernel.log: c800008000310e0f ; 8800004000310e0f

Now, if we decode the message we got from iLO manually (so that we have another source of MCE to decode from):

1 1 1 1 1 0 1 0 0 00 0000000011110100 0 0000 0000000000000010 0000 1110 0000 1111

UC 1
PCC 1
S 0
AR 0

Signaling:
Uncorrected error (UC). RESET THE SYSTEM

Examples? None found.

Compound error code found: Bus and Interconnect Errors.

BUS LL PP RRRR II T
BUS{11}_{11}_{0000}_{11}_{0}_ERR

Level: 11, generic
Request: 0000, Generic

Bus & Interconnect mnemonics:
Participation: 11, Generic

Therefore: Generic Bus and Interconnect Error

Here you see VMkernel is pretty good at decoding the MCEs by itself, but it can also be very useful to see for yourself what the real cause was if your error decode is missing.