Tag Archives: VMware ESXi

VMware Workstation-Powered Lab Part 2: Installing ESXi

In the previous part of this article, I’ve shown you where to download the required packages for use with the upcoming laboratory. In this part I will guide you through the very basic VMware ESXi installation on VMware Workstation. For this you will need the VMware Workstation installed on your system, along with the Hypervisor ISO. Without further ado, let’s begin.

Continue reading

Stress Testing an ESXi Host – CPU and MCE Debugging

I have needed to stress test a component inside a physical server – this time it was CPU and I’d like to share my method here. I have done a Memory Stress Test using a Windows VM in a previous article. I will be using a Windows VM again, but this time it will be Windows  Server 2012 Standard Edition that can handle up to 4TB Memory and up to 64 Sockets containing 640 logical processors – a very nice bump from Windows Server 2008 R2 Standard that had a Compute configuration maximum of 4 sockets and 32 GB RAM.

The host has crashed several times into a PSOD with Uncorrectable Machine Check Errors. From the start I had a hunch that the second Physical CPU or a System Board are faulty – but these were replaced already and the host has crashed yet again. I have taken a closer look at the matter and went to stress thest this ill host’s CPUs. Continue reading

HyperThreading: What is it and does it benefit ESXi?

Many times I come across the question of HyperThreading and its benefits – either in personal computing, but more importantly over the last few years, virtualization. I’d like to talk about what HyperThreading is for a moment, and show you if it benefits the virtualized environment.

What is HyperThreading?

Today, you see HyperThreadng (HT) technology is present on almost every Intel processor, be it Xeon or Core i3/i5/i7 Series. Basically, it splits one physical core to two logical cores, but the term splitting is somewhat inaccurate and confuses many consumers. Thinking that when they run a 2.5GHz 4-core, HyperThreaded CPU, they immediately have 8 effective cores carrying the full processing capability of 20 GHz. Mainly because when you say you split something, you think that this has been divided to two equal parts (or at least that’s what I think, anyways). Continue reading

1GbE Intel NIC Throttled to 100Mbit By SmartSpeed

We had a case on one of our ESXi hosts equipped with an Intel Corporation 82571EB Gigabit Ethernet Controller – although it was 1Gbit in speed, we were unable to achieve autonegotiation higher than 100 Mbit. When setting it manually to 1Gbit, the NIC disconnected itself from the network. Every other setting worked – 10 Mbit and 100Mbit both half and full duplex. We tried investigating with our Network Team, forcing 1Gbit on switch and that has also brought the NIC down.

I delved deeper into this issue and observed the VMkernel log via tail -f when I have forcibly disconnected the NIC and reconnected it again via esxcli. One line appeared that caught my attention:

vmnic6 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
e1000e 0000:07:00.0: vmnic6: Link Speed was downgraded by SmartSpeed
e1000e 0000:07:00.0: vmnic6: 10/100 speed: disabling TSO

I immediately caugt up on SmartSpeed and tried to find a way to disable it – that is until I have found out on many discussion threads later that SmartSpeed is an intelligent throtlling mechanism that is supposed to keep the connection running on various link speeds when an error somewhere on the link path is detected. The switches were working okay, the NIC didn’t detect any errors, so the next thing to be checked would be the cabling.

I arranged a cable check with the Data Center operators and what do you know – replacing cables for brand new ones eventually solved the issue! Sometimes the failing component causing you a headache for a good few hours can be a “mundane” piece of equipment such as patch cables.

How much memory should be free for VMkernel?

Recently I have made a small research to see how much free RAM does VMkernel need to work without any hiccups due to:

  • Memory Reclamation Techniques
  • Memory Reservation for the VMkernel itself

I have gathered this data from live environment. However one very important metric is not included in the below measures and graphs, and that is the Virtual Machine overhead that is individual for each environment and is dependant on the VMs’ Memory and vCPU amount.

A quick explanation:

  • RAM [GB]: How many Gigabytes of RAM are installed in the Server.
  • VMKernel [MB]: How many MB are reserved for the VMkernel itself (you can find this value in Configuration -> System Resources Tab).
  • Reclamation [MB]: Calculated with a Memory Reclamation Formula (900 MB + 1% of memory above 24GB).
  • Total [MB]: Sum of VMKernel & Reclamation values. This should be the governing baseline value.
  • Free [%]: How much % of the total server’s memory should be free.
RAM [GB] VMKernel [MB] Reclamation [MB] Total [MB] Free [%]
8 1393 900.0 2293.0 28.0
16 1749 900.0 2649.0 16.2
24 2378 900.0 3278.0 13.3
48 2682 1145.8 3827.8 7.8
64 2745 1309.6 4054.6 6.2
96 3514.5 1637.3 5151.8 5.2
128 3612 1965.0 5577.0 4.3
192 5218.5 2620.3 7838.8 4.0
384 8220 4586.4 12806.4 3.3
512 9985 5897.1 15882.1 3.0

And a graph is below:

Memory Reservation Graph

A Graph representing the GB Installed vs. MB reserved memory.

I hope this table comes in useful when deciding how much RAM there is in your environment for the hosts to use.

Online ESXi Firmware and Driver Upgrade on HP Servers

When upgrading firmware and drivers on a huge amount of servers, it used to be time-consuming to perform a firmware upgrade after a reboot on each and every one of your ESXi hosts to match the standard. Not anymore – since Service Pack for ProLiant 2014.09.0, the NIC Firmware can be upgraded online as well since its 10.x version (a bump from the 3.x or 4.x versions that now share a unified firmware). A huge step forward – now all the applicable firmware can be upgraded in one go – and online! No need to wait to catch the boot menu and go through HP Smart Update Manager individually.

Here’s a step by step walkthrough:

  1. Download the HP Service Pack for ProLiant you wish to apply. You will need to have a HP account and a server under warranty linked to it in order to download the newest releases.
  2. Stage the .iso file to a server that has a good connection to all the ESXi hosts you plan to upgrade (preferrably a terminal server inside the Data Center) and unpack it to the location of your liking.
  3. Run \\spplocation\hp\swpackages\x64\hpsum_bin_x64.exe – the binary will depend on your OS flavor.
  4. The following console window will pop up, stating that the HP SUM Web Service has been launched and a default web browser will lanch on the machine, opening the address localhost:63001 and automatically logging you in by passing through your credentials. You can also connect to your terminal server from any other computer that can access its ports 63001 or 63002 (and it is more comfortable that way). I strongly suggest using Google Chrome.
    image001
  5. If you access the web interface, this is what you get.image003
  6.  Start by clicking on the drop-down arrow in the top left corner and select Baseline Libraryimage005
  7. You will need to manually initiate the inventory process for the selected baseline, so click on the already present one for the process to begin.
    image007
    After a few minutes, the inventory completes.
    image009
  8. Now we need to add our ESXi hosts, select VM hosts from the drop-down menu.image011
  9. Localhost is added automatically and unfortunately can’t be changed. Click on Add Node.image013
  10. You can either add a single node by its FQDN or a range of IP addresses separated by a dash. You need to specify the type of device you are adding and the package that is your baseline. Don’t forget to put in the root credentials else the initialization will fail.image015
  11. If you need to select specific nodes inside a range, the second entry in the “Select the type of add” has just what you need. You enter the range, and after a scan you select the nodes you desire. Shift+Click and CTRL+Click work here like a charm.image017
  12. After you have added the nodes via the “Node Range” method, select the baseline to apply to them and enter the root credentials. image019
  13. When you were successful and the hosts were added, you can select multiple hosts by shift+click or ctrl+click and the right frame will change to multiple selection operation.image021
  14. Here you will need to select the baseline again by clicking on Select Baselinesimage023
  15. Select the SPP and click on Add
    image025
  16. Back in the multi-select frame you enter root credentials in order to scan the hostsimage027
  17. You will see the inventarization progressimage029
  18. Once the SUM evaluates an update is needed, input the root credentials again and Deploy the components.image031image033
  19. You have reached the familiar deploy screen where you choose the components to upgrade. When you choose Deploy, it will initialize and you will see a gray wheel spinning beside the chosen hosts.image035

When the deployment is complete, you will have a green light next to your hosts you applied updates to, and the updates will be applied on the next reboot – which is ideal for combination with VMware Update Manager to apply patches & firmware in one take.

PSOD Caused by a Machine Check Error (MCE)

Today I’d like to present to you an ESXi host crash we had in our environment tha was due to a hardware failure. This time, we were “lucky” enough to capture its PSOD. In earlier article about Machine Check Errors, I was talking about what exactly do they mean and how to debug them. Also, most of the time, when these are correctable Machine Check Errors, the host only reboots itself without leaving any trace as of why. That I have investigated by determining faulty memory after running a custom memory stress test on an ESXi host.

The Uncorrectable Machine Check Exception presented below is caused by “Other TransBus Generic Error” – this could have been related either to a CPU, or pathways on the motherboard… or both. Most of VMkernel dumps was pointing out to 2nd Physical CPU, but there were some occurrences on 1st CPU as well. Even the AHS log from the HP blade server was corrupted each time I tried to send it to a technician. Therefore they took action and replaced both the motherboard and CPUs. Since then there were no more trouble with this host.

PSOD due to Uncorrectable MCE on CPU

Manual Debugging:

For those of you who are interested – the MCE codes reported were:

In iLO: FA001E8000020E0F
in vmkernel.log: c800008000310e0f ; 8800004000310e0f

Now, if we decode the message we got from iLO manually (so that we have another source of MCE to decode from):

1 1 1 1 1 0 1 0 0 00 0000000011110100 0 0000 0000000000000010 0000 1110 0000 1111

UC 1
PCC 1
S 0
AR 0

Signaling:
Uncorrected error (UC). RESET THE SYSTEM

Examples? None found.

Compound error code found: Bus and Interconnect Errors.

BUS LL PP RRRR II T
BUS{11}_{11}_{0000}_{11}_{0}_ERR

Level: 11, generic
Request: 0000, Generic

Bus & Interconnect mnemonics:
Participation: 11, Generic

Therefore: Generic Bus and Interconnect Error

Here you see VMkernel is pretty good at decoding the MCEs by itself, but it can also be very useful to see for yourself what the real cause was if your error decode is missing.

Host IPMI Event Status Alarm PowerCLI Fix

Sometimes we are affected by a (supposedly firmware) bug that rarely affects our ESXi hosts in vCenter. This happens mainly on HP BL460c blade servers. You will get an alarm with IPMI Event Log Status, Other Hardware Objects, or Hardware Temperature status, but everything will appear okay on the Hardware Status screen and the IPMI log will be clear (or show you that 65535 are present when they aren’t). The gathered information has  pinpointed me towards resetting the CIM Service and Hardware Monitoring agents.

What this handy PowerCLI Script does is basically everything described above, but without tne hassle of connecting to each host manually – it’s a bit modular so take a look on the code first if you understand the mnemonics, if not just run it, enter your vCenter and VM name (don’t forget to see the $hosts value – it should contain your domain’t name!) and wait for a moment.

Write-Host "Reset sensors, refresh HW data & their views on an ESXi host" `n -ForegroundColor Green

<# Uncomment this to enable connection to vCenters defined in a text file
# 

$vcenterlist = Get-Content .\vcenters.txt

ForEach ($vcenter in $vcenterlist) {
Define vCenter
Write-Host `n"Connecting to vCenter $vcenter"

# Connect to vCenter
Connect-VIServer $vcenter | Out-Null
}
#>

# Define a blank array for the hosts
$hosts = @()

# input checking loop to check if $vcenter is null or not.
if ($vcenterlist -eq $null) {
do  {

[Boolean]$match = $false
$vcenter = Read-Host "Define a vCenter where the host is located"
$vcenter.Replace(' ','')
if ($vcenter -notlike "") { $match = $true }

Else {
Write-Host "The value must not be null. Try again or CTRL+C to break."`n -ForegroundColor Red
$match = $false
}

} Until ($match -eq $true)
}

# ESXi host definition
$input = Read-Host "Enter a name of ESXi host where you want to reset the HW sensors"
# Generate FQDN and store into an Array
$hosts += "$input`.yourdomain.lab"

# Connect to vCenter
Write-Host `n "Connecting to vCenter $vcenter`..."
ForEach ($vcenter in $vcenterlist) {
Connect-VIServer $vcenter | Out-Null
}

# The VMhost needs to be stored into an array with Get-VMhost for further processing
$vmhosts = Get-VMHost -Name $hosts

# Get all vmhosts for the connected vCenter sessions
#$vmhosts = Get-VMHost

ForEach ($vmhost in $vmhosts)
{
	Try
	{
		#initialize calls for refreshing hardware status..
		Write-Host "Restarting CIM Server service on $vmhost"
		Get-VMHost $vmhost | Get-VMHostService | Where { $_.Key -eq “sfcbd-watchdog” } | Restart-VMHostService -Confirm:$false | Out-Null
		Start-Sleep -Seconds 15

		Write-Host "Starting to refresh HW info on $vmhost (this can take a while)"

		# Define variables for system calls
		$hv = Get-View $vmhost
		$hss = get-view $hv.ConfigManager.HealthStatusSystem

		Write-Host "Resetting HW Sensors..."
		$hss.ResetSystemHealthInfo()
		Start-Sleep -Seconds 15

		Write-Host "Refreshing Data..."
		$hss.RefreshHealthStatusSystem()
		Start-Sleep -Seconds 15

		Write-Host "Refreshing Data View..."
		$hss.UpdateViewData()
		Start-Sleep -Seconds 15
	}
	Catch [System.Exception]
	{
		Write-Host "There was an error while trying to refresh the hardware data." `n `
					"Please check the ESXi host's Hardware Status Tab." -ForegroundColor 'Red'
	}
	Finally
	{
		Write-Host "Disconnecting from the vCenter Server..."
		Disconnect-VIServer $vcenter -Confirm:$false
		Write-Host "Done processing $vmhost." -ForegroundColor Green
	}
}

I Hope it has alleviated at least one occurrence of this bug 🙂

What is Virtualization?

Since I’d also like to have this blog more accessible for people of different skillsets, I’d like to make a short introduction to the Virtualization technology that is very progressive the last few years.

Physical Servers

Servers are machines that, under ideal conditions, operate 24 hours a day, 7 days a week, 365 days in a year. For stability and compatibility purposes, they are often running only one application or service for the highest possible uptime. However we know that maintaining this level of service is an exceptional task that is very hard to reach. In the age where multi-core and multi-threaded processors are prominent, having a quad-core processor just sitting there as a file or a simple application server would be a waste of resources. Here’s a little list.

Server operation cost:

  • Electricity that the servers consume and energy used for cooling the server room.
  • Actual physical space where 1 server = 1 slot in a rack.
  • Time and money if anything decides to break down at any time, requiring a technician’s visit to replace a faulty motherboard or a memory module that either crashed the server completely or caused an avalanche of BSODs (or any other kernel panic colors).
  • Money for space rental (if you don’t have your own data center).

Furthermore:

  • In most cases, physical hardware  in loaded small percentages when running.
  • Storage may not be easily expanded.

This is remedied with one very effective solution:

Virtualization

This technology enables multiple virtual machines of various Operation Systems to run on one physical server under a so-called hypervisor. Depending on server’s hardware capabilities and application needs, one physical server can run a few resource-demanding , or many lightweight machines.

The hypervisor is a “barebone OS” (in case of VMware, a POSIX-like system with unix-like commandline thanks to the BusyBox Libraries) installed on a bare metal. Once installed, properly configured and managed, it handles all virtual machines’ requests for CPU, memory and IO operations intelligently and scheduling resources when and where needed. Literally it transforms all BIOS instructions the virtual machines generate into the BIOS of physical hardware. Think of it as a robot who catches multiple requests from various sources and has only one output path where he sorts them.

A Virtual machine is a set of files, stored in a folder on a data store using a special file system called VMFS (Virtual Machine File System). These files contain the VM configuration, BIOS’ NVRAM, local Virtual Machine Disks, Virtual Machine Swap Files, snapshot deltas and more.

Virtualization benefits:

  • One physical hardware runs multiple virtual machines – hardware is being used to its maximum potential if scaled well.
  • Hypervisor (or an ESXi host) can be run from local storage, USB drive, SD card, or directly from the memory (with Enterprise version of VMware).
  • VMs can communicate between themselves and have physical network NICs assigned with use of virtual network switches.
  • VMs on shared storage (FCoE / iSCSI / NFS) with clustered ESXi hosts can use High Availability or Fault Tolerance (VM Mirroring) in case of failures.
  • VMs can be dynamically configured how much memory and/or CPU resources they will use.
  • Devices attached to the computer you are running a management console from (VMware vSphere) can be attached to the VMs.
  • VMs can be transferred from one host to another without any issue involving a hardware change in the OS.

And the eventual outcome? Virtualization helps save resources: power, space, and time – mostly downtime that would be connected to a failure of one physical host where an application would be rendered unusable.

DELL Perc H710P Local Storage SSD RAID1 Benchmark

Recently we have equipped one of our ESXi hosts with local SSD storage (Product Number: LB806M) to host a database VM. For redundancy we have chosen RAID1. I have done a small benchmark to compare it to already present 4x 1,2TB 10k RPM (PN: ST1200MM0007) RAID10 array.

The RAID Controller serving the drives was DELL Perc H710P Mini (Dual processor, 1GB DDR3 NV Cache). I have used the IOmeter application with Access Specification File from my favorite tech-news aggregate site, TechPowerUp. I have run the test on a 1GB Chunk of data. Without further ado, here are the results (click on an image to enlarge):

Throughput Benchmark

Throughput Benchmark

IOPs Benchmark

IOPs Benchmark

Latency Benchmark

Latency Benchmark

Also, I’ve captured a few interesting screenshots from esxtop over the course of benchmarks. Notice that the controller doesn’t even break a sweat under that many IOPS:

Installing the Windows VM for benchmarking

Installing the Windows VM for benchmarking

Database Benchmark on the Mechanical Hard DRive and SSD running simultaneously.

Database Benchmark on the Mechanical Hard DRive and SSD running simultaneously.

Nice IOPS :)

Nice IOPS 🙂

Sequential Read Benchmark - the Controller Cache comes into play.

Sequential Read Benchmark – the Controller Cache comes into play.

Hope you enjoyed the numbers. See you around.