Tag Archives: VMware

Updating an UCS system – from FI to OS drivers

Update October 27th 2015
At the time writing this article, I mentioned my experience on upgrading from version 2.2(3a), and the unexpected reboots of “some” blades. It turns this bug have been identified and is fixed since 2.2(3g). I forgot to update this article as promised : thanks to Patrick for asking after in the comments. 🙂

First to narrow the scope, the problem affected only UCS B200 M4. It was not obvious for me, as the deploye was a greenfield with only B200 M4 blades. It’s logged under bug number CSCut61527.

What is it about? The B200 M4’s BMC (Baseboard Management Card, a part of IPMI/CIMC system) sometimes return an invalid FRU and make the blade reboot … Yeah you read it right : a management controller taking potentially down production workloads …

This caveat is around since 2.2(3a) and 2.2(1b), and was fixed first on release 2.2(3g). Here is the link to the proper section of Release Notes for UCSM 2.2. When you are there, just look for CSCut61527 at the 10th row in the table.

upgrade-ucsm--bug-cscut61527--amp--b200m4--mdash--envoy-eacute-s

Lesson learned : always double-check your current UCSM version before adding B200 M4 blades if it’s not a greenfield deployment!

There is plenty of writing about “how to upgrade UCS” (from official Cisco documentation to independent blog posts) but I found none going from UCSM up to ESXi drivers (disclaimer : I looked after it less than 5min :-).

So here is my 2 cents on the matter.

What do I need to update my UCS system?

The detailed list of objects you need to upgrade are the following, from top-to-bottom :

  1. UCSM itself, which is a cluster management software running in Active/Passive mode on Fabric Interconnects,
  2. The Fabric Interconnects,
  3. The IO Modules, aka FEXs,
  4. Servers (either Blade or Rack format), whichcan be separated in three major section :
    1. BIOS,
    2. Controllers (SAS, CIMC),
    3. Adapters cards,
  5. Drivers, specific to your Operating System.

This document is old and some information may be outdated, but still describe quite well the “What” : Cisco UCS Firmware Versioning.

Where should I look for the software pieces?

It is rare enough for Cisco products to be mentioned : you don’t need to have a system linked to your CCOID to be able to download UCS related software 🙂

Fortunately, all pieces of software listed on the previous section are grouped by bundles and you don’t have (anymore) to download each packages separately :

  • Infrastructure Bundle : it contains UCSM, FI and FEX softwares/firmwares,
  • B-Series or C-Series Bundle : it contains BIOS, Controllers and Adapter cards firmwares,
  • An ISO with all C-Series or B-Series drivers.

Note : Instead of downloading 2GB of drivers, if you are looking for drivers on a particular Operating System, it may be better to look for Cisco UCS drivers on the OS editor’s site. For example if you look for the lastest Cisco UCS enic and fnic drivers for VMware, you can find them on vmware.com. It’s a 2MB download versus 2GB …

Updating the UCS system

In this section, I will not go for a screen-by-screen explanation but will rather explain the key steps and possible warnings you need to be aware of before starting the upgrade.

First, the documentation you should definitely check :

At the time of writing this article, with the current version being 2.2(3e), the recommend upgrade path is Top-to-Bottom, it’s generally the way to go. Yet on some earlier versions (1.4 if I am correct), required Bottom-to-Top.

It’s really unlikely that would change back again, but you should definitely check the documentation and the last release note update to know what’s the current and supported method. Here is Upgrading Cisco UCS from Release 2.1 to Release 2.2 document.

This doodle illustrate the updated parts and the actual order to follow.

Update UCS

Step 0 is about preparation. You need to upload the firmware packages to the Fabric Interconnect boot flash (the packages are copied to both fabric interconnects).

  1. Upgrade the UCSM Software. It’s supposed to be non-disruptive for data path and you should only relaunch the UCSM client. My recent experience when upgrading from 2.2(3a) to 2.2(3d) was catastrophic : some blades rebooted randomly 4-5 times … Not so “non-disruptive”. I managed to reproduce the same behavior on another system and a SR is currently open. I may update this post later depending on the SR’s issue.
  2. Stage the Firmware (~10-20min) on all FEXs (“Update Firmware” on Equipment>Firmware Management>Installed Firmware”) and set it to be active on next reboot (“Activate Firmware” without forgetting the related “active on next reboot” checkbox). This will save you a reboot, as the FEX will reboot anyway when the Fabric Interconnect will be upgraded,
  3. Upgrade the Fabric Interconnect which is holding the secondary role, wait for reboot (~15min) then change the cluster lead to get primary on the newly updated FI,
  4. Upgrade the remaining Fabric Interconnect and wait for reboot (~15min), then take back the cluster lead to the initial state (there is no automatic fail-back for UCSM),
  5. Update the blades : best way is through Maintenance and Firmware Policies,
    1. Be sure that your service profile is set to “User Ack” for maintenance Policy,
    2. For ESXi node, take it in maintenance mode first from your vSphere Client,
    3. Ack the reboot request on UCSM as your ESX nodes are in Maintenance mode.

Note : you can edit the default “Host Firmware Package” policy to use the right package version (for blade and rack), even without any service profile created. This way, any UCS server connected to the fabric will be automatically updated to the desired baseline. This will effectively prevent running different firmware due to different shipping/buying batches.

Most upgrade guides stop here, right after updating the hardware. Let me say that it’s golden path to #fail 🙂. The next part is about updating your ESXi driver to the most current driver version, supported by your UCS firmware release.

 Updating VMware ESXi drivers

At the end of the day, what matter is how your Operating System handle your hardware. That is the driver’s job. If it’s obsolete, either it works non-optimized and without “new features/enhancements” (that’s the best option) or it may lead to some unpredictable behaviors …

Bets are high you installed your ESXi on a UCS server using a Custom ISO, available at vmware.com. Bets are higher that as the VIC card’s exposed vETH and vHBA are recognized, nobody has bothered to update them. If so, you run with a 2-3 years old driver …

You can check your current enic (vETH) and fnic (vHBA) driver version on your ESXi host with the following commands :

#vmkload_mod -s enic
#vmkload_mod -s fnic

If you find Enic version 2.12.42 and Fnic version 1.6.0.5, you run the ISO’s version and I would highly recommend to upgrade.

Download your drivers at vmware.com following this navigation path : vmware.com > download > vsphere > Driver & Tools.

  1. Select the relevant vSphere version (5.X, do not choice “update 01, 02” links),
  2. Download Drivers for Cisco enic and fnic.

35a19d2f-76d2-4aa2-866b-154aa497514a.png

It’s 1MB per download on vmware.com, compared to the UCS Drivers ISO on Cisco.com which contains all drivers for all systems, but worth 2GB …

To apply the update, you have choice between esxcli on the esx shell, or via Update Manager.

No rocket science here, just follow this standard VMware KB and pick an option you are comfortable with : http://kb.vmware.com/kb/2005205 or KB1032936 for vSphere 4.x.

Reminder : you do NOT need a Windows based vCenter when using Update Manager. You just have to plan a Windows system to install VMware Update Manager utility, then you can enjoy using vCenter Appliance.

In addition, this Troubleshooting Technote go into all the details regarding how to check and update UCS Drivers for ESXi, Windows Server and Linux (Red Hat & Suse).

http://www.cisco.com/c/en/us/support/docs/servers-unified-computing/ucs-manager/116349-technote-product-00.html

 

VMware don’t play nice with CoD. For now. Me neither …

No, I am not going to talk about Call of Duty. At least not today, not on this blog post, even maybe not during this life … 🙂

So what’s CoD and what the hell this have to do with VMware? CoD stands for Cluster-on-Die, a new NUMA setting found on some shiny Intel Haswell processors, aka E5-v3.

It’s active by default.

Side note on this Feature
With Haswell, Intel have gone wild and each processor can hold up to 18 cores! That’s a lot, but nothing new here. The bottom line is that with so many cores on a single socket, it may lead to more latency toward memory access with some workloads.
So there is a bunch of new features, to get you confused have choice and find the best suiting option for your context.

Here is a link to a Cisco white paper, talking about the UCS B200 M4 and BIOS tuning for performance. You could find there plenty of details on each new BIOS settings. The following is an abstract for the “Cluster on Die Snoop” functionality

Cluster on Die Snoop
Cluster on Die (CoD) snoop is available on Intel Xeon processor E5-2600 v3 CPUs that have 10 or more cores. Note that some software packages do not support this setting: for example, at the time of writing, VMware vSphere does not support this setting. CoD snoop is the best setting to use when NUMA is enabled and the system is running a well-behaved NUMA application. A well-behaved NUMA application is one that generally accesses only memory attached to the local CPU. CoD snoop provides the best overall latency and bandwidth performance for memory access to the local CPU. For access to remote CPUs, however, this setting results in higher latency and lower bandwidth. This snoop mode is not advised when NUMA is disabled.

Problem : VMware doesn’t support this new option. It’s outlined on the KB2087032, for vSphere 5.1 in the front line, but you can read that this problem persist for vSphere 5.5.
I ran on this problem with a brand new UCS M4, the result (that I am aware of) is pretty fun : vSphere host see 4 sockets instead of 2 sockets … Licensing guys @VMware will be happy 🙂

Workaround : This is a well-known issue today for vSphere, and no doubt it will be fixed soon by VMware. In the mean time, the recommended action is to “disable” this setting. Actually, you cannot “disable” it, you just have to choose another CPU snoop mode. The best balanced mode seems to be “Home Snoop” mode.

Home Snoop
Home Snoop (HS) is also available on all Intel Xeon processor E5-2600 v3 CPUs and is excellent for NUMA applications that need to reach a remote CPU on a regular basis. Of the snoop modes, HS provides the best remote CPU bandwidth and latency, but with the penalty of slightly higher local latency, and is the best choice for memory and bandwidth-intensive applications: such as, certain decision support system (DSS) database workloads.

The Third available mode for CPU Snooping defined by Intel on E5-2600 v3 is “Early Snoop”, which is best suited for low latency environments.

So to configure that parameter on a UCS system, the best option is to define a custom BIOS policy through UCSM, tweaking the QPI snoop mode and choose “Home-Snoop”.

Changing the BIOS policy for this value take effect immediately, without needing to reboot. vSphere see that change immediately and report the correct socket number.

Just remember to be at least in firmware version 2.2(3c) to get that feature available, but my advice would be to run at least UCSM 2.2(3d), that’s the current recommended firmware version for B200 M4 as writing.

UPDATE March 25th : “Home Snoop” is the default setting for B200 M4 BIOS, when your system is shipping with UCSM > 2.2.3(c).

Links collection #2

Here is the Links Collection #2

This one is pretty short, and focused on VMFS & VMDK.
The thinking on the first link about all possible alternatives and the trade-offs is really interesting from a design point-of-view.

Next, some useful links about vCloud Director & vCloud Automation Center.

VMFS & VMDK

vCloud Director

Links collection #1

Here is a collection of some interesting articles & papers I stumbled upon last week.
This list is mainly for my own record and allow me to not keep a web browser with more than 20 tabs opened over one month. 🙂

Sure “cloud bookmarks” solutions are great. There is plenty of them out there, and I use them.
But the drawbacks are the following :
– it’s quicker for me to write down that link rather than struggling with folder and tags,
– I would need my credentials or an agent to access my bookmarks,
– it could be lengthy to find back that link 2 years later,
– that link is maybe only needed on a temporary basis or particular context, and do not deserve a bookmark.

Finally, I think it’s easier to share with others over a blog (no need to affiliate on a profile, etc …).
So, after this long introduction, here is “Link Collection #1”.

EMC VPLEX

  1. RecoverPoint Comes Clean with VPLEX (clearpathsg)
  2. Interesting use cases for VPLEX (vijay swami)
  3. A Deeper Look at VPLEX (Scott Lowe)

vMSC (vSphere Metro Streched Cluster)

  1. vSphere Metro Stretched Cluster with vSphere 5.5 and PDL AutoRemove (longwhiteclouds)

VMFS & VMDK

  1. Support for virtual machine disks larger than 2 TB in vSphere 5.5 (2058287) (vmware)
  2. vsphere 5.5 Jumbo VMDK Deep dive (longwhiteclouds)

vSphere, vCenter & vCloud Director 5.5

  1. VMware vCenter Server Appliance (VCSA) 5.5 deployment tips and tricks (ivobeerens)
  2. Top 10 things you must read about vSphere 5.5 (vsphere land)
  3. Comparison of vCloud Director Maximums 1.5 / 5.1 / 5.5 (virtualizationexpress)

  4. Comparison of vSphere Maximums 5.1 / 5.5 (vsphere land)

  5. vCenter, vSphere & vCloud Director 5.5 configuration maximums (vmware)
  6. vCloud Director 5.1 Configuration Maximums (2036392) (vmware)
  7. vSphere 5.1 configuration maximums (vmware)

Open Source Clouds

  1. Beyond Chef and Puppet: Ten essential DevOps tools (TechTarget)

vCD, Redhat, and the network : common pitfalls

This post could have been named “the good, the bad and the ugly”, you are free to choose which one you map with : vCD, Red Hat and the network … 😀

The purpose here, is to sum up common pitfalls when setting up a vCloud Director environment. As you will see, most of them are not related to vCloud, but rather to linux.

What do you need for a vCloud cell?

Before starting, let’s do a quick refresh on what you need for a vCloud Cell:

  • a VM with minimum hardware requirement and a supported guest OS (see installation guides),
  • a minimum of two nics : one for the web portal, one for VM consoles (VMRC proxy).

According to the installation guides, for all versions prior to 5.5 you will need:

  • 2GB of RAM,
  • ~1GB of disk space to install vCD binaries,
  • an RHEL 5.x or 6.x depending on the version of vCD (no CentOS support).

Starting with vCD 5.5, you will need 4GB of RAM and a little more disk space (1350MB).
Note: version 5.5 officially support CentOS 6.4.

There is no recommandation/requirement explicitly stated by VMware for cpu count. However, any decent vCD design should include a dedicated management cluster, and the cpu ressource will not be a bottleneck on this cluster. So I tend to setup vCD cells with 2 vCPUs.

Here, I am digressing from the original matter of this post. Let the design advices for another post, and focus on configuration errors.

The more common problems are the following:

  1. Hang on reboot of the vCD cell,
  2. Being unable to ping the 2nd nic,
  3. Unable to access the vCloud web portal.

1. Hang on reboot of the vCD cell

It’s mainly related to vSphere 5.1 and RHEL/CentOS 6.x. The problem have been widely discussed on VMware community forums, here are some links:

With these informations, you should be able to get rid of this nasty reboot hang …

2. Unable to ping the 2nd nic

If you choosed RHEL6 for your vCD cell guest OS and your nics are on the same vlan, you should have some problems to ping the second nic. This is due to RHEL’s default setting regarding “reverse path filtering”.

Basicly, RHEL drop packets when the route for outbound traffic differs from the route of incoming traffic. It’s a new “default behavior” with RHEL 6.
More details could be found on the subject following this link to the related Red Hat KB.

3. Unable to access the vCloud web portal

Even if you have dodged all these traps, you could still be frustrated when trying to launch the vCloud web portal for first configuration. And it’s mainly because of the default security parameters of RHEL: The iptable firewall is running.

Based on your security context, you could just disable the firewall or a better approach, configure it …
By default RHEL is not listening on port 443, so you have no chance to execute the first install wizard.

Fast spoiler : add the following rule and restart your firewall

iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 443 -j ACCEPT
service iptables save
service iptables restart

Obviously you will need to open other ports to get a fully functional vCD, but we are here focussing on what you need to get access to the main portal.

Kendrick Coleman wrote an exhaustive article about how to setup the RHEL firewall for the vCD cell. I strongly encourage you to read it if you decide to not completely turn off iptables.

And finally, the yoda quote:
“To be an iptables jedi, only one way there is luke : rtfm of netfilter/iptables

More links to help you configure your vCD cell right, on the first try: