New 2016 Hyper-V cluster hosts with sporadic sluggish performance/network latency (every two hours)

R

Ross Aveling

Hi All,

Tearing my hair out with strange Server 2016 Hyper-V host performance, wondering if anyone can help.

A bit of background. Had a perfectly happy 2008 R2 Enterprise Hyper-V cluster (two hosts) running for many years, no performance issues whatsoever;

  • 2 x Dell PowerEdge R710s
  • Dedicated LACP team (2 1GBe NICs) for host management
  • Dedicated 10GBe NICs for cluster communication and live migration (direct between hosts with jumbo frames)
  • Dedicated virtual switch for VM networking - 4 x 1GBe NICs in LACP team on each host
  • Dedicated 1GBe iSCSI NICs, 2 per host (using MPIO) for CSV access (CSV holds boot VHDs only)
  • Dedicated virtual switch for VM iSCSI volume access (2 x 1GBe NICs, using MPIO). For SQL and Exchange volumes etc.
  • HPE 2920 switch (2 x stacked). Hyper-V VM networking set up as LACP team, host management as dynamic LACP team
  • HPE Nimble storage array, dual controllers each with a total of 4 x 1GBe connections back to another dedicated HPE 2920 stack for iSCSI traffic only (no jumbo frames)


Recently upgraded the whole thing to run Server 2016 Datacenter (fully patched);

  • 2 x new Dell PowerEdge R640s (installed with latest drivers about a month ago)
  • Dedicated LACP team (2 1GBe NICs) for host management
  • Dedicated 10GBe NICs for cluster communication and live migration (direct between hosts with jumbo frames)
  • Dedicated virtual switch for VM networking - 2 x 10GBe NICs in LACP team on each host
  • Dedicated 1GBe iSCSI NICs, 2 per host (using MPIO) for CSV access (CSV holds boot VHDs only)
  • Dedicated virtual switch for VM iSCSI volume access (2 x 1GBe NICs, using MPIO). For SQL and Exchange volumes etc.
  • HPE 2920 switch (2 x stacked). This has been upgraded with new 10GBe modules to account for new Hyper-V virtual switch for VM networking (they remain LACP teamed as before). Host management remains dynamic LACP teamed.
  • Same HPE Nimble array, no changes


The migration was smooth with all VMs recreated in 2016 due to no upgrade path from 2008 R2. VMs settings kept the same, same IP addressing for normal and iSCSI connections etc. Hosts & VMs upgraded with latest Nimble Windows integration toolkit (with all recommended Windows Server hotfixes). Now using Server 2016 in-built NIC teaming (LACP Dynamic) rather than Broadcom drivers/Advanced Control Suite – no team issues reported by either Windows or HPE switches.

VMs are 2008 R2+ (bar two 2003 R2s), latest integration components and all integration services enabled apart from Guest Services.

Everything seemed to be fine and running well and then we noticed that our VMs on either host were running sluggish from time-to-time. Turns out that every two hours (on the dot) there would be a period of between 30 seconds to a minute when VMs network latency would go from the normal <1ms to anything up to 1,500ms and then drop back down again. Further investigation then showed the same latency occurred on the host management network connections too! I experienced this first hand when the RDP session to a host would either stop responding completely or would be like trying to wade through treacle.

Things I’ve investigated;

  • The performance of the 2920 stack bearing in mind the 10GBe module upgrades. This is fine with an average CPU load of 8% and memory at 40%. Connectivity to other physical servers is fine and they never experience similar high latency when the Hyper-V cluster does. I don’t think the switches are the problem
  • I know that Broadcom NICs were notoriously bad for Virtual Machine Queuing (despite us never having a problem ourselves on the R710s) but I can confirm that VMQ is turned off on the 1GBe NICs and on for the 10GBe NICs. Receive Side Scaling is enabled everywhere. VM and network performance outside of these ‘blips’ is fantastic and latency is usually always <1ms for the hosts and their VMs.
  • Nimble doesn’t report any spikes in iSCSI traffic/latency that correlate
  • Each host has dual 8-core Xeon silver CPUs and I don’t think CPU usage on the host is a problem. I’m trying to get some performance counters running to make sure though. Hosts have 256GB RAM and VMs are using nowhere near that amount.


The strongest culprit I’ve got to go on is VSS on the hosts. We noticed that the Microsoft Software Shadow Copy Provider service appears to enter the running state, event 7036 and then stops roughly 3 minutes later due to it “shutting down due to idle timeout”, event 8224 and it's then when we get the high latency. Originally, we were also seeing Nimble VSS Hardware Provider events at the same time, but we have since uninstalled the Nimble VSS/DSM hardware provider support on each host (we aren’t taking snapshots directly from the Nimble).

I thought that fixed it as we had a good day or two without the spikes however, it has come back! I’m able to overcome the issue for now by moving the VMs between hosts and rebooting them – I think just rebooting them before when dealing with the Nimble VSS hardware provider was the same temporary fix and nothing to do with Nimble WIT itself.

Only other stuff to mention is that we use Veritas Backup Exec 2016 to backup daily. The VMs (using the agent in the VMs) and the 2003 R2 VMs are backed up as clustered VMs using the BE agent on the hosts. There doesn’t seem to be a correlation with the backups and this problem either – backups have run with no subsequent latency for a couple of days.

I still return to the VSS copy provider service stuff. Is it supposed to run every two hours on 2016? – I’m about to bring out the mothballed R710s to check if they had the same behaviour on 2008 R2 but I doubt it.

I’ve looked online for people experiencing similar issues but have found nothing. Can anyone provide any guidance on how to troubleshoot further or indeed has had a similar problem? If I could whittle the problem down to a specific area I’ll then be able to engage directly with professional support services – right now I’m at a loss who I should talk to.

Many thanks in advance for any advice, it’s really appreciated.

Continue reading...
 
Back
Top Bottom