Slow Windows Server 2016 Updates - some Infos we found - further discussion

D

DDSYSADMIN

Hello everybody,

With dec '18 we again (after migrating to our new hardware) reached the point where every cummmulative update failed because of timeout.

I think there is enough information around to say that this is a very big problem on many systems and has nothing to do with the commonly presented "solutions" about not getting these updates or having trouble with downloading or similar. It's a problem itself.

To get those updates installed we tried much, but nothing helped apart from disabling AV. It still takes very long to install those cummulative updates but at least it works (downtime is still a shame and big problem).

One thing i stumbled across and that i think is worth mention, first check if tracing is active und going mad. We had an older machine that had this problem but it was the only one. There is a reg-key that should be deleted to revert tracing to standard (HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\WindowsUpdate\Trace).

Apart from that we digged deeper and looked at taskmanger and ressource-monitor for CPU and file-activity when the updates are "prepared" (the point where it failed for us everytime) and installed (where we only had problems on a few vms and it worked on 2nd try). This was a big eye-opener on several things:

1.) I'm not sure what's the official position but from what you can see in file-activity it's obvious that MS seems to simply put together all those smaller updates into one "zipped" package which is far away from truely merging them. I think the only "merge" that happens is to replace older updates with the replacement if there is one. That's why these packages have reached such a big size and are still growing. That's also why they worked earlier on our system but now we get those timeouts.

2.) So why disabling AV helped: Well since the update is basically a zip-file with an unkown number of smaller updates inside they need to get unpacked, checked for consistency and so on. That leeds to the following things:
A process called TiWorker.exe is unpacking them to an own folder under %windir%\SoftwareDistribution. That causes a massive amount of read-write to the disc (on our side a brand-new datacore-iSCSI-storage on 10GB/s fibrechannel and with SSDs which basically idles most of the time). That was driving mad our AV-solution. There are also other folders used, but i'm not sure when in the process and for what purpose (%windir%\servicing and %windir%\system32\catroot).
Later the read-write gets lower but they are still there (i think when all files are unpacked) but now the TiWorker.exe generates a huge amount of CPU-load. On our system it completely maxed out one vCPU for over 30 minutes (brand new HP DL380gen10 with Xeon Gold 5118 and VMware ESxi6.5 up2date). I can only guess what happens there but i think it's some kind of CRC-checking and checking the system for compatibility for all that smaller packages.
For us it helped to exclude the folders and TiWorker.exe from AV-scan. That speed ups the process enough to don't get a timeout but isn't really great because TiWorker.exe seems to be inside a changing path and so we needed to exclude it system-wide which is a possible security-problem. It's similar with the folders, which are of course known by malware-programmers and it would be obviuos to try to use them to get around AV.

3.) Apart from our specific solution, which is basicllay only a workaround, the problem seems to be the pure amount of files that are packed together inside the cummulative update and wahtever windows update is doing with them. This needs so much system-ressources that with ongoing unchanged process on ms-side the package will fail with a timeout sooner or later on every system in my opinion. At least the problem will probably get worse with every month and every new fix.


A completely different part is, why it sometimes takes so long to restart the vm (aka "windows is prepared"). Performance counters on our old platform are not showing much during that time and it hasn't failed there since we migrated to the new hardware.
Maybe someone of you has further information about what happens there and which folders are used (to check if excluding from AV for example helps) because it still takes plenty of time.

Also i hope to do a "kickoff" to put together all known informations about the process that have been digged out (helpful information is very much scattered across blogs atm). Maybe we can find out more and find more workarounds since im' pretty sure that ms won't act on this apart from saying "hey it works on 2019. update." (which possibly is only because there are not so many fixes included until now).

Some further information about our systems:
- ESXi 6.5 on HP DL380gen10 completely up2date (Soft-/Firmware)
- HPspecific ESXi-image
- DataCore-SAN also on DL380gen10 with SSDs (OS-virtualdiscs on SSD only)
- Bonded (LACP) 2x10GB/s for iSCSI
CrystalDiskMark 6.0.0 x64:
Sequential Read (Q= 32,T= 1) : 1094.799 MB/s
Sequential Write (Q= 32,T= 1) : 526.509 MB/s
Random Read 4KiB (Q= 8,T= 8) : 430.462 MB/s [ 105093.3 IOPS]
Random Write 4KiB (Q= 8,T= 8) : 168.259 MB/s [ 41078.9 IOPS]
Random Read 4KiB (Q= 32,T= 1) : 417.054 MB/s [ 101819.8 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 157.304 MB/s [ 38404.3 IOPS]
Random Read 4KiB (Q= 1,T= 1) : 42.067 MB/s [ 10270.3 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 14.260 MB/s [ 3481.4 IOPS]
- OS : Windows Server 2016 Datacenter (Full installation) [10.0 Build 14393] (x64)
- WSUS

I know many of you are very frustrated but when answering please keep in mind that it won't help much if simply blame ms for not getting this sorted out.
At least give your specs, describe your problem and if you found something that helped at least a bit.

Continue reading...
 
Back
Top Bottom