Win Server 2016 + Hyper-V + SQL 2016 random hang mistery

B

Berente Zoltán

Hi All,

We have a very hard problem to solve. First time we have met with this kind of problem is about a year ago at one of our customer's site. They have a Dell PowerEdge R530 server with Windows Server 2016 Std installed with Hyper-V role, running a few instances of Windows guests. One of them is a Win 2016 Std running as an SAP MSSQL server (2016 version) which is freezing randomly in 1-2 month, so relatively rarely. We had tried multiple things, installing latest FW and driver updates to the system, installing all Windows Updates, tried to update MSSQL, etc. When freezing problem happens the VM guest is unresponsive, if I try to do a shutdown in Hyper-V console I got an error message that it cannot shutdown the machine, as Ctrl+Alt+Delete also not working to login, but date and time on the lock screen shows not up to date information, so every element of the OS is freezing. Last occurance was at yesterday evening, I tried to ping the VM and I got answers but nothing was working, tried telnet to port 3389 and got an error that it cannot connect. I had checked Hyper-V console and found the time 8:30 AM while the real time was 11:00 PM. They were using the server yesterday without any problems so the freezed time is not showing the time when problem occurs, maybe only when it is starting to occur.

When I had turned off the server and turned on again many times it was configuring updates, however automatic installation is off, and we're managing updates by WSUS. I thought about performance/resource issues but there're other VMs on these machines that has more load and some performance issues, but only with some slow-downs, there's no problem with stability.

And the biggest problem is that we have 3 customers now who has the same problem on similar systems. Two of them are installed and managed by us, and the third one with own IT team, installed and configured their server by themselves and now they contacted us to help with this problem. We're searching for a solution for a year but without any luck. This third one is a larger company which needs 24/7 access to their SAP server. The three systems has lot of similarities but a few differences too. I tried to investigate them and noted these details:


Customer 1
DELL PowerEdge R530 H730 mini (6.604.06.00) SSD P/N: SSDSC2BB480G6R
Windows Server 2016 14393.2906
SQL Server 2016 (13.0.1742.0)

Customer 2


DELL PowerEdge R740 H730P mini (6.604.06.00) SSD P/N: SSDSC2KG480G8R
Windows Server 2016 14393.3025
SQL Server 2016 (13.0.1742.0)

Customer 3
DELL PowerEdge R730 H730 mini (6.603.06.00) SSD P/N: SSDSC2KG480G8R
Windows Server 2016 14393.2969
SQL Server 2016 (13.0.5337.0)


All of the 3 servers are running as a Hyper-V VM on Win 2016. They're running SQL Server 2016 for SAP B1. At customer1 there are 3 other VMs, 3 other Win 2016 Standard and only the SAP DB server is freezing. At customer2 there are 4 other VMs, 3 Win 2016 Standard and 1 Win 10 client and only the SAP DB server is freezing. At customer3 this is the only VM running on that server with tons of resources, 2x12 CPU, 256GB RAM, RAID10 disks and a RAID1 SSD array. The only difference between the SAP DB servers and others is that SAP DB servers using SSDs in RAID1 for VHD files containing the database files. The RAID controllers are near the same, I can only think about that a FW or driver issue is the key of the problem. Previously I thought that the enabled Windows Defender can also cause this kind of problem but problem occured every time outside of working hours, I don't think that an AV problem will choosing time for a hang every time outside of working hours. If something is happening with Windows Update I can imagine this source of problem too, but every time I had checked the update log I had found nothing that was automatically installed. My main problem in investigation is that logs showing almost nothing. In Windows logs I can see a healthy log, nothing special is happening when some time is suddenly missing at the time of freeze. Next items are told me that everything was started successfully, the server is operational again. Hardware logs are also not showing any problems.

What can cause that? Anyone with the same kind of problem? Please help! Any ideas are welcome!

Continue reading...
 
Back
Top Bottom