System failing to boot - need a way to change the driver for my system drive

D

Doug_x64

A little background:


I have a homebuilt system based on AMD Ryzen Threadripper and an AMB B450 motherboard. The motherboard has 3 M.2 NVMe sockets. The BIOS is capable of doing software RAID on the NVMe devices. I installed this machine over a RAID0 array of 2 x 2TB Samsung EVO 970 NVMe devices. This worked great for a long time, but started to issue CRC errors on specific sectors. The error only showed up when I was trying to do a backup, normal day to day operations were fine. Because the errors didn't impact day to day use, I ignored it for a long time. At this point, any backups are truly out of date. I tried for a week, with help from my backup software vendor, to try and get a complete backup, but it kept dying on the CRC errors.


So, I got a new 2TB NVMe device and tried to clone from the RAID array to the single 2TB device (I had sufficient free space to shrink the filesystem down and fit it on the single 2TB drive). When I tried to clone from the running system, it would hit the same CRC errors (note: I was able to get sector numbers during the cloning process, then look up what file contained the sector, then check the file, and I never found a problem in any of the files, but it was still giving CRC errors). I managed to finally boot into a rescue image build by my backup software tool (Acronis True Image), and when I got booted into it, then did the clone, it cloned with *no* CRC errors.


And just to be clear, I'm cloning from the software RAID device to a bare NVMe device with the intent of doing away with the software RAID stack entirely as it is clearly flaky.


There are two problems that limit my options here:


1) Linux has no support for the AMD NVMe software RAID, so many of the repair DVD/USB images you can get won't work on the RAID device at all as they run proprietary software on top of a linux boot image. That limits my options for what I can boot into to attempt to access the original raid device anyway.


2) When you enable NVMe software RAID on the motherboard, it *doesn't* change the PCI ID of the NVMe controller to something other than the PCI ID for a standard NVMe controller device. As a result, the same PCI ID can be attached to either the rcbottom/rcraid/rcconfig stack that makes up the AMD software RAID driver or it can be attached to the stornvme driver from Microsoft.


When I did the clone, the AMD RAID driver was handling the system volume. That means the cloned drive has the same setting. And under normal conditions, when I attempted to boot the cloned drive, it would do the "Adjusting hardware settings" message thing and fix up the drivers and the new cloned system would boot up and run fine. Except for problem #2 above completely stops that from happening. As far as Windows is concerned, no major changes have taken place on the computer. The motherboard/CPU are identical. The NVMe devices might have changed their bus addresses, but their still the same PCI IDs. So I can't boot the cloned drive.


Then I ran into a second problem. I attempted to go into the old copy of the system, go into Device Manager, and change the underlying driver for the bare drive, then I was going to reclone the system. But, I messed up the mapping of which devices were part of the RAID array and accidentally switched one of them to the NVMe driver, and so now the new system and old system both won't boot.


The symptoms from both are the same:


If I enable NVMe RAID in the BIOS, the system finds the RAID system and tries to boot it, does a hard reset, repeats, then goes into a recovery boot, fails to automatically fix the system bootup, then resets. This loops goes on forever.


If I disable NVMe RAID, it does the same thing but from the single 2TB device.


If I were under linux, I know what I'd need to do:


Boot up a rescue image, go to the root fs, use lspci to see the NVMe devices, make sure that the proper NVMe devices are attached to the right drivers in the dracut boot image, fix whatever is broken, reboot, done.


What I think needs done under Windows is boot up a rescue image, go into the system volume, detach both the rcbottom and rcraid drivers from the NVMe controllers, attach the stornvme driver to those PCI devices, then boot into a working system.


Any clues how I can make that happen? I am *not* a windows expert, but I can get it booted up to the point of having a repair environment and able to access the 2TB system volume, but have no clue what to do from there.

Continue reading...
 
Back
Top Bottom