DevHeads.net

Server fails to boot

First some history. This is an Intel MB and processor some 6 years old,
initially running CentOS 6. It has 4 x 1TB sata drives set up in two
mdraid 1 mirrors. It has performed really well in a rural setting with
frequent power cuts which the UPS has dealt with and auto shuts down the
server after a few minutes and then auto restarts when power is restored.

The clients needed a Windoze server for a proprietary accounting package
they use, thus I have recently installed two SSD drives (500GB each)
also in a raid 1 mirror and installed CentOS 7 as the host and also
VirtualBox running Windoze 10. The hard drives continue to hold their
data files.

This appeared to work just fine until a few days ago. After a power cut
the server would not reboot.

It takes a while to get in front of the machine, add a monitor, keyboard
and mouse only to find:

Warning: /dev/disk/by-id/md-uuid-xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
does not exist

repeated three times - one for each of the /, /boot, and swap raid
member sets along with a

Warning: /dev/disk/by-uuid/xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx does not
exist

for the /dev/md125 which is the actual raid 1 / device.

The system is in a root shell of some sort as it has not made the
transition from initramfs to the mdraid root drive.

there are some other lines of info and a txt file with hundreds of lines
of boot info, ending with the above info (as I recall).

I tried a reboot - same result, rebooted and tried an earlier kernel -
same result, tried a reboot to the recovery kernel and all went well.
System comes up, all raids sets are up and in sync - no errors.

So, no apparent H/W issues, no mdraid issues apparently, but none of the
regular kernels will now boot.

a blkid shows all the expected mdraid devices with the uuids from the
error message all in place as expected.

I did a yum reinstall of the most recent kernel as I thought that may
repair any /boot file system problems - particularly initramfs, but no
difference, will not boot, same exact error messages.

Thus I now have it running on the recovery kernel, with all the required
server functions being performed, albeit on an out of date kernel.

Google has one solved problem similar to mine but the solution was
change the BIOS from AHCI to IDE - that does not seem correct as I have
not changed BIOS, although I have not checked it at this time.

Another solution talks about a race condition and the md raid not being
ready when required during the boot process and thus to add delay in the
kernel boot line in grub2. Although no one indicated this actually worked.

Another proposed solution is to mount the failed devices from a recovery
boot and rebuild initramfs. Before I do this I would like to ask those
that know a little more about the boot process, what is going wrong? I
can believe the most recent initramfs being a problem, but all three
other kernels too?? Yet the recovery kernel works just fine.

As the system is remote, I would like some understanding of what's up
before I do any changes - if a reboot occurs and fails, it will mean
another trip.

Oh, one other thing, it seems the UPS is not working correctly, thus it
may not have shut down cleanly. Working to replace batteries in the UPS.

TIA for your insight.