Server fails to boot

First some history. This is an Intel MB and processor some 6 years old,
initially running CentOS 6. It has 4 x 1TB sata drives set up in two
mdraid 1 mirrors. It has performed really well in a rural setting with
frequent power cuts which the UPS has dealt with and auto shuts down the
server after a few minutes and then auto restarts when power is restored.

The clients needed a Windoze server for a proprietary accounting package
they use, thus I have recently installed two SSD drives (500GB each)
also in a raid 1 mirror and installed CentOS 7 as the host and also
VirtualBox running Windoze 10. The hard drives continue to hold their
data files.

This appeared to work just fine until a few days ago. After a power cut
the server would not reboot.

It takes a while to get in front of the machine, add a monitor, keyboard
and mouse only to find:

Warning: /dev/disk/by-id/md-uuid-xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
does not exist

repeated three times - one for each of the /, /boot, and swap raid
member sets along with a

Warning: /dev/disk/by-uuid/xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx does not

for the /dev/md125 which is the actual raid 1 / device.

The system is in a root shell of some sort as it has not made the
transition from initramfs to the mdraid root drive.

there are some other lines of info and a txt file with hundreds of lines
of boot info, ending with the above info (as I recall).

I tried a reboot - same result, rebooted and tried an earlier kernel -
same result, tried a reboot to the recovery kernel and all went well.
System comes up, all raids sets are up and in sync - no errors.

So, no apparent H/W issues, no mdraid issues apparently, but none of the
regular kernels will now boot.

a blkid shows all the expected mdraid devices with the uuids from the
error message all in place as expected.

I did a yum reinstall of the most recent kernel as I thought that may
repair any /boot file system problems - particularly initramfs, but no
difference, will not boot, same exact error messages.

Thus I now have it running on the recovery kernel, with all the required
server functions being performed, albeit on an out of date kernel.

Google has one solved problem similar to mine but the solution was
change the BIOS from AHCI to IDE - that does not seem correct as I have
not changed BIOS, although I have not checked it at this time.

Another solution talks about a race condition and the md raid not being
ready when required during the boot process and thus to add delay in the
kernel boot line in grub2. Although no one indicated this actually worked.

Another proposed solution is to mount the failed devices from a recovery
boot and rebuild initramfs. Before I do this I would like to ask those
that know a little more about the boot process, what is going wrong? I
can believe the most recent initramfs being a problem, but all three
other kernels too?? Yet the recovery kernel works just fine.

As the system is remote, I would like some understanding of what's up
before I do any changes - if a reboot occurs and fails, it will mean
another trip.

Oh, one other thing, it seems the UPS is not working correctly, thus it
may not have shut down cleanly. Working to replace batteries in the UPS.

TIA for your insight.


Re: Server fails to boot

By Gordon Messmer at 07/13/2019 - 18:15

On 7/8/19 4:28 AM, Rob Kampen wrote:

<a href="" title=""></a>

It sounds like your kernels aren't assembling the RAID device on boot,
which *might* be related to the above bug if one of the devices is
broken.  It's hard to tell from your description.  You mentioned that
the rescue kernel boots, but I wonder if the array is degraded at that

Otherwise, you might remove "rhgb" and "quiet" from the kernel boot
parameters and see if there's any useful information printed to the
console while booting a recent kernel.

Re: Server fails to boot

By Rob Kampen at 07/14/2019 - 04:53

On 14/07/19 10:15 AM, Gordon Messmer wrote:
I have no idea why the rescue kernel boots just fine, although it does
not establish the above links either, rather it sets up the links
/dev/md/<hostname>:{boot,root,swap} pointing to the assembled /dev/md125

My particular problem is: how do I get it to boot the later kernels?
What should be my repair process?

I have tried a boot with the rhgb and quiet removed and got no
additional information.

BTW once booted cat /proc/mdstat gives:

Personalities : [raid1]
md57 : active raid1 sdb7[1] sda7[0]
      554533696 blocks super 1.2 [2/2] [UU]

md99 : active raid1 sdd[1] sdc[0]
      976631360 blocks super 1.2 [2/2] [UU]

md121 : active raid1 sdb2[1] sda2[0]
      153500992 blocks [2/2] [UU]

md120 : active raid1 sda3[0] sdb3[1]
      263907712 blocks [2/2] [UU]

md125 : active raid1 sde1[0] sdf1[1]
      478813184 blocks super 1.2 [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md126 : active raid1 sde2[0] sdf2[1]
      1046528 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : active (auto-read-only) raid1 sde3[0] sdf3[1]
      8382464 blocks super 1.2 [2/2] [UU]

unused devices: <none>

no degraded raid devices .....