DevHeads.net

// RESEND // 7.6: Software RAID1 fails the only meaningful test

the output files to a website to see)

The point of RAID1 is to allow for continued uptime in a failure scenario.
When I assemble servers with RAID1, I set up two HDDs to mirror each other,
and test by booting from each drive individually to verify that it works. For
the OS partitions, I use simple partitions and ext4 so it's as simple as
possible.

Using the CentOS 7.6 installer (v 1810) I cannot get this test to pass in any
way, with or without LVM. Using an older installer, it works fine (v 1611) and
I am able to boot from either drive but as soon as I do a yum update then it
fails.

I think this may be related or the same issue reported in "LVM failure after
CentOS 7.6 upgrade" since that also involves booting from a degraded RAID1
array.

This is a terrible bug.

See below for some (hopefully) useful output while in recovery mode after a
failed boot.

### output of fdisk -l

Disk /dev/sda: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000c1fd0

Device Boot Start End Blocks Id System
/dev/sda1 2048 629409791 314703872 fd Linux raid autodetect
/dev/sda2 * 629409792 839256063 104923136 fd Linux raid autodetect
/dev/sda3 839256064 944179199 52461568 fd Linux raid autodetect
/dev/sda4 944179200 976773119 16296960 5 Extended
/dev/sda5 944181248 975654911 15736832 fd Linux raid autodetect

### output of cat /prod/mdstat
Personalities :
md126 : inactive sda5[0](S)
15727616 blocks super 1.2

md127 : inactive sda2[0](S)
104856576 blocks super 1.2

unused devices: <none>

### content of rdosreport.txt
It's big; see
<a href="http://chico.benjamindsmith.com/rdsosreport.txt" title="http://chico.benjamindsmith.com/rdsosreport.txt">http://chico.benjamindsmith.com/rdsosreport.txt</a>

Comments

Re: // RESEND // 7.6: Software RAID1 fails the only mea

By Gordon Messmer at 12/05/2018 - 23:07

On 12/5/18 11:55 AM, Benjamin Smith wrote:

I used my test system to test RAID failures.  It has a two-disk RAID1
mirror.  I pulled one drive, waited for the kernel to acknowledge the
missing drive, and then rebooted.  The system started up normally with
just one disk (which was originally sdb).

The thing that stands out as odd, to me, is that your kernel command
line includes "root=UUID=1b0d6168-50f1-4ceb-b6ac-85e55206e2d4" but that
UUID doesn't appear anywhere in the blkid output.  It should, as far as
I know.

Your root filesystem is in a RAID1 device that includes sda2 as a
member.  Its UUID is listed as an rd.md.uuid option on the command line
so it should be assembled (incomplete) during boot.  But I think your
kernel command line should include
"root=UUID=f127cce4-82f6-fa86-6bc5-2c6b8e3f8e7a" and not
"root=UUID=1b0d6168-50f1-4ceb-b6ac-85e55206e2d4"

Re: // RESEND // 7.6: Software RAID1 fails the only mea

By Benjamin Smith at 12/07/2018 - 19:14

my procedure was to shutdown with the system "whole" - both drives working.
Then, while dark, removing either disk and then starting up the server.
Regardless of which drive I tried to boot on, the failure was consistent.

Except that UUID exists when both drives are present. And this, even though
under an earlier CentOS version, it booted fine on either drive singly with
the above procedure before doing a yum update. And to clarify my procedure:

1) Set up system with 7.3, RAID1 bare partitions.
2) Wait for mdstat sync to finish.
3) Shutdown system
4) Remove either drive
5) system boots fine
6) Resync drives
7) yum update -y to 7.6
8) shutdown system.
9) remove either drive
10) bad putty tat.

Unfortunately, I have used this same system for other tests and no longer have
these UUIDs to test further. However, I can reproduce the problem to test
further as soon as I have something to test.

I'm going to see if using EXT4 as the file system has any effect.