DevHeads.net

LVM failure after CentOS 7.6 upgrade -- possible corruption

I've started updating systems to CentOS 7.6, and so far I have one failure.

This system has two peculiarities which might have triggered the
problem. The first is that one of the software RAID arrays on this
system is degraded. While troubleshooting the problem, I saw similar
error messages mentioned in bug reports indicating that sGNU/Linux
ystems would not boot with degraded software RAID arrays. The other
peculiar aspect is that the system uses dm-cache.

Logs from some of the early failed boots are not available, but before I
completely fixed the problem, I was able to bring the system up once,
and captured logs which look substantially similar to the initial boot.
The content of /var/log/messages is here:
<a href="https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw" title="https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw">https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw</a>

The output of lsblk (minus some VM logical volumes) is here:
<a href="https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg" title="https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg">https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg</a>

As best I can tell, the LVM tools were treating software RAID component
devices as PVs, and detecting a conflict between those and the assembled
RAID volume. When running "pvs" on the broken system, no RAID volumes
were listed, only component devices. At the moment, I don't know if the
LVs that were activated by the initrd were backed by component devices
or the RAID devices, so it's possible that this bug might corrupt
software RAID arrays.

In order to correct the problem, I had to add a global_filter to
/etc/lvm/lvm.conf and rebuild the initrd (dracut -f):
global_filter = [ "r|vm_.*_data|", "a|sdd1|", "r|sd..|" ]

This filter excludes the LVs that contain VM data, accepts "/dev/sdd1"
which is the dm-cache device, and rejects all other partitions on
SCSI(SATA) device nodes, as all of those are RAID component devices.

I'm still working on the details of the problem, but I wanted to share
what I know now in case anyone else might be affected.

After updating, look at the output of "pvs" if you use LVM on software RAID.

Comments

Re: LVM failure after CentOS 7.6 upgrade -- possible co

By Gordon Messmer at 12/05/2018 - 23:34

On 12/5/18 9:27 AM, Gordon Messmer wrote:
I don't have much new information, other than that I tested booting a
similar system with an intentionally degraded RAID volume. That one
booted properly, so I don't think that was the problem. The dm-cache
device still needs further investigation, but I'm going to wait for all
RAID arrays to re-sync before further testing.

Going through the log again, I'm looking at this line:
Dec 4 21:17:34 ascension lvm: WARNING: Device mismatch detected for
VolGroup/lv_root which is accessing /dev/md127 instead of /dev/sda3.

Since it says "is accessing /dev/md127", I think the kernel activated
the LVs properly, in which case there shouldn't be any corruption risk.

I still can't reason why the lvm tools were scanning the component
volumes to begin with.

Re: LVM failure after CentOS 7.6 upgrade -- possible co

By Gordon Messmer at 12/06/2018 - 00:57

On 12/5/18 8:34 PM, Gordon Messmer wrote:
I think I've figured it out. The new lvm-tools package appears to have
broken support for detecting dm metadata version 0.90. The update
should be stable for anyone who did not upgrade from earlier versions of
CentOS. (I don't actually know when Anaconda last used 0.90 metadata)

On a working system with "verbose = 6" in lvm.conf:

# mdadm --detail /dev/md/primary
/dev/md/primary:
Version : 1.2
...
# pvs
...
#device/dev-io.c:609 Opened /dev/sda3 RO O_DIRECT
#device/dev-io.c:359 /dev/sda3: size is 1951133696 sectors
#device/dev-io.c:658 Closed /dev/sda3
#filters/filter-mpath.c:196 /dev/sda3: Device is a partition,
using primary device sda for mpath component detection
#device/dev-io.c:336 /dev/sda3: using cached size 1951133696 sectors
#device/dev-md.c:163 Found md magic number at offset 4096 of
/dev/sda3.
#filters/filter-md.c:108 /dev/sda3: Skipping md component device
...

On the broken system:

# mdadm --detail /dev/md/primary
/dev/md/primary:
Version : 0.90
...
# pvs
...
#device/dev-io.c:609 Opened /dev/sda3 RO O_DIRECT
#device/dev-io.c:359 /dev/sda3: size is 5858142208 sectors
#device/dev-io.c:658 Closed /dev/sda3
#filters/filter-mpath.c:196 /dev/sda3: Device is a partition,
using primary device sda for mpath component detection
#filters/filter-partitioned.c:30 filter partitioned deferred
/dev/sda3
#filters/filter-md.c:99 filter md deferred /dev/sda3
#filters/filter-persistent.c:346 filter caching good /dev/sda3

Re: LVM failure after CentOS 7.6 upgrade -- possible co

By Benjamin Smith at 12/05/2018 - 14:27

My gut feeling is that this is related to a RAID1 issue I'm seeing with 7.6.
See email thread "CentOS 7.6: Software RAID1 fails the only meaningful test"

I suggest trying to boot from an earlier kernel. Good luck!

Ben S

On Wednesday, December 5, 2018 9:27:22 AM PST Gordon Messmer wrote:

Re: LVM failure after CentOS 7.6 upgrade -- possible co

By Stephen John Smoogen at 12/05/2018 - 14:38

On Wed, 5 Dec 2018 at 14:27, Benjamin Smith < ... at benjamindsmith dot com> wrote:
You might want to point out which list you posted it on since it
doesn't seem to be this one.

Re: LVM failure after CentOS 7.6 upgrade -- possible co

By Benjamin Smith at 12/05/2018 - 15:14

On Wednesday, December 5, 2018 11:38:50 AM PST Stephen John Smoogen wrote:
Apparently there's a size limit for emails. I've resent with one of the output
files hosted on a personal webserver and it went through.

For your issue, in my own testing, simply choosing an older kernel didn't
resolve the "boot from a single drive degraded" issue.

My suggestions for resolution:

1) Boot up on another system.
2) Install the degraded disk.
3) Install another drive to match the degraded disk.
4) Set up mdadm to pair the two drives, wait for them to sync.
5) Install the newly fixed pair back into the original system and see if it
boots.

In any event, I'd consider dd'ing the entirety of the drive in its current
form to another disk so you can recover to this point in time. Maybe even only
try to resolve things with the dd'd copy so you don't risk wherever you're at
now.

Good luck.

Ben

Re: LVM failure after CentOS 7.6 upgrade -- possible co

By Simon Matter at 12/05/2018 - 12:56

What exactly did `pvs' show and instead of what?

Regards,
Simon

Re: LVM failure after CentOS 7.6 upgrade -- possible co

By Gordon Messmer at 12/05/2018 - 18:11

On 12/5/18 9:56 AM, Simon Matter wrote:
It should print:

# pvs
  PV         VG          Fmt  Attr PSize   PFree
  /dev/md127 VolGroup    lvm2 a--   <2.73t <768.41g
  /dev/md2   BackupGroup lvm2 a--   <2.73t       0
  /dev/sdd1  VolGroup    lvm2 a--  <55.88g       0

and IIRC, it printed:

# pvs
  PV         VG          Fmt  Attr PSize   PFree
  /dev/sda3 VolGroup    lvm2 a--   <2.73t <768.41g
  /dev/sdc1   BackupGroup lvm2 a--   <2.73t       0