DevHeads.net

LVM failure after CentOS 7.6 upgrade -- possible corruption

I've started updating systems to CentOS 7.6, and so far I have one failure.

This system has two peculiarities which might have triggered the
problem. The first is that one of the software RAID arrays on this
system is degraded. While troubleshooting the problem, I saw similar
error messages mentioned in bug reports indicating that sGNU/Linux
ystems would not boot with degraded software RAID arrays. The other
peculiar aspect is that the system uses dm-cache.

Logs from some of the early failed boots are not available, but before I
completely fixed the problem, I was able to bring the system up once,
and captured logs which look substantially similar to the initial boot.
The content of /var/log/messages is here:
<a href="https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw" title="https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw">https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw</a>

The output of lsblk (minus some VM logical volumes) is here:
<a href="https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg" title="https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg">https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg</a>

As best I can tell, the LVM tools were treating software RAID component
devices as PVs, and detecting a conflict between those and the assembled
RAID volume. When running "pvs" on the broken system, no RAID volumes
were listed, only component devices. At the moment, I don't know if the
LVs that were activated by the initrd were backed by component devices
or the RAID devices, so it's possible that this bug might corrupt
software RAID arrays.

In order to correct the problem, I had to add a global_filter to
/etc/lvm/lvm.conf and rebuild the initrd (dracut -f):
global_filter = [ "r|vm_.*_data|", "a|sdd1|", "r|sd..|" ]

This filter excludes the LVs that contain VM data, accepts "/dev/sdd1"
which is the dm-cache device, and rejects all other partitions on
SCSI(SATA) device nodes, as all of those are RAID component devices.

I'm still working on the details of the problem, but I wanted to share
what I know now in case anyone else might be affected.

After updating, look at the output of "pvs" if you use LVM on software RAID.

Comments

Re: LVM failure after CentOS 7.6 upgrade -- possible co

By Simon Matter at 12/05/2018 - 12:56

What exactly did `pvs' show and instead of what?

Regards,
Simon