DevHeads.net

Upgrade to F30 gone wrong

Hello,

I upgraded from a fully updated F29 to F30 today. Upon reboot I see

grub>

This seems like a pretty major bug in some component somewhere. I don't think
this is a recoverable bug for most people, meaning they would have to
reinstall and possibly lose everything.

What component is at fault? I'd like to report this bug and have it fixed asap
before other people experience a non-working system after upgrade. No idea
how to recover the system at this point. It does have valuable work related
code and docs on it.

-Steve

Comments

Re: Upgrade to F30 gone wrong

By Tom Hughes at 05/04/2019 - 10:29

<a href="https://fedoraproject.org/wiki/Common_F30_bugs#blscfg-fail" title="https://fedoraproject.org/wiki/Common_F30_bugs#blscfg-fail">https://fedoraproject.org/wiki/Common_F30_bugs#blscfg-fail</a>

Tom

Re: Upgrade to F30 gone wrong

By Steve Grubb at 05/04/2019 - 10:54

On Saturday, May 4, 2019 10:29:18 AM EDT Tom Hughes wrote:
Thanks. It's nice that there is a writeup. But are non-technical people
expected to do this fix? (This is a rhetorical question to all of fedora-
devel.) Could dnf system-upgrade detect that it's running on a system that
will fail? Could it warn people beforehand or even apply the grub upgrade
first?

I have to think the issue could be detected before upgrading.

d=`mount | awk '$3 == "\/boot" { print $1 }' 2>/dev/null`
if [ "x$d" == "x" ] ; then
d=`mount | awk '$3 == "\/" { print $1 }' 2>/dev/null`
if [ "x$d" == "x" ] ; then
echo "Upgrading your grub2 install to a current version
cannot be done. You should not proceed with upgrading your system."
exit 1
fi
grub2-install $s
if [ $? -ne 0 ] ; then
echo "grub2-install encountered an error. You should not
proceed with upgrading your system."
exit 1
fi
fi

...

Anyways...following the instruction on that page...when
configfile /grub2/grub.cfg.rpmsave
is run, it immediately shows me the menu and boot commences as it did. It
then says: "execute the grub2-install /dev/X command (where X is the boot
device, i.e sda) to update the GRUB core and the module"

This is what happens:

# mount | grep boot
/dev/sda1 on /boot type ext4 (rw,relatime,seclabel)
[root ~]# grub2-install /dev/sda1
Installing for i386-pc platform.
grub2-install: warning: File system `ext2' doesn't support embedding.
grub2-install: warning: Embedding is not possible. GRUB can only be
installed in this setup by using blocklists. However, blocklists are
UNRELIABLE and their use is discouraged..
grub2-install: error: will not proceed with blocklists.

and rebooting the system now has:

grub rescue>

Which seems like a big step back from where I was.

-Steve

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/04/2019 - 13:43

On Sat, May 4, 2019 at 8:55 AM Steve Grubb < ... at redhat dot com> wrote:
Currently, there is no way to determine who owns the bootloader on a
BIOS computer, and therefore to avoid stepping on a bootloader that we
don't own, we never update it. But that means, we never update a
bootloader we do own as well. I've never been happy with this, but I
don't know if it's an upstream problem or a distribution problem.

I think the problem stems from a stale GRUB core.img (the part that's
embedded in either the MBR gap, or GPT's BIOSboot partition), but I'm
not 100% confident that's true. It could be a stale normal.mod which
is found in /boot/grub2/i386/ and like all other modules, these are
replaced with a 'grub2-install' invocation. But if it were true that
the problem stems only from modules, it's possible to copy them from
/usr to /boot, and script that. But unfortunately I think the core.img
probably needs updating, and that core.img is distribution specific.
So if we just replace it, we could be breaking the user's other
installed distributions.

*shrug*

If you take the point of view that multi-boot Linux (2+ linux distros,
as contrasted to dual-boot Fedora+macOS or Fedora+Windows) is
inherently technical, it's possible we could have a policy change
whereby we update the bootloader on Fedora upgrades. That would help
the non-technical, at the expense of certainly breaking the multi-boot
case.

It should be /dev/sda not /dev/sda1 - the advice note says /dev/sdX
where X is the boot device, not the boot device partition. So you did
it wrong. But you raise a valid point that this is obscure and
esoteric knowledge, and we're asking non-technical users to know such
things.

Maybe the wikie entry can be updated to make this more clear.

Re: Upgrade to F30 gone wrong

By Steve Grubb at 05/04/2019 - 14:44

On Saturday, May 4, 2019 1:43:56 PM EDT Chris Murphy wrote:
I would imagine it is trivial to examine the files to see that there is only
one system installed and handle that case. Multi-boot is more complicated.

grub rescue> set prefix=(hd0,1)/grub2
grub rescue> set root=(hd0,1)
grub rescue> insmod normal
error: symbol 'grub_file_progress_hook' not found.

Hmm...OK...let's try it from /usr as you said.

grub rescue> set prefix=(hd0,5)/usr/lib/grub
grub rescue> insmod normal
error: symbol 'grub_file_progress_hook' not found.

Hmm. On my main (working) system:

cd /usr/lib/grub/i386-pc/
# grep grub_file_progress_hook *
Binary file hfspluscomp.mod matches
Binary file kernel.exec matches
Binary file kernel.img matches
Binary file net.mod matches
Binary file ntfscomp.mod matches
Binary file ntfs.mod matches
Binary file progress.mod matches

# readelf -s hfspluscomp.mod | grep progress
11: 00000000 0 NOTYPE GLOBAL DEFAULT UND grub_file_progress_hook
# readelf -s net.mod | grep progress
17: 00000000 0 NOTYPE GLOBAL DEFAULT UND grub_file_progress_hook
# readelf -s ntfscomp.mod | grep progress
11: 00000000 0 NOTYPE GLOBAL DEFAULT UND grub_file_progress_hook
# readelf -s ntfs.mod | grep progress
11: 00000000 0 NOTYPE GLOBAL DEFAULT UND grub_file_progress_hook
# readelf -s progress.mod | grep progress
12: 00000000 0 NOTYPE GLOBAL DEFAULT UND grub_file_progress_hook
# readelf -s kernel.exec | grep progress
218: 0001cbe8 4 OBJECT GLOBAL DEFAULT 5 grub_file_progress_hook
# readelf -s kernel.img | grep progress
# readelf -s net.mod | grep progress
17: 00000000 0 NOTYPE GLOBAL DEFAULT UND grub_file_progress_hook

It appears to be defined in kernel.exec

Wonder if there is a stale kernel.exec and if so, how to get it loaded so
symbols are defined.

-Steve

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/04/2019 - 15:58

On Sat, May 4, 2019 at 12:44 PM Steve Grubb < ... at redhat dot com> wrote:
It's trivial if you're willing to make assumptions rather than test
for conditions. If you start testing for conditions you quickly find a
ton of exceptions. I can easily create a multiboot system that
partitioning wise looks like a default/automatic Fedora only system.
So you need a test that doesn't only check number and type of
partitions. It'd have to be deeper than that. If you open it up to
custom configurations then it's even more tests you have to do to know
for sure it's a single OS system. I think that's super complicated and
not at all trivial, it'll hit all kinds of edge cases.

This bug itself was expected to be an edge case, that not many users
would be affected, in that not many would have a stale Fedora 20 or
older bootloader. Surely 'grub2-install' would have been manually run,
or the user has done a recent clean install since Fedora 20, right?!
Possibly the assumption about upgrades is wrong because they have been
so reliable that users trust them. In other words, the problem is the
result of our own success!

And if that's true, revisiting the 'don't update the bootloader on
BIOS' policy perhaps is in order.

There are in fact real security bugs that crop in the bootloader from
time to time, and no one on BIOS gets the benefit of fixes for those
vulnerabilities if they do not run grub2-install manually. It's
insufficient to just update the RPMs on BIOS systems.

Anyway, I for one would strongly support changing this with the feature process.

Re: Upgrade to F30 gone wrong

By Sam Varshavchik at 05/04/2019 - 16:50

Chris Murphy writes:

One of my bricks that will soon get Fedora 30 was originally installed with
Fedora Core 4.

Obviously a minority; but you'll be surprised to learn how many systems
there are which have been running Fedora for a very long time. Fedora 20 is
what, about five years old? There are many, many systems which are at least
five years old. People don't really swap hardware every 2-3 years, any more.

Re: Upgrade to F30 gone wrong

By Marius Schwarz at 05/06/2019 - 13:38

Am 04.05.19 um 22:50 schrieb Sam Varshavchik:
You can switch hw without the need of a reinstallation, as next to noone
compiles kernels for a specific system anymore. They boot simply
anything now.

Best regards,
Marius

Re: Upgrade to F30 gone wrong

By Roberto Ragusa at 05/05/2019 - 07:47

On 5/4/19 10:50 PM, Sam Varshavchik wrote:
My contribution to the surprise:

[root@localhost ~]# grep fedora-release /root/install.log
Installing fedora-release-3-8.i386.
[root@localhost ~]# uname -a
Linux localhost 5.0.4-200.fc29.x86_64 #1 SMP Mon Mar 25 02:27:33 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

This system was upgraded from Fedora 3 up to 29.
Also note it started as i386, but at Fedora 16 got transformed into x86_64, a kind of (manual) upgrade never
officially considered possible.

I don't understand the consideration about old or new hardware.
Why would I have to reinstall the system when getting new hardware?
My Fedora system has jumped across 4 machines and who knows how many HDD/SDD replacements.

Regards.

Re: Upgrade to F30 gone wrong

By Nico Kadel-Garcia at 05/05/2019 - 13:59

On Sun, May 5, 2019 at 7:49 AM Roberto Ragusa < ... at robertoragusa dot it> wrote:
You're kind of begging for pain, at this point. Thee have been enough
subtle, fundamental, and functionally incompatible updates to
filesysems such as ext4 and xfs that a surprise at upgrade should not
shock you too much.

*Ouch*. OK, now you're just hurting yourself. Definitely time to back
up your old system and do a fresh install.

It Depends(tm). One issue I've encountered is with disk controllers.
Anaconda is pretty good about detecting disk controllers at boot time
and loading up initrd appropriately. on the new OS. Deducing the disk
controller, the order of discovery of such controllers, and the tuning
necessary to upgrade the OS reliably is an adventure. It also used to
be worse when the file systems were referred to in /etc/fstab by their
partition numbers, such as "/dev/sda1" which became "/dev/sde1" on the
old Promise RAID controllers depending on which support patch they had
applied. Ye *ghods*, I hated those controllers.....

Re: Upgrade to F30 gone wrong

By Roberto Ragusa at 05/06/2019 - 03:44

On 5/5/19 7:59 PM, Nico Kadel-Garcia wrote:
You are supposing the the filesystem has remained the same.
Instead, the content has been copied over a couple of times, so it is now a
fresh ext4 (on lvm, on dmcrypt, with SSD discard enabled,...).

Turning an i386 to x86_64 was easier than expected.
First you switch to 64 bit kernel (64 bit kernel with 32 bit userspace is
a good but not widely known idea); then you add the x86_64 libs (that can
often live in parallel to i686 ones); then you switch the real applications
to x86_64 and finally you can remove i686 libs you don't want anymore (possibly
every one of them). All done with yum and rpm on a live system.
This has been my daily work system for about 15 years, no reason to scratch it.
it probably contains no traces of the initial setup (apart from /root/install.log),
it evolved without any discontinuity.
Last time I ran anaconda for upgrading was in F14 (according to /root/upgrade.log);
after that it has always been yum/dnf.

Regards.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/04/2019 - 20:19

On Sat, May 4, 2019 at 2:50 PM Sam Varshavchik <mrsam@courier-mta.com> wrote:
I'm not suggesting they do, but rather that they occasionally do a
clean install or manually run grub-install. Reinstalling GRUB
(pre-boot binaries) by invoking
grub-install is the upstream recommendation, however esoteric that
sounds. There isn't an automatic update mechanism. And I'm pretty
sure it's the same for U-boot and extlinux.

Only on UEFI, and as a consequence of grubx64.efi being built by
Fedora's build system and included in the GRUB RPM, is the bootloader
updated.

And as I think about it, bootloader updates don't happen on rpm-ostree
(e.g. Silverblue) regardless of the type of firmware.

There is no release criteria that covers this, which is why this bug
was not a blocking bug. However, the Workstation Working Group's
Product Requirements Document says in part: "the upgrade process
should give a result that is the same as an original install of Fedora
Workstation" and that is clearly not the case as it pertains to the
bootloader.

The Server edition PRD reads more permissive saying "existing servers
and installed roles should be upgradable to new releases with minimal
involvement"

Re: Upgrade to F30 gone wrong

By Vitaly Zaitsev ... at 05/05/2019 - 11:39

It would be nice to have a robust upgradeable bootloader setup. I'm pretty sure that ranks before having a pretty flicker-free boot to Fedora users. Pretty boot has been a workstation priority for how many releases now?

Baring that, just having a reinstall bootloader option in rescue mode would go a long way to make this all less a PITA. Fedora has been doing incompatible bootloader changes every few years for as long as I remember

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/05/2019 - 17:59

On Sun, May 5, 2019 at 9:40 AM Nicolas Mailhot

GRUB 2 does not support bootloader reinstallation from the pre-boot
environment. The GRUB rescue code is in the stage 2 bootloader. If
you're at GRUB rescue, it means it can't read or can't find normal.mod
which means it can't read anything else it would need to construct a
stage 2 bootloader from scratch and install it.

So this ship sailed with GRUB Legacy a very long time ago.

Re: Upgrade to F30 gone wrong

By Steve Grubb at 05/05/2019 - 12:33

On Sunday, May 5, 2019 11:39:50 AM EDT Nicolas Mailhot via devel wrote:
Rescue mode? I couldn't find it. All references I could find to a rescue mode
date back to 2013 or later. I would have liked a rescue mode because makes it
easy to just chroot into your actual system from the livecd. Seems like we've
lost something nice if its really been dropped.

I wound up solving the problem by using the workstation livecd, mounting the
/boot partition, and issuing the grub2-install command. This worked, but the
instructions are completely undocumented (searching from Google) as best as I
can tell. There are docs for grub but not grub2. There are docs for Ubuntu
which has a different boot disk layout but not for Fedora. Following the
Ubuntu docs will not work. And speaking of which, seems like they have a disk
boot repair iso. No idea how robust it is, but it is something.

I am thinking that something could have been put into dnf system-upgrade
where it could have warned about the problem or did a workaround. Actually,
this could still be put into system-upgrade because not everyone switches
first week or two because they are waiting to see what problems people hit
before doing it themselves.

Best Regards,
-Steve

Re: Upgrade to F30 gone wrong

By Vitaly Zaitsev ... at 05/06/2019 - 03:46

Le dimanche 05 mai 2019 à 12:33 -0400, Steve Grubb a écrit :
You have a rescue mode on the generic (or network) install iso. That's
the swiss knife for rescuing Fedora systems that do not boot. You need
to work from the iso because if your bootloader is DOA… you can't do
anything from the installed system.

Unfortunately, while the rescue mode will usually find the installed
system, it does not know how to install a bootloader with current
Fedora defaults. You always need to dig up the current Fedora magic
from the internet. Which is a pity since the main reason people rescue
from the iso are boot problems

Regards,

Re: Upgrade to F30 gone wrong

By Panu Matilainen at 05/06/2019 - 06:50

On 5/6/19 10:46 AM, Nicolas Mailhot via devel wrote:
Which is all good, if it works. The Fedora 30 netinstall iso tracebacked
and failed to find a single Linux filesystem out of ~10 on my system
which had this issue. Fortunately F29 rescue image worked.

Should've filed a bug, but when your system is in a fast reboot loop
prior to even getting to grub prompt, bug reports are not necessarily
the first priority. That system started life as Fedora 15 or so, quite
possible bootloader never updated since because ... well, there hasn't
been any reason to do so, AFAIR.

This was by far the worst outcome of a distro-upgrade that I can
remember, and I've done quite a few. Somebody said something about being
victims of our own success, and there's a point in there:
distro-upgrades have been so uneventful for such a long time that I
admit, it never even occurred to me to look at the common bugs page
before s*** hit the fan.

Yup. And when you add the confusion of UEFI and BIOS bootloader
differences (for one, do 'dnf reinstall ..', for the other, do
'grub2-install', wtf? and documentated in a dark kernel-related corner
somewhere) and the less technical people are out the door already, and
even the more technical are halfway there.

Multiboot is always going to be a problem, but especially for simpler
setups, /root/anaconda-ks.cfg provides more than just an educated guess
as to where the bootloader might be lurking. The rescue mode could offer
something based on that.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/05/2019 - 18:14

On Sun, May 5, 2019 at 10:33 AM Steve Grubb < ... at redhat dot com> wrote:
Impractical. That is a code change, it would need buy off from dnf
folks, it would need translations (probably), it would need a freeze
exception, it does not address the much more common upgrade path of
GNOME Software (graphical upgrades). Keep in mind this bug was
actually found rather late in the testing process. I only stumbled
into it because someone said on #fedora-qa that ppc64le installs
weren't booting in Open QA. And then we got suspicious that it might
be an issue that could happen on x86/x86_64 if the GRUB core.img
(stage 2) was too old, and then Javier narrowed it down.

So there's a case where alternate architecture helped us catch a bug.
There really aren't many users with very old installation that are
doing beta testing, a lot of the tests and release requirements are in
fact predicated on clean installs of the most recent two Fedora
versions.

Right and that's the same with beta testing, which is how bugs like
this can sometimes not even get found until after release. A lot of
tests are done on pristine systems that are throw away. It's entirely
understandable few people want to test Fedora pre-release on their
rock solid 5+ year old Fedora system, but we actually stumbled on this
in some sense by luck of alternate arch acting like a canary.

Re: Upgrade to F30 gone wrong

By Vitaly Zaitsev ... at 05/06/2019 - 03:52

Le dimanche 05 mai 2019 à 16:14 -0600, Chris Murphy a écrit :
That's not true, many boot problems are found quite early in the
process by rawhide users, but rawhide users feedback is not taken into
account by installer folks because they don't look at boot problems
before quite late in the cycle, when rawhide users have already moved
on manually, and the default solution is always to reinstall from
scratch.

So problems are found, just not fixed

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/06/2019 - 12:29

On Mon, May 6, 2019 at 1:52 AM Nicolas Mailhot
<nicolas. ... at laposte dot net> wrote:
This is out of scope because the context of the conversation is
upgrades. You're talking about the installer which means clean
installs.

Re: Upgrade to F30 gone wrong

By Vitaly Zaitsev ... at 05/07/2019 - 02:40

Le May 6, 2019 4:29:22 PM UTC, Chris Murphy < ... at colorremedies dot com> a écrit :
I'm talking rescuing systems that do not boot anymore and that means the install media. You can't rescue a system with broken boot from within this system

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/07/2019 - 13:15

On Tue, May 7, 2019 at 12:40 AM Nicolas Mailhot
<nicolas. ... at laposte dot net> wrote:
I have no idea what you're talking about. The context of this thread
is a bug that happens during upgrades, and you do not need
rescue/install media to fix it. The Common Bugs lists the step you
need to successfully boot and fix the problem within this system.

Re: Upgrade to F30 gone wrong

By Panu Matilainen at 05/08/2019 - 06:49

On 5/7/19 8:15 PM, Chris Murphy wrote:
Nicolas' point is that the rescue boot entry only works in a limited
number of scenarios.

And in some cases the rescue media IS needed to fix this particular
issue as well. For example my box never got to the grub prompt at all,
it was busy reboot-looping, probably due to negletting to reinstall grub
in almost a decade. Others have pointed out other "completely broken"
symptoms.

Really, if you don't even get a menu from grub, how many people are
going to be able to work it from there? Even if there was another
computer comfortably nearby for googling, I wouldn't bother even trying.
Much easier to grab that rescue image, which thank goodness is there still.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/08/2019 - 17:55

I'm not sure which of three 'rescue' options is being discussed.

grub rescue appears when normal.mod can't be loaded, and means the menu
can't be displayed. This thread's bug results in this rescue.

Since circa Fedora 19, there is a GRUB menu entry 'Rescue' option which
boots a "no host only" initramfs. If you see the option, you don't have the
bug under discussion.

Install media, dvd and netinst, have a bootloader submenu option for
Troubleshooting, that leads to an option Rescue a Fedora system, which
actually launches anaconda with app option 'inst.rescue' that helps find,
assemble and mount a Fedora installation or even just provide a shell with
all the CLI programs normally available for installs including fsck, mkfs,
various FS debug tools, etc.

I don't think that's this bug. It suggests the core.img embedded in either
the MBR gap or BIOSBoot partition has a serious bug or that the computer
firmware has a serious bug. Somehow is can neither load blscfg.mod (this
bug) and can't run it's own rescue mode (not this bug).

That would be super tedious to track down. And interestingly, would
potentially be masked by always updating the bootloader between major
upgrades (makes bootloader a moving Target). But if no one ever tracks it
down, which includes the even more tedious requirement of localy building
GRUB from source and reproducing the bug (because Fedora's GRUB is so
substantially modified from upstream that upstream will always reject
Fedora bugs), it won't get fixed.

I admit, we've already failed if we're at grub rescue, letter alone in a
reboot cycle. But the latter sounds hardware (firmware) specific. Difficult
no matter what.

I think 1 in 10. Just a guess. That's probably generous.

Even if there was another

I don't know what rescue image is.

The bug this thread is about, if you hit it, you do not get a GRUB menu.

To me, rescue image means install media with Rescue boot option i.e.
anaconda inst.rescue.

Re: Upgrade to F30 gone wrong

By Panu Matilainen at 05/09/2019 - 02:59

On 5/9/19 12:55 AM, Chris Murphy wrote:
Okay, I've no idea what that is, unless it simply means the grub prompt.

AIUI this was the rescue mode that was being "critisized" here, in that
it only works in limited scenarios. I for one have never found any use
for it, but of course that doesn't mean a thing, it's nice that it's there.

Sure. I don't know what the rebooting loop was, but clearly it was
related to "this bug" because the bootloader has been working fine for
years until it got completely broken by the upgrade to F30. I've never
seen anything like that, ever. And certainly it was affected by "this
bug" too because of the ages old boot loader.

I guess we'll never find out because I wouldn't know the exact steps to
reproduce it even if I wanted to.

Indeed, and without it I would've been lost on a few occasions,
including this. I dont think there's an actual disagreement anywhere in
here, just a side-track related to the awareness of thees various rescue
modes and their relative powers of rescue, or lack of thereof.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/09/2019 - 12:32

On Thu, May 9, 2019 at 12:00 AM Panu Matilainen < ... at redhat dot com> wrote:
Oops, I confused myself. This thread's bug doesn't result in a grub
rescue prompt, but just the usual grub prompt.

The idea is with a 'no host only' initramfs, it should be able to boot
no matter what hardware is connected. In the case where hardware
changes, add or replace, the default "host only" initramfs might not
contain a necessary kernel module to boot the system, and boot will
fail. The rescue boot option does nothing else, no fsck, no additional
debuging - it's just using a giant initramfs with a bunch of modules
baked into it.

Yeah I'm not sure. It may be that the conditions for this bug can lead
to a different bug. Frankly I'm surprised we haven't run into a lot
more bootloader related problems, as stale as it can become on BIOS
systems.

Understood.

My take away from the conversation is: The Common Bugs steps to avoid
and get out from under the bug, and various rescue methods available,
are all appropriate and necessary, and all parties did the best they
could under the circumstances. But they are inadequate ways of
addressing BIOS GRUB staleness. Major version upgrades need to handle
this case in the best interest of most users, even if it means
inconveniencing some multibooters.

Re: Upgrade to F30 gone wrong

By Julen Landa Alustiza at 05/06/2019 - 07:40

Nicolas Mailhot via devel < ... at lists dot fedoraproject.org> igorleak hau
idatzi zuen (2019 mai. 6, al. 09:59):

Le dimanche 05 mai 2019 à 16:14 -0600, Chris Murphy a écrit :
That's not true, many boot problems are found quite early in the
process by rawhide users, but rawhide users feedback is not taken into
account by installer folks because they don't look at boot problems
before quite late in the cycle, when rawhide users have already moved
on manually, and the default solution is always to reinstall from
scratch.

So problems are found, just not fixed

About this specific bug...

We found this bug before releasing, but it is not a release blocking bug
(the upgrade criteria just cover clean n and n-1 upgrading to n+1 and this
bug just happens whith continously upgraded systems since fc21 or lower) so
QA folks talked with bootloader ones, it was a difficult bug to fix without
breaking things/overwriting other bootloaders and there was no time to
announce, discuss and decide about policy changes so we went ahead
documenting it on common bugs.

Re: Upgrade to F30 gone wrong

By Roberto Ragusa at 05/06/2019 - 09:25

Wait a moment, is n and n-1 defined to "installed from scratch n and n-1?".
Is this a precedent that n-installed is different than n-through-upgrades?

Regards.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/06/2019 - 14:09

On Mon, May 6, 2019 at 7:25 AM Roberto Ragusa < ... at robertoragusa dot it> wrote:
Correct.

"For each one of the release-blocking package sets, it must be
possible to successfully complete a direct upgrade from a fully
updated, clean default installation of each of the last two stable
Fedora releases with that package set installed."

<a href="https://fedoraproject.org/wiki/Fedora_30_Beta_Release_Criteria#Upgrade_requirements" title="https://fedoraproject.org/wiki/Fedora_30_Beta_Release_Criteria#Upgrade_requirements">https://fedoraproject.org/wiki/Fedora_30_Beta_Release_Criteria#Upgrade_r...</a>

It is completely impractical for QA to, every cycle, do a clean
install of each version of Fedora, and upgrade them in sequence to the
current pre-release version, and if any of those get stuck somewhere,
suggest it would be release blocking. It's totally untenable.

And not least of which is the BIOS bootloader staleness issue, but
file systems are inherently non-deterministic data blobs. The older
they get, the more non-deterministic they become, and the more likely
problems are edge cases that require special handling. The older it
is, the less stable it is, and the more likely you'll run into
problems no one else has. It's just the way they are.

Re: Upgrade to F30 gone wrong

By Martin Kolman at 05/06/2019 - 14:27

On Mon, 2019-05-06 at 12:09 -0600, Chris Murphy wrote:
Grossly simplified description of such an automated test:
- create a VM
- install old fedora version N via kickstart
- setup SSH keys for remote access in the kickstart
- apply any other customizations needed
- SSH to machine, run commands to upgrade to N+1, repeat until desired current version is reached

Test considered successfull if it is possible to login to the upgraded system after final reboot.
Test considered failed if a timeout is reached, with timeouts assigned to the individual parts
(installation, prepare for upgrade, upgrade).

More granular checks could be used, at the cost of more complicated test harness.

This would presumably run for many hours, but if multiple runs could be done for different starting
versions N in parallel, it should still be short enough to gate stuf like GO/NOGO decissions.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/06/2019 - 16:59

On Mon, May 6, 2019 at 12:27 PM Martin Kolman < ... at redhat dot com> wrote:

To what end? Someone would have to build this test in openqa, and then
when it fails, check it and try and figure out why it failed, which
itself could take hours.

The release criterion is clear, such a bug wouldn't block release.

That a thing can be done does not mean that thing should be done.

And the outcome in this particular instance would differ, how?
Someone would need to create these tests in openqa, then someone needs
to check the failing tests, and work out the cause of the problem.
That's exactly what happened in this case. And quite a lot of such
testing, including this one and the one you're proposing, when they
aren't release blocking someone may not do that. There's quite a lot
on QA's plate with just blockers and important freeze exception bugs.
Everything else depends on available bandwidth.

Re: Upgrade to F30 gone wrong

By Tomasz Torcz at 05/06/2019 - 10:47

On Mon, May 06, 2019 at 09:54:33AM -0400, Stephen John Smoogen wrote:
% dnf --releasever=31 system-upgrade download
Before you continue ensure that your system is fully upgraded by running
"dnf --refresh upgrade". Do you want to continue [y/N]:

So this message should read “Ensure your system is different that
we test for”?

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/06/2019 - 14:13

On Mon, May 6, 2019 at 8:47 AM Tomasz Torcz < ... at pipebreaker dot pl> wrote:
No.

Different than release testing that occurred during the last final
freeze, yes. But still tested, as QA, and many other Fedora users and
testers, test the packages in updates-testing before those packages
move to the updates repo.

Re: Upgrade to F30 gone wrong

By Tomasz Torcz at 05/06/2019 - 15:15

On Mon, May 06, 2019 at 12:13:16PM -0600, Chris Murphy wrote:
Ah, ok. “through-the-upgrades” meant “dnf system-upgrade”, no plain
“dnf upgrade”. Even criteria talks about fully upgraded system.
I've misunderstood completely :(

Re: Upgrade to F30 gone wrong

By Julen Landa Alustiza at 05/06/2019 - 09:46

Roberto Ragusa < ... at robertoragusa dot it> igorleak hau idatzi zuen (2019 mai.
6, al. 15:34):

Technically speaking and for our releasing criteria yep, we block on fresh
fc28/fc29 default installation of blocker package sets upgrades to fc30,
previously upgraded fc28/fc29 are out of scope.

We do our best to support upgrading from previously upgraded systems, but
they're not considered blocker bugs if clean default installations of n or
n-1 are not affected

<a href="https://fedoraproject.org/wiki/Fedora_30_Beta_Release_Criteria#Upgrade_requirements" title="https://fedoraproject.org/wiki/Fedora_30_Beta_Release_Criteria#Upgrade_requirements">https://fedoraproject.org/wiki/Fedora_30_Beta_Release_Criteria#Upgrade_r...</a>

Re: Upgrade to F30 gone wrong

By Martin Kolman at 05/06/2019 - 07:15

On Mon, 2019-05-06 at 09:52 +0200, Nicolas Mailhot via devel wrote:

Re: Upgrade to F30 gone wrong

By Sam Varshavchik at 05/05/2019 - 12:45

Steve Grubb writes:

Somewhere around that era, installing Fedora added a grub menu entry for
"Rescue" mode. I don't remember exactly what it was supposed to rescue, and
how.

It's been sitting in the grub menu ever since.

I have a /boot/vmlinuz-0-rescue-f0fe67c2a80d43d2947358968ab5277e with a 2013
timestamp. No idea which kernel it is. It appears to be immune to
installonly_limit.

<a href="https://fedoraproject.org/wiki/Common_F30_bugs#GRUB_boot_menu_is_not_populated_after_an_upgrade" title="https://fedoraproject.org/wiki/Common_F30_bugs#GRUB_boot_menu_is_not_populated_after_an_upgrade">https://fedoraproject.org/wiki/Common_F30_bugs#GRUB_boot_menu_is_not_pop...</a>
offers a more simple bandaid.

Checking a "Common Fxx bugs" page before every upgrade has been a well-
established ritual for quite some time.

That's what I suggested yesterday.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/05/2019 - 18:21

On Sun, May 5, 2019 at 10:45 AM Sam Varshavchik <mrsam@courier-mta.com> wrote:
If you see the rescue kernel+initramfs menu entry though, you do not
have the bug under discussion, and you don't need to run
'grub2-install' - your GRUB is by definition functioning fine if you
see this menu entry.

As for what it is, this is a copy of the original kernel installed at
the time Fedora was installed by Anaconda. What makes it "rescue" is
actually the "no host only" initramfs which is a kitchen sink
initramfs that in theory will boot any hardware. The idea is, if you
add new hardware and suddenly can't boot, it's likely because the
normal "host only" initramfs doesn't have a kernel module for that new
hardware; and you can use the "rescue" option as a fall back to boot,
and then build a new initramfs for one of your newer kernels.

And you're correct, it is never updated. I filed a bug/rfe for dnf
system-upgrade about this some time ago, but got some push back on
where such function really belongs. And actually it shouldn't be just
any arbitrary kernel, ideally it'd be a reasonably well tested kernel
perhaps the same one we actually release with.

Re: Upgrade to F30 gone wrong

By Sam Varshavchik at 05/05/2019 - 19:07

Chris Murphy writes:

I just inventoried my bricks. One, which has not been updated to F30 yet,
features this in its default.cfg:

menuentry 'Fedora 19 Rescue 929d17a456bf4083a935b6209da2ef46
(3.9.8-300.fc19.x86_64)' --class fedora --class gnu-linux --class gnu --
class os $menuentry_id_option 'gnulinux-simple-6a70e79b-78da-487a-acfe-
dfba88996747' {

So, it's a rescue from Fedora 19, which, from what I understand, is covered
by the bug.

I have executed grub2-install manually on this machine, though, but I
wouldn't expect it to figure out which release originally installed this
kernel. And this machine was initially installed much, much earlier.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/05/2019 - 19:15

On Sun, May 5, 2019 at 5:08 PM Sam Varshavchik <mrsam@courier-mta.com> wrote:
Only if all three are true:

a. It has BIOS firmware
b. grub2-install from Fedora 21 or newer has never been run on it
c. It has been upgraded to Fedora 30

That you can see this menu entry means one of those things is not
true; because the bug causes failure prior to menu entry parsing and
display.

In that case I expect that the rescue kernel+initramfs feature first
appeared in dracut in Fedora 19, so that's the first time it would
have noticed the pair are missing, and it would have created them at
that time. But that's just a guess.

Re: Upgrade to F30 gone wrong

By Sam Varshavchik at 05/05/2019 - 20:46

Chris Murphy writes:

This seems to be it. Two of my older BIOS systems, one still on F29, and one
that's now on F30, both have a rescue image referencing Fedora 19 (the
upgraded one referenced the rescue image in grub.cfg.rpmsave, pretty sure it
still appears in the grub menu at boot time).

Re: Upgrade to F30 gone wrong

By Vitaly Zaitsev ... at 05/05/2019 - 15:54

On Sun, 05 May 2019 12:45:00 -0400

From an old message thread, here are two ways to update to a current
rescue kernel.

"""
Delete (or move out of "/boot" the rescue kernel and initramfs.

Run

/etc/kernel/postinst.d/51-dracut-rescue-postinst.sh f27_kern_ver
/boot/f27_kern_img
"""

"""
You delete the rescue initramfs from /boot. Then the next time you
install a new kernel, it will create a new one. Possibly reinstalling
the existing kernel would work as well.
"""

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/04/2019 - 13:59

On Sat, May 4, 2019 at 11:43 AM Chris Murphy < ... at colorremedies dot com> wrote:
Actually, that's a problem too. The stale bootloader problem goes back
to an era where it was possible to install the bootloader into the
first sector of the boot partition, and in those cases, /dev/sda1 is
actually valid. And again, no practical way to discover this
automatically in advance.

Re: Upgrade to F30 gone wrong

By Sam Varshavchik at 05/04/2019 - 14:30

Chris Murphy writes:

It would be useful to have dnf system-upgrade emit a "say, you may need to
X first" message, before initiating a reboot, with an opportunity to bail
out. Just like the existing message that tells you to update the current
system first, before initiating an upgrade.

And, making this more generic, each new Fedora release could have a brief
upgrade message tucked away in it, somewhere, that dnf system-upgrade could
grab and show up front.

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/04/2019 - 15:35

On Sat, May 4, 2019 at 12:31 PM Sam Varshavchik <mrsam@courier-mta.com> wrote:
The reminder to update the current system applies to everyone. Whereas
the mitigation for this bug is for specific configurations that the
plug-in can't test for, and for other configurations it's not good
advice. Suggesting UEFI users run 'grub2-install' will have two
possible outcomes: non-standard GRUB behavior if the command works;
and confusion if 'grub2-install' doesn't work because it isn't
present. Either way, it's bad advice because that's a bad UX.

While handling this bug with a Common Bugs report is suboptimal, it
has long been expected that users should read Common Bugs before
installing or upgrading their systems. Making that advice more
prominent might be reasonable.

As simple as it sounds, someone has to build that and maintain it,
from upstream down to Fedora. And Fedora also supports two upgrade
mechanisms, dnf system-upgrade, and GNOME Software. And I think it's
reasonable that such messages to user space need to follow
localization guidelines, so now those messages need translation.
That's a lot of work to do when, again, we're all supposed to read
Common Bugs. It's not any different on Windows or macOS who publish
release notes, if you're very lucky they report some gotchas, but they
never do that notification in the actual upgrade tool.

The reason why this bug exists in my opinion is because we're being
too accommodating to the technical users who want linux multiboot, and
want Fedora to not step on their bootloader. I'm not convinced that's
a very good policy anymore. I personally would flip it around and
forcibly update the bootloader by assuming we own it, and if it turns
out that's the wrong assumption the injured party is a technical user
who should be familiar with linux mulit-boot madness and its esoteric
work arounds.

Re: Upgrade to F30 gone wrong

By Dridi Boukelmoune at 05/06/2019 - 03:14

On Sat, May 4, 2019 at 9:36 PM Chris Murphy < ... at colorremedies dot com> wrote:
I never heard of that, is it mentioned by the release announcement or
the blog entry on how to upgrade from fX to f{X+1}?

I think I would have remembered (and definitely checked the Common
Bugs) if that were the case. I will probably have plenty of time to
forget about it until f31 is out, so I'm all for advertising the
Common Bugs more!

Dridi

Re: Upgrade to F30 gone wrong

By Zbigniew =?utf-... at 05/05/2019 - 07:42

On Sat, May 04, 2019 at 01:35:32PM -0600, Chris Murphy wrote:
There was the pre-upgrade tool (<a href="https://fedoraproject.org/wiki/How_to_use_PreUpgrade" title="https://fedoraproject.org/wiki/How_to_use_PreUpgrade">https://fedoraproject.org/wiki/How_to_use_PreUpgrade</a>),
but I don't think it ever went anywhere.

IMHO, we just don't have the manpower to maintain such a tool in the
face of the frequent releases and the myriad of possible installation
styles and combinations of upgrade paths. For something more stable
and limited and predictable like RHEL this might be possible, but I don't
see it ever happening for Fedora.

This makes the assumption, which was also made earlier in the thread,
that it's somehow impossible to check what bootloader is installed.
Why? My bootloader is happy to tell me its version:
$ bootctl
...
Current Boot Loader:
Product: systemd-boot 241-565-g43d51bb
Features: ✓ Boot counting
✓ Menu timeout control
✓ One-shot menu timeout control
✓ Default entry control
✓ One-shot entry control
File: /EFI/systemd/systemd-bootx64.efi
...
Nowadays it's gives the exact git commit it's built from, in the past
it was just the release version, but either is enough. Therefore
'bootctl update' can fairly reliably *update*, i.e. do the installation
if the thing we have is newer than the version already installed.

In case of grub and MBR things are a bit more involved, but I don't
think there's any significant technical limitation to doing the same
check and conditional installation.

Zbyszek

Re: Upgrade to F30 gone wrong

By Chris Murphy at 05/05/2019 - 17:55

On Sun, May 5, 2019 at 5:45 AM Zbigniew Jędrzejewski-Szmek
< ... at in dot waw.pl> wrote:
Yes but this is out of scop for the conversation because your
bootloader is UEFI, and the bug under discussion is BIOS.

On BIOS, there are *three* common stage 1 bootloaders in common use on
Linux distros, and there's no room for versioning or signatures in the
440 bytes available for this bootloader. The only way to know what
we're dealing with, is to read and parse those 440 bytes, and find out
where and what they jump to and then parse that stage 2 code.

As an example of common: openSUSE uses (or at least used to, it's been
a couple years since I checked this) the syslinux mbr.bin as their
stage 1 bootloader, and they use GRUB as their stage 2 bootloader.

As for version, GRUB has a version.mod but you have to be running GRUB
in the pre-boot environment to run it. There is no checker for what is
actually installed or who installed it. The facility doesn't exist. I
don't know why. Considering there's maybe one downstream that provides
an unmodified upstream, and everyone else provides heavily modified
GRUB such that referencing an upstream version is pointless, you'd
need a facility to inject a version+signature for the distribution's
naming scheme to know what the installed binary actually is. And that
includes Fedora, whose GRUB has hundreds of patches on top of
upstream.

There's definitely no room for this in stage 1 bootloaders. There
might be (I'm not sure) enough room to reference a package in the
stage 2 bootloader, so that we know what distro and their version of
the bootloader it is, at least know that it's not ours. But the ship
has sailed. This capability needed to be in Fedora 5 years ago.

It is insufficient to check the stage 2 bootloader. You have to start
at the beginning, LBA 0, and follow it just like the computer does.
You can't work backwards.

Both the 1MB MBR gap, and 1MB GPT BIOSBoot partition are big enough to
contain multiple stage 2 bootloaders, and grub-install has a facility
to avoid stepping on stage 2 bootloader in those locations if it can.
So that stage 2 is written, and then it writes out a custom stage 1
bootloader in LBA 0 that tells the computer specifically what LBA to
jump to in either the MBA gap or BIOS Boot partition.

If you have a checker to look in those two locations, and it finds two
or more stage 2 bootloaders, which one is the currently use one? You
don't know. You have to read LBA 0 and parse it.

Multiboot on BIOS was always this much of a clusterfuck.

Re: Upgrade to F30 gone wrong

By Steve Grubb at 05/04/2019 - 10:59

Hello,

One detail is missing, see below

On Saturday, May 4, 2019 10:54:49 AM EDT Steve Grubb wrote:
error: symbol 'grub_file_progress_hook' not found.
Entering rescue mode...
grub rescue>

Which seems like a big step back from where I was.

-Steve

Re: Upgrade to F30 gone wrong

By Tom Hughes at 05/04/2019 - 11:02

Obviously I don't know your setup, but it's more normal for
the bootloader to be installed at device level rather than at
partition level, so you would want:

grub2-install /dev/sda

Obviously this should have been fixed as part of the upgrade
process - that's why the page is called "common bugs" because
it's a bug.

Tom

On 04/05/2019 15:59, Steve Grubb wrote: