Better interactivity in low-memory situations

This subject matches a Fedora Workstation Working Group issue of the
same name [1], and this post is intended to be an independent summary
of the findings so far, and call for additional testing and
discussion, in particular subject matter experts.

Problem and thesis statement:
Certain workloads, such as building webkitGTK from source, results in
heavy swap usage eventually leading to the system becoming totally
unresponsive. Look into switching from disk based swap, to swap on a
ZRAM device.

Summary of findings (restated, but basically the same as found at [2]):
Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM,
Samsung SSD 840 EVO, Fedora Rawhide Workstation.
Test case, build WebKitGTK from source.

$ cmake -DPORT=GTK -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
$ ninja

Case 1: 8GiB swap on SSD plain partition (not encrypted, not on LVM)
Case 2: 8GiB swap on /dev/zram0

In each case, that swap is exclusive, there are no other swap devices.
Within ~30 minutes in the first case, and ~10 minutes in the second
case, the GUI is completely unresponsive, mouse pointer has frozen and
doesn't recover after more than 30 minutes of waiting. By remote ssh,
the first case is semi-responsive, updates should be every 5 seconds
but are instead received every 2-5 minutes but it wasn't possible to
compel recovery by cancelling the build process after another 30
minutes. By remote ssh, the second case is totally unresponsive, no
updates for 30 minutes.

The system was manually forced power off at that point, in both cases.
oom killer never triggered.

NOTE: ninja, by default on this system, sets N concurrent jobs to
nrcpus + 2, which is 10 on this system. If I reboot with nr_cpus=4,
ninja sets N jobs to 6.

Case 3: 2GiB swap on /dev/zram0
In one test this resulted in system hang (no pointer movement) within
5 minutes of executing ninja, and within another 6 minutes oom killer
is invoked on a cc1plus process, which is fatal to the build process,
remaining build related processes quit on their own, and the system
eventually recovers.

But in two subsequent tests in this same configuration, oom killer
wasn't invoked, and the system meandered between responsive for ~1
minute, totally frozen for 5-6 minutes, in a cycle lasting beyond 1
hour without ever triggering oom killer.

Screenshot taken during one of the moments the remote ssh session updated
<a href="" title=""></a>

The state had not changed after 45 minutes following the above
screenshot so I forced power off on that system. But the point here is
this slightly different configuration has some non-determinism to it,
even though in the end it's a bad UX. The default, unprivileged build
command is effectively taking down the system all the same.

Case 4: 8GiB swap on SSD plain partition, `ninja -j 4`
This is the same setup as Case 1, except I manually set N jobs to 4.
Build succeeds, and except for a few mouse pointer stutters, the
system remains responsive, even Firefox with multiple tabs open, and
youtube video playing. Exactly the experience we'd like to see, albeit
not all CPU resources are used for the build, but clearly the limiting
factor is this particular package requires more than ~14GiB to build
successfully, and the system + shell + Firefox, just doesn't have

Starter questions:
To what degree, and why, is this problem instigated by the build
application (ninja in this example) or its supporting configuration
files, including cmake? Or the kernel? Or the system configuration? Is
it a straightforward problem, or is this actually somewhat nuanced
with multiple components in suboptimal configuration coming together
as the cause? Is it expected that an unprivileged user can run a
command whose defaults eventually lead to a totally unrecoverable
system? From a security risk standpoint, the blame can't be entirely
on the user or the application configuration, but how should
application containment be enforced? Other than containerizing the
build programs, is there a practical way right now of enforcing CPU
and memory limits on unprivileged applications? Other alternatives? At
the very least it seems like getting to an oom killer sooner would
result in a better experience, fail the process before the GUI becomes
unresponsive and hangs out for 30+minutes (possibly many hours).

<a href="" title=""></a>
<a href="" title=""></a>



Re: Better interactivity in low-memory situations

By Artem Tim at 08/15/2019 - 02:50

BFQ scheduler help a lot with this issue. Using it on Fedora since 4.19 kernel. Also there was previous discussion about make it default for Workstation
<a href=" ... at lists dot" title=" ... at lists dot"> ... at lists dot fedoraproject...</a>

Re: Better interactivity in low-memory situations

By Chris Murphy at 08/15/2019 - 15:19

On Thu, Aug 15, 2019 at 1:51 AM Artem Tim <ego. ... at gmail dot com> wrote:
It's mentioned in the workstation issue as having no effect in this case.
<a href="" title=""></a>

I just switched to it and repeating the test case and the GUI still
hangs, is unresponsive, even without substantial pressure on the SSD,
and swap isn't even 1/2 used.
<a href="" title=""></a>

But I am getting something new in kernel messages:

542 sysrq+t during a GUI freeze that lasted over 1 minute, and then:

[ 718.068633] fmac.local kernel: SLUB: Unable to allocate memory on
node -1, gfp=0x900(GFP_NOWAIT|__GFP_ZERO)
[ 718.068636] fmac.local kernel: cache: page->ptl, object size: 72,
buffer size: 72, default order: 0, min order: 0
[ 718.068639] fmac.local kernel: node 0: slabs: 296, objs: 16576, free: 0
[ 718.068704] fmac.local kernel: chronyd: page allocation failure:
order:0, mode:0x800(GFP_NOWAIT),

Not sure what to make of that. Complete 'journalctl -k' is here:
<a href="" title=""></a>

Re: Better interactivity in low-memory situations

By Chris Murphy at 08/15/2019 - 16:47

On Thu, Aug 15, 2019 at 2:19 PM Chris Murphy < ... at colorremedies dot com> wrote:
Asked on #fedora-kernel, it's a known issue with 5.3.0-rc4 and drm.

Re: Better interactivity in low-memory situations

By Dave Airlie at 08/15/2019 - 18:57

On Fri, Aug 16, 2019 at 7:48 AM Chris Murphy < ... at colorremedies dot com> wrote:
Nope it's not that.

Something has leaked all your memory (not drm).


Re: Better interactivity in low-memory situations

By S. at 08/14/2019 - 08:50

(Oops, sorry, re-post because I messed up the threading.)

I'm not a developer, nor do I pretend to understand the nuances of memory management. But I signed up for this list just to say "thanks" to all the devs and others that are finally discussing what I consider to be one of the biggest problems with Linux on the desktop.

My experience with desktop Linux distros with SSDs when a few processes start to leak memory, or if I launch a new program when my system is right at the limits, is a full system hang where only the mouse occasionally moves jerkily, and I can't switch to a virtual terminal. I recently learned the SysRq trick to evoke the OOM killer, but I personally think that the kernel should deal with that, not the user. As unfortunate as it is for the OOM killer to have to randomly kill something, I am of the opinion that the OS should *never* lock up, period. I would strongly prefer that one application get killed instead of losing all my applications and working data because of a necessary hard reboot.

I don't know if this helps or not, but anecdotally I started see this issue *after* SSDs became more common, i.e. I don't think I ever experienced it with spinning rust. Maybe something to do with the vastly faster I/O of an SSD, which allows it to more quickly saturate the RAM before the OOM killer has time to react?

Also, I've had relatively low memory KVM guests running on a VPS under very high load, and they never lockup. The OOM killer does occasionally kick in, but the affected daemon or systemd service restarts and it's amazingly undramatic. It appears that this issue only occurs with Xorg (and I imagine Wayland) and "desktop" usage.

As for the problem of the randomness of the OOM killer, couldn't it be made to take into account the PID and/or how long the process has been running? Normally Xorg (and I assume Wayland stuff) gets started before the other desktop programs that tend to consume a lot of memory. So if it's a higher PID and/or has been running for less time, give it a higher score for killability.

In my experience on a system with 8GB of RAM and an SSD, the amount of swap space makes no difference. I've tried with no swap space, with 2GB, with 8GB, etc, and it still hangs under high memory usage. I've also tried tuning a lot of sysctl parameters such as vm.swappiness, vm.vfs_cache_pressure, and vm.min_free_kbytes, to no avail.

Don't know if this helps, but here are some additional discussions of Linux unresponsiveness under low memory situations from a layman's perspective:
- (in the comments)

Thanks again to everyone for looking into this!

Re: Better interactivity in low-memory situations

By Florian Weimer at 08/12/2019 - 02:01

* Chris Murphy:

Do you use the built-in Intel graphics? Can you test with something


Re: Better interactivity in low-memory situations

By Chris Murphy at 08/12/2019 - 10:45

On Mon, Aug 12, 2019 at 1:01 AM Florian Weimer < ... at redhat dot com> wrote:
Only intel graphics. The AMD GPU on the test system is
non-functional/defective. Other systems only have Intel graphics. I
have tested this in a VM which I think is qxl graphics (?), and I get
the same results, with minimal sample size. It seems like the oom
happens more often and sooner on the VM, but that might because the VM
is necessarily even more resource constrained than the host. But I
have reproduced the total and seemingly indefinite hang. The results
aren't completely deterministic, whether baremetal or VM. They're all
"failures" in one form or another, but how they fail does differ run
to run. And that's expected because to what degree I'm simultaneously
browsing in Firefox, how many tabs are open, other programs being
used, the user is a cause of that non-determinism and is a relevant

Re: Better interactivity in low-memory situations

By Petr Pisar at 08/12/2019 - 09:40

On 2019-08-12, Florian Weimer < ... at redhat dot com> wrote:
As far as I know integrated graphics arrays do not share physical memory
from point of view of the CPU address space. The physical memory is
split between GPU and CPU regions and CPU never see the GPU's physical
memory. IOMMU can be asked for mapping GPU's memory into CPU's virtual
space as can be done with any PCI card, but the physical memory is
always separated. (Although it lives in the same memory chip.) Some
BIOSes allows to define the UMA split (ratio beteen GPU and CPU memory).
But that is out of control of an operating system and cannot be change
until reset.

What actually happens is that some CPU physical memory is used for a GUI
program text and some CPU memory for a block device I/O cache. Both
purposes are handled uniformly by Linux. When the physical memory is
exhausted, a memory allocator starts paging to a swap device. The evil
thing is how memory pages are selected to be swapped out. The algorithm
is to swap out the least recently used ones. And that is often the
program text. Not the block cache. As a result your GUI becomes
unresponsive because all the physical memory is filled with a block
cache and the program text has to be reloaded from a block device. And
what's worse, this happens even without swap space because program text
pages are backed by a file and thus can dropped and loaded from a file
system later. I.e. program text is always swapable.

A cure would be more fair memory allocator that could magically
discover that a user is more interested in the few megabytes of his
window manager than the gigabytes of a transfered file. The issue is
that the allocator does not discriminate. A process can actully provide
some hints using madvise(2) and mlock(2), but that does not apply to
the program text, neither to the block cache in the kernel space. And
even if processes provided hints, there always could be some adversarial
program abusing others. Maybe if ulimit were augmented with a block
cache maximal usage and an I/O scheduler accounted for that. That could

-- Petr

Re: Better interactivity in low-memory situations

By Florian Weimer at 08/12/2019 - 09:43

* Petr Pisar:

I expect that the GEM shrinker (or rather, the reason why it is needed)
radically alters kernel memory management.


Re: Better interactivity in low-memory situations

By Georg Sauthoff at 08/10/2019 - 11:56

On Fri, Aug 09, 2019 at 03:50:43PM -0600, Chris Murphy wrote:
To avoid such issues I disable swap on my machines. I really don't see
the point of having a swap partition if you have 16 or 32 GiB RAM. Even
with 8 GiB I disable swap.

With - say - 8 GiB the build of a large project might fail (e.g. llvm,
e.g. during linking) but it then fails fast and I can just restart it
with `ninja -j2` or something like that.

Another source of IO related unresponsiveness is buffer bloat - I thus
apply this configuration on my machines:

$ cat /etc/sysctl.d/01-disk-bufferbloat.conf

Best regards

Re: Better interactivity in low-memory situations

By Simon Farnsworth at 08/13/2019 - 07:02

Further, a sensible amount of swap (say 2 GiB or so) means that unused anonymous pages (e.g. data that's left over from initialization, or data that will only be needed when a process exits) can be swapped out and left on disk, freeing up valuable RAM for useful work.

Basically, a sane amount of swap is healthy - old advice about large amounts of swap is not.

Re: Better interactivity in low-memory situations

By Dave Airlie at 08/12/2019 - 19:27

On Sun, Aug 11, 2019 at 2:57 AM Georg Sauthoff < ... at georg dot so> wrote:
Disabling swap doesn't avoid the issues, it can in fact make them worse.

If you have apps allocate memory they don't always OOM before the
kernel tries to evict text pages, but since SSDs are fast it then
tries to pull back in those text pages before realising (that is what
most of the latest rounds of articles has been about). Something like
firefox runs with no swap, starts to need more memory than the system
has, parts of firefox executable get paged out, but then are needed
for firefox to use the RAM, and round in circles it goes.

Having swap is still in this day and age better for your system that
not having it.


Re: Better interactivity in low-memory situations

By Chris Murphy at 08/12/2019 - 20:48

On Mon, Aug 12, 2019 at 6:31 PM David Airlie < ... at redhat dot com> wrote:
I agree that it's better to have swap for incidental swap purposes,
rather than random things just getting abruptly hit with oom. I say
random, because I see the oom_score_adj is the same for every process
other than systemd-udev, auditd, sshd, and dbus. Plausibly the shell
could get oom killed without warning, taking out the entire user
session, all apps, and all the build processes.

I just discovered in the log from yesterday, that iotop was subject to
oom killer, rather than one of the large cc1plus processes, which is
what I'd previously consistently witnessed. So iotop and cc1plus must
be in the ballpark oom score wise and oom killer just so happens to
pick one or the other. iotop going away relieved just enough memory
that nothing else was subject to oom killer, and yet processes were
clearly resource starved nevertheless: the GUI was frozen, but then
also other processes had already been dying due to timeouts, for

Aug 11 18:26:57 fmac.local systemd[1]: sssd-kcm.service: Control
process exited, code=killed, status=15/TERM
Aug 11 18:26:57 fmac.local systemd[1]: sssd-kcm.service: Failed with
result 'timeout'.

Aug 11 18:27:00 fmac.local systemd[1]: systemd-journald.service: State
'stop-sigterm' timed out. Killing.
Aug 11 18:27:00 fmac.local systemd[1]: systemd-journald.service:
Killing process 31010 (systemd-journal) with signal SIGKILL.
Aug 11 18:27:00 fmac.local systemd[1]: systemd-journald.service: Main
process exited, code=killed, status=9/KILL

This is like a train wreck where there are all sorts of interesting
sub failures happening. At one point I think, well we need better oom
scores so the truly lowest important process is killed off. But upon
big picture scrutiny, the system is failing before oom killer has been
triggered. Processes are dying with timeouts. The GUI including the
mouse pointer is frozen, even when swap is half full. Practically
speaking, it's a goner the moment the mouse pointer froze the very
first time. I might tolerate some stuttering here and there, but
minutes of frozen state? Nah - not interested in seeing if this is
another 5 minutes of choke, or 5 days.

And that's the bad side of swap is when the system is more than
incidentally using it, and is depending on it. And apparently nothing
is on a deadline timer if things can just start timing out on their
own, including the system journal! That was a surprise to see. If it
was that hung up, maybe I can't trust the journal entry times or
order, maybe important entries were lost.

Re: Better interactivity in low-memory situations

By Jan Kratochvil at 08/10/2019 - 04:07

On Fri, 09 Aug 2019 23:50:43 +0200, Chris Murphy wrote:
RelWithDebInfo is -O2 -g build. That is not suitable for debugging, for
debugging you should use -DCMAKE_BUILD_TYPE=Debug (that is -g).
RelWithDebInfo is useful for final rpm packages but those are build in Koji.

Debug build will have smaller debug info so the problem may go away.

If it does not go away then tune the parallelism. Low -j makes the build
needlessly slow during compilation phase while high -j (up to about #cpus
+ 2 or so) will make the final linking phase with debug info to run out of
memory. This is why LLVM has separate "-j" for the linking phase but that is
implemented only in LLVM CMakeLists.txt files:
<a href="" title=""></a>
So that you leave the default -j high but set LLVM_PARALLEL_LINK_JOBS to 1 or 2.

Other options for faster build times are also LLVM specific:
-DLLVM_USE_LINKER=gold (maybe also lld now?)
- as or ld.lld are faster than ld.bfd
- Linking phase no longer deals with the huge debug info

Which should be applicable for other projects by something like (untested!):
-DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=gold -Wl,--gdb-index"
-DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=gold -Wl,--gdb-index"

(That gdb-index is useful if you are really going to debug it using GDB as
I expect you are going to do when you want RelWithDebInfo and not Release; but
then I would recommend Debug in such case anyway as debugging optimized code
is very difficult.)

$ help ulimit
-m the maximum resident set size
-u the maximum number of user processes
-v the size of virtual memory

One can also run it with 'nice -n19', 'ionice -c3'
and/or "cgclassify -g '*':hammock" (config attached).

But after all I recommend just more memory, it is cheap nowadays and I find
64GB just about the right size.


Re: Better interactivity in low-memory situations

By Chris Murphy at 08/11/2019 - 10:50

On Sat, Aug 10, 2019 at 3:07 AM Jan Kratochvil
<jan. ... at redhat dot com> wrote:
I don't follow. You're saying RelWithDebInfo is never suitable for a
local build?

I'm not convinced that matters, because what the user-developer is
trying to accomplish post-build isn't relevant to getting a successful
build. And also, this is just one example of how apparently easy it is
to take down a system with an unprivileged task, per the various
discussions I've had with members of the Workstation WG.

Anyway, the build fails for a different reason when I use Debug
instead of RelWithDebInfo so I can't test it.

In file included from Source/JavaScriptCore/config.h:32,
from Source/JavaScriptCore/llint/LLIntSettingsExtractor.cpp:26:
Source/JavaScriptCore/runtime/JSExportMacros.h:32:10: fatal error:
wtf/ExportMacros.h: No such file or directory
32 | #include <wtf/ExportMacros.h>
| ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
[1131/2911] Building CXX object
ninja: build stopped: subcommand failed.

Thanks. I'll have to defer to others about how to incorporate this so
the default build is more intelligently taking actual resources into
account. My strong bias is that the user-developer can't be burdened
with knowing esoteric things. The defaults should just work.

Let's take another argument. If the user manually specifies 'ninja -j
64' on this same system, is that sabotage? I'd say it is. And
therefore why isn't it sabotage that the ninja default computes N jobs
as nrcpus + 2? And also doesn't take available memory into account
when deciding what resources to demand? I can build linux all day long
on this system with its defaults and never run into a concurrent
usability problem.

There does seem to be a dual responsibility, somehow, between the
operating system and the application, to make sure sane requests are
made and honored.

That's an optimization. It can't be used as an excuse for an
unprivileged task taking down a system.

Re: Better interactivity in low-memory situations

By Jan Kratochvil at 08/11/2019 - 12:21

On Sun, 11 Aug 2019 17:50:17 +0200, Chris Murphy wrote:
Most of the time. What is your use case for it?

With powerful enough machine everything is possible. Just be aware
RelWithDebInfo is the most resource demanding option compared to Release and
Debug and at the same time it is the least useful one for local builds.

You are reinventing the wheel Fedora packager has already done for this
package. I guess you are missing some dependency. If you have a problem
stick to the proven build (unless it is temporarily FTBFS which this package
is not now). I think Fedora recommends mock for such rebuild but I find mock
inconvenient for local development so I use (I have some scripts for that):
dnf download --source webkit2gtk3
mkdir webkit2gtk3-2.24.3-1.fc30.src
cd webkit2gtk3-2.24.3-1.fc30.src
rpm2cpio ../webkit2gtk3-2.24.3-1.fc30.src.rpm|cpio -id
function rpmbuildlocal { time MAKEFLAGS= rpmbuild --define "_topdir $PWD" --define "_builddir $PWD" --define "_rpmdir $PWD" --define "_sourcedir $PWD" --define "_specdir $PWD" --define "_srcrpmdir $PWD" --define "_build_name_fmt %%{NAME}-%%{VERSION}-%%{RELEASE}.%%{ARCH}.rpm" "$@"; rmdir &>/dev/null BUILDROOT; }
# Is the .src.rpm rebuild still needed? <a href="" title=""></a>
rpmbuildlocal -bs *.spec
sudo dnf builddep webkit2gtk3-2.24.3-1.fc30.src.rpm
rm webkit2gtk3-2.24.3-1.fc30.src.rpm
rpmbuildlocal -bc webkit2gtk3.spec 2>&1|tee log
# or -bb or what do you want.
It has built fine for me here now.

For untrusted users Linux has given up for that, it is too big can of worms.
Use virtual machine (KVM) with specified resources (memory size). Nowadays it
should be also possible with less overhead by using Docker containers.

If you mean some local builds of your own causing runaway then
(1) Turn off swap as RAM is cheap enough today.
If something really runs out of the RAM it gets killed by kernel OOM.
(2) Have the swap on NVMe, it from my experience does not kill the machine.
(3) Use some reasonable ulimits in your ~/.bash_profile.
(4) When the machine is really unresponsible login there from a different box
and kill the culprits. From my own experience the machine is still able to
accept new SSH connection, despite a bit slowly.
But yes, I agree this problem has AFAIK no perfect solution.


Re: Better interactivity in low-memory situations

By Chris Murphy at 08/11/2019 - 13:54

On Sun, Aug 11, 2019 at 11:21 AM Jan Kratochvil
<jan. ... at redhat dot com> wrote:
My use case is testing the responsiveness of Fedora Workstation under
CPU and memory pressure, as experienced by an ordinary user.

That's out of scope.

I said from the outset this is an example. The central topic is that
an unprivileged program is able to ask for resources that do not
exist, and the operating system tries and fails to supply those
resources, resulting not only in task failure, but the entire system
is lost. In this example the user is doing other things concurrently
and likely experiences data loss and possibly even file system
corruption as a direct consequence of having to force power off on the
machine because for all practical purposes normal control has been

I don't think it's acceptable in 2019 that an unpriviledged task takes
out the entire operating system. As I mention in the very first post,
remote ssh was not responsive for 30 minutes, at which point I gave up
and forced power off. It's a bit of a trap though to suggest the user
needs the ability and skill to remote ssh to kill off runaway
programs, I refuse that premise.

It's completely sane for an ordinary user to consider that control of
the system has been lost immediately upon experiencing a frozen mouse

Re: Better interactivity in low-memory situations

By Jan Kratochvil at 08/11/2019 - 14:02

On Sun, 11 Aug 2019 20:54:28 +0200, Chris Murphy wrote:
Not really, this is what journaling filesystem is there for.

But then there still can be an application-level data corruptions if an
application does not handle its sudden termination properly.
Which should be rare but IIRC I did see it for example with Firefox.


Re: Better interactivity in low-memory situations

By Chris Murphy at 08/11/2019 - 16:05

On Sun, Aug 11, 2019 at 1:02 PM Jan Kratochvil
<jan. ... at redhat dot com> wrote:
Successful journal replay obviates the need for fsck, it has nothing
to do with avoiding corruption. And in any case, anything the user is
working on that isn't already saved and committed to stable media,
isn't going to survive the poweroff.

I think the point at which the mouse pointer has frozen, the user has
no practical means of controlling or interacting with the system, it's
a failure.

In the short term, is it reasonable and possible, to get the oom
killer to trigger sooner and thereby avoid the system becoming
unresponsive in the first place? The oom score for most all processes
is 0, and niced processes have their oom score increased. I'm not
seeing levers to control how aggressive it is, only a way of hinting
at which processes can be more readily subject to being killed. In
fact, a requirement of oom killer is that swap is completely consumed,
which if swap is on anything other than a fast SSD, swapping creates
its own performance problems way before oom can be a rescuer. I think
I just argued against my own question.

Re: Better interactivity in low-memory situations

By Benjamin Kircher at 08/12/2019 - 01:29

Yes you just did :-)

From what I understand from this LKML thread [1] fast swap on NVMe is only part of the issue (or adds to the issue). The kernel really really tries hard not to OOM kill anything and keep the system going. And this overcommitment is where it eventually gets unresponsive to the extend that the machine needs to be hard rebooted.

The LKML thread also mentions that user-space OOM handling could help.

But what about cgroups? Isn’t there a systemd utility that helps me wrap processes in resource constrained groups? Something along the line

$ systemd-run -p MemoryLimit=1G firefox

(Not tested.) I imagine that a well-behaved program will handle a bad malloc by ending itself?

BTW, this happens not only on Linux. I’m used to deal with quite big files during my day job and if you accidentally write some… em… very unsophisticated code that attempts to read the entire file into memory at once you can experience the same behavior on a recent macOS, too. You’re left with nothing else than force rebooting your machine.

[1] <a href="" title=""></a>


Re: Better interactivity in low-memory situations

By Chris Murphy at 08/12/2019 - 10:40

On Mon, Aug 12, 2019 at 12:30 AM Benjamin Kircher
<benjamin. ... at gmail dot com> wrote:
If I just run the example program, let's say systemd MemoryLimit is
set to /proc/meminfo MemAvailable, the program is still going to try
and bust out of that and fail. The failure reason is also non-obvious.
Yes this is definitely an improvement in that the system isn't taken

How to do this automatically? Could there be a mechanism for the
system and the requesting application to negotiate resources?

One reality is, the system isn't a good estimator of system
responsiveness from the user's point of view. Anytime swap is under
significant pressure (what's the definition of significant?) the
system is effectively lost at that point, *if* this is a desktop
system (includes laptops). In the example case, once swap is being
heavily used on either the SSD, or on ZRAM, the mouse pointer is
frozen variably 50%-90% of the time. It's not a usable system, well
before swap is full. How does the system learn that a light swap rate
is OK, but a heavy swap rate will lead to an angry user? And even
heavy swap might be OK on NVMe, or on a server.

Right now the only lever to avoid swap, is to not create a swap
partition at installation time. Or create a smaller one instead of 1:1
ratio with RAM. Or use a 1/4 RAM sized swap on ZRAM. A consequence of
each of these alternatives, is hibernation can't be used. Fedora
already explicitly does not support hibernation, but strictly that
means we don't block release on hibernation related bugs. Fedora does
still create a swap that meets the minimum size for hibernation, and
also inserts the required 'resume' kernel parameter to locate the
hibernation image at the next boot. So we kinda sorta do support it.

Another reality is, the example program, also doesn't have a good way
of estimating the resources it needs. It has some levers, that just
aren't being used by default, including -l option which reads "do not
start new jobs if the load average is greater than N". But that's
different than "tell me the box sizes you can use" and then the system
supplying a matching box, and for the program to work within it.

Re: Better interactivity in low-memory situations

By Benjamin Kircher at 08/12/2019 - 11:44

Honestly, right now, doing this automatically is not possible.

Instead, we anticipate the workload or the nature of the work. Like as when we connect remotely to a box and start some long running process, we anticipate trouble with the network and use a terminal multiplexer, right? Same thing with resource intensive processes.

But in future, I could imagine that this whole control group mechanism really pays off in a way where we distribute system resources automatically.

Isn’t that what Silverblue is all about? Having a base system and on top of that, everything is run in a container that could be potentially resource constraint?


Re: Better interactivity in low-memory situations

By Lennart Poettering at 08/12/2019 - 11:16

Ideally, GNOME would run all its apps as systemd --user services. We
could then set DefaultMemoryHigh= globally for the systemd --user
instance to some percentage value (which is taken relative to the
physical RAM size). This would then mean every user app individually
could use — let's say — 75% of the physical RAM size and when it wants
more it would be penalized during reclaim compared to apps using less.

If GNOME would run all apps as user services we could do various other
nice things too. For example, it could dynamically assign the fg app
more CPU/IO weight than the bg apps, if the system is starved of

We could add a mode to systemd's hibernation support to only "swapon"
a swap partition immediately before hibernating, and "swapoff" it
right after coming back. This has been proposed before, but noone so
far did the work on it. But quite frankly this feels just like taping
over the fact that the Linux kernel is rubbish when it comes to

As suggested above, I think DefaultMemoryHigh=75% would be an OK
approach which would allow us adjust to the "beefiness" of a machine


Re: Better interactivity in low-memory situations

By Daniel Xu at 09/13/2019 - 08:43

Running each app as systemd --user services is something we've been trying to encourage teams to do at FB. It lets monitor things much better using the cgroup control files.

In addition, it lets us configure oomd ( <a href="" title=""></a> ) to do much more intelligent things than kill the entire session. oomd is being proposed as a fedora package right now. I think the last missing piece for oomd to be really useful on desktop systems is the --user slice changes.

Re: Better interactivity in low-memory situations

By Chris Murphy at 08/19/2019 - 14:58

On Mon, Aug 12, 2019 at 10:20 AM Lennart Poettering
< ... at 0pointer dot de> wrote:
I'm skeptical as well. But to further explore this:

1. Does the kernel know better than to write a hibernation image (all
or part) to a /dev/zram device? e.g. a system with: 8GiB RAM, 8GiB
swap on ZRAM, 8GiB swap partition. We can use swap priority to use the
ZRAM device first, and conventional swap partition second. If the
user, today, were to hibernate, what happens?

2. Are you suggesting it would be possible to build support for
multiple swaps and have them dynamically enabled/disabled? e.g. the
same system as above, but the 8GiB swap on disk is actually made
across two partitions. i.e. a 2GiB partition and 6GiB partition.
Normal operation would call for swapon for /dev/zram *and* the small
on-disk swap. Only for hibernation would swapon happen for the larger
on-disk swap partition (the 2GiB one always stays on).

That's... interesting. It sounds potentially complicated. I can't
estimate if it could be fragile.

Let's consider something else: Hibernation is subject to kernel
lockdown policy on UEFI Secure Boot enabled computers. What percentage
of Fedora users these days are likely subject to this lockdown? Are we
able to effectively support hibernation? On the one hand, Fedora does
not block on hibernation bugs (kernel or firmware), thus not
supported. But tacitly hibernation is supported because a bunch of
users pushed an effort with Anaconda folks to make sure the swap
device is set with "resume=" boot parameter with out of the box

Another complicating issue: the Workstation working group has an issue
to explore better protecting user data by encrypting /home by default.
Of course, user data absolutely can and does leak into swap. Therefore
I think we're obligated to consider encrypting swap too. And if swap
is encrypted, how does resume from hibernation work? I guess
kernel+initramfs load, and plymouth asks for passphrase which unlocks
encrypted swap, and the kernel knows to resume from that device-mapper

I'm really skeptical of pissing off users who want hibernation to
work. But I'm also very skeptical of compromising other priorities,
and diverting resources, just for hibernation.

If you wait long enough between replies, I will find another log to
throw on this fire, somewhere. :-D

Re: Better interactivity in low-memory situations

By Lennart Poettering at 08/20/2019 - 03:15

Usespace takes care of this. It tells the kernel which swap device to
hibernate to and it nowadays understands that zswap is not a
candidate, and picks the largest swap with the highes prio these days:

<a href="" title=""></a>

Yes, that's what I was suggesting.

Yeah. It's an idea. No sure it's a good one though.

We probably should look into supporting hibernation to encrypted swap
with a key tied to the TPM. That way hibernation should be fully safe.

I am pretty sure swap encryption really should be tied to the TPM. In
fact, it's one of the very few cases where tying things to the TPM
exclusively really makes sense.

So far noone prepared convincing patches to do this though. If anyone
wants to look into this, I'd be happy to review a patch for
systemd-cryptsetup for example.


Re: Better interactivity in low-memory situations

By Chris Murphy at 08/20/2019 - 14:00

On Tue, Aug 20, 2019 at 2:15 AM Lennart Poettering < ... at 0pointer dot de> wrote:
For what it's worth, swap on /dev/zram is a totally different thing than zswap.

/dev/zram is just a compressed RAM disk. You can configure a size, but
it only consumes memory as it actually gets used, dynamic allocation.
This can be used for swap standalone, no conventional disk based swap
partition is needed. But if there is one, and it's set to a lower
priority than swap on /dev/zram, then it has the effect of spilling
over (but spill over is uncompressed).

zswap basically always compresses all of swap, with a predefined size
memory pool "cache", and requires a conventional disk based swap
partition as the spill over. Spill over is also compressed.

They superficially sound very similar but the strategies are different
on the details. I've been using both strategies (separately), but have
the most experience with zswap even though above I was referring to
swap on a ZRAM device. I know, so many Z's. But gist is, I can't
really discern any differences from a user point of view.

Zwap uses just a few kernel parameters to set it up. Whereas with swap
on zram, it requires a service unit file to setup the block device,
mkswap, and then swapon.

The swap on ZRAM thing is further complicated by multiple
implementations, and the preferred systemd zram-generator is
apparently broken.
<a href="" title=""></a>

IoT folks are using swap on ZRAM now, via the Fedora zram package
(systemd unit file to set everything up). Anaconda folks have their
own built-in swap on ZRAM setup that runs on low memory systems when
anaconda is launched. This happens on both Fedora netinstalls and
LiveOS. And it makes sense for those use cases where a disk based swap
partition doesn't exist, and maybe shouldn't.

Whereas for servers and workstations, zswap is well suited, as they're
perhaps more likely to have a conventional swap partition and have use
cases where spillover is likely.

<a href="" title=""></a>
<a href="" title=""></a>


<a href="" title=""></a>

So why not zswap? Well, kernel documentation shows it as being
experimental still, but upstream considers it stable enough for
production use using zbud allocator now, and z3fold allocator by the
end of the summer they think.
<a href="" title=""></a>


Re: Better interactivity in low-memory situations

By Benjamin Kircher at 08/12/2019 - 12:06

I really like the ideas. Why isn’t this done this way anyway?

I don’t have a GNOME desktop at hand right now to investigate how GNOME starts applications and so on but aren’t new processes started by the user — GNOME or not — always children of the user.slice? Is there a difference if I start a GNOME application or a normal process from my shell?

And for the beginning, wouldn’t it be enough to differentiate between user slices and system slice and set DefaultMemoryHigh= in a way to make sure there is always some headroom left for the system?


(… I definitely need to play around with Silverblue to learn what they are doing.)

Re: Better interactivity in low-memory situations

By Emery Berger at 08/12/2019 - 12:56

For what it's worth, my research group attacked basically exactly this
problem some time ago. We built a modified Linux kernel that we called
Redline that was utterly resilient to fork bombs, malloc bombs, and so on.
No process could take down the system, much less unprivileged ones. I think
some of the ideas we described back then would be worth adopting / adapting
today (the code is of course hopelessly out of date: we published our paper
on this at OSDI 2008).

We had a demo where we would run two identical systems, side by side, with
the same workloads (a number of videos playing simultaneously), but with
one running Redline, and the other running stock Linux. We would launch a
fork/malloc bomb on both. The Redline system barely hiccuped. The stock
Linux kernel would freeze and become totally unresponsive (or panic). It
was a great demo, but also a pain, since we invariably had to restart the
stock Linux box :).

Redline: first class support for interactivity in commodity operating

While modern workloads are increasingly interactive and resource-intensive
(e.g., graphical user interfaces, browsers, and multimedia players),
current operating systems have not kept up. These operating systems, which
evolved from core designs that date to the 1970s and 1980s, provide good
support for batch and command-line applications, but their ad hoc attempts
to handle interactive workloads are poor. Their best-effort, priority-based
schedulers provide no bounds on delays, and their resource managers (e.g.,
memory managers and disk I/O schedulers) are mostly oblivious to response
time requirements. Pressure on any one of these resources can significantly
degrade application responsiveness.

We present Redline, a system that brings first-class support for
interactive applications to commodity operating systems. Redline works with
unaltered applications and standard APIs. It uses lightweight
specifications to orchestrate memory and disk I/O management so that they
serve the needs of interactive applications. Unlike realtime systems that
treat specifications as strict requirements and thus pessimistically limit
system utilization, Redline dynamically adapts to recent load, maximizing
responsiveness and system utilization. We show that Redline delivers
responsiveness to interactive applications even in the face of extreme
workloads including fork bombs, memory bombs and bursty, large disk I/O
requests, reducing application pauses by up to two orders of magnitude.

Paper here (in case the attachment fails):

<a href="" title=""></a>

And links to code here:

<a href="" title=""></a>

There has been some recent follow-on work in this direction: see this work
out of Remzi and Andrea's lab at Wisconsin:
<a href="" title=""></a>

-- emery

Re: Better interactivity in low-memory situations

By Chris Murphy at 09/01/2019 - 16:20

On Mon, Aug 12, 2019 at 5:47 PM Emery Berger <emery. ... at gmail dot com> wrote:
I'm unable to find a concurring or dissenting opinions on this. What
kind of peer review has it received? Was it ever raised with upstream
kernel developers? What were there responses?

I wonder if the question of interactivity is just not a priority
upstream still, as they see various competing user space solutions for
this problem and that this suggests a generic solution is either not
practical to incorporate into the kernel, or maybe it isn't desired?

Re: Better interactivity in low-memory situations

By Dan =?utf-8?B?x... at 09/03/2019 - 16:17

Chris Murphy < ... at colorremedies dot com> writes:

I have only read parts of the Redline paper so I do not know if it was
ever tried to submit this upstream.

Judging from the Redline webpage
(<a href="" title=""></a>), it appears to only ever
been implemented on i386 and nowhere else (albeit that shouldn't be hard
to fix). Furthermore it does not support NUMA, which might be a bigger

My guess is that Redline might clash with the general idea how processes
should be scheduled of upstream Linux. Redline solves the problem of
keeping interactive applications interactive even under severe memory
pressure by changing the way they are scheduled, allocated memory and
how much data they are allowed to read from disks. If an application is
classified as interactive (in contrast to best-effort tasks, which
corresponds to a process in the current Linux kernel), then it will get
a requested amount of CPU time each x ms (e.g. to be able to run at 25
fps). Something comparable is done with memory and disk usage.

This is a pretty nice approach in my opinion but it has certain
- scheduling gets more complicated
- you need additional system calls to tell the kernel which processes
are interactive (otherwise they are treated the "old" way and you gain
- you need a userspace component that has a database of interactive
tasks (with a small set of configs, e.g. how often does your process
need a chunk of the CPU time)

It could be that the kernel community would perceive that as a blocker
and would instead prefer a different and more generic solution (this is
just my personal guess). It could also very well be that no one had time
to actually upstream this, as it was an academic project (no offense
intended, I've been in academia myself and know how things

Unfortunately, Redline was developed more than a decade ago, so
upstreaming it nowadays is probably equivalent to a full rewrite, given
the kernel's development pace.



Re: Better interactivity in low-memory situations

By Daniel Xu at 09/13/2019 - 08:46

Our team at FB is working on a similar (but more generic) solution. All of our work is open source / upstreamed into the linux kernel and we're running it in production on quite a large scale already. Results are very promising. We'll be presenting about it at All Systems Go (multiple talks) this year.

We'd love to chat in-person if anyone is interested.

Re: Better interactivity in low-memory situations

By Chris Murphy at 08/12/2019 - 16:57

On Mon, Aug 12, 2019 at 11:07 AM Benjamin Kircher
<benjamin. ... at gmail dot com> wrote:
I'm pretty sure Silverblue will be rebased on Fedora CoreOS which
recently released a preview. I'm not sure what the time frame for that
is, but maybe that work will be concurrent with work on a release
version of Fedora CoreOS. The central means of installing/uninstalling
and running applications on a future immutable system is flatpak. But
you don't need to commit a system to Silverblue to use and test
flatpak applications on Fedora 29/30 Workstation. Containerization is
an option not a requirement of flatpaks, as is running it as a systemd
--user instance.

Since layering is permitted with rpm-ostree based systems, using
overlayfs, there still needs to be some way for the per-user service
manager to enforce limits on unprivileged programs. The use of the
word "limit" might be misleading. Perhaps instead it should be on
defining and preserving the user interface responsiveness, whether
that's CLI or GUI, so that control isn't lost. i.e. the unprivileged
program gets the leftover resources, it's not a peer with the user
interface. Promoting the active user interfaces relative to the
unprivileged task would provide a way of effectively containing the
unprivileged tasks, by one always being able to preempt the other.

Re: Better interactivity in low-memory situations

By Lennart Poettering at 08/12/2019 - 14:01

Well, let's just say certain popular container managers blocked
switching to cgroupsv2, and only in cgroupsv2 delegating cgroup
subtrees to unprivileged users is safe. Hence doing this kind of
resource management wasn't really doable without ugly hacks.

But as it appears cgroupsv2 has a chance of becoming a reality on
Fedora now, so this opens a lot of doors.

Well, "user.slice" is a concept of the *system* service manager, but
desktop apps are if anything a concept of the *per-user* service

From the system service manager's PoV all user apps together make up
the user's 'user@.service' instance, it doesn#t look below.

i.e. cgroups is hierarchial, and various components can manage their
own subtrees. PID 1 manages the top of the tree, and the per-user
service manager a subtree of it that is below it and arranges per-user
apps below that. But from PID1's PoV each of those per-user subtrees
is opaque and it won't do resource management beneath that
boundary. It's the job of the per-user service manager to do resource
management there.


Re: Better interactivity in low-memory situations

By Michael Catanzaro at 08/11/2019 - 11:35

On Sun, Aug 11, 2019 at 10:50 AM, Chris Murphy
< ... at colorremedies dot com> wrote:
This seems like a distraction from the real goal here, which is to
ensure Fedora remains responsive under heavy memory pressure, and to
ensure unprivileged processes cannot take down the system by allocating
large amounts of memory. Fixing ninja and make to dynamically scale the
number of parallel build processes based on memory pressure would be
wonderful, but it's not going to solve the underlying issue here, which
is that random user processes should never be able to hang the system.


Re: Better interactivity in low-memory situations

By Chris Murphy at 08/11/2019 - 13:56

On Sun, Aug 11, 2019 at 10:36 AM < ... at gnome dot org> wrote:
That's fair.

Re: Better interactivity in low-memory situations

By Chris Murphy at 08/09/2019 - 22:51

Just in case anyone wants to try to reproduce this particular example:

1. Grab latest stable from here and untar it
<a href="" title=""></a>
2. Run this included script, which is dnf aware, to install dependencies
3. Additional packages I had to install to get it to build
sudo dnf install ruby-devel openjpeg2-devel woff2-devel

Re: Better interactivity in low-memory situations

By Omair Majid at 08/09/2019 - 17:22


Chris Murphy < ... at colorremedies dot com> writes:

It sounds like the same issue that has been in the news recently:

- <a href=";px=Linux-Does-Bad-Low-RAM" title=";px=Linux-Does-Bad-Low-RAM">;px=Linux-Does-Bad-Low-RAM</a>
- <a href="" title=""></a>

Older sources with more information:

- <a href="" title=""></a>
- <a href="" title=""></a>

(I learned about this bug the hard way; my machine experienced this bug
in the middle of a public presentation a few years ago.)