Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

From: Johannes Buchner <buchner.johannes@gmx.at>
To: linux-kernel@vger.kernel.org
Subject: Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
Date: Tue, 6 Aug 2019 10:57:06 +0200	[thread overview]
Message-ID: <5a42d32d-df03-85ae-d487-3faaa9f1fd9a@gmx.at> (raw)


[-- Attachment #1.1: Type: text/plain, Size: 4266 bytes --]

> On Mon, Aug 5, 2019 at 12:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Mon, Aug 05, 2019 at 02:13:16PM +0200, Vlastimil Babka wrote:
>> > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
>> > > Hello,
>> > >
>> > > There's this bug which has been bugging many people for many years
>> > > already and which is reproducible in less than a few minutes under the
>> > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
>> > > defaults.
>> > >
>> > > Steps to reproduce:
>> > >
>> > > 1) Boot with mem=4G
>> > > 2) Disable swap to make everything faster (sudo swapoff -a)
>> > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
>> > > 4) Start opening tabs in either of them and watch your free RAM decrease
>> > >
>> > > Once you hit a situation when opening a new tab requires more RAM than
>> > > is currently available, the system will stall hard. You will barely  be
>> > > able to move the mouse pointer. Your disk LED will be flashing
>> > > incessantly (I'm not entirely sure why). You will not be able to run new
>> > > applications or close currently running ones.
>> >
>> > > This little crisis may continue for minutes or even longer. I think
>> > > that's not how the system should behave in this situation. I believe
>> > > something must be done about that to avoid this stall.
>> >
>> > Yeah that's a known problem, made worse SSD's in fact, as they are able
>> > to keep refaulting the last remaining file pages fast enough, so there
>> > is still apparent progress in reclaim and OOM doesn't kick in.
>> >
>> > At this point, the likely solution will be probably based on pressure
>> > stall monitoring (PSI). I don't know how far we are from a built-in
>> > monitor with reasonable defaults for a desktop workload, so CCing
>> > relevant folks.
>>
>> Yes, psi was specifically developed to address this problem. Before
>> it, the kernel had to make all decisions based on relative event rates
>> but had no notion of time. Whereas to the user, time is clearly an
>> issue, and in fact makes all the difference. So psi quantifies the
>> time the workload spends executing vs. spinning its wheels.
>>
>> But choosing a universal cutoff for killing is not possible, since it
>> depends on the workload and the user's expectation: GUI and other
>> latency-sensitive applications care way before a compile job or video
>> encoding would care.
>>
>> Because of that, there are things like oomd and lmkd as mentioned, to
>> leave the exact policy decision to userspace.
>>
>> That being said, I think we should be able to provide a bare minimum
>> inside the kernel to avoid complete livelocks where the user does not
>> believe the machine would be able to recover without a reboot.
>>
>> The goal wouldn't be a glitch-free user experience - the kernel does
>> not know enough about the applications to even attempt that. It should
>> just not hang indefinitely. Maybe similar to the hung task detector.
>>
>> How about something like the below patch? With that, the kernel
>> catches excessive thrashing that happens before reclaim fails:
>>
>> [snip]
>>
>> +
>> +#define OOM_PRESSURE_LEVEL     80
>> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)
> 
> 80% of the last 10 seconds spent in full stall would definitely be a
> problem. If the system was already low on memory (which it probably
> is, or we would not be reclaiming so hard and registering such a big
> stall) then oom-killer would probably kill something before 8 seconds
> are passed. If my line of thinking is correct, then do we really
> benefit from such additional protection mechanism? I might be wrong
> here because my experience is limited to embedded systems with
> relatively small amounts of memory.

When one or more processes fight for memory, much of the time spent
stalling. Would an acceptable alternative strategy be, instead of
killing a process, to hold processes proportional to their stall time
and memory usage? By stop I mean delay their scheduling (akin to kill
-STOP/sleep/kill -CONT), or interleave the scheduling of
large-memory-using processes so they do not have to fight against each
other.

Cheers,
       Johannes


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]