Re: [PATCH (resend)] mm,oom: Defer dump_tasks() output.

From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Petr Mladek <pmladek@suse.com>,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH (resend)] mm,oom: Defer dump_tasks() output.
Date: Mon, 9 Sep 2019 21:40:24 +0900	[thread overview]
Message-ID: <579a27d2-52fb-207e-9278-fc20a2154394@i-love.sakura.ne.jp> (raw)
In-Reply-To: <20190909113627.GJ27159@dhcp22.suse.cz>

On 2019/09/09 20:36, Michal Hocko wrote:
> On Sat 07-09-19 19:54:32, Tetsuo Handa wrote:
>> (Resending to LKML as linux-mm ML dropped my posts.)
>>
>> If /proc/sys/vm/oom_dump_tasks != 0, dump_header() can become very slow
>> because dump_tasks() synchronously reports all OOM victim candidates, and
>> as a result ratelimit test for dump_header() cannot work as expected.
>>
>> This patch defers dump_tasks() output till oom_lock is released. As a
>> result of this patch, the latency between out_of_memory() is called and
>> SIGKILL is sent (and the OOM reaper starts reclaiming memory) will be
>> significantly reduced.
>>
>> Since CONFIG_PRINTK_CALLER was introduced, concurrent printk() became less
>> problematic. But we still need to correlate synchronously printed messages
>> and asynchronously printed messages if we defer dump_tasks() messages.
>> Thus, this patch also prefixes OOM killer messages using "OOM[$serial]:"
>> format. As a result, OOM killer messages would look like below.
>>
>>   [   31.935015][   T71] OOM[1]: kworker/4:1 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=-1, oom_score_adj=0
>>   (...snipped....)
>>   [   32.052635][   T71] OOM[1]: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=firewalld,pid=737,uid=0
>>   [   32.056886][   T71] OOM[1]: Out of memory: Killed process 737 (firewalld) total-vm:358672kB, anon-rss:22640kB, file-rss:12328kB, shmem-rss:0kB, UID:0 pgtables:421888kB oom_score_adj:0
>>   [   32.064291][   T71] OOM[1]: Tasks state (memory values in pages):
>>   [   32.067807][   T71] OOM[1]: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
>>   [   32.070057][   T54] oom_reaper: reaped process 737 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>>   [   32.072417][   T71] OOM[1]: [    548]     0   548     9772     1172   110592        0             0 systemd-journal
>>   (...snipped....)
>>   [   32.139566][   T71] OOM[1]: [    737]     0   737    89668     8742   421888        0             0 firewalld
>>   (...snipped....)
>>   [   32.221990][   T71] OOM[1]: [   1300]    48  1300    63025     1788   532480        0             0 httpd
>>
>> This patch might affect panic behavior triggered by panic_on_oom or no
>> OOM-killable tasks, for dump_header(oc, NULL) will not report OOM victim
>> candidates if there are not-yet-reported OOM victim candidates from past
>> rounds of OOM killer invocations. I don't know if that matters.
>>
>> For now this patch embeds "struct oom_task_info" into each
>> "struct task_struct". In order to avoid bloating "struct task_struct",
>> future patch might detach from "struct task_struct" because one
>> "struct oom_task_info" for one "struct signal_struct" will be enough.
> 
> This is not an improvement. It detaches the oom report and tasks_dump
> for an arbitrary amount of time because the worder context might be
> stalled for an arbitrary time. Even long after the oom is resolved.

A new worker thread is created if all existing worker threads are busy
because this patch solves OOM situation quickly when a new worker thread
cannot be created due to OOM situation.

Also, if a worker thread cannot run due to CPU starvation, the same thing
applies to dump_tasks(). In other words, dump_tasks() cannot complete due
to CPU starvation, which results in more costly and serious consequences.
Being able to send SIGKILL and reclaim memory as soon as possible is
an improvement.

> Not to mention that 1:1 (oom to tasks) information dumping is
> fundamentally broken. Any task might be on an oom list of different
> OOM contexts in different oom scopes (think of OOM happening in disjunct
> NUMA sets).

I can't understand what you are talking about. This patch just defers
printk() from /proc/sys/vm/oom_dump_tasks != 0. Please look at the patch
carefully. If you are saying that it is bad that OOM victim candidates for
OOM domain B, C, D ... cannot be printed if printing of OOM victim candidates
for OOM domain A has not finished, I can update this patch to print them.

> 
> This is just adding more kludges and making the code more complex
> without trying to address an underlying problems. So
> Nacked-by: Michal Hocko <mhocko@suse.com>

Since I'm sure that you are misunderstanding, this Nacked-by is invalid.

> 
> And more importantly it is _known_ that dump_tasks might be costly.

So what? That's an unfortunate reason people have to use
/proc/sys/vm/oom_dump_tasks == 0. That's not a reason we must not
offload costly dump_tasks().

Preemption from dump_tasks() is _fatally_ costly (e.g. SCHED_IDLE tasks
calls out_of_memory() while there are many realtime tasks running around).
This patch minimizes latency between "an allocating thread decided to
invoke the OOM killer" and "the OOM killer sends SIGKILL (and the OOM
reaper starts reclaiming memory)". In other words, this patch reduces
unoperational duration due to OOM condition.

> People who really need that information have to live with that fact.
> There is the way to disable this functionality. I wish we could have it
> disabled by default but it seems I am alone in that thinking so I am not
> going to push for it.

Setting /proc/sys/vm/oom_dump_tasks == 0 is not a solution.
We need a solution which can work with /proc/sys/vm/oom_dump_tasks != 0.