Re: [PATCH (resend)] mm,oom: Defer dump_tasks() output.

From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Petr Mladek <pmladek@suse.com>,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH (resend)] mm,oom: Defer dump_tasks() output.
Date: Tue, 10 Sep 2019 20:00:22 +0900	[thread overview]
Message-ID: <5bbcd93f-aa42-6c62-897a-d7b94aacdb87@i-love.sakura.ne.jp> (raw)
In-Reply-To: <20190909130435.GO27159@dhcp22.suse.cz>

On 2019/09/09 22:04, Michal Hocko wrote:
> On Mon 09-09-19 21:40:24, Tetsuo Handa wrote:
>> On 2019/09/09 20:36, Michal Hocko wrote:
>>> This is not an improvement. It detaches the oom report and tasks_dump
>>> for an arbitrary amount of time because the worder context might be
>>> stalled for an arbitrary time. Even long after the oom is resolved.
>>
>> A new worker thread is created if all existing worker threads are busy
>> because this patch solves OOM situation quickly when a new worker thread
>> cannot be created due to OOM situation.
>>
>> Also, if a worker thread cannot run due to CPU starvation, the same thing
>> applies to dump_tasks(). In other words, dump_tasks() cannot complete due
>> to CPU starvation, which results in more costly and serious consequences.
>> Being able to send SIGKILL and reclaim memory as soon as possible is
>> an improvement.
> 
> There might be zillion workers waiting to make a forward progress and
> you cannot expect any timing here. Just remember your own experiments
> with xfs and low memory conditions.

Even if there were zillion workers waiting to make a forward progress, the
worker for processing dump_tasks() output can make a forward progress. That's
how workqueue works. (If you still don't trust workqueue, I can update my patch
to use a kernel thread.) And if there were zillion workers waiting to make a
forward progress, completing the OOM killer quickly will be more important than
keep blocking zillion workers waiting for the OOM killer to solve OOM situation.
Preempting a thread calling out_of_memory() by zillion workers is a nightmare. ;-)

> 
>>> Not to mention that 1:1 (oom to tasks) information dumping is
>>> fundamentally broken. Any task might be on an oom list of different
>>> OOM contexts in different oom scopes (think of OOM happening in disjunct
>>> NUMA sets).
>>
>> I can't understand what you are talking about. This patch just defers
>> printk() from /proc/sys/vm/oom_dump_tasks != 0. Please look at the patch
>> carefully. If you are saying that it is bad that OOM victim candidates for
>> OOM domain B, C, D ... cannot be printed if printing of OOM victim candidates
>> for OOM domain A has not finished, I can update this patch to print them.
> 
> You would have to track each ongoing oom context separately.

I can update my patch to track each OOM context separately. But

>                                                              And not
> only those from different oom scopes because as a matter of fact a new
> OOM might trigger before the previous dump_tasks managed to be handled.

please be aware that we are already dropping OOM messages from different scopes
due to __ratelimit(&oom_rs). The difference is, given that __ratelimit(&oom_rs)
can work, nothing but which OOM messages will be dropped when cluster of OOM
events from multiple different scopes happened.

And "OOM events from multiple different scopes can trivially happen" is a
violation for commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer
locking") saying

    However, the OOM killer is a fairly cold error path, there is really no
    reason to optimize for highly performant and concurrent OOM kills.

where we will need "per an OOM scope locking mechanism" in order to avoid
deferring OOM killer event in current thread's OOM scope due to processing
OOM killer events in other threads' OOM scopes.