All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Petr Mladek <pmladek@suse.com>,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH (resend)] mm,oom: Defer dump_tasks() output.
Date: Sat, 14 Sep 2019 15:15:22 +0900	[thread overview]
Message-ID: <57be50b2-a97a-e559-e4bd-10d923895f83@i-love.sakura.ne.jp> (raw)
In-Reply-To: <5bbcd93f-aa42-6c62-897a-d7b94aacdb87@i-love.sakura.ne.jp>

On 2019/09/10 20:00, Tetsuo Handa wrote:
> On 2019/09/09 22:04, Michal Hocko wrote:
>> On Mon 09-09-19 21:40:24, Tetsuo Handa wrote:
>>> On 2019/09/09 20:36, Michal Hocko wrote:
>>>> This is not an improvement. It detaches the oom report and tasks_dump
>>>> for an arbitrary amount of time because the worder context might be
>>>> stalled for an arbitrary time. Even long after the oom is resolved.
>>>
>>> A new worker thread is created if all existing worker threads are busy
>>> because this patch solves OOM situation quickly when a new worker thread
>>> cannot be created due to OOM situation.
>>>
>>> Also, if a worker thread cannot run due to CPU starvation, the same thing
>>> applies to dump_tasks(). In other words, dump_tasks() cannot complete due
>>> to CPU starvation, which results in more costly and serious consequences.
>>> Being able to send SIGKILL and reclaim memory as soon as possible is
>>> an improvement.
>>
>> There might be zillion workers waiting to make a forward progress and
>> you cannot expect any timing here. Just remember your own experiments
>> with xfs and low memory conditions.
> 
> Even if there were zillion workers waiting to make a forward progress, the
> worker for processing dump_tasks() output can make a forward progress. That's
> how workqueue works. (If you still don't trust workqueue, I can update my patch
> to use a kernel thread.) And if there were zillion workers waiting to make a
> forward progress, completing the OOM killer quickly will be more important than
> keep blocking zillion workers waiting for the OOM killer to solve OOM situation.
> Preempting a thread calling out_of_memory() by zillion workers is a nightmare. ;-)
> 
>>
>>>> Not to mention that 1:1 (oom to tasks) information dumping is
>>>> fundamentally broken. Any task might be on an oom list of different
>>>> OOM contexts in different oom scopes (think of OOM happening in disjunct
>>>> NUMA sets).
>>>
>>> I can't understand what you are talking about. This patch just defers
>>> printk() from /proc/sys/vm/oom_dump_tasks != 0. Please look at the patch
>>> carefully. If you are saying that it is bad that OOM victim candidates for
>>> OOM domain B, C, D ... cannot be printed if printing of OOM victim candidates
>>> for OOM domain A has not finished, I can update this patch to print them.
>>
>> You would have to track each ongoing oom context separately.
> 
> I can update my patch to track each OOM context separately. But
> 
>>                                                              And not
>> only those from different oom scopes because as a matter of fact a new
>> OOM might trigger before the previous dump_tasks managed to be handled.
> 
> please be aware that we are already dropping OOM messages from different scopes
> due to __ratelimit(&oom_rs). The difference is, given that __ratelimit(&oom_rs)
> can work, nothing but which OOM messages will be dropped when cluster of OOM
> events from multiple different scopes happened.
> 
> And "OOM events from multiple different scopes can trivially happen" is a
> violation for commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer
> locking") saying
> 
>     However, the OOM killer is a fairly cold error path, there is really no
>     reason to optimize for highly performant and concurrent OOM kills.
> 
> where we will need "per an OOM scope locking mechanism" in order to avoid
> deferring OOM killer event in current thread's OOM scope due to processing
> OOM killer events in other threads' OOM scopes.
> 

Here is a delta patch.

From d34ef26275635d14c746ed46e5478c1dc0319ca2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 14 Sep 2019 15:11:02 +0900
Subject: [PATCH] mm,oom: Don't suppress dump_tasks() output from different OOM
 scopes.

Michal Hocko is complaining that "mm,oom: Defer dump_tasks() output."
needlessly suppresses concurrent OOM killer invocations from different
OOM scopes. This patch changes dump_tasks() not to suppress such output.

---
 kernel/fork.c |  1 +
 mm/oom_kill.c | 64 ++++++++++++++++++++++++++++++++++++++---------------------
 2 files changed, 42 insertions(+), 23 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f601168..c86a126d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1854,6 +1854,7 @@ static __latent_entropy struct task_struct *copy_process(
 	p = dup_task_struct(current, node);
 	if (!p)
 		goto fork_out;
+	INIT_LIST_HEAD(&p->oom_task_info.list);
 
 	/*
 	 * This _must_ happen before we call free_task(), i.e. before we jump
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index db62f50..7f57cea 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -379,6 +379,7 @@ static void select_bad_process(struct oom_control *oc)
 
 static unsigned int oom_killer_seq; /* Serialized by oom_lock. */
 static LIST_HEAD(oom_candidate_list); /* Serialized by oom_lock. */
+static LIST_HEAD(oom_tmp_candidate_list); /* Serialized by oom_lock. */
 
 static int add_candidate_task(struct task_struct *p, void *arg)
 {
@@ -395,8 +396,13 @@ static int add_candidate_task(struct task_struct *p, void *arg)
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 0;
-	get_task_struct(p);
 	oti = &p->oom_task_info;
+	/* p might be still waiting for dump_candidate_tasks(). */
+	if (!list_empty(&oti->list)) {
+		task_unlock(p);
+		return 1;
+	}
+	get_task_struct(p);
 	mm = p->mm;
 	oti->seq = oom_killer_seq;
 	memcpy(oti->comm, p->comm, sizeof(oti->comm));
@@ -409,14 +415,14 @@ static int add_candidate_task(struct task_struct *p, void *arg)
 	oti->swapents = get_mm_counter(mm, MM_SWAPENTS);
 	oti->score_adj = p->signal->oom_score_adj;
 	task_unlock(p);
-	list_add_tail(&oti->list, &oom_candidate_list);
+	list_add_tail(&oti->list, &oom_tmp_candidate_list);
 	return 0;
 }
 
 static void dump_candidate_tasks(struct work_struct *work)
 {
 	bool first = true;
-	unsigned int seq;
+	unsigned int seq = 0;
 	struct oom_task_info *oti;
 
 	if (work) /* Serialize only if asynchronous. */
@@ -424,7 +430,10 @@ static void dump_candidate_tasks(struct work_struct *work)
 	while (!list_empty(&oom_candidate_list)) {
 		oti = list_first_entry(&oom_candidate_list,
 				       struct oom_task_info, list);
-		seq = oti->seq;
+		if (seq != oti->seq) {
+			seq = oti->seq;
+			first = true;
+		}
 		if (first) {
 			pr_info("OOM[%u]: Tasks state (memory values in pages):\n",
 				seq);
@@ -436,7 +445,7 @@ static void dump_candidate_tasks(struct work_struct *work)
 			seq, oti->pid, oti->uid, oti->tgid, oti->total_vm,
 			oti->mm_rss, oti->pgtables_bytes, oti->swapents,
 			oti->score_adj, oti->comm);
-		list_del(&oti->list);
+		list_del_init(&oti->list);
 		if (work)
 			mutex_unlock(&oom_lock);
 		put_task_struct(container_of(oti, struct task_struct,
@@ -464,32 +473,41 @@ static void dump_candidate_tasks(struct work_struct *work)
 static void dump_tasks(struct oom_control *oc)
 {
 	struct task_struct *p;
+	int ret = 0;
 
 	/*
-	 * Suppress as long as there is any OOM victim candidate from past
-	 * rounds of OOM killer invocations. We could change this to suppress
-	 * only if there is an OOM victim candidate in the same OOM domain if
-	 * we want to see OOM victim candidates from different OOM domains.
-	 * But since dump_header() is already ratelimited, I don't know whether
-	 * it makes difference to suppress OOM victim candidates from different
-	 * OOM domains...
+	 * Suppress if OOM victim candidates in the same OOM scope from past
+	 * OOM killer invocations are still waiting for dump_candidate_tasks(),
+	 * for it is possible that the OOM reaper or exit_mmap() sets
+	 * MMF_OOM_SKIP before dump_candidate_tasks() completes. Otherwise,
+	 * call dump_candidate_tasks() after SIGKILL is sent to OOM victims and
+	 * the OOM reaper started reclaiming.
 	 */
-	if (!list_empty(&oom_candidate_list))
-		return;
 	if (is_memcg_oom(oc))
-		mem_cgroup_scan_tasks(oc->memcg, add_candidate_task, oc);
+		ret = mem_cgroup_scan_tasks(oc->memcg, add_candidate_task, oc);
 	else {
 		rcu_read_lock();
-		for_each_process(p)
-			add_candidate_task(p, oc);
+		for_each_process(p) {
+			ret = add_candidate_task(p, oc);
+			if (ret)
+				break;
+		}
 		rcu_read_unlock();
 	}
-	/*
-	 * Report OOM victim candidates after SIGKILL is sent to OOM victims
-	 * and the OOM reaper started reclaiming.
-	 */
-	if (!list_empty(&oom_candidate_list))
-		queue_work(system_long_wq, &oom_dump_candidates_work);
+	if (ret) {
+		while (!list_empty(&oom_tmp_candidate_list)) {
+			struct oom_task_info *oti =
+				list_first_entry(&oom_tmp_candidate_list,
+						 struct oom_task_info, list);
+
+			list_del_init(&oti->list);
+			put_task_struct(container_of(oti, struct task_struct,
+						     oom_task_info));
+		}
+		return;
+	}
+	list_splice_tail_init(&oom_tmp_candidate_list, &oom_candidate_list);
+	queue_work(system_unbound_wq, &oom_dump_candidates_work);
 }
 
 static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
-- 
1.8.3.1

      reply	other threads:[~2019-09-14  6:15 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1567159493-5232-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
2019-08-30 10:35 ` [PATCH] mm,oom: Defer dump_tasks() output Michal Hocko
     [not found]   ` <f69d1b83-aee4-8b00-81f6-adbe6121eb99@i-love.sakura.ne.jp>
2019-09-02  6:06     ` Michal Hocko
     [not found]       ` <cba675c7-88a2-0c5b-c97b-8d5c77eaa8ef@i-love.sakura.ne.jp>
2019-09-03 14:29         ` Michal Hocko
2019-09-04  8:13         ` Petr Mladek
2019-09-07 10:54 ` [PATCH (resend)] " Tetsuo Handa
2019-09-09 11:36   ` Michal Hocko
2019-09-09 12:40     ` Tetsuo Handa
2019-09-09 13:04       ` Michal Hocko
2019-09-10 11:00         ` Tetsuo Handa
2019-09-14  6:15           ` Tetsuo Handa [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57be50b2-a97a-e559-e4bd-10d923895f83@i-love.sakura.ne.jp \
    --to=penguin-kernel@i-love.sakura.ne.jp \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=pmladek@suse.com \
    --cc=sergey.senozhatsky@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.