Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.

From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
To: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	John Ogness <john.ogness@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
Date: Mon, 4 Mar 2019 20:40:37 +0900	[thread overview]
Message-ID: <6b97b4bb-a9b9-75b3-17a2-bff99ae7c526@i-love.sakura.ne.jp> (raw)
In-Reply-To: <20190304032202.GD23578@jagdpanzerIV>

On 2019/03/04 12:22, Sergey Senozhatsky wrote:
> On (02/23/19 13:42), Tetsuo Handa wrote:
> [..]
>> This patch tries to address "don't lockup the system" with minimal risk of
>> failing to "print out printk() messages", by allowing printk() callers to
>> tell printk() "store $body_text_lines lines into logbuf but start actual
>> printing after $trailer_text_line line is stored into logbuf". This patch
>> is different from existing printk_deferred(), for printk_deferred() is
>> intended for scheduler/timekeeping use only. Moreover, what this patch
>> wants to do is "do not try to print out printk() messages as soon as
>> possible", for accumulated stalling period cannot be decreased if
>> printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
>> prints out the messages. The point of this patch is to defer the stalling
>> duration to after leaving the critical section.
> 
> We can export printk deferred, I guess; but I'm not sure if it's going
> to be easy to switch OOM to printk_deferred - there are lots of direct
> printk callers: warn-s, dump_stacks, etc; it might even be simpler to
> start re-directing OOM printouts to printk_safe buffer.

I confirmed that printk_deferred() is not suitable for this purpose, for
it suddenly stalls for seconds at random locations flushing pending output
accumulated by printk_deferred(). Stalling inside critical section (e.g.
RCU read lock held) is what I don't like.

> 
> This is a bit of a strange issue, to be honest. If OOM prints too
> many messages then we might want to do some work on the OOM side.
> 
> But, to begin with, can you give an example of such a lockup? Just
> to understand how big/real the problem is.
> 
> What is that "OOM critical section" which printk can stall?

dump_task() is the OOM critical section from RCU perspective.
We can minimize RCU critical section by just getting a refcount on possible
candidates and then printing information and putting that refcount after
leaving RCU critical section.

----------

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f9b43c9..4781439 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1182,6 +1182,7 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct		*oom_reaper_list;
 #endif
+	struct list_head		oom_candidate_list;
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26ea863..6750b18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -399,6 +399,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *task;
+	LIST_HEAD(candidates);
 
 	pr_info("Tasks state (memory values in pages):\n");
 	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
@@ -407,6 +408,11 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;
 
+		get_task_struct(p);
+		list_add_tail(&p->oom_candidate_list, &candidates);
+	}
+	rcu_read_unlock();
+	list_for_each_entry(p, &candidates, oom_candidate_list) {
 		task = find_lock_task_mm(p);
 		if (!task) {
 			/*
@@ -425,7 +431,8 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 			task->signal->oom_score_adj, task->comm);
 		task_unlock(task);
 	}
-	rcu_read_unlock();
+	list_for_each_entry_safe(p, task, &candidates, oom_candidate_list)
+		put_task_struct(p);
 }
 
 static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
----------

But almost all out_of_memory() (where oom_lock mutex is held) is the OOM critical
section from memory reclaiming perspective, for we cannot reclaim memory (and
other concurrently allocating threads are needlessly wasting CPU time) unless
SIGKILL is sent after all printk() completed. Therefore, despite out_of_memory()
prints a lot of messages, it is expected to complete quickly as if it is an interrupt
handler. We could disable preemption inside out_of_memory() if all printk() with
oom_lock mutex held is deferrable until oom_lock mutex is released.

> 
> [..]
>> The possibility of failing to store all printk() messages to logbuf might
>> be increased by using "async" printk(). But since we have a lot of RAM
>> nowadays, allocating large logbuf enough to hold the entire SysRq-t output
>> using log_buf_len= kernel command line parameter won't be difficult.
> 
> Note, logbuf size is limited - 2G. Might be not as large as people
> would want it to be.

Are "machines which want to use 2GB logbuf" hosting millions of threads such
that even 2GB is not enough for holding SysRq-t output? If yes, then I guess
that tasklist traversal under RCU read lock would lockup even without printk().