linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] printk: Introduce "store now but print later" prefix.
@ 2019-02-23  4:42 Tetsuo Handa
  2019-03-04  3:22 ` Sergey Senozhatsky
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2019-02-23  4:42 UTC (permalink / raw)
  To: Petr Mladek, Sergey Senozhatsky, Steven Rostedt, John Ogness
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, Tetsuo Handa

There are callers which print many lines in

  $header_text_line
  $body_text_lines
  $trailer_text_line

format, and some of them print $body_text_lines part from atomic context
(or context where spending too much time can result in a lockup). SysRq-t
(which might call printk() on all threads from interrupt context) and the
OOM killer (which might call printk() on all threads from RCU critical
section, and which cannot make forward progress by sending SIGKILL to OOM
victims unless all printk() completes) are such callers. For such callers,
storing all printk() messages to logbuf as soon as possible and then start
actual printing to consoles (after e.g. leaving critical section, sending
SIGKILL to OOM victims) is preferable.

There has been an effort to offload actual printing to a dedicated kernel
thread. But such offloading is not yet accepted due to "print out printk()
messages as soon as possible" versus "don't lockup the system by spending
too much time inside printk()" collision. Therefore, I propose per
printk() flag while utilizing currently available synchronous printing
logic.

This patch tries to address "don't lockup the system" with minimal risk of
failing to "print out printk() messages", by allowing printk() callers to
tell printk() "store $body_text_lines lines into logbuf but start actual
printing after $trailer_text_line line is stored into logbuf". This patch
is different from existing printk_deferred(), for printk_deferred() is
intended for scheduler/timekeeping use only. Moreover, what this patch
wants to do is "do not try to print out printk() messages as soon as
possible", for accumulated stalling period cannot be decreased if
printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
prints out the messages. The point of this patch is to defer the stalling
duration to after leaving the critical section.

Hereafter, I call printk() with "store now but print later" as "async"
printk(), and printk() with "store now and print now" as "sync" printk().

Since storing $body_text_lines into logbuf using "async" printk() becomes
very fast (latency is reduced from some milliseconds to some microseconds),
the risk of failing to loose $body_text_lines lines which would have been
printed out if "sync" printk() were used is small.

Since actual printing starts as soon as a "sync" printk() is called (which
might be from a CPU which is not processing a series of "async" printk()),
the risk of failing to print out critical messages won't increase provided
that printk() for critical messages (e.g. oops) are "sync" printk().

The possibility of failing to store all printk() messages to logbuf might
be increased by using "async" printk(). But since we have a lot of RAM
nowadays, allocating large logbuf enough to hold the entire SysRq-t output
using log_buf_len= kernel command line parameter won't be difficult.

There is a possibility that a thread which called "sync" printk() stalls
for unexpectedly long period due to previously called a lot of "async"
printk(). Such thread is expected to be the thread itself who did a lot of
"async" printk(), and accumulated stalling period will be there even if
that thread did "sync" printk(). This patch counts on console_waiter in
console_trylock_spinning() for transferring control back from a random
thread which did an innocent "sync" printk() to a thread which did a lot
of "async" printk() by recommending such thread to use "sync" printk() for
$trailer_text_line. Such expectation may fail, but it can't be solved until
we get a really asynchronous printing using lockless consoles. Until that
time, let's try whether failing such expectation leads to horrible
consequences. If it turns out to be horrible consequences, we could add
hints for judging which thread (console_owner or console_waiter) should
continue printing (e.g. prefer schedulable process context over atomic
process context, prefer atomic process context over soft IRQ context,
prefer soft IRQ context over hard IRQ context, and/or track number of
records stored by "async" printk() into "struct task_struct" and compare
it at console_trylock_spinning()). If that still fails, we can temporarily
disable "async" printk() inside vprintk_emit().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/kern_levels.h | 18 ++++++++++++++++++
 include/linux/printk.h      | 28 ++++++++++++++++++++++++++++
 kernel/printk/internal.h    |  4 ++--
 kernel/printk/printk.c      | 16 +++++++++++++---
 kernel/printk/printk_safe.c |  2 +-
 5 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/include/linux/kern_levels.h b/include/linux/kern_levels.h
index bf2389c..dc924a3 100644
--- a/include/linux/kern_levels.h
+++ b/include/linux/kern_levels.h
@@ -23,6 +23,24 @@
  */
 #define KERN_CONT	KERN_SOH "c"
 
+/*
+ * Annotation for a "store now but print later" line of log printout.
+ * Such lines will be printed when printk() without this annotation is called.
+ * Users of this annotation are encouraged to do a final printk() without
+ * this annotation after a series of printk() with this annotation.
+ */
+#define KERN_ASYNC		KERN_SOH "a"
+#define KERN_EMERG_ASYNC	KERN_ASYNC KERN_EMERG
+#define KERN_ALERT_ASYNC	KERN_ASYNC KERN_ALERT
+#define KERN_CRIT_ASYNC		KERN_ASYNC KERN_CRIT
+#define KERN_ERR_ASYNC		KERN_ASYNC KERN_ERR
+#define KERN_WARNING_ASYNC	KERN_ASYNC KERN_WARNING
+#define KERN_NOTICE_ASYNC	KERN_ASYNC KERN_NOTICE
+#define KERN_INFO_ASYNC		KERN_ASYNC KERN_INFO
+#define KERN_DEBUG_ASYNC	KERN_ASYNC KERN_DEBUG
+#define KERN_DEFAULT_ASYNC	KERN_ASYNC KERN_DEFAULT
+#define KERN_CONT_ASYNC		KERN_ASYNC KERN_CONT
+
 /* integer equivalents of KERN_<LEVEL> */
 #define LOGLEVEL_SCHED		-2	/* Deferred messages from sched code
 					 * are set to this special level */
diff --git a/include/linux/printk.h b/include/linux/printk.h
index d7c77ed..5c27561 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -167,6 +167,7 @@ asmlinkage __printf(1, 0)
 
 asmlinkage __printf(1, 2) __cold
 int printk(const char *fmt, ...);
+#define printk_async(fmt, ...) printk(KERN_ASYNC fmt, ##__VA_ARGS__)
 
 /*
  * Special printk facility for scheduler/timekeeping use only, _DO_NOT_USE_ !
@@ -294,19 +295,34 @@ static inline void printk_safe_flush_on_panic(void)
  */
 #define pr_emerg(fmt, ...) \
 	printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_emerg_async(fmt, ...) \
+	printk(KERN_EMERG_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #define pr_alert(fmt, ...) \
 	printk(KERN_ALERT pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_alert_async(fmt, ...) \
+	printk(KERN_ALERT_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #define pr_crit(fmt, ...) \
 	printk(KERN_CRIT pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_crit_async(fmt, ...) \
+	printk(KERN_CRIT_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #define pr_err(fmt, ...) \
 	printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_err_async(fmt, ...) \
+	printk(KERN_ERR_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #define pr_warning(fmt, ...) \
 	printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_warning_async(fmt, ...) \
+	printk(KERN_WARNING_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #define pr_warn pr_warning
+#define pr_warn_async pr_warning_async
 #define pr_notice(fmt, ...) \
 	printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_notice_async(fmt, ...) \
+	printk(KERN_NOTICE_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #define pr_info(fmt, ...) \
 	printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_info_async(fmt, ...) \
+	printk(KERN_INFO_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 /*
  * Like KERN_CONT, pr_cont() should only be used when continuing
  * a line with no newline ('\n') enclosed. Otherwise it defaults
@@ -314,14 +330,20 @@ static inline void printk_safe_flush_on_panic(void)
  */
 #define pr_cont(fmt, ...) \
 	printk(KERN_CONT fmt, ##__VA_ARGS__)
+#define pr_cont_async(fmt, ...) \
+	printk(KERN_CONT_ASYNC fmt, ##__VA_ARGS__)
 
 /* pr_devel() should produce zero code unless DEBUG is defined */
 #ifdef DEBUG
 #define pr_devel(fmt, ...) \
 	printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_devel_async(fmt, ...) \
+	printk(KERN_DEBUG_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #else
 #define pr_devel(fmt, ...) \
 	no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_devel_async(fmt, ...) \
+	no_printk(KERN_DEBUG_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #endif
 
 
@@ -332,12 +354,18 @@ static inline void printk_safe_flush_on_panic(void)
 /* dynamic_pr_debug() uses pr_fmt() internally so we don't need it here */
 #define pr_debug(fmt, ...) \
 	dynamic_pr_debug(fmt, ##__VA_ARGS__)
+#define pr_debug_async(fmt, ...) \
+	dynamic_pr_debug(KERN_ASYNC fmt, ##__VA_ARGS__)
 #elif defined(DEBUG)
 #define pr_debug(fmt, ...) \
 	printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_debug_async(fmt, ...) \
+	printk(KERN_DEBUG_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #else
 #define pr_debug(fmt, ...) \
 	no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_debug_async(fmt, ...) \
+	no_printk(KERN_DEBUG_ASYNC pr_fmt(fmt), ##__VA_ARGS__)
 #endif
 
 /*
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 0f18988..a1c9b5c 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -24,8 +24,8 @@
 
 extern raw_spinlock_t logbuf_lock;
 
-__printf(5, 0)
-int vprintk_store(int facility, int level,
+__printf(6, 0)
+int vprintk_store(bool *async, int facility, int level,
 		  const char *dict, size_t dictlen,
 		  const char *fmt, va_list args);
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 02ca827..af365a7 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1884,7 +1884,7 @@ static size_t log_output(int facility, int level, enum log_flags lflags, const c
 }
 
 /* Must be called under logbuf_lock. */
-int vprintk_store(int facility, int level,
+int vprintk_store(bool *async, int facility, int level,
 		  const char *dict, size_t dictlen,
 		  const char *fmt, va_list args)
 {
@@ -1899,6 +1899,15 @@ int vprintk_store(int facility, int level,
 	 */
 	text_len = vscnprintf(text, sizeof(textbuf), fmt, args);
 
+	/* check "store now but print later" request */
+	if (text_len > 1 && text[0] == KERN_ASYNC[0] &&
+	    text[1] == KERN_ASYNC[1]) {
+		if (async)
+			*async = true;
+		text_len -= 2;
+		text += 2;
+	}
+
 	/* mark and strip a trailing newline */
 	if (text_len && text[text_len-1] == '\n') {
 		text_len--;
@@ -1942,6 +1951,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	bool in_sched = false, pending_output;
 	unsigned long flags;
 	u64 curr_log_seq;
+	bool async = false;
 
 	if (level == LOGLEVEL_SCHED) {
 		level = LOGLEVEL_DEFAULT;
@@ -1954,8 +1964,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 	/* This stops the holder of console_sem just where we want him */
 	logbuf_lock_irqsave(flags);
 	curr_log_seq = log_next_seq;
-	printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
-	pending_output = (curr_log_seq != log_next_seq);
+	printed_len = vprintk_store(&async, facility, level, dict, dictlen, fmt, args);
+	pending_output = !async && (curr_log_seq != log_next_seq);
 	logbuf_unlock_irqrestore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index 0913b4d..979c7b4 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -380,7 +380,7 @@ void __printk_safe_exit(void)
 	    raw_spin_trylock(&logbuf_lock)) {
 		int len;
 
-		len = vprintk_store(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
+		len = vprintk_store(NULL, 0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
 		raw_spin_unlock(&logbuf_lock);
 		defer_console_output();
 		return len;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-02-23  4:42 [RFC PATCH] printk: Introduce "store now but print later" prefix Tetsuo Handa
@ 2019-03-04  3:22 ` Sergey Senozhatsky
  2019-03-04 11:40   ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Sergey Senozhatsky @ 2019-03-04  3:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	Andrew Morton, Linus Torvalds, linux-kernel

On (02/23/19 13:42), Tetsuo Handa wrote:
[..]
> This patch tries to address "don't lockup the system" with minimal risk of
> failing to "print out printk() messages", by allowing printk() callers to
> tell printk() "store $body_text_lines lines into logbuf but start actual
> printing after $trailer_text_line line is stored into logbuf". This patch
> is different from existing printk_deferred(), for printk_deferred() is
> intended for scheduler/timekeeping use only. Moreover, what this patch
> wants to do is "do not try to print out printk() messages as soon as
> possible", for accumulated stalling period cannot be decreased if
> printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
> prints out the messages. The point of this patch is to defer the stalling
> duration to after leaving the critical section.

We can export printk deferred, I guess; but I'm not sure if it's going
to be easy to switch OOM to printk_deferred - there are lots of direct
printk callers: warn-s, dump_stacks, etc; it might even be simpler to
start re-directing OOM printouts to printk_safe buffer.

This is a bit of a strange issue, to be honest. If OOM prints too
many messages then we might want to do some work on the OOM side.

But, to begin with, can you give an example of such a lockup? Just
to understand how big/real the problem is.

What is that "OOM critical section" which printk can stall?

[..]
> The possibility of failing to store all printk() messages to logbuf might
> be increased by using "async" printk(). But since we have a lot of RAM
> nowadays, allocating large logbuf enough to hold the entire SysRq-t output
> using log_buf_len= kernel command line parameter won't be difficult.

Note, logbuf size is limited - 2G. Might be not as large as people
would want it to be.

	-ss

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-04  3:22 ` Sergey Senozhatsky
@ 2019-03-04 11:40   ` Tetsuo Handa
  2019-03-04 12:09     ` Sergey Senozhatsky
  2019-03-04 14:23     ` Petr Mladek
  0 siblings, 2 replies; 17+ messages in thread
From: Tetsuo Handa @ 2019-03-04 11:40 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Sergey Senozhatsky, Steven Rostedt, John Ogness,
	Andrew Morton, Linus Torvalds, linux-kernel

On 2019/03/04 12:22, Sergey Senozhatsky wrote:
> On (02/23/19 13:42), Tetsuo Handa wrote:
> [..]
>> This patch tries to address "don't lockup the system" with minimal risk of
>> failing to "print out printk() messages", by allowing printk() callers to
>> tell printk() "store $body_text_lines lines into logbuf but start actual
>> printing after $trailer_text_line line is stored into logbuf". This patch
>> is different from existing printk_deferred(), for printk_deferred() is
>> intended for scheduler/timekeeping use only. Moreover, what this patch
>> wants to do is "do not try to print out printk() messages as soon as
>> possible", for accumulated stalling period cannot be decreased if
>> printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
>> prints out the messages. The point of this patch is to defer the stalling
>> duration to after leaving the critical section.
> 
> We can export printk deferred, I guess; but I'm not sure if it's going
> to be easy to switch OOM to printk_deferred - there are lots of direct
> printk callers: warn-s, dump_stacks, etc; it might even be simpler to
> start re-directing OOM printouts to printk_safe buffer.

I confirmed that printk_deferred() is not suitable for this purpose, for
it suddenly stalls for seconds at random locations flushing pending output
accumulated by printk_deferred(). Stalling inside critical section (e.g.
RCU read lock held) is what I don't like.

> 
> This is a bit of a strange issue, to be honest. If OOM prints too
> many messages then we might want to do some work on the OOM side.
> 
> But, to begin with, can you give an example of such a lockup? Just
> to understand how big/real the problem is.
> 
> What is that "OOM critical section" which printk can stall?

dump_task() is the OOM critical section from RCU perspective.
We can minimize RCU critical section by just getting a refcount on possible
candidates and then printing information and putting that refcount after
leaving RCU critical section.

----------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f9b43c9..4781439 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1182,6 +1182,7 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct		*oom_reaper_list;
 #endif
+	struct list_head		oom_candidate_list;
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26ea863..6750b18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -399,6 +399,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *task;
+	LIST_HEAD(candidates);
 
 	pr_info("Tasks state (memory values in pages):\n");
 	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
@@ -407,6 +408,11 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;
 
+		get_task_struct(p);
+		list_add_tail(&p->oom_candidate_list, &candidates);
+	}
+	rcu_read_unlock();
+	list_for_each_entry(p, &candidates, oom_candidate_list) {
 		task = find_lock_task_mm(p);
 		if (!task) {
 			/*
@@ -425,7 +431,8 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 			task->signal->oom_score_adj, task->comm);
 		task_unlock(task);
 	}
-	rcu_read_unlock();
+	list_for_each_entry_safe(p, task, &candidates, oom_candidate_list)
+		put_task_struct(p);
 }
 
 static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
----------

But almost all out_of_memory() (where oom_lock mutex is held) is the OOM critical
section from memory reclaiming perspective, for we cannot reclaim memory (and
other concurrently allocating threads are needlessly wasting CPU time) unless
SIGKILL is sent after all printk() completed. Therefore, despite out_of_memory()
prints a lot of messages, it is expected to complete quickly as if it is an interrupt
handler. We could disable preemption inside out_of_memory() if all printk() with
oom_lock mutex held is deferrable until oom_lock mutex is released.

> 
> [..]
>> The possibility of failing to store all printk() messages to logbuf might
>> be increased by using "async" printk(). But since we have a lot of RAM
>> nowadays, allocating large logbuf enough to hold the entire SysRq-t output
>> using log_buf_len= kernel command line parameter won't be difficult.
> 
> Note, logbuf size is limited - 2G. Might be not as large as people
> would want it to be.

Are "machines which want to use 2GB logbuf" hosting millions of threads such
that even 2GB is not enough for holding SysRq-t output? If yes, then I guess
that tasklist traversal under RCU read lock would lockup even without printk().


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-04 11:40   ` Tetsuo Handa
@ 2019-03-04 12:09     ` Sergey Senozhatsky
  2019-03-04 14:23     ` Petr Mladek
  1 sibling, 0 replies; 17+ messages in thread
From: Sergey Senozhatsky @ 2019-03-04 12:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sergey Senozhatsky, Petr Mladek, Sergey Senozhatsky,
	Steven Rostedt, John Ogness, Andrew Morton, Linus Torvalds,
	linux-kernel

On (03/04/19 20:40), Tetsuo Handa wrote:
> On 2019/03/04 12:22, Sergey Senozhatsky wrote:
> > On (02/23/19 13:42), Tetsuo Handa wrote:
> > [..]
> >> This patch tries to address "don't lockup the system" with minimal risk of
> >> failing to "print out printk() messages", by allowing printk() callers to
> >> tell printk() "store $body_text_lines lines into logbuf but start actual
> >> printing after $trailer_text_line line is stored into logbuf". This patch
> >> is different from existing printk_deferred(), for printk_deferred() is
> >> intended for scheduler/timekeeping use only. Moreover, what this patch
> >> wants to do is "do not try to print out printk() messages as soon as
> >> possible", for accumulated stalling period cannot be decreased if
> >> printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
> >> prints out the messages. The point of this patch is to defer the stalling
> >> duration to after leaving the critical section.
> > 
> > We can export printk deferred, I guess; but I'm not sure if it's going
> > to be easy to switch OOM to printk_deferred - there are lots of direct
> > printk callers: warn-s, dump_stacks, etc; it might even be simpler to
> > start re-directing OOM printouts to printk_safe buffer.
> 
> I confirmed that printk_deferred() is not suitable for this purpose, for
> it suddenly stalls for seconds at random locations flushing pending output
> accumulated by printk_deferred().

You are right. printk_deferred() is usually bad news. It may kill the
system, it doesn't care that much. If there is no other printk() caller
to hand off printing to then we can stuck in console_unlock() printing
loop from IRQ context. printk_safe() is, basically, same thing.

> Stalling inside critical section (e.g. RCU read lock held) is what I
> don't like.

I see. Yes, we might hold off grace periods when RCU read side
section is getting interrupted and then we stuck in printing
loop from IRQ.

> dump_task() is the OOM critical section from RCU perspective.
> We can minimize RCU critical section by just getting a refcount on possible
> candidates and then printing information and putting that refcount after
> leaving RCU critical section.

Can do, I guess.

[..]
> > Note, logbuf size is limited - 2G. Might be not as large as people
> > would want it to be.
> 
> Are "machines which want to use 2GB logbuf" hosting millions of threads such
> that even 2GB is not enough for holding SysRq-t output? If yes, then I guess
> that tasklist traversal under RCU read lock would lockup even without printk().

640K^W... 2G is probably enough.

	-ss

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-04 11:40   ` Tetsuo Handa
  2019-03-04 12:09     ` Sergey Senozhatsky
@ 2019-03-04 14:23     ` Petr Mladek
  2019-03-04 14:37       ` Sergey Senozhatsky
  2019-03-05  1:23       ` Tetsuo Handa
  1 sibling, 2 replies; 17+ messages in thread
From: Petr Mladek @ 2019-03-04 14:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel

On Mon 2019-03-04 20:40:37, Tetsuo Handa wrote:
> On 2019/03/04 12:22, Sergey Senozhatsky wrote:
> > On (02/23/19 13:42), Tetsuo Handa wrote:
> > [..]
> >> This patch tries to address "don't lockup the system" with minimal risk of
> >> failing to "print out printk() messages", by allowing printk() callers to
> >> tell printk() "store $body_text_lines lines into logbuf but start actual
> >> printing after $trailer_text_line line is stored into logbuf". This patch
> >> is different from existing printk_deferred(), for printk_deferred() is
> >> intended for scheduler/timekeeping use only. Moreover, what this patch
> >> wants to do is "do not try to print out printk() messages as soon as
> >> possible", for accumulated stalling period cannot be decreased if
> >> printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
> >> prints out the messages. The point of this patch is to defer the stalling
> >> duration to after leaving the critical section.
> > 
> > We can export printk deferred, I guess; but I'm not sure if it's going
> > to be easy to switch OOM to printk_deferred - there are lots of direct
> > printk callers: warn-s, dump_stacks, etc; it might even be simpler to
> > start re-directing OOM printouts to printk_safe buffer.

Exactly. OOM calls many functions that are called also in other situations.
The async messages are not pushed to the console unless someone calls
a non-async printk.

How do you want to guarantee that the non-async printk will get called
in all situations? The async printk() API is too error-prone.


> I confirmed that printk_deferred() is not suitable for this purpose, for
> it suddenly stalls for seconds at random locations flushing pending output
> accumulated by printk_deferred(). Stalling inside critical section (e.g.
> RCU read lock held) is what I don't like.

I still do not see why your async printk should be significantly
better than printk_deferred(). There is still a random victim
that would be responsible to flush the messages.

It might increase the chance that it will get printed from
normal context. But it also adds the risk that consoles
will not get handled at all.

BTW: The comment above printk_deferred() is there for a reason.
     It is a hack that should not be used widely.

     If you convert half printk() calls into a deferred/async
     module, you will just get into another problems. For example,
     not seeing the messages at all, more lost messages, random
     victims would spend even more time with flushing to the console.

     And not. Handling all messages only from normal context
     or from a dedicated kthread is not acceptable.


> > This is a bit of a strange issue, to be honest. If OOM prints too
> > many messages then we might want to do some work on the OOM side.

To be honest, I am still not sure what messages we are talking about.
Are the messages printed() from OOM killer code? Or are most of
the messages about allocation failures?

Well, both sources of messages would deserve a revision/regulation
if they cause such a big problem.

For example, I would stop printing allocation failures until
the currently running OOM killer succeeds in freeing some memory.
It might print a message about that all further allocation
failures will not get reported and then another message
about the success...

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-04 14:23     ` Petr Mladek
@ 2019-03-04 14:37       ` Sergey Senozhatsky
  2019-03-05  1:23       ` Tetsuo Handa
  1 sibling, 0 replies; 17+ messages in thread
From: Sergey Senozhatsky @ 2019-03-04 14:37 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Tetsuo Handa, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, John Ogness, Andrew Morton, Linus Torvalds,
	linux-kernel

On (03/04/19 15:23), Petr Mladek wrote:
> > I confirmed that printk_deferred() is not suitable for this purpose, for
> > it suddenly stalls for seconds at random locations flushing pending output
> > accumulated by printk_deferred(). Stalling inside critical section (e.g.
> > RCU read lock held) is what I don't like.
> 
> I still do not see why your async printk should be significantly
> better than printk_deferred(). There is still a random victim
> that would be responsible to flush the messages.
> 
> It might increase the chance that it will get printed from
> normal context. But it also adds the risk that consoles
> will not get handled at all.
> 
> BTW: The comment above printk_deferred() is there for a reason.
>      It is a hack that should not be used widely.
> 
>      If you convert half printk() calls into a deferred/async
>      module, you will just get into another problems. For example,
>      not seeing the messages at all, more lost messages, random
>      victims would spend even more time with flushing to the console.

The thing with printk_deferred/printk_safe is that it always involves
IRQ->console_unlock()->loop. This thing is not welcome, we need some
other task to hand off printing to badly or we simply hard lockup the
system. I'd say that this probably should be the main reason those
those symbols are not exported.

> For example, I would stop printing allocation failures until
> the currently running OOM killer succeeds in freeing some memory.
> It might print a message about that all further allocation
> failures will not get reported and then another message
> about the success...

I'm all for it.

	-ss

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-04 14:23     ` Petr Mladek
  2019-03-04 14:37       ` Sergey Senozhatsky
@ 2019-03-05  1:23       ` Tetsuo Handa
  2019-03-05  7:52         ` Sergey Senozhatsky
                           ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Tetsuo Handa @ 2019-03-05  1:23 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel

Petr Mladek wrote:
> On Mon 2019-03-04 20:40:37, Tetsuo Handa wrote:
> > On 2019/03/04 12:22, Sergey Senozhatsky wrote:
> > > On (02/23/19 13:42), Tetsuo Handa wrote:
> > > [..]
> > >> This patch tries to address "don't lockup the system" with minimal risk of
> > >> failing to "print out printk() messages", by allowing printk() callers to
> > >> tell printk() "store $body_text_lines lines into logbuf but start actual
> > >> printing after $trailer_text_line line is stored into logbuf". This patch
> > >> is different from existing printk_deferred(), for printk_deferred() is
> > >> intended for scheduler/timekeeping use only. Moreover, what this patch
> > >> wants to do is "do not try to print out printk() messages as soon as
> > >> possible", for accumulated stalling period cannot be decreased if
> > >> printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
> > >> prints out the messages. The point of this patch is to defer the stalling
> > >> duration to after leaving the critical section.
> > > 
> > > We can export printk deferred, I guess; but I'm not sure if it's going
> > > to be easy to switch OOM to printk_deferred - there are lots of direct
> > > printk callers: warn-s, dump_stacks, etc; it might even be simpler to
> > > start re-directing OOM printouts to printk_safe buffer.
> 
> Exactly. OOM calls many functions that are called also in other situations.
> The async messages are not pushed to the console unless someone calls
> a non-async printk.
> 
> How do you want to guarantee that the non-async printk will get called
> in all situations? The async printk() API is too error-prone.

I'm suggesting to use a non-async printk() for $trailer_text_line line. I think
that changing all printk() called from out_of_memory() to async is doable, if
the caller of out_of_memory() wakes up a dedicated kthread described below.

> 
> 
> > I confirmed that printk_deferred() is not suitable for this purpose, for
> > it suddenly stalls for seconds at random locations flushing pending output
> > accumulated by printk_deferred(). Stalling inside critical section (e.g.
> > RCU read lock held) is what I don't like.
> 
> I still do not see why your async printk should be significantly
> better than printk_deferred(). There is still a random victim
> that would be responsible to flush the messages.

I consider that a non-async printk() (or wake up a dedicated kthread) after
a series of async printk() is better than printk_deferred() because the caller
of a non-async printk() can control when to flush the messages.

> 
> It might increase the chance that it will get printed from
> normal context. But it also adds the risk that consoles
> will not get handled at all.

This "store now but print later" prefix does not decrease the chance of
failing to print, unless the messages are dropped due to out of logbuf space.
Out of logbuf space should be handled by how to keep enough interval between
calling an event that calls many printk().

We might need to tune the logic for judging which thread (console_owner
or console_waiter) should continue printing.

> 
> BTW: The comment above printk_deferred() is there for a reason.
>      It is a hack that should not be used widely.
> 
>      If you convert half printk() calls into a deferred/async
>      module, you will just get into another problems. For example,
>      not seeing the messages at all, more lost messages, random
>      victims would spend even more time with flushing to the console.
> 
>      And not. Handling all messages only from normal context
>      or from a dedicated kthread is not acceptable.

This "store now but print later" prefix does not offload to "only"
normal context or a dedicated kthread. Messages are handled by
console_owner or console_waiter. Given that the logic for judging which
thread (console_owner or console_waiter) should continue printing is
updated such that a dedicated kthread which does

	while (1) {
		wait_event();
		console_lock();
		console_unlock();
	}

will be used if the scheduler can run that kthread and that kthread can join
the judgement (and a caller which called a non-async() printk() will continue
printing if the scheduler cannot run that kthread or that kthread cannot join
the judgement for some reason).

> 
> 
> > > This is a bit of a strange issue, to be honest. If OOM prints too
> > > many messages then we might want to do some work on the OOM side.
> 
> To be honest, I am still not sure what messages we are talking about.
> Are the messages printed() from OOM killer code? Or are most of
> the messages about allocation failures?

I'm talking about the messages printed from out_of_memory().

> 
> Well, both sources of messages would deserve a revision/regulation
> if they cause such a big problem.

If the messages printed from warn_alloc() is problematic, you can propose
more drastic changes like "Remove __GFP_NOWARN and add __GFP_WARN_FAILURE"
(i.e. be silent unless explicitly be told to warn). Any !__GFP_DIRECT_RECLAIM
allocation has to be able to survive even if that allocation failed.

> 
> For example, I would stop printing allocation failures until
> the currently running OOM killer succeeds in freeing some memory.

We could serialize out_of_memory() and warn_alloc(), but since warn_alloc()
can be called from atomic context (e.g. GFP_ATOMIC from interrupt handler)
while out_of_memory() is always called from schedulable context (e.g.
GFP_KERNEL), we can't use oom_lock mutex.

Also, both nopage_rs in warn_alloc() and oom_rs in oom_kill_process() are not
working well. This is because ___ratelimit() function assumes that operations
to be ratelimited complete fast enough to be able to repeat many times within
a second. If one operation to be ratelimited takes many seconds (or even
minutes), ___ratelimit() becomes always true and can not ratelimit at all.

Using async printk() in out_of_memory() and warn_alloc() would allow such
serialization. And ratelimiting without using nopage_rs/oom_rs (e.g. monitor
how many printk() messages are waiting to be flushed) would allow moderate
output.

> It might print a message about that all further allocation
> failures will not get reported and then another message
> about the success...

Are you talking about allocation failure messages from console drivers
when called from printk() ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-05  1:23       ` Tetsuo Handa
@ 2019-03-05  7:52         ` Sergey Senozhatsky
  2019-03-05 12:57         ` Michal Hocko
  2019-03-06 10:04         ` Petr Mladek
  2 siblings, 0 replies; 17+ messages in thread
From: Sergey Senozhatsky @ 2019-03-05  7:52 UTC (permalink / raw)
  To: Tetsuo Handa, Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel

On (03/05/19 10:23), Tetsuo Handa wrote:
> > > > [..]
> > > >> This patch tries to address "don't lockup the system" with minimal risk of
> > > >> failing to "print out printk() messages", by allowing printk() callers to
> > > >> tell printk() "store $body_text_lines lines into logbuf but start actual
> > > >> printing after $trailer_text_line line is stored into logbuf". This patch
> > > >> is different from existing printk_deferred(), for printk_deferred() is
> > > >> intended for scheduler/timekeeping use only. Moreover, what this patch
> > > >> wants to do is "do not try to print out printk() messages as soon as
> > > >> possible", for accumulated stalling period cannot be decreased if
> > > >> printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
> > > >> prints out the messages. The point of this patch is to defer the stalling
> > > >> duration to after leaving the critical section.
> > > > 
> > > > We can export printk deferred, I guess; but I'm not sure if it's going
> > > > to be easy to switch OOM to printk_deferred - there are lots of direct
> > > > printk callers: warn-s, dump_stacks, etc; it might even be simpler to
> > > > start re-directing OOM printouts to printk_safe buffer.
> > 
> > Exactly. OOM calls many functions that are called also in other situations.
> > The async messages are not pushed to the console unless someone calls
> > a non-async printk.
> > 
> > How do you want to guarantee that the non-async printk will get called
> > in all situations? The async printk() API is too error-prone.
> 
> I'm suggesting to use a non-async printk() for $trailer_text_line line. I think
> that changing all printk() called from out_of_memory() to async is doable, if
> the caller of out_of_memory() wakes up a dedicated kthread described below.

I'm not objecting Petr's opinion on this.

But, personally, I think that this idea of "put this into logbuf,
but don't poke serial consoles now" is not bad. Sort of a buffering,
but not precisely buffering, because we keep messages in the logbuf,
so any other CPU can flush pending messages for us, and we don't mix
up messages' order and don't lose messages (unless we can't flush
logbuf).

This buffering is better than printk_safe(); first, because printk_safe()
buffers are much smaller in size, second, printk_safe() results in
out-of-order messages and timestamps (some people hate this) and, third,
printk_safe() eventually attempts to console_unlock() from IRQ.

And it's better than printk_deferred(), because printk_deferred() does
console_unlock() from IRQ.

But, at the same time, it's yet another buffering. I understand Petr's
concerns. And, in this particular case, this buffering might look like
we are fixing the problem on the wrong side (printk). Who knows, we
may find some use cases for this buffered printk, unrelated to OOM.
Just my 5 cents.

	-ss

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-05  1:23       ` Tetsuo Handa
  2019-03-05  7:52         ` Sergey Senozhatsky
@ 2019-03-05 12:57         ` Michal Hocko
  2019-03-06 10:04         ` Petr Mladek
  2 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2019-03-05 12:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, John Ogness, Andrew Morton, Linus Torvalds,
	linux-kernel

On Tue 05-03-19 10:23:03, Tetsuo Handa wrote:
[...]
> Also, both nopage_rs in warn_alloc() and oom_rs in oom_kill_process() are not
> working well. This is because ___ratelimit() function assumes that operations
> to be ratelimited complete fast enough to be able to repeat many times within
> a second. If one operation to be ratelimited takes many seconds (or even
> minutes), ___ratelimit() becomes always true and can not ratelimit at all.

And this is a fundamental problem of the ratelimiting and something to
think about and address. The underlying assumption of the ratelimit
implementation that the throttled operation is mostly too short to
consider is just wrong in this and potentially many other cases. Well
basically whenever the API is used for printing. So rather than working
around this and complicating the code around I would focus on making
ratelimit implementation more robust for long taking operations.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-05  1:23       ` Tetsuo Handa
  2019-03-05  7:52         ` Sergey Senozhatsky
  2019-03-05 12:57         ` Michal Hocko
@ 2019-03-06 10:04         ` Petr Mladek
  2019-03-06 14:27           ` Sergey Senozhatsky
  2019-03-06 18:24           ` Tetsuo Handa
  2 siblings, 2 replies; 17+ messages in thread
From: Petr Mladek @ 2019-03-06 10:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel

On Tue 2019-03-05 10:23:03, Tetsuo Handa wrote:
> Petr Mladek wrote:
> > On Mon 2019-03-04 20:40:37, Tetsuo Handa wrote:
> > > On 2019/03/04 12:22, Sergey Senozhatsky wrote:
> > > > On (02/23/19 13:42), Tetsuo Handa wrote:
> > > > [..]
> > > >> This patch tries to address "don't lockup the system" with minimal risk of
> > > >> failing to "print out printk() messages", by allowing printk() callers to
> > > >> tell printk() "store $body_text_lines lines into logbuf but start actual
> > > >> printing after $trailer_text_line line is stored into logbuf". This patch
> > > >> is different from existing printk_deferred(), for printk_deferred() is
> > > >> intended for scheduler/timekeeping use only. Moreover, what this patch
> > > >> wants to do is "do not try to print out printk() messages as soon as
> > > >> possible", for accumulated stalling period cannot be decreased if
> > > >> printk_deferred() from e.g. dump_tasks() from out_of_memory() immediately
> > > >> prints out the messages. The point of this patch is to defer the stalling
> > > >> duration to after leaving the critical section.
> > > > 
> > > > We can export printk deferred, I guess; but I'm not sure if it's going
> > > > to be easy to switch OOM to printk_deferred - there are lots of direct
> > > > printk callers: warn-s, dump_stacks, etc; it might even be simpler to
> > > > start re-directing OOM printouts to printk_safe buffer.
> > 
> > Exactly. OOM calls many functions that are called also in other situations.
> > The async messages are not pushed to the console unless someone calls
> > a non-async printk.
> > 
> > How do you want to guarantee that the non-async printk will get called
> > in all situations? The async printk() API is too error-prone.
> 
> I'm suggesting to use a non-async printk() for $trailer_text_line line. I think
> that changing all printk() called from out_of_memory() to async is doable, if
> the caller of out_of_memory() wakes up a dedicated kthread described below.

This is error prone. The trailing printk is not guaranteed:

    + It might get lost by code refactoring.

    + People would miss that it is needed when async printk is used
      in a shared function, e.g. dump_stack().

    + People will not know that it is needed because they
      are not familiar with the API. The would just use it
      as part of a cargo cult programming or so.

For this reason, this API is a "no no no" from my side.


I am open to bring back discussions about a forced offload when
a single task is blocked in console_unlock() for too long.

It might be configurable depending on the usecase. For example,
real time kernel would be more aggressive with offloading.
It might need to be completely disabled to debug certain
type of problems, e.g. sudden death.


But I would really like to avoid even more printk variants.
They have many problems:

    + Only a handful of people would know how to use them right.

    + They are hard to maintain. Any incompatible printk() in the call
      chain might break the intention.

    + They are problematic in a shared code. Incompatible printk()
      variants might be required/preferred according to the use case.

    + They might improve situation in one workload and make it
      worse in another one.


> > For example, I would stop printing allocation failures until
> > the currently running OOM killer succeeds in freeing some memory.
> 
> We could serialize out_of_memory() and warn_alloc(), but since warn_alloc()
> can be called from atomic context (e.g. GFP_ATOMIC from interrupt handler)
> while out_of_memory() is always called from schedulable context (e.g.
> GFP_KERNEL), we can't use oom_lock mutex.

I did not mean serializing. I meant to avoid printing the warnings
at all until OOM killer finishes its job.


> Also, both nopage_rs in warn_alloc() and oom_rs in oom_kill_process() are not
> working well. This is because ___ratelimit() function assumes that operations
> to be ratelimited complete fast enough to be able to repeat many times within
> a second. If one operation to be ratelimited takes many seconds (or even
> minutes), ___ratelimit() becomes always true and can not ratelimit at all.

The current ratelimiting is time driven. We might need an event
driven variant. It might even be done with the current
implementation if we add something like:

void ratelimit_reset(struct ratelimit_state *rs)
{
	unsigned long flags;

	raw_spin_irqsave(&rs->lock, flags);

	rs->begin   = jiffies;
	rs->printed = 0;

	raw_spin_unlock_irqrestore(&rs->lock, flags);
}

We could call this when some event "solved" the problem.
It means that it makes sense to enable the related
ratelimited messages again because they would describe
another problem.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-06 10:04         ` Petr Mladek
@ 2019-03-06 14:27           ` Sergey Senozhatsky
  2019-03-06 18:24           ` Tetsuo Handa
  1 sibling, 0 replies; 17+ messages in thread
From: Sergey Senozhatsky @ 2019-03-06 14:27 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Tetsuo Handa, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, John Ogness, Andrew Morton, Linus Torvalds,
	linux-kernel

On (03/06/19 11:04), Petr Mladek wrote:
> > I'm suggesting to use a non-async printk() for $trailer_text_line line. I think
> > that changing all printk() called from out_of_memory() to async is doable, if
> > the caller of out_of_memory() wakes up a dedicated kthread described below.
> 
> This is error prone. The trailing printk is not guaranteed:
> 
>     + It might get lost by code refactoring.
> 
>     + People would miss that it is needed when async printk is used
>       in a shared function, e.g. dump_stack().
> 
>     + People will not know that it is needed because they
>       are not familiar with the API. The would just use it
>       as part of a cargo cult programming or so.
> 
> For this reason, this API is a "no no no" from my side.

Conceptually, this buffering is just pr_cont() with extension
from "1 line" to "N lines". pr_cont() also waits for a special
marker to flush the buffered data - that marker is pr_cont('\n'),
which forces printk to log_store() cont buffer. In Tetsuo's
proposal the marker is different. That's the only difference I
see. Should any other CPU printk() anything - it will print out
buffered data, same thing as pr_cont(), again.

	-ss

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-06 10:04         ` Petr Mladek
  2019-03-06 14:27           ` Sergey Senozhatsky
@ 2019-03-06 18:24           ` Tetsuo Handa
  2019-03-15 10:49             ` Tetsuo Handa
                               ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Tetsuo Handa @ 2019-03-06 18:24 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel,
	Michal Hocko

On 2019/03/06 19:04, Petr Mladek wrote:
>> I'm suggesting to use a non-async printk() for $trailer_text_line line. I think
>> that changing all printk() called from out_of_memory() to async is doable, if
>> the caller of out_of_memory() wakes up a dedicated kthread described below.
> 
> This is error prone. The trailing printk is not guaranteed:
> 
>     + It might get lost by code refactoring.

Caller uses this async printk() with their own risk. This concern is same with
"buffering into on-stack 'char buf[80]' using snprintf() and forgetting to flush
it using printk()".

> 
>     + People would miss that it is needed when async printk is used
>       in a shared function, e.g. dump_stack().

I'm planning to allow async printk() for shared functions like dump_stack()
because there will be some $trailer_text_line line after such shared functions.
(By the way, for that reason, I expect that we won't do a tree-wide KERN_DEFAULT
removal so that callers can pass KERN_DEFAULT_ASYNC as an function argument to
such shared functions.)

> 
>     + People will not know that it is needed because they
>       are not familiar with the API. The would just use it
>       as part of a cargo cult programming or so.

Same with "buffering into on-stack buffer and forgetting to flush it" above.

I'm planning to allow any shared functions which can be called from e.g.
warn_alloc(), out_of_memory(), SysRq-t by allowing such shared functions
to accept loglevel argument.

> 
> For this reason, this API is a "no no no" from my side.
> 

There are two types of printk() data. One is "small/short messages but
partial/incomplete line (even if a few bytes) might be helpful" and
the other is "large/long messages but partial/incomplete line is hardly
helpful". The former uses "sync printk()s" and the latter uses "async
printk()s followed by a sync printk()". And this API is powerful for the
latter type which are called by for_each_*() pattern (e.g. dump_tasks()).

> 
> I am open to bring back discussions about a forced offload when
> a single task is blocked in console_unlock() for too long.

This is a matter of adding hints for judging which thread
(console_owner or console_waiter) should continue printing.

> 
> It might be configurable depending on the usecase. For example,
> real time kernel would be more aggressive with offloading.
> It might need to be completely disabled to debug certain
> type of problems, e.g. sudden death.
> 
> 
> But I would really like to avoid even more printk variants.
> They have many problems:
> 
>     + Only a handful of people would know how to use them right.

The caller of async printk() knows when/how to use async printk().

> 
>     + They are hard to maintain. Any incompatible printk() in the call
>       chain might break the intention.

There is no incompatible printk(). Only when to wake up a thread which
writes to consoles differs.

> 
>     + They are problematic in a shared code. Incompatible printk()
>       variants might be required/preferred according to the use case.

Again, there is no incompatible printk().

> 
>     + They might improve situation in one workload and make it
>       worse in another one.

Avoiding e.g. RCU stall / OOM stall by deferring write to consoles is
an improvement. Reducing the side effect of accumulated pending logs
is matters of "how we can try to offload to an idle thread (or utilize
idle CPUs) for writing to consoles" and "how we can try to ratelimit
appropriately".

To me, none of these concerns is a valid reason to keep "store now but
print later" prefix away.



> 
> 
>>> For example, I would stop printing allocation failures until
>>> the currently running OOM killer succeeds in freeing some memory.
>>
>> We could serialize out_of_memory() and warn_alloc(), but since warn_alloc()
>> can be called from atomic context (e.g. GFP_ATOMIC from interrupt handler)
>> while out_of_memory() is always called from schedulable context (e.g.
>> GFP_KERNEL), we can't use oom_lock mutex.
> 
> I did not mean serializing. I meant to avoid printing the warnings
> at all until OOM killer finishes its job.

But your ratelimit_reset() below requires serializing.

> 
> 
>> Also, both nopage_rs in warn_alloc() and oom_rs in oom_kill_process() are not
>> working well. This is because ___ratelimit() function assumes that operations
>> to be ratelimited complete fast enough to be able to repeat many times within
>> a second. If one operation to be ratelimited takes many seconds (or even
>> minutes), ___ratelimit() becomes always true and can not ratelimit at all.
> 
> The current ratelimiting is time driven. We might need an event
> driven variant. It might even be done with the current
> implementation if we add something like:
> 
> void ratelimit_reset(struct ratelimit_state *rs)
> {
> 	unsigned long flags;
> 
> 	raw_spin_irqsave(&rs->lock, flags);
> 
> 	rs->begin   = jiffies;
> 	rs->printed = 0;
> 
> 	raw_spin_unlock_irqrestore(&rs->lock, flags);
> }
> 
> We could call this when some event "solved" the problem.

This requires serialization among threads using "rs". I already
proposed ratelimit_reset() for memcg's OOM problem at
https://lkml.kernel.org/r/201810180246.w9I2koi3011358@www262.sakura.ne.jp
but it was not accepted.

> It means that it makes sense to enable the related
> ratelimited messages again because they would describe
> another problem.

___ratelimit() could also check number of not-yet-flushed
printk() records (e.g. log_next_seq - console_seq <= $some_threshold).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-06 18:24           ` Tetsuo Handa
@ 2019-03-15 10:49             ` Tetsuo Handa
  2019-03-20 15:04             ` Petr Mladek
  2019-03-20 15:25             ` ratelimit API: was: " Petr Mladek
  2 siblings, 0 replies; 17+ messages in thread
From: Tetsuo Handa @ 2019-03-15 10:49 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel,
	Michal Hocko, Calvin Owens

On 2019/03/07 3:24, Tetsuo Handa wrote:
> On 2019/03/06 19:04, Petr Mladek wrote:
> I'm planning to allow any shared functions which can be called from e.g.
> warn_alloc(), out_of_memory(), SysRq-t by allowing such shared functions
> to accept loglevel argument.

Calvin Owens is proposing "printk: Introduce per-console loglevel setting"
which will allow reducing overall sequential call_console_drivers() delay.

> 
>>
>> For this reason, this API is a "no no no" from my side.
>>
> 
> There are two types of printk() data. One is "small/short messages but
> partial/incomplete line (even if a few bytes) might be helpful" and
> the other is "large/long messages but partial/incomplete line is hardly
> helpful". The former uses "sync printk()s" and the latter uses "async
> printk()s followed by a sync printk()". And this API is powerful for the
> latter type which are called by for_each_*() pattern (e.g. dump_tasks()).

I am proposing "printk: Introduce "store now but print later" prefix."
which will allow deferring call_console_drivers() until the caller leaves
critical sections.

> 
>>
>> I am open to bring back discussions about a forced offload when
>> a single task is blocked in console_unlock() for too long.
> 
> This is a matter of adding hints for judging which thread
> (console_owner or console_waiter) should continue printing.

And the patch shown bottom is an example of

  we could add hints for judging which thread (console_owner or
  console_waiter) should continue printing

tuning. It is "mostly offload" rather than "forced offload".

>>     + They might improve situation in one workload and make it
>>       worse in another one.
> 
> Avoiding e.g. RCU stall / OOM stall by deferring write to consoles is
> an improvement. Reducing the side effect of accumulated pending logs
> is matters of "how we can try to offload to an idle thread (or utilize
> idle CPUs) for writing to consoles" and "how we can try to ratelimit
> appropriately".
> 
> To me, none of these concerns is a valid reason to keep "store now but
> print later" prefix away.

In "Re: [PATCH v10 1/2] printk: Make printk() completely async" at
https://lkml.kernel.org/r/20160404155149.a3e3307def2d1315e2099c63@linux-foundation.org ,
Andrew Morton said

  The whole idea remains worrisome.

but I couldn't find the reason which still holds true. We already have
"printk: Console owner and waiter logic cleanup" which allows us to put
some limit on the number of characters which the printing task will print.

Worrying memory corruption sounds strange. What if memory parity error
inverted one bit of the address stored in log_buf, or inverted one bit of
the records stored in log_buf region? There are inevitable cases where
printk() can't survive memory corruption.

Forcing all messages to be offloaded might be bad. But I think that
allowing caller specified messages (i.e. "large/long messages but
partial/incomplete line is hardly helpful" case) to be offloaded
when possible is fine.

With these patches, apart from the console driver's locking dependency
problem remains, I think that we can reasonably avoid stalls caused by
flood of printk() from critical sections.

---
 kernel/printk/printk.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 81 insertions(+), 2 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 6b7654b..e1591dd 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -47,6 +47,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/debug.h>
 #include <linux/sched/task_stack.h>
+#include <linux/kthread.h>
 
 #include <linux/uaccess.h>
 #include <asm/sections.h>
@@ -1616,6 +1617,12 @@ int do_syslog(int type, char __user *buf, int len, int source)
 static DEFINE_RAW_SPINLOCK(console_owner_lock);
 static struct task_struct *console_owner;
 static bool console_waiter;
+static struct task_struct *console_writer;
+
+static void console_writer_work_func(struct irq_work *irq_work)
+{
+	wake_up_process(console_writer);
+}
 
 /**
  * console_lock_spinning_enable - mark beginning of code where another
@@ -1703,7 +1710,7 @@ static int console_trylock_spinning(void)
 	raw_spin_lock(&console_owner_lock);
 	owner = READ_ONCE(console_owner);
 	waiter = READ_ONCE(console_waiter);
-	if (!waiter && owner && owner != current) {
+	if (!waiter && owner && owner != current && owner != console_writer) {
 		WRITE_ONCE(console_waiter, true);
 		spin = true;
 	}
@@ -1960,6 +1967,33 @@ asmlinkage int vprintk_emit(int facility, int level,
 	/* If called from the scheduler, we can not call up(). */
 	if (!in_sched && pending_output) {
 		/*
+		 * Try to wake up the console printing intercepting kernel
+		 * thread which tries to hand over from process context.
+		 *
+		 * If current thread can call wake_up_process(), we can wake up
+		 * the thread on any CPU which is ready to execute process
+		 * context. Otherwise, we will need to wake up the thread via
+		 * IRQ work. I wish we can wake up the thread from some CPU
+		 * which is ready to flush IRQ works so that this IRQ work can
+		 * wake up the thread as soon as possible...
+		 *
+		 * This attempt will fail if the thread cannot run for some
+		 * reason (e.g. all CPUs are busy executing interrupt handlers
+		 * or high priority threads). But this attempt is allowed to
+		 * remain best-effort, for current thread will try to hand over
+		 * from current context, by finding that
+		 * console_owner != console_writer when this attempt failed,
+		 */
+		if (console_writer && console_owner != console_writer) {
+			static DEFINE_IRQ_WORK(console_writer_work,
+					       console_writer_work_func);
+
+			if (in_task())
+				wake_up_process(console_writer);
+			else
+				irq_work_queue(&console_writer_work);
+		}
+		/*
 		 * Disable preemption to avoid being preempted while holding
 		 * console_sem which would prevent anyone from printing to
 		 * console
@@ -1981,6 +2015,43 @@ asmlinkage int vprintk_emit(int facility, int level,
 }
 EXPORT_SYMBOL(vprintk_emit);
 
+static int console_writer_thread(void *unused)
+{
+	unsigned long last_woken_up = jiffies;
+
+	console_writer = current;
+	while (true) {
+		/*
+		 * Since printk() is synchronous, it is possible that a thread
+		 * which called printk() already incremented the console_seq
+		 * and console_seq == log_next_seq became true by the moment
+		 * this kernel thread got woken up.
+		 * Therefore, speculatively keep trying to become the console
+		 * waiter thread for a while, in case a series of printk() is
+		 * called within that period.
+		 */
+		if (console_seq == log_next_seq &&
+		    time_after(jiffies, last_woken_up + HZ)) {
+			schedule_timeout_idle(MAX_SCHEDULE_TIMEOUT);
+			last_woken_up = jiffies;
+		} else {
+			schedule_timeout_idle(1);
+		}
+		preempt_disable();
+		if (console_trylock_spinning())
+			console_unlock();
+		preempt_enable();
+	}
+	return 0;
+}
+
+static int __init console_writer_init(void)
+{
+	kthread_run(console_writer_thread, NULL, "console_writer");
+	return 0;
+}
+subsys_initcall(console_writer_init)
+
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
 	return vprintk_func(fmt, args);
@@ -2458,8 +2529,16 @@ void console_unlock(void)
 
 		printk_safe_exit_irqrestore(flags);
 
-		if (do_cond_resched)
+		if (do_cond_resched) {
+			/*
+			 * Try to wake up the console printing intercepting
+			 * kernel thread if current thread is about to complete
+			 * console_lock()/console_unlock() section.
+			 */
+			if (console_writer)
+				wake_up_process(console_writer);
 			cond_resched();
+		}
 	}
 
 	console_locked = 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-06 18:24           ` Tetsuo Handa
  2019-03-15 10:49             ` Tetsuo Handa
@ 2019-03-20 15:04             ` Petr Mladek
  2019-03-20 15:25             ` ratelimit API: was: " Petr Mladek
  2 siblings, 0 replies; 17+ messages in thread
From: Petr Mladek @ 2019-03-20 15:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel,
	Michal Hocko

On Thu 2019-03-07 03:24:25, Tetsuo Handa wrote:
> On 2019/03/06 19:04, Petr Mladek wrote:
> >> I'm suggesting to use a non-async printk() for $trailer_text_line line. I think
> >> that changing all printk() called from out_of_memory() to async is doable, if
> >> the caller of out_of_memory() wakes up a dedicated kthread described below.
> > 
> > This is error prone. The trailing printk is not guaranteed:
> > 
> >     + It might get lost by code refactoring.
> 
> Caller uses this async printk() with their own risk. This concern is same with
> "buffering into on-stack 'char buf[80]' using snprintf() and forgetting to flush
> it using printk()".

AFAIK, all pr_cont() users are self-contained. Most of them print all
pieces in the same function. Some of them print the part of the lines
using some helper scripts but these are defined in the same source
file, usually right above the caller.

> > 
> >     + People would miss that it is needed when async printk is used
> >       in a shared function, e.g. dump_stack().
> 
> I'm planning to allow async printk() for shared functions like dump_stack()
> because there will be some $trailer_text_line line after such shared functions.
> (By the way, for that reason, I expect that we won't do a tree-wide KERN_DEFAULT
> removal so that callers can pass KERN_DEFAULT_ASYNC as an function argument to
> such shared functions.)

Now, you are going to spread related lines all over the kernel sources.


> > But I would really like to avoid even more printk variants.
> > They have many problems:
> > 
> >     + Only a handful of people would know how to use them right.
> 
> The caller of async printk() knows when/how to use async printk().
> 
> > 
> >     + They are hard to maintain. Any incompatible printk() in the call
> >       chain might break the intention.
> 
> There is no incompatible printk(). Only when to wake up a thread which
> writes to consoles differs.

By incompatible I mean the following:

   + Adding any synchronous printk() into a chain of async printk's
     will break the chain. I mean that the consoles will get handled
     by the new normal printk() before the original author wanted.

   + Also the wanted normal printk() might get removed from some
     reason (by mistake). Then all the previous async lines will
     not get reliably flushed.

Both situations are bugs that might get fixed. But the API
is just too error prone from my POV.

We are repeating the same arguments by different words all over
again. I explained why the API is not acceptable for me and
suggested other approaches. This is my last mail about the API.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ratelimit API: was: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-06 18:24           ` Tetsuo Handa
  2019-03-15 10:49             ` Tetsuo Handa
  2019-03-20 15:04             ` Petr Mladek
@ 2019-03-20 15:25             ` Petr Mladek
  2019-03-21  8:13               ` Tetsuo Handa
  2 siblings, 1 reply; 17+ messages in thread
From: Petr Mladek @ 2019-03-20 15:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel,
	Michal Hocko

On Thu 2019-03-07 03:24:25, Tetsuo Handa wrote:
> On 2019/03/06 19:04, Petr Mladek wrote:
> > I did not mean serializing. I meant to avoid printing the warnings
> > at all until OOM killer finishes its job.
> 
> But your ratelimit_reset() below requires serializing.
> 
> > 
> > 
> >> Also, both nopage_rs in warn_alloc() and oom_rs in oom_kill_process() are not
> >> working well. This is because ___ratelimit() function assumes that operations
> >> to be ratelimited complete fast enough to be able to repeat many times within
> >> a second. If one operation to be ratelimited takes many seconds (or even
> >> minutes), ___ratelimit() becomes always true and can not ratelimit at all.
> > 
> > The current ratelimiting is time driven. We might need an event
> > driven variant. It might even be done with the current
> > implementation if we add something like:
> > 
> > void ratelimit_reset(struct ratelimit_state *rs)
> > {
> > 	unsigned long flags;
> > 
> > 	raw_spin_irqsave(&rs->lock, flags);
> > 
> > 	rs->begin   = jiffies;
> > 	rs->printed = 0;
> > 
> > 	raw_spin_unlock_irqrestore(&rs->lock, flags);
> > }
> > 
> > We could call this when some event "solved" the problem.
> 
> This requires serialization among threads using "rs". I already
> proposed ratelimit_reset() for memcg's OOM problem at
> https://lkml.kernel.org/r/201810180246.w9I2koi3011358@www262.sakura.ne.jp
> but it was not accepted.

IMHO, the main problem was that the patch tried to work around
the ratelimit API weakness by a custom code.

I believe that using an improved/extended ratelimit API with
a sane semantic would be more acceptable.


> > It means that it makes sense to enable the related
> > ratelimited messages again because they would describe
> > another problem.
> 
> ___ratelimit() could also check number of not-yet-flushed
> printk() records (e.g. log_next_seq - console_seq <= $some_threshold).

The number is almost useless without more information, for example,
how fast the consoles are, how many lines will get filtered
by a console_loglevel, if the console_sem owner is sleeping,
how many messages are being added by other CPUs.

I believe that we do not really need it. The ratelimit_reset()
user should know when the messages can get skipped because
they describe the same situation again and again.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ratelimit API: was: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-20 15:25             ` ratelimit API: was: " Petr Mladek
@ 2019-03-21  8:13               ` Tetsuo Handa
  2019-03-21  8:49                 ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2019-03-21  8:13 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Steven Rostedt,
	John Ogness, Andrew Morton, Linus Torvalds, linux-kernel,
	Michal Hocko

On 2019/03/21 0:25, Petr Mladek wrote:
>> This requires serialization among threads using "rs". I already
>> proposed ratelimit_reset() for memcg's OOM problem at
>> https://lkml.kernel.org/r/201810180246.w9I2koi3011358@www262.sakura.ne.jp
>> but it was not accepted.
> 
> IMHO, the main problem was that the patch tried to work around
> the ratelimit API weakness by a custom code.
> 
> I believe that using an improved/extended ratelimit API with
> a sane semantic would be more acceptable.
> 

Michal, are you OK to use ratelimit_reset() in out_of_memory()
if ratelimit_reset() is accepted?

> 
>>> It means that it makes sense to enable the related
>>> ratelimited messages again because they would describe
>>> another problem.
>>
>> ___ratelimit() could also check number of not-yet-flushed
>> printk() records (e.g. log_next_seq - console_seq <= $some_threshold).
> 
> The number is almost useless without more information, for example,
> how fast the consoles are, how many lines will get filtered
> by a console_loglevel, if the console_sem owner is sleeping,
> how many messages are being added by other CPUs.
> 
> I believe that we do not really need it. The ratelimit_reset()
> user should know when the messages can get skipped because
> they describe the same situation again and again.

If printk() becomes asynchronous (either my "synchronous by default +
console_writer kernel thread" or John's "asynchronous by default +
printk kernel thread" is accepted), there will be little delay between
___ratelimit() and ratelimit_reset() enough to make ratelimit_reset()
unnecessary. Also, "number of not-yet-flushed printk() records" will
become meaningful if printk() becomes asynchronous.

But "how fast the consoles" depends on what consoles will be added/removed
during messages are flushed, "how many lines will get filtered by a
console_loglevel" depends on whether someone will change it during
messages are flushed, "how many messages are being added by other CPUs
during the console_sem owner is sleeping" (or "how many bytes which will
need to be written to consoles") depends on stress at that moment. None of
these example information will be reliable for making ___ratelimit() decision.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ratelimit API: was: [RFC PATCH] printk: Introduce "store now but print later" prefix.
  2019-03-21  8:13               ` Tetsuo Handa
@ 2019-03-21  8:49                 ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2019-03-21  8:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, John Ogness, Andrew Morton, Linus Torvalds,
	linux-kernel

On Thu 21-03-19 17:13:54, Tetsuo Handa wrote:
> On 2019/03/21 0:25, Petr Mladek wrote:
> >> This requires serialization among threads using "rs". I already
> >> proposed ratelimit_reset() for memcg's OOM problem at
> >> https://lkml.kernel.org/r/201810180246.w9I2koi3011358@www262.sakura.ne.jp
> >> but it was not accepted.
> > 
> > IMHO, the main problem was that the patch tried to work around
> > the ratelimit API weakness by a custom code.
> > 
> > I believe that using an improved/extended ratelimit API with
> > a sane semantic would be more acceptable.
> > 
> 
> Michal, are you OK to use ratelimit_reset() in out_of_memory()
> if ratelimit_reset() is accepted?

I do not know what ratelimit_reset is but if that is a new API for
a more reasonable ratelimiting then sure, I do not have any objections.
I have been objecting to one-off hacks to workaround problems of the
existing api.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-03-21  8:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-23  4:42 [RFC PATCH] printk: Introduce "store now but print later" prefix Tetsuo Handa
2019-03-04  3:22 ` Sergey Senozhatsky
2019-03-04 11:40   ` Tetsuo Handa
2019-03-04 12:09     ` Sergey Senozhatsky
2019-03-04 14:23     ` Petr Mladek
2019-03-04 14:37       ` Sergey Senozhatsky
2019-03-05  1:23       ` Tetsuo Handa
2019-03-05  7:52         ` Sergey Senozhatsky
2019-03-05 12:57         ` Michal Hocko
2019-03-06 10:04         ` Petr Mladek
2019-03-06 14:27           ` Sergey Senozhatsky
2019-03-06 18:24           ` Tetsuo Handa
2019-03-15 10:49             ` Tetsuo Handa
2019-03-20 15:04             ` Petr Mladek
2019-03-20 15:25             ` ratelimit API: was: " Petr Mladek
2019-03-21  8:13               ` Tetsuo Handa
2019-03-21  8:49                 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).