linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: don't warn about allocations which stall for too long
@ 2017-10-26 11:28 Tetsuo Handa
  2017-10-26 11:41 ` Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Tetsuo Handa @ 2017-10-26 11:28 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, Tetsuo Handa, Cong Wang, Dave Hansen,
	Johannes Weiner, Mel Gorman, Michal Hocko, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang

Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too
long") was a great step for reducing possibility of silent hang up problem
caused by memory allocation stalls. But this commit reverts it, for it is
possible to trigger OOM lockup and/or soft lockups when many threads
concurrently called warn_alloc() (in order to warn about memory allocation
stalls) due to current implementation of printk(), and it is difficult to
obtain useful information due to limitation of synchronous warning
approach.

Current printk() implementation flushes all pending logs using the context
of a thread which called console_unlock(). printk() should be able to flush
all pending logs eventually unless somebody continues appending to printk()
buffer.

Since warn_alloc() started appending to printk() buffer while waiting for
oom_kill_process() to make forward progress when oom_kill_process() is
processing pending logs, it became possible for warn_alloc() to force
oom_kill_process() loop inside printk(). As a result, warn_alloc()
significantly increased possibility of preventing oom_kill_process() from
making forward progress.

---------- Pseudo code start ----------
Before warn_alloc() was introduced:

  retry:
    if (mutex_trylock(&oom_lock)) {
      while (atomic_read(&printk_pending_logs) > 0) {
        atomic_dec(&printk_pending_logs);
        print_one_log();
      }
      // Send SIGKILL here.
      mutex_unlock(&oom_lock)
    }
    goto retry;

After warn_alloc() was introduced:

  retry:
    if (mutex_trylock(&oom_lock)) {
      while (atomic_read(&printk_pending_logs) > 0) {
        atomic_dec(&printk_pending_logs);
        print_one_log();
      }
      // Send SIGKILL here.
      mutex_unlock(&oom_lock)
    } else if (waited_for_10seconds()) {
      atomic_inc(&printk_pending_logs);
    }
    goto retry;
---------- Pseudo code end ----------

Although waited_for_10seconds() becomes true once per 10 seconds, unbounded
number of threads can call waited_for_10seconds() at the same time. Also,
since threads doing waited_for_10seconds() keep doing almost busy loop, the
thread doing print_one_log() can use little CPU resource. Therefore, this
situation can be simplified like

---------- Pseudo code start ----------
  retry:
    if (mutex_trylock(&oom_lock)) {
      while (atomic_read(&printk_pending_logs) > 0) {
        atomic_dec(&printk_pending_logs);
        print_one_log();
      }
      // Send SIGKILL here.
      mutex_unlock(&oom_lock)
    } else {
      atomic_inc(&printk_pending_logs);
    }
    goto retry;
---------- Pseudo code end ----------

when printk() is called faster than print_one_log() can process a log.

One of possible mitigation would be to introduce a new lock in order to
make sure that no other series of printk() (either oom_kill_process() or
warn_alloc()) can append to printk() buffer when one series of printk()
(either oom_kill_process() or warn_alloc()) is already in progress. Such
serialization will also help obtaining kernel messages in readable form.

---------- Pseudo code start ----------
  retry:
    if (mutex_trylock(&oom_lock)) {
      mutex_lock(&oom_printk_lock);
      while (atomic_read(&printk_pending_logs) > 0) {
        atomic_dec(&printk_pending_logs);
        print_one_log();
      }
      // Send SIGKILL here.
      mutex_unlock(&oom_printk_lock);
      mutex_unlock(&oom_lock)
    } else {
      if (mutex_trylock(&oom_printk_lock)) {
        atomic_inc(&printk_pending_logs);
        mutex_unlock(&oom_printk_lock);
      }
    }
    goto retry;
---------- Pseudo code end ----------

But this commit does not go that direction, for we don't want to introduce
a new lock dependency, and we unlikely be able to obtain useful information
even if we serialized oom_kill_process() and warn_alloc().

Synchronous approach is prone to unexpected results (e.g. too late [1], too
frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped
with providing information other than "something is going wrong".
I want to consider asynchronous approach which can obtain information
during stalls with possibly relevant threads (e.g. the owner of oom_lock
and kswapd-like threads) and serve as a trigger for actions (e.g. turn
on/off tracepoints, ask libvirt daemon to take a memory dump of stalling
KVM guest for diagnostic purpose).

This commit temporarily looses ability to report e.g. OOM lockup due to
unable to invoke the OOM killer due to !__GFP_FS allocation request.
But asynchronous approach will be able to detect such situation and emit
warning. Thus, let's remove warn_alloc().

[1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
[2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
[3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: yuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
---
 mm/page_alloc.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97687b3..a4edfba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3856,8 +3856,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	enum compact_result compact_result;
 	int compaction_retries;
 	int no_progress_loops;
-	unsigned long alloc_start = jiffies;
-	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
 	int reserve_flags;
 
@@ -3989,14 +3987,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (!can_direct_reclaim)
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
-			"page allocation stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	/* Avoid recursion of direct reclaim */
 	if (current->flags & PF_MEMALLOC)
 		goto nopage;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-10-26 11:28 [PATCH] mm: don't warn about allocations which stall for too long Tetsuo Handa
@ 2017-10-26 11:41 ` Michal Hocko
       [not found]   ` <91bdbdea-3f33-b7c0-8345-d0fa8c7f1cf1@sonymobile.com>
  2017-10-26 14:37 ` Johannes Weiner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2017-10-26 11:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Cong Wang, Dave Hansen,
	Johannes Weiner, Mel Gorman, Petr Mladek, Sergey Senozhatsky,
	Vlastimil Babka, yuwang.yuwang

On Thu 26-10-17 20:28:59, Tetsuo Handa wrote:
> Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too
> long") was a great step for reducing possibility of silent hang up problem
> caused by memory allocation stalls. But this commit reverts it, for it is
> possible to trigger OOM lockup and/or soft lockups when many threads
> concurrently called warn_alloc() (in order to warn about memory allocation
> stalls) due to current implementation of printk(), and it is difficult to
> obtain useful information due to limitation of synchronous warning
> approach.
> 
> Current printk() implementation flushes all pending logs using the context
> of a thread which called console_unlock(). printk() should be able to flush
> all pending logs eventually unless somebody continues appending to printk()
> buffer.
> 
> Since warn_alloc() started appending to printk() buffer while waiting for
> oom_kill_process() to make forward progress when oom_kill_process() is
> processing pending logs, it became possible for warn_alloc() to force
> oom_kill_process() loop inside printk(). As a result, warn_alloc()
> significantly increased possibility of preventing oom_kill_process() from
> making forward progress.
> 
> ---------- Pseudo code start ----------
> Before warn_alloc() was introduced:
> 
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     }
>     goto retry;
> 
> After warn_alloc() was introduced:
> 
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     } else if (waited_for_10seconds()) {
>       atomic_inc(&printk_pending_logs);
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> Although waited_for_10seconds() becomes true once per 10 seconds, unbounded
> number of threads can call waited_for_10seconds() at the same time. Also,
> since threads doing waited_for_10seconds() keep doing almost busy loop, the
> thread doing print_one_log() can use little CPU resource. Therefore, this
> situation can be simplified like
> 
> ---------- Pseudo code start ----------
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     } else {
>       atomic_inc(&printk_pending_logs);
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> when printk() is called faster than print_one_log() can process a log.
> 
> One of possible mitigation would be to introduce a new lock in order to
> make sure that no other series of printk() (either oom_kill_process() or
> warn_alloc()) can append to printk() buffer when one series of printk()
> (either oom_kill_process() or warn_alloc()) is already in progress. Such
> serialization will also help obtaining kernel messages in readable form.
> 
> ---------- Pseudo code start ----------
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       mutex_lock(&oom_printk_lock);
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_printk_lock);
>       mutex_unlock(&oom_lock)
>     } else {
>       if (mutex_trylock(&oom_printk_lock)) {
>         atomic_inc(&printk_pending_logs);
>         mutex_unlock(&oom_printk_lock);
>       }
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> But this commit does not go that direction, for we don't want to introduce
> a new lock dependency, and we unlikely be able to obtain useful information
> even if we serialized oom_kill_process() and warn_alloc().
> 
> Synchronous approach is prone to unexpected results (e.g. too late [1], too
> frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped
> with providing information other than "something is going wrong".
> I want to consider asynchronous approach which can obtain information
> during stalls with possibly relevant threads (e.g. the owner of oom_lock
> and kswapd-like threads) and serve as a trigger for actions (e.g. turn
> on/off tracepoints, ask libvirt daemon to take a memory dump of stalling
> KVM guest for diagnostic purpose).
> 
> This commit temporarily looses ability to report e.g. OOM lockup due to
> unable to invoke the OOM killer due to !__GFP_FS allocation request.
> But asynchronous approach will be able to detect such situation and emit
> warning. Thus, let's remove warn_alloc().
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
> [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
> [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
> Reported-by: yuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
> Reported-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Cc: Petr Mladek <pmladek@suse.com>

The changelog is a bit excessive but it points out the real problem is
that printk is simply not ready for the sync warning which is good to
document and I hope that this will get addressed in a foreseeable future.
For the mean time it is simply better to remove the warning rather than
risk more issues.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 10 ----------
>  1 file changed, 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 97687b3..a4edfba 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3856,8 +3856,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	enum compact_result compact_result;
>  	int compaction_retries;
>  	int no_progress_loops;
> -	unsigned long alloc_start = jiffies;
> -	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
>  	int reserve_flags;
>  
> @@ -3989,14 +3987,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (!can_direct_reclaim)
>  		goto nopage;
>  
> -	/* Make sure we know about allocations which stall for too long */
> -	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> -			"page allocation stalls for %ums, order:%u",
> -			jiffies_to_msecs(jiffies-alloc_start), order);
> -		stall_timeout += 10 * HZ;
> -	}
> -
>  	/* Avoid recursion of direct reclaim */
>  	if (current->flags & PF_MEMALLOC)
>  		goto nopage;
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-10-26 11:28 [PATCH] mm: don't warn about allocations which stall for too long Tetsuo Handa
  2017-10-26 11:41 ` Michal Hocko
@ 2017-10-26 14:37 ` Johannes Weiner
  2017-10-31 19:32 ` Steven Rostedt
  2017-11-02 15:56 ` Steven Rostedt
  3 siblings, 0 replies; 25+ messages in thread
From: Johannes Weiner @ 2017-10-26 14:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Cong Wang, Dave Hansen, Mel Gorman,
	Michal Hocko, Petr Mladek, Sergey Senozhatsky, Vlastimil Babka,
	yuwang.yuwang

On Thu, Oct 26, 2017 at 08:28:59PM +0900, Tetsuo Handa wrote:
> [...] it is possible to trigger OOM lockup and/or soft lockups when
> many threads concurrently called warn_alloc() (in order to warn
> about memory allocation stalls) due to current implementation of
> printk(), and it is difficult to obtain useful information due to
> limitation of synchronous warning approach.
>
> [...]
>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
> Reported-by: yuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
> Reported-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Cc: Petr Mladek <pmladek@suse.com>

It would have been nice to be able to fix it instead, because there is
value in having the lockup detection. But it's true that it currently
causes more problems than it solves. Back to the drawing board for now.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks Tetsuo!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-10-26 11:28 [PATCH] mm: don't warn about allocations which stall for too long Tetsuo Handa
  2017-10-26 11:41 ` Michal Hocko
  2017-10-26 14:37 ` Johannes Weiner
@ 2017-10-31 19:32 ` Steven Rostedt
  2017-11-01  8:30   ` Vlastimil Babka
                     ` (2 more replies)
  2017-11-02 15:56 ` Steven Rostedt
  3 siblings, 3 replies; 25+ messages in thread
From: Steven Rostedt @ 2017-10-31 19:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Cong Wang, Dave Hansen,
	Johannes Weiner, Mel Gorman, Michal Hocko, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang

[-- Attachment #1: Type: text/plain, Size: 8892 bytes --]


Thank you for the perfect timing. You posted this the day after I
proposed a new solution at Kernel Summit in Prague for the printk lock
loop that you experienced here.

I attached the pdf that I used for that discussion (ignore the last
slide, it was left over and I never went there).

My proposal is to do something like this with printk:

Three types of printk usages:

1) Active printer (actively writing to the console).
2) Waiter (active printer, first user)
3) Sees active printer and a waiter, and just adds to the log buffer
   and leaves.

(new globals)
static DEFINE_SPIN_LOCK(console_owner_lock);
static struct task_struct console_owner;
static bool waiter;

console_unlock() {

[ Assumes this part can not preempt ]

	spin_lock(console_owner_lock);
	console_owner = current;
	spin_unlock(console_owner_lock);

	for each message
		write message out to console

		if (READ_ONCE(waiter))
			break;

	spin_lock(console_owner_lock);
	console_owner = NULL;
	spin_unlock(console_owner_lock);

[ preemption possible ]

	[ Needs to make sure waiter gets semaphore ]

	up(console_sem);
}


Then printk can have something like:


	if (console_trylock())
		console_unlock();
	else {
		struct task_struct *owner = NULL;

		spin_lock(console_owner_lock);
		if (waiter)
			goto out;
		WRITE_ONCE(waiter, true);
		owner = READ_ONCE(console_owner);		
	out:
		spin_unlock(console_owner_lock);
		if (owner) {
			while (!console_trylock())	
				cpu_relax();
			spin_lock(console_owner_lock);
			waiter = false;
			spin_unlock(console_owner_lock);
		}
	}

This way, only one CPU spins waiting to print, and only if the
console_lock owner is actively printing. If the console_lock owner
notices someone is waiting to print, it stops printing as a waiter will
always continue the prints. This will balance out the printks among all
the CPUs that are doing them and no one CPU will get stuck doing all
the printks.

This would solve your issue because the next warn_alloc() caller would
become the waiter, and take over the next message in the queue. This
would spread out the load of who does the actual printing, and not have
one printer be stuck doing the work.

-- Steve


On Thu, 26 Oct 2017 20:28:59 +0900
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too
> long") was a great step for reducing possibility of silent hang up problem
> caused by memory allocation stalls. But this commit reverts it, for it is
> possible to trigger OOM lockup and/or soft lockups when many threads
> concurrently called warn_alloc() (in order to warn about memory allocation
> stalls) due to current implementation of printk(), and it is difficult to
> obtain useful information due to limitation of synchronous warning
> approach.
> 
> Current printk() implementation flushes all pending logs using the context
> of a thread which called console_unlock(). printk() should be able to flush
> all pending logs eventually unless somebody continues appending to printk()
> buffer.
> 
> Since warn_alloc() started appending to printk() buffer while waiting for
> oom_kill_process() to make forward progress when oom_kill_process() is
> processing pending logs, it became possible for warn_alloc() to force
> oom_kill_process() loop inside printk(). As a result, warn_alloc()
> significantly increased possibility of preventing oom_kill_process() from
> making forward progress.
> 
> ---------- Pseudo code start ----------
> Before warn_alloc() was introduced:
> 
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     }
>     goto retry;
> 
> After warn_alloc() was introduced:
> 
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     } else if (waited_for_10seconds()) {
>       atomic_inc(&printk_pending_logs);
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> Although waited_for_10seconds() becomes true once per 10 seconds, unbounded
> number of threads can call waited_for_10seconds() at the same time. Also,
> since threads doing waited_for_10seconds() keep doing almost busy loop, the
> thread doing print_one_log() can use little CPU resource. Therefore, this
> situation can be simplified like
> 
> ---------- Pseudo code start ----------
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     } else {
>       atomic_inc(&printk_pending_logs);
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> when printk() is called faster than print_one_log() can process a log.
> 
> One of possible mitigation would be to introduce a new lock in order to
> make sure that no other series of printk() (either oom_kill_process() or
> warn_alloc()) can append to printk() buffer when one series of printk()
> (either oom_kill_process() or warn_alloc()) is already in progress. Such
> serialization will also help obtaining kernel messages in readable form.
> 
> ---------- Pseudo code start ----------
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       mutex_lock(&oom_printk_lock);
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_printk_lock);
>       mutex_unlock(&oom_lock)
>     } else {
>       if (mutex_trylock(&oom_printk_lock)) {
>         atomic_inc(&printk_pending_logs);
>         mutex_unlock(&oom_printk_lock);
>       }
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> But this commit does not go that direction, for we don't want to introduce
> a new lock dependency, and we unlikely be able to obtain useful information
> even if we serialized oom_kill_process() and warn_alloc().
> 
> Synchronous approach is prone to unexpected results (e.g. too late [1], too
> frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped
> with providing information other than "something is going wrong".
> I want to consider asynchronous approach which can obtain information
> during stalls with possibly relevant threads (e.g. the owner of oom_lock
> and kswapd-like threads) and serve as a trigger for actions (e.g. turn
> on/off tracepoints, ask libvirt daemon to take a memory dump of stalling
> KVM guest for diagnostic purpose).
> 
> This commit temporarily looses ability to report e.g. OOM lockup due to
> unable to invoke the OOM killer due to !__GFP_FS allocation request.
> But asynchronous approach will be able to detect such situation and emit
> warning. Thus, let's remove warn_alloc().
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
> [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
> [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
> Reported-by: yuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
> Reported-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Cc: Petr Mladek <pmladek@suse.com>
> ---
>  mm/page_alloc.c | 10 ----------
>  1 file changed, 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 97687b3..a4edfba 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3856,8 +3856,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	enum compact_result compact_result;
>  	int compaction_retries;
>  	int no_progress_loops;
> -	unsigned long alloc_start = jiffies;
> -	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
>  	int reserve_flags;
>  
> @@ -3989,14 +3987,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (!can_direct_reclaim)
>  		goto nopage;
>  
> -	/* Make sure we know about allocations which stall for too long */
> -	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> -			"page allocation stalls for %ums, order:%u",
> -			jiffies_to_msecs(jiffies-alloc_start), order);
> -		stall_timeout += 10 * HZ;
> -	}
> -
>  	/* Avoid recursion of direct reclaim */
>  	if (current->flags & PF_MEMALLOC)
>  		goto nopage;


[-- Attachment #2: printk-ks2017.pdf --]
[-- Type: application/pdf, Size: 75639 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-10-31 19:32 ` Steven Rostedt
@ 2017-11-01  8:30   ` Vlastimil Babka
  2017-11-01 13:38     ` Petr Mladek
  2017-11-01 15:33     ` Steven Rostedt
  2017-11-02  8:53   ` Sergey Senozhatsky
  2017-11-02 12:55   ` Michal Hocko
  2 siblings, 2 replies; 25+ messages in thread
From: Vlastimil Babka @ 2017-11-01  8:30 UTC (permalink / raw)
  To: Steven Rostedt, Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Cong Wang, Dave Hansen,
	Johannes Weiner, Mel Gorman, Michal Hocko, Petr Mladek,
	Sergey Senozhatsky, yuwang.yuwang

On 10/31/2017 08:32 PM, Steven Rostedt wrote:
> 
> Thank you for the perfect timing. You posted this the day after I
> proposed a new solution at Kernel Summit in Prague for the printk lock
> loop that you experienced here.
> 
> I attached the pdf that I used for that discussion (ignore the last
> slide, it was left over and I never went there).
> 
> My proposal is to do something like this with printk:
> 
> Three types of printk usages:
> 
> 1) Active printer (actively writing to the console).
> 2) Waiter (active printer, first user)
> 3) Sees active printer and a waiter, and just adds to the log buffer
>    and leaves.
> 
> (new globals)
> static DEFINE_SPIN_LOCK(console_owner_lock);
> static struct task_struct console_owner;
> static bool waiter;
> 
> console_unlock() {
> 
> [ Assumes this part can not preempt ]
> 
> 	spin_lock(console_owner_lock);
> 	console_owner = current;
> 	spin_unlock(console_owner_lock);
> 
> 	for each message
> 		write message out to console
> 
> 		if (READ_ONCE(waiter))
> 			break;

Ah, these two lines clarified for me what I didn't get from your talk,
so I got the wrong impression that the new scheme is just postponing the
problem.

But still, it seems to me that the scheme only works as long as there
are printk()'s coming with some reasonable frequency. There's still a
corner case when a storm of printk()'s can come that will fill the ring
buffers, and while during the storm the printing will be distributed
between CPUs nicely, the last unfortunate CPU after the storm subsides
will be left with a large accumulated buffer to print, and there will be
no waiters to take over if there are no more printk()'s coming. What
then, should it detect such situation and defer the flushing?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-01  8:30   ` Vlastimil Babka
@ 2017-11-01 13:38     ` Petr Mladek
  2017-11-01 15:36       ` Steven Rostedt
  2017-11-01 15:33     ` Steven Rostedt
  1 sibling, 1 reply; 25+ messages in thread
From: Petr Mladek @ 2017-11-01 13:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Steven Rostedt, Tetsuo Handa, akpm, linux-mm, linux-kernel,
	Cong Wang, Dave Hansen, Johannes Weiner, Mel Gorman,
	Michal Hocko, Sergey Senozhatsky, yuwang.yuwang

On Wed 2017-11-01 09:30:05, Vlastimil Babka wrote:
> On 10/31/2017 08:32 PM, Steven Rostedt wrote:
> > 
> > Thank you for the perfect timing. You posted this the day after I
> > proposed a new solution at Kernel Summit in Prague for the printk lock
> > loop that you experienced here.
> > 
> > I attached the pdf that I used for that discussion (ignore the last
> > slide, it was left over and I never went there).
> > 
> > My proposal is to do something like this with printk:
> > 
> > Three types of printk usages:
> > 
> > 1) Active printer (actively writing to the console).
> > 2) Waiter (active printer, first user)
> > 3) Sees active printer and a waiter, and just adds to the log buffer
> >    and leaves.
> > 
> > (new globals)
> > static DEFINE_SPIN_LOCK(console_owner_lock);
> > static struct task_struct console_owner;
> > static bool waiter;
> > 
> > console_unlock() {
> > 
> > [ Assumes this part can not preempt ]
> > 
> > 	spin_lock(console_owner_lock);
> > 	console_owner = current;
> > 	spin_unlock(console_owner_lock);
> > 
> > 	for each message
> > 		write message out to console
> > 
> > 		if (READ_ONCE(waiter))
> > 			break;
> 
> Ah, these two lines clarified for me what I didn't get from your talk,
> so I got the wrong impression that the new scheme is just postponing the
> problem.
> 
> But still, it seems to me that the scheme only works as long as there
> are printk()'s coming with some reasonable frequency. There's still a
> corner case when a storm of printk()'s can come that will fill the ring
> buffers, and while during the storm the printing will be distributed
> between CPUs nicely, the last unfortunate CPU after the storm subsides
> will be left with a large accumulated buffer to print, and there will be
> no waiters to take over if there are no more printk()'s coming. What
> then, should it detect such situation and defer the flushing?

This was my fear as well. Steven argued that this was theoretical.
And I do not have a real-life bullets against this argument at
the moment.

My current main worry with Steven's approach is a risk of deadlocks
that Jan Kara saw when he played with similar solution.

Also I am afraid that it would add yet another twist to the console
locking operations. It is already quite hard to follow the logic,
see the games with:

	+ console_locked
	+ console_suspended
	+ can_use_console()
	+ exclusive_console

And Steven is going to add:

	+ console_owner
	+ waiter

But let's wait for the patch. It might look and work nicely
in the end.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-01  8:30   ` Vlastimil Babka
  2017-11-01 13:38     ` Petr Mladek
@ 2017-11-01 15:33     ` Steven Rostedt
  2017-11-01 17:42       ` Vlastimil Babka
  1 sibling, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2017-11-01 15:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Michal Hocko,
	Petr Mladek, Sergey Senozhatsky, yuwang.yuwang

On Wed, 1 Nov 2017 09:30:05 +0100
Vlastimil Babka <vbabka@suse.cz> wrote:

> 
> But still, it seems to me that the scheme only works as long as there
> are printk()'s coming with some reasonable frequency. There's still a
> corner case when a storm of printk()'s can come that will fill the ring
> buffers, and while during the storm the printing will be distributed
> between CPUs nicely, the last unfortunate CPU after the storm subsides
> will be left with a large accumulated buffer to print, and there will be
> no waiters to take over if there are no more printk()'s coming. What
> then, should it detect such situation and defer the flushing?

No!

If such a case happened, that means the system is doing something
really stupid.

Btw, each printk that takes over, does one message, so the last one to
take over, shouldn't have a full buffer anyway.

But still, if you have such a hypothetical situation, the system should
just crash. The printk is still bounded by the length of the buffer.
Although it is slow, it will finish. Which is not the case with the
current situation. And the current situation (as which this patch
demonstrates) does happen today and is not hypothetical.

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-01 13:38     ` Petr Mladek
@ 2017-11-01 15:36       ` Steven Rostedt
  2017-11-02 11:46         ` Petr Mladek
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2017-11-01 15:36 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Vlastimil Babka, Tetsuo Handa, akpm, linux-mm, linux-kernel,
	Cong Wang, Dave Hansen, Johannes Weiner, Mel Gorman,
	Michal Hocko, Sergey Senozhatsky, yuwang.yuwang

On Wed, 1 Nov 2017 14:38:45 +0100
Petr Mladek <pmladek@suse.com> wrote:

> This was my fear as well. Steven argued that this was theoretical.
> And I do not have a real-life bullets against this argument at
> the moment.

And my argument is still if such a situation happens, the system is so
fscked up that it should just crash.

> 
> My current main worry with Steven's approach is a risk of deadlocks
> that Jan Kara saw when he played with similar solution.

And if there exists such a deadlock, then the deadlock exists today.

> 
> Also I am afraid that it would add yet another twist to the console
> locking operations. It is already quite hard to follow the logic,
> see the games with:
> 
> 	+ console_locked
> 	+ console_suspended
> 	+ can_use_console()
> 	+ exclusive_console
> 
> And Steven is going to add:
> 
> 	+ console_owner
> 	+ waiter

Agreed. Console_lock is just ugly. And this may just make it uglier :-/

> 
> But let's wait for the patch. It might look and work nicely
> in the end.

Oh, I need to write a patch? Bah, I guess I should. Where's all those
developers dying to do kernel programing where I can pass this off to?

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-01 15:33     ` Steven Rostedt
@ 2017-11-01 17:42       ` Vlastimil Babka
  2017-11-01 17:54         ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Vlastimil Babka @ 2017-11-01 17:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Michal Hocko,
	Petr Mladek, Sergey Senozhatsky, yuwang.yuwang

On 11/01/2017 04:33 PM, Steven Rostedt wrote:
> On Wed, 1 Nov 2017 09:30:05 +0100
> Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>>
>> But still, it seems to me that the scheme only works as long as there
>> are printk()'s coming with some reasonable frequency. There's still a
>> corner case when a storm of printk()'s can come that will fill the ring
>> buffers, and while during the storm the printing will be distributed
>> between CPUs nicely, the last unfortunate CPU after the storm subsides
>> will be left with a large accumulated buffer to print, and there will be
>> no waiters to take over if there are no more printk()'s coming. What
>> then, should it detect such situation and defer the flushing?
> 
> No!
> 
> If such a case happened, that means the system is doing something
> really stupid.

Hm, what about e.g. a soft lockup that triggers backtraces from all
CPU's? Yes, having softlockups is "stupid" but sometimes they do happen
and the system still recovers (just some looping operation is missing
cond_resched() and took longer than expected). It would be sad if it
didn't recover because of a printk() issue...

> Btw, each printk that takes over, does one message, so the last one to
> take over, shouldn't have a full buffer anyway.

There might be multiple messages per each CPU, e.g. the softlockup
backtraces.

> But still, if you have such a hypothetical situation, the system should
> just crash. The printk is still bounded by the length of the buffer.
> Although it is slow, it will finish.

Finish, but with single CPU doing the printing, which is wrong?

> Which is not the case with the
> current situation. And the current situation (as which this patch
> demonstrates) does happen today and is not hypothetical.

Yep, so ideally it can be fixed without corner cases :)

Vlastimil

> -- Steve
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-01 17:42       ` Vlastimil Babka
@ 2017-11-01 17:54         ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2017-11-01 17:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Michal Hocko,
	Petr Mladek, Sergey Senozhatsky, yuwang.yuwang

On Wed, 1 Nov 2017 18:42:25 +0100
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 11/01/2017 04:33 PM, Steven Rostedt wrote:
> > On Wed, 1 Nov 2017 09:30:05 +0100
> > Vlastimil Babka <vbabka@suse.cz> wrote:
> >   
> >>
> >> But still, it seems to me that the scheme only works as long as there
> >> are printk()'s coming with some reasonable frequency. There's still a
> >> corner case when a storm of printk()'s can come that will fill the ring
> >> buffers, and while during the storm the printing will be distributed
> >> between CPUs nicely, the last unfortunate CPU after the storm subsides
> >> will be left with a large accumulated buffer to print, and there will be
> >> no waiters to take over if there are no more printk()'s coming. What
> >> then, should it detect such situation and defer the flushing?  
> > 
> > No!
> > 
> > If such a case happened, that means the system is doing something
> > really stupid.  
> 
> Hm, what about e.g. a soft lockup that triggers backtraces from all
> CPU's? Yes, having softlockups is "stupid" but sometimes they do happen
> and the system still recovers (just some looping operation is missing
> cond_resched() and took longer than expected). It would be sad if it
> didn't recover because of a printk() issue...

I still think such a case would not be huge for the last printer.

> 
> > Btw, each printk that takes over, does one message, so the last one to
> > take over, shouldn't have a full buffer anyway.  
> 
> There might be multiple messages per each CPU, e.g. the softlockup
> backtraces.

And each one does multiple printks, still spreading the love around.

> 
> > But still, if you have such a hypothetical situation, the system should
> > just crash. The printk is still bounded by the length of the buffer.
> > Although it is slow, it will finish.  
> 
> Finish, but with single CPU doing the printing, which is wrong?

I don't think so. This is all hypothetical anyway. I need to implement
my solution, and then lets see if this can actually happen.

> 
> > Which is not the case with the
> > current situation. And the current situation (as which this patch
> > demonstrates) does happen today and is not hypothetical.  
> 
> Yep, so ideally it can be fixed without corner cases :)

If there is any corner cases. I guess the test would be to trigger a
soft lockup on all CPUs to print out a dump at the same time. But then
again, how is a soft lockup on all CPUs not any worse than a single CPU
finishing up the buffer output?

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-10-31 19:32 ` Steven Rostedt
  2017-11-01  8:30   ` Vlastimil Babka
@ 2017-11-02  8:53   ` Sergey Senozhatsky
  2017-11-02  9:14     ` Sergey Senozhatsky
  2017-11-02 14:55     ` Steven Rostedt
  2017-11-02 12:55   ` Michal Hocko
  2 siblings, 2 replies; 25+ messages in thread
From: Sergey Senozhatsky @ 2017-11-02  8:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Michal Hocko,
	Petr Mladek, Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang

On (10/31/17 15:32), Steven Rostedt wrote:
[..]
> (new globals)
> static DEFINE_SPIN_LOCK(console_owner_lock);
> static struct task_struct console_owner;
> static bool waiter;
> 
> console_unlock() {
> 
> [ Assumes this part can not preempt ]
>
> 	spin_lock(console_owner_lock);
> 	console_owner = current;
> 	spin_unlock(console_owner_lock);

 + disables IRQs?

> 	for each message
> 		write message out to console
> 
> 		if (READ_ONCE(waiter))
> 			break;
> 
> 	spin_lock(console_owner_lock);
> 	console_owner = NULL;
> 	spin_unlock(console_owner_lock);
> 
> [ preemption possible ]

otherwise

     printk()
      if (console_trylock())
        console_unlock()
         preempt_disable()
          spin_lock(console_owner_lock);
          console_owner = current;
          spin_unlock(console_owner_lock);
          .......
          spin_lock(console_owner_lock);
IRQ
    printk()
     console_trylock() // fails so we go to busy-loop part
      spin_lock(console_owner_lock);       << deadlock


even if we would replace spin_lock(console_owner_lock) with IRQ
spin_lock, we still would need to protect against IRQs on the very
same CPU. right? IOW, we need to store smp_processor_id() of a CPU
currently doing console_unlock() and check it in vprintk_emit()?
and we need to protect the entire console_unlock() function. not
just the printing loop, otherwise the IRQ CPU will spin forever
waiting for itself to up() the console_sem.

this somehow reminds me of "static unsigned int logbuf_cpu", which
we used to have in vprintk_emit() and were happy to remove it...


the whole "console_unlock() is non-preemptible" can bite, I'm
afraid. it's not always printk()->console_unlock(), sometimes
it's console_lock()->console_unlock() that has to flush the
logbuf.

CPU0					CPU1  ~  CPU99
console_lock();
					printk(); ... printk();
console_unlock()
 preempt_disable();
  for (;;)
    call_console_drivers();
    <<lockup>>


this pattern is not so unusual. _especially_ in the existing scheme
of things.

not to mention the problem of "the last printk()", which will take
over and do the flush.

CPU0					CPU1  ~  CPU99
console_lock();
					printk(); ... printk();
console_unlock();
					    IRQ on CPU2
					     printk()
					      // take over console_sem
					      console_unlock()

and so on.
seems that there will be lots of if-s.

	-ss

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-02  8:53   ` Sergey Senozhatsky
@ 2017-11-02  9:14     ` Sergey Senozhatsky
  2017-11-02 14:55     ` Steven Rostedt
  1 sibling, 0 replies; 25+ messages in thread
From: Sergey Senozhatsky @ 2017-11-02  9:14 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Steven Rostedt, Tetsuo Handa, akpm, linux-mm, linux-kernel,
	Cong Wang, Dave Hansen, Johannes Weiner, Mel Gorman,
	Michal Hocko, Petr Mladek, Sergey Senozhatsky, Vlastimil Babka,
	yuwang.yuwang

On (11/02/17 17:53), Sergey Senozhatsky wrote:
> On (10/31/17 15:32), Steven Rostedt wrote:
> [..]
> > (new globals)
> > static DEFINE_SPIN_LOCK(console_owner_lock);
> > static struct task_struct console_owner;
> > static bool waiter;
> > 
> > console_unlock() {
> > 
> > [ Assumes this part can not preempt ]
> >
> > 	spin_lock(console_owner_lock);
> > 	console_owner = current;
> > 	spin_unlock(console_owner_lock);
> 
>  + disables IRQs?
> 
> > 	for each message
> > 		write message out to console
> > 
> > 		if (READ_ONCE(waiter))
> > 			break;
> > 
> > 	spin_lock(console_owner_lock);
> > 	console_owner = NULL;
> > 	spin_unlock(console_owner_lock);
> > 
> > [ preemption possible ]
> 
> otherwise
> 
>      printk()
>       if (console_trylock())
>         console_unlock()
>          preempt_disable()
>           spin_lock(console_owner_lock);
>           console_owner = current;
>           spin_unlock(console_owner_lock);
>           .......
>           spin_lock(console_owner_lock);
> IRQ
>     printk()
>      console_trylock() // fails so we go to busy-loop part
>       spin_lock(console_owner_lock);       << deadlock
> 
> 
> even if we would replace spin_lock(console_owner_lock) with IRQ
> spin_lock, we still would need to protect against IRQs on the very
> same CPU. right? IOW, we need to store smp_processor_id() of a CPU
> currently doing console_unlock() and check it in vprintk_emit()?


a major self-correction:

> and we need to protect the entire console_unlock() function. not
> just the printing loop, otherwise the IRQ CPU will spin forever
> waiting for itself to up() the console_sem.

this part is wrong. should have been
	"we need to protect the entire printing loop"


so now console_unlock()'s printing loop is going to run

a) under preempt_disable()
b) under local_irq_save()

which is risky.

	-ss

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-01 15:36       ` Steven Rostedt
@ 2017-11-02 11:46         ` Petr Mladek
  2017-11-02 14:49           ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Petr Mladek @ 2017-11-02 11:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vlastimil Babka, Tetsuo Handa, akpm, linux-mm, linux-kernel,
	Cong Wang, Dave Hansen, Johannes Weiner, Mel Gorman,
	Michal Hocko, Sergey Senozhatsky, yuwang.yuwang

On Wed 2017-11-01 11:36:47, Steven Rostedt wrote:
> On Wed, 1 Nov 2017 14:38:45 +0100
> Petr Mladek <pmladek@suse.com> wrote:
> > My current main worry with Steven's approach is a risk of deadlocks
> > that Jan Kara saw when he played with similar solution.
> 
> And if there exists such a deadlock, then the deadlock exists today.

The patch is going to effectively change console_trylock() to
console_lock() and this might add problems.

The most simple example is:

       console_lock()
         printk()
	    console_trylock() was SAFE.

       console_lock()
         printk()
	   console_lock() cause DEADLOCK!

Sure, we could detect this and avoid waiting when
console_owner == current. But does this cover all
situations? What about?

CPU0			CPU1

console_lock()          func()
  console->write()        take_lockA()
    func()		    printk()
			      busy wait for console_lock()

      take_lockA()

By other words, it used to be safe to call printk() from
console->write() functions because printk() used console_trylock().
Your patch is going to change this. It is even worse because
you probably will not use console_lock() directly and therefore
this might be hidden for lockdep.

BTW: I am still not sure how to make the busy waiter preferred
over console_lock() callers. I mean that the busy waiter has
to get console_sem even if there are some tasks in the workqueue.


> > But let's wait for the patch. It might look and work nicely
> > in the end.
> 
> Oh, I need to write a patch? Bah, I guess I should. Where's all those
> developers dying to do kernel programing where I can pass this off to?

Yes, where are these days when my primary task was to learn kernel
hacking? This would have been a great training material.

I still have to invest time into fixing printk. But I personally
think that the lazy offloading to kthreads is more promising
way to go. It is pretty straightforward. The only problem is
the guaranty of the takeover. But there must be a reasonable
way how to detect that the system heart is still beating
and we are not the only working CPU.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-10-31 19:32 ` Steven Rostedt
  2017-11-01  8:30   ` Vlastimil Babka
  2017-11-02  8:53   ` Sergey Senozhatsky
@ 2017-11-02 12:55   ` Michal Hocko
  2 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-11-02 12:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang

On Tue 31-10-17 15:32:25, Steven Rostedt wrote:
> 
> Thank you for the perfect timing. You posted this the day after I
> proposed a new solution at Kernel Summit in Prague for the printk lock
> loop that you experienced here.
> 
> I attached the pdf that I used for that discussion (ignore the last
> slide, it was left over and I never went there).
> 
> My proposal is to do something like this with printk:
> 
> Three types of printk usages:
> 
> 1) Active printer (actively writing to the console).
> 2) Waiter (active printer, first user)
> 3) Sees active printer and a waiter, and just adds to the log buffer
>    and leaves.
> 
> (new globals)
> static DEFINE_SPIN_LOCK(console_owner_lock);
> static struct task_struct console_owner;
> static bool waiter;
> 
> console_unlock() {
> 
> [ Assumes this part can not preempt ]
> 
> 	spin_lock(console_owner_lock);
> 	console_owner = current;
> 	spin_unlock(console_owner_lock);
> 
> 	for each message
> 		write message out to console
> 
> 		if (READ_ONCE(waiter))
> 			break;
> 
> 	spin_lock(console_owner_lock);
> 	console_owner = NULL;
> 	spin_unlock(console_owner_lock);
> 
> [ preemption possible ]
> 
> 	[ Needs to make sure waiter gets semaphore ]
> 
> 	up(console_sem);
> }
> 
> 
> Then printk can have something like:
> 
> 
> 	if (console_trylock())
> 		console_unlock();
> 	else {
> 		struct task_struct *owner = NULL;
> 
> 		spin_lock(console_owner_lock);
> 		if (waiter)
> 			goto out;
> 		WRITE_ONCE(waiter, true);
> 		owner = READ_ONCE(console_owner);		
> 	out:
> 		spin_unlock(console_owner_lock);
> 		if (owner) {
> 			while (!console_trylock())	
> 				cpu_relax();
> 			spin_lock(console_owner_lock);
> 			waiter = false;
> 			spin_unlock(console_owner_lock);
> 		}
> 	}
> 
> This way, only one CPU spins waiting to print, and only if the
> console_lock owner is actively printing. If the console_lock owner
> notices someone is waiting to print, it stops printing as a waiter will
> always continue the prints. This will balance out the printks among all
> the CPUs that are doing them and no one CPU will get stuck doing all
> the printks.
> 
> This would solve your issue because the next warn_alloc() caller would
> become the waiter, and take over the next message in the queue. This
> would spread out the load of who does the actual printing, and not have
> one printer be stuck doing the work.

That also means that we would shift the overhead only to the first
waiter AFAIU. What if we have floods of warn_alloc from all CPUs?
Something that Tetsuo's test case simulates.

As Petr pointed out earlier in the thread, I do not think this is going
to help cosiderably and offloading to a kernel thread sounds like a
more viable option. It sounds really wrong to have printk basically
indeterministic wrt. call duration depending on who happens to do the
actual work. Either we make the call sync or completely offloaded to
a dedicated kernel thread and make sure that the buffer gets flushed
unconditionally on panic. I haven't been following all the printk
discussion recently so maybe this has been discussed and deemed not
viable for implementation details but in principle this should work.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-02 11:46         ` Petr Mladek
@ 2017-11-02 14:49           ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2017-11-02 14:49 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Vlastimil Babka, Tetsuo Handa, akpm, linux-mm, linux-kernel,
	Cong Wang, Dave Hansen, Johannes Weiner, Mel Gorman,
	Michal Hocko, Sergey Senozhatsky, yuwang.yuwang

On Thu, 2 Nov 2017 12:46:50 +0100
Petr Mladek <pmladek@suse.com> wrote:

> On Wed 2017-11-01 11:36:47, Steven Rostedt wrote:
> > On Wed, 1 Nov 2017 14:38:45 +0100
> > Petr Mladek <pmladek@suse.com> wrote:  
> > > My current main worry with Steven's approach is a risk of deadlocks
> > > that Jan Kara saw when he played with similar solution.  
> > 
> > And if there exists such a deadlock, then the deadlock exists today.  
> 
> The patch is going to effectively change console_trylock() to
> console_lock() and this might add problems.
> 
> The most simple example is:
> 
>        console_lock()
>          printk()
> 	    console_trylock() was SAFE.
> 
>        console_lock()
>          printk()
> 	   console_lock() cause DEADLOCK!
> 
> Sure, we could detect this and avoid waiting when
> console_owner == current. But does this cover all

Which I will do.

> situations? What about?
> 
> CPU0			CPU1
> 
> console_lock()          func()
>   console->write()        take_lockA()
>     func()		    printk()
> 			      busy wait for console_lock()
> 
>       take_lockA()

How does this not deadlock without my changes?

 func()
   take_lockA()
     printk()
       console_lock()
         console->write()
             func()
                take_lockA()

DEADLOCK!


> 
> By other words, it used to be safe to call printk() from
> console->write() functions because printk() used console_trylock().

I still don't see how this can be safe now.

> Your patch is going to change this. It is even worse because
> you probably will not use console_lock() directly and therefore
> this might be hidden for lockdep.

And no, my patch adds lockdep annotation for the spinner. And if I get
that wrong, I'm sure Peter Zijltra will help.

> 
> BTW: I am still not sure how to make the busy waiter preferred
> over console_lock() callers. I mean that the busy waiter has
> to get console_sem even if there are some tasks in the workqueue.

I started struggling with this, then realized that console_sem is just
that: a semaphore. Which doesn't have a concept of ownership. I can
simply hand off the semaphore without ever letting it go. My RFC patch
is almost done, you'll see it soon.

> 
> 
> > > But let's wait for the patch. It might look and work nicely
> > > in the end.  
> > 
> > Oh, I need to write a patch? Bah, I guess I should. Where's all those
> > developers dying to do kernel programing where I can pass this off to?  
> 
> Yes, where are these days when my primary task was to learn kernel
> hacking? This would have been a great training material.

:)

> 
> I still have to invest time into fixing printk. But I personally
> think that the lazy offloading to kthreads is more promising
> way to go. It is pretty straightforward. The only problem is
> the guaranty of the takeover. But there must be a reasonable
> way how to detect that the system heart is still beating
> and we are not the only working CPU.

My patch isn't that big. Let's talk more after I post it.

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-02  8:53   ` Sergey Senozhatsky
  2017-11-02  9:14     ` Sergey Senozhatsky
@ 2017-11-02 14:55     ` Steven Rostedt
  1 sibling, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2017-11-02 14:55 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Michal Hocko,
	Petr Mladek, Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang

On Thu, 2 Nov 2017 17:53:13 +0900
Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:

> On (10/31/17 15:32), Steven Rostedt wrote:
> [..]
> > (new globals)
> > static DEFINE_SPIN_LOCK(console_owner_lock);
> > static struct task_struct console_owner;
> > static bool waiter;
> > 
> > console_unlock() {
> > 
> > [ Assumes this part can not preempt ]
> >
> > 	spin_lock(console_owner_lock);
> > 	console_owner = current;
> > 	spin_unlock(console_owner_lock);  
> 
>  + disables IRQs?

Yes, this was pseudo code, just to get an idea out. I'll have a patch
soon that will include all the nasty details.

> 
> > 	for each message
> > 		write message out to console
> > 
> > 		if (READ_ONCE(waiter))
> > 			break;
> > 
> > 	spin_lock(console_owner_lock);
> > 	console_owner = NULL;
> > 	spin_unlock(console_owner_lock);
> > 
> > [ preemption possible ]  
> 
> otherwise
> 
>      printk()
>       if (console_trylock())
>         console_unlock()
>          preempt_disable()
>           spin_lock(console_owner_lock);
>           console_owner = current;
>           spin_unlock(console_owner_lock);
>           .......
>           spin_lock(console_owner_lock);
> IRQ
>     printk()
>      console_trylock() // fails so we go to busy-loop part
>       spin_lock(console_owner_lock);       << deadlock

Yeah, I do disable interrupts. The pseudo code was just a way to
quickly convey the idea. I said "spin_lock" where I could have just
said "lock".

> 
> 
> even if we would replace spin_lock(console_owner_lock) with IRQ
> spin_lock, we still would need to protect against IRQs on the very
> same CPU. right? IOW, we need to store smp_processor_id() of a CPU
> currently doing console_unlock() and check it in vprintk_emit()?
> and we need to protect the entire console_unlock() function. not
> just the printing loop, otherwise the IRQ CPU will spin forever
> waiting for itself to up() the console_sem.

Yes and it will.

> 
> this somehow reminds me of "static unsigned int logbuf_cpu", which
> we used to have in vprintk_emit() and were happy to remove it...
> 
> 
> the whole "console_unlock() is non-preemptible" can bite, I'm
> afraid. it's not always printk()->console_unlock(), sometimes
> it's console_lock()->console_unlock() that has to flush the
> logbuf.
> 
> CPU0					CPU1  ~  CPU99
> console_lock();
> 					printk(); ... printk();
> console_unlock()
>  preempt_disable();
>   for (;;)
>     call_console_drivers();
>     <<lockup>>
> 
> 
> this pattern is not so unusual. _especially_ in the existing scheme
> of things.
> 
> not to mention the problem of "the last printk()", which will take
> over and do the flush.
> 
> CPU0					CPU1  ~  CPU99
> console_lock();
> 					printk(); ... printk();
> console_unlock();
> 					    IRQ on CPU2
> 					     printk()
> 					      // take over console_sem
> 					      console_unlock()
> 
> and so on.
> seems that there will be lots of if-s.


Let's wait for the patch and talk more after I post it.

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-10-26 11:28 [PATCH] mm: don't warn about allocations which stall for too long Tetsuo Handa
                   ` (2 preceding siblings ...)
  2017-10-31 19:32 ` Steven Rostedt
@ 2017-11-02 15:56 ` Steven Rostedt
  2017-11-02 17:06   ` [PATCH v2] printk: Add console owner and waiter logic to load balance console writes Steven Rostedt
  3 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2017-11-02 15:56 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Cong Wang, Dave Hansen,
	Johannes Weiner, Mel Gorman, Michal Hocko, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang,
	Peter Zijlstra, Linus Torvalds, Jan Kara


Hi Tetsuo,

Can you see if this patch helps your situation?

OK, for the rest of you. Let's have the showdown ;-)

This patch implements what I discussed in Kernel Summit. I added
lockdep annotation (hopefully correctly), and it hasn't had any splats
(since I fixed some bugs in the first iterations). It did catch
problems when I had the owner covering too much. But now that the owner
is only set when actively calling the consoles, lockdep has stayed
quiet.


Here's the design again:

I added a "console_owner" which is set to a task that is actively
writing to the consoles. It is *not* the same an the owner of the
console_lock. It is only set when doing the calls to the console
functions. It is protected by a console_owner_lock which is a raw spin
lock.

There is a console_waiter. This is set when there is an active console
owner that is not current, and waiter is not set. This too is protected
by console_owner_lock.

In printk() when it tries to write to the consoles, we have:

	if (console_trylock())
		console_unlock();

Now I added an else, which will check if there is an active owner, and
no current waiter. If that is the case, then console_waiter is set, and
the task goes into a spin until it is no longer set.

When the active console owner finishes writing the current message to
the consoles, it grabs the console_owner_lock and sees if there is a
waiter, and clears console_owner.

If there is a waiter, then it breaks out of the loop, clears the waiter
flag (because that will release the waiter from its spin), and exits.
Note, it does *not* release the console semaphore. Because it is a
semaphore, there is no owner. Another task may release it. This means
that the waiter is guaranteed to be the new console owner! Which it
becomes.

Then the waiter calls console_unlock() and continues to write to the
consoles.

If another task comes along and does a printk() it too can become the
new waiter, and we wash rinse and repeat!

OK, let at it :-)

-- Steve

Not-yet-signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
Index: linux-trace.git/kernel/printk/printk.c
===================================================================
--- linux-trace.git.orig/kernel/printk/printk.c
+++ linux-trace.git/kernel/printk/printk.c
@@ -86,8 +86,15 @@ EXPORT_SYMBOL_GPL(console_drivers);
 static struct lockdep_map console_lock_dep_map = {
 	.name = "console_lock"
 };
+static struct lockdep_map console_owner_dep_map = {
+	.name = "console_owner"
+};
 #endif
 
+static DEFINE_RAW_SPINLOCK(console_owner_lock);
+static struct task_struct *console_owner;
+static bool console_waiter;
+
 enum devkmsg_log_bits {
 	__DEVKMSG_LOG_BIT_ON = 0,
 	__DEVKMSG_LOG_BIT_OFF,
@@ -1753,8 +1760,56 @@ asmlinkage int vprintk_emit(int facility
 		 * semaphore.  The release will print out buffers and wake up
 		 * /dev/kmsg and syslog() users.
 		 */
-		if (console_trylock())
+		if (console_trylock()) {
 			console_unlock();
+		} else {
+			struct task_struct *owner = NULL;
+			bool waiter;
+			bool spin = false;
+
+			printk_safe_enter_irqsave(flags);
+
+			raw_spin_lock(&console_owner_lock);
+			owner = READ_ONCE(console_owner);
+			waiter = console_waiter;
+			if (!waiter && owner && owner != current) {
+				console_waiter = true;
+				spin = true;
+			}
+			raw_spin_unlock(&console_owner_lock);
+
+			/*
+			 * If there is an active printk() writing to the
+			 * consoles, instead of having it write our data too,
+			 * see if we can offload that load from the active
+			 * printer, and do some printing ourselves.
+			 * Go into a spin only if there isn't already a waiter
+			 * spinning, and there is an active printer, and
+			 * that active printer isn't us (recursive printk?).
+			 */
+			if (spin) {
+				/* We spin waiting for the owner to release us */
+				mutex_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
+				/* Owner will clear console_waiter on hand off */
+				while (!READ_ONCE(console_waiter))
+					cpu_relax();
+
+				mutex_release(&console_owner_dep_map, 1, _THIS_IP_);
+				printk_safe_exit_irqrestore(flags);
+
+				/*
+				 * The owner passed the console lock to us.
+				 * Since we did not spin on console lock, annotate
+				 * this as a trylock. Otherwise lockdep will
+				 * complain.
+				 */
+				mutex_acquire(&console_lock_dep_map, 0, 1, _THIS_IP_);
+				console_unlock();
+				printk_safe_enter_irqsave(flags);
+			}
+			printk_safe_exit_irqrestore(flags);
+
+		}
 	}
 
 	return printed_len;
@@ -2141,6 +2196,7 @@ void console_unlock(void)
 	static u64 seen_seq;
 	unsigned long flags;
 	bool wake_klogd = false;
+	bool waiter = false;
 	bool do_cond_resched, retry;
 
 	if (console_suspended) {
@@ -2215,6 +2271,20 @@ skip:
 			goto skip;
 		}
 
+		/*
+		 * While actively printing out messages, if another printk()
+		 * were to occur on another CPU, it may wait for this one to
+		 * finish. This task can not be preempted if there is a
+		 * waiter waiting to take over.
+		 */
+
+		/* The waiter may spin on us after this */
+		mutex_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
+
+		raw_spin_lock(&console_owner_lock);
+		console_owner = current;
+		raw_spin_unlock(&console_owner_lock);
+
 		len += msg_print_text(msg, false, text + len, sizeof(text) - len);
 		if (nr_ext_console_drivers) {
 			ext_len = msg_print_ext_header(ext_text,
@@ -2232,11 +2302,48 @@ skip:
 		stop_critical_timings();	/* don't trace print latency */
 		call_console_drivers(ext_text, ext_len, text, len);
 		start_critical_timings();
+
+		raw_spin_lock(&console_owner_lock);
+		waiter = READ_ONCE(console_waiter);
+		console_owner = NULL;
+		raw_spin_unlock(&console_owner_lock);
+
+		/*
+		 * If there is a waiter waiting for us, then pass the
+		 * rest of the work load over to that waiter.
+		 */
+		if (waiter)
+			break;
+
+		/* There was no waiter, and nothing will spin on us here */
+		mutex_release(&console_owner_dep_map, 1, _THIS_IP_);
+
 		printk_safe_exit_irqrestore(flags);
 
 		if (do_cond_resched)
 			cond_resched();
 	}
+
+	/*
+	 * If there is an active waiter waiting on the console_lock.
+	 * Pass off the printing to the waiter, and the waiter
+	 * will continue printing on its CPU, and when all writing
+	 * has finished, the last printer will wake up klogd.
+	 */
+	if (waiter) {
+		console_waiter = false;
+		/* The waiter is now free to continue */
+		mutex_release(&console_owner_dep_map, 1, _THIS_IP_);
+		/*
+		 * Hand off console_lock to waiter. The waiter will perform
+		 * the up(). After this, the waiter is the console_lock owner.
+		 */
+		mutex_release(&console_lock_dep_map, 1, _THIS_IP_);
+		printk_safe_exit_irqrestore(flags);
+		/* Note, if waiter is set, logbuf_lock is not held */
+		return;
+	}
+
 	console_locked = 0;
 
 	/* Release the exclusive_console once it is used */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2] printk: Add console owner and waiter logic to load balance console writes
  2017-11-02 15:56 ` Steven Rostedt
@ 2017-11-02 17:06   ` Steven Rostedt
  2017-11-02 17:10     ` Steven Rostedt
                       ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Steven Rostedt @ 2017-11-02 17:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Cong Wang, Dave Hansen,
	Johannes Weiner, Mel Gorman, Michal Hocko, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang,
	Peter Zijlstra, Linus Torvalds, Jan Kara, Mathieu Desnoyers

From: Steven Rostedt (VMware) <rostedt@goodmis.org>

This patch implements what I discussed in Kernel Summit. I added
lockdep annotation (hopefully correctly), and it hasn't had any splats
(since I fixed some bugs in the first iterations). It did catch
problems when I had the owner covering too much. But now that the owner
is only set when actively calling the consoles, lockdep has stayed
quiet.
 
Here's the design again:

I added a "console_owner" which is set to a task that is actively
writing to the consoles. It is *not* the same an the owner of the
console_lock. It is only set when doing the calls to the console
functions. It is protected by a console_owner_lock which is a raw spin
lock.

There is a console_waiter. This is set when there is an active console
owner that is not current, and waiter is not set. This too is protected
by console_owner_lock.

In printk() when it tries to write to the consoles, we have:

	if (console_trylock())
		console_unlock();

Now I added an else, which will check if there is an active owner, and
no current waiter. If that is the case, then console_waiter is set, and
the task goes into a spin until it is no longer set.

When the active console owner finishes writing the current message to
the consoles, it grabs the console_owner_lock and sees if there is a
waiter, and clears console_owner.

If there is a waiter, then it breaks out of the loop, clears the waiter
flag (because that will release the waiter from its spin), and exits.
Note, it does *not* release the console semaphore. Because it is a
semaphore, there is no owner. Another task may release it. This means
that the waiter is guaranteed to be the new console owner! Which it
becomes.

Then the waiter calls console_unlock() and continues to write to the
consoles.

If another task comes along and does a printk() it too can become the
new waiter, and we wash rinse and repeat!

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
Changes from v1:

  Use "spin_acquire" instead of "mutex_acquire" for the
  console_lock_dep annotation. It may make a difference in the future,
  and it does act like a spinlock and not a mutex. (thanks to Peter
  Ziljstra).

  Remove all READ/WRITE_ONCE() from inside the spin lock protected
  areas, as they are unnecessary.

  Add a WRITE_ONCE() for clearing console_owner, because that's done
  outside the spin lock protection.

Index: linux-trace.git/kernel/printk/printk.c
===================================================================
--- linux-trace.git.orig/kernel/printk/printk.c
+++ linux-trace.git/kernel/printk/printk.c
@@ -86,8 +86,15 @@ EXPORT_SYMBOL_GPL(console_drivers);
 static struct lockdep_map console_lock_dep_map = {
 	.name = "console_lock"
 };
+static struct lockdep_map console_owner_dep_map = {
+	.name = "console_owner"
+};
 #endif
 
+static DEFINE_RAW_SPINLOCK(console_owner_lock);
+static struct task_struct *console_owner;
+static bool console_waiter;
+
 enum devkmsg_log_bits {
 	__DEVKMSG_LOG_BIT_ON = 0,
 	__DEVKMSG_LOG_BIT_OFF,
@@ -1753,8 +1760,56 @@ asmlinkage int vprintk_emit(int facility
 		 * semaphore.  The release will print out buffers and wake up
 		 * /dev/kmsg and syslog() users.
 		 */
-		if (console_trylock())
+		if (console_trylock()) {
 			console_unlock();
+		} else {
+			struct task_struct *owner = NULL;
+			bool waiter;
+			bool spin = false;
+
+			printk_safe_enter_irqsave(flags);
+
+			raw_spin_lock(&console_owner_lock);
+			owner = console_owner;
+			waiter = console_waiter;
+			if (!waiter && owner && owner != current) {
+				console_waiter = true;
+				spin = true;
+			}
+			raw_spin_unlock(&console_owner_lock);
+
+			/*
+			 * If there is an active printk() writing to the
+			 * consoles, instead of having it write our data too,
+			 * see if we can offload that load from the active
+			 * printer, and do some printing ourselves.
+			 * Go into a spin only if there isn't already a waiter
+			 * spinning, and there is an active printer, and
+			 * that active printer isn't us (recursive printk?).
+			 */
+			if (spin) {
+				/* We spin waiting for the owner to release us */
+				spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
+				/* Owner will clear console_waiter on hand off */
+				while (!READ_ONCE(console_waiter))
+					cpu_relax();
+
+				spin_release(&console_owner_dep_map, 1, _THIS_IP_);
+				printk_safe_exit_irqrestore(flags);
+
+				/*
+				 * The owner passed the console lock to us.
+				 * Since we did not spin on console lock, annotate
+				 * this as a trylock. Otherwise lockdep will
+				 * complain.
+				 */
+				mutex_acquire(&console_lock_dep_map, 0, 1, _THIS_IP_);
+				console_unlock();
+				printk_safe_enter_irqsave(flags);
+			}
+			printk_safe_exit_irqrestore(flags);
+
+		}
 	}
 
 	return printed_len;
@@ -2141,6 +2196,7 @@ void console_unlock(void)
 	static u64 seen_seq;
 	unsigned long flags;
 	bool wake_klogd = false;
+	bool waiter = false;
 	bool do_cond_resched, retry;
 
 	if (console_suspended) {
@@ -2215,6 +2271,20 @@ skip:
 			goto skip;
 		}
 
+		/*
+		 * While actively printing out messages, if another printk()
+		 * were to occur on another CPU, it may wait for this one to
+		 * finish. This task can not be preempted if there is a
+		 * waiter waiting to take over.
+		 */
+
+		/* The waiter may spin on us after this */
+		spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
+
+		raw_spin_lock(&console_owner_lock);
+		console_owner = current;
+		raw_spin_unlock(&console_owner_lock);
+
 		len += msg_print_text(msg, false, text + len, sizeof(text) - len);
 		if (nr_ext_console_drivers) {
 			ext_len = msg_print_ext_header(ext_text,
@@ -2232,11 +2302,48 @@ skip:
 		stop_critical_timings();	/* don't trace print latency */
 		call_console_drivers(ext_text, ext_len, text, len);
 		start_critical_timings();
+
+		raw_spin_lock(&console_owner_lock);
+		waiter = console_waiter;
+		console_owner = NULL;
+		raw_spin_unlock(&console_owner_lock);
+
+		/*
+		 * If there is a waiter waiting for us, then pass the
+		 * rest of the work load over to that waiter.
+		 */
+		if (waiter)
+			break;
+
+		/* There was no waiter, and nothing will spin on us here */
+		spin_release(&console_owner_dep_map, 1, _THIS_IP_);
+
 		printk_safe_exit_irqrestore(flags);
 
 		if (do_cond_resched)
 			cond_resched();
 	}
+
+	/*
+	 * If there is an active waiter waiting on the console_lock.
+	 * Pass off the printing to the waiter, and the waiter
+	 * will continue printing on its CPU, and when all writing
+	 * has finished, the last printer will wake up klogd.
+	 */
+	if (waiter) {
+		WRITE_ONCE(console_waiter, false);
+		/* The waiter is now free to continue */
+		spin_release(&console_owner_dep_map, 1, _THIS_IP_);
+		/*
+		 * Hand off console_lock to waiter. The waiter will perform
+		 * the up(). After this, the waiter is the console_lock owner.
+		 */
+		mutex_release(&console_lock_dep_map, 1, _THIS_IP_);
+		printk_safe_exit_irqrestore(flags);
+		/* Note, if waiter is set, logbuf_lock is not held */
+		return;
+	}
+
 	console_locked = 0;
 
 	/* Release the exclusive_console once it is used */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] printk: Add console owner and waiter logic to load balance console writes
  2017-11-02 17:06   ` [PATCH v2] printk: Add console owner and waiter logic to load balance console writes Steven Rostedt
@ 2017-11-02 17:10     ` Steven Rostedt
  2017-11-02 17:38     ` Steven Rostedt
  2017-11-03 10:19     ` Jan Kara
  2 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2017-11-02 17:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Cong Wang, Dave Hansen,
	Johannes Weiner, Mel Gorman, Michal Hocko, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang,
	Peter Zijlstra, Linus Torvalds, Jan Kara, Mathieu Desnoyers

On Thu, 2 Nov 2017 13:06:05 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> +		raw_spin_lock(&console_owner_lock);
> +		waiter = console_waiter;
> +		console_owner = NULL;
> +		raw_spin_unlock(&console_owner_lock);
> +
> +		/*
> +		 * If there is a waiter waiting for us, then pass the
> +		 * rest of the work load over to that waiter.
> +		 */
> +		if (waiter)
> +			break;
> +

Hmm, do I need a READ_ONCE() here?

Can gcc do the load of console_waiter outside the spin lock where 
if (waiter) is done?

Although it doesn't really matter, but it just makes the code more
fragile if it can. Should this be:

		raw_spin_lock(&console_owner_lock);
		waiter = READ_ONCE(console_waiter);
		console_owner = NULL;
		raw_spin_unlock(&console_owner_lock);

		/*
		 * If there is a waiter waiting for us, then pass the
		 * rest of the work load over to that waiter.
		 */
		if (waiter)
			break;

 ?

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] printk: Add console owner and waiter logic to load balance console writes
  2017-11-02 17:06   ` [PATCH v2] printk: Add console owner and waiter logic to load balance console writes Steven Rostedt
  2017-11-02 17:10     ` Steven Rostedt
@ 2017-11-02 17:38     ` Steven Rostedt
  2017-11-03 10:19     ` Jan Kara
  2 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2017-11-02 17:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Cong Wang, Dave Hansen,
	Johannes Weiner, Mel Gorman, Michal Hocko, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang,
	Peter Zijlstra, Linus Torvalds, Jan Kara, Mathieu Desnoyers

On Thu, 2 Nov 2017 13:06:05 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:


> +		raw_spin_lock(&console_owner_lock);
> +		waiter = console_waiter;
> +		console_owner = NULL;
> +		raw_spin_unlock(&console_owner_lock);
> +
> +		/*
> +		 * If there is a waiter waiting for us, then pass the
> +		 * rest of the work load over to that waiter.
> +		 */
> +		if (waiter)
> +			break;

OK, the spin_unlock() wont let the load leak. Thus it is fine as is.


> +			raw_spin_lock(&console_owner_lock);
> +			owner = console_owner;
> +			waiter = console_waiter;
> +			if (!waiter && owner && owner != current) {

But Mathieu Desnoyers pointed out that usage of variables within a
spinlock may be an issue. Although, it shouldn't affect the code as is,
I think I'll add back READ/WRITE_ONCE() just to be on the safe side.

I may add the waiter = READ_ONCE(console_waiter) to the first one too,
more as documentation. It should cause any issues to add it.

-- Steve



> +				console_waiter = true;
> +				spin = true;
> +			}
> +			raw_spin_unlock(&console_owner_lock);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] printk: Add console owner and waiter logic to load balance console writes
  2017-11-02 17:06   ` [PATCH v2] printk: Add console owner and waiter logic to load balance console writes Steven Rostedt
  2017-11-02 17:10     ` Steven Rostedt
  2017-11-02 17:38     ` Steven Rostedt
@ 2017-11-03 10:19     ` Jan Kara
  2017-11-03 11:18       ` Steven Rostedt
  2 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2017-11-03 10:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Michal Hocko,
	Petr Mladek, Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang,
	Peter Zijlstra, Linus Torvalds, Jan Kara, Mathieu Desnoyers

Hi,

On Thu 02-11-17 13:06:05, Steven Rostedt wrote:
> +			if (spin) {
> +				/* We spin waiting for the owner to release us */
> +				spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
> +				/* Owner will clear console_waiter on hand off */
> +				while (!READ_ONCE(console_waiter))
> +					cpu_relax();

Hum, what prevents us from rescheduling here? And what if the process
stored in console_owner is scheduled out? Both seem to be possible with
CONFIG_PREEMPT kernel? Unless I'm missing something you will need to
disable preemption in some places...

Other than that I like the simplicity of your approach.

								Honza

> +
> +				spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +				printk_safe_exit_irqrestore(flags);
> +
> +				/*
> +				 * The owner passed the console lock to us.
> +				 * Since we did not spin on console lock, annotate
> +				 * this as a trylock. Otherwise lockdep will
> +				 * complain.
> +				 */
> +				mutex_acquire(&console_lock_dep_map, 0, 1, _THIS_IP_);
> +				console_unlock();
> +				printk_safe_enter_irqsave(flags);
> +			}
> +			printk_safe_exit_irqrestore(flags);
> +
> +		}
>  	}
>  
>  	return printed_len;
> @@ -2141,6 +2196,7 @@ void console_unlock(void)
>  	static u64 seen_seq;
>  	unsigned long flags;
>  	bool wake_klogd = false;
> +	bool waiter = false;
>  	bool do_cond_resched, retry;
>  
>  	if (console_suspended) {
> @@ -2215,6 +2271,20 @@ skip:
>  			goto skip;
>  		}
>  
> +		/*
> +		 * While actively printing out messages, if another printk()
> +		 * were to occur on another CPU, it may wait for this one to
> +		 * finish. This task can not be preempted if there is a
> +		 * waiter waiting to take over.
> +		 */
> +
> +		/* The waiter may spin on us after this */
> +		spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
> +
> +		raw_spin_lock(&console_owner_lock);
> +		console_owner = current;
> +		raw_spin_unlock(&console_owner_lock);
> +
>  		len += msg_print_text(msg, false, text + len, sizeof(text) - len);
>  		if (nr_ext_console_drivers) {
>  			ext_len = msg_print_ext_header(ext_text,
> @@ -2232,11 +2302,48 @@ skip:
>  		stop_critical_timings();	/* don't trace print latency */
>  		call_console_drivers(ext_text, ext_len, text, len);
>  		start_critical_timings();
> +
> +		raw_spin_lock(&console_owner_lock);
> +		waiter = console_waiter;
> +		console_owner = NULL;
> +		raw_spin_unlock(&console_owner_lock);
> +
> +		/*
> +		 * If there is a waiter waiting for us, then pass the
> +		 * rest of the work load over to that waiter.
> +		 */
> +		if (waiter)
> +			break;
> +
> +		/* There was no waiter, and nothing will spin on us here */
> +		spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +
>  		printk_safe_exit_irqrestore(flags);
>  
>  		if (do_cond_resched)
>  			cond_resched();
>  	}
> +
> +	/*
> +	 * If there is an active waiter waiting on the console_lock.
> +	 * Pass off the printing to the waiter, and the waiter
> +	 * will continue printing on its CPU, and when all writing
> +	 * has finished, the last printer will wake up klogd.
> +	 */
> +	if (waiter) {
> +		WRITE_ONCE(console_waiter, false);
> +		/* The waiter is now free to continue */
> +		spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> +		/*
> +		 * Hand off console_lock to waiter. The waiter will perform
> +		 * the up(). After this, the waiter is the console_lock owner.
> +		 */
> +		mutex_release(&console_lock_dep_map, 1, _THIS_IP_);
> +		printk_safe_exit_irqrestore(flags);
> +		/* Note, if waiter is set, logbuf_lock is not held */
> +		return;
> +	}
> +
>  	console_locked = 0;
>  
>  	/* Release the exclusive_console once it is used */
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] printk: Add console owner and waiter logic to load balance console writes
  2017-11-03 10:19     ` Jan Kara
@ 2017-11-03 11:18       ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2017-11-03 11:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Michal Hocko,
	Petr Mladek, Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang,
	Peter Zijlstra, Linus Torvalds, Mathieu Desnoyers

On Fri, 3 Nov 2017 11:19:53 +0100
Jan Kara <jack@suse.cz> wrote:

> Hi,
> 
> On Thu 02-11-17 13:06:05, Steven Rostedt wrote:
> > +			if (spin) {
> > +				/* We spin waiting for the owner to release us */
> > +				spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
> > +				/* Owner will clear console_waiter on hand off */
> > +				while (!READ_ONCE(console_waiter))
> > +					cpu_relax();  
> 
> Hum, what prevents us from rescheduling here? And what if the process
> stored in console_owner is scheduled out? Both seem to be possible with
> CONFIG_PREEMPT kernel? Unless I'm missing something you will need to
> disable preemption in some places...

Yes you are missing something ;-)

> 
> Other than that I like the simplicity of your approach.
> 
> 								Honza
> 
> > +
> > +				spin_release(&console_owner_dep_map, 1, _THIS_IP_);
> > +				printk_safe_exit_irqrestore(flags);

The above line re-enables interrupts. And is done for both the
console_owner and the console_waiter. These are only held with
interrupts disabled. Nothing will preempt it. In fact, if it could,
lockdep would complain (it did in when I screwed it up at first ;-)


-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
       [not found]   ` <91bdbdea-3f33-b7c0-8345-d0fa8c7f1cf1@sonymobile.com>
@ 2017-11-09  8:52     ` Michal Hocko
       [not found]       ` <ef81333e-0e19-c6f6-a960-093dc60fb75c@sony.com>
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2017-11-09  8:52 UTC (permalink / raw)
  To: peter enderborg
  Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel, Cong Wang,
	Dave Hansen, Johannes Weiner, Mel Gorman, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang

[Please try to trim the context you are replying to]

On Wed 08-11-17 11:30:23, peter enderborg wrote:
[...]
> What about the idea to keep the function, but instead of printing only do a trace event.

I am not sure. I would rather see a tracepoint to mark the allocator
entry. This would allow both 1) measuring the allocation latency (to
compare it to the trace_mm_page_alloc and 2) check for stalls with
arbitrary user defined timeout (just print all allocations which haven't
passed trace_mm_page_alloc for the given amount of time).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
       [not found]       ` <ef81333e-0e19-c6f6-a960-093dc60fb75c@sony.com>
@ 2017-11-09 10:09         ` Michal Hocko
  2017-11-09 10:19           ` Tetsuo Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2017-11-09 10:09 UTC (permalink / raw)
  To: peter enderborg
  Cc: peter enderborg, Tetsuo Handa, akpm, linux-mm, linux-kernel,
	Cong Wang, Dave Hansen, Johannes Weiner, Mel Gorman, Petr Mladek,
	Sergey Senozhatsky, Vlastimil Babka, yuwang.yuwang

On Thu 09-11-17 10:34:46, peter enderborg wrote:
> On 11/09/2017 09:52 AM, Michal Hocko wrote:
> > I am not sure. I would rather see a tracepoint to mark the allocator
> > entry. This would allow both 1) measuring the allocation latency (to
> > compare it to the trace_mm_page_alloc and 2) check for stalls with
> > arbitrary user defined timeout (just print all allocations which haven't
> > passed trace_mm_page_alloc for the given amount of time).
> 
> Traces are not that expensive, but there are more than few in calls
> in this path. And Im trying to keep it as small that it can used for
> maintenance versions too.
>
> This is suggestion is a quick way of keeping the current solution for
> the ones that are interested the slow allocations. If we are going
> for a solution with a time-out parameter from the user what interface
> do you suggest to do this configuration. A filter parameter for the
> event?

I meant to do all that in postprocessing. So no specific API is needed,
just parse the output. Anyway, it seems that the printk will be put in
shape in a forseeable future so we might preserve the stall warning
after all. It is the show_mem part which is interesting during that
warning.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm: don't warn about allocations which stall for too long
  2017-11-09 10:09         ` Michal Hocko
@ 2017-11-09 10:19           ` Tetsuo Handa
  0 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2017-11-09 10:19 UTC (permalink / raw)
  To: mhocko, peter.enderborg
  Cc: peter.enderborg, akpm, linux-mm, linux-kernel, xiyou.wangcong,
	dave.hansen, hannes, mgorman, pmladek, sergey.senozhatsky,
	vbabka, yuwang.yuwang

Michal Hocko wrote:
> On Thu 09-11-17 10:34:46, peter enderborg wrote:
> > On 11/09/2017 09:52 AM, Michal Hocko wrote:
> > > I am not sure. I would rather see a tracepoint to mark the allocator
> > > entry. This would allow both 1) measuring the allocation latency (to
> > > compare it to the trace_mm_page_alloc and 2) check for stalls with
> > > arbitrary user defined timeout (just print all allocations which haven't
> > > passed trace_mm_page_alloc for the given amount of time).
> > 
> > Traces are not that expensive, but there are more than few in calls
> > in this path. And Im trying to keep it as small that it can used for
> > maintenance versions too.
> >
> > This is suggestion is a quick way of keeping the current solution for
> > the ones that are interested the slow allocations. If we are going
> > for a solution with a time-out parameter from the user what interface
> > do you suggest to do this configuration. A filter parameter for the
> > event?
> 
> I meant to do all that in postprocessing. So no specific API is needed,
> just parse the output. Anyway, it seems that the printk will be put in
> shape in a forseeable future so we might preserve the stall warning
> after all. It is the show_mem part which is interesting during that
> warning.

I don't know whether printk() will be put in shape in a foreseeable future.
The rule that "do not try to printk() faster than the kernel can write to
consoles" will remain no matter how printk() changes. Unless asynchronous
approach like https://lwn.net/Articles/723447/ is used, I think we can't
obtain useful information.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-11-09 10:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-26 11:28 [PATCH] mm: don't warn about allocations which stall for too long Tetsuo Handa
2017-10-26 11:41 ` Michal Hocko
     [not found]   ` <91bdbdea-3f33-b7c0-8345-d0fa8c7f1cf1@sonymobile.com>
2017-11-09  8:52     ` Michal Hocko
     [not found]       ` <ef81333e-0e19-c6f6-a960-093dc60fb75c@sony.com>
2017-11-09 10:09         ` Michal Hocko
2017-11-09 10:19           ` Tetsuo Handa
2017-10-26 14:37 ` Johannes Weiner
2017-10-31 19:32 ` Steven Rostedt
2017-11-01  8:30   ` Vlastimil Babka
2017-11-01 13:38     ` Petr Mladek
2017-11-01 15:36       ` Steven Rostedt
2017-11-02 11:46         ` Petr Mladek
2017-11-02 14:49           ` Steven Rostedt
2017-11-01 15:33     ` Steven Rostedt
2017-11-01 17:42       ` Vlastimil Babka
2017-11-01 17:54         ` Steven Rostedt
2017-11-02  8:53   ` Sergey Senozhatsky
2017-11-02  9:14     ` Sergey Senozhatsky
2017-11-02 14:55     ` Steven Rostedt
2017-11-02 12:55   ` Michal Hocko
2017-11-02 15:56 ` Steven Rostedt
2017-11-02 17:06   ` [PATCH v2] printk: Add console owner and waiter logic to load balance console writes Steven Rostedt
2017-11-02 17:10     ` Steven Rostedt
2017-11-02 17:38     ` Steven Rostedt
2017-11-03 10:19     ` Jan Kara
2017-11-03 11:18       ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).