All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-02-28  1:21 Tetsuo Handa
  2017-03-09 10:46 ` Tetsuo Handa
  2017-03-09 22:35   ` Andrew Morton
  0 siblings, 2 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-02-28  1:21 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, Tetsuo Handa, Michal Hocko,
	Johannes Weiner, Mel Gorman, Dave Chinner, Alexander Polakov

This patch adds a watchdog which periodically reports number of memory
allocating tasks, dying tasks and OOM victim tasks when some task is
spending too long time inside __alloc_pages_slowpath(). This patch also
serves as a hook for obtaining additional information using SystemTap
(e.g. examine other variables using printk(), capture a crash dump by
calling panic()) by triggering a callback only when a stall is detected.
Ability to take administrator-controlled actions based on some threshold
is a big advantage gained by introducing a state tracking.

Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
too long") was a great step for reducing possibility of silent hang up
problem caused by memory allocation stalls [1]. However, there are
reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
[3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
lockup problem) where this patch is more useful than that commit, for
this patch can report possibly related tasks even if allocating tasks
are unexpectedly blocked for so long. Regarding premature OOM killer
invocation, tracepoints which can accumulate samples in short interval
would be useful. But regarding too late to report allocation stalls,
this patch which can capture all tasks (for reporting overall situation)
in longer interval and act as a trigger (for accumulating short interval
samples) would be useful.

Changes from v1: ( http://lkml.kernel.org/r/201510182105.AGA00839.FHVFFStLQOMOOJ@I-love.SAKURA.ne.jp )

  (1) Use per a "struct task_struct" variables. This allows vmcore to
      remember information about last memory allocation request, which
      is useful for understanding last-minute behavior of the kernel.

  (2) Report using accurate timeout. This increases possibility of
      successfully reporting before watchdog timers reset the machine.

  (3) Show memory information (SysRq-m). This makes it easier to know
      the reason of stalling.

  (4) Show both $state_of_allocation and $state_of_task in the same
      line. This makes it easier to grep the output.

  (5) Minimize duration of spinlock held by the kernel thread.

Changes from v2: ( http://lkml.kernel.org/r/201511222346.JBH48464.VFFtOLOOQJMFHS@I-love.SAKURA.ne.jp )

  (1) Print sequence number. This makes it easier to know whether
      memory allocation is succeeding (looks like a livelock but making
      forward progress) or not.

  (2) Replace spinlock with cheaper seqlock_t like sequence number based
      method. The caller no longer contend on lock, and major overhead
      for caller side will be two smp_wmb() instead for
      read_lock()/read_unlock().

  (3) Print "exiting" instead for "dying" if an OOM victim is stalling
      at do_exit(), for SIGKILL is removed before arriving at do_exit().

  (4) Moved explanation to Documentation/malloc-watchdog.txt .

Changes from v3: ( http://lkml.kernel.org/r/201511250024.AAE78692.QVOtFFOSFOMLJH@I-love.SAKURA.ne.jp )

  (1) Avoid stalls even if there are so many tasks to report.

Changes from v4: ( http://lkml.kernel.org/r/201512130033.ABH90650.FtFOMOFLVOJHQS@I-love.SAKURA.ne.jp )

  (1) Use per CPU in-flight counter by reverting "Report using accurate
      timeout." in v2, in order to avoid walking the process list which
      is costly when there are extremely so many tasks in the system.

  (2) Updated Documentation/malloc-watchdog.txt to add explanation for
      serving as a hook for dynamic probes.

Changes from v5: ( http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp )

  (1) Disable commit 63f53dea0c9866e9 ("mm: warn about allocations which
      stall for too long") when CONFIG_DETECT_MEMALLOC_STALL_TASK is
      enabled.

  (2) Updated Documentation/malloc-watchdog.txt to reflect OOM related
      improvements up to Linux 4.9.

Changes from v6: ( http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp )

  (1) Check __GFP_DIRECT_RECLAIM allocation requests rather than
      __GFP_RECLAIM == (__GFP_KSWAPD_RECLAIM|__GFP_DIRECT_RECLAIM)
      allocation requests.

  (2) Rename /proc/sys/kernel/memalloc_task_timeout_secs to
      /proc/sys/kernel/memalloc_task_warning_secs , for the name of
      "timeout" is associated with "give up retrying, and OOM kill or fail".
      This variable is for emitting warning messages rather than giving up.

[1] http://lkml.kernel.org/r/201610182004.AEF87559.FOOHVLJOQFFtSM@I-love.SAKURA.ne.jp
[2] https://bugzilla.kernel.org/show_bug.cgi?id=192981
[3] http://lkml.kernel.org/r/201702261530.JDD56292.OFOLFHQtVMJSOF@I-love.SAKURA.ne.jp

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alexander Polakov <apolyakov@beget.ru>
---
 Documentation/malloc-watchdog.txt | 155 ++++++++++++++++++++++++++
 include/linux/oom.h               |   4 +
 include/linux/sched.h             |  29 +++++
 kernel/fork.c                     |   4 +
 kernel/hung_task.c                | 221 +++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c                   |  10 ++
 lib/Kconfig.debug                 |  24 +++++
 mm/oom_kill.c                     |   3 +
 mm/page_alloc.c                   |  57 ++++++++++
 9 files changed, 505 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/malloc-watchdog.txt

diff --git a/Documentation/malloc-watchdog.txt b/Documentation/malloc-watchdog.txt
new file mode 100644
index 0000000..e4b24d77
--- /dev/null
+++ b/Documentation/malloc-watchdog.txt
@@ -0,0 +1,155 @@
+=================================
+Memory allocation stall watchdog.
+=================================
+
+
+- What is it?
+
+This is an extension to khungtaskd kernel thread, which is for warning
+that memory allocation requests are stalling, in order to catch unexplained
+hangups/reboots caused by memory allocation stalls.
+
+
+- Why need to use it?
+
+Currently, when something went wrong inside memory allocation request,
+the system might stall without any kernel messages.
+
+Although there is khungtaskd kernel thread as an asynchronous monitoring
+approach, khungtaskd kernel thread is not always helpful because memory
+allocating tasks unlikely sleep in uninterruptible state for
+/proc/sys/kernel/hung_task_timeout_secs seconds.
+
+Although there is warn_alloc() as a synchronous monitoring approach
+which emits
+
+  "%s: page allocation stalls for %ums, order:%u, mode:%#x(%pGg)\n"
+
+line, warn_alloc() is not bullet proof because allocating tasks can get
+stuck before calling warn_alloc() and/or allocating tasks are using
+__GFP_NOWARN flag and/or such lines are suppressed by ratelimiting and/or
+such lines are corrupted due to collisions.
+
+Unless we use asynchronous monitoring approach, we can fail to figure out
+that something went wrong inside memory allocation requests.
+
+People are reporting hang up problems and/or slowdown problem inside memory
+allocation request. But we are forcing people to use kernels without means
+to find out what was happening. The means are expected to work without
+knowledge to use trace points functionality, are expected to run without
+memory allocation, are expected to dump output without administrator's
+operation, are expected to work before watchdog timers reset the machine.
+
+This extension adds a state tracking mechanism for memory allocation requests
+to khungtaskd kernel thread, allowing administrators to figure out that the
+system hung up due to memory allocation stalls and/or to take administrator-
+controlled actions when memory allocation requests are stalling.
+
+
+- How to configure it?
+
+Build kernels with CONFIG_DETECT_HUNG_TASK=y and
+CONFIG_DETECT_MEMALLOC_STALL_TASK=y.
+
+Default scan interval is configured by CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT.
+Scan interval can be changed at run time by writing timeout in seconds to
+/proc/sys/kernel/memalloc_task_warning_secs. Writing 0 disables this scan.
+
+Even if you disable this scan, information about last memory allocation
+request is kept. That is, you will get some hint for understanding
+last-minute behavior of the kernel when you analyze vmcore (or memory
+snapshot of a virtualized machine).
+
+
+- How memory allocation stalls are reported?
+
+This extension will report allocation stalls by printing
+
+  MemAlloc-Info: stalling=$X dying=$Y1 exiting=$Y2 victim=$Z oom_count=$O
+
+line where $X > 0, followed by
+
+  MemAlloc: $name($pid) flags=$flags switches=$switches $state_of_allocation $state_of_task
+
+lines and corresponding stack traces.
+
+$O is number of times the OOM killer is invoked. If $O does not increase
+over time, allocation requests got stuck before calling the OOM killer.
+
+$name is that task's comm name string ("struct task_struct"->comm).
+
+$pid is that task's pid value ("struct task_struct"->pid).
+
+$flags is that task's flags value ("struct task_struct"->flags).
+
+$switches is that task's context switch counter ("struct task_struct"->nvcsw +
+"struct task_struct"->nivcsw) which is also checked by
+/proc/sys/kernel/hung_task_warnings for finding hung tasks.
+
+$state_of_allocation is reported only when that task is stalling inside
+__alloc_pages_slowpath(), in seq=$seq gfp=$gfp order=$order delay=$delay
+format where $seq is the sequence number for allocation request, $gfp is
+the gfp flags used for that allocation request, $order is the order,
+delay is jiffies elapsed since entering into __alloc_pages_slowpath().
+
+You can check for seq=$seq field for each reported process. If $seq is
+increasing over time, it will be simply overloaded (not a livelock but
+progress is too slow to wait) unless the caller is doing open-coded
+__GFP_NOFAIL allocation requests (effectively a livelock).
+
+$state_of_task is reported only when that task is dying, in combination
+of "uninterruptible" (where that task is in uninterruptible sleep,
+likely due to uninterruptible lock), "exiting" (where that task arrived
+at do_exit() function), "dying" (where that task has pending SIGKILL)
+and "victim" (where that task received TIF_MEMDIE, likely be only 1 task).
+
+
+- How the messages look like?
+
+An example of MemAlloc lines (grep of dmesg output) is shown below.
+You can use serial console and/or netconsole to save these messages
+when the system is stalling.
+
+  [  100.503284] MemAlloc-Info: stalling=8 dying=1 exiting=0 victim=1 oom_count=101421
+  [  100.505674] MemAlloc: kswapd0(54) flags=0xa40840 switches=84685
+  [  100.546645] MemAlloc: kworker/3:1(70) flags=0x4208060 switches=9462 seq=5 gfp=0x2400000(GFP_NOIO) order=0 delay=8207 uninterruptible
+  [  100.606034] MemAlloc: systemd-journal(469) flags=0x400100 switches=8380 seq=212 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10620 uninterruptible
+  [  100.651766] MemAlloc: irqbalance(998) flags=0x400100 switches=4366 seq=5 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=5819
+  [  100.697590] MemAlloc: vmtoolsd(1928) flags=0x400100 switches=8542 seq=82 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10620 uninterruptible
+  [  100.743312] MemAlloc: tuned(3737) flags=0x400040 switches=8220 seq=44 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10620 uninterruptible
+  [  100.792038] MemAlloc: nmbd(3759) flags=0x400140 switches=8079 seq=198 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10620 uninterruptible
+  [  100.839428] MemAlloc: oom-write(3814) flags=0x400000 switches=8126 seq=223446 gfp=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) order=0 delay=10620 uninterruptible
+  [  100.878846] MemAlloc: write(3816) flags=0x400000 switches=7440 uninterruptible dying victim
+  [  100.917971] MemAlloc: write(3820) flags=0x400000 switches=16130 seq=8714 gfp=0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) order=0 delay=10620 uninterruptible
+  [  101.190979] MemAlloc-Info: stalling=8 dying=1 exiting=0 victim=1 oom_count=107514
+  [  111.194055] MemAlloc-Info: stalling=9 dying=1 exiting=0 victim=1 oom_count=199825
+  [  111.196624] MemAlloc: kswapd0(54) flags=0xa40840 switches=168410
+  [  111.238096] MemAlloc: kworker/3:1(70) flags=0x4208060 switches=18592 seq=5 gfp=0x2400000(GFP_NOIO) order=0 delay=18898
+  [  111.296920] MemAlloc: systemd-journal(469) flags=0x400100 switches=15918 seq=212 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=21311
+  [  111.343129] MemAlloc: systemd-logind(973) flags=0x400100 switches=7786 seq=3 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10476
+  [  111.390142] MemAlloc: irqbalance(998) flags=0x400100 switches=11965 seq=5 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=16510
+  [  111.435170] MemAlloc: vmtoolsd(1928) flags=0x400100 switches=16230 seq=82 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=21311 uninterruptible
+  [  111.479089] MemAlloc: tuned(3737) flags=0x400040 switches=15850 seq=44 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=21311 uninterruptible
+  [  111.528294] MemAlloc: nmbd(3759) flags=0x400140 switches=15682 seq=198 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=21311
+  [  111.576371] MemAlloc: oom-write(3814) flags=0x400000 switches=15378 seq=223446 gfp=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) order=0 delay=21311 uninterruptible
+  [  111.617562] MemAlloc: write(3816) flags=0x400000 switches=7440 uninterruptible dying victim
+  [  111.661662] MemAlloc: write(3820) flags=0x400000 switches=24334 seq=8714 gfp=0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) order=0 delay=21311 uninterruptible
+  [  111.956964] MemAlloc-Info: stalling=9 dying=1 exiting=0 victim=1 oom_count=206663
+
+You can check whether memory allocations are making forward progress.
+You can check where memory allocations are stalling using stack trace
+of reported task which follows each MemAlloc: line. You can check memory
+information (SysRq-m) and stuck workqueues information which follow the
+end of MemAlloc: lines. You can also check locks held (SysRq-d) if built
+with CONFIG_PROVE_LOCKING=y and lockdep is still active.
+
+This extension also serves as a hook for triggering actions when timeout
+expired. If you want to obtain more information, you can utilize dynamic
+probes using e.g. SystemTap. For example,
+
+  # stap -F -g -e 'probe kernel.function("check_memalloc_stalling_tasks").return { if ($return > 0) panic("MemAlloc stall detected."); }'
+
+will allow you to obtain vmcore by triggering the kernel panic. Since
+variables used by this extension is associated with "struct task_struct",
+you can obtain accurate snapshot using "foreach task" command from crash
+utility.
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b4e36e9..69556f3 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -79,8 +79,12 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern unsigned int out_of_memory_count;
+extern bool memalloc_maybe_stalling(void);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern unsigned long sysctl_memalloc_task_warning_secs;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a28deb..cd332f4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1478,6 +1478,32 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+struct memalloc_info {
+	/*
+	 * 0: not doing __GFP_RECLAIM allocation.
+	 * 1: doing non-recursive __GFP_RECLAIM allocation.
+	 * 2: doing recursive __GFP_RECLAIM allocation.
+	 */
+	u8 valid;
+	/*
+	 * bit 0: Will be reported as OOM victim.
+	 * bit 1: Will be reported as dying task.
+	 * bit 2: Will be reported as stalling task.
+	 * bit 3: Will be reported as exiting task.
+	 * bit 7: Will be reported unconditionally.
+	 */
+	u8 type;
+	/* Index used for memalloc_in_flight[] counter. */
+	u8 idx;
+	/* For progress monitoring. */
+	unsigned int sequence;
+	/* Started time in jiffies as of valid == 1. */
+	unsigned long start;
+	/* Requested order and gfp flags as of valid == 1. */
+	unsigned int order;
+	gfp_t gfp;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1970,6 +1996,9 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	struct memalloc_info memalloc;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 246bf9a..8003a9b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1641,6 +1641,10 @@ static __latent_entropy struct task_struct *copy_process(
 	p->sequential_io_avg	= 0;
 #endif
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	p->memalloc.sequence = 0;
+#endif
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 40c07e4..00140c8 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -16,6 +16,7 @@
 #include <linux/export.h>
 #include <linux/sysctl.h>
 #include <linux/utsname.h>
+#include <linux/oom.h>
 #include <trace/events/sched.h>
 
 /*
@@ -72,6 +73,206 @@ static int __init hung_task_panic_setup(char *str)
 	.notifier_call = hung_task_panic,
 };
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+/*
+ * Zero means infinite timeout - no checking done:
+ */
+unsigned long __read_mostly sysctl_memalloc_task_warning_secs =
+	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
+static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */
+
+static long memalloc_timeout_jiffies(unsigned long last_checked, long timeout)
+{
+	/* timeout of 0 will disable the watchdog */
+	return timeout ? last_checked - jiffies + timeout * HZ :
+		MAX_SCHEDULE_TIMEOUT;
+}
+
+/**
+ * is_stalling_task - Check and copy a task's memalloc variable.
+ *
+ * @task:   A task to check.
+ * @expire: Timeout in jiffies.
+ *
+ * Returns true if a task is stalling, false otherwise.
+ */
+static bool is_stalling_task(const struct task_struct *task,
+			     const unsigned long expire)
+{
+	const struct memalloc_info *m = &task->memalloc;
+
+	if (!m->valid)
+		return false;
+	memalloc.sequence = m->sequence;
+	memalloc.start = m->start;
+	memalloc.order = m->order;
+	memalloc.gfp = m->gfp;
+	return time_after_eq(expire, memalloc.start);
+}
+
+/*
+ * check_memalloc_stalling_tasks - Check for memory allocation stalls.
+ *
+ * @timeout: Timeout in jiffies.
+ *
+ * Returns number of stalling tasks.
+ *
+ * This function is marked as "noinline" in order to allow inserting dynamic
+ * probes (e.g. printing more information as needed using SystemTap, calling
+ * panic() if this function returned non 0 value).
+ */
+static noinline int check_memalloc_stalling_tasks(unsigned long timeout)
+{
+	char buf[256];
+	struct task_struct *g, *p;
+	unsigned long now;
+	unsigned long expire;
+	unsigned int sigkill_pending = 0;
+	unsigned int exiting_tasks = 0;
+	unsigned int memdie_pending = 0;
+	unsigned int stalling_tasks = 0;
+
+	cond_resched();
+	now = jiffies;
+	/*
+	 * Report tasks that stalled for more than half of timeout duration
+	 * because such tasks might be correlated with tasks that already
+	 * stalled for full timeout duration.
+	 */
+	expire = now - timeout * (HZ / 2);
+	/* Count stalling tasks, dying and victim tasks. */
+	preempt_disable();
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		u8 type = 0;
+
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= 1;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= 2;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= 8;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= 4;
+			stalling_tasks++;
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= 128;
+		p->memalloc.type = type;
+	}
+	rcu_read_unlock();
+	preempt_enable();
+	if (!stalling_tasks)
+		return 0;
+	cond_resched();
+	/* Report stalling tasks, dying and victim tasks. */
+	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
+		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
+		out_of_memory_count);
+	cond_resched();
+	sigkill_pending = 0;
+	exiting_tasks = 0;
+	memdie_pending = 0;
+	stalling_tasks = 0;
+	preempt_disable();
+	rcu_read_lock();
+ restart_report:
+	for_each_process_thread(g, p) {
+		bool can_cont;
+		u8 type;
+
+		if (likely(!p->memalloc.type))
+			continue;
+		p->memalloc.type = 0;
+		/* Recheck in case state changed meanwhile. */
+		type = 0;
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= 1;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= 2;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= 8;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= 4;
+			stalling_tasks++;
+			snprintf(buf, sizeof(buf),
+				 " seq=%u gfp=0x%x(%pGg) order=%u delay=%lu",
+				 memalloc.sequence, memalloc.gfp,
+				 &memalloc.gfp,
+				 memalloc.order, now - memalloc.start);
+		} else {
+			buf[0] = '\0';
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= 128;
+		if (unlikely(!type))
+			continue;
+		/*
+		 * Victim tasks get pending SIGKILL removed before arriving at
+		 * do_exit(). Therefore, print " exiting" instead for " dying".
+		 */
+		pr_warn("MemAlloc: %s(%u) flags=0x%x switches=%lu%s%s%s%s%s\n",
+			p->comm, p->pid, p->flags, p->nvcsw + p->nivcsw, buf,
+			(p->state & TASK_UNINTERRUPTIBLE) ?
+			" uninterruptible" : "",
+			(type & 8) ? " exiting" : "",
+			(type & 2) ? " dying" : "",
+			(type & 1) ? " victim" : "");
+		sched_show_task(p);
+		/*
+		 * Since there could be thousands of tasks to report, we always
+		 * call cond_resched() after each report, in order to avoid RCU
+		 * stalls.
+		 *
+		 * Since not yet reported tasks have p->memalloc.type > 0, we
+		 * can simply restart this loop in case "g" or "p" went away.
+		 */
+		get_task_struct(g);
+		get_task_struct(p);
+		rcu_read_unlock();
+		preempt_enable_no_resched();
+		/*
+		 * TODO: Try to wait for a while (e.g. sleep until usage of
+		 * printk() buffer becomes less than 75%) in order to avoid
+		 * dropping messages.
+		 */
+		cond_resched();
+		preempt_disable();
+		rcu_read_lock();
+		can_cont = pid_alive(g) && pid_alive(p);
+		put_task_struct(p);
+		put_task_struct(g);
+		if (!can_cont)
+			goto restart_report;
+	}
+	rcu_read_unlock();
+	preempt_enable_no_resched();
+	cond_resched();
+	/* Show memory information. (SysRq-m) */
+	show_mem(0, NULL);
+	/* Show workqueue state. */
+	show_workqueue_state();
+	/* Show lock information. (SysRq-d) */
+	debug_show_all_locks();
+	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
+		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
+		out_of_memory_count);
+	return stalling_tasks;
+}
+#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
+
 static void check_hung_task(struct task_struct *t, unsigned long timeout)
 {
 	unsigned long switch_count = t->nvcsw + t->nivcsw;
@@ -228,20 +429,36 @@ void reset_hung_task_detector(void)
 static int watchdog(void *dummy)
 {
 	unsigned long hung_last_checked = jiffies;
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	unsigned long stall_last_checked = hung_last_checked;
+#endif
 
 	set_user_nice(current, 0);
 
 	for ( ; ; ) {
 		unsigned long timeout = sysctl_hung_task_timeout_secs;
 		long t = hung_timeout_jiffies(hung_last_checked, timeout);
-
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+		unsigned long timeout2 = sysctl_memalloc_task_warning_secs;
+		long t2 = memalloc_timeout_jiffies(stall_last_checked,
+						   timeout2);
+
+		if (t2 <= 0) {
+			if (memalloc_maybe_stalling())
+				check_memalloc_stalling_tasks(timeout2);
+			stall_last_checked = jiffies;
+			continue;
+		}
+#else
+		long t2 = t;
+#endif
 		if (t <= 0) {
 			if (!atomic_xchg(&reset_hung_task, 0))
 				check_hung_uninterruptible_tasks(timeout);
 			hung_last_checked = jiffies;
 			continue;
 		}
-		schedule_timeout_interruptible(t);
+		schedule_timeout_interruptible(min(t, t2));
 	}
 
 	return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index bb260ce..a559ecb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1069,6 +1069,16 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &neg_one,
 	},
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	{
+		.procname	= "memalloc_task_warning_secs",
+		.data		= &sysctl_memalloc_task_warning_secs,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= proc_dohung_task_timeout_secs,
+		.extra2		= &hung_task_timeout_max,
+	},
+#endif
 #endif
 #ifdef CONFIG_RT_MUTEXES
 	{
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 97d62c2..ebdb7c0 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -912,6 +912,30 @@ config WQ_WATCHDOG
 	  state.  This can be configured through kernel parameter
 	  "workqueue.watchdog_thresh" and its sysfs counterpart.
 
+config DETECT_MEMALLOC_STALL_TASK
+	bool "Detect tasks stalling inside memory allocator"
+	default n
+	depends on DETECT_HUNG_TASK
+	help
+	  This option emits warning messages and traces when memory
+	  allocation requests are stalling, in order to catch unexplained
+	  hangups/reboots caused by memory allocation stalls.
+
+config DEFAULT_MEMALLOC_TASK_TIMEOUT
+	int "Default timeout for stalling task detection (in seconds)"
+	depends on DETECT_MEMALLOC_STALL_TASK
+	default 60
+	help
+	  This option controls the default timeout (in seconds) used
+	  to determine when a task has become non-responsive and should
+	  be considered stalling inside memory allocator.
+
+	  It can be adjusted at runtime via the kernel.memalloc_task_warning_secs
+	  sysctl or by writing a value to
+	  /proc/sys/kernel/memalloc_task_warning_secs.
+
+	  A timeout of 0 disables the check. The default is 60 seconds.
+
 endmenu # "Debug lockups and hangs"
 
 config PANIC_ON_OOPS
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 51c0918..116bd48 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
 
+unsigned int out_of_memory_count;
+
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
@@ -979,6 +981,7 @@ bool out_of_memory(struct oom_control *oc)
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
+	out_of_memory_count++;
 	if (oom_killer_disabled)
 		return false;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 66dd0dd..6adedbc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3637,8 +3637,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	enum compact_result compact_result;
 	int compaction_retries;
 	int no_progress_loops;
+#ifndef CONFIG_DETECT_MEMALLOC_STALL_TASK
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
+#endif
 	unsigned int cpuset_mems_cookie;
 
 	/*
@@ -3763,6 +3765,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (!can_direct_reclaim)
 		goto nopage;
 
+#ifndef CONFIG_DETECT_MEMALLOC_STALL_TASK
 	/* Make sure we know about allocations which stall for too long */
 	if (time_after(jiffies, alloc_start + stall_timeout)) {
 		warn_alloc(gfp_mask, ac->nodemask,
@@ -3770,6 +3773,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 			jiffies_to_msecs(jiffies-alloc_start), order);
 		stall_timeout += 10 * HZ;
 	}
+#endif
 
 	/* Avoid recursion of direct reclaim */
 	if (current->flags & PF_MEMALLOC)
@@ -3941,6 +3945,57 @@ static inline void finalise_ac(gfp_t gfp_mask,
 					ac->high_zoneidx, ac->nodemask);
 }
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+
+static DEFINE_PER_CPU_ALIGNED(int, memalloc_in_flight[2]);
+static u8 memalloc_active_index; /* Either 0 or 1. */
+
+/* Called periodically with sysctl_memalloc_task_warning_secs interval. */
+bool memalloc_maybe_stalling(void)
+{
+	int cpu;
+	int sum = 0;
+	const u8 idx = memalloc_active_index ^ 1;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(memalloc_in_flight[idx], cpu);
+	if (sum)
+		return true;
+	memalloc_active_index ^= 1;
+	return false;
+}
+
+static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
+{
+	struct memalloc_info *m = &current->memalloc;
+
+	/* We don't check for stalls for !__GFP_DIRECT_RECLAIM allocations. */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+	/* Check based on outermost __GFP_DIRECT_RECLAIM allocations. */
+	if (!m->valid) {
+		m->sequence++;
+		m->start = jiffies;
+		m->order = order;
+		m->gfp = gfp_mask;
+		m->idx = memalloc_active_index;
+		this_cpu_inc(memalloc_in_flight[m->idx]);
+	}
+	m->valid++;
+}
+
+static void stop_memalloc_timer(const gfp_t gfp_mask)
+{
+	struct memalloc_info *m = &current->memalloc;
+
+	if ((gfp_mask & __GFP_DIRECT_RECLAIM) && !--m->valid)
+		this_cpu_dec(memalloc_in_flight[m->idx]);
+}
+#else
+#define start_memalloc_timer(gfp_mask, order) do { } while (0)
+#define stop_memalloc_timer(gfp_mask) do { } while (0)
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3978,7 +4033,9 @@ struct page *
 	if (unlikely(ac.nodemask != nodemask))
 		ac.nodemask = nodemask;
 
+	start_memalloc_timer(alloc_mask, order);
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+	stop_memalloc_timer(alloc_mask);
 
 out:
 	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-02-28  1:21 [PATCH v7] mm: Add memory allocation watchdog kernel thread Tetsuo Handa
@ 2017-03-09 10:46 ` Tetsuo Handa
  2017-03-09 22:37     ` Andrew Morton
  2017-03-09 22:35   ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-09 10:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, mhocko, hannes, mgorman, david, apolyakov

Andrew, do you have any questions on this patch?
I really need this patch for finding bugs which MM people overlook.

Tetsuo Handa wrote:
> This patch adds a watchdog which periodically reports number of memory
> allocating tasks, dying tasks and OOM victim tasks when some task is
> spending too long time inside __alloc_pages_slowpath(). This patch also
> serves as a hook for obtaining additional information using SystemTap
> (e.g. examine other variables using printk(), capture a crash dump by
> calling panic()) by triggering a callback only when a stall is detected.
> Ability to take administrator-controlled actions based on some threshold
> is a big advantage gained by introducing a state tracking.
> 
> Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> too long") was a great step for reducing possibility of silent hang up
> problem caused by memory allocation stalls [1]. However, there are
> reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> lockup problem) where this patch is more useful than that commit, for
> this patch can report possibly related tasks even if allocating tasks
> are unexpectedly blocked for so long. Regarding premature OOM killer
> invocation, tracepoints which can accumulate samples in short interval
> would be useful. But regarding too late to report allocation stalls,
> this patch which can capture all tasks (for reporting overall situation)
> in longer interval and act as a trigger (for accumulating short interval
> samples) would be useful.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-02-28  1:21 [PATCH v7] mm: Add memory allocation watchdog kernel thread Tetsuo Handa
@ 2017-03-09 22:35   ` Andrew Morton
  2017-03-09 22:35   ` Andrew Morton
  1 sibling, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2017-03-09 22:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, Michal Hocko, Johannes Weiner,
	Mel Gorman, Dave Chinner, Alexander Polakov

On Tue, 28 Feb 2017 10:21:48 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> This patch adds a watchdog which periodically reports number of memory
> allocating tasks, dying tasks and OOM victim tasks when some task is
> spending too long time inside __alloc_pages_slowpath(). This patch also
> serves as a hook for obtaining additional information using SystemTap
> (e.g. examine other variables using printk(), capture a crash dump by
> calling panic()) by triggering a callback only when a stall is detected.
> Ability to take administrator-controlled actions based on some threshold
> is a big advantage gained by introducing a state tracking.
> 
> Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> too long") was a great step for reducing possibility of silent hang up
> problem caused by memory allocation stalls [1]. However, there are
> reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> lockup problem) where this patch is more useful than that commit, for
> this patch can report possibly related tasks even if allocating tasks
> are unexpectedly blocked for so long. Regarding premature OOM killer
> invocation, tracepoints which can accumulate samples in short interval
> would be useful. But regarding too late to report allocation stalls,
> this patch which can capture all tasks (for reporting overall situation)
> in longer interval and act as a trigger (for accumulating short interval
> samples) would be useful.
> 
> ...
>
> +Build kernels with CONFIG_DETECT_HUNG_TASK=y and
> +CONFIG_DETECT_MEMALLOC_STALL_TASK=y.
> +
> +Default scan interval is configured by CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT.
> +Scan interval can be changed at run time by writing timeout in seconds to
> +/proc/sys/kernel/memalloc_task_warning_secs. Writing 0 disables this scan.

"seconds" seems needlessly coarse.  Maybe milliseconds?

> +Even if you disable this scan, information about last memory allocation
> +request is kept. That is, you will get some hint for understanding
> +last-minute behavior of the kernel when you analyze vmcore (or memory
> +snapshot of a virtualized machine).
> 
> ...
>
> +struct memalloc_info {
> +	/*
> +	 * 0: not doing __GFP_RECLAIM allocation.
> +	 * 1: doing non-recursive __GFP_RECLAIM allocation.
> +	 * 2: doing recursive __GFP_RECLAIM allocation.
> +	 */
> +	u8 valid;
> +	/*
> +	 * bit 0: Will be reported as OOM victim.
> +	 * bit 1: Will be reported as dying task.
> +	 * bit 2: Will be reported as stalling task.
> +	 * bit 3: Will be reported as exiting task.
> +	 * bit 7: Will be reported unconditionally.

Create enums for these rather than hard-coding magic numbers?

These values don't seem to be used anyway - as far as I can tell this
could be a simple boolean.

> +	 */
> +	u8 type;
> +	/* Index used for memalloc_in_flight[] counter. */
> +	u8 idx;
> +	/* For progress monitoring. */
> +	unsigned int sequence;
> +	/* Started time in jiffies as of valid == 1. */
> +	unsigned long start;
> +	/* Requested order and gfp flags as of valid == 1. */
> +	unsigned int order;
> +	gfp_t gfp;
> +};
> 
> ...
>
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +/*
> + * Zero means infinite timeout - no checking done:
> + */
> +unsigned long __read_mostly sysctl_memalloc_task_warning_secs =
> +	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
> +static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */

What locking protects `memalloc' from concurrent modifications and
holds it stable for readers?

> 
> ...
>
> +static noinline int check_memalloc_stalling_tasks(unsigned long timeout)
> +{
> +	char buf[256];
> +	struct task_struct *g, *p;
> +	unsigned long now;
> +	unsigned long expire;
> +	unsigned int sigkill_pending = 0;
> +	unsigned int exiting_tasks = 0;
> +	unsigned int memdie_pending = 0;
> +	unsigned int stalling_tasks = 0;
> +
> 
> ...
>
> +			goto restart_report;
> +	}
> +	rcu_read_unlock();
> +	preempt_enable_no_resched();
> +	cond_resched();

All the cond_resched()s in this function seem a bit random.

> +	/* Show memory information. (SysRq-m) */
> +	show_mem(0, NULL);
> +	/* Show workqueue state. */
> +	show_workqueue_state();
> +	/* Show lock information. (SysRq-d) */
> +	debug_show_all_locks();
> +	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
> +		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
> +		out_of_memory_count);
> +	return stalling_tasks;
> +}
> +#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
> +
>  static void check_hung_task(struct task_struct *t, unsigned long timeout)
>  {
>  	unsigned long switch_count = t->nvcsw + t->nivcsw;
> @@ -228,20 +429,36 @@ void reset_hung_task_detector(void)
>  static int watchdog(void *dummy)
>  {
>  	unsigned long hung_last_checked = jiffies;
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +	unsigned long stall_last_checked = hung_last_checked;
> +#endif
>  
>  	set_user_nice(current, 0);
>  
>  	for ( ; ; ) {
>  		unsigned long timeout = sysctl_hung_task_timeout_secs;
>  		long t = hung_timeout_jiffies(hung_last_checked, timeout);
> -
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +		unsigned long timeout2 = sysctl_memalloc_task_warning_secs;
> +		long t2 = memalloc_timeout_jiffies(stall_last_checked,
> +						   timeout2);

Confused.  Shouldn't timeout2 be converted from seconds to jiffies
before being passed to memalloc_timeout_jiffies()?

> +		if (t2 <= 0) {
> +			if (memalloc_maybe_stalling())
> +				check_memalloc_stalling_tasks(timeout2);
> +			stall_last_checked = jiffies;
> +			continue;
> +		}
> +#else
> +		long t2 = t;
> +#endif
> 
> ...
>
> +bool memalloc_maybe_stalling(void)
> +{
> +	int cpu;
> +	int sum = 0;
> +	const u8 idx = memalloc_active_index ^ 1;
> +
> +	for_each_possible_cpu(cpu)

Do we really need to do this for offlined and not-present CPUs?

> +		sum += per_cpu(memalloc_in_flight[idx], cpu);
> +	if (sum)
> +		return true;
> +	memalloc_active_index ^= 1;
> +	return false;
> +}
> +
> 
> ...
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-09 22:35   ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2017-03-09 22:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, Michal Hocko, Johannes Weiner,
	Mel Gorman, Dave Chinner, Alexander Polakov

On Tue, 28 Feb 2017 10:21:48 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> This patch adds a watchdog which periodically reports number of memory
> allocating tasks, dying tasks and OOM victim tasks when some task is
> spending too long time inside __alloc_pages_slowpath(). This patch also
> serves as a hook for obtaining additional information using SystemTap
> (e.g. examine other variables using printk(), capture a crash dump by
> calling panic()) by triggering a callback only when a stall is detected.
> Ability to take administrator-controlled actions based on some threshold
> is a big advantage gained by introducing a state tracking.
> 
> Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> too long") was a great step for reducing possibility of silent hang up
> problem caused by memory allocation stalls [1]. However, there are
> reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> lockup problem) where this patch is more useful than that commit, for
> this patch can report possibly related tasks even if allocating tasks
> are unexpectedly blocked for so long. Regarding premature OOM killer
> invocation, tracepoints which can accumulate samples in short interval
> would be useful. But regarding too late to report allocation stalls,
> this patch which can capture all tasks (for reporting overall situation)
> in longer interval and act as a trigger (for accumulating short interval
> samples) would be useful.
> 
> ...
>
> +Build kernels with CONFIG_DETECT_HUNG_TASK=y and
> +CONFIG_DETECT_MEMALLOC_STALL_TASK=y.
> +
> +Default scan interval is configured by CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT.
> +Scan interval can be changed at run time by writing timeout in seconds to
> +/proc/sys/kernel/memalloc_task_warning_secs. Writing 0 disables this scan.

"seconds" seems needlessly coarse.  Maybe milliseconds?

> +Even if you disable this scan, information about last memory allocation
> +request is kept. That is, you will get some hint for understanding
> +last-minute behavior of the kernel when you analyze vmcore (or memory
> +snapshot of a virtualized machine).
> 
> ...
>
> +struct memalloc_info {
> +	/*
> +	 * 0: not doing __GFP_RECLAIM allocation.
> +	 * 1: doing non-recursive __GFP_RECLAIM allocation.
> +	 * 2: doing recursive __GFP_RECLAIM allocation.
> +	 */
> +	u8 valid;
> +	/*
> +	 * bit 0: Will be reported as OOM victim.
> +	 * bit 1: Will be reported as dying task.
> +	 * bit 2: Will be reported as stalling task.
> +	 * bit 3: Will be reported as exiting task.
> +	 * bit 7: Will be reported unconditionally.

Create enums for these rather than hard-coding magic numbers?

These values don't seem to be used anyway - as far as I can tell this
could be a simple boolean.

> +	 */
> +	u8 type;
> +	/* Index used for memalloc_in_flight[] counter. */
> +	u8 idx;
> +	/* For progress monitoring. */
> +	unsigned int sequence;
> +	/* Started time in jiffies as of valid == 1. */
> +	unsigned long start;
> +	/* Requested order and gfp flags as of valid == 1. */
> +	unsigned int order;
> +	gfp_t gfp;
> +};
> 
> ...
>
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +/*
> + * Zero means infinite timeout - no checking done:
> + */
> +unsigned long __read_mostly sysctl_memalloc_task_warning_secs =
> +	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
> +static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */

What locking protects `memalloc' from concurrent modifications and
holds it stable for readers?

> 
> ...
>
> +static noinline int check_memalloc_stalling_tasks(unsigned long timeout)
> +{
> +	char buf[256];
> +	struct task_struct *g, *p;
> +	unsigned long now;
> +	unsigned long expire;
> +	unsigned int sigkill_pending = 0;
> +	unsigned int exiting_tasks = 0;
> +	unsigned int memdie_pending = 0;
> +	unsigned int stalling_tasks = 0;
> +
> 
> ...
>
> +			goto restart_report;
> +	}
> +	rcu_read_unlock();
> +	preempt_enable_no_resched();
> +	cond_resched();

All the cond_resched()s in this function seem a bit random.

> +	/* Show memory information. (SysRq-m) */
> +	show_mem(0, NULL);
> +	/* Show workqueue state. */
> +	show_workqueue_state();
> +	/* Show lock information. (SysRq-d) */
> +	debug_show_all_locks();
> +	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
> +		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
> +		out_of_memory_count);
> +	return stalling_tasks;
> +}
> +#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
> +
>  static void check_hung_task(struct task_struct *t, unsigned long timeout)
>  {
>  	unsigned long switch_count = t->nvcsw + t->nivcsw;
> @@ -228,20 +429,36 @@ void reset_hung_task_detector(void)
>  static int watchdog(void *dummy)
>  {
>  	unsigned long hung_last_checked = jiffies;
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +	unsigned long stall_last_checked = hung_last_checked;
> +#endif
>  
>  	set_user_nice(current, 0);
>  
>  	for ( ; ; ) {
>  		unsigned long timeout = sysctl_hung_task_timeout_secs;
>  		long t = hung_timeout_jiffies(hung_last_checked, timeout);
> -
> +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> +		unsigned long timeout2 = sysctl_memalloc_task_warning_secs;
> +		long t2 = memalloc_timeout_jiffies(stall_last_checked,
> +						   timeout2);

Confused.  Shouldn't timeout2 be converted from seconds to jiffies
before being passed to memalloc_timeout_jiffies()?

> +		if (t2 <= 0) {
> +			if (memalloc_maybe_stalling())
> +				check_memalloc_stalling_tasks(timeout2);
> +			stall_last_checked = jiffies;
> +			continue;
> +		}
> +#else
> +		long t2 = t;
> +#endif
> 
> ...
>
> +bool memalloc_maybe_stalling(void)
> +{
> +	int cpu;
> +	int sum = 0;
> +	const u8 idx = memalloc_active_index ^ 1;
> +
> +	for_each_possible_cpu(cpu)

Do we really need to do this for offlined and not-present CPUs?

> +		sum += per_cpu(memalloc_in_flight[idx], cpu);
> +	if (sum)
> +		return true;
> +	memalloc_active_index ^= 1;
> +	return false;
> +}
> +
> 
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-09 10:46 ` Tetsuo Handa
@ 2017-03-09 22:37     ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2017-03-09 22:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, mhocko, hannes, mgorman, david, apolyakov

On Thu, 9 Mar 2017 19:46:14 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> Tetsuo Handa wrote:
> > This patch adds a watchdog which periodically reports number of memory
> > allocating tasks, dying tasks and OOM victim tasks when some task is
> > spending too long time inside __alloc_pages_slowpath(). This patch also
> > serves as a hook for obtaining additional information using SystemTap
> > (e.g. examine other variables using printk(), capture a crash dump by
> > calling panic()) by triggering a callback only when a stall is detected.
> > Ability to take administrator-controlled actions based on some threshold
> > is a big advantage gained by introducing a state tracking.
> > 
> > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > too long") was a great step for reducing possibility of silent hang up
> > problem caused by memory allocation stalls [1]. However, there are
> > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > lockup problem) where this patch is more useful than that commit, for
> > this patch can report possibly related tasks even if allocating tasks
> > are unexpectedly blocked for so long. Regarding premature OOM killer
> > invocation, tracepoints which can accumulate samples in short interval
> > would be useful. But regarding too late to report allocation stalls,
> > this patch which can capture all tasks (for reporting overall situation)
> > in longer interval and act as a trigger (for accumulating short interval
> > samples) would be useful.
)
> Andrew, do you have any questions on this patch?
> I really need this patch for finding bugs which MM people overlook.

(top-posting repaired - please don't do that)

Undecided.  I can see the need but it is indeed quite a large lump of
code.  Perhaps some additional examples of how this new code was used
to understand and improve real-world kernel problems would be persuasive.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-09 22:37     ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2017-03-09 22:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, mhocko, hannes, mgorman, david, apolyakov

On Thu, 9 Mar 2017 19:46:14 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> Tetsuo Handa wrote:
> > This patch adds a watchdog which periodically reports number of memory
> > allocating tasks, dying tasks and OOM victim tasks when some task is
> > spending too long time inside __alloc_pages_slowpath(). This patch also
> > serves as a hook for obtaining additional information using SystemTap
> > (e.g. examine other variables using printk(), capture a crash dump by
> > calling panic()) by triggering a callback only when a stall is detected.
> > Ability to take administrator-controlled actions based on some threshold
> > is a big advantage gained by introducing a state tracking.
> > 
> > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > too long") was a great step for reducing possibility of silent hang up
> > problem caused by memory allocation stalls [1]. However, there are
> > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > lockup problem) where this patch is more useful than that commit, for
> > this patch can report possibly related tasks even if allocating tasks
> > are unexpectedly blocked for so long. Regarding premature OOM killer
> > invocation, tracepoints which can accumulate samples in short interval
> > would be useful. But regarding too late to report allocation stalls,
> > this patch which can capture all tasks (for reporting overall situation)
> > in longer interval and act as a trigger (for accumulating short interval
> > samples) would be useful.
)
> Andrew, do you have any questions on this patch?
> I really need this patch for finding bugs which MM people overlook.

(top-posting repaired - please don't do that)

Undecided.  I can see the need but it is indeed quite a large lump of
code.  Perhaps some additional examples of how this new code was used
to understand and improve real-world kernel problems would be persuasive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-09 22:37     ` Andrew Morton
@ 2017-03-10 10:40       ` Michal Hocko
  -1 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-03-10 10:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

On Thu 09-03-17 14:37:51, Andrew Morton wrote:
> On Thu, 9 Mar 2017 19:46:14 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > Tetsuo Handa wrote:
> > > This patch adds a watchdog which periodically reports number of memory
> > > allocating tasks, dying tasks and OOM victim tasks when some task is
> > > spending too long time inside __alloc_pages_slowpath(). This patch also
> > > serves as a hook for obtaining additional information using SystemTap
> > > (e.g. examine other variables using printk(), capture a crash dump by
> > > calling panic()) by triggering a callback only when a stall is detected.
> > > Ability to take administrator-controlled actions based on some threshold
> > > is a big advantage gained by introducing a state tracking.
> > > 
> > > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > > too long") was a great step for reducing possibility of silent hang up
> > > problem caused by memory allocation stalls [1]. However, there are
> > > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > > lockup problem) where this patch is more useful than that commit, for
> > > this patch can report possibly related tasks even if allocating tasks
> > > are unexpectedly blocked for so long. Regarding premature OOM killer
> > > invocation, tracepoints which can accumulate samples in short interval
> > > would be useful. But regarding too late to report allocation stalls,
> > > this patch which can capture all tasks (for reporting overall situation)
> > > in longer interval and act as a trigger (for accumulating short interval
> > > samples) would be useful.
> )
> > Andrew, do you have any questions on this patch?
> > I really need this patch for finding bugs which MM people overlook.
> 
> (top-posting repaired - please don't do that)
> 
> Undecided.  I can see the need but it is indeed quite a large lump of
> code.  Perhaps some additional examples of how this new code was used
> to understand and improve real-world kernel problems would be persuasive.

Well, it is true that the watchdog can provide much more information
than we can gather with other debugging options we currently have when
allocations run away.  So I am not questioning this is useful when doing
memory stress testing and trying to trigger corner cases but I am not
really sure how much this will be useful in production systems, though.
Tracking is not for free both in runtime and longterm in maintenance.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-10 10:40       ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-03-10 10:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

On Thu 09-03-17 14:37:51, Andrew Morton wrote:
> On Thu, 9 Mar 2017 19:46:14 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > Tetsuo Handa wrote:
> > > This patch adds a watchdog which periodically reports number of memory
> > > allocating tasks, dying tasks and OOM victim tasks when some task is
> > > spending too long time inside __alloc_pages_slowpath(). This patch also
> > > serves as a hook for obtaining additional information using SystemTap
> > > (e.g. examine other variables using printk(), capture a crash dump by
> > > calling panic()) by triggering a callback only when a stall is detected.
> > > Ability to take administrator-controlled actions based on some threshold
> > > is a big advantage gained by introducing a state tracking.
> > > 
> > > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > > too long") was a great step for reducing possibility of silent hang up
> > > problem caused by memory allocation stalls [1]. However, there are
> > > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > > lockup problem) where this patch is more useful than that commit, for
> > > this patch can report possibly related tasks even if allocating tasks
> > > are unexpectedly blocked for so long. Regarding premature OOM killer
> > > invocation, tracepoints which can accumulate samples in short interval
> > > would be useful. But regarding too late to report allocation stalls,
> > > this patch which can capture all tasks (for reporting overall situation)
> > > in longer interval and act as a trigger (for accumulating short interval
> > > samples) would be useful.
> )
> > Andrew, do you have any questions on this patch?
> > I really need this patch for finding bugs which MM people overlook.
> 
> (top-posting repaired - please don't do that)
> 
> Undecided.  I can see the need but it is indeed quite a large lump of
> code.  Perhaps some additional examples of how this new code was used
> to understand and improve real-world kernel problems would be persuasive.

Well, it is true that the watchdog can provide much more information
than we can gather with other debugging options we currently have when
allocations run away.  So I am not questioning this is useful when doing
memory stress testing and trying to trigger corner cases but I am not
really sure how much this will be useful in production systems, though.
Tracking is not for free both in runtime and longterm in maintenance.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-09 22:35   ` Andrew Morton
@ 2017-03-10 11:12     ` Tetsuo Handa
  -1 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-10 11:12 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, mhocko, hannes, mgorman, david, apolyakov

Andrew Morton wrote:
> On Tue, 28 Feb 2017 10:21:48 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > This patch adds a watchdog which periodically reports number of memory
> > allocating tasks, dying tasks and OOM victim tasks when some task is
> > spending too long time inside __alloc_pages_slowpath(). This patch also
> > serves as a hook for obtaining additional information using SystemTap
> > (e.g. examine other variables using printk(), capture a crash dump by
> > calling panic()) by triggering a callback only when a stall is detected.
> > Ability to take administrator-controlled actions based on some threshold
> > is a big advantage gained by introducing a state tracking.
> > 
> > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > too long") was a great step for reducing possibility of silent hang up
> > problem caused by memory allocation stalls [1]. However, there are
> > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > lockup problem) where this patch is more useful than that commit, for
> > this patch can report possibly related tasks even if allocating tasks
> > are unexpectedly blocked for so long. Regarding premature OOM killer
> > invocation, tracepoints which can accumulate samples in short interval
> > would be useful. But regarding too late to report allocation stalls,
> > this patch which can capture all tasks (for reporting overall situation)
> > in longer interval and act as a trigger (for accumulating short interval
> > samples) would be useful.
> > 
> > ...
> >
> > +Build kernels with CONFIG_DETECT_HUNG_TASK=y and
> > +CONFIG_DETECT_MEMALLOC_STALL_TASK=y.
> > +
> > +Default scan interval is configured by CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT.
> > +Scan interval can be changed at run time by writing timeout in seconds to
> > +/proc/sys/kernel/memalloc_task_warning_secs. Writing 0 disables this scan.
> 
> "seconds" seems needlessly coarse.  Maybe milliseconds?

I can change it to milliseconds although I don't think someone wants to
set e.g. 3500 milliseconds. This timeout is "more than ... seconds" and
actual alert takes longer than this timeout, as with check_hung_task().

> 
> > +Even if you disable this scan, information about last memory allocation
> > +request is kept. That is, you will get some hint for understanding
> > +last-minute behavior of the kernel when you analyze vmcore (or memory
> > +snapshot of a virtualized machine).
> > 
> > ...
> >
> > +struct memalloc_info {
> > +	/*
> > +	 * 0: not doing __GFP_RECLAIM allocation.
> > +	 * 1: doing non-recursive __GFP_RECLAIM allocation.
> > +	 * 2: doing recursive __GFP_RECLAIM allocation.
> > +	 */
> > +	u8 valid;
> > +	/*
> > +	 * bit 0: Will be reported as OOM victim.
> > +	 * bit 1: Will be reported as dying task.
> > +	 * bit 2: Will be reported as stalling task.
> > +	 * bit 3: Will be reported as exiting task.
> > +	 * bit 7: Will be reported unconditionally.
> 
> Create enums for these rather than hard-coding magic numbers?

Sure.

> 
> These values don't seem to be used anyway - as far as I can tell this
> could be a simple boolean.

These values are used in the following switch.

+		u8 type = 0;
+
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= 1;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= 2;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= 8;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= 4;
+			stalling_tasks++;
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= 128;
+		p->memalloc.type = type;

Since writing logs to consoles stored by printk() is a slow operation,
we will need to add some delays for avoiding logs being dropped. But
we can't sleep while we are traversing the tasklist whereas tasks used
as a cursor can go away while we are sleeping with rcu_lock_break().
Thus, kmallocwd marks whether and how each thread should be reported
in the first round without sleeping, and reports threads which should be
reported in the second round. The second round sleeps and can complete
without duplicates/skips because p->memalloc.type remembers whether p
should be reported.

This approach can be as well used for check_hung_uninterruptible_tasks()
for avoiding RCU stall warnings, avoiding hung tasks being skipped due to
"goto unlock;" and eliminating the need to call touch_nmi_watchdog().
We can mark whether each thread should be reported as hung in the first
round, and report threads which should be reported as hung in the second
round. Such changes are outside of this patch's scope.

> 
> > +	 */
> > +	u8 type;
> > +	/* Index used for memalloc_in_flight[] counter. */
> > +	u8 idx;
> > +	/* For progress monitoring. */
> > +	unsigned int sequence;
> > +	/* Started time in jiffies as of valid == 1. */
> > +	unsigned long start;
> > +	/* Requested order and gfp flags as of valid == 1. */
> > +	unsigned int order;
> > +	gfp_t gfp;
> > +};
> > 
> > ...
> >
> > +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> > +/*
> > + * Zero means infinite timeout - no checking done:
> > + */
> > +unsigned long __read_mostly sysctl_memalloc_task_warning_secs =
> > +	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
> > +static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */
> 
> What locking protects `memalloc' from concurrent modifications and
> holds it stable for readers?

No locking is needed because only khungtaskd kernel thread accesses
this variable.

> 
> > 
> > ...
> >
> > +static noinline int check_memalloc_stalling_tasks(unsigned long timeout)
> > +{
> > +	char buf[256];
> > +	struct task_struct *g, *p;
> > +	unsigned long now;
> > +	unsigned long expire;
> > +	unsigned int sigkill_pending = 0;
> > +	unsigned int exiting_tasks = 0;
> > +	unsigned int memdie_pending = 0;
> > +	unsigned int stalling_tasks = 0;
> > +
> > 
> > ...
> >
> > +			goto restart_report;
> > +	}
> > +	rcu_read_unlock();
> > +	preempt_enable_no_resched();
> > +	cond_resched();
> 
> All the cond_resched()s in this function seem a bit random.

This function is trying to be as safe as possible; try to yield CPU time
whenever possible because printing stall warnings is not so urgent enough
to block other high priority threads. Since khungtaskd kernel thread is
not SCHED_IDLE priority, we can expect that stall warnings will be
printed eventually (unless a runaway by realtime threads occurs).

> 
> > +	/* Show memory information. (SysRq-m) */
> > +	show_mem(0, NULL);
> > +	/* Show workqueue state. */
> > +	show_workqueue_state();
> > +	/* Show lock information. (SysRq-d) */
> > +	debug_show_all_locks();
> > +	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
> > +		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
> > +		out_of_memory_count);
> > +	return stalling_tasks;
> > +}
> > +#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
> > +
> >  static void check_hung_task(struct task_struct *t, unsigned long timeout)
> >  {
> >  	unsigned long switch_count = t->nvcsw + t->nivcsw;
> > @@ -228,20 +429,36 @@ void reset_hung_task_detector(void)
> >  static int watchdog(void *dummy)
> >  {
> >  	unsigned long hung_last_checked = jiffies;
> > +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> > +	unsigned long stall_last_checked = hung_last_checked;
> > +#endif
> >  
> >  	set_user_nice(current, 0);
> >  
> >  	for ( ; ; ) {
> >  		unsigned long timeout = sysctl_hung_task_timeout_secs;
> >  		long t = hung_timeout_jiffies(hung_last_checked, timeout);
> > -
> > +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> > +		unsigned long timeout2 = sysctl_memalloc_task_warning_secs;
> > +		long t2 = memalloc_timeout_jiffies(stall_last_checked,
> > +						   timeout2);
> 
> Confused.  Shouldn't timeout2 be converted from seconds to jiffies
> before being passed to memalloc_timeout_jiffies()?

timeout2 is converted from seconds to jiffies by memalloc_timeout_jiffies()
as with timeout is converted from seconds to jiffies by hung_timeout_jiffies().

+static long memalloc_timeout_jiffies(unsigned long last_checked, long timeout)
+{
+	/* timeout of 0 will disable the watchdog */
+	return timeout ? last_checked - jiffies + timeout * HZ :
+		MAX_SCHEDULE_TIMEOUT;
+}

static long hung_timeout_jiffies(unsigned long last_checked,
                                 unsigned long timeout)
{
	/* timeout of 0 will disable the watchdog */
	return timeout ? last_checked - jiffies + timeout * HZ :
		MAX_SCHEDULE_TIMEOUT;
}

We can rename and share hung_timeout_jiffies() for both timeouts
if kmallocwd is acceptable.

> 
> > +		if (t2 <= 0) {
> > +			if (memalloc_maybe_stalling())
> > +				check_memalloc_stalling_tasks(timeout2);
> > +			stall_last_checked = jiffies;
> > +			continue;
> > +		}
> > +#else
> > +		long t2 = t;
> > +#endif
> > 
> > ...
> >
> > +bool memalloc_maybe_stalling(void)
> > +{
> > +	int cpu;
> > +	int sum = 0;
> > +	const u8 idx = memalloc_active_index ^ 1;
> > +
> > +	for_each_possible_cpu(cpu)
> 
> Do we really need to do this for offlined and not-present CPUs?

I just did not want to touch other files within this patch because
I'm not familiar with CPU online/offline handling and want to split
non-essential parts to followup patches.

I assume that a CPU can become offline when someone entered into
__alloc_pages_slowpath() using that CPU. We can update kmallocwd
not to check offlined and not-present CPUs by adding hooks for
resetting memalloc_in_flight counter of a CPU going offline.

> 
> > +		sum += per_cpu(memalloc_in_flight[idx], cpu);
> > +	if (sum)
> > +		return true;
> > +	memalloc_active_index ^= 1;
> > +	return false;
> > +}
> > +
> > 
> > ...
> >
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-10 11:12     ` Tetsuo Handa
  0 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-10 11:12 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, mhocko, hannes, mgorman, david, apolyakov

Andrew Morton wrote:
> On Tue, 28 Feb 2017 10:21:48 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > This patch adds a watchdog which periodically reports number of memory
> > allocating tasks, dying tasks and OOM victim tasks when some task is
> > spending too long time inside __alloc_pages_slowpath(). This patch also
> > serves as a hook for obtaining additional information using SystemTap
> > (e.g. examine other variables using printk(), capture a crash dump by
> > calling panic()) by triggering a callback only when a stall is detected.
> > Ability to take administrator-controlled actions based on some threshold
> > is a big advantage gained by introducing a state tracking.
> > 
> > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > too long") was a great step for reducing possibility of silent hang up
> > problem caused by memory allocation stalls [1]. However, there are
> > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > lockup problem) where this patch is more useful than that commit, for
> > this patch can report possibly related tasks even if allocating tasks
> > are unexpectedly blocked for so long. Regarding premature OOM killer
> > invocation, tracepoints which can accumulate samples in short interval
> > would be useful. But regarding too late to report allocation stalls,
> > this patch which can capture all tasks (for reporting overall situation)
> > in longer interval and act as a trigger (for accumulating short interval
> > samples) would be useful.
> > 
> > ...
> >
> > +Build kernels with CONFIG_DETECT_HUNG_TASK=y and
> > +CONFIG_DETECT_MEMALLOC_STALL_TASK=y.
> > +
> > +Default scan interval is configured by CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT.
> > +Scan interval can be changed at run time by writing timeout in seconds to
> > +/proc/sys/kernel/memalloc_task_warning_secs. Writing 0 disables this scan.
> 
> "seconds" seems needlessly coarse.  Maybe milliseconds?

I can change it to milliseconds although I don't think someone wants to
set e.g. 3500 milliseconds. This timeout is "more than ... seconds" and
actual alert takes longer than this timeout, as with check_hung_task().

> 
> > +Even if you disable this scan, information about last memory allocation
> > +request is kept. That is, you will get some hint for understanding
> > +last-minute behavior of the kernel when you analyze vmcore (or memory
> > +snapshot of a virtualized machine).
> > 
> > ...
> >
> > +struct memalloc_info {
> > +	/*
> > +	 * 0: not doing __GFP_RECLAIM allocation.
> > +	 * 1: doing non-recursive __GFP_RECLAIM allocation.
> > +	 * 2: doing recursive __GFP_RECLAIM allocation.
> > +	 */
> > +	u8 valid;
> > +	/*
> > +	 * bit 0: Will be reported as OOM victim.
> > +	 * bit 1: Will be reported as dying task.
> > +	 * bit 2: Will be reported as stalling task.
> > +	 * bit 3: Will be reported as exiting task.
> > +	 * bit 7: Will be reported unconditionally.
> 
> Create enums for these rather than hard-coding magic numbers?

Sure.

> 
> These values don't seem to be used anyway - as far as I can tell this
> could be a simple boolean.

These values are used in the following switch.

+		u8 type = 0;
+
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= 1;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= 2;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= 8;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= 4;
+			stalling_tasks++;
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= 128;
+		p->memalloc.type = type;

Since writing logs to consoles stored by printk() is a slow operation,
we will need to add some delays for avoiding logs being dropped. But
we can't sleep while we are traversing the tasklist whereas tasks used
as a cursor can go away while we are sleeping with rcu_lock_break().
Thus, kmallocwd marks whether and how each thread should be reported
in the first round without sleeping, and reports threads which should be
reported in the second round. The second round sleeps and can complete
without duplicates/skips because p->memalloc.type remembers whether p
should be reported.

This approach can be as well used for check_hung_uninterruptible_tasks()
for avoiding RCU stall warnings, avoiding hung tasks being skipped due to
"goto unlock;" and eliminating the need to call touch_nmi_watchdog().
We can mark whether each thread should be reported as hung in the first
round, and report threads which should be reported as hung in the second
round. Such changes are outside of this patch's scope.

> 
> > +	 */
> > +	u8 type;
> > +	/* Index used for memalloc_in_flight[] counter. */
> > +	u8 idx;
> > +	/* For progress monitoring. */
> > +	unsigned int sequence;
> > +	/* Started time in jiffies as of valid == 1. */
> > +	unsigned long start;
> > +	/* Requested order and gfp flags as of valid == 1. */
> > +	unsigned int order;
> > +	gfp_t gfp;
> > +};
> > 
> > ...
> >
> > +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> > +/*
> > + * Zero means infinite timeout - no checking done:
> > + */
> > +unsigned long __read_mostly sysctl_memalloc_task_warning_secs =
> > +	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
> > +static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */
> 
> What locking protects `memalloc' from concurrent modifications and
> holds it stable for readers?

No locking is needed because only khungtaskd kernel thread accesses
this variable.

> 
> > 
> > ...
> >
> > +static noinline int check_memalloc_stalling_tasks(unsigned long timeout)
> > +{
> > +	char buf[256];
> > +	struct task_struct *g, *p;
> > +	unsigned long now;
> > +	unsigned long expire;
> > +	unsigned int sigkill_pending = 0;
> > +	unsigned int exiting_tasks = 0;
> > +	unsigned int memdie_pending = 0;
> > +	unsigned int stalling_tasks = 0;
> > +
> > 
> > ...
> >
> > +			goto restart_report;
> > +	}
> > +	rcu_read_unlock();
> > +	preempt_enable_no_resched();
> > +	cond_resched();
> 
> All the cond_resched()s in this function seem a bit random.

This function is trying to be as safe as possible; try to yield CPU time
whenever possible because printing stall warnings is not so urgent enough
to block other high priority threads. Since khungtaskd kernel thread is
not SCHED_IDLE priority, we can expect that stall warnings will be
printed eventually (unless a runaway by realtime threads occurs).

> 
> > +	/* Show memory information. (SysRq-m) */
> > +	show_mem(0, NULL);
> > +	/* Show workqueue state. */
> > +	show_workqueue_state();
> > +	/* Show lock information. (SysRq-d) */
> > +	debug_show_all_locks();
> > +	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
> > +		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
> > +		out_of_memory_count);
> > +	return stalling_tasks;
> > +}
> > +#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
> > +
> >  static void check_hung_task(struct task_struct *t, unsigned long timeout)
> >  {
> >  	unsigned long switch_count = t->nvcsw + t->nivcsw;
> > @@ -228,20 +429,36 @@ void reset_hung_task_detector(void)
> >  static int watchdog(void *dummy)
> >  {
> >  	unsigned long hung_last_checked = jiffies;
> > +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> > +	unsigned long stall_last_checked = hung_last_checked;
> > +#endif
> >  
> >  	set_user_nice(current, 0);
> >  
> >  	for ( ; ; ) {
> >  		unsigned long timeout = sysctl_hung_task_timeout_secs;
> >  		long t = hung_timeout_jiffies(hung_last_checked, timeout);
> > -
> > +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
> > +		unsigned long timeout2 = sysctl_memalloc_task_warning_secs;
> > +		long t2 = memalloc_timeout_jiffies(stall_last_checked,
> > +						   timeout2);
> 
> Confused.  Shouldn't timeout2 be converted from seconds to jiffies
> before being passed to memalloc_timeout_jiffies()?

timeout2 is converted from seconds to jiffies by memalloc_timeout_jiffies()
as with timeout is converted from seconds to jiffies by hung_timeout_jiffies().

+static long memalloc_timeout_jiffies(unsigned long last_checked, long timeout)
+{
+	/* timeout of 0 will disable the watchdog */
+	return timeout ? last_checked - jiffies + timeout * HZ :
+		MAX_SCHEDULE_TIMEOUT;
+}

static long hung_timeout_jiffies(unsigned long last_checked,
                                 unsigned long timeout)
{
	/* timeout of 0 will disable the watchdog */
	return timeout ? last_checked - jiffies + timeout * HZ :
		MAX_SCHEDULE_TIMEOUT;
}

We can rename and share hung_timeout_jiffies() for both timeouts
if kmallocwd is acceptable.

> 
> > +		if (t2 <= 0) {
> > +			if (memalloc_maybe_stalling())
> > +				check_memalloc_stalling_tasks(timeout2);
> > +			stall_last_checked = jiffies;
> > +			continue;
> > +		}
> > +#else
> > +		long t2 = t;
> > +#endif
> > 
> > ...
> >
> > +bool memalloc_maybe_stalling(void)
> > +{
> > +	int cpu;
> > +	int sum = 0;
> > +	const u8 idx = memalloc_active_index ^ 1;
> > +
> > +	for_each_possible_cpu(cpu)
> 
> Do we really need to do this for offlined and not-present CPUs?

I just did not want to touch other files within this patch because
I'm not familiar with CPU online/offline handling and want to split
non-essential parts to followup patches.

I assume that a CPU can become offline when someone entered into
__alloc_pages_slowpath() using that CPU. We can update kmallocwd
not to check offlined and not-present CPUs by adding hooks for
resetting memalloc_in_flight counter of a CPU going offline.

> 
> > +		sum += per_cpu(memalloc_in_flight[idx], cpu);
> > +	if (sum)
> > +		return true;
> > +	memalloc_active_index ^= 1;
> > +	return false;
> > +}
> > +
> > 
> > ...
> >
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-10 10:40       ` Michal Hocko
@ 2017-03-10 11:19         ` Tetsuo Handa
  -1 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-10 11:19 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

Andrew Morton wrote:
> On Thu, 9 Mar 2017 19:46:14 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > Tetsuo Handa wrote:
> > > This patch adds a watchdog which periodically reports number of memory
> > > allocating tasks, dying tasks and OOM victim tasks when some task is
> > > spending too long time inside __alloc_pages_slowpath(). This patch also
> > > serves as a hook for obtaining additional information using SystemTap
> > > (e.g. examine other variables using printk(), capture a crash dump by
> > > calling panic()) by triggering a callback only when a stall is detected.
> > > Ability to take administrator-controlled actions based on some threshold
> > > is a big advantage gained by introducing a state tracking.
> > > 
> > > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > > too long") was a great step for reducing possibility of silent hang up
> > > problem caused by memory allocation stalls [1]. However, there are
> > > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > > lockup problem) where this patch is more useful than that commit, for
> > > this patch can report possibly related tasks even if allocating tasks
> > > are unexpectedly blocked for so long. Regarding premature OOM killer
> > > invocation, tracepoints which can accumulate samples in short interval
> > > would be useful. But regarding too late to report allocation stalls,
> > > this patch which can capture all tasks (for reporting overall situation)
> > > in longer interval and act as a trigger (for accumulating short interval
> > > samples) would be useful.
> )
> > Andrew, do you have any questions on this patch?
> > I really need this patch for finding bugs which MM people overlook.
> 
> Undecided.  I can see the need but it is indeed quite a large lump of
> code.  Perhaps some additional examples of how this new code was used
> to understand and improve real-world kernel problems would be persuasive.

This patch is aimed for help bisecting whether unexpected hung cases are related to
memory allocation. By merging this patch (and enabling this watchdog in enterprise
systems via kernels supported by distributors), we can identify patterns/cases of
problems (if related to memory allocation) and improve quality of Linux kernels
by fixing problems related to memory allocation.

I was working at a support center for 3 years and had many cases where Linux
systems encountered unexpected silent hangup/reboot problem. In most cases, all
clue available was limited to sar (in sysstat package) showing that free memory
was low just before unexpected hung/reboot.

As a nature of such problems, it is very hard for administrators to collect
information for analysis; let alone identify the cause of problems. As a result,
such problems are left unrecognized/unsolved at the support center, and are
seldom reported to distributors/developers in order to ask for fixes.

Michal Hocko wrote:
> Well, it is true that the watchdog can provide much more information
> than we can gather with other debugging options we currently have when
> allocations run away.  So I am not questioning this is useful when doing
> memory stress testing and trying to trigger corner cases but I am not
> really sure how much this will be useful in production systems, though.
> Tracking is not for free both in runtime and longterm in maintenance.

I don't think that my stress testing is hitting bugs which do not occur in
production systems. But I can't prove it without using watchdog mechanism.
Today I heard about a case where an enterprise system hung last Sunday due to
XFS deadlock by memory allocation. This was rather a lucky case where
administrator of that system was able to capture vmcore. In most cases,
administrators can't capture even SysRq-t; let alone vmcore. Therefore,
automatic watchdog is highly appreciated.



Or, did you mean "some additional examples of how this new code was used
to understand and improve real-world kernel problems" as how this patch
helped in [2] [3] ?

Regarding [3] (now continued as
http://lkml.kernel.org/r/201703031948.CHJ81278.VOHSFFFOOLJQMt@I-love.SAKURA.ne.jp ),
I used this patch for confirming the following things.

  (1) kswapd cannot make progress waiting for lock.

----------
[  518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
[  518.902095] kswapd0         D10776    69      2 0x00000000
[  518.903784] Call Trace:
[  518.904849]  __schedule+0x336/0xe00
[  518.906118]  schedule+0x3d/0x90
[  518.907314]  io_schedule+0x16/0x40
[  518.908622]  __xfs_iflock+0x129/0x140 [xfs]
[  518.910027]  ? autoremove_wake_function+0x60/0x60
[  518.911559]  xfs_reclaim_inode+0x162/0x440 [xfs]
[  518.913068]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
[  518.914611]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[  518.916148]  ? trace_hardirqs_on+0xd/0x10
[  518.917465]  ? try_to_wake_up+0x59/0x7a0
[  518.918758]  ? wake_up_process+0x15/0x20
[  518.920067]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[  518.921560]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[  518.923114]  super_cache_scan+0x181/0x190
[  518.924435]  shrink_slab+0x29f/0x6d0
[  518.925683]  shrink_node+0x2fa/0x310
[  518.926909]  kswapd+0x362/0x9b0
[  518.928061]  kthread+0x10f/0x150
[  518.929218]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[  518.930953]  ? kthread_create_on_node+0x70/0x70
[  518.932380]  ret_from_fork+0x31/0x40

[ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
[ 1095.632985] kswapd0         D10776    69      2 0x00000000
[ 1095.632988] Call Trace:
[ 1095.632991]  __schedule+0x336/0xe00
[ 1095.632994]  schedule+0x3d/0x90
[ 1095.632996]  io_schedule+0x16/0x40
[ 1095.633017]  __xfs_iflock+0x129/0x140 [xfs]
[ 1095.633021]  ? autoremove_wake_function+0x60/0x60
[ 1095.633051]  xfs_reclaim_inode+0x162/0x440 [xfs]
[ 1095.633072]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
[ 1095.633106]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[ 1095.633114]  ? trace_hardirqs_on+0xd/0x10
[ 1095.633116]  ? try_to_wake_up+0x59/0x7a0
[ 1095.633120]  ? wake_up_process+0x15/0x20
[ 1095.633156]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 1095.633178]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[ 1095.633180]  super_cache_scan+0x181/0x190
[ 1095.633183]  shrink_slab+0x29f/0x6d0
[ 1095.633189]  shrink_node+0x2fa/0x310
[ 1095.633193]  kswapd+0x362/0x9b0
[ 1095.633200]  kthread+0x10f/0x150
[ 1095.633201]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1095.633202]  ? kthread_create_on_node+0x70/0x70
[ 1095.633205]  ret_from_fork+0x31/0x40
----------

  (2) All WQ_MEM_RECLAIM threads shown by show_workqueue_state()
      cannot make progress waiting for memory allocation.

----------
[ 1095.633625] MemAlloc: xfs-data/sda1(451) flags=0x4228060 switches=45509 seq=1 gfp=0x1604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=652073
[ 1095.633626] xfs-data/sda1   R  running task    12696   451      2 0x00000000
[ 1095.633663] Workqueue: xfs-data/sda1 xfs_end_io [xfs]
[ 1095.633665] Call Trace:
[ 1095.633668]  __schedule+0x336/0xe00
[ 1095.633671]  schedule+0x3d/0x90
[ 1095.633672]  schedule_timeout+0x20d/0x510
[ 1095.633675]  ? lock_timer_base+0xa0/0xa0
[ 1095.633678]  schedule_timeout_uninterruptible+0x2a/0x30
[ 1095.633680]  __alloc_pages_slowpath+0x2b5/0xd95
[ 1095.633687]  __alloc_pages_nodemask+0x3e4/0x460
[ 1095.633699]  alloc_pages_current+0x97/0x1b0
[ 1095.633702]  new_slab+0x4cb/0x6b0
[ 1095.633706]  ___slab_alloc+0x3a3/0x620
[ 1095.633728]  ? kmem_alloc+0x96/0x120 [xfs]
[ 1095.633730]  ? ___slab_alloc+0x5c6/0x620
[ 1095.633732]  ? cpuacct_charge+0x38/0x1e0
[ 1095.633767]  ? kmem_alloc+0x96/0x120 [xfs]
[ 1095.633770]  __slab_alloc+0x46/0x7d
[ 1095.633773]  __kmalloc+0x301/0x3b0
[ 1095.633802]  kmem_alloc+0x96/0x120 [xfs]
[ 1095.633804]  ? kfree+0x1fa/0x330
[ 1095.633842]  xfs_log_commit_cil+0x489/0x710 [xfs]
[ 1095.633864]  __xfs_trans_commit+0x83/0x260 [xfs]
[ 1095.633883]  xfs_trans_commit+0x10/0x20 [xfs]
[ 1095.633901]  __xfs_setfilesize+0xdb/0x240 [xfs]
[ 1095.633936]  xfs_setfilesize_ioend+0x89/0xb0 [xfs]
[ 1095.633954]  ? xfs_setfilesize_ioend+0x5/0xb0 [xfs]
[ 1095.633971]  xfs_end_io+0x81/0x110 [xfs]
[ 1095.633973]  process_one_work+0x22b/0x760
[ 1095.633975]  ? process_one_work+0x194/0x760
[ 1095.633997]  rescuer_thread+0x1f2/0x3d0
[ 1095.634002]  kthread+0x10f/0x150
[ 1095.634003]  ? worker_thread+0x4b0/0x4b0
[ 1095.634004]  ? kthread_create_on_node+0x70/0x70
[ 1095.634007]  ret_from_fork+0x31/0x40
[ 1095.634013] MemAlloc: xfs-eofblocks/s(456) flags=0x4228860 switches=15435 seq=1 gfp=0x1400240(GFP_NOFS|__GFP_NOWARN) order=0 delay=293074
[ 1095.634014] xfs-eofblocks/s R  running task    12032   456      2 0x00000000
[ 1095.634037] Workqueue: xfs-eofblocks/sda1 xfs_eofblocks_worker [xfs]
[ 1095.634038] Call Trace:
[ 1095.634040]  ? _raw_spin_lock+0x3d/0x80
[ 1095.634042]  ? vmpressure+0xd0/0x120
[ 1095.634044]  ? vmpressure+0xd0/0x120
[ 1095.634047]  ? vmpressure_prio+0x21/0x30
[ 1095.634049]  ? do_try_to_free_pages+0x70/0x300
[ 1095.634052]  ? try_to_free_pages+0x131/0x3f0
[ 1095.634058]  ? __alloc_pages_slowpath+0x3ec/0xd95
[ 1095.634065]  ? __alloc_pages_nodemask+0x3e4/0x460
[ 1095.634069]  ? alloc_pages_current+0x97/0x1b0
[ 1095.634111]  ? xfs_buf_allocate_memory+0x160/0x2a3 [xfs]
[ 1095.634133]  ? xfs_buf_get_map+0x2be/0x480 [xfs]
[ 1095.634169]  ? xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1095.634204]  ? xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1095.634222]  ? xfs_btree_read_buf_block.constprop.34+0x78/0xc0 [xfs]
[ 1095.634239]  ? xfs_btree_lookup_get_block+0x8a/0x180 [xfs]
[ 1095.634257]  ? xfs_btree_lookup+0xd0/0x3f0 [xfs]
[ 1095.634296]  ? kmem_zone_alloc+0x96/0x120 [xfs]
[ 1095.634299]  ? _raw_spin_unlock+0x27/0x40
[ 1095.634315]  ? xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
[ 1095.634348]  ? xfs_bmap_del_extent+0x1b2/0x1610 [xfs]
[ 1095.634380]  ? kmem_zone_alloc+0x96/0x120 [xfs]
[ 1095.634400]  ? __xfs_bunmapi+0x4db/0xda0 [xfs]
[ 1095.634421]  ? xfs_bunmapi+0x2b/0x40 [xfs]
[ 1095.634459]  ? xfs_itruncate_extents+0x1df/0x780 [xfs]
[ 1095.634502]  ? xfs_rename+0xc70/0x1080 [xfs]
[ 1095.634525]  ? xfs_free_eofblocks+0x1c4/0x230 [xfs]
[ 1095.634546]  ? xfs_inode_free_eofblocks+0x18d/0x280 [xfs]
[ 1095.634565]  ? xfs_inode_ag_walk.isra.13+0x2b5/0x620 [xfs]
[ 1095.634582]  ? xfs_inode_ag_walk.isra.13+0x91/0x620 [xfs]
[ 1095.634618]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
[ 1095.634630]  ? radix_tree_next_chunk+0x10b/0x2d0
[ 1095.634635]  ? radix_tree_gang_lookup_tag+0xd7/0x150
[ 1095.634672]  ? xfs_perag_get_tag+0x11d/0x370 [xfs]
[ 1095.634690]  ? xfs_perag_get_tag+0x5/0x370 [xfs]
[ 1095.634709]  ? xfs_inode_ag_iterator_tag+0x71/0xa0 [xfs]
[ 1095.634726]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
[ 1095.634744]  ? __xfs_icache_free_eofblocks+0x3b/0x40 [xfs]
[ 1095.634759]  ? xfs_eofblocks_worker+0x27/0x40 [xfs]
[ 1095.634762]  ? process_one_work+0x22b/0x760
[ 1095.634763]  ? process_one_work+0x194/0x760
[ 1095.634784]  ? rescuer_thread+0x1f2/0x3d0
[ 1095.634788]  ? kthread+0x10f/0x150
[ 1095.634789]  ? worker_thread+0x4b0/0x4b0
[ 1095.634790]  ? kthread_create_on_node+0x70/0x70
[ 1095.634793]  ? ret_from_fork+0x31/0x40
----------

  (3) order-0 GFP_NOIO allocation request cannot make progress
      waiting for memory allocation.

----------
[ 1095.631687] MemAlloc: kworker/2:0(27) flags=0x4208860 switches=38727 seq=21 gfp=0x1400000(GFP_NOIO) order=0 delay=652160
[ 1095.631688] kworker/2:0     R  running task    12680    27      2 0x00000000
[ 1095.631739] Workqueue: events_freezable_power_ disk_events_workfn
[ 1095.631740] Call Trace:
[ 1095.631743]  __schedule+0x336/0xe00
[ 1095.631746]  preempt_schedule_common+0x1f/0x31
[ 1095.631747]  _cond_resched+0x1c/0x30
[ 1095.631749]  shrink_slab+0x339/0x6d0
[ 1095.631754]  shrink_node+0x2fa/0x310
[ 1095.631758]  do_try_to_free_pages+0xe1/0x300
[ 1095.631761]  try_to_free_pages+0x131/0x3f0
[ 1095.631765]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1095.631771]  __alloc_pages_nodemask+0x3e4/0x460
[ 1095.631775]  alloc_pages_current+0x97/0x1b0
[ 1095.631779]  bio_copy_kern+0xc9/0x180
[ 1095.631830]  blk_rq_map_kern+0x70/0x140
[ 1095.631835]  __scsi_execute.isra.22+0x13a/0x1e0
[ 1095.631838]  scsi_execute_req_flags+0x94/0x100
[ 1095.631844]  sr_check_events+0xbf/0x2b0 [sr_mod]
[ 1095.631852]  cdrom_check_events+0x18/0x30 [cdrom]
[ 1095.631854]  sr_block_check_events+0x2a/0x30 [sr_mod]
[ 1095.631856]  disk_check_events+0x60/0x170
[ 1095.631859]  disk_events_workfn+0x1c/0x20
[ 1095.631862]  process_one_work+0x22b/0x760
[ 1095.631863]  ? process_one_work+0x194/0x760
[ 1095.631867]  worker_thread+0x137/0x4b0
[ 1095.631887]  kthread+0x10f/0x150
[ 1095.631889]  ? process_one_work+0x760/0x760
[ 1095.631890]  ? kthread_create_on_node+0x70/0x70
[ 1095.631893]  ret_from_fork+0x31/0x40
----------

  (4) Number of stalling threads does not decrease over time while
      number of out_of_memory() calls increases over time.

----------
[  518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307
[  553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507
[  616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219
[  642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673
[  702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529
[  736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244
[  800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576
[  837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562
[  899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879
[  936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234
[  997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983
[ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912
[ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677
[ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791
----------

Further interpretation is up to XFS people (Brian and Dave). So far, boosting priority
(lowering watermark) of WQ_MEM_RECLAIM threads based on current_is_workqueue_rescuer()
seems to be considered as a fix because writeback path already has a special case
handling for the rescuer. But I'm not sure whether it helps getting writeback unstuck
when writeback path is blocked on a lock rather than memory allocation. (Threads doing
writeback and/or mempool allocations might be candidates for always reporting as with
kswapd. That would be in followup patches.)

----------
[ 1044.701393] kworker/u256:0  D10024  8444      2 0x00000080
[ 1044.703017] Workqueue: writeback wb_workfn (flush-8:0)
[ 1044.704550] Call Trace:
[ 1044.705550]  __schedule+0x336/0xe00
[ 1044.706761]  schedule+0x3d/0x90
[ 1044.707891]  io_schedule+0x16/0x40
[ 1044.709073]  __lock_page+0x126/0x180
[ 1044.710272]  ? page_cache_tree_insert+0x120/0x120
[ 1044.711728]  write_cache_pages+0x39a/0x6b0
[ 1044.713077]  ? xfs_map_blocks+0x5a0/0x5a0 [xfs]
[ 1044.714499]  ? xfs_vm_writepages+0x48/0xa0 [xfs]
[ 1044.715944]  xfs_vm_writepages+0x6b/0xa0 [xfs]
[ 1044.717362]  do_writepages+0x21/0x40
[ 1044.718577]  __writeback_single_inode+0x72/0x9d0
[ 1044.719998]  ? _raw_spin_unlock+0x27/0x40
[ 1044.721308]  writeback_sb_inodes+0x322/0x750
[ 1044.722662]  __writeback_inodes_wb+0x8c/0xc0
[ 1044.724009]  wb_writeback+0x3be/0x6e0
[ 1044.725238]  wb_workfn+0x146/0x7d0
[ 1044.726413]  ? process_one_work+0x194/0x760
[ 1044.727736]  process_one_work+0x22b/0x760
[ 1044.729017]  ? process_one_work+0x194/0x760
[ 1044.730343]  worker_thread+0x137/0x4b0
[ 1044.731577]  kthread+0x10f/0x150
[ 1044.732702]  ? process_one_work+0x760/0x760
[ 1044.734021]  ? kthread_create_on_node+0x70/0x70
[ 1044.735415]  ? do_syscall_64+0x195/0x200
[ 1044.736684]  ret_from_fork+0x31/0x40
----------

Regarding "infinite too_many_isolated() loop" case
( http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp ),
I used this patch for confirming the following things.

  (1) kswapd cannot make progress waiting for lock.

----------
[ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible
[ 1209.799726] kswapd0         D10936    67      2 0x00000000
[ 1209.807326] Call Trace:
[ 1209.812581]  __schedule+0x336/0xe00
[ 1209.818599]  schedule+0x3d/0x90
[ 1209.823907]  schedule_timeout+0x26a/0x510
[ 1209.827218]  ? trace_hardirqs_on+0xd/0x10
[ 1209.830535]  __down_common+0xfb/0x131
[ 1209.833801]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.837372]  __down+0x1d/0x1f
[ 1209.840331]  down+0x41/0x50
[ 1209.843243]  xfs_buf_lock+0x64/0x370 [xfs]
[ 1209.846597]  _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.850031]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
[ 1209.853514]  xfs_buf_get_map+0x2a/0x480 [xfs]
[ 1209.855831]  xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1209.857388]  ? free_debug_processing+0x27d/0x2af
[ 1209.859037]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1209.860707]  xfs_read_agf+0xc8/0x2b0 [xfs]
[ 1209.862184]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
[ 1209.863728]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
[ 1209.865385]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
[ 1209.866974]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.868374]  ? xfs_perag_get+0x1a0/0x310 [xfs]
[ 1209.869798]  ? xfs_perag_get+0x5/0x310 [xfs]
[ 1209.871288]  xfs_alloc_vextent+0x161/0xda0 [xfs]
[ 1209.872757]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
[ 1209.874182]  ? save_stack_trace+0x1b/0x20
[ 1209.875542]  xfs_bmap_alloc+0x17/0x30 [xfs]
[ 1209.876847]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
[ 1209.878190]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
[ 1209.879632]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
[ 1209.880909]  xfs_do_writepage+0x215/0x920 [xfs]
[ 1209.882255]  ? clear_page_dirty_for_io+0xb4/0x310
[ 1209.883598]  xfs_vm_writepage+0x3b/0x70 [xfs]
[ 1209.884841]  pageout.isra.54+0x1a4/0x460
[ 1209.886210]  shrink_page_list+0xa86/0xcf0
[ 1209.887441]  shrink_inactive_list+0x1c5/0x660
[ 1209.888682]  shrink_node_memcg+0x535/0x7f0
[ 1209.889975]  ? mem_cgroup_iter+0x14d/0x720
[ 1209.891197]  shrink_node+0xe1/0x310
[ 1209.892288]  kswapd+0x362/0x9b0
[ 1209.893308]  kthread+0x10f/0x150
[ 1209.894383]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1209.895703]  ? kthread_create_on_node+0x70/0x70
[ 1209.896956]  ret_from_fork+0x31/0x40
----------

  (2) Both GFP_IO and GFP_KERNEL allocations are stuck at
      too_many_isolated() loop.

----------
[ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible
[ 1209.902154] systemd-journal D11240   526      1 0x00000000
[ 1209.903642] Call Trace:
[ 1209.904574]  __schedule+0x336/0xe00
[ 1209.905734]  schedule+0x3d/0x90
[ 1209.906817]  schedule_timeout+0x20d/0x510
[ 1209.908025]  ? prepare_to_wait+0x2b/0xc0
[ 1209.909268]  ? lock_timer_base+0xa0/0xa0
[ 1209.910460]  io_schedule_timeout+0x1e/0x50
[ 1209.911681]  congestion_wait+0x86/0x260
[ 1209.912853]  ? remove_wait_queue+0x60/0x60
[ 1209.914115]  shrink_inactive_list+0x5b4/0x660
[ 1209.915385]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1209.916768]  shrink_node_memcg+0x535/0x7f0
[ 1209.918173]  shrink_node+0xe1/0x310
[ 1209.919288]  do_try_to_free_pages+0xe1/0x300
[ 1209.920548]  try_to_free_pages+0x131/0x3f0
[ 1209.921827]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1209.923137]  __alloc_pages_nodemask+0x3e4/0x460
[ 1209.924454]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.925790]  alloc_pages_current+0x97/0x1b0
[ 1209.927021]  ? find_get_entry+0x5/0x300
[ 1209.928189]  __page_cache_alloc+0x15d/0x1a0
[ 1209.929471]  ? pagecache_get_page+0x2c/0x2b0
[ 1209.930716]  filemap_fault+0x4df/0x8b0
[ 1209.931867]  ? filemap_fault+0x373/0x8b0
[ 1209.933111]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.934510]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1209.935857]  ? down_read_nested+0x7b/0xc0
[ 1209.937123]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.938373]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1209.939691]  __do_fault+0x1e/0xa0
[ 1209.940807]  ? _raw_spin_unlock+0x27/0x40
[ 1209.942002]  __handle_mm_fault+0xbb1/0xf40
[ 1209.943228]  ? mutex_unlock+0x12/0x20
[ 1209.944410]  ? devkmsg_read+0x15c/0x330
[ 1209.945912]  handle_mm_fault+0x16b/0x390
[ 1209.947297]  ? handle_mm_fault+0x49/0x390
[ 1209.948868]  __do_page_fault+0x24a/0x530
[ 1209.950351]  do_page_fault+0x30/0x80
[ 1209.951615]  page_fault+0x28/0x30

[ 1210.538496] MemAlloc: kworker/3:0(6345) flags=0x4208860 switches=10134 seq=22 gfp=0x1400000(GFP_NOIO) order=0 delay=45953 uninterruptible
[ 1210.541487] kworker/3:0     D12560  6345      2 0x00000080
[ 1210.542991] Workqueue: events_freezable_power_ disk_events_workfn
[ 1210.544577] Call Trace:
[ 1210.545468]  __schedule+0x336/0xe00
[ 1210.546606]  schedule+0x3d/0x90
[ 1210.547616]  schedule_timeout+0x20d/0x510
[ 1210.548778]  ? prepare_to_wait+0x2b/0xc0
[ 1210.550013]  ? lock_timer_base+0xa0/0xa0
[ 1210.551208]  io_schedule_timeout+0x1e/0x50
[ 1210.552519]  congestion_wait+0x86/0x260
[ 1210.553650]  ? remove_wait_queue+0x60/0x60
[ 1210.554900]  shrink_inactive_list+0x5b4/0x660
[ 1210.556119]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1210.557447]  shrink_node_memcg+0x535/0x7f0
[ 1210.558714]  shrink_node+0xe1/0x310
[ 1210.559803]  do_try_to_free_pages+0xe1/0x300
[ 1210.561009]  try_to_free_pages+0x131/0x3f0
[ 1210.562250]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1210.563506]  __alloc_pages_nodemask+0x3e4/0x460
[ 1210.564777]  alloc_pages_current+0x97/0x1b0
[ 1210.566017]  bio_copy_kern+0xc9/0x180
[ 1210.567116]  blk_rq_map_kern+0x70/0x140
[ 1210.568356]  __scsi_execute.isra.22+0x13a/0x1e0
[ 1210.569839]  scsi_execute_req_flags+0x94/0x100
[ 1210.571218]  sr_check_events+0xbf/0x2b0 [sr_mod]
[ 1210.572500]  cdrom_check_events+0x18/0x30 [cdrom]
[ 1210.573934]  sr_block_check_events+0x2a/0x30 [sr_mod]
[ 1210.575335]  disk_check_events+0x60/0x170
[ 1210.576509]  disk_events_workfn+0x1c/0x20
[ 1210.577744]  process_one_work+0x22b/0x760
[ 1210.578934]  ? process_one_work+0x194/0x760
[ 1210.580147]  worker_thread+0x137/0x4b0
[ 1210.581336]  kthread+0x10f/0x150
[ 1210.582365]  ? process_one_work+0x760/0x760
[ 1210.583603]  ? kthread_create_on_node+0x70/0x70
[ 1210.584961]  ? do_syscall_64+0x6c/0x200
[ 1210.586343]  ret_from_fork+0x31/0x40
----------

  (3) Number of stalling threads does not decrease over time and
      number of out_of_memory() calls does not increase over time.

----------
[ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
[ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
[ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
[ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
[ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
[ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
----------

Regarding [2], Alexander, Brian and Dave can explain it better than I. But
I think that "check threads which cannot make progress" principle is same.



> (top-posting repaired - please don't do that)

Hmm, I don't see anything wrong with mail headers of replied one that can
cause top-posting. I think something wired occurred outside of my control.

  Subject: Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
  References: <1488244908-57586-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
  In-Reply-To: <1488244908-57586-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
  Message-Id: <201703091946.GDC21885.OQFFOtJHSOFVML@I-love.SAKURA.ne.jp>
  Date: Thu, 9 Mar 2017 19:46:14 +0900

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-10 11:19         ` Tetsuo Handa
  0 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-10 11:19 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

Andrew Morton wrote:
> On Thu, 9 Mar 2017 19:46:14 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > Tetsuo Handa wrote:
> > > This patch adds a watchdog which periodically reports number of memory
> > > allocating tasks, dying tasks and OOM victim tasks when some task is
> > > spending too long time inside __alloc_pages_slowpath(). This patch also
> > > serves as a hook for obtaining additional information using SystemTap
> > > (e.g. examine other variables using printk(), capture a crash dump by
> > > calling panic()) by triggering a callback only when a stall is detected.
> > > Ability to take administrator-controlled actions based on some threshold
> > > is a big advantage gained by introducing a state tracking.
> > > 
> > > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > > too long") was a great step for reducing possibility of silent hang up
> > > problem caused by memory allocation stalls [1]. However, there are
> > > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > > lockup problem) where this patch is more useful than that commit, for
> > > this patch can report possibly related tasks even if allocating tasks
> > > are unexpectedly blocked for so long. Regarding premature OOM killer
> > > invocation, tracepoints which can accumulate samples in short interval
> > > would be useful. But regarding too late to report allocation stalls,
> > > this patch which can capture all tasks (for reporting overall situation)
> > > in longer interval and act as a trigger (for accumulating short interval
> > > samples) would be useful.
> )
> > Andrew, do you have any questions on this patch?
> > I really need this patch for finding bugs which MM people overlook.
> 
> Undecided.  I can see the need but it is indeed quite a large lump of
> code.  Perhaps some additional examples of how this new code was used
> to understand and improve real-world kernel problems would be persuasive.

This patch is aimed for help bisecting whether unexpected hung cases are related to
memory allocation. By merging this patch (and enabling this watchdog in enterprise
systems via kernels supported by distributors), we can identify patterns/cases of
problems (if related to memory allocation) and improve quality of Linux kernels
by fixing problems related to memory allocation.

I was working at a support center for 3 years and had many cases where Linux
systems encountered unexpected silent hangup/reboot problem. In most cases, all
clue available was limited to sar (in sysstat package) showing that free memory
was low just before unexpected hung/reboot.

As a nature of such problems, it is very hard for administrators to collect
information for analysis; let alone identify the cause of problems. As a result,
such problems are left unrecognized/unsolved at the support center, and are
seldom reported to distributors/developers in order to ask for fixes.

Michal Hocko wrote:
> Well, it is true that the watchdog can provide much more information
> than we can gather with other debugging options we currently have when
> allocations run away.  So I am not questioning this is useful when doing
> memory stress testing and trying to trigger corner cases but I am not
> really sure how much this will be useful in production systems, though.
> Tracking is not for free both in runtime and longterm in maintenance.

I don't think that my stress testing is hitting bugs which do not occur in
production systems. But I can't prove it without using watchdog mechanism.
Today I heard about a case where an enterprise system hung last Sunday due to
XFS deadlock by memory allocation. This was rather a lucky case where
administrator of that system was able to capture vmcore. In most cases,
administrators can't capture even SysRq-t; let alone vmcore. Therefore,
automatic watchdog is highly appreciated.



Or, did you mean "some additional examples of how this new code was used
to understand and improve real-world kernel problems" as how this patch
helped in [2] [3] ?

Regarding [3] (now continued as
http://lkml.kernel.org/r/201703031948.CHJ81278.VOHSFFFOOLJQMt@I-love.SAKURA.ne.jp ),
I used this patch for confirming the following things.

  (1) kswapd cannot make progress waiting for lock.

----------
[  518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
[  518.902095] kswapd0         D10776    69      2 0x00000000
[  518.903784] Call Trace:
[  518.904849]  __schedule+0x336/0xe00
[  518.906118]  schedule+0x3d/0x90
[  518.907314]  io_schedule+0x16/0x40
[  518.908622]  __xfs_iflock+0x129/0x140 [xfs]
[  518.910027]  ? autoremove_wake_function+0x60/0x60
[  518.911559]  xfs_reclaim_inode+0x162/0x440 [xfs]
[  518.913068]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
[  518.914611]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[  518.916148]  ? trace_hardirqs_on+0xd/0x10
[  518.917465]  ? try_to_wake_up+0x59/0x7a0
[  518.918758]  ? wake_up_process+0x15/0x20
[  518.920067]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[  518.921560]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[  518.923114]  super_cache_scan+0x181/0x190
[  518.924435]  shrink_slab+0x29f/0x6d0
[  518.925683]  shrink_node+0x2fa/0x310
[  518.926909]  kswapd+0x362/0x9b0
[  518.928061]  kthread+0x10f/0x150
[  518.929218]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[  518.930953]  ? kthread_create_on_node+0x70/0x70
[  518.932380]  ret_from_fork+0x31/0x40

[ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
[ 1095.632985] kswapd0         D10776    69      2 0x00000000
[ 1095.632988] Call Trace:
[ 1095.632991]  __schedule+0x336/0xe00
[ 1095.632994]  schedule+0x3d/0x90
[ 1095.632996]  io_schedule+0x16/0x40
[ 1095.633017]  __xfs_iflock+0x129/0x140 [xfs]
[ 1095.633021]  ? autoremove_wake_function+0x60/0x60
[ 1095.633051]  xfs_reclaim_inode+0x162/0x440 [xfs]
[ 1095.633072]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
[ 1095.633106]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[ 1095.633114]  ? trace_hardirqs_on+0xd/0x10
[ 1095.633116]  ? try_to_wake_up+0x59/0x7a0
[ 1095.633120]  ? wake_up_process+0x15/0x20
[ 1095.633156]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 1095.633178]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[ 1095.633180]  super_cache_scan+0x181/0x190
[ 1095.633183]  shrink_slab+0x29f/0x6d0
[ 1095.633189]  shrink_node+0x2fa/0x310
[ 1095.633193]  kswapd+0x362/0x9b0
[ 1095.633200]  kthread+0x10f/0x150
[ 1095.633201]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1095.633202]  ? kthread_create_on_node+0x70/0x70
[ 1095.633205]  ret_from_fork+0x31/0x40
----------

  (2) All WQ_MEM_RECLAIM threads shown by show_workqueue_state()
      cannot make progress waiting for memory allocation.

----------
[ 1095.633625] MemAlloc: xfs-data/sda1(451) flags=0x4228060 switches=45509 seq=1 gfp=0x1604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=652073
[ 1095.633626] xfs-data/sda1   R  running task    12696   451      2 0x00000000
[ 1095.633663] Workqueue: xfs-data/sda1 xfs_end_io [xfs]
[ 1095.633665] Call Trace:
[ 1095.633668]  __schedule+0x336/0xe00
[ 1095.633671]  schedule+0x3d/0x90
[ 1095.633672]  schedule_timeout+0x20d/0x510
[ 1095.633675]  ? lock_timer_base+0xa0/0xa0
[ 1095.633678]  schedule_timeout_uninterruptible+0x2a/0x30
[ 1095.633680]  __alloc_pages_slowpath+0x2b5/0xd95
[ 1095.633687]  __alloc_pages_nodemask+0x3e4/0x460
[ 1095.633699]  alloc_pages_current+0x97/0x1b0
[ 1095.633702]  new_slab+0x4cb/0x6b0
[ 1095.633706]  ___slab_alloc+0x3a3/0x620
[ 1095.633728]  ? kmem_alloc+0x96/0x120 [xfs]
[ 1095.633730]  ? ___slab_alloc+0x5c6/0x620
[ 1095.633732]  ? cpuacct_charge+0x38/0x1e0
[ 1095.633767]  ? kmem_alloc+0x96/0x120 [xfs]
[ 1095.633770]  __slab_alloc+0x46/0x7d
[ 1095.633773]  __kmalloc+0x301/0x3b0
[ 1095.633802]  kmem_alloc+0x96/0x120 [xfs]
[ 1095.633804]  ? kfree+0x1fa/0x330
[ 1095.633842]  xfs_log_commit_cil+0x489/0x710 [xfs]
[ 1095.633864]  __xfs_trans_commit+0x83/0x260 [xfs]
[ 1095.633883]  xfs_trans_commit+0x10/0x20 [xfs]
[ 1095.633901]  __xfs_setfilesize+0xdb/0x240 [xfs]
[ 1095.633936]  xfs_setfilesize_ioend+0x89/0xb0 [xfs]
[ 1095.633954]  ? xfs_setfilesize_ioend+0x5/0xb0 [xfs]
[ 1095.633971]  xfs_end_io+0x81/0x110 [xfs]
[ 1095.633973]  process_one_work+0x22b/0x760
[ 1095.633975]  ? process_one_work+0x194/0x760
[ 1095.633997]  rescuer_thread+0x1f2/0x3d0
[ 1095.634002]  kthread+0x10f/0x150
[ 1095.634003]  ? worker_thread+0x4b0/0x4b0
[ 1095.634004]  ? kthread_create_on_node+0x70/0x70
[ 1095.634007]  ret_from_fork+0x31/0x40
[ 1095.634013] MemAlloc: xfs-eofblocks/s(456) flags=0x4228860 switches=15435 seq=1 gfp=0x1400240(GFP_NOFS|__GFP_NOWARN) order=0 delay=293074
[ 1095.634014] xfs-eofblocks/s R  running task    12032   456      2 0x00000000
[ 1095.634037] Workqueue: xfs-eofblocks/sda1 xfs_eofblocks_worker [xfs]
[ 1095.634038] Call Trace:
[ 1095.634040]  ? _raw_spin_lock+0x3d/0x80
[ 1095.634042]  ? vmpressure+0xd0/0x120
[ 1095.634044]  ? vmpressure+0xd0/0x120
[ 1095.634047]  ? vmpressure_prio+0x21/0x30
[ 1095.634049]  ? do_try_to_free_pages+0x70/0x300
[ 1095.634052]  ? try_to_free_pages+0x131/0x3f0
[ 1095.634058]  ? __alloc_pages_slowpath+0x3ec/0xd95
[ 1095.634065]  ? __alloc_pages_nodemask+0x3e4/0x460
[ 1095.634069]  ? alloc_pages_current+0x97/0x1b0
[ 1095.634111]  ? xfs_buf_allocate_memory+0x160/0x2a3 [xfs]
[ 1095.634133]  ? xfs_buf_get_map+0x2be/0x480 [xfs]
[ 1095.634169]  ? xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1095.634204]  ? xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1095.634222]  ? xfs_btree_read_buf_block.constprop.34+0x78/0xc0 [xfs]
[ 1095.634239]  ? xfs_btree_lookup_get_block+0x8a/0x180 [xfs]
[ 1095.634257]  ? xfs_btree_lookup+0xd0/0x3f0 [xfs]
[ 1095.634296]  ? kmem_zone_alloc+0x96/0x120 [xfs]
[ 1095.634299]  ? _raw_spin_unlock+0x27/0x40
[ 1095.634315]  ? xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
[ 1095.634348]  ? xfs_bmap_del_extent+0x1b2/0x1610 [xfs]
[ 1095.634380]  ? kmem_zone_alloc+0x96/0x120 [xfs]
[ 1095.634400]  ? __xfs_bunmapi+0x4db/0xda0 [xfs]
[ 1095.634421]  ? xfs_bunmapi+0x2b/0x40 [xfs]
[ 1095.634459]  ? xfs_itruncate_extents+0x1df/0x780 [xfs]
[ 1095.634502]  ? xfs_rename+0xc70/0x1080 [xfs]
[ 1095.634525]  ? xfs_free_eofblocks+0x1c4/0x230 [xfs]
[ 1095.634546]  ? xfs_inode_free_eofblocks+0x18d/0x280 [xfs]
[ 1095.634565]  ? xfs_inode_ag_walk.isra.13+0x2b5/0x620 [xfs]
[ 1095.634582]  ? xfs_inode_ag_walk.isra.13+0x91/0x620 [xfs]
[ 1095.634618]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
[ 1095.634630]  ? radix_tree_next_chunk+0x10b/0x2d0
[ 1095.634635]  ? radix_tree_gang_lookup_tag+0xd7/0x150
[ 1095.634672]  ? xfs_perag_get_tag+0x11d/0x370 [xfs]
[ 1095.634690]  ? xfs_perag_get_tag+0x5/0x370 [xfs]
[ 1095.634709]  ? xfs_inode_ag_iterator_tag+0x71/0xa0 [xfs]
[ 1095.634726]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
[ 1095.634744]  ? __xfs_icache_free_eofblocks+0x3b/0x40 [xfs]
[ 1095.634759]  ? xfs_eofblocks_worker+0x27/0x40 [xfs]
[ 1095.634762]  ? process_one_work+0x22b/0x760
[ 1095.634763]  ? process_one_work+0x194/0x760
[ 1095.634784]  ? rescuer_thread+0x1f2/0x3d0
[ 1095.634788]  ? kthread+0x10f/0x150
[ 1095.634789]  ? worker_thread+0x4b0/0x4b0
[ 1095.634790]  ? kthread_create_on_node+0x70/0x70
[ 1095.634793]  ? ret_from_fork+0x31/0x40
----------

  (3) order-0 GFP_NOIO allocation request cannot make progress
      waiting for memory allocation.

----------
[ 1095.631687] MemAlloc: kworker/2:0(27) flags=0x4208860 switches=38727 seq=21 gfp=0x1400000(GFP_NOIO) order=0 delay=652160
[ 1095.631688] kworker/2:0     R  running task    12680    27      2 0x00000000
[ 1095.631739] Workqueue: events_freezable_power_ disk_events_workfn
[ 1095.631740] Call Trace:
[ 1095.631743]  __schedule+0x336/0xe00
[ 1095.631746]  preempt_schedule_common+0x1f/0x31
[ 1095.631747]  _cond_resched+0x1c/0x30
[ 1095.631749]  shrink_slab+0x339/0x6d0
[ 1095.631754]  shrink_node+0x2fa/0x310
[ 1095.631758]  do_try_to_free_pages+0xe1/0x300
[ 1095.631761]  try_to_free_pages+0x131/0x3f0
[ 1095.631765]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1095.631771]  __alloc_pages_nodemask+0x3e4/0x460
[ 1095.631775]  alloc_pages_current+0x97/0x1b0
[ 1095.631779]  bio_copy_kern+0xc9/0x180
[ 1095.631830]  blk_rq_map_kern+0x70/0x140
[ 1095.631835]  __scsi_execute.isra.22+0x13a/0x1e0
[ 1095.631838]  scsi_execute_req_flags+0x94/0x100
[ 1095.631844]  sr_check_events+0xbf/0x2b0 [sr_mod]
[ 1095.631852]  cdrom_check_events+0x18/0x30 [cdrom]
[ 1095.631854]  sr_block_check_events+0x2a/0x30 [sr_mod]
[ 1095.631856]  disk_check_events+0x60/0x170
[ 1095.631859]  disk_events_workfn+0x1c/0x20
[ 1095.631862]  process_one_work+0x22b/0x760
[ 1095.631863]  ? process_one_work+0x194/0x760
[ 1095.631867]  worker_thread+0x137/0x4b0
[ 1095.631887]  kthread+0x10f/0x150
[ 1095.631889]  ? process_one_work+0x760/0x760
[ 1095.631890]  ? kthread_create_on_node+0x70/0x70
[ 1095.631893]  ret_from_fork+0x31/0x40
----------

  (4) Number of stalling threads does not decrease over time while
      number of out_of_memory() calls increases over time.

----------
[  518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307
[  553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507
[  616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219
[  642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673
[  702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529
[  736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244
[  800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576
[  837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562
[  899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879
[  936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234
[  997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983
[ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912
[ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677
[ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791
----------

Further interpretation is up to XFS people (Brian and Dave). So far, boosting priority
(lowering watermark) of WQ_MEM_RECLAIM threads based on current_is_workqueue_rescuer()
seems to be considered as a fix because writeback path already has a special case
handling for the rescuer. But I'm not sure whether it helps getting writeback unstuck
when writeback path is blocked on a lock rather than memory allocation. (Threads doing
writeback and/or mempool allocations might be candidates for always reporting as with
kswapd. That would be in followup patches.)

----------
[ 1044.701393] kworker/u256:0  D10024  8444      2 0x00000080
[ 1044.703017] Workqueue: writeback wb_workfn (flush-8:0)
[ 1044.704550] Call Trace:
[ 1044.705550]  __schedule+0x336/0xe00
[ 1044.706761]  schedule+0x3d/0x90
[ 1044.707891]  io_schedule+0x16/0x40
[ 1044.709073]  __lock_page+0x126/0x180
[ 1044.710272]  ? page_cache_tree_insert+0x120/0x120
[ 1044.711728]  write_cache_pages+0x39a/0x6b0
[ 1044.713077]  ? xfs_map_blocks+0x5a0/0x5a0 [xfs]
[ 1044.714499]  ? xfs_vm_writepages+0x48/0xa0 [xfs]
[ 1044.715944]  xfs_vm_writepages+0x6b/0xa0 [xfs]
[ 1044.717362]  do_writepages+0x21/0x40
[ 1044.718577]  __writeback_single_inode+0x72/0x9d0
[ 1044.719998]  ? _raw_spin_unlock+0x27/0x40
[ 1044.721308]  writeback_sb_inodes+0x322/0x750
[ 1044.722662]  __writeback_inodes_wb+0x8c/0xc0
[ 1044.724009]  wb_writeback+0x3be/0x6e0
[ 1044.725238]  wb_workfn+0x146/0x7d0
[ 1044.726413]  ? process_one_work+0x194/0x760
[ 1044.727736]  process_one_work+0x22b/0x760
[ 1044.729017]  ? process_one_work+0x194/0x760
[ 1044.730343]  worker_thread+0x137/0x4b0
[ 1044.731577]  kthread+0x10f/0x150
[ 1044.732702]  ? process_one_work+0x760/0x760
[ 1044.734021]  ? kthread_create_on_node+0x70/0x70
[ 1044.735415]  ? do_syscall_64+0x195/0x200
[ 1044.736684]  ret_from_fork+0x31/0x40
----------

Regarding "infinite too_many_isolated() loop" case
( http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp ),
I used this patch for confirming the following things.

  (1) kswapd cannot make progress waiting for lock.

----------
[ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible
[ 1209.799726] kswapd0         D10936    67      2 0x00000000
[ 1209.807326] Call Trace:
[ 1209.812581]  __schedule+0x336/0xe00
[ 1209.818599]  schedule+0x3d/0x90
[ 1209.823907]  schedule_timeout+0x26a/0x510
[ 1209.827218]  ? trace_hardirqs_on+0xd/0x10
[ 1209.830535]  __down_common+0xfb/0x131
[ 1209.833801]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.837372]  __down+0x1d/0x1f
[ 1209.840331]  down+0x41/0x50
[ 1209.843243]  xfs_buf_lock+0x64/0x370 [xfs]
[ 1209.846597]  _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.850031]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
[ 1209.853514]  xfs_buf_get_map+0x2a/0x480 [xfs]
[ 1209.855831]  xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1209.857388]  ? free_debug_processing+0x27d/0x2af
[ 1209.859037]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1209.860707]  xfs_read_agf+0xc8/0x2b0 [xfs]
[ 1209.862184]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
[ 1209.863728]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
[ 1209.865385]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
[ 1209.866974]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.868374]  ? xfs_perag_get+0x1a0/0x310 [xfs]
[ 1209.869798]  ? xfs_perag_get+0x5/0x310 [xfs]
[ 1209.871288]  xfs_alloc_vextent+0x161/0xda0 [xfs]
[ 1209.872757]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
[ 1209.874182]  ? save_stack_trace+0x1b/0x20
[ 1209.875542]  xfs_bmap_alloc+0x17/0x30 [xfs]
[ 1209.876847]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
[ 1209.878190]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
[ 1209.879632]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
[ 1209.880909]  xfs_do_writepage+0x215/0x920 [xfs]
[ 1209.882255]  ? clear_page_dirty_for_io+0xb4/0x310
[ 1209.883598]  xfs_vm_writepage+0x3b/0x70 [xfs]
[ 1209.884841]  pageout.isra.54+0x1a4/0x460
[ 1209.886210]  shrink_page_list+0xa86/0xcf0
[ 1209.887441]  shrink_inactive_list+0x1c5/0x660
[ 1209.888682]  shrink_node_memcg+0x535/0x7f0
[ 1209.889975]  ? mem_cgroup_iter+0x14d/0x720
[ 1209.891197]  shrink_node+0xe1/0x310
[ 1209.892288]  kswapd+0x362/0x9b0
[ 1209.893308]  kthread+0x10f/0x150
[ 1209.894383]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1209.895703]  ? kthread_create_on_node+0x70/0x70
[ 1209.896956]  ret_from_fork+0x31/0x40
----------

  (2) Both GFP_IO and GFP_KERNEL allocations are stuck at
      too_many_isolated() loop.

----------
[ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible
[ 1209.902154] systemd-journal D11240   526      1 0x00000000
[ 1209.903642] Call Trace:
[ 1209.904574]  __schedule+0x336/0xe00
[ 1209.905734]  schedule+0x3d/0x90
[ 1209.906817]  schedule_timeout+0x20d/0x510
[ 1209.908025]  ? prepare_to_wait+0x2b/0xc0
[ 1209.909268]  ? lock_timer_base+0xa0/0xa0
[ 1209.910460]  io_schedule_timeout+0x1e/0x50
[ 1209.911681]  congestion_wait+0x86/0x260
[ 1209.912853]  ? remove_wait_queue+0x60/0x60
[ 1209.914115]  shrink_inactive_list+0x5b4/0x660
[ 1209.915385]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1209.916768]  shrink_node_memcg+0x535/0x7f0
[ 1209.918173]  shrink_node+0xe1/0x310
[ 1209.919288]  do_try_to_free_pages+0xe1/0x300
[ 1209.920548]  try_to_free_pages+0x131/0x3f0
[ 1209.921827]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1209.923137]  __alloc_pages_nodemask+0x3e4/0x460
[ 1209.924454]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.925790]  alloc_pages_current+0x97/0x1b0
[ 1209.927021]  ? find_get_entry+0x5/0x300
[ 1209.928189]  __page_cache_alloc+0x15d/0x1a0
[ 1209.929471]  ? pagecache_get_page+0x2c/0x2b0
[ 1209.930716]  filemap_fault+0x4df/0x8b0
[ 1209.931867]  ? filemap_fault+0x373/0x8b0
[ 1209.933111]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.934510]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1209.935857]  ? down_read_nested+0x7b/0xc0
[ 1209.937123]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.938373]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1209.939691]  __do_fault+0x1e/0xa0
[ 1209.940807]  ? _raw_spin_unlock+0x27/0x40
[ 1209.942002]  __handle_mm_fault+0xbb1/0xf40
[ 1209.943228]  ? mutex_unlock+0x12/0x20
[ 1209.944410]  ? devkmsg_read+0x15c/0x330
[ 1209.945912]  handle_mm_fault+0x16b/0x390
[ 1209.947297]  ? handle_mm_fault+0x49/0x390
[ 1209.948868]  __do_page_fault+0x24a/0x530
[ 1209.950351]  do_page_fault+0x30/0x80
[ 1209.951615]  page_fault+0x28/0x30

[ 1210.538496] MemAlloc: kworker/3:0(6345) flags=0x4208860 switches=10134 seq=22 gfp=0x1400000(GFP_NOIO) order=0 delay=45953 uninterruptible
[ 1210.541487] kworker/3:0     D12560  6345      2 0x00000080
[ 1210.542991] Workqueue: events_freezable_power_ disk_events_workfn
[ 1210.544577] Call Trace:
[ 1210.545468]  __schedule+0x336/0xe00
[ 1210.546606]  schedule+0x3d/0x90
[ 1210.547616]  schedule_timeout+0x20d/0x510
[ 1210.548778]  ? prepare_to_wait+0x2b/0xc0
[ 1210.550013]  ? lock_timer_base+0xa0/0xa0
[ 1210.551208]  io_schedule_timeout+0x1e/0x50
[ 1210.552519]  congestion_wait+0x86/0x260
[ 1210.553650]  ? remove_wait_queue+0x60/0x60
[ 1210.554900]  shrink_inactive_list+0x5b4/0x660
[ 1210.556119]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1210.557447]  shrink_node_memcg+0x535/0x7f0
[ 1210.558714]  shrink_node+0xe1/0x310
[ 1210.559803]  do_try_to_free_pages+0xe1/0x300
[ 1210.561009]  try_to_free_pages+0x131/0x3f0
[ 1210.562250]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1210.563506]  __alloc_pages_nodemask+0x3e4/0x460
[ 1210.564777]  alloc_pages_current+0x97/0x1b0
[ 1210.566017]  bio_copy_kern+0xc9/0x180
[ 1210.567116]  blk_rq_map_kern+0x70/0x140
[ 1210.568356]  __scsi_execute.isra.22+0x13a/0x1e0
[ 1210.569839]  scsi_execute_req_flags+0x94/0x100
[ 1210.571218]  sr_check_events+0xbf/0x2b0 [sr_mod]
[ 1210.572500]  cdrom_check_events+0x18/0x30 [cdrom]
[ 1210.573934]  sr_block_check_events+0x2a/0x30 [sr_mod]
[ 1210.575335]  disk_check_events+0x60/0x170
[ 1210.576509]  disk_events_workfn+0x1c/0x20
[ 1210.577744]  process_one_work+0x22b/0x760
[ 1210.578934]  ? process_one_work+0x194/0x760
[ 1210.580147]  worker_thread+0x137/0x4b0
[ 1210.581336]  kthread+0x10f/0x150
[ 1210.582365]  ? process_one_work+0x760/0x760
[ 1210.583603]  ? kthread_create_on_node+0x70/0x70
[ 1210.584961]  ? do_syscall_64+0x6c/0x200
[ 1210.586343]  ret_from_fork+0x31/0x40
----------

  (3) Number of stalling threads does not decrease over time and
      number of out_of_memory() calls does not increase over time.

----------
[ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
[ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
[ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
[ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
[ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
[ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
----------

Regarding [2], Alexander, Brian and Dave can explain it better than I. But
I think that "check threads which cannot make progress" principle is same.



> (top-posting repaired - please don't do that)

Hmm, I don't see anything wrong with mail headers of replied one that can
cause top-posting. I think something wired occurred outside of my control.

  Subject: Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
  References: <1488244908-57586-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
  In-Reply-To: <1488244908-57586-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
  Message-Id: <201703091946.GDC21885.OQFFOtJHSOFVML@I-love.SAKURA.ne.jp>
  Date: Thu, 9 Mar 2017 19:46:14 +0900

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-10 11:19         ` Tetsuo Handa
@ 2017-03-10 13:49           ` Brian Foster
  -1 siblings, 0 replies; 22+ messages in thread
From: Brian Foster @ 2017-03-10 13:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

On Fri, Mar 10, 2017 at 08:19:58PM +0900, Tetsuo Handa wrote:
> Andrew Morton wrote:
> > On Thu, 9 Mar 2017 19:46:14 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > > Tetsuo Handa wrote:
> > > > This patch adds a watchdog which periodically reports number of memory
> > > > allocating tasks, dying tasks and OOM victim tasks when some task is
> > > > spending too long time inside __alloc_pages_slowpath(). This patch also
> > > > serves as a hook for obtaining additional information using SystemTap
> > > > (e.g. examine other variables using printk(), capture a crash dump by
> > > > calling panic()) by triggering a callback only when a stall is detected.
> > > > Ability to take administrator-controlled actions based on some threshold
> > > > is a big advantage gained by introducing a state tracking.
> > > > 
> > > > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > > > too long") was a great step for reducing possibility of silent hang up
> > > > problem caused by memory allocation stalls [1]. However, there are
> > > > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > > > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > > > lockup problem) where this patch is more useful than that commit, for
> > > > this patch can report possibly related tasks even if allocating tasks
> > > > are unexpectedly blocked for so long. Regarding premature OOM killer
> > > > invocation, tracepoints which can accumulate samples in short interval
> > > > would be useful. But regarding too late to report allocation stalls,
> > > > this patch which can capture all tasks (for reporting overall situation)
> > > > in longer interval and act as a trigger (for accumulating short interval
> > > > samples) would be useful.
> > )
> > > Andrew, do you have any questions on this patch?
> > > I really need this patch for finding bugs which MM people overlook.
> > 
> > Undecided.  I can see the need but it is indeed quite a large lump of
> > code.  Perhaps some additional examples of how this new code was used
> > to understand and improve real-world kernel problems would be persuasive.
> 
> This patch is aimed for help bisecting whether unexpected hung cases are related to
> memory allocation. By merging this patch (and enabling this watchdog in enterprise
> systems via kernels supported by distributors), we can identify patterns/cases of
> problems (if related to memory allocation) and improve quality of Linux kernels
> by fixing problems related to memory allocation.
> 
> I was working at a support center for 3 years and had many cases where Linux
> systems encountered unexpected silent hangup/reboot problem. In most cases, all
> clue available was limited to sar (in sysstat package) showing that free memory
> was low just before unexpected hung/reboot.
> 
> As a nature of such problems, it is very hard for administrators to collect
> information for analysis; let alone identify the cause of problems. As a result,
> such problems are left unrecognized/unsolved at the support center, and are
> seldom reported to distributors/developers in order to ask for fixes.
> 
> Michal Hocko wrote:
> > Well, it is true that the watchdog can provide much more information
> > than we can gather with other debugging options we currently have when
> > allocations run away.  So I am not questioning this is useful when doing
> > memory stress testing and trying to trigger corner cases but I am not
> > really sure how much this will be useful in production systems, though.
> > Tracking is not for free both in runtime and longterm in maintenance.
> 
> I don't think that my stress testing is hitting bugs which do not occur in
> production systems. But I can't prove it without using watchdog mechanism.
> Today I heard about a case where an enterprise system hung last Sunday due to
> XFS deadlock by memory allocation. This was rather a lucky case where
> administrator of that system was able to capture vmcore. In most cases,
> administrators can't capture even SysRq-t; let alone vmcore. Therefore,
> automatic watchdog is highly appreciated.
> 
> 
> 
> Or, did you mean "some additional examples of how this new code was used
> to understand and improve real-world kernel problems" as how this patch
> helped in [2] [3] ?
> 
> Regarding [3] (now continued as
> http://lkml.kernel.org/r/201703031948.CHJ81278.VOHSFFFOOLJQMt@I-love.SAKURA.ne.jp ),
> I used this patch for confirming the following things.
> 
>   (1) kswapd cannot make progress waiting for lock.
> 
> ----------
> [  518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
> [  518.902095] kswapd0         D10776    69      2 0x00000000
> [  518.903784] Call Trace:
> [  518.904849]  __schedule+0x336/0xe00
> [  518.906118]  schedule+0x3d/0x90
> [  518.907314]  io_schedule+0x16/0x40
> [  518.908622]  __xfs_iflock+0x129/0x140 [xfs]
> [  518.910027]  ? autoremove_wake_function+0x60/0x60
> [  518.911559]  xfs_reclaim_inode+0x162/0x440 [xfs]
> [  518.913068]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
> [  518.914611]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
> [  518.916148]  ? trace_hardirqs_on+0xd/0x10
> [  518.917465]  ? try_to_wake_up+0x59/0x7a0
> [  518.918758]  ? wake_up_process+0x15/0x20
> [  518.920067]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
> [  518.921560]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
> [  518.923114]  super_cache_scan+0x181/0x190
> [  518.924435]  shrink_slab+0x29f/0x6d0
> [  518.925683]  shrink_node+0x2fa/0x310
> [  518.926909]  kswapd+0x362/0x9b0
> [  518.928061]  kthread+0x10f/0x150
> [  518.929218]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
> [  518.930953]  ? kthread_create_on_node+0x70/0x70
> [  518.932380]  ret_from_fork+0x31/0x40
> 
> [ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
> [ 1095.632985] kswapd0         D10776    69      2 0x00000000
> [ 1095.632988] Call Trace:
> [ 1095.632991]  __schedule+0x336/0xe00
> [ 1095.632994]  schedule+0x3d/0x90
> [ 1095.632996]  io_schedule+0x16/0x40
> [ 1095.633017]  __xfs_iflock+0x129/0x140 [xfs]
> [ 1095.633021]  ? autoremove_wake_function+0x60/0x60
> [ 1095.633051]  xfs_reclaim_inode+0x162/0x440 [xfs]
> [ 1095.633072]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
> [ 1095.633106]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
> [ 1095.633114]  ? trace_hardirqs_on+0xd/0x10
> [ 1095.633116]  ? try_to_wake_up+0x59/0x7a0
> [ 1095.633120]  ? wake_up_process+0x15/0x20
> [ 1095.633156]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
> [ 1095.633178]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
> [ 1095.633180]  super_cache_scan+0x181/0x190
> [ 1095.633183]  shrink_slab+0x29f/0x6d0
> [ 1095.633189]  shrink_node+0x2fa/0x310
> [ 1095.633193]  kswapd+0x362/0x9b0
> [ 1095.633200]  kthread+0x10f/0x150
> [ 1095.633201]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
> [ 1095.633202]  ? kthread_create_on_node+0x70/0x70
> [ 1095.633205]  ret_from_fork+0x31/0x40
> ----------
> 
>   (2) All WQ_MEM_RECLAIM threads shown by show_workqueue_state()
>       cannot make progress waiting for memory allocation.
> 
> ----------
> [ 1095.633625] MemAlloc: xfs-data/sda1(451) flags=0x4228060 switches=45509 seq=1 gfp=0x1604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=652073
> [ 1095.633626] xfs-data/sda1   R  running task    12696   451      2 0x00000000
> [ 1095.633663] Workqueue: xfs-data/sda1 xfs_end_io [xfs]
> [ 1095.633665] Call Trace:
> [ 1095.633668]  __schedule+0x336/0xe00
> [ 1095.633671]  schedule+0x3d/0x90
> [ 1095.633672]  schedule_timeout+0x20d/0x510
> [ 1095.633675]  ? lock_timer_base+0xa0/0xa0
> [ 1095.633678]  schedule_timeout_uninterruptible+0x2a/0x30
> [ 1095.633680]  __alloc_pages_slowpath+0x2b5/0xd95
> [ 1095.633687]  __alloc_pages_nodemask+0x3e4/0x460
> [ 1095.633699]  alloc_pages_current+0x97/0x1b0
> [ 1095.633702]  new_slab+0x4cb/0x6b0
> [ 1095.633706]  ___slab_alloc+0x3a3/0x620
> [ 1095.633728]  ? kmem_alloc+0x96/0x120 [xfs]
> [ 1095.633730]  ? ___slab_alloc+0x5c6/0x620
> [ 1095.633732]  ? cpuacct_charge+0x38/0x1e0
> [ 1095.633767]  ? kmem_alloc+0x96/0x120 [xfs]
> [ 1095.633770]  __slab_alloc+0x46/0x7d
> [ 1095.633773]  __kmalloc+0x301/0x3b0
> [ 1095.633802]  kmem_alloc+0x96/0x120 [xfs]
> [ 1095.633804]  ? kfree+0x1fa/0x330
> [ 1095.633842]  xfs_log_commit_cil+0x489/0x710 [xfs]
> [ 1095.633864]  __xfs_trans_commit+0x83/0x260 [xfs]
> [ 1095.633883]  xfs_trans_commit+0x10/0x20 [xfs]
> [ 1095.633901]  __xfs_setfilesize+0xdb/0x240 [xfs]
> [ 1095.633936]  xfs_setfilesize_ioend+0x89/0xb0 [xfs]
> [ 1095.633954]  ? xfs_setfilesize_ioend+0x5/0xb0 [xfs]
> [ 1095.633971]  xfs_end_io+0x81/0x110 [xfs]
> [ 1095.633973]  process_one_work+0x22b/0x760
> [ 1095.633975]  ? process_one_work+0x194/0x760
> [ 1095.633997]  rescuer_thread+0x1f2/0x3d0
> [ 1095.634002]  kthread+0x10f/0x150
> [ 1095.634003]  ? worker_thread+0x4b0/0x4b0
> [ 1095.634004]  ? kthread_create_on_node+0x70/0x70
> [ 1095.634007]  ret_from_fork+0x31/0x40
> [ 1095.634013] MemAlloc: xfs-eofblocks/s(456) flags=0x4228860 switches=15435 seq=1 gfp=0x1400240(GFP_NOFS|__GFP_NOWARN) order=0 delay=293074
> [ 1095.634014] xfs-eofblocks/s R  running task    12032   456      2 0x00000000
> [ 1095.634037] Workqueue: xfs-eofblocks/sda1 xfs_eofblocks_worker [xfs]
> [ 1095.634038] Call Trace:
> [ 1095.634040]  ? _raw_spin_lock+0x3d/0x80
> [ 1095.634042]  ? vmpressure+0xd0/0x120
> [ 1095.634044]  ? vmpressure+0xd0/0x120
> [ 1095.634047]  ? vmpressure_prio+0x21/0x30
> [ 1095.634049]  ? do_try_to_free_pages+0x70/0x300
> [ 1095.634052]  ? try_to_free_pages+0x131/0x3f0
> [ 1095.634058]  ? __alloc_pages_slowpath+0x3ec/0xd95
> [ 1095.634065]  ? __alloc_pages_nodemask+0x3e4/0x460
> [ 1095.634069]  ? alloc_pages_current+0x97/0x1b0
> [ 1095.634111]  ? xfs_buf_allocate_memory+0x160/0x2a3 [xfs]
> [ 1095.634133]  ? xfs_buf_get_map+0x2be/0x480 [xfs]
> [ 1095.634169]  ? xfs_buf_read_map+0x2c/0x400 [xfs]
> [ 1095.634204]  ? xfs_trans_read_buf_map+0x186/0x830 [xfs]
> [ 1095.634222]  ? xfs_btree_read_buf_block.constprop.34+0x78/0xc0 [xfs]
> [ 1095.634239]  ? xfs_btree_lookup_get_block+0x8a/0x180 [xfs]
> [ 1095.634257]  ? xfs_btree_lookup+0xd0/0x3f0 [xfs]
> [ 1095.634296]  ? kmem_zone_alloc+0x96/0x120 [xfs]
> [ 1095.634299]  ? _raw_spin_unlock+0x27/0x40
> [ 1095.634315]  ? xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
> [ 1095.634348]  ? xfs_bmap_del_extent+0x1b2/0x1610 [xfs]
> [ 1095.634380]  ? kmem_zone_alloc+0x96/0x120 [xfs]
> [ 1095.634400]  ? __xfs_bunmapi+0x4db/0xda0 [xfs]
> [ 1095.634421]  ? xfs_bunmapi+0x2b/0x40 [xfs]
> [ 1095.634459]  ? xfs_itruncate_extents+0x1df/0x780 [xfs]
> [ 1095.634502]  ? xfs_rename+0xc70/0x1080 [xfs]
> [ 1095.634525]  ? xfs_free_eofblocks+0x1c4/0x230 [xfs]
> [ 1095.634546]  ? xfs_inode_free_eofblocks+0x18d/0x280 [xfs]
> [ 1095.634565]  ? xfs_inode_ag_walk.isra.13+0x2b5/0x620 [xfs]
> [ 1095.634582]  ? xfs_inode_ag_walk.isra.13+0x91/0x620 [xfs]
> [ 1095.634618]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
> [ 1095.634630]  ? radix_tree_next_chunk+0x10b/0x2d0
> [ 1095.634635]  ? radix_tree_gang_lookup_tag+0xd7/0x150
> [ 1095.634672]  ? xfs_perag_get_tag+0x11d/0x370 [xfs]
> [ 1095.634690]  ? xfs_perag_get_tag+0x5/0x370 [xfs]
> [ 1095.634709]  ? xfs_inode_ag_iterator_tag+0x71/0xa0 [xfs]
> [ 1095.634726]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
> [ 1095.634744]  ? __xfs_icache_free_eofblocks+0x3b/0x40 [xfs]
> [ 1095.634759]  ? xfs_eofblocks_worker+0x27/0x40 [xfs]
> [ 1095.634762]  ? process_one_work+0x22b/0x760
> [ 1095.634763]  ? process_one_work+0x194/0x760
> [ 1095.634784]  ? rescuer_thread+0x1f2/0x3d0
> [ 1095.634788]  ? kthread+0x10f/0x150
> [ 1095.634789]  ? worker_thread+0x4b0/0x4b0
> [ 1095.634790]  ? kthread_create_on_node+0x70/0x70
> [ 1095.634793]  ? ret_from_fork+0x31/0x40
> ----------
> 

Clearly we have some rescuer threads blocked on memory reclaim. What
about the rescuer threads that are not blocked in reclaim? For example,
the one that could very well be responsible for allowing xfsaild (and
thus kswapd) to proceed is here:

[ 1041.555227] xfs-buf/sda1    S14904   450      2 0x00000000
[ 1041.556813] Call Trace:
[ 1041.557796]  __schedule+0x336/0xe00
[ 1041.558983]  schedule+0x3d/0x90
[ 1041.560085]  rescuer_thread+0x322/0x3d0
[ 1041.561333]  kthread+0x10f/0x150
[ 1041.562464]  ? worker_thread+0x4b0/0x4b0
[ 1041.563732]  ? kthread_create_on_node+0x70/0x70
[ 1041.565123]  ret_from_fork+0x31/0x40

I've pointed this out at least twice now in the original thread. :/

>   (3) order-0 GFP_NOIO allocation request cannot make progress
>       waiting for memory allocation.
> 
> ----------
> [ 1095.631687] MemAlloc: kworker/2:0(27) flags=0x4208860 switches=38727 seq=21 gfp=0x1400000(GFP_NOIO) order=0 delay=652160
> [ 1095.631688] kworker/2:0     R  running task    12680    27      2 0x00000000
> [ 1095.631739] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1095.631740] Call Trace:
> [ 1095.631743]  __schedule+0x336/0xe00
> [ 1095.631746]  preempt_schedule_common+0x1f/0x31
> [ 1095.631747]  _cond_resched+0x1c/0x30
> [ 1095.631749]  shrink_slab+0x339/0x6d0
> [ 1095.631754]  shrink_node+0x2fa/0x310
> [ 1095.631758]  do_try_to_free_pages+0xe1/0x300
> [ 1095.631761]  try_to_free_pages+0x131/0x3f0
> [ 1095.631765]  __alloc_pages_slowpath+0x3ec/0xd95
> [ 1095.631771]  __alloc_pages_nodemask+0x3e4/0x460
> [ 1095.631775]  alloc_pages_current+0x97/0x1b0
> [ 1095.631779]  bio_copy_kern+0xc9/0x180
> [ 1095.631830]  blk_rq_map_kern+0x70/0x140
> [ 1095.631835]  __scsi_execute.isra.22+0x13a/0x1e0
> [ 1095.631838]  scsi_execute_req_flags+0x94/0x100
> [ 1095.631844]  sr_check_events+0xbf/0x2b0 [sr_mod]
> [ 1095.631852]  cdrom_check_events+0x18/0x30 [cdrom]
> [ 1095.631854]  sr_block_check_events+0x2a/0x30 [sr_mod]
> [ 1095.631856]  disk_check_events+0x60/0x170
> [ 1095.631859]  disk_events_workfn+0x1c/0x20
> [ 1095.631862]  process_one_work+0x22b/0x760
> [ 1095.631863]  ? process_one_work+0x194/0x760
> [ 1095.631867]  worker_thread+0x137/0x4b0
> [ 1095.631887]  kthread+0x10f/0x150
> [ 1095.631889]  ? process_one_work+0x760/0x760
> [ 1095.631890]  ? kthread_create_on_node+0x70/0x70
> [ 1095.631893]  ret_from_fork+0x31/0x40
> ----------
> 
>   (4) Number of stalling threads does not decrease over time while
>       number of out_of_memory() calls increases over time.
> 
> ----------
> [  518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307
> [  553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507
> [  616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219
> [  642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673
> [  702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529
> [  736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244
> [  800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576
> [  837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562
> [  899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879
> [  936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234
> [  997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983
> [ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912
> [ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677
> [ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791
> ----------
> 
> Further interpretation is up to XFS people (Brian and Dave). So far, boosting priority
> (lowering watermark) of WQ_MEM_RECLAIM threads based on current_is_workqueue_rescuer()
> seems to be considered as a fix because writeback path already has a special case
> handling for the rescuer. But I'm not sure whether it helps getting writeback unstuck
> when writeback path is blocked on a lock rather than memory allocation. (Threads doing
> writeback and/or mempool allocations might be candidates for always reporting as with
> kswapd. That would be in followup patches.)
> 

Using the rescuer context in those associated workqueues is probably a
good idea, it may even resolve (or workaround) this problem. As noted
above, however, I don't think we've proven that addresses this problem.

How would you expect current_is_workqueue_rescuer() to affect the idle
xfs-buf rescuer thread above? Is there something else going on that
could be blocking metadata buffer I/O completion via xfs-buf? We do have
the following, which I suppose suggests there is a pending work item:

[  273.412600] workqueue xfs-buf/sda1: flags=0xc
[  273.414486]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/1
[  273.416415]     pending: xfs_buf_ioend_work [xfs]

... but otherwise I don't see anything executing xfs_buf_ioend()..? Am I
missing something?

AFAICT, all the information we have so far shows is that kswapd is
blocked in XFS, and as a result of that, we've ultimately deadlocked on
memory allocation.

Brian

> ----------
> [ 1044.701393] kworker/u256:0  D10024  8444      2 0x00000080
> [ 1044.703017] Workqueue: writeback wb_workfn (flush-8:0)
> [ 1044.704550] Call Trace:
> [ 1044.705550]  __schedule+0x336/0xe00
> [ 1044.706761]  schedule+0x3d/0x90
> [ 1044.707891]  io_schedule+0x16/0x40
> [ 1044.709073]  __lock_page+0x126/0x180
> [ 1044.710272]  ? page_cache_tree_insert+0x120/0x120
> [ 1044.711728]  write_cache_pages+0x39a/0x6b0
> [ 1044.713077]  ? xfs_map_blocks+0x5a0/0x5a0 [xfs]
> [ 1044.714499]  ? xfs_vm_writepages+0x48/0xa0 [xfs]
> [ 1044.715944]  xfs_vm_writepages+0x6b/0xa0 [xfs]
> [ 1044.717362]  do_writepages+0x21/0x40
> [ 1044.718577]  __writeback_single_inode+0x72/0x9d0
> [ 1044.719998]  ? _raw_spin_unlock+0x27/0x40
> [ 1044.721308]  writeback_sb_inodes+0x322/0x750
> [ 1044.722662]  __writeback_inodes_wb+0x8c/0xc0
> [ 1044.724009]  wb_writeback+0x3be/0x6e0
> [ 1044.725238]  wb_workfn+0x146/0x7d0
> [ 1044.726413]  ? process_one_work+0x194/0x760
> [ 1044.727736]  process_one_work+0x22b/0x760
> [ 1044.729017]  ? process_one_work+0x194/0x760
> [ 1044.730343]  worker_thread+0x137/0x4b0
> [ 1044.731577]  kthread+0x10f/0x150
> [ 1044.732702]  ? process_one_work+0x760/0x760
> [ 1044.734021]  ? kthread_create_on_node+0x70/0x70
> [ 1044.735415]  ? do_syscall_64+0x195/0x200
> [ 1044.736684]  ret_from_fork+0x31/0x40
> ----------
> 
> Regarding "infinite too_many_isolated() loop" case
> ( http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp ),
> I used this patch for confirming the following things.
> 
>   (1) kswapd cannot make progress waiting for lock.
> 
> ----------
> [ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible
> [ 1209.799726] kswapd0         D10936    67      2 0x00000000
> [ 1209.807326] Call Trace:
> [ 1209.812581]  __schedule+0x336/0xe00
> [ 1209.818599]  schedule+0x3d/0x90
> [ 1209.823907]  schedule_timeout+0x26a/0x510
> [ 1209.827218]  ? trace_hardirqs_on+0xd/0x10
> [ 1209.830535]  __down_common+0xfb/0x131
> [ 1209.833801]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
> [ 1209.837372]  __down+0x1d/0x1f
> [ 1209.840331]  down+0x41/0x50
> [ 1209.843243]  xfs_buf_lock+0x64/0x370 [xfs]
> [ 1209.846597]  _xfs_buf_find+0x2cb/0xc10 [xfs]
> [ 1209.850031]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
> [ 1209.853514]  xfs_buf_get_map+0x2a/0x480 [xfs]
> [ 1209.855831]  xfs_buf_read_map+0x2c/0x400 [xfs]
> [ 1209.857388]  ? free_debug_processing+0x27d/0x2af
> [ 1209.859037]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
> [ 1209.860707]  xfs_read_agf+0xc8/0x2b0 [xfs]
> [ 1209.862184]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
> [ 1209.863728]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
> [ 1209.865385]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
> [ 1209.866974]  ? __radix_tree_lookup+0x84/0xf0
> [ 1209.868374]  ? xfs_perag_get+0x1a0/0x310 [xfs]
> [ 1209.869798]  ? xfs_perag_get+0x5/0x310 [xfs]
> [ 1209.871288]  xfs_alloc_vextent+0x161/0xda0 [xfs]
> [ 1209.872757]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
> [ 1209.874182]  ? save_stack_trace+0x1b/0x20
> [ 1209.875542]  xfs_bmap_alloc+0x17/0x30 [xfs]
> [ 1209.876847]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
> [ 1209.878190]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
> [ 1209.879632]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
> [ 1209.880909]  xfs_do_writepage+0x215/0x920 [xfs]
> [ 1209.882255]  ? clear_page_dirty_for_io+0xb4/0x310
> [ 1209.883598]  xfs_vm_writepage+0x3b/0x70 [xfs]
> [ 1209.884841]  pageout.isra.54+0x1a4/0x460
> [ 1209.886210]  shrink_page_list+0xa86/0xcf0
> [ 1209.887441]  shrink_inactive_list+0x1c5/0x660
> [ 1209.888682]  shrink_node_memcg+0x535/0x7f0
> [ 1209.889975]  ? mem_cgroup_iter+0x14d/0x720
> [ 1209.891197]  shrink_node+0xe1/0x310
> [ 1209.892288]  kswapd+0x362/0x9b0
> [ 1209.893308]  kthread+0x10f/0x150
> [ 1209.894383]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
> [ 1209.895703]  ? kthread_create_on_node+0x70/0x70
> [ 1209.896956]  ret_from_fork+0x31/0x40
> ----------
> 
>   (2) Both GFP_IO and GFP_KERNEL allocations are stuck at
>       too_many_isolated() loop.
> 
> ----------
> [ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible
> [ 1209.902154] systemd-journal D11240   526      1 0x00000000
> [ 1209.903642] Call Trace:
> [ 1209.904574]  __schedule+0x336/0xe00
> [ 1209.905734]  schedule+0x3d/0x90
> [ 1209.906817]  schedule_timeout+0x20d/0x510
> [ 1209.908025]  ? prepare_to_wait+0x2b/0xc0
> [ 1209.909268]  ? lock_timer_base+0xa0/0xa0
> [ 1209.910460]  io_schedule_timeout+0x1e/0x50
> [ 1209.911681]  congestion_wait+0x86/0x260
> [ 1209.912853]  ? remove_wait_queue+0x60/0x60
> [ 1209.914115]  shrink_inactive_list+0x5b4/0x660
> [ 1209.915385]  ? __list_lru_count_one.isra.2+0x22/0x80
> [ 1209.916768]  shrink_node_memcg+0x535/0x7f0
> [ 1209.918173]  shrink_node+0xe1/0x310
> [ 1209.919288]  do_try_to_free_pages+0xe1/0x300
> [ 1209.920548]  try_to_free_pages+0x131/0x3f0
> [ 1209.921827]  __alloc_pages_slowpath+0x3ec/0xd95
> [ 1209.923137]  __alloc_pages_nodemask+0x3e4/0x460
> [ 1209.924454]  ? __radix_tree_lookup+0x84/0xf0
> [ 1209.925790]  alloc_pages_current+0x97/0x1b0
> [ 1209.927021]  ? find_get_entry+0x5/0x300
> [ 1209.928189]  __page_cache_alloc+0x15d/0x1a0
> [ 1209.929471]  ? pagecache_get_page+0x2c/0x2b0
> [ 1209.930716]  filemap_fault+0x4df/0x8b0
> [ 1209.931867]  ? filemap_fault+0x373/0x8b0
> [ 1209.933111]  ? xfs_ilock+0x22c/0x360 [xfs]
> [ 1209.934510]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
> [ 1209.935857]  ? down_read_nested+0x7b/0xc0
> [ 1209.937123]  ? xfs_ilock+0x22c/0x360 [xfs]
> [ 1209.938373]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
> [ 1209.939691]  __do_fault+0x1e/0xa0
> [ 1209.940807]  ? _raw_spin_unlock+0x27/0x40
> [ 1209.942002]  __handle_mm_fault+0xbb1/0xf40
> [ 1209.943228]  ? mutex_unlock+0x12/0x20
> [ 1209.944410]  ? devkmsg_read+0x15c/0x330
> [ 1209.945912]  handle_mm_fault+0x16b/0x390
> [ 1209.947297]  ? handle_mm_fault+0x49/0x390
> [ 1209.948868]  __do_page_fault+0x24a/0x530
> [ 1209.950351]  do_page_fault+0x30/0x80
> [ 1209.951615]  page_fault+0x28/0x30
> 
> [ 1210.538496] MemAlloc: kworker/3:0(6345) flags=0x4208860 switches=10134 seq=22 gfp=0x1400000(GFP_NOIO) order=0 delay=45953 uninterruptible
> [ 1210.541487] kworker/3:0     D12560  6345      2 0x00000080
> [ 1210.542991] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1210.544577] Call Trace:
> [ 1210.545468]  __schedule+0x336/0xe00
> [ 1210.546606]  schedule+0x3d/0x90
> [ 1210.547616]  schedule_timeout+0x20d/0x510
> [ 1210.548778]  ? prepare_to_wait+0x2b/0xc0
> [ 1210.550013]  ? lock_timer_base+0xa0/0xa0
> [ 1210.551208]  io_schedule_timeout+0x1e/0x50
> [ 1210.552519]  congestion_wait+0x86/0x260
> [ 1210.553650]  ? remove_wait_queue+0x60/0x60
> [ 1210.554900]  shrink_inactive_list+0x5b4/0x660
> [ 1210.556119]  ? __list_lru_count_one.isra.2+0x22/0x80
> [ 1210.557447]  shrink_node_memcg+0x535/0x7f0
> [ 1210.558714]  shrink_node+0xe1/0x310
> [ 1210.559803]  do_try_to_free_pages+0xe1/0x300
> [ 1210.561009]  try_to_free_pages+0x131/0x3f0
> [ 1210.562250]  __alloc_pages_slowpath+0x3ec/0xd95
> [ 1210.563506]  __alloc_pages_nodemask+0x3e4/0x460
> [ 1210.564777]  alloc_pages_current+0x97/0x1b0
> [ 1210.566017]  bio_copy_kern+0xc9/0x180
> [ 1210.567116]  blk_rq_map_kern+0x70/0x140
> [ 1210.568356]  __scsi_execute.isra.22+0x13a/0x1e0
> [ 1210.569839]  scsi_execute_req_flags+0x94/0x100
> [ 1210.571218]  sr_check_events+0xbf/0x2b0 [sr_mod]
> [ 1210.572500]  cdrom_check_events+0x18/0x30 [cdrom]
> [ 1210.573934]  sr_block_check_events+0x2a/0x30 [sr_mod]
> [ 1210.575335]  disk_check_events+0x60/0x170
> [ 1210.576509]  disk_events_workfn+0x1c/0x20
> [ 1210.577744]  process_one_work+0x22b/0x760
> [ 1210.578934]  ? process_one_work+0x194/0x760
> [ 1210.580147]  worker_thread+0x137/0x4b0
> [ 1210.581336]  kthread+0x10f/0x150
> [ 1210.582365]  ? process_one_work+0x760/0x760
> [ 1210.583603]  ? kthread_create_on_node+0x70/0x70
> [ 1210.584961]  ? do_syscall_64+0x6c/0x200
> [ 1210.586343]  ret_from_fork+0x31/0x40
> ----------
> 
>   (3) Number of stalling threads does not decrease over time and
>       number of out_of_memory() calls does not increase over time.
> 
> ----------
> [ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
> ----------
> 
> Regarding [2], Alexander, Brian and Dave can explain it better than I. But
> I think that "check threads which cannot make progress" principle is same.
> 
> 
> 
> > (top-posting repaired - please don't do that)
> 
> Hmm, I don't see anything wrong with mail headers of replied one that can
> cause top-posting. I think something wired occurred outside of my control.
> 
>   Subject: Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
>   From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>   References: <1488244908-57586-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
>   In-Reply-To: <1488244908-57586-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
>   Message-Id: <201703091946.GDC21885.OQFFOtJHSOFVML@I-love.SAKURA.ne.jp>
>   Date: Thu, 9 Mar 2017 19:46:14 +0900
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-10 13:49           ` Brian Foster
  0 siblings, 0 replies; 22+ messages in thread
From: Brian Foster @ 2017-03-10 13:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

On Fri, Mar 10, 2017 at 08:19:58PM +0900, Tetsuo Handa wrote:
> Andrew Morton wrote:
> > On Thu, 9 Mar 2017 19:46:14 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > > Tetsuo Handa wrote:
> > > > This patch adds a watchdog which periodically reports number of memory
> > > > allocating tasks, dying tasks and OOM victim tasks when some task is
> > > > spending too long time inside __alloc_pages_slowpath(). This patch also
> > > > serves as a hook for obtaining additional information using SystemTap
> > > > (e.g. examine other variables using printk(), capture a crash dump by
> > > > calling panic()) by triggering a callback only when a stall is detected.
> > > > Ability to take administrator-controlled actions based on some threshold
> > > > is a big advantage gained by introducing a state tracking.
> > > > 
> > > > Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
> > > > too long") was a great step for reducing possibility of silent hang up
> > > > problem caused by memory allocation stalls [1]. However, there are
> > > > reports of long stalls (e.g. [2] is over 30 minutes!) and lockups (e.g.
> > > > [3] is an "unable to invoke the OOM killer due to !__GFP_FS allocation"
> > > > lockup problem) where this patch is more useful than that commit, for
> > > > this patch can report possibly related tasks even if allocating tasks
> > > > are unexpectedly blocked for so long. Regarding premature OOM killer
> > > > invocation, tracepoints which can accumulate samples in short interval
> > > > would be useful. But regarding too late to report allocation stalls,
> > > > this patch which can capture all tasks (for reporting overall situation)
> > > > in longer interval and act as a trigger (for accumulating short interval
> > > > samples) would be useful.
> > )
> > > Andrew, do you have any questions on this patch?
> > > I really need this patch for finding bugs which MM people overlook.
> > 
> > Undecided.  I can see the need but it is indeed quite a large lump of
> > code.  Perhaps some additional examples of how this new code was used
> > to understand and improve real-world kernel problems would be persuasive.
> 
> This patch is aimed for help bisecting whether unexpected hung cases are related to
> memory allocation. By merging this patch (and enabling this watchdog in enterprise
> systems via kernels supported by distributors), we can identify patterns/cases of
> problems (if related to memory allocation) and improve quality of Linux kernels
> by fixing problems related to memory allocation.
> 
> I was working at a support center for 3 years and had many cases where Linux
> systems encountered unexpected silent hangup/reboot problem. In most cases, all
> clue available was limited to sar (in sysstat package) showing that free memory
> was low just before unexpected hung/reboot.
> 
> As a nature of such problems, it is very hard for administrators to collect
> information for analysis; let alone identify the cause of problems. As a result,
> such problems are left unrecognized/unsolved at the support center, and are
> seldom reported to distributors/developers in order to ask for fixes.
> 
> Michal Hocko wrote:
> > Well, it is true that the watchdog can provide much more information
> > than we can gather with other debugging options we currently have when
> > allocations run away.  So I am not questioning this is useful when doing
> > memory stress testing and trying to trigger corner cases but I am not
> > really sure how much this will be useful in production systems, though.
> > Tracking is not for free both in runtime and longterm in maintenance.
> 
> I don't think that my stress testing is hitting bugs which do not occur in
> production systems. But I can't prove it without using watchdog mechanism.
> Today I heard about a case where an enterprise system hung last Sunday due to
> XFS deadlock by memory allocation. This was rather a lucky case where
> administrator of that system was able to capture vmcore. In most cases,
> administrators can't capture even SysRq-t; let alone vmcore. Therefore,
> automatic watchdog is highly appreciated.
> 
> 
> 
> Or, did you mean "some additional examples of how this new code was used
> to understand and improve real-world kernel problems" as how this patch
> helped in [2] [3] ?
> 
> Regarding [3] (now continued as
> http://lkml.kernel.org/r/201703031948.CHJ81278.VOHSFFFOOLJQMt@I-love.SAKURA.ne.jp ),
> I used this patch for confirming the following things.
> 
>   (1) kswapd cannot make progress waiting for lock.
> 
> ----------
> [  518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
> [  518.902095] kswapd0         D10776    69      2 0x00000000
> [  518.903784] Call Trace:
> [  518.904849]  __schedule+0x336/0xe00
> [  518.906118]  schedule+0x3d/0x90
> [  518.907314]  io_schedule+0x16/0x40
> [  518.908622]  __xfs_iflock+0x129/0x140 [xfs]
> [  518.910027]  ? autoremove_wake_function+0x60/0x60
> [  518.911559]  xfs_reclaim_inode+0x162/0x440 [xfs]
> [  518.913068]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
> [  518.914611]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
> [  518.916148]  ? trace_hardirqs_on+0xd/0x10
> [  518.917465]  ? try_to_wake_up+0x59/0x7a0
> [  518.918758]  ? wake_up_process+0x15/0x20
> [  518.920067]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
> [  518.921560]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
> [  518.923114]  super_cache_scan+0x181/0x190
> [  518.924435]  shrink_slab+0x29f/0x6d0
> [  518.925683]  shrink_node+0x2fa/0x310
> [  518.926909]  kswapd+0x362/0x9b0
> [  518.928061]  kthread+0x10f/0x150
> [  518.929218]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
> [  518.930953]  ? kthread_create_on_node+0x70/0x70
> [  518.932380]  ret_from_fork+0x31/0x40
> 
> [ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
> [ 1095.632985] kswapd0         D10776    69      2 0x00000000
> [ 1095.632988] Call Trace:
> [ 1095.632991]  __schedule+0x336/0xe00
> [ 1095.632994]  schedule+0x3d/0x90
> [ 1095.632996]  io_schedule+0x16/0x40
> [ 1095.633017]  __xfs_iflock+0x129/0x140 [xfs]
> [ 1095.633021]  ? autoremove_wake_function+0x60/0x60
> [ 1095.633051]  xfs_reclaim_inode+0x162/0x440 [xfs]
> [ 1095.633072]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
> [ 1095.633106]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
> [ 1095.633114]  ? trace_hardirqs_on+0xd/0x10
> [ 1095.633116]  ? try_to_wake_up+0x59/0x7a0
> [ 1095.633120]  ? wake_up_process+0x15/0x20
> [ 1095.633156]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
> [ 1095.633178]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
> [ 1095.633180]  super_cache_scan+0x181/0x190
> [ 1095.633183]  shrink_slab+0x29f/0x6d0
> [ 1095.633189]  shrink_node+0x2fa/0x310
> [ 1095.633193]  kswapd+0x362/0x9b0
> [ 1095.633200]  kthread+0x10f/0x150
> [ 1095.633201]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
> [ 1095.633202]  ? kthread_create_on_node+0x70/0x70
> [ 1095.633205]  ret_from_fork+0x31/0x40
> ----------
> 
>   (2) All WQ_MEM_RECLAIM threads shown by show_workqueue_state()
>       cannot make progress waiting for memory allocation.
> 
> ----------
> [ 1095.633625] MemAlloc: xfs-data/sda1(451) flags=0x4228060 switches=45509 seq=1 gfp=0x1604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=652073
> [ 1095.633626] xfs-data/sda1   R  running task    12696   451      2 0x00000000
> [ 1095.633663] Workqueue: xfs-data/sda1 xfs_end_io [xfs]
> [ 1095.633665] Call Trace:
> [ 1095.633668]  __schedule+0x336/0xe00
> [ 1095.633671]  schedule+0x3d/0x90
> [ 1095.633672]  schedule_timeout+0x20d/0x510
> [ 1095.633675]  ? lock_timer_base+0xa0/0xa0
> [ 1095.633678]  schedule_timeout_uninterruptible+0x2a/0x30
> [ 1095.633680]  __alloc_pages_slowpath+0x2b5/0xd95
> [ 1095.633687]  __alloc_pages_nodemask+0x3e4/0x460
> [ 1095.633699]  alloc_pages_current+0x97/0x1b0
> [ 1095.633702]  new_slab+0x4cb/0x6b0
> [ 1095.633706]  ___slab_alloc+0x3a3/0x620
> [ 1095.633728]  ? kmem_alloc+0x96/0x120 [xfs]
> [ 1095.633730]  ? ___slab_alloc+0x5c6/0x620
> [ 1095.633732]  ? cpuacct_charge+0x38/0x1e0
> [ 1095.633767]  ? kmem_alloc+0x96/0x120 [xfs]
> [ 1095.633770]  __slab_alloc+0x46/0x7d
> [ 1095.633773]  __kmalloc+0x301/0x3b0
> [ 1095.633802]  kmem_alloc+0x96/0x120 [xfs]
> [ 1095.633804]  ? kfree+0x1fa/0x330
> [ 1095.633842]  xfs_log_commit_cil+0x489/0x710 [xfs]
> [ 1095.633864]  __xfs_trans_commit+0x83/0x260 [xfs]
> [ 1095.633883]  xfs_trans_commit+0x10/0x20 [xfs]
> [ 1095.633901]  __xfs_setfilesize+0xdb/0x240 [xfs]
> [ 1095.633936]  xfs_setfilesize_ioend+0x89/0xb0 [xfs]
> [ 1095.633954]  ? xfs_setfilesize_ioend+0x5/0xb0 [xfs]
> [ 1095.633971]  xfs_end_io+0x81/0x110 [xfs]
> [ 1095.633973]  process_one_work+0x22b/0x760
> [ 1095.633975]  ? process_one_work+0x194/0x760
> [ 1095.633997]  rescuer_thread+0x1f2/0x3d0
> [ 1095.634002]  kthread+0x10f/0x150
> [ 1095.634003]  ? worker_thread+0x4b0/0x4b0
> [ 1095.634004]  ? kthread_create_on_node+0x70/0x70
> [ 1095.634007]  ret_from_fork+0x31/0x40
> [ 1095.634013] MemAlloc: xfs-eofblocks/s(456) flags=0x4228860 switches=15435 seq=1 gfp=0x1400240(GFP_NOFS|__GFP_NOWARN) order=0 delay=293074
> [ 1095.634014] xfs-eofblocks/s R  running task    12032   456      2 0x00000000
> [ 1095.634037] Workqueue: xfs-eofblocks/sda1 xfs_eofblocks_worker [xfs]
> [ 1095.634038] Call Trace:
> [ 1095.634040]  ? _raw_spin_lock+0x3d/0x80
> [ 1095.634042]  ? vmpressure+0xd0/0x120
> [ 1095.634044]  ? vmpressure+0xd0/0x120
> [ 1095.634047]  ? vmpressure_prio+0x21/0x30
> [ 1095.634049]  ? do_try_to_free_pages+0x70/0x300
> [ 1095.634052]  ? try_to_free_pages+0x131/0x3f0
> [ 1095.634058]  ? __alloc_pages_slowpath+0x3ec/0xd95
> [ 1095.634065]  ? __alloc_pages_nodemask+0x3e4/0x460
> [ 1095.634069]  ? alloc_pages_current+0x97/0x1b0
> [ 1095.634111]  ? xfs_buf_allocate_memory+0x160/0x2a3 [xfs]
> [ 1095.634133]  ? xfs_buf_get_map+0x2be/0x480 [xfs]
> [ 1095.634169]  ? xfs_buf_read_map+0x2c/0x400 [xfs]
> [ 1095.634204]  ? xfs_trans_read_buf_map+0x186/0x830 [xfs]
> [ 1095.634222]  ? xfs_btree_read_buf_block.constprop.34+0x78/0xc0 [xfs]
> [ 1095.634239]  ? xfs_btree_lookup_get_block+0x8a/0x180 [xfs]
> [ 1095.634257]  ? xfs_btree_lookup+0xd0/0x3f0 [xfs]
> [ 1095.634296]  ? kmem_zone_alloc+0x96/0x120 [xfs]
> [ 1095.634299]  ? _raw_spin_unlock+0x27/0x40
> [ 1095.634315]  ? xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
> [ 1095.634348]  ? xfs_bmap_del_extent+0x1b2/0x1610 [xfs]
> [ 1095.634380]  ? kmem_zone_alloc+0x96/0x120 [xfs]
> [ 1095.634400]  ? __xfs_bunmapi+0x4db/0xda0 [xfs]
> [ 1095.634421]  ? xfs_bunmapi+0x2b/0x40 [xfs]
> [ 1095.634459]  ? xfs_itruncate_extents+0x1df/0x780 [xfs]
> [ 1095.634502]  ? xfs_rename+0xc70/0x1080 [xfs]
> [ 1095.634525]  ? xfs_free_eofblocks+0x1c4/0x230 [xfs]
> [ 1095.634546]  ? xfs_inode_free_eofblocks+0x18d/0x280 [xfs]
> [ 1095.634565]  ? xfs_inode_ag_walk.isra.13+0x2b5/0x620 [xfs]
> [ 1095.634582]  ? xfs_inode_ag_walk.isra.13+0x91/0x620 [xfs]
> [ 1095.634618]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
> [ 1095.634630]  ? radix_tree_next_chunk+0x10b/0x2d0
> [ 1095.634635]  ? radix_tree_gang_lookup_tag+0xd7/0x150
> [ 1095.634672]  ? xfs_perag_get_tag+0x11d/0x370 [xfs]
> [ 1095.634690]  ? xfs_perag_get_tag+0x5/0x370 [xfs]
> [ 1095.634709]  ? xfs_inode_ag_iterator_tag+0x71/0xa0 [xfs]
> [ 1095.634726]  ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
> [ 1095.634744]  ? __xfs_icache_free_eofblocks+0x3b/0x40 [xfs]
> [ 1095.634759]  ? xfs_eofblocks_worker+0x27/0x40 [xfs]
> [ 1095.634762]  ? process_one_work+0x22b/0x760
> [ 1095.634763]  ? process_one_work+0x194/0x760
> [ 1095.634784]  ? rescuer_thread+0x1f2/0x3d0
> [ 1095.634788]  ? kthread+0x10f/0x150
> [ 1095.634789]  ? worker_thread+0x4b0/0x4b0
> [ 1095.634790]  ? kthread_create_on_node+0x70/0x70
> [ 1095.634793]  ? ret_from_fork+0x31/0x40
> ----------
> 

Clearly we have some rescuer threads blocked on memory reclaim. What
about the rescuer threads that are not blocked in reclaim? For example,
the one that could very well be responsible for allowing xfsaild (and
thus kswapd) to proceed is here:

[ 1041.555227] xfs-buf/sda1    S14904   450      2 0x00000000
[ 1041.556813] Call Trace:
[ 1041.557796]  __schedule+0x336/0xe00
[ 1041.558983]  schedule+0x3d/0x90
[ 1041.560085]  rescuer_thread+0x322/0x3d0
[ 1041.561333]  kthread+0x10f/0x150
[ 1041.562464]  ? worker_thread+0x4b0/0x4b0
[ 1041.563732]  ? kthread_create_on_node+0x70/0x70
[ 1041.565123]  ret_from_fork+0x31/0x40

I've pointed this out at least twice now in the original thread. :/

>   (3) order-0 GFP_NOIO allocation request cannot make progress
>       waiting for memory allocation.
> 
> ----------
> [ 1095.631687] MemAlloc: kworker/2:0(27) flags=0x4208860 switches=38727 seq=21 gfp=0x1400000(GFP_NOIO) order=0 delay=652160
> [ 1095.631688] kworker/2:0     R  running task    12680    27      2 0x00000000
> [ 1095.631739] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1095.631740] Call Trace:
> [ 1095.631743]  __schedule+0x336/0xe00
> [ 1095.631746]  preempt_schedule_common+0x1f/0x31
> [ 1095.631747]  _cond_resched+0x1c/0x30
> [ 1095.631749]  shrink_slab+0x339/0x6d0
> [ 1095.631754]  shrink_node+0x2fa/0x310
> [ 1095.631758]  do_try_to_free_pages+0xe1/0x300
> [ 1095.631761]  try_to_free_pages+0x131/0x3f0
> [ 1095.631765]  __alloc_pages_slowpath+0x3ec/0xd95
> [ 1095.631771]  __alloc_pages_nodemask+0x3e4/0x460
> [ 1095.631775]  alloc_pages_current+0x97/0x1b0
> [ 1095.631779]  bio_copy_kern+0xc9/0x180
> [ 1095.631830]  blk_rq_map_kern+0x70/0x140
> [ 1095.631835]  __scsi_execute.isra.22+0x13a/0x1e0
> [ 1095.631838]  scsi_execute_req_flags+0x94/0x100
> [ 1095.631844]  sr_check_events+0xbf/0x2b0 [sr_mod]
> [ 1095.631852]  cdrom_check_events+0x18/0x30 [cdrom]
> [ 1095.631854]  sr_block_check_events+0x2a/0x30 [sr_mod]
> [ 1095.631856]  disk_check_events+0x60/0x170
> [ 1095.631859]  disk_events_workfn+0x1c/0x20
> [ 1095.631862]  process_one_work+0x22b/0x760
> [ 1095.631863]  ? process_one_work+0x194/0x760
> [ 1095.631867]  worker_thread+0x137/0x4b0
> [ 1095.631887]  kthread+0x10f/0x150
> [ 1095.631889]  ? process_one_work+0x760/0x760
> [ 1095.631890]  ? kthread_create_on_node+0x70/0x70
> [ 1095.631893]  ret_from_fork+0x31/0x40
> ----------
> 
>   (4) Number of stalling threads does not decrease over time while
>       number of out_of_memory() calls increases over time.
> 
> ----------
> [  518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307
> [  553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507
> [  616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219
> [  642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673
> [  702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529
> [  736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244
> [  800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576
> [  837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562
> [  899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879
> [  936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234
> [  997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983
> [ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912
> [ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677
> [ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791
> ----------
> 
> Further interpretation is up to XFS people (Brian and Dave). So far, boosting priority
> (lowering watermark) of WQ_MEM_RECLAIM threads based on current_is_workqueue_rescuer()
> seems to be considered as a fix because writeback path already has a special case
> handling for the rescuer. But I'm not sure whether it helps getting writeback unstuck
> when writeback path is blocked on a lock rather than memory allocation. (Threads doing
> writeback and/or mempool allocations might be candidates for always reporting as with
> kswapd. That would be in followup patches.)
> 

Using the rescuer context in those associated workqueues is probably a
good idea, it may even resolve (or workaround) this problem. As noted
above, however, I don't think we've proven that addresses this problem.

How would you expect current_is_workqueue_rescuer() to affect the idle
xfs-buf rescuer thread above? Is there something else going on that
could be blocking metadata buffer I/O completion via xfs-buf? We do have
the following, which I suppose suggests there is a pending work item:

[  273.412600] workqueue xfs-buf/sda1: flags=0xc
[  273.414486]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/1
[  273.416415]     pending: xfs_buf_ioend_work [xfs]

... but otherwise I don't see anything executing xfs_buf_ioend()..? Am I
missing something?

AFAICT, all the information we have so far shows is that kswapd is
blocked in XFS, and as a result of that, we've ultimately deadlocked on
memory allocation.

Brian

> ----------
> [ 1044.701393] kworker/u256:0  D10024  8444      2 0x00000080
> [ 1044.703017] Workqueue: writeback wb_workfn (flush-8:0)
> [ 1044.704550] Call Trace:
> [ 1044.705550]  __schedule+0x336/0xe00
> [ 1044.706761]  schedule+0x3d/0x90
> [ 1044.707891]  io_schedule+0x16/0x40
> [ 1044.709073]  __lock_page+0x126/0x180
> [ 1044.710272]  ? page_cache_tree_insert+0x120/0x120
> [ 1044.711728]  write_cache_pages+0x39a/0x6b0
> [ 1044.713077]  ? xfs_map_blocks+0x5a0/0x5a0 [xfs]
> [ 1044.714499]  ? xfs_vm_writepages+0x48/0xa0 [xfs]
> [ 1044.715944]  xfs_vm_writepages+0x6b/0xa0 [xfs]
> [ 1044.717362]  do_writepages+0x21/0x40
> [ 1044.718577]  __writeback_single_inode+0x72/0x9d0
> [ 1044.719998]  ? _raw_spin_unlock+0x27/0x40
> [ 1044.721308]  writeback_sb_inodes+0x322/0x750
> [ 1044.722662]  __writeback_inodes_wb+0x8c/0xc0
> [ 1044.724009]  wb_writeback+0x3be/0x6e0
> [ 1044.725238]  wb_workfn+0x146/0x7d0
> [ 1044.726413]  ? process_one_work+0x194/0x760
> [ 1044.727736]  process_one_work+0x22b/0x760
> [ 1044.729017]  ? process_one_work+0x194/0x760
> [ 1044.730343]  worker_thread+0x137/0x4b0
> [ 1044.731577]  kthread+0x10f/0x150
> [ 1044.732702]  ? process_one_work+0x760/0x760
> [ 1044.734021]  ? kthread_create_on_node+0x70/0x70
> [ 1044.735415]  ? do_syscall_64+0x195/0x200
> [ 1044.736684]  ret_from_fork+0x31/0x40
> ----------
> 
> Regarding "infinite too_many_isolated() loop" case
> ( http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp ),
> I used this patch for confirming the following things.
> 
>   (1) kswapd cannot make progress waiting for lock.
> 
> ----------
> [ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible
> [ 1209.799726] kswapd0         D10936    67      2 0x00000000
> [ 1209.807326] Call Trace:
> [ 1209.812581]  __schedule+0x336/0xe00
> [ 1209.818599]  schedule+0x3d/0x90
> [ 1209.823907]  schedule_timeout+0x26a/0x510
> [ 1209.827218]  ? trace_hardirqs_on+0xd/0x10
> [ 1209.830535]  __down_common+0xfb/0x131
> [ 1209.833801]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
> [ 1209.837372]  __down+0x1d/0x1f
> [ 1209.840331]  down+0x41/0x50
> [ 1209.843243]  xfs_buf_lock+0x64/0x370 [xfs]
> [ 1209.846597]  _xfs_buf_find+0x2cb/0xc10 [xfs]
> [ 1209.850031]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
> [ 1209.853514]  xfs_buf_get_map+0x2a/0x480 [xfs]
> [ 1209.855831]  xfs_buf_read_map+0x2c/0x400 [xfs]
> [ 1209.857388]  ? free_debug_processing+0x27d/0x2af
> [ 1209.859037]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
> [ 1209.860707]  xfs_read_agf+0xc8/0x2b0 [xfs]
> [ 1209.862184]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
> [ 1209.863728]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
> [ 1209.865385]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
> [ 1209.866974]  ? __radix_tree_lookup+0x84/0xf0
> [ 1209.868374]  ? xfs_perag_get+0x1a0/0x310 [xfs]
> [ 1209.869798]  ? xfs_perag_get+0x5/0x310 [xfs]
> [ 1209.871288]  xfs_alloc_vextent+0x161/0xda0 [xfs]
> [ 1209.872757]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
> [ 1209.874182]  ? save_stack_trace+0x1b/0x20
> [ 1209.875542]  xfs_bmap_alloc+0x17/0x30 [xfs]
> [ 1209.876847]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
> [ 1209.878190]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
> [ 1209.879632]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
> [ 1209.880909]  xfs_do_writepage+0x215/0x920 [xfs]
> [ 1209.882255]  ? clear_page_dirty_for_io+0xb4/0x310
> [ 1209.883598]  xfs_vm_writepage+0x3b/0x70 [xfs]
> [ 1209.884841]  pageout.isra.54+0x1a4/0x460
> [ 1209.886210]  shrink_page_list+0xa86/0xcf0
> [ 1209.887441]  shrink_inactive_list+0x1c5/0x660
> [ 1209.888682]  shrink_node_memcg+0x535/0x7f0
> [ 1209.889975]  ? mem_cgroup_iter+0x14d/0x720
> [ 1209.891197]  shrink_node+0xe1/0x310
> [ 1209.892288]  kswapd+0x362/0x9b0
> [ 1209.893308]  kthread+0x10f/0x150
> [ 1209.894383]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
> [ 1209.895703]  ? kthread_create_on_node+0x70/0x70
> [ 1209.896956]  ret_from_fork+0x31/0x40
> ----------
> 
>   (2) Both GFP_IO and GFP_KERNEL allocations are stuck at
>       too_many_isolated() loop.
> 
> ----------
> [ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible
> [ 1209.902154] systemd-journal D11240   526      1 0x00000000
> [ 1209.903642] Call Trace:
> [ 1209.904574]  __schedule+0x336/0xe00
> [ 1209.905734]  schedule+0x3d/0x90
> [ 1209.906817]  schedule_timeout+0x20d/0x510
> [ 1209.908025]  ? prepare_to_wait+0x2b/0xc0
> [ 1209.909268]  ? lock_timer_base+0xa0/0xa0
> [ 1209.910460]  io_schedule_timeout+0x1e/0x50
> [ 1209.911681]  congestion_wait+0x86/0x260
> [ 1209.912853]  ? remove_wait_queue+0x60/0x60
> [ 1209.914115]  shrink_inactive_list+0x5b4/0x660
> [ 1209.915385]  ? __list_lru_count_one.isra.2+0x22/0x80
> [ 1209.916768]  shrink_node_memcg+0x535/0x7f0
> [ 1209.918173]  shrink_node+0xe1/0x310
> [ 1209.919288]  do_try_to_free_pages+0xe1/0x300
> [ 1209.920548]  try_to_free_pages+0x131/0x3f0
> [ 1209.921827]  __alloc_pages_slowpath+0x3ec/0xd95
> [ 1209.923137]  __alloc_pages_nodemask+0x3e4/0x460
> [ 1209.924454]  ? __radix_tree_lookup+0x84/0xf0
> [ 1209.925790]  alloc_pages_current+0x97/0x1b0
> [ 1209.927021]  ? find_get_entry+0x5/0x300
> [ 1209.928189]  __page_cache_alloc+0x15d/0x1a0
> [ 1209.929471]  ? pagecache_get_page+0x2c/0x2b0
> [ 1209.930716]  filemap_fault+0x4df/0x8b0
> [ 1209.931867]  ? filemap_fault+0x373/0x8b0
> [ 1209.933111]  ? xfs_ilock+0x22c/0x360 [xfs]
> [ 1209.934510]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
> [ 1209.935857]  ? down_read_nested+0x7b/0xc0
> [ 1209.937123]  ? xfs_ilock+0x22c/0x360 [xfs]
> [ 1209.938373]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
> [ 1209.939691]  __do_fault+0x1e/0xa0
> [ 1209.940807]  ? _raw_spin_unlock+0x27/0x40
> [ 1209.942002]  __handle_mm_fault+0xbb1/0xf40
> [ 1209.943228]  ? mutex_unlock+0x12/0x20
> [ 1209.944410]  ? devkmsg_read+0x15c/0x330
> [ 1209.945912]  handle_mm_fault+0x16b/0x390
> [ 1209.947297]  ? handle_mm_fault+0x49/0x390
> [ 1209.948868]  __do_page_fault+0x24a/0x530
> [ 1209.950351]  do_page_fault+0x30/0x80
> [ 1209.951615]  page_fault+0x28/0x30
> 
> [ 1210.538496] MemAlloc: kworker/3:0(6345) flags=0x4208860 switches=10134 seq=22 gfp=0x1400000(GFP_NOIO) order=0 delay=45953 uninterruptible
> [ 1210.541487] kworker/3:0     D12560  6345      2 0x00000080
> [ 1210.542991] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1210.544577] Call Trace:
> [ 1210.545468]  __schedule+0x336/0xe00
> [ 1210.546606]  schedule+0x3d/0x90
> [ 1210.547616]  schedule_timeout+0x20d/0x510
> [ 1210.548778]  ? prepare_to_wait+0x2b/0xc0
> [ 1210.550013]  ? lock_timer_base+0xa0/0xa0
> [ 1210.551208]  io_schedule_timeout+0x1e/0x50
> [ 1210.552519]  congestion_wait+0x86/0x260
> [ 1210.553650]  ? remove_wait_queue+0x60/0x60
> [ 1210.554900]  shrink_inactive_list+0x5b4/0x660
> [ 1210.556119]  ? __list_lru_count_one.isra.2+0x22/0x80
> [ 1210.557447]  shrink_node_memcg+0x535/0x7f0
> [ 1210.558714]  shrink_node+0xe1/0x310
> [ 1210.559803]  do_try_to_free_pages+0xe1/0x300
> [ 1210.561009]  try_to_free_pages+0x131/0x3f0
> [ 1210.562250]  __alloc_pages_slowpath+0x3ec/0xd95
> [ 1210.563506]  __alloc_pages_nodemask+0x3e4/0x460
> [ 1210.564777]  alloc_pages_current+0x97/0x1b0
> [ 1210.566017]  bio_copy_kern+0xc9/0x180
> [ 1210.567116]  blk_rq_map_kern+0x70/0x140
> [ 1210.568356]  __scsi_execute.isra.22+0x13a/0x1e0
> [ 1210.569839]  scsi_execute_req_flags+0x94/0x100
> [ 1210.571218]  sr_check_events+0xbf/0x2b0 [sr_mod]
> [ 1210.572500]  cdrom_check_events+0x18/0x30 [cdrom]
> [ 1210.573934]  sr_block_check_events+0x2a/0x30 [sr_mod]
> [ 1210.575335]  disk_check_events+0x60/0x170
> [ 1210.576509]  disk_events_workfn+0x1c/0x20
> [ 1210.577744]  process_one_work+0x22b/0x760
> [ 1210.578934]  ? process_one_work+0x194/0x760
> [ 1210.580147]  worker_thread+0x137/0x4b0
> [ 1210.581336]  kthread+0x10f/0x150
> [ 1210.582365]  ? process_one_work+0x760/0x760
> [ 1210.583603]  ? kthread_create_on_node+0x70/0x70
> [ 1210.584961]  ? do_syscall_64+0x6c/0x200
> [ 1210.586343]  ret_from_fork+0x31/0x40
> ----------
> 
>   (3) Number of stalling threads does not decrease over time and
>       number of out_of_memory() calls does not increase over time.
> 
> ----------
> [ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
> ----------
> 
> Regarding [2], Alexander, Brian and Dave can explain it better than I. But
> I think that "check threads which cannot make progress" principle is same.
> 
> 
> 
> > (top-posting repaired - please don't do that)
> 
> Hmm, I don't see anything wrong with mail headers of replied one that can
> cause top-posting. I think something wired occurred outside of my control.
> 
>   Subject: Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
>   From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>   References: <1488244908-57586-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
>   In-Reply-To: <1488244908-57586-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
>   Message-Id: <201703091946.GDC21885.OQFFOtJHSOFVML@I-love.SAKURA.ne.jp>
>   Date: Thu, 9 Mar 2017 19:46:14 +0900
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-10 11:19         ` Tetsuo Handa
@ 2017-03-10 15:26           ` Michal Hocko
  -1 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-03-10 15:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

On Fri 10-03-17 20:19:58, Tetsuo Handa wrote:
[...]
> Or, did you mean "some additional examples of how this new code was used
> to understand and improve real-world kernel problems" as how this patch
> helped in [2] [3] ?
> 
> Regarding [3] (now continued as
> http://lkml.kernel.org/r/201703031948.CHJ81278.VOHSFFFOOLJQMt@I-love.SAKURA.ne.jp ),
> I used this patch for confirming the following things.
> 
>   (1) kswapd cannot make progress waiting for lock.
> 
> ----------
> [  518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
> [  518.902095] kswapd0         D10776    69      2 0x00000000
> [  518.903784] Call Trace:
[...]
> 
> [ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
> [ 1095.632985] kswapd0         D10776    69      2 0x00000000
> [ 1095.632988] Call Trace:
[...]
> ----------
> 
>   (2) All WQ_MEM_RECLAIM threads shown by show_workqueue_state()
>       cannot make progress waiting for memory allocation.
> 
> ----------
> [ 1095.633625] MemAlloc: xfs-data/sda1(451) flags=0x4228060 switches=45509 seq=1 gfp=0x1604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=652073
> [ 1095.633626] xfs-data/sda1   R  running task    12696   451      2 0x00000000
> [ 1095.633663] Workqueue: xfs-data/sda1 xfs_end_io [xfs]
> [ 1095.633665] Call Trace:
[...]

We can get all this from the sysrq+t output, right? Well except for the
gfp mask, oder and delay which is something that would give us a hint
that something is going on with the allocation because all those tasks
are in the page allocation obviously from the Call trace.
> ----------
> 
>   (3) order-0 GFP_NOIO allocation request cannot make progress
>       waiting for memory allocation.

which is completely irrelevant to the issue...
 
>   (4) Number of stalling threads does not decrease over time while
>       number of out_of_memory() calls increases over time.

You would see the same from the allocation stall warning. Even when it
is not really ideal (we are getting there though) you would see that
some allocation simply do not make a forward progress.

> ----------
> [  518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307
> [  553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507
> [  616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219
> [  642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673
> [  702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529
> [  736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244
> [  800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576
> [  837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562
> [  899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879
> [  936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234
> [  997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983
> [ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912
> [ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677
> [ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791

yeah, this is more than you can find out right now but I am not really
sure how much more this will tell you from what we already know above.
Do not take me wrong this is all nice and dandy but existing debugging
tools would give us a good clue what is going on even without that.
 
[...]

> Regarding "infinite too_many_isolated() loop" case
> ( http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp ),
> I used this patch for confirming the following things.
> 
>   (1) kswapd cannot make progress waiting for lock.
[...]
> 
>   (2) Both GFP_IO and GFP_KERNEL allocations are stuck at
>       too_many_isolated() loop.
[...]
> ----------
> 
>   (3) Number of stalling threads does not decrease over time and
>       number of out_of_memory() calls does not increase over time.
> 
> ----------
> [ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
> ----------

OK, this is a better example but I would argue that multiple snapshots
when CPUs are basically idle while all memory allocations are sitting on
the congestion wait would signal this pretty clearly as well and we can
see this from the sysrq+t as well.

So, we have means to debug these issues. Some of them are rather coarse
and your watchdog can collect much more and maybe give us a clue much
quicker but we still have to judge whether all this is really needed
because it doesn't come for free. Have you considered this aspect?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-10 15:26           ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-03-10 15:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

On Fri 10-03-17 20:19:58, Tetsuo Handa wrote:
[...]
> Or, did you mean "some additional examples of how this new code was used
> to understand and improve real-world kernel problems" as how this patch
> helped in [2] [3] ?
> 
> Regarding [3] (now continued as
> http://lkml.kernel.org/r/201703031948.CHJ81278.VOHSFFFOOLJQMt@I-love.SAKURA.ne.jp ),
> I used this patch for confirming the following things.
> 
>   (1) kswapd cannot make progress waiting for lock.
> 
> ----------
> [  518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
> [  518.902095] kswapd0         D10776    69      2 0x00000000
> [  518.903784] Call Trace:
[...]
> 
> [ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
> [ 1095.632985] kswapd0         D10776    69      2 0x00000000
> [ 1095.632988] Call Trace:
[...]
> ----------
> 
>   (2) All WQ_MEM_RECLAIM threads shown by show_workqueue_state()
>       cannot make progress waiting for memory allocation.
> 
> ----------
> [ 1095.633625] MemAlloc: xfs-data/sda1(451) flags=0x4228060 switches=45509 seq=1 gfp=0x1604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=652073
> [ 1095.633626] xfs-data/sda1   R  running task    12696   451      2 0x00000000
> [ 1095.633663] Workqueue: xfs-data/sda1 xfs_end_io [xfs]
> [ 1095.633665] Call Trace:
[...]

We can get all this from the sysrq+t output, right? Well except for the
gfp mask, oder and delay which is something that would give us a hint
that something is going on with the allocation because all those tasks
are in the page allocation obviously from the Call trace.
> ----------
> 
>   (3) order-0 GFP_NOIO allocation request cannot make progress
>       waiting for memory allocation.

which is completely irrelevant to the issue...
 
>   (4) Number of stalling threads does not decrease over time while
>       number of out_of_memory() calls increases over time.

You would see the same from the allocation stall warning. Even when it
is not really ideal (we are getting there though) you would see that
some allocation simply do not make a forward progress.

> ----------
> [  518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307
> [  553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507
> [  616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219
> [  642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673
> [  702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529
> [  736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244
> [  800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576
> [  837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562
> [  899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879
> [  936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234
> [  997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983
> [ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912
> [ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677
> [ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791

yeah, this is more than you can find out right now but I am not really
sure how much more this will tell you from what we already know above.
Do not take me wrong this is all nice and dandy but existing debugging
tools would give us a good clue what is going on even without that.
 
[...]

> Regarding "infinite too_many_isolated() loop" case
> ( http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp ),
> I used this patch for confirming the following things.
> 
>   (1) kswapd cannot make progress waiting for lock.
[...]
> 
>   (2) Both GFP_IO and GFP_KERNEL allocations are stuck at
>       too_many_isolated() loop.
[...]
> ----------
> 
>   (3) Number of stalling threads does not decrease over time and
>       number of out_of_memory() calls does not increase over time.
> 
> ----------
> [ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
> [ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
> ----------

OK, this is a better example but I would argue that multiple snapshots
when CPUs are basically idle while all memory allocations are sitting on
the congestion wait would signal this pretty clearly as well and we can
see this from the sysrq+t as well.

So, we have means to debug these issues. Some of them are rather coarse
and your watchdog can collect much more and maybe give us a clue much
quicker but we still have to judge whether all this is really needed
because it doesn't come for free. Have you considered this aspect?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-10 15:26           ` Michal Hocko
@ 2017-03-11  1:46             ` Tetsuo Handa
  -1 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-11  1:46 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

Michal Hocko wrote:
> So, we have means to debug these issues. Some of them are rather coarse
> and your watchdog can collect much more and maybe give us a clue much
> quicker but we still have to judge whether all this is really needed
> because it doesn't come for free. Have you considered this aspect?

Sigh... You are ultimately ignoring the reality. Educating everybody to master
debugging tools does not come for free. If I liken your argumentation to
security modules, it looks like the following.

  "There is already SELinux. SELinux can do everything. Thus, AppArmor is not needed.
   I don't care about users/customers who cannot administrate SELinux."

The reality is different. We need tools which users/customers can afford using.
You had better getting away from existing debug tools which kernel developers
are using.

First of all, SysRq is an emergency tool and therefore it requires administrator's
intervention. Your argumentation sounds to me that "Give up debugging unless you
can sit on in front of console of Linux systems 24-7" which is already impossible.

SysRq-t cannot print seq= and delay= fields because information of in-flight allocation
request is not accessible from "struct task_struct", making extremely difficult to
judge whether progress is made when several SysRq-t snapshots are taken.

Also, year by year it is getting difficult to use vmcore for analysis because vmcore
might include sensitive data (even after filtering out user pages). I saw cases where
vmcore cannot be sent to support centers due to e.g. organization's information
control rules. Sometimes we have to analyze from only kernel messages. Some pieces of
information extracted by running scripts against /usr/bin/crash on cutomer's side
might be available, but in general we can't assume that the whole memory image which
includes whatever information is available.

In most cases, administrators can't capture even SysRq-t; let alone vmcore.
Therefore, automatic watchdog is highly appreciated. Have you considered this aspect?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-11  1:46             ` Tetsuo Handa
  0 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-11  1:46 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

Michal Hocko wrote:
> So, we have means to debug these issues. Some of them are rather coarse
> and your watchdog can collect much more and maybe give us a clue much
> quicker but we still have to judge whether all this is really needed
> because it doesn't come for free. Have you considered this aspect?

Sigh... You are ultimately ignoring the reality. Educating everybody to master
debugging tools does not come for free. If I liken your argumentation to
security modules, it looks like the following.

  "There is already SELinux. SELinux can do everything. Thus, AppArmor is not needed.
   I don't care about users/customers who cannot administrate SELinux."

The reality is different. We need tools which users/customers can afford using.
You had better getting away from existing debug tools which kernel developers
are using.

First of all, SysRq is an emergency tool and therefore it requires administrator's
intervention. Your argumentation sounds to me that "Give up debugging unless you
can sit on in front of console of Linux systems 24-7" which is already impossible.

SysRq-t cannot print seq= and delay= fields because information of in-flight allocation
request is not accessible from "struct task_struct", making extremely difficult to
judge whether progress is made when several SysRq-t snapshots are taken.

Also, year by year it is getting difficult to use vmcore for analysis because vmcore
might include sensitive data (even after filtering out user pages). I saw cases where
vmcore cannot be sent to support centers due to e.g. organization's information
control rules. Sometimes we have to analyze from only kernel messages. Some pieces of
information extracted by running scripts against /usr/bin/crash on cutomer's side
might be available, but in general we can't assume that the whole memory image which
includes whatever information is available.

In most cases, administrators can't capture even SysRq-t; let alone vmcore.
Therefore, automatic watchdog is highly appreciated. Have you considered this aspect?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-11  1:46             ` Tetsuo Handa
@ 2017-03-13  9:45               ` Michal Hocko
  -1 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-03-13  9:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

On Sat 11-03-17 10:46:58, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > So, we have means to debug these issues. Some of them are rather coarse
> > and your watchdog can collect much more and maybe give us a clue much
> > quicker but we still have to judge whether all this is really needed
> > because it doesn't come for free. Have you considered this aspect?
> 
> Sigh... You are ultimately ignoring the reality. Educating everybody to master
> debugging tools does not come for free. If I liken your argumentation to
> security modules, it looks like the following.
> 
>   "There is already SELinux. SELinux can do everything. Thus, AppArmor is not needed.
>    I don't care about users/customers who cannot administrate SELinux."
> 
> The reality is different. We need tools which users/customers can afford using.
> You had better getting away from existing debug tools which kernel developers
> are using.
> 
> First of all, SysRq is an emergency tool and therefore it requires administrator's
> intervention. Your argumentation sounds to me that "Give up debugging unless you
> can sit on in front of console of Linux systems 24-7" which is already impossible.

My experience also tells me that different soft/hard lockups tend to
generate quite non-trivial number of false positives and those are
reported as bugs. We simply tend to underestimate how easy it is to trigger
paths without scheduling or how easy it is to trigger hardlockups on
large machines just due to lock bouncing etc...

> SysRq-t cannot print seq= and delay= fields because information of in-flight allocation
> request is not accessible from "struct task_struct", making extremely difficult to
> judge whether progress is made when several SysRq-t snapshots are taken.
> 
> Also, year by year it is getting difficult to use vmcore for analysis because vmcore
> might include sensitive data (even after filtering out user pages). I saw cases where
> vmcore cannot be sent to support centers due to e.g. organization's information
> control rules. Sometimes we have to analyze from only kernel messages. Some pieces of
> information extracted by running scripts against /usr/bin/crash on cutomer's side
> might be available, but in general we can't assume that the whole memory image which
> includes whatever information is available.
> 
> In most cases, administrators can't capture even SysRq-t; let alone vmcore.
> Therefore, automatic watchdog is highly appreciated. Have you considered this aspect?

yes I have. I tend to work with our SUSE L3 and enterprise customer a
lot last 10 years. And what I claim is that adding more watchdog doesn't
necessarily mean we will get better bug reports. I do not have any exact
statistics but my perception is that allocation lockups tends to be less
than 1% of reported bugs. You seem to make a huge issue from this
particular class of issues basing your argumentation on "unknown
issues which might have been allocation lockups etc." I am not feeling
comfortable with this kind of arguing and making any decision on them.

So let me repeat (for the last time). I find your watchdog interesting
for stress testing but I am not convinced this is generally useful for
real workloads and the maintenance burden is worth it. I _might_ be
wrong here and that is why this is _no_ a NAK from me but I feel
uncomfortable how hard you are pushing this.

I expect this is my last word on this.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-13  9:45               ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-03-13  9:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

On Sat 11-03-17 10:46:58, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > So, we have means to debug these issues. Some of them are rather coarse
> > and your watchdog can collect much more and maybe give us a clue much
> > quicker but we still have to judge whether all this is really needed
> > because it doesn't come for free. Have you considered this aspect?
> 
> Sigh... You are ultimately ignoring the reality. Educating everybody to master
> debugging tools does not come for free. If I liken your argumentation to
> security modules, it looks like the following.
> 
>   "There is already SELinux. SELinux can do everything. Thus, AppArmor is not needed.
>    I don't care about users/customers who cannot administrate SELinux."
> 
> The reality is different. We need tools which users/customers can afford using.
> You had better getting away from existing debug tools which kernel developers
> are using.
> 
> First of all, SysRq is an emergency tool and therefore it requires administrator's
> intervention. Your argumentation sounds to me that "Give up debugging unless you
> can sit on in front of console of Linux systems 24-7" which is already impossible.

My experience also tells me that different soft/hard lockups tend to
generate quite non-trivial number of false positives and those are
reported as bugs. We simply tend to underestimate how easy it is to trigger
paths without scheduling or how easy it is to trigger hardlockups on
large machines just due to lock bouncing etc...

> SysRq-t cannot print seq= and delay= fields because information of in-flight allocation
> request is not accessible from "struct task_struct", making extremely difficult to
> judge whether progress is made when several SysRq-t snapshots are taken.
> 
> Also, year by year it is getting difficult to use vmcore for analysis because vmcore
> might include sensitive data (even after filtering out user pages). I saw cases where
> vmcore cannot be sent to support centers due to e.g. organization's information
> control rules. Sometimes we have to analyze from only kernel messages. Some pieces of
> information extracted by running scripts against /usr/bin/crash on cutomer's side
> might be available, but in general we can't assume that the whole memory image which
> includes whatever information is available.
> 
> In most cases, administrators can't capture even SysRq-t; let alone vmcore.
> Therefore, automatic watchdog is highly appreciated. Have you considered this aspect?

yes I have. I tend to work with our SUSE L3 and enterprise customer a
lot last 10 years. And what I claim is that adding more watchdog doesn't
necessarily mean we will get better bug reports. I do not have any exact
statistics but my perception is that allocation lockups tends to be less
than 1% of reported bugs. You seem to make a huge issue from this
particular class of issues basing your argumentation on "unknown
issues which might have been allocation lockups etc." I am not feeling
comfortable with this kind of arguing and making any decision on them.

So let me repeat (for the last time). I find your watchdog interesting
for stress testing but I am not convinced this is generally useful for
real workloads and the maintenance burden is worth it. I _might_ be
wrong here and that is why this is _no_ a NAK from me but I feel
uncomfortable how hard you are pushing this.

I expect this is my last word on this.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
  2017-03-13  9:45               ` Michal Hocko
@ 2017-03-13 13:45                 ` Tetsuo Handa
  -1 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-13 13:45 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

Michal Hocko wrote:
> On Sat 11-03-17 10:46:58, Tetsuo Handa wrote:
> > In most cases, administrators can't capture even SysRq-t; let alone vmcore.
> > Therefore, automatic watchdog is highly appreciated. Have you considered this aspect?
> 
> yes I have. I tend to work with our SUSE L3 and enterprise customer a
> lot last 10 years. And what I claim is that adding more watchdog doesn't
> necessarily mean we will get better bug reports. I do not have any exact
> statistics but my perception is that allocation lockups tends to be less
> than 1% of reported bugs. You seem to make a huge issue from this
> particular class of issues basing your argumentation on "unknown
> issues which might have been allocation lockups etc." I am not feeling
> comfortable with this kind of arguing and making any decision on them.

Allocation lockups might be less than 1% of _reported_ bugs.
What I'm talking about is that there will be _unreported_ (and therefore
unrecognized/unsolved) bugs caused by memory allocation behavior.
You are refusing to make an attempt to prove/verify/handle it.

> 
> So let me repeat (for the last time). I find your watchdog interesting
> for stress testing but I am not convinced this is generally useful for
> real workloads and the maintenance burden is worth it. I _might_ be
> wrong here and that is why this is _no_ a NAK from me but I feel
> uncomfortable how hard you are pushing this.

If you worry about false positives and/or side effects of watchdog, you can
disable it in your distribution (i.e. SUSE). There are developers/users/customers
who will be helped by it.

> 
> I expect this is my last word on this.

After all, there is no real objection. Andrew, what do you think?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] mm: Add memory allocation watchdog kernel thread.
@ 2017-03-13 13:45                 ` Tetsuo Handa
  0 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2017-03-13 13:45 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, linux-kernel, hannes, mgorman, david, apolyakov

Michal Hocko wrote:
> On Sat 11-03-17 10:46:58, Tetsuo Handa wrote:
> > In most cases, administrators can't capture even SysRq-t; let alone vmcore.
> > Therefore, automatic watchdog is highly appreciated. Have you considered this aspect?
> 
> yes I have. I tend to work with our SUSE L3 and enterprise customer a
> lot last 10 years. And what I claim is that adding more watchdog doesn't
> necessarily mean we will get better bug reports. I do not have any exact
> statistics but my perception is that allocation lockups tends to be less
> than 1% of reported bugs. You seem to make a huge issue from this
> particular class of issues basing your argumentation on "unknown
> issues which might have been allocation lockups etc." I am not feeling
> comfortable with this kind of arguing and making any decision on them.

Allocation lockups might be less than 1% of _reported_ bugs.
What I'm talking about is that there will be _unreported_ (and therefore
unrecognized/unsolved) bugs caused by memory allocation behavior.
You are refusing to make an attempt to prove/verify/handle it.

> 
> So let me repeat (for the last time). I find your watchdog interesting
> for stress testing but I am not convinced this is generally useful for
> real workloads and the maintenance burden is worth it. I _might_ be
> wrong here and that is why this is _no_ a NAK from me but I feel
> uncomfortable how hard you are pushing this.

If you worry about false positives and/or side effects of watchdog, you can
disable it in your distribution (i.e. SUSE). There are developers/users/customers
who will be helped by it.

> 
> I expect this is my last word on this.

After all, there is no real objection. Andrew, what do you think?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-03-13 13:45 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28  1:21 [PATCH v7] mm: Add memory allocation watchdog kernel thread Tetsuo Handa
2017-03-09 10:46 ` Tetsuo Handa
2017-03-09 22:37   ` Andrew Morton
2017-03-09 22:37     ` Andrew Morton
2017-03-10 10:40     ` Michal Hocko
2017-03-10 10:40       ` Michal Hocko
2017-03-10 11:19       ` Tetsuo Handa
2017-03-10 11:19         ` Tetsuo Handa
2017-03-10 13:49         ` Brian Foster
2017-03-10 13:49           ` Brian Foster
2017-03-10 15:26         ` Michal Hocko
2017-03-10 15:26           ` Michal Hocko
2017-03-11  1:46           ` Tetsuo Handa
2017-03-11  1:46             ` Tetsuo Handa
2017-03-13  9:45             ` Michal Hocko
2017-03-13  9:45               ` Michal Hocko
2017-03-13 13:45               ` Tetsuo Handa
2017-03-13 13:45                 ` Tetsuo Handa
2017-03-09 22:35 ` Andrew Morton
2017-03-09 22:35   ` Andrew Morton
2017-03-10 11:12   ` Tetsuo Handa
2017-03-10 11:12     ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.