linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6] mm: Add memory allocation watchdog kernel thread.
@ 2016-11-06  7:15 Tetsuo Handa
  2016-12-15 10:24 ` Tetsuo Handa
  2017-01-25 18:11 ` Johannes Weiner
  0 siblings, 2 replies; 12+ messages in thread
From: Tetsuo Handa @ 2016-11-06  7:15 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, linux-kernel, Tetsuo Handa

Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
too long") was a great step for reducing possibility of silent hang up
problem caused by memory allocation stalls [1]. But it became clear
that Michal Hocko is not going to make warn_alloc() bullet proof [2].
Therefore, I again propose this patch in order to allow a bullet proof
reporting.

Michal Hocko agrees that reporting is important part of debugging of
problems but is thinking that state tracking is in general too complex
for something that doesn't happen in most properly configured systems.
But I assert that there _are_ systems which are bothered by low memory
situations. It is pointless to refer to "properly configured systems"
as a reason not to introduce a state tracking.

This patch adds a watchdog which periodically reports number of memory
allocating tasks, dying tasks and OOM victim tasks when some task is
spending too long time inside __alloc_pages_slowpath(). This patch also
serves as a hook for obtaining additional information using SystemTap
(e.g. examine other variables using printk(), capture a crash dump by
calling panic()) by triggering a callback only when an stall is detected.
Ability to take administrator-controlled actions based on some threshold
is a big advantage gained by introducing a state tracking.

It is administrators who decide whether to utilize debugging capability
with state tracking. Let's give administrators a choice and a chance.

Changes from v1: ( http://lkml.kernel.org/r/201510182105.AGA00839.FHVFFStLQOMOOJ@I-love.SAKURA.ne.jp )

  (1) Use per a "struct task_struct" variables. This allows vmcore to
      remember information about last memory allocation request, which
      is useful for understanding last-minute behavior of the kernel.

  (2) Report using accurate timeout. This increases possibility of
      successfully reporting before watchdog timers reset the machine.

  (3) Show memory information (SysRq-m). This makes it easier to know
      the reason of stalling.

  (4) Show both $state_of_allocation and $state_of_task in the same
      line. This makes it easier to grep the output.

  (5) Minimize duration of spinlock held by the kernel thread.

Changes from v2: ( http://lkml.kernel.org/r/201511222346.JBH48464.VFFtOLOOQJMFHS@I-love.SAKURA.ne.jp )

  (1) Print sequence number. This makes it easier to know whether
      memory allocation is succeeding (looks like a livelock but making
      forward progress) or not.

  (2) Replace spinlock with cheaper seqlock_t like sequence number based
      method. The caller no longer contend on lock, and major overhead
      for caller side will be two smp_wmb() instead for
      read_lock()/read_unlock().

  (3) Print "exiting" instead for "dying" if an OOM victim is stalling
      at do_exit(), for SIGKILL is removed before arriving at do_exit().

  (4) Moved explanation to Documentation/malloc-watchdog.txt .

Changes from v3: ( http://lkml.kernel.org/r/201511250024.AAE78692.QVOtFFOSFOMLJH@I-love.SAKURA.ne.jp )

  (1) Avoid stalls even if there are so many tasks to report.

Changes from v4: ( http://lkml.kernel.org/r/201512130033.ABH90650.FtFOMOFLVOJHQS@I-love.SAKURA.ne.jp )

  (1) Use per CPU in-flight counter by reverting "Report using accurate
      timeout." in v2, in order to avoid walking the process list which
      is costly when there are extremely so many tasks in the system.

  (2) Updated Documentation/malloc-watchdog.txt to add explanation for
      serving as a hook for dynamic probes.

Changes from v5: ( http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp )

  (1) Disable commit 63f53dea0c9866e9 ("mm: warn about allocations which
      stall for too long") when CONFIG_DETECT_MEMALLOC_STALL_TASK is
      enabled.

  (2) Updated Documentation/malloc-watchdog.txt to reflect OOM related
      improvements up to Linux 4.9.

[1] http://lkml.kernel.org/r/201610182004.AEF87559.FOOHVLJOQFFtSM@I-love.SAKURA.ne.jp
[2] http://lkml.kernel.org/r/20161019115525.GH7517@dhcp22.suse.cz

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 Documentation/malloc-watchdog.txt | 155 ++++++++++++++++++++++++++
 include/linux/oom.h               |   4 +
 include/linux/sched.h             |  29 +++++
 kernel/fork.c                     |   4 +
 kernel/hung_task.c                | 221 +++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c                   |  10 ++
 lib/Kconfig.debug                 |  24 +++++
 mm/oom_kill.c                     |   3 +
 mm/page_alloc.c                   |  57 ++++++++++
 9 files changed, 505 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/malloc-watchdog.txt

diff --git a/Documentation/malloc-watchdog.txt b/Documentation/malloc-watchdog.txt
new file mode 100644
index 0000000..e4b24d77
--- /dev/null
+++ b/Documentation/malloc-watchdog.txt
@@ -0,0 +1,155 @@
+=================================
+Memory allocation stall watchdog.
+=================================
+
+
+- What is it?
+
+This is an extension to khungtaskd kernel thread, which is for warning
+that memory allocation requests are stalling, in order to catch unexplained
+hangups/reboots caused by memory allocation stalls.
+
+
+- Why need to use it?
+
+Currently, when something went wrong inside memory allocation request,
+the system might stall without any kernel messages.
+
+Although there is khungtaskd kernel thread as an asynchronous monitoring
+approach, khungtaskd kernel thread is not always helpful because memory
+allocating tasks unlikely sleep in uninterruptible state for
+/proc/sys/kernel/hung_task_timeout_secs seconds.
+
+Although there is warn_alloc() as a synchronous monitoring approach
+which emits
+
+  "%s: page allocation stalls for %ums, order:%u, mode:%#x(%pGg)\n"
+
+line, warn_alloc() is not bullet proof because allocating tasks can get
+stuck before calling warn_alloc() and/or allocating tasks are using
+__GFP_NOWARN flag and/or such lines are suppressed by ratelimiting and/or
+such lines are corrupted due to collisions.
+
+Unless we use asynchronous monitoring approach, we can fail to figure out
+that something went wrong inside memory allocation requests.
+
+People are reporting hang up problems and/or slowdown problem inside memory
+allocation request. But we are forcing people to use kernels without means
+to find out what was happening. The means are expected to work without
+knowledge to use trace points functionality, are expected to run without
+memory allocation, are expected to dump output without administrator's
+operation, are expected to work before watchdog timers reset the machine.
+
+This extension adds a state tracking mechanism for memory allocation requests
+to khungtaskd kernel thread, allowing administrators to figure out that the
+system hung up due to memory allocation stalls and/or to take administrator-
+controlled actions when memory allocation requests are stalling.
+
+
+- How to configure it?
+
+Build kernels with CONFIG_DETECT_HUNG_TASK=y and
+CONFIG_DETECT_MEMALLOC_STALL_TASK=y.
+
+Default scan interval is configured by CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT.
+Scan interval can be changed at run time by writing timeout in seconds to
+/proc/sys/kernel/memalloc_task_timeout_secs. Writing 0 disables this scan.
+
+Even if you disable this scan, information about last memory allocation
+request is kept. That is, you will get some hint for understanding
+last-minute behavior of the kernel when you analyze vmcore (or memory
+snapshot of a virtualized machine).
+
+
+- How memory allocation stalls are reported?
+
+This extension will report allocation stalls by printing
+
+  MemAlloc-Info: stalling=$X dying=$Y1 exiting=$Y2 victim=$Z oom_count=$O
+
+line where $X > 0, followed by
+
+  MemAlloc: $name($pid) flags=$flags switches=$switches $state_of_allocation $state_of_task
+
+lines and corresponding stack traces.
+
+$O is number of times the OOM killer is invoked. If $O does not increase
+over time, allocation requests got stuck before calling the OOM killer.
+
+$name is that task's comm name string ("struct task_struct"->comm).
+
+$pid is that task's pid value ("struct task_struct"->pid).
+
+$flags is that task's flags value ("struct task_struct"->flags).
+
+$switches is that task's context switch counter ("struct task_struct"->nvcsw +
+"struct task_struct"->nivcsw) which is also checked by
+/proc/sys/kernel/hung_task_warnings for finding hung tasks.
+
+$state_of_allocation is reported only when that task is stalling inside
+__alloc_pages_slowpath(), in seq=$seq gfp=$gfp order=$order delay=$delay
+format where $seq is the sequence number for allocation request, $gfp is
+the gfp flags used for that allocation request, $order is the order,
+delay is jiffies elapsed since entering into __alloc_pages_slowpath().
+
+You can check for seq=$seq field for each reported process. If $seq is
+increasing over time, it will be simply overloaded (not a livelock but
+progress is too slow to wait) unless the caller is doing open-coded
+__GFP_NOFAIL allocation requests (effectively a livelock).
+
+$state_of_task is reported only when that task is dying, in combination
+of "uninterruptible" (where that task is in uninterruptible sleep,
+likely due to uninterruptible lock), "exiting" (where that task arrived
+at do_exit() function), "dying" (where that task has pending SIGKILL)
+and "victim" (where that task received TIF_MEMDIE, likely be only 1 task).
+
+
+- How the messages look like?
+
+An example of MemAlloc lines (grep of dmesg output) is shown below.
+You can use serial console and/or netconsole to save these messages
+when the system is stalling.
+
+  [  100.503284] MemAlloc-Info: stalling=8 dying=1 exiting=0 victim=1 oom_count=101421
+  [  100.505674] MemAlloc: kswapd0(54) flags=0xa40840 switches=84685
+  [  100.546645] MemAlloc: kworker/3:1(70) flags=0x4208060 switches=9462 seq=5 gfp=0x2400000(GFP_NOIO) order=0 delay=8207 uninterruptible
+  [  100.606034] MemAlloc: systemd-journal(469) flags=0x400100 switches=8380 seq=212 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10620 uninterruptible
+  [  100.651766] MemAlloc: irqbalance(998) flags=0x400100 switches=4366 seq=5 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=5819
+  [  100.697590] MemAlloc: vmtoolsd(1928) flags=0x400100 switches=8542 seq=82 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10620 uninterruptible
+  [  100.743312] MemAlloc: tuned(3737) flags=0x400040 switches=8220 seq=44 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10620 uninterruptible
+  [  100.792038] MemAlloc: nmbd(3759) flags=0x400140 switches=8079 seq=198 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10620 uninterruptible
+  [  100.839428] MemAlloc: oom-write(3814) flags=0x400000 switches=8126 seq=223446 gfp=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) order=0 delay=10620 uninterruptible
+  [  100.878846] MemAlloc: write(3816) flags=0x400000 switches=7440 uninterruptible dying victim
+  [  100.917971] MemAlloc: write(3820) flags=0x400000 switches=16130 seq=8714 gfp=0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) order=0 delay=10620 uninterruptible
+  [  101.190979] MemAlloc-Info: stalling=8 dying=1 exiting=0 victim=1 oom_count=107514
+  [  111.194055] MemAlloc-Info: stalling=9 dying=1 exiting=0 victim=1 oom_count=199825
+  [  111.196624] MemAlloc: kswapd0(54) flags=0xa40840 switches=168410
+  [  111.238096] MemAlloc: kworker/3:1(70) flags=0x4208060 switches=18592 seq=5 gfp=0x2400000(GFP_NOIO) order=0 delay=18898
+  [  111.296920] MemAlloc: systemd-journal(469) flags=0x400100 switches=15918 seq=212 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=21311
+  [  111.343129] MemAlloc: systemd-logind(973) flags=0x400100 switches=7786 seq=3 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=10476
+  [  111.390142] MemAlloc: irqbalance(998) flags=0x400100 switches=11965 seq=5 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=16510
+  [  111.435170] MemAlloc: vmtoolsd(1928) flags=0x400100 switches=16230 seq=82 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=21311 uninterruptible
+  [  111.479089] MemAlloc: tuned(3737) flags=0x400040 switches=15850 seq=44 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=21311 uninterruptible
+  [  111.528294] MemAlloc: nmbd(3759) flags=0x400140 switches=15682 seq=198 gfp=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=21311
+  [  111.576371] MemAlloc: oom-write(3814) flags=0x400000 switches=15378 seq=223446 gfp=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO) order=0 delay=21311 uninterruptible
+  [  111.617562] MemAlloc: write(3816) flags=0x400000 switches=7440 uninterruptible dying victim
+  [  111.661662] MemAlloc: write(3820) flags=0x400000 switches=24334 seq=8714 gfp=0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) order=0 delay=21311 uninterruptible
+  [  111.956964] MemAlloc-Info: stalling=9 dying=1 exiting=0 victim=1 oom_count=206663
+
+You can check whether memory allocations are making forward progress.
+You can check where memory allocations are stalling using stack trace
+of reported task which follows each MemAlloc: line. You can check memory
+information (SysRq-m) and stuck workqueues information which follow the
+end of MemAlloc: lines. You can also check locks held (SysRq-d) if built
+with CONFIG_PROVE_LOCKING=y and lockdep is still active.
+
+This extension also serves as a hook for triggering actions when timeout
+expired. If you want to obtain more information, you can utilize dynamic
+probes using e.g. SystemTap. For example,
+
+  # stap -F -g -e 'probe kernel.function("check_memalloc_stalling_tasks").return { if ($return > 0) panic("MemAlloc stall detected."); }'
+
+will allow you to obtain vmcore by triggering the kernel panic. Since
+variables used by this extension is associated with "struct task_struct",
+you can obtain accurate snapshot using "foreach task" command from crash
+utility.
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b4e36e9..69556f3 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -79,8 +79,12 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern unsigned int out_of_memory_count;
+extern bool memalloc_maybe_stalling(void);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern unsigned long sysctl_memalloc_task_timeout_secs;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6aa5d8c..6e56317 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1496,6 +1496,32 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+struct memalloc_info {
+	/*
+	 * 0: not doing __GFP_RECLAIM allocation.
+	 * 1: doing non-recursive __GFP_RECLAIM allocation.
+	 * 2: doing recursive __GFP_RECLAIM allocation.
+	 */
+	u8 valid;
+	/*
+	 * bit 0: Will be reported as OOM victim.
+	 * bit 1: Will be reported as dying task.
+	 * bit 2: Will be reported as stalling task.
+	 * bit 3: Will be reported as exiting task.
+	 * bit 7: Will be reported unconditionally.
+	 */
+	u8 type;
+	/* Index used for memalloc_in_flight[] counter. */
+	u8 idx;
+	/* For progress monitoring. */
+	unsigned int sequence;
+	/* Started time in jiffies as of valid == 1. */
+	unsigned long start;
+	/* Requested order and gfp flags as of valid == 1. */
+	unsigned int order;
+	gfp_t gfp;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1979,6 +2005,9 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	struct memalloc_info memalloc;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index fd85c68..9d5ebd1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1621,6 +1621,10 @@ static __latent_entropy struct task_struct *copy_process(
 	p->sequential_io_avg	= 0;
 #endif
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	p->memalloc.sequence = 0;
+#endif
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 40c07e4..df95a20 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -16,6 +16,7 @@
 #include <linux/export.h>
 #include <linux/sysctl.h>
 #include <linux/utsname.h>
+#include <linux/oom.h>
 #include <trace/events/sched.h>
 
 /*
@@ -72,6 +73,206 @@ static int __init hung_task_panic_setup(char *str)
 	.notifier_call = hung_task_panic,
 };
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+/*
+ * Zero means infinite timeout - no checking done:
+ */
+unsigned long __read_mostly sysctl_memalloc_task_timeout_secs =
+	CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
+static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */
+
+static long memalloc_timeout_jiffies(unsigned long last_checked, long timeout)
+{
+	/* timeout of 0 will disable the watchdog */
+	return timeout ? last_checked - jiffies + timeout * HZ :
+		MAX_SCHEDULE_TIMEOUT;
+}
+
+/**
+ * is_stalling_task - Check and copy a task's memalloc variable.
+ *
+ * @task:   A task to check.
+ * @expire: Timeout in jiffies.
+ *
+ * Returns true if a task is stalling, false otherwise.
+ */
+static bool is_stalling_task(const struct task_struct *task,
+			     const unsigned long expire)
+{
+	const struct memalloc_info *m = &task->memalloc;
+
+	if (!m->valid)
+		return false;
+	memalloc.sequence = m->sequence;
+	memalloc.start = m->start;
+	memalloc.order = m->order;
+	memalloc.gfp = m->gfp;
+	return time_after_eq(expire, memalloc.start);
+}
+
+/*
+ * check_memalloc_stalling_tasks - Check for memory allocation stalls.
+ *
+ * @timeout: Timeout in jiffies.
+ *
+ * Returns number of stalling tasks.
+ *
+ * This function is marked as "noinline" in order to allow inserting dynamic
+ * probes (e.g. printing more information as needed using SystemTap, calling
+ * panic() if this function returned non 0 value).
+ */
+static noinline int check_memalloc_stalling_tasks(unsigned long timeout)
+{
+	char buf[256];
+	struct task_struct *g, *p;
+	unsigned long now;
+	unsigned long expire;
+	unsigned int sigkill_pending = 0;
+	unsigned int exiting_tasks = 0;
+	unsigned int memdie_pending = 0;
+	unsigned int stalling_tasks = 0;
+
+	cond_resched();
+	now = jiffies;
+	/*
+	 * Report tasks that stalled for more than half of timeout duration
+	 * because such tasks might be correlated with tasks that already
+	 * stalled for full timeout duration.
+	 */
+	expire = now - timeout * (HZ / 2);
+	/* Count stalling tasks, dying and victim tasks. */
+	preempt_disable();
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		u8 type = 0;
+
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= 1;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= 2;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= 8;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= 4;
+			stalling_tasks++;
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= 128;
+		p->memalloc.type = type;
+	}
+	rcu_read_unlock();
+	preempt_enable();
+	if (!stalling_tasks)
+		return 0;
+	cond_resched();
+	/* Report stalling tasks, dying and victim tasks. */
+	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
+		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
+		out_of_memory_count);
+	cond_resched();
+	sigkill_pending = 0;
+	exiting_tasks = 0;
+	memdie_pending = 0;
+	stalling_tasks = 0;
+	preempt_disable();
+	rcu_read_lock();
+ restart_report:
+	for_each_process_thread(g, p) {
+		bool can_cont;
+		u8 type;
+
+		if (likely(!p->memalloc.type))
+			continue;
+		p->memalloc.type = 0;
+		/* Recheck in case state changed meanwhile. */
+		type = 0;
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			type |= 1;
+			memdie_pending++;
+		}
+		if (fatal_signal_pending(p)) {
+			type |= 2;
+			sigkill_pending++;
+		}
+		if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) {
+			type |= 8;
+			exiting_tasks++;
+		}
+		if (is_stalling_task(p, expire)) {
+			type |= 4;
+			stalling_tasks++;
+			snprintf(buf, sizeof(buf),
+				 " seq=%u gfp=0x%x(%pGg) order=%u delay=%lu",
+				 memalloc.sequence, memalloc.gfp,
+				 &memalloc.gfp,
+				 memalloc.order, now - memalloc.start);
+		} else {
+			buf[0] = '\0';
+		}
+		if (p->flags & PF_KSWAPD)
+			type |= 128;
+		if (unlikely(!type))
+			continue;
+		/*
+		 * Victim tasks get pending SIGKILL removed before arriving at
+		 * do_exit(). Therefore, print " exiting" instead for " dying".
+		 */
+		pr_warn("MemAlloc: %s(%u) flags=0x%x switches=%lu%s%s%s%s%s\n",
+			p->comm, p->pid, p->flags, p->nvcsw + p->nivcsw, buf,
+			(p->state & TASK_UNINTERRUPTIBLE) ?
+			" uninterruptible" : "",
+			(type & 8) ? " exiting" : "",
+			(type & 2) ? " dying" : "",
+			(type & 1) ? " victim" : "");
+		sched_show_task(p);
+		/*
+		 * Since there could be thousands of tasks to report, we always
+		 * call cond_resched() after each report, in order to avoid RCU
+		 * stalls.
+		 *
+		 * Since not yet reported tasks have p->memalloc.type > 0, we
+		 * can simply restart this loop in case "g" or "p" went away.
+		 */
+		get_task_struct(g);
+		get_task_struct(p);
+		rcu_read_unlock();
+		preempt_enable_no_resched();
+		/*
+		 * TODO: Try to wait for a while (e.g. sleep until usage of
+		 * printk() buffer becomes less than 75%) in order to avoid
+		 * dropping messages.
+		 */
+		cond_resched();
+		preempt_disable();
+		rcu_read_lock();
+		can_cont = pid_alive(g) && pid_alive(p);
+		put_task_struct(p);
+		put_task_struct(g);
+		if (!can_cont)
+			goto restart_report;
+	}
+	rcu_read_unlock();
+	preempt_enable_no_resched();
+	cond_resched();
+	/* Show memory information. (SysRq-m) */
+	show_mem(0);
+	/* Show workqueue state. */
+	show_workqueue_state();
+	/* Show lock information. (SysRq-d) */
+	debug_show_all_locks();
+	pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n",
+		stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending,
+		out_of_memory_count);
+	return stalling_tasks;
+}
+#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
+
 static void check_hung_task(struct task_struct *t, unsigned long timeout)
 {
 	unsigned long switch_count = t->nvcsw + t->nivcsw;
@@ -228,20 +429,36 @@ void reset_hung_task_detector(void)
 static int watchdog(void *dummy)
 {
 	unsigned long hung_last_checked = jiffies;
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	unsigned long stall_last_checked = hung_last_checked;
+#endif
 
 	set_user_nice(current, 0);
 
 	for ( ; ; ) {
 		unsigned long timeout = sysctl_hung_task_timeout_secs;
 		long t = hung_timeout_jiffies(hung_last_checked, timeout);
-
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+		unsigned long timeout2 = sysctl_memalloc_task_timeout_secs;
+		long t2 = memalloc_timeout_jiffies(stall_last_checked,
+						   timeout2);
+
+		if (t2 <= 0) {
+			if (memalloc_maybe_stalling())
+				check_memalloc_stalling_tasks(timeout2);
+			stall_last_checked = jiffies;
+			continue;
+		}
+#else
+		long t2 = t;
+#endif
 		if (t <= 0) {
 			if (!atomic_xchg(&reset_hung_task, 0))
 				check_hung_uninterruptible_tasks(timeout);
 			hung_last_checked = jiffies;
 			continue;
 		}
-		schedule_timeout_interruptible(t);
+		schedule_timeout_interruptible(min(t, t2));
 	}
 
 	return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 39b3368..d96952d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1069,6 +1069,16 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &neg_one,
 	},
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+	{
+		.procname	= "memalloc_task_timeout_secs",
+		.data		= &sysctl_memalloc_task_timeout_secs,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= proc_dohung_task_timeout_secs,
+		.extra2		= &hung_task_timeout_max,
+	},
+#endif
 #endif
 #ifdef CONFIG_RT_MUTEXES
 	{
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index fd60ace..bcd2494 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -880,6 +880,30 @@ config WQ_WATCHDOG
 	  state.  This can be configured through kernel parameter
 	  "workqueue.watchdog_thresh" and its sysfs counterpart.
 
+config DETECT_MEMALLOC_STALL_TASK
+	bool "Detect tasks stalling inside memory allocator"
+	default n
+	depends on DETECT_HUNG_TASK
+	help
+	  This option emits warning messages and traces when memory
+	  allocation requests are stalling, in order to catch unexplained
+	  hangups/reboots caused by memory allocation stalls.
+
+config DEFAULT_MEMALLOC_TASK_TIMEOUT
+	int "Default timeout for stalling task detection (in seconds)"
+	depends on DETECT_MEMALLOC_STALL_TASK
+	default 60
+	help
+	  This option controls the default timeout (in seconds) used
+	  to determine when a task has become non-responsive and should
+	  be considered stalling inside memory allocator.
+
+	  It can be adjusted at runtime via the kernel.memalloc_task_timeout_secs
+	  sysctl or by writing a value to
+	  /proc/sys/kernel/memalloc_task_timeout_secs.
+
+	  A timeout of 0 disables the check. The default is 60 seconds.
+
 endmenu # "Debug lockups and hangs"
 
 config PANIC_ON_OOPS
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d..ab46d06 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
 
+unsigned int out_of_memory_count;
+
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
@@ -986,6 +988,7 @@ bool out_of_memory(struct oom_control *oc)
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
+	out_of_memory_count++;
 	if (oom_killer_disabled)
 		return false;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 806ada3..ae76920 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3548,8 +3548,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
+#ifndef CONFIG_DETECT_MEMALLOC_STALL_TASK
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
+#endif
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3705,6 +3707,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
+#ifndef CONFIG_DETECT_MEMALLOC_STALL_TASK
 	/* Make sure we know about allocations which stall for too long */
 	if (time_after(jiffies, alloc_start + stall_timeout)) {
 		warn_alloc(gfp_mask,
@@ -3712,6 +3715,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 			jiffies_to_msecs(jiffies-alloc_start), order);
 		stall_timeout += 10 * HZ;
 	}
+#endif
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
@@ -3747,6 +3751,57 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return page;
 }
 
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+
+static DEFINE_PER_CPU_ALIGNED(int, memalloc_in_flight[2]);
+static u8 memalloc_active_index; /* Either 0 or 1. */
+
+/* Called periodically with sysctl_memalloc_task_timeout_secs interval. */
+bool memalloc_maybe_stalling(void)
+{
+	int cpu;
+	int sum = 0;
+	const u8 idx = memalloc_active_index ^ 1;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(memalloc_in_flight[idx], cpu);
+	if (sum)
+		return true;
+	memalloc_active_index ^= 1;
+	return false;
+}
+
+static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
+{
+	struct memalloc_info *m = &current->memalloc;
+
+	/* We don't check for stalls for !__GFP_RECLAIM allocations. */
+	if (!(gfp_mask & __GFP_RECLAIM))
+		return;
+	/* We don't check for stalls for nested __GFP_RECLAIM allocations */
+	if (!m->valid) {
+		m->sequence++;
+		m->start = jiffies;
+		m->order = order;
+		m->gfp = gfp_mask;
+		m->idx = memalloc_active_index;
+		this_cpu_inc(memalloc_in_flight[m->idx]);
+	}
+	m->valid++;
+}
+
+static void stop_memalloc_timer(const gfp_t gfp_mask)
+{
+	struct memalloc_info *m = &current->memalloc;
+
+	if ((gfp_mask & __GFP_RECLAIM) && !--m->valid)
+		this_cpu_dec(memalloc_in_flight[m->idx]);
+}
+#else
+#define start_memalloc_timer(gfp_mask, order) do { } while (0)
+#define stop_memalloc_timer(gfp_mask) do { } while (0)
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3828,7 +3883,9 @@ struct page *
 	 */
 	if (cpusets_enabled())
 		ac.nodemask = nodemask;
+	start_memalloc_timer(alloc_mask, order);
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+	stop_memalloc_timer(alloc_mask);
 
 no_zone:
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2016-11-06  7:15 [PATCH v6] mm: Add memory allocation watchdog kernel thread Tetsuo Handa
@ 2016-12-15 10:24 ` Tetsuo Handa
  2016-12-28 11:42   ` Tetsuo Handa
  2017-01-25 18:11 ` Johannes Weiner
  1 sibling, 1 reply; 12+ messages in thread
From: Tetsuo Handa @ 2016-12-15 10:24 UTC (permalink / raw)
  To: mhocko
  Cc: pmladek, sergey.senozhatsky.work, vegard.nossum, linux-mm, linux-kernel

Tetsuo Handa wrote:
> This patch adds a watchdog which periodically reports number of memory
> allocating tasks, dying tasks and OOM victim tasks when some task is
> spending too long time inside __alloc_pages_slowpath(). This patch also
> serves as a hook for obtaining additional information using SystemTap
> (e.g. examine other variables using printk(), capture a crash dump by
> calling panic()) by triggering a callback only when an stall is detected.
> Ability to take administrator-controlled actions based on some threshold
> is a big advantage gained by introducing a state tracking.

Resuming this thread as an answer to Michal's question at [1]
in order to focus on watchdog discussion.

http://I-love.SAKURA.ne.jp/tmp/serial-20161215-4.txt.xz (second run with
one-liner stall report per each second at [2]) showed that warn_alloc()
not reporting __GFP_NOWARN allocation stalls misses a chance to tell the
administrator that the system got stuck inside page allocator.

--------------------
[  177.666356] Out of memory: Kill process 5337 (a.out) score 999 or sacrifice child
[  177.670005] Killed process 5337 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  189.898315] a.out(5337): page allocation stalls for 10139ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  199.759537] a.out(5337): page allocation stalls for 20001ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  209.970568] a.out(5337): page allocation stalls for 30211ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  219.763326] a.out(5337): page allocation stalls for 40005ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  229.759599] a.out(5337): page allocation stalls for 50001ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
(...snipped...)
[  336.436299] Out of memory: Kill process 5491 (a.out) score 999 or sacrifice child
[  336.439684] Killed process 5491 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  336.447065] oom_reaper: reaped process 5491 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  339.759355] a.out(5337): page allocation stalls for 160001ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  349.760561] a.out(5337): page allocation stalls for 170002ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  359.760531] a.out(5337): page allocation stalls for 180002ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
(...snipped...)
[  789.759370] a.out(5337): page allocation stalls for 610001ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  799.760557] a.out(5337): page allocation stalls for 620002ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  800.879431] sysrq: SysRq : Show State
[  800.880939]   task                        PC stack   pid father
[  800.883078] systemd         D11320     1      0 0x00000000
[  800.885028]  ffff88006d58d840 ffff88006d58ac40 ffff88006baf8040 ffff880074450040
[  800.887757]  ffff8800757da218 ffffc90000017590 ffffffff817524d8 ffffffff813a55b6
[  800.890420]  0000000000000000 ffff8800757da218 0000000000000296 ffff880074450040
[  800.893148] Call Trace:
[  800.894114]  [<ffffffff817524d8>] ? __schedule+0x2f8/0xbc0
[  800.896169]  [<ffffffff813a55b6>] ? debug_object_activate+0x166/0x210
[  800.898420]  [<ffffffff81752ddd>] schedule+0x3d/0x90
[  800.900215]  [<ffffffff8175806a>] schedule_timeout+0x22a/0x540
[  800.902346]  [<ffffffff81127910>] ? lock_timer_base+0xa0/0xa0
[  800.904492]  [<ffffffff811332ac>] ? ktime_get+0xac/0x140
[  800.906453]  [<ffffffff81752176>] io_schedule_timeout+0xa6/0x110
[  800.908587]  [<ffffffff81209c76>] congestion_wait+0x86/0x260
[  800.910691]  [<ffffffff810eee70>] ? prepare_to_wait_event+0xf0/0xf0
[  800.912948]  [<ffffffff811fad79>] shrink_inactive_list+0x639/0x680
[  800.915200]  [<ffffffff811fb656>] shrink_node_memcg+0x526/0x7d0
[  800.917689]  [<ffffffff811fb9e1>] shrink_node+0xe1/0x310
[  800.919774]  [<ffffffff811fbf4d>] do_try_to_free_pages+0xed/0x380
[  800.922350]  [<ffffffff811fc311>] try_to_free_pages+0x131/0x3f0
[  800.924567]  [<ffffffff81281af8>] __alloc_pages_slowpath+0x3ac/0x9db
[  800.926962]  [<ffffffff811e9966>] __alloc_pages_nodemask+0x456/0x4e0
[  800.929316]  [<ffffffff81247507>] alloc_pages_current+0x97/0x1b0
[  800.931586]  [<ffffffff811dc485>] ? find_get_entry+0x5/0x300
[  800.933721]  [<ffffffff811dbc2d>] __page_cache_alloc+0x15d/0x1a0
[  800.936030]  [<ffffffff811dddbc>] ? pagecache_get_page+0x2c/0x2b0
[  800.938314]  [<ffffffff811e06ce>] filemap_fault+0x48e/0x6d0
[  800.940394]  [<ffffffff811e0579>] ? filemap_fault+0x339/0x6d0
[  800.942717]  [<ffffffffa0272781>] xfs_filemap_fault+0x71/0x1e0 [xfs]
[  800.945106]  [<ffffffff811dcfd3>] ? filemap_map_pages+0x2d3/0x5d0
[  800.947649]  [<ffffffff8121a700>] __do_fault+0x80/0x130
[  800.949708]  [<ffffffff8121ec9b>] do_fault+0x4cb/0x6d0
[  800.951729]  [<ffffffff81220e4b>] handle_mm_fault+0x74b/0x1010
[  800.953979]  [<ffffffff8122075b>] ? handle_mm_fault+0x5b/0x1010
[  800.956294]  [<ffffffff810783a5>] ? __do_page_fault+0x175/0x530
[  800.958626]  [<ffffffff8107847a>] __do_page_fault+0x24a/0x530
[  800.960875]  [<ffffffff81078790>] do_page_fault+0x30/0x80
[  800.962956]  [<ffffffff8175b598>] page_fault+0x28/0x30
(...snipped...)
[  802.637484] kswapd0         D10584    56      2 0x00000000
[  802.639758]  0000000000000000 ffff88004b572c40 ffff88007048ca40 ffff880072174a40
[  802.642735]  ffff8800757da218 ffffc900006ff6b0 ffffffff817524d8 ffffc900006ff660
[  802.645755]  ffffc900006ff670 ffff8800757da218 0000000044e61594 ffff880072174a40
[  802.648778] Call Trace:
[  802.650042]  [<ffffffff817524d8>] ? __schedule+0x2f8/0xbc0
[  802.652339]  [<ffffffff81752ddd>] schedule+0x3d/0x90
[  802.654425]  [<ffffffff817573fd>] rwsem_down_read_failed+0xfd/0x180
[  802.657362]  [<ffffffffa0261b9f>] ? xfs_map_blocks+0x9f/0x550 [xfs]
[  802.660493]  [<ffffffff81395b78>] call_rwsem_down_read_failed+0x18/0x30
[  802.663313]  [<ffffffffa02858f4>] ? xfs_ilock+0x1a4/0x350 [xfs]
[  802.665936]  [<ffffffff810f714f>] down_read_nested+0xaf/0xc0
[  802.668457]  [<ffffffffa02858f4>] ? xfs_ilock+0x1a4/0x350 [xfs]
[  802.670988]  [<ffffffffa02858f4>] xfs_ilock+0x1a4/0x350 [xfs]
[  802.673534]  [<ffffffffa0261b9f>] xfs_map_blocks+0x9f/0x550 [xfs]
[  802.676124]  [<ffffffffa0262293>] xfs_do_writepage+0x243/0x940 [xfs]
[  802.678773]  [<ffffffff811eca54>] ? clear_page_dirty_for_io+0xb4/0x310
[  802.681872]  [<ffffffffa02629cb>] xfs_vm_writepage+0x3b/0x70 [xfs]
[  802.684442]  [<ffffffff811f77a4>] pageout.isra.49+0x1a4/0x460
[  802.686880]  [<ffffffff811f9e00>] shrink_page_list+0x8e0/0xbd0
[  802.689343]  [<ffffffff811fa94f>] shrink_inactive_list+0x20f/0x680
[  802.691864]  [<ffffffff811fb656>] shrink_node_memcg+0x526/0x7d0
[  802.694240]  [<ffffffff811fb9e1>] shrink_node+0xe1/0x310
[  802.696405]  [<ffffffff811fd1a2>] kswapd+0x362/0x9b0
[  802.698472]  [<ffffffff811fce40>] ? mem_cgroup_shrink_node+0x3b0/0x3b0
[  802.701031]  [<ffffffff810bfe92>] kthread+0x102/0x120
[  802.703175]  [<ffffffff810fbca9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  802.705797]  [<ffffffff810bfd90>] ? kthread_park+0x60/0x60
[  802.708074]  [<ffffffff8175a33a>] ret_from_fork+0x2a/0x40
(...snipped...)
[  809.550458] a.out           R  running task    11240  5337   4599 0x00000086
[  809.553340]  ffff88006e77f850 0000000000000000 0000000000000000 0000000000000000
[  809.556327]  0000000000000000 ffff8800740d7498 ffffc90007e57268 ffffffff817596b7
[  809.559340]  ffff88006e77f838 ffffc90007e57288 ffffffff812177ea ffffffff81ccf160
[  809.562334] Call Trace:
[  809.563607]  [<ffffffff817596b7>] ? _raw_spin_unlock+0x27/0x40
[  809.565997]  [<ffffffff812177ea>] ? __list_lru_count_one.isra.2+0x4a/0x80
[  809.568709]  [<ffffffff811f729e>] ? shrink_slab+0x31e/0x680
[  809.570994]  [<ffffffff81187c8e>] ? delayacct_end+0x3e/0x60
[  809.573319]  [<ffffffff810fdcd9>] ? lock_acquire+0xc9/0x250
[  809.575595]  [<ffffffff811fbbfc>] ? shrink_node+0x2fc/0x310
[  809.577864]  [<ffffffff811fbf4d>] ? do_try_to_free_pages+0xed/0x380
[  809.580396]  [<ffffffff811fc311>] ? try_to_free_pages+0x131/0x3f0
[  809.582831]  [<ffffffff81281af8>] ? __alloc_pages_slowpath+0x3ac/0x9db
[  809.585433]  [<ffffffff811e9966>] ? __alloc_pages_nodemask+0x456/0x4e0
[  809.588026]  [<ffffffff81247507>] ? alloc_pages_current+0x97/0x1b0
[  809.590476]  [<ffffffff81252e9a>] ? new_slab+0x4ca/0x6a0
[  809.592664]  [<ffffffff81255091>] ? ___slab_alloc+0x3a1/0x620
[  809.594982]  [<ffffffffa029a836>] ? kmem_alloc+0x96/0x120 [xfs]
[  809.597347]  [<ffffffffa029a836>] ? kmem_alloc+0x96/0x120 [xfs]
[  809.599736]  [<ffffffff812841ec>] ? __slab_alloc+0x46/0x7d
[  809.601961]  [<ffffffff812566c1>] ? __kmalloc+0x301/0x3b0
[  809.604177]  [<ffffffffa029a836>] ? kmem_alloc+0x96/0x120 [xfs]
[  809.606583]  [<ffffffff81254639>] ? kfree+0x1f9/0x330
[  809.608683]  [<ffffffffa02a227b>] ? xfs_log_commit_cil+0x54b/0x690 [xfs]
[  809.611335]  [<ffffffffa029ab86>] ? kmem_zone_alloc+0x96/0x120 [xfs]
[  809.613853]  [<ffffffffa0299bd7>] ? __xfs_trans_commit+0x97/0x250 [xfs]
[  809.616430]  [<ffffffffa029a2cc>] ? __xfs_trans_roll+0x6c/0xe0 [xfs]
[  809.618963]  [<ffffffffa029a365>] ? xfs_trans_roll+0x25/0x40 [xfs]
[  809.621401]  [<ffffffffa028969d>] ? xfs_itruncate_extents+0x2bd/0x730 [xfs]
[  809.624332]  [<ffffffffa02673e3>] ? xfs_free_eofblocks+0x1e3/0x240 [xfs]
[  809.626963]  [<ffffffffa0289cf4>] ? xfs_release+0x94/0x150 [xfs]
[  809.629384]  [<ffffffffa02720a5>] ? xfs_file_release+0x15/0x20 [xfs]
[  809.631887]  [<ffffffff81289ff8>] ? __fput+0xf8/0x200
[  809.633954]  [<ffffffff8128a13e>] ? ____fput+0xe/0x10
[  809.636023]  [<ffffffff810be103>] ? task_work_run+0x83/0xc0
[  809.638305]  [<ffffffff8109d43f>] ? do_exit+0x31f/0xcd0
[  809.640431]  [<ffffffff810abcfe>] ? get_signal+0xde/0x9b0
[  809.642609]  [<ffffffff8109de7c>] ? do_group_exit+0x4c/0xc0
[  809.644856]  [<ffffffff810abf7f>] ? get_signal+0x35f/0x9b0
[  809.647073]  [<ffffffffa02744f9>] ? xfs_file_buffered_aio_write+0xa9/0x3b0 [xfs]
[  809.649904]  [<ffffffff81036687>] ? do_signal+0x37/0x6c0
[  809.652104]  [<ffffffffa0274890>] ? xfs_file_write_iter+0x90/0x130 [xfs]
[  809.654725]  [<ffffffff81254639>] ? kfree+0x1f9/0x330
[  809.656805]  [<ffffffff810904cc>] ? exit_to_usermode_loop+0x51/0x92
[  809.659293]  [<ffffffff81003d65>] ? do_syscall_64+0x195/0x200
[  809.661596]  [<ffffffff8175a189>] ? entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  889.760595] a.out(5337): page allocation stalls for 710002ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  899.760518] a.out(5337): page allocation stalls for 720002ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  909.760429] a.out(5337): page allocation stalls for 730002ms, order:1 mode:0x2604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK)
[  916.234927] sysrq: SysRq : Resetting
[  916.237156] ACPI MEMORY or I/O RESET_REG.
--------------------

kmallocwd would have told the administrator that the system got stuck
inside page allocator due to kswapd v.s. shrink_inactive_list() trap.

Scattering around tracepoints like [3] will not replace kmallocwd.
We don't need to enable tracepoints when allocation request is not stalling,
but we want to enable console output when allocation request got stuck.
Synchronous watchdog like warn_alloc() cannot do this because we can't
predict which line of which function gets stuck. Only asynchronous watchdog
like kmallocwd can serve as a trigger for enabling console output.

Although Michal thinks that this kmallocwd is too complex, I believe that
this complexity is inevitable for overcoming unsolvable deficiency of
warn_alloc() and tracepoints. Since it seems that nobody has objection
against the idea of having an asynchronous watchdog, I want to have it
merged. Are there any ideas/suggestions/improvements to this kmallocwd ?



[1] http://lkml.kernel.org/r/20161214181850.GC16763@dhcp22.suse.cz
[2] http://lkml.kernel.org/r/201612151921.CBE43202.SFLtOFJMOFOQVH@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/20161214145324.26261-1-mhocko@kernel.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2016-12-15 10:24 ` Tetsuo Handa
@ 2016-12-28 11:42   ` Tetsuo Handa
  2017-01-25 14:03     ` Tetsuo Handa
  0 siblings, 1 reply; 12+ messages in thread
From: Tetsuo Handa @ 2016-12-28 11:42 UTC (permalink / raw)
  To: mgorman, hannes, vdavydov.dev, mhocko
  Cc: pmladek, sergey.senozhatsky.work, vegard.nossum, linux-mm, linux-kernel

Michal Hocko wrote at http://lkml.kernel.org/r/20161227105715.GE1308@dhcp22.suse.cz :
> On Tue 27-12-16 19:39:28, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not saying that the current code works perfectly when we are
> > > hitting the direct reclaim close to the OOM but improving that requires
> > > much more than slapping a global lock there.
> > 
> > So, we finally agreed that there are problems when we are hitting the direct
> > reclaim close to the OOM. Good.
> 
> There has never been a disagreement here. The point we seem to be
> disagreeing is how much those issues you are seeing matter. I do not
> consider them top priority because they are not happening in real life
> enough.

There is no evidence to prove "they are not happening in real life enough", for
there is no catch-all reporting mechanism. I consider that offering a mean to
find and report problems is top priority as a troubleshooting staff.

> > > Just try to remember how you were pushing really hard for oom timeouts
> > > one year back because the OOM killer was suboptimal and could lockup. It
> > > took some redesign and many changes to fix that. The result is
> > > imho a better, more predictable and robust code which wouldn't be the
> > > case if we just went your way to have a fix quickly...
> > 
> > I agree that the result is good for users who can update kernels. But that
> > change was too large to backport. Any approach which did not in time for
> > customers' deadline of deciding their kernels to use for 10 years is
> > useless for them. Lack of catch-all reporting/triggering mechanism is
> > unhappy for both customers and troubleshooting staffs at support centers.
> 
> Then implement whatever you find appropriate on those old kernels and
> deal with the follow up reports. This is the fair deal you have cope
> with when using and supporting old kernels.

Customers are using distributor's kernels. Due to out-of-tree vendor's prebuilt
modules which can be loaded into only prebuilt distributor's kernels, it is
impossible for me to make changes to those old kernels. Also, that distributor's
policy is that "offer no support even if just rebuilt from source" which prevents
customers from testing changes made by me to those old kernels. Thus, implement
whatever I find appropriate on those old kernels is not an option. Merging
upstream-first, in accordance with that distributor's policy, is the only option.

>  
> > Improving the direct reclaim close to the OOM requires a lot of effort.
> > We might add new bugs during that effort. So, where is valid reason that
> > we can not have asynchronous watchdog like kmallocwd? Please do explain
> > at kmallocwd thread. You have never persuaded me about keeping kmallocwd
> > out of tree.
> 
> I am not going to repeat my arguments over again. I haven't nacked that
> patch and it seems there is no great interest in it so do not try to
> claim that it is me who is blocking this feature. I just do not think it
> is worth it.

OK. I was assuming that Acked-by: or Reviewed-by: from you is essential.

So far, nobody has objections about having asynchronous watchdog.
Mel, Johannes and Vladimir, what do you think about this version of
kmallocwd? If no objections, I think we can start with this version
with a fix shown below folded.

----------------------------------------
>From 5adc8d9bfb31dce1954667cabf65842df31d4ed7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 28 Dec 2016 09:52:03 +0900
Subject: [PATCH] mm: Don't check __GFP_KSWAPD_RECLAIM by memory allocation
 watchdog.

There are some __GFP_KSWAPD_RECLAIM && !__GFP_DIRECT_RECLAIM callers.
Since such callers do not sleep, we should check only __GFP_DIRECT_RECLAIM
callers than __GFP_RECLAIM == (__GFP_KSWAPD_RECLAIM|__GFP_DIRECT_RECLAIM)
callers.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6478f44..58c1238 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3769,10 +3769,10 @@ static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
 {
 	struct memalloc_info *m = &current->memalloc;
 
-	/* We don't check for stalls for !__GFP_RECLAIM allocations. */
-	if (!(gfp_mask & __GFP_RECLAIM))
+	/* We don't check for stalls for !__GFP_DIRECT_RECLAIM allocations. */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;
-	/* We don't check for stalls for nested __GFP_RECLAIM allocations */
+	/* Check based on outermost __GFP_DIRECT_RECLAIM allocations. */
 	if (!m->valid) {
 		m->sequence++;
 		m->start = jiffies;
@@ -3788,7 +3788,7 @@ static void stop_memalloc_timer(const gfp_t gfp_mask)
 {
 	struct memalloc_info *m = &current->memalloc;
 
-	if ((gfp_mask & __GFP_RECLAIM) && !--m->valid)
+	if ((gfp_mask & __GFP_DIRECT_RECLAIM) && !--m->valid)
 		this_cpu_dec(memalloc_in_flight[m->idx]);
 }
 #else
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2016-12-28 11:42   ` Tetsuo Handa
@ 2017-01-25 14:03     ` Tetsuo Handa
  2017-01-25 14:21       ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Tetsuo Handa @ 2017-01-25 14:03 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, hannes, vdavydov.dev, mhocko, pmladek,
	sergey.senozhatsky.work, vegard.nossum, linux-mm, linux-kernel

Andrew, what do you think about this version? There seems to be no objections.

This patch should be helpful for initial research of minutes-lasting stalls (e.g.
http://lkml.kernel.org/r/20170123135111.13ac3e47110de10a4bd503ef@linux-foundation.org )
and proving whether what Michal does not think are not happening
(e.g. http://lkml.kernel.org/r/20170125095337.GF32377@dhcp22.suse.cz ).
I think we can start testing this version at linux-next tree.

Tetsuo Handa wrote:
> Michal Hocko wrote at http://lkml.kernel.org/r/20161227105715.GE1308@dhcp22.suse.cz :
> > On Tue 27-12-16 19:39:28, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > I am not saying that the current code works perfectly when we are
> > > > hitting the direct reclaim close to the OOM but improving that requires
> > > > much more than slapping a global lock there.
> > > 
> > > So, we finally agreed that there are problems when we are hitting the direct
> > > reclaim close to the OOM. Good.
> > 
> > There has never been a disagreement here. The point we seem to be
> > disagreeing is how much those issues you are seeing matter. I do not
> > consider them top priority because they are not happening in real life
> > enough.
> 
> There is no evidence to prove "they are not happening in real life enough", for
> there is no catch-all reporting mechanism. I consider that offering a mean to
> find and report problems is top priority as a troubleshooting staff.
> 
> > > > Just try to remember how you were pushing really hard for oom timeouts
> > > > one year back because the OOM killer was suboptimal and could lockup. It
> > > > took some redesign and many changes to fix that. The result is
> > > > imho a better, more predictable and robust code which wouldn't be the
> > > > case if we just went your way to have a fix quickly...
> > > 
> > > I agree that the result is good for users who can update kernels. But that
> > > change was too large to backport. Any approach which did not in time for
> > > customers' deadline of deciding their kernels to use for 10 years is
> > > useless for them. Lack of catch-all reporting/triggering mechanism is
> > > unhappy for both customers and troubleshooting staffs at support centers.
> > 
> > Then implement whatever you find appropriate on those old kernels and
> > deal with the follow up reports. This is the fair deal you have cope
> > with when using and supporting old kernels.
> 
> Customers are using distributor's kernels. Due to out-of-tree vendor's prebuilt
> modules which can be loaded into only prebuilt distributor's kernels, it is
> impossible for me to make changes to those old kernels. Also, that distributor's
> policy is that "offer no support even if just rebuilt from source" which prevents
> customers from testing changes made by me to those old kernels. Thus, implement
> whatever I find appropriate on those old kernels is not an option. Merging
> upstream-first, in accordance with that distributor's policy, is the only option.
> 
> >  
> > > Improving the direct reclaim close to the OOM requires a lot of effort.
> > > We might add new bugs during that effort. So, where is valid reason that
> > > we can not have asynchronous watchdog like kmallocwd? Please do explain
> > > at kmallocwd thread. You have never persuaded me about keeping kmallocwd
> > > out of tree.
> > 
> > I am not going to repeat my arguments over again. I haven't nacked that
> > patch and it seems there is no great interest in it so do not try to
> > claim that it is me who is blocking this feature. I just do not think it
> > is worth it.
> 
> OK. I was assuming that Acked-by: or Reviewed-by: from you is essential.
> 
> So far, nobody has objections about having asynchronous watchdog.
> Mel, Johannes and Vladimir, what do you think about this version of
> kmallocwd? If no objections, I think we can start with this version
> with a fix shown below folded.
> 
> ----------------------------------------
> >From 5adc8d9bfb31dce1954667cabf65842df31d4ed7 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 28 Dec 2016 09:52:03 +0900
> Subject: [PATCH] mm: Don't check __GFP_KSWAPD_RECLAIM by memory allocation
>  watchdog.
> 
> There are some __GFP_KSWAPD_RECLAIM && !__GFP_DIRECT_RECLAIM callers.
> Since such callers do not sleep, we should check only __GFP_DIRECT_RECLAIM
> callers than __GFP_RECLAIM == (__GFP_KSWAPD_RECLAIM|__GFP_DIRECT_RECLAIM)
> callers.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/page_alloc.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6478f44..58c1238 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3769,10 +3769,10 @@ static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
>  {
>  	struct memalloc_info *m = &current->memalloc;
>  
> -	/* We don't check for stalls for !__GFP_RECLAIM allocations. */
> -	if (!(gfp_mask & __GFP_RECLAIM))
> +	/* We don't check for stalls for !__GFP_DIRECT_RECLAIM allocations. */
> +	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
>  		return;
> -	/* We don't check for stalls for nested __GFP_RECLAIM allocations */
> +	/* Check based on outermost __GFP_DIRECT_RECLAIM allocations. */
>  	if (!m->valid) {
>  		m->sequence++;
>  		m->start = jiffies;
> @@ -3788,7 +3788,7 @@ static void stop_memalloc_timer(const gfp_t gfp_mask)
>  {
>  	struct memalloc_info *m = &current->memalloc;
>  
> -	if ((gfp_mask & __GFP_RECLAIM) && !--m->valid)
> +	if ((gfp_mask & __GFP_DIRECT_RECLAIM) && !--m->valid)
>  		this_cpu_dec(memalloc_in_flight[m->idx]);
>  }
>  #else
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2017-01-25 14:03     ` Tetsuo Handa
@ 2017-01-25 14:21       ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-01-25 14:21 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, mgorman, hannes, vdavydov.dev, pmladek,
	sergey.senozhatsky.work, vegard.nossum, linux-mm, linux-kernel

On Wed 25-01-17 23:03:43, Tetsuo Handa wrote:
> Andrew, what do you think about this version? There seems to be no objections.

Well, this is not true. My main objection still holds. I simply do not
think all the additional code is really worth it. Not enough to nack the
patch, though, of course. I am not questioning that the watchdog might
be useful for debugging. I just think that what we have currently tells
us enough to debug issues. Sure, you will tell that this is not
reliable, but I will argue that this should be fixed because that makes
sense regardless of the stall warning.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2016-11-06  7:15 [PATCH v6] mm: Add memory allocation watchdog kernel thread Tetsuo Handa
  2016-12-15 10:24 ` Tetsuo Handa
@ 2017-01-25 18:11 ` Johannes Weiner
  2017-01-25 18:45   ` Michal Hocko
  2017-01-25 23:44   ` Minchan Kim
  1 sibling, 2 replies; 12+ messages in thread
From: Johannes Weiner @ 2017-01-25 18:11 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, linux-kernel

On Sun, Nov 06, 2016 at 04:15:01PM +0900, Tetsuo Handa wrote:
> +- Why need to use it?
> +
> +Currently, when something went wrong inside memory allocation request,
> +the system might stall without any kernel messages.
> +
> +Although there is khungtaskd kernel thread as an asynchronous monitoring
> +approach, khungtaskd kernel thread is not always helpful because memory
> +allocating tasks unlikely sleep in uninterruptible state for
> +/proc/sys/kernel/hung_task_timeout_secs seconds.
> +
> +Although there is warn_alloc() as a synchronous monitoring approach
> +which emits
> +
> +  "%s: page allocation stalls for %ums, order:%u, mode:%#x(%pGg)\n"
> +
> +line, warn_alloc() is not bullet proof because allocating tasks can get
> +stuck before calling warn_alloc() and/or allocating tasks are using
> +__GFP_NOWARN flag and/or such lines are suppressed by ratelimiting and/or
> +such lines are corrupted due to collisions.

I'm not fully convinced by this explanation. Do you have a real life
example where the warn_alloc() stall info is not enough? If yes, this
should be included here and in the changelog. If not, the extra code,
the task_struct overhead etc. don't seem justified.

__GFP_NOWARN shouldn't suppress stall warnings, IMO. It's for whether
the caller expects allocation failure and is prepared to handle it; an
allocation stalling out for 10s is an issue regardless of the callsite.

---

>From 6420cae52cac8167bd5fb19f45feed2d540bc11d Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 25 Jan 2017 12:57:20 -0500
Subject: [PATCH] mm: page_alloc: __GFP_NOWARN shouldn't suppress stall
 warnings

__GFP_NOWARN, which is usually added to avoid warnings from callsites
that expect to fail and have fallbacks, currently also suppresses
allocation stall warnings. These trigger when an allocation is stuck
inside the allocator for 10 seconds or longer.

But there is no class of allocations that can get legitimately stuck
in the allocator for this long. This always indicates a problem.

Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69a97b7..7ce051d1d575 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3704,7 +3704,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/* Make sure we know about allocations which stall for too long */
 	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
+		warn_alloc(gfp_mask & ~__GFP_NOWARN,
 			"page allocation stalls for %ums, order:%u",
 			jiffies_to_msecs(jiffies-alloc_start), order);
 		stall_timeout += 10 * HZ;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2017-01-25 18:11 ` Johannes Weiner
@ 2017-01-25 18:45   ` Michal Hocko
  2017-01-25 19:22     ` Johannes Weiner
  2017-01-26 10:28     ` Tetsuo Handa
  2017-01-25 23:44   ` Minchan Kim
  1 sibling, 2 replies; 12+ messages in thread
From: Michal Hocko @ 2017-01-25 18:45 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tetsuo Handa, linux-mm, linux-kernel

On Wed 25-01-17 13:11:50, Johannes Weiner wrote:
[...]
> >From 6420cae52cac8167bd5fb19f45feed2d540bc11d Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 25 Jan 2017 12:57:20 -0500
> Subject: [PATCH] mm: page_alloc: __GFP_NOWARN shouldn't suppress stall
>  warnings
> 
> __GFP_NOWARN, which is usually added to avoid warnings from callsites
> that expect to fail and have fallbacks, currently also suppresses
> allocation stall warnings. These trigger when an allocation is stuck
> inside the allocator for 10 seconds or longer.
> 
> But there is no class of allocations that can get legitimately stuck
> in the allocator for this long. This always indicates a problem.
> 
> Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.

Tetsuo has already suggested something like this and I didn't really
like it because it makes the semantic of the flag confusing. The mask
says to not warn while the kernel log might contain an allocation splat.
You are right that stalling for 10s seconds means a problem on its own
but on the other hand I can imagine somebody might really want to have
clean logs and the last thing we want is to have another gfp flag for
that purpose.

I also do not think that this change would make a big difference because
most allocations simply use this flag along with __GFP_NORETRY or
GFP_NOWAIT resp GFP_ATOMIC. Have we ever seen a stall with this
allocation requests?

I haven't nacked Tetsuo's patch AFAIR and will not nack this one either
I just do not think we should tweak __GFP_NOWARN.
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f3e0c69a97b7..7ce051d1d575 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3704,7 +3704,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  
>  	/* Make sure we know about allocations which stall for too long */
>  	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask,
> +		warn_alloc(gfp_mask & ~__GFP_NOWARN,
>  			"page allocation stalls for %ums, order:%u",
>  			jiffies_to_msecs(jiffies-alloc_start), order);
>  		stall_timeout += 10 * HZ;
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2017-01-25 18:45   ` Michal Hocko
@ 2017-01-25 19:22     ` Johannes Weiner
  2017-01-26  7:57       ` Michal Hocko
  2017-01-26 10:28     ` Tetsuo Handa
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2017-01-25 19:22 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, linux-kernel

On Wed, Jan 25, 2017 at 07:45:49PM +0100, Michal Hocko wrote:
> On Wed 25-01-17 13:11:50, Johannes Weiner wrote:
> [...]
> > >From 6420cae52cac8167bd5fb19f45feed2d540bc11d Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Wed, 25 Jan 2017 12:57:20 -0500
> > Subject: [PATCH] mm: page_alloc: __GFP_NOWARN shouldn't suppress stall
> >  warnings
> > 
> > __GFP_NOWARN, which is usually added to avoid warnings from callsites
> > that expect to fail and have fallbacks, currently also suppresses
> > allocation stall warnings. These trigger when an allocation is stuck
> > inside the allocator for 10 seconds or longer.
> > 
> > But there is no class of allocations that can get legitimately stuck
> > in the allocator for this long. This always indicates a problem.
> > 
> > Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.
> 
> Tetsuo has already suggested something like this and I didn't really
> like it because it makes the semantic of the flag confusing. The mask
> says to not warn while the kernel log might contain an allocation splat.
> You are right that stalling for 10s seconds means a problem on its own
> but on the other hand I can imagine somebody might really want to have
> clean logs and the last thing we want is to have another gfp flag for
> that purpose.

I don't think it's confusing. __GFP_NOWARN tells the allocator whether
an allocation failure can be handled or whether it constitutes a bug.

If we agree that stalling for 10s is a bug, then we should emit the
warnings. Tying this to whether the caller can handle an allocation
failure is non-sensical. Not warning about a bug because the user
would prefer clean logs is... somewhat out there.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2017-01-25 18:11 ` Johannes Weiner
  2017-01-25 18:45   ` Michal Hocko
@ 2017-01-25 23:44   ` Minchan Kim
  1 sibling, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2017-01-25 23:44 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel

On Wed, Jan 25, 2017 at 01:11:50PM -0500, Johannes Weiner wrote:
> On Sun, Nov 06, 2016 at 04:15:01PM +0900, Tetsuo Handa wrote:
> > +- Why need to use it?
> > +
> > +Currently, when something went wrong inside memory allocation request,
> > +the system might stall without any kernel messages.
> > +
> > +Although there is khungtaskd kernel thread as an asynchronous monitoring
> > +approach, khungtaskd kernel thread is not always helpful because memory
> > +allocating tasks unlikely sleep in uninterruptible state for
> > +/proc/sys/kernel/hung_task_timeout_secs seconds.
> > +
> > +Although there is warn_alloc() as a synchronous monitoring approach
> > +which emits
> > +
> > +  "%s: page allocation stalls for %ums, order:%u, mode:%#x(%pGg)\n"
> > +
> > +line, warn_alloc() is not bullet proof because allocating tasks can get
> > +stuck before calling warn_alloc() and/or allocating tasks are using
> > +__GFP_NOWARN flag and/or such lines are suppressed by ratelimiting and/or
> > +such lines are corrupted due to collisions.
> 
> I'm not fully convinced by this explanation. Do you have a real life
> example where the warn_alloc() stall info is not enough? If yes, this
> should be included here and in the changelog. If not, the extra code,
> the task_struct overhead etc. don't seem justified.
> 
> __GFP_NOWARN shouldn't suppress stall warnings, IMO. It's for whether
> the caller expects allocation failure and is prepared to handle it; an
> allocation stalling out for 10s is an issue regardless of the callsite.
> 
> ---
> 
> From 6420cae52cac8167bd5fb19f45feed2d540bc11d Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 25 Jan 2017 12:57:20 -0500
> Subject: [PATCH] mm: page_alloc: __GFP_NOWARN shouldn't suppress stall
>  warnings
> 
> __GFP_NOWARN, which is usually added to avoid warnings from callsites
> that expect to fail and have fallbacks, currently also suppresses
> allocation stall warnings. These trigger when an allocation is stuck
> inside the allocator for 10 seconds or longer.
> 
> But there is no class of allocations that can get legitimately stuck
> in the allocator for this long. This always indicates a problem.
> 
> Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2017-01-25 19:22     ` Johannes Weiner
@ 2017-01-26  7:57       ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-01-26  7:57 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tetsuo Handa, linux-mm, linux-kernel

On Wed 25-01-17 14:22:45, Johannes Weiner wrote:
> On Wed, Jan 25, 2017 at 07:45:49PM +0100, Michal Hocko wrote:
> > On Wed 25-01-17 13:11:50, Johannes Weiner wrote:
> > [...]
> > > >From 6420cae52cac8167bd5fb19f45feed2d540bc11d Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Wed, 25 Jan 2017 12:57:20 -0500
> > > Subject: [PATCH] mm: page_alloc: __GFP_NOWARN shouldn't suppress stall
> > >  warnings
> > > 
> > > __GFP_NOWARN, which is usually added to avoid warnings from callsites
> > > that expect to fail and have fallbacks, currently also suppresses
> > > allocation stall warnings. These trigger when an allocation is stuck
> > > inside the allocator for 10 seconds or longer.
> > > 
> > > But there is no class of allocations that can get legitimately stuck
> > > in the allocator for this long. This always indicates a problem.
> > > 
> > > Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.
> > 
> > Tetsuo has already suggested something like this and I didn't really
> > like it because it makes the semantic of the flag confusing. The mask
> > says to not warn while the kernel log might contain an allocation splat.
> > You are right that stalling for 10s seconds means a problem on its own
> > but on the other hand I can imagine somebody might really want to have
> > clean logs and the last thing we want is to have another gfp flag for
> > that purpose.
> 
> I don't think it's confusing. __GFP_NOWARN tells the allocator whether
> an allocation failure can be handled or whether it constitutes a bug.
> 
> If we agree that stalling for 10s is a bug, then we should emit the
> warnings.

Yes, in many cases it would be a bug in the MM. Some of them would be
inherent because the allocator doesn't implement any fairness and
starvation cannot be ruled out (would that be a bug?). In general,
looping/spending a lot of time in kernel can be seen as a bug. We have
watchdogs to report those cases and the time has told us that we had to
develop ways to silent those lockups because in some cases we couldn't
handle them. I am worried we will eventually find cases like that for
allocation stalls as well. I might be over sensitive because we have
made some mistakes in the gfp flags land already and I would like to
prevent more to come.

That being said, I will not stand in the way...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2017-01-25 18:45   ` Michal Hocko
  2017-01-25 19:22     ` Johannes Weiner
@ 2017-01-26 10:28     ` Tetsuo Handa
  2017-02-22  2:11       ` Tetsuo Handa
  1 sibling, 1 reply; 12+ messages in thread
From: Tetsuo Handa @ 2017-01-26 10:28 UTC (permalink / raw)
  To: mhocko, hannes; +Cc: linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 25-01-17 13:11:50, Johannes Weiner wrote:
> [...]
> > >From 6420cae52cac8167bd5fb19f45feed2d540bc11d Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Wed, 25 Jan 2017 12:57:20 -0500
> > Subject: [PATCH] mm: page_alloc: __GFP_NOWARN shouldn't suppress stall
> >  warnings
> > 
> > __GFP_NOWARN, which is usually added to avoid warnings from callsites
> > that expect to fail and have fallbacks, currently also suppresses
> > allocation stall warnings. These trigger when an allocation is stuck
> > inside the allocator for 10 seconds or longer.
> > 
> > But there is no class of allocations that can get legitimately stuck
> > in the allocator for this long. This always indicates a problem.
> > 
> > Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.
> 
> Tetsuo has already suggested something like this and I didn't really

Yes, I already suggested it at
http://lkml.kernel.org/r/1484132120-35288-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .

> like it because it makes the semantic of the flag confusing. The mask
> says to not warn while the kernel log might contain an allocation splat.
> You are right that stalling for 10s seconds means a problem on its own
> but on the other hand I can imagine somebody might really want to have
> clean logs and the last thing we want is to have another gfp flag for
> that purpose.

I agree with Johannes about that __GFP_NOWARN should not suppress allocation
stall warnings. But having another gfp flag for that purpose is not useful.
Given that someone really wants not to have allocation stall warnings in the
kernel logs, where is the switch for enabling/disabling allocation stall
warnings (because gfp flags are constant determined at build time)? We will
need to have either a kernel command line option or a sysctl (or sysfs)
variable. khungtaskd uses sysctl variables for those who really wants not
to have TASK_UNINTERRUPTIBLE warnings; so does kmallocwd.

> 
> I also do not think that this change would make a big difference because
> most allocations simply use this flag along with __GFP_NORETRY or
> GFP_NOWAIT resp GFP_ATOMIC. Have we ever seen a stall with this
> allocation requests?

You are totally ignoring what I explained in the "[PATCH] mm: Ignore
__GFP_NOWARN when reporting stalls" thread shown above.

  Majority of __GFP_DIRECT_RECLAIM allocation requests are tolerable with
  allocation failure (and they will be willing to give up upon SIGKILL if
  they are from syscall) and do not need to alarm the admin to do any action.
  If they are not tolerable with allocation failure, they will add __GFP_NOFAIL.
  
  Apart from the reality that they are not tested well because they are
  currently protected by the "too small to fail" memory-allocation rule,
  they are ready to add __GFP_NOWARN. And current behavior (i.e. !costly
  __GFP_DIRECT_RECLAIM allocation requests won't fail unless __GFP_NORETRY
  is set or TIF_MEMDIE is set after SIGKILL was delivered) keeps them away
  from adding __GFP_NOFAIL.

> 
> I haven't nacked Tetsuo's patch AFAIR and will not nack this one either
> I just do not think we should tweak __GFP_NOWARN.

Leaving this proposal as it is is counterproductive. I already said

  The discussion at this stage should not be "whether we need such
  watchdog and debugging code" but should be "how we can make the impact
  of watchdog and debugging code as small as possible".

at http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
And there had been no response.

Johannes Weiner wrote:
> On Sun, Nov 06, 2016 at 04:15:01PM +0900, Tetsuo Handa wrote:
> > +- Why need to use it?
> > +
> > +Currently, when something went wrong inside memory allocation request,
> > +the system might stall without any kernel messages.
> > +
> > +Although there is khungtaskd kernel thread as an asynchronous monitoring
> > +approach, khungtaskd kernel thread is not always helpful because memory
> > +allocating tasks unlikely sleep in uninterruptible state for
> > +/proc/sys/kernel/hung_task_timeout_secs seconds.
> > +
> > +Although there is warn_alloc() as a synchronous monitoring approach
> > +which emits
> > +
> > +  "%s: page allocation stalls for %ums, order:%u, mode:%#x(%pGg)\n"
> > +
> > +line, warn_alloc() is not bullet proof because allocating tasks can get
> > +stuck before calling warn_alloc() and/or allocating tasks are using
> > +__GFP_NOWARN flag and/or such lines are suppressed by ratelimiting and/or
> > +such lines are corrupted due to collisions.
> 
> I'm not fully convinced by this explanation. Do you have a real life
> example where the warn_alloc() stall info is not enough? If yes, this
> should be included here and in the changelog. If not, the extra code,
> the task_struct overhead etc. don't seem justified.

warn_alloc() stall info cannot provide overall analyses. I said

  If you meant (b), it is because synchronous watchdog is not reliable and
  cannot provide enough diagnostic information. Since allocation livelock
  involves several threads due to dependency, it is important to take a
  snapshot of possibly relevant threads. By using asynchronous watchdog,
  we can not only take a snapshot but also take more actions for obtaining
  diagnostic information (e.g. enabling tracepoints when allocation stalls
  are detected).

in the same thread shown above. For example, the cause of allocation stall
might be due to out of idle workqueue thread; e.g. commit 373ccbe5927034b5
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
any progress"). Without reporting all possibly relevant threads, we might
fail to obtain enough diagnostic information. Changing warn_alloc() to
also report workqueues/kswapd/locks etc. will be noisy and incomplete
because warn_alloc() cares only current thread.

I welcome suggestions for "how we can make the impact of watchdog and
debugging code as small as possible". But I don't have environment for
evaluating the task_struct overhead. I wonder whether this overhead matters
because this is allocation slowpath which will consume a lot of CPU cycles
for scanning and/or sleep for many jiffies waiting for I/O. It will be far
cheaper than keeping mm related tracepoints enabled until something happens
which might be uptime of months. Sending the v6 patch to linux-next and
start evaluating the overhead will be the way to avoid leaving this proposal
as it is.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.
  2017-01-26 10:28     ` Tetsuo Handa
@ 2017-02-22  2:11       ` Tetsuo Handa
  0 siblings, 0 replies; 12+ messages in thread
From: Tetsuo Handa @ 2017-02-22  2:11 UTC (permalink / raw)
  To: mhocko, hannes; +Cc: linux-mm, linux-kernel

Johannes, any questions/comments on this patch?
I'd like to send to linux-next because this patch will be useful for
examining problems like "Bug 192981 - page allocation stalls". I think
that this patch will remain be useful as long as we use direct reclaim.

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-01-17 13:11:50, Johannes Weiner wrote:
> > [...]
> > > >From 6420cae52cac8167bd5fb19f45feed2d540bc11d Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Wed, 25 Jan 2017 12:57:20 -0500
> > > Subject: [PATCH] mm: page_alloc: __GFP_NOWARN shouldn't suppress stall
> > >  warnings
> > > 
> > > __GFP_NOWARN, which is usually added to avoid warnings from callsites
> > > that expect to fail and have fallbacks, currently also suppresses
> > > allocation stall warnings. These trigger when an allocation is stuck
> > > inside the allocator for 10 seconds or longer.
> > > 
> > > But there is no class of allocations that can get legitimately stuck
> > > in the allocator for this long. This always indicates a problem.
> > > 
> > > Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.
> > 
> > Tetsuo has already suggested something like this and I didn't really
> 
> Yes, I already suggested it at
> http://lkml.kernel.org/r/1484132120-35288-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> 
> > like it because it makes the semantic of the flag confusing. The mask
> > says to not warn while the kernel log might contain an allocation splat.
> > You are right that stalling for 10s seconds means a problem on its own
> > but on the other hand I can imagine somebody might really want to have
> > clean logs and the last thing we want is to have another gfp flag for
> > that purpose.
> 
> I agree with Johannes about that __GFP_NOWARN should not suppress allocation
> stall warnings. But having another gfp flag for that purpose is not useful.
> Given that someone really wants not to have allocation stall warnings in the
> kernel logs, where is the switch for enabling/disabling allocation stall
> warnings (because gfp flags are constant determined at build time)? We will
> need to have either a kernel command line option or a sysctl (or sysfs)
> variable. khungtaskd uses sysctl variables for those who really wants not
> to have TASK_UNINTERRUPTIBLE warnings; so does kmallocwd.
> 
> > 
> > I also do not think that this change would make a big difference because
> > most allocations simply use this flag along with __GFP_NORETRY or
> > GFP_NOWAIT resp GFP_ATOMIC. Have we ever seen a stall with this
> > allocation requests?
> 
> You are totally ignoring what I explained in the "[PATCH] mm: Ignore
> __GFP_NOWARN when reporting stalls" thread shown above.
> 
>   Majority of __GFP_DIRECT_RECLAIM allocation requests are tolerable with
>   allocation failure (and they will be willing to give up upon SIGKILL if
>   they are from syscall) and do not need to alarm the admin to do any action.
>   If they are not tolerable with allocation failure, they will add __GFP_NOFAIL.
>   
>   Apart from the reality that they are not tested well because they are
>   currently protected by the "too small to fail" memory-allocation rule,
>   they are ready to add __GFP_NOWARN. And current behavior (i.e. !costly
>   __GFP_DIRECT_RECLAIM allocation requests won't fail unless __GFP_NORETRY
>   is set or TIF_MEMDIE is set after SIGKILL was delivered) keeps them away
>   from adding __GFP_NOFAIL.
> 
> > 
> > I haven't nacked Tetsuo's patch AFAIR and will not nack this one either
> > I just do not think we should tweak __GFP_NOWARN.
> 
> Leaving this proposal as it is is counterproductive. I already said
> 
>   The discussion at this stage should not be "whether we need such
>   watchdog and debugging code" but should be "how we can make the impact
>   of watchdog and debugging code as small as possible".
> 
> at http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> And there had been no response.
> 
> Johannes Weiner wrote:
> > On Sun, Nov 06, 2016 at 04:15:01PM +0900, Tetsuo Handa wrote:
> > > +- Why need to use it?
> > > +
> > > +Currently, when something went wrong inside memory allocation request,
> > > +the system might stall without any kernel messages.
> > > +
> > > +Although there is khungtaskd kernel thread as an asynchronous monitoring
> > > +approach, khungtaskd kernel thread is not always helpful because memory
> > > +allocating tasks unlikely sleep in uninterruptible state for
> > > +/proc/sys/kernel/hung_task_timeout_secs seconds.
> > > +
> > > +Although there is warn_alloc() as a synchronous monitoring approach
> > > +which emits
> > > +
> > > +  "%s: page allocation stalls for %ums, order:%u, mode:%#x(%pGg)\n"
> > > +
> > > +line, warn_alloc() is not bullet proof because allocating tasks can get
> > > +stuck before calling warn_alloc() and/or allocating tasks are using
> > > +__GFP_NOWARN flag and/or such lines are suppressed by ratelimiting and/or
> > > +such lines are corrupted due to collisions.
> > 
> > I'm not fully convinced by this explanation. Do you have a real life
> > example where the warn_alloc() stall info is not enough? If yes, this
> > should be included here and in the changelog. If not, the extra code,
> > the task_struct overhead etc. don't seem justified.
> 
> warn_alloc() stall info cannot provide overall analyses. I said
> 
>   If you meant (b), it is because synchronous watchdog is not reliable and
>   cannot provide enough diagnostic information. Since allocation livelock
>   involves several threads due to dependency, it is important to take a
>   snapshot of possibly relevant threads. By using asynchronous watchdog,
>   we can not only take a snapshot but also take more actions for obtaining
>   diagnostic information (e.g. enabling tracepoints when allocation stalls
>   are detected).
> 
> in the same thread shown above. For example, the cause of allocation stall
> might be due to out of idle workqueue thread; e.g. commit 373ccbe5927034b5
> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
> any progress"). Without reporting all possibly relevant threads, we might
> fail to obtain enough diagnostic information. Changing warn_alloc() to
> also report workqueues/kswapd/locks etc. will be noisy and incomplete
> because warn_alloc() cares only current thread.
> 
> I welcome suggestions for "how we can make the impact of watchdog and
> debugging code as small as possible". But I don't have environment for
> evaluating the task_struct overhead. I wonder whether this overhead matters
> because this is allocation slowpath which will consume a lot of CPU cycles
> for scanning and/or sleep for many jiffies waiting for I/O. It will be far
> cheaper than keeping mm related tracepoints enabled until something happens
> which might be uptime of months. Sending the v6 patch to linux-next and
> start evaluating the overhead will be the way to avoid leaving this proposal
> as it is.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-02-22  2:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-06  7:15 [PATCH v6] mm: Add memory allocation watchdog kernel thread Tetsuo Handa
2016-12-15 10:24 ` Tetsuo Handa
2016-12-28 11:42   ` Tetsuo Handa
2017-01-25 14:03     ` Tetsuo Handa
2017-01-25 14:21       ` Michal Hocko
2017-01-25 18:11 ` Johannes Weiner
2017-01-25 18:45   ` Michal Hocko
2017-01-25 19:22     ` Johannes Weiner
2017-01-26  7:57       ` Michal Hocko
2017-01-26 10:28     ` Tetsuo Handa
2017-02-22  2:11       ` Tetsuo Handa
2017-01-25 23:44   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).