All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls.
@ 2014-11-23  4:49 Tetsuo Handa
  2014-11-23  4:50 ` [PATCH 1/5] mm: Introduce OOM kill timeout Tetsuo Handa
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-23  4:49 UTC (permalink / raw)
  To: linux-mm

This patchset serves for two purposes.

  (a) Mitigate one of phenomena

       "Regarding many of Linux kernel versions (from unknown till now), any
        local user can give a certain type of memory pressure which causes
        __alloc_pages_nodemask() to keep trying to reclaim memory for
        presumably forever. As a consequence, such user can disturb any users'
        activities by keeping the system stalled with 0% or 100% CPU usage.
        On systems where XFS is used, SysRq-f (forced OOM killer) may become
        unresponsive because kernel worker thread which is supposed to process
        SysRq-f request is blocked by previous request's GFP_WAIT allocation."

      which is triggered by a vulnerability which exists since (if I didn't
      miss something) Linux 2.0 (18 years ago).

      I reported this vulnerability last year and a CVE number was assigned,
      but no progress has been made. If a malicious local user notices a
      patchset that mitigates/fixes this vulnerability, the user is free to
      attack existing Linux systems. Therefore, I propose this patchset before
      any patchset that mitigates/fixes this vulnerability is proposed.

  (b) Help debugging memory allocation stall problems which are not caused
      by malicious attacks. Since I'm providing technical support service for
      troubleshooting RHEL systems, I sometimes encounter cases where memory
      allocation is suspicious. But SysRq or hung check timer does not report
      how long the thread stalled for memory allocation. Therefore, I propose
      this patchset for reporting and responding memory allocation stalls.

This patchset does the following things.

  [PATCH 1/5] mm: Introduce OOM kill timeout.

    Introduce timeout for TIF_MEMDIE threads in case they cannot be
    terminated immediately for some reason.

  [PATCH 2/5] mm: Kill shrinker's global semaphore.

    Don't respond with "try again" when we need to call out_of_memory().

  [PATCH 3/5] mm: Remember ongoing memory allocation status.

    Remember the starting time of ongoing memory allocation, and let
    thread dump print how long ongoing memory allocation is stalled.

  [PATCH 4/5] mm: Drop __GFP_WAIT flag when allocating from shrinker functions.

    Avoid potential deadlock or kernel stack overflow by calling shrinker
    functions recursively.

  [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls.

    Introduce a small sleep for saving CPU when memory allocation is taking
    too long.

This patchset is meant for ease of backporting because fixing the root cause
requires fundamental changes which may prevent any Linux systems from working
unless carefully implemented and appropriately configured.

  drivers/staging/android/lowmemorykiller.c |    2
  include/linux/mm.h                        |    2
  include/linux/sched.h                     |    5 +
  include/linux/shrinker.h                  |    4 +
  kernel/sched/core.c                       |   17 ++++++
  mm/memcontrol.c                           |    2
  mm/oom_kill.c                             |   35 ++++++++++++-
  mm/page_alloc.c                           |   68 +++++++++++++++++++++++++-
  mm/vmscan.c                               |   78 +++++++++++++++++++++---------
  9 files changed, 184 insertions(+), 29 deletions(-)

Regards.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-23  4:49 [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls Tetsuo Handa
@ 2014-11-23  4:50 ` Tetsuo Handa
  2014-11-24 16:50   ` Michal Hocko
  2014-11-23  4:50 ` [PATCH 2/5] mm: Kill shrinker's global semaphore Tetsuo Handa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-23  4:50 UTC (permalink / raw)
  To: linux-mm

>From ca8b3ee4bea5bcc6f8ec5e8496a97fd4cab5a440 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 23 Nov 2014 13:38:53 +0900
Subject: [PATCH 1/5] mm: Introduce OOM kill timeout.

Regarding many of Linux kernel versions (from unknown till now), any
local user can give a certain type of memory pressure which causes
__alloc_pages_nodemask() to keep trying to reclaim memory for presumably
forever. As a consequence, such user can disturb any users' activities
by keeping the system stalled with 0% or 100% CPU usage.

On systems where XFS is used, SysRq-f (forced OOM killer) may become
unresponsive because kernel worker thread which is supposed to process
SysRq-f request is blocked by previous request's GFP_WAIT allocation.

The problem described above is one of phenomena which is triggered by
a vulnerability which exists since (if I didn't miss something)
Linux 2.0 (18 years ago). However, it is too difficult to backport
patches which fix the vulnerability.

Setting TIF_MEMDIE to SIGKILL'ed and/or PF_EXITING thread disables
the OOM killer. But the TIF_MEMDIE thread may not be able to terminate
within reasonable duration for some reason. Therefore, in order to avoid
keeping the OOM killer disabled forever, this patch introduces 5 seconds
timeout for TIF_MEMDIE threads which are supposed to terminate shortly.

Android platform's low memory killer is already using 1 second timeout
for TIF_MEMDIE threads. This patch is for generic platforms.

Note that this patch does not help unless out_of_memory() is called.
For example, if all threads were looping at

  while (unlikely(too_many_isolated(zone, file, sc))) {
          congestion_wait(BLK_RW_ASYNC, HZ/10);

          /* We are about to die and free our memory. Return now. */
          if (fatal_signal_pending(current))
                  return SWAP_CLUSTER_MAX;
  }

in shrink_inactive_list() when kswapd is sleeping inside shrinker
functions, the system will stall forever with 0% CPU usage.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/staging/android/lowmemorykiller.c |  2 +-
 include/linux/mm.h                        |  2 ++
 include/linux/sched.h                     |  2 ++
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 35 ++++++++++++++++++++++++++++---
 5 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index b545d3d..819bc36 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -160,7 +160,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 			     selected->pid, selected->comm,
 			     selected_oom_score_adj, selected_tasksize);
 		lowmem_deathpending_timeout = jiffies + HZ;
-		set_tsk_thread_flag(selected, TIF_MEMDIE);
+		set_memdie_flag(selected);
 		send_sig(SIGKILL, selected, 0);
 		rem += selected_tasksize;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b464611..8b187fe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2161,5 +2161,7 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+void set_memdie_flag(struct task_struct *task);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..f1626c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1661,6 +1661,8 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+	/* Set when TIF_MEMDIE flag is set to this thread. */
+	unsigned long memdie_start;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d6ac0e3..bf51518 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1735,7 +1735,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
+		set_memdie_flag(current);
 		return;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b..678c431 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -134,6 +134,19 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (!has_intersects_mems_allowed(p, nodemask))
 		return true;
 
+	/* p may not be terminated within reasonale duration */
+	if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+		smp_rmb(); /* set_memdie_flag() uses smp_wmb(). */
+		if (time_after(jiffies, p->memdie_start + 5 * HZ)) {
+			static unsigned char warn = 255;
+			char comm[sizeof(p->comm)];
+
+			if (warn && warn--)
+				pr_err("Process %d (%s) was not killed within 5 seconds.\n",
+				       task_pid_nr(p), get_task_comm(comm, p));
+			return true;
+		}
+	}
 	return false;
 }
 
@@ -444,7 +457,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
+		set_memdie_flag(p);
 		put_task_struct(p);
 		return;
 	}
@@ -527,7 +540,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		}
 	rcu_read_unlock();
 
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	set_memdie_flag(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
@@ -650,7 +663,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
+		set_memdie_flag(current);
 		return;
 	}
 
@@ -711,3 +724,19 @@ void pagefault_out_of_memory(void)
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
 }
+
+void set_memdie_flag(struct task_struct *task)
+{
+	if (test_tsk_thread_flag(task, TIF_MEMDIE))
+		return;
+	/*
+	 * Allow oom_unkillable_task() to take into account whether
+	 * the thread cannot be terminated immediately for some reason
+	 * (e.g. waiting on unkillable lock, waiting for completion by
+	 * other thread).
+	 */
+	task->memdie_start = jiffies;
+	smp_wmb(); /* oom_unkillable_task() uses smp_rmb(). */
+	set_tsk_thread_flag(task, TIF_MEMDIE);
+}
+EXPORT_SYMBOL(set_memdie_flag);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] mm: Kill shrinker's global semaphore.
  2014-11-23  4:49 [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls Tetsuo Handa
  2014-11-23  4:50 ` [PATCH 1/5] mm: Introduce OOM kill timeout Tetsuo Handa
@ 2014-11-23  4:50 ` Tetsuo Handa
  2014-11-24 16:55   ` Michal Hocko
  2014-11-23  4:51 ` [PATCH 3/5] mm: Remember ongoing memory allocation status Tetsuo Handa
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-23  4:50 UTC (permalink / raw)
  To: linux-mm

>From 92aec48e3b2e21c3716654670a24890f34c58683 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 23 Nov 2014 13:39:25 +0900
Subject: [PATCH 2/5] mm: Kill shrinker's global semaphore.

Currently register_shrinker()/unregister_shrinker() calls down_write()
while shrink_slab() calls down_read_trylock(). This implies that the OOM
killer becomes disabled because shrink_slab() pretends "we reclaimed some
slab memory" even if "no slab memory can be reclaimed" when somebody calls
register_shrinker()/unregister_shrinker() while one of shrinker functions
allocates memory and/or holds mutex which may take unpredictably long
duration to complete.

This patch replaces global semaphore with per a shrinker refcounter
so that shrink_slab() can respond "we could not reclaim slab memory"
when out_of_memory() needs to be called.

Before this patch, response time of addition/removal are unpredictable
when one of shrinkers are in use by shrink_slab(), nearly 0 otherwise.

After this patch, response time of addition is nearly 0. Response time of
removal remains unpredictable when the shrinker to remove is in use by
shrink_slab(), nearly two RCU grace periods otherwise.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/shrinker.h |  4 +++
 mm/vmscan.c              | 78 ++++++++++++++++++++++++++++++++++--------------
 2 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c0970..745246a 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -59,6 +59,10 @@ struct shrinker {
 	struct list_head list;
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
+	/* Number of users holding reference to this object. */
+	atomic_t usage;
+	/* Used for handling concurrent unregistration tracing. */
+	struct list_head gc_list;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcb4707..54d2638 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -144,7 +144,7 @@ int vm_swappiness = 60;
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+static DEFINE_SPINLOCK(shrinker_list_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -208,9 +208,16 @@ int register_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
-	list_add_tail(&shrinker->list, &shrinker_list);
-	up_write(&shrinker_rwsem);
+	/*
+	 * Make it possible for list_for_each_entry_rcu(shrinker,
+	 * &shrinker_list, list) in shrink_slab() to find this shrinker.
+	 * We assume that this shrinker is not under unregister_shrinker()
+	 * call.
+	 */
+	atomic_set(&shrinker->usage, 0);
+	spin_lock(&shrinker_list_lock);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
+	spin_unlock(&shrinker_list_lock);
 	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
@@ -220,9 +227,41 @@ EXPORT_SYMBOL(register_shrinker);
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
-	list_del(&shrinker->list);
-	up_write(&shrinker_rwsem);
+	static LIST_HEAD(shrinker_gc_list);
+	struct shrinker *gc;
+	unsigned int i = 0;
+	int usage;
+
+	/*
+	 * Make it impossible for shrinkers on shrinker_list and shrinkers
+	 * on shrinker_gc_list to call atomic_inc(&shrinker->usage) after
+	 * RCU grace period expires.
+	 */
+	spin_lock(&shrinker_list_lock);
+	list_del_rcu(&shrinker->list);
+	list_for_each_entry(gc, &shrinker_gc_list, gc_list) {
+		if (gc->list.next == &shrinker->list)
+			rcu_assign_pointer(gc->list.next, shrinker->list.next);
+	}
+	list_add_tail(&shrinker->gc_list, &shrinker_gc_list);
+	spin_unlock(&shrinker_list_lock);
+	synchronize_rcu();
+	/*
+	 * Wait for readers until RCU grace period expires after the last
+	 * atomic_dec(&shrinker->usage). Warn if it is taking too long.
+	 */
+	while (1) {
+		usage = atomic_read(&shrinker->usage);
+		if (!usage)
+			break;
+		msleep(100);
+		WARN(++i % 600 == 0, "Shrinker usage=%d\n", usage);
+	}
+	synchronize_rcu();
+	/* Now, nobody is using this shrinker. */
+	spin_lock(&shrinker_list_lock);
+	list_del(&shrinker->gc_list);
+	spin_unlock(&shrinker_list_lock);
 	kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
@@ -369,23 +408,15 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	if (nr_pages_scanned == 0)
 		nr_pages_scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem)) {
-		/*
-		 * If we would return 0, our callers would understand that we
-		 * have nothing else to shrink and give up trying. By returning
-		 * 1 we keep it going and assume we'll be able to shrink next
-		 * time.
-		 */
-		freed = 1;
-		goto out;
-	}
-
-	list_for_each_entry(shrinker, &shrinker_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
+		atomic_inc(&shrinker->usage);
+		rcu_read_unlock();
 		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
 			shrinkctl->nid = 0;
 			freed += shrink_slab_node(shrinkctl, shrinker,
 					nr_pages_scanned, lru_pages);
-			continue;
+			goto next_entry;
 		}
 
 		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
@@ -394,9 +425,12 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 						nr_pages_scanned, lru_pages);
 
 		}
+next_entry:
+		rcu_read_lock();
+		atomic_dec(&shrinker->usage);
 	}
-	up_read(&shrinker_rwsem);
-out:
+	rcu_read_unlock();
+
 	cond_resched();
 	return freed;
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] mm: Remember ongoing memory allocation status.
  2014-11-23  4:49 [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls Tetsuo Handa
  2014-11-23  4:50 ` [PATCH 1/5] mm: Introduce OOM kill timeout Tetsuo Handa
  2014-11-23  4:50 ` [PATCH 2/5] mm: Kill shrinker's global semaphore Tetsuo Handa
@ 2014-11-23  4:51 ` Tetsuo Handa
  2014-11-24 17:01   ` Michal Hocko
  2014-11-23  4:52 ` [PATCH 4/5] mm: Drop __GFP_WAIT flag when allocating from shrinker functions Tetsuo Handa
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-23  4:51 UTC (permalink / raw)
  To: linux-mm

>From 0c6d4e0ac9fc5964fdd09849c99e4f6497b7a37e Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 23 Nov 2014 13:40:20 +0900
Subject: [PATCH 3/5] mm: Remember ongoing memory allocation status.

When a stall by memory allocation problem occurs, printing how long
a thread was blocked for memory allocation will be useful.

This patch allows remembering how many jiffies was spent for ongoing
__alloc_pages_nodemask() and reading it by printing backtrace and by
analyzing vmcore.

If the system is rebooted by timeout of SoftDog watchdog, this patch
will be helpful because we can check whether the thread writing to
/dev/watchdog interface was blocked for memory allocation.

If the system is running on a QEMU (KVM) managed via libvirt interface,
this patch will be helpful because we can check status of ongoing
memory allocation by comparing several vmcore snapshots obtained
via "virsh dump" command.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/sched.h |  3 +++
 kernel/sched/core.c   | 17 +++++++++++++++++
 mm/page_alloc.c       | 20 ++++++++++++++++++--
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f1626c3..83ac0c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1663,6 +1663,9 @@ struct task_struct {
 #endif
 	/* Set when TIF_MEMDIE flag is set to this thread. */
 	unsigned long memdie_start;
+	/* Set when outermost memory allocation starts. */
+	unsigned long gfp_start;
+	gfp_t gfp_flags;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24beb9b..f8d0192 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4518,6 +4518,22 @@ out_unlock:
 	return retval;
 }
 
+static void print_memalloc_info(const struct task_struct *p)
+{
+	const gfp_t gfp = p->gfp_flags;
+
+	/*
+	 * __alloc_pages_nodemask() doesn't use smp_wmb() between
+	 * updating ->gfp_start and ->gfp_flags. But reading stale
+	 * ->gfp_start value harms nothing but printing bogus duration.
+	 * Correct duration will be printed when this function is
+	 * called for the next time.
+	 */
+	if (unlikely(gfp))
+		printk(KERN_INFO "MemAlloc: %ld jiffies on 0x%x\n",
+			jiffies - p->gfp_start, gfp);
+}
+
 static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
@@ -4550,6 +4566,7 @@ void sched_show_task(struct task_struct *p)
 		task_pid_nr(p), ppid,
 		(unsigned long)task_thread_info(p)->flags);
 
+	print_memalloc_info(p);
 	print_worker_info(KERN_INFO, p);
 	show_stack(p, NULL);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 616a2c9..11cc37d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2790,6 +2790,18 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	int classzone_idx;
+	const bool omit_timestamp = !(gfp_mask & __GFP_WAIT) ||
+		current->gfp_flags;
+
+	if (!omit_timestamp) {
+		/*
+		 * Since omit_timestamp == false depends on
+		 * (gfp_mask & __GFP_WAIT) != 0 , the current->gfp_flags is
+		 * updated from zero to non-zero value.
+		 */
+		current->gfp_start = jiffies;
+		current->gfp_flags = gfp_mask;
+	}
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2798,7 +2810,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+		goto nopage;
 
 	/*
 	 * Check the zones suitable for the gfp_mask contain at least one
@@ -2806,7 +2818,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * of GFP_THISNODE and a memoryless node
 	 */
 	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
+		goto nopage;
 
 	if (IS_ENABLED(CONFIG_CMA) && migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
@@ -2850,6 +2862,10 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
+nopage:
+	if (!omit_timestamp)
+		current->gfp_flags = 0;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] mm: Drop __GFP_WAIT flag when allocating from shrinker functions.
  2014-11-23  4:49 [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls Tetsuo Handa
                   ` (2 preceding siblings ...)
  2014-11-23  4:51 ` [PATCH 3/5] mm: Remember ongoing memory allocation status Tetsuo Handa
@ 2014-11-23  4:52 ` Tetsuo Handa
  2014-11-24 17:14   ` Michal Hocko
  2014-11-23  4:53 ` [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls Tetsuo Handa
  2014-11-24 17:25 ` [RFC PATCH 0/5] mm: Patches for mitigating " Michal Hocko
  5 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-23  4:52 UTC (permalink / raw)
  To: linux-mm

>From b248c31988ea582d2d4f4093fb8b649be91174bb Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 23 Nov 2014 13:40:47 +0900
Subject: [PATCH 4/5] mm: Drop __GFP_WAIT flag when allocating from shrinker functions.

Memory allocations from shrinker functions are complicated.
If unexpected flags are stored in "struct shrink_control"->gfp_mask and
used inside shrinker functions, it can cause difficult-to-trigger bugs
like https://bugzilla.kernel.org/show_bug.cgi?id=87891 .

Also, stack usage by __alloc_pages_nodemask() is large. If we unlimitedly
allow recursive __alloc_pages_nodemask() calls, kernel stack could overflow
under extreme memory pressure.

Some shrinker functions are using sleepable locks which could make kswapd
sleep for unpredictable duration. If kswapd is unexpectedly blocked inside
shrinker functions and somebody is expecting that kswapd is running for
reclaiming memory (e.g.

  while (unlikely(too_many_isolated(zone, file, sc))) {
          congestion_wait(BLK_RW_ASYNC, HZ/10);

          /* We are about to die and free our memory. Return now. */
          if (fatal_signal_pending(current))
                  return SWAP_CLUSTER_MAX;
  }

in shrink_inactive_list()), it is a memory allocation deadlock.

This patch drops __GFP_WAIT flag when allocating from shrinker functions
so that recursive __alloc_pages_nodemask() calls will not cause troubles
like recursive locks and/or unpredictable sleep. The comments in this patch
suggest shrinker functions users to try to avoid use of sleepable locks
and memory allocations from shrinker functions, as with TTM driver's
shrinker functions.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11cc37d..c77418e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2801,6 +2801,41 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		 */
 		current->gfp_start = jiffies;
 		current->gfp_flags = gfp_mask;
+	} else {
+		/*
+		 * When this function is called from interrupt context,
+		 * the caller must not include __GFP_WAIT flag.
+		 *
+		 * When this function is called by recursive
+		 * __alloc_pages_nodemask() calls from shrinker functions,
+		 * the context might allow __GFP_WAIT flag. But since this
+		 * function consumes a lot of kernel stack, kernel stack
+		 * could overflow under extreme memory pressure if we
+		 * unlimitedly allow recursive __alloc_pages_nodemask() calls.
+		 * Also, if kswapd is unexpectedly blocked for unpredictable
+		 * duration inside shrinker functions, and somebody is
+		 * expecting that kswapd is running for reclaiming memory,
+		 * it is a memory allocation deadlock.
+		 *
+		 * If current->gfp_flags != 0 here, it means that this function
+		 * is called from either interrupt context or shrinker
+		 * functions. Thus, it should be safe to drop __GFP_WAIT flag.
+		 *
+		 * Moreover, we don't need to check for current->gfp_flags != 0
+		 * here because omit_timestamp == true is equivalent to
+		 * (gfp_mask & __GFP_WAIT) == 0 and/or current->gfp_flags != 0.
+		 * Dropping __GFP_WAIT flag when (gfp_mask & __GFP_WAIT) == 0
+		 * is a no-op.
+		 *
+		 * By dropping __GFP_WAIT flag, kswapd will no longer blocked
+		 * by recursive __alloc_pages_nodemask() calls from shrinker
+		 * functions. Note that kswapd could still be blocked for
+		 * unpredictable duration if sleepable locks are used inside
+		 * shrinker functions. Therefore, please try to avoid use of
+		 * sleepable locks and memory allocations from shrinker
+		 * functions.
+		 */
+		gfp_mask &= ~__GFP_WAIT;
 	}
 
 	gfp_mask &= gfp_allowed_mask;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls.
  2014-11-23  4:49 [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls Tetsuo Handa
                   ` (3 preceding siblings ...)
  2014-11-23  4:52 ` [PATCH 4/5] mm: Drop __GFP_WAIT flag when allocating from shrinker functions Tetsuo Handa
@ 2014-11-23  4:53 ` Tetsuo Handa
  2014-11-24 17:19   ` Michal Hocko
  2014-11-24 17:25 ` [RFC PATCH 0/5] mm: Patches for mitigating " Michal Hocko
  5 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-23  4:53 UTC (permalink / raw)
  To: linux-mm

>From 4fad86f7a653dbbaec3ba2389f74f97a6705a558 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 23 Nov 2014 13:41:24 +0900
Subject: [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls.

This patch introduces 1ms of unkillable sleep before retrying when
sleepable __alloc_pages_nodemask() is taking more than 5 seconds.
According to Documentation/timers/timers-howto.txt, msleep < 20ms
can sleep for up to 20ms, but this should not be a problem because
msleep(1) is called only when there is no choice but retrying.

This patch is intended for two purposes.

(1) Reduce CPU usage when memory allocation deadlock occurred, by
    avoiding useless busy retry loop.

(2) Allow SysRq-w (or SysRq-t) to report how long each thread is
    blocked for memory allocation.

  kworker/0:2     D ffff88007a2d8cf8     0    61      2 0x00000000
  MemAlloc: 69851 jiffies on 0x10
  Workqueue: events_freezable_power_ disk_events_workfn
   ffff88007a2e3898 0000000000000046 ffff88007a2e38f8 ffff88007a2d88d0
   0000000000013500 ffff88007a2e3fd8 0000000000013500 ffff88007a2d88d0
   ffff88007fffdb08 0000000100052ae5 ffff88007a2e38c8 ffffffff819d44c0
  Call Trace:
   [<ffffffff815951e4>] schedule+0x24/0x70
   [<ffffffff815982b1>] schedule_timeout+0x111/0x1a0
   [<ffffffff810b7470>] ? migrate_timer_list+0x60/0x60
   [<ffffffff810b778f>] msleep+0x2f/0x40
   [<ffffffff81110ecb>] __alloc_pages_nodemask+0x7eb/0xad0
   [<ffffffff81150dae>] alloc_pages_current+0x8e/0x100
   [<ffffffff81252156>] bio_copy_user_iov+0x1d6/0x380
   [<ffffffff8125474d>] ? blk_rq_init+0xed/0x160
   [<ffffffff81252399>] bio_copy_kern+0x49/0x100
   [<ffffffff8109a370>] ? prepare_to_wait_event+0x100/0x100
   [<ffffffff8125c0ef>] blk_rq_map_kern+0x6f/0x130
   [<ffffffff81159e1e>] ? kmem_cache_alloc+0x48e/0x4b0
   [<ffffffff8139c50f>] scsi_execute+0x12f/0x160
   [<ffffffff8139dd54>] scsi_execute_req_flags+0x84/0xf0
   [<ffffffffa01e19cc>] sr_check_events+0xbc/0x2e0 [sr_mod]
   [<ffffffff810912ac>] ? put_prev_entity+0x2c/0x3b0
   [<ffffffffa01d5177>] cdrom_check_events+0x17/0x30 [cdrom]
   [<ffffffffa01e1e5d>] sr_block_check_events+0x2d/0x30 [sr_mod]
   [<ffffffff81266236>] disk_check_events+0x56/0x1b0
   [<ffffffff812663a1>] disk_events_workfn+0x11/0x20
   [<ffffffff81076aef>] process_one_work+0x13f/0x370
   [<ffffffff81077ad9>] worker_thread+0x119/0x500
   [<ffffffff810779c0>] ? rescuer_thread+0x350/0x350
   [<ffffffff8107cbbc>] kthread+0xdc/0x100
   [<ffffffff8107cae0>] ? kthread_create_on_node+0x1b0/0x1b0
   [<ffffffff815995bc>] ret_from_fork+0x7c/0xb0
   [<ffffffff8107cae0>] ? kthread_create_on_node+0x1b0/0x1b0

  kworker/u16:28  D ffff8800793d0638     0  9950    346 0x00000080
  MemAlloc: 13014 jiffies on 0x250
   ffff880052777618 0000000000000046 ffff880052777678 ffff8800793d0210
   0000000000013500 ffff880052777fd8 0000000000013500 ffff8800793d0210
   ffff88007fffdb08 00000001000534b2 ffff880052777648 ffff88007c920000
  Call Trace:
   [<ffffffff815951e4>] schedule+0x24/0x70
   [<ffffffff815982b1>] schedule_timeout+0x111/0x1a0
   [<ffffffff810b7470>] ? migrate_timer_list+0x60/0x60
   [<ffffffff810b778f>] msleep+0x2f/0x40
   [<ffffffff81110ecb>] __alloc_pages_nodemask+0x7eb/0xad0
   [<ffffffff81150dae>] alloc_pages_current+0x8e/0x100
   [<ffffffffa0269f97>] xfs_buf_allocate_memory+0x168/0x247 [xfs]
   [<ffffffffa0235f62>] xfs_buf_get_map+0xd2/0x130 [xfs]
   [<ffffffffa0236534>] xfs_buf_read_map+0x24/0xc0 [xfs]
   [<ffffffffa025fdb9>] xfs_trans_read_buf_map+0x119/0x300 [xfs]
   [<ffffffffa022b9f9>] xfs_imap_to_bp+0x69/0xf0 [xfs]
   [<ffffffffa022bee9>] xfs_iread+0x79/0x410 [xfs]
   [<ffffffffa0251c8f>] ? kmem_zone_alloc+0x6f/0x100 [xfs]
   [<ffffffffa023d8ff>] xfs_iget+0x18f/0x530 [xfs]
   [<ffffffffa024589e>] xfs_lookup+0xae/0xd0 [xfs]
   [<ffffffffa0242cf3>] xfs_vn_lookup+0x73/0xc0 [xfs]
   [<ffffffff8117f1a8>] lookup_real+0x18/0x50
   [<ffffffff811848cc>] do_last+0x98c/0x1250
   [<ffffffff81180123>] ? inode_permission+0x13/0x40
   [<ffffffff81182699>] ? link_path_walk+0x79/0x850
   [<ffffffff81185253>] path_openat+0xc3/0x670
   [<ffffffff81186984>] do_filp_open+0x44/0xb0
   [<ffffffff81213991>] ? security_prepare_creds+0x11/0x20
   [<ffffffff8107e871>] ? prepare_creds+0xf1/0x1b0
   [<ffffffff8117c491>] do_open_exec+0x21/0xe0
   [<ffffffff8117d1eb>] do_execve_common.isra.27+0x1bb/0x5e0
   [<ffffffff8117d623>] do_execve+0x13/0x20
   [<ffffffff81073e56>] ____call_usermodehelper+0x126/0x1c0
   [<ffffffff81073ef0>] ? ____call_usermodehelper+0x1c0/0x1c0
   [<ffffffff81073f09>] call_helper+0x19/0x20
   [<ffffffff815995bc>] ret_from_fork+0x7c/0xb0
   [<ffffffff81073ef0>] ? ____call_usermodehelper+0x1c0/0x1c0

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c77418e..9e80b9f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/page-debug-flags.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/rt.h>
+#include <linux/delay.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -2738,6 +2739,12 @@ rebalance:
 					goto nopage;
 			}
 
+			/*
+			 * If wait == true and it is taking more than 5
+			 * seconds, sleep for 1ms for reducing CPU usage.
+			 */
+			if (time_after(jiffies, current->gfp_start + 5 * HZ))
+				msleep(1);
 			goto restart;
 		}
 	}
@@ -2748,6 +2755,12 @@ rebalance:
 						pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+		/*
+		 * If wait == true and it is taking more than 5 seconds,
+		 * sleep for 1ms for reducing CPU usage.
+		 */
+		if (time_after(jiffies, current->gfp_start + 5 * HZ))
+			msleep(1);
 		goto rebalance;
 	} else {
 		/*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-23  4:50 ` [PATCH 1/5] mm: Introduce OOM kill timeout Tetsuo Handa
@ 2014-11-24 16:50   ` Michal Hocko
  2014-11-24 22:29     ` David Rientjes
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2014-11-24 16:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sun 23-11-14 13:50:07, Tetsuo Handa wrote:
> >From ca8b3ee4bea5bcc6f8ec5e8496a97fd4cab5a440 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 23 Nov 2014 13:38:53 +0900
> Subject: [PATCH 1/5] mm: Introduce OOM kill timeout.
> 
> Regarding many of Linux kernel versions (from unknown till now), any
> local user can give a certain type of memory pressure which causes
> __alloc_pages_nodemask() to keep trying to reclaim memory for presumably
> forever.

Retrying for ever might be an intention (see GFP_NOFAIL).

> As a consequence, such user can disturb any users' activities
> by keeping the system stalled with 0% or 100% CPU usage.

But the above doesn't make much sense to me. Sure reclaim can cause a
lot of CPU cycles to be burnt but most of direct reclaimers are simply
stuck waiting for something - congestion_wait or others.

> On systems where XFS is used, SysRq-f (forced OOM killer) may become
> unresponsive because kernel worker thread which is supposed to process
> SysRq-f request is blocked by previous request's GFP_WAIT allocation.

How is XFS relevant here? Besides that work queue has a fallback mode -
rescuer thread - which processes work items which cannot be processed by
the worker threads because they cannot be created due to allocation
failures. Using workqueues for sysrq triggered OOM is quite suboptimal
but this should be handled on the sysrq layer.

> The problem described above is one of phenomena which is triggered by
> a vulnerability which exists since (if I didn't miss something)
> Linux 2.0 (18 years ago). However, it is too difficult to backport
> patches which fix the vulnerability.

What is the vulnerability?

> Setting TIF_MEMDIE to SIGKILL'ed and/or PF_EXITING thread disables
> the OOM killer. But the TIF_MEMDIE thread may not be able to terminate
> within reasonable duration for some reason. Therefore, in order to avoid
> keeping the OOM killer disabled forever, this patch introduces 5 seconds
> timeout for TIF_MEMDIE threads which are supposed to terminate shortly.

I really do not like this. The timeout sounds arbitrary random. Besides
how would it solve the problem? We would go after another task which
might be blocked on the very same lock. How long should we go? What
happens when all of them wake up and consume all the memory on the way
out because they have access to the memory reserves now?

Also have you actually seen something like that happening?

We had a kind of similar problem in Memory cgroup controller because the
OOM was handled in the allocation path which might sit on many locks and
had to wait for the victim . So waiting for OOM victim to finish would
simply deadlock if the killed task was stuck on any of the locks held by
memcg OOM killer. But this is not the case anymore (we are processing
memcg OOM from the fault path).

The global OOM killer didn't have this kind of problem because OOM
killer doesn't wait for the victim to finish. If the victim waits for
something else that cannot make any progress because of the short memory
then I would call it a bug and it shouldn't be papered over and rather
fixed properly.

The oom killer code is quite complex and subtle already so I really do
not think that we should be adding ad-hoc heuristics without really good
reasons and when all other options are considered not viable. I do not
see any real life problem stated here and what is worse the changelog is
misleading in several ways. So NAK to this patch.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] mm: Kill shrinker's global semaphore.
  2014-11-23  4:50 ` [PATCH 2/5] mm: Kill shrinker's global semaphore Tetsuo Handa
@ 2014-11-24 16:55   ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2014-11-24 16:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sun 23-11-14 13:50:50, Tetsuo Handa wrote:
> >From 92aec48e3b2e21c3716654670a24890f34c58683 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 23 Nov 2014 13:39:25 +0900
> Subject: [PATCH 2/5] mm: Kill shrinker's global semaphore.
> 
> Currently register_shrinker()/unregister_shrinker() calls down_write()
> while shrink_slab() calls down_read_trylock().

> This implies that the OOM killer becomes disabled because
> shrink_slab() pretends "we reclaimed some slab memory" even
> if "no slab memory can be reclaimed" when somebody calls
> register_shrinker()/unregister_shrinker() while one of shrinker
> functions allocates memory and/or holds mutex which may take
> unpredictably long duration to complete.

Which load would be SLAB mostly that this would matter?

Other than that I thought that {un}register_shrinker are really unlikely
paths called during initialization and tear down which usually do not
happen during OOM conditions.

> This patch replaces global semaphore with per a shrinker refcounter
> so that shrink_slab() can respond "we could not reclaim slab memory"
> when out_of_memory() needs to be called.
> 
> Before this patch, response time of addition/removal are unpredictable
> when one of shrinkers are in use by shrink_slab(), nearly 0 otherwise.
> 
> After this patch, response time of addition is nearly 0. Response time of
> removal remains unpredictable when the shrinker to remove is in use by
> shrink_slab(), nearly two RCU grace periods otherwise.

I cannot judge the patch itself as this is out of my area but is the
complexity worth it? I think the OOM argument is bogus because there
SLAB usually doesn't dominate the memory consumption in my experience.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] mm: Remember ongoing memory allocation status.
  2014-11-23  4:51 ` [PATCH 3/5] mm: Remember ongoing memory allocation status Tetsuo Handa
@ 2014-11-24 17:01   ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2014-11-24 17:01 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sun 23-11-14 13:51:31, Tetsuo Handa wrote:
> >From 0c6d4e0ac9fc5964fdd09849c99e4f6497b7a37e Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 23 Nov 2014 13:40:20 +0900
> Subject: [PATCH 3/5] mm: Remember ongoing memory allocation status.
> 
> When a stall by memory allocation problem occurs, printing how long
> a thread was blocked for memory allocation will be useful.

Why tracepoints are not suitable for this debugging?

> This patch allows remembering how many jiffies was spent for ongoing
> __alloc_pages_nodemask() and reading it by printing backtrace and by
> analyzing vmcore.

__alloc_pages_nodemask is a hotpath of the allocation and it is not
really acceptable to add debugging stuff there which will have only very
limited usage.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] mm: Drop __GFP_WAIT flag when allocating from shrinker functions.
  2014-11-23  4:52 ` [PATCH 4/5] mm: Drop __GFP_WAIT flag when allocating from shrinker functions Tetsuo Handa
@ 2014-11-24 17:14   ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2014-11-24 17:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sun 23-11-14 13:52:48, Tetsuo Handa wrote:
[...]
> This patch drops __GFP_WAIT flag when allocating from shrinker functions
> so that recursive __alloc_pages_nodemask() calls will not cause troubles
> like recursive locks and/or unpredictable sleep. The comments in this patch
> suggest shrinker functions users to try to avoid use of sleepable locks
> and memory allocations from shrinker functions, as with TTM driver's
> shrinker functions.

Again, you are just papering over potential bugs. Those bugs should be
identified and fixe _properly_ (like stop calling kmalloc in the bug
referenced in your changelog) rather than dropping gfp flags behind
requester back.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls.
  2014-11-23  4:53 ` [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls Tetsuo Handa
@ 2014-11-24 17:19   ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2014-11-24 17:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sun 23-11-14 13:53:41, Tetsuo Handa wrote:
> >From 4fad86f7a653dbbaec3ba2389f74f97a6705a558 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 23 Nov 2014 13:41:24 +0900
> Subject: [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls.
> 
> This patch introduces 1ms of unkillable sleep before retrying when
> sleepable __alloc_pages_nodemask() is taking more than 5 seconds.
> According to Documentation/timers/timers-howto.txt, msleep < 20ms
> can sleep for up to 20ms, but this should not be a problem because
> msleep(1) is called only when there is no choice but retrying.
> 
> This patch is intended for two purposes.
> 
> (1) Reduce CPU usage when memory allocation deadlock occurred, by
>     avoiding useless busy retry loop.
> 
> (2) Allow SysRq-w (or SysRq-t) to report how long each thread is
>     blocked for memory allocation.

Both do not make any sense to me whatsoever. If there is a deadlock then
we cannot consume CPU as the deadlocked tasks are _blocked_. I guess you
meant livelocked but even then, how does a random timeout helps?

Why would a timeout help sysrq to proceed?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls.
  2014-11-23  4:49 [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls Tetsuo Handa
                   ` (4 preceding siblings ...)
  2014-11-23  4:53 ` [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls Tetsuo Handa
@ 2014-11-24 17:25 ` Michal Hocko
  5 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2014-11-24 17:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sun 23-11-14 13:49:27, Tetsuo Handa wrote:
[...]
>       I reported this vulnerability last year and a CVE number was assigned,
>       but no progress has been made. If a malicious local user notices a
>       patchset that mitigates/fixes this vulnerability, the user is free to
>       attack existing Linux systems. Therefore, I propose this patchset before
>       any patchset that mitigates/fixes this vulnerability is proposed.

I have looked at patches and I do not believe they address anything.
They seem like random and ad-hoc hacks which pretend to solve a class of
problems but in fact only paper over potentially real ones.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-24 16:50   ` Michal Hocko
@ 2014-11-24 22:29     ` David Rientjes
  2014-11-25 10:38       ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: David Rientjes @ 2014-11-24 22:29 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm

On Mon, 24 Nov 2014, Michal Hocko wrote:

> > The problem described above is one of phenomena which is triggered by
> > a vulnerability which exists since (if I didn't miss something)
> > Linux 2.0 (18 years ago). However, it is too difficult to backport
> > patches which fix the vulnerability.
> 
> What is the vulnerability?
> 

There have historically been issues when oom killed processes fail to 
exit, so this is probably trying to address one of those issues.

The most notable example is when an oom killed process is waiting on a 
lock that is held by another thread that is trying to allocate memory and 
looping indefinitely since reclaim fails and the oom killer keeps finding 
the oom killed process waiting to exit.  This is a consequence of the page 
allocator looping forever for small order allocations.  Memcg oom kills 
typically see this much more often when you do complete kmem accounting: 
any combination of mutex + kmalloc(GFP_KERNEL) becomes a potential 
livelock.  For the system oom killer, I would imagine this would be 
difficult to trigger since it would require a process holding the mutex to 
never be able to allocate memory.

The oom killer timeout is always an attractive remedy to this situation 
and gets proposed quite often.  Several problems: (1) you can needlessly 
panic the machine because no other processes are eligible for oom kill 
after declaring that the first oom kill victim cannot make progress, (2) 
it can lead to unnecessary oom killing if the oom kill victim can exit but 
hasn't be scheduled or is in the process of exiting, (3) you can easily 
turn the oom killer into a serial oom killer since there's no guarantee 
the next process that is chosen won't be affected by the same problem, and 
(4) this doesn't fix the problem if an oom disabled process is wedged 
trying to allocate memory while holding a mutex that others are waiting 
on.

The general approach has always been to fix the actual issue in whatever 
code is causing the wedge.  We lack specific examples in this changelog 
and I agree that it seems to be papering over issues that could otherwise 
be fixed, so I agree with your NACK.

> We had a kind of similar problem in Memory cgroup controller because the
> OOM was handled in the allocation path which might sit on many locks and
> had to wait for the victim . So waiting for OOM victim to finish would
> simply deadlock if the killed task was stuck on any of the locks held by
> memcg OOM killer. But this is not the case anymore (we are processing
> memcg OOM from the fault path).
> 

I'm painfully aware of it happening with complete kmem accounting, however 
:)  I'm sure you can imagine the scenario that is causes and unfortunately 
our complete support isn't upstream so there's no code that I can point 
to.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-24 22:29     ` David Rientjes
@ 2014-11-25 10:38       ` Michal Hocko
  2014-11-25 12:54         ` Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2014-11-25 10:38 UTC (permalink / raw)
  To: David Rientjes; +Cc: Tetsuo Handa, linux-mm

On Mon 24-11-14 14:29:00, David Rientjes wrote:
> On Mon, 24 Nov 2014, Michal Hocko wrote:
> 
> > > The problem described above is one of phenomena which is triggered by
> > > a vulnerability which exists since (if I didn't miss something)
> > > Linux 2.0 (18 years ago). However, it is too difficult to backport
> > > patches which fix the vulnerability.
> > 
> > What is the vulnerability?
> > 
> 
> There have historically been issues when oom killed processes fail to 
> exit, so this is probably trying to address one of those issues.

Let me clarify. The patch is sold as a security fix. In that context
vulnerability means a behavior which might be abused by a user. I was
merely interested whether there are some known scenarios which would
turn a potential OOM killer deadlock into an exploitable bug. The
changelog was rather unclear about it and rather strong in claims that
any user might trigger OOM deadlock.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-25 10:38       ` Michal Hocko
@ 2014-11-25 12:54         ` Tetsuo Handa
  2014-11-25 13:45           ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-25 12:54 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: linux-mm

Michal Hocko wrote:
> On Mon 24-11-14 14:29:00, David Rientjes wrote:
> > On Mon, 24 Nov 2014, Michal Hocko wrote:
> > 
> > > > The problem described above is one of phenomena which is triggered by
> > > > a vulnerability which exists since (if I didn't miss something)
> > > > Linux 2.0 (18 years ago). However, it is too difficult to backport
> > > > patches which fix the vulnerability.
> > > 
> > > What is the vulnerability?
> > > 
> > 
> > There have historically been issues when oom killed processes fail to 
> > exit, so this is probably trying to address one of those issues.

Exactly.

> 
> Let me clarify. The patch is sold as a security fix. In that context
> vulnerability means a behavior which might be abused by a user. I was
> merely interested whether there are some known scenarios which would
> turn a potential OOM killer deadlock into an exploitable bug. The
> changelog was rather unclear about it and rather strong in claims that
> any user might trigger OOM deadlock.

Well, both of you are in the CC: list of my mail which includes a reproducer
program which I sent on Thu, 26 Jun 2014 21:02:36 +0900.

Please prepare two VMs, one with XFS and one without XFS. Compile and run
the reproducer program as a local unpriviledged user and see what happens.
You will see stalled traces as with cited in this patchset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-25 12:54         ` Tetsuo Handa
@ 2014-11-25 13:45           ` Michal Hocko
  2014-11-26 11:58             ` Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2014-11-25 13:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, linux-mm

On Tue 25-11-14 21:54:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Let me clarify. The patch is sold as a security fix. In that context
> > vulnerability means a behavior which might be abused by a user. I was
> > merely interested whether there are some known scenarios which would
> > turn a potential OOM killer deadlock into an exploitable bug. The
> > changelog was rather unclear about it and rather strong in claims that
> > any user might trigger OOM deadlock.
> 
> Well, both of you are in the CC: list of my mail which includes a reproducer
> program which I sent on Thu, 26 Jun 2014 21:02:36 +0900.

OK, found the emails. There were more issues mentioned there. The below
one is from 24 Apr.
[   42.904325] Out of memory: Kill process 316 (firewalld) score 29 or sacrifice child
[   42.908797] Killed process 316 (firewalld) total-vm:327624kB, anon-rss:14900kB, file-rss:4kB
[   46.137191] SysRq : Changing Loglevel
[   46.138143] Loglevel set to 9
[   72.028990] SysRq : Show State
[...]
[   72.029945] systemd         R  running task        0     1      0 0x00000000
[   72.029945]  ffff88001efbb908 0000000000000086 ffff88001efbbfd8 0000000000014580
[   72.029945]  ffff88001efbbfd8 0000000000014580 ffff88001dc18000 ffff88001efba000
[   72.029945]  ffff88001efbba60 ffffffff8194eaa8 0000000000000034 0000000000000000
[   72.029945] Call Trace:
[   72.029945]  [<ffffffff81094ed6>] __cond_resched+0x26/0x30
[   72.029945]  [<ffffffff815f1cba>] _cond_resched+0x3a/0x50
[   72.029945]  [<ffffffff811538ec>] shrink_slab+0x1dc/0x300
[   72.029945]  [<ffffffff811a9721>] ? vmpressure+0x21/0x90
[   72.029945]  [<ffffffff81156982>] do_try_to_free_pages+0x3c2/0x4e0
[   72.029945]  [<ffffffff81156b9c>] try_to_free_pages+0xfc/0x180
[   72.029945]  [<ffffffff8114b2ce>] __alloc_pages_nodemask+0x75e/0xb10
[   72.029945]  [<ffffffff81188689>] alloc_pages_current+0xa9/0x170
[   72.029945]  [<ffffffff811419f7>] __page_cache_alloc+0x87/0xb0
[   72.029945]  [<ffffffff81143d48>] filemap_fault+0x188/0x430
[   72.029945]  [<ffffffff811682ce>] __do_fault+0x7e/0x520
[   72.029945]  [<ffffffff8116c615>] handle_mm_fault+0x3e5/0xd90
[   72.029945]  [<ffffffff810712d6>] ? dequeue_signal+0x86/0x180
[   72.029945]  [<ffffffff811f76e4>] ? ep_send_events_proc+0x174/0x1d0
[   72.029945]  [<ffffffff811f983c>] ? signalfd_copyinfo+0x1c/0x250
[   72.029945]  [<ffffffff815f7886>] __do_page_fault+0x156/0x540
[   72.029945]  [<ffffffff815f7c8a>] do_page_fault+0x1a/0x70
[   72.029945]  [<ffffffff811b03e8>] ? SyS_read+0x58/0xb0
[   72.029945]  [<ffffffff815f3ec8>] page_fault+0x28/0x30
[...]
[   72.029945] firewalld       x ffff88001fc14580     0   316      1 0x00100084
[   72.029945]  ffff88001cc1bcd0 0000000000000046 ffff88001cc1bfd8 0000000000014580
[   72.029945]  ffff88001cc1bfd8 0000000000014580 ffff88001b3cdb00 ffff88001b3ce300
[   72.029945]  ffff88001cc1b858 ffff88001cc1b858 ffff88001b3cdaf0 ffff88001b3cdb00
[   72.029945] Call Trace:
[   72.029945]  [<ffffffff815f18b9>] schedule+0x29/0x70
[   72.029945]  [<ffffffff81064207>] do_exit+0x6e7/0xa60
[   72.029945]  [<ffffffff811c3a40>] ? poll_select_copy_remaining+0x150/0x150
[   72.029945]  [<ffffffff810645ff>] do_group_exit+0x3f/0xa0
[   72.029945]  [<ffffffff81074000>] get_signal_to_deliver+0x1d0/0x6e0
[   72.029945]  [<ffffffff81012437>] do_signal+0x57/0x600
[   72.029945]  [<ffffffff811fb457>] ? eventfd_ctx_read+0x67/0x260
[   72.029945]  [<ffffffff81012a49>] do_notify_resume+0x69/0xb0
[   72.029945]  [<ffffffff815fcad2>] int_signal+0x12/0x17

So the task has been killed and it is waiting for parent to handle its
signal but that is blocked on memory allocation. The OOM victim is
TASK_DEAD so it has already passed exit_mm and should have released its
memory and it has dropped TIF_MEMDIE so it is ignored by OOM killer. It
is still holding some resources but those should be restricted and
shouldn't keep OOM condition normally.

The OOM report was not complete so it is hard to say why the OOM
condition wasn't resolved by the OOM killer but other OOM report you
have posted (26 Apr) in that thread suggested that the system doesn't
have any swap and the page cache is full of shmem. The process list
didn't contain any large memory consumer so killing somebody wouldn't
help much. But the OOM victim died normally in that case:
[  945.823514] kworker/u64:0 invoked oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0
[...]
[  945.907809] active_anon:1743 inactive_anon:24451 isolated_anon:0
[  945.907809]  active_file:49 inactive_file:215 isolated_file:0
[  945.907809]  unevictable:0 dirty:0 writeback:0 unstable:0
[  945.907809]  free:13233 slab_reclaimable:3264 slab_unreclaimable:6369
[  945.907809]  mapped:27 shmem:24795 pagetables:177 bounce:0
[  945.907809]  free_cma:0
[...]
[  945.959966] 25060 total pagecache pages
[  945.961567] 0 pages in swap cache
[  945.963053] Swap cache stats: add 0, delete 0, find 0/0
[  945.964930] Free swap  = 0kB
[  945.966324] Total swap = 0kB
[  945.967717] 524158 pages RAM
[  945.969103] 0 pages HighMem/MovableOnly
[  945.970692] 12583 pages reserved
[  945.972144] 0 pages hwpoisoned
[  945.973564] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  945.976012] [  464]     0   464    10364      248      22        0         -1000 systemd-udevd
[  945.978636] [  554]     0   554    12791      119      25        0         -1000 auditd
[  945.981118] [  661]    81   661     6850      276      19        0          -900 dbus-daemon
[  945.983689] [ 1409]     0  1409    20740      210      43        0         -1000 sshd
[  945.986124] [ 9393]     0  9393    27502       33      12        0             0 agetty
[  945.988611] [ 9641]  1000  9641     1042       21       7        0             0 a.out
[  945.991059] Out of memory: Kill process 9393 (agetty) score 0 or sacrifice child
[...]
[ 1048.924249] SysRq : Changing Loglevel
[ 1048.926059] Loglevel set to 9
[ 1050.892055] SysRq : Show State

Pid 9393 is not present in the following list.

So I really do not see any real issue here. Btw. it would be really
helpful if this was a in the changelog (without reproducer if you really
believe it could be abused).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-25 13:45           ` Michal Hocko
@ 2014-11-26 11:58             ` Tetsuo Handa
  2014-11-26 18:43               ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-26 11:58 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: linux-mm

Michal Hocko wrote:
> On Tue 25-11-14 21:54:23, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > Let me clarify. The patch is sold as a security fix. In that context
> > > vulnerability means a behavior which might be abused by a user. I was
> > > merely interested whether there are some known scenarios which would
> > > turn a potential OOM killer deadlock into an exploitable bug. The
> > > changelog was rather unclear about it and rather strong in claims that
> > > any user might trigger OOM deadlock.
> > 
> > Well, both of you are in the CC: list of my mail which includes a reproducer
> > program which I sent on Thu, 26 Jun 2014 21:02:36 +0900.
> 
> OK, found the emails. There were more issues mentioned there. The below
> one is from 24 Apr.

I posted various traces in that thread.

> So the task has been killed and it is waiting for parent to handle its
> signal but that is blocked on memory allocation. The OOM victim is
> TASK_DEAD so it has already passed exit_mm and should have released its
> memory and it has dropped TIF_MEMDIE so it is ignored by OOM killer. It
> is still holding some resources but those should be restricted and
> shouldn't keep OOM condition normally.

Here is an example trace of 3.10.0-121.el7-test. Two of OOM-killed processes
are inside task_work_run() from do_exit() and got stuck at memory allocation.
Processes past exit_mm() in do_exit() contribute OOM deadlock.
(Am I using wrong word? Should I say livelock rather than deadlock?)

[  234.118200] in:imjournal    R  running task        0   672      1 0x0000008e
[  234.120266]  0000000000009bf4 ffffffff81dc2ee0 ffffffffffffff10 ffffffff815dde57
[  234.122488]  0000000000000010 0000000000000202 ffff8800346593a0 0000000000000000
[  234.124712]  ffff8800346593b8 ffffffff811ae769 0000000000000000 ffff8800346593f0
[  234.126956] Call Trace:
[  234.128030]  [<ffffffff815dde57>] ? _raw_spin_lock+0x37/0x50
[  234.129765]  [<ffffffff811ae769>] ? put_super+0x19/0x40
[  234.131410]  [<ffffffff811af8d4>] ? prune_super+0x144/0x1a0
[  234.133125]  [<ffffffff8115103b>] ? shrink_slab+0xab/0x300
[  234.134838]  [<ffffffff811a5ae1>] ? vmpressure+0x21/0x90
[  234.136502]  [<ffffffff81154192>] ? do_try_to_free_pages+0x3c2/0x4e0
[  234.138370]  [<ffffffff811543ac>] ? try_to_free_pages+0xfc/0x180
[  234.140178]  [<ffffffff81148b4e>] ? __alloc_pages_nodemask+0x75e/0xb10
[  234.142090]  [<ffffffff811855a9>] ? alloc_pages_current+0xa9/0x170
[  234.143972]  [<ffffffffa0211b11>] ? xfs_buf_allocate_memory+0x16d/0x24a [xfs]
[  234.146068]  [<ffffffffa01a23b5>] ? xfs_buf_get_map+0x125/0x180 [xfs]
[  234.148008]  [<ffffffffa01a2d4c>] ? xfs_buf_read_map+0x2c/0x140 [xfs]
[  234.149933]  [<ffffffffa0206089>] ? xfs_trans_read_buf_map+0x2d9/0x4a0 [xfs]
[  234.151977]  [<ffffffffa01d3698>] ? xfs_btree_read_buf_block.isra.18.constprop.29+0x78/0xc0 [xfs]
[  234.154399]  [<ffffffffa01a2dfa>] ? xfs_buf_read_map+0xda/0x140 [xfs]
[  234.156330]  [<ffffffffa01d3760>] ? xfs_btree_lookup_get_block+0x80/0x100 [xfs]
[  234.158438]  [<ffffffffa01d78e7>] ? xfs_btree_lookup+0xd7/0x4b0 [xfs]
[  234.160362]  [<ffffffffa01bba0b>] ? xfs_alloc_lookup_eq+0x1b/0x20 [xfs]
[  234.162318]  [<ffffffffa01be52e>] ? xfs_free_ag_extent+0x30e/0x750 [xfs]
[  234.164286]  [<ffffffffa01bfa65>] ? xfs_free_extent+0xe5/0x120 [xfs]
[  234.166187]  [<ffffffffa019eb2f>] ? xfs_bmap_finish+0x15f/0x1b0 [xfs]
[  234.168101]  [<ffffffffa01ef5ed>] ? xfs_itruncate_extents+0x17d/0x2b0 [xfs]
[  234.170112]  [<ffffffffa019fa0e>] ? xfs_free_eofblocks+0x1ee/0x270 [xfs]
[  234.172081]  [<ffffffffa01ef97b>] ? xfs_release+0x13b/0x1e0 [xfs]
[  234.173915]  [<ffffffffa01a6425>] ? xfs_file_release+0x15/0x20 [xfs]
[  234.175807]  [<ffffffff811ad7a9>] ? __fput+0xe9/0x270
[  234.177413]  [<ffffffff811ada7e>] ? ____fput+0xe/0x10
[  234.179017]  [<ffffffff81082404>] ? task_work_run+0xc4/0xe0
[  234.180778]  [<ffffffff81063ddb>] ? do_exit+0x2cb/0xa60
[  234.182433]  [<ffffffff81094ebd>] ? ttwu_do_activate.constprop.87+0x5d/0x70
[  234.184438]  [<ffffffff81097506>] ? try_to_wake_up+0x1b6/0x280
[  234.186196]  [<ffffffff810645ef>] ? do_group_exit+0x3f/0xa0
[  234.187887]  [<ffffffff81073ff0>] ? get_signal_to_deliver+0x1d0/0x6e0
[  234.189773]  [<ffffffff81012437>] ? do_signal+0x57/0x600
[  234.191423]  [<ffffffff81086ae0>] ? wake_up_bit+0x30/0x30
[  234.193085]  [<ffffffff81012a41>] ? do_notify_resume+0x61/0xb0
[  234.194840]  [<ffffffff815e7152>] ? int_signal+0x12/0x17

[  234.221720] abrt-watch-log  D ffff88007fa54540     0   587      1 0x00100086
[  234.223804]  ffff88007be65a98 0000000000000046 ffff88007be65fd8 0000000000014540
[  234.226018]  ffff88007be65fd8 0000000000014540 ffff880076fa71c0 ffff88007acf71c0
[  234.228229]  ffff88007acf71c0 ffff8800757ee090 fffffffeffffffff ffff8800757ee098
[  234.230453] Call Trace:
[  234.231509]  [<ffffffff815dbf29>] schedule+0x29/0x70
[  234.233091]  [<ffffffff815dda45>] rwsem_down_read_failed+0xf5/0x165
[  234.234964]  [<ffffffffa019f8d2>] ? xfs_free_eofblocks+0xb2/0x270 [xfs]
[  234.236890]  [<ffffffff812c27b4>] call_rwsem_down_read_failed+0x14/0x30
[  234.238824]  [<ffffffff815db300>] ? down_read+0x20/0x30
[  234.240505]  [<ffffffffa01ecfcc>] xfs_ilock+0xbc/0xe0 [xfs]
[  234.242221]  [<ffffffffa019f8d2>] xfs_free_eofblocks+0xb2/0x270 [xfs]
[  234.244114]  [<ffffffff81190c22>] ? kmem_cache_free+0x1b2/0x1d0
[  234.245891]  [<ffffffff811c1a1f>] ? __d_free+0x3f/0x60
[  234.247519]  [<ffffffffa01ef97b>] xfs_release+0x13b/0x1e0 [xfs]
[  234.249300]  [<ffffffffa01a6425>] xfs_file_release+0x15/0x20 [xfs]
[  234.251141]  [<ffffffff811ad7a9>] __fput+0xe9/0x270
[  234.252699]  [<ffffffff811ada7e>] ____fput+0xe/0x10
[  234.254254]  [<ffffffff81082404>] task_work_run+0xc4/0xe0
[  234.255958]  [<ffffffff81063ddb>] do_exit+0x2cb/0xa60
[  234.257552]  [<ffffffff811656ee>] ? __do_fault+0x7e/0x520
[  234.259213]  [<ffffffff810645ef>] do_group_exit+0x3f/0xa0
[  234.260877]  [<ffffffff81073ff0>] get_signal_to_deliver+0x1d0/0x6e0
[  234.262723]  [<ffffffff81012437>] do_signal+0x57/0x600
[  234.264398]  [<ffffffff8108a7ed>] ? hrtimer_nanosleep+0xad/0x170
[  234.266199]  [<ffffffff81089780>] ? hrtimer_get_res+0x50/0x50
[  234.267935]  [<ffffffff81012a41>] do_notify_resume+0x61/0xb0
[  234.269665]  [<ffffffff815de33c>] retint_signal+0x48/0x8c

> The OOM report was not complete so it is hard to say why the OOM
> condition wasn't resolved by the OOM killer but other OOM report you
> have posted (26 Apr) in that thread suggested that the system doesn't
> have any swap and the page cache is full of shmem. The process list
> didn't contain any large memory consumer so killing somebody wouldn't
> help much. But the OOM victim died normally in that case:

The problem is that a.out invoked by a local unprivileged user is the only
and the biggest memory consumer which the OOM killer thinks the least memory
consumer. Killing a.out does solve the OOM condition, but the OOM killer is
forever waiting for most of all OOM-killable processes except a.out when
other processes (including OOM-killed processes) depend on a.out to be
killed for resuming their memory allocation.

And, here is example of stalled traces with and without swap space.

  https://lkml.org/lkml/2014/7/2/249

0x10 allocation and 0x250 allocation spinned for 10 minutes
(and I gave up waiting) when no swap partition is available.
0x2000d0 allocation slept for more than 20 minutes (and I gave
up waiting) when swap partition is available.

There are many processes running but there is no load except a.out when
the OOM killer is triggered for the first time. The OOM killer should have
OOM-killed a.out rather than forever waiting for unkillable OOM-killed
processes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-26 11:58             ` Tetsuo Handa
@ 2014-11-26 18:43               ` Michal Hocko
  2014-11-27 14:49                 ` Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2014-11-26 18:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, linux-mm

On Wed 26-11-14 20:58:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 25-11-14 21:54:23, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > Let me clarify. The patch is sold as a security fix. In that context
> > > > vulnerability means a behavior which might be abused by a user. I was
> > > > merely interested whether there are some known scenarios which would
> > > > turn a potential OOM killer deadlock into an exploitable bug. The
> > > > changelog was rather unclear about it and rather strong in claims that
> > > > any user might trigger OOM deadlock.
> > > 
> > > Well, both of you are in the CC: list of my mail which includes a reproducer
> > > program which I sent on Thu, 26 Jun 2014 21:02:36 +0900.
> > 
> > OK, found the emails. There were more issues mentioned there. The below
> > one is from 24 Apr.
> 
> I posted various traces in that thread.
> 
> > So the task has been killed and it is waiting for parent to handle its
> > signal but that is blocked on memory allocation. The OOM victim is
> > TASK_DEAD so it has already passed exit_mm and should have released its
> > memory and it has dropped TIF_MEMDIE so it is ignored by OOM killer. It
> > is still holding some resources but those should be restricted and
> > shouldn't keep OOM condition normally.
> 
> Here is an example trace of 3.10.0-121.el7-test. Two of OOM-killed processes
> are inside task_work_run() from do_exit() and got stuck at memory allocation.
> Processes past exit_mm() in do_exit() contribute OOM deadlock.

If the OOM victim passed exit_mm then it is usually not interesting for
the OOM killer as it has already unmapped and freed its memory (assuming
that mm_users is not elevated). It also doesn't have TIF_MEMDIE anymore
so it doesn't block OOM killer from killing other tasks.

> (Am I using wrong word? Should I say livelock rather than deadlock?)

This depends on the situation. If the OOM victim would be blocked on a
lock held by another task which cannot proceed in allocation then it
would be a deadlock. While permanent retries because of memory shortage
would be more towards a livelock.

> [  234.118200] in:imjournal    R  running task        0   672      1 0x0000008e
> [  234.120266]  0000000000009bf4 ffffffff81dc2ee0 ffffffffffffff10 ffffffff815dde57
> [  234.122488]  0000000000000010 0000000000000202 ffff8800346593a0 0000000000000000
> [  234.124712]  ffff8800346593b8 ffffffff811ae769 0000000000000000 ffff8800346593f0
> [  234.126956] Call Trace:
> [  234.128030]  [<ffffffff815dde57>] ? _raw_spin_lock+0x37/0x50
> [  234.129765]  [<ffffffff811ae769>] ? put_super+0x19/0x40
> [  234.131410]  [<ffffffff811af8d4>] ? prune_super+0x144/0x1a0
> [  234.133125]  [<ffffffff8115103b>] ? shrink_slab+0xab/0x300
> [  234.134838]  [<ffffffff811a5ae1>] ? vmpressure+0x21/0x90
> [  234.136502]  [<ffffffff81154192>] ? do_try_to_free_pages+0x3c2/0x4e0
> [  234.138370]  [<ffffffff811543ac>] ? try_to_free_pages+0xfc/0x180
> [  234.140178]  [<ffffffff81148b4e>] ? __alloc_pages_nodemask+0x75e/0xb10
> [  234.142090]  [<ffffffff811855a9>] ? alloc_pages_current+0xa9/0x170
> [  234.143972]  [<ffffffffa0211b11>] ? xfs_buf_allocate_memory+0x16d/0x24a [xfs]
> [  234.146068]  [<ffffffffa01a23b5>] ? xfs_buf_get_map+0x125/0x180 [xfs]
> [  234.148008]  [<ffffffffa01a2d4c>] ? xfs_buf_read_map+0x2c/0x140 [xfs]
> [  234.149933]  [<ffffffffa0206089>] ? xfs_trans_read_buf_map+0x2d9/0x4a0 [xfs]
> [  234.151977]  [<ffffffffa01d3698>] ? xfs_btree_read_buf_block.isra.18.constprop.29+0x78/0xc0 [xfs]
> [  234.154399]  [<ffffffffa01a2dfa>] ? xfs_buf_read_map+0xda/0x140 [xfs]
> [  234.156330]  [<ffffffffa01d3760>] ? xfs_btree_lookup_get_block+0x80/0x100 [xfs]
> [  234.158438]  [<ffffffffa01d78e7>] ? xfs_btree_lookup+0xd7/0x4b0 [xfs]
> [  234.160362]  [<ffffffffa01bba0b>] ? xfs_alloc_lookup_eq+0x1b/0x20 [xfs]
> [  234.162318]  [<ffffffffa01be52e>] ? xfs_free_ag_extent+0x30e/0x750 [xfs]
> [  234.164286]  [<ffffffffa01bfa65>] ? xfs_free_extent+0xe5/0x120 [xfs]
> [  234.166187]  [<ffffffffa019eb2f>] ? xfs_bmap_finish+0x15f/0x1b0 [xfs]
> [  234.168101]  [<ffffffffa01ef5ed>] ? xfs_itruncate_extents+0x17d/0x2b0 [xfs]
> [  234.170112]  [<ffffffffa019fa0e>] ? xfs_free_eofblocks+0x1ee/0x270 [xfs]
> [  234.172081]  [<ffffffffa01ef97b>] ? xfs_release+0x13b/0x1e0 [xfs]
> [  234.173915]  [<ffffffffa01a6425>] ? xfs_file_release+0x15/0x20 [xfs]
> [  234.175807]  [<ffffffff811ad7a9>] ? __fput+0xe9/0x270
> [  234.177413]  [<ffffffff811ada7e>] ? ____fput+0xe/0x10
> [  234.179017]  [<ffffffff81082404>] ? task_work_run+0xc4/0xe0
> [  234.180778]  [<ffffffff81063ddb>] ? do_exit+0x2cb/0xa60
> [  234.182433]  [<ffffffff81094ebd>] ? ttwu_do_activate.constprop.87+0x5d/0x70
> [  234.184438]  [<ffffffff81097506>] ? try_to_wake_up+0x1b6/0x280
> [  234.186196]  [<ffffffff810645ef>] ? do_group_exit+0x3f/0xa0
> [  234.187887]  [<ffffffff81073ff0>] ? get_signal_to_deliver+0x1d0/0x6e0
> [  234.189773]  [<ffffffff81012437>] ? do_signal+0x57/0x600
> [  234.191423]  [<ffffffff81086ae0>] ? wake_up_bit+0x30/0x30
> [  234.193085]  [<ffffffff81012a41>] ? do_notify_resume+0x61/0xb0
> [  234.194840]  [<ffffffff815e7152>] ? int_signal+0x12/0x17
> 
> [  234.221720] abrt-watch-log  D ffff88007fa54540     0   587      1 0x00100086
> [  234.223804]  ffff88007be65a98 0000000000000046 ffff88007be65fd8 0000000000014540
> [  234.226018]  ffff88007be65fd8 0000000000014540 ffff880076fa71c0 ffff88007acf71c0
> [  234.228229]  ffff88007acf71c0 ffff8800757ee090 fffffffeffffffff ffff8800757ee098
> [  234.230453] Call Trace:
> [  234.231509]  [<ffffffff815dbf29>] schedule+0x29/0x70
> [  234.233091]  [<ffffffff815dda45>] rwsem_down_read_failed+0xf5/0x165
> [  234.234964]  [<ffffffffa019f8d2>] ? xfs_free_eofblocks+0xb2/0x270 [xfs]
> [  234.236890]  [<ffffffff812c27b4>] call_rwsem_down_read_failed+0x14/0x30
> [  234.238824]  [<ffffffff815db300>] ? down_read+0x20/0x30
> [  234.240505]  [<ffffffffa01ecfcc>] xfs_ilock+0xbc/0xe0 [xfs]
> [  234.242221]  [<ffffffffa019f8d2>] xfs_free_eofblocks+0xb2/0x270 [xfs]
> [  234.244114]  [<ffffffff81190c22>] ? kmem_cache_free+0x1b2/0x1d0
> [  234.245891]  [<ffffffff811c1a1f>] ? __d_free+0x3f/0x60
> [  234.247519]  [<ffffffffa01ef97b>] xfs_release+0x13b/0x1e0 [xfs]
> [  234.249300]  [<ffffffffa01a6425>] xfs_file_release+0x15/0x20 [xfs]
> [  234.251141]  [<ffffffff811ad7a9>] __fput+0xe9/0x270
> [  234.252699]  [<ffffffff811ada7e>] ____fput+0xe/0x10
> [  234.254254]  [<ffffffff81082404>] task_work_run+0xc4/0xe0
> [  234.255958]  [<ffffffff81063ddb>] do_exit+0x2cb/0xa60
> [  234.257552]  [<ffffffff811656ee>] ? __do_fault+0x7e/0x520
> [  234.259213]  [<ffffffff810645ef>] do_group_exit+0x3f/0xa0
> [  234.260877]  [<ffffffff81073ff0>] get_signal_to_deliver+0x1d0/0x6e0
> [  234.262723]  [<ffffffff81012437>] do_signal+0x57/0x600
> [  234.264398]  [<ffffffff8108a7ed>] ? hrtimer_nanosleep+0xad/0x170
> [  234.266199]  [<ffffffff81089780>] ? hrtimer_get_res+0x50/0x50
> [  234.267935]  [<ffffffff81012a41>] do_notify_resume+0x61/0xb0
> [  234.269665]  [<ffffffff815de33c>] retint_signal+0x48/0x8c

Without OOM report these traces are not useful very much. They are both
somewhere in exit_files and deferred fput. I am not sure how much memory
the process might hold at that time. I would be quite surprised if this
was the majority of the OOM victim's memory.

> > The OOM report was not complete so it is hard to say why the OOM
> > condition wasn't resolved by the OOM killer but other OOM report you
> > have posted (26 Apr) in that thread suggested that the system doesn't
> > have any swap and the page cache is full of shmem. The process list
> > didn't contain any large memory consumer so killing somebody wouldn't
> > help much. But the OOM victim died normally in that case:
> 
> The problem is that a.out invoked by a local unprivileged user is the only
> and the biggest memory consumer which the OOM killer thinks the least memory
> consumer.

Yes, because a.out doesn't consume to much of per-process accounted
memory. It's rss, ptes and swapped out memory is negligible to
the memory allocated on behalf of processes for in-kernel data
structures. This is quite unfortunate but this is basically "an
untrusted user on your computer has to be contained" scenario. Ulimits
should help to a certain degree and kmem accounting from memory cgroup
controller should help for dentries, inodes and fork bombs but there
might be other resources that might be unrestricted. If this is the case
then the OOM killer should be taught to consider them or added a
restriction for them. Later is preferable IMO. But adding a timeout to
OOM killer and hope that the next attempt will be more successful is
definitely not the right approach.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-26 18:43               ` Michal Hocko
@ 2014-11-27 14:49                 ` Tetsuo Handa
  2014-11-28 16:17                   ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-11-27 14:49 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: linux-mm

Michal Hocko wrote:
> On Wed 26-11-14 20:58:52, Tetsuo Handa wrote:
> > Here is an example trace of 3.10.0-121.el7-test. Two of OOM-killed processes
> > are inside task_work_run() from do_exit() and got stuck at memory allocation.
> > Processes past exit_mm() in do_exit() contribute OOM deadlock.
> 
> If the OOM victim passed exit_mm then it is usually not interesting for
> the OOM killer as it has already unmapped and freed its memory (assuming
> that mm_users is not elevated). It also doesn't have TIF_MEMDIE anymore
> so it doesn't block OOM killer from killing other tasks.

Then, why did the stall last for many minutes without making any progress?
I think that some lock held by a process past exit_mm() can prevent another
process chosen by the OOM killer from holding the lock (and therefore make
it impossible for another process to terminate).

> Without OOM report these traces are not useful very much. They are both
> somewhere in exit_files and deferred fput. I am not sure how much memory
> the process might hold at that time. I would be quite surprised if this
> was the majority of the OOM victim's memory.

I don't mean to attach any OOM reports here because attaching the OOM report
is equivalent with posting the reproducer program to LKML because the trace
of a.out will tell how to trigger the OOM deadlock/livelock. You already have
the source code of a.out and you are free to compile it and run a.out in
your environment.

> > > The OOM report was not complete so it is hard to say why the OOM
> > > condition wasn't resolved by the OOM killer but other OOM report you
> > > have posted (26 Apr) in that thread suggested that the system doesn't
> > > have any swap and the page cache is full of shmem. The process list
> > > didn't contain any large memory consumer so killing somebody wouldn't
> > > help much. But the OOM victim died normally in that case:
> > 
> > The problem is that a.out invoked by a local unprivileged user is the only
> > and the biggest memory consumer which the OOM killer thinks the least memory
> > consumer.
> 
> Yes, because a.out doesn't consume to much of per-process accounted
> memory. It's rss, ptes and swapped out memory is negligible to
> the memory allocated on behalf of processes for in-kernel data
> structures. This is quite unfortunate but this is basically "an
> untrusted user on your computer has to be contained" scenario.

Why do you think about only containing untrusted user? I'm using a.out as
a memory stressing tester for finding bugs under extreme memory pressure.
This is quite unfortunate but this is basically "any unreasonably lasting
stalls under extreme memory pressure have to be fixed" scenario.

>                                                                Ulimits
> should help to a certain degree and kmem accounting from memory cgroup
> controller should help for dentries, inodes and fork bombs but there
> might be other resources that might be unrestricted. If this is the case
> then the OOM killer should be taught to consider them or added a
> restriction for them. Later is preferable IMO.

Ulimits does not help at all because a.out consumes kernel memory where only
kmem accounting can account. But the kmem accounting helps little for me
because what I want is kmem accounting based on UID rather than memory cgroup.

I agree that teaching the OOM killer to consider them is preferable.
This vulnerability resembles "CVE-2010-4243 kernel: mm: mem allocated invisible
to oom_kill() when not attached to any threads", but much harder to fix and
backport. No patches are ever proposed due to performance hit and complexity.

>                                                But adding a timeout to
> OOM killer and hope that the next attempt will be more successful is
> definitely not the right approach.

I saw a case where an innocent administrator unexpectedly hit
"CVE-2012-4398 kernel: request_module() OOM local DoS" and his system
stalled for many hours until he manually issued SysRq-c.
I fixed request_module() and kthread_create(), but there are dozens of
memory allocation with locks held which may cause unexpected OOM stalls.
If below one is available, I will no longer see similar cases even if
the cause of OOM stall is out-of-tree kernel modules.

 	/* p may not be terminated within reasonale duration */
-	if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+	if (sysctl_memdie_timeout_jiffies &&
+	    test_tsk_thread_flag(p, TIF_MEMDIE)) {
 		smp_rmb(); /* set_memdie_flag() uses smp_wmb(). */
-		if (time_after(jiffies, p->memdie_start + 5 * HZ)) {
-			static unsigned char warn = 255;
-			char comm[sizeof(p->comm)];
-
-			if (warn && warn--)
-				pr_err("Process %d (%s) was not killed within 5 seconds.\n",
-				       task_pid_nr(p), get_task_comm(comm, p));
-			return true;
-		}
+		if (time_after(jiffies, p->memdie_start + sysctl_memdie_timeout_jiffies))
+			panic("Process %d (%s) did not die within %lu jiffies.\n",
+			      task_pid_nr(p), get_task_comm(comm, p),
+			      sysctl_memdie_timeout_jiffies);
 	}

If timeout for next OOM-kill is not acceptable, what about timeout for
kernel panic (followed by kdump and automatic reboot) like above one?
If still NACK, what alternatives can you propose for distributions using
2.6.18 / 2.6.32 / 3.2 kernels which do not have the kmem accounting?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: Introduce OOM kill timeout.
  2014-11-27 14:49                 ` Tetsuo Handa
@ 2014-11-28 16:17                   ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2014-11-28 16:17 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, linux-mm

On Thu 27-11-14 23:49:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 26-11-14 20:58:52, Tetsuo Handa wrote:
> > > Here is an example trace of 3.10.0-121.el7-test. Two of OOM-killed processes
> > > are inside task_work_run() from do_exit() and got stuck at memory allocation.
> > > Processes past exit_mm() in do_exit() contribute OOM deadlock.
> > 
> > If the OOM victim passed exit_mm then it is usually not interesting for
> > the OOM killer as it has already unmapped and freed its memory (assuming
> > that mm_users is not elevated). It also doesn't have TIF_MEMDIE anymore
> > so it doesn't block OOM killer from killing other tasks.
> 
> Then, why did the stall last for many minutes without making any progress?
> I think that some lock held by a process past exit_mm() can prevent another
> process chosen by the OOM killer from holding the lock (and therefore make
> it impossible for another process to terminate).

Now that I am looking closer it seems probable that the victim got
TIF_MEMDIE set again because it is still PF_EXITING so that it can dive
into memory reserves and continue. Which didn't help in your particular
case most probably because the memory seems depleted beyond any hope.

Both of your tasks are blocked on a lock but it is not 100% clear
whether this is just the case at the time of the sysrq or permanent but
I would expect soft lockup watchdog complaining as at least one of them
is spin_lock. Anyway this looks like a live lock due to depleted memory
because even memory reserves didn't help to make any progress.

> > Without OOM report these traces are not useful very much. They are both
> > somewhere in exit_files and deferred fput. I am not sure how much memory
> > the process might hold at that time. I would be quite surprised if this
> > was the majority of the OOM victim's memory.
> 
> I don't mean to attach any OOM reports here because attaching the OOM report
> is equivalent with posting the reproducer program to LKML because the trace
> of a.out will tell how to trigger the OOM deadlock/livelock.

The trace is not really that interesting. The memory counters and the
list of eligible tasks is...

> You already have the source code of a.out and you are free to compile
> it and run a.out in your environment.
> 
> > > > The OOM report was not complete so it is hard to say why the OOM
> > > > condition wasn't resolved by the OOM killer but other OOM report you
> > > > have posted (26 Apr) in that thread suggested that the system doesn't
> > > > have any swap and the page cache is full of shmem. The process list
> > > > didn't contain any large memory consumer so killing somebody wouldn't
> > > > help much. But the OOM victim died normally in that case:
> > > 
> > > The problem is that a.out invoked by a local unprivileged user is the only
> > > and the biggest memory consumer which the OOM killer thinks the least memory
> > > consumer.
> > 
> > Yes, because a.out doesn't consume to much of per-process accounted
> > memory. It's rss, ptes and swapped out memory is negligible to
> > the memory allocated on behalf of processes for in-kernel data
> > structures. This is quite unfortunate but this is basically "an
> > untrusted user on your computer has to be contained" scenario.
> 
> Why do you think about only containing untrusted user?

Because non-malicious users usually do not shoot themselves into foot.
This includes both the configuration of the system and running a load
which doesn't eat up unaccounted kernel memory to death.

> I'm using a.out as a memory stressing tester for finding bugs under
> extreme memory pressure.

And I agree that having an unbounded kernel memory usage on behalf of
an user is a bug which should be fixed properly. I will have a look at
your reproducer again and try to think about a potential fix.

> This is quite unfortunate but this is basically "any unreasonably lasting
> stalls under extreme memory pressure have to be fixed" scenario.
> 
> >                                                                Ulimits
> > should help to a certain degree and kmem accounting from memory cgroup
> > controller should help for dentries, inodes and fork bombs but there
> > might be other resources that might be unrestricted. If this is the case
> > then the OOM killer should be taught to consider them or added a
> > restriction for them. Later is preferable IMO.
> 
> Ulimits does not help at all because a.out consumes kernel memory where only
> kmem accounting can account.

Normally ulimit would cap the user visible end of the resource.

> But the kmem accounting helps little for me because what I want is
> kmem accounting based on UID rather than memory cgroup.
>
> I agree that teaching the OOM killer to consider them is preferable.
> This vulnerability resembles "CVE-2010-4243 kernel: mm: mem allocated invisible
> to oom_kill() when not attached to any threads", but much harder to fix and
> backport. No patches are ever proposed due to performance hit and complexity.
> 
> >                                                But adding a timeout to
> > OOM killer and hope that the next attempt will be more successful is
> > definitely not the right approach.
> 
> I saw a case where an innocent administrator unexpectedly hit
> "CVE-2012-4398 kernel: request_module() OOM local DoS" and his system
> stalled for many hours until he manually issued SysRq-c.
> I fixed request_module() and kthread_create(), but there are dozens of
> memory allocation with locks held which may cause unexpected OOM stalls.
> If below one is available, I will no longer see similar cases even if
> the cause of OOM stall is out-of-tree kernel modules.
> 
>  	/* p may not be terminated within reasonale duration */
> -	if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> +	if (sysctl_memdie_timeout_jiffies &&
> +	    test_tsk_thread_flag(p, TIF_MEMDIE)) {
>  		smp_rmb(); /* set_memdie_flag() uses smp_wmb(). */
> -		if (time_after(jiffies, p->memdie_start + 5 * HZ)) {
> -			static unsigned char warn = 255;
> -			char comm[sizeof(p->comm)];
> -
> -			if (warn && warn--)
> -				pr_err("Process %d (%s) was not killed within 5 seconds.\n",
> -				       task_pid_nr(p), get_task_comm(comm, p));
> -			return true;
> -		}
> +		if (time_after(jiffies, p->memdie_start + sysctl_memdie_timeout_jiffies))
> +			panic("Process %d (%s) did not die within %lu jiffies.\n",
> +			      task_pid_nr(p), get_task_comm(comm, p),
> +			      sysctl_memdie_timeout_jiffies);
>  	}
>
> If timeout for next OOM-kill is not acceptable, what about timeout for
> kernel panic (followed by kdump and automatic reboot) like above one?

This is basically same thing and already too late to do anything. Your
machine is DoSed already and the reboot is only marginally better
approach. What would be a safe timeout which wouldn't panic a system
which is struggling but it would eventually make a progress?
Why the admin cannot sysrq+c manually?

I am not saying that this is absolutely no-go but I would _really_ like
to have a fix rather than a workaround.

> If still NACK, what alternatives can you propose for distributions using
> 2.6.18 / 2.6.32 / 3.2 kernels which do not have the kmem accounting?

Feel free to use your specific and out of tree workarounds if you
believe they will suit better your users.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-11-28 16:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-23  4:49 [RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls Tetsuo Handa
2014-11-23  4:50 ` [PATCH 1/5] mm: Introduce OOM kill timeout Tetsuo Handa
2014-11-24 16:50   ` Michal Hocko
2014-11-24 22:29     ` David Rientjes
2014-11-25 10:38       ` Michal Hocko
2014-11-25 12:54         ` Tetsuo Handa
2014-11-25 13:45           ` Michal Hocko
2014-11-26 11:58             ` Tetsuo Handa
2014-11-26 18:43               ` Michal Hocko
2014-11-27 14:49                 ` Tetsuo Handa
2014-11-28 16:17                   ` Michal Hocko
2014-11-23  4:50 ` [PATCH 2/5] mm: Kill shrinker's global semaphore Tetsuo Handa
2014-11-24 16:55   ` Michal Hocko
2014-11-23  4:51 ` [PATCH 3/5] mm: Remember ongoing memory allocation status Tetsuo Handa
2014-11-24 17:01   ` Michal Hocko
2014-11-23  4:52 ` [PATCH 4/5] mm: Drop __GFP_WAIT flag when allocating from shrinker functions Tetsuo Handa
2014-11-24 17:14   ` Michal Hocko
2014-11-23  4:53 ` [PATCH 5/5] mm: Insert some delay if ongoing memory allocation stalls Tetsuo Handa
2014-11-24 17:19   ` Michal Hocko
2014-11-24 17:25 ` [RFC PATCH 0/5] mm: Patches for mitigating " Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.