linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled()
@ 2017-07-26 16:50 Dima Zavin
  2017-07-26 17:02 ` Christopher Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Dima Zavin @ 2017-07-26 16:50 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Li Zefan, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, cgroups, linux-kernel, linux-mm, Cliff Spradlin

In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.

This problem manifested itself as a dead lock in the slub
allocator, inside get_any_partial. The loop reads
mems_allowed_seq value (via read_mems_allowed_begin),
performs the defrag operation, and then verifies the consistency
of mem_allowed via the read_mems_allowed_retry and the cookie
returned by xxx_begin. The issue here is that both begin and retry
first check if cpusets are enabled via cpusets_enabled() static branch.
This branch can be rewritted dynamically (via cpuset_inc) if a new
cpuset is created. The x86 jump label code fully synchronizes across
all CPUs for every entry it rewrites. If it rewrites only one of the
callsites (specifically the one in read_mems_allowed_retry) and then
waits for the smp_call_function(do_sync_core) to complete while a CPU is
inside the begin/retry section with IRQs off and the mems_allowed value
is changed, we can hang. This is because begin() will always return 0
(since it wasn't patched yet) while retry() will test the 0 against
the actual value of the seq counter.

The fix is to cache the value that's returned by cpusets_enabled() at the
top of the loop, and only operate on the seqlock (both begin and retry) if
it was true.

The relevant stack traces of the two stuck threads:

  CPU: 107 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
  RIP: smp_call_function_many+0x1f9/0x260
  Call Trace:
    ? setup_data_read+0xa0/0xa0
    ? ___slab_alloc+0x28b/0x5a0
    smp_call_function+0x3b/0x70
    ? setup_data_read+0xa0/0xa0
    on_each_cpu+0x2f/0x90
    ? ___slab_alloc+0x28a/0x5a0
    ? ___slab_alloc+0x28b/0x5a0
    text_poke_bp+0x87/0xd0
    ? ___slab_alloc+0x28a/0x5a0
    arch_jump_label_transform+0x93/0x100
    __jump_label_update+0x77/0x90
    jump_label_update+0xaa/0xc0
    static_key_slow_inc+0x9e/0xb0
    cpuset_css_online+0x70/0x2e0
    online_css+0x2c/0xa0
    cgroup_apply_control_enable+0x27f/0x3d0
    cgroup_mkdir+0x2b7/0x420
    kernfs_iop_mkdir+0x5a/0x80
    vfs_mkdir+0xf6/0x1a0
    SyS_mkdir+0xb7/0xe0
    entry_SYSCALL_64_fastpath+0x18/0xad

  ...

  CPU: 22 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8818087c0000 task.stack: ffffc90000030000
  RIP: int3+0x39/0x70
  Call Trace:
    <#DB> ? ___slab_alloc+0x28b/0x5a0
    <EOE> ? copy_process.part.40+0xf7/0x1de0
    ? __slab_alloc.isra.80+0x54/0x90
    ? copy_process.part.40+0xf7/0x1de0
    ? copy_process.part.40+0xf7/0x1de0
    ? kmem_cache_alloc_node+0x8a/0x280
    ? copy_process.part.40+0xf7/0x1de0
    ? _do_fork+0xe7/0x6c0
    ? _raw_spin_unlock_irq+0x2d/0x60
    ? trace_hardirqs_on_caller+0x136/0x1d0
    ? entry_SYSCALL_64_fastpath+0x5/0xad
    ? do_syscall_64+0x27/0x350
    ? SyS_clone+0x19/0x20
    ? do_syscall_64+0x60/0x350
    ? entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
---

We were reproducing the issue here with some regularity on ubuntu 14.04
running v4.9 (v4.9.36 at the time). The patch applies cleanly to 4.12 but
was only compile-tested there.

This is kind of a hacky solution that solves our immediate issue, but looks
like a more general problem and can affect other unsuspecting users of
these APIs. I suppose an irqs-off seqlock loop that is optimized away via
static_branch rewrites is rare. And, technically, it actually would be ok
except for the all-cpu sync in the x86 jump-label code between each entry
re-write. I don't know enough about all the implications of changing that,
or anything else in this path so I went with a targeted "fix" and rely on
the collective wisdom here to sort out what the correct solution to the
problem should be.

 include/linux/cpuset.h | 14 ++++++++++++--
 mm/slub.c              | 13 +++++++++++--
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index bfc204e70338..2a0f217413c6 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -111,12 +111,17 @@ extern void cpuset_print_current_mems_allowed(void);
  * causing process failure. A retry loop with read_mems_allowed_begin and
  * read_mems_allowed_retry prevents these artificial failures.
  */
+static inline unsigned int raw_read_mems_allowed_begin(void)
+{
+	return read_seqcount_begin(&current->mems_allowed_seq);
+}
+
 static inline unsigned int read_mems_allowed_begin(void)
 {
 	if (!cpusets_enabled())
 		return 0;
 
-	return read_seqcount_begin(&current->mems_allowed_seq);
+	return raw_read_mems_allowed_begin();
 }
 
 /*
@@ -125,12 +130,17 @@ static inline unsigned int read_mems_allowed_begin(void)
  * update of mems_allowed. It is up to the caller to retry the operation if
  * appropriate.
  */
+static inline bool raw_read_mems_allowed_retry(unsigned int seq)
+{
+	return read_seqcount_retry(&current->mems_allowed_seq, seq);
+}
+
 static inline bool read_mems_allowed_retry(unsigned int seq)
 {
 	if (!cpusets_enabled())
 		return false;
 
-	return read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return raw_read_mems_allowed_retry(seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
diff --git a/mm/slub.c b/mm/slub.c
index edc79ca3c6d5..7a6c74851250 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1847,6 +1847,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *object;
 	unsigned int cpuset_mems_cookie;
+	bool csets_enabled;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1871,7 +1872,14 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 		return NULL;
 
 	do {
-		cpuset_mems_cookie = read_mems_allowed_begin();
+		if (cpusets_enabled()) {
+			csets_enabled = true;
+			cpuset_mems_cookie = raw_read_mems_allowed_begin();
+		} else {
+			csets_enabled = false;
+			cpuset_mems_cookie = 0;
+		}
+
 		zonelist = node_zonelist(mempolicy_slab_node(), flags);
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 			struct kmem_cache_node *n;
@@ -1893,7 +1901,8 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 				}
 			}
 		}
-	} while (read_mems_allowed_retry(cpuset_mems_cookie));
+	} while (csets_enabled &&
+		 raw_read_mems_allowed_retry(cpuset_mems_cookie));
 #endif
 	return NULL;
 }
-- 
2.14.0.rc0.400.g1c36432dff-goog

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-26 16:50 [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled() Dima Zavin
@ 2017-07-26 17:02 ` Christopher Lameter
  2017-07-26 19:54   ` Dima Zavin
  2017-07-27 16:46   ` [PATCH v2] cpuset: " Dima Zavin
  0 siblings, 2 replies; 18+ messages in thread
From: Christopher Lameter @ 2017-07-26 17:02 UTC (permalink / raw)
  To: Dima Zavin
  Cc: Mel Gorman, Andrew Morton, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, cgroups, linux-kernel, linux-mm,
	Cliff Spradlin

On Wed, 26 Jul 2017, Dima Zavin wrote:

> The fix is to cache the value that's returned by cpusets_enabled() at the
> top of the loop, and only operate on the seqlock (both begin and retry) if
> it was true.

I think the proper fix would be to ensure that the calls to
read_mems_allowed_{begin,retry} cannot cause the deadlock. Otherwise you
have to fix this in multiple places.

Maybe read_mems_allowed_* can do some form of synchronization or *_retry
can implictly rely on the results of cpusets_enabled() by *_begin?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-26 17:02 ` Christopher Lameter
@ 2017-07-26 19:54   ` Dima Zavin
  2017-07-27 16:46   ` [PATCH v2] cpuset: " Dima Zavin
  1 sibling, 0 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-26 19:54 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Mel Gorman, Andrew Morton, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, cgroups, LKML, linux-mm,
	Cliff Spradlin

On Wed, Jul 26, 2017 at 10:02 AM, Christopher Lameter <cl@linux.com> wrote:
> On Wed, 26 Jul 2017, Dima Zavin wrote:
>
>> The fix is to cache the value that's returned by cpusets_enabled() at the
>> top of the loop, and only operate on the seqlock (both begin and retry) if
>> it was true.
>
> I think the proper fix would be to ensure that the calls to
> read_mems_allowed_{begin,retry} cannot cause the deadlock. Otherwise you
> have to fix this in multiple places.
>
> Maybe read_mems_allowed_* can do some form of synchronization or *_retry
> can implictly rely on the results of cpusets_enabled() by *_begin?
>

(res-ending because gmail hates me, sorry).

Thanks for the quick reply!

I can turn the cookie into a uint64, put the sequence into the low
order 32 bits and put the enabled state into bit 33 (or 63 :) ). Then
retry will not query cpusets_enabled() and will just look at the
enabled bit. This means that *_retry will always have a conditional
jump (i.e. lose the whole static_branch optimization) but maybe that's
ok since that's pretty rare and the *_begin() will still benefit from
it?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-26 17:02 ` Christopher Lameter
  2017-07-26 19:54   ` Dima Zavin
@ 2017-07-27 16:46   ` Dima Zavin
  2017-07-27 19:48     ` Andrew Morton
                       ` (3 more replies)
  1 sibling, 4 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-27 16:46 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, cgroups, linux-kernel, linux-mm, Cliff Spradlin,
	Mel Gorman

In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.

This problem manifested itself as a dead lock in the slub
allocator, inside get_any_partial. The loop reads
mems_allowed_seq value (via read_mems_allowed_begin),
performs the defrag operation, and then verifies the consistency
of mem_allowed via the read_mems_allowed_retry and the cookie
returned by xxx_begin. The issue here is that both begin and retry
first check if cpusets are enabled via cpusets_enabled() static branch.
This branch can be rewritted dynamically (via cpuset_inc) if a new
cpuset is created. The x86 jump label code fully synchronizes across
all CPUs for every entry it rewrites. If it rewrites only one of the
callsites (specifically the one in read_mems_allowed_retry) and then
waits for the smp_call_function(do_sync_core) to complete while a CPU is
inside the begin/retry section with IRQs off and the mems_allowed value
is changed, we can hang. This is because begin() will always return 0
(since it wasn't patched yet) while retry() will test the 0 against
the actual value of the seq counter.

The fix is to cache the value that's returned by cpusets_enabled() at the
top of the loop, and only operate on the seqcount (both begin and retry) if
it was true.

The relevant stack traces of the two stuck threads:

  CPU: 107 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
  RIP: smp_call_function_many+0x1f9/0x260
  Call Trace:
    ? setup_data_read+0xa0/0xa0
    ? ___slab_alloc+0x28b/0x5a0
    smp_call_function+0x3b/0x70
    ? setup_data_read+0xa0/0xa0
    on_each_cpu+0x2f/0x90
    ? ___slab_alloc+0x28a/0x5a0
    ? ___slab_alloc+0x28b/0x5a0
    text_poke_bp+0x87/0xd0
    ? ___slab_alloc+0x28a/0x5a0
    arch_jump_label_transform+0x93/0x100
    __jump_label_update+0x77/0x90
    jump_label_update+0xaa/0xc0
    static_key_slow_inc+0x9e/0xb0
    cpuset_css_online+0x70/0x2e0
    online_css+0x2c/0xa0
    cgroup_apply_control_enable+0x27f/0x3d0
    cgroup_mkdir+0x2b7/0x420
    kernfs_iop_mkdir+0x5a/0x80
    vfs_mkdir+0xf6/0x1a0
    SyS_mkdir+0xb7/0xe0
    entry_SYSCALL_64_fastpath+0x18/0xad

  ...

  CPU: 22 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8818087c0000 task.stack: ffffc90000030000
  RIP: int3+0x39/0x70
  Call Trace:
    <#DB> ? ___slab_alloc+0x28b/0x5a0
    <EOE> ? copy_process.part.40+0xf7/0x1de0
    ? __slab_alloc.isra.80+0x54/0x90
    ? copy_process.part.40+0xf7/0x1de0
    ? copy_process.part.40+0xf7/0x1de0
    ? kmem_cache_alloc_node+0x8a/0x280
    ? copy_process.part.40+0xf7/0x1de0
    ? _do_fork+0xe7/0x6c0
    ? _raw_spin_unlock_irq+0x2d/0x60
    ? trace_hardirqs_on_caller+0x136/0x1d0
    ? entry_SYSCALL_64_fastpath+0x5/0xad
    ? do_syscall_64+0x27/0x350
    ? SyS_clone+0x19/0x20
    ? do_syscall_64+0x60/0x350
    ? entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
---

v2:
 - Moved the cached cpusets_enabled() state into the cookie, turned
   the cookie into a struct and updated all the other call sites.
 - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
   Still only tested on v4.9.36 and compile tested against v4.12.

 include/linux/cpuset.h | 27 +++++++++++++++++----------
 mm/filemap.c           |  6 +++---
 mm/hugetlb.c           | 12 ++++++------
 mm/mempolicy.c         | 12 ++++++------
 mm/page_alloc.c        |  8 ++++----
 mm/slab.c              |  6 +++---
 mm/slub.c              |  6 +++---
 7 files changed, 42 insertions(+), 35 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 119a3f9604b0..f64f6d3b1dce 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -16,6 +16,11 @@
 #include <linux/mm.h>
 #include <linux/jump_label.h>
 
+struct cpuset_mems_cookie {
+	unsigned int seq;
+	bool was_enabled;
+};
+
 #ifdef CONFIG_CPUSETS
 
 extern struct static_key_false cpusets_enabled_key;
@@ -113,12 +118,15 @@ extern void cpuset_print_current_mems_allowed(void);
  * causing process failure. A retry loop with read_mems_allowed_begin and
  * read_mems_allowed_retry prevents these artificial failures.
  */
-static inline unsigned int read_mems_allowed_begin(void)
+static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
 {
-	if (!cpusets_enabled())
-		return 0;
+	if (!cpusets_enabled()) {
+		cookie->was_enabled = false;
+		return;
+	}
 
-	return read_seqcount_begin(&current->mems_allowed_seq);
+	cookie->was_enabled = true;
+	cookie->seq = read_seqcount_begin(&current->mems_allowed_seq);
 }
 
 /*
@@ -127,12 +135,11 @@ static inline unsigned int read_mems_allowed_begin(void)
  * update of mems_allowed. It is up to the caller to retry the operation if
  * appropriate.
  */
-static inline bool read_mems_allowed_retry(unsigned int seq)
+static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie)
 {
-	if (!cpusets_enabled())
+	if (!cookie->was_enabled)
 		return false;
-
-	return read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return read_seqcount_retry(&current->mems_allowed_seq, cookie->seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -249,12 +256,12 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline unsigned int read_mems_allowed_begin(void)
+static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
 {
 	return 0;
 }
 
-static inline bool read_mems_allowed_retry(unsigned int seq)
+static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie)
 {
 	return false;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..c0730b377519 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -716,12 +716,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	struct page *page;
 
 	if (cpuset_do_page_mem_spread()) {
-		unsigned int cpuset_mems_cookie;
+		struct cpuset_mems_cookie cpuset_mems_cookie;
 		do {
-			cpuset_mems_cookie = read_mems_allowed_begin();
+			read_mems_allowed_begin(&cpuset_mems_cookie);
 			n = cpuset_mem_spread_node();
 			page = __alloc_pages_node(n, gfp, 0);
-		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
+		} while (!page && read_mems_allowed_retry(&cpuset_mems_cookie));
 
 		return page;
 	}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3eedb187e549..1defa44f4fe6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -907,7 +907,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
-	unsigned int cpuset_mems_cookie;
+	struct cpuset_mems_cookie cpuset_mems_cookie;
 
 	/*
 	 * A child process with MAP_PRIVATE mappings created by their parent
@@ -923,7 +923,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 		goto err;
 
 retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
+	read_mems_allowed_begin(&cpuset_mems_cookie);
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask(h), &mpol, &nodemask);
 
@@ -945,7 +945,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+	if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 
@@ -1511,7 +1511,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 {
 	int order = huge_page_order(h);
 	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
-	unsigned int cpuset_mems_cookie;
+	struct cpuset_mems_cookie cpuset_mems_cookie;
 
 	/*
 	 * We need a VMA to get a memory policy.  If we do not
@@ -1548,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 		struct zonelist *zl;
 		nodemask_t *nodemask;
 
-		cpuset_mems_cookie = read_mems_allowed_begin();
+		read_mems_allowed_begin(&cpuset_mems_cookie);
 		zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
 		mpol_cond_put(mpol);
 		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
 		if (page)
 			return page;
-	} while (read_mems_allowed_retry(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(&cpuset_mems_cookie));
 
 	return NULL;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..b4f2513a2296 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1971,13 +1971,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 {
 	struct mempolicy *pol;
 	struct page *page;
-	unsigned int cpuset_mems_cookie;
+	struct cpuset_mems_cookie cpuset_mems_cookie;
 	struct zonelist *zl;
 	nodemask_t *nmask;
 
 retry_cpuset:
 	pol = get_vma_policy(vma, addr);
-	cpuset_mems_cookie = read_mems_allowed_begin();
+	read_mems_allowed_begin(&cpuset_mems_cookie);
 
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;
@@ -2019,7 +2019,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	page = __alloc_pages_nodemask(gfp, order, zl, nmask);
 	mpol_cond_put(pol);
 out:
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+	if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 }
@@ -2047,13 +2047,13 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = &default_policy;
 	struct page *page;
-	unsigned int cpuset_mems_cookie;
+	struct cpuset_mems_cookie cpuset_mems_cookie;
 
 	if (!in_interrupt() && !(gfp & __GFP_THISNODE))
 		pol = get_task_policy(current);
 
 retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
+	read_mems_allowed_begin(&cpuset_mems_cookie);
 
 	/*
 	 * No reference counting needed for current->mempolicy
@@ -2066,7 +2066,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+	if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2302f250d6b1..36cd4e95fb38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3688,7 +3688,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int no_progress_loops;
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
-	unsigned int cpuset_mems_cookie;
+	struct cpuset_mems_cookie cpuset_mems_cookie;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3713,7 +3713,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	compaction_retries = 0;
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
-	cpuset_mems_cookie = read_mems_allowed_begin();
+	read_mems_allowed_begin(&cpuset_mems_cookie);
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -3872,7 +3872,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * It's possible we raced with cpuset update so the OOM would be
 	 * premature (see below the nopage: label for full explanation).
 	 */
-	if (read_mems_allowed_retry(cpuset_mems_cookie))
+	if (read_mems_allowed_retry(&cpuset_mems_cookie))
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
@@ -3900,7 +3900,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * to fail, check if the cpuset changed during allocation and if so,
 	 * retry.
 	 */
-	if (read_mems_allowed_retry(cpuset_mems_cookie))
+	if (read_mems_allowed_retry(&cpuset_mems_cookie))
 		goto retry_cpuset;
 
 	/*
diff --git a/mm/slab.c b/mm/slab.c
index 2a31ee3c5814..391fe9d9d24e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3195,13 +3195,13 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	void *obj = NULL;
 	struct page *page;
 	int nid;
-	unsigned int cpuset_mems_cookie;
+	struct cpuset_mems_cookie cpuset_mems_cookie;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
 retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
+	read_mems_allowed_begin(&cpuset_mems_cookie);
 	zonelist = node_zonelist(mempolicy_slab_node(), flags);
 
 retry:
@@ -3245,7 +3245,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 		}
 	}
 
-	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
+	if (unlikely(!obj && read_mems_allowed_retry(&cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return obj;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 8addc535bcdc..55c4862852ec 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1849,7 +1849,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *object;
-	unsigned int cpuset_mems_cookie;
+	struct cpuset_mems_cookie cpuset_mems_cookie;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1874,7 +1874,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 		return NULL;
 
 	do {
-		cpuset_mems_cookie = read_mems_allowed_begin();
+		read_mems_allowed_begin(&cpuset_mems_cookie);
 		zonelist = node_zonelist(mempolicy_slab_node(), flags);
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 			struct kmem_cache_node *n;
@@ -1896,7 +1896,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 				}
 			}
 		}
-	} while (read_mems_allowed_retry(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(&cpuset_mems_cookie));
 #endif
 	return NULL;
 }
-- 
2.14.0.rc0.400.g1c36432dff-goog

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-27 16:46   ` [PATCH v2] cpuset: " Dima Zavin
@ 2017-07-27 19:48     ` Andrew Morton
  2017-07-27 21:41       ` Dima Zavin
  2017-07-27 19:51     ` Andrew Morton
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2017-07-27 19:48 UTC (permalink / raw)
  To: Dima Zavin
  Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes,
	Joonsoo Kim, cgroups, linux-kernel, linux-mm, Cliff Spradlin,
	Mel Gorman

On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <dmitriyz@waymo.com> wrote:

> In codepaths that use the begin/retry interface for reading
> mems_allowed_seq with irqs disabled, there exists a race condition that
> stalls the patch process after only modifying a subset of the
> static_branch call sites.
> 
> This problem manifested itself as a dead lock in the slub
> allocator, inside get_any_partial. The loop reads
> mems_allowed_seq value (via read_mems_allowed_begin),
> performs the defrag operation, and then verifies the consistency
> of mem_allowed via the read_mems_allowed_retry and the cookie
> returned by xxx_begin. The issue here is that both begin and retry
> first check if cpusets are enabled via cpusets_enabled() static branch.
> This branch can be rewritted dynamically (via cpuset_inc) if a new
> cpuset is created. The x86 jump label code fully synchronizes across
> all CPUs for every entry it rewrites. If it rewrites only one of the
> callsites (specifically the one in read_mems_allowed_retry) and then
> waits for the smp_call_function(do_sync_core) to complete while a CPU is
> inside the begin/retry section with IRQs off and the mems_allowed value
> is changed, we can hang. This is because begin() will always return 0
> (since it wasn't patched yet) while retry() will test the 0 against
> the actual value of the seq counter.
> 
> The fix is to cache the value that's returned by cpusets_enabled() at the
> top of the loop, and only operate on the seqcount (both begin and retry) if
> it was true.

Tricky.  Hence we should have a nice code comment somewhere describing
all of this.

> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -16,6 +16,11 @@
>  #include <linux/mm.h>
>  #include <linux/jump_label.h>
>  
> +struct cpuset_mems_cookie {
> +	unsigned int seq;
> +	bool was_enabled;
> +};

At cpuset_mems_cookie would be a good site - why it exists, what it
does, when it is used and how.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-27 16:46   ` [PATCH v2] cpuset: " Dima Zavin
  2017-07-27 19:48     ` Andrew Morton
@ 2017-07-27 19:51     ` Andrew Morton
  2017-07-27 21:41       ` Dima Zavin
  2017-07-28  7:45     ` Vlastimil Babka
  2017-07-29  4:56     ` [PATCH v2] " kbuild test robot
  3 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2017-07-27 19:51 UTC (permalink / raw)
  To: Dima Zavin
  Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes,
	Joonsoo Kim, cgroups, linux-kernel, linux-mm, Cliff Spradlin,
	Mel Gorman

On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <dmitriyz@waymo.com> wrote:

>  - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
>    Still only tested on v4.9.36 and compile tested against v4.12.

That's a problem - this doesn't come close to applying on current
mainline.  I can fix that I guess, but the result should be tested
well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-27 19:48     ` Andrew Morton
@ 2017-07-27 21:41       ` Dima Zavin
  0 siblings, 0 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-27 21:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes,
	Joonsoo Kim, cgroups, LKML, linux-mm, Cliff Spradlin, Mel Gorman

On Thu, Jul 27, 2017 at 12:48 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <dmitriyz@waymo.com> wrote:
>
>> In codepaths that use the begin/retry interface for reading
>> mems_allowed_seq with irqs disabled, there exists a race condition that
>> stalls the patch process after only modifying a subset of the
>> static_branch call sites.
>>
>> This problem manifested itself as a dead lock in the slub
>> allocator, inside get_any_partial. The loop reads
>> mems_allowed_seq value (via read_mems_allowed_begin),
>> performs the defrag operation, and then verifies the consistency
>> of mem_allowed via the read_mems_allowed_retry and the cookie
>> returned by xxx_begin. The issue here is that both begin and retry
>> first check if cpusets are enabled via cpusets_enabled() static branch.
>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>> cpuset is created. The x86 jump label code fully synchronizes across
>> all CPUs for every entry it rewrites. If it rewrites only one of the
>> callsites (specifically the one in read_mems_allowed_retry) and then
>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>> inside the begin/retry section with IRQs off and the mems_allowed value
>> is changed, we can hang. This is because begin() will always return 0
>> (since it wasn't patched yet) while retry() will test the 0 against
>> the actual value of the seq counter.
>>
>> The fix is to cache the value that's returned by cpusets_enabled() at the
>> top of the loop, and only operate on the seqcount (both begin and retry) if
>> it was true.
>
> Tricky.  Hence we should have a nice code comment somewhere describing
> all of this.
>
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -16,6 +16,11 @@
>>  #include <linux/mm.h>
>>  #include <linux/jump_label.h>
>>
>> +struct cpuset_mems_cookie {
>> +     unsigned int seq;
>> +     bool was_enabled;
>> +};
>
> At cpuset_mems_cookie would be a good site - why it exists, what it
> does, when it is used and how.

Will do. I actually had a comment here but removed it in lieu of
commit message :) Will put it back.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-27 19:51     ` Andrew Morton
@ 2017-07-27 21:41       ` Dima Zavin
  0 siblings, 0 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-27 21:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes,
	Joonsoo Kim, cgroups, LKML, linux-mm, Cliff Spradlin, Mel Gorman

On Thu, Jul 27, 2017 at 12:51 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <dmitriyz@waymo.com> wrote:
>
>>  - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
>>    Still only tested on v4.9.36 and compile tested against v4.12.
>
> That's a problem - this doesn't come close to applying on current
> mainline.  I can fix that I guess, but the result should be tested
> well.
>

I'll fix up for latest, and see if I can test it. I should be able to
boot vanilla with not too much trouble. May take a few days.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-27 16:46   ` [PATCH v2] cpuset: " Dima Zavin
  2017-07-27 19:48     ` Andrew Morton
  2017-07-27 19:51     ` Andrew Morton
@ 2017-07-28  7:45     ` Vlastimil Babka
  2017-07-28  8:48       ` Dima Zavin
  2017-07-28  9:30       ` Peter Zijlstra
  2017-07-29  4:56     ` [PATCH v2] " kbuild test robot
  3 siblings, 2 replies; 18+ messages in thread
From: Vlastimil Babka @ 2017-07-28  7:45 UTC (permalink / raw)
  To: Dima Zavin, Christopher Lameter
  Cc: Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, cgroups, linux-kernel, linux-mm, Cliff Spradlin,
	Mel Gorman, Peter Zijlstra

[+CC PeterZ]

On 07/27/2017 06:46 PM, Dima Zavin wrote:
> In codepaths that use the begin/retry interface for reading
> mems_allowed_seq with irqs disabled, there exists a race condition that
> stalls the patch process after only modifying a subset of the
> static_branch call sites.
> 
> This problem manifested itself as a dead lock in the slub
> allocator, inside get_any_partial. The loop reads
> mems_allowed_seq value (via read_mems_allowed_begin),
> performs the defrag operation, and then verifies the consistency
> of mem_allowed via the read_mems_allowed_retry and the cookie
> returned by xxx_begin. The issue here is that both begin and retry
> first check if cpusets are enabled via cpusets_enabled() static branch.
> This branch can be rewritted dynamically (via cpuset_inc) if a new
> cpuset is created. The x86 jump label code fully synchronizes across
> all CPUs for every entry it rewrites. If it rewrites only one of the
> callsites (specifically the one in read_mems_allowed_retry) and then
> waits for the smp_call_function(do_sync_core) to complete while a CPU is
> inside the begin/retry section with IRQs off and the mems_allowed value
> is changed, we can hang. This is because begin() will always return 0
> (since it wasn't patched yet) while retry() will test the 0 against
> the actual value of the seq counter.

Hm I wonder if there are other static branch users potentially having
similar problem. Then it would be best to fix this at static branch
level. Any idea, Peter? An inelegant solution would be to have indicate
static_branch_(un)likely() callsites ordering for the patching. I.e.
here we would make sure that read_mems_allowed_begin() callsites are
patched before read_mems_allowed_retry() when enabling the static key,
and the opposite order when disabling the static key.

> The fix is to cache the value that's returned by cpusets_enabled() at the
> top of the loop, and only operate on the seqcount (both begin and retry) if
> it was true.

Maybe we could just return e.g. -1 in read_mems_allowed_begin() when
cpusets are disabled, and test it in read_mems_allowed_retry() before
doing a proper seqcount retry check? Also I think you can still do the
cpusets_enabled() check in read_mems_allowed_retry() before the
was_enabled (or cookie == -1) test?

> The relevant stack traces of the two stuck threads:
> 
>   CPU: 107 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
>   Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>   task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
>   RIP: smp_call_function_many+0x1f9/0x260
>   Call Trace:
>     ? setup_data_read+0xa0/0xa0
>     ? ___slab_alloc+0x28b/0x5a0
>     smp_call_function+0x3b/0x70
>     ? setup_data_read+0xa0/0xa0
>     on_each_cpu+0x2f/0x90
>     ? ___slab_alloc+0x28a/0x5a0
>     ? ___slab_alloc+0x28b/0x5a0
>     text_poke_bp+0x87/0xd0
>     ? ___slab_alloc+0x28a/0x5a0
>     arch_jump_label_transform+0x93/0x100
>     __jump_label_update+0x77/0x90
>     jump_label_update+0xaa/0xc0
>     static_key_slow_inc+0x9e/0xb0
>     cpuset_css_online+0x70/0x2e0
>     online_css+0x2c/0xa0
>     cgroup_apply_control_enable+0x27f/0x3d0
>     cgroup_mkdir+0x2b7/0x420
>     kernfs_iop_mkdir+0x5a/0x80
>     vfs_mkdir+0xf6/0x1a0
>     SyS_mkdir+0xb7/0xe0
>     entry_SYSCALL_64_fastpath+0x18/0xad
> 
>   ...
> 
>   CPU: 22 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
>   Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>   task: ffff8818087c0000 task.stack: ffffc90000030000
>   RIP: int3+0x39/0x70
>   Call Trace:
>     <#DB> ? ___slab_alloc+0x28b/0x5a0
>     <EOE> ? copy_process.part.40+0xf7/0x1de0
>     ? __slab_alloc.isra.80+0x54/0x90
>     ? copy_process.part.40+0xf7/0x1de0
>     ? copy_process.part.40+0xf7/0x1de0
>     ? kmem_cache_alloc_node+0x8a/0x280
>     ? copy_process.part.40+0xf7/0x1de0
>     ? _do_fork+0xe7/0x6c0
>     ? _raw_spin_unlock_irq+0x2d/0x60
>     ? trace_hardirqs_on_caller+0x136/0x1d0
>     ? entry_SYSCALL_64_fastpath+0x5/0xad
>     ? do_syscall_64+0x27/0x350
>     ? SyS_clone+0x19/0x20
>     ? do_syscall_64+0x60/0x350
>     ? entry_SYSCALL64_slow_path+0x25/0x25
> 
> Reported-by: Cliff Spradlin <cspradlin@waymo.com>
> Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
> ---
> 
> v2:
>  - Moved the cached cpusets_enabled() state into the cookie, turned
>    the cookie into a struct and updated all the other call sites.
>  - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
>    Still only tested on v4.9.36 and compile tested against v4.12.
> 
>  include/linux/cpuset.h | 27 +++++++++++++++++----------
>  mm/filemap.c           |  6 +++---
>  mm/hugetlb.c           | 12 ++++++------
>  mm/mempolicy.c         | 12 ++++++------
>  mm/page_alloc.c        |  8 ++++----
>  mm/slab.c              |  6 +++---
>  mm/slub.c              |  6 +++---
>  7 files changed, 42 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 119a3f9604b0..f64f6d3b1dce 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -16,6 +16,11 @@
>  #include <linux/mm.h>
>  #include <linux/jump_label.h>
>  
> +struct cpuset_mems_cookie {
> +	unsigned int seq;
> +	bool was_enabled;
> +};
> +
>  #ifdef CONFIG_CPUSETS
>  
>  extern struct static_key_false cpusets_enabled_key;
> @@ -113,12 +118,15 @@ extern void cpuset_print_current_mems_allowed(void);
>   * causing process failure. A retry loop with read_mems_allowed_begin and
>   * read_mems_allowed_retry prevents these artificial failures.
>   */
> -static inline unsigned int read_mems_allowed_begin(void)
> +static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
>  {
> -	if (!cpusets_enabled())
> -		return 0;
> +	if (!cpusets_enabled()) {
> +		cookie->was_enabled = false;
> +		return;
> +	}
>  
> -	return read_seqcount_begin(&current->mems_allowed_seq);
> +	cookie->was_enabled = true;
> +	cookie->seq = read_seqcount_begin(&current->mems_allowed_seq);
>  }
>  
>  /*
> @@ -127,12 +135,11 @@ static inline unsigned int read_mems_allowed_begin(void)
>   * update of mems_allowed. It is up to the caller to retry the operation if
>   * appropriate.
>   */
> -static inline bool read_mems_allowed_retry(unsigned int seq)
> +static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie)
>  {
> -	if (!cpusets_enabled())
> +	if (!cookie->was_enabled)
>  		return false;
> -
> -	return read_seqcount_retry(&current->mems_allowed_seq, seq);
> +	return read_seqcount_retry(&current->mems_allowed_seq, cookie->seq);
>  }
>  
>  static inline void set_mems_allowed(nodemask_t nodemask)
> @@ -249,12 +256,12 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  {
>  }
>  
> -static inline unsigned int read_mems_allowed_begin(void)
> +static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
>  {
>  	return 0;
>  }
>  
> -static inline bool read_mems_allowed_retry(unsigned int seq)
> +static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie)
>  {
>  	return false;
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6f1be573a5e6..c0730b377519 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -716,12 +716,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
>  	struct page *page;
>  
>  	if (cpuset_do_page_mem_spread()) {
> -		unsigned int cpuset_mems_cookie;
> +		struct cpuset_mems_cookie cpuset_mems_cookie;
>  		do {
> -			cpuset_mems_cookie = read_mems_allowed_begin();
> +			read_mems_allowed_begin(&cpuset_mems_cookie);
>  			n = cpuset_mem_spread_node();
>  			page = __alloc_pages_node(n, gfp, 0);
> -		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
> +		} while (!page && read_mems_allowed_retry(&cpuset_mems_cookie));
>  
>  		return page;
>  	}
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3eedb187e549..1defa44f4fe6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -907,7 +907,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	struct zonelist *zonelist;
>  	struct zone *zone;
>  	struct zoneref *z;
> -	unsigned int cpuset_mems_cookie;
> +	struct cpuset_mems_cookie cpuset_mems_cookie;
>  
>  	/*
>  	 * A child process with MAP_PRIVATE mappings created by their parent
> @@ -923,7 +923,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  		goto err;
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> +	read_mems_allowed_begin(&cpuset_mems_cookie);
>  	zonelist = huge_zonelist(vma, address,
>  					htlb_alloc_mask(h), &mpol, &nodemask);
>  
> @@ -945,7 +945,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	}
>  
>  	mpol_cond_put(mpol);
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> +	if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return page;
>  
> @@ -1511,7 +1511,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
>  {
>  	int order = huge_page_order(h);
>  	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
> -	unsigned int cpuset_mems_cookie;
> +	struct cpuset_mems_cookie cpuset_mems_cookie;
>  
>  	/*
>  	 * We need a VMA to get a memory policy.  If we do not
> @@ -1548,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
>  		struct zonelist *zl;
>  		nodemask_t *nodemask;
>  
> -		cpuset_mems_cookie = read_mems_allowed_begin();
> +		read_mems_allowed_begin(&cpuset_mems_cookie);
>  		zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
>  		mpol_cond_put(mpol);
>  		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
>  		if (page)
>  			return page;
> -	} while (read_mems_allowed_retry(cpuset_mems_cookie));
> +	} while (read_mems_allowed_retry(&cpuset_mems_cookie));
>  
>  	return NULL;
>  }
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 37d0b334bfe9..b4f2513a2296 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1971,13 +1971,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  {
>  	struct mempolicy *pol;
>  	struct page *page;
> -	unsigned int cpuset_mems_cookie;
> +	struct cpuset_mems_cookie cpuset_mems_cookie;
>  	struct zonelist *zl;
>  	nodemask_t *nmask;
>  
>  retry_cpuset:
>  	pol = get_vma_policy(vma, addr);
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> +	read_mems_allowed_begin(&cpuset_mems_cookie);
>  
>  	if (pol->mode == MPOL_INTERLEAVE) {
>  		unsigned nid;
> @@ -2019,7 +2019,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  	page = __alloc_pages_nodemask(gfp, order, zl, nmask);
>  	mpol_cond_put(pol);
>  out:
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> +	if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return page;
>  }
> @@ -2047,13 +2047,13 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
>  {
>  	struct mempolicy *pol = &default_policy;
>  	struct page *page;
> -	unsigned int cpuset_mems_cookie;
> +	struct cpuset_mems_cookie cpuset_mems_cookie;
>  
>  	if (!in_interrupt() && !(gfp & __GFP_THISNODE))
>  		pol = get_task_policy(current);
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> +	read_mems_allowed_begin(&cpuset_mems_cookie);
>  
>  	/*
>  	 * No reference counting needed for current->mempolicy
> @@ -2066,7 +2066,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
>  				policy_zonelist(gfp, pol, numa_node_id()),
>  				policy_nodemask(gfp, pol));
>  
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> +	if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  
>  	return page;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2302f250d6b1..36cd4e95fb38 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3688,7 +3688,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int no_progress_loops;
>  	unsigned long alloc_start = jiffies;
>  	unsigned int stall_timeout = 10 * HZ;
> -	unsigned int cpuset_mems_cookie;
> +	struct cpuset_mems_cookie cpuset_mems_cookie;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3713,7 +3713,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	compaction_retries = 0;
>  	no_progress_loops = 0;
>  	compact_priority = DEF_COMPACT_PRIORITY;
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> +	read_mems_allowed_begin(&cpuset_mems_cookie);
>  
>  	/*
>  	 * The fast path uses conservative alloc_flags to succeed only until
> @@ -3872,7 +3872,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * It's possible we raced with cpuset update so the OOM would be
>  	 * premature (see below the nopage: label for full explanation).
>  	 */
> -	if (read_mems_allowed_retry(cpuset_mems_cookie))
> +	if (read_mems_allowed_retry(&cpuset_mems_cookie))
>  		goto retry_cpuset;
>  
>  	/* Reclaim has failed us, start killing things */
> @@ -3900,7 +3900,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * to fail, check if the cpuset changed during allocation and if so,
>  	 * retry.
>  	 */
> -	if (read_mems_allowed_retry(cpuset_mems_cookie))
> +	if (read_mems_allowed_retry(&cpuset_mems_cookie))
>  		goto retry_cpuset;
>  
>  	/*
> diff --git a/mm/slab.c b/mm/slab.c
> index 2a31ee3c5814..391fe9d9d24e 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3195,13 +3195,13 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
>  	void *obj = NULL;
>  	struct page *page;
>  	int nid;
> -	unsigned int cpuset_mems_cookie;
> +	struct cpuset_mems_cookie cpuset_mems_cookie;
>  
>  	if (flags & __GFP_THISNODE)
>  		return NULL;
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> +	read_mems_allowed_begin(&cpuset_mems_cookie);
>  	zonelist = node_zonelist(mempolicy_slab_node(), flags);
>  
>  retry:
> @@ -3245,7 +3245,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
>  		}
>  	}
>  
> -	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
> +	if (unlikely(!obj && read_mems_allowed_retry(&cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return obj;
>  }
> diff --git a/mm/slub.c b/mm/slub.c
> index 8addc535bcdc..55c4862852ec 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1849,7 +1849,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>  	struct zone *zone;
>  	enum zone_type high_zoneidx = gfp_zone(flags);
>  	void *object;
> -	unsigned int cpuset_mems_cookie;
> +	struct cpuset_mems_cookie cpuset_mems_cookie;
>  
>  	/*
>  	 * The defrag ratio allows a configuration of the tradeoffs between
> @@ -1874,7 +1874,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>  		return NULL;
>  
>  	do {
> -		cpuset_mems_cookie = read_mems_allowed_begin();
> +		read_mems_allowed_begin(&cpuset_mems_cookie);
>  		zonelist = node_zonelist(mempolicy_slab_node(), flags);
>  		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  			struct kmem_cache_node *n;
> @@ -1896,7 +1896,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>  				}
>  			}
>  		}
> -	} while (read_mems_allowed_retry(cpuset_mems_cookie));
> +	} while (read_mems_allowed_retry(&cpuset_mems_cookie));
>  #endif
>  	return NULL;
>  }
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-28  7:45     ` Vlastimil Babka
@ 2017-07-28  8:48       ` Dima Zavin
  2017-07-28  9:30       ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-28  8:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, cgroups, LKML, linux-mm,
	Cliff Spradlin, Mel Gorman, Peter Zijlstra

On Fri, Jul 28, 2017 at 12:45 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> [+CC PeterZ]
>
> On 07/27/2017 06:46 PM, Dima Zavin wrote:
>> In codepaths that use the begin/retry interface for reading
>> mems_allowed_seq with irqs disabled, there exists a race condition that
>> stalls the patch process after only modifying a subset of the
>> static_branch call sites.
>>
>> This problem manifested itself as a dead lock in the slub
>> allocator, inside get_any_partial. The loop reads
>> mems_allowed_seq value (via read_mems_allowed_begin),
>> performs the defrag operation, and then verifies the consistency
>> of mem_allowed via the read_mems_allowed_retry and the cookie
>> returned by xxx_begin. The issue here is that both begin and retry
>> first check if cpusets are enabled via cpusets_enabled() static branch.
>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>> cpuset is created. The x86 jump label code fully synchronizes across
>> all CPUs for every entry it rewrites. If it rewrites only one of the
>> callsites (specifically the one in read_mems_allowed_retry) and then
>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>> inside the begin/retry section with IRQs off and the mems_allowed value
>> is changed, we can hang. This is because begin() will always return 0
>> (since it wasn't patched yet) while retry() will test the 0 against
>> the actual value of the seq counter.
>
> Hm I wonder if there are other static branch users potentially having
> similar problem. Then it would be best to fix this at static branch
> level. Any idea, Peter? An inelegant solution would be to have indicate
> static_branch_(un)likely() callsites ordering for the patching. I.e.
> here we would make sure that read_mems_allowed_begin() callsites are
> patched before read_mems_allowed_retry() when enabling the static key,
> and the opposite order when disabling the static key.
>

This was my main worry, that I'm just patching up one incarnation of
this problem
and other clients will eventually trip over this.

>> The fix is to cache the value that's returned by cpusets_enabled() at the
>> top of the loop, and only operate on the seqcount (both begin and retry) if
>> it was true.
>
> Maybe we could just return e.g. -1 in read_mems_allowed_begin() when
> cpusets are disabled, and test it in read_mems_allowed_retry() before
> doing a proper seqcount retry check? Also I think you can still do the
> cpusets_enabled() check in read_mems_allowed_retry() before the
> was_enabled (or cookie == -1) test?

Hmm, good point! If cpusets_enabled() is true, then we can still test against
was_enabled and do the right thing (adds one extra branch in that case). When
it's false, we still benefit from the static_branch fanciness. Thanks!

Re setting the cookie to -1, I didn't really want to overload the
cookie value but
rather just make the state explicit so it's easier to grawk as this is
all already
subtle enough.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-28  7:45     ` Vlastimil Babka
  2017-07-28  8:48       ` Dima Zavin
@ 2017-07-28  9:30       ` Peter Zijlstra
  2017-07-28 14:05         ` Vlastimil Babka
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2017-07-28  9:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dima Zavin, Christopher Lameter, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, cgroups,
	linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman

On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote:
> [+CC PeterZ]
> 
> On 07/27/2017 06:46 PM, Dima Zavin wrote:
> > In codepaths that use the begin/retry interface for reading
> > mems_allowed_seq with irqs disabled, there exists a race condition that
> > stalls the patch process after only modifying a subset of the
> > static_branch call sites.
> > 
> > This problem manifested itself as a dead lock in the slub
> > allocator, inside get_any_partial. The loop reads
> > mems_allowed_seq value (via read_mems_allowed_begin),
> > performs the defrag operation, and then verifies the consistency
> > of mem_allowed via the read_mems_allowed_retry and the cookie
> > returned by xxx_begin. The issue here is that both begin and retry
> > first check if cpusets are enabled via cpusets_enabled() static branch.
> > This branch can be rewritted dynamically (via cpuset_inc) if a new
> > cpuset is created. The x86 jump label code fully synchronizes across
> > all CPUs for every entry it rewrites. If it rewrites only one of the
> > callsites (specifically the one in read_mems_allowed_retry) and then
> > waits for the smp_call_function(do_sync_core) to complete while a CPU is
> > inside the begin/retry section with IRQs off and the mems_allowed value
> > is changed, we can hang. This is because begin() will always return 0
> > (since it wasn't patched yet) while retry() will test the 0 against
> > the actual value of the seq counter.
> 
> Hm I wonder if there are other static branch users potentially having
> similar problem. Then it would be best to fix this at static branch
> level. Any idea, Peter? An inelegant solution would be to have indicate
> static_branch_(un)likely() callsites ordering for the patching. I.e.
> here we would make sure that read_mems_allowed_begin() callsites are
> patched before read_mems_allowed_retry() when enabling the static key,
> and the opposite order when disabling the static key.

I'm not aware of any other sure ordering requirements. But you can
manually create this order by using 2 static keys. Then flip them in the
desired order.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-28  9:30       ` Peter Zijlstra
@ 2017-07-28 14:05         ` Vlastimil Babka
  2017-07-28 16:52           ` Dima Zavin
  2017-07-31  4:01           ` [PATCH v3] " Dima Zavin
  0 siblings, 2 replies; 18+ messages in thread
From: Vlastimil Babka @ 2017-07-28 14:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dima Zavin, Christopher Lameter, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, cgroups,
	linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman

On 07/28/2017 11:30 AM, Peter Zijlstra wrote:
> On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote:
>> [+CC PeterZ]
>>
>> On 07/27/2017 06:46 PM, Dima Zavin wrote:
>>> In codepaths that use the begin/retry interface for reading
>>> mems_allowed_seq with irqs disabled, there exists a race condition that
>>> stalls the patch process after only modifying a subset of the
>>> static_branch call sites.
>>>
>>> This problem manifested itself as a dead lock in the slub
>>> allocator, inside get_any_partial. The loop reads
>>> mems_allowed_seq value (via read_mems_allowed_begin),
>>> performs the defrag operation, and then verifies the consistency
>>> of mem_allowed via the read_mems_allowed_retry and the cookie
>>> returned by xxx_begin. The issue here is that both begin and retry
>>> first check if cpusets are enabled via cpusets_enabled() static branch.
>>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>>> cpuset is created. The x86 jump label code fully synchronizes across
>>> all CPUs for every entry it rewrites. If it rewrites only one of the
>>> callsites (specifically the one in read_mems_allowed_retry) and then
>>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>>> inside the begin/retry section with IRQs off and the mems_allowed value
>>> is changed, we can hang. This is because begin() will always return 0
>>> (since it wasn't patched yet) while retry() will test the 0 against
>>> the actual value of the seq counter.
>>
>> Hm I wonder if there are other static branch users potentially having
>> similar problem. Then it would be best to fix this at static branch
>> level. Any idea, Peter? An inelegant solution would be to have indicate
>> static_branch_(un)likely() callsites ordering for the patching. I.e.
>> here we would make sure that read_mems_allowed_begin() callsites are
>> patched before read_mems_allowed_retry() when enabling the static key,
>> and the opposite order when disabling the static key.
> 
> I'm not aware of any other sure ordering requirements. But you can
> manually create this order by using 2 static keys. Then flip them in the
> desired order.

Right, thanks for the suggestion. I think that would be preferable to
complicating the cookie handling. Add a new key next to
cpusets_enabled_key, let's say "cpusets_enabled_pre_key". Make
read_mems_allowed_begin() check this key instead of cpusets_enabled().
Change cpuset_inc/dec to inc/dec also this new key in the right order
and that should be it. Dima, can you try that or should I?

Thanks,
Vlastimil

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-28 14:05         ` Vlastimil Babka
@ 2017-07-28 16:52           ` Dima Zavin
  2017-07-31  4:01           ` [PATCH v3] " Dima Zavin
  1 sibling, 0 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-28 16:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, LKML,
	linux-mm, Cliff Spradlin, Mel Gorman

On Fri, Jul 28, 2017 at 7:05 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 07/28/2017 11:30 AM, Peter Zijlstra wrote:
>> On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote:
>>> [+CC PeterZ]
>>>
>>> On 07/27/2017 06:46 PM, Dima Zavin wrote:
>>>> In codepaths that use the begin/retry interface for reading
>>>> mems_allowed_seq with irqs disabled, there exists a race condition that
>>>> stalls the patch process after only modifying a subset of the
>>>> static_branch call sites.
>>>>
>>>> This problem manifested itself as a dead lock in the slub
>>>> allocator, inside get_any_partial. The loop reads
>>>> mems_allowed_seq value (via read_mems_allowed_begin),
>>>> performs the defrag operation, and then verifies the consistency
>>>> of mem_allowed via the read_mems_allowed_retry and the cookie
>>>> returned by xxx_begin. The issue here is that both begin and retry
>>>> first check if cpusets are enabled via cpusets_enabled() static branch.
>>>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>>>> cpuset is created. The x86 jump label code fully synchronizes across
>>>> all CPUs for every entry it rewrites. If it rewrites only one of the
>>>> callsites (specifically the one in read_mems_allowed_retry) and then
>>>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>>>> inside the begin/retry section with IRQs off and the mems_allowed value
>>>> is changed, we can hang. This is because begin() will always return 0
>>>> (since it wasn't patched yet) while retry() will test the 0 against
>>>> the actual value of the seq counter.
>>>
>>> Hm I wonder if there are other static branch users potentially having
>>> similar problem. Then it would be best to fix this at static branch
>>> level. Any idea, Peter? An inelegant solution would be to have indicate
>>> static_branch_(un)likely() callsites ordering for the patching. I.e.
>>> here we would make sure that read_mems_allowed_begin() callsites are
>>> patched before read_mems_allowed_retry() when enabling the static key,
>>> and the opposite order when disabling the static key.
>>
>> I'm not aware of any other sure ordering requirements. But you can
>> manually create this order by using 2 static keys. Then flip them in the
>> desired order.
>
> Right, thanks for the suggestion. I think that would be preferable to
> complicating the cookie handling. Add a new key next to
> cpusets_enabled_key, let's say "cpusets_enabled_pre_key". Make
> read_mems_allowed_begin() check this key instead of cpusets_enabled().
> Change cpuset_inc/dec to inc/dec also this new key in the right order
> and that should be it. Dima, can you try that or should I?

Yeah, I like that approach much better. I'll re-spin a new version in a bit.

--Dima

>
> Thanks,
> Vlastimil
>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-27 16:46   ` [PATCH v2] cpuset: " Dima Zavin
                       ` (2 preceding siblings ...)
  2017-07-28  7:45     ` Vlastimil Babka
@ 2017-07-29  4:56     ` kbuild test robot
  3 siblings, 0 replies; 18+ messages in thread
From: kbuild test robot @ 2017-07-29  4:56 UTC (permalink / raw)
  To: Dima Zavin
  Cc: kbuild-all, Christopher Lameter, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, cgroups,
	linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 1827 bytes --]

Hi Dima,

[auto build test WARNING on v4.12]
[cannot apply to cgroup/for-next linus/master v4.13-rc2 v4.13-rc1 next-20170728]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dima-Zavin/cpuset-fix-a-deadlock-due-to-incomplete-patching-of-cpusets_enabled/20170729-123852
config: i386-randconfig-x019-201730 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/core.c:13:0:
   include/linux/cpuset.h: In function 'read_mems_allowed_begin':
>> include/linux/cpuset.h:261:9: warning: 'return' with a value, in function returning void
     return 0;
            ^
   include/linux/cpuset.h:259:20: note: declared here
    static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
                       ^~~~~~~~~~~~~~~~~~~~~~~

vim +/return +261 include/linux/cpuset.h

58568d2a Miao Xie   2009-06-16  258  
6b828cc9 Dima Zavin 2017-07-27  259  static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
c0ff7453 Miao Xie   2010-05-24  260  {
cc9a6c87 Mel Gorman 2012-03-21 @261  	return 0;
c0ff7453 Miao Xie   2010-05-24  262  }
c0ff7453 Miao Xie   2010-05-24  263  

:::::: The code at line 261 was first introduced by commit
:::::: cc9a6c8776615f9c194ccf0b63a0aa5628235545 cpuset: mm: reduce large amounts of memory barrier related damage v3

:::::: TO: Mel Gorman <mgorman@suse.de>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30137 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-28 14:05         ` Vlastimil Babka
  2017-07-28 16:52           ` Dima Zavin
@ 2017-07-31  4:01           ` Dima Zavin
  2017-07-31  4:04             ` Dima Zavin
  2017-07-31  8:02             ` Vlastimil Babka
  1 sibling, 2 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-31  4:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, cgroups,
	linux-kernel, linux-mm, Mel Gorman, Cliff Spradlin

In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.

This problem manifested itself as a dead lock in the slub
allocator, inside get_any_partial. The loop reads
mems_allowed_seq value (via read_mems_allowed_begin),
performs the defrag operation, and then verifies the consistency
of mem_allowed via the read_mems_allowed_retry and the cookie
returned by xxx_begin. The issue here is that both begin and retry
first check if cpusets are enabled via cpusets_enabled() static branch.
This branch can be rewritted dynamically (via cpuset_inc) if a new
cpuset is created. The x86 jump label code fully synchronizes across
all CPUs for every entry it rewrites. If it rewrites only one of the
callsites (specifically the one in read_mems_allowed_retry) and then
waits for the smp_call_function(do_sync_core) to complete while a CPU is
inside the begin/retry section with IRQs off and the mems_allowed value
is changed, we can hang. This is because begin() will always return 0
(since it wasn't patched yet) while retry() will test the 0 against
the actual value of the seq counter.

The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly,
when disabling cpusets via cpuset_dec(), we first ensure that callers
of cpuset_mems_allowed_retry() will start ignoring the seqcount
value before we let cpuset_mems_allowed_begin() return 0.

The relevant stack traces of the two stuck threads:

  CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
  RIP: smp_call_function_many+0x1f9/0x260
  Call Trace:
    ? setup_data_read+0xa0/0xa0
    ? ___slab_alloc+0x28b/0x5a0
    smp_call_function+0x3b/0x70
    ? setup_data_read+0xa0/0xa0
    on_each_cpu+0x2f/0x90
    ? ___slab_alloc+0x28a/0x5a0
    ? ___slab_alloc+0x28b/0x5a0
    text_poke_bp+0x87/0xd0
    ? ___slab_alloc+0x28a/0x5a0
    arch_jump_label_transform+0x93/0x100
    __jump_label_update+0x77/0x90
    jump_label_update+0xaa/0xc0
    static_key_slow_inc+0x9e/0xb0
    cpuset_css_online+0x70/0x2e0
    online_css+0x2c/0xa0
    cgroup_apply_control_enable+0x27f/0x3d0
    cgroup_mkdir+0x2b7/0x420
    kernfs_iop_mkdir+0x5a/0x80
    vfs_mkdir+0xf6/0x1a0
    SyS_mkdir+0xb7/0xe0
    entry_SYSCALL_64_fastpath+0x18/0xad

  ...

  CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8818087c0000 task.stack: ffffc90000030000
  RIP: int3+0x39/0x70
  Call Trace:
    <#DB> ? ___slab_alloc+0x28b/0x5a0
    <EOE> ? copy_process.part.40+0xf7/0x1de0
    ? __slab_alloc.isra.80+0x54/0x90
    ? copy_process.part.40+0xf7/0x1de0
    ? copy_process.part.40+0xf7/0x1de0
    ? kmem_cache_alloc_node+0x8a/0x280
    ? copy_process.part.40+0xf7/0x1de0
    ? _do_fork+0xe7/0x6c0
    ? _raw_spin_unlock_irq+0x2d/0x60
    ? trace_hardirqs_on_caller+0x136/0x1d0
    ? entry_SYSCALL_64_fastpath+0x5/0xad
    ? do_syscall_64+0x27/0x350
    ? SyS_clone+0x19/0x20
    ? do_syscall_64+0x60/0x350
    ? entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
---

v3:
 - Changed the implementation based on Peter Zijlstra's suggestion. Now
   using two keys for begin/retry instead of hacking the state into the
   cookie.
 - Rebased and tested on top of v4.13-rc3.

v4:
 - Moved the cached cpusets_enabled() state into the cookie, turned
   the cookie into a struct and updated all the other call sites.
 - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
   Still only tested on v4.9.36 and compile tested against v4.12.

 include/linux/cpuset.h | 19 +++++++++++++++++--
 kernel/cgroup/cpuset.c |  1 +
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 119a3f9604b0..e5a684c04c70 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -18,6 +18,19 @@
 
 #ifdef CONFIG_CPUSETS
 
+/*
+ * Static branch rewrites can happen in an arbitrary order for a given
+ * key. In code paths where we need to loop with read_mems_allowed_begin() and
+ * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
+ * to ensure that begin() always gets rewritten before retry() in the
+ * disabled -> enabled transition. If not, then if local irqs are disabled
+ * around the loop, we can deadlock since retry() would always be
+ * comparing the latest value of the mems_allowed seqcount against 0 as
+ * begin() still would see cpusets_enabled() as false. The enabled -> disabled
+ * transition should happen in reverse order for the same reasons (want to stop
+ * looking at real value of mems_allowed.sequence in retry() first).
+ */
+extern struct static_key_false cpusets_pre_enable_key;
 extern struct static_key_false cpusets_enabled_key;
 static inline bool cpusets_enabled(void)
 {
@@ -32,12 +45,14 @@ static inline int nr_cpusets(void)
 
 static inline void cpuset_inc(void)
 {
+	static_branch_inc(&cpusets_pre_enable_key);
 	static_branch_inc(&cpusets_enabled_key);
 }
 
 static inline void cpuset_dec(void)
 {
 	static_branch_dec(&cpusets_enabled_key);
+	static_branch_dec(&cpusets_pre_enable_key);
 }
 
 extern int cpuset_init(void);
@@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void);
  */
 static inline unsigned int read_mems_allowed_begin(void)
 {
-	if (!cpusets_enabled())
+	if (!static_branch_unlikely(&cpusets_pre_enable_key))
 		return 0;
 
 	return read_seqcount_begin(&current->mems_allowed_seq);
@@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void)
  */
 static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	if (!cpusets_enabled())
+	if (!static_branch_unlikely(&cpusets_enabled_key))
 		return false;
 
 	return read_seqcount_retry(&current->mems_allowed_seq, seq);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ca8376e5008c..8d5151688504 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -63,6 +63,7 @@
 #include <linux/cgroup.h>
 #include <linux/wait.h>
 
+DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
 
 /* See "Frequency meter" comments, below. */
-- 
2.14.0.rc0.400.g1c36432dff-goog

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-31  4:01           ` [PATCH v3] " Dima Zavin
@ 2017-07-31  4:04             ` Dima Zavin
  2017-07-31  8:02             ` Vlastimil Babka
  1 sibling, 0 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-31  4:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, LKML,
	linux-mm, Mel Gorman, Cliff Spradlin

On Sun, Jul 30, 2017 at 9:01 PM, Dima Zavin <dmitriyz@waymo.com> wrote:
> In codepaths that use the begin/retry interface for reading
> mems_allowed_seq with irqs disabled, there exists a race condition that
> stalls the patch process after only modifying a subset of the
> static_branch call sites.
>
> This problem manifested itself as a dead lock in the slub
> allocator, inside get_any_partial. The loop reads
> mems_allowed_seq value (via read_mems_allowed_begin),
> performs the defrag operation, and then verifies the consistency
> of mem_allowed via the read_mems_allowed_retry and the cookie
> returned by xxx_begin. The issue here is that both begin and retry
> first check if cpusets are enabled via cpusets_enabled() static branch.
> This branch can be rewritted dynamically (via cpuset_inc) if a new
> cpuset is created. The x86 jump label code fully synchronizes across
> all CPUs for every entry it rewrites. If it rewrites only one of the
> callsites (specifically the one in read_mems_allowed_retry) and then
> waits for the smp_call_function(do_sync_core) to complete while a CPU is
> inside the begin/retry section with IRQs off and the mems_allowed value
> is changed, we can hang. This is because begin() will always return 0
> (since it wasn't patched yet) while retry() will test the 0 against
> the actual value of the seq counter.
>
> The fix is to use two different static keys: one for begin
> (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
> first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
> always return a valid seqcount if are enabling cpusets. Similarly,
> when disabling cpusets via cpuset_dec(), we first ensure that callers
> of cpuset_mems_allowed_retry() will start ignoring the seqcount
> value before we let cpuset_mems_allowed_begin() return 0.
>
> The relevant stack traces of the two stuck threads:
>
>   CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
>   Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>   task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
>   RIP: smp_call_function_many+0x1f9/0x260
>   Call Trace:
>     ? setup_data_read+0xa0/0xa0
>     ? ___slab_alloc+0x28b/0x5a0
>     smp_call_function+0x3b/0x70
>     ? setup_data_read+0xa0/0xa0
>     on_each_cpu+0x2f/0x90
>     ? ___slab_alloc+0x28a/0x5a0
>     ? ___slab_alloc+0x28b/0x5a0
>     text_poke_bp+0x87/0xd0
>     ? ___slab_alloc+0x28a/0x5a0
>     arch_jump_label_transform+0x93/0x100
>     __jump_label_update+0x77/0x90
>     jump_label_update+0xaa/0xc0
>     static_key_slow_inc+0x9e/0xb0
>     cpuset_css_online+0x70/0x2e0
>     online_css+0x2c/0xa0
>     cgroup_apply_control_enable+0x27f/0x3d0
>     cgroup_mkdir+0x2b7/0x420
>     kernfs_iop_mkdir+0x5a/0x80
>     vfs_mkdir+0xf6/0x1a0
>     SyS_mkdir+0xb7/0xe0
>     entry_SYSCALL_64_fastpath+0x18/0xad
>
>   ...
>
>   CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
>   Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>   task: ffff8818087c0000 task.stack: ffffc90000030000
>   RIP: int3+0x39/0x70
>   Call Trace:
>     <#DB> ? ___slab_alloc+0x28b/0x5a0
>     <EOE> ? copy_process.part.40+0xf7/0x1de0
>     ? __slab_alloc.isra.80+0x54/0x90
>     ? copy_process.part.40+0xf7/0x1de0
>     ? copy_process.part.40+0xf7/0x1de0
>     ? kmem_cache_alloc_node+0x8a/0x280
>     ? copy_process.part.40+0xf7/0x1de0
>     ? _do_fork+0xe7/0x6c0
>     ? _raw_spin_unlock_irq+0x2d/0x60
>     ? trace_hardirqs_on_caller+0x136/0x1d0
>     ? entry_SYSCALL_64_fastpath+0x5/0xad
>     ? do_syscall_64+0x27/0x350
>     ? SyS_clone+0x19/0x20
>     ? do_syscall_64+0x60/0x350
>     ? entry_SYSCALL64_slow_path+0x25/0x25
>
> Reported-by: Cliff Spradlin <cspradlin@waymo.com>
> Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
> ---
>
> v3:
>  - Changed the implementation based on Peter Zijlstra's suggestion. Now
>    using two keys for begin/retry instead of hacking the state into the
>    cookie.
>  - Rebased and tested on top of v4.13-rc3.
>
> v4:

Doh, latest patch is v3, I obviously meant v2 here instead of v4. Sigh. Sorry.

--Dima

>  - Moved the cached cpusets_enabled() state into the cookie, turned
>    the cookie into a struct and updated all the other call sites.
>  - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
>    Still only tested on v4.9.36 and compile tested against v4.12.
>
>  include/linux/cpuset.h | 19 +++++++++++++++++--
>  kernel/cgroup/cpuset.c |  1 +
>  2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 119a3f9604b0..e5a684c04c70 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -18,6 +18,19 @@
>
>  #ifdef CONFIG_CPUSETS
>
> +/*
> + * Static branch rewrites can happen in an arbitrary order for a given
> + * key. In code paths where we need to loop with read_mems_allowed_begin() and
> + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
> + * to ensure that begin() always gets rewritten before retry() in the
> + * disabled -> enabled transition. If not, then if local irqs are disabled
> + * around the loop, we can deadlock since retry() would always be
> + * comparing the latest value of the mems_allowed seqcount against 0 as
> + * begin() still would see cpusets_enabled() as false. The enabled -> disabled
> + * transition should happen in reverse order for the same reasons (want to stop
> + * looking at real value of mems_allowed.sequence in retry() first).
> + */
> +extern struct static_key_false cpusets_pre_enable_key;
>  extern struct static_key_false cpusets_enabled_key;
>  static inline bool cpusets_enabled(void)
>  {
> @@ -32,12 +45,14 @@ static inline int nr_cpusets(void)
>
>  static inline void cpuset_inc(void)
>  {
> +       static_branch_inc(&cpusets_pre_enable_key);
>         static_branch_inc(&cpusets_enabled_key);
>  }
>
>  static inline void cpuset_dec(void)
>  {
>         static_branch_dec(&cpusets_enabled_key);
> +       static_branch_dec(&cpusets_pre_enable_key);
>  }
>
>  extern int cpuset_init(void);
> @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void);
>   */
>  static inline unsigned int read_mems_allowed_begin(void)
>  {
> -       if (!cpusets_enabled())
> +       if (!static_branch_unlikely(&cpusets_pre_enable_key))
>                 return 0;
>
>         return read_seqcount_begin(&current->mems_allowed_seq);
> @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void)
>   */
>  static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> -       if (!cpusets_enabled())
> +       if (!static_branch_unlikely(&cpusets_enabled_key))
>                 return false;
>
>         return read_seqcount_retry(&current->mems_allowed_seq, seq);
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index ca8376e5008c..8d5151688504 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -63,6 +63,7 @@
>  #include <linux/cgroup.h>
>  #include <linux/wait.h>
>
> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>  DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>
>  /* See "Frequency meter" comments, below. */
> --
> 2.14.0.rc0.400.g1c36432dff-goog
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-31  4:01           ` [PATCH v3] " Dima Zavin
  2017-07-31  4:04             ` Dima Zavin
@ 2017-07-31  8:02             ` Vlastimil Babka
  2017-07-31  9:05               ` Dima Zavin
  1 sibling, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2017-07-31  8:02 UTC (permalink / raw)
  To: Dima Zavin
  Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, cgroups,
	linux-kernel, linux-mm, Mel Gorman, Cliff Spradlin

On 07/31/2017 06:01 AM, Dima Zavin wrote:
> In codepaths that use the begin/retry interface for reading
> mems_allowed_seq with irqs disabled, there exists a race condition that
> stalls the patch process after only modifying a subset of the
> static_branch call sites.
> 
> This problem manifested itself as a dead lock in the slub
> allocator, inside get_any_partial. The loop reads
> mems_allowed_seq value (via read_mems_allowed_begin),
> performs the defrag operation, and then verifies the consistency
> of mem_allowed via the read_mems_allowed_retry and the cookie
> returned by xxx_begin. The issue here is that both begin and retry
> first check if cpusets are enabled via cpusets_enabled() static branch.
> This branch can be rewritted dynamically (via cpuset_inc) if a new
> cpuset is created. The x86 jump label code fully synchronizes across
> all CPUs for every entry it rewrites. If it rewrites only one of the
> callsites (specifically the one in read_mems_allowed_retry) and then
> waits for the smp_call_function(do_sync_core) to complete while a CPU is
> inside the begin/retry section with IRQs off and the mems_allowed value
> is changed, we can hang. This is because begin() will always return 0
> (since it wasn't patched yet) while retry() will test the 0 against
> the actual value of the seq counter.
> 
> The fix is to use two different static keys: one for begin
> (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
> first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
> always return a valid seqcount if are enabling cpusets. Similarly,
> when disabling cpusets via cpuset_dec(), we first ensure that callers
> of cpuset_mems_allowed_retry() will start ignoring the seqcount
> value before we let cpuset_mems_allowed_begin() return 0.
> 
> The relevant stack traces of the two stuck threads:
> 
>   CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
>   Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>   task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
>   RIP: smp_call_function_many+0x1f9/0x260
>   Call Trace:
>     ? setup_data_read+0xa0/0xa0
>     ? ___slab_alloc+0x28b/0x5a0
>     smp_call_function+0x3b/0x70
>     ? setup_data_read+0xa0/0xa0
>     on_each_cpu+0x2f/0x90
>     ? ___slab_alloc+0x28a/0x5a0
>     ? ___slab_alloc+0x28b/0x5a0
>     text_poke_bp+0x87/0xd0
>     ? ___slab_alloc+0x28a/0x5a0
>     arch_jump_label_transform+0x93/0x100
>     __jump_label_update+0x77/0x90
>     jump_label_update+0xaa/0xc0
>     static_key_slow_inc+0x9e/0xb0
>     cpuset_css_online+0x70/0x2e0
>     online_css+0x2c/0xa0
>     cgroup_apply_control_enable+0x27f/0x3d0
>     cgroup_mkdir+0x2b7/0x420
>     kernfs_iop_mkdir+0x5a/0x80
>     vfs_mkdir+0xf6/0x1a0
>     SyS_mkdir+0xb7/0xe0
>     entry_SYSCALL_64_fastpath+0x18/0xad
> 
>   ...
> 
>   CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
>   Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>   task: ffff8818087c0000 task.stack: ffffc90000030000
>   RIP: int3+0x39/0x70
>   Call Trace:
>     <#DB> ? ___slab_alloc+0x28b/0x5a0
>     <EOE> ? copy_process.part.40+0xf7/0x1de0
>     ? __slab_alloc.isra.80+0x54/0x90
>     ? copy_process.part.40+0xf7/0x1de0
>     ? copy_process.part.40+0xf7/0x1de0
>     ? kmem_cache_alloc_node+0x8a/0x280
>     ? copy_process.part.40+0xf7/0x1de0
>     ? _do_fork+0xe7/0x6c0
>     ? _raw_spin_unlock_irq+0x2d/0x60
>     ? trace_hardirqs_on_caller+0x136/0x1d0
>     ? entry_SYSCALL_64_fastpath+0x5/0xad
>     ? do_syscall_64+0x27/0x350
>     ? SyS_clone+0x19/0x20
>     ? do_syscall_64+0x60/0x350
>     ? entry_SYSCALL64_slow_path+0x25/0x25
> 
> Reported-by: Cliff Spradlin <cspradlin@waymo.com>
> Signed-off-by: Dima Zavin <dmitriyz@waymo.com>

Looks good. Could you verify it fixes the issue, or was it too hard to
reproduce? Also is this a stable candidate patch, and can you identify
an exact commit hash it fixes?

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
> 
> v3:
>  - Changed the implementation based on Peter Zijlstra's suggestion. Now
>    using two keys for begin/retry instead of hacking the state into the
>    cookie.
>  - Rebased and tested on top of v4.13-rc3.
> 
> v4:
>  - Moved the cached cpusets_enabled() state into the cookie, turned
>    the cookie into a struct and updated all the other call sites.
>  - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
>    Still only tested on v4.9.36 and compile tested against v4.12.
> 
>  include/linux/cpuset.h | 19 +++++++++++++++++--
>  kernel/cgroup/cpuset.c |  1 +
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 119a3f9604b0..e5a684c04c70 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -18,6 +18,19 @@
>  
>  #ifdef CONFIG_CPUSETS
>  
> +/*
> + * Static branch rewrites can happen in an arbitrary order for a given
> + * key. In code paths where we need to loop with read_mems_allowed_begin() and
> + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
> + * to ensure that begin() always gets rewritten before retry() in the
> + * disabled -> enabled transition. If not, then if local irqs are disabled
> + * around the loop, we can deadlock since retry() would always be
> + * comparing the latest value of the mems_allowed seqcount against 0 as
> + * begin() still would see cpusets_enabled() as false. The enabled -> disabled
> + * transition should happen in reverse order for the same reasons (want to stop
> + * looking at real value of mems_allowed.sequence in retry() first).
> + */
> +extern struct static_key_false cpusets_pre_enable_key;
>  extern struct static_key_false cpusets_enabled_key;
>  static inline bool cpusets_enabled(void)
>  {
> @@ -32,12 +45,14 @@ static inline int nr_cpusets(void)
>  
>  static inline void cpuset_inc(void)
>  {
> +	static_branch_inc(&cpusets_pre_enable_key);
>  	static_branch_inc(&cpusets_enabled_key);
>  }
>  
>  static inline void cpuset_dec(void)
>  {
>  	static_branch_dec(&cpusets_enabled_key);
> +	static_branch_dec(&cpusets_pre_enable_key);
>  }
>  
>  extern int cpuset_init(void);
> @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void);
>   */
>  static inline unsigned int read_mems_allowed_begin(void)
>  {
> -	if (!cpusets_enabled())
> +	if (!static_branch_unlikely(&cpusets_pre_enable_key))
>  		return 0;
>  
>  	return read_seqcount_begin(&current->mems_allowed_seq);
> @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void)
>   */
>  static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> -	if (!cpusets_enabled())
> +	if (!static_branch_unlikely(&cpusets_enabled_key))
>  		return false;
>  
>  	return read_seqcount_retry(&current->mems_allowed_seq, seq);
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index ca8376e5008c..8d5151688504 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -63,6 +63,7 @@
>  #include <linux/cgroup.h>
>  #include <linux/wait.h>
>  
> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>  DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>  
>  /* See "Frequency meter" comments, below. */
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  2017-07-31  8:02             ` Vlastimil Babka
@ 2017-07-31  9:05               ` Dima Zavin
  0 siblings, 0 replies; 18+ messages in thread
From: Dima Zavin @ 2017-07-31  9:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, LKML,
	linux-mm, Mel Gorman, Cliff Spradlin

On Mon, Jul 31, 2017 at 1:02 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 07/31/2017 06:01 AM, Dima Zavin wrote:
>> In codepaths that use the begin/retry interface for reading
>> mems_allowed_seq with irqs disabled, there exists a race condition that
>> stalls the patch process after only modifying a subset of the
>> static_branch call sites.
>>
>> This problem manifested itself as a dead lock in the slub
>> allocator, inside get_any_partial. The loop reads
>> mems_allowed_seq value (via read_mems_allowed_begin),
>> performs the defrag operation, and then verifies the consistency
>> of mem_allowed via the read_mems_allowed_retry and the cookie
>> returned by xxx_begin. The issue here is that both begin and retry
>> first check if cpusets are enabled via cpusets_enabled() static branch.
>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>> cpuset is created. The x86 jump label code fully synchronizes across
>> all CPUs for every entry it rewrites. If it rewrites only one of the
>> callsites (specifically the one in read_mems_allowed_retry) and then
>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>> inside the begin/retry section with IRQs off and the mems_allowed value
>> is changed, we can hang. This is because begin() will always return 0
>> (since it wasn't patched yet) while retry() will test the 0 against
>> the actual value of the seq counter.
>>
>> The fix is to use two different static keys: one for begin
>> (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
>> first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
>> always return a valid seqcount if are enabling cpusets. Similarly,
>> when disabling cpusets via cpuset_dec(), we first ensure that callers
>> of cpuset_mems_allowed_retry() will start ignoring the seqcount
>> value before we let cpuset_mems_allowed_begin() return 0.
>>
>> The relevant stack traces of the two stuck threads:
>>
>>   CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
>>   Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>>   task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
>>   RIP: smp_call_function_many+0x1f9/0x260
>>   Call Trace:
>>     ? setup_data_read+0xa0/0xa0
>>     ? ___slab_alloc+0x28b/0x5a0
>>     smp_call_function+0x3b/0x70
>>     ? setup_data_read+0xa0/0xa0
>>     on_each_cpu+0x2f/0x90
>>     ? ___slab_alloc+0x28a/0x5a0
>>     ? ___slab_alloc+0x28b/0x5a0
>>     text_poke_bp+0x87/0xd0
>>     ? ___slab_alloc+0x28a/0x5a0
>>     arch_jump_label_transform+0x93/0x100
>>     __jump_label_update+0x77/0x90
>>     jump_label_update+0xaa/0xc0
>>     static_key_slow_inc+0x9e/0xb0
>>     cpuset_css_online+0x70/0x2e0
>>     online_css+0x2c/0xa0
>>     cgroup_apply_control_enable+0x27f/0x3d0
>>     cgroup_mkdir+0x2b7/0x420
>>     kernfs_iop_mkdir+0x5a/0x80
>>     vfs_mkdir+0xf6/0x1a0
>>     SyS_mkdir+0xb7/0xe0
>>     entry_SYSCALL_64_fastpath+0x18/0xad
>>
>>   ...
>>
>>   CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
>>   Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>>   task: ffff8818087c0000 task.stack: ffffc90000030000
>>   RIP: int3+0x39/0x70
>>   Call Trace:
>>     <#DB> ? ___slab_alloc+0x28b/0x5a0
>>     <EOE> ? copy_process.part.40+0xf7/0x1de0
>>     ? __slab_alloc.isra.80+0x54/0x90
>>     ? copy_process.part.40+0xf7/0x1de0
>>     ? copy_process.part.40+0xf7/0x1de0
>>     ? kmem_cache_alloc_node+0x8a/0x280
>>     ? copy_process.part.40+0xf7/0x1de0
>>     ? _do_fork+0xe7/0x6c0
>>     ? _raw_spin_unlock_irq+0x2d/0x60
>>     ? trace_hardirqs_on_caller+0x136/0x1d0
>>     ? entry_SYSCALL_64_fastpath+0x5/0xad
>>     ? do_syscall_64+0x27/0x350
>>     ? SyS_clone+0x19/0x20
>>     ? do_syscall_64+0x60/0x350
>>     ? entry_SYSCALL64_slow_path+0x25/0x25
>>
>> Reported-by: Cliff Spradlin <cspradlin@waymo.com>
>> Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
>
> Looks good. Could you verify it fixes the issue, or was it too hard to
> reproduce? Also is this a stable candidate patch, and can you identify
> an exact commit hash it fixes?

It's tough to reproduce but this fix works as well as the original
hack. I think the problematic commit must have been:

commit 46e700abc44ce215acb4341d9702ce3972eda571
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 6 16:28:15 2015 -0800
    mm, page_alloc: remove unnecessary taking of a seqlock when
cpusets are disabled

It's probably stable worthy, but I don't know who can make the call on that.

Thanks for the reviews!

--Dima

>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
>> ---
>>
>> v3:
>>  - Changed the implementation based on Peter Zijlstra's suggestion. Now
>>    using two keys for begin/retry instead of hacking the state into the
>>    cookie.
>>  - Rebased and tested on top of v4.13-rc3.
>>
>> v4:
>>  - Moved the cached cpusets_enabled() state into the cookie, turned
>>    the cookie into a struct and updated all the other call sites.
>>  - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
>>    Still only tested on v4.9.36 and compile tested against v4.12.
>>
>>  include/linux/cpuset.h | 19 +++++++++++++++++--
>>  kernel/cgroup/cpuset.c |  1 +
>>  2 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>> index 119a3f9604b0..e5a684c04c70 100644
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -18,6 +18,19 @@
>>
>>  #ifdef CONFIG_CPUSETS
>>
>> +/*
>> + * Static branch rewrites can happen in an arbitrary order for a given
>> + * key. In code paths where we need to loop with read_mems_allowed_begin() and
>> + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
>> + * to ensure that begin() always gets rewritten before retry() in the
>> + * disabled -> enabled transition. If not, then if local irqs are disabled
>> + * around the loop, we can deadlock since retry() would always be
>> + * comparing the latest value of the mems_allowed seqcount against 0 as
>> + * begin() still would see cpusets_enabled() as false. The enabled -> disabled
>> + * transition should happen in reverse order for the same reasons (want to stop
>> + * looking at real value of mems_allowed.sequence in retry() first).
>> + */
>> +extern struct static_key_false cpusets_pre_enable_key;
>>  extern struct static_key_false cpusets_enabled_key;
>>  static inline bool cpusets_enabled(void)
>>  {
>> @@ -32,12 +45,14 @@ static inline int nr_cpusets(void)
>>
>>  static inline void cpuset_inc(void)
>>  {
>> +     static_branch_inc(&cpusets_pre_enable_key);
>>       static_branch_inc(&cpusets_enabled_key);
>>  }
>>
>>  static inline void cpuset_dec(void)
>>  {
>>       static_branch_dec(&cpusets_enabled_key);
>> +     static_branch_dec(&cpusets_pre_enable_key);
>>  }
>>
>>  extern int cpuset_init(void);
>> @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void);
>>   */
>>  static inline unsigned int read_mems_allowed_begin(void)
>>  {
>> -     if (!cpusets_enabled())
>> +     if (!static_branch_unlikely(&cpusets_pre_enable_key))
>>               return 0;
>>
>>       return read_seqcount_begin(&current->mems_allowed_seq);
>> @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void)
>>   */
>>  static inline bool read_mems_allowed_retry(unsigned int seq)
>>  {
>> -     if (!cpusets_enabled())
>> +     if (!static_branch_unlikely(&cpusets_enabled_key))
>>               return false;
>>
>>       return read_seqcount_retry(&current->mems_allowed_seq, seq);
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index ca8376e5008c..8d5151688504 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -63,6 +63,7 @@
>>  #include <linux/cgroup.h>
>>  #include <linux/wait.h>
>>
>> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>>  DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>>
>>  /* See "Frequency meter" comments, below. */
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-07-31  9:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-26 16:50 [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled() Dima Zavin
2017-07-26 17:02 ` Christopher Lameter
2017-07-26 19:54   ` Dima Zavin
2017-07-27 16:46   ` [PATCH v2] cpuset: " Dima Zavin
2017-07-27 19:48     ` Andrew Morton
2017-07-27 21:41       ` Dima Zavin
2017-07-27 19:51     ` Andrew Morton
2017-07-27 21:41       ` Dima Zavin
2017-07-28  7:45     ` Vlastimil Babka
2017-07-28  8:48       ` Dima Zavin
2017-07-28  9:30       ` Peter Zijlstra
2017-07-28 14:05         ` Vlastimil Babka
2017-07-28 16:52           ` Dima Zavin
2017-07-31  4:01           ` [PATCH v3] " Dima Zavin
2017-07-31  4:04             ` Dima Zavin
2017-07-31  8:02             ` Vlastimil Babka
2017-07-31  9:05               ` Dima Zavin
2017-07-29  4:56     ` [PATCH v2] " kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).