* [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled() @ 2017-07-26 16:50 Dima Zavin 2017-07-26 17:02 ` Christopher Lameter 0 siblings, 1 reply; 18+ messages in thread From: Dima Zavin @ 2017-07-26 16:50 UTC (permalink / raw) To: Mel Gorman, Andrew Morton Cc: Li Zefan, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, cgroups, linux-kernel, linux-mm, Cliff Spradlin In codepaths that use the begin/retry interface for reading mems_allowed_seq with irqs disabled, there exists a race condition that stalls the patch process after only modifying a subset of the static_branch call sites. This problem manifested itself as a dead lock in the slub allocator, inside get_any_partial. The loop reads mems_allowed_seq value (via read_mems_allowed_begin), performs the defrag operation, and then verifies the consistency of mem_allowed via the read_mems_allowed_retry and the cookie returned by xxx_begin. The issue here is that both begin and retry first check if cpusets are enabled via cpusets_enabled() static branch. This branch can be rewritted dynamically (via cpuset_inc) if a new cpuset is created. The x86 jump label code fully synchronizes across all CPUs for every entry it rewrites. If it rewrites only one of the callsites (specifically the one in read_mems_allowed_retry) and then waits for the smp_call_function(do_sync_core) to complete while a CPU is inside the begin/retry section with IRQs off and the mems_allowed value is changed, we can hang. This is because begin() will always return 0 (since it wasn't patched yet) while retry() will test the 0 against the actual value of the seq counter. The fix is to cache the value that's returned by cpusets_enabled() at the top of the loop, and only operate on the seqlock (both begin and retry) if it was true. The relevant stack traces of the two stuck threads: CPU: 107 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 RIP: smp_call_function_many+0x1f9/0x260 Call Trace: ? setup_data_read+0xa0/0xa0 ? ___slab_alloc+0x28b/0x5a0 smp_call_function+0x3b/0x70 ? setup_data_read+0xa0/0xa0 on_each_cpu+0x2f/0x90 ? ___slab_alloc+0x28a/0x5a0 ? ___slab_alloc+0x28b/0x5a0 text_poke_bp+0x87/0xd0 ? ___slab_alloc+0x28a/0x5a0 arch_jump_label_transform+0x93/0x100 __jump_label_update+0x77/0x90 jump_label_update+0xaa/0xc0 static_key_slow_inc+0x9e/0xb0 cpuset_css_online+0x70/0x2e0 online_css+0x2c/0xa0 cgroup_apply_control_enable+0x27f/0x3d0 cgroup_mkdir+0x2b7/0x420 kernfs_iop_mkdir+0x5a/0x80 vfs_mkdir+0xf6/0x1a0 SyS_mkdir+0xb7/0xe0 entry_SYSCALL_64_fastpath+0x18/0xad ... CPU: 22 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8818087c0000 task.stack: ffffc90000030000 RIP: int3+0x39/0x70 Call Trace: <#DB> ? ___slab_alloc+0x28b/0x5a0 <EOE> ? copy_process.part.40+0xf7/0x1de0 ? __slab_alloc.isra.80+0x54/0x90 ? copy_process.part.40+0xf7/0x1de0 ? copy_process.part.40+0xf7/0x1de0 ? kmem_cache_alloc_node+0x8a/0x280 ? copy_process.part.40+0xf7/0x1de0 ? _do_fork+0xe7/0x6c0 ? _raw_spin_unlock_irq+0x2d/0x60 ? trace_hardirqs_on_caller+0x136/0x1d0 ? entry_SYSCALL_64_fastpath+0x5/0xad ? do_syscall_64+0x27/0x350 ? SyS_clone+0x19/0x20 ? do_syscall_64+0x60/0x350 ? entry_SYSCALL64_slow_path+0x25/0x25 Reported-by: Cliff Spradlin <cspradlin@waymo.com> Signed-off-by: Dima Zavin <dmitriyz@waymo.com> --- We were reproducing the issue here with some regularity on ubuntu 14.04 running v4.9 (v4.9.36 at the time). The patch applies cleanly to 4.12 but was only compile-tested there. This is kind of a hacky solution that solves our immediate issue, but looks like a more general problem and can affect other unsuspecting users of these APIs. I suppose an irqs-off seqlock loop that is optimized away via static_branch rewrites is rare. And, technically, it actually would be ok except for the all-cpu sync in the x86 jump-label code between each entry re-write. I don't know enough about all the implications of changing that, or anything else in this path so I went with a targeted "fix" and rely on the collective wisdom here to sort out what the correct solution to the problem should be. include/linux/cpuset.h | 14 ++++++++++++-- mm/slub.c | 13 +++++++++++-- 2 files changed, 23 insertions(+), 4 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index bfc204e70338..2a0f217413c6 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -111,12 +111,17 @@ extern void cpuset_print_current_mems_allowed(void); * causing process failure. A retry loop with read_mems_allowed_begin and * read_mems_allowed_retry prevents these artificial failures. */ +static inline unsigned int raw_read_mems_allowed_begin(void) +{ + return read_seqcount_begin(¤t->mems_allowed_seq); +} + static inline unsigned int read_mems_allowed_begin(void) { if (!cpusets_enabled()) return 0; - return read_seqcount_begin(¤t->mems_allowed_seq); + return raw_read_mems_allowed_begin(); } /* @@ -125,12 +130,17 @@ static inline unsigned int read_mems_allowed_begin(void) * update of mems_allowed. It is up to the caller to retry the operation if * appropriate. */ +static inline bool raw_read_mems_allowed_retry(unsigned int seq) +{ + return read_seqcount_retry(¤t->mems_allowed_seq, seq); +} + static inline bool read_mems_allowed_retry(unsigned int seq) { if (!cpusets_enabled()) return false; - return read_seqcount_retry(¤t->mems_allowed_seq, seq); + return raw_read_mems_allowed_retry(seq); } static inline void set_mems_allowed(nodemask_t nodemask) diff --git a/mm/slub.c b/mm/slub.c index edc79ca3c6d5..7a6c74851250 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1847,6 +1847,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, enum zone_type high_zoneidx = gfp_zone(flags); void *object; unsigned int cpuset_mems_cookie; + bool csets_enabled; /* * The defrag ratio allows a configuration of the tradeoffs between @@ -1871,7 +1872,14 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, return NULL; do { - cpuset_mems_cookie = read_mems_allowed_begin(); + if (cpusets_enabled()) { + csets_enabled = true; + cpuset_mems_cookie = raw_read_mems_allowed_begin(); + } else { + csets_enabled = false; + cpuset_mems_cookie = 0; + } + zonelist = node_zonelist(mempolicy_slab_node(), flags); for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { struct kmem_cache_node *n; @@ -1893,7 +1901,8 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, } } } - } while (read_mems_allowed_retry(cpuset_mems_cookie)); + } while (csets_enabled && + raw_read_mems_allowed_retry(cpuset_mems_cookie)); #endif return NULL; } -- 2.14.0.rc0.400.g1c36432dff-goog ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-26 16:50 [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled() Dima Zavin @ 2017-07-26 17:02 ` Christopher Lameter 2017-07-26 19:54 ` Dima Zavin 2017-07-27 16:46 ` [PATCH v2] cpuset: " Dima Zavin 0 siblings, 2 replies; 18+ messages in thread From: Christopher Lameter @ 2017-07-26 17:02 UTC (permalink / raw) To: Dima Zavin Cc: Mel Gorman, Andrew Morton, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, cgroups, linux-kernel, linux-mm, Cliff Spradlin On Wed, 26 Jul 2017, Dima Zavin wrote: > The fix is to cache the value that's returned by cpusets_enabled() at the > top of the loop, and only operate on the seqlock (both begin and retry) if > it was true. I think the proper fix would be to ensure that the calls to read_mems_allowed_{begin,retry} cannot cause the deadlock. Otherwise you have to fix this in multiple places. Maybe read_mems_allowed_* can do some form of synchronization or *_retry can implictly rely on the results of cpusets_enabled() by *_begin? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-26 17:02 ` Christopher Lameter @ 2017-07-26 19:54 ` Dima Zavin 2017-07-27 16:46 ` [PATCH v2] cpuset: " Dima Zavin 1 sibling, 0 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-26 19:54 UTC (permalink / raw) To: Christopher Lameter Cc: Mel Gorman, Andrew Morton, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, cgroups, LKML, linux-mm, Cliff Spradlin On Wed, Jul 26, 2017 at 10:02 AM, Christopher Lameter <cl@linux.com> wrote: > On Wed, 26 Jul 2017, Dima Zavin wrote: > >> The fix is to cache the value that's returned by cpusets_enabled() at the >> top of the loop, and only operate on the seqlock (both begin and retry) if >> it was true. > > I think the proper fix would be to ensure that the calls to > read_mems_allowed_{begin,retry} cannot cause the deadlock. Otherwise you > have to fix this in multiple places. > > Maybe read_mems_allowed_* can do some form of synchronization or *_retry > can implictly rely on the results of cpusets_enabled() by *_begin? > (res-ending because gmail hates me, sorry). Thanks for the quick reply! I can turn the cookie into a uint64, put the sequence into the low order 32 bits and put the enabled state into bit 33 (or 63 :) ). Then retry will not query cpusets_enabled() and will just look at the enabled bit. This means that *_retry will always have a conditional jump (i.e. lose the whole static_branch optimization) but maybe that's ok since that's pretty rare and the *_begin() will still benefit from it? ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-26 17:02 ` Christopher Lameter 2017-07-26 19:54 ` Dima Zavin @ 2017-07-27 16:46 ` Dima Zavin 2017-07-27 19:48 ` Andrew Morton ` (3 more replies) 1 sibling, 4 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-27 16:46 UTC (permalink / raw) To: Christopher Lameter Cc: Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman In codepaths that use the begin/retry interface for reading mems_allowed_seq with irqs disabled, there exists a race condition that stalls the patch process after only modifying a subset of the static_branch call sites. This problem manifested itself as a dead lock in the slub allocator, inside get_any_partial. The loop reads mems_allowed_seq value (via read_mems_allowed_begin), performs the defrag operation, and then verifies the consistency of mem_allowed via the read_mems_allowed_retry and the cookie returned by xxx_begin. The issue here is that both begin and retry first check if cpusets are enabled via cpusets_enabled() static branch. This branch can be rewritted dynamically (via cpuset_inc) if a new cpuset is created. The x86 jump label code fully synchronizes across all CPUs for every entry it rewrites. If it rewrites only one of the callsites (specifically the one in read_mems_allowed_retry) and then waits for the smp_call_function(do_sync_core) to complete while a CPU is inside the begin/retry section with IRQs off and the mems_allowed value is changed, we can hang. This is because begin() will always return 0 (since it wasn't patched yet) while retry() will test the 0 against the actual value of the seq counter. The fix is to cache the value that's returned by cpusets_enabled() at the top of the loop, and only operate on the seqcount (both begin and retry) if it was true. The relevant stack traces of the two stuck threads: CPU: 107 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 RIP: smp_call_function_many+0x1f9/0x260 Call Trace: ? setup_data_read+0xa0/0xa0 ? ___slab_alloc+0x28b/0x5a0 smp_call_function+0x3b/0x70 ? setup_data_read+0xa0/0xa0 on_each_cpu+0x2f/0x90 ? ___slab_alloc+0x28a/0x5a0 ? ___slab_alloc+0x28b/0x5a0 text_poke_bp+0x87/0xd0 ? ___slab_alloc+0x28a/0x5a0 arch_jump_label_transform+0x93/0x100 __jump_label_update+0x77/0x90 jump_label_update+0xaa/0xc0 static_key_slow_inc+0x9e/0xb0 cpuset_css_online+0x70/0x2e0 online_css+0x2c/0xa0 cgroup_apply_control_enable+0x27f/0x3d0 cgroup_mkdir+0x2b7/0x420 kernfs_iop_mkdir+0x5a/0x80 vfs_mkdir+0xf6/0x1a0 SyS_mkdir+0xb7/0xe0 entry_SYSCALL_64_fastpath+0x18/0xad ... CPU: 22 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8818087c0000 task.stack: ffffc90000030000 RIP: int3+0x39/0x70 Call Trace: <#DB> ? ___slab_alloc+0x28b/0x5a0 <EOE> ? copy_process.part.40+0xf7/0x1de0 ? __slab_alloc.isra.80+0x54/0x90 ? copy_process.part.40+0xf7/0x1de0 ? copy_process.part.40+0xf7/0x1de0 ? kmem_cache_alloc_node+0x8a/0x280 ? copy_process.part.40+0xf7/0x1de0 ? _do_fork+0xe7/0x6c0 ? _raw_spin_unlock_irq+0x2d/0x60 ? trace_hardirqs_on_caller+0x136/0x1d0 ? entry_SYSCALL_64_fastpath+0x5/0xad ? do_syscall_64+0x27/0x350 ? SyS_clone+0x19/0x20 ? do_syscall_64+0x60/0x350 ? entry_SYSCALL64_slow_path+0x25/0x25 Reported-by: Cliff Spradlin <cspradlin@waymo.com> Signed-off-by: Dima Zavin <dmitriyz@waymo.com> --- v2: - Moved the cached cpusets_enabled() state into the cookie, turned the cookie into a struct and updated all the other call sites. - Applied on top of v4.12 since one of the callers in page_alloc.c changed. Still only tested on v4.9.36 and compile tested against v4.12. include/linux/cpuset.h | 27 +++++++++++++++++---------- mm/filemap.c | 6 +++--- mm/hugetlb.c | 12 ++++++------ mm/mempolicy.c | 12 ++++++------ mm/page_alloc.c | 8 ++++---- mm/slab.c | 6 +++--- mm/slub.c | 6 +++--- 7 files changed, 42 insertions(+), 35 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 119a3f9604b0..f64f6d3b1dce 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -16,6 +16,11 @@ #include <linux/mm.h> #include <linux/jump_label.h> +struct cpuset_mems_cookie { + unsigned int seq; + bool was_enabled; +}; + #ifdef CONFIG_CPUSETS extern struct static_key_false cpusets_enabled_key; @@ -113,12 +118,15 @@ extern void cpuset_print_current_mems_allowed(void); * causing process failure. A retry loop with read_mems_allowed_begin and * read_mems_allowed_retry prevents these artificial failures. */ -static inline unsigned int read_mems_allowed_begin(void) +static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie) { - if (!cpusets_enabled()) - return 0; + if (!cpusets_enabled()) { + cookie->was_enabled = false; + return; + } - return read_seqcount_begin(¤t->mems_allowed_seq); + cookie->was_enabled = true; + cookie->seq = read_seqcount_begin(¤t->mems_allowed_seq); } /* @@ -127,12 +135,11 @@ static inline unsigned int read_mems_allowed_begin(void) * update of mems_allowed. It is up to the caller to retry the operation if * appropriate. */ -static inline bool read_mems_allowed_retry(unsigned int seq) +static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie) { - if (!cpusets_enabled()) + if (!cookie->was_enabled) return false; - - return read_seqcount_retry(¤t->mems_allowed_seq, seq); + return read_seqcount_retry(¤t->mems_allowed_seq, cookie->seq); } static inline void set_mems_allowed(nodemask_t nodemask) @@ -249,12 +256,12 @@ static inline void set_mems_allowed(nodemask_t nodemask) { } -static inline unsigned int read_mems_allowed_begin(void) +static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie) { return 0; } -static inline bool read_mems_allowed_retry(unsigned int seq) +static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie) { return false; } diff --git a/mm/filemap.c b/mm/filemap.c index 6f1be573a5e6..c0730b377519 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -716,12 +716,12 @@ struct page *__page_cache_alloc(gfp_t gfp) struct page *page; if (cpuset_do_page_mem_spread()) { - unsigned int cpuset_mems_cookie; + struct cpuset_mems_cookie cpuset_mems_cookie; do { - cpuset_mems_cookie = read_mems_allowed_begin(); + read_mems_allowed_begin(&cpuset_mems_cookie); n = cpuset_mem_spread_node(); page = __alloc_pages_node(n, gfp, 0); - } while (!page && read_mems_allowed_retry(cpuset_mems_cookie)); + } while (!page && read_mems_allowed_retry(&cpuset_mems_cookie)); return page; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3eedb187e549..1defa44f4fe6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -907,7 +907,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, struct zonelist *zonelist; struct zone *zone; struct zoneref *z; - unsigned int cpuset_mems_cookie; + struct cpuset_mems_cookie cpuset_mems_cookie; /* * A child process with MAP_PRIVATE mappings created by their parent @@ -923,7 +923,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, goto err; retry_cpuset: - cpuset_mems_cookie = read_mems_allowed_begin(); + read_mems_allowed_begin(&cpuset_mems_cookie); zonelist = huge_zonelist(vma, address, htlb_alloc_mask(h), &mpol, &nodemask); @@ -945,7 +945,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, } mpol_cond_put(mpol); - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie))) goto retry_cpuset; return page; @@ -1511,7 +1511,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, { int order = huge_page_order(h); gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN; - unsigned int cpuset_mems_cookie; + struct cpuset_mems_cookie cpuset_mems_cookie; /* * We need a VMA to get a memory policy. If we do not @@ -1548,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, struct zonelist *zl; nodemask_t *nodemask; - cpuset_mems_cookie = read_mems_allowed_begin(); + read_mems_allowed_begin(&cpuset_mems_cookie); zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask); mpol_cond_put(mpol); page = __alloc_pages_nodemask(gfp, order, zl, nodemask); if (page) return page; - } while (read_mems_allowed_retry(cpuset_mems_cookie)); + } while (read_mems_allowed_retry(&cpuset_mems_cookie)); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 37d0b334bfe9..b4f2513a2296 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1971,13 +1971,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, { struct mempolicy *pol; struct page *page; - unsigned int cpuset_mems_cookie; + struct cpuset_mems_cookie cpuset_mems_cookie; struct zonelist *zl; nodemask_t *nmask; retry_cpuset: pol = get_vma_policy(vma, addr); - cpuset_mems_cookie = read_mems_allowed_begin(); + read_mems_allowed_begin(&cpuset_mems_cookie); if (pol->mode == MPOL_INTERLEAVE) { unsigned nid; @@ -2019,7 +2019,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, page = __alloc_pages_nodemask(gfp, order, zl, nmask); mpol_cond_put(pol); out: - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie))) goto retry_cpuset; return page; } @@ -2047,13 +2047,13 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) { struct mempolicy *pol = &default_policy; struct page *page; - unsigned int cpuset_mems_cookie; + struct cpuset_mems_cookie cpuset_mems_cookie; if (!in_interrupt() && !(gfp & __GFP_THISNODE)) pol = get_task_policy(current); retry_cpuset: - cpuset_mems_cookie = read_mems_allowed_begin(); + read_mems_allowed_begin(&cpuset_mems_cookie); /* * No reference counting needed for current->mempolicy @@ -2066,7 +2066,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) policy_zonelist(gfp, pol, numa_node_id()), policy_nodemask(gfp, pol)); - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie))) goto retry_cpuset; return page; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2302f250d6b1..36cd4e95fb38 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3688,7 +3688,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int no_progress_loops; unsigned long alloc_start = jiffies; unsigned int stall_timeout = 10 * HZ; - unsigned int cpuset_mems_cookie; + struct cpuset_mems_cookie cpuset_mems_cookie; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3713,7 +3713,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, compaction_retries = 0; no_progress_loops = 0; compact_priority = DEF_COMPACT_PRIORITY; - cpuset_mems_cookie = read_mems_allowed_begin(); + read_mems_allowed_begin(&cpuset_mems_cookie); /* * The fast path uses conservative alloc_flags to succeed only until @@ -3872,7 +3872,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * It's possible we raced with cpuset update so the OOM would be * premature (see below the nopage: label for full explanation). */ - if (read_mems_allowed_retry(cpuset_mems_cookie)) + if (read_mems_allowed_retry(&cpuset_mems_cookie)) goto retry_cpuset; /* Reclaim has failed us, start killing things */ @@ -3900,7 +3900,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * to fail, check if the cpuset changed during allocation and if so, * retry. */ - if (read_mems_allowed_retry(cpuset_mems_cookie)) + if (read_mems_allowed_retry(&cpuset_mems_cookie)) goto retry_cpuset; /* diff --git a/mm/slab.c b/mm/slab.c index 2a31ee3c5814..391fe9d9d24e 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3195,13 +3195,13 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) void *obj = NULL; struct page *page; int nid; - unsigned int cpuset_mems_cookie; + struct cpuset_mems_cookie cpuset_mems_cookie; if (flags & __GFP_THISNODE) return NULL; retry_cpuset: - cpuset_mems_cookie = read_mems_allowed_begin(); + read_mems_allowed_begin(&cpuset_mems_cookie); zonelist = node_zonelist(mempolicy_slab_node(), flags); retry: @@ -3245,7 +3245,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) } } - if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie))) + if (unlikely(!obj && read_mems_allowed_retry(&cpuset_mems_cookie))) goto retry_cpuset; return obj; } diff --git a/mm/slub.c b/mm/slub.c index 8addc535bcdc..55c4862852ec 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1849,7 +1849,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); void *object; - unsigned int cpuset_mems_cookie; + struct cpuset_mems_cookie cpuset_mems_cookie; /* * The defrag ratio allows a configuration of the tradeoffs between @@ -1874,7 +1874,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, return NULL; do { - cpuset_mems_cookie = read_mems_allowed_begin(); + read_mems_allowed_begin(&cpuset_mems_cookie); zonelist = node_zonelist(mempolicy_slab_node(), flags); for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { struct kmem_cache_node *n; @@ -1896,7 +1896,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, } } } - } while (read_mems_allowed_retry(cpuset_mems_cookie)); + } while (read_mems_allowed_retry(&cpuset_mems_cookie)); #endif return NULL; } -- 2.14.0.rc0.400.g1c36432dff-goog ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-27 16:46 ` [PATCH v2] cpuset: " Dima Zavin @ 2017-07-27 19:48 ` Andrew Morton 2017-07-27 21:41 ` Dima Zavin 2017-07-27 19:51 ` Andrew Morton ` (2 subsequent siblings) 3 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2017-07-27 19:48 UTC (permalink / raw) To: Dima Zavin Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, cgroups, linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <dmitriyz@waymo.com> wrote: > In codepaths that use the begin/retry interface for reading > mems_allowed_seq with irqs disabled, there exists a race condition that > stalls the patch process after only modifying a subset of the > static_branch call sites. > > This problem manifested itself as a dead lock in the slub > allocator, inside get_any_partial. The loop reads > mems_allowed_seq value (via read_mems_allowed_begin), > performs the defrag operation, and then verifies the consistency > of mem_allowed via the read_mems_allowed_retry and the cookie > returned by xxx_begin. The issue here is that both begin and retry > first check if cpusets are enabled via cpusets_enabled() static branch. > This branch can be rewritted dynamically (via cpuset_inc) if a new > cpuset is created. The x86 jump label code fully synchronizes across > all CPUs for every entry it rewrites. If it rewrites only one of the > callsites (specifically the one in read_mems_allowed_retry) and then > waits for the smp_call_function(do_sync_core) to complete while a CPU is > inside the begin/retry section with IRQs off and the mems_allowed value > is changed, we can hang. This is because begin() will always return 0 > (since it wasn't patched yet) while retry() will test the 0 against > the actual value of the seq counter. > > The fix is to cache the value that's returned by cpusets_enabled() at the > top of the loop, and only operate on the seqcount (both begin and retry) if > it was true. Tricky. Hence we should have a nice code comment somewhere describing all of this. > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -16,6 +16,11 @@ > #include <linux/mm.h> > #include <linux/jump_label.h> > > +struct cpuset_mems_cookie { > + unsigned int seq; > + bool was_enabled; > +}; At cpuset_mems_cookie would be a good site - why it exists, what it does, when it is used and how. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-27 19:48 ` Andrew Morton @ 2017-07-27 21:41 ` Dima Zavin 0 siblings, 0 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-27 21:41 UTC (permalink / raw) To: Andrew Morton Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, cgroups, LKML, linux-mm, Cliff Spradlin, Mel Gorman On Thu, Jul 27, 2017 at 12:48 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <dmitriyz@waymo.com> wrote: > >> In codepaths that use the begin/retry interface for reading >> mems_allowed_seq with irqs disabled, there exists a race condition that >> stalls the patch process after only modifying a subset of the >> static_branch call sites. >> >> This problem manifested itself as a dead lock in the slub >> allocator, inside get_any_partial. The loop reads >> mems_allowed_seq value (via read_mems_allowed_begin), >> performs the defrag operation, and then verifies the consistency >> of mem_allowed via the read_mems_allowed_retry and the cookie >> returned by xxx_begin. The issue here is that both begin and retry >> first check if cpusets are enabled via cpusets_enabled() static branch. >> This branch can be rewritted dynamically (via cpuset_inc) if a new >> cpuset is created. The x86 jump label code fully synchronizes across >> all CPUs for every entry it rewrites. If it rewrites only one of the >> callsites (specifically the one in read_mems_allowed_retry) and then >> waits for the smp_call_function(do_sync_core) to complete while a CPU is >> inside the begin/retry section with IRQs off and the mems_allowed value >> is changed, we can hang. This is because begin() will always return 0 >> (since it wasn't patched yet) while retry() will test the 0 against >> the actual value of the seq counter. >> >> The fix is to cache the value that's returned by cpusets_enabled() at the >> top of the loop, and only operate on the seqcount (both begin and retry) if >> it was true. > > Tricky. Hence we should have a nice code comment somewhere describing > all of this. > >> --- a/include/linux/cpuset.h >> +++ b/include/linux/cpuset.h >> @@ -16,6 +16,11 @@ >> #include <linux/mm.h> >> #include <linux/jump_label.h> >> >> +struct cpuset_mems_cookie { >> + unsigned int seq; >> + bool was_enabled; >> +}; > > At cpuset_mems_cookie would be a good site - why it exists, what it > does, when it is used and how. Will do. I actually had a comment here but removed it in lieu of commit message :) Will put it back. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-27 16:46 ` [PATCH v2] cpuset: " Dima Zavin 2017-07-27 19:48 ` Andrew Morton @ 2017-07-27 19:51 ` Andrew Morton 2017-07-27 21:41 ` Dima Zavin 2017-07-28 7:45 ` Vlastimil Babka 2017-07-29 4:56 ` [PATCH v2] " kbuild test robot 3 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2017-07-27 19:51 UTC (permalink / raw) To: Dima Zavin Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, cgroups, linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <dmitriyz@waymo.com> wrote: > - Applied on top of v4.12 since one of the callers in page_alloc.c changed. > Still only tested on v4.9.36 and compile tested against v4.12. That's a problem - this doesn't come close to applying on current mainline. I can fix that I guess, but the result should be tested well. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-27 19:51 ` Andrew Morton @ 2017-07-27 21:41 ` Dima Zavin 0 siblings, 0 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-27 21:41 UTC (permalink / raw) To: Andrew Morton Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, cgroups, LKML, linux-mm, Cliff Spradlin, Mel Gorman On Thu, Jul 27, 2017 at 12:51 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <dmitriyz@waymo.com> wrote: > >> - Applied on top of v4.12 since one of the callers in page_alloc.c changed. >> Still only tested on v4.9.36 and compile tested against v4.12. > > That's a problem - this doesn't come close to applying on current > mainline. I can fix that I guess, but the result should be tested > well. > I'll fix up for latest, and see if I can test it. I should be able to boot vanilla with not too much trouble. May take a few days. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-27 16:46 ` [PATCH v2] cpuset: " Dima Zavin 2017-07-27 19:48 ` Andrew Morton 2017-07-27 19:51 ` Andrew Morton @ 2017-07-28 7:45 ` Vlastimil Babka 2017-07-28 8:48 ` Dima Zavin 2017-07-28 9:30 ` Peter Zijlstra 2017-07-29 4:56 ` [PATCH v2] " kbuild test robot 3 siblings, 2 replies; 18+ messages in thread From: Vlastimil Babka @ 2017-07-28 7:45 UTC (permalink / raw) To: Dima Zavin, Christopher Lameter Cc: Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman, Peter Zijlstra [+CC PeterZ] On 07/27/2017 06:46 PM, Dima Zavin wrote: > In codepaths that use the begin/retry interface for reading > mems_allowed_seq with irqs disabled, there exists a race condition that > stalls the patch process after only modifying a subset of the > static_branch call sites. > > This problem manifested itself as a dead lock in the slub > allocator, inside get_any_partial. The loop reads > mems_allowed_seq value (via read_mems_allowed_begin), > performs the defrag operation, and then verifies the consistency > of mem_allowed via the read_mems_allowed_retry and the cookie > returned by xxx_begin. The issue here is that both begin and retry > first check if cpusets are enabled via cpusets_enabled() static branch. > This branch can be rewritted dynamically (via cpuset_inc) if a new > cpuset is created. The x86 jump label code fully synchronizes across > all CPUs for every entry it rewrites. If it rewrites only one of the > callsites (specifically the one in read_mems_allowed_retry) and then > waits for the smp_call_function(do_sync_core) to complete while a CPU is > inside the begin/retry section with IRQs off and the mems_allowed value > is changed, we can hang. This is because begin() will always return 0 > (since it wasn't patched yet) while retry() will test the 0 against > the actual value of the seq counter. Hm I wonder if there are other static branch users potentially having similar problem. Then it would be best to fix this at static branch level. Any idea, Peter? An inelegant solution would be to have indicate static_branch_(un)likely() callsites ordering for the patching. I.e. here we would make sure that read_mems_allowed_begin() callsites are patched before read_mems_allowed_retry() when enabling the static key, and the opposite order when disabling the static key. > The fix is to cache the value that's returned by cpusets_enabled() at the > top of the loop, and only operate on the seqcount (both begin and retry) if > it was true. Maybe we could just return e.g. -1 in read_mems_allowed_begin() when cpusets are disabled, and test it in read_mems_allowed_retry() before doing a proper seqcount retry check? Also I think you can still do the cpusets_enabled() check in read_mems_allowed_retry() before the was_enabled (or cookie == -1) test? > The relevant stack traces of the two stuck threads: > > CPU: 107 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 > Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 > task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 > RIP: smp_call_function_many+0x1f9/0x260 > Call Trace: > ? setup_data_read+0xa0/0xa0 > ? ___slab_alloc+0x28b/0x5a0 > smp_call_function+0x3b/0x70 > ? setup_data_read+0xa0/0xa0 > on_each_cpu+0x2f/0x90 > ? ___slab_alloc+0x28a/0x5a0 > ? ___slab_alloc+0x28b/0x5a0 > text_poke_bp+0x87/0xd0 > ? ___slab_alloc+0x28a/0x5a0 > arch_jump_label_transform+0x93/0x100 > __jump_label_update+0x77/0x90 > jump_label_update+0xaa/0xc0 > static_key_slow_inc+0x9e/0xb0 > cpuset_css_online+0x70/0x2e0 > online_css+0x2c/0xa0 > cgroup_apply_control_enable+0x27f/0x3d0 > cgroup_mkdir+0x2b7/0x420 > kernfs_iop_mkdir+0x5a/0x80 > vfs_mkdir+0xf6/0x1a0 > SyS_mkdir+0xb7/0xe0 > entry_SYSCALL_64_fastpath+0x18/0xad > > ... > > CPU: 22 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 > Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 > task: ffff8818087c0000 task.stack: ffffc90000030000 > RIP: int3+0x39/0x70 > Call Trace: > <#DB> ? ___slab_alloc+0x28b/0x5a0 > <EOE> ? copy_process.part.40+0xf7/0x1de0 > ? __slab_alloc.isra.80+0x54/0x90 > ? copy_process.part.40+0xf7/0x1de0 > ? copy_process.part.40+0xf7/0x1de0 > ? kmem_cache_alloc_node+0x8a/0x280 > ? copy_process.part.40+0xf7/0x1de0 > ? _do_fork+0xe7/0x6c0 > ? _raw_spin_unlock_irq+0x2d/0x60 > ? trace_hardirqs_on_caller+0x136/0x1d0 > ? entry_SYSCALL_64_fastpath+0x5/0xad > ? do_syscall_64+0x27/0x350 > ? SyS_clone+0x19/0x20 > ? do_syscall_64+0x60/0x350 > ? entry_SYSCALL64_slow_path+0x25/0x25 > > Reported-by: Cliff Spradlin <cspradlin@waymo.com> > Signed-off-by: Dima Zavin <dmitriyz@waymo.com> > --- > > v2: > - Moved the cached cpusets_enabled() state into the cookie, turned > the cookie into a struct and updated all the other call sites. > - Applied on top of v4.12 since one of the callers in page_alloc.c changed. > Still only tested on v4.9.36 and compile tested against v4.12. > > include/linux/cpuset.h | 27 +++++++++++++++++---------- > mm/filemap.c | 6 +++--- > mm/hugetlb.c | 12 ++++++------ > mm/mempolicy.c | 12 ++++++------ > mm/page_alloc.c | 8 ++++---- > mm/slab.c | 6 +++--- > mm/slub.c | 6 +++--- > 7 files changed, 42 insertions(+), 35 deletions(-) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index 119a3f9604b0..f64f6d3b1dce 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -16,6 +16,11 @@ > #include <linux/mm.h> > #include <linux/jump_label.h> > > +struct cpuset_mems_cookie { > + unsigned int seq; > + bool was_enabled; > +}; > + > #ifdef CONFIG_CPUSETS > > extern struct static_key_false cpusets_enabled_key; > @@ -113,12 +118,15 @@ extern void cpuset_print_current_mems_allowed(void); > * causing process failure. A retry loop with read_mems_allowed_begin and > * read_mems_allowed_retry prevents these artificial failures. > */ > -static inline unsigned int read_mems_allowed_begin(void) > +static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie) > { > - if (!cpusets_enabled()) > - return 0; > + if (!cpusets_enabled()) { > + cookie->was_enabled = false; > + return; > + } > > - return read_seqcount_begin(¤t->mems_allowed_seq); > + cookie->was_enabled = true; > + cookie->seq = read_seqcount_begin(¤t->mems_allowed_seq); > } > > /* > @@ -127,12 +135,11 @@ static inline unsigned int read_mems_allowed_begin(void) > * update of mems_allowed. It is up to the caller to retry the operation if > * appropriate. > */ > -static inline bool read_mems_allowed_retry(unsigned int seq) > +static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie) > { > - if (!cpusets_enabled()) > + if (!cookie->was_enabled) > return false; > - > - return read_seqcount_retry(¤t->mems_allowed_seq, seq); > + return read_seqcount_retry(¤t->mems_allowed_seq, cookie->seq); > } > > static inline void set_mems_allowed(nodemask_t nodemask) > @@ -249,12 +256,12 @@ static inline void set_mems_allowed(nodemask_t nodemask) > { > } > > -static inline unsigned int read_mems_allowed_begin(void) > +static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie) > { > return 0; > } > > -static inline bool read_mems_allowed_retry(unsigned int seq) > +static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie) > { > return false; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 6f1be573a5e6..c0730b377519 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -716,12 +716,12 @@ struct page *__page_cache_alloc(gfp_t gfp) > struct page *page; > > if (cpuset_do_page_mem_spread()) { > - unsigned int cpuset_mems_cookie; > + struct cpuset_mems_cookie cpuset_mems_cookie; > do { > - cpuset_mems_cookie = read_mems_allowed_begin(); > + read_mems_allowed_begin(&cpuset_mems_cookie); > n = cpuset_mem_spread_node(); > page = __alloc_pages_node(n, gfp, 0); > - } while (!page && read_mems_allowed_retry(cpuset_mems_cookie)); > + } while (!page && read_mems_allowed_retry(&cpuset_mems_cookie)); > > return page; > } > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 3eedb187e549..1defa44f4fe6 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -907,7 +907,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, > struct zonelist *zonelist; > struct zone *zone; > struct zoneref *z; > - unsigned int cpuset_mems_cookie; > + struct cpuset_mems_cookie cpuset_mems_cookie; > > /* > * A child process with MAP_PRIVATE mappings created by their parent > @@ -923,7 +923,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, > goto err; > > retry_cpuset: > - cpuset_mems_cookie = read_mems_allowed_begin(); > + read_mems_allowed_begin(&cpuset_mems_cookie); > zonelist = huge_zonelist(vma, address, > htlb_alloc_mask(h), &mpol, &nodemask); > > @@ -945,7 +945,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, > } > > mpol_cond_put(mpol); > - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) > + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie))) > goto retry_cpuset; > return page; > > @@ -1511,7 +1511,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, > { > int order = huge_page_order(h); > gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN; > - unsigned int cpuset_mems_cookie; > + struct cpuset_mems_cookie cpuset_mems_cookie; > > /* > * We need a VMA to get a memory policy. If we do not > @@ -1548,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, > struct zonelist *zl; > nodemask_t *nodemask; > > - cpuset_mems_cookie = read_mems_allowed_begin(); > + read_mems_allowed_begin(&cpuset_mems_cookie); > zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask); > mpol_cond_put(mpol); > page = __alloc_pages_nodemask(gfp, order, zl, nodemask); > if (page) > return page; > - } while (read_mems_allowed_retry(cpuset_mems_cookie)); > + } while (read_mems_allowed_retry(&cpuset_mems_cookie)); > > return NULL; > } > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 37d0b334bfe9..b4f2513a2296 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1971,13 +1971,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > { > struct mempolicy *pol; > struct page *page; > - unsigned int cpuset_mems_cookie; > + struct cpuset_mems_cookie cpuset_mems_cookie; > struct zonelist *zl; > nodemask_t *nmask; > > retry_cpuset: > pol = get_vma_policy(vma, addr); > - cpuset_mems_cookie = read_mems_allowed_begin(); > + read_mems_allowed_begin(&cpuset_mems_cookie); > > if (pol->mode == MPOL_INTERLEAVE) { > unsigned nid; > @@ -2019,7 +2019,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > page = __alloc_pages_nodemask(gfp, order, zl, nmask); > mpol_cond_put(pol); > out: > - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) > + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie))) > goto retry_cpuset; > return page; > } > @@ -2047,13 +2047,13 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) > { > struct mempolicy *pol = &default_policy; > struct page *page; > - unsigned int cpuset_mems_cookie; > + struct cpuset_mems_cookie cpuset_mems_cookie; > > if (!in_interrupt() && !(gfp & __GFP_THISNODE)) > pol = get_task_policy(current); > > retry_cpuset: > - cpuset_mems_cookie = read_mems_allowed_begin(); > + read_mems_allowed_begin(&cpuset_mems_cookie); > > /* > * No reference counting needed for current->mempolicy > @@ -2066,7 +2066,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) > policy_zonelist(gfp, pol, numa_node_id()), > policy_nodemask(gfp, pol)); > > - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) > + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie))) > goto retry_cpuset; > > return page; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2302f250d6b1..36cd4e95fb38 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3688,7 +3688,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > int no_progress_loops; > unsigned long alloc_start = jiffies; > unsigned int stall_timeout = 10 * HZ; > - unsigned int cpuset_mems_cookie; > + struct cpuset_mems_cookie cpuset_mems_cookie; > > /* > * In the slowpath, we sanity check order to avoid ever trying to > @@ -3713,7 +3713,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > compaction_retries = 0; > no_progress_loops = 0; > compact_priority = DEF_COMPACT_PRIORITY; > - cpuset_mems_cookie = read_mems_allowed_begin(); > + read_mems_allowed_begin(&cpuset_mems_cookie); > > /* > * The fast path uses conservative alloc_flags to succeed only until > @@ -3872,7 +3872,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > * It's possible we raced with cpuset update so the OOM would be > * premature (see below the nopage: label for full explanation). > */ > - if (read_mems_allowed_retry(cpuset_mems_cookie)) > + if (read_mems_allowed_retry(&cpuset_mems_cookie)) > goto retry_cpuset; > > /* Reclaim has failed us, start killing things */ > @@ -3900,7 +3900,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > * to fail, check if the cpuset changed during allocation and if so, > * retry. > */ > - if (read_mems_allowed_retry(cpuset_mems_cookie)) > + if (read_mems_allowed_retry(&cpuset_mems_cookie)) > goto retry_cpuset; > > /* > diff --git a/mm/slab.c b/mm/slab.c > index 2a31ee3c5814..391fe9d9d24e 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -3195,13 +3195,13 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) > void *obj = NULL; > struct page *page; > int nid; > - unsigned int cpuset_mems_cookie; > + struct cpuset_mems_cookie cpuset_mems_cookie; > > if (flags & __GFP_THISNODE) > return NULL; > > retry_cpuset: > - cpuset_mems_cookie = read_mems_allowed_begin(); > + read_mems_allowed_begin(&cpuset_mems_cookie); > zonelist = node_zonelist(mempolicy_slab_node(), flags); > > retry: > @@ -3245,7 +3245,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) > } > } > > - if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie))) > + if (unlikely(!obj && read_mems_allowed_retry(&cpuset_mems_cookie))) > goto retry_cpuset; > return obj; > } > diff --git a/mm/slub.c b/mm/slub.c > index 8addc535bcdc..55c4862852ec 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1849,7 +1849,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, > struct zone *zone; > enum zone_type high_zoneidx = gfp_zone(flags); > void *object; > - unsigned int cpuset_mems_cookie; > + struct cpuset_mems_cookie cpuset_mems_cookie; > > /* > * The defrag ratio allows a configuration of the tradeoffs between > @@ -1874,7 +1874,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, > return NULL; > > do { > - cpuset_mems_cookie = read_mems_allowed_begin(); > + read_mems_allowed_begin(&cpuset_mems_cookie); > zonelist = node_zonelist(mempolicy_slab_node(), flags); > for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { > struct kmem_cache_node *n; > @@ -1896,7 +1896,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, > } > } > } > - } while (read_mems_allowed_retry(cpuset_mems_cookie)); > + } while (read_mems_allowed_retry(&cpuset_mems_cookie)); > #endif > return NULL; > } > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-28 7:45 ` Vlastimil Babka @ 2017-07-28 8:48 ` Dima Zavin 2017-07-28 9:30 ` Peter Zijlstra 1 sibling, 0 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-28 8:48 UTC (permalink / raw) To: Vlastimil Babka Cc: Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, LKML, linux-mm, Cliff Spradlin, Mel Gorman, Peter Zijlstra On Fri, Jul 28, 2017 at 12:45 AM, Vlastimil Babka <vbabka@suse.cz> wrote: > [+CC PeterZ] > > On 07/27/2017 06:46 PM, Dima Zavin wrote: >> In codepaths that use the begin/retry interface for reading >> mems_allowed_seq with irqs disabled, there exists a race condition that >> stalls the patch process after only modifying a subset of the >> static_branch call sites. >> >> This problem manifested itself as a dead lock in the slub >> allocator, inside get_any_partial. The loop reads >> mems_allowed_seq value (via read_mems_allowed_begin), >> performs the defrag operation, and then verifies the consistency >> of mem_allowed via the read_mems_allowed_retry and the cookie >> returned by xxx_begin. The issue here is that both begin and retry >> first check if cpusets are enabled via cpusets_enabled() static branch. >> This branch can be rewritted dynamically (via cpuset_inc) if a new >> cpuset is created. The x86 jump label code fully synchronizes across >> all CPUs for every entry it rewrites. If it rewrites only one of the >> callsites (specifically the one in read_mems_allowed_retry) and then >> waits for the smp_call_function(do_sync_core) to complete while a CPU is >> inside the begin/retry section with IRQs off and the mems_allowed value >> is changed, we can hang. This is because begin() will always return 0 >> (since it wasn't patched yet) while retry() will test the 0 against >> the actual value of the seq counter. > > Hm I wonder if there are other static branch users potentially having > similar problem. Then it would be best to fix this at static branch > level. Any idea, Peter? An inelegant solution would be to have indicate > static_branch_(un)likely() callsites ordering for the patching. I.e. > here we would make sure that read_mems_allowed_begin() callsites are > patched before read_mems_allowed_retry() when enabling the static key, > and the opposite order when disabling the static key. > This was my main worry, that I'm just patching up one incarnation of this problem and other clients will eventually trip over this. >> The fix is to cache the value that's returned by cpusets_enabled() at the >> top of the loop, and only operate on the seqcount (both begin and retry) if >> it was true. > > Maybe we could just return e.g. -1 in read_mems_allowed_begin() when > cpusets are disabled, and test it in read_mems_allowed_retry() before > doing a proper seqcount retry check? Also I think you can still do the > cpusets_enabled() check in read_mems_allowed_retry() before the > was_enabled (or cookie == -1) test? Hmm, good point! If cpusets_enabled() is true, then we can still test against was_enabled and do the right thing (adds one extra branch in that case). When it's false, we still benefit from the static_branch fanciness. Thanks! Re setting the cookie to -1, I didn't really want to overload the cookie value but rather just make the state explicit so it's easier to grawk as this is all already subtle enough. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-28 7:45 ` Vlastimil Babka 2017-07-28 8:48 ` Dima Zavin @ 2017-07-28 9:30 ` Peter Zijlstra 2017-07-28 14:05 ` Vlastimil Babka 1 sibling, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2017-07-28 9:30 UTC (permalink / raw) To: Vlastimil Babka Cc: Dima Zavin, Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote: > [+CC PeterZ] > > On 07/27/2017 06:46 PM, Dima Zavin wrote: > > In codepaths that use the begin/retry interface for reading > > mems_allowed_seq with irqs disabled, there exists a race condition that > > stalls the patch process after only modifying a subset of the > > static_branch call sites. > > > > This problem manifested itself as a dead lock in the slub > > allocator, inside get_any_partial. The loop reads > > mems_allowed_seq value (via read_mems_allowed_begin), > > performs the defrag operation, and then verifies the consistency > > of mem_allowed via the read_mems_allowed_retry and the cookie > > returned by xxx_begin. The issue here is that both begin and retry > > first check if cpusets are enabled via cpusets_enabled() static branch. > > This branch can be rewritted dynamically (via cpuset_inc) if a new > > cpuset is created. The x86 jump label code fully synchronizes across > > all CPUs for every entry it rewrites. If it rewrites only one of the > > callsites (specifically the one in read_mems_allowed_retry) and then > > waits for the smp_call_function(do_sync_core) to complete while a CPU is > > inside the begin/retry section with IRQs off and the mems_allowed value > > is changed, we can hang. This is because begin() will always return 0 > > (since it wasn't patched yet) while retry() will test the 0 against > > the actual value of the seq counter. > > Hm I wonder if there are other static branch users potentially having > similar problem. Then it would be best to fix this at static branch > level. Any idea, Peter? An inelegant solution would be to have indicate > static_branch_(un)likely() callsites ordering for the patching. I.e. > here we would make sure that read_mems_allowed_begin() callsites are > patched before read_mems_allowed_retry() when enabling the static key, > and the opposite order when disabling the static key. I'm not aware of any other sure ordering requirements. But you can manually create this order by using 2 static keys. Then flip them in the desired order. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-28 9:30 ` Peter Zijlstra @ 2017-07-28 14:05 ` Vlastimil Babka 2017-07-28 16:52 ` Dima Zavin 2017-07-31 4:01 ` [PATCH v3] " Dima Zavin 0 siblings, 2 replies; 18+ messages in thread From: Vlastimil Babka @ 2017-07-28 14:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Dima Zavin, Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman On 07/28/2017 11:30 AM, Peter Zijlstra wrote: > On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote: >> [+CC PeterZ] >> >> On 07/27/2017 06:46 PM, Dima Zavin wrote: >>> In codepaths that use the begin/retry interface for reading >>> mems_allowed_seq with irqs disabled, there exists a race condition that >>> stalls the patch process after only modifying a subset of the >>> static_branch call sites. >>> >>> This problem manifested itself as a dead lock in the slub >>> allocator, inside get_any_partial. The loop reads >>> mems_allowed_seq value (via read_mems_allowed_begin), >>> performs the defrag operation, and then verifies the consistency >>> of mem_allowed via the read_mems_allowed_retry and the cookie >>> returned by xxx_begin. The issue here is that both begin and retry >>> first check if cpusets are enabled via cpusets_enabled() static branch. >>> This branch can be rewritted dynamically (via cpuset_inc) if a new >>> cpuset is created. The x86 jump label code fully synchronizes across >>> all CPUs for every entry it rewrites. If it rewrites only one of the >>> callsites (specifically the one in read_mems_allowed_retry) and then >>> waits for the smp_call_function(do_sync_core) to complete while a CPU is >>> inside the begin/retry section with IRQs off and the mems_allowed value >>> is changed, we can hang. This is because begin() will always return 0 >>> (since it wasn't patched yet) while retry() will test the 0 against >>> the actual value of the seq counter. >> >> Hm I wonder if there are other static branch users potentially having >> similar problem. Then it would be best to fix this at static branch >> level. Any idea, Peter? An inelegant solution would be to have indicate >> static_branch_(un)likely() callsites ordering for the patching. I.e. >> here we would make sure that read_mems_allowed_begin() callsites are >> patched before read_mems_allowed_retry() when enabling the static key, >> and the opposite order when disabling the static key. > > I'm not aware of any other sure ordering requirements. But you can > manually create this order by using 2 static keys. Then flip them in the > desired order. Right, thanks for the suggestion. I think that would be preferable to complicating the cookie handling. Add a new key next to cpusets_enabled_key, let's say "cpusets_enabled_pre_key". Make read_mems_allowed_begin() check this key instead of cpusets_enabled(). Change cpuset_inc/dec to inc/dec also this new key in the right order and that should be it. Dima, can you try that or should I? Thanks, Vlastimil > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-28 14:05 ` Vlastimil Babka @ 2017-07-28 16:52 ` Dima Zavin 2017-07-31 4:01 ` [PATCH v3] " Dima Zavin 1 sibling, 0 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-28 16:52 UTC (permalink / raw) To: Vlastimil Babka Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, LKML, linux-mm, Cliff Spradlin, Mel Gorman On Fri, Jul 28, 2017 at 7:05 AM, Vlastimil Babka <vbabka@suse.cz> wrote: > On 07/28/2017 11:30 AM, Peter Zijlstra wrote: >> On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote: >>> [+CC PeterZ] >>> >>> On 07/27/2017 06:46 PM, Dima Zavin wrote: >>>> In codepaths that use the begin/retry interface for reading >>>> mems_allowed_seq with irqs disabled, there exists a race condition that >>>> stalls the patch process after only modifying a subset of the >>>> static_branch call sites. >>>> >>>> This problem manifested itself as a dead lock in the slub >>>> allocator, inside get_any_partial. The loop reads >>>> mems_allowed_seq value (via read_mems_allowed_begin), >>>> performs the defrag operation, and then verifies the consistency >>>> of mem_allowed via the read_mems_allowed_retry and the cookie >>>> returned by xxx_begin. The issue here is that both begin and retry >>>> first check if cpusets are enabled via cpusets_enabled() static branch. >>>> This branch can be rewritted dynamically (via cpuset_inc) if a new >>>> cpuset is created. The x86 jump label code fully synchronizes across >>>> all CPUs for every entry it rewrites. If it rewrites only one of the >>>> callsites (specifically the one in read_mems_allowed_retry) and then >>>> waits for the smp_call_function(do_sync_core) to complete while a CPU is >>>> inside the begin/retry section with IRQs off and the mems_allowed value >>>> is changed, we can hang. This is because begin() will always return 0 >>>> (since it wasn't patched yet) while retry() will test the 0 against >>>> the actual value of the seq counter. >>> >>> Hm I wonder if there are other static branch users potentially having >>> similar problem. Then it would be best to fix this at static branch >>> level. Any idea, Peter? An inelegant solution would be to have indicate >>> static_branch_(un)likely() callsites ordering for the patching. I.e. >>> here we would make sure that read_mems_allowed_begin() callsites are >>> patched before read_mems_allowed_retry() when enabling the static key, >>> and the opposite order when disabling the static key. >> >> I'm not aware of any other sure ordering requirements. But you can >> manually create this order by using 2 static keys. Then flip them in the >> desired order. > > Right, thanks for the suggestion. I think that would be preferable to > complicating the cookie handling. Add a new key next to > cpusets_enabled_key, let's say "cpusets_enabled_pre_key". Make > read_mems_allowed_begin() check this key instead of cpusets_enabled(). > Change cpuset_inc/dec to inc/dec also this new key in the right order > and that should be it. Dima, can you try that or should I? Yeah, I like that approach much better. I'll re-spin a new version in a bit. --Dima > > Thanks, > Vlastimil > >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >> > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-28 14:05 ` Vlastimil Babka 2017-07-28 16:52 ` Dima Zavin @ 2017-07-31 4:01 ` Dima Zavin 2017-07-31 4:04 ` Dima Zavin 2017-07-31 8:02 ` Vlastimil Babka 1 sibling, 2 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-31 4:01 UTC (permalink / raw) To: Vlastimil Babka Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, linux-kernel, linux-mm, Mel Gorman, Cliff Spradlin In codepaths that use the begin/retry interface for reading mems_allowed_seq with irqs disabled, there exists a race condition that stalls the patch process after only modifying a subset of the static_branch call sites. This problem manifested itself as a dead lock in the slub allocator, inside get_any_partial. The loop reads mems_allowed_seq value (via read_mems_allowed_begin), performs the defrag operation, and then verifies the consistency of mem_allowed via the read_mems_allowed_retry and the cookie returned by xxx_begin. The issue here is that both begin and retry first check if cpusets are enabled via cpusets_enabled() static branch. This branch can be rewritted dynamically (via cpuset_inc) if a new cpuset is created. The x86 jump label code fully synchronizes across all CPUs for every entry it rewrites. If it rewrites only one of the callsites (specifically the one in read_mems_allowed_retry) and then waits for the smp_call_function(do_sync_core) to complete while a CPU is inside the begin/retry section with IRQs off and the mems_allowed value is changed, we can hang. This is because begin() will always return 0 (since it wasn't patched yet) while retry() will test the 0 against the actual value of the seq counter. The fix is to use two different static keys: one for begin (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() always return a valid seqcount if are enabling cpusets. Similarly, when disabling cpusets via cpuset_dec(), we first ensure that callers of cpuset_mems_allowed_retry() will start ignoring the seqcount value before we let cpuset_mems_allowed_begin() return 0. The relevant stack traces of the two stuck threads: CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 RIP: smp_call_function_many+0x1f9/0x260 Call Trace: ? setup_data_read+0xa0/0xa0 ? ___slab_alloc+0x28b/0x5a0 smp_call_function+0x3b/0x70 ? setup_data_read+0xa0/0xa0 on_each_cpu+0x2f/0x90 ? ___slab_alloc+0x28a/0x5a0 ? ___slab_alloc+0x28b/0x5a0 text_poke_bp+0x87/0xd0 ? ___slab_alloc+0x28a/0x5a0 arch_jump_label_transform+0x93/0x100 __jump_label_update+0x77/0x90 jump_label_update+0xaa/0xc0 static_key_slow_inc+0x9e/0xb0 cpuset_css_online+0x70/0x2e0 online_css+0x2c/0xa0 cgroup_apply_control_enable+0x27f/0x3d0 cgroup_mkdir+0x2b7/0x420 kernfs_iop_mkdir+0x5a/0x80 vfs_mkdir+0xf6/0x1a0 SyS_mkdir+0xb7/0xe0 entry_SYSCALL_64_fastpath+0x18/0xad ... CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8818087c0000 task.stack: ffffc90000030000 RIP: int3+0x39/0x70 Call Trace: <#DB> ? ___slab_alloc+0x28b/0x5a0 <EOE> ? copy_process.part.40+0xf7/0x1de0 ? __slab_alloc.isra.80+0x54/0x90 ? copy_process.part.40+0xf7/0x1de0 ? copy_process.part.40+0xf7/0x1de0 ? kmem_cache_alloc_node+0x8a/0x280 ? copy_process.part.40+0xf7/0x1de0 ? _do_fork+0xe7/0x6c0 ? _raw_spin_unlock_irq+0x2d/0x60 ? trace_hardirqs_on_caller+0x136/0x1d0 ? entry_SYSCALL_64_fastpath+0x5/0xad ? do_syscall_64+0x27/0x350 ? SyS_clone+0x19/0x20 ? do_syscall_64+0x60/0x350 ? entry_SYSCALL64_slow_path+0x25/0x25 Reported-by: Cliff Spradlin <cspradlin@waymo.com> Signed-off-by: Dima Zavin <dmitriyz@waymo.com> --- v3: - Changed the implementation based on Peter Zijlstra's suggestion. Now using two keys for begin/retry instead of hacking the state into the cookie. - Rebased and tested on top of v4.13-rc3. v4: - Moved the cached cpusets_enabled() state into the cookie, turned the cookie into a struct and updated all the other call sites. - Applied on top of v4.12 since one of the callers in page_alloc.c changed. Still only tested on v4.9.36 and compile tested against v4.12. include/linux/cpuset.h | 19 +++++++++++++++++-- kernel/cgroup/cpuset.c | 1 + 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 119a3f9604b0..e5a684c04c70 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -18,6 +18,19 @@ #ifdef CONFIG_CPUSETS +/* + * Static branch rewrites can happen in an arbitrary order for a given + * key. In code paths where we need to loop with read_mems_allowed_begin() and + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need + * to ensure that begin() always gets rewritten before retry() in the + * disabled -> enabled transition. If not, then if local irqs are disabled + * around the loop, we can deadlock since retry() would always be + * comparing the latest value of the mems_allowed seqcount against 0 as + * begin() still would see cpusets_enabled() as false. The enabled -> disabled + * transition should happen in reverse order for the same reasons (want to stop + * looking at real value of mems_allowed.sequence in retry() first). + */ +extern struct static_key_false cpusets_pre_enable_key; extern struct static_key_false cpusets_enabled_key; static inline bool cpusets_enabled(void) { @@ -32,12 +45,14 @@ static inline int nr_cpusets(void) static inline void cpuset_inc(void) { + static_branch_inc(&cpusets_pre_enable_key); static_branch_inc(&cpusets_enabled_key); } static inline void cpuset_dec(void) { static_branch_dec(&cpusets_enabled_key); + static_branch_dec(&cpusets_pre_enable_key); } extern int cpuset_init(void); @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void); */ static inline unsigned int read_mems_allowed_begin(void) { - if (!cpusets_enabled()) + if (!static_branch_unlikely(&cpusets_pre_enable_key)) return 0; return read_seqcount_begin(¤t->mems_allowed_seq); @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void) */ static inline bool read_mems_allowed_retry(unsigned int seq) { - if (!cpusets_enabled()) + if (!static_branch_unlikely(&cpusets_enabled_key)) return false; return read_seqcount_retry(¤t->mems_allowed_seq, seq); diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index ca8376e5008c..8d5151688504 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -63,6 +63,7 @@ #include <linux/cgroup.h> #include <linux/wait.h> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); /* See "Frequency meter" comments, below. */ -- 2.14.0.rc0.400.g1c36432dff-goog ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-31 4:01 ` [PATCH v3] " Dima Zavin @ 2017-07-31 4:04 ` Dima Zavin 2017-07-31 8:02 ` Vlastimil Babka 1 sibling, 0 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-31 4:04 UTC (permalink / raw) To: Vlastimil Babka Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, LKML, linux-mm, Mel Gorman, Cliff Spradlin On Sun, Jul 30, 2017 at 9:01 PM, Dima Zavin <dmitriyz@waymo.com> wrote: > In codepaths that use the begin/retry interface for reading > mems_allowed_seq with irqs disabled, there exists a race condition that > stalls the patch process after only modifying a subset of the > static_branch call sites. > > This problem manifested itself as a dead lock in the slub > allocator, inside get_any_partial. The loop reads > mems_allowed_seq value (via read_mems_allowed_begin), > performs the defrag operation, and then verifies the consistency > of mem_allowed via the read_mems_allowed_retry and the cookie > returned by xxx_begin. The issue here is that both begin and retry > first check if cpusets are enabled via cpusets_enabled() static branch. > This branch can be rewritted dynamically (via cpuset_inc) if a new > cpuset is created. The x86 jump label code fully synchronizes across > all CPUs for every entry it rewrites. If it rewrites only one of the > callsites (specifically the one in read_mems_allowed_retry) and then > waits for the smp_call_function(do_sync_core) to complete while a CPU is > inside the begin/retry section with IRQs off and the mems_allowed value > is changed, we can hang. This is because begin() will always return 0 > (since it wasn't patched yet) while retry() will test the 0 against > the actual value of the seq counter. > > The fix is to use two different static keys: one for begin > (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we > first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() > always return a valid seqcount if are enabling cpusets. Similarly, > when disabling cpusets via cpuset_dec(), we first ensure that callers > of cpuset_mems_allowed_retry() will start ignoring the seqcount > value before we let cpuset_mems_allowed_begin() return 0. > > The relevant stack traces of the two stuck threads: > > CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 > Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 > task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 > RIP: smp_call_function_many+0x1f9/0x260 > Call Trace: > ? setup_data_read+0xa0/0xa0 > ? ___slab_alloc+0x28b/0x5a0 > smp_call_function+0x3b/0x70 > ? setup_data_read+0xa0/0xa0 > on_each_cpu+0x2f/0x90 > ? ___slab_alloc+0x28a/0x5a0 > ? ___slab_alloc+0x28b/0x5a0 > text_poke_bp+0x87/0xd0 > ? ___slab_alloc+0x28a/0x5a0 > arch_jump_label_transform+0x93/0x100 > __jump_label_update+0x77/0x90 > jump_label_update+0xaa/0xc0 > static_key_slow_inc+0x9e/0xb0 > cpuset_css_online+0x70/0x2e0 > online_css+0x2c/0xa0 > cgroup_apply_control_enable+0x27f/0x3d0 > cgroup_mkdir+0x2b7/0x420 > kernfs_iop_mkdir+0x5a/0x80 > vfs_mkdir+0xf6/0x1a0 > SyS_mkdir+0xb7/0xe0 > entry_SYSCALL_64_fastpath+0x18/0xad > > ... > > CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 > Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 > task: ffff8818087c0000 task.stack: ffffc90000030000 > RIP: int3+0x39/0x70 > Call Trace: > <#DB> ? ___slab_alloc+0x28b/0x5a0 > <EOE> ? copy_process.part.40+0xf7/0x1de0 > ? __slab_alloc.isra.80+0x54/0x90 > ? copy_process.part.40+0xf7/0x1de0 > ? copy_process.part.40+0xf7/0x1de0 > ? kmem_cache_alloc_node+0x8a/0x280 > ? copy_process.part.40+0xf7/0x1de0 > ? _do_fork+0xe7/0x6c0 > ? _raw_spin_unlock_irq+0x2d/0x60 > ? trace_hardirqs_on_caller+0x136/0x1d0 > ? entry_SYSCALL_64_fastpath+0x5/0xad > ? do_syscall_64+0x27/0x350 > ? SyS_clone+0x19/0x20 > ? do_syscall_64+0x60/0x350 > ? entry_SYSCALL64_slow_path+0x25/0x25 > > Reported-by: Cliff Spradlin <cspradlin@waymo.com> > Signed-off-by: Dima Zavin <dmitriyz@waymo.com> > --- > > v3: > - Changed the implementation based on Peter Zijlstra's suggestion. Now > using two keys for begin/retry instead of hacking the state into the > cookie. > - Rebased and tested on top of v4.13-rc3. > > v4: Doh, latest patch is v3, I obviously meant v2 here instead of v4. Sigh. Sorry. --Dima > - Moved the cached cpusets_enabled() state into the cookie, turned > the cookie into a struct and updated all the other call sites. > - Applied on top of v4.12 since one of the callers in page_alloc.c changed. > Still only tested on v4.9.36 and compile tested against v4.12. > > include/linux/cpuset.h | 19 +++++++++++++++++-- > kernel/cgroup/cpuset.c | 1 + > 2 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index 119a3f9604b0..e5a684c04c70 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -18,6 +18,19 @@ > > #ifdef CONFIG_CPUSETS > > +/* > + * Static branch rewrites can happen in an arbitrary order for a given > + * key. In code paths where we need to loop with read_mems_allowed_begin() and > + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need > + * to ensure that begin() always gets rewritten before retry() in the > + * disabled -> enabled transition. If not, then if local irqs are disabled > + * around the loop, we can deadlock since retry() would always be > + * comparing the latest value of the mems_allowed seqcount against 0 as > + * begin() still would see cpusets_enabled() as false. The enabled -> disabled > + * transition should happen in reverse order for the same reasons (want to stop > + * looking at real value of mems_allowed.sequence in retry() first). > + */ > +extern struct static_key_false cpusets_pre_enable_key; > extern struct static_key_false cpusets_enabled_key; > static inline bool cpusets_enabled(void) > { > @@ -32,12 +45,14 @@ static inline int nr_cpusets(void) > > static inline void cpuset_inc(void) > { > + static_branch_inc(&cpusets_pre_enable_key); > static_branch_inc(&cpusets_enabled_key); > } > > static inline void cpuset_dec(void) > { > static_branch_dec(&cpusets_enabled_key); > + static_branch_dec(&cpusets_pre_enable_key); > } > > extern int cpuset_init(void); > @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void); > */ > static inline unsigned int read_mems_allowed_begin(void) > { > - if (!cpusets_enabled()) > + if (!static_branch_unlikely(&cpusets_pre_enable_key)) > return 0; > > return read_seqcount_begin(¤t->mems_allowed_seq); > @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void) > */ > static inline bool read_mems_allowed_retry(unsigned int seq) > { > - if (!cpusets_enabled()) > + if (!static_branch_unlikely(&cpusets_enabled_key)) > return false; > > return read_seqcount_retry(¤t->mems_allowed_seq, seq); > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index ca8376e5008c..8d5151688504 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -63,6 +63,7 @@ > #include <linux/cgroup.h> > #include <linux/wait.h> > > +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > /* See "Frequency meter" comments, below. */ > -- > 2.14.0.rc0.400.g1c36432dff-goog > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-31 4:01 ` [PATCH v3] " Dima Zavin 2017-07-31 4:04 ` Dima Zavin @ 2017-07-31 8:02 ` Vlastimil Babka 2017-07-31 9:05 ` Dima Zavin 1 sibling, 1 reply; 18+ messages in thread From: Vlastimil Babka @ 2017-07-31 8:02 UTC (permalink / raw) To: Dima Zavin Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, linux-kernel, linux-mm, Mel Gorman, Cliff Spradlin On 07/31/2017 06:01 AM, Dima Zavin wrote: > In codepaths that use the begin/retry interface for reading > mems_allowed_seq with irqs disabled, there exists a race condition that > stalls the patch process after only modifying a subset of the > static_branch call sites. > > This problem manifested itself as a dead lock in the slub > allocator, inside get_any_partial. The loop reads > mems_allowed_seq value (via read_mems_allowed_begin), > performs the defrag operation, and then verifies the consistency > of mem_allowed via the read_mems_allowed_retry and the cookie > returned by xxx_begin. The issue here is that both begin and retry > first check if cpusets are enabled via cpusets_enabled() static branch. > This branch can be rewritted dynamically (via cpuset_inc) if a new > cpuset is created. The x86 jump label code fully synchronizes across > all CPUs for every entry it rewrites. If it rewrites only one of the > callsites (specifically the one in read_mems_allowed_retry) and then > waits for the smp_call_function(do_sync_core) to complete while a CPU is > inside the begin/retry section with IRQs off and the mems_allowed value > is changed, we can hang. This is because begin() will always return 0 > (since it wasn't patched yet) while retry() will test the 0 against > the actual value of the seq counter. > > The fix is to use two different static keys: one for begin > (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we > first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() > always return a valid seqcount if are enabling cpusets. Similarly, > when disabling cpusets via cpuset_dec(), we first ensure that callers > of cpuset_mems_allowed_retry() will start ignoring the seqcount > value before we let cpuset_mems_allowed_begin() return 0. > > The relevant stack traces of the two stuck threads: > > CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 > Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 > task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 > RIP: smp_call_function_many+0x1f9/0x260 > Call Trace: > ? setup_data_read+0xa0/0xa0 > ? ___slab_alloc+0x28b/0x5a0 > smp_call_function+0x3b/0x70 > ? setup_data_read+0xa0/0xa0 > on_each_cpu+0x2f/0x90 > ? ___slab_alloc+0x28a/0x5a0 > ? ___slab_alloc+0x28b/0x5a0 > text_poke_bp+0x87/0xd0 > ? ___slab_alloc+0x28a/0x5a0 > arch_jump_label_transform+0x93/0x100 > __jump_label_update+0x77/0x90 > jump_label_update+0xaa/0xc0 > static_key_slow_inc+0x9e/0xb0 > cpuset_css_online+0x70/0x2e0 > online_css+0x2c/0xa0 > cgroup_apply_control_enable+0x27f/0x3d0 > cgroup_mkdir+0x2b7/0x420 > kernfs_iop_mkdir+0x5a/0x80 > vfs_mkdir+0xf6/0x1a0 > SyS_mkdir+0xb7/0xe0 > entry_SYSCALL_64_fastpath+0x18/0xad > > ... > > CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 > Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 > task: ffff8818087c0000 task.stack: ffffc90000030000 > RIP: int3+0x39/0x70 > Call Trace: > <#DB> ? ___slab_alloc+0x28b/0x5a0 > <EOE> ? copy_process.part.40+0xf7/0x1de0 > ? __slab_alloc.isra.80+0x54/0x90 > ? copy_process.part.40+0xf7/0x1de0 > ? copy_process.part.40+0xf7/0x1de0 > ? kmem_cache_alloc_node+0x8a/0x280 > ? copy_process.part.40+0xf7/0x1de0 > ? _do_fork+0xe7/0x6c0 > ? _raw_spin_unlock_irq+0x2d/0x60 > ? trace_hardirqs_on_caller+0x136/0x1d0 > ? entry_SYSCALL_64_fastpath+0x5/0xad > ? do_syscall_64+0x27/0x350 > ? SyS_clone+0x19/0x20 > ? do_syscall_64+0x60/0x350 > ? entry_SYSCALL64_slow_path+0x25/0x25 > > Reported-by: Cliff Spradlin <cspradlin@waymo.com> > Signed-off-by: Dima Zavin <dmitriyz@waymo.com> Looks good. Could you verify it fixes the issue, or was it too hard to reproduce? Also is this a stable candidate patch, and can you identify an exact commit hash it fixes? Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > > v3: > - Changed the implementation based on Peter Zijlstra's suggestion. Now > using two keys for begin/retry instead of hacking the state into the > cookie. > - Rebased and tested on top of v4.13-rc3. > > v4: > - Moved the cached cpusets_enabled() state into the cookie, turned > the cookie into a struct and updated all the other call sites. > - Applied on top of v4.12 since one of the callers in page_alloc.c changed. > Still only tested on v4.9.36 and compile tested against v4.12. > > include/linux/cpuset.h | 19 +++++++++++++++++-- > kernel/cgroup/cpuset.c | 1 + > 2 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index 119a3f9604b0..e5a684c04c70 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -18,6 +18,19 @@ > > #ifdef CONFIG_CPUSETS > > +/* > + * Static branch rewrites can happen in an arbitrary order for a given > + * key. In code paths where we need to loop with read_mems_allowed_begin() and > + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need > + * to ensure that begin() always gets rewritten before retry() in the > + * disabled -> enabled transition. If not, then if local irqs are disabled > + * around the loop, we can deadlock since retry() would always be > + * comparing the latest value of the mems_allowed seqcount against 0 as > + * begin() still would see cpusets_enabled() as false. The enabled -> disabled > + * transition should happen in reverse order for the same reasons (want to stop > + * looking at real value of mems_allowed.sequence in retry() first). > + */ > +extern struct static_key_false cpusets_pre_enable_key; > extern struct static_key_false cpusets_enabled_key; > static inline bool cpusets_enabled(void) > { > @@ -32,12 +45,14 @@ static inline int nr_cpusets(void) > > static inline void cpuset_inc(void) > { > + static_branch_inc(&cpusets_pre_enable_key); > static_branch_inc(&cpusets_enabled_key); > } > > static inline void cpuset_dec(void) > { > static_branch_dec(&cpusets_enabled_key); > + static_branch_dec(&cpusets_pre_enable_key); > } > > extern int cpuset_init(void); > @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void); > */ > static inline unsigned int read_mems_allowed_begin(void) > { > - if (!cpusets_enabled()) > + if (!static_branch_unlikely(&cpusets_pre_enable_key)) > return 0; > > return read_seqcount_begin(¤t->mems_allowed_seq); > @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void) > */ > static inline bool read_mems_allowed_retry(unsigned int seq) > { > - if (!cpusets_enabled()) > + if (!static_branch_unlikely(&cpusets_enabled_key)) > return false; > > return read_seqcount_retry(¤t->mems_allowed_seq, seq); > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index ca8376e5008c..8d5151688504 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -63,6 +63,7 @@ > #include <linux/cgroup.h> > #include <linux/wait.h> > > +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > /* See "Frequency meter" comments, below. */ > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-31 8:02 ` Vlastimil Babka @ 2017-07-31 9:05 ` Dima Zavin 0 siblings, 0 replies; 18+ messages in thread From: Dima Zavin @ 2017-07-31 9:05 UTC (permalink / raw) To: Vlastimil Babka Cc: Peter Zijlstra, Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, LKML, linux-mm, Mel Gorman, Cliff Spradlin On Mon, Jul 31, 2017 at 1:02 AM, Vlastimil Babka <vbabka@suse.cz> wrote: > On 07/31/2017 06:01 AM, Dima Zavin wrote: >> In codepaths that use the begin/retry interface for reading >> mems_allowed_seq with irqs disabled, there exists a race condition that >> stalls the patch process after only modifying a subset of the >> static_branch call sites. >> >> This problem manifested itself as a dead lock in the slub >> allocator, inside get_any_partial. The loop reads >> mems_allowed_seq value (via read_mems_allowed_begin), >> performs the defrag operation, and then verifies the consistency >> of mem_allowed via the read_mems_allowed_retry and the cookie >> returned by xxx_begin. The issue here is that both begin and retry >> first check if cpusets are enabled via cpusets_enabled() static branch. >> This branch can be rewritted dynamically (via cpuset_inc) if a new >> cpuset is created. The x86 jump label code fully synchronizes across >> all CPUs for every entry it rewrites. If it rewrites only one of the >> callsites (specifically the one in read_mems_allowed_retry) and then >> waits for the smp_call_function(do_sync_core) to complete while a CPU is >> inside the begin/retry section with IRQs off and the mems_allowed value >> is changed, we can hang. This is because begin() will always return 0 >> (since it wasn't patched yet) while retry() will test the 0 against >> the actual value of the seq counter. >> >> The fix is to use two different static keys: one for begin >> (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we >> first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() >> always return a valid seqcount if are enabling cpusets. Similarly, >> when disabling cpusets via cpuset_dec(), we first ensure that callers >> of cpuset_mems_allowed_retry() will start ignoring the seqcount >> value before we let cpuset_mems_allowed_begin() return 0. >> >> The relevant stack traces of the two stuck threads: >> >> CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 >> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 >> task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 >> RIP: smp_call_function_many+0x1f9/0x260 >> Call Trace: >> ? setup_data_read+0xa0/0xa0 >> ? ___slab_alloc+0x28b/0x5a0 >> smp_call_function+0x3b/0x70 >> ? setup_data_read+0xa0/0xa0 >> on_each_cpu+0x2f/0x90 >> ? ___slab_alloc+0x28a/0x5a0 >> ? ___slab_alloc+0x28b/0x5a0 >> text_poke_bp+0x87/0xd0 >> ? ___slab_alloc+0x28a/0x5a0 >> arch_jump_label_transform+0x93/0x100 >> __jump_label_update+0x77/0x90 >> jump_label_update+0xaa/0xc0 >> static_key_slow_inc+0x9e/0xb0 >> cpuset_css_online+0x70/0x2e0 >> online_css+0x2c/0xa0 >> cgroup_apply_control_enable+0x27f/0x3d0 >> cgroup_mkdir+0x2b7/0x420 >> kernfs_iop_mkdir+0x5a/0x80 >> vfs_mkdir+0xf6/0x1a0 >> SyS_mkdir+0xb7/0xe0 >> entry_SYSCALL_64_fastpath+0x18/0xad >> >> ... >> >> CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 >> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 >> task: ffff8818087c0000 task.stack: ffffc90000030000 >> RIP: int3+0x39/0x70 >> Call Trace: >> <#DB> ? ___slab_alloc+0x28b/0x5a0 >> <EOE> ? copy_process.part.40+0xf7/0x1de0 >> ? __slab_alloc.isra.80+0x54/0x90 >> ? copy_process.part.40+0xf7/0x1de0 >> ? copy_process.part.40+0xf7/0x1de0 >> ? kmem_cache_alloc_node+0x8a/0x280 >> ? copy_process.part.40+0xf7/0x1de0 >> ? _do_fork+0xe7/0x6c0 >> ? _raw_spin_unlock_irq+0x2d/0x60 >> ? trace_hardirqs_on_caller+0x136/0x1d0 >> ? entry_SYSCALL_64_fastpath+0x5/0xad >> ? do_syscall_64+0x27/0x350 >> ? SyS_clone+0x19/0x20 >> ? do_syscall_64+0x60/0x350 >> ? entry_SYSCALL64_slow_path+0x25/0x25 >> >> Reported-by: Cliff Spradlin <cspradlin@waymo.com> >> Signed-off-by: Dima Zavin <dmitriyz@waymo.com> > > Looks good. Could you verify it fixes the issue, or was it too hard to > reproduce? Also is this a stable candidate patch, and can you identify > an exact commit hash it fixes? It's tough to reproduce but this fix works as well as the original hack. I think the problematic commit must have been: commit 46e700abc44ce215acb4341d9702ce3972eda571 Author: Mel Gorman <mgorman@techsingularity.net> Date: Fri Nov 6 16:28:15 2015 -0800 mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled It's probably stable worthy, but I don't know who can make the call on that. Thanks for the reviews! --Dima > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > >> --- >> >> v3: >> - Changed the implementation based on Peter Zijlstra's suggestion. Now >> using two keys for begin/retry instead of hacking the state into the >> cookie. >> - Rebased and tested on top of v4.13-rc3. >> >> v4: >> - Moved the cached cpusets_enabled() state into the cookie, turned >> the cookie into a struct and updated all the other call sites. >> - Applied on top of v4.12 since one of the callers in page_alloc.c changed. >> Still only tested on v4.9.36 and compile tested against v4.12. >> >> include/linux/cpuset.h | 19 +++++++++++++++++-- >> kernel/cgroup/cpuset.c | 1 + >> 2 files changed, 18 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h >> index 119a3f9604b0..e5a684c04c70 100644 >> --- a/include/linux/cpuset.h >> +++ b/include/linux/cpuset.h >> @@ -18,6 +18,19 @@ >> >> #ifdef CONFIG_CPUSETS >> >> +/* >> + * Static branch rewrites can happen in an arbitrary order for a given >> + * key. In code paths where we need to loop with read_mems_allowed_begin() and >> + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need >> + * to ensure that begin() always gets rewritten before retry() in the >> + * disabled -> enabled transition. If not, then if local irqs are disabled >> + * around the loop, we can deadlock since retry() would always be >> + * comparing the latest value of the mems_allowed seqcount against 0 as >> + * begin() still would see cpusets_enabled() as false. The enabled -> disabled >> + * transition should happen in reverse order for the same reasons (want to stop >> + * looking at real value of mems_allowed.sequence in retry() first). >> + */ >> +extern struct static_key_false cpusets_pre_enable_key; >> extern struct static_key_false cpusets_enabled_key; >> static inline bool cpusets_enabled(void) >> { >> @@ -32,12 +45,14 @@ static inline int nr_cpusets(void) >> >> static inline void cpuset_inc(void) >> { >> + static_branch_inc(&cpusets_pre_enable_key); >> static_branch_inc(&cpusets_enabled_key); >> } >> >> static inline void cpuset_dec(void) >> { >> static_branch_dec(&cpusets_enabled_key); >> + static_branch_dec(&cpusets_pre_enable_key); >> } >> >> extern int cpuset_init(void); >> @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void); >> */ >> static inline unsigned int read_mems_allowed_begin(void) >> { >> - if (!cpusets_enabled()) >> + if (!static_branch_unlikely(&cpusets_pre_enable_key)) >> return 0; >> >> return read_seqcount_begin(¤t->mems_allowed_seq); >> @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void) >> */ >> static inline bool read_mems_allowed_retry(unsigned int seq) >> { >> - if (!cpusets_enabled()) >> + if (!static_branch_unlikely(&cpusets_enabled_key)) >> return false; >> >> return read_seqcount_retry(¤t->mems_allowed_seq, seq); >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index ca8376e5008c..8d5151688504 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -63,6 +63,7 @@ >> #include <linux/cgroup.h> >> #include <linux/wait.h> >> >> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); >> DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); >> >> /* See "Frequency meter" comments, below. */ >> > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() 2017-07-27 16:46 ` [PATCH v2] cpuset: " Dima Zavin ` (2 preceding siblings ...) 2017-07-28 7:45 ` Vlastimil Babka @ 2017-07-29 4:56 ` kbuild test robot 3 siblings, 0 replies; 18+ messages in thread From: kbuild test robot @ 2017-07-29 4:56 UTC (permalink / raw) To: Dima Zavin Cc: kbuild-all, Christopher Lameter, Li Zefan, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, cgroups, linux-kernel, linux-mm, Cliff Spradlin, Mel Gorman [-- Attachment #1: Type: text/plain, Size: 1827 bytes --] Hi Dima, [auto build test WARNING on v4.12] [cannot apply to cgroup/for-next linus/master v4.13-rc2 v4.13-rc1 next-20170728] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Dima-Zavin/cpuset-fix-a-deadlock-due-to-incomplete-patching-of-cpusets_enabled/20170729-123852 config: i386-randconfig-x019-201730 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): In file included from kernel/sched/core.c:13:0: include/linux/cpuset.h: In function 'read_mems_allowed_begin': >> include/linux/cpuset.h:261:9: warning: 'return' with a value, in function returning void return 0; ^ include/linux/cpuset.h:259:20: note: declared here static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie) ^~~~~~~~~~~~~~~~~~~~~~~ vim +/return +261 include/linux/cpuset.h 58568d2a Miao Xie 2009-06-16 258 6b828cc9 Dima Zavin 2017-07-27 259 static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie) c0ff7453 Miao Xie 2010-05-24 260 { cc9a6c87 Mel Gorman 2012-03-21 @261 return 0; c0ff7453 Miao Xie 2010-05-24 262 } c0ff7453 Miao Xie 2010-05-24 263 :::::: The code at line 261 was first introduced by commit :::::: cc9a6c8776615f9c194ccf0b63a0aa5628235545 cpuset: mm: reduce large amounts of memory barrier related damage v3 :::::: TO: Mel Gorman <mgorman@suse.de> :::::: CC: Linus Torvalds <torvalds@linux-foundation.org> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 30137 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2017-07-31 9:05 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-26 16:50 [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled() Dima Zavin 2017-07-26 17:02 ` Christopher Lameter 2017-07-26 19:54 ` Dima Zavin 2017-07-27 16:46 ` [PATCH v2] cpuset: " Dima Zavin 2017-07-27 19:48 ` Andrew Morton 2017-07-27 21:41 ` Dima Zavin 2017-07-27 19:51 ` Andrew Morton 2017-07-27 21:41 ` Dima Zavin 2017-07-28 7:45 ` Vlastimil Babka 2017-07-28 8:48 ` Dima Zavin 2017-07-28 9:30 ` Peter Zijlstra 2017-07-28 14:05 ` Vlastimil Babka 2017-07-28 16:52 ` Dima Zavin 2017-07-31 4:01 ` [PATCH v3] " Dima Zavin 2017-07-31 4:04 ` Dima Zavin 2017-07-31 8:02 ` Vlastimil Babka 2017-07-31 9:05 ` Dima Zavin 2017-07-29 4:56 ` [PATCH v2] " kbuild test robot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).