All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2012-03-07 18:08 ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-03-07 18:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miao Xie, David Rientjes, Peter Zijlstra, Christoph Lameter,
	linux-mm, linux-kernel

Changelog since V2
  o Documentation						(akpm)
  o Do not retry hugetlb allocations in the event of an error	(akpm)
  o Properly initialise seqcount				(akpm)

Changelog since V1
  o Use seqcount with rmb instead of atomics (Peter, Christoph)

Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing
cpuset's mems] wins a super prize for the largest number of memory
barriers entered into fast paths for one commit. [get|put]_mems_allowed
is incredibly heavy with pairs of full memory barriers inserted into a
number of hot paths. This was detected while investigating at large page
allocator slowdown introduced some time after 2.6.32. The largest portion
of this overhead was shown by oprofile to be at an mfence introduced by
this commit into the page allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write sides
with a sequence counter with just read barriers on the fast path side.
This is much cheaper on some architectures, including x86.  The main bulk
of the patch is the retry logic if the nodemask changes in a manner that
can cause a false failure.

While updating the nodemask, a check is made to see if a false failure is
a risk. If it is, the sequence number gets bumped and parallel allocators
will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual
results were

                         3.3.0-rc3          3.3.0-rc3
                         rc3-vanilla        nobarrier-v2r1
Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds)             135.68    132.17
User+Sys Time Running Test (seconds)         164.2    160.13
Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much improved
and roughly in correlation to what oprofile reported (these performance
figures are without profiling so skew is expected). The actual number of
page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The first
ran within a cpuset and continually ran a small program that faulted 100M
of anonymous data. In a second window, the nodemask of the cpuset was
continually randomised in a loop. Without the commit, the program would
fail every so often (usually within 10 seconds) and obviously with the
commit everything worked fine. With this patch applied, it also worked
fine so the fix should be functionally equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h    |   47 +++++++++++++++++++--------------------------
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |    2 +-
 kernel/cpuset.c           |   43 ++++++++---------------------------------
 kernel/fork.c             |    1 +
 mm/filemap.c              |   11 +++++++----
 mm/hugetlb.c              |   15 +++++++++++----
 mm/mempolicy.c            |   28 ++++++++++++++++++++-------
 mm/page_alloc.c           |   33 +++++++++++++++++++++----------
 mm/slab.c                 |   13 ++++++++-----
 mm/slub.c                 |   40 +++++++++++++++++++++++---------------
 mm/vmscan.c               |    2 --
 12 files changed, 133 insertions(+), 110 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..7a7e5fd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,42 +89,33 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * reading current mems_allowed and mempolicy in the fastpath must protected
- * by get_mems_allowed()
+ * get_mems_allowed is required when making decisions involving mems_allowed
+ * such as during page allocation. mems_allowed can be updated in parallel
+ * and depending on the new value an operation can fail potentially causing
+ * process failure. A retry loop with get_mems_allowed and put_mems_allowed
+ * prevents these artificial failures.
  */
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
-	current->mems_allowed_change_disable++;
-
-	/*
-	 * ensure that reading mems_allowed and mempolicy happens after the
-	 * update of ->mems_allowed_change_disable.
-	 *
-	 * the write-side task finds ->mems_allowed_change_disable is not 0,
-	 * and knows the read-side task is reading mems_allowed or mempolicy,
-	 * so it will clear old bits lazily.
-	 */
-	smp_mb();
+	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
-static inline void put_mems_allowed(void)
+/*
+ * If this returns false, the operation that took place after get_mems_allowed
+ * may have failed. It is up to the caller to retry the operation if
+ * appropriate.
+ */
+static inline bool put_mems_allowed(unsigned int seq)
 {
-	/*
-	 * ensure that reading mems_allowed and mempolicy before reducing
-	 * mems_allowed_change_disable.
-	 *
-	 * the write-side task will know that the read-side task is still
-	 * reading mems_allowed or mempolicy, don't clears old bits in the
-	 * nodemask.
-	 */
-	smp_mb();
-	--ACCESS_ONCE(current->mems_allowed_change_disable);
+	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
 {
 	task_lock(current);
+	write_seqcount_begin(&current->mems_allowed_seq);
 	current->mems_allowed = nodemask;
+	write_seqcount_end(&current->mems_allowed_seq);
 	task_unlock(current);
 }
 
@@ -234,12 +225,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
+	return 0;
 }
 
-static inline void put_mems_allowed(void)
+static inline bool put_mems_allowed(unsigned int seq)
 {
+	return true;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 9c66b1a..fa7d16d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -29,6 +29,13 @@ extern struct fs_struct init_fs;
 #define INIT_GROUP_RWSEM(sig)
 #endif
 
+#ifdef CONFIG_CPUSETS
+#define INIT_CPUSET_SEQ							\
+	.mems_allowed_seq = SEQCNT_ZERO,
+#else
+#define INIT_CPUSET_SEQ
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -192,6 +199,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_CPUSET_SEQ							\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..a0bb87a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1498,7 +1498,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_CPUSETS
 	nodemask_t mems_allowed;	/* Protected by alloc_lock */
-	int mems_allowed_change_disable;
+	seqcount_t mems_allowed_seq;	/* Seqence no to catch updates */
 	int cpuset_mem_spread_rotor;
 	int cpuset_slab_spread_rotor;
 #endif
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a09ac2b..5014493 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 {
 	bool need_loop;
 
-repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -983,45 +982,19 @@ repeat:
 	 */
 	need_loop = task_has_mempolicy(tsk) ||
 			!nodes_intersects(*newmems, tsk->mems_allowed);
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-	/*
-	 * ensure checking ->mems_allowed_change_disable after setting all new
-	 * allowed nodes.
-	 *
-	 * the read-side task can see an nodemask with new allowed nodes and
-	 * old allowed nodes. and if it allocates page when cpuset clears newly
-	 * disallowed ones continuous, it can see the new allowed bits.
-	 *
-	 * And if setting all new allowed nodes is after the checking, setting
-	 * all new allowed nodes and clearing newly disallowed ones will be done
-	 * continuous, and the read-side task may find no node to alloc page.
-	 */
-	smp_mb();
+	if (need_loop)
+		write_seqcount_begin(&tsk->mems_allowed_seq);
 
-	/*
-	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
-	 */
-	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
-		task_unlock(tsk);
-		if (!task_curr(tsk))
-			yield();
-		goto repeat;
-	}
-
-	/*
-	 * ensure checking ->mems_allowed_change_disable before clearing all new
-	 * disallowed nodes.
-	 *
-	 * if clearing newly disallowed bits before the checking, the read-side
-	 * task may find no node to alloc page.
-	 */
-	smp_mb();
+	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
 	tsk->mems_allowed = *newmems;
+
+	if (need_loop)
+		write_seqcount_end(&tsk->mems_allowed_seq);
+
 	task_unlock(tsk);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index e2cd3e2..d79526b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1195,6 +1195,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #ifdef CONFIG_CPUSETS
 	p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
 	p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
+	seqcount_init(&p->mems_allowed_seq);
 #endif
 #ifdef CONFIG_TRACE_IRQFLAGS
 	p->irq_events = 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..dee0860 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -500,10 +500,13 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	struct page *page;
 
 	if (cpuset_do_page_mem_spread()) {
-		get_mems_allowed();
-		n = cpuset_mem_spread_node();
-		page = alloc_pages_exact_node(n, gfp, 0);
-		put_mems_allowed();
+		unsigned int cpuset_mems_cookie;
+		do {
+			cpuset_mems_cookie = get_mems_allowed();
+			n = cpuset_mem_spread_node();
+			page = alloc_pages_exact_node(n, gfp, 0);
+		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+
 		return page;
 	}
 	return alloc_pages(gfp, 0);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..1434b7f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -454,14 +454,16 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	struct page *page = NULL;
+	struct page *page;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	unsigned int cpuset_mems_cookie;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -488,10 +490,15 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 			}
 		}
 	}
-err:
+
 	mpol_cond_put(mpol);
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
+
+err:
+	mpol_cond_put(mpol);
+	return NULL;
 }
 
 static void update_and_free_page(struct hstate *h, struct page *page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 06b145f..013d981 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1843,18 +1843,24 @@ struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct mempolicy *pol;
 	struct zonelist *zl;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
+
+retry_cpuset:
+	pol = get_vma_policy(current, vma, addr);
+	cpuset_mems_cookie = get_mems_allowed();
 
-	get_mems_allowed();
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
+
 		return page;
 	}
 	zl = policy_zonelist(gfp, pol, node);
@@ -1865,7 +1871,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
 		return page;
 	}
 	/*
@@ -1873,7 +1880,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
 
@@ -1900,11 +1908,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = current->mempolicy;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1915,7 +1926,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a13ded1..453c348 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2378,8 +2378,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
-	struct page *page;
+	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
+	unsigned int cpuset_mems_cookie;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2398,15 +2399,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
-	if (!preferred_zone) {
-		put_mems_allowed();
-		return NULL;
-	}
+	if (!preferred_zone)
+		goto out;
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
@@ -2416,9 +2417,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	put_mems_allowed();
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+
+out:
+	/*
+	 * When updating a task's mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2632,13 +2643,15 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 bool skip_free_areas_node(unsigned int flags, int nid)
 {
 	bool ret = false;
+	unsigned int cpuset_mems_cookie;
 
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		goto out;
 
-	get_mems_allowed();
-	ret = !node_isset(nid, cpuset_current_mems_allowed);
-	put_mems_allowed();
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		ret = !node_isset(nid, cpuset_current_mems_allowed);
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index f0bd785..29c8716 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3284,12 +3284,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
 	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_mem_id();
-	get_mems_allowed();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
 		nid_alloc = cpuset_slab_spread_node();
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
-	put_mems_allowed();
 	if (nid_alloc != nid_here)
 		return ____cache_alloc_node(cachep, flags, nid_alloc);
 	return NULL;
@@ -3312,14 +3310,17 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
+	unsigned int cpuset_mems_cookie;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+
 retry:
 	/*
 	 * Look through allowed nodes for objects available
@@ -3372,7 +3373,9 @@ retry:
 			}
 		}
 	}
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+		goto retry_cpuset;
 	return obj;
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 4907563..f4a6229 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1581,6 +1581,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *object;
+	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1604,23 +1605,32 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		struct kmem_cache_node *n;
-
-		n = get_node(s, zone_to_nid(zone));
-
-		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-				n->nr_partial > s->min_partial) {
-			object = get_partial_node(s, n, c);
-			if (object) {
-				put_mems_allowed();
-				return object;
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+			struct kmem_cache_node *n;
+
+			n = get_node(s, zone_to_nid(zone));
+
+			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+					n->nr_partial > s->min_partial) {
+				object = get_partial_node(s, n, c);
+				if (object) {
+					/*
+					 * Return the object even if
+					 * put_mems_allowed indicated that
+					 * the cpuset mems_allowed was
+					 * updated in parallel. It's a
+					 * harmless race between the alloc
+					 * and the cpuset update.
+					 */
+					put_mems_allowed(cpuset_mems_cookie);
+					return object;
+				}
 			}
 		}
-	}
-	put_mems_allowed();
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 #endif
 	return NULL;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c52b235..fccc048 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2337,7 +2337,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
-	get_mems_allowed();
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
@@ -2401,7 +2400,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 out:
 	delayacct_freepages_end();
-	put_mems_allowed();
 
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2012-03-07 18:08 ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-03-07 18:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miao Xie, David Rientjes, Peter Zijlstra, Christoph Lameter,
	linux-mm, linux-kernel

Changelog since V2
  o Documentation						(akpm)
  o Do not retry hugetlb allocations in the event of an error	(akpm)
  o Properly initialise seqcount				(akpm)

Changelog since V1
  o Use seqcount with rmb instead of atomics (Peter, Christoph)

Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing
cpuset's mems] wins a super prize for the largest number of memory
barriers entered into fast paths for one commit. [get|put]_mems_allowed
is incredibly heavy with pairs of full memory barriers inserted into a
number of hot paths. This was detected while investigating at large page
allocator slowdown introduced some time after 2.6.32. The largest portion
of this overhead was shown by oprofile to be at an mfence introduced by
this commit into the page allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write sides
with a sequence counter with just read barriers on the fast path side.
This is much cheaper on some architectures, including x86.  The main bulk
of the patch is the retry logic if the nodemask changes in a manner that
can cause a false failure.

While updating the nodemask, a check is made to see if a false failure is
a risk. If it is, the sequence number gets bumped and parallel allocators
will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual
results were

                         3.3.0-rc3          3.3.0-rc3
                         rc3-vanilla        nobarrier-v2r1
Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds)             135.68    132.17
User+Sys Time Running Test (seconds)         164.2    160.13
Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much improved
and roughly in correlation to what oprofile reported (these performance
figures are without profiling so skew is expected). The actual number of
page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The first
ran within a cpuset and continually ran a small program that faulted 100M
of anonymous data. In a second window, the nodemask of the cpuset was
continually randomised in a loop. Without the commit, the program would
fail every so often (usually within 10 seconds) and obviously with the
commit everything worked fine. With this patch applied, it also worked
fine so the fix should be functionally equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h    |   47 +++++++++++++++++++--------------------------
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |    2 +-
 kernel/cpuset.c           |   43 ++++++++---------------------------------
 kernel/fork.c             |    1 +
 mm/filemap.c              |   11 +++++++----
 mm/hugetlb.c              |   15 +++++++++++----
 mm/mempolicy.c            |   28 ++++++++++++++++++++-------
 mm/page_alloc.c           |   33 +++++++++++++++++++++----------
 mm/slab.c                 |   13 ++++++++-----
 mm/slub.c                 |   40 +++++++++++++++++++++++---------------
 mm/vmscan.c               |    2 --
 12 files changed, 133 insertions(+), 110 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..7a7e5fd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,42 +89,33 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * reading current mems_allowed and mempolicy in the fastpath must protected
- * by get_mems_allowed()
+ * get_mems_allowed is required when making decisions involving mems_allowed
+ * such as during page allocation. mems_allowed can be updated in parallel
+ * and depending on the new value an operation can fail potentially causing
+ * process failure. A retry loop with get_mems_allowed and put_mems_allowed
+ * prevents these artificial failures.
  */
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
-	current->mems_allowed_change_disable++;
-
-	/*
-	 * ensure that reading mems_allowed and mempolicy happens after the
-	 * update of ->mems_allowed_change_disable.
-	 *
-	 * the write-side task finds ->mems_allowed_change_disable is not 0,
-	 * and knows the read-side task is reading mems_allowed or mempolicy,
-	 * so it will clear old bits lazily.
-	 */
-	smp_mb();
+	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
-static inline void put_mems_allowed(void)
+/*
+ * If this returns false, the operation that took place after get_mems_allowed
+ * may have failed. It is up to the caller to retry the operation if
+ * appropriate.
+ */
+static inline bool put_mems_allowed(unsigned int seq)
 {
-	/*
-	 * ensure that reading mems_allowed and mempolicy before reducing
-	 * mems_allowed_change_disable.
-	 *
-	 * the write-side task will know that the read-side task is still
-	 * reading mems_allowed or mempolicy, don't clears old bits in the
-	 * nodemask.
-	 */
-	smp_mb();
-	--ACCESS_ONCE(current->mems_allowed_change_disable);
+	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
 {
 	task_lock(current);
+	write_seqcount_begin(&current->mems_allowed_seq);
 	current->mems_allowed = nodemask;
+	write_seqcount_end(&current->mems_allowed_seq);
 	task_unlock(current);
 }
 
@@ -234,12 +225,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
+	return 0;
 }
 
-static inline void put_mems_allowed(void)
+static inline bool put_mems_allowed(unsigned int seq)
 {
+	return true;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 9c66b1a..fa7d16d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -29,6 +29,13 @@ extern struct fs_struct init_fs;
 #define INIT_GROUP_RWSEM(sig)
 #endif
 
+#ifdef CONFIG_CPUSETS
+#define INIT_CPUSET_SEQ							\
+	.mems_allowed_seq = SEQCNT_ZERO,
+#else
+#define INIT_CPUSET_SEQ
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -192,6 +199,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_CPUSET_SEQ							\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..a0bb87a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1498,7 +1498,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_CPUSETS
 	nodemask_t mems_allowed;	/* Protected by alloc_lock */
-	int mems_allowed_change_disable;
+	seqcount_t mems_allowed_seq;	/* Seqence no to catch updates */
 	int cpuset_mem_spread_rotor;
 	int cpuset_slab_spread_rotor;
 #endif
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a09ac2b..5014493 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 {
 	bool need_loop;
 
-repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
@@ -983,45 +982,19 @@ repeat:
 	 */
 	need_loop = task_has_mempolicy(tsk) ||
 			!nodes_intersects(*newmems, tsk->mems_allowed);
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
-	/*
-	 * ensure checking ->mems_allowed_change_disable after setting all new
-	 * allowed nodes.
-	 *
-	 * the read-side task can see an nodemask with new allowed nodes and
-	 * old allowed nodes. and if it allocates page when cpuset clears newly
-	 * disallowed ones continuous, it can see the new allowed bits.
-	 *
-	 * And if setting all new allowed nodes is after the checking, setting
-	 * all new allowed nodes and clearing newly disallowed ones will be done
-	 * continuous, and the read-side task may find no node to alloc page.
-	 */
-	smp_mb();
+	if (need_loop)
+		write_seqcount_begin(&tsk->mems_allowed_seq);
 
-	/*
-	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
-	 */
-	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
-		task_unlock(tsk);
-		if (!task_curr(tsk))
-			yield();
-		goto repeat;
-	}
-
-	/*
-	 * ensure checking ->mems_allowed_change_disable before clearing all new
-	 * disallowed nodes.
-	 *
-	 * if clearing newly disallowed bits before the checking, the read-side
-	 * task may find no node to alloc page.
-	 */
-	smp_mb();
+	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
 	tsk->mems_allowed = *newmems;
+
+	if (need_loop)
+		write_seqcount_end(&tsk->mems_allowed_seq);
+
 	task_unlock(tsk);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index e2cd3e2..d79526b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1195,6 +1195,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #ifdef CONFIG_CPUSETS
 	p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
 	p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
+	seqcount_init(&p->mems_allowed_seq);
 #endif
 #ifdef CONFIG_TRACE_IRQFLAGS
 	p->irq_events = 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..dee0860 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -500,10 +500,13 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	struct page *page;
 
 	if (cpuset_do_page_mem_spread()) {
-		get_mems_allowed();
-		n = cpuset_mem_spread_node();
-		page = alloc_pages_exact_node(n, gfp, 0);
-		put_mems_allowed();
+		unsigned int cpuset_mems_cookie;
+		do {
+			cpuset_mems_cookie = get_mems_allowed();
+			n = cpuset_mem_spread_node();
+			page = alloc_pages_exact_node(n, gfp, 0);
+		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+
 		return page;
 	}
 	return alloc_pages(gfp, 0);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..1434b7f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -454,14 +454,16 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	struct page *page = NULL;
+	struct page *page;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	unsigned int cpuset_mems_cookie;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -488,10 +490,15 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 			}
 		}
 	}
-err:
+
 	mpol_cond_put(mpol);
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
+
+err:
+	mpol_cond_put(mpol);
+	return NULL;
 }
 
 static void update_and_free_page(struct hstate *h, struct page *page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 06b145f..013d981 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1843,18 +1843,24 @@ struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct mempolicy *pol;
 	struct zonelist *zl;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
+
+retry_cpuset:
+	pol = get_vma_policy(current, vma, addr);
+	cpuset_mems_cookie = get_mems_allowed();
 
-	get_mems_allowed();
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
+
 		return page;
 	}
 	zl = policy_zonelist(gfp, pol, node);
@@ -1865,7 +1871,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
 		return page;
 	}
 	/*
@@ -1873,7 +1880,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
 
@@ -1900,11 +1908,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = current->mempolicy;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1915,7 +1926,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a13ded1..453c348 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2378,8 +2378,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
-	struct page *page;
+	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
+	unsigned int cpuset_mems_cookie;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2398,15 +2399,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
-	if (!preferred_zone) {
-		put_mems_allowed();
-		return NULL;
-	}
+	if (!preferred_zone)
+		goto out;
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
@@ -2416,9 +2417,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	put_mems_allowed();
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+
+out:
+	/*
+	 * When updating a task's mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2632,13 +2643,15 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 bool skip_free_areas_node(unsigned int flags, int nid)
 {
 	bool ret = false;
+	unsigned int cpuset_mems_cookie;
 
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		goto out;
 
-	get_mems_allowed();
-	ret = !node_isset(nid, cpuset_current_mems_allowed);
-	put_mems_allowed();
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		ret = !node_isset(nid, cpuset_current_mems_allowed);
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index f0bd785..29c8716 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3284,12 +3284,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
 	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_mem_id();
-	get_mems_allowed();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
 		nid_alloc = cpuset_slab_spread_node();
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
-	put_mems_allowed();
 	if (nid_alloc != nid_here)
 		return ____cache_alloc_node(cachep, flags, nid_alloc);
 	return NULL;
@@ -3312,14 +3310,17 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
+	unsigned int cpuset_mems_cookie;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+
 retry:
 	/*
 	 * Look through allowed nodes for objects available
@@ -3372,7 +3373,9 @@ retry:
 			}
 		}
 	}
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+		goto retry_cpuset;
 	return obj;
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 4907563..f4a6229 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1581,6 +1581,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *object;
+	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1604,23 +1605,32 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
 
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		struct kmem_cache_node *n;
-
-		n = get_node(s, zone_to_nid(zone));
-
-		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-				n->nr_partial > s->min_partial) {
-			object = get_partial_node(s, n, c);
-			if (object) {
-				put_mems_allowed();
-				return object;
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+			struct kmem_cache_node *n;
+
+			n = get_node(s, zone_to_nid(zone));
+
+			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+					n->nr_partial > s->min_partial) {
+				object = get_partial_node(s, n, c);
+				if (object) {
+					/*
+					 * Return the object even if
+					 * put_mems_allowed indicated that
+					 * the cpuset mems_allowed was
+					 * updated in parallel. It's a
+					 * harmless race between the alloc
+					 * and the cpuset update.
+					 */
+					put_mems_allowed(cpuset_mems_cookie);
+					return object;
+				}
 			}
 		}
-	}
-	put_mems_allowed();
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 #endif
 	return NULL;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c52b235..fccc048 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2337,7 +2337,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
-	get_mems_allowed();
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
@@ -2401,7 +2400,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 out:
 	delayacct_freepages_end();
-	put_mems_allowed();
 
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2012-03-07 18:08 ` Mel Gorman
@ 2012-03-26 10:56   ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-03-26 10:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Wed, 2012-03-07 at 18:08 +0000, Mel Gorman wrote:
> +               } while (!put_mems_allowed(cpuset_mems_cookie) && !page);

Sorry for only noticing this now, but wouldn't it be better to first
check page and only then bother with the put_mems_allowed() thing? That
avoids the smp_rmb() and seqcount conditional all together in the likely
case the allocation actually succeeded.

---
Subject: mm: Optimize put_mems_allowed() usage

Avoid calling put_mems_allowed() in case the page allocation succeeded.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/cpuset.h |    2 +-
 mm/filemap.c           |    2 +-
 mm/hugetlb.c           |    2 +-
 mm/mempolicy.c         |    6 +++---
 mm/page_alloc.c        |    2 +-
 mm/slab.c              |    2 +-
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7a7e5fd..f666b99 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -107,7 +107,7 @@ static inline unsigned int get_mems_allowed(void)
  */
 static inline bool put_mems_allowed(unsigned int seq)
 {
-	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return likely(!read_seqcount_retry(&current->mems_allowed_seq, seq));
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
diff --git a/mm/filemap.c b/mm/filemap.c
index c3811bc..3b41553 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -504,7 +504,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 			cpuset_mems_cookie = get_mems_allowed();
 			n = cpuset_mem_spread_node();
 			page = alloc_pages_exact_node(n, gfp, 0);
-		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+		} while (!page && !put_mems_allowed(cpuset_mems_cookie));
 
 		return page;
 	}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8ce6f4..25250c9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -570,7 +570,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cfb6c86..6010eef 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1865,7 +1865,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 			goto retry_cpuset;
 
 		return page;
@@ -1878,7 +1878,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 			goto retry_cpuset;
 		return page;
 	}
@@ -1887,7 +1887,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caea788..96acea4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2429,7 +2429,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * the mask is being updated. If a page allocation is about to fail,
 	 * check if the cpuset changed during allocation and if so, retry.
 	 */
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
diff --git a/mm/slab.c b/mm/slab.c
index 29c8716..7d320b5 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3374,7 +3374,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 		}
 	}
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+	if (unlikely(!obj && !put_mems_allowed(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return obj;
 }


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2012-03-26 10:56   ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-03-26 10:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Wed, 2012-03-07 at 18:08 +0000, Mel Gorman wrote:
> +               } while (!put_mems_allowed(cpuset_mems_cookie) && !page);

Sorry for only noticing this now, but wouldn't it be better to first
check page and only then bother with the put_mems_allowed() thing? That
avoids the smp_rmb() and seqcount conditional all together in the likely
case the allocation actually succeeded.

---
Subject: mm: Optimize put_mems_allowed() usage

Avoid calling put_mems_allowed() in case the page allocation succeeded.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/cpuset.h |    2 +-
 mm/filemap.c           |    2 +-
 mm/hugetlb.c           |    2 +-
 mm/mempolicy.c         |    6 +++---
 mm/page_alloc.c        |    2 +-
 mm/slab.c              |    2 +-
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7a7e5fd..f666b99 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -107,7 +107,7 @@ static inline unsigned int get_mems_allowed(void)
  */
 static inline bool put_mems_allowed(unsigned int seq)
 {
-	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return likely(!read_seqcount_retry(&current->mems_allowed_seq, seq));
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
diff --git a/mm/filemap.c b/mm/filemap.c
index c3811bc..3b41553 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -504,7 +504,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 			cpuset_mems_cookie = get_mems_allowed();
 			n = cpuset_mem_spread_node();
 			page = alloc_pages_exact_node(n, gfp, 0);
-		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+		} while (!page && !put_mems_allowed(cpuset_mems_cookie));
 
 		return page;
 	}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8ce6f4..25250c9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -570,7 +570,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cfb6c86..6010eef 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1865,7 +1865,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 			goto retry_cpuset;
 
 		return page;
@@ -1878,7 +1878,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 			goto retry_cpuset;
 		return page;
 	}
@@ -1887,7 +1887,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caea788..96acea4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2429,7 +2429,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * the mask is being updated. If a page allocation is about to fail,
 	 * check if the cpuset changed during allocation and if so, retry.
 	 */
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && !put_mems_allowed(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
diff --git a/mm/slab.c b/mm/slab.c
index 29c8716..7d320b5 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3374,7 +3374,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 		}
 	}
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+	if (unlikely(!obj && !put_mems_allowed(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return obj;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2012-03-26 10:56   ` Peter Zijlstra
@ 2012-03-26 11:07     ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-03-26 11:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Mon, 2012-03-26 at 12:56 +0200, Peter Zijlstra wrote:
>  static inline bool put_mems_allowed(unsigned int seq)
>  {
> -       return !read_seqcount_retry(&current->mems_allowed_seq, seq);
> +       return likely(!read_seqcount_retry(&current->mems_allowed_seq, seq));
>  } 

Ignore this hunk, read_seqcount_retry() already has a branch hint in.
I'll send a new version if people thing the rest of the patch is worth
it.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2012-03-26 11:07     ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-03-26 11:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Mon, 2012-03-26 at 12:56 +0200, Peter Zijlstra wrote:
>  static inline bool put_mems_allowed(unsigned int seq)
>  {
> -       return !read_seqcount_retry(&current->mems_allowed_seq, seq);
> +       return likely(!read_seqcount_retry(&current->mems_allowed_seq, seq));
>  } 

Ignore this hunk, read_seqcount_retry() already has a branch hint in.
I'll send a new version if people thing the rest of the patch is worth
it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2012-03-26 10:56   ` Peter Zijlstra
@ 2012-03-26 15:50     ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-03-26 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Mon, Mar 26, 2012 at 12:56:24PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-03-07 at 18:08 +0000, Mel Gorman wrote:
> > +               } while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> 
> Sorry for only noticing this now, but wouldn't it be better to first
> check page and only then bother with the put_mems_allowed() thing? That
> avoids the smp_rmb() and seqcount conditional all together in the likely
> case the allocation actually succeeded.
> 
> <SNIP>
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c3811bc..3b41553 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -504,7 +504,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
>  			cpuset_mems_cookie = get_mems_allowed();
>  			n = cpuset_mem_spread_node();
>  			page = alloc_pages_exact_node(n, gfp, 0);
> -		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> +		} while (!page && !put_mems_allowed(cpuset_mems_cookie));
>  
>  		return page;
>  	}

I think such a change would be better but should also rename the API.
If developers see a get_foo type call, they will expect to see a put_foo
call or assume it's a bug even though the implementation happens to be ok
with that. Any suggestion on what a good new name would be?

How about read_mems_allowed_begin() and read_mems_allowed_retry()?

read_mems_allowed_begin would be a rename of get_mems_allowed().  In an
error path, read_mems_allowed_retry() would documented to be *optionally*
called when deciding whether to retry the operation or not. In this scheme,
!put_mems_allowed would become read_mems_allowed_retry() which might be
a bit easier to read overall.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2012-03-26 15:50     ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-03-26 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Mon, Mar 26, 2012 at 12:56:24PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-03-07 at 18:08 +0000, Mel Gorman wrote:
> > +               } while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> 
> Sorry for only noticing this now, but wouldn't it be better to first
> check page and only then bother with the put_mems_allowed() thing? That
> avoids the smp_rmb() and seqcount conditional all together in the likely
> case the allocation actually succeeded.
> 
> <SNIP>
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c3811bc..3b41553 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -504,7 +504,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
>  			cpuset_mems_cookie = get_mems_allowed();
>  			n = cpuset_mem_spread_node();
>  			page = alloc_pages_exact_node(n, gfp, 0);
> -		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> +		} while (!page && !put_mems_allowed(cpuset_mems_cookie));
>  
>  		return page;
>  	}

I think such a change would be better but should also rename the API.
If developers see a get_foo type call, they will expect to see a put_foo
call or assume it's a bug even though the implementation happens to be ok
with that. Any suggestion on what a good new name would be?

How about read_mems_allowed_begin() and read_mems_allowed_retry()?

read_mems_allowed_begin would be a rename of get_mems_allowed().  In an
error path, read_mems_allowed_retry() would documented to be *optionally*
called when deciding whether to retry the operation or not. In this scheme,
!put_mems_allowed would become read_mems_allowed_retry() which might be
a bit easier to read overall.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2012-03-26 15:50     ` Mel Gorman
@ 2012-03-26 16:20       ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-03-26 16:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Mon, 2012-03-26 at 16:50 +0100, Mel Gorman wrote:
> On Mon, Mar 26, 2012 at 12:56:24PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-03-07 at 18:08 +0000, Mel Gorman wrote:
> > > +               } while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> > 
> > Sorry for only noticing this now, but wouldn't it be better to first
> > check page and only then bother with the put_mems_allowed() thing? That
> > avoids the smp_rmb() and seqcount conditional all together in the likely
> > case the allocation actually succeeded.
> > 
> > <SNIP>
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c3811bc..3b41553 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -504,7 +504,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
> >  			cpuset_mems_cookie = get_mems_allowed();
> >  			n = cpuset_mem_spread_node();
> >  			page = alloc_pages_exact_node(n, gfp, 0);
> > -		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> > +		} while (!page && !put_mems_allowed(cpuset_mems_cookie));
> >  
> >  		return page;
> >  	}
> 
> I think such a change would be better but should also rename the API.
> If developers see a get_foo type call, they will expect to see a put_foo
> call or assume it's a bug even though the implementation happens to be ok
> with that. Any suggestion on what a good new name would be?
> 
> How about read_mems_allowed_begin() and read_mems_allowed_retry()?
> 
> read_mems_allowed_begin would be a rename of get_mems_allowed().  In an
> error path, read_mems_allowed_retry() would documented to be *optionally*
> called when deciding whether to retry the operation or not. In this scheme,
> !put_mems_allowed would become read_mems_allowed_retry() which might be
> a bit easier to read overall.

One:

git grep -l "\(get\|put\)_mems_allowed" | while read file; do sed -i -e
's/\<get_mems_allowed\>/read_mems_allowed_begin/g' -e
's/\<put_mems_allowed\>/read_mems_allowed_retry/g' $file; done

and a few edits later..

---
 include/linux/cpuset.h |   18 +++++++++---------
 kernel/cpuset.c        |    2 +-
 mm/filemap.c           |    4 ++--
 mm/hugetlb.c           |    4 ++--
 mm/mempolicy.c         |   14 +++++++-------
 mm/page_alloc.c        |    8 ++++----
 mm/slab.c              |    4 ++--
 mm/slub.c              |   16 +++-------------
 8 files changed, 30 insertions(+), 40 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7a7e5fd..d008b03 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,25 +89,25 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * get_mems_allowed is required when making decisions involving mems_allowed
+ * read_mems_allowed_begin is required when making decisions involving mems_allowed
  * such as during page allocation. mems_allowed can be updated in parallel
  * and depending on the new value an operation can fail potentially causing
- * process failure. A retry loop with get_mems_allowed and put_mems_allowed
+ * process failure. A retry loop with read_mems_allowed_begin and read_mems_allowed_retry
  * prevents these artificial failures.
  */
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
 /*
- * If this returns false, the operation that took place after get_mems_allowed
+ * If this returns false, the operation that took place after read_mems_allowed_begin
  * may have failed. It is up to the caller to retry the operation if
  * appropriate.
  */
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -225,14 +225,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return 0;
 }
 
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return true;
+	return false;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1010cc6..703df59 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -976,7 +976,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	task_lock(tsk);
 	/*
 	 * Determine if a loop is necessary if another thread is doing
-	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * read_mems_allowed_begin().  If at least one node remains unchanged and
 	 * tsk does not have a mempolicy, then an empty nodemask will not be
 	 * possible when mems_allowed is larger than a word.
 	 */
diff --git a/mm/filemap.c b/mm/filemap.c
index c3811bc..5694807 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -501,10 +501,10 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
-			cpuset_mems_cookie = get_mems_allowed();
+			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
 			page = alloc_pages_exact_node(n, gfp, 0);
-		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8ce6f4..6c52f6a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -541,7 +541,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -570,7 +570,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cfb6c86..ee5f48c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1692,7 +1692,7 @@ int node_random(const nodemask_t *maskp)
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
  * @nodemask for filtering the zonelist.
  *
- * Must be protected by get_mems_allowed()
+ * Must be protected by read_mems_allowed_begin()
  */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 				gfp_t gfp_flags, struct mempolicy **mpol,
@@ -1857,7 +1857,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 
 retry_cpuset:
 	pol = get_vma_policy(current, vma, addr);
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
@@ -1865,7 +1865,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 
 		return page;
@@ -1878,7 +1878,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 		return page;
 	}
@@ -1887,7 +1887,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 }
@@ -1921,7 +1921,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		pol = &default_policy;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/*
 	 * No reference counting needed for current->mempolicy
@@ -1934,7 +1934,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caea788..b586d96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2402,7 +2402,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
@@ -2429,7 +2429,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * the mask is being updated. If a page allocation is about to fail,
 	 * check if the cpuset changed during allocation and if so, retry.
 	 */
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
@@ -2651,9 +2651,9 @@ bool skip_free_areas_node(unsigned int flags, int nid)
 		goto out;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		ret = !node_isset(nid, cpuset_current_mems_allowed);
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index 29c8716..e5a4533 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3318,7 +3318,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 
 retry:
@@ -3374,7 +3374,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 		}
 	}
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return obj;
 }
diff --git a/mm/slub.c b/mm/slub.c
index f4a6229..7a158be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1606,7 +1606,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 		return NULL;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 			struct kmem_cache_node *n;
@@ -1616,21 +1616,11 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 					n->nr_partial > s->min_partial) {
 				object = get_partial_node(s, n, c);
-				if (object) {
-					/*
-					 * Return the object even if
-					 * put_mems_allowed indicated that
-					 * the cpuset mems_allowed was
-					 * updated in parallel. It's a
-					 * harmless race between the alloc
-					 * and the cpuset update.
-					 */
-					put_mems_allowed(cpuset_mems_cookie);
+				if (object)
 					return object;
-				}
 			}
 		}
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 #endif
 	return NULL;
 }


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2012-03-26 16:20       ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-03-26 16:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Mon, 2012-03-26 at 16:50 +0100, Mel Gorman wrote:
> On Mon, Mar 26, 2012 at 12:56:24PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-03-07 at 18:08 +0000, Mel Gorman wrote:
> > > +               } while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> > 
> > Sorry for only noticing this now, but wouldn't it be better to first
> > check page and only then bother with the put_mems_allowed() thing? That
> > avoids the smp_rmb() and seqcount conditional all together in the likely
> > case the allocation actually succeeded.
> > 
> > <SNIP>
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c3811bc..3b41553 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -504,7 +504,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
> >  			cpuset_mems_cookie = get_mems_allowed();
> >  			n = cpuset_mem_spread_node();
> >  			page = alloc_pages_exact_node(n, gfp, 0);
> > -		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> > +		} while (!page && !put_mems_allowed(cpuset_mems_cookie));
> >  
> >  		return page;
> >  	}
> 
> I think such a change would be better but should also rename the API.
> If developers see a get_foo type call, they will expect to see a put_foo
> call or assume it's a bug even though the implementation happens to be ok
> with that. Any suggestion on what a good new name would be?
> 
> How about read_mems_allowed_begin() and read_mems_allowed_retry()?
> 
> read_mems_allowed_begin would be a rename of get_mems_allowed().  In an
> error path, read_mems_allowed_retry() would documented to be *optionally*
> called when deciding whether to retry the operation or not. In this scheme,
> !put_mems_allowed would become read_mems_allowed_retry() which might be
> a bit easier to read overall.

One:

git grep -l "\(get\|put\)_mems_allowed" | while read file; do sed -i -e
's/\<get_mems_allowed\>/read_mems_allowed_begin/g' -e
's/\<put_mems_allowed\>/read_mems_allowed_retry/g' $file; done

and a few edits later..

---
 include/linux/cpuset.h |   18 +++++++++---------
 kernel/cpuset.c        |    2 +-
 mm/filemap.c           |    4 ++--
 mm/hugetlb.c           |    4 ++--
 mm/mempolicy.c         |   14 +++++++-------
 mm/page_alloc.c        |    8 ++++----
 mm/slab.c              |    4 ++--
 mm/slub.c              |   16 +++-------------
 8 files changed, 30 insertions(+), 40 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 7a7e5fd..d008b03 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,25 +89,25 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * get_mems_allowed is required when making decisions involving mems_allowed
+ * read_mems_allowed_begin is required when making decisions involving mems_allowed
  * such as during page allocation. mems_allowed can be updated in parallel
  * and depending on the new value an operation can fail potentially causing
- * process failure. A retry loop with get_mems_allowed and put_mems_allowed
+ * process failure. A retry loop with read_mems_allowed_begin and read_mems_allowed_retry
  * prevents these artificial failures.
  */
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
 /*
- * If this returns false, the operation that took place after get_mems_allowed
+ * If this returns false, the operation that took place after read_mems_allowed_begin
  * may have failed. It is up to the caller to retry the operation if
  * appropriate.
  */
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -225,14 +225,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return 0;
 }
 
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return true;
+	return false;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1010cc6..703df59 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -976,7 +976,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	task_lock(tsk);
 	/*
 	 * Determine if a loop is necessary if another thread is doing
-	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * read_mems_allowed_begin().  If at least one node remains unchanged and
 	 * tsk does not have a mempolicy, then an empty nodemask will not be
 	 * possible when mems_allowed is larger than a word.
 	 */
diff --git a/mm/filemap.c b/mm/filemap.c
index c3811bc..5694807 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -501,10 +501,10 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
-			cpuset_mems_cookie = get_mems_allowed();
+			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
 			page = alloc_pages_exact_node(n, gfp, 0);
-		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8ce6f4..6c52f6a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -541,7 +541,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -570,7 +570,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cfb6c86..ee5f48c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1692,7 +1692,7 @@ int node_random(const nodemask_t *maskp)
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
  * @nodemask for filtering the zonelist.
  *
- * Must be protected by get_mems_allowed()
+ * Must be protected by read_mems_allowed_begin()
  */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 				gfp_t gfp_flags, struct mempolicy **mpol,
@@ -1857,7 +1857,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 
 retry_cpuset:
 	pol = get_vma_policy(current, vma, addr);
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
@@ -1865,7 +1865,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 
 		return page;
@@ -1878,7 +1878,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 		return page;
 	}
@@ -1887,7 +1887,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 }
@@ -1921,7 +1921,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		pol = &default_policy;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/*
 	 * No reference counting needed for current->mempolicy
@@ -1934,7 +1934,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caea788..b586d96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2402,7 +2402,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
@@ -2429,7 +2429,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * the mask is being updated. If a page allocation is about to fail,
 	 * check if the cpuset changed during allocation and if so, retry.
 	 */
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
@@ -2651,9 +2651,9 @@ bool skip_free_areas_node(unsigned int flags, int nid)
 		goto out;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		ret = !node_isset(nid, cpuset_current_mems_allowed);
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index 29c8716..e5a4533 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3318,7 +3318,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 
 retry:
@@ -3374,7 +3374,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 		}
 	}
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return obj;
 }
diff --git a/mm/slub.c b/mm/slub.c
index f4a6229..7a158be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1606,7 +1606,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 		return NULL;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 			struct kmem_cache_node *n;
@@ -1616,21 +1616,11 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 					n->nr_partial > s->min_partial) {
 				object = get_partial_node(s, n, c);
-				if (object) {
-					/*
-					 * Return the object even if
-					 * put_mems_allowed indicated that
-					 * the cpuset mems_allowed was
-					 * updated in parallel. It's a
-					 * harmless race between the alloc
-					 * and the cpuset update.
-					 */
-					put_mems_allowed(cpuset_mems_cookie);
+				if (object)
 					return object;
-				}
 			}
 		}
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 #endif
 	return NULL;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2012-03-26 16:20       ` Peter Zijlstra
@ 2012-03-27 12:47         ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-03-27 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Mon, Mar 26, 2012 at 06:20:52PM +0200, Peter Zijlstra wrote:
> > <SNIP>
> > 
> > I think such a change would be better but should also rename the API.
> > If developers see a get_foo type call, they will expect to see a put_foo
> > call or assume it's a bug even though the implementation happens to be ok
> > with that. Any suggestion on what a good new name would be?
> > 
> > How about read_mems_allowed_begin() and read_mems_allowed_retry()?
> > 
> > read_mems_allowed_begin would be a rename of get_mems_allowed().  In an
> > error path, read_mems_allowed_retry() would documented to be *optionally*
> > called when deciding whether to retry the operation or not. In this scheme,
> > !put_mems_allowed would become read_mems_allowed_retry() which might be
> > a bit easier to read overall.
> 
> One:
> 
> git grep -l "\(get\|put\)_mems_allowed" | while read file; do sed -i -e
> 's/\<get_mems_allowed\>/read_mems_allowed_begin/g' -e
> 's/\<put_mems_allowed\>/read_mems_allowed_retry/g' $file; done
> 
> and a few edits later..
> 
> ---
>  include/linux/cpuset.h |   18 +++++++++---------
>  kernel/cpuset.c        |    2 +-
>  mm/filemap.c           |    4 ++--
>  mm/hugetlb.c           |    4 ++--
>  mm/mempolicy.c         |   14 +++++++-------
>  mm/page_alloc.c        |    8 ++++----
>  mm/slab.c              |    4 ++--
>  mm/slub.c              |   16 +++-------------
>  8 files changed, 30 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 7a7e5fd..d008b03 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -89,25 +89,25 @@ extern void rebuild_sched_domains(void);
>  extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>  
>  /*
> - * get_mems_allowed is required when making decisions involving mems_allowed
> + * read_mems_allowed_begin is required when making decisions involving mems_allowed
>   * such as during page allocation. mems_allowed can be updated in parallel
>   * and depending on the new value an operation can fail potentially causing
> - * process failure. A retry loop with get_mems_allowed and put_mems_allowed
> + * process failure. A retry loop with read_mems_allowed_begin and read_mems_allowed_retry
>   * prevents these artificial failures.
>   */

Going over 80 columns there. This happens in other places in the patch
but the alternative in those cases is less readable.

> -static inline unsigned int get_mems_allowed(void)
> +static inline unsigned int read_mems_allowed_begin(void)
>  {
>  	return read_seqcount_begin(&current->mems_allowed_seq);
>  }
>  
>  /*
> - * If this returns false, the operation that took place after get_mems_allowed
> + * If this returns false, the operation that took place after read_mems_allowed_begin
>   * may have failed. It is up to the caller to retry the operation if
>   * appropriate.
>   */

80 cols again and it should be "returns true". Something like this?

/*
 * If this returns true, the operation that took place after 
 * read_mems_allowed_begin may have failed artifically due to a paralle
 * update of mems_allowed. It is up to the caller to retry the operation
 * if appropriate.
 */

Other than that, the changes looked good and I agree that it is better
overall.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2012-03-27 12:47         ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-03-27 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

On Mon, Mar 26, 2012 at 06:20:52PM +0200, Peter Zijlstra wrote:
> > <SNIP>
> > 
> > I think such a change would be better but should also rename the API.
> > If developers see a get_foo type call, they will expect to see a put_foo
> > call or assume it's a bug even though the implementation happens to be ok
> > with that. Any suggestion on what a good new name would be?
> > 
> > How about read_mems_allowed_begin() and read_mems_allowed_retry()?
> > 
> > read_mems_allowed_begin would be a rename of get_mems_allowed().  In an
> > error path, read_mems_allowed_retry() would documented to be *optionally*
> > called when deciding whether to retry the operation or not. In this scheme,
> > !put_mems_allowed would become read_mems_allowed_retry() which might be
> > a bit easier to read overall.
> 
> One:
> 
> git grep -l "\(get\|put\)_mems_allowed" | while read file; do sed -i -e
> 's/\<get_mems_allowed\>/read_mems_allowed_begin/g' -e
> 's/\<put_mems_allowed\>/read_mems_allowed_retry/g' $file; done
> 
> and a few edits later..
> 
> ---
>  include/linux/cpuset.h |   18 +++++++++---------
>  kernel/cpuset.c        |    2 +-
>  mm/filemap.c           |    4 ++--
>  mm/hugetlb.c           |    4 ++--
>  mm/mempolicy.c         |   14 +++++++-------
>  mm/page_alloc.c        |    8 ++++----
>  mm/slab.c              |    4 ++--
>  mm/slub.c              |   16 +++-------------
>  8 files changed, 30 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 7a7e5fd..d008b03 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -89,25 +89,25 @@ extern void rebuild_sched_domains(void);
>  extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>  
>  /*
> - * get_mems_allowed is required when making decisions involving mems_allowed
> + * read_mems_allowed_begin is required when making decisions involving mems_allowed
>   * such as during page allocation. mems_allowed can be updated in parallel
>   * and depending on the new value an operation can fail potentially causing
> - * process failure. A retry loop with get_mems_allowed and put_mems_allowed
> + * process failure. A retry loop with read_mems_allowed_begin and read_mems_allowed_retry
>   * prevents these artificial failures.
>   */

Going over 80 columns there. This happens in other places in the patch
but the alternative in those cases is less readable.

> -static inline unsigned int get_mems_allowed(void)
> +static inline unsigned int read_mems_allowed_begin(void)
>  {
>  	return read_seqcount_begin(&current->mems_allowed_seq);
>  }
>  
>  /*
> - * If this returns false, the operation that took place after get_mems_allowed
> + * If this returns false, the operation that took place after read_mems_allowed_begin
>   * may have failed. It is up to the caller to retry the operation if
>   * appropriate.
>   */

80 cols again and it should be "returns true". Something like this?

/*
 * If this returns true, the operation that took place after 
 * read_mems_allowed_begin may have failed artifically due to a paralle
 * update of mems_allowed. It is up to the caller to retry the operation
 * if appropriate.
 */

Other than that, the changes looked good and I agree that it is better
overall.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH] mm: Optimize put_mems_allowed() usage
  2012-03-27 12:47         ` Mel Gorman
@ 2012-03-27 13:14           ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-03-27 13:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

Subject: mm: Optimize put_mems_allowed() usage
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Mar 26 14:13:05 CEST 2012

Since put_mems_allowed() is strictly optional, its a seqcount retry,
we don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some
relative fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and
read_mems_allowed_retry(), where it is important to note that the
return value of the latter call is inverted from its previous
incarnation.

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/cpuset.h |   27 ++++++++++++++-------------
 kernel/cpuset.c        |    2 +-
 mm/filemap.c           |    4 ++--
 mm/hugetlb.c           |    4 ++--
 mm/mempolicy.c         |   14 +++++++-------
 mm/page_alloc.c        |    8 ++++----
 mm/slab.c              |    4 ++--
 mm/slub.c              |   16 +++-------------
 8 files changed, 35 insertions(+), 44 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,25 +89,26 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * get_mems_allowed is required when making decisions involving mems_allowed
- * such as during page allocation. mems_allowed can be updated in parallel
- * and depending on the new value an operation can fail potentially causing
- * process failure. A retry loop with get_mems_allowed and put_mems_allowed
- * prevents these artificial failures.
+ * read_mems_allowed_begin is required when making decisions involving
+ * mems_allowed such as during page allocation. mems_allowed can be updated in
+ * parallel and depending on the new value an operation can fail potentially
+ * causing process failure. A retry loop with read_mems_allowed_begin and
+ * read_mems_allowed_retry prevents these artificial failures.
  */
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
 /*
- * If this returns false, the operation that took place after get_mems_allowed
- * may have failed. It is up to the caller to retry the operation if
+ * If this returns true, the operation that took place after
+ * read_mems_allowed_begin may have failed artificially due to a concurrent
+ * update of mems_allowed. It is up to the caller to retry the operation if
  * appropriate.
  */
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -225,14 +226,14 @@ static inline void set_mems_allowed(node
 {
 }
 
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return 0;
 }
 
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return true;
+	return false;
 }
 
 #endif /* !CONFIG_CPUSETS */
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -976,7 +976,7 @@ static void cpuset_change_task_nodemask(
 	task_lock(tsk);
 	/*
 	 * Determine if a loop is necessary if another thread is doing
-	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * read_mems_allowed_begin().  If at least one node remains unchanged and
 	 * tsk does not have a mempolicy, then an empty nodemask will not be
 	 * possible when mems_allowed is larger than a word.
 	 */
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -501,10 +501,10 @@ struct page *__page_cache_alloc(gfp_t gf
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
-			cpuset_mems_cookie = get_mems_allowed();
+			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
 			page = alloc_pages_exact_node(n, gfp, 0);
-		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -541,7 +541,7 @@ static struct page *dequeue_huge_page_vm
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -570,7 +570,7 @@ static struct page *dequeue_huge_page_vm
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1692,7 +1692,7 @@ int node_random(const nodemask_t *maskp)
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
  * @nodemask for filtering the zonelist.
  *
- * Must be protected by get_mems_allowed()
+ * Must be protected by read_mems_allowed_begin()
  */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 				gfp_t gfp_flags, struct mempolicy **mpol,
@@ -1857,7 +1857,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
 
 retry_cpuset:
 	pol = get_vma_policy(current, vma, addr);
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
@@ -1865,7 +1865,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 
 		return page;
@@ -1878,7 +1878,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 		return page;
 	}
@@ -1887,7 +1887,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 }
@@ -1921,7 +1921,7 @@ struct page *alloc_pages_current(gfp_t g
 		pol = &default_policy;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/*
 	 * No reference counting needed for current->mempolicy
@@ -1934,7 +1934,7 @@ struct page *alloc_pages_current(gfp_t g
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2402,7 +2402,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
 		return NULL;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
@@ -2429,7 +2429,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
 	 * the mask is being updated. If a page allocation is about to fail,
 	 * check if the cpuset changed during allocation and if so, retry.
 	 */
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
@@ -2651,9 +2651,9 @@ bool skip_free_areas_node(unsigned int f
 		goto out;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		ret = !node_isset(nid, cpuset_current_mems_allowed);
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 out:
 	return ret;
 }
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3318,7 +3318,7 @@ static void *fallback_alloc(struct kmem_
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 
 retry:
@@ -3374,7 +3374,7 @@ static void *fallback_alloc(struct kmem_
 		}
 	}
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return obj;
 }
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1606,7 +1606,7 @@ static struct page *get_any_partial(stru
 		return NULL;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 			struct kmem_cache_node *n;
@@ -1616,21 +1616,11 @@ static struct page *get_any_partial(stru
 			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 					n->nr_partial > s->min_partial) {
 				object = get_partial_node(s, n, c);
-				if (object) {
-					/*
-					 * Return the object even if
-					 * put_mems_allowed indicated that
-					 * the cpuset mems_allowed was
-					 * updated in parallel. It's a
-					 * harmless race between the alloc
-					 * and the cpuset update.
-					 */
-					put_mems_allowed(cpuset_mems_cookie);
+				if (object)
 					return object;
-				}
 			}
 		}
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 #endif
 	return NULL;
 }


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH] mm: Optimize put_mems_allowed() usage
@ 2012-03-27 13:14           ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-03-27 13:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

Subject: mm: Optimize put_mems_allowed() usage
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Mar 26 14:13:05 CEST 2012

Since put_mems_allowed() is strictly optional, its a seqcount retry,
we don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some
relative fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and
read_mems_allowed_retry(), where it is important to note that the
return value of the latter call is inverted from its previous
incarnation.

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/cpuset.h |   27 ++++++++++++++-------------
 kernel/cpuset.c        |    2 +-
 mm/filemap.c           |    4 ++--
 mm/hugetlb.c           |    4 ++--
 mm/mempolicy.c         |   14 +++++++-------
 mm/page_alloc.c        |    8 ++++----
 mm/slab.c              |    4 ++--
 mm/slub.c              |   16 +++-------------
 8 files changed, 35 insertions(+), 44 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,25 +89,26 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * get_mems_allowed is required when making decisions involving mems_allowed
- * such as during page allocation. mems_allowed can be updated in parallel
- * and depending on the new value an operation can fail potentially causing
- * process failure. A retry loop with get_mems_allowed and put_mems_allowed
- * prevents these artificial failures.
+ * read_mems_allowed_begin is required when making decisions involving
+ * mems_allowed such as during page allocation. mems_allowed can be updated in
+ * parallel and depending on the new value an operation can fail potentially
+ * causing process failure. A retry loop with read_mems_allowed_begin and
+ * read_mems_allowed_retry prevents these artificial failures.
  */
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
 /*
- * If this returns false, the operation that took place after get_mems_allowed
- * may have failed. It is up to the caller to retry the operation if
+ * If this returns true, the operation that took place after
+ * read_mems_allowed_begin may have failed artificially due to a concurrent
+ * update of mems_allowed. It is up to the caller to retry the operation if
  * appropriate.
  */
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -225,14 +226,14 @@ static inline void set_mems_allowed(node
 {
 }
 
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return 0;
 }
 
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return true;
+	return false;
 }
 
 #endif /* !CONFIG_CPUSETS */
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -976,7 +976,7 @@ static void cpuset_change_task_nodemask(
 	task_lock(tsk);
 	/*
 	 * Determine if a loop is necessary if another thread is doing
-	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * read_mems_allowed_begin().  If at least one node remains unchanged and
 	 * tsk does not have a mempolicy, then an empty nodemask will not be
 	 * possible when mems_allowed is larger than a word.
 	 */
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -501,10 +501,10 @@ struct page *__page_cache_alloc(gfp_t gf
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
-			cpuset_mems_cookie = get_mems_allowed();
+			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
 			page = alloc_pages_exact_node(n, gfp, 0);
-		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -541,7 +541,7 @@ static struct page *dequeue_huge_page_vm
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -570,7 +570,7 @@ static struct page *dequeue_huge_page_vm
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1692,7 +1692,7 @@ int node_random(const nodemask_t *maskp)
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
  * @nodemask for filtering the zonelist.
  *
- * Must be protected by get_mems_allowed()
+ * Must be protected by read_mems_allowed_begin()
  */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 				gfp_t gfp_flags, struct mempolicy **mpol,
@@ -1857,7 +1857,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
 
 retry_cpuset:
 	pol = get_vma_policy(current, vma, addr);
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
@@ -1865,7 +1865,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 
 		return page;
@@ -1878,7 +1878,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 		return page;
 	}
@@ -1887,7 +1887,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 }
@@ -1921,7 +1921,7 @@ struct page *alloc_pages_current(gfp_t g
 		pol = &default_policy;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/*
 	 * No reference counting needed for current->mempolicy
@@ -1934,7 +1934,7 @@ struct page *alloc_pages_current(gfp_t g
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2402,7 +2402,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
 		return NULL;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
@@ -2429,7 +2429,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
 	 * the mask is being updated. If a page allocation is about to fail,
 	 * check if the cpuset changed during allocation and if so, retry.
 	 */
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
@@ -2651,9 +2651,9 @@ bool skip_free_areas_node(unsigned int f
 		goto out;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		ret = !node_isset(nid, cpuset_current_mems_allowed);
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 out:
 	return ret;
 }
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3318,7 +3318,7 @@ static void *fallback_alloc(struct kmem_
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 
 retry:
@@ -3374,7 +3374,7 @@ static void *fallback_alloc(struct kmem_
 		}
 	}
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return obj;
 }
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1606,7 +1606,7 @@ static struct page *get_any_partial(stru
 		return NULL;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 			struct kmem_cache_node *n;
@@ -1616,21 +1616,11 @@ static struct page *get_any_partial(stru
 			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 					n->nr_partial > s->min_partial) {
 				object = get_partial_node(s, n, c);
-				if (object) {
-					/*
-					 * Return the object even if
-					 * put_mems_allowed indicated that
-					 * the cpuset mems_allowed was
-					 * updated in parallel. It's a
-					 * harmless race between the alloc
-					 * and the cpuset update.
-					 */
-					put_mems_allowed(cpuset_mems_cookie);
+				if (object)
 					return object;
-				}
 			}
 		}
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 #endif
 	return NULL;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] mm: Optimize put_mems_allowed() usage
  2012-03-27 13:14           ` Peter Zijlstra
@ 2012-05-17 10:33             ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-05-17 10:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

Anybody care to pick this one up? Andrew?

On Tue, 2012-03-27 at 15:14 +0200, Peter Zijlstra wrote:
> Subject: mm: Optimize put_mems_allowed() usage
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Mar 26 14:13:05 CEST 2012
> 
> Since put_mems_allowed() is strictly optional, its a seqcount retry,
> we don't need to evaluate the function if the allocation was in fact
> successful, saving a smp_rmb some loads and comparisons on some
> relative fast-paths.
> 
> Since the naming, get/put_mems_allowed() does suggest a mandatory
> pairing, rename the interface, as suggested by Mel, to resemble the
> seqcount interface.
> 
> This gives us: read_mems_allowed_begin() and
> read_mems_allowed_retry(), where it is important to note that the
> return value of the latter call is inverted from its previous
> incarnation.
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/cpuset.h |   27 ++++++++++++++-------------
>  kernel/cpuset.c        |    2 +-
>  mm/filemap.c           |    4 ++--
>  mm/hugetlb.c           |    4 ++--
>  mm/mempolicy.c         |   14 +++++++-------
>  mm/page_alloc.c        |    8 ++++----
>  mm/slab.c              |    4 ++--
>  mm/slub.c              |   16 +++-------------
>  8 files changed, 35 insertions(+), 44 deletions(-)
> 
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -89,25 +89,26 @@ extern void rebuild_sched_domains(void);
>  extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>  
>  /*
> - * get_mems_allowed is required when making decisions involving mems_allowed
> - * such as during page allocation. mems_allowed can be updated in parallel
> - * and depending on the new value an operation can fail potentially causing
> - * process failure. A retry loop with get_mems_allowed and put_mems_allowed
> - * prevents these artificial failures.
> + * read_mems_allowed_begin is required when making decisions involving
> + * mems_allowed such as during page allocation. mems_allowed can be updated in
> + * parallel and depending on the new value an operation can fail potentially
> + * causing process failure. A retry loop with read_mems_allowed_begin and
> + * read_mems_allowed_retry prevents these artificial failures.
>   */
> -static inline unsigned int get_mems_allowed(void)
> +static inline unsigned int read_mems_allowed_begin(void)
>  {
>  	return read_seqcount_begin(&current->mems_allowed_seq);
>  }
>  
>  /*
> - * If this returns false, the operation that took place after get_mems_allowed
> - * may have failed. It is up to the caller to retry the operation if
> + * If this returns true, the operation that took place after
> + * read_mems_allowed_begin may have failed artificially due to a concurrent
> + * update of mems_allowed. It is up to the caller to retry the operation if
>   * appropriate.
>   */
> -static inline bool put_mems_allowed(unsigned int seq)
> +static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> -	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
> +	return read_seqcount_retry(&current->mems_allowed_seq, seq);
>  }
>  
>  static inline void set_mems_allowed(nodemask_t nodemask)
> @@ -225,14 +226,14 @@ static inline void set_mems_allowed(node
>  {
>  }
>  
> -static inline unsigned int get_mems_allowed(void)
> +static inline unsigned int read_mems_allowed_begin(void)
>  {
>  	return 0;
>  }
>  
> -static inline bool put_mems_allowed(unsigned int seq)
> +static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> -	return true;
> +	return false;
>  }
>  
>  #endif /* !CONFIG_CPUSETS */
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -976,7 +976,7 @@ static void cpuset_change_task_nodemask(
>  	task_lock(tsk);
>  	/*
>  	 * Determine if a loop is necessary if another thread is doing
> -	 * get_mems_allowed().  If at least one node remains unchanged and
> +	 * read_mems_allowed_begin().  If at least one node remains unchanged and
>  	 * tsk does not have a mempolicy, then an empty nodemask will not be
>  	 * possible when mems_allowed is larger than a word.
>  	 */
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -501,10 +501,10 @@ struct page *__page_cache_alloc(gfp_t gf
>  	if (cpuset_do_page_mem_spread()) {
>  		unsigned int cpuset_mems_cookie;
>  		do {
> -			cpuset_mems_cookie = get_mems_allowed();
> +			cpuset_mems_cookie = read_mems_allowed_begin();
>  			n = cpuset_mem_spread_node();
>  			page = alloc_pages_exact_node(n, gfp, 0);
> -		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> +		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
>  
>  		return page;
>  	}
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -541,7 +541,7 @@ static struct page *dequeue_huge_page_vm
>  	unsigned int cpuset_mems_cookie;
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  	zonelist = huge_zonelist(vma, address,
>  					htlb_alloc_mask, &mpol, &nodemask);
>  	/*
> @@ -570,7 +570,7 @@ static struct page *dequeue_huge_page_vm
>  	}
>  
>  	mpol_cond_put(mpol);
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return page;
>  
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1692,7 +1692,7 @@ int node_random(const nodemask_t *maskp)
>   * If the effective policy is 'BIND, returns a pointer to the mempolicy's
>   * @nodemask for filtering the zonelist.
>   *
> - * Must be protected by get_mems_allowed()
> + * Must be protected by read_mems_allowed_begin()
>   */
>  struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  				gfp_t gfp_flags, struct mempolicy **mpol,
> @@ -1857,7 +1857,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  
>  retry_cpuset:
>  	pol = get_vma_policy(current, vma, addr);
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>  		unsigned nid;
> @@ -1865,7 +1865,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
>  		mpol_cond_put(pol);
>  		page = alloc_page_interleave(gfp, order, nid);
> -		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  			goto retry_cpuset;
>  
>  		return page;
> @@ -1878,7 +1878,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  		struct page *page =  __alloc_pages_nodemask(gfp, order,
>  						zl, policy_nodemask(gfp, pol));
>  		__mpol_put(pol);
> -		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  			goto retry_cpuset;
>  		return page;
>  	}
> @@ -1887,7 +1887,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  	 */
>  	page = __alloc_pages_nodemask(gfp, order, zl,
>  				      policy_nodemask(gfp, pol));
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return page;
>  }
> @@ -1921,7 +1921,7 @@ struct page *alloc_pages_current(gfp_t g
>  		pol = &default_policy;
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  
>  	/*
>  	 * No reference counting needed for current->mempolicy
> @@ -1934,7 +1934,7 @@ struct page *alloc_pages_current(gfp_t g
>  				policy_zonelist(gfp, pol, numa_node_id()),
>  				policy_nodemask(gfp, pol));
>  
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  
>  	return page;
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2402,7 +2402,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
>  		return NULL;
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  
>  	/* The preferred zone is used for statistics later */
>  	first_zones_zonelist(zonelist, high_zoneidx,
> @@ -2429,7 +2429,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
>  	 * the mask is being updated. If a page allocation is about to fail,
>  	 * check if the cpuset changed during allocation and if so, retry.
>  	 */
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  
>  	return page;
> @@ -2651,9 +2651,9 @@ bool skip_free_areas_node(unsigned int f
>  		goto out;
>  
>  	do {
> -		cpuset_mems_cookie = get_mems_allowed();
> +		cpuset_mems_cookie = read_mems_allowed_begin();
>  		ret = !node_isset(nid, cpuset_current_mems_allowed);
> -	} while (!put_mems_allowed(cpuset_mems_cookie));
> +	} while (read_mems_allowed_retry(cpuset_mems_cookie));
>  out:
>  	return ret;
>  }
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3318,7 +3318,7 @@ static void *fallback_alloc(struct kmem_
>  	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>  
>  retry:
> @@ -3374,7 +3374,7 @@ static void *fallback_alloc(struct kmem_
>  		}
>  	}
>  
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
> +	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return obj;
>  }
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1606,7 +1606,7 @@ static struct page *get_any_partial(stru
>  		return NULL;
>  
>  	do {
> -		cpuset_mems_cookie = get_mems_allowed();
> +		cpuset_mems_cookie = read_mems_allowed_begin();
>  		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>  		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  			struct kmem_cache_node *n;
> @@ -1616,21 +1616,11 @@ static struct page *get_any_partial(stru
>  			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
>  					n->nr_partial > s->min_partial) {
>  				object = get_partial_node(s, n, c);
> -				if (object) {
> -					/*
> -					 * Return the object even if
> -					 * put_mems_allowed indicated that
> -					 * the cpuset mems_allowed was
> -					 * updated in parallel. It's a
> -					 * harmless race between the alloc
> -					 * and the cpuset update.
> -					 */
> -					put_mems_allowed(cpuset_mems_cookie);
> +				if (object)
>  					return object;
> -				}
>  			}
>  		}
> -	} while (!put_mems_allowed(cpuset_mems_cookie));
> +	} while (read_mems_allowed_retry(cpuset_mems_cookie));
>  #endif
>  	return NULL;
>  }
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] mm: Optimize put_mems_allowed() usage
@ 2012-05-17 10:33             ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-05-17 10:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel

Anybody care to pick this one up? Andrew?

On Tue, 2012-03-27 at 15:14 +0200, Peter Zijlstra wrote:
> Subject: mm: Optimize put_mems_allowed() usage
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Mar 26 14:13:05 CEST 2012
> 
> Since put_mems_allowed() is strictly optional, its a seqcount retry,
> we don't need to evaluate the function if the allocation was in fact
> successful, saving a smp_rmb some loads and comparisons on some
> relative fast-paths.
> 
> Since the naming, get/put_mems_allowed() does suggest a mandatory
> pairing, rename the interface, as suggested by Mel, to resemble the
> seqcount interface.
> 
> This gives us: read_mems_allowed_begin() and
> read_mems_allowed_retry(), where it is important to note that the
> return value of the latter call is inverted from its previous
> incarnation.
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/cpuset.h |   27 ++++++++++++++-------------
>  kernel/cpuset.c        |    2 +-
>  mm/filemap.c           |    4 ++--
>  mm/hugetlb.c           |    4 ++--
>  mm/mempolicy.c         |   14 +++++++-------
>  mm/page_alloc.c        |    8 ++++----
>  mm/slab.c              |    4 ++--
>  mm/slub.c              |   16 +++-------------
>  8 files changed, 35 insertions(+), 44 deletions(-)
> 
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -89,25 +89,26 @@ extern void rebuild_sched_domains(void);
>  extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>  
>  /*
> - * get_mems_allowed is required when making decisions involving mems_allowed
> - * such as during page allocation. mems_allowed can be updated in parallel
> - * and depending on the new value an operation can fail potentially causing
> - * process failure. A retry loop with get_mems_allowed and put_mems_allowed
> - * prevents these artificial failures.
> + * read_mems_allowed_begin is required when making decisions involving
> + * mems_allowed such as during page allocation. mems_allowed can be updated in
> + * parallel and depending on the new value an operation can fail potentially
> + * causing process failure. A retry loop with read_mems_allowed_begin and
> + * read_mems_allowed_retry prevents these artificial failures.
>   */
> -static inline unsigned int get_mems_allowed(void)
> +static inline unsigned int read_mems_allowed_begin(void)
>  {
>  	return read_seqcount_begin(&current->mems_allowed_seq);
>  }
>  
>  /*
> - * If this returns false, the operation that took place after get_mems_allowed
> - * may have failed. It is up to the caller to retry the operation if
> + * If this returns true, the operation that took place after
> + * read_mems_allowed_begin may have failed artificially due to a concurrent
> + * update of mems_allowed. It is up to the caller to retry the operation if
>   * appropriate.
>   */
> -static inline bool put_mems_allowed(unsigned int seq)
> +static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> -	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
> +	return read_seqcount_retry(&current->mems_allowed_seq, seq);
>  }
>  
>  static inline void set_mems_allowed(nodemask_t nodemask)
> @@ -225,14 +226,14 @@ static inline void set_mems_allowed(node
>  {
>  }
>  
> -static inline unsigned int get_mems_allowed(void)
> +static inline unsigned int read_mems_allowed_begin(void)
>  {
>  	return 0;
>  }
>  
> -static inline bool put_mems_allowed(unsigned int seq)
> +static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> -	return true;
> +	return false;
>  }
>  
>  #endif /* !CONFIG_CPUSETS */
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -976,7 +976,7 @@ static void cpuset_change_task_nodemask(
>  	task_lock(tsk);
>  	/*
>  	 * Determine if a loop is necessary if another thread is doing
> -	 * get_mems_allowed().  If at least one node remains unchanged and
> +	 * read_mems_allowed_begin().  If at least one node remains unchanged and
>  	 * tsk does not have a mempolicy, then an empty nodemask will not be
>  	 * possible when mems_allowed is larger than a word.
>  	 */
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -501,10 +501,10 @@ struct page *__page_cache_alloc(gfp_t gf
>  	if (cpuset_do_page_mem_spread()) {
>  		unsigned int cpuset_mems_cookie;
>  		do {
> -			cpuset_mems_cookie = get_mems_allowed();
> +			cpuset_mems_cookie = read_mems_allowed_begin();
>  			n = cpuset_mem_spread_node();
>  			page = alloc_pages_exact_node(n, gfp, 0);
> -		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> +		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
>  
>  		return page;
>  	}
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -541,7 +541,7 @@ static struct page *dequeue_huge_page_vm
>  	unsigned int cpuset_mems_cookie;
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  	zonelist = huge_zonelist(vma, address,
>  					htlb_alloc_mask, &mpol, &nodemask);
>  	/*
> @@ -570,7 +570,7 @@ static struct page *dequeue_huge_page_vm
>  	}
>  
>  	mpol_cond_put(mpol);
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return page;
>  
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1692,7 +1692,7 @@ int node_random(const nodemask_t *maskp)
>   * If the effective policy is 'BIND, returns a pointer to the mempolicy's
>   * @nodemask for filtering the zonelist.
>   *
> - * Must be protected by get_mems_allowed()
> + * Must be protected by read_mems_allowed_begin()
>   */
>  struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  				gfp_t gfp_flags, struct mempolicy **mpol,
> @@ -1857,7 +1857,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  
>  retry_cpuset:
>  	pol = get_vma_policy(current, vma, addr);
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>  		unsigned nid;
> @@ -1865,7 +1865,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
>  		mpol_cond_put(pol);
>  		page = alloc_page_interleave(gfp, order, nid);
> -		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  			goto retry_cpuset;
>  
>  		return page;
> @@ -1878,7 +1878,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  		struct page *page =  __alloc_pages_nodemask(gfp, order,
>  						zl, policy_nodemask(gfp, pol));
>  		__mpol_put(pol);
> -		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  			goto retry_cpuset;
>  		return page;
>  	}
> @@ -1887,7 +1887,7 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  	 */
>  	page = __alloc_pages_nodemask(gfp, order, zl,
>  				      policy_nodemask(gfp, pol));
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return page;
>  }
> @@ -1921,7 +1921,7 @@ struct page *alloc_pages_current(gfp_t g
>  		pol = &default_policy;
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  
>  	/*
>  	 * No reference counting needed for current->mempolicy
> @@ -1934,7 +1934,7 @@ struct page *alloc_pages_current(gfp_t g
>  				policy_zonelist(gfp, pol, numa_node_id()),
>  				policy_nodemask(gfp, pol));
>  
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  
>  	return page;
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2402,7 +2402,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
>  		return NULL;
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  
>  	/* The preferred zone is used for statistics later */
>  	first_zones_zonelist(zonelist, high_zoneidx,
> @@ -2429,7 +2429,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
>  	 * the mask is being updated. If a page allocation is about to fail,
>  	 * check if the cpuset changed during allocation and if so, retry.
>  	 */
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  
>  	return page;
> @@ -2651,9 +2651,9 @@ bool skip_free_areas_node(unsigned int f
>  		goto out;
>  
>  	do {
> -		cpuset_mems_cookie = get_mems_allowed();
> +		cpuset_mems_cookie = read_mems_allowed_begin();
>  		ret = !node_isset(nid, cpuset_current_mems_allowed);
> -	} while (!put_mems_allowed(cpuset_mems_cookie));
> +	} while (read_mems_allowed_retry(cpuset_mems_cookie));
>  out:
>  	return ret;
>  }
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3318,7 +3318,7 @@ static void *fallback_alloc(struct kmem_
>  	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>  
>  retry_cpuset:
> -	cpuset_mems_cookie = get_mems_allowed();
> +	cpuset_mems_cookie = read_mems_allowed_begin();
>  	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>  
>  retry:
> @@ -3374,7 +3374,7 @@ static void *fallback_alloc(struct kmem_
>  		}
>  	}
>  
> -	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
> +	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  	return obj;
>  }
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1606,7 +1606,7 @@ static struct page *get_any_partial(stru
>  		return NULL;
>  
>  	do {
> -		cpuset_mems_cookie = get_mems_allowed();
> +		cpuset_mems_cookie = read_mems_allowed_begin();
>  		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>  		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  			struct kmem_cache_node *n;
> @@ -1616,21 +1616,11 @@ static struct page *get_any_partial(stru
>  			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
>  					n->nr_partial > s->min_partial) {
>  				object = get_partial_node(s, n, c);
> -				if (object) {
> -					/*
> -					 * Return the object even if
> -					 * put_mems_allowed indicated that
> -					 * the cpuset mems_allowed was
> -					 * updated in parallel. It's a
> -					 * harmless race between the alloc
> -					 * and the cpuset update.
> -					 */
> -					put_mems_allowed(cpuset_mems_cookie);
> +				if (object)
>  					return object;
> -				}
>  			}
>  		}
> -	} while (!put_mems_allowed(cpuset_mems_cookie));
> +	} while (read_mems_allowed_retry(cpuset_mems_cookie));
>  #endif
>  	return NULL;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] mm: Optimize put_mems_allowed() usage
  2012-03-27 13:14           ` Peter Zijlstra
@ 2012-05-17 20:16             ` Andrew Morton
  -1 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2012-05-17 20:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, Pekka Enberg

On Tue, 27 Mar 2012 15:14:30 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Subject: mm: Optimize put_mems_allowed() usage
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Mar 26 14:13:05 CEST 2012
> 
> Since put_mems_allowed() is strictly optional, its a seqcount retry,
> we don't need to evaluate the function if the allocation was in fact
> successful, saving a smp_rmb some loads and comparisons on some
> relative fast-paths.
> 
> Since the naming, get/put_mems_allowed() does suggest a mandatory
> pairing, rename the interface, as suggested by Mel, to resemble the
> seqcount interface.
> 
> This gives us: read_mems_allowed_begin() and
> read_mems_allowed_retry(), where it is important to note that the
> return value of the latter call is inverted from its previous
> incarnation.
> 
> ...
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1606,7 +1606,7 @@ static struct page *get_any_partial(stru
>  		return NULL;
>  
>  	do {
> -		cpuset_mems_cookie = get_mems_allowed();
> +		cpuset_mems_cookie = read_mems_allowed_begin();
>  		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>  		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  			struct kmem_cache_node *n;
> @@ -1616,21 +1616,11 @@ static struct page *get_any_partial(stru
>  			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
>  					n->nr_partial > s->min_partial) {
>  				object = get_partial_node(s, n, c);
> -				if (object) {
> -					/*
> -					 * Return the object even if
> -					 * put_mems_allowed indicated that
> -					 * the cpuset mems_allowed was
> -					 * updated in parallel. It's a
> -					 * harmless race between the alloc
> -					 * and the cpuset update.
> -					 */
> -					put_mems_allowed(cpuset_mems_cookie);
> +				if (object)
>  					return object;
> -				}
>  			}
>  		}
> -	} while (!put_mems_allowed(cpuset_mems_cookie));
> +	} while (read_mems_allowed_retry(cpuset_mems_cookie));

I do think it was a bad idea to remove that comment.  As it stands, the
reader will be wondering why we did the read_mems_allowed_begin() at
all, and whether failing to check for a change is a bug.

--- a/mm/slub.c~mm-optimize-put_mems_allowed-usage-fix
+++ a/mm/slub.c
@@ -1624,8 +1624,16 @@ static struct page *get_any_partial(stru
 			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 					n->nr_partial > s->min_partial) {
 				object = get_partial_node(s, n, c);
-				if (object)
+				if (object) {
+					/*
+					 * Don't check read_mems_allowed_retry()
+					 * here - if mems_allowed was updated in
+					 * parallel, that was a harmless race
+					 * between allocation and the cpuset
+					 * update
+					 */
 					return object;
+				}
 			}
 		}
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
_


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] mm: Optimize put_mems_allowed() usage
@ 2012-05-17 20:16             ` Andrew Morton
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2012-05-17 20:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, Pekka Enberg

On Tue, 27 Mar 2012 15:14:30 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Subject: mm: Optimize put_mems_allowed() usage
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Mar 26 14:13:05 CEST 2012
> 
> Since put_mems_allowed() is strictly optional, its a seqcount retry,
> we don't need to evaluate the function if the allocation was in fact
> successful, saving a smp_rmb some loads and comparisons on some
> relative fast-paths.
> 
> Since the naming, get/put_mems_allowed() does suggest a mandatory
> pairing, rename the interface, as suggested by Mel, to resemble the
> seqcount interface.
> 
> This gives us: read_mems_allowed_begin() and
> read_mems_allowed_retry(), where it is important to note that the
> return value of the latter call is inverted from its previous
> incarnation.
> 
> ...
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1606,7 +1606,7 @@ static struct page *get_any_partial(stru
>  		return NULL;
>  
>  	do {
> -		cpuset_mems_cookie = get_mems_allowed();
> +		cpuset_mems_cookie = read_mems_allowed_begin();
>  		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>  		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  			struct kmem_cache_node *n;
> @@ -1616,21 +1616,11 @@ static struct page *get_any_partial(stru
>  			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
>  					n->nr_partial > s->min_partial) {
>  				object = get_partial_node(s, n, c);
> -				if (object) {
> -					/*
> -					 * Return the object even if
> -					 * put_mems_allowed indicated that
> -					 * the cpuset mems_allowed was
> -					 * updated in parallel. It's a
> -					 * harmless race between the alloc
> -					 * and the cpuset update.
> -					 */
> -					put_mems_allowed(cpuset_mems_cookie);
> +				if (object)
>  					return object;
> -				}
>  			}
>  		}
> -	} while (!put_mems_allowed(cpuset_mems_cookie));
> +	} while (read_mems_allowed_retry(cpuset_mems_cookie));

I do think it was a bad idea to remove that comment.  As it stands, the
reader will be wondering why we did the read_mems_allowed_begin() at
all, and whether failing to check for a change is a bug.

--- a/mm/slub.c~mm-optimize-put_mems_allowed-usage-fix
+++ a/mm/slub.c
@@ -1624,8 +1624,16 @@ static struct page *get_any_partial(stru
 			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 					n->nr_partial > s->min_partial) {
 				object = get_partial_node(s, n, c);
-				if (object)
+				if (object) {
+					/*
+					 * Don't check read_mems_allowed_retry()
+					 * here - if mems_allowed was updated in
+					 * parallel, that was a harmless race
+					 * between allocation and the cpuset
+					 * update
+					 */
 					return object;
+				}
 			}
 		}
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] mm: Optimize put_mems_allowed() usage
  2012-05-17 20:16             ` Andrew Morton
@ 2012-05-17 20:23               ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-05-17 20:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, Pekka Enberg

On Thu, 2012-05-17 at 13:16 -0700, Andrew Morton wrote:
> I do think it was a bad idea to remove that comment.  As it stands, the
> reader will be wondering why we did the read_mems_allowed_begin() at
> all, and whether failing to check for a change is a bug.
> 
> --- a/mm/slub.c~mm-optimize-put_mems_allowed-usage-fix
> +++ a/mm/slub.c
> @@ -1624,8 +1624,16 @@ static struct page *get_any_partial(stru
>                         if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
>                                         n->nr_partial > s->min_partial) {
>                                 object = get_partial_node(s, n, c);
> -                               if (object)
> +                               if (object) {
> +                                       /*
> +                                        * Don't check read_mems_allowed_retry()
> +                                        * here - if mems_allowed was updated in
> +                                        * parallel, that was a harmless race
> +                                        * between allocation and the cpuset
> +                                        * update
> +                                        */
>                                         return object;
> +                               }
>                         }
>                 }
>         } while (read_mems_allowed_retry(cpuset_mems_cookie)); 

OK, it seemed weird to have that comment in this one place whilst it is
the general pattern of this construct.

The whole read_mems_allowed_retry() should only ever be attempted in
case the allocation failed.

But sure..

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] mm: Optimize put_mems_allowed() usage
@ 2012-05-17 20:23               ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-05-17 20:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, Pekka Enberg

On Thu, 2012-05-17 at 13:16 -0700, Andrew Morton wrote:
> I do think it was a bad idea to remove that comment.  As it stands, the
> reader will be wondering why we did the read_mems_allowed_begin() at
> all, and whether failing to check for a change is a bug.
> 
> --- a/mm/slub.c~mm-optimize-put_mems_allowed-usage-fix
> +++ a/mm/slub.c
> @@ -1624,8 +1624,16 @@ static struct page *get_any_partial(stru
>                         if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
>                                         n->nr_partial > s->min_partial) {
>                                 object = get_partial_node(s, n, c);
> -                               if (object)
> +                               if (object) {
> +                                       /*
> +                                        * Don't check read_mems_allowed_retry()
> +                                        * here - if mems_allowed was updated in
> +                                        * parallel, that was a harmless race
> +                                        * between allocation and the cpuset
> +                                        * update
> +                                        */
>                                         return object;
> +                               }
>                         }
>                 }
>         } while (read_mems_allowed_retry(cpuset_mems_cookie)); 

OK, it seemed weird to have that comment in this one place whilst it is
the general pattern of this construct.

The whole read_mems_allowed_retry() should only ever be attempted in
case the allocation failed.

But sure..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [tip:sched/numa] mm: Optimize put_mems_allowed() usage
  2012-03-26 10:56   ` Peter Zijlstra
                     ` (2 preceding siblings ...)
  (?)
@ 2012-05-18 10:20   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-05-18 10:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, cl, akpm, mgorman, tglx,
	rientjes

Commit-ID:  2ab41dd599228532afc0d5920f82e6378ec1c5f3
Gitweb:     http://git.kernel.org/tip/2ab41dd599228532afc0d5920f82e6378ec1c5f3
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 26 Mar 2012 14:13:05 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 17 May 2012 14:06:12 +0200

mm: Optimize put_mems_allowed() usage

Since put_mems_allowed() is strictly optional, its a seqcount retry,
we don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some
relative fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and
read_mems_allowed_retry(), where it is important to note that the
return value of the latter call is inverted from its previous
incarnation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Link: http://lkml.kernel.org/r/1332759384.16159.92.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/cpuset.h |   27 ++++++++++++++-------------
 kernel/cpuset.c        |    2 +-
 mm/filemap.c           |    4 ++--
 mm/hugetlb.c           |    4 ++--
 mm/mempolicy.c         |   14 +++++++-------
 mm/page_alloc.c        |    8 ++++----
 mm/slab.c              |    4 ++--
 mm/slub.c              |   16 +++-------------
 8 files changed, 35 insertions(+), 44 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 668f66b..4052564 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,25 +89,26 @@ extern void rebuild_sched_domains(void);
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
 
 /*
- * get_mems_allowed is required when making decisions involving mems_allowed
- * such as during page allocation. mems_allowed can be updated in parallel
- * and depending on the new value an operation can fail potentially causing
- * process failure. A retry loop with get_mems_allowed and put_mems_allowed
- * prevents these artificial failures.
+ * read_mems_allowed_begin is required when making decisions involving
+ * mems_allowed such as during page allocation. mems_allowed can be updated in
+ * parallel and depending on the new value an operation can fail potentially
+ * causing process failure. A retry loop with read_mems_allowed_begin and
+ * read_mems_allowed_retry prevents these artificial failures.
  */
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
 /*
- * If this returns false, the operation that took place after get_mems_allowed
- * may have failed. It is up to the caller to retry the operation if
+ * If this returns true, the operation that took place after
+ * read_mems_allowed_begin may have failed artificially due to a concurrent
+ * update of mems_allowed. It is up to the caller to retry the operation if
  * appropriate.
  */
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+	return read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
 static inline void set_mems_allowed(nodemask_t nodemask)
@@ -223,14 +224,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 {
 }
 
-static inline unsigned int get_mems_allowed(void)
+static inline unsigned int read_mems_allowed_begin(void)
 {
 	return 0;
 }
 
-static inline bool put_mems_allowed(unsigned int seq)
+static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	return true;
+	return false;
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 14f7070..a9ac50d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -976,7 +976,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	task_lock(tsk);
 	/*
 	 * Determine if a loop is necessary if another thread is doing
-	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * read_mems_allowed_begin().  If at least one node remains unchanged and
 	 * tsk does not have a mempolicy, then an empty nodemask will not be
 	 * possible when mems_allowed is larger than a word.
 	 */
diff --git a/mm/filemap.c b/mm/filemap.c
index 79c4b2b..800c592 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -501,10 +501,10 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
-			cpuset_mems_cookie = get_mems_allowed();
+			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
 			page = alloc_pages_exact_node(n, gfp, 0);
-		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5a16423..35f315c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -541,7 +541,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
@@ -570,7 +570,7 @@ retry_cpuset:
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b195691..1a51b7f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1695,7 +1695,7 @@ int node_random(const nodemask_t *maskp)
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
  * @nodemask for filtering the zonelist.
  *
- * Must be protected by get_mems_allowed()
+ * Must be protected by read_mems_allowed_begin()
  */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 				gfp_t gfp_flags, struct mempolicy **mpol,
@@ -1860,7 +1860,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 
 retry_cpuset:
 	pol = get_vma_policy(current, vma, addr);
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
@@ -1868,7 +1868,7 @@ retry_cpuset:
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 
 		return page;
@@ -1881,7 +1881,7 @@ retry_cpuset:
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 			goto retry_cpuset;
 		return page;
 	}
@@ -1890,7 +1890,7 @@ retry_cpuset:
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
 }
@@ -1924,7 +1924,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		pol = &default_policy;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/*
 	 * No reference counting needed for current->mempolicy
@@ -1937,7 +1937,7 @@ retry_cpuset:
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a712fb9..ccc75cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2442,7 +2442,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
@@ -2469,7 +2469,7 @@ out:
 	 * the mask is being updated. If a page allocation is about to fail,
 	 * check if the cpuset changed during allocation and if so, retry.
 	 */
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return page;
@@ -2691,9 +2691,9 @@ bool skip_free_areas_node(unsigned int flags, int nid)
 		goto out;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		ret = !node_isset(nid, cpuset_current_mems_allowed);
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 out:
 	return ret;
 }
diff --git a/mm/slab.c b/mm/slab.c
index e901a36..5070f96 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3367,7 +3367,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 retry_cpuset:
-	cpuset_mems_cookie = get_mems_allowed();
+	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 
 retry:
@@ -3423,7 +3423,7 @@ retry:
 		}
 	}
 
-	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+	if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return obj;
 }
diff --git a/mm/slub.c b/mm/slub.c
index ffe13fd..4b0e61f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1613,7 +1613,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 		return NULL;
 
 	do {
-		cpuset_mems_cookie = get_mems_allowed();
+		cpuset_mems_cookie = read_mems_allowed_begin();
 		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 			struct kmem_cache_node *n;
@@ -1623,21 +1623,11 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 					n->nr_partial > s->min_partial) {
 				object = get_partial_node(s, n, c);
-				if (object) {
-					/*
-					 * Return the object even if
-					 * put_mems_allowed indicated that
-					 * the cpuset mems_allowed was
-					 * updated in parallel. It's a
-					 * harmless race between the alloc
-					 * and the cpuset update.
-					 */
-					put_mems_allowed(cpuset_mems_cookie);
+				if (object)
 					return object;
-				}
 			}
 		}
-	} while (!put_mems_allowed(cpuset_mems_cookie));
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 #endif
 	return NULL;
 }

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2012-03-07 18:08 ` Mel Gorman
@ 2013-08-23 13:03   ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-23 13:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Wed, Mar 07, 2012 at 06:08:52PM +0000, Mel Gorman wrote:
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 06b145f..013d981 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1843,18 +1843,24 @@ struct page *
>  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		unsigned long addr, int node)
>  {
> -	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +	struct mempolicy *pol;
>  	struct zonelist *zl;
>  	struct page *page;
> +	unsigned int cpuset_mems_cookie;
> +
> +retry_cpuset:
> +	pol = get_vma_policy(current, vma, addr);
> +	cpuset_mems_cookie = get_mems_allowed();
>  
> -	get_mems_allowed();
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
>  		mpol_cond_put(pol);
>  		page = alloc_page_interleave(gfp, order, nid);
> -		put_mems_allowed();
> +		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +			goto retry_cpuset;
> +
>  		return page;
>  	}
>  	zl = policy_zonelist(gfp, pol, node);

So I think this patch is broken (still). Suppose we have an
INTERLEAVE mempol like 0x3 and change it to 0xc.

Original:	0x3
Rebind Step 1:	0xf /* set bits */
Rebind Step 2:	0xc /* clear bits */

Now look at what can happen with offset_il_node() when its ran
concurrently with step 2:

  nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */

  /* now we clear the actual bits */
  
  target = (unsigned int)off % nnodes; /* assume target >= 2 */
  c = 0;
  do {
  	nid = next_node(nid, pol->v.nodes);
	c++;
  } while (c <= target);

  /* here nid := MAX_NUMNODES */


This nid is then blindly inserted into node_zonelist() which does an
NODE_DATA() array access out of bounds and off we go.

This would suggest we put the whole seqcount thing inside
offset_il_node().

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-23 13:03   ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-23 13:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Wed, Mar 07, 2012 at 06:08:52PM +0000, Mel Gorman wrote:
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 06b145f..013d981 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1843,18 +1843,24 @@ struct page *
>  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		unsigned long addr, int node)
>  {
> -	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +	struct mempolicy *pol;
>  	struct zonelist *zl;
>  	struct page *page;
> +	unsigned int cpuset_mems_cookie;
> +
> +retry_cpuset:
> +	pol = get_vma_policy(current, vma, addr);
> +	cpuset_mems_cookie = get_mems_allowed();
>  
> -	get_mems_allowed();
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
>  		mpol_cond_put(pol);
>  		page = alloc_page_interleave(gfp, order, nid);
> -		put_mems_allowed();
> +		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +			goto retry_cpuset;
> +
>  		return page;
>  	}
>  	zl = policy_zonelist(gfp, pol, node);

So I think this patch is broken (still). Suppose we have an
INTERLEAVE mempol like 0x3 and change it to 0xc.

Original:	0x3
Rebind Step 1:	0xf /* set bits */
Rebind Step 2:	0xc /* clear bits */

Now look at what can happen with offset_il_node() when its ran
concurrently with step 2:

  nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */

  /* now we clear the actual bits */
  
  target = (unsigned int)off % nnodes; /* assume target >= 2 */
  c = 0;
  do {
  	nid = next_node(nid, pol->v.nodes);
	c++;
  } while (c <= target);

  /* here nid := MAX_NUMNODES */


This nid is then blindly inserted into node_zonelist() which does an
NODE_DATA() array access out of bounds and off we go.

This would suggest we put the whole seqcount thing inside
offset_il_node().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2013-08-23 13:03   ` Peter Zijlstra
@ 2013-08-23 18:15     ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-23 18:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> So I think this patch is broken (still). Suppose we have an
> INTERLEAVE mempol like 0x3 and change it to 0xc.
> 
> Original:	0x3
> Rebind Step 1:	0xf /* set bits */
> Rebind Step 2:	0xc /* clear bits */
> 
> Now look at what can happen with offset_il_node() when its ran
> concurrently with step 2:
> 
>   nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */
> 
>   /* now we clear the actual bits */
>   
>   target = (unsigned int)off % nnodes; /* assume target >= 2 */
>   c = 0;
>   do {
>   	nid = next_node(nid, pol->v.nodes);
> 	c++;
>   } while (c <= target);
> 
>   /* here nid := MAX_NUMNODES */
> 
> 
> This nid is then blindly inserted into node_zonelist() which does an
> NODE_DATA() array access out of bounds and off we go.
> 
> This would suggest we put the whole seqcount thing inside
> offset_il_node().

Oh bloody grrr. Its not directly related at all, the patch in question
fixes a cpuset task_struct::mems_allowed problem while the above is a
mempolicy issue and of course the cpuset and mempolicy code are
completely bloody different :/

So I guess the quick and ugly solution is something like the below. 

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1762,19 +1762,21 @@ unsigned slab_node(void)
 static unsigned offset_il_node(struct mempolicy *pol,
 		struct vm_area_struct *vma, unsigned long off)
 {
-	unsigned nnodes = nodes_weight(pol->v.nodes);
-	unsigned target;
-	int c;
-	int nid = -1;
+	unsigned nnodes, target;
+	int c, nid;
 
+again:
+	nnodes = nodes_weight(pol->v.nodes);
 	if (!nnodes)
 		return numa_node_id();
+
 	target = (unsigned int)off % nnodes;
-	c = 0;
-	do {
+	for (c = 0, nid = -1; c <= target; c++)
 		nid = next_node(nid, pol->v.nodes);
-		c++;
-	} while (c <= target);
+
+	if (unlikely((unsigned)nid >= MAX_NUMNODES))
+		goto again;
+
 	return nid;
 }
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-23 18:15     ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-23 18:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> So I think this patch is broken (still). Suppose we have an
> INTERLEAVE mempol like 0x3 and change it to 0xc.
> 
> Original:	0x3
> Rebind Step 1:	0xf /* set bits */
> Rebind Step 2:	0xc /* clear bits */
> 
> Now look at what can happen with offset_il_node() when its ran
> concurrently with step 2:
> 
>   nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */
> 
>   /* now we clear the actual bits */
>   
>   target = (unsigned int)off % nnodes; /* assume target >= 2 */
>   c = 0;
>   do {
>   	nid = next_node(nid, pol->v.nodes);
> 	c++;
>   } while (c <= target);
> 
>   /* here nid := MAX_NUMNODES */
> 
> 
> This nid is then blindly inserted into node_zonelist() which does an
> NODE_DATA() array access out of bounds and off we go.
> 
> This would suggest we put the whole seqcount thing inside
> offset_il_node().

Oh bloody grrr. Its not directly related at all, the patch in question
fixes a cpuset task_struct::mems_allowed problem while the above is a
mempolicy issue and of course the cpuset and mempolicy code are
completely bloody different :/

So I guess the quick and ugly solution is something like the below. 

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1762,19 +1762,21 @@ unsigned slab_node(void)
 static unsigned offset_il_node(struct mempolicy *pol,
 		struct vm_area_struct *vma, unsigned long off)
 {
-	unsigned nnodes = nodes_weight(pol->v.nodes);
-	unsigned target;
-	int c;
-	int nid = -1;
+	unsigned nnodes, target;
+	int c, nid;
 
+again:
+	nnodes = nodes_weight(pol->v.nodes);
 	if (!nnodes)
 		return numa_node_id();
+
 	target = (unsigned int)off % nnodes;
-	c = 0;
-	do {
+	for (c = 0, nid = -1; c <= target; c++)
 		nid = next_node(nid, pol->v.nodes);
-		c++;
-	} while (c <= target);
+
+	if (unlikely((unsigned)nid >= MAX_NUMNODES))
+		goto again;
+
 	return nid;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2013-08-23 18:15     ` Peter Zijlstra
@ 2013-08-26  5:32       ` Rik van Riel
  -1 siblings, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2013-08-26  5:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Miao Xie, David Rientjes,
	Christoph Lameter, linux-mm, linux-kernel

On 08/23/2013 02:15 PM, Peter Zijlstra wrote:

> So I guess the quick and ugly solution is something like the below. 

This still crashes :)

> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1762,19 +1762,21 @@ unsigned slab_node(void)
>  static unsigned offset_il_node(struct mempolicy *pol,
>  		struct vm_area_struct *vma, unsigned long off)
>  {
> -	unsigned nnodes = nodes_weight(pol->v.nodes);
> -	unsigned target;
> -	int c;
> -	int nid = -1;
> +	unsigned nnodes, target;
> +	int c, nid;
>  
> +again:
> +	nnodes = nodes_weight(pol->v.nodes);
>  	if (!nnodes)
>  		return numa_node_id();
> +
>  	target = (unsigned int)off % nnodes;
> -	c = 0;
> -	do {
> +	for (c = 0, nid = -1; c <= target; c++)
>  		nid = next_node(nid, pol->v.nodes);
> -		c++;
> -	} while (c <= target);
> +
> +	if (unlikely((unsigned)nid >= MAX_NUMNODES))
> +		goto again;

I'll go kick off a compile that replaces the conditional above with:

	if (unlikely(!node_online(nid)))
		goto again;

>  	return nid;
>  }


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-26  5:32       ` Rik van Riel
  0 siblings, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2013-08-26  5:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Miao Xie, David Rientjes,
	Christoph Lameter, linux-mm, linux-kernel

On 08/23/2013 02:15 PM, Peter Zijlstra wrote:

> So I guess the quick and ugly solution is something like the below. 

This still crashes :)

> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1762,19 +1762,21 @@ unsigned slab_node(void)
>  static unsigned offset_il_node(struct mempolicy *pol,
>  		struct vm_area_struct *vma, unsigned long off)
>  {
> -	unsigned nnodes = nodes_weight(pol->v.nodes);
> -	unsigned target;
> -	int c;
> -	int nid = -1;
> +	unsigned nnodes, target;
> +	int c, nid;
>  
> +again:
> +	nnodes = nodes_weight(pol->v.nodes);
>  	if (!nnodes)
>  		return numa_node_id();
> +
>  	target = (unsigned int)off % nnodes;
> -	c = 0;
> -	do {
> +	for (c = 0, nid = -1; c <= target; c++)
>  		nid = next_node(nid, pol->v.nodes);
> -		c++;
> -	} while (c <= target);
> +
> +	if (unlikely((unsigned)nid >= MAX_NUMNODES))
> +		goto again;

I'll go kick off a compile that replaces the conditional above with:

	if (unlikely(!node_online(nid)))
		goto again;

>  	return nid;
>  }


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2013-08-23 18:15     ` Peter Zijlstra
@ 2013-08-29  9:28       ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2013-08-29  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > So I think this patch is broken (still).

I am assuming the lack of complaints is that it is not a heavily executed
path. I expect that you (and Rik) are hitting this as part of automatic
NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
is the reproduction case.

> > Suppose we have an
> > INTERLEAVE mempol like 0x3 and change it to 0xc.
> > 
> > Original:	0x3
> > Rebind Step 1:	0xf /* set bits */
> > Rebind Step 2:	0xc /* clear bits */
> > 
> > Now look at what can happen with offset_il_node() when its ran
> > concurrently with step 2:
> > 
> >   nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */
> > 
> >   /* now we clear the actual bits */
> >   
> >   target = (unsigned int)off % nnodes; /* assume target >= 2 */
> >   c = 0;
> >   do {
> >   	nid = next_node(nid, pol->v.nodes);
> > 	c++;
> >   } while (c <= target);
> > 
> >   /* here nid := MAX_NUMNODES */
> > 
> > 
> > This nid is then blindly inserted into node_zonelist() which does an
> > NODE_DATA() array access out of bounds and off we go.
> > 
> > This would suggest we put the whole seqcount thing inside
> > offset_il_node().
> 
> Oh bloody grrr. Its not directly related at all, the patch in question
> fixes a cpuset task_struct::mems_allowed problem while the above is a
> mempolicy issue and of course the cpuset and mempolicy code are
> completely bloody different :/
> 
> So I guess the quick and ugly solution is something like the below. 
> 
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1762,19 +1762,21 @@ unsigned slab_node(void)
>  static unsigned offset_il_node(struct mempolicy *pol,
>  		struct vm_area_struct *vma, unsigned long off)
>  {
> -	unsigned nnodes = nodes_weight(pol->v.nodes);
> -	unsigned target;
> -	int c;
> -	int nid = -1;
> +	unsigned nnodes, target;
> +	int c, nid;
>  
> +again:
> +	nnodes = nodes_weight(pol->v.nodes);
>  	if (!nnodes)
>  		return numa_node_id();
> +
>  	target = (unsigned int)off % nnodes;
> -	c = 0;
> -	do {
> +	for (c = 0, nid = -1; c <= target; c++)
>  		nid = next_node(nid, pol->v.nodes);
> -		c++;
> -	} while (c <= target);
> +
> +	if (unlikely((unsigned)nid >= MAX_NUMNODES))
> +		goto again;
> +

MAX_NUMNODES is unrelated to anything except that it might prevent a crash
and even then nr_online_nodes is probably what you wanted and even that
assumes the NUMA node numbering is contiguous. The real concern is whether
the updated mask is an allowed target for the updated memory policy. If
it's not then "nid" can be pointing off the deep end somewhere.  With this
conversion to a for loop there is race after you check nnodes where target
gets set to 0 and then return a nid of -1 which I suppose will just blow
up differently but it's fixable.

This? Untested. Fixes implicit types while it's there. Note the use of
first node and (c < target) to guarantee nid gets set and that the first
potential node is still used as an interleave target.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..ae880c3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1755,22 +1755,24 @@ unsigned slab_node(void)
 }
 
 /* Do static interleaving for a VMA with known offset. */
-static unsigned offset_il_node(struct mempolicy *pol,
+static unsigned int offset_il_node(struct mempolicy *pol,
 		struct vm_area_struct *vma, unsigned long off)
 {
-	unsigned nnodes = nodes_weight(pol->v.nodes);
-	unsigned target;
-	int c;
-	int nid = -1;
+	unsigned int nr_nodes, target;
+	int i, nid;
 
-	if (!nnodes)
+again:
+	nr_nodes = nodes_weight(pol->v.nodes);
+	if (!nr_nodes)
 		return numa_node_id();
-	target = (unsigned int)off % nnodes;
-	c = 0;
-	do {
+	target = (unsigned int)off % nr_nodes;
+	for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
 		nid = next_node(nid, pol->v.nodes);
-		c++;
-	} while (c <= target);
+
+	/* Policy nodemask can potentially update in parallel */
+	if (unlikely(!node_isset(nid, pol->v.nodes)))
+		goto again;
+
 	return nid;
 }
 


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-29  9:28       ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2013-08-29  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > So I think this patch is broken (still).

I am assuming the lack of complaints is that it is not a heavily executed
path. I expect that you (and Rik) are hitting this as part of automatic
NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
is the reproduction case.

> > Suppose we have an
> > INTERLEAVE mempol like 0x3 and change it to 0xc.
> > 
> > Original:	0x3
> > Rebind Step 1:	0xf /* set bits */
> > Rebind Step 2:	0xc /* clear bits */
> > 
> > Now look at what can happen with offset_il_node() when its ran
> > concurrently with step 2:
> > 
> >   nnodes = nodes_weight(pol->v.nodes); /* observes 0xf and returns 4 */
> > 
> >   /* now we clear the actual bits */
> >   
> >   target = (unsigned int)off % nnodes; /* assume target >= 2 */
> >   c = 0;
> >   do {
> >   	nid = next_node(nid, pol->v.nodes);
> > 	c++;
> >   } while (c <= target);
> > 
> >   /* here nid := MAX_NUMNODES */
> > 
> > 
> > This nid is then blindly inserted into node_zonelist() which does an
> > NODE_DATA() array access out of bounds and off we go.
> > 
> > This would suggest we put the whole seqcount thing inside
> > offset_il_node().
> 
> Oh bloody grrr. Its not directly related at all, the patch in question
> fixes a cpuset task_struct::mems_allowed problem while the above is a
> mempolicy issue and of course the cpuset and mempolicy code are
> completely bloody different :/
> 
> So I guess the quick and ugly solution is something like the below. 
> 
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1762,19 +1762,21 @@ unsigned slab_node(void)
>  static unsigned offset_il_node(struct mempolicy *pol,
>  		struct vm_area_struct *vma, unsigned long off)
>  {
> -	unsigned nnodes = nodes_weight(pol->v.nodes);
> -	unsigned target;
> -	int c;
> -	int nid = -1;
> +	unsigned nnodes, target;
> +	int c, nid;
>  
> +again:
> +	nnodes = nodes_weight(pol->v.nodes);
>  	if (!nnodes)
>  		return numa_node_id();
> +
>  	target = (unsigned int)off % nnodes;
> -	c = 0;
> -	do {
> +	for (c = 0, nid = -1; c <= target; c++)
>  		nid = next_node(nid, pol->v.nodes);
> -		c++;
> -	} while (c <= target);
> +
> +	if (unlikely((unsigned)nid >= MAX_NUMNODES))
> +		goto again;
> +

MAX_NUMNODES is unrelated to anything except that it might prevent a crash
and even then nr_online_nodes is probably what you wanted and even that
assumes the NUMA node numbering is contiguous. The real concern is whether
the updated mask is an allowed target for the updated memory policy. If
it's not then "nid" can be pointing off the deep end somewhere.  With this
conversion to a for loop there is race after you check nnodes where target
gets set to 0 and then return a nid of -1 which I suppose will just blow
up differently but it's fixable.

This? Untested. Fixes implicit types while it's there. Note the use of
first node and (c < target) to guarantee nid gets set and that the first
potential node is still used as an interleave target.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..ae880c3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1755,22 +1755,24 @@ unsigned slab_node(void)
 }
 
 /* Do static interleaving for a VMA with known offset. */
-static unsigned offset_il_node(struct mempolicy *pol,
+static unsigned int offset_il_node(struct mempolicy *pol,
 		struct vm_area_struct *vma, unsigned long off)
 {
-	unsigned nnodes = nodes_weight(pol->v.nodes);
-	unsigned target;
-	int c;
-	int nid = -1;
+	unsigned int nr_nodes, target;
+	int i, nid;
 
-	if (!nnodes)
+again:
+	nr_nodes = nodes_weight(pol->v.nodes);
+	if (!nr_nodes)
 		return numa_node_id();
-	target = (unsigned int)off % nnodes;
-	c = 0;
-	do {
+	target = (unsigned int)off % nr_nodes;
+	for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
 		nid = next_node(nid, pol->v.nodes);
-		c++;
-	} while (c <= target);
+
+	/* Policy nodemask can potentially update in parallel */
+	if (unlikely(!node_isset(nid, pol->v.nodes)))
+		goto again;
+
 	return nid;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2013-08-29  9:28       ` Mel Gorman
@ 2013-08-29  9:43         ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-29  9:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote:
> On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > > So I think this patch is broken (still).
> 
> I am assuming the lack of complaints is that it is not a heavily executed
> path. I expect that you (and Rik) are hitting this as part of automatic
> NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
> is the reproduction case.

I thought it was, we crashed somewhere suspiciously close, but no. You
need shared mpols for this to actually trigger and the NUMA stuff
doesn't use that.

> > +	if (unlikely((unsigned)nid >= MAX_NUMNODES))
> > +		goto again;
> > +
> 
> MAX_NUMNODES is unrelated to anything except that it might prevent a crash
> and even then nr_online_nodes is probably what you wanted and even that
> assumes the NUMA node numbering is contiguous. 

I used whatever nodemask.h did to detect end-of-bitmap and they use
MAX_NUMNODES. See __next_node() and for_each_node() like.

MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
of the bitmap, nr_online_nodes would hoever.

> The real concern is whether
> the updated mask is an allowed target for the updated memory policy. If
> it's not then "nid" can be pointing off the deep end somewhere.  With this
> conversion to a for loop there is race after you check nnodes where target
> gets set to 0 and then return a nid of -1 which I suppose will just blow
> up differently but it's fixable.

But but but, I did i <= target, which will match when target == 0 so
you'll get at least a single iteration and nid will be set.

> This? Untested. Fixes implicit types while it's there. Note the use of
> first node and (c < target) to guarantee nid gets set and that the first
> potential node is still used as an interleave target.
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 7431001..ae880c3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
>  }
>  
>  /* Do static interleaving for a VMA with known offset. */
> -static unsigned offset_il_node(struct mempolicy *pol,
> +static unsigned int offset_il_node(struct mempolicy *pol,
>  		struct vm_area_struct *vma, unsigned long off)
>  {
> -	unsigned nnodes = nodes_weight(pol->v.nodes);
> -	unsigned target;
> -	int c;
> -	int nid = -1;
> +	unsigned int nr_nodes, target;
> +	int i, nid;
>  
> -	if (!nnodes)
> +again:
> +	nr_nodes = nodes_weight(pol->v.nodes);
> +	if (!nr_nodes)
>  		return numa_node_id();
> -	target = (unsigned int)off % nnodes;
> -	c = 0;
> -	do {
> +	target = (unsigned int)off % nr_nodes;
> +	for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
>  		nid = next_node(nid, pol->v.nodes);
> -		c++;
> -	} while (c <= target);
> +
> +	/* Policy nodemask can potentially update in parallel */
> +	if (unlikely(!node_isset(nid, pol->v.nodes)))
> +		goto again;
> +
>  	return nid;
>  }

So I explicitly didn't use the node_isset() test because that's more
likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
a node that isn't actually part of the mask anymore -- a race is a race
anyway.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-29  9:43         ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-29  9:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote:
> On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > > So I think this patch is broken (still).
> 
> I am assuming the lack of complaints is that it is not a heavily executed
> path. I expect that you (and Rik) are hitting this as part of automatic
> NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
> is the reproduction case.

I thought it was, we crashed somewhere suspiciously close, but no. You
need shared mpols for this to actually trigger and the NUMA stuff
doesn't use that.

> > +	if (unlikely((unsigned)nid >= MAX_NUMNODES))
> > +		goto again;
> > +
> 
> MAX_NUMNODES is unrelated to anything except that it might prevent a crash
> and even then nr_online_nodes is probably what you wanted and even that
> assumes the NUMA node numbering is contiguous. 

I used whatever nodemask.h did to detect end-of-bitmap and they use
MAX_NUMNODES. See __next_node() and for_each_node() like.

MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
of the bitmap, nr_online_nodes would hoever.

> The real concern is whether
> the updated mask is an allowed target for the updated memory policy. If
> it's not then "nid" can be pointing off the deep end somewhere.  With this
> conversion to a for loop there is race after you check nnodes where target
> gets set to 0 and then return a nid of -1 which I suppose will just blow
> up differently but it's fixable.

But but but, I did i <= target, which will match when target == 0 so
you'll get at least a single iteration and nid will be set.

> This? Untested. Fixes implicit types while it's there. Note the use of
> first node and (c < target) to guarantee nid gets set and that the first
> potential node is still used as an interleave target.
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 7431001..ae880c3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
>  }
>  
>  /* Do static interleaving for a VMA with known offset. */
> -static unsigned offset_il_node(struct mempolicy *pol,
> +static unsigned int offset_il_node(struct mempolicy *pol,
>  		struct vm_area_struct *vma, unsigned long off)
>  {
> -	unsigned nnodes = nodes_weight(pol->v.nodes);
> -	unsigned target;
> -	int c;
> -	int nid = -1;
> +	unsigned int nr_nodes, target;
> +	int i, nid;
>  
> -	if (!nnodes)
> +again:
> +	nr_nodes = nodes_weight(pol->v.nodes);
> +	if (!nr_nodes)
>  		return numa_node_id();
> -	target = (unsigned int)off % nnodes;
> -	c = 0;
> -	do {
> +	target = (unsigned int)off % nr_nodes;
> +	for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
>  		nid = next_node(nid, pol->v.nodes);
> -		c++;
> -	} while (c <= target);
> +
> +	/* Policy nodemask can potentially update in parallel */
> +	if (unlikely(!node_isset(nid, pol->v.nodes)))
> +		goto again;
> +
>  	return nid;
>  }

So I explicitly didn't use the node_isset() test because that's more
likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
a node that isn't actually part of the mask anymore -- a race is a race
anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2013-08-29  9:43         ` Peter Zijlstra
@ 2013-08-29  9:45           ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-29  9:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 7431001..ae880c3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
> >  }
> >  
> >  /* Do static interleaving for a VMA with known offset. */
> > -static unsigned offset_il_node(struct mempolicy *pol,
> > +static unsigned int offset_il_node(struct mempolicy *pol,
> >  		struct vm_area_struct *vma, unsigned long off)
> >  {
> > -	unsigned nnodes = nodes_weight(pol->v.nodes);
> > -	unsigned target;
> > -	int c;
> > -	int nid = -1;
> > +	unsigned int nr_nodes, target;
> > +	int i, nid;
> >  
> > -	if (!nnodes)
> > +again:
> > +	nr_nodes = nodes_weight(pol->v.nodes);
> > +	if (!nr_nodes)
> >  		return numa_node_id();
> > -	target = (unsigned int)off % nnodes;
> > -	c = 0;
> > -	do {
> > +	target = (unsigned int)off % nr_nodes;
> > +	for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
> >  		nid = next_node(nid, pol->v.nodes);
> > -		c++;
> > -	} while (c <= target);
> > +
> > +	/* Policy nodemask can potentially update in parallel */
> > +	if (unlikely(!node_isset(nid, pol->v.nodes)))
> > +		goto again;
> > +
> >  	return nid;
> >  }
> 
> So I explicitly didn't use the node_isset() test because that's more
> likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> a node that isn't actually part of the mask anymore -- a race is a race
> anyway.

Oh more importantly, if nid does indeed end up being >= MAX_NUMNODES as
is possible with next_node() the node_isset() test will be out-of-bounds
and can crash itself.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-29  9:45           ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-29  9:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 7431001..ae880c3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
> >  }
> >  
> >  /* Do static interleaving for a VMA with known offset. */
> > -static unsigned offset_il_node(struct mempolicy *pol,
> > +static unsigned int offset_il_node(struct mempolicy *pol,
> >  		struct vm_area_struct *vma, unsigned long off)
> >  {
> > -	unsigned nnodes = nodes_weight(pol->v.nodes);
> > -	unsigned target;
> > -	int c;
> > -	int nid = -1;
> > +	unsigned int nr_nodes, target;
> > +	int i, nid;
> >  
> > -	if (!nnodes)
> > +again:
> > +	nr_nodes = nodes_weight(pol->v.nodes);
> > +	if (!nr_nodes)
> >  		return numa_node_id();
> > -	target = (unsigned int)off % nnodes;
> > -	c = 0;
> > -	do {
> > +	target = (unsigned int)off % nr_nodes;
> > +	for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
> >  		nid = next_node(nid, pol->v.nodes);
> > -		c++;
> > -	} while (c <= target);
> > +
> > +	/* Policy nodemask can potentially update in parallel */
> > +	if (unlikely(!node_isset(nid, pol->v.nodes)))
> > +		goto again;
> > +
> >  	return nid;
> >  }
> 
> So I explicitly didn't use the node_isset() test because that's more
> likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> a node that isn't actually part of the mask anymore -- a race is a race
> anyway.

Oh more importantly, if nid does indeed end up being >= MAX_NUMNODES as
is possible with next_node() the node_isset() test will be out-of-bounds
and can crash itself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2013-08-29  9:43         ` Peter Zijlstra
@ 2013-08-29 10:56           ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2013-08-29 10:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote:
> > On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > > > So I think this patch is broken (still).
> > 
> > I am assuming the lack of complaints is that it is not a heavily executed
> > path. I expect that you (and Rik) are hitting this as part of automatic
> > NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
> > is the reproduction case.
> 
> I thought it was, we crashed somewhere suspiciously close, but no. You
> need shared mpols for this to actually trigger and the NUMA stuff
> doesn't use that.
> 

Ah, so this is a red herring?

> > > +	if (unlikely((unsigned)nid >= MAX_NUMNODES))
> > > +		goto again;
> > > +
> > 
> > MAX_NUMNODES is unrelated to anything except that it might prevent a crash
> > and even then nr_online_nodes is probably what you wanted and even that
> > assumes the NUMA node numbering is contiguous. 
> 
> I used whatever nodemask.h did to detect end-of-bitmap and they use
> MAX_NUMNODES. See __next_node() and for_each_node() like.
> 

The check does prevent us going off the end of the bitmap but does not
necessarily return an online node.

> MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
> of the bitmap, nr_online_nodes would hoever.
> 

I intended to say nr_node_ids, the same size as buffers such as the
task_numa_buffers. If we ever return a nid > nr_node_ids here then
task_numa_fault would corrupt memory. However, it should be possible for
node_weight to exceed nr_node_ids except maybe during node hot-remove so
it's not the problem.

> > The real concern is whether
> > the updated mask is an allowed target for the updated memory policy. If
> > it's not then "nid" can be pointing off the deep end somewhere.  With this
> > conversion to a for loop there is race after you check nnodes where target
> > gets set to 0 and then return a nid of -1 which I suppose will just blow
> > up differently but it's fixable.
> 
> But but but, I did i <= target, which will match when target == 0 so
> you'll get at least a single iteration and nid will be set.
> 

True.

> > This? Untested. Fixes implicit types while it's there. Note the use of
> > first node and (c < target) to guarantee nid gets set and that the first
> > potential node is still used as an interleave target.
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 7431001..ae880c3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
> >  }
> >  
> >  /* Do static interleaving for a VMA with known offset. */
> > -static unsigned offset_il_node(struct mempolicy *pol,
> > +static unsigned int offset_il_node(struct mempolicy *pol,
> >  		struct vm_area_struct *vma, unsigned long off)
> >  {
> > -	unsigned nnodes = nodes_weight(pol->v.nodes);
> > -	unsigned target;
> > -	int c;
> > -	int nid = -1;
> > +	unsigned int nr_nodes, target;
> > +	int i, nid;
> >  
> > -	if (!nnodes)
> > +again:
> > +	nr_nodes = nodes_weight(pol->v.nodes);
> > +	if (!nr_nodes)
> >  		return numa_node_id();
> > -	target = (unsigned int)off % nnodes;
> > -	c = 0;
> > -	do {
> > +	target = (unsigned int)off % nr_nodes;
> > +	for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
> >  		nid = next_node(nid, pol->v.nodes);
> > -		c++;
> > -	} while (c <= target);
> > +
> > +	/* Policy nodemask can potentially update in parallel */
> > +	if (unlikely(!node_isset(nid, pol->v.nodes)))
> > +		goto again;
> > +
> >  	return nid;
> >  }
> 
> So I explicitly didn't use the node_isset() test because that's more
> likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> a node that isn't actually part of the mask anymore -- a race is a race
> anyway.

Yeah and as long as it's < nr_node_ids it should be ok within the task
numa fault handling as well.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-29 10:56           ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2013-08-29 10:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 11:43:42AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 29, 2013 at 10:28:29AM +0100, Mel Gorman wrote:
> > On Fri, Aug 23, 2013 at 08:15:46PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 23, 2013 at 03:03:32PM +0200, Peter Zijlstra wrote:
> > > > So I think this patch is broken (still).
> > 
> > I am assuming the lack of complaints is that it is not a heavily executed
> > path. I expect that you (and Rik) are hitting this as part of automatic
> > NUMA balancing. Still a bug, just slightly less urgent if NUMA balancing
> > is the reproduction case.
> 
> I thought it was, we crashed somewhere suspiciously close, but no. You
> need shared mpols for this to actually trigger and the NUMA stuff
> doesn't use that.
> 

Ah, so this is a red herring?

> > > +	if (unlikely((unsigned)nid >= MAX_NUMNODES))
> > > +		goto again;
> > > +
> > 
> > MAX_NUMNODES is unrelated to anything except that it might prevent a crash
> > and even then nr_online_nodes is probably what you wanted and even that
> > assumes the NUMA node numbering is contiguous. 
> 
> I used whatever nodemask.h did to detect end-of-bitmap and they use
> MAX_NUMNODES. See __next_node() and for_each_node() like.
> 

The check does prevent us going off the end of the bitmap but does not
necessarily return an online node.

> MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
> of the bitmap, nr_online_nodes would hoever.
> 

I intended to say nr_node_ids, the same size as buffers such as the
task_numa_buffers. If we ever return a nid > nr_node_ids here then
task_numa_fault would corrupt memory. However, it should be possible for
node_weight to exceed nr_node_ids except maybe during node hot-remove so
it's not the problem.

> > The real concern is whether
> > the updated mask is an allowed target for the updated memory policy. If
> > it's not then "nid" can be pointing off the deep end somewhere.  With this
> > conversion to a for loop there is race after you check nnodes where target
> > gets set to 0 and then return a nid of -1 which I suppose will just blow
> > up differently but it's fixable.
> 
> But but but, I did i <= target, which will match when target == 0 so
> you'll get at least a single iteration and nid will be set.
> 

True.

> > This? Untested. Fixes implicit types while it's there. Note the use of
> > first node and (c < target) to guarantee nid gets set and that the first
> > potential node is still used as an interleave target.
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 7431001..ae880c3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1755,22 +1755,24 @@ unsigned slab_node(void)
> >  }
> >  
> >  /* Do static interleaving for a VMA with known offset. */
> > -static unsigned offset_il_node(struct mempolicy *pol,
> > +static unsigned int offset_il_node(struct mempolicy *pol,
> >  		struct vm_area_struct *vma, unsigned long off)
> >  {
> > -	unsigned nnodes = nodes_weight(pol->v.nodes);
> > -	unsigned target;
> > -	int c;
> > -	int nid = -1;
> > +	unsigned int nr_nodes, target;
> > +	int i, nid;
> >  
> > -	if (!nnodes)
> > +again:
> > +	nr_nodes = nodes_weight(pol->v.nodes);
> > +	if (!nr_nodes)
> >  		return numa_node_id();
> > -	target = (unsigned int)off % nnodes;
> > -	c = 0;
> > -	do {
> > +	target = (unsigned int)off % nr_nodes;
> > +	for (i = 0, nid = first_node(pol->v.nodes); i < target; i++)
> >  		nid = next_node(nid, pol->v.nodes);
> > -		c++;
> > -	} while (c <= target);
> > +
> > +	/* Policy nodemask can potentially update in parallel */
> > +	if (unlikely(!node_isset(nid, pol->v.nodes)))
> > +		goto again;
> > +
> >  	return nid;
> >  }
> 
> So I explicitly didn't use the node_isset() test because that's more
> likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> a node that isn't actually part of the mask anymore -- a race is a race
> anyway.

Yeah and as long as it's < nr_node_ids it should be ok within the task
numa fault handling as well.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2013-08-29 10:56           ` Mel Gorman
@ 2013-08-29 11:14             ` Peter Zijlstra
  -1 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-29 11:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 11:56:57AM +0100, Mel Gorman wrote:
> > I thought it was, we crashed somewhere suspiciously close, but no. You
> > need shared mpols for this to actually trigger and the NUMA stuff
> > doesn't use that.
> > 
> 
> Ah, so this is a red herring?

Yeah, but I still think its an actual bug. It seems the easiest way to
trigger this would be to:

 create a task that constantly allocates pages
 have said task have an MPOL_INTERLEAVE task policy
 put said task into a cpuset
 using a different task (your shell for example) flip the cpuset's
   mems_allowed back and forth.

This would have the shell task constantly rebind (in two steps) our
allocating task's INTERLEAVE policy.

> > I used whatever nodemask.h did to detect end-of-bitmap and they use
> > MAX_NUMNODES. See __next_node() and for_each_node() like.
> > 
> 
> The check does prevent us going off the end of the bitmap but does not
> necessarily return an online node.

Right, but its guaranteed to return a 'valid' node. I don't think it
returning an offline node is a problem, we'll find it empty and fail the
page allocation.

> > MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
> > of the bitmap, nr_online_nodes would hoever.
> > 
> 
> I intended to say nr_node_ids, the same size as buffers such as the
> task_numa_buffers. If we ever return a nid > nr_node_ids here then
> task_numa_fault would corrupt memory. However, it should be possible for
> node_weight to exceed nr_node_ids except maybe during node hot-remove so
> it's not the problem.

The nodemask situation seems somewhat more confused than the cpumask
case; how would we ever return a nid > nr_node_ids? Corrupt nodemask?

In the cpumask case we use the runtime limit nr_cpu_ids for all bitmap
operations, arguably we should make the nodemask stuff do the same.

Less bits to iterate is always good; a MAX_NUMNODES=64
(x86_64-defconfig) will still iterate all 64 bits, even though its
unlikely to have more than 1 let alone more than 8 nodes.

> > So I explicitly didn't use the node_isset() test because that's more
> > likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> > a node that isn't actually part of the mask anymore -- a race is a race
> > anyway.
> 
> Yeah and as long as it's < nr_node_ids it should be ok within the task
> numa fault handling as well.

Right, I'm just a tad confused on how we could ever get a nid >=
nr_node_ids except from a prior bug (corrupted nodemask).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-29 11:14             ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-08-29 11:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 11:56:57AM +0100, Mel Gorman wrote:
> > I thought it was, we crashed somewhere suspiciously close, but no. You
> > need shared mpols for this to actually trigger and the NUMA stuff
> > doesn't use that.
> > 
> 
> Ah, so this is a red herring?

Yeah, but I still think its an actual bug. It seems the easiest way to
trigger this would be to:

 create a task that constantly allocates pages
 have said task have an MPOL_INTERLEAVE task policy
 put said task into a cpuset
 using a different task (your shell for example) flip the cpuset's
   mems_allowed back and forth.

This would have the shell task constantly rebind (in two steps) our
allocating task's INTERLEAVE policy.

> > I used whatever nodemask.h did to detect end-of-bitmap and they use
> > MAX_NUMNODES. See __next_node() and for_each_node() like.
> > 
> 
> The check does prevent us going off the end of the bitmap but does not
> necessarily return an online node.

Right, but its guaranteed to return a 'valid' node. I don't think it
returning an offline node is a problem, we'll find it empty and fail the
page allocation.

> > MAX_NUMNODES doesn't assume contiguous numbers since its the actual size
> > of the bitmap, nr_online_nodes would hoever.
> > 
> 
> I intended to say nr_node_ids, the same size as buffers such as the
> task_numa_buffers. If we ever return a nid > nr_node_ids here then
> task_numa_fault would corrupt memory. However, it should be possible for
> node_weight to exceed nr_node_ids except maybe during node hot-remove so
> it's not the problem.

The nodemask situation seems somewhat more confused than the cpumask
case; how would we ever return a nid > nr_node_ids? Corrupt nodemask?

In the cpumask case we use the runtime limit nr_cpu_ids for all bitmap
operations, arguably we should make the nodemask stuff do the same.

Less bits to iterate is always good; a MAX_NUMNODES=64
(x86_64-defconfig) will still iterate all 64 bits, even though its
unlikely to have more than 1 let alone more than 8 nodes.

> > So I explicitly didn't use the node_isset() test because that's more
> > likely to trigger than the nid >= MAX_NUMNODES test. Its fine to return
> > a node that isn't actually part of the mask anymore -- a race is a race
> > anyway.
> 
> Yeah and as long as it's < nr_node_ids it should be ok within the task
> numa fault handling as well.

Right, I'm just a tad confused on how we could ever get a nid >=
nr_node_ids except from a prior bug (corrupted nodemask).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
  2013-08-29 11:14             ` Peter Zijlstra
@ 2013-08-29 12:10               ` Mel Gorman
  -1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2013-08-29 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 01:14:19PM +0200, Peter Zijlstra wrote:
> > I intended to say nr_node_ids, the same size as buffers such as the
> > task_numa_buffers. If we ever return a nid > nr_node_ids here then
> > task_numa_fault would corrupt memory. However, it should be possible for
> > node_weight to exceed nr_node_ids except maybe during node hot-remove so
> > it's not the problem.
> 
> The nodemask situation seems somewhat more confused than the cpumask
> case; how would we ever return a nid > nr_node_ids? Corrupt nodemask?
> 

ARGH. I meant impossible. The exact bloody opposite of what I wrote.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3
@ 2013-08-29 12:10               ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2013-08-29 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miao Xie, David Rientjes, Christoph Lameter,
	linux-mm, linux-kernel, riel

On Thu, Aug 29, 2013 at 01:14:19PM +0200, Peter Zijlstra wrote:
> > I intended to say nr_node_ids, the same size as buffers such as the
> > task_numa_buffers. If we ever return a nid > nr_node_ids here then
> > task_numa_fault would corrupt memory. However, it should be possible for
> > node_weight to exceed nr_node_ids except maybe during node hot-remove so
> > it's not the problem.
> 
> The nodemask situation seems somewhat more confused than the cpumask
> case; how would we ever return a nid > nr_node_ids? Corrupt nodemask?
> 

ARGH. I meant impossible. The exact bloody opposite of what I wrote.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2013-08-29 12:11 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-07 18:08 [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3 Mel Gorman
2012-03-07 18:08 ` Mel Gorman
2012-03-26 10:56 ` Peter Zijlstra
2012-03-26 10:56   ` Peter Zijlstra
2012-03-26 11:07   ` Peter Zijlstra
2012-03-26 11:07     ` Peter Zijlstra
2012-03-26 15:50   ` Mel Gorman
2012-03-26 15:50     ` Mel Gorman
2012-03-26 16:20     ` Peter Zijlstra
2012-03-26 16:20       ` Peter Zijlstra
2012-03-27 12:47       ` Mel Gorman
2012-03-27 12:47         ` Mel Gorman
2012-03-27 13:14         ` [PATCH] mm: Optimize put_mems_allowed() usage Peter Zijlstra
2012-03-27 13:14           ` Peter Zijlstra
2012-05-17 10:33           ` Peter Zijlstra
2012-05-17 10:33             ` Peter Zijlstra
2012-05-17 20:16           ` Andrew Morton
2012-05-17 20:16             ` Andrew Morton
2012-05-17 20:23             ` Peter Zijlstra
2012-05-17 20:23               ` Peter Zijlstra
2012-05-18 10:20   ` [tip:sched/numa] " tip-bot for Peter Zijlstra
2013-08-23 13:03 ` [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v3 Peter Zijlstra
2013-08-23 13:03   ` Peter Zijlstra
2013-08-23 18:15   ` Peter Zijlstra
2013-08-23 18:15     ` Peter Zijlstra
2013-08-26  5:32     ` Rik van Riel
2013-08-26  5:32       ` Rik van Riel
2013-08-29  9:28     ` Mel Gorman
2013-08-29  9:28       ` Mel Gorman
2013-08-29  9:43       ` Peter Zijlstra
2013-08-29  9:43         ` Peter Zijlstra
2013-08-29  9:45         ` Peter Zijlstra
2013-08-29  9:45           ` Peter Zijlstra
2013-08-29 10:56         ` Mel Gorman
2013-08-29 10:56           ` Mel Gorman
2013-08-29 11:14           ` Peter Zijlstra
2013-08-29 11:14             ` Peter Zijlstra
2013-08-29 12:10             ` Mel Gorman
2013-08-29 12:10               ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.