All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-10 17:48 ` Vladimir Davydov
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Davydov @ 2014-08-10 17:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim

fallback_alloc is called on kmalloc if the preferred node doesn't have
free or partial slabs and there's no pages on the node's free list
(GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
to locate a free or partial slab on other allowed nodes' lists. While
iterating over the preferred node's zonelist it skips those zones which
cpuset_zone_allowed_hardwall returns false for. That means that for a
task bound to a specific node using cpusets fallback_alloc will always
ignore free slabs on other nodes and go directly to the reclaimer,
which, however, may allocate from other nodes if cpuset.mem_hardwall is
unset (default). As a result, we may get lists of free slabs grow
without bounds on other nodes, which is bad, because inactive slabs are
only evicted by cache_reap at a very slow rate and cannot be dropped
forcefully.

To reproduce the issue, run a process that will walk over a directory
tree with lots of files inside a cpuset bound to a node that constantly
experiences memory pressure. Look at num_slabs vs active_slabs growth as
reported by /proc/slabinfo.

We should use cpuset_zone_allowed_softwall in fallback_alloc. Since it
can sleep, we only call it on __GFP_WAIT allocations. For atomic
allocations we simply ignore cpusets, which is in agreement with the
cpuset documenation (see the comment to __cpuset_node_allowed_softwall).

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 2e60bf3dedbb..1d77a4df7ee1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3049,14 +3049,23 @@ retry:
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		nid = zone_to_nid(zone);
 
-		if (cpuset_zone_allowed_hardwall(zone, flags) &&
-			get_node(cache, nid) &&
-			get_node(cache, nid)->free_objects) {
-				obj = ____cache_alloc_node(cache,
-					flags | GFP_THISNODE, nid);
-				if (obj)
-					break;
+		if (!get_node(cache, nid) ||
+		    !get_node(cache, nid)->free_objects)
+			continue;
+
+		if (local_flags & __GFP_WAIT) {
+			bool allowed;
+
+			local_irq_enable();
+			allowed = cpuset_zone_allowed_softwall(zone, flags);
+			local_irq_disable();
+			if (!allowed)
+				continue;
 		}
+
+		obj = ____cache_alloc_node(cache, flags | GFP_THISNODE, nid);
+		if (obj)
+			break;
 	}
 
 	if (!obj) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-10 17:48 ` Vladimir Davydov
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Davydov @ 2014-08-10 17:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim

fallback_alloc is called on kmalloc if the preferred node doesn't have
free or partial slabs and there's no pages on the node's free list
(GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
to locate a free or partial slab on other allowed nodes' lists. While
iterating over the preferred node's zonelist it skips those zones which
cpuset_zone_allowed_hardwall returns false for. That means that for a
task bound to a specific node using cpusets fallback_alloc will always
ignore free slabs on other nodes and go directly to the reclaimer,
which, however, may allocate from other nodes if cpuset.mem_hardwall is
unset (default). As a result, we may get lists of free slabs grow
without bounds on other nodes, which is bad, because inactive slabs are
only evicted by cache_reap at a very slow rate and cannot be dropped
forcefully.

To reproduce the issue, run a process that will walk over a directory
tree with lots of files inside a cpuset bound to a node that constantly
experiences memory pressure. Look at num_slabs vs active_slabs growth as
reported by /proc/slabinfo.

We should use cpuset_zone_allowed_softwall in fallback_alloc. Since it
can sleep, we only call it on __GFP_WAIT allocations. For atomic
allocations we simply ignore cpusets, which is in agreement with the
cpuset documenation (see the comment to __cpuset_node_allowed_softwall).

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 2e60bf3dedbb..1d77a4df7ee1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3049,14 +3049,23 @@ retry:
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		nid = zone_to_nid(zone);
 
-		if (cpuset_zone_allowed_hardwall(zone, flags) &&
-			get_node(cache, nid) &&
-			get_node(cache, nid)->free_objects) {
-				obj = ____cache_alloc_node(cache,
-					flags | GFP_THISNODE, nid);
-				if (obj)
-					break;
+		if (!get_node(cache, nid) ||
+		    !get_node(cache, nid)->free_objects)
+			continue;
+
+		if (local_flags & __GFP_WAIT) {
+			bool allowed;
+
+			local_irq_enable();
+			allowed = cpuset_zone_allowed_softwall(zone, flags);
+			local_irq_disable();
+			if (!allowed)
+				continue;
 		}
+
+		obj = ____cache_alloc_node(cache, flags | GFP_THISNODE, nid);
+		if (obj)
+			break;
 	}
 
 	if (!obj) {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
  2014-08-10 17:48 ` Vladimir Davydov
@ 2014-08-10 22:43   ` David Rientjes
  -1 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2014-08-10 22:43 UTC (permalink / raw)
  To: Vladimir Davydov, Li Zefan
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Sun, 10 Aug 2014, Vladimir Davydov wrote:

> fallback_alloc is called on kmalloc if the preferred node doesn't have
> free or partial slabs and there's no pages on the node's free list
> (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
> to locate a free or partial slab on other allowed nodes' lists. While
> iterating over the preferred node's zonelist it skips those zones which
> cpuset_zone_allowed_hardwall returns false for. That means that for a
> task bound to a specific node using cpusets fallback_alloc will always
> ignore free slabs on other nodes and go directly to the reclaimer,
> which, however, may allocate from other nodes if cpuset.mem_hardwall is
> unset (default). As a result, we may get lists of free slabs grow
> without bounds on other nodes, which is bad, because inactive slabs are
> only evicted by cache_reap at a very slow rate and cannot be dropped
> forcefully.
> 
> To reproduce the issue, run a process that will walk over a directory
> tree with lots of files inside a cpuset bound to a node that constantly
> experiences memory pressure. Look at num_slabs vs active_slabs growth as
> reported by /proc/slabinfo.
> 
> We should use cpuset_zone_allowed_softwall in fallback_alloc. Since it
> can sleep, we only call it on __GFP_WAIT allocations. For atomic
> allocations we simply ignore cpusets, which is in agreement with the
> cpuset documenation (see the comment to __cpuset_node_allowed_softwall).
> 

If that rule were ever changed, nobody would think to modify the 
fallback_alloc() behavior in the slab allocator.  Why can't 
cpuset_zone_allowed_hardwall() just return 1 for !__GFP_WAIT?

I don't think this issue is restricted only to slab, it's for all callers 
of cpuset_zone_allowed_softwall() that could possibly be atomic.  I think 
it would be better to determine if cpuset_zone_allowed() should be 
hardwall or softwall depending on the gfp flags.

Let's add Li, the cpuset maintainer.  Any reason we can't do this?
---
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -48,29 +48,16 @@ extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 void cpuset_init_current_mems_allowed(void);
 int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
-extern int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask);
-extern int __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask);
+extern int __cpuset_node_allowed(int node, const gfp_t gfp_mask);
 
-static inline int cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
+static inline int cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
-	return nr_cpusets() <= 1 ||
-		__cpuset_node_allowed_softwall(node, gfp_mask);
+	return nr_cpusets() <= 1 || __cpuset_node_allowed(node, gfp_mask);
 }
 
-static inline int cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
+static inline int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 {
-	return nr_cpusets() <= 1 ||
-		__cpuset_node_allowed_hardwall(node, gfp_mask);
-}
-
-static inline int cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
-{
-	return cpuset_node_allowed_softwall(zone_to_nid(z), gfp_mask);
-}
-
-static inline int cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
-{
-	return cpuset_node_allowed_hardwall(zone_to_nid(z), gfp_mask);
+	return cpuset_node_allowed(zone_to_nid(z), gfp_mask);
 }
 
 extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
@@ -178,22 +165,12 @@ static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 	return 1;
 }
 
-static inline int cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
-{
-	return 1;
-}
-
-static inline int cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
-{
-	return 1;
-}
-
-static inline int cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
+static inline int cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
 	return 1;
 }
 
-static inline int cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
+static inline int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 {
 	return 1;
 }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2449,7 +2449,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
 }
 
 /**
- * cpuset_node_allowed_softwall - Can we allocate on a memory node?
+ * __cpuset_node_allowed - Can we allocate on a memory node?
  * @node: is this an allowed node?
  * @gfp_mask: memory allocation flags
  *
@@ -2461,12 +2461,8 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * flag, yes.
  * Otherwise, no.
  *
- * If __GFP_HARDWALL is set, cpuset_node_allowed_softwall() reduces to
- * cpuset_node_allowed_hardwall().  Otherwise, cpuset_node_allowed_softwall()
- * might sleep, and might allow a node from an enclosing cpuset.
- *
- * cpuset_node_allowed_hardwall() only handles the simpler case of hardwall
- * cpusets, and never sleeps.
+ * If __GFP_HARDWALL is not set, this might sleep and might allow a node from an
+ * enclosing cpuset.
  *
  * The __GFP_THISNODE placement logic is really handled elsewhere,
  * by forcibly using a zonelist starting at a specified node, and by
@@ -2495,7 +2491,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *
  * The second pass through get_page_from_freelist() doesn't even call
  * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
- * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
+ * variable 'atomic' is set, and the bit ALLOC_CPUSET is not set
  * in alloc_flags.  That logic and the checks below have the combined
  * affect that:
  *	in_interrupt - any node ok (current task context irrelevant)
@@ -2505,18 +2501,22 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  *
  * Rule:
- *    Don't call cpuset_node_allowed_softwall if you can't sleep, unless you
+ *    Don't call __cpuset_node_allowed if you can't sleep, unless you
  *    pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
  *    the code that might scan up ancestor cpusets and sleep.
  */
-int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
+int __cpuset_node_allowed(int node, const gfp_t gfp_mask)
 {
 	struct cpuset *cs;		/* current cpuset ancestors */
 	int allowed;			/* is allocation in zone z allowed? */
 
-	if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
+	if (in_interrupt())
 		return 1;
 	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
+	if (gfp_mask & __GFP_THISNODE)
+		return 1;
+	if (!(gfp_mask & __GFP_WAIT))
+		return 1;
 	if (node_isset(node, current->mems_allowed))
 		return 1;
 	/*
@@ -2543,44 +2543,6 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
 	return allowed;
 }
 
-/*
- * cpuset_node_allowed_hardwall - Can we allocate on a memory node?
- * @node: is this an allowed node?
- * @gfp_mask: memory allocation flags
- *
- * If we're in interrupt, yes, we can always allocate.  If __GFP_THISNODE is
- * set, yes, we can always allocate.  If node is in our task's mems_allowed,
- * yes.  If the task has been OOM killed and has access to memory reserves as
- * specified by the TIF_MEMDIE flag, yes.
- * Otherwise, no.
- *
- * The __GFP_THISNODE placement logic is really handled elsewhere,
- * by forcibly using a zonelist starting at a specified node, and by
- * (in get_page_from_freelist()) refusing to consider the zones for
- * any node on the zonelist except the first.  By the time any such
- * calls get to this routine, we should just shut up and say 'yes'.
- *
- * Unlike the cpuset_node_allowed_softwall() variant, above,
- * this variant requires that the node be in the current task's
- * mems_allowed or that we're in interrupt.  It does not scan up the
- * cpuset hierarchy for the nearest enclosing mem_exclusive cpuset.
- * It never sleeps.
- */
-int __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
-{
-	if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
-		return 1;
-	if (node_isset(node, current->mems_allowed))
-		return 1;
-	/*
-	 * Allow tasks that have access to memory reserves because they have
-	 * been OOM killed to get memory anywhere.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
-		return 1;
-	return 0;
-}
-
 /**
  * cpuset_mem_spread_node() - On which node to begin search for a file page
  * cpuset_slab_spread_node() - On which node to begin search for a slab page
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -582,7 +582,7 @@ retry_cpuset:
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask(h))) {
+		if (cpuset_zone_allowed(zone, htlb_alloc_mask(h))) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
 				if (avoid_reserve)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -231,7 +231,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
-		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+		if (!cpuset_zone_allowed(zone, gfp_mask))
 			cpuset_limited = true;
 
 	if (cpuset_limited) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1963,7 +1963,7 @@ zonelist_scan:
 
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
+	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						high_zoneidx, nodemask) {
@@ -1974,7 +1974,7 @@ zonelist_scan:
 				continue;
 		if (cpusets_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			!cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 		/*
 		 * Distribute pages in proportion to the individual
@@ -2492,7 +2492,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_HARDER;
 		/*
 		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
-		 * comment for __cpuset_node_allowed_softwall().
+		 * comment for __cpuset_node_allowed().
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3047,16 +3047,19 @@ retry:
 	 * from existing per node queues.
 	 */
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		nid = zone_to_nid(zone);
+		struct kmem_cache_node *n;
 
-		if (cpuset_zone_allowed_hardwall(zone, flags) &&
-			get_node(cache, nid) &&
-			get_node(cache, nid)->free_objects) {
-				obj = ____cache_alloc_node(cache,
-					flags | GFP_THISNODE, nid);
-				if (obj)
-					break;
-		}
+		nid = zone_to_nid(zone);
+		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
+			continue;
+		n = get_node(cache, nid);
+		if (!n)
+			continue;
+		if (!n->free_objects)
+			continue;
+		obj = ____cache_alloc_node(cache, flags | GFP_THISNODE, nid);
+		if (obj)
+			break;
 	}
 
 	if (!obj) {
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1671,20 +1671,22 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			struct kmem_cache_node *n;
 
 			n = get_node(s, zone_to_nid(zone));
+			if (!n)
+				continue;
+			if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
+				continue;
+			if (n->nr_parial <= s->min_partial)
+				continue;
 
-			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-					n->nr_partial > s->min_partial) {
-				object = get_partial_node(s, n, c, flags);
-				if (object) {
-					/*
-					 * Don't check read_mems_allowed_retry()
-					 * here - if mems_allowed was updated in
-					 * parallel, that was a harmless race
-					 * between allocation and the cpuset
-					 * update
-					 */
-					return object;
-				}
+			object = get_partial_node(s, n, c, flags);
+			if (object) {
+				/*
+				 * Don't check read_mems_allowed_retry() here -
+				 * if mems_allowed was updated in parallel,
+				 * that was a harmless race between allocation
+				 * and the cpuset update.
+				 */
+				return object;
 			}
 		}
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2399,7 +2399,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			if (!cpuset_zone_allowed(zone,
+						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
 			lru_pages += zone_reclaimable_pages(zone);
@@ -3381,7 +3382,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!populated_zone(zone))
 		return;
 
-	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
 		return;
 	pgdat = zone->zone_pgdat;
 	if (pgdat->kswapd_max_order < order) {

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-10 22:43   ` David Rientjes
  0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2014-08-10 22:43 UTC (permalink / raw)
  To: Vladimir Davydov, Li Zefan
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Sun, 10 Aug 2014, Vladimir Davydov wrote:

> fallback_alloc is called on kmalloc if the preferred node doesn't have
> free or partial slabs and there's no pages on the node's free list
> (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
> to locate a free or partial slab on other allowed nodes' lists. While
> iterating over the preferred node's zonelist it skips those zones which
> cpuset_zone_allowed_hardwall returns false for. That means that for a
> task bound to a specific node using cpusets fallback_alloc will always
> ignore free slabs on other nodes and go directly to the reclaimer,
> which, however, may allocate from other nodes if cpuset.mem_hardwall is
> unset (default). As a result, we may get lists of free slabs grow
> without bounds on other nodes, which is bad, because inactive slabs are
> only evicted by cache_reap at a very slow rate and cannot be dropped
> forcefully.
> 
> To reproduce the issue, run a process that will walk over a directory
> tree with lots of files inside a cpuset bound to a node that constantly
> experiences memory pressure. Look at num_slabs vs active_slabs growth as
> reported by /proc/slabinfo.
> 
> We should use cpuset_zone_allowed_softwall in fallback_alloc. Since it
> can sleep, we only call it on __GFP_WAIT allocations. For atomic
> allocations we simply ignore cpusets, which is in agreement with the
> cpuset documenation (see the comment to __cpuset_node_allowed_softwall).
> 

If that rule were ever changed, nobody would think to modify the 
fallback_alloc() behavior in the slab allocator.  Why can't 
cpuset_zone_allowed_hardwall() just return 1 for !__GFP_WAIT?

I don't think this issue is restricted only to slab, it's for all callers 
of cpuset_zone_allowed_softwall() that could possibly be atomic.  I think 
it would be better to determine if cpuset_zone_allowed() should be 
hardwall or softwall depending on the gfp flags.

Let's add Li, the cpuset maintainer.  Any reason we can't do this?
---
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -48,29 +48,16 @@ extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 void cpuset_init_current_mems_allowed(void);
 int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
-extern int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask);
-extern int __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask);
+extern int __cpuset_node_allowed(int node, const gfp_t gfp_mask);
 
-static inline int cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
+static inline int cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
-	return nr_cpusets() <= 1 ||
-		__cpuset_node_allowed_softwall(node, gfp_mask);
+	return nr_cpusets() <= 1 || __cpuset_node_allowed(node, gfp_mask);
 }
 
-static inline int cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
+static inline int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 {
-	return nr_cpusets() <= 1 ||
-		__cpuset_node_allowed_hardwall(node, gfp_mask);
-}
-
-static inline int cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
-{
-	return cpuset_node_allowed_softwall(zone_to_nid(z), gfp_mask);
-}
-
-static inline int cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
-{
-	return cpuset_node_allowed_hardwall(zone_to_nid(z), gfp_mask);
+	return cpuset_node_allowed(zone_to_nid(z), gfp_mask);
 }
 
 extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
@@ -178,22 +165,12 @@ static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 	return 1;
 }
 
-static inline int cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
-{
-	return 1;
-}
-
-static inline int cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
-{
-	return 1;
-}
-
-static inline int cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
+static inline int cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
 	return 1;
 }
 
-static inline int cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
+static inline int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 {
 	return 1;
 }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2449,7 +2449,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
 }
 
 /**
- * cpuset_node_allowed_softwall - Can we allocate on a memory node?
+ * __cpuset_node_allowed - Can we allocate on a memory node?
  * @node: is this an allowed node?
  * @gfp_mask: memory allocation flags
  *
@@ -2461,12 +2461,8 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * flag, yes.
  * Otherwise, no.
  *
- * If __GFP_HARDWALL is set, cpuset_node_allowed_softwall() reduces to
- * cpuset_node_allowed_hardwall().  Otherwise, cpuset_node_allowed_softwall()
- * might sleep, and might allow a node from an enclosing cpuset.
- *
- * cpuset_node_allowed_hardwall() only handles the simpler case of hardwall
- * cpusets, and never sleeps.
+ * If __GFP_HARDWALL is not set, this might sleep and might allow a node from an
+ * enclosing cpuset.
  *
  * The __GFP_THISNODE placement logic is really handled elsewhere,
  * by forcibly using a zonelist starting at a specified node, and by
@@ -2495,7 +2491,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *
  * The second pass through get_page_from_freelist() doesn't even call
  * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
- * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
+ * variable 'atomic' is set, and the bit ALLOC_CPUSET is not set
  * in alloc_flags.  That logic and the checks below have the combined
  * affect that:
  *	in_interrupt - any node ok (current task context irrelevant)
@@ -2505,18 +2501,22 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  *
  * Rule:
- *    Don't call cpuset_node_allowed_softwall if you can't sleep, unless you
+ *    Don't call __cpuset_node_allowed if you can't sleep, unless you
  *    pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
  *    the code that might scan up ancestor cpusets and sleep.
  */
-int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
+int __cpuset_node_allowed(int node, const gfp_t gfp_mask)
 {
 	struct cpuset *cs;		/* current cpuset ancestors */
 	int allowed;			/* is allocation in zone z allowed? */
 
-	if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
+	if (in_interrupt())
 		return 1;
 	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
+	if (gfp_mask & __GFP_THISNODE)
+		return 1;
+	if (!(gfp_mask & __GFP_WAIT))
+		return 1;
 	if (node_isset(node, current->mems_allowed))
 		return 1;
 	/*
@@ -2543,44 +2543,6 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
 	return allowed;
 }
 
-/*
- * cpuset_node_allowed_hardwall - Can we allocate on a memory node?
- * @node: is this an allowed node?
- * @gfp_mask: memory allocation flags
- *
- * If we're in interrupt, yes, we can always allocate.  If __GFP_THISNODE is
- * set, yes, we can always allocate.  If node is in our task's mems_allowed,
- * yes.  If the task has been OOM killed and has access to memory reserves as
- * specified by the TIF_MEMDIE flag, yes.
- * Otherwise, no.
- *
- * The __GFP_THISNODE placement logic is really handled elsewhere,
- * by forcibly using a zonelist starting at a specified node, and by
- * (in get_page_from_freelist()) refusing to consider the zones for
- * any node on the zonelist except the first.  By the time any such
- * calls get to this routine, we should just shut up and say 'yes'.
- *
- * Unlike the cpuset_node_allowed_softwall() variant, above,
- * this variant requires that the node be in the current task's
- * mems_allowed or that we're in interrupt.  It does not scan up the
- * cpuset hierarchy for the nearest enclosing mem_exclusive cpuset.
- * It never sleeps.
- */
-int __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
-{
-	if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
-		return 1;
-	if (node_isset(node, current->mems_allowed))
-		return 1;
-	/*
-	 * Allow tasks that have access to memory reserves because they have
-	 * been OOM killed to get memory anywhere.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
-		return 1;
-	return 0;
-}
-
 /**
  * cpuset_mem_spread_node() - On which node to begin search for a file page
  * cpuset_slab_spread_node() - On which node to begin search for a slab page
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -582,7 +582,7 @@ retry_cpuset:
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask(h))) {
+		if (cpuset_zone_allowed(zone, htlb_alloc_mask(h))) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
 				if (avoid_reserve)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -231,7 +231,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
-		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+		if (!cpuset_zone_allowed(zone, gfp_mask))
 			cpuset_limited = true;
 
 	if (cpuset_limited) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1963,7 +1963,7 @@ zonelist_scan:
 
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
+	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						high_zoneidx, nodemask) {
@@ -1974,7 +1974,7 @@ zonelist_scan:
 				continue;
 		if (cpusets_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			!cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 		/*
 		 * Distribute pages in proportion to the individual
@@ -2492,7 +2492,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_HARDER;
 		/*
 		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
-		 * comment for __cpuset_node_allowed_softwall().
+		 * comment for __cpuset_node_allowed().
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3047,16 +3047,19 @@ retry:
 	 * from existing per node queues.
 	 */
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		nid = zone_to_nid(zone);
+		struct kmem_cache_node *n;
 
-		if (cpuset_zone_allowed_hardwall(zone, flags) &&
-			get_node(cache, nid) &&
-			get_node(cache, nid)->free_objects) {
-				obj = ____cache_alloc_node(cache,
-					flags | GFP_THISNODE, nid);
-				if (obj)
-					break;
-		}
+		nid = zone_to_nid(zone);
+		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
+			continue;
+		n = get_node(cache, nid);
+		if (!n)
+			continue;
+		if (!n->free_objects)
+			continue;
+		obj = ____cache_alloc_node(cache, flags | GFP_THISNODE, nid);
+		if (obj)
+			break;
 	}
 
 	if (!obj) {
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1671,20 +1671,22 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 			struct kmem_cache_node *n;
 
 			n = get_node(s, zone_to_nid(zone));
+			if (!n)
+				continue;
+			if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
+				continue;
+			if (n->nr_parial <= s->min_partial)
+				continue;
 
-			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-					n->nr_partial > s->min_partial) {
-				object = get_partial_node(s, n, c, flags);
-				if (object) {
-					/*
-					 * Don't check read_mems_allowed_retry()
-					 * here - if mems_allowed was updated in
-					 * parallel, that was a harmless race
-					 * between allocation and the cpuset
-					 * update
-					 */
-					return object;
-				}
+			object = get_partial_node(s, n, c, flags);
+			if (object) {
+				/*
+				 * Don't check read_mems_allowed_retry() here -
+				 * if mems_allowed was updated in parallel,
+				 * that was a harmless race between allocation
+				 * and the cpuset update.
+				 */
+				return object;
 			}
 		}
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2399,7 +2399,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			if (!cpuset_zone_allowed(zone,
+						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
 			lru_pages += zone_reclaimable_pages(zone);
@@ -3381,7 +3382,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!populated_zone(zone))
 		return;
 
-	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
 		return;
 	pgdat = zone->zone_pgdat;
 	if (pgdat->kswapd_max_order < order) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
  2014-08-10 22:43   ` David Rientjes
  (?)
@ 2014-08-11  7:13     ` Vladimir Davydov
  -1 siblings, 0 replies; 16+ messages in thread
From: Vladimir Davydov @ 2014-08-11  7:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Li Zefan, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Sun, Aug 10, 2014 at 03:43:21PM -0700, David Rientjes wrote:
> On Sun, 10 Aug 2014, Vladimir Davydov wrote:
> 
> > fallback_alloc is called on kmalloc if the preferred node doesn't have
> > free or partial slabs and there's no pages on the node's free list
> > (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
> > to locate a free or partial slab on other allowed nodes' lists. While
> > iterating over the preferred node's zonelist it skips those zones which
> > cpuset_zone_allowed_hardwall returns false for. That means that for a
> > task bound to a specific node using cpusets fallback_alloc will always
> > ignore free slabs on other nodes and go directly to the reclaimer,
> > which, however, may allocate from other nodes if cpuset.mem_hardwall is
> > unset (default). As a result, we may get lists of free slabs grow
> > without bounds on other nodes, which is bad, because inactive slabs are
> > only evicted by cache_reap at a very slow rate and cannot be dropped
> > forcefully.
> > 
> > To reproduce the issue, run a process that will walk over a directory
> > tree with lots of files inside a cpuset bound to a node that constantly
> > experiences memory pressure. Look at num_slabs vs active_slabs growth as
> > reported by /proc/slabinfo.
> > 
> > We should use cpuset_zone_allowed_softwall in fallback_alloc. Since it
> > can sleep, we only call it on __GFP_WAIT allocations. For atomic
> > allocations we simply ignore cpusets, which is in agreement with the
> > cpuset documenation (see the comment to __cpuset_node_allowed_softwall).
> > 
> 
> If that rule were ever changed, nobody would think to modify the 
> fallback_alloc() behavior in the slab allocator.  Why can't 
> cpuset_zone_allowed_hardwall() just return 1 for !__GFP_WAIT?
> 
> I don't think this issue is restricted only to slab, it's for all callers 
> of cpuset_zone_allowed_softwall() that could possibly be atomic.  I think 
> it would be better to determine if cpuset_zone_allowed() should be 
> hardwall or softwall depending on the gfp flags.
> 
> Let's add Li, the cpuset maintainer.  Any reason we can't do this?
> ---
[...]
> diff --git a/mm/slab.c b/mm/slab.c
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3047,16 +3047,19 @@ retry:
>  	 * from existing per node queues.
>  	 */
>  	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> -		nid = zone_to_nid(zone);
> +		struct kmem_cache_node *n;
>  
> -		if (cpuset_zone_allowed_hardwall(zone, flags) &&
> -			get_node(cache, nid) &&
> -			get_node(cache, nid)->free_objects) {
> -				obj = ____cache_alloc_node(cache,
> -					flags | GFP_THISNODE, nid);
> -				if (obj)
> -					break;
> -		}
> +		nid = zone_to_nid(zone);
> +		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))

We must use softwall check here, otherwise we will proceed to
alloc_pages even if there are lots of free slabs on other nodes.
alloc_pages, in turn, may allocate from other nodes in case
cpuset.mem_hardwall=0, because it uses softwall check, so it may add yet
another free slab to another node's list even if it isn't empty. As a
result, we may get free list bloating on other nodes. I've seen a
machine with one of its nodes almost completely filled with inactive
slabs for buffer_heads (dozens of GBs) w/o any chance to drop them. So,
this is a bug that must be fixed.

Note, for SLUB using hardwall check in get_any_partial won't lead to
such a problem, because once added a new slab is loaded to a per cpu
list forcing any further user to allocate from it. Strictly speaking, we
should use softwall check there either though.

> +			continue;
> +		n = get_node(cache, nid);
> +		if (!n)
> +			continue;
> +		if (!n->free_objects)
> +			continue;
> +		obj = ____cache_alloc_node(cache, flags | GFP_THISNODE, nid);
> +		if (obj)
> +			break;
>  	}
>  
>  	if (!obj) {
> diff --git a/mm/slub.c b/mm/slub.c
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1671,20 +1671,22 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>  			struct kmem_cache_node *n;
>  
>  			n = get_node(s, zone_to_nid(zone));
> +			if (!n)
> +				continue;
> +			if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
> +				continue;
> +			if (n->nr_parial <= s->min_partial)
> +				continue;
>  
> -			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
> -					n->nr_partial > s->min_partial) {
> -				object = get_partial_node(s, n, c, flags);
> -				if (object) {
> -					/*
> -					 * Don't check read_mems_allowed_retry()
> -					 * here - if mems_allowed was updated in
> -					 * parallel, that was a harmless race
> -					 * between allocation and the cpuset
> -					 * update
> -					 */
> -					return object;
> -				}
> +			object = get_partial_node(s, n, c, flags);
> +			if (object) {
> +				/*
> +				 * Don't check read_mems_allowed_retry() here -
> +				 * if mems_allowed was updated in parallel,
> +				 * that was a harmless race between allocation
> +				 * and the cpuset update.
> +				 */
> +				return object;
>  			}
>  		}
>  	} while (read_mems_allowed_retry(cpuset_mems_cookie));
[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-11  7:13     ` Vladimir Davydov
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Davydov @ 2014-08-11  7:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Li Zefan, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Sun, Aug 10, 2014 at 03:43:21PM -0700, David Rientjes wrote:
> On Sun, 10 Aug 2014, Vladimir Davydov wrote:
> 
> > fallback_alloc is called on kmalloc if the preferred node doesn't have
> > free or partial slabs and there's no pages on the node's free list
> > (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
> > to locate a free or partial slab on other allowed nodes' lists. While
> > iterating over the preferred node's zonelist it skips those zones which
> > cpuset_zone_allowed_hardwall returns false for. That means that for a
> > task bound to a specific node using cpusets fallback_alloc will always
> > ignore free slabs on other nodes and go directly to the reclaimer,
> > which, however, may allocate from other nodes if cpuset.mem_hardwall is
> > unset (default). As a result, we may get lists of free slabs grow
> > without bounds on other nodes, which is bad, because inactive slabs are
> > only evicted by cache_reap at a very slow rate and cannot be dropped
> > forcefully.
> > 
> > To reproduce the issue, run a process that will walk over a directory
> > tree with lots of files inside a cpuset bound to a node that constantly
> > experiences memory pressure. Look at num_slabs vs active_slabs growth as
> > reported by /proc/slabinfo.
> > 
> > We should use cpuset_zone_allowed_softwall in fallback_alloc. Since it
> > can sleep, we only call it on __GFP_WAIT allocations. For atomic
> > allocations we simply ignore cpusets, which is in agreement with the
> > cpuset documenation (see the comment to __cpuset_node_allowed_softwall).
> > 
> 
> If that rule were ever changed, nobody would think to modify the 
> fallback_alloc() behavior in the slab allocator.  Why can't 
> cpuset_zone_allowed_hardwall() just return 1 for !__GFP_WAIT?
> 
> I don't think this issue is restricted only to slab, it's for all callers 
> of cpuset_zone_allowed_softwall() that could possibly be atomic.  I think 
> it would be better to determine if cpuset_zone_allowed() should be 
> hardwall or softwall depending on the gfp flags.
> 
> Let's add Li, the cpuset maintainer.  Any reason we can't do this?
> ---
[...]
> diff --git a/mm/slab.c b/mm/slab.c
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3047,16 +3047,19 @@ retry:
>  	 * from existing per node queues.
>  	 */
>  	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> -		nid = zone_to_nid(zone);
> +		struct kmem_cache_node *n;
>  
> -		if (cpuset_zone_allowed_hardwall(zone, flags) &&
> -			get_node(cache, nid) &&
> -			get_node(cache, nid)->free_objects) {
> -				obj = ____cache_alloc_node(cache,
> -					flags | GFP_THISNODE, nid);
> -				if (obj)
> -					break;
> -		}
> +		nid = zone_to_nid(zone);
> +		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))

We must use softwall check here, otherwise we will proceed to
alloc_pages even if there are lots of free slabs on other nodes.
alloc_pages, in turn, may allocate from other nodes in case
cpuset.mem_hardwall=0, because it uses softwall check, so it may add yet
another free slab to another node's list even if it isn't empty. As a
result, we may get free list bloating on other nodes. I've seen a
machine with one of its nodes almost completely filled with inactive
slabs for buffer_heads (dozens of GBs) w/o any chance to drop them. So,
this is a bug that must be fixed.

Note, for SLUB using hardwall check in get_any_partial won't lead to
such a problem, because once added a new slab is loaded to a per cpu
list forcing any further user to allocate from it. Strictly speaking, we
should use softwall check there either though.

> +			continue;
> +		n = get_node(cache, nid);
> +		if (!n)
> +			continue;
> +		if (!n->free_objects)
> +			continue;
> +		obj = ____cache_alloc_node(cache, flags | GFP_THISNODE, nid);
> +		if (obj)
> +			break;
>  	}
>  
>  	if (!obj) {
> diff --git a/mm/slub.c b/mm/slub.c
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1671,20 +1671,22 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>  			struct kmem_cache_node *n;
>  
>  			n = get_node(s, zone_to_nid(zone));
> +			if (!n)
> +				continue;
> +			if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
> +				continue;
> +			if (n->nr_parial <= s->min_partial)
> +				continue;
>  
> -			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
> -					n->nr_partial > s->min_partial) {
> -				object = get_partial_node(s, n, c, flags);
> -				if (object) {
> -					/*
> -					 * Don't check read_mems_allowed_retry()
> -					 * here - if mems_allowed was updated in
> -					 * parallel, that was a harmless race
> -					 * between allocation and the cpuset
> -					 * update
> -					 */
> -					return object;
> -				}
> +			object = get_partial_node(s, n, c, flags);
> +			if (object) {
> +				/*
> +				 * Don't check read_mems_allowed_retry() here -
> +				 * if mems_allowed was updated in parallel,
> +				 * that was a harmless race between allocation
> +				 * and the cpuset update.
> +				 */
> +				return object;
>  			}
>  		}
>  	} while (read_mems_allowed_retry(cpuset_mems_cookie));
[...]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-11  7:13     ` Vladimir Davydov
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Davydov @ 2014-08-11  7:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Li Zefan, Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Pekka Enberg,
	Joonsoo Kim

On Sun, Aug 10, 2014 at 03:43:21PM -0700, David Rientjes wrote:
> On Sun, 10 Aug 2014, Vladimir Davydov wrote:
> 
> > fallback_alloc is called on kmalloc if the preferred node doesn't have
> > free or partial slabs and there's no pages on the node's free list
> > (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
> > to locate a free or partial slab on other allowed nodes' lists. While
> > iterating over the preferred node's zonelist it skips those zones which
> > cpuset_zone_allowed_hardwall returns false for. That means that for a
> > task bound to a specific node using cpusets fallback_alloc will always
> > ignore free slabs on other nodes and go directly to the reclaimer,
> > which, however, may allocate from other nodes if cpuset.mem_hardwall is
> > unset (default). As a result, we may get lists of free slabs grow
> > without bounds on other nodes, which is bad, because inactive slabs are
> > only evicted by cache_reap at a very slow rate and cannot be dropped
> > forcefully.
> > 
> > To reproduce the issue, run a process that will walk over a directory
> > tree with lots of files inside a cpuset bound to a node that constantly
> > experiences memory pressure. Look at num_slabs vs active_slabs growth as
> > reported by /proc/slabinfo.
> > 
> > We should use cpuset_zone_allowed_softwall in fallback_alloc. Since it
> > can sleep, we only call it on __GFP_WAIT allocations. For atomic
> > allocations we simply ignore cpusets, which is in agreement with the
> > cpuset documenation (see the comment to __cpuset_node_allowed_softwall).
> > 
> 
> If that rule were ever changed, nobody would think to modify the 
> fallback_alloc() behavior in the slab allocator.  Why can't 
> cpuset_zone_allowed_hardwall() just return 1 for !__GFP_WAIT?
> 
> I don't think this issue is restricted only to slab, it's for all callers 
> of cpuset_zone_allowed_softwall() that could possibly be atomic.  I think 
> it would be better to determine if cpuset_zone_allowed() should be 
> hardwall or softwall depending on the gfp flags.
> 
> Let's add Li, the cpuset maintainer.  Any reason we can't do this?
> ---
[...]
> diff --git a/mm/slab.c b/mm/slab.c
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3047,16 +3047,19 @@ retry:
>  	 * from existing per node queues.
>  	 */
>  	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> -		nid = zone_to_nid(zone);
> +		struct kmem_cache_node *n;
>  
> -		if (cpuset_zone_allowed_hardwall(zone, flags) &&
> -			get_node(cache, nid) &&
> -			get_node(cache, nid)->free_objects) {
> -				obj = ____cache_alloc_node(cache,
> -					flags | GFP_THISNODE, nid);
> -				if (obj)
> -					break;
> -		}
> +		nid = zone_to_nid(zone);
> +		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))

We must use softwall check here, otherwise we will proceed to
alloc_pages even if there are lots of free slabs on other nodes.
alloc_pages, in turn, may allocate from other nodes in case
cpuset.mem_hardwall=0, because it uses softwall check, so it may add yet
another free slab to another node's list even if it isn't empty. As a
result, we may get free list bloating on other nodes. I've seen a
machine with one of its nodes almost completely filled with inactive
slabs for buffer_heads (dozens of GBs) w/o any chance to drop them. So,
this is a bug that must be fixed.

Note, for SLUB using hardwall check in get_any_partial won't lead to
such a problem, because once added a new slab is loaded to a per cpu
list forcing any further user to allocate from it. Strictly speaking, we
should use softwall check there either though.

> +			continue;
> +		n = get_node(cache, nid);
> +		if (!n)
> +			continue;
> +		if (!n->free_objects)
> +			continue;
> +		obj = ____cache_alloc_node(cache, flags | GFP_THISNODE, nid);
> +		if (obj)
> +			break;
>  	}
>  
>  	if (!obj) {
> diff --git a/mm/slub.c b/mm/slub.c
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1671,20 +1671,22 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>  			struct kmem_cache_node *n;
>  
>  			n = get_node(s, zone_to_nid(zone));
> +			if (!n)
> +				continue;
> +			if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
> +				continue;
> +			if (n->nr_parial <= s->min_partial)
> +				continue;
>  
> -			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
> -					n->nr_partial > s->min_partial) {
> -				object = get_partial_node(s, n, c, flags);
> -				if (object) {
> -					/*
> -					 * Don't check read_mems_allowed_retry()
> -					 * here - if mems_allowed was updated in
> -					 * parallel, that was a harmless race
> -					 * between allocation and the cpuset
> -					 * update
> -					 */
> -					return object;
> -				}
> +			object = get_partial_node(s, n, c, flags);
> +			if (object) {
> +				/*
> +				 * Don't check read_mems_allowed_retry() here -
> +				 * if mems_allowed was updated in parallel,
> +				 * that was a harmless race between allocation
> +				 * and the cpuset update.
> +				 */
> +				return object;
>  			}
>  		}
>  	} while (read_mems_allowed_retry(cpuset_mems_cookie));
[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
  2014-08-11  7:13     ` Vladimir Davydov
@ 2014-08-11 11:37       ` David Rientjes
  -1 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2014-08-11 11:37 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Li Zefan, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Mon, 11 Aug 2014, Vladimir Davydov wrote:

> > diff --git a/mm/slab.c b/mm/slab.c
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -3047,16 +3047,19 @@ retry:
> >  	 * from existing per node queues.
> >  	 */
> >  	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> > -		nid = zone_to_nid(zone);
> > +		struct kmem_cache_node *n;
> >  
> > -		if (cpuset_zone_allowed_hardwall(zone, flags) &&
> > -			get_node(cache, nid) &&
> > -			get_node(cache, nid)->free_objects) {
> > -				obj = ____cache_alloc_node(cache,
> > -					flags | GFP_THISNODE, nid);
> > -				if (obj)
> > -					break;
> > -		}
> > +		nid = zone_to_nid(zone);
> > +		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
> 
> We must use softwall check here, otherwise we will proceed to
> alloc_pages even if there are lots of free slabs on other nodes.
> alloc_pages, in turn, may allocate from other nodes in case
> cpuset.mem_hardwall=0, because it uses softwall check, so it may add yet
> another free slab to another node's list even if it isn't empty. As a
> result, we may get free list bloating on other nodes. I've seen a
> machine with one of its nodes almost completely filled with inactive
> slabs for buffer_heads (dozens of GBs) w/o any chance to drop them. So,
> this is a bug that must be fixed.
> 

Right, I understand, and my patch makes no attempt to fix that issue, it's 
simply collapsing the code down into a single cpuset_zone_allowed() 
function and the context for the allocation is controlled by the gfp 
flags (and hardwall is controlled by setting __GFP_HARDWALL) as it should 
be.  I understand the issue you face, but I can't combine a cleanup with a 
fix and I would prefer to have your patch keep your commit description.  

The diffstat for my proposal removes many more lines than it adds and I 
think it will avoid this type of issue in the future for new callers.  
Your patch could then be based on the single cpuset_zone_allowed() 
function where you would simply have to remove the __GFP_HARDWALL above.  
Or, your patch could be merged first and then my cleanup on top, but it 
seems like your one-liner would be more clear if it is based on mine.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-11 11:37       ` David Rientjes
  0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2014-08-11 11:37 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Li Zefan, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Mon, 11 Aug 2014, Vladimir Davydov wrote:

> > diff --git a/mm/slab.c b/mm/slab.c
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -3047,16 +3047,19 @@ retry:
> >  	 * from existing per node queues.
> >  	 */
> >  	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> > -		nid = zone_to_nid(zone);
> > +		struct kmem_cache_node *n;
> >  
> > -		if (cpuset_zone_allowed_hardwall(zone, flags) &&
> > -			get_node(cache, nid) &&
> > -			get_node(cache, nid)->free_objects) {
> > -				obj = ____cache_alloc_node(cache,
> > -					flags | GFP_THISNODE, nid);
> > -				if (obj)
> > -					break;
> > -		}
> > +		nid = zone_to_nid(zone);
> > +		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
> 
> We must use softwall check here, otherwise we will proceed to
> alloc_pages even if there are lots of free slabs on other nodes.
> alloc_pages, in turn, may allocate from other nodes in case
> cpuset.mem_hardwall=0, because it uses softwall check, so it may add yet
> another free slab to another node's list even if it isn't empty. As a
> result, we may get free list bloating on other nodes. I've seen a
> machine with one of its nodes almost completely filled with inactive
> slabs for buffer_heads (dozens of GBs) w/o any chance to drop them. So,
> this is a bug that must be fixed.
> 

Right, I understand, and my patch makes no attempt to fix that issue, it's 
simply collapsing the code down into a single cpuset_zone_allowed() 
function and the context for the allocation is controlled by the gfp 
flags (and hardwall is controlled by setting __GFP_HARDWALL) as it should 
be.  I understand the issue you face, but I can't combine a cleanup with a 
fix and I would prefer to have your patch keep your commit description.  

The diffstat for my proposal removes many more lines than it adds and I 
think it will avoid this type of issue in the future for new callers.  
Your patch could then be based on the single cpuset_zone_allowed() 
function where you would simply have to remove the __GFP_HARDWALL above.  
Or, your patch could be merged first and then my cleanup on top, but it 
seems like your one-liner would be more clear if it is based on mine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
  2014-08-11 11:37       ` David Rientjes
@ 2014-08-11 12:17         ` Vladimir Davydov
  -1 siblings, 0 replies; 16+ messages in thread
From: Vladimir Davydov @ 2014-08-11 12:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Li Zefan, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Mon, Aug 11, 2014 at 04:37:15AM -0700, David Rientjes wrote:
> On Mon, 11 Aug 2014, Vladimir Davydov wrote:
> 
> > > diff --git a/mm/slab.c b/mm/slab.c
> > > --- a/mm/slab.c
> > > +++ b/mm/slab.c
> > > @@ -3047,16 +3047,19 @@ retry:
> > >  	 * from existing per node queues.
> > >  	 */
> > >  	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> > > -		nid = zone_to_nid(zone);
> > > +		struct kmem_cache_node *n;
> > >  
> > > -		if (cpuset_zone_allowed_hardwall(zone, flags) &&
> > > -			get_node(cache, nid) &&
> > > -			get_node(cache, nid)->free_objects) {
> > > -				obj = ____cache_alloc_node(cache,
> > > -					flags | GFP_THISNODE, nid);
> > > -				if (obj)
> > > -					break;
> > > -		}
> > > +		nid = zone_to_nid(zone);
> > > +		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
> > 
> > We must use softwall check here, otherwise we will proceed to
> > alloc_pages even if there are lots of free slabs on other nodes.
> > alloc_pages, in turn, may allocate from other nodes in case
> > cpuset.mem_hardwall=0, because it uses softwall check, so it may add yet
> > another free slab to another node's list even if it isn't empty. As a
> > result, we may get free list bloating on other nodes. I've seen a
> > machine with one of its nodes almost completely filled with inactive
> > slabs for buffer_heads (dozens of GBs) w/o any chance to drop them. So,
> > this is a bug that must be fixed.
> > 
> 
> Right, I understand, and my patch makes no attempt to fix that issue, it's 
> simply collapsing the code down into a single cpuset_zone_allowed() 
> function and the context for the allocation is controlled by the gfp 
> flags (and hardwall is controlled by setting __GFP_HARDWALL) as it should 
> be.  I understand the issue you face, but I can't combine a cleanup with a 
> fix and I would prefer to have your patch keep your commit description.  

Sorry, I misunderstood you.

> The diffstat for my proposal removes many more lines than it adds and I 
> think it will avoid this type of issue in the future for new callers.  
> Your patch could then be based on the single cpuset_zone_allowed() 
> function where you would simply have to remove the __GFP_HARDWALL above.  
> Or, your patch could be merged first and then my cleanup on top, but it 
> seems like your one-liner would be more clear if it is based on mine.

Having one function instead of two doing similar thing is usually better
IMO, but AFAIU your patch isn't a mere cleanup - it also slightly
changes the logic behind !__GFP_WAIT vs cpusets interaction:

> @@ -2505,18 +2501,22 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
>   *	GFP_USER     - only nodes in current tasks mems allowed ok.
>   *
>   * Rule:
> - *    Don't call cpuset_node_allowed_softwall if you can't sleep, unless you
> + *    Don't call __cpuset_node_allowed if you can't sleep, unless you
>   *    pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
>   *    the code that might scan up ancestor cpusets and sleep.
>   */
> -int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
> +int __cpuset_node_allowed(int node, const gfp_t gfp_mask)
>  {
>  	struct cpuset *cs;		/* current cpuset ancestors */
>  	int allowed;			/* is allocation in zone z allowed? */
>  
> -	if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
> +	if (in_interrupt())
>  		return 1;
>  	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
> +	if (gfp_mask & __GFP_THISNODE)
> +		return 1;
> +	if (!(gfp_mask & __GFP_WAIT))
> +		return 1;

This means cpuset_zone_allowed will now always return true for
!__GFP_WAIT allocations.

>  	if (node_isset(node, current->mems_allowed))
>  		return 1;
>  	/*
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1963,7 +1963,7 @@ zonelist_scan:
>  
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> -	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
> +	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
>  	 */
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  						high_zoneidx, nodemask) {
> @@ -1974,7 +1974,7 @@ zonelist_scan:
>  				continue;
>  		if (cpusets_enabled() &&
>  			(alloc_flags & ALLOC_CPUSET) &&
> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +			!cpuset_zone_allowed(zone, gfp_mask))
>  				continue;

So, this is get_page_from_freelist. It's called from
__alloc_pages_nodemask with alloc_flags always having ALLOC_CPUSET bit
set and from __alloc_pages_slowpath with alloc_flags having ALLOC_CPUSET
bit set only for __GFP_WAIT allocations. That said, w/o your patch we
try to respect cpusets for all allocations, including atomic, and only
ignore cpusets if tight on memory (freelist's empty) for !__GFP_WAIT
allocations, while with your patch we always ignore cpusets for
!__GFP_WAIT allocations. Not sure if it really matters though, because
usually one uses cpuset.mems in conjunction with cpuset.cpus and it
won't make any difference then. It also doesn't conflict with any cpuset
documentation.

>  		/*
>  		 * Distribute pages in proportion to the individual

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-11 12:17         ` Vladimir Davydov
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Davydov @ 2014-08-11 12:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Li Zefan, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Mon, Aug 11, 2014 at 04:37:15AM -0700, David Rientjes wrote:
> On Mon, 11 Aug 2014, Vladimir Davydov wrote:
> 
> > > diff --git a/mm/slab.c b/mm/slab.c
> > > --- a/mm/slab.c
> > > +++ b/mm/slab.c
> > > @@ -3047,16 +3047,19 @@ retry:
> > >  	 * from existing per node queues.
> > >  	 */
> > >  	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> > > -		nid = zone_to_nid(zone);
> > > +		struct kmem_cache_node *n;
> > >  
> > > -		if (cpuset_zone_allowed_hardwall(zone, flags) &&
> > > -			get_node(cache, nid) &&
> > > -			get_node(cache, nid)->free_objects) {
> > > -				obj = ____cache_alloc_node(cache,
> > > -					flags | GFP_THISNODE, nid);
> > > -				if (obj)
> > > -					break;
> > > -		}
> > > +		nid = zone_to_nid(zone);
> > > +		if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
> > 
> > We must use softwall check here, otherwise we will proceed to
> > alloc_pages even if there are lots of free slabs on other nodes.
> > alloc_pages, in turn, may allocate from other nodes in case
> > cpuset.mem_hardwall=0, because it uses softwall check, so it may add yet
> > another free slab to another node's list even if it isn't empty. As a
> > result, we may get free list bloating on other nodes. I've seen a
> > machine with one of its nodes almost completely filled with inactive
> > slabs for buffer_heads (dozens of GBs) w/o any chance to drop them. So,
> > this is a bug that must be fixed.
> > 
> 
> Right, I understand, and my patch makes no attempt to fix that issue, it's 
> simply collapsing the code down into a single cpuset_zone_allowed() 
> function and the context for the allocation is controlled by the gfp 
> flags (and hardwall is controlled by setting __GFP_HARDWALL) as it should 
> be.  I understand the issue you face, but I can't combine a cleanup with a 
> fix and I would prefer to have your patch keep your commit description.  

Sorry, I misunderstood you.

> The diffstat for my proposal removes many more lines than it adds and I 
> think it will avoid this type of issue in the future for new callers.  
> Your patch could then be based on the single cpuset_zone_allowed() 
> function where you would simply have to remove the __GFP_HARDWALL above.  
> Or, your patch could be merged first and then my cleanup on top, but it 
> seems like your one-liner would be more clear if it is based on mine.

Having one function instead of two doing similar thing is usually better
IMO, but AFAIU your patch isn't a mere cleanup - it also slightly
changes the logic behind !__GFP_WAIT vs cpusets interaction:

> @@ -2505,18 +2501,22 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
>   *	GFP_USER     - only nodes in current tasks mems allowed ok.
>   *
>   * Rule:
> - *    Don't call cpuset_node_allowed_softwall if you can't sleep, unless you
> + *    Don't call __cpuset_node_allowed if you can't sleep, unless you
>   *    pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
>   *    the code that might scan up ancestor cpusets and sleep.
>   */
> -int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
> +int __cpuset_node_allowed(int node, const gfp_t gfp_mask)
>  {
>  	struct cpuset *cs;		/* current cpuset ancestors */
>  	int allowed;			/* is allocation in zone z allowed? */
>  
> -	if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
> +	if (in_interrupt())
>  		return 1;
>  	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
> +	if (gfp_mask & __GFP_THISNODE)
> +		return 1;
> +	if (!(gfp_mask & __GFP_WAIT))
> +		return 1;

This means cpuset_zone_allowed will now always return true for
!__GFP_WAIT allocations.

>  	if (node_isset(node, current->mems_allowed))
>  		return 1;
>  	/*
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1963,7 +1963,7 @@ zonelist_scan:
>  
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> -	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
> +	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
>  	 */
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  						high_zoneidx, nodemask) {
> @@ -1974,7 +1974,7 @@ zonelist_scan:
>  				continue;
>  		if (cpusets_enabled() &&
>  			(alloc_flags & ALLOC_CPUSET) &&
> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +			!cpuset_zone_allowed(zone, gfp_mask))
>  				continue;

So, this is get_page_from_freelist. It's called from
__alloc_pages_nodemask with alloc_flags always having ALLOC_CPUSET bit
set and from __alloc_pages_slowpath with alloc_flags having ALLOC_CPUSET
bit set only for __GFP_WAIT allocations. That said, w/o your patch we
try to respect cpusets for all allocations, including atomic, and only
ignore cpusets if tight on memory (freelist's empty) for !__GFP_WAIT
allocations, while with your patch we always ignore cpusets for
!__GFP_WAIT allocations. Not sure if it really matters though, because
usually one uses cpuset.mems in conjunction with cpuset.cpus and it
won't make any difference then. It also doesn't conflict with any cpuset
documentation.

>  		/*
>  		 * Distribute pages in proportion to the individual

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
  2014-08-11 12:17         ` Vladimir Davydov
@ 2014-08-11 21:05           ` David Rientjes
  -1 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2014-08-11 21:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Li Zefan, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Mon, 11 Aug 2014, Vladimir Davydov wrote:

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1963,7 +1963,7 @@ zonelist_scan:
> >  
> >  	/*
> >  	 * Scan zonelist, looking for a zone with enough free.
> > -	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
> > +	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
> >  	 */
> >  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >  						high_zoneidx, nodemask) {
> > @@ -1974,7 +1974,7 @@ zonelist_scan:
> >  				continue;
> >  		if (cpusets_enabled() &&
> >  			(alloc_flags & ALLOC_CPUSET) &&
> > -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > +			!cpuset_zone_allowed(zone, gfp_mask))
> >  				continue;
> 
> So, this is get_page_from_freelist. It's called from
> __alloc_pages_nodemask with alloc_flags always having ALLOC_CPUSET bit
> set and from __alloc_pages_slowpath with alloc_flags having ALLOC_CPUSET
> bit set only for __GFP_WAIT allocations. That said, w/o your patch we
> try to respect cpusets for all allocations, including atomic, and only
> ignore cpusets if tight on memory (freelist's empty) for !__GFP_WAIT
> allocations, while with your patch we always ignore cpusets for
> !__GFP_WAIT allocations. Not sure if it really matters though, because
> usually one uses cpuset.mems in conjunction with cpuset.cpus and it
> won't make any difference then. It also doesn't conflict with any cpuset
> documentation.
> 

Yeah, that's why I'm asking Li, the cpuset maintainer, if we can do this.  
The only thing that we get by falling back to the page allocator slowpath 
is that kswapd gets woken up before the allocation is attempted without 
ALLOC_CPUSET.  It seems pointless to wakeup kswapd when the allocation can 
succeed on any node.  Even with the patch, if the allocation fails because 
all nodes are below their min watermark, then we still fallback to the 
slowpath and wake up kswapd but there's nothing much else we can do 
because it's !__GFP_WAIT.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-11 21:05           ` David Rientjes
  0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2014-08-11 21:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Li Zefan, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On Mon, 11 Aug 2014, Vladimir Davydov wrote:

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1963,7 +1963,7 @@ zonelist_scan:
> >  
> >  	/*
> >  	 * Scan zonelist, looking for a zone with enough free.
> > -	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
> > +	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
> >  	 */
> >  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >  						high_zoneidx, nodemask) {
> > @@ -1974,7 +1974,7 @@ zonelist_scan:
> >  				continue;
> >  		if (cpusets_enabled() &&
> >  			(alloc_flags & ALLOC_CPUSET) &&
> > -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > +			!cpuset_zone_allowed(zone, gfp_mask))
> >  				continue;
> 
> So, this is get_page_from_freelist. It's called from
> __alloc_pages_nodemask with alloc_flags always having ALLOC_CPUSET bit
> set and from __alloc_pages_slowpath with alloc_flags having ALLOC_CPUSET
> bit set only for __GFP_WAIT allocations. That said, w/o your patch we
> try to respect cpusets for all allocations, including atomic, and only
> ignore cpusets if tight on memory (freelist's empty) for !__GFP_WAIT
> allocations, while with your patch we always ignore cpusets for
> !__GFP_WAIT allocations. Not sure if it really matters though, because
> usually one uses cpuset.mems in conjunction with cpuset.cpus and it
> won't make any difference then. It also doesn't conflict with any cpuset
> documentation.
> 

Yeah, that's why I'm asking Li, the cpuset maintainer, if we can do this.  
The only thing that we get by falling back to the page allocator slowpath 
is that kswapd gets woken up before the allocation is attempted without 
ALLOC_CPUSET.  It seems pointless to wakeup kswapd when the allocation can 
succeed on any node.  Even with the patch, if the allocation fails because 
all nodes are below their min watermark, then we still fallback to the 
slowpath and wake up kswapd but there's nothing much else we can do 
because it's !__GFP_WAIT.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
  2014-08-11 21:05           ` David Rientjes
  (?)
@ 2014-08-15  3:13             ` Li Zefan
  -1 siblings, 0 replies; 16+ messages in thread
From: Li Zefan @ 2014-08-15  3:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vladimir Davydov, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On 2014/8/12 5:05, David Rientjes wrote:
> On Mon, 11 Aug 2014, Vladimir Davydov wrote:
> 
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1963,7 +1963,7 @@ zonelist_scan:
>>>  
>>>  	/*
>>>  	 * Scan zonelist, looking for a zone with enough free.
>>> -	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
>>> +	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
>>>  	 */
>>>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>>>  						high_zoneidx, nodemask) {
>>> @@ -1974,7 +1974,7 @@ zonelist_scan:
>>>  				continue;
>>>  		if (cpusets_enabled() &&
>>>  			(alloc_flags & ALLOC_CPUSET) &&
>>> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
>>> +			!cpuset_zone_allowed(zone, gfp_mask))
>>>  				continue;
>>
>> So, this is get_page_from_freelist. It's called from
>> __alloc_pages_nodemask with alloc_flags always having ALLOC_CPUSET bit
>> set and from __alloc_pages_slowpath with alloc_flags having ALLOC_CPUSET
>> bit set only for __GFP_WAIT allocations. That said, w/o your patch we
>> try to respect cpusets for all allocations, including atomic, and only
>> ignore cpusets if tight on memory (freelist's empty) for !__GFP_WAIT
>> allocations, while with your patch we always ignore cpusets for
>> !__GFP_WAIT allocations. Not sure if it really matters though, because
>> usually one uses cpuset.mems in conjunction with cpuset.cpus and it
>> won't make any difference then. It also doesn't conflict with any cpuset
>> documentation.
>>
> 
> Yeah, that's why I'm asking Li, the cpuset maintainer, if we can do this.  

I'm not quite sure. That code has been there before I got involved in cpuset.

> The only thing that we get by falling back to the page allocator slowpath 
> is that kswapd gets woken up before the allocation is attempted without 
> ALLOC_CPUSET.  It seems pointless to wakeup kswapd when the allocation can 
> succeed on any node.  Even with the patch, if the allocation fails because 
> all nodes are below their min watermark, then we still fallback to the 
> slowpath and wake up kswapd but there's nothing much else we can do 
> because it's !__GFP_WAIT.
> .

But I tend to agree with you. But if we want to do this, we should split this
change from the cleanup.

Regarding to the cleanup, I found there used to be a single cpuset_node_allowed(),
and your cleanup is exactly a revert of that ancient commit:

commit 02a0e53d8227aff5e62e0433f82c12c1c2805fd6
Author: Paul Jackson <pj@sgi.com>
Date:   Wed Dec 13 00:34:25 2006 -0800

    [PATCH] cpuset: rework cpuset_zone_allowed api

Seems the major intention was to avoid accident sleep-in-atomic bugs, because
callback_mutex might be held.

I don't see there's any reason callback_mutex can't be a spinlock. I thought
about this when Gu Zhen fixed the bug that callback_mutex is nested inside
rcu_read_lock().

--
 kernel/cpuset.c | 81 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 32 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index baa155c..9d9e239 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -284,7 +284,7 @@ static struct cpuset top_cpuset = {
  */
 
 static DEFINE_MUTEX(cpuset_mutex);
-static DEFINE_MUTEX(callback_mutex);
+static DEFINE_SPINLOCK(callback_lock);
 
 /*
  * CPU / memory hotplug is handled asynchronously.
@@ -848,6 +848,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
  */
 static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 {
+	unsigned long flags;
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 	bool need_rebuild_sched_domains = false;
@@ -875,9 +876,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 			continue;
 		rcu_read_unlock();
 
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		cpumask_copy(cp->effective_cpus, new_cpus);
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 
 		WARN_ON(!cgroup_on_dfl(cp->css.cgroup) &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -910,6 +911,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			  const char *buf)
 {
+	unsigned long flags;
 	int retval;
 
 	/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
@@ -942,9 +944,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/* use trialcs->cpus_allowed as a temp variable */
 	update_cpumasks_hier(cs, trialcs->cpus_allowed);
@@ -1105,6 +1107,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
  */
 static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 {
+	unsigned long flags;
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 
@@ -1132,8 +1135,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 		rcu_read_unlock();
 
 		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		cp->effective_mems = *new_mems;
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 
 		WARN_ON(!cgroup_on_dfl(cp->css.cgroup) &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -1162,6 +1166,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 			   const char *buf)
 {
+	unsigned long flags;
 	int retval;
 
 	/*
@@ -1201,9 +1206,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->mems_allowed = trialcs->mems_allowed;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &cs->mems_allowed);
@@ -1264,6 +1269,7 @@ static void update_tasks_flags(struct cpuset *cs)
 static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 		       int turning_on)
 {
+	unsigned long flags;
 	struct cpuset *trialcs;
 	int balance_flag_changed;
 	int spread_flag_changed;
@@ -1288,9 +1294,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->flags = trialcs->flags;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		rebuild_sched_domains_locked();
@@ -1690,11 +1696,12 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	ssize_t count;
 	char *buf, *s;
 	int ret = 0;
+	unsigned long flags;
 
 	count = seq_get_buf(sf, &buf);
 	s = buf;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_mutex, flags);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -1721,7 +1728,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		seq_commit(sf, -1);
 	}
 out_unlock:
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqstore(&callback_lock, flags);
 	return ret;
 }
 
@@ -1924,6 +1931,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	struct cpuset *parent = parent_cs(cs);
 	struct cpuset *tmp_cs;
 	struct cgroup_subsys_state *pos_css;
+	unsigned long flags;
 
 	if (!parent)
 		return 0;
@@ -1938,12 +1946,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	if (cgroup_on_dfl(cs->css.cgroup)) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 	}
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -1970,10 +1978,10 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->mems_allowed = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
-	mutex_unlock(&callback_mutex);
+	spin_lock_irqrestore(&callback_lock, flags);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	return 0;
@@ -2011,8 +2019,10 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
+	unsigned long flags;
+
 	mutex_lock(&cpuset_mutex);
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 
 	if (cgroup_on_dfl(root_css->cgroup)) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -2023,7 +2033,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -2107,13 +2117,14 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 			    bool cpus_updated, bool mems_updated)
 {
 	bool is_empty;
+	unsigned long flags;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->cpus_allowed, new_cpus);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -2145,15 +2156,17 @@ hotplug_update_tasks(struct cpuset *cs,
 		     struct cpumask *new_cpus, nodemask_t *new_mems,
 		     bool cpus_updated, bool mems_updated)
 {
+	unsigned long flags;
+
 	if (cpumask_empty(new_cpus))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (cpus_updated)
 		update_tasks_cpumask(cs);
@@ -2227,6 +2240,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	static nodemask_t new_mems;
 	bool cpus_updated, mems_updated;
 	bool on_dfl = cgroup_on_dfl(top_cpuset.css.cgroup);
+	unsigned long flags;
 
 	mutex_lock(&cpuset_mutex);
 
@@ -2239,21 +2253,21 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 		update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -2346,11 +2360,13 @@ void __init cpuset_init_smp(void)
 
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
-	mutex_lock(&callback_mutex);
+	unsigned long flags;
+
+	spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	rcu_read_unlock();
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
@@ -2396,12 +2412,13 @@ void cpuset_init_current_mems_allowed(void)
 nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 {
 	nodemask_t mask;
+	unsigned long flags;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &mask);
 	rcu_read_unlock();
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -2514,14 +2531,14 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
 		return 1;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_mutex, flags);
 
 	rcu_read_lock();
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 	rcu_read_unlock();
 
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-15  3:13             ` Li Zefan
  0 siblings, 0 replies; 16+ messages in thread
From: Li Zefan @ 2014-08-15  3:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vladimir Davydov, Andrew Morton, linux-mm, linux-kernel, cgroups,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim

On 2014/8/12 5:05, David Rientjes wrote:
> On Mon, 11 Aug 2014, Vladimir Davydov wrote:
> 
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1963,7 +1963,7 @@ zonelist_scan:
>>>  
>>>  	/*
>>>  	 * Scan zonelist, looking for a zone with enough free.
>>> -	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
>>> +	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
>>>  	 */
>>>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>>>  						high_zoneidx, nodemask) {
>>> @@ -1974,7 +1974,7 @@ zonelist_scan:
>>>  				continue;
>>>  		if (cpusets_enabled() &&
>>>  			(alloc_flags & ALLOC_CPUSET) &&
>>> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
>>> +			!cpuset_zone_allowed(zone, gfp_mask))
>>>  				continue;
>>
>> So, this is get_page_from_freelist. It's called from
>> __alloc_pages_nodemask with alloc_flags always having ALLOC_CPUSET bit
>> set and from __alloc_pages_slowpath with alloc_flags having ALLOC_CPUSET
>> bit set only for __GFP_WAIT allocations. That said, w/o your patch we
>> try to respect cpusets for all allocations, including atomic, and only
>> ignore cpusets if tight on memory (freelist's empty) for !__GFP_WAIT
>> allocations, while with your patch we always ignore cpusets for
>> !__GFP_WAIT allocations. Not sure if it really matters though, because
>> usually one uses cpuset.mems in conjunction with cpuset.cpus and it
>> won't make any difference then. It also doesn't conflict with any cpuset
>> documentation.
>>
> 
> Yeah, that's why I'm asking Li, the cpuset maintainer, if we can do this.  

I'm not quite sure. That code has been there before I got involved in cpuset.

> The only thing that we get by falling back to the page allocator slowpath 
> is that kswapd gets woken up before the allocation is attempted without 
> ALLOC_CPUSET.  It seems pointless to wakeup kswapd when the allocation can 
> succeed on any node.  Even with the patch, if the allocation fails because 
> all nodes are below their min watermark, then we still fallback to the 
> slowpath and wake up kswapd but there's nothing much else we can do 
> because it's !__GFP_WAIT.
> .

But I tend to agree with you. But if we want to do this, we should split this
change from the cleanup.

Regarding to the cleanup, I found there used to be a single cpuset_node_allowed(),
and your cleanup is exactly a revert of that ancient commit:

commit 02a0e53d8227aff5e62e0433f82c12c1c2805fd6
Author: Paul Jackson <pj@sgi.com>
Date:   Wed Dec 13 00:34:25 2006 -0800

    [PATCH] cpuset: rework cpuset_zone_allowed api

Seems the major intention was to avoid accident sleep-in-atomic bugs, because
callback_mutex might be held.

I don't see there's any reason callback_mutex can't be a spinlock. I thought
about this when Gu Zhen fixed the bug that callback_mutex is nested inside
rcu_read_lock().

--
 kernel/cpuset.c | 81 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 32 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index baa155c..9d9e239 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -284,7 +284,7 @@ static struct cpuset top_cpuset = {
  */
 
 static DEFINE_MUTEX(cpuset_mutex);
-static DEFINE_MUTEX(callback_mutex);
+static DEFINE_SPINLOCK(callback_lock);
 
 /*
  * CPU / memory hotplug is handled asynchronously.
@@ -848,6 +848,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
  */
 static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 {
+	unsigned long flags;
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 	bool need_rebuild_sched_domains = false;
@@ -875,9 +876,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 			continue;
 		rcu_read_unlock();
 
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		cpumask_copy(cp->effective_cpus, new_cpus);
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 
 		WARN_ON(!cgroup_on_dfl(cp->css.cgroup) &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -910,6 +911,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			  const char *buf)
 {
+	unsigned long flags;
 	int retval;
 
 	/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
@@ -942,9 +944,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/* use trialcs->cpus_allowed as a temp variable */
 	update_cpumasks_hier(cs, trialcs->cpus_allowed);
@@ -1105,6 +1107,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
  */
 static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 {
+	unsigned long flags;
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 
@@ -1132,8 +1135,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 		rcu_read_unlock();
 
 		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		cp->effective_mems = *new_mems;
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 
 		WARN_ON(!cgroup_on_dfl(cp->css.cgroup) &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -1162,6 +1166,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 			   const char *buf)
 {
+	unsigned long flags;
 	int retval;
 
 	/*
@@ -1201,9 +1206,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->mems_allowed = trialcs->mems_allowed;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &cs->mems_allowed);
@@ -1264,6 +1269,7 @@ static void update_tasks_flags(struct cpuset *cs)
 static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 		       int turning_on)
 {
+	unsigned long flags;
 	struct cpuset *trialcs;
 	int balance_flag_changed;
 	int spread_flag_changed;
@@ -1288,9 +1294,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->flags = trialcs->flags;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		rebuild_sched_domains_locked();
@@ -1690,11 +1696,12 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	ssize_t count;
 	char *buf, *s;
 	int ret = 0;
+	unsigned long flags;
 
 	count = seq_get_buf(sf, &buf);
 	s = buf;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_mutex, flags);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -1721,7 +1728,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		seq_commit(sf, -1);
 	}
 out_unlock:
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqstore(&callback_lock, flags);
 	return ret;
 }
 
@@ -1924,6 +1931,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	struct cpuset *parent = parent_cs(cs);
 	struct cpuset *tmp_cs;
 	struct cgroup_subsys_state *pos_css;
+	unsigned long flags;
 
 	if (!parent)
 		return 0;
@@ -1938,12 +1946,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	if (cgroup_on_dfl(cs->css.cgroup)) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 	}
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -1970,10 +1978,10 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->mems_allowed = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
-	mutex_unlock(&callback_mutex);
+	spin_lock_irqrestore(&callback_lock, flags);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	return 0;
@@ -2011,8 +2019,10 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
+	unsigned long flags;
+
 	mutex_lock(&cpuset_mutex);
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 
 	if (cgroup_on_dfl(root_css->cgroup)) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -2023,7 +2033,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -2107,13 +2117,14 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 			    bool cpus_updated, bool mems_updated)
 {
 	bool is_empty;
+	unsigned long flags;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->cpus_allowed, new_cpus);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -2145,15 +2156,17 @@ hotplug_update_tasks(struct cpuset *cs,
 		     struct cpumask *new_cpus, nodemask_t *new_mems,
 		     bool cpus_updated, bool mems_updated)
 {
+	unsigned long flags;
+
 	if (cpumask_empty(new_cpus))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (cpus_updated)
 		update_tasks_cpumask(cs);
@@ -2227,6 +2240,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	static nodemask_t new_mems;
 	bool cpus_updated, mems_updated;
 	bool on_dfl = cgroup_on_dfl(top_cpuset.css.cgroup);
+	unsigned long flags;
 
 	mutex_lock(&cpuset_mutex);
 
@@ -2239,21 +2253,21 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 		update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -2346,11 +2360,13 @@ void __init cpuset_init_smp(void)
 
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
-	mutex_lock(&callback_mutex);
+	unsigned long flags;
+
+	spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	rcu_read_unlock();
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
@@ -2396,12 +2412,13 @@ void cpuset_init_current_mems_allowed(void)
 nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 {
 	nodemask_t mask;
+	unsigned long flags;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &mask);
 	rcu_read_unlock();
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -2514,14 +2531,14 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
 		return 1;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_mutex, flags);
 
 	rcu_read_lock();
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 	rcu_read_unlock();
 
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
@ 2014-08-15  3:13             ` Li Zefan
  0 siblings, 0 replies; 16+ messages in thread
From: Li Zefan @ 2014-08-15  3:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vladimir Davydov, Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Pekka Enberg,
	Joonsoo Kim

On 2014/8/12 5:05, David Rientjes wrote:
> On Mon, 11 Aug 2014, Vladimir Davydov wrote:
> 
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1963,7 +1963,7 @@ zonelist_scan:
>>>  
>>>  	/*
>>>  	 * Scan zonelist, looking for a zone with enough free.
>>> -	 * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
>>> +	 * See __cpuset_node_allowed() comment in kernel/cpuset.c.
>>>  	 */
>>>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>>>  						high_zoneidx, nodemask) {
>>> @@ -1974,7 +1974,7 @@ zonelist_scan:
>>>  				continue;
>>>  		if (cpusets_enabled() &&
>>>  			(alloc_flags & ALLOC_CPUSET) &&
>>> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
>>> +			!cpuset_zone_allowed(zone, gfp_mask))
>>>  				continue;
>>
>> So, this is get_page_from_freelist. It's called from
>> __alloc_pages_nodemask with alloc_flags always having ALLOC_CPUSET bit
>> set and from __alloc_pages_slowpath with alloc_flags having ALLOC_CPUSET
>> bit set only for __GFP_WAIT allocations. That said, w/o your patch we
>> try to respect cpusets for all allocations, including atomic, and only
>> ignore cpusets if tight on memory (freelist's empty) for !__GFP_WAIT
>> allocations, while with your patch we always ignore cpusets for
>> !__GFP_WAIT allocations. Not sure if it really matters though, because
>> usually one uses cpuset.mems in conjunction with cpuset.cpus and it
>> won't make any difference then. It also doesn't conflict with any cpuset
>> documentation.
>>
> 
> Yeah, that's why I'm asking Li, the cpuset maintainer, if we can do this.  

I'm not quite sure. That code has been there before I got involved in cpuset.

> The only thing that we get by falling back to the page allocator slowpath 
> is that kswapd gets woken up before the allocation is attempted without 
> ALLOC_CPUSET.  It seems pointless to wakeup kswapd when the allocation can 
> succeed on any node.  Even with the patch, if the allocation fails because 
> all nodes are below their min watermark, then we still fallback to the 
> slowpath and wake up kswapd but there's nothing much else we can do 
> because it's !__GFP_WAIT.
> .

But I tend to agree with you. But if we want to do this, we should split this
change from the cleanup.

Regarding to the cleanup, I found there used to be a single cpuset_node_allowed(),
and your cleanup is exactly a revert of that ancient commit:

commit 02a0e53d8227aff5e62e0433f82c12c1c2805fd6
Author: Paul Jackson <pj-sJ/iWh9BUns@public.gmane.org>
Date:   Wed Dec 13 00:34:25 2006 -0800

    [PATCH] cpuset: rework cpuset_zone_allowed api

Seems the major intention was to avoid accident sleep-in-atomic bugs, because
callback_mutex might be held.

I don't see there's any reason callback_mutex can't be a spinlock. I thought
about this when Gu Zhen fixed the bug that callback_mutex is nested inside
rcu_read_lock().

--
 kernel/cpuset.c | 81 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 32 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index baa155c..9d9e239 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -284,7 +284,7 @@ static struct cpuset top_cpuset = {
  */
 
 static DEFINE_MUTEX(cpuset_mutex);
-static DEFINE_MUTEX(callback_mutex);
+static DEFINE_SPINLOCK(callback_lock);
 
 /*
  * CPU / memory hotplug is handled asynchronously.
@@ -848,6 +848,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
  */
 static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 {
+	unsigned long flags;
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 	bool need_rebuild_sched_domains = false;
@@ -875,9 +876,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 			continue;
 		rcu_read_unlock();
 
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		cpumask_copy(cp->effective_cpus, new_cpus);
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 
 		WARN_ON(!cgroup_on_dfl(cp->css.cgroup) &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -910,6 +911,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			  const char *buf)
 {
+	unsigned long flags;
 	int retval;
 
 	/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
@@ -942,9 +944,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/* use trialcs->cpus_allowed as a temp variable */
 	update_cpumasks_hier(cs, trialcs->cpus_allowed);
@@ -1105,6 +1107,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
  */
 static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 {
+	unsigned long flags;
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 
@@ -1132,8 +1135,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 		rcu_read_unlock();
 
 		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		cp->effective_mems = *new_mems;
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 
 		WARN_ON(!cgroup_on_dfl(cp->css.cgroup) &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -1162,6 +1166,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 			   const char *buf)
 {
+	unsigned long flags;
 	int retval;
 
 	/*
@@ -1201,9 +1206,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		goto done;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->mems_allowed = trialcs->mems_allowed;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &cs->mems_allowed);
@@ -1264,6 +1269,7 @@ static void update_tasks_flags(struct cpuset *cs)
 static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 		       int turning_on)
 {
+	unsigned long flags;
 	struct cpuset *trialcs;
 	int balance_flag_changed;
 	int spread_flag_changed;
@@ -1288,9 +1294,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->flags = trialcs->flags;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		rebuild_sched_domains_locked();
@@ -1690,11 +1696,12 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	ssize_t count;
 	char *buf, *s;
 	int ret = 0;
+	unsigned long flags;
 
 	count = seq_get_buf(sf, &buf);
 	s = buf;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_mutex, flags);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -1721,7 +1728,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		seq_commit(sf, -1);
 	}
 out_unlock:
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqstore(&callback_lock, flags);
 	return ret;
 }
 
@@ -1924,6 +1931,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	struct cpuset *parent = parent_cs(cs);
 	struct cpuset *tmp_cs;
 	struct cgroup_subsys_state *pos_css;
+	unsigned long flags;
 
 	if (!parent)
 		return 0;
@@ -1938,12 +1946,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	if (cgroup_on_dfl(cs->css.cgroup)) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 	}
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -1970,10 +1978,10 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cs->mems_allowed = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
-	mutex_unlock(&callback_mutex);
+	spin_lock_irqrestore(&callback_lock, flags);
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	return 0;
@@ -2011,8 +2019,10 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
+	unsigned long flags;
+
 	mutex_lock(&cpuset_mutex);
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 
 	if (cgroup_on_dfl(root_css->cgroup)) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -2023,7 +2033,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -2107,13 +2117,14 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 			    bool cpus_updated, bool mems_updated)
 {
 	bool is_empty;
+	unsigned long flags;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->cpus_allowed, new_cpus);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->mems_allowed = *new_mems;
 	cs->effective_mems = *new_mems;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	/*
 	 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -2145,15 +2156,17 @@ hotplug_update_tasks(struct cpuset *cs,
 		     struct cpumask *new_cpus, nodemask_t *new_mems,
 		     bool cpus_updated, bool mems_updated)
 {
+	unsigned long flags;
+
 	if (cpumask_empty(new_cpus))
 		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	if (cpus_updated)
 		update_tasks_cpumask(cs);
@@ -2227,6 +2240,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	static nodemask_t new_mems;
 	bool cpus_updated, mems_updated;
 	bool on_dfl = cgroup_on_dfl(top_cpuset.css.cgroup);
+	unsigned long flags;
 
 	mutex_lock(&cpuset_mutex);
 
@@ -2239,21 +2253,21 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		mutex_lock(&callback_mutex);
+		spin_lock_irqsave(&callback_lock, flags);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		mutex_unlock(&callback_mutex);
+		spin_unlock_irqrestore(&callback_lock, flags);
 		update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -2346,11 +2360,13 @@ void __init cpuset_init_smp(void)
 
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
-	mutex_lock(&callback_mutex);
+	unsigned long flags;
+
+	spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_cpus(task_cs(tsk), pmask);
 	rcu_read_unlock();
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
@@ -2396,12 +2412,13 @@ void cpuset_init_current_mems_allowed(void)
 nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 {
 	nodemask_t mask;
+	unsigned long flags;
 
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &mask);
 	rcu_read_unlock();
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -2514,14 +2531,14 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
 		return 1;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	mutex_lock(&callback_mutex);
+	spin_lock_irqsave(&callback_mutex, flags);
 
 	rcu_read_lock();
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 	rcu_read_unlock();
 
-	mutex_unlock(&callback_mutex);
+	spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-08-15  3:17 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-10 17:48 [PATCH -mm] slab: fix cpuset check in fallback_alloc Vladimir Davydov
2014-08-10 17:48 ` Vladimir Davydov
2014-08-10 22:43 ` David Rientjes
2014-08-10 22:43   ` David Rientjes
2014-08-11  7:13   ` Vladimir Davydov
2014-08-11  7:13     ` Vladimir Davydov
2014-08-11  7:13     ` Vladimir Davydov
2014-08-11 11:37     ` David Rientjes
2014-08-11 11:37       ` David Rientjes
2014-08-11 12:17       ` Vladimir Davydov
2014-08-11 12:17         ` Vladimir Davydov
2014-08-11 21:05         ` David Rientjes
2014-08-11 21:05           ` David Rientjes
2014-08-15  3:13           ` Li Zefan
2014-08-15  3:13             ` Li Zefan
2014-08-15  3:13             ` Li Zefan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.