All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch -mm v2] cpusets: add memory_slab_hardwall flag
@ 2009-03-10  2:22 David Rientjes
  2009-03-10  2:28 ` David Rientjes
  2009-03-10 20:59 ` Christoph Lameter
  0 siblings, 2 replies; 13+ messages in thread
From: David Rientjes @ 2009-03-10  2:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

Adds a per-cpuset `memory_slab_hardwall' flag.

The slab allocator interface for determining whether an object is allowed
is

	int current_cpuset_object_allowed(int node, gfp_t flags)

This returns non-zero when the object is allowed, either because
current's cpuset does not have memory_slab_hardwall enabled or because
it allows allocation on the node.  Otherwise, it returns zero.

There are two possibilities for requiring objects originate from the
allocating task's set of allowable nodes: memory isolation between
disjoint cpusets and NUMA optimizations for cpu affinity to memory the
object is being allocated from.

This interface is lockless and very quick in the slab allocator fastpath
when not enabled because a new task flag, PF_SLAB_HARDWALL, is added to
determine whether or not its cpuset has mandated objects be allocated on
the set of allowed nodes.  If the option is not set for a task's cpuset
(or only a single cpuset exists), this reduces to only checking for a
specific bit in current->flags.

For slab, if the physical node id of the cpu cache is not from an
allowable node, the allocation will fail.  If an allocation is targeted
for a node that is not allowed, we allocate from an appropriate one
instead of failing.

For slob, if the page from the slob list is not from an allowable node,
we continue to scan for an appropriate slab.  If none can be used, a new
slab is allocated.

For slub, if the cpu slab is not from an allowable node, the partial list
is scanned for a replacement.  If none can be used, a new slab is
allocated.

Tasks that allocate objects from cpusets that do not have
memory_slab_hardwall set can still allocate from cpu slabs that were
allocated in a disjoint cpuset.

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/cpusets.txt |   54 ++++++++++++++++++++++++-------------
 include/linux/cpuset.h            |   11 +++++++
 include/linux/sched.h             |    1 +
 kernel/cpuset.c                   |   26 +++++++++++++++++
 mm/slab.c                         |    4 +++
 mm/slob.c                         |    6 +++-
 mm/slub.c                         |   12 +++++---
 7 files changed, 89 insertions(+), 25 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -14,20 +14,21 @@ CONTENTS:
 =========
 
 1. Cpusets
-  1.1 What are cpusets ?
-  1.2 Why are cpusets needed ?
-  1.3 How are cpusets implemented ?
-  1.4 What are exclusive cpusets ?
-  1.5 What is memory_pressure ?
-  1.6 What is memory spread ?
-  1.7 What is sched_load_balance ?
-  1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
+  1.1  What are cpusets ?
+  1.2  Why are cpusets needed ?
+  1.3  How are cpusets implemented ?
+  1.4  What are exclusive cpusets ?
+  1.5  What is memory_pressure ?
+  1.6  What is memory spread ?
+  1.7  What is sched_load_balance ?
+  1.8  What is sched_relax_domain_level ?
+  1.9  What is memory_slab_hardwall ?
+  1.10 How do I use cpusets ?
 2. Usage Examples and Syntax
-  2.1 Basic Usage
-  2.2 Adding/removing cpus
-  2.3 Setting flags
-  2.4 Attaching processes
+  2.1  Basic Usage
+  2.2  Adding/removing cpus
+  2.3  Setting flags
+  2.4  Attaching processes
 3. Questions
 4. Contact
 
@@ -581,8 +582,22 @@ If your situation is:
 then increasing 'sched_relax_domain_level' would benefit you.
 
 
-1.9 How do I use cpusets ?
---------------------------
+1.9 What is memory_slab_hardwall ?
+----------------------------------
+
+A cpuset may require that slab object allocations all originate from
+its set of mems, either for memory isolation or NUMA optimizations.  Slab
+allocators normally optimize allocations in the fastpath by returning
+objects from a cpu slab.  These objects do not necessarily originate from
+slabs allocated on a cpuset's mems.
+
+When memory_slab_hardwall is set, all objects are allocated from slabs on
+the cpuset's set of mems.  This may incur a performance penalty if the
+cpu slab must be swapped for a different slab.
+
+
+1.10 How do I use cpusets ?
+---------------------------
 
 In order to minimize the impact of cpusets on critical kernel
 code, such as the scheduler, and due to the fact that the kernel
@@ -725,10 +740,11 @@ Now you want to do something with this cpuset.
 
 In this directory you can find several files:
 # ls
-cpu_exclusive  memory_migrate      mems                      tasks
-cpus           memory_pressure     notify_on_release
-mem_exclusive  memory_spread_page  sched_load_balance
-mem_hardwall   memory_spread_slab  sched_relax_domain_level
+cpu_exclusive		memory_pressure			notify_on_release
+cpus			memory_slab_hardwall		sched_load_balance
+mem_exclusive		memory_spread_page		sched_relax_domain_level
+mem_hardwall		memory_spread_slab		tasks
+memory_migrate		mems
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -86,6 +86,12 @@ static inline int cpuset_do_slab_mem_spread(void)
 	return current->flags & PF_SPREAD_SLAB;
 }
 
+static inline int current_cpuset_object_allowed(int node, gfp_t flags)
+{
+	return !(current->flags & PF_SPREAD_SLAB) ||
+	       cpuset_node_allowed_hardwall(node, flags);
+}
+
 extern int current_cpuset_is_being_rebound(void);
 
 extern void rebuild_sched_domains(void);
@@ -174,6 +180,11 @@ static inline int cpuset_do_slab_mem_spread(void)
 	return 0;
 }
 
+static inline int current_cpuset_object_allowed(int node, gfp_t flags)
+{
+	return 1;
+}
+
 static inline int current_cpuset_is_being_rebound(void)
 {
 	return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1623,6 +1623,7 @@ extern cputime_t task_gtime(struct task_struct *p);
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
 #define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
+#define PF_SLAB_HARDWALL 0x08000000	/* Allocate slab objects only in cpuset */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -142,6 +142,7 @@ typedef enum {
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_SLAB_HARDWALL,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -180,6 +181,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_slab_hardwall(const struct cpuset *cs)
+{
+	return test_bit(CS_SLAB_HARDWALL, &cs->flags);
+}
+
 /*
  * Increment this integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -400,6 +406,10 @@ void cpuset_update_task_memory_state(void)
 			tsk->flags |= PF_SPREAD_SLAB;
 		else
 			tsk->flags &= ~PF_SPREAD_SLAB;
+		if (is_slab_hardwall(cs))
+			tsk->flags |= PF_SLAB_HARDWALL;
+		else
+			tsk->flags &= ~PF_SLAB_HARDWALL;
 		task_unlock(tsk);
 		mutex_unlock(&callback_mutex);
 		mpol_rebind_task(tsk, &tsk->mems_allowed);
@@ -1417,6 +1427,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_SLAB_HARDWALL,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1458,6 +1469,10 @@ static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
 		retval = update_flag(CS_SPREAD_SLAB, cs, val);
 		cs->mems_generation = cpuset_mems_generation++;
 		break;
+	case FILE_SLAB_HARDWALL:
+		retval = update_flag(CS_SLAB_HARDWALL, cs, val);
+		cs->mems_generation = cpuset_mems_generation++;
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1614,6 +1629,8 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+	case FILE_SLAB_HARDWALL:
+		return is_slab_hardwall(cs);
 	default:
 		BUG();
 	}
@@ -1721,6 +1738,13 @@ static struct cftype files[] = {
 		.write_u64 = cpuset_write_u64,
 		.private = FILE_SPREAD_SLAB,
 	},
+
+	{
+		.name = "memory_slab_hardwall",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_SLAB_HARDWALL,
+	},
 };
 
 static struct cftype cft_memory_pressure_enabled = {
@@ -1814,6 +1838,8 @@ static struct cgroup_subsys_state *cpuset_create(
 		set_bit(CS_SPREAD_PAGE, &cs->flags);
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
+	if (is_slab_hardwall(parent))
+		set_bit(CS_SLAB_HARDWALL, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3124,6 +3124,8 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
+	if (!current_cpuset_object_allowed(numa_node_id(), flags))
+		return NULL;
 	if (likely(ac->avail)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
@@ -3249,6 +3251,8 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
 	void *obj;
 	int x;
 
+	if (!current_cpuset_object_allowed(nodeid, flags))
+		nodeid = cpuset_mem_spread_node();
 	l3 = cachep->nodelists[nodeid];
 	BUG_ON(!l3);
 
diff --git a/mm/slob.c b/mm/slob.c
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -319,14 +319,18 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
 	spin_lock_irqsave(&slob_lock, flags);
 	/* Iterate through each partially free page, try to find room */
 	list_for_each_entry(sp, slob_list, list) {
+		int slab_node = page_to_nid(&sp->page);
+
 #ifdef CONFIG_NUMA
 		/*
 		 * If there's a node specification, search for a partial
 		 * page with a matching node id in the freelist.
 		 */
-		if (node != -1 && page_to_nid(&sp->page) != node)
+		if (node != -1 && slab_node != node)
 			continue;
 #endif
+		if (!current_cpuset_object_allowed(slab_node, gfp))
+			continue;
 		/* Enough room on this page? */
 		if (sp->units < SLOB_UNITS(size))
 			continue;
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1353,6 +1353,8 @@ static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
 	struct page *page;
 	int searchnode = (node == -1) ? numa_node_id() : node;
 
+	if (!current_cpuset_object_allowed(node, flags))
+		searchnode = cpuset_mem_spread_node();
 	page = get_partial_node(get_node(s, searchnode));
 	if (page || (flags & __GFP_THISNODE))
 		return page;
@@ -1475,15 +1477,15 @@ static void flush_all(struct kmem_cache *s)
 
 /*
  * Check if the objects in a per cpu structure fit numa
- * locality expectations.
+ * locality expectations and is allowed in current's cpuset.
  */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
+static inline int check_node(struct kmem_cache_cpu *c, int node, gfp_t flags)
 {
 #ifdef CONFIG_NUMA
 	if (node != -1 && c->node != node)
 		return 0;
 #endif
-	return 1;
+	return current_cpuset_object_allowed(node, flags);
 }
 
 /*
@@ -1517,7 +1519,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto new_slab;
 
 	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	if (unlikely(!check_node(c, node, gfpflags)))
 		goto another_slab;
 
 	stat(c, ALLOC_REFILL);
@@ -1604,7 +1606,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	local_irq_save(flags);
 	c = get_cpu_slab(s, smp_processor_id());
 	objsize = c->objsize;
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	if (unlikely(!c->freelist || !check_node(c, node, gfpflags)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-10  2:22 [patch -mm v2] cpusets: add memory_slab_hardwall flag David Rientjes
@ 2009-03-10  2:28 ` David Rientjes
  2009-03-10 20:59 ` Christoph Lameter
  1 sibling, 0 replies; 13+ messages in thread
From: David Rientjes @ 2009-03-10  2:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Mon, 9 Mar 2009, David Rientjes wrote:

> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -86,6 +86,12 @@ static inline int cpuset_do_slab_mem_spread(void)
>  	return current->flags & PF_SPREAD_SLAB;
>  }
>  
> +static inline int current_cpuset_object_allowed(int node, gfp_t flags)
> +{
> +	return !(current->flags & PF_SPREAD_SLAB) ||
> +	       cpuset_node_allowed_hardwall(node, flags);
> +}
> +
>  extern int current_cpuset_is_being_rebound(void);
>  
>  extern void rebuild_sched_domains(void);

That should be PF_SLAB_HARDWALL.
---
 include/linux/cpuset.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d255e68..4db22d1 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -88,7 +88,7 @@ static inline int cpuset_do_slab_mem_spread(void)
 
 static inline int current_cpuset_object_allowed(int node, gfp_t flags)
 {
-	return !(current->flags & PF_SPREAD_SLAB) ||
+	return !(current->flags & PF_SLAB_HARDWALL) ||
 	       cpuset_node_allowed_hardwall(node, flags);
 }
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-10  2:22 [patch -mm v2] cpusets: add memory_slab_hardwall flag David Rientjes
  2009-03-10  2:28 ` David Rientjes
@ 2009-03-10 20:59 ` Christoph Lameter
  2009-03-10 21:24   ` David Rientjes
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2009-03-10 20:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Mon, 9 Mar 2009, David Rientjes wrote:

> This interface is lockless and very quick in the slab allocator fastpath
> when not enabled because a new task flag, PF_SLAB_HARDWALL, is added to
> determine whether or not its cpuset has mandated objects be allocated on
> the set of allowed nodes.  If the option is not set for a task's cpuset
> (or only a single cpuset exists), this reduces to only checking for a
> specific bit in current->flags.

We already have PF_SPREAD_PAGE PF_SPREAD_SLAB and PF_MEMPOLICY.
PF_MEMPOLICY in slab can have the same role as PF_SLAB_HARDWALL. It
attempts what you describe. One the one hand you duplicate functionality
that is already there and on the other you want to put code in the hot
paths that we have intentionally avoided for ages.

The description is not accurate. This feature is only useful if someone
comes up with a crummy cpuset definition in which a processor is a member
of multiple cpusets and thus the per cpu queues of multiple subsystems get
objects depending on which cpuset is active.

If a processor is only used from one cpuset (natural use) then these
problems do not occur.

There is still no use case for this on a NUMA platform. NUMA jobs that I
know about where people care about latencies have cpusets that do not
share processors and thus this problem does not occur.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-10 20:59 ` Christoph Lameter
@ 2009-03-10 21:24   ` David Rientjes
  2009-03-12 15:47     ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-03-10 21:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Tue, 10 Mar 2009, Christoph Lameter wrote:

> We already have PF_SPREAD_PAGE PF_SPREAD_SLAB and PF_MEMPOLICY.
> PF_MEMPOLICY in slab can have the same role as PF_SLAB_HARDWALL. It
> attempts what you describe. One the one hand you duplicate functionality
> that is already there and on the other you want to put code in the hot
> paths that we have intentionally avoided for ages.
> 

For slab, PF_MEMPOLICY is a viable alternative for the functionality that 
is being added with this patch.  The difference is that it only is 
enforced when the allocating task is a member of that cpuset and does not 
require the overhead of scanning the MPOL_BIND zonelist for every 
allocation to determine whwther a node is acceptable.

For slub, there is no current alternative to memory_slab_hardwall.

> The description is not accurate. This feature is only useful if someone
> comes up with a crummy cpuset definition in which a processor is a member
> of multiple cpusets and thus the per cpu queues of multiple subsystems get
> objects depending on which cpuset is active.
> 

Cpusets are hierarchical, so it is quite possible that a parent cpuset 
will include a group of cpus that has affinity to a specific group of 
mems.  This isolates that cpuset and all of its children for NUMA 
optimiziations.  Within that, there can be several descendant cpusets that 
include disjoint subsets of mems to isolate the memory that can be used 
for specific jobs.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-10 21:24   ` David Rientjes
@ 2009-03-12 15:47     ` Christoph Lameter
  2009-03-12 18:43       ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2009-03-12 15:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Tue, 10 Mar 2009, David Rientjes wrote:

> > The description is not accurate. This feature is only useful if someone
> > comes up with a crummy cpuset definition in which a processor is a member
> > of multiple cpusets and thus the per cpu queues of multiple subsystems get
> > objects depending on which cpuset is active.
> >
>
> Cpusets are hierarchical, so it is quite possible that a parent cpuset
> will include a group of cpus that has affinity to a specific group of
> mems.  This isolates that cpuset and all of its children for NUMA
> optimiziations.  Within that, there can be several descendant cpusets that
> include disjoint subsets of mems to isolate the memory that can be used
> for specific jobs.

Yes cpusets are hierachical for management purposes but it is well known
that overlaying cpusets for running applications can cause issues with the
scheduler etc. Jobs run in the leaf not in the higher levels that may
overlap.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-12 15:47     ` Christoph Lameter
@ 2009-03-12 18:43       ` David Rientjes
  2009-03-12 19:12         ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-03-12 18:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Thu, 12 Mar 2009, Christoph Lameter wrote:

> > Cpusets are hierarchical, so it is quite possible that a parent cpuset
> > will include a group of cpus that has affinity to a specific group of
> > mems.  This isolates that cpuset and all of its children for NUMA
> > optimiziations.  Within that, there can be several descendant cpusets that
> > include disjoint subsets of mems to isolate the memory that can be used
> > for specific jobs.
> 
> Yes cpusets are hierachical for management purposes but it is well known
> that overlaying cpusets for running applications can cause issues with the
> scheduler etc. Jobs run in the leaf not in the higher levels that may
> overlap.
> 

Yes, jobs are running in the leaf with my above example.  And it's quite 
possible that the higher level has segmented the machine for NUMA locality 
and then further divided that memory for individual jobs.  When a job 
completes or is killed, the slab cache that it has allocated can be freed 
in its entirety with no partial slab fragmentation (i.e. there are no 
objects allocated from its slabs for disjoint, still running jobs).  That 
cpuset may then serve another job.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-12 18:43       ` David Rientjes
@ 2009-03-12 19:12         ` Christoph Lameter
  2009-03-12 19:32           ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2009-03-12 19:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Thu, 12 Mar 2009, David Rientjes wrote:

> Yes, jobs are running in the leaf with my above example.  And it's quite
> possible that the higher level has segmented the machine for NUMA locality
> and then further divided that memory for individual jobs.  When a job
> completes or is killed, the slab cache that it has allocated can be freed
> in its entirety with no partial slab fragmentation (i.e. there are no
> objects allocated from its slabs for disjoint, still running jobs).  That
> cpuset may then serve another job.

Looks like we are talking about a differing project here. Partial slabs
are shared between all processors with SLUB. Slab shares the partial slabs
for the processors on the same node.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-12 19:12         ` Christoph Lameter
@ 2009-03-12 19:32           ` David Rientjes
  2009-03-13 20:34             ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-03-12 19:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Thu, 12 Mar 2009, Christoph Lameter wrote:

> > Yes, jobs are running in the leaf with my above example.  And it's quite
> > possible that the higher level has segmented the machine for NUMA locality
> > and then further divided that memory for individual jobs.  When a job
> > completes or is killed, the slab cache that it has allocated can be freed
> > in its entirety with no partial slab fragmentation (i.e. there are no
> > objects allocated from its slabs for disjoint, still running jobs).  That
> > cpuset may then serve another job.
> 
> Looks like we are talking about a differing project here. Partial slabs
> are shared between all processors with SLUB. Slab shares the partial slabs
> for the processors on the same node.
> 

If `memory_slab_hardwall' is set for a cpuset, its tasks will only pull a 
slab off the partial list that was allocated on an allowed node.  So in my 
earlier example which segments the machine via cpusets for NUMA locality 
and then divides those cpusets further for exclusive memory to provide to 
individual jobs, slab allocations will be constrained within the cpuset of 
the task that allocated them.  When a job dies, all slab allocations are 
freed so that no objects remain on the memory allowed to that job and, 
thus, no partial slabs remain (i.e. there were no object allocations on 
the job's slabs from disjoint cpusets because of the exclusivity).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-12 19:32           ` David Rientjes
@ 2009-03-13 20:34             ` Christoph Lameter
  2009-03-13 23:28               ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2009-03-13 20:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Thu, 12 Mar 2009, David Rientjes wrote:

> If `memory_slab_hardwall' is set for a cpuset, its tasks will only pull a
> slab off the partial list that was allocated on an allowed node.  So in my
> earlier example which segments the machine via cpusets for NUMA locality
> and then divides those cpusets further for exclusive memory to provide to
> individual jobs, slab allocations will be constrained within the cpuset of
> the task that allocated them.  When a job dies, all slab allocations are
> freed so that no objects remain on the memory allowed to that job and,
> thus, no partial slabs remain (i.e. there were no object allocations on
> the job's slabs from disjoint cpusets because of the exclusivity).

In order to do that you would need to duplicate the partial lists for
each job and then guarantee that only the job uses objects from these
partial pages.

Usually some partial allocated pages are kept around and are
indiscriminately used by various system components by the OS due to other
processing. Creating partial slabs cannot be avoided like that.






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-13 20:34             ` Christoph Lameter
@ 2009-03-13 23:28               ` David Rientjes
  2009-03-16 16:41                 ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-03-13 23:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Fri, 13 Mar 2009, Christoph Lameter wrote:

> > If `memory_slab_hardwall' is set for a cpuset, its tasks will only pull a
> > slab off the partial list that was allocated on an allowed node.  So in my
> > earlier example which segments the machine via cpusets for NUMA locality
> > and then divides those cpusets further for exclusive memory to provide to
> > individual jobs, slab allocations will be constrained within the cpuset of
> > the task that allocated them.  When a job dies, all slab allocations are
> > freed so that no objects remain on the memory allowed to that job and,
> > thus, no partial slabs remain (i.e. there were no object allocations on
> > the job's slabs from disjoint cpusets because of the exclusivity).
> 
> In order to do that you would need to duplicate the partial lists for
> each job and then guarantee that only the job uses objects from these
> partial pages.
> 

Each job is running it its own cpuset with exclusive memory attached to it 
in my example.  With memory_slab_hardwall, we simply don't refill the cpu 
slab with partial slabs that are on s->node[] for nodes that are not 
allowed.  When that job dies, all of its partial slabs will be freed if 
all cpusets within that parent cpuset are memory_slab_hardwall.

This isn't that uncommon of a use case.  The machine is being partitioned, 
depending on their affinity to one another, by its cpus and mems via 
cpusets.  Within those cpusets, which already supply memory locality to 
the cpus that the attached tasks are executing on, we can create 
descendant cpusets to provide memory exclusivity.

I think this is fairly straight forward in my implementation.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-13 23:28               ` David Rientjes
@ 2009-03-16 16:41                 ` Christoph Lameter
  2009-03-16 22:17                   ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2009-03-16 16:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

If the nodes are exclusive to a load then the cpus attached to those nodes
are also exclusive? If so then there is no problem since the percpu queues
are only in use for a specific load with a consistent restriction on
cpusets and a consistent memory policy. Thus there is no need for
memory_slab_hardwall.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall flag
  2009-03-16 16:41                 ` Christoph Lameter
@ 2009-03-16 22:17                   ` David Rientjes
  2009-03-17 19:41                     ` [patch -mm v2] cpusets: add memory_slab_hardwall fla Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-03-16 22:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Mon, 16 Mar 2009, Christoph Lameter wrote:

> If the nodes are exclusive to a load then the cpus attached to those nodes
> are also exclusive?

No, they are not exclusive.

Here is my example (for the third time) if, for example, mems are grouped 
by the cpus for which they have affinity:

/dev/cpuset
	--> cpuset_A (cpus 0-1, mems 0-3)
	--> cpuset_B (cpus 2-3, mems 4-7)
	--> cpuset_C (cpus 4-5, mems 8-11)
	--> ...

Within that, we isolate mems for specific jobs:

/dev/cpuset
	--> cpuset_A (cpus 0-1, mems 0-3)
		--> job_1 (mem 0)
		--> job_2 (mem 1-2)
		--> job_3 (mem 3)
	--> ...

> If so then there is no problem since the percpu queues
> are only in use for a specific load with a consistent restriction on
> cpusets and a consistent memory policy. Thus there is no need for
> memory_slab_hardwall.
> 

All of those jobs may have different mempolicy requirements.  
Specifically, some cpusets may require slab hardwall behavior while 
others do not for true memory isolation or NUMA optimizations.

In other words, there is _no_ way with slub to isolate slab allocations 
for job_1 from job_2, job_3, etc.  That is what memory_slab_hardwall 
intends to address.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm v2] cpusets: add memory_slab_hardwall fla
  2009-03-16 22:17                   ` David Rientjes
@ 2009-03-17 19:41                     ` Christoph Lameter
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2009-03-17 19:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, KOSAKI Motohiro, linux-kernel

On Mon, 16 Mar 2009, David Rientjes wrote:

> Here is my example (for the third time) if, for example, mems are grouped
> by the cpus for which they have affinity:

Well I wish I had been send the messages the first two times.

> /dev/cpuset
> 	--> cpuset_A (cpus 0-1, mems 0-3)
> 	--> cpuset_B (cpus 2-3, mems 4-7)
> 	--> cpuset_C (cpus 4-5, mems 8-11)
> 	--> ...

A cpu can only be assigned to a single numa node. cpuset_A/B/C have 2
cpus but 4 nodes. What is the mapping of nodes to cpus?

> Within that, we isolate mems for specific jobs:
>
> /dev/cpuset
> 	--> cpuset_A (cpus 0-1, mems 0-3)
> 		--> job_1 (mem 0)
> 		--> job_2 (mem 1-2)
> 		--> job_3 (mem 3)
> 	--> ...

Nobody would do that since usually multiple cpus are associated with a
single node. Memory and processors are assigned together in order to
reduce latencies. Randomly assigning processors to nodes will cause
latencies and significant traffic on the interconnect. This does not look
sane and would not be optimized in any way.

Are these memoryless nodes? What are you trying to accomplish?


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-03-17 19:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-10  2:22 [patch -mm v2] cpusets: add memory_slab_hardwall flag David Rientjes
2009-03-10  2:28 ` David Rientjes
2009-03-10 20:59 ` Christoph Lameter
2009-03-10 21:24   ` David Rientjes
2009-03-12 15:47     ` Christoph Lameter
2009-03-12 18:43       ` David Rientjes
2009-03-12 19:12         ` Christoph Lameter
2009-03-12 19:32           ` David Rientjes
2009-03-13 20:34             ` Christoph Lameter
2009-03-13 23:28               ` David Rientjes
2009-03-16 16:41                 ` Christoph Lameter
2009-03-16 22:17                   ` David Rientjes
2009-03-17 19:41                     ` [patch -mm v2] cpusets: add memory_slab_hardwall fla Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.