linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/5] cpuset memory spread basic implementation
@ 2006-02-04  7:19 Paul Jackson
  2006-02-04  7:19 ` [PATCH 2/5] cpuset memory spread page cache implementation and hooks Paul Jackson
                   ` (7 more replies)
  0 siblings, 8 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-04  7:19 UTC (permalink / raw)
  To: akpm; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, Paul Jackson, clameter

From: Paul Jackson <pj@sgi.com>

This patch provides the implementation and cpuset interface for
an alternative memory allocation policy that can be applied to
certain kinds of memory allocations, such as the page cache (file
system buffers) and some slab caches (such as inode caches).

The policy is called "memory spreading."  If enabled, it
spreads out these kinds of memory allocations over all the
nodes allowed to a task, instead of preferring to place them
on the node where the task is executing.

All other kinds of allocations, including anonymous pages for
a tasks stack and data regions, are not affected by this policy
choice, and continue to be allocated preferring the node local
to execution, as modified by the NUMA mempolicy.

A new per-cpuset file, "memory_spread", is defined.  This is
a boolean flag file, containing a "0" (off) or "1" (on).
By default it is off, and the kernel allocation placement
is unchanged.  If it is turned on for a given cpuset (write a
"1" to that cpusets memory_spread file) then the alternative
policy applies to all tasks in that cpuset.

The implementation is simple.  Setting the cpuset flag
"memory_spread" turns on a per-process flag PF_MEM_SPREAD for
each task that is in that cpuset or subsequently joins that
cpuset.  In subsequent patches, the page allocation calls for
the affected page cache and slab caches are modified to perform
an inline check for this PF_MEM_SPREAD task flag, and if set,
a call to a new routine cpuset_mem_spread_node() returns the
node to prefer for the allocation.

The cpuset_mem_spread_node() routine is also simple.  It uses
the value of a per-task rotor cpuset_mem_spread_rotor to select
the next node in the current tasks mems_allowed to prefer for
the allocation.

This policy can provide substantial improvements for jobs that
need to place thread local data on the corresponding node, but
that need to access large file system data sets that need to
be spread across the several nodes in the jobs cpuset in order
to fit.  Without this patch, especially for jobs that might
have one thread reading in the data set, the memory allocation
across the nodes in the jobs cpuset can become very uneven.

A couple of Copyright year ranges are updated as well.  And a
couple of email addresses that can be found in the MAINTAINERS
file are removed.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 Documentation/cpusets.txt |   64 ++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/cpuset.h    |   18 ++++++++++++
 include/linux/sched.h     |    2 +
 kernel/cpuset.c           |   63 ++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 138 insertions(+), 9 deletions(-)

--- 2.6.16-rc1-mm5.orig/Documentation/cpusets.txt	2006-02-03 16:38:09.613742481 -0800
+++ 2.6.16-rc1-mm5/Documentation/cpusets.txt	2006-02-03 21:28:32.568213724 -0800
@@ -17,7 +17,8 @@ CONTENTS:
   1.4 What are exclusive cpusets ?
   1.5 What does notify_on_release do ?
   1.6 What is memory_pressure ?
-  1.7 How do I use cpusets ?
+  1.7 What is memory_spread ?
+  1.8 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -315,7 +316,66 @@ the tasks in the cpuset, in units of rec
 times 1000.
 
 
-1.7 How do I use cpusets ?
+1.7 What is memory_spread ?
+---------------------------
+If the per-cpuset boolean flag file 'memory_spread' is set,
+then the kernel will spread the file system buffers (page cache)
+evenly over all the nodes that the faulting task is allowed to
+use, instead of preferring to put those pages on the node where
+the task is running.  Some file system related slab caches,
+such as for inodes and dentries are also affected.  A tasks
+private (anonymous) data and stack regions are not affected.
+
+By default, memory_spread is off, and memory pages are allocated
+on the node local to where the task is running, except perhaps
+as modified by the tasks NUMA mempolicy or cpuset configuration,
+so long as sufficient free memory pages are available.
+
+When new cpusets are created, they inherit the memory_spread
+setting of their parent.
+
+Setting memory_spread causes allocations of the affected page
+and slab caches to ignore the tasks NUMA mempolicy and be spread
+instead.    Tasks using mbind() or set_mempolicy() calls to set
+NUMA mempolicies will not notice any change in these calls as a
+result of their containing tasks memory_spread setting.  If
+memory spreading is turned off, then the currently specified
+NUMA mempolicy once again applies to memory page allocations.
+
+A new per-cpuset file, 'memory_spread', is defined.  This is
+a boolean flag file, containing a "0" (off) or "1" (on).
+By default it is off, and the kernel allocation placement
+is unchanged.  If it is turned on for a given cpuset (write a
+"1" to that cpusets memory_spread file) then the alternative
+policy applies to all tasks in that cpuset.
+
+The implementation is simple.  Setting the cpuset flag
+'memory_spread' turns on a per-process flag PF_MEM_SPREAD
+for each task that is in that cpuset or subsequently joins
+that cpuset.  The page allocation calls for the affected page
+cache and slab caches are modified to perform an inline check
+for this PF_MEM_SPREAD task flag, and if set, a call to a new
+routine cpuset_mem_spread_node() returns the node to prefer
+for the allocation.
+
+The cpuset_mem_spread_node() routine is also simple.  It uses
+the value of a per-task rotor cpuset_mem_spread_rotor to select
+the next node in the current tasks mems_allowed to prefer for
+the allocation.
+
+This memory placement policy is also known (in other contexts)
+as round-robin or interleave.
+
+This policy can provide substantial improvements for jobs that
+need to place thread local data on the corresponding node, but
+that need to access large file system data sets that need to
+be spread across the several nodes in the jobs cpuset in order
+to fit.  Without this policy, especially for jobs that might
+have one thread reading in the data set, the memory allocation
+across the nodes in the jobs cpuset can become very uneven.
+
+
+1.8 How do I use cpusets ?
 --------------------------
 
 In order to minimize the impact of cpusets on critical kernel
--- 2.6.16-rc1-mm5.orig/include/linux/cpuset.h	2006-02-03 16:44:23.433333121 -0800
+++ 2.6.16-rc1-mm5/include/linux/cpuset.h	2006-02-03 21:34:56.388980845 -0800
@@ -4,7 +4,7 @@
  *  cpuset interface
  *
  *  Copyright (C) 2003 BULL SA
- *  Copyright (C) 2004 Silicon Graphics, Inc.
+ *  Copyright (C) 2004-2006 Silicon Graphics, Inc.
  *
  */
 
@@ -51,6 +51,12 @@ extern char *cpuset_task_status_allowed(
 extern void cpuset_lock(void);
 extern void cpuset_unlock(void);
 
+extern int cpuset_mem_spread_node(void);
+static inline int cpuset_mem_spread_check(void)
+{
+	return current->flags & PF_MEM_SPREAD;
+}
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
@@ -99,6 +105,16 @@ static inline char *cpuset_task_status_a
 static inline void cpuset_lock(void) {}
 static inline void cpuset_unlock(void) {}
 
+static inline int cpuset_mem_spread_node(void)
+{
+	return 0;
+}
+
+static inline int cpuset_mem_spread_check(void)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
--- 2.6.16-rc1-mm5.orig/kernel/cpuset.c	2006-02-03 20:14:30.533135654 -0800
+++ 2.6.16-rc1-mm5/kernel/cpuset.c	2006-02-03 21:38:56.833115432 -0800
@@ -4,15 +4,14 @@
  *  Processor and Memory placement constraints for sets of tasks.
  *
  *  Copyright (C) 2003 BULL SA.
- *  Copyright (C) 2004 Silicon Graphics, Inc.
+ *  Copyright (C) 2004-2006 Silicon Graphics, Inc.
  *
  *  Portions derived from Patrick Mochel's sysfs code.
  *  sysfs is Copyright (c) 2001-3 Patrick Mochel
- *  Portions Copyright (c) 2004 Silicon Graphics, Inc.
  *
- *  2003-10-10 Written by Simon Derr <simon.derr@bull.net>
+ *  2003-10-10 Written by Simon Derr.
  *  2003-10-22 Updates by Stephen Hemminger.
- *  2004 May-July Rework by Paul Jackson <pj@sgi.com>
+ *  2004 May-July Rework by Paul Jackson.
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of the Linux
@@ -108,7 +107,8 @@ typedef enum {
 	CS_MEM_EXCLUSIVE,
 	CS_MEMORY_MIGRATE,
 	CS_REMOVED,
-	CS_NOTIFY_ON_RELEASE
+	CS_NOTIFY_ON_RELEASE,
+	CS_MEM_SPREAD,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -137,6 +137,11 @@ static inline int is_memory_migrate(cons
 	return !!test_bit(CS_MEMORY_MIGRATE, &cs->flags);
 }
 
+static inline int is_mem_spread(const struct cpuset *cs)
+{
+	return !!test_bit(CS_MEM_SPREAD, &cs->flags);
+}
+
 /*
  * Increment this atomic integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -657,6 +662,10 @@ void cpuset_update_task_memory_state(voi
 		cs = tsk->cpuset;	/* Maybe changed when task not locked */
 		guarantee_online_mems(cs, &tsk->mems_allowed);
 		tsk->cpuset_mems_generation = cs->mems_generation;
+		if (is_mem_spread(cs))
+			tsk->flags |= PF_MEM_SPREAD;
+		else
+			tsk->flags &= ~PF_MEM_SPREAD;
 		task_unlock(tsk);
 		mutex_unlock(&callback_mutex);
 		mpol_rebind_task(tsk, &tsk->mems_allowed);
@@ -957,7 +966,8 @@ static int update_memory_pressure_enable
 /*
  * update_flag - read a 0 or a 1 in a file and update associated flag
  * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
- *				CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE)
+ *				CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
+ *				CS_MEM_SPREAD)
  * cs:	the cpuset to update
  * buf:	the buffer where we read the 0 or 1
  *
@@ -1188,6 +1198,7 @@ typedef enum {
 	FILE_NOTIFY_ON_RELEASE,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
+	FILE_MEM_SPREAD,
 	FILE_TASKLIST,
 } cpuset_filetype_t;
 
@@ -1247,6 +1258,11 @@ static ssize_t cpuset_common_file_write(
 	case FILE_MEMORY_PRESSURE:
 		retval = -EACCES;
 		break;
+	case FILE_MEM_SPREAD:
+		retval = update_flag(CS_MEM_SPREAD, cs, buffer);
+		atomic_inc(&cpuset_mems_generation);
+		cs->mems_generation = atomic_read(&cpuset_mems_generation);
+		break;
 	case FILE_TASKLIST:
 		retval = attach_task(cs, buffer, &pathbuf);
 		break;
@@ -1356,6 +1372,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_MEMORY_PRESSURE:
 		s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter));
 		break;
+	case FILE_MEM_SPREAD:
+		*s++ = is_mem_spread(cs) ? '1' : '0';
+		break;
 	default:
 		retval = -EINVAL;
 		goto out;
@@ -1719,6 +1738,11 @@ static struct cftype cft_memory_pressure
 	.private = FILE_MEMORY_PRESSURE,
 };
 
+static struct cftype cft_mem_spread = {
+	.name = "memory_spread",
+	.private = FILE_MEM_SPREAD,
+};
+
 static int cpuset_populate_dir(struct dentry *cs_dentry)
 {
 	int err;
@@ -1737,6 +1761,8 @@ static int cpuset_populate_dir(struct de
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_memory_pressure)) < 0)
 		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_mem_spread)) < 0)
+		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
 		return err;
 	return 0;
@@ -1765,6 +1791,8 @@ static long cpuset_create(struct cpuset 
 	cs->flags = 0;
 	if (notify_on_release(parent))
 		set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
+	if (is_mem_spread(parent))
+		set_bit(CS_MEM_SPREAD, &cs->flags);
 	cs->cpus_allowed = CPU_MASK_NONE;
 	cs->mems_allowed = NODE_MASK_NONE;
 	atomic_set(&cs->count, 0);
@@ -2171,6 +2199,29 @@ void cpuset_unlock(void)
 }
 
 /**
+ * cpuset_mem_spread_node() - Decide which cpuset node gets this allocation.
+ *
+ * If a task is marked PF_MEM_SPREAD (which it will be if the task is
+ * in a cpuset for which is_mem_spread() is true), and if the memory
+ * allocation used cpuset_mem_spread_node() to determine on which node
+ * to start looking, as it will for certain page cache or slab cache
+ * pages such as used for file system buffers and inode caches, then
+ * instead of starting on the local node to look for a free page,
+ * rather spread the starting node around the tasks mems_allowed nodes.
+ */
+
+int cpuset_mem_spread_node(void)
+{
+	int node;
+
+	node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
+	if (node == MAX_NUMNODES)
+		node = first_node(current->mems_allowed);
+	current->cpuset_mem_spread_rotor = node;
+	return node;
+}
+
+/**
  * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
  * @p: pointer to task_struct of some other task.
  *
--- 2.6.16-rc1-mm5.orig/include/linux/sched.h	2006-02-03 20:14:45.524512883 -0800
+++ 2.6.16-rc1-mm5/include/linux/sched.h	2006-02-03 20:35:14.431690522 -0800
@@ -886,6 +886,7 @@ struct task_struct {
 	struct cpuset *cpuset;
 	nodemask_t mems_allowed;
 	int cpuset_mems_generation;
+	int cpuset_mem_spread_rotor;
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
@@ -947,6 +948,7 @@ static inline void put_task_struct(struc
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
 #define PF_RANDOMIZE	0x00800000	/* randomize virtual address space */
 #define PF_SWAPWRITE	0x01000000	/* Allowed to write to swap */
+#define PF_MEM_SPREAD	0x04000000	/* Spread some memory over cpuset */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
@ 2006-02-04  7:19 ` Paul Jackson
  2006-02-04 23:49   ` Andrew Morton
  2006-02-04  7:19 ` [PATCH 3/5] cpuset memory spread slab cache implementation Paul Jackson
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-04  7:19 UTC (permalink / raw)
  To: akpm; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, Paul Jackson, clameter

From: Paul Jackson <pj@sgi.com>

Change the page cache allocation calls to support cpuset memory
spreading.

See the previous patch, cpuset_mem_spread, for an explanation
of cpuset memory spreading.

On systems without cpusets configured in the kernel, this is
no change.

On systems with cpusets configured in the kernel, but the
"memory_spread" cpuset option not enabled for the current tasks
cpuset, this adds one failed bit test of the processor state
flag PF_MEM_SPREAD.

On tasks in cpusets with "memory_spread" enabled, this adds
a call to a cpuset routine that computes which of the tasks
mems_allowed nodes should be preferred for this allocation.

If memory spreading applies to a particular allocation, then
any other NUMA mempolicy does not apply.

Signed-off-by: Paul Jackson

---

 include/linux/pagemap.h |    9 +++++++++
 1 files changed, 9 insertions(+)

--- 2.6.16-rc1-mm3.orig/include/linux/pagemap.h	2006-01-31 08:55:27.238487776 -0800
+++ 2.6.16-rc1-mm3/include/linux/pagemap.h	2006-01-31 08:58:25.459267700 -0800
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
+#include <linux/cpuset.h>
 #include <linux/gfp.h>
 
 /*
@@ -53,11 +54,19 @@ void release_pages(struct page **pages, 
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
+	if (cpuset_mem_spread_check()) {
+		int n = cpuset_mem_spread_node();
+		return alloc_pages_node(n, mapping_gfp_mask(x), 0);
+	}
 	return alloc_pages(mapping_gfp_mask(x), 0);
 }
 
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
+	if (cpuset_mem_spread_check()) {
+		int n = cpuset_mem_spread_node();
+		return alloc_pages_node(n, mapping_gfp_mask(x)|__GFP_COLD, 0);
+	}
 	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
 }
 

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 3/5] cpuset memory spread slab cache implementation
  2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
  2006-02-04  7:19 ` [PATCH 2/5] cpuset memory spread page cache implementation and hooks Paul Jackson
@ 2006-02-04  7:19 ` Paul Jackson
  2006-02-04 23:49   ` Andrew Morton
  2006-02-04  7:19 ` [PATCH 4/5] cpuset memory spread slab cache optimizations Paul Jackson
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-04  7:19 UTC (permalink / raw)
  To: akpm; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, Paul Jackson, clameter

From: Paul Jackson <pj@sgi.com>

Provide the slab cache infrastructure to support cpuset memory
spreading.

See the previous patches, cpuset_mem_spread, for an explanation
of cpuset memory spreading.

This patch provides a slab cache SLAB_MEM_SPREAD flag.  If set
in the kmem_cache_create() call defining a slab cache, then
any task marked with the process state flag PF_MEMSPREAD will
spread memory page allocations for that cache over all the
allowed nodes, instead of preferring the local (faulting) node.

On systems not configured with CONFIG_NUMA, this results in no
change to the page allocation code path for slab caches.

On systems with cpusets configured in the kernel, but the
"memory_spread" cpuset option not enabled for the current tasks
cpuset, this adds one failed bit test of the processor state
flag PF_MEM_SPREAD on each page allocation for slab caches.

For tasks so marked, a second inline test is done for the
slab cache flag SLAB_MEM_SPREAD, and if that is set and if
the allocation is not in_interrupt(), this adds a call to to a
cpuset routine that computes which of the tasks mems_allowed
nodes should be preferred for this allocation.

==> This patch adds another hook into the performance critical
    code path to allocating objects from the slab cache, in the
    ____cache_alloc() chunk, below.  The next patch optimizes this
    hook, reducing the impact of the combined mempolicy plus memory
    spreading hooks on this critical code path to a single check
    against the tasks task_struct flags word.

This patch provides the generic slab flags and logic needed to
apply memory spreading to a particular slab.

A subsequent patch will mark a few specific slab caches for this
placement policy.

Signed-off-by: Paul Jackson

---

 include/linux/slab.h |    1 +
 mm/slab.c            |   13 +++++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

--- 2.6.16-rc1-mm5.orig/include/linux/slab.h	2006-02-03 22:17:31.772404695 -0800
+++ 2.6.16-rc1-mm5/include/linux/slab.h	2006-02-03 22:17:33.545862215 -0800
@@ -47,6 +47,7 @@ typedef struct kmem_cache kmem_cache_t;
 						   what is reclaimable later*/
 #define SLAB_PANIC		0x00040000UL	/* panic if kmem_cache_create() fails */
 #define SLAB_DESTROY_BY_RCU	0x00080000UL	/* defer freeing pages to RCU */
+#define SLAB_MEM_SPREAD		0x00100000UL	/* Spread some memory over cpuset */
 
 /* flags passed to a constructor func */
 #define	SLAB_CTOR_CONSTRUCTOR	0x001UL		/* if not set, then deconstructor */
--- 2.6.16-rc1-mm5.orig/mm/slab.c	2006-02-03 22:17:31.772404695 -0800
+++ 2.6.16-rc1-mm5/mm/slab.c	2006-02-03 22:17:33.549768509 -0800
@@ -94,6 +94,7 @@
 #include	<linux/interrupt.h>
 #include	<linux/init.h>
 #include	<linux/compiler.h>
+#include	<linux/cpuset.h>
 #include	<linux/seq_file.h>
 #include	<linux/notifier.h>
 #include	<linux/kallsyms.h>
@@ -173,12 +174,12 @@
 			 SLAB_NO_REAP | SLAB_CACHE_DMA | \
 			 SLAB_MUST_HWCACHE_ALIGN | SLAB_STORE_USER | \
 			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
-			 SLAB_DESTROY_BY_RCU)
+			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
 #else
 # define CREATE_MASK	(SLAB_HWCACHE_ALIGN | SLAB_NO_REAP | \
 			 SLAB_CACHE_DMA | SLAB_MUST_HWCACHE_ALIGN | \
 			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
-			 SLAB_DESTROY_BY_RCU)
+			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
 #endif
 
 /*
@@ -2708,6 +2709,14 @@ static inline void *____cache_alloc(stru
 		if (nid != numa_node_id())
 			return __cache_alloc_node(cachep, flags, nid);
 	}
+	if (unlikely(cpuset_mem_spread_check() &&
+					(cachep->flags & SLAB_MEM_SPREAD) &&
+					!in_interrupt())) {
+		int nid = cpuset_mem_spread_node();
+
+		if (nid != numa_node_id())
+			return __cache_alloc_node(cachep, flags, nid);
+	}
 #endif
 
 	check_irq_off();

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 4/5] cpuset memory spread slab cache optimizations
  2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
  2006-02-04  7:19 ` [PATCH 2/5] cpuset memory spread page cache implementation and hooks Paul Jackson
  2006-02-04  7:19 ` [PATCH 3/5] cpuset memory spread slab cache implementation Paul Jackson
@ 2006-02-04  7:19 ` Paul Jackson
  2006-02-04 23:50   ` Andrew Morton
  2006-02-04 23:50   ` Andrew Morton
  2006-02-04  7:19 ` [PATCH 5/5] cpuset memory spread slab cache hooks Paul Jackson
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-04  7:19 UTC (permalink / raw)
  To: akpm; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, Paul Jackson, clameter

From: Paul Jackson <pj@sgi.com>

The hooks in the slab cache allocator code path for support
of NUMA mempolicies and cpuset memory spreading are in an
important code path.  Many systems will use neither feature.

This patch optimizes those hooks down to a single check of
some bits in the current tasks task_struct flags.  For non
NUMA systems, this hook and related code is already ifdef'd
out.

The optimization is done by using another task flag, set if
the task is using a non-default NUMA mempolicy.  Taking this
flag bit along with the PF_MEM_SPREAD flag bit added earlier
in this 'cpuset memory spreading' patch set, one can check
for the combination of either of these special case memory
placement mechanisms with a single test of the current tasks
task_struct flags.

This patch also tightens up the code, to save a few bytes
of kernel text space, and moves some of it out of line.
Due to the nested inlines called from multiple places,
we were ending up with three copies of this code, which
once we get off the main code path (for local node
allocation) seems a bit wasteful of instruction memory.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 include/linux/mempolicy.h |    5 +++++
 include/linux/sched.h     |    1 +
 kernel/fork.c             |    1 +
 mm/mempolicy.c            |   18 ++++++++++++++++++
 mm/slab.c                 |   37 +++++++++++++++++++++++--------------
 5 files changed, 48 insertions(+), 14 deletions(-)

--- 2.6.16-rc1-mm5.orig/include/linux/sched.h	2006-02-03 22:17:26.705941251 -0800
+++ 2.6.16-rc1-mm5/include/linux/sched.h	2006-02-03 22:17:43.293042557 -0800
@@ -949,6 +949,7 @@ static inline void put_task_struct(struc
 #define PF_RANDOMIZE	0x00800000	/* randomize virtual address space */
 #define PF_SWAPWRITE	0x01000000	/* Allowed to write to swap */
 #define PF_MEM_SPREAD	0x04000000	/* Spread some memory over cpuset */
+#define PF_MEMPOLICY	0x08000000	/* Non-default NUMA mempolicy */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
--- 2.6.16-rc1-mm5.orig/kernel/fork.c	2006-02-03 22:17:26.707894398 -0800
+++ 2.6.16-rc1-mm5/kernel/fork.c	2006-02-03 22:17:43.304761439 -0800
@@ -1018,6 +1018,7 @@ static task_t *copy_process(unsigned lon
  		p->mempolicy = NULL;
  		goto bad_fork_cleanup_cpuset;
  	}
+	mpol_set_task_struct_flag(p);
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
--- 2.6.16-rc1-mm5.orig/mm/mempolicy.c	2006-02-03 22:17:26.707894398 -0800
+++ 2.6.16-rc1-mm5/mm/mempolicy.c	2006-02-03 22:17:43.323316336 -0800
@@ -423,6 +423,7 @@ long do_set_mempolicy(int mode, nodemask
 		return PTR_ERR(new);
 	mpol_free(current->mempolicy);
 	current->mempolicy = new;
+	mpol_set_task_struct_flag(current);
 	if (new && new->policy == MPOL_INTERLEAVE)
 		current->il_next = first_node(new->v.nodes);
 	return 0;
@@ -1666,6 +1667,23 @@ void mpol_rebind_mm(struct mm_struct *mm
 }
 
 /*
+ * Update task->flags PF_MEMPOLICY bit: set iff non-default mempolicy.
+ * Allows more rapid checking of this (combined perhaps with other
+ * PF_* flag bits) on memory allocation hot code paths.
+ *
+ * The task struct 'p' should either be current or a newly
+ * forked child that is not visible on the task list yet.
+ */
+
+void mpol_set_task_struct_flag(struct task_struct *p)
+{
+	if (p->mempolicy)
+		p->flags |= PF_MEMPOLICY;
+	else
+		p->flags &= ~PF_MEMPOLICY;
+}
+
+/*
  * Display pages allocated per node and memory policy via /proc.
  */
 
--- 2.6.16-rc1-mm5.orig/include/linux/mempolicy.h	2006-02-03 22:17:26.706917824 -0800
+++ 2.6.16-rc1-mm5/include/linux/mempolicy.h	2006-02-03 22:17:43.323316336 -0800
@@ -147,6 +147,7 @@ extern void mpol_rebind_policy(struct me
 extern void mpol_rebind_task(struct task_struct *tsk,
 					const nodemask_t *new);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
+extern void mpol_set_task_struct_flag(struct task_struct *p);
 #define set_cpuset_being_rebound(x) (cpuset_being_rebound = (x))
 
 #ifdef CONFIG_CPUSET
@@ -248,6 +249,10 @@ static inline void mpol_rebind_mm(struct
 {
 }
 
+static inline void mpol_set_task_struct_flag(struct task_struct *p)
+{
+}
+
 #define set_cpuset_being_rebound(x) do {} while (0)
 
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
--- 2.6.16-rc1-mm5.orig/mm/slab.c	2006-02-03 22:17:33.549768509 -0800
+++ 2.6.16-rc1-mm5/mm/slab.c	2006-02-03 22:17:43.327222630 -0800
@@ -850,6 +850,7 @@ static struct array_cache *alloc_arrayca
 
 #ifdef CONFIG_NUMA
 static void *__cache_alloc_node(struct kmem_cache *, gfp_t, int);
+static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
 
 static struct array_cache **alloc_alien_cache(int node, int limit)
 {
@@ -2703,20 +2704,9 @@ static inline void *____cache_alloc(stru
 	struct array_cache *ac;
 
 #ifdef CONFIG_NUMA
-	if (unlikely(current->mempolicy && !in_interrupt())) {
-		int nid = slab_node(current->mempolicy);
-
-		if (nid != numa_node_id())
-			return __cache_alloc_node(cachep, flags, nid);
-	}
-	if (unlikely(cpuset_mem_spread_check() &&
-					(cachep->flags & SLAB_MEM_SPREAD) &&
-					!in_interrupt())) {
-		int nid = cpuset_mem_spread_node();
-
-		if (nid != numa_node_id())
-			return __cache_alloc_node(cachep, flags, nid);
-	}
+	if (unlikely(current->flags & (PF_MEM_SPREAD|PF_MEMPOLICY)))
+		if ((objp = alternate_node_alloc(cachep, flags)) != NULL)
+			return objp;
 #endif
 
 	check_irq_off();
@@ -2751,6 +2741,25 @@ __cache_alloc(struct kmem_cache *cachep,
 
 #ifdef CONFIG_NUMA
 /*
+ * Try allocating on another node if PF_MEM_SPREAD or PF_MEMPOLICY.
+ */
+static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
+{
+	int nid_alloc, nid_here;
+
+	if (in_interrupt())
+		return NULL;
+	nid_alloc = nid_here = numa_node_id();
+	if (cpuset_mem_spread_check() && (cachep->flags & SLAB_MEM_SPREAD))
+		nid_alloc = cpuset_mem_spread_node();
+	else if (current->mempolicy)
+		nid_alloc = slab_node(current->mempolicy);
+	if (nid_alloc != nid_here)
+		return __cache_alloc_node(cachep, flags, nid_alloc);
+	return NULL;
+}
+
+/*
  * A interface to enable slab creation on nodeid
  */
 static void *__cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 5/5] cpuset memory spread slab cache hooks
  2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
                   ` (2 preceding siblings ...)
  2006-02-04  7:19 ` [PATCH 4/5] cpuset memory spread slab cache optimizations Paul Jackson
@ 2006-02-04  7:19 ` Paul Jackson
  2006-02-06  4:37   ` Andrew Morton
  2006-02-04 23:49 ` [PATCH 1/5] cpuset memory spread basic implementation Andrew Morton
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-04  7:19 UTC (permalink / raw)
  To: akpm; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, Paul Jackson, clameter

From: Paul Jackson <pj@sgi.com>

Change the kmem_cache_create calls for certain slab caches to support
cpuset memory spreading.

See the previous patches, cpuset_mem_spread, for an explanation of
cpuset memory spreading, and cpuset_mem_spread_slab_cache for the
slab cache support for memory spreading.

The slag caches marked for now are: dentry_cache, inode_cache,
and buffer_head.  This list may change over time.

Signed-off-by: Paul Jackson

---

 fs/buffer.c |    7 +++++--
 fs/dcache.c |    3 ++-
 fs/inode.c  |    9 +++++++--
 3 files changed, 14 insertions(+), 5 deletions(-)

--- 2.6.16-rc1-mm5.orig/fs/dcache.c	2006-02-03 20:14:45.616310776 -0800
+++ 2.6.16-rc1-mm5/fs/dcache.c	2006-02-03 21:56:36.605864543 -0800
@@ -1683,7 +1683,8 @@ static void __init dcache_init(unsigned 
 	dentry_cache = kmem_cache_create("dentry_cache",
 					 sizeof(struct dentry),
 					 0,
-					 SLAB_RECLAIM_ACCOUNT|SLAB_PANIC,
+					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
+					 SLAB_MEM_SPREAD),
 					 NULL, NULL);
 	
 	shrinker = set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
--- 2.6.16-rc1-mm5.orig/fs/inode.c	2006-02-03 20:14:45.619240496 -0800
+++ 2.6.16-rc1-mm5/fs/inode.c	2006-02-03 21:56:36.606841116 -0800
@@ -1376,8 +1376,13 @@ void __init inode_init(unsigned long mem
 	struct shrinker *shrinker;
 
 	/* inode slab cache */
-	inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode),
-				0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC, init_once, NULL);
+	inode_cachep = kmem_cache_create("inode_cache",
+					 sizeof(struct inode),
+					 0,
+					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
+					 SLAB_MEM_SPREAD),
+					 init_once,
+					 NULL);
 	shrinker = set_shrinker(DEFAULT_SEEKS, shrink_icache_memory);
 	kmem_set_shrinker(inode_cachep, shrinker);
 
--- 2.6.16-rc1-mm5.orig/fs/buffer.c	2006-02-03 20:14:32.642534053 -0800
+++ 2.6.16-rc1-mm5/fs/buffer.c	2006-02-03 21:56:36.608794263 -0800
@@ -3203,8 +3203,11 @@ void __init buffer_init(void)
 	int nrpages;
 
 	bh_cachep = kmem_cache_create("buffer_head",
-			sizeof(struct buffer_head), 0,
-			SLAB_RECLAIM_ACCOUNT|SLAB_PANIC, init_buffer_head, NULL);
+					sizeof(struct buffer_head), 0,
+					(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
+					SLAB_MEM_SPREAD),
+					init_buffer_head,
+					NULL);
 
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
                   ` (3 preceding siblings ...)
  2006-02-04  7:19 ` [PATCH 5/5] cpuset memory spread slab cache hooks Paul Jackson
@ 2006-02-04 23:49 ` Andrew Morton
  2006-02-05  3:35   ` Christoph Lameter
  2006-02-06  4:33   ` Andrew Morton
  2006-02-04 23:50 ` Andrew Morton
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2006-02-04 23:49 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, pj, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> From: Paul Jackson <pj@sgi.com>
> 
> This patch provides the implementation and cpuset interface for
> an alternative memory allocation policy that can be applied to
> certain kinds of memory allocations, such as the page cache (file
> system buffers) and some slab caches (such as inode caches).
> 
> ...
>
> A new per-cpuset file, "memory_spread", is defined.  This is
> a boolean flag file, containing a "0" (off) or "1" (on).
> By default it is off, and the kernel allocation placement
> is unchanged.  If it is turned on for a given cpuset (write a
> "1" to that cpusets memory_spread file) then the alternative
> policy applies to all tasks in that cpuset.

I'd have thought it would be saner to split these things apart:
"slab_spread", "pagecache_spread", etc.

> +static inline int cpuset_mem_spread_check(void)
> +{
> +	return current->flags & PF_MEM_SPREAD;
> +}

That's not a terribly assertive name.  cpuset_mem_spread_needed()?

> +		if (is_mem_spread(cs))
> +			tsk->flags |= PF_MEM_SPREAD;
> +		else
> +			tsk->flags &= ~PF_MEM_SPREAD;
>  		task_unlock(tsk);

OT: do we ever set PF_foo on a task other than `current'?  I have a feeling
that we do...

> +	case FILE_MEM_SPREAD:
> +		retval = update_flag(CS_MEM_SPREAD, cs, buffer);
> +		atomic_inc(&cpuset_mems_generation);
> +		cs->mems_generation = atomic_read(&cpuset_mems_generation);

atomic_inc_return()

> +int cpuset_mem_spread_node(void)
> +{
> +	int node;
> +
> +	node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
> +	if (node == MAX_NUMNODES)
> +		node = first_node(current->mems_allowed);
> +	current->cpuset_mem_spread_rotor = node;
> +	return node;
> +}

hm.  What guarantees that a node which is in current->mems_allowed is still
online?


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-04  7:19 ` [PATCH 2/5] cpuset memory spread page cache implementation and hooks Paul Jackson
@ 2006-02-04 23:49   ` Andrew Morton
  2006-02-05  1:42     ` Paul Jackson
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-04 23:49 UTC (permalink / raw)
  To: Paul Jackson; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, pj, clameter

Paul Jackson <pj@sgi.com> wrote:
>
>   static inline struct page *page_cache_alloc(struct address_space *x)
>   {
>  +	if (cpuset_mem_spread_check()) {
>  +		int n = cpuset_mem_spread_node();
>  +		return alloc_pages_node(n, mapping_gfp_mask(x), 0);
>  +	}
>   	return alloc_pages(mapping_gfp_mask(x), 0);
>   }
>   
>   static inline struct page *page_cache_alloc_cold(struct address_space *x)
>   {
>  +	if (cpuset_mem_spread_check()) {
>  +		int n = cpuset_mem_spread_node();
>  +		return alloc_pages_node(n, mapping_gfp_mask(x)|__GFP_COLD, 0);
>  +	}
>   	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
>   }

This is starting to get a bit bloaty.  Might be worth thinking about
uninlining these for certain Kconfig combinations.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 3/5] cpuset memory spread slab cache implementation
  2006-02-04  7:19 ` [PATCH 3/5] cpuset memory spread slab cache implementation Paul Jackson
@ 2006-02-04 23:49   ` Andrew Morton
  2006-02-05  3:37     ` Christoph Lameter
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-04 23:49 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, pj, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> +	if (unlikely(cpuset_mem_spread_check() &&
>  +					(cachep->flags & SLAB_MEM_SPREAD) &&
>  +					!in_interrupt())) {
>  +		int nid = cpuset_mem_spread_node();
>  +
>  +		if (nid != numa_node_id())
>  +			return __cache_alloc_node(cachep, flags, nid);
>  +	}

Need a comment here explaining the mysterious !in_interrupt() check.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 4/5] cpuset memory spread slab cache optimizations
  2006-02-04  7:19 ` [PATCH 4/5] cpuset memory spread slab cache optimizations Paul Jackson
@ 2006-02-04 23:50   ` Andrew Morton
  2006-02-05  3:18     ` Paul Jackson
  2006-02-04 23:50   ` Andrew Morton
  1 sibling, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-04 23:50 UTC (permalink / raw)
  To: Paul Jackson; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, pj, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> +void mpol_set_task_struct_flag(struct task_struct *p)
>  +{
>  +	if (p->mempolicy)
>  +		p->flags |= PF_MEMPOLICY;
>  +	else
>  +		p->flags &= ~PF_MEMPOLICY;
>  +}

As mentioned before, if we ever modify tsk->flags, where tsk != current, we
have a nasty race.  So this function's interface really does invite that
race and hence is not very good.

As we do seem to be only calling it for current or for a newly-created task
I guess the access is OK, so perhaps a weaselly comment would cover that
worry.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
                   ` (4 preceding siblings ...)
  2006-02-04 23:49 ` [PATCH 1/5] cpuset memory spread basic implementation Andrew Morton
@ 2006-02-04 23:50 ` Andrew Morton
  2006-02-04 23:57   ` David S. Miller
  2006-02-06  4:37 ` Andrew Morton
  2006-02-06  9:18 ` Simon Derr
  7 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-04 23:50 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, pj, clameter

Paul Jackson <pj@sgi.com> wrote:
>
>  +static inline int is_mem_spread(const struct cpuset *cs)
>  +{
>  +	return !!test_bit(CS_MEM_SPREAD, &cs->flags);
>  +}

The !!  doesn't seem needed.  The name of this function implies that it
returns a boolean, not a scalar.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 4/5] cpuset memory spread slab cache optimizations
  2006-02-04  7:19 ` [PATCH 4/5] cpuset memory spread slab cache optimizations Paul Jackson
  2006-02-04 23:50   ` Andrew Morton
@ 2006-02-04 23:50   ` Andrew Morton
  2006-02-05  4:10     ` Paul Jackson
  1 sibling, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-04 23:50 UTC (permalink / raw)
  To: Paul Jackson; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, pj, clameter

Paul Jackson <pj@sgi.com> wrote:
>
>  @@ -2703,20 +2704,9 @@ static inline void *____cache_alloc(stru
>   	struct array_cache *ac;
>   
>   #ifdef CONFIG_NUMA
>  -	if (unlikely(current->mempolicy && !in_interrupt())) {
>  -		int nid = slab_node(current->mempolicy);
>  -
>  -		if (nid != numa_node_id())
>  -			return __cache_alloc_node(cachep, flags, nid);
>  -	}
>  -	if (unlikely(cpuset_mem_spread_check() &&
>  -					(cachep->flags & SLAB_MEM_SPREAD) &&
>  -					!in_interrupt())) {
>  -		int nid = cpuset_mem_spread_node();
>  -
>  -		if (nid != numa_node_id())
>  -			return __cache_alloc_node(cachep, flags, nid);
>  -	}
>  +	if (unlikely(current->flags & (PF_MEM_SPREAD|PF_MEMPOLICY)))
>  +		if ((objp = alternate_node_alloc(cachep, flags)) != NULL)
>  +			return objp;
>   #endif
>   
>   	check_irq_off();
>  @@ -2751,6 +2741,25 @@ __cache_alloc(struct kmem_cache *cachep,
>   
>   #ifdef CONFIG_NUMA
>   /*
>  + * Try allocating on another node if PF_MEM_SPREAD or PF_MEMPOLICY.
>  + */
>  +static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
>  +{
>  +	int nid_alloc, nid_here;
>  +
>  +	if (in_interrupt())
>  +		return NULL;
>  +	nid_alloc = nid_here = numa_node_id();
>  +	if (cpuset_mem_spread_check() && (cachep->flags & SLAB_MEM_SPREAD))
>  +		nid_alloc = cpuset_mem_spread_node();
>  +	else if (current->mempolicy)
>  +		nid_alloc = slab_node(current->mempolicy);
>  +	if (nid_alloc != nid_here)
>  +		return __cache_alloc_node(cachep, flags, nid_alloc);
>  +	return NULL;
>  +}
>  +

Why not move the PF_MEM_SPREAD|PF_MEMPOLICY test into
alternate_node_alloc(), inline the whole thing and nuke the #ifdef in
__cache_alloc()?

We're adding even more goop into the NUMA __cache_alloc() fastpath.  This bad.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-04 23:50 ` Andrew Morton
@ 2006-02-04 23:57   ` David S. Miller
  0 siblings, 0 replies; 102+ messages in thread
From: David S. Miller @ 2006-02-04 23:57 UTC (permalink / raw)
  To: akpm; +Cc: pj, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

From: Andrew Morton <akpm@osdl.org>
Date: Sat, 4 Feb 2006 15:50:27 -0800

> The !!  doesn't seem needed.  The name of this function implies that it
> returns a boolean, not a scalar.

As a historical note it used to be a common implementation error to
return "flag & bit" from this function instead of the correct
"(flag & bit) != 0".


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-04 23:49   ` Andrew Morton
@ 2006-02-05  1:42     ` Paul Jackson
  2006-02-05  1:54       ` Andrew Morton
  0 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-05  1:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, clameter

Andrew, responding to pj:
> >   static inline struct page *page_cache_alloc_cold(struct address_space *x)
> >   {
> >  +	if (cpuset_mem_spread_check()) {
> >  +		int n = cpuset_mem_spread_node();
> >  +		return alloc_pages_node(n, mapping_gfp_mask(x)|__GFP_COLD, 0);
> >  +	}
> >   	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
> >   }
> 
> This is starting to get a bit bloaty.  Might be worth thinking about
> uninlining these for certain Kconfig combinations.

Good point.

I can easily imagine doing something like the following, to move some
of the logic out of line, rather in the same manner as I did the slab
cache hooks, in "[PATCH 4/5] cpuset memory spread slab cache
optimizations"

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
static struct page *page_cache_alloc_mem_spread_cold(struct address_space *x)
{
	int n = cpuset_mem_spread_node();
	return alloc_pages_node(n, mapping_gfp_mask(x)|__GFP_COLD, 0);
}

static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
	if (cpuset_mem_spread_check())
		return page_cache_alloc_mem_spread_cold(x);
	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
}
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

But I am not sure what you mean by "uninline for certain Kconfig
combinations."  Do you mean uninline these two page_cache_alloc*()
routines, for all configs that enable CONFIG_CPUSET?

    The configs w/o CONFIG_CPUSET have "cpuset_mem_spread_check()"
    defined as a constant 0, so for them, this bloat will disappear,
    so they would not gain any bloat reduction by uninlining these
    page_cache_alloc*() routines, in any case.

    The configs with CONFIG_CPUSET might include future major
    desktop PC distros, which might not want these page_cache_alloc*()
    routines uninlined (though I am sure they would like them to be
    non-bloaty.)

Tell me more.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-05  1:42     ` Paul Jackson
@ 2006-02-05  1:54       ` Andrew Morton
  2006-02-05  3:28         ` Christoph Lameter
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-05  1:54 UTC (permalink / raw)
  To: Paul Jackson; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, clameter

Paul Jackson <pj@sgi.com> wrote:
>
>  But I am not sure what you mean by "uninline for certain Kconfig
>  combinations."

This function:

static inline struct page *page_cache_alloc(struct address_space *x)
{
	if (cpuset_mem_spread_check()) {
		int n = cpuset_mem_spread_node();
		return alloc_pages_node(n, mapping_gfp_mask(x), 0);
	}
	return alloc_pages(mapping_gfp_mask(x), 0);
}

Really has two forms, depending upon Kconfig.

1:

static inline struct page *page_cache_alloc(struct address_space *x)
{
	return alloc_pages(mapping_gfp_mask(x), 0);
}

That should be inlined.

2:

static inline struct page *page_cache_alloc(struct address_space *x)
{
	if (cpuset_mem_spread_check()) {
		int n = cpuset_mem_spread_node();
		return alloc_pages_node(n, mapping_gfp_mask(x), 0);
	}
	return alloc_pages(mapping_gfp_mask(x), 0);
}

That shouldn't be inlined.

That's all.   One would have to fiddle a bit, work out how many callsites
there are, gauge the impact on text size, etc.  page_cache_alloc() seems
to have a single callsite, and page_cache_alloc_cold() four, so it's
a quite minor issue.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 4/5] cpuset memory spread slab cache optimizations
  2006-02-04 23:50   ` Andrew Morton
@ 2006-02-05  3:18     ` Paul Jackson
  0 siblings, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-05  3:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, clameter

Andrew wrote:
> perhaps a weaselly comment would cover that worry.

Well ... the comment was there, but the problem with comments
is no one reads them ;)

+ * The task struct 'p' should either be current or a newly
+ * forked child that is not visible on the task list yet.
+ */
+
+void mpol_set_task_struct_flag(struct task_struct *p)


> this function's interface really does invite that
> race and hence is not very good

Agreed.  I'm still scratching my head coming up with a better way.

Hmmm ... except for the call from fork, all calls to this are from
within mm/mempolicy.c.  I could make the routine within mempolicy.c
static, and provide an exported wrapper with a name like:

	mpol_fix_fork_child_flag()

that wrapped it.  With a name like that, there seems less risk of
abusing this.

Any other suggestions?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-05  1:54       ` Andrew Morton
@ 2006-02-05  3:28         ` Christoph Lameter
  2006-02-05  5:06           ` Andrew Morton
  0 siblings, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-05  3:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, steiner, dgc, Simon.Derr, ak, linux-kernel

Hmm... Make this


static inline struct page *page_cache_alloc(struct address_space *x)
{
#ifdef CONFIG_NUMA
 	if (cpuset_mem_spread_check()) {
 		int n = cpuset_mem_spread_node();
 		return alloc_pages_node(n, mapping_gfp_mask(x), 0);
 	}
#endif
 	return alloc_pages(mapping_gfp_mask(x), 0);
}

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-04 23:49 ` [PATCH 1/5] cpuset memory spread basic implementation Andrew Morton
@ 2006-02-05  3:35   ` Christoph Lameter
  2006-02-06  4:33   ` Andrew Morton
  1 sibling, 0 replies; 102+ messages in thread
From: Christoph Lameter @ 2006-02-05  3:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, dgc, steiner, Simon.Derr, ak, linux-kernel

On Sat, 4 Feb 2006, Andrew Morton wrote:

> > +int cpuset_mem_spread_node(void)
> > +{
> > +	int node;
> > +
> > +	node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
> > +	if (node == MAX_NUMNODES)
> > +		node = first_node(current->mems_allowed);
> > +	current->cpuset_mem_spread_rotor = node;
> > +	return node;
> > +}
> 
> hm.  What guarantees that a node which is in current->mems_allowed is still
> online?

If a node is not online then the slab allocator will fall back to the 
local node. See kmem_cache_alloc_node.

The page allocator will refer to the zonelist of the node that is offline. 
Hmm... Isnt current->mems_allowed restricted by the available nodes?




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 3/5] cpuset memory spread slab cache implementation
  2006-02-04 23:49   ` Andrew Morton
@ 2006-02-05  3:37     ` Christoph Lameter
  0 siblings, 0 replies; 102+ messages in thread
From: Christoph Lameter @ 2006-02-05  3:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, dgc, steiner, Simon.Derr, ak, linux-kernel

On Sat, 4 Feb 2006, Andrew Morton wrote:

> Paul Jackson <pj@sgi.com> wrote:
> >
> > +	if (unlikely(cpuset_mem_spread_check() &&
> >  +					(cachep->flags & SLAB_MEM_SPREAD) &&
> >  +					!in_interrupt())) {
> >  +		int nid = cpuset_mem_spread_node();
> >  +
> >  +		if (nid != numa_node_id())
> >  +			return __cache_alloc_node(cachep, flags, nid);
> >  +	}
> 
> Need a comment here explaining the mysterious !in_interrupt() check.

If we are in interrupt context then the current pointer may not be 
meaningful. cpuset settings should not be applied for any memory 
allocation.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 4/5] cpuset memory spread slab cache optimizations
  2006-02-04 23:50   ` Andrew Morton
@ 2006-02-05  4:10     ` Paul Jackson
  0 siblings, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-05  4:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: steiner, dgc, Simon.Derr, ak, linux-kernel, clameter

> We're adding even more goop into the NUMA __cache_alloc() fastpath.  This bad.

Huh?  I'm only adding more goop (beyond a single inline bit test
of current->flags) in the:
	NUMA and (MEMPOLICY or MEM_SPREAD)
path.

>  @@ -2703,20 +2704,9 @@ static inline void *____cache_alloc(stru
> ...
>  +	if (unlikely(current->flags & (PF_MEM_SPREAD|PF_MEMPOLICY)))
>  +		if ((objp = alternate_node_alloc(cachep, flags)) != NULL)
>  +			return objp;

There are three copies of ____cache_alloc() in mm/slab.c, once
compiled.  Do you really want three copies of alternate_node_alloc()
routine in the kernel, just to avoid a subroutine call in the "NUMA and
(MEMPOLICY or MEM_SPREAD)" case?

I doubt you want that.

In other words, I don't understand yet.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-05  3:28         ` Christoph Lameter
@ 2006-02-05  5:06           ` Andrew Morton
  2006-02-05  6:08             ` Paul Jackson
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-05  5:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: pj, steiner, dgc, Simon.Derr, ak, linux-kernel

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> Hmm... Make this
> 
> 
> static inline struct page *page_cache_alloc(struct address_space *x)
> {
> #ifdef CONFIG_NUMA
>  	if (cpuset_mem_spread_check()) {
>  		int n = cpuset_mem_spread_node();
>  		return alloc_pages_node(n, mapping_gfp_mask(x), 0);
>  	}
> #endif
>  	return alloc_pages(mapping_gfp_mask(x), 0);
> }

That's a no-op.

The problem remains that for CONFIG_NUMA=y, this function is too big to inline.

It's a minor thing.  But it's a thing.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-05  5:06           ` Andrew Morton
@ 2006-02-05  6:08             ` Paul Jackson
  2006-02-05  6:15               ` Andrew Morton
  0 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-05  6:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, steiner, dgc, Simon.Derr, ak, linux-kernel

Andrew wrote:
> That's a no-op.

agreed.

> The problem remains that for CONFIG_NUMA=y, this function is too big to inline.

A clear statement of the problem.  Good.

But I'm still being a stupid git.  Is the following variant of
page_cache_alloc_cold() still bigger than you would prefer inlined
(where cpuset_mem_spread_check() is an inline current->flags test)
(ditto for page_cache_alloc())?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
static struct page *page_cache_alloc_mem_spread_cold(struct address_space *x)
{
	int n = cpuset_mem_spread_node();
	return alloc_pages_node(n, mapping_gfp_mask(x)|__GFP_COLD, 0);
}

static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
	if (cpuset_mem_spread_check())
		return page_cache_alloc_mem_spread_cold(x);
	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
}
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Are you recommending taking the whole thing, both page_cache_alloc*()
calls, for the CONFIG_NUMA case, out of line, instead of even the above?

If so, fine ... then the rest of your explanations make sense to
me on how to go about coding this, and I'll try coding it up.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-05  6:08             ` Paul Jackson
@ 2006-02-05  6:15               ` Andrew Morton
  2006-02-05  6:28                 ` Paul Jackson
                                   ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Andrew Morton @ 2006-02-05  6:15 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, steiner, dgc, Simon.Derr, ak, linux-kernel

Paul Jackson <pj@sgi.com> wrote:
>
> Andrew wrote:
> > That's a no-op.
> 
> agreed.
> 
> > The problem remains that for CONFIG_NUMA=y, this function is too big to inline.
> 
> A clear statement of the problem.  Good.
> 
> But I'm still being a stupid git.  Is the following variant of
> page_cache_alloc_cold() still bigger than you would prefer inlined
> (where cpuset_mem_spread_check() is an inline current->flags test)
> (ditto for page_cache_alloc())?
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> static struct page *page_cache_alloc_mem_spread_cold(struct address_space *x)
> {
> 	int n = cpuset_mem_spread_node();
> 	return alloc_pages_node(n, mapping_gfp_mask(x)|__GFP_COLD, 0);
> }

That's an almost-equivalent transformation.  If the compiler's good enough,
it'll generate the same code here I think.

If so then there's probably not much point in optimising it - but one needs
to look at the numbers.

> static inline struct page *page_cache_alloc_cold(struct address_space *x)
> {
> 	if (cpuset_mem_spread_check())
> 		return page_cache_alloc_mem_spread_cold(x);
> 	return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
> }
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> Are you recommending taking the whole thing, both page_cache_alloc*()
> calls, for the CONFIG_NUMA case, out of line, instead of even the above?

I'm saying "gee, that looks big.  Do you have time to investigate possible
improvements?"   They may come to naught.

> If so, fine ... then the rest of your explanations make sense to
> me on how to go about coding this, and I'll try coding it up.

Neato.  Please also have a think about __cache_alloc(), see if we can
improve it further - that's a real hotspot.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-05  6:15               ` Andrew Morton
@ 2006-02-05  6:28                 ` Paul Jackson
  2006-02-06  0:20                 ` Paul Jackson
  2006-02-06  5:51                 ` Paul Jackson
  2 siblings, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-05  6:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, steiner, dgc, Simon.Derr, ak, linux-kernel

Andrew wrote:
> If so then there's probably not much point in optimising it - but one needs
> to look at the numbers.

Ok - I'll play around with it.

> Please also have a think about __cache_alloc() ..

I'll give it shot - no telling if I'll hit anything yet.

> but one needs to look at the numbers.

Definitely.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-05  6:15               ` Andrew Morton
  2006-02-05  6:28                 ` Paul Jackson
@ 2006-02-06  0:20                 ` Paul Jackson
  2006-02-06  5:51                 ` Paul Jackson
  2 siblings, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  0:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, steiner, dgc, Simon.Derr, ak, linux-kernel, Pekka Enberg

Andrew wrote:
> Please also have a think about __cache_alloc(), see if we can
> improve it further - that's a real hotspot.

I won't be able to get at __cache_alloc() at least until Pekka Enberg
layers his "slab: consolidate allocation paths" patches on top of this
cpuset memory spread patchset.  Indeed, it might be Pekka who does more
for __cache_alloc() than me ... that has yet to play out.

The entangled, ifdef'd piece of code in mm/slab.c is more convoluted
than I can get my head around on this attempt, except for narrowly
focused changes.  Perhaps Pekka's work and another week will be enough.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-04 23:49 ` [PATCH 1/5] cpuset memory spread basic implementation Andrew Morton
  2006-02-05  3:35   ` Christoph Lameter
@ 2006-02-06  4:33   ` Andrew Morton
  2006-02-06  5:50     ` Paul Jackson
  1 sibling, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-06  4:33 UTC (permalink / raw)
  To: pj, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Andrew Morton <akpm@osdl.org> wrote:
>
> Paul Jackson <pj@sgi.com> wrote:
>  >
>  > From: Paul Jackson <pj@sgi.com>
>  > 
>  > This patch provides the implementation and cpuset interface for
>  > an alternative memory allocation policy that can be applied to
>  > certain kinds of memory allocations, such as the page cache (file
>  > system buffers) and some slab caches (such as inode caches).
>  > 
>  > ...
>  >
>  > A new per-cpuset file, "memory_spread", is defined.  This is
>  > a boolean flag file, containing a "0" (off) or "1" (on).
>  > By default it is off, and the kernel allocation placement
>  > is unchanged.  If it is turned on for a given cpuset (write a
>  > "1" to that cpusets memory_spread file) then the alternative
>  > policy applies to all tasks in that cpuset.
> 
>  I'd have thought it would be saner to split these things apart:
>  "slab_spread", "pagecache_spread", etc.

This, please.   It impacts the design of the whole thing.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
                   ` (5 preceding siblings ...)
  2006-02-04 23:50 ` Andrew Morton
@ 2006-02-06  4:37 ` Andrew Morton
  2006-02-06  6:02   ` Ingo Molnar
  2006-02-06  6:56   ` Paul Jackson
  2006-02-06  9:18 ` Simon Derr
  7 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2006-02-06  4:37 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, pj, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> This policy can provide substantial improvements for jobs that
>  need to place thread local data on the corresponding node, but
>  that need to access large file system data sets that need to
>  be spread across the several nodes in the jobs cpuset in order
>  to fit.  Without this patch, especially for jobs that might
>  have one thread reading in the data set, the memory allocation
>  across the nodes in the jobs cpuset can become very uneven.


It all seems rather ironic.  We do vast amounts of development to make
certain microbenchmarks look good, then run a real workload on the thing,
find that all those microbenchmark-inspired tweaks actually deoptimised the
real workload?  So now we need to add per-task knobs to turn off the
previously-added microbenchmark-tweaks.

What happens if one process does lots of filesystem activity and another
one (concurrent or subsequent) wants lots of thread-local storage?  Won't
the same thing happen?

IOW: this patch seems to be a highly specific bandaid which is repairing an
ill-advised problem of our own making, does it not?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 5/5] cpuset memory spread slab cache hooks
  2006-02-04  7:19 ` [PATCH 5/5] cpuset memory spread slab cache hooks Paul Jackson
@ 2006-02-06  4:37   ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2006-02-06  4:37 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, pj, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> Change the kmem_cache_create calls for certain slab caches to support
>  cpuset memory spreading.
> 
>  See the previous patches, cpuset_mem_spread, for an explanation of
>  cpuset memory spreading, and cpuset_mem_spread_slab_cache for the
>  slab cache support for memory spreading.
> 
>  The slag caches marked for now are: dentry_cache, inode_cache,
>  and buffer_head.  This list may change over time.

inode_cache is practically unused.  You'll be wanting to patch
ext3_inode_cache, xfs-inode_cache, etc.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  4:33   ` Andrew Morton
@ 2006-02-06  5:50     ` Paul Jackson
  2006-02-06  6:02       ` Andrew Morton
  0 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  5:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

> >  I'd have thought it would be saner to split these things apart:
> >  "slab_spread", "pagecache_spread", etc.
> 
> This, please.   It impacts the design of the whole thing.

It was still in my queue to respond to, yes.

All I am aware that is needed is to distinguish between:

  (1) application space pages, such as data and stack space,
      which the applications can page and place under their
      detailed control, and

  (2) what from the application's viewpoint is "kernel stuff"
      such as large amounts of pages required by file i/o,
      and their associated inode/dentry structures.

The application space pages are typically anonymous pages
which go away when the owning tasks exits, while the kernel
space pages are typically accessible by multiple tasks and
can stay around long after the initial faulting task exits.

I prefer to keep the tunable knobs to a minimum.  One boolean
was sufficient for this.

Just because a distinction seems substantial from the kernel
internals perspective, doesn't mean we should reflect that in
the tunable knobs.  We should have an actual need first, not
a strawman.

If there is some reason, or preference, for adding two knobs
(slab and page) instead of one, I can certainly do it.

I am not yet aware that such is useful.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-05  6:15               ` Andrew Morton
  2006-02-05  6:28                 ` Paul Jackson
  2006-02-06  0:20                 ` Paul Jackson
@ 2006-02-06  5:51                 ` Paul Jackson
  2006-02-06  7:14                   ` Pekka J Enberg
  2 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  5:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, steiner, dgc, Simon.Derr, ak, linux-kernel, Pekka Enberg

Earlier Andrew wrote:
> Really has two forms, depending upon Kconfig.
> 
> 1:
> 
> static inline struct page *page_cache_alloc(struct address_space *x)
> {
> 	return alloc_pages(mapping_gfp_mask(x), 0);
> }
> 
> That should be inlined.
> 
> 2:
> 
> static inline struct page *page_cache_alloc(struct address_space *x)
> {
> 	if (cpuset_mem_spread_check()) {
> 		int n = cpuset_mem_spread_node();
> 		return alloc_pages_node(n, mapping_gfp_mask(x), 0);
> 	}
> 	return alloc_pages(mapping_gfp_mask(x), 0);
> }

Later on, he wrote:
> I'm saying "gee, that looks big.  Do you have time to investigate possible
> improvements?"   They may come to naught.

After playing around with the variations we've considered on this
thread, the results are simple enough.  I experimented with just
the 3 calls to page_cache_alloc_cold() in mm/filemap.c, because that
was easy, and all these calls have the same shape.

    For non-NUMA, removing 'inline' from the three page_cache_alloc_cold()
    calls in mm/filemap.c would cost a total of 16 bytes text size

    For NUMA+CPUSET, removing it would _save_ 583 bytes total over the
    three calls.

    The "nm -S" size of the uninlined page_cache_alloc_cold() is 448 bytes
    (it was 96 bytes before this cpuset patchset).

    This is all on ia64 sn2_defconfig gcc 3.3.3.

The conclusion is straight forward, and as Andrew suspected.

We want these two page_cache_alloc*() routines out of line in the
NUMA case, but left inline for the non-NUMA case.

I will follow up with a simple patch that makes it easy to mark
routines that should be inline for UMA, out of line for NUMA.

These two page_cache_alloc*(), and perhaps also __cache_alloc() when
Pekka or I gets a handle on it, are candidates for this marking, as
routines to inline on UMA, out of line on NUMA.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  5:50     ` Paul Jackson
@ 2006-02-06  6:02       ` Andrew Morton
  2006-02-06  6:17         ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2006-02-06  6:02 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> > >  I'd have thought it would be saner to split these things apart:
> > >  "slab_spread", "pagecache_spread", etc.
> > 
> > This, please.   It impacts the design of the whole thing.
> 
> It was still in my queue to respond to, yes.
> 
> All I am aware that is needed is to distinguish between:
> 
>   (1) application space pages, such as data and stack space,
>       which the applications can page and place under their
>       detailed control, and
> 
>   (2) what from the application's viewpoint is "kernel stuff"
>       such as large amounts of pages required by file i/o,
>       and their associated inode/dentry structures.
> 
> The application space pages are typically anonymous pages
> which go away when the owning tasks exits, while the kernel
> space pages are typically accessible by multiple tasks and
> can stay around long after the initial faulting task exits.
> 
> I prefer to keep the tunable knobs to a minimum.  One boolean
> was sufficient for this.
> 
> Just because a distinction seems substantial from the kernel
> internals perspective, doesn't mean we should reflect that in
> the tunable knobs.  We should have an actual need first, not
> a strawman.
> 
> If there is some reason, or preference, for adding two knobs
> (slab and page) instead of one, I can certainly do it.
> 
> I am not yet aware that such is useful.
> 

I suspect that you'll find different workload patterns and sequences which
cause similar regressions in the future.  It'd be useful to sit down and
try to think of some and try them out.

I think the bottom line here is that the kernel just cannot predict the
future and it will need help from the applications and/or administrators to
be able to do optimal things.  For that, finer-grained one-knob-per-concept
controls would be better.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  4:37 ` Andrew Morton
@ 2006-02-06  6:02   ` Ingo Molnar
  2006-02-06  6:56   ` Paul Jackson
  1 sibling, 0 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  6:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Jackson, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Andrew Morton <akpm@osdl.org> wrote:

> Paul Jackson <pj@sgi.com> wrote:
> >
> > This policy can provide substantial improvements for jobs that
> >  need to place thread local data on the corresponding node, but
> >  that need to access large file system data sets that need to
> >  be spread across the several nodes in the jobs cpuset in order
> >  to fit.  Without this patch, especially for jobs that might
> >  have one thread reading in the data set, the memory allocation
> >  across the nodes in the jobs cpuset can become very uneven.
> 
> 
> It all seems rather ironic.  We do vast amounts of development to make 
> certain microbenchmarks look good, then run a real workload on the 
> thing, find that all those microbenchmark-inspired tweaks actually 
> deoptimised the real workload?  So now we need to add per-task knobs 
> to turn off the previously-added microbenchmark-tweaks.
> 
> What happens if one process does lots of filesystem activity and 
> another one (concurrent or subsequent) wants lots of thread-local 
> storage?  Won't the same thing happen?
> 
> IOW: this patch seems to be a highly specific bandaid which is 
> repairing an ill-advised problem of our own making, does it not?

i suspect it all depends whether the workload is 'global' or 'local'.  
Lets consider the hypothetical case of a 16-node box with 64 CPUs and 1 
TB of RAM, which could have two fundamental types of workloads:

- lots of per-CPU tasks which are highly independent and each does its
  own stuff. For this case we really want to allocate everything per-CPU
  and as close to the task as possible.

- 90% of the 1 TB of RAM is in a shared 'database' that is accessed by 
  all nodes in a nonpredictable pattern, from any CPU. For this case we 
  want to 'spread out' the pagecache as much as possible. If we dont 
  spread it out then one task - e.g. an initialization process - could 
  create a really bad distribution for the pagecache: big continuous 
  ranges allocated on the same node. If the workload has randomness but 
  also occasional "medium range" locality, an uneven portion of the 
  accesses could go to the same node, hosting some big continuous chunk 
  of the database. So we want to spread out in an as finegrained way as 
  possible. (perhaps not too finegrained though, to let block IO still 
  be reasonably batched.)

we normally optimize for the first case, and it works pretty well on 
both SMP and NUMA. We do pretty well with the second workload on SMP, 
but on NUMA, the non-spreadig can hurt. So it makes sense to 
artificially 'interleave' all the RAM that goes into the pagecache, to 
have a good statistical distribution of pages.

neither workload is broken, nor did we do any design mistake to optimize 
the SMP case for the first one - it is really the common thing on most 
boxes. But the second workload does happen too, and it conflicts with 
the first workload's needs. The difference between the workloads cannot 
be bridged by the kernel: it is two very different access patterns that 
results from the problem the application is trying to solve - the kernel 
cannot influence that.

I suspect there is no other way but to let the application tell the 
kernel which strategy it wants to be utilized.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  6:02       ` Andrew Morton
@ 2006-02-06  6:17         ` Ingo Molnar
  2006-02-06  7:22           ` Paul Jackson
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  6:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Jackson, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Andrew Morton <akpm@osdl.org> wrote:

> I think the bottom line here is that the kernel just cannot predict 
> the future and it will need help from the applications and/or 
> administrators to be able to do optimal things.  For that, 
> finer-grained one-knob-per-concept controls would be better.

yep. The cleanest would be to let tasks identify the fundamental access 
pattern with different granularity. I'm wondering whether it would be 
enough to simply extend madvise and fadvise to 'task' scope as well, and 
change the pagecache allocation pattern to 'spread out' pages on NUMA, 
if POSIX_FADV_RANDOM / MADV_RANDOM is specified.

hence 'global' workloads could set the per-task [and perhaps per-cpuset] 
access-pattern default to POSIX_FADV_RANDOM, while 'local' workloads 
could set it to POSIX_FADV_SEQUENTIAL [or leave it at the default].

another API solution: perhaps there should be a per-mountpoint 
fadvise/madvise hint? Thus the database in question could set the access 
pattern for the object itself. (or an ACL tag could achieve the same) 
That approach would have the advantage of being quite finegrained, and 
would limit the 'interleaving' strategy to the affected objects alone.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  4:37 ` Andrew Morton
  2006-02-06  6:02   ` Ingo Molnar
@ 2006-02-06  6:56   ` Paul Jackson
  2006-02-06  7:08     ` Andrew Morton
  1 sibling, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  6:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Andrew wrote:
> IOW: this patch seems to be a highly specific bandaid which is repairing an
> ill-advised problem of our own making, does it not?


I am mystified.  I am unable to imagine how you see this memory
spreading patchset as a response to some damage caused by previous
work.

Nothing we have ever done has deprived us of the ability to run a
job in a cpuset, where the application code of that job manages the
per-node placement of thread local storage, while the kernel evenly
distributes the placement of file system activity.

We never had that ability, in the mainline Linux kernel.

Ingo describes one alternative workload, where this alternative
strategy is useful.

What Ingo described involved a particular job on a 64 CPU, 1 TB system.
We have systems with multiple cpusets of such sizes, each running
such a job, all on the same system, at once.

Big shared systems, running performance critical jobs simultaneously,
present different challenges than seen on embeddeds, workstations or
smaller multi-use servers.

The driving force here is not our prior kernel design decisions.

The driving force is the economics of big systems, paid for by larger
organizations for use across multiple divisions or departments or
universities or commands whatever unit.  Systems obtained for running
performance critical, highly parallel, data and computationally
intensive applications.  They require job isolation of cpu and memory,
application management of memory use for thread local storage, and
uniform behaviour across the cpuset of kernel memory usage.

Each such job -may- require this alternative page and slab cache
memory spreading strategy, which is why it's a per-cpuset choice.

> What happens if one process does lots of filesystem activity and another
> one (concurrent or subsequent) wants lots of thread-local storage?  Won't
> the same thing happen?

Don't run two jobs in the same cpuset that have conflicting
memory requirements.

We're talking dedicated cpusets, with dedicated cpus and memory
nodes, for a given job.  Or, in Ingo's example, essentially a
single cpuset, covering the entire system, running one job.

In either case, some workloads require a different strategy for
such kernel memory placement, which would be the wrong default for most
uses.

So, the user must tell the kernel it needs this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  6:56   ` Paul Jackson
@ 2006-02-06  7:08     ` Andrew Morton
  2006-02-06  7:39       ` Ingo Molnar
  2006-02-06  9:32       ` Paul Jackson
  0 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2006-02-06  7:08 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> Andrew wrote:
> > IOW: this patch seems to be a highly specific bandaid which is repairing an
> > ill-advised problem of our own making, does it not?
> 
> 
> I am mystified.  I am unable to imagine how you see this memory
> spreading patchset as a response to some damage caused by previous
> work.

Node-local allocation.

> 
> So, the user must tell the kernel it needs this.
>

Well I agree.  And I think that the only way we'll get peak performance for
an acceptaly broad range of applications is to provide many fine-grained
controls and the appropriate documentation and instrumentation to help
developers and administrators use those controls.

We're all on the same page here.  I'm questioning whether slab and
pagecache should be inextricably lumped together though.

Is it possible to integrate the slab and pagecache allocation policies more
cleanly into a process's mempolicy?  Right now, MPOL_* are disjoint.

(Why is the spreading policy part of cpusets at all?  Shouldn't it be part
of the mempolicy layer?)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-06  5:51                 ` Paul Jackson
@ 2006-02-06  7:14                   ` Pekka J Enberg
  2006-02-06  7:42                     ` Pekka J Enberg
  0 siblings, 1 reply; 102+ messages in thread
From: Pekka J Enberg @ 2006-02-06  7:14 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, clameter, steiner, dgc, Simon.Derr, ak, linux-kernel

On Sun, 5 Feb 2006, Paul Jackson wrote:
> These two page_cache_alloc*(), and perhaps also __cache_alloc() when
> Pekka or I gets a handle on it, are candidates for this marking, as
> routines to inline on UMA, out of line on NUMA.

For slab, I found that the following two patches reduce text size most 
(for i386 NUMAQ config) while keeping UMA path the same. I don't have 
actual NUMA-capable hardware so I have no way to benchmark them. Both 
patches move code out-of-line and thus introduce new function calls which 
might affect performance negatively.

http://www.cs.helsinki.fi/u/penberg/linux/penberg-2.6/penberg-01-slab/slab-alloc-path-cleanup.patch
http://www.cs.helsinki.fi/u/penberg/linux/penberg-2.6/penberg-01-slab/slab-reduce-text-size.patch

			Pekka

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  6:17         ` Ingo Molnar
@ 2006-02-06  7:22           ` Paul Jackson
  2006-02-06  7:43             ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  7:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

> I'm wondering whether it would be 
> enough to simply extend madvise and fadvise to 'task' scope as well, and 
> change the pagecache allocation pattern to 'spread out' pages on NUMA, 
> if POSIX_FADV_RANDOM / MADV_RANDOM is specified.

This would not seem to work, at least for the needs I am aware of.

The tasks we are talking about do -not- want a default RANDOM
policy.  They want node-local allocation for per-thread data
(data and stack for example), and at the same time spread
allocation for kernel space (page and slab cache).

For another thing, memory spreading is not the same as RANDOM
policies.  RANDOM policies apply to page read ahead and retention
strategies, not to page placement (what node they are faulted into)
policies.

For a third thing, madvise takes a virtual address range, which
is irrelevant for specifying kernel address space page and slab
cache pages.

But the biggest difficulty, from my perspective, would be that
this strategy is normally selected per-cpuset, not per-task.

We are managing memory placement across the cpuset, on a per-job
basis.  It's the system administrator or batch scheduler, not the
individual application coder, who will likely want to enforce this
alternative memory placement strategy.

The madvise and posix_fadvise calls have no provision for one task
to affect another.  They just apply to the current task.  As Andrew
noted, it doesn't make much sense for different tasks in the same
cpuset to disagree on this placement policy choice.  We need a cpuset
wide policy choice, and a cpuset wide mechanism for making the choice,
not a per-task task internal mechanism.

Or at the very least, an attribute that is inherited across fork
and exec, so that a job grandfather (founding father) task can
set the policy, for all descendents.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  7:08     ` Andrew Morton
@ 2006-02-06  7:39       ` Ingo Molnar
  2006-02-06  8:22         ` Paul Jackson
  2006-02-06  8:35         ` Paul Jackson
  2006-02-06  9:32       ` Paul Jackson
  1 sibling, 2 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Jackson, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Andrew Morton <akpm@osdl.org> wrote:

> We're all on the same page here.  I'm questioning whether slab and 
> pagecache should be inextricably lumped together though.
> 
> Is it possible to integrate the slab and pagecache allocation policies 
> more cleanly into a process's mempolicy?  Right now, MPOL_* are 
> disjoint.
> 
> (Why is the spreading policy part of cpusets at all?  Shouldn't it be 
> part of the mempolicy layer?)

the whole mempolicy design seems to be too coarse: it is a fundamentally 
per-node thing, while workloads often share nodes. So it seems to me the 
approach Paul took was to make things more finegrained via cpusets - as 
that seems to be the preferred method to isolate workloads anyway.  
Cpusets are a limited form of virtualization / resource allocation, they 
allow the partitioning of a workload to a set of CPUs and a workload's 
memory allocations to a set of nodes.

in that sense, if we accept cpusets as the main abstraction for workload 
isolation on NUMA systems, it would be a natural and minimal extension 
to attach an access pattern hint to the cpuset - which is the broadest 
container of the workload. Mempolicies are pretty orthogonal to this and 
do not allow the separate handling of two workloads living in two 
different cpusets.

once we accept cpusets as the main abstraction, i dont think there is 
any fundamentally cleaner solution than the one presented by Paul. The 
advantage of having a 'global, per-cpuset' hint is obvious: the 
administrator can set it without having to change applications. Since it 
is global for the "virtual machine" (that is represented by the cpuset), 
the natural controls are limited to kernel entities: slab caches, 
pagecache, anonymous allocations.

what feels hacky is the knowledge about kernel-internal caches, but 
there's nothing else to control i think. Making it finegrained to the 
object level would make it impractical to use in the cpuset abstraction.

if we do not accept cpusets as the main abstraction, then per-task and
per-object hints seem to be the right control - which would have to be
used by the application.

the cpuset solution is certainly simpler to implement: the cpuset is 
already available to the memory allocator, so it's a simple step to 
extend it. Object-level flags would have to be passed down to the 
allocators - we dont have those right now as allocations are mostly 
anonymous.

also, maybe application / object level hints are _too_ finegrained: if a 
cpuset is used as a container for a 'project', then it's easy and 
straightforward to attach an allocation policy to it. Modifying hundreds 
of apps, some of which might be legacy, seems impractical - and the 
access pattern might very much depend on the project it is used in.

so to me the cpuset level seems to be the most natural point to control 
this: it is the level where resources are partitioned, and hence anyone 
configuring them should have a good idea about the expected access 
patterns of the project the cpuset belongs to. The application writer 
has little idea about the circumstances the app gets used in.

if we want to reduce complexity, i'd suggest to consolidate the MPOL_* 
mechanism into cpusets, and phase out the mempolicy syscalls. (The sysfs 
interface to cpusets is much cleaner anyway.)

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-06  7:14                   ` Pekka J Enberg
@ 2006-02-06  7:42                     ` Pekka J Enberg
  2006-02-06  7:51                       ` Pekka J Enberg
  0 siblings, 1 reply; 102+ messages in thread
From: Pekka J Enberg @ 2006-02-06  7:42 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, clameter, steiner, dgc, Simon.Derr, ak,
	linux-kernel, manfred

On Mon, 6 Feb 2006, Pekka J Enberg wrote:
> For slab, I found that the following two patches reduce text size most 
> (for i386 NUMAQ config) while keeping UMA path the same. I don't have 
> actual NUMA-capable hardware so I have no way to benchmark them. Both 
> patches move code out-of-line and thus introduce new function calls which 
> might affect performance negatively.
> 
> http://www.cs.helsinki.fi/u/penberg/linux/penberg-2.6/penberg-01-slab/slab-alloc-path-cleanup.patch

Actually, the above patch isn't probably any good as it moves 
cache_alloc_cpucache() out-of-line which should be the common case for 
NUMA too (it's hurting kmem_cache_alloc and kmalloc). The following should 
be better.

Subject: slab: consolidate allocation paths
From: Pekka Enberg <penberg@cs.helsinki.fi>

This patch consolidates the UMA and NUMA memory allocation paths in the
slab allocator. This is accomplished by making the UMA-path look like
we are on NUMA but always allocating from the current node.

NUMA text size:

   text    data     bss     dec     hex filename
  16227    2640      24   18891    49cb mm/slab.o (before)
  16196    2640      24   18860    49ac mm/slab.o (after)

UMA text size stays the same.

Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---

Index: 2.6-git/mm/slab.c
===================================================================
--- 2.6-git.orig/mm/slab.c
+++ 2.6-git/mm/slab.c
@@ -829,8 +829,6 @@ static struct array_cache *alloc_arrayca
 }
 
 #ifdef CONFIG_NUMA
-static void *__cache_alloc_node(struct kmem_cache *, gfp_t, int);
-
 static struct array_cache **alloc_alien_cache(int node, int limit)
 {
 	struct array_cache **ac_ptr;
@@ -2715,20 +2713,12 @@ static void *cache_alloc_debugcheck_afte
 #define cache_alloc_debugcheck_after(a,b,objp,d) (objp)
 #endif
 
-static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static __always_inline void *cache_alloc_cpucache(struct kmem_cache *cachep,
+						  gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
 
-#ifdef CONFIG_NUMA
-	if (unlikely(current->mempolicy && !in_interrupt())) {
-		int nid = slab_node(current->mempolicy);
-
-		if (nid != numa_node_id())
-			return __cache_alloc_node(cachep, flags, nid);
-	}
-#endif
-
 	check_irq_off();
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
@@ -2742,23 +2732,6 @@ static inline void *____cache_alloc(stru
 	return objp;
 }
 
-static __always_inline void *
-__cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
-{
-	unsigned long save_flags;
-	void *objp;
-
-	cache_alloc_debugcheck_before(cachep, flags);
-
-	local_irq_save(save_flags);
-	objp = ____cache_alloc(cachep, flags);
-	local_irq_restore(save_flags);
-	objp = cache_alloc_debugcheck_after(cachep, flags, objp,
-					    caller);
-	prefetchw(objp);
-	return objp;
-}
-
 #ifdef CONFIG_NUMA
 /*
  * A interface to enable slab creation on nodeid
@@ -2821,8 +2794,58 @@ static void *__cache_alloc_node(struct k
       done:
 	return obj;
 }
+
+static void *__cache_alloc(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+{
+
+	if (nodeid != numa_node_id() && cachep->nodelists[nodeid])
+		return __cache_alloc_node(cachep, flags, nodeid);
+
+	if (unlikely(current->mempolicy && !in_interrupt())) {
+		nodeid = slab_node(current->mempolicy);
+
+		if (nodeid != numa_node_id() && cachep->nodelists[nodeid])
+			return __cache_alloc_node(cachep, flags, nodeid);
+	}
+
+	return cache_alloc_cpucache(cachep, flags);
+}
+
+#else
+
+/*
+ * On UMA, we always allocate directly from the per-CPU cache.
+ */
+
+static __always_inline void *__cache_alloc(struct kmem_cache *cachep,
+					   gfp_t flags, int nodeid)
+{
+	return NULL;
+}
+
 #endif
 
+static __always_inline void *cache_alloc(struct kmem_cache *cachep,
+					 gfp_t flags, int nodeid,
+					 void *caller)
+{
+	unsigned long save_flags;
+	void *objp;
+
+	cache_alloc_debugcheck_before(cachep, flags);
+	local_irq_save(save_flags);
+
+	if (likely(nodeid == -1))
+		objp = cache_alloc_cpucache(cachep, flags);
+	else
+		objp = __cache_alloc(cachep, flags, nodeid);
+
+	local_irq_restore(save_flags);
+	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
+	prefetchw(objp);
+	return objp;
+}
+
 /*
  * Caller needs to acquire correct kmem_list's list_lock
  */
@@ -2984,7 +3007,7 @@ static inline void __cache_free(struct k
  */
 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
-	return __cache_alloc(cachep, flags, __builtin_return_address(0));
+	return cache_alloc(cachep, flags, -1, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
@@ -3045,23 +3068,7 @@ int fastcall kmem_ptr_validate(struct km
  */
 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
-	unsigned long save_flags;
-	void *ptr;
-
-	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
-
-	if (nodeid == -1 || nodeid == numa_node_id() ||
-	    !cachep->nodelists[nodeid])
-		ptr = ____cache_alloc(cachep, flags);
-	else
-		ptr = __cache_alloc_node(cachep, flags, nodeid);
-	local_irq_restore(save_flags);
-
-	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr,
-					   __builtin_return_address(0));
-
-	return ptr;
+	return cache_alloc(cachep, flags, nodeid, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node);
 
@@ -3111,7 +3118,7 @@ static __always_inline void *__do_kmallo
 	cachep = __find_general_cachep(size, flags);
 	if (unlikely(cachep == NULL))
 		return NULL;
-	return __cache_alloc(cachep, flags, caller);
+	return cache_alloc(cachep, flags, -1, caller);
 }
 
 #ifndef CONFIG_DEBUG_SLAB

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  7:22           ` Paul Jackson
@ 2006-02-06  7:43             ` Ingo Molnar
  2006-02-06  8:19               ` Paul Jackson
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  7:43 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Paul Jackson <pj@sgi.com> wrote:

> The tasks we are talking about do -not- want a default RANDOM policy.  
> They want node-local allocation for per-thread data (data and stack 
> for example), and at the same time spread allocation for kernel space 
> (page and slab cache).

ok, i think i got that. Could you perhaps outline two actual use-cases 
that would need two cpusets with different policies, on the same box? Or 
would you set the spread-out policy for every cpuset that is used in a 
box with lots of cpusets, to achieve fairness? (and not bother about it 
on boxes with dedicated workloads)

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-06  7:42                     ` Pekka J Enberg
@ 2006-02-06  7:51                       ` Pekka J Enberg
  2006-02-06 17:32                         ` Pekka Enberg
  0 siblings, 1 reply; 102+ messages in thread
From: Pekka J Enberg @ 2006-02-06  7:51 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, clameter, steiner, dgc, Simon.Derr, ak,
	linux-kernel, manfred

On Mon, 6 Feb 2006, Pekka J Enberg wrote:
> Actually, the above patch isn't probably any good as it moves 
> cache_alloc_cpucache() out-of-line which should be the common case for 
> NUMA too (it's hurting kmem_cache_alloc and kmalloc). The following should 
> be better.

Hmm. Strike that. It's wrong as it no longer respects mempolicy.

		Pekka

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  7:43             ` Ingo Molnar
@ 2006-02-06  8:19               ` Paul Jackson
  2006-02-06  8:22                 ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  8:19 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Ingo wrote:
> Could you perhaps outline two actual use-cases 
> that would need two cpusets with different policies,
> on the same box?

We normally run with different policies, in the same box, on different
cpusets at the same time.  But this might be because some cpusets
-need- the memory spreading, and the others that don't are left to the
default policy.

In my immediate experience, I can only outline a hypothetical case,
not an actual case, where the default node-local policy would be sorely
needed, as opposed to just preferred:

    If a job were running several threads, each of which did some
    file i/o in roughly equal amounts, for processing (reading and
    writing) in that thread, it could need the performance that
    depended on these pages being placed node local.

In cpusets running classic Unix loads, such as the daemon processes or
the login sessions, the default node-local would certainly be
preferred, as that policy is well tuned for that sort of load.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  7:39       ` Ingo Molnar
@ 2006-02-06  8:22         ` Paul Jackson
  2006-02-06  8:35         ` Paul Jackson
  1 sibling, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  8:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Ingo wrote:
> Modifying hundreds 
> of apps, some of which might be legacy, seems impractical - and the 
> access pattern might very much depend on the project it is used in.

Well said.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  8:19               ` Paul Jackson
@ 2006-02-06  8:22                 ` Ingo Molnar
  2006-02-06  8:40                   ` Ingo Molnar
                                     ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  8:22 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Paul Jackson <pj@sgi.com> wrote:

> Ingo wrote:
> > Could you perhaps outline two actual use-cases 
> > that would need two cpusets with different policies,
> > on the same box?
> 
> We normally run with different policies, in the same box, on different 
> cpusets at the same time.  But this might be because some cpusets 
> -need- the memory spreading, and the others that don't are left to the 
> default policy.

so in practice, the memory spreading is in fact a global setting, used
by all cpusets that matter? That seems to support Andrew's observation
that our assumptions / defaults are bad, pretty much independently of
the workload.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  7:39       ` Ingo Molnar
  2006-02-06  8:22         ` Paul Jackson
@ 2006-02-06  8:35         ` Paul Jackson
  1 sibling, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  8:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Ingo wrote:
> if we want to reduce complexity, i'd suggest to consolidate the MPOL_* 
> mechanism into cpusets, and phase out the mempolicy syscalls. (The sysfs 
> interface to cpusets is much cleaner anyway.)

I think that there is an essential place for both interfaces.

Individual tasks need to be able to micromanage their memory placement
and (with sched_setaffinity) cpu scheduling.  For instance, the cpuset
interface would be ill equipped to express the virtual address-range
placement that the mbind(2) system call can express.

Also the cpuset interface affects all tasks equally that are in that
cpuset, which is simply not enough.  Individual threads have their own
special needs, which they are prepared to express in code.

We might have details of the mempolicy system calls that we don't like;
I've complained about such myself in times long past.  But it is quite
servicable, and the API details are probably better left as they are.
Incompatible changes would cause more problems than we fixed. 

The two separate interfaces really do fit the end-usage pattern
rather well.  We have cpusets for the sysadmins and batch schedulers,
and we have the schedaffinity and mempolicy system calls for the
applications.

I will grant that it's a pleasure, after all these years, to be
arguing that "we need mempolicy too", rather than arguing "we
need cpusets in addition."

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  8:22                 ` Ingo Molnar
@ 2006-02-06  8:40                   ` Ingo Molnar
  2006-02-06  9:03                     ` Paul Jackson
  2006-02-06 20:22                     ` Paul Jackson
  2006-02-06  8:47                   ` Paul Jackson
  2006-02-06 10:09                   ` Andi Kleen
  2 siblings, 2 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  8:40 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Ingo Molnar <mingo@elte.hu> wrote:

> > We normally run with different policies, in the same box, on different 
> > cpusets at the same time.  But this might be because some cpusets 
> > -need- the memory spreading, and the others that don't are left to the 
> > default policy.
> 
> so in practice, the memory spreading is in fact a global setting, used 
> by all cpusets that matter? That seems to support Andrew's observation 
> that our assumptions / defaults are bad, pretty much independently of 
> the workload.

in other words: the spreading out likely _hurts_ performance in the 
typical case (which prefers node-locality), but when you are using 
multiple cpusets you want to opt for fairness between projects, over 
opportunistic optimizations such as node-local allocations. I.e. the 
spreading out, as it is used today, is rather a global fairness setting 
for the kernel, and not really a workload-specific access-pattern thing.  
Right?

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  8:22                 ` Ingo Molnar
  2006-02-06  8:40                   ` Ingo Molnar
@ 2006-02-06  8:47                   ` Paul Jackson
  2006-02-06  8:51                     ` Ingo Molnar
  2006-02-06 10:09                   ` Andi Kleen
  2 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  8:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Ingo asked:
> so in practice, the memory spreading is in fact a global setting, used
> by all cpusets that matter? 

I don't know if that is true or not.

I'll have to ask my field engineers, who actually have experience
with a variety of customer workloads.

... well, I do have partial knowledge of this.

When I was coding this, I suggested that instead of picking some of the
slab caches to memory spread, we pick them all, as that would be easier
to code.

That suggestion was shot down by others more experienced within SGI, as
some slab caches hold what is essentially per-thread data, that is
fairly hot in the thread context that allocated it.  Spreading that data
would quite predictably increase cross-node bus traffic, which is bad.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  8:47                   ` Paul Jackson
@ 2006-02-06  8:51                     ` Ingo Molnar
  2006-02-06  9:09                       ` Paul Jackson
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  8:51 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Paul Jackson <pj@sgi.com> wrote:

> Ingo asked:
> > so in practice, the memory spreading is in fact a global setting, used
> > by all cpusets that matter? 
> 
> I don't know if that is true or not.
> 
> I'll have to ask my field engineers, who actually have experience
> with a variety of customer workloads.
> 
> ... well, I do have partial knowledge of this.
> 
> When I was coding this, I suggested that instead of picking some of the
> slab caches to memory spread, we pick them all, as that would be easier
> to code.
> 
> That suggestion was shot down by others more experienced within SGI, 
> as some slab caches hold what is essentially per-thread data, that is 
> fairly hot in the thread context that allocated it.  Spreading that 
> data would quite predictably increase cross-node bus traffic, which is 
> bad.

yes, but still that is a global attribute: we know that those slabs are 
fundamentally per-thread. They wont ever be non-per-thread. So the 
decision could be made via a per-slab attribute, that is picked by 
kernel developers (initially you). The pagecache would be spread-out if
this .config option is specified. This makes it a much cleaner static
'kernel fairness policy' thing, instead of a fuzzier userspace thing. 

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  8:40                   ` Ingo Molnar
@ 2006-02-06  9:03                     ` Paul Jackson
  2006-02-06  9:09                       ` Ingo Molnar
  2006-02-06 20:22                     ` Paul Jackson
  1 sibling, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  9:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Ingo wrote:
> I.e. the 
> spreading out, as it is used today, is rather a global fairness setting 
> for the kernel, and not really a workload-specific access-pattern thing.  
> Right?

I'm not quite sure where you're going with this, but I doubt I agree.
It's job specific, and cache specific.

If the job has a number of threads hitting the same data set and:
 1) the data set is faulted in non-uniformly (perhaps some
    job init task reads it in), and
 2) the data set is accessed with little thread locality
    (one thread is as likely as the next to read or write
    a particular page),
then for that job spreading makes sense.

If the cache is one that goes with a data set, such as file system
buffers (page cache) and inode and dentry slab caches, then for that
cache spreading makes sense.  (Yes Andrew, your xfs query is still in my
queue.)

But for many (most?) other jobs and other caches, the default node-local
policy is better.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  9:03                     ` Paul Jackson
@ 2006-02-06  9:09                       ` Ingo Molnar
  2006-02-06  9:27                         ` Paul Jackson
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  9:09 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Paul Jackson <pj@sgi.com> wrote:

> It's job specific, and cache specific.
> 
> If the job has a number of threads hitting the same data set and:
>  1) the data set is faulted in non-uniformly (perhaps some
>     job init task reads it in), and
>  2) the data set is accessed with little thread locality
>     (one thread is as likely as the next to read or write
>     a particular page),
> then for that job spreading makes sense.
> 
> If the cache is one that goes with a data set, such as file system 
> buffers (page cache) and inode and dentry slab caches, then for that 
> cache spreading makes sense.  (Yes Andrew, your xfs query is still in 
> my queue.)
> 
> But for many (most?) other jobs and other caches, the default 
> node-local policy is better.

what type of objects need to be spread (currently)? It seems that your 
current focus is on filesystem related objects: pagecache, inodes, 
dentries - correct? Is there anything else that needs to be spread? In 
particular, does any userspace mapped memory need to be spread - or is 
it handled with other mechanisms?

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  8:51                     ` Ingo Molnar
@ 2006-02-06  9:09                       ` Paul Jackson
  0 siblings, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  9:09 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

> yes, but still that is a global attribute:

Ok ... I am starting to see where you are going with this.

Well, certainly not global in the sense that a selected cache would be
spread over the whole system.  The data set read in by the job in one
cpuset must not pollute the memory of another cpuset.

But it -might- work to mark certain caches to be memory spread across
the current cpuset (to be precise, across current->mems_allowed), as
the default kernel placement policy for those selected caches, with
no per-cpuset mechanism to specify otherwise.

Or it might not work.

I don't know tonight.

I will have to wait for others to chime in.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
                   ` (6 preceding siblings ...)
  2006-02-06  4:37 ` Andrew Morton
@ 2006-02-06  9:18 ` Simon Derr
  7 siblings, 0 replies; 102+ messages in thread
From: Simon Derr @ 2006-02-06  9:18 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

On Fri, 3 Feb 2006, Paul Jackson wrote:

> This policy can provide substantial improvements for jobs that
> need to place thread local data on the corresponding node, but
> that need to access large file system data sets that need to
> be spread across the several nodes in the jobs cpuset in order
> to fit.  Without this patch, especially for jobs that might
> have one thread reading in the data set, the memory allocation
> across the nodes in the jobs cpuset can become very uneven.
> 

Oh, that's good news for me.
I was receiving more and more complains about this kind of issues.
I feel this is really a good answer to the typical "page cache ate all my 
node memory" case, which is *really* a pain for many HPC apps that access 
large files.

Thanks Paul.

AKPM wrote:

> IOW: this patch seems to be a highly specific bandaid which is repairing 
> an ill-advised problem of our own making, does it not?

I'm not sure about the 'ill-advised' part. All our efforts to let the 
kernel do the Right Thing by himself on all situations should not prevent 
us from remembering that Linux is not a time machine, and that sometimes, 
it is just a lot easier and probably better to give the kernel a few hints 
about what it should do.

And even if this can seem 'specific', this kind of workload is NOT rare, 
at least in HPC.

	Simon.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  9:09                       ` Ingo Molnar
@ 2006-02-06  9:27                         ` Paul Jackson
  2006-02-06  9:37                           ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  9:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Ingo asked:
> what type of objects need to be spread (currently)? It seems that your 
> current focus is on filesystem related objects: 

In addition to the filesystem related objects called out in
this current patch set, we also have some xfs directory
and inode caches.  An xfs patch is winding its way toward
lkml that will enhance the xfs cache creation calls a little,
so that we can pick off the particular slab caches we need to
be able to spread, while leaving other xfs slab caches with
the default node-local policy.

>  does any userspace mapped memory need to be spread 

I don't think so, but I am not entirely confident of my answer
tonight.  I would expect the applications I care about to place mapped
pages by being careful to make the first access (load or store) of that
page from a cpu on the node where they wanted that page placed.

So, yes, either mostly filesystem related objects, or all such.

I'm not sure which.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  7:08     ` Andrew Morton
  2006-02-06  7:39       ` Ingo Molnar
@ 2006-02-06  9:32       ` Paul Jackson
  2006-02-06  9:57         ` Andrew Morton
  1 sibling, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06  9:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Andrew wrote:
> Well I agree.

Good.


> And I think that the only way we'll get peak performance for
> an acceptaly broad range of applications is to provide many fine-grained
> controls and the appropriate documentation and instrumentation to help
> developers and administrators use those controls.
> 
> We're all on the same page here.  I'm questioning whether slab and
> pagecache should be inextricably lumped together though.

They certainly don't need to be lumped.  I just don't go about
creating additional mechanism or apparatus until I smell the need.
(Well, sometimes I do -- too much fun. ;)

When Andrew Morton, who has far more history with this code than I,
recommends such additional mechanism, that's all the smelling I need.

How fine grained would you recommend, Andrew?

Is page vs slab cache the appropriate level of granularity?



> Is it possible to integrate the slab and pagecache allocation policies more
> cleanly into a process's mempolicy?  Right now, MPOL_* are disjoint.
> 
> (Why is the spreading policy part of cpusets at all?  Shouldn't it be part
> of the mempolicy layer?)

The NUMA mempolicy code handles per-task, task internal memory placement
policy, and the cpuset code handles cpuset-wide cpu and memory placement
policy.

In actual usage, spreading the kernel caches of a job is very much a
decision that is made per-job(*), by the system administrator or batch
scheduler, not by the application coder.  The application code may well
be -very- awary of the placement of their data pages in user address
space, and to manage this will use calls such as mbind and
set_mempolicy, in addition to using node-local placement (arranging to
fault in each page from a thread running on the node that is to receive
that page).  The application has no interest in micromanaging the
kernels placement of page and slab caches, other than choosing between
node-local and cpuset spread strategies.

(*) Actually, made per-cpuset, not per-job.  But where this matters,
    that tends to be the same thing.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  9:27                         ` Paul Jackson
@ 2006-02-06  9:37                           ` Ingo Molnar
  0 siblings, 0 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06  9:37 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter


* Paul Jackson <pj@sgi.com> wrote:

> Ingo asked:
> > what type of objects need to be spread (currently)? It seems that your 
> > current focus is on filesystem related objects: 
> 
> In addition to the filesystem related objects called out in this 
> current patch set, we also have some xfs directory and inode caches.  
> An xfs patch is winding its way toward lkml that will enhance the xfs 
> cache creation calls a little, so that we can pick off the particular 
> slab caches we need to be able to spread, while leaving other xfs slab 
> caches with the default node-local policy.
> 
> >  does any userspace mapped memory need to be spread 
> 
> I don't think so, but I am not entirely confident of my answer 
> tonight.  I would expect the applications I care about to place mapped 
> pages by being careful to make the first access (load or store) of 
> that page from a cpu on the node where they wanted that page placed.
> 
> So, yes, either mostly filesystem related objects, or all such.
> 
> I'm not sure which.

if that's the case, then i think the best way to express this would be 
to categorize file objects into two groups: "global" (spread-out) and 
"local". Since filesystem space is already categorized per-project, this 
is also practical for the admin to do.

i.e. mountpoints/directories/files would get a 'locality of reference' 
attribute, and whenever the VFS allocates memory related to those files, 
it will do so based on the attribute. (The attribute is inherited deeper 
in the hierarchy - i.e. setting a 'global' attribute for a mountpoint 
makes all files within that filesystem spread-out.)

this is much cleaner i think, and easy/intuitive to configure. This 
would have performance advantages over your current approach as well: 
e.g. /tmp would always stay "local", in all cpusets - while with your 
current patch they are spread out. Bigger applications (like databases) 
would set this attribute themselves - but sysadmins would do it too, on 
shared boxes.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  9:32       ` Paul Jackson
@ 2006-02-06  9:57         ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2006-02-06  9:57 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Paul Jackson <pj@sgi.com> wrote:
>
>  Is page vs slab cache the appropriate level of granularity?
> 

I guess so.  Doing it on a per-cpuset+per-slab or per-cpuset+per-inode
basis would get a bit complex implementation-wise, I expect.  And a smart
application could roughly implement that itself anyway by turning spreading
on and off as it goes.

One does wonder about the buffer-head slab - it's pretty tightly bound to
pagecache..


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  8:22                 ` Ingo Molnar
  2006-02-06  8:40                   ` Ingo Molnar
  2006-02-06  8:47                   ` Paul Jackson
@ 2006-02-06 10:09                   ` Andi Kleen
  2006-02-06 10:11                     ` Ingo Molnar
  2 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2006-02-06 10:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, akpm, dgc, steiner, Simon.Derr, linux-kernel, clameter

On Monday 06 February 2006 09:22, Ingo Molnar wrote:
> 
> * Paul Jackson <pj@sgi.com> wrote:
> 
> > Ingo wrote:
> > > Could you perhaps outline two actual use-cases 
> > > that would need two cpusets with different policies,
> > > on the same box?
> > 
> > We normally run with different policies, in the same box, on different 
> > cpusets at the same time.  But this might be because some cpusets 
> > -need- the memory spreading, and the others that don't are left to the 
> > default policy.
> 
> so in practice, the memory spreading is in fact a global setting, used
> by all cpusets that matter? That seems to support Andrew's observation
> that our assumptions / defaults are bad, pretty much independently of
> the workload.

Yes in general page cache and other global caches should be spread 
around all nodes by default. There was a patch to do this for the page 
cache in SLES9 for a long time, but d/icache and possibly other slab
caches have of course the same problem. For doing IO caching it doesn't matter much if 
the cache is local or not because memory access in read/write is not that
critical. But it's a big problem when one node fills up with 
IO (or dentry or inode) caches because an IO intensive process ran
on it on some time and later a process running there cannot get local
memory.

Of course there might be some corner cases where using local
memory for caching is still better (like mmap file IO), but my 
guess is that it isn't a good default. 

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 10:09                   ` Andi Kleen
@ 2006-02-06 10:11                     ` Ingo Molnar
  2006-02-06 10:16                       ` Andi Kleen
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06 10:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Jackson, akpm, dgc, steiner, Simon.Derr, linux-kernel, clameter


* Andi Kleen <ak@suse.de> wrote:

> Of course there might be some corner cases where using local memory 
> for caching is still better (like mmap file IO), but my guess is that 
> it isn't a good default.

/tmp is almost certainly one where local memory is better.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 10:11                     ` Ingo Molnar
@ 2006-02-06 10:16                       ` Andi Kleen
  2006-02-06 10:23                         ` Ingo Molnar
  2006-02-06 14:35                         ` Paul Jackson
  0 siblings, 2 replies; 102+ messages in thread
From: Andi Kleen @ 2006-02-06 10:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, akpm, dgc, steiner, Simon.Derr, linux-kernel, clameter

On Monday 06 February 2006 11:11, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > Of course there might be some corner cases where using local memory 
> > for caching is still better (like mmap file IO), but my guess is that 
> > it isn't a good default.
> 
> /tmp is almost certainly one where local memory is better.

Not sure. What happens if someone writes a 1GB /tmp file on a 1GB node?

Christoph recently added some changes to the page allocator to 
try harder to get local memory to work around this problem, but
attacking it at the root might be better.

Perhaps one could do a "near" caching policy for big machines: e.g. 
if on a big Altix prefer to put it on a not too far away node, but
spread it out evenly. But it's not clear yet such complexity is needed.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 10:16                       ` Andi Kleen
@ 2006-02-06 10:23                         ` Ingo Molnar
  2006-02-06 10:35                           ` Andi Kleen
  2006-02-06 14:42                           ` Paul Jackson
  2006-02-06 14:35                         ` Paul Jackson
  1 sibling, 2 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06 10:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Jackson, akpm, dgc, steiner, Simon.Derr, linux-kernel, clameter


* Andi Kleen <ak@suse.de> wrote:

> On Monday 06 February 2006 11:11, Ingo Molnar wrote:
> > 
> > * Andi Kleen <ak@suse.de> wrote:
> > 
> > > Of course there might be some corner cases where using local memory 
> > > for caching is still better (like mmap file IO), but my guess is that 
> > > it isn't a good default.
> > 
> > /tmp is almost certainly one where local memory is better.
> 
> Not sure. What happens if someone writes a 1GB /tmp file on a 1GB 
> node?

well, if the pagecache is filled on a node above a certain ratio then 
one would have to spread it out forcibly. But otherwise, try to keep 
things as local as possible, because that will perform best. This is 
different from the case Paul's patch is addressing: workloads which are 
known to be global (and hence spreading out is the best-performing 
allocation).

(for which problem i suggested a per-mount/directory/file 
locality-of-reference attribute in another post.)

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 10:23                         ` Ingo Molnar
@ 2006-02-06 10:35                           ` Andi Kleen
  2006-02-06 14:42                           ` Paul Jackson
  1 sibling, 0 replies; 102+ messages in thread
From: Andi Kleen @ 2006-02-06 10:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, akpm, dgc, steiner, Simon.Derr, linux-kernel, clameter

On Monday 06 February 2006 11:23, Ingo Molnar wrote:

> 
> well, if the pagecache is filled on a node above a certain ratio then 
> one would have to spread it out forcibly. 

In theory yes. In practice it doesn't work that well ...

> But otherwise, try to keep  
> things as local as possible, because that will perform best. 

Experience teaches differently. For IO caches (and d/icache) strict local 
caching doesn't seem to be the best policy because it competes with more
important mapped memory too much.

> This is  
> different from the case Paul's patch is addressing: workloads which are 
> known to be global (and hence spreading out is the best-performing 
> allocation).
> 
> (for which problem i suggested a per-mount/directory/file 
> locality-of-reference attribute in another post.)

iirc there is already a patch around for tmpfs to do that. But the 
interesting point here is what should be that default. And what
to do with the d/icaches by default.

-Andi
 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 10:16                       ` Andi Kleen
  2006-02-06 10:23                         ` Ingo Molnar
@ 2006-02-06 14:35                         ` Paul Jackson
  2006-02-06 16:48                           ` Christoph Lameter
  1 sibling, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06 14:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel, clameter

Andi wrote:
> Perhaps one could do a "near" caching policy for big machines: e.g. 
> if on a big Altix prefer to put it on a not too far away node, but
> spread it out evenly. But it's not clear yet such complexity is needed.

I suspect that spreading evenly over the current tasks mems_allowed is
just what is needed.

There is nothing special about big Altix systems here; just use
task->mems_allowed.  For all but tasks using MPOL_BIND, this means
spreading the caching over the nodes in the tasks cpuset.

We don't want to spread over a larger area, because cpusets need to
maintain a fairly aggressive containment.  One jobs big data set must
not pollute the cpuset of another job, except in cases involving kernel
distress.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 10:23                         ` Ingo Molnar
  2006-02-06 10:35                           ` Andi Kleen
@ 2006-02-06 14:42                           ` Paul Jackson
  1 sibling, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-06 14:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ak, akpm, dgc, steiner, Simon.Derr, linux-kernel, clameter

Ingo wrote:
> the case Paul's patch is addressing: workloads which are 
> known to be global (and hence spreading out is the best-performing 
> allocation).

Just to be clear, note that even in my case, we require the node-local
policy for some allocations (into the users portion of the address
space) at the same time we require the spread policy (not
sure I like 'global' here) for other allocations (file stuff).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 14:35                         ` Paul Jackson
@ 2006-02-06 16:48                           ` Christoph Lameter
  2006-02-06 17:11                             ` Andi Kleen
  0 siblings, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-06 16:48 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andi Kleen, mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel

Just some words of clarification here:

Memory spreading is useful if multiple processes are running on 
multiple nodes that access the same set of files (and therefore use the 
same dentries inodes etc). This is only true for very large applications.
Users typically segment a machine using cpusets to give these apps a 
range of nodes and processors to use. The rest of the system may be used
for other needs. The spreading should therefore be restricted to the set 
of nodes in use by that application.

This is very different from the typical case of a single threaded process 
roaming across some data and then terminating. In that case we always want 
placement of memory as near to the process as possible. In cases were we 
are not sure about future application behavior it is best to assume that 
node local is best. Spreading memory allocations for storage that is only 
accessed from one processor will reduce the performance of an application.

So the default operating mode needs to be node local.

There is one exception from this during system bootup. In that case we are 
allocating structures that will be finally be used by processes running on 
all nodes on the system. It is therefore advantageous to spread these 
allocations out. That is currently accomplished by setting the default 
memory allocation policy to MPOL_INTERLEAVE while a kernel boots. Memory 
allocation policies revert to default (node local) when init starts.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 16:48                           ` Christoph Lameter
@ 2006-02-06 17:11                             ` Andi Kleen
  2006-02-06 18:21                               ` Christoph Lameter
  0 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2006-02-06 17:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul Jackson, mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Monday 06 February 2006 17:48, Christoph Lameter wrote:

> This is very different from the typical case of a single threaded process 
> roaming across some data and then terminating. In that case we always want 
> placement of memory as near to the process as possible. In cases were we 
> are not sure about future application behavior it is best to assume that 
> node local is best. Spreading memory allocations for storage that is only 
> accessed from one processor will reduce the performance of an application.
> 
> So the default operating mode needs to be node local.

I still don't quite agree. As long as the latency penalty of going
off node is not too bad (let's say < factor 2) i think it's better
to spread out the caches than to always locate them locally.
That is because kernel object/data cache accesses are far less frequent
than user mapped memory accesses. And it's a good idea to give
the later memory some headstart for local memory.

If you have a much worse worst case NUMA factor it might be different,
but even there it would be a good idea to at least spread it out
to nearby nodes.

-Andi


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 2/5] cpuset memory spread page cache implementation and hooks
  2006-02-06  7:51                       ` Pekka J Enberg
@ 2006-02-06 17:32                         ` Pekka Enberg
  0 siblings, 0 replies; 102+ messages in thread
From: Pekka Enberg @ 2006-02-06 17:32 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, clameter, steiner, dgc, Simon.Derr, ak,
	linux-kernel, manfred

Hi,

On Mon, 6 Feb 2006, Pekka J Enberg wrote:
> > Actually, the above patch isn't probably any good as it moves 
> > cache_alloc_cpucache() out-of-line which should be the common case for 
> > NUMA too (it's hurting kmem_cache_alloc and kmalloc). The following should 
> > be better.

Paul, as requested, here's one on top of your cpuset changes. This one
preserves kernel text size for both UMA and NUMA so it's code
restructuring only. The optimizations you did in the cpuset patches
already make the NUMA fastpath so small that I wasn't able to improve it
on i386.

			Pekka

Subject: slab: alloc path consolidation
From: Pekka Enberg <penberg@cs.helsinki.fi>

This patch consolidates the UMA and NUMA memory allocation paths in the
slab allocator. This is accomplished by making the UMA-path look like
we are on NUMA but always allocating from the current node. The patch has
no functional changes, only code restructuring.

Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---

 mm/slab.c |   98 ++++++++++++++++++++++++++++++++++----------------------------
 1 file changed, 54 insertions(+), 44 deletions(-)

Index: 2.6-cpuset/mm/slab.c
===================================================================
--- 2.6-cpuset.orig/mm/slab.c
+++ 2.6-cpuset/mm/slab.c
@@ -830,7 +830,6 @@ static struct array_cache *alloc_arrayca
 
 #ifdef CONFIG_NUMA
 static void *__cache_alloc_node(struct kmem_cache *, gfp_t, int);
-static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
 
 static struct array_cache **alloc_alien_cache(int node, int limit)
 {
@@ -2667,17 +2666,11 @@ static void *cache_alloc_debugcheck_afte
 #define cache_alloc_debugcheck_after(a,b,objp,d) (objp)
 #endif
 
-static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static inline void *cache_alloc_cpu_cache(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
 
-#ifdef CONFIG_NUMA
-	if (unlikely(current->flags & (PF_MEM_SPREAD|PF_MEMPOLICY)))
-		if ((objp = alternate_node_alloc(cachep, flags)) != NULL)
-			return objp;
-#endif
-
 	check_irq_off();
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
@@ -2691,23 +2684,6 @@ static inline void *____cache_alloc(stru
 	return objp;
 }
 
-static __always_inline void *
-__cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
-{
-	unsigned long save_flags;
-	void *objp;
-
-	cache_alloc_debugcheck_before(cachep, flags);
-
-	local_irq_save(save_flags);
-	objp = ____cache_alloc(cachep, flags);
-	local_irq_restore(save_flags);
-	objp = cache_alloc_debugcheck_after(cachep, flags, objp,
-					    caller);
-	prefetchw(objp);
-	return objp;
-}
-
 #ifdef CONFIG_NUMA
 /*
  * Try allocating on another node if PF_MEM_SPREAD or PF_MEMPOLICY.
@@ -2788,8 +2764,58 @@ static void *__cache_alloc_node(struct k
       done:
 	return obj;
 }
+
+static __always_inline void *__cache_alloc(struct kmem_cache *cachep,
+					   gfp_t flags, int nodeid)
+{
+	void *objp;
+
+	if (nodeid == -1 || nodeid == numa_node_id() ||
+	    !cachep->nodelists[nodeid]) {
+		if (unlikely(current->flags & (PF_MEM_SPREAD|PF_MEMPOLICY))) {
+			objp = alternate_node_alloc(cachep, flags);
+			if (objp)
+				goto out;
+		}
+		objp = cache_alloc_cpu_cache(cachep, flags);
+	} else
+		objp = __cache_alloc_node(cachep, flags, nodeid);
+  out:
+	return objp;
+}
+
+#else
+
+static __always_inline void *__cache_alloc(struct kmem_cache *cachep,
+					   gfp_t flags, int nodeid)
+{
+	return cache_alloc_cpu_cache(cachep, flags);
+}
+
 #endif
 
+static __always_inline void *cache_alloc(struct kmem_cache *cache,
+					 gfp_t flags, int nodeid, void *caller)
+{
+	unsigned long save_flags;
+	void *obj;
+
+	cache_alloc_debugcheck_before(cache, flags);
+	local_irq_save(save_flags);
+	obj = __cache_alloc(cache, flags, nodeid);
+	local_irq_restore(save_flags);
+	obj = cache_alloc_debugcheck_after(cache, flags, obj, caller);
+	return obj;
+}
+
+static __always_inline void *
+cache_alloc_current(struct kmem_cache *cache, gfp_t flags, void *caller)
+{
+	void *obj = cache_alloc(cache, flags, -1, caller);
+	prefetchw(obj);
+	return obj;
+}
+
 /*
  * Caller needs to acquire correct kmem_list's list_lock
  */
@@ -2951,7 +2977,7 @@ static inline void __cache_free(struct k
  */
 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
-	return __cache_alloc(cachep, flags, __builtin_return_address(0));
+	return cache_alloc_current(cachep, flags, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
@@ -3012,23 +3038,7 @@ int fastcall kmem_ptr_validate(struct km
  */
 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
-	unsigned long save_flags;
-	void *ptr;
-
-	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
-
-	if (nodeid == -1 || nodeid == numa_node_id() ||
-	    !cachep->nodelists[nodeid])
-		ptr = ____cache_alloc(cachep, flags);
-	else
-		ptr = __cache_alloc_node(cachep, flags, nodeid);
-	local_irq_restore(save_flags);
-
-	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr,
-					   __builtin_return_address(0));
-
-	return ptr;
+	return cache_alloc(cachep, flags, nodeid, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node);
 
@@ -3078,7 +3088,7 @@ static __always_inline void *__do_kmallo
 	cachep = __find_general_cachep(size, flags);
 	if (unlikely(cachep == NULL))
 		return NULL;
-	return __cache_alloc(cachep, flags, caller);
+	return cache_alloc_current(cachep, flags, caller);
 }
 
 #ifndef CONFIG_DEBUG_SLAB



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 17:11                             ` Andi Kleen
@ 2006-02-06 18:21                               ` Christoph Lameter
  2006-02-06 18:36                                 ` Andi Kleen
  0 siblings, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-06 18:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Jackson, mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Mon, 6 Feb 2006, Andi Kleen wrote:

> I still don't quite agree. As long as the latency penalty of going
> off node is not too bad (let's say < factor 2) i think it's better
> to spread out the caches than to always locate them locally.

AFAIK you can reach these low latency factors only if multiple nodes are 
on the same motherboard. Likely Opteron specific?

> If you have a much worse worst case NUMA factor it might be different,
> but even there it would be a good idea to at least spread it out
> to nearby nodes.

I dont understand you here. What would be the benefit of selecting more 
distant memory over local? I can only imagine that this would be 
beneficial if we know that the data would be used later by other 
processes.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 18:21                               ` Christoph Lameter
@ 2006-02-06 18:36                                 ` Andi Kleen
  2006-02-06 18:43                                   ` Christoph Lameter
  2006-02-06 18:43                                   ` Ingo Molnar
  0 siblings, 2 replies; 102+ messages in thread
From: Andi Kleen @ 2006-02-06 18:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul Jackson, mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Monday 06 February 2006 19:21, Christoph Lameter wrote:
> On Mon, 6 Feb 2006, Andi Kleen wrote:
> 
> > I still don't quite agree. As long as the latency penalty of going
> > off node is not too bad (let's say < factor 2) i think it's better
> > to spread out the caches than to always locate them locally.
> 
> AFAIK you can reach these low latency factors only if multiple nodes are 
> on the same motherboard. Likely Opteron specific?

Should be true for most CPUs with integrated memory controller.

Anyways, the 2 was just an example, true number would probably
need to be found with benchmarks.

> 
> > If you have a much worse worst case NUMA factor it might be different,
> > but even there it would be a good idea to at least spread it out
> > to nearby nodes.
> 
> I dont understand you here. What would be the benefit of selecting more 
> distant memory over local? I can only imagine that this would be 
> beneficial if we know that the data would be used later by other 
> processes.

The benefit would be to not fill up the local node as quickly when
you do something IO (or dcache intensive).  And on contrary when you
do something local memory intensive on that node then you won't need
to throw away all the IO caches if they are already spread out.

The kernel uses of these cached objects are not really _that_ latency 
sensitive and not that frequent so it makes sense to spread it out a 
bit to nearby nodes.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 18:36                                 ` Andi Kleen
@ 2006-02-06 18:43                                   ` Christoph Lameter
  2006-02-06 18:48                                     ` Andi Kleen
  2006-02-06 18:43                                   ` Ingo Molnar
  1 sibling, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-06 18:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Jackson, mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Mon, 6 Feb 2006, Andi Kleen wrote:

> > AFAIK you can reach these low latency factors only if multiple nodes are 
> > on the same motherboard. Likely Opteron specific?
> 
> Should be true for most CPUs with integrated memory controller.

Even the best memory controller cannot violate the laws of physics 
(electrons can at maximum travel with the speed of light). Therefore
cable lengths have a major influence on the latency of signals.

> > I dont understand you here. What would be the benefit of selecting more 
> > distant memory over local? I can only imagine that this would be 
> > beneficial if we know that the data would be used later by other 
> > processes.
> 
> The benefit would be to not fill up the local node as quickly when
> you do something IO (or dcache intensive).  And on contrary when you
> do something local memory intensive on that node then you won't need
> to throw away all the IO caches if they are already spread out.

An efficient local reclaim should deal with that situation. zone_reclaim 
will free up portions of memory in order to stay on node.

> The kernel uses of these cached objects are not really _that_ latency 
> sensitive and not that frequent so it makes sense to spread it out a 
> bit to nearby nodes.

The impact of spreading cached object will depend on the application and 
the NUMA latencies in the system.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 18:36                                 ` Andi Kleen
  2006-02-06 18:43                                   ` Christoph Lameter
@ 2006-02-06 18:43                                   ` Ingo Molnar
  2006-02-06 20:01                                     ` Paul Jackson
  1 sibling, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06 18:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Paul Jackson, akpm, dgc, steiner, Simon.Derr,
	linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> > > If you have a much worse worst case NUMA factor it might be different,
> > > but even there it would be a good idea to at least spread it out
> > > to nearby nodes.
> > 
> > I dont understand you here. What would be the benefit of selecting more 
> > distant memory over local? I can only imagine that this would be 
> > beneficial if we know that the data would be used later by other 
> > processes.
> 
> The benefit would be to not fill up the local node as quickly when you 
> do something IO (or dcache intensive).  And on contrary when you do 
> something local memory intensive on that node then you won't need to 
> throw away all the IO caches if they are already spread out.
> 
> The kernel uses of these cached objects are not really _that_ latency 
> sensitive and not that frequent so it makes sense to spread it out a 
> bit to nearby nodes.

I'm not sure i agree. If a cache isnt that important, then there wont be 
that much of them (hence they cannot interact with user pages that 
much), and it wont be used that frequently -> the VM will discard it 
faster. If there's tons of dentries and inodes and pagecache around, 
then there must be a reason it's around: it was actively used. In that 
case we should spread them out only if we know in advance that their use 
is global, not local - and we should default to local.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 18:43                                   ` Christoph Lameter
@ 2006-02-06 18:48                                     ` Andi Kleen
  2006-02-06 19:19                                       ` Christoph Lameter
  2006-02-06 20:27                                       ` Paul Jackson
  0 siblings, 2 replies; 102+ messages in thread
From: Andi Kleen @ 2006-02-06 18:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul Jackson, mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Monday 06 February 2006 19:43, Christoph Lameter wrote:

> The impact of spreading cached object will depend on the application and 
> the NUMA latencies in the system.

Yes I can see it not working well when a dentry is put at the other 
end of a 256 node altix. That is why just spreading it to nearby
nodes might be an alternative.

On the other hand global interleaving actually worked for the page cache 
in production in SLES9, so it can't be that bad.

Also I'm sure you can construct some workload where it is a major loss.
For those one has NUMA policy to adjust it (although I don't know yet
how to apply separate numa policy to the d/i/file page cache - but if 
it should be a real problem it could be surely solved somehow)

The question is just if it's a common situation. My guess is that just
giving local memory priority but not throwing away all IO caches
when the local node fills up would be a generally useful default policy.

-Andi


>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 18:48                                     ` Andi Kleen
@ 2006-02-06 19:19                                       ` Christoph Lameter
  2006-02-06 20:27                                       ` Paul Jackson
  1 sibling, 0 replies; 102+ messages in thread
From: Christoph Lameter @ 2006-02-06 19:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Jackson, mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Mon, 6 Feb 2006, Andi Kleen wrote:
> On the other hand global interleaving actually worked for the page cache 
> in production in SLES9, so it can't be that bad.

I would see it as an emergency measure given the bad control over locality 
in SLES9 and the lack of an efficient zone reclaim.

> The question is just if it's a common situation. My guess is that just
> giving local memory priority but not throwing away all IO caches
> when the local node fills up would be a generally useful default policy.

We do not throw away "all IO caches". We take a portion of the inactive 
list and scan for freeable pages.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 18:43                                   ` Ingo Molnar
@ 2006-02-06 20:01                                     ` Paul Jackson
  2006-02-06 20:05                                       ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06 20:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ak, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel

Ingo wrote:
> we should default to local.

Agreed.  There is much software and systems management expectations
sitting on top of this, that have certain expectations of the default
memory placement behaviour, to a rough degree, of the system.

They are expecting node-local placement.

We would only change that default if it was shown to be substantially
wrong headed in a substantial number of cases.  It has not been
so shown.  It is either an adequate or quite desirable default for
most cases.

Rather we need to consider optional behaviour, for use on workloads
for which other policies are worth developing and invoking.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 20:01                                     ` Paul Jackson
@ 2006-02-06 20:05                                       ` Ingo Molnar
  2006-02-06 20:27                                         ` Christoph Lameter
  2006-02-06 23:45                                         ` Paul Jackson
  0 siblings, 2 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06 20:05 UTC (permalink / raw)
  To: Paul Jackson; +Cc: ak, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel


* Paul Jackson <pj@sgi.com> wrote:

> Ingo wrote:
> > we should default to local.
> 
> Agreed.  There is much software and systems management expectations 
> sitting on top of this, that have certain expectations of the default 
> memory placement behaviour, to a rough degree, of the system.
> 
> They are expecting node-local placement.
> 
> We would only change that default if it was shown to be substantially 
> wrong headed in a substantial number of cases.  It has not been so 
> shown.  It is either an adequate or quite desirable default for most 
> cases.
> 
> Rather we need to consider optional behaviour, for use on workloads 
> for which other policies are worth developing and invoking.

yes. And it seems that for the workloads you cited, the most natural 
direction to drive the 'spreading' of resources is from the VFS side.  
That would also avoid the problem Andrew observed: the ugliness of a 
sysadmin configuring the placement strategy of kernel-internal slab 
caches. It also feels a much more robust choice from the conceptual POV.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06  8:40                   ` Ingo Molnar
  2006-02-06  9:03                     ` Paul Jackson
@ 2006-02-06 20:22                     ` Paul Jackson
  1 sibling, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-06 20:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, dgc, steiner, Simon.Derr, ak, linux-kernel, clameter

Ingo wrote:
> in other words: the spreading out likely _hurts_ performance in the 
> typical case (which prefers node-locality), but when you are using 
> multiple cpusets you want to opt for fairness between projects, over 
> opportunistic optimizations such as node-local allocations.

Spreading might be useful for this, but it is not what is driving my
interest in it.

My immediate goal is to obtain spreading across the nodes within of a
single cpuset, running a single job, not "fairness between projects."

I have little interest in cross (between) project memory management
beyond simply isolating them from each other.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 18:48                                     ` Andi Kleen
  2006-02-06 19:19                                       ` Christoph Lameter
@ 2006-02-06 20:27                                       ` Paul Jackson
  1 sibling, 0 replies; 102+ messages in thread
From: Paul Jackson @ 2006-02-06 20:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: clameter, mingo, akpm, dgc, steiner, Simon.Derr, linux-kernel

> Yes I can see it not working well when a dentry is put at the other 
> end of a 256 node altix. That is why just spreading it to nearby
> nodes might be an alternative.

As I've suggested earlier in this thread:
=========================================

I suspect that spreading evenly over the current tasks mems_allowed is
just what is needed.

There is nothing special about big Altix systems here; just use
task->mems_allowed.  For all but tasks using MPOL_BIND, this means
spreading the caching over the nodes in the tasks cpuset.

We don't want to spread over a larger area, because cpusets need to
maintain a fairly aggressive containment.  One jobs big data set must
not pollute the cpuset of another job, except in cases involving kernel
distress.

Let the cpusets define what is "the right size" to spread across.
We do not need additional kernel heuristics or options to decide this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 20:05                                       ` Ingo Molnar
@ 2006-02-06 20:27                                         ` Christoph Lameter
  2006-02-06 20:41                                           ` Ingo Molnar
  2006-02-06 23:45                                         ` Paul Jackson
  1 sibling, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-06 20:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, ak, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Mon, 6 Feb 2006, Ingo Molnar wrote:

> yes. And it seems that for the workloads you cited, the most natural 
> direction to drive the 'spreading' of resources is from the VFS side.  
> That would also avoid the problem Andrew observed: the ugliness of a 
> sysadmin configuring the placement strategy of kernel-internal slab 
> caches. It also feels a much more robust choice from the conceptual POV.

A sysadmin currently simply configures the memory policy or cpuset policy. 
He has no knowledge of the underlying slab.

Moving this to the VFS will give rise to all sorts of weird effects. F.e. 
doing a grep on a file will spread the pages all over the system. 
Performance will drop for simple single thread processes.

What happens if a filesystem is exported? Is the spreading also exported?

Seems that the allocation policy is application dependend and not related 
to file access. Also some of the slabs have no underlying files that 
could determine their allocation strategy (network subsystem etc). 

AFAIK the cleanest solution is that an application controls its memory
allocation policy (which has been available for a long time via memory 
policies which even in early 2.6.x controlled the slab page allocation 
policy).

Cpusets are simply a convenience so that a larger groups of applications 
can implement the same policy and may allow one to avoid running numactl 
or modifying an existent application.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 20:27                                         ` Christoph Lameter
@ 2006-02-06 20:41                                           ` Ingo Molnar
  2006-02-06 20:49                                             ` Christoph Lameter
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06 20:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul Jackson, ak, akpm, dgc, steiner, Simon.Derr, linux-kernel


* Christoph Lameter <clameter@engr.sgi.com> wrote:

> On Mon, 6 Feb 2006, Ingo Molnar wrote:
> 
> > yes. And it seems that for the workloads you cited, the most natural 
> > direction to drive the 'spreading' of resources is from the VFS side.  
> > That would also avoid the problem Andrew observed: the ugliness of a 
> > sysadmin configuring the placement strategy of kernel-internal slab 
> > caches. It also feels a much more robust choice from the conceptual POV.
> 
> A sysadmin currently simply configures the memory policy or cpuset 
> policy.  He has no knowledge of the underlying slab.
> 
> Moving this to the VFS will give rise to all sorts of weird effects. 
> F.e.  doing a grep on a file will spread the pages all over the 
> system.  Performance will drop for simple single thread processes.

it's a feature, not a weird effect! Under the VFS-driven scheme, if two 
projects (one 'local' and one 'global') can access the same (presumably 
big) file, then the sysadmin has to make up his mind and determine which 
policy to use for that file. The file will either be local, or global - 
consistently.

[ I dont think most policies would be set on the file level though - 
  directory level seems sufficient. E.g. /usr and /tmp would probably 
  default to 'local', while /home/bigproject1/ would default to 
  'global', while /home/bigproject2/ would default to 'local' [depending 
  on the project's need]. Single-file would be used if there is an 
  exception: e.g. if /home/bigproject3/ defaults to 'local', it could 
  still mark /home/bigproject3/big-shared-db/ as 'global'. ]

with the per-cpuset policy approach on the other hand it would be 
non-deterministic which policy the file gets allocated under: whichever 
cpuset first manages to touch that file. That is what i'd call a weird 
and undesirable effect. This weirdness comes from the conceptual hickup 
of attaching the object-allocation policy to the workload, not to the 
file objects of the workload - hence conflicts can arise if two 
workloads share file objects.

> What happens if a filesystem is exported? Is the spreading also 
> exported?

what do you mean? The policy matters at the import point, so i doubt 
knfsd would have to be taught to pass policies around. But it could do 
it, if the need arises. Alternatively, the sysadmin on the importing 
side can/should set the policy based on the needs of the application 
using the imported file objects. It is that box that is doing the 
allocations after all, not the server. In fact the same filesystem could 
easily be 'global' on the serving system, and 'local' on the importing 
system.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 20:41                                           ` Ingo Molnar
@ 2006-02-06 20:49                                             ` Christoph Lameter
  2006-02-06 21:07                                               ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-06 20:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, ak, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Mon, 6 Feb 2006, Ingo Molnar wrote:

> it's a feature, not a weird effect! Under the VFS-driven scheme, if two 
> projects (one 'local' and one 'global') can access the same (presumably 
> big) file, then the sysadmin has to make up his mind and determine which 
> policy to use for that file. The file will either be local, or global - 
> consistently.

But that local or global allocation policy depends on what task is 
accessing the data at what time. A simple grep should not result in 
interleaving. A big application accessing the same data from multiple 
processes should have interleaving for shared data. Both may not be active 
at the same time.
 
> with the per-cpuset policy approach on the other hand it would be 
> non-deterministic which policy the file gets allocated under: whichever 
> cpuset first manages to touch that file. That is what i'd call a weird 
> and undesirable effect. This weirdness comes from the conceptual hickup 
> of attaching the object-allocation policy to the workload, not to the 
> file objects of the workload - hence conflicts can arise if two 
> workloads share file objects.

Well these weird effects are then at least expected since there was a 
cpuset set up for applications to activate this effect and the 
processes running in that cpuset will behave in the weird way we want.

The mountpoint option means that reading the contents of a file in some 
filesystems is slower than in others because some files spread their pages 
all over the system while others are node local. Again if the process is 
single threaded the node local is always the right approach. These single 
threaded processes will no longer be able to run with full pagecache 
speed. Memory will be used in other nodes that may have been reserved for 
other purposes by the user.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 20:49                                             ` Christoph Lameter
@ 2006-02-06 21:07                                               ` Ingo Molnar
  2006-02-06 22:10                                                 ` Christoph Lameter
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06 21:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul Jackson, ak, akpm, dgc, steiner, Simon.Derr, linux-kernel

* Christoph Lameter <clameter@engr.sgi.com> wrote:

> On Mon, 6 Feb 2006, Ingo Molnar wrote:
> 
> > it's a feature, not a weird effect! Under the VFS-driven scheme, if two 
> > projects (one 'local' and one 'global') can access the same (presumably 
> > big) file, then the sysadmin has to make up his mind and determine which 
> > policy to use for that file. The file will either be local, or global - 
> > consistently.
> 
> But that local or global allocation policy depends on what task is 
> accessing the data at what time. A simple grep should not result in 
> interleaving. A big application accessing the same data from multiple 
> processes should have interleaving for shared data. Both may not be 
> active at the same time.

the grep faults in the pagecache, and depending on which job is active 
first, the placement of the pages will either be spread out or local, 
depending on the timing of those jobs. How do you expect this to behave 
deterministically?

> > with the per-cpuset policy approach on the other hand it would be 
> > non-deterministic which policy the file gets allocated under: whichever 
> > cpuset first manages to touch that file. That is what i'd call a weird 
> > and undesirable effect. This weirdness comes from the conceptual hickup 
> > of attaching the object-allocation policy to the workload, not to the 
> > file objects of the workload - hence conflicts can arise if two 
> > workloads share file objects.
> 
> Well these weird effects are then at least expected since there was a 
> cpuset set up for applications to activate this effect and the 
> processes running in that cpuset will behave in the weird way we want.

nondeterministic placement of pagecache pages sure looks nasty. In most 
cases i suspect what matters are project-specific data files - which 
will be allocated deterministically because they are mostly private to 
the cpuset. But e.g. /usr files want to be local in most cases, even for 
a 'spread out' cpuset. Why would you want to allocate them globally?

> The mountpoint option means that reading the contents of a file in 
> some filesystems is slower than in others because some files spread 
> their pages all over the system while others are node local. Again if 
> the process is single threaded the node local is always the right 
> approach. These single threaded processes will no longer be able to 
> run with full pagecache speed. Memory will be used in other nodes that 
> may have been reserved for other purposes by the user.

but a single object cannot be allocated both locally and globally!  
(well, it could be, for read-mostly workloads, but lets ignore that 
possibility) So instead of letting chance determine it, it is the most 
natural thing to let the object (or its container) determine which 
strategy to use - not the workload. This avoids the ambiguity at its 
core.

so if two projects want to use the same file in two different ways at 
the same time then there is no solution either under the VFS-based or 
under the cpuset-based approach - but at least the VFS-based method is 
fully predictable, and wont depend on which science department starts 
its simulation job first ...

if two projects want to use the same file in two different ways but not 
at the same time, then again the VFS-based method is better: each 
project, when it starts to run, could/should set the policy of that 
shared data. (which setting of policy would also flush all pagecache 
pages of the affected file[s], flushing any prior incorrect placement of 
pages)

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 21:07                                               ` Ingo Molnar
@ 2006-02-06 22:10                                                 ` Christoph Lameter
  2006-02-06 23:29                                                   ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-06 22:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, ak, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Mon, 6 Feb 2006, Ingo Molnar wrote:

> the grep faults in the pagecache, and depending on which job is active 
> first, the placement of the pages will either be spread out or local, 
> depending on the timing of those jobs. How do you expect this to behave 
> deterministically?

It behaves using node local allocation as expected. The determinism is 
only broken when the user sets up a cpuset. This is an unusual activity 
by the sysadmin and he will be fully aware of what is going on.

> cases i suspect what matters are project-specific data files - which 
> will be allocated deterministically because they are mostly private to 
> the cpuset. But e.g. /usr files want to be local in most cases, even for 
> a 'spread out' cpuset. Why would you want to allocate them globally?

We allocate nothing globally.

> but a single object cannot be allocated both locally and globally!  
> (well, it could be, for read-mostly workloads, but lets ignore that 
> possibility) So instead of letting chance determine it, it is the most 
> natural thing to let the object (or its container) determine which 
> strategy to use - not the workload. This avoids the ambiguity at its 
> core.

We want cpusets to make a round robin allocation within the memory 
assigned to the cpuset. There is no global allocation that I 
am aware of.

> so if two projects want to use the same file in two different ways at 
> the same time then there is no solution either under the VFS-based or 
> under the cpuset-based approach - but at least the VFS-based method is 
> fully predictable, and wont depend on which science department starts 
> its simulation job first ...

It will just reduce performance by ~20% for selected files. Surely nobody 
will notice ;-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 22:10                                                 ` Christoph Lameter
@ 2006-02-06 23:29                                                   ` Ingo Molnar
  0 siblings, 0 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-06 23:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul Jackson, ak, akpm, dgc, steiner, Simon.Derr, linux-kernel


* Christoph Lameter <clameter@engr.sgi.com> wrote:

> > but a single object cannot be allocated both locally and globally!  
> > (well, it could be, for read-mostly workloads, but lets ignore that 
> > possibility) So instead of letting chance determine it, it is the most 
> > natural thing to let the object (or its container) determine which 
> > strategy to use - not the workload. This avoids the ambiguity at its 
> > core.
> 
> We want cpusets to make a round robin allocation within the memory 
> assigned to the cpuset. There is no global allocation that I am aware 
> of.

i think we might be talking about separate things, so lets go one step
back.

firstly, i think what you call roundrobin is what i call 'global'.  
[roundrobin allocation is what is best for a cache that is accessed in a 
'global' way - as opposed to cached data that is accessed in a 'local' 
way.]

secondly, i'm not sure i understood it correctly why you want to have 
all (mostly filesystem related) allocations within selected cpusets go 
in a roundrobin way. My understanding so far was that you wanted this 
because the workload attached to that cpuset was using the filesystem 
objects in a 'global' way: i.e. from many different nodes, with no 
particular locality of reference. Am i mistaken about this?

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 20:05                                       ` Ingo Molnar
  2006-02-06 20:27                                         ` Christoph Lameter
@ 2006-02-06 23:45                                         ` Paul Jackson
  2006-02-07  0:19                                           ` Ingo Molnar
  1 sibling, 1 reply; 102+ messages in thread
From: Paul Jackson @ 2006-02-06 23:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ak, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel

Ingo wrote:
> And it seems that for the workloads you cited, the most natural 
> direction to drive the 'spreading' of resources is from the VFS side.  
> That would also avoid the problem Andrew observed: the ugliness of a 
> sysadmin configuring the placement strategy of kernel-internal slab 
> caches. It also feels a much more robust choice from the conceptual POV.

Arrghh ...

I'm confused, on several points.

I've discussed this some with my SGI colleagues, and think I understand
where they are coming from.

But I can't make sense of your recommendation, Ingo.

I don't yet see why you find this more natural or robust, but let me
deal with some details first.

I don't recall Andrew observing ugliness in a sysadmin configuring
a kernel slab.  I recall him asking to add such ugliness.  My proposal
just had a "memory_spread" boolean, which asked for the kernel (1) to
spread memory, at least getting the big kinds of allocations done from
the apps perspective, within the kernel, but (2) leaving the user
address space pages to be placed by the default node-local policy.
No mention there of slab caches.  It was Andrew who wanted to add
such details.

First it might be most useful to explain a detail of your proposal that
I don't get, which is blocking me from considering it seriously.

I understand mount options, but I don't know what mechanisms (at the
kernel-user API) you have in mind to manage per-directory and per-file
options.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-06 23:45                                         ` Paul Jackson
@ 2006-02-07  0:19                                           ` Ingo Molnar
  2006-02-07  1:17                                             ` David Chinner
  2006-02-07  9:31                                             ` Andi Kleen
  0 siblings, 2 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-07  0:19 UTC (permalink / raw)
  To: Paul Jackson; +Cc: ak, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel


* Paul Jackson <pj@sgi.com> wrote:

> First it might be most useful to explain a detail of your proposal 
> that I don't get, which is blocking me from considering it seriously.
> 
> I understand mount options, but I don't know what mechanisms (at the 
> kernel-user API) you have in mind to manage per-directory and per-file 
> options.

well, i thought of nothing overly complex: it would have to be a 
persistent flag attached to the physical inode. Lets assume XFS added 
this - e.g. as an extended attribute. That raw inode attribute/flag gets 
inherited by dentries, and propagates down into child dentries. (There 
is a global default that the root dentry starts with, and mountpoints 
may override the flag too.) If any directory down in the hierarchy has a 
different flag, it overrides the current one. No flag means "inherit 
parent's flag". So there would be 3 possible states for every inode:

 - default: the vast majority of inodes would have no flag set

 - some would have a 'cache:local' flag

 - some would have a 'cache:global' flag

which would result in every inode getting flagged as either 'local' or 
'global'. When the pagecache (and inode/dentry cache) gets populated, 
the kernel will always know what the current allocation strategy is for 
any given object:

- if an inode ends up being flagged as 'global', then all its pagecache 
  allocations will be roundrobined across nodes.

- if an inode is flagged 'local', it will be allocated to the node/cpu 
  that makes use of it.

workloads may share the same object and may want to use it in different 
ways. E.g. there's one big central database file, and one job uses it in 
a 'local' way, another one uses it in a 'global' way. Each job would 
have to set the attribute to the right value. Setting the flag for the 
inode results in all existing pages for that inode to be flushed. The 
jobs need to serialize their access to the object, as the kernel can 
only allocate according to one policy.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07  0:19                                           ` Ingo Molnar
@ 2006-02-07  1:17                                             ` David Chinner
  2006-02-07  9:31                                             ` Andi Kleen
  1 sibling, 0 replies; 102+ messages in thread
From: David Chinner @ 2006-02-07  1:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, ak, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Tue, Feb 07, 2006 at 01:19:02AM +0100, Ingo Molnar wrote:
> 
> * Paul Jackson <pj@sgi.com> wrote:
> 
> > First it might be most useful to explain a detail of your proposal 
> > that I don't get, which is blocking me from considering it seriously.
> > 
> > I understand mount options, but I don't know what mechanisms (at the 
> > kernel-user API) you have in mind to manage per-directory and per-file 
> > options.
> 
> well, i thought of nothing overly complex: it would have to be a 
> persistent flag attached to the physical inode. Lets assume XFS added 
> this - e.g. as an extended attribute. That raw inode attribute/flag gets 
> inherited by dentries, and propagates down into child dentries.

XFS already has inheritable inode attributes, and they work in a
different (conflicting) manner. Currently when you set certain
attributes on a directory inode, then all directories and files
created within the directory inherit a certain attribute(s) at create
time. See xfsctl(3).

Having no flag set means to use the filesystem default, not "use
the parent config". Flags are also kept on the inode, not the
dentries. We should make sure that the interfaces do
not conflict.

>  - default: the vast majority of inodes would have no flag set
> 
>  - some would have a 'cache:local' flag
> 
>  - some would have a 'cache:global' flag

Easy to add to XFS and to propagate back to the linux inode if
you use the current semantics that XFS supports. How the
memory allocator works from there is up to you...

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07  0:19                                           ` Ingo Molnar
  2006-02-07  1:17                                             ` David Chinner
@ 2006-02-07  9:31                                             ` Andi Kleen
  2006-02-07 11:53                                               ` Ingo Molnar
  1 sibling, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2006-02-07  9:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 01:19, Ingo Molnar wrote:
> 
> * Paul Jackson <pj@sgi.com> wrote:
> 
> > First it might be most useful to explain a detail of your proposal 
> > that I don't get, which is blocking me from considering it seriously.
> > 
> > I understand mount options, but I don't know what mechanisms (at the 
> > kernel-user API) you have in mind to manage per-directory and per-file 
> > options.
> 
> well, i thought of nothing overly complex: it would have to be a 
> persistent flag attached to the physical inode. Lets assume XFS added 
> this - e.g. as an extended attribute.

There used to be a patch floating around to do policy for file caches
(or rather arbitary address spaces)
It used special ELF headers to set the policy. I thought about these policy EAs 
long ago. The main reason I never liked them much is that on some EA
implementations you have to fetch an separate block to get at the EA.
And this policy EA would need to be read all the time, thus adding lots
of additional seeks. That didn't seem worth it.


>  - default: the vast majority of inodes would have no flag set
> 
>  - some would have a 'cache:local' flag
> 
>  - some would have a 'cache:global' flag

If you do policy you could as well do the full policy states from
mempolicy.c. Both cache:local and cache:global can be expressed in it.

> which would result in every inode getting flagged as either 'local' or 
> 'global'. When the pagecache (and inode/dentry cache) gets populated, 
> the kernel will always know what the current allocation strategy is for 
> any given object:

In practice it will probably only set for a small minority of objects
if at all. I could imagine admining this policy could be a PITA too.

> workloads may share the same object and may want to use it in different 
> ways. E.g. there's one big central database file, and one job uses it in 
> a 'local' way another one uses it in a 'global' way. Each job would  
> have to set the attribute to the right value. Setting the flag for the 
> inode results in all existing pages for that inode to be flushed. The 
> jobs need to serialize their access to the object, as the kernel can 
> only allocate according to one policy.

I think we are much better off with some sensible defaults for file cache

- global or "nearby" for read/write
- global for inode/dcache
- local for mmap file data

I bet that will cover most cases quite nicely.

-Andi


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07  9:31                                             ` Andi Kleen
@ 2006-02-07 11:53                                               ` Ingo Molnar
  2006-02-07 12:14                                                 ` Andi Kleen
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-07 11:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Jackson, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> On Tuesday 07 February 2006 01:19, Ingo Molnar wrote:
> > 
> > * Paul Jackson <pj@sgi.com> wrote:
> > 
> > > First it might be most useful to explain a detail of your proposal 
> > > that I don't get, which is blocking me from considering it seriously.
> > > 
> > > I understand mount options, but I don't know what mechanisms (at the 
> > > kernel-user API) you have in mind to manage per-directory and per-file 
> > > options.
> > 
> > well, i thought of nothing overly complex: it would have to be a 
> > persistent flag attached to the physical inode. Lets assume XFS added 
> > this - e.g. as an extended attribute.
> 
> There used to be a patch floating around to do policy for file caches 
> (or rather arbitary address spaces) It used special ELF headers to set 
> the policy. I thought about these policy EAs long ago. The main reason 
> I never liked them much is that on some EA implementations you have to 
> fetch an separate block to get at the EA. And this policy EA would 
> need to be read all the time, thus adding lots of additional seeks. 
> That didn't seem worth it.

EAs would be fine - and they dont have to be propagated down into the 
hierarchy. There's no reason to propagate them if the lack of EA means 
'inherit from parent'. Btw., not all filesystems need an extra block 
seek - e.g. ext3 embedds them nicely into the raw inode.

> >  - default: the vast majority of inodes would have no flag set
> > 
> >  - some would have a 'cache:local' flag
> > 
> >  - some would have a 'cache:global' flag
> 
> If you do policy you could as well do the full policy states from
> mempolicy.c. Both cache:local and cache:global can be expressed in it.

sure.

> > which would result in every inode getting flagged as either 'local' or 
> > 'global'. When the pagecache (and inode/dentry cache) gets populated, 
> > the kernel will always know what the current allocation strategy is for 
> > any given object:
> 
> In practice it will probably only set for a small minority of objects 
> if at all. I could imagine admining this policy could be a PITA too.

It's so much cleaner and more flexible. I bet it's even a noticeable 
speedup for users of the current cpuset-based approach: the cpuset 
method forces _all_ 'large' objects to be allocated in a spread-out 
form. While with the EA method you can pinpoint those few files (or 
directories) that include the data that needs the spreading out. E.g.  
/tmp would default to 'local' (unless the app creates a big /tmp file, 
for which it should set the spread-out attribute).

another thing: on NUMA, are the pagecache portions of readonly files 
(such as /usr binaries, etc.) duplicated across nodes in current 
kernels, or is it still random which node gets it? This too could be an 
EA caching attribute: whether to create per-node caches for file 
content.

This kind of stuff is precisely what EAs were invented for.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 11:53                                               ` Ingo Molnar
@ 2006-02-07 12:14                                                 ` Andi Kleen
  2006-02-07 12:30                                                   ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2006-02-07 12:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Jackson, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 12:53, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > On Tuesday 07 February 2006 01:19, Ingo Molnar wrote:
> > > 
> > > * Paul Jackson <pj@sgi.com> wrote:
> > > 
> > > > First it might be most useful to explain a detail of your proposal 
> > > > that I don't get, which is blocking me from considering it seriously.
> > > > 
> > > > I understand mount options, but I don't know what mechanisms (at the 
> > > > kernel-user API) you have in mind to manage per-directory and per-file 
> > > > options.
> > > 
> > > well, i thought of nothing overly complex: it would have to be a 
> > > persistent flag attached to the physical inode. Lets assume XFS added 
> > > this - e.g. as an extended attribute.
> > 
> > There used to be a patch floating around to do policy for file caches 
> > (or rather arbitary address spaces) It used special ELF headers to set 
> > the policy. I thought about these policy EAs long ago. The main reason 
> > I never liked them much is that on some EA implementations you have to 
> > fetch an separate block to get at the EA. And this policy EA would 
> > need to be read all the time, thus adding lots of additional seeks. 
> > That didn't seem worth it.
> 
> EAs would be fine - and they dont have to be propagated down into the 
> hierarchy. There's no reason to propagate them if the lack of EA means 
> 'inherit from parent'. Btw., not all filesystems need an extra block 
> seek - e.g. ext3 embedds them nicely into the raw inode.

If it's big enough. But to make it anywhere near usable for an admin
you would need to support inheritance from directories too
[I know from personal experience that ACLs without directory inheritance
are extremly messy to use], and with that it would get messy quickly
and require more overhead. 

I'm not really sure EAs are a good approach here.

Perhaps just a per super block setting if anything (tmpfs supports that
already) 


> > > which would result in every inode getting flagged as either 'local' or 
> > > 'global'. When the pagecache (and inode/dentry cache) gets populated, 
> > > the kernel will always know what the current allocation strategy is for 
> > > any given object:
> > 
> > In practice it will probably only set for a small minority of objects 
> > if at all. I could imagine admining this policy could be a PITA too.
> 
> It's so much cleaner and more flexible. I bet it's even a noticeable 
> speedup for users of the current cpuset-based approach: the cpuset 
> method forces _all_ 'large' objects to be allocated in a spread-out 
> form. While with the EA method you can pinpoint those few files (or 
> directories) that include the data that needs the spreading out. E.g.  
> /tmp would default to 'local' (unless the app creates a big /tmp file, 
> for which it should set the spread-out attribute).

It's unrealistic - no app will do that. It just needs good defaults.

I still don't really think it will make much difference if the file
cache is local or global. Compare to disk IO it is still infinitely
faster, so a relatively small slowdown from going off node is not
that big an issue.

> another thing: on NUMA, are the pagecache portions of readonly files 
> (such as /usr binaries, etc.) duplicated across nodes in current 
> kernels, or is it still random which node gets it? 

Random.

> This too could be an  
> EA caching attribute: whether to create per-node caches for file 
> content.

There were (ugly) patches floating around for text duplication, but iirc the benchmarkers
were still trying to figure out if it's even a good idea. My guess it is 
not because CPUs tend to have very aggressive prefetching for code streams which
can deal with latency well.
 
> This kind of stuff is precisely what EAs were invented for.

I'm not so sure.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 12:14                                                 ` Andi Kleen
@ 2006-02-07 12:30                                                   ` Ingo Molnar
  2006-02-07 12:43                                                     ` Andi Kleen
  2006-02-07 17:06                                                     ` Christoph Lameter
  0 siblings, 2 replies; 102+ messages in thread
From: Ingo Molnar @ 2006-02-07 12:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Jackson, clameter, akpm, dgc, steiner, Simon.Derr, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> I still don't really think it will make much difference if the file 
> cache is local or global. Compare to disk IO it is still infinitely 
> faster, so a relatively small slowdown from going off node is not that 
> big an issue.

well, maybe the SGI folks can give us some numbers?

> > another thing: on NUMA, are the pagecache portions of readonly files 
> > (such as /usr binaries, etc.) duplicated across nodes in current 
> > kernels, or is it still random which node gets it? 
> 
> Random.
> 
> > This too could be an  
> > EA caching attribute: whether to create per-node caches for file 
> > content.
> 
> There were (ugly) patches floating around for text duplication, but 
> iirc the benchmarkers were still trying to figure out if it's even a 
> good idea. My guess it is not because CPUs tend to have very 
> aggressive prefetching for code streams which can deal with latency 
> well.

you are a bit biased towards low-latency NUMA setups i guess (read: 
Opterons) :-) Obviously with a low NUMA factor, we dont have to deal 
with memory access assymetries all that much.

But i think we should expand our file caching architecture into those 
caching details nevertheless: it's directly applicable to software 
driven clusters as well. There pagecache replication on nodes is a must, 
and obviously there it makes a big difference whether files are cached 
locally or remotely.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 12:30                                                   ` Ingo Molnar
@ 2006-02-07 12:43                                                     ` Andi Kleen
  2006-02-07 12:58                                                       ` Ingo Molnar
  2006-02-07 17:10                                                       ` Christoph Lameter
  2006-02-07 17:06                                                     ` Christoph Lameter
  1 sibling, 2 replies; 102+ messages in thread
From: Andi Kleen @ 2006-02-07 12:43 UTC (permalink / raw)
  To: Ingo Molnar, steiner
  Cc: Paul Jackson, clameter, akpm, dgc, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 13:30, Ingo Molnar wrote:

> you are a bit biased towards low-latency NUMA setups i guess (read: 
> Opterons) :-) 

Well they are the vast majority of NUMA systems Linux runs on.
And there are more than just Opterons, e.g. IBM Summit. And even
the majority of Altixes are not _that_ big.

Of course we need to deal somehow with the big systems, but
for the good defaults the smaller systems are more important.
Big systems tend to have capable administrators who
are willing to tweak them. But that's rarely the case with
the small systems. So I think as long as the big system
can be somehow made to work with special configuration
and ignoring corner cases that's fine. But for the low 
NUMA systems it should perform as well as possibly out of the box.

> Obviously with a low NUMA factor, we dont have to deal  
> with memory access assymetries all that much.

That is why I proposed "nearby policy". It can turn a system
with a large NUMA factor into a system with a small NUMA factor.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 12:43                                                     ` Andi Kleen
@ 2006-02-07 12:58                                                       ` Ingo Molnar
  2006-02-07 13:14                                                         ` Andi Kleen
  2006-02-07 17:10                                                       ` Christoph Lameter
  1 sibling, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-07 12:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: steiner, Paul Jackson, clameter, akpm, dgc, Simon.Derr, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> On Tuesday 07 February 2006 13:30, Ingo Molnar wrote:
> 
> > you are a bit biased towards low-latency NUMA setups i guess (read: 
> > Opterons) :-) 
> 
> Well they are the vast majority of NUMA systems Linux runs on.
>
> And there are more than just Opterons, e.g. IBM Summit. And even the 
> majority of Altixes are not _that_ big.
> 
> Of course we need to deal somehow with the big systems, but for the 
> good defaults the smaller systems are more important.

i'm not sure i understand your point. You said that for small systems 
with a low NUMA factor it doesnt really matter where the pagecache is 
placed. I mostly agree with that. And since placement makes no 
difference there, we can freely shape things for the systems where it 
does make a difference. It will probably make a small win on smaller 
systems too, as a bonus. Ok?

> Big systems tend to have capable administrators who are willing to 
> tweak them. But that's rarely the case with the small systems. So I 
> think as long as the big system can be somehow made to work with 
> special configuration and ignoring corner cases that's fine. But for 
> the low NUMA systems it should perform as well as possibly out of the 
> box.

i also mentioned software-based clusters in the previous mail, so it's 
not only about big systems. Caching attributes are very much relevant 
there. Tightly integrated clusters can be considered NUMA systems with a 
NUMA factor of 1000 or so (or worse).

> > Obviously with a low NUMA factor, we dont have to deal  
> > with memory access assymetries all that much.
> 
> That is why I proposed "nearby policy". It can turn a system with a 
> large NUMA factor into a system with a small NUMA factor.

well, would the "nearby policy" make a difference on the small systems?  
Small systems (to me) are just a flat and symmetric hierarchy of nodes - 
the next step from SMP. So there's really just two distances: local to 
the node, and one level of 'alien'. Or do you include systems in this 
category that have bigger assymetries?

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 12:58                                                       ` Ingo Molnar
@ 2006-02-07 13:14                                                         ` Andi Kleen
  2006-02-07 14:11                                                           ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2006-02-07 13:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: steiner, Paul Jackson, clameter, akpm, dgc, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 13:58, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > On Tuesday 07 February 2006 13:30, Ingo Molnar wrote:
> > 
> > > you are a bit biased towards low-latency NUMA setups i guess (read: 
> > > Opterons) :-) 
> > 
> > Well they are the vast majority of NUMA systems Linux runs on.
> >
> > And there are more than just Opterons, e.g. IBM Summit. And even the 
> > majority of Altixes are not _that_ big.
> > 
> > Of course we need to deal somehow with the big systems, but for the 
> > good defaults the smaller systems are more important.
> 
> i'm not sure i understand your point. You said that for small systems 
> with a low NUMA factor it doesnt really matter where the pagecache is 
> placed. 

I meant it's not that big an issue if it's remote, but it's
bad if it fills up the local node.

Also pagecache = unmapped file cache here. And d/icache. 
For mapped anonymous memory it's different of course.


> I mostly agree with that. And since placement makes no  
> difference there, we can freely shape things for the systems where it 
> does make a difference. It will probably make a small win on smaller 
> systems too, as a bonus. Ok?

No because filling up the local node is a problem on the small systems
too because it prevents the processes from getting enough anonymous
memory. For anonymous memory memory placement is important even for 
small systems.

Basically you have to consider the frequency of access:

Mapped memory is very frequently accessed. For it memory placement 
is really important. Optimizing it at the cost of everything
else is a good default strategy

File cache is much less frequently accessed (most programs buffer
read/write well) and when it is accessed it is using functions
that are relatively latency tolerant (kernel memcpy). So memory
placement is much less important here.

And d/inode are also very infrequently accessed compared to local memory,
so the occasionally additional latency is better than competing too much
with local memory allocation.

> > Big systems tend to have capable administrators who are willing to 
> > tweak them. But that's rarely the case with the small systems. So I 
> > think as long as the big system can be somehow made to work with 
> > special configuration and ignoring corner cases that's fine. But for 
> > the low NUMA systems it should perform as well as possibly out of the 
> > box.
> 
> i also mentioned software-based clusters in the previous mail, so it's 
> not only about big systems. Caching attributes are very much relevant 
> there. Tightly integrated clusters can be considered NUMA systems with a 
> NUMA factor of 1000 or so (or worse).

To be honest I don't think systems with such a NUMA factor will ever work 
well in the general case. So I wouldn't recommend considering them
much if at all in your design thoughts. The result would likely not
be a good balanced design.

> > > Obviously with a low NUMA factor, we dont have to deal  
> > > with memory access assymetries all that much.
> > 
> > That is why I proposed "nearby policy". It can turn a system with a 
> > large NUMA factor into a system with a small NUMA factor.
> 
> well, would the "nearby policy" make a difference on the small systems? 

Probably not.
> 
> Small systems (to me) are just a flat and symmetric hierarchy of nodes - 
> the next step from SMP. So there's really just two distances: local to 
> the node, and one level of 'alien'. Or do you include systems in this 
> category that have bigger assymetries?

Even a 4 way Opteron has two hierarchies (although the kernel
often doesn't know about them because most BIOS have broken SLIT tables) 
and a 8 way Opteron has three. Similar let's say a 4 node x460.
But these systems have still a reasonably small NUMA factor.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 13:14                                                         ` Andi Kleen
@ 2006-02-07 14:11                                                           ` Ingo Molnar
  2006-02-07 14:23                                                             ` Andi Kleen
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2006-02-07 14:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: steiner, Paul Jackson, clameter, akpm, dgc, Simon.Derr, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> I meant it's not that big an issue if it's remote, but it's bad if it 
> fills up the local node.

are you sure this is not some older VM issue? Unless it's a fundamental 
property of NUMA systems, it would be bad to factor in some VM artifact 
into the caching design.

> Basically you have to consider the frequency of access:
> 
> Mapped memory is very frequently accessed. For it memory placement is 
> really important. Optimizing it at the cost of everything else is a 
> good default strategy
> 
> File cache is much less frequently accessed (most programs buffer 
> read/write well) and when it is accessed it is using functions that 
> are relatively latency tolerant (kernel memcpy). So memory placement 
> is much less important here.
> 
> And d/inode are also very infrequently accessed compared to local 
> memory, so the occasionally additional latency is better than 
> competing too much with local memory allocation.

Most pagecache pages are clean, and it's easy and fast to zap a clean 
page when a new anonymous page needs space. So i dont really see why the 
pagecache is such a big issue - it should in essence be invisible to the 
rest of the VM. (barring the extreme case of lots of dirty pages in the 
pagecache) What am i missing?

> > i also mentioned software-based clusters in the previous mail, so it's 
> > not only about big systems. Caching attributes are very much relevant 
> > there. Tightly integrated clusters can be considered NUMA systems with a 
> > NUMA factor of 1000 or so (or worse).
> 
> To be honest I don't think systems with such a NUMA factor will ever 
> work well in the general case. So I wouldn't recommend considering 
> them much if at all in your design thoughts. The result would likely 
> not be a good balanced design.

loosely coupled clusters do seem to work quite well, since the 
overwhelming majority of computing jobs tend to deal with easily 
partitionable workloads. Making clusters more seemless via software 
(a'ka OpenMosix) is still quite tempting i think.

> > Small systems (to me) are just a flat and symmetric hierarchy of nodes - 
> > the next step from SMP. So there's really just two distances: local to 
> > the node, and one level of 'alien'. Or do you include systems in this 
> > category that have bigger assymetries?
> 
> Even a 4 way Opteron has two hierarchies [...]

yeah, you are right.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 14:11                                                           ` Ingo Molnar
@ 2006-02-07 14:23                                                             ` Andi Kleen
  2006-02-07 17:11                                                               ` Christoph Lameter
  0 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2006-02-07 14:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: steiner, Paul Jackson, clameter, akpm, dgc, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 15:11, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > I meant it's not that big an issue if it's remote, but it's bad if it 
> > fills up the local node.
> 
> are you sure this is not some older VM issue? 

Unless you implement page migration for all caches it's still there.
The only way to get rid of caches on a node currently is to throw them
away.  And refetching them from disk is quite costly.

> Unless it's a fundamental  
> property of NUMA systems, it would be bad to factor in some VM artifact 
> into the caching design.

Why not? It has to work with the real existing VM, not some imaginary perfect
one.
 
> > Basically you have to consider the frequency of access:
> > 
> > Mapped memory is very frequently accessed. For it memory placement is 
> > really important. Optimizing it at the cost of everything else is a 
> > good default strategy
> > 
> > File cache is much less frequently accessed (most programs buffer 
> > read/write well) and when it is accessed it is using functions that 
> > are relatively latency tolerant (kernel memcpy). So memory placement 
> > is much less important here.
> > 
> > And d/inode are also very infrequently accessed compared to local 
> > memory, so the occasionally additional latency is better than 
> > competing too much with local memory allocation.
> 
> Most pagecache pages are clean, 


... unless you've just written a lot of data.

> and it's easy and fast to zap a clean  
> page when a new anonymous page needs space. So i dont really see why the 
> pagecache is such a big issue - it should in essence be invisible to the 
> rest of the VM. (barring the extreme case of lots of dirty pages in the 
> pagecache) What am i missing?

d/icaches for once don't work this way. Do a find / and watch the results on
your local node.  

And in practice your assumption of everything clean and nice in page cache is 
also often not true.
 
> > > i also mentioned software-based clusters in the previous mail, so it's 
> > > not only about big systems. Caching attributes are very much relevant 
> > > there. Tightly integrated clusters can be considered NUMA systems with a 
> > > NUMA factor of 1000 or so (or worse).
> > 
> > To be honest I don't think systems with such a NUMA factor will ever 
> > work well in the general case. So I wouldn't recommend considering 
> > them much if at all in your design thoughts. The result would likely 
> > not be a good balanced design.
> 
> loosely coupled clusters do seem to work quite well, since the 
> overwhelming majority of computing jobs tend to deal with easily 
> partitionable workloads. 

Yes, but with message passing but without any kind of shared memory.

> Making clusters more seemless via software  
> (a'ka OpenMosix) is still quite tempting i think.

Ok we agree on that then. Great.

-Andi


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 12:30                                                   ` Ingo Molnar
  2006-02-07 12:43                                                     ` Andi Kleen
@ 2006-02-07 17:06                                                     ` Christoph Lameter
  2006-02-07 17:26                                                       ` Andi Kleen
  1 sibling, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-07 17:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Paul Jackson, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Tue, 7 Feb 2006, Ingo Molnar wrote:

> * Andi Kleen <ak@suse.de> wrote:
> 
> > I still don't really think it will make much difference if the file 
> > cache is local or global. Compare to disk IO it is still infinitely 
> > faster, so a relatively small slowdown from going off node is not that 
> > big an issue.
> 
> well, maybe the SGI folks can give us some numbers?

The latency may grow (average) by a factor of 4 (same thoughput though on 
our boxes). On some architectures it is significantly more and also the 
bandwidth is reduced.

This is a significant factor. Applications that do not manage locality 
correctly loose at least 30-40% performance.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 12:43                                                     ` Andi Kleen
  2006-02-07 12:58                                                       ` Ingo Molnar
@ 2006-02-07 17:10                                                       ` Christoph Lameter
  2006-02-07 17:28                                                         ` Andi Kleen
  1 sibling, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-07 17:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, steiner, Paul Jackson, akpm, dgc, Simon.Derr, linux-kernel

On Tue, 7 Feb 2006, Andi Kleen wrote:

> On Tuesday 07 February 2006 13:30, Ingo Molnar wrote:
> 
> > you are a bit biased towards low-latency NUMA setups i guess (read: 
> > Opterons) :-) 
> 
> Well they are the vast majority of NUMA systems Linux runs on.

The opterons are some strange mix of SMP and NUMA system. The NUMA "nodes" 
are on the same motherboard and therefore there are only small latencies 
involved. NUMA only gives small benefits.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 14:23                                                             ` Andi Kleen
@ 2006-02-07 17:11                                                               ` Christoph Lameter
  2006-02-07 17:29                                                                 ` Andi Kleen
  0 siblings, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-07 17:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, steiner, Paul Jackson, akpm, dgc, Simon.Derr, linux-kernel

On Tue, 7 Feb 2006, Andi Kleen wrote:

> > are you sure this is not some older VM issue? 
> 
> Unless you implement page migration for all caches it's still there.
> The only way to get rid of caches on a node currently is to throw them
> away.  And refetching them from disk is quite costly.

The caches on a node are shrunk dynamically see the zone_reclaim 
functionality introduced in 2.6.16-rc2.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 17:06                                                     ` Christoph Lameter
@ 2006-02-07 17:26                                                       ` Andi Kleen
  0 siblings, 0 replies; 102+ messages in thread
From: Andi Kleen @ 2006-02-07 17:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Paul Jackson, akpm, dgc, steiner, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 18:06, Christoph Lameter wrote:
> On Tue, 7 Feb 2006, Ingo Molnar wrote:
> 
> > * Andi Kleen <ak@suse.de> wrote:
> > 
> > > I still don't really think it will make much difference if the file 
> > > cache is local or global. Compare to disk IO it is still infinitely 
> > > faster, so a relatively small slowdown from going off node is not that 
> > > big an issue.
> > 
> > well, maybe the SGI folks can give us some numbers?
> 
> The latency may grow (average) by a factor of 4 (same thoughput though on 
> our boxes). On some architectures it is significantly more and also the 
> bandwidth is reduced.
> 
> This is a significant factor. Applications that do not manage locality 
> correctly loose at least 30-40% performance.

This number is for local mapped memory I assume.

But do you have any numbers for file caches or dentry/inode caches? 
My guess is that if an  application would lose that much in read/write 
or readdir/stat it would call them too often :) But it's unlikely
i guess.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 17:10                                                       ` Christoph Lameter
@ 2006-02-07 17:28                                                         ` Andi Kleen
  2006-02-07 17:42                                                           ` Christoph Lameter
  0 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2006-02-07 17:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, steiner, Paul Jackson, akpm, dgc, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 18:10, Christoph Lameter wrote:
> On Tue, 7 Feb 2006, Andi Kleen wrote:
> 
> > On Tuesday 07 February 2006 13:30, Ingo Molnar wrote:
> > 
> > > you are a bit biased towards low-latency NUMA setups i guess (read: 
> > > Opterons) :-) 
> > 
> > Well they are the vast majority of NUMA systems Linux runs on.
> 
> The opterons are some strange mix of SMP and NUMA system. The NUMA "nodes" 
> are on the same motherboard 

Actually it's not true - 8 socket systems are built out of two
boards. And there are much bigger systems upcomming.

> and therefore there are only small latencies  
> involved. NUMA only gives small benefits.

That's also not true. Everytime I get memory placement for 
process memory wrong users complain _very_ loudly and there 
are clear benefits in benchmarks too.
 
-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 17:11                                                               ` Christoph Lameter
@ 2006-02-07 17:29                                                                 ` Andi Kleen
  2006-02-07 17:39                                                                   ` Christoph Lameter
  0 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2006-02-07 17:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, steiner, Paul Jackson, akpm, dgc, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 18:11, Christoph Lameter wrote:
> On Tue, 7 Feb 2006, Andi Kleen wrote:
> 
> > > are you sure this is not some older VM issue? 
> > 
> > Unless you implement page migration for all caches it's still there.
> > The only way to get rid of caches on a node currently is to throw them
> > away.  And refetching them from disk is quite costly.
> 
> The caches on a node are shrunk dynamically see the zone_reclaim 
> functionality introduced in 2.6.16-rc2.

Yes, they're thrown away which is wasteful. If they were spread
around in the first place that often wouldn't be needed.

-Andi


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 17:29                                                                 ` Andi Kleen
@ 2006-02-07 17:39                                                                   ` Christoph Lameter
  0 siblings, 0 replies; 102+ messages in thread
From: Christoph Lameter @ 2006-02-07 17:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, steiner, Paul Jackson, akpm, dgc, Simon.Derr, linux-kernel

On Tue, 7 Feb 2006, Andi Kleen wrote:

> > The caches on a node are shrunk dynamically see the zone_reclaim 
> > functionality introduced in 2.6.16-rc2.
> 
> Yes, they're thrown away which is wasteful. If they were spread
> around in the first place that often wouldn't be needed.

That would reduce performance for a process running on the node and it 
would contaminate other nodes that may have other processes runing that 
also want to have optimal access to their files.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 17:28                                                         ` Andi Kleen
@ 2006-02-07 17:42                                                           ` Christoph Lameter
  2006-02-07 17:51                                                             ` Andi Kleen
  0 siblings, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2006-02-07 17:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, steiner, Paul Jackson, akpm, dgc, Simon.Derr, linux-kernel

On Tue, 7 Feb 2006, Andi Kleen wrote:

> > The opterons are some strange mix of SMP and NUMA system. The NUMA "nodes" 
> > are on the same motherboard 
> 
> Actually it's not true - 8 socket systems are built out of two
> boards. And there are much bigger systems upcomming.

But they are still next to one another.... No distance to cover,.
> 
> > and therefore there are only small latencies  
> > involved. NUMA only gives small benefits.
> 
> That's also not true. Everytime I get memory placement for 
> process memory wrong users complain _very_ loudly and there 
> are clear benefits in benchmarks too.

What are the latencies in an 8 way opteron system? I.e. Local memory, next 
processor, most distant processor?



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 1/5] cpuset memory spread basic implementation
  2006-02-07 17:42                                                           ` Christoph Lameter
@ 2006-02-07 17:51                                                             ` Andi Kleen
  0 siblings, 0 replies; 102+ messages in thread
From: Andi Kleen @ 2006-02-07 17:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, steiner, Paul Jackson, akpm, dgc, Simon.Derr, linux-kernel

On Tuesday 07 February 2006 18:42, Christoph Lameter wrote:

> > > and therefore there are only small latencies  
> > > involved. NUMA only gives small benefits.
> > 
> > That's also not true. Everytime I get memory placement for 
> > process memory wrong users complain _very_ loudly and there 
> > are clear benefits in benchmarks too.
> 
> What are the latencies in an 8 way opteron system? I.e. Local memory, next 
> processor, most distant processor?

The NUMA factor is surprisingly good because of the way the cache coherency
works even the local memory access gets slower with more
nodes @) iirc it's <3. Worst case latency tends to be <200ns.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2006-02-07 18:02 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-04  7:19 [PATCH 1/5] cpuset memory spread basic implementation Paul Jackson
2006-02-04  7:19 ` [PATCH 2/5] cpuset memory spread page cache implementation and hooks Paul Jackson
2006-02-04 23:49   ` Andrew Morton
2006-02-05  1:42     ` Paul Jackson
2006-02-05  1:54       ` Andrew Morton
2006-02-05  3:28         ` Christoph Lameter
2006-02-05  5:06           ` Andrew Morton
2006-02-05  6:08             ` Paul Jackson
2006-02-05  6:15               ` Andrew Morton
2006-02-05  6:28                 ` Paul Jackson
2006-02-06  0:20                 ` Paul Jackson
2006-02-06  5:51                 ` Paul Jackson
2006-02-06  7:14                   ` Pekka J Enberg
2006-02-06  7:42                     ` Pekka J Enberg
2006-02-06  7:51                       ` Pekka J Enberg
2006-02-06 17:32                         ` Pekka Enberg
2006-02-04  7:19 ` [PATCH 3/5] cpuset memory spread slab cache implementation Paul Jackson
2006-02-04 23:49   ` Andrew Morton
2006-02-05  3:37     ` Christoph Lameter
2006-02-04  7:19 ` [PATCH 4/5] cpuset memory spread slab cache optimizations Paul Jackson
2006-02-04 23:50   ` Andrew Morton
2006-02-05  3:18     ` Paul Jackson
2006-02-04 23:50   ` Andrew Morton
2006-02-05  4:10     ` Paul Jackson
2006-02-04  7:19 ` [PATCH 5/5] cpuset memory spread slab cache hooks Paul Jackson
2006-02-06  4:37   ` Andrew Morton
2006-02-04 23:49 ` [PATCH 1/5] cpuset memory spread basic implementation Andrew Morton
2006-02-05  3:35   ` Christoph Lameter
2006-02-06  4:33   ` Andrew Morton
2006-02-06  5:50     ` Paul Jackson
2006-02-06  6:02       ` Andrew Morton
2006-02-06  6:17         ` Ingo Molnar
2006-02-06  7:22           ` Paul Jackson
2006-02-06  7:43             ` Ingo Molnar
2006-02-06  8:19               ` Paul Jackson
2006-02-06  8:22                 ` Ingo Molnar
2006-02-06  8:40                   ` Ingo Molnar
2006-02-06  9:03                     ` Paul Jackson
2006-02-06  9:09                       ` Ingo Molnar
2006-02-06  9:27                         ` Paul Jackson
2006-02-06  9:37                           ` Ingo Molnar
2006-02-06 20:22                     ` Paul Jackson
2006-02-06  8:47                   ` Paul Jackson
2006-02-06  8:51                     ` Ingo Molnar
2006-02-06  9:09                       ` Paul Jackson
2006-02-06 10:09                   ` Andi Kleen
2006-02-06 10:11                     ` Ingo Molnar
2006-02-06 10:16                       ` Andi Kleen
2006-02-06 10:23                         ` Ingo Molnar
2006-02-06 10:35                           ` Andi Kleen
2006-02-06 14:42                           ` Paul Jackson
2006-02-06 14:35                         ` Paul Jackson
2006-02-06 16:48                           ` Christoph Lameter
2006-02-06 17:11                             ` Andi Kleen
2006-02-06 18:21                               ` Christoph Lameter
2006-02-06 18:36                                 ` Andi Kleen
2006-02-06 18:43                                   ` Christoph Lameter
2006-02-06 18:48                                     ` Andi Kleen
2006-02-06 19:19                                       ` Christoph Lameter
2006-02-06 20:27                                       ` Paul Jackson
2006-02-06 18:43                                   ` Ingo Molnar
2006-02-06 20:01                                     ` Paul Jackson
2006-02-06 20:05                                       ` Ingo Molnar
2006-02-06 20:27                                         ` Christoph Lameter
2006-02-06 20:41                                           ` Ingo Molnar
2006-02-06 20:49                                             ` Christoph Lameter
2006-02-06 21:07                                               ` Ingo Molnar
2006-02-06 22:10                                                 ` Christoph Lameter
2006-02-06 23:29                                                   ` Ingo Molnar
2006-02-06 23:45                                         ` Paul Jackson
2006-02-07  0:19                                           ` Ingo Molnar
2006-02-07  1:17                                             ` David Chinner
2006-02-07  9:31                                             ` Andi Kleen
2006-02-07 11:53                                               ` Ingo Molnar
2006-02-07 12:14                                                 ` Andi Kleen
2006-02-07 12:30                                                   ` Ingo Molnar
2006-02-07 12:43                                                     ` Andi Kleen
2006-02-07 12:58                                                       ` Ingo Molnar
2006-02-07 13:14                                                         ` Andi Kleen
2006-02-07 14:11                                                           ` Ingo Molnar
2006-02-07 14:23                                                             ` Andi Kleen
2006-02-07 17:11                                                               ` Christoph Lameter
2006-02-07 17:29                                                                 ` Andi Kleen
2006-02-07 17:39                                                                   ` Christoph Lameter
2006-02-07 17:10                                                       ` Christoph Lameter
2006-02-07 17:28                                                         ` Andi Kleen
2006-02-07 17:42                                                           ` Christoph Lameter
2006-02-07 17:51                                                             ` Andi Kleen
2006-02-07 17:06                                                     ` Christoph Lameter
2006-02-07 17:26                                                       ` Andi Kleen
2006-02-04 23:50 ` Andrew Morton
2006-02-04 23:57   ` David S. Miller
2006-02-06  4:37 ` Andrew Morton
2006-02-06  6:02   ` Ingo Molnar
2006-02-06  6:56   ` Paul Jackson
2006-02-06  7:08     ` Andrew Morton
2006-02-06  7:39       ` Ingo Molnar
2006-02-06  8:22         ` Paul Jackson
2006-02-06  8:35         ` Paul Jackson
2006-02-06  9:32       ` Paul Jackson
2006-02-06  9:57         ` Andrew Morton
2006-02-06  9:18 ` Simon Derr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).