All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/5] cpuset: better bitmap remap defaults
@ 2005-11-04  5:31 Paul Jackson
  2005-11-04  5:31 ` [PATCH 2/5] cpuset: rebind numa vma mempolicy Paul Jackson
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Paul Jackson @ 2005-11-04  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Simon Derr, Andi Kleen, Paul Jackson, linux-kernel, Christoph Lameter

Fix the default behaviour for the remap operators in bitmap,
cpumask and nodemask.

As previously submitted, the pair of masks <A, B> defined a map
of the positions of the set bits in A to the corresponding bits
in B.  This is still true.

The issue is how to map the other positions, corresponding to
the unset (0) bits in A.  As previously submitted, they were
all mapped to the first set bit position in B, a constant map.

When I tried to code per-vma mempolicy rebinding using these
remap operators, I realized this was wrong.

This patch changes the default to map all the unset bit positions
in A to the same positions in B, the identity map.

For example, if A has bits 4-7 set, and B has bits 9-12 set,
then the map defined by the pair <A, B> maps each bit position
in the first 32 bits as follows:
	0 ==> 0
	  ...
	3 ==> 3
	4 ==> 9
	  ...
	7 ==> 12
	8 ==> 8
	9 ==> 9
	  ...
	31 ==> 31

This now corresponds to the typical behaviour desired when
migrating pages and policies from one cpuset to another.

The pages on nodes within the original cpuset, and the
references in memory policies to nodes within the original
cpuset, are migrated to the corresponding cpuset-relative nodes
in the destination cpuset.  Other pages and node references
are left untouched.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 lib/bitmap.c   |   89 ++++++++++++++++++++++++++++-----------------------------
 2 files changed, 53 insertions(+), 47 deletions(-)

--- 2.6.14-rc4-mm1-cpuset-patches.orig/lib/bitmap.c	2005-10-25 00:46:28.750910463 -0700
+++ 2.6.14-rc4-mm1-cpuset-patches/lib/bitmap.c	2005-10-25 06:50:03.755941035 -0700
@@ -519,7 +519,7 @@ EXPORT_SYMBOL(bitmap_parselist);
  *
  * Map the bit at position @pos in @buf (of length @bits) to the
  * ordinal of which set bit it is.  If it is not set or if @pos
- * is not a valid bit position, map to zero (0).
+ * is not a valid bit position, map to -1.
  *
  * If for example, just bits 4 through 7 are set in @buf, then @pos
  * values 4 through 7 will get mapped to 0 through 3, respectively,
@@ -531,18 +531,19 @@ EXPORT_SYMBOL(bitmap_parselist);
  */
 static int bitmap_pos_to_ord(const unsigned long *buf, int pos, int bits)
 {
-	int ord = 0;
+	int i, ord;
 
-	if (pos >= 0 && pos < bits) {
-		int i;
+	if (pos < 0 || pos >= bits || !test_bit(pos, buf))
+		return -1;
 
-		for (i = find_first_bit(buf, bits);
-		     i < pos;
-		     i = find_next_bit(buf, bits, i + 1))
-	     		ord++;
-		if (i > pos)
-			ord = 0;
+	i = find_first_bit(buf, bits);
+	ord = 0;
+	while (i < pos) {
+		i = find_next_bit(buf, bits, i + 1);
+	     	ord++;
 	}
+	BUG_ON(i != pos);
+
 	return ord;
 }
 
@@ -553,11 +554,12 @@ static int bitmap_pos_to_ord(const unsig
  *	@bits: number of valid bit positions in @buf
  *
  * Map the ordinal offset of bit @ord in @buf to its position in @buf.
- * If @ord is not the ordinal offset of a set bit in @buf, map to zero (0).
+ * Value of @ord should be in range 0 <= @ord < weight(buf), else
+ * results are undefined.
  *
  * If for example, just bits 4 through 7 are set in @buf, then @ord
  * values 0 through 3 will get mapped to 4 through 7, respectively,
- * and all other @ord valuds will get mapped to 0.  When @ord value 3
+ * and all other @ord values return undefined values.  When @ord value 3
  * gets mapped to (returns) @pos value 7 in this example, that means
  * that the 3rd set bit (starting with 0th) is at position 7 in @buf.
  *
@@ -583,8 +585,8 @@ static int bitmap_ord_to_pos(const unsig
 
 /**
  * bitmap_remap - Apply map defined by a pair of bitmaps to another bitmap
- *	@src: subset to be remapped
  *	@dst: remapped result
+ *	@src: subset to be remapped
  *	@old: defines domain of map
  *	@new: defines range of map
  *	@bits: number of bits in each of these bitmaps
@@ -596,49 +598,42 @@ static int bitmap_ord_to_pos(const unsig
  * weight of @old, map the position of the n-th set bit in @old to
  * the position of the m-th set bit in @new, where m == n % w.
  *
- * If either of the @old and @new bitmaps are empty, or if@src and @dst
- * point to the same location, then this routine does nothing.
+ * If either of the @old and @new bitmaps are empty, or if @src and
+ * @dst point to the same location, then this routine copies @src
+ * to @dst.
  *
- * The positions of unset bits in @old are mapped to the position of
- * the first set bit in @new.
+ * The positions of unset bits in @old are mapped to themselves
+ * (the identify map).
  *
  * Apply the above specified mapping to @src, placing the result in
  * @dst, clearing any bits previously set in @dst.
  *
- * The resulting value of @dst will have either the same weight as
- * @src, or less weight in the general case that the mapping wasn't
- * injective due to the weight of @new being less than that of @old.
- * The resulting value of @dst will never have greater weight than
- * that of @src, except perhaps in the case that one of the above
- * conditions was not met and this routine just returned.
- *
  * For example, lets say that @old has bits 4 through 7 set, and
  * @new has bits 12 through 15 set.  This defines the mapping of bit
  * position 4 to 12, 5 to 13, 6 to 14 and 7 to 15, and of all other
- * bit positions to 12 (the first set bit in @new.  So if say @src
- * comes into this routine with bits 1, 5 and 7 set, then @dst should
- * leave with bits 12, 13 and 15 set.
+ * bit positions unchanged.  So if say @src comes into this routine
+ * with bits 1, 5 and 7 set, then @dst should leave with bits 1,
+ * 13 and 15 set.
  */
 void bitmap_remap(unsigned long *dst, const unsigned long *src,
 		const unsigned long *old, const unsigned long *new,
 		int bits)
 {
-	int s;
+	int oldbit, w;
 
-	if (bitmap_weight(old, bits) == 0)
-		return;
-	if (bitmap_weight(new, bits) == 0)
-		return;
 	if (dst == src)		/* following doesn't handle inplace remaps */
 		return;
-
 	bitmap_zero(dst, bits);
-	for (s = find_first_bit(src, bits);
-	     s < bits;
-	     s = find_next_bit(src, bits, s + 1)) {
-	     	int x = bitmap_pos_to_ord(old, s, bits);
-		int y = bitmap_ord_to_pos(new, x, bits);
-		set_bit(y, dst);
+
+	w = bitmap_weight(new, bits);
+	for (oldbit = find_first_bit(src, bits);
+	     oldbit < bits;
+	     oldbit = find_next_bit(src, bits, oldbit + 1)) {
+	     	int n = bitmap_pos_to_ord(old, oldbit, bits);
+		if (n < 0 || w == 0)
+			set_bit(oldbit, dst);	/* identity map */
+		else
+			set_bit(bitmap_ord_to_pos(new, n % w, bits), dst);
 	}
 }
 EXPORT_SYMBOL(bitmap_remap);
@@ -657,8 +652,8 @@ EXPORT_SYMBOL(bitmap_remap);
  * weight of @old, map the position of the n-th set bit in @old to
  * the position of the m-th set bit in @new, where m == n % w.
  *
- * The positions of unset bits in @old are mapped to the position of
- * the first set bit in @new.
+ * The positions of unset bits in @old are mapped to themselves
+ * (the identify map).
  *
  * Apply the above specified mapping to bit position @oldbit, returning
  * the new bit position.
@@ -666,14 +661,18 @@ EXPORT_SYMBOL(bitmap_remap);
  * For example, lets say that @old has bits 4 through 7 set, and
  * @new has bits 12 through 15 set.  This defines the mapping of bit
  * position 4 to 12, 5 to 13, 6 to 14 and 7 to 15, and of all other
- * bit positions to 12 (the first set bit in @new.  So if say @oldbit
- * is 5, then this routine returns 13.
+ * bit positions unchanged.  So if say @oldbit is 5, then this routine
+ * returns 13.
  */
 int bitmap_bitremap(int oldbit, const unsigned long *old,
 				const unsigned long *new, int bits)
 {
-	int x = bitmap_pos_to_ord(old, oldbit, bits);
-	return bitmap_ord_to_pos(new, x, bits);
+	int w = bitmap_weight(new, bits);
+	int n = bitmap_pos_to_ord(old, oldbit, bits);
+	if (n < 0 || w == 0)
+		return oldbit;
+	else
+		return bitmap_ord_to_pos(new, n % w, bits);
 }
 EXPORT_SYMBOL(bitmap_bitremap);
 

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/5] cpuset: rebind numa vma mempolicy
  2005-11-04  5:31 [PATCH 1/5] cpuset: better bitmap remap defaults Paul Jackson
@ 2005-11-04  5:31 ` Paul Jackson
  2005-11-07  6:10   ` Paul Jackson
  2005-11-04  5:31 ` [PATCH 3/5] cpuset: change marker for relative numbering Paul Jackson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Paul Jackson @ 2005-11-04  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Jackson, Simon Derr, Andi Kleen, Christoph Lameter, linux-kernel

When automatically rebinding a NUMA mempolicy of a task
that is moved to a different cpuset, or of the tasks in
a cpuset whose memory placement 'mems' is changed, this
patch adds the code to rebind the mempolicies of the
vma's in the tasks memory space as well.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 mm/mempolicy.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

--- 2.6.14-rc5-mm1-cpuset-patches.orig/mm/mempolicy.c	2005-10-29 21:22:39.068225953 -0700
+++ 2.6.14-rc5-mm1-cpuset-patches/mm/mempolicy.c	2005-10-29 21:30:56.420406908 -0700
@@ -1262,10 +1262,19 @@ static void rebind_policy(struct mempoli
 /*
  * Someone moved this task to different nodes.  Fixup mempolicies.
  *
- * TODO - fixup current->mm->vma and shmfs/tmpfs/hugetlbfs policies as well,
- * once we have a cpuset mechanism to mark which cpuset subtree is migrating.
+ * TODO - fixup shmfs/tmpfs/hugetlbfs policies as well.
  */
 void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new)
 {
+	struct mm_struct *mm = current->mm;
+
+	if (mm) {
+		struct vm_area_struct *vma;
+
+		down_write(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next)
+			rebind_policy(vma->vm_policy, old, new);
+		up_write(&mm->mmap_sem);
+	}
 	rebind_policy(current->mempolicy, old, new);
 }

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/5] cpuset: change marker for relative numbering
  2005-11-04  5:31 [PATCH 1/5] cpuset: better bitmap remap defaults Paul Jackson
  2005-11-04  5:31 ` [PATCH 2/5] cpuset: rebind numa vma mempolicy Paul Jackson
@ 2005-11-04  5:31 ` Paul Jackson
  2005-11-05  7:08   ` Andrew Morton
  2005-11-04  5:31 ` [PATCH 4/5] cpuset: mempolicy one more nodemask conversion Paul Jackson
  2005-11-04  5:31 ` [PATCH 5/5] cpuset: memory reclaim rate meter Paul Jackson
  3 siblings, 1 reply; 14+ messages in thread
From: Paul Jackson @ 2005-11-04  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Simon Derr, Paul Jackson, Andi Kleen, linux-kernel, Christoph Lameter

This patch provides a minimal mechanism to support the safe
cpuset-relative management of CPU and Memory placement from
user library code, in the face of possible external migration
to different CPU's and Memory Nodes.

The interface presented to user space for cpusets uses system wide
numbering of CPUs and Memory Nodes.   It is the responsibility of
user level code, presumably in a library, to present cpuset-relative
numbering to applications when that would be more useful to them.

However if a task is moved to a different cpuset, or if the 'cpus'
or 'mems' of a cpuset are changed, then we need a way for such
library code to detect that its cpuset-relative numbering has
changed, when expressed using system wide numbering.

The kernel cannot safely allow user code to lock kernel resources.
The kernel could deliver out-of-band notice of cpuset changes by
such mechanisms as signals or usermodehelper callbacks, however
this can't be delivered to library code linked in applications
without intruding on the IPC mechanisms available to the app.
The kernel could require user level code to do all the work,
tracking the cpuset state before and during changes, to verify no
unexpected change occurred, but this becomes an onerous task.

The "marker_pid" cpuset field provides a simple way to make this
task less onerous on user library code.  The code writes its pid
to a cpusets "marker_pid" at the start of a sequence of queries
and updates, and check as it goes that the cpsuets marker_pid
doesn't change.  The pread(2) system call does a seek and read in
a single call.  If the marker_pid changes, the library code should
retry the required sequence of operations.

Anytime that a task modifies the "cpus" or "mems" of a cpuset,
unless it's pid is in the cpusets marker_pid field, the kernel
zeros this field.

The above was inspired by the load linked and store conditional
(ll/sc) instructions in the MIPS II instruction set.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 kernel/cpuset.c |   74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 74 insertions(+)

--- 2.6.14-rc5-mm1-cpuset-patches.orig/kernel/cpuset.c	2005-11-02 23:16:21.532227825 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/kernel/cpuset.c	2005-11-02 23:17:07.102068196 -0800
@@ -60,6 +60,7 @@ struct cpuset {
 	unsigned long flags;		/* "unsigned long" so bitops work */
 	cpumask_t cpus_allowed;		/* CPUs allowed to tasks in cpuset */
 	nodemask_t mems_allowed;	/* Memory Nodes allowed to tasks */
+	pid_t marker_pid;		/* pid of task doing marked updates */
 
 	/*
 	 * Count is atomic so can incr (fork) or decr (exit) without a lock.
@@ -130,10 +131,49 @@ static inline int notify_on_release(cons
  */
 static atomic_t cpuset_mems_generation = ATOMIC_INIT(1);
 
+/*
+ * marker_pid -- managing cpuset changes safely from user space.
+ *
+ * The interface presented to user space for cpusets uses system wide
+ * numbering of CPUs and Memory Nodes.   It is the responsibility of
+ * user level code, presumably in a library, to present cpuset-relative
+ * numbering to applications when that would be more useful to them.
+ *
+ * However if a task is moved to a different cpuset, or if the 'cpus'
+ * or 'mems' of a cpuset are changed, then we need a way for such
+ * library code to detect that its cpuset-relative numbering has
+ * changed, when expressed using system wide numbering.
+ *
+ * The kernel cannot safely allow user code to lock kernel resources.
+ * The kernel could deliver out-of-band notice of cpuset changes by
+ * such mechanisms as signals or usermodehelper callbacks, however
+ * this can't be delivered to library code linked in applications
+ * without intruding on the IPC mechanisms available to the app.
+ * The kernel could require user level code to do all the work,
+ * tracking the cpuset state before and during changes, to verify no
+ * unexpected change occurred, but this becomes an onerous task.
+ *
+ * The "marker_pid" cpuset field provides a simple way to make this
+ * task less onerous on user library code.  A task writes its pid
+ * to a cpusets "marker_pid" at the start of a sequence of queries
+ * and updates, and check as it goes that the cpsuets marker_pid
+ * doesn't change.  The pread(2) system call does a seek and read in
+ * a single call.  If the marker_pid changes, the user code should
+ * retry the required sequence of operations.
+ *
+ * Anytime that a task modifies the "cpus" or "mems" of a cpuset,
+ * unless it's pid is in the cpusets marker_pid field, the kernel
+ * zeros this field.
+ *
+ * The above was inspired by the load linked and store conditional
+ * (ll/sc) instructions in the MIPS II instruction set.
+ */
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
+	.marker_pid = 0,
 	.count = ATOMIC_INIT(0),
 	.sibling = LIST_HEAD_INIT(top_cpuset.sibling),
 	.children = LIST_HEAD_INIT(top_cpuset.children),
@@ -793,6 +833,19 @@ static int update_nodemask(struct cpuset
 }
 
 /*
+ * Call with manage_sem held.
+ */
+
+static int update_marker_pid(struct cpuset *cs, char *buf)
+{
+	if (simple_strtoul(buf, NULL, 10) != 0)
+		cs->marker_pid = current->pid;
+	else
+		cs->marker_pid = 0;
+	return 0;
+}
+
+/*
  * update_flag - read a 0 or a 1 in a file and update associated flag
  * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
  *						CS_NOTIFY_ON_RELEASE)
@@ -910,6 +963,7 @@ typedef enum {
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
 	FILE_NOTIFY_ON_RELEASE,
+	FILE_MARKER_PID,
 	FILE_TASKLIST,
 } cpuset_filetype_t;
 
@@ -922,6 +976,7 @@ static ssize_t cpuset_common_file_write(
 	char *buffer;
 	char *pathbuf = NULL;
 	int retval = 0;
+	int marked_change;
 
 	/* Crude upper limit on largest legitimate cpulist user might write. */
 	if (nbytes > 100 + 6 * NR_CPUS)
@@ -944,12 +999,15 @@ static ssize_t cpuset_common_file_write(
 		goto out2;
 	}
 
+	marked_change = 0;
 	switch (type) {
 	case FILE_CPULIST:
 		retval = update_cpumask(cs, buffer);
+		marked_change = 1;
 		break;
 	case FILE_MEMLIST:
 		retval = update_nodemask(cs, buffer);
+		marked_change = 1;
 		break;
 	case FILE_CPU_EXCLUSIVE:
 		retval = update_flag(CS_CPU_EXCLUSIVE, cs, buffer);
@@ -960,6 +1018,9 @@ static ssize_t cpuset_common_file_write(
 	case FILE_NOTIFY_ON_RELEASE:
 		retval = update_flag(CS_NOTIFY_ON_RELEASE, cs, buffer);
 		break;
+	case FILE_MARKER_PID:
+		retval = update_marker_pid(cs, buffer);
+		break;
 	case FILE_TASKLIST:
 		retval = attach_task(cs, buffer, &pathbuf);
 		break;
@@ -968,6 +1029,9 @@ static ssize_t cpuset_common_file_write(
 		goto out2;
 	}
 
+	if (marked_change && retval == 0 && cs->marker_pid != current->pid)
+		cs->marker_pid = 0;
+
 	if (retval == 0)
 		retval = nbytes;
 out2:
@@ -1060,6 +1124,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_NOTIFY_ON_RELEASE:
 		*s++ = notify_on_release(cs) ? '1' : '0';
 		break;
+	case FILE_MARKER_PID:
+		s += sprintf(s, "%d", cs->marker_pid);
+		break;
 	default:
 		retval = -EINVAL;
 		goto out;
@@ -1408,6 +1475,11 @@ static struct cftype cft_notify_on_relea
 	.private = FILE_NOTIFY_ON_RELEASE,
 };
 
+static struct cftype cft_marker_pid = {
+	.name = "marker_pid",
+	.private = FILE_MARKER_PID,
+};
+
 static int cpuset_populate_dir(struct dentry *cs_dentry)
 {
 	int err;
@@ -1422,6 +1494,8 @@ static int cpuset_populate_dir(struct de
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_notify_on_release)) < 0)
 		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_marker_pid)) < 0)
+		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
 		return err;
 	return 0;

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/5] cpuset: mempolicy one more nodemask conversion
  2005-11-04  5:31 [PATCH 1/5] cpuset: better bitmap remap defaults Paul Jackson
  2005-11-04  5:31 ` [PATCH 2/5] cpuset: rebind numa vma mempolicy Paul Jackson
  2005-11-04  5:31 ` [PATCH 3/5] cpuset: change marker for relative numbering Paul Jackson
@ 2005-11-04  5:31 ` Paul Jackson
  2005-11-04  5:31 ` [PATCH 5/5] cpuset: memory reclaim rate meter Paul Jackson
  3 siblings, 0 replies; 14+ messages in thread
From: Paul Jackson @ 2005-11-04  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Jackson, Simon Derr, Andi Kleen, Christoph Lameter, linux-kernel

Finish converting mm/mempolicy.c from bitmaps to nodemasks.
The previous conversion had left one routine using bitmaps,
since it involved a corresponding change to kernel/cpuset.c

Fix that interface by replacing with a simple macro that
calls nodes_subset(), or if !CONFIG_CPUSET, returns (1).

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 include/linux/cpuset.h |    5 +++--
 kernel/cpuset.c        |   10 ----------
 mm/mempolicy.c         |    5 ++---
 3 files changed, 5 insertions(+), 15 deletions(-)

--- 2.6.14-rc5-mm1-cpuset-patches.orig/include/linux/cpuset.h	2005-11-02 23:18:21.015972401 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/include/linux/cpuset.h	2005-11-02 23:18:22.437863143 -0800
@@ -21,7 +21,8 @@ extern void cpuset_exit(struct task_stru
 extern cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_current_mems_allowed(void);
-void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
+#define cpuset_nodes_subset_current_mems_allowed(nodes) \
+		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 extern int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask);
 extern int cpuset_excl_nodes_overlap(const struct task_struct *p);
@@ -42,7 +43,7 @@ static inline cpumask_t cpuset_cpus_allo
 
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_current_mems_allowed(void) {}
-static inline void cpuset_restrict_to_mems_allowed(unsigned long *nodes) {}
+#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
 static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
 {
--- 2.6.14-rc5-mm1-cpuset-patches.orig/kernel/cpuset.c	2005-11-02 23:18:21.015972401 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/kernel/cpuset.c	2005-11-02 23:18:22.438839716 -0800
@@ -1790,16 +1790,6 @@ done:
 }
 
 /**
- * cpuset_restrict_to_mems_allowed - limit nodes to current mems_allowed
- * @nodes: pointer to a node bitmap that is and-ed with mems_allowed
- */
-void cpuset_restrict_to_mems_allowed(unsigned long *nodes)
-{
-	bitmap_and(nodes, nodes, nodes_addr(current->mems_allowed),
-							MAX_NUMNODES);
-}
-
-/**
  * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
  * @zl: the zonelist to be checked
  *
--- 2.6.14-rc5-mm1-cpuset-patches.orig/mm/mempolicy.c	2005-11-02 23:18:21.015972401 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/mm/mempolicy.c	2005-11-02 23:18:22.440792863 -0800
@@ -342,10 +342,9 @@ static int contextualize_policy(int mode
 	if (!nodes)
 		return 0;
 
-	/* Update current mems_allowed */
 	cpuset_update_current_mems_allowed();
-	/* Ignore nodes not set in current->mems_allowed */
-	cpuset_restrict_to_mems_allowed(nodes->bits);
+	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
+		return -EINVAL;
 	return mpol_check_policy(mode, nodes);
 }
 

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 5/5] cpuset: memory reclaim rate meter
  2005-11-04  5:31 [PATCH 1/5] cpuset: better bitmap remap defaults Paul Jackson
                   ` (2 preceding siblings ...)
  2005-11-04  5:31 ` [PATCH 4/5] cpuset: mempolicy one more nodemask conversion Paul Jackson
@ 2005-11-04  5:31 ` Paul Jackson
  2005-11-05  7:13   ` Andrew Morton
  2005-11-05  7:27   ` Paul Jackson
  3 siblings, 2 replies; 14+ messages in thread
From: Paul Jackson @ 2005-11-04  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Simon Derr, Paul Jackson, Andi Kleen, linux-kernel, Christoph Lameter

Provide a simple per-cpuset metric of memory stress, tracking the
-rate- that the tasks in a cpuset call try_to_free_pages(), the
synchronous (direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated
cpusets to efficiently detect what level of memory stress that job
is encountering.

This is useful both on tightly managed systems running a wide mix
of submitted jobs, which may choose to terminate or reprioritize
jobs that are trying to use more memory than allowed on the nodes
assigned them, and with tightly coupled, long running, massively
parallel scientific computing jobs that will dramatically fail to
meet required performance goals if they start to swap.

This patch just provides a very economical way for the batch manager
to monitor a cpuset for signs of memory distress.  It's up to the
batch manager or other user code to decide what to do about it and
take action.

A per-cpuset simple digital filter (requires a spinlock and 3 words
of data per-cpuset) is kept, and updated by any task attached to that
cpuset, if it enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by
the tasks in the cpuset, in units of reclaims attempted per second,
times 1000.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

An earlier patch that tried to address this same problem
is in the thread:

	http://lkml.org/lkml/2005/3/19/148
	Date	Sat, 19 Mar 2005 17:48:46 -0800 (PST)
	Subject	[Patch] cpusets policy kill no swap

It was rejected, as it hardwired policy, mechanism and
detection together, in a manner that would only have
been useful to very specialized customer needs.

 include/linux/cpuset.h |    3 
 kernel/cpuset.c        |  158 ++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c        |    1 
 3 files changed, 161 insertions(+), 1 deletion(-)

--- 2.6.14-rc5-mm1-cpuset-patches.orig/include/linux/cpuset.h	2005-11-03 19:07:00.026535184 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/include/linux/cpuset.h	2005-11-03 19:07:00.130051971 -0800
@@ -26,6 +26,7 @@ void cpuset_update_current_mems_allowed(
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 extern int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask);
 extern int cpuset_excl_nodes_overlap(const struct task_struct *p);
+extern void cpuset_synchronous_page_reclaim_bump(void);
 extern struct file_operations proc_cpuset_operations;
 extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
 
@@ -60,6 +61,8 @@ static inline int cpuset_excl_nodes_over
 	return 1;
 }
 
+static ineline void cpuset_synchronous_page_reclaim_bump(void) {}
+
 static inline char *cpuset_task_status_allowed(struct task_struct *task,
 							char *buffer)
 {
--- 2.6.14-rc5-mm1-cpuset-patches.orig/kernel/cpuset.c	2005-11-03 19:07:00.050949521 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/kernel/cpuset.c	2005-11-03 19:28:12.507607967 -0800
@@ -56,6 +56,15 @@
 
 #define CPUSET_SUPER_MAGIC 		0x27e0eb
 
+/* See "Frequency meter" comments, below. */
+
+struct fmeter {
+	int cnt;		/* unprocessed events count */
+	int val;		/* most recent output value */
+	time_t time;		/* clock (secs) when val computed */
+	spinlock_t lock;	/* guards read or write of above */
+};
+
 struct cpuset {
 	unsigned long flags;		/* "unsigned long" so bitops work */
 	cpumask_t cpus_allowed;		/* CPUs allowed to tasks in cpuset */
@@ -81,7 +90,9 @@ struct cpuset {
 	 * Copy of global cpuset_mems_generation as of the most
 	 * recent time this cpuset changed its mems_allowed.
 	 */
-	 int mems_generation;
+	int mems_generation;
+
+	struct fmeter fmeter;		/* memory_reclaim_rate filter */
 };
 
 /* bits in struct cpuset flags field */
@@ -174,6 +185,9 @@ static struct cpuset top_cpuset = {
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
 	.marker_pid = 0,
+	.fmeter.cnt = 0,
+	.fmeter.val = 0,
+	.fmeter.time = 0,
 	.count = ATOMIC_INIT(0),
 	.sibling = LIST_HEAD_INIT(top_cpuset.sibling),
 	.children = LIST_HEAD_INIT(top_cpuset.children),
@@ -887,6 +901,104 @@ static int update_flag(cpuset_flagbits_t
 }
 
 /*
+ * Frequency meter - How fast is some event occuring?
+ *
+ * These routines manage a digitally filtered, constant time based,
+ * event frequency meter.  There are four routines:
+ *   fmeter_init() - initialize a frequency meter.
+ *   fmeter_markevent() - called each time the event happens.
+ *   fmeter_getrate() - returns the recent rate of such events.
+ *   fmeter_update() - internal routine used to update fmeter.
+ *
+ * A common data structure is passed to each of these routines,
+ * which is used to keep track of the state required to manage the
+ * frequency meter and its digital filter.
+ *
+ * The filter works on the number of events marked per unit time.
+ * The filter is single-pole low-pass recursive (IIR).  The time unit
+ * is 1 second.  Arithmetic is done using 32-bit integers scaled to
+ * simulate 3 decimal digits of precision (multiplied by 1000).
+ *
+ * With an FM_COEF of 933, and a time base of 1 second, the filter
+ * has a half-life of 10 seconds, meaning that if the events quit
+ * happening, then the rate returned from the fmeter_getrate()
+ * will be cut in half each 10 seconds, until it converges to zero.
+ *
+ * It is not worth doing a real infinitely recursive filter.  If more
+ * than FM_MAXTICKS ticks have elapsed since the last filter event,
+ * just compute FM_MAXTICKS ticks worth, by which point the level
+ * will be stable.
+ *
+ * Limit the count of unprocessed events to FM_MAXCNT, so as to avoid
+ * arithmetic overflow in the fmeter_update() routine.
+ *
+ * Given the simple 32 bit integer arithmetic used, this meter works
+ * best for reporting rates between one per millisecond (msec) and
+ * one per 32 (approx) seconds.  At constant rates faster than one
+ * per msec it maxes out at values just under 1,000,000.  At constant
+ * rates between one per msec, and one per second it will stabilize
+ * to a value N*1000, where N is the rate of events per second.
+ * At constant rates between one per second and one per 32 seconds,
+ * it will be choppy, moving up on the seconds that have an event,
+ * and then decaying until the next event.  At rates slower than
+ * about one in 32 seconds, it decays all the way back to zero between
+ * each event.
+ */
+
+#define FM_COEF 933		/* coefficient for half-life of 10 secs */
+#define FM_MAXTICKS ((time_t)99) /* useless computing more ticks than this */
+#define FM_MAXCNT 1000000	/* limit cnt to avoid overflow */
+#define FM_SCALE 1000		/* faux fixed point scale */
+
+/* Initialize a frequency meter */
+static void fmeter_init(struct fmeter *fmp)
+{
+	fmp->cnt = 0;
+	fmp->val = 0;
+	fmp->time = 0;
+	spin_lock_init(&fmp->lock);
+}
+
+/* Internal meter update - process cnt events and update value */
+static void fmeter_update(struct fmeter *fmp)
+{
+	time_t now = get_seconds();
+	time_t ticks = now - fmp->time;
+
+	if (ticks == 0)
+		return;
+
+	ticks = min(FM_MAXTICKS, ticks);
+	while (ticks-- > 0)
+		fmp->val = (FM_COEF * fmp->val) / FM_SCALE;
+	fmp->time = now;
+
+	fmp->val += ((FM_SCALE - FM_COEF) * fmp->cnt) / FM_SCALE;
+	fmp->cnt = 0;
+}
+
+/* Process any previous ticks, then bump cnt by one (times scale). */
+static void fmeter_markevent(struct fmeter *fmp)
+{
+	spin_lock(&fmp->lock);
+	fmeter_update(fmp);
+	fmp->cnt = min(FM_MAXCNT, fmp->cnt + FM_SCALE);
+	spin_unlock(&fmp->lock);
+}
+
+/* Process any previous ticks, then return current value. */
+static int fmeter_getrate(struct fmeter *fmp)
+{
+	int val;
+
+	spin_lock(&fmp->lock);
+	fmeter_update(fmp);
+	val = fmp->val;
+	spin_unlock(&fmp->lock);
+	return val;
+}
+
+/*
  * Attack task specified by pid in 'pidbuf' to cpuset 'cs', possibly
  * writing the path of the old cpuset in 'ppathbuf' if it needs to be
  * notified on release.
@@ -964,6 +1076,7 @@ typedef enum {
 	FILE_MEM_EXCLUSIVE,
 	FILE_NOTIFY_ON_RELEASE,
 	FILE_MARKER_PID,
+	FILE_RECLAIM_RATE,
 	FILE_TASKLIST,
 } cpuset_filetype_t;
 
@@ -1021,6 +1134,9 @@ static ssize_t cpuset_common_file_write(
 	case FILE_MARKER_PID:
 		retval = update_marker_pid(cs, buffer);
 		break;
+	case FILE_RECLAIM_RATE:
+		retval = -EACCES;
+		break;
 	case FILE_TASKLIST:
 		retval = attach_task(cs, buffer, &pathbuf);
 		break;
@@ -1127,6 +1243,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_MARKER_PID:
 		s += sprintf(s, "%d", cs->marker_pid);
 		break;
+	case FILE_RECLAIM_RATE:
+		s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter));
+		break;
 	default:
 		retval = -EINVAL;
 		goto out;
@@ -1480,6 +1599,11 @@ static struct cftype cft_marker_pid = {
 	.private = FILE_MARKER_PID,
 };
 
+static struct cftype cft_reclaim_rate = {
+	.name = "memory_reclaim_rate",
+	.private = FILE_RECLAIM_RATE,
+};
+
 static int cpuset_populate_dir(struct dentry *cs_dentry)
 {
 	int err;
@@ -1496,6 +1620,8 @@ static int cpuset_populate_dir(struct de
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_marker_pid)) < 0)
 		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_reclaim_rate)) < 0)
+		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
 		return err;
 	return 0;
@@ -1531,6 +1657,7 @@ static long cpuset_create(struct cpuset 
 	INIT_LIST_HEAD(&cs->children);
 	atomic_inc(&cpuset_mems_generation);
 	cs->mems_generation = atomic_read(&cpuset_mems_generation);
+	fmeter_init(&cs->fmeter);
 
 	cs->parent = parent;
 
@@ -1620,6 +1747,7 @@ int __init cpuset_init(void)
 	top_cpuset.cpus_allowed = CPU_MASK_ALL;
 	top_cpuset.mems_allowed = NODE_MASK_ALL;
 
+	fmeter_init(&top_cpuset.fmeter);
 	atomic_inc(&cpuset_mems_generation);
 	top_cpuset.mems_generation = atomic_read(&cpuset_mems_generation);
 
@@ -1929,6 +2057,34 @@ done:
 	return overlap;
 }
 
+/**
+ * cpuset_synchronous_page_reclaim_bump - keep stats of per-cpuset relaims.
+ *
+ * Keep a running average of the rate of synchronous (direct)
+ * page reclaim efforts initiated by tasks in each cpuset.
+ *
+ * This represents the rate at which some task in the cpuset
+ * ran low on memory on all nodes it was allowed to use, and
+ * had to enter the kernels page reclaim code in an effort to
+ * create more free memory by tossing clean pages or swapping
+ * or writing dirty pages.
+ *
+ * Display to user space in the per-cpuset read-only file
+ * "memory_reclaim_rate".  Value displayed is an integer
+ * representing the recent rate of entry into the synchronous
+ * (direct) page reclaim by any task attached to the cpuset.
+ **/
+
+void cpuset_synchronous_page_reclaim_bump(void)
+{
+	struct cpuset *cs;
+
+	task_lock(current);
+	cs = current->cpuset;
+	fmeter_markevent(&cs->fmeter);
+	task_unlock(current);
+}
+
 /*
  * proc_cpuset_show()
  *  - Print tasks cpuset path into seq_file.
--- 2.6.14-rc5-mm1-cpuset-patches.orig/mm/page_alloc.c	2005-11-03 19:06:53.301850334 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/mm/page_alloc.c	2005-11-03 19:26:02.267868104 -0800
@@ -976,6 +976,7 @@ rebalance:
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
+	cpuset_synchronous_page_reclaim_bump();
 	p->flags |= PF_MEMALLOC;
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] cpuset: change marker for relative numbering
  2005-11-04  5:31 ` [PATCH 3/5] cpuset: change marker for relative numbering Paul Jackson
@ 2005-11-05  7:08   ` Andrew Morton
  2005-11-06 10:04     ` Paul Jackson
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2005-11-05  7:08 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Simon.Derr, pj, ak, linux-kernel, clameter

Paul Jackson <pj@sgi.com> wrote:
>
>  This patch provides a minimal mechanism to support the safe
>  cpuset-relative management of CPU and Memory placement from
>  user library code, in the face of possible external migration
>  to different CPU's and Memory Nodes.

I guess you mean external migration of a cpuset to different CPUs and
memory nodes via FILE_CPULIST and FILE_MEMLIST?

>  The interface presented to user space for cpusets uses system wide
>  numbering of CPUs and Memory Nodes.   It is the responsibility of
>  user level code, presumably in a library, to present cpuset-relative
>  numbering to applications when that would be more useful to them.
> 
>  However if a task is moved to a different cpuset, or if the 'cpus'
>  or 'mems' of a cpuset are changed, then we need a way for such
>  library code to detect that its cpuset-relative numbering has
>  changed, when expressed using system wide numbering.

Why?  If someone calls into that library for a query, then that library can
call into the kernel to query the current state, no?  Are you assuming that
this library will wish to cache state which belongs to the kernel?

>  The kernel cannot safely allow user code to lock kernel resources.

Well yes - in the presence of other processes dinking with a cpuset's
settings, such a library is always racy.  Unless it provides
kernel-triggered callbacks when something changes.  Or unless it does
locking.

>  The kernel could deliver out-of-band notice of cpuset changes by
>  such mechanisms as signals or usermodehelper callbacks, however
>  this can't be delivered to library code linked in applications
>  without intruding on the IPC mechanisms available to the app.

connector?

>  The kernel could require user level code to do all the work,
>  tracking the cpuset state before and during changes, to verify no
>  unexpected change occurred, but this becomes an onerous task.

Not sure I understand this, but if you're saying that maintaining state in
a single pace is good then yup.

>  The "marker_pid" cpuset field provides a simple way to make this
>  task less onerous on user library code.  The code writes its pid
>  to a cpusets "marker_pid" at the start of a sequence of queries
>  and updates, and check as it goes that the cpsuets marker_pid
>  doesn't change.  The pread(2) system call does a seek and read in
>  a single call.  If the marker_pid changes, the library code should
>  retry the required sequence of operations.

<looks for API documentation in cpusets.txt, gives up>

Shouldn't all those files in cpusetfs be documented?

So what is the <undocumented> programming interface which you are proposing
here?

a) process writes a non-zero number to file "marker_pid" in cpusetfs. 
   Kernel remembers that the calling task "owns" the cpuset for updates.

b) cpuset updates proceed.  If a different task does an update, marker_pid
   gets cleared.

c) process reads "marker_pid" and if that's still equal to getpid(), we
   know that nobody raced with us.  If someone _did_ race with us, what? 
   re-read everything?

d) process writes zero to "marker_pid" to release the marker.


Given that you're developing a library to do all this, why not do proper
locking in userspace and require that all cpuset updates go via the
library?

Does this code work correctly if different threads stomp on each others'
toes?  I think so.


hm, cpuset_common_file_write() likes to return 0 for the number of bytes
written.  This could cause some userspace tools to hang, repeating the
write all the time, or to report a write error.  It should return nbytes on
success.

>  Anytime that a task modifies the "cpus" or "mems" of a cpuset,
>  unless it's pid is in the cpusets marker_pid field, the kernel
>  zeros this field.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] cpuset: memory reclaim rate meter
  2005-11-04  5:31 ` [PATCH 5/5] cpuset: memory reclaim rate meter Paul Jackson
@ 2005-11-05  7:13   ` Andrew Morton
  2005-11-05  7:34     ` Paul Jackson
  2005-11-05  7:43     ` Paul Jackson
  2005-11-05  7:27   ` Paul Jackson
  1 sibling, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2005-11-05  7:13 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Simon.Derr, pj, ak, linux-kernel, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> +static ineline void cpuset_synchronous_page_reclaim_bump(void) {}

Get down and give me fifty.

I guess I'll give up and merge this thing.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] cpuset: memory reclaim rate meter
  2005-11-04  5:31 ` [PATCH 5/5] cpuset: memory reclaim rate meter Paul Jackson
  2005-11-05  7:13   ` Andrew Morton
@ 2005-11-05  7:27   ` Paul Jackson
  1 sibling, 0 replies; 14+ messages in thread
From: Paul Jackson @ 2005-11-05  7:27 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, Simon.Derr, ak, linux-kernel, clameter

pj wrote:
> Provide a simple per-cpuset metric of memory stress, tracking the
> -rate- that the tasks in a cpuset call try_to_free_pages(), the
> synchronous (direct) memory reclaim code.

On another thread
	[PATCH 0/7] Fragmentation Avoidance V19" (Mel Gorman's)
Andrew rejected this memory reclaim meter patch, for several reasons.

I hope the other 4 patches in this set will go into *-mm, and be
considered for 2.6.15:

    [PATCH 1/5] cpuset: better bitmap remap defaults
    [PATCH 2/5] cpuset: rebind numa vma mempolicy
    [PATCH 3/5] cpuset: change marker for relative numbering
    [PATCH 4/5] cpuset: mempolicy one more nodemask conversion

(Well - I see 2/5, and its fix 6/5, is raising issues in my inbox ;)

I am about to resubmit this 5/5 memory reclaim rate, with a change
to make it essentially zero cost (test a global flag) on systems that
don't use it.

We continue to have a critical need on large systems under batch
management for the batch manager to detect, economically and quickly,
when a job is applying significant memory pressure to its cpuset.

Per-task or per-mm statistics can provide more detail, on a variety
of specific metrics.  But batch managers on big systems neither want,
nor can afford, these statistics.

To access some attribute of each task in a cpuset first requires
reading the cpusets 'tasks' file, to list the tasks in the cpuset.
This scans the task list to see which tasks reference that cpuset.
Since the tasks in a cpuset may change frequently, a batch manager
would have to relist the tasks each time it examined that cpuset for
memory pressure.  Frequent task list scans and accessing a file
per-task gets expensive.

>From what I can tell, the direct reclaim code, try_to_free_pages(),
is not called at a high rate.  In my simple tests, it was called 10 to
20 times per second, by the threads in a cpuset allocating memory as
fast as they can, past what fit easily.  I suspect that it is rate
limited by disk speeds, to a rough approximation.

Experience on prior systems, and my simple tests now, show that
watching for entry into direct reclaim provides a good indication
of memory pressure.  Certainly other events happen under pressure as
well; batch managers really don't care.  They just want to know when
a job is trying to blow out of its memory confinement.

There may well be a need for more specific memory management counters
on a per-task or per-mm basis.  I don't have the expertise, experience
or franchise to drive that effort.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] cpuset: memory reclaim rate meter
  2005-11-05  7:13   ` Andrew Morton
@ 2005-11-05  7:34     ` Paul Jackson
  2005-11-05  7:43     ` Paul Jackson
  1 sibling, 0 replies; 14+ messages in thread
From: Paul Jackson @ 2005-11-05  7:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Simon.Derr, ak, linux-kernel, clameter

> Get down and give me fifty.

No chance.  The best I could do in basic training was 15.
Almost got drummed out of the Military on that account.

I am far older and weaker now.


> I guess I'll give up and merge this thing.

If you'd like one that avoids doing any work on the 99.3% of the worlds
systems that don't need this, hang on a little bit and I will send you
the new improved version.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] cpuset: memory reclaim rate meter
  2005-11-05  7:13   ` Andrew Morton
  2005-11-05  7:34     ` Paul Jackson
@ 2005-11-05  7:43     ` Paul Jackson
  1 sibling, 0 replies; 14+ messages in thread
From: Paul Jackson @ 2005-11-05  7:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Simon.Derr, ak, linux-kernel, clameter

> I guess I'll give up and merge this thing.

Actually - definitely wait for the next version.

Kill the version you have of this patch.

I just realized that "memory_reclaim_rate" was a bogus
name for this feature.  The users, batch managers, don't
give a dang that its hooked into some direct reclaim
point in the kernel.

They just want a decent measure of memory pressure in
a cpuset.

So I am going to rename the flag, to "memory_pressure".

Kill the current 'memory reclaim rate' patch.  I will
resend in perhaps an hour, under this new name.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] cpuset: change marker for relative numbering
  2005-11-05  7:08   ` Andrew Morton
@ 2005-11-06 10:04     ` Paul Jackson
  2005-11-06 20:57       ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Jackson @ 2005-11-06 10:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Simon.Derr, ak, linux-kernel, clameter

Andrew wrote:
> >  This patch provides a minimal mechanism to support the safe
> >  cpuset-relative management of CPU and Memory placement from
> >  user library code, in the face of possible external migration
> >  to different CPU's and Memory Nodes.
> 
> I guess you mean external migration of a cpuset to different CPUs and
> memory nodes via FILE_CPULIST and FILE_MEMLIST?

Yes.

> >  However if a task is moved to a different cpuset, or if the 'cpus'
> >  or 'mems' of a cpuset are changed, then we need a way for such
> >  library code to detect that its cpuset-relative numbering has
> >  changed, when expressed using system wide numbering.
> 
> Why?  If someone calls into that library for a query, then that library can
> call into the kernel to query the current state, no?  Are you assuming that
> this library will wish to cache state which belongs to the kernel?

A single shot query is no problem.  Many of the libcpuset calls
involve a sequence of operations on cpusets.

Lets say for example the application asks to have a thread pinned on
the 3rd CPU in its cpuset.  The library code must read the current
cpuset placement, calculate what CPU is 3rd, and issue a sched_affinity
call.

Or in a little more complex operation, the app might construct an
internal specification of a cpuset, specifying certain cpus and
mems, using numbering relative to its parent cpuset, then ask for
a child cpuset to be created, in accordance with that specification.

The library will have to work overtime to have this work correctly,
in the face of possible movement of the parent cpuset at anytime.

> >  The kernel could deliver out-of-band notice of cpuset changes by
> >  such mechanisms as signals or usermodehelper callbacks, however
> >  this can't be delivered to library code linked in applications
> >  without intruding on the IPC mechanisms available to the app.
> 
> connector?

True - connector/netlink technology, or other mechanisms that
delivered asynchronous notice via a file or socket would not
intrude noticeably on the apps IPC mechanisms.

Other criteria that I didn't mention ... Registering for a
connector takes more kernel text space than the simple marker_pid.

And the connector delivers its message asynchronously, so using
it here would be racey.  If task A is performing a series of operations
on a cpuset that task B is messing with, and if task B sends off notice
of its changes via some connector or similar mechanism, it might not
arrive at task A prior to A completing its operations and returning to
the calling user code, thinking that all was well.  If task A checks
for the cpusets marker_pid being unmodified at the end of its
operations, that is sequenced on the cpuset 'manage_sem' with task B's
changes.

> >  The kernel could require user level code to do all the work,
> >  tracking the cpuset state before and during changes, to verify no
> >  unexpected change occurred, but this becomes an onerous task.
> 
> Not sure I understand this,

The library code could read in the entire state of a cpuset,
perform its operations (noting the affect of these ops on
the cpuset state) and then re-read the cpuset state to see if
it had exactly the intended state.  Not impossible, but onerous.

> Shouldn't all those files in cpusetfs be documented?

Yup.  It's on my short list of things to do, real soon now.

> c) process reads "marker_pid" and if that's still equal to getpid(), we
>    know that nobody raced with us.  If someone _did_ race with us, what? 
>    re-read everything?

If someone did race us, then yes, re-read and redo the operation.

> d) process writes zero to "marker_pid" to release the marker.

No need for that, which is an important detail.  It is good that
no one cares if we leave our stale pid in a marker_pid past our
caring about it.

> Given that you're developing a library to do all this, why not do proper
> locking in userspace and require that all cpuset updates go via the
> library?

The superficial reason why not is that I can't prevent someone from
doing "echo pid > /dev/cpuset/tasks", moving a task to another cpuset
even as that task in the middle of dorking with its cpusets.  Not all
cpuset operations will involve using my blessed library.

The actual reason why not is that I have a strong bias toward localized
"every man for himself" solutions, over globally coordinated solutions.

> hm, cpuset_common_file_write() likes to return 0 for the number of bytes
> written. 

Good catch - thanks.  I'll look into it.

On second look - I don't think so.  The following code, near the
end of kernel/cpuset.c:cpuset_common_file_write() seems to set
the return number of bytes correctly:

        if (retval == 0)
                retval = nbytes;

And a simple test doing "strace /bin/echo 0 > mems" showed that the
write system called returned 2, for the length of the string "0\n",
as desired.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] cpuset: change marker for relative numbering
  2005-11-06 10:04     ` Paul Jackson
@ 2005-11-06 20:57       ` Andrew Morton
  2005-11-06 22:49         ` Paul Jackson
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2005-11-06 20:57 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Simon.Derr, ak, linux-kernel, clameter

Paul Jackson <pj@sgi.com> wrote:
>
> > Given that you're developing a library to do all this, why not do proper
>  > locking in userspace and require that all cpuset updates go via the
>  > library?
> 
>  The superficial reason why not is that I can't prevent someone from
>  doing "echo pid > /dev/cpuset/tasks", moving a task to another cpuset
>  even as that task in the middle of dorking with its cpusets.  Not all
>  cpuset operations will involve using my blessed library.

If someone modifies a library-managed cpuset via the backdoor then the
library (and its caller) are out of sync with reality _anyway_.  Why does
an encounter with this very small race window worsen things?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] cpuset: change marker for relative numbering
  2005-11-06 20:57       ` Andrew Morton
@ 2005-11-06 22:49         ` Paul Jackson
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Jackson @ 2005-11-06 22:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Simon.Derr, ak, linux-kernel, clameter

Andrew wrote:
> If someone modifies a library-managed cpuset via the backdoor then the
> library (and its caller) are out of sync with reality _anyway_.

Yes, for system-wide operations.  No - for cpuset-relative operations.

For task migration (Christoph Lameter's patches) to work, I need to
provide a safe way for jobs to manage placement within their assigned
cpuset.  This means providing wrappers to sched_setaffinity, mbind and
set_mempolicy that take cpuset-relative cpu/mem numbers, and provide a
robust, cpuset-relative API to applications, that hides any migrations
from the application.

A year ago, Simon Derr pushed hard to get cpuset-relative numbering
support into the kernel, anticipating these sorts of problems.  I and
others pushed back, saying that this was the work of libraries, and
that the kernel-user API needed to use one simple, system-wide numbering.

Enforcing a system-wide synchronization of library code, using just
user code, is expensive, difficult and scales poorly on large systems.

A trivial, code-wise, hook in the kernel will enable each independent
library routine to efficiently detect any parallel changes and redo
their operation sequence.  It enables providing applications with a
cpuset-relative API for internal job memory and cpu placement that is
efficient and robustly safe in the face of migrations.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/5] cpuset: rebind numa vma mempolicy
  2005-11-04  5:31 ` [PATCH 2/5] cpuset: rebind numa vma mempolicy Paul Jackson
@ 2005-11-07  6:10   ` Paul Jackson
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Jackson @ 2005-11-07  6:10 UTC (permalink / raw)
  To: akpm; +Cc: Simon.Derr, ak, clameter, linux-kernel

Andrew - please remove these two patches from *-mm for now.  I need to
think more about the mmap_sem locking issues.  The questions you asked
were good ones - thanks.  My answers so far suck.

	[PATCH 2/5] cpuset: rebind numa vma mempolicy
	[PATCH 6/5] cpuset: rebind numa vma mempolicy fix
aka
	cpuset-rebind-numa-vma-mempolicy.patch
	cpuset-rebind-numa-vma-mempolicy-fix.patch

So far as I can tell, nothing is actually broken with this pair of
patches.  My cpuset function and stress tests still passed.  But they
are too hacky^W non-deterministic.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-11-07  6:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-04  5:31 [PATCH 1/5] cpuset: better bitmap remap defaults Paul Jackson
2005-11-04  5:31 ` [PATCH 2/5] cpuset: rebind numa vma mempolicy Paul Jackson
2005-11-07  6:10   ` Paul Jackson
2005-11-04  5:31 ` [PATCH 3/5] cpuset: change marker for relative numbering Paul Jackson
2005-11-05  7:08   ` Andrew Morton
2005-11-06 10:04     ` Paul Jackson
2005-11-06 20:57       ` Andrew Morton
2005-11-06 22:49         ` Paul Jackson
2005-11-04  5:31 ` [PATCH 4/5] cpuset: mempolicy one more nodemask conversion Paul Jackson
2005-11-04  5:31 ` [PATCH 5/5] cpuset: memory reclaim rate meter Paul Jackson
2005-11-05  7:13   ` Andrew Morton
2005-11-05  7:34     ` Paul Jackson
2005-11-05  7:43     ` Paul Jackson
2005-11-05  7:27   ` Paul Jackson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.