linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch 0/4] four small cpuset patches ...
@ 2004-09-11  8:28 Paul Jackson
  2004-09-11  8:28 ` [Patch 1/4] cpusets display allowed masks in proc status Paul Jackson
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Paul Jackson @ 2004-09-11  8:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Simon Derr, Paul Jackson, linux-kernel

Here are four small patches that I would like included in the
cpuset patches in *-mm.  Simon Derr has reviewed them and found
them reasonable.  I have built, booted and unit tested them.

They are:
 1) Add cpus_allowed, mems_allowed to /proc/<pid>/status
 2) Simplify recalculation of new cpus_allowed when task changes cpuset
 3) Remove a bit of slightly broken, completely useless, code
 4) Tweak top cpuset to have just online, not all possible, cpus and nodes.


-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Patch 1/4] cpusets display allowed masks in proc status
  2004-09-11  8:28 [Patch 0/4] four small cpuset patches Paul Jackson
@ 2004-09-11  8:28 ` Paul Jackson
  2004-09-11  8:28 ` [Patch 2/4] cpusets simplify cpus_allowed setting in attach Paul Jackson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2004-09-11  8:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, Simon Derr, linux-kernel

Should have done this earlier - the cpus_allowed and mems_allowed
values for each task are not always obvious from the tasks cpuset
settings, for various reasons, including:

  1) a task might use sched_setaffinity, mbind or set_mempolicy
     to restrain its placement to less than its cpuset, or
  2) various temporary changes to cpus_allowed are done by
     kernel internal code, or
  3) attaching a task to a cpuset doesn't change its mems_allowed
     until the next time that task needs kernel memory, or
  4) changing a cpusets 'cpus' doesn't change the cpus_allowed of
     the tasks attached to it until those tasks are reattached
     to that cpuset (to avoid a hook in the hotpath scheduler
     code in the kernel).

So it is useful, when learning and making new uses of cpusets and
placement to be able to see what are the current value of a tasks
cpus_allowed and mems_allowed, which are the actual placement used
by the kernel scheduler and memory allocator.

The cpus_allowed and mems_allowed values are needed by user space
apps that are micromanaging placement, such as when moving an app to a
different cpuset, and trying to recreate the cpuset-relative placement
obtained by that app within its cpuset using sched_setaffinity,
mbind and set_mempolicy.

The cpus_allowed value is also available via the sched_getaffinity
system call.  But since the entire rest of the cpuset API, including
the display of mems_allowed added here, is via an ascii style
presentation in /proc and /dev/cpuset, it is worth the extra couple
lines of code to display cpus_allowed in the same way.

This patch adds the display of these two fields to the 'status'
file in the /proc/<pid> directory of each task.  The fields are only
added if CONFIG_CPUSETS is enabled (which is also needed to define
the mems_allowed field of each task).  The new output lines look like:

  $ tail -2 /proc/1/status
  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
  Mems_allowed:   ffffffff,ffffffff

Signed-off-by: Paul Jackson <pj@sgi.com>

Index: 2.6.9-rc1-mm4/fs/proc/array.c
===================================================================
--- 2.6.9-rc1-mm4.orig/fs/proc/array.c	2004-09-10 09:11:33.000000000 -0700
+++ 2.6.9-rc1-mm4/fs/proc/array.c	2004-09-10 15:27:23.000000000 -0700
@@ -73,6 +73,7 @@
 #include <linux/highmem.h>
 #include <linux/file.h>
 #include <linux/times.h>
+#include <linux/cpuset.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -295,6 +296,7 @@ int proc_pid_status(struct task_struct *
 	}
 	buffer = task_sig(task, buffer);
 	buffer = task_cap(task, buffer);
+	buffer = cpuset_task_status_allowed(task, buffer);
 #if defined(CONFIG_ARCH_S390)
 	buffer = task_show_regs(task, buffer);
 #endif
Index: 2.6.9-rc1-mm4/include/linux/cpuset.h
===================================================================
--- 2.6.9-rc1-mm4.orig/include/linux/cpuset.h	2004-09-10 09:11:33.000000000 -0700
+++ 2.6.9-rc1-mm4/include/linux/cpuset.h	2004-09-10 15:27:23.000000000 -0700
@@ -25,6 +25,7 @@ void cpuset_restrict_to_mems_allowed(uns
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 int cpuset_zone_allowed(struct zone *z);
 extern struct file_operations proc_cpuset_operations;
+extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
 
 #else /* !CONFIG_CPUSETS */
 
@@ -52,6 +53,12 @@ static inline int cpuset_zone_allowed(st
 	return 1;
 }
 
+static inline char *cpuset_task_status_allowed(struct task_struct *task,
+							char *buffer)
+{
+	return buffer;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
Index: 2.6.9-rc1-mm4/kernel/cpuset.c
===================================================================
--- 2.6.9-rc1-mm4.orig/kernel/cpuset.c	2004-09-10 15:27:03.000000000 -0700
+++ 2.6.9-rc1-mm4/kernel/cpuset.c	2004-09-10 15:27:23.000000000 -0700
@@ -1495,3 +1495,15 @@ struct file_operations proc_cpuset_opera
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
+
+/* Display task cpus_allowed, mems_allowed in /proc/<pid>/status file. */
+char *cpuset_task_status_allowed(struct task_struct *task, char *buffer)
+{
+	buffer += sprintf(buffer, "Cpus_allowed:\t");
+	buffer += cpumask_scnprintf(buffer, PAGE_SIZE, task->cpus_allowed);
+	buffer += sprintf(buffer, "\n");
+	buffer += sprintf(buffer, "Mems_allowed:\t");
+	buffer += nodemask_scnprintf(buffer, PAGE_SIZE, task->mems_allowed);
+	buffer += sprintf(buffer, "\n");
+	return buffer;
+}

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Patch 2/4] cpusets simplify cpus_allowed setting in attach
  2004-09-11  8:28 [Patch 0/4] four small cpuset patches Paul Jackson
  2004-09-11  8:28 ` [Patch 1/4] cpusets display allowed masks in proc status Paul Jackson
@ 2004-09-11  8:28 ` Paul Jackson
  2004-09-11  8:28 ` [Patch 3/4] cpusets remove useless validation check Paul Jackson
  2004-09-11  8:28 ` [Patch 4/4] cpusets top mask just online, not all possible Paul Jackson
  3 siblings, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2004-09-11  8:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Simon Derr, Paul Jackson, linux-kernel

When reattaching a task to a different cpuset, I had
code that attempted to preserve the cpuset relative
placement of the task (such as might have been done
using sched_setaffinity, mbind or set_mempolicy) by
rebinding the task to the intersection of its old
and new placement, if not empty.

This resulted in strange and puzzling behaviour, such
as having a tasks cpus_allowed not change if the task
was reattached to the root cpuset.

Simplify the code - when attaching a task to a cpuset,
simply set its cpus_allowed to that of the cpuset.
This is much less surprising in practice.

Signed-off-by: Paul Jackson <pj@sgi.com>

Index: 2.6.9-rc1-mm4/kernel/cpuset.c
===================================================================
--- 2.6.9-rc1-mm4.orig/kernel/cpuset.c	2004-09-08 15:22:59.000000000 -0700
+++ 2.6.9-rc1-mm4/kernel/cpuset.c	2004-09-08 18:44:33.000000000 -0700
@@ -605,7 +605,6 @@ static int attach_task(struct cpuset *cs
 	pid_t pid;
 	struct task_struct *tsk;
 	struct cpuset *oldcs;
-	cpumask_t cpus;
 
 	if (sscanf(buf, "%d", &pid) != 1)
 		return -EIO;
@@ -645,16 +644,7 @@ static int attach_task(struct cpuset *cs
 	tsk->cpuset = cs;
 	task_unlock(tsk);
 
-	/*
-	 * If the tasks current CPU placement overlaps with its new cpuset,
-	 * then let it run in that overlap.  Otherwise fallback to simply
-	 * letting it have the run of the CPUs in the new cpuset.
-	 */
-	if (cpus_intersects(tsk->cpus_allowed, cs->cpus_allowed))
-		cpus_and(cpus, tsk->cpus_allowed, cs->cpus_allowed);
-	else
-		cpus = cs->cpus_allowed;
-	set_cpus_allowed(tsk, cpus);
+	set_cpus_allowed(tsk, cs->cpus_allowed);
 
 	put_task_struct(tsk);
 	if (atomic_dec_and_test(&oldcs->count))

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Patch 3/4] cpusets remove useless validation check
  2004-09-11  8:28 [Patch 0/4] four small cpuset patches Paul Jackson
  2004-09-11  8:28 ` [Patch 1/4] cpusets display allowed masks in proc status Paul Jackson
  2004-09-11  8:28 ` [Patch 2/4] cpusets simplify cpus_allowed setting in attach Paul Jackson
@ 2004-09-11  8:28 ` Paul Jackson
  2004-09-11  8:28 ` [Patch 4/4] cpusets top mask just online, not all possible Paul Jackson
  3 siblings, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2004-09-11  8:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, Simon Derr, linux-kernel

As reported to me by Simon Derr, there was an off by one
error in checking the usage count in validate_change().
However, after realizing that there was no way to exploit
this error, I determined that the check (for an empty
cpuset) could never fail here.  This is because:
 1) other code ensures we only attach a task to non-empty
    cpusets, and
 2) there is no way to make a non-empty cpuset empty again.

So, rather than fix the useless check, just remove it.

Signed-off-by: Paul Jackson <pj@sgi.com>

Index: 2.6.9-rc1-mm4/kernel/cpuset.c
===================================================================
--- 2.6.9-rc1-mm4.orig/kernel/cpuset.c	2004-09-08 18:44:33.000000000 -0700
+++ 2.6.9-rc1-mm4/kernel/cpuset.c	2004-09-08 18:45:40.000000000 -0700
@@ -501,14 +501,6 @@ static int validate_change(const struct 
 	if (cur == &top_cpuset)
 		return -EPERM;
 
-	/* Any in-use cpuset must have at least ONE cpu and mem */
-	if (atomic_read(&trial->count) > 1) {
-		if (cpus_empty(trial->cpus_allowed))
-			return -ENOSPC;
-		if (nodes_empty(trial->mems_allowed))
-			return -ENOSPC;
-	}
-
 	/* We must be a subset of our parent cpuset */
 	if (!is_cpuset_subset(trial, par))
 		return -EACCES;

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-11  8:28 [Patch 0/4] four small cpuset patches Paul Jackson
                   ` (2 preceding siblings ...)
  2004-09-11  8:28 ` [Patch 3/4] cpusets remove useless validation check Paul Jackson
@ 2004-09-11  8:28 ` Paul Jackson
  2004-09-11 14:10   ` Anton Blanchard
  3 siblings, 1 reply; 15+ messages in thread
From: Paul Jackson @ 2004-09-11  8:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Simon Derr, Paul Jackson, linux-kernel

Initialize the top cpuset to only have the online
CPUs and Nodes, rather than all possible.  This
seems more natural to the observer.

Signed-off-by: Paul Jackson <pj@sgi.com>

Index: 2.6.9-rc1-mm4/kernel/cpuset.c
===================================================================
--- 2.6.9-rc1-mm4.orig/kernel/cpuset.c	2004-09-10 15:27:30.000000000 -0700
+++ 2.6.9-rc1-mm4/kernel/cpuset.c	2004-09-10 15:27:32.000000000 -0700
@@ -1271,7 +1271,9 @@ int __init cpuset_init(void)
 	struct dentry *root;
 	int err;
 
-	top_cpuset.mems_allowed = node_possible_map;
+	top_cpuset.cpus_allowed = CPU_MASK_ALL;
+	top_cpuset.mems_allowed = NODE_MASK_ALL;
+
 	atomic_inc(&cpuset_mems_generation);
 	top_cpuset.mems_generation = atomic_read(&cpuset_mems_generation);
 
@@ -1300,12 +1302,13 @@ out:
 /**
  * cpuset_init_smp - initialize cpus_allowed
  *
- * Description: Initialize cpus_allowed after cpu_possible_map is initialized
+ * Description: Finish top cpuset after cpu, node maps are initialized
  **/
 
 void __init cpuset_init_smp(void)
 {
-	top_cpuset.cpus_allowed = cpu_possible_map;
+	top_cpuset.cpus_allowed = cpu_online_map;
+	top_cpuset.mems_allowed = node_online_map;
 }
 
 /**

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-11  8:28 ` [Patch 4/4] cpusets top mask just online, not all possible Paul Jackson
@ 2004-09-11 14:10   ` Anton Blanchard
  2004-09-11 17:07     ` Paul Jackson
  0 siblings, 1 reply; 15+ messages in thread
From: Anton Blanchard @ 2004-09-11 14:10 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andrew Morton, Simon Derr, linux-kernel


Hi Paul,

> Initialize the top cpuset to only have the online
> CPUs and Nodes, rather than all possible.  This
> seems more natural to the observer.

How does this change interact with CPU hotplug?

Anton

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-11 14:10   ` Anton Blanchard
@ 2004-09-11 17:07     ` Paul Jackson
  2004-09-11 17:28       ` Dave Hansen
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Paul Jackson @ 2004-09-11 17:07 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: akpm, Simon.Derr, linux-kernel, Andi Kleen, IWAMOTO Toshihiro,
	Dave Hansen

I'm adding Andi Kleen, Iwamoto-san and Dave Hansen to the cc list, since
the numa code and other hotplug work might have similar considerations.

Anton asks:
> How does this change interact with CPU hotplug?

You beat my estimate ;).  I figured it would be a day before someone
asked this question.  It only took you six hours.  Good.

Cpusets and hotplug (CPU or Memory) aren't friends, yet.

Cpusets builds up additional data structures, used to manage a tasks CPU
and Memory placement.  If more CPUs or Memory are added later on,
cpusets won't know of them nor let you use them.  If CPUs or Memory are
removed later on, cpusets will still think it is ok to use them, and
potentially starve a task if that tasks cpuset had been configured to
_only_ allow using the now departed CPU or Memory.

When the move_task_off_dead_cpu() code in kernel/sched.c catches this,
the following code can break cpuset exclusive semantics if it decides
that a task has to be allowed to run anywhere because none of the places
it had been allowed are online anymore:

  kernel/sched.c: move_task_off_dead_cpu()

        /* No more Mr. Nice Guy. */
        if (dest_cpu == NR_CPUS) {
                        tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
                        if (!cpus_intersects(tsk->cpus_allowed, cpu_online_map))
                                cpus_setall(tsk->cpus_allowed);
                dest_cpu = any_online_cpu(tsk->cpus_allowed);

It wouldn't surprise me if Andi Kleen's numa code, especially the
MPOL_BIND which builds up special restricted zonelists holding only the
bound Memory Nodes, has the same sorts of interactions with Memory
hotplug.  However, I have not given this suspicion any careful thought,
so could easily be wrong.

The CPU placement code, prior to cpusets, had just been horsing the
task->cpus_allowed field around, which the CPU hotplug guys have been
able to deal with, adding or modifying code such as the above.  But the
numa Memory placement code (I suspect), and with cpusets now the CPU
placement code, build up additional structures that think they know
what's available and who can use it.  This assumption is violated when
stuff is plugged in and out.

Here's my current best shot at how to deal with this:

 1) For now, CONFIG_CPUSETS (and CONFIG_NUMA?) marked incompatible
    with CONFIG_HOTPLUG.

 2) Someday soon, the cpuset (and numa?) placement code needs to add an
    internal kernel call that the hotplug code can call to inform the
    placement code that a CPU or Memory resource has gone on or offline,
    so that the placement code can "deal with it", somehow.

 3) The way that I anticipate the cpuset code will "deal with it"
    will be:

      a] When a CPU or Memory is added, just add it to the top cpuset.
         User code can take it from there, adding the new resource
         to whatever lower level cpusets it wants to.

      b] When a CPU or Memory is about to be removed, walk the cpuset
	 tree from the bottom up, removing the resource, and if
	 that causes a particular cpuset to become empty (no more
	 online CPU or no more online Memory), then automatically
	 reassign any task attached to that cpuset to its parent
	 cpuset, and remove the soon to be empty cpuset.  If the
	 fine user doesn't like this default forced re-placement
	 to parent cpuset, they should have emptied that soon to be
	 useless cpuset of tasks before unplugging the hardware, and
	 attached those tasks to whatever cpuset met their fancy.
	 Or, if the removal could not have been anticipated, the
	 user code will just have to move tasks around after the
	 fact.

      c] The hotplug code should never add a CPU or Memory to what
         a task can use, without co-ordinating with the cpuset code.
         So fallback code such as the above call to cpus_setall()
         should involve cpusets somehow - whether asking it for the
         fallback CPUs to use, or telling it that this task just got
         force migrated to CPU_MASK_ALL - not sure which.

 4) Andi and the memory hotplug folks will have to speak to the question
    of whether this matters to the numa placement code, and what that
    might mean.

Suggestions?

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-11 17:07     ` Paul Jackson
@ 2004-09-11 17:28       ` Dave Hansen
  2004-09-12  2:21         ` Paul Jackson
  2004-09-11 17:39       ` Dave Hansen
  2004-09-11 18:55       ` Andi Kleen
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2004-09-11 17:28 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Anton Blanchard, Andrew Morton, Simon.Derr,
	Linux Kernel Mailing List, Andi Kleen, IWAMOTO Toshihiro

On Sat, 2004-09-11 at 10:07, Paul Jackson wrote:
> I'm adding Andi Kleen, Iwamoto-san and Dave Hansen to the cc list, since
> the numa code and other hotplug work might have similar considerations.
> 
> Anton asks:
> > How does this change interact with CPU hotplug?
> 
> You beat my estimate ;).  I figured it would be a day before someone
> asked this question.  It only took you six hours.  Good.
> 
> Cpusets and hotplug (CPU or Memory) aren't friends, yet.

Heh.  I assumed that cpusets didn't deal with memory, and I ignored the
code when I saw it.  Oh well... :)

> It wouldn't surprise me if Andi Kleen's numa code, especially the
> MPOL_BIND which builds up special restricted zonelists holding only the
> bound Memory Nodes, has the same sorts of interactions with Memory
> hotplug.  However, I have not given this suspicion any careful thought,
> so could easily be wrong.

It shouldn't interact yet because we don't add any NUMA nodes yet, we
only grow existing ones.  The only case that would be a problem is if an
entire node's memory was removed, but that would probably be similar to
trying to bind to an empty node now, or what happens during CPU hotplug
with cpus_allowed set to a now-unplugged CPU.

>  2) Someday soon, the cpuset (and numa?) placement code needs to add an
>     internal kernel call that the hotplug code can call to inform the
>     placement code that a CPU or Memory resource has gone on or offline,
>     so that the placement code can "deal with it", somehow.

See kernel/cpu.c:25
/* Need to know about CPUs going up/down? */
int register_cpu_notifier(struct notifier_block *nb)

We'll do the same for memory, I promise.

>  3) The way that I anticipate the cpuset code will "deal with it"
>     will be:

Other than providing the notifiers, I don't think there's a whole lot
else the hotplug code can do.  That's left for the poor souls working on
the binding APIs.  Memory will probably need some slightly more
informative things that the CPU notifiers because we might have some
more properties that the handlers might care about, but it shouldn't be
fundamentally different. 

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-11 17:07     ` Paul Jackson
  2004-09-11 17:28       ` Dave Hansen
@ 2004-09-11 17:39       ` Dave Hansen
  2004-09-11 18:55       ` Andi Kleen
  2 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2004-09-11 17:39 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Anton Blanchard, Andrew Morton, Simon.Derr,
	Linux Kernel Mailing List, Andi Kleen, IWAMOTO Toshihiro,
	Martin J. Bligh

Now that I've actuallly _looked_ at your code, I noticed that the
mems_allowed really is just a nodemask_t.  

So, you shouldn't have much to worry about from the current memory
hotplug stuff because we don't mess with hotplugging actual NUMA nodes
yet.  As I eluded to before, you might need notification of we ever get
down to a zero-sized node, but that should be id.  You'll certainly need
notification when entire nodes go offline, so I'm adding Martin Bligh to
the cc list (we just talked about NUMA hotplug yesterday).

Martin, you might want to look back in your LKML archives for where this
discussion came from.

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-11 17:07     ` Paul Jackson
  2004-09-11 17:28       ` Dave Hansen
  2004-09-11 17:39       ` Dave Hansen
@ 2004-09-11 18:55       ` Andi Kleen
  2004-09-12  2:29         ` Paul Jackson
  2 siblings, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2004-09-11 18:55 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Anton Blanchard, akpm, Simon.Derr, linux-kernel, Andi Kleen,
	IWAMOTO Toshihiro, Dave Hansen

On Sat, Sep 11, 2004 at 10:07:31AM -0700, Paul Jackson wrote:
> Cpusets builds up additional data structures, used to manage a tasks CPU
> and Memory placement.  If more CPUs or Memory are added later on,
> cpusets won't know of them nor let you use them.  If CPUs or Memory are
> removed later on, cpusets will still think it is ok to use them, and
> potentially starve a task if that tasks cpuset had been configured to
> _only_ allow using the now departed CPU or Memory.

MPOL_BIND uses direct pointers to zones, the others just node numbers
and fall back to other zones.  Lost node numbers should be pretty easy to 
deal with because there is enough redirection. MPOL_BIND is a bit more
difficult.

My prefered solution would be to never actually remove the zone datastructure,
but just make them zero size when their memory is gone. Then the mempolicies 
could still keep pointers to them, but any allocations will fail.

This may require putting them into different memory though (currently
they are usually in the node itself) 

This should also allow cpumemset to work.

Of course the applications may not be very happy when all the memory
they are allowed to touch is gone, but fixing that is imho more
a high level user space management problem, nothing the kernel
should try to resolve.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-11 17:28       ` Dave Hansen
@ 2004-09-12  2:21         ` Paul Jackson
  2004-09-12  4:43           ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Jackson @ 2004-09-12  2:21 UTC (permalink / raw)
  To: Dave Hansen; +Cc: anton, akpm, Simon.Derr, linux-kernel, ak, iwamoto

Andrew,

  Discard this 4/4 patch for "online" - I've got a better one
  coming in a few hours.  (If you want to keep this 4/4 patch
  and have me send the next one relative to it, that works too -
  just ask.)

The problem is that what's online, and what cpusets decrees, might
conflict.

  The solution is simple: online wins.

This is for the simple reason that it is better to run something in the
wrong place, than to refuse to run it.  The kernel, the apps and the
users all cope better with misplacement than with starvation.

There are only a handful of places that cpusets hooks into the kernel,
none of them on critical code paths.  My upcoming patch will guarantee
to always provide 'online' CPUs and Memory Nodes in these places, using
the current cpu_online_map and node_online_map.

If what the user configured in their cpuset is still online, then no
problem - they get what they asked for.

If they setup a cpuset, then unplug the CPUs or Memory Nodes that cpuset
uses, but don't update their cpuset config, then tough - I'll give them
some CPUs and Memory Nodes that are online instead.  If they don't like
my online choices, then they should fix their cpuset config to be valid
again.

Fix should require a couple lines of cpumask and nodemask logic in a
handful of cpuset routines, and a few other related cpuset code tweaks
... right up my alley.

Dave,

  Just as code in kernel/sched.c:move_task_off_dead_cpu()
  updates the affected tasks cpus_allowed to move off a dead
  CPU, calling cpuset_cpus_allowed() to get a new list of
  candidate CPUs to be set in task->cpus_allowed, similarly I
  expect that disabling a Memory Node should result in some
  "move_task_off_dead_memory()" routine, that called a cpuset
  routine to be called "cpuset_mems_allowed()", to get a new
  list of Memory Nodes, to be set in task->mems_allowed.

  Does this expectation fit for you?  If so, let me know when
  you want the"cpuset_mems_allowed()" routine - or take a stab
  at it yourself if you find that easier.

  There are other ways to skin this cat - feel free to offer
  such up if that fits your work better.

  But, for now, I leave the issue of how to update the tasks
  mems_allowed field, when a Memory Node goes offline, in your
  hands, for the next step.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-11 18:55       ` Andi Kleen
@ 2004-09-12  2:29         ` Paul Jackson
  0 siblings, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2004-09-12  2:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: anton, akpm, Simon.Derr, linux-kernel, ak, iwamoto, haveblue

Andi wrote:
> My prefered solution would be to never actually remove the zone datastructure,
> but just make them zero size when their memory is gone.  ...
> 
> This should also allow cpumemset to work.

Since cpusets doesn't stash zone pointers, it doesn't care.
It just stashes cpu masks (bitmaps) and node masks.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-12  2:21         ` Paul Jackson
@ 2004-09-12  4:43           ` Dave Hansen
  2004-09-12  5:35             ` Paul Jackson
  2004-09-12  5:42             ` Paul Jackson
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2004-09-12  4:43 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Anton Blanchard, Andrew Morton, Simon.Derr,
	Linux Kernel Mailing List, ak, IWAMOTO Toshihiro

On Sat, 2004-09-11 at 19:21, Paul Jackson wrote:
>   Just as code in kernel/sched.c:move_task_off_dead_cpu()
>   updates the affected tasks cpus_allowed to move off a dead
>   CPU, calling cpuset_cpus_allowed() to get a new list of
>   candidate CPUs to be set in task->cpus_allowed, similarly I
>   expect that disabling a Memory Node should result in some
>   "move_task_off_dead_memory()" routine, that called a cpuset
>   routine to be called "cpuset_mems_allowed()", to get a new
>   list of Memory Nodes, to be set in task->mems_allowed.
> 
>   Does this expectation fit for you?  If so, let me know when
>   you want the"cpuset_mems_allowed()" routine - or take a stab
>   at it yourself if you find that easier.

That seems pretty reasonable to me.  The fallback order that's
implemented in move_task_off_dead_cpu() seems very similar to what needs
to be done for memory.  However, those operations might have to occur a
bit earlier in the process for memory.

Consider a case where a zone is shrinking, and memory is being moved out
of a section inside of it.  We should try to allocate the memory to move
pages from the shrinking area in the same zone, the same node, then in
other ever-less-optimal NUMA nodes.

But, this has to occur in any removal case, even before the node is
completely gone.  So, for now, I think that we probably need to go about
the normal removal process, and make sure to update mems_allowed if a
process's page is migrated to a place that wasn't in its mems_allowed.

Is it easy to tell if a given page was influenced by a cpusets
allocation?  How would the memory removal code know which task[s] to go
and update?

>   There are other ways to skin this cat - feel free to offer
>   such up if that fits your work better.

My only other idea is just to leave it alone for now.  We're still a
good bit of time away from removing NUMA nodes and I'll probably have a
much better idea about what we need when that occurs.  

Somebody correct me if you disagree, but I don't see merging memory
removal as something that's going to happen in the next couple of
months.  The work that I'm doing on it now is more so that I can
robustly test addition over, and over, and over again and get
_additiion_ in a mergable state.

I hate to have you or anyone else do a bunch of work for something that
isn't in-tree or planned to be in-tree for a while.  We'll fix it when
we get there :)

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-12  4:43           ` Dave Hansen
@ 2004-09-12  5:35             ` Paul Jackson
  2004-09-12  5:42             ` Paul Jackson
  1 sibling, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2004-09-12  5:35 UTC (permalink / raw)
  To: Dave Hansen; +Cc: anton, akpm, Simon.Derr, linux-kernel, ak, iwamoto

Dave wrote:
> Is it easy to tell if a given page was influenced by a cpusets
> allocation?  How would the memory removal code know which task[s] to go
> and update?

It is not nearly as easy as for cpus to know which tasks to update,
because cpus have a list of tasks on them.

And it is not nearly as easy as for cpus to _do_ the update, because a
tasks memory placement, unlike its cpu placement, can not be changed by
a third party.  So one has to leave a hint laying in wait for the task
to be changed, which the task can trip over and use to trigger its own
memory placement update.  Grep for 'generation' in kernel/cpuset.c for
the cpuset version of this hint.

We will need two pieces:

 1) We will need some user space code, that walks through
    all cpusets, and for each cpuset that includes the node
    in question, updates that cpuset to not have that node
    (which will bump that cpusets generation).

 2) Then each task that is attached to that cpuset, the next
    time that task is about to ask for memory (in one of the
    mm/mempolicy routines, just before entering __alloc_pages())
    will notice the generation number on its cpuset has changed,
    and the cpuset_update_current_mems_allowed() code will
    update the tasks mems_allowed to the correct, online, value.

With what I expect to submit to lkml for *-mm in a few hours,
piece (2) will be complete and ready to go.  Piece (1) can wait.

> We'll fix it when we get there :)

Yup.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 4/4] cpusets top mask just online, not all possible
  2004-09-12  4:43           ` Dave Hansen
  2004-09-12  5:35             ` Paul Jackson
@ 2004-09-12  5:42             ` Paul Jackson
  1 sibling, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2004-09-12  5:42 UTC (permalink / raw)
  To: Dave Hansen; +Cc: anton, akpm, Simon.Derr, linux-kernel, ak, iwamoto

pj wrote:
>   I expect that disabling a Memory Node should result in some
>   "move_task_off_dead_memory()" routine, ...
>   Does this expectation fit for you?

Dave replied:
> That seems pretty reasonable to me.  ...

In other words, as is apparent in the reply I just sent a few minutes
ago, what I wrote about "move_task_off_dead_memory()" routine no longer
seems at all reasonable to me ;).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2004-09-12  5:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-11  8:28 [Patch 0/4] four small cpuset patches Paul Jackson
2004-09-11  8:28 ` [Patch 1/4] cpusets display allowed masks in proc status Paul Jackson
2004-09-11  8:28 ` [Patch 2/4] cpusets simplify cpus_allowed setting in attach Paul Jackson
2004-09-11  8:28 ` [Patch 3/4] cpusets remove useless validation check Paul Jackson
2004-09-11  8:28 ` [Patch 4/4] cpusets top mask just online, not all possible Paul Jackson
2004-09-11 14:10   ` Anton Blanchard
2004-09-11 17:07     ` Paul Jackson
2004-09-11 17:28       ` Dave Hansen
2004-09-12  2:21         ` Paul Jackson
2004-09-12  4:43           ` Dave Hansen
2004-09-12  5:35             ` Paul Jackson
2004-09-12  5:42             ` Paul Jackson
2004-09-11 17:39       ` Dave Hansen
2004-09-11 18:55       ` Andi Kleen
2004-09-12  2:29         ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).