All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 1/8] cpusets v3 - Table of Contents
@ 2004-06-29 11:21 Paul Jackson
  2004-06-29 11:22 ` [patch 2/8] cpusets v3 - Overview Paul Jackson
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Paul Jackson @ 2004-06-29 11:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jack Steiner, Jesse Barnes, Sylvain,
	Dan Higgins, Matthew Dobson, Andi Kleen, Paul Jackson, Simon

The following patch set is being offered for review and comment.

Shortly, not yet, I expect to be requesting Andrew to consider it
for inclusion in *-mm.  Andrew will be glad to hear that this patch
set (outside of Matthew Dobson's nodemask patch) has _much_ less
impact on existing kernel code than my previous cpumask patch set ;).

First I still need to perform further testing, obtain more feedback,
and do some more documentation and a man page, before asking to get
it into *-mm.  My thanks to those who have reviewed it so far.

The bulk of the code and much of the design work in this patch has
been done by Simon Derr <Simon.Derr@bull.net> of Bull (France).
The nodemask patch is a preliminary draft of work by Matthew
Dobson, based on my cpumask patches.

This version of the cpuset patch set is against 2.6.7-mm4.

These patches provide the essential kernel support for cpusets, which
enable identifying a hierarchy of subsets of system CPU and Memory Node
resources and attaching tasks to these subsets.  Tasks may only request
to use (sched_setaffinity, mbind and set_mempolicy) the CPUs and Memory
Nodes allowed to it by its cpuset.  Cpusets may be strictly exclusive
(other non-ancestral cpusets may not overlap).  One can list which
tasks are in which cpusets, and change which cpuset a task is in.
No new system calls are used; all access and modification is via a
cpuset virtual file system.

See further the Cpuset Overview, item [2/9] of this email set.

==> I recommend that first time readers look first at items (2) Overview
    and (7) Kernel Hooks PATCH, for a better understanding of what this
    cpuset kernel patch is intended to do, and the very small kernel
    footprint required to accomplish this.

Now I have several email messages to present.  Items 2 through 8
will be sent as replies to this first message.

  1) This table of contents.
  2) Overview of kernel cpusets - a small text document.
  3) [patch] cpumask_consts - minor fix to my cpumask patch set
  4) [patch] nodemask - nodemask patch (draft of Matthew Dobson's patch)
  5) [patch] cpuset_bitmap_lists - add bitmap lists format
  6) [patch] cpuset_new_files - Main cpuset patch - cpuset.c, cpuset.h
  7) [patch] cpuset_kernel_hooks - The few, small kernel hooks needed
  8) [patch] cpuset_proc_hooks - One more hook, for /proc/<pid>/cpuset.

Your feedback is welcome.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 2/8] cpusets v3 - Overview
  2004-06-29 11:21 [patch 1/8] cpusets v3 - Table of Contents Paul Jackson
@ 2004-06-29 11:22 ` Paul Jackson
  2004-06-29 11:22 ` [patch 3/8] cpusets v3 - cpumask_t - additional const qualifiers Paul Jackson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Paul Jackson @ 2004-06-29 11:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jack Steiner, Jesse Barnes, Paul Jackson,
	Dan Higgins, Matthew Dobson, Andi Kleen, Sylvain, Simon

The management of large computer systems, with many processors (CPUs),
complex memory cache hierarchies and multiple Memory Nodes having
non-uniform access times (NUMA) presents additional challenges for
the efficient scheduling and memory placement of processes.

Frequently more modest sized systems can be operated with adequate
efficiency just by letting the operating system automatically share
the available CPU and Memory resources amongst the requesting tasks.

But larger systems, which benefit more from careful processor and
memory placement to reduce memory access times and contention,
and which typically represent a larger investment for the customer,
can benefit from explictly placing jobs on properly sized subsets of
the system, especially when running large applications with demanding
performance characteristics.

These subsets, or "soft partitions" must be able to be dynamically
adjusted, as the job mix changes, without impacting other concurrently
executing jobs.

                              Cpusets

Cpusets provide a Linux kernel (2.6.7 and above) mechanism to constrain
which CPUs and Memory Nodes are used by a process or set of processes.

The Linux kernel already has a pair of mechanisms to specify on which
CPUs a task may be scheduled (sched_setaffinity) and on which Memory
Nodes it may obtain memory (mbind, set_mempolicy).

Cpusets extends these two mechanisms as follows:

 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
   kernel.
 - Each task in the system is attached to a cpuset, via a pointer
   in the task structure to a reference counted cpuset structure.
 - Calls to sched_setaffinity are filtered to just those CPUs
   allowed in that tasks cpuset.
 - Calls to mbind and set_mempolicy are filtered to just
   those Memory Nodes allowed in that tasks cpuset.
 - The "top_cpuset" contains all the systems CPUs and Memory
   Nodes.
 - For any cpuset, one can define child cpusets containing a subset
   of the parents CPU and Memory Node resources.
 - The hierarchy of cpusets can be mounted at /dev/cpuset, for
   browsing and manipulation from user space.
 - A cpuset may be marked strictly exclusive, which ensures that
   no other cpuset (except direct ancestors and descendents) may
   contain any overlapping CPUs or Memory Nodes.
 - You can list all the tasks (by pid) attached to any cpuset.

The implementation of cpusets requires a few, simple hooks
into the rest of the kernel, none in performance critical paths:

 - in main/init.c, to initialize the top_cpuset at system boot.
 - in fork and exit, to attach and detach a task from its cpuset.
 - in sched_setaffinity, to mask the requested CPUs by what's
   allowed in that tasks cpuset.
 - in sched.c migrate_all_tasks(), to keep migrating tasks within
   the CPUs allowed by their cpuset, if possible.
 - in the mbind and set_mempolicy system calls, to mask the requested
   Memory Nodes by what's allowed in that tasks cpuset.

In addition a new filesystem, of type "cpuset" may be mounted,
typically at /dev/cpuset, to enable browsing and modifying the cpusets
presently known to the kernel.  No new system calls are added for
cpusets - all support for querying and modifying cpusets is via
this cpuset file system.

Each task under /proc has an added file, displaying the cpuset
name, as the path relative to the root of the cpuset filesystem.

Each cpuset is represented by a directory in the cpuset file system
containing the following files describing that cpuset:

 - cpus: list of CPUs in that cpuset
 - mems: list of Memory Nodes in that cpuset
 - cpustrict flag: is cpuset strictly exclusive?
 - memstrict flag: is cpuset strictly exclusive?
 - autoclean flag: is cpuset to be automatically deleted when not used
 - tasks: list of tasks (by pid) attached to that cpuset

New cpusets are created using the mkdir system call or shell
command.  The properties of a cpuset, such as its flags, allowed
CPUs and Memory Nodes, and attached tasks, are modified by writing
to the appropriate file in that cpusets directory, as listed above.

The named hierarchical structure of nested cpusets allows partitioning
a large system into nested, dynamically changeable, "soft-partitions".

The attachment of each task, automatically inherited at fork by any
children of that task, to a cpuset allows organizing the work load
on a system into related sets of tasks such that each set is constrained
to using the CPUs and Memory Nodes of a particular cpuset.  A task
may be reattached to any other cpuset, if allowed by the permissions
on the necessary cpuset file system directories.

Such management of a system "in the large" integrates smoothly with
the detailed placement done on individual tasks and memory regions
using the sched_setaffinity, mbind and set_mempolicy system calls.

The following rules apply to each cpuset:

 - Its CPUs and Memory Nodes must be a subset of its parents.
 - It can only be marked strictly exclusive (strict) if its parent is.

These rules, and the natural hierarchy of cpusets, enable efficient
enforcement of the strictly exclusive guarantee, without having to scan
all cpusets everytime any of them change to ensure nothing overlaps a
structly exclusive cpuset.

The use of a Linux virtual file system (vfs) to represent the cpuset
hierarchy provides for a familiar permission and name space for cpusets,
with a minimum of additional kernel code.

If a cpuset is marked autoclean, then when the last task attached to it
is detached or exits, that cpuset is automatically removed.

For example, the following sequence of commands will setup a cpuset
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then move the current shell to that cpuset:

  mount -t cpuset none /dev/cpuset
  cd /dev/cpuset/top_cpuset
  mkdir Charlie
  cd Charlie
  /bin/echo 2-3 > cpus
  /bin/echo 1 > mems
  /bin/echo $$ > tasks

That's pretty much it - cpusets constrains the existing CPU and
Memory placement calls to only request resources within a tasks
current cpuset, and they form a nestable hierarchy visible in
a virtual file system.  These are the essential hooks, beyond
what is already present, required to manage dynamic job placement
on large systems.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 3/8] cpusets v3 - cpumask_t - additional const qualifiers
  2004-06-29 11:21 [patch 1/8] cpusets v3 - Table of Contents Paul Jackson
  2004-06-29 11:22 ` [patch 2/8] cpusets v3 - Overview Paul Jackson
@ 2004-06-29 11:22 ` Paul Jackson
  2004-06-29 11:22 ` [patch 4/8] cpusets v3 - nodemask patch (draft of Matthew Dobson's patch) Paul Jackson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Paul Jackson @ 2004-06-29 11:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jack Steiner, Jesse Barnes, Sylvain,
	Dan Higgins, Matthew Dobson, Andi Kleen, Paul Jackson, Simon

The remainder of the const qualifiers on cpumask ops.

Index: 2.6.7-mm4/include/linux/cpumask.h
===================================================================
--- 2.6.7-mm4.orig/include/linux/cpumask.h	2004-06-28 19:40:25.000000000 -0700
+++ 2.6.7-mm4/include/linux/cpumask.h	2004-06-28 19:54:11.000000000 -0700
@@ -114,58 +114,58 @@
 }
 
 #define cpus_and(dst, src1, src2) __cpus_and(&(dst), &(src1), &(src2), NR_CPUS)
-static inline void __cpus_and(cpumask_t *dstp, cpumask_t *src1p,
-					cpumask_t *src2p, int nbits)
+static inline void __cpus_and(cpumask_t *dstp, const cpumask_t *src1p,
+					const cpumask_t *src2p, int nbits)
 {
 	bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits);
 }
 
 #define cpus_or(dst, src1, src2) __cpus_or(&(dst), &(src1), &(src2), NR_CPUS)
-static inline void __cpus_or(cpumask_t *dstp, cpumask_t *src1p,
-					cpumask_t *src2p, int nbits)
+static inline void __cpus_or(cpumask_t *dstp, const cpumask_t *src1p,
+					const cpumask_t *src2p, int nbits)
 {
 	bitmap_or(dstp->bits, src1p->bits, src2p->bits, nbits);
 }
 
 #define cpus_xor(dst, src1, src2) __cpus_xor(&(dst), &(src1), &(src2), NR_CPUS)
-static inline void __cpus_xor(cpumask_t *dstp, cpumask_t *src1p,
-					cpumask_t *src2p, int nbits)
+static inline void __cpus_xor(cpumask_t *dstp, const cpumask_t *src1p,
+					const cpumask_t *src2p, int nbits)
 {
 	bitmap_xor(dstp->bits, src1p->bits, src2p->bits, nbits);
 }
 
 #define cpus_andnot(dst, src1, src2) \
 				__cpus_andnot(&(dst), &(src1), &(src2), NR_CPUS)
-static inline void __cpus_andnot(cpumask_t *dstp, cpumask_t *src1p,
-					cpumask_t *src2p, int nbits)
+static inline void __cpus_andnot(cpumask_t *dstp, const cpumask_t *src1p,
+					const cpumask_t *src2p, int nbits)
 {
 	bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits);
 }
 
 #define cpus_complement(dst, src) __cpus_complement(&(dst), &(src), NR_CPUS)
 static inline void __cpus_complement(cpumask_t *dstp,
-					cpumask_t *srcp, int nbits)
+					const cpumask_t *srcp, int nbits)
 {
 	bitmap_complement(dstp->bits, srcp->bits, nbits);
 }
 
 #define cpus_equal(src1, src2) __cpus_equal(&(src1), &(src2), NR_CPUS)
-static inline int __cpus_equal(cpumask_t *src1p,
-					cpumask_t *src2p, int nbits)
+static inline int __cpus_equal(const cpumask_t *src1p,
+					const cpumask_t *src2p, int nbits)
 {
 	return bitmap_equal(src1p->bits, src2p->bits, nbits);
 }
 
 #define cpus_intersects(src1, src2) __cpus_intersects(&(src1), &(src2), NR_CPUS)
-static inline int __cpus_intersects(cpumask_t *src1p,
-					cpumask_t *src2p, int nbits)
+static inline int __cpus_intersects(const cpumask_t *src1p,
+					const cpumask_t *src2p, int nbits)
 {
 	return bitmap_intersects(src1p->bits, src2p->bits, nbits);
 }
 
 #define cpus_subset(src1, src2) __cpus_subset(&(src1), &(src2), NR_CPUS)
-static inline int __cpus_subset(cpumask_t *src1p,
-					cpumask_t *src2p, int nbits)
+static inline int __cpus_subset(const cpumask_t *src1p,
+					const cpumask_t *src2p, int nbits)
 {
 	return bitmap_subset(src1p->bits, src2p->bits, nbits);
 }
@@ -257,7 +257,7 @@
 #define cpumask_scnprintf(buf, len, src) \
 			__cpumask_scnprintf((buf), (len), &(src), NR_CPUS)
 static inline int __cpumask_scnprintf(char *buf, int len,
-					cpumask_t *srcp, int nbits)
+					const cpumask_t *srcp, int nbits)
 {
 	return bitmap_scnprintf(buf, len, srcp->bits, nbits);
 }
@@ -265,9 +265,9 @@
 #define cpumask_parse(ubuf, ulen, src) \
 			__cpumask_parse((ubuf), (ulen), &(src), NR_CPUS)
 static inline int __cpumask_parse(const char __user *buf, int len,
-					cpumask_t *srcp, int nbits)
+					cpumask_t *dstp, int nbits)
 {
-	return bitmap_parse(buf, len, srcp->bits, nbits);
+	return bitmap_parse(buf, len, dstp->bits, nbits);
 }
 
 #if NR_CPUS > 1

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 4/8] cpusets v3 - nodemask patch (draft of Matthew Dobson's patch)
  2004-06-29 11:21 [patch 1/8] cpusets v3 - Table of Contents Paul Jackson
  2004-06-29 11:22 ` [patch 2/8] cpusets v3 - Overview Paul Jackson
  2004-06-29 11:22 ` [patch 3/8] cpusets v3 - cpumask_t - additional const qualifiers Paul Jackson
@ 2004-06-29 11:22 ` Paul Jackson
  2004-06-29 11:22 ` [patch 5/8] cpusets v3 - New bitmap lists format Paul Jackson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Paul Jackson @ 2004-06-29 11:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jack Steiner, Jesse Barnes, Paul Jackson,
	Dan Higgins, Matthew Dobson, Andi Kleen, Sylvain, Simon

This patch provides a nodemask_t type, similar to cpumask_t,
except it contains MAX_NUMNODES bits, instead of NR_CPUS.

The node_online_map is made to be of this nodemask_t type,
and some minor changes in users of node_online_map are
required.

This is a draft of a nodemask_t patch that I anticipate
Matthew Dobson of IBM will be submitting.

Index: 2.6.7-mm4/arch/i386/kernel/numaq.c
===================================================================
--- 2.6.7-mm4.orig/arch/i386/kernel/numaq.c	2004-06-28 19:34:33.000000000 -0700
+++ 2.6.7-mm4/arch/i386/kernel/numaq.c	2004-06-28 20:14:26.000000000 -0700
@@ -28,6 +28,7 @@
 #include <linux/bootmem.h>
 #include <linux/mmzone.h>
 #include <linux/module.h>
+#include <linux/nodemask.h>
 #include <asm/numaq.h>
 
 /* These are needed before the pgdat's are created */
Index: 2.6.7-mm4/arch/i386/kernel/srat.c
===================================================================
--- 2.6.7-mm4.orig/arch/i386/kernel/srat.c	2004-06-28 19:34:33.000000000 -0700
+++ 2.6.7-mm4/arch/i386/kernel/srat.c	2004-06-28 20:14:37.000000000 -0700
@@ -28,6 +28,7 @@
 #include <linux/bootmem.h>
 #include <linux/mmzone.h>
 #include <linux/acpi.h>
+#include <linux/nodemask.h>
 #include <asm/srat.h>
 
 /*
Index: 2.6.7-mm4/arch/i386/mm/discontig.c
===================================================================
--- 2.6.7-mm4.orig/arch/i386/mm/discontig.c	2004-06-28 19:40:38.000000000 -0700
+++ 2.6.7-mm4/arch/i386/mm/discontig.c	2004-06-28 20:14:47.000000000 -0700
@@ -28,6 +28,7 @@
 #include <linux/mmzone.h>
 #include <linux/highmem.h>
 #include <linux/initrd.h>
+#include <linux/nodemask.h>
 #include <asm/e820.h>
 #include <asm/setup.h>
 #include <asm/mmzone.h>
Index: 2.6.7-mm4/arch/ia64/kernel/acpi.c
===================================================================
--- 2.6.7-mm4.orig/arch/ia64/kernel/acpi.c	2004-06-28 20:00:54.000000000 -0700
+++ 2.6.7-mm4/arch/ia64/kernel/acpi.c	2004-06-28 20:00:57.000000000 -0700
@@ -43,6 +43,7 @@
 #include <linux/acpi.h>
 #include <linux/efi.h>
 #include <linux/mmzone.h>
+#include <linux/nodemask.h>
 #include <asm/io.h>
 #include <asm/iosapic.h>
 #include <asm/machvec.h>
Index: 2.6.7-mm4/arch/ia64/mm/discontig.c
===================================================================
--- 2.6.7-mm4.orig/arch/ia64/mm/discontig.c	2004-06-28 20:00:54.000000000 -0700
+++ 2.6.7-mm4/arch/ia64/mm/discontig.c	2004-06-28 20:00:57.000000000 -0700
@@ -16,6 +16,7 @@
 #include <linux/bootmem.h>
 #include <linux/acpi.h>
 #include <linux/efi.h>
+#include <linux/nodemask.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/meminit.h>
Index: 2.6.7-mm4/arch/ppc64/mm/numa.c
===================================================================
--- 2.6.7-mm4.orig/arch/ppc64/mm/numa.c	2004-06-28 19:40:19.000000000 -0700
+++ 2.6.7-mm4/arch/ppc64/mm/numa.c	2004-06-28 20:14:56.000000000 -0700
@@ -14,6 +14,7 @@
 #include <linux/mm.h>
 #include <linux/mmzone.h>
 #include <linux/module.h>
+#include <linux/nodemask.h>
 #include <asm/lmb.h>
 #include <asm/machdep.h>
 #include <asm/abs_addr.h>
Index: 2.6.7-mm4/arch/x86_64/mm/numa.c
===================================================================
--- 2.6.7-mm4.orig/arch/x86_64/mm/numa.c	2004-06-28 19:40:19.000000000 -0700
+++ 2.6.7-mm4/arch/x86_64/mm/numa.c	2004-06-28 20:15:06.000000000 -0700
@@ -10,6 +10,7 @@
 #include <linux/mmzone.h>
 #include <linux/ctype.h>
 #include <linux/module.h>
+#include <linux/nodemask.h>
 #include <asm/e820.h>
 #include <asm/proto.h>
 #include <asm/dma.h>
Index: 2.6.7-mm4/include/asm-i386/node.h
===================================================================
--- 2.6.7-mm4.orig/include/asm-i386/node.h	2004-06-28 20:00:54.000000000 -0700
+++ 2.6.7-mm4/include/asm-i386/node.h	2004-06-28 20:00:57.000000000 -0700
@@ -5,6 +5,7 @@
 #include <linux/mmzone.h>
 #include <linux/node.h>
 #include <linux/topology.h>
+#include <linux/nodemask.h>
 
 struct i386_node {
 	struct node node;
Index: 2.6.7-mm4/include/linux/mmzone.h
===================================================================
--- 2.6.7-mm4.orig/include/linux/mmzone.h	2004-06-28 20:00:54.000000000 -0700
+++ 2.6.7-mm4/include/linux/mmzone.h	2004-06-28 20:00:57.000000000 -0700
@@ -411,35 +411,6 @@
 #error ZONES_SHIFT > MAX_ZONES_SHIFT
 #endif
 
-extern DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
-
-#if defined(CONFIG_DISCONTIGMEM) || defined(CONFIG_NUMA)
-
-#define node_online(node)	test_bit(node, node_online_map)
-#define node_set_online(node)	set_bit(node, node_online_map)
-#define node_set_offline(node)	clear_bit(node, node_online_map)
-static inline unsigned int num_online_nodes(void)
-{
-	int i, num = 0;
-
-	for(i = 0; i < MAX_NUMNODES; i++){
-		if (node_online(i))
-			num++;
-	}
-	return num;
-}
-
-#else /* !CONFIG_DISCONTIGMEM && !CONFIG_NUMA */
-
-#define node_online(node) \
-	({ BUG_ON((node) != 0); test_bit(node, node_online_map); })
-#define node_set_online(node) \
-	({ BUG_ON((node) != 0); set_bit(node, node_online_map); })
-#define node_set_offline(node) \
-	({ BUG_ON((node) != 0); clear_bit(node, node_online_map); })
-#define num_online_nodes()	1
-
-#endif /* CONFIG_DISCONTIGMEM || CONFIG_NUMA */
 #endif /* !__ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MMZONE_H */
Index: 2.6.7-mm4/include/linux/nodemask.h
===================================================================
--- 2.6.7-mm4.orig/include/linux/nodemask.h	2003-03-14 05:07:09.000000000 -0800
+++ 2.6.7-mm4/include/linux/nodemask.h	2004-06-28 20:13:13.000000000 -0700
@@ -0,0 +1,336 @@
+#ifndef __LINUX_NODEMASK_H
+#define __LINUX_NODEMASK_H
+
+/*
+ * Nodemasks provide a bitmap suitable for representing the
+ * set of Node's in a system, one bit position per Node number.
+ *
+ * See detailed comments in the file linux/bitmap.h describing the
+ * data type on which these nodemasks are based.
+ *
+ * For details of nodemask_scnprintf() and nodemask_parse(),
+ * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
+ *
+ * The available nodemask operations are:
+ *
+ * void node_set(node, mask)		turn on bit 'node' in mask
+ * void node_clear(node, mask)		turn off bit 'node' in mask
+ * void nodes_setall(mask)		set all bits
+ * void nodes_clear(mask)		clear all bits
+ * int node_isset(node, mask)		true iff bit 'node' set in mask
+ * int node_test_and_set(node, mask)	test and set bit 'node' in mask
+ *
+ * void nodes_and(dst, src1, src2)	dst = src1 & src2  [intersection]
+ * void nodes_or(dst, src1, src2)	dst = src1 | src2  [union]
+ * void nodes_xor(dst, src1, src2)	dst = src1 ^ src2
+ * void nodes_andnot(dst, src1, src2)	dst = src1 & ~src2
+ * void nodes_complement(dst, src)	dst = ~src
+ *
+ * int nodes_equal(mask1, mask2)	Does mask1 == mask2?
+ * int nodes_intersects(mask1, mask2)	Do mask1 and mask2 intersect?
+ * int nodes_subset(mask1, mask2)	Is mask1 a subset of mask2?
+ * int nodes_empty(mask)		Is mask empty (no bits sets)?
+ * int nodes_full(mask)			Is mask full (all bits sets)?
+ * int nodes_weight(mask)		Hamming weigh - number of set bits
+ *
+ * void nodes_shift_right(dst, src, n)	Shift right
+ * void nodes_shift_left(dst, src, n)	Shift left
+ *
+ * int first_node(mask)			Number lowest set bit, or MAX_NUMNODES
+ * int next_node(node, mask)		Next node past 'node', or MAX_NUMNODES
+ *
+ * nodemask_t nodemask_of_node(node)	Return nodemask with bit 'node' set
+ * NODE_MASK_ALL			Initializer - all bits set
+ * NODE_MASK_NONE			Initializer - no bits set
+ * unsigned long *nodes_addr(mask)	Array of unsigned long's in mask
+ *
+ * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
+ * int nodemask_parse(ubuf, ulen, mask)	Parse ascii string as nodemask
+ *
+ * for_each_node_mask(node, mask)	for-loop node over mask
+ *
+ * int num_online_nodes()		Number of online Nodes
+ * int num_possible_nodes()		Number of all possible Nodes
+ * int num_present_nodes()		Number of present Nodes
+ *
+ * int node_online(node)		Is some node online?
+ * int node_possible(node)		Is some node possible?
+ * int node_present(node)		Is some node present (can schedule)?
+ *
+ * int any_online_node(mask)		First online node in mask
+ *
+ * for_each_node(node)			for-loop node over node_possible_map
+ * for_each_online_node(node)		for-loop node over node_online_map
+ * for_each_present_node(node)		for-loop node over node_present_map
+ *
+ * int any_online_node(mask)		First online node in mask
+ * node_set_online(node)		set bit 'node' in node_online_map
+ * node_set_offline(node)		clear bit 'node' in node_online_map
+ *
+ *
+ * Subtlety:
+ * 1) The 'type-checked' form of node_isset() causes gcc (3.3.2, anyway)
+ *    to generate slightly worse code.  Note for example the additional
+ *    40 lines of assembly code compiling the "for each possible node"
+ *    loops buried in the disk_stat_read() macros calls when compiling
+ *    drivers/block/genhd.c (arch i386, CONFIG_SMP=y).  So use a simple
+ *    one-line #define for node_isset(), instead of wrapping an inline
+ *    inside a macro, the way we do the other calls.
+ */
+
+#include <linux/threads.h>
+#include <linux/bitmap.h>
+#include <asm/bug.h>
+
+typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
+extern nodemask_t _unused_nodemask_arg_;
+
+#define node_set(node, dst) __node_set((node), &(dst))
+static inline void __node_set(int node, volatile nodemask_t *dstp)
+{
+	set_bit(node, dstp->bits);
+}
+
+#define node_clear(node, dst) __node_clear((node), &(dst))
+static inline void __node_clear(int node, volatile nodemask_t *dstp)
+{
+	clear_bit(node, dstp->bits);
+}
+
+#define nodes_setall(dst) __nodes_setall(&(dst), MAX_NUMNODES)
+static inline void __nodes_setall(nodemask_t *dstp, int nbits)
+{
+	bitmap_fill(dstp->bits, nbits);
+}
+
+#define nodes_clear(dst) __nodes_clear(&(dst), MAX_NUMNODES)
+static inline void __nodes_clear(nodemask_t *dstp, int nbits)
+{
+	bitmap_zero(dstp->bits, nbits);
+}
+
+/* No static inline type checking - see Subtlety (1) above. */
+#define node_isset(node, nodemask) test_bit((node), (nodemask).bits)
+
+#define node_test_and_set(node, nodemask) \
+			__node_test_and_set((node), &(nodemask))
+static inline int __node_test_and_set(int node, nodemask_t *addr)
+{
+	return test_and_set_bit(node, addr->bits);
+}
+
+#define nodes_and(dst, src1, src2) \
+			__nodes_and(&(dst), &(src1), &(src2), MAX_NUMNODES)
+static inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p,
+					const nodemask_t *src2p, int nbits)
+{
+	bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits);
+}
+
+#define nodes_or(dst, src1, src2) \
+			__nodes_or(&(dst), &(src1), &(src2), MAX_NUMNODES)
+static inline void __nodes_or(nodemask_t *dstp, const nodemask_t *src1p,
+					const nodemask_t *src2p, int nbits)
+{
+	bitmap_or(dstp->bits, src1p->bits, src2p->bits, nbits);
+}
+
+#define nodes_xor(dst, src1, src2) \
+			__nodes_xor(&(dst), &(src1), &(src2), MAX_NUMNODES)
+static inline void __nodes_xor(nodemask_t *dstp, const nodemask_t *src1p,
+					const nodemask_t *src2p, int nbits)
+{
+	bitmap_xor(dstp->bits, src1p->bits, src2p->bits, nbits);
+}
+
+#define nodes_andnot(dst, src1, src2) \
+			__nodes_andnot(&(dst), &(src1), &(src2), MAX_NUMNODES)
+static inline void __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p,
+					const nodemask_t *src2p, int nbits)
+{
+	bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits);
+}
+
+#define nodes_complement(dst, src) \
+			__nodes_complement(&(dst), &(src), MAX_NUMNODES)
+static inline void __nodes_complement(nodemask_t *dstp,
+					const nodemask_t *srcp, int nbits)
+{
+	bitmap_complement(dstp->bits, srcp->bits, nbits);
+}
+
+#define nodes_equal(src1, src2) \
+			__nodes_equal(&(src1), &(src2), MAX_NUMNODES)
+static inline int __nodes_equal(const nodemask_t *src1p,
+					const nodemask_t *src2p, int nbits)
+{
+	return bitmap_equal(src1p->bits, src2p->bits, nbits);
+}
+
+#define nodes_intersects(src1, src2) \
+			__nodes_intersects(&(src1), &(src2), MAX_NUMNODES)
+static inline int __nodes_intersects(const nodemask_t *src1p,
+					const nodemask_t *src2p, int nbits)
+{
+	return bitmap_intersects(src1p->bits, src2p->bits, nbits);
+}
+
+#define nodes_subset(src1, src2) \
+			__nodes_subset(&(src1), &(src2), MAX_NUMNODES)
+static inline int __nodes_subset(const nodemask_t *src1p,
+					const nodemask_t *src2p, int nbits)
+{
+	return bitmap_subset(src1p->bits, src2p->bits, nbits);
+}
+
+#define nodes_empty(src) __nodes_empty(&(src), MAX_NUMNODES)
+static inline int __nodes_empty(const nodemask_t *srcp, int nbits)
+{
+	return bitmap_empty(srcp->bits, nbits);
+}
+
+#define nodes_full(nodemask) __nodes_full(&(nodemask), MAX_NUMNODES)
+static inline int __nodes_full(const nodemask_t *srcp, int nbits)
+{
+	return bitmap_full(srcp->bits, nbits);
+}
+
+#define nodes_weight(nodemask) __nodes_weight(&(nodemask), MAX_NUMNODES)
+static inline int __nodes_weight(const nodemask_t *srcp, int nbits)
+{
+	return bitmap_weight(srcp->bits, nbits);
+}
+
+#define nodes_shift_right(dst, src, n) \
+			__nodes_shift_right(&(dst), &(src), (n), MAX_NUMNODES)
+static inline void __nodes_shift_right(nodemask_t *dstp,
+					const nodemask_t *srcp, int n, int nbits)
+{
+	bitmap_shift_right(dstp->bits, srcp->bits, n, nbits);
+}
+
+#define nodes_shift_left(dst, src, n) \
+			__nodes_shift_left(&(dst), &(src), (n), MAX_NUMNODES)
+static inline void __nodes_shift_left(nodemask_t *dstp,
+					const nodemask_t *srcp, int n, int nbits)
+{
+	bitmap_shift_left(dstp->bits, srcp->bits, n, nbits);
+}
+
+#define first_node(src) __first_node(&(src), MAX_NUMNODES)
+static inline int __first_node(const nodemask_t *srcp, int nbits)
+{
+	return find_first_bit(srcp->bits, nbits);
+}
+
+#define next_node(n, src) __next_node((n), &(src), MAX_NUMNODES)
+static inline int __next_node(int n, const nodemask_t *srcp, int nbits)
+{
+	return find_next_bit(srcp->bits, nbits, n+1);
+}
+
+#define nodemask_of_node(node)						\
+({									\
+	typeof(_unused_nodemask_arg_) m;				\
+	if (sizeof(m) == sizeof(unsigned long)) {			\
+		m.bits[0] = 1UL<<(node);				\
+	} else {							\
+		nodes_clear(m);						\
+		node_set((node), m);					\
+	}								\
+	m;								\
+})
+
+#define NODE_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(MAX_NUMNODES)
+
+#if MAX_NUMNODES <= BITS_PER_LONG
+
+#define NODE_MASK_ALL							\
+((nodemask_t) { {							\
+	[BITS_TO_LONGS(MAX_NUMNODES)-1] = NODE_MASK_LAST_WORD		\
+} })
+
+#else
+
+#define NODE_MASK_ALL							\
+((nodemask_t) { {							\
+	[0 ... BITS_TO_LONGS(MAX_NUMNODES)-2] = ~0UL,			\
+	[BITS_TO_LONGS(MAX_NUMNODES)-1] = NODE_MASK_LAST_WORD		\
+} })
+
+#endif
+
+#define NODE_MASK_NONE							\
+((nodemask_t) { {							\
+	[0 ... BITS_TO_LONGS(MAX_NUMNODES)-1] =  0UL			\
+} })
+
+#define nodes_addr(src) ((src).bits)
+
+#define nodemask_scnprintf(buf, len, src) \
+			__nodemask_scnprintf((buf), (len), &(src), MAX_NUMNODES)
+static inline int __nodemask_scnprintf(char *buf, int len,
+					const nodemask_t *srcp, int nbits)
+{
+	return bitmap_scnprintf(buf, len, srcp->bits, nbits);
+}
+
+#define nodemask_parse(ubuf, ulen, src) \
+			__nodemask_parse((ubuf), (ulen), &(src), MAX_NUMNODES)
+static inline int __nodemask_parse(const char __user *buf, int len,
+					nodemask_t *dstp, int nbits)
+{
+	return bitmap_parse(buf, len, dstp->bits, nbits);
+}
+
+#if MAX_NUMNODES > 1
+#define for_each_node_mask(node, mask)			\
+	for ((node) = first_node(mask);			\
+		(node) < MAX_NUMNODES;			\
+		(node) = next_node((node), (mask)))
+#else /* MAX_NUMNODES == 1 */
+#define for_each_node_mask(node, mask) for ((node) = 0; (node) < 1; (node)++)
+#endif /* MAX_NUMNODES */
+
+/*
+ * The following particular system nodemasks and operations
+ * on them manage all possible, online and present nodes.
+ */
+
+extern nodemask_t node_online_map;
+extern nodemask_t node_possible_map;
+extern nodemask_t node_present_map;
+
+#if MAX_NUMNODES > 1
+#define num_online_nodes()	nodes_weight(node_online_map)
+#define num_possible_nodes()	nodes_weight(node_possible_map)
+#define num_present_nodes()	nodes_weight(node_present_map)
+#define node_online(node)	node_isset((node), node_online_map)
+#define node_possible(node)	node_isset((node), node_possible_map)
+#define node_present(node)	node_isset((node), node_present_map)
+#else
+#define num_online_nodes()	1
+#define num_possible_nodes()	1
+#define num_present_nodes()	1
+#define node_online(node)	((node) == 0)
+#define node_possible(node)	((node) == 0)
+#define node_present(node)	((node) == 0)
+#endif
+
+#define any_online_node(mask)			\
+({						\
+	int node;				\
+	for_each_node_mask(node, (mask))	\
+		if (node_online(node))		\
+			break;			\
+	node;					\
+})
+
+#define node_set_online(node)	   set_bit((node), node_online_map.bits)
+#define node_set_offline(node)	   clear_bit((node), node_online_map.bits)
+
+#define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
+#define for_each_online_node(node) for_each_node_mask((node), node_online_map)
+#define for_each_present_node(node) for_each_node_mask((node), node_present_map)
+
+#endif /* __LINUX_NODEMASK_H */
Index: 2.6.7-mm4/mm/mempolicy.c
===================================================================
--- 2.6.7-mm4.orig/mm/mempolicy.c	2004-06-28 20:00:54.000000000 -0700
+++ 2.6.7-mm4/mm/mempolicy.c	2004-06-28 20:13:13.000000000 -0700
@@ -66,6 +66,7 @@
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
+#include <linux/nodemask.h>
 #include <linux/gfp.h>
 #include <linux/slab.h>
 #include <linux/string.h>
@@ -95,7 +96,7 @@
 {
 	DECLARE_BITMAP(online2, MAX_NUMNODES);
 
-	bitmap_copy(online2, node_online_map, MAX_NUMNODES);
+	bitmap_copy(online2, nodes_addr(node_online_map), MAX_NUMNODES);
 	if (bitmap_empty(online2, MAX_NUMNODES))
 		set_bit(0, online2);
 	if (!bitmap_subset(nodes, online2, MAX_NUMNODES))
@@ -422,7 +423,7 @@
 	case MPOL_PREFERRED:
 		/* or use current node instead of online map? */
 		if (p->v.preferred_node < 0)
-			bitmap_copy(nodes, node_online_map, MAX_NUMNODES);
+			bitmap_copy(nodes, nodes_addr(node_online_map), MAX_NUMNODES);
 		else
 			__set_bit(p->v.preferred_node, nodes);
 		break;
@@ -628,7 +629,7 @@
 	struct zonelist *zl;
 	struct page *page;
 
-	BUG_ON(!test_bit(nid, node_online_map));
+	BUG_ON(!test_bit(nid, nodes_addr(node_online_map)));
 	zl = NODE_DATA(nid)->node_zonelists + (gfp & GFP_ZONEMASK);
 	page = __alloc_pages(gfp, order, zl);
 	if (page && page_zone(page) == zl->zones[0]) {
@@ -1017,7 +1018,8 @@
 	/* Set interleaving policy for system init. This way not all
 	   the data structures allocated at system boot end up in node zero. */
 
-	if (sys_set_mempolicy(MPOL_INTERLEAVE, node_online_map, MAX_NUMNODES) < 0)
+	if (sys_set_mempolicy(MPOL_INTERLEAVE, nodes_addr(node_online_map),
+							MAX_NUMNODES) < 0)
 		printk("numa_policy_init: interleaving failed\n");
 }
 
Index: 2.6.7-mm4/mm/page_alloc.c
===================================================================
--- 2.6.7-mm4.orig/mm/page_alloc.c	2004-06-28 20:00:54.000000000 -0700
+++ 2.6.7-mm4/mm/page_alloc.c	2004-06-28 20:00:57.000000000 -0700
@@ -31,10 +31,12 @@
 #include <linux/topology.h>
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
+#include <linux/nodemask.h>
 
 #include <asm/tlbflush.h>
 
-DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
+nodemask_t node_online_map = NODE_MASK_NONE;
+nodemask_t node_possible_map = NODE_MASK_ALL;
 struct pglist_data *pgdat_list;
 unsigned long totalram_pages;
 unsigned long totalhigh_pages;

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 5/8] cpusets v3 - New bitmap lists format
  2004-06-29 11:21 [patch 1/8] cpusets v3 - Table of Contents Paul Jackson
                   ` (2 preceding siblings ...)
  2004-06-29 11:22 ` [patch 4/8] cpusets v3 - nodemask patch (draft of Matthew Dobson's patch) Paul Jackson
@ 2004-06-29 11:22 ` Paul Jackson
  2004-06-29 11:22 ` [patch 6/8] cpusets v3 - The main new files: cpuset.c, cpuset.h Paul Jackson
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Paul Jackson @ 2004-06-29 11:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jack Steiner, Jesse Barnes, Sylvain,
	Dan Higgins, Matthew Dobson, Andi Kleen, Paul Jackson, Simon

A bitmap print and parse format that provides lists of ranges
of numbers, to be first used for by cpusets.

Index: 2.6.7-mm1/include/linux/bitmap.h
===================================================================
--- 2.6.7-mm1.orig/include/linux/bitmap.h	2004-06-23 02:44:49.000000000 -0700
+++ 2.6.7-mm1/include/linux/bitmap.h	2004-06-23 02:44:56.000000000 -0700
@@ -41,7 +41,8 @@
  * bitmap_shift_right(dst, src, n, nbits)	*dst = *src >> n
  * bitmap_shift_left(dst, src, n, nbits)	*dst = *src << n
  * bitmap_scnprintf(buf, len, src, nbits)	Print bitmap src to buf
- * bitmap_parse(ubuf, ulen, dst, nbits)		Parse bitmap dst from buf
+ * bitmap_parse(ubuf, ulen, dst, nbits)		Parse bitmap dst from user buf
+ * bitmap_parselist(buf, dst, nbits)		Parse bitmap dst from list
  */
 
 /*
@@ -98,6 +99,8 @@ extern int bitmap_scnprintf(char *buf, u
 			const unsigned long *src, int nbits);
 extern int bitmap_parse(const char __user *ubuf, unsigned int ulen,
 			unsigned long *dst, int nbits);
+extern int bitmap_parselist(const char *buf, unsigned long *maskp,
+			int nmaskbits);
 
 #define BITMAP_LAST_WORD_MASK(nbits)					\
 (									\
Index: 2.6.7-mm1/include/linux/cpumask.h
===================================================================
--- 2.6.7-mm1.orig/include/linux/cpumask.h	2004-06-23 02:44:49.000000000 -0700
+++ 2.6.7-mm1/include/linux/cpumask.h	2004-06-23 02:44:56.000000000 -0700
@@ -10,6 +10,8 @@
  *
  * For details of cpumask_scnprintf() and cpumask_parse(),
  * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
+ * For details of cpulist_parse(), see bitmap_parselist(), also
+ * in lib/bitmap.c.
  *
  * The available cpumask operations are:
  *
@@ -46,6 +48,7 @@
  *
  * int cpumask_scnprintf(buf, len, mask) Format cpumask for printing
  * int cpumask_parse(ubuf, ulen, mask)	Parse ascii string as cpumask
+ * int cpulist_parse(buf, map)		Parse ascii string as cpulist
  *
  * for_each_cpu_mask(cpu, mask)		for-loop cpu over mask
  *
@@ -262,14 +265,20 @@ static inline int __cpumask_scnprintf(ch
 	return bitmap_scnprintf(buf, len, srcp->bits, nbits);
 }
 
-#define cpumask_parse(ubuf, ulen, src) \
-			__cpumask_parse((ubuf), (ulen), &(src), NR_CPUS)
+#define cpumask_parse(ubuf, ulen, dst) \
+			__cpumask_parse((ubuf), (ulen), &(dst), NR_CPUS)
 static inline int __cpumask_parse(const char __user *buf, int len,
 					cpumask_t *dstp, int nbits)
 {
 	return bitmap_parse(buf, len, dstp->bits, nbits);
 }
 
+#define cpulist_parse(buf, dst) __cpulist_parse((buf), &(dst), NR_CPUS)
+static inline int __cpulist_parse(const char *buf, cpumask_t *dstp, int nbits)
+{
+	return bitmap_parselist(buf, dstp->bits, nbits);
+}
+
 #if NR_CPUS > 1
 #define for_each_cpu_mask(cpu, mask)		\
 	for ((cpu) = first_cpu(mask);		\
Index: 2.6.7-mm1/include/linux/nodemask.h
===================================================================
--- 2.6.7-mm1.orig/include/linux/nodemask.h	2004-06-23 02:44:49.000000000 -0700
+++ 2.6.7-mm1/include/linux/nodemask.h	2004-06-23 02:44:56.000000000 -0700
@@ -10,7 +10,9 @@
  *
  * For details of nodemask_scnprintf() and nodemask_parse(),
  * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
- *
+ * For details of nodelist_parse(), see bitmap_parselist(), also
+ * in lib/bitmap.c.
+
  * The available nodemask operations are:
  *
  * void node_set(node, mask)		turn on bit 'node' in mask
@@ -46,6 +48,7 @@
  *
  * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
  * int nodemask_parse(ubuf, ulen, mask)	Parse ascii string as nodemask
+ * int nodelist_parse(buf, map)		Parse ascii string as nodelist
  *
  * for_each_node_mask(node, mask)	for-loop node over mask
  *
@@ -275,14 +278,20 @@ static inline int __nodemask_scnprintf(c
 	return bitmap_scnprintf(buf, len, srcp->bits, nbits);
 }
 
-#define nodemask_parse(ubuf, ulen, src) \
-			__nodemask_parse((ubuf), (ulen), &(src), MAX_NUMNODES)
+#define nodemask_parse(ubuf, ulen, dst) \
+			__nodemask_parse((ubuf), (ulen), &(dst), MAX_NUMNODES)
 static inline int __nodemask_parse(const char __user *buf, int len,
 					nodemask_t *dstp, int nbits)
 {
 	return bitmap_parse(buf, len, dstp->bits, nbits);
 }
 
+#define nodelist_parse(buf, dst) __nodelist_parse((buf), &(dst), MAX_NUMNODES)
+static inline int __nodelist_parse(const char *buf, nodemask_t *dstp, int nbits)
+{
+	return bitmap_parselist(buf, dstp->bits, nbits);
+}
+
 #if MAX_NUMNODES > 1
 #define for_each_node_mask(node, mask)			\
 	for ((node) = first_node(mask);			\
Index: 2.6.7-mm1/lib/bitmap.c
===================================================================
--- 2.6.7-mm1.orig/lib/bitmap.c	2004-06-23 02:44:49.000000000 -0700
+++ 2.6.7-mm1/lib/bitmap.c	2004-06-23 02:44:56.000000000 -0700
@@ -291,6 +291,7 @@ EXPORT_SYMBOL(__bitmap_weight);
 #define nbits_to_hold_value(val)	fls(val)
 #define roundup_power2(val,modulus)	(((val) + (modulus) - 1) & ~((modulus) - 1))
 #define unhex(c)			(isdigit(c) ? (c - '0') : (toupper(c) - 'A' + 10))
+#define BASEDEC 10		/* fancier cpuset lists input in decimal */
 
 /**
  * bitmap_scnprintf - convert bitmap to an ASCII hex string.
@@ -408,3 +409,86 @@ int bitmap_parse(const char __user *ubuf
 	return 0;
 }
 EXPORT_SYMBOL(bitmap_parse);
+
+/**
+ * bitmap_parselist - parses a more flexible format for inputting bit masks
+ * @buf: read nul-terminated user string from this buffer
+ * @mask: write resulting mask here
+ * @nmaskbits: number of bits in mask to be written
+ *
+ * The input format supports a space separated list of one or more comma
+ * separated sequences of ascii decimal bit numbers and ranges.  Each
+ * sequence may be preceded by one of the prefix characters '=',
+ * '-', '+', or '!', which have the following meanings:
+ *    '=': rewrite the mask to have only the bits specified in this sequence
+ *    '-': turn off the bits specified in this sequence
+ *    '+': turn on the bits specified in this sequence
+ *    '!': same as '-'.
+ *
+ * If no such initial character is specified, then the default prefix '='
+ * is presumed.  The list is evaluated and applied in left to right order.
+ *
+ * Eamples of input format:
+ *	0-4,9				# rewrites to 0,1,2,3,4,9
+ *	-9				# removes 9
+ *	+6-8				# adds 6,7,8
+ *	1-6 -0,2-4 +11-14,16-19 -14-16	# same as 1,5,6,11-13,17-19
+ *	1-6 -0,2-4 +11-14,16-19 =14-16	# same as just 14,15,16
+ *
+ * Possible errno's returned for invalid input strings are:
+ *      -EINVAL:   second number in range smaller than first
+ *      -ERANGE:   bit number specified too large for mask
+ *      -EINVAL: invalid prefix char (not '=', '-', '+', or '!')
+ */
+
+int bitmap_parselist(const char *buf, unsigned long *maskp, int nmaskbits)
+{
+	char *p, *q;
+	int masklen = BITS_TO_LONGS(nmaskbits);
+
+	while ((p = strsep((char **)(&buf), " ")) != NULL) { /* blows const XXX */
+		char op = isdigit(*p) ? '=' : *p++;
+		unsigned long m[masklen];
+		int maskbytes = sizeof(m);
+		int i;
+
+		if (op == ' ')
+			continue;
+		memset(m, 0, maskbytes);
+
+		while ((q = strsep(&p, ",")) != NULL) {
+			unsigned a = simple_strtoul(q, 0, BASEDEC);
+			unsigned b = a;
+			char *cp = strchr(q, '-');
+			if (cp)
+				b = simple_strtoul(cp + 1, 0, BASEDEC);
+			if (!(a <= b))
+				return -EINVAL;
+			if (b >= nmaskbits)
+				return -ERANGE;
+			while (a <= b) {
+				set_bit(a, m);
+				a++;
+			}
+		}
+
+		switch (op) {
+			case '=':
+				memcpy(maskp, m, maskbytes);
+				break;
+			case '!':
+			case '-':
+				for (i = 0; i < masklen; i++)
+					maskp[i] &= ~m[i];
+				break;
+			case '+':
+				for (i = 0; i < masklen; i++)
+					maskp[i] |= m[i];
+				break;
+			default:
+				return -EINVAL;
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL(bitmap_parselist);

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 6/8] cpusets v3 - The main new files: cpuset.c, cpuset.h
  2004-06-29 11:21 [patch 1/8] cpusets v3 - Table of Contents Paul Jackson
                   ` (3 preceding siblings ...)
  2004-06-29 11:22 ` [patch 5/8] cpusets v3 - New bitmap lists format Paul Jackson
@ 2004-06-29 11:22 ` Paul Jackson
  2004-06-29 11:22 ` [patch 7/8] cpusets v3 - The few, small kernel hooks needed Paul Jackson
  2004-06-29 11:22 ` [patch 8/8] cpusets v3 - One more hook, for /proc/<pid>/cpuset Paul Jackson
  6 siblings, 0 replies; 8+ messages in thread
From: Paul Jackson @ 2004-06-29 11:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jack Steiner, Jesse Barnes, Paul Jackson,
	Dan Higgins, Matthew Dobson, Andi Kleen, Sylvain, Simon

The main cpuset patch - including Documentation, kernel/cpuset.c,
cpuset.h, and the kernel/Makefile hook.

The main cpuset code establishes a hierarchy of cpusets, visible
in a pseudo filesystem.  Each task links to the cpuset that controls
its CPU and Memory placement.  The CPUs and Memory Nodes allowed
to any particular cpuset are always a subset of that cpusets
parent, with the top cpuset containing all CPUs and Memory Nodes
in the system.

This hierarchy is required in order to efficiently provide for strictly
exclusive cpusets - the CPUs and Memory Nodes of a strictly exclusive
cpuset are guaranteed by the kernel to not be part of any other cpuset
that is not a direct ancestor or descendent.

Follow on patches will add the necessary kernel hooks to connect
cpusets with the rest of the kernel.


Index: 2.6.7-mm4/Documentation/cpusets.txt
===================================================================
--- /dev/null
+++ 2.6.7-mm4/Documentation/cpusets.txt
@@ -0,0 +1,174 @@
+				CPUSETS
+				-------
+
+Copyright (C) 2004 BULL SA.
+Written by Simon.Derr@bull.net
+
+CONTENTS:
+=========
+
+1. cpusets
+  1.1 What are CPUSETS ?
+  1.2 What can I do with CPUSETS ?
+2. CSFS
+  2.1 Overview
+  2.2 Adding/removing cpus
+  2.3 Setting flags
+  2.4 Attaching processes
+3. Questions
+4. Contact
+
+1. cpusets
+==========
+
+1.1 What are CPUSETS ?
+----------------------
+
+Cpusets provide a simple mechanism for organizing the set of CPUs and
+Memory Nodes assigned to a set of tasks on a large SMP or NUMA system.
+
+Each task has a single cpuset pointer, to the cpuset assigned it.
+A task may only use CPUs (sched_setaffinity) and Memory Nodes (mbind
+or set_mempolicy) contained in its assigned cpuset.  If a cpuset is
+strictly exclusive, no other unrelated (not ancestor or descendent)
+cpuset may overlap (have the same CPU or Memory Node).
+
+User level code may create and destroy cpusets by name in the cpuset
+pseudo filesystem, manage the attributes and permissions of these
+cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
+specify and query to which cpuset a task is assigned, and list the
+task pids assigned to a cpuset.
+
+Some domains in which it can be useful :
+
+    * Web Servers running multiple instances of the same web application.
+    * Servers running different applications (for instance, a web server and a
+      database).
+    * HPC applications, especially in NUMA machines.
+
+
+1.2 What can I do with CPUSETS ?
+--------------------------------
+
+CPUSETS allow to:
+
+   1. create sets of CPUs on the system, and bind applications to them
+
+   2. translate the masks of CPUs given to sched_setaffinity() so they stay
+inside the set of CPUs. With this mechanism, processors are virtualized, for
+the use of sched_setaffinity() and /proc information. Thus, any former
+application using this syscall to bind processes to processors will work with
+virtual CPUs without any change.
+
+   3. provide a way to create sets of cpus *inside* a set of cpus : hence a
+system administrator can partition a system among users, and users can
+partition their partition among their applications.
+
+   4. Change on the fly the execution area of a whole set of processes (to give
+more resources to a critical application, for example).
+
+
+2. CSFS
+=======
+
+You already guessed that CSFS stands for 'CpuSets FileSystem'...
+
+2.1 Overview
+------------
+
+Creating, modifying, using the cpusets can be done through the csfs pseudo
+filesystem.
+
+To mount it, type:
+# mount -t csfs none /proc/cpusets
+
+Then under /proc/cpusets you can find a tree that corresponds to the tree of the
+cpusets in the system. For instance, /proc/cpusets/top_cpuset is the cpuset that
+holds the whole system.
+
+If you want to create a new cpuset under top_cpuset:
+# cd /proc/cpusets/top_cpuset
+# mkdir my_cpuset
+
+Now you want to do something with this cpuset.
+# cd my_cpuset
+
+In this directory you can find several files:
+# ls
+autoclean  debug  reserved_cpus  strictly_reserved_cpus
+cpus       id     strict         tasks
+
+Reading them will give you information about the state of this cpuset: the CPUs
+it can use, the processes that are using it, its properties... And Writing to
+these files you can manipulate the cpuset.
+
+Set some flags:
+# /bin/echo 1 > autoclean
+
+Add some cpus:
+# /bin/echo 0-7 > cpus
+
+Now attach your shell to this cpuset:
+# /bin/echo $$ > tasks
+
+You can also create cpusets inside your cpuset by using mkdir in this directory.
+# mkdir my_sub_cs
+
+To remove a cpuset, juste use rmdir:
+# rmdir my_sub_cs
+This will fail is the cpuset is in use (has cpusets inside, or has processes
+attached).
+
+2.2 Adding/removing cpus
+------------------------
+
+This is the syntax to use when writing in /proc/cpusets/top_cpuset/foo/bar/cpus
+
+# /bin/echo 1-4 > cpus		-> set cpus list to cpus 1,2,3,4
+# /bin/echo 1,2,3,4 > cpus	-> set cpus list to cpus 1,2,3,4
+# /bin/echo +1 > cpus		-> add cpu 1 to the cpus list
+# /bin/echo -1-4 > cpus		-> remove cpus 1,2,3,4 from the cpus list
+# /bin/echo -1,2,3,4 > cpus	-> remove cpus 1,2,3,4 from the cpus list
+
+All these can be mixed together:
+# /bin/echo 1-7 -6 +9,10	-> set cpus list to 1,2,3,4,5,7,9,10
+
+2.3 Setting flags
+-----------------
+
+The syntax is very simple:
+
+# /bin/echo 1 > strict 		-> set flag 'strict'
+# /bin/echo 0 > strict 		-> unset flag 'strict'
+# /bin/echo 1 > autoclean 	-> set flag 'autoclean'
+
+2.4 Attaching processes
+-----------------------
+
+# /bin/echo PID > tasks
+
+Note that it is PID, not PIDs. You can only attach ONE task at a time.
+If you have several tasks to attach, you have to do it one after another:
+
+# /bin/echo PID1 > tasks
+# /bin/echo PID2 > tasks
+	...
+# /bin/echo PIDn > tasks
+
+
+3. Questions
+============
+
+Q: what's up with this '/bin/echo' ?
+A: bash's builtin 'echo' command does not check calls to write() against
+   errors. If you use it in the csfs cpusets interface, you won't be
+   able to tell whether a command succeeded or failed.
+
+Q: When I attach processes, only the first of the line gets really attached !
+A: We can only return one error code per call to write(). So you should also
+   put only ONE pid.
+
+4. Contact
+==========
+
+Web: http://www.bullopensource.org/cpuset
Index: 2.6.7-mm4/include/linux/cpuset.h
===================================================================
--- /dev/null
+++ 2.6.7-mm4/include/linux/cpuset.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_CPUSET_H
+#define _LINUX_CPUSET_H
+/*
+ *  cpuset interface
+ *
+ *  Copyright (C) 2003 BULL SA
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/cpumask.h>
+#include <linux/nodemask.h>
+
+#ifdef CONFIG_CPUSETS
+
+extern int cpuset_init(void);
+extern void cpuset_fork(struct task_struct *p);
+extern void cpuset_exit(struct task_struct *p);
+extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
+extern const nodemask_t cpuset_mems_allowed(const struct task_struct *p);
+extern int proc_pid_cspath(struct task_struct *p, char *buf, int len);
+
+#else /* !CONFIG_CPUSETS */
+
+static inline int cpuset_init(void) { return 0; }
+static inline void cpuset_fork(struct task_struct *p) {}
+static inline void cpuset_exit(struct task_struct *p) {}
+
+static inline const cpumask_t cpuset_cpus_allowed(struct task_struct *p)
+{
+	return cpu_possible_map;
+}
+
+static inline const nodemask_t cpuset_mems_allowed(struct task_struct *p)
+{
+	return node_possible_map;
+}
+
+#endif /* !CONFIG_CPUSETS */
+
+#endif /* _LINUX_CPUSET_H */
Index: 2.6.7-mm4/kernel/Makefile
===================================================================
--- 2.6.7-mm4.orig/kernel/Makefile
+++ 2.6.7-mm4/kernel/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_IKCONFIG_PROC) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
 obj-$(CONFIG_AUDIT) += audit.o
 obj-$(CONFIG_AUDITSYSCALL) += auditsc.o
+obj-$(CONFIG_CPUSETS) += cpuset.o
 
 ifneq ($(CONFIG_IA64),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
Index: 2.6.7-mm4/kernel/cpuset.c
===================================================================
--- /dev/null
+++ 2.6.7-mm4/kernel/cpuset.c
@@ -0,0 +1,1531 @@
+/*
+ *  kernel/cpuset.c
+ *
+ *  Processor and Memory placement constraints for sets of tasks.
+ *
+ *  Copyright (C) 2003 BULL SA.
+ *
+ *  Portions derived from Patrick Mochel's sysfs code.
+ *  sysfs is Copyright (c) 2001-3 Patrick Mochel
+ *
+ *  2003-10-10 Written by Simon Derr <simon.derr@bull.net>
+ *  2003-10-22 Updates by Stephen Hemminger.
+ *  2004 May-June Some rework by Paul Jackson <pj@sgi.com>
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/config.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/cpuset.h>
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/spinlock.h>
+#include <linux/stat.h>
+#include <linux/string.h>
+#include <linux/time.h>
+#include <linux/backing-dev.h>
+
+#include <asm/uaccess.h>
+#include <asm/atomic.h>
+#include <asm/semaphore.h>
+
+#define CPUSET_SUPER_MAGIC 		0x27e0eb
+
+struct cpuset {
+	unsigned long flags;		/* "unsigned long" so bitops work */
+	cpumask_t cpus_allowed;		/* CPUs allowed to tasks in cpuset */
+	nodemask_t mems_allowed;	/* Memory Nodes allowed to tasks */
+	struct cpuset *parent;		/* my parent */
+	/*
+	 * We link our 'sibling' struct into our parents 'children'.
+	 * Our children link their 'sibling' into our 'children'.
+	 */
+	struct list_head sibling;	/* my parents children */
+	struct list_head children;	/* my children */
+	atomic_t count;			/* count all users (tasks + children) */
+	struct dentry *dentry;		/* cpuset fs entry */
+};
+
+/* bits in struct cpuset flags field */
+typedef enum {
+	CS_CPUSTRICT,
+	CS_MEMSTRICT,
+	CS_AUTOCLEAN,
+	CS_HAS_BEEN_ATTACHED,
+	CS_REMOVED,
+} cpuset_flagbits_t;
+
+/* convenient tests for these bits */
+static inline int is_cpustrict(const struct cpuset *cs)
+{
+	return !!test_bit(CS_CPUSTRICT, &cs->flags);
+}
+
+static inline int is_memstrict(const struct cpuset *cs)
+{
+	return !!test_bit(CS_MEMSTRICT, &cs->flags);
+}
+
+static inline int is_autoclean(const struct cpuset *cs)
+{
+	return !!test_bit(CS_AUTOCLEAN, &cs->flags);
+}
+
+static inline int has_been_attached(const struct cpuset *cs)
+{
+	return !!test_bit(CS_HAS_BEEN_ATTACHED, &cs->flags);
+}
+
+static inline int is_removed(const struct cpuset *cs)
+{
+	return !!test_bit(CS_REMOVED, &cs->flags);
+}
+
+static struct cpuset top_cpuset = {
+	.flags = ((1 << CS_CPUSTRICT) | (1 << CS_MEMSTRICT)),
+	.parent = NULL,
+	.sibling = LIST_HEAD_INIT(top_cpuset.sibling),
+	.children = LIST_HEAD_INIT(top_cpuset.children),
+	.count = ATOMIC_INIT(1),	/* top_cpuset can't be deleted */
+	.cpus_allowed = CPU_MASK_ALL,
+	.mems_allowed = NODE_MASK_ALL,
+};
+
+static struct vfsmount *cpuset_mount;
+static struct super_block *cpuset_sb = NULL;
+
+/*
+ * cpuset_sem should be held by anyone who is depending on the children
+ * or sibling lists of any cpuset, or performing non-atomic operations
+ * on the flags or *_allowed values of a cpuset, such as raising the
+ * CS_REMOVED flag bit iff it is not already raised, or reading and
+ * conditionally modifying the *_allowed values.  One kernel global
+ * cpuset semaphore should be sufficient - these things don't change
+ * that much.
+ */
+static DECLARE_MUTEX(cpuset_sem);
+
+static long cpuset_create(struct cpuset *parent, const char *name, int mode);
+static int cpuset_populate_dir(struct dentry *cs_dentry);
+static void release_cpuset_unlocked(struct cpuset *cs);
+static int cpuset_destroy(struct cpuset *cs);
+static int create_dir(struct cpuset *cs, struct dentry *p, const char *n,
+						struct dentry **d, int mode);
+
+static struct backing_dev_info cpuset_backing_dev_info = {
+	.ra_pages = 0,		/* No readahead */
+	.memory_backed = 1,	/* Does not contribute to dirty memory */
+};
+
+static struct inode *cpuset_new_inode(mode_t mode)
+{
+	struct inode *inode = new_inode(cpuset_sb);
+	if (inode) {
+		inode->i_mode = mode;
+		inode->i_uid = current->fsuid;
+		inode->i_gid = current->fsgid;
+		inode->i_blksize = PAGE_CACHE_SIZE;
+		inode->i_blocks = 0;
+		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+		inode->i_mapping->backing_dev_info = &cpuset_backing_dev_info;
+	}
+	return inode;
+}
+
+static void cpuset_diput(struct dentry *dentry, struct inode *inode)
+{
+	/* is dentry a directory ? if so, kfree() associated cpuset */
+	if (S_ISDIR(inode->i_mode)) {
+		struct cpuset *cs = (struct cpuset *)dentry->d_fsdata;
+		BUG_ON(!(is_removed(cs)));
+		kfree(cs);
+	}
+	iput(inode);
+}
+
+static struct dentry_operations cpuset_dops = {
+	.d_iput = cpuset_diput,
+};
+
+static struct dentry *cpuset_get_dentry(struct dentry *parent, const char *name)
+{
+	struct qstr qstr;
+	struct dentry *d;
+
+	qstr.name = name;
+	qstr.len = strlen(name);
+	qstr.hash = full_name_hash(name, qstr.len);
+	d = lookup_hash(&qstr, parent);
+	if (d)
+		d->d_op = &cpuset_dops;
+	return d;
+}
+
+static int cpuset_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+	struct dentry *d_parent = dentry->d_parent;
+	struct cpuset *c_parent = (struct cpuset *)d_parent->d_fsdata;
+
+	/* the vfs holds inode->i_sem already */
+	return cpuset_create(c_parent, dentry->d_name.name, mode | S_IFDIR);
+}
+
+static void use_cpuset(struct cpuset * cs)
+{
+	atomic_inc(&cs->count);
+}
+
+static void remove_dir(struct dentry *d)
+{
+	struct dentry *parent = dget(d->d_parent);
+
+	d_delete(d);
+	simple_rmdir(parent->d_inode, d);
+	dput(parent);
+}
+
+/*
+ * NOTE : the dentry must have been dget()'ed
+ */
+static void cpuset_d_remove_dir(struct dentry *dentry)
+{
+	struct list_head *node;
+
+	spin_lock(&dcache_lock);
+	node = dentry->d_subdirs.next;
+	while (node != &dentry->d_subdirs) {
+		struct dentry *d = list_entry(node, struct dentry, d_child);
+		list_del_init(node);
+		if (d->d_inode) {
+			d = dget_locked(d);
+			spin_unlock(&dcache_lock);
+			d_delete(d);
+			simple_unlink(dentry->d_inode, d);
+			dput(d);
+			spin_lock(&dcache_lock);
+		}
+		node = dentry->d_subdirs.next;
+	}
+	list_del_init(&dentry->d_child);
+	spin_unlock(&dcache_lock);
+	remove_dir(dentry);
+}
+
+/*
+ *	cpuset_remove - remove a cpuset
+ *	cs:		the cpuset to remove
+ *
+ *	The cpuset struct is cleaned, and it is removed from the lists
+ *	The cpuset still exists so if there are still files opened in
+ *	cpuset we are not in trouble -- but it's plain useless.
+ *	It will be freed when all the dentries' use drop to zero.
+ */
+
+static void cpuset_remove(struct cpuset *cs)
+{
+	struct dentry *d;
+
+	set_bit(CS_REMOVED, &cs->flags);
+	list_del(&cs->sibling);
+	d = dget(cs->dentry);
+	cs->dentry = NULL;
+	spin_unlock(&d->d_lock);
+	cpuset_d_remove_dir(d);
+	dput(d);
+}
+
+/*
+ *	check_cpuset_autoclean - check if an unused cpuset must be deleted.
+ *	cs: 		the cpuset to check and maybe delete.
+ *	take_sem:	non-zero means we were called by
+ *			release_cpuset_unlocked and thus need to lock
+ *			the cpuset inode.
+ *			zero means we were called by release_cpuset_locked
+ *			and thus the cpuset inode is already locked.
+ *			(locked = inode->i_sem is taken)
+ *
+ *	Called when usage count of a cpuset drops to zero.
+ *	If cs has CS_AUTOCLEAN flag set, delete cs.
+ */
+
+static void check_cpuset_autoclean(struct cpuset *cs, int take_sem)
+{
+	struct inode *inode = NULL;	/* damn gcc warning */
+	struct inode *inodep;
+	struct cpuset *parent = NULL;
+
+	/* always hold parent's inode semaphore, in all cases */
+	inodep = igrab(cs->dentry->d_parent->d_inode);
+	down(&inodep->i_sem);
+
+	/* hold cs' inode's semaphore only if caller didn't */
+	if (take_sem) {
+		inode = igrab(cs->dentry->d_inode);
+		down(&inode->i_sem);
+	}
+
+	down(&cpuset_sem);
+
+	if (atomic_read(&cs->count) == 0 &&
+	    !is_removed(cs) &&
+	    is_autoclean(cs) &&
+	    has_been_attached(cs)
+	) {
+		parent = cs->parent;
+
+		/* Actual deletion occurs here : */
+		cpuset_remove(cs);
+	}
+
+	up(&cpuset_sem);
+
+	if (take_sem) {
+		up(&inode->i_sem);
+		iput(inode);
+	}
+
+	up(&inodep->i_sem);
+	iput(inodep);
+	if (parent)
+		release_cpuset_unlocked(parent);
+}
+
+/*
+ *	Why two flavours of release_cpuset() ?
+ *	When doing a rmdir(), the vfs will take the semaphore on the inode of
+ *	the directory to remove, and also on the inode of its parent. Now if
+ *	the parent is autoclean, and has to be removed, we don't need to down()
+ *	its inode because the vfs did it. Fine.
+ *	But if the cpuset autoclean has to be removed because a process exited()
+ *	the semaphore has to be held.
+ *	So we call:
+ *	release_cpuset_locked   : if the semaphore on the inode has been taken
+ *	release_cpuset_unlocked : else.
+ */
+
+static void release_cpuset_locked(struct cpuset *cs)
+{
+	if (atomic_dec_and_test(&cs->count))
+		check_cpuset_autoclean(cs, 0);
+}
+
+/*
+ * release_cpuset_unlocked(cs) - decrement cpuset 'cs' use count
+ * If use count drops to zero, check_cpuset_autoclean() is called.
+ * Locking: cpuset_sem MUST NOT BE HELD, as check_cpuset_autoclean
+ * will grab it.
+ */
+
+static void release_cpuset_unlocked(struct cpuset *cs)
+{
+	if (atomic_dec_and_test(&cs->count))
+		check_cpuset_autoclean(cs, 1);
+}
+
+static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry)
+{
+	struct cpuset *cs = (struct cpuset *)dentry->d_fsdata;
+	int retval;
+
+	use_cpuset(cs);
+	/* the vfs holds both inode->i_sem already */
+	retval = cpuset_destroy(cs);
+	/*
+	 * Even if deletion succeeded, we can call release_cpuset_unlocked()
+	 * Because the cpuset will be freed only when the dentry is.
+	 */
+	release_cpuset_unlocked(cs);
+	return retval;
+}
+
+static struct inode_operations cpuset_dir_inode_operations = {
+	.lookup = simple_lookup,
+	.mkdir = cpuset_mkdir,
+	.rmdir = cpuset_rmdir,
+};
+
+/*
+ *	cpuset_create_dir - create a directory for an object.
+ *	cs: 	the cpuset we create the directory for.
+ *		It must have a valid ->parent field
+ *		And we are going to fill its ->dentry field.
+ *	name:	The name to give to the cpuset directory. Will be copied.
+ *	mode:	mode to set on new directory.
+ */
+
+static int cpuset_create_dir(struct cpuset *cs, const char *name, int mode)
+{
+	struct dentry *dentry = NULL;
+	struct dentry *parent;
+	int error = 0;
+	struct cpuset *csp;
+
+	if (!cs)
+		return -EINVAL;
+
+	csp = cs->parent;
+
+	/* find parent dentry : parent cpuset or fs root ? */
+	if (csp)
+		parent = csp->dentry;
+	else if (cpuset_mount && cpuset_mount->mnt_sb)
+		parent = cpuset_mount->mnt_sb->s_root;
+	else
+		return -EFAULT;
+
+	error = create_dir(cs, parent, name, &dentry, mode);
+	if (!error)
+		cs->dentry = dentry;
+	return error;
+}
+
+static struct super_operations cpuset_ops = {
+	.statfs = simple_statfs,
+	.drop_inode = generic_delete_inode,
+};
+
+static int cpuset_fill_super(struct super_block *sb, void *unused_data,
+							int unused_silent)
+{
+	struct inode *inode;
+	struct dentry *root;
+
+	sb->s_blocksize = PAGE_CACHE_SIZE;
+	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+	sb->s_magic = CPUSET_SUPER_MAGIC;
+	sb->s_op = &cpuset_ops;
+	cpuset_sb = sb;
+
+	inode = cpuset_new_inode(S_IFDIR | S_IRUGO | S_IXUGO | S_IWUSR);
+	if (inode) {
+		inode->i_op = &simple_dir_inode_operations;
+		inode->i_fop = &simple_dir_operations;
+		/* directories start off with i_nlink == 2 (for "." entry) */
+		inode->i_nlink++;
+	} else {
+		return -ENOMEM;
+	}
+
+	root = d_alloc_root(inode);
+	if (!root) {
+		iput(inode);
+		return -ENOMEM;
+	}
+	sb->s_root = root;
+	return 0;
+}
+
+static struct super_block *cpuset_get_sb(struct file_system_type *fs_type,
+					int flags, const char *unused_dev_name,
+					void *data)
+{
+	return get_sb_single(fs_type, flags, data, cpuset_fill_super);
+}
+
+static struct file_system_type cpuset_fs_type = {
+	.name = "cpuset",
+	.get_sb = cpuset_get_sb,
+	.kill_sb = kill_litter_super,
+};
+
+/* struct cftype:
+ *
+ * The files in the cpuset filesystem mostly have a very simple read/write
+ * handling, some common function will take care of it. Nevertheless some cases
+ * (read tasks) are special and therefore I define this structure for every
+ * kind of file.
+ *
+ *
+ * When reading/writing to a file:
+ *	- the cpuset to use in file->f_dentry->d_parent->d_fsdata
+ *	- the 'cftype' of the file is file->f_dentry->d_fsdata
+ */
+
+struct cftype {
+	char *name;
+	int private;
+	int (*open) (struct inode *inode, struct file *file);
+	ssize_t (*read) (struct file *file, char __user *buf, size_t nbytes,
+							loff_t *ppos);
+	int (*write) (struct file *file, const char *buf, size_t nbytes,
+							loff_t *ppos);
+	int (*release) (struct inode *inode, struct file *file);
+};
+
+static inline struct cpuset *__d_cs(struct dentry *dentry)
+{
+	return (struct cpuset *)dentry->d_fsdata;
+}
+
+static inline struct cftype *__d_cft(struct dentry *dentry)
+{
+	return (struct cftype *)dentry->d_fsdata;
+}
+
+/*
+ * is_cpuset_subset(p, q) - Is cpuset p a subset of cpuset q?
+ *
+ * One cpuset is a subset of another if all its allowed CPUs and
+ * Memory Nodes are a subset of the other, and its strict flags are
+ * only set if the other's are set.
+ */
+
+static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
+{
+	return	cpus_subset(p->cpus_allowed, q->cpus_allowed) &&
+		nodes_subset(p->mems_allowed, q->mems_allowed) &&
+		is_cpustrict(p) <= is_cpustrict(q) &&
+		is_memstrict(p) <= is_memstrict(q);
+}
+
+/*
+ * validate_change() - Used to validate that any proposed cpuset change
+ *		       follows the structural rules for cpusets.
+ *
+ * If we replaced the flag and mask values of the current cpuset
+ * (cur) with those values in the trial cpuset (trial), would
+ * our various subset and strict rules still be valid?  Presumes
+ * cpuset_sem held.
+ *
+ * 'cur' is the address of an actual, in-use cpuset.  Operations
+ * such as list traversal that depend on the actual address of the
+ * cpuset in the list must use cur below, not trial.
+ *
+ * 'trial' is the address of bulk structure copy of cur, with
+ * perhaps one or more of the fields cpus_allowed, mems_allowed,
+ * or flags changed to new, trial values.
+ *
+ * Return 0 if valid, -errno if not.
+ */
+
+static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
+{
+	struct cpuset *c, *par = cur->parent;
+
+	/*
+	 * Don't mess with Big Daddy - top_cpuset must remain maximal.
+	 * And besides, the rest of this routine blows chunks if par == 0.
+	 */
+	if (cur == &top_cpuset)
+		return -EPERM;
+
+	/* Any in-use cpuset must have at least ONE cpu and mem */
+	if (atomic_read(&trial->count) > 1) {
+		if (cpus_empty(trial->cpus_allowed))
+			return -ENOSPC;
+		if (nodes_empty(trial->mems_allowed))
+			return -ENOSPC;
+	}
+
+	/* We must be a subset of our parent cpuset */
+	if (!is_cpuset_subset(trial, par))
+		return -EACCES;
+
+	/* Each of our child cpusets must be a subset of us */
+	list_for_each_entry(c, &cur->children, sibling) {
+		if (!is_cpuset_subset(c, trial))
+			return -EBUSY;
+	}
+
+	/* If either I or some sibling (!= me) is strict, we can't overlap */
+	list_for_each_entry(c, &par->children, sibling) {
+		if ((is_cpustrict(trial) || is_cpustrict(c)) &&
+		    c != cur &&
+		    cpus_intersects(trial->cpus_allowed, c->cpus_allowed)
+		) {
+			return -EINVAL;
+		}
+		if ((is_memstrict(trial) || is_memstrict(c)) &&
+		    c != cur &&
+		    nodes_intersects(trial->mems_allowed, c->mems_allowed)
+		) {
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ *	update_cpus - change the list of cpus of a cpuset
+ *	cs:		the cpuset
+ *	new_mask:	the new list of cpus
+ */
+
+static int update_cpus(struct cpuset *cs, const cpumask_t *new_mask)
+{
+	int err;
+	struct cpuset trialcs;
+
+	down(&cpuset_sem);
+
+	trialcs = *cs;
+	trialcs.cpus_allowed = *new_mask;
+
+	err = validate_change(cs, &trialcs);
+	if (err == 0)
+		cs->cpus_allowed = *new_mask;
+
+	up(&cpuset_sem);
+	return err;
+}
+
+static int update_cpumask_from_str(struct cpuset *cs, char *buf)
+{
+	cpumask_t new_mask;
+	int retval;
+
+	new_mask = cs->cpus_allowed;
+	retval = cpulist_parse(buf, new_mask);
+	if (retval < 0)
+		return retval;
+
+	use_cpuset(cs);
+	if (is_removed(cs))
+		retval = -ENODEV;
+	else
+		retval = update_cpus(cs, &new_mask);
+	release_cpuset_unlocked(cs);
+	return retval;
+}
+
+/*
+ *	update_mems - change the list of mems of a cpuset
+ *	cs:		the cpuset
+ *	new_mask:	the new list of mems
+ */
+
+static int update_mems(struct cpuset *cs, const nodemask_t *new_mask)
+{
+	int err;
+	struct cpuset trialcs;
+
+	down(&cpuset_sem);
+
+	trialcs = *cs;
+	trialcs.mems_allowed = *new_mask;
+
+	err = validate_change(cs, &trialcs);
+	if (err == 0)
+		cs->mems_allowed = *new_mask;
+
+	up(&cpuset_sem);
+	return err;
+}
+
+static int update_nodemask_from_str(struct cpuset *cs, char *buf)
+{
+	nodemask_t new_mask;
+	int retval;
+
+	new_mask = cs->mems_allowed;
+	retval = nodelist_parse(buf, new_mask);
+	if (retval < 0)
+		return retval;
+
+	use_cpuset(cs);
+	if (is_removed(cs))
+		retval = -ENODEV;
+	else
+		retval = update_mems(cs, &new_mask);
+	release_cpuset_unlocked(cs);
+	return retval;
+}
+
+/*
+ * update_flag - read a 0 or a 1 in a file and update associated flag
+ * bit:	the bit to update (CPUSTRICT, MEMSTRICT, AUTOCLEAN)
+ * cs:	the cpuset to update
+ * buf:	the buffer where we read the 0 or 1
+ */
+
+static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, char *buf)
+{
+	int turning_on;
+	struct cpuset trialcs;
+	int err;
+
+	turning_on = (simple_strtoul(buf, NULL, 10) != 0);
+
+	use_cpuset(cs);
+	down(&cpuset_sem);
+	trialcs = *cs;
+	if (turning_on)
+		set_bit(bit, &trialcs.flags);
+	else
+		clear_bit(bit, &trialcs.flags);
+
+	err = validate_change(cs, &trialcs);
+	if (err == 0) {
+		if (turning_on)
+			set_bit(bit, &cs->flags);
+		else
+			clear_bit(bit, &cs->flags);
+	}
+
+	up(&cpuset_sem);
+	/*
+	 * If cs has been attached, and we add the autoclean flag,
+	 * the cs will be destroyed on this call to release_cpuset_unlocked().
+	 */
+	release_cpuset_unlocked(cs);
+	return err;
+}
+
+
+static int cpuset_attach_task_by_pid(struct cpuset *cs, int pid)
+{
+	struct task_struct *tsk;
+	struct cpuset *oldcs;
+
+	if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
+		return -ENOSPC;
+
+	if (pid) {
+		read_lock(&tasklist_lock);
+
+		tsk = find_task_by_pid(pid);
+		if (!tsk) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+
+		get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+
+		if ((current->euid) && (current->euid != tsk->uid)
+		    && (current->euid != tsk->suid)) {
+			put_task_struct(tsk);
+			return -EACCES;
+		}
+	} else {
+		tsk = current;
+		get_task_struct(tsk);
+	}
+
+	task_lock(tsk);
+	oldcs = tsk->cpuset;
+	if (!oldcs) {
+		task_unlock(tsk);
+		put_task_struct(tsk);
+		return -ESRCH;
+	}
+	use_cpuset(cs);
+	set_bit(CS_HAS_BEEN_ATTACHED, &cs->flags);
+	tsk->cpuset = cs;
+	task_unlock(tsk);
+
+	put_task_struct(tsk);
+	release_cpuset_unlocked(oldcs);
+	return 0;
+}
+
+static int attach_task_from_str(struct cpuset *cs, char *buf)
+{
+	int pid;
+
+	int err = -EIO;
+	int conv = sscanf(buf, "%d", &pid);
+
+	if (conv == 1) {
+		use_cpuset(cs);
+		if (is_removed(cs))
+			err = -ENODEV;
+		else
+			err = cpuset_attach_task_by_pid(cs, pid);
+		release_cpuset_unlocked(cs);
+	}
+
+	if (err < 0)
+		return err;
+	return 0;
+}
+
+/* The various types of files and directories in a cpuset file system */
+
+typedef enum {
+	FILE_ROOT,
+	FILE_DIR,
+	FILE_CPULIST,
+	FILE_MEMLIST,
+	FILE_CPUSTRICT,
+	FILE_MEMSTRICT,
+	FILE_AUTOCLEAN,
+	FILE_TASKLIST,
+} cpuset_filetype_t;
+
+static ssize_t cpuset_common_file_write(struct file *file, const char *userbuf,
+					size_t nbytes, loff_t *unused_ppos)
+{
+	struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+	struct cftype *cft = __d_cft(file->f_dentry);
+	cpuset_filetype_t type = cft->private;
+	char *buffer;
+	int retval = 0;
+
+	/* Crude upper limit on largest legitimate cpulist user might write. */
+	if (nbytes > 100 + 6 * NR_CPUS)
+		return -E2BIG;
+
+	/* +1 for nul-terminator */
+	if ((buffer = kmalloc(nbytes + 1, GFP_KERNEL)) == 0)
+		return -ENOMEM;
+
+	if (copy_from_user(buffer, userbuf, nbytes)) {
+		retval = -EFAULT;
+		goto out;
+	}
+	buffer[nbytes] = 0;	/* nul-terminate */
+
+	switch (type) {
+	default:
+		retval = -EINVAL;
+		goto out;
+	case FILE_CPULIST:
+		retval = update_cpumask_from_str(cs, buffer);
+		break;
+	case FILE_MEMLIST:
+		retval = update_nodemask_from_str(cs, buffer);
+		break;
+	case FILE_CPUSTRICT:
+		retval = update_flag(CS_CPUSTRICT, cs, buffer);
+		break;
+	case FILE_MEMSTRICT:
+		retval = update_flag(CS_MEMSTRICT, cs, buffer);
+		break;
+	case FILE_AUTOCLEAN:
+		retval = update_flag(CS_AUTOCLEAN, cs, buffer);
+		break;
+	case FILE_TASKLIST:
+		retval = attach_task_from_str(cs, buffer);
+		break;
+	}
+	if (retval == 0)
+		retval = nbytes;
+out:
+	kfree(buffer);
+	return retval;
+}
+
+static ssize_t cpuset_file_write(struct file *file, const char *buf,
+						size_t nbytes, loff_t *ppos)
+{
+	ssize_t retval = 0;
+	struct cftype *cft = __d_cft(file->f_dentry);
+	if (!cft)
+		return -ENODEV;
+
+	/* special function ? */
+	if (cft->write)
+		retval = cft->write(file, buf, nbytes, ppos);
+	else
+		retval = cpuset_common_file_write(file, buf, nbytes, ppos);
+
+	return retval;
+}
+
+static int cpuset_sprintf_list(char *page, unsigned long *mask, int max)
+{
+	char *s = page;
+	int i;
+	int p = 0;		/* n -> n previous cpus were present   */
+	int f = 0;		/* 1 -> at least one cpu was found */
+	/* loop one time more in the case where last cpu is part of a range */
+	for (i = 0; i < max + 1; i++) {
+		if ((i < max) && (test_bit(i, mask))) {
+			if (!p) {
+				if (f)
+					*(s++) = ',';
+				s += sprintf(s, "%d", i);
+			}
+			f = 1;
+			p++;
+		} else {
+			if (f && (p > 1))
+				s += sprintf(s, "-%d", i - 1);
+			p = 0;
+		}
+	}
+	return s - page;
+}
+
+static int cpuset_sprintf_cpulist(char *page, cpumask_t mask)
+{
+	return cpuset_sprintf_list(page, cpus_addr(mask), NR_CPUS);
+}
+
+static int cpuset_sprintf_memlist(char *page, nodemask_t mask)
+{
+	return cpuset_sprintf_list(page, nodes_addr(mask), MAX_NUMNODES);
+}
+
+static ssize_t cpuset_common_file_read(struct file *file, char __user *buf,
+				size_t nbytes, loff_t *ppos)
+{
+	struct cftype *cft = __d_cft(file->f_dentry);
+	struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+	cpuset_filetype_t type = cft->private;
+	char *page;
+	ssize_t retval = 0;
+	char *s;
+	char *start;
+	size_t n;
+
+	if (!(page = (char *)__get_free_page(GFP_KERNEL)))
+		return -ENOMEM;
+
+	s = page;
+
+	switch (type) {
+	default:
+		retval = -EINVAL;
+		goto out;
+	case FILE_CPULIST:
+		s += cpuset_sprintf_cpulist(s, cs->cpus_allowed);
+		break;
+	case FILE_MEMLIST:
+		s += cpuset_sprintf_memlist(s, cs->mems_allowed);
+		break;
+	case FILE_CPUSTRICT:
+		*s++ = is_cpustrict(cs) ? '1' : '0';
+		break;
+	case FILE_MEMSTRICT:
+		*s++ = is_memstrict(cs) ? '1' : '0';
+		break;
+	case FILE_AUTOCLEAN:
+		*s++ = is_autoclean(cs) ? '1' : '0';
+		break;
+	}
+	*s++ = '\n';
+	*s = '\0';
+
+	start = page + *ppos;
+	n = s - start;
+	retval = n - copy_to_user(buf, start, min(n, nbytes));
+	*ppos += retval;
+out:
+	free_page((unsigned long)page);
+	return retval;
+}
+
+static ssize_t cpuset_file_read(struct file *file, char *buf, size_t nbytes,
+								loff_t *ppos)
+{
+	ssize_t retval = 0;
+	struct cftype *cft = __d_cft(file->f_dentry);
+	if (!cft)
+		return -ENODEV;
+
+	/* special function ? */
+	if (cft->read)
+		retval = cft->read(file, buf, nbytes, ppos);
+	else
+		retval = cpuset_common_file_read(file, buf, nbytes, ppos);
+
+	return retval;
+}
+
+static int cpuset_file_open(struct inode *inode, struct file *file)
+{
+	int err;
+	struct cftype *cft;
+
+	err = generic_file_open(inode, file);
+	if (err)
+		return err;
+
+	cft = __d_cft(file->f_dentry);
+	if (!cft)
+		return -ENODEV;
+	if (cft->open)
+		err = cft->open(inode, file);
+	else
+		err = 0;
+
+	return err;
+}
+
+static int cpuset_file_release(struct inode *inode, struct file *file)
+{
+	struct cftype *cft = __d_cft(file->f_dentry);
+	if (cft->release)
+		return cft->release(inode, file);
+	return 0;
+}
+
+static struct file_operations cpuset_file_operations = {
+	.read = cpuset_file_read,
+	.write = cpuset_file_write,
+	.llseek = generic_file_llseek,
+	.open = cpuset_file_open,
+	.release = cpuset_file_release,
+};
+
+static int cpuset_create_file(struct dentry *dentry, int mode)
+{
+	struct inode *inode;
+
+	if (!dentry)
+		return -ENOENT;
+	if (dentry->d_inode)
+		return -EEXIST;
+	
+	inode = cpuset_new_inode(mode);
+	if (!inode)
+		return -ENOMEM;
+
+	if (S_ISDIR(mode)) {
+		inode->i_op = &cpuset_dir_inode_operations;
+		inode->i_fop = &simple_dir_operations;
+	
+		/* start off with i_nlink == 2 (for "." entry) */
+		inode->i_nlink++;
+	} else if (S_ISREG(mode)) {
+		inode->i_size = PAGE_SIZE;
+		inode->i_fop = &cpuset_file_operations;
+	}
+
+	d_instantiate(dentry, inode);
+	dget(dentry);	/* Extra count - pin the dentry in core */
+	return 0;
+}
+
+static int create_dir(struct cpuset *cs, struct dentry *p, const char *n,
+						struct dentry **d, int mode)
+{
+	int error;
+
+	*d = cpuset_get_dentry(p, n);
+	if (!IS_ERR(*d)) {
+		error = cpuset_create_file(*d, S_IFDIR | mode);
+		if (!error) {
+			(*d)->d_fsdata = cs;
+			p->d_inode->i_nlink++;
+		}
+		dput(*d);
+	} else
+		error = PTR_ERR(*d);
+	return error;
+}
+
+/* MUST be called with dir->d_inode->i_sem held */
+
+static int cpuset_add_file(struct dentry *dir, const struct cftype *cft)
+{
+	struct dentry *dentry;
+	int error;
+
+	dentry = cpuset_get_dentry(dir, cft->name);
+	if (!IS_ERR(dentry)) {
+		error = cpuset_create_file(dentry, 0644 | S_IFREG);
+		if (!error)
+			dentry->d_fsdata = (void *)cft;
+		dput(dentry);
+	} else
+		error = PTR_ERR(dentry);
+	return error;
+}
+
+/*
+ * Stuff for reading the 'tasks' file.
+ *
+ * Reading this file can return large amounts of data if a cpuset has
+ * *lots* of attached tasks. So it may need several calls to read(),
+ * but we cannot guarantee that the information we produce is correct
+ * unless we produce it entirely atomically.
+ *
+ * Upon first file read(), a struct ctr_struct is allocated, that
+ * will have a pointer to an array (also allocated here).  The struct
+ * ctr_struct * is stored in file->private_data.  Its resources will
+ * be freed by release() when the file is closed.  The array is used
+ * to sprintf the PIDs and then used by read().
+ */
+
+/* cpusets_tasks_read array */
+
+struct ctr_struct {
+	int *array;
+	int count;
+};
+
+static struct ctr_struct *cpuset_tasks_mkctr(struct file *file)
+{
+	struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+	struct ctr_struct *ctr;
+	pid_t *array;
+	int n, max;
+	pid_t i, j, last;
+	struct task_struct *g, *p;
+
+	ctr = kmalloc(sizeof(*ctr), GFP_KERNEL);
+	if (!ctr)
+		return NULL;
+
+	/*
+	 * If cpuset gets more users after we read count, we won't have
+	 * enough space - tough.  This race is indistinguishable to the
+	 * caller from the case that the additional cpuset users didn't
+	 * show up until sometime later on.
+	 */
+
+	max = atomic_read(&cs->count);
+	array = kmalloc(max * sizeof(pid_t), GFP_KERNEL);
+	if (!array) {
+		kfree(ctr);
+		return NULL;
+	}
+
+	n = 0;
+	read_lock(&tasklist_lock);
+	do_each_thread(g, p) {
+		if (p->cpuset == cs) {
+			array[n++] = p->pid;
+			if (unlikely(n == max))
+				goto array_full;
+		}
+	}
+	while_each_thread(g, p);
+array_full:
+	read_unlock(&tasklist_lock);
+
+	/* stupid bubble sort */
+	for (i = 0; i < n - 1; i++) {
+		for (j = 0; j < n - 1 - i; j++)
+			if (array[j + 1] < array[j]) {
+				pid_t tmp = array[j];
+				array[j] = array[j + 1];
+				array[j + 1] = tmp;
+			}
+	}
+
+	/*
+	 * Collapse sorted array by grouping consecutive pids.
+	 * Code range of pids with a negative pid on the second.
+	 * Read from array[i]; write to array]j]; j <= i always.
+	 */
+	last = array[0];  /* any value != array[0] - 1 */
+	j = -1;
+	for (i = 0; i < n; i++) {
+		pid_t curr = array[i];
+		/* consecutive pids ? */
+		if (curr - last == 1) {
+			/* move destination index if it has not been done */
+			if (array[j] > 0)
+				j++;
+			array[j] = -curr;
+		} else
+			array[++j] = curr;
+		last = curr;
+	}
+
+	ctr->array = array;
+	ctr->count = j + 1;
+	file->private_data = (void *)ctr;
+	return ctr;
+}
+
+/* printf one pid from an array
+ * different formatting depending on whether it is positive or negative,
+ * or whether it is or not the first pid or the last
+ */
+static int array_pid_sprintf(char *buf, pid_t *array, int idx, int last)
+{
+	pid_t v = array[idx];
+	int l = 0;
+
+	if (v < 0) {		/* second pid of a range of pids */
+		v = -v;
+		buf[l++] = '-';
+	} else {		/* first pid of a range, or not a range */
+		if (idx)	/* comma only if it's not the first */
+			buf[l++] = ',';
+	}
+	l += sprintf(buf + l, "%d", v);
+	/* newline after last record */
+	if (idx == last)
+		l += sprintf(buf + l, "\n");
+	return l;
+}
+
+static ssize_t cpuset_tasks_read(struct file *file, char __user *buf,
+						size_t nbytes, loff_t *ppos)
+{
+	struct ctr_struct *ctr = (struct ctr_struct *)file->private_data;
+	int *array, nr_pids, i;
+	size_t len, lastlen = 0;
+	char *page;
+
+	/* allocate buffer and fill it on first call to read() */
+	if (!ctr) {
+		ctr = cpuset_tasks_mkctr(file);
+		if (!ctr)
+			return -ENOMEM;
+	}
+
+	array = ctr->array;
+	nr_pids = ctr->count;
+
+	if (!(page = (char *)__get_free_page(GFP_KERNEL)))
+		return -ENOMEM;
+
+	i = *ppos;		/* index of pid being printed */
+	len = 0;		/* length of data sprintf'ed in the page */
+
+	while ((len < PAGE_SIZE - 10) && (i < nr_pids) && (len < nbytes)) {
+		lastlen = array_pid_sprintf(page + len, array, i++, nr_pids - 1);
+		len += lastlen;
+	}
+
+	/* if we wrote too much, remove last record */
+	if (len > nbytes) {
+		len -= lastlen;
+		i--;
+	}
+
+	*ppos = i;
+
+	if (copy_to_user(buf, page, len))
+		len = -EFAULT;
+	free_page((unsigned long)page);
+	return len;
+}
+
+static int cpuset_tasks_release(struct inode *unused_inode, struct file *file)
+{
+	struct ctr_struct *ctr;
+
+	/* we have nothing to do if no read-access is needed */
+	if (!(file->f_mode & FMODE_READ))
+		return 0;
+
+	ctr = (struct ctr_struct *)file->private_data;
+	kfree(ctr->array);
+	kfree(ctr);
+	return 0;
+}
+
+/*
+ *	cpuset_create - create a cpuset
+ *	parent:	cpuset that will be parent of the new cpuset.
+ *	name:		name of the new cpuset. Will be strcpy'ed.
+ *	mode:		mode to set on new inode
+ *
+ *	Must be called with the semaphore on the parent inode held
+ */
+
+static long cpuset_create(struct cpuset *parent, const char *name, int mode)
+{
+	struct cpuset *cs;
+	int err;
+
+	cs = kmalloc(sizeof(*cs), GFP_KERNEL);
+	if (!cs)
+		return -ENOMEM;
+
+	use_cpuset(parent);	/* child needs parent to live */
+	down(&cpuset_sem);
+	cs->flags = 0;
+	cs->cpus_allowed = parent->cpus_allowed;
+	cs->mems_allowed = parent->mems_allowed;
+	atomic_set(&cs->count, 0);
+	INIT_LIST_HEAD(&cs->sibling);
+	INIT_LIST_HEAD(&cs->children);
+
+
+	cs->parent = parent;
+
+	list_add(&cs->sibling, &cs->parent->children);
+
+	err = cpuset_create_dir(cs, name, mode);
+	if (err < 0)
+		goto err;
+	err = cpuset_populate_dir(cs->dentry);
+	up(&cpuset_sem);
+	return err;
+err:
+	list_del(&cs->sibling);
+	up(&cpuset_sem);
+	release_cpuset_unlocked(parent);
+	kfree(cs);
+	return err;
+}
+
+/*
+ *	cpuset_destroy: destroy a cpuset
+ *	cs:	cpuset to be destroyed
+ *
+ *	Calls cpuset_remove(). The memory will only be freed when the cpuset'
+ *	dentries are dropped.
+ *
+ *	This is how cpuset removal occurs actually:
+ *	cpuset_destroy -> cpuset_remove -> list_del,removed=1...
+ *	.... later...(depending on filesystem use)...
+ *	dput(dentry)->iput(inode)->kfree().
+ *
+ *
+ * 	cpuset_destroy MUST be called with:
+ *		* use count at least 1
+ *		* semaphores on inode and parent's inode held.
+ *
+ *	Return: ZERO on success.
+ */
+
+static int cpuset_destroy(struct cpuset *cs)
+{
+	struct cpuset *parent = cs->parent;
+
+	down(&cpuset_sem);
+	spin_lock(&cs->dentry->d_lock);
+	/* whoever called us incremented count, so it is at least 1 */
+	if (atomic_read(&cs->count) > 1) {
+		spin_unlock(&cs->dentry->d_lock);
+		up(&cpuset_sem);
+		return -EBUSY;
+	}
+	/* everything OK, now we pull the trigger */
+
+	/* make this cpuset unusable, remove it from the lists */
+	cpuset_remove(cs); /* also unlocks &cs->dentry->d_lock */
+
+	up(&cpuset_sem);
+
+	/*
+	 * If the parent also has to be deleted (autoclean), we already hold
+	 * the inode semaphore => hence the call to release_cpuset_locked.
+	 */
+	release_cpuset_locked(parent);
+	return 0;
+}
+
+/*
+ * for the common functions, 'private' gives the type of file
+ */
+
+static struct cftype cft_tasks = {
+	.name = "tasks",
+	.read = cpuset_tasks_read,
+	.release = cpuset_tasks_release,
+	.private = FILE_TASKLIST,
+};
+
+static struct cftype cft_cpus = {
+	.name = "cpus",
+	.private = FILE_CPULIST,
+};
+
+static struct cftype cft_mems = {
+	.name = "mems",
+	.private = FILE_MEMLIST,
+};
+
+static struct cftype cft_cpustrict = {
+	.name = "cpustrict",
+	.private = FILE_CPUSTRICT,
+};
+
+static struct cftype cft_memstrict = {
+	.name = "memstrict",
+	.private = FILE_MEMSTRICT,
+};
+
+static struct cftype cft_autoclean = {
+	.name = "autoclean",
+	.private = FILE_AUTOCLEAN,
+};
+
+/* MUST be called with ->d_inode->i_sem held */
+static int cpuset_populate_dir(struct dentry *cs_dentry)
+{
+	int err;
+
+	if ((err = cpuset_add_file(cs_dentry, &cft_cpus)) < 0)
+		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_mems)) < 0)
+		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_cpustrict)) < 0)
+		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_memstrict)) < 0)
+		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_autoclean)) < 0)
+		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
+		return err;
+	return 0;
+}
+
+/**
+ * cpuset_init - initialize cpusets at system boot
+ *
+ * Description: Initialize top_cpuset and the cpuset internal file system,
+ **/
+
+int __init cpuset_init(void)
+{
+	int err;
+
+	top_cpuset.cpus_allowed = cpu_possible_map;
+	top_cpuset.mems_allowed = node_possible_map;
+
+	init_task.cpuset = &top_cpuset;
+	set_bit(CS_HAS_BEEN_ATTACHED, &init_task.cpuset->flags);
+
+	err = register_filesystem(&cpuset_fs_type);
+	if (err < 0)
+		goto out;
+	cpuset_mount = kern_mount(&cpuset_fs_type);
+	if (IS_ERR(cpuset_mount)) {
+		printk(KERN_ERR "cpuset: could not mount!\n");
+		err = PTR_ERR(cpuset_mount);
+		cpuset_mount = NULL;
+		goto out;
+	}
+	err = cpuset_create_dir(&top_cpuset, "top_cpuset", 0644);
+	if (err < 0)
+		goto out;
+	err = cpuset_populate_dir(top_cpuset.dentry);
+out:
+	return err;
+}
+
+/**
+ * cpuset_fork - attach newly forked task to its parents cpuset.
+ * @p: pointer to task_struct of forking parent process.
+ *
+ * Description: By default, on fork, a task inherits its
+ * parents cpuset.  The pointer to the shared cpuset is
+ * automatically copied in fork.c by dup_task_struct().
+ * This cpuset_fork() routine need only increment the usage
+ * counter in that cpuset.
+ **/
+
+void cpuset_fork(struct task_struct *tsk)
+{
+	atomic_inc(&tsk->cpuset->count);
+}
+
+/**
+ * cpuset_exit - detach exiting task from cpuset
+ * @tsk: pointer to task_struct of exiting process
+ *
+ * Description: Detach @tsk from its cpuset and release
+ * that cpuset (which will decrement its usage and perhaps
+ * remove it.)
+ **/
+
+void cpuset_exit(struct task_struct *tsk)
+{
+	struct cpuset *cs;
+
+	task_lock(tsk);
+	cs = tsk->cpuset;
+	tsk->cpuset = NULL;
+	task_unlock(tsk);
+
+	release_cpuset_unlocked(cs);
+}
+
+/**
+ * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
+ *
+ * Description: Returns the cpumask_t cpus_allowed of the cpuset
+ * attached to the specified @tsk.
+ **/
+
+const cpumask_t cpuset_cpus_allowed(const struct task_struct *tsk)
+{
+	cpumask_t mask;
+
+	task_lock((struct task_struct *)tsk);
+	if (tsk->cpuset)
+		mask = tsk->cpuset->cpus_allowed;
+	else
+		mask = CPU_MASK_ALL;
+	task_unlock((struct task_struct *)tsk);
+
+	return mask;
+}
+
+/**
+ * cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed.
+ *
+ * Description: Returns the nodemask_t mems_allowed of the cpuset
+ * attached to the specified @tsk.
+ **/
+
+const nodemask_t cpuset_mems_allowed(const struct task_struct *tsk)
+{
+	nodemask_t mask;
+
+	task_lock((struct task_struct *)tsk);
+	if (tsk->cpuset)
+		mask = tsk->cpuset->mems_allowed;
+	else
+		mask = NODE_MASK_ALL;
+	task_unlock((struct task_struct *)tsk);
+
+	return mask;
+}
+
+/**
+ * proc_pid_cspath - print tasks cpuset path into buffer
+ * @tsk: pointer to task_struct of task whose cpuset's path to print
+ * @buf: pointer to buffer into which to print path
+ * @buflen: length of @buf
+ *
+ * Description: Print task's cpuset path (without mountpoint)
+ * in the given buffer. Used for /proc/<pid>/cpuset.
+ */
+
+int proc_pid_cspath(struct task_struct *tsk, char *buf, int buflen)
+{
+	char *endbuf = buf + buflen;
+	char *start = endbuf;
+	struct cpuset *cs, *bottomcs;
+	int count;
+
+	task_lock(tsk);
+	bottomcs = cs = tsk->cpuset;
+	use_cpuset(bottomcs);
+	task_unlock(tsk);
+	*--start = '\n';
+	for (;;) {
+		int l = cs->dentry->d_name.len;
+		start -= l;
+		if (start < buf)
+			goto toolong;
+		memcpy(start, cs->dentry->d_name.name, l);
+		cs = cs->parent;
+		if (!cs)
+			break;
+		if (--start < buf)
+			goto toolong;
+		*start = '/';
+	}
+	count = endbuf - start;
+	memmove(buf, start, count);
+	release_cpuset_unlocked(bottomcs);
+	return count;
+toolong:
+	release_cpuset_unlocked(bottomcs);
+	return -ENAMETOOLONG;
+}

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 7/8] cpusets v3 - The few, small kernel hooks needed
  2004-06-29 11:21 [patch 1/8] cpusets v3 - Table of Contents Paul Jackson
                   ` (4 preceding siblings ...)
  2004-06-29 11:22 ` [patch 6/8] cpusets v3 - The main new files: cpuset.c, cpuset.h Paul Jackson
@ 2004-06-29 11:22 ` Paul Jackson
  2004-06-29 11:22 ` [patch 8/8] cpusets v3 - One more hook, for /proc/<pid>/cpuset Paul Jackson
  6 siblings, 0 replies; 8+ messages in thread
From: Paul Jackson @ 2004-06-29 11:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jack Steiner, Jesse Barnes, Sylvain,
	Dan Higgins, Matthew Dobson, Andi Kleen, Paul Jackson, Simon

cpuset kernel hooks in init, exit, fork, sched_setaffinity,
Kconfig and sched.h.

These hooks establish and propogate cpusets, and enforce
their CPU placement limitations on sched_setaffinity, and
Memory Node placement limitations on mbind, sys_set_mempolicy.

Index: 2.6.7-mm4/include/linux/sched.h
===================================================================
--- 2.6.7-mm4.orig/include/linux/sched.h	2004-06-29 03:54:46.000000000 -0700
+++ 2.6.7-mm4/include/linux/sched.h	2004-06-29 03:55:38.000000000 -0700
@@ -366,6 +366,7 @@ struct k_itimer {
 
 struct io_context;			/* See blkdev.h */
 void exit_io_context(void);
+struct cpuset;
 
 #define NGROUPS_SMALL		32
 #define NGROUPS_PER_BLOCK	((int)(PAGE_SIZE / sizeof(gid_t)))
@@ -535,6 +536,10 @@ struct task_struct {
   	struct mempolicy *mempolicy;
   	short il_next;		/* could be shared with used_math */
 #endif
+
+#ifdef CONFIG_CPUSETS
+	struct cpuset *cpuset;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: 2.6.7-mm4/init/Kconfig
===================================================================
--- 2.6.7-mm4.orig/init/Kconfig	2004-06-29 03:54:39.000000000 -0700
+++ 2.6.7-mm4/init/Kconfig	2004-06-29 03:55:38.000000000 -0700
@@ -279,6 +279,16 @@ config EPOLL
 	  Disabling this option will cause the kernel to be built without
 	  support for epoll family of system calls.
 
+config CPUSETS
+	bool "Cpuset support"
+	help
+	  This options will let you create and manage CPUSET's which
+	  allow dynamically partitioning a system into sets of CPUs and
+	  Memory Nodes and assigning tasks to run only within those sets.
+	  This is primarily useful on large SMP or NUMA systems.
+
+	  Say N if unsure.
+
 source "drivers/block/Kconfig.iosched"
 
 config CC_OPTIMIZE_FOR_SIZE
Index: 2.6.7-mm4/init/main.c
===================================================================
--- 2.6.7-mm4.orig/init/main.c	2004-06-29 03:54:43.000000000 -0700
+++ 2.6.7-mm4/init/main.c	2004-06-29 03:55:38.000000000 -0700
@@ -41,6 +41,7 @@
 #include <linux/writeback.h>
 #include <linux/cpu.h>
 #include <linux/efi.h>
+#include <linux/cpuset.h>
 #include <linux/unistd.h>
 #include <linux/rmap.h>
 #include <linux/mempolicy.h>
@@ -536,6 +537,8 @@ asmlinkage void __init start_kernel(void
 #ifdef CONFIG_PROC_FS
 	proc_root_init();
 #endif
+	cpuset_init();
+
 	check_bugs();
 
 	/* Do the rest non-__init'ed, we're now alive */
Index: 2.6.7-mm4/kernel/exit.c
===================================================================
--- 2.6.7-mm4.orig/kernel/exit.c	2004-06-29 03:54:39.000000000 -0700
+++ 2.6.7-mm4/kernel/exit.c	2004-06-29 03:55:38.000000000 -0700
@@ -28,6 +28,7 @@
 #include <asm/unistd.h>
 #include <asm/pgtable.h>
 #include <asm/mmu_context.h>
+#include <linux/cpuset.h>
 
 extern void sem_exit (void);
 extern struct task_struct *child_reaper;
@@ -800,6 +801,7 @@ asmlinkage NORET_TYPE void do_exit(long 
 	__exit_fs(tsk);
 	exit_namespace(tsk);
 	exit_thread();
+	cpuset_exit(tsk);
 #ifdef CONFIG_NUMA
 	mpol_free(tsk->mempolicy);
 #endif
Index: 2.6.7-mm4/kernel/fork.c
===================================================================
--- 2.6.7-mm4.orig/kernel/fork.c	2004-06-29 03:54:43.000000000 -0700
+++ 2.6.7-mm4/kernel/fork.c	2004-06-29 03:55:38.000000000 -0700
@@ -36,6 +36,7 @@
 #include <linux/mount.h>
 #include <linux/audit.h>
 #include <linux/rmap.h>
+#include <linux/cpuset.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1093,6 +1094,8 @@ struct task_struct *copy_process(unsigne
 	if (p->ptrace & PT_PTRACED)
 		__ptrace_link(p, current->parent);
 
+	cpuset_fork(p);
+
 	attach_pid(p, PIDTYPE_PID, p->pid);
 	if (thread_group_leader(p)) {
 		attach_pid(p, PIDTYPE_TGID, p->tgid);
Index: 2.6.7-mm4/kernel/sched.c
===================================================================
--- 2.6.7-mm4.orig/kernel/sched.c	2004-06-29 03:54:43.000000000 -0700
+++ 2.6.7-mm4/kernel/sched.c	2004-06-29 03:55:38.000000000 -0700
@@ -41,6 +41,7 @@
 #include <linux/percpu.h>
 #include <linux/perfctr.h>
 #include <linux/kthread.h>
+#include <linux/cpuset.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -2908,7 +2909,7 @@ out_unlock:
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 				      unsigned long __user *user_mask_ptr)
 {
-	cpumask_t new_mask;
+	cpumask_t new_mask, cpus_allowed;
 	int retval;
 	task_t *p;
 
@@ -2941,6 +2942,8 @@ asmlinkage long sys_sched_setaffinity(pi
 			!capable(CAP_SYS_NICE))
 		goto out_unlock;
 
+	cpus_allowed = cpuset_cpus_allowed(p);
+	cpus_and(new_mask, new_mask, cpus_allowed);
 	retval = set_cpus_allowed(p, new_mask);
 
 out_unlock:
@@ -3520,7 +3523,9 @@ static void migrate_all_tasks(int src_cp
 		if (dest_cpu == NR_CPUS)
 			dest_cpu = any_online_cpu(tsk->cpus_allowed);
 		if (dest_cpu == NR_CPUS) {
-			cpus_setall(tsk->cpus_allowed);
+			tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
+			if (!cpus_intersects(tsk->cpus_allowed, cpu_online_map))
+				cpus_setall(tsk->cpus_allowed);
 			dest_cpu = any_online_cpu(tsk->cpus_allowed);
 
 			/* Don't tell them about moving exiting tasks
Index: 2.6.7-mm4/mm/mempolicy.c
===================================================================
--- 2.6.7-mm4.orig/mm/mempolicy.c	2004-06-29 03:55:33.000000000 -0700
+++ 2.6.7-mm4/mm/mempolicy.c	2004-06-29 03:55:38.000000000 -0700
@@ -67,6 +67,7 @@
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/nodemask.h>
+#include <linux/cpuset.h>
 #include <linux/gfp.h>
 #include <linux/slab.h>
 #include <linux/string.h>
@@ -132,6 +133,7 @@ static int get_nodes(unsigned long *node
 	unsigned long k;
 	unsigned long nlongs;
 	unsigned long endmask;
+	nodemask_t mems_allowed;
 
 	--maxnode;
 	bitmap_zero(nodes, MAX_NUMNODES);
@@ -164,6 +166,9 @@ static int get_nodes(unsigned long *node
 	if (copy_from_user(nodes, nmask, nlongs*sizeof(unsigned long)))
 		return -EFAULT;
 	nodes[nlongs-1] &= endmask;
+	/* Ignore nodes not allowed in current cpuset */
+	mems_allowed = cpuset_mems_allowed(current);
+	bitmap_and(nodes, nodes, nodes_addr(mems_allowed), MAX_NUMNODES);
 	return mpol_check_policy(mode, nodes);
 }
 

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 8/8] cpusets v3 - One more hook, for /proc/<pid>/cpuset.
  2004-06-29 11:21 [patch 1/8] cpusets v3 - Table of Contents Paul Jackson
                   ` (5 preceding siblings ...)
  2004-06-29 11:22 ` [patch 7/8] cpusets v3 - The few, small kernel hooks needed Paul Jackson
@ 2004-06-29 11:22 ` Paul Jackson
  6 siblings, 0 replies; 8+ messages in thread
From: Paul Jackson @ 2004-06-29 11:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jack Steiner, Jesse Barnes, Paul Jackson,
	Dan Higgins, Matthew Dobson, Andi Kleen, Sylvain, Simon

Add cpuset links to /proc, from each task to its cpuset.

Index: 2.6.7-mm1/fs/proc/base.c
===================================================================
--- 2.6.7-mm1.orig/fs/proc/base.c	2004-06-28 15:08:12.000000000 -0700
+++ 2.6.7-mm1/fs/proc/base.c	2004-06-28 15:10:43.000000000 -0700
@@ -32,6 +32,7 @@
 #include <linux/mount.h>
 #include <linux/security.h>
 #include <linux/ptrace.h>
+#include <linux/cpuset.h>
 
 /*
  * For hysterical raisins we keep the same inumbers as in the old procfs.
@@ -60,6 +61,9 @@ enum pid_directory_inos {
 	PROC_TGID_MAPS,
 	PROC_TGID_MOUNTS,
 	PROC_TGID_WCHAN,
+#ifdef CONFIG_CPUSETS
+	PROC_TGID_CPUSET,
+#endif
 #ifdef CONFIG_SECURITY
 	PROC_TGID_ATTR,
 	PROC_TGID_ATTR_CURRENT,
@@ -83,6 +87,9 @@ enum pid_directory_inos {
 	PROC_TID_MAPS,
 	PROC_TID_MOUNTS,
 	PROC_TID_WCHAN,
+#ifdef CONFIG_CPUSETS
+	PROC_TID_CPUSET,
+#endif
 #ifdef CONFIG_SECURITY
 	PROC_TID_ATTR,
 	PROC_TID_ATTR_CURRENT,
@@ -123,6 +130,9 @@ static struct pid_entry tgid_base_stuff[
 #ifdef CONFIG_KALLSYMS
 	E(PROC_TGID_WCHAN,     "wchan",   S_IFREG|S_IRUGO),
 #endif
+#ifdef CONFIG_CPUSETS
+	E(PROC_TGID_CPUSET,    "cpuset", S_IFREG|S_IRUGO),
+#endif
 	{0,0,NULL,0}
 };
 static struct pid_entry tid_base_stuff[] = {
@@ -145,6 +155,9 @@ static struct pid_entry tid_base_stuff[]
 #ifdef CONFIG_KALLSYMS
 	E(PROC_TID_WCHAN,      "wchan",   S_IFREG|S_IRUGO),
 #endif
+#ifdef CONFIG_CPUSETS
+	E(PROC_TID_CPUSET,     "cpuset", S_IFREG|S_IRUGO),
+#endif
 	{0,0,NULL,0}
 };
 
@@ -767,6 +780,14 @@ static struct inode_operations proc_pid_
 	.follow_link	= proc_pid_follow_link
 };
 
+
+#ifdef CONFIG_CPUSETS
+static int proc_pid_cpuset(struct task_struct *task, char *buffer)
+{
+	return proc_pid_cspath(task, buffer, PAGE_SIZE);
+}
+#endif /* CONFIG_CPUSETS */
+
 static int pid_alive(struct task_struct *p)
 {
 	BUG_ON(p->pids[PIDTYPE_PID].pidptr != &p->pids[PIDTYPE_PID].pid);
@@ -1375,6 +1396,13 @@ static struct dentry *proc_pident_lookup
 			ei->op.proc_read = proc_pid_wchan;
 			break;
 #endif
+#ifdef CONFIG_CPUSETS
+		case PROC_TID_CPUSET:
+		case PROC_TGID_CPUSET:
+			inode->i_fop = &proc_info_file_operations;
+			ei->op.proc_read = proc_pid_cpuset;
+			break;
+#endif
 		default:
 			printk("procfs: impossible type (%d)",p->type);
 			iput(inode);
Index: 2.6.7-mm1/fs/proc/root.c
===================================================================
--- 2.6.7-mm1.orig/fs/proc/root.c	2004-06-28 15:08:12.000000000 -0700
+++ 2.6.7-mm1/fs/proc/root.c	2004-06-28 15:10:43.000000000 -0700
@@ -75,6 +75,9 @@ void __init proc_root_init(void)
 	proc_device_tree_init();
 #endif
 	proc_bus = proc_mkdir("bus", 0);
+#ifdef CONFIG_CPUSETS
+	proc_mkdir("cpusets", 0);
+#endif
 }
 
 static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry, struct nameidata *nd)

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-06-29 11:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-06-29 11:21 [patch 1/8] cpusets v3 - Table of Contents Paul Jackson
2004-06-29 11:22 ` [patch 2/8] cpusets v3 - Overview Paul Jackson
2004-06-29 11:22 ` [patch 3/8] cpusets v3 - cpumask_t - additional const qualifiers Paul Jackson
2004-06-29 11:22 ` [patch 4/8] cpusets v3 - nodemask patch (draft of Matthew Dobson's patch) Paul Jackson
2004-06-29 11:22 ` [patch 5/8] cpusets v3 - New bitmap lists format Paul Jackson
2004-06-29 11:22 ` [patch 6/8] cpusets v3 - The main new files: cpuset.c, cpuset.h Paul Jackson
2004-06-29 11:22 ` [patch 7/8] cpusets v3 - The few, small kernel hooks needed Paul Jackson
2004-06-29 11:22 ` [patch 8/8] cpusets v3 - One more hook, for /proc/<pid>/cpuset Paul Jackson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.