All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch -mm 1/7] x86_64: configurable fake numa node sizes
@ 2007-03-01 17:12 David Rientjes
  2007-03-01 17:13 ` [patch -mm 2/7] x86_64: split remaining fake nodes equally David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2007-03-01 17:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

Extends the numa=fake x86_64 command-line option to allow for configurable
node sizes.  These nodes can be used in conjunction with cpusets for
coarse memory resource management.

The old command-line option is still supported:
  numa=fake=32	gives 32 fake NUMA nodes, ignoring the NUMA setup of the
		actual machine.

But now you may configure your system for the node sizes of your choice:
  numa=fake=2*512,1024,2*256
		gives two 512M nodes, one 1024M node, two 256M nodes, and
		the rest of system memory to a sixth node.

The existing hash function is maintained to support the various node sizes
that are possible with this implementation.

Each node of the same size receives roughly the same amount of available
pages, regardless of any reserved memory with its address range.  The
total available pages on the system is calculated and divided by the
number of equal nodes to allocate.  These nodes are then dynamically
allocated and their borders extended until such time as their number of
available pages reaches the required size.

Configurable node sizes are recommended when used in conjunction with
cpusets for memory control because it eliminates the overhead associated
with scanning the zonelists of many smaller full nodes on page_alloc().

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/x86_64/boot-options.txt |    8 +-
 arch/x86_64/mm/numa.c                 |  255 +++++++++++++++++++--------------
 include/asm-x86_64/mmzone.h           |    2 +-
 3 files changed, 155 insertions(+), 110 deletions(-)

diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt
--- a/Documentation/x86_64/boot-options.txt
+++ b/Documentation/x86_64/boot-options.txt
@@ -149,7 +149,13 @@ NUMA
 
   numa=noacpi   Don't parse the SRAT table for NUMA setup
 
-  numa=fake=X   Fake X nodes and ignore NUMA setup of the actual machine.
+  numa=fake=CMDLINE
+		If a number, fakes CMDLINE nodes and ignores NUMA setup of the
+		actual machine.  Otherwise, system memory is configured
+		depending on the sizes and coefficients listed.  For example:
+			numa=fake=2*512,1024,4*256
+		gives two 512M nodes, a 1024M node, and four 256M nodes.  The
+		remaining system RAM is allocated to an additional node.
 
   numa=hotadd=percent
 		Only allow hotadd memory to preallocate page structures upto
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -276,125 +276,166 @@ void __init numa_init_array(void)
 
 #ifdef CONFIG_NUMA_EMU
 /* Numa emulation */
-int numa_fake __initdata = 0;
+#define E820_ADDR_HOLE_SIZE(start, end)					\
+	(e820_hole_size((start) >> PAGE_SHIFT, (end) >> PAGE_SHIFT) <<	\
+	PAGE_SHIFT)
+char *cmdline __initdata;
 
 /*
- * This function is used to find out if the start and end correspond to
- * different zones.
+ * Setups up nid to range from addr to addr + size.  If the end boundary is
+ * greater than max_addr, then max_addr is used instead.  The return value is 0
+ * if there is additional memory left for allocation past addr and -1 otherwise.
+ * addr is adjusted to be at the end of the node.
  */
-int zone_cross_over(unsigned long start, unsigned long end)
+static int __init setup_node_range(int nid, struct bootnode *nodes, u64 *addr,
+				   u64 size, u64 max_addr)
 {
-	if ((start < (MAX_DMA32_PFN << PAGE_SHIFT)) &&
-			(end >= (MAX_DMA32_PFN << PAGE_SHIFT)))
-		return 1;
-	return 0;
+	int ret = 0;
+	nodes[nid].start = *addr;
+	*addr += size;
+	if (*addr >= max_addr) {
+		*addr = max_addr;
+		ret = -1;
+	}
+	nodes[nid].end = *addr;
+	node_set_online(nid);
+	printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n", nid,
+	       nodes[nid].start, nodes[nid].end,
+	       (nodes[nid].end - nodes[nid].start) >> 20);
+	return ret;
 }
 
-static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
+/*
+ * Splits num_nodes nodes up equally starting at node_start.  The return value
+ * is the number of nodes split up and addr is adjusted to be at the end of the
+ * last node allocated.
+ */
+static int __init split_nodes_equally(struct bootnode *nodes, u64 *addr,
+				      u64 max_addr, int node_start,
+				      int num_nodes)
 {
- 	int i, big;
- 	struct bootnode nodes[MAX_NUMNODES];
- 	unsigned long sz, old_sz;
-	unsigned long hole_size;
-	unsigned long start, end;
-	unsigned long max_addr = (end_pfn << PAGE_SHIFT);
-
-	start = (start_pfn << PAGE_SHIFT);
-	hole_size = e820_hole_size(start, max_addr);
-	sz = (max_addr - start - hole_size) / numa_fake;
-
- 	/* Kludge needed for the hash function */
-
-	old_sz = sz;
-	/*
-	 * Round down to the nearest FAKE_NODE_MIN_SIZE.
-	 */
-	sz &= FAKE_NODE_MIN_HASH_MASK;
+	unsigned int big;
+	u64 size;
+	int i;
 
+	if (num_nodes <= 0)
+		return -1;
+	if (num_nodes > MAX_NUMNODES)
+		num_nodes = MAX_NUMNODES;
+	size = (max_addr - *addr - E820_ADDR_HOLE_SIZE(*addr, max_addr)) /
+	       num_nodes;
 	/*
-	 * We ensure that each node is at least 64MB big.  Smaller than this
-	 * size can cause VM hiccups.
+	 * Calculate the number of big nodes that can be allocated as a result
+	 * of consolidating the leftovers.
 	 */
-	if (sz == 0) {
-		printk(KERN_INFO "Not enough memory for %d nodes.  Reducing "
-				"the number of nodes\n", numa_fake);
-		numa_fake = (max_addr - start - hole_size) / FAKE_NODE_MIN_SIZE;
-		printk(KERN_INFO "Number of fake nodes will be = %d\n",
-				numa_fake);
-		sz = FAKE_NODE_MIN_SIZE;
+	big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * num_nodes) /
+	      FAKE_NODE_MIN_SIZE;
+
+	/* Round down to nearest FAKE_NODE_MIN_SIZE. */
+	size &= FAKE_NODE_MIN_HASH_MASK;
+	if (!size) {
+		printk(KERN_ERR "Not enough memory for each node.  "
+		       "NUMA emulation disabled.\n");
+		return -1;
 	}
-	/*
-	 * Find out how many nodes can get an extra NODE_MIN_SIZE granule.
-	 * This logic ensures the extra memory gets distributed among as many
-	 * nodes as possible (as compared to one single node getting all that
-	 * extra memory.
-	 */
-	big = ((old_sz - sz) * numa_fake) / FAKE_NODE_MIN_SIZE;
-	printk(KERN_INFO "Fake node Size: %luMB hole_size: %luMB big nodes: "
-			"%d\n",
-			(sz >> 20), (hole_size >> 20), big);
- 	memset(&nodes,0,sizeof(nodes));
-	end = start;
- 	for (i = 0; i < numa_fake; i++) {
-		/*
-		 * In case we are not able to allocate enough memory for all
-		 * the nodes, we reduce the number of fake nodes.
-		 */
-		if (end >= max_addr) {
-			numa_fake = i - 1;
-			break;
-		}
- 		start = nodes[i].start = end;
-		/*
-		 * Final node can have all the remaining memory.
-		 */
- 		if (i == numa_fake-1)
- 			sz = max_addr - start;
- 		end = nodes[i].start + sz;
-		/*
-		 * Fir "big" number of nodes get extra granule.
-		 */
+
+	for (i = node_start; i < num_nodes + node_start; i++) {
+		u64 end = *addr + size;
 		if (i < big)
 			end += FAKE_NODE_MIN_SIZE;
 		/*
-		 * Iterate over the range to ensure that this node gets at
-		 * least sz amount of RAM (excluding holes)
+		 * The final node can have the remaining system RAM.  Other
+		 * nodes receive roughly the same amount of available pages.
 		 */
-		while ((end - start - e820_hole_size(start, end)) < sz) {
-			end += FAKE_NODE_MIN_SIZE;
-			if (end >= max_addr)
-				break;
+		if (i == num_nodes + node_start - 1)
+			end = max_addr;
+		else
+			while (end - *addr - E820_ADDR_HOLE_SIZE(*addr, end) <
+			       size) {
+				end += FAKE_NODE_MIN_SIZE;
+				if (end > max_addr) {
+					end = max_addr;
+					break;
+				}
+			}
+		if (setup_node_range(i, nodes, addr, end - *addr, max_addr) < 0)
+			break;
+	}
+	return i - node_start + 1;
+}
+
+/*
+ * Sets up the system RAM area from start_pfn to end_pfn according to the
+ * numa=fake command-line option.
+ */
+static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct bootnode nodes[MAX_NUMNODES];
+	u64 addr = start_pfn << PAGE_SHIFT;
+	u64 max_addr = end_pfn << PAGE_SHIFT;
+	unsigned int coeff;
+	unsigned int num = 0;
+	int num_nodes = 0;
+	u64 size;
+	int i;
+
+	memset(&nodes, 0, sizeof(nodes));
+	/*
+	 * If the numa=fake command-line is just a single number N, split the
+	 * system RAM into N fake nodes.
+	 */
+	if (!strchr(cmdline, '*') && !strchr(cmdline, ',')) {
+		num_nodes = split_nodes_equally(nodes, &addr, max_addr, 0,
+						simple_strtol(cmdline, NULL, 0));
+		if (num_nodes < 0)
+			return num_nodes;
+		goto out;
+	}
+
+	/* Parse the command line. */
+	for (coeff = 1; ; cmdline++) {
+		if (*cmdline && isdigit(*cmdline)) {
+			num = num * 10 + *cmdline - '0';
+			continue;
 		}
-		/*
-		 * Look at the next node to make sure there is some real memory
-		 * to map.  Bad things happen when the only memory present
-		 * in a zone on a fake node is IO hole.
-		 */
-		while (e820_hole_size(end, end + FAKE_NODE_MIN_SIZE) > 0) {
-			if (zone_cross_over(start, end + sz)) {
-				end = (MAX_DMA32_PFN << PAGE_SHIFT);
-				break;
+		if (*cmdline == '*')
+			coeff = num;
+		if (!*cmdline || *cmdline == ',') {
+			/*
+			 * Round down to the nearest FAKE_NODE_MIN_SIZE.
+			 * Command-line coefficients are in megabytes.
+			 */
+			size = ((u64)num << 20) & FAKE_NODE_MIN_HASH_MASK;
+			if (size) {
+				for (i = 0; i < coeff; i++, num_nodes++)
+					if (setup_node_range(num_nodes, nodes,
+						&addr, size, max_addr) < 0)
+						goto done;
+				coeff = 1;
 			}
-			if (end >= max_addr)
-				break;
-			end += FAKE_NODE_MIN_SIZE;
 		}
-		if (end > max_addr)
-			end = max_addr;
-		nodes[i].end = end;
- 		printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n",
- 		       i,
- 		       nodes[i].start, nodes[i].end,
- 		       (nodes[i].end - nodes[i].start) >> 20);
-		node_set_online(i);
- 	}
- 	memnode_shift = compute_hash_shift(nodes, numa_fake);
- 	if (memnode_shift < 0) {
- 		memnode_shift = 0;
- 		printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
- 		return -1;
- 	}
- 	for_each_online_node(i) {
+		if (!*cmdline)
+			break;
+		num = 0;
+	}
+done:
+	if (!num_nodes)
+		return -1;
+	/* Fill remainder of system RAM with a final node, if appropriate. */
+	if (addr < max_addr) {
+		setup_node_range(num_nodes, nodes, &addr, max_addr - addr,
+				 max_addr);
+		num_nodes++;
+	}
+out:
+	memnode_shift = compute_hash_shift(nodes, num_nodes);
+	if (memnode_shift < 0) {
+		memnode_shift = 0;
+		printk(KERN_ERR "No NUMA hash function found.  NUMA emulation "
+		       "disabled.\n");
+		return -1;
+	}
+	for_each_online_node(i) {
 		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
 						nodes[i].end >> PAGE_SHIFT);
  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
@@ -402,14 +443,15 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
  	numa_init_array();
  	return 0;
 }
-#endif
+#undef E820_ADDR_HOLE_SIZE
+#endif /* CONFIG_NUMA_EMU */
 
 void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 { 
 	int i;
 
 #ifdef CONFIG_NUMA_EMU
-	if (numa_fake && !numa_emulation(start_pfn, end_pfn))
+	if (cmdline && !numa_emulation(start_pfn, end_pfn))
  		return;
 #endif
 
@@ -489,11 +531,8 @@ static __init int numa_setup(char *opt)
 	if (!strncmp(opt,"off",3))
 		numa_off = 1;
 #ifdef CONFIG_NUMA_EMU
-	if(!strncmp(opt, "fake=", 5)) {
-		numa_fake = simple_strtoul(opt+5,NULL,0); ;
-		if (numa_fake >= MAX_NUMNODES)
-			numa_fake = MAX_NUMNODES;
-	}
+	if (!strncmp(opt, "fake=", 5))
+		cmdline = opt + 5;
 #endif
 #ifdef CONFIG_ACPI_NUMA
  	if (!strncmp(opt,"noacpi",6))
diff --git a/include/asm-x86_64/mmzone.h b/include/asm-x86_64/mmzone.h
--- a/include/asm-x86_64/mmzone.h
+++ b/include/asm-x86_64/mmzone.h
@@ -49,7 +49,7 @@ extern int pfn_valid(unsigned long pfn);
 
 #ifdef CONFIG_NUMA_EMU
 #define FAKE_NODE_MIN_SIZE	(64*1024*1024)
-#define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1ul))
+#define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1uL))
 #endif
 
 #endif

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch -mm 2/7] x86_64: split remaining fake nodes equally
  2007-03-01 17:12 [patch -mm 1/7] x86_64: configurable fake numa node sizes David Rientjes
@ 2007-03-01 17:13 ` David Rientjes
  2007-03-01 17:13   ` [patch -mm 3/7] x86_64: fixed size remaining fake nodes David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2007-03-01 17:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

Extends the numa=fake x86_64 command-line option to split the remaining
system memory into equal-sized nodes.

For example:
numa=fake=2*512,4*	gives two 512M nodes and the remaining system
			memory is split into four approximately equal
			chunks.

This is beneficial for systems where the exact size of RAM is unknown or
not necessarily relevant, but the granularity with which nodes shall be
allocated is known.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/x86_64/boot-options.txt |    4 +++-
 arch/x86_64/mm/numa.c                 |   22 ++++++++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt
--- a/Documentation/x86_64/boot-options.txt
+++ b/Documentation/x86_64/boot-options.txt
@@ -155,7 +155,9 @@ NUMA
 		depending on the sizes and coefficients listed.  For example:
 			numa=fake=2*512,1024,4*256
 		gives two 512M nodes, a 1024M node, and four 256M nodes.  The
-		remaining system RAM is allocated to an additional node.
+		remaining system RAM is allocated to an additional node.  If
+		the last character of CMDLINE is a *, the remaining system RAM
+		is instead divided up equally among its coefficient.
 
   numa=hotadd=percent
 		Only allow hotadd memory to preallocate page structures upto
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -421,11 +421,25 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
 done:
 	if (!num_nodes)
 		return -1;
-	/* Fill remainder of system RAM with a final node, if appropriate. */
+	/* Fill remainder of system RAM, if appropriate. */
 	if (addr < max_addr) {
-		setup_node_range(num_nodes, nodes, &addr, max_addr - addr,
-				 max_addr);
-		num_nodes++;
+		switch (*(cmdline - 1)) {
+		case '*':
+			/* Split remaining nodes into coeff chunks */
+			if (coeff <= 0)
+				break;
+			num_nodes += split_nodes_equally(nodes, &addr, max_addr,
+							 num_nodes, coeff);
+			break;
+		case ',':
+			/* Do not allocate remaining system RAM */
+			break;
+		default:
+			/* Give one final node */
+			setup_node_range(num_nodes, nodes, &addr,
+					 max_addr - addr, max_addr);
+			num_nodes++;
+		}
 	}
 out:
 	memnode_shift = compute_hash_shift(nodes, num_nodes);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch -mm 3/7] x86_64: fixed size remaining fake nodes
  2007-03-01 17:13 ` [patch -mm 2/7] x86_64: split remaining fake nodes equally David Rientjes
@ 2007-03-01 17:13   ` David Rientjes
  2007-03-01 17:13     ` [patch -mm 4/7] x86_64: map fake nodes to real nodes David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2007-03-01 17:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

Extends the numa=fake x86_64 command-line option to split the remaining
system memory into nodes of fixed size.  Any leftover memory is allocated
to a final node unless the command-line ends with a comma.

For example:
  numa=fake=2*512,*128	gives two 512M nodes and the remaining system
			memory is split into nodes of 128M each.

This is beneficial for systems where the exact size of RAM is unknown or
not necessarily relevant, but the size of the remaining nodes to be
allocated is known based on their capacity for resource management.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/x86_64/boot-options.txt |   14 ++++++---
 arch/x86_64/mm/numa.c                 |   47 ++++++++++++++++++++++++++-------
 2 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt
--- a/Documentation/x86_64/boot-options.txt
+++ b/Documentation/x86_64/boot-options.txt
@@ -153,11 +153,15 @@ NUMA
 		If a number, fakes CMDLINE nodes and ignores NUMA setup of the
 		actual machine.  Otherwise, system memory is configured
 		depending on the sizes and coefficients listed.  For example:
-			numa=fake=2*512,1024,4*256
-		gives two 512M nodes, a 1024M node, and four 256M nodes.  The
-		remaining system RAM is allocated to an additional node.  If
-		the last character of CMDLINE is a *, the remaining system RAM
-		is instead divided up equally among its coefficient.
+			numa=fake=2*512,1024,4*256,*128
+		gives two 512M nodes, a 1024M node, four 256M nodes, and the
+		rest split into 128M chunks.  If the last character of CMDLINE
+		is a *, the remaining memory is divided up equally among its
+		coefficient:
+			numa=fake=2*512,2*
+		gives two 512M nodes and the rest split into two nodes.
+		Otherwise, the remaining system RAM is allocated to an
+		additional node.
 
   numa=hotadd=percent
 		Only allow hotadd memory to preallocate page structures upto
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -365,6 +365,21 @@ static int __init split_nodes_equally(struct bootnode *nodes, u64 *addr,
 }
 
 /*
+ * Splits the remaining system RAM into chunks of size.  The remaining memory is
+ * always assigned to a final node and can be asymmetric.  Returns the number of
+ * nodes split.
+ */
+static int __init split_nodes_by_size(struct bootnode *nodes, u64 *addr,
+				      u64 max_addr, int node_start, u64 size)
+{
+	int i = node_start;
+	size = (size << 20) & FAKE_NODE_MIN_HASH_MASK;
+	while (!setup_node_range(i++, nodes, addr, size, max_addr))
+		;
+	return i - node_start;
+}
+
+/*
  * Sets up the system RAM area from start_pfn to end_pfn according to the
  * numa=fake command-line option.
  */
@@ -373,9 +388,10 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
 	struct bootnode nodes[MAX_NUMNODES];
 	u64 addr = start_pfn << PAGE_SHIFT;
 	u64 max_addr = end_pfn << PAGE_SHIFT;
-	unsigned int coeff;
-	unsigned int num = 0;
 	int num_nodes = 0;
+	int coeff_flag;
+	int coeff = -1;
+	int num = 0;
 	u64 size;
 	int i;
 
@@ -393,29 +409,34 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
 	}
 
 	/* Parse the command line. */
-	for (coeff = 1; ; cmdline++) {
+	for (coeff_flag = 0; ; cmdline++) {
 		if (*cmdline && isdigit(*cmdline)) {
 			num = num * 10 + *cmdline - '0';
 			continue;
 		}
-		if (*cmdline == '*')
-			coeff = num;
+		if (*cmdline == '*') {
+			if (num > 0)
+				coeff = num;
+			coeff_flag = 1;
+		}
 		if (!*cmdline || *cmdline == ',') {
+			if (!coeff_flag)
+				coeff = 1;
 			/*
 			 * Round down to the nearest FAKE_NODE_MIN_SIZE.
 			 * Command-line coefficients are in megabytes.
 			 */
 			size = ((u64)num << 20) & FAKE_NODE_MIN_HASH_MASK;
-			if (size) {
+			if (size)
 				for (i = 0; i < coeff; i++, num_nodes++)
 					if (setup_node_range(num_nodes, nodes,
 						&addr, size, max_addr) < 0)
 						goto done;
-				coeff = 1;
-			}
+			if (!*cmdline)
+				break;
+			coeff_flag = 0;
+			coeff = -1;
 		}
-		if (!*cmdline)
-			break;
 		num = 0;
 	}
 done:
@@ -423,6 +444,12 @@ done:
 		return -1;
 	/* Fill remainder of system RAM, if appropriate. */
 	if (addr < max_addr) {
+		if (coeff_flag && coeff < 0) {
+			/* Split remaining nodes into num-sized chunks */
+			num_nodes += split_nodes_by_size(nodes, &addr, max_addr,
+							 num_nodes, num);
+			goto out;
+		}
 		switch (*(cmdline - 1)) {
 		case '*':
 			/* Split remaining nodes into coeff chunks */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch -mm 4/7] x86_64: map fake nodes to real nodes
  2007-03-01 17:13   ` [patch -mm 3/7] x86_64: fixed size remaining fake nodes David Rientjes
@ 2007-03-01 17:13     ` David Rientjes
  2007-03-01 17:13       ` [patch -mm 5/7] x86_64: disable alien cache for fake numa David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2007-03-01 17:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Rohit Seth, linux-kernel

Exports the struct bootnode array globally so that the physical mapping
can be saved when NUMA emulation is used.  This is then copied and stored
for later reference so that there exists a mapping between fake nodes and
the real nodes they reside on through the get_phys_node() function.

physical_node_map is a new struct bootnode array that is used to save the
physical mapping in the emulation case.  The is no effect when
CONFIG_NUMA_EMU is disabled or numa=fake=off.

The emulation case is handled after K8 and ACPI so that the physical
mapping can be saved later.

__node_distance() is modified to use the physical node that corresponds to
the fake node for measurement.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Rohit Seth <rohitseth@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 arch/x86_64/mm/k8topology.c   |   23 +++++---
 arch/x86_64/mm/numa.c         |  113 +++++++++++++++++++++++++++--------------
 arch/x86_64/mm/srat.c         |    9 +++-
 include/asm-x86_64/numa.h     |    4 +-
 include/asm-x86_64/proto.h    |    2 +-
 include/asm-x86_64/topology.h |    1 +
 6 files changed, 100 insertions(+), 52 deletions(-)

diff --git a/arch/x86_64/mm/k8topology.c b/arch/x86_64/mm/k8topology.c
--- a/arch/x86_64/mm/k8topology.c
+++ b/arch/x86_64/mm/k8topology.c
@@ -40,10 +40,9 @@ static __init int find_northbridge(void)
 	return -1; 	
 }
 
-int __init k8_scan_nodes(unsigned long start, unsigned long end)
+int __init k8_scan_nodes(unsigned long start, unsigned long end, int fake)
 { 
 	unsigned long prevbase;
-	struct bootnode nodes[8];
 	int nodeid, i, nb; 
 	unsigned char nodeids[8];
 	int found = 0;
@@ -161,19 +160,25 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 	if (!found)
 		return -1; 
 
-	memnode_shift = compute_hash_shift(nodes, 8);
-	if (memnode_shift < 0) { 
-		printk(KERN_ERR "No NUMA node hash function found. Contact maintainer\n"); 
-		return -1; 
-	} 
-	printk(KERN_INFO "Using node hash shift of %d\n", memnode_shift); 
+	if (!fake) {
+		memnode_shift = compute_hash_shift(8);
+		if (memnode_shift < 0) {
+			printk(KERN_ERR "No NUMA node hash function found. "
+					"Contact maintainer\n");
+			return -1;
+		}
+		printk(KERN_INFO "Using node hash shift of %d\n",
+		       memnode_shift);
+	}
 
 	for (i = 0; i < 8; i++) {
 		if (nodes[i].start != nodes[i].end) { 
 			nodeid = nodeids[i];
 			apicid_to_node[nodeid << dualcore] = i;
 			apicid_to_node[(nodeid << dualcore) + dualcore] = i;
-			setup_node_bootmem(i, nodes[i].start, nodes[i].end); 
+			if (!fake)
+				setup_node_bootmem(i, nodes[i].start,
+						   nodes[i].end);
 		} 
 	}
 
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -34,6 +34,7 @@ unsigned char apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
  	[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
 };
 cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
+struct bootnode nodes[MAX_NUMNODES] __read_mostly;
 
 int numa_off __initdata;
 unsigned long __initdata nodemap_addr;
@@ -47,8 +48,7 @@ unsigned long __initdata nodemap_size;
  * 0 if memnodmap[] too small (of shift too small)
  * -1 if node overlap or lost ram (shift too big)
  */
-static int __init
-populate_memnodemap(const struct bootnode *nodes, int numnodes, int shift)
+static int __init populate_memnodemap(int numnodes, int shift)
 {
 	int i; 
 	int res = -1;
@@ -104,8 +104,7 @@ static int __init allocate_cachealigned_memnodemap(void)
  * The LSB of all start and end addresses in the node map is the value of the
  * maximum possible shift.
  */
-static int __init
-extract_lsb_from_nodes (const struct bootnode *nodes, int numnodes)
+static int __init extract_lsb_from_nodes(int numnodes)
 {
 	int i, nodes_used = 0;
 	unsigned long start, end;
@@ -129,17 +128,17 @@ extract_lsb_from_nodes (const struct bootnode *nodes, int numnodes)
 	return i;
 }
 
-int __init compute_hash_shift(struct bootnode *nodes, int numnodes)
+int __init compute_hash_shift(int numnodes)
 {
 	int shift;
 
-	shift = extract_lsb_from_nodes(nodes, numnodes);
+	shift = extract_lsb_from_nodes(numnodes);
 	if (allocate_cachealigned_memnodemap())
 		return -1;
 	printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n",
 		shift);
 
-	if (populate_memnodemap(nodes, numnodes, shift) != 1) {
+	if (populate_memnodemap(numnodes, shift) != 1) {
 		printk(KERN_INFO
 	"Your memory is not aligned you need to rebuild your kernel "
 	"with a bigger NODEMAPSIZE shift=%d\n",
@@ -279,7 +278,37 @@ void __init numa_init_array(void)
 #define E820_ADDR_HOLE_SIZE(start, end)					\
 	(e820_hole_size((start) >> PAGE_SHIFT, (end) >> PAGE_SHIFT) <<	\
 	PAGE_SHIFT)
+
+static struct bootnode physical_node_map[MAX_NUMNODES];
 char *cmdline __initdata;
+int numa_emu;
+
+/*
+ * Returns the physical NUMA node that fake node nid resides on.  If NUMA
+ * emulation is disabled, then this is the same as nid.
+ */
+int get_phys_node(int nid)
+{
+	pg_data_t *pgdat;
+	u64 node_start_addr;
+	unsigned int i;
+	int ret = 0;
+
+	if (!numa_emu)
+		return nid;
+
+	pgdat = NODE_DATA(nid);
+	node_start_addr = pgdat->node_start_pfn << PAGE_SHIFT;
+
+	for (i = 0; i < MAX_NUMNODES; i++)
+		if (node_start_addr >= physical_node_map[i].start &&
+		    node_start_addr < physical_node_map[i].end) {
+			ret = i;
+			break;
+		}
+
+	return ret;
+}
 
 /*
  * Setups up nid to range from addr to addr + size.  If the end boundary is
@@ -287,8 +316,7 @@ char *cmdline __initdata;
  * if there is additional memory left for allocation past addr and -1 otherwise.
  * addr is adjusted to be at the end of the node.
  */
-static int __init setup_node_range(int nid, struct bootnode *nodes, u64 *addr,
-				   u64 size, u64 max_addr)
+static int __init setup_node_range(int nid, u64 *addr, u64 size, u64 max_addr)
 {
 	int ret = 0;
 	nodes[nid].start = *addr;
@@ -310,8 +338,7 @@ static int __init setup_node_range(int nid, struct bootnode *nodes, u64 *addr,
  * is the number of nodes split up and addr is adjusted to be at the end of the
  * last node allocated.
  */
-static int __init split_nodes_equally(struct bootnode *nodes, u64 *addr,
-				      u64 max_addr, int node_start,
+static int __init split_nodes_equally(u64 *addr, u64 max_addr, int node_start,
 				      int num_nodes)
 {
 	unsigned int big;
@@ -358,7 +385,7 @@ static int __init split_nodes_equally(struct bootnode *nodes, u64 *addr,
 					break;
 				}
 			}
-		if (setup_node_range(i, nodes, addr, end - *addr, max_addr) < 0)
+		if (setup_node_range(i, addr, end - *addr, max_addr) < 0)
 			break;
 	}
 	return i - node_start + 1;
@@ -369,12 +396,12 @@ static int __init split_nodes_equally(struct bootnode *nodes, u64 *addr,
  * always assigned to a final node and can be asymmetric.  Returns the number of
  * nodes split.
  */
-static int __init split_nodes_by_size(struct bootnode *nodes, u64 *addr,
-				      u64 max_addr, int node_start, u64 size)
+static int __init split_nodes_by_size(u64 *addr, u64 max_addr, int node_start,
+				      u64 size)
 {
 	int i = node_start;
 	size = (size << 20) & FAKE_NODE_MIN_HASH_MASK;
-	while (!setup_node_range(i++, nodes, addr, size, max_addr))
+	while (!setup_node_range(i++, addr, size, max_addr))
 		;
 	return i - node_start;
 }
@@ -385,7 +412,6 @@ static int __init split_nodes_by_size(struct bootnode *nodes, u64 *addr,
  */
 static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
 {
-	struct bootnode nodes[MAX_NUMNODES];
 	u64 addr = start_pfn << PAGE_SHIFT;
 	u64 max_addr = end_pfn << PAGE_SHIFT;
 	int num_nodes = 0;
@@ -395,13 +421,18 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
 	u64 size;
 	int i;
 
+	/*
+	 * Map the existing real NUMA toplogy to physical_node_map before the
+	 * information is cleared.
+	 */
+	memcpy(physical_node_map, nodes, sizeof(nodes));
 	memset(&nodes, 0, sizeof(nodes));
 	/*
 	 * If the numa=fake command-line is just a single number N, split the
 	 * system RAM into N fake nodes.
 	 */
 	if (!strchr(cmdline, '*') && !strchr(cmdline, ',')) {
-		num_nodes = split_nodes_equally(nodes, &addr, max_addr, 0,
+		num_nodes = split_nodes_equally(&addr, max_addr, 0,
 						simple_strtol(cmdline, NULL, 0));
 		if (num_nodes < 0)
 			return num_nodes;
@@ -429,8 +460,8 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
 			size = ((u64)num << 20) & FAKE_NODE_MIN_HASH_MASK;
 			if (size)
 				for (i = 0; i < coeff; i++, num_nodes++)
-					if (setup_node_range(num_nodes, nodes,
-						&addr, size, max_addr) < 0)
+					if (setup_node_range(num_nodes, &addr,
+						size, max_addr) < 0)
 						goto done;
 			if (!*cmdline)
 				break;
@@ -446,7 +477,7 @@ done:
 	if (addr < max_addr) {
 		if (coeff_flag && coeff < 0) {
 			/* Split remaining nodes into num-sized chunks */
-			num_nodes += split_nodes_by_size(nodes, &addr, max_addr,
+			num_nodes += split_nodes_by_size(&addr, max_addr,
 							 num_nodes, num);
 			goto out;
 		}
@@ -455,7 +486,7 @@ done:
 			/* Split remaining nodes into coeff chunks */
 			if (coeff <= 0)
 				break;
-			num_nodes += split_nodes_equally(nodes, &addr, max_addr,
+			num_nodes += split_nodes_equally(&addr, max_addr,
 							 num_nodes, coeff);
 			break;
 		case ',':
@@ -463,13 +494,13 @@ done:
 			break;
 		default:
 			/* Give one final node */
-			setup_node_range(num_nodes, nodes, &addr,
-					 max_addr - addr, max_addr);
+			setup_node_range(num_nodes, &addr, max_addr - addr,
+					 max_addr);
 			num_nodes++;
 		}
 	}
 out:
-	memnode_shift = compute_hash_shift(nodes, num_nodes);
+	memnode_shift = compute_hash_shift(num_nodes);
 	if (memnode_shift < 0) {
 		memnode_shift = 0;
 		printk(KERN_ERR "No NUMA hash function found.  NUMA emulation "
@@ -489,30 +520,36 @@ out:
 
 void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 { 
+	unsigned long start_addr = start_pfn << PAGE_SHIFT;
+	unsigned long end_addr = end_pfn << PAGE_SHIFT;
 	int i;
 
-#ifdef CONFIG_NUMA_EMU
-	if (cmdline && !numa_emulation(start_pfn, end_pfn))
- 		return;
-#endif
-
 #ifdef CONFIG_ACPI_NUMA
-	if (!numa_off && !acpi_scan_nodes(start_pfn << PAGE_SHIFT,
-					  end_pfn << PAGE_SHIFT))
+	if (!numa_off && !cmdline && !acpi_scan_nodes(start_addr, end_addr))
  		return;
 #endif
 
 #ifdef CONFIG_K8_NUMA
-	if (!numa_off && !k8_scan_nodes(start_pfn<<PAGE_SHIFT, end_pfn<<PAGE_SHIFT))
-		return;
+	if (!numa_off && !k8_scan_nodes(start_addr, end_addr, cmdline != 0))
+		if (cmdline == 0)
+			return;
+#endif
+
+#ifdef CONFIG_NUMA_EMU
+	if (cmdline)
+	{
+		numa_emu = !numa_emulation(start_pfn, end_pfn);
+		if (numa_emu)
+			return;
+	}
 #endif
+
 	printk(KERN_INFO "%s\n",
 	       numa_off ? "NUMA turned off" : "No NUMA configuration found");
 
-	printk(KERN_INFO "Faking a node at %016lx-%016lx\n", 
-	       start_pfn << PAGE_SHIFT,
-	       end_pfn << PAGE_SHIFT); 
-		/* setup dummy node covering all memory */ 
+	printk(KERN_INFO "Faking a node at %016lx-%016lx\n", start_addr,
+	       end_addr);
+	/* setup dummy node covering all memory */
 	memnode_shift = 63; 
 	memnodemap = memnode.embedded_map;
 	memnodemap[0] = 0;
@@ -522,7 +559,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 		numa_set_node(i, 0);
 	node_to_cpumask[0] = cpumask_of_cpu(0);
 	e820_register_active_regions(0, start_pfn, end_pfn);
-	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, end_pfn << PAGE_SHIFT);
+	setup_node_bootmem(0, start_addr, end_addr);
 }
 
 __cpuinit void numa_add_cpu(int cpu)
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -26,7 +26,6 @@ int acpi_numa __initdata;
 static struct acpi_table_slit *acpi_slit;
 
 static nodemask_t nodes_parsed __initdata;
-static struct bootnode nodes[MAX_NUMNODES] __initdata;
 static struct bootnode nodes_add[MAX_NUMNODES];
 static int found_add_area __initdata;
 int hotadd_percent __initdata = 0;
@@ -411,7 +410,7 @@ int __init acpi_scan_nodes(unsigned long start, unsigned long end)
 		return -1;
 	}
 
-	memnode_shift = compute_hash_shift(nodes, MAX_NUMNODES);
+	memnode_shift = compute_hash_shift(MAX_NUMNODES);
 	if (memnode_shift < 0) {
 		printk(KERN_ERR
 		     "SRAT: No NUMA node hash function found. Contact maintainer\n");
@@ -461,6 +460,12 @@ int __node_distance(int a, int b)
 {
 	int index;
 
+#ifdef CONFIG_NUMA_EMU
+	/* In fake NUMA, the physical node is used for node distance. */
+	a = get_phys_node(a);
+	b = get_phys_node(b);
+#endif
+
 	if (!acpi_slit)
 		return a == b ? 10 : 20;
 	index = acpi_slit->locality_count * node_to_pxm(a);
diff --git a/include/asm-x86_64/numa.h b/include/asm-x86_64/numa.h
--- a/include/asm-x86_64/numa.h
+++ b/include/asm-x86_64/numa.h
@@ -6,8 +6,8 @@
 struct bootnode {
 	u64 start,end; 
 };
-
-extern int compute_hash_shift(struct bootnode *nodes, int numnodes);
+extern struct bootnode nodes[MAX_NUMNODES];
+extern int compute_hash_shift(int numnodes);
 
 #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
 
diff --git a/include/asm-x86_64/proto.h b/include/asm-x86_64/proto.h
--- a/include/asm-x86_64/proto.h
+++ b/include/asm-x86_64/proto.h
@@ -51,7 +51,7 @@ extern void early_printk(const char *fmt, ...) __attribute__((format(printf,1,2)
 
 extern void early_identify_cpu(struct cpuinfo_x86 *c);
 
-extern int k8_scan_nodes(unsigned long start, unsigned long end);
+extern int k8_scan_nodes(unsigned long start, unsigned long end, int fake);
 
 extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
 extern unsigned long numa_free_all_bootmem(void);
diff --git a/include/asm-x86_64/topology.h b/include/asm-x86_64/topology.h
--- a/include/asm-x86_64/topology.h
+++ b/include/asm-x86_64/topology.h
@@ -68,5 +68,6 @@ extern int __node_distance(int, int);
 #include <asm-generic/topology.h>
 
 extern cpumask_t cpu_coregroup_map(int cpu);
+extern int get_phys_node(int nid);
 
 #endif

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch -mm 5/7] x86_64: disable alien cache for fake numa
  2007-03-01 17:13     ` [patch -mm 4/7] x86_64: map fake nodes to real nodes David Rientjes
@ 2007-03-01 17:13       ` David Rientjes
  2007-03-01 17:13         ` [patch -mm 6/7] x86_64: export physnode mapping to userspace David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2007-03-01 17:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Paul Menage, linux-kernel

Disables using alien cache for slab when NUMA emulation is being used.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 arch/x86_64/mm/numa.c |    2 ++
 mm/slab.c             |    2 +-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -280,6 +280,7 @@ void __init numa_init_array(void)
 	PAGE_SHIFT)
 
 static struct bootnode physical_node_map[MAX_NUMNODES];
+extern int use_alien_caches;
 char *cmdline __initdata;
 int numa_emu;
 
@@ -512,6 +513,7 @@ out:
 						nodes[i].end >> PAGE_SHIFT);
  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 	}
+	use_alien_caches = 0;
  	numa_init_array();
  	return 0;
 }
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -889,7 +889,7 @@ static void __slab_error(const char *function, struct kmem_cache *cachep,
  * line
   */
 
-static int use_alien_caches __read_mostly = 1;
+int use_alien_caches __read_mostly = 1;
 static int __init noaliencache_setup(char *s)
 {
 	use_alien_caches = 0;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch -mm 6/7] x86_64: export physnode mapping to userspace
  2007-03-01 17:13       ` [patch -mm 5/7] x86_64: disable alien cache for fake numa David Rientjes
@ 2007-03-01 17:13         ` David Rientjes
  2007-03-01 17:13           ` [patch -mm 7/7] x86_64: fake numa for cpusets document David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2007-03-01 17:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Rohit Seth, linux-kernel

Exports the physical NUMA node to fake NUMA node mapping to user-space via
sysfs.  The information is now accessible via the 'physnode' file.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Rohit Seth <rohitseth@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/base/node.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -133,6 +133,13 @@ static ssize_t node_read_distance(struct sys_device * dev, char * buf)
 }
 static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
+#ifdef CONFIG_NUMA_EMU
+static ssize_t node_read_physnode(struct sys_device *dev, char *buf)
+{
+	return sprintf(buf, "%d\n", get_phys_node(dev->id));
+}
+static SYSDEV_ATTR(physnode, S_IRUGO, node_read_physnode, NULL);
+#endif
 
 /*
  * register_node - Setup a driverfs device for a node.
@@ -153,6 +160,9 @@ int register_node(struct node *node, int num, struct node *parent)
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
 		sysdev_create_file(&node->sysdev, &attr_numastat);
 		sysdev_create_file(&node->sysdev, &attr_distance);
+#ifdef CONFIG_NUMA_EMU
+		sysdev_create_file(&node->sysdev, &attr_physnode);
+#endif
 	}
 	return error;
 }
@@ -170,6 +180,9 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_meminfo);
 	sysdev_remove_file(&node->sysdev, &attr_numastat);
 	sysdev_remove_file(&node->sysdev, &attr_distance);
+#ifdef CONFIG_NUMA_EMU
+	sysdev_remove_file(&node->sysdev, &attr_physnode);
+#endif
 
 	sysdev_unregister(&node->sysdev);
 }

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch -mm 7/7] x86_64: fake numa for cpusets document
  2007-03-01 17:13         ` [patch -mm 6/7] x86_64: export physnode mapping to userspace David Rientjes
@ 2007-03-01 17:13           ` David Rientjes
  0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2007-03-01 17:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

Create a document to explain how to use numa=fake in conjunction with
cpusets for coarse memory resource management.

An attempt to get more awareness and testing for this feature.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/x86_64/fake-numa-for-cpusets |   66 ++++++++++++++++++++++++++++
 1 files changed, 66 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/x86_64/fake-numa-for-cpusets

diff --git a/Documentation/x86_64/fake-numa-for-cpusets b/Documentation/x86_64/fake-numa-for-cpusets
new file mode 100644
--- /dev/null
+++ b/Documentation/x86_64/fake-numa-for-cpusets
@@ -0,0 +1,66 @@
+Using numa=fake and CPUSets for Resource Management
+Written by David Rientjes <rientjes@cs.washington.edu>
+
+This document describes how the numa=fake x86_64 command-line option can be used
+in conjunction with cpusets for coarse memory management.  Using this feature,
+you can create fake NUMA nodes that represent contiguous chunks of memory and
+assign them to cpusets and their attached tasks.  This is a way of limiting the
+amount of system memory that are available to a certain class of tasks.
+
+For more information on the features of cpusets, see Documentation/cpusets.txt.
+There are a number of different configurations you can use for your needs.  For
+more information on the numa=fake command line option and its various ways of
+configuring fake nodes, see Documentation/x86_64/boot-options.txt.
+
+For the purposes of this introduction, we'll assume a very primitive NUMA
+emulation setup of "numa=fake=4*512,".  This will split our system memory into
+four equal chunks of 512M each that we can now use to assign to cpusets.  As
+you become more familiar with using this combination for resource control,
+you'll determine a better setup to minimize the number of nodes you have to deal
+with.
+
+A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
+
+	Faking node 0 at 0000000000000000-0000000020000000 (512MB)
+	Faking node 1 at 0000000020000000-0000000040000000 (512MB)
+	Faking node 2 at 0000000040000000-0000000060000000 (512MB)
+	Faking node 3 at 0000000060000000-0000000080000000 (512MB)
+	...
+	On node 0 totalpages: 130975
+	On node 1 totalpages: 131072
+	On node 2 totalpages: 131072
+	On node 3 totalpages: 131072
+
+Now following the instructions for mounting the cpusets filesystem from
+Documentation/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
+address spaces) to individual cpusets:
+
+	[root@xroads /]# mkdir exampleset
+	[root@xroads /]# mount -t cpuset none exampleset
+	[root@xroads /]# mkdir exampleset/ddset
+	[root@xroads /]# cd exampleset/ddset
+	[root@xroads /exampleset/ddset]# echo 0-1 > cpus
+	[root@xroads /exampleset/ddset]# echo 0-1 > mems
+
+Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
+memory allocations (1G).
+
+You can now assign tasks to these cpusets to limit the memory resources
+available to them according to the fake nodes assigned as mems:
+
+	[root@xroads /exampleset/ddset]# echo $$ > tasks
+	[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
+	[1] 13425
+
+Notice the difference between the system memory usage as reported by
+/proc/meminfo between the restricted cpuset case above and the unrestricted
+case (i.e. running the same 'dd' command without assigning it to a fake NUMA
+cpuset):
+				Unrestricted	Restricted
+	MemTotal:		3091900 kB	3091900 kB
+	MemFree:		  42113 kB	1513236 kB
+
+This allows for coarse memory management for the tasks you assign to particular
+cpusets.  Since cpusets can form a hierarchy, you can create some pretty
+interesting combinations of use-cases for various classes of tasks for your
+memory management needs.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-03-01 17:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-01 17:12 [patch -mm 1/7] x86_64: configurable fake numa node sizes David Rientjes
2007-03-01 17:13 ` [patch -mm 2/7] x86_64: split remaining fake nodes equally David Rientjes
2007-03-01 17:13   ` [patch -mm 3/7] x86_64: fixed size remaining fake nodes David Rientjes
2007-03-01 17:13     ` [patch -mm 4/7] x86_64: map fake nodes to real nodes David Rientjes
2007-03-01 17:13       ` [patch -mm 5/7] x86_64: disable alien cache for fake numa David Rientjes
2007-03-01 17:13         ` [patch -mm 6/7] x86_64: export physnode mapping to userspace David Rientjes
2007-03-01 17:13           ` [patch -mm 7/7] x86_64: fake numa for cpusets document David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.