All of lore.kernel.org
 help / color / mirror / Atom feed
* [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
@ 2007-02-06 11:23 KAMEZAWA Hiroyuki
  2007-02-06 17:26 ` Christoph Lameter
  2007-02-07 10:20 ` Andi Kleen
  0 siblings, 2 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-06 11:23 UTC (permalink / raw)
  To: LKML; +Cc: GOTO, Christoph Lameter, Andrew Morton

current mempolicy just checks whether a node is online or not.
If there is memory-less-node, mempolicy's target node can be
invalid.
This patch adds a check whether a node has memory or not.

This is an back-trace in which a program uses MPOL_MBIND just includes
memory-less-node.

== backtrace from crash (linux-2.6.20) ==
 #0 [BSP:e000000121f412d8] schedule at a00000010061ccc0
 #1 [BSP:e000000121f41280] rwsem_down_failed_common at a000000100290490
 #2 [BSP:e000000121f41260] rwsem_down_read_failed at a000000100620d30
 #3 [BSP:e000000121f41240] down_read at a0000001000b01a0
 #4 [BSP:e000000121f411e8] ia64_do_page_fault at a000000100625710
 #5 [BSP:e000000121f411e8] ia64_leave_kernel at a00000010000c660
  EFRAME: e000000121f47100
      B0: a00000010013cc40      CR_IIP: a00000010012aa30
 CR_IPSR: 0000101008022018      CR_IFS: 8000000000000205
  AR_PFS: 0000000000000309      AR_RSC: 0000000000000003
 AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
  AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
  LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000000
      B6: a00000010003f040          B7: a00000010000ccd0
      PR: 000000000055a9a5          R1: a000000100d5a5b0
      R2: e00000010c50df7c          R3: 0000000000000030
      R8: 0000000000000000          R9: e00000011dc52930
     R10: e00000011dc52928         R11: e00000010c50df80
     R12: e000000121f472c0         R13: e000000121f40000
     R14: 0000000000000002         R15: 000000003fffff00
     R16: 0000000010400000         R17: e000000121f40000
     R18: a000000100b5a9d0         R19: e000000121f40018
     R20: e000000121f40c84         R21: 0000000000000000
     R22: e000000121f47330         R23: e000000121f47334
     R24: e000000121f40b88         R25: e000000121f47340
     R26: e000000121f47334         R27: 0000000000000000
     R28: 0000000000000000         R29: e000000121f47338
     R30: 000000007fffffff         R31: a000000100b5b5e0
      F6: 1003eccd55056199632ec     F7: 1003e9e3779b97f4a7c16
      F8: 1003e0a00000010001422     F9: 1003e000000000fa00000
     F10: 1003e000000003b9aca00    F11: 1003e431bde82d7b634db
 #6 [BSP:e000000121f411c0] slab_node at a00000010012aa30
 #7 [BSP:e000000121f41190] alternate_node_alloc at a00000010013cc40
 #8 [BSP:e000000121f41160] kmem_cache_alloc at a00000010013dc40
 #9 [BSP:e000000121f41100] desc_prologue at a00000010003ee00
#10 [BSP:e000000121f410c0] unw_decode_r2 at a00000010003f0c0
#11 [BSP:e000000121f41068] find_save_locs at a00000010003fbf0
#12 [BSP:e000000121f41038] unw_init_frame_info at a000000100040900
#13 [BSP:e000000121f41010] unw_init_running at a00000010000ccf0
==
This panic(hang) was found by a numa test-set on a system with 3 nodes, where
node(2) was memory-less-node.

This means an access to NULL,here.
==
unsigned slab_node(struct mempolicy *policy)
{
        case MPOL_BIND:
                /*
                 * Follow bind policy behavior and start allocation at the
                 * first node.
                 */
                return zone_to_nid(policy->v.zonelist->zones[0]);
}
==
length of this zonelist was 0.
It seems fixing a NULL access here is also O.K. 
This patch is just an idea.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



Index: linux-2.6.20/mm/mempolicy.c
===================================================================
--- linux-2.6.20.orig/mm/mempolicy.c	2007-02-06 20:02:31.000000000 +0900
+++ linux-2.6.20/mm/mempolicy.c	2007-02-06 20:09:47.000000000 +0900
@@ -116,6 +116,8 @@
 static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
 	int empty = nodes_empty(*nodes);
+	int nd;
+	nodemask_t node_with_memory;
 
 	switch (mode) {
 	case MPOL_DEFAULT:
@@ -130,7 +132,12 @@
 			return -EINVAL;
 		break;
 	}
-	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
+	nodes_clear(node_with_memory);
+	for_each_online_node(nd) {
+		if (node_present_pages(nd))
+			node_set(nd, node_with_memory);
+	}
+	return nodes_subset(*nodes, node_with_memory) ? 0 : -EINVAL;
 }
 
 /* Generate a custom zonelist for the BIND policy. */


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-06 11:23 [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node KAMEZAWA Hiroyuki
@ 2007-02-06 17:26 ` Christoph Lameter
  2007-02-07  1:14   ` KAMEZAWA Hiroyuki
  2007-02-07 10:20 ` Andi Kleen
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2007-02-06 17:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, GOTO, Christoph Lameter, Andrew Morton

On Tue, 6 Feb 2007, KAMEZAWA Hiroyuki wrote:

> This means an access to NULL,here.
> ==
> unsigned slab_node(struct mempolicy *policy)
> {
>         case MPOL_BIND:
>                 /*
>                  * Follow bind policy behavior and start allocation at the
>                  * first node.
>                  */
>                 return zone_to_nid(policy->v.zonelist->zones[0]);
> }
> ==
> length of this zonelist was 0.
> It seems fixing a NULL access here is also O.K. 
> This patch is just an idea.

Hmmm... Remove the node from the node_online_map instead?

>  	}
> -	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
> +	nodes_clear(node_with_memory);
> +	for_each_online_node(nd) {
> +		if (node_present_pages(nd))
> +			node_set(nd, node_with_memory);
> +	}
> +	return nodes_subset(*nodes, node_with_memory) ? 0 : -EINVAL;
>  }

Uhh. No loops here please. If we really need this then we need to have 
some sort of set operation here.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-06 17:26 ` Christoph Lameter
@ 2007-02-07  1:14   ` KAMEZAWA Hiroyuki
  2007-02-07  8:04     ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-07  1:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, y-goto, clameter, akpm

On Tue, 6 Feb 2007 09:26:53 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 6 Feb 2007, KAMEZAWA Hiroyuki wrote:
> 
> > This means an access to NULL,here.
> > ==
> > unsigned slab_node(struct mempolicy *policy)
> > {
> >         case MPOL_BIND:
> >                 /*
> >                  * Follow bind policy behavior and start allocation at the
> >                  * first node.
> >                  */
> >                 return zone_to_nid(policy->v.zonelist->zones[0]);
> > }
> > ==
> > length of this zonelist was 0.
> > It seems fixing a NULL access here is also O.K. 
> > This patch is just an idea.
> 
> Hmmm... Remove the node from the node_online_map instead?
> 
Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
How about adding  nodemask for nodes equips memory ?

===
There are memory-less-nodes (i.e. cpu only node.) on some systems.
mempolicy, which requires nodemask as its arg, should compare user's nodemask
with nodemask for node-with-memory-mask instead of node_online_map.

This patch adds node_with_memory_map and rewrite mempolicy's node_online_map
to node_with_memory_map.
(This patch supports node-hot-add.)

tested on ia64 NUMA
- 3nodes, node(2) is memory-less
- 3nodes, all nodes have memory.


Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Index: linux-2.6.20/include/linux/nodemask.h
===================================================================
--- linux-2.6.20.orig/include/linux/nodemask.h	2007-02-05 03:44:54.000000000 +0900
+++ linux-2.6.20/include/linux/nodemask.h	2007-02-07 10:03:53.000000000 +0900
@@ -344,6 +344,7 @@
 
 extern nodemask_t node_online_map;
 extern nodemask_t node_possible_map;
+extern nodemask_t node_with_memory_map;
 
 #if MAX_NUMNODES > 1
 #define num_online_nodes()	nodes_weight(node_online_map)
Index: linux-2.6.20/mm/page_alloc.c
===================================================================
--- linux-2.6.20.orig/mm/page_alloc.c	2007-02-05 03:44:54.000000000 +0900
+++ linux-2.6.20/mm/page_alloc.c	2007-02-07 09:48:35.000000000 +0900
@@ -59,6 +59,8 @@
 long nr_swap_pages;
 int percpu_pagelist_fraction;
 
+nodemask_t	node_with_memory_map = NODE_MASK_NONE;
+
 static void __free_pages_ok(struct page *page, unsigned int order);
 
 /*
@@ -1861,6 +1863,8 @@
 	for_each_online_node(nid) {
 		build_zonelists(NODE_DATA(nid));
 		build_zonelist_cache(NODE_DATA(nid));
+		if (node_present_pages(nid))
+			node_set(nid, node_with_memory_map);
 	}
 	return 0;
 }
Index: linux-2.6.20/mm/mempolicy.c
===================================================================
--- linux-2.6.20.orig/mm/mempolicy.c	2007-02-07 09:29:01.000000000 +0900
+++ linux-2.6.20/mm/mempolicy.c	2007-02-07 09:50:51.000000000 +0900
@@ -130,7 +130,7 @@
 			return -EINVAL;
 		break;
 	}
-	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
+	return nodes_subset(*nodes, node_with_memory_map) ? 0 : -EINVAL;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -500,7 +500,7 @@
 	case MPOL_PREFERRED:
 		/* or use current node instead of online map? */
 		if (p->v.preferred_node < 0)
-			*nodes = node_online_map;
+			*nodes = node_with_memory_map;
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -1612,7 +1612,7 @@
 	/* Set interleaving policy for system init. This way not all
 	   the data structures allocated at system boot end up in node zero. */
 
-	if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
+	if (do_set_mempolicy(MPOL_INTERLEAVE, &node_with_memory_map))
 		printk("numa_policy_init: interleaving failed\n");
 }
 












^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07  1:14   ` KAMEZAWA Hiroyuki
@ 2007-02-07  8:04     ` Christoph Lameter
  2007-02-07  8:36         ` [2.6.20][PATCH] fix mempolicy error check on a system with KAMEZAWA Hiroyuki
  2007-02-07 10:20       ` Andi Kleen
  0 siblings, 2 replies; 28+ messages in thread
From: Christoph Lameter @ 2007-02-07  8:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, y-goto, clameter, akpm

On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:

> > Hmmm... Remove the node from the node_online_map instead?
> > 
> Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
> How about adding  nodemask for nodes equips memory ?

Ok that is better but...

Would it be possible to attach the cpus to the 
next nodes with memory and mark the node offline? That way we could avoid 
another mask that we constantly have to check?

Or fix the location where the error occurred to be able to tolerate a node 
with no zones?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07  8:04     ` Christoph Lameter
@ 2007-02-07  8:36         ` KAMEZAWA Hiroyuki
  2007-02-07 10:20       ` Andi Kleen
  1 sibling, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-07  8:36 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, y-goto, clameter, akpm, linux-ia64

On Wed, 7 Feb 2007 00:04:41 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:
> 
> > > Hmmm... Remove the node from the node_online_map instead?
> > > 
> > Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
> > How about adding  nodemask for nodes equips memory ?
> 
> Ok that is better but...
> 
> Would it be possible to attach the cpus to the 
> next nodes with memory and mark the node offline? That way we could avoid 
> another mask that we constantly have to check?
> 
Added ia64 list to CC.
I know ia64 kernel did what you say in old days (I know RHEL4/2.6.9 kernel does it).
Someone changed it and created cpu-only-node for some purpose, I don't know why.


> Or fix the location where the error occurred to be able to tolerate a node 
> with no zones?
> 
Hmm, 
In this case, MPOL_MBIND, the user requests to allocate memory from specified nodes.
I think it's better to tell him "you can't do that" than silently allocating memory
from other places.

-Kame






^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with
@ 2007-02-07  8:36         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-07  8:36 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, y-goto, clameter, akpm, linux-ia64

On Wed, 7 Feb 2007 00:04:41 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:
> 
> > > Hmmm... Remove the node from the node_online_map instead?
> > > 
> > Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
> > How about adding  nodemask for nodes equips memory ?
> 
> Ok that is better but...
> 
> Would it be possible to attach the cpus to the 
> next nodes with memory and mark the node offline? That way we could avoid 
> another mask that we constantly have to check?
> 
Added ia64 list to CC.
I know ia64 kernel did what you say in old days (I know RHEL4/2.6.9 kernel does it).
Someone changed it and created cpu-only-node for some purpose, I don't know why.


> Or fix the location where the error occurred to be able to tolerate a node 
> with no zones?
> 
Hmm, 
In this case, MPOL_MBIND, the user requests to allocate memory from specified nodes.
I think it's better to tell him "you can't do that" than silently allocating memory
from other places.

-Kame






^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:20 ` Andi Kleen
@ 2007-02-07 10:07   ` KAMEZAWA Hiroyuki
  2007-02-07 10:19     ` Andi Kleen
  2007-02-07 14:05     ` Christoph Lameter
  2007-02-07 14:03   ` Christoph Lameter
  2007-02-07 16:23   ` Andrew Morton
  2 siblings, 2 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-07 10:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, y-goto, clameter, akpm

On 07 Feb 2007 11:20:06 +0100
Andi Kleen <ak@suse.de> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > current mempolicy just checks whether a node is online or not.
> > If there is memory-less-node, mempolicy's target node can be
> > invalid.
> > This patch adds a check whether a node has memory or not.
> 
> IMHO there shouldn't be any memory less nodes. The architecture code
> should not create them. The CPU should be assigned to a nearby node instead.
> At least x86-64 ensures that.
> 
AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
Then, ia64 can create cpu-memory-less-node(node with no available resource.).
(*)I don't like this.

If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
cpus should be moved to nearby node at hotadd .
And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven 
node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)

-Kame




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:07   ` KAMEZAWA Hiroyuki
@ 2007-02-07 10:19     ` Andi Kleen
  2007-02-07 10:37       ` KAMEZAWA Hiroyuki
  2007-02-07 14:05     ` Christoph Lameter
  1 sibling, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2007-02-07 10:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, y-goto, clameter, akpm


> AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> (*)I don't like this.
> 
> If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> cpus should be moved to nearby node at hotadd .
> And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven 
> node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)

You can also alias node numbers to solve this: just point multiple node numbers 
to the same pgdat. For a memory less node this would be a nearby one.

-Andi

 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-06 11:23 [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node KAMEZAWA Hiroyuki
  2007-02-06 17:26 ` Christoph Lameter
@ 2007-02-07 10:20 ` Andi Kleen
  2007-02-07 10:07   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Andi Kleen @ 2007-02-07 10:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, GOTO, Christoph Lameter, Andrew Morton

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> current mempolicy just checks whether a node is online or not.
> If there is memory-less-node, mempolicy's target node can be
> invalid.
> This patch adds a check whether a node has memory or not.

IMHO there shouldn't be any memory less nodes. The architecture code
should not create them. The CPU should be assigned to a nearby node instead.
At least x86-64 ensures that.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07  8:04     ` Christoph Lameter
  2007-02-07  8:36         ` [2.6.20][PATCH] fix mempolicy error check on a system with KAMEZAWA Hiroyuki
@ 2007-02-07 10:20       ` Andi Kleen
  1 sibling, 0 replies; 28+ messages in thread
From: Andi Kleen @ 2007-02-07 10:20 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: KAMEZAWA Hiroyuki, linux-kernel, y-goto, clameter, akpm

Christoph Lameter <clameter@sgi.com> writes:
> 
> Would it be possible to attach the cpus to the 
> next nodes with memory and mark the node offline? That way we could avoid 
> another mask that we constantly have to check?

That is what x86-64 does and I believe is the right solution to this.

-Andi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:19     ` Andi Kleen
@ 2007-02-07 10:37       ` KAMEZAWA Hiroyuki
  2007-02-07 10:41         ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-07 10:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, y-goto, clameter, akpm

On Wed, 7 Feb 2007 11:19:02 +0100
Andi Kleen <ak@suse.de> wrote:

> 
> > AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> > Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> > (*)I don't like this.
> > 
> > If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> > cpus should be moved to nearby node at hotadd .
> > And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven 
> > node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)
> 
> You can also alias node numbers to solve this: just point multiple node numbers 
> to the same pgdat. For a memory less node this would be a nearby one.
> 
Hmm, interesting...the 'alias' means follwing ?
==
NODE_DATA(A) = pgdat_for_A
NODE_DATA(B) = pgdat_for_A // B is memory-less.
- NODE_DATA(B) is valid but B is not online.
==
looks complicated..and we have to care /sys/devices/system/node handling.

-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:37       ` KAMEZAWA Hiroyuki
@ 2007-02-07 10:41         ` Andi Kleen
  2007-02-07 10:49           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2007-02-07 10:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, y-goto, clameter, akpm

On Wednesday 07 February 2007 11:37, KAMEZAWA Hiroyuki wrote:
> On Wed, 7 Feb 2007 11:19:02 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> > 
> > > AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> > > Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> > > (*)I don't like this.
> > > 
> > > If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> > > cpus should be moved to nearby node at hotadd .
> > > And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven 
> > > node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)
> > 
> > You can also alias node numbers to solve this: just point multiple node numbers 
> > to the same pgdat. For a memory less node this would be a nearby one.
> > 
> Hmm, interesting...the 'alias' means follwing ?

Yes.

> NODE_DATA(A) = pgdat_for_A
> NODE_DATA(B) = pgdat_for_A // B is memory-less.
> - NODE_DATA(B) is valid but B is not online.

Well it is online because A is. For all practical purposes
it is A, just under a different name.

> ==
> looks complicated..and we have to care /sys/devices/system/node handling.

x86-64 used to do that when it still only did 1:1 cpu<->memory mappings.
I don't remember any problems with it.

-Andi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:41         ` Andi Kleen
@ 2007-02-07 10:49           ` KAMEZAWA Hiroyuki
  2007-02-07 11:32             ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-07 10:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, y-goto, clameter, akpm

On Wed, 7 Feb 2007 11:41:25 +0100
Andi Kleen <ak@suse.de> wrote:

> On Wednesday 07 February 2007 11:37, KAMEZAWA Hiroyuki wrote:
> > On Wed, 7 Feb 2007 11:19:02 +0100
> > Andi Kleen <ak@suse.de> wrote:
> > > You can also alias node numbers to solve this: just point multiple node numbers 
> > > to the same pgdat. For a memory less node this would be a nearby one.
> > > 
> > Hmm, interesting...the 'alias' means follwing ?
> 
> Yes.
> 
> > NODE_DATA(A) = pgdat_for_A
> > NODE_DATA(B) = pgdat_for_A // B is memory-less.
> > - NODE_DATA(B) is valid but B is not online.
> 
> Well it is online because A is. For all practical purposes
> it is A, just under a different name.
> 
> > ==
> > looks complicated..and we have to care /sys/devices/system/node handling.
> 
> x86-64 used to do that when it still only did 1:1 cpu<->memory mappings.
> I don't remember any problems with it.
> 

How for_each_online_node(nid) works ? it can handle alias-nid ?

==
	for_each_online_node(nid) {
		pgdat = NODE_DATA(nid);
==
This code never accesses pgdat_for_A twice ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:49           ` KAMEZAWA Hiroyuki
@ 2007-02-07 11:32             ` Andi Kleen
  2007-02-07 12:27               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2007-02-07 11:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, y-goto, clameter, akpm


> How for_each_online_node(nid) works ? it can handle alias-nid ?
> 
> ==
> 	for_each_online_node(nid) {
> 		pgdat = NODE_DATA(nid);
> ==
> This code never accesses pgdat_for_A twice ?

It would. If there's a problem it could be changed to walk the pgdat lists instead,
but at least when x86-64 still did it there wasn't.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 11:32             ` Andi Kleen
@ 2007-02-07 12:27               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-07 12:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, y-goto, clameter, akpm

On Wed, 7 Feb 2007 12:32:36 +0100
Andi Kleen <ak@suse.de> wrote:

> 
> > How for_each_online_node(nid) works ? it can handle alias-nid ?
> > 
> > ==
> > 	for_each_online_node(nid) {
> > 		pgdat = NODE_DATA(nid);
> > ==
> > This code never accesses pgdat_for_A twice ?
> 
> It would. If there's a problem it could be changed to walk the pgdat lists instead,
> but at least when x86-64 still did it there wasn't.
> 

Thank you for discussion. 

I'll try another hot-fix just for this NULL pointer access problem,
which is not invasive and not costly.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:20 ` Andi Kleen
  2007-02-07 10:07   ` KAMEZAWA Hiroyuki
@ 2007-02-07 14:03   ` Christoph Lameter
  2007-02-07 16:23   ` Andrew Morton
  2 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2007-02-07 14:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: KAMEZAWA Hiroyuki, LKML, GOTO, Christoph Lameter, Andrew Morton

On Wed, 7 Feb 2007, Andi Kleen wrote:

> IMHO there shouldn't be any memory less nodes. The architecture code
> should not create them. The CPU should be assigned to a nearby node instead.
> At least x86-64 ensures that.

Yes I wish we would do it that way on all platforms. SGI's SN2 does that 
too.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:07   ` KAMEZAWA Hiroyuki
  2007-02-07 10:19     ` Andi Kleen
@ 2007-02-07 14:05     ` Christoph Lameter
  2007-02-07 15:28       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2007-02-07 14:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andi Kleen, linux-kernel, y-goto, clameter, akpm

On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:

> > IMHO there shouldn't be any memory less nodes. The architecture code
> > should not create them. The CPU should be assigned to a nearby node instead.
> > At least x86-64 ensures that.
> > 
> AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> (*)I don't like this.

I think that is only true for !SN2 platforms? Could we fix this?

> If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> cpus should be moved to nearby node at hotadd .
> And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven 
> node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)

Well you could bring down the cpu and bring it up again? This would also 
assure the best placement of the runtime structures for node?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 14:05     ` Christoph Lameter
@ 2007-02-07 15:28       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-07 15:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: ak, linux-kernel, y-goto, clameter, akpm

On Wed, 7 Feb 2007 06:05:56 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:
> 
> > > IMHO there shouldn't be any memory less nodes. The architecture code
> > > should not create them. The CPU should be assigned to a nearby node instead.
> > > At least x86-64 ensures that.
> > > 
> > AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> > Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> > (*)I don't like this.
> 
> I think that is only true for !SN2 platforms? Could we fix this?
> 
AFAIK, some vendor(HP?) has following configraion
- node0 .... cpu only node
- node1 .... cpu only node
- node2 .... memory only node.
This is because of their memory-interleave technique.

Our 64cpu socket NUMA system also has a config
- node0 cpu+memory node
- node 1 - 7 cpu only node.
for deviding scheduler domain.(old kernel had problem with big-sched-domain)

To fix memory-less-node, we have to test the performance of
"very-big-scheduler-domain" and to define the rule for cpu-hot-add, as
"a new cpu will be added to the most nearby node" 
(node-hot-add will have to add some hook..)

I don't know someone who created memory-less-node in past may have some other issues.

There may be some complicated topology system with complicated PXM map.


> > If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> > cpus should be moved to nearby node at hotadd .
> > And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven 
> > node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)
> 
> Well you could bring down the cpu and bring it up again? This would also 
> assure the best placement of the runtime structures for node?
> 
cpu-to-node relationship is fixed in the early stage of cpu hotplug.
I'm not sure we can bring down/up cpu again in clean way. After a cpu is added,
the kernel losts its original PXM value now.

about runtime structures:
The runtime structure placement for a hot-added-node is another issue here.
I and Goto-san have a plan for optimized placement of structures and will 
try when we can do. (We are now assgined to RHEL5 stabilization tasks...)

Moving per-cpu-area at hotadd does not look easy.
IMHO, maybe we have to use stop_machine_run() to move it.

Anyway, I'll post an another *easy* patch just for fix the NULL pointer access.
please review.

Thanks,
-Kame




 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 10:20 ` Andi Kleen
  2007-02-07 10:07   ` KAMEZAWA Hiroyuki
  2007-02-07 14:03   ` Christoph Lameter
@ 2007-02-07 16:23   ` Andrew Morton
  2007-02-07 16:50     ` Andi Kleen
  2 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2007-02-07 16:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: KAMEZAWA Hiroyuki, LKML, GOTO, Christoph Lameter

On 07 Feb 2007 11:20:06 +0100 Andi Kleen <ak@suse.de> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > current mempolicy just checks whether a node is online or not.
> > If there is memory-less-node, mempolicy's target node can be
> > invalid.
> > This patch adds a check whether a node has memory or not.
> 
> IMHO there shouldn't be any memory less nodes. The architecture code
> should not create them. The CPU should be assigned to a nearby node instead.

umm, why?

A node which has CPUs and no memory is obviously physically possible and
isn't a completely insane thing for a user to do.  I'd have thought that
the kernel should be able to cleanly and clearly handle it, and to
accurately present the machine's topology to the user without us having to
go adding falsehoods like this?

> At least x86-64 ensures that.
> 
> -Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 16:23   ` Andrew Morton
@ 2007-02-07 16:50     ` Andi Kleen
  2007-02-07 17:43       ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2007-02-07 16:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: KAMEZAWA Hiroyuki, LKML, GOTO, Christoph Lameter

On Wednesday 07 February 2007 17:23, Andrew Morton wrote:
> On 07 Feb 2007 11:20:06 +0100 Andi Kleen <ak@suse.de> wrote:
> 
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> > 
> > > current mempolicy just checks whether a node is online or not.
> > > If there is memory-less-node, mempolicy's target node can be
> > > invalid.
> > > This patch adds a check whether a node has memory or not.
> > 
> > IMHO there shouldn't be any memory less nodes. The architecture code
> > should not create them. The CPU should be assigned to a nearby node instead.
> 
> umm, why?
> 
> A node which has CPUs and no memory is obviously physically possible and
> isn't a completely insane thing for a user to do.  I'd have thought that
> the kernel should be able to cleanly and clearly handle it,

It doesn't.

> and to 
> accurately present the machine's topology to the user without us having to
> go adding falsehoods like this?

a node is a piece of memory. Without memory it doesn't make sense.

-Andi
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 16:50     ` Andi Kleen
@ 2007-02-07 17:43       ` Andrew Morton
  2007-02-07 18:15         ` [PATCH] FS : Speedup rw_verify_area() Eric Dumazet
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Andrew Morton @ 2007-02-07 17:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: KAMEZAWA Hiroyuki, LKML, GOTO, Christoph Lameter

On Wed, 7 Feb 2007 17:50:55 +0100 Andi Kleen <ak@suse.de> wrote:

> On Wednesday 07 February 2007 17:23, Andrew Morton wrote:
> > On 07 Feb 2007 11:20:06 +0100 Andi Kleen <ak@suse.de> wrote:
> > 
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> > > 
> > > > current mempolicy just checks whether a node is online or not.
> > > > If there is memory-less-node, mempolicy's target node can be
> > > > invalid.
> > > > This patch adds a check whether a node has memory or not.
> > > 
> > > IMHO there shouldn't be any memory less nodes. The architecture code
> > > should not create them. The CPU should be assigned to a nearby node instead.
> > 
> > umm, why?
> > 
> > A node which has CPUs and no memory is obviously physically possible and
> > isn't a completely insane thing for a user to do.  I'd have thought that
> > the kernel should be able to cleanly and clearly handle it,
> 
> It doesn't.

Fix it?

> > and to 
> > accurately present the machine's topology to the user without us having to
> > go adding falsehoods like this?
> 
> a node is a piece of memory. Without memory it doesn't make sense.

Who said?  I can pick up a piece of circuitry which has four CPUs and no
RAM, wave it about then stick it in a computer.  The kernel is just wrong,
surely?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] FS : Speedup rw_verify_area()
  2007-02-07 17:43       ` Andrew Morton
@ 2007-02-07 18:15         ` Eric Dumazet
  2007-02-08  0:37         ` [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node KAMEZAWA Hiroyuki
  2007-02-08 19:09         ` Christoph Lameter
  2 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2007-02-07 18:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

oprofile hunting showed a stall in rw_verify_area(), because of triple 
indirection and potential cache misses.
(file->f_path.dentry->d_inode->i_flock)

By moving initialization of 'struct inode' pointer before the pos/count sanity 
tests, we allow the compiler and processor to perform two loads by 
anticipation, reducing stall, without prefetch() hints. Even x86 arch has 
enough registers to not use temporary variables and not increase text size.

I validated this patch running a bench and studied oprofile changes, and 
absolute perf of the test program.

Results of my epoll_pipe_bench (source available on request) on a Pentium-M  
1.6 GHz machine

Before :
# ./epoll_pipe_bench -l 30 -t 20
Avg: 436089 evts/sec read_count=8843037 write_count=8843040 21.218390 samples 
per call
(best value out of 10 runs)

After :
# ./epoll_pipe_bench -l 30 -t 20
Avg: 470980 evts/sec read_count=9549871 write_count=9549894 21.216694 samples 
per call
(best value out of 10 runs)

oprofile CPU_CLK_UNHALTED events gave a reduction from 5.3401 % to 2.5851 % 
for the rw_verify_area() function.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: rw_verify_area.patch --]
[-- Type: text/plain, Size: 628 bytes --]

--- linux-2.6.20/fs/read_write.c	2007-02-07 18:21:59.000000000 +0100
+++ linux-2.6.20-ed/fs/read_write.c	2007-02-07 18:23:41.000000000 +0100
@@ -197,13 +197,13 @@ int rw_verify_area(int read_write, struc
 	struct inode *inode;
 	loff_t pos;
 
+	inode = file->f_path.dentry->d_inode;
 	if (unlikely((ssize_t) count < 0))
 		goto Einval;
 	pos = *ppos;
 	if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
 		goto Einval;
 
-	inode = file->f_path.dentry->d_inode;
 	if (unlikely(inode->i_flock && MANDATORY_LOCK(inode))) {
 		int retval = locks_mandatory_area(
 			read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 17:43       ` Andrew Morton
  2007-02-07 18:15         ` [PATCH] FS : Speedup rw_verify_area() Eric Dumazet
@ 2007-02-08  0:37         ` KAMEZAWA Hiroyuki
  2007-02-08 19:09         ` Christoph Lameter
  2 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-02-08  0:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ak, linux-kernel, y-goto, clameter

On Wed, 7 Feb 2007 09:43:44 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> > > and to 
> > > accurately present the machine's topology to the user without us having to
> > > go adding falsehoods like this?
> > 
> > a node is a piece of memory. Without memory it doesn't make sense.
> 
> Who said?  I can pick up a piece of circuitry which has four CPUs and no
> RAM, wave it about then stick it in a computer.  The kernel is just wrong,
> surely?
> 
As far as I remember, this was the first e-mail which tells me that there are
- cpu only node
- device only node
- memory only node
during node-hot-plug discussion.

http://www.gelato.unsw.edu.au/archives/linux-ia64/0405/9679.html

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07  8:36         ` [2.6.20][PATCH] fix mempolicy error check on a system with KAMEZAWA Hiroyuki
@ 2007-02-08 11:49           ` Bob Picco
  -1 siblings, 0 replies; 28+ messages in thread
From: Bob Picco @ 2007-02-08 11:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, linux-kernel, y-goto, clameter, akpm, linux-ia64

Hiroyuki KAMEZAWA wrote:	[Wed Feb 07 2007, 03:36:47AM EST]
> On Wed, 7 Feb 2007 00:04:41 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:
> > 
> > > > Hmmm... Remove the node from the node_online_map instead?
> > > > 
> > > Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
> > > How about adding  nodemask for nodes equips memory ?
> > 
> > Ok that is better but...
> > 
> > Would it be possible to attach the cpus to the 
> > next nodes with memory and mark the node offline? That way we could avoid 
> > another mask that we constantly have to check?
> > 
> Added ia64 list to CC.
> I know ia64 kernel did what you say in old days (I know RHEL4/2.6.9 kernel does it).
> Someone changed it and created cpu-only-node for some purpose, I don't know why.
That was me. It will probably be later today or Friday before I've had
time to review the code.  For reference look for string memory_less in
arch/ia64/mm/discontig.c.

The short story is HP ships NUMA boxes with interleaved memory only by
default which is represented by a single memory only node. Originally all
the CPU nodes where assigned to the memory node. The code was very
complicated and incorrect to me. Subsequently, and what we have now, the CPU
only nodes are revealed and the memory only node too. I do believe that a
cpu only nodes should be possible but now there seems to be a new issue.

bob
> 
> 
> > Or fix the location where the error occurred to be able to tolerate a node 
> > with no zones?
> > 
> Hmm, 
> In this case, MPOL_MBIND, the user requests to allocate memory from specified nodes.
> I think it's better to tell him "you can't do that" than silently allocating memory
> from other places.
> 
> -Kame
> 
> 
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
@ 2007-02-08 11:49           ` Bob Picco
  0 siblings, 0 replies; 28+ messages in thread
From: Bob Picco @ 2007-02-08 11:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, linux-kernel, y-goto, clameter, akpm, linux-ia64

Hiroyuki KAMEZAWA wrote:	[Wed Feb 07 2007, 03:36:47AM EST]
> On Wed, 7 Feb 2007 00:04:41 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:
> > 
> > > > Hmmm... Remove the node from the node_online_map instead?
> > > > 
> > > Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
> > > How about adding  nodemask for nodes equips memory ?
> > 
> > Ok that is better but...
> > 
> > Would it be possible to attach the cpus to the 
> > next nodes with memory and mark the node offline? That way we could avoid 
> > another mask that we constantly have to check?
> > 
> Added ia64 list to CC.
> I know ia64 kernel did what you say in old days (I know RHEL4/2.6.9 kernel does it).
> Someone changed it and created cpu-only-node for some purpose, I don't know why.
That was me. It will probably be later today or Friday before I've had
time to review the code.  For reference look for string memory_less in
arch/ia64/mm/discontig.c.

The short story is HP ships NUMA boxes with interleaved memory only by
default which is represented by a single memory only node. Originally all
the CPU nodes where assigned to the memory node. The code was very
complicated and incorrect to me. Subsequently, and what we have now, the CPU
only nodes are revealed and the memory only node too. I do believe that a
cpu only nodes should be possible but now there seems to be a new issue.

bob
> 
> 
> > Or fix the location where the error occurred to be able to tolerate a node 
> > with no zones?
> > 
> Hmm, 
> In this case, MPOL_MBIND, the user requests to allocate memory from specified nodes.
> I think it's better to tell him "you can't do that" than silently allocating memory
> from other places.
> 
> -Kame
> 
> 
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-07 17:43       ` Andrew Morton
  2007-02-07 18:15         ` [PATCH] FS : Speedup rw_verify_area() Eric Dumazet
  2007-02-08  0:37         ` [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node KAMEZAWA Hiroyuki
@ 2007-02-08 19:09         ` Christoph Lameter
  2007-02-08 19:26           ` Andrew Morton
  2 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2007-02-08 19:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, KAMEZAWA Hiroyuki, LKML, GOTO, Christoph Lameter

On Wed, 7 Feb 2007, Andrew Morton wrote:

> > > A node which has CPUs and no memory is obviously physically possible and
> > > isn't a completely insane thing for a user to do.  I'd have thought that
> > > the kernel should be able to cleanly and clearly handle it,
> > 
> > It doesn't.
> 
> Fix it?

Why? It is a bad idea.
 
> > > and to 
> > > accurately present the machine's topology to the user without us having to
> > > go adding falsehoods like this?
> > 
> > a node is a piece of memory. Without memory it doesn't make sense.
> 
> Who said?  I can pick up a piece of circuitry which has four CPUs and no
> RAM, wave it about then stick it in a computer.  The kernel is just wrong,
> surely?

Surely your computer has some memory so attach it to that memory (which 
in a NUMA system would be one or the other node).

Cpu only "nodes" would mean that all memory would be off node. Meaning 
whatever interconnect one has would be heavily used. Operating system and 
application performance will suffer.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-08 19:09         ` Christoph Lameter
@ 2007-02-08 19:26           ` Andrew Morton
  2007-02-08 19:35             ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2007-02-08 19:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, KAMEZAWA Hiroyuki, LKML, GOTO, Christoph Lameter

On Thu, 8 Feb 2007 11:09:40 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote:

> > > > and to 
> > > > accurately present the machine's topology to the user without us having to
> > > > go adding falsehoods like this?
> > > 
> > > a node is a piece of memory. Without memory it doesn't make sense.
> > 
> > Who said?  I can pick up a piece of circuitry which has four CPUs and no
> > RAM, wave it about then stick it in a computer.  The kernel is just wrong,
> > surely?
> 
> Surely your computer has some memory so attach it to that memory (which 
> in a NUMA system would be one or the other node).

"attach it".  But it _isn't_ attached.  There is no memory on this node. 
We seem to be saying that we should misrepresent the physical topology
because the kernel doesn't handle it appropriately.

> Cpu only "nodes" would mean that all memory would be off node. Meaning 
> whatever interconnect one has would be heavily used. Operating system and 
> application performance will suffer.

>From this a logical step would be to change the kernel to refuse to bring
memoryless nodes online at all.

If that's not an approproate solution, then there must be a legtimate
reason for using memoryless nodes.  

Which is it?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node
  2007-02-08 19:26           ` Andrew Morton
@ 2007-02-08 19:35             ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2007-02-08 19:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, KAMEZAWA Hiroyuki, LKML, GOTO, Christoph Lameter

On Thu, 8 Feb 2007, Andrew Morton wrote:

> > Surely your computer has some memory so attach it to that memory (which 
> > in a NUMA system would be one or the other node).
> 
> "attach it".  But it _isn't_ attached.  There is no memory on this node. 
> We seem to be saying that we should misrepresent the physical topology
> because the kernel doesn't handle it appropriately.

I think we are have a disagreement on what node means. Andi is saying that 
a node is bound to memory. The affinity of that chunk of memory has to be 
considered by the Kernel. If there is no memory then there is no memory 
management issue. This memory may have assigned I/O devices and cpus that 
use that memory for their operations. I think this is the working 
definition of the kernel.

You are saying a node means something that is connected to a NUMA 
interlink that may have cpus / memory / IO devices.

It seems that both definitions have been used.

> > Cpu only "nodes" would mean that all memory would be off node. Meaning 
> > whatever interconnect one has would be heavily used. Operating system and 
> > application performance will suffer.
> 
> >From this a logical step would be to change the kernel to refuse to bring
> memoryless nodes online at all.

This is the case.

Arch code typically ignores memory less nodes and assigns processors and 
I/O devices with no memory to the next node. F.e. the SGI I/O "node" only 
exists on the arch level. They are mapped to the next memory node by 
arch coide and the locality information returned to the Kernel for devices 
on that I/O node is the next memory node.


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-02-08 19:35 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-06 11:23 [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node KAMEZAWA Hiroyuki
2007-02-06 17:26 ` Christoph Lameter
2007-02-07  1:14   ` KAMEZAWA Hiroyuki
2007-02-07  8:04     ` Christoph Lameter
2007-02-07  8:36       ` KAMEZAWA Hiroyuki
2007-02-07  8:36         ` [2.6.20][PATCH] fix mempolicy error check on a system with KAMEZAWA Hiroyuki
2007-02-08 11:49         ` [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node Bob Picco
2007-02-08 11:49           ` Bob Picco
2007-02-07 10:20       ` Andi Kleen
2007-02-07 10:20 ` Andi Kleen
2007-02-07 10:07   ` KAMEZAWA Hiroyuki
2007-02-07 10:19     ` Andi Kleen
2007-02-07 10:37       ` KAMEZAWA Hiroyuki
2007-02-07 10:41         ` Andi Kleen
2007-02-07 10:49           ` KAMEZAWA Hiroyuki
2007-02-07 11:32             ` Andi Kleen
2007-02-07 12:27               ` KAMEZAWA Hiroyuki
2007-02-07 14:05     ` Christoph Lameter
2007-02-07 15:28       ` KAMEZAWA Hiroyuki
2007-02-07 14:03   ` Christoph Lameter
2007-02-07 16:23   ` Andrew Morton
2007-02-07 16:50     ` Andi Kleen
2007-02-07 17:43       ` Andrew Morton
2007-02-07 18:15         ` [PATCH] FS : Speedup rw_verify_area() Eric Dumazet
2007-02-08  0:37         ` [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node KAMEZAWA Hiroyuki
2007-02-08 19:09         ` Christoph Lameter
2007-02-08 19:26           ` Andrew Morton
2007-02-08 19:35             ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.