All of lore.kernel.org
 help / color / mirror / Atom feed
* [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-02  8:12 ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-02  8:12 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, Christoph Lameter, Andi Kleen
  Cc: kosaki.motohiro

Hi

I tested numactl on 2.6.24-rc8-mm1.
and I found strange behavior.

test method and result.

	$ numactl --interleave=all ls
	set_mempolicy: Invalid argument
	setting interleave mask: Invalid argument

numactl command download from
	ftp://ftp.suse.com/pub/people/ak/numa/
	(I choice numactl-1.0.2)


Of course, older kernel(RHEL5.1) works good.



more detail:

1. my machine node and memory.

$ numactl --hardware
available: 16 nodes (0-15)
node 0 size: 0 MB
node 0 free: 0 MB
node 1 size: 0 MB
node 1 free: 0 MB
node 2 size: 3872 MB
node 2 free: 1487 MB
node 3 size: 4032 MB
node 3 free: 3671 MB
node 4 size: 0 MB
node 4 free: 0 MB
node 5 size: 0 MB
node 5 free: 0 MB
node 6 size: 0 MB
node 6 free: 0 MB
node 7 size: 0 MB
node 7 free: 0 MB
node 8 size: 0 MB
node 8 free: 0 MB
node 9 size: 0 MB
node 9 free: 0 MB
node 10 size: 0 MB
node 10 free: 0 MB
node 11 size: 0 MB
node 11 free: 0 MB
node 12 size: 0 MB
node 12 free: 0 MB
node 13 size: 0 MB
node 13 free: 0 MB
node 14 size: 0 MB
node 14 free: 0 MB
node 15 size: 0 MB
node 15 free: 0 MB


2. numactl behavior of --interleave=all
   2.1  scan "/sys/devices/system/node" dir
   2.2  calculate max node number
   2.3  all bit turn on of existing node.
        (i.e. 0xFF generated on my environment.)
   2.4  call set_mempolicy()

3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
   3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
       in mpol_check_policy()

	-> check failed when memmoryless node exist.
           (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)

4. RHEL5.1 set_mempolicy(2) behavior
   4.1 check nodesubset(nodemask argument, node_online_map)
       in mpol_check_policy().

	-> check success.


I don't know wrong either kernel or libnuma.
Please any comments!


- kosaki




^ permalink raw reply	[flat|nested] 90+ messages in thread

* [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-02  8:12 ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-02  8:12 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, Christoph Lameter, Andi Kleen
  Cc: kosaki.motohiro

Hi

I tested numactl on 2.6.24-rc8-mm1.
and I found strange behavior.

test method and result.

	$ numactl --interleave=all ls
	set_mempolicy: Invalid argument
	setting interleave mask: Invalid argument

numactl command download from
	ftp://ftp.suse.com/pub/people/ak/numa/
	(I choice numactl-1.0.2)


Of course, older kernel(RHEL5.1) works good.



more detail:

1. my machine node and memory.

$ numactl --hardware
available: 16 nodes (0-15)
node 0 size: 0 MB
node 0 free: 0 MB
node 1 size: 0 MB
node 1 free: 0 MB
node 2 size: 3872 MB
node 2 free: 1487 MB
node 3 size: 4032 MB
node 3 free: 3671 MB
node 4 size: 0 MB
node 4 free: 0 MB
node 5 size: 0 MB
node 5 free: 0 MB
node 6 size: 0 MB
node 6 free: 0 MB
node 7 size: 0 MB
node 7 free: 0 MB
node 8 size: 0 MB
node 8 free: 0 MB
node 9 size: 0 MB
node 9 free: 0 MB
node 10 size: 0 MB
node 10 free: 0 MB
node 11 size: 0 MB
node 11 free: 0 MB
node 12 size: 0 MB
node 12 free: 0 MB
node 13 size: 0 MB
node 13 free: 0 MB
node 14 size: 0 MB
node 14 free: 0 MB
node 15 size: 0 MB
node 15 free: 0 MB


2. numactl behavior of --interleave=all
   2.1  scan "/sys/devices/system/node" dir
   2.2  calculate max node number
   2.3  all bit turn on of existing node.
        (i.e. 0xFF generated on my environment.)
   2.4  call set_mempolicy()

3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
   3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
       in mpol_check_policy()

	-> check failed when memmoryless node exist.
           (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)

4. RHEL5.1 set_mempolicy(2) behavior
   4.1 check nodesubset(nodemask argument, node_online_map)
       in mpol_check_policy().

	-> check success.


I don't know wrong either kernel or libnuma.
Please any comments!


- kosaki



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-02  8:12 ` KOSAKI Motohiro
@ 2008-02-02  9:09   ` Andi Kleen
  -1 siblings, 0 replies; 90+ messages in thread
From: Andi Kleen @ 2008-02-02  9:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Andrew Morton, Christoph Lameter,
	Andi Kleen, Lee.Schermerhorn

[intentional full quote]

On Sat, Feb 02, 2008 at 05:12:30PM +0900, KOSAKI Motohiro wrote:
> I tested numactl on 2.6.24-rc8-mm1.
> and I found strange behavior.
> 
> test method and result.
> 
> 	$ numactl --interleave=all ls
> 	set_mempolicy: Invalid argument
> 	setting interleave mask: Invalid argument
> 
> numactl command download from
> 	ftp://ftp.suse.com/pub/people/ak/numa/
> 	(I choice numactl-1.0.2)
> 
> 
> Of course, older kernel(RHEL5.1) works good.
> 
> 
> 
> more detail:
> 
> 1. my machine node and memory.
> 
> $ numactl --hardware
> available: 16 nodes (0-15)
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 2 size: 3872 MB
> node 2 free: 1487 MB
> node 3 size: 4032 MB
> node 3 free: 3671 MB
> node 4 size: 0 MB
> node 4 free: 0 MB
> node 5 size: 0 MB
> node 5 free: 0 MB
> node 6 size: 0 MB
> node 6 free: 0 MB
> node 7 size: 0 MB
> node 7 free: 0 MB
> node 8 size: 0 MB
> node 8 free: 0 MB
> node 9 size: 0 MB
> node 9 free: 0 MB
> node 10 size: 0 MB
> node 10 free: 0 MB
> node 11 size: 0 MB
> node 11 free: 0 MB
> node 12 size: 0 MB
> node 12 free: 0 MB
> node 13 size: 0 MB
> node 13 free: 0 MB
> node 14 size: 0 MB
> node 14 free: 0 MB
> node 15 size: 0 MB
> node 15 free: 0 MB
> 
> 
> 2. numactl behavior of --interleave=all
>    2.1  scan "/sys/devices/system/node" dir
>    2.2  calculate max node number
>    2.3  all bit turn on of existing node.
>         (i.e. 0xFF generated on my environment.)
>    2.4  call set_mempolicy()
> 
> 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
>    3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
>        in mpol_check_policy()
> 
> 	-> check failed when memmoryless node exist.
>            (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
> 
> 4. RHEL5.1 set_mempolicy(2) behavior
>    4.1 check nodesubset(nodemask argument, node_online_map)
>        in mpol_check_policy().
> 
> 	-> check success.
> 
> 
> I don't know wrong either kernel or libnuma.

When the kernel behaviour changes and breaks user space then the kernel
is usually wrong. Cc'ed Lee S. who maintains the kernel code now.

-Andi

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-02  9:09   ` Andi Kleen
  0 siblings, 0 replies; 90+ messages in thread
From: Andi Kleen @ 2008-02-02  9:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Andrew Morton, Christoph Lameter,
	Andi Kleen, Lee.Schermerhorn

[intentional full quote]

On Sat, Feb 02, 2008 at 05:12:30PM +0900, KOSAKI Motohiro wrote:
> I tested numactl on 2.6.24-rc8-mm1.
> and I found strange behavior.
> 
> test method and result.
> 
> 	$ numactl --interleave=all ls
> 	set_mempolicy: Invalid argument
> 	setting interleave mask: Invalid argument
> 
> numactl command download from
> 	ftp://ftp.suse.com/pub/people/ak/numa/
> 	(I choice numactl-1.0.2)
> 
> 
> Of course, older kernel(RHEL5.1) works good.
> 
> 
> 
> more detail:
> 
> 1. my machine node and memory.
> 
> $ numactl --hardware
> available: 16 nodes (0-15)
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 2 size: 3872 MB
> node 2 free: 1487 MB
> node 3 size: 4032 MB
> node 3 free: 3671 MB
> node 4 size: 0 MB
> node 4 free: 0 MB
> node 5 size: 0 MB
> node 5 free: 0 MB
> node 6 size: 0 MB
> node 6 free: 0 MB
> node 7 size: 0 MB
> node 7 free: 0 MB
> node 8 size: 0 MB
> node 8 free: 0 MB
> node 9 size: 0 MB
> node 9 free: 0 MB
> node 10 size: 0 MB
> node 10 free: 0 MB
> node 11 size: 0 MB
> node 11 free: 0 MB
> node 12 size: 0 MB
> node 12 free: 0 MB
> node 13 size: 0 MB
> node 13 free: 0 MB
> node 14 size: 0 MB
> node 14 free: 0 MB
> node 15 size: 0 MB
> node 15 free: 0 MB
> 
> 
> 2. numactl behavior of --interleave=all
>    2.1  scan "/sys/devices/system/node" dir
>    2.2  calculate max node number
>    2.3  all bit turn on of existing node.
>         (i.e. 0xFF generated on my environment.)
>    2.4  call set_mempolicy()
> 
> 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
>    3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
>        in mpol_check_policy()
> 
> 	-> check failed when memmoryless node exist.
>            (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
> 
> 4. RHEL5.1 set_mempolicy(2) behavior
>    4.1 check nodesubset(nodemask argument, node_online_map)
>        in mpol_check_policy().
> 
> 	-> check success.
> 
> 
> I don't know wrong either kernel or libnuma.

When the kernel behaviour changes and breaks user space then the kernel
is usually wrong. Cc'ed Lee S. who maintains the kernel code now.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-02  9:09   ` Andi Kleen
@ 2008-02-02  9:37     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-02  9:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Andrew Morton,
	Christoph Lameter, Lee.Schermerhorn

Hi Andi,

> > 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
> >    3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
> >        in mpol_check_policy()
> > 
> > 	-> check failed when memmoryless node exist.
> >            (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
> > 
> > 4. RHEL5.1 set_mempolicy(2) behavior
> >    4.1 check nodesubset(nodemask argument, node_online_map)
> >        in mpol_check_policy().
> > 
> > 	-> check success.
> > 
> > I don't know wrong either kernel or libnuma.
> 
> When the kernel behaviour changes and breaks user space then the kernel
> is usually wrong. Cc'ed Lee S. who maintains the kernel code now.

may be yes, may be no.

I have 1 simple question. 
Why do libnuma generate bitpattern of all bit on instead
check /sys/devices/system/node/has_high_memory nor 
check /sys/devices/system/node/online?

Do you know it?

and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
if CONFIG_HIGHMEM disabled, the has_high_memory file show 
the same as the has_normal_memory.

may be, userland process should check has_high_memory file.

but, I am not confident.
Thanks.


- kosaki



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 drivers/base/node.c |    7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

Index: b/drivers/base/node.c
===================================================================
--- a/drivers/base/node.c       2008-02-02 17:52:32.000000000 +0900
+++ b/drivers/base/node.c       2008-02-02 18:32:38.000000000 +0900
@@ -276,7 +276,6 @@ static SYSDEV_CLASS_ATTR(has_normal_memo
                                                                        NULL);
 static SYSDEV_CLASS_ATTR(has_cpu, 0444, print_nodes_has_cpu, NULL);

-#ifdef CONFIG_HIGHMEM
 static ssize_t print_nodes_has_high_memory(struct sysdev_class *class,
                                                 char *buf)
 {
@@ -285,15 +284,11 @@ static ssize_t print_nodes_has_high_memo

 static SYSDEV_CLASS_ATTR(has_high_memory, 0444, print_nodes_has_high_memory,
                                                                         NULL);
-#endif
-
 struct sysdev_class_attribute *node_state_attr[] = {
        &attr_possible,
        &attr_online,
        &attr_has_normal_memory,
-#ifdef CONFIG_HIGHMEM
        &attr_has_high_memory,
-#endif
        &attr_has_cpu,
 };

@@ -302,7 +297,7 @@ static int node_states_init(void)
        int i;
        int err = 0;

-       for (i = 0;  i < NR_NODE_STATES; i++) {
+       for (i = 0;  i < ARRAY_SIZE(node_state_attr); i++) {
                int ret;
                ret = sysdev_class_create_file(&node_class, node_state_attr[i]);
                if (!err)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-02  9:37     ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-02  9:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Andrew Morton,
	Christoph Lameter, Lee.Schermerhorn

Hi Andi,

> > 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
> >    3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
> >        in mpol_check_policy()
> > 
> > 	-> check failed when memmoryless node exist.
> >            (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
> > 
> > 4. RHEL5.1 set_mempolicy(2) behavior
> >    4.1 check nodesubset(nodemask argument, node_online_map)
> >        in mpol_check_policy().
> > 
> > 	-> check success.
> > 
> > I don't know wrong either kernel or libnuma.
> 
> When the kernel behaviour changes and breaks user space then the kernel
> is usually wrong. Cc'ed Lee S. who maintains the kernel code now.

may be yes, may be no.

I have 1 simple question. 
Why do libnuma generate bitpattern of all bit on instead
check /sys/devices/system/node/has_high_memory nor 
check /sys/devices/system/node/online?

Do you know it?

and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
if CONFIG_HIGHMEM disabled, the has_high_memory file show 
the same as the has_normal_memory.

may be, userland process should check has_high_memory file.

but, I am not confident.
Thanks.


- kosaki



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 drivers/base/node.c |    7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

Index: b/drivers/base/node.c
===================================================================
--- a/drivers/base/node.c       2008-02-02 17:52:32.000000000 +0900
+++ b/drivers/base/node.c       2008-02-02 18:32:38.000000000 +0900
@@ -276,7 +276,6 @@ static SYSDEV_CLASS_ATTR(has_normal_memo
                                                                        NULL);
 static SYSDEV_CLASS_ATTR(has_cpu, 0444, print_nodes_has_cpu, NULL);

-#ifdef CONFIG_HIGHMEM
 static ssize_t print_nodes_has_high_memory(struct sysdev_class *class,
                                                 char *buf)
 {
@@ -285,15 +284,11 @@ static ssize_t print_nodes_has_high_memo

 static SYSDEV_CLASS_ATTR(has_high_memory, 0444, print_nodes_has_high_memory,
                                                                         NULL);
-#endif
-
 struct sysdev_class_attribute *node_state_attr[] = {
        &attr_possible,
        &attr_online,
        &attr_has_normal_memory,
-#ifdef CONFIG_HIGHMEM
        &attr_has_high_memory,
-#endif
        &attr_has_cpu,
 };

@@ -302,7 +297,7 @@ static int node_states_init(void)
        int i;
        int err = 0;

-       for (i = 0;  i < NR_NODE_STATES; i++) {
+       for (i = 0;  i < ARRAY_SIZE(node_state_attr); i++) {
                int ret;
                ret = sysdev_class_create_file(&node_class, node_state_attr[i]);
                if (!err)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-02  9:37     ` KOSAKI Motohiro
@ 2008-02-02 11:30       ` Andi Kleen
  -1 siblings, 0 replies; 90+ messages in thread
From: Andi Kleen @ 2008-02-02 11:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andi Kleen, linux-mm, linux-kernel, Andrew Morton,
	Christoph Lameter, Lee.Schermerhorn

> I have 1 simple question. 
> Why do libnuma generate bitpattern of all bit on instead
> check /sys/devices/system/node/has_high_memory nor 
> check /sys/devices/system/node/online?
> 
> Do you know it?

It's far simpler and cheaper (sysfs is expensive) to do this in the kernel 
and besides the kernel can do more easily keep up with dynamic topology
changes.

> 
> and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> if CONFIG_HIGHMEM disabled, the has_high_memory file show 
> the same as the has_normal_memory.
> 
> may be, userland process should check has_high_memory file.

To be honest I've never tried seriously to make 32bit NUMA policy
(with highmem) work well; just kept it at a "should not break"
level. That is because with highmem the kernel's choices at 
placing memory are seriously limited anyways so I doubt 32bit
NUMA will ever work very well.

-Andi


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-02 11:30       ` Andi Kleen
  0 siblings, 0 replies; 90+ messages in thread
From: Andi Kleen @ 2008-02-02 11:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andi Kleen, linux-mm, linux-kernel, Andrew Morton,
	Christoph Lameter, Lee.Schermerhorn

> I have 1 simple question. 
> Why do libnuma generate bitpattern of all bit on instead
> check /sys/devices/system/node/has_high_memory nor 
> check /sys/devices/system/node/online?
> 
> Do you know it?

It's far simpler and cheaper (sysfs is expensive) to do this in the kernel 
and besides the kernel can do more easily keep up with dynamic topology
changes.

> 
> and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> if CONFIG_HIGHMEM disabled, the has_high_memory file show 
> the same as the has_normal_memory.
> 
> may be, userland process should check has_high_memory file.

To be honest I've never tried seriously to make 32bit NUMA policy
(with highmem) work well; just kept it at a "should not break"
level. That is because with highmem the kernel's choices at 
placing memory are seriously limited anyways so I doubt 32bit
NUMA will ever work very well.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-02  9:37     ` KOSAKI Motohiro
@ 2008-02-04 18:20       ` Lee Schermerhorn
  -1 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-04 18:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andi Kleen, linux-mm, linux-kernel, Andrew Morton,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman

On Sat, 2008-02-02 at 18:37 +0900, KOSAKI Motohiro wrote:
> Hi Andi,
> 
> > > 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
> > >    3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
> > >        in mpol_check_policy()
> > > 
> > > 	-> check failed when memmoryless node exist.
> > >            (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
> > > 
> > > 4. RHEL5.1 set_mempolicy(2) behavior
> > >    4.1 check nodesubset(nodemask argument, node_online_map)
> > >        in mpol_check_policy().
> > > 
> > > 	-> check success.
> > > 
> > > I don't know wrong either kernel or libnuma.
> > 
> > When the kernel behaviour changes and breaks user space then the kernel
> > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.

The memoryless nodes patch series changed a lot of things, so just
reverting this one area [mpol_check_policy()] probably won't restore the
prior behavior.  A fully populated node mask is not necessarily a proper
subset of node_online_map().  And contextualize_policy() also requires
the mask to be a subset of mems_allowed which also defaults to nodes
with memory.

I don't know how Mel Gorman's "two zonelist" series, which is still
awaiting a window into the -mm tree, affects this behavior.  Those
patches will certainly be affected by whatever we decide here.

I don't know the current state of Paul's rework of cpusets and
mems_allowed.  That probably resolves this issue, if he still plans on
allowing a fully populated mask to indicate interleaving over all
allowed nodes.

I have a patch that takes a different approach to "interleave=all" that
doesn't solve Paul's and David's requirements.  I also have patches to
libnuma and numactl that work with my patches, but I saw no sense in
posting them unless my kernel patches got some traction.  If interested,
you can find them at:

 http://free.linux.hp.com/~lts/Patches/Numactl/



 
> 
> may be yes, may be no.
> 
> I have 1 simple question. 
> Why do libnuma generate bitpattern of all bit on instead
> check /sys/devices/system/node/has_high_memory nor 
> check /sys/devices/system/node/online?
> 
> Do you know it?

In addition to Andi's answer about simplicity, libnuma and numactl
predate the sysfs node masks.  There was no way to query what the valid
set of nodes would be, but the kernel allowed a fully populated map.  We
broke that with the memoryless nodes rework.

> 
> and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> if CONFIG_HIGHMEM disabled, the has_high_memory file show 
> the same as the has_normal_memory.
> may be, userland process should check has_high_memory file.
> 
> but, I am not confident.


Regarding the patch itself:  If others have no problems with displaying
a "has_high_memory" node mask for systems w/o HIGH_MEM configured, I can
live with it.  

The current upstream kernel [2.6.24] supports a MPOL_MEMS_ALLOWED flag
to get_mempolicy() to return the nodes allowed in the caller's cpuset.
My numactl patches, mentioned above, support this.

However, as Andi says, we really can't break application behavior.  All
applications that use mempolicy don't necessarily use libnuma APIs.  So,
a fully populated interleave node mask should be allowed and should
probably mean "all allowed nodes with memory". 

I think we'd still need to reduce the interleave policy mask to nodes
with memory when it's installed or find another way to skip memoryless
nodes when interleaving, else we don't get even distribution of
interleaved pages over the nodes that do have memory.  This is one of
the memoryless nodes fixes.  I THINK this is one of the areas that Paul
and David are investigating.

Christoph, Mel, Paul:  any suggestions for a [relatively quick] fix that
doesn't break the memoryless nodes work and doesn't violate cpuset
constraints?



> Thanks.
> 
> 
> - kosaki
> 
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> ---
>  drivers/base/node.c |    7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> Index: b/drivers/base/node.c
> ===================================================================
> --- a/drivers/base/node.c       2008-02-02 17:52:32.000000000 +0900
> +++ b/drivers/base/node.c       2008-02-02 18:32:38.000000000 +0900
> @@ -276,7 +276,6 @@ static SYSDEV_CLASS_ATTR(has_normal_memo
>                                                                         NULL);
>  static SYSDEV_CLASS_ATTR(has_cpu, 0444, print_nodes_has_cpu, NULL);
> 
> -#ifdef CONFIG_HIGHMEM
>  static ssize_t print_nodes_has_high_memory(struct sysdev_class *class,
>                                                  char *buf)
>  {
> @@ -285,15 +284,11 @@ static ssize_t print_nodes_has_high_memo
> 
>  static SYSDEV_CLASS_ATTR(has_high_memory, 0444, print_nodes_has_high_memory,
>                                                                          NULL);
> -#endif
> -
>  struct sysdev_class_attribute *node_state_attr[] = {
>         &attr_possible,
>         &attr_online,
>         &attr_has_normal_memory,
> -#ifdef CONFIG_HIGHMEM
>         &attr_has_high_memory,
> -#endif
>         &attr_has_cpu,
>  };
> 
> @@ -302,7 +297,7 @@ static int node_states_init(void)
>         int i;
>         int err = 0;
> 
> -       for (i = 0;  i < NR_NODE_STATES; i++) {
> +       for (i = 0;  i < ARRAY_SIZE(node_state_attr); i++) {
>                 int ret;
>                 ret = sysdev_class_create_file(&node_class, node_state_attr[i]);
>                 if (!err)
> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-04 18:20       ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-04 18:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andi Kleen, linux-mm, linux-kernel, Andrew Morton,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman

On Sat, 2008-02-02 at 18:37 +0900, KOSAKI Motohiro wrote:
> Hi Andi,
> 
> > > 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
> > >    3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
> > >        in mpol_check_policy()
> > > 
> > > 	-> check failed when memmoryless node exist.
> > >            (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
> > > 
> > > 4. RHEL5.1 set_mempolicy(2) behavior
> > >    4.1 check nodesubset(nodemask argument, node_online_map)
> > >        in mpol_check_policy().
> > > 
> > > 	-> check success.
> > > 
> > > I don't know wrong either kernel or libnuma.
> > 
> > When the kernel behaviour changes and breaks user space then the kernel
> > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.

The memoryless nodes patch series changed a lot of things, so just
reverting this one area [mpol_check_policy()] probably won't restore the
prior behavior.  A fully populated node mask is not necessarily a proper
subset of node_online_map().  And contextualize_policy() also requires
the mask to be a subset of mems_allowed which also defaults to nodes
with memory.

I don't know how Mel Gorman's "two zonelist" series, which is still
awaiting a window into the -mm tree, affects this behavior.  Those
patches will certainly be affected by whatever we decide here.

I don't know the current state of Paul's rework of cpusets and
mems_allowed.  That probably resolves this issue, if he still plans on
allowing a fully populated mask to indicate interleaving over all
allowed nodes.

I have a patch that takes a different approach to "interleave=all" that
doesn't solve Paul's and David's requirements.  I also have patches to
libnuma and numactl that work with my patches, but I saw no sense in
posting them unless my kernel patches got some traction.  If interested,
you can find them at:

 http://free.linux.hp.com/~lts/Patches/Numactl/



 
> 
> may be yes, may be no.
> 
> I have 1 simple question. 
> Why do libnuma generate bitpattern of all bit on instead
> check /sys/devices/system/node/has_high_memory nor 
> check /sys/devices/system/node/online?
> 
> Do you know it?

In addition to Andi's answer about simplicity, libnuma and numactl
predate the sysfs node masks.  There was no way to query what the valid
set of nodes would be, but the kernel allowed a fully populated map.  We
broke that with the memoryless nodes rework.

> 
> and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> if CONFIG_HIGHMEM disabled, the has_high_memory file show 
> the same as the has_normal_memory.
> may be, userland process should check has_high_memory file.
> 
> but, I am not confident.


Regarding the patch itself:  If others have no problems with displaying
a "has_high_memory" node mask for systems w/o HIGH_MEM configured, I can
live with it.  

The current upstream kernel [2.6.24] supports a MPOL_MEMS_ALLOWED flag
to get_mempolicy() to return the nodes allowed in the caller's cpuset.
My numactl patches, mentioned above, support this.

However, as Andi says, we really can't break application behavior.  All
applications that use mempolicy don't necessarily use libnuma APIs.  So,
a fully populated interleave node mask should be allowed and should
probably mean "all allowed nodes with memory". 

I think we'd still need to reduce the interleave policy mask to nodes
with memory when it's installed or find another way to skip memoryless
nodes when interleaving, else we don't get even distribution of
interleaved pages over the nodes that do have memory.  This is one of
the memoryless nodes fixes.  I THINK this is one of the areas that Paul
and David are investigating.

Christoph, Mel, Paul:  any suggestions for a [relatively quick] fix that
doesn't break the memoryless nodes work and doesn't violate cpuset
constraints?



> Thanks.
> 
> 
> - kosaki
> 
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> ---
>  drivers/base/node.c |    7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> Index: b/drivers/base/node.c
> ===================================================================
> --- a/drivers/base/node.c       2008-02-02 17:52:32.000000000 +0900
> +++ b/drivers/base/node.c       2008-02-02 18:32:38.000000000 +0900
> @@ -276,7 +276,6 @@ static SYSDEV_CLASS_ATTR(has_normal_memo
>                                                                         NULL);
>  static SYSDEV_CLASS_ATTR(has_cpu, 0444, print_nodes_has_cpu, NULL);
> 
> -#ifdef CONFIG_HIGHMEM
>  static ssize_t print_nodes_has_high_memory(struct sysdev_class *class,
>                                                  char *buf)
>  {
> @@ -285,15 +284,11 @@ static ssize_t print_nodes_has_high_memo
> 
>  static SYSDEV_CLASS_ATTR(has_high_memory, 0444, print_nodes_has_high_memory,
>                                                                          NULL);
> -#endif
> -
>  struct sysdev_class_attribute *node_state_attr[] = {
>         &attr_possible,
>         &attr_online,
>         &attr_has_normal_memory,
> -#ifdef CONFIG_HIGHMEM
>         &attr_has_high_memory,
> -#endif
>         &attr_has_cpu,
>  };
> 
> @@ -302,7 +297,7 @@ static int node_states_init(void)
>         int i;
>         int err = 0;
> 
> -       for (i = 0;  i < NR_NODE_STATES; i++) {
> +       for (i = 0;  i < ARRAY_SIZE(node_state_attr); i++) {
>                 int ret;
>                 ret = sysdev_class_create_file(&node_class, node_state_attr[i]);
>                 if (!err)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-02 11:30       ` Andi Kleen
@ 2008-02-04 19:03         ` Christoph Lameter
  -1 siblings, 0 replies; 90+ messages in thread
From: Christoph Lameter @ 2008-02-04 19:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, Andrew Morton, Lee.Schermerhorn

On Sat, 2 Feb 2008, Andi Kleen wrote:

> To be honest I've never tried seriously to make 32bit NUMA policy
> (with highmem) work well; just kept it at a "should not break"
> level. That is because with highmem the kernel's choices at 
> placing memory are seriously limited anyways so I doubt 32bit
> NUMA will ever work very well.

Memory policies do not work reliably with config highmem (I have never 
seen such usage because large memory systems are typically 64 bit 
which have no highmem, but there are some 32bit numa uses of HIGHMEM) ....

Memory policies are only applied to the highest zone. So if a system has 
highmem on some nodes and not on the others then policies will only be 
applied if allocations happen to occur on the highmem nodes.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-04 19:03         ` Christoph Lameter
  0 siblings, 0 replies; 90+ messages in thread
From: Christoph Lameter @ 2008-02-04 19:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, Andrew Morton, Lee.Schermerhorn

On Sat, 2 Feb 2008, Andi Kleen wrote:

> To be honest I've never tried seriously to make 32bit NUMA policy
> (with highmem) work well; just kept it at a "should not break"
> level. That is because with highmem the kernel's choices at 
> placing memory are seriously limited anyways so I doubt 32bit
> NUMA will ever work very well.

Memory policies do not work reliably with config highmem (I have never 
seen such usage because large memory systems are typically 64 bit 
which have no highmem, but there are some 32bit numa uses of HIGHMEM) ....

Memory policies are only applied to the highest zone. So if a system has 
highmem on some nodes and not on the others then policies will only be 
applied if allocations happen to occur on the highmem nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-04 18:20       ` Lee Schermerhorn
@ 2008-02-05  9:26         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-05  9:26 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Paul Jackson, David Rientjes,
	Mel Gorman, torvalds

Hi Lee-san

I change subject because 2.6.24 reproduce too.


> I have a patch that takes a different approach to "interleave=all" that
> doesn't solve Paul's and David's requirements.  I also have patches to
> libnuma and numactl that work with my patches, but I saw no sense in
> posting them unless my kernel patches got some traction.  If interested,
> you can find them at:
> 
>  http://free.linux.hp.com/~lts/Patches/Numactl/

unfortunately it doesn't works on my test environment ;-)

				numactl-orig		numactl-with-lee-patch
	2.6.24			  failed		  failed
	2.6.24-rc8-mm1		  failed		  failed


I got below error messages by all case.

	$ numactl --interleave=all ls
	set_mempolicy: Invalid argument
	setting interleave mask: Invalid argument


I think kernel is need changed too.
I attached bellow.
kernel2.6.24-rc8-mm1 + mypatch + numactl-1.0.2 + leepatch works good.

> > and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> > if CONFIG_HIGHMEM disabled, the has_high_memory file show 
> > the same as the has_normal_memory.
> > may be, userland process should check has_high_memory file.
> 
> Regarding the patch itself:  If others have no problems with displaying
> a "has_high_memory" node mask for systems w/o HIGH_MEM configured, I can
> live with it.  
> 
> The current upstream kernel [2.6.24] supports a MPOL_MEMS_ALLOWED flag
> to get_mempolicy() to return the nodes allowed in the caller's cpuset.
> My numactl patches, mentioned above, support this.

OK, I cancel my previous has_high_memory patch.
and, I understood anyone doesn't use 32bit numa.


> However, as Andi says, we really can't break application behavior.  All
> applications that use mempolicy don't necessarily use libnuma APIs.  So,
> a fully populated interleave node mask should be allowed and should
> probably mean "all allowed nodes with memory". 

Agreed.

> I think we'd still need to reduce the interleave policy mask to nodes
> with memory when it's installed or find another way to skip memoryless
> nodes when interleaving, else we don't get even distribution of
> interleaved pages over the nodes that do have memory.  This is one of
> the memoryless nodes fixes.  I THINK this is one of the areas that Paul
> and David are investigating.

this is good news for me :)
I'll wait his patch post.



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/mempolicy.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c    2008-02-02 17:54:33.000000000 +0900
+++ b/mm/mempolicy.c    2008-02-05 17:49:47.000000000 +0900
@@ -187,9 +187,12 @@ static struct mempolicy *mpol_new(int mo
        atomic_set(&policy->refcnt, 1);
        switch (mode) {
        case MPOL_INTERLEAVE:
-               policy->v.nodes = *nodes;
-               nodes_and(policy->v.nodes, policy->v.nodes,
-                                       node_states[N_HIGH_MEMORY]);
+               if (nodes) {
+                       policy->v.nodes = *nodes;
+                       nodes_and(policy->v.nodes, policy->v.nodes,
+                                 node_states[N_HIGH_MEMORY]);
+               } else
+                       policy->v.nodes = node_states[N_HIGH_MEMORY];
                if (nodes_weight(policy->v.nodes) == 0) {
                        kmem_cache_free(policy_cache, policy);
                        return ERR_PTR(-EINVAL);
@@ -934,7 +937,7 @@ asmlinkage long sys_set_mempolicy(int mo
        err = get_nodes(&nodes, nmask, maxnode);
        if (err)
                return err;
-       return do_set_mempolicy(mode, &nodes);
+       return do_set_mempolicy(mode, nodes_empty(nodes) ? NULL : &nodes);
 }

 asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,




^ permalink raw reply	[flat|nested] 90+ messages in thread

* [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05  9:26         ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-05  9:26 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Paul Jackson, David Rientjes,
	Mel Gorman, torvalds

Hi Lee-san

I change subject because 2.6.24 reproduce too.


> I have a patch that takes a different approach to "interleave=all" that
> doesn't solve Paul's and David's requirements.  I also have patches to
> libnuma and numactl that work with my patches, but I saw no sense in
> posting them unless my kernel patches got some traction.  If interested,
> you can find them at:
> 
>  http://free.linux.hp.com/~lts/Patches/Numactl/

unfortunately it doesn't works on my test environment ;-)

				numactl-orig		numactl-with-lee-patch
	2.6.24			  failed		  failed
	2.6.24-rc8-mm1		  failed		  failed


I got below error messages by all case.

	$ numactl --interleave=all ls
	set_mempolicy: Invalid argument
	setting interleave mask: Invalid argument


I think kernel is need changed too.
I attached bellow.
kernel2.6.24-rc8-mm1 + mypatch + numactl-1.0.2 + leepatch works good.

> > and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> > if CONFIG_HIGHMEM disabled, the has_high_memory file show 
> > the same as the has_normal_memory.
> > may be, userland process should check has_high_memory file.
> 
> Regarding the patch itself:  If others have no problems with displaying
> a "has_high_memory" node mask for systems w/o HIGH_MEM configured, I can
> live with it.  
> 
> The current upstream kernel [2.6.24] supports a MPOL_MEMS_ALLOWED flag
> to get_mempolicy() to return the nodes allowed in the caller's cpuset.
> My numactl patches, mentioned above, support this.

OK, I cancel my previous has_high_memory patch.
and, I understood anyone doesn't use 32bit numa.


> However, as Andi says, we really can't break application behavior.  All
> applications that use mempolicy don't necessarily use libnuma APIs.  So,
> a fully populated interleave node mask should be allowed and should
> probably mean "all allowed nodes with memory". 

Agreed.

> I think we'd still need to reduce the interleave policy mask to nodes
> with memory when it's installed or find another way to skip memoryless
> nodes when interleaving, else we don't get even distribution of
> interleaved pages over the nodes that do have memory.  This is one of
> the memoryless nodes fixes.  I THINK this is one of the areas that Paul
> and David are investigating.

this is good news for me :)
I'll wait his patch post.



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/mempolicy.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c    2008-02-02 17:54:33.000000000 +0900
+++ b/mm/mempolicy.c    2008-02-05 17:49:47.000000000 +0900
@@ -187,9 +187,12 @@ static struct mempolicy *mpol_new(int mo
        atomic_set(&policy->refcnt, 1);
        switch (mode) {
        case MPOL_INTERLEAVE:
-               policy->v.nodes = *nodes;
-               nodes_and(policy->v.nodes, policy->v.nodes,
-                                       node_states[N_HIGH_MEMORY]);
+               if (nodes) {
+                       policy->v.nodes = *nodes;
+                       nodes_and(policy->v.nodes, policy->v.nodes,
+                                 node_states[N_HIGH_MEMORY]);
+               } else
+                       policy->v.nodes = node_states[N_HIGH_MEMORY];
                if (nodes_weight(policy->v.nodes) == 0) {
                        kmem_cache_free(policy_cache, policy);
                        return ERR_PTR(-EINVAL);
@@ -934,7 +937,7 @@ asmlinkage long sys_set_mempolicy(int mo
        err = get_nodes(&nodes, nmask, maxnode);
        if (err)
                return err;
-       return do_set_mempolicy(mode, &nodes);
+       return do_set_mempolicy(mode, nodes_empty(nodes) ? NULL : &nodes);
 }

 asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-04 18:20       ` Lee Schermerhorn
@ 2008-02-05 10:17         ` Paul Jackson
  -1 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 10:17 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, andi, linux-mm, linux-kernel, akpm, clameter,
	rientjes, mel

Lee wrote:
> I don't know the current state of Paul's rework of cpusets and
> mems_allowed.  That probably resolves this issue, if he still plans on
> allowing a fully populated mask to indicate interleaving over all
> allowed nodes.

It got a bit stalled out for the last month (my employer had other
designs on my time.)  But I'd really like to drive it home.

What happened so far, in December 2007 and earlier, is that a few of us:

  David Rientjes <rientjes@google.com>
  Lee.Schermerhorn@hp.com
  Christoph Lameter <clameter@sgi.com>
  Andi Kleen <ak@suse.de>

had a discussion, motivated in good part by the need to allow a
mempolicy of MPOL_INTERLEAVE over all nodes currently available in
the cpuset, where that interleave policy was robustly preserved if
the cpuset changed (without requiring the application to somehow
"know" its cpuset had changed and reissuing the set_mempolicy call.)

But that discussion touched on some other long standing deficiencies
in the way that I had originally glued cpusets and memory policies
together.  The current mechanism doesn't handle changing cpusets very
well, especially if the number of nodes in the cpuset increases.

Obviously, I can't change the current behaviour, especially of the
mempolicy system calls.  I can only add new options that provide new
alternatives.

The patchset I'd like to drive home addresses these issues with a
couple of additional MPOL_* flags, upward compatible, that alter the
way that nodemasks are mapped into cpusets, and remapped if the cpuset
subsequently changes.

The next two steps I need to take are:
 1) propose this patch, with careful explanation (it's easy to lose
    one's bearings in the mappings and remappings of node numberings)
    to a wider audience, such as linux-mm or linux-kernel, and
 2) carefully test this, especially on each code path I touched in
    mm/mempolicy.c, where the changes were delicate, to ensure I
    didn't break any existing code.

There were also some other, smaller patches proposed, by myself and
others.  I was preferring to address a wider set of the long standing
issues in this area, but the others above mostly preferred the smaller
patches.  This needs to be discussed in a wider forum, and a concensus
reached.

Hopefully this week or next, I will publish this patch proposal.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 10:17         ` Paul Jackson
  0 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 10:17 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, andi, linux-mm, linux-kernel, akpm, clameter,
	rientjes, mel

Lee wrote:
> I don't know the current state of Paul's rework of cpusets and
> mems_allowed.  That probably resolves this issue, if he still plans on
> allowing a fully populated mask to indicate interleaving over all
> allowed nodes.

It got a bit stalled out for the last month (my employer had other
designs on my time.)  But I'd really like to drive it home.

What happened so far, in December 2007 and earlier, is that a few of us:

  David Rientjes <rientjes@google.com>
  Lee.Schermerhorn@hp.com
  Christoph Lameter <clameter@sgi.com>
  Andi Kleen <ak@suse.de>

had a discussion, motivated in good part by the need to allow a
mempolicy of MPOL_INTERLEAVE over all nodes currently available in
the cpuset, where that interleave policy was robustly preserved if
the cpuset changed (without requiring the application to somehow
"know" its cpuset had changed and reissuing the set_mempolicy call.)

But that discussion touched on some other long standing deficiencies
in the way that I had originally glued cpusets and memory policies
together.  The current mechanism doesn't handle changing cpusets very
well, especially if the number of nodes in the cpuset increases.

Obviously, I can't change the current behaviour, especially of the
mempolicy system calls.  I can only add new options that provide new
alternatives.

The patchset I'd like to drive home addresses these issues with a
couple of additional MPOL_* flags, upward compatible, that alter the
way that nodemasks are mapped into cpusets, and remapped if the cpuset
subsequently changes.

The next two steps I need to take are:
 1) propose this patch, with careful explanation (it's easy to lose
    one's bearings in the mappings and remappings of node numberings)
    to a wider audience, such as linux-mm or linux-kernel, and
 2) carefully test this, especially on each code path I touched in
    mm/mempolicy.c, where the changes were delicate, to ensure I
    didn't break any existing code.

There were also some other, smaller patches proposed, by myself and
others.  I was preferring to address a wider set of the long standing
issues in this area, but the others above mostly preferred the smaller
patches.  This needs to be discussed in a wider forum, and a concensus
reached.

Hopefully this week or next, I will publish this patch proposal.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 10:17         ` Paul Jackson
@ 2008-02-05 11:14           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-05 11:14 UTC (permalink / raw)
  To: Paul Jackson
  Cc: kosaki.motohiro, Lee Schermerhorn, andi, linux-mm, linux-kernel,
	akpm, clameter, rientjes, mel

Hi Paul

> Hopefully this week or next, I will publish this patch proposal.

Great.

at that time, I will join review the patch with presure :)


- kosaki



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 11:14           ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-05 11:14 UTC (permalink / raw)
  To: Paul Jackson
  Cc: kosaki.motohiro, Lee Schermerhorn, andi, linux-mm, linux-kernel,
	akpm, clameter, rientjes, mel

Hi Paul

> Hopefully this week or next, I will publish this patch proposal.

Great.

at that time, I will join review the patch with presure :)


- kosaki


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-04 18:20       ` Lee Schermerhorn
@ 2008-02-05 14:31         ` Mel Gorman
  -1 siblings, 0 replies; 90+ messages in thread
From: Mel Gorman @ 2008-02-05 14:31 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Paul Jackson, David Rientjes

On (04/02/08 13:20), Lee Schermerhorn didst pronounce:
> > > When the kernel behaviour changes and breaks user space then the kernel
> > > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
> 
> The memoryless nodes patch series changed a lot of things, so just
> reverting this one area [mpol_check_policy()] probably won't restore the
> prior behavior.  A fully populated node mask is not necessarily a proper
> subset of node_online_map().  And contextualize_policy() also requires
> the mask to be a subset of mems_allowed which also defaults to nodes
> with memory.
> 
> I don't know how Mel Gorman's "two zonelist" series, which is still
> awaiting a window into the -mm tree, affects this behavior.  Those
> patches will certainly be affected by whatever we decide here.
> 

I doubt they'd make a difference to this particular problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 14:31         ` Mel Gorman
  0 siblings, 0 replies; 90+ messages in thread
From: Mel Gorman @ 2008-02-05 14:31 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Paul Jackson, David Rientjes

On (04/02/08 13:20), Lee Schermerhorn didst pronounce:
> > > When the kernel behaviour changes and breaks user space then the kernel
> > > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
> 
> The memoryless nodes patch series changed a lot of things, so just
> reverting this one area [mpol_check_policy()] probably won't restore the
> prior behavior.  A fully populated node mask is not necessarily a proper
> subset of node_online_map().  And contextualize_policy() also requires
> the mask to be a subset of mems_allowed which also defaults to nodes
> with memory.
> 
> I don't know how Mel Gorman's "two zonelist" series, which is still
> awaiting a window into the -mm tree, affects this behavior.  Those
> patches will certainly be affected by whatever we decide here.
> 

I doubt they'd make a difference to this particular problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 14:31         ` Mel Gorman
@ 2008-02-05 15:23           ` Lee Schermerhorn
  -1 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-05 15:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Paul Jackson, David Rientjes

On Tue, 2008-02-05 at 14:31 +0000, Mel Gorman wrote:
> On (04/02/08 13:20), Lee Schermerhorn didst pronounce:
> > > > When the kernel behaviour changes and breaks user space then the kernel
> > > > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
> > 
> > The memoryless nodes patch series changed a lot of things, so just
> > reverting this one area [mpol_check_policy()] probably won't restore the
> > prior behavior.  A fully populated node mask is not necessarily a proper
> > subset of node_online_map().  And contextualize_policy() also requires
> > the mask to be a subset of mems_allowed which also defaults to nodes
> > with memory.
> > 
> > I don't know how Mel Gorman's "two zonelist" series, which is still
> > awaiting a window into the -mm tree, affects this behavior.  Those
> > patches will certainly be affected by whatever we decide here.
> > 
> 
> I doubt they'd make a difference to this particular problem.

I didn't really think so, but I wanted to give you a heads up regarding
this, as I think it will affect your patches.  I'm hoping we'll see them
in -mm soon after the .25 merge window closes.  If you get there before
a fix to this issue, so much the better, IMO :-).

Lee


> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 15:23           ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-05 15:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Paul Jackson, David Rientjes

On Tue, 2008-02-05 at 14:31 +0000, Mel Gorman wrote:
> On (04/02/08 13:20), Lee Schermerhorn didst pronounce:
> > > > When the kernel behaviour changes and breaks user space then the kernel
> > > > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
> > 
> > The memoryless nodes patch series changed a lot of things, so just
> > reverting this one area [mpol_check_policy()] probably won't restore the
> > prior behavior.  A fully populated node mask is not necessarily a proper
> > subset of node_online_map().  And contextualize_policy() also requires
> > the mask to be a subset of mems_allowed which also defaults to nodes
> > with memory.
> > 
> > I don't know how Mel Gorman's "two zonelist" series, which is still
> > awaiting a window into the -mm tree, affects this behavior.  Those
> > patches will certainly be affected by whatever we decide here.
> > 
> 
> I doubt they'd make a difference to this particular problem.

I didn't really think so, but I wanted to give you a heads up regarding
this, as I think it will affect your patches.  I'm hoping we'll see them
in -mm soon after the .25 merge window closes.  If you get there before
a fix to this issue, so much the better, IMO :-).

Lee


> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 15:23           ` Lee Schermerhorn
@ 2008-02-05 18:12             ` Christoph Lameter
  -1 siblings, 0 replies; 90+ messages in thread
From: Christoph Lameter @ 2008-02-05 18:12 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Paul Jackson, David Rientjes

Could we focus on the problem instead of discussion of new patches under 
development? Can we confirm that what Kosaki sees is a bug?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 18:12             ` Christoph Lameter
  0 siblings, 0 replies; 90+ messages in thread
From: Christoph Lameter @ 2008-02-05 18:12 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Paul Jackson, David Rientjes

Could we focus on the problem instead of discussion of new patches under 
development? Can we confirm that what Kosaki sees is a bug?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 18:12             ` Christoph Lameter
@ 2008-02-05 18:27               ` Lee Schermerhorn
  -1 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-05 18:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Paul Jackson, David Rientjes

On Tue, 2008-02-05 at 10:12 -0800, Christoph Lameter wrote:
> Could we focus on the problem instead of discussion of new patches under 
> development? 

Christoph:  you are free to ignore any part of this discussion that you
wish...

> Can we confirm that what Kosaki sees is a bug?

by definition, right?  we broke user space.  

Lee


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 18:27               ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-05 18:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Paul Jackson, David Rientjes

On Tue, 2008-02-05 at 10:12 -0800, Christoph Lameter wrote:
> Could we focus on the problem instead of discussion of new patches under 
> development? 

Christoph:  you are free to ignore any part of this discussion that you
wish...

> Can we confirm that what Kosaki sees is a bug?

by definition, right?  we broke user space.  

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 18:27               ` Lee Schermerhorn
@ 2008-02-05 19:04                 ` Christoph Lameter
  -1 siblings, 0 replies; 90+ messages in thread
From: Christoph Lameter @ 2008-02-05 19:04 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Paul Jackson, David Rientjes

On Tue, 5 Feb 2008, Lee Schermerhorn wrote:

> Christoph:  you are free to ignore any part of this discussion that you
> wish...

Had the impression that we are ignoring Kosaki's fix to the problem. Can 
we fix up his patch to address the immediate issue?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 19:04                 ` Christoph Lameter
  0 siblings, 0 replies; 90+ messages in thread
From: Christoph Lameter @ 2008-02-05 19:04 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, KOSAKI Motohiro, Andi Kleen, linux-mm, linux-kernel,
	Andrew Morton, Paul Jackson, David Rientjes

On Tue, 5 Feb 2008, Lee Schermerhorn wrote:

> Christoph:  you are free to ignore any part of this discussion that you
> wish...

Had the impression that we are ignoring Kosaki's fix to the problem. Can 
we fix up his patch to address the immediate issue?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 19:04                 ` Christoph Lameter
@ 2008-02-05 19:15                   ` Paul Jackson
  -1 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 19:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee.Schermerhorn, mel, kosaki.motohiro, andi, linux-mm,
	linux-kernel, akpm, rientjes

Christoph wrote:
> Can we fix up his patch to address the immediate issue?

Since any of those future patches only add optional modes
with new flags, while preserving current behaviour if you
don't use one of the new flags, therefore the current behavior
has to work as best it can.

Therefore fixes such as this to address immediate issues
are probably needed.  Yup.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 19:15                   ` Paul Jackson
  0 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 19:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee.Schermerhorn, mel, kosaki.motohiro, andi, linux-mm,
	linux-kernel, akpm, rientjes

Christoph wrote:
> Can we fix up his patch to address the immediate issue?

Since any of those future patches only add optional modes
with new flags, while preserving current behaviour if you
don't use one of the new flags, therefore the current behavior
has to work as best it can.

Therefore fixes such as this to address immediate issues
are probably needed.  Yup.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 10:17         ` Paul Jackson
@ 2008-02-05 19:56           ` David Rientjes
  -1 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-05 19:56 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Lee Schermerhorn, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

On Tue, 5 Feb 2008, Paul Jackson wrote:

> But that discussion touched on some other long standing deficiencies
> in the way that I had originally glued cpusets and memory policies
> together.  The current mechanism doesn't handle changing cpusets very
> well, especially if the number of nodes in the cpuset increases.
> 

That's because of the nodemask remaps that are done for the various 
mempolicy cases when rebinding the policy.  I agree we cannot change that 
implementation now even though it is undocumented.

The more alarming result of these remaps is in the MPOL_BIND case, as 
we've talked about before.  The language in set_mempolicy(2):

	The MPOL_BIND policy is a strict policy that restricts memory
	allocation to the nodes specified in nodemask. There won't be
	allocations on other nodes.

makes it pretty clear that allocations will not be done on other nodes not 
provided in the set_mempolicy() nodemask if the task is not swapped out.  

But the current implementation allows that if the task is either moved to 
a different cpuset or its cpuset's mems change.  For example, consider a 
task that is allowed nodes 1-3 by its cpuset and asks for a MPOL_BIND 
mempolicy of node 2.  If that cpuset's mems change to 4-6, the mempolicy 
is now effectively a bind on node 5.

> The next two steps I need to take are:
>  1) propose this patch, with careful explanation (it's easy to lose
>     one's bearings in the mappings and remappings of node numberings)
>     to a wider audience, such as linux-mm or linux-kernel, and

Thanks.

>  2) carefully test this, especially on each code path I touched in
>     mm/mempolicy.c, where the changes were delicate, to ensure I
>     didn't break any existing code.
> 
> There were also some other, smaller patches proposed, by myself and
> others.  I was preferring to address a wider set of the long standing
> issues in this area, but the others above mostly preferred the smaller
> patches.  This needs to be discussed in a wider forum, and a concensus
> reached.
> 

I think if these MPOL_* flags that you're proposing are made as generic as 
possible for all possible mempolicies (current and future), it would be 
the optimal change.  It would prevent us from having to add new flags for 
corner-cases in the future and would allow us to keep the flag set as 
small as possible.  My suggestion of MPOL_F_STATIC_NODEMASK goes a long 
way to solve these issues both for MPOL_INTERLEAVE (in conjunction with 
storing the set_mempolicy() intent) and the MPOL_BIND discrepency I 
mentioned above.

		David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 19:56           ` David Rientjes
  0 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-05 19:56 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Lee Schermerhorn, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

On Tue, 5 Feb 2008, Paul Jackson wrote:

> But that discussion touched on some other long standing deficiencies
> in the way that I had originally glued cpusets and memory policies
> together.  The current mechanism doesn't handle changing cpusets very
> well, especially if the number of nodes in the cpuset increases.
> 

That's because of the nodemask remaps that are done for the various 
mempolicy cases when rebinding the policy.  I agree we cannot change that 
implementation now even though it is undocumented.

The more alarming result of these remaps is in the MPOL_BIND case, as 
we've talked about before.  The language in set_mempolicy(2):

	The MPOL_BIND policy is a strict policy that restricts memory
	allocation to the nodes specified in nodemask. There won't be
	allocations on other nodes.

makes it pretty clear that allocations will not be done on other nodes not 
provided in the set_mempolicy() nodemask if the task is not swapped out.  

But the current implementation allows that if the task is either moved to 
a different cpuset or its cpuset's mems change.  For example, consider a 
task that is allowed nodes 1-3 by its cpuset and asks for a MPOL_BIND 
mempolicy of node 2.  If that cpuset's mems change to 4-6, the mempolicy 
is now effectively a bind on node 5.

> The next two steps I need to take are:
>  1) propose this patch, with careful explanation (it's easy to lose
>     one's bearings in the mappings and remappings of node numberings)
>     to a wider audience, such as linux-mm or linux-kernel, and

Thanks.

>  2) carefully test this, especially on each code path I touched in
>     mm/mempolicy.c, where the changes were delicate, to ensure I
>     didn't break any existing code.
> 
> There were also some other, smaller patches proposed, by myself and
> others.  I was preferring to address a wider set of the long standing
> issues in this area, but the others above mostly preferred the smaller
> patches.  This needs to be discussed in a wider forum, and a concensus
> reached.
> 

I think if these MPOL_* flags that you're proposing are made as generic as 
possible for all possible mempolicies (current and future), it would be 
the optimal change.  It would prevent us from having to add new flags for 
corner-cases in the future and would allow us to keep the flag set as 
small as possible.  My suggestion of MPOL_F_STATIC_NODEMASK goes a long 
way to solve these issues both for MPOL_INTERLEAVE (in conjunction with 
storing the set_mempolicy() intent) and the MPOL_BIND discrepency I 
mentioned above.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 19:15                   ` Paul Jackson
@ 2008-02-05 20:06                     ` David Rientjes
  -1 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-05 20:06 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Christoph Lameter, Lee.Schermerhorn, mel, kosaki.motohiro, andi,
	linux-mm, linux-kernel, akpm

On Tue, 5 Feb 2008, Paul Jackson wrote:

> Since any of those future patches only add optional modes
> with new flags, while preserving current behaviour if you
> don't use one of the new flags, therefore the current behavior
> has to work as best it can.
> 

There's a subtlety to this issue that allows it to be fixed and easily 
extended for two upcoming changes:

 - Paul Jackson's mempolicy and cpuset interactions change that will
   probably allow set_mempolicy() callers to specify with a MPOL_*
   flag whether they are referring to "dynamic" or "static" nodemasks[*],
   and

 - node hotplug (both add and remove) that will change the state of a
   node with an identical id.

Paul, with his patch, will need to preserve the "intent" of the mempolicy 
as the nodemask that was passed by the user and attempt on all successive 
rebinds to accomodate that intent as much as possible.

So at the time of rebind it is quite simple to intersect the set of system 
nodes that have memory with the intent of the mempolicy to yield the 
effected nodemask.  This nodemask is saved in the mempolicy (pol->v.nodes 
in this case for interleave) and only steps through the set of nodes that 
can allow interleaved allocations.

When the available nodes changes, either by cpuset change or node hotplug, 
the rebind is quite simple when the intent is preserved.  So we're going 
to need an additional nodemask_t added to struct mempolicy that saves this 
intent and modify contextualize_policy() to allow it.  This will basically 
make any set_mempolicy() call succeed even if the application does not 
have access to any of the mempolicy nodes because it is possible that they 
will become accessible in the future.  In that case the mempolicy is 
effectively MPOL_DEFAULT until the desired nodes become available and it 
is effected.

		David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 20:06                     ` David Rientjes
  0 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-05 20:06 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Christoph Lameter, Lee.Schermerhorn, mel, kosaki.motohiro, andi,
	linux-mm, linux-kernel, akpm

On Tue, 5 Feb 2008, Paul Jackson wrote:

> Since any of those future patches only add optional modes
> with new flags, while preserving current behaviour if you
> don't use one of the new flags, therefore the current behavior
> has to work as best it can.
> 

There's a subtlety to this issue that allows it to be fixed and easily 
extended for two upcoming changes:

 - Paul Jackson's mempolicy and cpuset interactions change that will
   probably allow set_mempolicy() callers to specify with a MPOL_*
   flag whether they are referring to "dynamic" or "static" nodemasks[*],
   and

 - node hotplug (both add and remove) that will change the state of a
   node with an identical id.

Paul, with his patch, will need to preserve the "intent" of the mempolicy 
as the nodemask that was passed by the user and attempt on all successive 
rebinds to accomodate that intent as much as possible.

So at the time of rebind it is quite simple to intersect the set of system 
nodes that have memory with the intent of the mempolicy to yield the 
effected nodemask.  This nodemask is saved in the mempolicy (pol->v.nodes 
in this case for interleave) and only steps through the set of nodes that 
can allow interleaved allocations.

When the available nodes changes, either by cpuset change or node hotplug, 
the rebind is quite simple when the intent is preserved.  So we're going 
to need an additional nodemask_t added to struct mempolicy that saves this 
intent and modify contextualize_policy() to allow it.  This will basically 
make any set_mempolicy() call succeed even if the application does not 
have access to any of the mempolicy nodes because it is possible that they 
will become accessible in the future.  In that case the mempolicy is 
effectively MPOL_DEFAULT until the desired nodes become available and it 
is effected.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 19:56           ` David Rientjes
@ 2008-02-05 20:51             ` Paul Jackson
  -1 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 20:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee.Schermerhorn, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

David wrote:
> The more alarming result of these remaps is in the MPOL_BIND case, as 
> we've talked about before.  The language in set_mempolicy(2):

You're diving into the middle of a rather involved discussion
we had on the other various patches proposed to extend the
interaction of mempolicy's with cpusets and hotplug.

I choose not to hijack this current thread with my rebuttal,
which you've seen before, of your points here.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 20:51             ` Paul Jackson
  0 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 20:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee.Schermerhorn, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

David wrote:
> The more alarming result of these remaps is in the MPOL_BIND case, as 
> we've talked about before.  The language in set_mempolicy(2):

You're diving into the middle of a rather involved discussion
we had on the other various patches proposed to extend the
interaction of mempolicy's with cpusets and hotplug.

I choose not to hijack this current thread with my rebuttal,
which you've seen before, of your points here.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 20:51             ` Paul Jackson
@ 2008-02-05 21:03               ` David Rientjes
  -1 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-05 21:03 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Lee.Schermerhorn, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

On Tue, 5 Feb 2008, Paul Jackson wrote:

> David wrote:
> > The more alarming result of these remaps is in the MPOL_BIND case, as 
> > we've talked about before.  The language in set_mempolicy(2):
> 
> You're diving into the middle of a rather involved discussion
> we had on the other various patches proposed to extend the
> interaction of mempolicy's with cpusets and hotplug.
> 

I've simply identified that MPOL_BIND mempolicy interactions with a task's 
changing mems_allowed as a result of a cpuset move or mems change is also 
an issue that can be addressed at the same time as the interleave problem.  

And it can be done with the addition of a single MPOL_F_* flag.

> I choose not to hijack this current thread with my rebuttal,
> which you've seen before, of your points here.
> 

The issues of mempolicies working over memoryless nodes and supporting 
changing cpusets are very closely related and can be addressed in the same 
way.  It would be disappointing to see a lot of work done to fix the 
memoryless node issue or the changing cpuset mems issue and then realize 
both could have been fixed quite simply with a relatively small set of 
changes.

		David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 21:03               ` David Rientjes
  0 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-05 21:03 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Lee.Schermerhorn, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

On Tue, 5 Feb 2008, Paul Jackson wrote:

> David wrote:
> > The more alarming result of these remaps is in the MPOL_BIND case, as 
> > we've talked about before.  The language in set_mempolicy(2):
> 
> You're diving into the middle of a rather involved discussion
> we had on the other various patches proposed to extend the
> interaction of mempolicy's with cpusets and hotplug.
> 

I've simply identified that MPOL_BIND mempolicy interactions with a task's 
changing mems_allowed as a result of a cpuset move or mems change is also 
an issue that can be addressed at the same time as the interleave problem.  

And it can be done with the addition of a single MPOL_F_* flag.

> I choose not to hijack this current thread with my rebuttal,
> which you've seen before, of your points here.
> 

The issues of mempolicies working over memoryless nodes and supporting 
changing cpusets are very closely related and can be addressed in the same 
way.  It would be disappointing to see a lot of work done to fix the 
memoryless node issue or the changing cpuset mems issue and then realize 
both could have been fixed quite simply with a relatively small set of 
changes.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 21:03               ` David Rientjes
@ 2008-02-05 21:33                 ` Paul Jackson
  -1 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 21:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee.Schermerhorn, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

David wrote:
> It would be disappointing to see a lot of work done to fix

The suggested patch of KOSAKI Motohiro didn't look like a lot of work to me.

I continue to prefer not to hijack this thread for that other discussion.
Just presenting your position and calling it "simple" is misleading.
The discussion so far has involved over a hundred messages over months,
and certainly your position, nor mine for that matter, obtained concensus.

How does the patch of KOSAKI Motohiro, earlier in this thread, look to you?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 21:33                 ` Paul Jackson
  0 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 21:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee.Schermerhorn, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

David wrote:
> It would be disappointing to see a lot of work done to fix

The suggested patch of KOSAKI Motohiro didn't look like a lot of work to me.

I continue to prefer not to hijack this thread for that other discussion.
Just presenting your position and calling it "simple" is misleading.
The discussion so far has involved over a hundred messages over months,
and certainly your position, nor mine for that matter, obtained concensus.

How does the patch of KOSAKI Motohiro, earlier in this thread, look to you?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05  9:26         ` KOSAKI Motohiro
  (?)
@ 2008-02-05 21:57         ` Lee Schermerhorn
  2008-02-05 22:12           ` Christoph Lameter
                             ` (3 more replies)
  -1 siblings, 4 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-05 21:57 UTC (permalink / raw)
  To: KOSAKI Motohiro, linux-kernel, Andrew Morton
  Cc: Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	torvalds, Eric Whitney

Here's a patch that addresses the problem w/o requiring change to
numactl or libnuma.  It DOES have side affects, discussed in the
description.

Tested with memoryless nodes and restricted cpusets using the numactl
installed with RHEL5.1.

Altho' nominally against 24-mm1, applies cleanly to 2.6.24.  Should be
suitable for 'stable' if everyone agrees.

Lee
----------------------------------

[PATCH] 2.6.24-mm1 - mempolicy:  silently restrict to allowed nodes

Kosaki-san noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that 
problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy to just remove the non-allowed
   nodes, as is currently done in-line for mbind().  This
   guarantees that the resulting mask includes only nodes with
   memory.

   NOTE:  this is a [benign, IMO] change in behavior for
          set_mempolicy().  Dis-allowed nodes will be silently
          ignored, rather than returning an error.

          Another, perhaps less benign, change in behavior:
          MPOL_PREFERRED policy that specifies only memoryless nodes
          or nodes that are disallowed in the cpuset will be interpreted
          as "local allocation" as the nodemask will be empty after
          the masking in contextualize_policy().  With a bit of
          additional hackery I can make this return EINVAL.

          Comments?

2) modify mbind() to use contextualize_policy(), like set_mempolicy(),
   instead of masking nodes in-line.

3) remove the now redundant check for memoryless nodes from
   mpol_check_policy().

4) remove the masking of policy nodes for interleave policy from
   mpol_new().

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2008-02-05 11:25:17.000000000 -0500
+++ Linux/mm/mempolicy.c	2008-02-05 16:03:11.000000000 -0500
@@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
 			return -EINVAL;
 		break;
 	}
- 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ 	return 0;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -426,9 +424,13 @@ static int contextualize_policy(int mode
 	if (!nodes)
 		return 0;
 
+	/*
+	 * Restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
 	cpuset_update_task_memory_state();
-	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
-		return -EINVAL;
+	nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+
 	return mpol_check_policy(mode, nodes);
 }
 
@@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
 	if (end == start)
 		return 0;
 
-	if (mpol_check_policy(mode, nmask))
+	if (contextualize_policy(mode, nmask))
 		return -EINVAL;
 
 	new = mpol_new(mode, nmask);
@@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long 
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-#ifdef CONFIG_CPUSETS
-	/* Restrict the nodes to the allowed nodes in the cpuset */
-	nodes_and(nodes, nodes, current->mems_allowed);
-#endif
 	return do_mbind(start, len, mode, &nodes, flags);
 }
 




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 21:33                 ` Paul Jackson
@ 2008-02-05 22:04                   ` Lee Schermerhorn
  -1 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-05 22:04 UTC (permalink / raw)
  To: Paul Jackson
  Cc: David Rientjes, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

On Tue, 2008-02-05 at 15:33 -0600, Paul Jackson wrote:
> David wrote:
> > It would be disappointing to see a lot of work done to fix
> 
> The suggested patch of KOSAKI Motohiro didn't look like a lot of work to me.
> 
> I continue to prefer not to hijack this thread for that other discussion.
> Just presenting your position and calling it "simple" is misleading.
> The discussion so far has involved over a hundred messages over months,
> and certainly your position, nor mine for that matter, obtained concensus.
> 
> How does the patch of KOSAKI Motohiro, earlier in this thread, look to you?
> 

Paul,

It wasn't clear to me whether Kosaki-san's patch required a modified
numactl/libnuma or not.   I think so, because that patch doesn't change
the error return in contextualize_policy() and in mpol_check_policy().
My modified numactl/libnuma avoids this by only passing in allowed mems
fetch via get_mempolicy() with the new MEMS_ALLOWED flags.

The patch I just posted doesn't depend on the numactl changes and seems
quite minimal to me.  I think it cleans up the differences between
set_mempolicy() and mbind(), as well.  However, some may take exception
to the change in behavior--silently ignoring dis-allowed nodes in
set_mempolicy().

Also, your cpuset/mempolicy work will probably need to undo the
unconditional masking in contextualize_policy() and/or save the original
node mask somewhere...

Lee


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 22:04                   ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-05 22:04 UTC (permalink / raw)
  To: Paul Jackson
  Cc: David Rientjes, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

On Tue, 2008-02-05 at 15:33 -0600, Paul Jackson wrote:
> David wrote:
> > It would be disappointing to see a lot of work done to fix
> 
> The suggested patch of KOSAKI Motohiro didn't look like a lot of work to me.
> 
> I continue to prefer not to hijack this thread for that other discussion.
> Just presenting your position and calling it "simple" is misleading.
> The discussion so far has involved over a hundred messages over months,
> and certainly your position, nor mine for that matter, obtained concensus.
> 
> How does the patch of KOSAKI Motohiro, earlier in this thread, look to you?
> 

Paul,

It wasn't clear to me whether Kosaki-san's patch required a modified
numactl/libnuma or not.   I think so, because that patch doesn't change
the error return in contextualize_policy() and in mpol_check_policy().
My modified numactl/libnuma avoids this by only passing in allowed mems
fetch via get_mempolicy() with the new MEMS_ALLOWED flags.

The patch I just posted doesn't depend on the numactl changes and seems
quite minimal to me.  I think it cleans up the differences between
set_mempolicy() and mbind(), as well.  However, some may take exception
to the change in behavior--silently ignoring dis-allowed nodes in
set_mempolicy().

Also, your cpuset/mempolicy work will probably need to undo the
unconditional masking in contextualize_policy() and/or save the original
node mask somewhere...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 21:57         ` Lee Schermerhorn
@ 2008-02-05 22:12           ` Christoph Lameter
  2008-02-06 16:00             ` Lee Schermerhorn
  2008-02-05 22:15           ` Paul Jackson
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 90+ messages in thread
From: Christoph Lameter @ 2008-02-05 22:12 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, Paul Jackson,
	David Rientjes, Mel Gorman, torvalds, Eric Whitney

On Tue, 5 Feb 2008, Lee Schermerhorn wrote:

> mbind(2), on the other hand, just masks off any nodes in the 
> nodemask that are not included in the caller's mems_allowed.

Ok so we temporarily adopt these semantics for set_mempolicy.

> 1) modify contextualize_policy to just remove the non-allowed
>    nodes, as is currently done in-line for mbind().  This
>    guarantees that the resulting mask includes only nodes with
>    memory.

Right make ssense. we already contextualize for cpusets.

> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2008-02-05 11:25:17.000000000 -0500
> +++ Linux/mm/mempolicy.c	2008-02-05 16:03:11.000000000 -0500
> @@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
>  			return -EINVAL;
>  		break;
>  	}
> - 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + 	return 0;
>  }

Hmmm... That is a pretty drastic change.

> @@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
>  	switch (mode) {
>  	case MPOL_INTERLEAVE:
>  		policy->v.nodes = *nodes;
> -		nodes_and(policy->v.nodes, policy->v.nodes,
> -					node_states[N_HIGH_MEMORY]);
>  		if (nodes_weight(policy->v.nodes) == 0) {
>  			kmem_cache_free(policy_cache, policy);
>  			return ERR_PTR(-EINVAL);

Do we really need to remove these lines if we change set_mempolicy?

> @@ -426,9 +424,13 @@ static int contextualize_policy(int mode
>  	if (!nodes)
>  		return 0;
>  
> +	/*
> +	 * Restrict the nodes to the allowed nodes in the cpuset.
> +	 * This is guaranteed to be a subset of nodes with memory.
> +	 */
>  	cpuset_update_task_memory_state();
> -	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> -		return -EINVAL;
> +	nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> +
>  	return mpol_check_policy(mode, nodes);
>  }
>  

Ditto?

> @@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
>  	if (end == start)
>  		return 0;
>  
> -	if (mpol_check_policy(mode, nmask))
> +	if (contextualize_policy(mode, nmask))
>  		return -EINVAL;
>  
>  	new = mpol_new(mode, nmask);
> @@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long 
>  	err = get_nodes(&nodes, nmask, maxnode);
>  	if (err)
>  		return err;
> -#ifdef CONFIG_CPUSETS
> -	/* Restrict the nodes to the allowed nodes in the cpuset */
> -	nodes_and(nodes, nodes, current->mems_allowed);
> -#endif

Would just removing #ifdef CONFIG_CPUSETS work? mems_allowed falls back to 
node_possible_map.... Shouldnt that be node_online_map?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 21:57         ` Lee Schermerhorn
  2008-02-05 22:12           ` Christoph Lameter
@ 2008-02-05 22:15           ` Paul Jackson
  2008-02-06  2:17           ` David Rientjes
  2008-02-06  6:49           ` KOSAKI Motohiro
  3 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 22:15 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, linux-kernel, akpm, clameter, rientjes, mel,
	torvalds, eric.whitney

Lee wrote:
> [PATCH] 2.6.24-mm1 - mempolicy:  silently restrict to allowed nodes

At first glance, I like it.  Thanks.

The changes in the exact behaviour of set_mempolicy (and mbind?) seem
to me to be changes for the better -- subtle improvements in the
consistency of handling corner cases.

However I don't have code that depends all that elaborately on the
fine details of these system calls, so I'm easy.  If others know of
an existing or likely usage pattern that this patch would break,
that would be interesting input.

Thanks, Lee.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 22:04                   ` Lee Schermerhorn
@ 2008-02-05 22:44                     ` David Rientjes
  -1 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-05 22:44 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Paul Jackson, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

On Tue, 5 Feb 2008, Lee Schermerhorn wrote:

> The patch I just posted doesn't depend on the numactl changes and seems
> quite minimal to me.  I think it cleans up the differences between
> set_mempolicy() and mbind(), as well.  However, some may take exception
> to the change in behavior--silently ignoring dis-allowed nodes in
> set_mempolicy().
> 

If the intent of the set_mempolicy() call is going to be preserved in the 
struct mempolicy with Paul's change, then we're going to allow disallowed 
nodes anyway.  So the only nodemask errors that we should return are ones 
that are empty; nodemasks that include offlined nodes should be allowed to 
support node hotplug.  Likewise, memoryless nodes should still be saved as 
the intent of the syscall.

The change to save the intent or silently ignore disallowed nodes would 
also require applications to issue a successive get_mempolicy() call to 
know what their current mempolicy is, since it will be able to change with 
a cpusets change or node hotplug event.  There is no longer this assurance 
that if set_mempolicy() returns without an error that the memory policy is 
effected.  But in the presence of subsystems such as cpusets that allow 
those mempolicies to change from beneath the application, there is no way 
around that: the nodemask that the mempolicy acts on can dynamically 
change at any time.

So I don't see any problem with silently ignoring disallowed nodes and 
encourage it so that the kernel accomodates the intent of the mempolicy in 
the future if and when it can be effected.

		David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 22:44                     ` David Rientjes
  0 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-05 22:44 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Paul Jackson, kosaki.motohiro, andi, linux-mm, linux-kernel,
	akpm, clameter, mel

On Tue, 5 Feb 2008, Lee Schermerhorn wrote:

> The patch I just posted doesn't depend on the numactl changes and seems
> quite minimal to me.  I think it cleans up the differences between
> set_mempolicy() and mbind(), as well.  However, some may take exception
> to the change in behavior--silently ignoring dis-allowed nodes in
> set_mempolicy().
> 

If the intent of the set_mempolicy() call is going to be preserved in the 
struct mempolicy with Paul's change, then we're going to allow disallowed 
nodes anyway.  So the only nodemask errors that we should return are ones 
that are empty; nodemasks that include offlined nodes should be allowed to 
support node hotplug.  Likewise, memoryless nodes should still be saved as 
the intent of the syscall.

The change to save the intent or silently ignore disallowed nodes would 
also require applications to issue a successive get_mempolicy() call to 
know what their current mempolicy is, since it will be able to change with 
a cpusets change or node hotplug event.  There is no longer this assurance 
that if set_mempolicy() returns without an error that the memory policy is 
effected.  But in the presence of subsystems such as cpusets that allow 
those mempolicies to change from beneath the application, there is no way 
around that: the nodemask that the mempolicy acts on can dynamically 
change at any time.

So I don't see any problem with silently ignoring disallowed nodes and 
encourage it so that the kernel accomodates the intent of the mempolicy in 
the future if and when it can be effected.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 22:04                   ` Lee Schermerhorn
@ 2008-02-05 22:50                     ` Paul Jackson
  -1 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 22:50 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: rientjes, kosaki.motohiro, andi, linux-mm, linux-kernel, akpm,
	clameter, mel

Lee wrote:
> Also, your cpuset/mempolicy work will probably need to undo the
> unconditional masking in contextualize_policy() and/or save the original
> node mask somewhere...

Yeah, something like that ... just a small matter of code.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node.
@ 2008-02-05 22:50                     ` Paul Jackson
  0 siblings, 0 replies; 90+ messages in thread
From: Paul Jackson @ 2008-02-05 22:50 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: rientjes, kosaki.motohiro, andi, linux-mm, linux-kernel, akpm,
	clameter, mel

Lee wrote:
> Also, your cpuset/mempolicy work will probably need to undo the
> unconditional masking in contextualize_policy() and/or save the original
> node mask somewhere...

Yeah, something like that ... just a small matter of code.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 21:57         ` Lee Schermerhorn
  2008-02-05 22:12           ` Christoph Lameter
  2008-02-05 22:15           ` Paul Jackson
@ 2008-02-06  2:17           ` David Rientjes
  2008-02-06 16:11             ` Lee Schermerhorn
  2008-02-06  6:49           ` KOSAKI Motohiro
  3 siblings, 1 reply; 90+ messages in thread
From: David Rientjes @ 2008-02-06  2:17 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, Christoph Lameter,
	Paul Jackson, Mel Gorman, torvalds, Eric Whitney

On Tue, 5 Feb 2008, Lee Schermerhorn wrote:

> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2008-02-05 11:25:17.000000000 -0500
> +++ Linux/mm/mempolicy.c	2008-02-05 16:03:11.000000000 -0500
> @@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
>  			return -EINVAL;
>  		break;
>  	}
> - 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + 	return 0;
>  }
>  
>  /* Generate a custom zonelist for the BIND policy. */

This change will be necessary when the nodemask passed from the syscall is 
saved in the struct mempolicy as the intent of the application as well.

> @@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
>  	switch (mode) {
>  	case MPOL_INTERLEAVE:
>  		policy->v.nodes = *nodes;
> -		nodes_and(policy->v.nodes, policy->v.nodes,
> -					node_states[N_HIGH_MEMORY]);
>  		if (nodes_weight(policy->v.nodes) == 0) {
>  			kmem_cache_free(policy_cache, policy);
>  			return ERR_PTR(-EINVAL);
> @@ -426,9 +424,13 @@ static int contextualize_policy(int mode
>  	if (!nodes)
>  		return 0;
>  
> +	/*
> +	 * Restrict the nodes to the allowed nodes in the cpuset.
> +	 * This is guaranteed to be a subset of nodes with memory.
> +	 */
>  	cpuset_update_task_memory_state();
> -	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> -		return -EINVAL;
> +	nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> +
>  	return mpol_check_policy(mode, nodes);
>  }
>  

I would defer the intersection until later because contextualize_policy() 
is called before mpol_new() so we have no struct mempolicy to save the 
intent in.  It doesn't matter for the sake of this change, I know, but you 
could move this intersection to mpol_new() and give us an opportunity to 
store the user's nodemask in the mempolicy with a one-line change and get 
the same desired result.

You can now remove cpuset_nodes_subset_current_mems_allowed() from 
linux/cpuset.h.

> @@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
>  	if (end == start)
>  		return 0;
>  
> -	if (mpol_check_policy(mode, nmask))
> +	if (contextualize_policy(mode, nmask))
>  		return -EINVAL;
>  
>  	new = mpol_new(mode, nmask);
> @@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long 
>  	err = get_nodes(&nodes, nmask, maxnode);
>  	if (err)
>  		return err;
> -#ifdef CONFIG_CPUSETS
> -	/* Restrict the nodes to the allowed nodes in the cpuset */
> -	nodes_and(nodes, nodes, current->mems_allowed);
> -#endif
>  	return do_mbind(start, len, mode, &nodes, flags);
>  }
>  

Looks good, thanks for doing this.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 21:57         ` Lee Schermerhorn
                             ` (2 preceding siblings ...)
  2008-02-06  2:17           ` David Rientjes
@ 2008-02-06  6:49           ` KOSAKI Motohiro
  3 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-06  6:49 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, Christoph Lameter,
	Paul Jackson, David Rientjes, Mel Gorman, torvalds, Eric Whitney

Hi Lee-san

> Here's a patch that addresses the problem w/o requiring change to
> numactl or libnuma.  It DOES have side affects, discussed in the
> description.

Thank you!

but unfortunately, My machine is broken phisically today ;-)
I will test it tommorow or later.




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05 22:12           ` Christoph Lameter
@ 2008-02-06 16:00             ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-06 16:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, Paul Jackson,
	David Rientjes, Mel Gorman, torvalds, Eric Whitney

On Tue, 2008-02-05 at 14:12 -0800, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Lee Schermerhorn wrote:
> 
> > mbind(2), on the other hand, just masks off any nodes in the 
> > nodemask that are not included in the caller's mems_allowed.
> 
> Ok so we temporarily adopt these semantics for set_mempolicy.
> 
> > 1) modify contextualize_policy to just remove the non-allowed
> >    nodes, as is currently done in-line for mbind().  This
> >    guarantees that the resulting mask includes only nodes with
> >    memory.
> 
> Right make ssense. we already contextualize for cpusets.

Only for mbind().  set_mempolicy(), via contextualize_policy() was just
returning EINVAL for invalid nodes in the mask.  I don't know if it
always worked like this, or if we did this in the memoryless nodes
series...

> 
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c	2008-02-05 11:25:17.000000000 -0500
> > +++ Linux/mm/mempolicy.c	2008-02-05 16:03:11.000000000 -0500
> > @@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
> >  			return -EINVAL;
> >  		break;
> >  	}
> > - 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> > + 	return 0;
> >  }
> 
> Hmmm... That is a pretty drastic change.

the nodes_subset() would always return true, once we mask it with
cpuset_current_mems_allowed(), right?  mems_allowed can now only contain
nodes with memory and if cpusets are not configured,
cpuset_current_mems_allowed() just returns node_states[N_HIGH_MEMORY].

So, I think this is a no-op.

> 
> > @@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
> >  	switch (mode) {
> >  	case MPOL_INTERLEAVE:
> >  		policy->v.nodes = *nodes;
> > -		nodes_and(policy->v.nodes, policy->v.nodes,
> > -					node_states[N_HIGH_MEMORY]);
> >  		if (nodes_weight(policy->v.nodes) == 0) {
> >  			kmem_cache_free(policy_cache, policy);
> >  			return ERR_PTR(-EINVAL);
> 
> Do we really need to remove these lines if we change set_mempolicy?

Again, with the change to contextualize_policy(), the nodemask is
guaranteed to only contain nodes with memory, so this was redundant.

> 
> > @@ -426,9 +424,13 @@ static int contextualize_policy(int mode
> >  	if (!nodes)
> >  		return 0;
> >  
> > +	/*
> > +	 * Restrict the nodes to the allowed nodes in the cpuset.
> > +	 * This is guaranteed to be a subset of nodes with memory.
> > +	 */
> >  	cpuset_update_task_memory_state();
> > -	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> > -		return -EINVAL;
> > +	nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> > +
> >  	return mpol_check_policy(mode, nodes);
> >  }
> >  
> 
> Ditto?

This is the main change in the patch:  masking off the invalid nodes
[like sys_mbind() did inline] rather than complaining about them.

However, after I finish testing, I will post an update to this patch
which restores some of the error checks that this change lost.

> 
> > @@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
> >  	if (end == start)
> >  		return 0;
> >  
> > -	if (mpol_check_policy(mode, nmask))
> > +	if (contextualize_policy(mode, nmask))
> >  		return -EINVAL;
> >  
> >  	new = mpol_new(mode, nmask);
> > @@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long 
> >  	err = get_nodes(&nodes, nmask, maxnode);
> >  	if (err)
> >  		return err;
> > -#ifdef CONFIG_CPUSETS
> > -	/* Restrict the nodes to the allowed nodes in the cpuset */
> > -	nodes_and(nodes, nodes, current->mems_allowed);
> > -#endif
> 
> Would just removing #ifdef CONFIG_CPUSETS work? mems_allowed falls back to 
> node_possible_map.... Shouldnt that be node_online_map?

I removed this because I changed do_mbind() to call the revised
contextualize_policy() that does exactly this masking.  I didn't see any
reason to leave the duplicate code there.

I think that mems_allowed now falls back to nodes with memory.  Or it
should in the current code.  When Paul adds his new magic, that might
change.

Lee
> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-06  2:17           ` David Rientjes
@ 2008-02-06 16:11             ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-06 16:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, linux-kernel, Andrew Morton, Christoph Lameter,
	Paul Jackson, Mel Gorman, torvalds, Eric Whitney

On Tue, 2008-02-05 at 18:17 -0800, David Rientjes wrote:
> On Tue, 5 Feb 2008, Lee Schermerhorn wrote:
> 
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c	2008-02-05 11:25:17.000000000 -0500
> > +++ Linux/mm/mempolicy.c	2008-02-05 16:03:11.000000000 -0500
> > @@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
> >  			return -EINVAL;
> >  		break;
> >  	}
> > - 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> > + 	return 0;
> >  }
> >  
> >  /* Generate a custom zonelist for the BIND policy. */
> 
> This change will be necessary when the nodemask passed from the syscall is 
> saved in the struct mempolicy as the intent of the application as well.
> 
> > @@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
> >  	switch (mode) {
> >  	case MPOL_INTERLEAVE:
> >  		policy->v.nodes = *nodes;
> > -		nodes_and(policy->v.nodes, policy->v.nodes,
> > -					node_states[N_HIGH_MEMORY]);
> >  		if (nodes_weight(policy->v.nodes) == 0) {
> >  			kmem_cache_free(policy_cache, policy);
> >  			return ERR_PTR(-EINVAL);
> > @@ -426,9 +424,13 @@ static int contextualize_policy(int mode
> >  	if (!nodes)
> >  		return 0;
> >  
> > +	/*
> > +	 * Restrict the nodes to the allowed nodes in the cpuset.
> > +	 * This is guaranteed to be a subset of nodes with memory.
> > +	 */
> >  	cpuset_update_task_memory_state();
> > -	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> > -		return -EINVAL;
> > +	nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> > +
> >  	return mpol_check_policy(mode, nodes);
> >  }
> >  
> 
> I would defer the intersection until later because contextualize_policy() 
> is called before mpol_new() so we have no struct mempolicy to save the 
> intent in.  It doesn't matter for the sake of this change, I know, but you 
> could move this intersection to mpol_new() and give us an opportunity to 
> store the user's nodemask in the mempolicy with a one-line change and get 
> the same desired result.

Hi, David:

I wanted to avoid a major restructuring of the code for this patch.
However, now that both do_mbind() and do_set_mempolicy() both call
contextualize_policy() [which calls mpol_check_policy()] immediately
before calling mpol_new(), I agree we can push this "contextualization"
down there.  I would like to defer this to another patch--perhaps as
part of Paul's rework of mempolicy and cpusets.  

Note that there is another caller of mpol_new() --
mpol_shared_policy_init().  We'll need to decide whether that call needs
to be contextualized, as it constructs a policy from the tmpfs or
hugetlbfs superblock, as specified on the mount command [or kernel
command line?].  As this is a privileged operation, one could argue that
it should be exempt from cpuset constraints.

> 
> You can now remove cpuset_nodes_subset_current_mems_allowed() from 
> linux/cpuset.h.
> 
> > @@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
> >  	if (end == start)
> >  		return 0;
> >  
> > -	if (mpol_check_policy(mode, nmask))
> > +	if (contextualize_policy(mode, nmask))
> >  		return -EINVAL;
> >  
> >  	new = mpol_new(mode, nmask);
> > @@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long 
> >  	err = get_nodes(&nodes, nmask, maxnode);
> >  	if (err)
> >  		return err;
> > -#ifdef CONFIG_CPUSETS
> > -	/* Restrict the nodes to the allowed nodes in the cpuset */
> > -	nodes_and(nodes, nodes, current->mems_allowed);
> > -#endif
> >  	return do_mbind(start, len, mode, &nodes, flags);
> >  }
> >  
> 
> Looks good, thanks for doing this.

As I mentioned to Christoph, I'll post a new version that I think
handles the error conditions better.

Later,
Lee


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-05  9:26         ` KOSAKI Motohiro
  (?)
  (?)
@ 2008-02-06 17:38         ` Lee Schermerhorn
  2008-02-07  8:31           ` KOSAKI Motohiro
  -1 siblings, 1 reply; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-06 17:38 UTC (permalink / raw)
  To: KOSAKI Motohiro, linux-kernel, Andrew Morton
  Cc: Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	torvalds, Eric Whitney

I've updated the patch to restore some error checking that my previous
version and the memoryless-nodes series lost.  

Again, tested with "numactl --interleave=all" and memtoy on ia64 using
mem= command line argument to simulate memoryless node.

Lee
----------------------------------
[PATCH] 2.6.24-mm1 - mempolicy:  silently restrict to allowed nodes

V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
  mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
   cpuset_nodes_subset_current_mems_allowed() from cpuset.h

Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
      currently done in-line for mbind().  This guarantees
      that the resulting mask includes only nodes with memory.

      NOTE:  this is a [benign, IMO] change in behavior for
             set_mempolicy().  Dis-allowed nodes will be
             silently ignored, rather than returning an error.

   c) pass the "was_empty" state of the incoming mask to
      mpol_check_policy() for vetting the user specified 
      nodemask for MPOL_DEFAULT and MPOL_PREFERRED

2) In mpol_check_policy():
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   a) add a case for MPOL_PREFERRED:  if in coming was not empty
      and resulting mask IS empty, user specified invalid nodes.
      Return EINVAL.
   b) remove the now redundant check for memoryless nodes

3) modify mbind() to use contextualize_policy(), like set_mempolicy(),
   instead of masking nodes in-line.  This preserves the current
   behavior of mbind() and restores some error checking that the
   memoryless-nodes series lost when restricting node masks to
   allowed_nodes [== subset of nodes with memory].

4) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/cpuset.h |    3 --
 mm/mempolicy.c         |   50 ++++++++++++++++++++++++++++++++-----------------
 2 files changed, 33 insertions(+), 20 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2008-02-05 16:51:19.000000000 -0500
+++ Linux/mm/mempolicy.c	2008-02-06 10:58:57.000000000 -0500
@@ -114,24 +114,37 @@ static void mpol_rebind_policy(struct me
                                const nodemask_t *newmask);
 
 /* Do sanity checking on a policy */
-static int mpol_check_policy(int mode, nodemask_t *nodes)
+static int mpol_check_policy(int mode, nodemask_t *nodes, int was_empty)
 {
-	int empty = nodes_empty(*nodes);
+	int is_empty = nodes_empty(*nodes);
 
 	switch (mode) {
 	case MPOL_DEFAULT:
-		if (!empty)
+		/*
+		 * require caller to specify an empty nodemask
+		 * before "contextualization"
+		 */
+		if (!was_empty)
 			return -EINVAL;
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		/* Preferred will only use the first bit, but allow
-		   more for now. */
-		if (empty)
+		/*
+		 * require at least 1 valid node after "contextualization"
+		 */
+		if (is_empty)
+			return -EINVAL;
+		break;
+	case MPOL_PREFERRED:
+		/*
+		 * Did caller specify invalid nodes?
+		 * Don't silently accept this as "local allocation".
+		 */
+		if (!was_empty && is_empty)
 			return -EINVAL;
 		break;
 	}
- 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ 	return 0;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +201,6 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -423,13 +434,22 @@ static int mbind_range(struct vm_area_st
 
 static int contextualize_policy(int mode, nodemask_t *nodes)
 {
+	int was_empty;
+
 	if (!nodes)
 		return 0;
 
+	/*
+	 * Remember whether in coming nodemask was empty,  If not,
+	 * restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
 	cpuset_update_task_memory_state();
-	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
-		return -EINVAL;
-	return mpol_check_policy(mode, nodes);
+	was_empty = nodes_empty(*nodes);
+	if (!was_empty)
+		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+
+	return mpol_check_policy(mode, nodes, was_empty);
 }
 
 
@@ -797,7 +817,7 @@ static long do_mbind(unsigned long start
 	if (end == start)
 		return 0;
 
-	if (mpol_check_policy(mode, nmask))
+	if (contextualize_policy(mode, nmask))
 		return -EINVAL;
 
 	new = mpol_new(mode, nmask);
@@ -915,10 +935,6 @@ asmlinkage long sys_mbind(unsigned long 
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-#ifdef CONFIG_CPUSETS
-	/* Restrict the nodes to the allowed nodes in the cpuset */
-	nodes_and(nodes, nodes, current->mems_allowed);
-#endif
 	return do_mbind(start, len, mode, &nodes, flags);
 }
 
Index: Linux/include/linux/cpuset.h
===================================================================
--- Linux.orig/include/linux/cpuset.h	2008-02-05 16:05:15.000000000 -0500
+++ Linux/include/linux/cpuset.h	2008-02-06 10:47:48.000000000 -0500
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
-		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +101,6 @@ static inline nodemask_t cpuset_mems_all
 #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
 static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
 {



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
  2008-02-06 17:38         ` Lee Schermerhorn
@ 2008-02-07  8:31           ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-07  8:31 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, Christoph Lameter,
	Paul Jackson, David Rientjes, Mel Gorman, torvalds, Eric Whitney

Hi Lee-san

Unfortunately, 2.6.24-mm1 can't boot on fujitsu machine.
(hmm, origin.patch cause regression to pci initialization ;-)

instead, I tested 2.6.24 + your patch.
it seem work good :)

Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

and, I have a bit comment.


>  /* Do sanity checking on a policy */
> -static int mpol_check_policy(int mode, nodemask_t *nodes)
> +static int mpol_check_policy(int mode, nodemask_t *nodes, int was_empty)

was_empty argument is a bit ugly.
Could we unify mpol_check_policy and contextualize_policy?
mpol_check_policy only called from contextualize_policy.

> - 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + 	return 0;

Could we N_POSSIBLE check?

I attached the patch for my idea explain.
on my test environment, your patch and mine works good both.

- kosaki


---
 mm/mempolicy.c |   47 +++++++++++++++++++++--------------------------
 1 file changed, 21 insertions(+), 26 deletions(-)

Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c	2008-02-07 17:19:09.000000000 +0900
+++ b/mm/mempolicy.c	2008-02-07 17:24:28.000000000 +0900
@@ -114,9 +114,25 @@ static void mpol_rebind_policy(struct me
                                const nodemask_t *newmask);
 
 /* Do sanity checking on a policy */
-static int mpol_check_policy(int mode, nodemask_t *nodes, int was_empty)
+static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
-	int is_empty = nodes_empty(*nodes);
+	int was_empty;
+	int is_empty;
+
+	if (!nodes)
+		return 0;
+
+	/*
+	 * Remember whether in coming nodemask was empty,  If not,
+	 * restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
+	cpuset_update_task_memory_state();
+	was_empty = nodes_empty(*nodes);
+	if (!was_empty)
+		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+
+	is_empty = nodes_empty(*nodes);
 
 	switch (mode) {
 	case MPOL_DEFAULT:
@@ -144,7 +160,7 @@ static int mpol_check_policy(int mode, n
 			return -EINVAL;
 		break;
 	}
- 	return 0;
+ 	return nodes_subset(*nodes, node_states[N_POSSIBLE]) ? 0 : -EINVAL;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -432,27 +448,6 @@ static int mbind_range(struct vm_area_st
 	return err;
 }
 
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
-	int was_empty;
-
-	if (!nodes)
-		return 0;
-
-	/*
-	 * Remember whether in coming nodemask was empty,  If not,
-	 * restrict the nodes to the allowed nodes in the cpuset.
-	 * This is guaranteed to be a subset of nodes with memory.
-	 */
-	cpuset_update_task_memory_state();
-	was_empty = nodes_empty(*nodes);
-	if (!was_empty)
-		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
-
-	return mpol_check_policy(mode, nodes, was_empty);
-}
-
-
 /*
  * Update task->flags PF_MEMPOLICY bit: set iff non-default
  * mempolicy.  Allows more rapid checking of this (combined perhaps
@@ -488,7 +483,7 @@ static long do_set_mempolicy(int mode, n
 {
 	struct mempolicy *new;
 
-	if (contextualize_policy(mode, nodes))
+	if (mpol_check_policy(mode, nodes))
 		return -EINVAL;
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
@@ -817,7 +812,7 @@ static long do_mbind(unsigned long start
 	if (end == start)
 		return 0;
 
-	if (contextualize_policy(mode, nmask))
+	if (mpol_check_policy(mode, nmask))
 		return -EINVAL;
 
 	new = mpol_new(mode, nmask);



^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-05  9:26         ` KOSAKI Motohiro
@ 2008-02-08 19:45           ` Lee Schermerhorn
  -1 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-08 19:45 UTC (permalink / raw)
  To: KOSAKI Motohiro, linux-kernel, Andrew Morton
  Cc: Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, torvalds, Eric Whitney

Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
works on memoryless node."

[Aside:  I noticed there were two slightly different distributions for
this topic.  I've unified the distribution lists w/o dropping anyone, I
think.  Apologies if you'd rather have been dropped...]

Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
folding contextualize_policy() into mpol_check_policy() [because my
"was_empty" argument "was ugly" ;-)].  It does seem to clean up the
code.

I'm still deferring David Rientjes' suggestion to fold
mpol_check_policy() into mpol_new().  We need to sort out whether
mempolicies specified for tmpfs and hugetlbfs mounts always need the
same "contextualization" as user/application installed policies.  I
don't want to hold up this bug fix for that discussion.  This is
something Paul J will need to address with his cpuset/mempolicy rework,
so we can sort it out in that context.

Again, tested with "numactl --interleave=all" and memtoy on ia64 using
mem= command line argument to simulate memoryless node.


Lee

============================
[PATCH] 2.6.24-mm1 - mempolicy:  silently restrict nodemask to allowed nodes

V2 -> V3:
+ As suggested by Kosaki Motohito, fold the "contextualization"
  of policy nodemask into mpol_check_policy().  Looks a little
  cleaner. 

V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
  mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
   cpuset_nodes_subset_current_mems_allowed() from cpuset.h

Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy logic to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
      currently done in-line for mbind().  This guarantees
      that the resulting mask includes only nodes with memory.

      NOTE:  this is a [benign, IMO] change in behavior for
             set_mempolicy().  Dis-allowed nodes will be
             silently ignored, rather than returning an error.

   c) fold this code into mpol_check_policy(), replace 2 calls to
      contextualize_policy() to call mpol_check_policy() directly
      and remove contextualize_policy().

2) In existing mpol_check_policy() logic, after "contextualization":
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
      contains at least one node.
   c) add a case for MPOL_PREFERRED:  if in coming was not empty
      and resulting mask IS empty, user specified invalid nodes.
      Return EINVAL.
   c) remove the now redundant check for memoryless nodes

3) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

4) Now that mpol_check_policy() contextualizes the nodemask, remove
   the in-line nodes_and() from sys_mbind().  I believe that this
   restores mbind() to the behavior before the memoryless-nodes
   patch series.  E.g., we'll no longer treat an invalid nodemask
   with MPOL_PREFERRED as local allocation.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/cpuset.h |    3 --
 mm/mempolicy.c         |   61 ++++++++++++++++++++++++++++---------------------
 2 files changed, 36 insertions(+), 28 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2008-02-08 11:11:34.000000000 -0500
+++ Linux/mm/mempolicy.c	2008-02-08 13:40:40.000000000 -0500
@@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
 /* Do sanity checking on a policy */
 static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
-	int empty = nodes_empty(*nodes);
+	int was_empty, is_empty;
+
+	if (!nodes)
+		return 0;
+
+	/*
+	 * "Contextualize" the in-coming nodemast for cpusets:
+	 * Remember whether in-coming nodemask was empty,  If not,
+	 * restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
+	cpuset_update_task_memory_state();
+	is_empty = was_empty = nodes_empty(*nodes);
+	if (!was_empty) {
+		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+		is_empty = nodes_empty(*nodes);	/* after "contextualization" */
+	}
 
 	switch (mode) {
 	case MPOL_DEFAULT:
-		if (!empty)
+		/*
+		 * require caller to specify an empty nodemask
+		 * before "contextualization"
+		 */
+		if (!was_empty)
 			return -EINVAL;
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		/* Preferred will only use the first bit, but allow
-		   more for now. */
-		if (empty)
+		/*
+		 * require at least 1 valid node after "contextualization"
+		 */
+		if (is_empty)
+			return -EINVAL;
+		break;
+	case MPOL_PREFERRED:
+		/*
+		 * Did caller specify invalid nodes?
+		 * Don't silently accept this as "local allocation".
+		 */
+		if (!was_empty && is_empty)
 			return -EINVAL;
 		break;
 	}
- 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ 	return 0;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
 	return err;
 }
 
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
-	if (!nodes)
-		return 0;
-
-	cpuset_update_task_memory_state();
-	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
-		return -EINVAL;
-	return mpol_check_policy(mode, nodes);
-}
-
-
 /*
  * Update task->flags PF_MEMPOLICY bit: set iff non-default
  * mempolicy.  Allows more rapid checking of this (combined perhaps
@@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
 {
 	struct mempolicy *new;
 
-	if (contextualize_policy(mode, nodes))
+	if (mpol_check_policy(mode, nodes))
 		return -EINVAL;
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
@@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long 
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-#ifdef CONFIG_CPUSETS
-	/* Restrict the nodes to the allowed nodes in the cpuset */
-	nodes_and(nodes, nodes, current->mems_allowed);
-#endif
 	return do_mbind(start, len, mode, &nodes, flags);
 }
 
Index: Linux/include/linux/cpuset.h
===================================================================
--- Linux.orig/include/linux/cpuset.h	2008-02-08 11:11:34.000000000 -0500
+++ Linux/include/linux/cpuset.h	2008-02-08 11:12:43.000000000 -0500
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
-		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +101,6 @@ static inline nodemask_t cpuset_mems_all
 #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
 static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
 {



^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
@ 2008-02-08 19:45           ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-08 19:45 UTC (permalink / raw)
  To: KOSAKI Motohiro, linux-kernel, Andrew Morton
  Cc: Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, torvalds, Eric Whitney

Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
works on memoryless node."

[Aside:  I noticed there were two slightly different distributions for
this topic.  I've unified the distribution lists w/o dropping anyone, I
think.  Apologies if you'd rather have been dropped...]

Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
folding contextualize_policy() into mpol_check_policy() [because my
"was_empty" argument "was ugly" ;-)].  It does seem to clean up the
code.

I'm still deferring David Rientjes' suggestion to fold
mpol_check_policy() into mpol_new().  We need to sort out whether
mempolicies specified for tmpfs and hugetlbfs mounts always need the
same "contextualization" as user/application installed policies.  I
don't want to hold up this bug fix for that discussion.  This is
something Paul J will need to address with his cpuset/mempolicy rework,
so we can sort it out in that context.

Again, tested with "numactl --interleave=all" and memtoy on ia64 using
mem= command line argument to simulate memoryless node.


Lee

============================
[PATCH] 2.6.24-mm1 - mempolicy:  silently restrict nodemask to allowed nodes

V2 -> V3:
+ As suggested by Kosaki Motohito, fold the "contextualization"
  of policy nodemask into mpol_check_policy().  Looks a little
  cleaner. 

V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
  mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
   cpuset_nodes_subset_current_mems_allowed() from cpuset.h

Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy logic to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
      currently done in-line for mbind().  This guarantees
      that the resulting mask includes only nodes with memory.

      NOTE:  this is a [benign, IMO] change in behavior for
             set_mempolicy().  Dis-allowed nodes will be
             silently ignored, rather than returning an error.

   c) fold this code into mpol_check_policy(), replace 2 calls to
      contextualize_policy() to call mpol_check_policy() directly
      and remove contextualize_policy().

2) In existing mpol_check_policy() logic, after "contextualization":
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
      contains at least one node.
   c) add a case for MPOL_PREFERRED:  if in coming was not empty
      and resulting mask IS empty, user specified invalid nodes.
      Return EINVAL.
   c) remove the now redundant check for memoryless nodes

3) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

4) Now that mpol_check_policy() contextualizes the nodemask, remove
   the in-line nodes_and() from sys_mbind().  I believe that this
   restores mbind() to the behavior before the memoryless-nodes
   patch series.  E.g., we'll no longer treat an invalid nodemask
   with MPOL_PREFERRED as local allocation.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/cpuset.h |    3 --
 mm/mempolicy.c         |   61 ++++++++++++++++++++++++++++---------------------
 2 files changed, 36 insertions(+), 28 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2008-02-08 11:11:34.000000000 -0500
+++ Linux/mm/mempolicy.c	2008-02-08 13:40:40.000000000 -0500
@@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
 /* Do sanity checking on a policy */
 static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
-	int empty = nodes_empty(*nodes);
+	int was_empty, is_empty;
+
+	if (!nodes)
+		return 0;
+
+	/*
+	 * "Contextualize" the in-coming nodemast for cpusets:
+	 * Remember whether in-coming nodemask was empty,  If not,
+	 * restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
+	cpuset_update_task_memory_state();
+	is_empty = was_empty = nodes_empty(*nodes);
+	if (!was_empty) {
+		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+		is_empty = nodes_empty(*nodes);	/* after "contextualization" */
+	}
 
 	switch (mode) {
 	case MPOL_DEFAULT:
-		if (!empty)
+		/*
+		 * require caller to specify an empty nodemask
+		 * before "contextualization"
+		 */
+		if (!was_empty)
 			return -EINVAL;
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		/* Preferred will only use the first bit, but allow
-		   more for now. */
-		if (empty)
+		/*
+		 * require at least 1 valid node after "contextualization"
+		 */
+		if (is_empty)
+			return -EINVAL;
+		break;
+	case MPOL_PREFERRED:
+		/*
+		 * Did caller specify invalid nodes?
+		 * Don't silently accept this as "local allocation".
+		 */
+		if (!was_empty && is_empty)
 			return -EINVAL;
 		break;
 	}
- 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ 	return 0;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
 	return err;
 }
 
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
-	if (!nodes)
-		return 0;
-
-	cpuset_update_task_memory_state();
-	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
-		return -EINVAL;
-	return mpol_check_policy(mode, nodes);
-}
-
-
 /*
  * Update task->flags PF_MEMPOLICY bit: set iff non-default
  * mempolicy.  Allows more rapid checking of this (combined perhaps
@@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
 {
 	struct mempolicy *new;
 
-	if (contextualize_policy(mode, nodes))
+	if (mpol_check_policy(mode, nodes))
 		return -EINVAL;
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
@@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long 
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-#ifdef CONFIG_CPUSETS
-	/* Restrict the nodes to the allowed nodes in the cpuset */
-	nodes_and(nodes, nodes, current->mems_allowed);
-#endif
 	return do_mbind(start, len, mode, &nodes, flags);
 }
 
Index: Linux/include/linux/cpuset.h
===================================================================
--- Linux.orig/include/linux/cpuset.h	2008-02-08 11:11:34.000000000 -0500
+++ Linux/include/linux/cpuset.h	2008-02-08 11:12:43.000000000 -0500
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
-		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +101,6 @@ static inline nodemask_t cpuset_mems_all
 #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
 static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1] Mempolicy: silently restrict nodemask to allowed nodes V3
  2008-02-08 19:45           ` Lee Schermerhorn
@ 2008-02-09 18:11             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 18:11 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Paul Jackson,
	David Rientjes, Mel Gorman, linux-mm, torvalds, Eric Whitney

Hi Lee-san

looks good for me.
I'll test about the head of week and report it by another mail.

Thanks!

> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
>
> [Aside:  I noticed there were two slightly different distributions for
> this topic.  I've unified the distribution lists w/o dropping anyone, I
> think.  Apologies if you'd rather have been dropped...]
>
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)].  It does seem to clean up the
> code.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1] Mempolicy: silently restrict nodemask to allowed nodes V3
@ 2008-02-09 18:11             ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-09 18:11 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Paul Jackson,
	David Rientjes, Mel Gorman, linux-mm, torvalds, Eric Whitney

Hi Lee-san

looks good for me.
I'll test about the head of week and report it by another mail.

Thanks!

> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
>
> [Aside:  I noticed there were two slightly different distributions for
> this topic.  I've unified the distribution lists w/o dropping anyone, I
> think.  Apologies if you'd rather have been dropped...]
>
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)].  It does seem to clean up the
> code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-08 19:45           ` Lee Schermerhorn
@ 2008-02-10  5:29             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-10  5:29 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, Christoph Lameter,
	Paul Jackson, David Rientjes, Mel Gorman, linux-mm, torvalds,
	Eric Whitney, Greg KH

CC'd Greg KH <greg@kroah.com>

I tested this patch on fujitsu memoryless node.
(2.6.24 + silently-restrict-nodemask-to-allowed-nodes-V3 insted 2.6.24-mm1)
it seems works good.

Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


Greg, I hope this patch merge to 2.6.24.x stable tree because
this patch is regression fixed patch.
Please tell me what do i doing for it.


[intentional full quote]

> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
> 
> [Aside:  I noticed there were two slightly different distributions for
> this topic.  I've unified the distribution lists w/o dropping anyone, I
> think.  Apologies if you'd rather have been dropped...]
> 
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)].  It does seem to clean up the
> code.
> 
> I'm still deferring David Rientjes' suggestion to fold
> mpol_check_policy() into mpol_new().  We need to sort out whether
> mempolicies specified for tmpfs and hugetlbfs mounts always need the
> same "contextualization" as user/application installed policies.  I
> don't want to hold up this bug fix for that discussion.  This is
> something Paul J will need to address with his cpuset/mempolicy rework,
> so we can sort it out in that context.
> 
> Again, tested with "numactl --interleave=all" and memtoy on ia64 using
> mem= command line argument to simulate memoryless node.
> 
> 
> Lee
> 
> ============================
> [PATCH] 2.6.24-mm1 - mempolicy:  silently restrict nodemask to allowed nodes
> 
> V2 -> V3:
> + As suggested by Kosaki Motohito, fold the "contextualization"
>   of policy nodemask into mpol_check_policy().  Looks a little
>   cleaner. 
> 
> V1 -> V2:
> + Communicate whether or not incoming node mask was empty to
>   mpol_check_policy() for better error checking.
> + As suggested by David Rientjes, remove the now unused
>    cpuset_nodes_subset_current_mems_allowed() from cpuset.h
> 
> Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
> presence of memoryless nodes.  This patch attempts to fix that problem.
> 
> Some background:  
> 
> numactl --interleave=all calls set_mempolicy(2) with a fully
> populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
> [in do_set_mempolicy()] calls contextualize_policy() which
> requires that the nodemask be a subset of the current task's
> mems_allowed; else EINVAL will be returned.  A task's
> mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
> i.e., nodes with memory.  So, a fully populated nodemask will
> be declared invalid if it includes memoryless nodes.
> 
>   NOTE:  the same thing will occur when running in a cpuset
>          with restricted mem_allowed--for the same reason:
>          node mask contains dis-allowed nodes.
> 
> mbind(2), on the other hand, just masks off any nodes in the 
> nodemask that are not included in the caller's mems_allowed.
> 
> In each case [mbind() and set_mempolicy()], mpol_check_policy()
> will complain [again, resulting in EINVAL] if the nodemask contains 
> any memoryless nodes.  This is somewhat redundant as mpol_new() 
> will remove memoryless nodes for interleave policy, as will 
> bind_zonelist()--called by mpol_new() for BIND policy.
> 
> Proposed fix:
> 
> 1) modify contextualize_policy logic to:
>    a) remember whether the incoming node mask is empty.
>    b) if not, restrict the nodemask to allowed nodes, as is
>       currently done in-line for mbind().  This guarantees
>       that the resulting mask includes only nodes with memory.
> 
>       NOTE:  this is a [benign, IMO] change in behavior for
>              set_mempolicy().  Dis-allowed nodes will be
>              silently ignored, rather than returning an error.
> 
>    c) fold this code into mpol_check_policy(), replace 2 calls to
>       contextualize_policy() to call mpol_check_policy() directly
>       and remove contextualize_policy().
> 
> 2) In existing mpol_check_policy() logic, after "contextualization":
>    a) MPOL_DEFAULT:  require that in coming mask "was_empty"
>    b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
>       contains at least one node.
>    c) add a case for MPOL_PREFERRED:  if in coming was not empty
>       and resulting mask IS empty, user specified invalid nodes.
>       Return EINVAL.
>    c) remove the now redundant check for memoryless nodes
> 
> 3) remove the now redundant masking of policy nodes for interleave
>    policy from mpol_new().
> 
> 4) Now that mpol_check_policy() contextualizes the nodemask, remove
>    the in-line nodes_and() from sys_mbind().  I believe that this
>    restores mbind() to the behavior before the memoryless-nodes
>    patch series.  E.g., we'll no longer treat an invalid nodemask
>    with MPOL_PREFERRED as local allocation.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/cpuset.h |    3 --
>  mm/mempolicy.c         |   61 ++++++++++++++++++++++++++++---------------------
>  2 files changed, 36 insertions(+), 28 deletions(-)
> 
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2008-02-08 11:11:34.000000000 -0500
> +++ Linux/mm/mempolicy.c	2008-02-08 13:40:40.000000000 -0500
> @@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
>  /* Do sanity checking on a policy */
>  static int mpol_check_policy(int mode, nodemask_t *nodes)
>  {
> -	int empty = nodes_empty(*nodes);
> +	int was_empty, is_empty;
> +
> +	if (!nodes)
> +		return 0;
> +
> +	/*
> +	 * "Contextualize" the in-coming nodemast for cpusets:
> +	 * Remember whether in-coming nodemask was empty,  If not,
> +	 * restrict the nodes to the allowed nodes in the cpuset.
> +	 * This is guaranteed to be a subset of nodes with memory.
> +	 */
> +	cpuset_update_task_memory_state();
> +	is_empty = was_empty = nodes_empty(*nodes);
> +	if (!was_empty) {
> +		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> +		is_empty = nodes_empty(*nodes);	/* after "contextualization" */
> +	}
>  
>  	switch (mode) {
>  	case MPOL_DEFAULT:
> -		if (!empty)
> +		/*
> +		 * require caller to specify an empty nodemask
> +		 * before "contextualization"
> +		 */
> +		if (!was_empty)
>  			return -EINVAL;
>  		break;
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> -		/* Preferred will only use the first bit, but allow
> -		   more for now. */
> -		if (empty)
> +		/*
> +		 * require at least 1 valid node after "contextualization"
> +		 */
> +		if (is_empty)
> +			return -EINVAL;
> +		break;
> +	case MPOL_PREFERRED:
> +		/*
> +		 * Did caller specify invalid nodes?
> +		 * Don't silently accept this as "local allocation".
> +		 */
> +		if (!was_empty && is_empty)
>  			return -EINVAL;
>  		break;
>  	}
> - 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + 	return 0;
>  }
>  
>  /* Generate a custom zonelist for the BIND policy. */
> @@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
>  	switch (mode) {
>  	case MPOL_INTERLEAVE:
>  		policy->v.nodes = *nodes;
> -		nodes_and(policy->v.nodes, policy->v.nodes,
> -					node_states[N_HIGH_MEMORY]);
>  		if (nodes_weight(policy->v.nodes) == 0) {
>  			kmem_cache_free(policy_cache, policy);
>  			return ERR_PTR(-EINVAL);
> @@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
>  	return err;
>  }
>  
> -static int contextualize_policy(int mode, nodemask_t *nodes)
> -{
> -	if (!nodes)
> -		return 0;
> -
> -	cpuset_update_task_memory_state();
> -	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> -		return -EINVAL;
> -	return mpol_check_policy(mode, nodes);
> -}
> -
> -
>  /*
>   * Update task->flags PF_MEMPOLICY bit: set iff non-default
>   * mempolicy.  Allows more rapid checking of this (combined perhaps
> @@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
>  {
>  	struct mempolicy *new;
>  
> -	if (contextualize_policy(mode, nodes))
> +	if (mpol_check_policy(mode, nodes))
>  		return -EINVAL;
>  	new = mpol_new(mode, nodes);
>  	if (IS_ERR(new))
> @@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long 
>  	err = get_nodes(&nodes, nmask, maxnode);
>  	if (err)
>  		return err;
> -#ifdef CONFIG_CPUSETS
> -	/* Restrict the nodes to the allowed nodes in the cpuset */
> -	nodes_and(nodes, nodes, current->mems_allowed);
> -#endif
>  	return do_mbind(start, len, mode, &nodes, flags);
>  }
>  
> Index: Linux/include/linux/cpuset.h
> ===================================================================
> --- Linux.orig/include/linux/cpuset.h	2008-02-08 11:11:34.000000000 -0500
> +++ Linux/include/linux/cpuset.h	2008-02-08 11:12:43.000000000 -0500
> @@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
>  #define cpuset_current_mems_allowed (current->mems_allowed)
>  void cpuset_init_current_mems_allowed(void);
>  void cpuset_update_task_memory_state(void);
> -#define cpuset_nodes_subset_current_mems_allowed(nodes) \
> -		nodes_subset((nodes), current->mems_allowed)
>  int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
>  
>  extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
> @@ -103,7 +101,6 @@ static inline nodemask_t cpuset_mems_all
>  #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
>  static inline void cpuset_init_current_mems_allowed(void) {}
>  static inline void cpuset_update_task_memory_state(void) {}
> -#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
>  
>  static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
>  {
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
@ 2008-02-10  5:29             ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-10  5:29 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, Christoph Lameter,
	Paul Jackson, David Rientjes, Mel Gorman, linux-mm, torvalds,
	Eric Whitney, Greg KH

CC'd Greg KH <greg@kroah.com>

I tested this patch on fujitsu memoryless node.
(2.6.24 + silently-restrict-nodemask-to-allowed-nodes-V3 insted 2.6.24-mm1)
it seems works good.

Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


Greg, I hope this patch merge to 2.6.24.x stable tree because
this patch is regression fixed patch.
Please tell me what do i doing for it.


[intentional full quote]

> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
> 
> [Aside:  I noticed there were two slightly different distributions for
> this topic.  I've unified the distribution lists w/o dropping anyone, I
> think.  Apologies if you'd rather have been dropped...]
> 
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)].  It does seem to clean up the
> code.
> 
> I'm still deferring David Rientjes' suggestion to fold
> mpol_check_policy() into mpol_new().  We need to sort out whether
> mempolicies specified for tmpfs and hugetlbfs mounts always need the
> same "contextualization" as user/application installed policies.  I
> don't want to hold up this bug fix for that discussion.  This is
> something Paul J will need to address with his cpuset/mempolicy rework,
> so we can sort it out in that context.
> 
> Again, tested with "numactl --interleave=all" and memtoy on ia64 using
> mem= command line argument to simulate memoryless node.
> 
> 
> Lee
> 
> ============================
> [PATCH] 2.6.24-mm1 - mempolicy:  silently restrict nodemask to allowed nodes
> 
> V2 -> V3:
> + As suggested by Kosaki Motohito, fold the "contextualization"
>   of policy nodemask into mpol_check_policy().  Looks a little
>   cleaner. 
> 
> V1 -> V2:
> + Communicate whether or not incoming node mask was empty to
>   mpol_check_policy() for better error checking.
> + As suggested by David Rientjes, remove the now unused
>    cpuset_nodes_subset_current_mems_allowed() from cpuset.h
> 
> Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
> presence of memoryless nodes.  This patch attempts to fix that problem.
> 
> Some background:  
> 
> numactl --interleave=all calls set_mempolicy(2) with a fully
> populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
> [in do_set_mempolicy()] calls contextualize_policy() which
> requires that the nodemask be a subset of the current task's
> mems_allowed; else EINVAL will be returned.  A task's
> mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
> i.e., nodes with memory.  So, a fully populated nodemask will
> be declared invalid if it includes memoryless nodes.
> 
>   NOTE:  the same thing will occur when running in a cpuset
>          with restricted mem_allowed--for the same reason:
>          node mask contains dis-allowed nodes.
> 
> mbind(2), on the other hand, just masks off any nodes in the 
> nodemask that are not included in the caller's mems_allowed.
> 
> In each case [mbind() and set_mempolicy()], mpol_check_policy()
> will complain [again, resulting in EINVAL] if the nodemask contains 
> any memoryless nodes.  This is somewhat redundant as mpol_new() 
> will remove memoryless nodes for interleave policy, as will 
> bind_zonelist()--called by mpol_new() for BIND policy.
> 
> Proposed fix:
> 
> 1) modify contextualize_policy logic to:
>    a) remember whether the incoming node mask is empty.
>    b) if not, restrict the nodemask to allowed nodes, as is
>       currently done in-line for mbind().  This guarantees
>       that the resulting mask includes only nodes with memory.
> 
>       NOTE:  this is a [benign, IMO] change in behavior for
>              set_mempolicy().  Dis-allowed nodes will be
>              silently ignored, rather than returning an error.
> 
>    c) fold this code into mpol_check_policy(), replace 2 calls to
>       contextualize_policy() to call mpol_check_policy() directly
>       and remove contextualize_policy().
> 
> 2) In existing mpol_check_policy() logic, after "contextualization":
>    a) MPOL_DEFAULT:  require that in coming mask "was_empty"
>    b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
>       contains at least one node.
>    c) add a case for MPOL_PREFERRED:  if in coming was not empty
>       and resulting mask IS empty, user specified invalid nodes.
>       Return EINVAL.
>    c) remove the now redundant check for memoryless nodes
> 
> 3) remove the now redundant masking of policy nodes for interleave
>    policy from mpol_new().
> 
> 4) Now that mpol_check_policy() contextualizes the nodemask, remove
>    the in-line nodes_and() from sys_mbind().  I believe that this
>    restores mbind() to the behavior before the memoryless-nodes
>    patch series.  E.g., we'll no longer treat an invalid nodemask
>    with MPOL_PREFERRED as local allocation.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/cpuset.h |    3 --
>  mm/mempolicy.c         |   61 ++++++++++++++++++++++++++++---------------------
>  2 files changed, 36 insertions(+), 28 deletions(-)
> 
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2008-02-08 11:11:34.000000000 -0500
> +++ Linux/mm/mempolicy.c	2008-02-08 13:40:40.000000000 -0500
> @@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
>  /* Do sanity checking on a policy */
>  static int mpol_check_policy(int mode, nodemask_t *nodes)
>  {
> -	int empty = nodes_empty(*nodes);
> +	int was_empty, is_empty;
> +
> +	if (!nodes)
> +		return 0;
> +
> +	/*
> +	 * "Contextualize" the in-coming nodemast for cpusets:
> +	 * Remember whether in-coming nodemask was empty,  If not,
> +	 * restrict the nodes to the allowed nodes in the cpuset.
> +	 * This is guaranteed to be a subset of nodes with memory.
> +	 */
> +	cpuset_update_task_memory_state();
> +	is_empty = was_empty = nodes_empty(*nodes);
> +	if (!was_empty) {
> +		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> +		is_empty = nodes_empty(*nodes);	/* after "contextualization" */
> +	}
>  
>  	switch (mode) {
>  	case MPOL_DEFAULT:
> -		if (!empty)
> +		/*
> +		 * require caller to specify an empty nodemask
> +		 * before "contextualization"
> +		 */
> +		if (!was_empty)
>  			return -EINVAL;
>  		break;
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> -		/* Preferred will only use the first bit, but allow
> -		   more for now. */
> -		if (empty)
> +		/*
> +		 * require at least 1 valid node after "contextualization"
> +		 */
> +		if (is_empty)
> +			return -EINVAL;
> +		break;
> +	case MPOL_PREFERRED:
> +		/*
> +		 * Did caller specify invalid nodes?
> +		 * Don't silently accept this as "local allocation".
> +		 */
> +		if (!was_empty && is_empty)
>  			return -EINVAL;
>  		break;
>  	}
> - 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + 	return 0;
>  }
>  
>  /* Generate a custom zonelist for the BIND policy. */
> @@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
>  	switch (mode) {
>  	case MPOL_INTERLEAVE:
>  		policy->v.nodes = *nodes;
> -		nodes_and(policy->v.nodes, policy->v.nodes,
> -					node_states[N_HIGH_MEMORY]);
>  		if (nodes_weight(policy->v.nodes) == 0) {
>  			kmem_cache_free(policy_cache, policy);
>  			return ERR_PTR(-EINVAL);
> @@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
>  	return err;
>  }
>  
> -static int contextualize_policy(int mode, nodemask_t *nodes)
> -{
> -	if (!nodes)
> -		return 0;
> -
> -	cpuset_update_task_memory_state();
> -	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> -		return -EINVAL;
> -	return mpol_check_policy(mode, nodes);
> -}
> -
> -
>  /*
>   * Update task->flags PF_MEMPOLICY bit: set iff non-default
>   * mempolicy.  Allows more rapid checking of this (combined perhaps
> @@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
>  {
>  	struct mempolicy *new;
>  
> -	if (contextualize_policy(mode, nodes))
> +	if (mpol_check_policy(mode, nodes))
>  		return -EINVAL;
>  	new = mpol_new(mode, nodes);
>  	if (IS_ERR(new))
> @@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long 
>  	err = get_nodes(&nodes, nmask, maxnode);
>  	if (err)
>  		return err;
> -#ifdef CONFIG_CPUSETS
> -	/* Restrict the nodes to the allowed nodes in the cpuset */
> -	nodes_and(nodes, nodes, current->mems_allowed);
> -#endif
>  	return do_mbind(start, len, mode, &nodes, flags);
>  }
>  
> Index: Linux/include/linux/cpuset.h
> ===================================================================
> --- Linux.orig/include/linux/cpuset.h	2008-02-08 11:11:34.000000000 -0500
> +++ Linux/include/linux/cpuset.h	2008-02-08 11:12:43.000000000 -0500
> @@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
>  #define cpuset_current_mems_allowed (current->mems_allowed)
>  void cpuset_init_current_mems_allowed(void);
>  void cpuset_update_task_memory_state(void);
> -#define cpuset_nodes_subset_current_mems_allowed(nodes) \
> -		nodes_subset((nodes), current->mems_allowed)
>  int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
>  
>  extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
> @@ -103,7 +101,6 @@ static inline nodemask_t cpuset_mems_all
>  #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
>  static inline void cpuset_init_current_mems_allowed(void) {}
>  static inline void cpuset_update_task_memory_state(void) {}
> -#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
>  
>  static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
>  {
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-10  5:29             ` KOSAKI Motohiro
@ 2008-02-10  5:49               ` Greg KH
  -1 siblings, 0 replies; 90+ messages in thread
From: Greg KH @ 2008-02-10  5:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lee Schermerhorn, linux-kernel, Andrew Morton, Christoph Lameter,
	Paul Jackson, David Rientjes, Mel Gorman, linux-mm, torvalds,
	Eric Whitney

On Sun, Feb 10, 2008 at 02:29:24PM +0900, KOSAKI Motohiro wrote:
> CC'd Greg KH <greg@kroah.com>
> 
> I tested this patch on fujitsu memoryless node.
> (2.6.24 + silently-restrict-nodemask-to-allowed-nodes-V3 insted 2.6.24-mm1)
> it seems works good.
> 
> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> 
> Greg, I hope this patch merge to 2.6.24.x stable tree because
> this patch is regression fixed patch.
> Please tell me what do i doing for it.

Once the patch goes into Linus's tree, feel free to send it to the
stable@kernel.org address so that we can include it in the 2.6.24.x
tree.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
@ 2008-02-10  5:49               ` Greg KH
  0 siblings, 0 replies; 90+ messages in thread
From: Greg KH @ 2008-02-10  5:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lee Schermerhorn, linux-kernel, Andrew Morton, Christoph Lameter,
	Paul Jackson, David Rientjes, Mel Gorman, linux-mm, torvalds,
	Eric Whitney

On Sun, Feb 10, 2008 at 02:29:24PM +0900, KOSAKI Motohiro wrote:
> CC'd Greg KH <greg@kroah.com>
> 
> I tested this patch on fujitsu memoryless node.
> (2.6.24 + silently-restrict-nodemask-to-allowed-nodes-V3 insted 2.6.24-mm1)
> it seems works good.
> 
> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> 
> Greg, I hope this patch merge to 2.6.24.x stable tree because
> this patch is regression fixed patch.
> Please tell me what do i doing for it.

Once the patch goes into Linus's tree, feel free to send it to the
stable@kernel.org address so that we can include it in the 2.6.24.x
tree.

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-10  5:49               ` Greg KH
@ 2008-02-10  7:42                 ` Linus Torvalds
  -1 siblings, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2008-02-10  7:42 UTC (permalink / raw)
  To: Greg KH
  Cc: KOSAKI Motohiro, Lee Schermerhorn, linux-kernel, Andrew Morton,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney



On Sat, 9 Feb 2008, Greg KH wrote:
> 
> Once the patch goes into Linus's tree, feel free to send it to the
> stable@kernel.org address so that we can include it in the 2.6.24.x
> tree.

I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so 
I simply don't know whether it's supposed to go into *my* kernel or just 
-mm.

There's also been several versions and discussions, so I'd really like to 
have somebody send me a final patch with all the acks etc.. One that is 
clearly for me, not for -mm.

			Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
@ 2008-02-10  7:42                 ` Linus Torvalds
  0 siblings, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2008-02-10  7:42 UTC (permalink / raw)
  To: Greg KH
  Cc: KOSAKI Motohiro, Lee Schermerhorn, linux-kernel, Andrew Morton,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney


On Sat, 9 Feb 2008, Greg KH wrote:
> 
> Once the patch goes into Linus's tree, feel free to send it to the
> stable@kernel.org address so that we can include it in the 2.6.24.x
> tree.

I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so 
I simply don't know whether it's supposed to go into *my* kernel or just 
-mm.

There's also been several versions and discussions, so I'd really like to 
have somebody send me a final patch with all the acks etc.. One that is 
clearly for me, not for -mm.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-10  7:42                 ` Linus Torvalds
@ 2008-02-10 10:31                   ` Andrew Morton
  -1 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2008-02-10 10:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg KH, KOSAKI Motohiro, Lee Schermerhorn, linux-kernel,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney

On Sat, 9 Feb 2008 23:42:21 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sat, 9 Feb 2008, Greg KH wrote:
> > 
> > Once the patch goes into Linus's tree, feel free to send it to the
> > stable@kernel.org address so that we can include it in the 2.6.24.x
> > tree.
> 
> I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so 
> I simply don't know whether it's supposed to go into *my* kernel or just 
> -mm.
> 
> There's also been several versions and discussions, so I'd really like to 
> have somebody send me a final patch with all the acks etc.. One that is 
> clearly for me, not for -mm.
> 

fyi, I won't be able to do much patch-wrangling until Tuesday or Wednesday.
All the big machines are disconnected and mothballed due to domestic
s/carpet/hardwood/g.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
@ 2008-02-10 10:31                   ` Andrew Morton
  0 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2008-02-10 10:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg KH, KOSAKI Motohiro, Lee Schermerhorn, linux-kernel,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney

On Sat, 9 Feb 2008 23:42:21 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sat, 9 Feb 2008, Greg KH wrote:
> > 
> > Once the patch goes into Linus's tree, feel free to send it to the
> > stable@kernel.org address so that we can include it in the 2.6.24.x
> > tree.
> 
> I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so 
> I simply don't know whether it's supposed to go into *my* kernel or just 
> -mm.
> 
> There's also been several versions and discussions, so I'd really like to 
> have somebody send me a final patch with all the acks etc.. One that is 
> clearly for me, not for -mm.
> 

fyi, I won't be able to do much patch-wrangling until Tuesday or Wednesday.
All the big machines are disconnected and mothballed due to domestic
s/carpet/hardwood/g.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-10  7:42                 ` Linus Torvalds
@ 2008-02-11 16:47                   ` Lee Schermerhorn
  -1 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-11 16:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Greg KH, linux-kernel, Andrew Morton,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney

On Sat, 2008-02-09 at 23:42 -0800, Linus Torvalds wrote:
> 
> On Sat, 9 Feb 2008, Greg KH wrote:
> > 
> > Once the patch goes into Linus's tree, feel free to send it to the
> > stable@kernel.org address so that we can include it in the 2.6.24.x
> > tree.
> 
> I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so 
> I simply don't know whether it's supposed to go into *my* kernel or just 
> -mm.
> 
> There's also been several versions and discussions, so I'd really like to 
> have somebody send me a final patch with all the acks etc.. One that is 
> clearly for me, not for -mm.
> 

Kosaki-san:  You've tested V3 on '.24.  Do you want to repost the patch
refreshed against .24, adding your "Tested-by:"  [and "Signed-off-by:",
as the folding of the contextualization into mpol_check_policy() is
based on your code--apologies for not adding it myself]?  I'm tied up
with something else for most of this week and won't get to it until
Friday, earliest.

Regards,
Lee

P.S., As Andrew pointed out, I forgot to run checkpatch and the patch
does include a violation thereof.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
@ 2008-02-11 16:47                   ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-11 16:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Greg KH, linux-kernel, Andrew Morton,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney

On Sat, 2008-02-09 at 23:42 -0800, Linus Torvalds wrote:
> 
> On Sat, 9 Feb 2008, Greg KH wrote:
> > 
> > Once the patch goes into Linus's tree, feel free to send it to the
> > stable@kernel.org address so that we can include it in the 2.6.24.x
> > tree.
> 
> I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so 
> I simply don't know whether it's supposed to go into *my* kernel or just 
> -mm.
> 
> There's also been several versions and discussions, so I'd really like to 
> have somebody send me a final patch with all the acks etc.. One that is 
> clearly for me, not for -mm.
> 

Kosaki-san:  You've tested V3 on '.24.  Do you want to repost the patch
refreshed against .24, adding your "Tested-by:"  [and "Signed-off-by:",
as the folding of the contextualization into mpol_check_policy() is
based on your code--apologies for not adding it myself]?  I'm tied up
with something else for most of this week and won't get to it until
Friday, earliest.

Regards,
Lee

P.S., As Andrew pointed out, I forgot to run checkpatch and the patch
does include a violation thereof.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-11 16:47                   ` Lee Schermerhorn
  (?)
@ 2008-02-12  0:43                   ` KOSAKI Motohiro
  2008-02-12  1:00                     ` David Rientjes
  -1 siblings, 1 reply; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-12  0:43 UTC (permalink / raw)
  To: Lee Schermerhorn, Andrew Morton; +Cc: kosaki.motohiro, linux-mm

Hi Andrew Lee-san

# remove almost CC'd

> Kosaki-san:  You've tested V3 on '.24.  Do you want to repost the patch
> refreshed against .24, adding your "Tested-by:"  [and "Signed-off-by:",
> as the folding of the contextualization into mpol_check_policy() is
> based on your code--apologies for not adding it myself]?  I'm tied up
> with something else for most of this week and won't get to it until
> Friday, earliest.

OK.
I append my Tested-by.(but not Singed-off-by because my work is very little).

and, I attached .24 adjusted patch.
my change is only line number change and remove extra space.


> P.S., As Andrew pointed out, I forgot to run checkpatch and the patch
> does include a violation thereof.

fixed.
the violation is only 1 extra white space :)


-------------------------------------------------------------------
Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
works on memoryless node."

[Aside:  I noticed there were two slightly different distributions for
this topic.  I've unified the distribution lists w/o dropping anyone, I
think.  Apologies if you'd rather have been dropped...]

Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
folding contextualize_policy() into mpol_check_policy() [because my
"was_empty" argument "was ugly" ;-)].  It does seem to clean up the
code.

I'm still deferring David Rientjes' suggestion to fold
mpol_check_policy() into mpol_new().  We need to sort out whether
mempolicies specified for tmpfs and hugetlbfs mounts always need the
same "contextualization" as user/application installed policies.  I
don't want to hold up this bug fix for that discussion.  This is
something Paul J will need to address with his cpuset/mempolicy rework,
so we can sort it out in that context.

Again, tested with "numactl --interleave=all" and memtoy on ia64 using
mem= command line argument to simulate memoryless node.


Lee

============================
[PATCH] 2.6.24-mm1 - mempolicy:  silently restrict nodemask to allowed nodes

V2 -> V3:
+ As suggested by Kosaki Motohito, fold the "contextualization"
  of policy nodemask into mpol_check_policy().  Looks a little
  cleaner. 

V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
  mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
   cpuset_nodes_subset_current_mems_allowed() from cpuset.h

Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy logic to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
      currently done in-line for mbind().  This guarantees
      that the resulting mask includes only nodes with memory.

      NOTE:  this is a [benign, IMO] change in behavior for
             set_mempolicy().  Dis-allowed nodes will be
             silently ignored, rather than returning an error.

   c) fold this code into mpol_check_policy(), replace 2 calls to
      contextualize_policy() to call mpol_check_policy() directly
      and remove contextualize_policy().

2) In existing mpol_check_policy() logic, after "contextualization":
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
      contains at least one node.
   c) add a case for MPOL_PREFERRED:  if in coming was not empty
      and resulting mask IS empty, user specified invalid nodes.
      Return EINVAL.
   c) remove the now redundant check for memoryless nodes

3) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

4) Now that mpol_check_policy() contextualizes the nodemask, remove
   the in-line nodes_and() from sys_mbind().  I believe that this
   restores mbind() to the behavior before the memoryless-nodes
   patch series.  E.g., we'll no longer treat an invalid nodemask
   with MPOL_PREFERRED as local allocation.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Tested-by:      KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

 include/linux/cpuset.h |    3 --
 mm/mempolicy.c         |   61 ++++++++++++++++++++++++++++---------------------
 2 files changed, 36 insertions(+), 28 deletions(-)

Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c	2008-02-10 14:27:58.000000000 +0900
+++ b/mm/mempolicy.c	2008-02-12 09:37:35.000000000 +0900
@@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
 /* Do sanity checking on a policy */
 static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
-	int empty = nodes_empty(*nodes);
+	int was_empty, is_empty;
+
+	if (!nodes)
+		return 0;
+
+	/*
+	 * "Contextualize" the in-coming nodemast for cpusets:
+	 * Remember whether in-coming nodemask was empty,  If not,
+	 * restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
+	cpuset_update_task_memory_state();
+	is_empty = was_empty = nodes_empty(*nodes);
+	if (!was_empty) {
+		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+		is_empty = nodes_empty(*nodes);	/* after "contextualization" */
+	}
 
 	switch (mode) {
 	case MPOL_DEFAULT:
-		if (!empty)
+		/*
+		 * require caller to specify an empty nodemask
+		 * before "contextualization"
+		 */
+		if (!was_empty)
 			return -EINVAL;
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		/* Preferred will only use the first bit, but allow
-		   more for now. */
-		if (empty)
+		/*
+		 * require at least 1 valid node after "contextualization"
+		 */
+		if (is_empty)
+			return -EINVAL;
+		break;
+	case MPOL_PREFERRED:
+		/*
+		 * Did caller specify invalid nodes?
+		 * Don't silently accept this as "local allocation".
+		 */
+		if (!was_empty && is_empty)
 			return -EINVAL;
 		break;
 	}
- 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+	return 0;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
 	return err;
 }
 
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
-	if (!nodes)
-		return 0;
-
-	cpuset_update_task_memory_state();
-	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
-		return -EINVAL;
-	return mpol_check_policy(mode, nodes);
-}
-
-
 /*
  * Update task->flags PF_MEMPOLICY bit: set iff non-default
  * mempolicy.  Allows more rapid checking of this (combined perhaps
@@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
 {
 	struct mempolicy *new;
 
-	if (contextualize_policy(mode, nodes))
+	if (mpol_check_policy(mode, nodes))
 		return -EINVAL;
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
@@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long 
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-#ifdef CONFIG_CPUSETS
-	/* Restrict the nodes to the allowed nodes in the cpuset */
-	nodes_and(nodes, nodes, current->mems_allowed);
-#endif
 	return do_mbind(start, len, mode, &nodes, flags);
 }
 
Index: b/include/linux/cpuset.h
===================================================================
--- a/include/linux/cpuset.h	2008-02-10 14:27:58.000000000 +0900
+++ b/include/linux/cpuset.h	2008-02-10 14:33:40.000000000 +0900
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
-		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -101,7 +99,6 @@ static inline nodemask_t cpuset_mems_all
 #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
 static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
 {




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-12  0:43                   ` KOSAKI Motohiro
@ 2008-02-12  1:00                     ` David Rientjes
  2008-02-12  1:56                       ` KOSAKI Motohiro
  2008-02-12 15:08                       ` Lee Schermerhorn
  0 siblings, 2 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-12  1:00 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Lee Schermerhorn, Andrew Morton, linux-mm

On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:

> Hi Andrew Lee-san
> 
> # remove almost CC'd
> 

Please don't remove cc's that were included on the original posting if 
you're passing the patch along.

> OK.
> I append my Tested-by.(but not Singed-off-by because my work is very little).
> 
> and, I attached .24 adjusted patch.
> my change is only line number change and remove extra space.
> 

Andrew may clarify this, but I believe you need to include a sign-off line 
even if you alter just that one whitespace.

 [ I edited that whitespace in my own copy of this patch when I applied it 
   to my tree because git complained about it (and my patchset removes the 
   same line with the whitespace removed). ]

> -------------------------------------------------------------------
> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
> 
> [Aside:  I noticed there were two slightly different distributions for
> this topic.  I've unified the distribution lists w/o dropping anyone, I
> think.  Apologies if you'd rather have been dropped...]
> 
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)].  It does seem to clean up the
> code.
> 
> I'm still deferring David Rientjes' suggestion to fold
> mpol_check_policy() into mpol_new().  We need to sort out whether
> mempolicies specified for tmpfs and hugetlbfs mounts always need the
> same "contextualization" as user/application installed policies.  I
> don't want to hold up this bug fix for that discussion.  This is
> something Paul J will need to address with his cpuset/mempolicy rework,
> so we can sort it out in that context.
> 

I took care of this in my patchset from this morning, so I think we can 
drop this disclaimer now.

> 2) In existing mpol_check_policy() logic, after "contextualization":
>    a) MPOL_DEFAULT:  require that in coming mask "was_empty"

While my patchset effectively obsoletes this patch (but is nonetheless 
based on top of it), I don't understand why you require that MPOL_DEFAULT 
nodemasks are empty.

mpol_new() will not dynamically allocate a new mempolicy in that case 
anyway since it is the system default so the only reason why 
set_mempolicy(MPOL_DEFAULT, numa_no_nodes, ...) won't work is because of 
this addition to mpol_check_policy().

In other words, what is the influence to dismiss a MPOL_DEFAULT mempolicy 
request from a user as invalid simply because it includes set nodes in the 
nodemask?

The warning in the man page that nodemask should be NULL is irrelevant 
here because the user did pass a nodemask, it just happened not to be 
empty.  There seems to be no compelling reason to consider this as invalid 
since MPOL_DEFAULT explicitly does not require a nodemask.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-12  1:00                     ` David Rientjes
@ 2008-02-12  1:56                       ` KOSAKI Motohiro
  2008-02-12  2:05                         ` David Rientjes
  2008-02-12 15:08                       ` Lee Schermerhorn
  1 sibling, 1 reply; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-12  1:56 UTC (permalink / raw)
  To: David Rientjes; +Cc: kosaki.motohiro, Lee Schermerhorn, Andrew Morton, linux-mm

Hi

> > # remove almost CC'd
> 
> Please don't remove cc's that were included on the original posting if 
> you're passing the patch along.

Oops, sorry.
I was not worth of solicitude. 


> > OK.
> > I append my Tested-by.(but not Singed-off-by because my work is very little).
> > 
> > and, I attached .24 adjusted patch.
> > my change is only line number change and remove extra space.
> 
> Andrew may clarify this, but I believe you need to include a sign-off line 
> even if you alter just that one whitespace.
> 
>  [ I edited that whitespace in my own copy of this patch when I applied it 
>    to my tree because git complained about it (and my patchset removes the 
>    same line with the whitespace removed). ]

Hmm..
OK


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


> > I'm still deferring David Rientjes' suggestion to fold
> > mpol_check_policy() into mpol_new().  We need to sort out whether
> > mempolicies specified for tmpfs and hugetlbfs mounts always need the
> > same "contextualization" as user/application installed policies.  I
> > don't want to hold up this bug fix for that discussion.  This is
> > something Paul J will need to address with his cpuset/mempolicy rework,
> > so we can sort it out in that context.
> > 
> 
> I took care of this in my patchset from this morning, so I think we can 
> drop this disclaimer now.

Disagreed.

this patch is regression fixed patch.
regression should fixed ASAP.

your patch is very nice patch.
but it is feature enhancement.
the feature enhancement should tested by many people in -mm tree for a while.

end up, timing of mainline merge is large different.


> > 2) In existing mpol_check_policy() logic, after "contextualization":
> >    a) MPOL_DEFAULT:  require that in coming mask "was_empty"
> 
> While my patchset effectively obsoletes this patch (but is nonetheless 
> based on top of it), I don't understand why you require that MPOL_DEFAULT 
> nodemasks are empty.
> 
> mpol_new() will not dynamically allocate a new mempolicy in that case 
> anyway since it is the system default so the only reason why 
> set_mempolicy(MPOL_DEFAULT, numa_no_nodes, ...) won't work is because of 
> this addition to mpol_check_policy().
> 
> In other words, what is the influence to dismiss a MPOL_DEFAULT mempolicy 
> request from a user as invalid simply because it includes set nodes in the 
> nodemask?

Hmm..
By which version are you testing?


my testing result was

                       set_mempolicy(MPOL_DEFAULT,NULL)  set_mempolicy(MPOL_DEFAULT,numa_no_nodes)
                      ------------------------------------------------------------------------------
2.6.24                      success                                   success
2.6.24 + lee-patch          success                                   success



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-12  1:56                       ` KOSAKI Motohiro
@ 2008-02-12  2:05                         ` David Rientjes
  2008-02-12  3:05                           ` KOSAKI Motohiro
  0 siblings, 1 reply; 90+ messages in thread
From: David Rientjes @ 2008-02-12  2:05 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Lee Schermerhorn, Andrew Morton, linux-mm

On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:

> > > I'm still deferring David Rientjes' suggestion to fold
> > > mpol_check_policy() into mpol_new().  We need to sort out whether
> > > mempolicies specified for tmpfs and hugetlbfs mounts always need the
> > > same "contextualization" as user/application installed policies.  I
> > > don't want to hold up this bug fix for that discussion.  This is
> > > something Paul J will need to address with his cpuset/mempolicy rework,
> > > so we can sort it out in that context.
> > > 
> > 
> > I took care of this in my patchset from this morning, so I think we can 
> > drop this disclaimer now.
> 
> Disagreed.
> 
> this patch is regression fixed patch.
> regression should fixed ASAP.
> 
> your patch is very nice patch.
> but it is feature enhancement.
> the feature enhancement should tested by many people in -mm tree for a while.
> 
> end up, timing of mainline merge is large different.
> 

I'm talking about the disclaimer that I quoted above in the changelog of 
this patch.  Lee was stating that he deferred my suggestion to move the 
logic into mpol_new(), which I did in my patchset, but I don't think that 
needs to be included in this patch's changelog.

I'm all for the merging of this patch (once my concern below is addressed) 
but the section of the changelog that is quoted above is unnecessary.

> > mpol_new() will not dynamically allocate a new mempolicy in that case 
> > anyway since it is the system default so the only reason why 
> > set_mempolicy(MPOL_DEFAULT, numa_no_nodes, ...) won't work is because of 
> > this addition to mpol_check_policy().
> > 
> > In other words, what is the influence to dismiss a MPOL_DEFAULT mempolicy 
> > request from a user as invalid simply because it includes set nodes in the 
> > nodemask?
> 
> Hmm..
> By which version are you testing?
> 

I'm talking about this section of the patch:

> @@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
>  /* Do sanity checking on a policy */
>  static int mpol_check_policy(int mode, nodemask_t *nodes)
>  {
> -	int empty = nodes_empty(*nodes);
> +	int was_empty, is_empty;
> +
> +	if (!nodes)
> +		return 0;
> +
> +	/*
> +	 * "Contextualize" the in-coming nodemast for cpusets:
> +	 * Remember whether in-coming nodemask was empty,  If not,
> +	 * restrict the nodes to the allowed nodes in the cpuset.
> +	 * This is guaranteed to be a subset of nodes with memory.
> +	 */
> +	cpuset_update_task_memory_state();
> +	is_empty = was_empty = nodes_empty(*nodes);
> +	if (!was_empty) {
> +		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> +		is_empty = nodes_empty(*nodes);	/* after "contextualization" */
> +	}
>  
>  	switch (mode) {
>  	case MPOL_DEFAULT:
> -		if (!empty)
> +		/*
> +		 * require caller to specify an empty nodemask
> +		 * before "contextualization"
> +		 */
> +		if (!was_empty)
>  			return -EINVAL;

Even though it is obviously the old behavior as well, I want to know why 
we are rejecting MPOL_DEFAULT policies that are passed to either 
set_mempolicy() or mbind() with nodemasks that aren't empty.  MPOL_DEFAULT 
is a mempolicy in itself; it does not act on any passed nodemask.

So my question is why we consider this invalid:

	nodemask_t nodes;

	nodes_clear(&nodes);
	node_set(1, &nodes);
	set_mempolicy(MPOL_DEFAULT, nodes, 1 << CONFIG_NODES_SHIFT);

The nodemask doesn't matter at all with a MPOL_DEFAULT policy.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-12  2:05                         ` David Rientjes
@ 2008-02-12  3:05                           ` KOSAKI Motohiro
  2008-02-12  3:17                             ` David Rientjes
  0 siblings, 1 reply; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-12  3:05 UTC (permalink / raw)
  To: David Rientjes; +Cc: kosaki.motohiro, Lee Schermerhorn, Andrew Morton, linux-mm

Hi

> I'm talking about the disclaimer that I quoted above in the changelog of 
> this patch.  Lee was stating that he deferred my suggestion to move the 
> logic into mpol_new(), which I did in my patchset, but I don't think that 
> needs to be included in this patch's changelog.
> 
> I'm all for the merging of this patch (once my concern below is addressed) 
> but the section of the changelog that is quoted above is unnecessary.

OK. I obey you.

> So my question is why we consider this invalid:
> 
> 	nodemask_t nodes;
> 
> 	nodes_clear(&nodes);
> 	node_set(1, &nodes);
> 	set_mempolicy(MPOL_DEFAULT, nodes, 1 << CONFIG_NODES_SHIFT);
> 
> The nodemask doesn't matter at all with a MPOL_DEFAULT policy.

Hmmmmmm
sorry, I don't understand yet.

My test result was

RHEL5(initrd-2.6.18 + rhel patch)	EINVAL
2.6.24					EINVAL
2.6.24 + lee-patch			EINVAL


I don't know current behavior good or wrong.
but I think it is not regression.


-------------------------------------------------------
#include <numaif.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <numa.h>

main(){
        int err;
        nodemask_t nodes;

        nodemask_zero(&nodes);
        nodemask_set(&nodes, 1);
        err = set_mempolicy(MPOL_DEFAULT, &nodes.n[0], 2048);
        if (err < 0) {
                perror("set_mempolicy");
        }

        return 0;
}


- kosaki


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-12  3:05                           ` KOSAKI Motohiro
@ 2008-02-12  3:17                             ` David Rientjes
  0 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-12  3:17 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Lee Schermerhorn, Andrew Morton, linux-mm

On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:

> Hmmmmmm
> sorry, I don't understand yet.
> 
> My test result was
> 
> RHEL5(initrd-2.6.18 + rhel patch)	EINVAL
> 2.6.24					EINVAL
> 2.6.24 + lee-patch			EINVAL
> 
> 
> I don't know current behavior good or wrong.
> but I think it is not regression.
> 

Yes, it's not a regression, but I'm asking why we can't allow this:

	nodemask_t nodes = NODE_MASK_NONE;

	node_set(1, &nodes);
	set_mempolicy(MPOL_DEFAULT, &nodes, MAX_NUMNODES);

It seems like that should not return -EINVAL and that it should just 
effect the system default of a MPOL_DEFAULT policy.

It's not a problem that I'm complaining about specifically in this patch, 
I'm just raising the concern that returning -EINVAL here is really 
unnecessary since mpol_new() will readily accept it.

So you can add my

	Acked-by: David Rientjes <rientjes@google.com>

to this patch, but I would like some counter-arguments presented that show 
why we shouldn't allow the above code to work later on.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3
  2008-02-11 16:47                   ` Lee Schermerhorn
@ 2008-02-12  4:30                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-12  4:30 UTC (permalink / raw)
  To: Lee Schermerhorn, Andrew Morton
  Cc: kosaki.motohiro, Linus Torvalds, Greg KH, linux-kernel,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney

Hi Andrew

# this is second post of the same patch.

this is backport from -mm to mainline.
original patch is http://marc.info/?l=linux-kernel&m=120250000001182&w=2

my change is only line number change and remove extra space.
please ack.


============================
[PATCH] 2.6.24 - mempolicy:  silently restrict nodemask to allowed nodes

V2 -> V3:
+ As suggested by Kosaki Motohito, fold the "contextualization"
  of policy nodemask into mpol_check_policy().  Looks a little
  cleaner. 

V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
  mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
   cpuset_nodes_subset_current_mems_allowed() from cpuset.h

Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy logic to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
      currently done in-line for mbind().  This guarantees
      that the resulting mask includes only nodes with memory.

      NOTE:  this is a [benign, IMO] change in behavior for
             set_mempolicy().  Dis-allowed nodes will be
             silently ignored, rather than returning an error.

   c) fold this code into mpol_check_policy(), replace 2 calls to
      contextualize_policy() to call mpol_check_policy() directly
      and remove contextualize_policy().

2) In existing mpol_check_policy() logic, after "contextualization":
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
      contains at least one node.
   c) add a case for MPOL_PREFERRED:  if in coming was not empty
      and resulting mask IS empty, user specified invalid nodes.
      Return EINVAL.
   c) remove the now redundant check for memoryless nodes

3) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

4) Now that mpol_check_policy() contextualizes the nodemask, remove
   the in-line nodes_and() from sys_mbind().  I believe that this
   restores mbind() to the behavior before the memoryless-nodes
   patch series.  E.g., we'll no longer treat an invalid nodemask
   with MPOL_PREFERRED as local allocation.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by:      KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by:       David Rientjes <rientjes@google.com>

 include/linux/cpuset.h |    3 --
 mm/mempolicy.c         |   61 ++++++++++++++++++++++++++++---------------------
 2 files changed, 36 insertions(+), 28 deletions(-)

Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c	2008-02-10 14:27:58.000000000 +0900
+++ b/mm/mempolicy.c	2008-02-12 09:37:35.000000000 +0900
@@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
 /* Do sanity checking on a policy */
 static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
-	int empty = nodes_empty(*nodes);
+	int was_empty, is_empty;
+
+	if (!nodes)
+		return 0;
+
+	/*
+	 * "Contextualize" the in-coming nodemast for cpusets:
+	 * Remember whether in-coming nodemask was empty,  If not,
+	 * restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
+	cpuset_update_task_memory_state();
+	is_empty = was_empty = nodes_empty(*nodes);
+	if (!was_empty) {
+		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+		is_empty = nodes_empty(*nodes);	/* after "contextualization" */
+	}
 
 	switch (mode) {
 	case MPOL_DEFAULT:
-		if (!empty)
+		/*
+		 * require caller to specify an empty nodemask
+		 * before "contextualization"
+		 */
+		if (!was_empty)
 			return -EINVAL;
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		/* Preferred will only use the first bit, but allow
-		   more for now. */
-		if (empty)
+		/*
+		 * require at least 1 valid node after "contextualization"
+		 */
+		if (is_empty)
+			return -EINVAL;
+		break;
+	case MPOL_PREFERRED:
+		/*
+		 * Did caller specify invalid nodes?
+		 * Don't silently accept this as "local allocation".
+		 */
+		if (!was_empty && is_empty)
 			return -EINVAL;
 		break;
 	}
- 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+	return 0;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
 	return err;
 }
 
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
-	if (!nodes)
-		return 0;
-
-	cpuset_update_task_memory_state();
-	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
-		return -EINVAL;
-	return mpol_check_policy(mode, nodes);
-}
-
-
 /*
  * Update task->flags PF_MEMPOLICY bit: set iff non-default
  * mempolicy.  Allows more rapid checking of this (combined perhaps
@@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
 {
 	struct mempolicy *new;
 
-	if (contextualize_policy(mode, nodes))
+	if (mpol_check_policy(mode, nodes))
 		return -EINVAL;
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
@@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long 
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-#ifdef CONFIG_CPUSETS
-	/* Restrict the nodes to the allowed nodes in the cpuset */
-	nodes_and(nodes, nodes, current->mems_allowed);
-#endif
 	return do_mbind(start, len, mode, &nodes, flags);
 }
 
Index: b/include/linux/cpuset.h
===================================================================
--- a/include/linux/cpuset.h	2008-02-10 14:27:58.000000000 +0900
+++ b/include/linux/cpuset.h	2008-02-10 14:33:40.000000000 +0900
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
-		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -101,7 +99,6 @@ static inline nodemask_t cpuset_mems_all
 #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
 static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
 {







^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3
@ 2008-02-12  4:30                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-12  4:30 UTC (permalink / raw)
  To: Lee Schermerhorn, Andrew Morton
  Cc: kosaki.motohiro, Linus Torvalds, Greg KH, linux-kernel,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney

Hi Andrew

# this is second post of the same patch.

this is backport from -mm to mainline.
original patch is http://marc.info/?l=linux-kernel&m=120250000001182&w=2

my change is only line number change and remove extra space.
please ack.


============================
[PATCH] 2.6.24 - mempolicy:  silently restrict nodemask to allowed nodes

V2 -> V3:
+ As suggested by Kosaki Motohito, fold the "contextualization"
  of policy nodemask into mpol_check_policy().  Looks a little
  cleaner. 

V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
  mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
   cpuset_nodes_subset_current_mems_allowed() from cpuset.h

Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy logic to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
      currently done in-line for mbind().  This guarantees
      that the resulting mask includes only nodes with memory.

      NOTE:  this is a [benign, IMO] change in behavior for
             set_mempolicy().  Dis-allowed nodes will be
             silently ignored, rather than returning an error.

   c) fold this code into mpol_check_policy(), replace 2 calls to
      contextualize_policy() to call mpol_check_policy() directly
      and remove contextualize_policy().

2) In existing mpol_check_policy() logic, after "contextualization":
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
      contains at least one node.
   c) add a case for MPOL_PREFERRED:  if in coming was not empty
      and resulting mask IS empty, user specified invalid nodes.
      Return EINVAL.
   c) remove the now redundant check for memoryless nodes

3) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

4) Now that mpol_check_policy() contextualizes the nodemask, remove
   the in-line nodes_and() from sys_mbind().  I believe that this
   restores mbind() to the behavior before the memoryless-nodes
   patch series.  E.g., we'll no longer treat an invalid nodemask
   with MPOL_PREFERRED as local allocation.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by:      KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by:       David Rientjes <rientjes@google.com>

 include/linux/cpuset.h |    3 --
 mm/mempolicy.c         |   61 ++++++++++++++++++++++++++++---------------------
 2 files changed, 36 insertions(+), 28 deletions(-)

Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c	2008-02-10 14:27:58.000000000 +0900
+++ b/mm/mempolicy.c	2008-02-12 09:37:35.000000000 +0900
@@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
 /* Do sanity checking on a policy */
 static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
-	int empty = nodes_empty(*nodes);
+	int was_empty, is_empty;
+
+	if (!nodes)
+		return 0;
+
+	/*
+	 * "Contextualize" the in-coming nodemast for cpusets:
+	 * Remember whether in-coming nodemask was empty,  If not,
+	 * restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
+	cpuset_update_task_memory_state();
+	is_empty = was_empty = nodes_empty(*nodes);
+	if (!was_empty) {
+		nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+		is_empty = nodes_empty(*nodes);	/* after "contextualization" */
+	}
 
 	switch (mode) {
 	case MPOL_DEFAULT:
-		if (!empty)
+		/*
+		 * require caller to specify an empty nodemask
+		 * before "contextualization"
+		 */
+		if (!was_empty)
 			return -EINVAL;
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		/* Preferred will only use the first bit, but allow
-		   more for now. */
-		if (empty)
+		/*
+		 * require at least 1 valid node after "contextualization"
+		 */
+		if (is_empty)
+			return -EINVAL;
+		break;
+	case MPOL_PREFERRED:
+		/*
+		 * Did caller specify invalid nodes?
+		 * Don't silently accept this as "local allocation".
+		 */
+		if (!was_empty && is_empty)
 			return -EINVAL;
 		break;
 	}
- 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+	return 0;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
 	return err;
 }
 
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
-	if (!nodes)
-		return 0;
-
-	cpuset_update_task_memory_state();
-	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
-		return -EINVAL;
-	return mpol_check_policy(mode, nodes);
-}
-
-
 /*
  * Update task->flags PF_MEMPOLICY bit: set iff non-default
  * mempolicy.  Allows more rapid checking of this (combined perhaps
@@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
 {
 	struct mempolicy *new;
 
-	if (contextualize_policy(mode, nodes))
+	if (mpol_check_policy(mode, nodes))
 		return -EINVAL;
 	new = mpol_new(mode, nodes);
 	if (IS_ERR(new))
@@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long 
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-#ifdef CONFIG_CPUSETS
-	/* Restrict the nodes to the allowed nodes in the cpuset */
-	nodes_and(nodes, nodes, current->mems_allowed);
-#endif
 	return do_mbind(start, len, mode, &nodes, flags);
 }
 
Index: b/include/linux/cpuset.h
===================================================================
--- a/include/linux/cpuset.h	2008-02-10 14:27:58.000000000 +0900
+++ b/include/linux/cpuset.h	2008-02-10 14:33:40.000000000 +0900
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
-		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -101,7 +99,6 @@ static inline nodemask_t cpuset_mems_all
 #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
 static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
 {






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3
  2008-02-12  4:30                     ` KOSAKI Motohiro
@ 2008-02-12  5:06                       ` David Rientjes
  -1 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-12  5:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lee Schermerhorn, Andrew Morton, Linus Torvalds, Greg KH,
	linux-kernel, Christoph Lameter, Paul Jackson, Mel Gorman,
	linux-mm, Eric Whitney

On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:

> [PATCH] 2.6.24 - mempolicy:  silently restrict nodemask to allowed nodes
> 

Linus has already merged this patch into his tree, but next time you 
pass along a contribution to a maintainer the first line should read:

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

so the person who actually wrote the patch is listed as the author in the 
git commit.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3
@ 2008-02-12  5:06                       ` David Rientjes
  0 siblings, 0 replies; 90+ messages in thread
From: David Rientjes @ 2008-02-12  5:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lee Schermerhorn, Andrew Morton, Linus Torvalds, Greg KH,
	linux-kernel, Christoph Lameter, Paul Jackson, Mel Gorman,
	linux-mm, Eric Whitney

On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:

> [PATCH] 2.6.24 - mempolicy:  silently restrict nodemask to allowed nodes
> 

Linus has already merged this patch into his tree, but next time you 
pass along a contribution to a maintainer the first line should read:

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

so the person who actually wrote the patch is listed as the author in the 
git commit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3
  2008-02-12  4:30                     ` KOSAKI Motohiro
@ 2008-02-12  5:07                       ` Andrew Morton
  -1 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2008-02-12  5:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lee Schermerhorn, Linus Torvalds, Greg KH, linux-kernel,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney

On Tue, 12 Feb 2008 13:30:22 +0900 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi Andrew
> 
> # this is second post of the same patch.
> 
> this is backport from -mm to mainline.
> original patch is http://marc.info/?l=linux-kernel&m=120250000001182&w=2
> 
> my change is only line number change and remove extra space.

This is identical to what I have now.

> please ack.

As it's now post -rc1 and not a 100% obvious thing, I tend to hang onto
such patches for a week or so before sending up to Linus

Should this be backported to 2.6.24.x?  If so, the reasons for such a
relatively stern step should be spelled out in the changelog for the
-stable maintiners to evaluate.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3
@ 2008-02-12  5:07                       ` Andrew Morton
  0 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2008-02-12  5:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lee Schermerhorn, Linus Torvalds, Greg KH, linux-kernel,
	Christoph Lameter, Paul Jackson, David Rientjes, Mel Gorman,
	linux-mm, Eric Whitney

On Tue, 12 Feb 2008 13:30:22 +0900 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi Andrew
> 
> # this is second post of the same patch.
> 
> this is backport from -mm to mainline.
> original patch is http://marc.info/?l=linux-kernel&m=120250000001182&w=2
> 
> my change is only line number change and remove extra space.

This is identical to what I have now.

> please ack.

As it's now post -rc1 and not a 100% obvious thing, I tend to hang onto
such patches for a week or so before sending up to Linus

Should this be backported to 2.6.24.x?  If so, the reasons for such a
relatively stern step should be spelled out in the changelog for the
-stable maintiners to evaluate.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3
  2008-02-12  5:07                       ` Andrew Morton
@ 2008-02-12 13:18                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-12 13:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Lee Schermerhorn, Linus Torvalds, Greg KH,
	linux-kernel, Christoph Lameter, Paul Jackson, David Rientjes,
	Mel Gorman, linux-mm, Eric Whitney

Hi

> > please ack.
> 
> As it's now post -rc1 and not a 100% obvious thing, I tend to hang onto
> such patches for a week or so before sending up to Linus

Thanks, really thanks.


> Should this be backported to 2.6.24.x?  If so, the reasons for such a
> relatively stern step should be spelled out in the changelog for the
> -stable maintiners to evaluate.

Oh,
you think below reason is not enough, really?

1. it is regression.
2. it is very easy reprodusable on memoryless node machine.


if so, i back down on my backport reclaim.
I don't hope increase your headache ;-)

thanks.

-kosaki



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3
@ 2008-02-12 13:18                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 90+ messages in thread
From: KOSAKI Motohiro @ 2008-02-12 13:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Lee Schermerhorn, Linus Torvalds, Greg KH,
	linux-kernel, Christoph Lameter, Paul Jackson, David Rientjes,
	Mel Gorman, linux-mm, Eric Whitney

Hi

> > please ack.
> 
> As it's now post -rc1 and not a 100% obvious thing, I tend to hang onto
> such patches for a week or so before sending up to Linus

Thanks, really thanks.


> Should this be backported to 2.6.24.x?  If so, the reasons for such a
> relatively stern step should be spelled out in the changelog for the
> -stable maintiners to evaluate.

Oh,
you think below reason is not enough, really?

1. it is regression.
2. it is very easy reprodusable on memoryless node machine.


if so, i back down on my backport reclaim.
I don't hope increase your headache ;-)

thanks.

-kosaki


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-12  1:00                     ` David Rientjes
  2008-02-12  1:56                       ` KOSAKI Motohiro
@ 2008-02-12 15:08                       ` Lee Schermerhorn
  2008-02-12 19:06                         ` David Rientjes
  1 sibling, 1 reply; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-12 15:08 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, Andrew Morton, linux-mm

On Mon, 2008-02-11 at 17:00 -0800, David Rientjes wrote:
> On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:
> 
> > Hi Andrew Lee-san
> > 
> > # remove almost CC'd
> > 
> 
> Please don't remove cc's that were included on the original posting if 
> you're passing the patch along.
> 
> > OK.
> > I append my Tested-by.(but not Singed-off-by because my work is very little).
> > 
> > and, I attached .24 adjusted patch.
> > my change is only line number change and remove extra space.
> > 
> 
> Andrew may clarify this, but I believe you need to include a sign-off line 
> even if you alter just that one whitespace.
> 
>  [ I edited that whitespace in my own copy of this patch when I applied it 
>    to my tree because git complained about it (and my patchset removes the 
>    same line with the whitespace removed). ]
> 
> > -------------------------------------------------------------------
> > Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> > works on memoryless node."
> > 
> > [Aside:  I noticed there were two slightly different distributions for
> > this topic.  I've unified the distribution lists w/o dropping anyone, I
> > think.  Apologies if you'd rather have been dropped...]
> > 
> > Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> > folding contextualize_policy() into mpol_check_policy() [because my
> > "was_empty" argument "was ugly" ;-)].  It does seem to clean up the
> > code.
> > 
> > I'm still deferring David Rientjes' suggestion to fold
> > mpol_check_policy() into mpol_new().  We need to sort out whether
> > mempolicies specified for tmpfs and hugetlbfs mounts always need the
> > same "contextualization" as user/application installed policies.  I
> > don't want to hold up this bug fix for that discussion.  This is
> > something Paul J will need to address with his cpuset/mempolicy rework,
> > so we can sort it out in that context.
> > 
> 
> I took care of this in my patchset from this morning, so I think we can 
> drop this disclaimer now.

David: 

I'm fine with removing this.  I didn't consider it part of the patch
description anyway.  

> > 2) In existing mpol_check_policy() logic, after "contextualization":
> >    a) MPOL_DEFAULT:  require that in coming mask "was_empty"
> 
> While my patchset effectively obsoletes this patch (but is nonetheless 
> based on top of it), I don't understand why you require that MPOL_DEFAULT 
> nodemasks are empty.

Firstly, because this was the original API. 

Secondly, I consider this key to extensible API design.  Perhaps,
someday, we might want to assign some semantic to certain non-empty
nodemasks to MPOL_DEFAULT.  If we're allowing applications to pass
arbitrary nodemask now, and just ignoring them, that becomes difficult.
Just like dis-allowing unassigned flag values.

> 
> mpol_new() will not dynamically allocate a new mempolicy in that case 
> anyway since it is the system default so the only reason why 
> set_mempolicy(MPOL_DEFAULT, numa_no_nodes, ...) won't work is because of 
> this addition to mpol_check_policy().

??? Isn't numa_no_nodes an empty node mask?  If this worked before the
memoryless nodes patch set went in [I believe it did], it should still
work.

> 
> In other words, what is the influence to dismiss a MPOL_DEFAULT mempolicy 
> request from a user as invalid simply because it includes set nodes in the 
> nodemask?
> 
> The warning in the man page that nodemask should be NULL is irrelevant 
> here because the user did pass a nodemask, it just happened not to be 
> empty.  There seems to be no compelling reason to consider this as invalid 
> since MPOL_DEFAULT explicitly does not require a nodemask.

See above.  If you have some use--i.e., as defined semantic--for a
non-empty node mask, then by all means remove the check.  But, until we
do, best not to let correct applications do this.  That way, they won't
break when/if someone DOES assign meaning to non-empty masks.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-12 15:08                       ` Lee Schermerhorn
@ 2008-02-12 19:06                         ` David Rientjes
  2008-02-13  0:07                           ` Lee Schermerhorn
  0 siblings, 1 reply; 90+ messages in thread
From: David Rientjes @ 2008-02-12 19:06 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: KOSAKI Motohiro, Andrew Morton, linux-mm

On Tue, 12 Feb 2008, Lee Schermerhorn wrote:

> Firstly, because this was the original API. 
> 
> Secondly, I consider this key to extensible API design.  Perhaps,
> someday, we might want to assign some semantic to certain non-empty
> nodemasks to MPOL_DEFAULT.  If we're allowing applications to pass
> arbitrary nodemask now, and just ignoring them, that becomes difficult.
> Just like dis-allowing unassigned flag values.
> 

I allow it with my patchset because there's simply no reason not to.

MPOL_DEFAULT is the default system-wide policy that does not require a 
nodemask as a parameter.  Both the man page (set_mempolicy(2)) and the 
documentation (Documentation/vm/numa_memory_policy.txt) state that.

It makes no sense in the future to assign a meaning to a nodemask passed 
along with MPOL_DEFAULT.  None at all.  The policy is simply the 
equivalent of default_policy and, as the system default, a nodemask 
parameter to the system default policy is wrong be definition.

So, logically, we can either allow all nodemasks to be passed with a 
MPOL_DEFAULT policy or none at all (it must be NULL).  Empty nodemasks 
don't have any logical relationship with MPOL_DEFAULT.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-12 19:06                         ` David Rientjes
@ 2008-02-13  0:07                           ` Lee Schermerhorn
  2008-02-13  0:42                             ` David Rientjes
  0 siblings, 1 reply; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-13  0:07 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, Andrew Morton, linux-mm

On Tue, 2008-02-12 at 11:06 -0800, David Rientjes wrote:
> On Tue, 12 Feb 2008, Lee Schermerhorn wrote:
> 
> > Firstly, because this was the original API. 
> > 
> > Secondly, I consider this key to extensible API design.  Perhaps,
> > someday, we might want to assign some semantic to certain non-empty
> > nodemasks to MPOL_DEFAULT.  If we're allowing applications to pass
> > arbitrary nodemask now, and just ignoring them, that becomes difficult.
> > Just like dis-allowing unassigned flag values.
> > 
> 
> I allow it with my patchset because there's simply no reason not to.

I'm interpreting this as "because I [David] simply see no reason not
to."   

> 
> MPOL_DEFAULT is the default system-wide policy that does not require a 
> nodemask as a parameter.  Both the man page (set_mempolicy(2)) and the 
> documentation (Documentation/vm/numa_memory_policy.txt) state that.
> 
> It makes no sense in the future to assign a meaning to a nodemask passed 
> along with MPOL_DEFAULT.  None at all.  

Again, you're stating an opinion, to which you're entitled, or
expressing a limitation to your clairvoyance, for which I can't fault
you.  Indeed, I tend to agree with you on this particular point--my own
opinion and/or lack of vision.  However, I've been burned in the past by
just this scenario--wanting to assign meaning to something that was
ignored--because it could break existing applications.  So, on general
principle, I like to be fairly strict with argument checking [despite my
natural libertarian tendencies].

> The policy is simply the 
> equivalent of default_policy and, as the system default, a nodemask 
> parameter to the system default policy is wrong be definition.  
> 
> So, logically, we can either allow all nodemasks to be passed with a 
> MPOL_DEFAULT policy or none at all (it must be NULL).  Empty nodemasks 
> don't have any logical relationship with MPOL_DEFAULT.

Ah, maybe this explains our disconnect.  Internally, a NULL nodemask
pointer specified by the application is equivalent to an empty nodemask
is equivalent to maxnode == 0.  See get_nodes().  By the time
mpol_check_policy() or mpol_new() get called, all they have is a pointer
to the cleared nodemask in the stack frame of sys_set_mempolicy() or
sys_mbind().  So, the existing code's error checking doesn't require one
to specify a non-NULL, but empty nodemask.  It just requires that one
does not specify any nodes with MPOL_DEFAULT.  

Does this help clarify things?

Lee

P.S.,

I've had time to review the patches and have comments queued up.  I'll
send along comments shortly [wherein I do mention my preference for the
error checking].


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-13  0:07                           ` Lee Schermerhorn
@ 2008-02-13  0:42                             ` David Rientjes
  2008-02-13 16:32                               ` Lee Schermerhorn
  0 siblings, 1 reply; 90+ messages in thread
From: David Rientjes @ 2008-02-13  0:42 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: KOSAKI Motohiro, Andrew Morton, linux-mm

On Tue, 12 Feb 2008, Lee Schermerhorn wrote:

> > MPOL_DEFAULT is the default system-wide policy that does not require a 
> > nodemask as a parameter.  Both the man page (set_mempolicy(2)) and the 
> > documentation (Documentation/vm/numa_memory_policy.txt) state that.
> > 
> > It makes no sense in the future to assign a meaning to a nodemask passed 
> > along with MPOL_DEFAULT.  None at all.  
> 
> Again, you're stating an opinion, to which you're entitled, or
> expressing a limitation to your clairvoyance, for which I can't fault
> you.  Indeed, I tend to agree with you on this particular point--my own
> opinion and/or lack of vision.  However, I've been burned in the past by
> just this scenario--wanting to assign meaning to something that was
> ignored--because it could break existing applications.  So, on general
> principle, I like to be fairly strict with argument checking [despite my
> natural libertarian tendencies].
> 

It's currently undefined behavior.  Neither the Linux documentation 
(Documentation/vm/numa_memory_policy.txt) nor the man page 
(set_mempolicy(2)) state the meaning of a nodemask passed with 
MPOL_DEFAULT.

The man page simply says the nodemask should be passed as NULL and the 
documentation state that MPOL_DEFAULT "does not use the optional set of 
nodes."

So what we do with that nodemask is an implementation detail that does not 
need to conform to any pre-defined API or even the possibility that one 
day it will become useful.  In the context of the documentation, it is 
logical that any nodemask that is passed with MPOL_DEFAULT is valid since 
it's not used at all.

As you know, mempolicies can already morph into being effected over a 
subset of nodes that was passed with set_mempolicy() or mbind() without 
knowledge to the user.  That requires get_mempolicy() to determine.  
Changing a non-empty nodemask passed with MPOL_DEFAULT to an empty 
nodemask because it has no logical meaning is nothing new.

> > The policy is simply the 
> > equivalent of default_policy and, as the system default, a nodemask 
> > parameter to the system default policy is wrong be definition.  
> > 
> > So, logically, we can either allow all nodemasks to be passed with a 
> > MPOL_DEFAULT policy or none at all (it must be NULL).  Empty nodemasks 
> > don't have any logical relationship with MPOL_DEFAULT.
> 
> Ah, maybe this explains our disconnect.  Internally, a NULL nodemask
> pointer specified by the application is equivalent to an empty nodemask
> is equivalent to maxnode == 0.  See get_nodes().  By the time
> mpol_check_policy() or mpol_new() get called, all they have is a pointer
> to the cleared nodemask in the stack frame of sys_set_mempolicy() or
> sys_mbind().  So, the existing code's error checking doesn't require one
> to specify a non-NULL, but empty nodemask.  It just requires that one
> does not specify any nodes with MPOL_DEFAULT.  
> 

You were previously arguing from an API or "reserved for future-use" 
standpoint and now you're arguing from an implementation standpoint.  Both 
of which are very different from each other.

The implementation can change to deal with this however we want (as I did 
in my patchset), so arguing in support of what mpol_new() or 
mpol_check_policy() currently do is irrelevant.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-13  0:42                             ` David Rientjes
@ 2008-02-13 16:32                               ` Lee Schermerhorn
  2008-02-13 18:32                                 ` David Rientjes
  0 siblings, 1 reply; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-13 16:32 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, Andrew Morton, linux-mm

On Tue, 2008-02-12 at 16:42 -0800, David Rientjes wrote:
> On Tue, 12 Feb 2008, Lee Schermerhorn wrote:
> 
> > > MPOL_DEFAULT is the default system-wide policy that does not require a 
> > > nodemask as a parameter.  Both the man page (set_mempolicy(2)) and the 
> > > documentation (Documentation/vm/numa_memory_policy.txt) state that.
> > > 
> > > It makes no sense in the future to assign a meaning to a nodemask passed 
> > > along with MPOL_DEFAULT.  None at all.  
> > 
> > Again, you're stating an opinion, to which you're entitled, or
> > expressing a limitation to your clairvoyance, for which I can't fault
> > you.  Indeed, I tend to agree with you on this particular point--my own
> > opinion and/or lack of vision.  However, I've been burned in the past by
> > just this scenario--wanting to assign meaning to something that was
> > ignored--because it could break existing applications.  So, on general
> > principle, I like to be fairly strict with argument checking [despite my
> > natural libertarian tendencies].
> > 
> 
> It's currently undefined behavior.  Neither the Linux documentation 
> (Documentation/vm/numa_memory_policy.txt) nor the man page 
> (set_mempolicy(2)) state the meaning of a nodemask passed with 
> MPOL_DEFAULT.
> 
> The man page simply says the nodemask should be passed as NULL and the 
> documentation state that MPOL_DEFAULT "does not use the optional set of 
> nodes."
> 
> So what we do with that nodemask is an implementation detail that does not 
> need to conform to any pre-defined API or even the possibility that one 
> day it will become useful.  In the context of the documentation, it is 
> logical that any nodemask that is passed with MPOL_DEFAULT is valid since 
> it's not used at all.
> 
> As you know, mempolicies can already morph into being effected over a 
> subset of nodes that was passed with set_mempolicy() or mbind() without 
> knowledge to the user.  That requires get_mempolicy() to determine.  
> Changing a non-empty nodemask passed with MPOL_DEFAULT to an empty 
> nodemask because it has no logical meaning is nothing new.

I think we're beating a dead horse here.  However, one more
consideration:

I'm not sure why you don't want to require the nodemask to be NULL/empty
in the case of MPOL_DEFAULT.  Perhaps it's from a code complexity
viewpoint.  Or maybe you think we're being kind to the programmer by
cutting them some slack.  Vis a vis the latter, I would argue that we're
not doing a programmer any favor by letting this slide by.  MPOL_DEFAULT
takes no nodemask.  So, if a non-empty nodemask is passed, the
programmer has done something wrong. 

Perhaps this was intentional:  Say, they did a cut and paste of another
set_mempolicy() call, changed the policy to DEFAULT and left the
nodemask args refering to a non-empty node mask.   In that case,
allowing it does no harm.

But, perhaps they intended a different policy and forgot to change the
MPOL_DEFAULT to that policy.  If we didn't complain about the non-empty
node mask, this could be a very tricky bug to diagnose.  

Since it's easy [and, IMO, advisable] for the kernel to be strict about
argument checking, I say do it.  I understand if you or other don't
agree.  I just want us to understand where each other is coming from
[and own up to our opinions as such].  I think this will be useful in
future interactions, of which I hope there are many.

> 
> > > The policy is simply the 
> > > equivalent of default_policy and, as the system default, a nodemask 
> > > parameter to the system default policy is wrong be definition.  
> > > 
> > > So, logically, we can either allow all nodemasks to be passed with a 
> > > MPOL_DEFAULT policy or none at all (it must be NULL).  Empty nodemasks 
> > > don't have any logical relationship with MPOL_DEFAULT.
> > 
> > Ah, maybe this explains our disconnect.  Internally, a NULL nodemask
> > pointer specified by the application is equivalent to an empty nodemask
> > is equivalent to maxnode == 0.  See get_nodes().  By the time
> > mpol_check_policy() or mpol_new() get called, all they have is a pointer
> > to the cleared nodemask in the stack frame of sys_set_mempolicy() or
> > sys_mbind().  So, the existing code's error checking doesn't require one
> > to specify a non-NULL, but empty nodemask.  It just requires that one
> > does not specify any nodes with MPOL_DEFAULT.  
> > 
> 
> You were previously arguing from an API or "reserved for future-use" 
> standpoint and now you're arguing from an implementation standpoint.  Both 
> of which are very different from each other.
> 
> The implementation can change to deal with this however we want (as I did 
> in my patchset), so arguing in support of what mpol_new() or 
> mpol_check_policy() currently do is irrelevant.  


Sorry.  I wasn't "arguing" here.  I interpreted your last couple of
paragraphs as indicating that you thought I was advocating a non-NULL,
but empty nodemask.  Based on that interpretation, I was pointing out
that the implementation hands mpol_check_policy() [now in mpol_new()]
and empty nodemask in all three equivalent cases mentioned above.  So,
by testing that, we can determine whether or not the user specified a
non-NULL, non-empty nodemask with MPOL_DEFAULT.  That's all I meant.

Lee
> 
> 		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-13 16:32                               ` Lee Schermerhorn
@ 2008-02-13 18:32                                 ` David Rientjes
  2008-02-13 18:56                                   ` Lee Schermerhorn
  0 siblings, 1 reply; 90+ messages in thread
From: David Rientjes @ 2008-02-13 18:32 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: KOSAKI Motohiro, Andrew Morton, linux-mm

On Wed, 13 Feb 2008, Lee Schermerhorn wrote:

> I'm not sure why you don't want to require the nodemask to be NULL/empty
> in the case of MPOL_DEFAULT.  Perhaps it's from a code complexity
> viewpoint.  Or maybe you think we're being kind to the programmer by
> cutting them some slack.  Vis a vis the latter, I would argue that we're
> not doing a programmer any favor by letting this slide by.  MPOL_DEFAULT
> takes no nodemask.  So, if a non-empty nodemask is passed, the
> programmer has done something wrong. 
> 

I mentioned on LKML that I've currently folded all the current logic of 
mpol_check_policy() as it stands this minute in Linus' tree into 
mpol_new() so that non-empty nodemasks are no longer accepted for 
MPOL_DEFAULT.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2.6.24-mm1]  Mempolicy:  silently restrict nodemask to allowed nodes V3
  2008-02-13 18:32                                 ` David Rientjes
@ 2008-02-13 18:56                                   ` Lee Schermerhorn
  0 siblings, 0 replies; 90+ messages in thread
From: Lee Schermerhorn @ 2008-02-13 18:56 UTC (permalink / raw)
  To: David Rientjes; +Cc: KOSAKI Motohiro, Andrew Morton, linux-mm

On Wed, 2008-02-13 at 10:32 -0800, David Rientjes wrote:
> On Wed, 13 Feb 2008, Lee Schermerhorn wrote:
> 
> > I'm not sure why you don't want to require the nodemask to be NULL/empty
> > in the case of MPOL_DEFAULT.  Perhaps it's from a code complexity
> > viewpoint.  Or maybe you think we're being kind to the programmer by
> > cutting them some slack.  Vis a vis the latter, I would argue that we're
> > not doing a programmer any favor by letting this slide by.  MPOL_DEFAULT
> > takes no nodemask.  So, if a non-empty nodemask is passed, the
> > programmer has done something wrong. 
> > 
> 
> I mentioned on LKML that I've currently folded all the current logic of 
> mpol_check_policy() as it stands this minute in Linus' tree into 
> mpol_new() so that non-empty nodemasks are no longer accepted for 
> MPOL_DEFAULT.
> 

Right.  That's why I mentioned "beating a dead horse".  I was answering
mail this morning in the order I read them.  This seemed to me to
warrant a response, independent of the lkml stream.  Again, just to get
on the same page.  Honest.  Not just to be a <insert whatever you might
be thinking here>... :-)

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2008-02-13 18:56 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-02  8:12 [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node KOSAKI Motohiro
2008-02-02  8:12 ` KOSAKI Motohiro
2008-02-02  9:09 ` Andi Kleen
2008-02-02  9:09   ` Andi Kleen
2008-02-02  9:37   ` KOSAKI Motohiro
2008-02-02  9:37     ` KOSAKI Motohiro
2008-02-02 11:30     ` Andi Kleen
2008-02-02 11:30       ` Andi Kleen
2008-02-04 19:03       ` Christoph Lameter
2008-02-04 19:03         ` Christoph Lameter
2008-02-04 18:20     ` Lee Schermerhorn
2008-02-04 18:20       ` Lee Schermerhorn
2008-02-05  9:26       ` [2.6.24 regression][BUGFIX] " KOSAKI Motohiro
2008-02-05  9:26         ` KOSAKI Motohiro
2008-02-05 21:57         ` Lee Schermerhorn
2008-02-05 22:12           ` Christoph Lameter
2008-02-06 16:00             ` Lee Schermerhorn
2008-02-05 22:15           ` Paul Jackson
2008-02-06  2:17           ` David Rientjes
2008-02-06 16:11             ` Lee Schermerhorn
2008-02-06  6:49           ` KOSAKI Motohiro
2008-02-06 17:38         ` Lee Schermerhorn
2008-02-07  8:31           ` KOSAKI Motohiro
2008-02-08 19:45         ` [PATCH 2.6.24-mm1] Mempolicy: silently restrict nodemask to allowed nodes V3 Lee Schermerhorn
2008-02-08 19:45           ` Lee Schermerhorn
2008-02-09 18:11           ` KOSAKI Motohiro
2008-02-09 18:11             ` KOSAKI Motohiro
2008-02-10  5:29           ` KOSAKI Motohiro
2008-02-10  5:29             ` KOSAKI Motohiro
2008-02-10  5:49             ` Greg KH
2008-02-10  5:49               ` Greg KH
2008-02-10  7:42               ` Linus Torvalds
2008-02-10  7:42                 ` Linus Torvalds
2008-02-10 10:31                 ` Andrew Morton
2008-02-10 10:31                   ` Andrew Morton
2008-02-11 16:47                 ` Lee Schermerhorn
2008-02-11 16:47                   ` Lee Schermerhorn
2008-02-12  0:43                   ` KOSAKI Motohiro
2008-02-12  1:00                     ` David Rientjes
2008-02-12  1:56                       ` KOSAKI Motohiro
2008-02-12  2:05                         ` David Rientjes
2008-02-12  3:05                           ` KOSAKI Motohiro
2008-02-12  3:17                             ` David Rientjes
2008-02-12 15:08                       ` Lee Schermerhorn
2008-02-12 19:06                         ` David Rientjes
2008-02-13  0:07                           ` Lee Schermerhorn
2008-02-13  0:42                             ` David Rientjes
2008-02-13 16:32                               ` Lee Schermerhorn
2008-02-13 18:32                                 ` David Rientjes
2008-02-13 18:56                                   ` Lee Schermerhorn
2008-02-12  4:30                   ` [PATCH for 2.6.24][regression fix] " KOSAKI Motohiro
2008-02-12  4:30                     ` KOSAKI Motohiro
2008-02-12  5:06                     ` David Rientjes
2008-02-12  5:06                       ` David Rientjes
2008-02-12  5:07                     ` Andrew Morton
2008-02-12  5:07                       ` Andrew Morton
2008-02-12 13:18                       ` KOSAKI Motohiro
2008-02-12 13:18                         ` KOSAKI Motohiro
2008-02-05 10:17       ` [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node Paul Jackson
2008-02-05 10:17         ` Paul Jackson
2008-02-05 11:14         ` KOSAKI Motohiro
2008-02-05 11:14           ` KOSAKI Motohiro
2008-02-05 19:56         ` David Rientjes
2008-02-05 19:56           ` David Rientjes
2008-02-05 20:51           ` Paul Jackson
2008-02-05 20:51             ` Paul Jackson
2008-02-05 21:03             ` David Rientjes
2008-02-05 21:03               ` David Rientjes
2008-02-05 21:33               ` Paul Jackson
2008-02-05 21:33                 ` Paul Jackson
2008-02-05 22:04                 ` Lee Schermerhorn
2008-02-05 22:04                   ` Lee Schermerhorn
2008-02-05 22:44                   ` David Rientjes
2008-02-05 22:44                     ` David Rientjes
2008-02-05 22:50                   ` Paul Jackson
2008-02-05 22:50                     ` Paul Jackson
2008-02-05 14:31       ` Mel Gorman
2008-02-05 14:31         ` Mel Gorman
2008-02-05 15:23         ` Lee Schermerhorn
2008-02-05 15:23           ` Lee Schermerhorn
2008-02-05 18:12           ` Christoph Lameter
2008-02-05 18:12             ` Christoph Lameter
2008-02-05 18:27             ` Lee Schermerhorn
2008-02-05 18:27               ` Lee Schermerhorn
2008-02-05 19:04               ` Christoph Lameter
2008-02-05 19:04                 ` Christoph Lameter
2008-02-05 19:15                 ` Paul Jackson
2008-02-05 19:15                   ` Paul Jackson
2008-02-05 20:06                   ` David Rientjes
2008-02-05 20:06                     ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.