linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: mempolicy: fix the absence of the last bit of nodemask
@ 2019-10-12 12:19 Pan Zhang
  2019-10-14  9:12 ` Michal Hocko
  0 siblings, 1 reply; 5+ messages in thread
From: Pan Zhang @ 2019-10-12 12:19 UTC (permalink / raw)
  To: akpm, vbabka, rientjes, mhocko, jgg, aarcange, yang.shi, zhongjiang
  Cc: linux-mm, linux-kernel

    When I want to use set_mempolicy to get the memory from each node on the numa machine,
    and the MPOL_INTERLEAVE flag seems to achieve this goal.
    However, during the test, it was found that the use result of node was unbalanced.
    The memory was allocated evenly from the nodes except the last node,
    which obviously did not match the expectations.

    You can test as follows:
1.  Create a file that needs to be mmap ped:
    dd if=/dev/zero of=./test count=1024 bs=1M

2.  Use `numactl -H` to see that your test machine has several nodes,
    and then change the macro NUM_NODES to the corresponding number of nodes
    in the test program.

3.  Compile the following program:
    gcc numa_alloc_test.c -lnuma

    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <numaif.h>
    #include <unistd.h>
    #include <numaif.h>
    #include <sys/mman.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>

    // rewrite these macro as `numactl -H` showed
    // The number of nodes on which machine the program runs
    #define NUM_NODES 2

    // memory we want to alloc from multinode averagely
    #define ALLOC_MEM_SIZE (1 << 30)
    void print_node_memusage()
    {
        for (int i=0; i < NUM_NODES; i++) {
            FILE *fp;
            char buf[1024];

            snprintf(buf, sizeof(buf),
                "cat /sys/devices/system/node/node%lu/meminfo | grep MemUsed", i);

            if ((fp = popen(buf, "r")) == NULL) {
                perror("popen");
                exit(-1);
            }

            while(fgets(buf, sizeof(buf), fp) != NULL) {
                printf("%s", buf);
            }

            if(pclose(fp))  {
                perror("pclose");
                exit(-1);
            }
        }
    }

    int main()
    {
        unsigned long num_nodes = NUM_NODES;
        unsigned long nodes_mask = (1 << NUM_NODES) - 1;
        // use MPOL_INTERLEAVE flag in order to balanced memory allocation
        set_mempolicy(MPOL_INTERLEAVE, &nodes_mask, num_nodes);

        // print info of nodes' memused before memory allocation
        print_node_memusage();

        int fd = open("./test", O_RDWR);
        unsigned long *addr = mmap(NULL, ALLOC_MEM_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);

        // trigger page fault and page alloc
        for (unsigned long i=0; i < ALLOC_MEM_SIZE/sizeof(unsigned long); i++) {
            addr[i] = i;
        }

        // print info of nodes' memused before memory allocation
        print_node_memusage();
        munmap(addr, ALLOC_MEM_SIZE);
        return 0;
    }

4.  execution procedures:
    ./a.out
5.  observe the output:
    On my `2 nodes` arm64 test environment, the test result is as follows:
    # ./a.out
    Node 0 MemUsed:         1313952 kB
    Node 1 MemUsed:          267620 kB
    Node 0 MemUsed:         2365500 kB (use 1GB)
    Node 1 MemUsed:          267832 kB (do not used)

    Besides, I found the same problem at https://bugzilla.kernel.org/show_bug.cgi?id=201433,
    so I feel it is necessary to track and fix this issue.

    I tracked the impact of set_mempolicy and memory allocation strategy on the alloc_pages
    process (MPOL_INTERLEAVE node pages allocation is implemented in `alloc_page_interleave`),
    and found that the memory allocation is based on nodemask (`interleave_nodes` -> `next_node_in`),
    so the problem may be in the nodemask setting: evetually, i found the nodemask is set
    in the `get_nodes` function.

    mm/mempolicy.c: `get_nodes` function
    --maxnode causes nodemask to ignore the last node. I think this needs to be changed,
    except that it also handles the case where the maxnode that the user passed in is 1.

    After the modification, the test result is normal.
    # ./a.out
    Node 0 MemUsed:          508044 kB
    Node 1 MemUsed:         1239276 kB
    Node 0 MemUsed:         1034196 kB (use 513MB)
    Node 1 MemUsed:         1768492 kB (use 516MB)

Signed-off-by: z00417012 <zhangpan26@huawei.com>
---
 mm/mempolicy.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967b..a23509f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1328,9 +1328,11 @@ static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
 	unsigned long nlongs;
 	unsigned long endmask;
 
-	--maxnode;
 	nodes_clear(*nodes);
-	if (maxnode == 0 || !nmask)
+	/*
+	 * If the user specified only one node, no need to set nodemask
+	 */
+	if (maxnode - 1 == 0 || !nmask)
 		return 0;
 	if (maxnode > PAGE_SIZE*BITS_PER_BYTE)
 		return -EINVAL;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: mempolicy: fix the absence of the last bit of nodemask
  2019-10-12 12:19 [PATCH] mm: mempolicy: fix the absence of the last bit of nodemask Pan Zhang
@ 2019-10-14  9:12 ` Michal Hocko
  2019-10-14  9:35   ` Vlastimil Babka
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2019-10-14  9:12 UTC (permalink / raw)
  To: Pan Zhang
  Cc: akpm, vbabka, rientjes, jgg, aarcange, yang.shi, zhongjiang,
	linux-mm, linux-kernel, Cristopher Lameter

[Cc Christopher - th initial emails is http://lkml.kernel.org/r/1570882789-20579-1-git-send-email-zhangpan26@huawei.com]

On Sat 12-10-19 20:19:48, Pan Zhang wrote:
>     When I want to use set_mempolicy to get the memory from each node on the numa machine,
>     and the MPOL_INTERLEAVE flag seems to achieve this goal.
>     However, during the test, it was found that the use result of node was unbalanced.
>     The memory was allocated evenly from the nodes except the last node,
>     which obviously did not match the expectations.
> 
>     You can test as follows:
> 1.  Create a file that needs to be mmap ped:
>     dd if=/dev/zero of=./test count=1024 bs=1M

This will already poppulate the page cache and if it fits into memory
(which seems to be the case in your example output) then your mmap later
will not allocate any new memory.

I suspect that using numactl --interleave 0,1 dd if=/dev/zero of=./test count=1024 bs=1M

will produce an output much closer to your expectation. Right?

[...]
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967b..a23509f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1328,9 +1328,11 @@ static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
>  	unsigned long nlongs;
>  	unsigned long endmask;
>  
> -	--maxnode;
>  	nodes_clear(*nodes);
> -	if (maxnode == 0 || !nmask)
> +	/*
> +	 * If the user specified only one node, no need to set nodemask
> +	 */
> +	if (maxnode - 1 == 0 || !nmask)
>  		return 0;
>  	if (maxnode > PAGE_SIZE*BITS_PER_BYTE)
>  		return -EINVAL;

I am afraid this is a wrong fix. It is really hard to grasp the code but my
understanding is that the caller is supposed to provide maxnode larger
than than the nodemask. So if you want 2 nodes then maxnode should be 3.
Have a look at the libnuma (which is a reference implementation)

static void setpol(int policy, struct bitmask *bmp)
{
	if (set_mempolicy(policy, bmp->maskp, bmp->size + 1) < 0)
		numa_error("set_mempolicy");
}

The semantic is quite awkward but it is that way for years.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: mempolicy: fix the absence of the last bit of nodemask
  2019-10-14  9:12 ` Michal Hocko
@ 2019-10-14  9:35   ` Vlastimil Babka
  2019-10-14 13:49     ` Pan Zhang
  0 siblings, 1 reply; 5+ messages in thread
From: Vlastimil Babka @ 2019-10-14  9:35 UTC (permalink / raw)
  To: Michal Hocko, Pan Zhang
  Cc: akpm, rientjes, jgg, aarcange, yang.shi, zhongjiang, linux-mm,
	linux-kernel, Cristopher Lameter, Linux API, Alexander Viro

On 10/14/19 11:12 AM, Michal Hocko wrote:
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 4ae967b..a23509f 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -1328,9 +1328,11 @@ static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
>>  	unsigned long nlongs;
>>  	unsigned long endmask;
>>  
>> -	--maxnode;
>>  	nodes_clear(*nodes);
>> -	if (maxnode == 0 || !nmask)
>> +	/*
>> +	 * If the user specified only one node, no need to set nodemask
>> +	 */
>> +	if (maxnode - 1 == 0 || !nmask)
>>  		return 0;
>>  	if (maxnode > PAGE_SIZE*BITS_PER_BYTE)
>>  		return -EINVAL;
> 
> I am afraid this is a wrong fix. It is really hard to grasp the code but my
> understanding is that the caller is supposed to provide maxnode larger
> than than the nodemask. So if you want 2 nodes then maxnode should be 3.
> Have a look at the libnuma (which is a reference implementation)
> 
> static void setpol(int policy, struct bitmask *bmp)
> {
> 	if (set_mempolicy(policy, bmp->maskp, bmp->size + 1) < 0)
> 		numa_error("set_mempolicy");
> }
> 
> The semantic is quite awkward but it is that way for years.

Yes, unfortunately. Too late to change. We could just update the
manpages at this point.

get_mempolicy(2) says:
 maxnode specifies the number of node IDs that can be stored into
nodemask—that is, the maximum node ID plus one.

- Since node ID starts with 0, it should be actually "plus two".

set_mempolicy(2) says:
 nodemask  points to a bit mask of node IDs that contains up to maxnode
bits.

- should be also clarified.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: [PATCH] mm: mempolicy: fix the absence of the last bit of nodemask
  2019-10-14  9:35   ` Vlastimil Babka
@ 2019-10-14 13:49     ` Pan Zhang
  0 siblings, 0 replies; 5+ messages in thread
From: Pan Zhang @ 2019-10-14 13:49 UTC (permalink / raw)
  To: akpm, vbabka, rientjes, mhocko, jgg, aarcange, yang.shi, zhongjiang
  Cc: linux-mm, linux-kernel

    On Mon 14-10:19 17:12:52, Michal Hocko wrote:
    >>     When I want to use set_mempolicy to get the memory from each node on the numa machine,
    >>     and the MPOL_INTERLEAVE flag seems to achieve this goal.
    >>     However, during the test, it was found that the use result of node was unbalanced.
    >>     The memory was allocated evenly from the nodes except the last node,
    >>     which obviously did not match the expectations.
    >> 
    >>     You can test as follows:
    >> 1.  Create a file that needs to be mmap ped:
    >>     dd if=/dev/zero of=./test count=1024 bs=1M

    >This will already poppulate the page cache and if it fits into memory (which seems to be the case in your example output) then your mmap later will not allocate any new memory.
    >
    >I suspect that using numactl --interleave 0,1 dd if=/dev/zero of=./test count=1024 bs=1M
    >
    >will produce an output much closer to your expectation. Right?

    Yes, you are right. `dd` command will 'populate the page cache and if it fits into memory'.
    As a newcomer who is studying hard in this field,
    I am sorry for this and I don't know much about the mechanism of memory management.
    
    I used `malloc` again in my program to allocate memory and produced the same `confusing` result.

    But as you and Vlastimil Babka said, historical reasons have made the implementation of this interface less intuitive.
    Modifying manual may be a better option.

    Thank you both for your reply and explanation.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] mm: mempolicy: fix the absence of the last bit of nodemask
@ 2019-10-12 12:08 z00417012
  0 siblings, 0 replies; 5+ messages in thread
From: z00417012 @ 2019-10-12 12:08 UTC (permalink / raw)
  To: akpm, vbabka, rientjes, mhocko, jgg, aarcange, yang.shi, zhongjiang
  Cc: linux-mm, linux-kernel

    When I want to use set_mempolicy to get the memory from each node on the numa machine,
    and the MPOL_INTERLEAVE flag seems to achieve this goal.
    However, during the test, it was found that the use result of node was unbalanced.
    The memory was allocated evenly from the nodes except the last node,
    which obviously did not match the expectations.

    You can test as follows:
1.  Create a file that needs to be mmap ped:
    dd if=/dev/zero of=./test count=1024 bs=1M

2.  Use `numactl -H` to see that your test machine has several nodes,
    and then change the macro NUM_NODES to the corresponding number of nodes
    in the test program.

3.  Compile the following program:
    gcc numa_alloc_test.c -lnuma

    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <numaif.h>
    #include <unistd.h>
    #include <numaif.h>
    #include <sys/mman.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>

    // rewrite these macro as `numactl -H` showed
    // The number of nodes on which machine the program runs
    #define NUM_NODES 2

    // memory we want to alloc from multinode averagely
    #define ALLOC_MEM_SIZE (1 << 30)
    void print_node_memusage()
    {
        for (int i=0; i < NUM_NODES; i++) {
            FILE *fp;
            char buf[1024];

            snprintf(buf, sizeof(buf),
                "cat /sys/devices/system/node/node%lu/meminfo | grep MemUsed", i);

            if ((fp = popen(buf, "r")) == NULL) {
                perror("popen");
                exit(-1);
            }

            while(fgets(buf, sizeof(buf), fp) != NULL) {
                printf("%s", buf);
            }

            if(pclose(fp))  {
                perror("pclose");
                exit(-1);
            }
        }
    }

    int main()
    {
        unsigned long num_nodes = NUM_NODES;
        unsigned long nodes_mask = (1 << NUM_NODES) - 1;
        // use MPOL_INTERLEAVE flag in order to balanced memory allocation
        set_mempolicy(MPOL_INTERLEAVE, &nodes_mask, num_nodes);

        // print info of nodes' memused before memory allocation
        print_node_memusage();

        int fd = open("./test", O_RDWR);
        unsigned long *addr = mmap(NULL, ALLOC_MEM_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);

        // trigger page fault and page alloc
        for (unsigned long i=0; i < ALLOC_MEM_SIZE/sizeof(unsigned long); i++) {
            addr[i] = i;
        }

        // print info of nodes' memused before memory allocation
        print_node_memusage();
        munmap(addr, ALLOC_MEM_SIZE);
        return 0;
    }

4.  execution procedures:
    ./a.out
5.  observe the output:
    On my `2 nodes` arm64 test environment, the test result is as follows:
    # ./a.out
    Node 0 MemUsed:         1313952 kB
    Node 1 MemUsed:          267620 kB
    Node 0 MemUsed:         2365500 kB (use 1GB)
    Node 1 MemUsed:          267832 kB (do not used)

    Besides, I found the same problem at https://bugzilla.kernel.org/show_bug.cgi?id=201433,
    so I feel it is necessary to track and fix this issue.

    I tracked the impact of set_mempolicy and memory allocation strategy on the alloc_pages
    process (MPOL_INTERLEAVE node pages allocation is implemented in `alloc_page_interleave`),
    and found that the memory allocation is based on nodemask (`interleave_nodes` -> `next_node_in`),
    so the problem may be in the nodemask setting: evetually, i found the nodemask is set
    in the `get_nodes` function.

    mm/mempolicy.c: `get_nodes` function
    --maxnode causes nodemask to ignore the last node. I think this needs to be changed,
    except that it also handles the case where the maxnode that the user passed in is 1.

    After the modification, the test result is normal.
    # ./a.out
    Node 0 MemUsed:          508044 kB
    Node 1 MemUsed:         1239276 kB
    Node 0 MemUsed:         1034196 kB (use 513MB)
    Node 1 MemUsed:         1768492 kB (use 516MB)

Signed-off-by: z00417012 <zhangpan26@huawei.com>
---
 mm/mempolicy.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967b..a23509f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1328,9 +1328,11 @@ static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
 	unsigned long nlongs;
 	unsigned long endmask;
 
-	--maxnode;
 	nodes_clear(*nodes);
-	if (maxnode == 0 || !nmask)
+	/*
+	 * If the user specified only one node, no need to set nodemask
+	 */
+	if (maxnode - 1 == 0 || !nmask)
 		return 0;
 	if (maxnode > PAGE_SIZE*BITS_PER_BYTE)
 		return -EINVAL;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-10-14 13:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-12 12:19 [PATCH] mm: mempolicy: fix the absence of the last bit of nodemask Pan Zhang
2019-10-14  9:12 ` Michal Hocko
2019-10-14  9:35   ` Vlastimil Babka
2019-10-14 13:49     ` Pan Zhang
  -- strict thread matches above, loose matches on Subject: below --
2019-10-12 12:08 z00417012

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).