bug: move_pages(2) does not udpate "status" if no pages are moved

All of lore.kernel.org
 help / color / mirror / Atom feed

* bug: move_pages(2) does not udpate "status" if no pages are moved
@ 2019-12-04 19:01 Felix Abecassis
  2019-12-04 19:21 ` John Hubbard
  2019-12-04 20:17 ` Yang Shi
  0 siblings, 2 replies; 10+ messages in thread
From: Felix Abecassis @ 2019-12-04 19:01 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton

Hello all,

On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
pages happen to be on the right node already, this function returns 0 but the
"status" array is not updated. This array potentially contains garbage values
(e.g. from malloc(3)), and I don't see a way to detect this.

Looking at the kernel code, we are probably exiting do_pages_move here:
out_flush:
    if (list_empty(&pagelist))
        return err;

Here is a sample C program to reproduce the problem:
/* $ gcc move_pages_bug.c -lnuma -o move_pages_bug */

#define _DEFAULT_SOURCE
#include <sys/mman.h>

#include <numaif.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main(void)
{
    const long node_id = 1;
    const long page_size = sysconf(_SC_PAGESIZE);
    const int64_t num_pages = 8;

    unsigned long nodemask =  1 << node_id;
    long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
    if (ret < 0)
        return (EXIT_FAILURE);

    void **pages = malloc(sizeof(void*) * num_pages);
    for (int i = 0; i < num_pages; ++i) {
        pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
                MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
                -1, 0);
        if (pages[i] == MAP_FAILED)
            return (EXIT_FAILURE);
    }

    ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
    if (ret < 0)
        return (EXIT_FAILURE);

    int *nodes = malloc(sizeof(int) * num_pages);
    int *status = malloc(sizeof(int) * num_pages);
    for (int i = 0; i < num_pages; ++i) {
        nodes[i] = node_id;
        status[i] = 0xd0; /* simulate garbage values */
    }

    ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
    printf("move_pages: %ld\n", ret);
    for (int i = 0; i < num_pages; ++i)
        printf("status[%d] = %d\n", i, status[i]);
}

And here is a sample output showing the "garbage" values:
$ ./move_pages_bug
move_pages: 0
status[0] = 208
status[1] = 208
status[2] = 208
status[3] = 208
status[4] = 208
status[5] = 208
status[6] = 208
status[7] = 208

Note that passing NULL as the "nodes" argument works as expected here.

Also, it seems that it's the last "run-length" of pages on the right node(s)
that triggers this problem, e.g. if I add "nodes[0] = nodes[1] = 0", then the
output becomes:
$ ./move_pages_bug
move_pages: 0
status[0] = 0
status[1] = 0
status[2] = 208
status[3] = 208
status[4] = 208
status[5] = 208
status[6] = 208
status[7] = 208

And with just "nodes[7] = 0;", the first run-length of pages gets assigned
correctly:
$ ./move_pages_bug
move_pages: 0
status[0] = 1
status[1] = 1
status[2] = 1
status[3] = 1
status[4] = 1
status[5] = 1
status[6] = 1
status[7] = 0


Thank you,


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-04 19:01 bug: move_pages(2) does not udpate "status" if no pages are moved Felix Abecassis
@ 2019-12-04 19:21 ` John Hubbard
  2019-12-04 20:04   ` Yang Shi
  2019-12-04 20:17 ` Yang Shi
  1 sibling, 1 reply; 10+ messages in thread
From: John Hubbard @ 2019-12-04 19:21 UTC (permalink / raw)
  To: Felix Abecassis, linux-mm; +Cc: Andrew Morton

On 12/4/19 11:01 AM, Felix Abecassis wrote:
> Hello all,
> 

Hi Felix,

Thanks for writing up a very clear description of the problem.

> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
> pages happen to be on the right node already, this function returns 0 but the
> "status" array is not updated. This array potentially contains garbage values
> (e.g. from malloc(3)), and I don't see a way to detect this.

The way to detect this case would be to zero the array before calling move_pages().
Then, if move_pages() returns 0, and the array remains full of zeroes, you can
conclude that move_pages() "succeeded", and that there were no errors for any
of the pages. So the pages are where you requested them to end up.

> 
> Looking at the kernel code, we are probably exiting do_pages_move here:
> out_flush:
>     if (list_empty(&pagelist))
>         return err;

I looked at that code and the surrounding function, and it's been pretty much
unchanged for quite a while. The above was last touched in April, 2018, for 
example.

Yes, we could change the kernel code to fill in the array with zeroes in that
situation, but the man page doesn't actually cover this case at all. We'd have
to also change the man page, to say something like, "if pages were not moved 
because they were already in the requested location, then the status array
will contain <SOME_VALUE> for such pages". In other words, the kernel matches
the requirements (the man page) as it stands today, at least as I'm reading it.

And given that one can already figure all this out with the existing kernel
and libnuma behavior, I'm guessing that the linux-mm folks will not see any
reason to make such a change--but maybe I'm guessing wrong. Anyone on CC want
to weigh in there?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-04 19:21 ` John Hubbard
@ 2019-12-04 20:04   ` Yang Shi
  2019-12-04 23:45     ` John Hubbard
  0 siblings, 1 reply; 10+ messages in thread
From: Yang Shi @ 2019-12-04 20:04 UTC (permalink / raw)
  To: John Hubbard; +Cc: Felix Abecassis, Linux MM, Andrew Morton

On Wed, Dec 4, 2019 at 11:21 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/4/19 11:01 AM, Felix Abecassis wrote:
> > Hello all,
> >
>
> Hi Felix,
>
> Thanks for writing up a very clear description of the problem.
>
> > On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
> > pages happen to be on the right node already, this function returns 0 but the
> > "status" array is not updated. This array potentially contains garbage values
> > (e.g. from malloc(3)), and I don't see a way to detect this.
>
>
> The way to detect this case would be to zero the array before calling move_pages().
> Then, if move_pages() returns 0, and the array remains full of zeroes, you can
> conclude that move_pages() "succeeded", and that there were no errors for any
> of the pages. So the pages are where you requested them to end up.

I don't think we can just simply return all zeros here. It looks the
status should contain error code or the target node id if the page is
moved to that node successfully. So, if the page is already on the
requested node, the status should contain the current node id, but the
current node maybe not 0.

So, IMHO it sounds like a valid bug.

>
>
> >
> > Looking at the kernel code, we are probably exiting do_pages_move here:
> > out_flush:
> >     if (list_empty(&pagelist))
> >         return err;
>
>
> I looked at that code and the surrounding function, and it's been pretty much
> unchanged for quite a while. The above was last touched in April, 2018, for
> example.
>
> Yes, we could change the kernel code to fill in the array with zeroes in that
> situation, but the man page doesn't actually cover this case at all. We'd have
> to also change the man page, to say something like, "if pages were not moved
> because they were already in the requested location, then the status array
> will contain <SOME_VALUE> for such pages". In other words, the kernel matches
> the requirements (the man page) as it stands today, at least as I'm reading it.
>
> And given that one can already figure all this out with the existing kernel
> and libnuma behavior, I'm guessing that the linux-mm folks will not see any
> reason to make such a change--but maybe I'm guessing wrong. Anyone on CC want
> to weigh in there?
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-04 19:01 bug: move_pages(2) does not udpate "status" if no pages are moved Felix Abecassis
  2019-12-04 19:21 ` John Hubbard
@ 2019-12-04 20:17 ` Yang Shi
  2019-12-04 22:54   ` Felix Abecassis
  2019-12-05  0:03   ` John Hubbard
  1 sibling, 2 replies; 10+ messages in thread
From: Yang Shi @ 2019-12-04 20:17 UTC (permalink / raw)
  To: Felix Abecassis; +Cc: Linux MM, Andrew Morton

On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote:
>
> Hello all,
>
> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
> pages happen to be on the right node already, this function returns 0 but the
> "status" array is not updated. This array potentially contains garbage values
> (e.g. from malloc(3)), and I don't see a way to detect this.
>
> Looking at the kernel code, we are probably exiting do_pages_move here:
> out_flush:
>     if (list_empty(&pagelist))
>         return err;

May you please give the below patch a try? I just did build test.

diff --git a/mm/migrate.c b/mm/migrate.c
index a8f87cb..f2f1279 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
  * the target node
  */
 static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
- int node, struct list_head *pagelist, bool migrate_all)
+ int node, struct list_head *pagelist, bool migrate_all,
+ int __user *status, int start)
 {
  struct vm_area_struct *vma;
  struct page *page;
@@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct
mm_struct *mm, unsigned long addr,
  goto out;

  err = 0;
- if (page_to_nid(page) == node)
+ if (page_to_nid(page) == node) {
+ err = store_status(status, start, node, 1);
  goto out_putpage;
+ }

  err = -EACCES;
  if (page_mapcount(page) > 1 && !migrate_all)
@@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm,
nodemask_t task_nodes,
  * report them via status
  */
  err = add_page_for_migration(mm, addr, current_node,
- &pagelist, flags & MPOL_MF_MOVE_ALL);
+ &pagelist, flags & MPOL_MF_MOVE_ALL, status,
+ i);
+
  if (!err)
  continue;

>
> Here is a sample C program to reproduce the problem:
> /* $ gcc move_pages_bug.c -lnuma -o move_pages_bug */
>
> #define _DEFAULT_SOURCE
> #include <sys/mman.h>
>
> #include <numaif.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <unistd.h>
>
> int main(void)
> {
>     const long node_id = 1;
>     const long page_size = sysconf(_SC_PAGESIZE);
>     const int64_t num_pages = 8;
>
>     unsigned long nodemask =  1 << node_id;
>     long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
>     if (ret < 0)
>         return (EXIT_FAILURE);
>
>     void **pages = malloc(sizeof(void*) * num_pages);
>     for (int i = 0; i < num_pages; ++i) {
>         pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
>                 MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
>                 -1, 0);
>         if (pages[i] == MAP_FAILED)
>             return (EXIT_FAILURE);
>     }
>
>     ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
>     if (ret < 0)
>         return (EXIT_FAILURE);
>
>     int *nodes = malloc(sizeof(int) * num_pages);
>     int *status = malloc(sizeof(int) * num_pages);
>     for (int i = 0; i < num_pages; ++i) {
>         nodes[i] = node_id;
>         status[i] = 0xd0; /* simulate garbage values */
>     }
>
>     ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
>     printf("move_pages: %ld\n", ret);
>     for (int i = 0; i < num_pages; ++i)
>         printf("status[%d] = %d\n", i, status[i]);
> }
>
> And here is a sample output showing the "garbage" values:
> $ ./move_pages_bug
> move_pages: 0
> status[0] = 208
> status[1] = 208
> status[2] = 208
> status[3] = 208
> status[4] = 208
> status[5] = 208
> status[6] = 208
> status[7] = 208
>
> Note that passing NULL as the "nodes" argument works as expected here.
>
> Also, it seems that it's the last "run-length" of pages on the right node(s)
> that triggers this problem, e.g. if I add "nodes[0] = nodes[1] = 0", then the
> output becomes:
> $ ./move_pages_bug
> move_pages: 0
> status[0] = 0
> status[1] = 0
> status[2] = 208
> status[3] = 208
> status[4] = 208
> status[5] = 208
> status[6] = 208
> status[7] = 208
>
> And with just "nodes[7] = 0;", the first run-length of pages gets assigned
> correctly:
> $ ./move_pages_bug
> move_pages: 0
> status[0] = 1
> status[1] = 1
> status[2] = 1
> status[3] = 1
> status[4] = 1
> status[5] = 1
> status[6] = 1
> status[7] = 0
>
>
> Thank you,
>


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-04 20:17 ` Yang Shi
@ 2019-12-04 22:54   ` Felix Abecassis
  2019-12-04 23:37     ` Yang Shi
  2019-12-05  0:03   ` John Hubbard
  1 sibling, 1 reply; 10+ messages in thread
From: Felix Abecassis @ 2019-12-04 22:54 UTC (permalink / raw)
  To: Yang Shi; +Cc: Linux MM, Andrew Morton

On Wed, Dec 04, 2019 at 12:17:59PM -0800, Yang Shi wrote:
> On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote:
> >
> > Hello all,
> >
> > On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
> > pages happen to be on the right node already, this function returns 0 but the
> > "status" array is not updated. This array potentially contains garbage values
> > (e.g. from malloc(3)), and I don't see a way to detect this.
> >
> > Looking at the kernel code, we are probably exiting do_pages_move here:
> > out_flush:
> >     if (list_empty(&pagelist))
> >         return err;
> 
> May you please give the below patch a try? I just did build test.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a8f87cb..f2f1279 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
>   * the target node
>   */
>  static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> - int node, struct list_head *pagelist, bool migrate_all)
> + int node, struct list_head *pagelist, bool migrate_all,
> + int __user *status, int start)
>  {
>   struct vm_area_struct *vma;
>   struct page *page;
> @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct
> mm_struct *mm, unsigned long addr,
>   goto out;
> 
>   err = 0;
> - if (page_to_nid(page) == node)
> + if (page_to_nid(page) == node) {
> + err = store_status(status, start, node, 1);
>   goto out_putpage;
> + }
> 
>   err = -EACCES;
>   if (page_mapcount(page) > 1 && !migrate_all)
> @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm,
> nodemask_t task_nodes,
>   * report them via status
>   */
>   err = add_page_for_migration(mm, addr, current_node,
> - &pagelist, flags & MPOL_MF_MOVE_ALL);
> + &pagelist, flags & MPOL_MF_MOVE_ALL, status,
> + i);
> +
>   if (!err)
>   continue;
>

Thanks! I have tested this patch on top of 5.4-rc8, and I can comfirm it fixes
this particular test case.

> >
> > Here is a sample C program to reproduce the problem:
> > /* $ gcc move_pages_bug.c -lnuma -o move_pages_bug */
> >
> > #define _DEFAULT_SOURCE
> > #include <sys/mman.h>
> >
> > #include <numaif.h>
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <unistd.h>
> >
> > int main(void)
> > {
> >     const long node_id = 1;
> >     const long page_size = sysconf(_SC_PAGESIZE);
> >     const int64_t num_pages = 8;
> >
> >     unsigned long nodemask =  1 << node_id;
> >     long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
> >     if (ret < 0)
> >         return (EXIT_FAILURE);
> >
> >     void **pages = malloc(sizeof(void*) * num_pages);
> >     for (int i = 0; i < num_pages; ++i) {
> >         pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
> >                 MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
> >                 -1, 0);
> >         if (pages[i] == MAP_FAILED)
> >             return (EXIT_FAILURE);
> >     }
> >
> >     ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
> >     if (ret < 0)
> >         return (EXIT_FAILURE);
> >
> >     int *nodes = malloc(sizeof(int) * num_pages);
> >     int *status = malloc(sizeof(int) * num_pages);
> >     for (int i = 0; i < num_pages; ++i) {
> >         nodes[i] = node_id;
> >         status[i] = 0xd0; /* simulate garbage values */
> >     }
> >
> >     ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
> >     printf("move_pages: %ld\n", ret);
> >     for (int i = 0; i < num_pages; ++i)
> >         printf("status[%d] = %d\n", i, status[i]);
> > }
> >
> > And here is a sample output showing the "garbage" values:
> > $ ./move_pages_bug
> > move_pages: 0
> > status[0] = 208
> > status[1] = 208
> > status[2] = 208
> > status[3] = 208
> > status[4] = 208
> > status[5] = 208
> > status[6] = 208
> > status[7] = 208
> >
> > Note that passing NULL as the "nodes" argument works as expected here.
> >
> > Also, it seems that it's the last "run-length" of pages on the right node(s)
> > that triggers this problem, e.g. if I add "nodes[0] = nodes[1] = 0", then the
> > output becomes:
> > $ ./move_pages_bug
> > move_pages: 0
> > status[0] = 0
> > status[1] = 0
> > status[2] = 208
> > status[3] = 208
> > status[4] = 208
> > status[5] = 208
> > status[6] = 208
> > status[7] = 208
> >
> > And with just "nodes[7] = 0;", the first run-length of pages gets assigned
> > correctly:
> > $ ./move_pages_bug
> > move_pages: 0
> > status[0] = 1
> > status[1] = 1
> > status[2] = 1
> > status[3] = 1
> > status[4] = 1
> > status[5] = 1
> > status[6] = 1
> > status[7] = 0
> >
> >
> > Thank you,
> >


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-04 22:54   ` Felix Abecassis
@ 2019-12-04 23:37     ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2019-12-04 23:37 UTC (permalink / raw)
  To: Felix Abecassis; +Cc: Linux MM, Andrew Morton

On Wed, Dec 4, 2019 at 2:54 PM Felix Abecassis <fabecassis@nvidia.com> wrote:
>
> On Wed, Dec 04, 2019 at 12:17:59PM -0800, Yang Shi wrote:
> > On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote:
> > >
> > > Hello all,
> > >
> > > On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
> > > pages happen to be on the right node already, this function returns 0 but the
> > > "status" array is not updated. This array potentially contains garbage values
> > > (e.g. from malloc(3)), and I don't see a way to detect this.
> > >
> > > Looking at the kernel code, we are probably exiting do_pages_move here:
> > > out_flush:
> > >     if (list_empty(&pagelist))
> > >         return err;
> >
> > May you please give the below patch a try? I just did build test.
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index a8f87cb..f2f1279 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
> >   * the target node
> >   */
> >  static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> > - int node, struct list_head *pagelist, bool migrate_all)
> > + int node, struct list_head *pagelist, bool migrate_all,
> > + int __user *status, int start)
> >  {
> >   struct vm_area_struct *vma;
> >   struct page *page;
> > @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct
> > mm_struct *mm, unsigned long addr,
> >   goto out;
> >
> >   err = 0;
> > - if (page_to_nid(page) == node)
> > + if (page_to_nid(page) == node) {
> > + err = store_status(status, start, node, 1);
> >   goto out_putpage;
> > + }
> >
> >   err = -EACCES;
> >   if (page_mapcount(page) > 1 && !migrate_all)
> > @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm,
> > nodemask_t task_nodes,
> >   * report them via status
> >   */
> >   err = add_page_for_migration(mm, addr, current_node,
> > - &pagelist, flags & MPOL_MF_MOVE_ALL);
> > + &pagelist, flags & MPOL_MF_MOVE_ALL, status,
> > + i);
> > +
> >   if (!err)
> >   continue;
> >
>
> Thanks! I have tested this patch on top of 5.4-rc8, and I can comfirm it fixes
> this particular test case.

Thanks for confirming this. I will send out formal patch soon.

>
> > >
> > > Here is a sample C program to reproduce the problem:
> > > /* $ gcc move_pages_bug.c -lnuma -o move_pages_bug */
> > >
> > > #define _DEFAULT_SOURCE
> > > #include <sys/mman.h>
> > >
> > > #include <numaif.h>
> > > #include <stdlib.h>
> > > #include <stdio.h>
> > > #include <unistd.h>
> > >
> > > int main(void)
> > > {
> > >     const long node_id = 1;
> > >     const long page_size = sysconf(_SC_PAGESIZE);
> > >     const int64_t num_pages = 8;
> > >
> > >     unsigned long nodemask =  1 << node_id;
> > >     long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
> > >     if (ret < 0)
> > >         return (EXIT_FAILURE);
> > >
> > >     void **pages = malloc(sizeof(void*) * num_pages);
> > >     for (int i = 0; i < num_pages; ++i) {
> > >         pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
> > >                 MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
> > >                 -1, 0);
> > >         if (pages[i] == MAP_FAILED)
> > >             return (EXIT_FAILURE);
> > >     }
> > >
> > >     ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
> > >     if (ret < 0)
> > >         return (EXIT_FAILURE);
> > >
> > >     int *nodes = malloc(sizeof(int) * num_pages);
> > >     int *status = malloc(sizeof(int) * num_pages);
> > >     for (int i = 0; i < num_pages; ++i) {
> > >         nodes[i] = node_id;
> > >         status[i] = 0xd0; /* simulate garbage values */
> > >     }
> > >
> > >     ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
> > >     printf("move_pages: %ld\n", ret);
> > >     for (int i = 0; i < num_pages; ++i)
> > >         printf("status[%d] = %d\n", i, status[i]);
> > > }
> > >
> > > And here is a sample output showing the "garbage" values:
> > > $ ./move_pages_bug
> > > move_pages: 0
> > > status[0] = 208
> > > status[1] = 208
> > > status[2] = 208
> > > status[3] = 208
> > > status[4] = 208
> > > status[5] = 208
> > > status[6] = 208
> > > status[7] = 208
> > >
> > > Note that passing NULL as the "nodes" argument works as expected here.
> > >
> > > Also, it seems that it's the last "run-length" of pages on the right node(s)
> > > that triggers this problem, e.g. if I add "nodes[0] = nodes[1] = 0", then the
> > > output becomes:
> > > $ ./move_pages_bug
> > > move_pages: 0
> > > status[0] = 0
> > > status[1] = 0
> > > status[2] = 208
> > > status[3] = 208
> > > status[4] = 208
> > > status[5] = 208
> > > status[6] = 208
> > > status[7] = 208
> > >
> > > And with just "nodes[7] = 0;", the first run-length of pages gets assigned
> > > correctly:
> > > $ ./move_pages_bug
> > > move_pages: 0
> > > status[0] = 1
> > > status[1] = 1
> > > status[2] = 1
> > > status[3] = 1
> > > status[4] = 1
> > > status[5] = 1
> > > status[6] = 1
> > > status[7] = 0
> > >
> > >
> > > Thank you,
> > >


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-04 20:04   ` Yang Shi
@ 2019-12-04 23:45     ` John Hubbard
  2019-12-05  0:43       ` Yang Shi
  0 siblings, 1 reply; 10+ messages in thread
From: John Hubbard @ 2019-12-04 23:45 UTC (permalink / raw)
  To: Yang Shi; +Cc: Felix Abecassis, Linux MM, Andrew Morton

On 12/4/19 12:04 PM, Yang Shi wrote:
> On Wed, Dec 4, 2019 at 11:21 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 12/4/19 11:01 AM, Felix Abecassis wrote:
>>> Hello all,
>>>
>>
>> Hi Felix,
>>
>> Thanks for writing up a very clear description of the problem.
>>
>>> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
>>> pages happen to be on the right node already, this function returns 0 but the
>>> "status" array is not updated. This array potentially contains garbage values
>>> (e.g. from malloc(3)), and I don't see a way to detect this.
>>
>>
>> The way to detect this case would be to zero the array before calling move_pages().
>> Then, if move_pages() returns 0, and the array remains full of zeroes, you can
>> conclude that move_pages() "succeeded", and that there were no errors for any
>> of the pages. So the pages are where you requested them to end up.
> 
> I don't think we can just simply return all zeros here. It looks the
> status should contain error code or the target node id if the page is
> moved to that node successfully. So, if the page is already on the
> requested node, the status should contain the current node id, but the
> current node maybe not 0.
> 
> So, IMHO it sounds like a valid bug.
> 

Yes, you're right, a more precise reading of the man page does support that: 
if move_pages() returns 0, then the status array *must* contain valid node IDs. 
I see. (Felix also mentioned the same thing, in a side note.)

Looking some more at both the man page and Felix's report (and the kernel 
implementation), it seems like there are maybe two bugs here:

1) Not setting the status array in some cases, if some pages were not moved for
"non fatal" reasons, and

2) Returning success if no pages were moved. The "ERRORS" section of the man
page seems to require that ENOENT be returned in that case. Although, you could
perhaps argue that this statement is only unidirectional. In other words, maybe
ENOENT happens, but it doesn't *have* to happen, if all pages were already on the
target node.

ERRORS

ENOENT No pages were found that require moving.  All pages are either already 
       on the target node, not present, had an  invalid  address  or could not
       be moved because they were mapped by multiple processes.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-04 20:17 ` Yang Shi
  2019-12-04 22:54   ` Felix Abecassis
@ 2019-12-05  0:03   ` John Hubbard
  2019-12-05  0:40     ` Yang Shi
  1 sibling, 1 reply; 10+ messages in thread
From: John Hubbard @ 2019-12-05  0:03 UTC (permalink / raw)
  To: Yang Shi, Felix Abecassis; +Cc: Linux MM, Andrew Morton

On 12/4/19 12:17 PM, Yang Shi wrote:
> On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote:
>>
>> Hello all,
>>
>> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
>> pages happen to be on the right node already, this function returns 0 but the
>> "status" array is not updated. This array potentially contains garbage values
>> (e.g. from malloc(3)), and I don't see a way to detect this.
>>
>> Looking at the kernel code, we are probably exiting do_pages_move here:
>> out_flush:
>>     if (list_empty(&pagelist))
>>         return err;
> 
> May you please give the below patch a try? I just did build test.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a8f87cb..f2f1279 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
>   * the target node
>   */
>  static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> - int node, struct list_head *pagelist, bool migrate_all)
> + int node, struct list_head *pagelist, bool migrate_all,
> + int __user *status, int start)
>  {
>   struct vm_area_struct *vma;
>   struct page *page;
> @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct
> mm_struct *mm, unsigned long addr,
>   goto out;
> 
>   err = 0;
> - if (page_to_nid(page) == node)
> + if (page_to_nid(page) == node) {
> + err = store_status(status, start, node, 1);
>   goto out_putpage;
> + }
> 
>   err = -EACCES;
>   if (page_mapcount(page) > 1 && !migrate_all)
> @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm,
> nodemask_t task_nodes,
>   * report them via status
>   */
>   err = add_page_for_migration(mm, addr, current_node,
> - &pagelist, flags & MPOL_MF_MOVE_ALL);
> + &pagelist, flags & MPOL_MF_MOVE_ALL, status,
> + i);
> +
>   if (!err)
>   continue;
> 

Hi Yang,

The patch looks correct, and I *think* the following lockdep report
is a pre-existing problem, but it happened with your patch applied to today's
linux.git (commit aedc0650f9135f3b92b39cbed1a8fe98d8088825), using the
unmodified version of Felix's test program:

============================================
WARNING: possible recursive locking detected
5.4.0-hubbard-github+ #552 Not tainted
--------------------------------------------
move_pages_bug/1286 is trying to acquire lock:
ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: __might_fault+0x3e/0x90

but task is already holding lock:
ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&mm->mmap_sem#2);
  lock(&mm->mmap_sem#2);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

1 lock held by move_pages_bug/1286:
 #0: ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0

stack backtrace:
CPU: 6 PID: 1286 Comm: move_pages_bug Not tainted 5.4.0-hubbard-github+ #552
Hardware name: ASUS X299-A/PRIME X299-A, BIOS 2002 09/25/2019
Call Trace:
 dump_stack+0x71/0xa0
 validate_chain.cold+0x122/0x15f
 ? find_held_lock+0x2b/0x80
 __lock_acquire+0x39c/0x790
 lock_acquire+0x95/0x190
 ? __might_fault+0x3e/0x90
 __might_fault+0x68/0x90
 ? __might_fault+0x3e/0x90
 do_pages_move+0x2c4/0x6a0
 kernel_move_pages+0x1f5/0x3e0
 ? do_syscall_64+0x1c/0x230
 __x64_sys_move_pages+0x25/0x30
 do_syscall_64+0x5a/0x230
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7efd42f581ad
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f08
RSP: 002b:00007ffffb207c78 EFLAGS: 00000216 ORIG_RAX: 0000000000000117
RAX: ffffffffffffffda RBX: 0000556eb240cd28 RCX: 00007efd42f581ad
RDX: 0000556eb240ccf0 RSI: 0000000000000008 RDI: 0000000000000000
RBP: 00007ffffb207d10 R08: 0000556eb240cd70 R09: 0000000000000002
R10: 0000556eb240cd40 R11: 0000000000000216 R12: 0000556eb04b70a0
R13: 00007ffffb207df0 R14: 0000000000000000 R15: 0000000000000000


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-05  0:03   ` John Hubbard
@ 2019-12-05  0:40     ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2019-12-05  0:40 UTC (permalink / raw)
  To: John Hubbard; +Cc: Felix Abecassis, Linux MM, Andrew Morton

On Wed, Dec 4, 2019 at 4:03 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/4/19 12:17 PM, Yang Shi wrote:
> > On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote:
> >>
> >> Hello all,
> >>
> >> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
> >> pages happen to be on the right node already, this function returns 0 but the
> >> "status" array is not updated. This array potentially contains garbage values
> >> (e.g. from malloc(3)), and I don't see a way to detect this.
> >>
> >> Looking at the kernel code, we are probably exiting do_pages_move here:
> >> out_flush:
> >>     if (list_empty(&pagelist))
> >>         return err;
> >
> > May you please give the below patch a try? I just did build test.
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index a8f87cb..f2f1279 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
> >   * the target node
> >   */
> >  static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> > - int node, struct list_head *pagelist, bool migrate_all)
> > + int node, struct list_head *pagelist, bool migrate_all,
> > + int __user *status, int start)
> >  {
> >   struct vm_area_struct *vma;
> >   struct page *page;
> > @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct
> > mm_struct *mm, unsigned long addr,
> >   goto out;
> >
> >   err = 0;
> > - if (page_to_nid(page) == node)
> > + if (page_to_nid(page) == node) {
> > + err = store_status(status, start, node, 1);
> >   goto out_putpage;
> > + }
> >
> >   err = -EACCES;
> >   if (page_mapcount(page) > 1 && !migrate_all)
> > @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm,
> > nodemask_t task_nodes,
> >   * report them via status
> >   */
> >   err = add_page_for_migration(mm, addr, current_node,
> > - &pagelist, flags & MPOL_MF_MOVE_ALL);
> > + &pagelist, flags & MPOL_MF_MOVE_ALL, status,
> > + i);
> > +
> >   if (!err)
> >   continue;
> >
>
> Hi Yang,
>
> The patch looks correct, and I *think* the following lockdep report
> is a pre-existing problem, but it happened with your patch applied to today's
> linux.git (commit aedc0650f9135f3b92b39cbed1a8fe98d8088825), using the
> unmodified version of Felix's test program:

Thanks for catching this. It looks it is caused by my patch. The patch
calls store_status() in add_page_for_migration() which holds mmap_sem.
I will take a deeper look.

>
> ============================================
> WARNING: possible recursive locking detected
> 5.4.0-hubbard-github+ #552 Not tainted
> --------------------------------------------
> move_pages_bug/1286 is trying to acquire lock:
> ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: __might_fault+0x3e/0x90
>
> but task is already holding lock:
> ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(&mm->mmap_sem#2);
>   lock(&mm->mmap_sem#2);
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 1 lock held by move_pages_bug/1286:
>  #0: ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0
>
> stack backtrace:
> CPU: 6 PID: 1286 Comm: move_pages_bug Not tainted 5.4.0-hubbard-github+ #552
> Hardware name: ASUS X299-A/PRIME X299-A, BIOS 2002 09/25/2019
> Call Trace:
>  dump_stack+0x71/0xa0
>  validate_chain.cold+0x122/0x15f
>  ? find_held_lock+0x2b/0x80
>  __lock_acquire+0x39c/0x790
>  lock_acquire+0x95/0x190
>  ? __might_fault+0x3e/0x90
>  __might_fault+0x68/0x90
>  ? __might_fault+0x3e/0x90
>  do_pages_move+0x2c4/0x6a0
>  kernel_move_pages+0x1f5/0x3e0
>  ? do_syscall_64+0x1c/0x230
>  __x64_sys_move_pages+0x25/0x30
>  do_syscall_64+0x5a/0x230
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7efd42f581ad
> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f08
> RSP: 002b:00007ffffb207c78 EFLAGS: 00000216 ORIG_RAX: 0000000000000117
> RAX: ffffffffffffffda RBX: 0000556eb240cd28 RCX: 00007efd42f581ad
> RDX: 0000556eb240ccf0 RSI: 0000000000000008 RDI: 0000000000000000
> RBP: 00007ffffb207d10 R08: 0000556eb240cd70 R09: 0000000000000002
> R10: 0000556eb240cd40 R11: 0000000000000216 R12: 0000556eb04b70a0
> R13: 00007ffffb207df0 R14: 0000000000000000 R15: 0000000000000000
>
>
> thanks,
> --
> John Hubbard
> NVIDIA


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug: move_pages(2) does not udpate "status" if no pages are moved
  2019-12-04 23:45     ` John Hubbard
@ 2019-12-05  0:43       ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2019-12-05  0:43 UTC (permalink / raw)
  To: John Hubbard, Christoph Lameter; +Cc: Felix Abecassis, Linux MM, Andrew Morton

On Wed, Dec 4, 2019 at 3:45 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/4/19 12:04 PM, Yang Shi wrote:
> > On Wed, Dec 4, 2019 at 11:21 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> On 12/4/19 11:01 AM, Felix Abecassis wrote:
> >>> Hello all,
> >>>
> >>
> >> Hi Felix,
> >>
> >> Thanks for writing up a very clear description of the problem.
> >>
> >>> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all
> >>> pages happen to be on the right node already, this function returns 0 but the
> >>> "status" array is not updated. This array potentially contains garbage values
> >>> (e.g. from malloc(3)), and I don't see a way to detect this.
> >>
> >>
> >> The way to detect this case would be to zero the array before calling move_pages().
> >> Then, if move_pages() returns 0, and the array remains full of zeroes, you can
> >> conclude that move_pages() "succeeded", and that there were no errors for any
> >> of the pages. So the pages are where you requested them to end up.
> >
> > I don't think we can just simply return all zeros here. It looks the
> > status should contain error code or the target node id if the page is
> > moved to that node successfully. So, if the page is already on the
> > requested node, the status should contain the current node id, but the
> > current node maybe not 0.
> >
> > So, IMHO it sounds like a valid bug.
> >
>
> Yes, you're right, a more precise reading of the man page does support that:
> if move_pages() returns 0, then the status array *must* contain valid node IDs.
> I see. (Felix also mentioned the same thing, in a side note.)
>
> Looking some more at both the man page and Felix's report (and the kernel
> implementation), it seems like there are maybe two bugs here:
>
> 1) Not setting the status array in some cases, if some pages were not moved for
> "non fatal" reasons, and
>
> 2) Returning success if no pages were moved. The "ERRORS" section of the man
> page seems to require that ENOENT be returned in that case. Although, you could
> perhaps argue that this statement is only unidirectional. In other words, maybe
> ENOENT happens, but it doesn't *have* to happen, if all pages were already on the
> target node.
>
> ERRORS
>
> ENOENT No pages were found that require moving.  All pages are either already
>        on the target node, not present, had an  invalid  address  or could not
>        be moved because they were mapped by multiple processes.

Thanks for pointing this out. Yes, the man page says so, but the code
doesn't do so at all. I dig into the very first commit
742755a1d8ce2b548428f7aacf1758b4bba50080 ("[PATCH] page migration:
sys_move_pages(): support moving of individual pages"), however, it
didn't return -ENOENT if the pages are already on the target node if I
read the code correctly.

Added Chris in this thread who is the original author of this API.


>
>
> thanks,
> --
> John Hubbard
> NVIDIA


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-12-05  0:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-04 19:01 bug: move_pages(2) does not udpate "status" if no pages are moved Felix Abecassis
2019-12-04 19:21 ` John Hubbard
2019-12-04 20:04   ` Yang Shi
2019-12-04 23:45     ` John Hubbard
2019-12-05  0:43       ` Yang Shi
2019-12-04 20:17 ` Yang Shi
2019-12-04 22:54   ` Felix Abecassis
2019-12-04 23:37     ` Yang Shi
2019-12-05  0:03   ` John Hubbard
2019-12-05  0:40     ` Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.