* bug: move_pages(2) does not udpate "status" if no pages are moved @ 2019-12-04 19:01 Felix Abecassis 2019-12-04 19:21 ` John Hubbard 2019-12-04 20:17 ` Yang Shi 0 siblings, 2 replies; 10+ messages in thread From: Felix Abecassis @ 2019-12-04 19:01 UTC (permalink / raw) To: linux-mm; +Cc: Andrew Morton Hello all, On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all pages happen to be on the right node already, this function returns 0 but the "status" array is not updated. This array potentially contains garbage values (e.g. from malloc(3)), and I don't see a way to detect this. Looking at the kernel code, we are probably exiting do_pages_move here: out_flush: if (list_empty(&pagelist)) return err; Here is a sample C program to reproduce the problem: /* $ gcc move_pages_bug.c -lnuma -o move_pages_bug */ #define _DEFAULT_SOURCE #include <sys/mman.h> #include <numaif.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> int main(void) { const long node_id = 1; const long page_size = sysconf(_SC_PAGESIZE); const int64_t num_pages = 8; unsigned long nodemask = 1 << node_id; long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)); if (ret < 0) return (EXIT_FAILURE); void **pages = malloc(sizeof(void*) * num_pages); for (int i = 0; i < num_pages; ++i) { pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ, MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS, -1, 0); if (pages[i] == MAP_FAILED) return (EXIT_FAILURE); } ret = set_mempolicy(MPOL_DEFAULT, NULL, 0); if (ret < 0) return (EXIT_FAILURE); int *nodes = malloc(sizeof(int) * num_pages); int *status = malloc(sizeof(int) * num_pages); for (int i = 0; i < num_pages; ++i) { nodes[i] = node_id; status[i] = 0xd0; /* simulate garbage values */ } ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE); printf("move_pages: %ld\n", ret); for (int i = 0; i < num_pages; ++i) printf("status[%d] = %d\n", i, status[i]); } And here is a sample output showing the "garbage" values: $ ./move_pages_bug move_pages: 0 status[0] = 208 status[1] = 208 status[2] = 208 status[3] = 208 status[4] = 208 status[5] = 208 status[6] = 208 status[7] = 208 Note that passing NULL as the "nodes" argument works as expected here. Also, it seems that it's the last "run-length" of pages on the right node(s) that triggers this problem, e.g. if I add "nodes[0] = nodes[1] = 0", then the output becomes: $ ./move_pages_bug move_pages: 0 status[0] = 0 status[1] = 0 status[2] = 208 status[3] = 208 status[4] = 208 status[5] = 208 status[6] = 208 status[7] = 208 And with just "nodes[7] = 0;", the first run-length of pages gets assigned correctly: $ ./move_pages_bug move_pages: 0 status[0] = 1 status[1] = 1 status[2] = 1 status[3] = 1 status[4] = 1 status[5] = 1 status[6] = 1 status[7] = 0 Thank you, ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-04 19:01 bug: move_pages(2) does not udpate "status" if no pages are moved Felix Abecassis @ 2019-12-04 19:21 ` John Hubbard 2019-12-04 20:04 ` Yang Shi 2019-12-04 20:17 ` Yang Shi 1 sibling, 1 reply; 10+ messages in thread From: John Hubbard @ 2019-12-04 19:21 UTC (permalink / raw) To: Felix Abecassis, linux-mm; +Cc: Andrew Morton On 12/4/19 11:01 AM, Felix Abecassis wrote: > Hello all, > Hi Felix, Thanks for writing up a very clear description of the problem. > On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all > pages happen to be on the right node already, this function returns 0 but the > "status" array is not updated. This array potentially contains garbage values > (e.g. from malloc(3)), and I don't see a way to detect this. The way to detect this case would be to zero the array before calling move_pages(). Then, if move_pages() returns 0, and the array remains full of zeroes, you can conclude that move_pages() "succeeded", and that there were no errors for any of the pages. So the pages are where you requested them to end up. > > Looking at the kernel code, we are probably exiting do_pages_move here: > out_flush: > if (list_empty(&pagelist)) > return err; I looked at that code and the surrounding function, and it's been pretty much unchanged for quite a while. The above was last touched in April, 2018, for example. Yes, we could change the kernel code to fill in the array with zeroes in that situation, but the man page doesn't actually cover this case at all. We'd have to also change the man page, to say something like, "if pages were not moved because they were already in the requested location, then the status array will contain <SOME_VALUE> for such pages". In other words, the kernel matches the requirements (the man page) as it stands today, at least as I'm reading it. And given that one can already figure all this out with the existing kernel and libnuma behavior, I'm guessing that the linux-mm folks will not see any reason to make such a change--but maybe I'm guessing wrong. Anyone on CC want to weigh in there? thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-04 19:21 ` John Hubbard @ 2019-12-04 20:04 ` Yang Shi 2019-12-04 23:45 ` John Hubbard 0 siblings, 1 reply; 10+ messages in thread From: Yang Shi @ 2019-12-04 20:04 UTC (permalink / raw) To: John Hubbard; +Cc: Felix Abecassis, Linux MM, Andrew Morton On Wed, Dec 4, 2019 at 11:21 AM John Hubbard <jhubbard@nvidia.com> wrote: > > On 12/4/19 11:01 AM, Felix Abecassis wrote: > > Hello all, > > > > Hi Felix, > > Thanks for writing up a very clear description of the problem. > > > On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all > > pages happen to be on the right node already, this function returns 0 but the > > "status" array is not updated. This array potentially contains garbage values > > (e.g. from malloc(3)), and I don't see a way to detect this. > > > The way to detect this case would be to zero the array before calling move_pages(). > Then, if move_pages() returns 0, and the array remains full of zeroes, you can > conclude that move_pages() "succeeded", and that there were no errors for any > of the pages. So the pages are where you requested them to end up. I don't think we can just simply return all zeros here. It looks the status should contain error code or the target node id if the page is moved to that node successfully. So, if the page is already on the requested node, the status should contain the current node id, but the current node maybe not 0. So, IMHO it sounds like a valid bug. > > > > > > Looking at the kernel code, we are probably exiting do_pages_move here: > > out_flush: > > if (list_empty(&pagelist)) > > return err; > > > I looked at that code and the surrounding function, and it's been pretty much > unchanged for quite a while. The above was last touched in April, 2018, for > example. > > Yes, we could change the kernel code to fill in the array with zeroes in that > situation, but the man page doesn't actually cover this case at all. We'd have > to also change the man page, to say something like, "if pages were not moved > because they were already in the requested location, then the status array > will contain <SOME_VALUE> for such pages". In other words, the kernel matches > the requirements (the man page) as it stands today, at least as I'm reading it. > > And given that one can already figure all this out with the existing kernel > and libnuma behavior, I'm guessing that the linux-mm folks will not see any > reason to make such a change--but maybe I'm guessing wrong. Anyone on CC want > to weigh in there? > > > thanks, > -- > John Hubbard > NVIDIA > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-04 20:04 ` Yang Shi @ 2019-12-04 23:45 ` John Hubbard 2019-12-05 0:43 ` Yang Shi 0 siblings, 1 reply; 10+ messages in thread From: John Hubbard @ 2019-12-04 23:45 UTC (permalink / raw) To: Yang Shi; +Cc: Felix Abecassis, Linux MM, Andrew Morton On 12/4/19 12:04 PM, Yang Shi wrote: > On Wed, Dec 4, 2019 at 11:21 AM John Hubbard <jhubbard@nvidia.com> wrote: >> >> On 12/4/19 11:01 AM, Felix Abecassis wrote: >>> Hello all, >>> >> >> Hi Felix, >> >> Thanks for writing up a very clear description of the problem. >> >>> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all >>> pages happen to be on the right node already, this function returns 0 but the >>> "status" array is not updated. This array potentially contains garbage values >>> (e.g. from malloc(3)), and I don't see a way to detect this. >> >> >> The way to detect this case would be to zero the array before calling move_pages(). >> Then, if move_pages() returns 0, and the array remains full of zeroes, you can >> conclude that move_pages() "succeeded", and that there were no errors for any >> of the pages. So the pages are where you requested them to end up. > > I don't think we can just simply return all zeros here. It looks the > status should contain error code or the target node id if the page is > moved to that node successfully. So, if the page is already on the > requested node, the status should contain the current node id, but the > current node maybe not 0. > > So, IMHO it sounds like a valid bug. > Yes, you're right, a more precise reading of the man page does support that: if move_pages() returns 0, then the status array *must* contain valid node IDs. I see. (Felix also mentioned the same thing, in a side note.) Looking some more at both the man page and Felix's report (and the kernel implementation), it seems like there are maybe two bugs here: 1) Not setting the status array in some cases, if some pages were not moved for "non fatal" reasons, and 2) Returning success if no pages were moved. The "ERRORS" section of the man page seems to require that ENOENT be returned in that case. Although, you could perhaps argue that this statement is only unidirectional. In other words, maybe ENOENT happens, but it doesn't *have* to happen, if all pages were already on the target node. ERRORS ENOENT No pages were found that require moving. All pages are either already on the target node, not present, had an invalid address or could not be moved because they were mapped by multiple processes. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-04 23:45 ` John Hubbard @ 2019-12-05 0:43 ` Yang Shi 0 siblings, 0 replies; 10+ messages in thread From: Yang Shi @ 2019-12-05 0:43 UTC (permalink / raw) To: John Hubbard, Christoph Lameter; +Cc: Felix Abecassis, Linux MM, Andrew Morton On Wed, Dec 4, 2019 at 3:45 PM John Hubbard <jhubbard@nvidia.com> wrote: > > On 12/4/19 12:04 PM, Yang Shi wrote: > > On Wed, Dec 4, 2019 at 11:21 AM John Hubbard <jhubbard@nvidia.com> wrote: > >> > >> On 12/4/19 11:01 AM, Felix Abecassis wrote: > >>> Hello all, > >>> > >> > >> Hi Felix, > >> > >> Thanks for writing up a very clear description of the problem. > >> > >>> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all > >>> pages happen to be on the right node already, this function returns 0 but the > >>> "status" array is not updated. This array potentially contains garbage values > >>> (e.g. from malloc(3)), and I don't see a way to detect this. > >> > >> > >> The way to detect this case would be to zero the array before calling move_pages(). > >> Then, if move_pages() returns 0, and the array remains full of zeroes, you can > >> conclude that move_pages() "succeeded", and that there were no errors for any > >> of the pages. So the pages are where you requested them to end up. > > > > I don't think we can just simply return all zeros here. It looks the > > status should contain error code or the target node id if the page is > > moved to that node successfully. So, if the page is already on the > > requested node, the status should contain the current node id, but the > > current node maybe not 0. > > > > So, IMHO it sounds like a valid bug. > > > > Yes, you're right, a more precise reading of the man page does support that: > if move_pages() returns 0, then the status array *must* contain valid node IDs. > I see. (Felix also mentioned the same thing, in a side note.) > > Looking some more at both the man page and Felix's report (and the kernel > implementation), it seems like there are maybe two bugs here: > > 1) Not setting the status array in some cases, if some pages were not moved for > "non fatal" reasons, and > > 2) Returning success if no pages were moved. The "ERRORS" section of the man > page seems to require that ENOENT be returned in that case. Although, you could > perhaps argue that this statement is only unidirectional. In other words, maybe > ENOENT happens, but it doesn't *have* to happen, if all pages were already on the > target node. > > ERRORS > > ENOENT No pages were found that require moving. All pages are either already > on the target node, not present, had an invalid address or could not > be moved because they were mapped by multiple processes. Thanks for pointing this out. Yes, the man page says so, but the code doesn't do so at all. I dig into the very first commit 742755a1d8ce2b548428f7aacf1758b4bba50080 ("[PATCH] page migration: sys_move_pages(): support moving of individual pages"), however, it didn't return -ENOENT if the pages are already on the target node if I read the code correctly. Added Chris in this thread who is the original author of this API. > > > thanks, > -- > John Hubbard > NVIDIA ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-04 19:01 bug: move_pages(2) does not udpate "status" if no pages are moved Felix Abecassis 2019-12-04 19:21 ` John Hubbard @ 2019-12-04 20:17 ` Yang Shi 2019-12-04 22:54 ` Felix Abecassis 2019-12-05 0:03 ` John Hubbard 1 sibling, 2 replies; 10+ messages in thread From: Yang Shi @ 2019-12-04 20:17 UTC (permalink / raw) To: Felix Abecassis; +Cc: Linux MM, Andrew Morton On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote: > > Hello all, > > On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all > pages happen to be on the right node already, this function returns 0 but the > "status" array is not updated. This array potentially contains garbage values > (e.g. from malloc(3)), and I don't see a way to detect this. > > Looking at the kernel code, we are probably exiting do_pages_move here: > out_flush: > if (list_empty(&pagelist)) > return err; May you please give the below patch a try? I just did build test. diff --git a/mm/migrate.c b/mm/migrate.c index a8f87cb..f2f1279 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm, * the target node */ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, - int node, struct list_head *pagelist, bool migrate_all) + int node, struct list_head *pagelist, bool migrate_all, + int __user *status, int start) { struct vm_area_struct *vma; struct page *page; @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, goto out; err = 0; - if (page_to_nid(page) == node) + if (page_to_nid(page) == node) { + err = store_status(status, start, node, 1); goto out_putpage; + } err = -EACCES; if (page_mapcount(page) > 1 && !migrate_all) @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, * report them via status */ err = add_page_for_migration(mm, addr, current_node, - &pagelist, flags & MPOL_MF_MOVE_ALL); + &pagelist, flags & MPOL_MF_MOVE_ALL, status, + i); + if (!err) continue; > > Here is a sample C program to reproduce the problem: > /* $ gcc move_pages_bug.c -lnuma -o move_pages_bug */ > > #define _DEFAULT_SOURCE > #include <sys/mman.h> > > #include <numaif.h> > #include <stdlib.h> > #include <stdio.h> > #include <unistd.h> > > int main(void) > { > const long node_id = 1; > const long page_size = sysconf(_SC_PAGESIZE); > const int64_t num_pages = 8; > > unsigned long nodemask = 1 << node_id; > long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)); > if (ret < 0) > return (EXIT_FAILURE); > > void **pages = malloc(sizeof(void*) * num_pages); > for (int i = 0; i < num_pages; ++i) { > pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ, > MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS, > -1, 0); > if (pages[i] == MAP_FAILED) > return (EXIT_FAILURE); > } > > ret = set_mempolicy(MPOL_DEFAULT, NULL, 0); > if (ret < 0) > return (EXIT_FAILURE); > > int *nodes = malloc(sizeof(int) * num_pages); > int *status = malloc(sizeof(int) * num_pages); > for (int i = 0; i < num_pages; ++i) { > nodes[i] = node_id; > status[i] = 0xd0; /* simulate garbage values */ > } > > ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE); > printf("move_pages: %ld\n", ret); > for (int i = 0; i < num_pages; ++i) > printf("status[%d] = %d\n", i, status[i]); > } > > And here is a sample output showing the "garbage" values: > $ ./move_pages_bug > move_pages: 0 > status[0] = 208 > status[1] = 208 > status[2] = 208 > status[3] = 208 > status[4] = 208 > status[5] = 208 > status[6] = 208 > status[7] = 208 > > Note that passing NULL as the "nodes" argument works as expected here. > > Also, it seems that it's the last "run-length" of pages on the right node(s) > that triggers this problem, e.g. if I add "nodes[0] = nodes[1] = 0", then the > output becomes: > $ ./move_pages_bug > move_pages: 0 > status[0] = 0 > status[1] = 0 > status[2] = 208 > status[3] = 208 > status[4] = 208 > status[5] = 208 > status[6] = 208 > status[7] = 208 > > And with just "nodes[7] = 0;", the first run-length of pages gets assigned > correctly: > $ ./move_pages_bug > move_pages: 0 > status[0] = 1 > status[1] = 1 > status[2] = 1 > status[3] = 1 > status[4] = 1 > status[5] = 1 > status[6] = 1 > status[7] = 0 > > > Thank you, > ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-04 20:17 ` Yang Shi @ 2019-12-04 22:54 ` Felix Abecassis 2019-12-04 23:37 ` Yang Shi 2019-12-05 0:03 ` John Hubbard 1 sibling, 1 reply; 10+ messages in thread From: Felix Abecassis @ 2019-12-04 22:54 UTC (permalink / raw) To: Yang Shi; +Cc: Linux MM, Andrew Morton On Wed, Dec 04, 2019 at 12:17:59PM -0800, Yang Shi wrote: > On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote: > > > > Hello all, > > > > On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all > > pages happen to be on the right node already, this function returns 0 but the > > "status" array is not updated. This array potentially contains garbage values > > (e.g. from malloc(3)), and I don't see a way to detect this. > > > > Looking at the kernel code, we are probably exiting do_pages_move here: > > out_flush: > > if (list_empty(&pagelist)) > > return err; > > May you please give the below patch a try? I just did build test. > > diff --git a/mm/migrate.c b/mm/migrate.c > index a8f87cb..f2f1279 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm, > * the target node > */ > static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, > - int node, struct list_head *pagelist, bool migrate_all) > + int node, struct list_head *pagelist, bool migrate_all, > + int __user *status, int start) > { > struct vm_area_struct *vma; > struct page *page; > @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct > mm_struct *mm, unsigned long addr, > goto out; > > err = 0; > - if (page_to_nid(page) == node) > + if (page_to_nid(page) == node) { > + err = store_status(status, start, node, 1); > goto out_putpage; > + } > > err = -EACCES; > if (page_mapcount(page) > 1 && !migrate_all) > @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm, > nodemask_t task_nodes, > * report them via status > */ > err = add_page_for_migration(mm, addr, current_node, > - &pagelist, flags & MPOL_MF_MOVE_ALL); > + &pagelist, flags & MPOL_MF_MOVE_ALL, status, > + i); > + > if (!err) > continue; > Thanks! I have tested this patch on top of 5.4-rc8, and I can comfirm it fixes this particular test case. > > > > Here is a sample C program to reproduce the problem: > > /* $ gcc move_pages_bug.c -lnuma -o move_pages_bug */ > > > > #define _DEFAULT_SOURCE > > #include <sys/mman.h> > > > > #include <numaif.h> > > #include <stdlib.h> > > #include <stdio.h> > > #include <unistd.h> > > > > int main(void) > > { > > const long node_id = 1; > > const long page_size = sysconf(_SC_PAGESIZE); > > const int64_t num_pages = 8; > > > > unsigned long nodemask = 1 << node_id; > > long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)); > > if (ret < 0) > > return (EXIT_FAILURE); > > > > void **pages = malloc(sizeof(void*) * num_pages); > > for (int i = 0; i < num_pages; ++i) { > > pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ, > > MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS, > > -1, 0); > > if (pages[i] == MAP_FAILED) > > return (EXIT_FAILURE); > > } > > > > ret = set_mempolicy(MPOL_DEFAULT, NULL, 0); > > if (ret < 0) > > return (EXIT_FAILURE); > > > > int *nodes = malloc(sizeof(int) * num_pages); > > int *status = malloc(sizeof(int) * num_pages); > > for (int i = 0; i < num_pages; ++i) { > > nodes[i] = node_id; > > status[i] = 0xd0; /* simulate garbage values */ > > } > > > > ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE); > > printf("move_pages: %ld\n", ret); > > for (int i = 0; i < num_pages; ++i) > > printf("status[%d] = %d\n", i, status[i]); > > } > > > > And here is a sample output showing the "garbage" values: > > $ ./move_pages_bug > > move_pages: 0 > > status[0] = 208 > > status[1] = 208 > > status[2] = 208 > > status[3] = 208 > > status[4] = 208 > > status[5] = 208 > > status[6] = 208 > > status[7] = 208 > > > > Note that passing NULL as the "nodes" argument works as expected here. > > > > Also, it seems that it's the last "run-length" of pages on the right node(s) > > that triggers this problem, e.g. if I add "nodes[0] = nodes[1] = 0", then the > > output becomes: > > $ ./move_pages_bug > > move_pages: 0 > > status[0] = 0 > > status[1] = 0 > > status[2] = 208 > > status[3] = 208 > > status[4] = 208 > > status[5] = 208 > > status[6] = 208 > > status[7] = 208 > > > > And with just "nodes[7] = 0;", the first run-length of pages gets assigned > > correctly: > > $ ./move_pages_bug > > move_pages: 0 > > status[0] = 1 > > status[1] = 1 > > status[2] = 1 > > status[3] = 1 > > status[4] = 1 > > status[5] = 1 > > status[6] = 1 > > status[7] = 0 > > > > > > Thank you, > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-04 22:54 ` Felix Abecassis @ 2019-12-04 23:37 ` Yang Shi 0 siblings, 0 replies; 10+ messages in thread From: Yang Shi @ 2019-12-04 23:37 UTC (permalink / raw) To: Felix Abecassis; +Cc: Linux MM, Andrew Morton On Wed, Dec 4, 2019 at 2:54 PM Felix Abecassis <fabecassis@nvidia.com> wrote: > > On Wed, Dec 04, 2019 at 12:17:59PM -0800, Yang Shi wrote: > > On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote: > > > > > > Hello all, > > > > > > On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all > > > pages happen to be on the right node already, this function returns 0 but the > > > "status" array is not updated. This array potentially contains garbage values > > > (e.g. from malloc(3)), and I don't see a way to detect this. > > > > > > Looking at the kernel code, we are probably exiting do_pages_move here: > > > out_flush: > > > if (list_empty(&pagelist)) > > > return err; > > > > May you please give the below patch a try? I just did build test. > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index a8f87cb..f2f1279 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm, > > * the target node > > */ > > static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, > > - int node, struct list_head *pagelist, bool migrate_all) > > + int node, struct list_head *pagelist, bool migrate_all, > > + int __user *status, int start) > > { > > struct vm_area_struct *vma; > > struct page *page; > > @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct > > mm_struct *mm, unsigned long addr, > > goto out; > > > > err = 0; > > - if (page_to_nid(page) == node) > > + if (page_to_nid(page) == node) { > > + err = store_status(status, start, node, 1); > > goto out_putpage; > > + } > > > > err = -EACCES; > > if (page_mapcount(page) > 1 && !migrate_all) > > @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm, > > nodemask_t task_nodes, > > * report them via status > > */ > > err = add_page_for_migration(mm, addr, current_node, > > - &pagelist, flags & MPOL_MF_MOVE_ALL); > > + &pagelist, flags & MPOL_MF_MOVE_ALL, status, > > + i); > > + > > if (!err) > > continue; > > > > Thanks! I have tested this patch on top of 5.4-rc8, and I can comfirm it fixes > this particular test case. Thanks for confirming this. I will send out formal patch soon. > > > > > > > Here is a sample C program to reproduce the problem: > > > /* $ gcc move_pages_bug.c -lnuma -o move_pages_bug */ > > > > > > #define _DEFAULT_SOURCE > > > #include <sys/mman.h> > > > > > > #include <numaif.h> > > > #include <stdlib.h> > > > #include <stdio.h> > > > #include <unistd.h> > > > > > > int main(void) > > > { > > > const long node_id = 1; > > > const long page_size = sysconf(_SC_PAGESIZE); > > > const int64_t num_pages = 8; > > > > > > unsigned long nodemask = 1 << node_id; > > > long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)); > > > if (ret < 0) > > > return (EXIT_FAILURE); > > > > > > void **pages = malloc(sizeof(void*) * num_pages); > > > for (int i = 0; i < num_pages; ++i) { > > > pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ, > > > MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS, > > > -1, 0); > > > if (pages[i] == MAP_FAILED) > > > return (EXIT_FAILURE); > > > } > > > > > > ret = set_mempolicy(MPOL_DEFAULT, NULL, 0); > > > if (ret < 0) > > > return (EXIT_FAILURE); > > > > > > int *nodes = malloc(sizeof(int) * num_pages); > > > int *status = malloc(sizeof(int) * num_pages); > > > for (int i = 0; i < num_pages; ++i) { > > > nodes[i] = node_id; > > > status[i] = 0xd0; /* simulate garbage values */ > > > } > > > > > > ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE); > > > printf("move_pages: %ld\n", ret); > > > for (int i = 0; i < num_pages; ++i) > > > printf("status[%d] = %d\n", i, status[i]); > > > } > > > > > > And here is a sample output showing the "garbage" values: > > > $ ./move_pages_bug > > > move_pages: 0 > > > status[0] = 208 > > > status[1] = 208 > > > status[2] = 208 > > > status[3] = 208 > > > status[4] = 208 > > > status[5] = 208 > > > status[6] = 208 > > > status[7] = 208 > > > > > > Note that passing NULL as the "nodes" argument works as expected here. > > > > > > Also, it seems that it's the last "run-length" of pages on the right node(s) > > > that triggers this problem, e.g. if I add "nodes[0] = nodes[1] = 0", then the > > > output becomes: > > > $ ./move_pages_bug > > > move_pages: 0 > > > status[0] = 0 > > > status[1] = 0 > > > status[2] = 208 > > > status[3] = 208 > > > status[4] = 208 > > > status[5] = 208 > > > status[6] = 208 > > > status[7] = 208 > > > > > > And with just "nodes[7] = 0;", the first run-length of pages gets assigned > > > correctly: > > > $ ./move_pages_bug > > > move_pages: 0 > > > status[0] = 1 > > > status[1] = 1 > > > status[2] = 1 > > > status[3] = 1 > > > status[4] = 1 > > > status[5] = 1 > > > status[6] = 1 > > > status[7] = 0 > > > > > > > > > Thank you, > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-04 20:17 ` Yang Shi 2019-12-04 22:54 ` Felix Abecassis @ 2019-12-05 0:03 ` John Hubbard 2019-12-05 0:40 ` Yang Shi 1 sibling, 1 reply; 10+ messages in thread From: John Hubbard @ 2019-12-05 0:03 UTC (permalink / raw) To: Yang Shi, Felix Abecassis; +Cc: Linux MM, Andrew Morton On 12/4/19 12:17 PM, Yang Shi wrote: > On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote: >> >> Hello all, >> >> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all >> pages happen to be on the right node already, this function returns 0 but the >> "status" array is not updated. This array potentially contains garbage values >> (e.g. from malloc(3)), and I don't see a way to detect this. >> >> Looking at the kernel code, we are probably exiting do_pages_move here: >> out_flush: >> if (list_empty(&pagelist)) >> return err; > > May you please give the below patch a try? I just did build test. > > diff --git a/mm/migrate.c b/mm/migrate.c > index a8f87cb..f2f1279 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm, > * the target node > */ > static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, > - int node, struct list_head *pagelist, bool migrate_all) > + int node, struct list_head *pagelist, bool migrate_all, > + int __user *status, int start) > { > struct vm_area_struct *vma; > struct page *page; > @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct > mm_struct *mm, unsigned long addr, > goto out; > > err = 0; > - if (page_to_nid(page) == node) > + if (page_to_nid(page) == node) { > + err = store_status(status, start, node, 1); > goto out_putpage; > + } > > err = -EACCES; > if (page_mapcount(page) > 1 && !migrate_all) > @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm, > nodemask_t task_nodes, > * report them via status > */ > err = add_page_for_migration(mm, addr, current_node, > - &pagelist, flags & MPOL_MF_MOVE_ALL); > + &pagelist, flags & MPOL_MF_MOVE_ALL, status, > + i); > + > if (!err) > continue; > Hi Yang, The patch looks correct, and I *think* the following lockdep report is a pre-existing problem, but it happened with your patch applied to today's linux.git (commit aedc0650f9135f3b92b39cbed1a8fe98d8088825), using the unmodified version of Felix's test program: ============================================ WARNING: possible recursive locking detected 5.4.0-hubbard-github+ #552 Not tainted -------------------------------------------- move_pages_bug/1286 is trying to acquire lock: ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: __might_fault+0x3e/0x90 but task is already holding lock: ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&mm->mmap_sem#2); lock(&mm->mmap_sem#2); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by move_pages_bug/1286: #0: ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0 stack backtrace: CPU: 6 PID: 1286 Comm: move_pages_bug Not tainted 5.4.0-hubbard-github+ #552 Hardware name: ASUS X299-A/PRIME X299-A, BIOS 2002 09/25/2019 Call Trace: dump_stack+0x71/0xa0 validate_chain.cold+0x122/0x15f ? find_held_lock+0x2b/0x80 __lock_acquire+0x39c/0x790 lock_acquire+0x95/0x190 ? __might_fault+0x3e/0x90 __might_fault+0x68/0x90 ? __might_fault+0x3e/0x90 do_pages_move+0x2c4/0x6a0 kernel_move_pages+0x1f5/0x3e0 ? do_syscall_64+0x1c/0x230 __x64_sys_move_pages+0x25/0x30 do_syscall_64+0x5a/0x230 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7efd42f581ad Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f08 RSP: 002b:00007ffffb207c78 EFLAGS: 00000216 ORIG_RAX: 0000000000000117 RAX: ffffffffffffffda RBX: 0000556eb240cd28 RCX: 00007efd42f581ad RDX: 0000556eb240ccf0 RSI: 0000000000000008 RDI: 0000000000000000 RBP: 00007ffffb207d10 R08: 0000556eb240cd70 R09: 0000000000000002 R10: 0000556eb240cd40 R11: 0000000000000216 R12: 0000556eb04b70a0 R13: 00007ffffb207df0 R14: 0000000000000000 R15: 0000000000000000 thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: bug: move_pages(2) does not udpate "status" if no pages are moved 2019-12-05 0:03 ` John Hubbard @ 2019-12-05 0:40 ` Yang Shi 0 siblings, 0 replies; 10+ messages in thread From: Yang Shi @ 2019-12-05 0:40 UTC (permalink / raw) To: John Hubbard; +Cc: Felix Abecassis, Linux MM, Andrew Morton On Wed, Dec 4, 2019 at 4:03 PM John Hubbard <jhubbard@nvidia.com> wrote: > > On 12/4/19 12:17 PM, Yang Shi wrote: > > On Wed, Dec 4, 2019 at 11:01 AM Felix Abecassis <fabecassis@nvidia.com> wrote: > >> > >> Hello all, > >> > >> On kernel 5.3, when using the move_pages syscall (wrapped by libnuma) and all > >> pages happen to be on the right node already, this function returns 0 but the > >> "status" array is not updated. This array potentially contains garbage values > >> (e.g. from malloc(3)), and I don't see a way to detect this. > >> > >> Looking at the kernel code, we are probably exiting do_pages_move here: > >> out_flush: > >> if (list_empty(&pagelist)) > >> return err; > > > > May you please give the below patch a try? I just did build test. > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index a8f87cb..f2f1279 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -1517,7 +1517,8 @@ static int do_move_pages_to_node(struct mm_struct *mm, > > * the target node > > */ > > static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, > > - int node, struct list_head *pagelist, bool migrate_all) > > + int node, struct list_head *pagelist, bool migrate_all, > > + int __user *status, int start) > > { > > struct vm_area_struct *vma; > > struct page *page; > > @@ -1543,8 +1544,10 @@ static int add_page_for_migration(struct > > mm_struct *mm, unsigned long addr, > > goto out; > > > > err = 0; > > - if (page_to_nid(page) == node) > > + if (page_to_nid(page) == node) { > > + err = store_status(status, start, node, 1); > > goto out_putpage; > > + } > > > > err = -EACCES; > > if (page_mapcount(page) > 1 && !migrate_all) > > @@ -1639,7 +1642,9 @@ static int do_pages_move(struct mm_struct *mm, > > nodemask_t task_nodes, > > * report them via status > > */ > > err = add_page_for_migration(mm, addr, current_node, > > - &pagelist, flags & MPOL_MF_MOVE_ALL); > > + &pagelist, flags & MPOL_MF_MOVE_ALL, status, > > + i); > > + > > if (!err) > > continue; > > > > Hi Yang, > > The patch looks correct, and I *think* the following lockdep report > is a pre-existing problem, but it happened with your patch applied to today's > linux.git (commit aedc0650f9135f3b92b39cbed1a8fe98d8088825), using the > unmodified version of Felix's test program: Thanks for catching this. It looks it is caused by my patch. The patch calls store_status() in add_page_for_migration() which holds mmap_sem. I will take a deeper look. > > ============================================ > WARNING: possible recursive locking detected > 5.4.0-hubbard-github+ #552 Not tainted > -------------------------------------------- > move_pages_bug/1286 is trying to acquire lock: > ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: __might_fault+0x3e/0x90 > > but task is already holding lock: > ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&mm->mmap_sem#2); > lock(&mm->mmap_sem#2); > > *** DEADLOCK *** > > May be due to missing lock nesting notation > > 1 lock held by move_pages_bug/1286: > #0: ffff8882a365ab18 (&mm->mmap_sem#2){++++}, at: do_pages_move+0x129/0x6a0 > > stack backtrace: > CPU: 6 PID: 1286 Comm: move_pages_bug Not tainted 5.4.0-hubbard-github+ #552 > Hardware name: ASUS X299-A/PRIME X299-A, BIOS 2002 09/25/2019 > Call Trace: > dump_stack+0x71/0xa0 > validate_chain.cold+0x122/0x15f > ? find_held_lock+0x2b/0x80 > __lock_acquire+0x39c/0x790 > lock_acquire+0x95/0x190 > ? __might_fault+0x3e/0x90 > __might_fault+0x68/0x90 > ? __might_fault+0x3e/0x90 > do_pages_move+0x2c4/0x6a0 > kernel_move_pages+0x1f5/0x3e0 > ? do_syscall_64+0x1c/0x230 > __x64_sys_move_pages+0x25/0x30 > do_syscall_64+0x5a/0x230 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x7efd42f581ad > Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f08 > RSP: 002b:00007ffffb207c78 EFLAGS: 00000216 ORIG_RAX: 0000000000000117 > RAX: ffffffffffffffda RBX: 0000556eb240cd28 RCX: 00007efd42f581ad > RDX: 0000556eb240ccf0 RSI: 0000000000000008 RDI: 0000000000000000 > RBP: 00007ffffb207d10 R08: 0000556eb240cd70 R09: 0000000000000002 > R10: 0000556eb240cd40 R11: 0000000000000216 R12: 0000556eb04b70a0 > R13: 00007ffffb207df0 R14: 0000000000000000 R15: 0000000000000000 > > > thanks, > -- > John Hubbard > NVIDIA ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-12-05 0:43 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-04 19:01 bug: move_pages(2) does not udpate "status" if no pages are moved Felix Abecassis 2019-12-04 19:21 ` John Hubbard 2019-12-04 20:04 ` Yang Shi 2019-12-04 23:45 ` John Hubbard 2019-12-05 0:43 ` Yang Shi 2019-12-04 20:17 ` Yang Shi 2019-12-04 22:54 ` Felix Abecassis 2019-12-04 23:37 ` Yang Shi 2019-12-05 0:03 ` John Hubbard 2019-12-05 0:40 ` Yang Shi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.