* hugetlb page migration vs. overcommit @ 2017-11-22 15:28 Michal Hocko 2017-11-22 19:11 ` Mike Kravetz 2017-11-28 10:19 ` Michal Hocko 0 siblings, 2 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-22 15:28 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML Hi, is there any reason why we enforce the overcommit limit during hugetlb pages migration? It's in alloc_huge_page_node->__alloc_buddy_huge_page path. I am wondering whether this is really an intentional behavior. The page migration allocates a page just temporarily so we should be able to go over the overcommit limit for the migration duration. The reason I am asking is that hugetlb pages tend to be utilized usually (otherwise the memory would be just wasted and pool shrunk) but then the migration simply fails which breaks memory hotplug and other migration dependent functionality which is quite suboptimal. You can workaround that by increasing the overcommit limit. Why don't we simply migrate as long as we are able to allocate the target hugetlb page? I have a half baked patch to remove this restriction, would there be an opposition to do something like that? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: hugetlb page migration vs. overcommit 2017-11-22 15:28 hugetlb page migration vs. overcommit Michal Hocko @ 2017-11-22 19:11 ` Mike Kravetz 2017-11-23 9:21 ` Michal Hocko 2017-11-27 6:27 ` Naoya Horiguchi 2017-11-28 10:19 ` Michal Hocko 1 sibling, 2 replies; 23+ messages in thread From: Mike Kravetz @ 2017-11-22 19:11 UTC (permalink / raw) To: Michal Hocko, linux-mm; +Cc: Naoya Horiguchi, LKML On 11/22/2017 07:28 AM, Michal Hocko wrote: > Hi, > is there any reason why we enforce the overcommit limit during hugetlb > pages migration? It's in alloc_huge_page_node->__alloc_buddy_huge_page > path. I am wondering whether this is really an intentional behavior. I do not think it was intentional. But, I was not around when that code was added. > The page migration allocates a page just temporarily so we should be > able to go over the overcommit limit for the migration duration. The > reason I am asking is that hugetlb pages tend to be utilized usually > (otherwise the memory would be just wasted and pool shrunk) but then > the migration simply fails which breaks memory hotplug and other > migration dependent functionality which is quite suboptimal. You can > workaround that by increasing the overcommit limit. Yes. In an environment making optimal use of huge pages, you are unlikely to have 'spare pages' set aside for a potential migration operation. So I agree that it would make sense to try and allocate overcommit pages for this purpose. > Why don't we simply migrate as long as we are able to allocate the > target hugetlb page? I have a half baked patch to remove this > restriction, would there be an opposition to do something like that? I would not be opposed and would help with this effort. My concern would be any subtle hugetlb accounting issues once you start messing with additional overcommit pages. Since Naoya was originally involved in huge page migration, I would welcome his comments. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: hugetlb page migration vs. overcommit 2017-11-22 19:11 ` Mike Kravetz @ 2017-11-23 9:21 ` Michal Hocko 2017-11-27 6:27 ` Naoya Horiguchi 1 sibling, 0 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-23 9:21 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, Naoya Horiguchi, LKML On Wed 22-11-17 11:11:38, Mike Kravetz wrote: > On 11/22/2017 07:28 AM, Michal Hocko wrote: [...] > > Why don't we simply migrate as long as we are able to allocate the > > target hugetlb page? I have a half baked patch to remove this > > restriction, would there be an opposition to do something like that? > > I would not be opposed and would help with this effort. My concern would > be any subtle hugetlb accounting issues once you start messing with > additional overcommit pages. Well my current (crude) patch checks for overcommit in the destructor and releases the page if we are over. That should deal with accounting AFAICS. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: hugetlb page migration vs. overcommit 2017-11-22 19:11 ` Mike Kravetz 2017-11-23 9:21 ` Michal Hocko @ 2017-11-27 6:27 ` Naoya Horiguchi 1 sibling, 0 replies; 23+ messages in thread From: Naoya Horiguchi @ 2017-11-27 6:27 UTC (permalink / raw) To: Mike Kravetz, Michal Hocko, linux-mm; +Cc: Naoya Horiguchi, LKML On 11/23/2017 04:11 AM, Mike Kravetz wrote: > On 11/22/2017 07:28 AM, Michal Hocko wrote: >> Hi, >> is there any reason why we enforce the overcommit limit during hugetlb >> pages migration? It's in alloc_huge_page_node->__alloc_buddy_huge_page >> path. I am wondering whether this is really an intentional behavior. > > I do not think it was intentional. But, I was not around when that > code was added. > >> The page migration allocates a page just temporarily so we should be >> able to go over the overcommit limit for the migration duration. The >> reason I am asking is that hugetlb pages tend to be utilized usually >> (otherwise the memory would be just wasted and pool shrunk) but then >> the migration simply fails which breaks memory hotplug and other >> migration dependent functionality which is quite suboptimal. You can >> workaround that by increasing the overcommit limit. > > Yes. In an environment making optimal use of huge pages, you are unlikely > to have 'spare pages' set aside for a potential migration operation. So > I agree that it would make sense to try and allocate overcommit pages for > this purpose. Thank you for pointing this out, Michal, Mike. Doing overcommitting in hugepage migration is totally right to me, I just didn't notice it when I wrote the code. > >> Why don't we simply migrate as long as we are able to allocate the >> target hugetlb page? I have a half baked patch to remove this >> restriction, would there be an opposition to do something like that? > > I would not be opposed and would help with this effort. My concern would > be any subtle hugetlb accounting issues once you start messing with > additional overcommit pages. Yes, hugetlb accounting always needs care when touching related code. I can help testing. Thanks, Naoya Horiguchi ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: hugetlb page migration vs. overcommit 2017-11-22 15:28 hugetlb page migration vs. overcommit Michal Hocko 2017-11-22 19:11 ` Mike Kravetz @ 2017-11-28 10:19 ` Michal Hocko 2017-11-28 14:12 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Michal Hocko @ 2017-11-28 10:19 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML On Wed 22-11-17 16:28:32, Michal Hocko wrote: > Hi, > is there any reason why we enforce the overcommit limit during hugetlb > pages migration? It's in alloc_huge_page_node->__alloc_buddy_huge_page > path. I am wondering whether this is really an intentional behavior. > The page migration allocates a page just temporarily so we should be > able to go over the overcommit limit for the migration duration. The > reason I am asking is that hugetlb pages tend to be utilized usually > (otherwise the memory would be just wasted and pool shrunk) but then > the migration simply fails which breaks memory hotplug and other > migration dependent functionality which is quite suboptimal. You can > workaround that by increasing the overcommit limit. > > Why don't we simply migrate as long as we are able to allocate the > target hugetlb page? I have a half baked patch to remove this > restriction, would there be an opposition to do something like that? So I finally got to think about this some more and looked at how we actually account things more thoroughly. And it is, you both of you expected, quite subtle and not easy to get around. Per NUMA pools make things quite complicated. Why? Migration can really increase the overall pool size. Say we are migrating from Node1 to Node2. Node2 doesn't have any pre-allocated pages but assume that the overcommit allows us to move on. All good. Except that the original page will return to the pool because free_huge_page will see Node1 without any surplus pages and therefore moves back the page to the pool. Node2 will release the surplus page only after it is freed which can be an unbound amount of time. While we are still effectively under the overcommit limit the semantic is kind of strange and I am not sure the behavior is really intended. I see why per node surplus counter is used here. We simply want to maintain per node counts after regular page free. So I was thinking to add a temporary/migrate state to the huge page for migration pages (start with new page, state transfered to the old page on success) and free such a page to the allocator regardless of the surplus counters. This would mean that the page migration might change inter node pool sizes but I guess that should be acceptable. What do you guys think? I can send a draft patch if that helps you to understand the idea. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: hugetlb page migration vs. overcommit 2017-11-28 10:19 ` Michal Hocko @ 2017-11-28 14:12 ` Michal Hocko 2017-11-28 14:12 ` [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization Michal Hocko 2017-11-28 14:12 ` [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration Michal Hocko 0 siblings, 2 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-28 14:12 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML On Tue 28-11-17 11:19:07, Michal Hocko wrote: > On Wed 22-11-17 16:28:32, Michal Hocko wrote: > > Hi, > > is there any reason why we enforce the overcommit limit during hugetlb > > pages migration? It's in alloc_huge_page_node->__alloc_buddy_huge_page > > path. I am wondering whether this is really an intentional behavior. > > The page migration allocates a page just temporarily so we should be > > able to go over the overcommit limit for the migration duration. The > > reason I am asking is that hugetlb pages tend to be utilized usually > > (otherwise the memory would be just wasted and pool shrunk) but then > > the migration simply fails which breaks memory hotplug and other > > migration dependent functionality which is quite suboptimal. You can > > workaround that by increasing the overcommit limit. > > > > Why don't we simply migrate as long as we are able to allocate the > > target hugetlb page? I have a half baked patch to remove this > > restriction, would there be an opposition to do something like that? > > So I finally got to think about this some more and looked at how we > actually account things more thoroughly. And it is, you both of you > expected, quite subtle and not easy to get around. Per NUMA pools make > things quite complicated. Why? Migration can really increase the overall > pool size. Say we are migrating from Node1 to Node2. Node2 doesn't have > any pre-allocated pages but assume that the overcommit allows us to move > on. All good. Except that the original page will return to the pool > because free_huge_page will see Node1 without any surplus pages and > therefore moves back the page to the pool. Node2 will release the > surplus page only after it is freed which can be an unbound amount of > time. > > While we are still effectively under the overcommit limit the semantic > is kind of strange and I am not sure the behavior is really intended. > I see why per node surplus counter is used here. We simply want to > maintain per node counts after regular page free. So I was thinking > to add a temporary/migrate state to the huge page for migration pages > (start with new page, state transfered to the old page on success) and > free such a page to the allocator regardless of the surplus counters. > > This would mean that the page migration might change inter node pool > sizes but I guess that should be acceptable. What do you guys think? > I can send a draft patch if that helps you to understand the idea. This is what I have currently and it seems to work (or at least it doesn't it doesn't blow up immediately). The first patch is a cleanup and patch2 implements the temporary page idea. Does this make any sense to you at all? -- Michal Hocko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization 2017-11-28 14:12 ` Michal Hocko @ 2017-11-28 14:12 ` Michal Hocko 2017-11-28 21:34 ` Mike Kravetz 2017-11-28 14:12 ` [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Michal Hocko @ 2017-11-28 14:12 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> hugetlb allocator has two entry points to the page allocator - alloc_fresh_huge_page_node - __hugetlb_alloc_buddy_huge_page The two differ very subtly in two aspects. The first one doesn't care about HTLB_BUDDY_* stats and it doesn't initialize the huge page. prep_new_huge_page is not used because it not only initializes hugetlb specific stuff but because it also put_page and releases the page to the hugetlb pool which is not what is required in some contexts. This makes things more complicated than necessary. Simplify things by a) removing the page allocator entry point duplicity and only keep __hugetlb_alloc_buddy_huge_page and b) make prep_new_huge_page more reusable by removing the put_page which moves the page to the allocator pool. All current callers are updated to call put_page explicitly. Later patches will add new callers which won't need it. This patch shouldn't introduce any functional change. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/hugetlb.c | 61 +++++++++++++++++++++++++++++------------------------------- 1 file changed, 29 insertions(+), 32 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2c9033d39bfe..8189c92fac82 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1157,6 +1157,7 @@ static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid) if (page) { prep_compound_gigantic_page(page, huge_page_order(h)); prep_new_huge_page(h, page, nid); + put_page(page); /* free it into the hugepage allocator */ } return page; @@ -1304,7 +1305,6 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; spin_unlock(&hugetlb_lock); - put_page(page); /* free it into the hugepage allocator */ } static void prep_compound_gigantic_page(struct page *page, unsigned int order) @@ -1381,41 +1381,49 @@ pgoff_t __basepage_index(struct page *page) return (index << compound_order(page_head)) + compound_idx; } -static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) +static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, + gfp_t gfp_mask, int nid, nodemask_t *nmask) { + int order = huge_page_order(h); struct page *page; - page = __alloc_pages_node(nid, - htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| - __GFP_RETRY_MAYFAIL|__GFP_NOWARN, - huge_page_order(h)); - if (page) { - prep_new_huge_page(h, page, nid); - } + gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; + if (nid == NUMA_NO_NODE) + nid = numa_mem_id(); + page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask); + if (page) + __count_vm_event(HTLB_BUDDY_PGALLOC); + else + __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); return page; } +/* + * Allocates a fresh page to the hugetlb allocator pool in the node interleaved + * manner. + */ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) { struct page *page; int nr_nodes, node; - int ret = 0; + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { - page = alloc_fresh_huge_page_node(h, node); - if (page) { - ret = 1; + page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, + node, nodes_allowed); + if (page) break; - } + } - if (ret) - count_vm_event(HTLB_BUDDY_PGALLOC); - else - count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); + if (!page) + return 0; - return ret; + prep_new_huge_page(h, page, page_to_nid(page)); + put_page(page); /* free it into the hugepage allocator */ + + return 1; } /* @@ -1523,17 +1531,6 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) return rc; } -static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask) -{ - int order = huge_page_order(h); - - gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; - if (nid == NUMA_NO_NODE) - nid = numa_mem_id(); - return __alloc_pages_nodemask(gfp_mask, order, nid, nmask); -} - static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nmask) { @@ -1589,11 +1586,9 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, */ h->nr_huge_pages_node[r_nid]++; h->surplus_huge_pages_node[r_nid]++; - __count_vm_event(HTLB_BUDDY_PGALLOC); } else { h->nr_huge_pages--; h->surplus_huge_pages--; - __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); } spin_unlock(&hugetlb_lock); @@ -2148,6 +2143,8 @@ static void __init gather_bootmem_prealloc(void) prep_compound_huge_page(page, h->order); WARN_ON(PageReserved(page)); prep_new_huge_page(h, page, page_to_nid(page)); + put_page(page); /* free it into the hugepage allocator */ + /* * If we had gigantic hugepages allocated at boot time, we need * to restore the 'stolen' pages to totalram_pages in order to -- 2.15.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization 2017-11-28 14:12 ` [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization Michal Hocko @ 2017-11-28 21:34 ` Mike Kravetz 2017-11-29 6:57 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Mike Kravetz @ 2017-11-28 21:34 UTC (permalink / raw) To: Michal Hocko, linux-mm; +Cc: Naoya Horiguchi, LKML, Michal Hocko On 11/28/2017 06:12 AM, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > hugetlb allocator has two entry points to the page allocator > - alloc_fresh_huge_page_node > - __hugetlb_alloc_buddy_huge_page > > The two differ very subtly in two aspects. The first one doesn't care > about HTLB_BUDDY_* stats and it doesn't initialize the huge page. > prep_new_huge_page is not used because it not only initializes hugetlb > specific stuff but because it also put_page and releases the page to > the hugetlb pool which is not what is required in some contexts. This > makes things more complicated than necessary. > > Simplify things by a) removing the page allocator entry point duplicity > and only keep __hugetlb_alloc_buddy_huge_page and b) make > prep_new_huge_page more reusable by removing the put_page which moves > the page to the allocator pool. All current callers are updated to call > put_page explicitly. Later patches will add new callers which won't > need it. > > This patch shouldn't introduce any functional change. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/hugetlb.c | 61 +++++++++++++++++++++++++++++------------------------------- > 1 file changed, 29 insertions(+), 32 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 2c9033d39bfe..8189c92fac82 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1157,6 +1157,7 @@ static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid) > if (page) { > prep_compound_gigantic_page(page, huge_page_order(h)); > prep_new_huge_page(h, page, nid); > + put_page(page); /* free it into the hugepage allocator */ > } > > return page; > @@ -1304,7 +1305,6 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) > h->nr_huge_pages++; > h->nr_huge_pages_node[nid]++; > spin_unlock(&hugetlb_lock); > - put_page(page); /* free it into the hugepage allocator */ > } > > static void prep_compound_gigantic_page(struct page *page, unsigned int order) > @@ -1381,41 +1381,49 @@ pgoff_t __basepage_index(struct page *page) > return (index << compound_order(page_head)) + compound_idx; > } > > -static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) > +static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, > + gfp_t gfp_mask, int nid, nodemask_t *nmask) > { > + int order = huge_page_order(h); > struct page *page; > > - page = __alloc_pages_node(nid, > - htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| > - __GFP_RETRY_MAYFAIL|__GFP_NOWARN, > - huge_page_order(h)); > - if (page) { > - prep_new_huge_page(h, page, nid); > - } > + gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; > + if (nid == NUMA_NO_NODE) > + nid = numa_mem_id(); > + page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask); > + if (page) > + __count_vm_event(HTLB_BUDDY_PGALLOC); > + else > + __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); > > return page; > } > > +/* > + * Allocates a fresh page to the hugetlb allocator pool in the node interleaved > + * manner. > + */ > static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) > { > struct page *page; > int nr_nodes, node; > - int ret = 0; > + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; > > for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { > - page = alloc_fresh_huge_page_node(h, node); > - if (page) { > - ret = 1; > + page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, > + node, nodes_allowed); I don't have the greatest understanding of node/nodemasks, but ... Since __hugetlb_alloc_buddy_huge_page calls __alloc_pages_nodemask(), do we still need to explicitly iterate over nodes with for_each_node_mask_to_alloc() here? -- Mike Kravetz > + if (page) > break; > - } > + > } > > - if (ret) > - count_vm_event(HTLB_BUDDY_PGALLOC); > - else > - count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); > + if (!page) > + return 0; > > - return ret; > + prep_new_huge_page(h, page, page_to_nid(page)); > + put_page(page); /* free it into the hugepage allocator */ > + > + return 1; > } > > /* > @@ -1523,17 +1531,6 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) > return rc; > } > > -static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, > - gfp_t gfp_mask, int nid, nodemask_t *nmask) > -{ > - int order = huge_page_order(h); > - > - gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; > - if (nid == NUMA_NO_NODE) > - nid = numa_mem_id(); > - return __alloc_pages_nodemask(gfp_mask, order, nid, nmask); > -} > - > static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, > int nid, nodemask_t *nmask) > { > @@ -1589,11 +1586,9 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, > */ > h->nr_huge_pages_node[r_nid]++; > h->surplus_huge_pages_node[r_nid]++; > - __count_vm_event(HTLB_BUDDY_PGALLOC); > } else { > h->nr_huge_pages--; > h->surplus_huge_pages--; > - __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); > } > spin_unlock(&hugetlb_lock); > > @@ -2148,6 +2143,8 @@ static void __init gather_bootmem_prealloc(void) > prep_compound_huge_page(page, h->order); > WARN_ON(PageReserved(page)); > prep_new_huge_page(h, page, page_to_nid(page)); > + put_page(page); /* free it into the hugepage allocator */ > + > /* > * If we had gigantic hugepages allocated at boot time, we need > * to restore the 'stolen' pages to totalram_pages in order to > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization 2017-11-28 21:34 ` Mike Kravetz @ 2017-11-29 6:57 ` Michal Hocko 2017-11-29 19:09 ` Mike Kravetz 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2017-11-29 6:57 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, Naoya Horiguchi, LKML On Tue 28-11-17 13:34:53, Mike Kravetz wrote: > On 11/28/2017 06:12 AM, Michal Hocko wrote: [...] > > +/* > > + * Allocates a fresh page to the hugetlb allocator pool in the node interleaved > > + * manner. > > + */ > > static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) > > { > > struct page *page; > > int nr_nodes, node; > > - int ret = 0; > > + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; > > > > for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { > > - page = alloc_fresh_huge_page_node(h, node); > > - if (page) { > > - ret = 1; > > + page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, > > + node, nodes_allowed); > > I don't have the greatest understanding of node/nodemasks, but ... > Since __hugetlb_alloc_buddy_huge_page calls __alloc_pages_nodemask(), do > we still need to explicitly iterate over nodes with > for_each_node_mask_to_alloc() here? Yes we do, because callers depend on the round robin allocation policy which is implemented by the ugly for_each_node_mask_to_alloc. I am not saying I like the way this is done but this is user visible thing. Or maybe I've missunderstood the whole thing... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization 2017-11-29 6:57 ` Michal Hocko @ 2017-11-29 19:09 ` Mike Kravetz 0 siblings, 0 replies; 23+ messages in thread From: Mike Kravetz @ 2017-11-29 19:09 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Naoya Horiguchi, LKML On 11/28/2017 10:57 PM, Michal Hocko wrote: > On Tue 28-11-17 13:34:53, Mike Kravetz wrote: >> On 11/28/2017 06:12 AM, Michal Hocko wrote: > [...] >>> +/* >>> + * Allocates a fresh page to the hugetlb allocator pool in the node interleaved >>> + * manner. >>> + */ >>> static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) >>> { >>> struct page *page; >>> int nr_nodes, node; >>> - int ret = 0; >>> + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; >>> >>> for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { >>> - page = alloc_fresh_huge_page_node(h, node); >>> - if (page) { >>> - ret = 1; >>> + page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, >>> + node, nodes_allowed); >> >> I don't have the greatest understanding of node/nodemasks, but ... >> Since __hugetlb_alloc_buddy_huge_page calls __alloc_pages_nodemask(), do >> we still need to explicitly iterate over nodes with >> for_each_node_mask_to_alloc() here? > > Yes we do, because callers depend on the round robin allocation policy > which is implemented by the ugly for_each_node_mask_to_alloc. I am not > saying I like the way this is done but this is user visible thing. Ah, thanks. I missed the __GFP_THISNODE. Because of that, the nodes_allowed mask is not used in the allocation attempts. So, cycling through the nodes with the for_each_node_mask_to_alloc makes sense. > Or maybe I've missunderstood the whole thing... No, this should preserve the original behavior. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-28 14:12 ` Michal Hocko 2017-11-28 14:12 ` [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization Michal Hocko @ 2017-11-28 14:12 ` Michal Hocko 2017-11-29 1:39 ` Mike Kravetz ` (3 more replies) 1 sibling, 4 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-28 14:12 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> hugepage migration relies on __alloc_buddy_huge_page to get a new page. This has 2 main disadvantages. 1) it doesn't allow to migrate any huge page if the pool is used completely which is not an exceptional case as the pool is static and unused memory is just wasted. 2) it leads to a weird semantic when migration between two numa nodes might increase the pool size of the destination NUMA node while the page is in use. The issue is caused by per NUMA node surplus pages tracking (see free_huge_page). Address both issues by changing the way how we allocate and account pages allocated for migration. Those should temporal by definition. So we mark them that way (we will abuse page flags in the 3rd page) and update free_huge_page to free such pages to the page allocator. Page migration path then just transfers the temporal status from the new page to the old one which will be freed on the last reference. The global surplus count will never change during this path but we still have to be careful when freeing a page from a node with surplus pages on the node. Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better reflect its purpose. The new allocation routine for the migration path is __alloc_migrate_huge_page. The user visible effect of this patch is that migrated pages are really temporal and they travel between NUMA nodes as per the migration request: Before migration /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 After /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 with the previous implementation, both nodes would have nr_hugepages:1 until the page is freed. Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/hugetlb.h | 35 +++++++++++++++++++++++++++++ mm/hugetlb.c | 58 +++++++++++++++++++++++++++++++++++-------------- mm/migrate.c | 13 +++++++++++ 3 files changed, 90 insertions(+), 16 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 6e3696c7b35a..1b6d7783c717 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -157,8 +157,43 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot); bool is_hugetlb_entry_migration(pte_t pte); + +/* + * Internal hugetlb specific page flag. Do not use outside of the hugetlb + * code + */ +static inline bool PageHugeTemporary(struct page *page) +{ + if (!PageHuge(page)) + return false; + + return page[2].flags == -1U; +} + +static inline void SetPageHugeTemporary(struct page *page) +{ + page[2].flags = -1U; +} + +static inline void ClearPageHugeTemporary(struct page *page) +{ + page[2].flags = 0; +} #else /* !CONFIG_HUGETLB_PAGE */ +static inline bool PageHugeTemporary(struct page *page) +{ + return false; +} + +static inline void SetPageHugeTemporary(struct page *page) +{ +} + +static inline void ClearPageHugeTemporary(struct page *page) +{ +} + static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8189c92fac82..037bf0f89463 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page) if (restore_reserve) h->resv_huge_pages++; - if (h->surplus_huge_pages_node[nid]) { + if (PageHugeTemporary(page)) { + list_del(&page->lru); + ClearPageHugeTemporary(page); + update_and_free_page(h, page); + if (h->surplus_huge_pages_node[nid]) + h->surplus_huge_pages_node[nid]--; + } else if (h->surplus_huge_pages_node[nid]) { /* remove the page from active list */ list_del(&page->lru); update_and_free_page(h, page); @@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) return rc; } -static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, +/* + * Allocates a fresh surplus page from the page allocator. Temporary + * requests (e.g. page migration) can pass enforce_overcommit == false + */ +static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nmask) { struct page *page; @@ -1595,6 +1605,28 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, return page; } +static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, + int nid, nodemask_t *nmask) +{ + struct page *page; + + if (hstate_is_gigantic(h)) + return NULL; + + page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask); + if (!page) + return NULL; + + /* + * We do not account these pages as surplus because they are only + * temporary and will be released properly on the last reference + */ + prep_new_huge_page(h, page, page_to_nid(page)); + SetPageHugeTemporary(page); + + return page; +} + /* * Use the VMA's mpolicy to allocate a huge page from the buddy. */ @@ -1609,17 +1641,13 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h, nodemask_t *nodemask; nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask); - page = __alloc_buddy_huge_page(h, gfp_mask, nid, nodemask); + page = __alloc_surplus_huge_page(h, gfp_mask, nid, nodemask); mpol_cond_put(mpol); return page; } -/* - * This allocation function is useful in the context where vma is irrelevant. - * E.g. soft-offlining uses this function because it only cares physical - * address of error page. - */ +/* page migration callback function */ struct page *alloc_huge_page_node(struct hstate *h, int nid) { gfp_t gfp_mask = htlb_alloc_mask(h); @@ -1634,12 +1662,12 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid) spin_unlock(&hugetlb_lock); if (!page) - page = __alloc_buddy_huge_page(h, gfp_mask, nid, NULL); + page = __alloc_migrate_huge_page(h, gfp_mask, nid, NULL); return page; } - +/* page migration callback function */ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask) { @@ -1657,9 +1685,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, } spin_unlock(&hugetlb_lock); - /* No reservations, try to overcommit */ - - return __alloc_buddy_huge_page(h, gfp_mask, preferred_nid, nmask); + return __alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask); } /* @@ -1687,7 +1713,7 @@ static int gather_surplus_pages(struct hstate *h, int delta) retry: spin_unlock(&hugetlb_lock); for (i = 0; i < needed; i++) { - page = __alloc_buddy_huge_page(h, htlb_alloc_mask(h), + page = __alloc_surplus_huge_page(h, htlb_alloc_mask(h), NUMA_NO_NODE, NULL); if (!page) { alloc_ok = false; @@ -2284,7 +2310,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, * First take pages out of surplus state. Then make up the * remaining difference by allocating fresh huge pages. * - * We might race with __alloc_buddy_huge_page() here and be unable + * We might race with __alloc_surplus_huge_page() here and be unable * to convert a surplus huge page to a normal huge page. That is * not critical, though, it just means the overall size of the * pool might be one hugepage larger than it needs to be, but @@ -2330,7 +2356,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, * By placing pages into the surplus state independent of the * overcommit value, we are allowing the surplus pool size to * exceed overcommit. There are few sane options here. Since - * __alloc_buddy_huge_page() is checking the global counter, + * __alloc_surplus_huge_page() is checking the global counter, * though, we'll note that we're not allowed to exceed surplus * and won't grow the pool anywhere else. Not until one of the * sysctls are changed, or the surplus pages go out of use. diff --git a/mm/migrate.c b/mm/migrate.c index 4d0be47a322a..b3345f8174a9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1326,6 +1326,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, hugetlb_cgroup_migrate(hpage, new_hpage); put_new_page = NULL; set_page_owner_migrate_reason(new_hpage, reason); + + /* + * transfer temporary state of the new huge page. This is + * reverse to other transitions because the newpage is going to + * be final while the old one will be freed so it takes over + * the temporary status. + * No need for any locking here because destructor cannot race + * with us. + */ + if (PageHugeTemporary(new_hpage)) { + SetPageHugeTemporary(hpage); + ClearPageHugeTemporary(new_hpage); + } } unlock_page(hpage); -- 2.15.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-28 14:12 ` [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration Michal Hocko @ 2017-11-29 1:39 ` Mike Kravetz 2017-11-29 7:17 ` Michal Hocko 2017-11-29 9:22 ` Michal Hocko ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: Mike Kravetz @ 2017-11-29 1:39 UTC (permalink / raw) To: Michal Hocko, linux-mm; +Cc: Naoya Horiguchi, LKML, Michal Hocko On 11/28/2017 06:12 AM, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > hugepage migration relies on __alloc_buddy_huge_page to get a new page. > This has 2 main disadvantages. > 1) it doesn't allow to migrate any huge page if the pool is used > completely which is not an exceptional case as the pool is static and > unused memory is just wasted. > 2) it leads to a weird semantic when migration between two numa nodes > might increase the pool size of the destination NUMA node while the page > is in use. The issue is caused by per NUMA node surplus pages tracking > (see free_huge_page). > > Address both issues by changing the way how we allocate and account > pages allocated for migration. Those should temporal by definition. > So we mark them that way (we will abuse page flags in the 3rd page) > and update free_huge_page to free such pages to the page allocator. > Page migration path then just transfers the temporal status from the > new page to the old one which will be freed on the last reference. In general, I think this will work. Some questions below. > The global surplus count will never change during this path but we still > have to be careful when freeing a page from a node with surplus pages > on the node. Not sure about the "freeing page from a node with surplus pages" comment. If allocating PageHugeTemporary pages does not adjust surplus counts, then there should be no concern at the time of freeing. Could this comment be a hold over from a previous implementation attempt? > > Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better > reflect its purpose. The new allocation routine for the migration path > is __alloc_migrate_huge_page. > > The user visible effect of this patch is that migrated pages are really > temporal and they travel between NUMA nodes as per the migration > request: > Before migration > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 > > After > > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 > > with the previous implementation, both nodes would have nr_hugepages:1 > until the page is freed. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/hugetlb.h | 35 +++++++++++++++++++++++++++++ > mm/hugetlb.c | 58 +++++++++++++++++++++++++++++++++++-------------- > mm/migrate.c | 13 +++++++++++ > 3 files changed, 90 insertions(+), 16 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 6e3696c7b35a..1b6d7783c717 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -157,8 +157,43 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > unsigned long address, unsigned long end, pgprot_t newprot); > > bool is_hugetlb_entry_migration(pte_t pte); > + > +/* > + * Internal hugetlb specific page flag. Do not use outside of the hugetlb > + * code > + */ > +static inline bool PageHugeTemporary(struct page *page) > +{ > + if (!PageHuge(page)) > + return false; > + > + return page[2].flags == -1U; > +} > + > +static inline void SetPageHugeTemporary(struct page *page) > +{ > + page[2].flags = -1U; > +} > + > +static inline void ClearPageHugeTemporary(struct page *page) > +{ > + page[2].flags = 0; > +} > #else /* !CONFIG_HUGETLB_PAGE */ > > +static inline bool PageHugeTemporary(struct page *page) > +{ > + return false; > +} > + > +static inline void SetPageHugeTemporary(struct page *page) > +{ > +} > + > +static inline void ClearPageHugeTemporary(struct page *page) > +{ > +} > + > static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) > { > } > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 8189c92fac82..037bf0f89463 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page) > if (restore_reserve) > h->resv_huge_pages++; > > - if (h->surplus_huge_pages_node[nid]) { > + if (PageHugeTemporary(page)) { > + list_del(&page->lru); > + ClearPageHugeTemporary(page); > + update_and_free_page(h, page); > + if (h->surplus_huge_pages_node[nid]) > + h->surplus_huge_pages_node[nid]--; I think this is not correct. Should the lines dealing with per-node surplus counts even be here? If the lines above are correct, then it implies that the sum of per node surplus counts could exceed (or get out of sync with) the global surplus count. > + } else if (h->surplus_huge_pages_node[nid]) { > /* remove the page from active list */ > list_del(&page->lru); > update_and_free_page(h, page); > @@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) > return rc; > } > > -static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, > +/* > + * Allocates a fresh surplus page from the page allocator. Temporary > + * requests (e.g. page migration) can pass enforce_overcommit == false 'enforce_overcommit == false' perhaps part of an earlier implementation attempt? > + */ > +static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, > int nid, nodemask_t *nmask) > { > struct page *page; > @@ -1595,6 +1605,28 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, > return page; > } > > +static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, > + int nid, nodemask_t *nmask) > +{ > + struct page *page; > + > + if (hstate_is_gigantic(h)) > + return NULL; > + > + page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask); > + if (!page) > + return NULL; > + > + /* > + * We do not account these pages as surplus because they are only > + * temporary and will be released properly on the last reference > + */ > + prep_new_huge_page(h, page, page_to_nid(page)); > + SetPageHugeTemporary(page); > + > + return page; > +} > + > /* > * Use the VMA's mpolicy to allocate a huge page from the buddy. > */ > @@ -1609,17 +1641,13 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h, > nodemask_t *nodemask; > > nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask); > - page = __alloc_buddy_huge_page(h, gfp_mask, nid, nodemask); > + page = __alloc_surplus_huge_page(h, gfp_mask, nid, nodemask); > mpol_cond_put(mpol); > > return page; > } > > -/* > - * This allocation function is useful in the context where vma is irrelevant. > - * E.g. soft-offlining uses this function because it only cares physical > - * address of error page. > - */ > +/* page migration callback function */ > struct page *alloc_huge_page_node(struct hstate *h, int nid) > { > gfp_t gfp_mask = htlb_alloc_mask(h); > @@ -1634,12 +1662,12 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid) > spin_unlock(&hugetlb_lock); > > if (!page) > - page = __alloc_buddy_huge_page(h, gfp_mask, nid, NULL); > + page = __alloc_migrate_huge_page(h, gfp_mask, nid, NULL); > > return page; > } > > - > +/* page migration callback function */ > struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, > nodemask_t *nmask) > { > @@ -1657,9 +1685,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, > } > spin_unlock(&hugetlb_lock); > > - /* No reservations, try to overcommit */ > - > - return __alloc_buddy_huge_page(h, gfp_mask, preferred_nid, nmask); > + return __alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask); > } > > /* > @@ -1687,7 +1713,7 @@ static int gather_surplus_pages(struct hstate *h, int delta) > retry: > spin_unlock(&hugetlb_lock); > for (i = 0; i < needed; i++) { > - page = __alloc_buddy_huge_page(h, htlb_alloc_mask(h), > + page = __alloc_surplus_huge_page(h, htlb_alloc_mask(h), > NUMA_NO_NODE, NULL); > if (!page) { > alloc_ok = false; > @@ -2284,7 +2310,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, > * First take pages out of surplus state. Then make up the > * remaining difference by allocating fresh huge pages. > * > - * We might race with __alloc_buddy_huge_page() here and be unable > + * We might race with __alloc_surplus_huge_page() here and be unable > * to convert a surplus huge page to a normal huge page. That is > * not critical, though, it just means the overall size of the > * pool might be one hugepage larger than it needs to be, but > @@ -2330,7 +2356,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, > * By placing pages into the surplus state independent of the > * overcommit value, we are allowing the surplus pool size to > * exceed overcommit. There are few sane options here. Since > - * __alloc_buddy_huge_page() is checking the global counter, > + * __alloc_surplus_huge_page() is checking the global counter, > * though, we'll note that we're not allowed to exceed surplus > * and won't grow the pool anywhere else. Not until one of the > * sysctls are changed, or the surplus pages go out of use. > diff --git a/mm/migrate.c b/mm/migrate.c > index 4d0be47a322a..b3345f8174a9 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1326,6 +1326,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > hugetlb_cgroup_migrate(hpage, new_hpage); > put_new_page = NULL; > set_page_owner_migrate_reason(new_hpage, reason); > + > + /* > + * transfer temporary state of the new huge page. This is > + * reverse to other transitions because the newpage is going to > + * be final while the old one will be freed so it takes over > + * the temporary status. > + * No need for any locking here because destructor cannot race > + * with us. > + */ > + if (PageHugeTemporary(new_hpage)) { > + SetPageHugeTemporary(hpage); > + ClearPageHugeTemporary(new_hpage); > + } > } > > unlock_page(hpage); > I'm still trying to wrap my head around all the different scenarios. In general, this new code only 'kicks in' if the there is not a free pre-allocated huge page for migration. Right? So, if there are free huge pages they are 'consumed' during migration and the number of available pre-allocated huge pages is reduced? Or, is that not exactly how it works? Or does it depend in the purpose of the migration? The only reason I ask is because this new method of allocating a surplus page (if successful) results in no decrease of available huge pages. Perhaps all migrations should attempt to allocate surplus pages and not impact the pre-allocated number of available huge pages. Or, perhaps I am just confused. :) -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-29 1:39 ` Mike Kravetz @ 2017-11-29 7:17 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-29 7:17 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, Naoya Horiguchi, LKML On Tue 28-11-17 17:39:50, Mike Kravetz wrote: > On 11/28/2017 06:12 AM, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > hugepage migration relies on __alloc_buddy_huge_page to get a new page. > > This has 2 main disadvantages. > > 1) it doesn't allow to migrate any huge page if the pool is used > > completely which is not an exceptional case as the pool is static and > > unused memory is just wasted. > > 2) it leads to a weird semantic when migration between two numa nodes > > might increase the pool size of the destination NUMA node while the page > > is in use. The issue is caused by per NUMA node surplus pages tracking > > (see free_huge_page). > > > > Address both issues by changing the way how we allocate and account > > pages allocated for migration. Those should temporal by definition. > > So we mark them that way (we will abuse page flags in the 3rd page) > > and update free_huge_page to free such pages to the page allocator. > > Page migration path then just transfers the temporal status from the > > new page to the old one which will be freed on the last reference. > > In general, I think this will work. Some questions below. > > > The global surplus count will never change during this path but we still > > have to be careful when freeing a page from a node with surplus pages > > on the node. > > Not sure about the "freeing page from a node with surplus pages" comment. > If allocating PageHugeTemporary pages does not adjust surplus counts, then > there should be no concern at the time of freeing. > > Could this comment be a hold over from a previous implementation attempt? > Not really. You have to realize that the original page could be surplus on its node. More on that below. [...] > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 8189c92fac82..037bf0f89463 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page) > > if (restore_reserve) > > h->resv_huge_pages++; > > > > - if (h->surplus_huge_pages_node[nid]) { > > + if (PageHugeTemporary(page)) { > > + list_del(&page->lru); > > + ClearPageHugeTemporary(page); > > + update_and_free_page(h, page); > > + if (h->surplus_huge_pages_node[nid]) > > + h->surplus_huge_pages_node[nid]--; > > I think this is not correct. Should the lines dealing with per-node > surplus counts even be here? If the lines above are correct, then it > implies that the sum of per node surplus counts could exceed (or get out > of sync with) the global surplus count. You are right, I guess. This per-node accounting makes the whole thing real pain. I am worried that we will free next page from the same node and reduce the overal pool size. I will think about it some more. > > + } else if (h->surplus_huge_pages_node[nid]) { > > /* remove the page from active list */ > > list_del(&page->lru); > > update_and_free_page(h, page); > > @@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) > > return rc; > > } > > > > -static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, > > +/* > > + * Allocates a fresh surplus page from the page allocator. Temporary > > + * requests (e.g. page migration) can pass enforce_overcommit == false > > 'enforce_overcommit == false' perhaps part of an earlier implementation > attempt? yeah. [...] > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 4d0be47a322a..b3345f8174a9 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -1326,6 +1326,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > > hugetlb_cgroup_migrate(hpage, new_hpage); > > put_new_page = NULL; > > set_page_owner_migrate_reason(new_hpage, reason); > > + > > + /* > > + * transfer temporary state of the new huge page. This is > > + * reverse to other transitions because the newpage is going to > > + * be final while the old one will be freed so it takes over > > + * the temporary status. > > + * No need for any locking here because destructor cannot race > > + * with us. > > + */ > > + if (PageHugeTemporary(new_hpage)) { > > + SetPageHugeTemporary(hpage); > > + ClearPageHugeTemporary(new_hpage); > > + } > > } > > > > unlock_page(hpage); > > > > I'm still trying to wrap my head around all the different scenarios. > In general, this new code only 'kicks in' if the there is not a free > pre-allocated huge page for migration. Right? yes > So, if there are free huge pages they are 'consumed' during migration > and the number of available pre-allocated huge pages is reduced? Or, > is that not exactly how it works? Or does it depend in the purpose > of the migration? Well, if we have pre-allocated pages then we just consume them and they will not get Temporary status so the additional code doesn't kick in. > The only reason I ask is because this new method of allocating a surplus > page (if successful) results in no decrease of available huge pages. > Perhaps all migrations should attempt to allocate surplus pages and not > impact the pre-allocated number of available huge pages. That could reduce the chances of the migration success because allocating a fresh huge page can fail. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-28 14:12 ` [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration Michal Hocko 2017-11-29 1:39 ` Mike Kravetz @ 2017-11-29 9:22 ` Michal Hocko 2017-11-29 9:40 ` Michal Hocko ` (2 more replies) 2017-11-29 9:51 ` Michal Hocko 2017-11-29 11:33 ` [PATCH RFC v2 " Michal Hocko 3 siblings, 3 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-29 9:22 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML What about this on top. I haven't tested this yet though. --- diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 1b6d7783c717..f5fcd4e355dc 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end, long freed); bool isolate_huge_page(struct page *page, struct list_head *list); void putback_active_hugepage(struct page *page); +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason); void free_huge_page(struct page *page); void hugetlb_fix_reserve_counts(struct inode *inode); extern struct mutex *hugetlb_fault_mutex_table; @@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list) return false; } #define putback_active_hugepage(p) do {} while (0) +#define move_hugetlb_state(old, new, reason) do {} while (0) static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 037bf0f89463..30601c1c62f3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -34,6 +34,7 @@ #include <linux/hugetlb_cgroup.h> #include <linux/node.h> #include <linux/userfaultfd_k.h> +#include <linux/page_owner.h> #include "internal.h" int hugetlb_max_hstate __read_mostly; @@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page) spin_unlock(&hugetlb_lock); put_page(page); } + +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason) +{ + struct hstate *h = page_hstate(oldpage); + + hugetlb_cgroup_migrate(oldpage, newpage); + set_page_owner_migrate_reason(newpage, reason); + + /* + * transfer temporary state of the new huge page. This is + * reverse to other transitions because the newpage is going to + * be final while the old one will be freed so it takes over + * the temporary status. + * + * Also note that we have to transfer the per-node surplus state + * here as well otherwise the global surplus count will not match + * the per-node's. + */ + if (PageHugeTemporary(newpage)) { + int old_nid = page_to_nid(oldpage); + int new_nid = page_to_nid(newpage); + + SetPageHugeTemporary(oldpage); + ClearPageHugeTemporary(newpage); + + if (h->surplus_huge_pages_node[old_nid]) { + h->surplus_huge_pages_node[old_nid]--; + h->surplus_huge_pages_node[new_nid]++; + } + } +} diff --git a/mm/migrate.c b/mm/migrate.c index b3345f8174a9..1e5525a25691 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1323,22 +1323,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, put_anon_vma(anon_vma); if (rc == MIGRATEPAGE_SUCCESS) { - hugetlb_cgroup_migrate(hpage, new_hpage); + move_hugetlb_state(hpage, new_hpage, reason); put_new_page = NULL; - set_page_owner_migrate_reason(new_hpage, reason); - - /* - * transfer temporary state of the new huge page. This is - * reverse to other transitions because the newpage is going to - * be final while the old one will be freed so it takes over - * the temporary status. - * No need for any locking here because destructor cannot race - * with us. - */ - if (PageHugeTemporary(new_hpage)) { - SetPageHugeTemporary(hpage); - ClearPageHugeTemporary(new_hpage); - } } unlock_page(hpage); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-29 9:22 ` Michal Hocko @ 2017-11-29 9:40 ` Michal Hocko 2017-11-29 11:23 ` Michal Hocko 2017-11-29 19:52 ` Mike Kravetz 2 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-29 9:40 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML On Wed 29-11-17 10:22:34, Michal Hocko wrote: > What about this on top. I haven't tested this yet though. > --- We will need to drop surplus_huge_pages_node handling from the free path obviously as well diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1be43563e226..756833f9ef8b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1312,8 +1312,6 @@ void free_huge_page(struct page *page) list_del(&page->lru); ClearPageHugeTemporary(page); update_and_free_page(h, page); - if (h->surplus_huge_pages_node[nid]) - h->surplus_huge_pages_node[nid]--; } else if (h->surplus_huge_pages_node[nid]) { /* remove the page from active list */ list_del(&page->lru); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-29 9:22 ` Michal Hocko 2017-11-29 9:40 ` Michal Hocko @ 2017-11-29 11:23 ` Michal Hocko 2017-11-29 19:52 ` Mike Kravetz 2 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-29 11:23 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML On Wed 29-11-17 10:22:34, Michal Hocko wrote: > What about this on top. I haven't tested this yet though. OK, it seem to work: root@test1:~# echo 1 > /proc/sys/vm/nr_hugepages root@test1:~# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/* /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:1 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 # mmap 2 huge pages root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/* /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:2 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:1 root@test1:~# migratepages $(pidof map_hugetlb) 1 0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:2 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:1 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 and exit the mmap root@test1:~# grep . /sys/devices/system/node/node*/hugepages/hugepages-2048kB/* /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:1 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-29 9:22 ` Michal Hocko 2017-11-29 9:40 ` Michal Hocko 2017-11-29 11:23 ` Michal Hocko @ 2017-11-29 19:52 ` Mike Kravetz 2017-11-30 7:57 ` Michal Hocko 2 siblings, 1 reply; 23+ messages in thread From: Mike Kravetz @ 2017-11-29 19:52 UTC (permalink / raw) To: Michal Hocko, linux-mm; +Cc: Naoya Horiguchi, LKML On 11/29/2017 01:22 AM, Michal Hocko wrote: > What about this on top. I haven't tested this yet though. Yes, this would work. However, I think a simple modification to your previous free_huge_page changes would make this unnecessary. I was confused in your previous patch because you decremented the per-node surplus page count, but not the global count. I think it would have been correct (and made this patch unnecessary) if you decremented the global counter there as well. Of course, this patch makes the surplus accounting more explicit. If we move forward with this patch, one issue below. > --- > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 1b6d7783c717..f5fcd4e355dc 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end, > long freed); > bool isolate_huge_page(struct page *page, struct list_head *list); > void putback_active_hugepage(struct page *page); > +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason); > void free_huge_page(struct page *page); > void hugetlb_fix_reserve_counts(struct inode *inode); > extern struct mutex *hugetlb_fault_mutex_table; > @@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list) > return false; > } > #define putback_active_hugepage(p) do {} while (0) > +#define move_hugetlb_state(old, new, reason) do {} while (0) > > static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > unsigned long address, unsigned long end, pgprot_t newprot) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 037bf0f89463..30601c1c62f3 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -34,6 +34,7 @@ > #include <linux/hugetlb_cgroup.h> > #include <linux/node.h> > #include <linux/userfaultfd_k.h> > +#include <linux/page_owner.h> > #include "internal.h" > > int hugetlb_max_hstate __read_mostly; > @@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page) > spin_unlock(&hugetlb_lock); > put_page(page); > } > + > +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason) > +{ > + struct hstate *h = page_hstate(oldpage); > + > + hugetlb_cgroup_migrate(oldpage, newpage); > + set_page_owner_migrate_reason(newpage, reason); > + > + /* > + * transfer temporary state of the new huge page. This is > + * reverse to other transitions because the newpage is going to > + * be final while the old one will be freed so it takes over > + * the temporary status. > + * > + * Also note that we have to transfer the per-node surplus state > + * here as well otherwise the global surplus count will not match > + * the per-node's. > + */ > + if (PageHugeTemporary(newpage)) { > + int old_nid = page_to_nid(oldpage); > + int new_nid = page_to_nid(newpage); > + > + SetPageHugeTemporary(oldpage); > + ClearPageHugeTemporary(newpage); > + > + if (h->surplus_huge_pages_node[old_nid]) { > + h->surplus_huge_pages_node[old_nid]--; > + h->surplus_huge_pages_node[new_nid]++; > + } You need to take hugetlb_lock before adjusting the surplus counts. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-29 19:52 ` Mike Kravetz @ 2017-11-30 7:57 ` Michal Hocko 2017-11-30 19:35 ` Mike Kravetz 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2017-11-30 7:57 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, Naoya Horiguchi, LKML On Wed 29-11-17 11:52:53, Mike Kravetz wrote: > On 11/29/2017 01:22 AM, Michal Hocko wrote: > > What about this on top. I haven't tested this yet though. > > Yes, this would work. > > However, I think a simple modification to your previous free_huge_page > changes would make this unnecessary. I was confused in your previous > patch because you decremented the per-node surplus page count, but not > the global count. I think it would have been correct (and made this > patch unnecessary) if you decremented the global counter there as well. We cannot really increment the global counter because the over number of surplus pages during migration doesn't increase. > Of course, this patch makes the surplus accounting more explicit. > > If we move forward with this patch, one issue below. > > > --- > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index 1b6d7783c717..f5fcd4e355dc 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end, > > long freed); > > bool isolate_huge_page(struct page *page, struct list_head *list); > > void putback_active_hugepage(struct page *page); > > +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason); > > void free_huge_page(struct page *page); > > void hugetlb_fix_reserve_counts(struct inode *inode); > > extern struct mutex *hugetlb_fault_mutex_table; > > @@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list) > > return false; > > } > > #define putback_active_hugepage(p) do {} while (0) > > +#define move_hugetlb_state(old, new, reason) do {} while (0) > > > > static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > > unsigned long address, unsigned long end, pgprot_t newprot) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 037bf0f89463..30601c1c62f3 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -34,6 +34,7 @@ > > #include <linux/hugetlb_cgroup.h> > > #include <linux/node.h> > > #include <linux/userfaultfd_k.h> > > +#include <linux/page_owner.h> > > #include "internal.h" > > > > int hugetlb_max_hstate __read_mostly; > > @@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page) > > spin_unlock(&hugetlb_lock); > > put_page(page); > > } > > + > > +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason) > > +{ > > + struct hstate *h = page_hstate(oldpage); > > + > > + hugetlb_cgroup_migrate(oldpage, newpage); > > + set_page_owner_migrate_reason(newpage, reason); > > + > > + /* > > + * transfer temporary state of the new huge page. This is > > + * reverse to other transitions because the newpage is going to > > + * be final while the old one will be freed so it takes over > > + * the temporary status. > > + * > > + * Also note that we have to transfer the per-node surplus state > > + * here as well otherwise the global surplus count will not match > > + * the per-node's. > > + */ > > + if (PageHugeTemporary(newpage)) { > > + int old_nid = page_to_nid(oldpage); > > + int new_nid = page_to_nid(newpage); > > + > > + SetPageHugeTemporary(oldpage); > > + ClearPageHugeTemporary(newpage); > > + > > + if (h->surplus_huge_pages_node[old_nid]) { > > + h->surplus_huge_pages_node[old_nid]--; > > + h->surplus_huge_pages_node[new_nid]++; > > + } > > You need to take hugetlb_lock before adjusting the surplus counts. You are right. Actually moving the code to hugetlb.c was exactly because I didn't want to take the lock outside of the hugetlb proper. I just forgot to add it here. Thanks for spotting. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-30 7:57 ` Michal Hocko @ 2017-11-30 19:35 ` Mike Kravetz 2017-11-30 19:57 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Mike Kravetz @ 2017-11-30 19:35 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Naoya Horiguchi, LKML On 11/29/2017 11:57 PM, Michal Hocko wrote: > On Wed 29-11-17 11:52:53, Mike Kravetz wrote: >> On 11/29/2017 01:22 AM, Michal Hocko wrote: >>> What about this on top. I haven't tested this yet though. >> >> Yes, this would work. >> >> However, I think a simple modification to your previous free_huge_page >> changes would make this unnecessary. I was confused in your previous >> patch because you decremented the per-node surplus page count, but not >> the global count. I think it would have been correct (and made this >> patch unnecessary) if you decremented the global counter there as well. > > We cannot really increment the global counter because the over number of > surplus pages during migration doesn't increase. I was not suggesting we increment the global surplus count. Rather, your previous patch should have decremented the global surplus count in free_huge_page. Something like: @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page) if (restore_reserve) h->resv_huge_pages++; - if (h->surplus_huge_pages_node[nid]) { + if (PageHugeTemporary(page)) { + list_del(&page->lru); + ClearPageHugeTemporary(page); + update_and_free_page(h, page); + if (h->surplus_huge_pages_node[nid]) + h->surplus_huge_pages--; + h->surplus_huge_pages_node[nid]--; + } + } else if (h->surplus_huge_pages_node[nid]) { /* remove the page from active list */ list_del(&page->lru); update_and_free_page(h, page); When we allocate one of these 'PageHugeTemporary' pages, we only increment the global and node specific nr_huge_pages counters. To me, this makes all the huge page counters be the same as if there were simply one additional pre-allocated huge page. This 'extra' (PageHugeTemporary) page will go away when free_huge_page is called. So, my thought is that it is not necessary to transfer per-node counts from the original to target node. Of course, I may be missing something. When thinking about transfering per-node counts as is done in your latest patch, I took another look at all the per-node counts. This may show my ignorance of huge page migration, but do we need to handle the case where the page being migrated is 'free'? Is that possible? If so, there will be a count for free_huge_pages_node and the page will be on the per node hugepage_freelists that must be handled -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-30 19:35 ` Mike Kravetz @ 2017-11-30 19:57 ` Michal Hocko 2017-11-30 20:06 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2017-11-30 19:57 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, Naoya Horiguchi, LKML On Thu 30-11-17 11:35:11, Mike Kravetz wrote: > On 11/29/2017 11:57 PM, Michal Hocko wrote: > > On Wed 29-11-17 11:52:53, Mike Kravetz wrote: > >> On 11/29/2017 01:22 AM, Michal Hocko wrote: > >>> What about this on top. I haven't tested this yet though. > >> > >> Yes, this would work. > >> > >> However, I think a simple modification to your previous free_huge_page > >> changes would make this unnecessary. I was confused in your previous > >> patch because you decremented the per-node surplus page count, but not > >> the global count. I think it would have been correct (and made this > >> patch unnecessary) if you decremented the global counter there as well. > > > > We cannot really increment the global counter because the over number of > > surplus pages during migration doesn't increase. > > I was not suggesting we increment the global surplus count. Rather, > your previous patch should have decremented the global surplus count in > free_huge_page. Something like: sorry I meant to say decrement. The point is that overal suprlus count doesn't change after the migration. The only thing that _might_ change is the per node distribution of surplus pages. That is why I think we should handle that during the migration. > @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page) > if (restore_reserve) > h->resv_huge_pages++; > > - if (h->surplus_huge_pages_node[nid]) { > + if (PageHugeTemporary(page)) { > + list_del(&page->lru); > + ClearPageHugeTemporary(page); > + update_and_free_page(h, page); > + if (h->surplus_huge_pages_node[nid]) > + h->surplus_huge_pages--; > + h->surplus_huge_pages_node[nid]--; > + } > + } else if (h->surplus_huge_pages_node[nid]) { > /* remove the page from active list */ > list_del(&page->lru); > update_and_free_page(h, page); > > When we allocate one of these 'PageHugeTemporary' pages, we only increment > the global and node specific nr_huge_pages counters. To me, this makes all > the huge page counters be the same as if there were simply one additional > pre-allocated huge page. This 'extra' (PageHugeTemporary) page will go > away when free_huge_page is called. So, my thought is that it is not > necessary to transfer per-node counts from the original to target node. > Of course, I may be missing something. The thing is that we do not know whether the original page is surplus until the deallocation time. > When thinking about transfering per-node counts as is done in your latest > patch, I took another look at all the per-node counts. This may show my > ignorance of huge page migration, but do we need to handle the case where > the page being migrated is 'free'? Is that possible? If so, there will > be a count for free_huge_pages_node and the page will be on the per node > hugepage_freelists that must be handled I do not understand. What do you mean by free? Sitting on the pool? I do not think we ever try to migrate those. They simply do not have any state to migrate. We could very well just allocate fresh pages on the remote node and dissolve free ones. I am not sure we do that during the memory hotplug to preserve the pool size and I am too tired to check that now. This would be a different topic I guess. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-30 19:57 ` Michal Hocko @ 2017-11-30 20:06 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-30 20:06 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, Naoya Horiguchi, LKML On Thu 30-11-17 20:57:43, Michal Hocko wrote: > On Thu 30-11-17 11:35:11, Mike Kravetz wrote: > > On 11/29/2017 11:57 PM, Michal Hocko wrote: > > > On Wed 29-11-17 11:52:53, Mike Kravetz wrote: > > >> On 11/29/2017 01:22 AM, Michal Hocko wrote: > > >>> What about this on top. I haven't tested this yet though. > > >> > > >> Yes, this would work. > > >> > > >> However, I think a simple modification to your previous free_huge_page > > >> changes would make this unnecessary. I was confused in your previous > > >> patch because you decremented the per-node surplus page count, but not > > >> the global count. I think it would have been correct (and made this > > >> patch unnecessary) if you decremented the global counter there as well. > > > > > > We cannot really increment the global counter because the over number of > > > surplus pages during migration doesn't increase. > > > > I was not suggesting we increment the global surplus count. Rather, > > your previous patch should have decremented the global surplus count in > > free_huge_page. Something like: > > sorry I meant to say decrement. The point is that overal suprlus count > doesn't change after the migration. The only thing that _might_ change > is the per node distribution of surplus pages. That is why I think we > should handle that during the migration. Let me clarify. The migration context is the only place where we have both the old and new page so this sounds like the only place to know that we need to transfer the per-node surplus state. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-28 14:12 ` [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration Michal Hocko 2017-11-29 1:39 ` Mike Kravetz 2017-11-29 9:22 ` Michal Hocko @ 2017-11-29 9:51 ` Michal Hocko 2017-11-29 11:33 ` [PATCH RFC v2 " Michal Hocko 3 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-29 9:51 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML On Tue 28-11-17 15:12:11, Michal Hocko wrote: [...] > +/* > + * Internal hugetlb specific page flag. Do not use outside of the hugetlb > + * code > + */ > +static inline bool PageHugeTemporary(struct page *page) > +{ > + if (!PageHuge(page)) > + return false; > + > + return page[2].flags == -1U; > +} > + > +static inline void SetPageHugeTemporary(struct page *page) > +{ > + page[2].flags = -1U; > +} > + > +static inline void ClearPageHugeTemporary(struct page *page) > +{ > + page[2].flags = 0; > +} Ups, this is obviously not OK. I was just lucky to not hit BUG_ONs because I am clearly overwriting node/zone data in flags. I will have to find something else to abuse. I will go with my favorite mapping pointer which is not used at all. --- diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1be43563e226..db7544a0b7b6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1259,17 +1259,17 @@ static inline bool PageHugeTemporary(struct page *page) if (!PageHuge(page)) return false; - return page[2].flags == -1U; + return page[2].mapping == -1U; } static inline void SetPageHugeTemporary(struct page *page) { - page[2].flags = -1U; + page[2].mapping = -1U; } static inline void ClearPageHugeTemporary(struct page *page) { - page[2].flags = 0; + page[2].mapping = NULL; } void free_huge_page(struct page *page) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH RFC v2 2/2] mm, hugetlb: do not rely on overcommit limit during migration 2017-11-28 14:12 ` [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration Michal Hocko ` (2 preceding siblings ...) 2017-11-29 9:51 ` Michal Hocko @ 2017-11-29 11:33 ` Michal Hocko 3 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2017-11-29 11:33 UTC (permalink / raw) To: linux-mm; +Cc: Mike Kravetz, Naoya Horiguchi, LKML OK, so this is the v2 with all the fixups folded in. It doesn't blow up immediately and even seem to work. --- ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2017-11-30 20:06 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-22 15:28 hugetlb page migration vs. overcommit Michal Hocko 2017-11-22 19:11 ` Mike Kravetz 2017-11-23 9:21 ` Michal Hocko 2017-11-27 6:27 ` Naoya Horiguchi 2017-11-28 10:19 ` Michal Hocko 2017-11-28 14:12 ` Michal Hocko 2017-11-28 14:12 ` [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization Michal Hocko 2017-11-28 21:34 ` Mike Kravetz 2017-11-29 6:57 ` Michal Hocko 2017-11-29 19:09 ` Mike Kravetz 2017-11-28 14:12 ` [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration Michal Hocko 2017-11-29 1:39 ` Mike Kravetz 2017-11-29 7:17 ` Michal Hocko 2017-11-29 9:22 ` Michal Hocko 2017-11-29 9:40 ` Michal Hocko 2017-11-29 11:23 ` Michal Hocko 2017-11-29 19:52 ` Mike Kravetz 2017-11-30 7:57 ` Michal Hocko 2017-11-30 19:35 ` Mike Kravetz 2017-11-30 19:57 ` Michal Hocko 2017-11-30 20:06 ` Michal Hocko 2017-11-29 9:51 ` Michal Hocko 2017-11-29 11:33 ` [PATCH RFC v2 " Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).