* [PATCH 0/2] alloc_huge_page/hugetlb_reserve_pages race @ 2015-05-18 17:49 Mike Kravetz 2015-05-18 17:49 ` [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz 2015-05-18 17:49 ` [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz 0 siblings, 2 replies; 6+ messages in thread From: Mike Kravetz @ 2015-05-18 17:49 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes, Luiz Capitulino, Andrew Morton, Mike Kravetz While working on hugetlbfs fallocate support, I noticed the following race in the existing code. It is unlikely that this race is hit very often in the current code. However, if more functionality to add and remove pages to hugetlbfs mappings (such as fallocate) is added the likelihood of hitting this race will increase. alloc_huge_page and hugetlb_reserve_pages use information from the reserve map to determine if there are enough available huge pages to complete the operation, as well as adjust global reserve and subpool usage counts. The order of operations is as follows: - call region_chg() to determine the expected change based on reserve map - determine if enough resources are available for this operation - adjust global counts based on the expected change - call region_add() to update the reserve map The issue is that reserve map could change between the call to region_chg and region_add. In this case, the counters which were adjusted based on the output of region_chg will not be correct. In order to hit this race today, there must be an existing shared hugetlb mmap created with the MAP_NORESERVE flag. A page fault to allocate a huge page via this mapping must occur at the same another task is mapping the same region without the MAP_NORESERVE flag. The patch set does not prevent the race from happening. Rather, it adds simple functionality to detect when the race has occurred. If a race is detected, then the incorrect counts are adjusted. Mike Kravetz (2): mm/hugetlb: compute/return the number of regions added by region_add() mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages mm/hugetlb.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 48 insertions(+), 8 deletions(-) -- 2.1.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() 2015-05-18 17:49 [PATCH 0/2] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz @ 2015-05-18 17:49 ` Mike Kravetz 2015-05-21 23:30 ` Andrew Morton 2015-05-18 17:49 ` [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz 1 sibling, 1 reply; 6+ messages in thread From: Mike Kravetz @ 2015-05-18 17:49 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes, Luiz Capitulino, Andrew Morton, Mike Kravetz Modify region_add() to keep track of regions(pages) added to the reserve map and return this value. The return value can be compared to the return value of region_chg() to determine if the map was modified between calls. Make vma_commit_reservation() also pass along the return value of region_add(). The special case return values of vma_needs_reservation() should also be taken into account when determining the return value of vma_commit_reservation(). Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- mm/hugetlb.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c41b2a0..7f64034 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -156,6 +156,7 @@ static long region_add(struct resv_map *resv, long f, long t) { struct list_head *head = &resv->regions; struct file_region *rg, *nrg, *trg; + long chg = 0; spin_lock(&resv->lock); /* Locate the region we are either in or before. */ @@ -181,14 +182,17 @@ static long region_add(struct resv_map *resv, long f, long t) if (rg->to > t) t = rg->to; if (rg != nrg) { + chg -= (rg->to - rg->from); list_del(&rg->link); kfree(rg); } } + chg += (nrg->from - f); nrg->from = f; + chg += t - nrg->to; nrg->to = t; spin_unlock(&resv->lock); - return 0; + return chg; } static long region_chg(struct resv_map *resv, long f, long t) @@ -1349,18 +1353,25 @@ static long vma_needs_reservation(struct hstate *h, else return chg < 0 ? chg : 0; } -static void vma_commit_reservation(struct hstate *h, + +static long vma_commit_reservation(struct hstate *h, struct vm_area_struct *vma, unsigned long addr) { struct resv_map *resv; pgoff_t idx; + long add; resv = vma_resv_map(vma); if (!resv) - return; + return 1; idx = vma_hugecache_offset(h, vma, addr); - region_add(resv, idx, idx + 1); + add = region_add(resv, idx, idx + 1); + + if (vma->vm_flags & VM_MAYSHARE) + return add; + else + return 0; } static struct page *alloc_huge_page(struct vm_area_struct *vma, -- 2.1.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() 2015-05-18 17:49 ` [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz @ 2015-05-21 23:30 ` Andrew Morton 2015-05-21 23:43 ` Mike Kravetz 0 siblings, 1 reply; 6+ messages in thread From: Andrew Morton @ 2015-05-21 23:30 UTC (permalink / raw) To: Mike Kravetz Cc: linux-mm, linux-kernel, Naoya Horiguchi, Davidlohr Bueso, David Rientjes, Luiz Capitulino On Mon, 18 May 2015 10:49:08 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > Modify region_add() to keep track of regions(pages) added to the > reserve map and return this value. The return value can be > compared to the return value of region_chg() to determine if the > map was modified between calls. Make vma_commit_reservation() > also pass along the return value of region_add(). The special > case return values of vma_needs_reservation() should also be > taken into account when determining the return value of > vma_commit_reservation(). Could we please get this code slightly documented while it's hot in your mind? - One has to do an extraordinary amount of reading to discover that the units of file_region.from and .to are "multiples of 1<<huge_page_order(h)" (where "h" is imponderable). Let's get this written down? - Is file_region.to inclusive or exclusive? - What are they called "from" and "to" anyway? We usually use "start" and "end" for such things. > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -156,6 +156,7 @@ static long region_add(struct resv_map *resv, long f, long t) > { > struct list_head *head = &resv->regions; > struct file_region *rg, *nrg, *trg; > + long chg = 0; > > spin_lock(&resv->lock); > /* Locate the region we are either in or before. */ > @@ -181,14 +182,17 @@ static long region_add(struct resv_map *resv, long f, long t) > if (rg->to > t) > t = rg->to; > if (rg != nrg) { > + chg -= (rg->to - rg->from); > list_del(&rg->link); > kfree(rg); > } > } > + chg += (nrg->from - f); > nrg->from = f; > + chg += t - nrg->to; > nrg->to = t; > spin_unlock(&resv->lock); > - return 0; > + return chg; > } Let's document the return value. It appears that this function is designed to return a negative number (units?) on a successful addition. Why, and what does that number represent. > static long region_chg(struct resv_map *resv, long f, long t) > @@ -1349,18 +1353,25 @@ static long vma_needs_reservation(struct hstate *h, > else > return chg < 0 ? chg : 0; > } > -static void vma_commit_reservation(struct hstate *h, > + > +static long vma_commit_reservation(struct hstate *h, > struct vm_area_struct *vma, unsigned long addr) > { > struct resv_map *resv; > pgoff_t idx; > + long add; > > resv = vma_resv_map(vma); > if (!resv) > - return; > + return 1; > > idx = vma_hugecache_offset(h, vma, addr); > - region_add(resv, idx, idx + 1); > + add = region_add(resv, idx, idx + 1); > + > + if (vma->vm_flags & VM_MAYSHARE) > + return add; > + else > + return 0; > } Let's document the return value here as well please. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() 2015-05-21 23:30 ` Andrew Morton @ 2015-05-21 23:43 ` Mike Kravetz 0 siblings, 0 replies; 6+ messages in thread From: Mike Kravetz @ 2015-05-21 23:43 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Naoya Horiguchi, Davidlohr Bueso, David Rientjes, Luiz Capitulino On 05/21/2015 04:30 PM, Andrew Morton wrote: > On Mon, 18 May 2015 10:49:08 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >> Modify region_add() to keep track of regions(pages) added to the >> reserve map and return this value. The return value can be >> compared to the return value of region_chg() to determine if the >> map was modified between calls. Make vma_commit_reservation() >> also pass along the return value of region_add(). The special >> case return values of vma_needs_reservation() should also be >> taken into account when determining the return value of >> vma_commit_reservation(). > > Could we please get this code slightly documented while it's hot in > your mind? Will do. I'll provide an updated patch to address your questions and better document this whole region/reserve map stuff. It took me a few days to wrap my head around how it actually works. -- Mike Kravetz > - One has to do an extraordinary amount of reading to discover that > the units of file_region.from and .to are "multiples of > 1<<huge_page_order(h)" (where "h" is imponderable). > > Let's get this written down? > > - Is file_region.to inclusive or exclusive? > > - What are they called "from" and "to" anyway? We usually use > "start" and "end" for such things. > > >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -156,6 +156,7 @@ static long region_add(struct resv_map *resv, long f, long t) >> { >> struct list_head *head = &resv->regions; >> struct file_region *rg, *nrg, *trg; >> + long chg = 0; >> >> spin_lock(&resv->lock); >> /* Locate the region we are either in or before. */ >> @@ -181,14 +182,17 @@ static long region_add(struct resv_map *resv, long f, long t) >> if (rg->to > t) >> t = rg->to; >> if (rg != nrg) { >> + chg -= (rg->to - rg->from); >> list_del(&rg->link); >> kfree(rg); >> } >> } >> + chg += (nrg->from - f); >> nrg->from = f; >> + chg += t - nrg->to; >> nrg->to = t; >> spin_unlock(&resv->lock); >> - return 0; >> + return chg; >> } > > Let's document the return value. It appears that this function is > designed to return a negative number (units?) on a successful addition. > Why, and what does that number represent. > > >> static long region_chg(struct resv_map *resv, long f, long t) >> @@ -1349,18 +1353,25 @@ static long vma_needs_reservation(struct hstate *h, >> else >> return chg < 0 ? chg : 0; >> } >> -static void vma_commit_reservation(struct hstate *h, >> + >> +static long vma_commit_reservation(struct hstate *h, >> struct vm_area_struct *vma, unsigned long addr) >> { >> struct resv_map *resv; >> pgoff_t idx; >> + long add; >> >> resv = vma_resv_map(vma); >> if (!resv) >> - return; >> + return 1; >> >> idx = vma_hugecache_offset(h, vma, addr); >> - region_add(resv, idx, idx + 1); >> + add = region_add(resv, idx, idx + 1); >> + >> + if (vma->vm_flags & VM_MAYSHARE) >> + return add; >> + else >> + return 0; >> } > > Let's document the return value here as well please. > ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages 2015-05-18 17:49 [PATCH 0/2] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz 2015-05-18 17:49 ` [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz @ 2015-05-18 17:49 ` Mike Kravetz 2015-05-21 23:35 ` Andrew Morton 1 sibling, 1 reply; 6+ messages in thread From: Mike Kravetz @ 2015-05-18 17:49 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes, Luiz Capitulino, Andrew Morton, Mike Kravetz alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the number of pages which will be added to the reserve map. Subpool and global reserve counts are adjusted based on the output of region_chg. Before the pages are actually added to the reserve map, these routines could race and add fewer pages than expected. If this happens, the subpool and global reserve counts are not correct. Compare the number of pages actually added (region_add) to those expected to added (region_chg). If fewer pages are actually added, this indicates a race and adjust counters accordingly. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- mm/hugetlb.c | 37 +++++++++++++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7f64034..63f6d43 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1374,13 +1374,16 @@ static long vma_commit_reservation(struct hstate *h, return 0; } +/* Forward declaration */ +static int hugetlb_acct_memory(struct hstate *h, long delta); + static struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) { struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); struct page *page; - long chg; + long chg, commit; int ret, idx; struct hugetlb_cgroup *h_cg; @@ -1421,7 +1424,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, set_page_private(page, (unsigned long)spool); - vma_commit_reservation(h, vma, addr); + commit = vma_commit_reservation(h, vma, addr); + if (unlikely(chg > commit)) { + /* + * The page was added to the reservation map between + * vma_needs_reservation and vma_commit_reservation. + * This indicates a race with hugetlb_reserve_pages. + * Adjust for the subpool count incremented above AND + * in hugetlb_reserve_pages for the same page. Also, + * the reservation count added in hugetlb_reserve_pages + * no longer applies. + */ + hugepage_subpool_put_pages(spool, 1); + hugetlb_acct_memory(h, -1); + } return page; out_uncharge_cgroup: @@ -3512,8 +3528,21 @@ int hugetlb_reserve_pages(struct inode *inode, * consumed reservations are stored in the map. Hence, nothing * else has to be done for private mappings here */ - if (!vma || vma->vm_flags & VM_MAYSHARE) - region_add(resv_map, from, to); + if (!vma || vma->vm_flags & VM_MAYSHARE) { + long add = region_add(resv_map, from, to); + + if (unlikely(chg > add)) { + /* + * pages in this range were added to the reserve + * map between region_chg and region_add. This + * indicates a race with alloc_huge_page. Adjust + * the subpool and reserve counts modified above + * based on the difference. + */ + hugepage_subpool_put_pages(spool, chg - add); + hugetlb_acct_memory(h, -(chg - ret)); + } + } return 0; out_err: if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) -- 2.1.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages 2015-05-18 17:49 ` [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz @ 2015-05-21 23:35 ` Andrew Morton 0 siblings, 0 replies; 6+ messages in thread From: Andrew Morton @ 2015-05-21 23:35 UTC (permalink / raw) To: Mike Kravetz Cc: linux-mm, linux-kernel, Naoya Horiguchi, Davidlohr Bueso, David Rientjes, Luiz Capitulino On Mon, 18 May 2015 10:49:09 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > alloc_huge_page and hugetlb_reserve_pages use region_chg to > calculate the number of pages which will be added to the reserve > map. Subpool and global reserve counts are adjusted based on > the output of region_chg. Before the pages are actually added > to the reserve map, these routines could race and add fewer > pages than expected. If this happens, the subpool and global > reserve counts are not correct. > > Compare the number of pages actually added (region_add) to those > expected to added (region_chg). If fewer pages are actually added, > this indicates a race and adjust counters accordingly. > > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1374,13 +1374,16 @@ static long vma_commit_reservation(struct hstate *h, > return 0; > } > > +/* Forward declaration */ > +static int hugetlb_acct_memory(struct hstate *h, long delta); > + Its best to put forward declarations at top-of-file. Otherwise we can end up with multiple forward declarations if someone later needs the symbol at an earlier site in the file. Had you done that you might have noticed that hugetlb_acct_memory() was already declared ;) --- a/mm/hugetlb.c~mm-hugetlb-handle-races-in-alloc_huge_page-and-hugetlb_reserve_pages-fix +++ a/mm/hugetlb.c @@ -1475,9 +1475,6 @@ static long vma_commit_reservation(struc return 0; } -/* Forward declaration */ -static int hugetlb_acct_memory(struct hstate *h, long delta); - static struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) { _ ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-05-21 23:43 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-05-18 17:49 [PATCH 0/2] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz 2015-05-18 17:49 ` [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz 2015-05-21 23:30 ` Andrew Morton 2015-05-21 23:43 ` Mike Kravetz 2015-05-18 17:49 ` [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz 2015-05-21 23:35 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).