linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] alloc_huge_page/hugetlb_reserve_pages race
@ 2015-05-18 17:49 Mike Kravetz
  2015-05-18 17:49 ` [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
  2015-05-18 17:49 ` [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
  0 siblings, 2 replies; 6+ messages in thread
From: Mike Kravetz @ 2015-05-18 17:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton, Mike Kravetz

While working on hugetlbfs fallocate support, I noticed the following
race in the existing code.  It is unlikely that this race is hit very
often in the current code.  However, if more functionality to add and
remove pages to hugetlbfs mappings (such as fallocate) is added the
likelihood of hitting this race will increase.

alloc_huge_page and hugetlb_reserve_pages use information from the
reserve map to determine if there are enough available huge pages to
complete the operation, as well as adjust global reserve and subpool
usage counts.  The order of operations is as follows:
- call region_chg() to determine the expected change based on reserve map
- determine if enough resources are available for this operation
- adjust global counts based on the expected change
- call region_add() to update the reserve map
The issue is that reserve map could change between the call to region_chg
and region_add.  In this case, the counters which were adjusted based on
the output of region_chg will not be correct.

In order to hit this race today, there must be an existing shared hugetlb
mmap created with the MAP_NORESERVE flag.  A page fault to allocate a huge
page via this mapping must occur at the same another task is mapping the
same region without the MAP_NORESERVE flag.

The patch set does not prevent the race from happening.  Rather, it adds
simple functionality to detect when the race has occurred.  If a race is
detected, then the incorrect counts are adjusted.

Mike Kravetz (2):
  mm/hugetlb: compute/return the number of regions added by region_add()
  mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages

 mm/hugetlb.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 48 insertions(+), 8 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add()
  2015-05-18 17:49 [PATCH 0/2] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz
@ 2015-05-18 17:49 ` Mike Kravetz
  2015-05-21 23:30   ` Andrew Morton
  2015-05-18 17:49 ` [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
  1 sibling, 1 reply; 6+ messages in thread
From: Mike Kravetz @ 2015-05-18 17:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton, Mike Kravetz

Modify region_add() to keep track of regions(pages) added to the
reserve map and return this value.  The return value can be
compared to the return value of region_chg() to determine if the
map was modified between calls.  Make vma_commit_reservation()
also pass along the return value of region_add().  The special
case return values of vma_needs_reservation() should also be
taken into account when determining the return value of
vma_commit_reservation().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c41b2a0..7f64034 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -156,6 +156,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
+	long chg = 0;
 
 	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
@@ -181,14 +182,17 @@ static long region_add(struct resv_map *resv, long f, long t)
 		if (rg->to > t)
 			t = rg->to;
 		if (rg != nrg) {
+			chg -= (rg->to - rg->from);
 			list_del(&rg->link);
 			kfree(rg);
 		}
 	}
+	chg += (nrg->from - f);
 	nrg->from = f;
+	chg += t - nrg->to;
 	nrg->to = t;
 	spin_unlock(&resv->lock);
-	return 0;
+	return chg;
 }
 
 static long region_chg(struct resv_map *resv, long f, long t)
@@ -1349,18 +1353,25 @@ static long vma_needs_reservation(struct hstate *h,
 	else
 		return chg < 0 ? chg : 0;
 }
-static void vma_commit_reservation(struct hstate *h,
+
+static long vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct resv_map *resv;
 	pgoff_t idx;
+	long add;
 
 	resv = vma_resv_map(vma);
 	if (!resv)
-		return;
+		return 1;
 
 	idx = vma_hugecache_offset(h, vma, addr);
-	region_add(resv, idx, idx + 1);
+	add = region_add(resv, idx, idx + 1);
+
+	if (vma->vm_flags & VM_MAYSHARE)
+		return add;
+	else
+		return 0;
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  2015-05-18 17:49 [PATCH 0/2] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz
  2015-05-18 17:49 ` [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
@ 2015-05-18 17:49 ` Mike Kravetz
  2015-05-21 23:35   ` Andrew Morton
  1 sibling, 1 reply; 6+ messages in thread
From: Mike Kravetz @ 2015-05-18 17:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton, Mike Kravetz

alloc_huge_page and hugetlb_reserve_pages use region_chg to
calculate the number of pages which will be added to the reserve
map.  Subpool and global reserve counts are adjusted based on
the output of region_chg.  Before the pages are actually added
to the reserve map, these routines could race and add fewer
pages than expected.  If this happens, the subpool and global
reserve counts are not correct.

Compare the number of pages actually added (region_add) to those
expected to added (region_chg).  If fewer pages are actually added,
this indicates a race and adjust counters accordingly.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7f64034..63f6d43 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1374,13 +1374,16 @@ static long vma_commit_reservation(struct hstate *h,
 		return 0;
 }
 
+/* Forward declaration */
+static int hugetlb_acct_memory(struct hstate *h, long delta);
+
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg;
+	long chg, commit;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1421,7 +1424,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 
 	set_page_private(page, (unsigned long)spool);
 
-	vma_commit_reservation(h, vma, addr);
+	commit = vma_commit_reservation(h, vma, addr);
+	if (unlikely(chg > commit)) {
+		/*
+		 * The page was added to the reservation map between
+		 * vma_needs_reservation and vma_commit_reservation.
+		 * This indicates a race with hugetlb_reserve_pages.
+		 * Adjust for the subpool count incremented above AND
+		 * in hugetlb_reserve_pages for the same page.  Also,
+		 * the reservation count added in hugetlb_reserve_pages
+		 * no longer applies.
+		 */
+		hugepage_subpool_put_pages(spool, 1);
+		hugetlb_acct_memory(h, -1);
+	}
 	return page;
 
 out_uncharge_cgroup:
@@ -3512,8 +3528,21 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * consumed reservations are stored in the map. Hence, nothing
 	 * else has to be done for private mappings here
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(resv_map, from, to);
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		long add = region_add(resv_map, from, to);
+
+		if (unlikely(chg > add)) {
+			/*
+			 * pages in this range were added to the reserve
+			 * map between region_chg and region_add.  This
+			 * indicates a race with alloc_huge_page.  Adjust
+			 * the subpool and reserve counts modified above
+			 * based on the difference.
+			 */
+			hugepage_subpool_put_pages(spool, chg - add);
+			hugetlb_acct_memory(h, -(chg - ret));
+		}
+	}
 	return 0;
 out_err:
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add()
  2015-05-18 17:49 ` [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
@ 2015-05-21 23:30   ` Andrew Morton
  2015-05-21 23:43     ` Mike Kravetz
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2015-05-21 23:30 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, Davidlohr Bueso,
	David Rientjes, Luiz Capitulino

On Mon, 18 May 2015 10:49:08 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> Modify region_add() to keep track of regions(pages) added to the
> reserve map and return this value.  The return value can be
> compared to the return value of region_chg() to determine if the
> map was modified between calls.  Make vma_commit_reservation()
> also pass along the return value of region_add().  The special
> case return values of vma_needs_reservation() should also be
> taken into account when determining the return value of
> vma_commit_reservation().

Could we please get this code slightly documented while it's hot in
your mind?

- One has to do an extraordinary amount of reading to discover that
  the units of file_region.from and .to are "multiples of
  1<<huge_page_order(h)" (where "h" is imponderable).

  Let's get this written down?

- Is file_region.to inclusive or exclusive?

- What are they called "from" and "to" anyway?  We usually use
  "start" and "end" for such things.


> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -156,6 +156,7 @@ static long region_add(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *nrg, *trg;
> +	long chg = 0;
>  
>  	spin_lock(&resv->lock);
>  	/* Locate the region we are either in or before. */
> @@ -181,14 +182,17 @@ static long region_add(struct resv_map *resv, long f, long t)
>  		if (rg->to > t)
>  			t = rg->to;
>  		if (rg != nrg) {
> +			chg -= (rg->to - rg->from);
>  			list_del(&rg->link);
>  			kfree(rg);
>  		}
>  	}
> +	chg += (nrg->from - f);
>  	nrg->from = f;
> +	chg += t - nrg->to;
>  	nrg->to = t;
>  	spin_unlock(&resv->lock);
> -	return 0;
> +	return chg;
>  }

Let's document the return value.  It appears that this function is
designed to return a negative number (units?) on a successful addition.
Why, and what does that number represent.


>  static long region_chg(struct resv_map *resv, long f, long t)
> @@ -1349,18 +1353,25 @@ static long vma_needs_reservation(struct hstate *h,
>  	else
>  		return chg < 0 ? chg : 0;
>  }
> -static void vma_commit_reservation(struct hstate *h,
> +
> +static long vma_commit_reservation(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct resv_map *resv;
>  	pgoff_t idx;
> +	long add;
>  
>  	resv = vma_resv_map(vma);
>  	if (!resv)
> -		return;
> +		return 1;
>  
>  	idx = vma_hugecache_offset(h, vma, addr);
> -	region_add(resv, idx, idx + 1);
> +	add = region_add(resv, idx, idx + 1);
> +
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		return add;
> +	else
> +		return 0;
>  }

Let's document the return value here as well please.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  2015-05-18 17:49 ` [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
@ 2015-05-21 23:35   ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2015-05-21 23:35 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, Davidlohr Bueso,
	David Rientjes, Luiz Capitulino

On Mon, 18 May 2015 10:49:09 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> alloc_huge_page and hugetlb_reserve_pages use region_chg to
> calculate the number of pages which will be added to the reserve
> map.  Subpool and global reserve counts are adjusted based on
> the output of region_chg.  Before the pages are actually added
> to the reserve map, these routines could race and add fewer
> pages than expected.  If this happens, the subpool and global
> reserve counts are not correct.
> 
> Compare the number of pages actually added (region_add) to those
> expected to added (region_chg).  If fewer pages are actually added,
> this indicates a race and adjust counters accordingly.
> 
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1374,13 +1374,16 @@ static long vma_commit_reservation(struct hstate *h,
>  		return 0;
>  }
>  
> +/* Forward declaration */
> +static int hugetlb_acct_memory(struct hstate *h, long delta);
> +

Its best to put forward declarations at top-of-file.  Otherwise we can
end up with multiple forward declarations if someone later needs the
symbol at an earlier site in the file.

Had you done that you might have noticed that hugetlb_acct_memory() was
already declared ;)

--- a/mm/hugetlb.c~mm-hugetlb-handle-races-in-alloc_huge_page-and-hugetlb_reserve_pages-fix
+++ a/mm/hugetlb.c
@@ -1475,9 +1475,6 @@ static long vma_commit_reservation(struc
 		return 0;
 }
 
-/* Forward declaration */
-static int hugetlb_acct_memory(struct hstate *h, long delta);
-
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
_



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add()
  2015-05-21 23:30   ` Andrew Morton
@ 2015-05-21 23:43     ` Mike Kravetz
  0 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2015-05-21 23:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, Davidlohr Bueso,
	David Rientjes, Luiz Capitulino

On 05/21/2015 04:30 PM, Andrew Morton wrote:
> On Mon, 18 May 2015 10:49:08 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
>> Modify region_add() to keep track of regions(pages) added to the
>> reserve map and return this value.  The return value can be
>> compared to the return value of region_chg() to determine if the
>> map was modified between calls.  Make vma_commit_reservation()
>> also pass along the return value of region_add().  The special
>> case return values of vma_needs_reservation() should also be
>> taken into account when determining the return value of
>> vma_commit_reservation().
>
> Could we please get this code slightly documented while it's hot in
> your mind?

Will do.  I'll provide an updated patch to address your questions and
better document this whole region/reserve map stuff.  It took me a few
days to wrap my head around how it actually works.

-- 
Mike Kravetz

> - One has to do an extraordinary amount of reading to discover that
>    the units of file_region.from and .to are "multiples of
>    1<<huge_page_order(h)" (where "h" is imponderable).
>
>    Let's get this written down?
>
> - Is file_region.to inclusive or exclusive?
>
> - What are they called "from" and "to" anyway?  We usually use
>    "start" and "end" for such things.
>
>
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -156,6 +156,7 @@ static long region_add(struct resv_map *resv, long f, long t)
>>   {
>>   	struct list_head *head = &resv->regions;
>>   	struct file_region *rg, *nrg, *trg;
>> +	long chg = 0;
>>
>>   	spin_lock(&resv->lock);
>>   	/* Locate the region we are either in or before. */
>> @@ -181,14 +182,17 @@ static long region_add(struct resv_map *resv, long f, long t)
>>   		if (rg->to > t)
>>   			t = rg->to;
>>   		if (rg != nrg) {
>> +			chg -= (rg->to - rg->from);
>>   			list_del(&rg->link);
>>   			kfree(rg);
>>   		}
>>   	}
>> +	chg += (nrg->from - f);
>>   	nrg->from = f;
>> +	chg += t - nrg->to;
>>   	nrg->to = t;
>>   	spin_unlock(&resv->lock);
>> -	return 0;
>> +	return chg;
>>   }
>
> Let's document the return value.  It appears that this function is
> designed to return a negative number (units?) on a successful addition.
> Why, and what does that number represent.
>
>
>>   static long region_chg(struct resv_map *resv, long f, long t)
>> @@ -1349,18 +1353,25 @@ static long vma_needs_reservation(struct hstate *h,
>>   	else
>>   		return chg < 0 ? chg : 0;
>>   }
>> -static void vma_commit_reservation(struct hstate *h,
>> +
>> +static long vma_commit_reservation(struct hstate *h,
>>   			struct vm_area_struct *vma, unsigned long addr)
>>   {
>>   	struct resv_map *resv;
>>   	pgoff_t idx;
>> +	long add;
>>
>>   	resv = vma_resv_map(vma);
>>   	if (!resv)
>> -		return;
>> +		return 1;
>>
>>   	idx = vma_hugecache_offset(h, vma, addr);
>> -	region_add(resv, idx, idx + 1);
>> +	add = region_add(resv, idx, idx + 1);
>> +
>> +	if (vma->vm_flags & VM_MAYSHARE)
>> +		return add;
>> +	else
>> +		return 0;
>>   }
>
> Let's document the return value here as well please.
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-05-21 23:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-18 17:49 [PATCH 0/2] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz
2015-05-18 17:49 ` [PATCH 1/2] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
2015-05-21 23:30   ` Andrew Morton
2015-05-21 23:43     ` Mike Kravetz
2015-05-18 17:49 ` [PATCH 2/2] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
2015-05-21 23:35   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).