linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] alloc_huge_page/hugetlb_reserve_pages race
@ 2015-05-27 17:56 Mike Kravetz
  2015-05-27 17:56 ` [PATCH v3 1/3] mm/hugetlb: document the reserve map/region tracking routines Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Mike Kravetz @ 2015-05-27 17:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton, Mike Kravetz

While working on hugetlbfs fallocate support, I noticed the following
race in the existing code.  It is unlikely that this race is hit very
often in the current code.  However, if more functionality to add and
remove pages to hugetlbfs mappings (such as fallocate) is added the
likelihood of hitting this race will increase.

alloc_huge_page and hugetlb_reserve_pages use information from the
reserve map to determine if there are enough available huge pages to
complete the operation, as well as adjust global reserve and subpool
usage counts.  The order of operations is as follows:
- call region_chg() to determine the expected change based on reserve map
- determine if enough resources are available for this operation
- adjust global counts based on the expected change
- call region_add() to update the reserve map
The issue is that reserve map could change between the call to region_chg
and region_add.  In this case, the counters which were adjusted based on
the output of region_chg will not be correct.

In order to hit this race today, there must be an existing shared hugetlb
mmap created with the MAP_NORESERVE flag.  A page fault to allocate a huge
page via this mapping must occur at the same another task is mapping the
same region without the MAP_NORESERVE flag.

The patch set does not prevent the race from happening.  Rather, it adds
simple functionality to detect when the race has occurred.  If a race is
detected, then the incorrect counts are adjusted.

Review comments pointed out the need for documentation of the existing
region/reserve map routines.  This patch set also adds documentation
in this area.

v3:
  Created separate patch for new documentation created in v2
  Added VM_BUG_ON() to region add at suggestion of Naoya Horiguchi
  __vma_reservation_common keys off parameter commit for easier reading
v2:
  Added documentation for the region/reserve map routines
  Created common routine for vma_commit_reservation and
    vma_commit_reservation to help prevent them from drifting
    apart in the future.

Mike Kravetz (3):
  mm/hugetlb: document the reserve map/region tracking routines
  mm/hugetlb: compute/return the number of regions added by region_add()
  mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages

 mm/hugetlb.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 128 insertions(+), 30 deletions(-)

-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/3] mm/hugetlb: document the reserve map/region tracking routines
  2015-05-27 17:56 [PATCH v3 0/3] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz
@ 2015-05-27 17:56 ` Mike Kravetz
  2015-05-27 17:56 ` [PATCH v3 2/3] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
  2015-05-27 17:56 ` [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
  2 siblings, 0 replies; 9+ messages in thread
From: Mike Kravetz @ 2015-05-27 17:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton, Mike Kravetz

This is a documentation only patch and does not modify any code.
Descriptions of the routines used for reserve map/region tracking
are added.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54f129d..ad2c628 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -212,8 +212,20 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  *
- * The region data structures are embedded into a resv_map and
- * protected by a resv_map's lock
+ * The region data structures are embedded into a resv_map and protected
+ * by a resv_map's lock.  The set of regions within the resv_map represent
+ * reservations for huge pages, or huge pages that have already been
+ * instantiated within the map.  The from and to elements are huge page
+ * indicies into the associated mapping.  from indicates the starting index
+ * of the region.  to represents the first index past the end of  the region.
+ *
+ * For example, a file region structure with from == 0 and to == 4 represents
+ * four huge pages in a mapping.  It is important to note that the to element
+ * represents the first element past the end of the region. This is used in
+ * arithmetic as 4(to) - 0(from) = 4 huge pages in the region.
+ *
+ * Interval notation of the form [from, to) will be used to indicate that
+ * the endpoint from is inclusive and to is exclusive.
  */
 struct file_region {
 	struct list_head link;
@@ -221,6 +233,14 @@ struct file_region {
 	long to;
 };
 
+/*
+ * Add the huge page range represented by [f, t) to the reserve
+ * map.  Existing regions will be expanded to accommodate the
+ * specified range.  We know only existing regions need to be
+ * expanded, because region_add is only called after region_chg
+ * with the same range.  If a new file_region structure must
+ * be allocated, it is done in region_chg.
+ */
 static long region_add(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
@@ -260,6 +280,25 @@ static long region_add(struct resv_map *resv, long f, long t)
 	return 0;
 }
 
+/*
+ * Examine the existing reserve map and determine how many
+ * huge pages in the specified range [f, t) are NOT currently
+ * represented.  This routine is called before a subsequent
+ * call to region_add that will actually modify the reserve
+ * map to add the specified range [f, t).  region_chg does
+ * not change the number of huge pages represented by the
+ * map.  However, if the existing regions in the map can not
+ * be expanded to represent the new range, a new file_region
+ * structure is added to the map as a placeholder.  This is
+ * so that the subsequent region_add call will have all the
+ * regions it needs and will not fail.
+ *
+ * Returns the number of huge pages that need to be added
+ * to the existing reservation map for the range [f, t).
+ * This number is greater or equal to zero.  -ENOMEM is
+ * returned if a new file_region structure is needed and can
+ * not be allocated.
+ */
 static long region_chg(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
@@ -326,6 +365,11 @@ out_nrg:
 	return chg;
 }
 
+/*
+ * Truncate the reserve map at index 'end'.  Modify/truncate any
+ * region which contains end.  Delete any regions past end.
+ * Return the number of huge pages removed from the map.
+ */
 static long region_truncate(struct resv_map *resv, long end)
 {
 	struct list_head *head = &resv->regions;
@@ -361,6 +405,10 @@ out:
 	return chg;
 }
 
+/*
+ * Count and return the number of huge pages in the reserve map
+ * that intersect with the range [f, t).
+ */
 static long region_count(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/3] mm/hugetlb: compute/return the number of regions added by region_add()
  2015-05-27 17:56 [PATCH v3 0/3] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz
  2015-05-27 17:56 ` [PATCH v3 1/3] mm/hugetlb: document the reserve map/region tracking routines Mike Kravetz
@ 2015-05-27 17:56 ` Mike Kravetz
  2015-05-29  1:48   ` Naoya Horiguchi
  2015-05-27 17:56 ` [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
  2 siblings, 1 reply; 9+ messages in thread
From: Mike Kravetz @ 2015-05-27 17:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton, Mike Kravetz

Modify region_add() to keep track of regions(pages) added to the
reserve map and return this value.  The return value can be
compared to the return value of region_chg() to determine if the
map was modified between calls.

Make vma_commit_reservation() also pass along the return value of
region_add().  In the normal case, we want vma_commit_reservation
to return the same value as the preceding call to vma_needs_reservation.
Create a common __vma_reservation_common routine to help keep the
special case return values in sync

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 72 ++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 48 insertions(+), 24 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ad2c628..b3d3d59 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -240,11 +240,15 @@ struct file_region {
  * expanded, because region_add is only called after region_chg
  * with the same range.  If a new file_region structure must
  * be allocated, it is done in region_chg.
+ *
+ * Return the number of new huge pages added to the map.  This
+ * number is greater than or equal to zero.
  */
 static long region_add(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
+	long add = 0;
 
 	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
@@ -270,14 +274,24 @@ static long region_add(struct resv_map *resv, long f, long t)
 		if (rg->to > t)
 			t = rg->to;
 		if (rg != nrg) {
+			/* Decrement return value by the deleted range.
+			 * Another range will span this area so that by
+			 * end of routine add will be >= zero
+			 */
+			add -= (rg->to - rg->from);
 			list_del(&rg->link);
 			kfree(rg);
 		}
 	}
+
+	add += (nrg->from - f);		/* Added to beginning of region */
 	nrg->from = f;
+	add += t - nrg->to;		/* Added to end of region */
 	nrg->to = t;
+
 	spin_unlock(&resv->lock);
-	return 0;
+	VM_BUG_ON(add < 0);
+	return add;
 }
 
 /*
@@ -1472,46 +1486,56 @@ static void return_unused_surplus_pages(struct hstate *h,
 }
 
 /*
- * Determine if the huge page at addr within the vma has an associated
- * reservation.  Where it does not we will need to logically increase
- * reservation and actually increase subpool usage before an allocation
- * can occur.  Where any new reservation would be required the
- * reservation change is prepared, but not committed.  Once the page
- * has been allocated from the subpool and instantiated the change should
- * be committed via vma_commit_reservation.  No action is required on
- * failure.
+ * vma_needs_reservation and vma_commit_reservation are used by the huge
+ * page allocation routines to manage reservations.
+ *
+ * vma_needs_reservation is called to determine if the huge page at addr
+ * within the vma has an associated reservation.  If a reservation is
+ * needed, the value 1 is returned.  The caller is then responsible for
+ * managing the global reservation and subpool usage counts.  After
+ * the huge page has been allocated, vma_commit_reservation is called
+ * to add the page to the reservation map.
+ *
+ * In the normal case, vma_commit_reservation returns the same value
+ * as the preceding vma_needs_reservation call.  The only time this
+ * is not the case is if a reserve map was changed between calls.  It
+ * is the responsibility of the caller to notice the difference and
+ * take appropriate action.
  */
-static long vma_needs_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+static long __vma_reservation_common(struct hstate *h,
+				struct vm_area_struct *vma, unsigned long addr,
+				bool commit)
 {
 	struct resv_map *resv;
 	pgoff_t idx;
-	long chg;
+	long ret;
 
 	resv = vma_resv_map(vma);
 	if (!resv)
 		return 1;
 
 	idx = vma_hugecache_offset(h, vma, addr);
-	chg = region_chg(resv, idx, idx + 1);
+	if (commit)
+		ret = region_add(resv, idx, idx + 1);
+	else
+		ret = region_chg(resv, idx, idx + 1);
 
 	if (vma->vm_flags & VM_MAYSHARE)
-		return chg;
+		return ret;
 	else
-		return chg < 0 ? chg : 0;
+		return ret < 0 ? ret : 0;
 }
-static void vma_commit_reservation(struct hstate *h,
+
+static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct resv_map *resv;
-	pgoff_t idx;
-
-	resv = vma_resv_map(vma);
-	if (!resv)
-		return;
+	return __vma_reservation_common(h, vma, addr, false);
+}
 
-	idx = vma_hugecache_offset(h, vma, addr);
-	region_add(resv, idx, idx + 1);
+static long vma_commit_reservation(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	return __vma_reservation_common(h, vma, addr, true);
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  2015-05-27 17:56 [PATCH v3 0/3] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz
  2015-05-27 17:56 ` [PATCH v3 1/3] mm/hugetlb: document the reserve map/region tracking routines Mike Kravetz
  2015-05-27 17:56 ` [PATCH v3 2/3] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
@ 2015-05-27 17:56 ` Mike Kravetz
  2015-05-28 14:01   ` Davidlohr Bueso
                     ` (2 more replies)
  2 siblings, 3 replies; 9+ messages in thread
From: Mike Kravetz @ 2015-05-27 17:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton, Mike Kravetz

alloc_huge_page and hugetlb_reserve_pages use region_chg to
calculate the number of pages which will be added to the reserve
map.  Subpool and global reserve counts are adjusted based on
the output of region_chg.  Before the pages are actually added
to the reserve map, these routines could race and add fewer
pages than expected.  If this happens, the subpool and global
reserve counts are not correct.

Compare the number of pages actually added (region_add) to those
expected to added (region_chg).  If fewer pages are actually added,
this indicates a race and adjust counters accordingly.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b3d3d59..038c84e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1544,7 +1544,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg;
+	long chg, commit;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1585,7 +1585,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 
 	set_page_private(page, (unsigned long)spool);
 
-	vma_commit_reservation(h, vma, addr);
+	commit = vma_commit_reservation(h, vma, addr);
+	if (unlikely(chg > commit)) {
+		/*
+		 * The page was added to the reservation map between
+		 * vma_needs_reservation and vma_commit_reservation.
+		 * This indicates a race with hugetlb_reserve_pages.
+		 * Adjust for the subpool count incremented above AND
+		 * in hugetlb_reserve_pages for the same page.  Also,
+		 * the reservation count added in hugetlb_reserve_pages
+		 * no longer applies.
+		 */
+		hugepage_subpool_put_pages(spool, 1);
+		hugetlb_acct_memory(h, -1);
+	}
 	return page;
 
 out_uncharge_cgroup:
@@ -3699,8 +3712,21 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * consumed reservations are stored in the map. Hence, nothing
 	 * else has to be done for private mappings here
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(resv_map, from, to);
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		long add = region_add(resv_map, from, to);
+
+		if (unlikely(chg > add)) {
+			/*
+			 * pages in this range were added to the reserve
+			 * map between region_chg and region_add.  This
+			 * indicates a race with alloc_huge_page.  Adjust
+			 * the subpool and reserve counts modified above
+			 * based on the difference.
+			 */
+			hugepage_subpool_put_pages(spool, chg - add);
+			hugetlb_acct_memory(h, -(chg - ret));
+		}
+	}
 	return 0;
 out_err:
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  2015-05-27 17:56 ` [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
@ 2015-05-28 14:01   ` Davidlohr Bueso
  2015-05-28 21:00     ` Mike Kravetz
  2015-05-29  1:49   ` Naoya Horiguchi
  2015-06-01 16:53   ` Mike Kravetz
  2 siblings, 1 reply; 9+ messages in thread
From: Davidlohr Bueso @ 2015-05-28 14:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Rientjes,
	Luiz Capitulino, Andrew Morton

On Wed, 2015-05-27 at 10:56 -0700, Mike Kravetz wrote:
> alloc_huge_page and hugetlb_reserve_pages use region_chg to
> calculate the number of pages which will be added to the reserve
> map.  Subpool and global reserve counts are adjusted based on
> the output of region_chg.  Before the pages are actually added
> to the reserve map, these routines could race and add fewer
> pages than expected.  If this happens, the subpool and global
> reserve counts are not correct.
> 
> Compare the number of pages actually added (region_add) to those
> expected to added (region_chg).  If fewer pages are actually added,
> this indicates a race and adjust counters accordingly.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

With one nit below.

> ---
>  mm/hugetlb.c | 34 ++++++++++++++++++++++++++++++----
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b3d3d59..038c84e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1544,7 +1544,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *page;
> -	long chg;
> +	long chg, commit;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg;
>  
> @@ -1585,7 +1585,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  
>  	set_page_private(page, (unsigned long)spool);
>  
> -	vma_commit_reservation(h, vma, addr);
> +	commit = vma_commit_reservation(h, vma, addr);
> +	if (unlikely(chg > commit)) {
> +		/*
> +		 * The page was added to the reservation map between
> +		 * vma_needs_reservation and vma_commit_reservation.
> +		 * This indicates a race with hugetlb_reserve_pages.
> +		 * Adjust for the subpool count incremented above AND
> +		 * in hugetlb_reserve_pages for the same page.  Also,
> +		 * the reservation count added in hugetlb_reserve_pages
> +		 * no longer applies.
> +		 */
> +		hugepage_subpool_put_pages(spool, 1);
> +		hugetlb_acct_memory(h, -1);

Should these fixups be encapsulated in a helper? The comment is the same
for both alloc_huge_page and hugetlb_reserve_pages.

Thanks,
Davidlohr

> +	}
>  	return page;
>  
>  out_uncharge_cgroup:
> @@ -3699,8 +3712,21 @@ int hugetlb_reserve_pages(struct inode *inode,
>  	 * consumed reservations are stored in the map. Hence, nothing
>  	 * else has to be done for private mappings here
>  	 */
> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(resv_map, from, to);
> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> +		long add = region_add(resv_map, from, to);
> +
> +		if (unlikely(chg > add)) {
> +			/*
> +			 * pages in this range were added to the reserve
> +			 * map between region_chg and region_add.  This
> +			 * indicates a race with alloc_huge_page.  Adjust
> +			 * the subpool and reserve counts modified above
> +			 * based on the difference.
> +			 */
> +			hugepage_subpool_put_pages(spool, chg - add);
> +			hugetlb_acct_memory(h, -(chg - ret));
> +		}
> +	}
>  	return 0;
>  out_err:
>  	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  2015-05-28 14:01   ` Davidlohr Bueso
@ 2015-05-28 21:00     ` Mike Kravetz
  0 siblings, 0 replies; 9+ messages in thread
From: Mike Kravetz @ 2015-05-28 21:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Rientjes,
	Luiz Capitulino, Andrew Morton

On 05/28/2015 07:01 AM, Davidlohr Bueso wrote:
> On Wed, 2015-05-27 at 10:56 -0700, Mike Kravetz wrote:
>> alloc_huge_page and hugetlb_reserve_pages use region_chg to
>> calculate the number of pages which will be added to the reserve
>> map.  Subpool and global reserve counts are adjusted based on
>> the output of region_chg.  Before the pages are actually added
>> to the reserve map, these routines could race and add fewer
>> pages than expected.  If this happens, the subpool and global
>> reserve counts are not correct.
>>
>> Compare the number of pages actually added (region_add) to those
>> expected to added (region_chg).  If fewer pages are actually added,
>> this indicates a race and adjust counters accordingly.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
>
> With one nit below.
>
>> ---
>>   mm/hugetlb.c | 34 ++++++++++++++++++++++++++++++----
>>   1 file changed, 30 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index b3d3d59..038c84e 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1544,7 +1544,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>>   	struct hugepage_subpool *spool = subpool_vma(vma);
>>   	struct hstate *h = hstate_vma(vma);
>>   	struct page *page;
>> -	long chg;
>> +	long chg, commit;
>>   	int ret, idx;
>>   	struct hugetlb_cgroup *h_cg;
>>
>> @@ -1585,7 +1585,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>>
>>   	set_page_private(page, (unsigned long)spool);
>>
>> -	vma_commit_reservation(h, vma, addr);
>> +	commit = vma_commit_reservation(h, vma, addr);
>> +	if (unlikely(chg > commit)) {
>> +		/*
>> +		 * The page was added to the reservation map between
>> +		 * vma_needs_reservation and vma_commit_reservation.
>> +		 * This indicates a race with hugetlb_reserve_pages.
>> +		 * Adjust for the subpool count incremented above AND
>> +		 * in hugetlb_reserve_pages for the same page.  Also,
>> +		 * the reservation count added in hugetlb_reserve_pages
>> +		 * no longer applies.
>> +		 */
>> +		hugepage_subpool_put_pages(spool, 1);
>> +		hugetlb_acct_memory(h, -1);
>
> Should these fixups be encapsulated in a helper? The comment is the same
> for both alloc_huge_page and hugetlb_reserve_pages.

I would like to leave things as they are right now.  It makes it
pretty explicit what fixup is needed and performed.

As you know, discovery of this bug came out of my hugetlbfs fallocate
work.  In that patchset, I created a race fixup routine.  If fallocate
moves forward, I'll reexamine the above fixups and look into helpers.

Thanks for the review,
-- 
Mike Kravetz

>
> Thanks,
> Davidlohr
>
>> +	}
>>   	return page;
>>
>>   out_uncharge_cgroup:
>> @@ -3699,8 +3712,21 @@ int hugetlb_reserve_pages(struct inode *inode,
>>   	 * consumed reservations are stored in the map. Hence, nothing
>>   	 * else has to be done for private mappings here
>>   	 */
>> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
>> -		region_add(resv_map, from, to);
>> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
>> +		long add = region_add(resv_map, from, to);
>> +
>> +		if (unlikely(chg > add)) {
>> +			/*
>> +			 * pages in this range were added to the reserve
>> +			 * map between region_chg and region_add.  This
>> +			 * indicates a race with alloc_huge_page.  Adjust
>> +			 * the subpool and reserve counts modified above
>> +			 * based on the difference.
>> +			 */
>> +			hugepage_subpool_put_pages(spool, chg - add);
>> +			hugetlb_acct_memory(h, -(chg - ret));
>> +		}
>> +	}
>>   	return 0;
>>   out_err:
>>   	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/3] mm/hugetlb: compute/return the number of regions added by region_add()
  2015-05-27 17:56 ` [PATCH v3 2/3] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
@ 2015-05-29  1:48   ` Naoya Horiguchi
  0 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2015-05-29  1:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton

On Wed, May 27, 2015 at 10:56:10AM -0700, Mike Kravetz wrote:
> Modify region_add() to keep track of regions(pages) added to the
> reserve map and return this value.  The return value can be
> compared to the return value of region_chg() to determine if the
> map was modified between calls.
> 
> Make vma_commit_reservation() also pass along the return value of
> region_add().  In the normal case, we want vma_commit_reservation
> to return the same value as the preceding call to vma_needs_reservation.
> Create a common __vma_reservation_common routine to help keep the
> special case return values in sync
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  2015-05-27 17:56 ` [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
  2015-05-28 14:01   ` Davidlohr Bueso
@ 2015-05-29  1:49   ` Naoya Horiguchi
  2015-06-01 16:53   ` Mike Kravetz
  2 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2015-05-29  1:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Davidlohr Bueso, David Rientjes,
	Luiz Capitulino, Andrew Morton

On Wed, May 27, 2015 at 10:56:11AM -0700, Mike Kravetz wrote:
> alloc_huge_page and hugetlb_reserve_pages use region_chg to
> calculate the number of pages which will be added to the reserve
> map.  Subpool and global reserve counts are adjusted based on
> the output of region_chg.  Before the pages are actually added
> to the reserve map, these routines could race and add fewer
> pages than expected.  If this happens, the subpool and global
> reserve counts are not correct.
> 
> Compare the number of pages actually added (region_add) to those
> expected to added (region_chg).  If fewer pages are actually added,
> this indicates a race and adjust counters accordingly.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  2015-05-27 17:56 ` [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
  2015-05-28 14:01   ` Davidlohr Bueso
  2015-05-29  1:49   ` Naoya Horiguchi
@ 2015-06-01 16:53   ` Mike Kravetz
  2 siblings, 0 replies; 9+ messages in thread
From: Mike Kravetz @ 2015-06-01 16:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton
  Cc: David Rientjes, Luiz Capitulino

On 05/27/2015 10:56 AM, Mike Kravetz wrote:
> alloc_huge_page and hugetlb_reserve_pages use region_chg to
> calculate the number of pages which will be added to the reserve
> map.  Subpool and global reserve counts are adjusted based on
> the output of region_chg.  Before the pages are actually added
> to the reserve map, these routines could race and add fewer
> pages than expected.  If this happens, the subpool and global
> reserve counts are not correct.
>
> Compare the number of pages actually added (region_add) to those
> expected to added (region_chg).  If fewer pages are actually added,
> this indicates a race and adjust counters accordingly.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb.c | 34 ++++++++++++++++++++++++++++++----
>   1 file changed, 30 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b3d3d59..038c84e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1544,7 +1544,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>   	struct hugepage_subpool *spool = subpool_vma(vma);
>   	struct hstate *h = hstate_vma(vma);
>   	struct page *page;
> -	long chg;
> +	long chg, commit;
>   	int ret, idx;
>   	struct hugetlb_cgroup *h_cg;
>
> @@ -1585,7 +1585,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>
>   	set_page_private(page, (unsigned long)spool);
>
> -	vma_commit_reservation(h, vma, addr);
> +	commit = vma_commit_reservation(h, vma, addr);
> +	if (unlikely(chg > commit)) {
> +		/*
> +		 * The page was added to the reservation map between
> +		 * vma_needs_reservation and vma_commit_reservation.
> +		 * This indicates a race with hugetlb_reserve_pages.
> +		 * Adjust for the subpool count incremented above AND
> +		 * in hugetlb_reserve_pages for the same page.  Also,
> +		 * the reservation count added in hugetlb_reserve_pages
> +		 * no longer applies.
> +		 */
> +		hugepage_subpool_put_pages(spool, 1);
> +		hugetlb_acct_memory(h, -1);
> +	}
>   	return page;

Well, this is embarrassing.  The code to fix up counts is incomplete
and incorrect.  It does not take into account the hugetlbfs min_size
option to reserve a set of pages for a filesystem.  The code should
look like:

		long rsv_adjust;

		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
		hugetlb_acct_memory(h, -rsv_adjust);

A similar change is required in hugetlb_reserve_pages().

I have verified that the global reserve count is not properly
adjusted with the original patch, and verified that it is properly
adjusted with the above change.

Should I create a new patch set, or create a patch on top of the
existing set?

I think this oversight points out the need to encapsulate the
subpool and global reserve count adjustments.  There are too
many counters/adjustments in the code and it is pretty easy not
to take them all into account.  I'll start work on some
encapsulation routines.

-- 
Mike Kravetz

>
>   out_uncharge_cgroup:
> @@ -3699,8 +3712,21 @@ int hugetlb_reserve_pages(struct inode *inode,
>   	 * consumed reservations are stored in the map. Hence, nothing
>   	 * else has to be done for private mappings here
>   	 */
> -	if (!vma || vma->vm_flags & VM_MAYSHARE)
> -		region_add(resv_map, from, to);
> +	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> +		long add = region_add(resv_map, from, to);
> +
> +		if (unlikely(chg > add)) {
> +			/*
> +			 * pages in this range were added to the reserve
> +			 * map between region_chg and region_add.  This
> +			 * indicates a race with alloc_huge_page.  Adjust
> +			 * the subpool and reserve counts modified above
> +			 * based on the difference.
> +			 */
> +			hugepage_subpool_put_pages(spool, chg - add);
> +			hugetlb_acct_memory(h, -(chg - ret));
> +		}
> +	}
>   	return 0;
>   out_err:
>   	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-06-01 16:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-27 17:56 [PATCH v3 0/3] alloc_huge_page/hugetlb_reserve_pages race Mike Kravetz
2015-05-27 17:56 ` [PATCH v3 1/3] mm/hugetlb: document the reserve map/region tracking routines Mike Kravetz
2015-05-27 17:56 ` [PATCH v3 2/3] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
2015-05-29  1:48   ` Naoya Horiguchi
2015-05-27 17:56 ` [PATCH v3 3/3] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
2015-05-28 14:01   ` Davidlohr Bueso
2015-05-28 21:00     ` Mike Kravetz
2015-05-29  1:49   ` Naoya Horiguchi
2015-06-01 16:53   ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).