linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] hugetlbfs: add fallocate support
@ 2015-07-02 15:38 Mike Kravetz
  2015-07-02 15:38 ` [PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add Mike Kravetz
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

This patch set has gone through several revisions as an RFC which
is documented in the log below.  The major change in this version
is the addition of code to handle the region_chg/region_add calling
sequence with the addition of fallocate hole punch.  This is the
first patch in the series.  As suggested during the RFC process,
tests have been proposed to libhugetlbfs as described at:
http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
fallocate(2) man page modifications are also necessary to specify
that fallocate for hugetlbfs only operates on whole pages.  This
change will be submitted once the code has stabalized and been
proposed for merging.

hugetlbfs is used today by applications that want a high degree of
control over huge page usage.  Often, large hugetlbfs files are used
to map a large number huge pages into the application processes.
The applications know when page ranges within these large files will
no longer be used, and ideally would like to release them back to
the subpool or global pools for other uses.  The fallocate() system
call provides an interface for preallocation and hole punching within
files.  This patch set adds fallocate functionality to hugetlbfs.

v1:
  Add a cache of region descriptors to the resv_map for use by
    region_add in case hole punch deletes entries necessary for
    a successful operation.
RFC v4:
  Removed alloc_huge_page/hugetlb_reserve_pages race patches as already
    in mmotm
  Moved hugetlb_fix_reserve_counts in series as suggested by Naoya Horiguchi
  Inline'ed hugetlb_fault_mutex routines as suggested by Davidlohr Bueso and
    existing code changed to use new interfaces as suggested by Naoya
  fallocate preallocation code cleaned up and made simpler
  Modified alloc_huge_page to handle special case where allocation is
    for a hole punched area with spool reserves
RFC v3:
  Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
    in current code
  fallocate allocation and hole punch is synchronized with page
    faults via existing mutex table
   hole punch uses existing hugetlb_vmtruncate_list instead of more
    generic unmap_mapping_range for unmapping
   Error handling for the case when region_del() fauils
RFC v2:
  Addressed alignment and error handling issues noticed by Hillf Danton
  New region_del() routine for region tracking/resv_map of ranges
  Fixed several issues found during more extensive testing
  Error handling in region_del() when kmalloc() fails stills needs
    to be addressed
  madvise remove support remains

Mike Kravetz (10):
  mm/hugetlb: add cache of descriptors to resv_map for region_add
  mm/hugetlb: add region_del() to delete a specific range of entries
  mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  hugetlbfs: truncate_hugepages() takes a range of pages
  mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  hugetlbfs: New huge_add_to_page_cache helper routine
  hugetlbfs: add hugetlbfs_fallocate()
  mm: madvise allow remove operation for hugetlbfs

 fs/hugetlbfs/inode.c    | 281 +++++++++++++++++++++++++++++---
 include/linux/hugetlb.h |  17 +-
 mm/hugetlb.c            | 422 ++++++++++++++++++++++++++++++++++++++----------
 mm/madvise.c            |   2 +-
 4 files changed, 618 insertions(+), 104 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-03  4:21   ` Hillf Danton
  2015-07-02 15:38 ` [PATCH 02/10] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to remove a specific range of
pages.  When pages are removed, their associated entries in
the region/reserve map will also be removed.  This will break
an assumption in the region_chg/region_add calling sequence.
If a new region descriptor must be allocated, it is done as
part of the region_chg processing.  In this way, region_add
can not fail because it does not need to attempt an allocation.

To prepare for fallocate hole punch, create a "cache" of
descriptors that can be used by region_add if necessary.
region_chg will ensure there are sufficient entries in the
cache.  It will be necessary to track the number of in progress
add operations to know a sufficient number of descriptors
reside in the cache.  A new routine region_abort is added to
adjust this in progress count when add operations are aborted.
vma_abort_reservation is also added for callers creating
reservations with vma_needs_reservation/vma_commit_reservation.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c            | 166 ++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 150 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2050261..f9c8cb2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -35,6 +35,9 @@ struct resv_map {
 	struct kref refs;
 	spinlock_t lock;
 	struct list_head regions;
+	long adds_in_progress;
+	struct list_head rgn_cache;
+	long rgn_cache_count;
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..189c0d7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -240,11 +240,14 @@ struct file_region {
 
 /*
  * Add the huge page range represented by [f, t) to the reserve
- * map.  Existing regions will be expanded to accommodate the
- * specified range.  We know only existing regions need to be
- * expanded, because region_add is only called after region_chg
- * with the same range.  If a new file_region structure must
- * be allocated, it is done in region_chg.
+ * map.  In the normal case, existing regions will be expanded
+ * to accommodate the specified range.  Sufficient regions should
+ * exist for expansion due to the previous call to region_chg
+ * with the same range.  However, it is possible that region_del
+ * could have been called after region_chg and modifed the map
+ * in such a way that no region exists to be expanded.  In this
+ * case, pull a region descriptor from the cache associated with
+ * the map and use that for the new range.
  *
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
@@ -261,6 +264,27 @@ static long region_add(struct resv_map *resv, long f, long t)
 		if (f <= rg->to)
 			break;
 
+	if (&rg->link == head || t < rg->from) {
+		/*
+		 * No region exits which can be expanded to include the
+		 * specified range.  Pull a region descriptor from the
+		 * cache, and use it for this range.
+		 */
+		VM_BUG_ON(!resv->rgn_cache_count);
+
+		resv->rgn_cache_count--;
+		nrg = list_first_entry(&resv->rgn_cache, struct file_region,
+					link);
+		list_del(&nrg->link);
+
+		nrg->from = f;
+		nrg->to = t;
+		list_add(&nrg->link, rg->link.prev);
+
+		add += t - f;
+		goto out_locked;
+	}
+
 	/* Round our left edge to the current segment if it encloses us. */
 	if (f > rg->from)
 		f = rg->from;
@@ -294,6 +318,8 @@ static long region_add(struct resv_map *resv, long f, long t)
 	add += t - nrg->to;		/* Added to end of region */
 	nrg->to = t;
 
+out_locked:
+	resv->adds_in_progress--;
 	spin_unlock(&resv->lock);
 	VM_BUG_ON(add < 0);
 	return add;
@@ -312,11 +338,16 @@ static long region_add(struct resv_map *resv, long f, long t)
  * so that the subsequent region_add call will have all the
  * regions it needs and will not fail.
  *
+ * Upon entry, region_chg will also examine the cache of
+ * region descriptors associated with the map.  If there
+ * not enough descriptors cached, one will be allocated
+ * for the in progress add operation.
+ *
  * Returns the number of huge pages that need to be added
  * to the existing reservation map for the range [f, t).
  * This number is greater or equal to zero.  -ENOMEM is
- * returned if a new file_region structure is needed and can
- * not be allocated.
+ * returned if a new file_region structure or cache entry
+ * is needed and can not be allocated.
  */
 static long region_chg(struct resv_map *resv, long f, long t)
 {
@@ -326,6 +357,30 @@ static long region_chg(struct resv_map *resv, long f, long t)
 
 retry:
 	spin_lock(&resv->lock);
+	resv->adds_in_progress++;
+
+	/*
+	 * Check for sufficient descriptors in the cache to accommodate
+	 * the number of in progress add operations.
+	 */
+	if (resv->adds_in_progress > resv->rgn_cache_count) {
+		struct file_region *trg;
+
+		VM_BUG_ON(resv->adds_in_progress - resv->rgn_cache_count > 1);
+		/* Must drop lock to allocate a new descriptor. */
+		resv->adds_in_progress--;
+		spin_unlock(&resv->lock);
+
+		trg = kmalloc(sizeof(*trg), GFP_KERNEL);
+		if (!trg)
+			return -ENOMEM;
+
+		spin_lock(&resv->lock);
+		resv->adds_in_progress++;
+		list_add(&trg->link, &resv->rgn_cache);
+		resv->rgn_cache_count++;
+	}
+
 	/* Locate the region we are before or in. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -336,6 +391,7 @@ retry:
 	 * size such that we can guarantee to record the reservation. */
 	if (&rg->link == head || t < rg->from) {
 		if (!nrg) {
+			resv->adds_in_progress--;
 			spin_unlock(&resv->lock);
 			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
 			if (!nrg)
@@ -385,6 +441,25 @@ out_nrg:
 }
 
 /*
+ * Abort the in progress add operation.  The adds_in_progress field
+ * of the resv_map keeps track of the operations in progress between
+ * calls to region_chg and region_add.  Operations are sometimes
+ * aborted after the call to region_chg.  In such cases, region_abort
+ * is called to decrement the adds_in_progress counter.
+ *
+ * NOTE: The range arguments [f, t) are not needed or used in this
+ * routine.  They are kept to make reading the calling code easier as
+ * arguments will match the associated region_chg call.
+ */
+static void region_abort(struct resv_map *resv, long f, long t)
+{
+	spin_lock(&resv->lock);
+	VM_BUG_ON(!resv->rgn_cache_count);
+	resv->adds_in_progress--;
+	spin_unlock(&resv->lock);
+}
+
+/*
  * Truncate the reserve map at index 'end'.  Modify/truncate any
  * region which contains end.  Delete any regions past end.
  * Return the number of huge pages removed from the map.
@@ -544,22 +619,42 @@ static void set_vma_private_data(struct vm_area_struct *vma,
 struct resv_map *resv_map_alloc(void)
 {
 	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
-	if (!resv_map)
+	struct file_region *rg = kmalloc(sizeof(*rg), GFP_KERNEL);
+
+	if (!resv_map || !rg) {
+		kfree(resv_map);
+		kfree(rg);
 		return NULL;
+	}
 
 	kref_init(&resv_map->refs);
 	spin_lock_init(&resv_map->lock);
 	INIT_LIST_HEAD(&resv_map->regions);
 
+	resv_map->adds_in_progress = 0;
+
+	INIT_LIST_HEAD(&resv_map->rgn_cache);
+	list_add(&rg->link, &resv_map->rgn_cache);
+	resv_map->rgn_cache_count = 1;
+
 	return resv_map;
 }
 
 void resv_map_release(struct kref *ref)
 {
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
+	struct list_head *head = &resv_map->regions;
+	struct file_region *rg, *trg;
 
 	/* Clear out any active regions before we release the map. */
 	region_truncate(resv_map, 0);
+
+	/* ... and any entries left in the cache */
+	list_for_each_entry_safe(rg, trg, head, link)
+		list_del(&rg->link);
+
+	VM_BUG_ON(resv_map->adds_in_progress);
+
 	kfree(resv_map);
 }
 
@@ -1473,16 +1568,18 @@ static void return_unused_surplus_pages(struct hstate *h,
 	}
 }
 
+
 /*
- * vma_needs_reservation and vma_commit_reservation are used by the huge
- * page allocation routines to manage reservations.
+ * vma_needs_reservation, vma_commit_reservation and vma_abort_reservation
+ * are used by the huge page allocation routines to manage reservations.
  *
  * vma_needs_reservation is called to determine if the huge page at addr
  * within the vma has an associated reservation.  If a reservation is
  * needed, the value 1 is returned.  The caller is then responsible for
  * managing the global reservation and subpool usage counts.  After
  * the huge page has been allocated, vma_commit_reservation is called
- * to add the page to the reservation map.
+ * to add the page to the reservation map.  If the reservation must be
+ * aborted instead of committed, vma_abort_reservation is called.
  *
  * In the normal case, vma_commit_reservation returns the same value
  * as the preceding vma_needs_reservation call.  The only time this
@@ -1490,9 +1587,14 @@ static void return_unused_surplus_pages(struct hstate *h,
  * is the responsibility of the caller to notice the difference and
  * take appropriate action.
  */
+enum vma_resv_mode {
+	VMA_NEEDS_RESV,
+	VMA_COMMIT_RESV,
+	VMA_ABORT_RESV,
+};
 static long __vma_reservation_common(struct hstate *h,
 				struct vm_area_struct *vma, unsigned long addr,
-				bool commit)
+				enum vma_resv_mode mode)
 {
 	struct resv_map *resv;
 	pgoff_t idx;
@@ -1503,10 +1605,20 @@ static long __vma_reservation_common(struct hstate *h,
 		return 1;
 
 	idx = vma_hugecache_offset(h, vma, addr);
-	if (commit)
-		ret = region_add(resv, idx, idx + 1);
-	else
+	switch (mode) {
+	case VMA_NEEDS_RESV:
 		ret = region_chg(resv, idx, idx + 1);
+		break;
+	case VMA_COMMIT_RESV:
+		ret = region_add(resv, idx, idx + 1);
+		break;
+	case VMA_ABORT_RESV:
+		region_abort(resv, idx, idx + 1);
+		ret = 0;
+		break;
+	default:
+		BUG();
+	}
 
 	if (vma->vm_flags & VM_MAYSHARE)
 		return ret;
@@ -1517,13 +1629,19 @@ static long __vma_reservation_common(struct hstate *h,
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	return __vma_reservation_common(h, vma, addr, false);
+	return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV);
 }
 
 static long vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	return __vma_reservation_common(h, vma, addr, true);
+	return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV);
+}
+
+static void vma_abort_reservation(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	(void)__vma_reservation_common(h, vma, addr, VMA_ABORT_RESV);
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -1549,8 +1667,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (chg < 0)
 		return ERR_PTR(-ENOMEM);
 	if (chg || avoid_reserve)
-		if (hugepage_subpool_get_pages(spool, 1) < 0)
+		if (hugepage_subpool_get_pages(spool, 1) < 0) {
+			vma_abort_reservation(h, vma, addr);
 			return ERR_PTR(-ENOSPC);
+		}
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret)
@@ -1596,6 +1716,7 @@ out_uncharge_cgroup:
 out_subpool_put:
 	if (chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
+	vma_abort_reservation(h, vma, addr);
 	return ERR_PTR(-ENOSPC);
 }
 
@@ -3236,11 +3357,14 @@ retry:
 	 * any allocations necessary to record that reservation occur outside
 	 * the spinlock.
 	 */
-	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
+	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		if (vma_needs_reservation(h, vma, address) < 0) {
 			ret = VM_FAULT_OOM;
 			goto backout_unlocked;
 		}
+		/* Just decrements count, does not deallocate */
+		vma_abort_reservation(h, vma, address);
+	}
 
 	ptl = huge_pte_lockptr(h, mm, ptep);
 	spin_lock(ptl);
@@ -3387,6 +3511,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			ret = VM_FAULT_OOM;
 			goto out_mutex;
 		}
+		/* Just decrements count, does not deallocate */
+		vma_abort_reservation(h, vma, address);
 
 		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
@@ -3726,6 +3852,8 @@ int hugetlb_reserve_pages(struct inode *inode,
 	}
 	return 0;
 out_err:
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
+		region_abort(resv_map, from, to);
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 02/10] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
  2015-07-02 15:38 ` [PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-03  7:20   ` Hillf Danton
  2015-07-02 15:38 ` [PATCH 03/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to remove a specific range of pages.
The existing region_truncate() routine deletes all region/reserve
map entries after a specified offset.  region_del() will provide
this same functionality if the end of region is specified as -1.
Hence, region_del() can replace region_truncate().

Unlike region_truncate(), region_del() can return an error in the
rare case where it can not allocate memory for a region descriptor.
This ONLY happens in the case where an existing region must be split.
Current callers passing -1 as end of range will never experience
this error and do not need to deal with error handling.  Future
callers of region_del() (such as fallocate hole punch) will need to
handle this error.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 101 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 75 insertions(+), 26 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 189c0d7..e8c7f68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -460,43 +460,92 @@ static void region_abort(struct resv_map *resv, long f, long t)
 }
 
 /*
- * Truncate the reserve map at index 'end'.  Modify/truncate any
- * region which contains end.  Delete any regions past end.
- * Return the number of huge pages removed from the map.
+ * Delete the specified range [f, t) from the reserve map.  If the
+ * t parameter is -1, this indicates that ALL regions after f should
+ * be deleted.  Locate the regions which intersect [f, t) and either
+ * trim, delete or split the existing regions.
+ *
+ * Returns the number of huge pages deleted from the reserve map.
+ * In the normal case, the return value is zero or more.  In the
+ * case where a region must be split, a new region descriptor must
+ * be allocated.  If the allocation fails, -ENOMEM will be returned.
+ * NOTE: If the parameter t == -1, then we will never split a region
+ * and possibly return -ENOMEM.  Callers specifying t == -1 do not
+ * need to check for -ENOMEM error.
  */
-static long region_truncate(struct resv_map *resv, long end)
+static long region_del(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
-	long chg = 0;
+	struct file_region *nrg = NULL;
+	long del = 0;
 
+	if (t == -1)
+		t = LONG_MAX;
+retry:
 	spin_lock(&resv->lock);
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (end <= rg->to)
+	list_for_each_entry_safe(rg, trg, head, link) {
+		if (rg->to <= f)
+			continue;
+		if (rg->from >= t)
 			break;
-	if (&rg->link == head)
-		goto out;
 
-	/* If we are in the middle of a region then adjust it. */
-	if (end > rg->from) {
-		chg = rg->to - end;
-		rg->to = end;
-		rg = list_entry(rg->link.next, typeof(*rg), link);
-	}
+		if (f > rg->from && t < rg->to) { /* Must split region */
+			/*
+			 * Check for an entry in the cache before dropping
+			 * lock and attempting allocation.
+			 */
+			if (!nrg &&
+			    resv->rgn_cache_count > resv->adds_in_progress) {
+				nrg = list_first_entry(&resv->rgn_cache,
+							struct file_region,
+							link);
+				list_del(&nrg->link);
+				resv->rgn_cache_count--;
+			}
 
-	/* Drop any remaining regions. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
+			if (!nrg) {
+				spin_unlock(&resv->lock);
+				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+				if (!nrg)
+					return -ENOMEM;
+				goto retry;
+			}
+
+			del += t - f;
+
+			/* New entry for end of split region */
+			nrg->from = t;
+			nrg->to = rg->to;
+			INIT_LIST_HEAD(&nrg->link);
+
+			/* Original entry is trimmed */
+			rg->to = f;
+
+			list_add(&nrg->link, &rg->link);
+			nrg = NULL;
 			break;
-		chg += rg->to - rg->from;
-		list_del(&rg->link);
-		kfree(rg);
+		}
+
+		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
+			del += rg->to - rg->from;
+			list_del(&rg->link);
+			kfree(rg);
+			continue;
+		}
+
+		if (f <= rg->from) {	/* Trim beginning of region */
+			del += t - rg->from;
+			rg->from = t;
+		} else {		/* Trim end of region */
+			del += rg->to - f;
+			rg->to = f;
+		}
 	}
 
-out:
 	spin_unlock(&resv->lock);
-	return chg;
+	kfree(nrg);
+	return del;
 }
 
 /*
@@ -647,7 +696,7 @@ void resv_map_release(struct kref *ref)
 	struct file_region *rg, *trg;
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(resv_map, 0);
+	region_del(resv_map, 0, -1);
 
 	/* ... and any entries left in the cache */
 	list_for_each_entry_safe(rg, trg, head, link)
@@ -3868,7 +3917,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	long gbl_reserve;
 
 	if (resv_map)
-		chg = region_truncate(resv_map, offset);
+		chg = region_del(resv_map, offset, -1);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 03/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
  2015-07-02 15:38 ` [PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add Mike Kravetz
  2015-07-02 15:38 ` [PATCH 02/10] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-02 15:38 ` [PATCH 04/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table).  fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages.  Expose interfaces so that fallocate operations
can be synchronized with page faults.  Minor name changes to
be more consistent with other global hugetlb symbols.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  5 +++++
 mm/hugetlb.c            | 20 ++++++++++----------
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f9c8cb2..43e8511f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -88,6 +88,11 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
+extern struct mutex *hugetlb_fault_mutex_table;
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+				struct vm_area_struct *vma,
+				struct address_space *mapping,
+				pgoff_t idx, unsigned long address);
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e8c7f68..21e957b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -64,7 +64,7 @@ DEFINE_SPINLOCK(hugetlb_lock);
  * prevent spurious OOMs when the hugepage pool is fully utilized.
  */
 static int num_fault_mutexes;
-static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
@@ -2481,7 +2481,7 @@ static void __exit hugetlb_exit(void)
 	}
 
 	kobject_put(hugepages_kobj);
-	kfree(htlb_fault_mutex_table);
+	kfree(hugetlb_fault_mutex_table);
 }
 module_exit(hugetlb_exit);
 
@@ -2514,12 +2514,12 @@ static int __init hugetlb_init(void)
 #else
 	num_fault_mutexes = 1;
 #endif
-	htlb_fault_mutex_table =
+	hugetlb_fault_mutex_table =
 		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
-	BUG_ON(!htlb_fault_mutex_table);
+	BUG_ON(!hugetlb_fault_mutex_table);
 
 	for (i = 0; i < num_fault_mutexes; i++)
-		mutex_init(&htlb_fault_mutex_table[i]);
+		mutex_init(&hugetlb_fault_mutex_table[i]);
 	return 0;
 }
 module_init(hugetlb_init);
@@ -3453,7 +3453,7 @@ backout_unlocked:
 }
 
 #ifdef CONFIG_SMP
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 			    struct vm_area_struct *vma,
 			    struct address_space *mapping,
 			    pgoff_t idx, unsigned long address)
@@ -3478,7 +3478,7 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
  * For uniprocesor systems we always use a single mutex, so just
  * return 0 and avoid the hashing overhead.
  */
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 			    struct vm_area_struct *vma,
 			    struct address_space *mapping,
 			    pgoff_t idx, unsigned long address)
@@ -3526,8 +3526,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
-	mutex_lock(&htlb_fault_mutex_table[hash]);
+	hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
@@ -3612,7 +3612,7 @@ out_ptl:
 		put_page(pagecache_page);
 	}
 out_mutex:
-	mutex_unlock(&htlb_fault_mutex_table[hash]);
+	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 04/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
                   ` (2 preceding siblings ...)
  2015-07-02 15:38 ` [PATCH 03/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-02 15:38 ` [PATCH 05/10] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range.  If end is 0, this indicates all pages after start
should be unmapped.  This is the same as the existing truncate
functionality.  Modify existing callers to add 0 as end of range.

Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e5a93d8..e9d4c8d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -374,11 +374,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
 }
 
 static inline void
-hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
+hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end)
 {
 	struct vm_area_struct *vma;
 
-	vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) {
+	/*
+	 * end == 0 indicates that the entire range after
+	 * start should be unmapped.
+	 */
+	vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) {
 		unsigned long v_offset;
 
 		/*
@@ -387,13 +391,20 @@ hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
 		 * which overlap the truncated area starting at pgoff,
 		 * and no vma on a 32-bit arch can span beyond the 4GB.
 		 */
-		if (vma->vm_pgoff < pgoff)
-			v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+		if (vma->vm_pgoff < start)
+			v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
 		else
 			v_offset = 0;
 
-		unmap_hugepage_range(vma, vma->vm_start + v_offset,
-				     vma->vm_end, NULL);
+		if (end) {
+			end = ((end - start) << PAGE_SHIFT) +
+			       vma->vm_start + v_offset;
+			if (end > vma->vm_end)
+				end = vma->vm_end;
+		} else
+			end = vma->vm_end;
+
+		unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL);
 	}
 }
 
@@ -409,7 +420,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
-		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
 	truncate_hugepages(inode, offset);
 	return 0;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 05/10] hugetlbfs: truncate_hugepages() takes a range of pages
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
                   ` (3 preceding siblings ...)
  2015-07-02 15:38 ` [PATCH 04/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-02 15:38 ` [PATCH 06/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of -1 is passed, the
current "truncate" functionality is maintained. Existing callers
are modified to pass -1 as end of range. By keying off end == -1,
the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults
by using the fault mutex table. The hole punch case can experience
the rare region_del error and must handle accordingly.

Add the routine hugetlb_fix_reserve_counts to fix up reserve counts
in the case where region_del returns an error.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages().  To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from
region_del and pass it back to the caller.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 98 ++++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/hugetlb.h |  4 +-
 mm/hugetlb.c            | 40 ++++++++++++++++++--
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e9d4c8d..160d58d4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -318,26 +318,61 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
-static void truncate_huge_page(struct page *page)
+static void remove_huge_page(struct page *page)
 {
 	ClearPageDirty(page);
 	ClearPageUptodate(page);
 	delete_from_page_cache(page);
 }
 
-static void truncate_hugepages(struct inode *inode, loff_t lstart)
+
+/*
+ * remove_inode_hugepages handles two distinct cases: truncation and hole
+ * punch.  There are subtle differences in operation for each case.
+
+ * truncation is indicated by end of range being -1
+ *	In this case, we first scan the range and release found pages.
+ *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
+ *	maps and global counts.
+ * hole punch is indicated if end is not -1
+ *	In the hole punch case we scan the range and release found pages.
+ *	Only when releasing a page is the associated region/reserv map
+ *	deleted.  The region/reserv map for ranges without associated
+ *	pages are not modified.
+ * Note: If the passed end of range value is beyond the end of file, but
+ * not -1 this routine still performs a hole punch operation.
+ */
+static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
+				   loff_t lend)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
+	const pgoff_t end = lend >> huge_page_shift(h);
+	struct vm_area_struct pseudo_vma;
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
+	long lookup_nr = PAGEVEC_SIZE;
+	bool truncate_op = (lend == -1);
 
+	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (1) {
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (next < end) {
+		/*
+		 * Make sure to never grab more pages that we
+		 * might possibly need.
+		 */
+		if (end - next < lookup_nr)
+			lookup_nr = end - next;
+
+		/*
+		 * This pagevec_lookup() may return pages past 'end',
+		 * so we must check for page->index > end.
+		 */
+		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
 			if (next == start)
 				break;
 			next = start;
@@ -346,26 +381,69 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 		for (i = 0; i < pagevec_count(&pvec); ++i) {
 			struct page *page = pvec.pages[i];
+			u32 hash;
+
+			hash = hugetlb_fault_mutex_hash(h, current->mm,
+							&pseudo_vma,
+							mapping, next, 0);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 			lock_page(page);
+			if (page->index >= end) {
+				unlock_page(page);
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+				next = end;	/* we are done */
+				break;
+			}
+
+			/*
+			 * If page is mapped, it was faulted in after being
+			 * unmapped.  Do nothing in this race case.  In the
+			 * normal case page is not mapped.
+			 */
+			if (!page_mapped(page)) {
+				bool rsv_on_error = !PagePrivate(page);
+				/*
+				 * We must free the huge page and remove
+				 * from page cache (remove_huge_page) BEFORE
+				 * removing the region/reserve map
+				 * (hugetlb_unreserve_pages).  In rare out
+				 * of memory conditions, removal of the
+				 * region/reserve map could fail.  Before
+				 * free'ing the page, note PagePrivate which
+				 * is used in case of error.
+				 */
+				remove_huge_page(page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(hugetlb_unreserve_pages(
+							inode, next,
+							next + 1, 1)))
+						hugetlb_fix_reserve_counts(
+							inode, rsv_on_error);
+				}
+			}
+
 			if (page->index > next)
 				next = page->index;
+
 			++next;
-			truncate_huge_page(page);
 			unlock_page(page);
-			freed++;
+
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		huge_pagevec_release(&pvec);
 	}
-	BUG_ON(!lstart && mapping->nrpages);
-	hugetlb_unreserve_pages(inode, start, freed);
+
+	if (truncate_op)
+		(void)hugetlb_unreserve_pages(inode, start, end, freed);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	struct resv_map *resv_map;
 
-	truncate_hugepages(inode, 0);
+	remove_inode_hugepages(inode, 0, -1);
 	resv_map = (struct resv_map *)inode->i_mapping->private_data;
 	/* root inode doesn't have the resv_map, so we should check it */
 	if (resv_map)
@@ -422,7 +500,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
-	truncate_hugepages(inode, offset);
+	remove_inode_hugepages(inode, offset, -1);
 	return 0;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 43e8511f..95cd716 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -83,11 +83,13 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						vm_flags_t vm_flags);
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+						long freed);
 int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 				struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 21e957b..1c3049c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -549,6 +549,28 @@ retry:
 }
 
 /*
+ * A rare out of memory error was encountered which prevented removal of
+ * the reserve map region for a page.  The huge page itself was free'ed
+ * and removed from the page cache.  This routine will adjust the subpool
+ * usage count, and the global reserve count if needed.  By incrementing
+ * these counts, the reserve map entry which could not be deleted will
+ * appear as a "reserved" entry instead of simply dangling with incorrect
+ * counts.
+ */
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+{
+	struct hugepage_subpool *spool = subpool_inode(inode);
+	long rsv_adjust;
+
+	rsv_adjust = hugepage_subpool_get_pages(spool, 1);
+	if (restore_reserve && rsv_adjust) {
+		struct hstate *h = hstate_inode(inode);
+
+		hugetlb_acct_memory(h, 1);
+	}
+}
+
+/*
  * Count and return the number of huge pages in the reserve map
  * that intersect with the range [f, t).
  */
@@ -3908,7 +3930,8 @@ out_err:
 	return ret;
 }
 
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+								long freed)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct resv_map *resv_map = inode_resv_map(inode);
@@ -3916,8 +3939,17 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	long gbl_reserve;
 
-	if (resv_map)
-		chg = region_del(resv_map, offset, -1);
+	if (resv_map) {
+		chg = region_del(resv_map, start, end);
+		/*
+		 * region_del() can fail in the rare case where a region
+		 * must be split and another region descriptor can not be
+		 * allocated.  If end == -1, it will not fail.
+		 */
+		if (chg < 0)
+			return chg;
+	}
+
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
@@ -3928,6 +3960,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	 */
 	gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
 	hugetlb_acct_memory(h, -gbl_reserve);
+
+	return 0;
 }
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 06/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
                   ` (4 preceding siblings ...)
  2015-07-02 15:38 ` [PATCH 05/10] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-02 15:38 ` [PATCH 07/10] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings.  However, this will not be the
case with fallocate hole punch.  When punching a hole, the present
page will be deleted as well as the region/reserve map entry (and
hence any reservation).  vma_has_reserves is passed "chg" which
indicates whether or not a region/reserve map is present.  Use
this to determine if reserves are actually present or were removed
via hole punch.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1c3049c..5923c21 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -800,9 +800,19 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
 			return 0;
 	}
 
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
+	if (vma->vm_flags & VM_MAYSHARE) {
+		/*
+		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
+		 * be a region map for all pages.  The only situation where
+		 * there is no region map is if a hole was punched via
+		 * fallocate.  In this case, there really are no reverves to
+		 * use.  This situation is indicated if chg != 0.
+		 */
+		if (chg)
+			return 0;
+		else
+			return 1;
+	}
 
 	/*
 	 * Only the process that called mmap() has reserves for
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 07/10] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
                   ` (5 preceding siblings ...)
  2015-07-02 15:38 ` [PATCH 06/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-02 15:38 ` [PATCH 08/10] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Areas hole punched by fallocate will not have entries in the
region/reserve map.  However, shared mappings with min_size subpool
reservations may still have reserved pages.  alloc_huge_page needs
to handle this special case and do the proper accounting.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 39 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5923c21..90d8250 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1731,34 +1731,58 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg, commit;
+	long map_chg, map_commit;
+	long gbl_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
 	idx = hstate_index(h);
 	/*
-	 * Processes that did not create the mapping will have no
-	 * reserves and will not have accounted against subpool
-	 * limit. Check that the subpool limit can be made before
-	 * satisfying the allocation MAP_NORESERVE mappings may also
-	 * need pages and subpool limit allocated allocated if no reserve
-	 * mapping overlaps.
+	 * Examine the region/reserve map to determine if the process
+	 * has a reservation for the page to be allocated.  A return
+	 * code of zero indicates a reservation exists (no change).
 	 */
-	chg = vma_needs_reservation(h, vma, addr);
-	if (chg < 0)
+	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
+	if (map_chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg || avoid_reserve)
-		if (hugepage_subpool_get_pages(spool, 1) < 0) {
+
+	/*
+	 * Processes that did not create the mapping will have no
+	 * reserves as indicated by the region/reserve map. Check
+	 * that the allocation will not exceed the subpool limit.
+	 * Allocations for MAP_NORESERVE mappings also need to be
+	 * checked against any subpool limit.
+	 */
+	if (map_chg || avoid_reserve) {
+		gbl_chg = hugepage_subpool_get_pages(spool, 1);
+		if (gbl_chg < 0) {
 			vma_abort_reservation(h, vma, addr);
 			return ERR_PTR(-ENOSPC);
 		}
 
+		/*
+		 * Even though there was no reservation in the region/reserve
+		 * map, there could be reservations associated with the
+		 * subpool that can be used.  This would be indicated if the
+		 * return value of hugepage_subpool_get_pages() is zero.
+		 * However, if avoid_reserve is specified we still avoid even
+		 * the subpool reservations.
+		 */
+		if (avoid_reserve)
+			gbl_chg = 1;
+	}
+
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret)
 		goto out_subpool_put;
 
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+	/*
+	 * glb_chg is passed to indicate whether or not a page must be taken
+	 * from the global free pool (global change).  gbl_chg == 0 indicates
+	 * a reservation exists for the allocation.
+	 */
+	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1774,8 +1798,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 
 	set_page_private(page, (unsigned long)spool);
 
-	commit = vma_commit_reservation(h, vma, addr);
-	if (unlikely(chg > commit)) {
+	map_commit = vma_commit_reservation(h, vma, addr);
+	if (unlikely(map_chg > map_commit)) {
 		/*
 		 * The page was added to the reservation map between
 		 * vma_needs_reservation and vma_commit_reservation.
@@ -1795,7 +1819,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
 out_subpool_put:
-	if (chg || avoid_reserve)
+	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
 	vma_abort_reservation(h, vma, addr);
 	return ERR_PTR(-ENOSPC);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 08/10] hugetlbfs: New huge_add_to_page_cache helper routine
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
                   ` (6 preceding siblings ...)
  2015-07-02 15:38 ` [PATCH 07/10] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-02 15:38 ` [PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
  2015-07-02 15:38 ` [PATCH 10/10] mm: madvise allow remove operation for hugetlbfs Mike Kravetz
  9 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Currently, there is  only a single place where hugetlbfs pages are
added to the page cache.  The new fallocate code be adding a second
one, so break the functionality out into its own helper.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 27 ++++++++++++++++++---------
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 95cd716..12e90f1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -333,6 +333,8 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			pgoff_t idx);
 
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 90d8250..d6e47ac 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3373,6 +3373,23 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 	return page != NULL;
 }
 
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			   pgoff_t idx)
+{
+	struct inode *inode = mapping->host;
+	struct hstate *h = hstate_inode(inode);
+	int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+
+	if (err)
+		return err;
+	ClearPagePrivate(page);
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks += blocks_per_huge_page(h);
+	spin_unlock(&inode->i_lock);
+	return 0;
+}
+
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			   struct address_space *mapping, pgoff_t idx,
 			   unsigned long address, pte_t *ptep, unsigned int flags)
@@ -3420,21 +3437,13 @@ retry:
 		set_page_huge_active(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err;
-			struct inode *inode = mapping->host;
-
-			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+			int err = huge_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				put_page(page);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
-			ClearPagePrivate(page);
-
-			spin_lock(&inode->i_lock);
-			inode->i_blocks += blocks_per_huge_page(h);
-			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
 			if (unlikely(anon_vma_prepare(vma))) {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
                   ` (7 preceding siblings ...)
  2015-07-02 15:38 ` [PATCH 08/10] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  2015-07-02 15:38 ` [PATCH 10/10] mm: madvise allow remove operation for hugetlbfs Mike Kravetz
  9 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

This is based on the shmem version, but it has diverged quite
a bit.  We have no swap to worry about, nor the new file sealing.
Add synchronication via the fault mutex table to coordinate
page faults,  fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of
a hugetlbfs file without having it mapped.  This also gives us
the ability to support MADV_REMOVE since it is currently
implemented using fallocate().  MADV_REMOVE lets madvise() remove
pages from the middle of a hugetlbfs file, which wasn't possible
before.

hugetlbfs fallocate only operates on whole huge pages.

Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 158 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c            |   2 +-
 3 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 160d58d4..05b9d47 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -12,6 +12,7 @@
 #include <linux/thread_info.h>
 #include <asm/current.h>
 #include <linux/sched.h>		/* remove ASAP */
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/file.h>
@@ -504,6 +505,160 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	return 0;
 }
 
+static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct hstate *h = hstate_inode(inode);
+	loff_t hpage_size = huge_page_size(h);
+	loff_t hole_start, hole_end;
+
+	/*
+	 * For hole punch round up the beginning offset of the hole and
+	 * round down the end.
+	 */
+	hole_start = round_up(offset, hpage_size);
+	hole_end = round_down(offset + len, hpage_size);
+
+	if (hole_end > hole_start) {
+		struct address_space *mapping = inode->i_mapping;
+
+		mutex_lock(&inode->i_mutex);
+		i_mmap_lock_write(mapping);
+		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
+			hugetlb_vmdelete_list(&mapping->i_mmap,
+						hole_start >> PAGE_SHIFT,
+						hole_end  >> PAGE_SHIFT);
+		i_mmap_unlock_write(mapping);
+		remove_inode_hugepages(inode, hole_start, hole_end);
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	return 0;
+}
+
+static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
+				loff_t len)
+{
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
+	struct vm_area_struct pseudo_vma;
+	struct mm_struct *mm = current->mm;
+	loff_t hpage_size = huge_page_size(h);
+	unsigned long hpage_shift = huge_page_shift(h);
+	pgoff_t start, index, end;
+	int error;
+	u32 hash;
+
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+		return -EOPNOTSUPP;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		return hugetlbfs_punch_hole(inode, offset, len);
+
+	/*
+	 * Default preallocate case.
+	 * For this range, start is rounded down and end is rounded up
+	 * as well as being converted to page offsets.
+	 */
+	start = offset >> hpage_shift;
+	end = (offset + len + hpage_size - 1) >> hpage_shift;
+
+	mutex_lock(&inode->i_mutex);
+
+	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+	error = inode_newsize_ok(inode, offset + len);
+	if (error)
+		goto out;
+
+	/*
+	 * Initialize a pseudo vma that just contains the policy used
+	 * when allocating the huge pages.  The actual policy field
+	 * (vm_policy) is determined based on the index in the loop below.
+	 */
+	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+	pseudo_vma.vm_file = file;
+
+	for (index = start; index < end; index++) {
+		/*
+		 * This is supposed to be the vaddr where the page is being
+		 * faulted in, but we have no vaddr here.
+		 */
+		struct page *page;
+		unsigned long addr;
+		int avoid_reserve = 0;
+
+		cond_resched();
+
+		/*
+		 * fallocate(2) manpage permits EINTR; we may have been
+		 * interrupted because we are using up too much memory.
+		 */
+		if (signal_pending(current)) {
+			error = -EINTR;
+			break;
+		}
+
+		/* Get policy based on index */
+		pseudo_vma.vm_policy =
+			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
+							index);
+
+		/* addr is the offset within the file (zero based) */
+		addr = index * hpage_size;
+
+		/* mutex taken here, fault path and hole punch */
+		hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
+						index, addr);
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+		/* See if already present in mapping to avoid alloc/free */
+		page = find_get_page(mapping, index);
+		if (page) {
+			put_page(page);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			mpol_cond_put(pseudo_vma.vm_policy);
+			continue;
+		}
+
+		/* Allocate page and add to page cache */
+		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
+		mpol_cond_put(pseudo_vma.vm_policy);
+		if (IS_ERR(page)) {
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			error = PTR_ERR(page);
+			goto out;
+		}
+		clear_huge_page(page, addr, pages_per_huge_page(h));
+		__SetPageUptodate(page);
+		error = huge_add_to_page_cache(page, mapping, index);
+		if (unlikely(error)) {
+			put_page(page);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			goto out;
+		}
+
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+		/*
+		 * page_put due to reference from alloc_huge_page()
+		 * unlock_page because locked by add_to_page_cache()
+		 */
+		put_page(page);
+		unlock_page(page);
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = CURRENT_TIME;
+	spin_lock(&inode->i_lock);
+	inode->i_private = NULL;
+	spin_unlock(&inode->i_lock);
+out:
+	mutex_unlock(&inode->i_mutex);
+	return error;
+}
+
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -815,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
 	.mmap			= hugetlbfs_file_mmap,
 	.fsync			= noop_fsync,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
-	.llseek		= default_llseek,
+	.llseek			= default_llseek,
+	.fallocate		= hugetlbfs_fallocate,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 12e90f1..1fae5cd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -330,6 +330,8 @@ struct huge_bootmem_page {
 #endif
 };
 
+struct page *alloc_huge_page(struct vm_area_struct *vma,
+				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
@@ -484,6 +486,7 @@ static inline bool hugepages_supported(void)
 
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
+#define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_noerr(v, a, r) NULL
 #define alloc_bootmem_huge_page(h) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d6e47ac..0e478c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1725,7 +1725,7 @@ static void vma_abort_reservation(struct hstate *h,
 	(void)__vma_reservation_common(h, vma, addr, VMA_ABORT_RESV);
 }
 
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
+struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/10] mm: madvise allow remove operation for hugetlbfs
  2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
                   ` (8 preceding siblings ...)
  2015-07-02 15:38 ` [PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
@ 2015-07-02 15:38 ` Mike Kravetz
  9 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-02 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Now that we have hole punching support for hugetlbfs, we can
also support the MADV_REMOVE interface to it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/madvise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index d215ea9..3c1b7f0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -467,7 +467,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 
 	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
 
-	if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+	if (vma->vm_flags & VM_LOCKED)
 		return -EINVAL;
 
 	f = vma->vm_file;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add
  2015-07-02 15:38 ` [PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add Mike Kravetz
@ 2015-07-03  4:21   ` Hillf Danton
  2015-07-03 17:12     ` Mike Kravetz
  0 siblings, 1 reply; 15+ messages in thread
From: Hillf Danton @ 2015-07-03  4:21 UTC (permalink / raw)
  To: 'Mike Kravetz', linux-mm, linux-kernel, linux-api
  Cc: 'Dave Hansen', 'Naoya Horiguchi',
	'David Rientjes', 'Hugh Dickins',
	'Davidlohr Bueso', 'Aneesh Kumar',
	'Christoph Hellwig'

> 
> fallocate hole punch will want to remove a specific range of
> pages.  When pages are removed, their associated entries in
> the region/reserve map will also be removed.  This will break
> an assumption in the region_chg/region_add calling sequence.
> If a new region descriptor must be allocated, it is done as
> part of the region_chg processing.  In this way, region_add
> can not fail because it does not need to attempt an allocation.
> 
> To prepare for fallocate hole punch, create a "cache" of
> descriptors that can be used by region_add if necessary.
> region_chg will ensure there are sufficient entries in the
> cache.  It will be necessary to track the number of in progress
> add operations to know a sufficient number of descriptors
> reside in the cache.  A new routine region_abort is added to
> adjust this in progress count when add operations are aborted.
> vma_abort_reservation is also added for callers creating
> reservations with vma_needs_reservation/vma_commit_reservation.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/hugetlb.h |   3 +
>  mm/hugetlb.c            | 166 ++++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 150 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2050261..f9c8cb2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -35,6 +35,9 @@ struct resv_map {
>  	struct kref refs;
>  	spinlock_t lock;
>  	struct list_head regions;
> +	long adds_in_progress;
> +	struct list_head rgn_cache;
> +	long rgn_cache_count;
>  };
>  extern struct resv_map *resv_map_alloc(void);
>  void resv_map_release(struct kref *ref);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a8c3087..189c0d7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -240,11 +240,14 @@ struct file_region {
> 
>  /*
>   * Add the huge page range represented by [f, t) to the reserve
> - * map.  Existing regions will be expanded to accommodate the
> - * specified range.  We know only existing regions need to be
> - * expanded, because region_add is only called after region_chg
> - * with the same range.  If a new file_region structure must
> - * be allocated, it is done in region_chg.
> + * map.  In the normal case, existing regions will be expanded
> + * to accommodate the specified range.  Sufficient regions should
> + * exist for expansion due to the previous call to region_chg
> + * with the same range.  However, it is possible that region_del
> + * could have been called after region_chg and modifed the map
> + * in such a way that no region exists to be expanded.  In this
> + * case, pull a region descriptor from the cache associated with
> + * the map and use that for the new range.
>   *
>   * Return the number of new huge pages added to the map.  This
>   * number is greater than or equal to zero.
> @@ -261,6 +264,27 @@ static long region_add(struct resv_map *resv, long f, long t)
>  		if (f <= rg->to)
>  			break;
> 
> +	if (&rg->link == head || t < rg->from) {
> +		/*
> +		 * No region exits which can be expanded to include the

s/exits/exists/ ?
> +		 * specified range.  Pull a region descriptor from the
> +		 * cache, and use it for this range.
> +		 */
> +		VM_BUG_ON(!resv->rgn_cache_count);
> +
> +		resv->rgn_cache_count--;
> +		nrg = list_first_entry(&resv->rgn_cache, struct file_region,
> +					link);
> +		list_del(&nrg->link);
> +
> +		nrg->from = f;
> +		nrg->to = t;
> +		list_add(&nrg->link, rg->link.prev);
> +
> +		add += t - f;
> +		goto out_locked;
> +	}
> +
>  	/* Round our left edge to the current segment if it encloses us. */
>  	if (f > rg->from)
>  		f = rg->from;
> @@ -294,6 +318,8 @@ static long region_add(struct resv_map *resv, long f, long t)
>  	add += t - nrg->to;		/* Added to end of region */
>  	nrg->to = t;
> 
> +out_locked:
> +	resv->adds_in_progress--;
>  	spin_unlock(&resv->lock);
>  	VM_BUG_ON(add < 0);
>  	return add;
> @@ -312,11 +338,16 @@ static long region_add(struct resv_map *resv, long f, long t)
>   * so that the subsequent region_add call will have all the
>   * regions it needs and will not fail.
>   *
> + * Upon entry, region_chg will also examine the cache of
> + * region descriptors associated with the map.  If there
> + * not enough descriptors cached, one will be allocated
> + * for the in progress add operation.
> + *
>   * Returns the number of huge pages that need to be added
>   * to the existing reservation map for the range [f, t).
>   * This number is greater or equal to zero.  -ENOMEM is
> - * returned if a new file_region structure is needed and can
> - * not be allocated.
> + * returned if a new file_region structure or cache entry
> + * is needed and can not be allocated.
>   */
>  static long region_chg(struct resv_map *resv, long f, long t)
>  {
> @@ -326,6 +357,30 @@ static long region_chg(struct resv_map *resv, long f, long t)
> 
>  retry:
>  	spin_lock(&resv->lock);
> +	resv->adds_in_progress++;
> +
> +	/*
> +	 * Check for sufficient descriptors in the cache to accommodate
> +	 * the number of in progress add operations.
> +	 */
> +	if (resv->adds_in_progress > resv->rgn_cache_count) {
> +		struct file_region *trg;
> +
> +		VM_BUG_ON(resv->adds_in_progress - resv->rgn_cache_count > 1);
> +		/* Must drop lock to allocate a new descriptor. */
> +		resv->adds_in_progress--;
> +		spin_unlock(&resv->lock);
> +
> +		trg = kmalloc(sizeof(*trg), GFP_KERNEL);
> +		if (!trg)
> +			return -ENOMEM;
> +
> +		spin_lock(&resv->lock);
> +		resv->adds_in_progress++;
> +		list_add(&trg->link, &resv->rgn_cache);
> +		resv->rgn_cache_count++;
> +	}
> +
>  	/* Locate the region we are before or in. */
>  	list_for_each_entry(rg, head, link)
>  		if (f <= rg->to)
> @@ -336,6 +391,7 @@ retry:
>  	 * size such that we can guarantee to record the reservation. */
>  	if (&rg->link == head || t < rg->from) {
>  		if (!nrg) {
> +			resv->adds_in_progress--;
>  			spin_unlock(&resv->lock);
>  			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
>  			if (!nrg)
> @@ -385,6 +441,25 @@ out_nrg:
>  }
> 
>  /*
> + * Abort the in progress add operation.  The adds_in_progress field
> + * of the resv_map keeps track of the operations in progress between
> + * calls to region_chg and region_add.  Operations are sometimes
> + * aborted after the call to region_chg.  In such cases, region_abort
> + * is called to decrement the adds_in_progress counter.
> + *
> + * NOTE: The range arguments [f, t) are not needed or used in this
> + * routine.  They are kept to make reading the calling code easier as
> + * arguments will match the associated region_chg call.
> + */
> +static void region_abort(struct resv_map *resv, long f, long t)
> +{
> +	spin_lock(&resv->lock);
> +	VM_BUG_ON(!resv->rgn_cache_count);
> +	resv->adds_in_progress--;
> +	spin_unlock(&resv->lock);
> +}
> +
> +/*
>   * Truncate the reserve map at index 'end'.  Modify/truncate any
>   * region which contains end.  Delete any regions past end.
>   * Return the number of huge pages removed from the map.
> @@ -544,22 +619,42 @@ static void set_vma_private_data(struct vm_area_struct *vma,
>  struct resv_map *resv_map_alloc(void)
>  {
>  	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
> -	if (!resv_map)
> +	struct file_region *rg = kmalloc(sizeof(*rg), GFP_KERNEL);
> +
> +	if (!resv_map || !rg) {
> +		kfree(resv_map);
> +		kfree(rg);
>  		return NULL;
> +	}
> 
>  	kref_init(&resv_map->refs);
>  	spin_lock_init(&resv_map->lock);
>  	INIT_LIST_HEAD(&resv_map->regions);
> 
> +	resv_map->adds_in_progress = 0;
> +
> +	INIT_LIST_HEAD(&resv_map->rgn_cache);
> +	list_add(&rg->link, &resv_map->rgn_cache);
> +	resv_map->rgn_cache_count = 1;
> +
>  	return resv_map;
>  }
> 
>  void resv_map_release(struct kref *ref)
>  {
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
> +	struct list_head *head = &resv_map->regions;
> +	struct file_region *rg, *trg;
> 
>  	/* Clear out any active regions before we release the map. */
>  	region_truncate(resv_map, 0);
> +
> +	/* ... and any entries left in the cache */
> +	list_for_each_entry_safe(rg, trg, head, link)
> +		list_del(&rg->link);
> +

I am wondering if `cache' is rgn_cache.
And rg should be freed if yes.
> +	VM_BUG_ON(resv_map->adds_in_progress);
> +
>  	kfree(resv_map);
>  }
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 02/10] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-07-02 15:38 ` [PATCH 02/10] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
@ 2015-07-03  7:20   ` Hillf Danton
  2015-07-03 17:22     ` Mike Kravetz
  0 siblings, 1 reply; 15+ messages in thread
From: Hillf Danton @ 2015-07-03  7:20 UTC (permalink / raw)
  To: 'Mike Kravetz', linux-mm, linux-kernel, linux-api
  Cc: 'Dave Hansen', 'Naoya Horiguchi',
	'David Rientjes', 'Hugh Dickins',
	'Davidlohr Bueso', 'Aneesh Kumar',
	'Christoph Hellwig'

> fallocate hole punch will want to remove a specific range of pages.
> The existing region_truncate() routine deletes all region/reserve
> map entries after a specified offset.  region_del() will provide
> this same functionality if the end of region is specified as -1.
> Hence, region_del() can replace region_truncate().
> 
> Unlike region_truncate(), region_del() can return an error in the
> rare case where it can not allocate memory for a region descriptor.
> This ONLY happens in the case where an existing region must be split.
> Current callers passing -1 as end of range will never experience
> this error and do not need to deal with error handling.  Future
> callers of region_del() (such as fallocate hole punch) will need to
> handle this error.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 101 ++++++++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 75 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 189c0d7..e8c7f68 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -460,43 +460,92 @@ static void region_abort(struct resv_map *resv, long f, long t)
>  }
> 
>  /*
> - * Truncate the reserve map at index 'end'.  Modify/truncate any
> - * region which contains end.  Delete any regions past end.
> - * Return the number of huge pages removed from the map.
> + * Delete the specified range [f, t) from the reserve map.  If the
> + * t parameter is -1, this indicates that ALL regions after f should
> + * be deleted.  Locate the regions which intersect [f, t) and either
> + * trim, delete or split the existing regions.
> + *
> + * Returns the number of huge pages deleted from the reserve map.
> + * In the normal case, the return value is zero or more.  In the
> + * case where a region must be split, a new region descriptor must
> + * be allocated.  If the allocation fails, -ENOMEM will be returned.
> + * NOTE: If the parameter t == -1, then we will never split a region
> + * and possibly return -ENOMEM.  Callers specifying t == -1 do not
> + * need to check for -ENOMEM error.
>   */
> -static long region_truncate(struct resv_map *resv, long end)
> +static long region_del(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *trg;
> -	long chg = 0;
> +	struct file_region *nrg = NULL;
> +	long del = 0;
> 
> +	if (t == -1)
> +		t = LONG_MAX;

Why not use 
> +retry:
>  	spin_lock(&resv->lock);
> -	/* Locate the region we are either in or before. */
> -	list_for_each_entry(rg, head, link)
> -		if (end <= rg->to)
> +	list_for_each_entry_safe(rg, trg, head, link) {
> +		if (rg->to <= f)
> +			continue;
> +		if (rg->from >= t)
>  			break;
> -	if (&rg->link == head)
> -		goto out;
> 
> -	/* If we are in the middle of a region then adjust it. */
> -	if (end > rg->from) {
> -		chg = rg->to - end;
> -		rg->to = end;
> -		rg = list_entry(rg->link.next, typeof(*rg), link);
> -	}
> +		if (f > rg->from && t < rg->to) { /* Must split region */
> +			/*
> +			 * Check for an entry in the cache before dropping
> +			 * lock and attempting allocation.
> +			 */
> +			if (!nrg &&
> +			    resv->rgn_cache_count > resv->adds_in_progress) {
> +				nrg = list_first_entry(&resv->rgn_cache,
> +							struct file_region,
> +							link);
> +				list_del(&nrg->link);
> +				resv->rgn_cache_count--;
> +			}
> 
> -	/* Drop any remaining regions. */
> -	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> -		if (&rg->link == head)
> +			if (!nrg) {
> +				spin_unlock(&resv->lock);
> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> +				if (!nrg)
> +					return -ENOMEM;
> +				goto retry;
> +			}
> +
> +			del += t - f;
> +
> +			/* New entry for end of split region */
> +			nrg->from = t;
> +			nrg->to = rg->to;
> +			INIT_LIST_HEAD(&nrg->link);
> +
> +			/* Original entry is trimmed */
> +			rg->to = f;
> +
> +			list_add(&nrg->link, &rg->link);
> +			nrg = NULL;
>  			break;
> -		chg += rg->to - rg->from;
> -		list_del(&rg->link);
> -		kfree(rg);
> +		}
> +
> +		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
> +			del += rg->to - rg->from;
> +			list_del(&rg->link);
> +			kfree(rg);
> +			continue;
> +		}
> +
> +		if (f <= rg->from) {	/* Trim beginning of region */
> +			del += t - rg->from;
> +			rg->from = t;
> +		} else {		/* Trim end of region */
> +			del += rg->to - f;
> +			rg->to = f;
> +		}
>  	}
> 
> -out:
>  	spin_unlock(&resv->lock);
> -	return chg;
> +	kfree(nrg);
> +	return del;
>  }
> 
>  /*
> @@ -647,7 +696,7 @@ void resv_map_release(struct kref *ref)
>  	struct file_region *rg, *trg;
> 
>  	/* Clear out any active regions before we release the map. */
> -	region_truncate(resv_map, 0);
> +	region_del(resv_map, 0, -1);

LONG_MAX is not selected, why?
> 
>  	/* ... and any entries left in the cache */
>  	list_for_each_entry_safe(rg, trg, head, link)
> @@ -3868,7 +3917,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  	long gbl_reserve;
> 
>  	if (resv_map)
> -		chg = region_truncate(resv_map, offset);
> +		chg = region_del(resv_map, offset, -1);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> --
> 2.1.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add
  2015-07-03  4:21   ` Hillf Danton
@ 2015-07-03 17:12     ` Mike Kravetz
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-03 17:12 UTC (permalink / raw)
  To: Hillf Danton, linux-mm, linux-kernel, linux-api
  Cc: 'Dave Hansen', 'Naoya Horiguchi',
	'David Rientjes', 'Hugh Dickins',
	'Davidlohr Bueso', 'Aneesh Kumar',
	'Christoph Hellwig'

On 07/02/2015 09:21 PM, Hillf Danton wrote:
>>
>> fallocate hole punch will want to remove a specific range of
>> pages.  When pages are removed, their associated entries in
>> the region/reserve map will also be removed.  This will break
>> an assumption in the region_chg/region_add calling sequence.
>> If a new region descriptor must be allocated, it is done as
>> part of the region_chg processing.  In this way, region_add
>> can not fail because it does not need to attempt an allocation.
>>
>> To prepare for fallocate hole punch, create a "cache" of
>> descriptors that can be used by region_add if necessary.
>> region_chg will ensure there are sufficient entries in the
>> cache.  It will be necessary to track the number of in progress
>> add operations to know a sufficient number of descriptors
>> reside in the cache.  A new routine region_abort is added to
>> adjust this in progress count when add operations are aborted.
>> vma_abort_reservation is also added for callers creating
>> reservations with vma_needs_reservation/vma_commit_reservation.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   include/linux/hugetlb.h |   3 +
>>   mm/hugetlb.c            | 166 ++++++++++++++++++++++++++++++++++++++++++------
>>   2 files changed, 150 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 2050261..f9c8cb2 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -35,6 +35,9 @@ struct resv_map {
>>   	struct kref refs;
>>   	spinlock_t lock;
>>   	struct list_head regions;
>> +	long adds_in_progress;
>> +	struct list_head rgn_cache;
>> +	long rgn_cache_count;
>>   };
>>   extern struct resv_map *resv_map_alloc(void);
>>   void resv_map_release(struct kref *ref);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index a8c3087..189c0d7 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -240,11 +240,14 @@ struct file_region {
>>
>>   /*
>>    * Add the huge page range represented by [f, t) to the reserve
>> - * map.  Existing regions will be expanded to accommodate the
>> - * specified range.  We know only existing regions need to be
>> - * expanded, because region_add is only called after region_chg
>> - * with the same range.  If a new file_region structure must
>> - * be allocated, it is done in region_chg.
>> + * map.  In the normal case, existing regions will be expanded
>> + * to accommodate the specified range.  Sufficient regions should
>> + * exist for expansion due to the previous call to region_chg
>> + * with the same range.  However, it is possible that region_del
>> + * could have been called after region_chg and modifed the map
>> + * in such a way that no region exists to be expanded.  In this
>> + * case, pull a region descriptor from the cache associated with
>> + * the map and use that for the new range.
>>    *
>>    * Return the number of new huge pages added to the map.  This
>>    * number is greater than or equal to zero.
>> @@ -261,6 +264,27 @@ static long region_add(struct resv_map *resv, long f, long t)
>>   		if (f <= rg->to)
>>   			break;
>>
>> +	if (&rg->link == head || t < rg->from) {
>> +		/*
>> +		 * No region exits which can be expanded to include the
>
> s/exits/exists/ ?

Yes, that is a typo and should be exists.

>> +		 * specified range.  Pull a region descriptor from the
>> +		 * cache, and use it for this range.
>> +		 */
>> +		VM_BUG_ON(!resv->rgn_cache_count);
>> +
>> +		resv->rgn_cache_count--;
>> +		nrg = list_first_entry(&resv->rgn_cache, struct file_region,
>> +					link);
>> +		list_del(&nrg->link);
>> +
>> +		nrg->from = f;
>> +		nrg->to = t;
>> +		list_add(&nrg->link, rg->link.prev);
>> +
>> +		add += t - f;
>> +		goto out_locked;
>> +	}
>> +
>>   	/* Round our left edge to the current segment if it encloses us. */
>>   	if (f > rg->from)
>>   		f = rg->from;
>> @@ -294,6 +318,8 @@ static long region_add(struct resv_map *resv, long f, long t)
>>   	add += t - nrg->to;		/* Added to end of region */
>>   	nrg->to = t;
>>
>> +out_locked:
>> +	resv->adds_in_progress--;
>>   	spin_unlock(&resv->lock);
>>   	VM_BUG_ON(add < 0);
>>   	return add;
>> @@ -312,11 +338,16 @@ static long region_add(struct resv_map *resv, long f, long t)
>>    * so that the subsequent region_add call will have all the
>>    * regions it needs and will not fail.
>>    *
>> + * Upon entry, region_chg will also examine the cache of
>> + * region descriptors associated with the map.  If there
>> + * not enough descriptors cached, one will be allocated
>> + * for the in progress add operation.
>> + *
>>    * Returns the number of huge pages that need to be added
>>    * to the existing reservation map for the range [f, t).
>>    * This number is greater or equal to zero.  -ENOMEM is
>> - * returned if a new file_region structure is needed and can
>> - * not be allocated.
>> + * returned if a new file_region structure or cache entry
>> + * is needed and can not be allocated.
>>    */
>>   static long region_chg(struct resv_map *resv, long f, long t)
>>   {
>> @@ -326,6 +357,30 @@ static long region_chg(struct resv_map *resv, long f, long t)
>>
>>   retry:
>>   	spin_lock(&resv->lock);
>> +	resv->adds_in_progress++;
>> +
>> +	/*
>> +	 * Check for sufficient descriptors in the cache to accommodate
>> +	 * the number of in progress add operations.
>> +	 */
>> +	if (resv->adds_in_progress > resv->rgn_cache_count) {
>> +		struct file_region *trg;
>> +
>> +		VM_BUG_ON(resv->adds_in_progress - resv->rgn_cache_count > 1);
>> +		/* Must drop lock to allocate a new descriptor. */
>> +		resv->adds_in_progress--;
>> +		spin_unlock(&resv->lock);
>> +
>> +		trg = kmalloc(sizeof(*trg), GFP_KERNEL);
>> +		if (!trg)
>> +			return -ENOMEM;
>> +
>> +		spin_lock(&resv->lock);
>> +		resv->adds_in_progress++;
>> +		list_add(&trg->link, &resv->rgn_cache);
>> +		resv->rgn_cache_count++;
>> +	}
>> +
>>   	/* Locate the region we are before or in. */
>>   	list_for_each_entry(rg, head, link)
>>   		if (f <= rg->to)
>> @@ -336,6 +391,7 @@ retry:
>>   	 * size such that we can guarantee to record the reservation. */
>>   	if (&rg->link == head || t < rg->from) {
>>   		if (!nrg) {
>> +			resv->adds_in_progress--;
>>   			spin_unlock(&resv->lock);
>>   			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
>>   			if (!nrg)
>> @@ -385,6 +441,25 @@ out_nrg:
>>   }
>>
>>   /*
>> + * Abort the in progress add operation.  The adds_in_progress field
>> + * of the resv_map keeps track of the operations in progress between
>> + * calls to region_chg and region_add.  Operations are sometimes
>> + * aborted after the call to region_chg.  In such cases, region_abort
>> + * is called to decrement the adds_in_progress counter.
>> + *
>> + * NOTE: The range arguments [f, t) are not needed or used in this
>> + * routine.  They are kept to make reading the calling code easier as
>> + * arguments will match the associated region_chg call.
>> + */
>> +static void region_abort(struct resv_map *resv, long f, long t)
>> +{
>> +	spin_lock(&resv->lock);
>> +	VM_BUG_ON(!resv->rgn_cache_count);
>> +	resv->adds_in_progress--;
>> +	spin_unlock(&resv->lock);
>> +}
>> +
>> +/*
>>    * Truncate the reserve map at index 'end'.  Modify/truncate any
>>    * region which contains end.  Delete any regions past end.
>>    * Return the number of huge pages removed from the map.
>> @@ -544,22 +619,42 @@ static void set_vma_private_data(struct vm_area_struct *vma,
>>   struct resv_map *resv_map_alloc(void)
>>   {
>>   	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
>> -	if (!resv_map)
>> +	struct file_region *rg = kmalloc(sizeof(*rg), GFP_KERNEL);
>> +
>> +	if (!resv_map || !rg) {
>> +		kfree(resv_map);
>> +		kfree(rg);
>>   		return NULL;
>> +	}
>>
>>   	kref_init(&resv_map->refs);
>>   	spin_lock_init(&resv_map->lock);
>>   	INIT_LIST_HEAD(&resv_map->regions);
>>
>> +	resv_map->adds_in_progress = 0;
>> +
>> +	INIT_LIST_HEAD(&resv_map->rgn_cache);
>> +	list_add(&rg->link, &resv_map->rgn_cache);
>> +	resv_map->rgn_cache_count = 1;
>> +
>>   	return resv_map;
>>   }
>>
>>   void resv_map_release(struct kref *ref)
>>   {
>>   	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>> +	struct list_head *head = &resv_map->regions;
>> +	struct file_region *rg, *trg;
>>
>>   	/* Clear out any active regions before we release the map. */
>>   	region_truncate(resv_map, 0);
>> +
>> +	/* ... and any entries left in the cache */
>> +	list_for_each_entry_safe(rg, trg, head, link)
>> +		list_del(&rg->link);
>> +
>
> I am wondering if `cache' is rgn_cache.
> And rg should be freed if yes.

Good catch.

It should be:

	struct list_head *head = &resv_map->rgn_cache;

and

	list_for_each_entry_safe(rg, trg, head, link) {
		list_del(&rg->link);
		kfree(rg);
	}

As is, it will leak memory and not free entries in the cache.

I will fix this in the next revision.

-- 
Mike Kravetz

>> +	VM_BUG_ON(resv_map->adds_in_progress);
>> +
>>   	kfree(resv_map);
>>   }
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 02/10] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-07-03  7:20   ` Hillf Danton
@ 2015-07-03 17:22     ` Mike Kravetz
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-07-03 17:22 UTC (permalink / raw)
  To: Hillf Danton, linux-mm, linux-kernel, linux-api
  Cc: 'Dave Hansen', 'Naoya Horiguchi',
	'David Rientjes', 'Hugh Dickins',
	'Davidlohr Bueso', 'Aneesh Kumar',
	'Christoph Hellwig'

On 07/03/2015 12:20 AM, Hillf Danton wrote:
>> fallocate hole punch will want to remove a specific range of pages.
>> The existing region_truncate() routine deletes all region/reserve
>> map entries after a specified offset.  region_del() will provide
>> this same functionality if the end of region is specified as -1.
>> Hence, region_del() can replace region_truncate().
>>
>> Unlike region_truncate(), region_del() can return an error in the
>> rare case where it can not allocate memory for a region descriptor.
>> This ONLY happens in the case where an existing region must be split.
>> Current callers passing -1 as end of range will never experience
>> this error and do not need to deal with error handling.  Future
>> callers of region_del() (such as fallocate hole punch) will need to
>> handle this error.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   mm/hugetlb.c | 101 ++++++++++++++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 75 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 189c0d7..e8c7f68 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -460,43 +460,92 @@ static void region_abort(struct resv_map *resv, long f, long t)
>>   }
>>
>>   /*
>> - * Truncate the reserve map at index 'end'.  Modify/truncate any
>> - * region which contains end.  Delete any regions past end.
>> - * Return the number of huge pages removed from the map.
>> + * Delete the specified range [f, t) from the reserve map.  If the
>> + * t parameter is -1, this indicates that ALL regions after f should
>> + * be deleted.  Locate the regions which intersect [f, t) and either
>> + * trim, delete or split the existing regions.
>> + *
>> + * Returns the number of huge pages deleted from the reserve map.
>> + * In the normal case, the return value is zero or more.  In the
>> + * case where a region must be split, a new region descriptor must
>> + * be allocated.  If the allocation fails, -ENOMEM will be returned.
>> + * NOTE: If the parameter t == -1, then we will never split a region
>> + * and possibly return -ENOMEM.  Callers specifying t == -1 do not
>> + * need to check for -ENOMEM error.
>>    */
>> -static long region_truncate(struct resv_map *resv, long end)
>> +static long region_del(struct resv_map *resv, long f, long t)
>>   {
>>   	struct list_head *head = &resv->regions;
>>   	struct file_region *rg, *trg;
>> -	long chg = 0;
>> +	struct file_region *nrg = NULL;
>> +	long del = 0;
>>
>> +	if (t == -1)
>> +		t = LONG_MAX;
>
> Why not use

See below

>> +retry:
>>   	spin_lock(&resv->lock);
>> -	/* Locate the region we are either in or before. */
>> -	list_for_each_entry(rg, head, link)
>> -		if (end <= rg->to)
>> +	list_for_each_entry_safe(rg, trg, head, link) {
>> +		if (rg->to <= f)
>> +			continue;
>> +		if (rg->from >= t)
>>   			break;
>> -	if (&rg->link == head)
>> -		goto out;
>>
>> -	/* If we are in the middle of a region then adjust it. */
>> -	if (end > rg->from) {
>> -		chg = rg->to - end;
>> -		rg->to = end;
>> -		rg = list_entry(rg->link.next, typeof(*rg), link);
>> -	}
>> +		if (f > rg->from && t < rg->to) { /* Must split region */
>> +			/*
>> +			 * Check for an entry in the cache before dropping
>> +			 * lock and attempting allocation.
>> +			 */
>> +			if (!nrg &&
>> +			    resv->rgn_cache_count > resv->adds_in_progress) {
>> +				nrg = list_first_entry(&resv->rgn_cache,
>> +							struct file_region,
>> +							link);
>> +				list_del(&nrg->link);
>> +				resv->rgn_cache_count--;
>> +			}
>>
>> -	/* Drop any remaining regions. */
>> -	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
>> -		if (&rg->link == head)
>> +			if (!nrg) {
>> +				spin_unlock(&resv->lock);
>> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
>> +				if (!nrg)
>> +					return -ENOMEM;
>> +				goto retry;
>> +			}
>> +
>> +			del += t - f;
>> +
>> +			/* New entry for end of split region */
>> +			nrg->from = t;
>> +			nrg->to = rg->to;
>> +			INIT_LIST_HEAD(&nrg->link);
>> +
>> +			/* Original entry is trimmed */
>> +			rg->to = f;
>> +
>> +			list_add(&nrg->link, &rg->link);
>> +			nrg = NULL;
>>   			break;
>> -		chg += rg->to - rg->from;
>> -		list_del(&rg->link);
>> -		kfree(rg);
>> +		}
>> +
>> +		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
>> +			del += rg->to - rg->from;
>> +			list_del(&rg->link);
>> +			kfree(rg);
>> +			continue;
>> +		}
>> +
>> +		if (f <= rg->from) {	/* Trim beginning of region */
>> +			del += t - rg->from;
>> +			rg->from = t;
>> +		} else {		/* Trim end of region */
>> +			del += rg->to - f;
>> +			rg->to = f;
>> +		}
>>   	}
>>
>> -out:
>>   	spin_unlock(&resv->lock);
>> -	return chg;
>> +	kfree(nrg);
>> +	return del;
>>   }
>>
>>   /*
>> @@ -647,7 +696,7 @@ void resv_map_release(struct kref *ref)
>>   	struct file_region *rg, *trg;
>>
>>   	/* Clear out any active regions before we release the map. */
>> -	region_truncate(resv_map, 0);
>> +	region_del(resv_map, 0, -1);
>
> LONG_MAX is not selected, why?

I think your question is "Why not use LONG_MAX to indicate all
remaining regions in the map should be deleted (a truncate
operation)?".

The two values passed to the routine are essentially huge page
offsets.  My first thought when creating the routine is that a
negative value (such as -1) is an invalid offset and a good
choice to indicate a "special" case operation.  LONG_MAX is also
an invalid huge page offset, so it could be used for the same
purpose and avoid the "if (t == -1) t = LONG_MAX" at the
beginning of the routine.  I have no objections in changing this
to LONG_MAX is if makes the code simpler/easier to understand.

-- 
Mike Kravetz

>>
>>   	/* ... and any entries left in the cache */
>>   	list_for_each_entry_safe(rg, trg, head, link)
>> @@ -3868,7 +3917,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>>   	long gbl_reserve;
>>
>>   	if (resv_map)
>> -		chg = region_truncate(resv_map, offset);
>> +		chg = region_del(resv_map, offset, -1);
>>   	spin_lock(&inode->i_lock);
>>   	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>>   	spin_unlock(&inode->i_lock);
>> --
>> 2.1.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-07-03 17:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-02 15:38 [PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
2015-07-02 15:38 ` [PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add Mike Kravetz
2015-07-03  4:21   ` Hillf Danton
2015-07-03 17:12     ` Mike Kravetz
2015-07-02 15:38 ` [PATCH 02/10] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
2015-07-03  7:20   ` Hillf Danton
2015-07-03 17:22     ` Mike Kravetz
2015-07-02 15:38 ` [PATCH 03/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
2015-07-02 15:38 ` [PATCH 04/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
2015-07-02 15:38 ` [PATCH 05/10] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
2015-07-02 15:38 ` [PATCH 06/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
2015-07-02 15:38 ` [PATCH 07/10] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
2015-07-02 15:38 ` [PATCH 08/10] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
2015-07-02 15:38 ` [PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
2015-07-02 15:38 ` [PATCH 10/10] mm: madvise allow remove operation for hugetlbfs Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).