linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support
@ 2015-06-23  0:38 Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

No major changes since last RFC.  Externalized existing hugetlb
fault mutex code.  Added more comments to alloc_huge_page reservation
handling as I could not come up with a good way to encapsulate.
libhugetlbfs test cases and man page updates will be included with
a later request for patch inclusion.

hugetlbfs is used today by applications that want a high degree of
control over huge page usage.  Often, large hugetlbfs files are used
to map a large number huge pages into the application processes.
The applications know when page ranges within these large files will
no longer be used, and ideally would like to release them back to
the subpool or global pools for other uses.  The fallocate() system
call provides an interface for preallocation and hole punching within
files.  This patch set adds fallocate functionality to hugetlbfs.

RFC v5:
  Simply made existing hugetlb fault mutex hash routine available for
    use by fallocate.
  Unable to come up with good reservation encapsulation routines for
    alloc_huge_page, so attempted to comment better.
RFC v4:
  Removed alloc_huge_page/hugetlb_reserve_pages race patches as already
    in mmotm
  Moved hugetlb_fix_reserve_counts in series as suggested by Naoya Horiguchi
  Inline'ed hugetlb_fault_mutex routines as suggested by Davidlohr Bueso and
    existing code changed to use new interfaces as suggested by Naoya
  fallocate preallocation code cleaned up and made simpler
  Modified alloc_huge_page to handle special case where allocation is
    for a hole punched area with spool reserves
RFC v3:
  Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
    in current code
  fallocate allocation and hole punch is synchronized with page
    faults via existing mutex table
   hole punch uses existing hugetlb_vmtruncate_list instead of more
    generic unmap_mapping_range for unmapping
   Error handling for the case when region_del() fails
RFC v2:
  Addressed alignment and error handling issues noticed by Hillf Danton
  New region_del() routine for region tracking/resv_map of ranges
  Fixed several issues found during more extensive testing
  Error handling in region_del() when kmalloc() fails stills needs
    to be addressed
  madvise remove support remains

Mike Kravetz (9):
  mm/hugetlb: add region_del() to delete a specific range of entries
  mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  hugetlbfs: truncate_hugepages() takes a range of pages
  mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  hugetlbfs: New huge_add_to_page_cache helper routine
  hugetlbfs: add hugetlbfs_fallocate()
  mm: madvise allow remove operation for hugetlbfs

 fs/hugetlbfs/inode.c    | 281 ++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/hugetlb.h |  14 ++-
 mm/hugetlb.c            | 245 +++++++++++++++++++++++++++++------------
 mm/madvise.c            |   2 +-
 4 files changed, 456 insertions(+), 86 deletions(-)

-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  2015-06-29 21:47   ` Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to remove a specific range of pages.
The existing region_truncate() routine deletes all region/reserve
map entries after a specified offset.  region_del() will provide
this same functionality if the end of region is specified as -1.
Hence, region_del() can replace region_truncate().

Unlike region_truncate(), region_del() can return an error in the
rare case where it can not allocate memory for a region descriptor.
This ONLY happens in the case where an existing region must be split.
Current callers passing -1 as end of range will never experience
this error and do not need to deal with error handling.  Future
callers of region_del() (such as fallocate hole punch) will need to
handle this error.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 88 ++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 62 insertions(+), 26 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..3fc2359 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -385,43 +385,79 @@ out_nrg:
 }
 
 /*
- * Truncate the reserve map at index 'end'.  Modify/truncate any
- * region which contains end.  Delete any regions past end.
- * Return the number of huge pages removed from the map.
+ * Delete the specified range [f, t) from the reserve map.  If the
+ * t parameter is -1, this indicates that ALL regions after f should
+ * be deleted.  Locate the regions which intersect [f, t) and either
+ * trim, delete or split the existing regions.
+ *
+ * Returns the number of huge pages deleted from the reserve map.
+ * In the normal case, the return value is zero or more.  In the
+ * case where a region must be split, a new region descriptor must
+ * be allocated.  If the allocation fails, -ENOMEM will be returned.
+ * NOTE: If the parameter t == -1, then we will never split a region
+ * and possibly return -ENOMEM.  Callers specifying t == -1 do not
+ * need to check for -ENOMEM error.
  */
-static long region_truncate(struct resv_map *resv, long end)
+static long region_del(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
-	long chg = 0;
+	struct file_region *nrg = NULL;
+	long del = 0;
 
+	if (t == -1)
+		t = LONG_MAX;
+retry:
 	spin_lock(&resv->lock);
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (end <= rg->to)
+	list_for_each_entry_safe(rg, trg, head, link) {
+		if (rg->to <= f)
+			continue;
+		if (rg->from >= t)
 			break;
-	if (&rg->link == head)
-		goto out;
 
-	/* If we are in the middle of a region then adjust it. */
-	if (end > rg->from) {
-		chg = rg->to - end;
-		rg->to = end;
-		rg = list_entry(rg->link.next, typeof(*rg), link);
-	}
+		if (f > rg->from && t < rg->to) { /* Must split region */
+			if (!nrg) {
+				spin_unlock(&resv->lock);
+				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+				if (!nrg)
+					return -ENOMEM;
+				goto retry;
+			}
 
-	/* Drop any remaining regions. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
+			del += t - f;
+
+			/* New entry for end of split region */
+			nrg->from = t;
+			nrg->to = rg->to;
+			INIT_LIST_HEAD(&nrg->link);
+
+			/* Original entry is trimmed */
+			rg->to = f;
+
+			list_add(&nrg->link, &rg->link);
+			nrg = NULL;
 			break;
-		chg += rg->to - rg->from;
-		list_del(&rg->link);
-		kfree(rg);
+		}
+
+		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
+			del += rg->to - rg->from;
+			list_del(&rg->link);
+			kfree(rg);
+			continue;
+		}
+
+		if (f <= rg->from) {	/* Trim beginning of region */
+			del += t - rg->from;
+			rg->from = t;
+		} else {		/* Trim end of region */
+			del += rg->to - f;
+			rg->to = f;
+		}
 	}
 
-out:
 	spin_unlock(&resv->lock);
-	return chg;
+	kfree(nrg);
+	return del;
 }
 
 /*
@@ -559,7 +595,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(resv_map, 0);
+	region_del(resv_map, 0, -1);
 	kfree(resv_map);
 }
 
@@ -3740,7 +3776,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	long gbl_reserve;
 
 	if (resv_map)
-		chg = region_truncate(resv_map, offset);
+		chg = region_del(resv_map, offset, -1);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 3/9] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table).  fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages.  Expose interfaces so that fallocate operations
can be synchronized with page faults.  Minor name changes to
be more consistent with other global hugetlb symbols.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  5 +++++
 mm/hugetlb.c            | 20 ++++++++++----------
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2050261..33c6b32 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -85,6 +85,11 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
+extern struct mutex *hugetlb_fault_mutex_table;
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+				struct vm_area_struct *vma,
+				struct address_space *mapping,
+				pgoff_t idx, unsigned long address);
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3fc2359..9657cca 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -64,7 +64,7 @@ DEFINE_SPINLOCK(hugetlb_lock);
  * prevent spurious OOMs when the hugepage pool is fully utilized.
  */
 static int num_fault_mutexes;
-static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
@@ -2347,7 +2347,7 @@ static void __exit hugetlb_exit(void)
 	}
 
 	kobject_put(hugepages_kobj);
-	kfree(htlb_fault_mutex_table);
+	kfree(hugetlb_fault_mutex_table);
 }
 module_exit(hugetlb_exit);
 
@@ -2380,12 +2380,12 @@ static int __init hugetlb_init(void)
 #else
 	num_fault_mutexes = 1;
 #endif
-	htlb_fault_mutex_table =
+	hugetlb_fault_mutex_table =
 		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
-	BUG_ON(!htlb_fault_mutex_table);
+	BUG_ON(!hugetlb_fault_mutex_table);
 
 	for (i = 0; i < num_fault_mutexes; i++)
-		mutex_init(&htlb_fault_mutex_table[i]);
+		mutex_init(&hugetlb_fault_mutex_table[i]);
 	return 0;
 }
 module_init(hugetlb_init);
@@ -3316,7 +3316,7 @@ backout_unlocked:
 }
 
 #ifdef CONFIG_SMP
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 			    struct vm_area_struct *vma,
 			    struct address_space *mapping,
 			    pgoff_t idx, unsigned long address)
@@ -3341,7 +3341,7 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
  * For uniprocesor systems we always use a single mutex, so just
  * return 0 and avoid the hashing overhead.
  */
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 			    struct vm_area_struct *vma,
 			    struct address_space *mapping,
 			    pgoff_t idx, unsigned long address)
@@ -3389,8 +3389,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
-	mutex_lock(&htlb_fault_mutex_table[hash]);
+	hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
@@ -3473,7 +3473,7 @@ out_ptl:
 		put_page(pagecache_page);
 	}
 out_mutex:
-	mutex_unlock(&htlb_fault_mutex_table[hash]);
+	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 3/9] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 4/9] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range.  If end is 0, this indicates all pages after start
should be unmapped.  This is the same as the existing truncate
functionality.  Modify existing callers to add 0 as end of range.

Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e5a93d8..e9d4c8d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -374,11 +374,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
 }
 
 static inline void
-hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
+hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end)
 {
 	struct vm_area_struct *vma;
 
-	vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) {
+	/*
+	 * end == 0 indicates that the entire range after
+	 * start should be unmapped.
+	 */
+	vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) {
 		unsigned long v_offset;
 
 		/*
@@ -387,13 +391,20 @@ hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
 		 * which overlap the truncated area starting at pgoff,
 		 * and no vma on a 32-bit arch can span beyond the 4GB.
 		 */
-		if (vma->vm_pgoff < pgoff)
-			v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+		if (vma->vm_pgoff < start)
+			v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
 		else
 			v_offset = 0;
 
-		unmap_hugepage_range(vma, vma->vm_start + v_offset,
-				     vma->vm_end, NULL);
+		if (end) {
+			end = ((end - start) << PAGE_SHIFT) +
+			       vma->vm_start + v_offset;
+			if (end > vma->vm_end)
+				end = vma->vm_end;
+		} else
+			end = vma->vm_end;
+
+		unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL);
 	}
 }
 
@@ -409,7 +420,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
-		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
 	truncate_hugepages(inode, offset);
 	return 0;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 4/9] hugetlbfs: truncate_hugepages() takes a range of pages
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (2 preceding siblings ...)
  2015-06-23  0:38 ` [RFC v5 PATCH 3/9] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 5/9] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of -1 is passed, the
current "truncate" functionality is maintained. Existing callers
are modified to pass -1 as end of range. By keying off end == -1,
the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults
by using the fault mutex table. The hole punch case can experience
the rare region_del error and must handle accordingly.

Add the routine hugetlb_fix_reserve_counts to fix up reserve counts
in the case where region_del returns an error.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages().  To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from
region_del and pass it back to the caller.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 98 ++++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/hugetlb.h |  4 +-
 mm/hugetlb.c            | 40 ++++++++++++++++++--
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e9d4c8d..160d58d4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -318,26 +318,61 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
-static void truncate_huge_page(struct page *page)
+static void remove_huge_page(struct page *page)
 {
 	ClearPageDirty(page);
 	ClearPageUptodate(page);
 	delete_from_page_cache(page);
 }
 
-static void truncate_hugepages(struct inode *inode, loff_t lstart)
+
+/*
+ * remove_inode_hugepages handles two distinct cases: truncation and hole
+ * punch.  There are subtle differences in operation for each case.
+
+ * truncation is indicated by end of range being -1
+ *	In this case, we first scan the range and release found pages.
+ *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
+ *	maps and global counts.
+ * hole punch is indicated if end is not -1
+ *	In the hole punch case we scan the range and release found pages.
+ *	Only when releasing a page is the associated region/reserv map
+ *	deleted.  The region/reserv map for ranges without associated
+ *	pages are not modified.
+ * Note: If the passed end of range value is beyond the end of file, but
+ * not -1 this routine still performs a hole punch operation.
+ */
+static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
+				   loff_t lend)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
+	const pgoff_t end = lend >> huge_page_shift(h);
+	struct vm_area_struct pseudo_vma;
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
+	long lookup_nr = PAGEVEC_SIZE;
+	bool truncate_op = (lend == -1);
 
+	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (1) {
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (next < end) {
+		/*
+		 * Make sure to never grab more pages that we
+		 * might possibly need.
+		 */
+		if (end - next < lookup_nr)
+			lookup_nr = end - next;
+
+		/*
+		 * This pagevec_lookup() may return pages past 'end',
+		 * so we must check for page->index > end.
+		 */
+		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
 			if (next == start)
 				break;
 			next = start;
@@ -346,26 +381,69 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 		for (i = 0; i < pagevec_count(&pvec); ++i) {
 			struct page *page = pvec.pages[i];
+			u32 hash;
+
+			hash = hugetlb_fault_mutex_hash(h, current->mm,
+							&pseudo_vma,
+							mapping, next, 0);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 			lock_page(page);
+			if (page->index >= end) {
+				unlock_page(page);
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+				next = end;	/* we are done */
+				break;
+			}
+
+			/*
+			 * If page is mapped, it was faulted in after being
+			 * unmapped.  Do nothing in this race case.  In the
+			 * normal case page is not mapped.
+			 */
+			if (!page_mapped(page)) {
+				bool rsv_on_error = !PagePrivate(page);
+				/*
+				 * We must free the huge page and remove
+				 * from page cache (remove_huge_page) BEFORE
+				 * removing the region/reserve map
+				 * (hugetlb_unreserve_pages).  In rare out
+				 * of memory conditions, removal of the
+				 * region/reserve map could fail.  Before
+				 * free'ing the page, note PagePrivate which
+				 * is used in case of error.
+				 */
+				remove_huge_page(page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(hugetlb_unreserve_pages(
+							inode, next,
+							next + 1, 1)))
+						hugetlb_fix_reserve_counts(
+							inode, rsv_on_error);
+				}
+			}
+
 			if (page->index > next)
 				next = page->index;
+
 			++next;
-			truncate_huge_page(page);
 			unlock_page(page);
-			freed++;
+
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		huge_pagevec_release(&pvec);
 	}
-	BUG_ON(!lstart && mapping->nrpages);
-	hugetlb_unreserve_pages(inode, start, freed);
+
+	if (truncate_op)
+		(void)hugetlb_unreserve_pages(inode, start, end, freed);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	struct resv_map *resv_map;
 
-	truncate_hugepages(inode, 0);
+	remove_inode_hugepages(inode, 0, -1);
 	resv_map = (struct resv_map *)inode->i_mapping->private_data;
 	/* root inode doesn't have the resv_map, so we should check it */
 	if (resv_map)
@@ -422,7 +500,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
-	truncate_hugepages(inode, offset);
+	remove_inode_hugepages(inode, offset, -1);
 	return 0;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 33c6b32..e6ed378 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -80,11 +80,13 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						vm_flags_t vm_flags);
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+						long freed);
 int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 				struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9657cca..7cd4b67 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -461,6 +461,28 @@ retry:
 }
 
 /*
+ * A rare out of memory error was encountered which prevented removal of
+ * the reserve map region for a page.  The huge page itself was free'ed
+ * and removed from the page cache.  This routine will adjust the subpool
+ * usage count, and the global reserve count if needed.  By incrementing
+ * these counts, the reserve map entry which could not be deleted will
+ * appear as a "reserved" entry instead of simply dangling with incorrect
+ * counts.
+ */
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+{
+	struct hugepage_subpool *spool = subpool_inode(inode);
+	long rsv_adjust;
+
+	rsv_adjust = hugepage_subpool_get_pages(spool, 1);
+	if (restore_reserve && rsv_adjust) {
+		struct hstate *h = hstate_inode(inode);
+
+		hugetlb_acct_memory(h, 1);
+	}
+}
+
+/*
  * Count and return the number of huge pages in the reserve map
  * that intersect with the range [f, t).
  */
@@ -3767,7 +3789,8 @@ out_err:
 	return ret;
 }
 
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+								long freed)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct resv_map *resv_map = inode_resv_map(inode);
@@ -3775,8 +3798,17 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	long gbl_reserve;
 
-	if (resv_map)
-		chg = region_del(resv_map, offset, -1);
+	if (resv_map) {
+		chg = region_del(resv_map, start, end);
+		/*
+		 * region_del() can fail in the rare case where a region
+		 * must be split and another region descriptor can not be
+		 * allocated.  If end == -1, it will not fail.
+		 */
+		if (chg < 0)
+			return chg;
+	}
+
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
@@ -3787,6 +3819,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	 */
 	gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
 	hugetlb_acct_memory(h, -gbl_reserve);
+
+	return 0;
 }
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 5/9] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (3 preceding siblings ...)
  2015-06-23  0:38 ` [RFC v5 PATCH 4/9] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings.  However, this will not be the
case with fallocate hole punch.  When punching a hole, the present
page will be deleted as well as the region/reserve map entry (and
hence any reservation).  vma_has_reserves is passed "chg" which
indicates whether or not a region/reserve map is present.  Use
this to determine if reserves are actually present or were removed
via hole punch.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7cd4b67..9d2859f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -692,9 +692,19 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
 			return 0;
 	}
 
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
+	if (vma->vm_flags & VM_MAYSHARE) {
+		/*
+		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
+		 * be a region map for all pages.  The only situation where
+		 * there is no region map is if a hole was punched via
+		 * fallocate.  In this case, there really are no reverves to
+		 * use.  This situation is indicated if chg != 0.
+		 */
+		if (chg)
+			return 0;
+		else
+			return 1;
+	}
 
 	/*
 	 * Only the process that called mmap() has reserves for
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (4 preceding siblings ...)
  2015-06-23  0:38 ` [RFC v5 PATCH 5/9] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 7/9] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Areas hole punched by fallocate will not have entries in the
region/reserve map.  However, shared mappings with min_size subpool
reservations may still have reserved pages.  alloc_huge_page needs
to handle this special case and do the proper accounting.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 39 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d2859f..a485d9c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1600,32 +1600,56 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg, commit;
+	long map_chg, map_commit;
+	long gbl_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
 	idx = hstate_index(h);
 	/*
-	 * Processes that did not create the mapping will have no
-	 * reserves and will not have accounted against subpool
-	 * limit. Check that the subpool limit can be made before
-	 * satisfying the allocation MAP_NORESERVE mappings may also
-	 * need pages and subpool limit allocated allocated if no reserve
-	 * mapping overlaps.
+	 * Examine the region/reserve map to determine if the process
+	 * has a reservation for the page to be allocated.  A return
+	 * code of zero indicates a reservation exists (no change).
 	 */
-	chg = vma_needs_reservation(h, vma, addr);
-	if (chg < 0)
+	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
+	if (map_chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg || avoid_reserve)
-		if (hugepage_subpool_get_pages(spool, 1) < 0)
+
+	/*
+	 * Processes that did not create the mapping will have no
+	 * reserves as indicated by the region/reserve map. Check
+	 * that the allocation will not exceed the subpool limit.
+	 * Allocations for MAP_NORESERVE mappings also need to be
+	 * checked against any subpool limit.
+	 */
+	if (map_chg || avoid_reserve) {
+		gbl_chg = hugepage_subpool_get_pages(spool, 1);
+		if (gbl_chg < 0)
 			return ERR_PTR(-ENOSPC);
 
+		/*
+		 * Even though there was no reservation in the region/reserve
+		 * map, there could be reservations associated with the
+		 * subpool that can be used.  This would be indicated if the
+		 * return value of hugepage_subpool_get_pages() is zero.
+		 * However, if avoid_reserve is specified we still avoid even
+		 * the subpool reservations.
+		 */
+		if (avoid_reserve)
+			gbl_chg = 1;
+	}
+
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret)
 		goto out_subpool_put;
 
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+	/*
+	 * glb_chg is passed to indicate whether or not a page must be taken
+	 * from the global free pool (global change).  gbl_chg == 0 indicates
+	 * a reservation exists for the allocation.
+	 */
+	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1641,8 +1665,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 
 	set_page_private(page, (unsigned long)spool);
 
-	commit = vma_commit_reservation(h, vma, addr);
-	if (unlikely(chg > commit)) {
+	map_commit = vma_commit_reservation(h, vma, addr);
+	if (unlikely(map_chg > map_commit)) {
 		/*
 		 * The page was added to the reservation map between
 		 * vma_needs_reservation and vma_commit_reservation.
@@ -1662,7 +1686,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
 out_subpool_put:
-	if (chg || avoid_reserve)
+	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
 	return ERR_PTR(-ENOSPC);
 }
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 7/9] hugetlbfs: New huge_add_to_page_cache helper routine
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (5 preceding siblings ...)
  2015-06-23  0:38 ` [RFC v5 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
  2015-06-23  0:38 ` [RFC v5 PATCH 9/9] mm: madvise allow remove operation for hugetlbfs Mike Kravetz
  8 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Currently, there is  only a single place where hugetlbfs pages are
added to the page cache.  The new fallocate code be adding a second
one, so break the functionality out into its own helper.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 27 ++++++++++++++++++---------
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e6ed378..71a24a7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -330,6 +330,8 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			pgoff_t idx);
 
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a485d9c..77712c8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3239,6 +3239,23 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 	return page != NULL;
 }
 
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			   pgoff_t idx)
+{
+	struct inode *inode = mapping->host;
+	struct hstate *h = hstate_inode(inode);
+	int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+
+	if (err)
+		return err;
+	ClearPagePrivate(page);
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks += blocks_per_huge_page(h);
+	spin_unlock(&inode->i_lock);
+	return 0;
+}
+
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			   struct address_space *mapping, pgoff_t idx,
 			   unsigned long address, pte_t *ptep, unsigned int flags)
@@ -3286,21 +3303,13 @@ retry:
 		set_page_huge_active(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err;
-			struct inode *inode = mapping->host;
-
-			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+			int err = huge_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				put_page(page);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
-			ClearPagePrivate(page);
-
-			spin_lock(&inode->i_lock);
-			inode->i_blocks += blocks_per_huge_page(h);
-			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
 			if (unlikely(anon_vma_prepare(vma))) {
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate()
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (6 preceding siblings ...)
  2015-06-23  0:38 ` [RFC v5 PATCH 7/9] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  2015-07-24  6:25   ` Michal Hocko
  2015-06-23  0:38 ` [RFC v5 PATCH 9/9] mm: madvise allow remove operation for hugetlbfs Mike Kravetz
  8 siblings, 1 reply; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

This is based on the shmem version, but it has diverged quite
a bit.  We have no swap to worry about, nor the new file sealing.
Add synchronication via the fault mutex table to coordinate
page faults,  fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of
a hugetlbfs file without having it mapped.  This also gives us
the ability to support MADV_REMOVE since it is currently
implemented using fallocate().  MADV_REMOVE lets madvise() remove
pages from the middle of a hugetlbfs file, which wasn't possible
before.

hugetlbfs fallocate only operates on whole huge pages.

Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 158 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c            |   2 +-
 3 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 160d58d4..05b9d47 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -12,6 +12,7 @@
 #include <linux/thread_info.h>
 #include <asm/current.h>
 #include <linux/sched.h>		/* remove ASAP */
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/file.h>
@@ -504,6 +505,160 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	return 0;
 }
 
+static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct hstate *h = hstate_inode(inode);
+	loff_t hpage_size = huge_page_size(h);
+	loff_t hole_start, hole_end;
+
+	/*
+	 * For hole punch round up the beginning offset of the hole and
+	 * round down the end.
+	 */
+	hole_start = round_up(offset, hpage_size);
+	hole_end = round_down(offset + len, hpage_size);
+
+	if (hole_end > hole_start) {
+		struct address_space *mapping = inode->i_mapping;
+
+		mutex_lock(&inode->i_mutex);
+		i_mmap_lock_write(mapping);
+		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
+			hugetlb_vmdelete_list(&mapping->i_mmap,
+						hole_start >> PAGE_SHIFT,
+						hole_end  >> PAGE_SHIFT);
+		i_mmap_unlock_write(mapping);
+		remove_inode_hugepages(inode, hole_start, hole_end);
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	return 0;
+}
+
+static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
+				loff_t len)
+{
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
+	struct vm_area_struct pseudo_vma;
+	struct mm_struct *mm = current->mm;
+	loff_t hpage_size = huge_page_size(h);
+	unsigned long hpage_shift = huge_page_shift(h);
+	pgoff_t start, index, end;
+	int error;
+	u32 hash;
+
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+		return -EOPNOTSUPP;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		return hugetlbfs_punch_hole(inode, offset, len);
+
+	/*
+	 * Default preallocate case.
+	 * For this range, start is rounded down and end is rounded up
+	 * as well as being converted to page offsets.
+	 */
+	start = offset >> hpage_shift;
+	end = (offset + len + hpage_size - 1) >> hpage_shift;
+
+	mutex_lock(&inode->i_mutex);
+
+	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+	error = inode_newsize_ok(inode, offset + len);
+	if (error)
+		goto out;
+
+	/*
+	 * Initialize a pseudo vma that just contains the policy used
+	 * when allocating the huge pages.  The actual policy field
+	 * (vm_policy) is determined based on the index in the loop below.
+	 */
+	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+	pseudo_vma.vm_file = file;
+
+	for (index = start; index < end; index++) {
+		/*
+		 * This is supposed to be the vaddr where the page is being
+		 * faulted in, but we have no vaddr here.
+		 */
+		struct page *page;
+		unsigned long addr;
+		int avoid_reserve = 0;
+
+		cond_resched();
+
+		/*
+		 * fallocate(2) manpage permits EINTR; we may have been
+		 * interrupted because we are using up too much memory.
+		 */
+		if (signal_pending(current)) {
+			error = -EINTR;
+			break;
+		}
+
+		/* Get policy based on index */
+		pseudo_vma.vm_policy =
+			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
+							index);
+
+		/* addr is the offset within the file (zero based) */
+		addr = index * hpage_size;
+
+		/* mutex taken here, fault path and hole punch */
+		hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
+						index, addr);
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+		/* See if already present in mapping to avoid alloc/free */
+		page = find_get_page(mapping, index);
+		if (page) {
+			put_page(page);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			mpol_cond_put(pseudo_vma.vm_policy);
+			continue;
+		}
+
+		/* Allocate page and add to page cache */
+		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
+		mpol_cond_put(pseudo_vma.vm_policy);
+		if (IS_ERR(page)) {
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			error = PTR_ERR(page);
+			goto out;
+		}
+		clear_huge_page(page, addr, pages_per_huge_page(h));
+		__SetPageUptodate(page);
+		error = huge_add_to_page_cache(page, mapping, index);
+		if (unlikely(error)) {
+			put_page(page);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			goto out;
+		}
+
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+		/*
+		 * page_put due to reference from alloc_huge_page()
+		 * unlock_page because locked by add_to_page_cache()
+		 */
+		put_page(page);
+		unlock_page(page);
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = CURRENT_TIME;
+	spin_lock(&inode->i_lock);
+	inode->i_private = NULL;
+	spin_unlock(&inode->i_lock);
+out:
+	mutex_unlock(&inode->i_mutex);
+	return error;
+}
+
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -815,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
 	.mmap			= hugetlbfs_file_mmap,
 	.fsync			= noop_fsync,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
-	.llseek		= default_llseek,
+	.llseek			= default_llseek,
+	.fallocate		= hugetlbfs_fallocate,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 71a24a7..ad7da43 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -327,6 +327,8 @@ struct huge_bootmem_page {
 #endif
 };
 
+struct page *alloc_huge_page(struct vm_area_struct *vma,
+				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
@@ -481,6 +483,7 @@ static inline bool hugepages_supported(void)
 
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
+#define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_noerr(v, a, r) NULL
 #define alloc_bootmem_huge_page(h) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 77712c8..d5c5bbd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1594,7 +1594,7 @@ static long vma_commit_reservation(struct hstate *h,
 	return __vma_reservation_common(h, vma, addr, true);
 }
 
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
+struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v5 PATCH 9/9] mm: madvise allow remove operation for hugetlbfs
  2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (7 preceding siblings ...)
  2015-06-23  0:38 ` [RFC v5 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
@ 2015-06-23  0:38 ` Mike Kravetz
  8 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-23  0:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Now that we have hole punching support for hugetlbfs, we can
also support the MADV_REMOVE interface to it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/madvise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index d215ea9..3c1b7f0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -467,7 +467,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 
 	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
 
-	if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+	if (vma->vm_flags & VM_LOCKED)
 		return -EINVAL;
 
 	f = vma->vm_file;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC v5 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-06-23  0:38 ` [RFC v5 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
@ 2015-06-29 21:47   ` Mike Kravetz
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2015-06-29 21:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig

On 06/22/2015 05:38 PM, Mike Kravetz wrote:
> fallocate hole punch will want to remove a specific range of pages.
> The existing region_truncate() routine deletes all region/reserve
> map entries after a specified offset.  region_del() will provide
> this same functionality if the end of region is specified as -1.
> Hence, region_del() can replace region_truncate().
>
> Unlike region_truncate(), region_del() can return an error in the
> rare case where it can not allocate memory for a region descriptor.
> This ONLY happens in the case where an existing region must be split.
> Current callers passing -1 as end of range will never experience
> this error and do not need to deal with error handling.  Future
> callers of region_del() (such as fallocate hole punch) will need to
> handle this error.

Unfortunately, this new region_del() functionality required for hole
punch conflicts with existing region_chg()/region_add() assumptions.

region_chg/region_add is something like a two step commit process for
adding new region entries.  region_chg is first called to determine
the changes required for the new entry.  If the new entry can be
represented by expanding an existing region, no changes are made to
the map in region_chg.  If the new entry is not adjacent to an
existing region, a placeholder is created during region_chg.  Later
when region_add is called, the assumption is that a region (real or
placeholder) can be expanded to represent the new entry.  Since
all required entries already exist in the map, region_add can not
fail.

It is possible for the new region_del to modify the map between the
region_chg and region_add calls.  It can not modify the same map
entry being added by region_chg/region_add as that is protected by
the fault mutex.  However, it can modify an entry adjacent to the
new entry.  The entry could be modified so that it is no longer
adjacent to the new entry.  As a result, when region_add is called
it will not find a region which can be expanded to represent the
new entry.

In this situation, region_add only needs to add a new region to
the map.  However, to do so would require allocating a new region
descriptor.  The allocation could fail which would result in
region_add failing.

I'm thinking about having a cache of region descriptors pre-allocated
to handle this (rare) situation.  The number of descriptors needed
in the cache would correspond to the number of page faults in
progress (between region_chg and region_add).  region_chg would make
sure there are sufficient descriptors and allocate one if needed.
Error handling for region_chg ENOMEM already exists.  A sufficient
number of entries would be pre-allocated such that in the normal
case no allocation would be necessary.

Thoughts?
-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v5 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate()
  2015-06-23  0:38 ` [RFC v5 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
@ 2015-07-24  6:25   ` Michal Hocko
  2015-07-24 16:18     ` Mike Kravetz
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-07-24  6:25 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Davidlohr Bueso, Aneesh Kumar,
	Hillf Danton, Christoph Hellwig

I hope this is the current version of the pathc - I somehow got lost in
last submissions where the discussion happens in v4 thread. This version
seems to have the same issue:
---

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v5 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate()
  2015-07-24  6:25   ` Michal Hocko
@ 2015-07-24 16:18     ` Mike Kravetz
  2015-07-27  7:07       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Kravetz @ 2015-07-24 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Davidlohr Bueso, Aneesh Kumar,
	Hillf Danton, Christoph Hellwig

On 07/23/2015 11:25 PM, Michal Hocko wrote:
> I hope this is the current version of the pathc - I somehow got lost in
> last submissions where the discussion happens in v4 thread. This version
> seems to have the same issue:

Yes, Michal this issue exists in the version put into mmotm and was
noticed by kbuild test robot and Stephen in linux-next build.

Your patch below is the most obvious.  Thanks!  However, is this
the preferred method of handling this type of issue?  Is it
preferred to create wrappers for the code which handles numa
policy?  Then there could be two versions of the wrapper:  one
if CONFIG_NUMA is defined and one (a no-op) if not.  I am happy
with either, but am a relative newbie in this area so am looking
for a little guidance.

-- 
Mike Kravetz

> ---
>  From 04c37a979c5ce8cd39d3243e4e2c12905e4f1e6e Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 24 Jul 2015 08:14:32 +0200
> Subject: [PATCH] mmotm:
>   hugetlbfs-new-huge_add_to_page_cache-helper-routine-fix
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> hugetlbfs_fallocate relies on CONFIG_NUMA and fails to compile otherwise.
> This just makes the code compile but it seems it begs for a better solution.
>
> fs/hugetlbfs/inode.c: In function a??hugetlbfs_fallocatea??:
> fs/hugetlbfs/inode.c:578:13: error: a??struct vm_area_structa?? has no member named a??vm_policya??
>     pseudo_vma.vm_policy =
>               ^
> fs/hugetlbfs/inode.c:579:4: error: implicit declaration of function a??mpol_shared_policy_lookupa?? [-Werror=implicit-function-declaration]
>      mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
>      ^
> fs/hugetlbfs/inode.c:595:28: error: a??struct vm_area_structa?? has no member named a??vm_policya??
>      mpol_cond_put(pseudo_vma.vm_policy);
>                              ^
> fs/hugetlbfs/inode.c:601:27: error: a??struct vm_area_structa?? has no member named a??vm_policya??
>     mpol_cond_put(pseudo_vma.vm_policy);
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   fs/hugetlbfs/inode.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index d977cae89d29..dfca09218d77 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -575,9 +575,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>   		}
>
>   		/* Get policy based on index */
> +#ifdef CONFIG_NUMA
>   		pseudo_vma.vm_policy =
>   			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
>   							index);
> +#endif
>
>   		/* addr is the offset within the file (zero based) */
>   		addr = index * hpage_size;
> @@ -592,13 +594,17 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>   		if (page) {
>   			put_page(page);
>   			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +#ifdef CONFIG_NUMA
>   			mpol_cond_put(pseudo_vma.vm_policy);
> +#endif
>   			continue;
>   		}
>
>   		/* Allocate page and add to page cache */
>   		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
> +#ifdef CONFIG_NUMA
>   		mpol_cond_put(pseudo_vma.vm_policy);
> +#endif
>   		if (IS_ERR(page)) {
>   			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>   			error = PTR_ERR(page);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v5 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate()
  2015-07-24 16:18     ` Mike Kravetz
@ 2015-07-27  7:07       ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2015-07-27  7:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Davidlohr Bueso, Aneesh Kumar,
	Hillf Danton, Christoph Hellwig

On Fri 24-07-15 09:18:35, Mike Kravetz wrote:
> On 07/23/2015 11:25 PM, Michal Hocko wrote:
> >I hope this is the current version of the pathc - I somehow got lost in
> >last submissions where the discussion happens in v4 thread. This version
> >seems to have the same issue:
> 
> Yes, Michal this issue exists in the version put into mmotm and was
> noticed by kbuild test robot and Stephen in linux-next build.
> 
> Your patch below is the most obvious.  Thanks!  However, is this
> the preferred method of handling this type of issue?  Is it
> preferred to create wrappers for the code which handles numa
> policy?

Yes that would be preferable. My "work around" was just to make my git
mirror of mmotm compilable. I am not sure whether Andrew picked my patch
but it will certainly get replaced by a cleaner version.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-07-27  7:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-23  0:38 [RFC v5 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
2015-06-23  0:38 ` [RFC v5 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
2015-06-29 21:47   ` Mike Kravetz
2015-06-23  0:38 ` [RFC v5 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
2015-06-23  0:38 ` [RFC v5 PATCH 3/9] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
2015-06-23  0:38 ` [RFC v5 PATCH 4/9] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
2015-06-23  0:38 ` [RFC v5 PATCH 5/9] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
2015-06-23  0:38 ` [RFC v5 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
2015-06-23  0:38 ` [RFC v5 PATCH 7/9] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
2015-06-23  0:38 ` [RFC v5 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
2015-07-24  6:25   ` Michal Hocko
2015-07-24 16:18     ` Mike Kravetz
2015-07-27  7:07       ` Michal Hocko
2015-06-23  0:38 ` [RFC v5 PATCH 9/9] mm: madvise allow remove operation for hugetlbfs Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).