linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support
@ 2015-06-11 21:01 Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Most changes since the last RFC have been code cleanup and restructuring
as suggested by review comments.  One bug was fixed in alloc_huge_page
accounting for hole punched areas.  man pages have not yet been updated
and test cases have not yet been added to libhugetlbfs as suggested.
Looking for any additional review comments before proposing code be
included.

hugetlbfs is used today by applications that want a high degree of
control over huge page usage.  Often, large hugetlbfs files are used
to map a large number huge pages into the application processes.
The applications know when page ranges within these large files will
no longer be used, and ideally would like to release them back to
the subpool or global pools for other uses.  The fallocate() system
call provides an interface for preallocation and hole punching within
files.  This patch set adds fallocate functionality to hugetlbfs.

RFC v4:
  Removed alloc_huge_page/hugetlb_reserve_pages race patches as already
    in mmotm
  Moved hugetlb_fix_reserve_counts in series as suggested by Naoya Horiguchi
  Inline'ed hugetlb_fault_mutex routines as suggested by Davidlohr Bueso and
    existing code changed to use new interfaces as suggested by Naoya
  fallocate preallocation code cleaned up and made simpler
  Modified alloc_huge_page to handle special case where allocation is
    for a hole punched area with spool reserves
RFC v3:
  Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
    in current code
  fallocate allocation and hole punch is synchronized with page
    faults via existing mutex table
   hole punch uses existing hugetlb_vmtruncate_list instead of more
    generic unmap_mapping_range for unmapping
   Error handling for the case when region_del() fauils
RFC v2:
  Addressed alignment and error handling issues noticed by Hillf Danton
  New region_del() routine for region tracking/resv_map of ranges
  Fixed several issues found during more extensive testing
  Error handling in region_del() when kmalloc() fails stills needs
    to be addressed
  madvise remove support remains

Mike Kravetz (9):
  mm/hugetlb: add region_del() to delete a specific range of entries
  mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  hugetlbfs: truncate_hugepages() takes a range of pages
  mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  hugetlbfs: New huge_add_to_page_cache helper routine
  hugetlbfs: add hugetlbfs_fallocate()
  mm: madvise allow remove operation for hugetlbfs

 fs/hugetlbfs/inode.c    | 274 ++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/hugetlb.h |  19 +++-
 mm/hugetlb.c            | 207 +++++++++++++++++++++++++++---------
 mm/madvise.c            |   2 +-
 4 files changed, 432 insertions(+), 70 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to remove a specific range of pages.
The existing region_truncate() routine deletes all region/reserve
map entries after a specified offset.  region_del() will provide
this same functionality if the end of region is specified as -1.
Hence, region_del() can replace region_truncate().

Unlike region_truncate(), region_del() can return an error in the
rare case where it can not allocate memory for a region descriptor.
This ONLY happens in the case where an existing region must be split.
Current callers passing -1 as end of range will never experience
this error and do not need to deal with error handling.  Future
callers of region_del() (such as fallocate hole punch) will need to
handle this error.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 88 ++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 62 insertions(+), 26 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..3fc2359 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -385,43 +385,79 @@ out_nrg:
 }
 
 /*
- * Truncate the reserve map at index 'end'.  Modify/truncate any
- * region which contains end.  Delete any regions past end.
- * Return the number of huge pages removed from the map.
+ * Delete the specified range [f, t) from the reserve map.  If the
+ * t parameter is -1, this indicates that ALL regions after f should
+ * be deleted.  Locate the regions which intersect [f, t) and either
+ * trim, delete or split the existing regions.
+ *
+ * Returns the number of huge pages deleted from the reserve map.
+ * In the normal case, the return value is zero or more.  In the
+ * case where a region must be split, a new region descriptor must
+ * be allocated.  If the allocation fails, -ENOMEM will be returned.
+ * NOTE: If the parameter t == -1, then we will never split a region
+ * and possibly return -ENOMEM.  Callers specifying t == -1 do not
+ * need to check for -ENOMEM error.
  */
-static long region_truncate(struct resv_map *resv, long end)
+static long region_del(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
-	long chg = 0;
+	struct file_region *nrg = NULL;
+	long del = 0;
 
+	if (t == -1)
+		t = LONG_MAX;
+retry:
 	spin_lock(&resv->lock);
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (end <= rg->to)
+	list_for_each_entry_safe(rg, trg, head, link) {
+		if (rg->to <= f)
+			continue;
+		if (rg->from >= t)
 			break;
-	if (&rg->link == head)
-		goto out;
 
-	/* If we are in the middle of a region then adjust it. */
-	if (end > rg->from) {
-		chg = rg->to - end;
-		rg->to = end;
-		rg = list_entry(rg->link.next, typeof(*rg), link);
-	}
+		if (f > rg->from && t < rg->to) { /* Must split region */
+			if (!nrg) {
+				spin_unlock(&resv->lock);
+				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+				if (!nrg)
+					return -ENOMEM;
+				goto retry;
+			}
 
-	/* Drop any remaining regions. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
+			del += t - f;
+
+			/* New entry for end of split region */
+			nrg->from = t;
+			nrg->to = rg->to;
+			INIT_LIST_HEAD(&nrg->link);
+
+			/* Original entry is trimmed */
+			rg->to = f;
+
+			list_add(&nrg->link, &rg->link);
+			nrg = NULL;
 			break;
-		chg += rg->to - rg->from;
-		list_del(&rg->link);
-		kfree(rg);
+		}
+
+		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
+			del += rg->to - rg->from;
+			list_del(&rg->link);
+			kfree(rg);
+			continue;
+		}
+
+		if (f <= rg->from) {	/* Trim beginning of region */
+			del += t - rg->from;
+			rg->from = t;
+		} else {		/* Trim end of region */
+			del += rg->to - f;
+			rg->to = f;
+		}
 	}
 
-out:
 	spin_unlock(&resv->lock);
-	return chg;
+	kfree(nrg);
+	return del;
 }
 
 /*
@@ -559,7 +595,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(resv_map, 0);
+	region_del(resv_map, 0, -1);
 	kfree(resv_map);
 }
 
@@ -3740,7 +3776,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	long gbl_reserve;
 
 	if (resv_map)
-		chg = region_truncate(resv_map, offset);
+		chg = region_del(resv_map, offset, -1);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  2015-06-11 22:46   ` Davidlohr Bueso
  2015-06-11 21:01 ` [RFC v4 PATCH 3/9] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table).  fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages.  Expose interfaces so that fallocate operations
can be synchronized with page faults.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 10 ++++++++++
 mm/hugetlb.c            | 20 ++++++++++++++++----
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2050261..bbd072e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -85,6 +85,16 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
+u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx);
+extern struct mutex *htlb_fault_mutex_table;
+static inline void hugetlb_fault_mutex_lock(u32 hash)
+{
+	mutex_lock(&htlb_fault_mutex_table[hash]);
+}
+static inline void hugetlb_fault_mutex_unlock(u32 hash)
+{
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
+}
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3fc2359..f617cb6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -64,7 +64,7 @@ DEFINE_SPINLOCK(hugetlb_lock);
  * prevent spurious OOMs when the hugepage pool is fully utilized.
  */
 static int num_fault_mutexes;
-static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
@@ -3324,7 +3324,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 	unsigned long key[2];
 	u32 hash;
 
-	if (vma->vm_flags & VM_SHARED) {
+	/* !vma implies this was called from hugetlbfs fallocate code */
+	if (!vma || vma->vm_flags & VM_SHARED) {
 		key[0] = (unsigned long) mapping;
 		key[1] = idx;
 	} else {
@@ -3350,6 +3351,17 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 }
 #endif
 
+/*
+ * Interface for use by hugetlbfs fallocate code.  Faults must be
+ * synchronized with page adds or deletes by fallocate.  fallocate
+ * only deals with shared mappings.  See also hugetlb_fault_mutex_lock
+ * and hugetlb_fault_mutex_unlock.
+ */
+u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
+{
+	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
@@ -3390,7 +3402,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * the same page in the page cache.
 	 */
 	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
-	mutex_lock(&htlb_fault_mutex_table[hash]);
+	hugetlb_fault_mutex_lock(hash);
 
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
@@ -3473,7 +3485,7 @@ out_ptl:
 		put_page(pagecache_page);
 	}
 out_mutex:
-	mutex_unlock(&htlb_fault_mutex_table[hash]);
+	hugetlb_fault_mutex_unlock(hash);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 3/9] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 4/9] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range.  If end is 0, this indicates all pages after start
should be unmapped.  This is the same as the existing truncate
functionality.  Modify existing callers to add 0 as end of range.

Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e5a93d8..e9d4c8d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -374,11 +374,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
 }
 
 static inline void
-hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
+hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end)
 {
 	struct vm_area_struct *vma;
 
-	vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) {
+	/*
+	 * end == 0 indicates that the entire range after
+	 * start should be unmapped.
+	 */
+	vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) {
 		unsigned long v_offset;
 
 		/*
@@ -387,13 +391,20 @@ hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
 		 * which overlap the truncated area starting at pgoff,
 		 * and no vma on a 32-bit arch can span beyond the 4GB.
 		 */
-		if (vma->vm_pgoff < pgoff)
-			v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+		if (vma->vm_pgoff < start)
+			v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
 		else
 			v_offset = 0;
 
-		unmap_hugepage_range(vma, vma->vm_start + v_offset,
-				     vma->vm_end, NULL);
+		if (end) {
+			end = ((end - start) << PAGE_SHIFT) +
+			       vma->vm_start + v_offset;
+			if (end > vma->vm_end)
+				end = vma->vm_end;
+		} else
+			end = vma->vm_end;
+
+		unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL);
 	}
 }
 
@@ -409,7 +420,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
-		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
 	truncate_hugepages(inode, offset);
 	return 0;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 4/9] hugetlbfs: truncate_hugepages() takes a range of pages
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (2 preceding siblings ...)
  2015-06-11 21:01 ` [RFC v4 PATCH 3/9] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 5/9] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of -1 is passed, the
current "truncate" functionality is maintained. Existing callers
are modified to pass -1 as end of range. By keying off end == -1,
the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults
by using the fault mutex table. The hole punch case can experience
the rare region_del error and must handle accordingly.

Add the routine hugetlb_fix_reserve_counts to fix up reserve counts
in the case where region_del returns an error.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages().  To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from
region_del and pass it back to the caller.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 93 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/hugetlb.h |  4 ++-
 mm/hugetlb.c            | 40 +++++++++++++++++++--
 3 files changed, 123 insertions(+), 14 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e9d4c8d..728d758 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -318,26 +318,58 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
-static void truncate_huge_page(struct page *page)
+static void remove_huge_page(struct page *page)
 {
 	ClearPageDirty(page);
 	ClearPageUptodate(page);
 	delete_from_page_cache(page);
 }
 
-static void truncate_hugepages(struct inode *inode, loff_t lstart)
+
+/*
+ * remove_inode_hugepages handles two distinct cases: truncation and hole
+ * punch.  There are subtle differences in operation for each case.
+
+ * truncation is indicated by end of range being -1
+ *	In this case, we first scan the range and release found pages.
+ *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
+ *	maps and global counts.
+ * hole punch is indicated if end is not -1
+ *	In the hole punch case we scan the range and release found pages.
+ *	Only when releasing a page is the associated region/reserv map
+ *	deleted.  The region/reserv map for ranges without associated
+ *	pages are not modified.
+ * Note: If the passed end of range value is beyond the end of file, but
+ * not -1 this routine still performs a hole punch operation.
+ */
+static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
+				   loff_t lend)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
+	const pgoff_t end = lend >> huge_page_shift(h);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
+	long lookup_nr = PAGEVEC_SIZE;
+	bool truncate_op = (lend == -1);
 
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (1) {
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (next < end) {
+		/*
+		 * Make sure to never grab more pages that we
+		 * might possibly need.
+		 */
+		if (end - next < lookup_nr)
+			lookup_nr = end - next;
+
+		/*
+		 * This pagevec_lookup() may return pages past 'end',
+		 * so we must check for page->index > end.
+		 */
+		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
 			if (next == start)
 				break;
 			next = start;
@@ -346,26 +378,67 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 		for (i = 0; i < pagevec_count(&pvec); ++i) {
 			struct page *page = pvec.pages[i];
+			u32 hash;
+
+			hash = hugetlb_fault_mutex_shared_hash(mapping, next);
+			hugetlb_fault_mutex_lock(hash);
 
 			lock_page(page);
+			if (page->index >= end) {
+				unlock_page(page);
+				hugetlb_fault_mutex_unlock(hash);
+				next = end;	/* we are done */
+				break;
+			}
+
+			/*
+			 * If page is mapped, it was faulted in after being
+			 * unmapped.  Do nothing in this race case.  In the
+			 * normal case page is not mapped.
+			 */
+			if (!page_mapped(page)) {
+				bool rsv_on_error = !PagePrivate(page);
+				/*
+				 * We must free the huge page and remove
+				 * from page cache (remove_huge_page) BEFORE
+				 * removing the region/reserve map
+				 * (hugetlb_unreserve_pages).  In rare out
+				 * of memory conditions, removal of the
+				 * region/reserve map could fail.  Before
+				 * free'ing the page, note PagePrivate which
+				 * is used in case of error.
+				 */
+				remove_huge_page(page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(hugetlb_unreserve_pages(
+							inode, next,
+							next + 1, 1)))
+						hugetlb_fix_reserve_counts(
+							inode, rsv_on_error);
+				}
+			}
+
 			if (page->index > next)
 				next = page->index;
+
 			++next;
-			truncate_huge_page(page);
 			unlock_page(page);
-			freed++;
+
+			hugetlb_fault_mutex_unlock(hash);
 		}
 		huge_pagevec_release(&pvec);
 	}
-	BUG_ON(!lstart && mapping->nrpages);
-	hugetlb_unreserve_pages(inode, start, freed);
+
+	if (truncate_op)
+		(void)hugetlb_unreserve_pages(inode, start, end, freed);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	struct resv_map *resv_map;
 
-	truncate_hugepages(inode, 0);
+	remove_inode_hugepages(inode, 0, -1);
 	resv_map = (struct resv_map *)inode->i_mapping->private_data;
 	/* root inode doesn't have the resv_map, so we should check it */
 	if (resv_map)
@@ -422,7 +495,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
-	truncate_hugepages(inode, offset);
+	remove_inode_hugepages(inode, offset, -1);
 	return 0;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bbd072e..4da75b7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -80,11 +80,13 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						vm_flags_t vm_flags);
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+						long freed);
 int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
 u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx);
 extern struct mutex *htlb_fault_mutex_table;
 static inline void hugetlb_fault_mutex_lock(u32 hash)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f617cb6..6881097 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -461,6 +461,28 @@ retry:
 }
 
 /*
+ * A rare out of memory error was encountered which prevented removal of
+ * the reserve map region for a page.  The huge page itself was free''ed
+ * and removed from the page cache.  This routine will adjust the global
+ * reserve count if needed, and the subpool usage count.  By incrementing
+ * these counts, the reserve map entry which could not be deleted will
+ * appear as a "reserved" entry instead of simply dangling with incorrect
+ * counts.
+ */
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+{
+	struct hugepage_subpool *spool = subpool_inode(inode);
+	long rsv_adjust;
+
+	rsv_adjust = hugepage_subpool_get_pages(spool, 1);
+	if (restore_reserve && rsv_adjust) {
+		struct hstate *h = hstate_inode(inode);
+
+		hugetlb_acct_memory(h, 1);
+	}
+}
+
+/*
  * Count and return the number of huge pages in the reserve map
  * that intersect with the range [f, t).
  */
@@ -3779,7 +3801,8 @@ out_err:
 	return ret;
 }
 
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+								long freed)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct resv_map *resv_map = inode_resv_map(inode);
@@ -3787,8 +3810,17 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	long gbl_reserve;
 
-	if (resv_map)
-		chg = region_del(resv_map, offset, -1);
+	if (resv_map) {
+		chg = region_del(resv_map, start, end);
+		/*
+		 * region_del() can fail in the rare case where a region
+		 * must be split and another region descriptor can not be
+		 * allocated.  If end == -1, it will not fail.
+		 */
+		if (chg < 0)
+			return chg;
+	}
+
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
@@ -3799,6 +3831,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	 */
 	gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
 	hugetlb_acct_memory(h, -gbl_reserve);
+
+	return 0;
 }
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 5/9] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (3 preceding siblings ...)
  2015-06-11 21:01 ` [RFC v4 PATCH 4/9] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings.  However, will not be the case
with fallocate hole punch.  When punching a hole, the present page
will be deleted as well as the region/reserve map entry (and hence
any reservation).  vma_has_reserves is passed "chg" which indicates
whether or not a region/reserve map is present.  Use this to determine
if reserves are actually present or were removed via hole punch.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6881097..ecbaffe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -692,9 +692,19 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
 			return 0;
 	}
 
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
+	if (vma->vm_flags & VM_MAYSHARE) {
+		/*
+		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
+		 * be a region map for all pages.  The only situation where
+		 * there is no region map is if a hole was punched via
+		 * fallocate.  In this case, there really are no reverves to
+		 * use.  This situation is indicated if chg != 0.
+		 */
+		if (chg)
+			return 0;
+		else
+			return 1;
+	}
 
 	/*
 	 * Only the process that called mmap() has reserves for
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (4 preceding siblings ...)
  2015-06-11 21:01 ` [RFC v4 PATCH 5/9] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  2015-06-15  6:34   ` Naoya Horiguchi
  2015-06-11 21:01 ` [RFC v4 PATCH 7/9] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Areas hole punched by fallocate will not have entries in the
region/reserve map.  However, shared mappings with min_size subpool
reservations may still have reserved pages.  alloc_huge_page needs
to handle this special case and do the proper accounting.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 48 +++++++++++++++++++++++++++---------------------
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ecbaffe..9c295c9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -692,19 +692,9 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
 			return 0;
 	}
 
-	if (vma->vm_flags & VM_MAYSHARE) {
-		/*
-		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
-		 * be a region map for all pages.  The only situation where
-		 * there is no region map is if a hole was punched via
-		 * fallocate.  In this case, there really are no reverves to
-		 * use.  This situation is indicated if chg != 0.
-		 */
-		if (chg)
-			return 0;
-		else
-			return 1;
-	}
+	/* Shared mappings always use reserves */
+	if (vma->vm_flags & VM_MAYSHARE)
+		return 1;
 
 	/*
 	 * Only the process that called mmap() has reserves for
@@ -1601,6 +1591,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	long chg, commit;
+	long gbl_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1608,24 +1599,39 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	/*
 	 * Processes that did not create the mapping will have no
 	 * reserves and will not have accounted against subpool
-	 * limit. Check that the subpool limit can be made before
-	 * satisfying the allocation MAP_NORESERVE mappings may also
-	 * need pages and subpool limit allocated allocated if no reserve
-	 * mapping overlaps.
+	 * limit. Check that the subpool limit will not be exceeded
+	 * before performing the allocation.  Allocations for
+	 * MAP_NORESERVE mappings also need to be checked against
+	 * any subpool limit.
+	 *
+	 * NOTE: Shared mappings with holes punched via fallocate
+	 * may still have reservations, even without entries in the
+	 * reserve map as indicated by vma_needs_reservation.  This
+	 * would be the case if hugepage_subpool_get_pages returns
+	 * zero to indicate no changes to the global reservation count
+	 * are necessary.  In this case, pass the output of
+	 * hugepage_subpool_get_pages (zero) to dequeue_huge_page_vma
+	 * so that the page is not counted against the global limit.
+	 * For MAP_NORESERVE mappings always pass the output of
+	 * vma_needs_reservation.  For race detection and error cleanup
+	 * use output of vma_needs_reservation as well.
 	 */
-	chg = vma_needs_reservation(h, vma, addr);
+	chg = gbl_chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(-ENOMEM);
-	if (chg || avoid_reserve)
-		if (hugepage_subpool_get_pages(spool, 1) < 0)
+	if (chg || avoid_reserve) {
+		gbl_chg = hugepage_subpool_get_pages(spool, 1);
+		if (gbl_chg < 0)
 			return ERR_PTR(-ENOSPC);
+	}
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret)
 		goto out_subpool_put;
 
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve,
+					avoid_reserve ? chg : gbl_chg);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 7/9] hugetlbfs: New huge_add_to_page_cache helper routine
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (5 preceding siblings ...)
  2015-06-11 21:01 ` [RFC v4 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 9/9] mm: madvise allow remove operation for hugetlbfs Mike Kravetz
  8 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Currently, there is  only a single place where hugetlbfs pages are
added to the page cache.  The new fallocate code be adding a second
one, so break the functionality out into its own helper.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 27 ++++++++++++++++++---------
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4da75b7..0ea36bd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -335,6 +335,8 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			pgoff_t idx);
 
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9c295c9..2cc33ad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3221,6 +3221,23 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 	return page != NULL;
 }
 
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			   pgoff_t idx)
+{
+	struct inode *inode = mapping->host;
+	struct hstate *h = hstate_inode(inode);
+	int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+
+	if (err)
+		return err;
+	ClearPagePrivate(page);
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks += blocks_per_huge_page(h);
+	spin_unlock(&inode->i_lock);
+	return 0;
+}
+
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			   struct address_space *mapping, pgoff_t idx,
 			   unsigned long address, pte_t *ptep, unsigned int flags)
@@ -3268,21 +3285,13 @@ retry:
 		set_page_huge_active(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err;
-			struct inode *inode = mapping->host;
-
-			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+			int err = huge_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				put_page(page);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
-			ClearPagePrivate(page);
-
-			spin_lock(&inode->i_lock);
-			inode->i_blocks += blocks_per_huge_page(h);
-			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
 			if (unlikely(anon_vma_prepare(vma))) {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate()
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (6 preceding siblings ...)
  2015-06-11 21:01 ` [RFC v4 PATCH 7/9] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  2015-06-11 21:01 ` [RFC v4 PATCH 9/9] mm: madvise allow remove operation for hugetlbfs Mike Kravetz
  8 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

This is based on the shmem version, but it has diverged quite
a bit.  We have no swap to worry about, nor the new file sealing.
Add synchronication via the fault mutex table to coordinate
page faults,  fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of
a hugetlbfs file without having it mapped.  This also gives us
the ability to support MADV_REMOVE since it is currently
implemented using fallocate().  MADV_REMOVE lets madvise() remove
pages from the middle of a hugetlbfs file, which wasn't possible
before.

hugetlbfs fallocate only operates on whole huge pages.

Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 156 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c            |   8 +--
 3 files changed, 162 insertions(+), 5 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 728d758..830f782 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -12,6 +12,7 @@
 #include <linux/thread_info.h>
 #include <asm/current.h>
 #include <linux/sched.h>		/* remove ASAP */
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/file.h>
@@ -499,6 +500,158 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	return 0;
 }
 
+static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct hstate *h = hstate_inode(inode);
+	loff_t hpage_size = huge_page_size(h);
+	loff_t hole_start, hole_end;
+
+	/*
+	 * For hole punch round up the beginning offset of the hole and
+	 * round down the end.
+	 */
+	hole_start = round_up(offset, hpage_size);
+	hole_end = round_down(offset + len, hpage_size);
+
+	if (hole_end > hole_start) {
+		struct address_space *mapping = inode->i_mapping;
+
+		mutex_lock(&inode->i_mutex);
+		i_mmap_lock_write(mapping);
+		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
+			hugetlb_vmdelete_list(&mapping->i_mmap,
+						hole_start >> PAGE_SHIFT,
+						hole_end  >> PAGE_SHIFT);
+		i_mmap_unlock_write(mapping);
+		remove_inode_hugepages(inode, hole_start, hole_end);
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	return 0;
+}
+
+static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
+				loff_t len)
+{
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
+	struct vm_area_struct pseudo_vma;
+	loff_t hpage_size = huge_page_size(h);
+	unsigned long hpage_shift = huge_page_shift(h);
+	pgoff_t start, index, end;
+	int error;
+	u32 hash;
+
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+		return -EOPNOTSUPP;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		return hugetlbfs_punch_hole(inode, offset, len);
+
+	/*
+	 * Default preallocate case.
+	 * For this range, start is rounded down and end is rounded up
+	 * as well as being converted to page offsets.
+	 */
+	start = offset >> hpage_shift;
+	end = (offset + len + hpage_size - 1) >> hpage_shift;
+
+	mutex_lock(&inode->i_mutex);
+
+	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+	error = inode_newsize_ok(inode, offset + len);
+	if (error)
+		goto out;
+
+	/*
+	 * Initialize a pseudo vma that just contains the policy used
+	 * when allocating the huge pages.  The actual policy field
+	 * (vm_policy) is determined based on the index in the loop below.
+	 */
+	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+	pseudo_vma.vm_file = file;
+
+	for (index = start; index < end; index++) {
+		/*
+		 * This is supposed to be the vaddr where the page is being
+		 * faulted in, but we have no vaddr here.
+		 */
+		struct page *page;
+		unsigned long addr;
+		int avoid_reserve = 0;
+
+		cond_resched();
+
+		/*
+		 * fallocate(2) manpage permits EINTR; we may have been
+		 * interrupted because we are using up too much memory.
+		 */
+		if (signal_pending(current)) {
+			error = -EINTR;
+			break;
+		}
+
+		/* Get policy based on index */
+		pseudo_vma.vm_policy =
+			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
+							index);
+
+		/* addr is the offset within the file (zero based) */
+		addr = index * hpage_size;
+
+		/* mutex taken here, fault path and hole punch */
+		hash = hugetlb_fault_mutex_shared_hash(mapping, index);
+		hugetlb_fault_mutex_lock(hash);
+
+		/* See if already present in mapping to avoid alloc/free */
+		page = find_get_page(mapping, index);
+		if (page) {
+			put_page(page);
+			hugetlb_fault_mutex_unlock(hash);
+			mpol_cond_put(pseudo_vma.vm_policy);
+			continue;
+		}
+
+		/* Allocate page and add to page cache */
+		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
+		mpol_cond_put(pseudo_vma.vm_policy);
+		if (IS_ERR(page)) {
+			hugetlb_fault_mutex_unlock(hash);
+			error = PTR_ERR(page);
+			goto out;
+		}
+		clear_huge_page(page, addr, pages_per_huge_page(h));
+		__SetPageUptodate(page);
+		error = huge_add_to_page_cache(page, mapping, index);
+		if (unlikely(error)) {
+			put_page(page);
+			hugetlb_fault_mutex_unlock(hash);
+			goto out;
+		}
+
+		hugetlb_fault_mutex_unlock(hash);
+
+		/*
+		 * page_put due to reference from alloc_huge_page()
+		 * unlock_page because locked by add_to_page_cache()
+		 */
+		put_page(page);
+		unlock_page(page);
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = CURRENT_TIME;
+	spin_lock(&inode->i_lock);
+	inode->i_private = NULL;
+	spin_unlock(&inode->i_lock);
+out:
+	mutex_unlock(&inode->i_mutex);
+	return error;
+}
+
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -810,7 +963,8 @@ const struct file_operations hugetlbfs_file_operations = {
 	.mmap			= hugetlbfs_file_mmap,
 	.fsync			= noop_fsync,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
-	.llseek		= default_llseek,
+	.llseek			= default_llseek,
+	.fallocate		= hugetlbfs_fallocate,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0ea36bd..0567aa7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -332,6 +332,8 @@ struct huge_bootmem_page {
 #endif
 };
 
+struct page *alloc_huge_page(struct vm_area_struct *vma,
+				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
@@ -486,6 +488,7 @@ static inline bool hugepages_supported(void)
 
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
+#define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_noerr(v, a, r) NULL
 #define alloc_bootmem_huge_page(h) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2cc33ad..6c4bbef 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -462,9 +462,9 @@ retry:
 
 /*
  * A rare out of memory error was encountered which prevented removal of
- * the reserve map region for a page.  The huge page itself was free''ed
- * and removed from the page cache.  This routine will adjust the global
- * reserve count if needed, and the subpool usage count.  By incrementing
+ * the reserve map region for a page.  The huge page itself was free'ed
+ * and removed from the page cache.  This routine will adjust the subpool
+ * usage count, and the global reserve count if needed.  By incrementing
  * these counts, the reserve map entry which could not be deleted will
  * appear as a "reserved" entry instead of simply dangling with incorrect
  * counts.
@@ -1584,7 +1584,7 @@ static long vma_commit_reservation(struct hstate *h,
 	return __vma_reservation_common(h, vma, addr, true);
 }
 
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
+struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v4 PATCH 9/9] mm: madvise allow remove operation for hugetlbfs
  2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
                   ` (7 preceding siblings ...)
  2015-06-11 21:01 ` [RFC v4 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
@ 2015-06-11 21:01 ` Mike Kravetz
  8 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 21:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Now that we have hole punching support for hugetlbfs, we can
also support the MADV_REMOVE interface to it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/madvise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index d215ea9..3c1b7f0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -467,7 +467,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 
 	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
 
-	if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+	if (vma->vm_flags & VM_LOCKED)
 		return -EINVAL;
 
 	f = vma->vm_file;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v4 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-06-11 21:01 ` [RFC v4 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
@ 2015-06-11 22:46   ` Davidlohr Bueso
  2015-06-11 23:09     ` Mike Kravetz
  2015-06-17 22:05     ` Mike Kravetz
  0 siblings, 2 replies; 15+ messages in thread
From: Davidlohr Bueso @ 2015-06-11 22:46 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, 2015-06-11 at 14:01 -0700, Mike Kravetz wrote:
>  /* Forward declaration */
>  static int hugetlb_acct_memory(struct hstate *h, long delta);
> @@ -3324,7 +3324,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>  	unsigned long key[2];
>  	u32 hash;
>  
> -	if (vma->vm_flags & VM_SHARED) {
> +	/* !vma implies this was called from hugetlbfs fallocate code */
> +	if (!vma || vma->vm_flags & VM_SHARED) {

That !vma is icky, and really no need for it: hugetlbfs_fallocate(), for
example, already passes [pseudo]vma->vm_flags with VM_SHARED, and you
say it yourself in the comment. Do you see any reason why we cannot just
keep the vma->vm_flags & VM_SHARED check?

> +/*
> + * Interface for use by hugetlbfs fallocate code.  Faults must be
> + * synchronized with page adds or deletes by fallocate.  fallocate
> + * only deals with shared mappings.  See also hugetlb_fault_mutex_lock
> + * and hugetlb_fault_mutex_unlock.
> + */
> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
> +{
> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
> +}

It strikes me that this too should be static inlined. But I really
dislike the nil params thing, which should be addressed by my comment
above.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v4 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-06-11 22:46   ` Davidlohr Bueso
@ 2015-06-11 23:09     ` Mike Kravetz
  2015-06-17 22:05     ` Mike Kravetz
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-11 23:09 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 06/11/2015 03:46 PM, Davidlohr Bueso wrote:
> On Thu, 2015-06-11 at 14:01 -0700, Mike Kravetz wrote:
>>   /* Forward declaration */
>>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>> @@ -3324,7 +3324,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>>   	unsigned long key[2];
>>   	u32 hash;
>>
>> -	if (vma->vm_flags & VM_SHARED) {
>> +	/* !vma implies this was called from hugetlbfs fallocate code */
>> +	if (!vma || vma->vm_flags & VM_SHARED) {
>
> That !vma is icky, and really no need for it: hugetlbfs_fallocate(), for
> example, already passes [pseudo]vma->vm_flags with VM_SHARED, and you
> say it yourself in the comment. Do you see any reason why we cannot just
> keep the vma->vm_flags & VM_SHARED check?
>
>> +/*
>> + * Interface for use by hugetlbfs fallocate code.  Faults must be
>> + * synchronized with page adds or deletes by fallocate.  fallocate
>> + * only deals with shared mappings.  See also hugetlb_fault_mutex_lock
>> + * and hugetlb_fault_mutex_unlock.
>> + */
>> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
>> +{
>> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
>> +}
>
> It strikes me that this too should be static inlined. But I really
> dislike the nil params thing, which should be addressed by my comment
> above.

In the previous RFC, I was trying not to make all the fault mutex data
global (so it could be accessed outside hugetlb.c).  That was the
original reason for the wrapper interfaces.  That may just be too ugly,
and does not buy us much.

Now that the mutex table is global for inlining, I might as well make
fault_mutex_hash() global.  I can then get rid of the wrappers.  However,
I'm guessing it would be a good idea to change the name(s) to something
hugetlb specific since they will be global.

-- 
Mike Kravetz

>
> Thanks,
> Davidlohr
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v4 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  2015-06-11 21:01 ` [RFC v4 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
@ 2015-06-15  6:34   ` Naoya Horiguchi
  2015-06-15 18:42     ` Mike Kravetz
  0 siblings, 1 reply; 15+ messages in thread
From: Naoya Horiguchi @ 2015-06-15  6:34 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, Jun 11, 2015 at 02:01:37PM -0700, Mike Kravetz wrote:
> Areas hole punched by fallocate will not have entries in the
> region/reserve map.  However, shared mappings with min_size subpool
> reservations may still have reserved pages.  alloc_huge_page needs
> to handle this special case and do the proper accounting.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 48 +++++++++++++++++++++++++++---------------------
>  1 file changed, 27 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ecbaffe..9c295c9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -692,19 +692,9 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
>  			return 0;
>  	}
>  
> -	if (vma->vm_flags & VM_MAYSHARE) {
> -		/*
> -		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
> -		 * be a region map for all pages.  The only situation where
> -		 * there is no region map is if a hole was punched via
> -		 * fallocate.  In this case, there really are no reverves to
> -		 * use.  This situation is indicated if chg != 0.
> -		 */
> -		if (chg)
> -			return 0;
> -		else
> -			return 1;
> -	}
> +	/* Shared mappings always use reserves */
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		return 1;

This change completely reverts 5/9, so can you omit 5/9?

>  
>  	/*
>  	 * Only the process that called mmap() has reserves for
> @@ -1601,6 +1591,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *page;
>  	long chg, commit;
> +	long gbl_chg;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg;
>  
> @@ -1608,24 +1599,39 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	/*
>  	 * Processes that did not create the mapping will have no
>  	 * reserves and will not have accounted against subpool
> -	 * limit. Check that the subpool limit can be made before
> -	 * satisfying the allocation MAP_NORESERVE mappings may also
> -	 * need pages and subpool limit allocated allocated if no reserve
> -	 * mapping overlaps.
> +	 * limit. Check that the subpool limit will not be exceeded
> +	 * before performing the allocation.  Allocations for
> +	 * MAP_NORESERVE mappings also need to be checked against
> +	 * any subpool limit.
> +	 *
> +	 * NOTE: Shared mappings with holes punched via fallocate
> +	 * may still have reservations, even without entries in the
> +	 * reserve map as indicated by vma_needs_reservation.  This
> +	 * would be the case if hugepage_subpool_get_pages returns
> +	 * zero to indicate no changes to the global reservation count
> +	 * are necessary.  In this case, pass the output of
> +	 * hugepage_subpool_get_pages (zero) to dequeue_huge_page_vma
> +	 * so that the page is not counted against the global limit.
> +	 * For MAP_NORESERVE mappings always pass the output of
> +	 * vma_needs_reservation.  For race detection and error cleanup
> +	 * use output of vma_needs_reservation as well.
>  	 */
> -	chg = vma_needs_reservation(h, vma, addr);
> +	chg = gbl_chg = vma_needs_reservation(h, vma, addr);
>  	if (chg < 0)
>  		return ERR_PTR(-ENOMEM);
> -	if (chg || avoid_reserve)
> -		if (hugepage_subpool_get_pages(spool, 1) < 0)
> +	if (chg || avoid_reserve) {
> +		gbl_chg = hugepage_subpool_get_pages(spool, 1);
> +		if (gbl_chg < 0)
>  			return ERR_PTR(-ENOSPC);
> +	}
>  
>  	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
>  	if (ret)
>  		goto out_subpool_put;
>  
>  	spin_lock(&hugetlb_lock);
> -	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
> +	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve,
> +					avoid_reserve ? chg : gbl_chg);

You use chg or gbl_chg depending on avoid_reserve here, and below this line
there's code like below

	commit = vma_commit_reservation(h, vma, addr);
	if (unlikely(chg > commit)) {
		...
	}

This also need to be changed to use chg or gbl_chg depending on avoid_reserve?

# I feel that this reserve-handling code in alloc_huge_page() is too complicated
# and hard to understand, so some cleanup like separating reserve parts into
# other new routine(s) might be helpful...

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v4 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  2015-06-15  6:34   ` Naoya Horiguchi
@ 2015-06-15 18:42     ` Mike Kravetz
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-15 18:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 06/14/2015 11:34 PM, Naoya Horiguchi wrote:
> On Thu, Jun 11, 2015 at 02:01:37PM -0700, Mike Kravetz wrote:
>> Areas hole punched by fallocate will not have entries in the
>> region/reserve map.  However, shared mappings with min_size subpool
>> reservations may still have reserved pages.  alloc_huge_page needs
>> to handle this special case and do the proper accounting.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   mm/hugetlb.c | 48 +++++++++++++++++++++++++++---------------------
>>   1 file changed, 27 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index ecbaffe..9c295c9 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -692,19 +692,9 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
>>   			return 0;
>>   	}
>>   
>> -	if (vma->vm_flags & VM_MAYSHARE) {
>> -		/*
>> -		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
>> -		 * be a region map for all pages.  The only situation where
>> -		 * there is no region map is if a hole was punched via
>> -		 * fallocate.  In this case, there really are no reverves to
>> -		 * use.  This situation is indicated if chg != 0.
>> -		 */
>> -		if (chg)
>> -			return 0;
>> -		else
>> -			return 1;
>> -	}
>> +	/* Shared mappings always use reserves */
>> +	if (vma->vm_flags & VM_MAYSHARE)
>> +		return 1;
> 
> This change completely reverts 5/9, so can you omit 5/9?

That was a mistake.  This change should not be in the patch.  The
change from 5/9 needs to remain.  Sorry for confusion.  Thanks for
catching.

>>   	/*
>>   	 * Only the process that called mmap() has reserves for
>> @@ -1601,6 +1591,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>>   	struct hstate *h = hstate_vma(vma);
>>   	struct page *page;
>>   	long chg, commit;
>> +	long gbl_chg;
>>   	int ret, idx;
>>   	struct hugetlb_cgroup *h_cg;
>>   
>> @@ -1608,24 +1599,39 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>>   	/*
>>   	 * Processes that did not create the mapping will have no
>>   	 * reserves and will not have accounted against subpool
>> -	 * limit. Check that the subpool limit can be made before
>> -	 * satisfying the allocation MAP_NORESERVE mappings may also
>> -	 * need pages and subpool limit allocated allocated if no reserve
>> -	 * mapping overlaps.
>> +	 * limit. Check that the subpool limit will not be exceeded
>> +	 * before performing the allocation.  Allocations for
>> +	 * MAP_NORESERVE mappings also need to be checked against
>> +	 * any subpool limit.
>> +	 *
>> +	 * NOTE: Shared mappings with holes punched via fallocate
>> +	 * may still have reservations, even without entries in the
>> +	 * reserve map as indicated by vma_needs_reservation.  This
>> +	 * would be the case if hugepage_subpool_get_pages returns
>> +	 * zero to indicate no changes to the global reservation count
>> +	 * are necessary.  In this case, pass the output of
>> +	 * hugepage_subpool_get_pages (zero) to dequeue_huge_page_vma
>> +	 * so that the page is not counted against the global limit.
>> +	 * For MAP_NORESERVE mappings always pass the output of
>> +	 * vma_needs_reservation.  For race detection and error cleanup
>> +	 * use output of vma_needs_reservation as well.
>>   	 */
>> -	chg = vma_needs_reservation(h, vma, addr);
>> +	chg = gbl_chg = vma_needs_reservation(h, vma, addr);
>>   	if (chg < 0)
>>   		return ERR_PTR(-ENOMEM);
>> -	if (chg || avoid_reserve)
>> -		if (hugepage_subpool_get_pages(spool, 1) < 0)
>> +	if (chg || avoid_reserve) {
>> +		gbl_chg = hugepage_subpool_get_pages(spool, 1);
>> +		if (gbl_chg < 0)
>>   			return ERR_PTR(-ENOSPC);
>> +	}
>>   
>>   	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
>>   	if (ret)
>>   		goto out_subpool_put;
>>   
>>   	spin_lock(&hugetlb_lock);
>> -	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
>> +	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve,
>> +					avoid_reserve ? chg : gbl_chg);
> 
> You use chg or gbl_chg depending on avoid_reserve here, and below this line
> there's code like below
> 
> 	commit = vma_commit_reservation(h, vma, addr);
> 	if (unlikely(chg > commit)) {
> 		...
> 	}
> 
> This also need to be changed to use chg or gbl_chg depending on avoid_reserve?

It should use chg only.  I attempted to address this at the end of the
Note above.
" For race detection and error cleanup use output of vma_needs_reservation
  as well."
I will add more comments to make it clear.

> # I feel that this reserve-handling code in alloc_huge_page() is too complicated
> # and hard to understand, so some cleanup like separating reserve parts into
> # other new routine(s) might be helpful...

I agree, let me think about ways to split this up and hopefully make
it easier to understand.

-- 
Mike Kravetz

> 
> Thanks,
> Naoya Horiguchi
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v4 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-06-11 22:46   ` Davidlohr Bueso
  2015-06-11 23:09     ` Mike Kravetz
@ 2015-06-17 22:05     ` Mike Kravetz
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2015-06-17 22:05 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 06/11/2015 03:46 PM, Davidlohr Bueso wrote:
> On Thu, 2015-06-11 at 14:01 -0700, Mike Kravetz wrote:
>>   /* Forward declaration */
>>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>> @@ -3324,7 +3324,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>>   	unsigned long key[2];
>>   	u32 hash;
>>
>> -	if (vma->vm_flags & VM_SHARED) {
>> +	/* !vma implies this was called from hugetlbfs fallocate code */
>> +	if (!vma || vma->vm_flags & VM_SHARED) {
>
> That !vma is icky, and really no need for it: hugetlbfs_fallocate(), for
> example, already passes [pseudo]vma->vm_flags with VM_SHARED, and you
> say it yourself in the comment. Do you see any reason why we cannot just
> keep the vma->vm_flags & VM_SHARED check?
>

Ah, I did not recall all the users of this code until I went to change
it. The other user is truncate_hugapages() which will now be used for
fallocate hole punch.  Truncate like fallocate is an inode operation
and there is no specific vma.  I can create a pseudo-vma here as well
just to pass the flag.  I guess that would at least be consistent with
the other user.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-06-17 22:06 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-11 21:01 [RFC v4 PATCH 0/9] hugetlbfs: add fallocate support Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 1/9] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 2/9] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
2015-06-11 22:46   ` Davidlohr Bueso
2015-06-11 23:09     ` Mike Kravetz
2015-06-17 22:05     ` Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 3/9] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 4/9] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 5/9] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 6/9] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate Mike Kravetz
2015-06-15  6:34   ` Naoya Horiguchi
2015-06-15 18:42     ` Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 7/9] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 8/9] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
2015-06-11 21:01 ` [RFC v4 PATCH 9/9] mm: madvise allow remove operation for hugetlbfs Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).