All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support
@ 2015-05-21 15:47 ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

There have been numerous changes to this patch set since the previous
RFC.  Most notably in the area of synchronization and race condition
handling.  The patch set now stands up to a series of functionality
and stress testing.

hugetlbfs is used today by applications that want a high degree of
control over huge page usage.  Often, large hugetlbfs files are used
to map a large number huge pages into the application processes.
The applications know when page ranges within these large files will
no longer be used, and ideally would like to release them back to
the subpool or global pools for other uses.  The fallocate() system
call provides an interface for preallocation and hole punching within
files.  This patch set adds fallocate functionality to hugetlbfs.

RFC v3:
  Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
    in current code
  fallocate allocation and hole punch is synchronized with page
    faults via existing mutex table
   hole punch uses existing hugetlb_vmtruncate_list instead of more
    generic unmap_mapping_range for unmapping
   Error handling for the case when region_del() fauils
RFC v2:
  Addressed alignment and error handling issues noticed by Hillf Danton
  New region_del() routine for region tracking/resv_map of ranges
  Fixed several issues found during more extensive testing
  Error handling in region_del() when kmalloc() fails stills needs
    to be addressed
  madvise remove support remains


Mike Kravetz (10):
  mm/hugetlb: compute/return the number of regions added by region_add()
  mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  mm/hugetlb: add region_del() to delete a specific range of entries
  mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  hugetlbfs: truncate_hugepages() takes a range of pages
  hugetlbfs: New huge_add_to_page_cache helper routine
  mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  hugetlbfs: add hugetlbfs_fallocate()
  mm: madvise allow remove operation for hugetlbfs

 fs/hugetlbfs/inode.c    | 282 ++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/hugetlb.h |  12 ++-
 mm/hugetlb.c            | 238 ++++++++++++++++++++++++++++++++--------
 mm/madvise.c            |   2 +-
 4 files changed, 469 insertions(+), 65 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support
@ 2015-05-21 15:47 ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

There have been numerous changes to this patch set since the previous
RFC.  Most notably in the area of synchronization and race condition
handling.  The patch set now stands up to a series of functionality
and stress testing.

hugetlbfs is used today by applications that want a high degree of
control over huge page usage.  Often, large hugetlbfs files are used
to map a large number huge pages into the application processes.
The applications know when page ranges within these large files will
no longer be used, and ideally would like to release them back to
the subpool or global pools for other uses.  The fallocate() system
call provides an interface for preallocation and hole punching within
files.  This patch set adds fallocate functionality to hugetlbfs.

RFC v3:
  Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
    in current code
  fallocate allocation and hole punch is synchronized with page
    faults via existing mutex table
   hole punch uses existing hugetlb_vmtruncate_list instead of more
    generic unmap_mapping_range for unmapping
   Error handling for the case when region_del() fauils
RFC v2:
  Addressed alignment and error handling issues noticed by Hillf Danton
  New region_del() routine for region tracking/resv_map of ranges
  Fixed several issues found during more extensive testing
  Error handling in region_del() when kmalloc() fails stills needs
    to be addressed
  madvise remove support remains


Mike Kravetz (10):
  mm/hugetlb: compute/return the number of regions added by region_add()
  mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  mm/hugetlb: add region_del() to delete a specific range of entries
  mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  hugetlbfs: truncate_hugepages() takes a range of pages
  hugetlbfs: New huge_add_to_page_cache helper routine
  mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  hugetlbfs: add hugetlbfs_fallocate()
  mm: madvise allow remove operation for hugetlbfs

 fs/hugetlbfs/inode.c    | 282 ++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/hugetlb.h |  12 ++-
 mm/hugetlb.c            | 238 ++++++++++++++++++++++++++++++++--------
 mm/madvise.c            |   2 +-
 4 files changed, 469 insertions(+), 65 deletions(-)

-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 01/10] mm/hugetlb: compute/return the number of regions added by region_add()
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Modify region_add() to keep track of regions(pages) added to the
reserve map and return this value.  The return value can be
compared to the return value of region_chg() to determine if the
map was modified between calls.  Make vma_commit_reservation()
also pass along the return value of region_add().  The special
case return values of vma_needs_reservation() should also be
taken into account when determining the return value of
vma_commit_reservation().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c41b2a0..7f64034 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -156,6 +156,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
+	long chg = 0;
 
 	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
@@ -181,14 +182,17 @@ static long region_add(struct resv_map *resv, long f, long t)
 		if (rg->to > t)
 			t = rg->to;
 		if (rg != nrg) {
+			chg -= (rg->to - rg->from);
 			list_del(&rg->link);
 			kfree(rg);
 		}
 	}
+	chg += (nrg->from - f);
 	nrg->from = f;
+	chg += t - nrg->to;
 	nrg->to = t;
 	spin_unlock(&resv->lock);
-	return 0;
+	return chg;
 }
 
 static long region_chg(struct resv_map *resv, long f, long t)
@@ -1349,18 +1353,25 @@ static long vma_needs_reservation(struct hstate *h,
 	else
 		return chg < 0 ? chg : 0;
 }
-static void vma_commit_reservation(struct hstate *h,
+
+static long vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct resv_map *resv;
 	pgoff_t idx;
+	long add;
 
 	resv = vma_resv_map(vma);
 	if (!resv)
-		return;
+		return 1;
 
 	idx = vma_hugecache_offset(h, vma, addr);
-	region_add(resv, idx, idx + 1);
+	add = region_add(resv, idx, idx + 1);
+
+	if (vma->vm_flags & VM_MAYSHARE)
+		return add;
+	else
+		return 0;
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 01/10] mm/hugetlb: compute/return the number of regions added by region_add()
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Modify region_add() to keep track of regions(pages) added to the
reserve map and return this value.  The return value can be
compared to the return value of region_chg() to determine if the
map was modified between calls.  Make vma_commit_reservation()
also pass along the return value of region_add().  The special
case return values of vma_needs_reservation() should also be
taken into account when determining the return value of
vma_commit_reservation().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c41b2a0..7f64034 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -156,6 +156,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
+	long chg = 0;
 
 	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
@@ -181,14 +182,17 @@ static long region_add(struct resv_map *resv, long f, long t)
 		if (rg->to > t)
 			t = rg->to;
 		if (rg != nrg) {
+			chg -= (rg->to - rg->from);
 			list_del(&rg->link);
 			kfree(rg);
 		}
 	}
+	chg += (nrg->from - f);
 	nrg->from = f;
+	chg += t - nrg->to;
 	nrg->to = t;
 	spin_unlock(&resv->lock);
-	return 0;
+	return chg;
 }
 
 static long region_chg(struct resv_map *resv, long f, long t)
@@ -1349,18 +1353,25 @@ static long vma_needs_reservation(struct hstate *h,
 	else
 		return chg < 0 ? chg : 0;
 }
-static void vma_commit_reservation(struct hstate *h,
+
+static long vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct resv_map *resv;
 	pgoff_t idx;
+	long add;
 
 	resv = vma_resv_map(vma);
 	if (!resv)
-		return;
+		return 1;
 
 	idx = vma_hugecache_offset(h, vma, addr);
-	region_add(resv, idx, idx + 1);
+	add = region_add(resv, idx, idx + 1);
+
+	if (vma->vm_flags & VM_MAYSHARE)
+		return add;
+	else
+		return 0;
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 02/10] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

alloc_huge_page and hugetlb_reserve_pages use region_chg to
calculate the number of pages which will be added to the reserve
map.  Subpool and global reserve counts are adjusted based on
the output of region_chg.  Before the pages are actually added
to the reserve map, these routines could race and add fewer
pages than expected.  If this happens, the subpool and global
reserve counts are not correct.

Compare the number of pages actually added (region_add) to those
expected to added (region_chg).  If fewer pages are actually added,
this indicates a race and adjust counters accordingly.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7f64034..63f6d43 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1374,13 +1374,16 @@ static long vma_commit_reservation(struct hstate *h,
 		return 0;
 }
 
+/* Forward declaration */
+static int hugetlb_acct_memory(struct hstate *h, long delta);
+
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg;
+	long chg, commit;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1421,7 +1424,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 
 	set_page_private(page, (unsigned long)spool);
 
-	vma_commit_reservation(h, vma, addr);
+	commit = vma_commit_reservation(h, vma, addr);
+	if (unlikely(chg > commit)) {
+		/*
+		 * The page was added to the reservation map between
+		 * vma_needs_reservation and vma_commit_reservation.
+		 * This indicates a race with hugetlb_reserve_pages.
+		 * Adjust for the subpool count incremented above AND
+		 * in hugetlb_reserve_pages for the same page.  Also,
+		 * the reservation count added in hugetlb_reserve_pages
+		 * no longer applies.
+		 */
+		hugepage_subpool_put_pages(spool, 1);
+		hugetlb_acct_memory(h, -1);
+	}
 	return page;
 
 out_uncharge_cgroup:
@@ -3512,8 +3528,21 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * consumed reservations are stored in the map. Hence, nothing
 	 * else has to be done for private mappings here
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(resv_map, from, to);
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		long add = region_add(resv_map, from, to);
+
+		if (unlikely(chg > add)) {
+			/*
+			 * pages in this range were added to the reserve
+			 * map between region_chg and region_add.  This
+			 * indicates a race with alloc_huge_page.  Adjust
+			 * the subpool and reserve counts modified above
+			 * based on the difference.
+			 */
+			hugepage_subpool_put_pages(spool, chg - add);
+			hugetlb_acct_memory(h, -(chg - ret));
+		}
+	}
 	return 0;
 out_err:
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 02/10] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

alloc_huge_page and hugetlb_reserve_pages use region_chg to
calculate the number of pages which will be added to the reserve
map.  Subpool and global reserve counts are adjusted based on
the output of region_chg.  Before the pages are actually added
to the reserve map, these routines could race and add fewer
pages than expected.  If this happens, the subpool and global
reserve counts are not correct.

Compare the number of pages actually added (region_add) to those
expected to added (region_chg).  If fewer pages are actually added,
this indicates a race and adjust counters accordingly.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7f64034..63f6d43 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1374,13 +1374,16 @@ static long vma_commit_reservation(struct hstate *h,
 		return 0;
 }
 
+/* Forward declaration */
+static int hugetlb_acct_memory(struct hstate *h, long delta);
+
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
-	long chg;
+	long chg, commit;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
 
@@ -1421,7 +1424,20 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 
 	set_page_private(page, (unsigned long)spool);
 
-	vma_commit_reservation(h, vma, addr);
+	commit = vma_commit_reservation(h, vma, addr);
+	if (unlikely(chg > commit)) {
+		/*
+		 * The page was added to the reservation map between
+		 * vma_needs_reservation and vma_commit_reservation.
+		 * This indicates a race with hugetlb_reserve_pages.
+		 * Adjust for the subpool count incremented above AND
+		 * in hugetlb_reserve_pages for the same page.  Also,
+		 * the reservation count added in hugetlb_reserve_pages
+		 * no longer applies.
+		 */
+		hugepage_subpool_put_pages(spool, 1);
+		hugetlb_acct_memory(h, -1);
+	}
 	return page;
 
 out_uncharge_cgroup:
@@ -3512,8 +3528,21 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * consumed reservations are stored in the map. Hence, nothing
 	 * else has to be done for private mappings here
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(resv_map, from, to);
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		long add = region_add(resv_map, from, to);
+
+		if (unlikely(chg > add)) {
+			/*
+			 * pages in this range were added to the reserve
+			 * map between region_chg and region_add.  This
+			 * indicates a race with alloc_huge_page.  Adjust
+			 * the subpool and reserve counts modified above
+			 * based on the difference.
+			 */
+			hugepage_subpool_put_pages(spool, chg - add);
+			hugetlb_acct_memory(h, -(chg - ret));
+		}
+	}
 	return 0;
 out_err:
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 03/10] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to remove a specific range of pages.
The existing region_truncate() routine deletes all region/reserve
map entries after a specified offset.  region_del() will provide
this same functionality if the end of region is specified as -1.
Hence, region_del() can replace region_truncate().

Unlike region_truncate(), region_del() can return an error in the
rare case where it can not allocate memory for a region descriptor.
This ONLY happens in the case where an existing region must be split.
Current callers passing -1 as end of range will never experience
this error and do not need to deal with error handling.  Future
callers of region_del() (such as fallocate hole punch) will need to
handle this error.  A routine hugetlb_fix_reserve_counts() is added
to assist in cleaning up if fallocate hole punch experiences this
type of error in region_del().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c            | 99 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 79 insertions(+), 21 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7b57850..fd337f2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -81,6 +81,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 bool is_hugepage_active(struct page *page);
 void free_huge_page(struct page *page);
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 63f6d43..620cc9e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -261,38 +261,74 @@ out_nrg:
 	return chg;
 }
 
-static long region_truncate(struct resv_map *resv, long end)
+static long region_del(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
+	struct file_region *nrg = NULL;
 	long chg = 0;
 
+	/*
+	 * Locate segments we overlap and etiher split, remove or
+	 * trim the existing regions.  The end of region (t) == -1
+	 * indicates all remaining regions.  Special case t == -1 as
+	 * all comparisons are signed.  Also, when t == -1 it is not
+	 * possible to return an error (-ENOMEM) as this only happens
+	 * when splitting a region.  Callers take advantage of this
+	 * when calling with -1.
+	 */
+	if (t == -1)
+		t = LONG_MAX;
+retry:
 	spin_lock(&resv->lock);
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (end <= rg->to)
+	list_for_each_entry_safe(rg, trg, head, link) {
+		if (rg->to <= f)
+			continue;
+		if (rg->from >= t)
 			break;
-	if (&rg->link == head)
-		goto out;
 
-	/* If we are in the middle of a region then adjust it. */
-	if (end > rg->from) {
-		chg = rg->to - end;
-		rg->to = end;
-		rg = list_entry(rg->link.next, typeof(*rg), link);
-	}
+		if (f > rg->from && t < rg->to) { /* must split region */
+			if (!nrg) {
+				spin_unlock(&resv->lock);
+				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+				if (!nrg)
+					return -ENOMEM;
+				goto retry;
+			}
 
-	/* Drop any remaining regions. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
+			chg += t - f;
+
+			/* new entry for end of split region */
+			nrg->from = t;
+			nrg->to = rg->to;
+			INIT_LIST_HEAD(&nrg->link);
+
+			/* original entry is trimmed */
+			rg->to = f;
+
+			list_add(&nrg->link, &rg->link);
+			nrg = NULL;
 			break;
-		chg += rg->to - rg->from;
-		list_del(&rg->link);
-		kfree(rg);
+		}
+
+		if (f <= rg->from && t >= rg->to) { /* remove entire region */
+			chg += rg->to - rg->from;
+			list_del(&rg->link);
+			kfree(rg);
+			continue;
+		}
+
+		if (f <= rg->from) {	/* trim beginning of region */
+			chg += t - rg->from;
+			rg->from = t;
+		} else {		/* trim end of region */
+			chg += rg->to - f;
+			rg->to = f;
+		}
 	}
 
-out:
 	spin_unlock(&resv->lock);
+	kfree(nrg);
 	return chg;
 }
 
@@ -324,6 +360,27 @@ static long region_count(struct resv_map *resv, long f, long t)
 }
 
 /*
+ * A rare out of memory error was encountered which prevented removal of
+ * the reserve map region for a page.  The huge page itself was free''ed
+ * and removed from the page cache.  This routine will adjust the global
+ * reserve count if needed, and the subpool usage count.  By incrementing
+ * these counts, the reserve map entry which could not be deleted will
+ * appear as a "reserved" entry instead of simply dangling with incorrect
+ * counts.
+ */
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+{
+	struct hugepage_subpool *spool = subpool_inode(inode);
+
+	if (restore_reserve) {
+		struct hstate *h = hstate_inode(inode);
+
+		h->resv_huge_pages++;
+	}
+	hugepage_subpool_get_pages(spool, 1);
+}
+
+/*
  * Convert the address within this vma to the page offset within
  * the mapping, in pagecache page units; huge pages here.
  */
@@ -427,7 +484,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(resv_map, 0);
+	region_del(resv_map, 0, -1);
 	kfree(resv_map);
 }
 
@@ -3558,7 +3615,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	if (resv_map)
-		chg = region_truncate(resv_map, offset);
+		chg = region_del(resv_map, offset, -1);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 03/10] mm/hugetlb: add region_del() to delete a specific range of entries
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to remove a specific range of pages.
The existing region_truncate() routine deletes all region/reserve
map entries after a specified offset.  region_del() will provide
this same functionality if the end of region is specified as -1.
Hence, region_del() can replace region_truncate().

Unlike region_truncate(), region_del() can return an error in the
rare case where it can not allocate memory for a region descriptor.
This ONLY happens in the case where an existing region must be split.
Current callers passing -1 as end of range will never experience
this error and do not need to deal with error handling.  Future
callers of region_del() (such as fallocate hole punch) will need to
handle this error.  A routine hugetlb_fix_reserve_counts() is added
to assist in cleaning up if fallocate hole punch experiences this
type of error in region_del().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c            | 99 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 79 insertions(+), 21 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7b57850..fd337f2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -81,6 +81,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 bool is_hugepage_active(struct page *page);
 void free_huge_page(struct page *page);
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 63f6d43..620cc9e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -261,38 +261,74 @@ out_nrg:
 	return chg;
 }
 
-static long region_truncate(struct resv_map *resv, long end)
+static long region_del(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
+	struct file_region *nrg = NULL;
 	long chg = 0;
 
+	/*
+	 * Locate segments we overlap and etiher split, remove or
+	 * trim the existing regions.  The end of region (t) == -1
+	 * indicates all remaining regions.  Special case t == -1 as
+	 * all comparisons are signed.  Also, when t == -1 it is not
+	 * possible to return an error (-ENOMEM) as this only happens
+	 * when splitting a region.  Callers take advantage of this
+	 * when calling with -1.
+	 */
+	if (t == -1)
+		t = LONG_MAX;
+retry:
 	spin_lock(&resv->lock);
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (end <= rg->to)
+	list_for_each_entry_safe(rg, trg, head, link) {
+		if (rg->to <= f)
+			continue;
+		if (rg->from >= t)
 			break;
-	if (&rg->link == head)
-		goto out;
 
-	/* If we are in the middle of a region then adjust it. */
-	if (end > rg->from) {
-		chg = rg->to - end;
-		rg->to = end;
-		rg = list_entry(rg->link.next, typeof(*rg), link);
-	}
+		if (f > rg->from && t < rg->to) { /* must split region */
+			if (!nrg) {
+				spin_unlock(&resv->lock);
+				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+				if (!nrg)
+					return -ENOMEM;
+				goto retry;
+			}
 
-	/* Drop any remaining regions. */
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
+			chg += t - f;
+
+			/* new entry for end of split region */
+			nrg->from = t;
+			nrg->to = rg->to;
+			INIT_LIST_HEAD(&nrg->link);
+
+			/* original entry is trimmed */
+			rg->to = f;
+
+			list_add(&nrg->link, &rg->link);
+			nrg = NULL;
 			break;
-		chg += rg->to - rg->from;
-		list_del(&rg->link);
-		kfree(rg);
+		}
+
+		if (f <= rg->from && t >= rg->to) { /* remove entire region */
+			chg += rg->to - rg->from;
+			list_del(&rg->link);
+			kfree(rg);
+			continue;
+		}
+
+		if (f <= rg->from) {	/* trim beginning of region */
+			chg += t - rg->from;
+			rg->from = t;
+		} else {		/* trim end of region */
+			chg += rg->to - f;
+			rg->to = f;
+		}
 	}
 
-out:
 	spin_unlock(&resv->lock);
+	kfree(nrg);
 	return chg;
 }
 
@@ -324,6 +360,27 @@ static long region_count(struct resv_map *resv, long f, long t)
 }
 
 /*
+ * A rare out of memory error was encountered which prevented removal of
+ * the reserve map region for a page.  The huge page itself was free''ed
+ * and removed from the page cache.  This routine will adjust the global
+ * reserve count if needed, and the subpool usage count.  By incrementing
+ * these counts, the reserve map entry which could not be deleted will
+ * appear as a "reserved" entry instead of simply dangling with incorrect
+ * counts.
+ */
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+{
+	struct hugepage_subpool *spool = subpool_inode(inode);
+
+	if (restore_reserve) {
+		struct hstate *h = hstate_inode(inode);
+
+		h->resv_huge_pages++;
+	}
+	hugepage_subpool_get_pages(spool, 1);
+}
+
+/*
  * Convert the address within this vma to the page offset within
  * the mapping, in pagecache page units; huge pages here.
  */
@@ -427,7 +484,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(resv_map, 0);
+	region_del(resv_map, 0, -1);
 	kfree(resv_map);
 }
 
@@ -3558,7 +3615,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	if (resv_map)
-		chg = region_truncate(resv_map, offset);
+		chg = region_del(resv_map, offset, -1);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table).  fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages.  Expose interfaces so that fallocate operations
can be synchronized with page faults.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            | 23 ++++++++++++++++++++++-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fd337f2..d0d033e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -82,6 +82,9 @@ void putback_active_hugepage(struct page *page);
 bool is_hugepage_active(struct page *page);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
+u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx);
+void hugetlb_fault_mutex_lock(u32 hash);
+void hugetlb_fault_mutex_unlock(u32 hash);
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 620cc9e..df0d32a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3183,7 +3183,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 	unsigned long key[2];
 	u32 hash;
 
-	if (vma->vm_flags & VM_SHARED) {
+	/* !vma implies this was called from hugetlbfs fallocate code */
+	if (!vma || vma->vm_flags & VM_SHARED) {
 		key[0] = (unsigned long) mapping;
 		key[1] = idx;
 	} else {
@@ -3209,6 +3210,26 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 }
 #endif
 
+/*
+ * Interfaces to the fault mutex routines for use by hugetlbfs
+ * fallocate code.  Faults must be synchronized with page adds or
+ * deletes by fallocate.  fallocate only deals with shared mappings.
+ */
+u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
+{
+	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
+}
+
+void hugetlb_fault_mutex_lock(u32 hash)
+{
+	mutex_lock(&htlb_fault_mutex_table[hash]);
+}
+
+void hugetlb_fault_mutex_unlock(u32 hash)
+{
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table).  fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages.  Expose interfaces so that fallocate operations
can be synchronized with page faults.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            | 23 ++++++++++++++++++++++-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fd337f2..d0d033e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -82,6 +82,9 @@ void putback_active_hugepage(struct page *page);
 bool is_hugepage_active(struct page *page);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
+u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx);
+void hugetlb_fault_mutex_lock(u32 hash);
+void hugetlb_fault_mutex_unlock(u32 hash);
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 620cc9e..df0d32a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3183,7 +3183,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 	unsigned long key[2];
 	u32 hash;
 
-	if (vma->vm_flags & VM_SHARED) {
+	/* !vma implies this was called from hugetlbfs fallocate code */
+	if (!vma || vma->vm_flags & VM_SHARED) {
 		key[0] = (unsigned long) mapping;
 		key[1] = idx;
 	} else {
@@ -3209,6 +3210,26 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
 }
 #endif
 
+/*
+ * Interfaces to the fault mutex routines for use by hugetlbfs
+ * fallocate code.  Faults must be synchronized with page adds or
+ * deletes by fallocate.  fallocate only deals with shared mappings.
+ */
+u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
+{
+	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
+}
+
+void hugetlb_fault_mutex_lock(u32 hash)
+{
+	mutex_lock(&htlb_fault_mutex_table[hash]);
+}
+
+void hugetlb_fault_mutex_unlock(u32 hash)
+{
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 05/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range.  If end is 0, this indicates all pages after start
should be unmapped.  This is the same as the existing truncate
functionality.  Modify existing callers to add 0 as end of range.

Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c274aca..dda529c 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -373,11 +373,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
 }
 
 static inline void
-hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
+hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end)
 {
 	struct vm_area_struct *vma;
 
-	vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) {
+	/*
+	 * end == 0 indicates that the entire range after
+	 * start should be unmapped.
+	 */
+	vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) {
 		unsigned long v_offset;
 
 		/*
@@ -386,13 +390,20 @@ hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
 		 * which overlap the truncated area starting at pgoff,
 		 * and no vma on a 32-bit arch can span beyond the 4GB.
 		 */
-		if (vma->vm_pgoff < pgoff)
-			v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+		if (vma->vm_pgoff < start)
+			v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
 		else
 			v_offset = 0;
 
-		unmap_hugepage_range(vma, vma->vm_start + v_offset,
-				     vma->vm_end, NULL);
+		if (end) {
+			end = ((end - start) << PAGE_SHIFT) +
+			       vma->vm_start + v_offset;
+			if (end > vma->vm_end)
+				end = vma->vm_end;
+		} else
+			end = vma->vm_end;
+
+		unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL);
 	}
 }
 
@@ -408,7 +419,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
-		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
 	truncate_hugepages(inode, offset);
 	return 0;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 05/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range.  If end is 0, this indicates all pages after start
should be unmapped.  This is the same as the existing truncate
functionality.  Modify existing callers to add 0 as end of range.

Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c274aca..dda529c 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -373,11 +373,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
 }
 
 static inline void
-hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
+hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end)
 {
 	struct vm_area_struct *vma;
 
-	vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) {
+	/*
+	 * end == 0 indicates that the entire range after
+	 * start should be unmapped.
+	 */
+	vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) {
 		unsigned long v_offset;
 
 		/*
@@ -386,13 +390,20 @@ hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
 		 * which overlap the truncated area starting at pgoff,
 		 * and no vma on a 32-bit arch can span beyond the 4GB.
 		 */
-		if (vma->vm_pgoff < pgoff)
-			v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+		if (vma->vm_pgoff < start)
+			v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
 		else
 			v_offset = 0;
 
-		unmap_hugepage_range(vma, vma->vm_start + v_offset,
-				     vma->vm_end, NULL);
+		if (end) {
+			end = ((end - start) << PAGE_SHIFT) +
+			       vma->vm_start + v_offset;
+			if (end > vma->vm_end)
+				end = vma->vm_end;
+		} else
+			end = vma->vm_end;
+
+		unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL);
 	}
 }
 
@@ -408,7 +419,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
-		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
 	truncate_hugepages(inode, offset);
 	return 0;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 06/10] hugetlbfs: truncate_hugepages() takes a range of pages
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of -1 is passed, the
current "truncate" functionality is maintained. Existing callers
are modified to pass -1 as end of range. By keying off end == -1,
the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults
by using the fault mutex table. The hole punch case can experience
the rare region_del error and must handle accordingly.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages().  To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from
region_del and pass it back to the caller.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 88 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/hugetlb.h |  3 +-
 mm/hugetlb.c            | 17 ++++++++--
 3 files changed, 94 insertions(+), 14 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dda529c..dfa88a5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -317,26 +317,53 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
-static void truncate_huge_page(struct page *page)
+static void remove_huge_page(struct page *page)
 {
 	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
 	ClearPageUptodate(page);
 	delete_from_page_cache(page);
 }
 
-static void truncate_hugepages(struct inode *inode, loff_t lstart)
+/*
+ * remove_inode_hugepages handles two distinct cases: truncation and hole punch
+ * truncation is indicated by end of range being -1
+ *	In this case, we first scan the range and release found pages.
+ *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
+ *	maps and global counts.
+ * hole punch is indicated if end is not -1
+ *	In the hole punch case we scan the range and release found pages.
+ *	Only when releasing a page is the associated region/reserv map
+ *	deleted.  The region/reserv map for ranges without associated
+ *	pages are not modified.
+ */
+static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
+				   loff_t lend)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
+	const pgoff_t end = lend >> huge_page_shift(h);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
+	long lookup_nr = PAGEVEC_SIZE;
+	bool truncate_op = (lend == -1);
 
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (1) {
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (next < end) {
+		/*
+		 * Make sure to never grab more pages that we
+		 * might possibly need.
+		 */
+		if (end - next < lookup_nr)
+			lookup_nr = end - next;
+
+		/*
+		 * This pagevec_lookup() may return pages past 'end',
+		 * so we must check for page->index > end.
+		 */
+		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
 			if (next == start)
 				break;
 			next = start;
@@ -345,26 +372,67 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 		for (i = 0; i < pagevec_count(&pvec); ++i) {
 			struct page *page = pvec.pages[i];
+			u32 hash;
+
+			hash = hugetlb_fault_mutex_shared_hash(mapping, next);
+			hugetlb_fault_mutex_lock(hash);
 
 			lock_page(page);
+			if (page->index >= end) {
+				unlock_page(page);
+				hugetlb_fault_mutex_unlock(hash);
+				next = end;	/* we are done */
+				break;
+			}
+
+			/*
+			 * If page is mapped, it was faulted in after being
+			 * unmapped.  Do nothing in this race case.  In the
+			 * normal case page is not mapped.
+			 */
+			if (!page_mapped(page)) {
+				bool rsv_on_error = !PagePrivate(page);
+				/*
+				 * We must free the huge page and remove
+				 * from page cache (remove_huge_page) BEFORE
+				 * removing the region/reserve map
+				 * (hugetlb_unreserve_pages).  In rare out
+				 * of memory conditions, removal of the
+				 * region/reserve map could fail.  Before
+				 * free'ing the page, note PagePrivate which
+				 * is used in case of error.
+				 */
+				remove_huge_page(page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(hugetlb_unreserve_pages(
+							inode, next,
+							next + 1, 1)))
+						hugetlb_fix_reserve_counts(
+							inode, rsv_on_error);
+				}
+			}
+
 			if (page->index > next)
 				next = page->index;
+
 			++next;
-			truncate_huge_page(page);
 			unlock_page(page);
-			freed++;
+
+			hugetlb_fault_mutex_unlock(hash);
 		}
 		huge_pagevec_release(&pvec);
 	}
-	BUG_ON(!lstart && mapping->nrpages);
-	hugetlb_unreserve_pages(inode, start, freed);
+
+	if (truncate_op)
+		(void)hugetlb_unreserve_pages(inode, start, end, freed);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	struct resv_map *resv_map;
 
-	truncate_hugepages(inode, 0);
+	remove_inode_hugepages(inode, 0, -1);
 	resv_map = (struct resv_map *)inode->i_mapping->private_data;
 	/* root inode doesn't have the resv_map, so we should check it */
 	if (resv_map)
@@ -421,7 +489,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
-	truncate_hugepages(inode, offset);
+	remove_inode_hugepages(inode, offset, -1);
 	return 0;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d0d033e..4c2856e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -75,7 +75,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						vm_flags_t vm_flags);
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+						long freed);
 int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df0d32a..0cf0622 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3628,21 +3628,32 @@ out_err:
 	return ret;
 }
 
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+								long freed)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct resv_map *resv_map = inode_resv_map(inode);
 	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
-	if (resv_map)
-		chg = region_del(resv_map, offset, -1);
+	if (resv_map) {
+		chg = region_del(resv_map, start, end);
+		/*
+		 * region_del() can fail in the rare case where a region
+		 * must be split and another region descriptor can not be
+		 * allocated.  If end == -1, it will not fail.
+		 */
+		if (chg < 0)
+			return chg;
+	}
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
 
 	hugepage_subpool_put_pages(spool, (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
+
+	return 0;
 }
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 06/10] hugetlbfs: truncate_hugepages() takes a range of pages
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of -1 is passed, the
current "truncate" functionality is maintained. Existing callers
are modified to pass -1 as end of range. By keying off end == -1,
the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults
by using the fault mutex table. The hole punch case can experience
the rare region_del error and must handle accordingly.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages().  To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from
region_del and pass it back to the caller.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 88 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/hugetlb.h |  3 +-
 mm/hugetlb.c            | 17 ++++++++--
 3 files changed, 94 insertions(+), 14 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dda529c..dfa88a5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -317,26 +317,53 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
-static void truncate_huge_page(struct page *page)
+static void remove_huge_page(struct page *page)
 {
 	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
 	ClearPageUptodate(page);
 	delete_from_page_cache(page);
 }
 
-static void truncate_hugepages(struct inode *inode, loff_t lstart)
+/*
+ * remove_inode_hugepages handles two distinct cases: truncation and hole punch
+ * truncation is indicated by end of range being -1
+ *	In this case, we first scan the range and release found pages.
+ *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
+ *	maps and global counts.
+ * hole punch is indicated if end is not -1
+ *	In the hole punch case we scan the range and release found pages.
+ *	Only when releasing a page is the associated region/reserv map
+ *	deleted.  The region/reserv map for ranges without associated
+ *	pages are not modified.
+ */
+static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
+				   loff_t lend)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
+	const pgoff_t end = lend >> huge_page_shift(h);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
+	long lookup_nr = PAGEVEC_SIZE;
+	bool truncate_op = (lend == -1);
 
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (1) {
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (next < end) {
+		/*
+		 * Make sure to never grab more pages that we
+		 * might possibly need.
+		 */
+		if (end - next < lookup_nr)
+			lookup_nr = end - next;
+
+		/*
+		 * This pagevec_lookup() may return pages past 'end',
+		 * so we must check for page->index > end.
+		 */
+		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
 			if (next == start)
 				break;
 			next = start;
@@ -345,26 +372,67 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 		for (i = 0; i < pagevec_count(&pvec); ++i) {
 			struct page *page = pvec.pages[i];
+			u32 hash;
+
+			hash = hugetlb_fault_mutex_shared_hash(mapping, next);
+			hugetlb_fault_mutex_lock(hash);
 
 			lock_page(page);
+			if (page->index >= end) {
+				unlock_page(page);
+				hugetlb_fault_mutex_unlock(hash);
+				next = end;	/* we are done */
+				break;
+			}
+
+			/*
+			 * If page is mapped, it was faulted in after being
+			 * unmapped.  Do nothing in this race case.  In the
+			 * normal case page is not mapped.
+			 */
+			if (!page_mapped(page)) {
+				bool rsv_on_error = !PagePrivate(page);
+				/*
+				 * We must free the huge page and remove
+				 * from page cache (remove_huge_page) BEFORE
+				 * removing the region/reserve map
+				 * (hugetlb_unreserve_pages).  In rare out
+				 * of memory conditions, removal of the
+				 * region/reserve map could fail.  Before
+				 * free'ing the page, note PagePrivate which
+				 * is used in case of error.
+				 */
+				remove_huge_page(page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(hugetlb_unreserve_pages(
+							inode, next,
+							next + 1, 1)))
+						hugetlb_fix_reserve_counts(
+							inode, rsv_on_error);
+				}
+			}
+
 			if (page->index > next)
 				next = page->index;
+
 			++next;
-			truncate_huge_page(page);
 			unlock_page(page);
-			freed++;
+
+			hugetlb_fault_mutex_unlock(hash);
 		}
 		huge_pagevec_release(&pvec);
 	}
-	BUG_ON(!lstart && mapping->nrpages);
-	hugetlb_unreserve_pages(inode, start, freed);
+
+	if (truncate_op)
+		(void)hugetlb_unreserve_pages(inode, start, end, freed);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	struct resv_map *resv_map;
 
-	truncate_hugepages(inode, 0);
+	remove_inode_hugepages(inode, 0, -1);
 	resv_map = (struct resv_map *)inode->i_mapping->private_data;
 	/* root inode doesn't have the resv_map, so we should check it */
 	if (resv_map)
@@ -421,7 +489,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
-	truncate_hugepages(inode, offset);
+	remove_inode_hugepages(inode, offset, -1);
 	return 0;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d0d033e..4c2856e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -75,7 +75,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						vm_flags_t vm_flags);
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+						long freed);
 int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df0d32a..0cf0622 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3628,21 +3628,32 @@ out_err:
 	return ret;
 }
 
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+								long freed)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct resv_map *resv_map = inode_resv_map(inode);
 	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
-	if (resv_map)
-		chg = region_del(resv_map, offset, -1);
+	if (resv_map) {
+		chg = region_del(resv_map, start, end);
+		/*
+		 * region_del() can fail in the rare case where a region
+		 * must be split and another region descriptor can not be
+		 * allocated.  If end == -1, it will not fail.
+		 */
+		if (chg < 0)
+			return chg;
+	}
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
 
 	hugepage_subpool_put_pages(spool, (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
+
+	return 0;
 }
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 07/10] hugetlbfs: New huge_add_to_page_cache helper routine
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Currently, there is  only a single place where hugetlbfs pages are
added to the page cache.  The new fallocate code be adding a second
one, so break the functionality out into its own helper.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 27 ++++++++++++++++++---------
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4c2856e..934f339 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -330,6 +330,8 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			pgoff_t idx);
 
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0cf0622..449cf5f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3043,6 +3043,23 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 	return page != NULL;
 }
 
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			   pgoff_t idx)
+{
+	struct inode *inode = mapping->host;
+	struct hstate *h = hstate_inode(inode);
+	int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+
+	if (err)
+		return err;
+	ClearPagePrivate(page);
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks += blocks_per_huge_page(h);
+	spin_unlock(&inode->i_lock);
+	return 0;
+}
+
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			   struct address_space *mapping, pgoff_t idx,
 			   unsigned long address, pte_t *ptep, unsigned int flags)
@@ -3089,21 +3106,13 @@ retry:
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err;
-			struct inode *inode = mapping->host;
-
-			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+			int err = huge_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				put_page(page);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
-			ClearPagePrivate(page);
-
-			spin_lock(&inode->i_lock);
-			inode->i_blocks += blocks_per_huge_page(h);
-			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
 			if (unlikely(anon_vma_prepare(vma))) {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 07/10] hugetlbfs: New huge_add_to_page_cache helper routine
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Currently, there is  only a single place where hugetlbfs pages are
added to the page cache.  The new fallocate code be adding a second
one, so break the functionality out into its own helper.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 27 ++++++++++++++++++---------
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4c2856e..934f339 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -330,6 +330,8 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			pgoff_t idx);
 
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0cf0622..449cf5f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3043,6 +3043,23 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 	return page != NULL;
 }
 
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+			   pgoff_t idx)
+{
+	struct inode *inode = mapping->host;
+	struct hstate *h = hstate_inode(inode);
+	int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+
+	if (err)
+		return err;
+	ClearPagePrivate(page);
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks += blocks_per_huge_page(h);
+	spin_unlock(&inode->i_lock);
+	return 0;
+}
+
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			   struct address_space *mapping, pgoff_t idx,
 			   unsigned long address, pte_t *ptep, unsigned int flags)
@@ -3089,21 +3106,13 @@ retry:
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err;
-			struct inode *inode = mapping->host;
-
-			err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+			int err = huge_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				put_page(page);
 				if (err == -EEXIST)
 					goto retry;
 				goto out;
 			}
-			ClearPagePrivate(page);
-
-			spin_lock(&inode->i_lock);
-			inode->i_blocks += blocks_per_huge_page(h);
-			spin_unlock(&inode->i_lock);
 		} else {
 			lock_page(page);
 			if (unlikely(anon_vma_prepare(vma))) {
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 08/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings.  However, will not be the case
with fallocate hole punch.  When punching a hole, the present page
will be deleted as well as the region/reserve map entry (and hence
any reservation).  vma_has_reserves is passed "chg" which indicates
whether or not a region/reserve map is present.  Use this to determine
if reserves are actually present or were removed via hole punch.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 449cf5f..94c6154 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -559,9 +559,19 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
 			return 0;
 	}
 
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
+	if (vma->vm_flags & VM_MAYSHARE) {
+		/*
+		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
+		 * be a region map for all pages.  The only situation where
+		 * there is no region map is if a hole was punched via
+		 * fallocate.  In this case, there really are no reverves to
+		 * use.  This situation is indicated if chg != 0.
+		 */
+		if (chg)
+			return 0;
+		else
+			return 1;
+	}
 
 	/*
 	 * Only the process that called mmap() has reserves for
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 08/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings.  However, will not be the case
with fallocate hole punch.  When punching a hole, the present page
will be deleted as well as the region/reserve map entry (and hence
any reservation).  vma_has_reserves is passed "chg" which indicates
whether or not a region/reserve map is present.  Use this to determine
if reserves are actually present or were removed via hole punch.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 449cf5f..94c6154 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -559,9 +559,19 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
 			return 0;
 	}
 
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE)
-		return 1;
+	if (vma->vm_flags & VM_MAYSHARE) {
+		/*
+		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
+		 * be a region map for all pages.  The only situation where
+		 * there is no region map is if a hole was punched via
+		 * fallocate.  In this case, there really are no reverves to
+		 * use.  This situation is indicated if chg != 0.
+		 */
+		if (chg)
+			return 0;
+		else
+			return 1;
+	}
 
 	/*
 	 * Only the process that called mmap() has reserves for
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

This is based on the shmem version, but it has diverged quite
a bit.  We have no swap to worry about, nor the new file sealing.
Add synchronication via the fault mutex table to coordinate
page faults,  fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of
a hugetlbfs file without having it mapped.  This also gives us
the ability to support MADV_REMOVE since it is currently
implemented using fallocate().  MADV_REMOVE lets madvise() remove
pages from the middle of a hugetlbfs file, which wasn't possible
before.

hugetlbfs fallocate only operates on whole huge pages.

Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 169 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c            |   2 +-
 3 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dfa88a5..4b1535f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -12,6 +12,7 @@
 #include <linux/thread_info.h>
 #include <asm/current.h>
 #include <linux/sched.h>		/* remove ASAP */
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/file.h>
@@ -493,6 +494,171 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	return 0;
 }
 
+static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct hstate *h = hstate_inode(inode);
+	unsigned long hpage_size = huge_page_size(h);
+	loff_t hole_start, hole_end;
+
+	/*
+	 * For hole punch round up the beginning offset of the hole and
+	 * round down the end.
+	 */
+	hole_start = (offset + hpage_size - 1) & huge_page_mask(h);
+	hole_end = (offset + len) & huge_page_mask(h);
+
+	if ((u64)hole_end > (u64)hole_start) {
+		struct address_space *mapping = inode->i_mapping;
+
+		mutex_lock(&inode->i_mutex);
+		i_mmap_lock_write(mapping);
+		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
+			hugetlb_vmdelete_list(&mapping->i_mmap,
+						hole_start >> PAGE_SHIFT,
+						hole_end  >> PAGE_SHIFT);
+		i_mmap_unlock_write(mapping);
+		remove_inode_hugepages(inode, hole_start, hole_end);
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	return 0;
+}
+
+static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
+				loff_t len)
+{
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
+	struct vm_area_struct pseudo_vma;
+	unsigned long hpage_size = huge_page_size(h);
+	unsigned long hpage_shift = huge_page_shift(h);
+	pgoff_t start, index, end;
+	int error;
+	u32 hash;
+
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+		return -EOPNOTSUPP;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		return hugetlbfs_punch_hole(inode, offset, len);
+
+	/*
+	 * Default preallocate case.
+	 * For this range, start is rounded down and end is rounded up.
+	 */
+	start = offset >> hpage_shift;
+	end = (offset + len + hpage_size - 1) >> hpage_shift;
+
+	mutex_lock(&inode->i_mutex);
+
+	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+	error = inode_newsize_ok(inode, offset + len);
+	if (error)
+		goto out;
+
+	/*
+	 * Initialize a pseudo vma that just contains the policy used
+	 * when allocating the huge pages.  The actual policy field
+	 * (vm_policy) is determined based on the index in the loop below.
+	 */
+	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+	pseudo_vma.vm_start = 0;
+	pseudo_vma.vm_flags |= (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+	pseudo_vma.vm_file = file;
+
+	for (index = start; index < end; index++) {
+		/*
+		 * This is supposed to be the vaddr where the page is being
+		 * faulted in, but we have no vaddr here.
+		 */
+		struct page *page;
+		unsigned long addr;
+		int avoid_reserve = 0;
+
+		cond_resched();
+
+		/*
+		 * fallocate(2) manpage permits EINTR; we may have been
+		 * interrupted because we are using up too much memory.
+		 */
+		if (signal_pending(current)) {
+			error = -EINTR;
+			break;
+		}
+
+		/* Get policy based on index */
+		pseudo_vma.vm_policy =
+			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
+							index);
+
+		/* addr is the offset within the file (zero based) */
+		addr = index * hpage_size;
+
+		/* mutex taken here, fault path and hole punch */
+		hash = hugetlb_fault_mutex_shared_hash(mapping, index);
+		hugetlb_fault_mutex_lock(hash);
+
+		/* see if page already exists to avoid alloc/free */
+		page = find_get_page(mapping, index);
+		if (page) {
+			put_page(page);
+			hugetlb_fault_mutex_unlock(hash);
+			continue;
+		}
+
+		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
+		mpol_cond_put(pseudo_vma.vm_policy);
+		if (IS_ERR(page)) {
+			hugetlb_fault_mutex_unlock(hash);
+			error = PTR_ERR(page);
+			goto out;
+		}
+		clear_huge_page(page, addr, pages_per_huge_page(h));
+		__SetPageUptodate(page);
+		error = huge_add_to_page_cache(page, mapping, index);
+		if (error) {
+			/*
+			 * An entry already exists in the cache.  This implies
+			 * a region also existed in the reserve map at the time
+			 * the page was allocated above.  Therefore, no use
+			 * count was added to the subpool for the page.  Before
+			 * freeing the page, clear the subpool reference so
+			 * that the count is not decremented.
+			 */
+			set_page_private(page, 0);/* clear spool reference */
+			put_page(page);
+
+			hugetlb_fault_mutex_unlock(hash);
+			/* Keep going if we see an -EEXIST */
+			if (error == -EEXIST) {
+				error = 0;	/* do not return to user */
+				continue;
+			} else
+				goto out;
+		}
+
+		hugetlb_fault_mutex_unlock(hash);
+
+		/*
+		 * page_put due to reference from alloc_huge_page()
+		 * unlock_page because locked by add_to_page_cache()
+		 */
+		put_page(page);
+		unlock_page(page);
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = CURRENT_TIME;
+	spin_lock(&inode->i_lock);
+	inode->i_private = NULL;
+	spin_unlock(&inode->i_lock);
+out:
+	mutex_unlock(&inode->i_mutex);
+	return error;
+}
+
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -804,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
 	.mmap			= hugetlbfs_file_mmap,
 	.fsync			= noop_fsync,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
-	.llseek		= default_llseek,
+	.llseek			= default_llseek,
+	.fallocate		= hugetlbfs_fallocate,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 934f339..fa36b7a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -327,6 +327,8 @@ struct huge_bootmem_page {
 #endif
 };
 
+struct page *alloc_huge_page(struct vm_area_struct *vma,
+				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
@@ -481,6 +483,7 @@ static inline bool hugepages_supported(void)
 
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
+#define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_noerr(v, a, r) NULL
 #define alloc_bootmem_huge_page(h) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 94c6154..1e95038 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1444,7 +1444,7 @@ static long vma_commit_reservation(struct hstate *h,
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
+struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

This is based on the shmem version, but it has diverged quite
a bit.  We have no swap to worry about, nor the new file sealing.
Add synchronication via the fault mutex table to coordinate
page faults,  fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of
a hugetlbfs file without having it mapped.  This also gives us
the ability to support MADV_REMOVE since it is currently
implemented using fallocate().  MADV_REMOVE lets madvise() remove
pages from the middle of a hugetlbfs file, which wasn't possible
before.

hugetlbfs fallocate only operates on whole huge pages.

Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 169 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c            |   2 +-
 3 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dfa88a5..4b1535f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -12,6 +12,7 @@
 #include <linux/thread_info.h>
 #include <asm/current.h>
 #include <linux/sched.h>		/* remove ASAP */
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/file.h>
@@ -493,6 +494,171 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	return 0;
 }
 
+static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct hstate *h = hstate_inode(inode);
+	unsigned long hpage_size = huge_page_size(h);
+	loff_t hole_start, hole_end;
+
+	/*
+	 * For hole punch round up the beginning offset of the hole and
+	 * round down the end.
+	 */
+	hole_start = (offset + hpage_size - 1) & huge_page_mask(h);
+	hole_end = (offset + len) & huge_page_mask(h);
+
+	if ((u64)hole_end > (u64)hole_start) {
+		struct address_space *mapping = inode->i_mapping;
+
+		mutex_lock(&inode->i_mutex);
+		i_mmap_lock_write(mapping);
+		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
+			hugetlb_vmdelete_list(&mapping->i_mmap,
+						hole_start >> PAGE_SHIFT,
+						hole_end  >> PAGE_SHIFT);
+		i_mmap_unlock_write(mapping);
+		remove_inode_hugepages(inode, hole_start, hole_end);
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	return 0;
+}
+
+static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
+				loff_t len)
+{
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
+	struct vm_area_struct pseudo_vma;
+	unsigned long hpage_size = huge_page_size(h);
+	unsigned long hpage_shift = huge_page_shift(h);
+	pgoff_t start, index, end;
+	int error;
+	u32 hash;
+
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+		return -EOPNOTSUPP;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		return hugetlbfs_punch_hole(inode, offset, len);
+
+	/*
+	 * Default preallocate case.
+	 * For this range, start is rounded down and end is rounded up.
+	 */
+	start = offset >> hpage_shift;
+	end = (offset + len + hpage_size - 1) >> hpage_shift;
+
+	mutex_lock(&inode->i_mutex);
+
+	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+	error = inode_newsize_ok(inode, offset + len);
+	if (error)
+		goto out;
+
+	/*
+	 * Initialize a pseudo vma that just contains the policy used
+	 * when allocating the huge pages.  The actual policy field
+	 * (vm_policy) is determined based on the index in the loop below.
+	 */
+	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+	pseudo_vma.vm_start = 0;
+	pseudo_vma.vm_flags |= (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+	pseudo_vma.vm_file = file;
+
+	for (index = start; index < end; index++) {
+		/*
+		 * This is supposed to be the vaddr where the page is being
+		 * faulted in, but we have no vaddr here.
+		 */
+		struct page *page;
+		unsigned long addr;
+		int avoid_reserve = 0;
+
+		cond_resched();
+
+		/*
+		 * fallocate(2) manpage permits EINTR; we may have been
+		 * interrupted because we are using up too much memory.
+		 */
+		if (signal_pending(current)) {
+			error = -EINTR;
+			break;
+		}
+
+		/* Get policy based on index */
+		pseudo_vma.vm_policy =
+			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
+							index);
+
+		/* addr is the offset within the file (zero based) */
+		addr = index * hpage_size;
+
+		/* mutex taken here, fault path and hole punch */
+		hash = hugetlb_fault_mutex_shared_hash(mapping, index);
+		hugetlb_fault_mutex_lock(hash);
+
+		/* see if page already exists to avoid alloc/free */
+		page = find_get_page(mapping, index);
+		if (page) {
+			put_page(page);
+			hugetlb_fault_mutex_unlock(hash);
+			continue;
+		}
+
+		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
+		mpol_cond_put(pseudo_vma.vm_policy);
+		if (IS_ERR(page)) {
+			hugetlb_fault_mutex_unlock(hash);
+			error = PTR_ERR(page);
+			goto out;
+		}
+		clear_huge_page(page, addr, pages_per_huge_page(h));
+		__SetPageUptodate(page);
+		error = huge_add_to_page_cache(page, mapping, index);
+		if (error) {
+			/*
+			 * An entry already exists in the cache.  This implies
+			 * a region also existed in the reserve map at the time
+			 * the page was allocated above.  Therefore, no use
+			 * count was added to the subpool for the page.  Before
+			 * freeing the page, clear the subpool reference so
+			 * that the count is not decremented.
+			 */
+			set_page_private(page, 0);/* clear spool reference */
+			put_page(page);
+
+			hugetlb_fault_mutex_unlock(hash);
+			/* Keep going if we see an -EEXIST */
+			if (error == -EEXIST) {
+				error = 0;	/* do not return to user */
+				continue;
+			} else
+				goto out;
+		}
+
+		hugetlb_fault_mutex_unlock(hash);
+
+		/*
+		 * page_put due to reference from alloc_huge_page()
+		 * unlock_page because locked by add_to_page_cache()
+		 */
+		put_page(page);
+		unlock_page(page);
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = CURRENT_TIME;
+	spin_lock(&inode->i_lock);
+	inode->i_private = NULL;
+	spin_unlock(&inode->i_lock);
+out:
+	mutex_unlock(&inode->i_mutex);
+	return error;
+}
+
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -804,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
 	.mmap			= hugetlbfs_file_mmap,
 	.fsync			= noop_fsync,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
-	.llseek		= default_llseek,
+	.llseek			= default_llseek,
+	.fallocate		= hugetlbfs_fallocate,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 934f339..fa36b7a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -327,6 +327,8 @@ struct huge_bootmem_page {
 #endif
 };
 
+struct page *alloc_huge_page(struct vm_area_struct *vma,
+				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
@@ -481,6 +483,7 @@ static inline bool hugepages_supported(void)
 
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
+#define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_huge_page_noerr(v, a, r) NULL
 #define alloc_bootmem_huge_page(h) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 94c6154..1e95038 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1444,7 +1444,7 @@ static long vma_commit_reservation(struct hstate *h,
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
+struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 10/10] mm: madvise allow remove operation for hugetlbfs
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-21 15:47   ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Now that we have hole punching support for hugetlbfs, we can
also support the MADV_REMOVE interface to it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/madvise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..c4a1027 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -299,7 +299,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 
 	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
 
-	if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+	if (vma->vm_flags & VM_LOCKED)
 		return -EINVAL;
 
 	f = vma->vm_file;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC v3 PATCH 10/10] mm: madvise allow remove operation for hugetlbfs
@ 2015-05-21 15:47   ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-21 15:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, David Rientjes, Hugh Dickins,
	Davidlohr Bueso, Aneesh Kumar, Hillf Danton, Christoph Hellwig,
	Mike Kravetz

Now that we have hole punching support for hugetlbfs, we can
also support the MADV_REMOVE interface to it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/madvise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..c4a1027 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -299,7 +299,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 
 	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
 
-	if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+	if (vma->vm_flags & VM_LOCKED)
 		return -EINVAL;
 
 	f = vma->vm_file;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 03/10] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-05-21 15:47   ` Mike Kravetz
@ 2015-05-22  6:21     ` Naoya Horiguchi
  -1 siblings, 0 replies; 46+ messages in thread
From: Naoya Horiguchi @ 2015-05-22  6:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, May 21, 2015 at 08:47:37AM -0700, Mike Kravetz wrote:
> fallocate hole punch will want to remove a specific range of pages.
> The existing region_truncate() routine deletes all region/reserve
> map entries after a specified offset.  region_del() will provide
> this same functionality if the end of region is specified as -1.
> Hence, region_del() can replace region_truncate().
> 
> Unlike region_truncate(), region_del() can return an error in the
> rare case where it can not allocate memory for a region descriptor.
> This ONLY happens in the case where an existing region must be split.
> Current callers passing -1 as end of range will never experience
> this error and do not need to deal with error handling.  Future
> callers of region_del() (such as fallocate hole punch) will need to
> handle this error.  A routine hugetlb_fix_reserve_counts() is added
> to assist in cleaning up if fallocate hole punch experiences this
> type of error in region_del().
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/hugetlb.h |  1 +
>  mm/hugetlb.c            | 99 ++++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 79 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 7b57850..fd337f2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -81,6 +81,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list);
>  void putback_active_hugepage(struct page *page);
>  bool is_hugepage_active(struct page *page);
>  void free_huge_page(struct page *page);
> +void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);

This function is used in patch 6/10 for the first time,
so is it better to move the definition to that patch?
(this temporarily introduces "defined but not used" warning...)

>  #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 63f6d43..620cc9e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -261,38 +261,74 @@ out_nrg:
>  	return chg;
>  }
>  
> -static long region_truncate(struct resv_map *resv, long end)
> +static long region_del(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *trg;
> +	struct file_region *nrg = NULL;
>  	long chg = 0;
>  
> +	/*
> +	 * Locate segments we overlap and etiher split, remove or
> +	 * trim the existing regions.  The end of region (t) == -1
> +	 * indicates all remaining regions.  Special case t == -1 as
> +	 * all comparisons are signed.  Also, when t == -1 it is not
> +	 * possible to return an error (-ENOMEM) as this only happens
> +	 * when splitting a region.  Callers take advantage of this
> +	 * when calling with -1.
> +	 */
> +	if (t == -1)
> +		t = LONG_MAX;
> +retry:
>  	spin_lock(&resv->lock);
> -	/* Locate the region we are either in or before. */
> -	list_for_each_entry(rg, head, link)
> -		if (end <= rg->to)
> +	list_for_each_entry_safe(rg, trg, head, link) {
> +		if (rg->to <= f)
> +			continue;
> +		if (rg->from >= t)
>  			break;
> -	if (&rg->link == head)
> -		goto out;
>  
> -	/* If we are in the middle of a region then adjust it. */
> -	if (end > rg->from) {
> -		chg = rg->to - end;
> -		rg->to = end;
> -		rg = list_entry(rg->link.next, typeof(*rg), link);
> -	}
> +		if (f > rg->from && t < rg->to) { /* must split region */
> +			if (!nrg) {
> +				spin_unlock(&resv->lock);
> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> +				if (!nrg)
> +					return -ENOMEM;
> +				goto retry;
> +			}
>  
> -	/* Drop any remaining regions. */
> -	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> -		if (&rg->link == head)
> +			chg += t - f;
> +
> +			/* new entry for end of split region */
> +			nrg->from = t;
> +			nrg->to = rg->to;
> +			INIT_LIST_HEAD(&nrg->link);
> +
> +			/* original entry is trimmed */
> +			rg->to = f;
> +
> +			list_add(&nrg->link, &rg->link);
> +			nrg = NULL;
>  			break;
> -		chg += rg->to - rg->from;
> -		list_del(&rg->link);
> -		kfree(rg);
> +		}
> +
> +		if (f <= rg->from && t >= rg->to) { /* remove entire region */
> +			chg += rg->to - rg->from;
> +			list_del(&rg->link);
> +			kfree(rg);
> +			continue;
> +		}
> +
> +		if (f <= rg->from) {	/* trim beginning of region */
> +			chg += t - rg->from;
> +			rg->from = t;
> +		} else {		/* trim end of region */
> +			chg += rg->to - f;
> +			rg->to = f;

Is it better to put "break" here?

Thanks,
Naoya Horiguchi

> +		}
>  	}
>  
> -out:
>  	spin_unlock(&resv->lock);
> +	kfree(nrg);
>  	return chg;
>  }
>  
> @@ -324,6 +360,27 @@ static long region_count(struct resv_map *resv, long f, long t)
>  }
>  
>  /*
> + * A rare out of memory error was encountered which prevented removal of
> + * the reserve map region for a page.  The huge page itself was free''ed
> + * and removed from the page cache.  This routine will adjust the global
> + * reserve count if needed, and the subpool usage count.  By incrementing
> + * these counts, the reserve map entry which could not be deleted will
> + * appear as a "reserved" entry instead of simply dangling with incorrect
> + * counts.
> + */
> +void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
> +{
> +	struct hugepage_subpool *spool = subpool_inode(inode);
> +
> +	if (restore_reserve) {
> +		struct hstate *h = hstate_inode(inode);
> +
> +		h->resv_huge_pages++;
> +	}
> +	hugepage_subpool_get_pages(spool, 1);
> +}
> +
> +/*
>   * Convert the address within this vma to the page offset within
>   * the mapping, in pagecache page units; huge pages here.
>   */
> @@ -427,7 +484,7 @@ void resv_map_release(struct kref *ref)
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>  
>  	/* Clear out any active regions before we release the map. */
> -	region_truncate(resv_map, 0);
> +	region_del(resv_map, 0, -1);
>  	kfree(resv_map);
>  }
>  
> @@ -3558,7 +3615,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>  
>  	if (resv_map)
> -		chg = region_truncate(resv_map, offset);
> +		chg = region_del(resv_map, offset, -1);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 03/10] mm/hugetlb: add region_del() to delete a specific range of entries
@ 2015-05-22  6:21     ` Naoya Horiguchi
  0 siblings, 0 replies; 46+ messages in thread
From: Naoya Horiguchi @ 2015-05-22  6:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, May 21, 2015 at 08:47:37AM -0700, Mike Kravetz wrote:
> fallocate hole punch will want to remove a specific range of pages.
> The existing region_truncate() routine deletes all region/reserve
> map entries after a specified offset.  region_del() will provide
> this same functionality if the end of region is specified as -1.
> Hence, region_del() can replace region_truncate().
> 
> Unlike region_truncate(), region_del() can return an error in the
> rare case where it can not allocate memory for a region descriptor.
> This ONLY happens in the case where an existing region must be split.
> Current callers passing -1 as end of range will never experience
> this error and do not need to deal with error handling.  Future
> callers of region_del() (such as fallocate hole punch) will need to
> handle this error.  A routine hugetlb_fix_reserve_counts() is added
> to assist in cleaning up if fallocate hole punch experiences this
> type of error in region_del().
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/hugetlb.h |  1 +
>  mm/hugetlb.c            | 99 ++++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 79 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 7b57850..fd337f2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -81,6 +81,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list);
>  void putback_active_hugepage(struct page *page);
>  bool is_hugepage_active(struct page *page);
>  void free_huge_page(struct page *page);
> +void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);

This function is used in patch 6/10 for the first time,
so is it better to move the definition to that patch?
(this temporarily introduces "defined but not used" warning...)

>  #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 63f6d43..620cc9e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -261,38 +261,74 @@ out_nrg:
>  	return chg;
>  }
>  
> -static long region_truncate(struct resv_map *resv, long end)
> +static long region_del(struct resv_map *resv, long f, long t)
>  {
>  	struct list_head *head = &resv->regions;
>  	struct file_region *rg, *trg;
> +	struct file_region *nrg = NULL;
>  	long chg = 0;
>  
> +	/*
> +	 * Locate segments we overlap and etiher split, remove or
> +	 * trim the existing regions.  The end of region (t) == -1
> +	 * indicates all remaining regions.  Special case t == -1 as
> +	 * all comparisons are signed.  Also, when t == -1 it is not
> +	 * possible to return an error (-ENOMEM) as this only happens
> +	 * when splitting a region.  Callers take advantage of this
> +	 * when calling with -1.
> +	 */
> +	if (t == -1)
> +		t = LONG_MAX;
> +retry:
>  	spin_lock(&resv->lock);
> -	/* Locate the region we are either in or before. */
> -	list_for_each_entry(rg, head, link)
> -		if (end <= rg->to)
> +	list_for_each_entry_safe(rg, trg, head, link) {
> +		if (rg->to <= f)
> +			continue;
> +		if (rg->from >= t)
>  			break;
> -	if (&rg->link == head)
> -		goto out;
>  
> -	/* If we are in the middle of a region then adjust it. */
> -	if (end > rg->from) {
> -		chg = rg->to - end;
> -		rg->to = end;
> -		rg = list_entry(rg->link.next, typeof(*rg), link);
> -	}
> +		if (f > rg->from && t < rg->to) { /* must split region */
> +			if (!nrg) {
> +				spin_unlock(&resv->lock);
> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> +				if (!nrg)
> +					return -ENOMEM;
> +				goto retry;
> +			}
>  
> -	/* Drop any remaining regions. */
> -	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> -		if (&rg->link == head)
> +			chg += t - f;
> +
> +			/* new entry for end of split region */
> +			nrg->from = t;
> +			nrg->to = rg->to;
> +			INIT_LIST_HEAD(&nrg->link);
> +
> +			/* original entry is trimmed */
> +			rg->to = f;
> +
> +			list_add(&nrg->link, &rg->link);
> +			nrg = NULL;
>  			break;
> -		chg += rg->to - rg->from;
> -		list_del(&rg->link);
> -		kfree(rg);
> +		}
> +
> +		if (f <= rg->from && t >= rg->to) { /* remove entire region */
> +			chg += rg->to - rg->from;
> +			list_del(&rg->link);
> +			kfree(rg);
> +			continue;
> +		}
> +
> +		if (f <= rg->from) {	/* trim beginning of region */
> +			chg += t - rg->from;
> +			rg->from = t;
> +		} else {		/* trim end of region */
> +			chg += rg->to - f;
> +			rg->to = f;

Is it better to put "break" here?

Thanks,
Naoya Horiguchi

> +		}
>  	}
>  
> -out:
>  	spin_unlock(&resv->lock);
> +	kfree(nrg);
>  	return chg;
>  }
>  
> @@ -324,6 +360,27 @@ static long region_count(struct resv_map *resv, long f, long t)
>  }
>  
>  /*
> + * A rare out of memory error was encountered which prevented removal of
> + * the reserve map region for a page.  The huge page itself was free''ed
> + * and removed from the page cache.  This routine will adjust the global
> + * reserve count if needed, and the subpool usage count.  By incrementing
> + * these counts, the reserve map entry which could not be deleted will
> + * appear as a "reserved" entry instead of simply dangling with incorrect
> + * counts.
> + */
> +void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
> +{
> +	struct hugepage_subpool *spool = subpool_inode(inode);
> +
> +	if (restore_reserve) {
> +		struct hstate *h = hstate_inode(inode);
> +
> +		h->resv_huge_pages++;
> +	}
> +	hugepage_subpool_get_pages(spool, 1);
> +}
> +
> +/*
>   * Convert the address within this vma to the page offset within
>   * the mapping, in pagecache page units; huge pages here.
>   */
> @@ -427,7 +484,7 @@ void resv_map_release(struct kref *ref)
>  	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>  
>  	/* Clear out any active regions before we release the map. */
> -	region_truncate(resv_map, 0);
> +	region_del(resv_map, 0, -1);
>  	kfree(resv_map);
>  }
>  
> @@ -3558,7 +3615,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>  
>  	if (resv_map)
> -		chg = region_truncate(resv_map, offset);
> +		chg = region_del(resv_map, offset, -1);
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> -- 
> 2.1.0
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-05-21 15:47   ` Mike Kravetz
@ 2015-05-22  6:23     ` Naoya Horiguchi
  -1 siblings, 0 replies; 46+ messages in thread
From: Naoya Horiguchi @ 2015-05-22  6:23 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, May 21, 2015 at 08:47:38AM -0700, Mike Kravetz wrote:
> hugetlb page faults are currently synchronized by the table of
> mutexes (htlb_fault_mutex_table).  fallocate code will need to
> synchronize with the page fault code when it allocates or
> deletes pages.  Expose interfaces so that fallocate operations
> can be synchronized with page faults.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/hugetlb.h |  3 +++
>  mm/hugetlb.c            | 23 ++++++++++++++++++++++-
>  2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index fd337f2..d0d033e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -82,6 +82,9 @@ void putback_active_hugepage(struct page *page);
>  bool is_hugepage_active(struct page *page);
>  void free_huge_page(struct page *page);
>  void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx);
> +void hugetlb_fault_mutex_lock(u32 hash);
> +void hugetlb_fault_mutex_unlock(u32 hash);
>  
>  #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 620cc9e..df0d32a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3183,7 +3183,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>  	unsigned long key[2];
>  	u32 hash;
>  
> -	if (vma->vm_flags & VM_SHARED) {
> +	/* !vma implies this was called from hugetlbfs fallocate code */
> +	if (!vma || vma->vm_flags & VM_SHARED) {
>  		key[0] = (unsigned long) mapping;
>  		key[1] = idx;
>  	} else {
> @@ -3209,6 +3210,26 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>  }
>  #endif
>  
> +/*
> + * Interfaces to the fault mutex routines for use by hugetlbfs
> + * fallocate code.  Faults must be synchronized with page adds or
> + * deletes by fallocate.  fallocate only deals with shared mappings.
> + */
> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
> +{
> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
> +}
> +
> +void hugetlb_fault_mutex_lock(u32 hash)
> +{
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
> +}
> +
> +void hugetlb_fault_mutex_unlock(u32 hash)
> +{
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
> +}
> +
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags)
>  {

You introduce new lock/unlock interfaces, so how about making the existing
user of this lock (i.e. hugetlb_fault()) use them?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
@ 2015-05-22  6:23     ` Naoya Horiguchi
  0 siblings, 0 replies; 46+ messages in thread
From: Naoya Horiguchi @ 2015-05-22  6:23 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, May 21, 2015 at 08:47:38AM -0700, Mike Kravetz wrote:
> hugetlb page faults are currently synchronized by the table of
> mutexes (htlb_fault_mutex_table).  fallocate code will need to
> synchronize with the page fault code when it allocates or
> deletes pages.  Expose interfaces so that fallocate operations
> can be synchronized with page faults.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/hugetlb.h |  3 +++
>  mm/hugetlb.c            | 23 ++++++++++++++++++++++-
>  2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index fd337f2..d0d033e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -82,6 +82,9 @@ void putback_active_hugepage(struct page *page);
>  bool is_hugepage_active(struct page *page);
>  void free_huge_page(struct page *page);
>  void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx);
> +void hugetlb_fault_mutex_lock(u32 hash);
> +void hugetlb_fault_mutex_unlock(u32 hash);
>  
>  #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 620cc9e..df0d32a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3183,7 +3183,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>  	unsigned long key[2];
>  	u32 hash;
>  
> -	if (vma->vm_flags & VM_SHARED) {
> +	/* !vma implies this was called from hugetlbfs fallocate code */
> +	if (!vma || vma->vm_flags & VM_SHARED) {
>  		key[0] = (unsigned long) mapping;
>  		key[1] = idx;
>  	} else {
> @@ -3209,6 +3210,26 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>  }
>  #endif
>  
> +/*
> + * Interfaces to the fault mutex routines for use by hugetlbfs
> + * fallocate code.  Faults must be synchronized with page adds or
> + * deletes by fallocate.  fallocate only deals with shared mappings.
> + */
> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
> +{
> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
> +}
> +
> +void hugetlb_fault_mutex_lock(u32 hash)
> +{
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
> +}
> +
> +void hugetlb_fault_mutex_unlock(u32 hash)
> +{
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
> +}
> +
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags)
>  {

You introduce new lock/unlock interfaces, so how about making the existing
user of this lock (i.e. hugetlb_fault()) use them?

Thanks,
Naoya Horiguchi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 06/10] hugetlbfs: truncate_hugepages() takes a range of pages
  2015-05-21 15:47   ` Mike Kravetz
@ 2015-05-22  8:08     ` Naoya Horiguchi
  -1 siblings, 0 replies; 46+ messages in thread
From: Naoya Horiguchi @ 2015-05-22  8:08 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, May 21, 2015 at 08:47:40AM -0700, Mike Kravetz wrote:
> Modify truncate_hugepages() to take a range of pages (start, end)
> instead of simply start. If an end value of -1 is passed, the
> current "truncate" functionality is maintained. Existing callers
> are modified to pass -1 as end of range. By keying off end == -1,
> the routine behaves differently for truncate and hole punch.
> Page removal is now synchronized with page allocation via faults
> by using the fault mutex table. The hole punch case can experience
> the rare region_del error and must handle accordingly.
> 
> Since the routine handles more than just the truncate case, it is
> renamed to remove_inode_hugepages().  To be consistent, the routine
> truncate_huge_page() is renamed remove_huge_page().
> 
> Downstream of remove_inode_hugepages(), the routine
> hugetlb_unreserve_pages() is also modified to take a range of pages.
> hugetlb_unreserve_pages is modified to detect an error from
> region_del and pass it back to the caller.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c    | 88 +++++++++++++++++++++++++++++++++++++++++++------
>  include/linux/hugetlb.h |  3 +-
>  mm/hugetlb.c            | 17 ++++++++--
>  3 files changed, 94 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index dda529c..dfa88a5 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -317,26 +317,53 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
>  	return -EINVAL;
>  }
>  
> -static void truncate_huge_page(struct page *page)
> +static void remove_huge_page(struct page *page)
>  {
>  	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
>  	ClearPageUptodate(page);
>  	delete_from_page_cache(page);
>  }
>  
> -static void truncate_hugepages(struct inode *inode, loff_t lstart)
> +/*
> + * remove_inode_hugepages handles two distinct cases: truncation and hole punch
> + * truncation is indicated by end of range being -1
> + *	In this case, we first scan the range and release found pages.
> + *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
> + *	maps and global counts.
> + * hole punch is indicated if end is not -1
> + *	In the hole punch case we scan the range and release found pages.
> + *	Only when releasing a page is the associated region/reserv map
> + *	deleted.  The region/reserv map for ranges without associated
> + *	pages are not modified.

If lend is not -1 but large enough to go beyond the end of file, which
should it be handled by truncate operation or hole punch operation?
If it makes no difference or never happens, it's OK with some comments.

Thanks,
Naoya Horiguchi

> + */
> +static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> +				   loff_t lend)
>  {
>  	struct hstate *h = hstate_inode(inode);
>  	struct address_space *mapping = &inode->i_data;
>  	const pgoff_t start = lstart >> huge_page_shift(h);
> +	const pgoff_t end = lend >> huge_page_shift(h);
>  	struct pagevec pvec;
>  	pgoff_t next;
>  	int i, freed = 0;
> +	long lookup_nr = PAGEVEC_SIZE;
> +	bool truncate_op = (lend == -1);
>  
>  	pagevec_init(&pvec, 0);
>  	next = start;
> -	while (1) {
> -		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +	while (next < end) {
> +		/*
> +		 * Make sure to never grab more pages that we
> +		 * might possibly need.
> +		 */
> +		if (end - next < lookup_nr)
> +			lookup_nr = end - next;
> +
> +		/*
> +		 * This pagevec_lookup() may return pages past 'end',
> +		 * so we must check for page->index > end.
> +		 */
> +		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
>  			if (next == start)
>  				break;
>  			next = start;
> @@ -345,26 +372,67 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>  
>  		for (i = 0; i < pagevec_count(&pvec); ++i) {
>  			struct page *page = pvec.pages[i];
> +			u32 hash;
> +
> +			hash = hugetlb_fault_mutex_shared_hash(mapping, next);
> +			hugetlb_fault_mutex_lock(hash);
>  
>  			lock_page(page);
> +			if (page->index >= end) {
> +				unlock_page(page);
> +				hugetlb_fault_mutex_unlock(hash);
> +				next = end;	/* we are done */
> +				break;
> +			}
> +
> +			/*
> +			 * If page is mapped, it was faulted in after being
> +			 * unmapped.  Do nothing in this race case.  In the
> +			 * normal case page is not mapped.
> +			 */
> +			if (!page_mapped(page)) {
> +				bool rsv_on_error = !PagePrivate(page);
> +				/*
> +				 * We must free the huge page and remove
> +				 * from page cache (remove_huge_page) BEFORE
> +				 * removing the region/reserve map
> +				 * (hugetlb_unreserve_pages).  In rare out
> +				 * of memory conditions, removal of the
> +				 * region/reserve map could fail.  Before
> +				 * free'ing the page, note PagePrivate which
> +				 * is used in case of error.
> +				 */
> +				remove_huge_page(page);
> +				freed++;
> +				if (!truncate_op) {
> +					if (unlikely(hugetlb_unreserve_pages(
> +							inode, next,
> +							next + 1, 1)))
> +						hugetlb_fix_reserve_counts(
> +							inode, rsv_on_error);
> +				}
> +			}
> +
>  			if (page->index > next)
>  				next = page->index;
> +
>  			++next;
> -			truncate_huge_page(page);
>  			unlock_page(page);
> -			freed++;
> +
> +			hugetlb_fault_mutex_unlock(hash);
>  		}
>  		huge_pagevec_release(&pvec);
>  	}
> -	BUG_ON(!lstart && mapping->nrpages);
> -	hugetlb_unreserve_pages(inode, start, freed);
> +
> +	if (truncate_op)
> +		(void)hugetlb_unreserve_pages(inode, start, end, freed);
>  }
>  
>  static void hugetlbfs_evict_inode(struct inode *inode)
>  {
>  	struct resv_map *resv_map;
>  
> -	truncate_hugepages(inode, 0);
> +	remove_inode_hugepages(inode, 0, -1);
>  	resv_map = (struct resv_map *)inode->i_mapping->private_data;
>  	/* root inode doesn't have the resv_map, so we should check it */
>  	if (resv_map)
> @@ -421,7 +489,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>  	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>  		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
>  	i_mmap_unlock_write(mapping);
> -	truncate_hugepages(inode, offset);
> +	remove_inode_hugepages(inode, offset, -1);
>  	return 0;
>  }
>  
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d0d033e..4c2856e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -75,7 +75,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  int hugetlb_reserve_pages(struct inode *inode, long from, long to,
>  						struct vm_area_struct *vma,
>  						vm_flags_t vm_flags);
> -void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
> +long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> +						long freed);
>  int dequeue_hwpoisoned_huge_page(struct page *page);
>  bool isolate_huge_page(struct page *page, struct list_head *list);
>  void putback_active_hugepage(struct page *page);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index df0d32a..0cf0622 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3628,21 +3628,32 @@ out_err:
>  	return ret;
>  }
>  
> -void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
> +long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> +								long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
>  	struct resv_map *resv_map = inode_resv_map(inode);
>  	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>  
> -	if (resv_map)
> -		chg = region_del(resv_map, offset, -1);
> +	if (resv_map) {
> +		chg = region_del(resv_map, start, end);
> +		/*
> +		 * region_del() can fail in the rare case where a region
> +		 * must be split and another region descriptor can not be
> +		 * allocated.  If end == -1, it will not fail.
> +		 */
> +		if (chg < 0)
> +			return chg;
> +	}
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
>  
>  	hugepage_subpool_put_pages(spool, (chg - freed));
>  	hugetlb_acct_memory(h, -(chg - freed));
> +
> +	return 0;
>  }
>  
>  #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 06/10] hugetlbfs: truncate_hugepages() takes a range of pages
@ 2015-05-22  8:08     ` Naoya Horiguchi
  0 siblings, 0 replies; 46+ messages in thread
From: Naoya Horiguchi @ 2015-05-22  8:08 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, May 21, 2015 at 08:47:40AM -0700, Mike Kravetz wrote:
> Modify truncate_hugepages() to take a range of pages (start, end)
> instead of simply start. If an end value of -1 is passed, the
> current "truncate" functionality is maintained. Existing callers
> are modified to pass -1 as end of range. By keying off end == -1,
> the routine behaves differently for truncate and hole punch.
> Page removal is now synchronized with page allocation via faults
> by using the fault mutex table. The hole punch case can experience
> the rare region_del error and must handle accordingly.
> 
> Since the routine handles more than just the truncate case, it is
> renamed to remove_inode_hugepages().  To be consistent, the routine
> truncate_huge_page() is renamed remove_huge_page().
> 
> Downstream of remove_inode_hugepages(), the routine
> hugetlb_unreserve_pages() is also modified to take a range of pages.
> hugetlb_unreserve_pages is modified to detect an error from
> region_del and pass it back to the caller.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c    | 88 +++++++++++++++++++++++++++++++++++++++++++------
>  include/linux/hugetlb.h |  3 +-
>  mm/hugetlb.c            | 17 ++++++++--
>  3 files changed, 94 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index dda529c..dfa88a5 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -317,26 +317,53 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
>  	return -EINVAL;
>  }
>  
> -static void truncate_huge_page(struct page *page)
> +static void remove_huge_page(struct page *page)
>  {
>  	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
>  	ClearPageUptodate(page);
>  	delete_from_page_cache(page);
>  }
>  
> -static void truncate_hugepages(struct inode *inode, loff_t lstart)
> +/*
> + * remove_inode_hugepages handles two distinct cases: truncation and hole punch
> + * truncation is indicated by end of range being -1
> + *	In this case, we first scan the range and release found pages.
> + *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
> + *	maps and global counts.
> + * hole punch is indicated if end is not -1
> + *	In the hole punch case we scan the range and release found pages.
> + *	Only when releasing a page is the associated region/reserv map
> + *	deleted.  The region/reserv map for ranges without associated
> + *	pages are not modified.

If lend is not -1 but large enough to go beyond the end of file, which
should it be handled by truncate operation or hole punch operation?
If it makes no difference or never happens, it's OK with some comments.

Thanks,
Naoya Horiguchi

> + */
> +static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> +				   loff_t lend)
>  {
>  	struct hstate *h = hstate_inode(inode);
>  	struct address_space *mapping = &inode->i_data;
>  	const pgoff_t start = lstart >> huge_page_shift(h);
> +	const pgoff_t end = lend >> huge_page_shift(h);
>  	struct pagevec pvec;
>  	pgoff_t next;
>  	int i, freed = 0;
> +	long lookup_nr = PAGEVEC_SIZE;
> +	bool truncate_op = (lend == -1);
>  
>  	pagevec_init(&pvec, 0);
>  	next = start;
> -	while (1) {
> -		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +	while (next < end) {
> +		/*
> +		 * Make sure to never grab more pages that we
> +		 * might possibly need.
> +		 */
> +		if (end - next < lookup_nr)
> +			lookup_nr = end - next;
> +
> +		/*
> +		 * This pagevec_lookup() may return pages past 'end',
> +		 * so we must check for page->index > end.
> +		 */
> +		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
>  			if (next == start)
>  				break;
>  			next = start;
> @@ -345,26 +372,67 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>  
>  		for (i = 0; i < pagevec_count(&pvec); ++i) {
>  			struct page *page = pvec.pages[i];
> +			u32 hash;
> +
> +			hash = hugetlb_fault_mutex_shared_hash(mapping, next);
> +			hugetlb_fault_mutex_lock(hash);
>  
>  			lock_page(page);
> +			if (page->index >= end) {
> +				unlock_page(page);
> +				hugetlb_fault_mutex_unlock(hash);
> +				next = end;	/* we are done */
> +				break;
> +			}
> +
> +			/*
> +			 * If page is mapped, it was faulted in after being
> +			 * unmapped.  Do nothing in this race case.  In the
> +			 * normal case page is not mapped.
> +			 */
> +			if (!page_mapped(page)) {
> +				bool rsv_on_error = !PagePrivate(page);
> +				/*
> +				 * We must free the huge page and remove
> +				 * from page cache (remove_huge_page) BEFORE
> +				 * removing the region/reserve map
> +				 * (hugetlb_unreserve_pages).  In rare out
> +				 * of memory conditions, removal of the
> +				 * region/reserve map could fail.  Before
> +				 * free'ing the page, note PagePrivate which
> +				 * is used in case of error.
> +				 */
> +				remove_huge_page(page);
> +				freed++;
> +				if (!truncate_op) {
> +					if (unlikely(hugetlb_unreserve_pages(
> +							inode, next,
> +							next + 1, 1)))
> +						hugetlb_fix_reserve_counts(
> +							inode, rsv_on_error);
> +				}
> +			}
> +
>  			if (page->index > next)
>  				next = page->index;
> +
>  			++next;
> -			truncate_huge_page(page);
>  			unlock_page(page);
> -			freed++;
> +
> +			hugetlb_fault_mutex_unlock(hash);
>  		}
>  		huge_pagevec_release(&pvec);
>  	}
> -	BUG_ON(!lstart && mapping->nrpages);
> -	hugetlb_unreserve_pages(inode, start, freed);
> +
> +	if (truncate_op)
> +		(void)hugetlb_unreserve_pages(inode, start, end, freed);
>  }
>  
>  static void hugetlbfs_evict_inode(struct inode *inode)
>  {
>  	struct resv_map *resv_map;
>  
> -	truncate_hugepages(inode, 0);
> +	remove_inode_hugepages(inode, 0, -1);
>  	resv_map = (struct resv_map *)inode->i_mapping->private_data;
>  	/* root inode doesn't have the resv_map, so we should check it */
>  	if (resv_map)
> @@ -421,7 +489,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>  	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>  		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
>  	i_mmap_unlock_write(mapping);
> -	truncate_hugepages(inode, offset);
> +	remove_inode_hugepages(inode, offset, -1);
>  	return 0;
>  }
>  
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d0d033e..4c2856e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -75,7 +75,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  int hugetlb_reserve_pages(struct inode *inode, long from, long to,
>  						struct vm_area_struct *vma,
>  						vm_flags_t vm_flags);
> -void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
> +long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> +						long freed);
>  int dequeue_hwpoisoned_huge_page(struct page *page);
>  bool isolate_huge_page(struct page *page, struct list_head *list);
>  void putback_active_hugepage(struct page *page);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index df0d32a..0cf0622 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3628,21 +3628,32 @@ out_err:
>  	return ret;
>  }
>  
> -void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
> +long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> +								long freed)
>  {
>  	struct hstate *h = hstate_inode(inode);
>  	struct resv_map *resv_map = inode_resv_map(inode);
>  	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>  
> -	if (resv_map)
> -		chg = region_del(resv_map, offset, -1);
> +	if (resv_map) {
> +		chg = region_del(resv_map, start, end);
> +		/*
> +		 * region_del() can fail in the rare case where a region
> +		 * must be split and another region descriptor can not be
> +		 * allocated.  If end == -1, it will not fail.
> +		 */
> +		if (chg < 0)
> +			return chg;
> +	}
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
>  
>  	hugepage_subpool_put_pages(spool, (chg - freed));
>  	hugetlb_acct_memory(h, -(chg - freed));
> +
> +	return 0;
>  }
>  
>  #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> -- 
> 2.1.0
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 03/10] mm/hugetlb: add region_del() to delete a specific range of entries
  2015-05-22  6:21     ` Naoya Horiguchi
@ 2015-05-22 16:48       ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-22 16:48 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/21/2015 11:21 PM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:37AM -0700, Mike Kravetz wrote:
>> fallocate hole punch will want to remove a specific range of pages.
>> The existing region_truncate() routine deletes all region/reserve
>> map entries after a specified offset.  region_del() will provide
>> this same functionality if the end of region is specified as -1.
>> Hence, region_del() can replace region_truncate().
>>
>> Unlike region_truncate(), region_del() can return an error in the
>> rare case where it can not allocate memory for a region descriptor.
>> This ONLY happens in the case where an existing region must be split.
>> Current callers passing -1 as end of range will never experience
>> this error and do not need to deal with error handling.  Future
>> callers of region_del() (such as fallocate hole punch) will need to
>> handle this error.  A routine hugetlb_fix_reserve_counts() is added
>> to assist in cleaning up if fallocate hole punch experiences this
>> type of error in region_del().
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   include/linux/hugetlb.h |  1 +
>>   mm/hugetlb.c            | 99 ++++++++++++++++++++++++++++++++++++++-----------
>>   2 files changed, 79 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 7b57850..fd337f2 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -81,6 +81,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list);
>>   void putback_active_hugepage(struct page *page);
>>   bool is_hugepage_active(struct page *page);
>>   void free_huge_page(struct page *page);
>> +void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
> 
> This function is used in patch 6/10 for the first time,
> so is it better to move the definition to that patch?
> (this temporarily introduces "defined but not used" warning...)

Yes, I do think it would be better to move it to patch 6.  The existing
callers/users of region_del() will never encounter an error return value.
As you mention, it is only the new use case in patch 6/10 that needs
to deal with the error.  So, it makes sense to move it there.

>>   #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>>   pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 63f6d43..620cc9e 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -261,38 +261,74 @@ out_nrg:
>>   	return chg;
>>   }
>>   
>> -static long region_truncate(struct resv_map *resv, long end)
>> +static long region_del(struct resv_map *resv, long f, long t)
>>   {
>>   	struct list_head *head = &resv->regions;
>>   	struct file_region *rg, *trg;
>> +	struct file_region *nrg = NULL;
>>   	long chg = 0;
>>   
>> +	/*
>> +	 * Locate segments we overlap and etiher split, remove or
>> +	 * trim the existing regions.  The end of region (t) == -1
>> +	 * indicates all remaining regions.  Special case t == -1 as
>> +	 * all comparisons are signed.  Also, when t == -1 it is not
>> +	 * possible to return an error (-ENOMEM) as this only happens
>> +	 * when splitting a region.  Callers take advantage of this
>> +	 * when calling with -1.
>> +	 */
>> +	if (t == -1)
>> +		t = LONG_MAX;
>> +retry:
>>   	spin_lock(&resv->lock);
>> -	/* Locate the region we are either in or before. */
>> -	list_for_each_entry(rg, head, link)
>> -		if (end <= rg->to)
>> +	list_for_each_entry_safe(rg, trg, head, link) {
>> +		if (rg->to <= f)
>> +			continue;
>> +		if (rg->from >= t)
>>   			break;
>> -	if (&rg->link == head)
>> -		goto out;
>>   
>> -	/* If we are in the middle of a region then adjust it. */
>> -	if (end > rg->from) {
>> -		chg = rg->to - end;
>> -		rg->to = end;
>> -		rg = list_entry(rg->link.next, typeof(*rg), link);
>> -	}
>> +		if (f > rg->from && t < rg->to) { /* must split region */
>> +			if (!nrg) {
>> +				spin_unlock(&resv->lock);
>> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
>> +				if (!nrg)
>> +					return -ENOMEM;
>> +				goto retry;
>> +			}
>>   
>> -	/* Drop any remaining regions. */
>> -	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
>> -		if (&rg->link == head)
>> +			chg += t - f;
>> +
>> +			/* new entry for end of split region */
>> +			nrg->from = t;
>> +			nrg->to = rg->to;
>> +			INIT_LIST_HEAD(&nrg->link);
>> +
>> +			/* original entry is trimmed */
>> +			rg->to = f;
>> +
>> +			list_add(&nrg->link, &rg->link);
>> +			nrg = NULL;
>>   			break;
>> -		chg += rg->to - rg->from;
>> -		list_del(&rg->link);
>> -		kfree(rg);
>> +		}
>> +
>> +		if (f <= rg->from && t >= rg->to) { /* remove entire region */
>> +			chg += rg->to - rg->from;
>> +			list_del(&rg->link);
>> +			kfree(rg);
>> +			continue;
>> +		}
>> +
>> +		if (f <= rg->from) {	/* trim beginning of region */
>> +			chg += t - rg->from;
>> +			rg->from = t;
>> +		} else {		/* trim end of region */
>> +			chg += rg->to - f;
>> +			rg->to = f;
> 
> Is it better to put "break" here?

Yes, I think a break would be appropriate.  At this point we know the
range to be deleted will not intersect any other regions in the map.
So, the break is appropriate as we do not need to examine the remaining
regions.

I'll add these change as well as more documentation for region_del in
the next version of this patch set.

-- 
Mike Kravetz

> 
> Thanks,
> Naoya Horiguchi
> 
>> +		}
>>   	}
>>   
>> -out:
>>   	spin_unlock(&resv->lock);
>> +	kfree(nrg);
>>   	return chg;
>>   }
>>   
>> @@ -324,6 +360,27 @@ static long region_count(struct resv_map *resv, long f, long t)
>>   }
>>   
>>   /*
>> + * A rare out of memory error was encountered which prevented removal of
>> + * the reserve map region for a page.  The huge page itself was free''ed
>> + * and removed from the page cache.  This routine will adjust the global
>> + * reserve count if needed, and the subpool usage count.  By incrementing
>> + * these counts, the reserve map entry which could not be deleted will
>> + * appear as a "reserved" entry instead of simply dangling with incorrect
>> + * counts.
>> + */
>> +void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
>> +{
>> +	struct hugepage_subpool *spool = subpool_inode(inode);
>> +
>> +	if (restore_reserve) {
>> +		struct hstate *h = hstate_inode(inode);
>> +
>> +		h->resv_huge_pages++;
>> +	}
>> +	hugepage_subpool_get_pages(spool, 1);
>> +}
>> +
>> +/*
>>    * Convert the address within this vma to the page offset within
>>    * the mapping, in pagecache page units; huge pages here.
>>    */
>> @@ -427,7 +484,7 @@ void resv_map_release(struct kref *ref)
>>   	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>>   
>>   	/* Clear out any active regions before we release the map. */
>> -	region_truncate(resv_map, 0);
>> +	region_del(resv_map, 0, -1);
>>   	kfree(resv_map);
>>   }
>>   
>> @@ -3558,7 +3615,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>>   	struct hugepage_subpool *spool = subpool_inode(inode);
>>   
>>   	if (resv_map)
>> -		chg = region_truncate(resv_map, offset);
>> +		chg = region_del(resv_map, offset, -1);
>>   	spin_lock(&inode->i_lock);
>>   	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>>   	spin_unlock(&inode->i_lock);
>> -- 
>> 2.1.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 03/10] mm/hugetlb: add region_del() to delete a specific range of entries
@ 2015-05-22 16:48       ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-22 16:48 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/21/2015 11:21 PM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:37AM -0700, Mike Kravetz wrote:
>> fallocate hole punch will want to remove a specific range of pages.
>> The existing region_truncate() routine deletes all region/reserve
>> map entries after a specified offset.  region_del() will provide
>> this same functionality if the end of region is specified as -1.
>> Hence, region_del() can replace region_truncate().
>>
>> Unlike region_truncate(), region_del() can return an error in the
>> rare case where it can not allocate memory for a region descriptor.
>> This ONLY happens in the case where an existing region must be split.
>> Current callers passing -1 as end of range will never experience
>> this error and do not need to deal with error handling.  Future
>> callers of region_del() (such as fallocate hole punch) will need to
>> handle this error.  A routine hugetlb_fix_reserve_counts() is added
>> to assist in cleaning up if fallocate hole punch experiences this
>> type of error in region_del().
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   include/linux/hugetlb.h |  1 +
>>   mm/hugetlb.c            | 99 ++++++++++++++++++++++++++++++++++++++-----------
>>   2 files changed, 79 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 7b57850..fd337f2 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -81,6 +81,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list);
>>   void putback_active_hugepage(struct page *page);
>>   bool is_hugepage_active(struct page *page);
>>   void free_huge_page(struct page *page);
>> +void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
> 
> This function is used in patch 6/10 for the first time,
> so is it better to move the definition to that patch?
> (this temporarily introduces "defined but not used" warning...)

Yes, I do think it would be better to move it to patch 6.  The existing
callers/users of region_del() will never encounter an error return value.
As you mention, it is only the new use case in patch 6/10 that needs
to deal with the error.  So, it makes sense to move it there.

>>   #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>>   pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 63f6d43..620cc9e 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -261,38 +261,74 @@ out_nrg:
>>   	return chg;
>>   }
>>   
>> -static long region_truncate(struct resv_map *resv, long end)
>> +static long region_del(struct resv_map *resv, long f, long t)
>>   {
>>   	struct list_head *head = &resv->regions;
>>   	struct file_region *rg, *trg;
>> +	struct file_region *nrg = NULL;
>>   	long chg = 0;
>>   
>> +	/*
>> +	 * Locate segments we overlap and etiher split, remove or
>> +	 * trim the existing regions.  The end of region (t) == -1
>> +	 * indicates all remaining regions.  Special case t == -1 as
>> +	 * all comparisons are signed.  Also, when t == -1 it is not
>> +	 * possible to return an error (-ENOMEM) as this only happens
>> +	 * when splitting a region.  Callers take advantage of this
>> +	 * when calling with -1.
>> +	 */
>> +	if (t == -1)
>> +		t = LONG_MAX;
>> +retry:
>>   	spin_lock(&resv->lock);
>> -	/* Locate the region we are either in or before. */
>> -	list_for_each_entry(rg, head, link)
>> -		if (end <= rg->to)
>> +	list_for_each_entry_safe(rg, trg, head, link) {
>> +		if (rg->to <= f)
>> +			continue;
>> +		if (rg->from >= t)
>>   			break;
>> -	if (&rg->link == head)
>> -		goto out;
>>   
>> -	/* If we are in the middle of a region then adjust it. */
>> -	if (end > rg->from) {
>> -		chg = rg->to - end;
>> -		rg->to = end;
>> -		rg = list_entry(rg->link.next, typeof(*rg), link);
>> -	}
>> +		if (f > rg->from && t < rg->to) { /* must split region */
>> +			if (!nrg) {
>> +				spin_unlock(&resv->lock);
>> +				nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
>> +				if (!nrg)
>> +					return -ENOMEM;
>> +				goto retry;
>> +			}
>>   
>> -	/* Drop any remaining regions. */
>> -	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
>> -		if (&rg->link == head)
>> +			chg += t - f;
>> +
>> +			/* new entry for end of split region */
>> +			nrg->from = t;
>> +			nrg->to = rg->to;
>> +			INIT_LIST_HEAD(&nrg->link);
>> +
>> +			/* original entry is trimmed */
>> +			rg->to = f;
>> +
>> +			list_add(&nrg->link, &rg->link);
>> +			nrg = NULL;
>>   			break;
>> -		chg += rg->to - rg->from;
>> -		list_del(&rg->link);
>> -		kfree(rg);
>> +		}
>> +
>> +		if (f <= rg->from && t >= rg->to) { /* remove entire region */
>> +			chg += rg->to - rg->from;
>> +			list_del(&rg->link);
>> +			kfree(rg);
>> +			continue;
>> +		}
>> +
>> +		if (f <= rg->from) {	/* trim beginning of region */
>> +			chg += t - rg->from;
>> +			rg->from = t;
>> +		} else {		/* trim end of region */
>> +			chg += rg->to - f;
>> +			rg->to = f;
> 
> Is it better to put "break" here?

Yes, I think a break would be appropriate.  At this point we know the
range to be deleted will not intersect any other regions in the map.
So, the break is appropriate as we do not need to examine the remaining
regions.

I'll add these change as well as more documentation for region_del in
the next version of this patch set.

-- 
Mike Kravetz

> 
> Thanks,
> Naoya Horiguchi
> 
>> +		}
>>   	}
>>   
>> -out:
>>   	spin_unlock(&resv->lock);
>> +	kfree(nrg);
>>   	return chg;
>>   }
>>   
>> @@ -324,6 +360,27 @@ static long region_count(struct resv_map *resv, long f, long t)
>>   }
>>   
>>   /*
>> + * A rare out of memory error was encountered which prevented removal of
>> + * the reserve map region for a page.  The huge page itself was free''ed
>> + * and removed from the page cache.  This routine will adjust the global
>> + * reserve count if needed, and the subpool usage count.  By incrementing
>> + * these counts, the reserve map entry which could not be deleted will
>> + * appear as a "reserved" entry instead of simply dangling with incorrect
>> + * counts.
>> + */
>> +void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
>> +{
>> +	struct hugepage_subpool *spool = subpool_inode(inode);
>> +
>> +	if (restore_reserve) {
>> +		struct hstate *h = hstate_inode(inode);
>> +
>> +		h->resv_huge_pages++;
>> +	}
>> +	hugepage_subpool_get_pages(spool, 1);
>> +}
>> +
>> +/*
>>    * Convert the address within this vma to the page offset within
>>    * the mapping, in pagecache page units; huge pages here.
>>    */
>> @@ -427,7 +484,7 @@ void resv_map_release(struct kref *ref)
>>   	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
>>   
>>   	/* Clear out any active regions before we release the map. */
>> -	region_truncate(resv_map, 0);
>> +	region_del(resv_map, 0, -1);
>>   	kfree(resv_map);
>>   }
>>   
>> @@ -3558,7 +3615,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>>   	struct hugepage_subpool *spool = subpool_inode(inode);
>>   
>>   	if (resv_map)
>> -		chg = region_truncate(resv_map, offset);
>> +		chg = region_del(resv_map, offset, -1);
>>   	spin_lock(&inode->i_lock);
>>   	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>>   	spin_unlock(&inode->i_lock);
>> -- 
>> 2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-05-22  6:23     ` Naoya Horiguchi
@ 2015-05-22 16:50       ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-22 16:50 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/21/2015 11:23 PM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:38AM -0700, Mike Kravetz wrote:
>> hugetlb page faults are currently synchronized by the table of
>> mutexes (htlb_fault_mutex_table).  fallocate code will need to
>> synchronize with the page fault code when it allocates or
>> deletes pages.  Expose interfaces so that fallocate operations
>> can be synchronized with page faults.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   include/linux/hugetlb.h |  3 +++
>>   mm/hugetlb.c            | 23 ++++++++++++++++++++++-
>>   2 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index fd337f2..d0d033e 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -82,6 +82,9 @@ void putback_active_hugepage(struct page *page);
>>   bool is_hugepage_active(struct page *page);
>>   void free_huge_page(struct page *page);
>>   void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
>> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx);
>> +void hugetlb_fault_mutex_lock(u32 hash);
>> +void hugetlb_fault_mutex_unlock(u32 hash);
>>   
>>   #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>>   pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 620cc9e..df0d32a 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3183,7 +3183,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>>   	unsigned long key[2];
>>   	u32 hash;
>>   
>> -	if (vma->vm_flags & VM_SHARED) {
>> +	/* !vma implies this was called from hugetlbfs fallocate code */
>> +	if (!vma || vma->vm_flags & VM_SHARED) {
>>   		key[0] = (unsigned long) mapping;
>>   		key[1] = idx;
>>   	} else {
>> @@ -3209,6 +3210,26 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>>   }
>>   #endif
>>   
>> +/*
>> + * Interfaces to the fault mutex routines for use by hugetlbfs
>> + * fallocate code.  Faults must be synchronized with page adds or
>> + * deletes by fallocate.  fallocate only deals with shared mappings.
>> + */
>> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
>> +{
>> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
>> +}
>> +
>> +void hugetlb_fault_mutex_lock(u32 hash)
>> +{
>> +	mutex_lock(&htlb_fault_mutex_table[hash]);
>> +}
>> +
>> +void hugetlb_fault_mutex_unlock(u32 hash)
>> +{
>> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>> +}
>> +
>>   int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   			unsigned long address, unsigned int flags)
>>   {
> 
> You introduce new lock/unlock interfaces, so how about making the existing
> user of this lock (i.e. hugetlb_fault()) use them?
> 

I will make that change in the next version of the patch set.

Thanks,
-- 
Mike Kravetz


> Thanks,
> Naoya Horiguchi
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
@ 2015-05-22 16:50       ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-22 16:50 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/21/2015 11:23 PM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:38AM -0700, Mike Kravetz wrote:
>> hugetlb page faults are currently synchronized by the table of
>> mutexes (htlb_fault_mutex_table).  fallocate code will need to
>> synchronize with the page fault code when it allocates or
>> deletes pages.  Expose interfaces so that fallocate operations
>> can be synchronized with page faults.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   include/linux/hugetlb.h |  3 +++
>>   mm/hugetlb.c            | 23 ++++++++++++++++++++++-
>>   2 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index fd337f2..d0d033e 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -82,6 +82,9 @@ void putback_active_hugepage(struct page *page);
>>   bool is_hugepage_active(struct page *page);
>>   void free_huge_page(struct page *page);
>>   void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
>> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx);
>> +void hugetlb_fault_mutex_lock(u32 hash);
>> +void hugetlb_fault_mutex_unlock(u32 hash);
>>   
>>   #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>>   pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 620cc9e..df0d32a 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3183,7 +3183,8 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>>   	unsigned long key[2];
>>   	u32 hash;
>>   
>> -	if (vma->vm_flags & VM_SHARED) {
>> +	/* !vma implies this was called from hugetlbfs fallocate code */
>> +	if (!vma || vma->vm_flags & VM_SHARED) {
>>   		key[0] = (unsigned long) mapping;
>>   		key[1] = idx;
>>   	} else {
>> @@ -3209,6 +3210,26 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
>>   }
>>   #endif
>>   
>> +/*
>> + * Interfaces to the fault mutex routines for use by hugetlbfs
>> + * fallocate code.  Faults must be synchronized with page adds or
>> + * deletes by fallocate.  fallocate only deals with shared mappings.
>> + */
>> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
>> +{
>> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
>> +}
>> +
>> +void hugetlb_fault_mutex_lock(u32 hash)
>> +{
>> +	mutex_lock(&htlb_fault_mutex_table[hash]);
>> +}
>> +
>> +void hugetlb_fault_mutex_unlock(u32 hash)
>> +{
>> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>> +}
>> +
>>   int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   			unsigned long address, unsigned int flags)
>>   {
> 
> You introduce new lock/unlock interfaces, so how about making the existing
> user of this lock (i.e. hugetlb_fault()) use them?
> 

I will make that change in the next version of the patch set.

Thanks,
-- 
Mike Kravetz


> Thanks,
> Naoya Horiguchi
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-05-21 15:47   ` Mike Kravetz
@ 2015-05-22 17:01     ` Davidlohr Bueso
  -1 siblings, 0 replies; 46+ messages in thread
From: Davidlohr Bueso @ 2015-05-22 17:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, 2015-05-21 at 08:47 -0700, Mike Kravetz wrote:
> +/*
> + * Interfaces to the fault mutex routines for use by hugetlbfs
> + * fallocate code.  Faults must be synchronized with page adds or
> + * deletes by fallocate.  fallocate only deals with shared mappings.
> + */
> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
> +{
> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
> +}
> +
> +void hugetlb_fault_mutex_lock(u32 hash)
> +{
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
> +}
> +
> +void hugetlb_fault_mutex_unlock(u32 hash)
> +{
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
> +}+

These should really be inlined -- maybe add them to hugetlb.h along with
the mutex hashtable bits.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
@ 2015-05-22 17:01     ` Davidlohr Bueso
  0 siblings, 0 replies; 46+ messages in thread
From: Davidlohr Bueso @ 2015-05-22 17:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, 2015-05-21 at 08:47 -0700, Mike Kravetz wrote:
> +/*
> + * Interfaces to the fault mutex routines for use by hugetlbfs
> + * fallocate code.  Faults must be synchronized with page adds or
> + * deletes by fallocate.  fallocate only deals with shared mappings.
> + */
> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
> +{
> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
> +}
> +
> +void hugetlb_fault_mutex_lock(u32 hash)
> +{
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
> +}
> +
> +void hugetlb_fault_mutex_unlock(u32 hash)
> +{
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
> +}+

These should really be inlined -- maybe add them to hugetlb.h along with
the mutex hashtable bits.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 06/10] hugetlbfs: truncate_hugepages() takes a range of pages
  2015-05-22  8:08     ` Naoya Horiguchi
@ 2015-05-22 17:07       ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-22 17:07 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/22/2015 01:08 AM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:40AM -0700, Mike Kravetz wrote:
>> Modify truncate_hugepages() to take a range of pages (start, end)
>> instead of simply start. If an end value of -1 is passed, the
>> current "truncate" functionality is maintained. Existing callers
>> are modified to pass -1 as end of range. By keying off end == -1,
>> the routine behaves differently for truncate and hole punch.
>> Page removal is now synchronized with page allocation via faults
>> by using the fault mutex table. The hole punch case can experience
>> the rare region_del error and must handle accordingly.
>>
>> Since the routine handles more than just the truncate case, it is
>> renamed to remove_inode_hugepages().  To be consistent, the routine
>> truncate_huge_page() is renamed remove_huge_page().
>>
>> Downstream of remove_inode_hugepages(), the routine
>> hugetlb_unreserve_pages() is also modified to take a range of pages.
>> hugetlb_unreserve_pages is modified to detect an error from
>> region_del and pass it back to the caller.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   fs/hugetlbfs/inode.c    | 88 +++++++++++++++++++++++++++++++++++++++++++------
>>   include/linux/hugetlb.h |  3 +-
>>   mm/hugetlb.c            | 17 ++++++++--
>>   3 files changed, 94 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index dda529c..dfa88a5 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -317,26 +317,53 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
>>   	return -EINVAL;
>>   }
>>   
>> -static void truncate_huge_page(struct page *page)
>> +static void remove_huge_page(struct page *page)
>>   {
>>   	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
>>   	ClearPageUptodate(page);
>>   	delete_from_page_cache(page);
>>   }
>>   
>> -static void truncate_hugepages(struct inode *inode, loff_t lstart)
>> +/*
>> + * remove_inode_hugepages handles two distinct cases: truncation and hole punch
>> + * truncation is indicated by end of range being -1
>> + *	In this case, we first scan the range and release found pages.
>> + *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
>> + *	maps and global counts.
>> + * hole punch is indicated if end is not -1
>> + *	In the hole punch case we scan the range and release found pages.
>> + *	Only when releasing a page is the associated region/reserv map
>> + *	deleted.  The region/reserv map for ranges without associated
>> + *	pages are not modified.
> 
> If lend is not -1 but large enough to go beyond the end of file, which
> should it be handled by truncate operation or hole punch operation?
> If it makes no difference or never happens, it's OK with some comments.

Interesting question.  I did not think of this case.

If lend is not -1, then it must have been called by fallocate hole punch.
The other use cases/callers pass in -1 to truncate the file or entirely
remove the file.  remove_inode_hugepages really wants to distinguish
between hole punch and other operations.  In the hole punch case, the
size of the file does not change.  Therefore, we need to handle races with
faults for pages in the "hole" while deleting the hole.

The case where lend is beyond end of file and not -1 will be handled as
a hole punch.  I believe this is the desired behavior.  I will try to
make the documentation more clear.

-- 
Mike Kravetz

> 
> Thanks,
> Naoya Horiguchi
> 
>> + */
>> +static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>> +				   loff_t lend)
>>   {
>>   	struct hstate *h = hstate_inode(inode);
>>   	struct address_space *mapping = &inode->i_data;
>>   	const pgoff_t start = lstart >> huge_page_shift(h);
>> +	const pgoff_t end = lend >> huge_page_shift(h);
>>   	struct pagevec pvec;
>>   	pgoff_t next;
>>   	int i, freed = 0;
>> +	long lookup_nr = PAGEVEC_SIZE;
>> +	bool truncate_op = (lend == -1);
>>   
>>   	pagevec_init(&pvec, 0);
>>   	next = start;
>> -	while (1) {
>> -		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>> +	while (next < end) {
>> +		/*
>> +		 * Make sure to never grab more pages that we
>> +		 * might possibly need.
>> +		 */
>> +		if (end - next < lookup_nr)
>> +			lookup_nr = end - next;
>> +
>> +		/*
>> +		 * This pagevec_lookup() may return pages past 'end',
>> +		 * so we must check for page->index > end.
>> +		 */
>> +		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
>>   			if (next == start)
>>   				break;
>>   			next = start;
>> @@ -345,26 +372,67 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>>   
>>   		for (i = 0; i < pagevec_count(&pvec); ++i) {
>>   			struct page *page = pvec.pages[i];
>> +			u32 hash;
>> +
>> +			hash = hugetlb_fault_mutex_shared_hash(mapping, next);
>> +			hugetlb_fault_mutex_lock(hash);
>>   
>>   			lock_page(page);
>> +			if (page->index >= end) {
>> +				unlock_page(page);
>> +				hugetlb_fault_mutex_unlock(hash);
>> +				next = end;	/* we are done */
>> +				break;
>> +			}
>> +
>> +			/*
>> +			 * If page is mapped, it was faulted in after being
>> +			 * unmapped.  Do nothing in this race case.  In the
>> +			 * normal case page is not mapped.
>> +			 */
>> +			if (!page_mapped(page)) {
>> +				bool rsv_on_error = !PagePrivate(page);
>> +				/*
>> +				 * We must free the huge page and remove
>> +				 * from page cache (remove_huge_page) BEFORE
>> +				 * removing the region/reserve map
>> +				 * (hugetlb_unreserve_pages).  In rare out
>> +				 * of memory conditions, removal of the
>> +				 * region/reserve map could fail.  Before
>> +				 * free'ing the page, note PagePrivate which
>> +				 * is used in case of error.
>> +				 */
>> +				remove_huge_page(page);
>> +				freed++;
>> +				if (!truncate_op) {
>> +					if (unlikely(hugetlb_unreserve_pages(
>> +							inode, next,
>> +							next + 1, 1)))
>> +						hugetlb_fix_reserve_counts(
>> +							inode, rsv_on_error);
>> +				}
>> +			}
>> +
>>   			if (page->index > next)
>>   				next = page->index;
>> +
>>   			++next;
>> -			truncate_huge_page(page);
>>   			unlock_page(page);
>> -			freed++;
>> +
>> +			hugetlb_fault_mutex_unlock(hash);
>>   		}
>>   		huge_pagevec_release(&pvec);
>>   	}
>> -	BUG_ON(!lstart && mapping->nrpages);
>> -	hugetlb_unreserve_pages(inode, start, freed);
>> +
>> +	if (truncate_op)
>> +		(void)hugetlb_unreserve_pages(inode, start, end, freed);
>>   }
>>   
>>   static void hugetlbfs_evict_inode(struct inode *inode)
>>   {
>>   	struct resv_map *resv_map;
>>   
>> -	truncate_hugepages(inode, 0);
>> +	remove_inode_hugepages(inode, 0, -1);
>>   	resv_map = (struct resv_map *)inode->i_mapping->private_data;
>>   	/* root inode doesn't have the resv_map, so we should check it */
>>   	if (resv_map)
>> @@ -421,7 +489,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>>   	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>>   		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
>>   	i_mmap_unlock_write(mapping);
>> -	truncate_hugepages(inode, offset);
>> +	remove_inode_hugepages(inode, offset, -1);
>>   	return 0;
>>   }
>>   
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index d0d033e..4c2856e 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -75,7 +75,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   int hugetlb_reserve_pages(struct inode *inode, long from, long to,
>>   						struct vm_area_struct *vma,
>>   						vm_flags_t vm_flags);
>> -void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
>> +long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>> +						long freed);
>>   int dequeue_hwpoisoned_huge_page(struct page *page);
>>   bool isolate_huge_page(struct page *page, struct list_head *list);
>>   void putback_active_hugepage(struct page *page);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index df0d32a..0cf0622 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3628,21 +3628,32 @@ out_err:
>>   	return ret;
>>   }
>>   
>> -void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>> +long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>> +								long freed)
>>   {
>>   	struct hstate *h = hstate_inode(inode);
>>   	struct resv_map *resv_map = inode_resv_map(inode);
>>   	long chg = 0;
>>   	struct hugepage_subpool *spool = subpool_inode(inode);
>>   
>> -	if (resv_map)
>> -		chg = region_del(resv_map, offset, -1);
>> +	if (resv_map) {
>> +		chg = region_del(resv_map, start, end);
>> +		/*
>> +		 * region_del() can fail in the rare case where a region
>> +		 * must be split and another region descriptor can not be
>> +		 * allocated.  If end == -1, it will not fail.
>> +		 */
>> +		if (chg < 0)
>> +			return chg;
>> +	}
>>   	spin_lock(&inode->i_lock);
>>   	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>>   	spin_unlock(&inode->i_lock);
>>   
>>   	hugepage_subpool_put_pages(spool, (chg - freed));
>>   	hugetlb_acct_memory(h, -(chg - freed));
>> +
>> +	return 0;
>>   }
>>   
>>   #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>> -- 
>> 2.1.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 06/10] hugetlbfs: truncate_hugepages() takes a range of pages
@ 2015-05-22 17:07       ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-22 17:07 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/22/2015 01:08 AM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:40AM -0700, Mike Kravetz wrote:
>> Modify truncate_hugepages() to take a range of pages (start, end)
>> instead of simply start. If an end value of -1 is passed, the
>> current "truncate" functionality is maintained. Existing callers
>> are modified to pass -1 as end of range. By keying off end == -1,
>> the routine behaves differently for truncate and hole punch.
>> Page removal is now synchronized with page allocation via faults
>> by using the fault mutex table. The hole punch case can experience
>> the rare region_del error and must handle accordingly.
>>
>> Since the routine handles more than just the truncate case, it is
>> renamed to remove_inode_hugepages().  To be consistent, the routine
>> truncate_huge_page() is renamed remove_huge_page().
>>
>> Downstream of remove_inode_hugepages(), the routine
>> hugetlb_unreserve_pages() is also modified to take a range of pages.
>> hugetlb_unreserve_pages is modified to detect an error from
>> region_del and pass it back to the caller.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   fs/hugetlbfs/inode.c    | 88 +++++++++++++++++++++++++++++++++++++++++++------
>>   include/linux/hugetlb.h |  3 +-
>>   mm/hugetlb.c            | 17 ++++++++--
>>   3 files changed, 94 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index dda529c..dfa88a5 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -317,26 +317,53 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
>>   	return -EINVAL;
>>   }
>>   
>> -static void truncate_huge_page(struct page *page)
>> +static void remove_huge_page(struct page *page)
>>   {
>>   	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
>>   	ClearPageUptodate(page);
>>   	delete_from_page_cache(page);
>>   }
>>   
>> -static void truncate_hugepages(struct inode *inode, loff_t lstart)
>> +/*
>> + * remove_inode_hugepages handles two distinct cases: truncation and hole punch
>> + * truncation is indicated by end of range being -1
>> + *	In this case, we first scan the range and release found pages.
>> + *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
>> + *	maps and global counts.
>> + * hole punch is indicated if end is not -1
>> + *	In the hole punch case we scan the range and release found pages.
>> + *	Only when releasing a page is the associated region/reserv map
>> + *	deleted.  The region/reserv map for ranges without associated
>> + *	pages are not modified.
> 
> If lend is not -1 but large enough to go beyond the end of file, which
> should it be handled by truncate operation or hole punch operation?
> If it makes no difference or never happens, it's OK with some comments.

Interesting question.  I did not think of this case.

If lend is not -1, then it must have been called by fallocate hole punch.
The other use cases/callers pass in -1 to truncate the file or entirely
remove the file.  remove_inode_hugepages really wants to distinguish
between hole punch and other operations.  In the hole punch case, the
size of the file does not change.  Therefore, we need to handle races with
faults for pages in the "hole" while deleting the hole.

The case where lend is beyond end of file and not -1 will be handled as
a hole punch.  I believe this is the desired behavior.  I will try to
make the documentation more clear.

-- 
Mike Kravetz

> 
> Thanks,
> Naoya Horiguchi
> 
>> + */
>> +static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>> +				   loff_t lend)
>>   {
>>   	struct hstate *h = hstate_inode(inode);
>>   	struct address_space *mapping = &inode->i_data;
>>   	const pgoff_t start = lstart >> huge_page_shift(h);
>> +	const pgoff_t end = lend >> huge_page_shift(h);
>>   	struct pagevec pvec;
>>   	pgoff_t next;
>>   	int i, freed = 0;
>> +	long lookup_nr = PAGEVEC_SIZE;
>> +	bool truncate_op = (lend == -1);
>>   
>>   	pagevec_init(&pvec, 0);
>>   	next = start;
>> -	while (1) {
>> -		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>> +	while (next < end) {
>> +		/*
>> +		 * Make sure to never grab more pages that we
>> +		 * might possibly need.
>> +		 */
>> +		if (end - next < lookup_nr)
>> +			lookup_nr = end - next;
>> +
>> +		/*
>> +		 * This pagevec_lookup() may return pages past 'end',
>> +		 * so we must check for page->index > end.
>> +		 */
>> +		if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
>>   			if (next == start)
>>   				break;
>>   			next = start;
>> @@ -345,26 +372,67 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>>   
>>   		for (i = 0; i < pagevec_count(&pvec); ++i) {
>>   			struct page *page = pvec.pages[i];
>> +			u32 hash;
>> +
>> +			hash = hugetlb_fault_mutex_shared_hash(mapping, next);
>> +			hugetlb_fault_mutex_lock(hash);
>>   
>>   			lock_page(page);
>> +			if (page->index >= end) {
>> +				unlock_page(page);
>> +				hugetlb_fault_mutex_unlock(hash);
>> +				next = end;	/* we are done */
>> +				break;
>> +			}
>> +
>> +			/*
>> +			 * If page is mapped, it was faulted in after being
>> +			 * unmapped.  Do nothing in this race case.  In the
>> +			 * normal case page is not mapped.
>> +			 */
>> +			if (!page_mapped(page)) {
>> +				bool rsv_on_error = !PagePrivate(page);
>> +				/*
>> +				 * We must free the huge page and remove
>> +				 * from page cache (remove_huge_page) BEFORE
>> +				 * removing the region/reserve map
>> +				 * (hugetlb_unreserve_pages).  In rare out
>> +				 * of memory conditions, removal of the
>> +				 * region/reserve map could fail.  Before
>> +				 * free'ing the page, note PagePrivate which
>> +				 * is used in case of error.
>> +				 */
>> +				remove_huge_page(page);
>> +				freed++;
>> +				if (!truncate_op) {
>> +					if (unlikely(hugetlb_unreserve_pages(
>> +							inode, next,
>> +							next + 1, 1)))
>> +						hugetlb_fix_reserve_counts(
>> +							inode, rsv_on_error);
>> +				}
>> +			}
>> +
>>   			if (page->index > next)
>>   				next = page->index;
>> +
>>   			++next;
>> -			truncate_huge_page(page);
>>   			unlock_page(page);
>> -			freed++;
>> +
>> +			hugetlb_fault_mutex_unlock(hash);
>>   		}
>>   		huge_pagevec_release(&pvec);
>>   	}
>> -	BUG_ON(!lstart && mapping->nrpages);
>> -	hugetlb_unreserve_pages(inode, start, freed);
>> +
>> +	if (truncate_op)
>> +		(void)hugetlb_unreserve_pages(inode, start, end, freed);
>>   }
>>   
>>   static void hugetlbfs_evict_inode(struct inode *inode)
>>   {
>>   	struct resv_map *resv_map;
>>   
>> -	truncate_hugepages(inode, 0);
>> +	remove_inode_hugepages(inode, 0, -1);
>>   	resv_map = (struct resv_map *)inode->i_mapping->private_data;
>>   	/* root inode doesn't have the resv_map, so we should check it */
>>   	if (resv_map)
>> @@ -421,7 +489,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>>   	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>>   		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
>>   	i_mmap_unlock_write(mapping);
>> -	truncate_hugepages(inode, offset);
>> +	remove_inode_hugepages(inode, offset, -1);
>>   	return 0;
>>   }
>>   
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index d0d033e..4c2856e 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -75,7 +75,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>   int hugetlb_reserve_pages(struct inode *inode, long from, long to,
>>   						struct vm_area_struct *vma,
>>   						vm_flags_t vm_flags);
>> -void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
>> +long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>> +						long freed);
>>   int dequeue_hwpoisoned_huge_page(struct page *page);
>>   bool isolate_huge_page(struct page *page, struct list_head *list);
>>   void putback_active_hugepage(struct page *page);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index df0d32a..0cf0622 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3628,21 +3628,32 @@ out_err:
>>   	return ret;
>>   }
>>   
>> -void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>> +long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>> +								long freed)
>>   {
>>   	struct hstate *h = hstate_inode(inode);
>>   	struct resv_map *resv_map = inode_resv_map(inode);
>>   	long chg = 0;
>>   	struct hugepage_subpool *spool = subpool_inode(inode);
>>   
>> -	if (resv_map)
>> -		chg = region_del(resv_map, offset, -1);
>> +	if (resv_map) {
>> +		chg = region_del(resv_map, start, end);
>> +		/*
>> +		 * region_del() can fail in the rare case where a region
>> +		 * must be split and another region descriptor can not be
>> +		 * allocated.  If end == -1, it will not fail.
>> +		 */
>> +		if (chg < 0)
>> +			return chg;
>> +	}
>>   	spin_lock(&inode->i_lock);
>>   	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>>   	spin_unlock(&inode->i_lock);
>>   
>>   	hugepage_subpool_put_pages(spool, (chg - freed));
>>   	hugetlb_acct_memory(h, -(chg - freed));
>> +
>> +	return 0;
>>   }
>>   
>>   #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>> -- 
>> 2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  2015-05-22 17:01     ` Davidlohr Bueso
@ 2015-05-22 17:10       ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-22 17:10 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/22/2015 10:01 AM, Davidlohr Bueso wrote:
> On Thu, 2015-05-21 at 08:47 -0700, Mike Kravetz wrote:
>> +/*
>> + * Interfaces to the fault mutex routines for use by hugetlbfs
>> + * fallocate code.  Faults must be synchronized with page adds or
>> + * deletes by fallocate.  fallocate only deals with shared mappings.
>> + */
>> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
>> +{
>> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
>> +}
>> +
>> +void hugetlb_fault_mutex_lock(u32 hash)
>> +{
>> +	mutex_lock(&htlb_fault_mutex_table[hash]);
>> +}
>> +
>> +void hugetlb_fault_mutex_unlock(u32 hash)
>> +{
>> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>> +}+
>
> These should really be inlined -- maybe add them to hugetlb.h along with
> the mutex hashtable bits.

Thanks.  I'll figure out some way to inline them in the next version
of the patch set.

-- 
Mike Kravetz

>
> Thanks,
> Davidlohr
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate
@ 2015-05-22 17:10       ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-22 17:10 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/22/2015 10:01 AM, Davidlohr Bueso wrote:
> On Thu, 2015-05-21 at 08:47 -0700, Mike Kravetz wrote:
>> +/*
>> + * Interfaces to the fault mutex routines for use by hugetlbfs
>> + * fallocate code.  Faults must be synchronized with page adds or
>> + * deletes by fallocate.  fallocate only deals with shared mappings.
>> + */
>> +u32 hugetlb_fault_mutex_shared_hash(struct address_space *mapping, pgoff_t idx)
>> +{
>> +	return fault_mutex_hash(NULL, NULL, NULL, mapping, idx, 0);
>> +}
>> +
>> +void hugetlb_fault_mutex_lock(u32 hash)
>> +{
>> +	mutex_lock(&htlb_fault_mutex_table[hash]);
>> +}
>> +
>> +void hugetlb_fault_mutex_unlock(u32 hash)
>> +{
>> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>> +}+
>
> These should really be inlined -- maybe add them to hugetlb.h along with
> the mutex hashtable bits.

Thanks.  I'll figure out some way to inline them in the next version
of the patch set.

-- 
Mike Kravetz

>
> Thanks,
> Davidlohr
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support
  2015-05-21 15:47 ` Mike Kravetz
@ 2015-05-22 21:50   ` Davidlohr Bueso
  -1 siblings, 0 replies; 46+ messages in thread
From: Davidlohr Bueso @ 2015-05-22 21:50 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, 2015-05-21 at 08:47 -0700, Mike Kravetz wrote:
> This patch set adds fallocate functionality to hugetlbfs.

It would be good to also have proper testcases in, say, libhugetlbfs.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support
@ 2015-05-22 21:50   ` Davidlohr Bueso
  0 siblings, 0 replies; 46+ messages in thread
From: Davidlohr Bueso @ 2015-05-22 21:50 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, 2015-05-21 at 08:47 -0700, Mike Kravetz wrote:
> This patch set adds fallocate functionality to hugetlbfs.

It would be good to also have proper testcases in, say, libhugetlbfs.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support
  2015-05-22 21:50   ` Davidlohr Bueso
@ 2015-05-23  2:32     ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-23  2:32 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/22/2015 02:50 PM, Davidlohr Bueso wrote:
> On Thu, 2015-05-21 at 08:47 -0700, Mike Kravetz wrote:
>> This patch set adds fallocate functionality to hugetlbfs.
>
> It would be good to also have proper testcases in, say, libhugetlbfs.

Makes sense.  I have some functionality and stress tests I have been
using during development.  I'll start work on adding them to the
libhugetlbfs test harness.

-- 
Mike Kravetz

> Thanks,
> Davidlohr
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support
@ 2015-05-23  2:32     ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-23  2:32 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	David Rientjes, Hugh Dickins, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/22/2015 02:50 PM, Davidlohr Bueso wrote:
> On Thu, 2015-05-21 at 08:47 -0700, Mike Kravetz wrote:
>> This patch set adds fallocate functionality to hugetlbfs.
>
> It would be good to also have proper testcases in, say, libhugetlbfs.

Makes sense.  I have some functionality and stress tests I have been
using during development.  I'll start work on adding them to the
libhugetlbfs test harness.

-- 
Mike Kravetz

> Thanks,
> Davidlohr
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()
  2015-05-21 15:47   ` Mike Kravetz
@ 2015-05-26  6:54     ` Naoya Horiguchi
  -1 siblings, 0 replies; 46+ messages in thread
From: Naoya Horiguchi @ 2015-05-26  6:54 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, May 21, 2015 at 08:47:43AM -0700, Mike Kravetz wrote:
> This is based on the shmem version, but it has diverged quite
> a bit.  We have no swap to worry about, nor the new file sealing.
> Add synchronication via the fault mutex table to coordinate
> page faults,  fallocate allocation and fallocate hole punch.
> 
> What this allows us to do is move physical memory in and out of
> a hugetlbfs file without having it mapped.  This also gives us
> the ability to support MADV_REMOVE since it is currently
> implemented using fallocate().  MADV_REMOVE lets madvise() remove
> pages from the middle of a hugetlbfs file, which wasn't possible
> before.
> 
> hugetlbfs fallocate only operates on whole huge pages.
> 
> Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

This patch changes the behavior of user API, so please update manpage of
fallocate(2).

> ---
>  fs/hugetlbfs/inode.c    | 169 +++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/hugetlb.h |   3 +
>  mm/hugetlb.c            |   2 +-
>  3 files changed, 172 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index dfa88a5..4b1535f 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -12,6 +12,7 @@
>  #include <linux/thread_info.h>
>  #include <asm/current.h>
>  #include <linux/sched.h>		/* remove ASAP */
> +#include <linux/falloc.h>
>  #include <linux/fs.h>
>  #include <linux/mount.h>
>  #include <linux/file.h>
> @@ -493,6 +494,171 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>  	return 0;
>  }
>  
> +static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct hstate *h = hstate_inode(inode);
> +	unsigned long hpage_size = huge_page_size(h);
> +	loff_t hole_start, hole_end;
> +
> +	/*
> +	 * For hole punch round up the beginning offset of the hole and
> +	 * round down the end.
> +	 */
> +	hole_start = (offset + hpage_size - 1) & huge_page_mask(h);
> +	hole_end = (offset + len) & huge_page_mask(h);

We have round_up/round_up macro, so please use them here.
Then, it's self-descriptive, so you don't have to write comment.

> +
> +	if ((u64)hole_end > (u64)hole_start) {

Why is this casting to u64 necessary?

> +		struct address_space *mapping = inode->i_mapping;
> +
> +		mutex_lock(&inode->i_mutex);
> +		i_mmap_lock_write(mapping);
> +		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
> +			hugetlb_vmdelete_list(&mapping->i_mmap,
> +						hole_start >> PAGE_SHIFT,
> +						hole_end  >> PAGE_SHIFT);
> +		i_mmap_unlock_write(mapping);
> +		remove_inode_hugepages(inode, hole_start, hole_end);
> +		mutex_unlock(&inode->i_mutex);
> +	}
> +
> +	return 0;
> +}
> +
> +static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> +				loff_t len)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = inode->i_mapping;
> +	struct hstate *h = hstate_inode(inode);
> +	struct vm_area_struct pseudo_vma;
> +	unsigned long hpage_size = huge_page_size(h);
> +	unsigned long hpage_shift = huge_page_shift(h);
> +	pgoff_t start, index, end;
> +	int error;
> +	u32 hash;
> +
> +	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> +		return -EOPNOTSUPP;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE)
> +		return hugetlbfs_punch_hole(inode, offset, len);
> +
> +	/*
> +	 * Default preallocate case.
> +	 * For this range, start is rounded down and end is rounded up.
> +	 */
> +	start = offset >> hpage_shift;
> +	end = (offset + len + hpage_size - 1) >> hpage_shift;
> +
> +	mutex_lock(&inode->i_mutex);
> +
> +	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
> +	error = inode_newsize_ok(inode, offset + len);
> +	if (error)
> +		goto out;
> +
> +	/*
> +	 * Initialize a pseudo vma that just contains the policy used
> +	 * when allocating the huge pages.  The actual policy field
> +	 * (vm_policy) is determined based on the index in the loop below.
> +	 */
> +	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
> +	pseudo_vma.vm_start = 0;
> +	pseudo_vma.vm_flags |= (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);

Maybe '|' isn't necessary.

> +	pseudo_vma.vm_file = file;
> +
> +	for (index = start; index < end; index++) {
> +		/*
> +		 * This is supposed to be the vaddr where the page is being
> +		 * faulted in, but we have no vaddr here.
> +		 */
> +		struct page *page;
> +		unsigned long addr;
> +		int avoid_reserve = 0;

avoid_reserve is referred only once and never changed, so no need to use
the variable?

> +
> +		cond_resched();
> +
> +		/*
> +		 * fallocate(2) manpage permits EINTR; we may have been
> +		 * interrupted because we are using up too much memory.
> +		 */
> +		if (signal_pending(current)) {
> +			error = -EINTR;
> +			break;
> +		}
> +
> +		/* Get policy based on index */
> +		pseudo_vma.vm_policy =
> +			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
> +							index);
> +
> +		/* addr is the offset within the file (zero based) */
> +		addr = index * hpage_size;
> +
> +		/* mutex taken here, fault path and hole punch */
> +		hash = hugetlb_fault_mutex_shared_hash(mapping, index);
> +		hugetlb_fault_mutex_lock(hash);
> +
> +		/* see if page already exists to avoid alloc/free */
> +		page = find_get_page(mapping, index);
> +		if (page) {
> +			put_page(page);
> +			hugetlb_fault_mutex_unlock(hash);

Don't you need mpol_cond_put() here?

> +			continue;
> +		}
> +
> +		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
> +		mpol_cond_put(pseudo_vma.vm_policy);
> +		if (IS_ERR(page)) {
> +			hugetlb_fault_mutex_unlock(hash);
> +			error = PTR_ERR(page);
> +			goto out;
> +		}
> +		clear_huge_page(page, addr, pages_per_huge_page(h));
> +		__SetPageUptodate(page);

Note that recently I added page_huge_active() to mark activeness of hugepages,
so when you rebased to v4.1-rc1+, please insert set_page_huge_active(page) here.

> +		error = huge_add_to_page_cache(page, mapping, index);
> +		if (error) {
> +			/*
> +			 * An entry already exists in the cache.  This implies
> +			 * a region also existed in the reserve map at the time
> +			 * the page was allocated above.  Therefore, no use
> +			 * count was added to the subpool for the page.  Before
> +			 * freeing the page, clear the subpool reference so
> +			 * that the count is not decremented.
> +			 */
> +			set_page_private(page, 0);/* clear spool reference */

This looks unclear to me. Which "count" do you refer to in the comment
"no use count was added to the subpool" or "the count is not decremented"?
I guess spool->used_hpages or spool->rsv_hpages, but alloc_huge_page() above
should call hugepage_subpool_get_pages(), so it's accounted, right?
Could you write comments more specifically?

Thanks,
Naoya Horiguchi

> +			put_page(page);
> +
> +			hugetlb_fault_mutex_unlock(hash);
> +			/* Keep going if we see an -EEXIST */
> +			if (error == -EEXIST) {
> +				error = 0;	/* do not return to user */
> +				continue;
> +			} else
> +				goto out;
> +		}
> +
> +		hugetlb_fault_mutex_unlock(hash);
> +
> +		/*
> +		 * page_put due to reference from alloc_huge_page()
> +		 * unlock_page because locked by add_to_page_cache()
> +		 */
> +		put_page(page);
> +		unlock_page(page);
> +	}
> +
> +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> +		i_size_write(inode, offset + len);
> +	inode->i_ctime = CURRENT_TIME;
> +	spin_lock(&inode->i_lock);
> +	inode->i_private = NULL;
> +	spin_unlock(&inode->i_lock);
> +out:
> +	mutex_unlock(&inode->i_mutex);
> +	return error;
> +}
> +
>  static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = dentry->d_inode;
> @@ -804,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
>  	.mmap			= hugetlbfs_file_mmap,
>  	.fsync			= noop_fsync,
>  	.get_unmapped_area	= hugetlb_get_unmapped_area,
> -	.llseek		= default_llseek,
> +	.llseek			= default_llseek,
> +	.fallocate		= hugetlbfs_fallocate,
>  };
>  
>  static const struct inode_operations hugetlbfs_dir_inode_operations = {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 934f339..fa36b7a 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -327,6 +327,8 @@ struct huge_bootmem_page {
>  #endif
>  };
>  
> +struct page *alloc_huge_page(struct vm_area_struct *vma,
> +				unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
> @@ -481,6 +483,7 @@ static inline bool hugepages_supported(void)
>  
>  #else	/* CONFIG_HUGETLB_PAGE */
>  struct hstate {};
> +#define alloc_huge_page(v, a, r) NULL
>  #define alloc_huge_page_node(h, nid) NULL
>  #define alloc_huge_page_noerr(v, a, r) NULL
>  #define alloc_bootmem_huge_page(h) NULL
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 94c6154..1e95038 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1444,7 +1444,7 @@ static long vma_commit_reservation(struct hstate *h,
>  /* Forward declaration */
>  static int hugetlb_acct_memory(struct hstate *h, long delta);
>  
> -static struct page *alloc_huge_page(struct vm_area_struct *vma,
> +struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				    unsigned long addr, int avoid_reserve)
>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()
@ 2015-05-26  6:54     ` Naoya Horiguchi
  0 siblings, 0 replies; 46+ messages in thread
From: Naoya Horiguchi @ 2015-05-26  6:54 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On Thu, May 21, 2015 at 08:47:43AM -0700, Mike Kravetz wrote:
> This is based on the shmem version, but it has diverged quite
> a bit.  We have no swap to worry about, nor the new file sealing.
> Add synchronication via the fault mutex table to coordinate
> page faults,  fallocate allocation and fallocate hole punch.
> 
> What this allows us to do is move physical memory in and out of
> a hugetlbfs file without having it mapped.  This also gives us
> the ability to support MADV_REMOVE since it is currently
> implemented using fallocate().  MADV_REMOVE lets madvise() remove
> pages from the middle of a hugetlbfs file, which wasn't possible
> before.
> 
> hugetlbfs fallocate only operates on whole huge pages.
> 
> Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

This patch changes the behavior of user API, so please update manpage of
fallocate(2).

> ---
>  fs/hugetlbfs/inode.c    | 169 +++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/hugetlb.h |   3 +
>  mm/hugetlb.c            |   2 +-
>  3 files changed, 172 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index dfa88a5..4b1535f 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -12,6 +12,7 @@
>  #include <linux/thread_info.h>
>  #include <asm/current.h>
>  #include <linux/sched.h>		/* remove ASAP */
> +#include <linux/falloc.h>
>  #include <linux/fs.h>
>  #include <linux/mount.h>
>  #include <linux/file.h>
> @@ -493,6 +494,171 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>  	return 0;
>  }
>  
> +static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct hstate *h = hstate_inode(inode);
> +	unsigned long hpage_size = huge_page_size(h);
> +	loff_t hole_start, hole_end;
> +
> +	/*
> +	 * For hole punch round up the beginning offset of the hole and
> +	 * round down the end.
> +	 */
> +	hole_start = (offset + hpage_size - 1) & huge_page_mask(h);
> +	hole_end = (offset + len) & huge_page_mask(h);

We have round_up/round_up macro, so please use them here.
Then, it's self-descriptive, so you don't have to write comment.

> +
> +	if ((u64)hole_end > (u64)hole_start) {

Why is this casting to u64 necessary?

> +		struct address_space *mapping = inode->i_mapping;
> +
> +		mutex_lock(&inode->i_mutex);
> +		i_mmap_lock_write(mapping);
> +		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
> +			hugetlb_vmdelete_list(&mapping->i_mmap,
> +						hole_start >> PAGE_SHIFT,
> +						hole_end  >> PAGE_SHIFT);
> +		i_mmap_unlock_write(mapping);
> +		remove_inode_hugepages(inode, hole_start, hole_end);
> +		mutex_unlock(&inode->i_mutex);
> +	}
> +
> +	return 0;
> +}
> +
> +static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> +				loff_t len)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = inode->i_mapping;
> +	struct hstate *h = hstate_inode(inode);
> +	struct vm_area_struct pseudo_vma;
> +	unsigned long hpage_size = huge_page_size(h);
> +	unsigned long hpage_shift = huge_page_shift(h);
> +	pgoff_t start, index, end;
> +	int error;
> +	u32 hash;
> +
> +	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> +		return -EOPNOTSUPP;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE)
> +		return hugetlbfs_punch_hole(inode, offset, len);
> +
> +	/*
> +	 * Default preallocate case.
> +	 * For this range, start is rounded down and end is rounded up.
> +	 */
> +	start = offset >> hpage_shift;
> +	end = (offset + len + hpage_size - 1) >> hpage_shift;
> +
> +	mutex_lock(&inode->i_mutex);
> +
> +	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
> +	error = inode_newsize_ok(inode, offset + len);
> +	if (error)
> +		goto out;
> +
> +	/*
> +	 * Initialize a pseudo vma that just contains the policy used
> +	 * when allocating the huge pages.  The actual policy field
> +	 * (vm_policy) is determined based on the index in the loop below.
> +	 */
> +	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
> +	pseudo_vma.vm_start = 0;
> +	pseudo_vma.vm_flags |= (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);

Maybe '|' isn't necessary.

> +	pseudo_vma.vm_file = file;
> +
> +	for (index = start; index < end; index++) {
> +		/*
> +		 * This is supposed to be the vaddr where the page is being
> +		 * faulted in, but we have no vaddr here.
> +		 */
> +		struct page *page;
> +		unsigned long addr;
> +		int avoid_reserve = 0;

avoid_reserve is referred only once and never changed, so no need to use
the variable?

> +
> +		cond_resched();
> +
> +		/*
> +		 * fallocate(2) manpage permits EINTR; we may have been
> +		 * interrupted because we are using up too much memory.
> +		 */
> +		if (signal_pending(current)) {
> +			error = -EINTR;
> +			break;
> +		}
> +
> +		/* Get policy based on index */
> +		pseudo_vma.vm_policy =
> +			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
> +							index);
> +
> +		/* addr is the offset within the file (zero based) */
> +		addr = index * hpage_size;
> +
> +		/* mutex taken here, fault path and hole punch */
> +		hash = hugetlb_fault_mutex_shared_hash(mapping, index);
> +		hugetlb_fault_mutex_lock(hash);
> +
> +		/* see if page already exists to avoid alloc/free */
> +		page = find_get_page(mapping, index);
> +		if (page) {
> +			put_page(page);
> +			hugetlb_fault_mutex_unlock(hash);

Don't you need mpol_cond_put() here?

> +			continue;
> +		}
> +
> +		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
> +		mpol_cond_put(pseudo_vma.vm_policy);
> +		if (IS_ERR(page)) {
> +			hugetlb_fault_mutex_unlock(hash);
> +			error = PTR_ERR(page);
> +			goto out;
> +		}
> +		clear_huge_page(page, addr, pages_per_huge_page(h));
> +		__SetPageUptodate(page);

Note that recently I added page_huge_active() to mark activeness of hugepages,
so when you rebased to v4.1-rc1+, please insert set_page_huge_active(page) here.

> +		error = huge_add_to_page_cache(page, mapping, index);
> +		if (error) {
> +			/*
> +			 * An entry already exists in the cache.  This implies
> +			 * a region also existed in the reserve map at the time
> +			 * the page was allocated above.  Therefore, no use
> +			 * count was added to the subpool for the page.  Before
> +			 * freeing the page, clear the subpool reference so
> +			 * that the count is not decremented.
> +			 */
> +			set_page_private(page, 0);/* clear spool reference */

This looks unclear to me. Which "count" do you refer to in the comment
"no use count was added to the subpool" or "the count is not decremented"?
I guess spool->used_hpages or spool->rsv_hpages, but alloc_huge_page() above
should call hugepage_subpool_get_pages(), so it's accounted, right?
Could you write comments more specifically?

Thanks,
Naoya Horiguchi

> +			put_page(page);
> +
> +			hugetlb_fault_mutex_unlock(hash);
> +			/* Keep going if we see an -EEXIST */
> +			if (error == -EEXIST) {
> +				error = 0;	/* do not return to user */
> +				continue;
> +			} else
> +				goto out;
> +		}
> +
> +		hugetlb_fault_mutex_unlock(hash);
> +
> +		/*
> +		 * page_put due to reference from alloc_huge_page()
> +		 * unlock_page because locked by add_to_page_cache()
> +		 */
> +		put_page(page);
> +		unlock_page(page);
> +	}
> +
> +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> +		i_size_write(inode, offset + len);
> +	inode->i_ctime = CURRENT_TIME;
> +	spin_lock(&inode->i_lock);
> +	inode->i_private = NULL;
> +	spin_unlock(&inode->i_lock);
> +out:
> +	mutex_unlock(&inode->i_mutex);
> +	return error;
> +}
> +
>  static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = dentry->d_inode;
> @@ -804,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
>  	.mmap			= hugetlbfs_file_mmap,
>  	.fsync			= noop_fsync,
>  	.get_unmapped_area	= hugetlb_get_unmapped_area,
> -	.llseek		= default_llseek,
> +	.llseek			= default_llseek,
> +	.fallocate		= hugetlbfs_fallocate,
>  };
>  
>  static const struct inode_operations hugetlbfs_dir_inode_operations = {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 934f339..fa36b7a 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -327,6 +327,8 @@ struct huge_bootmem_page {
>  #endif
>  };
>  
> +struct page *alloc_huge_page(struct vm_area_struct *vma,
> +				unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
> @@ -481,6 +483,7 @@ static inline bool hugepages_supported(void)
>  
>  #else	/* CONFIG_HUGETLB_PAGE */
>  struct hstate {};
> +#define alloc_huge_page(v, a, r) NULL
>  #define alloc_huge_page_node(h, nid) NULL
>  #define alloc_huge_page_noerr(v, a, r) NULL
>  #define alloc_bootmem_huge_page(h) NULL
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 94c6154..1e95038 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1444,7 +1444,7 @@ static long vma_commit_reservation(struct hstate *h,
>  /* Forward declaration */
>  static int hugetlb_acct_memory(struct hstate *h, long delta);
>  
> -static struct page *alloc_huge_page(struct vm_area_struct *vma,
> +struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				    unsigned long addr, int avoid_reserve)
>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
> -- 
> 2.1.0
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()
  2015-05-26  6:54     ` Naoya Horiguchi
@ 2015-05-26 17:53       ` Mike Kravetz
  -1 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-26 17:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/25/2015 11:54 PM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:43AM -0700, Mike Kravetz wrote:
>> This is based on the shmem version, but it has diverged quite
>> a bit.  We have no swap to worry about, nor the new file sealing.
>> Add synchronication via the fault mutex table to coordinate
>> page faults,  fallocate allocation and fallocate hole punch.
>>
>> What this allows us to do is move physical memory in and out of
>> a hugetlbfs file without having it mapped.  This also gives us
>> the ability to support MADV_REMOVE since it is currently
>> implemented using fallocate().  MADV_REMOVE lets madvise() remove
>> pages from the middle of a hugetlbfs file, which wasn't possible
>> before.
>>
>> hugetlbfs fallocate only operates on whole huge pages.
>>
>> Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> This patch changes the behavior of user API, so please update manpage of
> fallocate(2).

Will do.

Unfortunately, I believe hugetlbfs does not follow the man page
for ftruncate.  So, I will look to get that updated as well.

>> ---
>>   fs/hugetlbfs/inode.c    | 169 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   include/linux/hugetlb.h |   3 +
>>   mm/hugetlb.c            |   2 +-
>>   3 files changed, 172 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index dfa88a5..4b1535f 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/thread_info.h>
>>   #include <asm/current.h>
>>   #include <linux/sched.h>		/* remove ASAP */
>> +#include <linux/falloc.h>
>>   #include <linux/fs.h>
>>   #include <linux/mount.h>
>>   #include <linux/file.h>
>> @@ -493,6 +494,171 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>>   	return 0;
>>   }
>>   
>> +static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>> +{
>> +	struct hstate *h = hstate_inode(inode);
>> +	unsigned long hpage_size = huge_page_size(h);
>> +	loff_t hole_start, hole_end;
>> +
>> +	/*
>> +	 * For hole punch round up the beginning offset of the hole and
>> +	 * round down the end.
>> +	 */
>> +	hole_start = (offset + hpage_size - 1) & huge_page_mask(h);
>> +	hole_end = (offset + len) & huge_page_mask(h);
> 
> We have round_up/round_up macro, so please use them here.
> Then, it's self-descriptive, so you don't have to write comment.
> 
>> +
>> +	if ((u64)hole_end > (u64)hole_start) {
> 
> Why is this casting to u64 necessary?

It is not necessary.  I will remove it.

>> +		struct address_space *mapping = inode->i_mapping;
>> +
>> +		mutex_lock(&inode->i_mutex);
>> +		i_mmap_lock_write(mapping);
>> +		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>> +			hugetlb_vmdelete_list(&mapping->i_mmap,
>> +						hole_start >> PAGE_SHIFT,
>> +						hole_end  >> PAGE_SHIFT);
>> +		i_mmap_unlock_write(mapping);
>> +		remove_inode_hugepages(inode, hole_start, hole_end);
>> +		mutex_unlock(&inode->i_mutex);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>> +				loff_t len)
>> +{
>> +	struct inode *inode = file_inode(file);
>> +	struct address_space *mapping = inode->i_mapping;
>> +	struct hstate *h = hstate_inode(inode);
>> +	struct vm_area_struct pseudo_vma;
>> +	unsigned long hpage_size = huge_page_size(h);
>> +	unsigned long hpage_shift = huge_page_shift(h);
>> +	pgoff_t start, index, end;
>> +	int error;
>> +	u32 hash;
>> +
>> +	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (mode & FALLOC_FL_PUNCH_HOLE)
>> +		return hugetlbfs_punch_hole(inode, offset, len);
>> +
>> +	/*
>> +	 * Default preallocate case.
>> +	 * For this range, start is rounded down and end is rounded up.
>> +	 */
>> +	start = offset >> hpage_shift;
>> +	end = (offset + len + hpage_size - 1) >> hpage_shift;
>> +
>> +	mutex_lock(&inode->i_mutex);
>> +
>> +	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
>> +	error = inode_newsize_ok(inode, offset + len);
>> +	if (error)
>> +		goto out;
>> +
>> +	/*
>> +	 * Initialize a pseudo vma that just contains the policy used
>> +	 * when allocating the huge pages.  The actual policy field
>> +	 * (vm_policy) is determined based on the index in the loop below.
>> +	 */
>> +	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
>> +	pseudo_vma.vm_start = 0;
>> +	pseudo_vma.vm_flags |= (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
> 
> Maybe '|' isn't necessary.

No, it is not necessary.  I will remove.

>> +	pseudo_vma.vm_file = file;
>> +
>> +	for (index = start; index < end; index++) {
>> +		/*
>> +		 * This is supposed to be the vaddr where the page is being
>> +		 * faulted in, but we have no vaddr here.
>> +		 */
>> +		struct page *page;
>> +		unsigned long addr;
>> +		int avoid_reserve = 0;
> 
> avoid_reserve is referred only once and never changed, so no need to use
> the variable?

It is not necessary.  I will remove it.

>> +
>> +		cond_resched();
>> +
>> +		/*
>> +		 * fallocate(2) manpage permits EINTR; we may have been
>> +		 * interrupted because we are using up too much memory.
>> +		 */
>> +		if (signal_pending(current)) {
>> +			error = -EINTR;
>> +			break;
>> +		}
>> +
>> +		/* Get policy based on index */
>> +		pseudo_vma.vm_policy =
>> +			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
>> +							index);
>> +
>> +		/* addr is the offset within the file (zero based) */
>> +		addr = index * hpage_size;
>> +
>> +		/* mutex taken here, fault path and hole punch */
>> +		hash = hugetlb_fault_mutex_shared_hash(mapping, index);
>> +		hugetlb_fault_mutex_lock(hash);
>> +
>> +		/* see if page already exists to avoid alloc/free */
>> +		page = find_get_page(mapping, index);
>> +		if (page) {
>> +			put_page(page);
>> +			hugetlb_fault_mutex_unlock(hash);
> 
> Don't you need mpol_cond_put() here?

Yes.  Thank you, I will add it.

>> +			continue;
>> +		}
>> +
>> +		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
>> +		mpol_cond_put(pseudo_vma.vm_policy);
>> +		if (IS_ERR(page)) {
>> +			hugetlb_fault_mutex_unlock(hash);
>> +			error = PTR_ERR(page);
>> +			goto out;
>> +		}
>> +		clear_huge_page(page, addr, pages_per_huge_page(h));
>> +		__SetPageUptodate(page);
> 
> Note that recently I added page_huge_active() to mark activeness of hugepages,
> so when you rebased to v4.1-rc1+, please insert set_page_huge_active(page) here.
> 

Yes, I noticed your change.

>> +		error = huge_add_to_page_cache(page, mapping, index);
>> +		if (error) {
>> +			/*
>> +			 * An entry already exists in the cache.  This implies
>> +			 * a region also existed in the reserve map at the time
>> +			 * the page was allocated above.  Therefore, no use
>> +			 * count was added to the subpool for the page.  Before
>> +			 * freeing the page, clear the subpool reference so
>> +			 * that the count is not decremented.
>> +			 */
>> +			set_page_private(page, 0);/* clear spool reference */
> 
> This looks unclear to me. Which "count" do you refer to in the comment
> "no use count was added to the subpool" or "the count is not decremented"?
> I guess spool->used_hpages or spool->rsv_hpages, but alloc_huge_page() above
> should call hugepage_subpool_get_pages(), so it's accounted, right?
> Could you write comments more specifically?

Yes, this is confusing.  As I am reexamining the code, I see that
it is incorrect.  This code may not be necessary.  It was there to
handle a race with page faults.  The code now uses the hugetlb fault
mutex table to synchronize with page faults.  I will do some more
work here and expect this confusing code to go away.

Thank you for your comments,
-- 
Mike Kravetz

> 
> Thanks,
> Naoya Horiguchi
> 
>> +			put_page(page);
>> +
>> +			hugetlb_fault_mutex_unlock(hash);
>> +			/* Keep going if we see an -EEXIST */
>> +			if (error == -EEXIST) {
>> +				error = 0;	/* do not return to user */
>> +				continue;
>> +			} else
>> +				goto out;
>> +		}
>> +
>> +		hugetlb_fault_mutex_unlock(hash);
>> +
>> +		/*
>> +		 * page_put due to reference from alloc_huge_page()
>> +		 * unlock_page because locked by add_to_page_cache()
>> +		 */
>> +		put_page(page);
>> +		unlock_page(page);
>> +	}
>> +
>> +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>> +		i_size_write(inode, offset + len);
>> +	inode->i_ctime = CURRENT_TIME;
>> +	spin_lock(&inode->i_lock);
>> +	inode->i_private = NULL;
>> +	spin_unlock(&inode->i_lock);
>> +out:
>> +	mutex_unlock(&inode->i_mutex);
>> +	return error;
>> +}
>> +
>>   static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
>>   {
>>   	struct inode *inode = dentry->d_inode;
>> @@ -804,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
>>   	.mmap			= hugetlbfs_file_mmap,
>>   	.fsync			= noop_fsync,
>>   	.get_unmapped_area	= hugetlb_get_unmapped_area,
>> -	.llseek		= default_llseek,
>> +	.llseek			= default_llseek,
>> +	.fallocate		= hugetlbfs_fallocate,
>>   };
>>   
>>   static const struct inode_operations hugetlbfs_dir_inode_operations = {
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 934f339..fa36b7a 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -327,6 +327,8 @@ struct huge_bootmem_page {
>>   #endif
>>   };
>>   
>> +struct page *alloc_huge_page(struct vm_area_struct *vma,
>> +				unsigned long addr, int avoid_reserve);
>>   struct page *alloc_huge_page_node(struct hstate *h, int nid);
>>   struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>>   				unsigned long addr, int avoid_reserve);
>> @@ -481,6 +483,7 @@ static inline bool hugepages_supported(void)
>>   
>>   #else	/* CONFIG_HUGETLB_PAGE */
>>   struct hstate {};
>> +#define alloc_huge_page(v, a, r) NULL
>>   #define alloc_huge_page_node(h, nid) NULL
>>   #define alloc_huge_page_noerr(v, a, r) NULL
>>   #define alloc_bootmem_huge_page(h) NULL
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 94c6154..1e95038 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1444,7 +1444,7 @@ static long vma_commit_reservation(struct hstate *h,
>>   /* Forward declaration */
>>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>>   
>> -static struct page *alloc_huge_page(struct vm_area_struct *vma,
>> +struct page *alloc_huge_page(struct vm_area_struct *vma,
>>   				    unsigned long addr, int avoid_reserve)
>>   {
>>   	struct hugepage_subpool *spool = subpool_vma(vma);
>> -- 
>> 2.1.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()
@ 2015-05-26 17:53       ` Mike Kravetz
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Kravetz @ 2015-05-26 17:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, linux-kernel, Dave Hansen, David Rientjes,
	Hugh Dickins, Davidlohr Bueso, Aneesh Kumar, Hillf Danton,
	Christoph Hellwig

On 05/25/2015 11:54 PM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:43AM -0700, Mike Kravetz wrote:
>> This is based on the shmem version, but it has diverged quite
>> a bit.  We have no swap to worry about, nor the new file sealing.
>> Add synchronication via the fault mutex table to coordinate
>> page faults,  fallocate allocation and fallocate hole punch.
>>
>> What this allows us to do is move physical memory in and out of
>> a hugetlbfs file without having it mapped.  This also gives us
>> the ability to support MADV_REMOVE since it is currently
>> implemented using fallocate().  MADV_REMOVE lets madvise() remove
>> pages from the middle of a hugetlbfs file, which wasn't possible
>> before.
>>
>> hugetlbfs fallocate only operates on whole huge pages.
>>
>> Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> This patch changes the behavior of user API, so please update manpage of
> fallocate(2).

Will do.

Unfortunately, I believe hugetlbfs does not follow the man page
for ftruncate.  So, I will look to get that updated as well.

>> ---
>>   fs/hugetlbfs/inode.c    | 169 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   include/linux/hugetlb.h |   3 +
>>   mm/hugetlb.c            |   2 +-
>>   3 files changed, 172 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index dfa88a5..4b1535f 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/thread_info.h>
>>   #include <asm/current.h>
>>   #include <linux/sched.h>		/* remove ASAP */
>> +#include <linux/falloc.h>
>>   #include <linux/fs.h>
>>   #include <linux/mount.h>
>>   #include <linux/file.h>
>> @@ -493,6 +494,171 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>>   	return 0;
>>   }
>>   
>> +static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>> +{
>> +	struct hstate *h = hstate_inode(inode);
>> +	unsigned long hpage_size = huge_page_size(h);
>> +	loff_t hole_start, hole_end;
>> +
>> +	/*
>> +	 * For hole punch round up the beginning offset of the hole and
>> +	 * round down the end.
>> +	 */
>> +	hole_start = (offset + hpage_size - 1) & huge_page_mask(h);
>> +	hole_end = (offset + len) & huge_page_mask(h);
> 
> We have round_up/round_up macro, so please use them here.
> Then, it's self-descriptive, so you don't have to write comment.
> 
>> +
>> +	if ((u64)hole_end > (u64)hole_start) {
> 
> Why is this casting to u64 necessary?

It is not necessary.  I will remove it.

>> +		struct address_space *mapping = inode->i_mapping;
>> +
>> +		mutex_lock(&inode->i_mutex);
>> +		i_mmap_lock_write(mapping);
>> +		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>> +			hugetlb_vmdelete_list(&mapping->i_mmap,
>> +						hole_start >> PAGE_SHIFT,
>> +						hole_end  >> PAGE_SHIFT);
>> +		i_mmap_unlock_write(mapping);
>> +		remove_inode_hugepages(inode, hole_start, hole_end);
>> +		mutex_unlock(&inode->i_mutex);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>> +				loff_t len)
>> +{
>> +	struct inode *inode = file_inode(file);
>> +	struct address_space *mapping = inode->i_mapping;
>> +	struct hstate *h = hstate_inode(inode);
>> +	struct vm_area_struct pseudo_vma;
>> +	unsigned long hpage_size = huge_page_size(h);
>> +	unsigned long hpage_shift = huge_page_shift(h);
>> +	pgoff_t start, index, end;
>> +	int error;
>> +	u32 hash;
>> +
>> +	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (mode & FALLOC_FL_PUNCH_HOLE)
>> +		return hugetlbfs_punch_hole(inode, offset, len);
>> +
>> +	/*
>> +	 * Default preallocate case.
>> +	 * For this range, start is rounded down and end is rounded up.
>> +	 */
>> +	start = offset >> hpage_shift;
>> +	end = (offset + len + hpage_size - 1) >> hpage_shift;
>> +
>> +	mutex_lock(&inode->i_mutex);
>> +
>> +	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
>> +	error = inode_newsize_ok(inode, offset + len);
>> +	if (error)
>> +		goto out;
>> +
>> +	/*
>> +	 * Initialize a pseudo vma that just contains the policy used
>> +	 * when allocating the huge pages.  The actual policy field
>> +	 * (vm_policy) is determined based on the index in the loop below.
>> +	 */
>> +	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
>> +	pseudo_vma.vm_start = 0;
>> +	pseudo_vma.vm_flags |= (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
> 
> Maybe '|' isn't necessary.

No, it is not necessary.  I will remove.

>> +	pseudo_vma.vm_file = file;
>> +
>> +	for (index = start; index < end; index++) {
>> +		/*
>> +		 * This is supposed to be the vaddr where the page is being
>> +		 * faulted in, but we have no vaddr here.
>> +		 */
>> +		struct page *page;
>> +		unsigned long addr;
>> +		int avoid_reserve = 0;
> 
> avoid_reserve is referred only once and never changed, so no need to use
> the variable?

It is not necessary.  I will remove it.

>> +
>> +		cond_resched();
>> +
>> +		/*
>> +		 * fallocate(2) manpage permits EINTR; we may have been
>> +		 * interrupted because we are using up too much memory.
>> +		 */
>> +		if (signal_pending(current)) {
>> +			error = -EINTR;
>> +			break;
>> +		}
>> +
>> +		/* Get policy based on index */
>> +		pseudo_vma.vm_policy =
>> +			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
>> +							index);
>> +
>> +		/* addr is the offset within the file (zero based) */
>> +		addr = index * hpage_size;
>> +
>> +		/* mutex taken here, fault path and hole punch */
>> +		hash = hugetlb_fault_mutex_shared_hash(mapping, index);
>> +		hugetlb_fault_mutex_lock(hash);
>> +
>> +		/* see if page already exists to avoid alloc/free */
>> +		page = find_get_page(mapping, index);
>> +		if (page) {
>> +			put_page(page);
>> +			hugetlb_fault_mutex_unlock(hash);
> 
> Don't you need mpol_cond_put() here?

Yes.  Thank you, I will add it.

>> +			continue;
>> +		}
>> +
>> +		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
>> +		mpol_cond_put(pseudo_vma.vm_policy);
>> +		if (IS_ERR(page)) {
>> +			hugetlb_fault_mutex_unlock(hash);
>> +			error = PTR_ERR(page);
>> +			goto out;
>> +		}
>> +		clear_huge_page(page, addr, pages_per_huge_page(h));
>> +		__SetPageUptodate(page);
> 
> Note that recently I added page_huge_active() to mark activeness of hugepages,
> so when you rebased to v4.1-rc1+, please insert set_page_huge_active(page) here.
> 

Yes, I noticed your change.

>> +		error = huge_add_to_page_cache(page, mapping, index);
>> +		if (error) {
>> +			/*
>> +			 * An entry already exists in the cache.  This implies
>> +			 * a region also existed in the reserve map at the time
>> +			 * the page was allocated above.  Therefore, no use
>> +			 * count was added to the subpool for the page.  Before
>> +			 * freeing the page, clear the subpool reference so
>> +			 * that the count is not decremented.
>> +			 */
>> +			set_page_private(page, 0);/* clear spool reference */
> 
> This looks unclear to me. Which "count" do you refer to in the comment
> "no use count was added to the subpool" or "the count is not decremented"?
> I guess spool->used_hpages or spool->rsv_hpages, but alloc_huge_page() above
> should call hugepage_subpool_get_pages(), so it's accounted, right?
> Could you write comments more specifically?

Yes, this is confusing.  As I am reexamining the code, I see that
it is incorrect.  This code may not be necessary.  It was there to
handle a race with page faults.  The code now uses the hugetlb fault
mutex table to synchronize with page faults.  I will do some more
work here and expect this confusing code to go away.

Thank you for your comments,
-- 
Mike Kravetz

> 
> Thanks,
> Naoya Horiguchi
> 
>> +			put_page(page);
>> +
>> +			hugetlb_fault_mutex_unlock(hash);
>> +			/* Keep going if we see an -EEXIST */
>> +			if (error == -EEXIST) {
>> +				error = 0;	/* do not return to user */
>> +				continue;
>> +			} else
>> +				goto out;
>> +		}
>> +
>> +		hugetlb_fault_mutex_unlock(hash);
>> +
>> +		/*
>> +		 * page_put due to reference from alloc_huge_page()
>> +		 * unlock_page because locked by add_to_page_cache()
>> +		 */
>> +		put_page(page);
>> +		unlock_page(page);
>> +	}
>> +
>> +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>> +		i_size_write(inode, offset + len);
>> +	inode->i_ctime = CURRENT_TIME;
>> +	spin_lock(&inode->i_lock);
>> +	inode->i_private = NULL;
>> +	spin_unlock(&inode->i_lock);
>> +out:
>> +	mutex_unlock(&inode->i_mutex);
>> +	return error;
>> +}
>> +
>>   static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
>>   {
>>   	struct inode *inode = dentry->d_inode;
>> @@ -804,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
>>   	.mmap			= hugetlbfs_file_mmap,
>>   	.fsync			= noop_fsync,
>>   	.get_unmapped_area	= hugetlb_get_unmapped_area,
>> -	.llseek		= default_llseek,
>> +	.llseek			= default_llseek,
>> +	.fallocate		= hugetlbfs_fallocate,
>>   };
>>   
>>   static const struct inode_operations hugetlbfs_dir_inode_operations = {
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 934f339..fa36b7a 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -327,6 +327,8 @@ struct huge_bootmem_page {
>>   #endif
>>   };
>>   
>> +struct page *alloc_huge_page(struct vm_area_struct *vma,
>> +				unsigned long addr, int avoid_reserve);
>>   struct page *alloc_huge_page_node(struct hstate *h, int nid);
>>   struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>>   				unsigned long addr, int avoid_reserve);
>> @@ -481,6 +483,7 @@ static inline bool hugepages_supported(void)
>>   
>>   #else	/* CONFIG_HUGETLB_PAGE */
>>   struct hstate {};
>> +#define alloc_huge_page(v, a, r) NULL
>>   #define alloc_huge_page_node(h, nid) NULL
>>   #define alloc_huge_page_noerr(v, a, r) NULL
>>   #define alloc_bootmem_huge_page(h) NULL
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 94c6154..1e95038 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1444,7 +1444,7 @@ static long vma_commit_reservation(struct hstate *h,
>>   /* Forward declaration */
>>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>>   
>> -static struct page *alloc_huge_page(struct vm_area_struct *vma,
>> +struct page *alloc_huge_page(struct vm_area_struct *vma,
>>   				    unsigned long addr, int avoid_reserve)
>>   {
>>   	struct hugepage_subpool *spool = subpool_vma(vma);
>> -- 
>> 2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2015-05-26 17:55 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-21 15:47 [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
2015-05-21 15:47 ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 01/10] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 02/10] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 03/10] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-22  6:21   ` Naoya Horiguchi
2015-05-22  6:21     ` Naoya Horiguchi
2015-05-22 16:48     ` Mike Kravetz
2015-05-22 16:48       ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-22  6:23   ` Naoya Horiguchi
2015-05-22  6:23     ` Naoya Horiguchi
2015-05-22 16:50     ` Mike Kravetz
2015-05-22 16:50       ` Mike Kravetz
2015-05-22 17:01   ` Davidlohr Bueso
2015-05-22 17:01     ` Davidlohr Bueso
2015-05-22 17:10     ` Mike Kravetz
2015-05-22 17:10       ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 05/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 06/10] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-22  8:08   ` Naoya Horiguchi
2015-05-22  8:08     ` Naoya Horiguchi
2015-05-22 17:07     ` Mike Kravetz
2015-05-22 17:07       ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 07/10] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 08/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-26  6:54   ` Naoya Horiguchi
2015-05-26  6:54     ` Naoya Horiguchi
2015-05-26 17:53     ` Mike Kravetz
2015-05-26 17:53       ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 10/10] mm: madvise allow remove operation for hugetlbfs Mike Kravetz
2015-05-21 15:47   ` Mike Kravetz
2015-05-22 21:50 ` [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support Davidlohr Bueso
2015-05-22 21:50   ` Davidlohr Bueso
2015-05-23  2:32   ` Mike Kravetz
2015-05-23  2:32     ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.