linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] mm, hugetlb: fixes and fault scalability
@ 2014-01-31 17:36 Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 1/6] mm, hugetlb: unify region structure handling Davidlohr Bueso
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2014-01-31 17:36 UTC (permalink / raw)
  To: akpm, iamjoonsoo.kim
  Cc: riel, mgorman, mhocko, aneesh.kumar, kamezawa.hiroyu, hughd,
	david, js1304, liwanp, n-horiguchi, dhillf, rientjes, davidlohr,
	aswin, scott.norton, linux-mm, linux-kernel

Changes from v1 (http://lkml.org/lkml/2014/1/26/219), based on feedback 
from Naoya Horiguchi:
- Dropped cleanup patches 6 & 7.

- Re did patch 3, fixing some potential use after free for new
  regions.

- Cleaned up patch 5.

- Added review tags.

This patchset resumes the work to improve the whole hugepage fault
scalability path. Previous efforts can be found here:

https://lkml.org/lkml/2013/7/26/299
https://lkml.org/lkml/2013/12/18/50

The latest attempt to address the big-fat hugetlb instantiation mutex by
removing the need for it altogether ended up having too much of an overhead
to consider and allow scalability. The discussion can be found at:
https://lkml.org/lkml/2014/1/3/244

This patchset is divided in three parts, where the first seven patches,
from Joonsoo, have been included and reviewed in previous patchsets. The 
last patch is the actual performance one.

Part 1. (1-3) Introduce new protection method for region tracking 
        data structure, instead of the hugetlb_instantiation_mutex. There
        is race condition when we map the hugetlbfs file to two different
        processes. To prevent it, we need to new protection method like
        as this patchset.

Part 2. (4-5) clean-up.

        These make code really simple, so these are worth to go into
        mainline separately.

Part 3 (6) Use a table of mutexes instead of a unique one, and allow
        faults to be handled in parallel. Benefits and caveats to this
        approach are in the patch.

All changes have passed the libhugetblfs test cases.
This patchset applies on top of Linus' current tree (3.13-e7651b81).

  mm, hugetlb: unify region structure handling
  mm, hugetlb: improve, cleanup resv_map parameters
  mm, hugetlb: fix race in region tracking
  mm, hugetlb: remove resv_map_put
  mm, hugetlb: use vma_resv_map() map types
  mm, hugetlb: improve page-fault scalability

 fs/hugetlbfs/inode.c    |  17 ++-
 include/linux/hugetlb.h |  10 ++
 mm/hugetlb.c            | 286 ++++++++++++++++++++++++++++++------------------
 3 files changed, 204 insertions(+), 109 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/6] mm, hugetlb: unify region structure handling
  2014-01-31 17:36 [PATCH v2 0/6] mm, hugetlb: fixes and fault scalability Davidlohr Bueso
@ 2014-01-31 17:36 ` Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 2/6] mm, hugetlb: improve, cleanup resv_map parameters Davidlohr Bueso
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2014-01-31 17:36 UTC (permalink / raw)
  To: akpm, iamjoonsoo.kim
  Cc: riel, mgorman, mhocko, aneesh.kumar, kamezawa.hiroyu, hughd,
	david, js1304, liwanp, n-horiguchi, dhillf, rientjes, davidlohr,
	aswin, scott.norton, linux-mm, linux-kernel

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use address_mapping's
private_list and, while for MAP_PRIVATE, we use a resv_map.

Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[Updated changelog]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 fs/hugetlbfs/inode.c    | 17 +++++++++++++++--
 include/linux/hugetlb.h |  9 +++++++++
 mm/hugetlb.c            | 37 +++++++++++++++++++++----------------
 3 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d19b30a..2040275 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -366,7 +366,13 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
+	struct resv_map *resv_map;
+
 	truncate_hugepages(inode, 0);
+	resv_map = (struct resv_map *)inode->i_mapping->private_data;
+	/* root inode doesn't have the resv_map, so we should check it */
+	if (resv_map)
+		resv_map_release(&resv_map->refs);
 	clear_inode(inode);
 }
 
@@ -476,6 +482,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					umode_t mode, dev_t dev)
 {
 	struct inode *inode;
+	struct resv_map *resv_map;
+
+	resv_map = resv_map_alloc();
+	if (!resv_map)
+		return NULL;
 
 	inode = new_inode(sb);
 	if (inode) {
@@ -487,7 +498,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		INIT_LIST_HEAD(&inode->i_mapping->private_list);
+		inode->i_mapping->private_data = resv_map;
 		info = HUGETLBFS_I(inode);
 		/*
 		 * The policy is initialized here even if we are creating a
@@ -517,7 +528,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 			break;
 		}
 		lockdep_annotate_inode_mutex_key(inode);
-	}
+	} else
+		kref_put(&resv_map->refs, resv_map_release);
+
 	return inode;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8c43cc4..f62c2f6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -6,6 +6,8 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/kref.h>
 
 struct ctl_table;
 struct user_struct;
@@ -23,6 +25,13 @@ struct hugepage_subpool {
 	long max_hpages, used_hpages;
 };
 
+struct resv_map {
+	struct kref refs;
+	struct list_head regions;
+};
+extern struct resv_map *resv_map_alloc(void);
+void resv_map_release(struct kref *ref);
+
 extern spinlock_t hugetlb_lock;
 extern int hugetlb_max_hstate __read_mostly;
 #define for_each_hstate(h) \
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c01cb9f..138987f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -376,12 +376,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
 	vma->vm_private_data = (void *)value;
 }
 
-struct resv_map {
-	struct kref refs;
-	struct list_head regions;
-};
-
-static struct resv_map *resv_map_alloc(void)
+struct resv_map *resv_map_alloc(void)
 {
 	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
 	if (!resv_map)
@@ -393,7 +388,7 @@ static struct resv_map *resv_map_alloc(void)
 	return resv_map;
 }
 
-static void resv_map_release(struct kref *ref)
+void resv_map_release(struct kref *ref)
 {
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
@@ -1155,8 +1150,9 @@ static long vma_needs_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		return region_chg(&inode->i_mapping->private_list,
-							idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		return region_chg(&resv->regions, idx, idx + 1);
 
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
@@ -1180,7 +1176,9 @@ static void vma_commit_reservation(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		region_add(&inode->i_mapping->private_list, idx, idx + 1);
+		struct resv_map *resv = inode->i_mapping->private_data;
+
+		region_add(&resv->regions, idx, idx + 1);
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
@@ -3161,6 +3159,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	long ret, chg;
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
+	struct resv_map *resv_map;
 
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
@@ -3176,10 +3175,13 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		chg = region_chg(&inode->i_mapping->private_list, from, to);
-	else {
-		struct resv_map *resv_map = resv_map_alloc();
+	if (!vma || vma->vm_flags & VM_MAYSHARE) {
+		resv_map = inode->i_mapping->private_data;
+
+		chg = region_chg(&resv_map->regions, from, to);
+
+	} else {
+		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
@@ -3222,7 +3224,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&inode->i_mapping->private_list, from, to);
+		region_add(&resv_map->regions, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3233,9 +3235,12 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	long chg = region_truncate(&inode->i_mapping->private_list, offset);
+	struct resv_map *resv_map = inode->i_mapping->private_data;
+	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
+	if (resv_map)
+		chg = region_truncate(&resv_map->regions, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/6] mm, hugetlb: improve, cleanup resv_map parameters
  2014-01-31 17:36 [PATCH v2 0/6] mm, hugetlb: fixes and fault scalability Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 1/6] mm, hugetlb: unify region structure handling Davidlohr Bueso
@ 2014-01-31 17:36 ` Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 3/6] mm, hugetlb: fix race in region tracking Davidlohr Bueso
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2014-01-31 17:36 UTC (permalink / raw)
  To: akpm, iamjoonsoo.kim
  Cc: riel, mgorman, mhocko, aneesh.kumar, kamezawa.hiroyu, hughd,
	david, js1304, liwanp, n-horiguchi, dhillf, rientjes, davidlohr,
	aswin, scott.norton, linux-mm, linux-kernel

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions. This doesn't introduce any functional change, and it is just
for preparing a next step.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 mm/hugetlb.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 138987f..dca03a6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -151,8 +151,9 @@ struct file_region {
 	long to;
 };
 
-static long region_add(struct list_head *head, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
 	/* Locate the region we are either in or before. */
@@ -187,8 +188,9 @@ static long region_add(struct list_head *head, long f, long t)
 	return 0;
 }
 
-static long region_chg(struct list_head *head, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg;
 	long chg = 0;
 
@@ -236,8 +238,9 @@ static long region_chg(struct list_head *head, long f, long t)
 	return chg;
 }
 
-static long region_truncate(struct list_head *head, long end)
+static long region_truncate(struct resv_map *resv, long end)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg, *trg;
 	long chg = 0;
 
@@ -266,8 +269,9 @@ static long region_truncate(struct list_head *head, long end)
 	return chg;
 }
 
-static long region_count(struct list_head *head, long f, long t)
+static long region_count(struct resv_map *resv, long f, long t)
 {
+	struct list_head *head = &resv->regions;
 	struct file_region *rg;
 	long chg = 0;
 
@@ -393,7 +397,7 @@ void resv_map_release(struct kref *ref)
 	struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
 	/* Clear out any active regions before we release the map. */
-	region_truncate(&resv_map->regions, 0);
+	region_truncate(resv_map, 0);
 	kfree(resv_map);
 }
 
@@ -1152,7 +1156,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = inode->i_mapping->private_data;
 
-		return region_chg(&resv->regions, idx, idx + 1);
+		return region_chg(resv, idx, idx + 1);
 
 	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		return 1;
@@ -1162,7 +1166,7 @@ static long vma_needs_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
-		err = region_chg(&resv->regions, idx, idx + 1);
+		err = region_chg(resv, idx, idx + 1);
 		if (err < 0)
 			return err;
 		return 0;
@@ -1178,14 +1182,14 @@ static void vma_commit_reservation(struct hstate *h,
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = inode->i_mapping->private_data;
 
-		region_add(&resv->regions, idx, idx + 1);
+		region_add(resv, idx, idx + 1);
 
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		struct resv_map *resv = vma_resv_map(vma);
 
 		/* Mark this page used in the map. */
-		region_add(&resv->regions, idx, idx + 1);
+		region_add(resv, idx, idx + 1);
 	}
 }
 
@@ -2276,7 +2280,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		end = vma_hugecache_offset(h, vma, vma->vm_end);
 
 		reserve = (end - start) -
-			region_count(&resv->regions, start, end);
+			region_count(resv, start, end);
 
 		resv_map_put(vma);
 
@@ -3178,7 +3182,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
 		resv_map = inode->i_mapping->private_data;
 
-		chg = region_chg(&resv_map->regions, from, to);
+		chg = region_chg(resv_map, from, to);
 
 	} else {
 		resv_map = resv_map_alloc();
@@ -3224,7 +3228,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		region_add(&resv_map->regions, from, to);
+		region_add(resv_map, from, to);
 	return 0;
 out_err:
 	if (vma)
@@ -3240,7 +3244,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
 	if (resv_map)
-		chg = region_truncate(&resv_map->regions, offset);
+		chg = region_truncate(resv_map, offset);
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/6] mm, hugetlb: fix race in region tracking
  2014-01-31 17:36 [PATCH v2 0/6] mm, hugetlb: fixes and fault scalability Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 1/6] mm, hugetlb: unify region structure handling Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 2/6] mm, hugetlb: improve, cleanup resv_map parameters Davidlohr Bueso
@ 2014-01-31 17:36 ` Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 4/6] mm, hugetlb: remove resv_map_put Davidlohr Bueso
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2014-01-31 17:36 UTC (permalink / raw)
  To: akpm, iamjoonsoo.kim
  Cc: riel, mgorman, mhocko, aneesh.kumar, kamezawa.hiroyu, hughd,
	david, js1304, liwanp, n-horiguchi, dhillf, rientjes, davidlohr,
	aswin, scott.norton, linux-mm, linux-kernel

From: Davidlohr Bueso <davidlohr@hp.com>

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only the,
mmap_sem (exclusively). This doesn't prevent other tasks from modifying the
region structure, so it can be modified by two processes concurrently.

To solve this, introduce a spinlock to resv_map and make region manipulation
function grab it before they do actual work.

Acked-by: David Gibson <david@gibson.dropbear.id.au>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c            | 58 ++++++++++++++++++++++++++++++++-----------------
 2 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f62c2f6..5b337cf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -27,6 +27,7 @@ struct hugepage_subpool {
 
 struct resv_map {
 	struct kref refs;
+	spinlock_t lock;
 	struct list_head regions;
 };
 extern struct resv_map *resv_map_alloc(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dca03a6..3c03c7f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -135,15 +135,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantiation_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation_mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ * The region data structures are embedded into a resv_map and
+ * protected by a resv_map's lock
  */
 struct file_region {
 	struct list_head link;
@@ -156,6 +149,7 @@ static long region_add(struct resv_map *resv, long f, long t)
 	struct list_head *head = &resv->regions;
 	struct file_region *rg, *nrg, *trg;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -185,15 +179,18 @@ static long region_add(struct resv_map *resv, long f, long t)
 	}
 	nrg->from = f;
 	nrg->to = t;
+	spin_unlock(&resv->lock);
 	return 0;
 }
 
 static long region_chg(struct resv_map *resv, long f, long t)
 {
 	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg;
+	struct file_region *rg, *nrg = NULL;
 	long chg = 0;
 
+retry:
+	spin_lock(&resv->lock);
 	/* Locate the region we are before or in. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -203,15 +200,21 @@ static long region_chg(struct resv_map *resv, long f, long t)
 	 * Subtle, allocate a new region at the position but make it zero
 	 * size such that we can guarantee to record the reservation. */
 	if (&rg->link == head || t < rg->from) {
-		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
-		nrg->from = f;
-		nrg->to   = f;
-		INIT_LIST_HEAD(&nrg->link);
-		list_add(&nrg->link, rg->link.prev);
+		if (!nrg) {
+			spin_unlock(&resv->lock);
+			nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+			if (!nrg)
+				return -ENOMEM;
+
+			nrg->from = f;
+			nrg->to   = f;
+			INIT_LIST_HEAD(&nrg->link);
+			goto retry;
+		}
 
-		return t - f;
+		list_add(&nrg->link, rg->link.prev);
+		chg = t - f;
+		goto out_nrg;
 	}
 
 	/* Round our left edge to the current segment if it encloses us. */
@@ -224,7 +227,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		if (&rg->link == head)
 			break;
 		if (rg->from > t)
-			return chg;
+			goto out;
 
 		/* We overlap with this area, if it extends further than
 		 * us then we must extend ourselves.  Account for its
@@ -235,6 +238,14 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		}
 		chg -= rg->to - rg->from;
 	}
+
+out:
+	spin_unlock(&resv->lock);
+	/*  We already know we raced and no longer need the new region */
+	kfree(nrg);
+	return chg;
+out_nrg:
+	spin_unlock(&resv->lock);
 	return chg;
 }
 
@@ -244,12 +255,13 @@ static long region_truncate(struct resv_map *resv, long end)
 	struct file_region *rg, *trg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (end <= rg->to)
 			break;
 	if (&rg->link == head)
-		return 0;
+		goto out;
 
 	/* If we are in the middle of a region then adjust it. */
 	if (end > rg->from) {
@@ -266,6 +278,9 @@ static long region_truncate(struct resv_map *resv, long end)
 		list_del(&rg->link);
 		kfree(rg);
 	}
+
+out:
+	spin_unlock(&resv->lock);
 	return chg;
 }
 
@@ -275,6 +290,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 	struct file_region *rg;
 	long chg = 0;
 
+	spin_lock(&resv->lock);
 	/* Locate each segment we overlap with, and count that overlap. */
 	list_for_each_entry(rg, head, link) {
 		long seg_from;
@@ -290,6 +306,7 @@ static long region_count(struct resv_map *resv, long f, long t)
 
 		chg += seg_to - seg_from;
 	}
+	spin_unlock(&resv->lock);
 
 	return chg;
 }
@@ -387,6 +404,7 @@ struct resv_map *resv_map_alloc(void)
 		return NULL;
 
 	kref_init(&resv_map->refs);
+	spin_lock_init(&resv_map->lock);
 	INIT_LIST_HEAD(&resv_map->regions);
 
 	return resv_map;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 4/6] mm, hugetlb: remove resv_map_put
  2014-01-31 17:36 [PATCH v2 0/6] mm, hugetlb: fixes and fault scalability Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2014-01-31 17:36 ` [PATCH v2 3/6] mm, hugetlb: fix race in region tracking Davidlohr Bueso
@ 2014-01-31 17:36 ` Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 5/6] mm, hugetlb: use vma_resv_map() map types Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability Davidlohr Bueso
  5 siblings, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2014-01-31 17:36 UTC (permalink / raw)
  To: akpm, iamjoonsoo.kim
  Cc: riel, mgorman, mhocko, aneesh.kumar, kamezawa.hiroyu, hughd,
	david, js1304, liwanp, n-horiguchi, dhillf, rientjes, davidlohr,
	aswin, scott.norton, linux-mm, linux-kernel

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

This is a preparation patch to unify the use of vma_resv_map() regardless
of the map type. This patch prepares it by removing resv_map_put(), which
only works for HPAGE_RESV_OWNER's resv_map, not for all resv_maps.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[Updated changelog]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 mm/hugetlb.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c03c7f..dfe81b4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2275,15 +2275,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 		kref_get(&resv->refs);
 }
 
-static void resv_map_put(struct vm_area_struct *vma)
-{
-	struct resv_map *resv = vma_resv_map(vma);
-
-	if (!resv)
-		return;
-	kref_put(&resv->refs, resv_map_release);
-}
-
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
@@ -2300,7 +2291,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		reserve = (end - start) -
 			region_count(resv, start, end);
 
-		resv_map_put(vma);
+		kref_put(&resv->refs, resv_map_release);
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
@@ -3249,8 +3240,8 @@ int hugetlb_reserve_pages(struct inode *inode,
 		region_add(resv_map, from, to);
 	return 0;
 out_err:
-	if (vma)
-		resv_map_put(vma);
+	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 5/6] mm, hugetlb: use vma_resv_map() map types
  2014-01-31 17:36 [PATCH v2 0/6] mm, hugetlb: fixes and fault scalability Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2014-01-31 17:36 ` [PATCH v2 4/6] mm, hugetlb: remove resv_map_put Davidlohr Bueso
@ 2014-01-31 17:36 ` Davidlohr Bueso
  2014-01-31 17:36 ` [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability Davidlohr Bueso
  5 siblings, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2014-01-31 17:36 UTC (permalink / raw)
  To: akpm, iamjoonsoo.kim
  Cc: riel, mgorman, mhocko, aneesh.kumar, kamezawa.hiroyu, hughd,
	david, js1304, liwanp, n-horiguchi, dhillf, rientjes, davidlohr,
	aswin, scott.norton, linux-mm, linux-kernel

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. Unify it.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[code cleanups]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 mm/hugetlb.c | 95 ++++++++++++++++++++++++++++--------------------------------
 1 file changed, 45 insertions(+), 50 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfe81b4..7ab913c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -419,13 +419,24 @@ void resv_map_release(struct kref *ref)
 	kfree(resv_map);
 }
 
+static inline struct resv_map *inode_resv_map(struct inode *inode)
+{
+	return inode->i_mapping->private_data;
+}
+
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_MAYSHARE))
+	if (vma->vm_flags & VM_MAYSHARE) {
+		struct address_space *mapping = vma->vm_file->f_mapping;
+		struct inode *inode = mapping->host;
+
+		return inode_resv_map(inode);
+
+	} else {
 		return (struct resv_map *)(get_vma_private_data(vma) &
 							~HPAGE_RESV_MASK);
-	return NULL;
+	}
 }
 
 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
@@ -1167,48 +1178,34 @@ static void return_unused_surplus_pages(struct hstate *h,
 static long vma_needs_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		return region_chg(resv, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
+	long chg;
 
-	} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+	resv = vma_resv_map(vma);
+	if (!resv)
 		return 1;
 
-	} else  {
-		long err;
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	idx = vma_hugecache_offset(h, vma, addr);
+	chg = region_chg(resv, idx, idx + 1);
 
-		err = region_chg(resv, idx, idx + 1);
-		if (err < 0)
-			return err;
-		return 0;
-	}
+	if (vma->vm_flags & VM_MAYSHARE)
+		return chg;
+	else
+		return chg < 0 ? chg : 0;
 }
 static void vma_commit_reservation(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	struct address_space *mapping = vma->vm_file->f_mapping;
-	struct inode *inode = mapping->host;
-
-	if (vma->vm_flags & VM_MAYSHARE) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = inode->i_mapping->private_data;
-
-		region_add(resv, idx, idx + 1);
+	struct resv_map *resv;
+	pgoff_t idx;
 
-	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-		struct resv_map *resv = vma_resv_map(vma);
+	resv = vma_resv_map(vma);
+	if (!resv)
+		return;
 
-		/* Mark this page used in the map. */
-		region_add(resv, idx, idx + 1);
-	}
+	idx = vma_hugecache_offset(h, vma, addr);
+	region_add(resv, idx, idx + 1);
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -2271,7 +2268,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (resv)
+	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_get(&resv->refs);
 }
 
@@ -2280,23 +2277,21 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 	struct hstate *h = hstate_vma(vma);
 	struct resv_map *resv = vma_resv_map(vma);
 	struct hugepage_subpool *spool = subpool_vma(vma);
-	unsigned long reserve;
-	unsigned long start;
-	unsigned long end;
+	unsigned long reserve, start, end;
 
-	if (resv) {
-		start = vma_hugecache_offset(h, vma, vma->vm_start);
-		end = vma_hugecache_offset(h, vma, vma->vm_end);
+	if (!resv || !is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+		return;
 
-		reserve = (end - start) -
-			region_count(resv, start, end);
+	start = vma_hugecache_offset(h, vma, vma->vm_start);
+	end = vma_hugecache_offset(h, vma, vma->vm_end);
 
-		kref_put(&resv->refs, resv_map_release);
+	reserve = (end - start) - region_count(resv, start, end);
 
-		if (reserve) {
-			hugetlb_acct_memory(h, -reserve);
-			hugepage_subpool_put_pages(spool, reserve);
-		}
+	kref_put(&resv->refs, resv_map_release);
+
+	if (reserve) {
+		hugetlb_acct_memory(h, -reserve);
+		hugepage_subpool_put_pages(spool, reserve);
 	}
 }
 
@@ -3189,7 +3184,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		resv_map = inode->i_mapping->private_data;
+		resv_map = inode_resv_map(inode);
 
 		chg = region_chg(resv_map, from, to);
 
@@ -3248,7 +3243,7 @@ out_err:
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
 	struct hstate *h = hstate_inode(inode);
-	struct resv_map *resv_map = inode->i_mapping->private_data;
+	struct resv_map *resv_map = inode_resv_map(inode);
 	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability
  2014-01-31 17:36 [PATCH v2 0/6] mm, hugetlb: fixes and fault scalability Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2014-01-31 17:36 ` [PATCH v2 5/6] mm, hugetlb: use vma_resv_map() map types Davidlohr Bueso
@ 2014-01-31 17:36 ` Davidlohr Bueso
  2014-01-31 21:01   ` Andrew Morton
  2014-02-03  6:41   ` Joonsoo Kim
  5 siblings, 2 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2014-01-31 17:36 UTC (permalink / raw)
  To: akpm, iamjoonsoo.kim
  Cc: riel, mgorman, mhocko, aneesh.kumar, kamezawa.hiroyu, hughd,
	david, js1304, liwanp, n-horiguchi, dhillf, rientjes, davidlohr,
	aswin, scott.norton, linux-mm, linux-kernel

From: Davidlohr Bueso <davidlohr@hp.com>

The kernel can currently only handle a single hugetlb page fault at a time.
This is due to a single mutex that serializes the entire path. This lock
protects from spurious OOM errors under conditions of low of low availability
of free hugepages. This problem is specific to hugepages, because it is
normal to want to use every single hugepage in the system - with normal pages
we simply assume there will always be a few spare pages which can be used
temporarily until the race is resolved.

Address this problem by using a table of mutexes, allowing a better chance of
parallelization, where each hugepage is individually serialized. The hash key
is selected depending on the mapping type. For shared ones it consists of the
address space and file offset being faulted; while for private ones the mm and
virtual address are used. The size of the table is selected based on a compromise
of collisions and memory footprint of a series of database workloads.

Large database workloads that make heavy use of hugepages can be particularly
exposed to this issue, causing start-up times to be painfully slow. This patch
reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs
to 25.7 secs. Larger workloads will naturally benefit even more.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---

NOTE:
The only downside to this patch, detected by Joonsoo Kim, is that a small race
is possible in private mappings: A child process (with its own mm, after cow)
can instantiate a page that is already being handled by the parent in a cow
fault. When low on pages, can trigger spurious OOMs. I have not been able to
think of a efficient way of handling this... but do we really care about such
a tiny window? We already maintain another theoretical race with normal pages.
If not, one possible way to is to maintain the single hash for private mappings
-- any workloads that *really* suffer from this scaling problem should already
use shared mappings.

 mm/hugetlb.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 72 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7ab913c..9b77686 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -22,6 +22,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/page-isolation.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -53,6 +54,13 @@ static unsigned long __initdata default_hstate_size;
  */
 DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
++ * Serializes faults on the same logical page.  This is used to
++ * prevent spurious OOMs when the hugepage pool is fully utilized.
++ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+
 static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
 {
 	bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -1961,11 +1969,14 @@ static void __exit hugetlb_exit(void)
 	}
 
 	kobject_put(hugepages_kobj);
+	kfree(htlb_fault_mutex_table);
 }
 module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1990,6 +2001,18 @@ static int __init hugetlb_init(void)
 	hugetlb_register_all_nodes();
 	hugetlb_cgroup_file_init();
 
+#ifdef CONFIG_SMP
+	num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus());
+#else
+	num_fault_mutexes = 1;
+#endif
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	if (!htlb_fault_mutex_table)
+		return -ENOMEM;
+
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2767,15 +2790,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			   struct address_space *mapping, pgoff_t idx,
+			   unsigned long address, pte_t *ptep, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	int ret = VM_FAULT_SIGBUS;
 	int anon_rmap = 0;
-	pgoff_t idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 	spinlock_t *ptl;
 
@@ -2790,9 +2812,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 
-	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, address);
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -2902,17 +2921,53 @@ backout_unlocked:
 	goto out;
 }
 
+#ifdef CONFIG_SMP
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if (vma->vm_flags & VM_SHARED) {
+		key[0] = (unsigned long) mapping;
+		key[1] = idx;
+	} else {
+		key[0] = (unsigned long) mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+#else
+/*
+ * For uniprocesor systems we always use a single mutex, so just
+ * return 0 and avoid the hashing overhead.
+ */
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	return 0;
+}
+#endif
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep;
-	pte_t entry;
+	pte_t *ptep, entry;
 	spinlock_t *ptl;
 	int ret;
+	u32 hash;
+	pgoff_t idx;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping;
 
 	address &= huge_page_mask(h);
 
@@ -2931,15 +2986,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
+
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
 		goto out_mutex;
 	}
 
@@ -3008,8 +3068,7 @@ out_ptl:
 	put_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 	return ret;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability
  2014-01-31 17:36 ` [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability Davidlohr Bueso
@ 2014-01-31 21:01   ` Andrew Morton
  2014-01-31 21:52     ` Davidlohr Bueso
  2014-02-03  6:41   ` Joonsoo Kim
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2014-01-31 21:01 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: iamjoonsoo.kim, riel, mgorman, mhocko, aneesh.kumar,
	kamezawa.hiroyu, hughd, david, js1304, liwanp, n-horiguchi,
	dhillf, rientjes, aswin, scott.norton, linux-mm, linux-kernel

On Fri, 31 Jan 2014 09:36:46 -0800 Davidlohr Bueso <davidlohr@hp.com> wrote:

> From: Davidlohr Bueso <davidlohr@hp.com>
> 
> The kernel can currently only handle a single hugetlb page fault at a time.
> This is due to a single mutex that serializes the entire path. This lock
> protects from spurious OOM errors under conditions of low of low availability
> of free hugepages. This problem is specific to hugepages, because it is
> normal to want to use every single hugepage in the system - with normal pages
> we simply assume there will always be a few spare pages which can be used
> temporarily until the race is resolved.
> 
> Address this problem by using a table of mutexes, allowing a better chance of
> parallelization, where each hugepage is individually serialized. The hash key
> is selected depending on the mapping type. For shared ones it consists of the
> address space and file offset being faulted; while for private ones the mm and
> virtual address are used. The size of the table is selected based on a compromise
> of collisions and memory footprint of a series of database workloads.
> 
> Large database workloads that make heavy use of hugepages can be particularly
> exposed to this issue, causing start-up times to be painfully slow. This patch
> reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs
> to 25.7 secs. Larger workloads will naturally benefit even more.

hm, no magic bullet.  Where's the rest of the time being spent?

> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
> 
> NOTE:
> The only downside to this patch, detected by Joonsoo Kim, is that a small race
> is possible in private mappings: A child process (with its own mm, after cow)
> can instantiate a page that is already being handled by the parent in a cow
> fault. When low on pages, can trigger spurious OOMs. I have not been able to
> think of a efficient way of handling this... but do we really care about such
> a tiny window? We already maintain another theoretical race with normal pages.
> If not, one possible way to is to maintain the single hash for private mappings
> -- any workloads that *really* suffer from this scaling problem should already
> use shared mappings.
> 
>  mm/hugetlb.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 72 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7ab913c..9b77686 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -22,6 +22,7 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  #include <linux/page-isolation.h>
> +#include <linux/jhash.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -53,6 +54,13 @@ static unsigned long __initdata default_hstate_size;
>   */
>  DEFINE_SPINLOCK(hugetlb_lock);
>  
> +/*
> ++ * Serializes faults on the same logical page.  This is used to
> ++ * prevent spurious OOMs when the hugepage pool is fully utilized.
> ++ */

Strangeness.  I'll clean it up.

> +static int num_fault_mutexes;
> +static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
> +
>  static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
>  {
>  	bool free = (spool->count == 0) && (spool->used_hpages == 0);
> @@ -1961,11 +1969,14 @@ static void __exit hugetlb_exit(void)
>  	}
>  
>  	kobject_put(hugepages_kobj);
> +	kfree(htlb_fault_mutex_table);
>  }
>  module_exit(hugetlb_exit);
>  
>  static int __init hugetlb_init(void)
>  {
> +	int i;
> +
>  	/* Some platform decide whether they support huge pages at boot
>  	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
>  	 * there is no such support
> @@ -1990,6 +2001,18 @@ static int __init hugetlb_init(void)
>  	hugetlb_register_all_nodes();
>  	hugetlb_cgroup_file_init();
>  
> +#ifdef CONFIG_SMP
> +	num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus());
> +#else
> +	num_fault_mutexes = 1;
> +#endif
> +	htlb_fault_mutex_table =
> +		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
> +	if (!htlb_fault_mutex_table)
> +		return -ENOMEM;

If htlb_fault_mutex_table==NULL, the kernel will later oops.  Let's
just go BUG here.

> +	for (i = 0; i < num_fault_mutexes; i++)
> +		mutex_init(&htlb_fault_mutex_table[i]);
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -2767,15 +2790,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
>  }
>  
>  static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> -			unsigned long address, pte_t *ptep, unsigned int flags)
> +			   struct address_space *mapping, pgoff_t idx,
> +			   unsigned long address, pte_t *ptep, unsigned int flags)
>  {
>  	struct hstate *h = hstate_vma(vma);
>  	int ret = VM_FAULT_SIGBUS;
>  	int anon_rmap = 0;
> -	pgoff_t idx;
>  	unsigned long size;
>  	struct page *page;
> -	struct address_space *mapping;
>  	pte_t new_pte;
>  	spinlock_t *ptl;
>  
> @@ -2790,9 +2812,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		return ret;
>  	}
>  
> -	mapping = vma->vm_file->f_mapping;
> -	idx = vma_hugecache_offset(h, vma, address);
> -
>  	/*
>  	 * Use page lock to guard against racing truncation
>  	 * before we get page_table_lock.
> @@ -2902,17 +2921,53 @@ backout_unlocked:
>  	goto out;
>  }
>  
> +#ifdef CONFIG_SMP
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> +			    struct vm_area_struct *vma,
> +			    struct address_space *mapping,
> +			    pgoff_t idx, unsigned long address)
> +{
> +	unsigned long key[2];
> +	u32 hash;
> +
> +	if (vma->vm_flags & VM_SHARED) {
> +		key[0] = (unsigned long) mapping;
> +		key[1] = idx;
> +	} else {
> +		key[0] = (unsigned long) mm;
> +		key[1] = address >> huge_page_shift(h);
> +	}
> +
> +	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);

This looks a bit overengineered to me.  What happens if we just do

	hash = jhash2((u32 *)vma, sizeof(vma)/sizeof(u32), 0);

?

> +	return hash & (num_fault_mutexes - 1);
> +}
> +#else
> +/*
> + * For uniprocesor systems we always use a single mutex, so just
> + * return 0 and avoid the hashing overhead.
> + */
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> +			    struct vm_area_struct *vma,
> +			    struct address_space *mapping,
> +			    pgoff_t idx, unsigned long address)
> +{
> +	return 0;
> +}
> +#endif
> +
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags)
>  {
> -	pte_t *ptep;
> -	pte_t entry;
> +	pte_t *ptep, entry;
>  	spinlock_t *ptl;
>  	int ret;
> +	u32 hash;
> +	pgoff_t idx;
>  	struct page *page = NULL;
>  	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>  	struct hstate *h = hstate_vma(vma);
> +	struct address_space *mapping;
>  
>  	address &= huge_page_mask(h);
>  
> @@ -2931,15 +2986,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (!ptep)
>  		return VM_FAULT_OOM;
>  
> +	mapping = vma->vm_file->f_mapping;
> +	idx = vma_hugecache_offset(h, vma, address);
> +
>  	/*
>  	 * Serialize hugepage allocation and instantiation, so that we don't
>  	 * get spurious allocation failures if two CPUs race to instantiate
>  	 * the same page in the page cache.
>  	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
> +
>  	entry = huge_ptep_get(ptep);
>  	if (huge_pte_none(entry)) {
> -		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> +		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
>  		goto out_mutex;
>  	}
>  
> @@ -3008,8 +3068,7 @@ out_ptl:
>  	put_page(page);
>  
>  out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> -
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>  	return ret;

That's nice and simple.



From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-hugetlb-improve-page-fault-scalability-fix

remove stray + characters, go BUG if hugetlb_init() kmalloc fails

Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff -puN mm/hugetlb.c~mm-hugetlb-improve-page-fault-scalability-fix mm/hugetlb.c
--- a/mm/hugetlb.c~mm-hugetlb-improve-page-fault-scalability-fix
+++ a/mm/hugetlb.c
@@ -55,9 +55,9 @@ static unsigned long __initdata default_
 DEFINE_SPINLOCK(hugetlb_lock);
 
 /*
-+ * Serializes faults on the same logical page.  This is used to
-+ * prevent spurious OOMs when the hugepage pool is fully utilized.
-+ */
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
 static int num_fault_mutexes;
 static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
@@ -2008,8 +2008,7 @@ static int __init hugetlb_init(void)
 #endif
 	htlb_fault_mutex_table =
 		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
-	if (!htlb_fault_mutex_table)
-		return -ENOMEM;
+	BUG_ON(!htlb_fault_mutex_table);
 
 	for (i = 0; i < num_fault_mutexes; i++)
 		mutex_init(&htlb_fault_mutex_table[i]);
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability
  2014-01-31 21:01   ` Andrew Morton
@ 2014-01-31 21:52     ` Davidlohr Bueso
  0 siblings, 0 replies; 10+ messages in thread
From: Davidlohr Bueso @ 2014-01-31 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: iamjoonsoo.kim, riel, mgorman, mhocko, aneesh.kumar,
	kamezawa.hiroyu, hughd, david, js1304, liwanp, n-horiguchi,
	dhillf, rientjes, aswin, scott.norton, linux-mm, linux-kernel

On Fri, 2014-01-31 at 13:01 -0800, Andrew Morton wrote:
> On Fri, 31 Jan 2014 09:36:46 -0800 Davidlohr Bueso <davidlohr@hp.com> wrote:
> 
> > From: Davidlohr Bueso <davidlohr@hp.com>
> > 
> > The kernel can currently only handle a single hugetlb page fault at a time.
> > This is due to a single mutex that serializes the entire path. This lock
> > protects from spurious OOM errors under conditions of low of low availability
> > of free hugepages. This problem is specific to hugepages, because it is
> > normal to want to use every single hugepage in the system - with normal pages
> > we simply assume there will always be a few spare pages which can be used
> > temporarily until the race is resolved.
> > 
> > Address this problem by using a table of mutexes, allowing a better chance of
> > parallelization, where each hugepage is individually serialized. The hash key
> > is selected depending on the mapping type. For shared ones it consists of the
> > address space and file offset being faulted; while for private ones the mm and
> > virtual address are used. The size of the table is selected based on a compromise
> > of collisions and memory footprint of a series of database workloads.
> > 
> > Large database workloads that make heavy use of hugepages can be particularly
> > exposed to this issue, causing start-up times to be painfully slow. This patch
> > reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs
> > to 25.7 secs. Larger workloads will naturally benefit even more.
> 
> hm, no magic bullet.  Where's the rest of the time being spent?

Parallel to this mutex issue, I'm seeing enormous amounts of time being
spent when zeroing pages. This of course also affects thp. For instance,
with these patches applied, I now see things like the following when
instantiating 10Gbs worth of hugepages:

-  21.44%           oracle  [kernel.kallsyms]     [k] clear_page_c                                                                                                               ◆
   - clear_page_c                                                                                                                                                                ▒
      - 98.51% hugetlb_no_page                                                                                                                                                   ▒
           hugetlb_fault                                                                                                                                                         ▒
           __handle_mm_fault                                                                                                                                                     ▒
           handle_mm_fault                                                                                                                                                       ▒
           __do_page_fault                                                                                                                                                       ▒
           do_page_fault                                                                                                                                                         ▒
         + page_fault                                                                                                                                                            ▒
      + 1.20% get_page_from_freelist             

And this can only get worse, 10Gb DBs are rather small nowadays. Other
than not dirtying the cacheline with single-usage variables (rep stos
family), I haven't found a solution/workaround yet.

> >  
> > +#ifdef CONFIG_SMP
> > +	num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus());
> > +#else
> > +	num_fault_mutexes = 1;
> > +#endif
> > +	htlb_fault_mutex_table =
> > +		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
> > +	if (!htlb_fault_mutex_table)
> > +		return -ENOMEM;
> 
> If htlb_fault_mutex_table==NULL, the kernel will later oops.  Let's
> just go BUG here.

Good point.

> 
> > +	for (i = 0; i < num_fault_mutexes; i++)
> > +		mutex_init(&htlb_fault_mutex_table[i]);
> >  	return 0;
> >  }
> >  module_init(hugetlb_init);
> > @@ -2767,15 +2790,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
> >  }
> >  
> >  static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > -			unsigned long address, pte_t *ptep, unsigned int flags)
> > +			   struct address_space *mapping, pgoff_t idx,
> > +			   unsigned long address, pte_t *ptep, unsigned int flags)
> >  {
> >  	struct hstate *h = hstate_vma(vma);
> >  	int ret = VM_FAULT_SIGBUS;
> >  	int anon_rmap = 0;
> > -	pgoff_t idx;
> >  	unsigned long size;
> >  	struct page *page;
> > -	struct address_space *mapping;
> >  	pte_t new_pte;
> >  	spinlock_t *ptl;
> >  
> > @@ -2790,9 +2812,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		return ret;
> >  	}
> >  
> > -	mapping = vma->vm_file->f_mapping;
> > -	idx = vma_hugecache_offset(h, vma, address);
> > -
> >  	/*
> >  	 * Use page lock to guard against racing truncation
> >  	 * before we get page_table_lock.
> > @@ -2902,17 +2921,53 @@ backout_unlocked:
> >  	goto out;
> >  }
> >  
> > +#ifdef CONFIG_SMP
> > +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> > +			    struct vm_area_struct *vma,
> > +			    struct address_space *mapping,
> > +			    pgoff_t idx, unsigned long address)
> > +{
> > +	unsigned long key[2];
> > +	u32 hash;
> > +
> > +	if (vma->vm_flags & VM_SHARED) {
> > +		key[0] = (unsigned long) mapping;
> > +		key[1] = idx;
> > +	} else {
> > +		key[0] = (unsigned long) mm;
> > +		key[1] = address >> huge_page_shift(h);
> > +	}
> > +
> > +	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
> 
> This looks a bit overengineered to me.  What happens if we just do

IMO the key method better documents the hashing.

> 	hash = jhash2((u32 *)vma, sizeof(vma)/sizeof(u32), 0);
> 
> ?
> 
> > +	return hash & (num_fault_mutexes - 1);
> > +}
> > +#else
> > +/*
> > + * For uniprocesor systems we always use a single mutex, so just
> > + * return 0 and avoid the hashing overhead.
> > + */
> > +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> > +			    struct vm_area_struct *vma,
> > +			    struct address_space *mapping,
> > +			    pgoff_t idx, unsigned long address)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  			unsigned long address, unsigned int flags)
> >  {
> > -	pte_t *ptep;
> > -	pte_t entry;
> > +	pte_t *ptep, entry;
> >  	spinlock_t *ptl;
> >  	int ret;
> > +	u32 hash;
> > +	pgoff_t idx;
> >  	struct page *page = NULL;
> >  	struct page *pagecache_page = NULL;
> > -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> >  	struct hstate *h = hstate_vma(vma);
> > +	struct address_space *mapping;
> >  
> >  	address &= huge_page_mask(h);
> >  
> > @@ -2931,15 +2986,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (!ptep)
> >  		return VM_FAULT_OOM;
> >  
> > +	mapping = vma->vm_file->f_mapping;
> > +	idx = vma_hugecache_offset(h, vma, address);
> > +
> >  	/*
> >  	 * Serialize hugepage allocation and instantiation, so that we don't
> >  	 * get spurious allocation failures if two CPUs race to instantiate
> >  	 * the same page in the page cache.
> >  	 */
> > -	mutex_lock(&hugetlb_instantiation_mutex);
> > +	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> > +	mutex_lock(&htlb_fault_mutex_table[hash]);
> > +
> >  	entry = huge_ptep_get(ptep);
> >  	if (huge_pte_none(entry)) {
> > -		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> > +		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
> >  		goto out_mutex;
> >  	}
> >  
> > @@ -3008,8 +3068,7 @@ out_ptl:
> >  	put_page(page);
> >  
> >  out_mutex:
> > -	mutex_unlock(&hugetlb_instantiation_mutex);
> > -
> > +	mutex_unlock(&htlb_fault_mutex_table[hash]);
> >  	return ret;
> 
> That's nice and simple.
> 
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-hugetlb-improve-page-fault-scalability-fix
> 
> remove stray + characters, go BUG if hugetlb_init() kmalloc fails
> 
> Cc: Davidlohr Bueso <davidlohr@hp.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---

Looks good, thanks Andrew!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability
  2014-01-31 17:36 ` [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability Davidlohr Bueso
  2014-01-31 21:01   ` Andrew Morton
@ 2014-02-03  6:41   ` Joonsoo Kim
  1 sibling, 0 replies; 10+ messages in thread
From: Joonsoo Kim @ 2014-02-03  6:41 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, riel, mgorman, mhocko, aneesh.kumar, kamezawa.hiroyu,
	hughd, david, liwanp, n-horiguchi, dhillf, rientjes, aswin,
	scott.norton, linux-mm, linux-kernel

On Fri, Jan 31, 2014 at 09:36:46AM -0800, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <davidlohr@hp.com>
> 
> The kernel can currently only handle a single hugetlb page fault at a time.
> This is due to a single mutex that serializes the entire path. This lock
> protects from spurious OOM errors under conditions of low of low availability
> of free hugepages. This problem is specific to hugepages, because it is
> normal to want to use every single hugepage in the system - with normal pages
> we simply assume there will always be a few spare pages which can be used
> temporarily until the race is resolved.
> 
> Address this problem by using a table of mutexes, allowing a better chance of
> parallelization, where each hugepage is individually serialized. The hash key
> is selected depending on the mapping type. For shared ones it consists of the
> address space and file offset being faulted; while for private ones the mm and
> virtual address are used. The size of the table is selected based on a compromise
> of collisions and memory footprint of a series of database workloads.

Hello,

Thanks for doing this patchset. :)
Just one question!
Why do we need a separate hash key depending on the mapping type?

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-02-03  6:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-31 17:36 [PATCH v2 0/6] mm, hugetlb: fixes and fault scalability Davidlohr Bueso
2014-01-31 17:36 ` [PATCH v2 1/6] mm, hugetlb: unify region structure handling Davidlohr Bueso
2014-01-31 17:36 ` [PATCH v2 2/6] mm, hugetlb: improve, cleanup resv_map parameters Davidlohr Bueso
2014-01-31 17:36 ` [PATCH v2 3/6] mm, hugetlb: fix race in region tracking Davidlohr Bueso
2014-01-31 17:36 ` [PATCH v2 4/6] mm, hugetlb: remove resv_map_put Davidlohr Bueso
2014-01-31 17:36 ` [PATCH v2 5/6] mm, hugetlb: use vma_resv_map() map types Davidlohr Bueso
2014-01-31 17:36 ` [PATCH v2 6/6] mm, hugetlb: improve page-fault scalability Davidlohr Bueso
2014-01-31 21:01   ` Andrew Morton
2014-01-31 21:52     ` Davidlohr Bueso
2014-02-03  6:41   ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).