All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] hugepage: optimize page fault path locking
@ 2013-07-26 14:27 ` Davidlohr Bueso
  0 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-07-26 14:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Mel Gorman, Michal Hocko,
	AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel, Davidlohr Bueso

This patchset attempts to reduce the amount of contention we impose
on the hugetlb_instantiation_mutex by replacing the global mutex with
a table of mutexes, selected based on a hash. The original discussion can 
be found here: http://lkml.org/lkml/2013/7/12/428

Patch 1: Allows the file region tracking list to be serialized by its own rwsem.
This is necessary because the next patch allows concurrent hugepage fault paths,
getting rid of the hugetlb_instantiation_mutex - which protects chains of struct 
file_regionin inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions 
(!VM_MAYSHARE).

Patch 2: From David Gibson, for some reason never made it into the kernel. 
Further cleanups and enhancements from Anton Blanchard and myself.
Details of how the hash key is selected is in the patch.

Davidlohr Bueso (2):
  hugepage: protect file regions with rwsem
  hugepage: allow parallelization of the hugepage fault path

 mm/hugetlb.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 106 insertions(+), 28 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 0/2] hugepage: optimize page fault path locking
@ 2013-07-26 14:27 ` Davidlohr Bueso
  0 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-07-26 14:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Mel Gorman, Michal Hocko,
	AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel, Davidlohr Bueso

This patchset attempts to reduce the amount of contention we impose
on the hugetlb_instantiation_mutex by replacing the global mutex with
a table of mutexes, selected based on a hash. The original discussion can 
be found here: http://lkml.org/lkml/2013/7/12/428

Patch 1: Allows the file region tracking list to be serialized by its own rwsem.
This is necessary because the next patch allows concurrent hugepage fault paths,
getting rid of the hugetlb_instantiation_mutex - which protects chains of struct 
file_regionin inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions 
(!VM_MAYSHARE).

Patch 2: From David Gibson, for some reason never made it into the kernel. 
Further cleanups and enhancements from Anton Blanchard and myself.
Details of how the hash key is selected is in the patch.

Davidlohr Bueso (2):
  hugepage: protect file regions with rwsem
  hugepage: allow parallelization of the hugepage fault path

 mm/hugetlb.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 106 insertions(+), 28 deletions(-)

-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/2] hugepage: protect file regions with rwsem
  2013-07-26 14:27 ` Davidlohr Bueso
@ 2013-07-26 14:27   ` Davidlohr Bueso
  -1 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-07-26 14:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Mel Gorman, Michal Hocko,
	AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel, Davidlohr Bueso

The next patch will introduce parallel hugepage fault paths, replacing
the global hugetbl_instantiation_mutex with a table of hashed mutexes.
In order to prevent races, introduce a rw-semaphore that exclusively
serializes access to file region tracking structure.

Thanks to Konstantin Khlebnikov and Joonsoo Kim for pointing this issue out.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 mm/hugetlb.c | 47 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..4c3f4f0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -134,16 +134,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantion_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ * With the parallelization of the hugepage fault path, the region_rwsem replaces
+ * the original hugetlb_instantiation_mutex, serializing access to the chains of
+ * file regions.
  */
+DECLARE_RWSEM(region_rwsem);
+
 struct file_region {
 	struct list_head link;
 	long from;
@@ -154,6 +150,8 @@ static long region_add(struct list_head *head, long f, long t)
 {
 	struct file_region *rg, *nrg, *trg;
 
+	down_write(&region_rwsem);
+
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -183,6 +181,8 @@ static long region_add(struct list_head *head, long f, long t)
 	}
 	nrg->from = f;
 	nrg->to = t;
+
+	up_write(&region_rwsem);
 	return 0;
 }
 
@@ -191,6 +191,8 @@ static long region_chg(struct list_head *head, long f, long t)
 	struct file_region *rg, *nrg;
 	long chg = 0;
 
+	down_write(&region_rwsem);
+
 	/* Locate the region we are before or in. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -201,17 +203,21 @@ static long region_chg(struct list_head *head, long f, long t)
 	 * size such that we can guarantee to record the reservation. */
 	if (&rg->link == head || t < rg->from) {
 		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
+		if (!nrg) {
+			chg = -ENOMEM;
+			goto done_write;
+		}
 		nrg->from = f;
 		nrg->to   = f;
 		INIT_LIST_HEAD(&nrg->link);
 		list_add(&nrg->link, rg->link.prev);
 
-		return t - f;
+		chg = t - f;
+		goto done_write;
 	}
 
 	/* Round our left edge to the current segment if it encloses us. */
+	downgrade_write(&region_rwsem);
 	if (f > rg->from)
 		f = rg->from;
 	chg = t - f;
@@ -221,7 +227,7 @@ static long region_chg(struct list_head *head, long f, long t)
 		if (&rg->link == head)
 			break;
 		if (rg->from > t)
-			return chg;
+			break;
 
 		/* We overlap with this area, if it extends further than
 		 * us then we must extend ourselves.  Account for its
@@ -232,6 +238,11 @@ static long region_chg(struct list_head *head, long f, long t)
 		}
 		chg -= rg->to - rg->from;
 	}
+
+	up_read(&region_rwsem);
+	return chg;
+done_write:
+	up_write(&region_rwsem);
 	return chg;
 }
 
@@ -240,12 +251,14 @@ static long region_truncate(struct list_head *head, long end)
 	struct file_region *rg, *trg;
 	long chg = 0;
 
+	down_write(&region_rwsem);
+
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (end <= rg->to)
 			break;
 	if (&rg->link == head)
-		return 0;
+		goto done;
 
 	/* If we are in the middle of a region then adjust it. */
 	if (end > rg->from) {
@@ -262,6 +275,9 @@ static long region_truncate(struct list_head *head, long end)
 		list_del(&rg->link);
 		kfree(rg);
 	}
+
+done:
+	up_write(&region_rwsem);
 	return chg;
 }
 
@@ -270,6 +286,8 @@ static long region_count(struct list_head *head, long f, long t)
 	struct file_region *rg;
 	long chg = 0;
 
+	down_read(&region_rwsem);
+
 	/* Locate each segment we overlap with, and count that overlap. */
 	list_for_each_entry(rg, head, link) {
 		long seg_from;
@@ -286,6 +304,7 @@ static long region_count(struct list_head *head, long f, long t)
 		chg += seg_to - seg_from;
 	}
 
+	up_read(&region_rwsem);
 	return chg;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 1/2] hugepage: protect file regions with rwsem
@ 2013-07-26 14:27   ` Davidlohr Bueso
  0 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-07-26 14:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Mel Gorman, Michal Hocko,
	AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel, Davidlohr Bueso

The next patch will introduce parallel hugepage fault paths, replacing
the global hugetbl_instantiation_mutex with a table of hashed mutexes.
In order to prevent races, introduce a rw-semaphore that exclusively
serializes access to file region tracking structure.

Thanks to Konstantin Khlebnikov and Joonsoo Kim for pointing this issue out.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 mm/hugetlb.c | 47 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..4c3f4f0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -134,16 +134,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantion_mutex.  To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
- *
- *	down_write(&mm->mmap_sem);
- * or
- *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ * With the parallelization of the hugepage fault path, the region_rwsem replaces
+ * the original hugetlb_instantiation_mutex, serializing access to the chains of
+ * file regions.
  */
+DECLARE_RWSEM(region_rwsem);
+
 struct file_region {
 	struct list_head link;
 	long from;
@@ -154,6 +150,8 @@ static long region_add(struct list_head *head, long f, long t)
 {
 	struct file_region *rg, *nrg, *trg;
 
+	down_write(&region_rwsem);
+
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -183,6 +181,8 @@ static long region_add(struct list_head *head, long f, long t)
 	}
 	nrg->from = f;
 	nrg->to = t;
+
+	up_write(&region_rwsem);
 	return 0;
 }
 
@@ -191,6 +191,8 @@ static long region_chg(struct list_head *head, long f, long t)
 	struct file_region *rg, *nrg;
 	long chg = 0;
 
+	down_write(&region_rwsem);
+
 	/* Locate the region we are before or in. */
 	list_for_each_entry(rg, head, link)
 		if (f <= rg->to)
@@ -201,17 +203,21 @@ static long region_chg(struct list_head *head, long f, long t)
 	 * size such that we can guarantee to record the reservation. */
 	if (&rg->link == head || t < rg->from) {
 		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-		if (!nrg)
-			return -ENOMEM;
+		if (!nrg) {
+			chg = -ENOMEM;
+			goto done_write;
+		}
 		nrg->from = f;
 		nrg->to   = f;
 		INIT_LIST_HEAD(&nrg->link);
 		list_add(&nrg->link, rg->link.prev);
 
-		return t - f;
+		chg = t - f;
+		goto done_write;
 	}
 
 	/* Round our left edge to the current segment if it encloses us. */
+	downgrade_write(&region_rwsem);
 	if (f > rg->from)
 		f = rg->from;
 	chg = t - f;
@@ -221,7 +227,7 @@ static long region_chg(struct list_head *head, long f, long t)
 		if (&rg->link == head)
 			break;
 		if (rg->from > t)
-			return chg;
+			break;
 
 		/* We overlap with this area, if it extends further than
 		 * us then we must extend ourselves.  Account for its
@@ -232,6 +238,11 @@ static long region_chg(struct list_head *head, long f, long t)
 		}
 		chg -= rg->to - rg->from;
 	}
+
+	up_read(&region_rwsem);
+	return chg;
+done_write:
+	up_write(&region_rwsem);
 	return chg;
 }
 
@@ -240,12 +251,14 @@ static long region_truncate(struct list_head *head, long end)
 	struct file_region *rg, *trg;
 	long chg = 0;
 
+	down_write(&region_rwsem);
+
 	/* Locate the region we are either in or before. */
 	list_for_each_entry(rg, head, link)
 		if (end <= rg->to)
 			break;
 	if (&rg->link == head)
-		return 0;
+		goto done;
 
 	/* If we are in the middle of a region then adjust it. */
 	if (end > rg->from) {
@@ -262,6 +275,9 @@ static long region_truncate(struct list_head *head, long end)
 		list_del(&rg->link);
 		kfree(rg);
 	}
+
+done:
+	up_write(&region_rwsem);
 	return chg;
 }
 
@@ -270,6 +286,8 @@ static long region_count(struct list_head *head, long f, long t)
 	struct file_region *rg;
 	long chg = 0;
 
+	down_read(&region_rwsem);
+
 	/* Locate each segment we overlap with, and count that overlap. */
 	list_for_each_entry(rg, head, link) {
 		long seg_from;
@@ -286,6 +304,7 @@ static long region_count(struct list_head *head, long f, long t)
 		chg += seg_to - seg_from;
 	}
 
+	up_read(&region_rwsem);
 	return chg;
 }
 
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path
  2013-07-26 14:27 ` Davidlohr Bueso
@ 2013-07-26 14:27   ` Davidlohr Bueso
  -1 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-07-26 14:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Mel Gorman, Michal Hocko,
	AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel, Davidlohr Bueso

From: David Gibson <david@gibson.dropbear.id.au>

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash, which allows us to know
which page in the file we're instantiating. For shared mappings, the
hash key is selected based on the address space and file offset being faulted.
Similarly, for private mappings, the mm and virtual address are used.

From: Anton Blanchard <anton@samba.org>
[https://lkml.org/lkml/2011/7/15/31]
Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
  low bits.

- Always round num_fault_mutexes to a power of two to avoid an
  expensive modulus in the hash calculation.

I also tested this patch on a large POWER7 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>

First the time taken to fault 128GB of 16MB hugepages:

40.68 seconds

Now the same test with 64 concurrent threads:
39.34 seconds

Hardly any speedup. Finally the 64 concurrent threads test with
this patch applied:
0.85 seconds

We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x

This was tested with the libhugetlbfs test suite, and the PASS/FAIL
count was the same before and after this patch.

From: Davidlohr Bueso <davidlohr.bueso@hp.com>
- Cleaned up and forward ported to Linus' latest.
- Cache aligned mutexes.
- Keep non SMP systems using a single mutex.

It was found that this mutex can become quite contended
during the early phases of large databases which make use of huge pages - for instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
reports that this mutex can be one of the top 5 most contended locks in the kernel during
the first few minutes:

    	     hugetlb_instantiation_mutex:   10678     10678
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340

contentions:          10678
acquisitions:         99476
waittime-total: 76888911.01 us

With this patch we see a much less contention and wait time:

              &htlb_fault_mutex_table[i]:   383
              --------------------------
              &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
              --------------------------
              &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440

contentions:        383
acquisitions:    120546
waittime-total: 1381.72 us

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 73 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4c3f4f0..1426e44 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -21,6 +21,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -52,6 +53,13 @@ static unsigned long __initdata default_hstate_size;
  */
 DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+
 static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
 {
 	bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -1915,13 +1923,15 @@ static void __exit hugetlb_exit(void)
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[hstate_index(h)]);
 	}
-
+	kfree(htlb_fault_mutex_table);
 	kobject_put(hugepages_kobj);
 }
 module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1946,6 +1956,19 @@ static int __init hugetlb_init(void)
 	hugetlb_register_all_nodes();
 	hugetlb_cgroup_file_init();
 
+#ifdef CONFIG_SMP
+	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+#else
+	num_fault_mutexes = 1;
+#endif
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	if (!htlb_fault_mutex_table)
+		return -ENOMEM;
+
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2728,15 +2751,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			   struct address_space *mapping, pgoff_t idx,
+			   unsigned long address, pte_t *ptep, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	int ret = VM_FAULT_SIGBUS;
 	int anon_rmap = 0;
-	pgoff_t idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 
 	/*
@@ -2750,9 +2772,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 
-	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, address);
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -2858,15 +2877,51 @@ backout_unlocked:
 	goto out;
 }
 
+#ifdef CONFIG_SMP
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if (vma->vm_flags & VM_SHARED) {
+		key[0] = (unsigned long)mapping;
+		key[1] = idx;
+	} else {
+		key[0] = (unsigned long)mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+#else
+/*
+ * For uniprocesor systems we always use a single mutex, so just
+ * return 0 and avoid the hashing overhead.
+ */
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	return 0;
+}
+#endif
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep;
-	pte_t entry;
+	pgoff_t idx;
 	int ret;
+	u32 hash;
+	pte_t *ptep, entry;
 	struct page *page = NULL;
+	struct address_space *mapping;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2886,15 +2941,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
+
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
 		goto out_mutex;
 	}
 
@@ -2962,8 +3022,7 @@ out_page_table_lock:
 	put_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 	return ret;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-26 14:27   ` Davidlohr Bueso
  0 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-07-26 14:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Mel Gorman, Michal Hocko,
	AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel, Davidlohr Bueso

From: David Gibson <david@gibson.dropbear.id.au>

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash, which allows us to know
which page in the file we're instantiating. For shared mappings, the
hash key is selected based on the address space and file offset being faulted.
Similarly, for private mappings, the mm and virtual address are used.

From: Anton Blanchard <anton@samba.org>
[https://lkml.org/lkml/2011/7/15/31]
Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
  low bits.

- Always round num_fault_mutexes to a power of two to avoid an
  expensive modulus in the hash calculation.

I also tested this patch on a large POWER7 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>

First the time taken to fault 128GB of 16MB hugepages:

40.68 seconds

Now the same test with 64 concurrent threads:
39.34 seconds

Hardly any speedup. Finally the 64 concurrent threads test with
this patch applied:
0.85 seconds

We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x

This was tested with the libhugetlbfs test suite, and the PASS/FAIL
count was the same before and after this patch.

From: Davidlohr Bueso <davidlohr.bueso@hp.com>
- Cleaned up and forward ported to Linus' latest.
- Cache aligned mutexes.
- Keep non SMP systems using a single mutex.

It was found that this mutex can become quite contended
during the early phases of large databases which make use of huge pages - for instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
reports that this mutex can be one of the top 5 most contended locks in the kernel during
the first few minutes:

    	     hugetlb_instantiation_mutex:   10678     10678
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340

contentions:          10678
acquisitions:         99476
waittime-total: 76888911.01 us

With this patch we see a much less contention and wait time:

              &htlb_fault_mutex_table[i]:   383
              --------------------------
              &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
              --------------------------
              &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440

contentions:        383
acquisitions:    120546
waittime-total: 1381.72 us

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 73 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4c3f4f0..1426e44 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -21,6 +21,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -52,6 +53,13 @@ static unsigned long __initdata default_hstate_size;
  */
 DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+
 static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
 {
 	bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -1915,13 +1923,15 @@ static void __exit hugetlb_exit(void)
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[hstate_index(h)]);
 	}
-
+	kfree(htlb_fault_mutex_table);
 	kobject_put(hugepages_kobj);
 }
 module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1946,6 +1956,19 @@ static int __init hugetlb_init(void)
 	hugetlb_register_all_nodes();
 	hugetlb_cgroup_file_init();
 
+#ifdef CONFIG_SMP
+	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+#else
+	num_fault_mutexes = 1;
+#endif
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	if (!htlb_fault_mutex_table)
+		return -ENOMEM;
+
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2728,15 +2751,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			   struct address_space *mapping, pgoff_t idx,
+			   unsigned long address, pte_t *ptep, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	int ret = VM_FAULT_SIGBUS;
 	int anon_rmap = 0;
-	pgoff_t idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 
 	/*
@@ -2750,9 +2772,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 
-	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, address);
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -2858,15 +2877,51 @@ backout_unlocked:
 	goto out;
 }
 
+#ifdef CONFIG_SMP
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if (vma->vm_flags & VM_SHARED) {
+		key[0] = (unsigned long)mapping;
+		key[1] = idx;
+	} else {
+		key[0] = (unsigned long)mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+#else
+/*
+ * For uniprocesor systems we always use a single mutex, so just
+ * return 0 and avoid the hashing overhead.
+ */
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	return 0;
+}
+#endif
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep;
-	pte_t entry;
+	pgoff_t idx;
 	int ret;
+	u32 hash;
+	pte_t *ptep, entry;
 	struct page *page = NULL;
+	struct address_space *mapping;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2886,15 +2941,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
+
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
 		goto out_mutex;
 	}
 
@@ -2962,8 +3022,7 @@ out_page_table_lock:
 	put_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 	return ret;
 }
 
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path
  2013-07-26 14:27   ` Davidlohr Bueso
@ 2013-07-28  6:00     ` Hillf Danton
  -1 siblings, 0 replies; 32+ messages in thread
From: Hillf Danton @ 2013-07-28  6:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Mel Gorman,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel

On Fri, Jul 26, 2013 at 10:27 PM, Davidlohr Bueso
<davidlohr.bueso@hp.com> wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
>
> At present, the page fault path for hugepages is serialized by a
> single mutex.  This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM).  This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
>
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
>
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash, which allows us to know
> which page in the file we're instantiating. For shared mappings, the
> hash key is selected based on the address space and file offset being faulted.
> Similarly, for private mappings, the mm and virtual address are used.
>
> From: Anton Blanchard <anton@samba.org>
> [https://lkml.org/lkml/2011/7/15/31]
> Forward ported and made a few changes:
>
> - Use the Jenkins hash to scatter the hash, better than using just the
>   low bits.
>
> - Always round num_fault_mutexes to a power of two to avoid an
>   expensive modulus in the hash calculation.
>
> I also tested this patch on a large POWER7 box using a simple parallel
> fault testcase:
>
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
>
> Command line options:
>
> parallel_fault <nr_threads> <size in kB> <skip in kB>
>
> First the time taken to fault 128GB of 16MB hugepages:
>
> 40.68 seconds
>
> Now the same test with 64 concurrent threads:
> 39.34 seconds
>
> Hardly any speedup. Finally the 64 concurrent threads test with
> this patch applied:
> 0.85 seconds
>
> We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
>
> This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> count was the same before and after this patch.
>
> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> - Cleaned up and forward ported to Linus' latest.
> - Cache aligned mutexes.
> - Keep non SMP systems using a single mutex.
>
> It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
>
>              hugetlb_instantiation_mutex:   10678     10678
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
>
> With this patch we see a much less contention and wait time:
>
>               &htlb_fault_mutex_table[i]:   383
>               --------------------------
>               &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>               --------------------------
>               &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>
> contentions:        383
> acquisitions:    120546
> waittime-total: 1381.72 us
>
I see same figures in the message of Jul 18,
contentions:          10678
acquisitions:         99476
waittime-total: 76888911.01 us
and
contentions:        383
acquisitions:    120546
waittime-total: 1381.72 us
if I copy and paste correctly.

Were they measured with the global semaphore introduced in 1/8 for
serializing changes in file regions?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-28  6:00     ` Hillf Danton
  0 siblings, 0 replies; 32+ messages in thread
From: Hillf Danton @ 2013-07-28  6:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Mel Gorman,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel

On Fri, Jul 26, 2013 at 10:27 PM, Davidlohr Bueso
<davidlohr.bueso@hp.com> wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
>
> At present, the page fault path for hugepages is serialized by a
> single mutex.  This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM).  This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
>
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
>
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash, which allows us to know
> which page in the file we're instantiating. For shared mappings, the
> hash key is selected based on the address space and file offset being faulted.
> Similarly, for private mappings, the mm and virtual address are used.
>
> From: Anton Blanchard <anton@samba.org>
> [https://lkml.org/lkml/2011/7/15/31]
> Forward ported and made a few changes:
>
> - Use the Jenkins hash to scatter the hash, better than using just the
>   low bits.
>
> - Always round num_fault_mutexes to a power of two to avoid an
>   expensive modulus in the hash calculation.
>
> I also tested this patch on a large POWER7 box using a simple parallel
> fault testcase:
>
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
>
> Command line options:
>
> parallel_fault <nr_threads> <size in kB> <skip in kB>
>
> First the time taken to fault 128GB of 16MB hugepages:
>
> 40.68 seconds
>
> Now the same test with 64 concurrent threads:
> 39.34 seconds
>
> Hardly any speedup. Finally the 64 concurrent threads test with
> this patch applied:
> 0.85 seconds
>
> We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
>
> This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> count was the same before and after this patch.
>
> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> - Cleaned up and forward ported to Linus' latest.
> - Cache aligned mutexes.
> - Keep non SMP systems using a single mutex.
>
> It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
>
>              hugetlb_instantiation_mutex:   10678     10678
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
>
> With this patch we see a much less contention and wait time:
>
>               &htlb_fault_mutex_table[i]:   383
>               --------------------------
>               &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>               --------------------------
>               &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>
> contentions:        383
> acquisitions:    120546
> waittime-total: 1381.72 us
>
I see same figures in the message of Jul 18,
contentions:          10678
acquisitions:         99476
waittime-total: 76888911.01 us
and
contentions:        383
acquisitions:    120546
waittime-total: 1381.72 us
if I copy and paste correctly.

Were they measured with the global semaphore introduced in 1/8 for
serializing changes in file regions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/2] hugepage: optimize page fault path locking
  2013-07-26 14:27 ` Davidlohr Bueso
@ 2013-07-29  6:18   ` Joonsoo Kim
  -1 siblings, 0 replies; 32+ messages in thread
From: Joonsoo Kim @ 2013-07-29  6:18 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Mel Gorman,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	Hugh Dickins, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel

On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> This patchset attempts to reduce the amount of contention we impose
> on the hugetlb_instantiation_mutex by replacing the global mutex with
> a table of mutexes, selected based on a hash. The original discussion can 
> be found here: http://lkml.org/lkml/2013/7/12/428

Hello, Davidlohr.

I recently sent a patchset which remove the hugetlb_instantiation_mutex
entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
This patchset can be found here: https://lkml.org/lkml/2013/7/29/54

If possible, could you review it and test it whether your problem is
disappered with it or not?

Thanks.

> 
> Patch 1: Allows the file region tracking list to be serialized by its own rwsem.
> This is necessary because the next patch allows concurrent hugepage fault paths,
> getting rid of the hugetlb_instantiation_mutex - which protects chains of struct 
> file_regionin inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions 
> (!VM_MAYSHARE).
> 
> Patch 2: From David Gibson, for some reason never made it into the kernel. 
> Further cleanups and enhancements from Anton Blanchard and myself.
> Details of how the hash key is selected is in the patch.
> 
> Davidlohr Bueso (2):
>   hugepage: protect file regions with rwsem
>   hugepage: allow parallelization of the hugepage fault path
> 
>  mm/hugetlb.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 106 insertions(+), 28 deletions(-)
> 
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/2] hugepage: optimize page fault path locking
@ 2013-07-29  6:18   ` Joonsoo Kim
  0 siblings, 0 replies; 32+ messages in thread
From: Joonsoo Kim @ 2013-07-29  6:18 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Mel Gorman,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	Hugh Dickins, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel

On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> This patchset attempts to reduce the amount of contention we impose
> on the hugetlb_instantiation_mutex by replacing the global mutex with
> a table of mutexes, selected based on a hash. The original discussion can 
> be found here: http://lkml.org/lkml/2013/7/12/428

Hello, Davidlohr.

I recently sent a patchset which remove the hugetlb_instantiation_mutex
entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
This patchset can be found here: https://lkml.org/lkml/2013/7/29/54

If possible, could you review it and test it whether your problem is
disappered with it or not?

Thanks.

> 
> Patch 1: Allows the file region tracking list to be serialized by its own rwsem.
> This is necessary because the next patch allows concurrent hugepage fault paths,
> getting rid of the hugetlb_instantiation_mutex - which protects chains of struct 
> file_regionin inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions 
> (!VM_MAYSHARE).
> 
> Patch 2: From David Gibson, for some reason never made it into the kernel. 
> Further cleanups and enhancements from Anton Blanchard and myself.
> Details of how the hash key is selected is in the patch.
> 
> Davidlohr Bueso (2):
>   hugepage: protect file regions with rwsem
>   hugepage: allow parallelization of the hugepage fault path
> 
>  mm/hugetlb.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 106 insertions(+), 28 deletions(-)
> 
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path
  2013-07-28  6:00     ` Hillf Danton
@ 2013-07-29 19:16       ` Davidlohr Bueso
  -1 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-07-29 19:16 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Mel Gorman,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel

On Sun, 2013-07-28 at 14:00 +0800, Hillf Danton wrote:
> On Fri, Jul 26, 2013 at 10:27 PM, Davidlohr Bueso
> <davidlohr.bueso@hp.com> wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> >
> > At present, the page fault path for hugepages is serialized by a
> > single mutex.  This is used to avoid spurious out-of-memory conditions
> > when the hugepage pool is fully utilized (two processes or threads can
> > race to instantiate the same mapping with the last hugepage from the
> > pool, the race loser returning VM_FAULT_OOM).  This problem is
> > specific to hugepages, because it is normal to want to use every
> > single hugepage in the system - with normal pages we simply assume
> > there will always be a few spare pages which can be used temporarily
> > until the race is resolved.
> >
> > Unfortunately this serialization also means that clearing of hugepages
> > cannot be parallelized across multiple CPUs, which can lead to very
> > long process startup times when using large numbers of hugepages.
> >
> > This patch improves the situation by replacing the single mutex with a
> > table of mutexes, selected based on a hash, which allows us to know
> > which page in the file we're instantiating. For shared mappings, the
> > hash key is selected based on the address space and file offset being faulted.
> > Similarly, for private mappings, the mm and virtual address are used.
> >
> > From: Anton Blanchard <anton@samba.org>
> > [https://lkml.org/lkml/2011/7/15/31]
> > Forward ported and made a few changes:
> >
> > - Use the Jenkins hash to scatter the hash, better than using just the
> >   low bits.
> >
> > - Always round num_fault_mutexes to a power of two to avoid an
> >   expensive modulus in the hash calculation.
> >
> > I also tested this patch on a large POWER7 box using a simple parallel
> > fault testcase:
> >
> > http://ozlabs.org/~anton/junkcode/parallel_fault.c
> >
> > Command line options:
> >
> > parallel_fault <nr_threads> <size in kB> <skip in kB>
> >
> > First the time taken to fault 128GB of 16MB hugepages:
> >
> > 40.68 seconds
> >
> > Now the same test with 64 concurrent threads:
> > 39.34 seconds
> >
> > Hardly any speedup. Finally the 64 concurrent threads test with
> > this patch applied:
> > 0.85 seconds
> >
> > We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
> >
> > This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> > count was the same before and after this patch.
> >
> > From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> > - Cleaned up and forward ported to Linus' latest.
> > - Cache aligned mutexes.
> > - Keep non SMP systems using a single mutex.
> >
> > It was found that this mutex can become quite contended
> > during the early phases of large databases which make use of huge pages - for instance
> > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > the first few minutes:
> >
> >              hugetlb_instantiation_mutex:   10678     10678
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >
> > contentions:          10678
> > acquisitions:         99476
> > waittime-total: 76888911.01 us
> >
> > With this patch we see a much less contention and wait time:
> >
> >               &htlb_fault_mutex_table[i]:   383
> >               --------------------------
> >               &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
> >               --------------------------
> >               &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
> >
> > contentions:        383
> > acquisitions:    120546
> > waittime-total: 1381.72 us
> >
> I see same figures in the message of Jul 18,
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
> and
> contentions:        383
> acquisitions:    120546
> waittime-total: 1381.72 us
> if I copy and paste correctly.
> 
> Were they measured with the global semaphore introduced in 1/8 for
> serializing changes in file regions?

They were, but I copied the wrong text:

for the htlb mutex:

contentions: 453
acquisitions: 154786
waittime-total: 117765.59 us

For the new lock, this particular workload only uses region_add() and
region_chg() calls:

region_rwsem-W:
contentions: 4
acquisitions: 20077
waittime-total: 2244.64 us

region_rwsem-R:
contentions: 0
acquisitions: 2
waittime-total: 0 us



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-29 19:16       ` Davidlohr Bueso
  0 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-07-29 19:16 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Mel Gorman,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hugh Dickins,
	Joonsoo Kim, David Gibson, Eric B Munson, Anton Blanchard,
	Konstantin Khlebnikov, linux-mm, linux-kernel

On Sun, 2013-07-28 at 14:00 +0800, Hillf Danton wrote:
> On Fri, Jul 26, 2013 at 10:27 PM, Davidlohr Bueso
> <davidlohr.bueso@hp.com> wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> >
> > At present, the page fault path for hugepages is serialized by a
> > single mutex.  This is used to avoid spurious out-of-memory conditions
> > when the hugepage pool is fully utilized (two processes or threads can
> > race to instantiate the same mapping with the last hugepage from the
> > pool, the race loser returning VM_FAULT_OOM).  This problem is
> > specific to hugepages, because it is normal to want to use every
> > single hugepage in the system - with normal pages we simply assume
> > there will always be a few spare pages which can be used temporarily
> > until the race is resolved.
> >
> > Unfortunately this serialization also means that clearing of hugepages
> > cannot be parallelized across multiple CPUs, which can lead to very
> > long process startup times when using large numbers of hugepages.
> >
> > This patch improves the situation by replacing the single mutex with a
> > table of mutexes, selected based on a hash, which allows us to know
> > which page in the file we're instantiating. For shared mappings, the
> > hash key is selected based on the address space and file offset being faulted.
> > Similarly, for private mappings, the mm and virtual address are used.
> >
> > From: Anton Blanchard <anton@samba.org>
> > [https://lkml.org/lkml/2011/7/15/31]
> > Forward ported and made a few changes:
> >
> > - Use the Jenkins hash to scatter the hash, better than using just the
> >   low bits.
> >
> > - Always round num_fault_mutexes to a power of two to avoid an
> >   expensive modulus in the hash calculation.
> >
> > I also tested this patch on a large POWER7 box using a simple parallel
> > fault testcase:
> >
> > http://ozlabs.org/~anton/junkcode/parallel_fault.c
> >
> > Command line options:
> >
> > parallel_fault <nr_threads> <size in kB> <skip in kB>
> >
> > First the time taken to fault 128GB of 16MB hugepages:
> >
> > 40.68 seconds
> >
> > Now the same test with 64 concurrent threads:
> > 39.34 seconds
> >
> > Hardly any speedup. Finally the 64 concurrent threads test with
> > this patch applied:
> > 0.85 seconds
> >
> > We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
> >
> > This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> > count was the same before and after this patch.
> >
> > From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> > - Cleaned up and forward ported to Linus' latest.
> > - Cache aligned mutexes.
> > - Keep non SMP systems using a single mutex.
> >
> > It was found that this mutex can become quite contended
> > during the early phases of large databases which make use of huge pages - for instance
> > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > the first few minutes:
> >
> >              hugetlb_instantiation_mutex:   10678     10678
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >
> > contentions:          10678
> > acquisitions:         99476
> > waittime-total: 76888911.01 us
> >
> > With this patch we see a much less contention and wait time:
> >
> >               &htlb_fault_mutex_table[i]:   383
> >               --------------------------
> >               &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
> >               --------------------------
> >               &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
> >
> > contentions:        383
> > acquisitions:    120546
> > waittime-total: 1381.72 us
> >
> I see same figures in the message of Jul 18,
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
> and
> contentions:        383
> acquisitions:    120546
> waittime-total: 1381.72 us
> if I copy and paste correctly.
> 
> Were they measured with the global semaphore introduced in 1/8 for
> serializing changes in file regions?

They were, but I copied the wrong text:

for the htlb mutex:

contentions: 453
acquisitions: 154786
waittime-total: 117765.59 us

For the new lock, this particular workload only uses region_add() and
region_chg() calls:

region_rwsem-W:
contentions: 4
acquisitions: 20077
waittime-total: 2244.64 us

region_rwsem-R:
contentions: 0
acquisitions: 2
waittime-total: 0 us


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/2] hugepage: optimize page fault path locking
  2013-07-29  6:18   ` Joonsoo Kim
@ 2013-08-07  0:08     ` Davidlohr Bueso
  -1 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-08-07  0:08 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Davidlohr Bueso, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, David Gibson, Eric B Munson,
	Anton Blanchard, Konstantin Khlebnikov, linux-mm, linux-kernel

On Mon, 2013-07-29 at 15:18 +0900, Joonsoo Kim wrote:
> On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> > This patchset attempts to reduce the amount of contention we impose
> > on the hugetlb_instantiation_mutex by replacing the global mutex with
> > a table of mutexes, selected based on a hash. The original discussion can 
> > be found here: http://lkml.org/lkml/2013/7/12/428
> 
> Hello, Davidlohr.
> 
> I recently sent a patchset which remove the hugetlb_instantiation_mutex
> entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
> This patchset can be found here: https://lkml.org/lkml/2013/7/29/54
> 
> If possible, could you review it and test it whether your problem is
> disappered with it or not?

This patchset applies on top of https://lkml.org/lkml/2013/7/22/96
"[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix", right?

AFAIK those changes are the ones Andrew picked up a few weeks ago and
are now in linux-next, right? I was able to apply those just fine, but
couldn't apply your 'remove a hugetlb_instantiation_mutex series' (IIRC
pach 1/18 failed). I guess you'll send out a v2 anyway so I'll wait
until then.

In any case I'm not seeing an actual performance issue with the
hugetlb_instantiation_mutex, all I noticed was that under large DB
workloads that make use of hugepages, such as Oracle, this lock becomes
quite hot during the first few minutes of startup, which makes sense in
the fault path it is contended. So I'll try out your patches, but, in
this particular case, I just cannot compare with the lock vs without the
lock situations.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/2] hugepage: optimize page fault path locking
@ 2013-08-07  0:08     ` Davidlohr Bueso
  0 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2013-08-07  0:08 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Davidlohr Bueso, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, David Gibson, Eric B Munson,
	Anton Blanchard, Konstantin Khlebnikov, linux-mm, linux-kernel

On Mon, 2013-07-29 at 15:18 +0900, Joonsoo Kim wrote:
> On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> > This patchset attempts to reduce the amount of contention we impose
> > on the hugetlb_instantiation_mutex by replacing the global mutex with
> > a table of mutexes, selected based on a hash. The original discussion can 
> > be found here: http://lkml.org/lkml/2013/7/12/428
> 
> Hello, Davidlohr.
> 
> I recently sent a patchset which remove the hugetlb_instantiation_mutex
> entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
> This patchset can be found here: https://lkml.org/lkml/2013/7/29/54
> 
> If possible, could you review it and test it whether your problem is
> disappered with it or not?

This patchset applies on top of https://lkml.org/lkml/2013/7/22/96
"[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix", right?

AFAIK those changes are the ones Andrew picked up a few weeks ago and
are now in linux-next, right? I was able to apply those just fine, but
couldn't apply your 'remove a hugetlb_instantiation_mutex series' (IIRC
pach 1/18 failed). I guess you'll send out a v2 anyway so I'll wait
until then.

In any case I'm not seeing an actual performance issue with the
hugetlb_instantiation_mutex, all I noticed was that under large DB
workloads that make use of hugepages, such as Oracle, this lock becomes
quite hot during the first few minutes of startup, which makes sense in
the fault path it is contended. So I'll try out your patches, but, in
this particular case, I just cannot compare with the lock vs without the
lock situations.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/2] hugepage: optimize page fault path locking
  2013-08-07  0:08     ` Davidlohr Bueso
@ 2013-08-07  9:21       ` Joonsoo Kim
  -1 siblings, 0 replies; 32+ messages in thread
From: Joonsoo Kim @ 2013-08-07  9:21 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Davidlohr Bueso, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, David Gibson, Eric B Munson,
	Anton Blanchard, Konstantin Khlebnikov, linux-mm, linux-kernel

On Tue, Aug 06, 2013 at 05:08:04PM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-07-29 at 15:18 +0900, Joonsoo Kim wrote:
> > On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> > > This patchset attempts to reduce the amount of contention we impose
> > > on the hugetlb_instantiation_mutex by replacing the global mutex with
> > > a table of mutexes, selected based on a hash. The original discussion can 
> > > be found here: http://lkml.org/lkml/2013/7/12/428
> > 
> > Hello, Davidlohr.
> > 
> > I recently sent a patchset which remove the hugetlb_instantiation_mutex
> > entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
> > This patchset can be found here: https://lkml.org/lkml/2013/7/29/54
> > 
> > If possible, could you review it and test it whether your problem is
> > disappered with it or not?
> 
> This patchset applies on top of https://lkml.org/lkml/2013/7/22/96
> "[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix", right?
> 
> AFAIK those changes are the ones Andrew picked up a few weeks ago and
> are now in linux-next, right? I was able to apply those just fine, but
> couldn't apply your 'remove a hugetlb_instantiation_mutex series' (IIRC
> pach 1/18 failed). I guess you'll send out a v2 anyway so I'll wait
> until then.
> 
> In any case I'm not seeing an actual performance issue with the
> hugetlb_instantiation_mutex, all I noticed was that under large DB
> workloads that make use of hugepages, such as Oracle, this lock becomes
> quite hot during the first few minutes of startup, which makes sense in
> the fault path it is contended. So I'll try out your patches, but, in
> this particular case, I just cannot compare with the lock vs without the
> lock situations.

Okay. I just want to know that lock contention is reduced by my patches
in the first few minutes of startup. I will send v2 soon.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/2] hugepage: optimize page fault path locking
@ 2013-08-07  9:21       ` Joonsoo Kim
  0 siblings, 0 replies; 32+ messages in thread
From: Joonsoo Kim @ 2013-08-07  9:21 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Davidlohr Bueso, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, David Gibson, Eric B Munson,
	Anton Blanchard, Konstantin Khlebnikov, linux-mm, linux-kernel

On Tue, Aug 06, 2013 at 05:08:04PM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-07-29 at 15:18 +0900, Joonsoo Kim wrote:
> > On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> > > This patchset attempts to reduce the amount of contention we impose
> > > on the hugetlb_instantiation_mutex by replacing the global mutex with
> > > a table of mutexes, selected based on a hash. The original discussion can 
> > > be found here: http://lkml.org/lkml/2013/7/12/428
> > 
> > Hello, Davidlohr.
> > 
> > I recently sent a patchset which remove the hugetlb_instantiation_mutex
> > entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
> > This patchset can be found here: https://lkml.org/lkml/2013/7/29/54
> > 
> > If possible, could you review it and test it whether your problem is
> > disappered with it or not?
> 
> This patchset applies on top of https://lkml.org/lkml/2013/7/22/96
> "[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix", right?
> 
> AFAIK those changes are the ones Andrew picked up a few weeks ago and
> are now in linux-next, right? I was able to apply those just fine, but
> couldn't apply your 'remove a hugetlb_instantiation_mutex series' (IIRC
> pach 1/18 failed). I guess you'll send out a v2 anyway so I'll wait
> until then.
> 
> In any case I'm not seeing an actual performance issue with the
> hugetlb_instantiation_mutex, all I noticed was that under large DB
> workloads that make use of hugepages, such as Oracle, this lock becomes
> quite hot during the first few minutes of startup, which makes sense in
> the fault path it is contended. So I'll try out your patches, but, in
> this particular case, I just cannot compare with the lock vs without the
> lock situations.

Okay. I just want to know that lock contention is reduced by my patches
in the first few minutes of startup. I will send v2 soon.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-07-15  6:06       ` Anton Blanchard
@ 2011-07-21 10:17         ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2011-07-21 10:17 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: dwg, akpm, hughd, linux-mm, linux-kernel

On Fri, Jul 15, 2011 at 04:06:50PM +1000, Anton Blanchard wrote:
> 
> Hi Mel,
> 
> > I haven't tested this patch yet but typically how I would test it is
> > multiple parallel instances of make func from libhugetlbfs. In
> > particular I would be looking out for counter corruption. Has
> > something like this been done? I know hugetlb_lock protects the
> > counters but the locking in there has turned into a bit of a mess so
> > it's easy to miss something.
> 
> Thanks for the suggestion and sorry for taking so long. Make check has
> the same PASS/FAIL count before and after the patches.
> 
> I also ran 16 copies of make func on a large box with 896 HW threads.
> Some of the tests that use shared memory were a bit upset, but that
> seems to be because we use a static key. It seems the tests were also
> fighting over the number of huge pages they wanted the system set to.
> 
> It got up to a load average of 13207, and heap-overflow consumed all my
> memory, a pretty good effort considering I have over 1TB of it.
> 
> After things settled down things were OK, apart from the fact that we
> have 20 huge pages unaccounted for:
> 
> HugePages_Total:   10000
> HugePages_Free:     9980
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> 
> I verified there were no shared memory segments, and no files in the
> hugetlbfs filesystem (I double checked by unmounting it).
> 
> I can't see how this patch set would cause this. It seems like we can
> leak huge pages, perhaps in an error path. Anyway, I'll repost the
> patch set for comments.
> 

I didn't see any problem with the patch either. The locking should
be functionally equivalent for both private and shared mappings.

Out of curiousity, can you trigger the bug without the patches? It
could be a race on faulting the shared regions that is causing the
leakage.  Any chance this can be debugged minimally by finding out if
this is an accounting bug or if if a hugepage is leaked? Considering
the stress of the machine and its size, I'm guessing it's not trivially
reproducible anywhere else.

I think one possibility of where the bug is is when updating
the hugepage pool size with "make func". This is a scenario that does
not normally occur as hugepage pool resizing is an administrative task
that happens rarely. See this chunk for instance

        spin_lock(&hugetlb_lock);
        if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
                spin_unlock(&hugetlb_lock);
                return NULL;
        } else {
                h->nr_huge_pages++;
                h->surplus_huge_pages++;
        }
        spin_unlock(&hugetlb_lock);

        page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
                                        __GFP_REPEAT|__GFP_NOWARN,
                                        huge_page_order(h));

        if (page && arch_prepare_hugepage(page)) {
                __free_pages(page, huge_page_order(h));
                return NULL;
        }

That thing is not updating the counters if arch_prepare_hugepage fails.
That function is a no-op on powerpc normally. That wasn't changed for
any reason was it?

Another possibility is a regression of
[4f16fc10: mm: hugetlb: fix hugepage memory leak in mincore()] so maybe
try a similar reproduction case of mincore?

Maybe also try putting assert_spin_locked at every point nr_free_pages
or nr_huge_pages is updated and see if one of them triggers?


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
@ 2011-07-21 10:17         ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2011-07-21 10:17 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: dwg, akpm, hughd, linux-mm, linux-kernel

On Fri, Jul 15, 2011 at 04:06:50PM +1000, Anton Blanchard wrote:
> 
> Hi Mel,
> 
> > I haven't tested this patch yet but typically how I would test it is
> > multiple parallel instances of make func from libhugetlbfs. In
> > particular I would be looking out for counter corruption. Has
> > something like this been done? I know hugetlb_lock protects the
> > counters but the locking in there has turned into a bit of a mess so
> > it's easy to miss something.
> 
> Thanks for the suggestion and sorry for taking so long. Make check has
> the same PASS/FAIL count before and after the patches.
> 
> I also ran 16 copies of make func on a large box with 896 HW threads.
> Some of the tests that use shared memory were a bit upset, but that
> seems to be because we use a static key. It seems the tests were also
> fighting over the number of huge pages they wanted the system set to.
> 
> It got up to a load average of 13207, and heap-overflow consumed all my
> memory, a pretty good effort considering I have over 1TB of it.
> 
> After things settled down things were OK, apart from the fact that we
> have 20 huge pages unaccounted for:
> 
> HugePages_Total:   10000
> HugePages_Free:     9980
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> 
> I verified there were no shared memory segments, and no files in the
> hugetlbfs filesystem (I double checked by unmounting it).
> 
> I can't see how this patch set would cause this. It seems like we can
> leak huge pages, perhaps in an error path. Anyway, I'll repost the
> patch set for comments.
> 

I didn't see any problem with the patch either. The locking should
be functionally equivalent for both private and shared mappings.

Out of curiousity, can you trigger the bug without the patches? It
could be a race on faulting the shared regions that is causing the
leakage.  Any chance this can be debugged minimally by finding out if
this is an accounting bug or if if a hugepage is leaked? Considering
the stress of the machine and its size, I'm guessing it's not trivially
reproducible anywhere else.

I think one possibility of where the bug is is when updating
the hugepage pool size with "make func". This is a scenario that does
not normally occur as hugepage pool resizing is an administrative task
that happens rarely. See this chunk for instance

        spin_lock(&hugetlb_lock);
        if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
                spin_unlock(&hugetlb_lock);
                return NULL;
        } else {
                h->nr_huge_pages++;
                h->surplus_huge_pages++;
        }
        spin_unlock(&hugetlb_lock);

        page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
                                        __GFP_REPEAT|__GFP_NOWARN,
                                        huge_page_order(h));

        if (page && arch_prepare_hugepage(page)) {
                __free_pages(page, huge_page_order(h));
                return NULL;
        }

That thing is not updating the counters if arch_prepare_hugepage fails.
That function is a no-op on powerpc normally. That wasn't changed for
any reason was it?

Another possibility is a regression of
[4f16fc10: mm: hugetlb: fix hugepage memory leak in mincore()] so maybe
try a similar reproduction case of mincore?

Maybe also try putting assert_spin_locked at every point nr_free_pages
or nr_huge_pages is updated and see if one of them triggers?


-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-07-15  6:10         ` Anton Blanchard
  (?)
@ 2011-07-18 15:24         ` Eric B Munson
  -1 siblings, 0 replies; 32+ messages in thread
From: Eric B Munson @ 2011-07-18 15:24 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Mel Gorman, dwg, akpm, hughd, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2493 bytes --]

On Fri, 15 Jul 2011, Anton Blanchard wrote:

> From: David Gibson <dwg@au1.ibm.com>
> 
> At present, the page fault path for hugepages is serialized by a
> single mutex.  This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM).  This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
> 
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
> 
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash of the address_space and
> file offset being faulted (or mm and virtual address for MAP_PRIVATE
> mappings).
> 
> From: Anton Blanchard <anton@samba.org>
> 
> Forward ported and made a few changes:
> 
> - Use the Jenkins hash to scatter the hash, better than using just the
>   low bits.
> 
> - Always round num_fault_mutexes to a power of two to avoid an
>   expensive modulus in the hash calculation.
> 
> I also tested this patch on a large POWER7 box using a simple parallel
> fault testcase:
> 
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
> 
> Command line options:
> 
> parallel_fault <nr_threads> <size in kB> <skip in kB>
> 
> 
> First the time taken to fault 128GB of 16MB hugepages:
> 
> # time hugectl --heap ./parallel_fault 1 134217728 16384
> 40.68 seconds
> 
> Now the same test with 64 concurrent threads:
> # time hugectl --heap ./parallel_fault 64 134217728 16384
> 39.34 seconds
> 
> Hardly any speedup. Finally the 64 concurrent threads test with
> this patch applied:
> # time hugectl --heap ./parallel_fault 64 134217728 16384
> 0.85 seconds
> 
> We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
> 
> This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> count was the same before and after this patch.
> 
> 
> Signed-off-by: David Gibson <dwg@au1.ibm.com>
> Signed-off-by: Anton Blanchard <anton@samba.org>

Tested-by: Eric B Munson <emunson@mgebm.net>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-07-15  7:52     ` Andi Kleen
@ 2011-07-15 13:10       ` David Gibson
  -1 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2011-07-15 13:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Anton Blanchard, mel, akpm, hughd, linux-mm, linux-kernel

On Fri, Jul 15, 2011 at 12:52:38AM -0700, Andi Kleen wrote:
> Anton Blanchard <anton@samba.org> writes:
> 
> 
> > This patch improves the situation by replacing the single mutex with a
> > table of mutexes, selected based on a hash of the address_space and
> > file offset being faulted (or mm and virtual address for MAP_PRIVATE
> > mappings).
> 
> It's unclear to me how this solves the original OOM problem.
> But then you can still have early oom over all the hugepages if they
> happen to hash to different pages, can't you? 

The spurious OOM case only occurs when the two processes or threads
are racing to instantiate the same page (that is the same page within
an address_space for SHARED or the same virtual address for PRIVATE).
In other cases the OOM is correct behaviour (because we really don't
have enough hugepages to satisfy the requests).

Because of the hash's construction, we're guaranteed than in the
spurious OOM case, both processes or threads will use the same mutex.

> I think it would be better to move out the clearing out of the lock,

We really can't.  The lock has to be taken before we grab a page from
the pool, and can't be released until after the page is "committed"
either by updating the address space's radix tree (SHARED) or the page
tables (PRIVATE).  I can't see anyway the clearing can be moved out of
that.

> and possibly take the lock only when the hugepages are about to 
> go OOM.

This is much easier said than done.  

At one stage I did attempt a more theoretically elegant approach which
is to keep a count of the number of "in-flight" hugepages - OOMs
should be retried if it is non-zero.  I believe that approach can
work, but it turns out to be pretty darn hairy to implement.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
@ 2011-07-15 13:10       ` David Gibson
  0 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2011-07-15 13:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Anton Blanchard, mel, akpm, hughd, linux-mm, linux-kernel

On Fri, Jul 15, 2011 at 12:52:38AM -0700, Andi Kleen wrote:
> Anton Blanchard <anton@samba.org> writes:
> 
> 
> > This patch improves the situation by replacing the single mutex with a
> > table of mutexes, selected based on a hash of the address_space and
> > file offset being faulted (or mm and virtual address for MAP_PRIVATE
> > mappings).
> 
> It's unclear to me how this solves the original OOM problem.
> But then you can still have early oom over all the hugepages if they
> happen to hash to different pages, can't you? 

The spurious OOM case only occurs when the two processes or threads
are racing to instantiate the same page (that is the same page within
an address_space for SHARED or the same virtual address for PRIVATE).
In other cases the OOM is correct behaviour (because we really don't
have enough hugepages to satisfy the requests).

Because of the hash's construction, we're guaranteed than in the
spurious OOM case, both processes or threads will use the same mutex.

> I think it would be better to move out the clearing out of the lock,

We really can't.  The lock has to be taken before we grab a page from
the pool, and can't be released until after the page is "committed"
either by updating the address space's radix tree (SHARED) or the page
tables (PRIVATE).  I can't see anyway the clearing can be moved out of
that.

> and possibly take the lock only when the hugepages are about to 
> go OOM.

This is much easier said than done.  

At one stage I did attempt a more theoretically elegant approach which
is to keep a count of the number of "in-flight" hugepages - OOMs
should be retried if it is non-zero.  I believe that approach can
work, but it turns out to be pretty darn hairy to implement.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-01-25  3:34   ` Anton Blanchard
@ 2011-07-15  7:52     ` Andi Kleen
  -1 siblings, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2011-07-15  7:52 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: dwg, mel, akpm, hughd, linux-mm, linux-kernel

Anton Blanchard <anton@samba.org> writes:


> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash of the address_space and
> file offset being faulted (or mm and virtual address for MAP_PRIVATE
> mappings).

It's unclear to me how this solves the original OOM problem.
But then you can still have early oom over all the hugepages if they
happen to hash to different pages, can't you? 

I think it would be better to move out the clearing out of the lock,
and possibly take the lock only when the hugepages are about to 
go OOM.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
@ 2011-07-15  7:52     ` Andi Kleen
  0 siblings, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2011-07-15  7:52 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: dwg, mel, akpm, hughd, linux-mm, linux-kernel

Anton Blanchard <anton@samba.org> writes:


> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash of the address_space and
> file offset being faulted (or mm and virtual address for MAP_PRIVATE
> mappings).

It's unclear to me how this solves the original OOM problem.
But then you can still have early oom over all the hugepages if they
happen to hash to different pages, can't you? 

I think it would be better to move out the clearing out of the lock,
and possibly take the lock only when the hugepages are about to 
go OOM.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-07-15  6:06       ` Anton Blanchard
@ 2011-07-15  6:10         ` Anton Blanchard
  -1 siblings, 0 replies; 32+ messages in thread
From: Anton Blanchard @ 2011-07-15  6:10 UTC (permalink / raw)
  To: Mel Gorman, dwg, akpm, hughd; +Cc: linux-mm, linux-kernel

From: David Gibson <dwg@au1.ibm.com>

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash of the address_space and
file offset being faulted (or mm and virtual address for MAP_PRIVATE
mappings).

From: Anton Blanchard <anton@samba.org>

Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
  low bits.

- Always round num_fault_mutexes to a power of two to avoid an
  expensive modulus in the hash calculation.

I also tested this patch on a large POWER7 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>


First the time taken to fault 128GB of 16MB hugepages:

# time hugectl --heap ./parallel_fault 1 134217728 16384
40.68 seconds

Now the same test with 64 concurrent threads:
# time hugectl --heap ./parallel_fault 64 134217728 16384
39.34 seconds

Hardly any speedup. Finally the 64 concurrent threads test with
this patch applied:
# time hugectl --heap ./parallel_fault 64 134217728 16384
0.85 seconds

We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x

This was tested with the libhugetlbfs test suite, and the PASS/FAIL
count was the same before and after this patch.


Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-2.6-work/mm/hugetlb.c
===================================================================
--- linux-2.6-work.orig/mm/hugetlb.c	2011-07-15 09:17:23.724410080 +1000
+++ linux-2.6-work/mm/hugetlb.c	2011-07-15 09:17:24.584425505 +1000
@@ -21,6 +21,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -54,6 +55,13 @@ static unsigned long __initdata default_
 static DEFINE_SPINLOCK(hugetlb_lock);
 
 /*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static unsigned int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table;
+
+/*
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  */
@@ -1772,6 +1780,8 @@ module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1798,6 +1808,12 @@ static int __init hugetlb_init(void)
 
 	hugetlb_register_all_nodes();
 
+	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+	htlb_fault_mutex_table =
+		kmalloc(num_fault_mutexes * sizeof(struct mutex), GFP_KERNEL);
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2622,6 +2638,27 @@ backout_unlocked:
 	goto out;
 }
 
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    unsigned long pagenum, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if ((vma->vm_flags & VM_SHARED)) {
+		key[0] = (unsigned long)mapping;
+		key[1] = pagenum;
+	} else {
+		key[0] = (unsigned long)mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
@@ -2630,8 +2667,10 @@ int hugetlb_fault(struct mm_struct *mm,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping;
+	unsigned long idx;
+	u32 hash;
 
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
@@ -2648,12 +2687,16 @@ int hugetlb_fault(struct mm_struct *mm,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
@@ -2722,7 +2765,7 @@ out_page_table_lock:
 		unlock_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 
 	return ret;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
@ 2011-07-15  6:10         ` Anton Blanchard
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Blanchard @ 2011-07-15  6:10 UTC (permalink / raw)
  To: Mel Gorman, dwg, akpm, hughd; +Cc: linux-mm, linux-kernel

From: David Gibson <dwg@au1.ibm.com>

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash of the address_space and
file offset being faulted (or mm and virtual address for MAP_PRIVATE
mappings).

From: Anton Blanchard <anton@samba.org>

Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
  low bits.

- Always round num_fault_mutexes to a power of two to avoid an
  expensive modulus in the hash calculation.

I also tested this patch on a large POWER7 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>


First the time taken to fault 128GB of 16MB hugepages:

# time hugectl --heap ./parallel_fault 1 134217728 16384
40.68 seconds

Now the same test with 64 concurrent threads:
# time hugectl --heap ./parallel_fault 64 134217728 16384
39.34 seconds

Hardly any speedup. Finally the 64 concurrent threads test with
this patch applied:
# time hugectl --heap ./parallel_fault 64 134217728 16384
0.85 seconds

We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x

This was tested with the libhugetlbfs test suite, and the PASS/FAIL
count was the same before and after this patch.


Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-2.6-work/mm/hugetlb.c
===================================================================
--- linux-2.6-work.orig/mm/hugetlb.c	2011-07-15 09:17:23.724410080 +1000
+++ linux-2.6-work/mm/hugetlb.c	2011-07-15 09:17:24.584425505 +1000
@@ -21,6 +21,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -54,6 +55,13 @@ static unsigned long __initdata default_
 static DEFINE_SPINLOCK(hugetlb_lock);
 
 /*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static unsigned int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table;
+
+/*
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  */
@@ -1772,6 +1780,8 @@ module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1798,6 +1808,12 @@ static int __init hugetlb_init(void)
 
 	hugetlb_register_all_nodes();
 
+	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+	htlb_fault_mutex_table =
+		kmalloc(num_fault_mutexes * sizeof(struct mutex), GFP_KERNEL);
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2622,6 +2638,27 @@ backout_unlocked:
 	goto out;
 }
 
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    unsigned long pagenum, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if ((vma->vm_flags & VM_SHARED)) {
+		key[0] = (unsigned long)mapping;
+		key[1] = pagenum;
+	} else {
+		key[0] = (unsigned long)mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
@@ -2630,8 +2667,10 @@ int hugetlb_fault(struct mm_struct *mm,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping;
+	unsigned long idx;
+	u32 hash;
 
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
@@ -2648,12 +2687,16 @@ int hugetlb_fault(struct mm_struct *mm,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
@@ -2722,7 +2765,7 @@ out_page_table_lock:
 		unlock_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-01-26  9:24     ` Mel Gorman
@ 2011-07-15  6:06       ` Anton Blanchard
  -1 siblings, 0 replies; 32+ messages in thread
From: Anton Blanchard @ 2011-07-15  6:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: dwg, akpm, hughd, linux-mm, linux-kernel


Hi Mel,

> I haven't tested this patch yet but typically how I would test it is
> multiple parallel instances of make func from libhugetlbfs. In
> particular I would be looking out for counter corruption. Has
> something like this been done? I know hugetlb_lock protects the
> counters but the locking in there has turned into a bit of a mess so
> it's easy to miss something.

Thanks for the suggestion and sorry for taking so long. Make check has
the same PASS/FAIL count before and after the patches.

I also ran 16 copies of make func on a large box with 896 HW threads.
Some of the tests that use shared memory were a bit upset, but that
seems to be because we use a static key. It seems the tests were also
fighting over the number of huge pages they wanted the system set to.

It got up to a load average of 13207, and heap-overflow consumed all my
memory, a pretty good effort considering I have over 1TB of it.

After things settled down things were OK, apart from the fact that we
have 20 huge pages unaccounted for:

HugePages_Total:   10000
HugePages_Free:     9980
HugePages_Rsvd:        0
HugePages_Surp:        0

I verified there were no shared memory segments, and no files in the
hugetlbfs filesystem (I double checked by unmounting it).

I can't see how this patch set would cause this. It seems like we can
leak huge pages, perhaps in an error path. Anyway, I'll repost the
patch set for comments.

Anton

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
@ 2011-07-15  6:06       ` Anton Blanchard
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Blanchard @ 2011-07-15  6:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: dwg, akpm, hughd, linux-mm, linux-kernel


Hi Mel,

> I haven't tested this patch yet but typically how I would test it is
> multiple parallel instances of make func from libhugetlbfs. In
> particular I would be looking out for counter corruption. Has
> something like this been done? I know hugetlb_lock protects the
> counters but the locking in there has turned into a bit of a mess so
> it's easy to miss something.

Thanks for the suggestion and sorry for taking so long. Make check has
the same PASS/FAIL count before and after the patches.

I also ran 16 copies of make func on a large box with 896 HW threads.
Some of the tests that use shared memory were a bit upset, but that
seems to be because we use a static key. It seems the tests were also
fighting over the number of huge pages they wanted the system set to.

It got up to a load average of 13207, and heap-overflow consumed all my
memory, a pretty good effort considering I have over 1TB of it.

After things settled down things were OK, apart from the fact that we
have 20 huge pages unaccounted for:

HugePages_Total:   10000
HugePages_Free:     9980
HugePages_Rsvd:        0
HugePages_Surp:        0

I verified there were no shared memory segments, and no files in the
hugetlbfs filesystem (I double checked by unmounting it).

I can't see how this patch set would cause this. It seems like we can
leak huge pages, perhaps in an error path. Anyway, I'll repost the
patch set for comments.

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-01-25  3:34   ` Anton Blanchard
@ 2011-01-26  9:24     ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2011-01-26  9:24 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: dwg, akpm, hughd, linux-mm, linux-kernel

On Tue, Jan 25, 2011 at 02:34:14PM +1100, Anton Blanchard wrote:
> From: David Gibson <dwg@au1.ibm.com>
> 
> At present, the page fault path for hugepages is serialized by a
> single mutex.  This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM). This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
> 
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
> 
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash of the address_space and
> file offset being faulted (or mm and virtual address for MAP_PRIVATE
> mappings).
> 
> From: Anton Blanchard <anton@samba.org>
> 
> Forward ported and made a few changes:
> 
> - Use the Jenkins hash to scatter the hash, better than using just the
>   low bits.
> 
> - Always round num_fault_mutexes to a power of two to avoid an expensive
>   modulus in the hash calculation.
> 
> I also tested this patch on a 64 thread POWER6 box using a simple parallel
> fault testcase:
> 
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
> 
> Command line options:
> 
> parallel_fault <nr_threads> <size in kB> <skip in kB>
> 
> First the time taken to fault 48GB of 16MB hugepages:
> # time hugectl --heap ./parallel_fault 1 50331648 16384
> 11.1 seconds
> 
> Now the same test with 64 concurrent threads:
> # time hugectl --heap ./parallel_fault 64 50331648 16384
> 8.8 seconds
> 
> Hardly any speedup. Finally the 64 concurrent threads test with this patch
> applied:
> # time hugectl --heap ./parallel_fault 64 50331648 16384
> 0.7 seconds
> 
> We go from 8.8 seconds to 0.7 seconds, an improvement of 12.6x.
> 
> Signed-off-by: David Gibson <dwg@au1.ibm.com>
> Signed-off-by: Anton Blanchard <anton@samba.org>

I haven't tested this patch yet but typically how I would test it is multiple
parallel instances of make func from libhugetlbfs. In particular I would
be looking out for counter corruption. Has something like this been done?
I know hugetlb_lock protects the counters but the locking in there has turned
into a bit of a mess so it's easy to miss something.

> ---
> 
> Index: powerpc.git/mm/hugetlb.c
> ===================================================================
> --- powerpc.git.orig/mm/hugetlb.c	2011-01-25 13:20:49.311405902 +1100
> +++ powerpc.git/mm/hugetlb.c	2011-01-25 13:45:54.437235053 +1100
> @@ -21,6 +21,7 @@
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/jhash.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -54,6 +55,13 @@ static unsigned long __initdata default_
>  static DEFINE_SPINLOCK(hugetlb_lock);
>  
>  /*
> + * Serializes faults on the same logical page.  This is used to
> + * prevent spurious OOMs when the hugepage pool is fully utilized.
> + */
> +static unsigned int num_fault_mutexes;
> +static struct mutex *htlb_fault_mutex_table;
> +
> +/*
>   * Region tracking -- allows tracking of reservations and instantiated pages
>   *                    across the pages in a mapping.
>   */
> @@ -1764,6 +1772,8 @@ module_exit(hugetlb_exit);
>  
>  static int __init hugetlb_init(void)
>  {
> +	int i;
> +
>  	/* Some platform decide whether they support huge pages at boot
>  	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
>  	 * there is no such support
> @@ -1790,6 +1800,12 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_register_all_nodes();
>  
> +	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
> +	htlb_fault_mutex_table =
> +		kmalloc(num_fault_mutexes * sizeof(struct mutex), GFP_KERNEL);

and if this fails? It'd be unusual I know but num_possible_cpus() could
conceivably be large enough to prevent kmalloc() granting the request.
Do you need to do something similar to profile_init() here instead?

> +	for (i = 0; i < num_fault_mutexes; i++)
> +		mutex_init(&htlb_fault_mutex_table[i]);
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -2616,6 +2632,27 @@ backout_unlocked:
>  	goto out;
>  }
>  
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> +			    struct vm_area_struct *vma,
> +			    struct address_space *mapping,
> +			    unsigned long pagenum, unsigned long address)

pagenum could be anything. Leave it as idx or index because it's easier
to guess it's the result of vma_hugecache_offset().

> +{
> +	unsigned long key[2];
> +	u32 hash;
> +
> +	if ((vma->vm_flags & VM_SHARED)) {
> +		key[0] = (unsigned long)mapping;
> +		key[1] = pagenum;
> +	} else {
> +		key[0] = (unsigned long)mm;
> +		key[1] = address >> huge_page_shift(h);
> +	}
> +
> +	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
> +
> +	return hash & (num_fault_mutexes - 1);
> +}
> +
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags)
>  {
> @@ -2624,8 +2661,10 @@ int hugetlb_fault(struct mm_struct *mm,
>  	int ret;
>  	struct page *page = NULL;
>  	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>  	struct hstate *h = hstate_vma(vma);
> +	struct address_space *mapping;
> +	unsigned long idx;
> +	u32 hash;
>  
>  	ptep = huge_pte_offset(mm, address);
>  	if (ptep) {
> @@ -2642,12 +2681,16 @@ int hugetlb_fault(struct mm_struct *mm,
>  	if (!ptep)
>  		return VM_FAULT_OOM;
>  
> +	mapping = vma->vm_file->f_mapping;
> +	idx = vma_hugecache_offset(h, vma, address);
> +
>  	/*
>  	 * Serialize hugepage allocation and instantiation, so that we don't
>  	 * get spurious allocation failures if two CPUs race to instantiate
>  	 * the same page in the page cache.
>  	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
>  	entry = huge_ptep_get(ptep);
>  	if (huge_pte_none(entry)) {
>  		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> @@ -2716,7 +2759,7 @@ out_page_table_lock:
>  		unlock_page(page);
>  
>  out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>  
>  	return ret;
>  }
> 

I didn't spot anything wrong but I'd be happier if I knew multiple parallel
"make func" from libhugetlbfs tests were run as well.

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
@ 2011-01-26  9:24     ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2011-01-26  9:24 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: dwg, akpm, hughd, linux-mm, linux-kernel

On Tue, Jan 25, 2011 at 02:34:14PM +1100, Anton Blanchard wrote:
> From: David Gibson <dwg@au1.ibm.com>
> 
> At present, the page fault path for hugepages is serialized by a
> single mutex.  This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM). This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
> 
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
> 
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash of the address_space and
> file offset being faulted (or mm and virtual address for MAP_PRIVATE
> mappings).
> 
> From: Anton Blanchard <anton@samba.org>
> 
> Forward ported and made a few changes:
> 
> - Use the Jenkins hash to scatter the hash, better than using just the
>   low bits.
> 
> - Always round num_fault_mutexes to a power of two to avoid an expensive
>   modulus in the hash calculation.
> 
> I also tested this patch on a 64 thread POWER6 box using a simple parallel
> fault testcase:
> 
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
> 
> Command line options:
> 
> parallel_fault <nr_threads> <size in kB> <skip in kB>
> 
> First the time taken to fault 48GB of 16MB hugepages:
> # time hugectl --heap ./parallel_fault 1 50331648 16384
> 11.1 seconds
> 
> Now the same test with 64 concurrent threads:
> # time hugectl --heap ./parallel_fault 64 50331648 16384
> 8.8 seconds
> 
> Hardly any speedup. Finally the 64 concurrent threads test with this patch
> applied:
> # time hugectl --heap ./parallel_fault 64 50331648 16384
> 0.7 seconds
> 
> We go from 8.8 seconds to 0.7 seconds, an improvement of 12.6x.
> 
> Signed-off-by: David Gibson <dwg@au1.ibm.com>
> Signed-off-by: Anton Blanchard <anton@samba.org>

I haven't tested this patch yet but typically how I would test it is multiple
parallel instances of make func from libhugetlbfs. In particular I would
be looking out for counter corruption. Has something like this been done?
I know hugetlb_lock protects the counters but the locking in there has turned
into a bit of a mess so it's easy to miss something.

> ---
> 
> Index: powerpc.git/mm/hugetlb.c
> ===================================================================
> --- powerpc.git.orig/mm/hugetlb.c	2011-01-25 13:20:49.311405902 +1100
> +++ powerpc.git/mm/hugetlb.c	2011-01-25 13:45:54.437235053 +1100
> @@ -21,6 +21,7 @@
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/jhash.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -54,6 +55,13 @@ static unsigned long __initdata default_
>  static DEFINE_SPINLOCK(hugetlb_lock);
>  
>  /*
> + * Serializes faults on the same logical page.  This is used to
> + * prevent spurious OOMs when the hugepage pool is fully utilized.
> + */
> +static unsigned int num_fault_mutexes;
> +static struct mutex *htlb_fault_mutex_table;
> +
> +/*
>   * Region tracking -- allows tracking of reservations and instantiated pages
>   *                    across the pages in a mapping.
>   */
> @@ -1764,6 +1772,8 @@ module_exit(hugetlb_exit);
>  
>  static int __init hugetlb_init(void)
>  {
> +	int i;
> +
>  	/* Some platform decide whether they support huge pages at boot
>  	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
>  	 * there is no such support
> @@ -1790,6 +1800,12 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_register_all_nodes();
>  
> +	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
> +	htlb_fault_mutex_table =
> +		kmalloc(num_fault_mutexes * sizeof(struct mutex), GFP_KERNEL);

and if this fails? It'd be unusual I know but num_possible_cpus() could
conceivably be large enough to prevent kmalloc() granting the request.
Do you need to do something similar to profile_init() here instead?

> +	for (i = 0; i < num_fault_mutexes; i++)
> +		mutex_init(&htlb_fault_mutex_table[i]);
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -2616,6 +2632,27 @@ backout_unlocked:
>  	goto out;
>  }
>  
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> +			    struct vm_area_struct *vma,
> +			    struct address_space *mapping,
> +			    unsigned long pagenum, unsigned long address)

pagenum could be anything. Leave it as idx or index because it's easier
to guess it's the result of vma_hugecache_offset().

> +{
> +	unsigned long key[2];
> +	u32 hash;
> +
> +	if ((vma->vm_flags & VM_SHARED)) {
> +		key[0] = (unsigned long)mapping;
> +		key[1] = pagenum;
> +	} else {
> +		key[0] = (unsigned long)mm;
> +		key[1] = address >> huge_page_shift(h);
> +	}
> +
> +	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
> +
> +	return hash & (num_fault_mutexes - 1);
> +}
> +
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags)
>  {
> @@ -2624,8 +2661,10 @@ int hugetlb_fault(struct mm_struct *mm,
>  	int ret;
>  	struct page *page = NULL;
>  	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>  	struct hstate *h = hstate_vma(vma);
> +	struct address_space *mapping;
> +	unsigned long idx;
> +	u32 hash;
>  
>  	ptep = huge_pte_offset(mm, address);
>  	if (ptep) {
> @@ -2642,12 +2681,16 @@ int hugetlb_fault(struct mm_struct *mm,
>  	if (!ptep)
>  		return VM_FAULT_OOM;
>  
> +	mapping = vma->vm_file->f_mapping;
> +	idx = vma_hugecache_offset(h, vma, address);
> +
>  	/*
>  	 * Serialize hugepage allocation and instantiation, so that we don't
>  	 * get spurious allocation failures if two CPUs race to instantiate
>  	 * the same page in the page cache.
>  	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
>  	entry = huge_ptep_get(ptep);
>  	if (huge_pte_none(entry)) {
>  		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> @@ -2716,7 +2759,7 @@ out_page_table_lock:
>  		unlock_page(page);
>  
>  out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>  
>  	return ret;
>  }
> 

I didn't spot anything wrong but I'd be happier if I knew multiple parallel
"make func" from libhugetlbfs tests were run as well.

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-01-25  3:34   ` Anton Blanchard
  (?)
@ 2011-01-25 19:44   ` Eric B Munson
  -1 siblings, 0 replies; 32+ messages in thread
From: Eric B Munson @ 2011-01-25 19:44 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: dwg, mel, akpm, hughd, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2355 bytes --]

On Tue, 25 Jan 2011, Anton Blanchard wrote:

> From: David Gibson <dwg@au1.ibm.com>
> 
> At present, the page fault path for hugepages is serialized by a
> single mutex.  This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM).  This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
> 
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
> 
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash of the address_space and
> file offset being faulted (or mm and virtual address for MAP_PRIVATE
> mappings).
> 
> 
> From: Anton Blanchard <anton@samba.org>
> 
> Forward ported and made a few changes:
> 
> - Use the Jenkins hash to scatter the hash, better than using just the
>   low bits.
> 
> - Always round num_fault_mutexes to a power of two to avoid an expensive
>   modulus in the hash calculation.
> 
> I also tested this patch on a 64 thread POWER6 box using a simple parallel
> fault testcase:
> 
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
> 
> Command line options:
> 
> parallel_fault <nr_threads> <size in kB> <skip in kB>
> 
> First the time taken to fault 48GB of 16MB hugepages:
> # time hugectl --heap ./parallel_fault 1 50331648 16384
> 11.1 seconds
> 
> Now the same test with 64 concurrent threads:
> # time hugectl --heap ./parallel_fault 64 50331648 16384
> 8.8 seconds
> 
> Hardly any speedup. Finally the 64 concurrent threads test with this patch
> applied:
> # time hugectl --heap ./parallel_fault 64 50331648 16384
> 0.7 seconds
> 
> We go from 8.8 seconds to 0.7 seconds, an improvement of 12.6x.
> 
> Signed-off-by: David Gibson <dwg@au1.ibm.com>
> Signed-off-by: Anton Blanchard <anton@samba.org>

Reviewed-by: Eric B Munson <emunson@mgebm.net>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
  2011-01-25  3:32 [PATCH 1/2] hugepage: Protect region tracking lists with its own spinlock Anton Blanchard
@ 2011-01-25  3:34   ` Anton Blanchard
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Blanchard @ 2011-01-25  3:34 UTC (permalink / raw)
  To: dwg, mel, akpm, hughd; +Cc: linux-mm, linux-kernel

From: David Gibson <dwg@au1.ibm.com>

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash of the address_space and
file offset being faulted (or mm and virtual address for MAP_PRIVATE
mappings).


From: Anton Blanchard <anton@samba.org>

Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
  low bits.

- Always round num_fault_mutexes to a power of two to avoid an expensive
  modulus in the hash calculation.

I also tested this patch on a 64 thread POWER6 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>

First the time taken to fault 48GB of 16MB hugepages:
# time hugectl --heap ./parallel_fault 1 50331648 16384
11.1 seconds

Now the same test with 64 concurrent threads:
# time hugectl --heap ./parallel_fault 64 50331648 16384
8.8 seconds

Hardly any speedup. Finally the 64 concurrent threads test with this patch
applied:
# time hugectl --heap ./parallel_fault 64 50331648 16384
0.7 seconds

We go from 8.8 seconds to 0.7 seconds, an improvement of 12.6x.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: powerpc.git/mm/hugetlb.c
===================================================================
--- powerpc.git.orig/mm/hugetlb.c	2011-01-25 13:20:49.311405902 +1100
+++ powerpc.git/mm/hugetlb.c	2011-01-25 13:45:54.437235053 +1100
@@ -21,6 +21,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -54,6 +55,13 @@ static unsigned long __initdata default_
 static DEFINE_SPINLOCK(hugetlb_lock);
 
 /*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static unsigned int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table;
+
+/*
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  */
@@ -1764,6 +1772,8 @@ module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1790,6 +1800,12 @@ static int __init hugetlb_init(void)
 
 	hugetlb_register_all_nodes();
 
+	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+	htlb_fault_mutex_table =
+		kmalloc(num_fault_mutexes * sizeof(struct mutex), GFP_KERNEL);
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2616,6 +2632,27 @@ backout_unlocked:
 	goto out;
 }
 
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    unsigned long pagenum, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if ((vma->vm_flags & VM_SHARED)) {
+		key[0] = (unsigned long)mapping;
+		key[1] = pagenum;
+	} else {
+		key[0] = (unsigned long)mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
@@ -2624,8 +2661,10 @@ int hugetlb_fault(struct mm_struct *mm,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping;
+	unsigned long idx;
+	u32 hash;
 
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
@@ -2642,12 +2681,16 @@ int hugetlb_fault(struct mm_struct *mm,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
@@ -2716,7 +2759,7 @@ out_page_table_lock:
 		unlock_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 
 	return ret;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path
@ 2011-01-25  3:34   ` Anton Blanchard
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Blanchard @ 2011-01-25  3:34 UTC (permalink / raw)
  To: dwg, mel, akpm, hughd; +Cc: linux-mm, linux-kernel

From: David Gibson <dwg@au1.ibm.com>

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash of the address_space and
file offset being faulted (or mm and virtual address for MAP_PRIVATE
mappings).


From: Anton Blanchard <anton@samba.org>

Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
  low bits.

- Always round num_fault_mutexes to a power of two to avoid an expensive
  modulus in the hash calculation.

I also tested this patch on a 64 thread POWER6 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>

First the time taken to fault 48GB of 16MB hugepages:
# time hugectl --heap ./parallel_fault 1 50331648 16384
11.1 seconds

Now the same test with 64 concurrent threads:
# time hugectl --heap ./parallel_fault 64 50331648 16384
8.8 seconds

Hardly any speedup. Finally the 64 concurrent threads test with this patch
applied:
# time hugectl --heap ./parallel_fault 64 50331648 16384
0.7 seconds

We go from 8.8 seconds to 0.7 seconds, an improvement of 12.6x.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: powerpc.git/mm/hugetlb.c
===================================================================
--- powerpc.git.orig/mm/hugetlb.c	2011-01-25 13:20:49.311405902 +1100
+++ powerpc.git/mm/hugetlb.c	2011-01-25 13:45:54.437235053 +1100
@@ -21,6 +21,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -54,6 +55,13 @@ static unsigned long __initdata default_
 static DEFINE_SPINLOCK(hugetlb_lock);
 
 /*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static unsigned int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table;
+
+/*
  * Region tracking -- allows tracking of reservations and instantiated pages
  *                    across the pages in a mapping.
  */
@@ -1764,6 +1772,8 @@ module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1790,6 +1800,12 @@ static int __init hugetlb_init(void)
 
 	hugetlb_register_all_nodes();
 
+	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+	htlb_fault_mutex_table =
+		kmalloc(num_fault_mutexes * sizeof(struct mutex), GFP_KERNEL);
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2616,6 +2632,27 @@ backout_unlocked:
 	goto out;
 }
 
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    unsigned long pagenum, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if ((vma->vm_flags & VM_SHARED)) {
+		key[0] = (unsigned long)mapping;
+		key[1] = pagenum;
+	} else {
+		key[0] = (unsigned long)mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
@@ -2624,8 +2661,10 @@ int hugetlb_fault(struct mm_struct *mm,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping;
+	unsigned long idx;
+	u32 hash;
 
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
@@ -2642,12 +2681,16 @@ int hugetlb_fault(struct mm_struct *mm,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
@@ -2716,7 +2759,7 @@ out_page_table_lock:
 		unlock_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-08-07  9:21 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-26 14:27 [PATCH 0/2] hugepage: optimize page fault path locking Davidlohr Bueso
2013-07-26 14:27 ` Davidlohr Bueso
2013-07-26 14:27 ` [PATCH 1/2] hugepage: protect file regions with rwsem Davidlohr Bueso
2013-07-26 14:27   ` Davidlohr Bueso
2013-07-26 14:27 ` [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path Davidlohr Bueso
2013-07-26 14:27   ` Davidlohr Bueso
2013-07-28  6:00   ` Hillf Danton
2013-07-28  6:00     ` Hillf Danton
2013-07-29 19:16     ` Davidlohr Bueso
2013-07-29 19:16       ` Davidlohr Bueso
2013-07-29  6:18 ` [PATCH 0/2] hugepage: optimize page fault path locking Joonsoo Kim
2013-07-29  6:18   ` Joonsoo Kim
2013-08-07  0:08   ` Davidlohr Bueso
2013-08-07  0:08     ` Davidlohr Bueso
2013-08-07  9:21     ` Joonsoo Kim
2013-08-07  9:21       ` Joonsoo Kim
  -- strict thread matches above, loose matches on Subject: below --
2011-01-25  3:32 [PATCH 1/2] hugepage: Protect region tracking lists with its own spinlock Anton Blanchard
2011-01-25  3:34 ` [PATCH 2/2] hugepage: Allow parallelization of the hugepage fault path Anton Blanchard
2011-01-25  3:34   ` Anton Blanchard
2011-01-25 19:44   ` Eric B Munson
2011-01-26  9:24   ` Mel Gorman
2011-01-26  9:24     ` Mel Gorman
2011-07-15  6:06     ` Anton Blanchard
2011-07-15  6:06       ` Anton Blanchard
2011-07-15  6:10       ` Anton Blanchard
2011-07-15  6:10         ` Anton Blanchard
2011-07-18 15:24         ` Eric B Munson
2011-07-21 10:17       ` Mel Gorman
2011-07-21 10:17         ` Mel Gorman
2011-07-15  7:52   ` Andi Kleen
2011-07-15  7:52     ` Andi Kleen
2011-07-15 13:10     ` David Gibson
2011-07-15 13:10       ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.