All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-12 23:28 ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-12 23:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Mel Gorman,
	Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, Hugh Dickins, linux-mm, LKML

The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
in the page directory entry. It was found that this mutex can become quite contended
during the early phases of large databases which make use of huge pages - for instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
reports that this mutex can be one of the top 5 most contended locks in the kernel during
the first few minutes:

hugetlb_instantiation_mutex:      10678     10678
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340

contentions:          10678
acquisitions:         99476
waittime-total: 76888911.01 us

Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
the example above, we now get much less contention:

 &vma->hugetlb_instantiation_mutex:  1         1
       ---------------------------------
       &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
       ---------------------------------
       &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350

contentions:          1
acquisitions:    108092
waittime-total:  621.24 us

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 include/linux/mm_types.h |  3 +++
 mm/hugetlb.c             | 12 +++++-------
 mm/mmap.c                |  3 +++
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb425aa..b45fd87 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -289,6 +289,9 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+#ifdef CONFIG_HUGETLB_PAGE
+	struct mutex hugetlb_instantiation_mutex;
+#endif
 };
 
 struct core_thread {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..12e665b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * The region data structures are protected by a combination of the mmap_sem
  * and the hugetlb_instantion_mutex.  To access or modify a region the caller
  * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
+ * the vma's hugetlb_instantiation mutex:
  *
  *	down_write(&mm->mmap_sem);
  * or
  *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ *	mutex_lock(&vma->hugetlb_instantiation_mutex);
  */
 struct file_region {
 	struct list_head link;
@@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 
 /*
  * Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
+ * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
  * cannot race with other handlers or page migration.
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
@@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	mutex_lock(&vma->hugetlb_instantiation_mutex);
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
@@ -2943,8 +2942,7 @@ out_page_table_lock:
 	put_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
+	mutex_unlock(&vma->hugetlb_instantiation_mutex);
 	return ret;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index fbad7b0..8f0b034 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1543,6 +1543,9 @@ munmap_back:
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);
 	vma->vm_pgoff = pgoff;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+#ifdef CONFIG_HUGETLB_PAGE
+	mutex_init(&vma->hugetlb_instantiation_mutex);
+#endif
 
 	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
 
-- 
1.7.11.7




^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-12 23:28 ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-12 23:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Mel Gorman,
	Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, Hugh Dickins, linux-mm, LKML

The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
in the page directory entry. It was found that this mutex can become quite contended
during the early phases of large databases which make use of huge pages - for instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
reports that this mutex can be one of the top 5 most contended locks in the kernel during
the first few minutes:

hugetlb_instantiation_mutex:      10678     10678
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340

contentions:          10678
acquisitions:         99476
waittime-total: 76888911.01 us

Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
the example above, we now get much less contention:

 &vma->hugetlb_instantiation_mutex:  1         1
       ---------------------------------
       &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
       ---------------------------------
       &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350

contentions:          1
acquisitions:    108092
waittime-total:  621.24 us

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 include/linux/mm_types.h |  3 +++
 mm/hugetlb.c             | 12 +++++-------
 mm/mmap.c                |  3 +++
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb425aa..b45fd87 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -289,6 +289,9 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+#ifdef CONFIG_HUGETLB_PAGE
+	struct mutex hugetlb_instantiation_mutex;
+#endif
 };
 
 struct core_thread {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..12e665b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
  * The region data structures are protected by a combination of the mmap_sem
  * and the hugetlb_instantion_mutex.  To access or modify a region the caller
  * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
+ * the vma's hugetlb_instantiation mutex:
  *
  *	down_write(&mm->mmap_sem);
  * or
  *	down_read(&mm->mmap_sem);
- *	mutex_lock(&hugetlb_instantiation_mutex);
+ *	mutex_lock(&vma->hugetlb_instantiation_mutex);
  */
 struct file_region {
 	struct list_head link;
@@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 
 /*
  * Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
+ * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
  * cannot race with other handlers or page migration.
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
@@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	mutex_lock(&vma->hugetlb_instantiation_mutex);
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
@@ -2943,8 +2942,7 @@ out_page_table_lock:
 	put_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
+	mutex_unlock(&vma->hugetlb_instantiation_mutex);
 	return ret;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index fbad7b0..8f0b034 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1543,6 +1543,9 @@ munmap_back:
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);
 	vma->vm_pgoff = pgoff;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+#ifdef CONFIG_HUGETLB_PAGE
+	mutex_init(&vma->hugetlb_instantiation_mutex);
+#endif
 
 	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
 
-- 
1.7.11.7



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-12 23:28 ` Davidlohr Bueso
@ 2013-07-13  0:54   ` Hugh Dickins
  -1 siblings, 0 replies; 41+ messages in thread
From: Hugh Dickins @ 2013-07-13  0:54 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Gibson, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

Adding the essential David Gibson to the Cc list.

On Fri, 12 Jul 2013, Davidlohr Bueso wrote:

> The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
> in the page directory entry. It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
> 
> hugetlb_instantiation_mutex:      10678     10678
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> 
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
> 
> Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
> in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
> the example above, we now get much less contention:
> 
>  &vma->hugetlb_instantiation_mutex:  1         1
>        ---------------------------------
>        &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
>        ---------------------------------
>        &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> 
> contentions:          1
> acquisitions:    108092
> waittime-total:  621.24 us
> 
> Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>

I agree this is a problem worth solving,
but I doubt this patch is the right solution.

> ---
>  include/linux/mm_types.h |  3 +++
>  mm/hugetlb.c             | 12 +++++-------
>  mm/mmap.c                |  3 +++
>  3 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fb425aa..b45fd87 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -289,6 +289,9 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +	struct mutex hugetlb_instantiation_mutex;
> +#endif
>  };

Bloating every vm_area_struct with a rarely useful mutex:
I'm sure you can construct cases where per-vma mutex would win over
per-mm mutex, but they will have to be very common to justify the bloat.

>  
>  struct core_thread {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..12e665b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>   * The region data structures are protected by a combination of the mmap_sem
>   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
>   * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation mutex:
> + * the vma's hugetlb_instantiation mutex:

Reading the existing comment, this change looks very suspicious to me.
A per-vma mutex is just not going to provide the necessary exclusion, is
it?  (But I recall next to nothing about these regions and reservations.)

>   *
>   *	down_write(&mm->mmap_sem);
>   * or
>   *	down_read(&mm->mmap_sem);
> - *	mutex_lock(&hugetlb_instantiation_mutex);
> + *	mutex_lock(&vma->hugetlb_instantiation_mutex);
>   */
>  struct file_region {
>  	struct list_head link;
> @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  /*
>   * Hugetlb_cow() should be called with page lock of the original hugepage held.
> - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
>   * cannot race with other handlers or page migration.
>   * Keep the pte_same checks anyway to make transition from the mutex easier.
>   */
> @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int ret;
>  	struct page *page = NULL;
>  	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>  	struct hstate *h = hstate_vma(vma);
>  
>  	address &= huge_page_mask(h);
> @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * get spurious allocation failures if two CPUs race to instantiate
>  	 * the same page in the page cache.
>  	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	mutex_lock(&vma->hugetlb_instantiation_mutex);
>  	entry = huge_ptep_get(ptep);
>  	if (huge_pte_none(entry)) {
>  		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> @@ -2943,8 +2942,7 @@ out_page_table_lock:
>  	put_page(page);
>  
>  out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> -
> +	mutex_unlock(&vma->hugetlb_instantiation_mutex);
>  	return ret;
>  }
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index fbad7b0..8f0b034 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1543,6 +1543,9 @@ munmap_back:
>  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
>  	vma->vm_pgoff = pgoff;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +#ifdef CONFIG_HUGETLB_PAGE
> +	mutex_init(&vma->hugetlb_instantiation_mutex);
> +#endif
>  
>  	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
>  
> -- 
> 1.7.11.7

The hugetlb_instantiation_mutex has always been rather an embarrassment:
it would be much more satisfying to remove the need for it, than to split
it in this way.  (Maybe a technique like THP sometimes uses, marking an
entry as in transition while the new entry is prepared.)

But I suppose it would not have survived so long if that were easy,
and I think it may have grown some subtle dependants over the years -
as the region comment indicates.

Hugh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-13  0:54   ` Hugh Dickins
  0 siblings, 0 replies; 41+ messages in thread
From: Hugh Dickins @ 2013-07-13  0:54 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Gibson, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

Adding the essential David Gibson to the Cc list.

On Fri, 12 Jul 2013, Davidlohr Bueso wrote:

> The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
> in the page directory entry. It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
> 
> hugetlb_instantiation_mutex:      10678     10678
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> 
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
> 
> Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
> in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
> the example above, we now get much less contention:
> 
>  &vma->hugetlb_instantiation_mutex:  1         1
>        ---------------------------------
>        &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
>        ---------------------------------
>        &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> 
> contentions:          1
> acquisitions:    108092
> waittime-total:  621.24 us
> 
> Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>

I agree this is a problem worth solving,
but I doubt this patch is the right solution.

> ---
>  include/linux/mm_types.h |  3 +++
>  mm/hugetlb.c             | 12 +++++-------
>  mm/mmap.c                |  3 +++
>  3 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fb425aa..b45fd87 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -289,6 +289,9 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +	struct mutex hugetlb_instantiation_mutex;
> +#endif
>  };

Bloating every vm_area_struct with a rarely useful mutex:
I'm sure you can construct cases where per-vma mutex would win over
per-mm mutex, but they will have to be very common to justify the bloat.

>  
>  struct core_thread {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..12e665b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>   * The region data structures are protected by a combination of the mmap_sem
>   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
>   * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation mutex:
> + * the vma's hugetlb_instantiation mutex:

Reading the existing comment, this change looks very suspicious to me.
A per-vma mutex is just not going to provide the necessary exclusion, is
it?  (But I recall next to nothing about these regions and reservations.)

>   *
>   *	down_write(&mm->mmap_sem);
>   * or
>   *	down_read(&mm->mmap_sem);
> - *	mutex_lock(&hugetlb_instantiation_mutex);
> + *	mutex_lock(&vma->hugetlb_instantiation_mutex);
>   */
>  struct file_region {
>  	struct list_head link;
> @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  /*
>   * Hugetlb_cow() should be called with page lock of the original hugepage held.
> - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
>   * cannot race with other handlers or page migration.
>   * Keep the pte_same checks anyway to make transition from the mutex easier.
>   */
> @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int ret;
>  	struct page *page = NULL;
>  	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>  	struct hstate *h = hstate_vma(vma);
>  
>  	address &= huge_page_mask(h);
> @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * get spurious allocation failures if two CPUs race to instantiate
>  	 * the same page in the page cache.
>  	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	mutex_lock(&vma->hugetlb_instantiation_mutex);
>  	entry = huge_ptep_get(ptep);
>  	if (huge_pte_none(entry)) {
>  		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> @@ -2943,8 +2942,7 @@ out_page_table_lock:
>  	put_page(page);
>  
>  out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> -
> +	mutex_unlock(&vma->hugetlb_instantiation_mutex);
>  	return ret;
>  }
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index fbad7b0..8f0b034 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1543,6 +1543,9 @@ munmap_back:
>  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
>  	vma->vm_pgoff = pgoff;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +#ifdef CONFIG_HUGETLB_PAGE
> +	mutex_init(&vma->hugetlb_instantiation_mutex);
> +#endif
>  
>  	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
>  
> -- 
> 1.7.11.7

The hugetlb_instantiation_mutex has always been rather an embarrassment:
it would be much more satisfying to remove the need for it, than to split
it in this way.  (Maybe a technique like THP sometimes uses, marking an
entry as in transition while the new entry is prepared.)

But I suppose it would not have survived so long if that were easy,
and I think it may have grown some subtle dependants over the years -
as the region comment indicates.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-13  0:54   ` Hugh Dickins
@ 2013-07-15  3:16     ` Davidlohr Bueso
  -1 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-15  3:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Gibson, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote:
> Adding the essential David Gibson to the Cc list.
> 
> On Fri, 12 Jul 2013, Davidlohr Bueso wrote:
> 
> > The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
> > in the page directory entry. It was found that this mutex can become quite contended
> > during the early phases of large databases which make use of huge pages - for instance
> > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > the first few minutes:
> > 
> > hugetlb_instantiation_mutex:      10678     10678
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > 
> > contentions:          10678
> > acquisitions:         99476
> > waittime-total: 76888911.01 us
> > 
> > Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
> > in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
> > the example above, we now get much less contention:
> > 
> >  &vma->hugetlb_instantiation_mutex:  1         1
> >        ---------------------------------
> >        &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> >        ---------------------------------
> >        &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> > 
> > contentions:          1
> > acquisitions:    108092
> > waittime-total:  621.24 us
> > 
> > Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
> 
> I agree this is a problem worth solving,
> but I doubt this patch is the right solution.
> 
> > ---
> >  include/linux/mm_types.h |  3 +++
> >  mm/hugetlb.c             | 12 +++++-------
> >  mm/mmap.c                |  3 +++
> >  3 files changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index fb425aa..b45fd87 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -289,6 +289,9 @@ struct vm_area_struct {
> >  #ifdef CONFIG_NUMA
> >  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> >  #endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +	struct mutex hugetlb_instantiation_mutex;
> > +#endif
> >  };
> 
> Bloating every vm_area_struct with a rarely useful mutex:
> I'm sure you can construct cases where per-vma mutex would win over
> per-mm mutex, but they will have to be very common to justify the bloat.
> 

I cannot disagree here, this was my main concern about this patch, and,
as you mentioned, if we can just get rid of the need for the lock, much
better.

> >  
> >  struct core_thread {
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 83aff0a..12e665b 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> >   * The region data structures are protected by a combination of the mmap_sem
> >   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
> >   * must either hold the mmap_sem for write, or the mmap_sem for read and
> > - * the hugetlb_instantiation mutex:
> > + * the vma's hugetlb_instantiation mutex:
> 
> Reading the existing comment, this change looks very suspicious to me.
> A per-vma mutex is just not going to provide the necessary exclusion, is
> it?  (But I recall next to nothing about these regions and reservations.)
> 
> >   *
> >   *	down_write(&mm->mmap_sem);
> >   * or
> >   *	down_read(&mm->mmap_sem);
> > - *	mutex_lock(&hugetlb_instantiation_mutex);
> > + *	mutex_lock(&vma->hugetlb_instantiation_mutex);
> >   */
> >  struct file_region {
> >  	struct list_head link;
> > @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> >  
> >  /*
> >   * Hugetlb_cow() should be called with page lock of the original hugepage held.
> > - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> > + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
> >   * cannot race with other handlers or page migration.
> >   * Keep the pte_same checks anyway to make transition from the mutex easier.
> >   */
> > @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	int ret;
> >  	struct page *page = NULL;
> >  	struct page *pagecache_page = NULL;
> > -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> >  	struct hstate *h = hstate_vma(vma);
> >  
> >  	address &= huge_page_mask(h);
> > @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	 * get spurious allocation failures if two CPUs race to instantiate
> >  	 * the same page in the page cache.
> >  	 */
> > -	mutex_lock(&hugetlb_instantiation_mutex);
> > +	mutex_lock(&vma->hugetlb_instantiation_mutex);
> >  	entry = huge_ptep_get(ptep);
> >  	if (huge_pte_none(entry)) {
> >  		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> > @@ -2943,8 +2942,7 @@ out_page_table_lock:
> >  	put_page(page);
> >  
> >  out_mutex:
> > -	mutex_unlock(&hugetlb_instantiation_mutex);
> > -
> > +	mutex_unlock(&vma->hugetlb_instantiation_mutex);
> >  	return ret;
> >  }
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index fbad7b0..8f0b034 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1543,6 +1543,9 @@ munmap_back:
> >  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> >  	vma->vm_pgoff = pgoff;
> >  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +	mutex_init(&vma->hugetlb_instantiation_mutex);
> > +#endif
> >  
> >  	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
> >  
> > -- 
> > 1.7.11.7
> 
> The hugetlb_instantiation_mutex has always been rather an embarrassment:
> it would be much more satisfying to remove the need for it, than to split
> it in this way.  (Maybe a technique like THP sometimes uses, marking an
> entry as in transition while the new entry is prepared.)

I didn't realize this was a known issue. Doing some googling I can see
some additional alternatives to getting rid of the lock:

- [PATCH] remove hugetlb_instantiation_mutex:
https://lkml.org/lkml/2007/7/27/46

- Commit 3935baa (hugepage: serialize hugepage allocation and
instantiation): David mentioned a way to possibly avoid the need for
this lock.

> 
> But I suppose it would not have survived so long if that were easy,
> and I think it may have grown some subtle dependants over the years -
> as the region comment indicates.

Indeed. I'm not very acquainted with hugetlb code, so, assuming this
patch's approach isn't valid, and we can figure out some way of getting
rid of the mutex, I'd like to get some mm folks feedback on this.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-15  3:16     ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-15  3:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Gibson, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote:
> Adding the essential David Gibson to the Cc list.
> 
> On Fri, 12 Jul 2013, Davidlohr Bueso wrote:
> 
> > The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
> > in the page directory entry. It was found that this mutex can become quite contended
> > during the early phases of large databases which make use of huge pages - for instance
> > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > the first few minutes:
> > 
> > hugetlb_instantiation_mutex:      10678     10678
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > 
> > contentions:          10678
> > acquisitions:         99476
> > waittime-total: 76888911.01 us
> > 
> > Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
> > in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
> > the example above, we now get much less contention:
> > 
> >  &vma->hugetlb_instantiation_mutex:  1         1
> >        ---------------------------------
> >        &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> >        ---------------------------------
> >        &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> > 
> > contentions:          1
> > acquisitions:    108092
> > waittime-total:  621.24 us
> > 
> > Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
> 
> I agree this is a problem worth solving,
> but I doubt this patch is the right solution.
> 
> > ---
> >  include/linux/mm_types.h |  3 +++
> >  mm/hugetlb.c             | 12 +++++-------
> >  mm/mmap.c                |  3 +++
> >  3 files changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index fb425aa..b45fd87 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -289,6 +289,9 @@ struct vm_area_struct {
> >  #ifdef CONFIG_NUMA
> >  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> >  #endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +	struct mutex hugetlb_instantiation_mutex;
> > +#endif
> >  };
> 
> Bloating every vm_area_struct with a rarely useful mutex:
> I'm sure you can construct cases where per-vma mutex would win over
> per-mm mutex, but they will have to be very common to justify the bloat.
> 

I cannot disagree here, this was my main concern about this patch, and,
as you mentioned, if we can just get rid of the need for the lock, much
better.

> >  
> >  struct core_thread {
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 83aff0a..12e665b 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> >   * The region data structures are protected by a combination of the mmap_sem
> >   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
> >   * must either hold the mmap_sem for write, or the mmap_sem for read and
> > - * the hugetlb_instantiation mutex:
> > + * the vma's hugetlb_instantiation mutex:
> 
> Reading the existing comment, this change looks very suspicious to me.
> A per-vma mutex is just not going to provide the necessary exclusion, is
> it?  (But I recall next to nothing about these regions and reservations.)
> 
> >   *
> >   *	down_write(&mm->mmap_sem);
> >   * or
> >   *	down_read(&mm->mmap_sem);
> > - *	mutex_lock(&hugetlb_instantiation_mutex);
> > + *	mutex_lock(&vma->hugetlb_instantiation_mutex);
> >   */
> >  struct file_region {
> >  	struct list_head link;
> > @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> >  
> >  /*
> >   * Hugetlb_cow() should be called with page lock of the original hugepage held.
> > - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> > + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
> >   * cannot race with other handlers or page migration.
> >   * Keep the pte_same checks anyway to make transition from the mutex easier.
> >   */
> > @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	int ret;
> >  	struct page *page = NULL;
> >  	struct page *pagecache_page = NULL;
> > -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> >  	struct hstate *h = hstate_vma(vma);
> >  
> >  	address &= huge_page_mask(h);
> > @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	 * get spurious allocation failures if two CPUs race to instantiate
> >  	 * the same page in the page cache.
> >  	 */
> > -	mutex_lock(&hugetlb_instantiation_mutex);
> > +	mutex_lock(&vma->hugetlb_instantiation_mutex);
> >  	entry = huge_ptep_get(ptep);
> >  	if (huge_pte_none(entry)) {
> >  		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> > @@ -2943,8 +2942,7 @@ out_page_table_lock:
> >  	put_page(page);
> >  
> >  out_mutex:
> > -	mutex_unlock(&hugetlb_instantiation_mutex);
> > -
> > +	mutex_unlock(&vma->hugetlb_instantiation_mutex);
> >  	return ret;
> >  }
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index fbad7b0..8f0b034 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1543,6 +1543,9 @@ munmap_back:
> >  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> >  	vma->vm_pgoff = pgoff;
> >  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +	mutex_init(&vma->hugetlb_instantiation_mutex);
> > +#endif
> >  
> >  	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
> >  
> > -- 
> > 1.7.11.7
> 
> The hugetlb_instantiation_mutex has always been rather an embarrassment:
> it would be much more satisfying to remove the need for it, than to split
> it in this way.  (Maybe a technique like THP sometimes uses, marking an
> entry as in transition while the new entry is prepared.)

I didn't realize this was a known issue. Doing some googling I can see
some additional alternatives to getting rid of the lock:

- [PATCH] remove hugetlb_instantiation_mutex:
https://lkml.org/lkml/2007/7/27/46

- Commit 3935baa (hugepage: serialize hugepage allocation and
instantiation): David mentioned a way to possibly avoid the need for
this lock.

> 
> But I suppose it would not have survived so long if that were easy,
> and I think it may have grown some subtle dependants over the years -
> as the region comment indicates.

Indeed. I'm not very acquainted with hugetlb code, so, assuming this
patch's approach isn't valid, and we can figure out some way of getting
rid of the mutex, I'd like to get some mm folks feedback on this.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-12 23:28 ` Davidlohr Bueso
@ 2013-07-15  4:18   ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 41+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-15  4:18 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Mel Gorman,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	Hugh Dickins, linux-mm, LKML

This seems incorrect. hugetlb_instantiation_mutex protects chains of struct file_region
in inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions (!VM_MAYSHARE)
These chains obviously can be shared between several vmas, so per-vma lock cannot protect them.

Davidlohr Bueso wrote:
> The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
> in the page directory entry. It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
>
> hugetlb_instantiation_mutex:      10678     10678
>               ---------------------------
>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>               ---------------------------
>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
>
> Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
> in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
> the example above, we now get much less contention:
>
>   &vma->hugetlb_instantiation_mutex:  1         1
>         ---------------------------------
>         &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
>         ---------------------------------
>         &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
>
> contentions:          1
> acquisitions:    108092
> waittime-total:  621.24 us
>
> Signed-off-by: Davidlohr Bueso<davidlohr.bueso@hp.com>
> ---
>   include/linux/mm_types.h |  3 +++
>   mm/hugetlb.c             | 12 +++++-------
>   mm/mmap.c                |  3 +++
>   3 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fb425aa..b45fd87 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -289,6 +289,9 @@ struct vm_area_struct {
>   #ifdef CONFIG_NUMA
>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>   #endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +	struct mutex hugetlb_instantiation_mutex;
> +#endif
>   };
>
>   struct core_thread {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..12e665b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>    * The region data structures are protected by a combination of the mmap_sem
>    * and the hugetlb_instantion_mutex.  To access or modify a region the caller
>    * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation mutex:
> + * the vma's hugetlb_instantiation mutex:
>    *
>    *	down_write(&mm->mmap_sem);
>    * or
>    *	down_read(&mm->mmap_sem);
> - *	mutex_lock(&hugetlb_instantiation_mutex);
> + *	mutex_lock(&vma->hugetlb_instantiation_mutex);
>    */
>   struct file_region {
>   	struct list_head link;
> @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>
>   /*
>    * Hugetlb_cow() should be called with page lock of the original hugepage held.
> - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
>    * cannot race with other handlers or page migration.
>    * Keep the pte_same checks anyway to make transition from the mutex easier.
>    */
> @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	int ret;
>   	struct page *page = NULL;
>   	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>   	struct hstate *h = hstate_vma(vma);
>
>   	address&= huge_page_mask(h);
> @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	 * get spurious allocation failures if two CPUs race to instantiate
>   	 * the same page in the page cache.
>   	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	mutex_lock(&vma->hugetlb_instantiation_mutex);
>   	entry = huge_ptep_get(ptep);
>   	if (huge_pte_none(entry)) {
>   		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> @@ -2943,8 +2942,7 @@ out_page_table_lock:
>   	put_page(page);
>
>   out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> -
> +	mutex_unlock(&vma->hugetlb_instantiation_mutex);
>   	return ret;
>   }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index fbad7b0..8f0b034 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1543,6 +1543,9 @@ munmap_back:
>   	vma->vm_page_prot = vm_get_page_prot(vm_flags);
>   	vma->vm_pgoff = pgoff;
>   	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +#ifdef CONFIG_HUGETLB_PAGE
> +	mutex_init(&vma->hugetlb_instantiation_mutex);
> +#endif
>
>   	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-15  4:18   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 41+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-15  4:18 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Mel Gorman,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	Hugh Dickins, linux-mm, LKML

This seems incorrect. hugetlb_instantiation_mutex protects chains of struct file_region
in inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions (!VM_MAYSHARE)
These chains obviously can be shared between several vmas, so per-vma lock cannot protect them.

Davidlohr Bueso wrote:
> The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
> in the page directory entry. It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
>
> hugetlb_instantiation_mutex:      10678     10678
>               ---------------------------
>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>               ---------------------------
>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
>
> Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
> in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
> the example above, we now get much less contention:
>
>   &vma->hugetlb_instantiation_mutex:  1         1
>         ---------------------------------
>         &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
>         ---------------------------------
>         &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
>
> contentions:          1
> acquisitions:    108092
> waittime-total:  621.24 us
>
> Signed-off-by: Davidlohr Bueso<davidlohr.bueso@hp.com>
> ---
>   include/linux/mm_types.h |  3 +++
>   mm/hugetlb.c             | 12 +++++-------
>   mm/mmap.c                |  3 +++
>   3 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fb425aa..b45fd87 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -289,6 +289,9 @@ struct vm_area_struct {
>   #ifdef CONFIG_NUMA
>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>   #endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +	struct mutex hugetlb_instantiation_mutex;
> +#endif
>   };
>
>   struct core_thread {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..12e665b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>    * The region data structures are protected by a combination of the mmap_sem
>    * and the hugetlb_instantion_mutex.  To access or modify a region the caller
>    * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation mutex:
> + * the vma's hugetlb_instantiation mutex:
>    *
>    *	down_write(&mm->mmap_sem);
>    * or
>    *	down_read(&mm->mmap_sem);
> - *	mutex_lock(&hugetlb_instantiation_mutex);
> + *	mutex_lock(&vma->hugetlb_instantiation_mutex);
>    */
>   struct file_region {
>   	struct list_head link;
> @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>
>   /*
>    * Hugetlb_cow() should be called with page lock of the original hugepage held.
> - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
>    * cannot race with other handlers or page migration.
>    * Keep the pte_same checks anyway to make transition from the mutex easier.
>    */
> @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	int ret;
>   	struct page *page = NULL;
>   	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>   	struct hstate *h = hstate_vma(vma);
>
>   	address&= huge_page_mask(h);
> @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	 * get spurious allocation failures if two CPUs race to instantiate
>   	 * the same page in the page cache.
>   	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	mutex_lock(&vma->hugetlb_instantiation_mutex);
>   	entry = huge_ptep_get(ptep);
>   	if (huge_pte_none(entry)) {
>   		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> @@ -2943,8 +2942,7 @@ out_page_table_lock:
>   	put_page(page);
>
>   out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> -
> +	mutex_unlock(&vma->hugetlb_instantiation_mutex);
>   	return ret;
>   }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index fbad7b0..8f0b034 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1543,6 +1543,9 @@ munmap_back:
>   	vma->vm_page_prot = vm_get_page_prot(vm_flags);
>   	vma->vm_pgoff = pgoff;
>   	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +#ifdef CONFIG_HUGETLB_PAGE
> +	mutex_init(&vma->hugetlb_instantiation_mutex);
> +#endif
>
>   	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-15  3:16     ` Davidlohr Bueso
  (?)
@ 2013-07-15  7:24     ` David Gibson
  2013-07-15 23:08         ` Andrew Morton
  2013-07-16  1:51         ` Rik van Riel
  -1 siblings, 2 replies; 41+ messages in thread
From: David Gibson @ 2013-07-15  7:24 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Hugh Dickins, Andrew Morton, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 14062 bytes --]

On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote:
> > Adding the essential David Gibson to the Cc list.
> > 
> > On Fri, 12 Jul 2013, Davidlohr Bueso wrote:
> > 
> > > The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
> > > in the page directory entry. It was found that this mutex can become quite contended
> > > during the early phases of large databases which make use of huge pages - for instance
> > > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > > the first few minutes:
> > > 
> > > hugetlb_instantiation_mutex:      10678     10678
> > >              ---------------------------
> > >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > >              ---------------------------
> > >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > > 
> > > contentions:          10678
> > > acquisitions:         99476
> > > waittime-total: 76888911.01 us
> > > 
> > > Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
> > > in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
> > > the example above, we now get much less contention:
> > > 
> > >  &vma->hugetlb_instantiation_mutex:  1         1
> > >        ---------------------------------
> > >        &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> > >        ---------------------------------
> > >        &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> > > 
> > > contentions:          1
> > > acquisitions:    108092
> > > waittime-total:  621.24 us
> > > 
> > > Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
> > 
> > I agree this is a problem worth solving,
> > but I doubt this patch is the right solution.

It's not.

> > > ---
> > >  include/linux/mm_types.h |  3 +++
> > >  mm/hugetlb.c             | 12 +++++-------
> > >  mm/mmap.c                |  3 +++
> > >  3 files changed, 11 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index fb425aa..b45fd87 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -289,6 +289,9 @@ struct vm_area_struct {
> > >  #ifdef CONFIG_NUMA
> > >  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> > >  #endif
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > > +	struct mutex hugetlb_instantiation_mutex;
> > > +#endif
> > >  };
> > 
> > Bloating every vm_area_struct with a rarely useful mutex:
> > I'm sure you can construct cases where per-vma mutex would win over
> > per-mm mutex, but they will have to be very common to justify the bloat.
> > 
> 
> I cannot disagree here, this was my main concern about this patch, and,
> as you mentioned, if we can just get rid of the need for the lock, much
> better.

So, by all means try to get rid of the mutex, but doing so is
surprisingly difficult - I made several failed attempts a long time
ago, before giving up and creating the mutex in the first place.

Note that the there is no analogous handling in the normal page case,
because for normal pages we always assume we have a few pages of
"slush" and can temporarily overallocate without problems.  Because
hugepages are a more precious resource, it's usual to want to allocate
and use every single one - spurious OOMs when racing to allocate the
very last hugepage are what the mutex protects against.

> > >  struct core_thread {
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 83aff0a..12e665b 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> > >   * The region data structures are protected by a combination of the mmap_sem
> > >   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
> > >   * must either hold the mmap_sem for write, or the mmap_sem for read and
> > > - * the hugetlb_instantiation mutex:
> > > + * the vma's hugetlb_instantiation mutex:
> > 
> > Reading the existing comment, this change looks very suspicious to me.
> > A per-vma mutex is just not going to provide the necessary exclusion, is
> > it?  (But I recall next to nothing about these regions and
> > reservations.)

A per-VMA lock is definitely wrong.  I think it handles one form of
the race, between threads sharing a VM on a MAP_PRIVATE mapping.
However another form of the race can and does occur between different
MAP_SHARED VMAs in the same or different processes.  I think there may
be edge cases involving mremap() and MAP_PRIVATE that will also be
missed by a per-VMA lock.

Note that the libhugetlbfs testsuite contains tests for both PRIVATE
and SHARED variants of the race.

> > >   *
> > >   *	down_write(&mm->mmap_sem);
> > >   * or
> > >   *	down_read(&mm->mmap_sem);
> > > - *	mutex_lock(&hugetlb_instantiation_mutex);
> > > + *	mutex_lock(&vma->hugetlb_instantiation_mutex);
> > >   */
> > >  struct file_region {
> > >  	struct list_head link;
> > > @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  
> > >  /*
> > >   * Hugetlb_cow() should be called with page lock of the original hugepage held.
> > > - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> > > + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
> > >   * cannot race with other handlers or page migration.
> > >   * Keep the pte_same checks anyway to make transition from the mutex easier.
> > >   */
> > > @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  	int ret;
> > >  	struct page *page = NULL;
> > >  	struct page *pagecache_page = NULL;
> > > -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> > >  	struct hstate *h = hstate_vma(vma);
> > >  
> > >  	address &= huge_page_mask(h);
> > > @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  	 * get spurious allocation failures if two CPUs race to instantiate
> > >  	 * the same page in the page cache.
> > >  	 */
> > > -	mutex_lock(&hugetlb_instantiation_mutex);
> > > +	mutex_lock(&vma->hugetlb_instantiation_mutex);
> > >  	entry = huge_ptep_get(ptep);
> > >  	if (huge_pte_none(entry)) {
> > >  		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> > > @@ -2943,8 +2942,7 @@ out_page_table_lock:
> > >  	put_page(page);
> > >  
> > >  out_mutex:
> > > -	mutex_unlock(&hugetlb_instantiation_mutex);
> > > -
> > > +	mutex_unlock(&vma->hugetlb_instantiation_mutex);
> > >  	return ret;
> > >  }
> > >  
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index fbad7b0..8f0b034 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1543,6 +1543,9 @@ munmap_back:
> > >  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > >  	vma->vm_pgoff = pgoff;
> > >  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > > +	mutex_init(&vma->hugetlb_instantiation_mutex);
> > > +#endif
> > >  
> > >  	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
> > >  
> > 
> > The hugetlb_instantiation_mutex has always been rather an embarrassment:
> > it would be much more satisfying to remove the need for it, than to split
> > it in this way.  (Maybe a technique like THP sometimes uses, marking an
> > entry as in transition while the new entry is prepared.)
> 
> I didn't realize this was a known issue. Doing some googling I can see
> some additional alternatives to getting rid of the lock:
> 
> - [PATCH] remove hugetlb_instantiation_mutex:
> https://lkml.org/lkml/2007/7/27/46
> 
> - Commit 3935baa (hugepage: serialize hugepage allocation and
> instantiation): David mentioned a way to possibly avoid the need for
> this lock.
> 
> > 
> > But I suppose it would not have survived so long if that were easy,
> > and I think it may have grown some subtle dependants over the years -
> > as the region comment indicates.
> 
> Indeed. I'm not very acquainted with hugetlb code, so, assuming this
> patch's approach isn't valid, and we can figure out some way of getting
> rid of the mutex, I'd like to get some mm folks feedback on this.

I have previously proposed a correct method of improving scalability,
although it doesn't eliminate the lock.  That's to use a set of hashed
mutexes.  It wasn't merged before, but I don't recall the reasons
why.  Old and probably bitrotted patch below for reference:

hugepage: Allow parallelization of the hugepage fault path

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash of the address_space and
file offset being faulted (or mm and virtual address for MAP_PRIVATE
mappings).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c	2006-10-25 16:30:00.000000000 +1000
+++ working-2.6/mm/hugetlb.c	2006-10-26 14:38:08.000000000 +1000
@@ -32,6 +32,13 @@ static unsigned int free_huge_pages_node
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table;
+
 static void clear_huge_page(struct page *page, unsigned long addr)
 {
 	int i;
@@ -160,6 +167,13 @@ static int __init hugetlb_init(void)
 	}
 	max_huge_pages = free_huge_pages = nr_huge_pages = i;
 	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+
+	num_fault_mutexes = 2 * num_possible_cpus() - 1;
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -458,19 +472,14 @@ static int hugetlb_cow(struct mm_struct 
 }
 
 int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, int write_access)
+		    struct address_space *mapping, unsigned long idx,
+		    unsigned long address, pte_t *ptep, int write_access)
 {
 	int ret = VM_FAULT_SIGBUS;
-	unsigned long idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 
-	mapping = vma->vm_file->f_mapping;
-	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
-		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -545,28 +554,50 @@ out:
 	return ret;
 }
 
+static int fault_mutex_hash(struct mm_struct *mm, struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    unsigned long pagenum, unsigned long address)
+{
+	void *p = mapping;
+
+	if (! (vma->vm_flags & VM_SHARED)) {
+		p = mm;
+		pagenum = address << HPAGE_SIZE;
+	}
+
+	return ((unsigned long)p + pagenum) % num_fault_mutexes;
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access)
 {
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct address_space *mapping;
+	unsigned long idx;
+	int hash;
 
 	ptep = huge_pte_alloc(mm, address);
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
 	entry = *ptep;
 	if (pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
-		mutex_unlock(&hugetlb_instantiation_mutex);
+		ret = hugetlb_no_page(mm, vma, mapping, idx,
+				      address, ptep, write_access);
+		mutex_unlock(&htlb_fault_mutex_table[hash]);
 		return ret;
 	}
 
@@ -578,7 +609,7 @@ int hugetlb_fault(struct mm_struct *mm, 
 		if (write_access && !pte_write(entry))
 			ret = hugetlb_cow(mm, vma, address, ptep, entry);
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&hugetlb_instantiation_mutex);
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 
 	return ret;
 }


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-15  7:24     ` David Gibson
@ 2013-07-15 23:08         ` Andrew Morton
  2013-07-16  1:51         ` Rik van Riel
  1 sibling, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2013-07-15 23:08 UTC (permalink / raw)
  To: David Gibson
  Cc: Davidlohr Bueso, Hugh Dickins, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson <david@gibson.dropbear.id.au> wrote:

> I have previously proposed a correct method of improving scalability,
> although it doesn't eliminate the lock.  That's to use a set of hashed
> mutexes.

Yep - hashing the mutexes is an obvious and nicely localized way of
improving this.  It's a tweak, not a design change.

The changelog should describe the choice of the hash key with great
precision, please.  It's important and is the first thing which
reviewers and readers will zoom in on.

Should the individual mutexes be cacheline aligned?  Depends on the
acquisition frequency, I guess.  Please let's work through that.

Let's not damage uniprocesor kernels too much.  AFACIT the main offender
here is fault_mutex_hash(), which is the world's most obfuscated "return
0;".

>  It wasn't merged before, but I don't recall the reasons
> why. 

Me either.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-15 23:08         ` Andrew Morton
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2013-07-15 23:08 UTC (permalink / raw)
  To: David Gibson
  Cc: Davidlohr Bueso, Hugh Dickins, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson <david@gibson.dropbear.id.au> wrote:

> I have previously proposed a correct method of improving scalability,
> although it doesn't eliminate the lock.  That's to use a set of hashed
> mutexes.

Yep - hashing the mutexes is an obvious and nicely localized way of
improving this.  It's a tweak, not a design change.

The changelog should describe the choice of the hash key with great
precision, please.  It's important and is the first thing which
reviewers and readers will zoom in on.

Should the individual mutexes be cacheline aligned?  Depends on the
acquisition frequency, I guess.  Please let's work through that.

Let's not damage uniprocesor kernels too much.  AFACIT the main offender
here is fault_mutex_hash(), which is the world's most obfuscated "return
0;".

>  It wasn't merged before, but I don't recall the reasons
> why. 

Me either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-15 23:08         ` Andrew Morton
@ 2013-07-16  0:12           ` Davidlohr Bueso
  -1 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-16  0:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Gibson, Hugh Dickins, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML, Anton Blanchard

On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote:
> On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > I have previously proposed a correct method of improving scalability,
> > although it doesn't eliminate the lock.  That's to use a set of hashed
> > mutexes.
> 
> Yep - hashing the mutexes is an obvious and nicely localized way of
> improving this.  It's a tweak, not a design change.
> 
> The changelog should describe the choice of the hash key with great
> precision, please.  It's important and is the first thing which
> reviewers and readers will zoom in on.
> 
> Should the individual mutexes be cacheline aligned?  Depends on the
> acquisition frequency, I guess.  Please let's work through that.

In my test cases, involving different RDBMS, I'm getting around 114k
acquisitions.

> 
> Let's not damage uniprocesor kernels too much.  AFACIT the main offender
> here is fault_mutex_hash(), which is the world's most obfuscated "return
> 0;".

I guess we could add an ifndef CONFIG_SMP check to the function and
return 0 right away. That would eliminate any overhead in
fault_mutex_hash().

> 
> >  It wasn't merged before, but I don't recall the reasons
> > why. 

So I've forward ported the patch (will send once everyone agrees that
the matter is settled), including the changes Anton Blanchard added a
exactly two years ago:

https://lkml.org/lkml/2011/7/15/31

My tests show that the number of lock contentions drops from ~11k to
around 500. So this approach alleviates a lot of the bottleneck. I've
also ran it against libhugetlbfs without any regressions.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-16  0:12           ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-16  0:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Gibson, Hugh Dickins, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML, Anton Blanchard

On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote:
> On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > I have previously proposed a correct method of improving scalability,
> > although it doesn't eliminate the lock.  That's to use a set of hashed
> > mutexes.
> 
> Yep - hashing the mutexes is an obvious and nicely localized way of
> improving this.  It's a tweak, not a design change.
> 
> The changelog should describe the choice of the hash key with great
> precision, please.  It's important and is the first thing which
> reviewers and readers will zoom in on.
> 
> Should the individual mutexes be cacheline aligned?  Depends on the
> acquisition frequency, I guess.  Please let's work through that.

In my test cases, involving different RDBMS, I'm getting around 114k
acquisitions.

> 
> Let's not damage uniprocesor kernels too much.  AFACIT the main offender
> here is fault_mutex_hash(), which is the world's most obfuscated "return
> 0;".

I guess we could add an ifndef CONFIG_SMP check to the function and
return 0 right away. That would eliminate any overhead in
fault_mutex_hash().

> 
> >  It wasn't merged before, but I don't recall the reasons
> > why. 

So I've forward ported the patch (will send once everyone agrees that
the matter is settled), including the changes Anton Blanchard added a
exactly two years ago:

https://lkml.org/lkml/2011/7/15/31

My tests show that the number of lock contentions drops from ~11k to
around 500. So this approach alleviates a lot of the bottleneck. I've
also ran it against libhugetlbfs without any regressions.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-15  7:24     ` David Gibson
@ 2013-07-16  1:51         ` Rik van Riel
  2013-07-16  1:51         ` Rik van Riel
  1 sibling, 0 replies; 41+ messages in thread
From: Rik van Riel @ 2013-07-16  1:51 UTC (permalink / raw)
  To: David Gibson
  Cc: Davidlohr Bueso, Hugh Dickins, Andrew Morton, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

On 07/15/2013 03:24 AM, David Gibson wrote:
> On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:

>>> Reading the existing comment, this change looks very suspicious to me.
>>> A per-vma mutex is just not going to provide the necessary exclusion, is
>>> it?  (But I recall next to nothing about these regions and
>>> reservations.)
>
> A per-VMA lock is definitely wrong.  I think it handles one form of
> the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> However another form of the race can and does occur between different
> MAP_SHARED VMAs in the same or different processes.  I think there may
> be edge cases involving mremap() and MAP_PRIVATE that will also be
> missed by a per-VMA lock.
>
> Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> and SHARED variants of the race.

Can we get away with simply using a mutex in the file?
Say vma->vm_file->mapping->i_mmap_mutex?

That might help with multiple processes initializing
multiple shared memory segments at the same time, and
should not hurt the case of a process mapping its own
hugetlbfs area.

It might have the potential to hurt when getting private
copies on a MAP_PRIVATE area, though.  I have no idea
how common it is for multiple processes to MAP_PRIVATE
the same hugetlbfs file, though...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-16  1:51         ` Rik van Riel
  0 siblings, 0 replies; 41+ messages in thread
From: Rik van Riel @ 2013-07-16  1:51 UTC (permalink / raw)
  To: David Gibson
  Cc: Davidlohr Bueso, Hugh Dickins, Andrew Morton, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

On 07/15/2013 03:24 AM, David Gibson wrote:
> On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:

>>> Reading the existing comment, this change looks very suspicious to me.
>>> A per-vma mutex is just not going to provide the necessary exclusion, is
>>> it?  (But I recall next to nothing about these regions and
>>> reservations.)
>
> A per-VMA lock is definitely wrong.  I think it handles one form of
> the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> However another form of the race can and does occur between different
> MAP_SHARED VMAs in the same or different processes.  I think there may
> be edge cases involving mremap() and MAP_PRIVATE that will also be
> missed by a per-VMA lock.
>
> Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> and SHARED variants of the race.

Can we get away with simply using a mutex in the file?
Say vma->vm_file->mapping->i_mmap_mutex?

That might help with multiple processes initializing
multiple shared memory segments at the same time, and
should not hurt the case of a process mapping its own
hugetlbfs area.

It might have the potential to hurt when getting private
copies on a MAP_PRIVATE area, though.  I have no idea
how common it is for multiple processes to MAP_PRIVATE
the same hugetlbfs file, though...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-16  1:51         ` Rik van Riel
@ 2013-07-16  5:34           ` Joonsoo Kim
  -1 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-16  5:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Gibson, Davidlohr Bueso, Hugh Dickins, Andrew Morton,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML

On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> On 07/15/2013 03:24 AM, David Gibson wrote:
> >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> 
> >>>Reading the existing comment, this change looks very suspicious to me.
> >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> >>>it?  (But I recall next to nothing about these regions and
> >>>reservations.)
> >
> >A per-VMA lock is definitely wrong.  I think it handles one form of
> >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> >However another form of the race can and does occur between different
> >MAP_SHARED VMAs in the same or different processes.  I think there may
> >be edge cases involving mremap() and MAP_PRIVATE that will also be
> >missed by a per-VMA lock.
> >
> >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> >and SHARED variants of the race.
> 
> Can we get away with simply using a mutex in the file?
> Say vma->vm_file->mapping->i_mmap_mutex?

I totally agree with this approach :)

> 
> That might help with multiple processes initializing
> multiple shared memory segments at the same time, and
> should not hurt the case of a process mapping its own
> hugetlbfs area.
> 
> It might have the potential to hurt when getting private
> copies on a MAP_PRIVATE area, though.  I have no idea
> how common it is for multiple processes to MAP_PRIVATE
> the same hugetlbfs file, though...

Currently, getting private copies on a MAP_PRIVATE area is also
serialized by hugetlb_instantiation_mutex.
How do we get worse with your approach?

BTW, we have one race problem related to hugetlb_instantiation_mutex.
It is not right protection for region structure handling. We map the
area without holding a hugetlb_instantiation_mutex, so there is
race condition between mapping a new area and faulting the other area.
Am I missing?

Thanks.

> 
> -- 
> All rights reversed
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-16  5:34           ` Joonsoo Kim
  0 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-16  5:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Gibson, Davidlohr Bueso, Hugh Dickins, Andrew Morton,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML

On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> On 07/15/2013 03:24 AM, David Gibson wrote:
> >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> 
> >>>Reading the existing comment, this change looks very suspicious to me.
> >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> >>>it?  (But I recall next to nothing about these regions and
> >>>reservations.)
> >
> >A per-VMA lock is definitely wrong.  I think it handles one form of
> >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> >However another form of the race can and does occur between different
> >MAP_SHARED VMAs in the same or different processes.  I think there may
> >be edge cases involving mremap() and MAP_PRIVATE that will also be
> >missed by a per-VMA lock.
> >
> >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> >and SHARED variants of the race.
> 
> Can we get away with simply using a mutex in the file?
> Say vma->vm_file->mapping->i_mmap_mutex?

I totally agree with this approach :)

> 
> That might help with multiple processes initializing
> multiple shared memory segments at the same time, and
> should not hurt the case of a process mapping its own
> hugetlbfs area.
> 
> It might have the potential to hurt when getting private
> copies on a MAP_PRIVATE area, though.  I have no idea
> how common it is for multiple processes to MAP_PRIVATE
> the same hugetlbfs file, though...

Currently, getting private copies on a MAP_PRIVATE area is also
serialized by hugetlb_instantiation_mutex.
How do we get worse with your approach?

BTW, we have one race problem related to hugetlb_instantiation_mutex.
It is not right protection for region structure handling. We map the
area without holding a hugetlb_instantiation_mutex, so there is
race condition between mapping a new area and faulting the other area.
Am I missing?

Thanks.

> 
> -- 
> All rights reversed
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-16  0:12           ` Davidlohr Bueso
  (?)
@ 2013-07-16  8:00           ` David Gibson
  -1 siblings, 0 replies; 41+ messages in thread
From: David Gibson @ 2013-07-16  8:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Hugh Dickins, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML, Anton Blanchard

[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]

On Mon, Jul 15, 2013 at 05:12:31PM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote:
> > On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > I have previously proposed a correct method of improving scalability,
> > > although it doesn't eliminate the lock.  That's to use a set of hashed
> > > mutexes.
> > 
> > Yep - hashing the mutexes is an obvious and nicely localized way of
> > improving this.  It's a tweak, not a design change.
> > 
> > The changelog should describe the choice of the hash key with great
> > precision, please.  It's important and is the first thing which
> > reviewers and readers will zoom in on.

Yeah, that is important.

I no longer have much interest in the result of this patch, so I'll
leave it to others to do the forward port and cleanup.

But I will point out the gotcha here is that the hash key needs to be
based on (address_space & file offset) for MAP_SHARED, but (mm &
address) for MAP_PRIVATE.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-16  1:51         ` Rik van Riel
  (?)
  (?)
@ 2013-07-16  8:20         ` David Gibson
  -1 siblings, 0 replies; 41+ messages in thread
From: David Gibson @ 2013-07-16  8:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Davidlohr Bueso, Hugh Dickins, Andrew Morton, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1986 bytes --]

On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> On 07/15/2013 03:24 AM, David Gibson wrote:
> >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> 
> >>>Reading the existing comment, this change looks very suspicious to me.
> >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> >>>it?  (But I recall next to nothing about these regions and
> >>>reservations.)
> >
> >A per-VMA lock is definitely wrong.  I think it handles one form of
> >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> >However another form of the race can and does occur between different
> >MAP_SHARED VMAs in the same or different processes.  I think there may
> >be edge cases involving mremap() and MAP_PRIVATE that will also be
> >missed by a per-VMA lock.
> >
> >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> >and SHARED variants of the race.
> 
> Can we get away with simply using a mutex in the file?
> Say vma->vm_file->mapping->i_mmap_mutex?

So I don't know the VM well enough to know if this could conflict with
other usages of i_mmap_mutex.  But unfortunately, whether or not its
otherwise correct that approach won't address the scalability issue at
hand here.

In the case where the race matters, we're always dealing with the same
file.  Otherwise, we'd end up with a genuine, rather than spurious,
out-of-memory error, regardless of how the race turned out.

So in the case with the performance bottleneck we're considering, the
i_mmap_mutex approach degenerates to serialization on a single mutex,
just as before.

In order to improve scalability, we need to consider which page within
the file we're instantiating which is what the hash function in my
patch does.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-16  5:34           ` Joonsoo Kim
  (?)
@ 2013-07-16 10:01           ` David Gibson
  2013-07-18  6:50               ` Joonsoo Kim
  -1 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2013-07-16 10:01 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Rik van Riel, Davidlohr Bueso, Hugh Dickins, Andrew Morton,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 2458 bytes --]

On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote:
> On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> > On 07/15/2013 03:24 AM, David Gibson wrote:
> > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> > 
> > >>>Reading the existing comment, this change looks very suspicious to me.
> > >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> > >>>it?  (But I recall next to nothing about these regions and
> > >>>reservations.)
> > >
> > >A per-VMA lock is definitely wrong.  I think it handles one form of
> > >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> > >However another form of the race can and does occur between different
> > >MAP_SHARED VMAs in the same or different processes.  I think there may
> > >be edge cases involving mremap() and MAP_PRIVATE that will also be
> > >missed by a per-VMA lock.
> > >
> > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> > >and SHARED variants of the race.
> > 
> > Can we get away with simply using a mutex in the file?
> > Say vma->vm_file->mapping->i_mmap_mutex?
> 
> I totally agree with this approach :)
> 
> > 
> > That might help with multiple processes initializing
> > multiple shared memory segments at the same time, and
> > should not hurt the case of a process mapping its own
> > hugetlbfs area.
> > 
> > It might have the potential to hurt when getting private
> > copies on a MAP_PRIVATE area, though.  I have no idea
> > how common it is for multiple processes to MAP_PRIVATE
> > the same hugetlbfs file, though...
> 
> Currently, getting private copies on a MAP_PRIVATE area is also
> serialized by hugetlb_instantiation_mutex.
> How do we get worse with your approach?
> 
> BTW, we have one race problem related to hugetlb_instantiation_mutex.
> It is not right protection for region structure handling. We map the
> area without holding a hugetlb_instantiation_mutex, so there is
> race condition between mapping a new area and faulting the other area.
> Am I missing?

The hugetlb_instantiation_mutex has nothing to do with protecting
region structures.  It exists only to address one very specific and
frequently misunderstood race.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-15 23:08         ` Andrew Morton
@ 2013-07-17 19:50           ` Davidlohr Bueso
  -1 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-17 19:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Gibson, Hugh Dickins, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML, Eric B Munson,
	Anton Blanchard

From: David Gibson <david@gibson.dropbear.id.au>

At present, the page fault path for hugepages is serialized by a
single mutex. This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash, which allows us to know
which page in the file we're instantiating. For shared mappings, the
hash key is selected based on the address space and file offset being faulted.
Similarly, for private mappings, the mm and virtual address are used.

From: Anton Blanchard <anton@samba.org>
[https://lkml.org/lkml/2011/7/15/31]
Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
  low bits.

- Always round num_fault_mutexes to a power of two to avoid an
  expensive modulus in the hash calculation.

I also tested this patch on a large POWER7 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>

First the time taken to fault 128GB of 16MB hugepages:

40.68 seconds

Now the same test with 64 concurrent threads:
39.34 seconds

Hardly any speedup. Finally the 64 concurrent threads test with
this patch applied:
0.85 seconds

We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x

This was tested with the libhugetlbfs test suite, and the PASS/FAIL
count was the same before and after this patch.

From: Davidlohr Bueso <davidlohr.bueso@hp.com>

- Cleaned up and forward ported to Linus' latest.
- Cache aligned mutexes.
- Keep non SMP systems using a single mutex.

It was found that this mutex can become quite contended
during the early phases of large databases which make use of huge pages - for instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
reports that this mutex can be one of the top 5 most contended locks in the kernel during
the first few minutes:

    	     hugetlb_instantiation_mutex:   10678     10678
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340

contentions:          10678
acquisitions:         99476
waittime-total: 76888911.01 us

With this patch we see a much less contention and wait time:

              &htlb_fault_mutex_table[i]:   383
              --------------------------
              &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
              --------------------------
              &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440

contentions:        383
acquisitions:    120546
waittime-total: 1381.72 us

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 73 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..1f6e564 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -21,6 +21,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -52,6 +53,13 @@ static unsigned long __initdata default_hstate_size;
  */
 DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+
 static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
 {
 	bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -1896,13 +1904,15 @@ static void __exit hugetlb_exit(void)
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[hstate_index(h)]);
 	}
-
+	kfree(htlb_fault_mutex_table);
 	kobject_put(hugepages_kobj);
 }
 module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1927,6 +1937,19 @@ static int __init hugetlb_init(void)
 	hugetlb_register_all_nodes();
 	hugetlb_cgroup_file_init();
 
+#ifdef CONFIG_SMP
+	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+#else
+	num_fault_mutexes = 1;
+#endif
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	if (!htlb_fault_mutex_table)
+		return -ENOMEM;
+
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2709,15 +2732,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			   struct address_space *mapping, pgoff_t idx,
+			   unsigned long address, pte_t *ptep, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	int ret = VM_FAULT_SIGBUS;
 	int anon_rmap = 0;
-	pgoff_t idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 
 	/*
@@ -2731,9 +2753,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 
-	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, address);
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -2839,15 +2858,51 @@ backout_unlocked:
 	goto out;
 }
 
+#ifdef CONFIG_SMP
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if (vma->vm_flags & VM_SHARED) {
+		key[0] = (unsigned long)mapping;
+		key[1] = idx;
+	} else {
+		key[0] = (unsigned long)mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+#else
+/*
+ * For uniprocesor systems we always use a single mutex, so just
+ * return 0 and avoid the hashing overhead.
+ */
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	return 0;
+}
+#endif
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep;
-	pte_t entry;
+	pgoff_t idx;
 	int ret;
+	u32 hash;
+	pte_t *ptep, entry;
 	struct page *page = NULL;
+	struct address_space *mapping;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2867,15 +2922,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
+
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
 		goto out_mutex;
 	}
 
@@ -2943,8 +3003,7 @@ out_page_table_lock:
 	put_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 	return ret;
 }
 
-- 
1.7.11.7




^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-17 19:50           ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-17 19:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Gibson, Hugh Dickins, Rik van Riel, Michel Lespinasse,
	Mel Gorman, Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML, Eric B Munson,
	Anton Blanchard

From: David Gibson <david@gibson.dropbear.id.au>

At present, the page fault path for hugepages is serialized by a
single mutex. This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash, which allows us to know
which page in the file we're instantiating. For shared mappings, the
hash key is selected based on the address space and file offset being faulted.
Similarly, for private mappings, the mm and virtual address are used.

From: Anton Blanchard <anton@samba.org>
[https://lkml.org/lkml/2011/7/15/31]
Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
  low bits.

- Always round num_fault_mutexes to a power of two to avoid an
  expensive modulus in the hash calculation.

I also tested this patch on a large POWER7 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>

First the time taken to fault 128GB of 16MB hugepages:

40.68 seconds

Now the same test with 64 concurrent threads:
39.34 seconds

Hardly any speedup. Finally the 64 concurrent threads test with
this patch applied:
0.85 seconds

We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x

This was tested with the libhugetlbfs test suite, and the PASS/FAIL
count was the same before and after this patch.

From: Davidlohr Bueso <davidlohr.bueso@hp.com>

- Cleaned up and forward ported to Linus' latest.
- Cache aligned mutexes.
- Keep non SMP systems using a single mutex.

It was found that this mutex can become quite contended
during the early phases of large databases which make use of huge pages - for instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
reports that this mutex can be one of the top 5 most contended locks in the kernel during
the first few minutes:

    	     hugetlb_instantiation_mutex:   10678     10678
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
             ---------------------------
             hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340

contentions:          10678
acquisitions:         99476
waittime-total: 76888911.01 us

With this patch we see a much less contention and wait time:

              &htlb_fault_mutex_table[i]:   383
              --------------------------
              &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
              --------------------------
              &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440

contentions:        383
acquisitions:    120546
waittime-total: 1381.72 us

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 73 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..1f6e564 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -21,6 +21,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/jhash.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -52,6 +53,13 @@ static unsigned long __initdata default_hstate_size;
  */
 DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+
 static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
 {
 	bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -1896,13 +1904,15 @@ static void __exit hugetlb_exit(void)
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[hstate_index(h)]);
 	}
-
+	kfree(htlb_fault_mutex_table);
 	kobject_put(hugepages_kobj);
 }
 module_exit(hugetlb_exit);
 
 static int __init hugetlb_init(void)
 {
+	int i;
+
 	/* Some platform decide whether they support huge pages at boot
 	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 	 * there is no such support
@@ -1927,6 +1937,19 @@ static int __init hugetlb_init(void)
 	hugetlb_register_all_nodes();
 	hugetlb_cgroup_file_init();
 
+#ifdef CONFIG_SMP
+	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+#else
+	num_fault_mutexes = 1;
+#endif
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	if (!htlb_fault_mutex_table)
+		return -ENOMEM;
+
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -2709,15 +2732,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			   struct address_space *mapping, pgoff_t idx,
+			   unsigned long address, pte_t *ptep, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	int ret = VM_FAULT_SIGBUS;
 	int anon_rmap = 0;
-	pgoff_t idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 
 	/*
@@ -2731,9 +2753,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 
-	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, address);
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -2839,15 +2858,51 @@ backout_unlocked:
 	goto out;
 }
 
+#ifdef CONFIG_SMP
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	unsigned long key[2];
+	u32 hash;
+
+	if (vma->vm_flags & VM_SHARED) {
+		key[0] = (unsigned long)mapping;
+		key[1] = idx;
+	} else {
+		key[0] = (unsigned long)mm;
+		key[1] = address >> huge_page_shift(h);
+	}
+
+	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+	return hash & (num_fault_mutexes - 1);
+}
+#else
+/*
+ * For uniprocesor systems we always use a single mutex, so just
+ * return 0 and avoid the hashing overhead.
+ */
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    pgoff_t idx, unsigned long address)
+{
+	return 0;
+}
+#endif
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep;
-	pte_t entry;
+	pgoff_t idx;
 	int ret;
+	u32 hash;
+	pte_t *ptep, entry;
 	struct page *page = NULL;
+	struct address_space *mapping;
 	struct page *pagecache_page = NULL;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
 	address &= huge_page_mask(h);
@@ -2867,15 +2922,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
+
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
 		goto out_mutex;
 	}
 
@@ -2943,8 +3003,7 @@ out_page_table_lock:
 	put_page(page);
 
 out_mutex:
-	mutex_unlock(&hugetlb_instantiation_mutex);
-
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 	return ret;
 }
 
-- 
1.7.11.7



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
  2013-07-16 10:01           ` David Gibson
@ 2013-07-18  6:50               ` Joonsoo Kim
  0 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-18  6:50 UTC (permalink / raw)
  To: David Gibson
  Cc: Rik van Riel, Davidlohr Bueso, Hugh Dickins, Andrew Morton,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML

On Tue, Jul 16, 2013 at 08:01:46PM +1000, David Gibson wrote:
> On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote:
> > On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> > > On 07/15/2013 03:24 AM, David Gibson wrote:
> > > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> > > 
> > > >>>Reading the existing comment, this change looks very suspicious to me.
> > > >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> > > >>>it?  (But I recall next to nothing about these regions and
> > > >>>reservations.)
> > > >
> > > >A per-VMA lock is definitely wrong.  I think it handles one form of
> > > >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> > > >However another form of the race can and does occur between different
> > > >MAP_SHARED VMAs in the same or different processes.  I think there may
> > > >be edge cases involving mremap() and MAP_PRIVATE that will also be
> > > >missed by a per-VMA lock.
> > > >
> > > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> > > >and SHARED variants of the race.
> > > 
> > > Can we get away with simply using a mutex in the file?
> > > Say vma->vm_file->mapping->i_mmap_mutex?
> > 
> > I totally agree with this approach :)
> > 
> > > 
> > > That might help with multiple processes initializing
> > > multiple shared memory segments at the same time, and
> > > should not hurt the case of a process mapping its own
> > > hugetlbfs area.
> > > 
> > > It might have the potential to hurt when getting private
> > > copies on a MAP_PRIVATE area, though.  I have no idea
> > > how common it is for multiple processes to MAP_PRIVATE
> > > the same hugetlbfs file, though...
> > 
> > Currently, getting private copies on a MAP_PRIVATE area is also
> > serialized by hugetlb_instantiation_mutex.
> > How do we get worse with your approach?
> > 
> > BTW, we have one race problem related to hugetlb_instantiation_mutex.
> > It is not right protection for region structure handling. We map the
> > area without holding a hugetlb_instantiation_mutex, so there is
> > race condition between mapping a new area and faulting the other area.
> > Am I missing?
> 
> The hugetlb_instantiation_mutex has nothing to do with protecting
> region structures.  It exists only to address one very specific and
> frequently misunderstood race.

Yes, it was introduced for that purpose, but, currently, it is also
used for protecting region structure. You can see below comment in
mm/hugetlb.c

 * The region data structures are protected by a combination of the mmap_sem
 * and the hugetlb_instantion_mutex.  To access or modify a region the caller
 * must either hold the mmap_sem for write, or the mmap_sem for read and
 * the hugetlb_instantiation mutex:
 *
 *      down_write(&mm->mmap_sem);
 * or
 *      down_read(&mm->mmap_sem);
 *      mutex_lock(&hugetlb_instantiation_mutex);
 */

Thanks.

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
@ 2013-07-18  6:50               ` Joonsoo Kim
  0 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-18  6:50 UTC (permalink / raw)
  To: David Gibson
  Cc: Rik van Riel, Davidlohr Bueso, Hugh Dickins, Andrew Morton,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML

On Tue, Jul 16, 2013 at 08:01:46PM +1000, David Gibson wrote:
> On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote:
> > On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> > > On 07/15/2013 03:24 AM, David Gibson wrote:
> > > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> > > 
> > > >>>Reading the existing comment, this change looks very suspicious to me.
> > > >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> > > >>>it?  (But I recall next to nothing about these regions and
> > > >>>reservations.)
> > > >
> > > >A per-VMA lock is definitely wrong.  I think it handles one form of
> > > >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> > > >However another form of the race can and does occur between different
> > > >MAP_SHARED VMAs in the same or different processes.  I think there may
> > > >be edge cases involving mremap() and MAP_PRIVATE that will also be
> > > >missed by a per-VMA lock.
> > > >
> > > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> > > >and SHARED variants of the race.
> > > 
> > > Can we get away with simply using a mutex in the file?
> > > Say vma->vm_file->mapping->i_mmap_mutex?
> > 
> > I totally agree with this approach :)
> > 
> > > 
> > > That might help with multiple processes initializing
> > > multiple shared memory segments at the same time, and
> > > should not hurt the case of a process mapping its own
> > > hugetlbfs area.
> > > 
> > > It might have the potential to hurt when getting private
> > > copies on a MAP_PRIVATE area, though.  I have no idea
> > > how common it is for multiple processes to MAP_PRIVATE
> > > the same hugetlbfs file, though...
> > 
> > Currently, getting private copies on a MAP_PRIVATE area is also
> > serialized by hugetlb_instantiation_mutex.
> > How do we get worse with your approach?
> > 
> > BTW, we have one race problem related to hugetlb_instantiation_mutex.
> > It is not right protection for region structure handling. We map the
> > area without holding a hugetlb_instantiation_mutex, so there is
> > race condition between mapping a new area and faulting the other area.
> > Am I missing?
> 
> The hugetlb_instantiation_mutex has nothing to do with protecting
> region structures.  It exists only to address one very specific and
> frequently misunderstood race.

Yes, it was introduced for that purpose, but, currently, it is also
used for protecting region structure. You can see below comment in
mm/hugetlb.c

 * The region data structures are protected by a combination of the mmap_sem
 * and the hugetlb_instantion_mutex.  To access or modify a region the caller
 * must either hold the mmap_sem for write, or the mmap_sem for read and
 * the hugetlb_instantiation mutex:
 *
 *      down_write(&mm->mmap_sem);
 * or
 *      down_read(&mm->mmap_sem);
 *      mutex_lock(&hugetlb_instantiation_mutex);
 */

Thanks.

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-17 19:50           ` Davidlohr Bueso
@ 2013-07-18  8:42             ` Joonsoo Kim
  -1 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-18  8:42 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
> 
> At present, the page fault path for hugepages is serialized by a
> single mutex. This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM).  This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
> 
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
> 
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash, which allows us to know
> which page in the file we're instantiating. For shared mappings, the
> hash key is selected based on the address space and file offset being faulted.
> Similarly, for private mappings, the mm and virtual address are used.
> 

Hello.

With this table mutex, we cannot protect region tracking structure.
See below comment.

/*
 * Region tracking -- allows tracking of reservations and instantiated pages
 *                    across the pages in a mapping.
 *
 * The region data structures are protected by a combination of the mmap_sem
 * and the hugetlb_instantion_mutex.  To access or modify a region the caller
 * must either hold the mmap_sem for write, or the mmap_sem for read and
 * the hugetlb_instantiation mutex:
 *
 *      down_write(&mm->mmap_sem);
 * or
 *      down_read(&mm->mmap_sem);
 *      mutex_lock(&hugetlb_instantiation_mutex);
 */

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-18  8:42             ` Joonsoo Kim
  0 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-18  8:42 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
> 
> At present, the page fault path for hugepages is serialized by a
> single mutex. This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM).  This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
> 
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
> 
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash, which allows us to know
> which page in the file we're instantiating. For shared mappings, the
> hash key is selected based on the address space and file offset being faulted.
> Similarly, for private mappings, the mm and virtual address are used.
> 

Hello.

With this table mutex, we cannot protect region tracking structure.
See below comment.

/*
 * Region tracking -- allows tracking of reservations and instantiated pages
 *                    across the pages in a mapping.
 *
 * The region data structures are protected by a combination of the mmap_sem
 * and the hugetlb_instantion_mutex.  To access or modify a region the caller
 * must either hold the mmap_sem for write, or the mmap_sem for read and
 * the hugetlb_instantiation mutex:
 *
 *      down_write(&mm->mmap_sem);
 * or
 *      down_read(&mm->mmap_sem);
 *      mutex_lock(&hugetlb_instantiation_mutex);
 */

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-17 19:50           ` Davidlohr Bueso
@ 2013-07-18  9:07             ` Joonsoo Kim
  -1 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-18  9:07 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:

> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> 
> - Cleaned up and forward ported to Linus' latest.
> - Cache aligned mutexes.
> - Keep non SMP systems using a single mutex.
> 
> It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
> 
>     	     hugetlb_instantiation_mutex:   10678     10678
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> 
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us

Hello,
I have a question :)

So, each contention takes 7.6 ms in your result.
Do you map this area with VM_NORESERVE?
If we map with VM_RESERVE, when page fault, we just dequeue a huge page from a queue and clear
a page and then map it to a page table. So I guess, it shouldn't take so long.
I'm wondering why it takes so long.

And do you use 16KB-size hugepage?
If so, region handling could takes some times. If you access the area as random order,
the number of region can be more than 90000. I guess, this can be one reason to too long
waittime.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-18  9:07             ` Joonsoo Kim
  0 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-18  9:07 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:

> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> 
> - Cleaned up and forward ported to Linus' latest.
> - Cache aligned mutexes.
> - Keep non SMP systems using a single mutex.
> 
> It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
> 
>     	     hugetlb_instantiation_mutex:   10678     10678
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>              ---------------------------
>              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> 
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us

Hello,
I have a question :)

So, each contention takes 7.6 ms in your result.
Do you map this area with VM_NORESERVE?
If we map with VM_RESERVE, when page fault, we just dequeue a huge page from a queue and clear
a page and then map it to a page table. So I guess, it shouldn't take so long.
I'm wondering why it takes so long.

And do you use 16KB-size hugepage?
If so, region handling could takes some times. If you access the area as random order,
the number of region can be more than 90000. I guess, this can be one reason to too long
waittime.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-18  9:07             ` Joonsoo Kim
@ 2013-07-19  0:19               ` Davidlohr Bueso
  -1 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-19  0:19 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Thu, 2013-07-18 at 18:07 +0900, Joonsoo Kim wrote:
> On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> 
> > From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> > 
> > - Cleaned up and forward ported to Linus' latest.
> > - Cache aligned mutexes.
> > - Keep non SMP systems using a single mutex.
> > 
> > It was found that this mutex can become quite contended
> > during the early phases of large databases which make use of huge pages - for instance
> > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > the first few minutes:
> > 
> >     	     hugetlb_instantiation_mutex:   10678     10678
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > 
> > contentions:          10678
> > acquisitions:         99476
> > waittime-total: 76888911.01 us
> 
> Hello,
> I have a question :)
> 
> So, each contention takes 7.6 ms in your result.

Well, that's the total wait time. I can see your concern, but no, things
aren't *that* bad. The average amount of time spent waiting for the lock
would be 76888911.01/10678 = 7200us

> Do you map this area with VM_NORESERVE?
> If we map with VM_RESERVE, when page fault, we just dequeue a huge page from a queue and clear
> a page and then map it to a page table. So I guess, it shouldn't take so long.
> I'm wondering why it takes so long.
> 

I cannot really say. This is proprietary software. AFAICT if Oracle is
anything like Posgres, than probably no.


> And do you use 16KB-size hugepage?

No, 2Mb pages.

> If so, region handling could takes some times. If you access the area as random order,
> the number of region can be more than 90000. I guess, this can be one reason to too long
> waittime.
> 
> Thanks.

Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-19  0:19               ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-19  0:19 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Thu, 2013-07-18 at 18:07 +0900, Joonsoo Kim wrote:
> On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> 
> > From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> > 
> > - Cleaned up and forward ported to Linus' latest.
> > - Cache aligned mutexes.
> > - Keep non SMP systems using a single mutex.
> > 
> > It was found that this mutex can become quite contended
> > during the early phases of large databases which make use of huge pages - for instance
> > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > the first few minutes:
> > 
> >     	     hugetlb_instantiation_mutex:   10678     10678
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >              ---------------------------
> >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > 
> > contentions:          10678
> > acquisitions:         99476
> > waittime-total: 76888911.01 us
> 
> Hello,
> I have a question :)
> 
> So, each contention takes 7.6 ms in your result.

Well, that's the total wait time. I can see your concern, but no, things
aren't *that* bad. The average amount of time spent waiting for the lock
would be 76888911.01/10678 = 7200us

> Do you map this area with VM_NORESERVE?
> If we map with VM_RESERVE, when page fault, we just dequeue a huge page from a queue and clear
> a page and then map it to a page table. So I guess, it shouldn't take so long.
> I'm wondering why it takes so long.
> 

I cannot really say. This is proprietary software. AFAICT if Oracle is
anything like Posgres, than probably no.


> And do you use 16KB-size hugepage?

No, 2Mb pages.

> If so, region handling could takes some times. If you access the area as random order,
> the number of region can be more than 90000. I guess, this can be one reason to too long
> waittime.
> 
> Thanks.

Thanks,
Davidlohr


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-19  0:19               ` Davidlohr Bueso
@ 2013-07-19  0:35                 ` Davidlohr Bueso
  -1 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-19  0:35 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Thu, 2013-07-18 at 17:19 -0700, Davidlohr Bueso wrote:
> On Thu, 2013-07-18 at 18:07 +0900, Joonsoo Kim wrote:
> > On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> > 
> > > From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> > > 
> > > - Cleaned up and forward ported to Linus' latest.
> > > - Cache aligned mutexes.
> > > - Keep non SMP systems using a single mutex.
> > > 
> > > It was found that this mutex can become quite contended
> > > during the early phases of large databases which make use of huge pages - for instance
> > > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > > the first few minutes:
> > > 
> > >     	     hugetlb_instantiation_mutex:   10678     10678
> > >              ---------------------------
> > >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > >              ---------------------------
> > >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > > 
> > > contentions:          10678
> > > acquisitions:         99476
> > > waittime-total: 76888911.01 us
> > 
> > Hello,
> > I have a question :)
> > 
> > So, each contention takes 7.6 ms in your result.
> 
> Well, that's the total wait time. I can see your concern, but no, things
> aren't *that* bad. The average amount of time spent waiting for the lock
> would be 76888911.01/10678 = 7200us
> 

Long day, I think this is what you were saying for the start :)

> > Do you map this area with VM_NORESERVE?
> > If we map with VM_RESERVE, when page fault, we just dequeue a huge page from a queue and clear
> > a page and then map it to a page table. So I guess, it shouldn't take so long.
> > I'm wondering why it takes so long.
> > 
> 
> I cannot really say. This is proprietary software. AFAICT if Oracle is
> anything like Posgres, than probably no.
> 
> 
> > And do you use 16KB-size hugepage?
> 
> No, 2Mb pages.
> 
> > If so, region handling could takes some times. If you access the area as random order,
> > the number of region can be more than 90000. I guess, this can be one reason to too long
> > waittime.
> > 
> > Thanks.
> 
> Thanks,
> Davidlohr
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-19  0:35                 ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-19  0:35 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Thu, 2013-07-18 at 17:19 -0700, Davidlohr Bueso wrote:
> On Thu, 2013-07-18 at 18:07 +0900, Joonsoo Kim wrote:
> > On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> > 
> > > From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> > > 
> > > - Cleaned up and forward ported to Linus' latest.
> > > - Cache aligned mutexes.
> > > - Keep non SMP systems using a single mutex.
> > > 
> > > It was found that this mutex can become quite contended
> > > during the early phases of large databases which make use of huge pages - for instance
> > > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > > the first few minutes:
> > > 
> > >     	     hugetlb_instantiation_mutex:   10678     10678
> > >              ---------------------------
> > >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > >              ---------------------------
> > >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > > 
> > > contentions:          10678
> > > acquisitions:         99476
> > > waittime-total: 76888911.01 us
> > 
> > Hello,
> > I have a question :)
> > 
> > So, each contention takes 7.6 ms in your result.
> 
> Well, that's the total wait time. I can see your concern, but no, things
> aren't *that* bad. The average amount of time spent waiting for the lock
> would be 76888911.01/10678 = 7200us
> 

Long day, I think this is what you were saying for the start :)

> > Do you map this area with VM_NORESERVE?
> > If we map with VM_RESERVE, when page fault, we just dequeue a huge page from a queue and clear
> > a page and then map it to a page table. So I guess, it shouldn't take so long.
> > I'm wondering why it takes so long.
> > 
> 
> I cannot really say. This is proprietary software. AFAICT if Oracle is
> anything like Posgres, than probably no.
> 
> 
> > And do you use 16KB-size hugepage?
> 
> No, 2Mb pages.
> 
> > If so, region handling could takes some times. If you access the area as random order,
> > the number of region can be more than 90000. I guess, this can be one reason to too long
> > waittime.
> > 
> > Thanks.
> 
> Thanks,
> Davidlohr
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-18  8:42             ` Joonsoo Kim
  (?)
@ 2013-07-19  7:14             ` David Gibson
  2013-07-19 21:24                 ` Davidlohr Bueso
  -1 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2013-07-19  7:14 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Davidlohr Bueso, Andrew Morton, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

[-- Attachment #1: Type: text/plain, Size: 2508 bytes --]

On Thu, Jul 18, 2013 at 05:42:35PM +0900, Joonsoo Kim wrote:
> On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> > 
> > At present, the page fault path for hugepages is serialized by a
> > single mutex. This is used to avoid spurious out-of-memory conditions
> > when the hugepage pool is fully utilized (two processes or threads can
> > race to instantiate the same mapping with the last hugepage from the
> > pool, the race loser returning VM_FAULT_OOM).  This problem is
> > specific to hugepages, because it is normal to want to use every
> > single hugepage in the system - with normal pages we simply assume
> > there will always be a few spare pages which can be used temporarily
> > until the race is resolved.
> > 
> > Unfortunately this serialization also means that clearing of hugepages
> > cannot be parallelized across multiple CPUs, which can lead to very
> > long process startup times when using large numbers of hugepages.
> > 
> > This patch improves the situation by replacing the single mutex with a
> > table of mutexes, selected based on a hash, which allows us to know
> > which page in the file we're instantiating. For shared mappings, the
> > hash key is selected based on the address space and file offset being faulted.
> > Similarly, for private mappings, the mm and virtual address are used.
> > 
> 
> Hello.
> 
> With this table mutex, we cannot protect region tracking structure.
> See below comment.
> 
> /*
>  * Region tracking -- allows tracking of reservations and instantiated pages
>  *                    across the pages in a mapping.
>  *
>  * The region data structures are protected by a combination of the mmap_sem
>  * and the hugetlb_instantion_mutex.  To access or modify a region the caller
>  * must either hold the mmap_sem for write, or the mmap_sem for read and
>  * the hugetlb_instantiation mutex:
>  *
>  *      down_write(&mm->mmap_sem);
>  * or
>  *      down_read(&mm->mmap_sem);
>  *      mutex_lock(&hugetlb_instantiation_mutex);
>  */

Ugh.  Who the hell added that.  I guess you'll need to split of
another mutex for that purpose, afaict there should be no interaction
with the actual, intended purpose of the instantiation mutex.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-19  7:14             ` David Gibson
@ 2013-07-19 21:24                 ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-19 21:24 UTC (permalink / raw)
  To: David Gibson
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Fri, 2013-07-19 at 17:14 +1000, David Gibson wrote:
> On Thu, Jul 18, 2013 at 05:42:35PM +0900, Joonsoo Kim wrote:
> > On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> > > From: David Gibson <david@gibson.dropbear.id.au>
> > > 
> > > At present, the page fault path for hugepages is serialized by a
> > > single mutex. This is used to avoid spurious out-of-memory conditions
> > > when the hugepage pool is fully utilized (two processes or threads can
> > > race to instantiate the same mapping with the last hugepage from the
> > > pool, the race loser returning VM_FAULT_OOM).  This problem is
> > > specific to hugepages, because it is normal to want to use every
> > > single hugepage in the system - with normal pages we simply assume
> > > there will always be a few spare pages which can be used temporarily
> > > until the race is resolved.
> > > 
> > > Unfortunately this serialization also means that clearing of hugepages
> > > cannot be parallelized across multiple CPUs, which can lead to very
> > > long process startup times when using large numbers of hugepages.
> > > 
> > > This patch improves the situation by replacing the single mutex with a
> > > table of mutexes, selected based on a hash, which allows us to know
> > > which page in the file we're instantiating. For shared mappings, the
> > > hash key is selected based on the address space and file offset being faulted.
> > > Similarly, for private mappings, the mm and virtual address are used.
> > > 
> > 
> > Hello.
> > 
> > With this table mutex, we cannot protect region tracking structure.
> > See below comment.
> > 
> > /*
> >  * Region tracking -- allows tracking of reservations and instantiated pages
> >  *                    across the pages in a mapping.
> >  *
> >  * The region data structures are protected by a combination of the mmap_sem
> >  * and the hugetlb_instantion_mutex.  To access or modify a region the caller
> >  * must either hold the mmap_sem for write, or the mmap_sem for read and
> >  * the hugetlb_instantiation mutex:
> >  *
> >  *      down_write(&mm->mmap_sem);
> >  * or
> >  *      down_read(&mm->mmap_sem);
> >  *      mutex_lock(&hugetlb_instantiation_mutex);
> >  */
> 
> Ugh.  Who the hell added that.  I guess you'll need to split of
> another mutex for that purpose, afaict there should be no interaction
> with the actual, intended purpose of the instantiation mutex.

This was added in commit 84afd99b. One way to go would be to add a
spinlock to protect changes to the regions - however reading the
changelog, and based on David's previous explanation for the
instantiation mutex, I don't see why it was added. In fact several
places modify regions without holding the instantiation mutex, ie:
hugetlb_reserve_pages()

Am I missing something here?

Thanks,
Davidlohr





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-19 21:24                 ` Davidlohr Bueso
  0 siblings, 0 replies; 41+ messages in thread
From: Davidlohr Bueso @ 2013-07-19 21:24 UTC (permalink / raw)
  To: David Gibson
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Fri, 2013-07-19 at 17:14 +1000, David Gibson wrote:
> On Thu, Jul 18, 2013 at 05:42:35PM +0900, Joonsoo Kim wrote:
> > On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> > > From: David Gibson <david@gibson.dropbear.id.au>
> > > 
> > > At present, the page fault path for hugepages is serialized by a
> > > single mutex. This is used to avoid spurious out-of-memory conditions
> > > when the hugepage pool is fully utilized (two processes or threads can
> > > race to instantiate the same mapping with the last hugepage from the
> > > pool, the race loser returning VM_FAULT_OOM).  This problem is
> > > specific to hugepages, because it is normal to want to use every
> > > single hugepage in the system - with normal pages we simply assume
> > > there will always be a few spare pages which can be used temporarily
> > > until the race is resolved.
> > > 
> > > Unfortunately this serialization also means that clearing of hugepages
> > > cannot be parallelized across multiple CPUs, which can lead to very
> > > long process startup times when using large numbers of hugepages.
> > > 
> > > This patch improves the situation by replacing the single mutex with a
> > > table of mutexes, selected based on a hash, which allows us to know
> > > which page in the file we're instantiating. For shared mappings, the
> > > hash key is selected based on the address space and file offset being faulted.
> > > Similarly, for private mappings, the mm and virtual address are used.
> > > 
> > 
> > Hello.
> > 
> > With this table mutex, we cannot protect region tracking structure.
> > See below comment.
> > 
> > /*
> >  * Region tracking -- allows tracking of reservations and instantiated pages
> >  *                    across the pages in a mapping.
> >  *
> >  * The region data structures are protected by a combination of the mmap_sem
> >  * and the hugetlb_instantion_mutex.  To access or modify a region the caller
> >  * must either hold the mmap_sem for write, or the mmap_sem for read and
> >  * the hugetlb_instantiation mutex:
> >  *
> >  *      down_write(&mm->mmap_sem);
> >  * or
> >  *      down_read(&mm->mmap_sem);
> >  *      mutex_lock(&hugetlb_instantiation_mutex);
> >  */
> 
> Ugh.  Who the hell added that.  I guess you'll need to split of
> another mutex for that purpose, afaict there should be no interaction
> with the actual, intended purpose of the instantiation mutex.

This was added in commit 84afd99b. One way to go would be to add a
spinlock to protect changes to the regions - however reading the
changelog, and based on David's previous explanation for the
instantiation mutex, I don't see why it was added. In fact several
places modify regions without holding the instantiation mutex, ie:
hugetlb_reserve_pages()

Am I missing something here?

Thanks,
Davidlohr




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-19 21:24                 ` Davidlohr Bueso
@ 2013-07-22  0:59                   ` Joonsoo Kim
  -1 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-22  0:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Gibson, Andrew Morton, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Fri, Jul 19, 2013 at 02:24:15PM -0700, Davidlohr Bueso wrote:
> On Fri, 2013-07-19 at 17:14 +1000, David Gibson wrote:
> > On Thu, Jul 18, 2013 at 05:42:35PM +0900, Joonsoo Kim wrote:
> > > On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> > > > From: David Gibson <david@gibson.dropbear.id.au>
> > > > 
> > > > At present, the page fault path for hugepages is serialized by a
> > > > single mutex. This is used to avoid spurious out-of-memory conditions
> > > > when the hugepage pool is fully utilized (two processes or threads can
> > > > race to instantiate the same mapping with the last hugepage from the
> > > > pool, the race loser returning VM_FAULT_OOM).  This problem is
> > > > specific to hugepages, because it is normal to want to use every
> > > > single hugepage in the system - with normal pages we simply assume
> > > > there will always be a few spare pages which can be used temporarily
> > > > until the race is resolved.
> > > > 
> > > > Unfortunately this serialization also means that clearing of hugepages
> > > > cannot be parallelized across multiple CPUs, which can lead to very
> > > > long process startup times when using large numbers of hugepages.
> > > > 
> > > > This patch improves the situation by replacing the single mutex with a
> > > > table of mutexes, selected based on a hash, which allows us to know
> > > > which page in the file we're instantiating. For shared mappings, the
> > > > hash key is selected based on the address space and file offset being faulted.
> > > > Similarly, for private mappings, the mm and virtual address are used.
> > > > 
> > > 
> > > Hello.
> > > 
> > > With this table mutex, we cannot protect region tracking structure.
> > > See below comment.
> > > 
> > > /*
> > >  * Region tracking -- allows tracking of reservations and instantiated pages
> > >  *                    across the pages in a mapping.
> > >  *
> > >  * The region data structures are protected by a combination of the mmap_sem
> > >  * and the hugetlb_instantion_mutex.  To access or modify a region the caller
> > >  * must either hold the mmap_sem for write, or the mmap_sem for read and
> > >  * the hugetlb_instantiation mutex:
> > >  *
> > >  *      down_write(&mm->mmap_sem);
> > >  * or
> > >  *      down_read(&mm->mmap_sem);
> > >  *      mutex_lock(&hugetlb_instantiation_mutex);
> > >  */
> > 
> > Ugh.  Who the hell added that.  I guess you'll need to split of
> > another mutex for that purpose, afaict there should be no interaction
> > with the actual, intended purpose of the instantiation mutex.
> 
> This was added in commit 84afd99b. One way to go would be to add a
> spinlock to protect changes to the regions - however reading the
> changelog, and based on David's previous explanation for the
> instantiation mutex, I don't see why it was added. In fact several
> places modify regions without holding the instantiation mutex, ie:
> hugetlb_reserve_pages()
> 
> Am I missing something here?

hugetlb_reserve_pages() is called with down_write(mmap_sem),
so fault flow which require down_read(mmap_sem) cannot interfere
to change the region.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-22  0:59                   ` Joonsoo Kim
  0 siblings, 0 replies; 41+ messages in thread
From: Joonsoo Kim @ 2013-07-22  0:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: David Gibson, Andrew Morton, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On Fri, Jul 19, 2013 at 02:24:15PM -0700, Davidlohr Bueso wrote:
> On Fri, 2013-07-19 at 17:14 +1000, David Gibson wrote:
> > On Thu, Jul 18, 2013 at 05:42:35PM +0900, Joonsoo Kim wrote:
> > > On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
> > > > From: David Gibson <david@gibson.dropbear.id.au>
> > > > 
> > > > At present, the page fault path for hugepages is serialized by a
> > > > single mutex. This is used to avoid spurious out-of-memory conditions
> > > > when the hugepage pool is fully utilized (two processes or threads can
> > > > race to instantiate the same mapping with the last hugepage from the
> > > > pool, the race loser returning VM_FAULT_OOM).  This problem is
> > > > specific to hugepages, because it is normal to want to use every
> > > > single hugepage in the system - with normal pages we simply assume
> > > > there will always be a few spare pages which can be used temporarily
> > > > until the race is resolved.
> > > > 
> > > > Unfortunately this serialization also means that clearing of hugepages
> > > > cannot be parallelized across multiple CPUs, which can lead to very
> > > > long process startup times when using large numbers of hugepages.
> > > > 
> > > > This patch improves the situation by replacing the single mutex with a
> > > > table of mutexes, selected based on a hash, which allows us to know
> > > > which page in the file we're instantiating. For shared mappings, the
> > > > hash key is selected based on the address space and file offset being faulted.
> > > > Similarly, for private mappings, the mm and virtual address are used.
> > > > 
> > > 
> > > Hello.
> > > 
> > > With this table mutex, we cannot protect region tracking structure.
> > > See below comment.
> > > 
> > > /*
> > >  * Region tracking -- allows tracking of reservations and instantiated pages
> > >  *                    across the pages in a mapping.
> > >  *
> > >  * The region data structures are protected by a combination of the mmap_sem
> > >  * and the hugetlb_instantion_mutex.  To access or modify a region the caller
> > >  * must either hold the mmap_sem for write, or the mmap_sem for read and
> > >  * the hugetlb_instantiation mutex:
> > >  *
> > >  *      down_write(&mm->mmap_sem);
> > >  * or
> > >  *      down_read(&mm->mmap_sem);
> > >  *      mutex_lock(&hugetlb_instantiation_mutex);
> > >  */
> > 
> > Ugh.  Who the hell added that.  I guess you'll need to split of
> > another mutex for that purpose, afaict there should be no interaction
> > with the actual, intended purpose of the instantiation mutex.
> 
> This was added in commit 84afd99b. One way to go would be to add a
> spinlock to protect changes to the regions - however reading the
> changelog, and based on David's previous explanation for the
> instantiation mutex, I don't see why it was added. In fact several
> places modify regions without holding the instantiation mutex, ie:
> hugetlb_reserve_pages()
> 
> Am I missing something here?

hugetlb_reserve_pages() is called with down_write(mmap_sem),
so fault flow which require down_read(mmap_sem) cannot interfere
to change the region.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-17 19:50           ` Davidlohr Bueso
@ 2013-07-23  6:55             ` Hush Bensen
  -1 siblings, 0 replies; 41+ messages in thread
From: Hush Bensen @ 2013-07-23  6:55 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On 07/18/2013 03:50 AM, Davidlohr Bueso wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
>
> At present, the page fault path for hugepages is serialized by a
> single mutex. This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM).  This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
>
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
>
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash, which allows us to know
> which page in the file we're instantiating. For shared mappings, the
> hash key is selected based on the address space and file offset being faulted.
> Similarly, for private mappings, the mm and virtual address are used.
>
> From: Anton Blanchard <anton@samba.org>
> [https://lkml.org/lkml/2011/7/15/31]
> Forward ported and made a few changes:
>
> - Use the Jenkins hash to scatter the hash, better than using just the
>    low bits.
>
> - Always round num_fault_mutexes to a power of two to avoid an
>    expensive modulus in the hash calculation.
>
> I also tested this patch on a large POWER7 box using a simple parallel
> fault testcase:
>
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
>
> Command line options:
>
> parallel_fault <nr_threads> <size in kB> <skip in kB>

Could you explain the meaning of <size in kB> <skip in kB> here?

>
> First the time taken to fault 128GB of 16MB hugepages:
>
> 40.68 seconds

I can't get any time result after running prallel_fault, how can you get 
the number?

>
> Now the same test with 64 concurrent threads:
> 39.34 seconds
>
> Hardly any speedup. Finally the 64 concurrent threads test with
> this patch applied:
> 0.85 seconds
>
> We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
>
> This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> count was the same before and after this patch.
>
> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
>
> - Cleaned up and forward ported to Linus' latest.
> - Cache aligned mutexes.
> - Keep non SMP systems using a single mutex.
>
> It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
>
>      	     hugetlb_instantiation_mutex:   10678     10678
>               ---------------------------
>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>               ---------------------------
>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
>
> With this patch we see a much less contention and wait time:
>
>                &htlb_fault_mutex_table[i]:   383
>                --------------------------
>                &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>                --------------------------
>                &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>
> contentions:        383
> acquisitions:    120546
> waittime-total: 1381.72 us
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Anton Blanchard <anton@samba.org>
> Tested-by: Eric B Munson <emunson@mgebm.net>
> Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
> ---
>   mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 73 insertions(+), 14 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..1f6e564 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -21,6 +21,7 @@
>   #include <linux/rmap.h>
>   #include <linux/swap.h>
>   #include <linux/swapops.h>
> +#include <linux/jhash.h>
>   
>   #include <asm/page.h>
>   #include <asm/pgtable.h>
> @@ -52,6 +53,13 @@ static unsigned long __initdata default_hstate_size;
>    */
>   DEFINE_SPINLOCK(hugetlb_lock);
>   
> +/*
> + * Serializes faults on the same logical page.  This is used to
> + * prevent spurious OOMs when the hugepage pool is fully utilized.
> + */
> +static int num_fault_mutexes;
> +static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
> +
>   static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
>   {
>   	bool free = (spool->count == 0) && (spool->used_hpages == 0);
> @@ -1896,13 +1904,15 @@ static void __exit hugetlb_exit(void)
>   	for_each_hstate(h) {
>   		kobject_put(hstate_kobjs[hstate_index(h)]);
>   	}
> -
> +	kfree(htlb_fault_mutex_table);
>   	kobject_put(hugepages_kobj);
>   }
>   module_exit(hugetlb_exit);
>   
>   static int __init hugetlb_init(void)
>   {
> +	int i;
> +
>   	/* Some platform decide whether they support huge pages at boot
>   	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
>   	 * there is no such support
> @@ -1927,6 +1937,19 @@ static int __init hugetlb_init(void)
>   	hugetlb_register_all_nodes();
>   	hugetlb_cgroup_file_init();
>   
> +#ifdef CONFIG_SMP
> +	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
> +#else
> +	num_fault_mutexes = 1;
> +#endif
> +	htlb_fault_mutex_table =
> +		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
> +	if (!htlb_fault_mutex_table)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < num_fault_mutexes; i++)
> +		mutex_init(&htlb_fault_mutex_table[i]);
> +
>   	return 0;
>   }
>   module_init(hugetlb_init);
> @@ -2709,15 +2732,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
>   }
>   
>   static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> -			unsigned long address, pte_t *ptep, unsigned int flags)
> +			   struct address_space *mapping, pgoff_t idx,
> +			   unsigned long address, pte_t *ptep, unsigned int flags)
>   {
>   	struct hstate *h = hstate_vma(vma);
>   	int ret = VM_FAULT_SIGBUS;
>   	int anon_rmap = 0;
> -	pgoff_t idx;
>   	unsigned long size;
>   	struct page *page;
> -	struct address_space *mapping;
>   	pte_t new_pte;
>   
>   	/*
> @@ -2731,9 +2753,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   		return ret;
>   	}
>   
> -	mapping = vma->vm_file->f_mapping;
> -	idx = vma_hugecache_offset(h, vma, address);
> -
>   	/*
>   	 * Use page lock to guard against racing truncation
>   	 * before we get page_table_lock.
> @@ -2839,15 +2858,51 @@ backout_unlocked:
>   	goto out;
>   }
>   
> +#ifdef CONFIG_SMP
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> +			    struct vm_area_struct *vma,
> +			    struct address_space *mapping,
> +			    pgoff_t idx, unsigned long address)
> +{
> +	unsigned long key[2];
> +	u32 hash;
> +
> +	if (vma->vm_flags & VM_SHARED) {
> +		key[0] = (unsigned long)mapping;
> +		key[1] = idx;
> +	} else {
> +		key[0] = (unsigned long)mm;
> +		key[1] = address >> huge_page_shift(h);
> +	}
> +
> +	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
> +
> +	return hash & (num_fault_mutexes - 1);
> +}
> +#else
> +/*
> + * For uniprocesor systems we always use a single mutex, so just
> + * return 0 and avoid the hashing overhead.
> + */
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> +			    struct vm_area_struct *vma,
> +			    struct address_space *mapping,
> +			    pgoff_t idx, unsigned long address)
> +{
> +	return 0;
> +}
> +#endif
> +
>   int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   			unsigned long address, unsigned int flags)
>   {
> -	pte_t *ptep;
> -	pte_t entry;
> +	pgoff_t idx;
>   	int ret;
> +	u32 hash;
> +	pte_t *ptep, entry;
>   	struct page *page = NULL;
> +	struct address_space *mapping;
>   	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>   	struct hstate *h = hstate_vma(vma);
>   
>   	address &= huge_page_mask(h);
> @@ -2867,15 +2922,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	if (!ptep)
>   		return VM_FAULT_OOM;
>   
> +	mapping = vma->vm_file->f_mapping;
> +	idx = vma_hugecache_offset(h, vma, address);
> +
>   	/*
>   	 * Serialize hugepage allocation and instantiation, so that we don't
>   	 * get spurious allocation failures if two CPUs race to instantiate
>   	 * the same page in the page cache.
>   	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
> +
>   	entry = huge_ptep_get(ptep);
>   	if (huge_pte_none(entry)) {
> -		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> +		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
>   		goto out_mutex;
>   	}
>   
> @@ -2943,8 +3003,7 @@ out_page_table_lock:
>   	put_page(page);
>   
>   out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> -
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>   	return ret;
>   }
>   


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-23  6:55             ` Hush Bensen
  0 siblings, 0 replies; 41+ messages in thread
From: Hush Bensen @ 2013-07-23  6:55 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, David Gibson, Hugh Dickins, Rik van Riel,
	Michel Lespinasse, Mel Gorman, Konstantin Khlebnikov,
	Michal Hocko, AneeshKumarK.V, KAMEZAWA Hiroyuki, Hillf Danton,
	linux-mm, LKML, Eric B Munson, Anton Blanchard

On 07/18/2013 03:50 AM, Davidlohr Bueso wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
>
> At present, the page fault path for hugepages is serialized by a
> single mutex. This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM).  This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
>
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
>
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash, which allows us to know
> which page in the file we're instantiating. For shared mappings, the
> hash key is selected based on the address space and file offset being faulted.
> Similarly, for private mappings, the mm and virtual address are used.
>
> From: Anton Blanchard <anton@samba.org>
> [https://lkml.org/lkml/2011/7/15/31]
> Forward ported and made a few changes:
>
> - Use the Jenkins hash to scatter the hash, better than using just the
>    low bits.
>
> - Always round num_fault_mutexes to a power of two to avoid an
>    expensive modulus in the hash calculation.
>
> I also tested this patch on a large POWER7 box using a simple parallel
> fault testcase:
>
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
>
> Command line options:
>
> parallel_fault <nr_threads> <size in kB> <skip in kB>

Could you explain the meaning of <size in kB> <skip in kB> here?

>
> First the time taken to fault 128GB of 16MB hugepages:
>
> 40.68 seconds

I can't get any time result after running prallel_fault, how can you get 
the number?

>
> Now the same test with 64 concurrent threads:
> 39.34 seconds
>
> Hardly any speedup. Finally the 64 concurrent threads test with
> this patch applied:
> 0.85 seconds
>
> We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
>
> This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> count was the same before and after this patch.
>
> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
>
> - Cleaned up and forward ported to Linus' latest.
> - Cache aligned mutexes.
> - Keep non SMP systems using a single mutex.
>
> It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
>
>      	     hugetlb_instantiation_mutex:   10678     10678
>               ---------------------------
>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>               ---------------------------
>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>
> contentions:          10678
> acquisitions:         99476
> waittime-total: 76888911.01 us
>
> With this patch we see a much less contention and wait time:
>
>                &htlb_fault_mutex_table[i]:   383
>                --------------------------
>                &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>                --------------------------
>                &htlb_fault_mutex_table[i]    383   [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>
> contentions:        383
> acquisitions:    120546
> waittime-total: 1381.72 us
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Anton Blanchard <anton@samba.org>
> Tested-by: Eric B Munson <emunson@mgebm.net>
> Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
> ---
>   mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 73 insertions(+), 14 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..1f6e564 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -21,6 +21,7 @@
>   #include <linux/rmap.h>
>   #include <linux/swap.h>
>   #include <linux/swapops.h>
> +#include <linux/jhash.h>
>   
>   #include <asm/page.h>
>   #include <asm/pgtable.h>
> @@ -52,6 +53,13 @@ static unsigned long __initdata default_hstate_size;
>    */
>   DEFINE_SPINLOCK(hugetlb_lock);
>   
> +/*
> + * Serializes faults on the same logical page.  This is used to
> + * prevent spurious OOMs when the hugepage pool is fully utilized.
> + */
> +static int num_fault_mutexes;
> +static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
> +
>   static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
>   {
>   	bool free = (spool->count == 0) && (spool->used_hpages == 0);
> @@ -1896,13 +1904,15 @@ static void __exit hugetlb_exit(void)
>   	for_each_hstate(h) {
>   		kobject_put(hstate_kobjs[hstate_index(h)]);
>   	}
> -
> +	kfree(htlb_fault_mutex_table);
>   	kobject_put(hugepages_kobj);
>   }
>   module_exit(hugetlb_exit);
>   
>   static int __init hugetlb_init(void)
>   {
> +	int i;
> +
>   	/* Some platform decide whether they support huge pages at boot
>   	 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
>   	 * there is no such support
> @@ -1927,6 +1937,19 @@ static int __init hugetlb_init(void)
>   	hugetlb_register_all_nodes();
>   	hugetlb_cgroup_file_init();
>   
> +#ifdef CONFIG_SMP
> +	num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
> +#else
> +	num_fault_mutexes = 1;
> +#endif
> +	htlb_fault_mutex_table =
> +		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
> +	if (!htlb_fault_mutex_table)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < num_fault_mutexes; i++)
> +		mutex_init(&htlb_fault_mutex_table[i]);
> +
>   	return 0;
>   }
>   module_init(hugetlb_init);
> @@ -2709,15 +2732,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
>   }
>   
>   static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> -			unsigned long address, pte_t *ptep, unsigned int flags)
> +			   struct address_space *mapping, pgoff_t idx,
> +			   unsigned long address, pte_t *ptep, unsigned int flags)
>   {
>   	struct hstate *h = hstate_vma(vma);
>   	int ret = VM_FAULT_SIGBUS;
>   	int anon_rmap = 0;
> -	pgoff_t idx;
>   	unsigned long size;
>   	struct page *page;
> -	struct address_space *mapping;
>   	pte_t new_pte;
>   
>   	/*
> @@ -2731,9 +2753,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   		return ret;
>   	}
>   
> -	mapping = vma->vm_file->f_mapping;
> -	idx = vma_hugecache_offset(h, vma, address);
> -
>   	/*
>   	 * Use page lock to guard against racing truncation
>   	 * before we get page_table_lock.
> @@ -2839,15 +2858,51 @@ backout_unlocked:
>   	goto out;
>   }
>   
> +#ifdef CONFIG_SMP
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> +			    struct vm_area_struct *vma,
> +			    struct address_space *mapping,
> +			    pgoff_t idx, unsigned long address)
> +{
> +	unsigned long key[2];
> +	u32 hash;
> +
> +	if (vma->vm_flags & VM_SHARED) {
> +		key[0] = (unsigned long)mapping;
> +		key[1] = idx;
> +	} else {
> +		key[0] = (unsigned long)mm;
> +		key[1] = address >> huge_page_shift(h);
> +	}
> +
> +	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
> +
> +	return hash & (num_fault_mutexes - 1);
> +}
> +#else
> +/*
> + * For uniprocesor systems we always use a single mutex, so just
> + * return 0 and avoid the hashing overhead.
> + */
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> +			    struct vm_area_struct *vma,
> +			    struct address_space *mapping,
> +			    pgoff_t idx, unsigned long address)
> +{
> +	return 0;
> +}
> +#endif
> +
>   int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   			unsigned long address, unsigned int flags)
>   {
> -	pte_t *ptep;
> -	pte_t entry;
> +	pgoff_t idx;
>   	int ret;
> +	u32 hash;
> +	pte_t *ptep, entry;
>   	struct page *page = NULL;
> +	struct address_space *mapping;
>   	struct page *pagecache_page = NULL;
> -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>   	struct hstate *h = hstate_vma(vma);
>   
>   	address &= huge_page_mask(h);
> @@ -2867,15 +2922,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	if (!ptep)
>   		return VM_FAULT_OOM;
>   
> +	mapping = vma->vm_file->f_mapping;
> +	idx = vma_hugecache_offset(h, vma, address);
> +
>   	/*
>   	 * Serialize hugepage allocation and instantiation, so that we don't
>   	 * get spurious allocation failures if two CPUs race to instantiate
>   	 * the same page in the page cache.
>   	 */
> -	mutex_lock(&hugetlb_instantiation_mutex);
> +	hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> +	mutex_lock(&htlb_fault_mutex_table[hash]);
> +
>   	entry = huge_ptep_get(ptep);
>   	if (huge_pte_none(entry)) {
> -		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> +		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
>   		goto out_mutex;
>   	}
>   
> @@ -2943,8 +3003,7 @@ out_page_table_lock:
>   	put_page(page);
>   
>   out_mutex:
> -	mutex_unlock(&hugetlb_instantiation_mutex);
> -
> +	mutex_unlock(&htlb_fault_mutex_table[hash]);
>   	return ret;
>   }
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
  2013-07-18  9:07             ` Joonsoo Kim
@ 2013-07-23  7:04               ` Hush Bensen
  -1 siblings, 0 replies; 41+ messages in thread
From: Hush Bensen @ 2013-07-23  7:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Davidlohr Bueso, Andrew Morton, David Gibson, Hugh Dickins,
	Rik van Riel, Michel Lespinasse, Mel Gorman,
	Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML, Eric B Munson,
	Anton Blanchard

On 07/18/2013 05:07 PM, Joonsoo Kim wrote:
> On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
>
>> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
>>
>> - Cleaned up and forward ported to Linus' latest.
>> - Cache aligned mutexes.
>> - Keep non SMP systems using a single mutex.
>>
>> It was found that this mutex can become quite contended
>> during the early phases of large databases which make use of huge pages - for instance
>> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
>> reports that this mutex can be one of the top 5 most contended locks in the kernel during
>> the first few minutes:
>>
>>      	     hugetlb_instantiation_mutex:   10678     10678
>>               ---------------------------
>>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>>               ---------------------------
>>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>>
>> contentions:          10678
>> acquisitions:         99476
>> waittime-total: 76888911.01 us
> Hello,
> I have a question :)
>
> So, each contention takes 7.6 ms in your result.
> Do you map this area with VM_NORESERVE?
> If we map with VM_RESERVE, when page fault, we just dequeue a huge page from a queue and clear
> a page and then map it to a page table. So I guess, it shouldn't take so long.

I don't think there is clear page operation after dequeue huge page, 
actually it's even not done during hugetlb_reserve_pages, do you know 
why? There is just clear operation in hugetlb_no_page.

> I'm wondering why it takes so long.
>
> And do you use 16KB-size hugepage?
> If so, region handling could takes some times. If you access the area as random order,
> the number of region can be more than 90000. I guess, this can be one reason to too long
> waittime.
>
> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] hugepage: allow parallelization of the hugepage fault path
@ 2013-07-23  7:04               ` Hush Bensen
  0 siblings, 0 replies; 41+ messages in thread
From: Hush Bensen @ 2013-07-23  7:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Davidlohr Bueso, Andrew Morton, David Gibson, Hugh Dickins,
	Rik van Riel, Michel Lespinasse, Mel Gorman,
	Konstantin Khlebnikov, Michal Hocko, AneeshKumarK.V,
	KAMEZAWA Hiroyuki, Hillf Danton, linux-mm, LKML, Eric B Munson,
	Anton Blanchard

On 07/18/2013 05:07 PM, Joonsoo Kim wrote:
> On Wed, Jul 17, 2013 at 12:50:25PM -0700, Davidlohr Bueso wrote:
>
>> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
>>
>> - Cleaned up and forward ported to Linus' latest.
>> - Cache aligned mutexes.
>> - Keep non SMP systems using a single mutex.
>>
>> It was found that this mutex can become quite contended
>> during the early phases of large databases which make use of huge pages - for instance
>> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
>> reports that this mutex can be one of the top 5 most contended locks in the kernel during
>> the first few minutes:
>>
>>      	     hugetlb_instantiation_mutex:   10678     10678
>>               ---------------------------
>>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>>               ---------------------------
>>               hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>>
>> contentions:          10678
>> acquisitions:         99476
>> waittime-total: 76888911.01 us
> Hello,
> I have a question :)
>
> So, each contention takes 7.6 ms in your result.
> Do you map this area with VM_NORESERVE?
> If we map with VM_RESERVE, when page fault, we just dequeue a huge page from a queue and clear
> a page and then map it to a page table. So I guess, it shouldn't take so long.

I don't think there is clear page operation after dequeue huge page, 
actually it's even not done during hugetlb_reserve_pages, do you know 
why? There is just clear operation in hugetlb_no_page.

> I'm wondering why it takes so long.
>
> And do you use 16KB-size hugepage?
> If so, region handling could takes some times. If you access the area as random order,
> the number of region can be more than 90000. I guess, this can be one reason to too long
> waittime.
>
> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2013-07-23  7:05 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-12 23:28 [PATCH] mm/hugetlb: per-vma instantiation mutexes Davidlohr Bueso
2013-07-12 23:28 ` Davidlohr Bueso
2013-07-13  0:54 ` Hugh Dickins
2013-07-13  0:54   ` Hugh Dickins
2013-07-15  3:16   ` Davidlohr Bueso
2013-07-15  3:16     ` Davidlohr Bueso
2013-07-15  7:24     ` David Gibson
2013-07-15 23:08       ` Andrew Morton
2013-07-15 23:08         ` Andrew Morton
2013-07-16  0:12         ` Davidlohr Bueso
2013-07-16  0:12           ` Davidlohr Bueso
2013-07-16  8:00           ` David Gibson
2013-07-17 19:50         ` [PATCH] hugepage: allow parallelization of the hugepage fault path Davidlohr Bueso
2013-07-17 19:50           ` Davidlohr Bueso
2013-07-18  8:42           ` Joonsoo Kim
2013-07-18  8:42             ` Joonsoo Kim
2013-07-19  7:14             ` David Gibson
2013-07-19 21:24               ` Davidlohr Bueso
2013-07-19 21:24                 ` Davidlohr Bueso
2013-07-22  0:59                 ` Joonsoo Kim
2013-07-22  0:59                   ` Joonsoo Kim
2013-07-18  9:07           ` Joonsoo Kim
2013-07-18  9:07             ` Joonsoo Kim
2013-07-19  0:19             ` Davidlohr Bueso
2013-07-19  0:19               ` Davidlohr Bueso
2013-07-19  0:35               ` Davidlohr Bueso
2013-07-19  0:35                 ` Davidlohr Bueso
2013-07-23  7:04             ` Hush Bensen
2013-07-23  7:04               ` Hush Bensen
2013-07-23  6:55           ` Hush Bensen
2013-07-23  6:55             ` Hush Bensen
2013-07-16  1:51       ` [PATCH] mm/hugetlb: per-vma instantiation mutexes Rik van Riel
2013-07-16  1:51         ` Rik van Riel
2013-07-16  5:34         ` Joonsoo Kim
2013-07-16  5:34           ` Joonsoo Kim
2013-07-16 10:01           ` David Gibson
2013-07-18  6:50             ` Joonsoo Kim
2013-07-18  6:50               ` Joonsoo Kim
2013-07-16  8:20         ` David Gibson
2013-07-15  4:18 ` Konstantin Khlebnikov
2013-07-15  4:18   ` Konstantin Khlebnikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.