All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] hugetlbfs: support split page table lock
@ 2013-05-28 19:52 ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

Hi,

In previous discussion [1] on "extend hugepage migration" patches, Michal and
Kosaki-san commented that in the patch "migrate: add migrate_entry_wait_huge()"
we need to solve the issue in the arch-independent manner and had better care
USE_SPLIT_PTLOCK=y case. So this patch(es) does that.

I made sure that the patched kernel shows no regression in functional tests
of libhugetlbfs.

[1]: http://thread.gmane.org/gmane.linux.kernel.mm/96665/focus=96661

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 0/2] hugetlbfs: support split page table lock
@ 2013-05-28 19:52 ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

Hi,

In previous discussion [1] on "extend hugepage migration" patches, Michal and
Kosaki-san commented that in the patch "migrate: add migrate_entry_wait_huge()"
we need to solve the issue in the arch-independent manner and had better care
USE_SPLIT_PTLOCK=y case. So this patch(es) does that.

I made sure that the patched kernel shows no regression in functional tests
of libhugetlbfs.

[1]: http://thread.gmane.org/gmane.linux.kernel.mm/96665/focus=96661

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
  2013-05-28 19:52 ` Naoya Horiguchi
@ 2013-05-28 19:52   ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. This is not optimal because there can be lock
contentions between unrelated components using this lock.

This patch makes hugepage support split page table lock so that
we use page->ptl of the leaf node of page table tree which is pte for
normal pages but can be pmd and/or pud for hugepages of some architectures.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/mm/hugetlbpage.c |  6 ++--
 include/linux/hugetlb.h   | 18 ++++++++++
 mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
 3 files changed, 73 insertions(+), 35 deletions(-)

diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
index ae1aa71..0e4a396 100644
--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 			spte = huge_pte_offset(svma->vm_mm, saddr);
 			if (spte) {
 				get_page(virt_to_page(spte));
+				ptl = huge_pte_lockptr(mm, spte);
 				break;
 			}
 		}
@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
index a639c87..40f3215 100644
--- v3.10-rc3.orig/include/linux/hugetlb.h
+++ v3.10-rc3/include/linux/hugetlb.h
@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 int PageHuge(struct page *page);
 
+#if USE_SPLIT_PTLOCKS
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 463fb5e..8e1af32 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *srcptl, *dstptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dstptl = huge_pte_lockptr(dst, dst_pte);
+		srcptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dstptl);
+		spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(srcptl);
+		spin_unlock(dstptl);
 	}
 	return 0;
 
@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-05-28 19:52   ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. This is not optimal because there can be lock
contentions between unrelated components using this lock.

This patch makes hugepage support split page table lock so that
we use page->ptl of the leaf node of page table tree which is pte for
normal pages but can be pmd and/or pud for hugepages of some architectures.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/mm/hugetlbpage.c |  6 ++--
 include/linux/hugetlb.h   | 18 ++++++++++
 mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
 3 files changed, 73 insertions(+), 35 deletions(-)

diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
index ae1aa71..0e4a396 100644
--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 			spte = huge_pte_offset(svma->vm_mm, saddr);
 			if (spte) {
 				get_page(virt_to_page(spte));
+				ptl = huge_pte_lockptr(mm, spte);
 				break;
 			}
 		}
@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
index a639c87..40f3215 100644
--- v3.10-rc3.orig/include/linux/hugetlb.h
+++ v3.10-rc3/include/linux/hugetlb.h
@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 int PageHuge(struct page *page);
 
+#if USE_SPLIT_PTLOCKS
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 463fb5e..8e1af32 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *srcptl, *dstptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dstptl = huge_pte_lockptr(dst, dst_pte);
+		srcptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dstptl);
+		spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(srcptl);
+		spin_unlock(dstptl);
 	}
 	return 0;
 
@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
  2013-05-28 19:52 ` Naoya Horiguchi
@ 2013-05-28 19:52   ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.
This patch introduces migration_entry_wait_huge() to solve this.

Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
entry with huge_pte_offset() inside which all the arch-dependency of
the page table structure are. So migration_entry_wait_huge() and
__migration_entry_wait() are free from arch-dependency.

ChangeLog v3:
 - use huge_pte_lockptr

ChangeLog v2:
 - remove dup in migrate_entry_wait_huge()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # 2.6.35
---
 include/linux/swapops.h |  3 +++
 mm/hugetlb.c            |  2 +-
 mm/migrate.c            | 23 ++++++++++++++++++-----
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
index 47ead51..c5fd30d 100644
--- v3.10-rc3.orig/include/linux/swapops.h
+++ v3.10-rc3/include/linux/swapops.h
@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
 
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
 #else
 
 #define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
 static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm,
+					pte_t *pte) { }
 static inline int is_write_migration_entry(swp_entry_t entry)
 {
 	return 0;
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 8e1af32..d91a438 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			migration_entry_wait_huge(mm, ptep);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
index 6f2df6e..64ff118 100644
--- v3.10-rc3.orig/mm/migrate.c
+++ v3.10-rc3/mm/migrate.c
@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
  * get to the page and wait until migration is finished.
  * When we return from this function the fault will be retried.
  */
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+				spinlock_t *ptl)
 {
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
+	pte_t pte;
 	swp_entry_t entry;
 	struct page *page;
 
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
 		goto out;
@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap_unlock(ptep, ptl);
 }
 
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long address)
+{
+	spinlock_t *ptl = pte_lockptr(mm, pmd);
+	pte_t *ptep = pte_offset_map(pmd, address);
+	__migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+{
+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
+	__migration_entry_wait(mm, pte, ptl);
+}
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-05-28 19:52   ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.
This patch introduces migration_entry_wait_huge() to solve this.

Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
entry with huge_pte_offset() inside which all the arch-dependency of
the page table structure are. So migration_entry_wait_huge() and
__migration_entry_wait() are free from arch-dependency.

ChangeLog v3:
 - use huge_pte_lockptr

ChangeLog v2:
 - remove dup in migrate_entry_wait_huge()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # 2.6.35
---
 include/linux/swapops.h |  3 +++
 mm/hugetlb.c            |  2 +-
 mm/migrate.c            | 23 ++++++++++++++++++-----
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
index 47ead51..c5fd30d 100644
--- v3.10-rc3.orig/include/linux/swapops.h
+++ v3.10-rc3/include/linux/swapops.h
@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
 
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
 #else
 
 #define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
 static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm,
+					pte_t *pte) { }
 static inline int is_write_migration_entry(swp_entry_t entry)
 {
 	return 0;
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 8e1af32..d91a438 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			migration_entry_wait_huge(mm, ptep);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
index 6f2df6e..64ff118 100644
--- v3.10-rc3.orig/mm/migrate.c
+++ v3.10-rc3/mm/migrate.c
@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
  * get to the page and wait until migration is finished.
  * When we return from this function the fault will be retried.
  */
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+				spinlock_t *ptl)
 {
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
+	pte_t pte;
 	swp_entry_t entry;
 	struct page *page;
 
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
 		goto out;
@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap_unlock(ptep, ptl);
 }
 
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long address)
+{
+	spinlock_t *ptl = pte_lockptr(mm, pmd);
+	pte_t *ptep = pte_offset_map(pmd, address);
+	__migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+{
+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
+	__migration_entry_wait(mm, pte, ptl);
+}
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-05-28 19:52   ` Naoya Horiguchi
  (?)
  (?)
@ 2013-05-29  1:09   ` Wanpeng Li
  -1 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2013-05-29  1:09 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

On Tue, May 28, 2013 at 03:52:50PM -0400, Naoya Horiguchi wrote:
>Currently all of page table handling by hugetlbfs code are done under
>mm->page_table_lock. This is not optimal because there can be lock
>contentions between unrelated components using this lock.
>
>This patch makes hugepage support split page table lock so that
>we use page->ptl of the leaf node of page table tree which is pte for
>normal pages but can be pmd and/or pud for hugepages of some architectures.
>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>---
> arch/x86/mm/hugetlbpage.c |  6 ++--
> include/linux/hugetlb.h   | 18 ++++++++++
> mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
> 3 files changed, 73 insertions(+), 35 deletions(-)
>
>diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
>index ae1aa71..0e4a396 100644
>--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
>+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
>@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> 	unsigned long saddr;
> 	pte_t *spte = NULL;
> 	pte_t *pte;
>+	spinlock_t *ptl;
>
> 	if (!vma_shareable(vma, addr))
> 		return (pte_t *)pmd_alloc(mm, pud, addr);
>@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> 			spte = huge_pte_offset(svma->vm_mm, saddr);
> 			if (spte) {
> 				get_page(virt_to_page(spte));
>+				ptl = huge_pte_lockptr(mm, spte);
> 				break;
> 			}
> 		}
>@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> 	if (!spte)
> 		goto out;
>
>-	spin_lock(&mm->page_table_lock);
>+	spin_lock(ptl);
> 	if (pud_none(*pud))
> 		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
> 	else
> 		put_page(virt_to_page(spte));
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> out:
> 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> 	mutex_unlock(&mapping->i_mmap_mutex);
>diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
>index a639c87..40f3215 100644
>--- v3.10-rc3.orig/include/linux/hugetlb.h
>+++ v3.10-rc3/include/linux/hugetlb.h
>@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
>
> int PageHuge(struct page *page);
>
>+#if USE_SPLIT_PTLOCKS
>+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
>+#else	/* !USE_SPLIT_PTLOCKS */
>+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
>+#endif	/* USE_SPLIT_PTLOCKS */
>+
>+#define huge_pte_offset_lock(mm, address, ptlp)		\
>+({							\
>+	pte_t *__pte = huge_pte_offset(mm, address);	\
>+	spinlock_t *__ptl = NULL;			\
>+	if (__pte) {					\
>+		__ptl = huge_pte_lockptr(mm, __pte);	\
>+		*(ptlp) = __ptl;			\
>+		spin_lock(__ptl);			\
>+	}						\
>+	__pte;						\
>+})
>+
> void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
>index 463fb5e..8e1af32 100644
>--- v3.10-rc3.orig/mm/hugetlb.c
>+++ v3.10-rc3/mm/hugetlb.c
>@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>
> 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>+		spinlock_t *srcptl, *dstptl;
> 		src_pte = huge_pte_offset(src, addr);
> 		if (!src_pte)
> 			continue;
>@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> 		if (dst_pte == src_pte)
> 			continue;
>
>-		spin_lock(&dst->page_table_lock);
>-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
>+		dstptl = huge_pte_lockptr(dst, dst_pte);
>+		srcptl = huge_pte_lockptr(src, src_pte);
>+		spin_lock(dstptl);
>+		spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
> 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
> 			if (cow)
> 				huge_ptep_set_wrprotect(src, addr, src_pte);
>@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> 			page_dup_rmap(ptepage);
> 			set_huge_pte_at(dst, addr, dst_pte, entry);
> 		}
>-		spin_unlock(&src->page_table_lock);
>-		spin_unlock(&dst->page_table_lock);
>+		spin_unlock(srcptl);
>+		spin_unlock(dstptl);
> 	}
> 	return 0;
>
>@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> 	unsigned long address;
> 	pte_t *ptep;
> 	pte_t pte;
>+	spinlock_t *ptl;
> 	struct page *page;
> 	struct hstate *h = hstate_vma(vma);
> 	unsigned long sz = huge_page_size(h);
>@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> 	tlb_start_vma(tlb, vma);
> 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> again:
>-	spin_lock(&mm->page_table_lock);
> 	for (address = start; address < end; address += sz) {
>-		ptep = huge_pte_offset(mm, address);
>+		ptep = huge_pte_offset_lock(mm, address, &ptl);
> 		if (!ptep)
> 			continue;
>
> 		if (huge_pmd_unshare(mm, &address, ptep))
>-			continue;
>+			goto unlock;
>
> 		pte = huge_ptep_get(ptep);
> 		if (huge_pte_none(pte))
>-			continue;
>+			goto unlock;
>
> 		/*
> 		 * HWPoisoned hugepage is already unmapped and dropped reference
> 		 */
> 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> 			huge_pte_clear(mm, address, ptep);
>-			continue;
>+			goto unlock;
> 		}
>
> 		page = pte_page(pte);
>@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> 		 */
> 		if (ref_page) {
> 			if (page != ref_page)
>-				continue;
>+				goto unlock;
>
> 			/*
> 			 * Mark the VMA as having unmapped its page so that
>@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>
> 		page_remove_rmap(page);
> 		force_flush = !__tlb_remove_page(tlb, page);
>-		if (force_flush)
>+		if (force_flush) {
>+			spin_unlock(ptl);
> 			break;
>+		}
> 		/* Bail out after unmapping reference page if supplied */
>-		if (ref_page)
>+		if (ref_page) {
>+			spin_unlock(ptl);
> 			break;
>+		}
>+unlock:
>+		spin_unlock(ptl);
> 	}
>-	spin_unlock(&mm->page_table_lock);
> 	/*
> 	 * mmu_gather ran out of room to batch pages, we break out of
> 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
>@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 	int outside_reserve = 0;
> 	unsigned long mmun_start;	/* For mmu_notifiers */
> 	unsigned long mmun_end;		/* For mmu_notifiers */
>+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
>
> 	old_page = pte_page(pte);
>
>@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 	page_cache_get(old_page);
>
> 	/* Drop page_table_lock as buddy allocator may be called */
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> 	new_page = alloc_huge_page(vma, address, outside_reserve);
>
> 	if (IS_ERR(new_page)) {
>@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 			BUG_ON(huge_pte_none(pte));
> 			if (unmap_ref_private(mm, vma, old_page, address)) {
> 				BUG_ON(huge_pte_none(pte));
>-				spin_lock(&mm->page_table_lock);
>+				spin_lock(ptl);
> 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
> 					goto retry_avoidcopy;
>@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 		}
>
> 		/* Caller expects lock to be held */
>-		spin_lock(&mm->page_table_lock);
>+		spin_lock(ptl);
> 		if (err == -ENOMEM)
> 			return VM_FAULT_OOM;
> 		else
>@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 		page_cache_release(new_page);
> 		page_cache_release(old_page);
> 		/* Caller expects lock to be held */
>-		spin_lock(&mm->page_table_lock);
>+		spin_lock(ptl);
> 		return VM_FAULT_OOM;
> 	}
>
>@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 	 * Retake the page_table_lock to check for racing updates
> 	 * before the page tables are altered
> 	 */
>-	spin_lock(&mm->page_table_lock);
>+	spin_lock(ptl);
> 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
> 		/* Break COW */
>@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 		/* Make the old page be freed below */
> 		new_page = old_page;
> 	}
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> 	/* Caller expects lock to be held */
>-	spin_lock(&mm->page_table_lock);
>+	spin_lock(ptl);
> 	page_cache_release(new_page);
> 	page_cache_release(old_page);
> 	return 0;
>@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 	struct page *page;
> 	struct address_space *mapping;
> 	pte_t new_pte;
>+	spinlock_t *ptl;
>
> 	/*
> 	 * Currently, we are forced to kill the process in the event the
>@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 			goto backout_unlocked;
> 		}
>
>-	spin_lock(&mm->page_table_lock);
>+	ptl = huge_pte_lockptr(mm, ptep);
>+	spin_lock(ptl);
> 	size = i_size_read(mapping->host) >> huge_page_shift(h);
> 	if (idx >= size)
> 		goto backout;
>@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> 	}
>
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> 	unlock_page(page);
> out:
> 	return ret;
>
> backout:
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> backout_unlocked:
> 	unlock_page(page);
> 	put_page(page);
>@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> 	pte_t *ptep;
> 	pte_t entry;
>+	spinlock_t *ptl;
> 	int ret;
> 	struct page *page = NULL;
> 	struct page *pagecache_page = NULL;
>@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 	if (page != pagecache_page)
> 		lock_page(page);
>
>-	spin_lock(&mm->page_table_lock);
>+	ptl = huge_pte_lockptr(mm, ptep);
>+	spin_lock(ptl);
> 	/* Check for a racing update before calling hugetlb_cow */
> 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> 		goto out_page_table_lock;
>@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 		update_mmu_cache(vma, address, ptep);
>
> out_page_table_lock:
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
>
> 	if (pagecache_page) {
> 		unlock_page(pagecache_page);
>@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 	unsigned long remainder = *nr_pages;
> 	struct hstate *h = hstate_vma(vma);
>
>-	spin_lock(&mm->page_table_lock);
> 	while (vaddr < vma->vm_end && remainder) {
> 		pte_t *pte;
>+		spinlock_t *ptl = NULL;
> 		int absent;
> 		struct page *page;
>
>@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		 * Some archs (sparc64, sh*) have multiple pte_ts to
> 		 * each hugepage.  We have to make sure we get the
> 		 * first, for the page indexing below to work.
>+		 *
>+		 * Note that page table lock is not held when pte is null.
> 		 */
>-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
>+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
> 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
>
> 		/*
>@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		 */
> 		if (absent && (flags & FOLL_DUMP) &&
> 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
>+			if (pte)
>+				spin_unlock(ptl);
> 			remainder = 0;
> 			break;
> 		}
>@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		      !huge_pte_write(huge_ptep_get(pte)))) {
> 			int ret;
>
>-			spin_unlock(&mm->page_table_lock);
>+			if (pte)
>+				spin_unlock(ptl);
> 			ret = hugetlb_fault(mm, vma, vaddr,
> 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
>-			spin_lock(&mm->page_table_lock);
> 			if (!(ret & VM_FAULT_ERROR))
> 				continue;
>
>@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 			 */
> 			goto same_page;
> 		}
>+		spin_unlock(ptl);
> 	}
>-	spin_unlock(&mm->page_table_lock);
> 	*nr_pages = remainder;
> 	*position = vaddr;
>
>@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> 	flush_cache_range(vma, address, end);
>
> 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>-	spin_lock(&mm->page_table_lock);
> 	for (; address < end; address += huge_page_size(h)) {
>-		ptep = huge_pte_offset(mm, address);
>+		spinlock_t *ptl;
>+		ptep = huge_pte_offset_lock(mm, address, &ptl);
> 		if (!ptep)
> 			continue;
> 		if (huge_pmd_unshare(mm, &address, ptep)) {
> 			pages++;
>+			spin_unlock(ptl);
> 			continue;
> 		}
> 		if (!huge_pte_none(huge_ptep_get(ptep))) {
>@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> 			set_huge_pte_at(mm, address, ptep, pte);
> 			pages++;
> 		}
>+		spin_unlock(ptl);
> 	}
>-	spin_unlock(&mm->page_table_lock);
> 	/*
> 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
> 	 * may have cleared our pud entry and done put_page on the page table:
>-- 
>1.7.11.7
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-05-28 19:52   ` Naoya Horiguchi
  (?)
@ 2013-05-29  1:09   ` Wanpeng Li
  -1 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2013-05-29  1:09 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

On Tue, May 28, 2013 at 03:52:50PM -0400, Naoya Horiguchi wrote:
>Currently all of page table handling by hugetlbfs code are done under
>mm->page_table_lock. This is not optimal because there can be lock
>contentions between unrelated components using this lock.
>
>This patch makes hugepage support split page table lock so that
>we use page->ptl of the leaf node of page table tree which is pte for
>normal pages but can be pmd and/or pud for hugepages of some architectures.
>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>---
> arch/x86/mm/hugetlbpage.c |  6 ++--
> include/linux/hugetlb.h   | 18 ++++++++++
> mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
> 3 files changed, 73 insertions(+), 35 deletions(-)
>
>diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
>index ae1aa71..0e4a396 100644
>--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
>+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
>@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> 	unsigned long saddr;
> 	pte_t *spte = NULL;
> 	pte_t *pte;
>+	spinlock_t *ptl;
>
> 	if (!vma_shareable(vma, addr))
> 		return (pte_t *)pmd_alloc(mm, pud, addr);
>@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> 			spte = huge_pte_offset(svma->vm_mm, saddr);
> 			if (spte) {
> 				get_page(virt_to_page(spte));
>+				ptl = huge_pte_lockptr(mm, spte);
> 				break;
> 			}
> 		}
>@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> 	if (!spte)
> 		goto out;
>
>-	spin_lock(&mm->page_table_lock);
>+	spin_lock(ptl);
> 	if (pud_none(*pud))
> 		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
> 	else
> 		put_page(virt_to_page(spte));
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> out:
> 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> 	mutex_unlock(&mapping->i_mmap_mutex);
>diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
>index a639c87..40f3215 100644
>--- v3.10-rc3.orig/include/linux/hugetlb.h
>+++ v3.10-rc3/include/linux/hugetlb.h
>@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
>
> int PageHuge(struct page *page);
>
>+#if USE_SPLIT_PTLOCKS
>+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
>+#else	/* !USE_SPLIT_PTLOCKS */
>+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
>+#endif	/* USE_SPLIT_PTLOCKS */
>+
>+#define huge_pte_offset_lock(mm, address, ptlp)		\
>+({							\
>+	pte_t *__pte = huge_pte_offset(mm, address);	\
>+	spinlock_t *__ptl = NULL;			\
>+	if (__pte) {					\
>+		__ptl = huge_pte_lockptr(mm, __pte);	\
>+		*(ptlp) = __ptl;			\
>+		spin_lock(__ptl);			\
>+	}						\
>+	__pte;						\
>+})
>+
> void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
>index 463fb5e..8e1af32 100644
>--- v3.10-rc3.orig/mm/hugetlb.c
>+++ v3.10-rc3/mm/hugetlb.c
>@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>
> 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>+		spinlock_t *srcptl, *dstptl;
> 		src_pte = huge_pte_offset(src, addr);
> 		if (!src_pte)
> 			continue;
>@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> 		if (dst_pte == src_pte)
> 			continue;
>
>-		spin_lock(&dst->page_table_lock);
>-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
>+		dstptl = huge_pte_lockptr(dst, dst_pte);
>+		srcptl = huge_pte_lockptr(src, src_pte);
>+		spin_lock(dstptl);
>+		spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
> 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
> 			if (cow)
> 				huge_ptep_set_wrprotect(src, addr, src_pte);
>@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> 			page_dup_rmap(ptepage);
> 			set_huge_pte_at(dst, addr, dst_pte, entry);
> 		}
>-		spin_unlock(&src->page_table_lock);
>-		spin_unlock(&dst->page_table_lock);
>+		spin_unlock(srcptl);
>+		spin_unlock(dstptl);
> 	}
> 	return 0;
>
>@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> 	unsigned long address;
> 	pte_t *ptep;
> 	pte_t pte;
>+	spinlock_t *ptl;
> 	struct page *page;
> 	struct hstate *h = hstate_vma(vma);
> 	unsigned long sz = huge_page_size(h);
>@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> 	tlb_start_vma(tlb, vma);
> 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> again:
>-	spin_lock(&mm->page_table_lock);
> 	for (address = start; address < end; address += sz) {
>-		ptep = huge_pte_offset(mm, address);
>+		ptep = huge_pte_offset_lock(mm, address, &ptl);
> 		if (!ptep)
> 			continue;
>
> 		if (huge_pmd_unshare(mm, &address, ptep))
>-			continue;
>+			goto unlock;
>
> 		pte = huge_ptep_get(ptep);
> 		if (huge_pte_none(pte))
>-			continue;
>+			goto unlock;
>
> 		/*
> 		 * HWPoisoned hugepage is already unmapped and dropped reference
> 		 */
> 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> 			huge_pte_clear(mm, address, ptep);
>-			continue;
>+			goto unlock;
> 		}
>
> 		page = pte_page(pte);
>@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> 		 */
> 		if (ref_page) {
> 			if (page != ref_page)
>-				continue;
>+				goto unlock;
>
> 			/*
> 			 * Mark the VMA as having unmapped its page so that
>@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>
> 		page_remove_rmap(page);
> 		force_flush = !__tlb_remove_page(tlb, page);
>-		if (force_flush)
>+		if (force_flush) {
>+			spin_unlock(ptl);
> 			break;
>+		}
> 		/* Bail out after unmapping reference page if supplied */
>-		if (ref_page)
>+		if (ref_page) {
>+			spin_unlock(ptl);
> 			break;
>+		}
>+unlock:
>+		spin_unlock(ptl);
> 	}
>-	spin_unlock(&mm->page_table_lock);
> 	/*
> 	 * mmu_gather ran out of room to batch pages, we break out of
> 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
>@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 	int outside_reserve = 0;
> 	unsigned long mmun_start;	/* For mmu_notifiers */
> 	unsigned long mmun_end;		/* For mmu_notifiers */
>+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
>
> 	old_page = pte_page(pte);
>
>@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 	page_cache_get(old_page);
>
> 	/* Drop page_table_lock as buddy allocator may be called */
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> 	new_page = alloc_huge_page(vma, address, outside_reserve);
>
> 	if (IS_ERR(new_page)) {
>@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 			BUG_ON(huge_pte_none(pte));
> 			if (unmap_ref_private(mm, vma, old_page, address)) {
> 				BUG_ON(huge_pte_none(pte));
>-				spin_lock(&mm->page_table_lock);
>+				spin_lock(ptl);
> 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
> 					goto retry_avoidcopy;
>@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 		}
>
> 		/* Caller expects lock to be held */
>-		spin_lock(&mm->page_table_lock);
>+		spin_lock(ptl);
> 		if (err == -ENOMEM)
> 			return VM_FAULT_OOM;
> 		else
>@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 		page_cache_release(new_page);
> 		page_cache_release(old_page);
> 		/* Caller expects lock to be held */
>-		spin_lock(&mm->page_table_lock);
>+		spin_lock(ptl);
> 		return VM_FAULT_OOM;
> 	}
>
>@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 	 * Retake the page_table_lock to check for racing updates
> 	 * before the page tables are altered
> 	 */
>-	spin_lock(&mm->page_table_lock);
>+	spin_lock(ptl);
> 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
> 		/* Break COW */
>@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> 		/* Make the old page be freed below */
> 		new_page = old_page;
> 	}
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> 	/* Caller expects lock to be held */
>-	spin_lock(&mm->page_table_lock);
>+	spin_lock(ptl);
> 	page_cache_release(new_page);
> 	page_cache_release(old_page);
> 	return 0;
>@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 	struct page *page;
> 	struct address_space *mapping;
> 	pte_t new_pte;
>+	spinlock_t *ptl;
>
> 	/*
> 	 * Currently, we are forced to kill the process in the event the
>@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 			goto backout_unlocked;
> 		}
>
>-	spin_lock(&mm->page_table_lock);
>+	ptl = huge_pte_lockptr(mm, ptep);
>+	spin_lock(ptl);
> 	size = i_size_read(mapping->host) >> huge_page_shift(h);
> 	if (idx >= size)
> 		goto backout;
>@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> 	}
>
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> 	unlock_page(page);
> out:
> 	return ret;
>
> backout:
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
> backout_unlocked:
> 	unlock_page(page);
> 	put_page(page);
>@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> 	pte_t *ptep;
> 	pte_t entry;
>+	spinlock_t *ptl;
> 	int ret;
> 	struct page *page = NULL;
> 	struct page *pagecache_page = NULL;
>@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 	if (page != pagecache_page)
> 		lock_page(page);
>
>-	spin_lock(&mm->page_table_lock);
>+	ptl = huge_pte_lockptr(mm, ptep);
>+	spin_lock(ptl);
> 	/* Check for a racing update before calling hugetlb_cow */
> 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> 		goto out_page_table_lock;
>@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 		update_mmu_cache(vma, address, ptep);
>
> out_page_table_lock:
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock(ptl);
>
> 	if (pagecache_page) {
> 		unlock_page(pagecache_page);
>@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 	unsigned long remainder = *nr_pages;
> 	struct hstate *h = hstate_vma(vma);
>
>-	spin_lock(&mm->page_table_lock);
> 	while (vaddr < vma->vm_end && remainder) {
> 		pte_t *pte;
>+		spinlock_t *ptl = NULL;
> 		int absent;
> 		struct page *page;
>
>@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		 * Some archs (sparc64, sh*) have multiple pte_ts to
> 		 * each hugepage.  We have to make sure we get the
> 		 * first, for the page indexing below to work.
>+		 *
>+		 * Note that page table lock is not held when pte is null.
> 		 */
>-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
>+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
> 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
>
> 		/*
>@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		 */
> 		if (absent && (flags & FOLL_DUMP) &&
> 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
>+			if (pte)
>+				spin_unlock(ptl);
> 			remainder = 0;
> 			break;
> 		}
>@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		      !huge_pte_write(huge_ptep_get(pte)))) {
> 			int ret;
>
>-			spin_unlock(&mm->page_table_lock);
>+			if (pte)
>+				spin_unlock(ptl);
> 			ret = hugetlb_fault(mm, vma, vaddr,
> 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
>-			spin_lock(&mm->page_table_lock);
> 			if (!(ret & VM_FAULT_ERROR))
> 				continue;
>
>@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 			 */
> 			goto same_page;
> 		}
>+		spin_unlock(ptl);
> 	}
>-	spin_unlock(&mm->page_table_lock);
> 	*nr_pages = remainder;
> 	*position = vaddr;
>
>@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> 	flush_cache_range(vma, address, end);
>
> 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>-	spin_lock(&mm->page_table_lock);
> 	for (; address < end; address += huge_page_size(h)) {
>-		ptep = huge_pte_offset(mm, address);
>+		spinlock_t *ptl;
>+		ptep = huge_pte_offset_lock(mm, address, &ptl);
> 		if (!ptep)
> 			continue;
> 		if (huge_pmd_unshare(mm, &address, ptep)) {
> 			pages++;
>+			spin_unlock(ptl);
> 			continue;
> 		}
> 		if (!huge_pte_none(huge_ptep_get(ptep))) {
>@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> 			set_huge_pte_at(mm, address, ptep, pte);
> 			pages++;
> 		}
>+		spin_unlock(ptl);
> 	}
>-	spin_unlock(&mm->page_table_lock);
> 	/*
> 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
> 	 * may have cleared our pud entry and done put_page on the page table:
>-- 
>1.7.11.7
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
  2013-05-28 19:52   ` Naoya Horiguchi
  (?)
  (?)
@ 2013-05-29  1:11   ` Wanpeng Li
  -1 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2013-05-29  1:11 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

On Tue, May 28, 2013 at 03:52:51PM -0400, Naoya Horiguchi wrote:
>When we have a page fault for the address which is backed by a hugepage
>under migration, the kernel can't wait correctly and do busy looping on
>hugepage fault until the migration finishes.
>This is because pte_offset_map_lock() can't get a correct migration entry
>or a correct page table lock for hugepage.
>This patch introduces migration_entry_wait_huge() to solve this.
>
>Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
>entry with huge_pte_offset() inside which all the arch-dependency of
>the page table structure are. So migration_entry_wait_huge() and
>__migration_entry_wait() are free from arch-dependency.
>
>ChangeLog v3:
> - use huge_pte_lockptr
>
>ChangeLog v2:
> - remove dup in migrate_entry_wait_huge()
>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Cc: stable@vger.kernel.org # 2.6.35
>---
> include/linux/swapops.h |  3 +++
> mm/hugetlb.c            |  2 +-
> mm/migrate.c            | 23 ++++++++++++++++++-----
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
>diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
>index 47ead51..c5fd30d 100644
>--- v3.10-rc3.orig/include/linux/swapops.h
>+++ v3.10-rc3/include/linux/swapops.h
>@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
>
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> 					unsigned long address);
>+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
> #else
>
> #define make_migration_entry(page, write) swp_entry(0, 0)
>@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
> static inline void make_migration_entry_read(swp_entry_t *entryp) { }
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> 					 unsigned long address) { }
>+static inline void migration_entry_wait_huge(struct mm_struct *mm,
>+					pte_t *pte) { }
> static inline int is_write_migration_entry(swp_entry_t entry)
> {
> 	return 0;
>diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
>index 8e1af32..d91a438 100644
>--- v3.10-rc3.orig/mm/hugetlb.c
>+++ v3.10-rc3/mm/hugetlb.c
>@@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 	if (ptep) {
> 		entry = huge_ptep_get(ptep);
> 		if (unlikely(is_hugetlb_entry_migration(entry))) {
>-			migration_entry_wait(mm, (pmd_t *)ptep, address);
>+			migration_entry_wait_huge(mm, ptep);
> 			return 0;
> 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> 			return VM_FAULT_HWPOISON_LARGE |
>diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
>index 6f2df6e..64ff118 100644
>--- v3.10-rc3.orig/mm/migrate.c
>+++ v3.10-rc3/mm/migrate.c
>@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
>  * get to the page and wait until migration is finished.
>  * When we return from this function the fault will be retried.
>  */
>-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>-				unsigned long address)
>+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>+				spinlock_t *ptl)
> {
>-	pte_t *ptep, pte;
>-	spinlock_t *ptl;
>+	pte_t pte;
> 	swp_entry_t entry;
> 	struct page *page;
>
>-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
>+	spin_lock(ptl);
> 	pte = *ptep;
> 	if (!is_swap_pte(pte))
> 		goto out;
>@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> 	pte_unmap_unlock(ptep, ptl);
> }
>
>+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>+				unsigned long address)
>+{
>+	spinlock_t *ptl = pte_lockptr(mm, pmd);
>+	pte_t *ptep = pte_offset_map(pmd, address);
>+	__migration_entry_wait(mm, ptep, ptl);
>+}
>+
>+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
>+{
>+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
>+	__migration_entry_wait(mm, pte, ptl);
>+}
>+
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> static bool buffer_migrate_lock_buffers(struct buffer_head *head,
>-- 
>1.7.11.7
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
  2013-05-28 19:52   ` Naoya Horiguchi
  (?)
@ 2013-05-29  1:11   ` Wanpeng Li
  -1 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2013-05-29  1:11 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

On Tue, May 28, 2013 at 03:52:51PM -0400, Naoya Horiguchi wrote:
>When we have a page fault for the address which is backed by a hugepage
>under migration, the kernel can't wait correctly and do busy looping on
>hugepage fault until the migration finishes.
>This is because pte_offset_map_lock() can't get a correct migration entry
>or a correct page table lock for hugepage.
>This patch introduces migration_entry_wait_huge() to solve this.
>
>Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
>entry with huge_pte_offset() inside which all the arch-dependency of
>the page table structure are. So migration_entry_wait_huge() and
>__migration_entry_wait() are free from arch-dependency.
>
>ChangeLog v3:
> - use huge_pte_lockptr
>
>ChangeLog v2:
> - remove dup in migrate_entry_wait_huge()
>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Cc: stable@vger.kernel.org # 2.6.35
>---
> include/linux/swapops.h |  3 +++
> mm/hugetlb.c            |  2 +-
> mm/migrate.c            | 23 ++++++++++++++++++-----
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
>diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
>index 47ead51..c5fd30d 100644
>--- v3.10-rc3.orig/include/linux/swapops.h
>+++ v3.10-rc3/include/linux/swapops.h
>@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
>
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> 					unsigned long address);
>+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
> #else
>
> #define make_migration_entry(page, write) swp_entry(0, 0)
>@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
> static inline void make_migration_entry_read(swp_entry_t *entryp) { }
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> 					 unsigned long address) { }
>+static inline void migration_entry_wait_huge(struct mm_struct *mm,
>+					pte_t *pte) { }
> static inline int is_write_migration_entry(swp_entry_t entry)
> {
> 	return 0;
>diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
>index 8e1af32..d91a438 100644
>--- v3.10-rc3.orig/mm/hugetlb.c
>+++ v3.10-rc3/mm/hugetlb.c
>@@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 	if (ptep) {
> 		entry = huge_ptep_get(ptep);
> 		if (unlikely(is_hugetlb_entry_migration(entry))) {
>-			migration_entry_wait(mm, (pmd_t *)ptep, address);
>+			migration_entry_wait_huge(mm, ptep);
> 			return 0;
> 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> 			return VM_FAULT_HWPOISON_LARGE |
>diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
>index 6f2df6e..64ff118 100644
>--- v3.10-rc3.orig/mm/migrate.c
>+++ v3.10-rc3/mm/migrate.c
>@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
>  * get to the page and wait until migration is finished.
>  * When we return from this function the fault will be retried.
>  */
>-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>-				unsigned long address)
>+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>+				spinlock_t *ptl)
> {
>-	pte_t *ptep, pte;
>-	spinlock_t *ptl;
>+	pte_t pte;
> 	swp_entry_t entry;
> 	struct page *page;
>
>-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
>+	spin_lock(ptl);
> 	pte = *ptep;
> 	if (!is_swap_pte(pte))
> 		goto out;
>@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> 	pte_unmap_unlock(ptep, ptl);
> }
>
>+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>+				unsigned long address)
>+{
>+	spinlock_t *ptl = pte_lockptr(mm, pmd);
>+	pte_t *ptep = pte_offset_map(pmd, address);
>+	__migration_entry_wait(mm, ptep, ptl);
>+}
>+
>+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
>+{
>+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
>+	__migration_entry_wait(mm, pte, ptl);
>+}
>+
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> static bool buffer_migrate_lock_buffers(struct buffer_head *head,
>-- 
>1.7.11.7
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
  2013-05-28 19:52   ` Naoya Horiguchi
@ 2013-05-31 19:30     ` Andrew Morton
  -1 siblings, 0 replies; 42+ messages in thread
From: Andrew Morton @ 2013-05-31 19:30 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Mel Gorman, Andi Kleen, Michal Hocko, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Tue, 28 May 2013 15:52:51 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly and do busy looping on
> hugepage fault until the migration finishes.
> This is because pte_offset_map_lock() can't get a correct migration entry
> or a correct page table lock for hugepage.
> This patch introduces migration_entry_wait_huge() to solve this.
> 
> Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> entry with huge_pte_offset() inside which all the arch-dependency of
> the page table structure are. So migration_entry_wait_huge() and
> __migration_entry_wait() are free from arch-dependency.
> 
> ChangeLog v3:
>  - use huge_pte_lockptr
> 
> ChangeLog v2:
>  - remove dup in migrate_entry_wait_huge()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: stable@vger.kernel.org # 2.6.35

That changelog is not suitable for a -stable patch.  People who
maintain and utilize the stable trees need to know in some detail what
is the end-user impact of the patch (or, equivalently, of the bug
which the patch fixes).


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-05-31 19:30     ` Andrew Morton
  0 siblings, 0 replies; 42+ messages in thread
From: Andrew Morton @ 2013-05-31 19:30 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Mel Gorman, Andi Kleen, Michal Hocko, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Tue, 28 May 2013 15:52:51 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly and do busy looping on
> hugepage fault until the migration finishes.
> This is because pte_offset_map_lock() can't get a correct migration entry
> or a correct page table lock for hugepage.
> This patch introduces migration_entry_wait_huge() to solve this.
> 
> Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> entry with huge_pte_offset() inside which all the arch-dependency of
> the page table structure are. So migration_entry_wait_huge() and
> __migration_entry_wait() are free from arch-dependency.
> 
> ChangeLog v3:
>  - use huge_pte_lockptr
> 
> ChangeLog v2:
>  - remove dup in migrate_entry_wait_huge()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: stable@vger.kernel.org # 2.6.35

That changelog is not suitable for a -stable patch.  People who
maintain and utilize the stable trees need to know in some detail what
is the end-user impact of the patch (or, equivalently, of the bug
which the patch fixes).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
  2013-05-31 19:30     ` Andrew Morton
@ 2013-05-31 19:46       ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-05-31 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Mel Gorman, Andi Kleen, Michal Hocko, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

> That changelog is not suitable for a -stable patch.  People who
> maintain and utilize the stable trees need to know in some detail what
> is the end-user impact of the patch (or, equivalently, of the bug
> which the patch fixes).

OK. So, could you insert the following sentence?

On Fri, May 31, 2013 at 12:30:03PM -0700, Andrew Morton wrote:
> On Tue, 28 May 2013 15:52:51 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly and do busy looping on
> > hugepage fault until the migration finishes.

As a result, users who try to kick hugepage migration (via soft offlining,
for example) occasionally experience long delay or soft lockup.

Thanks,
Naoya

> > This is because pte_offset_map_lock() can't get a correct migration entry
> > or a correct page table lock for hugepage.
> > This patch introduces migration_entry_wait_huge() to solve this.
> > 
> > Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> > entry with huge_pte_offset() inside which all the arch-dependency of
> > the page table structure are. So migration_entry_wait_huge() and
> > __migration_entry_wait() are free from arch-dependency.
> > 
> > ChangeLog v3:
> >  - use huge_pte_lockptr
> > 
> > ChangeLog v2:
> >  - remove dup in migrate_entry_wait_huge()
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: stable@vger.kernel.org # 2.6.35
> 
> That changelog is not suitable for a -stable patch.  People who
> maintain and utilize the stable trees need to know in some detail what
> is the end-user impact of the patch (or, equivalently, of the bug
> which the patch fixes).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-05-31 19:46       ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-05-31 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Mel Gorman, Andi Kleen, Michal Hocko, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

> That changelog is not suitable for a -stable patch.  People who
> maintain and utilize the stable trees need to know in some detail what
> is the end-user impact of the patch (or, equivalently, of the bug
> which the patch fixes).

OK. So, could you insert the following sentence?

On Fri, May 31, 2013 at 12:30:03PM -0700, Andrew Morton wrote:
> On Tue, 28 May 2013 15:52:51 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly and do busy looping on
> > hugepage fault until the migration finishes.

As a result, users who try to kick hugepage migration (via soft offlining,
for example) occasionally experience long delay or soft lockup.

Thanks,
Naoya

> > This is because pte_offset_map_lock() can't get a correct migration entry
> > or a correct page table lock for hugepage.
> > This patch introduces migration_entry_wait_huge() to solve this.
> > 
> > Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> > entry with huge_pte_offset() inside which all the arch-dependency of
> > the page table structure are. So migration_entry_wait_huge() and
> > __migration_entry_wait() are free from arch-dependency.
> > 
> > ChangeLog v3:
> >  - use huge_pte_lockptr
> > 
> > ChangeLog v2:
> >  - remove dup in migrate_entry_wait_huge()
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: stable@vger.kernel.org # 2.6.35
> 
> That changelog is not suitable for a -stable patch.  People who
> maintain and utilize the stable trees need to know in some detail what
> is the end-user impact of the patch (or, equivalently, of the bug
> which the patch fixes).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-05-28 19:52   ` Naoya Horiguchi
@ 2013-06-03 13:19     ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2013-06-03 13:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. This is not optimal because there can be lock
> contentions between unrelated components using this lock.

While I agree with such a change in general I am a bit afraid of all
subtle tweaks in the mm code that make hugetlb special. Maybe there are
none for page_table_lock but I am not 100% sure. So this might be
really tricky and it is not necessary for your further patches, is it?

How have you tested this?

> This patch makes hugepage support split page table lock so that
> we use page->ptl of the leaf node of page table tree which is pte for
> normal pages but can be pmd and/or pud for hugepages of some architectures.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  arch/x86/mm/hugetlbpage.c |  6 ++--
>  include/linux/hugetlb.h   | 18 ++++++++++
>  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------

This doesn't seem to be the complete story. At least not from the
trivial:
$ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c:
spin_unlock(&mm->page_table_lock);
arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-06-03 13:19     ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2013-06-03 13:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. This is not optimal because there can be lock
> contentions between unrelated components using this lock.

While I agree with such a change in general I am a bit afraid of all
subtle tweaks in the mm code that make hugetlb special. Maybe there are
none for page_table_lock but I am not 100% sure. So this might be
really tricky and it is not necessary for your further patches, is it?

How have you tested this?

> This patch makes hugepage support split page table lock so that
> we use page->ptl of the leaf node of page table tree which is pte for
> normal pages but can be pmd and/or pud for hugepages of some architectures.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  arch/x86/mm/hugetlbpage.c |  6 ++--
>  include/linux/hugetlb.h   | 18 ++++++++++
>  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------

This doesn't seem to be the complete story. At least not from the
trivial:
$ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c:
spin_unlock(&mm->page_table_lock);
arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
  2013-05-28 19:52   ` Naoya Horiguchi
@ 2013-06-03 13:26     ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2013-06-03 13:26 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Tue 28-05-13 15:52:51, Naoya Horiguchi wrote:
> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly and do busy looping on
> hugepage fault until the migration finishes.
> This is because pte_offset_map_lock() can't get a correct migration entry
> or a correct page table lock for hugepage.
> This patch introduces migration_entry_wait_huge() to solve this.
> 
> Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> entry with huge_pte_offset() inside which all the arch-dependency of
> the page table structure are. So migration_entry_wait_huge() and
> __migration_entry_wait() are free from arch-dependency.
> 
> ChangeLog v3:
>  - use huge_pte_lockptr
> 
> ChangeLog v2:
>  - remove dup in migrate_entry_wait_huge()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: stable@vger.kernel.org # 2.6.35

OK, this looks good to me and I guess you can safely replace
huge_pte_lockptr by &(mm)->page_table_lock so you can implement this
even without risky 1/2 of this series. The patch should be as simple as
possible especially when it goes to the stable.

Without 1/2 dependency
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/swapops.h |  3 +++
>  mm/hugetlb.c            |  2 +-
>  mm/migrate.c            | 23 ++++++++++++++++++-----
>  3 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
> index 47ead51..c5fd30d 100644
> --- v3.10-rc3.orig/include/linux/swapops.h
> +++ v3.10-rc3/include/linux/swapops.h
> @@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
>  
>  extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  					unsigned long address);
> +extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
>  #else
>  
>  #define make_migration_entry(page, write) swp_entry(0, 0)
> @@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
>  static inline void make_migration_entry_read(swp_entry_t *entryp) { }
>  static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  					 unsigned long address) { }
> +static inline void migration_entry_wait_huge(struct mm_struct *mm,
> +					pte_t *pte) { }
>  static inline int is_write_migration_entry(swp_entry_t entry)
>  {
>  	return 0;
> diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
> index 8e1af32..d91a438 100644
> --- v3.10-rc3.orig/mm/hugetlb.c
> +++ v3.10-rc3/mm/hugetlb.c
> @@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
>  		if (unlikely(is_hugetlb_entry_migration(entry))) {
> -			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			migration_entry_wait_huge(mm, ptep);
>  			return 0;
>  		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>  			return VM_FAULT_HWPOISON_LARGE |
> diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
> index 6f2df6e..64ff118 100644
> --- v3.10-rc3.orig/mm/migrate.c
> +++ v3.10-rc3/mm/migrate.c
> @@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
>   * get to the page and wait until migration is finished.
>   * When we return from this function the fault will be retried.
>   */
> -void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> -				unsigned long address)
> +static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> +				spinlock_t *ptl)
>  {
> -	pte_t *ptep, pte;
> -	spinlock_t *ptl;
> +	pte_t pte;
>  	swp_entry_t entry;
>  	struct page *page;
>  
> -	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	spin_lock(ptl);
>  	pte = *ptep;
>  	if (!is_swap_pte(pte))
>  		goto out;
> @@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  	pte_unmap_unlock(ptep, ptl);
>  }
>  
> +void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long address)
> +{
> +	spinlock_t *ptl = pte_lockptr(mm, pmd);
> +	pte_t *ptep = pte_offset_map(pmd, address);
> +	__migration_entry_wait(mm, ptep, ptl);
> +}
> +
> +void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
> +{
> +	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
> +	__migration_entry_wait(mm, pte, ptl);
> +}
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-06-03 13:26     ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2013-06-03 13:26 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Tue 28-05-13 15:52:51, Naoya Horiguchi wrote:
> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly and do busy looping on
> hugepage fault until the migration finishes.
> This is because pte_offset_map_lock() can't get a correct migration entry
> or a correct page table lock for hugepage.
> This patch introduces migration_entry_wait_huge() to solve this.
> 
> Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> entry with huge_pte_offset() inside which all the arch-dependency of
> the page table structure are. So migration_entry_wait_huge() and
> __migration_entry_wait() are free from arch-dependency.
> 
> ChangeLog v3:
>  - use huge_pte_lockptr
> 
> ChangeLog v2:
>  - remove dup in migrate_entry_wait_huge()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: stable@vger.kernel.org # 2.6.35

OK, this looks good to me and I guess you can safely replace
huge_pte_lockptr by &(mm)->page_table_lock so you can implement this
even without risky 1/2 of this series. The patch should be as simple as
possible especially when it goes to the stable.

Without 1/2 dependency
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/swapops.h |  3 +++
>  mm/hugetlb.c            |  2 +-
>  mm/migrate.c            | 23 ++++++++++++++++++-----
>  3 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
> index 47ead51..c5fd30d 100644
> --- v3.10-rc3.orig/include/linux/swapops.h
> +++ v3.10-rc3/include/linux/swapops.h
> @@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
>  
>  extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  					unsigned long address);
> +extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
>  #else
>  
>  #define make_migration_entry(page, write) swp_entry(0, 0)
> @@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
>  static inline void make_migration_entry_read(swp_entry_t *entryp) { }
>  static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  					 unsigned long address) { }
> +static inline void migration_entry_wait_huge(struct mm_struct *mm,
> +					pte_t *pte) { }
>  static inline int is_write_migration_entry(swp_entry_t entry)
>  {
>  	return 0;
> diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
> index 8e1af32..d91a438 100644
> --- v3.10-rc3.orig/mm/hugetlb.c
> +++ v3.10-rc3/mm/hugetlb.c
> @@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
>  		if (unlikely(is_hugetlb_entry_migration(entry))) {
> -			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			migration_entry_wait_huge(mm, ptep);
>  			return 0;
>  		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>  			return VM_FAULT_HWPOISON_LARGE |
> diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
> index 6f2df6e..64ff118 100644
> --- v3.10-rc3.orig/mm/migrate.c
> +++ v3.10-rc3/mm/migrate.c
> @@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
>   * get to the page and wait until migration is finished.
>   * When we return from this function the fault will be retried.
>   */
> -void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> -				unsigned long address)
> +static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> +				spinlock_t *ptl)
>  {
> -	pte_t *ptep, pte;
> -	spinlock_t *ptl;
> +	pte_t pte;
>  	swp_entry_t entry;
>  	struct page *page;
>  
> -	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	spin_lock(ptl);
>  	pte = *ptep;
>  	if (!is_swap_pte(pte))
>  		goto out;
> @@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  	pte_unmap_unlock(ptep, ptl);
>  }
>  
> +void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long address)
> +{
> +	spinlock_t *ptl = pte_lockptr(mm, pmd);
> +	pte_t *ptep = pte_offset_map(pmd, address);
> +	__migration_entry_wait(mm, ptep, ptl);
> +}
> +
> +void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
> +{
> +	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
> +	__migration_entry_wait(mm, pte, ptl);
> +}
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-06-03 13:19     ` Michal Hocko
@ 2013-06-03 14:34       ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. This is not optimal because there can be lock
> > contentions between unrelated components using this lock.
> 
> While I agree with such a change in general I am a bit afraid of all
> subtle tweaks in the mm code that make hugetlb special. Maybe there are
> none for page_table_lock but I am not 100% sure. So this might be
> really tricky and it is not necessary for your further patches, is it?

No, this page_table_lock patch is separable from migration stuff.
As you said in another email, changes going to stable should be minimal,
so it's better to make 2/2 patch not depend on this patch.

> How have you tested this?

Other than libhugetlbfs test (that contains many workloads, but I'm
not sure it can detect the possible regression of this patch,)
I did simple testing where:
 - create a file on hugetlbfs,
 - create 10 processes and make each of them iterate the following:
   * mmap() the hugetlbfs file,
   * memset() the mapped range (to cause hugetlb_fault), and
   * munmap() the mapped range.
I think that this can make racy situation which should be prevented
by page table locks.

> > This patch makes hugepage support split page table lock so that
> > we use page->ptl of the leaf node of page table tree which is pte for
> > normal pages but can be pmd and/or pud for hugepages of some architectures.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  arch/x86/mm/hugetlbpage.c |  6 ++--
> >  include/linux/hugetlb.h   | 18 ++++++++++
> >  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
> 
> This doesn't seem to be the complete story. At least not from the
> trivial:
> $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
> arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
> arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c:
> spin_unlock(&mm->page_table_lock);
> arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.

This trivials should be fixed. Sorry.

Naoya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-06-03 14:34       ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. This is not optimal because there can be lock
> > contentions between unrelated components using this lock.
> 
> While I agree with such a change in general I am a bit afraid of all
> subtle tweaks in the mm code that make hugetlb special. Maybe there are
> none for page_table_lock but I am not 100% sure. So this might be
> really tricky and it is not necessary for your further patches, is it?

No, this page_table_lock patch is separable from migration stuff.
As you said in another email, changes going to stable should be minimal,
so it's better to make 2/2 patch not depend on this patch.

> How have you tested this?

Other than libhugetlbfs test (that contains many workloads, but I'm
not sure it can detect the possible regression of this patch,)
I did simple testing where:
 - create a file on hugetlbfs,
 - create 10 processes and make each of them iterate the following:
   * mmap() the hugetlbfs file,
   * memset() the mapped range (to cause hugetlb_fault), and
   * munmap() the mapped range.
I think that this can make racy situation which should be prevented
by page table locks.

> > This patch makes hugepage support split page table lock so that
> > we use page->ptl of the leaf node of page table tree which is pte for
> > normal pages but can be pmd and/or pud for hugepages of some architectures.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  arch/x86/mm/hugetlbpage.c |  6 ++--
> >  include/linux/hugetlb.h   | 18 ++++++++++
> >  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
> 
> This doesn't seem to be the complete story. At least not from the
> trivial:
> $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
> arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
> arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c:
> spin_unlock(&mm->page_table_lock);
> arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.

This trivials should be fixed. Sorry.

Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
  2013-06-03 13:26     ` Michal Hocko
@ 2013-06-03 14:34       ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Mon, Jun 03, 2013 at 03:26:41PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:51, Naoya Horiguchi wrote:
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly and do busy looping on
> > hugepage fault until the migration finishes.
> > This is because pte_offset_map_lock() can't get a correct migration entry
> > or a correct page table lock for hugepage.
> > This patch introduces migration_entry_wait_huge() to solve this.
> > 
> > Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> > entry with huge_pte_offset() inside which all the arch-dependency of
> > the page table structure are. So migration_entry_wait_huge() and
> > __migration_entry_wait() are free from arch-dependency.
> > 
> > ChangeLog v3:
> >  - use huge_pte_lockptr
> > 
> > ChangeLog v2:
> >  - remove dup in migrate_entry_wait_huge()
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: stable@vger.kernel.org # 2.6.35
> 
> OK, this looks good to me and I guess you can safely replace
> huge_pte_lockptr by &(mm)->page_table_lock so you can implement this
> even without risky 1/2 of this series. The patch should be as simple as
> possible especially when it goes to the stable.

Yes, I agree.

> > Without 1/2 dependency
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-06-03 14:34       ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Mon, Jun 03, 2013 at 03:26:41PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:51, Naoya Horiguchi wrote:
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly and do busy looping on
> > hugepage fault until the migration finishes.
> > This is because pte_offset_map_lock() can't get a correct migration entry
> > or a correct page table lock for hugepage.
> > This patch introduces migration_entry_wait_huge() to solve this.
> > 
> > Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> > entry with huge_pte_offset() inside which all the arch-dependency of
> > the page table structure are. So migration_entry_wait_huge() and
> > __migration_entry_wait() are free from arch-dependency.
> > 
> > ChangeLog v3:
> >  - use huge_pte_lockptr
> > 
> > ChangeLog v2:
> >  - remove dup in migrate_entry_wait_huge()
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: stable@vger.kernel.org # 2.6.35
> 
> OK, this looks good to me and I guess you can safely replace
> huge_pte_lockptr by &(mm)->page_table_lock so you can implement this
> even without risky 1/2 of this series. The patch should be as simple as
> possible especially when it goes to the stable.

Yes, I agree.

> > Without 1/2 dependency
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-06-03 14:34       ` Naoya Horiguchi
@ 2013-06-03 15:42         ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2013-06-03 15:42 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: linux-mm, Mel Gorman, Andi Kleen, KOSAKI Motohiro, Rik van Riel,
	linux-kernel

On Mon 03-06-13 10:34:35, Naoya Horiguchi wrote:
> On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> > On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > > Currently all of page table handling by hugetlbfs code are done under
> > > mm->page_table_lock. This is not optimal because there can be lock
> > > contentions between unrelated components using this lock.
> > 
> > While I agree with such a change in general I am a bit afraid of all
> > subtle tweaks in the mm code that make hugetlb special. Maybe there are
> > none for page_table_lock but I am not 100% sure. So this might be
> > really tricky and it is not necessary for your further patches, is it?
> 
> No, this page_table_lock patch is separable from migration stuff.
> As you said in another email, changes going to stable should be minimal,
> so it's better to make 2/2 patch not depend on this patch.

OK, so I do we go around this. Both patches are in the mm tree now.
Should Andrew just drop the current version and you repost a new
version? Sorry I didn't jump in sooner but I was quite busy last week.

> > How have you tested this?
> 
> Other than libhugetlbfs test (that contains many workloads, but I'm
> not sure it can detect the possible regression of this patch,)
> I did simple testing where:
>  - create a file on hugetlbfs,
>  - create 10 processes and make each of them iterate the following:
>    * mmap() the hugetlbfs file,
>    * memset() the mapped range (to cause hugetlb_fault), and
>    * munmap() the mapped range.
> I think that this can make racy situation which should be prevented
> by page table locks.

OK, but this still requires a deep inspection of all the subtle
dependencies on page_table_lock from the core mm. I might be wrong here
and should be more specific about the issues I have only suspicion for
but as this is "just" an scalability improvement (is this actually
measurable?) I would suggest to put it at the end of your hugetlbfs
enahcements for the migration. Just from the reviewability point of
view.

> > > This patch makes hugepage support split page table lock so that
> > > we use page->ptl of the leaf node of page table tree which is pte for
> > > normal pages but can be pmd and/or pud for hugepages of some architectures.
> > > 
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > ---
> > >  arch/x86/mm/hugetlbpage.c |  6 ++--
> > >  include/linux/hugetlb.h   | 18 ++++++++++
> > >  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
> > 
> > This doesn't seem to be the complete story. At least not from the
> > trivial:
> > $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
> > arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
> > arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c:
> > spin_unlock(&mm->page_table_lock);
> > arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
> 
> This trivials should be fixed. Sorry.

Other archs are often forgotten and cscope doesn't help exactly ;)

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-06-03 15:42         ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2013-06-03 15:42 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: linux-mm, Mel Gorman, Andi Kleen, KOSAKI Motohiro, Rik van Riel,
	linux-kernel

On Mon 03-06-13 10:34:35, Naoya Horiguchi wrote:
> On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> > On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > > Currently all of page table handling by hugetlbfs code are done under
> > > mm->page_table_lock. This is not optimal because there can be lock
> > > contentions between unrelated components using this lock.
> > 
> > While I agree with such a change in general I am a bit afraid of all
> > subtle tweaks in the mm code that make hugetlb special. Maybe there are
> > none for page_table_lock but I am not 100% sure. So this might be
> > really tricky and it is not necessary for your further patches, is it?
> 
> No, this page_table_lock patch is separable from migration stuff.
> As you said in another email, changes going to stable should be minimal,
> so it's better to make 2/2 patch not depend on this patch.

OK, so I do we go around this. Both patches are in the mm tree now.
Should Andrew just drop the current version and you repost a new
version? Sorry I didn't jump in sooner but I was quite busy last week.

> > How have you tested this?
> 
> Other than libhugetlbfs test (that contains many workloads, but I'm
> not sure it can detect the possible regression of this patch,)
> I did simple testing where:
>  - create a file on hugetlbfs,
>  - create 10 processes and make each of them iterate the following:
>    * mmap() the hugetlbfs file,
>    * memset() the mapped range (to cause hugetlb_fault), and
>    * munmap() the mapped range.
> I think that this can make racy situation which should be prevented
> by page table locks.

OK, but this still requires a deep inspection of all the subtle
dependencies on page_table_lock from the core mm. I might be wrong here
and should be more specific about the issues I have only suspicion for
but as this is "just" an scalability improvement (is this actually
measurable?) I would suggest to put it at the end of your hugetlbfs
enahcements for the migration. Just from the reviewability point of
view.

> > > This patch makes hugepage support split page table lock so that
> > > we use page->ptl of the leaf node of page table tree which is pte for
> > > normal pages but can be pmd and/or pud for hugepages of some architectures.
> > > 
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > ---
> > >  arch/x86/mm/hugetlbpage.c |  6 ++--
> > >  include/linux/hugetlb.h   | 18 ++++++++++
> > >  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
> > 
> > This doesn't seem to be the complete story. At least not from the
> > trivial:
> > $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
> > arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
> > arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c:
> > spin_unlock(&mm->page_table_lock);
> > arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
> 
> This trivials should be fixed. Sorry.

Other archs are often forgotten and cscope doesn't help exactly ;)

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4] migrate: add migrate_entry_wait_huge()
  2013-06-03 13:26     ` Michal Hocko
@ 2013-06-04 16:44       ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-06-04 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

Here is the revised one. Andrew, could you replace the following
patches in your tree with this?

  migrate-add-migrate_entry_wait_huge.patch
  hugetlbfs-support-split-page-table-lock.patch

Thanks,
Naoya Horiguchi
----
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Tue, 4 Jun 2013 12:26:32 -0400
Subject: [PATCH v4] migrate: add migrate_entry_wait_huge()

When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.
As a result, users who try to kick hugepage migration (via soft offlining,
for example) occasionally experience long delay or soft lockup.

This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.
This patch introduces migration_entry_wait_huge() to solve this.

ChangeLog v4:
 - replace huge_pte_lockptr with &(mm)->page_table_lock
   (postponed split page table lock patch)
 - update description

ChangeLog v3:
 - use huge_pte_lockptr

ChangeLog v2:
 - remove dup in migrate_entry_wait_huge()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # 2.6.35
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/swapops.h |  3 +++
 mm/hugetlb.c            |  2 +-
 mm/migrate.c            | 23 ++++++++++++++++++-----
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 47ead51..c5fd30d 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
 
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
 #else
 
 #define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
 static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm,
+					pte_t *pte) { }
 static inline int is_write_migration_entry(swp_entry_t entry)
 {
 	return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 463fb5e..d793c5e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2865,7 +2865,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			migration_entry_wait_huge(mm, ptep);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f2df6e..b8d56a1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
  * get to the page and wait until migration is finished.
  * When we return from this function the fault will be retried.
  */
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+				spinlock_t *ptl)
 {
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
+	pte_t pte;
 	swp_entry_t entry;
 	struct page *page;
 
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
 		goto out;
@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap_unlock(ptep, ptl);
 }
 
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long address)
+{
+	spinlock_t *ptl = pte_lockptr(mm, pmd);
+	pte_t *ptep = pte_offset_map(pmd, address);
+	__migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+{
+	spinlock_t *ptl = &(mm)->page_table_lock;
+	__migration_entry_wait(mm, pte, ptl);
+}
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4] migrate: add migrate_entry_wait_huge()
@ 2013-06-04 16:44       ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-06-04 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

Here is the revised one. Andrew, could you replace the following
patches in your tree with this?

  migrate-add-migrate_entry_wait_huge.patch
  hugetlbfs-support-split-page-table-lock.patch

Thanks,
Naoya Horiguchi
----
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Tue, 4 Jun 2013 12:26:32 -0400
Subject: [PATCH v4] migrate: add migrate_entry_wait_huge()

When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.
As a result, users who try to kick hugepage migration (via soft offlining,
for example) occasionally experience long delay or soft lockup.

This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.
This patch introduces migration_entry_wait_huge() to solve this.

ChangeLog v4:
 - replace huge_pte_lockptr with &(mm)->page_table_lock
   (postponed split page table lock patch)
 - update description

ChangeLog v3:
 - use huge_pte_lockptr

ChangeLog v2:
 - remove dup in migrate_entry_wait_huge()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # 2.6.35
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/swapops.h |  3 +++
 mm/hugetlb.c            |  2 +-
 mm/migrate.c            | 23 ++++++++++++++++++-----
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 47ead51..c5fd30d 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
 
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
 #else
 
 #define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
 static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm,
+					pte_t *pte) { }
 static inline int is_write_migration_entry(swp_entry_t entry)
 {
 	return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 463fb5e..d793c5e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2865,7 +2865,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			migration_entry_wait_huge(mm, ptep);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f2df6e..b8d56a1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
  * get to the page and wait until migration is finished.
  * When we return from this function the fault will be retried.
  */
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+				spinlock_t *ptl)
 {
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
+	pte_t pte;
 	swp_entry_t entry;
 	struct page *page;
 
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
 		goto out;
@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap_unlock(ptep, ptl);
 }
 
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long address)
+{
+	spinlock_t *ptl = pte_lockptr(mm, pmd);
+	pte_t *ptep = pte_offset_map(pmd, address);
+	__migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+{
+	spinlock_t *ptl = &(mm)->page_table_lock;
+	__migration_entry_wait(mm, pte, ptl);
+}
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-08 16:53     ` Aneesh Kumar K.V
@ 2013-09-09 16:26       ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-09-09 16:26 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

On Sun, Sep 08, 2013 at 10:23:55PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. So when a process have many threads and they heavily
> > access to the memory, lock contention happens and impacts the performance.
> >
> > This patch makes hugepage support split page table lock so that we use
> > page->ptl of the leaf node of page table tree which is pte for normal pages
> > but can be pmd and/or pud for hugepages of some architectures.
> >
> > ChangeLog v3:
> >  - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
> >  - remove replacement in some architecture dependent code. This is justified
> >    because an allocation of pgd/pud/pmd/pte entry can race with other
> >    allocation, not with read/write access, so we can use different locks.
> >    http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458
> >
> > ChangeLog v2:
> >  - add split ptl on other archs missed in v1
> >  - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  include/linux/hugetlb.h  | 20 +++++++++++
> >  include/linux/mm_types.h |  2 ++
> >  mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
> >  mm/mempolicy.c           |  5 +--
> >  mm/migrate.c             |  4 +--
> >  mm/rmap.c                |  2 +-
> >  6 files changed, 84 insertions(+), 41 deletions(-)
> >
> > diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
> > index 0393270..5cb8a4e 100644
> > --- v3.11-rc3.orig/include/linux/hugetlb.h
> > +++ v3.11-rc3/include/linux/hugetlb.h
> > @@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
> >  extern int sysctl_hugetlb_shm_group;
> >  extern struct list_head huge_boot_pages;
> >
> > +#if USE_SPLIT_PTLOCKS_HUGETLB
> > +#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
> > +#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
> > +#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
> > +#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
> > +
> > +#define huge_pte_offset_lock(mm, address, ptlp)		\
> > +({							\
> > +	pte_t *__pte = huge_pte_offset(mm, address);	\
> > +	spinlock_t *__ptl = NULL;			\
> > +	if (__pte) {					\
> > +		__ptl = huge_pte_lockptr(mm, __pte);	\
> > +		*(ptlp) = __ptl;			\
> > +		spin_lock(__ptl);			\
> > +	}						\
> > +	__pte;						\
> > +})
> 
> 
> why not a static inline function ?

I'm not sure how these two make change in runtime (maybe optimized to
similar code?).
I just used the same pattern with pte_offset_map_lock(). I think that
this may show that this function is the variant of pte_offset_map_lock().
But if we have a good reason to make it static inline, I'm glad to do this.
Do you have any specific ideas?

> 
> > +
> >  /* arch callbacks */
> >
> >  pte_t *huge_pte_alloc(struct mm_struct *mm,> @@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
> >  	BUG();
> >  }
> >
> > +#define huge_pte_lockptr(mm, ptep) 0
> > +
> >  #endif /* !CONFIG_HUGETLB_PAGE */
> >
> >  #define HUGETLB_ANON_FILE "anon_hugepage"
> > diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
> > index fb425aa..cfb8c6f 100644
> > --- v3.11-rc3.orig/include/linux/mm_types.h
> > +++ v3.11-rc3/include/linux/mm_types.h
> > @@ -24,6 +24,8 @@
> >  struct address_space;
> >
> >  #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
> > +#define USE_SPLIT_PTLOCKS_HUGETLB	\
> > +	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
> >
> 
> Is that a common pattern ? Don't we generally use
> HAVE_ARCH_SPLIT_PTLOCKS in arch config file ?

Do you mean that HAVE_ARCH_SPLIT_PTLOCKS enables/disables whole split
ptl mechanism (not only split ptl for hugepages) depending on archs? 
If you do, we can do this with adding a line 'default "999999" if PPC'.
(Note that if we do this, we can lose the performance benefit from
existing split ptl for normal pages. So I'm not sure it's acceptible
for users of the arch.)

> Also are we sure this is
> only an issue with PPC ?

Not confirmed yet (so let me take some time to check this).
I think that split ptl for hugepage should work on any archtectures
where every entry pointing to hugepage is stored in 4kB page allocated
for page table trees. So I'll check it.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-09-09 16:26       ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-09-09 16:26 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

On Sun, Sep 08, 2013 at 10:23:55PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. So when a process have many threads and they heavily
> > access to the memory, lock contention happens and impacts the performance.
> >
> > This patch makes hugepage support split page table lock so that we use
> > page->ptl of the leaf node of page table tree which is pte for normal pages
> > but can be pmd and/or pud for hugepages of some architectures.
> >
> > ChangeLog v3:
> >  - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
> >  - remove replacement in some architecture dependent code. This is justified
> >    because an allocation of pgd/pud/pmd/pte entry can race with other
> >    allocation, not with read/write access, so we can use different locks.
> >    http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458
> >
> > ChangeLog v2:
> >  - add split ptl on other archs missed in v1
> >  - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  include/linux/hugetlb.h  | 20 +++++++++++
> >  include/linux/mm_types.h |  2 ++
> >  mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
> >  mm/mempolicy.c           |  5 +--
> >  mm/migrate.c             |  4 +--
> >  mm/rmap.c                |  2 +-
> >  6 files changed, 84 insertions(+), 41 deletions(-)
> >
> > diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
> > index 0393270..5cb8a4e 100644
> > --- v3.11-rc3.orig/include/linux/hugetlb.h
> > +++ v3.11-rc3/include/linux/hugetlb.h
> > @@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
> >  extern int sysctl_hugetlb_shm_group;
> >  extern struct list_head huge_boot_pages;
> >
> > +#if USE_SPLIT_PTLOCKS_HUGETLB
> > +#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
> > +#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
> > +#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
> > +#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
> > +
> > +#define huge_pte_offset_lock(mm, address, ptlp)		\
> > +({							\
> > +	pte_t *__pte = huge_pte_offset(mm, address);	\
> > +	spinlock_t *__ptl = NULL;			\
> > +	if (__pte) {					\
> > +		__ptl = huge_pte_lockptr(mm, __pte);	\
> > +		*(ptlp) = __ptl;			\
> > +		spin_lock(__ptl);			\
> > +	}						\
> > +	__pte;						\
> > +})
> 
> 
> why not a static inline function ?

I'm not sure how these two make change in runtime (maybe optimized to
similar code?).
I just used the same pattern with pte_offset_map_lock(). I think that
this may show that this function is the variant of pte_offset_map_lock().
But if we have a good reason to make it static inline, I'm glad to do this.
Do you have any specific ideas?

> 
> > +
> >  /* arch callbacks */
> >
> >  pte_t *huge_pte_alloc(struct mm_struct *mm,> @@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
> >  	BUG();
> >  }
> >
> > +#define huge_pte_lockptr(mm, ptep) 0
> > +
> >  #endif /* !CONFIG_HUGETLB_PAGE */
> >
> >  #define HUGETLB_ANON_FILE "anon_hugepage"
> > diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
> > index fb425aa..cfb8c6f 100644
> > --- v3.11-rc3.orig/include/linux/mm_types.h
> > +++ v3.11-rc3/include/linux/mm_types.h
> > @@ -24,6 +24,8 @@
> >  struct address_space;
> >
> >  #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
> > +#define USE_SPLIT_PTLOCKS_HUGETLB	\
> > +	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
> >
> 
> Is that a common pattern ? Don't we generally use
> HAVE_ARCH_SPLIT_PTLOCKS in arch config file ?

Do you mean that HAVE_ARCH_SPLIT_PTLOCKS enables/disables whole split
ptl mechanism (not only split ptl for hugepages) depending on archs? 
If you do, we can do this with adding a line 'default "999999" if PPC'.
(Note that if we do this, we can lose the performance benefit from
existing split ptl for normal pages. So I'm not sure it's acceptible
for users of the arch.)

> Also are we sure this is
> only an issue with PPC ?

Not confirmed yet (so let me take some time to check this).
I think that split ptl for hugepage should work on any archtectures
where every entry pointing to hugepage is stored in 4kB page allocated
for page table trees. So I'll check it.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-05 21:27   ` Naoya Horiguchi
@ 2013-09-08 16:53     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-08 16:53 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. So when a process have many threads and they heavily
> access to the memory, lock contention happens and impacts the performance.
>
> This patch makes hugepage support split page table lock so that we use
> page->ptl of the leaf node of page table tree which is pte for normal pages
> but can be pmd and/or pud for hugepages of some architectures.
>
> ChangeLog v3:
>  - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
>  - remove replacement in some architecture dependent code. This is justified
>    because an allocation of pgd/pud/pmd/pte entry can race with other
>    allocation, not with read/write access, so we can use different locks.
>    http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458
>
> ChangeLog v2:
>  - add split ptl on other archs missed in v1
>  - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h  | 20 +++++++++++
>  include/linux/mm_types.h |  2 ++
>  mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
>  mm/mempolicy.c           |  5 +--
>  mm/migrate.c             |  4 +--
>  mm/rmap.c                |  2 +-
>  6 files changed, 84 insertions(+), 41 deletions(-)
>
> diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
> index 0393270..5cb8a4e 100644
> --- v3.11-rc3.orig/include/linux/hugetlb.h
> +++ v3.11-rc3/include/linux/hugetlb.h
> @@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
>  extern int sysctl_hugetlb_shm_group;
>  extern struct list_head huge_boot_pages;
>
> +#if USE_SPLIT_PTLOCKS_HUGETLB
> +#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
> +#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
> +#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
> +#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
> +
> +#define huge_pte_offset_lock(mm, address, ptlp)		\
> +({							\
> +	pte_t *__pte = huge_pte_offset(mm, address);	\
> +	spinlock_t *__ptl = NULL;			\
> +	if (__pte) {					\
> +		__ptl = huge_pte_lockptr(mm, __pte);	\
> +		*(ptlp) = __ptl;			\
> +		spin_lock(__ptl);			\
> +	}						\
> +	__pte;						\
> +})


why not a static inline function ?


> +
>  /* arch callbacks */
>
>  pte_t *huge_pte_alloc(struct mm_struct *mm,> @@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
>  	BUG();
>  }
>
> +#define huge_pte_lockptr(mm, ptep) 0
> +
>  #endif /* !CONFIG_HUGETLB_PAGE */
>
>  #define HUGETLB_ANON_FILE "anon_hugepage"
> diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
> index fb425aa..cfb8c6f 100644
> --- v3.11-rc3.orig/include/linux/mm_types.h
> +++ v3.11-rc3/include/linux/mm_types.h
> @@ -24,6 +24,8 @@
>  struct address_space;
>
>  #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
> +#define USE_SPLIT_PTLOCKS_HUGETLB	\
> +	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
>

Is that a common pattern ? Don't we generally use
HAVE_ARCH_SPLIT_PTLOCKS in arch config file ? Also are we sure this is
only an issue with PPC ?



-aneesh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-09-08 16:53     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-08 16:53 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. So when a process have many threads and they heavily
> access to the memory, lock contention happens and impacts the performance.
>
> This patch makes hugepage support split page table lock so that we use
> page->ptl of the leaf node of page table tree which is pte for normal pages
> but can be pmd and/or pud for hugepages of some architectures.
>
> ChangeLog v3:
>  - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
>  - remove replacement in some architecture dependent code. This is justified
>    because an allocation of pgd/pud/pmd/pte entry can race with other
>    allocation, not with read/write access, so we can use different locks.
>    http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458
>
> ChangeLog v2:
>  - add split ptl on other archs missed in v1
>  - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h  | 20 +++++++++++
>  include/linux/mm_types.h |  2 ++
>  mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
>  mm/mempolicy.c           |  5 +--
>  mm/migrate.c             |  4 +--
>  mm/rmap.c                |  2 +-
>  6 files changed, 84 insertions(+), 41 deletions(-)
>
> diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
> index 0393270..5cb8a4e 100644
> --- v3.11-rc3.orig/include/linux/hugetlb.h
> +++ v3.11-rc3/include/linux/hugetlb.h
> @@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
>  extern int sysctl_hugetlb_shm_group;
>  extern struct list_head huge_boot_pages;
>
> +#if USE_SPLIT_PTLOCKS_HUGETLB
> +#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
> +#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
> +#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
> +#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
> +
> +#define huge_pte_offset_lock(mm, address, ptlp)		\
> +({							\
> +	pte_t *__pte = huge_pte_offset(mm, address);	\
> +	spinlock_t *__ptl = NULL;			\
> +	if (__pte) {					\
> +		__ptl = huge_pte_lockptr(mm, __pte);	\
> +		*(ptlp) = __ptl;			\
> +		spin_lock(__ptl);			\
> +	}						\
> +	__pte;						\
> +})


why not a static inline function ?


> +
>  /* arch callbacks */
>
>  pte_t *huge_pte_alloc(struct mm_struct *mm,> @@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
>  	BUG();
>  }
>
> +#define huge_pte_lockptr(mm, ptep) 0
> +
>  #endif /* !CONFIG_HUGETLB_PAGE */
>
>  #define HUGETLB_ANON_FILE "anon_hugepage"
> diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
> index fb425aa..cfb8c6f 100644
> --- v3.11-rc3.orig/include/linux/mm_types.h
> +++ v3.11-rc3/include/linux/mm_types.h
> @@ -24,6 +24,8 @@
>  struct address_space;
>
>  #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
> +#define USE_SPLIT_PTLOCKS_HUGETLB	\
> +	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
>

Is that a common pattern ? Don't we generally use
HAVE_ARCH_SPLIT_PTLOCKS in arch config file ? Also are we sure this is
only an issue with PPC ?



-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-05 21:27 [PATCH 0/2 v3] split page table lock for hugepage Naoya Horiguchi
@ 2013-09-05 21:27   ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-09-05 21:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. So when a process have many threads and they heavily
access to the memory, lock contention happens and impacts the performance.

This patch makes hugepage support split page table lock so that we use
page->ptl of the leaf node of page table tree which is pte for normal pages
but can be pmd and/or pud for hugepages of some architectures.

ChangeLog v3:
 - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
 - remove replacement in some architecture dependent code. This is justified
   because an allocation of pgd/pud/pmd/pte entry can race with other
   allocation, not with read/write access, so we can use different locks.
   http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458

ChangeLog v2:
 - add split ptl on other archs missed in v1
 - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h  | 20 +++++++++++
 include/linux/mm_types.h |  2 ++
 mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
 mm/mempolicy.c           |  5 +--
 mm/migrate.c             |  4 +--
 mm/rmap.c                |  2 +-
 6 files changed, 84 insertions(+), 41 deletions(-)

diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
index 0393270..5cb8a4e 100644
--- v3.11-rc3.orig/include/linux/hugetlb.h
+++ v3.11-rc3/include/linux/hugetlb.h
@@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
+#if USE_SPLIT_PTLOCKS_HUGETLB
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 /* arch callbacks */
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
@@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
 	BUG();
 }
 
+#define huge_pte_lockptr(mm, ptep) 0
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"
diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
index fb425aa..cfb8c6f 100644
--- v3.11-rc3.orig/include/linux/mm_types.h
+++ v3.11-rc3/include/linux/mm_types.h
@@ -24,6 +24,8 @@
 struct address_space;
 
 #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
+#define USE_SPLIT_PTLOCKS_HUGETLB	\
+	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
 
 /*
  * Each physical page in the system has a struct page associated with
diff --git v3.11-rc3.orig/mm/hugetlb.c v3.11-rc3/mm/hugetlb.c
index 0f8da6b..b2dbca4 100644
--- v3.11-rc3.orig/mm/hugetlb.c
+++ v3.11-rc3/mm/hugetlb.c
@@ -2372,6 +2372,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2383,8 +2384,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dst_ptl = huge_pte_lockptr(dst, dst_pte);
+		src_ptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dst_ptl);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2394,8 +2397,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
 	}
 	return 0;
 
@@ -2438,6 +2441,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2451,25 +2455,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2480,7 +2483,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2497,13 +2500,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2617,6 +2625,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2666,7 +2675,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2680,7 +2689,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2695,7 +2704,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2710,7 +2719,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2722,10 +2731,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2775,6 +2784,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2860,7 +2870,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2882,13 +2893,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2900,6 +2911,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2968,7 +2980,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2988,7 +3001,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -3014,9 +3027,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -3024,8 +3037,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -3037,6 +3052,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3056,10 +3073,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3090,8 +3107,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3112,13 +3129,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3128,8 +3146,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
@@ -3292,6 +3310,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -3314,13 +3333,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, spte);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -3334,7 +3354,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with vma->vm_mm->page_table_lock held.
+ * called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git v3.11-rc3.orig/mm/mempolicy.c v3.11-rc3/mm/mempolicy.c
index 838d022..a9de55f 100644
--- v3.11-rc3.orig/mm/mempolicy.c
+++ v3.11-rc3/mm/mempolicy.c
@@ -522,8 +522,9 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 #ifdef CONFIG_HUGETLB_PAGE
 	int nid;
 	struct page *page;
+	spinlock_t *ptl = huge_pte_lockptr(vma->vm_mm, (pte_t *)pmd);
 
-	spin_lock(&vma->vm_mm->page_table_lock);
+	spin_lock(ptl);
 	page = pte_page(huge_ptep_get((pte_t *)pmd));
 	nid = page_to_nid(page);
 	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -533,7 +534,7 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
 		isolate_huge_page(page, private);
 unlock:
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	spin_unlock(ptl);
 #else
 	BUG();
 #endif
diff --git v3.11-rc3.orig/mm/migrate.c v3.11-rc3/mm/migrate.c
index 61f14a1..c69a9c7 100644
--- v3.11-rc3.orig/mm/migrate.c
+++ v3.11-rc3/mm/migrate.c
@@ -130,7 +130,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, addr);
 		if (!ptep)
 			goto out;
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, ptep);
 	} else {
 		pmd = mm_find_pmd(mm, addr);
 		if (!pmd)
@@ -249,7 +249,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 
 void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
 {
-	spinlock_t *ptl = &(mm)->page_table_lock;
+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
 	__migration_entry_wait(mm, pte, ptl);
 }
 
diff --git v3.11-rc3.orig/mm/rmap.c v3.11-rc3/mm/rmap.c
index c56ba98..eccec58 100644
--- v3.11-rc3.orig/mm/rmap.c
+++ v3.11-rc3/mm/rmap.c
@@ -601,7 +601,7 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 
 	if (unlikely(PageHuge(page))) {
 		pte = huge_pte_offset(mm, address);
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, pte);
 		goto check;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-09-05 21:27   ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-09-05 21:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. So when a process have many threads and they heavily
access to the memory, lock contention happens and impacts the performance.

This patch makes hugepage support split page table lock so that we use
page->ptl of the leaf node of page table tree which is pte for normal pages
but can be pmd and/or pud for hugepages of some architectures.

ChangeLog v3:
 - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
 - remove replacement in some architecture dependent code. This is justified
   because an allocation of pgd/pud/pmd/pte entry can race with other
   allocation, not with read/write access, so we can use different locks.
   http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458

ChangeLog v2:
 - add split ptl on other archs missed in v1
 - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h  | 20 +++++++++++
 include/linux/mm_types.h |  2 ++
 mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
 mm/mempolicy.c           |  5 +--
 mm/migrate.c             |  4 +--
 mm/rmap.c                |  2 +-
 6 files changed, 84 insertions(+), 41 deletions(-)

diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
index 0393270..5cb8a4e 100644
--- v3.11-rc3.orig/include/linux/hugetlb.h
+++ v3.11-rc3/include/linux/hugetlb.h
@@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
+#if USE_SPLIT_PTLOCKS_HUGETLB
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 /* arch callbacks */
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
@@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
 	BUG();
 }
 
+#define huge_pte_lockptr(mm, ptep) 0
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"
diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
index fb425aa..cfb8c6f 100644
--- v3.11-rc3.orig/include/linux/mm_types.h
+++ v3.11-rc3/include/linux/mm_types.h
@@ -24,6 +24,8 @@
 struct address_space;
 
 #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
+#define USE_SPLIT_PTLOCKS_HUGETLB	\
+	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
 
 /*
  * Each physical page in the system has a struct page associated with
diff --git v3.11-rc3.orig/mm/hugetlb.c v3.11-rc3/mm/hugetlb.c
index 0f8da6b..b2dbca4 100644
--- v3.11-rc3.orig/mm/hugetlb.c
+++ v3.11-rc3/mm/hugetlb.c
@@ -2372,6 +2372,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2383,8 +2384,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dst_ptl = huge_pte_lockptr(dst, dst_pte);
+		src_ptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dst_ptl);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2394,8 +2397,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
 	}
 	return 0;
 
@@ -2438,6 +2441,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2451,25 +2455,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2480,7 +2483,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2497,13 +2500,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2617,6 +2625,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2666,7 +2675,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2680,7 +2689,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2695,7 +2704,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2710,7 +2719,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2722,10 +2731,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2775,6 +2784,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2860,7 +2870,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2882,13 +2893,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2900,6 +2911,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2968,7 +2980,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2988,7 +3001,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -3014,9 +3027,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -3024,8 +3037,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -3037,6 +3052,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3056,10 +3073,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3090,8 +3107,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3112,13 +3129,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3128,8 +3146,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
@@ -3292,6 +3310,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -3314,13 +3333,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, spte);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -3334,7 +3354,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with vma->vm_mm->page_table_lock held.
+ * called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git v3.11-rc3.orig/mm/mempolicy.c v3.11-rc3/mm/mempolicy.c
index 838d022..a9de55f 100644
--- v3.11-rc3.orig/mm/mempolicy.c
+++ v3.11-rc3/mm/mempolicy.c
@@ -522,8 +522,9 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 #ifdef CONFIG_HUGETLB_PAGE
 	int nid;
 	struct page *page;
+	spinlock_t *ptl = huge_pte_lockptr(vma->vm_mm, (pte_t *)pmd);
 
-	spin_lock(&vma->vm_mm->page_table_lock);
+	spin_lock(ptl);
 	page = pte_page(huge_ptep_get((pte_t *)pmd));
 	nid = page_to_nid(page);
 	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -533,7 +534,7 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
 		isolate_huge_page(page, private);
 unlock:
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	spin_unlock(ptl);
 #else
 	BUG();
 #endif
diff --git v3.11-rc3.orig/mm/migrate.c v3.11-rc3/mm/migrate.c
index 61f14a1..c69a9c7 100644
--- v3.11-rc3.orig/mm/migrate.c
+++ v3.11-rc3/mm/migrate.c
@@ -130,7 +130,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, addr);
 		if (!ptep)
 			goto out;
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, ptep);
 	} else {
 		pmd = mm_find_pmd(mm, addr);
 		if (!pmd)
@@ -249,7 +249,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 
 void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
 {
-	spinlock_t *ptl = &(mm)->page_table_lock;
+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
 	__migration_entry_wait(mm, pte, ptl);
 }
 
diff --git v3.11-rc3.orig/mm/rmap.c v3.11-rc3/mm/rmap.c
index c56ba98..eccec58 100644
--- v3.11-rc3.orig/mm/rmap.c
+++ v3.11-rc3/mm/rmap.c
@@ -601,7 +601,7 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 
 	if (unlikely(PageHuge(page))) {
 		pte = huge_pte_offset(mm, address);
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, pte);
 		goto check;
 	}
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-05  9:18         ` Aneesh Kumar K.V
@ 2013-09-05 15:23           ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-09-05 15:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

On Thu, Sep 05, 2013 at 02:48:18PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Hi Aneesh,
> >
> > On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
> >> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> >> 
> >> > Currently all of page table handling by hugetlbfs code are done under
> >> > mm->page_table_lock. So when a process have many threads and they heavily
> >> > access to the memory, lock contention happens and impacts the performance.
> >> >
> >> > This patch makes hugepage support split page table lock so that we use
> >> > page->ptl of the leaf node of page table tree which is pte for normal pages
> >> > but can be pmd and/or pud for hugepages of some architectures.
> >> >
> >> > ChangeLog v2:
> >> >  - add split ptl on other archs missed in v1
> >> >
> >> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >> > ---
> >> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
> >> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
> >> >  include/linux/hugetlb.h       | 20 ++++++++++
> >> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
> >> >  mm/mempolicy.c                |  5 ++-
> >> >  mm/migrate.c                  |  4 +-
> >> >  mm/rmap.c                     |  2 +-
> >> >  7 files changed, 90 insertions(+), 45 deletions(-)
> >> >
> >> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> >> > index d67db4b..7e56cb7 100644
> >> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> >> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> >> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >> >  {
> >> >  	struct kmem_cache *cachep;
> >> >  	pte_t *new;
> >> > +	spinlock_t *ptl;
> >> >
> >> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >> >  	int i;
> >> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >> >  	if (! new)
> >> >  		return -ENOMEM;
> >> >
> >> > -	spin_lock(&mm->page_table_lock);
> >> > +	ptl = huge_pte_lockptr(mm, new);
> >> > +	spin_lock(ptl);
> >> 
> >> 
> >> Are you sure we can do that for ppc ?
> >> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);
> >
> > Ah, thanks. new is not a pointer to one full page occupied by page
> > table entries, so trying to use struct page of it is totally wrong.
> >
> >> The page for new(pte_t) could be shared right ? which mean a deadlock ?
> >
> > Yes, that's disastrous.
> >
> >> May be you should do it at the pmd level itself for ppc
> 
> The pgd page also cannot be used because pgd also comes from kmem
> cache.
> 
> >
> > Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
> > for now because this lock seems to protect us from the race between concurrent
> > calls of __hugepte_alloc(), not between allocation and read/write access.
> > Split ptl is used to avoid race between read/write accesses, so I think
> > that using different types of locks here is not dangerous.
> > # I guess that that's why we now use mm->page_table_lock for __pte_alloc()
> > # and its family even if USE_SPLIT_PTLOCKS is true.
> 
> A simpler approach could be to make huge_pte_lockptr arch
> specific and leave it as mm->page_table_lock for ppc 

OK, I'll do this.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-09-05 15:23           ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-09-05 15:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

On Thu, Sep 05, 2013 at 02:48:18PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Hi Aneesh,
> >
> > On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
> >> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> >> 
> >> > Currently all of page table handling by hugetlbfs code are done under
> >> > mm->page_table_lock. So when a process have many threads and they heavily
> >> > access to the memory, lock contention happens and impacts the performance.
> >> >
> >> > This patch makes hugepage support split page table lock so that we use
> >> > page->ptl of the leaf node of page table tree which is pte for normal pages
> >> > but can be pmd and/or pud for hugepages of some architectures.
> >> >
> >> > ChangeLog v2:
> >> >  - add split ptl on other archs missed in v1
> >> >
> >> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >> > ---
> >> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
> >> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
> >> >  include/linux/hugetlb.h       | 20 ++++++++++
> >> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
> >> >  mm/mempolicy.c                |  5 ++-
> >> >  mm/migrate.c                  |  4 +-
> >> >  mm/rmap.c                     |  2 +-
> >> >  7 files changed, 90 insertions(+), 45 deletions(-)
> >> >
> >> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> >> > index d67db4b..7e56cb7 100644
> >> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> >> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> >> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >> >  {
> >> >  	struct kmem_cache *cachep;
> >> >  	pte_t *new;
> >> > +	spinlock_t *ptl;
> >> >
> >> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >> >  	int i;
> >> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >> >  	if (! new)
> >> >  		return -ENOMEM;
> >> >
> >> > -	spin_lock(&mm->page_table_lock);
> >> > +	ptl = huge_pte_lockptr(mm, new);
> >> > +	spin_lock(ptl);
> >> 
> >> 
> >> Are you sure we can do that for ppc ?
> >> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);
> >
> > Ah, thanks. new is not a pointer to one full page occupied by page
> > table entries, so trying to use struct page of it is totally wrong.
> >
> >> The page for new(pte_t) could be shared right ? which mean a deadlock ?
> >
> > Yes, that's disastrous.
> >
> >> May be you should do it at the pmd level itself for ppc
> 
> The pgd page also cannot be used because pgd also comes from kmem
> cache.
> 
> >
> > Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
> > for now because this lock seems to protect us from the race between concurrent
> > calls of __hugepte_alloc(), not between allocation and read/write access.
> > Split ptl is used to avoid race between read/write accesses, so I think
> > that using different types of locks here is not dangerous.
> > # I guess that that's why we now use mm->page_table_lock for __pte_alloc()
> > # and its family even if USE_SPLIT_PTLOCKS is true.
> 
> A simpler approach could be to make huge_pte_lockptr arch
> specific and leave it as mm->page_table_lock for ppc 

OK, I'll do this.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-04 16:32       ` Naoya Horiguchi
@ 2013-09-05  9:18         ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-05  9:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Hi Aneesh,
>
> On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
>> 
>> > Currently all of page table handling by hugetlbfs code are done under
>> > mm->page_table_lock. So when a process have many threads and they heavily
>> > access to the memory, lock contention happens and impacts the performance.
>> >
>> > This patch makes hugepage support split page table lock so that we use
>> > page->ptl of the leaf node of page table tree which is pte for normal pages
>> > but can be pmd and/or pud for hugepages of some architectures.
>> >
>> > ChangeLog v2:
>> >  - add split ptl on other archs missed in v1
>> >
>> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> > ---
>> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
>> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
>> >  include/linux/hugetlb.h       | 20 ++++++++++
>> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
>> >  mm/mempolicy.c                |  5 ++-
>> >  mm/migrate.c                  |  4 +-
>> >  mm/rmap.c                     |  2 +-
>> >  7 files changed, 90 insertions(+), 45 deletions(-)
>> >
>> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
>> > index d67db4b..7e56cb7 100644
>> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
>> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
>> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>> >  {
>> >  	struct kmem_cache *cachep;
>> >  	pte_t *new;
>> > +	spinlock_t *ptl;
>> >
>> >  #ifdef CONFIG_PPC_FSL_BOOK3E
>> >  	int i;
>> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>> >  	if (! new)
>> >  		return -ENOMEM;
>> >
>> > -	spin_lock(&mm->page_table_lock);
>> > +	ptl = huge_pte_lockptr(mm, new);
>> > +	spin_lock(ptl);
>> 
>> 
>> Are you sure we can do that for ppc ?
>> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);
>
> Ah, thanks. new is not a pointer to one full page occupied by page
> table entries, so trying to use struct page of it is totally wrong.
>
>> The page for new(pte_t) could be shared right ? which mean a deadlock ?
>
> Yes, that's disastrous.
>
>> May be you should do it at the pmd level itself for ppc

The pgd page also cannot be used because pgd also comes from kmem
cache.

>
> Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
> for now because this lock seems to protect us from the race between concurrent
> calls of __hugepte_alloc(), not between allocation and read/write access.
> Split ptl is used to avoid race between read/write accesses, so I think
> that using different types of locks here is not dangerous.
> # I guess that that's why we now use mm->page_table_lock for __pte_alloc()
> # and its family even if USE_SPLIT_PTLOCKS is true.

A simpler approach could be to make huge_pte_lockptr arch
specific and leave it as mm->page_table_lock for ppc 


-aneesh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-09-05  9:18         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-05  9:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Hi Aneesh,
>
> On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
>> 
>> > Currently all of page table handling by hugetlbfs code are done under
>> > mm->page_table_lock. So when a process have many threads and they heavily
>> > access to the memory, lock contention happens and impacts the performance.
>> >
>> > This patch makes hugepage support split page table lock so that we use
>> > page->ptl of the leaf node of page table tree which is pte for normal pages
>> > but can be pmd and/or pud for hugepages of some architectures.
>> >
>> > ChangeLog v2:
>> >  - add split ptl on other archs missed in v1
>> >
>> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> > ---
>> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
>> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
>> >  include/linux/hugetlb.h       | 20 ++++++++++
>> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
>> >  mm/mempolicy.c                |  5 ++-
>> >  mm/migrate.c                  |  4 +-
>> >  mm/rmap.c                     |  2 +-
>> >  7 files changed, 90 insertions(+), 45 deletions(-)
>> >
>> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
>> > index d67db4b..7e56cb7 100644
>> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
>> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
>> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>> >  {
>> >  	struct kmem_cache *cachep;
>> >  	pte_t *new;
>> > +	spinlock_t *ptl;
>> >
>> >  #ifdef CONFIG_PPC_FSL_BOOK3E
>> >  	int i;
>> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>> >  	if (! new)
>> >  		return -ENOMEM;
>> >
>> > -	spin_lock(&mm->page_table_lock);
>> > +	ptl = huge_pte_lockptr(mm, new);
>> > +	spin_lock(ptl);
>> 
>> 
>> Are you sure we can do that for ppc ?
>> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);
>
> Ah, thanks. new is not a pointer to one full page occupied by page
> table entries, so trying to use struct page of it is totally wrong.
>
>> The page for new(pte_t) could be shared right ? which mean a deadlock ?
>
> Yes, that's disastrous.
>
>> May be you should do it at the pmd level itself for ppc

The pgd page also cannot be used because pgd also comes from kmem
cache.

>
> Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
> for now because this lock seems to protect us from the race between concurrent
> calls of __hugepte_alloc(), not between allocation and read/write access.
> Split ptl is used to avoid race between read/write accesses, so I think
> that using different types of locks here is not dangerous.
> # I guess that that's why we now use mm->page_table_lock for __pte_alloc()
> # and its family even if USE_SPLIT_PTLOCKS is true.

A simpler approach could be to make huge_pte_lockptr arch
specific and leave it as mm->page_table_lock for ppc 


-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-04  7:13     ` Aneesh Kumar K.V
@ 2013-09-04 16:32       ` Naoya Horiguchi
  -1 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-09-04 16:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Hi Aneesh,

On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. So when a process have many threads and they heavily
> > access to the memory, lock contention happens and impacts the performance.
> >
> > This patch makes hugepage support split page table lock so that we use
> > page->ptl of the leaf node of page table tree which is pte for normal pages
> > but can be pmd and/or pud for hugepages of some architectures.
> >
> > ChangeLog v2:
> >  - add split ptl on other archs missed in v1
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
> >  include/linux/hugetlb.h       | 20 ++++++++++
> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
> >  mm/mempolicy.c                |  5 ++-
> >  mm/migrate.c                  |  4 +-
> >  mm/rmap.c                     |  2 +-
> >  7 files changed, 90 insertions(+), 45 deletions(-)
> >
> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> > index d67db4b..7e56cb7 100644
> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  {
> >  	struct kmem_cache *cachep;
> >  	pte_t *new;
> > +	spinlock_t *ptl;
> >
> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >  	int i;
> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  	if (! new)
> >  		return -ENOMEM;
> >
> > -	spin_lock(&mm->page_table_lock);
> > +	ptl = huge_pte_lockptr(mm, new);
> > +	spin_lock(ptl);
> 
> 
> Are you sure we can do that for ppc ?
> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);

Ah, thanks. new is not a pointer to one full page occupied by page
table entries, so trying to use struct page of it is totally wrong.

> The page for new(pte_t) could be shared right ? which mean a deadlock ?

Yes, that's disastrous.

> May be you should do it at the pmd level itself for ppc

Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
for now because this lock seems to protect us from the race between concurrent
calls of __hugepte_alloc(), not between allocation and read/write access.
Split ptl is used to avoid race between read/write accesses, so I think
that using different types of locks here is not dangerous.
# I guess that that's why we now use mm->page_table_lock for __pte_alloc()
# and its family even if USE_SPLIT_PTLOCKS is true.

A bit off-topic, but I found that we have a bogus comment on
hugetlb_free_pgd_range in arch/powerpc/mm/hugetlbpage.c saying
"Must be called with pagetable lock held."
This seems not true because the caller free_pgtables() and its
callers (unmap_region() and exit_mmap()) never hold it.
I guess that it's just copied from free_pgd_range() and it's also
false for this function. I'll post a patch to remove this later.

Anyway, thank you for valuable comments!

Thanks,
Naoya Horiguchi

> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >  	/*
> >  	 * We have multiple higher-level entries that point to the same
> > @@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  #endif
> >  	}
> >  #endif
> > -	spin_unlock(&mm->page_table_lock);
> > +	spin_unlock(ptl);
> >  	return 0;
> >  }
> >
> 
> 
> -aneesh
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-09-04 16:32       ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-09-04 16:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Hi Aneesh,

On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. So when a process have many threads and they heavily
> > access to the memory, lock contention happens and impacts the performance.
> >
> > This patch makes hugepage support split page table lock so that we use
> > page->ptl of the leaf node of page table tree which is pte for normal pages
> > but can be pmd and/or pud for hugepages of some architectures.
> >
> > ChangeLog v2:
> >  - add split ptl on other archs missed in v1
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
> >  include/linux/hugetlb.h       | 20 ++++++++++
> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
> >  mm/mempolicy.c                |  5 ++-
> >  mm/migrate.c                  |  4 +-
> >  mm/rmap.c                     |  2 +-
> >  7 files changed, 90 insertions(+), 45 deletions(-)
> >
> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> > index d67db4b..7e56cb7 100644
> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  {
> >  	struct kmem_cache *cachep;
> >  	pte_t *new;
> > +	spinlock_t *ptl;
> >
> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >  	int i;
> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  	if (! new)
> >  		return -ENOMEM;
> >
> > -	spin_lock(&mm->page_table_lock);
> > +	ptl = huge_pte_lockptr(mm, new);
> > +	spin_lock(ptl);
> 
> 
> Are you sure we can do that for ppc ?
> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);

Ah, thanks. new is not a pointer to one full page occupied by page
table entries, so trying to use struct page of it is totally wrong.

> The page for new(pte_t) could be shared right ? which mean a deadlock ?

Yes, that's disastrous.

> May be you should do it at the pmd level itself for ppc

Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
for now because this lock seems to protect us from the race between concurrent
calls of __hugepte_alloc(), not between allocation and read/write access.
Split ptl is used to avoid race between read/write accesses, so I think
that using different types of locks here is not dangerous.
# I guess that that's why we now use mm->page_table_lock for __pte_alloc()
# and its family even if USE_SPLIT_PTLOCKS is true.

A bit off-topic, but I found that we have a bogus comment on
hugetlb_free_pgd_range in arch/powerpc/mm/hugetlbpage.c saying
"Must be called with pagetable lock held."
This seems not true because the caller free_pgtables() and its
callers (unmap_region() and exit_mmap()) never hold it.
I guess that it's just copied from free_pgd_range() and it's also
false for this function. I'll post a patch to remove this later.

Anyway, thank you for valuable comments!

Thanks,
Naoya Horiguchi

> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >  	/*
> >  	 * We have multiple higher-level entries that point to the same
> > @@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  #endif
> >  	}
> >  #endif
> > -	spin_unlock(&mm->page_table_lock);
> > +	spin_unlock(ptl);
> >  	return 0;
> >  }
> >
> 
> 
> -aneesh
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-08-30 17:18   ` Naoya Horiguchi
@ 2013-09-04  7:13     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-04  7:13 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. So when a process have many threads and they heavily
> access to the memory, lock contention happens and impacts the performance.
>
> This patch makes hugepage support split page table lock so that we use
> page->ptl of the leaf node of page table tree which is pte for normal pages
> but can be pmd and/or pud for hugepages of some architectures.
>
> ChangeLog v2:
>  - add split ptl on other archs missed in v1
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  arch/powerpc/mm/hugetlbpage.c |  6 ++-
>  arch/tile/mm/hugetlbpage.c    |  6 ++-
>  include/linux/hugetlb.h       | 20 ++++++++++
>  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
>  mm/mempolicy.c                |  5 ++-
>  mm/migrate.c                  |  4 +-
>  mm/rmap.c                     |  2 +-
>  7 files changed, 90 insertions(+), 45 deletions(-)
>
> diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> index d67db4b..7e56cb7 100644
> --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  {
>  	struct kmem_cache *cachep;
>  	pte_t *new;
> +	spinlock_t *ptl;
>
>  #ifdef CONFIG_PPC_FSL_BOOK3E
>  	int i;
> @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  	if (! new)
>  		return -ENOMEM;
>
> -	spin_lock(&mm->page_table_lock);
> +	ptl = huge_pte_lockptr(mm, new);
> +	spin_lock(ptl);


Are you sure we can do that for ppc ?
	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);

The page for new(pte_t) could be shared right ? which mean a deadlock ?

May be you should do it at the pmd level itself for ppc

>  #ifdef CONFIG_PPC_FSL_BOOK3E
>  	/*
>  	 * We have multiple higher-level entries that point to the same
> @@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  #endif
>  	}
>  #endif
> -	spin_unlock(&mm->page_table_lock);
> +	spin_unlock(ptl);
>  	return 0;
>  }
>


-aneesh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-09-04  7:13     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-04  7:13 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. So when a process have many threads and they heavily
> access to the memory, lock contention happens and impacts the performance.
>
> This patch makes hugepage support split page table lock so that we use
> page->ptl of the leaf node of page table tree which is pte for normal pages
> but can be pmd and/or pud for hugepages of some architectures.
>
> ChangeLog v2:
>  - add split ptl on other archs missed in v1
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  arch/powerpc/mm/hugetlbpage.c |  6 ++-
>  arch/tile/mm/hugetlbpage.c    |  6 ++-
>  include/linux/hugetlb.h       | 20 ++++++++++
>  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
>  mm/mempolicy.c                |  5 ++-
>  mm/migrate.c                  |  4 +-
>  mm/rmap.c                     |  2 +-
>  7 files changed, 90 insertions(+), 45 deletions(-)
>
> diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> index d67db4b..7e56cb7 100644
> --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  {
>  	struct kmem_cache *cachep;
>  	pte_t *new;
> +	spinlock_t *ptl;
>
>  #ifdef CONFIG_PPC_FSL_BOOK3E
>  	int i;
> @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  	if (! new)
>  		return -ENOMEM;
>
> -	spin_lock(&mm->page_table_lock);
> +	ptl = huge_pte_lockptr(mm, new);
> +	spin_lock(ptl);


Are you sure we can do that for ppc ?
	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);

The page for new(pte_t) could be shared right ? which mean a deadlock ?

May be you should do it at the pmd level itself for ppc

>  #ifdef CONFIG_PPC_FSL_BOOK3E
>  	/*
>  	 * We have multiple higher-level entries that point to the same
> @@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  #endif
>  	}
>  #endif
> -	spin_unlock(&mm->page_table_lock);
> +	spin_unlock(ptl);
>  	return 0;
>  }
>


-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
  2013-08-30 17:18 [PATCH 0/2 v2] split page table lock for hugepage Naoya Horiguchi
@ 2013-08-30 17:18   ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-08-30 17:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. So when a process have many threads and they heavily
access to the memory, lock contention happens and impacts the performance.

This patch makes hugepage support split page table lock so that we use
page->ptl of the leaf node of page table tree which is pte for normal pages
but can be pmd and/or pud for hugepages of some architectures.

ChangeLog v2:
 - add split ptl on other archs missed in v1

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/powerpc/mm/hugetlbpage.c |  6 ++-
 arch/tile/mm/hugetlbpage.c    |  6 ++-
 include/linux/hugetlb.h       | 20 ++++++++++
 mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
 mm/mempolicy.c                |  5 ++-
 mm/migrate.c                  |  4 +-
 mm/rmap.c                     |  2 +-
 7 files changed, 90 insertions(+), 45 deletions(-)

diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
index d67db4b..7e56cb7 100644
--- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
+++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
@@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 {
 	struct kmem_cache *cachep;
 	pte_t *new;
+	spinlock_t *ptl;
 
 #ifdef CONFIG_PPC_FSL_BOOK3E
 	int i;
@@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 	if (! new)
 		return -ENOMEM;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, new);
+	spin_lock(ptl);
 #ifdef CONFIG_PPC_FSL_BOOK3E
 	/*
 	 * We have multiple higher-level entries that point to the same
@@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #endif
 	}
 #endif
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	return 0;
 }
 
diff --git v3.11-rc3.orig/arch/tile/mm/hugetlbpage.c v3.11-rc3/arch/tile/mm/hugetlbpage.c
index 0ac3599..c10cef9 100644
--- v3.11-rc3.orig/arch/tile/mm/hugetlbpage.c
+++ v3.11-rc3/arch/tile/mm/hugetlbpage.c
@@ -59,6 +59,7 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long address)
 {
 	pte_t *new;
+	spinlock_t *ptl;
 
 	if (pmd_none(*pmd)) {
 		new = pte_alloc_one_kernel(mm, address);
@@ -67,14 +68,15 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
 
 		smp_wmb(); /* See comment in __pte_alloc */
 
-		spin_lock(&mm->page_table_lock);
+		ptl = huge_pte_lockptr(mm, new);
+		spin_lock(ptl);
 		if (likely(pmd_none(*pmd))) {  /* Has another populated it ? */
 			mm->nr_ptes++;
 			pmd_populate_kernel(mm, pmd, new);
 			new = NULL;
 		} else
 			VM_BUG_ON(pmd_trans_splitting(*pmd));
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 		if (new)
 			pte_free_kernel(mm, new);
 	}
diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
index 0393270..8de0127 100644
--- v3.11-rc3.orig/include/linux/hugetlb.h
+++ v3.11-rc3/include/linux/hugetlb.h
@@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
+#if USE_SPLIT_PTLOCKS
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 /* arch callbacks */
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
@@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
 	BUG();
 }
 
+#define huge_pte_lockptr(mm, ptep) 0
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"
diff --git v3.11-rc3.orig/mm/hugetlb.c v3.11-rc3/mm/hugetlb.c
index 0f8da6b..b2dbca4 100644
--- v3.11-rc3.orig/mm/hugetlb.c
+++ v3.11-rc3/mm/hugetlb.c
@@ -2372,6 +2372,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2383,8 +2384,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dst_ptl = huge_pte_lockptr(dst, dst_pte);
+		src_ptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dst_ptl);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2394,8 +2397,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
 	}
 	return 0;
 
@@ -2438,6 +2441,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2451,25 +2455,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2480,7 +2483,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2497,13 +2500,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2617,6 +2625,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2666,7 +2675,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2680,7 +2689,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2695,7 +2704,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2710,7 +2719,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2722,10 +2731,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2775,6 +2784,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2860,7 +2870,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2882,13 +2893,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2900,6 +2911,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2968,7 +2980,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2988,7 +3001,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -3014,9 +3027,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -3024,8 +3037,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -3037,6 +3052,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3056,10 +3073,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3090,8 +3107,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3112,13 +3129,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3128,8 +3146,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
@@ -3292,6 +3310,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -3314,13 +3333,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, spte);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -3334,7 +3354,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with vma->vm_mm->page_table_lock held.
+ * called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git v3.11-rc3.orig/mm/mempolicy.c v3.11-rc3/mm/mempolicy.c
index 838d022..5afe17b 100644
--- v3.11-rc3.orig/mm/mempolicy.c
+++ v3.11-rc3/mm/mempolicy.c
@@ -522,8 +522,9 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 #ifdef CONFIG_HUGETLB_PAGE
 	int nid;
 	struct page *page;
+	spinlock_t *ptl = pte_lockptr(vma->vm_mm, pmd);
 
-	spin_lock(&vma->vm_mm->page_table_lock);
+	spin_lock(ptl);
 	page = pte_page(huge_ptep_get((pte_t *)pmd));
 	nid = page_to_nid(page);
 	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -533,7 +534,7 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
 		isolate_huge_page(page, private);
 unlock:
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	spin_unlock(ptl);
 #else
 	BUG();
 #endif
diff --git v3.11-rc3.orig/mm/migrate.c v3.11-rc3/mm/migrate.c
index 61f14a1..c69a9c7 100644
--- v3.11-rc3.orig/mm/migrate.c
+++ v3.11-rc3/mm/migrate.c
@@ -130,7 +130,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, addr);
 		if (!ptep)
 			goto out;
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, ptep);
 	} else {
 		pmd = mm_find_pmd(mm, addr);
 		if (!pmd)
@@ -249,7 +249,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 
 void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
 {
-	spinlock_t *ptl = &(mm)->page_table_lock;
+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
 	__migration_entry_wait(mm, pte, ptl);
 }
 
diff --git v3.11-rc3.orig/mm/rmap.c v3.11-rc3/mm/rmap.c
index c56ba98..eccec58 100644
--- v3.11-rc3.orig/mm/rmap.c
+++ v3.11-rc3/mm/rmap.c
@@ -601,7 +601,7 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 
 	if (unlikely(PageHuge(page))) {
 		pte = huge_pte_offset(mm, address);
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, pte);
 		goto check;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-08-30 17:18   ` Naoya Horiguchi
  0 siblings, 0 replies; 42+ messages in thread
From: Naoya Horiguchi @ 2013-08-30 17:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. So when a process have many threads and they heavily
access to the memory, lock contention happens and impacts the performance.

This patch makes hugepage support split page table lock so that we use
page->ptl of the leaf node of page table tree which is pte for normal pages
but can be pmd and/or pud for hugepages of some architectures.

ChangeLog v2:
 - add split ptl on other archs missed in v1

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/powerpc/mm/hugetlbpage.c |  6 ++-
 arch/tile/mm/hugetlbpage.c    |  6 ++-
 include/linux/hugetlb.h       | 20 ++++++++++
 mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
 mm/mempolicy.c                |  5 ++-
 mm/migrate.c                  |  4 +-
 mm/rmap.c                     |  2 +-
 7 files changed, 90 insertions(+), 45 deletions(-)

diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
index d67db4b..7e56cb7 100644
--- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
+++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
@@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 {
 	struct kmem_cache *cachep;
 	pte_t *new;
+	spinlock_t *ptl;
 
 #ifdef CONFIG_PPC_FSL_BOOK3E
 	int i;
@@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 	if (! new)
 		return -ENOMEM;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, new);
+	spin_lock(ptl);
 #ifdef CONFIG_PPC_FSL_BOOK3E
 	/*
 	 * We have multiple higher-level entries that point to the same
@@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #endif
 	}
 #endif
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	return 0;
 }
 
diff --git v3.11-rc3.orig/arch/tile/mm/hugetlbpage.c v3.11-rc3/arch/tile/mm/hugetlbpage.c
index 0ac3599..c10cef9 100644
--- v3.11-rc3.orig/arch/tile/mm/hugetlbpage.c
+++ v3.11-rc3/arch/tile/mm/hugetlbpage.c
@@ -59,6 +59,7 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long address)
 {
 	pte_t *new;
+	spinlock_t *ptl;
 
 	if (pmd_none(*pmd)) {
 		new = pte_alloc_one_kernel(mm, address);
@@ -67,14 +68,15 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
 
 		smp_wmb(); /* See comment in __pte_alloc */
 
-		spin_lock(&mm->page_table_lock);
+		ptl = huge_pte_lockptr(mm, new);
+		spin_lock(ptl);
 		if (likely(pmd_none(*pmd))) {  /* Has another populated it ? */
 			mm->nr_ptes++;
 			pmd_populate_kernel(mm, pmd, new);
 			new = NULL;
 		} else
 			VM_BUG_ON(pmd_trans_splitting(*pmd));
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 		if (new)
 			pte_free_kernel(mm, new);
 	}
diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
index 0393270..8de0127 100644
--- v3.11-rc3.orig/include/linux/hugetlb.h
+++ v3.11-rc3/include/linux/hugetlb.h
@@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
+#if USE_SPLIT_PTLOCKS
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 /* arch callbacks */
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
@@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
 	BUG();
 }
 
+#define huge_pte_lockptr(mm, ptep) 0
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"
diff --git v3.11-rc3.orig/mm/hugetlb.c v3.11-rc3/mm/hugetlb.c
index 0f8da6b..b2dbca4 100644
--- v3.11-rc3.orig/mm/hugetlb.c
+++ v3.11-rc3/mm/hugetlb.c
@@ -2372,6 +2372,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2383,8 +2384,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dst_ptl = huge_pte_lockptr(dst, dst_pte);
+		src_ptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dst_ptl);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2394,8 +2397,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
 	}
 	return 0;
 
@@ -2438,6 +2441,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2451,25 +2455,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2480,7 +2483,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2497,13 +2500,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2617,6 +2625,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2666,7 +2675,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2680,7 +2689,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2695,7 +2704,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2710,7 +2719,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2722,10 +2731,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2775,6 +2784,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2860,7 +2870,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2882,13 +2893,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2900,6 +2911,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2968,7 +2980,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2988,7 +3001,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -3014,9 +3027,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -3024,8 +3037,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -3037,6 +3052,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3056,10 +3073,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3090,8 +3107,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3112,13 +3129,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3128,8 +3146,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
@@ -3292,6 +3310,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -3314,13 +3333,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, spte);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -3334,7 +3354,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with vma->vm_mm->page_table_lock held.
+ * called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git v3.11-rc3.orig/mm/mempolicy.c v3.11-rc3/mm/mempolicy.c
index 838d022..5afe17b 100644
--- v3.11-rc3.orig/mm/mempolicy.c
+++ v3.11-rc3/mm/mempolicy.c
@@ -522,8 +522,9 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 #ifdef CONFIG_HUGETLB_PAGE
 	int nid;
 	struct page *page;
+	spinlock_t *ptl = pte_lockptr(vma->vm_mm, pmd);
 
-	spin_lock(&vma->vm_mm->page_table_lock);
+	spin_lock(ptl);
 	page = pte_page(huge_ptep_get((pte_t *)pmd));
 	nid = page_to_nid(page);
 	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -533,7 +534,7 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
 		isolate_huge_page(page, private);
 unlock:
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	spin_unlock(ptl);
 #else
 	BUG();
 #endif
diff --git v3.11-rc3.orig/mm/migrate.c v3.11-rc3/mm/migrate.c
index 61f14a1..c69a9c7 100644
--- v3.11-rc3.orig/mm/migrate.c
+++ v3.11-rc3/mm/migrate.c
@@ -130,7 +130,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, addr);
 		if (!ptep)
 			goto out;
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, ptep);
 	} else {
 		pmd = mm_find_pmd(mm, addr);
 		if (!pmd)
@@ -249,7 +249,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 
 void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
 {
-	spinlock_t *ptl = &(mm)->page_table_lock;
+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
 	__migration_entry_wait(mm, pte, ptl);
 }
 
diff --git v3.11-rc3.orig/mm/rmap.c v3.11-rc3/mm/rmap.c
index c56ba98..eccec58 100644
--- v3.11-rc3.orig/mm/rmap.c
+++ v3.11-rc3/mm/rmap.c
@@ -601,7 +601,7 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 
 	if (unlikely(PageHuge(page))) {
 		pte = huge_pte_offset(mm, address);
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, pte);
 		goto check;
 	}
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2013-09-09 16:26 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-28 19:52 [PATCH 0/2] hugetlbfs: support split page table lock Naoya Horiguchi
2013-05-28 19:52 ` Naoya Horiguchi
2013-05-28 19:52 ` [PATCH 1/2] " Naoya Horiguchi
2013-05-28 19:52   ` Naoya Horiguchi
2013-05-29  1:09   ` Wanpeng Li
2013-05-29  1:09   ` Wanpeng Li
2013-06-03 13:19   ` Michal Hocko
2013-06-03 13:19     ` Michal Hocko
2013-06-03 14:34     ` Naoya Horiguchi
2013-06-03 14:34       ` Naoya Horiguchi
2013-06-03 15:42       ` Michal Hocko
2013-06-03 15:42         ` Michal Hocko
2013-05-28 19:52 ` [PATCH v3 2/2] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
2013-05-28 19:52   ` Naoya Horiguchi
2013-05-29  1:11   ` Wanpeng Li
2013-05-29  1:11   ` Wanpeng Li
2013-05-31 19:30   ` Andrew Morton
2013-05-31 19:30     ` Andrew Morton
2013-05-31 19:46     ` Naoya Horiguchi
2013-05-31 19:46       ` Naoya Horiguchi
2013-06-03 13:26   ` Michal Hocko
2013-06-03 13:26     ` Michal Hocko
2013-06-03 14:34     ` Naoya Horiguchi
2013-06-03 14:34       ` Naoya Horiguchi
2013-06-04 16:44     ` [PATCH v4] " Naoya Horiguchi
2013-06-04 16:44       ` Naoya Horiguchi
2013-08-30 17:18 [PATCH 0/2 v2] split page table lock for hugepage Naoya Horiguchi
2013-08-30 17:18 ` [PATCH 1/2] hugetlbfs: support split page table lock Naoya Horiguchi
2013-08-30 17:18   ` Naoya Horiguchi
2013-09-04  7:13   ` Aneesh Kumar K.V
2013-09-04  7:13     ` Aneesh Kumar K.V
2013-09-04 16:32     ` Naoya Horiguchi
2013-09-04 16:32       ` Naoya Horiguchi
2013-09-05  9:18       ` Aneesh Kumar K.V
2013-09-05  9:18         ` Aneesh Kumar K.V
2013-09-05 15:23         ` Naoya Horiguchi
2013-09-05 15:23           ` Naoya Horiguchi
2013-09-05 21:27 [PATCH 0/2 v3] split page table lock for hugepage Naoya Horiguchi
2013-09-05 21:27 ` [PATCH 1/2] hugetlbfs: support split page table lock Naoya Horiguchi
2013-09-05 21:27   ` Naoya Horiguchi
2013-09-08 16:53   ` Aneesh Kumar K.V
2013-09-08 16:53     ` Aneesh Kumar K.V
2013-09-09 16:26     ` Naoya Horiguchi
2013-09-09 16:26       ` Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.