linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2 v2] split page table lock for hugepage
@ 2013-08-30 17:18 Naoya Horiguchi
  2013-08-30 17:18 ` [PATCH 1/2] hugetlbfs: support split page table lock Naoya Horiguchi
  2013-08-30 17:18 ` [PATCH 2/2] thp: " Naoya Horiguchi
  0 siblings, 2 replies; 17+ messages in thread
From: Naoya Horiguchi @ 2013-08-30 17:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Hi,

I revised the split page table lock patch (v1 is [1]), and got some numbers
to confirm the performance improvement.

This patchset simply replaces all of locking/unlocking of mm->page_table_lock
in hugepage context with page->ptl when USE_SPLIT_PTLOCKS is true, which
breaks single mm wide locking into multiple small (pmd/pte sized) address
range locking, so we can clearly expect better performance when many threads
access to virtual memory of a process simultaneously.

Here is the result of my testing [2], where I measured the time (in seconds)
taken to execute a specific workload in various conditions. So the smaller
number means the better performance. The workload is like this:
  1) allocate N hugepages/thps and touch them once,
  2) create T threads with pthread_create(), and
  3) each thread accesses to the whole pages sequentially 10 times.

           |             hugetlb             |               thp               |
 N      T  |   v3.11-rc3    |    patched     |   v3.11-rc3    |    patched     |
100    100 |  0.13 (+-0.04) |  0.07 (+-0.01) |  0.10 (+-0.01) |  0.08 (+-0.03) |
100   3000 | 11.67 (+-0.47) |  6.54 (+-0.38) | 11.21 (+-0.28) |  6.44 (+-0.26) |
6000   100 |  2.87 (+-0.07) |  2.79 (+-0.06) |  3.21 (+-0.06) |  3.10 (+-0.06) |
6000  3000 | 18.76 (+-0.50) | 13.68 (+-0.35) | 19.44 (+-0.78) | 14.03 (+-0.43) |

  * Numbers are the averages (and stddev) of 20 testing respectively.

This result shows that for both of hugetlb/thp patched kernel provides better
results, so patches works fine. The performance gain is larger for larger T.
Interestingly, in more detailed analysis the improvement mostly comes from 2).
I got a little improvement for 3), but no visible improvement for 1).

[1] http://thread.gmane.org/gmane.linux.kernel.mm/100856/focus=100858
[2] https://github.com/Naoya-Horiguchi/test_split_page_table_lock_for_hugepage

Naoya Horiguchi (2):
      hugetlbfs: support split page table lock
      thp: support split page table lock

 arch/powerpc/mm/hugetlbpage.c |   6 +-
 arch/powerpc/mm/pgtable_64.c  |   8 +-
 arch/s390/mm/pgtable.c        |   4 +-
 arch/sparc/mm/tlb.c           |   4 +-
 arch/tile/mm/hugetlbpage.c    |   6 +-
 fs/proc/task_mmu.c            |  17 +++--
 include/linux/huge_mm.h       |  11 +--
 include/linux/hugetlb.h       |  20 +++++
 include/linux/mm.h            |   3 +
 mm/huge_memory.c              | 170 +++++++++++++++++++++++++-----------------
 mm/hugetlb.c                  |  92 ++++++++++++++---------
 mm/memcontrol.c               |  14 ++--
 mm/memory.c                   |  15 ++--
 mm/mempolicy.c                |   5 +-
 mm/migrate.c                  |  12 +--
 mm/mprotect.c                 |   5 +-
 mm/pgtable-generic.c          |  10 +--
 mm/rmap.c                     |  13 ++--
 18 files changed, 251 insertions(+), 164 deletions(-)

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
  2013-08-30 17:18 [PATCH 0/2 v2] split page table lock for hugepage Naoya Horiguchi
@ 2013-08-30 17:18 ` Naoya Horiguchi
  2013-09-04  7:13   ` Aneesh Kumar K.V
  2013-08-30 17:18 ` [PATCH 2/2] thp: " Naoya Horiguchi
  1 sibling, 1 reply; 17+ messages in thread
From: Naoya Horiguchi @ 2013-08-30 17:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. So when a process have many threads and they heavily
access to the memory, lock contention happens and impacts the performance.

This patch makes hugepage support split page table lock so that we use
page->ptl of the leaf node of page table tree which is pte for normal pages
but can be pmd and/or pud for hugepages of some architectures.

ChangeLog v2:
 - add split ptl on other archs missed in v1

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/powerpc/mm/hugetlbpage.c |  6 ++-
 arch/tile/mm/hugetlbpage.c    |  6 ++-
 include/linux/hugetlb.h       | 20 ++++++++++
 mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
 mm/mempolicy.c                |  5 ++-
 mm/migrate.c                  |  4 +-
 mm/rmap.c                     |  2 +-
 7 files changed, 90 insertions(+), 45 deletions(-)

diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
index d67db4b..7e56cb7 100644
--- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
+++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
@@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 {
 	struct kmem_cache *cachep;
 	pte_t *new;
+	spinlock_t *ptl;
 
 #ifdef CONFIG_PPC_FSL_BOOK3E
 	int i;
@@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 	if (! new)
 		return -ENOMEM;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, new);
+	spin_lock(ptl);
 #ifdef CONFIG_PPC_FSL_BOOK3E
 	/*
 	 * We have multiple higher-level entries that point to the same
@@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #endif
 	}
 #endif
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	return 0;
 }
 
diff --git v3.11-rc3.orig/arch/tile/mm/hugetlbpage.c v3.11-rc3/arch/tile/mm/hugetlbpage.c
index 0ac3599..c10cef9 100644
--- v3.11-rc3.orig/arch/tile/mm/hugetlbpage.c
+++ v3.11-rc3/arch/tile/mm/hugetlbpage.c
@@ -59,6 +59,7 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long address)
 {
 	pte_t *new;
+	spinlock_t *ptl;
 
 	if (pmd_none(*pmd)) {
 		new = pte_alloc_one_kernel(mm, address);
@@ -67,14 +68,15 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
 
 		smp_wmb(); /* See comment in __pte_alloc */
 
-		spin_lock(&mm->page_table_lock);
+		ptl = huge_pte_lockptr(mm, new);
+		spin_lock(ptl);
 		if (likely(pmd_none(*pmd))) {  /* Has another populated it ? */
 			mm->nr_ptes++;
 			pmd_populate_kernel(mm, pmd, new);
 			new = NULL;
 		} else
 			VM_BUG_ON(pmd_trans_splitting(*pmd));
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 		if (new)
 			pte_free_kernel(mm, new);
 	}
diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
index 0393270..8de0127 100644
--- v3.11-rc3.orig/include/linux/hugetlb.h
+++ v3.11-rc3/include/linux/hugetlb.h
@@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
+#if USE_SPLIT_PTLOCKS
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 /* arch callbacks */
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
@@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
 	BUG();
 }
 
+#define huge_pte_lockptr(mm, ptep) 0
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"
diff --git v3.11-rc3.orig/mm/hugetlb.c v3.11-rc3/mm/hugetlb.c
index 0f8da6b..b2dbca4 100644
--- v3.11-rc3.orig/mm/hugetlb.c
+++ v3.11-rc3/mm/hugetlb.c
@@ -2372,6 +2372,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2383,8 +2384,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dst_ptl = huge_pte_lockptr(dst, dst_pte);
+		src_ptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dst_ptl);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2394,8 +2397,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
 	}
 	return 0;
 
@@ -2438,6 +2441,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2451,25 +2455,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2480,7 +2483,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2497,13 +2500,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2617,6 +2625,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2666,7 +2675,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2680,7 +2689,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2695,7 +2704,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2710,7 +2719,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2722,10 +2731,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2775,6 +2784,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2860,7 +2870,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2882,13 +2893,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2900,6 +2911,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2968,7 +2980,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2988,7 +3001,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -3014,9 +3027,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -3024,8 +3037,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -3037,6 +3052,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3056,10 +3073,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3090,8 +3107,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3112,13 +3129,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3128,8 +3146,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
@@ -3292,6 +3310,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -3314,13 +3333,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, spte);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -3334,7 +3354,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with vma->vm_mm->page_table_lock held.
+ * called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git v3.11-rc3.orig/mm/mempolicy.c v3.11-rc3/mm/mempolicy.c
index 838d022..5afe17b 100644
--- v3.11-rc3.orig/mm/mempolicy.c
+++ v3.11-rc3/mm/mempolicy.c
@@ -522,8 +522,9 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 #ifdef CONFIG_HUGETLB_PAGE
 	int nid;
 	struct page *page;
+	spinlock_t *ptl = pte_lockptr(vma->vm_mm, pmd);
 
-	spin_lock(&vma->vm_mm->page_table_lock);
+	spin_lock(ptl);
 	page = pte_page(huge_ptep_get((pte_t *)pmd));
 	nid = page_to_nid(page);
 	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -533,7 +534,7 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
 		isolate_huge_page(page, private);
 unlock:
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	spin_unlock(ptl);
 #else
 	BUG();
 #endif
diff --git v3.11-rc3.orig/mm/migrate.c v3.11-rc3/mm/migrate.c
index 61f14a1..c69a9c7 100644
--- v3.11-rc3.orig/mm/migrate.c
+++ v3.11-rc3/mm/migrate.c
@@ -130,7 +130,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, addr);
 		if (!ptep)
 			goto out;
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, ptep);
 	} else {
 		pmd = mm_find_pmd(mm, addr);
 		if (!pmd)
@@ -249,7 +249,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 
 void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
 {
-	spinlock_t *ptl = &(mm)->page_table_lock;
+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
 	__migration_entry_wait(mm, pte, ptl);
 }
 
diff --git v3.11-rc3.orig/mm/rmap.c v3.11-rc3/mm/rmap.c
index c56ba98..eccec58 100644
--- v3.11-rc3.orig/mm/rmap.c
+++ v3.11-rc3/mm/rmap.c
@@ -601,7 +601,7 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 
 	if (unlikely(PageHuge(page))) {
 		pte = huge_pte_offset(mm, address);
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, pte);
 		goto check;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] thp: support split page table lock
  2013-08-30 17:18 [PATCH 0/2 v2] split page table lock for hugepage Naoya Horiguchi
  2013-08-30 17:18 ` [PATCH 1/2] hugetlbfs: support split page table lock Naoya Horiguchi
@ 2013-08-30 17:18 ` Naoya Horiguchi
  2013-09-02 10:53   ` Kirill A. Shutemov
  2013-09-03 20:52   ` Naoya Horiguchi
  1 sibling, 2 replies; 17+ messages in thread
From: Naoya Horiguchi @ 2013-08-30 17:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Thp related code also uses per process mm->page_table_lock now. So making
it fine-grained can provide better performance.

This patch makes thp support split page table lock which makes us use
page->ptl of the pages storing "pmd_trans_huge" pmds.

Some functions like pmd_trans_huge_lock() and page_check_address_pmd()
are expected by their caller to pass back the pointer of ptl, so this
patch adds to those functions the arguments for that. Rather than that,
this patch gives only straightforward replacement.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/powerpc/mm/pgtable_64.c |   8 +-
 arch/s390/mm/pgtable.c       |   4 +-
 arch/sparc/mm/tlb.c          |   4 +-
 fs/proc/task_mmu.c           |  17 +++--
 include/linux/huge_mm.h      |  11 +--
 include/linux/mm.h           |   3 +
 mm/huge_memory.c             | 170 ++++++++++++++++++++++++++-----------------
 mm/memcontrol.c              |  14 ++--
 mm/memory.c                  |  15 ++--
 mm/migrate.c                 |   8 +-
 mm/mprotect.c                |   5 +-
 mm/pgtable-generic.c         |  10 +--
 mm/rmap.c                    |  11 ++-
 13 files changed, 161 insertions(+), 119 deletions(-)

diff --git v3.11-rc3.orig/arch/powerpc/mm/pgtable_64.c v3.11-rc3/arch/powerpc/mm/pgtable_64.c
index 536eec72..f9177eb 100644
--- v3.11-rc3.orig/arch/powerpc/mm/pgtable_64.c
+++ v3.11-rc3/arch/powerpc/mm/pgtable_64.c
@@ -605,7 +605,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma,
 
 #ifdef CONFIG_DEBUG_VM
 	WARN_ON(!pmd_trans_huge(*pmdp));
-	assert_spin_locked(&vma->vm_mm->page_table_lock);
+	assert_spin_locked(huge_pmd_lockptr(vma->vm_mm, pmdp));
 #endif
 
 #ifdef PTE_ATOMIC_UPDATES
@@ -643,7 +643,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				pgtable_t pgtable)
 {
 	pgtable_t *pgtable_slot;
-	assert_spin_locked(&mm->page_table_lock);
+	assert_spin_locked(huge_pmd_lockptr(mm, pmdp));
 	/*
 	 * we store the pgtable in the second half of PMD
 	 */
@@ -663,7 +663,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 	pgtable_t pgtable;
 	pgtable_t *pgtable_slot;
 
-	assert_spin_locked(&mm->page_table_lock);
+	assert_spin_locked(huge_pmd_lockptr(mm, pmdp));
 	pgtable_slot = (pgtable_t *)pmdp + PTRS_PER_PMD;
 	pgtable = *pgtable_slot;
 	/*
@@ -687,7 +687,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 {
 #ifdef CONFIG_DEBUG_VM
 	WARN_ON(!pmd_none(*pmdp));
-	assert_spin_locked(&mm->page_table_lock);
+	assert_spin_locked(huge_pmd_lockptr(mm, pmdp));
 	WARN_ON(!pmd_trans_huge(pmd));
 #endif
 	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
diff --git v3.11-rc3.orig/arch/s390/mm/pgtable.c v3.11-rc3/arch/s390/mm/pgtable.c
index a8154a1..d6c6b5c 100644
--- v3.11-rc3.orig/arch/s390/mm/pgtable.c
+++ v3.11-rc3/arch/s390/mm/pgtable.c
@@ -1170,7 +1170,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 {
 	struct list_head *lh = (struct list_head *) pgtable;
 
-	assert_spin_locked(&mm->page_table_lock);
+	assert_spin_locked(huge_pmd_lockptr(mm, pmdp));
 
 	/* FIFO */
 	if (!mm->pmd_huge_pte)
@@ -1186,7 +1186,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 	pgtable_t pgtable;
 	pte_t *ptep;
 
-	assert_spin_locked(&mm->page_table_lock);
+	assert_spin_locked(huge_pmd_lockptr(mm, pmdp));
 
 	/* FIFO */
 	pgtable = mm->pmd_huge_pte;
diff --git v3.11-rc3.orig/arch/sparc/mm/tlb.c v3.11-rc3/arch/sparc/mm/tlb.c
index 7a91f28..cca3bed 100644
--- v3.11-rc3.orig/arch/sparc/mm/tlb.c
+++ v3.11-rc3/arch/sparc/mm/tlb.c
@@ -193,7 +193,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 {
 	struct list_head *lh = (struct list_head *) pgtable;
 
-	assert_spin_locked(&mm->page_table_lock);
+	assert_spin_locked(huge_pmd_lockptr(mm, pmdp));
 
 	/* FIFO */
 	if (!mm->pmd_huge_pte)
@@ -208,7 +208,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 	struct list_head *lh;
 	pgtable_t pgtable;
 
-	assert_spin_locked(&mm->page_table_lock);
+	assert_spin_locked(huge_pmd_lockptr(mm, pmdp));
 
 	/* FIFO */
 	pgtable = mm->pmd_huge_pte;
diff --git v3.11-rc3.orig/fs/proc/task_mmu.c v3.11-rc3/fs/proc/task_mmu.c
index dbf61f6..e23c882 100644
--- v3.11-rc3.orig/fs/proc/task_mmu.c
+++ v3.11-rc3/fs/proc/task_mmu.c
@@ -503,11 +503,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = mss->vma;
 	pte_t *pte;
-	spinlock_t *ptl;
+	spinlock_t *uninitialized_var(ptl);
 
-	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
-		spin_unlock(&walk->mm->page_table_lock);
+		spin_unlock(ptl);
 		mss->anonymous_thp += HPAGE_PMD_SIZE;
 		return 0;
 	}
@@ -980,10 +980,11 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	int err = 0;
 	pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+	spinlock_t *uninitialized_var(ptl);
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
-	if (vma && pmd_trans_huge_lock(pmd, vma) == 1) {
+	if (vma && pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		int pmd_flags2;
 
 		pmd_flags2 = (pmd_soft_dirty(*pmd) ? __PM_SOFT_DIRTY : 0);
@@ -997,7 +998,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			if (err)
 				break;
 		}
-		spin_unlock(&walk->mm->page_table_lock);
+		spin_unlock(ptl);
 		return err;
 	}
 
@@ -1276,13 +1277,13 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		unsigned long end, struct mm_walk *walk)
 {
 	struct numa_maps *md;
-	spinlock_t *ptl;
+	spinlock_t *uninitialized_var(ptl);
 	pte_t *orig_pte;
 	pte_t *pte;
 
 	md = walk->private;
 
-	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
+	if (pmd_trans_huge_lock(pmd, md->vma, &ptl) == 1) {
 		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
@@ -1290,7 +1291,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		if (page)
 			gather_stats(page, md, pte_dirty(huge_pte),
 				     HPAGE_PMD_SIZE/PAGE_SIZE);
-		spin_unlock(&walk->mm->page_table_lock);
+		spin_unlock(ptl);
 		return 0;
 	}
 
diff --git v3.11-rc3.orig/include/linux/huge_mm.h v3.11-rc3/include/linux/huge_mm.h
index b60de92..1faf757 100644
--- v3.11-rc3.orig/include/linux/huge_mm.h
+++ v3.11-rc3/include/linux/huge_mm.h
@@ -54,7 +54,8 @@ enum page_check_address_pmd_flag {
 extern pmd_t *page_check_address_pmd(struct page *page,
 				     struct mm_struct *mm,
 				     unsigned long address,
-				     enum page_check_address_pmd_flag flag);
+				     enum page_check_address_pmd_flag flag,
+				     spinlock_t **ptl);
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
@@ -133,14 +134,14 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long end,
 				    long adjust_next);
 extern int __pmd_trans_huge_lock(pmd_t *pmd,
-				 struct vm_area_struct *vma);
+				 struct vm_area_struct *vma, spinlock_t **ptl);
 /* mmap_sem must be held on entry */
 static inline int pmd_trans_huge_lock(pmd_t *pmd,
-				      struct vm_area_struct *vma)
+				struct vm_area_struct *vma, spinlock_t **ptl)
 {
 	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
 	if (pmd_trans_huge(*pmd))
-		return __pmd_trans_huge_lock(pmd, vma);
+		return __pmd_trans_huge_lock(pmd, vma, ptl);
 	else
 		return 0;
 }
@@ -219,7 +220,7 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 {
 }
 static inline int pmd_trans_huge_lock(pmd_t *pmd,
-				      struct vm_area_struct *vma)
+				struct vm_area_struct *vma, spinlock_t **ptl)
 {
 	return 0;
 }
diff --git v3.11-rc3.orig/include/linux/mm.h v3.11-rc3/include/linux/mm.h
index f022460..9219f43 100644
--- v3.11-rc3.orig/include/linux/mm.h
+++ v3.11-rc3/include/linux/mm.h
@@ -1251,6 +1251,8 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 } while (0)
 #define pte_lock_deinit(page)	((page)->mapping = NULL)
 #define pte_lockptr(mm, pmd)	({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
+#define huge_pmd_lockptr(mm, pmdp) \
+		({(void)(mm); __pte_lockptr(virt_to_page(pmdp)); })
 #else	/* !USE_SPLIT_PTLOCKS */
 /*
  * We use mm->page_table_lock to guard all pagetable pages of the mm.
@@ -1258,6 +1260,7 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 #define pte_lock_init(page)	do {} while (0)
 #define pte_lock_deinit(page)	do {} while (0)
 #define pte_lockptr(mm, pmd)	({(void)(pmd); &(mm)->page_table_lock;})
+#define huge_pmd_lockptr(mm, pmdp) ({(void)(pmd); &(mm)->page_table_lock; })
 #endif /* USE_SPLIT_PTLOCKS */
 
 static inline void pgtable_page_ctor(struct page *page)
diff --git v3.11-rc3.orig/mm/huge_memory.c v3.11-rc3/mm/huge_memory.c
index 243e710..3cb29e1 100644
--- v3.11-rc3.orig/mm/huge_memory.c
+++ v3.11-rc3/mm/huge_memory.c
@@ -705,6 +705,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct page *page)
 {
 	pgtable_t pgtable;
+	spinlock_t *ptl;
 
 	VM_BUG_ON(!PageCompound(page));
 	pgtable = pte_alloc_one(mm, haddr);
@@ -719,9 +720,10 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	 */
 	__SetPageUptodate(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
 	if (unlikely(!pmd_none(*pmd))) {
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		pte_free(mm, pgtable);
@@ -733,7 +735,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		set_pmd_at(mm, haddr, pmd, entry);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm->nr_ptes++;
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 	}
 
 	return 0;
@@ -761,6 +763,7 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
+/* Caller must hold page table lock. */
 static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
@@ -804,10 +807,11 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				count_vm_event(THP_FAULT_FALLBACK);
 				goto out;
 			}
-			spin_lock(&mm->page_table_lock);
+			ptl = huge_pmd_lockptr(mm, pmd);
+			spin_lock(ptl);
 			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
 					zero_page);
-			spin_unlock(&mm->page_table_lock);
+			spin_unlock(ptl);
 			if (!set) {
 				pte_free(mm, pgtable);
 				put_huge_zero_page();
@@ -864,14 +868,17 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pmd_t pmd;
 	pgtable_t pgtable;
 	int ret;
+	spinlock_t *uninitialized_var(dst_ptl), *uninitialized_var(src_ptl);
 
 	ret = -ENOMEM;
 	pgtable = pte_alloc_one(dst_mm, addr);
 	if (unlikely(!pgtable))
 		goto out;
 
-	spin_lock(&dst_mm->page_table_lock);
-	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+	dst_ptl = huge_pmd_lockptr(dst_mm, dst_ptl);
+	src_ptl = huge_pmd_lockptr(src_mm, src_ptl);
+	spin_lock(dst_ptl);
+	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
 	ret = -EAGAIN;
 	pmd = *src_pmd;
@@ -880,7 +887,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 	/*
-	 * mm->page_table_lock is enough to be sure that huge zero pmd is not
+	 * When page table lock is held, the huge zero pmd should not be
 	 * under splitting since we don't split the page itself, only pmd to
 	 * a page table.
 	 */
@@ -901,8 +908,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 	if (unlikely(pmd_trans_splitting(pmd))) {
 		/* split huge page running from under us */
-		spin_unlock(&src_mm->page_table_lock);
-		spin_unlock(&dst_mm->page_table_lock);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
 		pte_free(dst_mm, pgtable);
 
 		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
@@ -922,8 +929,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = 0;
 out_unlock:
-	spin_unlock(&src_mm->page_table_lock);
-	spin_unlock(&dst_mm->page_table_lock);
+	spin_unlock(src_ptl);
+	spin_unlock(dst_ptl);
 out:
 	return ret;
 }
@@ -936,8 +943,9 @@ void huge_pmd_set_accessed(struct mm_struct *mm,
 {
 	pmd_t entry;
 	unsigned long haddr;
+	spinlock_t *ptl = huge_pmd_lockptr(mm, pmd);
 
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto unlock;
 
@@ -947,7 +955,7 @@ void huge_pmd_set_accessed(struct mm_struct *mm,
 		update_mmu_cache_pmd(vma, address, pmd);
 
 unlock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 }
 
 static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
@@ -960,6 +968,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	int i, ret = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl;
 
 	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
 	if (!page) {
@@ -980,7 +989,8 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_free_page;
 
@@ -1007,7 +1017,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	}
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	put_huge_zero_page();
 	inc_mm_counter(mm, MM_ANONPAGES);
 
@@ -1017,7 +1027,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 out:
 	return ret;
 out_free_page:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	mem_cgroup_uncharge_page(page);
 	put_page(page);
@@ -1037,6 +1047,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	struct page **pages;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl;
 
 	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
 			GFP_KERNEL);
@@ -1077,7 +1088,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_free_pages;
 	VM_BUG_ON(!PageHead(page));
@@ -1103,7 +1115,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
 	page_remove_rmap(page);
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
@@ -1114,7 +1126,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	return ret;
 
 out_free_pages:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1134,12 +1146,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long haddr;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pmd_lockptr(mm, pmd);
 
 	VM_BUG_ON(!vma->anon_vma);
 	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_unlock;
 
@@ -1155,7 +1168,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 	get_page(page);
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
@@ -1200,11 +1213,11 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (page)
 		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
 		goto out_mn;
@@ -1225,13 +1238,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		ret |= VM_FAULT_WRITE;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out_mn:
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 out:
 	return ret;
 out_unlock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	return ret;
 }
 
@@ -1240,11 +1253,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   pmd_t *pmd,
 				   unsigned int flags)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	struct page *page = NULL;
 
-	assert_spin_locked(&mm->page_table_lock);
-
 	if (flags & FOLL_WRITE && !pmd_write(*pmd))
 		goto out;
 
@@ -1295,8 +1305,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid;
 	int current_nid = -1;
 	bool migrated;
+	spinlock_t *ptl = huge_pmd_lockptr(mm, pmdp);
 
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
@@ -1314,17 +1325,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Acquire the page lock to serialise THP migrations */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	lock_page(page);
 
 	/* Confirm the PTE did not while locked */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
 		put_page(page);
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	/* Migrate the THP to the requested node */
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1336,7 +1347,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 
 check_same:
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 clear_pmdnuma:
@@ -1345,7 +1356,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	if (current_nid != -1)
 		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
 	return 0;
@@ -1355,8 +1366,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
 	int ret = 0;
+	spinlock_t *uninitialized_var(ptl);
 
-	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
@@ -1371,7 +1383,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
 		if (is_huge_zero_pmd(orig_pmd)) {
 			tlb->mm->nr_ptes--;
-			spin_unlock(&tlb->mm->page_table_lock);
+			spin_unlock(ptl);
 			put_huge_zero_page();
 		} else {
 			page = pmd_page(orig_pmd);
@@ -1380,7 +1392,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
 			tlb->mm->nr_ptes--;
-			spin_unlock(&tlb->mm->page_table_lock);
+			spin_unlock(ptl);
 			tlb_remove_page(tlb, page);
 		}
 		pte_free(tlb->mm, pgtable);
@@ -1394,13 +1406,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned char *vec)
 {
 	int ret = 0;
+	spinlock_t *uninitialized_var(ptl);
 
-	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		/*
 		 * All logical pages in the range are present
 		 * if backed by a huge page.
 		 */
-		spin_unlock(&vma->vm_mm->page_table_lock);
+		spin_unlock(ptl);
 		memset(vec, 1, (end - addr) >> PAGE_SHIFT);
 		ret = 1;
 	}
@@ -1415,6 +1428,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 {
 	int ret = 0;
 	pmd_t pmd;
+	spinlock_t *uninitialized_var(ptl);
 
 	struct mm_struct *mm = vma->vm_mm;
 
@@ -1433,12 +1447,12 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		goto out;
 	}
 
-	ret = __pmd_trans_huge_lock(old_pmd, vma);
+	ret = __pmd_trans_huge_lock(old_pmd, vma, &ptl);
 	if (ret == 1) {
 		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
 		VM_BUG_ON(!pmd_none(*new_pmd));
 		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 	}
 out:
 	return ret;
@@ -1449,8 +1463,9 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
+	spinlock_t *uninitialized_var(ptl);
 
-	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		pmd_t entry;
 		entry = pmdp_get_and_clear(mm, addr, pmd);
 		if (!prot_numa) {
@@ -1466,7 +1481,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 		}
 		set_pmd_at(mm, addr, pmd, entry);
-		spin_unlock(&vma->vm_mm->page_table_lock);
+		spin_unlock(ptl);
 		ret = 1;
 	}
 
@@ -1480,12 +1495,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
  * Note that if it returns 1, this routine returns without unlocking page
  * table locks. So callers must unlock them.
  */
-int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+			  spinlock_t **ptl)
 {
-	spin_lock(&vma->vm_mm->page_table_lock);
+	*ptl = huge_pmd_lockptr(vma->vm_mm, pmd);
+	spin_lock(*ptl);
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&vma->vm_mm->page_table_lock);
+			spin_unlock(*ptl);
 			wait_split_huge_page(vma->anon_vma, pmd);
 			return -1;
 		} else {
@@ -1494,14 +1511,23 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 			return 1;
 		}
 	}
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	spin_unlock(*ptl);
 	return 0;
 }
 
+/*
+ * This function returns whether a given @page is mapped onto the @address
+ * in the virtual space of @mm.
+ *
+ * When it's true, this function returns *pmd with holding the page table lock
+ * and passing it back to the caller via @ptl.
+ * If it's false, returns NULL without holding the page table lock.
+ */
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
-			      enum page_check_address_pmd_flag flag)
+			      enum page_check_address_pmd_flag flag,
+			      spinlock_t **ptl)
 {
 	pmd_t *pmd, *ret = NULL;
 
@@ -1511,10 +1537,12 @@ pmd_t *page_check_address_pmd(struct page *page,
 	pmd = mm_find_pmd(mm, address);
 	if (!pmd)
 		goto out;
+	*ptl = huge_pmd_lockptr(mm, pmd);
+	spin_lock(*ptl);
 	if (pmd_none(*pmd))
-		goto out;
+		goto unlock;
 	if (pmd_page(*pmd) != page)
-		goto out;
+		goto unlock;
 	/*
 	 * split_vma() may create temporary aliased mappings. There is
 	 * no risk as long as all huge pmd are found and have their
@@ -1524,12 +1552,15 @@ pmd_t *page_check_address_pmd(struct page *page,
 	 */
 	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
 	    pmd_trans_splitting(*pmd))
-		goto out;
+		goto unlock;
 	if (pmd_trans_huge(*pmd)) {
 		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
 			  !pmd_trans_splitting(*pmd));
 		ret = pmd;
+		goto out;
 	}
+unlock:
+	spin_unlock(*ptl);
 out:
 	return ret;
 }
@@ -1541,14 +1572,15 @@ static int __split_huge_page_splitting(struct page *page,
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
 	int ret = 0;
+	spinlock_t *uninitialized_var(ptl);
 	/* For mmu_notifiers */
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	spin_lock(&mm->page_table_lock);
+
 	pmd = page_check_address_pmd(page, mm, address,
-				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+				PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
 		/*
 		 * We can't temporarily set the pmd to null in order
@@ -1559,8 +1591,8 @@ static int __split_huge_page_splitting(struct page *page,
 		 */
 		pmdp_splitting_flush(vma, address, pmd);
 		ret = 1;
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	return ret;
@@ -1694,10 +1726,10 @@ static int __split_huge_page_map(struct page *page,
 	int ret = 0, i;
 	pgtable_t pgtable;
 	unsigned long haddr;
+	spinlock_t *uninitialized_var(ptl);
 
-	spin_lock(&mm->page_table_lock);
 	pmd = page_check_address_pmd(page, mm, address,
-				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+				PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
 	if (pmd) {
 		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 		pmd_populate(mm, &_pmd, pgtable);
@@ -1752,8 +1784,8 @@ static int __split_huge_page_map(struct page *page,
 		pmdp_invalidate(vma, address, pmd);
 		pmd_populate(mm, pmd, pgtable);
 		ret = 1;
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 
 	return ret;
 }
@@ -2314,7 +2346,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	spin_lock(&mm->page_table_lock); /* probably unnecessary */
+	spin_lock(ptl); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
 	 * any huge TLB entry from the CPU so we won't allow
@@ -2322,7 +2354,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * to avoid the risk of CPU bugs in that area.
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	spin_lock(ptl);
@@ -2331,7 +2363,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	if (unlikely(!isolated)) {
 		pte_unmap(pte);
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
 		 * We can only use set_pmd_at when establishing
@@ -2339,7 +2371,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 		 * points to regular pagetables. Use pmd_populate for that
 		 */
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 		anon_vma_unlock_write(vma->anon_vma);
 		goto out;
 	}
@@ -2364,13 +2396,13 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	smp_wmb();
 
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	*hpage = NULL;
 
@@ -2698,6 +2730,7 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
+	spinlock_t *ptl;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2706,22 +2739,23 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!page_count(page));
 	get_page(page);
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	split_huge_page(page);
diff --git v3.11-rc3.orig/mm/memcontrol.c v3.11-rc3/mm/memcontrol.c
index 00a7a66..3949444 100644
--- v3.11-rc3.orig/mm/memcontrol.c
+++ v3.11-rc3/mm/memcontrol.c
@@ -6591,12 +6591,12 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 {
 	struct vm_area_struct *vma = walk->private;
 	pte_t *pte;
-	spinlock_t *ptl;
+	spinlock_t *uninitialized_var(ptl);
 
-	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
-		spin_unlock(&vma->vm_mm->page_table_lock);
+		spin_unlock(ptl);
 		return 0;
 	}
 
@@ -6769,7 +6769,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	int ret = 0;
 	struct vm_area_struct *vma = walk->private;
 	pte_t *pte;
-	spinlock_t *ptl;
+	spinlock_t *uninitialized_var(ptl);
 	enum mc_target_type target_type;
 	union mc_target target;
 	struct page *page;
@@ -6785,9 +6785,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	 *    to be unlocked in __split_huge_page_splitting(), where the main
 	 *    part of thp split is not executed yet.
 	 */
-	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		if (mc.precharge < HPAGE_PMD_NR) {
-			spin_unlock(&vma->vm_mm->page_table_lock);
+			spin_unlock(ptl);
 			return 0;
 		}
 		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
@@ -6804,7 +6804,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			}
 			put_page(page);
 		}
-		spin_unlock(&vma->vm_mm->page_table_lock);
+		spin_unlock(ptl);
 		return 0;
 	}
 
diff --git v3.11-rc3.orig/mm/memory.c v3.11-rc3/mm/memory.c
index 7ec1252..6827a35 100644
--- v3.11-rc3.orig/mm/memory.c
+++ v3.11-rc3/mm/memory.c
@@ -1531,20 +1531,21 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 			split_huge_page_pmd(vma, address, pmd);
 			goto split_fallthrough;
 		}
-		spin_lock(&mm->page_table_lock);
+		ptl = huge_pmd_lockptr(mm, pmd);
+		spin_lock(ptl);
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
-				spin_unlock(&mm->page_table_lock);
+				spin_unlock(ptl);
 				wait_split_huge_page(vma->anon_vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-				spin_unlock(&mm->page_table_lock);
+				spin_unlock(ptl);
 				*page_mask = HPAGE_PMD_NR - 1;
 				goto out;
 			}
 		} else
-			spin_unlock(&mm->page_table_lock);
+			spin_unlock(ptl);
 		/* fall through */
 	}
 split_fallthrough:
@@ -3609,17 +3610,17 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte, *orig_pte;
 	unsigned long _addr = addr & PMD_MASK;
 	unsigned long offset;
-	spinlock_t *ptl;
+	spinlock_t *ptl = huge_pmd_lockptr(mm, pmdp);
 	bool numa = false;
 	int local_nid = numa_node_id();
 
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	pmd = *pmdp;
 	if (pmd_numa(pmd)) {
 		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
 		numa = true;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (!numa)
 		return 0;
diff --git v3.11-rc3.orig/mm/migrate.c v3.11-rc3/mm/migrate.c
index c69a9c7..1e1e9f2 100644
--- v3.11-rc3.orig/mm/migrate.c
+++ v3.11-rc3/mm/migrate.c
@@ -1659,6 +1659,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
+	spinlock_t *ptl;
 
 	/*
 	 * Don't migrate pages that are mapped in multiple processes.
@@ -1699,9 +1700,10 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
 	if (unlikely(!pmd_same(*pmd, entry))) {
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1746,7 +1748,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * before it's fully transferred to the new page.
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	unlock_page(new_page);
 	unlock_page(page);
diff --git v3.11-rc3.orig/mm/mprotect.c v3.11-rc3/mm/mprotect.c
index 94722a4..c65c390 100644
--- v3.11-rc3.orig/mm/mprotect.c
+++ v3.11-rc3/mm/mprotect.c
@@ -116,9 +116,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
 				       pmd_t *pmd)
 {
-	spin_lock(&mm->page_table_lock);
+	spinlock_t *ptl = huge_pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
 	set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 }
 #else
 static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
diff --git v3.11-rc3.orig/mm/pgtable-generic.c v3.11-rc3/mm/pgtable-generic.c
index e1a6e4f..8e49928 100644
--- v3.11-rc3.orig/mm/pgtable-generic.c
+++ v3.11-rc3/mm/pgtable-generic.c
@@ -124,11 +124,10 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
 
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* The caller must hold page table lock */
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				pgtable_t pgtable)
 {
-	assert_spin_locked(&mm->page_table_lock);
-
 	/* FIFO */
 	if (!mm->pmd_huge_pte)
 		INIT_LIST_HEAD(&pgtable->lru);
@@ -141,13 +140,14 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* no "address" argument so destroys page coloring of some arch */
+/*
+ * no "address" argument so destroys page coloring of some arch
+ * The caller must hold page table lock.
+ */
 pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 {
 	pgtable_t pgtable;
 
-	assert_spin_locked(&mm->page_table_lock);
-
 	/* FIFO */
 	pgtable = mm->pmd_huge_pte;
 	if (list_empty(&pgtable->lru))
diff --git v3.11-rc3.orig/mm/rmap.c v3.11-rc3/mm/rmap.c
index eccec58..798f6ae 100644
--- v3.11-rc3.orig/mm/rmap.c
+++ v3.11-rc3/mm/rmap.c
@@ -666,24 +666,24 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int referenced = 0;
+	spinlock_t *uninitialized_var(ptl);
 
 	if (unlikely(PageTransHuge(page))) {
 		pmd_t *pmd;
 
-		spin_lock(&mm->page_table_lock);
 		/*
 		 * rmap might return false positives; we must filter
 		 * these out using page_check_address_pmd().
 		 */
 		pmd = page_check_address_pmd(page, mm, address,
-					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+					PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
 		if (!pmd) {
-			spin_unlock(&mm->page_table_lock);
+			spin_unlock(ptl);
 			goto out;
 		}
 
 		if (vma->vm_flags & VM_LOCKED) {
-			spin_unlock(&mm->page_table_lock);
+			spin_unlock(ptl);
 			*mapcount = 0;	/* break early from loop */
 			*vm_flags |= VM_LOCKED;
 			goto out;
@@ -692,10 +692,9 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(ptl);
 	} else {
 		pte_t *pte;
-		spinlock_t *ptl;
 
 		/*
 		 * rmap might return false positives; we must filter
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [PATCH 2/2] thp: support split page table lock
  2013-08-30 17:18 ` [PATCH 2/2] thp: " Naoya Horiguchi
@ 2013-09-02 10:53   ` Kirill A. Shutemov
  2013-09-02 16:37     ` Naoya Horiguchi
  2013-09-03 20:52   ` Naoya Horiguchi
  1 sibling, 1 reply; 17+ messages in thread
From: Kirill A. Shutemov @ 2013-09-02 10:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Naoya Horiguchi wrote:
> Thp related code also uses per process mm->page_table_lock now. So making
> it fine-grained can provide better performance.
> 
> This patch makes thp support split page table lock which makes us use
> page->ptl of the pages storing "pmd_trans_huge" pmds.

Hm. So, you use page->ptl only when you deal with thp pages, otherwise
mm->page_table_lock, right?

It looks inconsistent to me. Does it mean we have to take both locks on
split and collapse paths? I'm not sure if it's safe to take only
page->ptl for alloc path. Probably not.

Why not to use new locking for pmd everywhere?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] thp: support split page table lock
  2013-09-02 10:53   ` Kirill A. Shutemov
@ 2013-09-02 16:37     ` Naoya Horiguchi
  0 siblings, 0 replies; 17+ messages in thread
From: Naoya Horiguchi @ 2013-09-02 16:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Kirill, thank you for the comment.

On Mon, Sep 02, 2013 at 01:53:27PM +0300, Kirill A. Shutemov wrote:
> Naoya Horiguchi wrote:
> > Thp related code also uses per process mm->page_table_lock now. So making
> > it fine-grained can provide better performance.
> > 
> > This patch makes thp support split page table lock which makes us use
> > page->ptl of the pages storing "pmd_trans_huge" pmds.
> 
> Hm. So, you use page->ptl only when you deal with thp pages, otherwise
> mm->page_table_lock, right?

Maybe it's not enough.
We use page->ptl for both of thp and normal depending on USE_SPLIT_PTLOCKS.
And regardless of USE_SPLIT_PTLOCKS, mm->page_table_lock is still used
by other contexts like memory initialization code or driver code for their
specific usage.

> It looks inconsistent to me. Does it mean we have to take both locks on
> split and collapse paths?

This patch includes the replacement with page->ptl for split/collapse path.

> I'm not sure if it's safe to take only
> page->ptl for alloc path. Probably not.

Right, it's not safe.

> Why not to use new locking for pmd everywhere?

So I already do this.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] thp: support split page table lock
  2013-08-30 17:18 ` [PATCH 2/2] thp: " Naoya Horiguchi
  2013-09-02 10:53   ` Kirill A. Shutemov
@ 2013-09-03 20:52   ` Naoya Horiguchi
  1 sibling, 0 replies; 17+ messages in thread
From: Naoya Horiguchi @ 2013-09-03 20:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

> diff --git v3.11-rc3.orig/mm/huge_memory.c v3.11-rc3/mm/huge_memory.c
> index 243e710..3cb29e1 100644
> --- v3.11-rc3.orig/mm/huge_memory.c
> +++ v3.11-rc3/mm/huge_memory.c
...
> @@ -864,14 +868,17 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	pmd_t pmd;
>  	pgtable_t pgtable;
>  	int ret;
> +	spinlock_t *uninitialized_var(dst_ptl), *uninitialized_var(src_ptl);
>  
>  	ret = -ENOMEM;
>  	pgtable = pte_alloc_one(dst_mm, addr);
>  	if (unlikely(!pgtable))
>  		goto out;
>  
> -	spin_lock(&dst_mm->page_table_lock);
> -	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +	dst_ptl = huge_pmd_lockptr(dst_mm, dst_ptl);
> +	src_ptl = huge_pmd_lockptr(src_mm, src_ptl);

I found one mistake. This should be:

+	dst_ptl = huge_pmd_lockptr(dst_mm, dst_pmd);
+	src_ptl = huge_pmd_lockptr(src_mm, src_pmd);

Thanks,
Naoya Horiguchi

> +	spin_lock(dst_ptl);
> +	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>  
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-08-30 17:18 ` [PATCH 1/2] hugetlbfs: support split page table lock Naoya Horiguchi
@ 2013-09-04  7:13   ` Aneesh Kumar K.V
  2013-09-04 16:32     ` Naoya Horiguchi
  0 siblings, 1 reply; 17+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-04  7:13 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. So when a process have many threads and they heavily
> access to the memory, lock contention happens and impacts the performance.
>
> This patch makes hugepage support split page table lock so that we use
> page->ptl of the leaf node of page table tree which is pte for normal pages
> but can be pmd and/or pud for hugepages of some architectures.
>
> ChangeLog v2:
>  - add split ptl on other archs missed in v1
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  arch/powerpc/mm/hugetlbpage.c |  6 ++-
>  arch/tile/mm/hugetlbpage.c    |  6 ++-
>  include/linux/hugetlb.h       | 20 ++++++++++
>  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
>  mm/mempolicy.c                |  5 ++-
>  mm/migrate.c                  |  4 +-
>  mm/rmap.c                     |  2 +-
>  7 files changed, 90 insertions(+), 45 deletions(-)
>
> diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> index d67db4b..7e56cb7 100644
> --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  {
>  	struct kmem_cache *cachep;
>  	pte_t *new;
> +	spinlock_t *ptl;
>
>  #ifdef CONFIG_PPC_FSL_BOOK3E
>  	int i;
> @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  	if (! new)
>  		return -ENOMEM;
>
> -	spin_lock(&mm->page_table_lock);
> +	ptl = huge_pte_lockptr(mm, new);
> +	spin_lock(ptl);


Are you sure we can do that for ppc ?
	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);

The page for new(pte_t) could be shared right ? which mean a deadlock ?

May be you should do it at the pmd level itself for ppc

>  #ifdef CONFIG_PPC_FSL_BOOK3E
>  	/*
>  	 * We have multiple higher-level entries that point to the same
> @@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  #endif
>  	}
>  #endif
> -	spin_unlock(&mm->page_table_lock);
> +	spin_unlock(ptl);
>  	return 0;
>  }
>


-aneesh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-04  7:13   ` Aneesh Kumar K.V
@ 2013-09-04 16:32     ` Naoya Horiguchi
  2013-09-05  9:18       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 17+ messages in thread
From: Naoya Horiguchi @ 2013-09-04 16:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Hi Aneesh,

On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. So when a process have many threads and they heavily
> > access to the memory, lock contention happens and impacts the performance.
> >
> > This patch makes hugepage support split page table lock so that we use
> > page->ptl of the leaf node of page table tree which is pte for normal pages
> > but can be pmd and/or pud for hugepages of some architectures.
> >
> > ChangeLog v2:
> >  - add split ptl on other archs missed in v1
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
> >  include/linux/hugetlb.h       | 20 ++++++++++
> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
> >  mm/mempolicy.c                |  5 ++-
> >  mm/migrate.c                  |  4 +-
> >  mm/rmap.c                     |  2 +-
> >  7 files changed, 90 insertions(+), 45 deletions(-)
> >
> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> > index d67db4b..7e56cb7 100644
> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  {
> >  	struct kmem_cache *cachep;
> >  	pte_t *new;
> > +	spinlock_t *ptl;
> >
> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >  	int i;
> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  	if (! new)
> >  		return -ENOMEM;
> >
> > -	spin_lock(&mm->page_table_lock);
> > +	ptl = huge_pte_lockptr(mm, new);
> > +	spin_lock(ptl);
> 
> 
> Are you sure we can do that for ppc ?
> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);

Ah, thanks. new is not a pointer to one full page occupied by page
table entries, so trying to use struct page of it is totally wrong.

> The page for new(pte_t) could be shared right ? which mean a deadlock ?

Yes, that's disastrous.

> May be you should do it at the pmd level itself for ppc

Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
for now because this lock seems to protect us from the race between concurrent
calls of __hugepte_alloc(), not between allocation and read/write access.
Split ptl is used to avoid race between read/write accesses, so I think
that using different types of locks here is not dangerous.
# I guess that that's why we now use mm->page_table_lock for __pte_alloc()
# and its family even if USE_SPLIT_PTLOCKS is true.

A bit off-topic, but I found that we have a bogus comment on
hugetlb_free_pgd_range in arch/powerpc/mm/hugetlbpage.c saying
"Must be called with pagetable lock held."
This seems not true because the caller free_pgtables() and its
callers (unmap_region() and exit_mmap()) never hold it.
I guess that it's just copied from free_pgd_range() and it's also
false for this function. I'll post a patch to remove this later.

Anyway, thank you for valuable comments!

Thanks,
Naoya Horiguchi

> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >  	/*
> >  	 * We have multiple higher-level entries that point to the same
> > @@ -174,7 +176,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >  #endif
> >  	}
> >  #endif
> > -	spin_unlock(&mm->page_table_lock);
> > +	spin_unlock(ptl);
> >  	return 0;
> >  }
> >
> 
> 
> -aneesh
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-04 16:32     ` Naoya Horiguchi
@ 2013-09-05  9:18       ` Aneesh Kumar K.V
  2013-09-05 15:23         ` Naoya Horiguchi
  0 siblings, 1 reply; 17+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-05  9:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Hi Aneesh,
>
> On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
>> 
>> > Currently all of page table handling by hugetlbfs code are done under
>> > mm->page_table_lock. So when a process have many threads and they heavily
>> > access to the memory, lock contention happens and impacts the performance.
>> >
>> > This patch makes hugepage support split page table lock so that we use
>> > page->ptl of the leaf node of page table tree which is pte for normal pages
>> > but can be pmd and/or pud for hugepages of some architectures.
>> >
>> > ChangeLog v2:
>> >  - add split ptl on other archs missed in v1
>> >
>> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> > ---
>> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
>> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
>> >  include/linux/hugetlb.h       | 20 ++++++++++
>> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
>> >  mm/mempolicy.c                |  5 ++-
>> >  mm/migrate.c                  |  4 +-
>> >  mm/rmap.c                     |  2 +-
>> >  7 files changed, 90 insertions(+), 45 deletions(-)
>> >
>> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
>> > index d67db4b..7e56cb7 100644
>> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
>> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
>> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>> >  {
>> >  	struct kmem_cache *cachep;
>> >  	pte_t *new;
>> > +	spinlock_t *ptl;
>> >
>> >  #ifdef CONFIG_PPC_FSL_BOOK3E
>> >  	int i;
>> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>> >  	if (! new)
>> >  		return -ENOMEM;
>> >
>> > -	spin_lock(&mm->page_table_lock);
>> > +	ptl = huge_pte_lockptr(mm, new);
>> > +	spin_lock(ptl);
>> 
>> 
>> Are you sure we can do that for ppc ?
>> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);
>
> Ah, thanks. new is not a pointer to one full page occupied by page
> table entries, so trying to use struct page of it is totally wrong.
>
>> The page for new(pte_t) could be shared right ? which mean a deadlock ?
>
> Yes, that's disastrous.
>
>> May be you should do it at the pmd level itself for ppc

The pgd page also cannot be used because pgd also comes from kmem
cache.

>
> Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
> for now because this lock seems to protect us from the race between concurrent
> calls of __hugepte_alloc(), not between allocation and read/write access.
> Split ptl is used to avoid race between read/write accesses, so I think
> that using different types of locks here is not dangerous.
> # I guess that that's why we now use mm->page_table_lock for __pte_alloc()
> # and its family even if USE_SPLIT_PTLOCKS is true.

A simpler approach could be to make huge_pte_lockptr arch
specific and leave it as mm->page_table_lock for ppc 


-aneesh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-05  9:18       ` Aneesh Kumar K.V
@ 2013-09-05 15:23         ` Naoya Horiguchi
  0 siblings, 0 replies; 17+ messages in thread
From: Naoya Horiguchi @ 2013-09-05 15:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

On Thu, Sep 05, 2013 at 02:48:18PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Hi Aneesh,
> >
> > On Wed, Sep 04, 2013 at 12:43:19PM +0530, Aneesh Kumar K.V wrote:
> >> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> >> 
> >> > Currently all of page table handling by hugetlbfs code are done under
> >> > mm->page_table_lock. So when a process have many threads and they heavily
> >> > access to the memory, lock contention happens and impacts the performance.
> >> >
> >> > This patch makes hugepage support split page table lock so that we use
> >> > page->ptl of the leaf node of page table tree which is pte for normal pages
> >> > but can be pmd and/or pud for hugepages of some architectures.
> >> >
> >> > ChangeLog v2:
> >> >  - add split ptl on other archs missed in v1
> >> >
> >> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >> > ---
> >> >  arch/powerpc/mm/hugetlbpage.c |  6 ++-
> >> >  arch/tile/mm/hugetlbpage.c    |  6 ++-
> >> >  include/linux/hugetlb.h       | 20 ++++++++++
> >> >  mm/hugetlb.c                  | 92 ++++++++++++++++++++++++++-----------------
> >> >  mm/mempolicy.c                |  5 ++-
> >> >  mm/migrate.c                  |  4 +-
> >> >  mm/rmap.c                     |  2 +-
> >> >  7 files changed, 90 insertions(+), 45 deletions(-)
> >> >
> >> > diff --git v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> >> > index d67db4b..7e56cb7 100644
> >> > --- v3.11-rc3.orig/arch/powerpc/mm/hugetlbpage.c
> >> > +++ v3.11-rc3/arch/powerpc/mm/hugetlbpage.c
> >> > @@ -124,6 +124,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >> >  {
> >> >  	struct kmem_cache *cachep;
> >> >  	pte_t *new;
> >> > +	spinlock_t *ptl;
> >> >
> >> >  #ifdef CONFIG_PPC_FSL_BOOK3E
> >> >  	int i;
> >> > @@ -141,7 +142,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> >> >  	if (! new)
> >> >  		return -ENOMEM;
> >> >
> >> > -	spin_lock(&mm->page_table_lock);
> >> > +	ptl = huge_pte_lockptr(mm, new);
> >> > +	spin_lock(ptl);
> >> 
> >> 
> >> Are you sure we can do that for ppc ?
> >> 	new = kmem_cache_zalloc(cachep, GFP_KERNEL|__GFP_REPEAT);
> >
> > Ah, thanks. new is not a pointer to one full page occupied by page
> > table entries, so trying to use struct page of it is totally wrong.
> >
> >> The page for new(pte_t) could be shared right ? which mean a deadlock ?
> >
> > Yes, that's disastrous.
> >
> >> May be you should do it at the pmd level itself for ppc
> 
> The pgd page also cannot be used because pgd also comes from kmem
> cache.
> 
> >
> > Yes, that's possible, but I simply drop the changes in __hugepte_alloc()
> > for now because this lock seems to protect us from the race between concurrent
> > calls of __hugepte_alloc(), not between allocation and read/write access.
> > Split ptl is used to avoid race between read/write accesses, so I think
> > that using different types of locks here is not dangerous.
> > # I guess that that's why we now use mm->page_table_lock for __pte_alloc()
> > # and its family even if USE_SPLIT_PTLOCKS is true.
> 
> A simpler approach could be to make huge_pte_lockptr arch
> specific and leave it as mm->page_table_lock for ppc 

OK, I'll do this.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-08 16:53   ` Aneesh Kumar K.V
@ 2013-09-09 16:26     ` Naoya Horiguchi
  0 siblings, 0 replies; 17+ messages in thread
From: Naoya Horiguchi @ 2013-09-09 16:26 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

On Sun, Sep 08, 2013 at 10:23:55PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. So when a process have many threads and they heavily
> > access to the memory, lock contention happens and impacts the performance.
> >
> > This patch makes hugepage support split page table lock so that we use
> > page->ptl of the leaf node of page table tree which is pte for normal pages
> > but can be pmd and/or pud for hugepages of some architectures.
> >
> > ChangeLog v3:
> >  - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
> >  - remove replacement in some architecture dependent code. This is justified
> >    because an allocation of pgd/pud/pmd/pte entry can race with other
> >    allocation, not with read/write access, so we can use different locks.
> >    http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458
> >
> > ChangeLog v2:
> >  - add split ptl on other archs missed in v1
> >  - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  include/linux/hugetlb.h  | 20 +++++++++++
> >  include/linux/mm_types.h |  2 ++
> >  mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
> >  mm/mempolicy.c           |  5 +--
> >  mm/migrate.c             |  4 +--
> >  mm/rmap.c                |  2 +-
> >  6 files changed, 84 insertions(+), 41 deletions(-)
> >
> > diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
> > index 0393270..5cb8a4e 100644
> > --- v3.11-rc3.orig/include/linux/hugetlb.h
> > +++ v3.11-rc3/include/linux/hugetlb.h
> > @@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
> >  extern int sysctl_hugetlb_shm_group;
> >  extern struct list_head huge_boot_pages;
> >
> > +#if USE_SPLIT_PTLOCKS_HUGETLB
> > +#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
> > +#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
> > +#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
> > +#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
> > +
> > +#define huge_pte_offset_lock(mm, address, ptlp)		\
> > +({							\
> > +	pte_t *__pte = huge_pte_offset(mm, address);	\
> > +	spinlock_t *__ptl = NULL;			\
> > +	if (__pte) {					\
> > +		__ptl = huge_pte_lockptr(mm, __pte);	\
> > +		*(ptlp) = __ptl;			\
> > +		spin_lock(__ptl);			\
> > +	}						\
> > +	__pte;						\
> > +})
> 
> 
> why not a static inline function ?

I'm not sure how these two make change in runtime (maybe optimized to
similar code?).
I just used the same pattern with pte_offset_map_lock(). I think that
this may show that this function is the variant of pte_offset_map_lock().
But if we have a good reason to make it static inline, I'm glad to do this.
Do you have any specific ideas?

> 
> > +
> >  /* arch callbacks */
> >
> >  pte_t *huge_pte_alloc(struct mm_struct *mm,> @@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
> >  	BUG();
> >  }
> >
> > +#define huge_pte_lockptr(mm, ptep) 0
> > +
> >  #endif /* !CONFIG_HUGETLB_PAGE */
> >
> >  #define HUGETLB_ANON_FILE "anon_hugepage"
> > diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
> > index fb425aa..cfb8c6f 100644
> > --- v3.11-rc3.orig/include/linux/mm_types.h
> > +++ v3.11-rc3/include/linux/mm_types.h
> > @@ -24,6 +24,8 @@
> >  struct address_space;
> >
> >  #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
> > +#define USE_SPLIT_PTLOCKS_HUGETLB	\
> > +	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
> >
> 
> Is that a common pattern ? Don't we generally use
> HAVE_ARCH_SPLIT_PTLOCKS in arch config file ?

Do you mean that HAVE_ARCH_SPLIT_PTLOCKS enables/disables whole split
ptl mechanism (not only split ptl for hugepages) depending on archs? 
If you do, we can do this with adding a line 'default "999999" if PPC'.
(Note that if we do this, we can lose the performance benefit from
existing split ptl for normal pages. So I'm not sure it's acceptible
for users of the arch.)

> Also are we sure this is
> only an issue with PPC ?

Not confirmed yet (so let me take some time to check this).
I think that split ptl for hugepage should work on any archtectures
where every entry pointing to hugepage is stored in 4kB page allocated
for page table trees. So I'll check it.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-05 21:27 ` [PATCH 1/2] hugetlbfs: support split page table lock Naoya Horiguchi
@ 2013-09-08 16:53   ` Aneesh Kumar K.V
  2013-09-09 16:26     ` Naoya Horiguchi
  0 siblings, 1 reply; 17+ messages in thread
From: Aneesh Kumar K.V @ 2013-09-08 16:53 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Alex Thorlton, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. So when a process have many threads and they heavily
> access to the memory, lock contention happens and impacts the performance.
>
> This patch makes hugepage support split page table lock so that we use
> page->ptl of the leaf node of page table tree which is pte for normal pages
> but can be pmd and/or pud for hugepages of some architectures.
>
> ChangeLog v3:
>  - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
>  - remove replacement in some architecture dependent code. This is justified
>    because an allocation of pgd/pud/pmd/pte entry can race with other
>    allocation, not with read/write access, so we can use different locks.
>    http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458
>
> ChangeLog v2:
>  - add split ptl on other archs missed in v1
>  - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h  | 20 +++++++++++
>  include/linux/mm_types.h |  2 ++
>  mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
>  mm/mempolicy.c           |  5 +--
>  mm/migrate.c             |  4 +--
>  mm/rmap.c                |  2 +-
>  6 files changed, 84 insertions(+), 41 deletions(-)
>
> diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
> index 0393270..5cb8a4e 100644
> --- v3.11-rc3.orig/include/linux/hugetlb.h
> +++ v3.11-rc3/include/linux/hugetlb.h
> @@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
>  extern int sysctl_hugetlb_shm_group;
>  extern struct list_head huge_boot_pages;
>
> +#if USE_SPLIT_PTLOCKS_HUGETLB
> +#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
> +#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
> +#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
> +#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
> +
> +#define huge_pte_offset_lock(mm, address, ptlp)		\
> +({							\
> +	pte_t *__pte = huge_pte_offset(mm, address);	\
> +	spinlock_t *__ptl = NULL;			\
> +	if (__pte) {					\
> +		__ptl = huge_pte_lockptr(mm, __pte);	\
> +		*(ptlp) = __ptl;			\
> +		spin_lock(__ptl);			\
> +	}						\
> +	__pte;						\
> +})


why not a static inline function ?


> +
>  /* arch callbacks */
>
>  pte_t *huge_pte_alloc(struct mm_struct *mm,> @@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
>  	BUG();
>  }
>
> +#define huge_pte_lockptr(mm, ptep) 0
> +
>  #endif /* !CONFIG_HUGETLB_PAGE */
>
>  #define HUGETLB_ANON_FILE "anon_hugepage"
> diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
> index fb425aa..cfb8c6f 100644
> --- v3.11-rc3.orig/include/linux/mm_types.h
> +++ v3.11-rc3/include/linux/mm_types.h
> @@ -24,6 +24,8 @@
>  struct address_space;
>
>  #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
> +#define USE_SPLIT_PTLOCKS_HUGETLB	\
> +	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
>

Is that a common pattern ? Don't we generally use
HAVE_ARCH_SPLIT_PTLOCKS in arch config file ? Also are we sure this is
only an issue with PPC ?



-aneesh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
  2013-09-05 21:27 [PATCH 0/2 v3] split page table lock for hugepage Naoya Horiguchi
@ 2013-09-05 21:27 ` Naoya Horiguchi
  2013-09-08 16:53   ` Aneesh Kumar K.V
  0 siblings, 1 reply; 17+ messages in thread
From: Naoya Horiguchi @ 2013-09-05 21:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, kirill.shutemov,
	Aneesh Kumar K.V, Alex Thorlton, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. So when a process have many threads and they heavily
access to the memory, lock contention happens and impacts the performance.

This patch makes hugepage support split page table lock so that we use
page->ptl of the leaf node of page table tree which is pte for normal pages
but can be pmd and/or pud for hugepages of some architectures.

ChangeLog v3:
 - disable split ptl for ppc with USE_SPLIT_PTLOCKS_HUGETLB.
 - remove replacement in some architecture dependent code. This is justified
   because an allocation of pgd/pud/pmd/pte entry can race with other
   allocation, not with read/write access, so we can use different locks.
   http://thread.gmane.org/gmane.linux.kernel.mm/106292/focus=106458

ChangeLog v2:
 - add split ptl on other archs missed in v1
 - drop changes on arch/{powerpc,tile}/mm/hugetlbpage.c

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h  | 20 +++++++++++
 include/linux/mm_types.h |  2 ++
 mm/hugetlb.c             | 92 +++++++++++++++++++++++++++++-------------------
 mm/mempolicy.c           |  5 +--
 mm/migrate.c             |  4 +--
 mm/rmap.c                |  2 +-
 6 files changed, 84 insertions(+), 41 deletions(-)

diff --git v3.11-rc3.orig/include/linux/hugetlb.h v3.11-rc3/include/linux/hugetlb.h
index 0393270..5cb8a4e 100644
--- v3.11-rc3.orig/include/linux/hugetlb.h
+++ v3.11-rc3/include/linux/hugetlb.h
@@ -80,6 +80,24 @@ extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
+#if USE_SPLIT_PTLOCKS_HUGETLB
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS_HUGETLB */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS_HUGETLB */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 /* arch callbacks */
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
@@ -164,6 +182,8 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
 	BUG();
 }
 
+#define huge_pte_lockptr(mm, ptep) 0
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define HUGETLB_ANON_FILE "anon_hugepage"
diff --git v3.11-rc3.orig/include/linux/mm_types.h v3.11-rc3/include/linux/mm_types.h
index fb425aa..cfb8c6f 100644
--- v3.11-rc3.orig/include/linux/mm_types.h
+++ v3.11-rc3/include/linux/mm_types.h
@@ -24,6 +24,8 @@
 struct address_space;
 
 #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
+#define USE_SPLIT_PTLOCKS_HUGETLB	\
+	(USE_SPLIT_PTLOCKS && !defined(CONFIG_PPC))
 
 /*
  * Each physical page in the system has a struct page associated with
diff --git v3.11-rc3.orig/mm/hugetlb.c v3.11-rc3/mm/hugetlb.c
index 0f8da6b..b2dbca4 100644
--- v3.11-rc3.orig/mm/hugetlb.c
+++ v3.11-rc3/mm/hugetlb.c
@@ -2372,6 +2372,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2383,8 +2384,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dst_ptl = huge_pte_lockptr(dst, dst_pte);
+		src_ptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dst_ptl);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2394,8 +2397,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
 	}
 	return 0;
 
@@ -2438,6 +2441,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2451,25 +2455,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2480,7 +2483,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2497,13 +2500,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2617,6 +2625,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2666,7 +2675,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2680,7 +2689,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2695,7 +2704,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2710,7 +2719,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2722,10 +2731,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2775,6 +2784,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2860,7 +2870,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2882,13 +2893,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2900,6 +2911,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2968,7 +2980,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2988,7 +3001,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -3014,9 +3027,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -3024,8 +3037,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -3037,6 +3052,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3056,10 +3073,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3090,8 +3107,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3112,13 +3129,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3128,8 +3146,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
@@ -3292,6 +3310,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -3314,13 +3333,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, spte);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -3334,7 +3354,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with vma->vm_mm->page_table_lock held.
+ * called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git v3.11-rc3.orig/mm/mempolicy.c v3.11-rc3/mm/mempolicy.c
index 838d022..a9de55f 100644
--- v3.11-rc3.orig/mm/mempolicy.c
+++ v3.11-rc3/mm/mempolicy.c
@@ -522,8 +522,9 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 #ifdef CONFIG_HUGETLB_PAGE
 	int nid;
 	struct page *page;
+	spinlock_t *ptl = huge_pte_lockptr(vma->vm_mm, (pte_t *)pmd);
 
-	spin_lock(&vma->vm_mm->page_table_lock);
+	spin_lock(ptl);
 	page = pte_page(huge_ptep_get((pte_t *)pmd));
 	nid = page_to_nid(page);
 	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -533,7 +534,7 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
 		isolate_huge_page(page, private);
 unlock:
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	spin_unlock(ptl);
 #else
 	BUG();
 #endif
diff --git v3.11-rc3.orig/mm/migrate.c v3.11-rc3/mm/migrate.c
index 61f14a1..c69a9c7 100644
--- v3.11-rc3.orig/mm/migrate.c
+++ v3.11-rc3/mm/migrate.c
@@ -130,7 +130,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, addr);
 		if (!ptep)
 			goto out;
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, ptep);
 	} else {
 		pmd = mm_find_pmd(mm, addr);
 		if (!pmd)
@@ -249,7 +249,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 
 void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
 {
-	spinlock_t *ptl = &(mm)->page_table_lock;
+	spinlock_t *ptl = huge_pte_lockptr(mm, pte);
 	__migration_entry_wait(mm, pte, ptl);
 }
 
diff --git v3.11-rc3.orig/mm/rmap.c v3.11-rc3/mm/rmap.c
index c56ba98..eccec58 100644
--- v3.11-rc3.orig/mm/rmap.c
+++ v3.11-rc3/mm/rmap.c
@@ -601,7 +601,7 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 
 	if (unlikely(PageHuge(page))) {
 		pte = huge_pte_offset(mm, address);
-		ptl = &mm->page_table_lock;
+		ptl = huge_pte_lockptr(mm, pte);
 		goto check;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-06-03 14:34     ` Naoya Horiguchi
@ 2013-06-03 15:42       ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2013-06-03 15:42 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: linux-mm, Mel Gorman, Andi Kleen, KOSAKI Motohiro, Rik van Riel,
	linux-kernel

On Mon 03-06-13 10:34:35, Naoya Horiguchi wrote:
> On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> > On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > > Currently all of page table handling by hugetlbfs code are done under
> > > mm->page_table_lock. This is not optimal because there can be lock
> > > contentions between unrelated components using this lock.
> > 
> > While I agree with such a change in general I am a bit afraid of all
> > subtle tweaks in the mm code that make hugetlb special. Maybe there are
> > none for page_table_lock but I am not 100% sure. So this might be
> > really tricky and it is not necessary for your further patches, is it?
> 
> No, this page_table_lock patch is separable from migration stuff.
> As you said in another email, changes going to stable should be minimal,
> so it's better to make 2/2 patch not depend on this patch.

OK, so I do we go around this. Both patches are in the mm tree now.
Should Andrew just drop the current version and you repost a new
version? Sorry I didn't jump in sooner but I was quite busy last week.

> > How have you tested this?
> 
> Other than libhugetlbfs test (that contains many workloads, but I'm
> not sure it can detect the possible regression of this patch,)
> I did simple testing where:
>  - create a file on hugetlbfs,
>  - create 10 processes and make each of them iterate the following:
>    * mmap() the hugetlbfs file,
>    * memset() the mapped range (to cause hugetlb_fault), and
>    * munmap() the mapped range.
> I think that this can make racy situation which should be prevented
> by page table locks.

OK, but this still requires a deep inspection of all the subtle
dependencies on page_table_lock from the core mm. I might be wrong here
and should be more specific about the issues I have only suspicion for
but as this is "just" an scalability improvement (is this actually
measurable?) I would suggest to put it at the end of your hugetlbfs
enahcements for the migration. Just from the reviewability point of
view.

> > > This patch makes hugepage support split page table lock so that
> > > we use page->ptl of the leaf node of page table tree which is pte for
> > > normal pages but can be pmd and/or pud for hugepages of some architectures.
> > > 
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > ---
> > >  arch/x86/mm/hugetlbpage.c |  6 ++--
> > >  include/linux/hugetlb.h   | 18 ++++++++++
> > >  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
> > 
> > This doesn't seem to be the complete story. At least not from the
> > trivial:
> > $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
> > arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
> > arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c:
> > spin_unlock(&mm->page_table_lock);
> > arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
> 
> This trivials should be fixed. Sorry.

Other archs are often forgotten and cscope doesn't help exactly ;)

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-06-03 13:19   ` Michal Hocko
@ 2013-06-03 14:34     ` Naoya Horiguchi
  2013-06-03 15:42       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. This is not optimal because there can be lock
> > contentions between unrelated components using this lock.
> 
> While I agree with such a change in general I am a bit afraid of all
> subtle tweaks in the mm code that make hugetlb special. Maybe there are
> none for page_table_lock but I am not 100% sure. So this might be
> really tricky and it is not necessary for your further patches, is it?

No, this page_table_lock patch is separable from migration stuff.
As you said in another email, changes going to stable should be minimal,
so it's better to make 2/2 patch not depend on this patch.

> How have you tested this?

Other than libhugetlbfs test (that contains many workloads, but I'm
not sure it can detect the possible regression of this patch,)
I did simple testing where:
 - create a file on hugetlbfs,
 - create 10 processes and make each of them iterate the following:
   * mmap() the hugetlbfs file,
   * memset() the mapped range (to cause hugetlb_fault), and
   * munmap() the mapped range.
I think that this can make racy situation which should be prevented
by page table locks.

> > This patch makes hugepage support split page table lock so that
> > we use page->ptl of the leaf node of page table tree which is pte for
> > normal pages but can be pmd and/or pud for hugepages of some architectures.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  arch/x86/mm/hugetlbpage.c |  6 ++--
> >  include/linux/hugetlb.h   | 18 ++++++++++
> >  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
> 
> This doesn't seem to be the complete story. At least not from the
> trivial:
> $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
> arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
> arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c:
> spin_unlock(&mm->page_table_lock);
> arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.

This trivials should be fixed. Sorry.

Naoya

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] hugetlbfs: support split page table lock
  2013-05-28 19:52 ` [PATCH 1/2] " Naoya Horiguchi
@ 2013-06-03 13:19   ` Michal Hocko
  2013-06-03 14:34     ` Naoya Horiguchi
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2013-06-03 13:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
	Rik van Riel, linux-kernel

On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. This is not optimal because there can be lock
> contentions between unrelated components using this lock.

While I agree with such a change in general I am a bit afraid of all
subtle tweaks in the mm code that make hugetlb special. Maybe there are
none for page_table_lock but I am not 100% sure. So this might be
really tricky and it is not necessary for your further patches, is it?

How have you tested this?

> This patch makes hugepage support split page table lock so that
> we use page->ptl of the leaf node of page table tree which is pte for
> normal pages but can be pmd and/or pud for hugepages of some architectures.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  arch/x86/mm/hugetlbpage.c |  6 ++--
>  include/linux/hugetlb.h   | 18 ++++++++++
>  mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------

This doesn't seem to be the complete story. At least not from the
trivial:
$ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" -- 
arch/powerpc/mm/hugetlbpage.c:  spin_lock(&mm->page_table_lock);
arch/powerpc/mm/hugetlbpage.c:  spin_unlock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c:             spin_lock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c:
spin_unlock(&mm->page_table_lock);
arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] hugetlbfs: support split page table lock
  2013-05-28 19:52 [PATCH 0/2] " Naoya Horiguchi
@ 2013-05-28 19:52 ` Naoya Horiguchi
  2013-06-03 13:19   ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
	KOSAKI Motohiro, Rik van Riel, linux-kernel

Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. This is not optimal because there can be lock
contentions between unrelated components using this lock.

This patch makes hugepage support split page table lock so that
we use page->ptl of the leaf node of page table tree which is pte for
normal pages but can be pmd and/or pud for hugepages of some architectures.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/mm/hugetlbpage.c |  6 ++--
 include/linux/hugetlb.h   | 18 ++++++++++
 mm/hugetlb.c              | 84 ++++++++++++++++++++++++++++-------------------
 3 files changed, 73 insertions(+), 35 deletions(-)

diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
index ae1aa71..0e4a396 100644
--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 			spte = huge_pte_offset(svma->vm_mm, saddr);
 			if (spte) {
 				get_page(virt_to_page(spte));
+				ptl = huge_pte_lockptr(mm, spte);
 				break;
 			}
 		}
@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!spte)
 		goto out;
 
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (pud_none(*pud))
 		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
 	else
 		put_page(virt_to_page(spte));
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
index a639c87..40f3215 100644
--- v3.10-rc3.orig/include/linux/hugetlb.h
+++ v3.10-rc3/include/linux/hugetlb.h
@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 int PageHuge(struct page *page);
 
+#if USE_SPLIT_PTLOCKS
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else	/* !USE_SPLIT_PTLOCKS */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif	/* USE_SPLIT_PTLOCKS */
+
+#define huge_pte_offset_lock(mm, address, ptlp)		\
+({							\
+	pte_t *__pte = huge_pte_offset(mm, address);	\
+	spinlock_t *__ptl = NULL;			\
+	if (__pte) {					\
+		__ptl = huge_pte_lockptr(mm, __pte);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 463fb5e..8e1af32 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+		spinlock_t *srcptl, *dstptl;
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if (dst_pte == src_pte)
 			continue;
 
-		spin_lock(&dst->page_table_lock);
-		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+		dstptl = huge_pte_lockptr(dst, dst_pte);
+		srcptl = huge_pte_lockptr(src, src_pte);
+		spin_lock(dstptl);
+		spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(srcptl);
+		spin_unlock(dstptl);
 	}
 	return 0;
 
@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long address;
 	pte_t *ptep;
 	pte_t pte;
+	spinlock_t *ptl;
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_start_vma(tlb, vma);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
-	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 
 		if (huge_pmd_unshare(mm, &address, ptep))
-			continue;
+			goto unlock;
 
 		pte = huge_ptep_get(ptep);
 		if (huge_pte_none(pte))
-			continue;
+			goto unlock;
 
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			huge_pte_clear(mm, address, ptep);
-			continue;
+			goto unlock;
 		}
 
 		page = pte_page(pte);
@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		if (ref_page) {
 			if (page != ref_page)
-				continue;
+				goto unlock;
 
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
-		if (force_flush)
+		if (force_flush) {
+			spin_unlock(ptl);
 			break;
+		}
 		/* Bail out after unmapping reference page if supplied */
-		if (ref_page)
+		if (ref_page) {
+			spin_unlock(ptl);
 			break;
+		}
+unlock:
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	int outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
 
 	old_page = pte_page(pte);
 
@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_cache_get(old_page);
 
 	/* Drop page_table_lock as buddy allocator may be called */
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	new_page = alloc_huge_page(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			BUG_ON(huge_pte_none(pte));
 			if (unmap_ref_private(mm, vma, old_page, address)) {
 				BUG_ON(huge_pte_none(pte));
-				spin_lock(&mm->page_table_lock);
+				spin_lock(ptl);
 				ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 				if (likely(pte_same(huge_ptep_get(ptep), pte)))
 					goto retry_avoidcopy;
@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		else
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 		page_cache_release(old_page);
 		/* Caller expects lock to be held */
-		spin_lock(&mm->page_table_lock);
+		spin_lock(ptl);
 		return VM_FAULT_OOM;
 	}
 
@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
 	 */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	/* Caller expects lock to be held */
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto backout_unlocked;
 		}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
 	put_page(page);
@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *ptep;
 	pte_t entry;
+	spinlock_t *ptl;
 	int ret;
 	struct page *page = NULL;
 	struct page *pagecache_page = NULL;
@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page != pagecache_page)
 		lock_page(page);
 
-	spin_lock(&mm->page_table_lock);
+	ptl = huge_pte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_page_table_lock;
@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		update_mmu_cache(vma, address, ptep);
 
 out_page_table_lock:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 
 	if (pagecache_page) {
 		unlock_page(pagecache_page);
@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
-	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
+		spinlock_t *ptl = NULL;
 		int absent;
 		struct page *page;
 
@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
+			if (pte)
+				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		      !huge_pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-			spin_unlock(&mm->page_table_lock);
+			if (pte)
+				spin_unlock(ptl);
 			ret = hugetlb_fault(mm, vma, vaddr,
 				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
-			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
 
@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			goto same_page;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	*nr_pages = remainder;
 	*position = vaddr;
 
@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
-		ptep = huge_pte_offset(mm, address);
+		spinlock_t *ptl;
+		ptep = huge_pte_offset_lock(mm, address, &ptl);
 		if (!ptep)
 			continue;
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
+			spin_unlock(ptl);
 			continue;
 		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			set_huge_pte_at(mm, address, ptep, pte);
 			pages++;
 		}
+		spin_unlock(ptl);
 	}
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-09-09 16:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-30 17:18 [PATCH 0/2 v2] split page table lock for hugepage Naoya Horiguchi
2013-08-30 17:18 ` [PATCH 1/2] hugetlbfs: support split page table lock Naoya Horiguchi
2013-09-04  7:13   ` Aneesh Kumar K.V
2013-09-04 16:32     ` Naoya Horiguchi
2013-09-05  9:18       ` Aneesh Kumar K.V
2013-09-05 15:23         ` Naoya Horiguchi
2013-08-30 17:18 ` [PATCH 2/2] thp: " Naoya Horiguchi
2013-09-02 10:53   ` Kirill A. Shutemov
2013-09-02 16:37     ` Naoya Horiguchi
2013-09-03 20:52   ` Naoya Horiguchi
  -- strict thread matches above, loose matches on Subject: below --
2013-09-05 21:27 [PATCH 0/2 v3] split page table lock for hugepage Naoya Horiguchi
2013-09-05 21:27 ` [PATCH 1/2] hugetlbfs: support split page table lock Naoya Horiguchi
2013-09-08 16:53   ` Aneesh Kumar K.V
2013-09-09 16:26     ` Naoya Horiguchi
2013-05-28 19:52 [PATCH 0/2] " Naoya Horiguchi
2013-05-28 19:52 ` [PATCH 1/2] " Naoya Horiguchi
2013-06-03 13:19   ` Michal Hocko
2013-06-03 14:34     ` Naoya Horiguchi
2013-06-03 15:42       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).