* [PATCH 0/2] hugetlbfs: support split page table lock
@ 2013-05-28 19:52 ` Naoya Horiguchi
0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
Hi,
In previous discussion [1] on "extend hugepage migration" patches, Michal and
Kosaki-san commented that in the patch "migrate: add migrate_entry_wait_huge()"
we need to solve the issue in the arch-independent manner and had better care
USE_SPLIT_PTLOCK=y case. So this patch(es) does that.
I made sure that the patched kernel shows no regression in functional tests
of libhugetlbfs.
[1]: http://thread.gmane.org/gmane.linux.kernel.mm/96665/focus=96661
Thanks,
Naoya Horiguchi
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 0/2] hugetlbfs: support split page table lock
@ 2013-05-28 19:52 ` Naoya Horiguchi
0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
Hi,
In previous discussion [1] on "extend hugepage migration" patches, Michal and
Kosaki-san commented that in the patch "migrate: add migrate_entry_wait_huge()"
we need to solve the issue in the arch-independent manner and had better care
USE_SPLIT_PTLOCK=y case. So this patch(es) does that.
I made sure that the patched kernel shows no regression in functional tests
of libhugetlbfs.
[1]: http://thread.gmane.org/gmane.linux.kernel.mm/96665/focus=96661
Thanks,
Naoya Horiguchi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/2] hugetlbfs: support split page table lock
2013-05-28 19:52 ` Naoya Horiguchi
@ 2013-05-28 19:52 ` Naoya Horiguchi
-1 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. This is not optimal because there can be lock
contentions between unrelated components using this lock.
This patch makes hugepage support split page table lock so that
we use page->ptl of the leaf node of page table tree which is pte for
normal pages but can be pmd and/or pud for hugepages of some architectures.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
arch/x86/mm/hugetlbpage.c | 6 ++--
include/linux/hugetlb.h | 18 ++++++++++
mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
3 files changed, 73 insertions(+), 35 deletions(-)
diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
index ae1aa71..0e4a396 100644
--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
unsigned long saddr;
pte_t *spte = NULL;
pte_t *pte;
+ spinlock_t *ptl;
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spte = huge_pte_offset(svma->vm_mm, saddr);
if (spte) {
get_page(virt_to_page(spte));
+ ptl = huge_pte_lockptr(mm, spte);
break;
}
}
@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!spte)
goto out;
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (pud_none(*pud))
pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
else
put_page(virt_to_page(spte));
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
mutex_unlock(&mapping->i_mmap_mutex);
diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
index a639c87..40f3215 100644
--- v3.10-rc3.orig/include/linux/hugetlb.h
+++ v3.10-rc3/include/linux/hugetlb.h
@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
int PageHuge(struct page *page);
+#if USE_SPLIT_PTLOCKS
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else /* !USE_SPLIT_PTLOCKS */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif /* USE_SPLIT_PTLOCKS */
+
+#define huge_pte_offset_lock(mm, address, ptlp) \
+({ \
+ pte_t *__pte = huge_pte_offset(mm, address); \
+ spinlock_t *__ptl = NULL; \
+ if (__pte) { \
+ __ptl = huge_pte_lockptr(mm, __pte); \
+ *(ptlp) = __ptl; \
+ spin_lock(__ptl); \
+ } \
+ __pte; \
+})
+
void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 463fb5e..8e1af32 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+ spinlock_t *srcptl, *dstptl;
src_pte = huge_pte_offset(src, addr);
if (!src_pte)
continue;
@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (dst_pte == src_pte)
continue;
- spin_lock(&dst->page_table_lock);
- spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+ dstptl = huge_pte_lockptr(dst, dst_pte);
+ srcptl = huge_pte_lockptr(src, src_pte);
+ spin_lock(dstptl);
+ spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
if (!huge_pte_none(huge_ptep_get(src_pte))) {
if (cow)
huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
}
- spin_unlock(&src->page_table_lock);
- spin_unlock(&dst->page_table_lock);
+ spin_unlock(srcptl);
+ spin_unlock(dstptl);
}
return 0;
@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long address;
pte_t *ptep;
pte_t pte;
+ spinlock_t *ptl;
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_start_vma(tlb, vma);
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
again:
- spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += sz) {
- ptep = huge_pte_offset(mm, address);
+ ptep = huge_pte_offset_lock(mm, address, &ptl);
if (!ptep)
continue;
if (huge_pmd_unshare(mm, &address, ptep))
- continue;
+ goto unlock;
pte = huge_ptep_get(ptep);
if (huge_pte_none(pte))
- continue;
+ goto unlock;
/*
* HWPoisoned hugepage is already unmapped and dropped reference
*/
if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
huge_pte_clear(mm, address, ptep);
- continue;
+ goto unlock;
}
page = pte_page(pte);
@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
*/
if (ref_page) {
if (page != ref_page)
- continue;
+ goto unlock;
/*
* Mark the VMA as having unmapped its page so that
@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
- if (force_flush)
+ if (force_flush) {
+ spin_unlock(ptl);
break;
+ }
/* Bail out after unmapping reference page if supplied */
- if (ref_page)
+ if (ref_page) {
+ spin_unlock(ptl);
break;
+ }
+unlock:
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
/*
* mmu_gather ran out of room to batch pages, we break out of
* the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
int outside_reserve = 0;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
+ spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
old_page = pte_page(pte);
@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
page_cache_get(old_page);
/* Drop page_table_lock as buddy allocator may be called */
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
new_page = alloc_huge_page(vma, address, outside_reserve);
if (IS_ERR(new_page)) {
@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
BUG_ON(huge_pte_none(pte));
if (unmap_ref_private(mm, vma, old_page, address)) {
BUG_ON(huge_pte_none(pte));
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte)))
goto retry_avoidcopy;
@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
}
/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (err == -ENOMEM)
return VM_FAULT_OOM;
else
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
page_cache_release(new_page);
page_cache_release(old_page);
/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
return VM_FAULT_OOM;
}
@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
* Retake the page_table_lock to check for racing updates
* before the page tables are altered
*/
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte))) {
/* Break COW */
@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
/* Make the old page be freed below */
new_page = old_page;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
page_cache_release(new_page);
page_cache_release(old_page);
return 0;
@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+ spinlock_t *ptl;
/*
* Currently, we are forced to kill the process in the event the
@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto backout_unlocked;
}
- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(mm, ptep);
+ spin_lock(ptl);
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto backout;
@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
unlock_page(page);
out:
return ret;
backout:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
backout_unlocked:
unlock_page(page);
put_page(page);
@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
{
pte_t *ptep;
pte_t entry;
+ spinlock_t *ptl;
int ret;
struct page *page = NULL;
struct page *pagecache_page = NULL;
@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (page != pagecache_page)
lock_page(page);
- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(mm, ptep);
+ spin_lock(ptl);
/* Check for a racing update before calling hugetlb_cow */
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
goto out_page_table_lock;
@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
update_mmu_cache(vma, address, ptep);
out_page_table_lock:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
if (pagecache_page) {
unlock_page(pagecache_page);
@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long remainder = *nr_pages;
struct hstate *h = hstate_vma(vma);
- spin_lock(&mm->page_table_lock);
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
+ spinlock_t *ptl = NULL;
int absent;
struct page *page;
@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Some archs (sparc64, sh*) have multiple pte_ts to
* each hugepage. We have to make sure we get the
* first, for the page indexing below to work.
+ *
+ * Note that page table lock is not held when pte is null.
*/
- pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+ pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
absent = !pte || huge_pte_none(huge_ptep_get(pte));
/*
@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (absent && (flags & FOLL_DUMP) &&
!hugetlbfs_pagecache_present(h, vma, vaddr)) {
+ if (pte)
+ spin_unlock(ptl);
remainder = 0;
break;
}
@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
!huge_pte_write(huge_ptep_get(pte)))) {
int ret;
- spin_unlock(&mm->page_table_lock);
+ if (pte)
+ spin_unlock(ptl);
ret = hugetlb_fault(mm, vma, vaddr,
(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
- spin_lock(&mm->page_table_lock);
if (!(ret & VM_FAULT_ERROR))
continue;
@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
goto same_page;
}
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
*nr_pages = remainder;
*position = vaddr;
@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_cache_range(vma, address, end);
mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
- spin_lock(&mm->page_table_lock);
for (; address < end; address += huge_page_size(h)) {
- ptep = huge_pte_offset(mm, address);
+ spinlock_t *ptl;
+ ptep = huge_pte_offset_lock(mm, address, &ptl);
if (!ptep)
continue;
if (huge_pmd_unshare(mm, &address, ptep)) {
pages++;
+ spin_unlock(ptl);
continue;
}
if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
set_huge_pte_at(mm, address, ptep, pte);
pages++;
}
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
/*
* Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
* may have cleared our pud entry and done put_page on the page table:
--
1.7.11.7
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-05-28 19:52 ` Naoya Horiguchi
0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
Currently all of page table handling by hugetlbfs code are done under
mm->page_table_lock. This is not optimal because there can be lock
contentions between unrelated components using this lock.
This patch makes hugepage support split page table lock so that
we use page->ptl of the leaf node of page table tree which is pte for
normal pages but can be pmd and/or pud for hugepages of some architectures.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
arch/x86/mm/hugetlbpage.c | 6 ++--
include/linux/hugetlb.h | 18 ++++++++++
mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
3 files changed, 73 insertions(+), 35 deletions(-)
diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
index ae1aa71..0e4a396 100644
--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
unsigned long saddr;
pte_t *spte = NULL;
pte_t *pte;
+ spinlock_t *ptl;
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spte = huge_pte_offset(svma->vm_mm, saddr);
if (spte) {
get_page(virt_to_page(spte));
+ ptl = huge_pte_lockptr(mm, spte);
break;
}
}
@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!spte)
goto out;
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (pud_none(*pud))
pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
else
put_page(virt_to_page(spte));
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
mutex_unlock(&mapping->i_mmap_mutex);
diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
index a639c87..40f3215 100644
--- v3.10-rc3.orig/include/linux/hugetlb.h
+++ v3.10-rc3/include/linux/hugetlb.h
@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
int PageHuge(struct page *page);
+#if USE_SPLIT_PTLOCKS
+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
+#else /* !USE_SPLIT_PTLOCKS */
+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
+#endif /* USE_SPLIT_PTLOCKS */
+
+#define huge_pte_offset_lock(mm, address, ptlp) \
+({ \
+ pte_t *__pte = huge_pte_offset(mm, address); \
+ spinlock_t *__ptl = NULL; \
+ if (__pte) { \
+ __ptl = huge_pte_lockptr(mm, __pte); \
+ *(ptlp) = __ptl; \
+ spin_lock(__ptl); \
+ } \
+ __pte; \
+})
+
void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 463fb5e..8e1af32 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+ spinlock_t *srcptl, *dstptl;
src_pte = huge_pte_offset(src, addr);
if (!src_pte)
continue;
@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (dst_pte == src_pte)
continue;
- spin_lock(&dst->page_table_lock);
- spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+ dstptl = huge_pte_lockptr(dst, dst_pte);
+ srcptl = huge_pte_lockptr(src, src_pte);
+ spin_lock(dstptl);
+ spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
if (!huge_pte_none(huge_ptep_get(src_pte))) {
if (cow)
huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
}
- spin_unlock(&src->page_table_lock);
- spin_unlock(&dst->page_table_lock);
+ spin_unlock(srcptl);
+ spin_unlock(dstptl);
}
return 0;
@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long address;
pte_t *ptep;
pte_t pte;
+ spinlock_t *ptl;
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_start_vma(tlb, vma);
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
again:
- spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += sz) {
- ptep = huge_pte_offset(mm, address);
+ ptep = huge_pte_offset_lock(mm, address, &ptl);
if (!ptep)
continue;
if (huge_pmd_unshare(mm, &address, ptep))
- continue;
+ goto unlock;
pte = huge_ptep_get(ptep);
if (huge_pte_none(pte))
- continue;
+ goto unlock;
/*
* HWPoisoned hugepage is already unmapped and dropped reference
*/
if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
huge_pte_clear(mm, address, ptep);
- continue;
+ goto unlock;
}
page = pte_page(pte);
@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
*/
if (ref_page) {
if (page != ref_page)
- continue;
+ goto unlock;
/*
* Mark the VMA as having unmapped its page so that
@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
- if (force_flush)
+ if (force_flush) {
+ spin_unlock(ptl);
break;
+ }
/* Bail out after unmapping reference page if supplied */
- if (ref_page)
+ if (ref_page) {
+ spin_unlock(ptl);
break;
+ }
+unlock:
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
/*
* mmu_gather ran out of room to batch pages, we break out of
* the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
int outside_reserve = 0;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
+ spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
old_page = pte_page(pte);
@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
page_cache_get(old_page);
/* Drop page_table_lock as buddy allocator may be called */
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
new_page = alloc_huge_page(vma, address, outside_reserve);
if (IS_ERR(new_page)) {
@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
BUG_ON(huge_pte_none(pte));
if (unmap_ref_private(mm, vma, old_page, address)) {
BUG_ON(huge_pte_none(pte));
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte)))
goto retry_avoidcopy;
@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
}
/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (err == -ENOMEM)
return VM_FAULT_OOM;
else
@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
page_cache_release(new_page);
page_cache_release(old_page);
/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
return VM_FAULT_OOM;
}
@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
* Retake the page_table_lock to check for racing updates
* before the page tables are altered
*/
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte))) {
/* Break COW */
@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
/* Make the old page be freed below */
new_page = old_page;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
page_cache_release(new_page);
page_cache_release(old_page);
return 0;
@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+ spinlock_t *ptl;
/*
* Currently, we are forced to kill the process in the event the
@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto backout_unlocked;
}
- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(mm, ptep);
+ spin_lock(ptl);
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto backout;
@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
unlock_page(page);
out:
return ret;
backout:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
backout_unlocked:
unlock_page(page);
put_page(page);
@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
{
pte_t *ptep;
pte_t entry;
+ spinlock_t *ptl;
int ret;
struct page *page = NULL;
struct page *pagecache_page = NULL;
@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (page != pagecache_page)
lock_page(page);
- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(mm, ptep);
+ spin_lock(ptl);
/* Check for a racing update before calling hugetlb_cow */
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
goto out_page_table_lock;
@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
update_mmu_cache(vma, address, ptep);
out_page_table_lock:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
if (pagecache_page) {
unlock_page(pagecache_page);
@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long remainder = *nr_pages;
struct hstate *h = hstate_vma(vma);
- spin_lock(&mm->page_table_lock);
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
+ spinlock_t *ptl = NULL;
int absent;
struct page *page;
@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Some archs (sparc64, sh*) have multiple pte_ts to
* each hugepage. We have to make sure we get the
* first, for the page indexing below to work.
+ *
+ * Note that page table lock is not held when pte is null.
*/
- pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+ pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
absent = !pte || huge_pte_none(huge_ptep_get(pte));
/*
@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (absent && (flags & FOLL_DUMP) &&
!hugetlbfs_pagecache_present(h, vma, vaddr)) {
+ if (pte)
+ spin_unlock(ptl);
remainder = 0;
break;
}
@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
!huge_pte_write(huge_ptep_get(pte)))) {
int ret;
- spin_unlock(&mm->page_table_lock);
+ if (pte)
+ spin_unlock(ptl);
ret = hugetlb_fault(mm, vma, vaddr,
(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
- spin_lock(&mm->page_table_lock);
if (!(ret & VM_FAULT_ERROR))
continue;
@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
goto same_page;
}
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
*nr_pages = remainder;
*position = vaddr;
@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_cache_range(vma, address, end);
mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
- spin_lock(&mm->page_table_lock);
for (; address < end; address += huge_page_size(h)) {
- ptep = huge_pte_offset(mm, address);
+ spinlock_t *ptl;
+ ptep = huge_pte_offset_lock(mm, address, &ptl);
if (!ptep)
continue;
if (huge_pmd_unshare(mm, &address, ptep)) {
pages++;
+ spin_unlock(ptl);
continue;
}
if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
set_huge_pte_at(mm, address, ptep, pte);
pages++;
}
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
/*
* Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
* may have cleared our pud entry and done put_page on the page table:
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
2013-05-28 19:52 ` Naoya Horiguchi
@ 2013-05-28 19:52 ` Naoya Horiguchi
-1 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.
This patch introduces migration_entry_wait_huge() to solve this.
Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
entry with huge_pte_offset() inside which all the arch-dependency of
the page table structure are. So migration_entry_wait_huge() and
__migration_entry_wait() are free from arch-dependency.
ChangeLog v3:
- use huge_pte_lockptr
ChangeLog v2:
- remove dup in migrate_entry_wait_huge()
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # 2.6.35
---
include/linux/swapops.h | 3 +++
mm/hugetlb.c | 2 +-
mm/migrate.c | 23 ++++++++++++++++++-----
3 files changed, 22 insertions(+), 6 deletions(-)
diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
index 47ead51..c5fd30d 100644
--- v3.10-rc3.orig/include/linux/swapops.h
+++ v3.10-rc3/include/linux/swapops.h
@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
#else
#define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm,
+ pte_t *pte) { }
static inline int is_write_migration_entry(swp_entry_t entry)
{
return 0;
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 8e1af32..d91a438 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait(mm, (pmd_t *)ptep, address);
+ migration_entry_wait_huge(mm, ptep);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
index 6f2df6e..64ff118 100644
--- v3.10-rc3.orig/mm/migrate.c
+++ v3.10-rc3/mm/migrate.c
@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
* get to the page and wait until migration is finished.
* When we return from this function the fault will be retried.
*/
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+ spinlock_t *ptl)
{
- pte_t *ptep, pte;
- spinlock_t *ptl;
+ pte_t pte;
swp_entry_t entry;
struct page *page;
- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ spin_lock(ptl);
pte = *ptep;
if (!is_swap_pte(pte))
goto out;
@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
pte_unmap_unlock(ptep, ptl);
}
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address)
+{
+ spinlock_t *ptl = pte_lockptr(mm, pmd);
+ pte_t *ptep = pte_offset_map(pmd, address);
+ __migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+{
+ spinlock_t *ptl = huge_pte_lockptr(mm, pte);
+ __migration_entry_wait(mm, pte, ptl);
+}
+
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
--
1.7.11.7
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-05-28 19:52 ` Naoya Horiguchi
0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-05-28 19:52 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.
This patch introduces migration_entry_wait_huge() to solve this.
Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
entry with huge_pte_offset() inside which all the arch-dependency of
the page table structure are. So migration_entry_wait_huge() and
__migration_entry_wait() are free from arch-dependency.
ChangeLog v3:
- use huge_pte_lockptr
ChangeLog v2:
- remove dup in migrate_entry_wait_huge()
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # 2.6.35
---
include/linux/swapops.h | 3 +++
mm/hugetlb.c | 2 +-
mm/migrate.c | 23 ++++++++++++++++++-----
3 files changed, 22 insertions(+), 6 deletions(-)
diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
index 47ead51..c5fd30d 100644
--- v3.10-rc3.orig/include/linux/swapops.h
+++ v3.10-rc3/include/linux/swapops.h
@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
#else
#define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm,
+ pte_t *pte) { }
static inline int is_write_migration_entry(swp_entry_t entry)
{
return 0;
diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
index 8e1af32..d91a438 100644
--- v3.10-rc3.orig/mm/hugetlb.c
+++ v3.10-rc3/mm/hugetlb.c
@@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait(mm, (pmd_t *)ptep, address);
+ migration_entry_wait_huge(mm, ptep);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
index 6f2df6e..64ff118 100644
--- v3.10-rc3.orig/mm/migrate.c
+++ v3.10-rc3/mm/migrate.c
@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
* get to the page and wait until migration is finished.
* When we return from this function the fault will be retried.
*/
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+ spinlock_t *ptl)
{
- pte_t *ptep, pte;
- spinlock_t *ptl;
+ pte_t pte;
swp_entry_t entry;
struct page *page;
- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ spin_lock(ptl);
pte = *ptep;
if (!is_swap_pte(pte))
goto out;
@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
pte_unmap_unlock(ptep, ptl);
}
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address)
+{
+ spinlock_t *ptl = pte_lockptr(mm, pmd);
+ pte_t *ptep = pte_offset_map(pmd, address);
+ __migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+{
+ spinlock_t *ptl = huge_pte_lockptr(mm, pte);
+ __migration_entry_wait(mm, pte, ptl);
+}
+
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] hugetlbfs: support split page table lock
2013-05-28 19:52 ` Naoya Horiguchi
(?)
(?)
@ 2013-05-29 1:09 ` Wanpeng Li
-1 siblings, 0 replies; 26+ messages in thread
From: Wanpeng Li @ 2013-05-29 1:09 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
On Tue, May 28, 2013 at 03:52:50PM -0400, Naoya Horiguchi wrote:
>Currently all of page table handling by hugetlbfs code are done under
>mm->page_table_lock. This is not optimal because there can be lock
>contentions between unrelated components using this lock.
>
>This patch makes hugepage support split page table lock so that
>we use page->ptl of the leaf node of page table tree which is pte for
>normal pages but can be pmd and/or pud for hugepages of some architectures.
>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
>Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>---
> arch/x86/mm/hugetlbpage.c | 6 ++--
> include/linux/hugetlb.h | 18 ++++++++++
> mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
> 3 files changed, 73 insertions(+), 35 deletions(-)
>
>diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
>index ae1aa71..0e4a396 100644
>--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
>+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
>@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> unsigned long saddr;
> pte_t *spte = NULL;
> pte_t *pte;
>+ spinlock_t *ptl;
>
> if (!vma_shareable(vma, addr))
> return (pte_t *)pmd_alloc(mm, pud, addr);
>@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> spte = huge_pte_offset(svma->vm_mm, saddr);
> if (spte) {
> get_page(virt_to_page(spte));
>+ ptl = huge_pte_lockptr(mm, spte);
> break;
> }
> }
>@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> if (!spte)
> goto out;
>
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> if (pud_none(*pud))
> pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
> else
> put_page(virt_to_page(spte));
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> out:
> pte = (pte_t *)pmd_alloc(mm, pud, addr);
> mutex_unlock(&mapping->i_mmap_mutex);
>diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
>index a639c87..40f3215 100644
>--- v3.10-rc3.orig/include/linux/hugetlb.h
>+++ v3.10-rc3/include/linux/hugetlb.h
>@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
>
> int PageHuge(struct page *page);
>
>+#if USE_SPLIT_PTLOCKS
>+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
>+#else /* !USE_SPLIT_PTLOCKS */
>+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
>+#endif /* USE_SPLIT_PTLOCKS */
>+
>+#define huge_pte_offset_lock(mm, address, ptlp) \
>+({ \
>+ pte_t *__pte = huge_pte_offset(mm, address); \
>+ spinlock_t *__ptl = NULL; \
>+ if (__pte) { \
>+ __ptl = huge_pte_lockptr(mm, __pte); \
>+ *(ptlp) = __ptl; \
>+ spin_lock(__ptl); \
>+ } \
>+ __pte; \
>+})
>+
> void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
>index 463fb5e..8e1af32 100644
>--- v3.10-rc3.orig/mm/hugetlb.c
>+++ v3.10-rc3/mm/hugetlb.c
>@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>
> for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>+ spinlock_t *srcptl, *dstptl;
> src_pte = huge_pte_offset(src, addr);
> if (!src_pte)
> continue;
>@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> if (dst_pte == src_pte)
> continue;
>
>- spin_lock(&dst->page_table_lock);
>- spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
>+ dstptl = huge_pte_lockptr(dst, dst_pte);
>+ srcptl = huge_pte_lockptr(src, src_pte);
>+ spin_lock(dstptl);
>+ spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
> if (!huge_pte_none(huge_ptep_get(src_pte))) {
> if (cow)
> huge_ptep_set_wrprotect(src, addr, src_pte);
>@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> page_dup_rmap(ptepage);
> set_huge_pte_at(dst, addr, dst_pte, entry);
> }
>- spin_unlock(&src->page_table_lock);
>- spin_unlock(&dst->page_table_lock);
>+ spin_unlock(srcptl);
>+ spin_unlock(dstptl);
> }
> return 0;
>
>@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> unsigned long address;
> pte_t *ptep;
> pte_t pte;
>+ spinlock_t *ptl;
> struct page *page;
> struct hstate *h = hstate_vma(vma);
> unsigned long sz = huge_page_size(h);
>@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> tlb_start_vma(tlb, vma);
> mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> again:
>- spin_lock(&mm->page_table_lock);
> for (address = start; address < end; address += sz) {
>- ptep = huge_pte_offset(mm, address);
>+ ptep = huge_pte_offset_lock(mm, address, &ptl);
> if (!ptep)
> continue;
>
> if (huge_pmd_unshare(mm, &address, ptep))
>- continue;
>+ goto unlock;
>
> pte = huge_ptep_get(ptep);
> if (huge_pte_none(pte))
>- continue;
>+ goto unlock;
>
> /*
> * HWPoisoned hugepage is already unmapped and dropped reference
> */
> if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> huge_pte_clear(mm, address, ptep);
>- continue;
>+ goto unlock;
> }
>
> page = pte_page(pte);
>@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> */
> if (ref_page) {
> if (page != ref_page)
>- continue;
>+ goto unlock;
>
> /*
> * Mark the VMA as having unmapped its page so that
>@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>
> page_remove_rmap(page);
> force_flush = !__tlb_remove_page(tlb, page);
>- if (force_flush)
>+ if (force_flush) {
>+ spin_unlock(ptl);
> break;
>+ }
> /* Bail out after unmapping reference page if supplied */
>- if (ref_page)
>+ if (ref_page) {
>+ spin_unlock(ptl);
> break;
>+ }
>+unlock:
>+ spin_unlock(ptl);
> }
>- spin_unlock(&mm->page_table_lock);
> /*
> * mmu_gather ran out of room to batch pages, we break out of
> * the PTE lock to avoid doing the potential expensive TLB invalidate
>@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> int outside_reserve = 0;
> unsigned long mmun_start; /* For mmu_notifiers */
> unsigned long mmun_end; /* For mmu_notifiers */
>+ spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
>
> old_page = pte_page(pte);
>
>@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> page_cache_get(old_page);
>
> /* Drop page_table_lock as buddy allocator may be called */
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> new_page = alloc_huge_page(vma, address, outside_reserve);
>
> if (IS_ERR(new_page)) {
>@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> BUG_ON(huge_pte_none(pte));
> if (unmap_ref_private(mm, vma, old_page, address)) {
> BUG_ON(huge_pte_none(pte));
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> if (likely(pte_same(huge_ptep_get(ptep), pte)))
> goto retry_avoidcopy;
>@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> }
>
> /* Caller expects lock to be held */
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> if (err == -ENOMEM)
> return VM_FAULT_OOM;
> else
>@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> page_cache_release(new_page);
> page_cache_release(old_page);
> /* Caller expects lock to be held */
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> return VM_FAULT_OOM;
> }
>
>@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> * Retake the page_table_lock to check for racing updates
> * before the page tables are altered
> */
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> if (likely(pte_same(huge_ptep_get(ptep), pte))) {
> /* Break COW */
>@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> /* Make the old page be freed below */
> new_page = old_page;
> }
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> /* Caller expects lock to be held */
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> page_cache_release(new_page);
> page_cache_release(old_page);
> return 0;
>@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> struct page *page;
> struct address_space *mapping;
> pte_t new_pte;
>+ spinlock_t *ptl;
>
> /*
> * Currently, we are forced to kill the process in the event the
>@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> goto backout_unlocked;
> }
>
>- spin_lock(&mm->page_table_lock);
>+ ptl = huge_pte_lockptr(mm, ptep);
>+ spin_lock(ptl);
> size = i_size_read(mapping->host) >> huge_page_shift(h);
> if (idx >= size)
> goto backout;
>@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> }
>
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> unlock_page(page);
> out:
> return ret;
>
> backout:
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> backout_unlocked:
> unlock_page(page);
> put_page(page);
>@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> pte_t *ptep;
> pte_t entry;
>+ spinlock_t *ptl;
> int ret;
> struct page *page = NULL;
> struct page *pagecache_page = NULL;
>@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (page != pagecache_page)
> lock_page(page);
>
>- spin_lock(&mm->page_table_lock);
>+ ptl = huge_pte_lockptr(mm, ptep);
>+ spin_lock(ptl);
> /* Check for a racing update before calling hugetlb_cow */
> if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> goto out_page_table_lock;
>@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> update_mmu_cache(vma, address, ptep);
>
> out_page_table_lock:
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
>
> if (pagecache_page) {
> unlock_page(pagecache_page);
>@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long remainder = *nr_pages;
> struct hstate *h = hstate_vma(vma);
>
>- spin_lock(&mm->page_table_lock);
> while (vaddr < vma->vm_end && remainder) {
> pte_t *pte;
>+ spinlock_t *ptl = NULL;
> int absent;
> struct page *page;
>
>@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> * Some archs (sparc64, sh*) have multiple pte_ts to
> * each hugepage. We have to make sure we get the
> * first, for the page indexing below to work.
>+ *
>+ * Note that page table lock is not held when pte is null.
> */
>- pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
>+ pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
> absent = !pte || huge_pte_none(huge_ptep_get(pte));
>
> /*
>@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> */
> if (absent && (flags & FOLL_DUMP) &&
> !hugetlbfs_pagecache_present(h, vma, vaddr)) {
>+ if (pte)
>+ spin_unlock(ptl);
> remainder = 0;
> break;
> }
>@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> !huge_pte_write(huge_ptep_get(pte)))) {
> int ret;
>
>- spin_unlock(&mm->page_table_lock);
>+ if (pte)
>+ spin_unlock(ptl);
> ret = hugetlb_fault(mm, vma, vaddr,
> (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
>- spin_lock(&mm->page_table_lock);
> if (!(ret & VM_FAULT_ERROR))
> continue;
>
>@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> */
> goto same_page;
> }
>+ spin_unlock(ptl);
> }
>- spin_unlock(&mm->page_table_lock);
> *nr_pages = remainder;
> *position = vaddr;
>
>@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> flush_cache_range(vma, address, end);
>
> mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>- spin_lock(&mm->page_table_lock);
> for (; address < end; address += huge_page_size(h)) {
>- ptep = huge_pte_offset(mm, address);
>+ spinlock_t *ptl;
>+ ptep = huge_pte_offset_lock(mm, address, &ptl);
> if (!ptep)
> continue;
> if (huge_pmd_unshare(mm, &address, ptep)) {
> pages++;
>+ spin_unlock(ptl);
> continue;
> }
> if (!huge_pte_none(huge_ptep_get(ptep))) {
>@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> set_huge_pte_at(mm, address, ptep, pte);
> pages++;
> }
>+ spin_unlock(ptl);
> }
>- spin_unlock(&mm->page_table_lock);
> /*
> * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
> * may have cleared our pud entry and done put_page on the page table:
>--
>1.7.11.7
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] hugetlbfs: support split page table lock
2013-05-28 19:52 ` Naoya Horiguchi
(?)
@ 2013-05-29 1:09 ` Wanpeng Li
-1 siblings, 0 replies; 26+ messages in thread
From: Wanpeng Li @ 2013-05-29 1:09 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
On Tue, May 28, 2013 at 03:52:50PM -0400, Naoya Horiguchi wrote:
>Currently all of page table handling by hugetlbfs code are done under
>mm->page_table_lock. This is not optimal because there can be lock
>contentions between unrelated components using this lock.
>
>This patch makes hugepage support split page table lock so that
>we use page->ptl of the leaf node of page table tree which is pte for
>normal pages but can be pmd and/or pud for hugepages of some architectures.
>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
>Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>---
> arch/x86/mm/hugetlbpage.c | 6 ++--
> include/linux/hugetlb.h | 18 ++++++++++
> mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
> 3 files changed, 73 insertions(+), 35 deletions(-)
>
>diff --git v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c v3.10-rc3/arch/x86/mm/hugetlbpage.c
>index ae1aa71..0e4a396 100644
>--- v3.10-rc3.orig/arch/x86/mm/hugetlbpage.c
>+++ v3.10-rc3/arch/x86/mm/hugetlbpage.c
>@@ -75,6 +75,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> unsigned long saddr;
> pte_t *spte = NULL;
> pte_t *pte;
>+ spinlock_t *ptl;
>
> if (!vma_shareable(vma, addr))
> return (pte_t *)pmd_alloc(mm, pud, addr);
>@@ -89,6 +90,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> spte = huge_pte_offset(svma->vm_mm, saddr);
> if (spte) {
> get_page(virt_to_page(spte));
>+ ptl = huge_pte_lockptr(mm, spte);
> break;
> }
> }
>@@ -97,12 +99,12 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> if (!spte)
> goto out;
>
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> if (pud_none(*pud))
> pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
> else
> put_page(virt_to_page(spte));
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> out:
> pte = (pte_t *)pmd_alloc(mm, pud, addr);
> mutex_unlock(&mapping->i_mmap_mutex);
>diff --git v3.10-rc3.orig/include/linux/hugetlb.h v3.10-rc3/include/linux/hugetlb.h
>index a639c87..40f3215 100644
>--- v3.10-rc3.orig/include/linux/hugetlb.h
>+++ v3.10-rc3/include/linux/hugetlb.h
>@@ -32,6 +32,24 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
>
> int PageHuge(struct page *page);
>
>+#if USE_SPLIT_PTLOCKS
>+#define huge_pte_lockptr(mm, ptep) ({__pte_lockptr(virt_to_page(ptep)); })
>+#else /* !USE_SPLIT_PTLOCKS */
>+#define huge_pte_lockptr(mm, ptep) ({&(mm)->page_table_lock; })
>+#endif /* USE_SPLIT_PTLOCKS */
>+
>+#define huge_pte_offset_lock(mm, address, ptlp) \
>+({ \
>+ pte_t *__pte = huge_pte_offset(mm, address); \
>+ spinlock_t *__ptl = NULL; \
>+ if (__pte) { \
>+ __ptl = huge_pte_lockptr(mm, __pte); \
>+ *(ptlp) = __ptl; \
>+ spin_lock(__ptl); \
>+ } \
>+ __pte; \
>+})
>+
> void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
>index 463fb5e..8e1af32 100644
>--- v3.10-rc3.orig/mm/hugetlb.c
>+++ v3.10-rc3/mm/hugetlb.c
>@@ -2325,6 +2325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>
> for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>+ spinlock_t *srcptl, *dstptl;
> src_pte = huge_pte_offset(src, addr);
> if (!src_pte)
> continue;
>@@ -2336,8 +2337,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> if (dst_pte == src_pte)
> continue;
>
>- spin_lock(&dst->page_table_lock);
>- spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
>+ dstptl = huge_pte_lockptr(dst, dst_pte);
>+ srcptl = huge_pte_lockptr(src, src_pte);
>+ spin_lock(dstptl);
>+ spin_lock_nested(srcptl, SINGLE_DEPTH_NESTING);
> if (!huge_pte_none(huge_ptep_get(src_pte))) {
> if (cow)
> huge_ptep_set_wrprotect(src, addr, src_pte);
>@@ -2347,8 +2350,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> page_dup_rmap(ptepage);
> set_huge_pte_at(dst, addr, dst_pte, entry);
> }
>- spin_unlock(&src->page_table_lock);
>- spin_unlock(&dst->page_table_lock);
>+ spin_unlock(srcptl);
>+ spin_unlock(dstptl);
> }
> return 0;
>
>@@ -2391,6 +2394,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> unsigned long address;
> pte_t *ptep;
> pte_t pte;
>+ spinlock_t *ptl;
> struct page *page;
> struct hstate *h = hstate_vma(vma);
> unsigned long sz = huge_page_size(h);
>@@ -2404,25 +2408,24 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> tlb_start_vma(tlb, vma);
> mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> again:
>- spin_lock(&mm->page_table_lock);
> for (address = start; address < end; address += sz) {
>- ptep = huge_pte_offset(mm, address);
>+ ptep = huge_pte_offset_lock(mm, address, &ptl);
> if (!ptep)
> continue;
>
> if (huge_pmd_unshare(mm, &address, ptep))
>- continue;
>+ goto unlock;
>
> pte = huge_ptep_get(ptep);
> if (huge_pte_none(pte))
>- continue;
>+ goto unlock;
>
> /*
> * HWPoisoned hugepage is already unmapped and dropped reference
> */
> if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> huge_pte_clear(mm, address, ptep);
>- continue;
>+ goto unlock;
> }
>
> page = pte_page(pte);
>@@ -2433,7 +2436,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> */
> if (ref_page) {
> if (page != ref_page)
>- continue;
>+ goto unlock;
>
> /*
> * Mark the VMA as having unmapped its page so that
>@@ -2450,13 +2453,18 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>
> page_remove_rmap(page);
> force_flush = !__tlb_remove_page(tlb, page);
>- if (force_flush)
>+ if (force_flush) {
>+ spin_unlock(ptl);
> break;
>+ }
> /* Bail out after unmapping reference page if supplied */
>- if (ref_page)
>+ if (ref_page) {
>+ spin_unlock(ptl);
> break;
>+ }
>+unlock:
>+ spin_unlock(ptl);
> }
>- spin_unlock(&mm->page_table_lock);
> /*
> * mmu_gather ran out of room to batch pages, we break out of
> * the PTE lock to avoid doing the potential expensive TLB invalidate
>@@ -2570,6 +2578,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> int outside_reserve = 0;
> unsigned long mmun_start; /* For mmu_notifiers */
> unsigned long mmun_end; /* For mmu_notifiers */
>+ spinlock_t *ptl = huge_pte_lockptr(mm, ptep);
>
> old_page = pte_page(pte);
>
>@@ -2601,7 +2610,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> page_cache_get(old_page);
>
> /* Drop page_table_lock as buddy allocator may be called */
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> new_page = alloc_huge_page(vma, address, outside_reserve);
>
> if (IS_ERR(new_page)) {
>@@ -2619,7 +2628,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> BUG_ON(huge_pte_none(pte));
> if (unmap_ref_private(mm, vma, old_page, address)) {
> BUG_ON(huge_pte_none(pte));
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> if (likely(pte_same(huge_ptep_get(ptep), pte)))
> goto retry_avoidcopy;
>@@ -2633,7 +2642,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> }
>
> /* Caller expects lock to be held */
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> if (err == -ENOMEM)
> return VM_FAULT_OOM;
> else
>@@ -2648,7 +2657,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> page_cache_release(new_page);
> page_cache_release(old_page);
> /* Caller expects lock to be held */
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> return VM_FAULT_OOM;
> }
>
>@@ -2663,7 +2672,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> * Retake the page_table_lock to check for racing updates
> * before the page tables are altered
> */
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> if (likely(pte_same(huge_ptep_get(ptep), pte))) {
> /* Break COW */
>@@ -2675,10 +2684,10 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> /* Make the old page be freed below */
> new_page = old_page;
> }
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> /* Caller expects lock to be held */
>- spin_lock(&mm->page_table_lock);
>+ spin_lock(ptl);
> page_cache_release(new_page);
> page_cache_release(old_page);
> return 0;
>@@ -2728,6 +2737,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> struct page *page;
> struct address_space *mapping;
> pte_t new_pte;
>+ spinlock_t *ptl;
>
> /*
> * Currently, we are forced to kill the process in the event the
>@@ -2813,7 +2823,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> goto backout_unlocked;
> }
>
>- spin_lock(&mm->page_table_lock);
>+ ptl = huge_pte_lockptr(mm, ptep);
>+ spin_lock(ptl);
> size = i_size_read(mapping->host) >> huge_page_shift(h);
> if (idx >= size)
> goto backout;
>@@ -2835,13 +2846,13 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> }
>
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> unlock_page(page);
> out:
> return ret;
>
> backout:
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
> backout_unlocked:
> unlock_page(page);
> put_page(page);
>@@ -2853,6 +2864,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> pte_t *ptep;
> pte_t entry;
>+ spinlock_t *ptl;
> int ret;
> struct page *page = NULL;
> struct page *pagecache_page = NULL;
>@@ -2921,7 +2933,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (page != pagecache_page)
> lock_page(page);
>
>- spin_lock(&mm->page_table_lock);
>+ ptl = huge_pte_lockptr(mm, ptep);
>+ spin_lock(ptl);
> /* Check for a racing update before calling hugetlb_cow */
> if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> goto out_page_table_lock;
>@@ -2941,7 +2954,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> update_mmu_cache(vma, address, ptep);
>
> out_page_table_lock:
>- spin_unlock(&mm->page_table_lock);
>+ spin_unlock(ptl);
>
> if (pagecache_page) {
> unlock_page(pagecache_page);
>@@ -2976,9 +2989,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long remainder = *nr_pages;
> struct hstate *h = hstate_vma(vma);
>
>- spin_lock(&mm->page_table_lock);
> while (vaddr < vma->vm_end && remainder) {
> pte_t *pte;
>+ spinlock_t *ptl = NULL;
> int absent;
> struct page *page;
>
>@@ -2986,8 +2999,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> * Some archs (sparc64, sh*) have multiple pte_ts to
> * each hugepage. We have to make sure we get the
> * first, for the page indexing below to work.
>+ *
>+ * Note that page table lock is not held when pte is null.
> */
>- pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
>+ pte = huge_pte_offset_lock(mm, vaddr & huge_page_mask(h), &ptl);
> absent = !pte || huge_pte_none(huge_ptep_get(pte));
>
> /*
>@@ -2999,6 +3014,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> */
> if (absent && (flags & FOLL_DUMP) &&
> !hugetlbfs_pagecache_present(h, vma, vaddr)) {
>+ if (pte)
>+ spin_unlock(ptl);
> remainder = 0;
> break;
> }
>@@ -3018,10 +3035,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> !huge_pte_write(huge_ptep_get(pte)))) {
> int ret;
>
>- spin_unlock(&mm->page_table_lock);
>+ if (pte)
>+ spin_unlock(ptl);
> ret = hugetlb_fault(mm, vma, vaddr,
> (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
>- spin_lock(&mm->page_table_lock);
> if (!(ret & VM_FAULT_ERROR))
> continue;
>
>@@ -3052,8 +3069,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> */
> goto same_page;
> }
>+ spin_unlock(ptl);
> }
>- spin_unlock(&mm->page_table_lock);
> *nr_pages = remainder;
> *position = vaddr;
>
>@@ -3074,13 +3091,14 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> flush_cache_range(vma, address, end);
>
> mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>- spin_lock(&mm->page_table_lock);
> for (; address < end; address += huge_page_size(h)) {
>- ptep = huge_pte_offset(mm, address);
>+ spinlock_t *ptl;
>+ ptep = huge_pte_offset_lock(mm, address, &ptl);
> if (!ptep)
> continue;
> if (huge_pmd_unshare(mm, &address, ptep)) {
> pages++;
>+ spin_unlock(ptl);
> continue;
> }
> if (!huge_pte_none(huge_ptep_get(ptep))) {
>@@ -3090,8 +3108,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> set_huge_pte_at(mm, address, ptep, pte);
> pages++;
> }
>+ spin_unlock(ptl);
> }
>- spin_unlock(&mm->page_table_lock);
> /*
> * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
> * may have cleared our pud entry and done put_page on the page table:
>--
>1.7.11.7
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
2013-05-28 19:52 ` Naoya Horiguchi
(?)
(?)
@ 2013-05-29 1:11 ` Wanpeng Li
-1 siblings, 0 replies; 26+ messages in thread
From: Wanpeng Li @ 2013-05-29 1:11 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
On Tue, May 28, 2013 at 03:52:51PM -0400, Naoya Horiguchi wrote:
>When we have a page fault for the address which is backed by a hugepage
>under migration, the kernel can't wait correctly and do busy looping on
>hugepage fault until the migration finishes.
>This is because pte_offset_map_lock() can't get a correct migration entry
>or a correct page table lock for hugepage.
>This patch introduces migration_entry_wait_huge() to solve this.
>
>Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
>entry with huge_pte_offset() inside which all the arch-dependency of
>the page table structure are. So migration_entry_wait_huge() and
>__migration_entry_wait() are free from arch-dependency.
>
>ChangeLog v3:
> - use huge_pte_lockptr
>
>ChangeLog v2:
> - remove dup in migrate_entry_wait_huge()
>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
>Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Cc: stable@vger.kernel.org # 2.6.35
>---
> include/linux/swapops.h | 3 +++
> mm/hugetlb.c | 2 +-
> mm/migrate.c | 23 ++++++++++++++++++-----
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
>diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
>index 47ead51..c5fd30d 100644
>--- v3.10-rc3.orig/include/linux/swapops.h
>+++ v3.10-rc3/include/linux/swapops.h
>@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
>
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address);
>+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
> #else
>
> #define make_migration_entry(page, write) swp_entry(0, 0)
>@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
> static inline void make_migration_entry_read(swp_entry_t *entryp) { }
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address) { }
>+static inline void migration_entry_wait_huge(struct mm_struct *mm,
>+ pte_t *pte) { }
> static inline int is_write_migration_entry(swp_entry_t entry)
> {
> return 0;
>diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
>index 8e1af32..d91a438 100644
>--- v3.10-rc3.orig/mm/hugetlb.c
>+++ v3.10-rc3/mm/hugetlb.c
>@@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (ptep) {
> entry = huge_ptep_get(ptep);
> if (unlikely(is_hugetlb_entry_migration(entry))) {
>- migration_entry_wait(mm, (pmd_t *)ptep, address);
>+ migration_entry_wait_huge(mm, ptep);
> return 0;
> } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> return VM_FAULT_HWPOISON_LARGE |
>diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
>index 6f2df6e..64ff118 100644
>--- v3.10-rc3.orig/mm/migrate.c
>+++ v3.10-rc3/mm/migrate.c
>@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
> * get to the page and wait until migration is finished.
> * When we return from this function the fault will be retried.
> */
>-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>- unsigned long address)
>+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>+ spinlock_t *ptl)
> {
>- pte_t *ptep, pte;
>- spinlock_t *ptl;
>+ pte_t pte;
> swp_entry_t entry;
> struct page *page;
>
>- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
>+ spin_lock(ptl);
> pte = *ptep;
> if (!is_swap_pte(pte))
> goto out;
>@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> pte_unmap_unlock(ptep, ptl);
> }
>
>+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>+ unsigned long address)
>+{
>+ spinlock_t *ptl = pte_lockptr(mm, pmd);
>+ pte_t *ptep = pte_offset_map(pmd, address);
>+ __migration_entry_wait(mm, ptep, ptl);
>+}
>+
>+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
>+{
>+ spinlock_t *ptl = huge_pte_lockptr(mm, pte);
>+ __migration_entry_wait(mm, pte, ptl);
>+}
>+
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> static bool buffer_migrate_lock_buffers(struct buffer_head *head,
>--
>1.7.11.7
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
2013-05-28 19:52 ` Naoya Horiguchi
(?)
@ 2013-05-29 1:11 ` Wanpeng Li
-1 siblings, 0 replies; 26+ messages in thread
From: Wanpeng Li @ 2013-05-29 1:11 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, Michal Hocko,
KOSAKI Motohiro, Rik van Riel, linux-kernel
On Tue, May 28, 2013 at 03:52:51PM -0400, Naoya Horiguchi wrote:
>When we have a page fault for the address which is backed by a hugepage
>under migration, the kernel can't wait correctly and do busy looping on
>hugepage fault until the migration finishes.
>This is because pte_offset_map_lock() can't get a correct migration entry
>or a correct page table lock for hugepage.
>This patch introduces migration_entry_wait_huge() to solve this.
>
>Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
>entry with huge_pte_offset() inside which all the arch-dependency of
>the page table structure are. So migration_entry_wait_huge() and
>__migration_entry_wait() are free from arch-dependency.
>
>ChangeLog v3:
> - use huge_pte_lockptr
>
>ChangeLog v2:
> - remove dup in migrate_entry_wait_huge()
>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
>Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Cc: stable@vger.kernel.org # 2.6.35
>---
> include/linux/swapops.h | 3 +++
> mm/hugetlb.c | 2 +-
> mm/migrate.c | 23 ++++++++++++++++++-----
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
>diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
>index 47ead51..c5fd30d 100644
>--- v3.10-rc3.orig/include/linux/swapops.h
>+++ v3.10-rc3/include/linux/swapops.h
>@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
>
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address);
>+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
> #else
>
> #define make_migration_entry(page, write) swp_entry(0, 0)
>@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
> static inline void make_migration_entry_read(swp_entry_t *entryp) { }
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address) { }
>+static inline void migration_entry_wait_huge(struct mm_struct *mm,
>+ pte_t *pte) { }
> static inline int is_write_migration_entry(swp_entry_t entry)
> {
> return 0;
>diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
>index 8e1af32..d91a438 100644
>--- v3.10-rc3.orig/mm/hugetlb.c
>+++ v3.10-rc3/mm/hugetlb.c
>@@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (ptep) {
> entry = huge_ptep_get(ptep);
> if (unlikely(is_hugetlb_entry_migration(entry))) {
>- migration_entry_wait(mm, (pmd_t *)ptep, address);
>+ migration_entry_wait_huge(mm, ptep);
> return 0;
> } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> return VM_FAULT_HWPOISON_LARGE |
>diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
>index 6f2df6e..64ff118 100644
>--- v3.10-rc3.orig/mm/migrate.c
>+++ v3.10-rc3/mm/migrate.c
>@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
> * get to the page and wait until migration is finished.
> * When we return from this function the fault will be retried.
> */
>-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>- unsigned long address)
>+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>+ spinlock_t *ptl)
> {
>- pte_t *ptep, pte;
>- spinlock_t *ptl;
>+ pte_t pte;
> swp_entry_t entry;
> struct page *page;
>
>- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
>+ spin_lock(ptl);
> pte = *ptep;
> if (!is_swap_pte(pte))
> goto out;
>@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> pte_unmap_unlock(ptep, ptl);
> }
>
>+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>+ unsigned long address)
>+{
>+ spinlock_t *ptl = pte_lockptr(mm, pmd);
>+ pte_t *ptep = pte_offset_map(pmd, address);
>+ __migration_entry_wait(mm, ptep, ptl);
>+}
>+
>+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
>+{
>+ spinlock_t *ptl = huge_pte_lockptr(mm, pte);
>+ __migration_entry_wait(mm, pte, ptl);
>+}
>+
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> static bool buffer_migrate_lock_buffers(struct buffer_head *head,
>--
>1.7.11.7
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
2013-05-28 19:52 ` Naoya Horiguchi
@ 2013-05-31 19:30 ` Andrew Morton
-1 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2013-05-31 19:30 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Mel Gorman, Andi Kleen, Michal Hocko, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Tue, 28 May 2013 15:52:51 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly and do busy looping on
> hugepage fault until the migration finishes.
> This is because pte_offset_map_lock() can't get a correct migration entry
> or a correct page table lock for hugepage.
> This patch introduces migration_entry_wait_huge() to solve this.
>
> Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> entry with huge_pte_offset() inside which all the arch-dependency of
> the page table structure are. So migration_entry_wait_huge() and
> __migration_entry_wait() are free from arch-dependency.
>
> ChangeLog v3:
> - use huge_pte_lockptr
>
> ChangeLog v2:
> - remove dup in migrate_entry_wait_huge()
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: stable@vger.kernel.org # 2.6.35
That changelog is not suitable for a -stable patch. People who
maintain and utilize the stable trees need to know in some detail what
is the end-user impact of the patch (or, equivalently, of the bug
which the patch fixes).
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-05-31 19:30 ` Andrew Morton
0 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2013-05-31 19:30 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Mel Gorman, Andi Kleen, Michal Hocko, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Tue, 28 May 2013 15:52:51 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly and do busy looping on
> hugepage fault until the migration finishes.
> This is because pte_offset_map_lock() can't get a correct migration entry
> or a correct page table lock for hugepage.
> This patch introduces migration_entry_wait_huge() to solve this.
>
> Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> entry with huge_pte_offset() inside which all the arch-dependency of
> the page table structure are. So migration_entry_wait_huge() and
> __migration_entry_wait() are free from arch-dependency.
>
> ChangeLog v3:
> - use huge_pte_lockptr
>
> ChangeLog v2:
> - remove dup in migrate_entry_wait_huge()
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: stable@vger.kernel.org # 2.6.35
That changelog is not suitable for a -stable patch. People who
maintain and utilize the stable trees need to know in some detail what
is the end-user impact of the patch (or, equivalently, of the bug
which the patch fixes).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
2013-05-31 19:30 ` Andrew Morton
@ 2013-05-31 19:46 ` Naoya Horiguchi
-1 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-05-31 19:46 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, Mel Gorman, Andi Kleen, Michal Hocko, KOSAKI Motohiro,
Rik van Riel, linux-kernel
> That changelog is not suitable for a -stable patch. People who
> maintain and utilize the stable trees need to know in some detail what
> is the end-user impact of the patch (or, equivalently, of the bug
> which the patch fixes).
OK. So, could you insert the following sentence?
On Fri, May 31, 2013 at 12:30:03PM -0700, Andrew Morton wrote:
> On Tue, 28 May 2013 15:52:51 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly and do busy looping on
> > hugepage fault until the migration finishes.
As a result, users who try to kick hugepage migration (via soft offlining,
for example) occasionally experience long delay or soft lockup.
Thanks,
Naoya
> > This is because pte_offset_map_lock() can't get a correct migration entry
> > or a correct page table lock for hugepage.
> > This patch introduces migration_entry_wait_huge() to solve this.
> >
> > Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> > entry with huge_pte_offset() inside which all the arch-dependency of
> > the page table structure are. So migration_entry_wait_huge() and
> > __migration_entry_wait() are free from arch-dependency.
> >
> > ChangeLog v3:
> > - use huge_pte_lockptr
> >
> > ChangeLog v2:
> > - remove dup in migrate_entry_wait_huge()
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: stable@vger.kernel.org # 2.6.35
>
> That changelog is not suitable for a -stable patch. People who
> maintain and utilize the stable trees need to know in some detail what
> is the end-user impact of the patch (or, equivalently, of the bug
> which the patch fixes).
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-05-31 19:46 ` Naoya Horiguchi
0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-05-31 19:46 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, Mel Gorman, Andi Kleen, Michal Hocko, KOSAKI Motohiro,
Rik van Riel, linux-kernel
> That changelog is not suitable for a -stable patch. People who
> maintain and utilize the stable trees need to know in some detail what
> is the end-user impact of the patch (or, equivalently, of the bug
> which the patch fixes).
OK. So, could you insert the following sentence?
On Fri, May 31, 2013 at 12:30:03PM -0700, Andrew Morton wrote:
> On Tue, 28 May 2013 15:52:51 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly and do busy looping on
> > hugepage fault until the migration finishes.
As a result, users who try to kick hugepage migration (via soft offlining,
for example) occasionally experience long delay or soft lockup.
Thanks,
Naoya
> > This is because pte_offset_map_lock() can't get a correct migration entry
> > or a correct page table lock for hugepage.
> > This patch introduces migration_entry_wait_huge() to solve this.
> >
> > Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> > entry with huge_pte_offset() inside which all the arch-dependency of
> > the page table structure are. So migration_entry_wait_huge() and
> > __migration_entry_wait() are free from arch-dependency.
> >
> > ChangeLog v3:
> > - use huge_pte_lockptr
> >
> > ChangeLog v2:
> > - remove dup in migrate_entry_wait_huge()
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: stable@vger.kernel.org # 2.6.35
>
> That changelog is not suitable for a -stable patch. People who
> maintain and utilize the stable trees need to know in some detail what
> is the end-user impact of the patch (or, equivalently, of the bug
> which the patch fixes).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] hugetlbfs: support split page table lock
2013-05-28 19:52 ` Naoya Horiguchi
@ 2013-06-03 13:19 ` Michal Hocko
-1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-06-03 13:19 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. This is not optimal because there can be lock
> contentions between unrelated components using this lock.
While I agree with such a change in general I am a bit afraid of all
subtle tweaks in the mm code that make hugetlb special. Maybe there are
none for page_table_lock but I am not 100% sure. So this might be
really tricky and it is not necessary for your further patches, is it?
How have you tested this?
> This patch makes hugepage support split page table lock so that
> we use page->ptl of the leaf node of page table tree which is pte for
> normal pages but can be pmd and/or pud for hugepages of some architectures.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> arch/x86/mm/hugetlbpage.c | 6 ++--
> include/linux/hugetlb.h | 18 ++++++++++
> mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
This doesn't seem to be the complete story. At least not from the
trivial:
$ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" --
arch/powerpc/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
arch/powerpc/mm/hugetlbpage.c: spin_unlock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c:
spin_unlock(&mm->page_table_lock);
arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-06-03 13:19 ` Michal Hocko
0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-06-03 13:19 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> Currently all of page table handling by hugetlbfs code are done under
> mm->page_table_lock. This is not optimal because there can be lock
> contentions between unrelated components using this lock.
While I agree with such a change in general I am a bit afraid of all
subtle tweaks in the mm code that make hugetlb special. Maybe there are
none for page_table_lock but I am not 100% sure. So this might be
really tricky and it is not necessary for your further patches, is it?
How have you tested this?
> This patch makes hugepage support split page table lock so that
> we use page->ptl of the leaf node of page table tree which is pte for
> normal pages but can be pmd and/or pud for hugepages of some architectures.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> arch/x86/mm/hugetlbpage.c | 6 ++--
> include/linux/hugetlb.h | 18 ++++++++++
> mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
This doesn't seem to be the complete story. At least not from the
trivial:
$ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" --
arch/powerpc/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
arch/powerpc/mm/hugetlbpage.c: spin_unlock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
arch/tile/mm/hugetlbpage.c:
spin_unlock(&mm->page_table_lock);
arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
2013-05-28 19:52 ` Naoya Horiguchi
@ 2013-06-03 13:26 ` Michal Hocko
-1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-06-03 13:26 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Tue 28-05-13 15:52:51, Naoya Horiguchi wrote:
> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly and do busy looping on
> hugepage fault until the migration finishes.
> This is because pte_offset_map_lock() can't get a correct migration entry
> or a correct page table lock for hugepage.
> This patch introduces migration_entry_wait_huge() to solve this.
>
> Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> entry with huge_pte_offset() inside which all the arch-dependency of
> the page table structure are. So migration_entry_wait_huge() and
> __migration_entry_wait() are free from arch-dependency.
>
> ChangeLog v3:
> - use huge_pte_lockptr
>
> ChangeLog v2:
> - remove dup in migrate_entry_wait_huge()
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: stable@vger.kernel.org # 2.6.35
OK, this looks good to me and I guess you can safely replace
huge_pte_lockptr by &(mm)->page_table_lock so you can implement this
even without risky 1/2 of this series. The patch should be as simple as
possible especially when it goes to the stable.
Without 1/2 dependency
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
> include/linux/swapops.h | 3 +++
> mm/hugetlb.c | 2 +-
> mm/migrate.c | 23 ++++++++++++++++++-----
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
> index 47ead51..c5fd30d 100644
> --- v3.10-rc3.orig/include/linux/swapops.h
> +++ v3.10-rc3/include/linux/swapops.h
> @@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
>
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address);
> +extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
> #else
>
> #define make_migration_entry(page, write) swp_entry(0, 0)
> @@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
> static inline void make_migration_entry_read(swp_entry_t *entryp) { }
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address) { }
> +static inline void migration_entry_wait_huge(struct mm_struct *mm,
> + pte_t *pte) { }
> static inline int is_write_migration_entry(swp_entry_t entry)
> {
> return 0;
> diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
> index 8e1af32..d91a438 100644
> --- v3.10-rc3.orig/mm/hugetlb.c
> +++ v3.10-rc3/mm/hugetlb.c
> @@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (ptep) {
> entry = huge_ptep_get(ptep);
> if (unlikely(is_hugetlb_entry_migration(entry))) {
> - migration_entry_wait(mm, (pmd_t *)ptep, address);
> + migration_entry_wait_huge(mm, ptep);
> return 0;
> } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> return VM_FAULT_HWPOISON_LARGE |
> diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
> index 6f2df6e..64ff118 100644
> --- v3.10-rc3.orig/mm/migrate.c
> +++ v3.10-rc3/mm/migrate.c
> @@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
> * get to the page and wait until migration is finished.
> * When we return from this function the fault will be retried.
> */
> -void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> - unsigned long address)
> +static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> + spinlock_t *ptl)
> {
> - pte_t *ptep, pte;
> - spinlock_t *ptl;
> + pte_t pte;
> swp_entry_t entry;
> struct page *page;
>
> - ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> + spin_lock(ptl);
> pte = *ptep;
> if (!is_swap_pte(pte))
> goto out;
> @@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> pte_unmap_unlock(ptep, ptl);
> }
>
> +void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long address)
> +{
> + spinlock_t *ptl = pte_lockptr(mm, pmd);
> + pte_t *ptep = pte_offset_map(pmd, address);
> + __migration_entry_wait(mm, ptep, ptl);
> +}
> +
> +void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
> +{
> + spinlock_t *ptl = huge_pte_lockptr(mm, pte);
> + __migration_entry_wait(mm, pte, ptl);
> +}
> +
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-06-03 13:26 ` Michal Hocko
0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-06-03 13:26 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Tue 28-05-13 15:52:51, Naoya Horiguchi wrote:
> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly and do busy looping on
> hugepage fault until the migration finishes.
> This is because pte_offset_map_lock() can't get a correct migration entry
> or a correct page table lock for hugepage.
> This patch introduces migration_entry_wait_huge() to solve this.
>
> Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> entry with huge_pte_offset() inside which all the arch-dependency of
> the page table structure are. So migration_entry_wait_huge() and
> __migration_entry_wait() are free from arch-dependency.
>
> ChangeLog v3:
> - use huge_pte_lockptr
>
> ChangeLog v2:
> - remove dup in migrate_entry_wait_huge()
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: stable@vger.kernel.org # 2.6.35
OK, this looks good to me and I guess you can safely replace
huge_pte_lockptr by &(mm)->page_table_lock so you can implement this
even without risky 1/2 of this series. The patch should be as simple as
possible especially when it goes to the stable.
Without 1/2 dependency
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
> include/linux/swapops.h | 3 +++
> mm/hugetlb.c | 2 +-
> mm/migrate.c | 23 ++++++++++++++++++-----
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git v3.10-rc3.orig/include/linux/swapops.h v3.10-rc3/include/linux/swapops.h
> index 47ead51..c5fd30d 100644
> --- v3.10-rc3.orig/include/linux/swapops.h
> +++ v3.10-rc3/include/linux/swapops.h
> @@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
>
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address);
> +extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
> #else
>
> #define make_migration_entry(page, write) swp_entry(0, 0)
> @@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
> static inline void make_migration_entry_read(swp_entry_t *entryp) { }
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address) { }
> +static inline void migration_entry_wait_huge(struct mm_struct *mm,
> + pte_t *pte) { }
> static inline int is_write_migration_entry(swp_entry_t entry)
> {
> return 0;
> diff --git v3.10-rc3.orig/mm/hugetlb.c v3.10-rc3/mm/hugetlb.c
> index 8e1af32..d91a438 100644
> --- v3.10-rc3.orig/mm/hugetlb.c
> +++ v3.10-rc3/mm/hugetlb.c
> @@ -2877,7 +2877,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (ptep) {
> entry = huge_ptep_get(ptep);
> if (unlikely(is_hugetlb_entry_migration(entry))) {
> - migration_entry_wait(mm, (pmd_t *)ptep, address);
> + migration_entry_wait_huge(mm, ptep);
> return 0;
> } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> return VM_FAULT_HWPOISON_LARGE |
> diff --git v3.10-rc3.orig/mm/migrate.c v3.10-rc3/mm/migrate.c
> index 6f2df6e..64ff118 100644
> --- v3.10-rc3.orig/mm/migrate.c
> +++ v3.10-rc3/mm/migrate.c
> @@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
> * get to the page and wait until migration is finished.
> * When we return from this function the fault will be retried.
> */
> -void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> - unsigned long address)
> +static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> + spinlock_t *ptl)
> {
> - pte_t *ptep, pte;
> - spinlock_t *ptl;
> + pte_t pte;
> swp_entry_t entry;
> struct page *page;
>
> - ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> + spin_lock(ptl);
> pte = *ptep;
> if (!is_swap_pte(pte))
> goto out;
> @@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> pte_unmap_unlock(ptep, ptl);
> }
>
> +void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long address)
> +{
> + spinlock_t *ptl = pte_lockptr(mm, pmd);
> + pte_t *ptep = pte_offset_map(pmd, address);
> + __migration_entry_wait(mm, ptep, ptl);
> +}
> +
> +void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
> +{
> + spinlock_t *ptl = huge_pte_lockptr(mm, pte);
> + __migration_entry_wait(mm, pte, ptl);
> +}
> +
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] hugetlbfs: support split page table lock
2013-06-03 13:19 ` Michal Hocko
@ 2013-06-03 14:34 ` Naoya Horiguchi
-1 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. This is not optimal because there can be lock
> > contentions between unrelated components using this lock.
>
> While I agree with such a change in general I am a bit afraid of all
> subtle tweaks in the mm code that make hugetlb special. Maybe there are
> none for page_table_lock but I am not 100% sure. So this might be
> really tricky and it is not necessary for your further patches, is it?
No, this page_table_lock patch is separable from migration stuff.
As you said in another email, changes going to stable should be minimal,
so it's better to make 2/2 patch not depend on this patch.
> How have you tested this?
Other than libhugetlbfs test (that contains many workloads, but I'm
not sure it can detect the possible regression of this patch,)
I did simple testing where:
- create a file on hugetlbfs,
- create 10 processes and make each of them iterate the following:
* mmap() the hugetlbfs file,
* memset() the mapped range (to cause hugetlb_fault), and
* munmap() the mapped range.
I think that this can make racy situation which should be prevented
by page table locks.
> > This patch makes hugepage support split page table lock so that
> > we use page->ptl of the leaf node of page table tree which is pte for
> > normal pages but can be pmd and/or pud for hugepages of some architectures.
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > arch/x86/mm/hugetlbpage.c | 6 ++--
> > include/linux/hugetlb.h | 18 ++++++++++
> > mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
>
> This doesn't seem to be the complete story. At least not from the
> trivial:
> $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" --
> arch/powerpc/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
> arch/powerpc/mm/hugetlbpage.c: spin_unlock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c:
> spin_unlock(&mm->page_table_lock);
> arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
This trivials should be fixed. Sorry.
Naoya
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-06-03 14:34 ` Naoya Horiguchi
0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > Currently all of page table handling by hugetlbfs code are done under
> > mm->page_table_lock. This is not optimal because there can be lock
> > contentions between unrelated components using this lock.
>
> While I agree with such a change in general I am a bit afraid of all
> subtle tweaks in the mm code that make hugetlb special. Maybe there are
> none for page_table_lock but I am not 100% sure. So this might be
> really tricky and it is not necessary for your further patches, is it?
No, this page_table_lock patch is separable from migration stuff.
As you said in another email, changes going to stable should be minimal,
so it's better to make 2/2 patch not depend on this patch.
> How have you tested this?
Other than libhugetlbfs test (that contains many workloads, but I'm
not sure it can detect the possible regression of this patch,)
I did simple testing where:
- create a file on hugetlbfs,
- create 10 processes and make each of them iterate the following:
* mmap() the hugetlbfs file,
* memset() the mapped range (to cause hugetlb_fault), and
* munmap() the mapped range.
I think that this can make racy situation which should be prevented
by page table locks.
> > This patch makes hugepage support split page table lock so that
> > we use page->ptl of the leaf node of page table tree which is pte for
> > normal pages but can be pmd and/or pud for hugepages of some architectures.
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > arch/x86/mm/hugetlbpage.c | 6 ++--
> > include/linux/hugetlb.h | 18 ++++++++++
> > mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
>
> This doesn't seem to be the complete story. At least not from the
> trivial:
> $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" --
> arch/powerpc/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
> arch/powerpc/mm/hugetlbpage.c: spin_unlock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
> arch/tile/mm/hugetlbpage.c:
> spin_unlock(&mm->page_table_lock);
> arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
This trivials should be fixed. Sorry.
Naoya
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
2013-06-03 13:26 ` Michal Hocko
@ 2013-06-03 14:34 ` Naoya Horiguchi
-1 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Mon, Jun 03, 2013 at 03:26:41PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:51, Naoya Horiguchi wrote:
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly and do busy looping on
> > hugepage fault until the migration finishes.
> > This is because pte_offset_map_lock() can't get a correct migration entry
> > or a correct page table lock for hugepage.
> > This patch introduces migration_entry_wait_huge() to solve this.
> >
> > Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> > entry with huge_pte_offset() inside which all the arch-dependency of
> > the page table structure are. So migration_entry_wait_huge() and
> > __migration_entry_wait() are free from arch-dependency.
> >
> > ChangeLog v3:
> > - use huge_pte_lockptr
> >
> > ChangeLog v2:
> > - remove dup in migrate_entry_wait_huge()
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: stable@vger.kernel.org # 2.6.35
>
> OK, this looks good to me and I guess you can safely replace
> huge_pte_lockptr by &(mm)->page_table_lock so you can implement this
> even without risky 1/2 of this series. The patch should be as simple as
> possible especially when it goes to the stable.
Yes, I agree.
> > Without 1/2 dependency
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
Thanks,
Naoya
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] migrate: add migrate_entry_wait_huge()
@ 2013-06-03 14:34 ` Naoya Horiguchi
0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-06-03 14:34 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
On Mon, Jun 03, 2013 at 03:26:41PM +0200, Michal Hocko wrote:
> On Tue 28-05-13 15:52:51, Naoya Horiguchi wrote:
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly and do busy looping on
> > hugepage fault until the migration finishes.
> > This is because pte_offset_map_lock() can't get a correct migration entry
> > or a correct page table lock for hugepage.
> > This patch introduces migration_entry_wait_huge() to solve this.
> >
> > Note that the caller, hugetlb_fault(), gets the pointer to the "leaf"
> > entry with huge_pte_offset() inside which all the arch-dependency of
> > the page table structure are. So migration_entry_wait_huge() and
> > __migration_entry_wait() are free from arch-dependency.
> >
> > ChangeLog v3:
> > - use huge_pte_lockptr
> >
> > ChangeLog v2:
> > - remove dup in migrate_entry_wait_huge()
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: stable@vger.kernel.org # 2.6.35
>
> OK, this looks good to me and I guess you can safely replace
> huge_pte_lockptr by &(mm)->page_table_lock so you can implement this
> even without risky 1/2 of this series. The patch should be as simple as
> possible especially when it goes to the stable.
Yes, I agree.
> > Without 1/2 dependency
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
Thanks,
Naoya
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] hugetlbfs: support split page table lock
2013-06-03 14:34 ` Naoya Horiguchi
@ 2013-06-03 15:42 ` Michal Hocko
-1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-06-03 15:42 UTC (permalink / raw)
To: Naoya Horiguchi, Andrew Morton
Cc: linux-mm, Mel Gorman, Andi Kleen, KOSAKI Motohiro, Rik van Riel,
linux-kernel
On Mon 03-06-13 10:34:35, Naoya Horiguchi wrote:
> On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> > On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > > Currently all of page table handling by hugetlbfs code are done under
> > > mm->page_table_lock. This is not optimal because there can be lock
> > > contentions between unrelated components using this lock.
> >
> > While I agree with such a change in general I am a bit afraid of all
> > subtle tweaks in the mm code that make hugetlb special. Maybe there are
> > none for page_table_lock but I am not 100% sure. So this might be
> > really tricky and it is not necessary for your further patches, is it?
>
> No, this page_table_lock patch is separable from migration stuff.
> As you said in another email, changes going to stable should be minimal,
> so it's better to make 2/2 patch not depend on this patch.
OK, so I do we go around this. Both patches are in the mm tree now.
Should Andrew just drop the current version and you repost a new
version? Sorry I didn't jump in sooner but I was quite busy last week.
> > How have you tested this?
>
> Other than libhugetlbfs test (that contains many workloads, but I'm
> not sure it can detect the possible regression of this patch,)
> I did simple testing where:
> - create a file on hugetlbfs,
> - create 10 processes and make each of them iterate the following:
> * mmap() the hugetlbfs file,
> * memset() the mapped range (to cause hugetlb_fault), and
> * munmap() the mapped range.
> I think that this can make racy situation which should be prevented
> by page table locks.
OK, but this still requires a deep inspection of all the subtle
dependencies on page_table_lock from the core mm. I might be wrong here
and should be more specific about the issues I have only suspicion for
but as this is "just" an scalability improvement (is this actually
measurable?) I would suggest to put it at the end of your hugetlbfs
enahcements for the migration. Just from the reviewability point of
view.
> > > This patch makes hugepage support split page table lock so that
> > > we use page->ptl of the leaf node of page table tree which is pte for
> > > normal pages but can be pmd and/or pud for hugepages of some architectures.
> > >
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > ---
> > > arch/x86/mm/hugetlbpage.c | 6 ++--
> > > include/linux/hugetlb.h | 18 ++++++++++
> > > mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
> >
> > This doesn't seem to be the complete story. At least not from the
> > trivial:
> > $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" --
> > arch/powerpc/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
> > arch/powerpc/mm/hugetlbpage.c: spin_unlock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c:
> > spin_unlock(&mm->page_table_lock);
> > arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
>
> This trivials should be fixed. Sorry.
Other archs are often forgotten and cscope doesn't help exactly ;)
Thanks
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] hugetlbfs: support split page table lock
@ 2013-06-03 15:42 ` Michal Hocko
0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-06-03 15:42 UTC (permalink / raw)
To: Naoya Horiguchi, Andrew Morton
Cc: linux-mm, Mel Gorman, Andi Kleen, KOSAKI Motohiro, Rik van Riel,
linux-kernel
On Mon 03-06-13 10:34:35, Naoya Horiguchi wrote:
> On Mon, Jun 03, 2013 at 03:19:32PM +0200, Michal Hocko wrote:
> > On Tue 28-05-13 15:52:50, Naoya Horiguchi wrote:
> > > Currently all of page table handling by hugetlbfs code are done under
> > > mm->page_table_lock. This is not optimal because there can be lock
> > > contentions between unrelated components using this lock.
> >
> > While I agree with such a change in general I am a bit afraid of all
> > subtle tweaks in the mm code that make hugetlb special. Maybe there are
> > none for page_table_lock but I am not 100% sure. So this might be
> > really tricky and it is not necessary for your further patches, is it?
>
> No, this page_table_lock patch is separable from migration stuff.
> As you said in another email, changes going to stable should be minimal,
> so it's better to make 2/2 patch not depend on this patch.
OK, so I do we go around this. Both patches are in the mm tree now.
Should Andrew just drop the current version and you repost a new
version? Sorry I didn't jump in sooner but I was quite busy last week.
> > How have you tested this?
>
> Other than libhugetlbfs test (that contains many workloads, but I'm
> not sure it can detect the possible regression of this patch,)
> I did simple testing where:
> - create a file on hugetlbfs,
> - create 10 processes and make each of them iterate the following:
> * mmap() the hugetlbfs file,
> * memset() the mapped range (to cause hugetlb_fault), and
> * munmap() the mapped range.
> I think that this can make racy situation which should be prevented
> by page table locks.
OK, but this still requires a deep inspection of all the subtle
dependencies on page_table_lock from the core mm. I might be wrong here
and should be more specific about the issues I have only suspicion for
but as this is "just" an scalability improvement (is this actually
measurable?) I would suggest to put it at the end of your hugetlbfs
enahcements for the migration. Just from the reviewability point of
view.
> > > This patch makes hugepage support split page table lock so that
> > > we use page->ptl of the leaf node of page table tree which is pte for
> > > normal pages but can be pmd and/or pud for hugepages of some architectures.
> > >
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > ---
> > > arch/x86/mm/hugetlbpage.c | 6 ++--
> > > include/linux/hugetlb.h | 18 ++++++++++
> > > mm/hugetlb.c | 84 ++++++++++++++++++++++++++++-------------------
> >
> > This doesn't seem to be the complete story. At least not from the
> > trivial:
> > $ find arch/ -name "*hugetlb*" | xargs git grep "page_table_lock" --
> > arch/powerpc/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
> > arch/powerpc/mm/hugetlbpage.c: spin_unlock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c: spin_lock(&mm->page_table_lock);
> > arch/tile/mm/hugetlbpage.c:
> > spin_unlock(&mm->page_table_lock);
> > arch/x86/mm/hugetlbpage.c: * called with vma->vm_mm->page_table_lock held.
>
> This trivials should be fixed. Sorry.
Other archs are often forgotten and cscope doesn't help exactly ;)
Thanks
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v4] migrate: add migrate_entry_wait_huge()
2013-06-03 13:26 ` Michal Hocko
@ 2013-06-04 16:44 ` Naoya Horiguchi
-1 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-06-04 16:44 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
Here is the revised one. Andrew, could you replace the following
patches in your tree with this?
migrate-add-migrate_entry_wait_huge.patch
hugetlbfs-support-split-page-table-lock.patch
Thanks,
Naoya Horiguchi
----
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Tue, 4 Jun 2013 12:26:32 -0400
Subject: [PATCH v4] migrate: add migrate_entry_wait_huge()
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.
As a result, users who try to kick hugepage migration (via soft offlining,
for example) occasionally experience long delay or soft lockup.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.
This patch introduces migration_entry_wait_huge() to solve this.
ChangeLog v4:
- replace huge_pte_lockptr with &(mm)->page_table_lock
(postponed split page table lock patch)
- update description
ChangeLog v3:
- use huge_pte_lockptr
ChangeLog v2:
- remove dup in migrate_entry_wait_huge()
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # 2.6.35
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
include/linux/swapops.h | 3 +++
mm/hugetlb.c | 2 +-
mm/migrate.c | 23 ++++++++++++++++++-----
3 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 47ead51..c5fd30d 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
#else
#define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm,
+ pte_t *pte) { }
static inline int is_write_migration_entry(swp_entry_t entry)
{
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 463fb5e..d793c5e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2865,7 +2865,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait(mm, (pmd_t *)ptep, address);
+ migration_entry_wait_huge(mm, ptep);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f2df6e..b8d56a1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
* get to the page and wait until migration is finished.
* When we return from this function the fault will be retried.
*/
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+ spinlock_t *ptl)
{
- pte_t *ptep, pte;
- spinlock_t *ptl;
+ pte_t pte;
swp_entry_t entry;
struct page *page;
- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ spin_lock(ptl);
pte = *ptep;
if (!is_swap_pte(pte))
goto out;
@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
pte_unmap_unlock(ptep, ptl);
}
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address)
+{
+ spinlock_t *ptl = pte_lockptr(mm, pmd);
+ pte_t *ptep = pte_offset_map(pmd, address);
+ __migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+{
+ spinlock_t *ptl = &(mm)->page_table_lock;
+ __migration_entry_wait(mm, pte, ptl);
+}
+
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
--
1.7.11.7
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v4] migrate: add migrate_entry_wait_huge()
@ 2013-06-04 16:44 ` Naoya Horiguchi
0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2013-06-04 16:44 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Andi Kleen, KOSAKI Motohiro,
Rik van Riel, linux-kernel
Here is the revised one. Andrew, could you replace the following
patches in your tree with this?
migrate-add-migrate_entry_wait_huge.patch
hugetlbfs-support-split-page-table-lock.patch
Thanks,
Naoya Horiguchi
----
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Tue, 4 Jun 2013 12:26:32 -0400
Subject: [PATCH v4] migrate: add migrate_entry_wait_huge()
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.
As a result, users who try to kick hugepage migration (via soft offlining,
for example) occasionally experience long delay or soft lockup.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.
This patch introduces migration_entry_wait_huge() to solve this.
ChangeLog v4:
- replace huge_pte_lockptr with &(mm)->page_table_lock
(postponed split page table lock patch)
- update description
ChangeLog v3:
- use huge_pte_lockptr
ChangeLog v2:
- remove dup in migrate_entry_wait_huge()
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # 2.6.35
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
include/linux/swapops.h | 3 +++
mm/hugetlb.c | 2 +-
mm/migrate.c | 23 ++++++++++++++++++-----
3 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 47ead51..c5fd30d 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -137,6 +137,7 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
#else
#define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +149,8 @@ static inline int is_migration_entry(swp_entry_t swp)
static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm,
+ pte_t *pte) { }
static inline int is_write_migration_entry(swp_entry_t entry)
{
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 463fb5e..d793c5e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2865,7 +2865,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait(mm, (pmd_t *)ptep, address);
+ migration_entry_wait_huge(mm, ptep);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f2df6e..b8d56a1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -204,15 +204,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
* get to the page and wait until migration is finished.
* When we return from this function the fault will be retried.
*/
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+ spinlock_t *ptl)
{
- pte_t *ptep, pte;
- spinlock_t *ptl;
+ pte_t pte;
swp_entry_t entry;
struct page *page;
- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ spin_lock(ptl);
pte = *ptep;
if (!is_swap_pte(pte))
goto out;
@@ -240,6 +239,20 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
pte_unmap_unlock(ptep, ptl);
}
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address)
+{
+ spinlock_t *ptl = pte_lockptr(mm, pmd);
+ pte_t *ptep = pte_offset_map(pmd, address);
+ __migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+{
+ spinlock_t *ptl = &(mm)->page_table_lock;
+ __migration_entry_wait(mm, pte, ptl);
+}
+
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
end of thread, other threads:[~2013-06-04 16:45 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-28 19:52 [PATCH 0/2] hugetlbfs: support split page table lock Naoya Horiguchi
2013-05-28 19:52 ` Naoya Horiguchi
2013-05-28 19:52 ` [PATCH 1/2] " Naoya Horiguchi
2013-05-28 19:52 ` Naoya Horiguchi
2013-05-29 1:09 ` Wanpeng Li
2013-05-29 1:09 ` Wanpeng Li
2013-06-03 13:19 ` Michal Hocko
2013-06-03 13:19 ` Michal Hocko
2013-06-03 14:34 ` Naoya Horiguchi
2013-06-03 14:34 ` Naoya Horiguchi
2013-06-03 15:42 ` Michal Hocko
2013-06-03 15:42 ` Michal Hocko
2013-05-28 19:52 ` [PATCH v3 2/2] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
2013-05-28 19:52 ` Naoya Horiguchi
2013-05-29 1:11 ` Wanpeng Li
2013-05-29 1:11 ` Wanpeng Li
2013-05-31 19:30 ` Andrew Morton
2013-05-31 19:30 ` Andrew Morton
2013-05-31 19:46 ` Naoya Horiguchi
2013-05-31 19:46 ` Naoya Horiguchi
2013-06-03 13:26 ` Michal Hocko
2013-06-03 13:26 ` Michal Hocko
2013-06-03 14:34 ` Naoya Horiguchi
2013-06-03 14:34 ` Naoya Horiguchi
2013-06-04 16:44 ` [PATCH v4] " Naoya Horiguchi
2013-06-04 16:44 ` Naoya Horiguchi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.