linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
@ 2012-08-20 13:52 Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 1/8] THP: Use real address for NUMA policy Kirill A. Shutemov
                   ` (9 more replies)
  0 siblings, 10 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Clearing a 2MB huge page will typically blow away several levels of CPU
caches.  To avoid this only cache clear the 4K area around the fault
address and use a cache avoiding clears for the rest of the 2MB area.

This patchset implements cache avoiding version of clear_page only for
x86. If an architecture wants to provide cache avoiding version of
clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

v4:
  - vm.clear_huge_page_nocache sysctl;
  - rework page iteration in clear_{huge,gigantic}_page according to
    Andrea Arcangeli suggestion;
v3:
  - Rebased to current Linus' tree. kmap_atomic() build issue is fixed;
  - Pass fault address to clear_huge_page(). v2 had problem with clearing
    for sizes other than HPAGE_SIZE;
  - x86: fix 32bit variant. Fallback version of clear_page_nocache() has
    been added for non-SSE2 systems;
  - x86: clear_page_nocache() moved to clear_page_{32,64}.S;
  - x86: use pushq_cfi/popq_cfi instead of push/pop;
v2:
  - No code change. Only commit messages are updated;
  - RFC mark is dropped;

Andi Kleen (5):
  THP: Use real address for NUMA policy
  THP: Pass fault address to __do_huge_pmd_anonymous_page()
  x86: Add clear_page_nocache
  mm: make clear_huge_page cache clear only around the fault address
  x86: switch the 64bit uncached page clear to SSE/AVX v2

Kirill A. Shutemov (3):
  hugetlb: pass fault address to hugetlb_no_page()
  mm: pass fault address to clear_huge_page()
  mm: implement vm.clear_huge_page_nocache sysctl

 Documentation/sysctl/vm.txt      |   13 ++++++
 arch/x86/include/asm/page.h      |    2 +
 arch/x86/include/asm/string_32.h |    5 ++
 arch/x86/include/asm/string_64.h |    5 ++
 arch/x86/lib/Makefile            |    3 +-
 arch/x86/lib/clear_page_32.S     |   72 +++++++++++++++++++++++++++++++++++
 arch/x86/lib/clear_page_64.S     |   78 ++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/fault.c              |    7 +++
 include/linux/mm.h               |    7 +++-
 kernel/sysctl.c                  |   12 ++++++
 mm/huge_memory.c                 |   17 ++++----
 mm/hugetlb.c                     |   39 ++++++++++---------
 mm/memory.c                      |   72 ++++++++++++++++++++++++++++++----
 13 files changed, 294 insertions(+), 38 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 1/8] THP: Use real address for NUMA policy
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
@ 2012-08-20 13:52 ` Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 2/8] THP: Pass fault address to __do_huge_pmd_anonymous_page() Kirill A. Shutemov
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

Use the fault address, not the rounded down hpage address for NUMA
policy purposes. In some circumstances this can give more exact
NUMA policy.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..70737ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -681,11 +681,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
 					      struct vm_area_struct *vma,
-					      unsigned long haddr, int nd,
+					      unsigned long address, int nd,
 					      gfp_t extra_gfp)
 {
 	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-			       HPAGE_PMD_ORDER, vma, haddr, nd);
+			       HPAGE_PMD_ORDER, vma, address, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -710,7 +710,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id(), 0);
+					  vma, address, numa_node_id(), 0);
 		if (unlikely(!page)) {
 			count_vm_event(THP_FAULT_FALLBACK);
 			goto out;
@@ -944,7 +944,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					      vma, haddr, numa_node_id(), 0);
+					      vma, address, numa_node_id(), 0);
 	else
 		new_page = NULL;
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/8] THP: Pass fault address to __do_huge_pmd_anonymous_page()
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 1/8] THP: Use real address for NUMA policy Kirill A. Shutemov
@ 2012-08-20 13:52 ` Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 3/8] hugetlb: pass fault address to hugetlb_no_page() Kirill A. Shutemov
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70737ec..6f0825b611 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,7 +633,8 @@ static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
-					unsigned long haddr, pmd_t *pmd,
+					unsigned long haddr,
+					unsigned long address, pmd_t *pmd,
 					struct page *page)
 {
 	pgtable_t pgtable;
@@ -720,8 +721,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			put_page(page);
 			goto out;
 		}
-		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
-							  page))) {
+		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+						address, pmd, page))) {
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			goto out;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 3/8] hugetlb: pass fault address to hugetlb_no_page()
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 1/8] THP: Use real address for NUMA policy Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 2/8] THP: Pass fault address to __do_huge_pmd_anonymous_page() Kirill A. Shutemov
@ 2012-08-20 13:52 ` Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 4/8] mm: pass fault address to clear_huge_page() Kirill A. Shutemov
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/hugetlb.c |   38 +++++++++++++++++++-------------------
 1 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc72712..3c86d3d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2672,7 +2672,8 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			unsigned long haddr, unsigned long fault_address,
+			pte_t *ptep, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	int ret = VM_FAULT_SIGBUS;
@@ -2696,7 +2697,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, address);
+	idx = vma_hugecache_offset(h, vma, haddr);
 
 	/*
 	 * Use page lock to guard against racing truncation
@@ -2708,7 +2709,7 @@ retry:
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
-		page = alloc_huge_page(vma, address, 0);
+		page = alloc_huge_page(vma, haddr, 0);
 		if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			if (ret == -ENOMEM)
@@ -2717,7 +2718,7 @@ retry:
 				ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		clear_huge_page(page, address, pages_per_huge_page(h));
+		clear_huge_page(page, haddr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
@@ -2763,7 +2764,7 @@ retry:
 	 * the spinlock.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
-		if (vma_needs_reservation(h, vma, address) < 0) {
+		if (vma_needs_reservation(h, vma, haddr) < 0) {
 			ret = VM_FAULT_OOM;
 			goto backout_unlocked;
 		}
@@ -2778,16 +2779,16 @@ retry:
 		goto backout;
 
 	if (anon_rmap)
-		hugepage_add_new_anon_rmap(page, vma, address);
+		hugepage_add_new_anon_rmap(page, vma, haddr);
 	else
 		page_dup_rmap(page);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
-	set_huge_pte_at(mm, address, ptep, new_pte);
+	set_huge_pte_at(mm, haddr, ptep, new_pte);
 
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
+		ret = hugetlb_cow(mm, vma, haddr, ptep, new_pte, page);
 	}
 
 	spin_unlock(&mm->page_table_lock);
@@ -2813,21 +2814,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *pagecache_page = NULL;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
+	unsigned long haddr = address & huge_page_mask(h);
 
-	address &= huge_page_mask(h);
-
-	ptep = huge_pte_offset(mm, address);
+	ptep = huge_pte_offset(mm, haddr);
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			migration_entry_wait(mm, (pmd_t *)ptep, haddr);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
 				VM_FAULT_SET_HINDEX(hstate_index(h));
 	}
 
-	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
+	ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
 
@@ -2839,7 +2839,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	mutex_lock(&hugetlb_instantiation_mutex);
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+		ret = hugetlb_no_page(mm, vma, haddr, address, ptep, flags);
 		goto out_mutex;
 	}
 
@@ -2854,14 +2854,14 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * consumed.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !pte_write(entry)) {
-		if (vma_needs_reservation(h, vma, address) < 0) {
+		if (vma_needs_reservation(h, vma, haddr) < 0) {
 			ret = VM_FAULT_OOM;
 			goto out_mutex;
 		}
 
 		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
-								vma, address);
+								vma, haddr);
 	}
 
 	/*
@@ -2884,16 +2884,16 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry)) {
-			ret = hugetlb_cow(mm, vma, address, ptep, entry,
+			ret = hugetlb_cow(mm, vma, haddr, ptep, entry,
 							pagecache_page);
 			goto out_page_table_lock;
 		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
-	if (huge_ptep_set_access_flags(vma, address, ptep, entry,
+	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
 						flags & FAULT_FLAG_WRITE))
-		update_mmu_cache(vma, address, ptep);
+		update_mmu_cache(vma, haddr, ptep);
 
 out_page_table_lock:
 	spin_unlock(&mm->page_table_lock);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 4/8] mm: pass fault address to clear_huge_page()
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2012-08-20 13:52 ` [PATCH v4 3/8] hugetlb: pass fault address to hugetlb_no_page() Kirill A. Shutemov
@ 2012-08-20 13:52 ` Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 5/8] x86: Add clear_page_nocache Kirill A. Shutemov
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    2 +-
 mm/huge_memory.c   |    2 +-
 mm/hugetlb.c       |    3 ++-
 mm/memory.c        |    7 ++++---
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..2858723 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1638,7 +1638,7 @@ extern void dump_page(struct page *page);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
-			    unsigned long addr,
+			    unsigned long haddr, unsigned long fault_address,
 			    unsigned int pages_per_huge_page);
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6f0825b611..070bf89 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -644,7 +644,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	if (unlikely(!pgtable))
 		return VM_FAULT_OOM;
 
-	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	clear_huge_page(page, haddr, address, HPAGE_PMD_NR);
 	__SetPageUptodate(page);
 
 	spin_lock(&mm->page_table_lock);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c86d3d..5182192 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2718,7 +2718,8 @@ retry:
 				ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		clear_huge_page(page, haddr, pages_per_huge_page(h));
+		clear_huge_page(page, haddr, fault_address,
+				pages_per_huge_page(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..dfc179b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3984,19 +3984,20 @@ static void clear_gigantic_page(struct page *page,
 	}
 }
 void clear_huge_page(struct page *page,
-		     unsigned long addr, unsigned int pages_per_huge_page)
+		     unsigned long haddr, unsigned long fault_address,
+		     unsigned int pages_per_huge_page)
 {
 	int i;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, pages_per_huge_page);
+		clear_gigantic_page(page, haddr, pages_per_huge_page);
 		return;
 	}
 
 	might_sleep();
 	for (i = 0; i < pages_per_huge_page; i++) {
 		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
 	}
 }
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 5/8] x86: Add clear_page_nocache
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2012-08-20 13:52 ` [PATCH v4 4/8] mm: pass fault address to clear_huge_page() Kirill A. Shutemov
@ 2012-08-20 13:52 ` Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 6/8] mm: make clear_huge_page cache clear only around the fault address Kirill A. Shutemov
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

Add a cache avoiding version of clear_page. Straight forward integer variant
of the existing 64bit clear_page, for both 32bit and 64bit.

Also add the necessary glue for highmem including a layer that non cache
coherent architectures that use the virtual address for flushing can
hook in. This is not needed on x86 of course.

If an architecture wants to provide cache avoiding version of clear_page
it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/page.h      |    2 +
 arch/x86/include/asm/string_32.h |    5 +++
 arch/x86/include/asm/string_64.h |    5 +++
 arch/x86/lib/Makefile            |    3 +-
 arch/x86/lib/clear_page_32.S     |   72 ++++++++++++++++++++++++++++++++++++++
 arch/x86/lib/clear_page_64.S     |   29 +++++++++++++++
 arch/x86/mm/fault.c              |    7 ++++
 7 files changed, 122 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..aa83a1b 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -29,6 +29,8 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	copy_page(to, from);
 }
 
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr);
+
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
 	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..3f2fbcf 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include <linux/linkage.h>
+
 /* Let gcc decide whether to inline or use the out of line functions */
 
 #define __HAVE_ARCH_STRCPY
@@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long pattern,
 #define __HAVE_ARCH_MEMSCAN
 extern void *memscan(void *addr, int c, size_t size);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_32_H */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..ca23d1d 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include <linux/linkage.h>
+
 /* Written 2002 by Andi Kleen */
 
 /* Only used for special circumstances. Stolen from i386/string.h */
@@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src);
 char *strcat(char *dest, const char *src);
 int strcmp(const char *cs, const char *ct);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index b00f678..14e47a2 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_SMP) += rwlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
+lib-y += clear_page_$(BITS).o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
 
@@ -40,7 +41,7 @@ endif
 else
         obj-y += iomap_copy_64.o
         lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
-        lib-y += thunk_64.o clear_page_64.o copy_page_64.o
+        lib-y += thunk_64.o copy_page_64.o
         lib-y += memmove_64.o memset_64.o
         lib-y += copy_user_64.o copy_user_nocache_64.o
 	lib-y += cmpxchg16b_emu.o
diff --git a/arch/x86/lib/clear_page_32.S b/arch/x86/lib/clear_page_32.S
new file mode 100644
index 0000000..9592161
--- /dev/null
+++ b/arch/x86/lib/clear_page_32.S
@@ -0,0 +1,72 @@
+#include <linux/linkage.h>
+#include <asm/alternative-asm.h>
+#include <asm/cpufeature.h>
+#include <asm/dwarf2.h>
+
+/*
+ * Fallback version if SSE2 is not avaible.
+ */
+ENTRY(clear_page_nocache)
+	CFI_STARTPROC
+	mov    %eax,%edx
+	xorl   %eax,%eax
+	movl   $4096/32,%ecx
+	.p2align 4
+.Lloop:
+	decl	%ecx
+#define PUT(x) mov %eax,x*4(%edx)
+	PUT(0)
+	PUT(1)
+	PUT(2)
+	PUT(3)
+	PUT(4)
+	PUT(5)
+	PUT(6)
+	PUT(7)
+#undef PUT
+	lea	32(%edx),%edx
+	jnz	.Lloop
+	nop
+	ret
+	CFI_ENDPROC
+ENDPROC(clear_page_nocache)
+
+	.section .altinstr_replacement,"ax"
+1:      .byte 0xeb /* jmp <disp8> */
+	.byte (clear_page_nocache_sse2 - clear_page_nocache) - (2f - 1b)
+	/* offset */
+2:
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry clear_page_nocache,1b,X86_FEATURE_XMM2,\
+				16, 2b-1b
+	.previous
+
+/*
+ * Zero a page avoiding the caches
+ * eax	page
+ */
+ENTRY(clear_page_nocache_sse2)
+	CFI_STARTPROC
+	mov    %eax,%edx
+	xorl   %eax,%eax
+	movl   $4096/32,%ecx
+	.p2align 4
+.Lloop_sse2:
+	decl	%ecx
+#define PUT(x) movnti %eax,x*4(%edx)
+	PUT(0)
+	PUT(1)
+	PUT(2)
+	PUT(3)
+	PUT(4)
+	PUT(5)
+	PUT(6)
+	PUT(7)
+#undef PUT
+	lea	32(%edx),%edx
+	jnz	.Lloop_sse2
+	nop
+	ret
+	CFI_ENDPROC
+ENDPROC(clear_page_nocache_sse2)
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index f2145cf..9d2f3c2 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -40,6 +40,7 @@ ENTRY(clear_page)
 	PUT(5)
 	PUT(6)
 	PUT(7)
+#undef PUT
 	leaq	64(%rdi),%rdi
 	jnz	.Lloop
 	nop
@@ -71,3 +72,31 @@ ENDPROC(clear_page)
 	altinstruction_entry clear_page,2b,X86_FEATURE_ERMS,   \
 			     .Lclear_page_end-clear_page,3b-2b
 	.previous
+
+/*
+ * Zero a page avoiding the caches
+ * rdi	page
+ */
+ENTRY(clear_page_nocache)
+	CFI_STARTPROC
+	xorl   %eax,%eax
+	movl   $4096/64,%ecx
+	.p2align 4
+.Lloop_nocache:
+	decl	%ecx
+#define PUT(x) movnti %rax,x*8(%rdi)
+	movnti %rax,(%rdi)
+	PUT(1)
+	PUT(2)
+	PUT(3)
+	PUT(4)
+	PUT(5)
+	PUT(6)
+	PUT(7)
+#undef PUT
+	leaq	64(%rdi),%rdi
+	jnz	.Lloop_nocache
+	nop
+	ret
+	CFI_ENDPROC
+ENDPROC(clear_page_nocache)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 76dcd9d..d8cf231 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1209,3 +1209,10 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 }
+
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr)
+{
+	void *p = kmap_atomic(page);
+	clear_page_nocache(p);
+	kunmap_atomic(p);
+}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 6/8] mm: make clear_huge_page cache clear only around the fault address
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2012-08-20 13:52 ` [PATCH v4 5/8] x86: Add clear_page_nocache Kirill A. Shutemov
@ 2012-08-20 13:52 ` Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 7/8] x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

Clearing a 2MB huge page will typically blow away several levels
of CPU caches. To avoid this only cache clear the 4K area
around the fault address and use a cache avoiding clears
for the rest of the 2MB area.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |   37 +++++++++++++++++++++++++++++--------
 1 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index dfc179b..625ca33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3969,18 +3969,32 @@ EXPORT_SYMBOL(might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+#ifndef ARCH_HAS_USER_NOCACHE
+#define ARCH_HAS_USER_NOCACHE 0
+#endif
+
+#if ARCH_HAS_USER_NOCACHE == 0
+#define clear_user_highpage_nocache clear_user_highpage
+#endif
+
 static void clear_gigantic_page(struct page *page,
-				unsigned long addr,
-				unsigned int pages_per_huge_page)
+		unsigned long haddr, unsigned long fault_address,
+		unsigned int pages_per_huge_page)
 {
 	int i;
 	struct page *p = page;
+	unsigned long vaddr;
+	int target = (fault_address - haddr) >> PAGE_SHIFT;
 
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page;
-	     i++, p = mem_map_next(p, page, i)) {
+	for (i = 0, vaddr = haddr; i < pages_per_huge_page;
+			i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) {
 		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
+		if (!ARCH_HAS_USER_NOCACHE  || i == target)
+			clear_user_highpage(p, vaddr);
+		else
+			clear_user_highpage_nocache(p, vaddr);
 	}
 }
 void clear_huge_page(struct page *page,
@@ -3988,16 +4002,23 @@ void clear_huge_page(struct page *page,
 		     unsigned int pages_per_huge_page)
 {
 	int i;
+	unsigned long vaddr;
+	int target = (fault_address - haddr) >> PAGE_SHIFT;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, haddr, pages_per_huge_page);
+		clear_gigantic_page(page, haddr, fault_address,
+				pages_per_huge_page);
 		return;
 	}
 
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page; i++) {
+	for (i = 0, vaddr = haddr; i < pages_per_huge_page;
+			i++, page++, vaddr += PAGE_SIZE) {
 		cond_resched();
-		clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
+		if (!ARCH_HAS_USER_NOCACHE || i == target)
+			clear_user_highpage(page, vaddr);
+		else
+			clear_user_highpage_nocache(page, vaddr);
 	}
 }
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 7/8] x86: switch the 64bit uncached page clear to SSE/AVX v2
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2012-08-20 13:52 ` [PATCH v4 6/8] mm: make clear_huge_page cache clear only around the fault address Kirill A. Shutemov
@ 2012-08-20 13:52 ` Kirill A. Shutemov
  2012-08-20 13:52 ` [PATCH v4 8/8] mm: implement vm.clear_huge_page_nocache sysctl Kirill A. Shutemov
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

With multiple threads vector stores are more efficient, so use them.
This will cause the page clear to run non preemptable and add some
overhead. However on 32bit it was already non preempable (due to
kmap_atomic) and there is an preemption opportunity every 4K unit.

On a NPB (Nasa Parallel Benchmark) 128GB run on a Westmere this improves
the performance regression of enabling transparent huge pages
by ~2% (2.81% to 0.81%), near the runtime variability now.
On a system with AVX support more is expected.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
[kirill.shutemov@linux.intel.com: Properly save/restore arguments]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/lib/clear_page_64.S |   79 ++++++++++++++++++++++++++++++++++--------
 1 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 9d2f3c2..b302cff 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -73,30 +73,79 @@ ENDPROC(clear_page)
 			     .Lclear_page_end-clear_page,3b-2b
 	.previous
 
+#define SSE_UNROLL 128
+
 /*
  * Zero a page avoiding the caches
  * rdi	page
  */
 ENTRY(clear_page_nocache)
 	CFI_STARTPROC
-	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+	pushq_cfi %rdi
+	call   kernel_fpu_begin
+	popq_cfi  %rdi
+	sub    $16,%rsp
+	CFI_ADJUST_CFA_OFFSET 16
+	movdqu %xmm0,(%rsp)
+	xorpd  %xmm0,%xmm0
+	movl   $4096/SSE_UNROLL,%ecx
 	.p2align 4
 .Lloop_nocache:
 	decl	%ecx
-#define PUT(x) movnti %rax,x*8(%rdi)
-	movnti %rax,(%rdi)
-	PUT(1)
-	PUT(2)
-	PUT(3)
-	PUT(4)
-	PUT(5)
-	PUT(6)
-	PUT(7)
-#undef PUT
-	leaq	64(%rdi),%rdi
+	.set x,0
+	.rept SSE_UNROLL/16
+	movntdq %xmm0,x(%rdi)
+	.set x,x+16
+	.endr
+	leaq	SSE_UNROLL(%rdi),%rdi
 	jnz	.Lloop_nocache
-	nop
-	ret
+	movdqu (%rsp),%xmm0
+	addq   $16,%rsp
+	CFI_ADJUST_CFA_OFFSET -16
+	jmp   kernel_fpu_end
 	CFI_ENDPROC
 ENDPROC(clear_page_nocache)
+
+#ifdef CONFIG_AS_AVX
+
+	.section .altinstr_replacement,"ax"
+1:	.byte 0xeb					/* jmp <disp8> */
+	.byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b)
+	/* offset */
+2:
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\
+	                     16, 2b-1b
+	.previous
+
+#define AVX_UNROLL 256 /* TUNE ME */
+
+ENTRY(clear_page_nocache_avx)
+	CFI_STARTPROC
+	pushq_cfi %rdi
+	call   kernel_fpu_begin
+	popq_cfi  %rdi
+	sub    $32,%rsp
+	CFI_ADJUST_CFA_OFFSET 32
+	vmovdqu %ymm0,(%rsp)
+	vxorpd  %ymm0,%ymm0,%ymm0
+	movl   $4096/AVX_UNROLL,%ecx
+	.p2align 4
+.Lloop_avx:
+	decl	%ecx
+	.set x,0
+	.rept AVX_UNROLL/32
+	vmovntdq %ymm0,x(%rdi)
+	.set x,x+32
+	.endr
+	leaq	AVX_UNROLL(%rdi),%rdi
+	jnz	.Lloop_avx
+	vmovdqu (%rsp),%ymm0
+	addq   $32,%rsp
+	CFI_ADJUST_CFA_OFFSET -32
+	jmp   kernel_fpu_end
+	CFI_ENDPROC
+ENDPROC(clear_page_nocache_avx)
+
+#endif
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 8/8] mm: implement vm.clear_huge_page_nocache sysctl
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2012-08-20 13:52 ` [PATCH v4 7/8] x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov
@ 2012-08-20 13:52 ` Kirill A. Shutemov
  2012-09-12 10:09 ` [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
  2012-09-13 23:05 ` Andrew Morton
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-08-20 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

In some cases cache avoiding clearing huge page may slow down workload.
Let's provide an sysctl handle to disable it.

We use static_key here to avoid extra work on fast path.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/sysctl/vm.txt |   13 ++++++++++++
 include/linux/mm.h          |    5 ++++
 kernel/sysctl.c             |   12 +++++++++++
 mm/memory.c                 |   44 +++++++++++++++++++++++++++++++++++++-----
 4 files changed, 68 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..9559a97 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- clear_huge_page_nocache
 - compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
@@ -74,6 +75,18 @@ huge pages although processes will also directly compact memory as required.
 
 ==============================================================
 
+clear_huge_page_nocache
+
+Available only when the architecture provides ARCH_HAS_USER_NOCACHE and
+CONFIG_TRANSPARENT_HUGEPAGE or CONFIG_HUGETLBFS is set.
+
+When set to 1 (default) kernel will use cache avoiding clear routine for
+clearing huge pages. This minimize cache pollution.
+When set to 0 kernel will clear huge pages through cache. This may speed
+up some workloads. Also it's useful for benchmarking propose.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the background kernel
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2858723..9b48f43 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1643,6 +1643,11 @@ extern void clear_huge_page(struct page *page,
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
+#ifdef ARCH_HAS_USER_NOCACHE
+extern int sysctl_clear_huge_page_nocache;
+extern int clear_huge_page_nocache_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos);
+#endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87174ef..80ccc67 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1366,6 +1366,18 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#if defined(ARCH_HAS_USER_NOCACHE) && \
+	(defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS))
+	{
+		.procname	= "clear_huge_page_nocache",
+		.data		= &sysctl_clear_huge_page_nocache,
+		.maxlen		= sizeof(sysctl_clear_huge_page_nocache),
+		.mode		= 0644,
+		.proc_handler	= clear_huge_page_nocache_handler,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 	{ }
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 625ca33..395d574 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/static_key.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3970,12 +3971,43 @@ EXPORT_SYMBOL(might_fault);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 
-#ifndef ARCH_HAS_USER_NOCACHE
-#define ARCH_HAS_USER_NOCACHE 0
-#endif
+#ifdef ARCH_HAS_USER_NOCACHE
+int sysctl_clear_huge_page_nocache = 1;
+static DEFINE_MUTEX(sysctl_clear_huge_page_nocache_lock);
+static struct static_key clear_huge_page_nocache __read_mostly =
+	STATIC_KEY_INIT_TRUE;
 
-#if ARCH_HAS_USER_NOCACHE == 0
+static inline int is_nocache_enabled(void)
+{
+	return static_key_true(&clear_huge_page_nocache);
+}
+
+int clear_huge_page_nocache_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int orig_value = sysctl_clear_huge_page_nocache;
+	int ret;
+
+	mutex_lock(&sysctl_clear_huge_page_nocache_lock);
+	orig_value = sysctl_clear_huge_page_nocache;
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (!ret && write && sysctl_clear_huge_page_nocache != orig_value) {
+		if (sysctl_clear_huge_page_nocache)
+			static_key_slow_inc(&clear_huge_page_nocache);
+		else
+			static_key_slow_dec(&clear_huge_page_nocache);
+	}
+	mutex_unlock(&sysctl_clear_huge_page_nocache_lock);
+
+	return ret;
+}
+#else
 #define clear_user_highpage_nocache clear_user_highpage
+
+static inline int is_nocache_enabled(void)
+{
+	return 0;
+}
 #endif
 
 static void clear_gigantic_page(struct page *page,
@@ -3991,7 +4023,7 @@ static void clear_gigantic_page(struct page *page,
 	for (i = 0, vaddr = haddr; i < pages_per_huge_page;
 			i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) {
 		cond_resched();
-		if (!ARCH_HAS_USER_NOCACHE  || i == target)
+		if (!is_nocache_enabled() || i == target)
 			clear_user_highpage(p, vaddr);
 		else
 			clear_user_highpage_nocache(p, vaddr);
@@ -4015,7 +4047,7 @@ void clear_huge_page(struct page *page,
 	for (i = 0, vaddr = haddr; i < pages_per_huge_page;
 			i++, page++, vaddr += PAGE_SIZE) {
 		cond_resched();
-		if (!ARCH_HAS_USER_NOCACHE || i == target)
+		if (!is_nocache_enabled() || i == target)
 			clear_user_highpage(page, vaddr);
 		else
 			clear_user_highpage_nocache(page, vaddr);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2012-08-20 13:52 ` [PATCH v4 8/8] mm: implement vm.clear_huge_page_nocache sysctl Kirill A. Shutemov
@ 2012-09-12 10:09 ` Kirill A. Shutemov
  2012-09-13 23:05 ` Andrew Morton
  9 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-09-12 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Tim Chen, Alex Shi, Jan Beulich, Robert Richter, Andy Lutomirski,
	Andrew Morton, Andrea Arcangeli, Johannes Weiner, Hugh Dickins,
	KAMEZAWA Hiroyuki, Mel Gorman, linux-kernel, linuxppc-dev,
	linux-mips, linux-sh, sparclinux

[-- Attachment #1: Type: text/plain, Size: 50 bytes --]

Hi,

Any feedback?

-- 
 Kirill A. Shutemov

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
  2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2012-09-12 10:09 ` [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
@ 2012-09-13 23:05 ` Andrew Morton
  2012-09-14  5:52   ` Ingo Molnar
  9 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2012-09-13 23:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andi Kleen, Tim Chen, Alex Shi, Jan Beulich, Robert Richter,
	Andy Lutomirski, Andrea Arcangeli, Johannes Weiner, Hugh Dickins,
	KAMEZAWA Hiroyuki, Mel Gorman, linux-kernel, linuxppc-dev,
	linux-mips, linux-sh, sparclinux

On Mon, 20 Aug 2012 16:52:29 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Clearing a 2MB huge page will typically blow away several levels of CPU
> caches.  To avoid this only cache clear the 4K area around the fault
> address and use a cache avoiding clears for the rest of the 2MB area.
> 
> This patchset implements cache avoiding version of clear_page only for
> x86. If an architecture wants to provide cache avoiding version of
> clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
> clear_page_nocache() and clear_user_highpage_nocache().

Patchset looks nice to me, but the changelogs are terribly short of
performance measurements.  For this sort of change I do think it is
important that pretty exhaustive testing be performed, and that the
results (or a readable summary of them) be shown.  And that testing
should be designed to probe for slowdowns, not just the speedups!



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
  2012-09-13 23:05 ` Andrew Morton
@ 2012-09-14  5:52   ` Ingo Molnar
  2012-09-25 14:27     ` Kirill A. Shutemov
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2012-09-14  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, linux-mm, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Andi Kleen, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 20 Aug 2012 16:52:29 +0300
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Clearing a 2MB huge page will typically blow away several levels of CPU
> > caches.  To avoid this only cache clear the 4K area around the fault
> > address and use a cache avoiding clears for the rest of the 2MB area.
> > 
> > This patchset implements cache avoiding version of clear_page only for
> > x86. If an architecture wants to provide cache avoiding version of
> > clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
> > clear_page_nocache() and clear_user_highpage_nocache().
> 
> Patchset looks nice to me, but the changelogs are terribly 
> short of performance measurements.  For this sort of change I 
> do think it is important that pretty exhaustive testing be 
> performed, and that the results (or a readable summary of 
> them) be shown.  And that testing should be designed to probe 
> for slowdowns, not just the speedups!

That is my general impression as well.

Firstly, doing before/after "perf stat --repeat 3 ..." runs 
showing a statistically significant effect on a workload that is 
expected to win from this, and on a workload expected to be 
hurting from this would go a long way towards convincing me.

Secondly, if you can find some user-space simulation of the 
intended positive (and negative) effects then a 'perf bench' 
testcase designed to show weakness of any such approach, running 
the very kernel assembly code in user-space would also be rather 
useful.

See:

comet:~/tip> git grep x86 tools/perf/bench/ | grep inclu
tools/perf/bench/mem-memcpy-arch.h:#include "mem-memcpy-x86-64-asm-def.h"
tools/perf/bench/mem-memcpy-x86-64-asm.S:#include "../../../arch/x86/lib/memcpy_64.S"
tools/perf/bench/mem-memcpy.c:#include "mem-memcpy-x86-64-asm-def.h"
tools/perf/bench/mem-memset-arch.h:#include "mem-memset-x86-64-asm-def.h"
tools/perf/bench/mem-memset-x86-64-asm.S:#include "../../../arch/x86/lib/memset_64.S"
tools/perf/bench/mem-memset.c:#include "mem-memset-x86-64-asm-def.h"

that code uses the kernel-side assembly code and runs it in 
user-space.

Although obviously clearing pages on page faults needs some care 
to properly simulate in user-space.

Without repeatable hard numbers such code just gets into the 
kernel and bitrots there as new CPU generations come in - a few 
years down the line the original decisions often degrade to pure 
noise. We've been there, we've done that, we don't want to 
repeat it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
  2012-09-14  5:52   ` Ingo Molnar
@ 2012-09-25 14:27     ` Kirill A. Shutemov
  2012-09-25 19:33       ` Andrea Arcangeli
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill A. Shutemov @ 2012-09-25 14:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, linux-mm, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Andi Kleen, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

[-- Attachment #1: Type: text/plain, Size: 802 bytes --]

On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote:
> Without repeatable hard numbers such code just gets into the 
> kernel and bitrots there as new CPU generations come in - a few 
> years down the line the original decisions often degrade to pure 
> noise. We've been there, we've done that, we don't want to 
> repeat it.

<sorry, for late answer..>

Hard numbers are hard.
I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the
patchset doesn't show any difference (within run-to-run deviation).
On NPB it recovers THP regression, but it's probably not enough to make
decision.

It would be nice if somebody test the patchset on other system or
workload. Especially, if the configuration shows regression with
THP enabled.

-- 
 Kirill A. Shutemov

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
  2012-09-25 14:27     ` Kirill A. Shutemov
@ 2012-09-25 19:33       ` Andrea Arcangeli
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2012-09-25 19:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Andrew Morton, linux-mm, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, Andi Kleen, Tim Chen, Alex Shi,
	Jan Beulich, Robert Richter, Andy Lutomirski, Johannes Weiner,
	Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman, linux-kernel,
	linuxppc-dev, linux-mips, linux-sh, sparclinux

Hi Kirill,

On Tue, Sep 25, 2012 at 05:27:03PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote:
> > Without repeatable hard numbers such code just gets into the 
> > kernel and bitrots there as new CPU generations come in - a few 
> > years down the line the original decisions often degrade to pure 
> > noise. We've been there, we've done that, we don't want to 
> > repeat it.
> 
> <sorry, for late answer..>
> 
> Hard numbers are hard.
> I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the
> patchset doesn't show any difference (within run-to-run deviation).
> On NPB it recovers THP regression, but it's probably not enough to make
> decision.
> 
> It would be nice if somebody test the patchset on other system or
> workload. Especially, if the configuration shows regression with
> THP enabled.

If the only workload that gets a benefit is NPB then we've the proof
this is too hardware dependend to be a conclusive result.

It may have been slower by an accident, things like cache
associativity off by one bit, combined with the implicit coloring
provided to the lowest 512 colors could hurts more if the cache
associativity is low.

I'm saying this because NPB on a thinkpad (Intel CPU I assume) is the
benchmark that shows the most benefit among all benchmarks run on that
hardware.

http://www.phoronix.com/scan.php?page=article&item=linux_transparent_hugepages&num=2

I've once seen certain computations that run much slower with perfect
cache coloring but most others runs much faster with the page
coloring. Doesn't mean page coloring is bad per se. So the NPB on that
specific hardware may have been the exception and not the interesting
case. Especially considering the effect of cache-copying is opposite
on slightly different hw.

I think the the static_key should be off by default whenever the CPU
L2 cache size is >= the size of the copy (2*HPAGE_PMD_SIZE). Now the
cache does random replacement so maybe we could also allow cache
copies for twice the size of the copy (L2size >=
4*HPAGE_PMD_SIZE). Current CPUs have caches much larger than 2*2MB...

It would make a whole lot more sense for hugetlbfs giga pages than for
THP (unlike for THP, cache trashing with giga pages is guaranteed),
but even with giga pages, it's not like they're allocated frequently
(maybe once per OS reboot) so that's also sure totally lost in the
noise as it only saves a few accesses after the cache copy is
finished.

It's good to have tested it though.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-09-25 19:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-20 13:52 [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
2012-08-20 13:52 ` [PATCH v4 1/8] THP: Use real address for NUMA policy Kirill A. Shutemov
2012-08-20 13:52 ` [PATCH v4 2/8] THP: Pass fault address to __do_huge_pmd_anonymous_page() Kirill A. Shutemov
2012-08-20 13:52 ` [PATCH v4 3/8] hugetlb: pass fault address to hugetlb_no_page() Kirill A. Shutemov
2012-08-20 13:52 ` [PATCH v4 4/8] mm: pass fault address to clear_huge_page() Kirill A. Shutemov
2012-08-20 13:52 ` [PATCH v4 5/8] x86: Add clear_page_nocache Kirill A. Shutemov
2012-08-20 13:52 ` [PATCH v4 6/8] mm: make clear_huge_page cache clear only around the fault address Kirill A. Shutemov
2012-08-20 13:52 ` [PATCH v4 7/8] x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov
2012-08-20 13:52 ` [PATCH v4 8/8] mm: implement vm.clear_huge_page_nocache sysctl Kirill A. Shutemov
2012-09-12 10:09 ` [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
2012-09-13 23:05 ` Andrew Morton
2012-09-14  5:52   ` Ingo Molnar
2012-09-25 14:27     ` Kirill A. Shutemov
2012-09-25 19:33       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).