[PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page
@ 2012-08-09 15:02 Kirill A. Shutemov
  2012-08-09 15:02 ` [PATCH v2 1/6] THP: Use real address for NUMA policy Kirill A. Shutemov
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-09 15:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Clearing a 2MB huge page will typically blow away several levels of CPU
caches.  To avoid this only cache clear the 4K area around the fault
address and use a cache avoiding clears for the rest of the 2MB area.

This patchset implements cache avoiding version of clear_page only for
x86. If an architecture wants to provide cache avoiding version of
clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

v2:
  - No code change. Only commit messages are updated.
  - RFC mark is dropped.

Andi Kleen (6):
  THP: Use real address for NUMA policy
  mm: make clear_huge_page tolerate non aligned address
  THP: Pass real, not rounded, address to clear_huge_page
  x86: Add clear_page_nocache
  mm: make clear_huge_page cache clear only around the fault address
  x86: switch the 64bit uncached page clear to SSE/AVX v2

 arch/x86/include/asm/page.h          |    2 +
 arch/x86/include/asm/string_32.h     |    5 ++
 arch/x86/include/asm/string_64.h     |    5 ++
 arch/x86/lib/Makefile                |    1 +
 arch/x86/lib/clear_page_nocache_32.S |   30 +++++++++++
 arch/x86/lib/clear_page_nocache_64.S |   92 ++++++++++++++++++++++++++++++++++
 arch/x86/mm/fault.c                  |    7 +++
 mm/huge_memory.c                     |   17 +++---
 mm/memory.c                          |   29 ++++++++++-
 9 files changed, 178 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_nocache_32.S
 create mode 100644 arch/x86/lib/clear_page_nocache_64.S

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/6] THP: Use real address for NUMA policy
  2012-08-09 15:02 [PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
@ 2012-08-09 15:02 ` Kirill A. Shutemov
  2012-08-09 15:02 ` [PATCH v2 2/6] mm: make clear_huge_page tolerate non aligned address Kirill A. Shutemov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-09 15:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

Use the fault address, not the rounded down hpage address for NUMA
policy purposes. In some circumstances this can give more exact
NUMA policy.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..70737ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -681,11 +681,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
 					      struct vm_area_struct *vma,
-					      unsigned long haddr, int nd,
+					      unsigned long address, int nd,
 					      gfp_t extra_gfp)
 {
 	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-			       HPAGE_PMD_ORDER, vma, haddr, nd);
+			       HPAGE_PMD_ORDER, vma, address, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -710,7 +710,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id(), 0);
+					  vma, address, numa_node_id(), 0);
 		if (unlikely(!page)) {
 			count_vm_event(THP_FAULT_FALLBACK);
 			goto out;
@@ -944,7 +944,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					      vma, haddr, numa_node_id(), 0);
+					      vma, address, numa_node_id(), 0);
 	else
 		new_page = NULL;
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/6] mm: make clear_huge_page tolerate non aligned address
  2012-08-09 15:02 [PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
  2012-08-09 15:02 ` [PATCH v2 1/6] THP: Use real address for NUMA policy Kirill A. Shutemov
@ 2012-08-09 15:02 ` Kirill A. Shutemov
  2012-08-09 15:03 ` [PATCH v2 3/6] THP: Pass real, not rounded, address to clear_huge_page Kirill A. Shutemov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-09 15:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

hugetlb does not necessarily pass in an aligned address, so the
low level address computation is wrong.

This will fix architectures that actually use the address for flushing
the cleared address (very few, like xtensa/sparc/...?)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5736170..b47199a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3987,16 +3987,17 @@ void clear_huge_page(struct page *page,
 		     unsigned long addr, unsigned int pages_per_huge_page)
 {
 	int i;
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, pages_per_huge_page);
+		clear_gigantic_page(page, haddr, pages_per_huge_page);
 		return;
 	}
 
 	might_sleep();
 	for (i = 0; i < pages_per_huge_page; i++) {
 		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
 	}
 }
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/6] THP: Pass real, not rounded, address to clear_huge_page
  2012-08-09 15:02 [PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
  2012-08-09 15:02 ` [PATCH v2 1/6] THP: Use real address for NUMA policy Kirill A. Shutemov
  2012-08-09 15:02 ` [PATCH v2 2/6] mm: make clear_huge_page tolerate non aligned address Kirill A. Shutemov
@ 2012-08-09 15:03 ` Kirill A. Shutemov
  2012-08-09 15:03 ` [PATCH v2 4/6] x86: Add clear_page_nocache Kirill A. Shutemov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-09 15:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70737ec..ecd93f8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,7 +633,8 @@ static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
-					unsigned long haddr, pmd_t *pmd,
+					unsigned long haddr,
+					unsigned long address, pmd_t *pmd,
 					struct page *page)
 {
 	pgtable_t pgtable;
@@ -643,7 +644,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	if (unlikely(!pgtable))
 		return VM_FAULT_OOM;
 
-	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	clear_huge_page(page, address, HPAGE_PMD_NR);
 	__SetPageUptodate(page);
 
 	spin_lock(&mm->page_table_lock);
@@ -720,8 +721,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			put_page(page);
 			goto out;
 		}
-		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
-							  page))) {
+		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+						address, pmd, page))) {
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			goto out;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-09 15:02 [PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2012-08-09 15:03 ` [PATCH v2 3/6] THP: Pass real, not rounded, address to clear_huge_page Kirill A. Shutemov
@ 2012-08-09 15:03 ` Kirill A. Shutemov
  2012-08-09 15:22   ` Jan Beulich
  2012-08-09 15:23   ` H. Peter Anvin
  2012-08-09 15:03 ` [PATCH v2 5/6] mm: make clear_huge_page cache clear only around the fault address Kirill A. Shutemov
  2012-08-09 15:03 ` [PATCH v2 6/6] x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov
  5 siblings, 2 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-09 15:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

Add a cache avoiding version of clear_page. Straight forward integer variant
of the existing 64bit clear_page, for both 32bit and 64bit.

Also add the necessary glue for highmem including a layer that non cache
coherent architectures that use the virtual address for flushing can
hook in. This is not needed on x86 of course.

If an architecture wants to provide cache avoiding version of clear_page
it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/page.h          |    2 ++
 arch/x86/include/asm/string_32.h     |    5 +++++
 arch/x86/include/asm/string_64.h     |    5 +++++
 arch/x86/lib/Makefile                |    1 +
 arch/x86/lib/clear_page_nocache_32.S |   30 ++++++++++++++++++++++++++++++
 arch/x86/lib/clear_page_nocache_64.S |   29 +++++++++++++++++++++++++++++
 arch/x86/mm/fault.c                  |    7 +++++++
 7 files changed, 79 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_nocache_32.S
 create mode 100644 arch/x86/lib/clear_page_nocache_64.S

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..aa83a1b 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -29,6 +29,8 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	copy_page(to, from);
 }
 
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr);
+
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
 	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..3f2fbcf 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include <linux/linkage.h>
+
 /* Let gcc decide whether to inline or use the out of line functions */
 
 #define __HAVE_ARCH_STRCPY
@@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long pattern,
 #define __HAVE_ARCH_MEMSCAN
 extern void *memscan(void *addr, int c, size_t size);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_32_H */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..ca23d1d 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include <linux/linkage.h>
+
 /* Written 2002 by Andi Kleen */
 
 /* Only used for special circumstances. Stolen from i386/string.h */
@@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src);
 char *strcat(char *dest, const char *src);
 int strcmp(const char *cs, const char *ct);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index b00f678..a8ad6dd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_SMP) += rwlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
+lib-y += clear_page_nocache_$(BITS).o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
 
diff --git a/arch/x86/lib/clear_page_nocache_32.S b/arch/x86/lib/clear_page_nocache_32.S
new file mode 100644
index 0000000..2394e0c
--- /dev/null
+++ b/arch/x86/lib/clear_page_nocache_32.S
@@ -0,0 +1,30 @@
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/*
+ * Zero a page avoiding the caches
+ * rdi	page
+ */
+ENTRY(clear_page_nocache)
+	CFI_STARTPROC
+	mov    %eax,%edi
+	xorl   %eax,%eax
+	movl   $4096/64,%ecx
+	.p2align 4
+.Lloop:
+	decl	%ecx
+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)
+	PUT(0)
+	PUT(1)
+	PUT(2)
+	PUT(3)
+	PUT(4)
+	PUT(5)
+	PUT(6)
+	PUT(7)
+	lea	64(%edi),%edi
+	jnz	.Lloop
+	nop
+	ret
+	CFI_ENDPROC
+ENDPROC(clear_page_nocache)
diff --git a/arch/x86/lib/clear_page_nocache_64.S b/arch/x86/lib/clear_page_nocache_64.S
new file mode 100644
index 0000000..ee16d15
--- /dev/null
+++ b/arch/x86/lib/clear_page_nocache_64.S
@@ -0,0 +1,29 @@
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/*
+ * Zero a page avoiding the caches
+ * rdi	page
+ */
+ENTRY(clear_page_nocache)
+	CFI_STARTPROC
+	xorl   %eax,%eax
+	movl   $4096/64,%ecx
+	.p2align 4
+.Lloop:
+	decl	%ecx
+#define PUT(x) movnti %rax,x*8(%rdi)
+	movnti %rax,(%rdi)
+	PUT(1)
+	PUT(2)
+	PUT(3)
+	PUT(4)
+	PUT(5)
+	PUT(6)
+	PUT(7)
+	leaq	64(%rdi),%rdi
+	jnz	.Lloop
+	nop
+	ret
+	CFI_ENDPROC
+ENDPROC(clear_page_nocache)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 76dcd9d..20888b4 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1209,3 +1209,10 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 }
+
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr)
+{
+	void *p = kmap_atomic(page, KM_USER0);
+	clear_page_nocache(p);
+	kunmap_atomic(p);
+}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/6] mm: make clear_huge_page cache clear only around the fault address
  2012-08-09 15:02 [PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2012-08-09 15:03 ` [PATCH v2 4/6] x86: Add clear_page_nocache Kirill A. Shutemov
@ 2012-08-09 15:03 ` Kirill A. Shutemov
  2012-08-09 15:03 ` [PATCH v2 6/6] x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov
  5 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-09 15:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

Clearing a 2MB huge page will typically blow away several levels
of CPU caches. To avoid this only cache clear the 4K area
around the fault address and use a cache avoiding clears
for the rest of the 2MB area.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |   30 +++++++++++++++++++++++++++---
 1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b47199a..e9a75c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3969,18 +3969,35 @@ EXPORT_SYMBOL(might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+#ifndef ARCH_HAS_USER_NOCACHE
+#define ARCH_HAS_USER_NOCACHE 0
+#endif
+
+#if ARCH_HAS_USER_NOCACHE == 0
+#define clear_user_highpage_nocache clear_user_highpage
+#endif
+
 static void clear_gigantic_page(struct page *page,
 				unsigned long addr,
 				unsigned int pages_per_huge_page)
 {
 	int i;
 	struct page *p = page;
+	unsigned long vaddr;
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int target = (addr - haddr) >> PAGE_SHIFT;
 
 	might_sleep();
+	vaddr = haddr;
 	for (i = 0; i < pages_per_huge_page;
 	     i++, p = mem_map_next(p, page, i)) {
 		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
+		vaddr = haddr + i*PAGE_SIZE;
+		if (!ARCH_HAS_USER_NOCACHE  || i == target)
+			clear_user_highpage(p, vaddr);
+		else
+			clear_user_highpage_nocache(p, vaddr);
 	}
 }
 void clear_huge_page(struct page *page,
@@ -3988,16 +4005,23 @@ void clear_huge_page(struct page *page,
 {
 	int i;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	unsigned long vaddr;
+	int target = (addr - haddr) >> PAGE_SHIFT;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, haddr, pages_per_huge_page);
+		clear_gigantic_page(page, addr, pages_per_huge_page);
 		return;
 	}
 
 	might_sleep();
+	vaddr = haddr;
 	for (i = 0; i < pages_per_huge_page; i++) {
 		cond_resched();
-		clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
+		vaddr = haddr + i*PAGE_SIZE;
+		if (!ARCH_HAS_USER_NOCACHE || i == target)
+			clear_user_highpage(page + i, vaddr);
+		else
+			clear_user_highpage_nocache(page + i, vaddr);
 	}
 }
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 6/6] x86: switch the 64bit uncached page clear to SSE/AVX v2
  2012-08-09 15:02 [PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2012-08-09 15:03 ` [PATCH v2 5/6] mm: make clear_huge_page cache clear only around the fault address Kirill A. Shutemov
@ 2012-08-09 15:03 ` Kirill A. Shutemov
  2012-08-09 15:28   ` Jan Beulich
  5 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-09 15:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Andi Kleen,
	Kirill A. Shutemov, Tim Chen, Alex Shi, Jan Beulich,
	Robert Richter, Andy Lutomirski, Andrew Morton, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, KAMEZAWA Hiroyuki, Mel Gorman,
	linux-kernel, linuxppc-dev, linux-mips, linux-sh, sparclinux

From: Andi Kleen <ak@linux.intel.com>

With multiple threads vector stores are more efficient, so use them.
This will cause the page clear to run non preemptable and add some
overhead. However on 32bit it was already non preempable (due to
kmap_atomic) and there is an preemption opportunity every 4K unit.

On a NPB (Nasa Parallel Benchmark) 128GB run on a Westmere this improves
the performance regression of enabling transparent huge pages
by ~2% (2.81% to 0.81%), near the runtime variability now.
On a system with AVX support more is expected.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
[kirill.shutemov@linux.intel.com: Properly save/restore arguments]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/lib/clear_page_nocache_64.S |   91 ++++++++++++++++++++++++++++-----
 1 files changed, 77 insertions(+), 14 deletions(-)

diff --git a/arch/x86/lib/clear_page_nocache_64.S b/arch/x86/lib/clear_page_nocache_64.S
index ee16d15..c092919 100644
--- a/arch/x86/lib/clear_page_nocache_64.S
+++ b/arch/x86/lib/clear_page_nocache_64.S
@@ -1,29 +1,92 @@
+/*
+ * Clear pages with cache bypass.
+ * 
+ * Copyright (C) 2011, 2012 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ */
+
 #include <linux/linkage.h>
+#include <asm/alternative-asm.h>
+#include <asm/cpufeature.h>
 #include <asm/dwarf2.h>
 
+#define SSE_UNROLL 128
+
 /*
  * Zero a page avoiding the caches
  * rdi	page
  */
 ENTRY(clear_page_nocache)
 	CFI_STARTPROC
-	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+	push   %rdi
+	call   kernel_fpu_begin
+	pop    %rdi
+	sub    $16,%rsp
+	CFI_ADJUST_CFA_OFFSET 16
+	movdqu %xmm0,(%rsp)
+	xorpd  %xmm0,%xmm0
+	movl   $4096/SSE_UNROLL,%ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
-#define PUT(x) movnti %rax,x*8(%rdi)
-	movnti %rax,(%rdi)
-	PUT(1)
-	PUT(2)
-	PUT(3)
-	PUT(4)
-	PUT(5)
-	PUT(6)
-	PUT(7)
-	leaq	64(%rdi),%rdi
+	.set x,0
+	.rept SSE_UNROLL/16
+	movntdq %xmm0,x(%rdi)
+	.set x,x+16
+	.endr
+	leaq	SSE_UNROLL(%rdi),%rdi
 	jnz	.Lloop
-	nop
-	ret
+	movdqu (%rsp),%xmm0
+	addq   $16,%rsp
+	CFI_ADJUST_CFA_OFFSET -16
+	jmp   kernel_fpu_end
 	CFI_ENDPROC
 ENDPROC(clear_page_nocache)
+
+#ifdef CONFIG_AS_AVX
+
+	.section .altinstr_replacement,"ax"
+1:	.byte 0xeb					/* jmp <disp8> */
+	.byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b)
+	/* offset */
+2:
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\
+	                     16, 2b-1b
+	.previous
+
+#define AVX_UNROLL 256 /* TUNE ME */
+
+ENTRY(clear_page_nocache_avx)
+	CFI_STARTPROC
+	push   %rdi
+	call   kernel_fpu_begin
+	pop    %rdi
+	sub    $32,%rsp
+	CFI_ADJUST_CFA_OFFSET 32
+	vmovdqu %ymm0,(%rsp)
+	vxorpd  %ymm0,%ymm0,%ymm0
+	movl   $4096/AVX_UNROLL,%ecx
+	.p2align 4
+.Lloop_avx:
+	decl	%ecx
+	.set x,0
+	.rept AVX_UNROLL/32
+	vmovntdq %ymm0,x(%rdi)
+	.set x,x+32
+	.endr
+	leaq	AVX_UNROLL(%rdi),%rdi
+	jnz	.Lloop_avx
+	vmovdqu (%rsp),%ymm0
+	addq   $32,%rsp
+	CFI_ADJUST_CFA_OFFSET -32
+	jmp   kernel_fpu_end
+	CFI_ENDPROC
+ENDPROC(clear_page_nocache_avx)
+
+#endif
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-09 15:03 ` [PATCH v2 4/6] x86: Add clear_page_nocache Kirill A. Shutemov
@ 2012-08-09 15:22   ` Jan Beulich
  2012-08-09 15:26     ` Andi Kleen
  2012-08-13 11:43     ` Kirill A. Shutemov
  2012-08-09 15:23   ` H. Peter Anvin
  1 sibling, 2 replies; 16+ messages in thread
From: Jan Beulich @ 2012-08-09 15:22 UTC (permalink / raw)
  To: Andi Kleen, Kirill A. Shutemov
  Cc: Andy Lutomirski, Robert Richter, Johannes Weiner, Hugh Dickins,
	Alex Shi, KAMEZAWA Hiroyuki, x86, linux-mm, Thomas Gleixner,
	Andrew Morton, linux-mips, Tim Chen, linuxppc-dev,
	Andrea Arcangeli, Ingo Molnar, Mel Gorman, linux-kernel,
	linux-sh, sparclinux, H. Peter Anvin

>>> On 09.08.12 at 17:03, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Add a cache avoiding version of clear_page. Straight forward integer variant
> of the existing 64bit clear_page, for both 32bit and 64bit.

While on 64-bit this is fine, I fail to see how you avoid using the
SSE2 instruction on non-SSE2 systems.

> Also add the necessary glue for highmem including a layer that non cache
> coherent architectures that use the virtual address for flushing can
> hook in. This is not needed on x86 of course.
> 
> If an architecture wants to provide cache avoiding version of clear_page
> it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
> clear_page_nocache() and clear_user_highpage_nocache().
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/include/asm/page.h          |    2 ++
>  arch/x86/include/asm/string_32.h     |    5 +++++
>  arch/x86/include/asm/string_64.h     |    5 +++++
>  arch/x86/lib/Makefile                |    1 +
>  arch/x86/lib/clear_page_nocache_32.S |   30 ++++++++++++++++++++++++++++++
>  arch/x86/lib/clear_page_nocache_64.S |   29 +++++++++++++++++++++++++++++

Couldn't this more reasonably go into clear_page_{32,64}.S?

>  arch/x86/mm/fault.c                  |    7 +++++++
>  7 files changed, 79 insertions(+), 0 deletions(-)
>  create mode 100644 arch/x86/lib/clear_page_nocache_32.S
>  create mode 100644 arch/x86/lib/clear_page_nocache_64.S
>...
>--- /dev/null
>+++ b/arch/x86/lib/clear_page_nocache_32.S
>@@ -0,0 +1,30 @@
>+#include <linux/linkage.h>
>+#include <asm/dwarf2.h>
>+
>+/*
>+ * Zero a page avoiding the caches
>+ * rdi	page

Wrong comment.

>+ */
>+ENTRY(clear_page_nocache)
>+	CFI_STARTPROC
>+	mov    %eax,%edi

You need to pick a different register here (e.g. %edx), since
%edi has to be preserved by all functions called from C.

>+	xorl   %eax,%eax
>+	movl   $4096/64,%ecx
>+	.p2align 4
>+.Lloop:
>+	decl	%ecx
>+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)

Is doing twice as much unrolling as on 64-bit really worth it?

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-09 15:03 ` [PATCH v2 4/6] x86: Add clear_page_nocache Kirill A. Shutemov
  2012-08-09 15:22   ` Jan Beulich
@ 2012-08-09 15:23   ` H. Peter Anvin
  1 sibling, 0 replies; 16+ messages in thread
From: H. Peter Anvin @ 2012-08-09 15:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Thomas Gleixner, Ingo Molnar, x86, Andi Kleen,
	Tim Chen, Alex Shi, Jan Beulich, Robert Richter, Andy Lutomirski,
	Andrew Morton, Andrea Arcangeli, Johannes Weiner, Hugh Dickins,
	KAMEZAWA Hiroyuki, Mel Gorman, linux-kernel, linuxppc-dev,
	linux-mips, linux-sh, sparclinux

On 08/09/2012 08:03 AM, Kirill A. Shutemov wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> Add a cache avoiding version of clear_page. Straight forward integer variant
> of the existing 64bit clear_page, for both 32bit and 64bit.
>
> Also add the necessary glue for highmem including a layer that non cache
> coherent architectures that use the virtual address for flushing can
> hook in. This is not needed on x86 of course.
>
> If an architecture wants to provide cache avoiding version of clear_page
> it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
> clear_page_nocache() and clear_user_highpage_nocache().
>

Compile failure:

/home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c: In function 
‘clear_user_highpage_nocache’:
/home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c:1215:30: error: 
‘KM_USER0’ undeclared (first use in this function)
/home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c:1215:30: note: each 
undeclared identifier is reported only once for each function it appears in
/home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c:1215:2: error: too many 
arguments to function ‘kmap_atomic’
In file included from 
/home/hpa/kernel/tip.x86-mm/include/linux/pagemap.h:10:0,
                  from 
/home/hpa/kernel/tip.x86-mm/include/linux/mempolicy.h:70,
                  from 
/home/hpa/kernel/tip.x86-mm/include/linux/hugetlb.h:15,
                  from /home/hpa/kernel/tip.x86-mm/arch/x86/mm/fault.c:14:
/home/hpa/kernel/tip.x86-mm/include/linux/highmem.h:66:21: note: 
declared here
make[4]: *** [arch/x86/mm/fault.o] Error 1
make[3]: *** [arch/x86/mm] Error 2
make[2]: *** [arch/x86] Error 2
make[1]: *** [sub-make] Error 2
make[1]: Leaving directory `/home/hpa/kernel/tip.x86-mm'

This happens on *all* my test configurations, including both x86-64 and 
i386 allyesconfig.  I suspect your patchset base is stale.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-09 15:22   ` Jan Beulich
@ 2012-08-09 15:26     ` Andi Kleen
  2012-08-13 11:43     ` Kirill A. Shutemov
  1 sibling, 0 replies; 16+ messages in thread
From: Andi Kleen @ 2012-08-09 15:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Kirill A. Shutemov, x86, linux-mm, linux-kernel

> While on 64-bit this is fine, I fail to see how you avoid using the
> SSE2 instruction on non-SSE2 systems.

You're right, this needs a fallback path for 32bit non sse
(and fixing the ABI)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 6/6] x86: switch the 64bit uncached page clear to SSE/AVX v2
  2012-08-09 15:03 ` [PATCH v2 6/6] x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov
@ 2012-08-09 15:28   ` Jan Beulich
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2012-08-09 15:28 UTC (permalink / raw)
  To: Andi Kleen, Kirill A. Shutemov
  Cc: Andy Lutomirski, Robert Richter, Johannes Weiner, Hugh Dickins,
	Alex Shi, KAMEZAWA Hiroyuki, x86, linux-mm, Thomas Gleixner,
	Andrew Morton, linux-mips, Tim Chen, linuxppc-dev,
	Andrea Arcangeli, Ingo Molnar, Mel Gorman, linux-kernel,
	linux-sh, sparclinux, H. Peter Anvin

>>> On 09.08.12 at 17:03, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>  ENTRY(clear_page_nocache)
>  	CFI_STARTPROC
> -	xorl   %eax,%eax
> -	movl   $4096/64,%ecx
> +	push   %rdi
> +	call   kernel_fpu_begin
> +	pop    %rdi

You use CFI annotations elsewhere, so why don't you use
pushq_cfi/popq_cfi here?

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-09 15:22   ` Jan Beulich
  2012-08-09 15:26     ` Andi Kleen
@ 2012-08-13 11:43     ` Kirill A. Shutemov
  2012-08-13 12:02       ` Jan Beulich
                         ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-13 11:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andi Kleen, Andy Lutomirski, Robert Richter, Johannes Weiner,
	Hugh Dickins, Alex Shi, KAMEZAWA Hiroyuki, x86, linux-mm,
	Thomas Gleixner, Andrew Morton, linux-mips, Tim Chen,
	linuxppc-dev, Andrea Arcangeli, Ingo Molnar, Mel Gorman,
	linux-kernel, linux-sh, sparclinux, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 7861 bytes --]

On Thu, Aug 09, 2012 at 04:22:04PM +0100, Jan Beulich wrote:
> >>> On 09.08.12 at 17:03, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

...

> > ---
> >  arch/x86/include/asm/page.h          |    2 ++
> >  arch/x86/include/asm/string_32.h     |    5 +++++
> >  arch/x86/include/asm/string_64.h     |    5 +++++
> >  arch/x86/lib/Makefile                |    1 +
> >  arch/x86/lib/clear_page_nocache_32.S |   30 ++++++++++++++++++++++++++++++
> >  arch/x86/lib/clear_page_nocache_64.S |   29 +++++++++++++++++++++++++++++
> 
> Couldn't this more reasonably go into clear_page_{32,64}.S?

We don't have clear_page_32.S.

> >+	xorl   %eax,%eax
> >+	movl   $4096/64,%ecx
> >+	.p2align 4
> >+.Lloop:
> >+	decl	%ecx
> >+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)
> 
> Is doing twice as much unrolling as on 64-bit really worth it?

Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on
Westmere. Any preference? ;)

Westmere:

 Performance counter stats for './test_unroll32' (20 runs):

      31498.420608 task-clock                #    0.998 CPUs utilized            ( +-  0.25% )
                40 context-switches          #    0.001 K/sec                    ( +-  1.40% )
                 0 CPU-migrations            #    0.000 K/sec                    ( +-100.00% )
                89 page-faults               #    0.003 K/sec                    ( +-  0.13% )
    74,728,231,935 cycles                    #    2.372 GHz                      ( +-  0.25% ) [83.34%]
    53,789,969,009 stalled-cycles-frontend   #   71.98% frontend cycles idle     ( +-  0.35% ) [83.33%]
    41,681,014,054 stalled-cycles-backend    #   55.78% backend  cycles idle     ( +-  0.43% ) [66.67%]
    37,992,733,278 instructions              #    0.51  insns per cycle
                                             #    1.42  stalled cycles per insn  ( +-  0.05% ) [83.33%]
     3,561,376,245 branches                  #  113.065 M/sec                    ( +-  0.05% ) [83.33%]
        27,182,795 branch-misses             #    0.76% of all branches          ( +-  0.06% ) [83.33%]

      31.558545812 seconds time elapsed                                          ( +-  0.25% )

 Performance counter stats for './test_unroll64' (20 runs):

      31564.753623 task-clock                #    0.998 CPUs utilized            ( +-  0.19% )
                39 context-switches          #    0.001 K/sec                    ( +-  0.40% )
                 0 CPU-migrations            #    0.000 K/sec
                90 page-faults               #    0.003 K/sec                    ( +-  0.12% )
    74,886,045,192 cycles                    #    2.372 GHz                      ( +-  0.19% ) [83.33%]
    57,477,323,995 stalled-cycles-frontend   #   76.75% frontend cycles idle     ( +-  0.26% ) [83.34%]
    44,548,142,150 stalled-cycles-backend    #   59.49% backend  cycles idle     ( +-  0.31% ) [66.67%]
    32,940,027,099 instructions              #    0.44  insns per cycle
                                             #    1.74  stalled cycles per insn  ( +-  0.05% ) [83.34%]
     1,884,944,093 branches                  #   59.717 M/sec                    ( +-  0.05% ) [83.32%]
         1,027,135 branch-misses             #    0.05% of all branches          ( +-  0.56% ) [83.34%]

      31.621001407 seconds time elapsed                                          ( +-  0.19% )

Sandy Bridge:

 Performance counter stats for './test_unroll32' (20 runs):

       8578.382891 task-clock                #    0.997 CPUs utilized            ( +-  0.08% )
                15 context-switches          #    0.000 M/sec                    ( +-  2.97% )
                 0 CPU-migrations            #    0.000 M/sec
                84 page-faults               #    0.000 M/sec                    ( +-  0.13% )
    29,154,476,597 cycles                    #    3.399 GHz                      ( +-  0.08% ) [83.33%]
    11,851,215,147 stalled-cycles-frontend   #   40.65% frontend cycles idle     ( +-  0.20% ) [83.33%]
     1,530,172,593 stalled-cycles-backend    #    5.25% backend  cycles idle     ( +-  1.44% ) [66.67%]
    37,915,778,094 instructions              #    1.30  insns per cycle
                                             #    0.31  stalled cycles per insn  ( +-  0.00% ) [83.34%]
     3,590,533,447 branches                  #  418.556 M/sec                    ( +-  0.01% ) [83.35%]
        26,500,765 branch-misses             #    0.74% of all branches          ( +-  0.01% ) [83.34%]

       8.604638449 seconds time elapsed                                          ( +-  0.08% )

 Performance counter stats for './test_unroll64' (20 runs):

       8463.789963 task-clock                #    0.997 CPUs utilized            ( +-  0.07% )
                14 context-switches          #    0.000 M/sec                    ( +-  1.70% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
                85 page-faults               #    0.000 M/sec                    ( +-  0.12% )
    28,763,328,688 cycles                    #    3.398 GHz                      ( +-  0.07% ) [83.32%]
    13,517,462,952 stalled-cycles-frontend   #   47.00% frontend cycles idle     ( +-  0.14% ) [83.33%]
     1,356,208,859 stalled-cycles-backend    #    4.72% backend  cycles idle     ( +-  1.42% ) [66.68%]
    32,885,492,141 instructions              #    1.14  insns per cycle
                                             #    0.41  stalled cycles per insn  ( +-  0.00% ) [83.34%]
     1,912,094,072 branches                  #  225.915 M/sec                    ( +-  0.02% ) [83.34%]
           305,896 branch-misses             #    0.02% of all branches          ( +-  1.05% ) [83.33%]

       8.488304839 seconds time elapsed                                          ( +-  0.07% )

$ cat test.c
#include <stdio.h>
#include <sys/mman.h>

#define SIZE 1024*1024*1024

void clear_page_nocache_sse2(void *page) __attribute__((regparm(1)));

int main(int argc, char** argv)
{
        char *p;
        unsigned long i, j;

        p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
                        MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
        for(j = 0; j < 100; j++) {
                for(i = 0; i < SIZE; i += 4096) {
                        clear_page_nocache_sse2(p + i);
                }
        }

        return 0;
}
$ cat clear_page_nocache_unroll32.S
.globl clear_page_nocache_sse2
.align 4,0x90
clear_page_nocache_sse2:
.cfi_startproc
        mov    %eax,%edx
        xorl   %eax,%eax
        movl   $4096/32,%ecx
        .p2align 4
.Lloop_sse2:
        decl    %ecx
#define PUT(x) movnti %eax,x*4(%edx)
        PUT(0)
        PUT(1)
        PUT(2)
        PUT(3)
        PUT(4)
        PUT(5)
        PUT(6)
        PUT(7)
#undef PUT
        lea     32(%edx),%edx
        jnz     .Lloop_sse2
        nop
        ret
.cfi_endproc
.type clear_page_nocache_sse2, @function
.size clear_page_nocache_sse2, .-clear_page_nocache_sse2
$ cat clear_page_nocache_unroll64.S
.globl clear_page_nocache_sse2
.align 4,0x90
clear_page_nocache_sse2:
.cfi_startproc
        mov    %eax,%edx
        xorl   %eax,%eax
        movl   $4096/64,%ecx
        .p2align 4
.Lloop_sse2:
        decl    %ecx
#define PUT(x) movnti %eax,x*8(%edx) ; movnti %eax,x*8+4(%edx)
        PUT(0)
        PUT(1)
        PUT(2)
        PUT(3)
        PUT(4)
        PUT(5)
        PUT(6)
        PUT(7)
#undef PUT
        lea     64(%edx),%edx
        jnz     .Lloop_sse2
        nop
        ret
.cfi_endproc
.type clear_page_nocache_sse2, @function
.size clear_page_nocache_sse2, .-clear_page_nocache_sse2

-- 
 Kirill A. Shutemov

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-13 11:43     ` Kirill A. Shutemov
@ 2012-08-13 12:02       ` Jan Beulich
  2012-08-13 16:27       ` Andi Kleen
  2012-08-13 17:04       ` Borislav Petkov
  2 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2012-08-13 12:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Robert Richter, Johannes Weiner, Hugh Dickins,
	Alex Shi, KAMEZAWA Hiroyuki, x86, linux-mm, Thomas Gleixner,
	Andrew Morton, linux-mips, Andi Kleen, Tim Chen, linuxppc-dev,
	Andrea Arcangeli, Ingo Molnar, Mel Gorman, linux-kernel,
	linux-sh, sparclinux, H. Peter Anvin

>>> On 13.08.12 at 13:43, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> On Thu, Aug 09, 2012 at 04:22:04PM +0100, Jan Beulich wrote:
>> >>> On 09.08.12 at 17:03, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>  wrote:
> 
> ...
> 
>> > ---
>> >  arch/x86/include/asm/page.h          |    2 ++
>> >  arch/x86/include/asm/string_32.h     |    5 +++++
>> >  arch/x86/include/asm/string_64.h     |    5 +++++
>> >  arch/x86/lib/Makefile                |    1 +
>> >  arch/x86/lib/clear_page_nocache_32.S |   30 ++++++++++++++++++++++++++++++
>> >  arch/x86/lib/clear_page_nocache_64.S |   29 +++++++++++++++++++++++++++++
>> 
>> Couldn't this more reasonably go into clear_page_{32,64}.S?
> 
> We don't have clear_page_32.S.

Sure, but you're introducing a file anyway. Fold the new code into
the existing file for 64-bit, and create a new, similarly named one
for 32-bit.

>> >+	xorl   %eax,%eax
>> >+	movl   $4096/64,%ecx
>> >+	.p2align 4
>> >+.Lloop:
>> >+	decl	%ecx
>> >+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)
>> 
>> Is doing twice as much unrolling as on 64-bit really worth it?
> 
> Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on
> Westmere. Any preference? ;)

If it's not a clear win, I'd favor the 8-stores-per-cycle variant,
matching x86-64.

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-13 11:43     ` Kirill A. Shutemov
  2012-08-13 12:02       ` Jan Beulich
@ 2012-08-13 16:27       ` Andi Kleen
  2012-08-13 17:04       ` Borislav Petkov
  2 siblings, 0 replies; 16+ messages in thread
From: Andi Kleen @ 2012-08-13 16:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Beulich, Andy Lutomirski, Robert Richter, Johannes Weiner,
	Hugh Dickins, Alex Shi, KAMEZAWA Hiroyuki, x86, linux-mm,
	Thomas Gleixner, Andrew Morton, linux-mips, Tim Chen,
	linuxppc-dev, Andrea Arcangeli, Ingo Molnar, Mel Gorman,
	linux-kernel, linux-sh, sparclinux, H. Peter Anvin

> Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on
> Westmere. Any preference? ;)

You have to be careful with these benchmarks.

- You need to make sure the data is cache cold, cache hot is misleading.
- The numbers can change if you have multiple CPUs doing this in parallel.

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-13 11:43     ` Kirill A. Shutemov
  2012-08-13 12:02       ` Jan Beulich
  2012-08-13 16:27       ` Andi Kleen
@ 2012-08-13 17:04       ` Borislav Petkov
  2012-08-13 19:07         ` Kirill A. Shutemov
  2 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2012-08-13 17:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Beulich, Andi Kleen, Andy Lutomirski, Robert Richter,
	Johannes Weiner, Hugh Dickins, Alex Shi, KAMEZAWA Hiroyuki, x86,
	linux-mm, Thomas Gleixner, Andrew Morton, linux-mips, Tim Chen,
	linuxppc-dev, Andrea Arcangeli, Ingo Molnar, Mel Gorman,
	linux-kernel, linux-sh, sparclinux, H. Peter Anvin

On Mon, Aug 13, 2012 at 02:43:34PM +0300, Kirill A. Shutemov wrote:
> $ cat test.c
> #include <stdio.h>
> #include <sys/mman.h>
> 
> #define SIZE 1024*1024*1024
> 
> void clear_page_nocache_sse2(void *page) __attribute__((regparm(1)));
> 
> int main(int argc, char** argv)
> {
>         char *p;
>         unsigned long i, j;
> 
>         p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
>                         MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
>         for(j = 0; j < 100; j++) {
>                 for(i = 0; i < SIZE; i += 4096) {
>                         clear_page_nocache_sse2(p + i);
>                 }
>         }
> 
>         return 0;
> }
> $ cat clear_page_nocache_unroll32.S
> .globl clear_page_nocache_sse2
> .align 4,0x90
> clear_page_nocache_sse2:
> .cfi_startproc
>         mov    %eax,%edx
>         xorl   %eax,%eax
>         movl   $4096/32,%ecx
>         .p2align 4
> .Lloop_sse2:
>         decl    %ecx
> #define PUT(x) movnti %eax,x*4(%edx)
>         PUT(0)
>         PUT(1)
>         PUT(2)
>         PUT(3)
>         PUT(4)
>         PUT(5)
>         PUT(6)
>         PUT(7)
> #undef PUT
>         lea     32(%edx),%edx
>         jnz     .Lloop_sse2
>         nop
>         ret
> .cfi_endproc
> .type clear_page_nocache_sse2, @function
> .size clear_page_nocache_sse2, .-clear_page_nocache_sse2
> $ cat clear_page_nocache_unroll64.S
> .globl clear_page_nocache_sse2
> .align 4,0x90
> clear_page_nocache_sse2:
> .cfi_startproc
>         mov    %eax,%edx

This must still be the 32-bit version becaue it segfaults here. Here's
why:

mmap above gives a ptr which, on 64-bit, is larger than 32-bit, i.e. it
looks like 0x7fffxxxxx000, i.e. starting from top of userspace.

Now, the mov above truncates that ptr and the thing segfaults.

Doing s/edx/rdx/g fixes it though.

Thanks.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] x86: Add clear_page_nocache
  2012-08-13 17:04       ` Borislav Petkov
@ 2012-08-13 19:07         ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-08-13 19:07 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov, Jan Beulich, Andi Kleen,
	Andy Lutomirski, Robert Richter, Johannes Weiner, Hugh Dickins,
	Alex Shi, KAMEZAWA Hiroyuki, x86, linux-mm, Thomas Gleixner,
	Andrew Morton, linux-mips, Tim Chen, linuxppc-dev,
	Andrea Arcangeli, Ingo Molnar, Mel Gorman, linux-kernel,
	linux-sh, sparclinux, H. Peter Anvin

On Mon, Aug 13, 2012 at 07:04:02PM +0200, Borislav Petkov wrote:
> On Mon, Aug 13, 2012 at 02:43:34PM +0300, Kirill A. Shutemov wrote:
> > $ cat test.c
> > #include <stdio.h>
> > #include <sys/mman.h>
> > 
> > #define SIZE 1024*1024*1024
> > 
> > void clear_page_nocache_sse2(void *page) __attribute__((regparm(1)));
> > 
> > int main(int argc, char** argv)
> > {
> >         char *p;
> >         unsigned long i, j;
> > 
> >         p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
> >                         MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
> >         for(j = 0; j < 100; j++) {
> >                 for(i = 0; i < SIZE; i += 4096) {
> >                         clear_page_nocache_sse2(p + i);
> >                 }
> >         }
> > 
> >         return 0;
> > }
> > $ cat clear_page_nocache_unroll32.S
> > .globl clear_page_nocache_sse2
> > .align 4,0x90
> > clear_page_nocache_sse2:
> > .cfi_startproc
> >         mov    %eax,%edx
> >         xorl   %eax,%eax
> >         movl   $4096/32,%ecx
> >         .p2align 4
> > .Lloop_sse2:
> >         decl    %ecx
> > #define PUT(x) movnti %eax,x*4(%edx)
> >         PUT(0)
> >         PUT(1)
> >         PUT(2)
> >         PUT(3)
> >         PUT(4)
> >         PUT(5)
> >         PUT(6)
> >         PUT(7)
> > #undef PUT
> >         lea     32(%edx),%edx
> >         jnz     .Lloop_sse2
> >         nop
> >         ret
> > .cfi_endproc
> > .type clear_page_nocache_sse2, @function
> > .size clear_page_nocache_sse2, .-clear_page_nocache_sse2
> > $ cat clear_page_nocache_unroll64.S
> > .globl clear_page_nocache_sse2
> > .align 4,0x90
> > clear_page_nocache_sse2:
> > .cfi_startproc
> >         mov    %eax,%edx
> 
> This must still be the 32-bit version becaue it segfaults here.

Yes, it's test for 32-bit version.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-08-13 19:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-09 15:02 [PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page Kirill A. Shutemov
2012-08-09 15:02 ` [PATCH v2 1/6] THP: Use real address for NUMA policy Kirill A. Shutemov
2012-08-09 15:02 ` [PATCH v2 2/6] mm: make clear_huge_page tolerate non aligned address Kirill A. Shutemov
2012-08-09 15:03 ` [PATCH v2 3/6] THP: Pass real, not rounded, address to clear_huge_page Kirill A. Shutemov
2012-08-09 15:03 ` [PATCH v2 4/6] x86: Add clear_page_nocache Kirill A. Shutemov
2012-08-09 15:22   ` Jan Beulich
2012-08-09 15:26     ` Andi Kleen
2012-08-13 11:43     ` Kirill A. Shutemov
2012-08-13 12:02       ` Jan Beulich
2012-08-13 16:27       ` Andi Kleen
2012-08-13 17:04       ` Borislav Petkov
2012-08-13 19:07         ` Kirill A. Shutemov
2012-08-09 15:23   ` H. Peter Anvin
2012-08-09 15:03 ` [PATCH v2 5/6] mm: make clear_huge_page cache clear only around the fault address Kirill A. Shutemov
2012-08-09 15:03 ` [PATCH v2 6/6] x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov
2012-08-09 15:28   ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).