linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Virtual huge zero page
@ 2012-09-28 23:37 Kirill A. Shutemov
  2012-09-28 23:37 ` [PATCH 1/3] asm-generic: introduce pmd_special() and pmd_mkspecial() Kirill A. Shutemov
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-09-28 23:37 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm, H. Peter Anvin
  Cc: Andi Kleen, linux-kernel, Kirill A. Shutemov, Arnd Bergmann,
	Ingo Molnar, linux-arch, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Here's alternative implementation of huge zero page: virtual huge zero
page.

Virtual huge zero page is a PMD table with all entries set to zero page.
H. Peter Anvin asked to evaluate this implementation option.

Pros:
 - cache friendly (not yet benchmarked);
 - less changes required (if I haven't miss something ;);

Cons:
 - increases TLB pressure;
 - requires per-arch enabling;
 - one more check on handle_mm_fault() path.

At the moment I did only sanity check. Testing is required.

Any opinion?

Kirill A. Shutemov (3):
  asm-generic: introduce pmd_special() and pmd_mkspecial()
  mm, thp: implement virtual huge zero page
  x86: implement HAVE_PMD_SPECAIL

 arch/Kconfig                   |    6 ++++++
 arch/x86/Kconfig               |    1 +
 arch/x86/include/asm/pgtable.h |   14 +++++++++++++-
 include/asm-generic/pgtable.h  |   12 ++++++++++++
 include/linux/mm.h             |    8 ++++++++
 mm/huge_memory.c               |   38 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c                    |   15 ++++++++-------
 7 files changed, 86 insertions(+), 8 deletions(-)

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] asm-generic: introduce pmd_special() and pmd_mkspecial()
  2012-09-28 23:37 [PATCH 0/3] Virtual huge zero page Kirill A. Shutemov
@ 2012-09-28 23:37 ` Kirill A. Shutemov
  2012-09-28 23:37 ` [PATCH 2/3] mm, thp: implement virtual huge zero page Kirill A. Shutemov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-09-28 23:37 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm, H. Peter Anvin
  Cc: Andi Kleen, linux-kernel, Kirill A. Shutemov, Arnd Bergmann,
	Ingo Molnar, linux-arch, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Special PMD is similar to special PTE: it requires special handling.
Currently, it's needed to mark PMD with all PTEs set to zero page.

If an arch wants to provide support of special PMD it need to select
HAVE_PMD_SPECIAL config option and implement pmd_special() and
pmd_mkspecial().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/Kconfig                  |    6 ++++++
 include/asm-generic/pgtable.h |   12 ++++++++++++
 2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 72f2fa1..a74ba25 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -281,4 +281,10 @@ config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config HAVE_PMD_SPECIAL
+	bool
+	help
+	  An arch should select this symbol if it provides pmd_special()
+	  and pmd_mkspecial().
+
 source "kernel/gcov/Kconfig"
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ff4947b..393f3f0 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -59,6 +59,18 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef CONFIG_HAVE_PMD_SPECIAL
+static inline int pmd_special(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] mm, thp: implement virtual huge zero page
  2012-09-28 23:37 [PATCH 0/3] Virtual huge zero page Kirill A. Shutemov
  2012-09-28 23:37 ` [PATCH 1/3] asm-generic: introduce pmd_special() and pmd_mkspecial() Kirill A. Shutemov
@ 2012-09-28 23:37 ` Kirill A. Shutemov
  2012-09-28 23:37 ` [PATCH 3/3] x86: implement HAVE_PMD_SPECAIL Kirill A. Shutemov
  2012-09-29 13:48 ` [PATCH 0/3] Virtual huge zero page Andrea Arcangeli
  3 siblings, 0 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-09-28 23:37 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm, H. Peter Anvin
  Cc: Andi Kleen, linux-kernel, Kirill A. Shutemov, Arnd Bergmann,
	Ingo Molnar, linux-arch, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Virtual huge zero page is a PMD table with all entries set to zero page.
When we get write-protect page fault to zero page in such PMD we drop
the whole page table and allow THP (if enabled) to allocate a real
memory instead.

The implementation requires HAVE_PMD_SPECAIL from an arch if it wants to
support virtual zero page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    8 ++++++++
 mm/huge_memory.c   |   38 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c        |   15 ++++++++-------
 3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..179a41c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -514,6 +514,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 }
 #endif
 
+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+	extern unsigned long zero_pfn;
+	return zero_pfn;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..8189fb6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -696,6 +696,33 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
+static void set_huge_zero_page(pgtable_t pgtable, struct vm_area_struct *vma,
+		unsigned long haddr, pmd_t *pmd)
+{
+	pmd_t _pmd;
+	int i;
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pmd_populate(vma->vm_mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+		entry = pte_mkspecial(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(vma->vm_mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(vma->vm_mm, pmd, pgtable);
+	_pmd = pmd_mkspecial(*pmd);
+	set_pmd_at(vma->vm_mm, haddr, pmd, _pmd);
+	vma->vm_mm->nr_ptes++;
+}
+
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
@@ -709,6 +736,17 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_OOM;
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
+		if (IS_ENABLED(CONFIG_HAVE_PMD_SPECIAL) &&
+				!(flags & FAULT_FLAG_WRITE)) {
+			pgtable_t pgtable;
+			pgtable = pte_alloc_one(mm, haddr);
+			if (unlikely(!pgtable))
+				goto out;
+			spin_lock(&mm->page_table_lock);
+			set_huge_zero_page(pgtable, vma, haddr, pmd);
+			spin_unlock(&mm->page_table_lock);
+			return 0;
+		}
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 					  vma, haddr, numa_node_id(), 0);
 		if (unlikely(!page)) {
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..38dfd5e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -724,13 +724,6 @@ static inline int is_zero_pfn(unsigned long pfn)
 }
 #endif
 
-#ifndef my_zero_pfn
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
-	return zero_pfn;
-}
-#endif
-
 /*
  * vm_normal_page -- This function gets the "struct page" associated with a pte.
  *
@@ -3514,6 +3507,14 @@ retry:
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
+
+	if (pmd_special(*pmd) && flags & FAULT_FLAG_WRITE) {
+		pgtable_t pgtable = pmd_pgtable(*pmd);
+		pmd_clear(pmd);
+		pte_free(mm, pgtable);
+		mm->nr_ptes--;
+	}
+
 	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
 		if (!vma->vm_ops)
 			return do_huge_pmd_anonymous_page(mm, vma, address,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] x86: implement HAVE_PMD_SPECAIL
  2012-09-28 23:37 [PATCH 0/3] Virtual huge zero page Kirill A. Shutemov
  2012-09-28 23:37 ` [PATCH 1/3] asm-generic: introduce pmd_special() and pmd_mkspecial() Kirill A. Shutemov
  2012-09-28 23:37 ` [PATCH 2/3] mm, thp: implement virtual huge zero page Kirill A. Shutemov
@ 2012-09-28 23:37 ` Kirill A. Shutemov
  2012-09-29 13:48 ` [PATCH 0/3] Virtual huge zero page Andrea Arcangeli
  3 siblings, 0 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-09-28 23:37 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm, H. Peter Anvin
  Cc: Andi Kleen, linux-kernel, Kirill A. Shutemov, Arnd Bergmann,
	Ingo Molnar, linux-arch, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We can use the same bit as for special PTE.

There's no conflict with _PAGE_SPLITTING since it's only defined for PSE
pmd, but special PMD is only valid for non-PSE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig               |    1 +
 arch/x86/include/asm/pgtable.h |   14 +++++++++++++-
 2 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 50a1d1f..b2146c3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -97,6 +97,7 @@ config X86
 	select KTIME_SCALAR if X86_32
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
+	select HAVE_PMD_SPECIAL
 
 config INSTRUCTION_DECODER
 	def_bool (KPROBES || PERF_EVENTS || UPROBES)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..ff61694 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -167,6 +167,12 @@ static inline int has_transparent_hugepage(void)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline int pmd_special(pmd_t pmd)
+{
+	return !pmd_trans_huge(pmd) &&
+		pmd_flags(pmd) & _PAGE_SPECIAL;
+}
+
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
 {
 	pteval_t v = native_pte_val(pte);
@@ -290,6 +296,11 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SPECIAL);
+}
+
 /*
  * Mask out unsupported bits in a present pgprot.  Non-present pgprots
  * can use those bits for other purposes, so leave them be.
@@ -474,7 +485,8 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	pmdval_t flags = pmd_flags(pmd);
+	return (flags & ~(_PAGE_USER | _PAGE_SPECIAL)) != _KERNPG_TABLE;
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-09-28 23:37 [PATCH 0/3] Virtual huge zero page Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2012-09-28 23:37 ` [PATCH 3/3] x86: implement HAVE_PMD_SPECAIL Kirill A. Shutemov
@ 2012-09-29 13:48 ` Andrea Arcangeli
  2012-09-29 14:30   ` Andi Kleen
  2012-10-01 15:34   ` H. Peter Anvin
  3 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2012-09-29 13:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-mm, H. Peter Anvin, Andi Kleen,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On Sat, Sep 29, 2012 at 02:37:18AM +0300, Kirill A. Shutemov wrote:
> Cons:
>  - increases TLB pressure;

I generally don't like using 4k tlb entries ever. This only has the
advantage of saving 2MB-4KB RAM (globally), and a chpxchg at the first
system-wide zero page fault. I like apps to only use 2M TLB entries
whenever possible (that is going to payoff big as the number of 2M TLB
entries is going to increase over time).

I did some research with tricks using 4k ptes up to half the pmd was
filled before converting it to a THP (to save some memory and cache),
and it didn't look good, so my rule of thumb was "THP sometime costs,
even the switch from half pte filled to transhuge pmd still costs, so
to diminish the risk of slowdowns we should use 2M TLB entries
immediately, whenever possible".

Now the rule of thumb doesn't fully apply here, 1) there's no
compaction costs to offset, 2) chances are the zero page isn't very
performance critical anyway... only some weird apps uses it (but
sometime they have a legitimate reason for using it, this is why we
support it).

There would be a small cache benefit here... but even then some first
level caches are virtually indexed IIRC (always physically tagged to
avoid the software to notice) and virtually indexed ones won't get any
benefit.

It wouldn't provide even the memory saving tradeoff by dropping the
zero pmd at the first fault (not at the last). And it's better to
replace it at the first fault then the last (that matches the current
design).

Another point is that the previous patch is easier to port to other
archs by not requiring arch features to track the zero pmd.

I guess it won't make a whole lot of difference but my preference is
for the previous implementation that always guaranteed huge TLB
entries whenever possible. Said that I'm fine either ways so if
somebody has strong reasons for wanting this one, I'd like to hear
about it.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-09-29 13:48 ` [PATCH 0/3] Virtual huge zero page Andrea Arcangeli
@ 2012-09-29 14:30   ` Andi Kleen
  2012-09-29 14:37     ` Andrea Arcangeli
  2012-10-01 15:34   ` H. Peter Anvin
  1 sibling, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2012-09-29 14:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, H. Peter Anvin,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On Sat, Sep 29, 2012 at 03:48:11PM +0200, Andrea Arcangeli wrote:
> On Sat, Sep 29, 2012 at 02:37:18AM +0300, Kirill A. Shutemov wrote:
> > Cons:
> >  - increases TLB pressure;
> 
> I generally don't like using 4k tlb entries ever. This only has the

>From theory I would also prefer the 2MB huge page.

But some numbers comparing between the two alternatives are definitely
interesting.  Numbers are often better than theory.

> There would be a small cache benefit here... but even then some first
> level caches are virtually indexed IIRC (always physically tagged to

Modern x86 doesn't have virtually indexed caches.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-09-29 14:30   ` Andi Kleen
@ 2012-09-29 14:37     ` Andrea Arcangeli
  2012-10-01 13:49       ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2012-09-29 14:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, H. Peter Anvin,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On Sat, Sep 29, 2012 at 07:30:06AM -0700, Andi Kleen wrote:
> On Sat, Sep 29, 2012 at 03:48:11PM +0200, Andrea Arcangeli wrote:
> > On Sat, Sep 29, 2012 at 02:37:18AM +0300, Kirill A. Shutemov wrote:
> > > Cons:
> > >  - increases TLB pressure;
> > 
> > I generally don't like using 4k tlb entries ever. This only has the
> 
> From theory I would also prefer the 2MB huge page.
> 
> But some numbers comparing between the two alternatives are definitely
> interesting.  Numbers are often better than theory.

Sure good idea, just all standard benchmarks likely aren't using zero
pages so I suggest a basic micro benchmark:

   some loop of() {
      memcmp(uninitalized_pointer, (char *)uninitialized_pointer+4G, 4G)
      barrier();
   }

> 
> > There would be a small cache benefit here... but even then some first
> > level caches are virtually indexed IIRC (always physically tagged to
> 
> Modern x86 doesn't have virtually indexed caches.

With the above memcmp, I'm quite sure the previous patch will beat the
new one by a wide margin, especially on modern x86 with more 2M TLB
entries and >= 8MB L2 caches.

But I agree we need to verify it before taking a decision, and that
the numbers are better than theory, or to rephrase it "let's check the
theory is right" :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-09-29 14:37     ` Andrea Arcangeli
@ 2012-10-01 13:49       ` Kirill A. Shutemov
  2012-10-01 16:14         ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-10-01 13:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, Andrew Morton, linux-mm, H. Peter Anvin,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

[-- Attachment #1: Type: text/plain, Size: 3374 bytes --]

On Sat, Sep 29, 2012 at 04:37:37PM +0200, Andrea Arcangeli wrote:
> But I agree we need to verify it before taking a decision, and that
> the numbers are better than theory, or to rephrase it "let's check the
> theory is right" :)

Okay, microbenchmark:

% cat test_memcmp.c 
#include <assert.h>
#include <stdlib.h>
#include <string.h>

#define MB (1024ul * 1024ul)
#define GB (1024ul * MB)

int main(int argc, char **argv)
{
        char *p;
        int i;

        posix_memalign((void **)&p, 2 * MB, 8 * GB);
        for (i = 0; i < 100; i++) {
                assert(memcmp(p, p + 4*GB, 4*GB) == 0);
                asm volatile ("": : :"memory");
        }
        return 0;
}

huge zero page (initial implementation):

 Performance counter stats for './test_memcmp' (5 runs):

      32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
                 0 CPU-migrations            #    0.000 K/sec                  
             4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
    76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
    36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
     1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
   134,355,715,816 instructions              #    1.75  insns per cycle        
                                             #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
    13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
         1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]

      32.413866442 seconds time elapsed                                          ( +-  0.13% )

virtual huge zero page (the second implementation):

 Performance counter stats for './test_memcmp' (5 runs):

      30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
                 0 CPU-migrations            #    0.000 K/sec                  
             4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
    71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
    31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
       773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
   134,982,215,437 instructions              #    1.88  insns per cycle        
                                             #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
    13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
         1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]

      30.381324695 seconds time elapsed                                          ( +-  0.13% )

On Westmere-EX virtual huge zero page is ~6.7% faster.

-- 
 Kirill A. Shutemov

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-09-29 13:48 ` [PATCH 0/3] Virtual huge zero page Andrea Arcangeli
  2012-09-29 14:30   ` Andi Kleen
@ 2012-10-01 15:34   ` H. Peter Anvin
  2012-10-01 16:31     ` Andrea Arcangeli
  1 sibling, 1 reply; 23+ messages in thread
From: H. Peter Anvin @ 2012-10-01 15:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On 09/29/2012 06:48 AM, Andrea Arcangeli wrote:
> 
> There would be a small cache benefit here... but even then some first
> level caches are virtually indexed IIRC (always physically tagged to
> avoid the software to notice) and virtually indexed ones won't get any
> benefit.
> 

Not quite.  The virtual indexing is limited to a few bits (e.g. three
bits on K8); the right way to deal with that is to color the zeropage,
both the regular one and the virtual one (the virtual one would circle
through all the colors repeatedly.)

The cache difference, therefore, is *huge*.

> I guess it won't make a whole lot of difference but my preference is
> for the previous implementation that always guaranteed huge TLB
> entries whenever possible. Said that I'm fine either ways so if
> somebody has strong reasons for wanting this one, I'd like to hear
> about it.

It's a performance tradeoff, and it can, and should, be measured.

	-hpa


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 13:49       ` Kirill A. Shutemov
@ 2012-10-01 16:14         ` Andrea Arcangeli
  2012-10-01 17:18           ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2012-10-01 16:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andi Kleen, Andrew Morton, linux-mm, H. Peter Anvin,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On Mon, Oct 01, 2012 at 04:49:48PM +0300, Kirill A. Shutemov wrote:
> On Sat, Sep 29, 2012 at 04:37:37PM +0200, Andrea Arcangeli wrote:
> > But I agree we need to verify it before taking a decision, and that
> > the numbers are better than theory, or to rephrase it "let's check the
> > theory is right" :)
> 
> Okay, microbenchmark:
> 
> % cat test_memcmp.c 
> #include <assert.h>
> #include <stdlib.h>
> #include <string.h>
> 
> #define MB (1024ul * 1024ul)
> #define GB (1024ul * MB)
> 
> int main(int argc, char **argv)
> {
>         char *p;
>         int i;
> 
>         posix_memalign((void **)&p, 2 * MB, 8 * GB);
>         for (i = 0; i < 100; i++) {
>                 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
>                 asm volatile ("": : :"memory");
>         }
>         return 0;
> }
> 
> huge zero page (initial implementation):
> 
>  Performance counter stats for './test_memcmp' (5 runs):
> 
>       32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
>                 40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
>                  0 CPU-migrations            #    0.000 K/sec                  
>              4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
>     76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
>     36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
>      1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
>    134,355,715,816 instructions              #    1.75  insns per cycle        
>                                              #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
>     13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
>          1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
> 
>       32.413866442 seconds time elapsed                                          ( +-  0.13% )
> 
> virtual huge zero page (the second implementation):
> 
>  Performance counter stats for './test_memcmp' (5 runs):
> 
>       30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
>                 38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
>                  0 CPU-migrations            #    0.000 K/sec                  
>              4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
>     71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
>     31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
>        773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
>    134,982,215,437 instructions              #    1.88  insns per cycle        
>                                              #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
>     13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
>          1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
> 
>       30.381324695 seconds time elapsed                                          ( +-  0.13% )
> 
> On Westmere-EX virtual huge zero page is ~6.7% faster.

Great test thanks!

So the cache benefit is quite significant, and the TLB gains don't
offset the cache loss of the physical zero page. My call was wrong...

I get the same results as you did.

Now let's tweak the benchmark to test a "seeking" workload more
favorable to the physical 2M page by stressing the TLB.


===
#include <assert.h>
#include <stdlib.h>
#include <string.h>

#define MB (1024ul * 1024ul)
#define GB (1024ul * MB)

int main(int argc, char **argv)
{
	char *p;
	int i;

	posix_memalign((void **)&p, 2 * MB, 8 * GB);
	for (i = 0; i < 1000; i++) {
		char *_p = p;
		while (_p < p+4*GB) {
			assert(*_p == *(_p+4*GB));
			_p += 4096;
			asm volatile ("": : :"memory");
		}
	}
	return 0;
}
===

results:

virtual zeropage: char comparison seeking in 4G range 1000 times

 Performance counter stats for './zeropage-bench2' (3 runs):

      20624.051801 task-clock                #    0.999 CPUs utilized            ( +-  0.17% )
             1,762 context-switches          #    0.085 K/sec                    ( +-  1.05% )
                 1 CPU-migrations            #    0.000 K/sec                    ( +- 50.00% )
             4,221 page-faults               #    0.205 K/sec                  
    60,182,028,883 cycles                    #    2.918 GHz                      ( +-  0.17% ) [40.00%]
    56,958,431,315 stalled-cycles-frontend   #   94.64% frontend cycles idle     ( +-  0.16% ) [40.02%]
    54,966,753,363 stalled-cycles-backend    #   91.33% backend  cycles idle     ( +-  0.10% ) [40.03%]
     8,606,418,680 instructions              #    0.14  insns per cycle        
                                             #    6.62  stalled cycles per insn  ( +-  0.39% ) [50.03%]
     2,142,535,994 branches                  #  103.885 M/sec                    ( +-  0.20% ) [50.03%]
           115,916 branch-misses             #    0.01% of all branches          ( +-  3.86% ) [50.03%]
     3,209,731,169 L1-dcache-loads           #  155.630 M/sec                    ( +-  0.45% ) [50.01%]
       264,297,418 L1-dcache-load-misses     #    8.23% of all L1-dcache hits    ( +-  0.02% ) [50.00%]
         6,732,362 LLC-loads                 #    0.326 M/sec                    ( +-  0.23% ) [39.99%]
         4,981,319 LLC-load-misses           #   73.99% of all LL-cache hits     ( +-  0.74% ) [39.98%]

      20.649561185 seconds time elapsed                                          ( +-  0.19% )

physical zeropage: char comparison seeking in 4G range 1000 times

 Performance counter stats for './zeropage-bench2' (3 runs):

       2719.512443 task-clock                #    0.999 CPUs utilized            ( +-  0.34% )
               234 context-switches          #    0.086 K/sec                    ( +-  1.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
             4,221 page-faults               #    0.002 M/sec                  
     7,927,948,993 cycles                    #    2.915 GHz                      ( +-  0.17% ) [39.95%]
     4,780,183,162 stalled-cycles-frontend   #   60.30% frontend cycles idle     ( +-  0.58% ) [40.14%]
     2,246,666,029 stalled-cycles-backend    #   28.34% backend  cycles idle     ( +-  3.59% ) [40.19%]
     8,380,516,407 instructions              #    1.06  insns per cycle        
                                             #    0.57  stalled cycles per insn  ( +-  0.13% ) [50.21%]
     2,095,233,526 branches                  #  770.445 M/sec                    ( +-  0.08% ) [50.24%]
            24,586 branch-misses             #    0.00% of all branches          ( +- 11.77% ) [50.19%]
     3,151,778,195 L1-dcache-loads           # 1158.950 M/sec                    ( +-  0.01% ) [50.05%]
     1,051,317,291 L1-dcache-load-misses     #   33.36% of all L1-dcache hits    ( +-  0.02% ) [49.96%]
     1,049,134,961 LLC-loads                 #  385.781 M/sec                    ( +-  0.13% ) [39.92%]
             6,222 LLC-load-misses           #    0.00% of all LL-cache hits     ( +- 35.68% ) [39.93%]

       2.722077632 seconds time elapsed                                          ( +-  0.34% )

NOTE: I used taskset -c 0 in all tests here to reduce the error (this
is also a NUMA system and AutoNUMA wasn't patched in for this test to
avoid the risk of rejects in "git am").

(it would have been prettier if I added the TLB data performance
counters, whatever too late ;)

So in this case the compute time increases 658% with the 2m virtual
page, and the 2M physical page wins by a wide margin.

So my preference is still for the physical zero page even if it wastes
2m-4k RAM and increases the compute time 6% in the worst case.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 15:34   ` H. Peter Anvin
@ 2012-10-01 16:31     ` Andrea Arcangeli
  2012-10-01 17:03       ` H. Peter Anvin
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2012-10-01 16:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On Mon, Oct 01, 2012 at 08:34:28AM -0700, H. Peter Anvin wrote:
> On 09/29/2012 06:48 AM, Andrea Arcangeli wrote:
> > 
> > There would be a small cache benefit here... but even then some first
> > level caches are virtually indexed IIRC (always physically tagged to
> > avoid the software to notice) and virtually indexed ones won't get any
> > benefit.
> > 
> 
> Not quite.  The virtual indexing is limited to a few bits (e.g. three
> bits on K8); the right way to deal with that is to color the zeropage,
> both the regular one and the virtual one (the virtual one would circle
> through all the colors repeatedly.)
> 
> The cache difference, therefore, is *huge*.

Kirill measured the cache benefit and it provided a 6% gain, not very
huge but certainly significant.

> It's a performance tradeoff, and it can, and should, be measured.

I now measured the other side of the trade, by touching only one
character every 4k page in the range to simulate a very seeking load,
and doing so the physical huge zero page wins with a 600% margin, so
if the cache benefit is huge for the virtual zero page, the TLB
benefit is massive for the physical zero page.

Overall I think picking the solution that risks to regress the least
(also compared to current status of no zero page) is the safest.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 16:31     ` Andrea Arcangeli
@ 2012-10-01 17:03       ` H. Peter Anvin
  2012-10-01 17:15         ` Kirill A. Shutemov
  2012-10-01 17:26         ` Andrea Arcangeli
  0 siblings, 2 replies; 23+ messages in thread
From: H. Peter Anvin @ 2012-10-01 17:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On 10/01/2012 09:31 AM, Andrea Arcangeli wrote:
> On Mon, Oct 01, 2012 at 08:34:28AM -0700, H. Peter Anvin wrote:
>> On 09/29/2012 06:48 AM, Andrea Arcangeli wrote:
>>>
>>> There would be a small cache benefit here... but even then some first
>>> level caches are virtually indexed IIRC (always physically tagged to
>>> avoid the software to notice) and virtually indexed ones won't get any
>>> benefit.
>>>
>>
>> Not quite.  The virtual indexing is limited to a few bits (e.g. three
>> bits on K8); the right way to deal with that is to color the zeropage,
>> both the regular one and the virtual one (the virtual one would circle
>> through all the colors repeatedly.)
>>
>> The cache difference, therefore, is *huge*.
> 
> Kirill measured the cache benefit and it provided a 6% gain, not very
> huge but certainly significant.
> 
>> It's a performance tradeoff, and it can, and should, be measured.
> 
> I now measured the other side of the trade, by touching only one
> character every 4k page in the range to simulate a very seeking load,
> and doing so the physical huge zero page wins with a 600% margin, so
> if the cache benefit is huge for the virtual zero page, the TLB
> benefit is massive for the physical zero page.
> 
> Overall I think picking the solution that risks to regress the least
> (also compared to current status of no zero page) is the safest.
> 

Something isn't quite right about that.  If you look at your numbers:

1,049,134,961 LLC-loads
        6,222 LLC-load-misses

This is another way of saying in your benchmark the huge zero page is
parked in your LLC - using up 2 MB of your LLC, typically a significant
portion of said cache.  In a real-life application that will squeeze out
real data, but in your benchmark the system is artificially quiescent.

It is well known that microbenchmarks can be horribly misleading.  What
led to Kirill investigating huge zero page in the first place was the
fact that some applications/macrobenchmarks benefit, and I think those
are the right thing to look at.

	-hpa





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:03       ` H. Peter Anvin
@ 2012-10-01 17:15         ` Kirill A. Shutemov
  2012-10-01 18:03           ` Andrea Arcangeli
  2012-10-01 17:26         ` Andrea Arcangeli
  1 sibling, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-10-01 17:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Andi Kleen, linux-kernel, Arnd Bergmann, Ingo Molnar, linux-arch

On Mon, Oct 01, 2012 at 10:03:53AM -0700, H. Peter Anvin wrote:
> On 10/01/2012 09:31 AM, Andrea Arcangeli wrote:
> > On Mon, Oct 01, 2012 at 08:34:28AM -0700, H. Peter Anvin wrote:
> >> On 09/29/2012 06:48 AM, Andrea Arcangeli wrote:
> >>>
> >>> There would be a small cache benefit here... but even then some first
> >>> level caches are virtually indexed IIRC (always physically tagged to
> >>> avoid the software to notice) and virtually indexed ones won't get any
> >>> benefit.
> >>>
> >>
> >> Not quite.  The virtual indexing is limited to a few bits (e.g. three
> >> bits on K8); the right way to deal with that is to color the zeropage,
> >> both the regular one and the virtual one (the virtual one would circle
> >> through all the colors repeatedly.)
> >>
> >> The cache difference, therefore, is *huge*.
> > 
> > Kirill measured the cache benefit and it provided a 6% gain, not very
> > huge but certainly significant.
> > 
> >> It's a performance tradeoff, and it can, and should, be measured.
> > 
> > I now measured the other side of the trade, by touching only one
> > character every 4k page in the range to simulate a very seeking load,
> > and doing so the physical huge zero page wins with a 600% margin, so
> > if the cache benefit is huge for the virtual zero page, the TLB
> > benefit is massive for the physical zero page.
> > 
> > Overall I think picking the solution that risks to regress the least
> > (also compared to current status of no zero page) is the safest.
> > 
> 
> Something isn't quite right about that.  If you look at your numbers:
> 
> 1,049,134,961 LLC-loads
>         6,222 LLC-load-misses
> 
> This is another way of saying in your benchmark the huge zero page is
> parked in your LLC - using up 2 MB of your LLC, typically a significant
> portion of said cache.  In a real-life application that will squeeze out
> real data, but in your benchmark the system is artificially quiescent.
> 
> It is well known that microbenchmarks can be horribly misleading.  What
> led to Kirill investigating huge zero page in the first place was the
> fact that some applications/macrobenchmarks benefit, and I think those
> are the right thing to look at.

I think performance is not the first thing we should look at. We need to
choose which implementation is easier to support.

Applications which benefit from zero page are quite rare. We need to
provide a huge zero page to avoid huge memory consumption with THP.
That's it. Performance optimization for that rare case is overkill.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 16:14         ` Andrea Arcangeli
@ 2012-10-01 17:18           ` Kirill A. Shutemov
  0 siblings, 0 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-10-01 17:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andi Kleen, Andrew Morton, linux-mm,
	H. Peter Anvin, linux-kernel, Arnd Bergmann, Ingo Molnar,
	linux-arch

On Mon, Oct 01, 2012 at 06:14:37PM +0200, Andrea Arcangeli wrote:
> On Mon, Oct 01, 2012 at 04:49:48PM +0300, Kirill A. Shutemov wrote:
> > On Sat, Sep 29, 2012 at 04:37:37PM +0200, Andrea Arcangeli wrote:
> > > But I agree we need to verify it before taking a decision, and that
> > > the numbers are better than theory, or to rephrase it "let's check the
> > > theory is right" :)
> > 
> > Okay, microbenchmark:
> > 
> > % cat test_memcmp.c 
> > #include <assert.h>
> > #include <stdlib.h>
> > #include <string.h>
> > 
> > #define MB (1024ul * 1024ul)
> > #define GB (1024ul * MB)
> > 
> > int main(int argc, char **argv)
> > {
> >         char *p;
> >         int i;
> > 
> >         posix_memalign((void **)&p, 2 * MB, 8 * GB);
> >         for (i = 0; i < 100; i++) {
> >                 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> >                 asm volatile ("": : :"memory");
> >         }
> >         return 0;
> > }
> > 
> > huge zero page (initial implementation):
> > 
> >  Performance counter stats for './test_memcmp' (5 runs):
> > 
> >       32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
> >                 40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
> >                  0 CPU-migrations            #    0.000 K/sec                  
> >              4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
> >     76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
> >     36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
> >      1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
> >    134,355,715,816 instructions              #    1.75  insns per cycle        
> >                                              #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
> >     13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
> >          1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
> > 
> >       32.413866442 seconds time elapsed                                          ( +-  0.13% )
> > 
> > virtual huge zero page (the second implementation):
> > 
> >  Performance counter stats for './test_memcmp' (5 runs):
> > 
> >       30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
> >                 38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
> >                  0 CPU-migrations            #    0.000 K/sec                  
> >              4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
> >     71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
> >     31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
> >        773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
> >    134,982,215,437 instructions              #    1.88  insns per cycle        
> >                                              #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
> >     13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
> >          1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
> > 
> >       30.381324695 seconds time elapsed                                          ( +-  0.13% )
> > 
> > On Westmere-EX virtual huge zero page is ~6.7% faster.
> 
> Great test thanks!
> 
> So the cache benefit is quite significant, and the TLB gains don't
> offset the cache loss of the physical zero page. My call was wrong...
> 
> I get the same results as you did.
> 
> Now let's tweak the benchmark to test a "seeking" workload more
> favorable to the physical 2M page by stressing the TLB.
> 
> 
> ===
> #include <assert.h>
> #include <stdlib.h>
> #include <string.h>
> 
> #define MB (1024ul * 1024ul)
> #define GB (1024ul * MB)
> 
> int main(int argc, char **argv)
> {
> 	char *p;
> 	int i;
> 
> 	posix_memalign((void **)&p, 2 * MB, 8 * GB);
> 	for (i = 0; i < 1000; i++) {
> 		char *_p = p;
> 		while (_p < p+4*GB) {
> 			assert(*_p == *(_p+4*GB));
> 			_p += 4096;
> 			asm volatile ("": : :"memory");
> 		}
> 	}
> 	return 0;
> }

Results on my machine:

vitual zeropage:
 Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

      27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
                62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
             4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
    64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
    61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
    56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
    10,033,724,846 instructions              #    0.15  insns per cycle        
                                             #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
     2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
         1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
     3,302,006,540 L1-dcache-loads
          #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
       271,374,358 L1-dcache-misses
         #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
        20,385,476 LLC-load
                 #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
            76,754 LLC-misses
               #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
     3,309,927,290 dTLB-loads
               #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
     2,098,967,427 dTLB-misses
              #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]

      27.364448741 seconds time elapsed                                          ( +-  0.24% )

physical zeropage
 Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

       3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
                 9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
             4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
     8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
     5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
     2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
     9,494,670,537 instructions              #    1.14  insns per cycle        
                                             #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
     2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
           158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
     3,168,102,115 L1-dcache-loads
          #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
     1,048,710,998 L1-dcache-misses
         #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
     1,047,699,685 LLC-load
                 #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
             2,287 LLC-misses
               #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
     3,166,187,367 dTLB-loads
               #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
         4,266,538 dTLB-misses
              #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]

       3.513339813 seconds time elapsed                                          ( +-  0.26% )

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:03       ` H. Peter Anvin
  2012-10-01 17:15         ` Kirill A. Shutemov
@ 2012-10-01 17:26         ` Andrea Arcangeli
  2012-10-01 17:33           ` H. Peter Anvin
  1 sibling, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2012-10-01 17:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On Mon, Oct 01, 2012 at 10:03:53AM -0700, H. Peter Anvin wrote:
> Something isn't quite right about that.  If you look at your numbers:
> 
> 1,049,134,961 LLC-loads
>         6,222 LLC-load-misses
> 
> This is another way of saying in your benchmark the huge zero page is
> parked in your LLC - using up 2 MB of your LLC, typically a significant
> portion of said cache.  In a real-life application that will squeeze out
> real data, but in your benchmark the system is artificially quiescent.

Agreed. And that argument applies to the cache benefits of the virtual
zero page too: squeeze the cache just more aggressively so those 4k
got out of the cache too, and that 6% improvement will disappear
(while the TLB benefit of the physical zero page is guaranteed and is
always present no matter the workload, even if the TLB miss at the
same frequency, it'll get filled with one less cacheline access every
time).

> It is well known that microbenchmarks can be horribly misleading.  What
> led to Kirill investigating huge zero page in the first place was the
> fact that some applications/macrobenchmarks benefit, and I think those
> are the right thing to look at.

The whole point of the two microbenchmarks was to measure the worst
cases for both scenarios and I think that was useful. Real life using
zero pages are going to be somewhere in that range.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:26         ` Andrea Arcangeli
@ 2012-10-01 17:33           ` H. Peter Anvin
  2012-10-01 17:36             ` Kirill A. Shutemov
  2012-10-01 18:05             ` Andrea Arcangeli
  0 siblings, 2 replies; 23+ messages in thread
From: H. Peter Anvin @ 2012-10-01 17:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On 10/01/2012 10:26 AM, Andrea Arcangeli wrote:
> 
>> It is well known that microbenchmarks can be horribly misleading.  What
>> led to Kirill investigating huge zero page in the first place was the
>> fact that some applications/macrobenchmarks benefit, and I think those
>> are the right thing to look at.
> 
> The whole point of the two microbenchmarks was to measure the worst
> cases for both scenarios and I think that was useful. Real life using
> zero pages are going to be somewhere in that range.
> 

... and I think it would be worthwhile to know which effect dominates
(or neither, in which case it doesn't matter).

Overall, I'm okay with either as long as we don't lock down 2 MB when
there isn't a huge zero page in use.

	-hpa


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:33           ` H. Peter Anvin
@ 2012-10-01 17:36             ` Kirill A. Shutemov
  2012-10-01 17:37               ` H. Peter Anvin
  2012-10-01 18:05             ` Andrea Arcangeli
  1 sibling, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-10-01 17:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Andi Kleen, linux-kernel, Arnd Bergmann, Ingo Molnar, linux-arch

On Mon, Oct 01, 2012 at 10:33:12AM -0700, H. Peter Anvin wrote:
> Overall, I'm okay with either as long as we don't lock down 2 MB when
> there isn't a huge zero page in use.

Is shinker-reclaimable huge zero page okay for you?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:36             ` Kirill A. Shutemov
@ 2012-10-01 17:37               ` H. Peter Anvin
  2012-10-01 17:44                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: H. Peter Anvin @ 2012-10-01 17:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Andi Kleen, linux-kernel, Arnd Bergmann, Ingo Molnar, linux-arch

On 10/01/2012 10:36 AM, Kirill A. Shutemov wrote:
> On Mon, Oct 01, 2012 at 10:33:12AM -0700, H. Peter Anvin wrote:
>> Overall, I'm okay with either as long as we don't lock down 2 MB when
>> there isn't a huge zero page in use.
> 
> Is shinker-reclaimable huge zero page okay for you?
> 

Yes, I'm fine with that.  However, I'm curious about the relative
benefit versus virtual hzp from a performance perspective, on an
application where hzp actually matters.

One can otherwise argue that if hzp doesn't matter for except in a small
number of cases that we shouldn't use it at all.

	-hpa



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:37               ` H. Peter Anvin
@ 2012-10-01 17:44                 ` Kirill A. Shutemov
  2012-10-01 17:52                   ` H. Peter Anvin
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-10-01 17:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Andi Kleen, linux-kernel, Arnd Bergmann, Ingo Molnar, linux-arch

On Mon, Oct 01, 2012 at 10:37:23AM -0700, H. Peter Anvin wrote:
> One can otherwise argue that if hzp doesn't matter for except in a small
> number of cases that we shouldn't use it at all.

These small number of cases can easily trigger OOM if THP is enabled. :)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:44                 ` Kirill A. Shutemov
@ 2012-10-01 17:52                   ` H. Peter Anvin
  2012-10-01 18:56                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: H. Peter Anvin @ 2012-10-01 17:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Andi Kleen, linux-kernel, Arnd Bergmann, Ingo Molnar, linux-arch

On 10/01/2012 10:44 AM, Kirill A. Shutemov wrote:
> On Mon, Oct 01, 2012 at 10:37:23AM -0700, H. Peter Anvin wrote:
>> One can otherwise argue that if hzp doesn't matter for except in a small
>> number of cases that we shouldn't use it at all.
> 
> These small number of cases can easily trigger OOM if THP is enabled. :)
> 

And that doesn't happen in any conditions that *aren't* helped by hzp?

	-hpa


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:15         ` Kirill A. Shutemov
@ 2012-10-01 18:03           ` Andrea Arcangeli
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2012-10-01 18:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: H. Peter Anvin, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Andi Kleen, linux-kernel, Arnd Bergmann, Ingo Molnar, linux-arch

On Mon, Oct 01, 2012 at 08:15:19PM +0300, Kirill A. Shutemov wrote:
> I think performance is not the first thing we should look at. We need to
> choose which implementation is easier to support.

Having to introduce a special pmd bitflag requiring architectural
support is actually making it less self contained. The zero page
support is made optional of course, but the physical zero page would
have worked without the arch noticing.

> Applications which benefit from zero page are quite rare. We need to
> provide a huge zero page to avoid huge memory consumption with THP.
> That's it. Performance optimization for that rare case is overkill.

I still don't like the idea of some rare app potentially running
significantly slower (and we may not be notified because it's not a
breakage, if they're simulations it's hard to tell it's slower because
of different input or because of zero page being introduced). If we
knew for sure that zero pages accesses were always rare I wouldn't
care of course. But rare app != rare access.

The physical zero page patchset is certainly bigger, but it was mostly
localized in huge_memory.c so I don't see it at very intrusive even if
bigger.

Anyway if others sees the virtual zero page as easier to maintain, I'm
fine either ways.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:33           ` H. Peter Anvin
  2012-10-01 17:36             ` Kirill A. Shutemov
@ 2012-10-01 18:05             ` Andrea Arcangeli
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2012-10-01 18:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
	linux-kernel, Kirill A. Shutemov, Arnd Bergmann, Ingo Molnar,
	linux-arch

On Mon, Oct 01, 2012 at 10:33:12AM -0700, H. Peter Anvin wrote:
> ... and I think it would be worthwhile to know which effect dominates
> (or neither, in which case it doesn't matter).
> 
> Overall, I'm okay with either as long as we don't lock down 2 MB when
> there isn't a huge zero page in use.

Same here.

I agree the cmpxchg idea to free the 2M zero page, was a very nice
addition to the physical zero page patchset.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Virtual huge zero page
  2012-10-01 17:52                   ` H. Peter Anvin
@ 2012-10-01 18:56                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2012-10-01 18:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Andi Kleen, linux-kernel, Arnd Bergmann, Ingo Molnar, linux-arch

On Mon, Oct 01, 2012 at 10:52:06AM -0700, H. Peter Anvin wrote:
> On 10/01/2012 10:44 AM, Kirill A. Shutemov wrote:
> > On Mon, Oct 01, 2012 at 10:37:23AM -0700, H. Peter Anvin wrote:
> >> One can otherwise argue that if hzp doesn't matter for except in a small
> >> number of cases that we shouldn't use it at all.
> > 
> > These small number of cases can easily trigger OOM if THP is enabled. :)
> > 
> 
> And that doesn't happen in any conditions that *aren't* helped by hzp?

Sure, OOM still can happen.
But if we can eliminate a class of problem why not to do so?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-10-01 18:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-28 23:37 [PATCH 0/3] Virtual huge zero page Kirill A. Shutemov
2012-09-28 23:37 ` [PATCH 1/3] asm-generic: introduce pmd_special() and pmd_mkspecial() Kirill A. Shutemov
2012-09-28 23:37 ` [PATCH 2/3] mm, thp: implement virtual huge zero page Kirill A. Shutemov
2012-09-28 23:37 ` [PATCH 3/3] x86: implement HAVE_PMD_SPECAIL Kirill A. Shutemov
2012-09-29 13:48 ` [PATCH 0/3] Virtual huge zero page Andrea Arcangeli
2012-09-29 14:30   ` Andi Kleen
2012-09-29 14:37     ` Andrea Arcangeli
2012-10-01 13:49       ` Kirill A. Shutemov
2012-10-01 16:14         ` Andrea Arcangeli
2012-10-01 17:18           ` Kirill A. Shutemov
2012-10-01 15:34   ` H. Peter Anvin
2012-10-01 16:31     ` Andrea Arcangeli
2012-10-01 17:03       ` H. Peter Anvin
2012-10-01 17:15         ` Kirill A. Shutemov
2012-10-01 18:03           ` Andrea Arcangeli
2012-10-01 17:26         ` Andrea Arcangeli
2012-10-01 17:33           ` H. Peter Anvin
2012-10-01 17:36             ` Kirill A. Shutemov
2012-10-01 17:37               ` H. Peter Anvin
2012-10-01 17:44                 ` Kirill A. Shutemov
2012-10-01 17:52                   ` H. Peter Anvin
2012-10-01 18:56                     ` Kirill A. Shutemov
2012-10-01 18:05             ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).