All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] x86: rework tlb range flushing code
@ 2014-03-06  0:45 ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm, Dave Hansen

Reposting with an instrumentation patch, and a few minor tweaks.
I'd love some more eyeballs on this, but I think it's ready for
-mm.

I'm having it run through the LKP harness to see if any perfmance
regressions (or gains) show up.

Without the last (instrumentation/debugging) patch:

 arch/x86/include/asm/mmu_context.h |    6 ++
 arch/x86/include/asm/processor.h   |    1
 arch/x86/kernel/cpu/amd.c          |    7 --
 arch/x86/kernel/cpu/common.c       |   13 -----
 arch/x86/kernel/cpu/intel.c        |   26 ----------
 arch/x86/mm/tlb.c                  |   91 +++++++++++++++----------------------
 include/linux/mm_types.h           |   10 ++++
 mm/Makefile                        |    2
 8 files changed, 58 insertions(+), 98 deletions(-)

--

I originally went to look at this becuase I realized that newer
CPUs were not present in the intel_tlb_flushall_shift_set() code.

I went to try to figure out where to stick newer CPUs (do we
consider them more like SandyBridge or IvyBridge), and was not
able to repeat the original experiments.

Instead, this set does:
 1. Rework the code a bit to ready it for tracepoints
 2. Add tracepoints
 3. Add a new tunable and set it to a sane value

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/7] x86: rework tlb range flushing code
@ 2014-03-06  0:45 ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm, Dave Hansen

Reposting with an instrumentation patch, and a few minor tweaks.
I'd love some more eyeballs on this, but I think it's ready for
-mm.

I'm having it run through the LKP harness to see if any perfmance
regressions (or gains) show up.

Without the last (instrumentation/debugging) patch:

 arch/x86/include/asm/mmu_context.h |    6 ++
 arch/x86/include/asm/processor.h   |    1
 arch/x86/kernel/cpu/amd.c          |    7 --
 arch/x86/kernel/cpu/common.c       |   13 -----
 arch/x86/kernel/cpu/intel.c        |   26 ----------
 arch/x86/mm/tlb.c                  |   91 +++++++++++++++----------------------
 include/linux/mm_types.h           |   10 ++++
 mm/Makefile                        |    2
 8 files changed, 58 insertions(+), 98 deletions(-)

--

I originally went to look at this becuase I realized that newer
CPUs were not present in the intel_tlb_flushall_shift_set() code.

I went to try to figure out where to stick newer CPUs (do we
consider them more like SandyBridge or IvyBridge), and was not
able to repeat the original experiments.

Instead, this set does:
 1. Rework the code a bit to ready it for tracepoints
 2. Add tracepoints
 3. Add a new tunable and set it to a sane value

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/7] x86: mm: clean up tlb flushing code
  2014-03-06  0:45 ` Dave Hansen
@ 2014-03-06  0:45   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The

	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)

line of code is not exactly the easiest to audit, especially when
it ends up at two different indentation levels.  This eliminates
one of the the copy-n-paste versions.  It also gives us a unified
exit point for each path through this function.  We need this in
a minute for our tracepoint.


Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/tlb.c |   23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~simplify-tlb-code	2014-03-05 16:10:09.607047728 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:09.610047866 -0800
@@ -161,23 +161,24 @@ void flush_tlb_current_task(void)
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
+	int need_flush_others_all = 1;
 	unsigned long addr;
 	unsigned act_entries, tlb_entries = 0;
 	unsigned long nr_base_pages;
 
 	preempt_disable();
 	if (current->active_mm != mm)
-		goto flush_all;
+		goto out;
 
 	if (!current->mm) {
 		leave_mm(smp_processor_id());
-		goto flush_all;
+		goto out;
 	}
 
 	if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
 					|| vmflag & VM_HUGETLB) {
 		local_flush_tlb();
-		goto flush_all;
+		goto out;
 	}
 
 	/* In modern CPU, last level tlb used for both data/ins */
@@ -196,22 +197,20 @@ void flush_tlb_mm_range(struct mm_struct
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 		local_flush_tlb();
 	} else {
+		need_flush_others_all = 0;
 		/* flush range by one by one 'invlpg' */
 		for (addr = start; addr < end;	addr += PAGE_SIZE) {
 			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 			__flush_tlb_single(addr);
 		}
-
-		if (cpumask_any_but(mm_cpumask(mm),
-				smp_processor_id()) < nr_cpu_ids)
-			flush_tlb_others(mm_cpumask(mm), mm, start, end);
-		preempt_enable();
-		return;
 	}
-
-flush_all:
+out:
+	if (need_flush_others_all) {
+		start = 0UL;
+		end = TLB_FLUSH_ALL;
+	}
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
+		flush_tlb_others(mm_cpumask(mm), mm, start, end);
 	preempt_enable();
 }
 
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/7] x86: mm: clean up tlb flushing code
@ 2014-03-06  0:45   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The

	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)

line of code is not exactly the easiest to audit, especially when
it ends up at two different indentation levels.  This eliminates
one of the the copy-n-paste versions.  It also gives us a unified
exit point for each path through this function.  We need this in
a minute for our tracepoint.


Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/tlb.c |   23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~simplify-tlb-code	2014-03-05 16:10:09.607047728 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:09.610047866 -0800
@@ -161,23 +161,24 @@ void flush_tlb_current_task(void)
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
+	int need_flush_others_all = 1;
 	unsigned long addr;
 	unsigned act_entries, tlb_entries = 0;
 	unsigned long nr_base_pages;
 
 	preempt_disable();
 	if (current->active_mm != mm)
-		goto flush_all;
+		goto out;
 
 	if (!current->mm) {
 		leave_mm(smp_processor_id());
-		goto flush_all;
+		goto out;
 	}
 
 	if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
 					|| vmflag & VM_HUGETLB) {
 		local_flush_tlb();
-		goto flush_all;
+		goto out;
 	}
 
 	/* In modern CPU, last level tlb used for both data/ins */
@@ -196,22 +197,20 @@ void flush_tlb_mm_range(struct mm_struct
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 		local_flush_tlb();
 	} else {
+		need_flush_others_all = 0;
 		/* flush range by one by one 'invlpg' */
 		for (addr = start; addr < end;	addr += PAGE_SIZE) {
 			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 			__flush_tlb_single(addr);
 		}
-
-		if (cpumask_any_but(mm_cpumask(mm),
-				smp_processor_id()) < nr_cpu_ids)
-			flush_tlb_others(mm_cpumask(mm), mm, start, end);
-		preempt_enable();
-		return;
 	}
-
-flush_all:
+out:
+	if (need_flush_others_all) {
+		start = 0UL;
+		end = TLB_FLUSH_ALL;
+	}
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
+		flush_tlb_others(mm_cpumask(mm), mm, start, end);
 	preempt_enable();
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/7] x86: mm: rip out complicated, out-of-date, buggy TLB flushing
  2014-03-06  0:45 ` Dave Hansen
@ 2014-03-06  0:45   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I think the flush_tlb_mm_range() code that tries to tune the
flush sizes based on the CPU needs to get ripped out for
several reasons:

1. It is obviously buggy.  It uses mm->total_vm to judge the
   task's footprint in the TLB.  It should certainly be using
   some measure of RSS, *NOT* ->total_vm since only resident
   memory can populate the TLB.
2. Haswell, and several other CPUs are missing from the
   intel_tlb_flushall_shift_set() function.
3. It is plain wrong in my vm:
	[    0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
	[    0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
	[    0.037444] tlb_flushall_shift: 6
   Which leads to it to never use invlpg.
4. The assumptions about TLB refill costs are wrong:
	http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com
    (more on this in later patches)
5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
   I believe the sample times were too short.  Running the
   benchmark in a loop yields times that vary quite a bit.

Note that this leaves us with a static ceiling of 1 page.  This
is a conservative, dumb setting, and will be revised in a later
patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/processor.h |    1 
 b/arch/x86/kernel/cpu/amd.c        |    7 ---
 b/arch/x86/kernel/cpu/common.c     |   13 -----
 b/arch/x86/kernel/cpu/intel.c      |   26 -----------
 b/arch/x86/mm/tlb.c                |   83 ++++---------------------------------
 5 files changed, 11 insertions(+), 119 deletions(-)

diff -puN arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.857059131 -0800
+++ b/arch/x86/include/asm/processor.h	2014-03-05 16:10:09.867059588 -0800
@@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_I
 extern u16 __read_mostly tlb_lld_2m[NR_INFO];
 extern u16 __read_mostly tlb_lld_4m[NR_INFO];
 extern u16 __read_mostly tlb_lld_1g[NR_INFO];
-extern s8  __read_mostly tlb_flushall_shift;
 
 /*
  *  CPU type and hardware bug flags. Kept separately for each CPU.
diff -puN arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/amd.c
--- a/arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.859059223 -0800
+++ b/arch/x86/kernel/cpu/amd.c	2014-03-05 16:10:09.868059633 -0800
@@ -765,11 +765,6 @@ static unsigned int amd_size_cache(struc
 }
 #endif
 
-static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c)
-{
-	tlb_flushall_shift = 6;
-}
-
 static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 {
 	u32 ebx, eax, ecx, edx;
@@ -817,8 +812,6 @@ static void cpu_detect_tlb_amd(struct cp
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
-
-	cpu_set_tlb_flushall_shift(c);
 }
 
 static const struct cpu_dev amd_cpu_dev = {
diff -puN arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.861059314 -0800
+++ b/arch/x86/kernel/cpu/common.c	2014-03-05 16:10:09.869059678 -0800
@@ -479,26 +479,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO];
 u16 __read_mostly tlb_lld_4m[NR_INFO];
 u16 __read_mostly tlb_lld_1g[NR_INFO];
 
-/*
- * tlb_flushall_shift shows the balance point in replacing cr3 write
- * with multiple 'invlpg'. It will do this replacement when
- *   flush_tlb_lines <= active_lines/2^tlb_flushall_shift.
- * If tlb_flushall_shift is -1, means the replacement will be disabled.
- */
-s8  __read_mostly tlb_flushall_shift = -1;
-
 void cpu_detect_tlb(struct cpuinfo_x86 *c)
 {
 	if (this_cpu->c_detect_tlb)
 		this_cpu->c_detect_tlb(c);
 
 	printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n"
-		"Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n"
-		"tlb_flushall_shift: %d\n",
+		"Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n",
 		tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES],
 		tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES],
 		tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES],
-		tlb_lld_1g[ENTRIES], tlb_flushall_shift);
+		tlb_lld_1g[ENTRIES]);
 }
 
 void detect_ht(struct cpuinfo_x86 *c)
diff -puN arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.862059360 -0800
+++ b/arch/x86/kernel/cpu/intel.c	2014-03-05 16:10:09.869059678 -0800
@@ -631,31 +631,6 @@ static void intel_tlb_lookup(const unsig
 	}
 }
 
-static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
-{
-	switch ((c->x86 << 8) + c->x86_model) {
-	case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */
-	case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */
-	case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */
-	case 0x61d: /* six-core 45 nm xeon "Dunnington" */
-		tlb_flushall_shift = -1;
-		break;
-	case 0x63a: /* Ivybridge */
-		tlb_flushall_shift = 2;
-		break;
-	case 0x61a: /* 45 nm nehalem, "Bloomfield" */
-	case 0x61e: /* 45 nm nehalem, "Lynnfield" */
-	case 0x625: /* 32 nm nehalem, "Clarkdale" */
-	case 0x62c: /* 32 nm nehalem, "Gulftown" */
-	case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
-	case 0x62f: /* 32 nm Xeon E7 */
-	case 0x62a: /* SandyBridge */
-	case 0x62d: /* SandyBridge, "Romely-EP" */
-	default:
-		tlb_flushall_shift = 6;
-	}
-}
-
 static void intel_detect_tlb(struct cpuinfo_x86 *c)
 {
 	int i, j, n;
@@ -680,7 +655,6 @@ static void intel_detect_tlb(struct cpui
 		for (j = 1 ; j < 16 ; j++)
 			intel_tlb_lookup(desc[j]);
 	}
-	intel_tlb_flushall_shift_set(c);
 }
 
 static const struct cpu_dev intel_cpu_dev = {
diff -puN arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.864059451 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:09.870059724 -0800
@@ -158,13 +158,14 @@ void flush_tlb_current_task(void)
 	preempt_enable();
 }
 
+/* in units of pages */
+unsigned long tlb_single_page_flush_ceiling = 1;
+
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
 	int need_flush_others_all = 1;
 	unsigned long addr;
-	unsigned act_entries, tlb_entries = 0;
-	unsigned long nr_base_pages;
 
 	preempt_disable();
 	if (current->active_mm != mm)
@@ -175,25 +176,12 @@ void flush_tlb_mm_range(struct mm_struct
 		goto out;
 	}
 
-	if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
-					|| vmflag & VM_HUGETLB) {
+	if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
 		local_flush_tlb();
 		goto out;
 	}
 
-	/* In modern CPU, last level tlb used for both data/ins */
-	if (vmflag & VM_EXEC)
-		tlb_entries = tlb_lli_4k[ENTRIES];
-	else
-		tlb_entries = tlb_lld_4k[ENTRIES];
-
-	/* Assume all of TLB entries was occupied by this task */
-	act_entries = tlb_entries >> tlb_flushall_shift;
-	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
-	nr_base_pages = (end - start) >> PAGE_SHIFT;
-
-	/* tlb_flushall_shift is on balance point, details in commit log */
-	if (nr_base_pages > act_entries) {
+	if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 		local_flush_tlb();
 	} else {
@@ -259,68 +247,15 @@ static void do_kernel_range_flush(void *
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	unsigned act_entries;
-	struct flush_tlb_info info;
-
-	/* In modern CPU, last level tlb used for both data/ins */
-	act_entries = tlb_lld_4k[ENTRIES];
 
 	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 ||
-		(end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift)
-
+	if (end == TLB_FLUSH_ALL ||
+	    (end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	else {
+	} else {
+		struct flush_tlb_info info;
 		info.flush_start = start;
 		info.flush_end = end;
 		on_each_cpu(do_kernel_range_flush, &info, 1);
 	}
 }
-
-#ifdef CONFIG_DEBUG_TLBFLUSH
-static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
-			     size_t count, loff_t *ppos)
-{
-	char buf[32];
-	unsigned int len;
-
-	len = sprintf(buf, "%hd\n", tlb_flushall_shift);
-	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
-}
-
-static ssize_t tlbflush_write_file(struct file *file,
-		 const char __user *user_buf, size_t count, loff_t *ppos)
-{
-	char buf[32];
-	ssize_t len;
-	s8 shift;
-
-	len = min(count, sizeof(buf) - 1);
-	if (copy_from_user(buf, user_buf, len))
-		return -EFAULT;
-
-	buf[len] = '\0';
-	if (kstrtos8(buf, 0, &shift))
-		return -EINVAL;
-
-	if (shift < -1 || shift >= BITS_PER_LONG)
-		return -EINVAL;
-
-	tlb_flushall_shift = shift;
-	return count;
-}
-
-static const struct file_operations fops_tlbflush = {
-	.read = tlbflush_read_file,
-	.write = tlbflush_write_file,
-	.llseek = default_llseek,
-};
-
-static int __init create_tlb_flushall_shift(void)
-{
-	debugfs_create_file("tlb_flushall_shift", S_IRUSR | S_IWUSR,
-			    arch_debugfs_dir, NULL, &fops_tlbflush);
-	return 0;
-}
-late_initcall(create_tlb_flushall_shift);
-#endif
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/7] x86: mm: rip out complicated, out-of-date, buggy TLB flushing
@ 2014-03-06  0:45   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I think the flush_tlb_mm_range() code that tries to tune the
flush sizes based on the CPU needs to get ripped out for
several reasons:

1. It is obviously buggy.  It uses mm->total_vm to judge the
   task's footprint in the TLB.  It should certainly be using
   some measure of RSS, *NOT* ->total_vm since only resident
   memory can populate the TLB.
2. Haswell, and several other CPUs are missing from the
   intel_tlb_flushall_shift_set() function.
3. It is plain wrong in my vm:
	[    0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
	[    0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
	[    0.037444] tlb_flushall_shift: 6
   Which leads to it to never use invlpg.
4. The assumptions about TLB refill costs are wrong:
	http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com
    (more on this in later patches)
5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
   I believe the sample times were too short.  Running the
   benchmark in a loop yields times that vary quite a bit.

Note that this leaves us with a static ceiling of 1 page.  This
is a conservative, dumb setting, and will be revised in a later
patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/processor.h |    1 
 b/arch/x86/kernel/cpu/amd.c        |    7 ---
 b/arch/x86/kernel/cpu/common.c     |   13 -----
 b/arch/x86/kernel/cpu/intel.c      |   26 -----------
 b/arch/x86/mm/tlb.c                |   83 ++++---------------------------------
 5 files changed, 11 insertions(+), 119 deletions(-)

diff -puN arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.857059131 -0800
+++ b/arch/x86/include/asm/processor.h	2014-03-05 16:10:09.867059588 -0800
@@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_I
 extern u16 __read_mostly tlb_lld_2m[NR_INFO];
 extern u16 __read_mostly tlb_lld_4m[NR_INFO];
 extern u16 __read_mostly tlb_lld_1g[NR_INFO];
-extern s8  __read_mostly tlb_flushall_shift;
 
 /*
  *  CPU type and hardware bug flags. Kept separately for each CPU.
diff -puN arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/amd.c
--- a/arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.859059223 -0800
+++ b/arch/x86/kernel/cpu/amd.c	2014-03-05 16:10:09.868059633 -0800
@@ -765,11 +765,6 @@ static unsigned int amd_size_cache(struc
 }
 #endif
 
-static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c)
-{
-	tlb_flushall_shift = 6;
-}
-
 static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 {
 	u32 ebx, eax, ecx, edx;
@@ -817,8 +812,6 @@ static void cpu_detect_tlb_amd(struct cp
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
-
-	cpu_set_tlb_flushall_shift(c);
 }
 
 static const struct cpu_dev amd_cpu_dev = {
diff -puN arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.861059314 -0800
+++ b/arch/x86/kernel/cpu/common.c	2014-03-05 16:10:09.869059678 -0800
@@ -479,26 +479,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO];
 u16 __read_mostly tlb_lld_4m[NR_INFO];
 u16 __read_mostly tlb_lld_1g[NR_INFO];
 
-/*
- * tlb_flushall_shift shows the balance point in replacing cr3 write
- * with multiple 'invlpg'. It will do this replacement when
- *   flush_tlb_lines <= active_lines/2^tlb_flushall_shift.
- * If tlb_flushall_shift is -1, means the replacement will be disabled.
- */
-s8  __read_mostly tlb_flushall_shift = -1;
-
 void cpu_detect_tlb(struct cpuinfo_x86 *c)
 {
 	if (this_cpu->c_detect_tlb)
 		this_cpu->c_detect_tlb(c);
 
 	printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n"
-		"Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n"
-		"tlb_flushall_shift: %d\n",
+		"Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n",
 		tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES],
 		tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES],
 		tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES],
-		tlb_lld_1g[ENTRIES], tlb_flushall_shift);
+		tlb_lld_1g[ENTRIES]);
 }
 
 void detect_ht(struct cpuinfo_x86 *c)
diff -puN arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.862059360 -0800
+++ b/arch/x86/kernel/cpu/intel.c	2014-03-05 16:10:09.869059678 -0800
@@ -631,31 +631,6 @@ static void intel_tlb_lookup(const unsig
 	}
 }
 
-static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
-{
-	switch ((c->x86 << 8) + c->x86_model) {
-	case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */
-	case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */
-	case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */
-	case 0x61d: /* six-core 45 nm xeon "Dunnington" */
-		tlb_flushall_shift = -1;
-		break;
-	case 0x63a: /* Ivybridge */
-		tlb_flushall_shift = 2;
-		break;
-	case 0x61a: /* 45 nm nehalem, "Bloomfield" */
-	case 0x61e: /* 45 nm nehalem, "Lynnfield" */
-	case 0x625: /* 32 nm nehalem, "Clarkdale" */
-	case 0x62c: /* 32 nm nehalem, "Gulftown" */
-	case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
-	case 0x62f: /* 32 nm Xeon E7 */
-	case 0x62a: /* SandyBridge */
-	case 0x62d: /* SandyBridge, "Romely-EP" */
-	default:
-		tlb_flushall_shift = 6;
-	}
-}
-
 static void intel_detect_tlb(struct cpuinfo_x86 *c)
 {
 	int i, j, n;
@@ -680,7 +655,6 @@ static void intel_detect_tlb(struct cpui
 		for (j = 1 ; j < 16 ; j++)
 			intel_tlb_lookup(desc[j]);
 	}
-	intel_tlb_flushall_shift_set(c);
 }
 
 static const struct cpu_dev intel_cpu_dev = {
diff -puN arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing	2014-03-05 16:10:09.864059451 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:09.870059724 -0800
@@ -158,13 +158,14 @@ void flush_tlb_current_task(void)
 	preempt_enable();
 }
 
+/* in units of pages */
+unsigned long tlb_single_page_flush_ceiling = 1;
+
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
 	int need_flush_others_all = 1;
 	unsigned long addr;
-	unsigned act_entries, tlb_entries = 0;
-	unsigned long nr_base_pages;
 
 	preempt_disable();
 	if (current->active_mm != mm)
@@ -175,25 +176,12 @@ void flush_tlb_mm_range(struct mm_struct
 		goto out;
 	}
 
-	if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
-					|| vmflag & VM_HUGETLB) {
+	if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
 		local_flush_tlb();
 		goto out;
 	}
 
-	/* In modern CPU, last level tlb used for both data/ins */
-	if (vmflag & VM_EXEC)
-		tlb_entries = tlb_lli_4k[ENTRIES];
-	else
-		tlb_entries = tlb_lld_4k[ENTRIES];
-
-	/* Assume all of TLB entries was occupied by this task */
-	act_entries = tlb_entries >> tlb_flushall_shift;
-	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
-	nr_base_pages = (end - start) >> PAGE_SHIFT;
-
-	/* tlb_flushall_shift is on balance point, details in commit log */
-	if (nr_base_pages > act_entries) {
+	if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 		local_flush_tlb();
 	} else {
@@ -259,68 +247,15 @@ static void do_kernel_range_flush(void *
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	unsigned act_entries;
-	struct flush_tlb_info info;
-
-	/* In modern CPU, last level tlb used for both data/ins */
-	act_entries = tlb_lld_4k[ENTRIES];
 
 	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 ||
-		(end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift)
-
+	if (end == TLB_FLUSH_ALL ||
+	    (end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	else {
+	} else {
+		struct flush_tlb_info info;
 		info.flush_start = start;
 		info.flush_end = end;
 		on_each_cpu(do_kernel_range_flush, &info, 1);
 	}
 }
-
-#ifdef CONFIG_DEBUG_TLBFLUSH
-static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
-			     size_t count, loff_t *ppos)
-{
-	char buf[32];
-	unsigned int len;
-
-	len = sprintf(buf, "%hd\n", tlb_flushall_shift);
-	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
-}
-
-static ssize_t tlbflush_write_file(struct file *file,
-		 const char __user *user_buf, size_t count, loff_t *ppos)
-{
-	char buf[32];
-	ssize_t len;
-	s8 shift;
-
-	len = min(count, sizeof(buf) - 1);
-	if (copy_from_user(buf, user_buf, len))
-		return -EFAULT;
-
-	buf[len] = '\0';
-	if (kstrtos8(buf, 0, &shift))
-		return -EINVAL;
-
-	if (shift < -1 || shift >= BITS_PER_LONG)
-		return -EINVAL;
-
-	tlb_flushall_shift = shift;
-	return count;
-}
-
-static const struct file_operations fops_tlbflush = {
-	.read = tlbflush_read_file,
-	.write = tlbflush_write_file,
-	.llseek = default_llseek,
-};
-
-static int __init create_tlb_flushall_shift(void)
-{
-	debugfs_create_file("tlb_flushall_shift", S_IRUSR | S_IWUSR,
-			    arch_debugfs_dir, NULL, &fops_tlbflush);
-	return 0;
-}
-late_initcall(create_tlb_flushall_shift);
-#endif
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/7] x86: mm: fix missed global TLB flush stat
  2014-03-06  0:45 ` Dave Hansen
@ 2014-03-06  0:45   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

If we take the

	if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
		local_flush_tlb();
		goto out;
	}

path out of flush_tlb_mm_range(), we will have flushed the tlb,
but not incremented NR_TLB_LOCAL_FLUSH_ALL.  This unifies the
way out of the function so that we always take a single path when
doing a full tlb flush.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/tlb.c |   15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff -puN arch/x86/mm/tlb.c~fix-missed-global-flush-stat arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~fix-missed-global-flush-stat	2014-03-05 16:10:10.171073453 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:10.174073590 -0800
@@ -164,8 +164,9 @@ unsigned long tlb_single_page_flush_ceil
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
-	int need_flush_others_all = 1;
 	unsigned long addr;
+	/* do a global flush by default */
+	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
 
 	preempt_disable();
 	if (current->active_mm != mm)
@@ -176,16 +177,14 @@ void flush_tlb_mm_range(struct mm_struct
 		goto out;
 	}
 
-	if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
-		local_flush_tlb();
-		goto out;
-	}
+	if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
+		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
 
-	if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
+	if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
+		base_pages_to_flush = TLB_FLUSH_ALL;
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 		local_flush_tlb();
 	} else {
-		need_flush_others_all = 0;
 		/* flush range by one by one 'invlpg' */
 		for (addr = start; addr < end;	addr += PAGE_SIZE) {
 			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
@@ -193,7 +192,7 @@ void flush_tlb_mm_range(struct mm_struct
 		}
 	}
 out:
-	if (need_flush_others_all) {
+	if (base_pages_to_flush == TLB_FLUSH_ALL) {
 		start = 0UL;
 		end = TLB_FLUSH_ALL;
 	}
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/7] x86: mm: fix missed global TLB flush stat
@ 2014-03-06  0:45   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

If we take the

	if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
		local_flush_tlb();
		goto out;
	}

path out of flush_tlb_mm_range(), we will have flushed the tlb,
but not incremented NR_TLB_LOCAL_FLUSH_ALL.  This unifies the
way out of the function so that we always take a single path when
doing a full tlb flush.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/tlb.c |   15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff -puN arch/x86/mm/tlb.c~fix-missed-global-flush-stat arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~fix-missed-global-flush-stat	2014-03-05 16:10:10.171073453 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:10.174073590 -0800
@@ -164,8 +164,9 @@ unsigned long tlb_single_page_flush_ceil
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
-	int need_flush_others_all = 1;
 	unsigned long addr;
+	/* do a global flush by default */
+	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
 
 	preempt_disable();
 	if (current->active_mm != mm)
@@ -176,16 +177,14 @@ void flush_tlb_mm_range(struct mm_struct
 		goto out;
 	}
 
-	if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
-		local_flush_tlb();
-		goto out;
-	}
+	if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
+		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
 
-	if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
+	if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
+		base_pages_to_flush = TLB_FLUSH_ALL;
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 		local_flush_tlb();
 	} else {
-		need_flush_others_all = 0;
 		/* flush range by one by one 'invlpg' */
 		for (addr = start; addr < end;	addr += PAGE_SIZE) {
 			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
@@ -193,7 +192,7 @@ void flush_tlb_mm_range(struct mm_struct
 		}
 	}
 out:
-	if (need_flush_others_all) {
+	if (base_pages_to_flush == TLB_FLUSH_ALL) {
 		start = 0UL;
 		end = TLB_FLUSH_ALL;
 	}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/7] x86: mm: trace tlb flushes
  2014-03-06  0:45 ` Dave Hansen
@ 2014-03-06  0:45   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We don't have any good way to figure out what kinds of flushes
are being attempted.  Right now, we can try to use the vm
counters, but those only tell us what we actually did with the
hardware (one-by-one vs full) and don't tell us what was actually
_requested_.

This allows us to select out "interesting" TLB flushes that we
might want to optimize (like the ranged ones) and ignore the ones
that we have very little control over (the ones at context
switch).

Also, since we have a pair of tracepoint calls in
flush_tlb_mm_range(), we can time the deltas between them to make
sure that we got the "invlpg vs. global flush" balance correct in
practice.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h |    6 +++++
 b/arch/x86/mm/tlb.c                  |   12 +++++++++--
 b/include/linux/mm_types.h           |   10 +++++++++
 b/include/trace/events/tlb.h         |   37 +++++++++++++++++++++++++++++++++++
 b/mm/Makefile                        |    2 -
 b/mm/trace_tlb.c                     |   12 +++++++++++
 6 files changed, 76 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~tlb-trace-flushes arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~tlb-trace-flushes	2014-03-05 16:10:10.423084949 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2014-03-05 16:10:10.431085313 -0800
@@ -3,6 +3,10 @@
 
 #include <asm/desc.h>
 #include <linux/atomic.h>
+#include <linux/mm_types.h>
+
+#include <trace/events/tlb.h>
+
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
@@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_s
 
 		/* Re-load page tables */
 		load_cr3(next->pgd);
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 
 		/* Stop flush ipis for the previous mm */
 		cpumask_clear_cpu(cpu, mm_cpumask(prev));
@@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_s
 			 * to make sure to use no freed page tables.
 			 */
 			load_cr3(next->pgd);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 			load_LDT_nolock(&next->context);
 		}
 	}
diff -puN arch/x86/mm/tlb.c~tlb-trace-flushes arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~tlb-trace-flushes	2014-03-05 16:10:10.425085039 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:10.432085359 -0800
@@ -14,6 +14,8 @@
 #include <asm/uv/uv.h>
 #include <linux/debugfs.h>
 
+#include <trace/events/tlb.h>
+
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate)
 			= { &init_mm, 0, };
 
@@ -49,6 +51,7 @@ void leave_mm(int cpu)
 	if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
 		cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
 		load_cr3(swapper_pg_dir);
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 	}
 }
 EXPORT_SYMBOL_GPL(leave_mm);
@@ -105,9 +108,10 @@ static void flush_tlb_func(void *info)
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
-		if (f->flush_end == TLB_FLUSH_ALL)
+		if (f->flush_end == TLB_FLUSH_ALL) {
 			local_flush_tlb();
-		else if (!f->flush_end)
+			trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+		} else if (!f->flush_end)
 			__flush_tlb_single(f->flush_start);
 		else {
 			unsigned long addr;
@@ -152,7 +156,9 @@ void flush_tlb_current_task(void)
 	preempt_disable();
 
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
 	local_flush_tlb();
+	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL);
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
 	preempt_enable();
@@ -180,6 +186,7 @@ void flush_tlb_mm_range(struct mm_struct
 	if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
 		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
 
+	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
 	if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
 		base_pages_to_flush = TLB_FLUSH_ALL;
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
@@ -191,6 +198,7 @@ void flush_tlb_mm_range(struct mm_struct
 			__flush_tlb_single(addr);
 		}
 	}
+	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush);
 out:
 	if (base_pages_to_flush == TLB_FLUSH_ALL) {
 		start = 0UL;
diff -puN include/linux/mm_types.h~tlb-trace-flushes include/linux/mm_types.h
--- a/include/linux/mm_types.h~tlb-trace-flushes	2014-03-05 16:10:10.426085085 -0800
+++ b/include/linux/mm_types.h	2014-03-05 16:10:10.432085359 -0800
@@ -509,4 +509,14 @@ static inline void clear_tlb_flush_pendi
 }
 #endif
 
+enum tlb_flush_reason {
+	TLB_FLUSH_ON_TASK_SWITCH,
+	TLB_REMOTE_SHOOTDOWN,
+	TLB_LOCAL_SHOOTDOWN,
+	TLB_LOCAL_SHOOTDOWN_DONE,
+	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_LOCAL_MM_SHOOTDOWN_DONE,
+	NR_TLB_FLUSH_REASONS,
+};
+
 #endif /* _LINUX_MM_TYPES_H */
diff -puN /dev/null include/trace/events/tlb.h
--- /dev/null	2014-01-15 16:08:30.019511980 -0800
+++ b/include/trace/events/tlb.h	2014-03-05 16:10:10.433085404 -0800
@@ -0,0 +1,37 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tlb
+
+#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TLB_H
+
+#include <linux/mm_types.h>
+#include <linux/tracepoint.h>
+
+extern const char * const tlb_flush_reason_desc[];
+
+TRACE_EVENT(tlb_flush,
+
+	TP_PROTO(int reason, unsigned long pages),
+	TP_ARGS(reason, pages),
+
+	TP_STRUCT__entry(
+		__field(	  int, reason)
+		__field(unsigned long,  pages)
+	),
+
+	TP_fast_assign(
+		__entry->reason = reason;
+		__entry->pages  = pages;
+	),
+
+	TP_printk("pages: %ld reason: %d (%s)",
+		__entry->pages,
+		__entry->reason,
+		tlb_flush_reason_desc[__entry->reason])
+);
+
+#endif /* _TRACE_TLB_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
+
diff -puN mm/Makefile~tlb-trace-flushes mm/Makefile
--- a/mm/Makefile~tlb-trace-flushes	2014-03-05 16:10:10.428085177 -0800
+++ b/mm/Makefile	2014-03-05 16:10:10.433085404 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   vmalloc.o pagewalk.o pgtable-generic.o trace_tlb.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff -puN /dev/null mm/trace_tlb.c
--- /dev/null	2014-01-15 16:08:30.019511980 -0800
+++ b/mm/trace_tlb.c	2014-03-05 16:10:10.433085404 -0800
@@ -0,0 +1,12 @@
+#define CREATE_TRACE_POINTS
+#include <trace/events/tlb.h>
+
+const char * const tlb_flush_reason_desc[] = {
+	__stringify(TLB_FLUSH_ON_TASK_SWITCH),
+	__stringify(TLB_REMOTE_SHOOTDOWN),
+	__stringify(TLB_LOCAL_SHOOTDOWN),
+	__stringify(TLB_LOCAL_SHOOTDOWN_DONE),
+	__stringify(TLB_LOCAL_MM_SHOOTDOWN),
+	__stringify(TLB_LOCAL_MM_SHOOTDOWN_DONE),
+};
+
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/7] x86: mm: trace tlb flushes
@ 2014-03-06  0:45   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We don't have any good way to figure out what kinds of flushes
are being attempted.  Right now, we can try to use the vm
counters, but those only tell us what we actually did with the
hardware (one-by-one vs full) and don't tell us what was actually
_requested_.

This allows us to select out "interesting" TLB flushes that we
might want to optimize (like the ranged ones) and ignore the ones
that we have very little control over (the ones at context
switch).

Also, since we have a pair of tracepoint calls in
flush_tlb_mm_range(), we can time the deltas between them to make
sure that we got the "invlpg vs. global flush" balance correct in
practice.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h |    6 +++++
 b/arch/x86/mm/tlb.c                  |   12 +++++++++--
 b/include/linux/mm_types.h           |   10 +++++++++
 b/include/trace/events/tlb.h         |   37 +++++++++++++++++++++++++++++++++++
 b/mm/Makefile                        |    2 -
 b/mm/trace_tlb.c                     |   12 +++++++++++
 6 files changed, 76 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~tlb-trace-flushes arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~tlb-trace-flushes	2014-03-05 16:10:10.423084949 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2014-03-05 16:10:10.431085313 -0800
@@ -3,6 +3,10 @@
 
 #include <asm/desc.h>
 #include <linux/atomic.h>
+#include <linux/mm_types.h>
+
+#include <trace/events/tlb.h>
+
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
@@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_s
 
 		/* Re-load page tables */
 		load_cr3(next->pgd);
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 
 		/* Stop flush ipis for the previous mm */
 		cpumask_clear_cpu(cpu, mm_cpumask(prev));
@@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_s
 			 * to make sure to use no freed page tables.
 			 */
 			load_cr3(next->pgd);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 			load_LDT_nolock(&next->context);
 		}
 	}
diff -puN arch/x86/mm/tlb.c~tlb-trace-flushes arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~tlb-trace-flushes	2014-03-05 16:10:10.425085039 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:10.432085359 -0800
@@ -14,6 +14,8 @@
 #include <asm/uv/uv.h>
 #include <linux/debugfs.h>
 
+#include <trace/events/tlb.h>
+
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate)
 			= { &init_mm, 0, };
 
@@ -49,6 +51,7 @@ void leave_mm(int cpu)
 	if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
 		cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
 		load_cr3(swapper_pg_dir);
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 	}
 }
 EXPORT_SYMBOL_GPL(leave_mm);
@@ -105,9 +108,10 @@ static void flush_tlb_func(void *info)
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
-		if (f->flush_end == TLB_FLUSH_ALL)
+		if (f->flush_end == TLB_FLUSH_ALL) {
 			local_flush_tlb();
-		else if (!f->flush_end)
+			trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+		} else if (!f->flush_end)
 			__flush_tlb_single(f->flush_start);
 		else {
 			unsigned long addr;
@@ -152,7 +156,9 @@ void flush_tlb_current_task(void)
 	preempt_disable();
 
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
 	local_flush_tlb();
+	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL);
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
 	preempt_enable();
@@ -180,6 +186,7 @@ void flush_tlb_mm_range(struct mm_struct
 	if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
 		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
 
+	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
 	if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
 		base_pages_to_flush = TLB_FLUSH_ALL;
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
@@ -191,6 +198,7 @@ void flush_tlb_mm_range(struct mm_struct
 			__flush_tlb_single(addr);
 		}
 	}
+	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush);
 out:
 	if (base_pages_to_flush == TLB_FLUSH_ALL) {
 		start = 0UL;
diff -puN include/linux/mm_types.h~tlb-trace-flushes include/linux/mm_types.h
--- a/include/linux/mm_types.h~tlb-trace-flushes	2014-03-05 16:10:10.426085085 -0800
+++ b/include/linux/mm_types.h	2014-03-05 16:10:10.432085359 -0800
@@ -509,4 +509,14 @@ static inline void clear_tlb_flush_pendi
 }
 #endif
 
+enum tlb_flush_reason {
+	TLB_FLUSH_ON_TASK_SWITCH,
+	TLB_REMOTE_SHOOTDOWN,
+	TLB_LOCAL_SHOOTDOWN,
+	TLB_LOCAL_SHOOTDOWN_DONE,
+	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_LOCAL_MM_SHOOTDOWN_DONE,
+	NR_TLB_FLUSH_REASONS,
+};
+
 #endif /* _LINUX_MM_TYPES_H */
diff -puN /dev/null include/trace/events/tlb.h
--- /dev/null	2014-01-15 16:08:30.019511980 -0800
+++ b/include/trace/events/tlb.h	2014-03-05 16:10:10.433085404 -0800
@@ -0,0 +1,37 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tlb
+
+#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TLB_H
+
+#include <linux/mm_types.h>
+#include <linux/tracepoint.h>
+
+extern const char * const tlb_flush_reason_desc[];
+
+TRACE_EVENT(tlb_flush,
+
+	TP_PROTO(int reason, unsigned long pages),
+	TP_ARGS(reason, pages),
+
+	TP_STRUCT__entry(
+		__field(	  int, reason)
+		__field(unsigned long,  pages)
+	),
+
+	TP_fast_assign(
+		__entry->reason = reason;
+		__entry->pages  = pages;
+	),
+
+	TP_printk("pages: %ld reason: %d (%s)",
+		__entry->pages,
+		__entry->reason,
+		tlb_flush_reason_desc[__entry->reason])
+);
+
+#endif /* _TRACE_TLB_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
+
diff -puN mm/Makefile~tlb-trace-flushes mm/Makefile
--- a/mm/Makefile~tlb-trace-flushes	2014-03-05 16:10:10.428085177 -0800
+++ b/mm/Makefile	2014-03-05 16:10:10.433085404 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   vmalloc.o pagewalk.o pgtable-generic.o trace_tlb.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff -puN /dev/null mm/trace_tlb.c
--- /dev/null	2014-01-15 16:08:30.019511980 -0800
+++ b/mm/trace_tlb.c	2014-03-05 16:10:10.433085404 -0800
@@ -0,0 +1,12 @@
+#define CREATE_TRACE_POINTS
+#include <trace/events/tlb.h>
+
+const char * const tlb_flush_reason_desc[] = {
+	__stringify(TLB_FLUSH_ON_TASK_SWITCH),
+	__stringify(TLB_REMOTE_SHOOTDOWN),
+	__stringify(TLB_LOCAL_SHOOTDOWN),
+	__stringify(TLB_LOCAL_SHOOTDOWN_DONE),
+	__stringify(TLB_LOCAL_MM_SHOOTDOWN),
+	__stringify(TLB_LOCAL_MM_SHOOTDOWN_DONE),
+};
+
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/7] x86: mm: new tunable for single vs full TLB flush
  2014-03-06  0:45 ` Dave Hansen
@ 2014-03-06  0:45   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Most of the logic here is in the documentation file.  Please take
a look at it.

I know we've come full-circle here back to a tunable, but this
new one is *WAY* simpler.  I challenge anyone to describe in one
sentence how the old one worked.  Here's the way the new one
works:

	If we are flushing more pages than the ceiling, we use
	the full flush, otherwise we use invlpg.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/x86/tlb.txt |   64 ++++++++++++++++++++++++++++++++++++++++++++
 b/arch/x86/mm/tlb.c         |   46 +++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff -puN arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush	2014-03-05 16:10:10.743099544 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:10.747099726 -0800
@@ -266,3 +266,49 @@ void flush_tlb_kernel_range(unsigned lon
 		on_each_cpu(do_kernel_range_flush, &info, 1);
 	}
 }
+
+static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+
+	len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t tlbflush_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	char buf[32];
+	ssize_t len;
+	int ceiling;
+
+	len = min(count, sizeof(buf) - 1);
+	if (copy_from_user(buf, user_buf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (kstrtoint(buf, 0, &ceiling))
+		return -EINVAL;
+
+	if (ceiling < 0)
+		return -EINVAL;
+
+	tlb_single_page_flush_ceiling = ceiling;
+	return count;
+}
+
+static const struct file_operations fops_tlbflush = {
+	.read = tlbflush_read_file,
+	.write = tlbflush_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+	debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_tlbflush);
+	return 0;
+}
+late_initcall(create_tlb_single_page_flush_ceiling);
diff -puN /dev/null Documentation/x86/tlb.txt
--- /dev/null	2014-01-15 16:08:30.019511980 -0800
+++ b/Documentation/x86/tlb.txt	2014-03-05 16:10:10.747099726 -0800
@@ -0,0 +1,64 @@
+When the kernel unmaps or modified the attributes of a range of
+memory, it has two choices:
+ 1. Flush the entire TLB with a two-instruction sequence.  This is
+    a quick operation, but it causes collateral damage: TLB entries
+    from areas other than the one we are trying to flush will be
+    destroyed and must be refilled later, at some cost.
+ 2. Use the invlpg instruction to invalidate a single page at a
+    time.  This could potentialy cost many more instructions, but
+    it is a much more precise operation, causing no collateral
+    damage to other TLB entries.
+
+Which method to do depends on a few things:
+ 1. The size of the flush being performed.  A flush of the entire
+    address space is obviously better performed by flushing the
+    entire TLB than doing 2^48/PAGE_SIZE invlpg calls.
+ 2. The contents of the TLB.  If the TLB is empty, then there will
+    be no collateral damage caused by doing the global flush, and
+    all of the invlpg calls will have ended up being wasted work.
+    Whether or not the range being flushed was in the TLB matters
+    as well.
+ 3. The size of the TLB.  The larger the TLB, the more collateral
+    damage we do with a full flush.  So, the larger the TLB, the
+    more attrative invlpg looks.
+ 4. The microarchitecture.  The TLB has become a multi-level
+    cache on modern CPUs, and the global flushes have become more
+    expensive relative to single-page flushes.
+
+There is obviously no way the kernel can know all these things,
+especially the contents of the TLB during a given flush.  The
+sizes of the flush will vary greatly depending on the workload as
+well.  There is essentially no "right" point to choose.
+
+If you believe that invlpg is being called too often, you can
+lower the tunable:
+
+	/sys/debug/kernel/x86/tlb_single_page_flush_ceiling
+
+This will cause us to do the global flush for more cases.
+Lowering it to 0 will disable the use of invlpg.
+
+You might see invlpg inside of flush_tlb_mm_range() show up in
+profiles, or you can use the trace_tlb_flush() tracepoints. to
+determine how long the flush operations are taking.
+
+Essentially, you are balancing the cycles you spend doing invlpg
+with the cycles that you spend refilling the TLB later.
+
+You can measure how expensive TLB refills are by using
+performance counters and 'perf stat', like this:
+
+perf stat -e
+	cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
+	cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
+	cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
+	cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
+	cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
+	cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
+
+That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
+may have differently-named counters, but they should at least
+be there in some form.  You can use pmu-tools 'ocperf list'
+(https://github.com/andikleen/pmu-tools) to find the right
+counters for a given CPU.
+
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/7] x86: mm: new tunable for single vs full TLB flush
@ 2014-03-06  0:45   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Most of the logic here is in the documentation file.  Please take
a look at it.

I know we've come full-circle here back to a tunable, but this
new one is *WAY* simpler.  I challenge anyone to describe in one
sentence how the old one worked.  Here's the way the new one
works:

	If we are flushing more pages than the ceiling, we use
	the full flush, otherwise we use invlpg.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/x86/tlb.txt |   64 ++++++++++++++++++++++++++++++++++++++++++++
 b/arch/x86/mm/tlb.c         |   46 +++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff -puN arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush	2014-03-05 16:10:10.743099544 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:10.747099726 -0800
@@ -266,3 +266,49 @@ void flush_tlb_kernel_range(unsigned lon
 		on_each_cpu(do_kernel_range_flush, &info, 1);
 	}
 }
+
+static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+
+	len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t tlbflush_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	char buf[32];
+	ssize_t len;
+	int ceiling;
+
+	len = min(count, sizeof(buf) - 1);
+	if (copy_from_user(buf, user_buf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (kstrtoint(buf, 0, &ceiling))
+		return -EINVAL;
+
+	if (ceiling < 0)
+		return -EINVAL;
+
+	tlb_single_page_flush_ceiling = ceiling;
+	return count;
+}
+
+static const struct file_operations fops_tlbflush = {
+	.read = tlbflush_read_file,
+	.write = tlbflush_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+	debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_tlbflush);
+	return 0;
+}
+late_initcall(create_tlb_single_page_flush_ceiling);
diff -puN /dev/null Documentation/x86/tlb.txt
--- /dev/null	2014-01-15 16:08:30.019511980 -0800
+++ b/Documentation/x86/tlb.txt	2014-03-05 16:10:10.747099726 -0800
@@ -0,0 +1,64 @@
+When the kernel unmaps or modified the attributes of a range of
+memory, it has two choices:
+ 1. Flush the entire TLB with a two-instruction sequence.  This is
+    a quick operation, but it causes collateral damage: TLB entries
+    from areas other than the one we are trying to flush will be
+    destroyed and must be refilled later, at some cost.
+ 2. Use the invlpg instruction to invalidate a single page at a
+    time.  This could potentialy cost many more instructions, but
+    it is a much more precise operation, causing no collateral
+    damage to other TLB entries.
+
+Which method to do depends on a few things:
+ 1. The size of the flush being performed.  A flush of the entire
+    address space is obviously better performed by flushing the
+    entire TLB than doing 2^48/PAGE_SIZE invlpg calls.
+ 2. The contents of the TLB.  If the TLB is empty, then there will
+    be no collateral damage caused by doing the global flush, and
+    all of the invlpg calls will have ended up being wasted work.
+    Whether or not the range being flushed was in the TLB matters
+    as well.
+ 3. The size of the TLB.  The larger the TLB, the more collateral
+    damage we do with a full flush.  So, the larger the TLB, the
+    more attrative invlpg looks.
+ 4. The microarchitecture.  The TLB has become a multi-level
+    cache on modern CPUs, and the global flushes have become more
+    expensive relative to single-page flushes.
+
+There is obviously no way the kernel can know all these things,
+especially the contents of the TLB during a given flush.  The
+sizes of the flush will vary greatly depending on the workload as
+well.  There is essentially no "right" point to choose.
+
+If you believe that invlpg is being called too often, you can
+lower the tunable:
+
+	/sys/debug/kernel/x86/tlb_single_page_flush_ceiling
+
+This will cause us to do the global flush for more cases.
+Lowering it to 0 will disable the use of invlpg.
+
+You might see invlpg inside of flush_tlb_mm_range() show up in
+profiles, or you can use the trace_tlb_flush() tracepoints. to
+determine how long the flush operations are taking.
+
+Essentially, you are balancing the cycles you spend doing invlpg
+with the cycles that you spend refilling the TLB later.
+
+You can measure how expensive TLB refills are by using
+performance counters and 'perf stat', like this:
+
+perf stat -e
+	cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
+	cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
+	cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
+	cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
+	cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
+	cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
+
+That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
+may have differently-named counters, but they should at least
+be there in some form.  You can use pmu-tools 'ocperf list'
+(https://github.com/andikleen/pmu-tools) to find the right
+counters for a given CPU.
+
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
  2014-03-06  0:45 ` Dave Hansen
@ 2014-03-06  0:45   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.

During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page.  It breaks down like this:

 size   percent  percent<=
  V        V        V
GLOBAL:   2.20%   2.20% avg cycles:  2283
     1:  56.92%  59.12% avg cycles:  1276
     2:  13.78%  72.90% avg cycles:  1505
     3:   8.26%  81.16% avg cycles:  1880
     4:   7.41%  88.58% avg cycles:  2447
     5:   1.73%  90.31% avg cycles:  2358
     6:   1.32%  91.63% avg cycles:  2563
     7:   1.14%  92.77% avg cycles:  2862
     8:   0.62%  93.39% avg cycles:  3542
     9:   0.08%  93.47% avg cycles:  3289
    10:   0.43%  93.90% avg cycles:  3570
    11:   0.20%  94.10% avg cycles:  3767
    12:   0.08%  94.18% avg cycles:  3996
    13:   0.03%  94.20% avg cycles:  4077
    14:   0.02%  94.23% avg cycles:  4836
    15:   0.04%  94.26% avg cycles:  5699
    16:   0.06%  94.32% avg cycles:  5041
    17:   0.57%  94.89% avg cycles:  5473
    18:   0.02%  94.91% avg cycles:  5396
    19:   0.03%  94.95% avg cycles:  5296
    20:   0.02%  94.96% avg cycles:  6749
    21:   0.18%  95.14% avg cycles:  6225
    22:   0.01%  95.15% avg cycles:  6393
    23:   0.01%  95.16% avg cycles:  6861
    24:   0.12%  95.28% avg cycles:  6912
    25:   0.05%  95.32% avg cycles:  7190
    26:   0.01%  95.33% avg cycles:  7793
    27:   0.01%  95.34% avg cycles:  7833
    28:   0.01%  95.35% avg cycles:  8253
    29:   0.08%  95.42% avg cycles:  8024
    30:   0.03%  95.45% avg cycles:  9670
    31:   0.01%  95.46% avg cycles:  8949
    32:   0.01%  95.46% avg cycles:  9350
    33:   3.11%  98.57% avg cycles:  8534
    34:   0.02%  98.60% avg cycles: 10977
    35:   0.02%  98.62% avg cycles: 11400

We get in to dimishing returns pretty quickly.  On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige.  That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.

The previous code tried to size this number based on the size of
the TLB.  Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.

Settting it to 33 means that we cover the mallopt
M_TRIM_THRESHOLD, which is the most universally common size to do
flushes.

That's the short version.  Here's the long one for why I chose 33:

1. These numbers have a constant bias in the timestamps from the
   tracing.  Probably counts for a couple hundred cycles in each of
   these tests, but it should be fairly _even_ across all of them.
   The smallest delta between the tracepoints I have ever seen is
   335 cycles.  This is one reason the cycles/page cost goes down in
   general as the flushes get larger.  The true cost is nearer to
   100 cycles.
2. A full flush is more expensive than a single invlpg, but not
   by much (single percentages).
3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
   (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
   22,000 cycles.
4. 22,000 cycles is approximately the equivalent of doing 85
   invlpg operations.  But, the odds are that the TLB can
   actually be filled up faster than that because TLB misses that
   are close in time also tend to leverage the same caches.
6. ~98% of flushes are <=33 pages.  There are a lot of flushes of
   33 pages, probably because libc's M_TRIM_THRESHOLD is set to
   128k (32 pages)
7. I've found no consistent data to support changing the IvyBridge
   vs. SandyBridge tunable by a factor of 16

I used the performance counters on this hardware (IvyBridge i5-3320M)
to figure out the tlb miss costs:

ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush

     7,720,030,970      dtlb_load_misses_walk_duration                                    [57.13%]
       169,856,353      dtlb_load_misses_walk_completed                                    [57.15%]
       708,832,859      dtlb_store_misses_walk_duration                                    [57.17%]
        19,346,823      dtlb_store_misses_walk_completed                                    [57.17%]
     2,779,687,402      itlb_misses_walk_duration                                    [57.15%]
        82,241,148      itlb_misses_walk_completed                                    [57.13%]
           770,717      itlb_itlb_flush                                              [57.11%]

Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles).  At those rates, refilling the 512-entry dTLB takes
22,000 cycles.  On a SandyBridge system with more cores and larger
caches, those are dtlb=13.4ns and itlb=9.5ns.

cat perf.stat.txt | perl -pe 's/,//g'
	| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
		/itlb_misses_walk_completed/ { imiss+=$1 }
		/dtlb_.*_walk_duration/ { dcyc+=$1 }
		/dtlb_.*.*completed/ { dmiss+=$1 }
		END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, "   -----    ", icyc,imiss, dcyc,dmiss }

On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any

The assumptions that this code went in under:
https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
about 100ns.  Being generous, that is over by a factor of 6 on the
refill side, although it is fairly close on the cost of an invlpg.
An increase of a single invlpg operation seems to lengthen the flush
range operation by about 200 cycles.  Here is one example of the data
collected for flushing 10 and 11 pages (full data are below):

    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145

How to generate this table:

	echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
	echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
	echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter 
	echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable

Pipe the trace output in to this script:

	http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt

Note that these data were gathered with the invlpg threshold set to
150 pages.  Only data points with >=50 of samples were printed:

Flush    % of     %<=
in       flush    this
pages      es     size
------------------------------------------------------------------------------
    -1:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
     1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895
     2:  13.78%  72.90% avg cycles:  1505 cycles/page:  752 samples: 150335
     3:   8.26%  81.16% avg cycles:  1880 cycles/page:  626 samples: 90131
     4:   7.41%  88.58% avg cycles:  2447 cycles/page:  611 samples: 80877
     5:   1.73%  90.31% avg cycles:  2358 cycles/page:  471 samples: 18885
     6:   1.32%  91.63% avg cycles:  2563 cycles/page:  427 samples: 14397
     7:   1.14%  92.77% avg cycles:  2862 cycles/page:  408 samples: 12441
     8:   0.62%  93.39% avg cycles:  3542 cycles/page:  442 samples: 6721
     9:   0.08%  93.47% avg cycles:  3289 cycles/page:  365 samples: 917
    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145
    12:   0.08%  94.18% avg cycles:  3996 cycles/page:  333 samples: 864
    13:   0.03%  94.20% avg cycles:  4077 cycles/page:  313 samples: 289
    14:   0.02%  94.23% avg cycles:  4836 cycles/page:  345 samples: 236
    15:   0.04%  94.26% avg cycles:  5699 cycles/page:  379 samples: 390
    16:   0.06%  94.32% avg cycles:  5041 cycles/page:  315 samples: 643
    17:   0.57%  94.89% avg cycles:  5473 cycles/page:  321 samples: 6229
    18:   0.02%  94.91% avg cycles:  5396 cycles/page:  299 samples: 224
    19:   0.03%  94.95% avg cycles:  5296 cycles/page:  278 samples: 367
    20:   0.02%  94.96% avg cycles:  6749 cycles/page:  337 samples: 185
    21:   0.18%  95.14% avg cycles:  6225 cycles/page:  296 samples: 1964
    22:   0.01%  95.15% avg cycles:  6393 cycles/page:  290 samples: 83
    23:   0.01%  95.16% avg cycles:  6861 cycles/page:  298 samples: 61
    24:   0.12%  95.28% avg cycles:  6912 cycles/page:  288 samples: 1307
    25:   0.05%  95.32% avg cycles:  7190 cycles/page:  287 samples: 533
    26:   0.01%  95.33% avg cycles:  7793 cycles/page:  299 samples: 94
    27:   0.01%  95.34% avg cycles:  7833 cycles/page:  290 samples: 66
    28:   0.01%  95.35% avg cycles:  8253 cycles/page:  294 samples: 73
    29:   0.08%  95.42% avg cycles:  8024 cycles/page:  276 samples: 846
    30:   0.03%  95.45% avg cycles:  9670 cycles/page:  322 samples: 296
    31:   0.01%  95.46% avg cycles:  8949 cycles/page:  288 samples: 79
    32:   0.01%  95.46% avg cycles:  9350 cycles/page:  292 samples: 60
    33:   3.11%  98.57% avg cycles:  8534 cycles/page:  258 samples: 33936
    34:   0.02%  98.60% avg cycles: 10977 cycles/page:  322 samples: 268
    35:   0.02%  98.62% avg cycles: 11400 cycles/page:  325 samples: 177
    36:   0.01%  98.63% avg cycles: 11504 cycles/page:  319 samples: 161
    37:   0.02%  98.65% avg cycles: 11596 cycles/page:  313 samples: 182
    38:   0.02%  98.66% avg cycles: 11850 cycles/page:  311 samples: 195
    39:   0.01%  98.68% avg cycles: 12158 cycles/page:  311 samples: 128
    40:   0.01%  98.68% avg cycles: 11626 cycles/page:  290 samples: 78
    41:   0.04%  98.73% avg cycles: 11435 cycles/page:  278 samples: 477
    42:   0.01%  98.73% avg cycles: 12571 cycles/page:  299 samples: 74
    43:   0.01%  98.74% avg cycles: 12562 cycles/page:  292 samples: 78
    44:   0.01%  98.75% avg cycles: 12991 cycles/page:  295 samples: 108
    45:   0.01%  98.76% avg cycles: 13169 cycles/page:  292 samples: 78
    46:   0.02%  98.78% avg cycles: 12891 cycles/page:  280 samples: 261
    47:   0.01%  98.79% avg cycles: 13099 cycles/page:  278 samples: 67
    48:   0.01%  98.80% avg cycles: 13851 cycles/page:  288 samples: 77
    49:   0.01%  98.80% avg cycles: 13749 cycles/page:  280 samples: 66
    50:   0.01%  98.81% avg cycles: 13949 cycles/page:  278 samples: 73
    52:   0.00%  98.82% avg cycles: 14243 cycles/page:  273 samples: 52
    54:   0.01%  98.83% avg cycles: 15312 cycles/page:  283 samples: 87
    55:   0.01%  98.84% avg cycles: 15197 cycles/page:  276 samples: 109
    56:   0.02%  98.86% avg cycles: 15234 cycles/page:  272 samples: 208
    57:   0.00%  98.86% avg cycles: 14888 cycles/page:  261 samples: 53
    58:   0.01%  98.87% avg cycles: 15037 cycles/page:  259 samples: 59
    59:   0.01%  98.87% avg cycles: 15752 cycles/page:  266 samples: 63
    62:   0.00%  98.89% avg cycles: 16222 cycles/page:  261 samples: 54
    64:   0.02%  98.91% avg cycles: 17179 cycles/page:  268 samples: 248
    65:   0.12%  99.03% avg cycles: 18762 cycles/page:  288 samples: 1324
    85:   0.00%  99.10% avg cycles: 21649 cycles/page:  254 samples: 50
   127:   0.01%  99.18% avg cycles: 32397 cycles/page:  255 samples: 75
   128:   0.13%  99.31% avg cycles: 31711 cycles/page:  247 samples: 1466
   129:   0.18%  99.49% avg cycles: 33017 cycles/page:  255 samples: 1927
   181:   0.33%  99.84% avg cycles:  2489 cycles/page:   13 samples: 3547
   256:   0.05%  99.91% avg cycles:  2305 cycles/page:    9 samples: 550
   512:   0.03%  99.95% avg cycles:  2133 cycles/page:    4 samples: 304
  1512:   0.01%  99.99% avg cycles:  3038 cycles/page:    2 samples: 65

Here are the tlb counters during a 10-second slice of a kernel compile
for a SandyBridge system.  It's better than IvyBridge, but probably
due to the larger caches since this was one of the 'X' extreme parts.

    10,873,007,282      dtlb_load_misses_walk_duration
       250,711,333      dtlb_load_misses_walk_completed
     1,212,395,865      dtlb_store_misses_walk_duration
        31,615,772      dtlb_store_misses_walk_completed
     5,091,010,274      itlb_misses_walk_duration
       163,193,511      itlb_misses_walk_completed
         1,321,980      itlb_itlb_flush

      10.008045158 seconds time elapsed

# cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, "   -----    ", icyc,imiss, dcyc,dmiss }'
itlb ns/miss:  9.45338  dtlb ns/miss:  12.9716

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/tlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/x86/mm/tlb.c~set-tunable-to-sane-value arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~set-tunable-to-sane-value	2014-03-05 16:10:11.005111495 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:11.009111678 -0800
@@ -165,7 +165,7 @@ void flush_tlb_current_task(void)
 }
 
 /* in units of pages */
-unsigned long tlb_single_page_flush_ceiling = 1;
+unsigned long tlb_single_page_flush_ceiling = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
@ 2014-03-06  0:45   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.

During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page.  It breaks down like this:

 size   percent  percent<=
  V        V        V
GLOBAL:   2.20%   2.20% avg cycles:  2283
     1:  56.92%  59.12% avg cycles:  1276
     2:  13.78%  72.90% avg cycles:  1505
     3:   8.26%  81.16% avg cycles:  1880
     4:   7.41%  88.58% avg cycles:  2447
     5:   1.73%  90.31% avg cycles:  2358
     6:   1.32%  91.63% avg cycles:  2563
     7:   1.14%  92.77% avg cycles:  2862
     8:   0.62%  93.39% avg cycles:  3542
     9:   0.08%  93.47% avg cycles:  3289
    10:   0.43%  93.90% avg cycles:  3570
    11:   0.20%  94.10% avg cycles:  3767
    12:   0.08%  94.18% avg cycles:  3996
    13:   0.03%  94.20% avg cycles:  4077
    14:   0.02%  94.23% avg cycles:  4836
    15:   0.04%  94.26% avg cycles:  5699
    16:   0.06%  94.32% avg cycles:  5041
    17:   0.57%  94.89% avg cycles:  5473
    18:   0.02%  94.91% avg cycles:  5396
    19:   0.03%  94.95% avg cycles:  5296
    20:   0.02%  94.96% avg cycles:  6749
    21:   0.18%  95.14% avg cycles:  6225
    22:   0.01%  95.15% avg cycles:  6393
    23:   0.01%  95.16% avg cycles:  6861
    24:   0.12%  95.28% avg cycles:  6912
    25:   0.05%  95.32% avg cycles:  7190
    26:   0.01%  95.33% avg cycles:  7793
    27:   0.01%  95.34% avg cycles:  7833
    28:   0.01%  95.35% avg cycles:  8253
    29:   0.08%  95.42% avg cycles:  8024
    30:   0.03%  95.45% avg cycles:  9670
    31:   0.01%  95.46% avg cycles:  8949
    32:   0.01%  95.46% avg cycles:  9350
    33:   3.11%  98.57% avg cycles:  8534
    34:   0.02%  98.60% avg cycles: 10977
    35:   0.02%  98.62% avg cycles: 11400

We get in to dimishing returns pretty quickly.  On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige.  That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.

The previous code tried to size this number based on the size of
the TLB.  Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.

Settting it to 33 means that we cover the mallopt
M_TRIM_THRESHOLD, which is the most universally common size to do
flushes.

That's the short version.  Here's the long one for why I chose 33:

1. These numbers have a constant bias in the timestamps from the
   tracing.  Probably counts for a couple hundred cycles in each of
   these tests, but it should be fairly _even_ across all of them.
   The smallest delta between the tracepoints I have ever seen is
   335 cycles.  This is one reason the cycles/page cost goes down in
   general as the flushes get larger.  The true cost is nearer to
   100 cycles.
2. A full flush is more expensive than a single invlpg, but not
   by much (single percentages).
3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
   (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
   22,000 cycles.
4. 22,000 cycles is approximately the equivalent of doing 85
   invlpg operations.  But, the odds are that the TLB can
   actually be filled up faster than that because TLB misses that
   are close in time also tend to leverage the same caches.
6. ~98% of flushes are <=33 pages.  There are a lot of flushes of
   33 pages, probably because libc's M_TRIM_THRESHOLD is set to
   128k (32 pages)
7. I've found no consistent data to support changing the IvyBridge
   vs. SandyBridge tunable by a factor of 16

I used the performance counters on this hardware (IvyBridge i5-3320M)
to figure out the tlb miss costs:

ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush

     7,720,030,970      dtlb_load_misses_walk_duration                                    [57.13%]
       169,856,353      dtlb_load_misses_walk_completed                                    [57.15%]
       708,832,859      dtlb_store_misses_walk_duration                                    [57.17%]
        19,346,823      dtlb_store_misses_walk_completed                                    [57.17%]
     2,779,687,402      itlb_misses_walk_duration                                    [57.15%]
        82,241,148      itlb_misses_walk_completed                                    [57.13%]
           770,717      itlb_itlb_flush                                              [57.11%]

Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles).  At those rates, refilling the 512-entry dTLB takes
22,000 cycles.  On a SandyBridge system with more cores and larger
caches, those are dtlb=13.4ns and itlb=9.5ns.

cat perf.stat.txt | perl -pe 's/,//g'
	| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
		/itlb_misses_walk_completed/ { imiss+=$1 }
		/dtlb_.*_walk_duration/ { dcyc+=$1 }
		/dtlb_.*.*completed/ { dmiss+=$1 }
		END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, "   -----    ", icyc,imiss, dcyc,dmiss }

On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any

The assumptions that this code went in under:
https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
about 100ns.  Being generous, that is over by a factor of 6 on the
refill side, although it is fairly close on the cost of an invlpg.
An increase of a single invlpg operation seems to lengthen the flush
range operation by about 200 cycles.  Here is one example of the data
collected for flushing 10 and 11 pages (full data are below):

    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145

How to generate this table:

	echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
	echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
	echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter 
	echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable

Pipe the trace output in to this script:

	http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt

Note that these data were gathered with the invlpg threshold set to
150 pages.  Only data points with >=50 of samples were printed:

Flush    % of     %<=
in       flush    this
pages      es     size
------------------------------------------------------------------------------
    -1:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
     1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895
     2:  13.78%  72.90% avg cycles:  1505 cycles/page:  752 samples: 150335
     3:   8.26%  81.16% avg cycles:  1880 cycles/page:  626 samples: 90131
     4:   7.41%  88.58% avg cycles:  2447 cycles/page:  611 samples: 80877
     5:   1.73%  90.31% avg cycles:  2358 cycles/page:  471 samples: 18885
     6:   1.32%  91.63% avg cycles:  2563 cycles/page:  427 samples: 14397
     7:   1.14%  92.77% avg cycles:  2862 cycles/page:  408 samples: 12441
     8:   0.62%  93.39% avg cycles:  3542 cycles/page:  442 samples: 6721
     9:   0.08%  93.47% avg cycles:  3289 cycles/page:  365 samples: 917
    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145
    12:   0.08%  94.18% avg cycles:  3996 cycles/page:  333 samples: 864
    13:   0.03%  94.20% avg cycles:  4077 cycles/page:  313 samples: 289
    14:   0.02%  94.23% avg cycles:  4836 cycles/page:  345 samples: 236
    15:   0.04%  94.26% avg cycles:  5699 cycles/page:  379 samples: 390
    16:   0.06%  94.32% avg cycles:  5041 cycles/page:  315 samples: 643
    17:   0.57%  94.89% avg cycles:  5473 cycles/page:  321 samples: 6229
    18:   0.02%  94.91% avg cycles:  5396 cycles/page:  299 samples: 224
    19:   0.03%  94.95% avg cycles:  5296 cycles/page:  278 samples: 367
    20:   0.02%  94.96% avg cycles:  6749 cycles/page:  337 samples: 185
    21:   0.18%  95.14% avg cycles:  6225 cycles/page:  296 samples: 1964
    22:   0.01%  95.15% avg cycles:  6393 cycles/page:  290 samples: 83
    23:   0.01%  95.16% avg cycles:  6861 cycles/page:  298 samples: 61
    24:   0.12%  95.28% avg cycles:  6912 cycles/page:  288 samples: 1307
    25:   0.05%  95.32% avg cycles:  7190 cycles/page:  287 samples: 533
    26:   0.01%  95.33% avg cycles:  7793 cycles/page:  299 samples: 94
    27:   0.01%  95.34% avg cycles:  7833 cycles/page:  290 samples: 66
    28:   0.01%  95.35% avg cycles:  8253 cycles/page:  294 samples: 73
    29:   0.08%  95.42% avg cycles:  8024 cycles/page:  276 samples: 846
    30:   0.03%  95.45% avg cycles:  9670 cycles/page:  322 samples: 296
    31:   0.01%  95.46% avg cycles:  8949 cycles/page:  288 samples: 79
    32:   0.01%  95.46% avg cycles:  9350 cycles/page:  292 samples: 60
    33:   3.11%  98.57% avg cycles:  8534 cycles/page:  258 samples: 33936
    34:   0.02%  98.60% avg cycles: 10977 cycles/page:  322 samples: 268
    35:   0.02%  98.62% avg cycles: 11400 cycles/page:  325 samples: 177
    36:   0.01%  98.63% avg cycles: 11504 cycles/page:  319 samples: 161
    37:   0.02%  98.65% avg cycles: 11596 cycles/page:  313 samples: 182
    38:   0.02%  98.66% avg cycles: 11850 cycles/page:  311 samples: 195
    39:   0.01%  98.68% avg cycles: 12158 cycles/page:  311 samples: 128
    40:   0.01%  98.68% avg cycles: 11626 cycles/page:  290 samples: 78
    41:   0.04%  98.73% avg cycles: 11435 cycles/page:  278 samples: 477
    42:   0.01%  98.73% avg cycles: 12571 cycles/page:  299 samples: 74
    43:   0.01%  98.74% avg cycles: 12562 cycles/page:  292 samples: 78
    44:   0.01%  98.75% avg cycles: 12991 cycles/page:  295 samples: 108
    45:   0.01%  98.76% avg cycles: 13169 cycles/page:  292 samples: 78
    46:   0.02%  98.78% avg cycles: 12891 cycles/page:  280 samples: 261
    47:   0.01%  98.79% avg cycles: 13099 cycles/page:  278 samples: 67
    48:   0.01%  98.80% avg cycles: 13851 cycles/page:  288 samples: 77
    49:   0.01%  98.80% avg cycles: 13749 cycles/page:  280 samples: 66
    50:   0.01%  98.81% avg cycles: 13949 cycles/page:  278 samples: 73
    52:   0.00%  98.82% avg cycles: 14243 cycles/page:  273 samples: 52
    54:   0.01%  98.83% avg cycles: 15312 cycles/page:  283 samples: 87
    55:   0.01%  98.84% avg cycles: 15197 cycles/page:  276 samples: 109
    56:   0.02%  98.86% avg cycles: 15234 cycles/page:  272 samples: 208
    57:   0.00%  98.86% avg cycles: 14888 cycles/page:  261 samples: 53
    58:   0.01%  98.87% avg cycles: 15037 cycles/page:  259 samples: 59
    59:   0.01%  98.87% avg cycles: 15752 cycles/page:  266 samples: 63
    62:   0.00%  98.89% avg cycles: 16222 cycles/page:  261 samples: 54
    64:   0.02%  98.91% avg cycles: 17179 cycles/page:  268 samples: 248
    65:   0.12%  99.03% avg cycles: 18762 cycles/page:  288 samples: 1324
    85:   0.00%  99.10% avg cycles: 21649 cycles/page:  254 samples: 50
   127:   0.01%  99.18% avg cycles: 32397 cycles/page:  255 samples: 75
   128:   0.13%  99.31% avg cycles: 31711 cycles/page:  247 samples: 1466
   129:   0.18%  99.49% avg cycles: 33017 cycles/page:  255 samples: 1927
   181:   0.33%  99.84% avg cycles:  2489 cycles/page:   13 samples: 3547
   256:   0.05%  99.91% avg cycles:  2305 cycles/page:    9 samples: 550
   512:   0.03%  99.95% avg cycles:  2133 cycles/page:    4 samples: 304
  1512:   0.01%  99.99% avg cycles:  3038 cycles/page:    2 samples: 65

Here are the tlb counters during a 10-second slice of a kernel compile
for a SandyBridge system.  It's better than IvyBridge, but probably
due to the larger caches since this was one of the 'X' extreme parts.

    10,873,007,282      dtlb_load_misses_walk_duration
       250,711,333      dtlb_load_misses_walk_completed
     1,212,395,865      dtlb_store_misses_walk_duration
        31,615,772      dtlb_store_misses_walk_completed
     5,091,010,274      itlb_misses_walk_duration
       163,193,511      itlb_misses_walk_completed
         1,321,980      itlb_itlb_flush

      10.008045158 seconds time elapsed

# cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, "   -----    ", icyc,imiss, dcyc,dmiss }'
itlb ns/miss:  9.45338  dtlb ns/miss:  12.9716

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/tlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/x86/mm/tlb.c~set-tunable-to-sane-value arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~set-tunable-to-sane-value	2014-03-05 16:10:11.005111495 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:11.009111678 -0800
@@ -165,7 +165,7 @@ void flush_tlb_current_task(void)
 }
 
 /* in units of pages */
-unsigned long tlb_single_page_flush_ceiling = 1;
+unsigned long tlb_single_page_flush_ceiling = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 7/7] big time hack: instrument flush times
  2014-03-06  0:45 ` Dave Hansen
@ 2014-03-06  0:45   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The tracepoint code is a _bit_ too much overhead, so use some
percpu counters to aggregate it instead.  Yes, this is racy
and ugly beyond reason, but it was quick to code up.

I'm posting this here because it's interesting to have around,
and if other folks like it, maybe I can get it in to shape to
stick in to mainline.

---

 b/arch/x86/mm/tlb.c |  112 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)

diff -puN arch/x86/mm/tlb.c~instrument-flush-times arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~instrument-flush-times	2014-03-05 16:10:11.255122898 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:11.258123035 -0800
@@ -97,6 +97,8 @@ EXPORT_SYMBOL_GPL(leave_mm);
  * 1) Flush the tlb entries if the cpu uses the mm that's being flushed.
  * 2) Leave the mm if we are in the lazy tlb mode.
  */
+void inc_stat(u64 flush_size, u64 time);
+
 static void flush_tlb_func(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -109,17 +111,23 @@ static void flush_tlb_func(void *info)
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
 		if (f->flush_end == TLB_FLUSH_ALL) {
+			u64 start_ns = sched_clock();
 			local_flush_tlb();
+			inc_stat(TLB_FLUSH_ALL, sched_clock() - start_ns);
 			trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
 		} else if (!f->flush_end)
 			__flush_tlb_single(f->flush_start);
 		else {
+			u64 start_ns;
 			unsigned long addr;
+			start_ns = sched_clock();
 			addr = f->flush_start;
 			while (addr < f->flush_end) {
 				__flush_tlb_single(addr);
 				addr += PAGE_SIZE;
 			}
+			inc_stat((f->flush_end - f->flush_start) / PAGE_SIZE,
+				 sched_clock() - start_ns);
 		}
 	} else
 		leave_mm(smp_processor_id());
@@ -164,12 +172,112 @@ void flush_tlb_current_task(void)
 	preempt_enable();
 }
 
+struct one_tlb_stat {
+	u64 flushes;
+	u64 time;
+};
+
+#define NR_TO_TRACK 1024
+
+struct tlb_stats {
+	struct one_tlb_stat stats[NR_TO_TRACK];
+};
+
+DEFINE_PER_CPU(struct tlb_stats, tlb_stats);
+
+void inc_stat(u64 flush_size, u64 time)
+{
+	struct tlb_stats *thiscpu =
+		&per_cpu(tlb_stats, smp_processor_id());
+	struct one_tlb_stat *stat;
+
+	if (flush_size == TLB_FLUSH_ALL)
+		flush_size = 0;
+	if (flush_size >= NR_TO_TRACK)
+		flush_size = NR_TO_TRACK-1;
+
+	stat = &thiscpu->stats[flush_size];
+	stat->time += time;
+	stat->flushes++;
+}
+
+char printbuf[80 * NR_TO_TRACK];
+static ssize_t tlb_stat_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	int cpu;
+	int flush_size;
+	unsigned int len = 0;
+
+	for (flush_size = 0; flush_size < NR_TO_TRACK; flush_size++) {
+		struct one_tlb_stat tot;
+		tot.flushes = 0;
+		tot.time = 0;
+
+		for_each_online_cpu(cpu){
+			struct tlb_stats *thiscpu = &per_cpu(tlb_stats, cpu);
+			struct one_tlb_stat *stat;
+			stat = &thiscpu->stats[flush_size];
+			tot.flushes += stat->flushes;
+			tot.time += stat->time;
+		}
+		if (!tot.flushes)
+			continue;
+		if (flush_size == 0)
+			len += sprintf(&printbuf[len], "[FULL]");
+		else if (flush_size == NR_TO_TRACK-1)
+			len += sprintf(&printbuf[len], "[FBIG]");
+		else
+			len += sprintf(&printbuf[len], "[%d]", flush_size);
+
+		len += sprintf(&printbuf[len], " %lld %lld\n",
+			tot.flushes, tot.time);
+	}
+
+	return simple_read_from_buffer(user_buf, count, ppos, printbuf, len);
+}
+
+static ssize_t tlb_stat_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	int cpu;
+	int flush_size;
+
+	for_each_online_cpu(cpu){
+		struct tlb_stats *thiscpu = &per_cpu(tlb_stats, cpu);
+		for (flush_size = 0; flush_size < NR_TO_TRACK; flush_size++) {
+			struct one_tlb_stat *stat;
+			stat = &thiscpu->stats[flush_size];
+			stat->time = 0;
+			stat->flushes = 0;
+		}
+	}
+	return count;
+}
+
+static const struct file_operations fops_tlb_stat = {
+	.read = tlb_stat_read_file,
+	.write = tlb_stat_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init create_tlb_stats(void)
+{
+	debugfs_create_file("tlb_flush_stats", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_tlb_stat);
+	return 0;
+}
+late_initcall(create_tlb_stats);
+
+
 /* in units of pages */
 unsigned long tlb_single_page_flush_ceiling = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
+	u64 start_ns = 0;
+	u64 end_ns;
 	unsigned long addr;
 	/* do a global flush by default */
 	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
@@ -187,6 +295,7 @@ void flush_tlb_mm_range(struct mm_struct
 		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
 
 	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
+	start_ns = sched_clock();
 	if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
 		base_pages_to_flush = TLB_FLUSH_ALL;
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
@@ -198,12 +307,15 @@ void flush_tlb_mm_range(struct mm_struct
 			__flush_tlb_single(addr);
 		}
 	}
+	end_ns = sched_clock();
 	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush);
 out:
 	if (base_pages_to_flush == TLB_FLUSH_ALL) {
 		start = 0UL;
 		end = TLB_FLUSH_ALL;
 	}
+	if (start_ns)
+		inc_stat(base_pages_to_flush, end_ns - start_ns);
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), mm, start, end);
 	preempt_enable();
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 7/7] big time hack: instrument flush times
@ 2014-03-06  0:45   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-06  0:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The tracepoint code is a _bit_ too much overhead, so use some
percpu counters to aggregate it instead.  Yes, this is racy
and ugly beyond reason, but it was quick to code up.

I'm posting this here because it's interesting to have around,
and if other folks like it, maybe I can get it in to shape to
stick in to mainline.

---

 b/arch/x86/mm/tlb.c |  112 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)

diff -puN arch/x86/mm/tlb.c~instrument-flush-times arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~instrument-flush-times	2014-03-05 16:10:11.255122898 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:11.258123035 -0800
@@ -97,6 +97,8 @@ EXPORT_SYMBOL_GPL(leave_mm);
  * 1) Flush the tlb entries if the cpu uses the mm that's being flushed.
  * 2) Leave the mm if we are in the lazy tlb mode.
  */
+void inc_stat(u64 flush_size, u64 time);
+
 static void flush_tlb_func(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -109,17 +111,23 @@ static void flush_tlb_func(void *info)
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
 		if (f->flush_end == TLB_FLUSH_ALL) {
+			u64 start_ns = sched_clock();
 			local_flush_tlb();
+			inc_stat(TLB_FLUSH_ALL, sched_clock() - start_ns);
 			trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
 		} else if (!f->flush_end)
 			__flush_tlb_single(f->flush_start);
 		else {
+			u64 start_ns;
 			unsigned long addr;
+			start_ns = sched_clock();
 			addr = f->flush_start;
 			while (addr < f->flush_end) {
 				__flush_tlb_single(addr);
 				addr += PAGE_SIZE;
 			}
+			inc_stat((f->flush_end - f->flush_start) / PAGE_SIZE,
+				 sched_clock() - start_ns);
 		}
 	} else
 		leave_mm(smp_processor_id());
@@ -164,12 +172,112 @@ void flush_tlb_current_task(void)
 	preempt_enable();
 }
 
+struct one_tlb_stat {
+	u64 flushes;
+	u64 time;
+};
+
+#define NR_TO_TRACK 1024
+
+struct tlb_stats {
+	struct one_tlb_stat stats[NR_TO_TRACK];
+};
+
+DEFINE_PER_CPU(struct tlb_stats, tlb_stats);
+
+void inc_stat(u64 flush_size, u64 time)
+{
+	struct tlb_stats *thiscpu =
+		&per_cpu(tlb_stats, smp_processor_id());
+	struct one_tlb_stat *stat;
+
+	if (flush_size == TLB_FLUSH_ALL)
+		flush_size = 0;
+	if (flush_size >= NR_TO_TRACK)
+		flush_size = NR_TO_TRACK-1;
+
+	stat = &thiscpu->stats[flush_size];
+	stat->time += time;
+	stat->flushes++;
+}
+
+char printbuf[80 * NR_TO_TRACK];
+static ssize_t tlb_stat_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	int cpu;
+	int flush_size;
+	unsigned int len = 0;
+
+	for (flush_size = 0; flush_size < NR_TO_TRACK; flush_size++) {
+		struct one_tlb_stat tot;
+		tot.flushes = 0;
+		tot.time = 0;
+
+		for_each_online_cpu(cpu){
+			struct tlb_stats *thiscpu = &per_cpu(tlb_stats, cpu);
+			struct one_tlb_stat *stat;
+			stat = &thiscpu->stats[flush_size];
+			tot.flushes += stat->flushes;
+			tot.time += stat->time;
+		}
+		if (!tot.flushes)
+			continue;
+		if (flush_size == 0)
+			len += sprintf(&printbuf[len], "[FULL]");
+		else if (flush_size == NR_TO_TRACK-1)
+			len += sprintf(&printbuf[len], "[FBIG]");
+		else
+			len += sprintf(&printbuf[len], "[%d]", flush_size);
+
+		len += sprintf(&printbuf[len], " %lld %lld\n",
+			tot.flushes, tot.time);
+	}
+
+	return simple_read_from_buffer(user_buf, count, ppos, printbuf, len);
+}
+
+static ssize_t tlb_stat_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	int cpu;
+	int flush_size;
+
+	for_each_online_cpu(cpu){
+		struct tlb_stats *thiscpu = &per_cpu(tlb_stats, cpu);
+		for (flush_size = 0; flush_size < NR_TO_TRACK; flush_size++) {
+			struct one_tlb_stat *stat;
+			stat = &thiscpu->stats[flush_size];
+			stat->time = 0;
+			stat->flushes = 0;
+		}
+	}
+	return count;
+}
+
+static const struct file_operations fops_tlb_stat = {
+	.read = tlb_stat_read_file,
+	.write = tlb_stat_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init create_tlb_stats(void)
+{
+	debugfs_create_file("tlb_flush_stats", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_tlb_stat);
+	return 0;
+}
+late_initcall(create_tlb_stats);
+
+
 /* in units of pages */
 unsigned long tlb_single_page_flush_ceiling = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
+	u64 start_ns = 0;
+	u64 end_ns;
 	unsigned long addr;
 	/* do a global flush by default */
 	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
@@ -187,6 +295,7 @@ void flush_tlb_mm_range(struct mm_struct
 		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
 
 	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
+	start_ns = sched_clock();
 	if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
 		base_pages_to_flush = TLB_FLUSH_ALL;
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
@@ -198,12 +307,15 @@ void flush_tlb_mm_range(struct mm_struct
 			__flush_tlb_single(addr);
 		}
 	}
+	end_ns = sched_clock();
 	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush);
 out:
 	if (base_pages_to_flush == TLB_FLUSH_ALL) {
 		start = 0UL;
 		end = TLB_FLUSH_ALL;
 	}
+	if (start_ns)
+		inc_stat(base_pages_to_flush, end_ns - start_ns);
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), mm, start, end);
 	preempt_enable();
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/7] x86: rework tlb range flushing code
  2014-03-06  0:45 ` Dave Hansen
@ 2014-03-07  0:15   ` Davidlohr Bueso
  -1 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  0:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm

On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> Reposting with an instrumentation patch, and a few minor tweaks.
> I'd love some more eyeballs on this, but I think it's ready for
> -mm.
> 
> I'm having it run through the LKP harness to see if any perfmance
> regressions (or gains) show up.

fwiw I pounded these on a 80 core Westmere with my usual aim7 stuff for
most of the morning and didn't run into anything unusual or performance
differences.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/7] x86: rework tlb range flushing code
@ 2014-03-07  0:15   ` Davidlohr Bueso
  0 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  0:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm

On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> Reposting with an instrumentation patch, and a few minor tweaks.
> I'd love some more eyeballs on this, but I think it's ready for
> -mm.
> 
> I'm having it run through the LKP harness to see if any perfmance
> regressions (or gains) show up.

fwiw I pounded these on a 80 core Westmere with my usual aim7 stuff for
most of the morning and didn't run into anything unusual or performance
differences.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] x86: mm: clean up tlb flushing code
  2014-03-06  0:45   ` Dave Hansen
@ 2014-03-07  0:16     ` Davidlohr Bueso
  -1 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  0:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The
> 
> 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> 
> line of code is not exactly the easiest to audit, especially when
> it ends up at two different indentation levels.  This eliminates
> one of the the copy-n-paste versions.  It also gives us a unified
> exit point for each path through this function.  We need this in
> a minute for our tracepoint.
> 
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/arch/x86/mm/tlb.c |   23 +++++++++++------------
>  1 file changed, 11 insertions(+), 12 deletions(-)
> 
> diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c
> --- a/arch/x86/mm/tlb.c~simplify-tlb-code	2014-03-05 16:10:09.607047728 -0800
> +++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:09.610047866 -0800
> @@ -161,23 +161,24 @@ void flush_tlb_current_task(void)
>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  				unsigned long end, unsigned long vmflag)
>  {
> +	int need_flush_others_all = 1;

nit: this can be bool.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] x86: mm: clean up tlb flushing code
@ 2014-03-07  0:16     ` Davidlohr Bueso
  0 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  0:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The
> 
> 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> 
> line of code is not exactly the easiest to audit, especially when
> it ends up at two different indentation levels.  This eliminates
> one of the the copy-n-paste versions.  It also gives us a unified
> exit point for each path through this function.  We need this in
> a minute for our tracepoint.
> 
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/arch/x86/mm/tlb.c |   23 +++++++++++------------
>  1 file changed, 11 insertions(+), 12 deletions(-)
> 
> diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c
> --- a/arch/x86/mm/tlb.c~simplify-tlb-code	2014-03-05 16:10:09.607047728 -0800
> +++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:09.610047866 -0800
> @@ -161,23 +161,24 @@ void flush_tlb_current_task(void)
>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  				unsigned long end, unsigned long vmflag)
>  {
> +	int need_flush_others_all = 1;

nit: this can be bool.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] x86: mm: clean up tlb flushing code
  2014-03-07  0:16     ` Davidlohr Bueso
@ 2014-03-07  0:51       ` Davidlohr Bueso
  -1 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  0:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Thu, 2014-03-06 at 16:16 -0800, Davidlohr Bueso wrote:
> On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > The
> > 
> > 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> > 
> > line of code is not exactly the easiest to audit, especially when
> > it ends up at two different indentation levels.  This eliminates
> > one of the the copy-n-paste versions.  It also gives us a unified
> > exit point for each path through this function.  We need this in
> > a minute for our tracepoint.
> > 
> > 
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> > 
> >  b/arch/x86/mm/tlb.c |   23 +++++++++++------------
> >  1 file changed, 11 insertions(+), 12 deletions(-)
> > 
> > diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c
> > --- a/arch/x86/mm/tlb.c~simplify-tlb-code	2014-03-05 16:10:09.607047728 -0800
> > +++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:09.610047866 -0800
> > @@ -161,23 +161,24 @@ void flush_tlb_current_task(void)
> >  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >  				unsigned long end, unsigned long vmflag)
> >  {
> > +	int need_flush_others_all = 1;
> 
> nit: this can be bool.

never mind, you get rid of it later.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] x86: mm: clean up tlb flushing code
@ 2014-03-07  0:51       ` Davidlohr Bueso
  0 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  0:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Thu, 2014-03-06 at 16:16 -0800, Davidlohr Bueso wrote:
> On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > The
> > 
> > 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> > 
> > line of code is not exactly the easiest to audit, especially when
> > it ends up at two different indentation levels.  This eliminates
> > one of the the copy-n-paste versions.  It also gives us a unified
> > exit point for each path through this function.  We need this in
> > a minute for our tracepoint.
> > 
> > 
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> > 
> >  b/arch/x86/mm/tlb.c |   23 +++++++++++------------
> >  1 file changed, 11 insertions(+), 12 deletions(-)
> > 
> > diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c
> > --- a/arch/x86/mm/tlb.c~simplify-tlb-code	2014-03-05 16:10:09.607047728 -0800
> > +++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:09.610047866 -0800
> > @@ -161,23 +161,24 @@ void flush_tlb_current_task(void)
> >  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >  				unsigned long end, unsigned long vmflag)
> >  {
> > +	int need_flush_others_all = 1;
> 
> nit: this can be bool.

never mind, you get rid of it later.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] x86: mm: clean up tlb flushing code
  2014-03-07  0:51       ` Davidlohr Bueso
  (?)
@ 2014-03-07  0:57       ` Eric Boxer
  -1 siblings, 0 replies; 35+ messages in thread
From: Eric Boxer @ 2014-03-07  0:57 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: ak, akpm, alex.shi, linux-mm, mgorman, x86, Dave Hansen,
	linux-kernel, kirill.shutemov, dave.hansen

[-- Attachment #1: Type: text/plain, Size: 1506 bytes --]

Eric Boxer liked your message with Boxer. On March 6, 2014 at 6:51:28 PM CST, Davidlohr Bueso  wrote:On Thu, 2014-03-06 at 16:16 -0800, Davidlohr Bueso wrote:> On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:> > From: Dave Hansen > > > > The> > > > if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) > > > > line of code is not exactly the easiest to audit, especially when> > it ends up at two different indentation levels. This eliminates> > one of the the copy-n-paste versions. It also gives us a unified> > exit point for each path through this function. We need this in> > a minute for our tracepoint.> > > > > > Signed-off-by: Dave Hansen > > ---> > > > b/arch/x86/mm/tlb.c | 23 +++++++++++------------> > 1 file changed, 11 insertions(+), 12 deletions(-)> > > > diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c> > --- a/arch/x86/mm/tlb.c~simplify-tlb-code 2014-03-05 16:10:09.607047728 -0800> > +++ b/arch/x86/mm/tlb.c 2014-03-05 16:10:09.610047866 -0800> > @@ -161,23 +161,24 @@ void flush_tlb_current_task(void)> > void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,> > unsigned long end, unsigned long vmflag)> > {> > + int need_flush_others_all = 1;> > nit: this can be bool.never mind, you get rid of it later.--To unsubscribe from this list: send the line "unsubscribe linux-kernel" inthe body of a message to majordomo@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.htmlPlease read the FAQ at http://www.tux.org/lkml/     

[-- Attachment #2: Type: text/html, Size: 2232 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/7] x86: mm: new tunable for single vs full TLB flush
  2014-03-06  0:45   ` Dave Hansen
@ 2014-03-07  1:37     ` Davidlohr Bueso
  -1 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  1:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> +
> +If you believe that invlpg is being called too often, you can
> +lower the tunable:
> +
> +	/sys/debug/kernel/x86/tlb_single_page_flush_ceiling
> +

Whenever this tunable needs to be updated, most users will not know what
a invlpg is and won't think in terms of pages either. How about making
this in units of Kb instead? But then again most of those users won't be
looking into tlb flushing issues anyways, so...

While obvious, tt should also mention that this does not apply to
hugepages.

> +This will cause us to do the global flush for more cases.
> +Lowering it to 0 will disable the use of invlpg.
> +
> +You might see invlpg inside of flush_tlb_mm_range() show up in
> +profiles, or you can use the trace_tlb_flush() tracepoints. to
> +determine how long the flush operations are taking.
> +
> +Essentially, you are balancing the cycles you spend doing invlpg
> +with the cycles that you spend refilling the TLB later.
> +
> +You can measure how expensive TLB refills are by using
> +performance counters and 'perf stat', like this:
> +
> +perf stat -e
> +	cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
> +	cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
> +	cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
> +	cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
> +	cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
> +	cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
> +
> +That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
> +may have differently-named counters, but they should at least
> +be there in some form.  You can use pmu-tools 'ocperf list'
> +(https://github.com/andikleen/pmu-tools) to find the right
> +counters for a given CPU.
> +
> _
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/7] x86: mm: new tunable for single vs full TLB flush
@ 2014-03-07  1:37     ` Davidlohr Bueso
  0 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  1:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> +
> +If you believe that invlpg is being called too often, you can
> +lower the tunable:
> +
> +	/sys/debug/kernel/x86/tlb_single_page_flush_ceiling
> +

Whenever this tunable needs to be updated, most users will not know what
a invlpg is and won't think in terms of pages either. How about making
this in units of Kb instead? But then again most of those users won't be
looking into tlb flushing issues anyways, so...

While obvious, tt should also mention that this does not apply to
hugepages.

> +This will cause us to do the global flush for more cases.
> +Lowering it to 0 will disable the use of invlpg.
> +
> +You might see invlpg inside of flush_tlb_mm_range() show up in
> +profiles, or you can use the trace_tlb_flush() tracepoints. to
> +determine how long the flush operations are taking.
> +
> +Essentially, you are balancing the cycles you spend doing invlpg
> +with the cycles that you spend refilling the TLB later.
> +
> +You can measure how expensive TLB refills are by using
> +performance counters and 'perf stat', like this:
> +
> +perf stat -e
> +	cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
> +	cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
> +	cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
> +	cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
> +	cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
> +	cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
> +
> +That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
> +may have differently-named counters, but they should at least
> +be there in some form.  You can use pmu-tools 'ocperf list'
> +(https://github.com/andikleen/pmu-tools) to find the right
> +counters for a given CPU.
> +
> _
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
  2014-03-06  0:45   ` Dave Hansen
@ 2014-03-07  1:55     ` Davidlohr Bueso
  -1 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  1:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Now that we have some shiny new tracepoints, we can actually
> figure out what the heck is going on.
> 
> During a kernel compile, 60% of the flush_tlb_mm_range() calls
> are for a single page.  It breaks down like this:

It would be interesting to see similar data for opposite workloads with
more random access patterns. That's normally when things start getting
fun in the tlb world.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
@ 2014-03-07  1:55     ` Davidlohr Bueso
  0 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-07  1:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Now that we have some shiny new tracepoints, we can actually
> figure out what the heck is going on.
> 
> During a kernel compile, 60% of the flush_tlb_mm_range() calls
> are for a single page.  It breaks down like this:

It would be interesting to see similar data for opposite workloads with
more random access patterns. That's normally when things start getting
fun in the tlb world.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
  2014-03-07  1:55     ` Davidlohr Bueso
@ 2014-03-07 17:15       ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-07 17:15 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On 03/06/2014 05:55 PM, Davidlohr Bueso wrote:
> On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> Now that we have some shiny new tracepoints, we can actually
>> figure out what the heck is going on.
>>
>> During a kernel compile, 60% of the flush_tlb_mm_range() calls
>> are for a single page.  It breaks down like this:
> 
> It would be interesting to see similar data for opposite workloads with
> more random access patterns. That's normally when things start getting
> fun in the tlb world.

First of all, thanks for testing.  It's much appreciated!

Any suggestions for opposite workloads?

I've seen this tunable have really heavy effects on ebizzy.  It fits
almost entirely within the itlb and if we are doing full flushes, it
eats the itlb and increases the misses about 10x.  Even putting this
tunable above 500 pages (which is pretty insane) didn't help it.

Things that thrash the TLB don't really care if someone invalidates
their TLB since they're thrashing it anyway.

I've had a really hard time finding workloads that _care_ or are
affected by small changes in this tunable.  That's one of the reasons I
tried to simplify it: it's just not worth the complexity.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
@ 2014-03-07 17:15       ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-07 17:15 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On 03/06/2014 05:55 PM, Davidlohr Bueso wrote:
> On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> Now that we have some shiny new tracepoints, we can actually
>> figure out what the heck is going on.
>>
>> During a kernel compile, 60% of the flush_tlb_mm_range() calls
>> are for a single page.  It breaks down like this:
> 
> It would be interesting to see similar data for opposite workloads with
> more random access patterns. That's normally when things start getting
> fun in the tlb world.

First of all, thanks for testing.  It's much appreciated!

Any suggestions for opposite workloads?

I've seen this tunable have really heavy effects on ebizzy.  It fits
almost entirely within the itlb and if we are doing full flushes, it
eats the itlb and increases the misses about 10x.  Even putting this
tunable above 500 pages (which is pretty insane) didn't help it.

Things that thrash the TLB don't really care if someone invalidates
their TLB since they're thrashing it anyway.

I've had a really hard time finding workloads that _care_ or are
affected by small changes in this tunable.  That's one of the reasons I
tried to simplify it: it's just not worth the complexity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/7] x86: mm: new tunable for single vs full TLB flush
  2014-03-07  1:37     ` Davidlohr Bueso
@ 2014-03-07 17:19       ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-07 17:19 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On 03/06/2014 05:37 PM, Davidlohr Bueso wrote:
> On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>> +
>> +If you believe that invlpg is being called too often, you can
>> +lower the tunable:
>> +
>> +	/sys/debug/kernel/x86/tlb_single_page_flush_ceiling
>> +
> 
> Whenever this tunable needs to be updated, most users will not know what
> a invlpg is and won't think in terms of pages either. How about making
> this in units of Kb instead? But then again most of those users won't be
> looking into tlb flushing issues anyways, so...

Yeah, talking about the instruction directly in the documentation is
probably going a bit far.  I'll see if I can uplevel it a bit.

It's obviously not a big deal to change it to be pages vs. kb, but for
something that's as *COMPLETELY* developer-focused, I think we can keep
it in pages.  We don't want users fooling with this.

> While obvious, tt should also mention that this does not apply to
> hugepages.

Good point.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/7] x86: mm: new tunable for single vs full TLB flush
@ 2014-03-07 17:19       ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-07 17:19 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On 03/06/2014 05:37 PM, Davidlohr Bueso wrote:
> On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>> +
>> +If you believe that invlpg is being called too often, you can
>> +lower the tunable:
>> +
>> +	/sys/debug/kernel/x86/tlb_single_page_flush_ceiling
>> +
> 
> Whenever this tunable needs to be updated, most users will not know what
> a invlpg is and won't think in terms of pages either. How about making
> this in units of Kb instead? But then again most of those users won't be
> looking into tlb flushing issues anyways, so...

Yeah, talking about the instruction directly in the documentation is
probably going a bit far.  I'll see if I can uplevel it a bit.

It's obviously not a big deal to change it to be pages vs. kb, but for
something that's as *COMPLETELY* developer-focused, I think we can keep
it in pages.  We don't want users fooling with this.

> While obvious, tt should also mention that this does not apply to
> hugepages.

Good point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
  2014-03-07 17:15       ` Dave Hansen
@ 2014-03-08  0:28         ` Davidlohr Bueso
  -1 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-08  0:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Fri, 2014-03-07 at 09:15 -0800, Dave Hansen wrote:
> On 03/06/2014 05:55 PM, Davidlohr Bueso wrote:
> > On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >>
> >> Now that we have some shiny new tracepoints, we can actually
> >> figure out what the heck is going on.
> >>
> >> During a kernel compile, 60% of the flush_tlb_mm_range() calls
> >> are for a single page.  It breaks down like this:
> > 
> > It would be interesting to see similar data for opposite workloads with
> > more random access patterns. That's normally when things start getting
> > fun in the tlb world.
> 
> First of all, thanks for testing.  It's much appreciated!
> 
> Any suggestions for opposite workloads?

I was actually thinking of ebizzy as well.

> I've seen this tunable have really heavy effects on ebizzy.  It fits
> almost entirely within the itlb and if we are doing full flushes, it
> eats the itlb and increases the misses about 10x.  Even putting this
> tunable above 500 pages (which is pretty insane) didn't help it.

Interesting, I didn't expect the misses to be as severe. So I guess what
you say is that this issue is seen even with how we currently have
things.


> Things that thrash the TLB don't really care if someone invalidates
> their TLB since they're thrashing it anyway.

That's a really good point.

> I've had a really hard time finding workloads that _care_ or are
> affected by small changes in this tunable.  That's one of the reasons I
> tried to simplify it: it's just not worth the complexity.

I agree, since we aren't seeing much performance differences anyway I
guess it simply doesn't matter. I can see it perhaps as a factor for
virtualized workloads in the pre-tagged tlb era but not so much
nowadays. In any case I've also asked a colleague to see if he can
produce any interesting results with this patchset on his kvm workloads
but don't expect much surprises.

So all in all I definitely like this cleanup, and things are simplified
significantly without any apparent performance hits. The justification
for the ceiling being 33 seems pretty prudent, and heck, it can be
modified anyway by users. An additional suggestion would be to comment
this magic number, in the code.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
@ 2014-03-08  0:28         ` Davidlohr Bueso
  0 siblings, 0 replies; 35+ messages in thread
From: Davidlohr Bueso @ 2014-03-08  0:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, akpm, ak, kirill.shutemov, mgorman, alex.shi, x86,
	linux-mm, dave.hansen

On Fri, 2014-03-07 at 09:15 -0800, Dave Hansen wrote:
> On 03/06/2014 05:55 PM, Davidlohr Bueso wrote:
> > On Wed, 2014-03-05 at 16:45 -0800, Dave Hansen wrote:
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >>
> >> Now that we have some shiny new tracepoints, we can actually
> >> figure out what the heck is going on.
> >>
> >> During a kernel compile, 60% of the flush_tlb_mm_range() calls
> >> are for a single page.  It breaks down like this:
> > 
> > It would be interesting to see similar data for opposite workloads with
> > more random access patterns. That's normally when things start getting
> > fun in the tlb world.
> 
> First of all, thanks for testing.  It's much appreciated!
> 
> Any suggestions for opposite workloads?

I was actually thinking of ebizzy as well.

> I've seen this tunable have really heavy effects on ebizzy.  It fits
> almost entirely within the itlb and if we are doing full flushes, it
> eats the itlb and increases the misses about 10x.  Even putting this
> tunable above 500 pages (which is pretty insane) didn't help it.

Interesting, I didn't expect the misses to be as severe. So I guess what
you say is that this issue is seen even with how we currently have
things.


> Things that thrash the TLB don't really care if someone invalidates
> their TLB since they're thrashing it anyway.

That's a really good point.

> I've had a really hard time finding workloads that _care_ or are
> affected by small changes in this tunable.  That's one of the reasons I
> tried to simplify it: it's just not worth the complexity.

I agree, since we aren't seeing much performance differences anyway I
guess it simply doesn't matter. I can see it perhaps as a factor for
virtualized workloads in the pre-tagged tlb era but not so much
nowadays. In any case I've also asked a colleague to see if he can
produce any interesting results with this patchset on his kvm workloads
but don't expect much surprises.

So all in all I definitely like this cleanup, and things are simplified
significantly without any apparent performance hits. The justification
for the ceiling being 33 seems pretty prudent, and heck, it can be
modified anyway by users. An additional suggestion would be to comment
this magic number, in the code.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
  2014-03-10 17:11 Dave Hansen
@ 2014-03-10 17:11   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-10 17:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	davidlohr, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.

During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page.  It breaks down like this:

 size   percent  percent<=
  V        V        V
GLOBAL:   2.20%   2.20% avg cycles:  2283
     1:  56.92%  59.12% avg cycles:  1276
     2:  13.78%  72.90% avg cycles:  1505
     3:   8.26%  81.16% avg cycles:  1880
     4:   7.41%  88.58% avg cycles:  2447
     5:   1.73%  90.31% avg cycles:  2358
     6:   1.32%  91.63% avg cycles:  2563
     7:   1.14%  92.77% avg cycles:  2862
     8:   0.62%  93.39% avg cycles:  3542
     9:   0.08%  93.47% avg cycles:  3289
    10:   0.43%  93.90% avg cycles:  3570
    11:   0.20%  94.10% avg cycles:  3767
    12:   0.08%  94.18% avg cycles:  3996
    13:   0.03%  94.20% avg cycles:  4077
    14:   0.02%  94.23% avg cycles:  4836
    15:   0.04%  94.26% avg cycles:  5699
    16:   0.06%  94.32% avg cycles:  5041
    17:   0.57%  94.89% avg cycles:  5473
    18:   0.02%  94.91% avg cycles:  5396
    19:   0.03%  94.95% avg cycles:  5296
    20:   0.02%  94.96% avg cycles:  6749
    21:   0.18%  95.14% avg cycles:  6225
    22:   0.01%  95.15% avg cycles:  6393
    23:   0.01%  95.16% avg cycles:  6861
    24:   0.12%  95.28% avg cycles:  6912
    25:   0.05%  95.32% avg cycles:  7190
    26:   0.01%  95.33% avg cycles:  7793
    27:   0.01%  95.34% avg cycles:  7833
    28:   0.01%  95.35% avg cycles:  8253
    29:   0.08%  95.42% avg cycles:  8024
    30:   0.03%  95.45% avg cycles:  9670
    31:   0.01%  95.46% avg cycles:  8949
    32:   0.01%  95.46% avg cycles:  9350
    33:   3.11%  98.57% avg cycles:  8534
    34:   0.02%  98.60% avg cycles: 10977
    35:   0.02%  98.62% avg cycles: 11400

We get in to dimishing returns pretty quickly.  On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige.  That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.

The previous code tried to size this number based on the size of
the TLB.  Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.

Settting it to 33 means that we cover the mallopt
M_TRIM_THRESHOLD, which is the most universally common size to do
flushes.

That's the short version.  Here's the long one for why I chose 33:

1. These numbers have a constant bias in the timestamps from the
   tracing.  Probably counts for a couple hundred cycles in each of
   these tests, but it should be fairly _even_ across all of them.
   The smallest delta between the tracepoints I have ever seen is
   335 cycles.  This is one reason the cycles/page cost goes down in
   general as the flushes get larger.  The true cost is nearer to
   100 cycles.
2. A full flush is more expensive than a single invlpg, but not
   by much (single percentages).
3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
   (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
   22,000 cycles.
4. 22,000 cycles is approximately the equivalent of doing 85
   invlpg operations.  But, the odds are that the TLB can
   actually be filled up faster than that because TLB misses that
   are close in time also tend to leverage the same caches.
6. ~98% of flushes are <=33 pages.  There are a lot of flushes of
   33 pages, probably because libc's M_TRIM_THRESHOLD is set to
   128k (32 pages)
7. I've found no consistent data to support changing the IvyBridge
   vs. SandyBridge tunable by a factor of 16

I used the performance counters on this hardware (IvyBridge i5-3320M)
to figure out the tlb miss costs:

ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush

     7,720,030,970      dtlb_load_misses_walk_duration                                    [57.13%]
       169,856,353      dtlb_load_misses_walk_completed                                    [57.15%]
       708,832,859      dtlb_store_misses_walk_duration                                    [57.17%]
        19,346,823      dtlb_store_misses_walk_completed                                    [57.17%]
     2,779,687,402      itlb_misses_walk_duration                                    [57.15%]
        82,241,148      itlb_misses_walk_completed                                    [57.13%]
           770,717      itlb_itlb_flush                                              [57.11%]

Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles).  At those rates, refilling the 512-entry dTLB takes
22,000 cycles.  On a SandyBridge system with more cores and larger
caches, those are dtlb=13.4ns and itlb=9.5ns.

cat perf.stat.txt | perl -pe 's/,//g'
	| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
		/itlb_misses_walk_completed/ { imiss+=$1 }
		/dtlb_.*_walk_duration/ { dcyc+=$1 }
		/dtlb_.*.*completed/ { dmiss+=$1 }
		END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, "   -----    ", icyc,imiss, dcyc,dmiss }

On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any

The assumptions that this code went in under:
https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
about 100ns.  Being generous, that is over by a factor of 6 on the
refill side, although it is fairly close on the cost of an invlpg.
An increase of a single invlpg operation seems to lengthen the flush
range operation by about 200 cycles.  Here is one example of the data
collected for flushing 10 and 11 pages (full data are below):

    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145

How to generate this table:

	echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
	echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
	echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter 
	echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable

Pipe the trace output in to this script:

	http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt

Note that these data were gathered with the invlpg threshold set to
150 pages.  Only data points with >=50 of samples were printed:

Flush    % of     %<=
in       flush    this
pages      es     size
------------------------------------------------------------------------------
    -1:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
     1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895
     2:  13.78%  72.90% avg cycles:  1505 cycles/page:  752 samples: 150335
     3:   8.26%  81.16% avg cycles:  1880 cycles/page:  626 samples: 90131
     4:   7.41%  88.58% avg cycles:  2447 cycles/page:  611 samples: 80877
     5:   1.73%  90.31% avg cycles:  2358 cycles/page:  471 samples: 18885
     6:   1.32%  91.63% avg cycles:  2563 cycles/page:  427 samples: 14397
     7:   1.14%  92.77% avg cycles:  2862 cycles/page:  408 samples: 12441
     8:   0.62%  93.39% avg cycles:  3542 cycles/page:  442 samples: 6721
     9:   0.08%  93.47% avg cycles:  3289 cycles/page:  365 samples: 917
    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145
    12:   0.08%  94.18% avg cycles:  3996 cycles/page:  333 samples: 864
    13:   0.03%  94.20% avg cycles:  4077 cycles/page:  313 samples: 289
    14:   0.02%  94.23% avg cycles:  4836 cycles/page:  345 samples: 236
    15:   0.04%  94.26% avg cycles:  5699 cycles/page:  379 samples: 390
    16:   0.06%  94.32% avg cycles:  5041 cycles/page:  315 samples: 643
    17:   0.57%  94.89% avg cycles:  5473 cycles/page:  321 samples: 6229
    18:   0.02%  94.91% avg cycles:  5396 cycles/page:  299 samples: 224
    19:   0.03%  94.95% avg cycles:  5296 cycles/page:  278 samples: 367
    20:   0.02%  94.96% avg cycles:  6749 cycles/page:  337 samples: 185
    21:   0.18%  95.14% avg cycles:  6225 cycles/page:  296 samples: 1964
    22:   0.01%  95.15% avg cycles:  6393 cycles/page:  290 samples: 83
    23:   0.01%  95.16% avg cycles:  6861 cycles/page:  298 samples: 61
    24:   0.12%  95.28% avg cycles:  6912 cycles/page:  288 samples: 1307
    25:   0.05%  95.32% avg cycles:  7190 cycles/page:  287 samples: 533
    26:   0.01%  95.33% avg cycles:  7793 cycles/page:  299 samples: 94
    27:   0.01%  95.34% avg cycles:  7833 cycles/page:  290 samples: 66
    28:   0.01%  95.35% avg cycles:  8253 cycles/page:  294 samples: 73
    29:   0.08%  95.42% avg cycles:  8024 cycles/page:  276 samples: 846
    30:   0.03%  95.45% avg cycles:  9670 cycles/page:  322 samples: 296
    31:   0.01%  95.46% avg cycles:  8949 cycles/page:  288 samples: 79
    32:   0.01%  95.46% avg cycles:  9350 cycles/page:  292 samples: 60
    33:   3.11%  98.57% avg cycles:  8534 cycles/page:  258 samples: 33936
    34:   0.02%  98.60% avg cycles: 10977 cycles/page:  322 samples: 268
    35:   0.02%  98.62% avg cycles: 11400 cycles/page:  325 samples: 177
    36:   0.01%  98.63% avg cycles: 11504 cycles/page:  319 samples: 161
    37:   0.02%  98.65% avg cycles: 11596 cycles/page:  313 samples: 182
    38:   0.02%  98.66% avg cycles: 11850 cycles/page:  311 samples: 195
    39:   0.01%  98.68% avg cycles: 12158 cycles/page:  311 samples: 128
    40:   0.01%  98.68% avg cycles: 11626 cycles/page:  290 samples: 78
    41:   0.04%  98.73% avg cycles: 11435 cycles/page:  278 samples: 477
    42:   0.01%  98.73% avg cycles: 12571 cycles/page:  299 samples: 74
    43:   0.01%  98.74% avg cycles: 12562 cycles/page:  292 samples: 78
    44:   0.01%  98.75% avg cycles: 12991 cycles/page:  295 samples: 108
    45:   0.01%  98.76% avg cycles: 13169 cycles/page:  292 samples: 78
    46:   0.02%  98.78% avg cycles: 12891 cycles/page:  280 samples: 261
    47:   0.01%  98.79% avg cycles: 13099 cycles/page:  278 samples: 67
    48:   0.01%  98.80% avg cycles: 13851 cycles/page:  288 samples: 77
    49:   0.01%  98.80% avg cycles: 13749 cycles/page:  280 samples: 66
    50:   0.01%  98.81% avg cycles: 13949 cycles/page:  278 samples: 73
    52:   0.00%  98.82% avg cycles: 14243 cycles/page:  273 samples: 52
    54:   0.01%  98.83% avg cycles: 15312 cycles/page:  283 samples: 87
    55:   0.01%  98.84% avg cycles: 15197 cycles/page:  276 samples: 109
    56:   0.02%  98.86% avg cycles: 15234 cycles/page:  272 samples: 208
    57:   0.00%  98.86% avg cycles: 14888 cycles/page:  261 samples: 53
    58:   0.01%  98.87% avg cycles: 15037 cycles/page:  259 samples: 59
    59:   0.01%  98.87% avg cycles: 15752 cycles/page:  266 samples: 63
    62:   0.00%  98.89% avg cycles: 16222 cycles/page:  261 samples: 54
    64:   0.02%  98.91% avg cycles: 17179 cycles/page:  268 samples: 248
    65:   0.12%  99.03% avg cycles: 18762 cycles/page:  288 samples: 1324
    85:   0.00%  99.10% avg cycles: 21649 cycles/page:  254 samples: 50
   127:   0.01%  99.18% avg cycles: 32397 cycles/page:  255 samples: 75
   128:   0.13%  99.31% avg cycles: 31711 cycles/page:  247 samples: 1466
   129:   0.18%  99.49% avg cycles: 33017 cycles/page:  255 samples: 1927
   181:   0.33%  99.84% avg cycles:  2489 cycles/page:   13 samples: 3547
   256:   0.05%  99.91% avg cycles:  2305 cycles/page:    9 samples: 550
   512:   0.03%  99.95% avg cycles:  2133 cycles/page:    4 samples: 304
  1512:   0.01%  99.99% avg cycles:  3038 cycles/page:    2 samples: 65

Here are the tlb counters during a 10-second slice of a kernel compile
for a SandyBridge system.  It's better than IvyBridge, but probably
due to the larger caches since this was one of the 'X' extreme parts.

    10,873,007,282      dtlb_load_misses_walk_duration
       250,711,333      dtlb_load_misses_walk_completed
     1,212,395,865      dtlb_store_misses_walk_duration
        31,615,772      dtlb_store_misses_walk_completed
     5,091,010,274      itlb_misses_walk_duration
       163,193,511      itlb_misses_walk_completed
         1,321,980      itlb_itlb_flush

      10.008045158 seconds time elapsed

# cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, "   -----    ", icyc,imiss, dcyc,dmiss }'
itlb ns/miss:  9.45338  dtlb ns/miss:  12.9716

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/tlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/x86/mm/tlb.c~set-tunable-to-sane-value arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~set-tunable-to-sane-value	2014-03-05 16:10:11.005111495 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:11.009111678 -0800
@@ -165,7 +165,7 @@ void flush_tlb_current_task(void)
 }
 
 /* in units of pages */
-unsigned long tlb_single_page_flush_ceiling = 1;
+unsigned long tlb_single_page_flush_ceiling = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 6/7] x86: mm: set TLB flush tunable to sane value
@ 2014-03-10 17:11   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2014-03-10 17:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, ak, kirill.shutemov, mgorman, alex.shi, x86, linux-mm,
	davidlohr, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.

During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page.  It breaks down like this:

 size   percent  percent<=
  V        V        V
GLOBAL:   2.20%   2.20% avg cycles:  2283
     1:  56.92%  59.12% avg cycles:  1276
     2:  13.78%  72.90% avg cycles:  1505
     3:   8.26%  81.16% avg cycles:  1880
     4:   7.41%  88.58% avg cycles:  2447
     5:   1.73%  90.31% avg cycles:  2358
     6:   1.32%  91.63% avg cycles:  2563
     7:   1.14%  92.77% avg cycles:  2862
     8:   0.62%  93.39% avg cycles:  3542
     9:   0.08%  93.47% avg cycles:  3289
    10:   0.43%  93.90% avg cycles:  3570
    11:   0.20%  94.10% avg cycles:  3767
    12:   0.08%  94.18% avg cycles:  3996
    13:   0.03%  94.20% avg cycles:  4077
    14:   0.02%  94.23% avg cycles:  4836
    15:   0.04%  94.26% avg cycles:  5699
    16:   0.06%  94.32% avg cycles:  5041
    17:   0.57%  94.89% avg cycles:  5473
    18:   0.02%  94.91% avg cycles:  5396
    19:   0.03%  94.95% avg cycles:  5296
    20:   0.02%  94.96% avg cycles:  6749
    21:   0.18%  95.14% avg cycles:  6225
    22:   0.01%  95.15% avg cycles:  6393
    23:   0.01%  95.16% avg cycles:  6861
    24:   0.12%  95.28% avg cycles:  6912
    25:   0.05%  95.32% avg cycles:  7190
    26:   0.01%  95.33% avg cycles:  7793
    27:   0.01%  95.34% avg cycles:  7833
    28:   0.01%  95.35% avg cycles:  8253
    29:   0.08%  95.42% avg cycles:  8024
    30:   0.03%  95.45% avg cycles:  9670
    31:   0.01%  95.46% avg cycles:  8949
    32:   0.01%  95.46% avg cycles:  9350
    33:   3.11%  98.57% avg cycles:  8534
    34:   0.02%  98.60% avg cycles: 10977
    35:   0.02%  98.62% avg cycles: 11400

We get in to dimishing returns pretty quickly.  On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige.  That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.

The previous code tried to size this number based on the size of
the TLB.  Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.

Settting it to 33 means that we cover the mallopt
M_TRIM_THRESHOLD, which is the most universally common size to do
flushes.

That's the short version.  Here's the long one for why I chose 33:

1. These numbers have a constant bias in the timestamps from the
   tracing.  Probably counts for a couple hundred cycles in each of
   these tests, but it should be fairly _even_ across all of them.
   The smallest delta between the tracepoints I have ever seen is
   335 cycles.  This is one reason the cycles/page cost goes down in
   general as the flushes get larger.  The true cost is nearer to
   100 cycles.
2. A full flush is more expensive than a single invlpg, but not
   by much (single percentages).
3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
   (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
   22,000 cycles.
4. 22,000 cycles is approximately the equivalent of doing 85
   invlpg operations.  But, the odds are that the TLB can
   actually be filled up faster than that because TLB misses that
   are close in time also tend to leverage the same caches.
6. ~98% of flushes are <=33 pages.  There are a lot of flushes of
   33 pages, probably because libc's M_TRIM_THRESHOLD is set to
   128k (32 pages)
7. I've found no consistent data to support changing the IvyBridge
   vs. SandyBridge tunable by a factor of 16

I used the performance counters on this hardware (IvyBridge i5-3320M)
to figure out the tlb miss costs:

ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush

     7,720,030,970      dtlb_load_misses_walk_duration                                    [57.13%]
       169,856,353      dtlb_load_misses_walk_completed                                    [57.15%]
       708,832,859      dtlb_store_misses_walk_duration                                    [57.17%]
        19,346,823      dtlb_store_misses_walk_completed                                    [57.17%]
     2,779,687,402      itlb_misses_walk_duration                                    [57.15%]
        82,241,148      itlb_misses_walk_completed                                    [57.13%]
           770,717      itlb_itlb_flush                                              [57.11%]

Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles).  At those rates, refilling the 512-entry dTLB takes
22,000 cycles.  On a SandyBridge system with more cores and larger
caches, those are dtlb=13.4ns and itlb=9.5ns.

cat perf.stat.txt | perl -pe 's/,//g'
	| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
		/itlb_misses_walk_completed/ { imiss+=$1 }
		/dtlb_.*_walk_duration/ { dcyc+=$1 }
		/dtlb_.*.*completed/ { dmiss+=$1 }
		END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, "   -----    ", icyc,imiss, dcyc,dmiss }

On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any

The assumptions that this code went in under:
https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
about 100ns.  Being generous, that is over by a factor of 6 on the
refill side, although it is fairly close on the cost of an invlpg.
An increase of a single invlpg operation seems to lengthen the flush
range operation by about 200 cycles.  Here is one example of the data
collected for flushing 10 and 11 pages (full data are below):

    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145

How to generate this table:

	echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
	echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
	echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter 
	echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable

Pipe the trace output in to this script:

	http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt

Note that these data were gathered with the invlpg threshold set to
150 pages.  Only data points with >=50 of samples were printed:

Flush    % of     %<=
in       flush    this
pages      es     size
------------------------------------------------------------------------------
    -1:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
     1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895
     2:  13.78%  72.90% avg cycles:  1505 cycles/page:  752 samples: 150335
     3:   8.26%  81.16% avg cycles:  1880 cycles/page:  626 samples: 90131
     4:   7.41%  88.58% avg cycles:  2447 cycles/page:  611 samples: 80877
     5:   1.73%  90.31% avg cycles:  2358 cycles/page:  471 samples: 18885
     6:   1.32%  91.63% avg cycles:  2563 cycles/page:  427 samples: 14397
     7:   1.14%  92.77% avg cycles:  2862 cycles/page:  408 samples: 12441
     8:   0.62%  93.39% avg cycles:  3542 cycles/page:  442 samples: 6721
     9:   0.08%  93.47% avg cycles:  3289 cycles/page:  365 samples: 917
    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145
    12:   0.08%  94.18% avg cycles:  3996 cycles/page:  333 samples: 864
    13:   0.03%  94.20% avg cycles:  4077 cycles/page:  313 samples: 289
    14:   0.02%  94.23% avg cycles:  4836 cycles/page:  345 samples: 236
    15:   0.04%  94.26% avg cycles:  5699 cycles/page:  379 samples: 390
    16:   0.06%  94.32% avg cycles:  5041 cycles/page:  315 samples: 643
    17:   0.57%  94.89% avg cycles:  5473 cycles/page:  321 samples: 6229
    18:   0.02%  94.91% avg cycles:  5396 cycles/page:  299 samples: 224
    19:   0.03%  94.95% avg cycles:  5296 cycles/page:  278 samples: 367
    20:   0.02%  94.96% avg cycles:  6749 cycles/page:  337 samples: 185
    21:   0.18%  95.14% avg cycles:  6225 cycles/page:  296 samples: 1964
    22:   0.01%  95.15% avg cycles:  6393 cycles/page:  290 samples: 83
    23:   0.01%  95.16% avg cycles:  6861 cycles/page:  298 samples: 61
    24:   0.12%  95.28% avg cycles:  6912 cycles/page:  288 samples: 1307
    25:   0.05%  95.32% avg cycles:  7190 cycles/page:  287 samples: 533
    26:   0.01%  95.33% avg cycles:  7793 cycles/page:  299 samples: 94
    27:   0.01%  95.34% avg cycles:  7833 cycles/page:  290 samples: 66
    28:   0.01%  95.35% avg cycles:  8253 cycles/page:  294 samples: 73
    29:   0.08%  95.42% avg cycles:  8024 cycles/page:  276 samples: 846
    30:   0.03%  95.45% avg cycles:  9670 cycles/page:  322 samples: 296
    31:   0.01%  95.46% avg cycles:  8949 cycles/page:  288 samples: 79
    32:   0.01%  95.46% avg cycles:  9350 cycles/page:  292 samples: 60
    33:   3.11%  98.57% avg cycles:  8534 cycles/page:  258 samples: 33936
    34:   0.02%  98.60% avg cycles: 10977 cycles/page:  322 samples: 268
    35:   0.02%  98.62% avg cycles: 11400 cycles/page:  325 samples: 177
    36:   0.01%  98.63% avg cycles: 11504 cycles/page:  319 samples: 161
    37:   0.02%  98.65% avg cycles: 11596 cycles/page:  313 samples: 182
    38:   0.02%  98.66% avg cycles: 11850 cycles/page:  311 samples: 195
    39:   0.01%  98.68% avg cycles: 12158 cycles/page:  311 samples: 128
    40:   0.01%  98.68% avg cycles: 11626 cycles/page:  290 samples: 78
    41:   0.04%  98.73% avg cycles: 11435 cycles/page:  278 samples: 477
    42:   0.01%  98.73% avg cycles: 12571 cycles/page:  299 samples: 74
    43:   0.01%  98.74% avg cycles: 12562 cycles/page:  292 samples: 78
    44:   0.01%  98.75% avg cycles: 12991 cycles/page:  295 samples: 108
    45:   0.01%  98.76% avg cycles: 13169 cycles/page:  292 samples: 78
    46:   0.02%  98.78% avg cycles: 12891 cycles/page:  280 samples: 261
    47:   0.01%  98.79% avg cycles: 13099 cycles/page:  278 samples: 67
    48:   0.01%  98.80% avg cycles: 13851 cycles/page:  288 samples: 77
    49:   0.01%  98.80% avg cycles: 13749 cycles/page:  280 samples: 66
    50:   0.01%  98.81% avg cycles: 13949 cycles/page:  278 samples: 73
    52:   0.00%  98.82% avg cycles: 14243 cycles/page:  273 samples: 52
    54:   0.01%  98.83% avg cycles: 15312 cycles/page:  283 samples: 87
    55:   0.01%  98.84% avg cycles: 15197 cycles/page:  276 samples: 109
    56:   0.02%  98.86% avg cycles: 15234 cycles/page:  272 samples: 208
    57:   0.00%  98.86% avg cycles: 14888 cycles/page:  261 samples: 53
    58:   0.01%  98.87% avg cycles: 15037 cycles/page:  259 samples: 59
    59:   0.01%  98.87% avg cycles: 15752 cycles/page:  266 samples: 63
    62:   0.00%  98.89% avg cycles: 16222 cycles/page:  261 samples: 54
    64:   0.02%  98.91% avg cycles: 17179 cycles/page:  268 samples: 248
    65:   0.12%  99.03% avg cycles: 18762 cycles/page:  288 samples: 1324
    85:   0.00%  99.10% avg cycles: 21649 cycles/page:  254 samples: 50
   127:   0.01%  99.18% avg cycles: 32397 cycles/page:  255 samples: 75
   128:   0.13%  99.31% avg cycles: 31711 cycles/page:  247 samples: 1466
   129:   0.18%  99.49% avg cycles: 33017 cycles/page:  255 samples: 1927
   181:   0.33%  99.84% avg cycles:  2489 cycles/page:   13 samples: 3547
   256:   0.05%  99.91% avg cycles:  2305 cycles/page:    9 samples: 550
   512:   0.03%  99.95% avg cycles:  2133 cycles/page:    4 samples: 304
  1512:   0.01%  99.99% avg cycles:  3038 cycles/page:    2 samples: 65

Here are the tlb counters during a 10-second slice of a kernel compile
for a SandyBridge system.  It's better than IvyBridge, but probably
due to the larger caches since this was one of the 'X' extreme parts.

    10,873,007,282      dtlb_load_misses_walk_duration
       250,711,333      dtlb_load_misses_walk_completed
     1,212,395,865      dtlb_store_misses_walk_duration
        31,615,772      dtlb_store_misses_walk_completed
     5,091,010,274      itlb_misses_walk_duration
       163,193,511      itlb_misses_walk_completed
         1,321,980      itlb_itlb_flush

      10.008045158 seconds time elapsed

# cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, "   -----    ", icyc,imiss, dcyc,dmiss }'
itlb ns/miss:  9.45338  dtlb ns/miss:  12.9716

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/tlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/x86/mm/tlb.c~set-tunable-to-sane-value arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~set-tunable-to-sane-value	2014-03-05 16:10:11.005111495 -0800
+++ b/arch/x86/mm/tlb.c	2014-03-05 16:10:11.009111678 -0800
@@ -165,7 +165,7 @@ void flush_tlb_current_task(void)
 }
 
 /* in units of pages */
-unsigned long tlb_single_page_flush_ceiling = 1;
+unsigned long tlb_single_page_flush_ceiling = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-03-10 17:11 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-06  0:45 [PATCH 0/7] x86: rework tlb range flushing code Dave Hansen
2014-03-06  0:45 ` Dave Hansen
2014-03-06  0:45 ` [PATCH 1/7] x86: mm: clean up tlb " Dave Hansen
2014-03-06  0:45   ` Dave Hansen
2014-03-07  0:16   ` Davidlohr Bueso
2014-03-07  0:16     ` Davidlohr Bueso
2014-03-07  0:51     ` Davidlohr Bueso
2014-03-07  0:51       ` Davidlohr Bueso
2014-03-07  0:57       ` Eric Boxer
2014-03-06  0:45 ` [PATCH 2/7] x86: mm: rip out complicated, out-of-date, buggy TLB flushing Dave Hansen
2014-03-06  0:45   ` Dave Hansen
2014-03-06  0:45 ` [PATCH 3/7] x86: mm: fix missed global TLB flush stat Dave Hansen
2014-03-06  0:45   ` Dave Hansen
2014-03-06  0:45 ` [PATCH 4/7] x86: mm: trace tlb flushes Dave Hansen
2014-03-06  0:45   ` Dave Hansen
2014-03-06  0:45 ` [PATCH 5/7] x86: mm: new tunable for single vs full TLB flush Dave Hansen
2014-03-06  0:45   ` Dave Hansen
2014-03-07  1:37   ` Davidlohr Bueso
2014-03-07  1:37     ` Davidlohr Bueso
2014-03-07 17:19     ` Dave Hansen
2014-03-07 17:19       ` Dave Hansen
2014-03-06  0:45 ` [PATCH 6/7] x86: mm: set TLB flush tunable to sane value Dave Hansen
2014-03-06  0:45   ` Dave Hansen
2014-03-07  1:55   ` Davidlohr Bueso
2014-03-07  1:55     ` Davidlohr Bueso
2014-03-07 17:15     ` Dave Hansen
2014-03-07 17:15       ` Dave Hansen
2014-03-08  0:28       ` Davidlohr Bueso
2014-03-08  0:28         ` Davidlohr Bueso
2014-03-06  0:45 ` [PATCH 7/7] big time hack: instrument flush times Dave Hansen
2014-03-06  0:45   ` Dave Hansen
2014-03-07  0:15 ` [PATCH 0/7] x86: rework tlb range flushing code Davidlohr Bueso
2014-03-07  0:15   ` Davidlohr Bueso
2014-03-10 17:11 Dave Hansen
2014-03-10 17:11 ` [PATCH 6/7] x86: mm: set TLB flush tunable to sane value Dave Hansen
2014-03-10 17:11   ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.