All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush
@ 2013-12-12 11:55 ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 11:55 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Mel Gorman

I found that ebizzy regressed between 3.4 and 3.10 while testing on a new
machine. Bisection initially found at least two problems of which the
first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for
x86). The second was related to ACPI cpufreq and so it was disabled for
the purposes of this series.

The intent of the TLB range flush patch appeared to be to preserve
existing TLB entries which makes sense. The decision on whether to do a
full mm flush or a number of single page flushes depends on the size of the
relevant TLB and the CPU which is presuably taking the cost of a TLB refill.
It is a gamble because the cost of the per-page flushes must be offset by a
reduced TLB miss count. There are no indications what the cost of calling
invlpg are if there are no TLB entries and it's also not taking into
account how many CPUs it may have to execute these single TLB flushes on.

Ebizzy sees very little benefit as it discards newly allocated memory very
quickly which is why it appeared to regress so badly. While I'm wary of
optimising for such a benchmark, it's commonly tested and the defaults for
Ivybridge may need to be re-examined.

The following small series restores ebizzy to 3.4-era performance. Is there a
better way of doing this? Bear in mind that I'm testing on a single IvyBridge
machine and there is no guarantee the gain is universal or even relevant.

kernel build was tested but it's uninteresting as TLB range is unimportant
to it. A page fault benchmark was also tested but it does not hit the same paths
impacted by commit 611ae8e3.

ebizzy
                       3.13.0-rc3                3.4.69            3.13.0-rc3
                          vanilla               vanilla           newdefault-v1
Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)
Mean     4      7919.20 (  0.00%)     7842.60 ( -0.97%)     8680.60 (  9.61%)
Mean     5      7310.60 (  0.00%)     7740.60 (  5.88%)     8273.20 ( 13.17%)
Mean     6      6798.00 (  0.00%)     7720.20 ( 13.57%)     8033.20 ( 18.17%)
Mean     7      6759.40 (  0.00%)     7644.00 ( 13.09%)     7643.80 ( 13.08%)
Mean     8      6501.80 (  0.00%)     7666.40 ( 17.91%)     6944.40 (  6.81%)
Mean     12     6606.00 (  0.00%)     7523.20 ( 13.88%)     7035.80 (  6.51%)
Mean     16     6655.40 (  0.00%)     7287.40 (  9.50%)     7099.20 (  6.67%)
Mean     20     6703.80 (  0.00%)     7152.20 (  6.69%)     7116.60 (  6.16%)
Mean     24     6705.80 (  0.00%)     7014.80 (  4.61%)     7113.60 (  6.08%)
Mean     28     6706.60 (  0.00%)     6940.40 (  3.49%)     7115.20 (  6.09%)
Mean     32     6727.20 (  0.00%)     6878.80 (  2.25%)     7110.80 (  5.70%)
Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)
Stddev   3        71.67 (  0.00%)       69.38 (  3.19%)       84.13 (-17.39%)
Stddev   4        30.25 (  0.00%)       87.06 (-187.82%)       31.80 ( -5.14%)
Stddev   5        71.18 (  0.00%)       25.68 ( 63.92%)      125.24 (-75.95%)
Stddev   6        34.22 (  0.00%)       23.35 ( 31.75%)      124.40 (-263.57%)
Stddev   7       100.59 (  0.00%)      112.83 (-12.17%)       65.07 ( 35.31%)
Stddev   8        20.26 (  0.00%)       43.43 (-114.32%)       48.26 (-138.16%)
Stddev   12       19.43 (  0.00%)       19.73 ( -1.55%)       23.25 (-19.65%)
Stddev   16       14.47 (  0.00%)       26.42 (-82.54%)       17.71 (-22.40%)
Stddev   20       21.37 (  0.00%)       15.97 ( 25.27%)       14.87 ( 30.42%)
Stddev   24       12.87 (  0.00%)       28.12 (-118.44%)       10.46 ( 18.75%)
Stddev   28       13.89 (  0.00%)       17.97 (-29.36%)       12.22 ( 12.04%)
Stddev   32       18.14 (  0.00%)       20.37 (-12.31%)       16.40 (  9.58%)

          3.13.0-rc3      3.4.69  3.13.0-rc3
             vanilla     vanilla newdefault-v1
User          900.27      995.20      947.33
System       1583.41     1680.76     1533.17
Elapsed      2100.78     2100.81     2100.76

This shows the ebizzy comparison between 3.13-rc3, 3.4.69-stable and this series.
The series is not a universal win against 3.4 but the figure are generally better
and system CPU usage is reduced.

 arch/x86/kernel/cpu/intel.c |  2 +-
 arch/x86/mm/tlb.c           | 15 +++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

-- 
1.8.4


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush
@ 2013-12-12 11:55 ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 11:55 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Mel Gorman

I found that ebizzy regressed between 3.4 and 3.10 while testing on a new
machine. Bisection initially found at least two problems of which the
first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for
x86). The second was related to ACPI cpufreq and so it was disabled for
the purposes of this series.

The intent of the TLB range flush patch appeared to be to preserve
existing TLB entries which makes sense. The decision on whether to do a
full mm flush or a number of single page flushes depends on the size of the
relevant TLB and the CPU which is presuably taking the cost of a TLB refill.
It is a gamble because the cost of the per-page flushes must be offset by a
reduced TLB miss count. There are no indications what the cost of calling
invlpg are if there are no TLB entries and it's also not taking into
account how many CPUs it may have to execute these single TLB flushes on.

Ebizzy sees very little benefit as it discards newly allocated memory very
quickly which is why it appeared to regress so badly. While I'm wary of
optimising for such a benchmark, it's commonly tested and the defaults for
Ivybridge may need to be re-examined.

The following small series restores ebizzy to 3.4-era performance. Is there a
better way of doing this? Bear in mind that I'm testing on a single IvyBridge
machine and there is no guarantee the gain is universal or even relevant.

kernel build was tested but it's uninteresting as TLB range is unimportant
to it. A page fault benchmark was also tested but it does not hit the same paths
impacted by commit 611ae8e3.

ebizzy
                       3.13.0-rc3                3.4.69            3.13.0-rc3
                          vanilla               vanilla           newdefault-v1
Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)
Mean     4      7919.20 (  0.00%)     7842.60 ( -0.97%)     8680.60 (  9.61%)
Mean     5      7310.60 (  0.00%)     7740.60 (  5.88%)     8273.20 ( 13.17%)
Mean     6      6798.00 (  0.00%)     7720.20 ( 13.57%)     8033.20 ( 18.17%)
Mean     7      6759.40 (  0.00%)     7644.00 ( 13.09%)     7643.80 ( 13.08%)
Mean     8      6501.80 (  0.00%)     7666.40 ( 17.91%)     6944.40 (  6.81%)
Mean     12     6606.00 (  0.00%)     7523.20 ( 13.88%)     7035.80 (  6.51%)
Mean     16     6655.40 (  0.00%)     7287.40 (  9.50%)     7099.20 (  6.67%)
Mean     20     6703.80 (  0.00%)     7152.20 (  6.69%)     7116.60 (  6.16%)
Mean     24     6705.80 (  0.00%)     7014.80 (  4.61%)     7113.60 (  6.08%)
Mean     28     6706.60 (  0.00%)     6940.40 (  3.49%)     7115.20 (  6.09%)
Mean     32     6727.20 (  0.00%)     6878.80 (  2.25%)     7110.80 (  5.70%)
Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)
Stddev   3        71.67 (  0.00%)       69.38 (  3.19%)       84.13 (-17.39%)
Stddev   4        30.25 (  0.00%)       87.06 (-187.82%)       31.80 ( -5.14%)
Stddev   5        71.18 (  0.00%)       25.68 ( 63.92%)      125.24 (-75.95%)
Stddev   6        34.22 (  0.00%)       23.35 ( 31.75%)      124.40 (-263.57%)
Stddev   7       100.59 (  0.00%)      112.83 (-12.17%)       65.07 ( 35.31%)
Stddev   8        20.26 (  0.00%)       43.43 (-114.32%)       48.26 (-138.16%)
Stddev   12       19.43 (  0.00%)       19.73 ( -1.55%)       23.25 (-19.65%)
Stddev   16       14.47 (  0.00%)       26.42 (-82.54%)       17.71 (-22.40%)
Stddev   20       21.37 (  0.00%)       15.97 ( 25.27%)       14.87 ( 30.42%)
Stddev   24       12.87 (  0.00%)       28.12 (-118.44%)       10.46 ( 18.75%)
Stddev   28       13.89 (  0.00%)       17.97 (-29.36%)       12.22 ( 12.04%)
Stddev   32       18.14 (  0.00%)       20.37 (-12.31%)       16.40 (  9.58%)

          3.13.0-rc3      3.4.69  3.13.0-rc3
             vanilla     vanilla newdefault-v1
User          900.27      995.20      947.33
System       1583.41     1680.76     1533.17
Elapsed      2100.78     2100.81     2100.76

This shows the ebizzy comparison between 3.13-rc3, 3.4.69-stable and this series.
The series is not a universal win against 3.4 but the figure are generally better
and system CPU usage is reduced.

 arch/x86/kernel/cpu/intel.c |  2 +-
 arch/x86/mm/tlb.c           | 15 +++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/3] x86: mm: Clean up inconsistencies when flushing TLB ranges
  2013-12-12 11:55 ` Mel Gorman
@ 2013-12-12 11:55   ` Mel Gorman
  -1 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 11:55 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Mel Gorman

NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the
comparison with total_vm is done before taking tlb_flushall_shift into
account. Clean it up.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/tlb.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ae699b3..09b8cb8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -189,6 +189,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 {
 	unsigned long addr;
 	unsigned act_entries, tlb_entries = 0;
+	unsigned long nr_base_pages;
 
 	preempt_disable();
 	if (current->active_mm != mm)
@@ -210,18 +211,17 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		tlb_entries = tlb_lli_4k[ENTRIES];
 	else
 		tlb_entries = tlb_lld_4k[ENTRIES];
+
 	/* Assume all of TLB entries was occupied by this task */
-	act_entries = mm->total_vm > tlb_entries ? tlb_entries : mm->total_vm;
+	act_entries = tlb_entries >> tlb_flushall_shift;
+	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
+	nr_base_pages = (end - start) >> PAGE_SHIFT;
 
 	/* tlb_flushall_shift is on balance point, details in commit log */
-	if ((end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) {
+	if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
 		count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
 		local_flush_tlb();
 	} else {
-		if (has_large_page(mm, start, end)) {
-			local_flush_tlb();
-			goto flush_all;
-		}
 		/* flush range by one by one 'invlpg' */
 		for (addr = start; addr < end;	addr += PAGE_SIZE) {
 			count_vm_event(NR_TLB_LOCAL_FLUSH_ONE);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 1/3] x86: mm: Clean up inconsistencies when flushing TLB ranges
@ 2013-12-12 11:55   ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 11:55 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Mel Gorman

NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the
comparison with total_vm is done before taking tlb_flushall_shift into
account. Clean it up.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/tlb.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ae699b3..09b8cb8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -189,6 +189,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 {
 	unsigned long addr;
 	unsigned act_entries, tlb_entries = 0;
+	unsigned long nr_base_pages;
 
 	preempt_disable();
 	if (current->active_mm != mm)
@@ -210,18 +211,17 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		tlb_entries = tlb_lli_4k[ENTRIES];
 	else
 		tlb_entries = tlb_lld_4k[ENTRIES];
+
 	/* Assume all of TLB entries was occupied by this task */
-	act_entries = mm->total_vm > tlb_entries ? tlb_entries : mm->total_vm;
+	act_entries = tlb_entries >> tlb_flushall_shift;
+	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
+	nr_base_pages = (end - start) >> PAGE_SHIFT;
 
 	/* tlb_flushall_shift is on balance point, details in commit log */
-	if ((end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) {
+	if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
 		count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
 		local_flush_tlb();
 	} else {
-		if (has_large_page(mm, start, end)) {
-			local_flush_tlb();
-			goto flush_all;
-		}
 		/* flush range by one by one 'invlpg' */
 		for (addr = start; addr < end;	addr += PAGE_SIZE) {
 			count_vm_event(NR_TLB_LOCAL_FLUSH_ONE);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-12 11:55 ` Mel Gorman
@ 2013-12-12 11:55   ` Mel Gorman
  -1 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 11:55 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Mel Gorman

There was a large performance regression that was bisected to commit 611ae8e3
(x86/tlb: enable tlb flush range support for x86). This patch simply changes
the default balance point between a local and global flush for IvyBridge.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/kernel/cpu/intel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..2d93753 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
 		tlb_flushall_shift = 5;
 		break;
 	case 0x63a: /* Ivybridge */
-		tlb_flushall_shift = 1;
+		tlb_flushall_shift = 2;
 		break;
 	default:
 		tlb_flushall_shift = 6;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-12 11:55   ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 11:55 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Mel Gorman

There was a large performance regression that was bisected to commit 611ae8e3
(x86/tlb: enable tlb flush range support for x86). This patch simply changes
the default balance point between a local and global flush for IvyBridge.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/kernel/cpu/intel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..2d93753 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
 		tlb_flushall_shift = 5;
 		break;
 	case 0x63a: /* Ivybridge */
-		tlb_flushall_shift = 1;
+		tlb_flushall_shift = 2;
 		break;
 	default:
 		tlb_flushall_shift = 6;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/3] x86: mm: Account for the of CPUs that must be flushed during a TLB range flush
  2013-12-12 11:55 ` Mel Gorman
@ 2013-12-12 11:55   ` Mel Gorman
  -1 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 11:55 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Mel Gorman

X86 TLB range flushing uses a balance point to decide if a single global TLB
flush or multiple single page flushes would perform best.  This patch takes into
account how many CPUs must be flushed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/tlb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 09b8cb8..0cababa 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -217,6 +217,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
 	nr_base_pages = (end - start) >> PAGE_SHIFT;
 
+	/* Take the number of CPUs to range flush into account */
+	nr_base_pages *= cpumask_weight(mm_cpumask(mm));
+
 	/* tlb_flushall_shift is on balance point, details in commit log */
 	if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
 		count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/3] x86: mm: Account for the of CPUs that must be flushed during a TLB range flush
@ 2013-12-12 11:55   ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 11:55 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Mel Gorman

X86 TLB range flushing uses a balance point to decide if a single global TLB
flush or multiple single page flushes would perform best.  This patch takes into
account how many CPUs must be flushed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/tlb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 09b8cb8..0cababa 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -217,6 +217,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
 	nr_base_pages = (end - start) >> PAGE_SHIFT;
 
+	/* Take the number of CPUs to range flush into account */
+	nr_base_pages *= cpumask_weight(mm_cpumask(mm));
+
 	/* tlb_flushall_shift is on balance point, details in commit log */
 	if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
 		count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush
  2013-12-12 11:55 ` Mel Gorman
@ 2013-12-12 13:01   ` Ingo Molnar
  -1 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-12 13:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alex Shi, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Andrew Morton


* Mel Gorman <mgorman@suse.de> wrote:

> I found that ebizzy regressed between 3.4 and 3.10 while testing on a new
> machine. Bisection initially found at least two problems of which the
> first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for
> x86). The second was related to ACPI cpufreq and so it was disabled for
> the purposes of this series.
> 
> The intent of the TLB range flush patch appeared to be to preserve
> existing TLB entries which makes sense. The decision on whether to do a
> full mm flush or a number of single page flushes depends on the size of the
> relevant TLB and the CPU which is presuably taking the cost of a TLB refill.
> It is a gamble because the cost of the per-page flushes must be offset by a
> reduced TLB miss count. There are no indications what the cost of calling
> invlpg are if there are no TLB entries and it's also not taking into
> account how many CPUs it may have to execute these single TLB flushes on.
> 
> Ebizzy sees very little benefit as it discards newly allocated memory very
> quickly which is why it appeared to regress so badly. While I'm wary of
> optimising for such a benchmark, it's commonly tested and the defaults for
> Ivybridge may need to be re-examined.
> 
> The following small series restores ebizzy to 3.4-era performance. Is there a
> better way of doing this? Bear in mind that I'm testing on a single IvyBridge
> machine and there is no guarantee the gain is universal or even relevant.
> 
> kernel build was tested but it's uninteresting as TLB range is unimportant
> to it. A page fault benchmark was also tested but it does not hit the same paths
> impacted by commit 611ae8e3.
> 
> ebizzy
>                        3.13.0-rc3                3.4.69            3.13.0-rc3
>                           vanilla               vanilla           newdefault-v1
> Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
> Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
> Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)
> Mean     4      7919.20 (  0.00%)     7842.60 ( -0.97%)     8680.60 (  9.61%)
> Mean     5      7310.60 (  0.00%)     7740.60 (  5.88%)     8273.20 ( 13.17%)
> Mean     6      6798.00 (  0.00%)     7720.20 ( 13.57%)     8033.20 ( 18.17%)
> Mean     7      6759.40 (  0.00%)     7644.00 ( 13.09%)     7643.80 ( 13.08%)
> Mean     8      6501.80 (  0.00%)     7666.40 ( 17.91%)     6944.40 (  6.81%)
> Mean     12     6606.00 (  0.00%)     7523.20 ( 13.88%)     7035.80 (  6.51%)
> Mean     16     6655.40 (  0.00%)     7287.40 (  9.50%)     7099.20 (  6.67%)
> Mean     20     6703.80 (  0.00%)     7152.20 (  6.69%)     7116.60 (  6.16%)
> Mean     24     6705.80 (  0.00%)     7014.80 (  4.61%)     7113.60 (  6.08%)
> Mean     28     6706.60 (  0.00%)     6940.40 (  3.49%)     7115.20 (  6.09%)
> Mean     32     6727.20 (  0.00%)     6878.80 (  2.25%)     7110.80 (  5.70%)
> Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
> Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)
> Stddev   3        71.67 (  0.00%)       69.38 (  3.19%)       84.13 (-17.39%)
> Stddev   4        30.25 (  0.00%)       87.06 (-187.82%)       31.80 ( -5.14%)
> Stddev   5        71.18 (  0.00%)       25.68 ( 63.92%)      125.24 (-75.95%)
> Stddev   6        34.22 (  0.00%)       23.35 ( 31.75%)      124.40 (-263.57%)
> Stddev   7       100.59 (  0.00%)      112.83 (-12.17%)       65.07 ( 35.31%)
> Stddev   8        20.26 (  0.00%)       43.43 (-114.32%)       48.26 (-138.16%)
> Stddev   12       19.43 (  0.00%)       19.73 ( -1.55%)       23.25 (-19.65%)
> Stddev   16       14.47 (  0.00%)       26.42 (-82.54%)       17.71 (-22.40%)
> Stddev   20       21.37 (  0.00%)       15.97 ( 25.27%)       14.87 ( 30.42%)
> Stddev   24       12.87 (  0.00%)       28.12 (-118.44%)       10.46 ( 18.75%)
> Stddev   28       13.89 (  0.00%)       17.97 (-29.36%)       12.22 ( 12.04%)
> Stddev   32       18.14 (  0.00%)       20.37 (-12.31%)       16.40 (  9.58%)
> 
>           3.13.0-rc3      3.4.69  3.13.0-rc3
>              vanilla     vanilla newdefault-v1
> User          900.27      995.20      947.33
> System       1583.41     1680.76     1533.17
> Elapsed      2100.78     2100.81     2100.76
> 
> This shows the ebizzy comparison between 3.13-rc3, 3.4.69-stable and this series.
> The series is not a universal win against 3.4 but the figure are generally better
> and system CPU usage is reduced.

I think you found a real bug and I definitely agree that we want to 
fix it - the TLB range optimization was supposed to be a (nearly) 
unconditional win, so a regression like this is totally not acceptable 
IMHO.

Please help me out with interpreting the numbers. The Ebizzy table:

> ebizzy
>                        3.13.0-rc3                3.4.69            3.13.0-rc3
>                           vanilla               vanilla           newdefault-v1
> Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
> Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
> Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)

is the first numeric column number of threads/clients? The other 
colums are showing pairs of throughput (higher is better), with a 
performance regression percentage in parentheses.

do the stddev numbers:

> Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
> Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)

... correspond to the respective thread count and thus overlay the 
first table, right?

stddev appears to be rather large especially around a client count of 
7-8. It will be difficult to fine-tune the TLB range flush constants 
if noise is too large.

Regarding total CPU usage:

>           3.13.0-rc3      3.4.69  3.13.0-rc3
>              vanilla     vanilla newdefault-v1
> User          900.27      995.20      947.33
> System       1583.41     1680.76     1533.17
> Elapsed      2100.78     2100.81     2100.76

... elapsed time appears to be the same - does the benchmark run for a 
constant amount of time, regardless of performance?

This means that higher user time and lower system time generally 
represents higher achieved throughput, right?

Yet the sum does not appear to be constant across kernels - does this 
mean that even newdefault-v1 is idling around more than v3.4.69?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush
@ 2013-12-12 13:01   ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-12 13:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alex Shi, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Andrew Morton


* Mel Gorman <mgorman@suse.de> wrote:

> I found that ebizzy regressed between 3.4 and 3.10 while testing on a new
> machine. Bisection initially found at least two problems of which the
> first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for
> x86). The second was related to ACPI cpufreq and so it was disabled for
> the purposes of this series.
> 
> The intent of the TLB range flush patch appeared to be to preserve
> existing TLB entries which makes sense. The decision on whether to do a
> full mm flush or a number of single page flushes depends on the size of the
> relevant TLB and the CPU which is presuably taking the cost of a TLB refill.
> It is a gamble because the cost of the per-page flushes must be offset by a
> reduced TLB miss count. There are no indications what the cost of calling
> invlpg are if there are no TLB entries and it's also not taking into
> account how many CPUs it may have to execute these single TLB flushes on.
> 
> Ebizzy sees very little benefit as it discards newly allocated memory very
> quickly which is why it appeared to regress so badly. While I'm wary of
> optimising for such a benchmark, it's commonly tested and the defaults for
> Ivybridge may need to be re-examined.
> 
> The following small series restores ebizzy to 3.4-era performance. Is there a
> better way of doing this? Bear in mind that I'm testing on a single IvyBridge
> machine and there is no guarantee the gain is universal or even relevant.
> 
> kernel build was tested but it's uninteresting as TLB range is unimportant
> to it. A page fault benchmark was also tested but it does not hit the same paths
> impacted by commit 611ae8e3.
> 
> ebizzy
>                        3.13.0-rc3                3.4.69            3.13.0-rc3
>                           vanilla               vanilla           newdefault-v1
> Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
> Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
> Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)
> Mean     4      7919.20 (  0.00%)     7842.60 ( -0.97%)     8680.60 (  9.61%)
> Mean     5      7310.60 (  0.00%)     7740.60 (  5.88%)     8273.20 ( 13.17%)
> Mean     6      6798.00 (  0.00%)     7720.20 ( 13.57%)     8033.20 ( 18.17%)
> Mean     7      6759.40 (  0.00%)     7644.00 ( 13.09%)     7643.80 ( 13.08%)
> Mean     8      6501.80 (  0.00%)     7666.40 ( 17.91%)     6944.40 (  6.81%)
> Mean     12     6606.00 (  0.00%)     7523.20 ( 13.88%)     7035.80 (  6.51%)
> Mean     16     6655.40 (  0.00%)     7287.40 (  9.50%)     7099.20 (  6.67%)
> Mean     20     6703.80 (  0.00%)     7152.20 (  6.69%)     7116.60 (  6.16%)
> Mean     24     6705.80 (  0.00%)     7014.80 (  4.61%)     7113.60 (  6.08%)
> Mean     28     6706.60 (  0.00%)     6940.40 (  3.49%)     7115.20 (  6.09%)
> Mean     32     6727.20 (  0.00%)     6878.80 (  2.25%)     7110.80 (  5.70%)
> Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
> Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)
> Stddev   3        71.67 (  0.00%)       69.38 (  3.19%)       84.13 (-17.39%)
> Stddev   4        30.25 (  0.00%)       87.06 (-187.82%)       31.80 ( -5.14%)
> Stddev   5        71.18 (  0.00%)       25.68 ( 63.92%)      125.24 (-75.95%)
> Stddev   6        34.22 (  0.00%)       23.35 ( 31.75%)      124.40 (-263.57%)
> Stddev   7       100.59 (  0.00%)      112.83 (-12.17%)       65.07 ( 35.31%)
> Stddev   8        20.26 (  0.00%)       43.43 (-114.32%)       48.26 (-138.16%)
> Stddev   12       19.43 (  0.00%)       19.73 ( -1.55%)       23.25 (-19.65%)
> Stddev   16       14.47 (  0.00%)       26.42 (-82.54%)       17.71 (-22.40%)
> Stddev   20       21.37 (  0.00%)       15.97 ( 25.27%)       14.87 ( 30.42%)
> Stddev   24       12.87 (  0.00%)       28.12 (-118.44%)       10.46 ( 18.75%)
> Stddev   28       13.89 (  0.00%)       17.97 (-29.36%)       12.22 ( 12.04%)
> Stddev   32       18.14 (  0.00%)       20.37 (-12.31%)       16.40 (  9.58%)
> 
>           3.13.0-rc3      3.4.69  3.13.0-rc3
>              vanilla     vanilla newdefault-v1
> User          900.27      995.20      947.33
> System       1583.41     1680.76     1533.17
> Elapsed      2100.78     2100.81     2100.76
> 
> This shows the ebizzy comparison between 3.13-rc3, 3.4.69-stable and this series.
> The series is not a universal win against 3.4 but the figure are generally better
> and system CPU usage is reduced.

I think you found a real bug and I definitely agree that we want to 
fix it - the TLB range optimization was supposed to be a (nearly) 
unconditional win, so a regression like this is totally not acceptable 
IMHO.

Please help me out with interpreting the numbers. The Ebizzy table:

> ebizzy
>                        3.13.0-rc3                3.4.69            3.13.0-rc3
>                           vanilla               vanilla           newdefault-v1
> Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
> Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
> Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)

is the first numeric column number of threads/clients? The other 
colums are showing pairs of throughput (higher is better), with a 
performance regression percentage in parentheses.

do the stddev numbers:

> Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
> Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)

... correspond to the respective thread count and thus overlay the 
first table, right?

stddev appears to be rather large especially around a client count of 
7-8. It will be difficult to fine-tune the TLB range flush constants 
if noise is too large.

Regarding total CPU usage:

>           3.13.0-rc3      3.4.69  3.13.0-rc3
>              vanilla     vanilla newdefault-v1
> User          900.27      995.20      947.33
> System       1583.41     1680.76     1533.17
> Elapsed      2100.78     2100.81     2100.76

... elapsed time appears to be the same - does the benchmark run for a 
constant amount of time, regardless of performance?

This means that higher user time and lower system time generally 
represents higher achieved throughput, right?

Yet the sum does not appear to be constant across kernels - does this 
mean that even newdefault-v1 is idling around more than v3.4.69?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-12 11:55   ` Mel Gorman
@ 2013-12-12 13:13     ` Ingo Molnar
  -1 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-12 13:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alex Shi, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra


* Mel Gorman <mgorman@suse.de> wrote:

> There was a large performance regression that was bisected to commit 611ae8e3
> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
> the default balance point between a local and global flush for IvyBridge.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  arch/x86/kernel/cpu/intel.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..2d93753 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>  		tlb_flushall_shift = 5;
>  		break;
>  	case 0x63a: /* Ivybridge */
> -		tlb_flushall_shift = 1;
> +		tlb_flushall_shift = 2;
>  		break;

I'd not be surprised if other CPU models showed similar weaknesses 
under ebizzy as well.

I don't particularly like the tuning aspect of the whole feature: the 
tunings are model specific and they seem to come out of thin air, 
without explicit measurements visible.

In particular the first commit that added this optimization:

 commit c4211f42d3e66875298a5e26a75109878c80f15b
 Date:   Thu Jun 28 09:02:19 2012 +0800

    x86/tlb: add tlb_flushall_shift for specific CPU

already had these magic tunings, with no explanation about what kind 
of measurement was done to back up those tunings.

I don't think this is acceptable and until this is cleared up I think 
we might be better off turning off this feature altogether, or making 
a constant, very low tuning point.

The original code came via:

  611ae8e3f520 x86/tlb: enable tlb flush range support for x86

which references a couple of benchmarks, in particular a 
micro-benchmark:

  My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
  show that the random memory access on other CPU has 0~50% speed up
  on a 2P * 4cores * HT NHM EP while do 'munmap'.

if the tunings were done with the micro-benchmark then I think they 
are bogus, because AFAICS it does not measure the adversarial case of 
the optimization.

So I'd say at minimum we need to remove the per model tunings, and 
need to use very conservative defaults, to make sure we don't slow 
down reasonable workloads.

( In theory madvise() could give us information about the usage 
  pattern of the vma - but in practice madvise() is rarely used and I 
  doubt ebizzy or other real-world apps are using it, so it's 
  meaningless. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-12 13:13     ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-12 13:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alex Shi, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra


* Mel Gorman <mgorman@suse.de> wrote:

> There was a large performance regression that was bisected to commit 611ae8e3
> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
> the default balance point between a local and global flush for IvyBridge.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  arch/x86/kernel/cpu/intel.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..2d93753 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>  		tlb_flushall_shift = 5;
>  		break;
>  	case 0x63a: /* Ivybridge */
> -		tlb_flushall_shift = 1;
> +		tlb_flushall_shift = 2;
>  		break;

I'd not be surprised if other CPU models showed similar weaknesses 
under ebizzy as well.

I don't particularly like the tuning aspect of the whole feature: the 
tunings are model specific and they seem to come out of thin air, 
without explicit measurements visible.

In particular the first commit that added this optimization:

 commit c4211f42d3e66875298a5e26a75109878c80f15b
 Date:   Thu Jun 28 09:02:19 2012 +0800

    x86/tlb: add tlb_flushall_shift for specific CPU

already had these magic tunings, with no explanation about what kind 
of measurement was done to back up those tunings.

I don't think this is acceptable and until this is cleared up I think 
we might be better off turning off this feature altogether, or making 
a constant, very low tuning point.

The original code came via:

  611ae8e3f520 x86/tlb: enable tlb flush range support for x86

which references a couple of benchmarks, in particular a 
micro-benchmark:

  My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
  show that the random memory access on other CPU has 0~50% speed up
  on a 2P * 4cores * HT NHM EP while do 'munmap'.

if the tunings were done with the micro-benchmark then I think they 
are bogus, because AFAICS it does not measure the adversarial case of 
the optimization.

So I'd say at minimum we need to remove the per model tunings, and 
need to use very conservative defaults, to make sure we don't slow 
down reasonable workloads.

( In theory madvise() could give us information about the usage 
  pattern of the vma - but in practice madvise() is rarely used and I 
  doubt ebizzy or other real-world apps are using it, so it's 
  meaningless. )

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-12 13:13     ` Ingo Molnar
@ 2013-12-12 13:38       ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-12 13:38 UTC (permalink / raw)
  To: Ingo Molnar, Mel Gorman
  Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 12/12/2013 09:13 PM, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
>> There was a large performance regression that was bisected to commit 611ae8e3
>> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
>> the default balance point between a local and global flush for IvyBridge.
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>>  arch/x86/kernel/cpu/intel.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
>> index dc1ec0d..2d93753 100644
>> --- a/arch/x86/kernel/cpu/intel.c
>> +++ b/arch/x86/kernel/cpu/intel.c
>> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>>  		tlb_flushall_shift = 5;
>>  		break;
>>  	case 0x63a: /* Ivybridge */
>> -		tlb_flushall_shift = 1;
>> +		tlb_flushall_shift = 2;
>>  		break;
> 
> I'd not be surprised if other CPU models showed similar weaknesses 
> under ebizzy as well.
> 
> I don't particularly like the tuning aspect of the whole feature: the 
> tunings are model specific and they seem to come out of thin air, 
> without explicit measurements visible.
> 
> In particular the first commit that added this optimization:
> 
>  commit c4211f42d3e66875298a5e26a75109878c80f15b
>  Date:   Thu Jun 28 09:02:19 2012 +0800
> 
>     x86/tlb: add tlb_flushall_shift for specific CPU
> 
> already had these magic tunings, with no explanation about what kind 
> of measurement was done to back up those tunings.
> 
> I don't think this is acceptable and until this is cleared up I think 
> we might be better off turning off this feature altogether, or making 
> a constant, very low tuning point.
> 
> The original code came via:
> 
>   611ae8e3f520 x86/tlb: enable tlb flush range support for x86
> 
> which references a couple of benchmarks, in particular a 
> micro-benchmark:
> 
>   My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
>   show that the random memory access on other CPU has 0~50% speed up
>   on a 2P * 4cores * HT NHM EP while do 'munmap'.
> 
> if the tunings were done with the micro-benchmark then I think they 
> are bogus, because AFAICS it does not measure the adversarial case of 
> the optimization.
> 
> So I'd say at minimum we need to remove the per model tunings, and 
> need to use very conservative defaults, to make sure we don't slow 
> down reasonable workloads.

I also hate to depends on mysterious hardware differentiation. But there
do have some changes in tlb/cache part on different Intel CPU.(Guess HPA
know this more). And the different shift value get from testing not from
air. :)

> 
> ( In theory madvise() could give us information about the usage 
>   pattern of the vma - but in practice madvise() is rarely used and I 
>   doubt ebizzy or other real-world apps are using it, so it's 
>   meaningless. )
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-12 13:38       ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-12 13:38 UTC (permalink / raw)
  To: Ingo Molnar, Mel Gorman
  Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 12/12/2013 09:13 PM, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
>> There was a large performance regression that was bisected to commit 611ae8e3
>> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
>> the default balance point between a local and global flush for IvyBridge.
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>>  arch/x86/kernel/cpu/intel.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
>> index dc1ec0d..2d93753 100644
>> --- a/arch/x86/kernel/cpu/intel.c
>> +++ b/arch/x86/kernel/cpu/intel.c
>> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>>  		tlb_flushall_shift = 5;
>>  		break;
>>  	case 0x63a: /* Ivybridge */
>> -		tlb_flushall_shift = 1;
>> +		tlb_flushall_shift = 2;
>>  		break;
> 
> I'd not be surprised if other CPU models showed similar weaknesses 
> under ebizzy as well.
> 
> I don't particularly like the tuning aspect of the whole feature: the 
> tunings are model specific and they seem to come out of thin air, 
> without explicit measurements visible.
> 
> In particular the first commit that added this optimization:
> 
>  commit c4211f42d3e66875298a5e26a75109878c80f15b
>  Date:   Thu Jun 28 09:02:19 2012 +0800
> 
>     x86/tlb: add tlb_flushall_shift for specific CPU
> 
> already had these magic tunings, with no explanation about what kind 
> of measurement was done to back up those tunings.
> 
> I don't think this is acceptable and until this is cleared up I think 
> we might be better off turning off this feature altogether, or making 
> a constant, very low tuning point.
> 
> The original code came via:
> 
>   611ae8e3f520 x86/tlb: enable tlb flush range support for x86
> 
> which references a couple of benchmarks, in particular a 
> micro-benchmark:
> 
>   My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
>   show that the random memory access on other CPU has 0~50% speed up
>   on a 2P * 4cores * HT NHM EP while do 'munmap'.
> 
> if the tunings were done with the micro-benchmark then I think they 
> are bogus, because AFAICS it does not measure the adversarial case of 
> the optimization.
> 
> So I'd say at minimum we need to remove the per model tunings, and 
> need to use very conservative defaults, to make sure we don't slow 
> down reasonable workloads.

I also hate to depends on mysterious hardware differentiation. But there
do have some changes in tlb/cache part on different Intel CPU.(Guess HPA
know this more). And the different shift value get from testing not from
air. :)

> 
> ( In theory madvise() could give us information about the usage 
>   pattern of the vma - but in practice madvise() is rarely used and I 
>   doubt ebizzy or other real-world apps are using it, so it's 
>   meaningless. )
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/3] x86: mm: Account for the of CPUs that must be flushed during a TLB range flush
  2013-12-12 11:55   ` Mel Gorman
@ 2013-12-12 13:41     ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-12 13:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML

On 12/12/2013 07:55 PM, Mel Gorman wrote:
> X86 TLB range flushing uses a balance point to decide if a single global TLB
> flush or multiple single page flushes would perform best.  This patch takes into
> account how many CPUs must be flushed.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  arch/x86/mm/tlb.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 09b8cb8..0cababa 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -217,6 +217,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
>  	nr_base_pages = (end - start) >> PAGE_SHIFT;
>  
> +	/* Take the number of CPUs to range flush into account */
> +	nr_base_pages *= cpumask_weight(mm_cpumask(mm));
> +

flush range calculation base on per cpu, since no matter how many cpus
in the process, tlb flush and refill time won't change.
>  	/* tlb_flushall_shift is on balance point, details in commit log */
>  	if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
>  		count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/3] x86: mm: Account for the of CPUs that must be flushed during a TLB range flush
@ 2013-12-12 13:41     ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-12 13:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML

On 12/12/2013 07:55 PM, Mel Gorman wrote:
> X86 TLB range flushing uses a balance point to decide if a single global TLB
> flush or multiple single page flushes would perform best.  This patch takes into
> account how many CPUs must be flushed.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  arch/x86/mm/tlb.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 09b8cb8..0cababa 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -217,6 +217,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
>  	nr_base_pages = (end - start) >> PAGE_SHIFT;
>  
> +	/* Take the number of CPUs to range flush into account */
> +	nr_base_pages *= cpumask_weight(mm_cpumask(mm));
> +

flush range calculation base on per cpu, since no matter how many cpus
in the process, tlb flush and refill time won't change.
>  	/* tlb_flushall_shift is on balance point, details in commit log */
>  	if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
>  		count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
> 


-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-12 11:55   ` Mel Gorman
@ 2013-12-12 13:45     ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-12 13:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML

On 12/12/2013 07:55 PM, Mel Gorman wrote:
> There was a large performance regression that was bisected to commit 611ae8e3
> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
> the default balance point between a local and global flush for IvyBridge.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

agree to be more conservative.

Reviewed-by: Alex Shi <alex.shi@linro.org>
> ---
>  arch/x86/kernel/cpu/intel.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..2d93753 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>  		tlb_flushall_shift = 5;
>  		break;
>  	case 0x63a: /* Ivybridge */
> -		tlb_flushall_shift = 1;
> +		tlb_flushall_shift = 2;
>  		break;
>  	default:
>  		tlb_flushall_shift = 6;
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-12 13:45     ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-12 13:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML

On 12/12/2013 07:55 PM, Mel Gorman wrote:
> There was a large performance regression that was bisected to commit 611ae8e3
> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
> the default balance point between a local and global flush for IvyBridge.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

agree to be more conservative.

Reviewed-by: Alex Shi <alex.shi@linro.org>
> ---
>  arch/x86/kernel/cpu/intel.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..2d93753 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>  		tlb_flushall_shift = 5;
>  		break;
>  	case 0x63a: /* Ivybridge */
> -		tlb_flushall_shift = 1;
> +		tlb_flushall_shift = 2;
>  		break;
>  	default:
>  		tlb_flushall_shift = 6;
> 


-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/3] x86: mm: Clean up inconsistencies when flushing TLB ranges
  2013-12-12 11:55   ` Mel Gorman
@ 2013-12-12 13:59     ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-12 13:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML

On 12/12/2013 07:55 PM, Mel Gorman wrote:
> NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the
> comparison with total_vm is done before taking tlb_flushall_shift into
> account. Clean it up.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Alex Shi
> ---
>  arch/x86/mm/tlb.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index ae699b3..09b8cb8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -189,6 +189,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  {
>  	unsigned long addr;
>  	unsigned act_entries, tlb_entries = 0;
> +	unsigned long nr_base_pages;
>  
>  	preempt_disable();
>  	if (current->active_mm != mm)
> @@ -210,18 +211,17 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  		tlb_entries = tlb_lli_4k[ENTRIES];
>  	else
>  		tlb_entries = tlb_lld_4k[ENTRIES];
> +
>  	/* Assume all of TLB entries was occupied by this task */

the benchmark break this assumption?
> -	act_entries = mm->total_vm > tlb_entries ? tlb_entries : mm->total_vm;
> +	act_entries = tlb_entries >> tlb_flushall_shift;
> +	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
> +	nr_base_pages = (end - start) >> PAGE_SHIFT;
>  
>  	/* tlb_flushall_shift is on balance point, details in commit log */
> -	if ((end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) {
> +	if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
>  		count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
>  		local_flush_tlb();
>  	} else {
> -		if (has_large_page(mm, start, end)) {
> -			local_flush_tlb();
> -			goto flush_all;
> -		}
>  		/* flush range by one by one 'invlpg' */
>  		for (addr = start; addr < end;	addr += PAGE_SIZE) {
>  			count_vm_event(NR_TLB_LOCAL_FLUSH_ONE);
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/3] x86: mm: Clean up inconsistencies when flushing TLB ranges
@ 2013-12-12 13:59     ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-12 13:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML

On 12/12/2013 07:55 PM, Mel Gorman wrote:
> NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the
> comparison with total_vm is done before taking tlb_flushall_shift into
> account. Clean it up.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Alex Shi
> ---
>  arch/x86/mm/tlb.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index ae699b3..09b8cb8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -189,6 +189,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  {
>  	unsigned long addr;
>  	unsigned act_entries, tlb_entries = 0;
> +	unsigned long nr_base_pages;
>  
>  	preempt_disable();
>  	if (current->active_mm != mm)
> @@ -210,18 +211,17 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  		tlb_entries = tlb_lli_4k[ENTRIES];
>  	else
>  		tlb_entries = tlb_lld_4k[ENTRIES];
> +
>  	/* Assume all of TLB entries was occupied by this task */

the benchmark break this assumption?
> -	act_entries = mm->total_vm > tlb_entries ? tlb_entries : mm->total_vm;
> +	act_entries = tlb_entries >> tlb_flushall_shift;
> +	act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
> +	nr_base_pages = (end - start) >> PAGE_SHIFT;
>  
>  	/* tlb_flushall_shift is on balance point, details in commit log */
> -	if ((end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) {
> +	if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
>  		count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
>  		local_flush_tlb();
>  	} else {
> -		if (has_large_page(mm, start, end)) {
> -			local_flush_tlb();
> -			goto flush_all;
> -		}
>  		/* flush range by one by one 'invlpg' */
>  		for (addr = start; addr < end;	addr += PAGE_SIZE) {
>  			count_vm_event(NR_TLB_LOCAL_FLUSH_ONE);
> 


-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-12 13:38       ` Alex Shi
@ 2013-12-12 14:11         ` Ingo Molnar
  -1 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-12 14:11 UTC (permalink / raw)
  To: Alex Shi
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra


* Alex Shi <alex.shi@linaro.org> wrote:

> On 12/12/2013 09:13 PM, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mgorman@suse.de> wrote:
> > 
> >> There was a large performance regression that was bisected to commit 611ae8e3
> >> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
> >> the default balance point between a local and global flush for IvyBridge.
> >>
> >> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> ---
> >>  arch/x86/kernel/cpu/intel.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> >> index dc1ec0d..2d93753 100644
> >> --- a/arch/x86/kernel/cpu/intel.c
> >> +++ b/arch/x86/kernel/cpu/intel.c
> >> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
> >>  		tlb_flushall_shift = 5;
> >>  		break;
> >>  	case 0x63a: /* Ivybridge */
> >> -		tlb_flushall_shift = 1;
> >> +		tlb_flushall_shift = 2;
> >>  		break;
> > 
> > I'd not be surprised if other CPU models showed similar weaknesses 
> > under ebizzy as well.
> > 
> > I don't particularly like the tuning aspect of the whole feature: the 
> > tunings are model specific and they seem to come out of thin air, 
> > without explicit measurements visible.
> > 
> > In particular the first commit that added this optimization:
> > 
> >  commit c4211f42d3e66875298a5e26a75109878c80f15b
> >  Date:   Thu Jun 28 09:02:19 2012 +0800
> > 
> >     x86/tlb: add tlb_flushall_shift for specific CPU
> > 
> > already had these magic tunings, with no explanation about what kind 
> > of measurement was done to back up those tunings.
> > 
> > I don't think this is acceptable and until this is cleared up I think 
> > we might be better off turning off this feature altogether, or making 
> > a constant, very low tuning point.
> > 
> > The original code came via:
> > 
> >   611ae8e3f520 x86/tlb: enable tlb flush range support for x86
> > 
> > which references a couple of benchmarks, in particular a 
> > micro-benchmark:
> > 
> >   My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
> >   show that the random memory access on other CPU has 0~50% speed up
> >   on a 2P * 4cores * HT NHM EP while do 'munmap'.
> > 
> > if the tunings were done with the micro-benchmark then I think they 
> > are bogus, because AFAICS it does not measure the adversarial case of 
> > the optimization.

You have not replied to this concern of mine: if my concern is valid 
then that invalidates much of the current tunings.

> > So I'd say at minimum we need to remove the per model tunings, and 
> > need to use very conservative defaults, to make sure we don't slow 
> > down reasonable workloads.
> 
> I also hate to depends on mysterious hardware differentiation. But 
> there do have some changes in tlb/cache part on different Intel 
> CPU.(Guess HPA know this more). And the different shift value get 
> from testing not from air. :)

As far as I could see from the changelogs and the code itself the 
various tunings came from nowhere.

So I don't see my concerns addressed. My inclination would be to start 
with something like Mel's known-good tuning value below, we know that 
ebizzy does not regress with that setting. Any more aggressive tuning 
needs to be backed up with ebizzy-alike adversarial workload 
performance numbers.

Thanks,

	Ingo

(Patch totally untested.)

=============>
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..c98385d 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -614,23 +614,8 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
 	case 0x61d: /* six-core 45 nm xeon "Dunnington" */
 		tlb_flushall_shift = -1;
 		break;
-	case 0x61a: /* 45 nm nehalem, "Bloomfield" */
-	case 0x61e: /* 45 nm nehalem, "Lynnfield" */
-	case 0x625: /* 32 nm nehalem, "Clarkdale" */
-	case 0x62c: /* 32 nm nehalem, "Gulftown" */
-	case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
-	case 0x62f: /* 32 nm Xeon E7 */
-		tlb_flushall_shift = 6;
-		break;
-	case 0x62a: /* SandyBridge */
-	case 0x62d: /* SandyBridge, "Romely-EP" */
-		tlb_flushall_shift = 5;
-		break;
-	case 0x63a: /* Ivybridge */
-		tlb_flushall_shift = 1;
-		break;
 	default:
-		tlb_flushall_shift = 6;
+		tlb_flushall_shift = 2;
 	}
 }
 

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-12 14:11         ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-12 14:11 UTC (permalink / raw)
  To: Alex Shi
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra


* Alex Shi <alex.shi@linaro.org> wrote:

> On 12/12/2013 09:13 PM, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mgorman@suse.de> wrote:
> > 
> >> There was a large performance regression that was bisected to commit 611ae8e3
> >> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
> >> the default balance point between a local and global flush for IvyBridge.
> >>
> >> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> ---
> >>  arch/x86/kernel/cpu/intel.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> >> index dc1ec0d..2d93753 100644
> >> --- a/arch/x86/kernel/cpu/intel.c
> >> +++ b/arch/x86/kernel/cpu/intel.c
> >> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
> >>  		tlb_flushall_shift = 5;
> >>  		break;
> >>  	case 0x63a: /* Ivybridge */
> >> -		tlb_flushall_shift = 1;
> >> +		tlb_flushall_shift = 2;
> >>  		break;
> > 
> > I'd not be surprised if other CPU models showed similar weaknesses 
> > under ebizzy as well.
> > 
> > I don't particularly like the tuning aspect of the whole feature: the 
> > tunings are model specific and they seem to come out of thin air, 
> > without explicit measurements visible.
> > 
> > In particular the first commit that added this optimization:
> > 
> >  commit c4211f42d3e66875298a5e26a75109878c80f15b
> >  Date:   Thu Jun 28 09:02:19 2012 +0800
> > 
> >     x86/tlb: add tlb_flushall_shift for specific CPU
> > 
> > already had these magic tunings, with no explanation about what kind 
> > of measurement was done to back up those tunings.
> > 
> > I don't think this is acceptable and until this is cleared up I think 
> > we might be better off turning off this feature altogether, or making 
> > a constant, very low tuning point.
> > 
> > The original code came via:
> > 
> >   611ae8e3f520 x86/tlb: enable tlb flush range support for x86
> > 
> > which references a couple of benchmarks, in particular a 
> > micro-benchmark:
> > 
> >   My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
> >   show that the random memory access on other CPU has 0~50% speed up
> >   on a 2P * 4cores * HT NHM EP while do 'munmap'.
> > 
> > if the tunings were done with the micro-benchmark then I think they 
> > are bogus, because AFAICS it does not measure the adversarial case of 
> > the optimization.

You have not replied to this concern of mine: if my concern is valid 
then that invalidates much of the current tunings.

> > So I'd say at minimum we need to remove the per model tunings, and 
> > need to use very conservative defaults, to make sure we don't slow 
> > down reasonable workloads.
> 
> I also hate to depends on mysterious hardware differentiation. But 
> there do have some changes in tlb/cache part on different Intel 
> CPU.(Guess HPA know this more). And the different shift value get 
> from testing not from air. :)

As far as I could see from the changelogs and the code itself the 
various tunings came from nowhere.

So I don't see my concerns addressed. My inclination would be to start 
with something like Mel's known-good tuning value below, we know that 
ebizzy does not regress with that setting. Any more aggressive tuning 
needs to be backed up with ebizzy-alike adversarial workload 
performance numbers.

Thanks,

	Ingo

(Patch totally untested.)

=============>
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..c98385d 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -614,23 +614,8 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
 	case 0x61d: /* six-core 45 nm xeon "Dunnington" */
 		tlb_flushall_shift = -1;
 		break;
-	case 0x61a: /* 45 nm nehalem, "Bloomfield" */
-	case 0x61e: /* 45 nm nehalem, "Lynnfield" */
-	case 0x625: /* 32 nm nehalem, "Clarkdale" */
-	case 0x62c: /* 32 nm nehalem, "Gulftown" */
-	case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
-	case 0x62f: /* 32 nm Xeon E7 */
-		tlb_flushall_shift = 6;
-		break;
-	case 0x62a: /* SandyBridge */
-	case 0x62d: /* SandyBridge, "Romely-EP" */
-		tlb_flushall_shift = 5;
-		break;
-	case 0x63a: /* Ivybridge */
-		tlb_flushall_shift = 1;
-		break;
 	default:
-		tlb_flushall_shift = 6;
+		tlb_flushall_shift = 2;
 	}
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush
  2013-12-12 13:01   ` Ingo Molnar
@ 2013-12-12 14:40     ` Mel Gorman
  -1 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 14:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alex Shi, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Andrew Morton

On Thu, Dec 12, 2013 at 02:01:07PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > I found that ebizzy regressed between 3.4 and 3.10 while testing on a new
> > machine. Bisection initially found at least two problems of which the
> > first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for
> > x86). The second was related to ACPI cpufreq and so it was disabled for
> > the purposes of this series.
> > 
> > The intent of the TLB range flush patch appeared to be to preserve
> > existing TLB entries which makes sense. The decision on whether to do a
> > full mm flush or a number of single page flushes depends on the size of the
> > relevant TLB and the CPU which is presuably taking the cost of a TLB refill.
> > It is a gamble because the cost of the per-page flushes must be offset by a
> > reduced TLB miss count. There are no indications what the cost of calling
> > invlpg are if there are no TLB entries and it's also not taking into
> > account how many CPUs it may have to execute these single TLB flushes on.
> > 
> > Ebizzy sees very little benefit as it discards newly allocated memory very
> > quickly which is why it appeared to regress so badly. While I'm wary of
> > optimising for such a benchmark, it's commonly tested and the defaults for
> > Ivybridge may need to be re-examined.
> > 
> > The following small series restores ebizzy to 3.4-era performance. Is there a
> > better way of doing this? Bear in mind that I'm testing on a single IvyBridge
> > machine and there is no guarantee the gain is universal or even relevant.
> > 
> > kernel build was tested but it's uninteresting as TLB range is unimportant
> > to it. A page fault benchmark was also tested but it does not hit the same paths
> > impacted by commit 611ae8e3.
> > 
> > ebizzy
> >                        3.13.0-rc3                3.4.69            3.13.0-rc3
> >                           vanilla               vanilla           newdefault-v1
> > Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
> > Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
> > Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)
> > Mean     4      7919.20 (  0.00%)     7842.60 ( -0.97%)     8680.60 (  9.61%)
> > Mean     5      7310.60 (  0.00%)     7740.60 (  5.88%)     8273.20 ( 13.17%)
> > Mean     6      6798.00 (  0.00%)     7720.20 ( 13.57%)     8033.20 ( 18.17%)
> > Mean     7      6759.40 (  0.00%)     7644.00 ( 13.09%)     7643.80 ( 13.08%)
> > Mean     8      6501.80 (  0.00%)     7666.40 ( 17.91%)     6944.40 (  6.81%)
> > Mean     12     6606.00 (  0.00%)     7523.20 ( 13.88%)     7035.80 (  6.51%)
> > Mean     16     6655.40 (  0.00%)     7287.40 (  9.50%)     7099.20 (  6.67%)
> > Mean     20     6703.80 (  0.00%)     7152.20 (  6.69%)     7116.60 (  6.16%)
> > Mean     24     6705.80 (  0.00%)     7014.80 (  4.61%)     7113.60 (  6.08%)
> > Mean     28     6706.60 (  0.00%)     6940.40 (  3.49%)     7115.20 (  6.09%)
> > Mean     32     6727.20 (  0.00%)     6878.80 (  2.25%)     7110.80 (  5.70%)
> > Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
> > Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)
> > Stddev   3        71.67 (  0.00%)       69.38 (  3.19%)       84.13 (-17.39%)
> > Stddev   4        30.25 (  0.00%)       87.06 (-187.82%)       31.80 ( -5.14%)
> > Stddev   5        71.18 (  0.00%)       25.68 ( 63.92%)      125.24 (-75.95%)
> > Stddev   6        34.22 (  0.00%)       23.35 ( 31.75%)      124.40 (-263.57%)
> > Stddev   7       100.59 (  0.00%)      112.83 (-12.17%)       65.07 ( 35.31%)
> > Stddev   8        20.26 (  0.00%)       43.43 (-114.32%)       48.26 (-138.16%)
> > Stddev   12       19.43 (  0.00%)       19.73 ( -1.55%)       23.25 (-19.65%)
> > Stddev   16       14.47 (  0.00%)       26.42 (-82.54%)       17.71 (-22.40%)
> > Stddev   20       21.37 (  0.00%)       15.97 ( 25.27%)       14.87 ( 30.42%)
> > Stddev   24       12.87 (  0.00%)       28.12 (-118.44%)       10.46 ( 18.75%)
> > Stddev   28       13.89 (  0.00%)       17.97 (-29.36%)       12.22 ( 12.04%)
> > Stddev   32       18.14 (  0.00%)       20.37 (-12.31%)       16.40 (  9.58%)
> > 
> >           3.13.0-rc3      3.4.69  3.13.0-rc3
> >              vanilla     vanilla newdefault-v1
> > User          900.27      995.20      947.33
> > System       1583.41     1680.76     1533.17
> > Elapsed      2100.78     2100.81     2100.76
> > 
> > This shows the ebizzy comparison between 3.13-rc3, 3.4.69-stable and this series.
> > The series is not a universal win against 3.4 but the figure are generally better
> > and system CPU usage is reduced.
> 
> I think you found a real bug and I definitely agree that we want to 
> fix it - the TLB range optimization was supposed to be a (nearly) 
> unconditional win, so a regression like this is totally not acceptable 
> IMHO.
> 

Ok.

> Please help me out with interpreting the numbers. The Ebizzy table:
> 
> > ebizzy
> >                        3.13.0-rc3                3.4.69            3.13.0-rc3
> >                           vanilla               vanilla           newdefault-v1
> > Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
> > Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
> > Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)
> 
> is the first numeric column number of threads/clients?

The number of threads.

> The other 
> colums are showing pairs of throughput (higher is better), with a 
> performance regression percentage in parentheses.
> 

Average operations/second measured over the duration of the entire test.
The output of the program is essentially records/duration.

> do the stddev numbers:
> 
> > Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
> > Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)
> 
> ... correspond to the respective thread count and thus overlay the 
> first table, right?
> 

Each iteration of the test ran for a fixed duration of 30 seconds. There were
5 iterations per thread which I know is very low, this was an RFC. The stddev
is the standard deviation of the records/sec recorded for the 5 iterations.

> stddev appears to be rather large especially around a client count of 
> 7-8. It will be difficult to fine-tune the TLB range flush constants 
> if noise is too large.
> 

The number of iterations were very low to have high confidence of the
figures. The high standard deviation for 5 clients was a single large
outlier. It potentially could be stabilised to some extent by bumping up
the number of iterations a lot and using percentiles instead of means.

I'm a bit wary of optimising the TLB flush ranges based on the benchmark
even if we stabilised the figures. There are major weaknesses that limit
its usefulness for tuning this the shift value. The range of pages being
flushed are fixed so whether the CPU sees a benefit depends on the cost of
a single page flush, a global flush and the number of TLB entries relative
to the fixed size used by ebizzy. Similarly, the full cost/benefit of using
single page flushes vs global flush partially depends on whether the TLB
contains hot entries for the task being flushed that will be used in the
near future. ebizzy's allocations are short lived and, while I have not
analysed it, I suspect it benefits little from preserving its TLB entries.

I'm not entirely convinced that the balance points are a good idea at
all. There are a lot of assumptions and some complete unknowns. For example,
if this task is completely cold is a mm flush far cheaper than flushing
non-existent entries page by page? Unfortunately I do not know what the
original thinking was and why something like a calibration loop was not
used (other than the sheer difficulty of measuring it). i.e. Estimate the
cost of an mm flush, estimate the cost of a single page flush and then

if (cost(mm_flush) < nr_to_flush * cost(page_flush))
	flush_mm
else
	flush_page_range

Hopefully someone does remember.

> Regarding total CPU usage:
> 
> >           3.13.0-rc3      3.4.69  3.13.0-rc3
> >              vanilla     vanilla newdefault-v1
> > User          900.27      995.20      947.33
> > System       1583.41     1680.76     1533.17
> > Elapsed      2100.78     2100.81     2100.76
> 
> ... elapsed time appears to be the same - does the benchmark run for a 
> constant amount of time, regardless of performance?
> 

Constant.

> This means that higher user time and lower system time generally 
> represents higher achieved throughput, right?
> 

Good point, these values cannot be meaningfully interpreted because of
the fixed duration of the test. I should not even have looked.

> Yet the sum does not appear to be constant across kernels - does this 
> mean that even newdefault-v1 is idling around more than v3.4.69?

There are a number of factors that could affect this. Light monitoring was
active so there is some IO logging that. The workload is very context switch
so changes to the scheduler will distort results. The load is interrupt
heavy which will not be properly captured. The cost of the benchmark is
heavily dominated by zeroing newly allocated pages so changes in cache
coloring would show up in different ways.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush
@ 2013-12-12 14:40     ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 14:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alex Shi, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Andrew Morton

On Thu, Dec 12, 2013 at 02:01:07PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > I found that ebizzy regressed between 3.4 and 3.10 while testing on a new
> > machine. Bisection initially found at least two problems of which the
> > first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for
> > x86). The second was related to ACPI cpufreq and so it was disabled for
> > the purposes of this series.
> > 
> > The intent of the TLB range flush patch appeared to be to preserve
> > existing TLB entries which makes sense. The decision on whether to do a
> > full mm flush or a number of single page flushes depends on the size of the
> > relevant TLB and the CPU which is presuably taking the cost of a TLB refill.
> > It is a gamble because the cost of the per-page flushes must be offset by a
> > reduced TLB miss count. There are no indications what the cost of calling
> > invlpg are if there are no TLB entries and it's also not taking into
> > account how many CPUs it may have to execute these single TLB flushes on.
> > 
> > Ebizzy sees very little benefit as it discards newly allocated memory very
> > quickly which is why it appeared to regress so badly. While I'm wary of
> > optimising for such a benchmark, it's commonly tested and the defaults for
> > Ivybridge may need to be re-examined.
> > 
> > The following small series restores ebizzy to 3.4-era performance. Is there a
> > better way of doing this? Bear in mind that I'm testing on a single IvyBridge
> > machine and there is no guarantee the gain is universal or even relevant.
> > 
> > kernel build was tested but it's uninteresting as TLB range is unimportant
> > to it. A page fault benchmark was also tested but it does not hit the same paths
> > impacted by commit 611ae8e3.
> > 
> > ebizzy
> >                        3.13.0-rc3                3.4.69            3.13.0-rc3
> >                           vanilla               vanilla           newdefault-v1
> > Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
> > Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
> > Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)
> > Mean     4      7919.20 (  0.00%)     7842.60 ( -0.97%)     8680.60 (  9.61%)
> > Mean     5      7310.60 (  0.00%)     7740.60 (  5.88%)     8273.20 ( 13.17%)
> > Mean     6      6798.00 (  0.00%)     7720.20 ( 13.57%)     8033.20 ( 18.17%)
> > Mean     7      6759.40 (  0.00%)     7644.00 ( 13.09%)     7643.80 ( 13.08%)
> > Mean     8      6501.80 (  0.00%)     7666.40 ( 17.91%)     6944.40 (  6.81%)
> > Mean     12     6606.00 (  0.00%)     7523.20 ( 13.88%)     7035.80 (  6.51%)
> > Mean     16     6655.40 (  0.00%)     7287.40 (  9.50%)     7099.20 (  6.67%)
> > Mean     20     6703.80 (  0.00%)     7152.20 (  6.69%)     7116.60 (  6.16%)
> > Mean     24     6705.80 (  0.00%)     7014.80 (  4.61%)     7113.60 (  6.08%)
> > Mean     28     6706.60 (  0.00%)     6940.40 (  3.49%)     7115.20 (  6.09%)
> > Mean     32     6727.20 (  0.00%)     6878.80 (  2.25%)     7110.80 (  5.70%)
> > Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
> > Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)
> > Stddev   3        71.67 (  0.00%)       69.38 (  3.19%)       84.13 (-17.39%)
> > Stddev   4        30.25 (  0.00%)       87.06 (-187.82%)       31.80 ( -5.14%)
> > Stddev   5        71.18 (  0.00%)       25.68 ( 63.92%)      125.24 (-75.95%)
> > Stddev   6        34.22 (  0.00%)       23.35 ( 31.75%)      124.40 (-263.57%)
> > Stddev   7       100.59 (  0.00%)      112.83 (-12.17%)       65.07 ( 35.31%)
> > Stddev   8        20.26 (  0.00%)       43.43 (-114.32%)       48.26 (-138.16%)
> > Stddev   12       19.43 (  0.00%)       19.73 ( -1.55%)       23.25 (-19.65%)
> > Stddev   16       14.47 (  0.00%)       26.42 (-82.54%)       17.71 (-22.40%)
> > Stddev   20       21.37 (  0.00%)       15.97 ( 25.27%)       14.87 ( 30.42%)
> > Stddev   24       12.87 (  0.00%)       28.12 (-118.44%)       10.46 ( 18.75%)
> > Stddev   28       13.89 (  0.00%)       17.97 (-29.36%)       12.22 ( 12.04%)
> > Stddev   32       18.14 (  0.00%)       20.37 (-12.31%)       16.40 (  9.58%)
> > 
> >           3.13.0-rc3      3.4.69  3.13.0-rc3
> >              vanilla     vanilla newdefault-v1
> > User          900.27      995.20      947.33
> > System       1583.41     1680.76     1533.17
> > Elapsed      2100.78     2100.81     2100.76
> > 
> > This shows the ebizzy comparison between 3.13-rc3, 3.4.69-stable and this series.
> > The series is not a universal win against 3.4 but the figure are generally better
> > and system CPU usage is reduced.
> 
> I think you found a real bug and I definitely agree that we want to 
> fix it - the TLB range optimization was supposed to be a (nearly) 
> unconditional win, so a regression like this is totally not acceptable 
> IMHO.
> 

Ok.

> Please help me out with interpreting the numbers. The Ebizzy table:
> 
> > ebizzy
> >                        3.13.0-rc3                3.4.69            3.13.0-rc3
> >                           vanilla               vanilla           newdefault-v1
> > Mean     1      7353.60 (  0.00%)     6782.00 ( -7.77%)     7836.20 (  6.56%)
> > Mean     2      8120.40 (  0.00%)     8278.80 (  1.95%)     9520.60 ( 17.24%)
> > Mean     3      8087.80 (  0.00%)     8083.60 ( -0.05%)     9003.80 ( 11.33%)
> 
> is the first numeric column number of threads/clients?

The number of threads.

> The other 
> colums are showing pairs of throughput (higher is better), with a 
> performance regression percentage in parentheses.
> 

Average operations/second measured over the duration of the entire test.
The output of the program is essentially records/duration.

> do the stddev numbers:
> 
> > Stddev   1        42.71 (  0.00%)       53.16 (-24.46%)       39.80 (  6.82%)
> > Stddev   2       250.26 (  0.00%)      150.57 ( 39.84%)       31.94 ( 87.24%)
> 
> ... correspond to the respective thread count and thus overlay the 
> first table, right?
> 

Each iteration of the test ran for a fixed duration of 30 seconds. There were
5 iterations per thread which I know is very low, this was an RFC. The stddev
is the standard deviation of the records/sec recorded for the 5 iterations.

> stddev appears to be rather large especially around a client count of 
> 7-8. It will be difficult to fine-tune the TLB range flush constants 
> if noise is too large.
> 

The number of iterations were very low to have high confidence of the
figures. The high standard deviation for 5 clients was a single large
outlier. It potentially could be stabilised to some extent by bumping up
the number of iterations a lot and using percentiles instead of means.

I'm a bit wary of optimising the TLB flush ranges based on the benchmark
even if we stabilised the figures. There are major weaknesses that limit
its usefulness for tuning this the shift value. The range of pages being
flushed are fixed so whether the CPU sees a benefit depends on the cost of
a single page flush, a global flush and the number of TLB entries relative
to the fixed size used by ebizzy. Similarly, the full cost/benefit of using
single page flushes vs global flush partially depends on whether the TLB
contains hot entries for the task being flushed that will be used in the
near future. ebizzy's allocations are short lived and, while I have not
analysed it, I suspect it benefits little from preserving its TLB entries.

I'm not entirely convinced that the balance points are a good idea at
all. There are a lot of assumptions and some complete unknowns. For example,
if this task is completely cold is a mm flush far cheaper than flushing
non-existent entries page by page? Unfortunately I do not know what the
original thinking was and why something like a calibration loop was not
used (other than the sheer difficulty of measuring it). i.e. Estimate the
cost of an mm flush, estimate the cost of a single page flush and then

if (cost(mm_flush) < nr_to_flush * cost(page_flush))
	flush_mm
else
	flush_page_range

Hopefully someone does remember.

> Regarding total CPU usage:
> 
> >           3.13.0-rc3      3.4.69  3.13.0-rc3
> >              vanilla     vanilla newdefault-v1
> > User          900.27      995.20      947.33
> > System       1583.41     1680.76     1533.17
> > Elapsed      2100.78     2100.81     2100.76
> 
> ... elapsed time appears to be the same - does the benchmark run for a 
> constant amount of time, regardless of performance?
> 

Constant.

> This means that higher user time and lower system time generally 
> represents higher achieved throughput, right?
> 

Good point, these values cannot be meaningfully interpreted because of
the fixed duration of the test. I should not even have looked.

> Yet the sum does not appear to be constant across kernels - does this 
> mean that even newdefault-v1 is idling around more than v3.4.69?

There are a number of factors that could affect this. Light monitoring was
active so there is some IO logging that. The workload is very context switch
so changes to the scheduler will distort results. The load is interrupt
heavy which will not be properly captured. The cost of the benchmark is
heavily dominated by zeroing newly allocated pages so changes in cache
coloring would show up in different ways.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/3] x86: mm: Clean up inconsistencies when flushing TLB ranges
  2013-12-12 13:59     ` Alex Shi
@ 2013-12-12 23:53       ` Mel Gorman
  -1 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 23:53 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML

On Thu, Dec 12, 2013 at 09:59:33PM +0800, Alex Shi wrote:
> On 12/12/2013 07:55 PM, Mel Gorman wrote:
> > NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the
> > comparison with total_vm is done before taking tlb_flushall_shift into
> > account. Clean it up.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Reviewed-by: Alex Shi

Thanks.

> > ---
> >  arch/x86/mm/tlb.c | 12 ++++++------
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index ae699b3..09b8cb8 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -189,6 +189,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >  {
> >  	unsigned long addr;
> >  	unsigned act_entries, tlb_entries = 0;
> > +	unsigned long nr_base_pages;
> >  
> >  	preempt_disable();
> >  	if (current->active_mm != mm)
> > @@ -210,18 +211,17 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >  		tlb_entries = tlb_lli_4k[ENTRIES];
> >  	else
> >  		tlb_entries = tlb_lld_4k[ENTRIES];
> > +
> >  	/* Assume all of TLB entries was occupied by this task */
> 
> the benchmark break this assumption?

No, but it's a small benchmark with very little else running at the
time. It's an assumption that would only hold true on dedicated machines
to a single application. It would not hold true on desktops, multi-tier
server applications etc.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/3] x86: mm: Clean up inconsistencies when flushing TLB ranges
@ 2013-12-12 23:53       ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2013-12-12 23:53 UTC (permalink / raw)
  To: Alex Shi; +Cc: H Peter Anvin, Linux-X86, Linux-MM, LKML

On Thu, Dec 12, 2013 at 09:59:33PM +0800, Alex Shi wrote:
> On 12/12/2013 07:55 PM, Mel Gorman wrote:
> > NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the
> > comparison with total_vm is done before taking tlb_flushall_shift into
> > account. Clean it up.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Reviewed-by: Alex Shi

Thanks.

> > ---
> >  arch/x86/mm/tlb.c | 12 ++++++------
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index ae699b3..09b8cb8 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -189,6 +189,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >  {
> >  	unsigned long addr;
> >  	unsigned act_entries, tlb_entries = 0;
> > +	unsigned long nr_base_pages;
> >  
> >  	preempt_disable();
> >  	if (current->active_mm != mm)
> > @@ -210,18 +211,17 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >  		tlb_entries = tlb_lli_4k[ENTRIES];
> >  	else
> >  		tlb_entries = tlb_lld_4k[ENTRIES];
> > +
> >  	/* Assume all of TLB entries was occupied by this task */
> 
> the benchmark break this assumption?

No, but it's a small benchmark with very little else running at the
time. It's an assumption that would only hold true on dedicated machines
to a single application. It would not hold true on desktops, multi-tier
server applications etc.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-12 14:11         ` Ingo Molnar
@ 2013-12-13  1:02           ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-13  1:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Fengguang Wu

On 12/12/2013 10:11 PM, Ingo Molnar wrote:
> 
> * Alex Shi <alex.shi@linaro.org> wrote:
> 
>> On 12/12/2013 09:13 PM, Ingo Molnar wrote:
>>>
>>> * Mel Gorman <mgorman@suse.de> wrote:
>>>
>>>> There was a large performance regression that was bisected to commit 611ae8e3
>>>> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
>>>> the default balance point between a local and global flush for IvyBridge.
>>>>
>>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>>> ---
>>>>  arch/x86/kernel/cpu/intel.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
>>>> index dc1ec0d..2d93753 100644
>>>> --- a/arch/x86/kernel/cpu/intel.c
>>>> +++ b/arch/x86/kernel/cpu/intel.c
>>>> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>>>>  		tlb_flushall_shift = 5;
>>>>  		break;
>>>>  	case 0x63a: /* Ivybridge */
>>>> -		tlb_flushall_shift = 1;
>>>> +		tlb_flushall_shift = 2;
>>>>  		break;
>>>
>>> I'd not be surprised if other CPU models showed similar weaknesses 
>>> under ebizzy as well.
>>>
>>> I don't particularly like the tuning aspect of the whole feature: the 
>>> tunings are model specific and they seem to come out of thin air, 
>>> without explicit measurements visible.
>>>
>>> In particular the first commit that added this optimization:
>>>
>>>  commit c4211f42d3e66875298a5e26a75109878c80f15b
>>>  Date:   Thu Jun 28 09:02:19 2012 +0800
>>>
>>>     x86/tlb: add tlb_flushall_shift for specific CPU
>>>
>>> already had these magic tunings, with no explanation about what kind 
>>> of measurement was done to back up those tunings.
>>>
>>> I don't think this is acceptable and until this is cleared up I think 
>>> we might be better off turning off this feature altogether, or making 
>>> a constant, very low tuning point.
>>>
>>> The original code came via:
>>>
>>>   611ae8e3f520 x86/tlb: enable tlb flush range support for x86
>>>
>>> which references a couple of benchmarks, in particular a 
>>> micro-benchmark:
>>>
>>>   My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
>>>   show that the random memory access on other CPU has 0~50% speed up
>>>   on a 2P * 4cores * HT NHM EP while do 'munmap'.
>>>
>>> if the tunings were done with the micro-benchmark then I think they 
>>> are bogus, because AFAICS it does not measure the adversarial case of 
>>> the optimization.
> 
> You have not replied to this concern of mine: if my concern is valid 
> then that invalidates much of the current tunings.

The benefit from pretend flush range is not unconditional, since invlpg
also cost time. And different CPU has different invlpg/flush_all
execution time. That is part of reason for different flushall_shift
value, another reason, if my memory right, is multiple invlpg execution
time is not strict linearity. Can't confirm this, Sorry.

In theory the benefit is there, but most of benchmark can not show the
performance improvement, because most of benchmark don't do flush_range
frequency. So, need a micro benchmark to discover this if the benefit
really exists. And the micro benchmark also can find regressions if use
too much invlpg. The balance point is the flushall_shift value. Maybe
the flushall_shift value are bit aggressive or maybe testing scenario
doesn't cover everything. So I don't mind to take more conservative value.

BTW, at that time, I tested every benchmark in hands, no regression found.
> 
>>> So I'd say at minimum we need to remove the per model tunings, and 
>>> need to use very conservative defaults, to make sure we don't slow 
>>> down reasonable workloads.
>>
>> I also hate to depends on mysterious hardware differentiation. But 
>> there do have some changes in tlb/cache part on different Intel 
>> CPU.(Guess HPA know this more). And the different shift value get 
>> from testing not from air. :)
> 
> As far as I could see from the changelogs and the code itself the 
> various tunings came from nowhere.
> 
> So I don't see my concerns addressed. My inclination would be to start 
> with something like Mel's known-good tuning value below, we know that 
> ebizzy does not regress with that setting. Any more aggressive tuning 
> needs to be backed up with ebizzy-alike adversarial workload 
> performance numbers.

Testing can tell us more.

CC to fengguang for the following patch.
> 
> Thanks,
> 
> 	Ingo
> 
> (Patch totally untested.)
> 
> =============>
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..c98385d 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -614,23 +614,8 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>  	case 0x61d: /* six-core 45 nm xeon "Dunnington" */
>  		tlb_flushall_shift = -1;
>  		break;
> -	case 0x61a: /* 45 nm nehalem, "Bloomfield" */
> -	case 0x61e: /* 45 nm nehalem, "Lynnfield" */
> -	case 0x625: /* 32 nm nehalem, "Clarkdale" */
> -	case 0x62c: /* 32 nm nehalem, "Gulftown" */
> -	case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
> -	case 0x62f: /* 32 nm Xeon E7 */
> -		tlb_flushall_shift = 6;
> -		break;
> -	case 0x62a: /* SandyBridge */
> -	case 0x62d: /* SandyBridge, "Romely-EP" */
> -		tlb_flushall_shift = 5;
> -		break;
> -	case 0x63a: /* Ivybridge */
> -		tlb_flushall_shift = 1;
> -		break;
>  	default:
> -		tlb_flushall_shift = 6;
> +		tlb_flushall_shift = 2;
>  	}
>  }
>  
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-13  1:02           ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-13  1:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Fengguang Wu

On 12/12/2013 10:11 PM, Ingo Molnar wrote:
> 
> * Alex Shi <alex.shi@linaro.org> wrote:
> 
>> On 12/12/2013 09:13 PM, Ingo Molnar wrote:
>>>
>>> * Mel Gorman <mgorman@suse.de> wrote:
>>>
>>>> There was a large performance regression that was bisected to commit 611ae8e3
>>>> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
>>>> the default balance point between a local and global flush for IvyBridge.
>>>>
>>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>>> ---
>>>>  arch/x86/kernel/cpu/intel.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
>>>> index dc1ec0d..2d93753 100644
>>>> --- a/arch/x86/kernel/cpu/intel.c
>>>> +++ b/arch/x86/kernel/cpu/intel.c
>>>> @@ -627,7 +627,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>>>>  		tlb_flushall_shift = 5;
>>>>  		break;
>>>>  	case 0x63a: /* Ivybridge */
>>>> -		tlb_flushall_shift = 1;
>>>> +		tlb_flushall_shift = 2;
>>>>  		break;
>>>
>>> I'd not be surprised if other CPU models showed similar weaknesses 
>>> under ebizzy as well.
>>>
>>> I don't particularly like the tuning aspect of the whole feature: the 
>>> tunings are model specific and they seem to come out of thin air, 
>>> without explicit measurements visible.
>>>
>>> In particular the first commit that added this optimization:
>>>
>>>  commit c4211f42d3e66875298a5e26a75109878c80f15b
>>>  Date:   Thu Jun 28 09:02:19 2012 +0800
>>>
>>>     x86/tlb: add tlb_flushall_shift for specific CPU
>>>
>>> already had these magic tunings, with no explanation about what kind 
>>> of measurement was done to back up those tunings.
>>>
>>> I don't think this is acceptable and until this is cleared up I think 
>>> we might be better off turning off this feature altogether, or making 
>>> a constant, very low tuning point.
>>>
>>> The original code came via:
>>>
>>>   611ae8e3f520 x86/tlb: enable tlb flush range support for x86
>>>
>>> which references a couple of benchmarks, in particular a 
>>> micro-benchmark:
>>>
>>>   My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
>>>   show that the random memory access on other CPU has 0~50% speed up
>>>   on a 2P * 4cores * HT NHM EP while do 'munmap'.
>>>
>>> if the tunings were done with the micro-benchmark then I think they 
>>> are bogus, because AFAICS it does not measure the adversarial case of 
>>> the optimization.
> 
> You have not replied to this concern of mine: if my concern is valid 
> then that invalidates much of the current tunings.

The benefit from pretend flush range is not unconditional, since invlpg
also cost time. And different CPU has different invlpg/flush_all
execution time. That is part of reason for different flushall_shift
value, another reason, if my memory right, is multiple invlpg execution
time is not strict linearity. Can't confirm this, Sorry.

In theory the benefit is there, but most of benchmark can not show the
performance improvement, because most of benchmark don't do flush_range
frequency. So, need a micro benchmark to discover this if the benefit
really exists. And the micro benchmark also can find regressions if use
too much invlpg. The balance point is the flushall_shift value. Maybe
the flushall_shift value are bit aggressive or maybe testing scenario
doesn't cover everything. So I don't mind to take more conservative value.

BTW, at that time, I tested every benchmark in hands, no regression found.
> 
>>> So I'd say at minimum we need to remove the per model tunings, and 
>>> need to use very conservative defaults, to make sure we don't slow 
>>> down reasonable workloads.
>>
>> I also hate to depends on mysterious hardware differentiation. But 
>> there do have some changes in tlb/cache part on different Intel 
>> CPU.(Guess HPA know this more). And the different shift value get 
>> from testing not from air. :)
> 
> As far as I could see from the changelogs and the code itself the 
> various tunings came from nowhere.
> 
> So I don't see my concerns addressed. My inclination would be to start 
> with something like Mel's known-good tuning value below, we know that 
> ebizzy does not regress with that setting. Any more aggressive tuning 
> needs to be backed up with ebizzy-alike adversarial workload 
> performance numbers.

Testing can tell us more.

CC to fengguang for the following patch.
> 
> Thanks,
> 
> 	Ingo
> 
> (Patch totally untested.)
> 
> =============>
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..c98385d 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -614,23 +614,8 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
>  	case 0x61d: /* six-core 45 nm xeon "Dunnington" */
>  		tlb_flushall_shift = -1;
>  		break;
> -	case 0x61a: /* 45 nm nehalem, "Bloomfield" */
> -	case 0x61e: /* 45 nm nehalem, "Lynnfield" */
> -	case 0x625: /* 32 nm nehalem, "Clarkdale" */
> -	case 0x62c: /* 32 nm nehalem, "Gulftown" */
> -	case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
> -	case 0x62f: /* 32 nm Xeon E7 */
> -		tlb_flushall_shift = 6;
> -		break;
> -	case 0x62a: /* SandyBridge */
> -	case 0x62d: /* SandyBridge, "Romely-EP" */
> -		tlb_flushall_shift = 5;
> -		break;
> -	case 0x63a: /* Ivybridge */
> -		tlb_flushall_shift = 1;
> -		break;
>  	default:
> -		tlb_flushall_shift = 6;
> +		tlb_flushall_shift = 2;
>  	}
>  }
>  
> 


-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-13  1:02           ` Alex Shi
@ 2013-12-13  2:11             ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-13  2:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Fengguang Wu

On 12/13/2013 09:02 AM, Alex Shi wrote:
>> > You have not replied to this concern of mine: if my concern is valid 
>> > then that invalidates much of the current tunings.
> The benefit from pretend flush range is not unconditional, since invlpg
> also cost time. And different CPU has different invlpg/flush_all
> execution time. 

TLB refill time is also different on different kind of cpu.

BTW,
A bewitching idea is till attracting me.
https://lkml.org/lkml/2012/5/23/148
Even it was sentenced to death by HPA.
https://lkml.org/lkml/2012/5/24/143

That is that just flush one of thread TLB is enough for SMT/HT, seems
TLB is still shared in core on Intel CPU. This benefit is unconditional,
and if my memory right, Kbuild testing can improve about 1~2% in average
level.

So could you like to accept some ugly quirks to do this lazy TLB flush
on known working CPU?
Forgive me if it's stupid.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-13  2:11             ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-13  2:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Fengguang Wu

On 12/13/2013 09:02 AM, Alex Shi wrote:
>> > You have not replied to this concern of mine: if my concern is valid 
>> > then that invalidates much of the current tunings.
> The benefit from pretend flush range is not unconditional, since invlpg
> also cost time. And different CPU has different invlpg/flush_all
> execution time. 

TLB refill time is also different on different kind of cpu.

BTW,
A bewitching idea is till attracting me.
https://lkml.org/lkml/2012/5/23/148
Even it was sentenced to death by HPA.
https://lkml.org/lkml/2012/5/24/143

That is that just flush one of thread TLB is enough for SMT/HT, seems
TLB is still shared in core on Intel CPU. This benefit is unconditional,
and if my memory right, Kbuild testing can improve about 1~2% in average
level.

So could you like to accept some ugly quirks to do this lazy TLB flush
on known working CPU?
Forgive me if it's stupid.

-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush
  2013-12-12 14:40     ` Mel Gorman
@ 2013-12-13 13:35       ` Ingo Molnar
  -1 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-13 13:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alex Shi, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Andrew Morton


* Mel Gorman <mgorman@suse.de> wrote:

> > [...]
> >
> > stddev appears to be rather large especially around a client count 
> > of 7-8. It will be difficult to fine-tune the TLB range flush 
> > constants if noise is too large.
> 
> The number of iterations were very low to have high confidence of 
> the figures. The high standard deviation for 5 clients was a single 
> large outlier. It potentially could be stabilised to some extent by 
> bumping up the number of iterations a lot and using percentiles 
> instead of means.

Fair enough - and you were bisecting so length of runtime and 
confidence of detection were obviously the primary concerns.

> I'm a bit wary of optimising the TLB flush ranges based on the 
> benchmark even if we stabilised the figures. [...]

Absolutely - but they do appear to be pretty 'adversarial' to the TLB 
optimization, with a measurable slowdown in a pretty complex, 
real-life workload pattern.

So future tuning efforts will have to take such workloads into effect 
as well, to make sure we don't regress again.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush
@ 2013-12-13 13:35       ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-13 13:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alex Shi, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Andrew Morton


* Mel Gorman <mgorman@suse.de> wrote:

> > [...]
> >
> > stddev appears to be rather large especially around a client count 
> > of 7-8. It will be difficult to fine-tune the TLB range flush 
> > constants if noise is too large.
> 
> The number of iterations were very low to have high confidence of 
> the figures. The high standard deviation for 5 clients was a single 
> large outlier. It potentially could be stabilised to some extent by 
> bumping up the number of iterations a lot and using percentiles 
> instead of means.

Fair enough - and you were bisecting so length of runtime and 
confidence of detection were obviously the primary concerns.

> I'm a bit wary of optimising the TLB flush ranges based on the 
> benchmark even if we stabilised the figures. [...]

Absolutely - but they do appear to be pretty 'adversarial' to the TLB 
optimization, with a measurable slowdown in a pretty complex, 
real-life workload pattern.

So future tuning efforts will have to take such workloads into effect 
as well, to make sure we don't regress again.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-13  2:11             ` Alex Shi
@ 2013-12-13 13:43               ` Ingo Molnar
  -1 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-13 13:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Fengguang Wu


* Alex Shi <alex.shi@linaro.org> wrote:

> On 12/13/2013 09:02 AM, Alex Shi wrote:
> >> > You have not replied to this concern of mine: if my concern is valid 
> >> > then that invalidates much of the current tunings.
> > The benefit from pretend flush range is not unconditional, since invlpg
> > also cost time. And different CPU has different invlpg/flush_all
> > execution time. 
> 
> TLB refill time is also different on different kind of cpu.
> 
> BTW,
> A bewitching idea is till attracting me.
> https://lkml.org/lkml/2012/5/23/148
> Even it was sentenced to death by HPA.
> https://lkml.org/lkml/2012/5/24/143

I don't think it was sentenced to death by HPA. What do the hardware 
guys say, is this safe on current CPUs?

If yes then as long as we only activate this optimization for known 
models (and turn it off for unknown models) we should be pretty safe, 
even if the hw guys (obviously) don't want to promise this 
indefinitely for all Intel HT implementations in the future, right?

> That is that just flush one of thread TLB is enough for SMT/HT, 
> seems TLB is still shared in core on Intel CPU. This benefit is 
> unconditional, and if my memory right, Kbuild testing can improve 
> about 1~2% in average level.

Oh, a 1-2% kbuild speedup is absolutely _massive_. Don't even think 
about dropping this idea ... it needs to be explored.

Alas, that for_each_cpu() loop is obviously disgusting, these values 
should be precalculated into percpu variables and such.

> So could you like to accept some ugly quirks to do this lazy TLB 
> flush on known working CPU?

it's not really 'lazy TLB flush' AFAICS but a genuine optimization: 
only flush the TLB on the logical CPUs that need it, right? I.e. do 
only one flush per pair of siblings.

> Forgive me if it's stupid.

I'd say measurable speedups that are safe are never ever stupid.

And even the range-flush TLB optimization we are talking about here 
could still be used IMO, just tone it down a bit and make it less 
model dependent.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-13 13:43               ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-13 13:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Fengguang Wu


* Alex Shi <alex.shi@linaro.org> wrote:

> On 12/13/2013 09:02 AM, Alex Shi wrote:
> >> > You have not replied to this concern of mine: if my concern is valid 
> >> > then that invalidates much of the current tunings.
> > The benefit from pretend flush range is not unconditional, since invlpg
> > also cost time. And different CPU has different invlpg/flush_all
> > execution time. 
> 
> TLB refill time is also different on different kind of cpu.
> 
> BTW,
> A bewitching idea is till attracting me.
> https://lkml.org/lkml/2012/5/23/148
> Even it was sentenced to death by HPA.
> https://lkml.org/lkml/2012/5/24/143

I don't think it was sentenced to death by HPA. What do the hardware 
guys say, is this safe on current CPUs?

If yes then as long as we only activate this optimization for known 
models (and turn it off for unknown models) we should be pretty safe, 
even if the hw guys (obviously) don't want to promise this 
indefinitely for all Intel HT implementations in the future, right?

> That is that just flush one of thread TLB is enough for SMT/HT, 
> seems TLB is still shared in core on Intel CPU. This benefit is 
> unconditional, and if my memory right, Kbuild testing can improve 
> about 1~2% in average level.

Oh, a 1-2% kbuild speedup is absolutely _massive_. Don't even think 
about dropping this idea ... it needs to be explored.

Alas, that for_each_cpu() loop is obviously disgusting, these values 
should be precalculated into percpu variables and such.

> So could you like to accept some ugly quirks to do this lazy TLB 
> flush on known working CPU?

it's not really 'lazy TLB flush' AFAICS but a genuine optimization: 
only flush the TLB on the logical CPUs that need it, right? I.e. do 
only one flush per pair of siblings.

> Forgive me if it's stupid.

I'd say measurable speedups that are safe are never ever stupid.

And even the range-flush TLB optimization we are talking about here 
could still be used IMO, just tone it down a bit and make it less 
model dependent.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-13 13:43               ` Ingo Molnar
@ 2013-12-14 11:01                 ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-14 11:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Fengguang Wu

On 12/13/2013 09:43 PM, Ingo Molnar wrote:
> 
> * Alex Shi <alex.shi@linaro.org> wrote:
> 
>> On 12/13/2013 09:02 AM, Alex Shi wrote:
>>>>> You have not replied to this concern of mine: if my concern is valid 
>>>>> then that invalidates much of the current tunings.
>>> The benefit from pretend flush range is not unconditional, since invlpg
>>> also cost time. And different CPU has different invlpg/flush_all
>>> execution time. 
>>
>> TLB refill time is also different on different kind of cpu.
>>
>> BTW,
>> A bewitching idea is till attracting me.
>> https://lkml.org/lkml/2012/5/23/148
>> Even it was sentenced to death by HPA.
>> https://lkml.org/lkml/2012/5/24/143
> 
> I don't think it was sentenced to death by HPA. What do the hardware 
> guys say, is this safe on current CPUs?

This talking is fully public, no any other info I known.
At that time, I tried core2, nhm, wsm, snd, ivb, all kinds of machine I
can get. No issue found.

And assuming a rebase patch is testing in Fengguang's testing system
from last Friday, no bad news till now.
Fengugang, x86-tlb branch on my github tree.
> 
> If yes then as long as we only activate this optimization for known 
> models (and turn it off for unknown models) we should be pretty safe, 
> even if the hw guys (obviously) don't want to promise this 
> indefinitely for all Intel HT implementations in the future, right?

Agree with you.
> 
>> That is that just flush one of thread TLB is enough for SMT/HT, 
>> seems TLB is still shared in core on Intel CPU. This benefit is 
>> unconditional, and if my memory right, Kbuild testing can improve 
>> about 1~2% in average level.
> 
> Oh, a 1-2% kbuild speedup is absolutely _massive_. Don't even think 
> about dropping this idea ... it needs to be explored.
> 
> Alas, that for_each_cpu() loop is obviously disgusting, these values 
> should be precalculated into percpu variables and such.

yes, pr-calcucatied variable would save much time.
> 
>> So could you like to accept some ugly quirks to do this lazy TLB 
>> flush on known working CPU?
> 
> it's not really 'lazy TLB flush' AFAICS but a genuine optimization: 
> only flush the TLB on the logical CPUs that need it, right? I.e. do 
> only one flush per pair of siblings.
> 
>> Forgive me if it's stupid.
> 
> I'd say measurable speedups that are safe are never ever stupid.

Thanks a lot!
> 
> And even the range-flush TLB optimization we are talking about here 
> could still be used IMO, just tone it down a bit and make it less 
> model dependent.
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-14 11:01                 ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-14 11:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Fengguang Wu

On 12/13/2013 09:43 PM, Ingo Molnar wrote:
> 
> * Alex Shi <alex.shi@linaro.org> wrote:
> 
>> On 12/13/2013 09:02 AM, Alex Shi wrote:
>>>>> You have not replied to this concern of mine: if my concern is valid 
>>>>> then that invalidates much of the current tunings.
>>> The benefit from pretend flush range is not unconditional, since invlpg
>>> also cost time. And different CPU has different invlpg/flush_all
>>> execution time. 
>>
>> TLB refill time is also different on different kind of cpu.
>>
>> BTW,
>> A bewitching idea is till attracting me.
>> https://lkml.org/lkml/2012/5/23/148
>> Even it was sentenced to death by HPA.
>> https://lkml.org/lkml/2012/5/24/143
> 
> I don't think it was sentenced to death by HPA. What do the hardware 
> guys say, is this safe on current CPUs?

This talking is fully public, no any other info I known.
At that time, I tried core2, nhm, wsm, snd, ivb, all kinds of machine I
can get. No issue found.

And assuming a rebase patch is testing in Fengguang's testing system
from last Friday, no bad news till now.
Fengugang, x86-tlb branch on my github tree.
> 
> If yes then as long as we only activate this optimization for known 
> models (and turn it off for unknown models) we should be pretty safe, 
> even if the hw guys (obviously) don't want to promise this 
> indefinitely for all Intel HT implementations in the future, right?

Agree with you.
> 
>> That is that just flush one of thread TLB is enough for SMT/HT, 
>> seems TLB is still shared in core on Intel CPU. This benefit is 
>> unconditional, and if my memory right, Kbuild testing can improve 
>> about 1~2% in average level.
> 
> Oh, a 1-2% kbuild speedup is absolutely _massive_. Don't even think 
> about dropping this idea ... it needs to be explored.
> 
> Alas, that for_each_cpu() loop is obviously disgusting, these values 
> should be precalculated into percpu variables and such.

yes, pr-calcucatied variable would save much time.
> 
>> So could you like to accept some ugly quirks to do this lazy TLB 
>> flush on known working CPU?
> 
> it's not really 'lazy TLB flush' AFAICS but a genuine optimization: 
> only flush the TLB on the logical CPUs that need it, right? I.e. do 
> only one flush per pair of siblings.
> 
>> Forgive me if it's stupid.
> 
> I'd say measurable speedups that are safe are never ever stupid.

Thanks a lot!
> 
> And even the range-flush TLB optimization we are talking about here 
> could still be used IMO, just tone it down a bit and make it less 
> model dependent.
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-13  2:11             ` Alex Shi
@ 2013-12-14 14:19               ` Peter Zijlstra
  -1 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2013-12-14 14:19 UTC (permalink / raw)
  To: Alex Shi
  Cc: Ingo Molnar, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu

On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
> BTW,
> A bewitching idea is till attracting me.
> https://lkml.org/lkml/2012/5/23/148
> Even it was sentenced to death by HPA.
> https://lkml.org/lkml/2012/5/24/143
> 
> That is that just flush one of thread TLB is enough for SMT/HT, seems
> TLB is still shared in core on Intel CPU. This benefit is unconditional,
> and if my memory right, Kbuild testing can improve about 1~2% in average
> level.
> 
> So could you like to accept some ugly quirks to do this lazy TLB flush
> on known working CPU?
> Forgive me if it's stupid.

I think there's a further problem with that patch -- aside of it being
right from a hardware point of view.

We currently rely on the tlb flush IPI to synchronize with lockless page
table walkers like gup_fast().

By not sending an IPI to all CPUs you can get into trouble and crash the
kernel.

We absolutely must keep sending the IPI to all relevant CPUs, we can
choose not to actually do the flush on some CPUs, but we must keep
sending the IPI.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-14 14:19               ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2013-12-14 14:19 UTC (permalink / raw)
  To: Alex Shi
  Cc: Ingo Molnar, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu

On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
> BTW,
> A bewitching idea is till attracting me.
> https://lkml.org/lkml/2012/5/23/148
> Even it was sentenced to death by HPA.
> https://lkml.org/lkml/2012/5/24/143
> 
> That is that just flush one of thread TLB is enough for SMT/HT, seems
> TLB is still shared in core on Intel CPU. This benefit is unconditional,
> and if my memory right, Kbuild testing can improve about 1~2% in average
> level.
> 
> So could you like to accept some ugly quirks to do this lazy TLB flush
> on known working CPU?
> Forgive me if it's stupid.

I think there's a further problem with that patch -- aside of it being
right from a hardware point of view.

We currently rely on the tlb flush IPI to synchronize with lockless page
table walkers like gup_fast().

By not sending an IPI to all CPUs you can get into trouble and crash the
kernel.

We absolutely must keep sending the IPI to all relevant CPUs, we can
choose not to actually do the flush on some CPUs, but we must keep
sending the IPI.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-14 14:19               ` Peter Zijlstra
@ 2013-12-14 14:27                 ` Peter Zijlstra
  -1 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2013-12-14 14:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: Ingo Molnar, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu

On Sat, Dec 14, 2013 at 03:19:02PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
> > BTW,
> > A bewitching idea is till attracting me.
> > https://lkml.org/lkml/2012/5/23/148
> > Even it was sentenced to death by HPA.
> > https://lkml.org/lkml/2012/5/24/143
> > 
> > That is that just flush one of thread TLB is enough for SMT/HT, seems
> > TLB is still shared in core on Intel CPU. This benefit is unconditional,
> > and if my memory right, Kbuild testing can improve about 1~2% in average
> > level.
> > 
> > So could you like to accept some ugly quirks to do this lazy TLB flush
> > on known working CPU?
> > Forgive me if it's stupid.
> 
> I think there's a further problem with that patch -- aside of it being
> right from a hardware point of view.
> 
> We currently rely on the tlb flush IPI to synchronize with lockless page
> table walkers like gup_fast().
> 
> By not sending an IPI to all CPUs you can get into trouble and crash the
> kernel.
> 
> We absolutely must keep sending the IPI to all relevant CPUs, we can
> choose not to actually do the flush on some CPUs, but we must keep
> sending the IPI.

The alternative is switching x86 over to use HAVE_RCU_TABLE_FREE.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-14 14:27                 ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2013-12-14 14:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: Ingo Molnar, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu

On Sat, Dec 14, 2013 at 03:19:02PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
> > BTW,
> > A bewitching idea is till attracting me.
> > https://lkml.org/lkml/2012/5/23/148
> > Even it was sentenced to death by HPA.
> > https://lkml.org/lkml/2012/5/24/143
> > 
> > That is that just flush one of thread TLB is enough for SMT/HT, seems
> > TLB is still shared in core on Intel CPU. This benefit is unconditional,
> > and if my memory right, Kbuild testing can improve about 1~2% in average
> > level.
> > 
> > So could you like to accept some ugly quirks to do this lazy TLB flush
> > on known working CPU?
> > Forgive me if it's stupid.
> 
> I think there's a further problem with that patch -- aside of it being
> right from a hardware point of view.
> 
> We currently rely on the tlb flush IPI to synchronize with lockless page
> table walkers like gup_fast().
> 
> By not sending an IPI to all CPUs you can get into trouble and crash the
> kernel.
> 
> We absolutely must keep sending the IPI to all relevant CPUs, we can
> choose not to actually do the flush on some CPUs, but we must keep
> sending the IPI.

The alternative is switching x86 over to use HAVE_RCU_TABLE_FREE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-14 14:19               ` Peter Zijlstra
@ 2013-12-16  8:26                 ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-16  8:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu

On 12/14/2013 10:19 PM, Peter Zijlstra wrote:
> On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
>> BTW,
>> A bewitching idea is till attracting me.
>> https://lkml.org/lkml/2012/5/23/148
>> Even it was sentenced to death by HPA.
>> https://lkml.org/lkml/2012/5/24/143
>>
>> That is that just flush one of thread TLB is enough for SMT/HT, seems
>> TLB is still shared in core on Intel CPU. This benefit is unconditional,
>> and if my memory right, Kbuild testing can improve about 1~2% in average
>> level.
>>
>> So could you like to accept some ugly quirks to do this lazy TLB flush
>> on known working CPU?
>> Forgive me if it's stupid.
> 
> I think there's a further problem with that patch -- aside of it being
> right from a hardware point of view.
> 
> We currently rely on the tlb flush IPI to synchronize with lockless page
> table walkers like gup_fast().

I am sorry if I miss sth. :)

But if my understand correct, in the example of gup_fast, wait_split_huge_page
will never goes to BUG_ON(). Since the flush TLB IPI still be sent out to clear
each of _PAGE_SPLITTING on each CPU core. This patch just stop repeat TLB flush
in another SMT on same core. If there only noe SMT affected, the flush still be 
executed on it.

#define wait_split_huge_page(__anon_vma, __pmd)                         \
        do {                                                            \
                pmd_t *____pmd = (__pmd);                               \
                anon_vma_lock_write(__anon_vma);                        \
                anon_vma_unlock_write(__anon_vma);                      \
                BUG_ON(pmd_trans_splitting(*____pmd) ||                 \
                       pmd_trans_huge(*____pmd));                       \
        } while (0)

> 
> By not sending an IPI to all CPUs you can get into trouble and crash the
> kernel.
> 
> We absolutely must keep sending the IPI to all relevant CPUs, we can
> choose not to actually do the flush on some CPUs, but we must keep
> sending the IPI.
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-16  8:26                 ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-16  8:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu

On 12/14/2013 10:19 PM, Peter Zijlstra wrote:
> On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
>> BTW,
>> A bewitching idea is till attracting me.
>> https://lkml.org/lkml/2012/5/23/148
>> Even it was sentenced to death by HPA.
>> https://lkml.org/lkml/2012/5/24/143
>>
>> That is that just flush one of thread TLB is enough for SMT/HT, seems
>> TLB is still shared in core on Intel CPU. This benefit is unconditional,
>> and if my memory right, Kbuild testing can improve about 1~2% in average
>> level.
>>
>> So could you like to accept some ugly quirks to do this lazy TLB flush
>> on known working CPU?
>> Forgive me if it's stupid.
> 
> I think there's a further problem with that patch -- aside of it being
> right from a hardware point of view.
> 
> We currently rely on the tlb flush IPI to synchronize with lockless page
> table walkers like gup_fast().

I am sorry if I miss sth. :)

But if my understand correct, in the example of gup_fast, wait_split_huge_page
will never goes to BUG_ON(). Since the flush TLB IPI still be sent out to clear
each of _PAGE_SPLITTING on each CPU core. This patch just stop repeat TLB flush
in another SMT on same core. If there only noe SMT affected, the flush still be 
executed on it.

#define wait_split_huge_page(__anon_vma, __pmd)                         \
        do {                                                            \
                pmd_t *____pmd = (__pmd);                               \
                anon_vma_lock_write(__anon_vma);                        \
                anon_vma_unlock_write(__anon_vma);                      \
                BUG_ON(pmd_trans_splitting(*____pmd) ||                 \
                       pmd_trans_huge(*____pmd));                       \
        } while (0)

> 
> By not sending an IPI to all CPUs you can get into trouble and crash the
> kernel.
> 
> We absolutely must keep sending the IPI to all relevant CPUs, we can
> choose not to actually do the flush on some CPUs, but we must keep
> sending the IPI.
> 


-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-16  8:26                 ` Alex Shi
@ 2013-12-16 10:06                   ` Peter Zijlstra
  -1 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2013-12-16 10:06 UTC (permalink / raw)
  To: Alex Shi
  Cc: Ingo Molnar, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu

On Mon, Dec 16, 2013 at 04:26:31PM +0800, Alex Shi wrote:
> On 12/14/2013 10:19 PM, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
> >> BTW,
> >> A bewitching idea is till attracting me.
> >> https://lkml.org/lkml/2012/5/23/148
> >> Even it was sentenced to death by HPA.
> >> https://lkml.org/lkml/2012/5/24/143
> >>
> >> That is that just flush one of thread TLB is enough for SMT/HT, seems
> >> TLB is still shared in core on Intel CPU. This benefit is unconditional,
> >> and if my memory right, Kbuild testing can improve about 1~2% in average
> >> level.
> >>
> >> So could you like to accept some ugly quirks to do this lazy TLB flush
> >> on known working CPU?
> >> Forgive me if it's stupid.
> > 
> > I think there's a further problem with that patch -- aside of it being
> > right from a hardware point of view.
> > 
> > We currently rely on the tlb flush IPI to synchronize with lockless page
> > table walkers like gup_fast().
> 
> I am sorry if I miss sth. :)
> 
> But if my understand correct, in the example of gup_fast, wait_split_huge_page
> will never goes to BUG_ON(). Since the flush TLB IPI still be sent out to clear
> each of _PAGE_SPLITTING on each CPU core. This patch just stop repeat TLB flush
> in another SMT on same core. If there only noe SMT affected, the flush still be 
> executed on it.

This has nothing what so ff'ing ever to do with huge pages.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-16 10:06                   ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2013-12-16 10:06 UTC (permalink / raw)
  To: Alex Shi
  Cc: Ingo Molnar, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu

On Mon, Dec 16, 2013 at 04:26:31PM +0800, Alex Shi wrote:
> On 12/14/2013 10:19 PM, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
> >> BTW,
> >> A bewitching idea is till attracting me.
> >> https://lkml.org/lkml/2012/5/23/148
> >> Even it was sentenced to death by HPA.
> >> https://lkml.org/lkml/2012/5/24/143
> >>
> >> That is that just flush one of thread TLB is enough for SMT/HT, seems
> >> TLB is still shared in core on Intel CPU. This benefit is unconditional,
> >> and if my memory right, Kbuild testing can improve about 1~2% in average
> >> level.
> >>
> >> So could you like to accept some ugly quirks to do this lazy TLB flush
> >> on known working CPU?
> >> Forgive me if it's stupid.
> > 
> > I think there's a further problem with that patch -- aside of it being
> > right from a hardware point of view.
> > 
> > We currently rely on the tlb flush IPI to synchronize with lockless page
> > table walkers like gup_fast().
> 
> I am sorry if I miss sth. :)
> 
> But if my understand correct, in the example of gup_fast, wait_split_huge_page
> will never goes to BUG_ON(). Since the flush TLB IPI still be sent out to clear
> each of _PAGE_SPLITTING on each CPU core. This patch just stop repeat TLB flush
> in another SMT on same core. If there only noe SMT affected, the flush still be 
> executed on it.

This has nothing what so ff'ing ever to do with huge pages.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-14 14:27                 ` Peter Zijlstra
@ 2013-12-16 13:59                   ` Ingo Molnar
  -1 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-16 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Fengguang Wu


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Dec 14, 2013 at 03:19:02PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
> > > BTW,
> > > A bewitching idea is till attracting me.
> > > https://lkml.org/lkml/2012/5/23/148
> > > Even it was sentenced to death by HPA.
> > > https://lkml.org/lkml/2012/5/24/143
> > > 
> > > That is that just flush one of thread TLB is enough for SMT/HT, seems
> > > TLB is still shared in core on Intel CPU. This benefit is unconditional,
> > > and if my memory right, Kbuild testing can improve about 1~2% in average
> > > level.
> > > 
> > > So could you like to accept some ugly quirks to do this lazy TLB flush
> > > on known working CPU?
> > > Forgive me if it's stupid.
> > 
> > I think there's a further problem with that patch -- aside of it being
> > right from a hardware point of view.
> > 
> > We currently rely on the tlb flush IPI to synchronize with lockless page
> > table walkers like gup_fast().
> > 
> > By not sending an IPI to all CPUs you can get into trouble and crash the
> > kernel.
> > 
> > We absolutely must keep sending the IPI to all relevant CPUs, we can
> > choose not to actually do the flush on some CPUs, but we must keep
> > sending the IPI.
> 
> The alternative is switching x86 over to use HAVE_RCU_TABLE_FREE.

So if the kbuild speedup of 1-2% is true and reproducable then that 
might be worth doing.

Building the kernel is obviously a prime workload - and given that the 
kernel is active only about 10% of the time for a typical kernel 
build, a 1-2% speedup means a 10-20% speedup in kernel performance 
(which sounds a bit too good at first glance).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-16 13:59                   ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-16 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Fengguang Wu


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Dec 14, 2013 at 03:19:02PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2013 at 10:11:05AM +0800, Alex Shi wrote:
> > > BTW,
> > > A bewitching idea is till attracting me.
> > > https://lkml.org/lkml/2012/5/23/148
> > > Even it was sentenced to death by HPA.
> > > https://lkml.org/lkml/2012/5/24/143
> > > 
> > > That is that just flush one of thread TLB is enough for SMT/HT, seems
> > > TLB is still shared in core on Intel CPU. This benefit is unconditional,
> > > and if my memory right, Kbuild testing can improve about 1~2% in average
> > > level.
> > > 
> > > So could you like to accept some ugly quirks to do this lazy TLB flush
> > > on known working CPU?
> > > Forgive me if it's stupid.
> > 
> > I think there's a further problem with that patch -- aside of it being
> > right from a hardware point of view.
> > 
> > We currently rely on the tlb flush IPI to synchronize with lockless page
> > table walkers like gup_fast().
> > 
> > By not sending an IPI to all CPUs you can get into trouble and crash the
> > kernel.
> > 
> > We absolutely must keep sending the IPI to all relevant CPUs, we can
> > choose not to actually do the flush on some CPUs, but we must keep
> > sending the IPI.
> 
> The alternative is switching x86 over to use HAVE_RCU_TABLE_FREE.

So if the kbuild speedup of 1-2% is true and reproducable then that 
might be worth doing.

Building the kernel is obviously a prime workload - and given that the 
kernel is active only about 10% of the time for a typical kernel 
build, a 1-2% speedup means a 10-20% speedup in kernel performance 
(which sounds a bit too good at first glance).

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-16 13:59                   ` Ingo Molnar
@ 2013-12-17 11:59                     ` Alex Shi
  -1 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-17 11:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Fengguang Wu

On 12/16/2013 09:59 PM, Ingo Molnar wrote:
> So if the kbuild speedup of 1-2% is true and reproducable then that 
> might be worth doing.

I have a Intel desktop and need it for daily works. Wonder if Intel guys
like to have a try? I assume the patch is already in Fengguang's testing
system.
> 
> Building the kernel is obviously a prime workload - and given that the 
> kernel is active only about 10% of the time for a typical kernel 
> build, a 1-2% speedup means a 10-20% speedup in kernel performance 
> (which sounds a bit too good at first glance).

Maybe a extra time tlb flush causes more tlb refill that cost much user
space time.
-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-17 11:59                     ` Alex Shi
  0 siblings, 0 replies; 50+ messages in thread
From: Alex Shi @ 2013-12-17 11:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM, LKML,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Fengguang Wu

On 12/16/2013 09:59 PM, Ingo Molnar wrote:
> So if the kbuild speedup of 1-2% is true and reproducable then that 
> might be worth doing.

I have a Intel desktop and need it for daily works. Wonder if Intel guys
like to have a try? I assume the patch is already in Fengguang's testing
system.
> 
> Building the kernel is obviously a prime workload - and given that the 
> kernel is active only about 10% of the time for a typical kernel 
> build, a 1-2% speedup means a 10-20% speedup in kernel performance 
> (which sounds a bit too good at first glance).

Maybe a extra time tlb flush causes more tlb refill that cost much user
space time.
-- 
Thanks
    Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
  2013-12-17 11:59                     ` Alex Shi
@ 2013-12-17 13:14                       ` Ingo Molnar
  -1 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-17 13:14 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu


* Alex Shi <alex.shi@linaro.org> wrote:

> > Building the kernel is obviously a prime workload - and given that 
> > the kernel is active only about 10% of the time for a typical 
> > kernel build, a 1-2% speedup means a 10-20% speedup in kernel 
> > performance (which sounds a bit too good at first glance).
> 
> Maybe a extra time tlb flush causes more tlb refill that cost much 
> user space time.

All these things are measurable, that way maybes can be converted into 
certainty.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge
@ 2013-12-17 13:14                       ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2013-12-17 13:14 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, Mel Gorman, H Peter Anvin, Linux-X86, Linux-MM,
	LKML, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Fengguang Wu


* Alex Shi <alex.shi@linaro.org> wrote:

> > Building the kernel is obviously a prime workload - and given that 
> > the kernel is active only about 10% of the time for a typical 
> > kernel build, a 1-2% speedup means a 10-20% speedup in kernel 
> > performance (which sounds a bit too good at first glance).
> 
> Maybe a extra time tlb flush causes more tlb refill that cost much 
> user space time.

All these things are measurable, that way maybes can be converted into 
certainty.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2013-12-17 13:15 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-12 11:55 [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 TLB range flush Mel Gorman
2013-12-12 11:55 ` Mel Gorman
2013-12-12 11:55 ` [PATCH 1/3] x86: mm: Clean up inconsistencies when flushing TLB ranges Mel Gorman
2013-12-12 11:55   ` Mel Gorman
2013-12-12 13:59   ` Alex Shi
2013-12-12 13:59     ` Alex Shi
2013-12-12 23:53     ` Mel Gorman
2013-12-12 23:53       ` Mel Gorman
2013-12-12 11:55 ` [PATCH 2/3] x86: mm: Change tlb_flushall_shift for IvyBridge Mel Gorman
2013-12-12 11:55   ` Mel Gorman
2013-12-12 13:13   ` Ingo Molnar
2013-12-12 13:13     ` Ingo Molnar
2013-12-12 13:38     ` Alex Shi
2013-12-12 13:38       ` Alex Shi
2013-12-12 14:11       ` Ingo Molnar
2013-12-12 14:11         ` Ingo Molnar
2013-12-13  1:02         ` Alex Shi
2013-12-13  1:02           ` Alex Shi
2013-12-13  2:11           ` Alex Shi
2013-12-13  2:11             ` Alex Shi
2013-12-13 13:43             ` Ingo Molnar
2013-12-13 13:43               ` Ingo Molnar
2013-12-14 11:01               ` Alex Shi
2013-12-14 11:01                 ` Alex Shi
2013-12-14 14:19             ` Peter Zijlstra
2013-12-14 14:19               ` Peter Zijlstra
2013-12-14 14:27               ` Peter Zijlstra
2013-12-14 14:27                 ` Peter Zijlstra
2013-12-16 13:59                 ` Ingo Molnar
2013-12-16 13:59                   ` Ingo Molnar
2013-12-17 11:59                   ` Alex Shi
2013-12-17 11:59                     ` Alex Shi
2013-12-17 13:14                     ` Ingo Molnar
2013-12-17 13:14                       ` Ingo Molnar
2013-12-16  8:26               ` Alex Shi
2013-12-16  8:26                 ` Alex Shi
2013-12-16 10:06                 ` Peter Zijlstra
2013-12-16 10:06                   ` Peter Zijlstra
2013-12-12 13:45   ` Alex Shi
2013-12-12 13:45     ` Alex Shi
2013-12-12 11:55 ` [PATCH 3/3] x86: mm: Account for the of CPUs that must be flushed during a TLB range flush Mel Gorman
2013-12-12 11:55   ` Mel Gorman
2013-12-12 13:41   ` Alex Shi
2013-12-12 13:41     ` Alex Shi
2013-12-12 13:01 ` [RFC PATCH 0/3] Fix ebizzy performance regression on IvyBridge due to X86 " Ingo Molnar
2013-12-12 13:01   ` Ingo Molnar
2013-12-12 14:40   ` Mel Gorman
2013-12-12 14:40     ` Mel Gorman
2013-12-13 13:35     ` Ingo Molnar
2013-12-13 13:35       ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.