All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 3/4]x86: make tlb invalidate vector number configurable
@ 2011-01-17  2:52 Shaohua Li
  2011-02-14 13:53 ` [tip:x86/mm] x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32 tip-bot for Shaohua Li
  0 siblings, 1 reply; 2+ messages in thread
From: Shaohua Li @ 2011-01-17  2:52 UTC (permalink / raw)
  To: lkml; +Cc: Ingo Molnar, Andi Kleen, hpa, Andrew Morton, Eric Dumazet

Make the maxium TLB invalidate vectors depend on NR_CPUS, and the maxium
number is 32.
we currently only have 8 vectors for TLB invalidate. If we have a lot
of CPUs, the CPUs need share the 8 vectors and tlbstate_lock is used
to protect them. flush_tlb_page() is heavily used in page reclaim,
which will cause a lot of lock contention for tlbstate_lock. Andi Kleen
suggests increasing the vectors number to 32, which should be good for
current typical systems to reduce the tlbstate_lock contention.

My test system has 4 sockets and 64G memory, and 64 CPUs. My workload
creates 64 processes. Each process mmap reads a big empty sparse file.
The total size of the files are 2*total_mem, so this will cause a lot
of page reclaim. Below is the result I get from perf:
without the patch:
    24.25%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |--42.15%-- native_flush_tlb_others
with the patch:
    14.96%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |--13.89%-- native_flush_tlb_others
So this heavily reduces the tlbstate_lock contention.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 arch/x86/include/asm/irq_vectors.h |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux.orig/arch/x86/include/asm/irq_vectors.h	2010-11-02 14:41:01.000000000 +0800
+++ linux/arch/x86/include/asm/irq_vectors.h	2010-11-02 14:52:10.000000000 +0800
@@ -17,8 +17,8 @@
  *  Vectors   0 ...  31 : system traps and exceptions - hardcoded events
  *  Vectors  32 ... 127 : device interrupts
  *  Vector  128         : legacy int80 syscall interface
- *  Vectors 129 ... 229 : device interrupts
- *  Vectors 230 ... 255 : special interrupts
+ *  Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts
+ *  Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
  *
  * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
  *
@@ -124,8 +124,13 @@
  */
 #define LOCAL_TIMER_VECTOR		0xef
 
-/* f0-f7 used for spreading out TLB flushes: */
-#define NUM_INVALIDATE_TLB_VECTORS	   8
+/* up to 32 vectors used for spreading out TLB flushes: */
+#if NR_CPUS > 32
+#define NUM_INVALIDATE_TLB_VECTORS 32
+#else
+#define NUM_INVALIDATE_TLB_VECTORS NR_CPUS
+#endif
+
 #define INVALIDATE_TLB_VECTOR_END	0xee
 #define INVALIDATE_TLB_VECTOR_START	\
 	(INVALIDATE_TLB_VECTOR_END - NUM_INVALIDATE_TLB_VECTORS + 1)



^ permalink raw reply	[flat|nested] 2+ messages in thread

* [tip:x86/mm] x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32
  2011-01-17  2:52 [PATCH 3/4]x86: make tlb invalidate vector number configurable Shaohua Li
@ 2011-02-14 13:53 ` tip-bot for Shaohua Li
  0 siblings, 0 replies; 2+ messages in thread
From: tip-bot for Shaohua Li @ 2011-02-14 13:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, acme, hpa, mingo, eric.dumazet, andi, a.p.zijlstra,
	shaohua.li, fweisbec, tglx, mingo

Commit-ID:  70e4a369733a21e3d16b059a6ccdad22a344bf57
Gitweb:     http://git.kernel.org/tip/70e4a369733a21e3d16b059a6ccdad22a344bf57
Author:     Shaohua Li <shaohua.li@intel.com>
AuthorDate: Mon, 17 Jan 2011 10:52:07 +0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 14 Feb 2011 13:03:08 +0100

x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32

Make the maxium TLB invalidate vectors depend on NR_CPUS linearly,
with a maximum of 32 vectors.

We currently only have 8 vectors for TLB invalidation and that is clearly
inadequate. If we have a lot of CPUs, the CPUs need share the 8 vectors and
tlbstate_lock is used to protect them. flush_tlb_page() is
heavily used in page reclaim, which will cause a lot of lock
contention for tlbstate_lock.

Andi Kleen suggested increasing the vectors number to 32, which should be
good for current typical systems to reduce the tlbstate_lock contention.

My test system has 4 sockets and 64G memory, and 64 CPUs. My
workload creates 64 processes. Each process mmap reads a big
empty sparse file. The total size of the files are 2*total_mem,
so this will cause a lot of page reclaim.

Below is the result I get from perf call-graph profiling:

 without the patch:
 ------------------

    24.25%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |--42.15%-- native_flush_tlb_others

 with the patch:
 ------------------

    14.96%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |--13.89%-- native_flush_tlb_others

So this heavily reduces the tlbstate_lock contention.

Suggested-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1295232727.1949.709.camel@sli10-conroe>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/irq_vectors.h |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 42f0d4a..4980f48 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -17,8 +17,8 @@
  *  Vectors   0 ...  31 : system traps and exceptions - hardcoded events
  *  Vectors  32 ... 127 : device interrupts
  *  Vector  128         : legacy int80 syscall interface
- *  Vectors 129 ... 229 : device interrupts
- *  Vectors 230 ... 255 : special interrupts
+ *  Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts
+ *  Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
  *
  * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
  *
@@ -124,8 +124,13 @@
  */
 #define LOCAL_TIMER_VECTOR		0xef
 
-/* f0-f7 used for spreading out TLB flushes: */
-#define NUM_INVALIDATE_TLB_VECTORS	   8
+/* up to 32 vectors used for spreading out TLB flushes: */
+#if NR_CPUS <= 32
+# define NUM_INVALIDATE_TLB_VECTORS NR_CPUS
+#else
+# define NUM_INVALIDATE_TLB_VECTORS 32
+#endif
+
 #define INVALIDATE_TLB_VECTOR_END	0xee
 #define INVALIDATE_TLB_VECTOR_START	\
 	(INVALIDATE_TLB_VECTOR_END - NUM_INVALIDATE_TLB_VECTORS + 1)

^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-02-14 18:07 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-17  2:52 [PATCH 3/4]x86: make tlb invalidate vector number configurable Shaohua Li
2011-02-14 13:53 ` [tip:x86/mm] x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32 tip-bot for Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.