[PATCH 3/4]x86: make tlb invalidate vector number configurable

From: Shaohua Li <shaohua.li@intel.com>
To: lkml <linux-kernel@vger.kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>, Andi Kleen <andi@firstfloor.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Eric Dumazet <eric.dumazet@gmail.com>
Subject: [PATCH 3/4]x86: make tlb invalidate vector number configurable
Date: Mon, 17 Jan 2011 10:52:07 +0800	[thread overview]
Message-ID: <1295232727.1949.709.camel@sli10-conroe> (raw)

Make the maxium TLB invalidate vectors depend on NR_CPUS, and the maxium
number is 32.
we currently only have 8 vectors for TLB invalidate. If we have a lot
of CPUs, the CPUs need share the 8 vectors and tlbstate_lock is used
to protect them. flush_tlb_page() is heavily used in page reclaim,
which will cause a lot of lock contention for tlbstate_lock. Andi Kleen
suggests increasing the vectors number to 32, which should be good for
current typical systems to reduce the tlbstate_lock contention.

My test system has 4 sockets and 64G memory, and 64 CPUs. My workload
creates 64 processes. Each process mmap reads a big empty sparse file.
The total size of the files are 2*total_mem, so this will cause a lot
of page reclaim. Below is the result I get from perf:
without the patch:
    24.25%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |--42.15%-- native_flush_tlb_others
with the patch:
    14.96%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |--13.89%-- native_flush_tlb_others
So this heavily reduces the tlbstate_lock contention.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 arch/x86/include/asm/irq_vectors.h |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux/arch/x86/include/asm/irq_vectors.h
===================================================================

--- linux.orig/arch/x86/include/asm/irq_vectors.h	2010-11-02 14:41:01.000000000 +0800
+++ linux/arch/x86/include/asm/irq_vectors.h	2010-11-02 14:52:10.000000000 +0800
@@ -17,8 +17,8 @@
  *  Vectors   0 ...  31 : system traps and exceptions - hardcoded events
  *  Vectors  32 ... 127 : device interrupts
  *  Vector  128         : legacy int80 syscall interface
- *  Vectors 129 ... 229 : device interrupts
- *  Vectors 230 ... 255 : special interrupts
+ *  Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts
+ *  Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
  *
  * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
  *
@@ -124,8 +124,13 @@
  */
 #define LOCAL_TIMER_VECTOR		0xef
 
-/* f0-f7 used for spreading out TLB flushes: */
-#define NUM_INVALIDATE_TLB_VECTORS	   8
+/* up to 32 vectors used for spreading out TLB flushes: */
+#if NR_CPUS > 32
+#define NUM_INVALIDATE_TLB_VECTORS 32
+#else
+#define NUM_INVALIDATE_TLB_VECTORS NR_CPUS
+#endif
+
 #define INVALIDATE_TLB_VECTOR_END	0xee
 #define INVALIDATE_TLB_VECTOR_START	\
 	(INVALIDATE_TLB_VECTOR_END - NUM_INVALIDATE_TLB_VECTORS + 1)