All of lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@amd64.org>
To: Alex Shi <alex.shi@intel.com>
Cc: andi.kleen@intel.com, tim.c.chen@linux.intel.com,
	jeremy@goop.org, chrisw@sous-sol.org, akataria@vmware.com,
	tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
	rostedt@goodmis.org, fweisbec@gmail.com, riel@redhat.com,
	luto@mit.edu, avi@redhat.com, len.brown@intel.com,
	paul.gortmaker@windriver.com, dhowells@redhat.com,
	fenghua.yu@intel.com, yinghai@kernel.org, cpw@sgi.com,
	steiner@sgi.com, linux-kernel@vger.kernel.org,
	yongjie.ren@intel.com
Subject: Re: [PATCH 1/3] x86/tlb_info: get last level TLB entry number of CPU
Date: Sun, 29 Apr 2012 15:55:29 +0200	[thread overview]
Message-ID: <20120429135529.GA2713@aftab.osrc.amd.com> (raw)
In-Reply-To: <1335603099-2624-2-git-send-email-alex.shi@intel.com>

On Sat, Apr 28, 2012 at 04:51:37PM +0800, Alex Shi wrote:
> For 4KB pages, x86 CPU has 2 or 1 level TLB, first level is data TLB and
> instruction TLB, second level is shared TLB for both data and instructions.
> 
> For hupe page TLB, usually there is just one level and seperated by 2MB/4MB
> and 1GB.
> 
> Although each levels TLB size is important for performance tuning, but for
> genernal and rude optimizing, just last level TLB entry number is suitable.
> And in fact, last level TLB has the biggest entry number.
> 
> This patch will get the biggest TLB entry number and use it in furture TLB
> optimizing.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  arch/x86/include/asm/processor.h |   12 +++
>  arch/x86/kernel/cpu/common.c     |  163 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/cpu.h        |    1 +
>  3 files changed, 176 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 4fa7dcc..a91504b 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -61,6 +61,18 @@ static inline void *current_text_addr(void)
>  # define ARCH_MIN_MMSTRUCT_ALIGN	0
>  #endif
>  
> +enum tlb_infos {
> +	ENTRIES,
> +	/* ASS_WAYS, */

We don't need associativity?

> +	NR_INFO
> +};
> +
> +extern u16 __read_mostly tlb_lli_4k[NR_INFO];
> +extern u16 __read_mostly tlb_lli_2m[NR_INFO];
> +extern u16 __read_mostly tlb_lli_4m[NR_INFO];
> +extern u16 __read_mostly tlb_lld_4k[NR_INFO];
> +extern u16 __read_mostly tlb_lld_2m[NR_INFO];
> +extern u16 __read_mostly tlb_lld_4m[NR_INFO];

[..]

> +void __cpuinit cpu_detect_tlb_sizes()
> +{
> +	int i, j, n;
> +	unsigned int regs[4];
> +	unsigned char *desc = (unsigned char *)regs;
> +
> +	/* Number of times to iterate */
> +	n = cpuid_eax(2) & 0xFF;
> +
> +	for (i = 0 ; i < n ; i++) {
> +		cpuid(2, &regs[0], &regs[1], &regs[2], &regs[3]);

Ok, getting TLB sizes on AMD is easier :), see dirty patch below.

Also, there's cpuinfo_x86.x86_tlbsize which is L1 iTLB + L1 dTLB 4K
entries. The tlb sizes below could probably be integrated/cached there
too if this proves to bring some speedup.

But initial testing looks good:

This is Linus' git from today:

my pid is 2798 n=32 l=1024 p=512 t=1
get 256K pages with one byte writing uses 689ms, 2629ns/time
mprotect use 71ms 2178ns/time, 14103 times/thread/ms, cost 70ns/time
my pid is 2800 n=32 l=1024 p=512 t=2
get 256K pages with one byte writing uses 686ms, 2620ns/time
mprotect use 82ms 2508ns/time, 14272 times/thread/ms, cost 70ns/time
my pid is 2803 n=32 l=1024 p=512 t=4
get 256K pages with one byte writing uses 686ms, 2620ns/time
mprotect use 102ms 3120ns/time, 15332 times/thread/ms, cost 65ns/time
my pid is 2808 n=32 l=1024 p=512 t=8
get 256K pages with one byte writing uses 686ms, 2617ns/time
mprotect use 142ms 4350ns/time, 16930 times/thread/ms, cost 59ns/time
my pid is 2817 n=32 l=1024 p=512 t=16
get 256K pages with one byte writing uses 671ms, 2562ns/time
mprotect use 226ms 6925ns/time, 20508 times/thread/ms, cost 48ns/time
my pid is 2834 n=32 l=1024 p=512 t=32
get 256K pages with one byte writing uses 679ms, 2593ns/time
mprotect use 497ms 15182ns/time, 31891 times/thread/ms, cost 31ns/time
my pid is 2867 n=32 l=1024 p=512 t=64
get 256K pages with one byte writing uses 675ms, 2575ns/time
mprotect use 394ms 12031ns/time, 12727 times/thread/ms, cost 78ns/time
my pid is 2932 n=32 l=1024 p=512 t=128
get 256K pages with one byte writing uses 680ms, 2597ns/time
mprotect use 1425ms 43506ns/time, 11718 times/thread/ms, cost 85ns/time

and this is with your patches ontop:

my pid is 2817 n=32 l=1024 p=512 t=1
get 256K pages with one byte writing uses 680ms, 2597ns/time
mprotect use 120ms 3691ns/time, 35043 times/thread/ms, cost 28ns/time
my pid is 2819 n=32 l=1024 p=512 t=2
get 256K pages with one byte writing uses 678ms, 2588ns/time
mprotect use 133ms 4079ns/time, 36233 times/thread/ms, cost 27ns/time
my pid is 2822 n=32 l=1024 p=512 t=4
get 256K pages with one byte writing uses 675ms, 2578ns/time
mprotect use 162ms 4953ns/time, 38283 times/thread/ms, cost 26ns/time
my pid is 2827 n=32 l=1024 p=512 t=8
get 256K pages with one byte writing uses 680ms, 2593ns/time
mprotect use 243ms 7425ns/time, 42101 times/thread/ms, cost 23ns/time
my pid is 2836 n=32 l=1024 p=512 t=16
get 256K pages with one byte writing uses 673ms, 2570ns/time
mprotect use 356ms 10869ns/time, 45748 times/thread/ms, cost 21ns/time
my pid is 2853 n=32 l=1024 p=512 t=32
get 256K pages with one byte writing uses 667ms, 2545ns/time
mprotect use 460ms 14063ns/time, 35435 times/thread/ms, cost 28ns/time
my pid is 2886 n=32 l=1024 p=512 t=64
get 256K pages with one byte writing uses 672ms, 2564ns/time
mprotect use 1298ms 39641ns/time, 23971 times/thread/ms, cost 41ns/time
my pid is 2951 n=32 l=1024 p=512 t=128
get 256K pages with one byte writing uses 673ms, 2567ns/time
mprotect use 2682ms 81873ns/time, 12956 times/thread/ms, cost 77ns/time

and I definitely like those numbers.

So, assuming others don't have a problem with this approach, I like
this. Haven't looked at the other two patches yet though.

> +
> +		/* If bit 31 is set, this is an unknown format */
> +		for (j = 0 ; j < 3 ; j++)
> +			if (regs[j] & (1 << 31))
> +				regs[j] = 0;
> +
> +		/* Byte 0 is level count, not a descriptor */
> +		for (j = 1 ; j < 16 ; j++)
> +			tlb_lookup(desc[j]);
> +	}
> +	printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n" \
> +		"Last level dTLB entires: 4KB %d, 2MB %d, 4MB %d\n",

I'm sure you mean "entries" :-)

> +		tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES],
> +		tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES],
> +		tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES]);
> +}
> +
>  void __cpuinit detect_ht(struct cpuinfo_x86 *c)
>  {
>  #ifdef CONFIG_X86_HT
> @@ -911,6 +1072,8 @@ void __init identify_boot_cpu(void)
>  #else
>  	vgetcpu_set_mode();
>  #endif
> +	if (boot_cpu_data.cpuid_level >= 2)
> +		cpu_detect_tlb_sizes();
>  }
>  
>  void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
> diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
> index 8bacc78..a102ed1 100644
> --- a/arch/x86/kernel/cpu/cpu.h
> +++ b/arch/x86/kernel/cpu/cpu.h
> @@ -34,4 +34,5 @@ extern const struct cpu_dev *const __x86_cpu_dev_start[],
>  
>  extern void get_cpu_cap(struct cpuinfo_x86 *c);
>  extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c);
> +extern void cpu_detect_tlb_sizes(void);
>  #endif /* ARCH_X86_CPU_H */
> -- 

Thanks.

From: Borislav Petkov <borislav.petkov@amd.com>
Date: Sun, 29 Apr 2012 15:23:36 +0200
Subject: [PATCH 2/4] x86: Add AMD TLB size detection

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/kernel/cpu/common.c |   47 +++++++++++++++++++++++++++++-------------
 arch/x86/kernel/cpu/cpu.h    |    2 +-
 2 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5f14a700a665..9609fa74cfaf 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -585,29 +585,48 @@ void tlb_lookup(const unsigned char desc)
 		break;
 	}
 }
-void __cpuinit cpu_detect_tlb_sizes()
+void __cpuinit cpu_detect_tlb_sizes(struct cpuinfo_x86 *c)
 {
 	int i, j, n;
 	unsigned int regs[4];
 	unsigned char *desc = (unsigned char *)regs;
 
-	/* Number of times to iterate */
-	n = cpuid_eax(2) & 0xFF;
+	if (c->x86_vendor == X86_VENDOR_AMD) {
+		cpuid(0x80000006, &regs[0], &regs[1], &regs[2], &regs[3]);
 
-	for (i = 0 ; i < n ; i++) {
-		cpuid(2, &regs[0], &regs[1], &regs[2], &regs[3]);
+		tlb_lld_2m[ENTRIES] = tlb_lld_4m[ENTRIES] = (regs[0] >> 16) & 0xfff;
+		tlb_lli_2m[ENTRIES] = tlb_lli_4m[ENTRIES] = regs[0] & 0xfff;
+		tlb_lld_4k[ENTRIES] = (regs[1] >> 16) & 0xfff;
+		tlb_lli_4k[ENTRIES] =  regs[1] & 0xfff;
 
-		/* If bit 31 is set, this is an unknown format */
-		for (j = 0 ; j < 3 ; j++)
-			if (regs[j] & (1 << 31))
-				regs[j] = 0;
+		/* if any of the L2 TLBs are disabled, use L1 */
+		cpuid(0x80000005, &regs[0], &regs[1], &regs[2], &regs[3]);
 
-		/* Byte 0 is level count, not a descriptor */
-		for (j = 1 ; j < 16 ; j++)
-			tlb_lookup(desc[j]);
+		if (!tlb_lld_2m[ENTRIES])
+			tlb_lld_2m[ENTRIES] = tlb_lld_4m[ENTRIES] = (regs[0] >> 16) & 0xff;
+
+		if (!tlb_lli_2m[ENTRIES])
+			tlb_lli_2m[ENTRIES] = tlb_lli_4m[ENTRIES] = regs[0] & 0xff;
+	} else if (c->x86_vendor == X86_VENDOR_INTEL) {
+		/* Number of times to iterate */
+		n = cpuid_eax(2) & 0xFF;
+
+		for (i = 0 ; i < n ; i++) {
+			cpuid(2, &regs[0], &regs[1], &regs[2], &regs[3]);
+
+			/* If bit 31 is set, this is an unknown format */
+			for (j = 0 ; j < 3 ; j++)
+				if (regs[j] & (1 << 31))
+					regs[j] = 0;
+
+			/* Byte 0 is level count, not a descriptor */
+			for (j = 1 ; j < 16 ; j++)
+				tlb_lookup(desc[j]);
+		}
 	}
+
 	printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n" \
-		"Last level dTLB entires: 4KB %d, 2MB %d, 4MB %d\n",
+		"Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d\n",
 		tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES],
 		tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES],
 		tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES]);
@@ -1073,7 +1092,7 @@ void __init identify_boot_cpu(void)
 	vgetcpu_set_mode();
 #endif
 	if (boot_cpu_data.cpuid_level >= 2)
-		cpu_detect_tlb_sizes();
+		cpu_detect_tlb_sizes(&boot_cpu_data);
 }
 
 void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index a102ed1c8eca..01469b6dace1 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -34,5 +34,5 @@ extern const struct cpu_dev *const __x86_cpu_dev_start[],
 
 extern void get_cpu_cap(struct cpuinfo_x86 *c);
 extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c);
-extern void cpu_detect_tlb_sizes(void);
+extern void cpu_detect_tlb_sizes(struct cpuinfo_x86 *c);
 #endif /* ARCH_X86_CPU_H */
-- 
1.7.9.3.362.g71319

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

  reply	other threads:[~2012-04-29 13:55 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-28  8:51 [PATCH 0/3] TLB flush range optimization Alex Shi
2012-04-28  8:51 ` [PATCH 1/3] x86/tlb_info: get last level TLB entry number of CPU Alex Shi
2012-04-29 13:55   ` Borislav Petkov [this message]
2012-04-30  4:25     ` Alex Shi
2012-04-30 10:45       ` Borislav Petkov
2012-04-28  8:51 ` [PATCH 2/3] x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range Alex Shi
2012-04-30 10:54   ` Borislav Petkov
2012-05-02  9:24     ` Alex Shi
2012-05-02  9:38       ` Borislav Petkov
2012-05-02 11:38         ` Alex Shi
2012-05-02 13:04           ` Nick Piggin
2012-05-02 13:15             ` Alex Shi
2012-05-02 13:24             ` Alex Shi
2012-05-06  2:55             ` Alex Shi
2012-05-02 13:44           ` Borislav Petkov
2012-05-03  9:15             ` Alex Shi
2012-05-04  2:24   ` Ren, Yongjie
2012-05-04  5:46     ` Alex Shi
2012-04-28  8:51 ` [PATCH 3/3] x86/tlb: fall back to flush all when meet a THP large page Alex Shi
  -- strict thread matches above, loose matches on Subject: below --
2012-04-28  8:50 [PATCH 1/3] x86/tlb_info: get last level TLB entry number of CPU Alex Shi
2012-05-02 15:13 ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120429135529.GA2713@aftab.osrc.amd.com \
    --to=bp@amd64.org \
    --cc=akataria@vmware.com \
    --cc=alex.shi@intel.com \
    --cc=andi.kleen@intel.com \
    --cc=avi@redhat.com \
    --cc=chrisw@sous-sol.org \
    --cc=cpw@sgi.com \
    --cc=dhowells@redhat.com \
    --cc=fenghua.yu@intel.com \
    --cc=fweisbec@gmail.com \
    --cc=hpa@zytor.com \
    --cc=jeremy@goop.org \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@mit.edu \
    --cc=mingo@redhat.com \
    --cc=paul.gortmaker@windriver.com \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=steiner@sgi.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=yinghai@kernel.org \
    --cc=yongjie.ren@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.