Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4  due to wrmsr (performance)
       [not found] ` <20030318184010$6448@gated-at.bofh.it>
@ 2003-03-18 20:19   ` Pascal Schmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Pascal Schmidt @ 2003-03-18 20:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Brian Gerst

On Tue, 18 Mar 2003 19:40:10 +0100, you wrote in linux.kernel:

> vendor_id       : AuthenticAMD
> cpu family      : 6
> model           : 6
> model name      : AMD Athlon(tm) Processor
> stepping        : 2
> cpu MHz         : 1409.946
> empty overhead=11 cycles
> load overhead=5 cycles
> I$ load overhead=5 cycles
> I$ load overhead=5 cycles
> I$ store overhead=826 cycles
> 
> The Athlon XP shows really bad behavior when you store to the text area.

There seems to be a (surprising?) difference between AthlonXP model 6 and 
model 8:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(tm) XP 1700+
stepping        : 0

empty overhead=16 cycles
load overhead=1 cycles
I$ load overhead=2 cycles
I$ load overhead=1 cycles
I$ store overhead=81 cycles

-- 
Ciao,
Pascal

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
@ 2003-03-19  9:55 Ph. Marek
  0 siblings, 0 replies; 35+ messages in thread
From: Ph. Marek @ 2003-03-19  9:55 UTC (permalink / raw)
  To: linux-kernel

Hi Linus,

which compiler optimization should I use for this test?

-O3 shows other values:
* the empty overhead is 4 cycles shorter
* but store overhead goes from 3 to 48 cycles!

Please see below.


Regards,

Phil


gcc -O3 linus_i_d_cache.c -o linus_i_d_cache
./linus_i_d_cache

empty overhead=73 cycles
load overhead=10 cycles
I$ load overhead=10 cycles
I$ load overhead=10 cycles
I$ store overhead=48 cycles

gcc -g -Wall linus_i_d_cache.c -o linus_i_d_cache
./linus_i_d_cache

empty overhead=77 cycles
load overhead=12 cycles
I$ load overhead=12 cycles
I$ load overhead=12 cycles
I$ store overhead=3 cycles


cat /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 3
model name      : Pentium II (Klamath)
stepping        : 3
cpu MHz         : 265.916
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov mmx
bogomips        : 530.84




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-19  0:42             ` H. Peter Anvin
@ 2003-03-19  2:22               ` george anzinger
  0 siblings, 0 replies; 35+ messages in thread
From: george anzinger @ 2003-03-19  2:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

H. Peter Anvin wrote:
> Followup to:  <Pine.LNX.4.44.0303181113590.13708-100000@home.transmeta.com>
> By author:    Linus Torvalds <torvalds@transmeta.com>
> In newsgroup: linux.dev.kernel
> 
>>Wow. There aren't many things that AMD tends to show the P4-like "big
>>latency in rare cases" behaviour.
>>
>>But quite honestly, I think they made the right call, and I _expect_ that
>>of modern CPU's. The fact is, modern CPU's tend to need to pre-decode the
>>instruction stream some way, and storing to it while running from it is
>>just a really really bad idea. And since it's so easy to avoid it, you
>>really just shouldn't do it.
>>
> 
> 
> AMD, I believe, has an "annotated" icache

Here is an SMP:

vendor_id	: AuthenticAMD
cpu family	: 6
model		: 6
model name	: AMD Athlon(TM) MP 2000+
stepping	: 2
cpu MHz		: 1680.368
cache size	: 256 KB

empty overhead=11 cycles
load overhead=6 cycles
I$ load overhead=5 cycles
I$ load overhead=6 cycles
I$ store overhead=1051 cycles


-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-13 18:07           ` Andi Kleen
@ 2003-03-19  1:22             ` Rob Landley
  0 siblings, 0 replies; 35+ messages in thread
From: Rob Landley @ 2003-03-19  1:22 UTC (permalink / raw)
  To: Andi Kleen, Eric W. Biederman; +Cc: linux-kernel, discuss

On Thursday 13 February 2003 13:07, Andi Kleen wrote:
> [Hmm, this is becomming a FAQ]
>
> > Switching in and out of long mode is evil enough that I don't think it
> > is worth it.  And encouraging people to write good JIT compiling
>
> Forget it. It is completely undefined in the architecture what happens
> then. You'll lose interrupts and everything. Nothing for an operating
> system intended to be stable.
>
> I have no plans at all to even think about it for Linux/x86-64.
>
> > emulators sounds much better, especially in the long run.  But it can
> > be written.
>
> For DOS even a slow emulator should be good enough. After all most
> DOS Programs are written for slow machines. Bochs running on a K8
> will be hopefully fast enough. If not an JIT can be written, perhaps
> you can extend valgrind for it.

Fabrice Bellard, the author of TCC (Tiny C Compiler) seems to have taken it 
into his head that Bochs and Valgrind are too slow, and his current pet 
project is writing a new hand-optimized, portable JIT x86 emulator.  So 
there's one in the works already... :)

(See the tinycc-devel@nongnu.org archives for details, just this past weekend 
in fact...)

Rob

-- 
penguicon.sf.net - A combination Linux Expo and Science Fiction Convention, 
May 2-4 2003 in Warren, Michigan. Tutorials, installfest, filk, masquerade...



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-18 19:21           ` Linus Torvalds
  2003-03-18 20:03             ` Thomas Schlichter
  2003-03-18 20:24             ` Steven Cole
@ 2003-03-19  0:42             ` H. Peter Anvin
  2003-03-19  2:22               ` george anzinger
  2 siblings, 1 reply; 35+ messages in thread
From: H. Peter Anvin @ 2003-03-19  0:42 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.44.0303181113590.13708-100000@home.transmeta.com>
By author:    Linus Torvalds <torvalds@transmeta.com>
In newsgroup: linux.dev.kernel
> 
> Wow. There aren't many things that AMD tends to show the P4-like "big
> latency in rare cases" behaviour.
> 
> But quite honestly, I think they made the right call, and I _expect_ that
> of modern CPU's. The fact is, modern CPU's tend to need to pre-decode the
> instruction stream some way, and storing to it while running from it is
> just a really really bad idea. And since it's so easy to avoid it, you
> really just shouldn't do it.
> 

AMD, I believe, has an "annotated" icache.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-18 19:21           ` Linus Torvalds
  2003-03-18 20:03             ` Thomas Schlichter
@ 2003-03-18 20:24             ` Steven Cole
  2003-03-19  0:42             ` H. Peter Anvin
  2 siblings, 0 replies; 35+ messages in thread
From: Steven Cole @ 2003-03-18 20:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Brian Gerst, Kevin Pedretti, Linux Kernel

On Tue, 2003-03-18 at 12:21, Linus Torvalds wrote:
> 
> On Tue, 18 Mar 2003, Brian Gerst wrote:
> > 
> > Here's a few more data points:
> 
> Ok, this shows the behaviour I was trying to explain:
> 
> > vendor_id       : AuthenticAMD
> > cpu family      : 5
> > model           : 8
> > model name      : AMD-K6(tm) 3D processor
> > stepping        : 12
> > cpu MHz         : 451.037
> > empty overhead=105 cycles
> > load overhead=-2 cycles
> > I$ load overhead=30 cycles
> > I$ load overhead=90 cycles
> > I$ store overhead=95 cycles
> 
> ie loading from the same cacheline shows bad behaviour, most likely due to 
> cache line exclusion. Does anybody have an original Pentium to see if I 
> remember that one right?

Does this help?

[steven@trendb steven]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 5
model           : 2
model name      : Pentium 75 - 200
stepping        : 12
cpu MHz         : 166.196198
fdiv_bug        : no
hlt_bug         : no
sep_bug         : no
f00f_bug        : yes
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8
bogomips        : 331.78

[steven@trendb steven]$ uptime
 12:17pm  up 272 days,  6:35,  2 users,  load average: 0.02, 0.01, 0.00

[steven@trendb steven]$ ./linus1
empty overhead=76 cycles
load overhead=10 cycles
I$ load overhead=34 cycles
I$ load overhead=23 cycles
I$ store overhead=25 cycles



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-18 19:21           ` Linus Torvalds
@ 2003-03-18 20:03             ` Thomas Schlichter
  2003-03-18 20:24             ` Steven Cole
  2003-03-19  0:42             ` H. Peter Anvin
  2 siblings, 0 replies; 35+ messages in thread
From: Thomas Schlichter @ 2003-03-18 20:03 UTC (permalink / raw)
  To: Linus Torvalds, Brian Gerst; +Cc: Kevin Pedretti, linux-kernel

[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 1395 bytes --]

Am Dienstag, 18. März 2003 20:21 schrieb Linus Torvalds:
> On Tue, 18 Mar 2003, Brian Gerst wrote:
> > Here's a few more data points:
>
> Ok, this shows the behaviour I was trying to explain:
> > vendor_id       : AuthenticAMD
> > cpu family      : 5
> > model           : 8
> > model name      : AMD-K6(tm) 3D processor
> > stepping        : 12
> > cpu MHz         : 451.037
> > empty overhead=105 cycles
> > load overhead=-2 cycles
> > I$ load overhead=30 cycles
> > I$ load overhead=90 cycles
> > I$ store overhead=95 cycles
>
> ie loading from the same cacheline shows bad behaviour, most likely due to
> cache line exclusion. Does anybody have an original Pentium to see if I
> remember that one right?

Yes, you are right!
For an old Pentium-I with 133MHz (running FreeBSD, so I cannot provide 
cpuinfo-data :-( ) I get following:

empty overhead=73 cycles
load overhead=0 cycles
I$ load overhead=88 cycles
I$ load overhead=96 cycles
I$ store overhead=72 cycles

And just to provide data for the AMD K6-III :

vendor_id       : AuthenticAMD
cpu family      : 5
model           : 9
model name      : AMD-K6(tm) 3D+ Processor
stepping        : 1
cpu MHz         : 450.791
cache size      : 256 KB

empty overhead=142 cycles
load overhead=89 cycles
I$ load overhead=95 cycles
I$ load overhead=99 cycles
I$ store overhead=91 cycles

       Thomas

[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-18 18:30         ` Brian Gerst
  2003-03-18 19:14           ` Thomas Molina
@ 2003-03-18 19:21           ` Linus Torvalds
  2003-03-18 20:03             ` Thomas Schlichter
                               ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Linus Torvalds @ 2003-03-18 19:21 UTC (permalink / raw)
  To: Brian Gerst; +Cc: Kevin Pedretti, linux-kernel

On Tue, 18 Mar 2003, Brian Gerst wrote:
> 
> Here's a few more data points:

Ok, this shows the behaviour I was trying to explain:

> vendor_id       : AuthenticAMD
> cpu family      : 5
> model           : 8
> model name      : AMD-K6(tm) 3D processor
> stepping        : 12
> cpu MHz         : 451.037
> empty overhead=105 cycles
> load overhead=-2 cycles
> I$ load overhead=30 cycles
> I$ load overhead=90 cycles
> I$ store overhead=95 cycles

ie loading from the same cacheline shows bad behaviour, most likely due to 
cache line exclusion. Does anybody have an original Pentium to see if I 
remember that one right?

> vendor_id       : AuthenticAMD
> cpu family      : 6
> model           : 6
> model name      : AMD Athlon(tm) Processor
> stepping        : 2
> cpu MHz         : 1409.946
> empty overhead=11 cycles
> load overhead=5 cycles
> I$ load overhead=5 cycles
> I$ load overhead=5 cycles
> I$ store overhead=826 cycles
> 
> The Athlon XP shows really bad behavior when you store to the text area.

Wow. There aren't many things that AMD tends to show the P4-like "big
latency in rare cases" behaviour.

But quite honestly, I think they made the right call, and I _expect_ that
of modern CPU's. The fact is, modern CPU's tend to need to pre-decode the
instruction stream some way, and storing to it while running from it is
just a really really bad idea. And since it's so easy to avoid it, you
really just shouldn't do it.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-18 18:30         ` Brian Gerst
@ 2003-03-18 19:14           ` Thomas Molina
  2003-03-18 19:21           ` Linus Torvalds
  1 sibling, 0 replies; 35+ messages in thread
From: Thomas Molina @ 2003-03-18 19:14 UTC (permalink / raw)
  To: Linux Kernel Mailing List

> > You can run this (stupid) test-program to try. On my P4 I get
> > 
> > 	empty overhead=320 cycles
> > 	load overhead=0 cycles
> > 	I$ load overhead=0 cycles
> > 	I$ load overhead=0 cycles
> > 	I$ store overhead=264 cycles

On my Athlon 1.3GHz system I get:
[tmolina@dad tmolina]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1343.030
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov 
pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 2637.82

empty overhead=16 cycles
load overhead=1 cycles
I$ load overhead=1 cycles
I$ load overhead=1 cycles
I$ store overhead=763 cycles



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-18 16:41       ` Linus Torvalds
@ 2003-03-18 18:30         ` Brian Gerst
  2003-03-18 19:14           ` Thomas Molina
  2003-03-18 19:21           ` Linus Torvalds
  0 siblings, 2 replies; 35+ messages in thread
From: Brian Gerst @ 2003-03-18 18:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kevin Pedretti, linux-kernel

Linus Torvalds wrote:
> On Tue, 18 Mar 2003, Kevin Pedretti wrote:
> 
>>    I wasn't aware of what you state below but it makes sense.  What I 
>>haven't been able to figure out, and nobody seems to know, is why the 
>>rodata section of an executable is placed in the text section and is not 
>>page aligned.  This seems to be a mixing of code and data on the same 
>>page.  Maybe it doesn't matter since it is read only?
> 
> 
> It's a bad idea to share even read-only data, but the impact of read-only 
> data is much less that read-write. In particular, you should avoid sharing 
> _any_ code and data in the same physical L1 cache-line, since that will be 
> a big problem for any CPU with exclusion between the I$ and D$.
> 
> HOWEVER, modern x86 CPU's tend to have the I$ be part of the cache 
> coherency protocol, so instead of having exclusion they allow sharing as 
> long as the D$ isn't actually dirty. In that case it's fine to share 
> read-only data and code, although the cache utilization goes down if you 
> do a lot of it.
> 
> Anyway, as long as they are in separate cache-lines, you should be ok even 
> on something with cache exclusion.
> 
> When it comes to actually _writing_ to the data, at least on the P4 you
> don't want to have read-write data anywhere _near_ the I$ (somebody
> reported half-page granularity). This is true on crusoe too, btw (at a
> 128-byte granularity).
> 
> Anyway, I think gcc should make sure that even the ro-data section is at
> least cacheline-aligned so that it stays away from cachelines used for I$.  
> That makes sense even on CPU's that don't have exclusion, since it
> actually gives slightly better L1 cache utilization.
> 
> You can run this (stupid) test-program to try. On my P4 I get
> 
> 	empty overhead=320 cycles
> 	load overhead=0 cycles
> 	I$ load overhead=0 cycles
> 	I$ load overhead=0 cycles
> 	I$ store overhead=264 cycles
> 
> and on my PIII I get
> 
> 	empty overhead=74 cycles
> 	load overhead=8 cycles
> 	I$ load overhead=8 cycles
> 	I$ load overhead=8 cycles
> 	I$ store overhead=103 cycles
> 
> and (just for fun) on an old crusoe I get
> 
> 	empty overhead=67 cycles
> 	load overhead=-9 cycles
> 	I$ load overhead=-14 cycles
> 	I$ load overhead=-14 cycles
> 	I$ store overhead=12 cycles
> 
> where that "negative overhead" just shows that we do some strnge things to
> scheduling, and the loop actually ends up faster if it has a load in it
> than without the load..
> 
> But you can see that storing to code is a really bad idea. Especially on a 
> P4, where the overhead for a store was 264 cycles! (You can also see the 
> cost of doing just the empty synchronization and rdtsc - 320 cycles for a 
> rdtsc and two locked memory accesses on a P4).
> 
> I don't have access to an old Pentium - I think that was the one that had 
> the strict exclusion between the L1 I$ and D$, and then you should see the 
> I$ load overhead go up.
> 
> 			Linus

Here's a few more data points:

vendor_id       : AuthenticAMD
cpu family      : 5
model           : 8
model name      : AMD-K6(tm) 3D processor
stepping        : 12
cpu MHz         : 451.037
empty overhead=105 cycles
load overhead=-2 cycles
I$ load overhead=30 cycles
I$ load overhead=90 cycles
I$ store overhead=95 cycles


vendor_id       : GenuineIntel
cpu family      : 6
model           : 3
model name      : Pentium II (Klamath)
stepping        : 3
cpu MHz         : 265.913
empty overhead=73 cycles
load overhead=10 cycles
I$ load overhead=10 cycles
I$ load overhead=10 cycles
I$ store overhead=2 cycles


vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1409.946
empty overhead=11 cycles
load overhead=5 cycles
I$ load overhead=5 cycles
I$ load overhead=5 cycles
I$ store overhead=826 cycles

The Athlon XP shows really bad behavior when you store to the text area.

--
				Brian Gerst


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-18 15:24     ` Kevin Pedretti
@ 2003-03-18 16:41       ` Linus Torvalds
  2003-03-18 18:30         ` Brian Gerst
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2003-03-18 16:41 UTC (permalink / raw)
  To: Kevin Pedretti; +Cc: linux-kernel

On Tue, 18 Mar 2003, Kevin Pedretti wrote:
>
>     I wasn't aware of what you state below but it makes sense.  What I 
> haven't been able to figure out, and nobody seems to know, is why the 
> rodata section of an executable is placed in the text section and is not 
> page aligned.  This seems to be a mixing of code and data on the same 
> page.  Maybe it doesn't matter since it is read only?

It's a bad idea to share even read-only data, but the impact of read-only 
data is much less that read-write. In particular, you should avoid sharing 
_any_ code and data in the same physical L1 cache-line, since that will be 
a big problem for any CPU with exclusion between the I$ and D$.

HOWEVER, modern x86 CPU's tend to have the I$ be part of the cache 
coherency protocol, so instead of having exclusion they allow sharing as 
long as the D$ isn't actually dirty. In that case it's fine to share 
read-only data and code, although the cache utilization goes down if you 
do a lot of it.

Anyway, as long as they are in separate cache-lines, you should be ok even 
on something with cache exclusion.

When it comes to actually _writing_ to the data, at least on the P4 you
don't want to have read-write data anywhere _near_ the I$ (somebody
reported half-page granularity). This is true on crusoe too, btw (at a
128-byte granularity).

Anyway, I think gcc should make sure that even the ro-data section is at
least cacheline-aligned so that it stays away from cachelines used for I$.  
That makes sense even on CPU's that don't have exclusion, since it
actually gives slightly better L1 cache utilization.

You can run this (stupid) test-program to try. On my P4 I get

	empty overhead=320 cycles
	load overhead=0 cycles
	I$ load overhead=0 cycles
	I$ load overhead=0 cycles
	I$ store overhead=264 cycles

and on my PIII I get

	empty overhead=74 cycles
	load overhead=8 cycles
	I$ load overhead=8 cycles
	I$ load overhead=8 cycles
	I$ store overhead=103 cycles

and (just for fun) on an old crusoe I get

	empty overhead=67 cycles
	load overhead=-9 cycles
	I$ load overhead=-14 cycles
	I$ load overhead=-14 cycles
	I$ store overhead=12 cycles

where that "negative overhead" just shows that we do some strnge things to
scheduling, and the loop actually ends up faster if it has a load in it
than without the load..

But you can see that storing to code is a really bad idea. Especially on a 
P4, where the overhead for a store was 264 cycles! (You can also see the 
cost of doing just the empty synchronization and rdtsc - 320 cycles for a 
rdtsc and two locked memory accesses on a P4).

I don't have access to an old Pentium - I think that was the one that had 
the strict exclusion between the L1 I$ and D$, and then you should see the 
I$ load overhead go up.

			Linus

----
#include <sys/types.h>
#include <time.h>
#include <sys/time.h>
#include <sys/fcntl.h>
#include <asm/unistd.h>
#include <sys/stat.h>
#include <stdio.h>

#include <sys/mman.h>

#define PAGE_SIZE (4096UL)
#define PAGE_MASK (~(PAGE_SIZE-1))

#define serialize() asm volatile("lock ; addl $0,(%esp)")

#define rdtsc() ({ unsigned long a, d; asm volatile("rdtsc":"=a" (a), "=d" (d)); a; })

static int unused = 0;

#define NR (100000)

int main()
{
	int i;
	unsigned long overhead = ~0UL, empty = 0;
	void * address = (void *)(PAGE_MASK & (unsigned long)main);

	mprotect(address, PAGE_SIZE, PROT_READ | PROT_WRITE | PROT_EXEC);

	overhead = ~0UL;
	for (i = 0; i < NR; i++) {
		unsigned long cycles = rdtsc();
		serialize();
		serialize();
		cycles = rdtsc() - cycles;
		if (cycles < overhead)
			overhead = cycles;
	}
	printf("empty overhead=%ld cycles\n", overhead);
	empty = overhead;

	overhead = ~0UL;
	for (i = 0; i < NR; i++) {
		unsigned long dummy;
		unsigned long cycles = rdtsc();
		serialize();
		asm volatile("movl %1,%0":"=r" (dummy):"m" (unused));
		serialize();
		cycles = rdtsc() - cycles;
		if (cycles < overhead)
			overhead = cycles;
	}
	printf("load overhead=%ld cycles\n", overhead-empty);

	overhead = ~0UL;
	for (i = 0; i < NR; i++) {
		unsigned long dummy;
		unsigned long cycles = rdtsc();
		serialize();
		asm volatile("1:\tmovl 1b,%0":"=r" (dummy));
		serialize();
		cycles = rdtsc() - cycles;
		if (cycles < overhead)
			overhead = cycles;
	}
	printf("I$ load overhead=%ld cycles\n", overhead-empty);

	asm volatile("jmp 1f\n.align 128\n99:\t.long 0\n1:");
	overhead = ~0UL;
	for (i = 0; i < NR; i++) {
		unsigned long dummy;
		unsigned long cycles;
		cycles = rdtsc();
		serialize();
		asm volatile("movl 99b,%0":"=r" (dummy));
		serialize();
		cycles = rdtsc() - cycles;
		if (cycles < overhead)
			overhead = cycles;
	}
	printf("I$ load overhead=%ld cycles\n", overhead-empty);

	asm volatile("jmp 1f\n99:\t.long 0\n1:");
	overhead = ~0UL;
	for (i = 0; i < NR; i++) {
		unsigned long dummy;
		unsigned long cycles;
		cycles = rdtsc();
		serialize();
		asm volatile("1:\tmovl %0,99b":"=r" (dummy));
		serialize();
		cycles = rdtsc() - cycles;
		if (cycles < overhead)
			overhead = cycles;
	}
	printf("I$ store overhead=%ld cycles\n", overhead-empty);
	return 0;
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  5:54   ` Linus Torvalds
  2003-02-12 10:18     ` Jamie Lokier
@ 2003-03-18 15:24     ` Kevin Pedretti
  2003-03-18 16:41       ` Linus Torvalds
  1 sibling, 1 reply; 35+ messages in thread
From: Kevin Pedretti @ 2003-03-18 15:24 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

Linus,
    I wasn't aware of what you state below but it makes sense.  What I 
haven't been able to figure out, and nobody seems to know, is why the 
rodata section of an executable is placed in the text section and is not 
page aligned.  This seems to be a mixing of code and data on the same 
page.  Maybe it doesn't matter since it is read only?

Example:

 11 .text         000000e8  08048244  08048244  00000244  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 12 .fini         0000001c  0804832c  0804832c  0000032c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 13 .rodata       0000000c  08048348  08048348  00000348  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 14 .data         0000000c  08049354  08049354  00000354  2**2
                  CONTENTS, ALLOC, LOAD, DATA

Thanks,
Kevin


torvalds@transmeta.com wrote:

>In article <20030212041848.GA9273@bjl1.jlokier.co.uk>,
>Jamie Lokier  <jamie@shareable.org> wrote:
>  
>
>>A cute and wonderful hack is to use the 6 words in the TSS prior to
>>&tss->es as the trampoline. Now that __switch_to is done in software,
>>those words are not used for anything else.
>>    
>>
>
>No!! 
>
>That's not cute and wonderful, that's _horrible_.
>
>Mixing data and code on the same page is very very slow on a P4 (well, I
>think it's "same half-page", but the point is that you should not EVER
>mix data and code - it ends up being slow on modern CPU's).
>
>  
>
>>Other fixed offsets from &tss->esp0 are possible - especially nice
>>would be to share a cache line with the GDT's hot cache line.  (To do
>>this, place GDT before TSS, make KERNEL_CS near the end of the GDT,
>>and then the accesses to GDT, trampoline and tss->esp0 will all touch
>>the same cache line if you're lucky).
>>    
>>
>
>Since almost all x86 CPU's have some kind of cacheline exclusion policy
>between the I$ and the D$ (to handle the strict x86 I$ coherency
>requirements), your "if you're lucky" is completely bogus.  In fact,
>you'd be the _pessimal_ cache behaviour for something like that, ie you
>get lines that ping-pong between the L2 and the two instruction caches. 
>
>Don't do it. Keep data and code on separate pages.
>
>			Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-10  3:07         ` Linus Torvalds
  2003-03-10 11:06           ` Andi Kleen
@ 2003-03-10 22:44           ` Linus Torvalds
  1 sibling, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2003-03-10 22:44 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, ak

On Sun, 9 Mar 2003, Linus Torvalds wrote:
> 
> Your SYSENTER_ESP hack would probably get back the rest, but I haven't
> seen any patches for it, hint hint.

Oh, well, I just did it myself. And tested with both NMI's and debug 
traps, just to make sure that we do the right thing there too.

(If we get an NMI on the first three instructions in a debug trap that 
happens on the first instruction of the sysenter path, we're still 
screwed. I'm still trying to figure out a good way to unscrew us).

> In the meantime, we're almost back to where we were _and_ we support 
> sysenter (ie my system calls are down by almost a factor of four). So 
> we're doing pretty well.

We're now pretty much back to 2.4.x performance on the scheduler, as far 
as I can tell. Can people confirm and close the bug?

		Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-10 11:06           ` Andi Kleen
@ 2003-03-10 18:33             ` Linus Torvalds
  0 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2003-03-10 18:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jamie Lokier, linux-kernel, Manfred Spraul

On Mon, 10 Mar 2003, Andi Kleen wrote:
> 
> Unfortunately the patch still has the problem pointed out by Manfred
> Spraul: if you're unlucky it could destroy the _TIF_SIGPENDING set by
> another CPU with the non atomic access. Really thread_info should have
> two flag words: one that is truly local and can be accessed without LOCK
> and one that can be changed at will by external users too.

Yup, you're right.

I fixed that by splitting the "flags" field into two: "flags" is the old
flags, and "status" is thread-synchronous stuff (ie things that don't need
to worry about atomicity). Right now the FP lazy bit is the only thing 
that is marked as thread-synchronous.

While going through the users I also noticed that fork() did the FPU 
unlazy() thing totally wrong - it the the parent unlazy() _after_ it had 
already copied the process flags to the child, so even though it copied 
the x87 state to the child, the process flags could still say that the 
child was using lazy state, and thus the FP state in the child was 
basically totally corrupt. I wrote a test program to verify.

So I fixed that part too, by having a "prepare_to_copy()" function that
properly "calms down" the parent before we copy the task and thread
states. That fixes the bug, and also avoids an extra unnecessary x87 state
copy on x86.

(Not that the extra copy is noticeable - fork() is expensive enough 
anyway. It might just _barely_ be noticeable on thread creation when we 
don't have to worry about copying the VM state. But the bug was real, 
and the simplification is an added bonus).

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-03-10  3:07         ` Linus Torvalds
@ 2003-03-10 11:06           ` Andi Kleen
  2003-03-10 18:33             ` Linus Torvalds
  2003-03-10 22:44           ` Linus Torvalds
  1 sibling, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2003-03-10 11:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, linux-kernel, ak

On Mon, Mar 10, 2003 at 04:07:36AM +0100, Linus Torvalds wrote:
>  since you've been interested in the past, I thought I'd ask you to test
> the current context switch stuff. Andi cleaned up some FPU reload stuff
> (and I fixed a bug in it, tssk tssk Andi - you'd obviously not actually
> timed your cleanups), and I just committed and pushed out my "cache the

You mean the TIF->_TIF thing? Yes that was wrong in the first patch,
but fixed in the patches later. Unfortunately the patch still 
has the problem pointed out by Manfred Spraul: if you're unlucky
it could destroy the _TIF_SIGPENDING set by another CPU with the
non atomic access. Really thread_info should have two flag words:
one that is truly local and can be accessed without LOCK and 
one that can be changed at will by external users too.

After some discussion with him I think the right fix for now is to 
move it it back to PF_USEDFPU into task_struct->flags.

Will submit a patch for that later after I was able to test it.

-Andi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12 10:12       ` Jamie Lokier
@ 2003-03-10  3:07         ` Linus Torvalds
  2003-03-10 11:06           ` Andi Kleen
  2003-03-10 22:44           ` Linus Torvalds
  0 siblings, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2003-03-10  3:07 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, ak

Ok Jamie,
 since you've been interested in the past, I thought I'd ask you to test
the current context switch stuff. Andi cleaned up some FPU reload stuff
(and I fixed a bug in it, tssk tssk Andi - you'd obviously not actually
timed your cleanups), and I just committed and pushed out my "cache the
value of SYSENTER_CS in the TSS" patch.

It won't bring context switching back to where it _could_ be, but it
should be noticeably better. My pipe bandwidth is up from under 600MB/s
to about ~700MB/s according to lmbench.

Your SYSENTER_ESP hack would probably get back the rest, but I haven't
seen any patches for it, hint hint.

In the meantime, we're almost back to where we were _and_ we support 
sysenter (ie my system calls are down by almost a factor of four). So 
we're doing pretty well.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-13  5:17         ` Eric W. Biederman
@ 2003-02-13 18:07           ` Andi Kleen
  2003-03-19  1:22             ` Rob Landley
  0 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2003-02-13 18:07 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, discuss

[Hmm, this is becomming a FAQ]

> Switching in and out of long mode is evil enough that I don't think it
> is worth it.  And encouraging people to write good JIT compiling

Forget it. It is completely undefined in the architecture what happens
then. You'll lose interrupts and everything. Nothing for an operating
system intended to be stable.

I have no plans at all to even think about it for Linux/x86-64.

> emulators sounds much better, especially in the long run.  But it can
> be written.

For DOS even a slow emulator should be good enough. After all most
DOS Programs are written for slow machines. Bochs running on a K8
will be hopefully fast enough. If not an JIT can be written, perhaps
you can extend valgrind for it.

Or if you really rely on a DOS program executing fast you can
always boot a 32bit kernel which of course still supports vm86
in legacy mode.

-Andi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12 10:45       ` Andi Kleen
  2003-02-12 17:52         ` Ingo Oeser
@ 2003-02-13  5:17         ` Eric W. Biederman
  2003-02-13 18:07           ` Andi Kleen
  1 sibling, 1 reply; 35+ messages in thread
From: Eric W. Biederman @ 2003-02-13  5:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jamie Lokier, Dave Jones, Martin J. Bligh, linux-kernel

Andi Kleen <ak@suse.de> writes:

> On Wed, Feb 12, 2003 at 10:27:41AM +0000, Jamie Lokier wrote:
> > Andi Kleen wrote:
> > > +	/* FIXME should disable preemption here but how can we reenable it? */
> > > +
> > > +	enable_sysenter();
> > > +
> > 
> > Try this:
> 
> [...] I have no real interest in vm86 mode, perhaps one of the people
> interested in dosemu etc. could take care of it. I'm very glad it doesn't
> exist on my main architecture - x86-64 - given how many hacks it needs to be 
> supported.

There is certainly some old cruft in there, but...

I have been thinking evil thoughts lately about what it would take
to implement on x86-64.

Switching in and out of long mode is evil enough that I don't think it
is worth it.  And encouraging people to write good JIT compiling
emulators sounds much better, especially in the long run.  But it can
be written.

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12 18:18           ` Andi Kleen
@ 2003-02-13  2:42             ` Alan Cox
  0 siblings, 0 replies; 35+ messages in thread
From: Alan Cox @ 2003-02-13  2:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Oeser, Jamie Lokier, Dave Jones, Martin J. Bligh,
	Linux Kernel Mailing List

On Wed, 2003-02-12 at 18:18, Andi Kleen wrote:
> > So what about making it a CONFIG_XXX option? The few dosemu users
> > could then configure it in.
> 
> Doesn't help for precompiled distribution kernels, which is what the majority
> of linux users run these days.

XFree86 makes significant use of it, and its software x86 emulator isn't up to 
replacing it on many cards (eg my C&T only works with vm86 not the emulator).
Obviously on x864-64 you have little choice, but for x86-32 its somewhat
relevant.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12 17:52         ` Ingo Oeser
  2003-02-12 18:13           ` Dave Jones
@ 2003-02-12 18:18           ` Andi Kleen
  2003-02-13  2:42             ` Alan Cox
  1 sibling, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2003-02-12 18:18 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Andi Kleen, Jamie Lokier, Dave Jones, Martin J. Bligh, linux-kernel

On Wed, Feb 12, 2003 at 06:52:00PM +0100, Ingo Oeser wrote:
> On Wed, Feb 12, 2003 at 11:45:08AM +0100, Andi Kleen wrote:
> > [...] I have no real interest in vm86 mode, perhaps one of the people
> > interested in dosemu etc. could take care of it. I'm very glad it doesn't
> > exist on my main architecture - x86-64 - given how many hacks it needs to be 
> > supported.
> 
> So what about making it a CONFIG_XXX option? The few dosemu users
> could then configure it in.

Doesn't help for precompiled distribution kernels, which is what the majority
of linux users run these days.

-Andi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12 17:52         ` Ingo Oeser
@ 2003-02-12 18:13           ` Dave Jones
  2003-02-12 18:18           ` Andi Kleen
  1 sibling, 0 replies; 35+ messages in thread
From: Dave Jones @ 2003-02-12 18:13 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: Andi Kleen, Jamie Lokier, Martin J. Bligh, linux-kernel

On Wed, Feb 12, 2003 at 06:52:00PM +0100, Ingo Oeser wrote:
 > On Wed, Feb 12, 2003 at 11:45:08AM +0100, Andi Kleen wrote:
 > > [...] I have no real interest in vm86 mode, perhaps one of the people
 > > interested in dosemu etc. could take care of it. I'm very glad it doesn't
 > > exist on my main architecture - x86-64 - given how many hacks it needs to be 
 > > supported.
 > 
 > So what about making it a CONFIG_XXX option? The few dosemu users
 > could then configure it in.

Overkill. Andi's TF_VM86 fix looks to be the nicest way to do it.
If you don't use dosemu etc, the wrmsr should never be hit.

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12 10:45       ` Andi Kleen
@ 2003-02-12 17:52         ` Ingo Oeser
  2003-02-12 18:13           ` Dave Jones
  2003-02-12 18:18           ` Andi Kleen
  2003-02-13  5:17         ` Eric W. Biederman
  1 sibling, 2 replies; 35+ messages in thread
From: Ingo Oeser @ 2003-02-12 17:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jamie Lokier, Dave Jones, Martin J. Bligh, linux-kernel

On Wed, Feb 12, 2003 at 11:45:08AM +0100, Andi Kleen wrote:
> [...] I have no real interest in vm86 mode, perhaps one of the people
> interested in dosemu etc. could take care of it. I'm very glad it doesn't
> exist on my main architecture - x86-64 - given how many hacks it needs to be 
> supported.

So what about making it a CONFIG_XXX option? The few dosemu users
could then configure it in.

Regards

Ingo Oeser
-- 
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12 10:18     ` Jamie Lokier
@ 2003-02-12 17:24       ` Linus Torvalds
  0 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2003-02-12 17:24 UTC (permalink / raw)
  To: linux-kernel

In article <20030212101831.GB10422@bjl1.jlokier.co.uk>,
Jamie Lokier  <jamie@shareable.org> wrote:
>
>I meant: the trampoline _stack_ lives in the TSS.
>
>There is no trampoline _code_.

Ahh, ok. That sounds quite doable, and all my complaints go away.

It still leaves the debug exception and NMI issue.

The debug exception case is easy to trigger: use gdb to single-step
through the user-lebel fast system call code, and you _will_ get a debug
exception on the very first kernel instruction (which is also the one
that doesn't have a valid stack). 

So anybody want to actually try to implement this?

		Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  4:21   ` Jamie Lokier
  2003-02-12  5:49     ` Linus Torvalds
@ 2003-02-12 12:54     ` Dave Jones
  1 sibling, 0 replies; 35+ messages in thread
From: Dave Jones @ 2003-02-12 12:54 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Martin J. Bligh, ak, linux-kernel

On Wed, Feb 12, 2003 at 04:21:43AM +0000, Jamie Lokier wrote:

 > > I feel I'm missing something obvious here, but is this part the
 > > low-hanging fruit that it seems ?
 > You have eliminated one MSR write very cleanly, although there are
 > still a few unnecessary conditionals when compared with grabbing a
 > whole branch of the fruit tree, as it were.
 > 
 > That leaves the other MSR write, which is also unnecessary.

Removing that one didn't seem quite so easy, so I wussed out.

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12 10:27     ` Jamie Lokier
@ 2003-02-12 10:45       ` Andi Kleen
  2003-02-12 17:52         ` Ingo Oeser
  2003-02-13  5:17         ` Eric W. Biederman
  0 siblings, 2 replies; 35+ messages in thread
From: Andi Kleen @ 2003-02-12 10:45 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, Dave Jones, Martin J. Bligh, linux-kernel

On Wed, Feb 12, 2003 at 10:27:41AM +0000, Jamie Lokier wrote:
> Andi Kleen wrote:
> > +	/* FIXME should disable preemption here but how can we reenable it? */
> > +
> > +	enable_sysenter();
> > +
> 
> Try this:

[...] I have no real interest in vm86 mode, perhaps one of the people
interested in dosemu etc. could take care of it. I'm very glad it doesn't
exist on my main architecture - x86-64 - given how many hacks it needs to be 
supported.

I would like to have fast context switch on IA32 though so it would be nice 
if someone deeply familiar with sys_vm86 could review my patch.

Avoiding the SYSCALL_CS MSR is independent from the issues Linus raised.

-Andi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  7:50   ` Andi Kleen
@ 2003-02-12 10:27     ` Jamie Lokier
  2003-02-12 10:45       ` Andi Kleen
  0 siblings, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2003-02-12 10:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Jones, Martin J. Bligh, linux-kernel

Andi Kleen wrote:
> +	/* FIXME should disable preemption here but how can we reenable it? */
> +
> +	enable_sysenter();
> +

Try this:

	1. Disable preemption in do_sys_vm86(), at the same place as
	   disable_sysenter() is called.

	2. Enable preemption in save_v86_state(), and put the call
	   to enable_sysenter() there.

	3. In restore_sigcontext() [signal.c], _iff_ the VM flag
	   is set in the restored context, call disable_sysenter()
	   and also disable preemption.

That should make vm86 simply disable preemption while it is activated.
It is not as nice as actually being preemptible, but safe first,
optimise later.

The return path to vm86 mode has the peculiar property of not doing
the need_resched test, unlike the return path to normal user space,
which is a boon here.

-- Jamie


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  5:54   ` Linus Torvalds
@ 2003-02-12 10:18     ` Jamie Lokier
  2003-02-12 17:24       ` Linus Torvalds
  2003-03-18 15:24     ` Kevin Pedretti
  1 sibling, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2003-02-12 10:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> In article <20030212041848.GA9273@bjl1.jlokier.co.uk>,
> Jamie Lokier  <jamie@shareable.org> wrote:
> >
> >A cute and wonderful hack is to use the 6 words in the TSS prior to
> >&tss->es as the trampoline. Now that __switch_to is done in software,
> >those words are not used for anything else.
> 
> No!! 
> 
> That's not cute and wonderful, that's _horrible_.

I meant: the trampoline _stack_ lives in the TSS.

There is no trampoline _code_.

My apologies for poor wording.

-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  5:49     ` Linus Torvalds
@ 2003-02-12 10:12       ` Jamie Lokier
  2003-03-10  3:07         ` Linus Torvalds
  0 siblings, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2003-02-12 10:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> >That leaves the other MSR write, which is also unnecessary.
> 
> No, the other one _is_ necessary.  I did timings, and having it in the
> context switch path made system calls clearly faster on a P4 (as
> compared to my original trampoline approach).
> 
> It may be only two instructions difference ("movl xx,%esp ; jmp common")
> in the system call path, but it was much more than two cycles.  I don't
> know why, but I assume the system call causes a total pipeline flush,
> and then the immediate jmp basically means that the P4 has a hard time
> getting the pipe restarted.

The jump is not necessary, and you don't need to duplicate the system
call code either.  You place an instruction like this at the start of
the system call code - this is _all_ that you need.

	movl -60(%esp),%esp

The current task's esp0 is always stored in the TSS.  We get that for
free.  And you can point SYSENTER_ESP in, before or after the TSS too.
The trampoline stack needs exactly 6 words to handle debug and NMI.

The constant may vary according to how you lay things out, and you
might put it after cld;sti[2] in the entry code, but you get the idea.

I suspect you are right that it is the jump which is expensive - a
stack load _should_ be less than a cycle.  Normal functions do it all
the time.  But then a jump _should_ be less than a cycle too.  Ah well!

(Of course even a single load of %esp, even if it turns out to be
cheap, can cost more on average than writing the MSR per context
switch.)

> This might be fixable by moving more (all?) of the kernel-side fast
> system call code into the per-cpu trampoline page, so that you wouldn't
> have the immediate jump. Somebody needs to try it and time it, otherwise
> the wrmsr stays in the context switch.

I have timed how long it takes to do sysenter, call a function in
kernel space and sysexit to return, complete with the above method of
stack setup (and the debug + NMI fixups).  This is a module for <=2.4 kernels.

It takes ages - 82 cycles.

Here are my notes from last year (sorry, I don't have a P4):

   Performance and emulation methods
   ---------------------------------

     * On everything since later Pentium Pros from Intel, and since
       the K7 from AMD, `sysenter' is available as a native instruction.

       On my Celeron 366, it takes 82 (84.5 on an 800MHz P3) cycles to
       enter the kernel, call an empty C function and return to
       userspace.  Compare this to 236 (242) cycles using `int $0x81' to
       do the same thing.

     * On old CPUs which don't support `sysenter', it is emulated
       using the "illegal opcode" trap (#UD).

       This is actually quite fast: the empty C function takes only 17
       (16) cycles longer than `int $0x81'.  Because classic system
       calls use `int $0x80', you can see that emulating `sysenter'
       would be a useful fallback method for userspace system calls.

     * Don't take the cycle timings too seriously.  They vary by about
       8% according to the exact layout of the userspace code and also
       from one module loading to the next (probably due to cache or TLB
       colour effects).  I haven't quoted the _best_ timings (which are
       about 8% better than the ones I quoted), because they only occur
       occasionally and cannot be repeated to order (you have to unload
       and reload the module until the best timing appears).

> I want fast system calls. Most people don't see it yet (because you need
> a glibc that takes advantage of it), but those fast system calls are
> more than enough to make up for some scheduling overhead.

By the way, your suggestion of comparing %ds and %es to __USER_DS and
avoiding loading them if they are the expected values saves 8 cycles
on the two CPUs I did it on.  Not loading them on exit, which you
already do, saves a further 10 cycles.

Because you are careful to disallow sysenter from vm86 mode,
transitions from vm86 _always_ go through the
interrupt/int$0x80/exception paths, which always reload %ds and %es.

So your concern about vm86 screwing with the cpu's internal segment
descriptors doesn't apply, AFAICT, to the sysenter path.  (It probably
does apply to the interrupt and exception paths).

So here are some system call optimisation hints:

  [0] Comparing %ds and %es to __USER_DS and not reloading them in the
      sysenter path is safe, and worth 8 cycles on my CPUs.

  [1] "movl %ds,%ebx" is about 9-10 cycles faster than "pushl %ds;
      popl %ebx" for what its worth.  I think it's the pushl which is
      slow but I haven't timed it by itself.

  [2] Putting cli immediately before sti saves exactly 5 cycles on my
      Celeron, and putting that just before the %esp load helps a little.
      Cost of loading the flags register?

I am a wimp, or perhaps impatient, when it comes to the
compile-reboot-test cycle so I'm not likely to try the above any time
soon.  But those are good hints if anyone (Ingo? :) wants to try them.

enjoy,
-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  2:59 ` Dave Jones
  2003-02-12  4:21   ` Jamie Lokier
@ 2003-02-12  7:50   ` Andi Kleen
  2003-02-12 10:27     ` Jamie Lokier
  1 sibling, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2003-02-12  7:50 UTC (permalink / raw)
  To: Dave Jones, Martin J. Bligh, ak, linux-kernel

On Wed, Feb 12, 2003 at 02:59:02AM +0000, Dave Jones wrote:
> On Tue, Feb 11, 2003 at 05:35:43PM -0800, Martin J. Bligh wrote:
> 
>  > The reason it rewrites SYSENTER_CS is non obviously vm86 which
>  > doesn't guarantee the MSR stays constant (sigh). I think this would 
>  > be better handled by having a global flag or process flag when any process
>  > uses vm86 and not do it when this flag is not set (as in 99% of all 
>  > normal use cases)
> 
> I feel I'm missing something obvious here, but is this part the
> low-hanging fruit that it seems ?

Yes I implemented a similar patch now too last night. It also fixes a few other
fast path bugs in __switch_to

- Fix false sharing in the GDT and replace an imul with a shift.
Really pad the GDT to cache lines now.

- Don't use LOCK prefixes in bit operations when accessing the 
thread_info flags of the switched threads. LOCK is very slow on P4
and it isn't necessary here.

Really we should have __set_bit/__test_bit without memory barrier
and atomic stuff on each arch and use that for thread_info.h,
but for now do it this way.

[this is a port from x86-64]

- Inline FPU switch - it is only a few lines.

But I must say I don't know vm86() semantics enough to know if this is 
good enough, especially when the clear the TIF_VM86 flag. Could someone
more familiar with it review it?

BTW vm86.c at the first look doesn't look very preempt safe to me.

comments?

-Andi


diff -burpN -X ../KDIFX linux-2.5.60/arch/i386/kernel/cpu/common.c linux-2.5.60-work/arch/i386/kernel/cpu/common.c
--- linux-2.5.60/arch/i386/kernel/cpu/common.c	2003-02-10 19:37:57.000000000 +0100
+++ linux-2.5.60-work/arch/i386/kernel/cpu/common.c	2003-02-12 01:42:01.000000000 +0100
@@ -484,7 +484,7 @@ void __init cpu_init (void)
 		BUG();
 	enter_lazy_tlb(&init_mm, current, cpu);
 
-	load_esp0(t, thread->esp0);
+	load_esp0(current, t, thread->esp0);
 	set_tss_desc(cpu,t);
 	cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff;
 	load_TR_desc();
diff -burpN -X ../KDIFX linux-2.5.60/arch/i386/kernel/i387.c linux-2.5.60-work/arch/i386/kernel/i387.c
--- linux-2.5.60/arch/i386/kernel/i387.c	2003-02-10 19:39:17.000000000 +0100
+++ linux-2.5.60-work/arch/i386/kernel/i387.c	2003-02-11 23:51:58.000000000 +0100
@@ -52,24 +52,6 @@ void init_fpu(struct task_struct *tsk)
  * FPU lazy state save handling.
  */
 
-static inline void __save_init_fpu( struct task_struct *tsk )
-{
-	if ( cpu_has_fxsr ) {
-		asm volatile( "fxsave %0 ; fnclex"
-			      : "=m" (tsk->thread.i387.fxsave) );
-	} else {
-		asm volatile( "fnsave %0 ; fwait"
-			      : "=m" (tsk->thread.i387.fsave) );
-	}
-	clear_tsk_thread_flag(tsk, TIF_USEDFPU);
-}
-
-void save_init_fpu( struct task_struct *tsk )
-{
-	__save_init_fpu(tsk);
-	stts();
-}
-
 void kernel_fpu_begin(void)
 {
 	preempt_disable();
diff -burpN -X ../KDIFX linux-2.5.60/arch/i386/kernel/process.c linux-2.5.60-work/arch/i386/kernel/process.c
--- linux-2.5.60/arch/i386/kernel/process.c	2003-02-10 19:37:54.000000000 +0100
+++ linux-2.5.60-work/arch/i386/kernel/process.c	2003-02-12 01:40:02.000000000 +0100
@@ -437,7 +437,7 @@ void __switch_to(struct task_struct *pre
 	/*
 	 * Reload esp0, LDT and the page table pointer:
 	 */
-	load_esp0(tss, next->esp0);
+	load_esp0(prev_p, tss, next->esp0);
 
 	/*
 	 * Load the per-thread Thread-Local Storage descriptor.
diff -burpN -X ../KDIFX linux-2.5.60/arch/i386/kernel/vm86.c linux-2.5.60-work/arch/i386/kernel/vm86.c
--- linux-2.5.60/arch/i386/kernel/vm86.c	2003-02-10 19:37:58.000000000 +0100
+++ linux-2.5.60-work/arch/i386/kernel/vm86.c	2003-02-12 01:46:51.000000000 +0100
@@ -114,7 +117,7 @@ struct pt_regs * save_v86_state(struct k
 	}
 	tss = init_tss + smp_processor_id();
 	current->thread.esp0 = current->thread.saved_esp0;
-	load_esp0(tss, current->thread.esp0);
+	load_esp0(current, tss, current->thread.esp0);
 	current->thread.saved_esp0 = 0;
 	loadsegment(fs, current->thread.saved_fs);
 	loadsegment(gs, current->thread.saved_gs);
@@ -309,6 +313,10 @@ static inline void return_to_32bit(struc
 {
 	struct pt_regs * regs32;
 
+	/* FIXME should disable preemption here but how can we reenable it? */
+
+	enable_sysenter();
+
 	regs32 = save_v86_state(regs16);
 	regs32->eax = retval;
 	__asm__ __volatile__("movl %0,%%esp\n\t"
diff -burpN -X ../KDIFX linux-2.5.60/arch/x86_64/kernel/process.c linux-2.5.60-work/arch/x86_64/kernel/process.c
--- linux-2.5.60/arch/x86_64/kernel/process.c	2003-02-10 19:37:56.000000000 +0100
+++ linux-2.5.60-work/arch/x86_64/kernel/process.c	2003-02-12 01:51:00.000000000 +0100
@@ -41,6 +41,7 @@
 #include <linux/init.h>
 #include <linux/ctype.h>
 #include <linux/slab.h>
+#include <linux/thread_info.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
diff -burpN -X ../KDIFX linux-2.5.60/include/asm-i386/i387.h linux-2.5.60-work/include/asm-i386/i387.h
--- linux-2.5.60/include/asm-i386/i387.h	2003-02-10 19:38:49.000000000 +0100
+++ linux-2.5.60-work/include/asm-i386/i387.h	2003-02-12 01:21:13.000000000 +0100
@@ -21,23 +21,41 @@ extern void init_fpu(struct task_struct 
 /*
  * FPU lazy state save handling...
  */
-extern void save_init_fpu( struct task_struct *tsk );
 extern void restore_fpu( struct task_struct *tsk );
 
 extern void kernel_fpu_begin(void);
 #define kernel_fpu_end() do { stts(); preempt_enable(); } while(0)
 
 
+static inline void __save_init_fpu( struct task_struct *tsk )
+{
+	if ( cpu_has_fxsr ) {
+		asm volatile( "fxsave %0 ; fnclex"
+			      : "=m" (tsk->thread.i387.fxsave) );
+	} else {
+		asm volatile( "fnsave %0 ; fwait"
+			      : "=m" (tsk->thread.i387.fsave) );
+	}
+	tsk->thread_info->flags &= ~TIF_USEDFPU;
+}
+
+static inline void save_init_fpu( struct task_struct *tsk )
+{
+	__save_init_fpu(tsk);
+	stts();
+}
+
+
 #define unlazy_fpu( tsk ) do { \
-	if (test_tsk_thread_flag(tsk, TIF_USEDFPU)) \
+	if ((tsk)->thread_info->flags & _TIF_USEDFPU) \
 		save_init_fpu( tsk ); \
 } while (0)
 
 #define clear_fpu( tsk )					\
 do {								\
-	if (test_tsk_thread_flag(tsk, TIF_USEDFPU)) {		\
+	if ((tsk)->thread_info->flags & _TIF_USEDFPU) {		\
 		asm volatile("fwait");				\
-		clear_tsk_thread_flag(tsk, TIF_USEDFPU);	\
+		(tsk)->thread_info->flags &= ~_TIF_USEDFPU;	\
 		stts();						\
 	}							\
 } while (0)
diff -burpN -X ../KDIFX linux-2.5.60/include/asm-i386/processor.h linux-2.5.60-work/include/asm-i386/processor.h
--- linux-2.5.60/include/asm-i386/processor.h	2003-02-10 19:37:57.000000000 +0100
+++ linux-2.5.60-work/include/asm-i386/processor.h	2003-02-12 01:52:28.000000000 +0100
@@ -408,20 +408,30 @@ struct thread_struct {
 	.io_bitmap	= { [ 0 ... IO_BITMAP_SIZE ] = ~0 },		\
 }
 
-static inline void load_esp0(struct tss_struct *tss, unsigned long esp0)
-{
-	tss->esp0 = esp0;
-	if (cpu_has_sep) {
-		wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
-		wrmsr(MSR_IA32_SYSENTER_ESP, esp0, 0);
-	}
-}
-
-static inline void disable_sysenter(void)
-{
-	if (cpu_has_sep)  
-		wrmsr(MSR_IA32_SYSENTER_CS, 0, 0);
-}
+#define load_esp0(prev, tss, _esp0) do { \
+	(tss)->esp0 = _esp0;						\
+	if (cpu_has_sep) {						\
+		if (unlikely((prev)->thread_info->flags & _TIF_VM86))	\
+			wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);	\
+		wrmsr(MSR_IA32_SYSENTER_ESP, (_esp0), 0);		\
+	}								\
+} while(0)
+
+/* The caller of the next two functions should have disabled preemption. */
+
+#define disable_sysenter() do { \
+	if (cpu_has_sep) {				\
+		set_thread_flag(TIF_VM86);		\
+		wrmsr(MSR_IA32_SYSENTER_CS, 0, 0);	\
+	}	\
+} while(0)
+
+#define enable_sysenter() do { \
+	if (cpu_has_sep) {					\
+		clear_thread_flag(TIF_VM86);			\
+		wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);	\
+	}							\
+} while(0)
 
 #define start_thread(regs, new_eip, new_esp) do {		\
 	__asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0));	\
diff -burpN -X ../KDIFX linux-2.5.60/include/asm-i386/segment.h linux-2.5.60-work/include/asm-i386/segment.h
--- linux-2.5.60/include/asm-i386/segment.h	2003-02-10 19:38:06.000000000 +0100
+++ linux-2.5.60-work/include/asm-i386/segment.h	2003-02-11 23:56:37.000000000 +0100
@@ -67,7 +67,7 @@
 /*
  * The GDT has 25 entries but we pad it to cacheline boundary:
  */
-#define GDT_ENTRIES 28
+#define GDT_ENTRIES 32
 
 #define GDT_SIZE (GDT_ENTRIES * 8)
 
diff -burpN -X ../KDIFX linux-2.5.60/include/asm-i386/thread_info.h linux-2.5.60-work/include/asm-i386/thread_info.h
--- linux-2.5.60/include/asm-i386/thread_info.h	2003-02-10 19:37:59.000000000 +0100
+++ linux-2.5.60-work/include/asm-i386/thread_info.h	2003-02-12 01:51:26.000000000 +0100
@@ -111,15 +111,18 @@ static inline struct thread_info *curren
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_IRET		5	/* return with iret */
+#define TIF_VM86		6	/* may use vm86 */ 
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 
+
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
 #define _TIF_SINGLESTEP		(1<<TIF_SINGLESTEP)
 #define _TIF_IRET		(1<<TIF_IRET)
+#define _TIF_VM86		(1<<TIF_VM86)
 #define _TIF_USEDFPU		(1<<TIF_USEDFPU)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  4:18 ` Jamie Lokier
@ 2003-02-12  5:54   ` Linus Torvalds
  2003-02-12 10:18     ` Jamie Lokier
  2003-03-18 15:24     ` Kevin Pedretti
  0 siblings, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2003-02-12  5:54 UTC (permalink / raw)
  To: linux-kernel

In article <20030212041848.GA9273@bjl1.jlokier.co.uk>,
Jamie Lokier  <jamie@shareable.org> wrote:
>
>A cute and wonderful hack is to use the 6 words in the TSS prior to
>&tss->es as the trampoline. Now that __switch_to is done in software,
>those words are not used for anything else.

No!! 

That's not cute and wonderful, that's _horrible_.

Mixing data and code on the same page is very very slow on a P4 (well, I
think it's "same half-page", but the point is that you should not EVER
mix data and code - it ends up being slow on modern CPU's).

>Other fixed offsets from &tss->esp0 are possible - especially nice
>would be to share a cache line with the GDT's hot cache line.  (To do
>this, place GDT before TSS, make KERNEL_CS near the end of the GDT,
>and then the accesses to GDT, trampoline and tss->esp0 will all touch
>the same cache line if you're lucky).

Since almost all x86 CPU's have some kind of cacheline exclusion policy
between the I$ and the D$ (to handle the strict x86 I$ coherency
requirements), your "if you're lucky" is completely bogus.  In fact,
you'd be the _pessimal_ cache behaviour for something like that, ie you
get lines that ping-pong between the L2 and the two instruction caches. 

Don't do it. Keep data and code on separate pages.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  4:21   ` Jamie Lokier
@ 2003-02-12  5:49     ` Linus Torvalds
  2003-02-12 10:12       ` Jamie Lokier
  2003-02-12 12:54     ` Dave Jones
  1 sibling, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2003-02-12  5:49 UTC (permalink / raw)
  To: linux-kernel

In article <20030212042143.GB9273@bjl1.jlokier.co.uk>,
Jamie Lokier  <jamie@shareable.org> wrote:
>Dave Jones wrote:
>> I feel I'm missing something obvious here, but is this part the
>> low-hanging fruit that it seems ?
>
>You have eliminated one MSR write very cleanly, although there are
>still a few unnecessary conditionals when compared with grabbing a
>whole branch of the fruit tree, as it were.
>
>That leaves the other MSR write, which is also unnecessary.

No, the other one _is_ necessary.  I did timings, and having it in the
context switch path made system calls clearly faster on a P4 (as
compared to my original trampoline approach).

It may be only two instructions difference ("movl xx,%esp ; jmp common")
in the system call path, but it was much more than two cycles.  I don't
know why, but I assume the system call causes a total pipeline flush,
and then the immediate jmp basically means that the P4 has a hard time
getting the pipe restarted.

This might be fixable by moving more (all?) of the kernel-side fast
system call code into the per-cpu trampoline page, so that you wouldn't
have the immediate jump. Somebody needs to try it and time it, otherwise
the wrmsr stays in the context switch.

I want fast system calls. Most people don't see it yet (because you need
a glibc that takes advantage of it), but those fast system calls are
more than enough to make up for some scheduling overhead.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  2:59 ` Dave Jones
@ 2003-02-12  4:21   ` Jamie Lokier
  2003-02-12  5:49     ` Linus Torvalds
  2003-02-12 12:54     ` Dave Jones
  2003-02-12  7:50   ` Andi Kleen
  1 sibling, 2 replies; 35+ messages in thread
From: Jamie Lokier @ 2003-02-12  4:21 UTC (permalink / raw)
  To: Dave Jones, Martin J. Bligh, ak, linux-kernel

Dave Jones wrote:
> I feel I'm missing something obvious here, but is this part the
> low-hanging fruit that it seems ?

You have eliminated one MSR write very cleanly, although there are
still a few unnecessary conditionals when compared with grabbing a
whole branch of the fruit tree, as it were.

That leaves the other MSR write, which is also unnecessary.

-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  1:35 Martin J. Bligh
  2003-02-12  2:59 ` Dave Jones
@ 2003-02-12  4:18 ` Jamie Lokier
  2003-02-12  5:54   ` Linus Torvalds
  1 sibling, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2003-02-12  4:18 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

Martin J. Bligh wrote:
> Since the SYSENTER/vsyscall support went in the 2.5 __switch_to/load_esp0
> function does two WRMSRs to rewrite MSR_IA32_SYSENTER_CS and
> MSR_IA32_SYSENTER_ESP. This is hidden in processor.h:load_esp0. WRMSR is
> very slow (60+ cycles) especially on a Pentium 4 and slows down the context
> switch considerably. This is a trade off between faster system calls using
> SYSENTER and slower context switches, but the context switches got unduly
> hit here.

<Boggle!>  I'm amazed this slipped in.

> The reason it rewrites SYSENTER_CS is non obviously vm86 which
> doesn't guarantee the MSR stays constant (sigh).

I am confused by your sentence.  Can vm86 code alter the sysenter
MSRs?  That should raise a GPF, surely...  Or do you mean that the
code in vm86.c alters sysenter, because it calls disable_sysenter()?

> I think this would be better handled by having a global flag or
> process flag when any process uses vm86 and not do it when this flag
> is not set (as in 99% of all normal use cases)

Is there bug?

I think there's a bug with CONFIG_PREEMPT - can someone confirm?  The
kernel can be preempted after the call to disable_sysenter() in
vm86.c, and it will reschedule (see resume_kernel), and reload the
MSRs if I understand entry.S correctly.

So there needs to be a different way to set/clear the MSRs anyway.

Perhaps the debug register loads, ts_io_bitmap loads, and MSR loads
could all be efficiently conditional on a flag?

I.e., in __switch_to:

#define SLOW_SWITCH_VM86	(1 << 0)
#define SLOW_SWITCH_IO_BITMAP	(1 << 1)
#define SLOW_SWITCH_DEBUG	(1 << 2)

	if (unlikely(prev->slow_switch | next->slow_switch)) {
		if (unlikely(next->slow_switch & SLOW_SWITCH_DEBUG)) {
			// ...
		}
		if (unlikely((prev->slow_switch ^ next->slow_switch)
			     & SLOW_SWITCH_IO_BITMAP)) {
			// ...
		}
		if (unlikely((prev->slow_switch ^ next->slow_switch)
			     & SLOW_SWITCH_VM86)) {
			if (next->slow_switch & SLOW_SWITCH_VM86)
				disable_sysenter();
			else
				enable_sysenter();
		}
	}

And whenever ts_io_bitmap or debugrg[7] are written to, recalculate
the value of slow_switch (bits 1 and 2).  And set bit 0 in
do_sys_vm86, clear it in save_v86_state, and recalculate that bit in
restore_sigcontext.

That captures the rare cases, and ensures that the MSRs are always
clear in vm86 mode even if it is preempted, always set otherwise, and
not changed normally.

(The above assumes we revert to a trampoline stack, so the MSRs don't
have to be rewritten during normal context switches).

> It rewrites SYSENTER_ESP to the stack page of the current process.
> Previous implements used an trampoline for that. The reason it was moved to
> the context was that an NMI could see the trampoline stack for one
> instruction and when it calls current (very unlikely) and references the
> stack pointer it  doesn't get a valid task_struct. The obvious solution
> would be to somehow check this case (e.g. by looking at esp) in the NMI
> slow path.

It's very easy to fix NMIs by either looking at EIP or ESP at the
start of the NMI handler.  EIP is a bit simpler, because the address
range is fixed at link time and does not vary between CPUs (each CPU
needs its own 6-word trampoline).

With a trampoline stack, it's also necessary to fixup the case where a
Debug trap occurs at the start of the sysenter handler (in the debug
path), and when an NMI interrupts that debug path before it has fixed
up the stack.

A cute and wonderful hack is to use the 6 words in the TSS prior to
&tss->es as the trampoline. Now that __switch_to is done in software,
those words are not used for anything else.  The nice thing is that a
single load from a fixed offset from ESP gets you the replacement
value of %esp0, i.e. the "real" kernel stack is loaded like this:

	movl -68(%esp),%esp

Other fixed offsets from &tss->esp0 are possible - especially nice
would be to share a cache line with the GDT's hot cache line.  (To do
this, place GDT before TSS, make KERNEL_CS near the end of the GDT,
and then the accesses to GDT, trampoline and tss->esp0 will all touch
the same cache line if you're lucky).

The fixup cases for NMI and debug are a bit tricky but not _that_
tricky.

Enjoy,
-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
  2003-02-12  1:35 Martin J. Bligh
@ 2003-02-12  2:59 ` Dave Jones
  2003-02-12  4:21   ` Jamie Lokier
  2003-02-12  7:50   ` Andi Kleen
  2003-02-12  4:18 ` Jamie Lokier
  1 sibling, 2 replies; 35+ messages in thread
From: Dave Jones @ 2003-02-12  2:59 UTC (permalink / raw)
  To: Martin J. Bligh, ak; +Cc: linux-kernel

On Tue, Feb 11, 2003 at 05:35:43PM -0800, Martin J. Bligh wrote:

 > The reason it rewrites SYSENTER_CS is non obviously vm86 which
 > doesn't guarantee the MSR stays constant (sigh). I think this would 
 > be better handled by having a global flag or process flag when any process
 > uses vm86 and not do it when this flag is not set (as in 99% of all 
 > normal use cases)

I feel I'm missing something obvious here, but is this part the
low-hanging fruit that it seems ?

		Dave

--- bk-linus/arch/i386/kernel/sysenter.c	2003-02-12 00:10:15.000000000 -0100
+++ linux-2.5/arch/i386/kernel/sysenter.c	2003-02-12 01:53:58.000000000 -0100
@@ -20,6 +20,8 @@
 
 extern asmlinkage void sysenter_entry(void);
 
+int trashed_sysenter_cs;
+
 /*
  * Create a per-cpu fake "SEP thread" stack, so that we can
  * enter the kernel without having to worry about things like
--- bk-linus/include/asm-i386/processor.h	2003-02-12 00:15:23.000000000 -0100
+++ linux-2.5/include/asm-i386/processor.h	2003-02-12 01:53:43.000000000 -0100
@@ -408,19 +408,26 @@ struct thread_struct {
 	.io_bitmap	= { [ 0 ... IO_BITMAP_SIZE ] = ~0 },		\
 }
 
+extern int trashed_sysenter_cs;
+
 static inline void load_esp0(struct tss_struct *tss, unsigned long esp0)
 {
 	tss->esp0 = esp0;
 	if (cpu_has_sep) {
-		wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
+		if (trashed_sysenter_cs==1) {
+			wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
+			trashed_sysenter_cs = 0;
+		}
 		wrmsr(MSR_IA32_SYSENTER_ESP, esp0, 0);
 	}
 }
 
 static inline void disable_sysenter(void)
 {
-	if (cpu_has_sep)  
+	if (cpu_has_sep) {
 		wrmsr(MSR_IA32_SYSENTER_CS, 0, 0);
+		trashed_sysenter_cs = 1;
+	}
 }
 
 #define start_thread(regs, new_eip, new_esp) do {		\


-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
@ 2003-02-12  1:35 Martin J. Bligh
  2003-02-12  2:59 ` Dave Jones
  2003-02-12  4:18 ` Jamie Lokier
  0 siblings, 2 replies; 35+ messages in thread
From: Martin J. Bligh @ 2003-02-12  1:35 UTC (permalink / raw)
  To: linux-kernel

http://bugme.osdl.org/show_bug.cgi?id=350

           Summary: i386 context switch very slow compared to 2.4 due to
                    wrmsr (performance)
    Kernel Version: 2.5.5x (since SYSENTER support went in)
            Status: NEW
          Severity: normal
             Owner: rml@tech9.net
         Submitter: ak@suse.de

Distribution: any
Hardware Environment: especially Intel P4
Software Environment: 
Problem Description:

Since the SYSENTER/vsyscall support went in the 2.5 __switch_to/load_esp0
function does two WRMSRs to rewrite MSR_IA32_SYSENTER_CS and
MSR_IA32_SYSENTER_ESP. This is hidden in processor.h:load_esp0. WRMSR is
very slow (60+ cycles) especially on a Pentium 4 and slows down the context
switch considerably. This is a trade off between faster system calls using
SYSENTER and slower context switches, but the context switches got unduly
hit here.

The reason it rewrites SYSENTER_CS is non obviously vm86 which
doesn't guarantee the MSR stays constant (sigh). I think this would 
be better handled by having a global flag or process flag when any process
uses vm86 and not do it when this flag is not set (as in 99% of all 
normal use cases)

It rewrites SYSENTER_ESP to the stack page of the current process.
Previous implements used an trampoline for that. The reason it was moved to
the context was that an NMI could see the trampoline stack for one
instruction and when it calls current (very unlikely) and references the
stack pointer it  doesn't get a valid task_struct. The obvious solution
would be to somehow check this case (e.g. by looking at esp) in the NMI
slow path.

Steps to reproduce:

Benchmark __switch_to or context switch. Note lmbench is not reliable
(numbers vary wildy); microbenchmarks of WRMSR show the problem clearly.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2003-03-20 15:54 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20030318165013$55f4@gated-at.bofh.it>
     [not found] ` <20030318184010$6448@gated-at.bofh.it>
2003-03-18 20:19   ` [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance) Pascal Schmidt
2003-03-19  9:55 Ph. Marek
  -- strict thread matches above, loose matches on Subject: below --
2003-02-12  1:35 Martin J. Bligh
2003-02-12  2:59 ` Dave Jones
2003-02-12  4:21   ` Jamie Lokier
2003-02-12  5:49     ` Linus Torvalds
2003-02-12 10:12       ` Jamie Lokier
2003-03-10  3:07         ` Linus Torvalds
2003-03-10 11:06           ` Andi Kleen
2003-03-10 18:33             ` Linus Torvalds
2003-03-10 22:44           ` Linus Torvalds
2003-02-12 12:54     ` Dave Jones
2003-02-12  7:50   ` Andi Kleen
2003-02-12 10:27     ` Jamie Lokier
2003-02-12 10:45       ` Andi Kleen
2003-02-12 17:52         ` Ingo Oeser
2003-02-12 18:13           ` Dave Jones
2003-02-12 18:18           ` Andi Kleen
2003-02-13  2:42             ` Alan Cox
2003-02-13  5:17         ` Eric W. Biederman
2003-02-13 18:07           ` Andi Kleen
2003-03-19  1:22             ` Rob Landley
2003-02-12  4:18 ` Jamie Lokier
2003-02-12  5:54   ` Linus Torvalds
2003-02-12 10:18     ` Jamie Lokier
2003-02-12 17:24       ` Linus Torvalds
2003-03-18 15:24     ` Kevin Pedretti
2003-03-18 16:41       ` Linus Torvalds
2003-03-18 18:30         ` Brian Gerst
2003-03-18 19:14           ` Thomas Molina
2003-03-18 19:21           ` Linus Torvalds
2003-03-18 20:03             ` Thomas Schlichter
2003-03-18 20:24             ` Steven Cole
2003-03-19  0:42             ` H. Peter Anvin
2003-03-19  2:22               ` george anzinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).