linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] Runtime memory barrier patching
       [not found] <200304220111.h3M1BEp5004047@hera.kernel.org>
@ 2003-04-22  8:43 ` Arjan van de Ven
  2003-04-22 11:18   ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Arjan van de Ven @ 2003-04-22  8:43 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: ak

[-- Attachment #1: Type: text/plain, Size: 596 bytes --]

On Tue, 2003-04-22 at 01:23, Linux Kernel Mailing List wrote:
> ChangeSet 1.1169, 2003/04/21 16:23:20-07:00, ak@muc.de
> 
> 	[PATCH] Runtime memory barrier patching
> 	
> 	This implements automatic code patching of memory barriers based
> 	on the CPU capabilities. Normally lock ; addl $0,(%esp) barriers
> 	are used, but these are a bit slow on the Pentium 4.
> 	

very nice. Question: would it be doable use this for prefetch() as well?
Eg default to a non-prefetch kernel and patch in the proper prefetch
instruction for the current cpu ? (eg AMD prefetch vs Intel one etc etc)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-22  8:43 ` [PATCH] Runtime memory barrier patching Arjan van de Ven
@ 2003-04-22 11:18   ` Andi Kleen
  2003-04-22 16:11     ` Dave Jones
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2003-04-22 11:18 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Linux Kernel Mailing List, ak


On Tue, Apr 22, 2003 at 10:43:58AM +0200, Arjan van de Ven wrote:
> On Tue, 2003-04-22 at 01:23, Linux Kernel Mailing List wrote:
> > ChangeSet 1.1169, 2003/04/21 16:23:20-07:00, ak@muc.de
> > 
> > 	[PATCH] Runtime memory barrier patching
> > 	
> > 	This implements automatic code patching of memory barriers based
> > 	on the CPU capabilities. Normally lock ; addl $0,(%esp) barriers
> > 	are used, but these are a bit slow on the Pentium 4.
> > 	
> 
> very nice. Question: would it be doable use this for prefetch() as well?
> Eg default to a non-prefetch kernel and patch in the proper prefetch
> instruction for the current cpu ? (eg AMD prefetch vs Intel one etc etc)

Yes, I already implemented it, but have yet to boot it.

You only need Intel and AMD prefetch. For all Athlons the SSE prefetches
work (because we force the SSE MSR bit to on). prefetchw is 3dnow.
3dnow non 'w' prefetches would only make sense on the K6, but they're
not really worth it there because it doesn't have enough oustanding loads
in the memory unit and worse prefetch is microcoded there.

-Andi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-22 11:18   ` Andi Kleen
@ 2003-04-22 16:11     ` Dave Jones
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Jones @ 2003-04-22 16:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Arjan van de Ven, Linux Kernel Mailing List

On Tue, Apr 22, 2003 at 01:18:32PM +0200, Andi Kleen wrote:

 > You only need Intel and AMD prefetch. For all Athlons the SSE prefetches
 > work (because we force the SSE MSR bit to on).

Except those before Athlon XP which didn't have SSE.

		Dave


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
@ 2003-04-22 10:12 Chuck Ebbert
  0 siblings, 0 replies; 21+ messages in thread
From: Chuck Ebbert @ 2003-04-22 10:12 UTC (permalink / raw)
  To: linux-kernel


>> 
>> Is that true even on the 32-bit Athlons, especially the older ones?
>
> It is not the recommended form for Athlons (see my other mail) 
> But I doubt it's a big issue. The Athlon has a pretty good decoder.


 Doesn't Athlon do best with just a string of 0x90 because it treats
it as a real NOP and discards it very early in the CPU pipeline? It
broke the bogomips estimator when they were introduced...



------
 Chuck

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-22  0:06       ` Linus Torvalds
@ 2003-04-22  0:13         ` Jamie Lokier
  0 siblings, 0 replies; 21+ messages in thread
From: Jamie Lokier @ 2003-04-22  0:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel

Linus Torvalds wrote:
> > Such as removing the lock prefix when running non-SMP?
> 
> I think you should use a separate mechanism for that. It's really a 
> separate issue, _and_ the replacement is actually quite different (and 
> much more common, so you'd want to use a more compact data structure that 
> is likely just the list of addresses of locked instructions).

Indeed.

-- Jamie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 23:35     ` Jamie Lokier
  2003-04-21 23:46       ` Andi Kleen
@ 2003-04-22  0:06       ` Linus Torvalds
  2003-04-22  0:13         ` Jamie Lokier
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2003-04-22  0:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, linux-kernel


On Tue, 22 Apr 2003, Jamie Lokier wrote:
> 
> Such as removing the lock prefix when running non-SMP?

I think you should use a separate mechanism for that. It's really a 
separate issue, _and_ the replacement is actually quite different (and 
much more common, so you'd want to use a more compact data structure that 
is likely just the list of addresses of locked instructions).

		Linus


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 23:41 Chuck Ebbert
@ 2003-04-22  0:04 ` Jamie Lokier
  0 siblings, 0 replies; 21+ messages in thread
From: Jamie Lokier @ 2003-04-22  0:04 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, Andy Kleen

Chuck Ebbert wrote:
> > Does anybody have the preferred Intel sequence somewhere?
> 
>   These are in Intel ASM form (i.e. dest first) from the 486 manual:
> 
>  2-bytes            mov reg,reg
>  3-bytes            lea reg, 0[reg]  ; 8-bit displacement

Look in the GAS source code for opcodes.

At least on the 486 and Pentium (I remember that part of GAS in those
days), it makes sense to select a register that is not set in the
preceding few instructions or used in the subsequent few.  Otherwise,
lea's use of a register adds an address generation delay (worse than
just a mov).

Hence use of %esi and %edi in GAS - registers allocated last by the
compiler, so least likely to be used/set in surrounding instructions.

The register selection is probably too difficult to automate, though.

-- Jamie


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 23:46       ` Andi Kleen
@ 2003-04-21 23:57         ` Jamie Lokier
  0 siblings, 0 replies; 21+ messages in thread
From: Jamie Lokier @ 2003-04-21 23:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linus Torvalds, linux-kernel

Andi Kleen wrote:
> > Such as removing the lock prefix when running non-SMP?
...
> .altinstructions section after load to avoid too much kernel bloat (it
> currently costs 7 byte + the length of the replacement. And lock
> is quite common in the kernel these days.

1. It should cost at most 4 bytes, in a table of "remove me" address,
with topmost bits of each word used to give the length of the
instruction to remove - i.e. to replace with an optimal nop sequence.

2. Not just locks - a lot of spinlock code could be removed completely.
I don't have an opinion on whether this is worth doing - one can
simply run a UP kernel after all.

-- Jamie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 23:35     ` Jamie Lokier
@ 2003-04-21 23:46       ` Andi Kleen
  2003-04-21 23:57         ` Jamie Lokier
  2003-04-22  0:06       ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2003-04-21 23:46 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, Linus Torvalds, linux-kernel

On Tue, Apr 22, 2003 at 01:35:57AM +0200, Jamie Lokier wrote:
> Andi Kleen wrote:
> > The patching code is quite generic and could be used to patch other
> > instructions
> 
> Such as removing the lock prefix when running non-SMP?

Yes, could work. But you need a new variant of alternative()
or eat worse code. The current alternative() can only handle
constant sized original instructions, which requires that you
use a constant sized constraint (e.g. (%0) ... "r" (ptr)) etc.)
"m" is unfortunately variable size.

For the special case of lock it would still work because you
only need to patch the prefix away, not replace the whole 
instruction, but that requires a new macro.

Also when you do that I would start to think about discarding the
.altinstructions section after load to avoid too much kernel bloat (it
currently costs 7 byte + the length of the replacement. And lock
is quite common in the kernel these days.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
@ 2003-04-21 23:41 Chuck Ebbert
  2003-04-22  0:04 ` Jamie Lokier
  0 siblings, 1 reply; 21+ messages in thread
From: Chuck Ebbert @ 2003-04-21 23:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andy Kleen

Linus Torvalds wrote:


> Does anybody have the preferred Intel sequence somewhere?


  These are in Intel ASM form (i.e. dest first) from the 486 manual:

 2-bytes            mov reg,reg
 3-bytes            lea reg, 0[reg]  ; 8-bit displacement

No opcodes handy, though...


------
 Chuck

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 20:53   ` Andi Kleen
  2003-04-21 21:04     ` Linus Torvalds
@ 2003-04-21 23:35     ` Jamie Lokier
  2003-04-21 23:46       ` Andi Kleen
  2003-04-22  0:06       ` Linus Torvalds
  1 sibling, 2 replies; 21+ messages in thread
From: Jamie Lokier @ 2003-04-21 23:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linus Torvalds, linux-kernel

Andi Kleen wrote:
> The patching code is quite generic and could be used to patch other
> instructions

Such as removing the lock prefix when running non-SMP?

-- Jamie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 22:23         ` Linus Torvalds
@ 2003-04-21 22:59           ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2003-04-21 22:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel

On Tue, Apr 22, 2003 at 12:23:10AM +0200, Linus Torvalds wrote:
> 
> On Tue, 22 Apr 2003, Andi Kleen wrote:
> > 
> > At least on Athlon/Opteron these sequences are the fastest because they are
> > special cased in the decoder and do not consume any execution resources.  
> 
> Is that true even on the 32-bit Athlons, especially the older ones?

It is not the recommended form for Athlons (see my other mail) 
But I doubt it's a big issue. The Athlon has a pretty good decoder.

> 
> I can understand the special-casing on Opteron, since in 64-bit mode
> you'll see more of the prefixes, but for older K7s?

64bit mode needs it special cased anyways because the common
xchg ax,ax nop would not be a nop (it would zero extend the register to
64bit) 

> I think the P3 (which is still Intel's "current" offering as it comes to 
> the mobile Pentium-M side) has problems. And there are still people who 
> use even older chips.

P3 should be fine now.

> > I'm using the GAS sequences for the Intel case Ulrich pointed out now,
> > but only upto 4 bytes (memory barrier only needs 3 bytes currently). 
> > This will hopefully satisfy all nop optimizers ;)
> 
> Looks good to me.
> 
> I do have _one_ more small niggling issue - I think this patch also makes
> the CONFIG_X86_SSE2 define be a thing of the past. Or is it used for
> something else still? It would be good to remove it, and try to make most
> of the architecture choices be pure optimization hints (apart from some of
> the more painful architecture updates like the broken write protect on the
> original 386). That will make it easier for distribution makers.

CONFIG_X86_SSE2 is a nop now yes. But it does not matter because 
the user cannot set it directly. I can remove it in a followup patch.

I'm thinking of using it for the prefetches in the future
I wrote prefetch using versions of these for 64bit and it
helps, so it may make sense to port it over.

I also experiemented with replacing the local_irq_restore with
an P4 optimized version (bt $9,oldflags ; jnc 1f ; sti ; 1: instead
of pushl oldflags ; popfl) which is 60cycles -> 47cycles, but I ran
into weird binutils problems and it bloated the code quite a lot
(2 bytes -> 7 bytes) for only a few cycles so I dropped it again.

But please put the first version of the patch in first so that
we get the infrastructure for future work.

-Andi



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 22:05         ` Linus Torvalds
@ 2003-04-21 22:45           ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2003-04-21 22:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ulrich Drepper, Andi Kleen, linux-kernel

On Tue, Apr 22, 2003 at 12:05:51AM +0200, Linus Torvalds wrote:
> I would not be surprised if both AMD and Intel are playing some 
> "benchmarking games" by trying to select nop's that work badly for the 
> other side, and then showing how _their_ new CPU's are so much better by 
> having the compilers emit the "preferred" no-ops.
> 
> But maybe I'm just too cynical. And I do suspect the Hammer optimization
> guide was meant for the 64-bit mode only, because I'm pretty certain even
> AMD does badly on prefixes at least in older CPU generations.

Yes, you're right. The Athlon recommendation is pretty complicated
(they have different versions for different registers to avoid lengthening
dependency chains), but it uses different two byte nops. 

They explicitely say that it works well on "other CPUs" too which
likely means Intel.

Basically it is: 

         NOP2_EAX TEXTEQU <DB 08Bh,0C0h> ;MOV EAX, EAX
         NOP3_EAX TEXTEQU <DB 08Dh,004h,020h> ;LEA EAX, [EAX]
         NOP4_EAX TEXTEQU <DB 08Dh,044h,020h,000h> ;LEA EAX, [EAX+00]
        ;LEA EAX, [EAX+00];NOP
         NOP5_EAX TEXTEQU <DB 08Dh,044h,020h,000h,090h> 
        ;LEA EAX, [EAX+00000000]
         NOP6_EAX TEXTEQU <DB 08Dh,080h,0,0,0,0>
        ;LEA EAX, [EAX*1+00000000]
         NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0>
        ;LEA EAX, [EAX*1+00000000] ;NOP
         NOP8_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0,90h>
        ;JMP
         NOP9 TEXTEQU <DB 0EBh,007h,90h,90h,90h,90h,90h,90h,90h>

(+ the same for all other GPRs). Someone must have too much time
when documenting this ;)

But I'm not gonna resubmit with that unless if you insist on it...

-Andi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 22:11       ` Andi Kleen
@ 2003-04-21 22:23         ` Linus Torvalds
  2003-04-21 22:59           ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2003-04-21 22:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel


On Tue, 22 Apr 2003, Andi Kleen wrote:
> 
> At least on Athlon/Opteron these sequences are the fastest because they are
> special cased in the decoder and do not consume any execution resources.  

Is that true even on the 32-bit Athlons, especially the older ones?

I can understand the special-casing on Opteron, since in 64-bit mode
you'll see more of the prefixes, but for older K7s?

> On the P4 it also shouldn't matter because it has the trace cache that 
> hides decoding issues.  Same on Transmeta, right?

Yes, decoding is not usually an issue on a trace-cache-driven thing, 
either P4 or Transmeta.

> I guess it'll do well on P3.

I think the P3 (which is still Intel's "current" offering as it comes to 
the mobile Pentium-M side) has problems. And there are still people who 
use even older chips.

Of course, for _this_ particular case (ie sfence/mfence) older chips do 
not matter, as they'll fall back on the longer sequence and never see the 
no-ops, but we may have other replacements where it goes the other way and 
it's the _new_ sequence that is shorter.

> I'm using the GAS sequences for the Intel case Ulrich pointed out now,
> but only upto 4 bytes (memory barrier only needs 3 bytes currently). 
> This will hopefully satisfy all nop optimizers ;)

Looks good to me.

I do have _one_ more small niggling issue - I think this patch also makes
the CONFIG_X86_SSE2 define be a thing of the past. Or is it used for
something else still? It would be good to remove it, and try to make most
of the architecture choices be pure optimization hints (apart from some of
the more painful architecture updates like the broken write protect on the
original 386). That will make it easier for distribution makers.

		Linus "picky, picky" Torvalds


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 21:04     ` Linus Torvalds
  2003-04-21 21:43       ` Ulrich Drepper
@ 2003-04-21 22:11       ` Andi Kleen
  2003-04-21 22:23         ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2003-04-21 22:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel

On Mon, Apr 21, 2003 at 11:04:48PM +0200, Linus Torvalds wrote:
> 
> On Mon, 21 Apr 2003, Andi Kleen wrote:
> >
> > Ok fixed. I used the recommendations from the Hammer optimization
> > manual, will hopefully work for Intel too.
> 
> They may _work_ for intel, but quite frankly they suck for most Intel (and 
> probably non-intel too) CPU's. Using prefixes tends to almost always mess 
> up the instruction decoders on most CPU's out there.

At least on Athlon/Opteron these sequences are the fastest because they are
special cased in the decoder and do not consume any execution resources.  
On the P4 it also shouldn't matter because it has the trace cache that 
hides decoding issues.  Same on Transmeta, right? I guess it'll do 
well on P3.

I'm using the GAS sequences for the Intel case Ulrich pointed out now,
but only upto 4 bytes (memory barrier only needs 3 bytes currently). 
This will hopefully satisfy all nop optimizers ;)

-Andi

This patch implements automatic code patching of memory barriers based
on the CPU capabilities. Normally lock ; addl $0,(%esp) barriers
are used, but these are a bit slow on the Pentium 4. 

Linus proposed this a few weeks ago after the support for SSE1/SSE2
barriers was introduced. I now got around to implement it.

The main advantage is that it allows distributors to ship less binary
kernels but still get fast kernels. In particular it avoids the
need of a special Pentium 4 kernel.

The patching code is quite generic and could be used to patch
other instructions (like prefetches or specific other critical
instructions) too.
Thanks to Rusty's in kernel loader it also works seamlessly for modules.

The patching is done before other CPUs start to avoid potential
erratas with self modifying code on SMP systems. It makes no 
attempt to automatically handle assymetric systems (an secondary
CPU having less capabilities than the boot CPU). In this
case just boot with "noreplacement"

diff -u linux-2.5.68-gencpu/arch/i386/kernel/setup.c-o linux-2.5.68-gencpu/arch/i386/kernel/setup.c
--- linux-2.5.68-gencpu/arch/i386/kernel/setup.c-o	2003-04-21 22:18:39.000000000 +0200
+++ linux-2.5.68-gencpu/arch/i386/kernel/setup.c	2003-04-21 23:58:20.000000000 +0200
@@ -797,6 +797,63 @@
 		pci_mem_start = low_mem_size;
 }
 
+/* Replace instructions with better alternatives for this CPU type.
+
+   This runs before SMP is initialized to avoid SMP problems with
+   self modifying code. This implies that assymetric systems where
+   APs have less capabilities than the boot processor are not handled. 
+    
+   In this case boot with "noreplacement". */ 
+void __init apply_alternatives(void *start, void *end) 
+{ 
+	struct alt_instr *a; 
+	int diff, i, k;
+
+	for (a = start; a < end; 
+	     a = (void *)ALIGN((unsigned long)(a + 1) + a->instrlen, 4)) { 
+		if (!boot_cpu_has(a->cpuid))
+			continue;
+		BUG_ON(a->replacementlen > a->instrlen); 
+		memcpy(a->instr, a->replacement, a->replacementlen); 
+		diff = a->instrlen - a->replacementlen; 
+		for (i = a->replacementlen; diff > 0; diff -= k, i += k) {
+			static const char *nops[] = {
+				0,
+				"\x90",
+#if CONFIG_MK7 || CONFIG_MK8
+				"\x66\x90",
+				"\x66\x66\x90",
+				"\x66\x66\x66\x90",
+#else
+				"\x89\xf6",
+				"\x8d\x76\x00",
+				"\x8d\x74\x26\x00",
+#endif
+			};
+			k = min_t(int, diff, ARRAY_SIZE(nops)); 
+			memcpy(a->instr + i, nops[k], k); 
+		} 
+	}
+} 
+
+static int no_replacement __initdata = 0; 
+ 
+void __init alternative_instructions(void)
+{
+	extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
+	if (no_replacement) 
+		return;
+	apply_alternatives(__alt_instructions, __alt_instructions_end);
+}
+
+static int __init noreplacement_setup(char *s)
+{ 
+     no_replacement = 1; 
+     return 0; 
+} 
+
+__setup("noreplacement", noreplacement_setup); 
+
 void __init setup_arch(char **cmdline_p)
 {
 	unsigned long max_low_pfn;
diff -u linux-2.5.68-gencpu/arch/i386/kernel/module.c-o linux-2.5.68-gencpu/arch/i386/kernel/module.c
--- linux-2.5.68-gencpu/arch/i386/kernel/module.c-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/arch/i386/kernel/module.c	2003-04-21 21:07:27.000000000 +0200
@@ -104,9 +104,22 @@
 	return -ENOEXEC;
 }
 
+extern void apply_alternatives(void *start, void *end); 
+
 int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *me)
 {
+	Elf_Shdr *s;
+	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
+
+	/* look for .altinstructions to patch */ 
+	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { 
+		void *seg; 		
+		if (strcmp(".altinstructions", secstrings + s->sh_name))
+			continue;
+		seg = (void *)s->sh_addr; 
+		apply_alternatives(seg, seg + s->sh_size); 
+	} 	
 	return 0;
 }
diff -u linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S-o linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S
--- linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S	2003-04-20 21:24:22.000000000 +0200
@@ -81,6 +81,10 @@
   __con_initcall_start = .;
   .con_initcall.init : { *(.con_initcall.init) }
   __con_initcall_end = .;
+  . = ALIGN(4);
+  __alt_instructions = .;
+  .altinstructions : { *(.altinstructions) } 
+  __alt_instructions_end = .; 
   . = ALIGN(4096);
   __initramfs_start = .;
   .init.ramfs : { *(.init.ramfs) }
diff -u linux-2.5.68-gencpu/include/asm-i386/system.h-o linux-2.5.68-gencpu/include/asm-i386/system.h
--- linux-2.5.68-gencpu/include/asm-i386/system.h-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/include/asm-i386/system.h	2003-04-20 21:24:22.000000000 +0200
@@ -4,6 +4,7 @@
 #include <linux/config.h>
 #include <linux/kernel.h>
 #include <asm/segment.h>
+#include <asm/cpufeature.h>
 #include <linux/bitops.h> /* for LOCK_PREFIX */
 
 #ifdef __KERNEL__
@@ -276,6 +277,37 @@
 /* Compiling for a 386 proper.	Is it worth implementing via cli/sti?  */
 #endif
 
+struct alt_instr { 
+	u8 *instr; 		/* original instruction */
+	u8  cpuid;		/* cpuid bit set for replacement */
+	u8  instrlen;		/* length of original instruction */
+	u8  replacementlen; 	/* length of new instruction, <= instrlen */ 
+	u8  replacement[0];   	/* new instruction */
+}; 
+
+/* 
+ * Alternative instructions for different CPU types or capabilities.
+ * 
+ * This allows to use optimized instructions even on generic binary
+ * kernels.
+ * 
+ * length of oldinstr must be longer or equal the length of newinstr
+ * It can be padded with nops as needed.
+ * 
+ * For non barrier like inlines please define new variants
+ * without volatile and memory clobber.
+ */
+#define alternative(oldinstr, newinstr, feature) 	\
+	asm volatile ("661:\n\t" oldinstr "\n662:\n" 		     \
+		      ".section .altinstructions,\"a\"\n"     	     \
+		      "  .align 4\n"				       \
+		      "  .long 661b\n"            /* label */          \
+		      "  .byte %c0\n"             /* feature bit */    \
+		      "  .byte 662b-661b\n"       /* sourcelen */      \
+		      "  .byte 664f-663f\n"       /* replacementlen */ \
+		      "663:\n\t" newinstr "\n664:\n"   /* replacement */    \
+		      ".previous" :: "i" (feature) : "memory")  
+
 /*
  * Force strict CPU ordering.
  * And yes, this is required on UP too when we're talking
@@ -294,13 +326,15 @@
  * nop for these.
  */
  
-#ifdef CONFIG_X86_SSE2
-#define mb()	asm volatile("mfence" ::: "memory")
-#define rmb()	asm volatile("lfence" ::: "memory")
-#else
-#define mb() 	__asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory")
-#define rmb()	mb()
-#endif
+
+/* 
+ * Actually only lfence would be needed for mb() because all stores done 
+ * by the kernel should be already ordered. But keep a full barrier for now. 
+ */
+
+#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
+#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
+
 /**
  * read_barrier_depends - Flush all pending reads that subsequents reads
  * depend on.
@@ -356,7 +390,9 @@
 #define read_barrier_depends()	do { } while(0)
 
 #ifdef CONFIG_X86_OOSTORE
-#define wmb() 	__asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory")
+/* Actually there are no OOO store capable CPUs for now that do SSE, 
+   but make it already an possibility. */
+#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 #else
 #define wmb()	__asm__ __volatile__ ("": : :"memory")
 #endif
diff -u linux-2.5.68-gencpu/include/asm-i386/bugs.h-o linux-2.5.68-gencpu/include/asm-i386/bugs.h
--- linux-2.5.68-gencpu/include/asm-i386/bugs.h-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/include/asm-i386/bugs.h	2003-04-20 21:24:22.000000000 +0200
@@ -200,6 +200,8 @@
 #endif
 }
 
+extern void alternative_instructions(void);
+
 static void __init check_bugs(void)
 {
 	identify_cpu(&boot_cpu_data);
@@ -212,4 +214,5 @@
 	check_hlt();
 	check_popad();
 	system_utsname.machine[1] = '0' + (boot_cpu_data.x86 > 6 ? 6 : boot_cpu_data.x86);
+	alternative_instructions(); 
 }




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 21:43       ` Ulrich Drepper
@ 2003-04-21 22:05         ` Linus Torvalds
  2003-04-21 22:45           ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2003-04-21 22:05 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, linux-kernel


On Mon, 21 Apr 2003, Ulrich Drepper wrote:

> Linus Torvalds wrote:
> 
> > They may _work_ for intel, but quite frankly they suck for most Intel (and 
> > probably non-intel too) CPU's. Using prefixes tends to almost always mess 
> > up the instruction decoders on most CPU's out there.
> 
> Indeed, using prefixes is terrible.

I would not be surprised if both AMD and Intel are playing some 
"benchmarking games" by trying to select nop's that work badly for the 
other side, and then showing how _their_ new CPU's are so much better by 
having the compilers emit the "preferred" no-ops.

But maybe I'm just too cynical. And I do suspect the Hammer optimization
guide was meant for the 64-bit mode only, because I'm pretty certain even
AMD does badly on prefixes at least in older CPU generations.

		Linus


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 21:04     ` Linus Torvalds
@ 2003-04-21 21:43       ` Ulrich Drepper
  2003-04-21 22:05         ` Linus Torvalds
  2003-04-21 22:11       ` Andi Kleen
  1 sibling, 1 reply; 21+ messages in thread
From: Ulrich Drepper @ 2003-04-21 21:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:

> They may _work_ for intel, but quite frankly they suck for most Intel (and 
> probably non-intel too) CPU's. Using prefixes tends to almost always mess 
> up the instruction decoders on most CPU's out there.

Indeed, using prefixes is terrible.  This is what is used in gas:


    {0x90};                                     /* nop                  */
  static const char f32_2[] =
    {0x89,0xf6};                                /* movl %esi,%esi       */
  static const char f32_3[] =
    {0x8d,0x76,0x00};                           /* leal 0(%esi),%esi    */
  static const char f32_4[] =
    {0x8d,0x74,0x26,0x00};                      /* leal 0(%esi,1),%esi  */
  static const char f32_5[] =
    {0x90,                                      /* nop                  */
     0x8d,0x74,0x26,0x00};                      /* leal 0(%esi,1),%esi  */
  static const char f32_6[] =
    {0x8d,0xb6,0x00,0x00,0x00,0x00};            /* leal 0L(%esi),%esi   */
  static const char f32_7[] =
    {0x8d,0xb4,0x26,0x00,0x00,0x00,0x00};       /* leal 0L(%esi,1),%esi */
  static const char f32_8[] =
    {0x90,                                      /* nop                  */
     0x8d,0xb4,0x26,0x00,0x00,0x00,0x00};       /* leal 0L(%esi,1),%esi */
  static const char f32_9[] =
    {0x89,0xf6,                                 /* movl %esi,%esi       */
     0x8d,0xbc,0x27,0x00,0x00,0x00,0x00};       /* leal 0L(%edi,1),%edi */
  static const char f32_10[] =
    {0x8d,0x76,0x00,                            /* leal 0(%esi),%esi    */
     0x8d,0xbc,0x27,0x00,0x00,0x00,0x00};       /* leal 0L(%edi,1),%edi */
  static const char f32_11[] =
    {0x8d,0x74,0x26,0x00,                       /* leal 0(%esi,1),%esi  */
     0x8d,0xbc,0x27,0x00,0x00,0x00,0x00};       /* leal 0L(%edi,1),%edi */
  static const char f32_12[] =
    {0x8d,0xb6,0x00,0x00,0x00,0x00,             /* leal 0L(%esi),%esi   */
     0x8d,0xbf,0x00,0x00,0x00,0x00};            /* leal 0L(%edi),%edi   */
  static const char f32_13[] =
    {0x8d,0xb6,0x00,0x00,0x00,0x00,             /* leal 0L(%esi),%esi   */
     0x8d,0xbc,0x27,0x00,0x00,0x00,0x00};       /* leal 0L(%edi,1),%edi */
  static const char f32_14[] =
    {0x8d,0xb4,0x26,0x00,0x00,0x00,0x00,        /* leal 0L(%esi,1),%esi */
     0x8d,0xbc,0x27,0x00,0x00,0x00,0x00};       /* leal 0L(%edi,1),%edi */
  static const char f32_15[] =
    {0xeb,0x0d,0x90,0x90,0x90,0x90,0x90,        /* jmp .+15; lotsa nops */
     0x90,0x90,0x90,0x90,0x90,0x90,0x90,0x90};

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+pGYK2ijCOnn/RHQRArJ7AJ4gT4Di95o6wYonrmb+OGvNlbHvvQCfcohZ
nICs/UdZHeqwGs6Y4QJhhAA=
=JZN7
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 20:53   ` Andi Kleen
@ 2003-04-21 21:04     ` Linus Torvalds
  2003-04-21 21:43       ` Ulrich Drepper
  2003-04-21 22:11       ` Andi Kleen
  2003-04-21 23:35     ` Jamie Lokier
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2003-04-21 21:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel


On Mon, 21 Apr 2003, Andi Kleen wrote:
>
> Ok fixed. I used the recommendations from the Hammer optimization
> manual, will hopefully work for Intel too.

They may _work_ for intel, but quite frankly they suck for most Intel (and 
probably non-intel too) CPU's. Using prefixes tends to almost always mess 
up the instruction decoders on most CPU's out there.

I _think_ most Intel chips have one generic decoder (which knows about
prefixes etc), and the rest only handle simple instructions.

Intel has some preferred sequence that I'd much rather use by default,
although we can obviously use a CONFIG_xxx option to switch between these
sequences.

I forget what the intel sequence is, but I think the multi-byte opcodes 
are something like "lea 0(%esi),%esi" rather than adding operand size 
prefixes to the nop.

Does anybody have the preferred Intel sequence somewhere?

			Linus


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 19:59 ` Linus Torvalds
@ 2003-04-21 20:53   ` Andi Kleen
  2003-04-21 21:04     ` Linus Torvalds
  2003-04-21 23:35     ` Jamie Lokier
  0 siblings, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2003-04-21 20:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, linux-kernel

On Mon, Apr 21, 2003 at 09:59:00PM +0200, Linus Torvalds wrote:
> 
> On Mon, 21 Apr 2003, Andi Kleen wrote:
> > 
> > This patch implements automatic code patching of memory barriers based
> > on the CPU capabilities. Normally lock ; addl $0,(%esp) barriers
> > are used, but these are a bit slow on the Pentium 4. 
> 
> Could you fix this part:

Ok fixed. I used the recommendations from the Hammer optimization
manual, will hopefully work for Intel too.

Original mail for BitKeeper:

This patch implements automatic code patching of memory barriers based
on the CPU capabilities. Normally lock ; addl $0,(%esp) barriers
are used, but these are a bit slow on the Pentium 4. 

Linus proposed this a few weeks ago after the support for SSE1/SSE2
barriers was introduced. I now got around to implement it.

The main advantage is that it allows distributors to ship less binary
kernels but still get fast kernels. In particular it avoids the
need of a special Pentium 4 kernel.

The patching code is quite generic and could be used to patch
other instructions (like prefetches or specific other critical
instructions) too.
Thanks to Rusty's in kernel loader it also works seamlessly for modules.

The patching is done before other CPUs start to avoid potential
erratas with self modifying code on SMP systems. It makes no 
attempt to automatically handle assymetric systems (an secondary
CPU having less capabilities than the boot CPU). In this
case just boot with "noreplacement"


diff -u linux-2.5.68-gencpu/arch/i386/kernel/setup.c-o linux-2.5.68-gencpu/arch/i386/kernel/setup.c
--- linux-2.5.68-gencpu/arch/i386/kernel/setup.c-o	2003-04-21 22:18:39.000000000 +0200
+++ linux-2.5.68-gencpu/arch/i386/kernel/setup.c	2003-04-21 22:34:40.000000000 +0200
@@ -797,6 +797,57 @@
 		pci_mem_start = low_mem_size;
 }
 
+/* Replace instructions with better alternatives for this CPU type.
+
+   This runs before SMP is initialized to avoid SMP problems with
+   self modifying code. This implies that assymetric systems where
+   APs have less capabilities than the boot processor are not handled. 
+    
+   In this case boot with "noreplacement". */ 
+void __init apply_alternatives(void *start, void *end) 
+{ 
+	struct alt_instr *a; 
+	int diff, i, k;
+
+	for (a = start; a < end; 
+	     a = (void *)ALIGN((unsigned long)(a + 1) + a->instrlen, 4)) { 
+		if (!boot_cpu_has(a->cpuid))
+			continue;
+		BUG_ON(a->replacementlen > a->instrlen); 
+		memcpy(a->instr, a->replacement, a->replacementlen); 
+		diff = a->instrlen - a->replacementlen; 
+		for (i = a->replacementlen; diff > 0; diff -= k, i += k) {
+			static const char *nops[] = {
+				0,
+				"\x90",
+				"\x66\x90",
+				"\x66\x66\x90",
+				"\x66\x66\x66\x90",
+			};
+			k = min_t(int, diff, ARRAY_SIZE(nops)); 
+			memcpy(a->instr + i, nops[k], k); 
+		} 
+	}
+} 
+
+static int no_replacement __initdata = 0; 
+ 
+void __init alternative_instructions(void)
+{
+	extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
+	if (no_replacement) 
+		return;
+	apply_alternatives(__alt_instructions, __alt_instructions_end);
+}
+
+static int __init noreplacement_setup(char *s)
+{ 
+     no_replacement = 1; 
+     return 0; 
+} 
+
+__setup("noreplacement", noreplacement_setup); 
+
 void __init setup_arch(char **cmdline_p)
 {
 	unsigned long max_low_pfn;
diff -u linux-2.5.68-gencpu/arch/i386/kernel/module.c-o linux-2.5.68-gencpu/arch/i386/kernel/module.c
--- linux-2.5.68-gencpu/arch/i386/kernel/module.c-o	2003-04-20 21:24:16.000000000 +0200
@@ -104,9 +104,22 @@
 	return -ENOEXEC;
 }
 
+extern void apply_alternatives(void *start, void *end); 
+
 int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *me)
 {
+	Elf_Shdr *s;
+	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
+
+	/* look for .altinstructions to patch */ 
+	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { 
+		void *seg; 		
+		if (strcmp(".altinstructions", secstrings + s->sh_name))
+			continue;
+		seg = (void *)s->sh_addr; 
+		apply_alternatives(seg, seg + s->sh_size); 
+	} 	
 	return 0;
 }
diff -u linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S-o linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S
--- linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S	2003-04-20 21:24:22.000000000 +0200
@@ -81,6 +81,10 @@
   __con_initcall_start = .;
   .con_initcall.init : { *(.con_initcall.init) }
   __con_initcall_end = .;
+  . = ALIGN(4);
+  __alt_instructions = .;
+  .altinstructions : { *(.altinstructions) } 
+  __alt_instructions_end = .; 
   . = ALIGN(4096);
   __initramfs_start = .;
   .init.ramfs : { *(.init.ramfs) }
diff -u linux-2.5.68-gencpu/include/asm-i386/system.h-o linux-2.5.68-gencpu/include/asm-i386/system.h
--- linux-2.5.68-gencpu/include/asm-i386/system.h-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/include/asm-i386/system.h	2003-04-20 21:24:22.000000000 +0200
@@ -4,6 +4,7 @@
 #include <linux/config.h>
 #include <linux/kernel.h>
 #include <asm/segment.h>
+#include <asm/cpufeature.h>
 #include <linux/bitops.h> /* for LOCK_PREFIX */
 
 #ifdef __KERNEL__
@@ -276,6 +277,37 @@
 /* Compiling for a 386 proper.	Is it worth implementing via cli/sti?  */
 #endif
 
+struct alt_instr { 
+	u8 *instr; 		/* original instruction */
+	u8  cpuid;		/* cpuid bit set for replacement */
+	u8  instrlen;		/* length of original instruction */
+	u8  replacementlen; 	/* length of new instruction, <= instrlen */ 
+	u8  replacement[0];   	/* new instruction */
+}; 
+
+/* 
+ * Alternative instructions for different CPU types or capabilities.
+ * 
+ * This allows to use optimized instructions even on generic binary
+ * kernels.
+ * 
+ * length of oldinstr must be longer or equal the length of newinstr
+ * It can be padded with nops as needed.
+ * 
+ * For non barrier like inlines please define new variants
+ * without volatile and memory clobber.
+ */
+#define alternative(oldinstr, newinstr, feature) 	\
+	asm volatile ("661:\n\t" oldinstr "\n662:\n" 		     \
+		      ".section .altinstructions,\"a\"\n"     	     \
+		      "  .align 4\n"				       \
+		      "  .long 661b\n"            /* label */          \
+		      "  .byte %c0\n"             /* feature bit */    \
+		      "  .byte 662b-661b\n"       /* sourcelen */      \
+		      "  .byte 664f-663f\n"       /* replacementlen */ \
+		      "663:\n\t" newinstr "\n664:\n"   /* replacement */    \
+		      ".previous" :: "i" (feature) : "memory")  
+
 /*
  * Force strict CPU ordering.
  * And yes, this is required on UP too when we're talking
@@ -294,13 +326,15 @@
  * nop for these.
  */
  
-#ifdef CONFIG_X86_SSE2
-#define mb()	asm volatile("mfence" ::: "memory")
-#define rmb()	asm volatile("lfence" ::: "memory")
-#else
-#define mb() 	__asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory")
-#define rmb()	mb()
-#endif
+
+/* 
+ * Actually only lfence would be needed for mb() because all stores done 
+ * by the kernel should be already ordered. But keep a full barrier for now. 
+ */
+
+#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
+#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
+
 /**
  * read_barrier_depends - Flush all pending reads that subsequents reads
  * depend on.
@@ -356,7 +390,9 @@
 #define read_barrier_depends()	do { } while(0)
 
 #ifdef CONFIG_X86_OOSTORE
-#define wmb() 	__asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory")
+/* Actually there are no OOO store capable CPUs for now that do SSE, 
+   but make it already an possibility. */
+#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 #else
 #define wmb()	__asm__ __volatile__ ("": : :"memory")
 #endif
diff -u linux-2.5.68-gencpu/include/asm-i386/bugs.h-o linux-2.5.68-gencpu/include/asm-i386/bugs.h
--- linux-2.5.68-gencpu/include/asm-i386/bugs.h-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/include/asm-i386/bugs.h	2003-04-20 21:24:22.000000000 +0200
@@ -200,6 +200,8 @@
 #endif
 }
 
+extern void alternative_instructions(void);
+
 static void __init check_bugs(void)
 {
 	identify_cpu(&boot_cpu_data);
@@ -212,4 +214,5 @@
 	check_hlt();
 	check_popad();
 	system_utsname.machine[1] = '0' + (boot_cpu_data.x86 > 6 ? 6 : boot_cpu_data.x86);
+	alternative_instructions(); 
 }


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Runtime memory barrier patching
  2003-04-21 19:27 Andi Kleen
@ 2003-04-21 19:59 ` Linus Torvalds
  2003-04-21 20:53   ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2003-04-21 19:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel


On Mon, 21 Apr 2003, Andi Kleen wrote:
> 
> This patch implements automatic code patching of memory barriers based
> on the CPU capabilities. Normally lock ; addl $0,(%esp) barriers
> are used, but these are a bit slow on the Pentium 4. 

Could you fix this part:

+                       /* fill the overlap with single byte nops */ 
+                       memset(a->instr + a->replacementlen, 0x90, 
+                       a->instrlen - a->replacementlen); 

to use an array of replacements, something like

	#define MAXSIZE 6

	char *nop_sizes[MAXSIZE] = {
		NULL,		// "zero sized nop"? I don't think so
		{ 0x90 },	// simple one-byte no-op.
		{ .. whatever the two-byte NOP is .. }
		...
	};

and then have something like

	replace = a->instrlen - a->replacementlen;
	nop = a->instr + a->replacementlen;
	while (replace) {
		int size = replace;
		if (size > MAXSIZE)
			size = MAXSIZE;
		memcpy(nop, nop_sizes + size, size);
		nop += size;
		replace -= size;
	}

instead? I think it's silly to have multiple single-byte nops, when there
are well-defined multi-byte nops available.

Other than that this looks pretty good.

		Linus


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] Runtime memory barrier patching
@ 2003-04-21 19:27 Andi Kleen
  2003-04-21 19:59 ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2003-04-21 19:27 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel


This patch implements automatic code patching of memory barriers based
on the CPU capabilities. Normally lock ; addl $0,(%esp) barriers
are used, but these are a bit slow on the Pentium 4. 

Linus proposed this a few weeks ago after the support for SSE1/SSE2
barriers was introduced. I now got around to implement it.

The main advantage is that it allows distributors to ship less binary
kernels but still get fast kernels. In particular it avoids the
need of a special Pentium 4 kernel.

The patching code is quite generic and could be used to patch
other instructions (like prefetches or specific other critical
instructions) too.
Thanks to Rusty's in kernel loader it also works seamlessly for modules.

The patching is done before other CPUs start to avoid potential
erratas with self modifying code on SMP systems. It makes no 
attempt to automatically handle assymetric systems (an secondary
CPU having less capabilities than the boot CPU). In this
case just boot with "noreplacement"

Patch for 2.5.68. Please consider applying.

-Andi

diff -u linux-2.5.68-gencpu/arch/i386/kernel/setup.c-o linux-2.5.68-gencpu/arch/i386/kernel/setup.c
--- linux-2.5.68-gencpu/arch/i386/kernel/setup.c-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/arch/i386/kernel/setup.c	2003-04-21 21:08:00.000000000 +0200
@@ -797,6 +797,49 @@
 		pci_mem_start = low_mem_size;
 }
 
+/* Replace instructions with better alternatives for this CPU type.
+
+   This runs before SMP is initialized to avoid SMP problems with
+   self modifying code. This implies that assymetric systems where
+   APs have less capabilities than the boot processor are not handled. 
+    
+   In this case boot with "noreplacement". */ 
+void __init apply_alternatives(void *start, void *end) 
+{ 
+	struct alt_instr *a; 
+
+	for (a = start; a < end; 
+		a = (void *)ALIGN((unsigned long)(a + 1) + a->instrlen, 4)) { 
+		if (!boot_cpu_has(a->cpuid))
+			continue;
+		BUG_ON(a->replacementlen > a->instrlen); 
+		memcpy(a->instr, a->replacement, a->replacementlen); 
+		if (a->replacementlen < a->instrlen) { 
+			/* fill the overlap with single byte nops */ 
+			memset(a->instr + a->replacementlen, 0x90, 
+			a->instrlen - a->replacementlen); 
+		} 
+	}
+} 
+ 
+static int no_replacement __initdata = 0; 
+ 
+void __init alternative_instructions(void)
+{
+	extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
+	if (no_replacement) 
+		return;
+	apply_alternatives(__alt_instructions, __alt_instructions_end);
+}
+
+static int __init noreplacement_setup(char *s)
+{ 
+     no_replacement = 1; 
+     return 0; 
+} 
+
+__setup("noreplacement", noreplacement_setup); 
+
 void __init setup_arch(char **cmdline_p)
 {
 	unsigned long max_low_pfn;
diff -u linux-2.5.68-gencpu/arch/i386/kernel/module.c-o linux-2.5.68-gencpu/arch/i386/kernel/module.c
--- linux-2.5.68-gencpu/arch/i386/kernel/module.c-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/arch/i386/kernel/module.c	2003-04-21 21:07:27.000000000 +0200
@@ -104,9 +104,22 @@
 	return -ENOEXEC;
 }
 
+extern void apply_alternatives(void *start, void *end); 
+
 int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *me)
 {
+	Elf_Shdr *s;
+	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
+
+	/* look for .altinstructions to patch */ 
+	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { 
+		void *seg; 		
+		if (strcmp(".altinstructions", secstrings + s->sh_name))
+			continue;
+		seg = (void *)s->sh_addr; 
+		apply_alternatives(seg, seg + s->sh_size); 
+	} 	
 	return 0;
 }
diff -u linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S-o linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S
--- linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/arch/i386/vmlinux.lds.S	2003-04-20 21:24:22.000000000 +0200
@@ -81,6 +81,10 @@
   __con_initcall_start = .;
   .con_initcall.init : { *(.con_initcall.init) }
   __con_initcall_end = .;
+  . = ALIGN(4);
+  __alt_instructions = .;
+  .altinstructions : { *(.altinstructions) } 
+  __alt_instructions_end = .; 
   . = ALIGN(4096);
   __initramfs_start = .;
   .init.ramfs : { *(.init.ramfs) }
diff -u linux-2.5.68-gencpu/include/asm-i386/system.h-o linux-2.5.68-gencpu/include/asm-i386/system.h
--- linux-2.5.68-gencpu/include/asm-i386/system.h-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/include/asm-i386/system.h	2003-04-20 21:24:22.000000000 +0200
@@ -4,6 +4,7 @@
 #include <linux/config.h>
 #include <linux/kernel.h>
 #include <asm/segment.h>
+#include <asm/cpufeature.h>
 #include <linux/bitops.h> /* for LOCK_PREFIX */
 
 #ifdef __KERNEL__
@@ -276,6 +277,37 @@
 /* Compiling for a 386 proper.	Is it worth implementing via cli/sti?  */
 #endif
 
+struct alt_instr { 
+	u8 *instr; 		/* original instruction */
+	u8  cpuid;		/* cpuid bit set for replacement */
+	u8  instrlen;		/* length of original instruction */
+	u8  replacementlen; 	/* length of new instruction, <= instrlen */ 
+	u8  replacement[0];   	/* new instruction */
+}; 
+
+/* 
+ * Alternative instructions for different CPU types or capabilities.
+ * 
+ * This allows to use optimized instructions even on generic binary
+ * kernels.
+ * 
+ * length of oldinstr must be longer or equal the length of newinstr
+ * It can be padded with nops as needed.
+ * 
+ * For non barrier like inlines please define new variants
+ * without volatile and memory clobber.
+ */
+#define alternative(oldinstr, newinstr, feature) 	\
+	asm volatile ("661:\n\t" oldinstr "\n662:\n" 		     \
+		      ".section .altinstructions,\"a\"\n"     	     \
+		      "  .align 4\n"				       \
+		      "  .long 661b\n"            /* label */          \
+		      "  .byte %c0\n"             /* feature bit */    \
+		      "  .byte 662b-661b\n"       /* sourcelen */      \
+		      "  .byte 664f-663f\n"       /* replacementlen */ \
+		      "663:\n\t" newinstr "\n664:\n"   /* replacement */    \
+		      ".previous" :: "i" (feature) : "memory")  
+
 /*
  * Force strict CPU ordering.
  * And yes, this is required on UP too when we're talking
@@ -294,13 +326,15 @@
  * nop for these.
  */
  
-#ifdef CONFIG_X86_SSE2
-#define mb()	asm volatile("mfence" ::: "memory")
-#define rmb()	asm volatile("lfence" ::: "memory")
-#else
-#define mb() 	__asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory")
-#define rmb()	mb()
-#endif
+
+/* 
+ * Actually only lfence would be needed for mb() because all stores done 
+ * by the kernel should be already ordered. But keep a full barrier for now. 
+ */
+
+#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
+#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
+
 /**
  * read_barrier_depends - Flush all pending reads that subsequents reads
  * depend on.
@@ -356,7 +390,9 @@
 #define read_barrier_depends()	do { } while(0)
 
 #ifdef CONFIG_X86_OOSTORE
-#define wmb() 	__asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory")
+/* Actually there are no OOO store capable CPUs for now that do SSE, 
+   but make it already an possibility. */
+#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 #else
 #define wmb()	__asm__ __volatile__ ("": : :"memory")
 #endif
diff -u linux-2.5.68-gencpu/include/asm-i386/bugs.h-o linux-2.5.68-gencpu/include/asm-i386/bugs.h
--- linux-2.5.68-gencpu/include/asm-i386/bugs.h-o	2003-04-20 21:24:16.000000000 +0200
+++ linux-2.5.68-gencpu/include/asm-i386/bugs.h	2003-04-20 21:24:22.000000000 +0200
@@ -200,6 +200,8 @@
 #endif
 }
 
+extern void alternative_instructions(void);
+
 static void __init check_bugs(void)
 {
 	identify_cpu(&boot_cpu_data);
@@ -212,4 +214,5 @@
 	check_hlt();
 	check_popad();
 	system_utsname.machine[1] = '0' + (boot_cpu_data.x86 > 6 ? 6 : boot_cpu_data.x86);
+	alternative_instructions(); 
 }

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2003-04-22 16:00 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200304220111.h3M1BEp5004047@hera.kernel.org>
2003-04-22  8:43 ` [PATCH] Runtime memory barrier patching Arjan van de Ven
2003-04-22 11:18   ` Andi Kleen
2003-04-22 16:11     ` Dave Jones
2003-04-22 10:12 Chuck Ebbert
  -- strict thread matches above, loose matches on Subject: below --
2003-04-21 23:41 Chuck Ebbert
2003-04-22  0:04 ` Jamie Lokier
2003-04-21 19:27 Andi Kleen
2003-04-21 19:59 ` Linus Torvalds
2003-04-21 20:53   ` Andi Kleen
2003-04-21 21:04     ` Linus Torvalds
2003-04-21 21:43       ` Ulrich Drepper
2003-04-21 22:05         ` Linus Torvalds
2003-04-21 22:45           ` Andi Kleen
2003-04-21 22:11       ` Andi Kleen
2003-04-21 22:23         ` Linus Torvalds
2003-04-21 22:59           ` Andi Kleen
2003-04-21 23:35     ` Jamie Lokier
2003-04-21 23:46       ` Andi Kleen
2003-04-21 23:57         ` Jamie Lokier
2003-04-22  0:06       ` Linus Torvalds
2003-04-22  0:13         ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).