linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] SMP alternatives
@ 2006-01-24 15:33 Gerd Hoffmann
  2006-01-24 16:22 ` Ben Collins
  2006-01-26 10:22 ` Pavel Machek
  0 siblings, 2 replies; 117+ messages in thread
From: Gerd Hoffmann @ 2006-01-24 15:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 2171 bytes --]


  Hi Andrew,

Can you include the patch in -mm to give it some testing?  Then merge
maybe for 2.6.17?  Posted last time in december, with nobody complaining
any more about the most recent version.  The patch is almost unmodified
since, I've only had to add a small chunk due to the mutex merge.
Description is below, the patch (against -rc1-git4 snapshot) is attached.

cheers,

  Gerd

=====================================================================

This patch implements SMP alternatives, i.e. switching at runtime
between different code versions for UP and SMP.  The code can patch both
SMP->UP and UP->SMP.  The UP->SMP case is useful for CPU hotplug.

With CONFIG_CPU_HOTPLUG enabled the code switches to UP at boot
time and when the number of CPUs goes down to 1, and switches to
SMP when the number of CPUs goes up to 2.

Without CONFIG_CPU_HOTPLUG or on non-SMP-capable systems the code
is patched once at boot time (if needed) and the tables are
released afterwards.

The changes in detail:

  * The current alternatives bits are moved to a separate file,
    the SMP alternatives code is added there.

  * The patch adds some new elf sections to the kernel:
    .smp_altinstructions
        like .altinstructions, also contains a list
        of alt_instr structs.
    .smp_altinstr_replacement
        like .altinstr_replacement, but also has some space to
        save original instruction before replaving it.
    .smp_locks
        list of pointers to lock prefixes which can be nop'ed
        out on UP.
    The first two are used to replace more complex instruction
    sequences such as spinlocks and semaphores.  It would be possible
    to deal with the lock prefixes with that as well, but by handling
    them as special case the table sizes become much smaller.

 * The sections are page-aligned and padded up to page size, so they
   can be free if they are not needed.

 * Splitted the code to release init pages to a separate function and
   use it to release the elf sections if they are unused.

-- 
Gerd 'just married' Hoffmann <kraxel@suse.de>
I'm the hacker formerly known as Gerd Knorr.
http://www.suse.de/~kraxel/just-married.jpeg

[-- Attachment #2: smp-alternatives --]
[-- Type: text/plain, Size: 39966 bytes --]

This patch implements SMP alternatives, i.e. switching at runtime
between different code versions for UP and SMP.  The code can patch both
SMP->UP and UP->SMP.  The UP->SMP case is useful for CPU hotplug.

With CONFIG_CPU_HOTPLUG enabled the code switches to UP at boot
time and when the number of CPUs goes down to 1, and switches to
SMP when the number of CPUs goes up to 2.

Without CONFIG_CPU_HOTPLUG or on non-SMP-capable systems the code
is patched once at boot time (if needed) and the tables are
released afterwards.

The changes in detail:

  * The current alternatives bits are moved to a separate file,
    the SMP alternatives code is added there.

  * The patch adds some new elf sections to the kernel:
    .smp_altinstructions
	like .altinstructions, also contains a list
	of alt_instr structs.
    .smp_altinstr_replacement
	like .altinstr_replacement, but also has some space to
	save original instruction before replaving it.
    .smp_locks
	list of pointers to lock prefixes which can be nop'ed
	out on UP.
    The first two are used to replace more complex instruction
    sequences such as spinlocks and semaphores.  It would be possible
    to deal with the lock prefixes with that as well, but by handling
    them as special case the table sizes become much smaller.

 * The sections are page-aligned and padded up to page size, so they
   can be free if they are not needed.

 * Splitted the code to release init pages to a separate function and
   use it to release the elf sections if they are unused.

Signed-off-by: Gerd Knorr <kraxel@suse.de>
Index: vanilla-2.6.15/arch/i386/kernel/Makefile
===================================================================
--- vanilla-2.6.15.orig/arch/i386/kernel/Makefile
+++ vanilla-2.6.15/arch/i386/kernel/Makefile
@@ -7,7 +7,7 @@ extra-y := head.o init_task.o vmlinux.ld
 obj-y	:= process.o semaphore.o signal.o entry.o traps.o irq.o \
 		ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_i386.o \
 		pci-dma.o i386_ksyms.o i387.o dmi_scan.o bootflag.o \
-		quirks.o i8237.o
+		quirks.o i8237.o alternative.o
 
 obj-y				+= cpu/
 obj-y				+= timers/
Index: vanilla-2.6.15/arch/i386/kernel/alternative.c
===================================================================
--- /dev/null
+++ vanilla-2.6.15/arch/i386/kernel/alternative.c
@@ -0,0 +1,320 @@
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <asm/alternative.h>
+
+#define DEBUG 0
+#if DEBUG
+# define DPRINTK(fmt, args...) printk(fmt, args)
+#else
+# define DPRINTK(fmt, args...)
+#endif
+
+/* Use inline assembly to define this because the nops are defined 
+   as inline assembly strings in the include files and we cannot 
+   get them easily into strings. */
+asm("\t.data\nintelnops: " 
+    GENERIC_NOP1 GENERIC_NOP2 GENERIC_NOP3 GENERIC_NOP4 GENERIC_NOP5 GENERIC_NOP6
+    GENERIC_NOP7 GENERIC_NOP8); 
+asm("\t.data\nk8nops: " 
+    K8_NOP1 K8_NOP2 K8_NOP3 K8_NOP4 K8_NOP5 K8_NOP6
+    K8_NOP7 K8_NOP8); 
+asm("\t.data\nk7nops: " 
+    K7_NOP1 K7_NOP2 K7_NOP3 K7_NOP4 K7_NOP5 K7_NOP6
+    K7_NOP7 K7_NOP8); 
+    
+extern unsigned char intelnops[], k8nops[], k7nops[];
+static unsigned char *intel_nops[ASM_NOP_MAX+1] = { 
+     NULL,
+     intelnops,
+     intelnops + 1,
+     intelnops + 1 + 2,
+     intelnops + 1 + 2 + 3,
+     intelnops + 1 + 2 + 3 + 4,
+     intelnops + 1 + 2 + 3 + 4 + 5,
+     intelnops + 1 + 2 + 3 + 4 + 5 + 6,
+     intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
+}; 
+static unsigned char *k8_nops[ASM_NOP_MAX+1] = { 
+     NULL,
+     k8nops,
+     k8nops + 1,
+     k8nops + 1 + 2,
+     k8nops + 1 + 2 + 3,
+     k8nops + 1 + 2 + 3 + 4,
+     k8nops + 1 + 2 + 3 + 4 + 5,
+     k8nops + 1 + 2 + 3 + 4 + 5 + 6,
+     k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
+}; 
+static unsigned char *k7_nops[ASM_NOP_MAX+1] = { 
+     NULL,
+     k7nops,
+     k7nops + 1,
+     k7nops + 1 + 2,
+     k7nops + 1 + 2 + 3,
+     k7nops + 1 + 2 + 3 + 4,
+     k7nops + 1 + 2 + 3 + 4 + 5,
+     k7nops + 1 + 2 + 3 + 4 + 5 + 6,
+     k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
+}; 
+static struct nop { 
+     int cpuid; 
+     unsigned char **noptable; 
+} noptypes[] = { 
+     { X86_FEATURE_K8, k8_nops }, 
+     { X86_FEATURE_K7, k7_nops }, 
+     { -1, NULL }
+}; 
+
+
+extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
+extern struct alt_instr __smp_alt_instructions[], __smp_alt_instructions_end[];
+extern u8 *__smp_locks[], *__smp_locks_end[];
+
+extern u8 _text[], _etext[];
+extern u8 __smp_alt_begin[], __smp_alt_end[];
+
+
+static unsigned char** find_nop_table(void)
+{
+        unsigned char **noptable = intel_nops;
+	int i;
+
+	for (i = 0; noptypes[i].cpuid >= 0; i++) { 
+		if (boot_cpu_has(noptypes[i].cpuid)) { 
+			noptable = noptypes[i].noptable;
+			break;
+		}
+	}
+	return noptable;
+}
+
+/* Replace instructions with better alternatives for this CPU type.
+   This runs before SMP is initialized to avoid SMP problems with
+   self modifying code. This implies that assymetric systems where
+   APs have less capabilities than the boot processor are not handled. 
+   Tough. Make sure you disable such features by hand. */
+
+void apply_alternatives(struct alt_instr *start, struct alt_instr *end)
+{ 
+        unsigned char **noptable = find_nop_table();
+	struct alt_instr *a; 
+	int diff, i, k;
+
+	DPRINTK("%s: alt table %p -> %p\n", __FUNCTION__, start, end);
+	for (a = start; a < end; a++) { 
+		BUG_ON(a->replacementlen > a->instrlen); 
+		if (!boot_cpu_has(a->cpuid))
+			continue;
+		memcpy(a->instr, a->replacement, a->replacementlen); 
+		diff = a->instrlen - a->replacementlen; 
+		/* Pad the rest with nops */
+		for (i = a->replacementlen; diff > 0; diff -= k, i += k) {
+			k = diff;
+			if (k > ASM_NOP_MAX)
+				k = ASM_NOP_MAX;
+			memcpy(a->instr + i, noptable[k], k); 
+		} 
+	}
+} 
+
+static void alternatives_smp_save(struct alt_instr *start, struct alt_instr *end)
+{
+	struct alt_instr *a;
+
+	DPRINTK("%s: alt table %p-%p\n", __FUNCTION__, start, end);
+	for (a = start; a < end; a++) {
+		memcpy(a->replacement + a->replacementlen,
+		       a->instr,
+		       a->instrlen);
+	}
+}
+
+static void alternatives_smp_apply(struct alt_instr *start, struct alt_instr *end)
+{
+	struct alt_instr *a;
+
+	for (a = start; a < end; a++) {
+		memcpy(a->instr,
+		       a->replacement + a->replacementlen,
+		       a->instrlen);
+	}
+}
+
+static void alternatives_smp_lock(u8 **start, u8 **end, u8 *text, u8 *text_end)
+{
+	u8 **ptr;
+
+	for (ptr = start; ptr < end; ptr++) {
+		if (*ptr < text)
+			continue;
+		if (*ptr > text_end)
+			continue;
+		**ptr = 0xf0; /* lock prefix */
+	};
+}
+
+static void alternatives_smp_unlock(u8 **start, u8 **end, u8 *text, u8 *text_end)
+{
+        unsigned char **noptable = find_nop_table();
+	u8 **ptr;
+
+	for (ptr = start; ptr < end; ptr++) {
+		if (*ptr < text)
+			continue;
+		if (*ptr > text_end)
+			continue;
+		**ptr = noptable[1][0];
+	};
+}
+
+struct smp_alt_module {
+	/* what is this ??? */
+	struct module    *mod;
+	char             *name;
+
+	/* ptrs to lock prefixes */
+	u8               **locks;
+	u8               **locks_end;
+
+	/* .text segment, needed to avoid patching init code ;) */
+	u8               *text;
+	u8               *text_end;
+
+	struct list_head next;
+};
+static LIST_HEAD(smp_alt_modules);
+static DEFINE_SPINLOCK(smp_alt);
+
+static int smp_alt_once = 0;
+static int __init bootonly(char *str)
+{
+	smp_alt_once = 1;
+	return 1;
+}
+__setup("smp-alt-boot", bootonly);
+
+void alternatives_smp_module_add(struct module *mod, char *name,
+				 void *locks, void *locks_end,
+				 void *text,  void *text_end)
+{
+	struct smp_alt_module *smp;
+	unsigned long flags;
+
+	if (smp_alt_once) {
+		if (boot_cpu_has(X86_FEATURE_UP))
+			alternatives_smp_unlock(locks, locks_end,
+						text, text_end);
+		return;
+	}
+
+	smp = kzalloc(sizeof(*smp), GFP_KERNEL);
+	if (NULL == smp)
+		return; /* we'll run the (safe but slow) SMP code then ... */
+
+	smp->mod       = mod;
+	smp->name      = name;
+	smp->locks     = locks;
+	smp->locks_end = locks_end;
+	smp->text      = text;
+	smp->text_end  = text_end;
+	DPRINTK("%s: locks %p -> %p, text %p -> %p, name %s\n",
+		__FUNCTION__, smp->locks, smp->locks_end,
+		smp->text, smp->text_end, smp->name);
+
+	spin_lock_irqsave(&smp_alt, flags);
+	list_add_tail(&smp->next, &smp_alt_modules);
+	if (boot_cpu_has(X86_FEATURE_UP))
+		alternatives_smp_unlock(smp->locks, smp->locks_end,
+					smp->text, smp->text_end);
+	spin_unlock_irqrestore(&smp_alt, flags);
+}
+
+void alternatives_smp_module_del(struct module *mod)
+{
+	struct smp_alt_module *item;
+	unsigned long flags;
+
+	if (smp_alt_once)
+		return;
+
+	spin_lock_irqsave(&smp_alt, flags);
+	list_for_each_entry(item, &smp_alt_modules, next) {
+		if (mod != item->mod)
+			continue;
+		list_del(&item->next);
+		spin_unlock_irqrestore(&smp_alt, flags);
+		DPRINTK("%s: %s\n", __FUNCTION__, item->name);
+		kfree(item);
+		return;
+	}
+	spin_unlock_irqrestore(&smp_alt, flags);
+}
+
+void alternatives_smp_switch(int smp) 
+{
+	struct smp_alt_module *mod;
+	unsigned long flags;
+
+	if (smp_alt_once)
+		return;
+	BUG_ON(!smp && (num_online_cpus() > 1));
+
+	spin_lock_irqsave(&smp_alt, flags);
+	if (smp) {
+		printk(KERN_INFO "SMP alternatives: switching to SMP code\n");
+		clear_bit(X86_FEATURE_UP, boot_cpu_data.x86_capability);
+		alternatives_smp_apply(__smp_alt_instructions,
+				       __smp_alt_instructions_end);
+		list_for_each_entry(mod, &smp_alt_modules, next)
+			alternatives_smp_lock(mod->locks, mod->locks_end,
+					      mod->text, mod->text_end);
+	} else {
+		printk(KERN_INFO "SMP alternatives: switching to UP code\n");
+		set_bit(X86_FEATURE_UP, boot_cpu_data.x86_capability);
+		apply_alternatives(__smp_alt_instructions,
+				   __smp_alt_instructions_end);
+		list_for_each_entry(mod, &smp_alt_modules, next)
+			alternatives_smp_unlock(mod->locks, mod->locks_end,
+						mod->text, mod->text_end);
+	}
+	spin_unlock_irqrestore(&smp_alt, flags);
+} 
+
+extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
+
+void __init alternative_instructions(void)
+{
+	apply_alternatives(__alt_instructions, __alt_instructions_end);
+
+	/* switch to patch-once-at-boottime-only mode and free the
+	 * tables in case we know the number of CPUs will never ever
+	 * change */
+#ifdef CONFIG_HOTPLUG_CPU
+	if (num_possible_cpus() < 2)
+		smp_alt_once = 1;
+#else
+	smp_alt_once = 1;
+#endif
+	
+	if (smp_alt_once) {
+		if (1 == num_possible_cpus()) {
+			printk(KERN_INFO "SMP alternatives: switching to UP code\n");
+			set_bit(X86_FEATURE_UP, boot_cpu_data.x86_capability);
+			apply_alternatives(__smp_alt_instructions,
+					   __smp_alt_instructions_end);
+			alternatives_smp_unlock(__smp_locks, __smp_locks_end,
+						_text, _etext);
+		}
+		free_init_pages("SMP alternatives",
+				(unsigned long)__smp_alt_begin,
+				(unsigned long)__smp_alt_end);
+	} else {
+		alternatives_smp_save(__smp_alt_instructions,
+				      __smp_alt_instructions_end);
+		alternatives_smp_module_add(NULL, "core kernel",
+					    __smp_locks, __smp_locks_end,
+					    _text, _etext);
+		alternatives_smp_switch(0);
+	}
+}
Index: vanilla-2.6.15/arch/i386/kernel/module.c
===================================================================
--- vanilla-2.6.15.orig/arch/i386/kernel/module.c
+++ vanilla-2.6.15/arch/i386/kernel/module.c
@@ -104,26 +104,38 @@ int apply_relocate_add(Elf32_Shdr *sechd
 	return -ENOEXEC;
 }
 
-extern void apply_alternatives(void *start, void *end); 
-
 int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *me)
 {
-	const Elf_Shdr *s;
+	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
-	/* look for .altinstructions to patch */ 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { 
-		void *seg; 		
-		if (strcmp(".altinstructions", secstrings + s->sh_name))
-			continue;
-		seg = (void *)s->sh_addr; 
-		apply_alternatives(seg, seg + s->sh_size); 
-	} 	
+		if (!strcmp(".text", secstrings + s->sh_name))
+			text = s;
+		if (!strcmp(".altinstructions", secstrings + s->sh_name))
+			alt = s;
+		if (!strcmp(".smp_locks", secstrings + s->sh_name))
+			locks= s;
+	}
+
+	if (alt) {
+		/* patch .altinstructions */ 
+		void *aseg = (void *)alt->sh_addr;
+		apply_alternatives(aseg, aseg + alt->sh_size);
+	}
+	if (locks && text) {
+		void *lseg = (void *)locks->sh_addr;
+		void *tseg = (void *)text->sh_addr;
+		alternatives_smp_module_add(me, me->name,
+					    lseg, lseg + locks->sh_size,
+					    tseg, tseg + text->sh_size);
+	}
 	return 0;
 }
 
 void module_arch_cleanup(struct module *mod)
 {
+	alternatives_smp_module_del(mod);
 }
Index: vanilla-2.6.15/arch/i386/kernel/semaphore.c
===================================================================
--- vanilla-2.6.15.orig/arch/i386/kernel/semaphore.c
+++ vanilla-2.6.15/arch/i386/kernel/semaphore.c
@@ -110,11 +110,11 @@ asm(
 ".align	4\n"
 ".globl	__write_lock_failed\n"
 "__write_lock_failed:\n\t"
-	LOCK "addl	$" RW_LOCK_BIAS_STR ",(%eax)\n"
+	LOCK_PREFIX "addl	$" RW_LOCK_BIAS_STR ",(%eax)\n"
 "1:	rep; nop\n\t"
 	"cmpl	$" RW_LOCK_BIAS_STR ",(%eax)\n\t"
 	"jne	1b\n\t"
-	LOCK "subl	$" RW_LOCK_BIAS_STR ",(%eax)\n\t"
+	LOCK_PREFIX "subl	$" RW_LOCK_BIAS_STR ",(%eax)\n\t"
 	"jnz	__write_lock_failed\n\t"
 	"ret"
 );
@@ -124,11 +124,11 @@ asm(
 ".align	4\n"
 ".globl	__read_lock_failed\n"
 "__read_lock_failed:\n\t"
-	LOCK "incl	(%eax)\n"
+	LOCK_PREFIX "incl	(%eax)\n"
 "1:	rep; nop\n\t"
 	"cmpl	$1,(%eax)\n\t"
 	"js	1b\n\t"
-	LOCK "decl	(%eax)\n\t"
+	LOCK_PREFIX "decl	(%eax)\n\t"
 	"js	__read_lock_failed\n\t"
 	"ret"
 );
Index: vanilla-2.6.15/arch/i386/kernel/setup.c
===================================================================
--- vanilla-2.6.15.orig/arch/i386/kernel/setup.c
+++ vanilla-2.6.15/arch/i386/kernel/setup.c
@@ -1377,101 +1377,6 @@ static void __init register_memory(void)
 		pci_mem_start, gapstart, gapsize);
 }
 
-/* Use inline assembly to define this because the nops are defined 
-   as inline assembly strings in the include files and we cannot 
-   get them easily into strings. */
-asm("\t.data\nintelnops: " 
-    GENERIC_NOP1 GENERIC_NOP2 GENERIC_NOP3 GENERIC_NOP4 GENERIC_NOP5 GENERIC_NOP6
-    GENERIC_NOP7 GENERIC_NOP8); 
-asm("\t.data\nk8nops: " 
-    K8_NOP1 K8_NOP2 K8_NOP3 K8_NOP4 K8_NOP5 K8_NOP6
-    K8_NOP7 K8_NOP8); 
-asm("\t.data\nk7nops: " 
-    K7_NOP1 K7_NOP2 K7_NOP3 K7_NOP4 K7_NOP5 K7_NOP6
-    K7_NOP7 K7_NOP8); 
-    
-extern unsigned char intelnops[], k8nops[], k7nops[];
-static unsigned char *intel_nops[ASM_NOP_MAX+1] = { 
-     NULL,
-     intelnops,
-     intelnops + 1,
-     intelnops + 1 + 2,
-     intelnops + 1 + 2 + 3,
-     intelnops + 1 + 2 + 3 + 4,
-     intelnops + 1 + 2 + 3 + 4 + 5,
-     intelnops + 1 + 2 + 3 + 4 + 5 + 6,
-     intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-}; 
-static unsigned char *k8_nops[ASM_NOP_MAX+1] = { 
-     NULL,
-     k8nops,
-     k8nops + 1,
-     k8nops + 1 + 2,
-     k8nops + 1 + 2 + 3,
-     k8nops + 1 + 2 + 3 + 4,
-     k8nops + 1 + 2 + 3 + 4 + 5,
-     k8nops + 1 + 2 + 3 + 4 + 5 + 6,
-     k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-}; 
-static unsigned char *k7_nops[ASM_NOP_MAX+1] = { 
-     NULL,
-     k7nops,
-     k7nops + 1,
-     k7nops + 1 + 2,
-     k7nops + 1 + 2 + 3,
-     k7nops + 1 + 2 + 3 + 4,
-     k7nops + 1 + 2 + 3 + 4 + 5,
-     k7nops + 1 + 2 + 3 + 4 + 5 + 6,
-     k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-}; 
-static struct nop { 
-     int cpuid; 
-     unsigned char **noptable; 
-} noptypes[] = { 
-     { X86_FEATURE_K8, k8_nops }, 
-     { X86_FEATURE_K7, k7_nops }, 
-     { -1, NULL }
-}; 
-
-/* Replace instructions with better alternatives for this CPU type.
-
-   This runs before SMP is initialized to avoid SMP problems with
-   self modifying code. This implies that assymetric systems where
-   APs have less capabilities than the boot processor are not handled. 
-   Tough. Make sure you disable such features by hand. */ 
-void apply_alternatives(void *start, void *end) 
-{ 
-	struct alt_instr *a; 
-	int diff, i, k;
-        unsigned char **noptable = intel_nops; 
-	for (i = 0; noptypes[i].cpuid >= 0; i++) { 
-		if (boot_cpu_has(noptypes[i].cpuid)) { 
-			noptable = noptypes[i].noptable;
-			break;
-		}
-	} 
-	for (a = start; (void *)a < end; a++) { 
-		if (!boot_cpu_has(a->cpuid))
-			continue;
-		BUG_ON(a->replacementlen > a->instrlen); 
-		memcpy(a->instr, a->replacement, a->replacementlen); 
-		diff = a->instrlen - a->replacementlen; 
-		/* Pad the rest with nops */
-		for (i = a->replacementlen; diff > 0; diff -= k, i += k) {
-			k = diff;
-			if (k > ASM_NOP_MAX)
-				k = ASM_NOP_MAX;
-			memcpy(a->instr + i, noptable[k], k); 
-		} 
-	}
-} 
-
-void __init alternative_instructions(void)
-{
-	extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
-	apply_alternatives(__alt_instructions, __alt_instructions_end);
-}
-
 static char * __init machine_specific_memory_setup(void);
 
 #ifdef CONFIG_MCA
Index: vanilla-2.6.15/arch/i386/kernel/smpboot.c
===================================================================
--- vanilla-2.6.15.orig/arch/i386/kernel/smpboot.c
+++ vanilla-2.6.15/arch/i386/kernel/smpboot.c
@@ -909,6 +909,7 @@ static int __devinit do_boot_cpu(int api
 	}
 
 	++cpucount;
+	alternatives_smp_switch(1);
 
 	/*
 	 * We can't use kernel_thread since we must avoid to
@@ -1368,6 +1369,8 @@ void __cpu_die(unsigned int cpu)
 		/* They ack this in play_dead by setting CPU_DEAD */
 		if (per_cpu(cpu_state, cpu) == CPU_DEAD) {
 			printk ("CPU %d is now offline\n", cpu);
+			if (1 == num_online_cpus())
+				alternatives_smp_switch(0);
 			return;
 		}
 		msleep(100);
Index: vanilla-2.6.15/arch/i386/kernel/vmlinux.lds.S
===================================================================
--- vanilla-2.6.15.orig/arch/i386/kernel/vmlinux.lds.S
+++ vanilla-2.6.15/arch/i386/kernel/vmlinux.lds.S
@@ -68,6 +68,26 @@ SECTIONS
 	*(.data.init_task)
   }
 
+  /* might get freed after init */
+  . = ALIGN(4096);
+  __smp_alt_begin = .;
+  __smp_alt_instructions = .;
+  .smp_altinstructions : AT(ADDR(.smp_altinstructions) - LOAD_OFFSET) {
+	*(.smp_altinstructions)
+  }
+  __smp_alt_instructions_end = .; 
+  . = ALIGN(4);
+  __smp_locks = .;
+  .smp_locks : AT(ADDR(.smp_locks) - LOAD_OFFSET) {
+	*(.smp_locks)
+  }
+  __smp_locks_end = .; 
+  .smp_altinstr_replacement : AT(ADDR(.smp_altinstr_replacement) - LOAD_OFFSET) {
+	*(.smp_altinstr_replacement)
+  }
+  . = ALIGN(4096);
+  __smp_alt_end = .;
+
   /* will be freed after init */
   . = ALIGN(4096);		/* Init code and data */
   __init_begin = .;
Index: vanilla-2.6.15/arch/i386/mm/init.c
===================================================================
--- vanilla-2.6.15.orig/arch/i386/mm/init.c
+++ vanilla-2.6.15/arch/i386/mm/init.c
@@ -720,21 +720,6 @@ static int noinline do_test_wp_bit(void)
 	return flag;
 }
 
-void free_initmem(void)
-{
-	unsigned long addr;
-
-	addr = (unsigned long)(&__init_begin);
-	for (; addr < (unsigned long)(&__init_end); addr += PAGE_SIZE) {
-		ClearPageReserved(virt_to_page(addr));
-		set_page_count(virt_to_page(addr), 1);
-		memset((void *)addr, 0xcc, PAGE_SIZE);
-		free_page(addr);
-		totalram_pages++;
-	}
-	printk (KERN_INFO "Freeing unused kernel memory: %dk freed\n", (__init_end - __init_begin) >> 10);
-}
-
 #ifdef CONFIG_DEBUG_RODATA
 
 extern char __start_rodata, __end_rodata;
@@ -758,17 +743,31 @@ void mark_rodata_ro(void)
 }
 #endif
 
+void free_init_pages(char *what, unsigned long begin, unsigned long end)
+{
+	unsigned long addr;
+
+	for (addr = begin; addr < end; addr += PAGE_SIZE) {
+		ClearPageReserved(virt_to_page(addr));
+		set_page_count(virt_to_page(addr), 1);
+		memset((void *)addr, 0xcc, PAGE_SIZE);
+		free_page(addr);
+		totalram_pages++;
+	}
+	printk(KERN_INFO "Freeing %s: %ldk freed\n", what, (end - begin) >> 10);
+}
+
+void free_initmem(void)
+{
+	free_init_pages("unused kernel memory",
+			(unsigned long)(&__init_begin),
+			(unsigned long)(&__init_end));
+}
 
 #ifdef CONFIG_BLK_DEV_INITRD
 void free_initrd_mem(unsigned long start, unsigned long end)
 {
-	if (start < end)
-		printk (KERN_INFO "Freeing initrd memory: %ldk freed\n", (end - start) >> 10);
-	for (; start < end; start += PAGE_SIZE) {
-		ClearPageReserved(virt_to_page(start));
-		set_page_count(virt_to_page(start), 1);
-		free_page(start);
-		totalram_pages++;
-	}
+	free_init_pages("initrd memory", start, end);
 }
 #endif
+
Index: vanilla-2.6.15/include/asm-i386/alternative.h
===================================================================
--- /dev/null
+++ vanilla-2.6.15/include/asm-i386/alternative.h
@@ -0,0 +1,129 @@
+#ifndef _I386_ALTERNATIVE_H
+#define _I386_ALTERNATIVE_H
+
+#ifdef __KERNEL__
+
+struct alt_instr { 
+	u8 *instr; 		/* original instruction */
+	u8 *replacement;
+	u8  cpuid;		/* cpuid bit set for replacement */
+	u8  instrlen;		/* length of original instruction */
+	u8  replacementlen; 	/* length of new instruction, <= instrlen */ 
+	u8  pad;
+}; 
+
+extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
+
+struct module;
+extern void alternatives_smp_module_add(struct module *mod, char *name,
+					void *locks, void *locks_end,
+					void *text, void *text_end);
+extern void alternatives_smp_module_del(struct module *mod);
+extern void alternatives_smp_switch(int smp);
+
+#endif
+
+/* 
+ * Alternative instructions for different CPU types or capabilities.
+ * 
+ * This allows to use optimized instructions even on generic binary
+ * kernels.
+ * 
+ * length of oldinstr must be longer or equal the length of newinstr
+ * It can be padded with nops as needed.
+ * 
+ * For non barrier like inlines please define new variants
+ * without volatile and memory clobber.
+ */
+#define alternative(oldinstr, newinstr, feature) 	\
+	asm volatile ("661:\n\t" oldinstr "\n662:\n" 		     \
+		      ".section .altinstructions,\"a\"\n"     	     \
+		      "  .align 4\n"				       \
+		      "  .long 661b\n"            /* label */          \
+		      "  .long 663f\n"		  /* new instruction */ 	\
+		      "  .byte %c0\n"             /* feature bit */    \
+		      "  .byte 662b-661b\n"       /* sourcelen */      \
+		      "  .byte 664f-663f\n"       /* replacementlen */ \
+		      ".previous\n"						\
+		      ".section .altinstr_replacement,\"ax\"\n"			\
+		      "663:\n\t" newinstr "\n664:\n"   /* replacement */    \
+		      ".previous" :: "i" (feature) : "memory")  
+
+/*
+ * Alternative inline assembly with input.
+ * 
+ * Pecularities:
+ * No memory clobber here. 
+ * Argument numbers start with 1.
+ * Best is to use constraints that are fixed size (like (%1) ... "r")
+ * If you use variable sized constraints like "m" or "g" in the 
+ * replacement maake sure to pad to the worst case length.
+ */
+#define alternative_input(oldinstr, newinstr, feature, input...)		\
+	asm volatile ("661:\n\t" oldinstr "\n662:\n"				\
+		      ".section .altinstructions,\"a\"\n"			\
+		      "  .align 4\n"						\
+		      "  .long 661b\n"            /* label */			\
+		      "  .long 663f\n"		  /* new instruction */ 	\
+		      "  .byte %c0\n"             /* feature bit */		\
+		      "  .byte 662b-661b\n"       /* sourcelen */		\
+		      "  .byte 664f-663f\n"       /* replacementlen */ 		\
+		      ".previous\n"						\
+		      ".section .altinstr_replacement,\"ax\"\n"			\
+		      "663:\n\t" newinstr "\n664:\n"   /* replacement */ 	\
+		      ".previous" :: "i" (feature), ##input)
+
+/*
+ * Alternative inline assembly for SMP.
+ *
+ * alternative_smp() takes two versions (SMP first, UP second) and is
+ * for more complex stuff such as spinlocks.
+ *
+ * The LOCK_PREFIX macro defined here replaces the LOCK and
+ * LOCK_PREFIX macros used everywhere in the source tree.
+ *
+ * SMP alternatives use the same data structures as the other
+ * alternatives and the X86_FEATURE_UP flag to indicate the case of a
+ * UP system running a SMP kernel.  The existing apply_alternatives()
+ * works fine for patching a SMP kernel for UP.
+ * 
+ * The SMP alternative tables can be kept after boot and contain both
+ * UP and SMP versions of the instructions to allow switching back to
+ * SMP at runtime, when hotplugging in a new CPU, which is especially
+ * useful in virtualized environments.
+ *
+ * The very common lock prefix is handled as special case in a
+ * separate table which is a pure address list without replacement ptr
+ * and size information.  That keeps the table sizes small.
+ */ 
+
+#ifdef CONFIG_SMP
+#define alternative_smp(smpinstr, upinstr, args...) 	\
+	asm volatile ("661:\n\t" smpinstr "\n662:\n" 		     \
+		      ".section .smp_altinstructions,\"a\"\n"          \
+		      "  .align 4\n"				       \
+		      "  .long 661b\n"            /* label */          \
+		      "  .long 663f\n"		  /* new instruction */ 	\
+		      "  .byte 0x68\n"            /* X86_FEATURE_UP */    \
+		      "  .byte 662b-661b\n"       /* sourcelen */      \
+		      "  .byte 664f-663f\n"       /* replacementlen */ \
+		      ".previous\n"						\
+		      ".section .smp_altinstr_replacement,\"awx\"\n"   		\
+		      "663:\n\t" upinstr "\n"     /* replacement */    \
+		      "664:\n\t.fill 662b-661b,1,0x42\n" /* space for original */ \
+		      ".previous" : args)
+
+#define LOCK_PREFIX \
+		".section .smp_locks,\"a\"\n"	\
+		"  .align 4\n"			\
+		"  .long 661f\n" /* address */	\
+		".previous\n"			\
+	       	"661:\n\tlock; "
+
+#else /* ! CONFIG_SMP */
+#define alternative_smp(smpinstr, upinstr, args...) \
+	asm volatile (upinstr : args)
+#define LOCK_PREFIX ""
+#endif
+
+#endif /* _I386_ALTERNATIVE_H */
Index: vanilla-2.6.15/include/asm-i386/atomic.h
===================================================================
--- vanilla-2.6.15.orig/include/asm-i386/atomic.h
+++ vanilla-2.6.15/include/asm-i386/atomic.h
@@ -10,12 +10,6 @@
  * resource counting etc..
  */
 
-#ifdef CONFIG_SMP
-#define LOCK "lock ; "
-#else
-#define LOCK ""
-#endif
-
 /*
  * Make sure gcc doesn't try to be clever and move things around
  * on us. We need to use _exactly_ the address the user gave us,
@@ -52,7 +46,7 @@ typedef struct { volatile int counter; }
 static __inline__ void atomic_add(int i, atomic_t *v)
 {
 	__asm__ __volatile__(
-		LOCK "addl %1,%0"
+		LOCK_PREFIX "addl %1,%0"
 		:"=m" (v->counter)
 		:"ir" (i), "m" (v->counter));
 }
@@ -67,7 +61,7 @@ static __inline__ void atomic_add(int i,
 static __inline__ void atomic_sub(int i, atomic_t *v)
 {
 	__asm__ __volatile__(
-		LOCK "subl %1,%0"
+		LOCK_PREFIX "subl %1,%0"
 		:"=m" (v->counter)
 		:"ir" (i), "m" (v->counter));
 }
@@ -86,7 +80,7 @@ static __inline__ int atomic_sub_and_tes
 	unsigned char c;
 
 	__asm__ __volatile__(
-		LOCK "subl %2,%0; sete %1"
+		LOCK_PREFIX "subl %2,%0; sete %1"
 		:"=m" (v->counter), "=qm" (c)
 		:"ir" (i), "m" (v->counter) : "memory");
 	return c;
@@ -101,7 +95,7 @@ static __inline__ int atomic_sub_and_tes
 static __inline__ void atomic_inc(atomic_t *v)
 {
 	__asm__ __volatile__(
-		LOCK "incl %0"
+		LOCK_PREFIX "incl %0"
 		:"=m" (v->counter)
 		:"m" (v->counter));
 }
@@ -115,7 +109,7 @@ static __inline__ void atomic_inc(atomic
 static __inline__ void atomic_dec(atomic_t *v)
 {
 	__asm__ __volatile__(
-		LOCK "decl %0"
+		LOCK_PREFIX "decl %0"
 		:"=m" (v->counter)
 		:"m" (v->counter));
 }
@@ -133,7 +127,7 @@ static __inline__ int atomic_dec_and_tes
 	unsigned char c;
 
 	__asm__ __volatile__(
-		LOCK "decl %0; sete %1"
+		LOCK_PREFIX "decl %0; sete %1"
 		:"=m" (v->counter), "=qm" (c)
 		:"m" (v->counter) : "memory");
 	return c != 0;
@@ -152,7 +146,7 @@ static __inline__ int atomic_inc_and_tes
 	unsigned char c;
 
 	__asm__ __volatile__(
-		LOCK "incl %0; sete %1"
+		LOCK_PREFIX "incl %0; sete %1"
 		:"=m" (v->counter), "=qm" (c)
 		:"m" (v->counter) : "memory");
 	return c != 0;
@@ -172,7 +166,7 @@ static __inline__ int atomic_add_negativ
 	unsigned char c;
 
 	__asm__ __volatile__(
-		LOCK "addl %2,%0; sets %1"
+		LOCK_PREFIX "addl %2,%0; sets %1"
 		:"=m" (v->counter), "=qm" (c)
 		:"ir" (i), "m" (v->counter) : "memory");
 	return c;
@@ -195,7 +189,7 @@ static __inline__ int atomic_add_return(
 	/* Modern 486+ processor */
 	__i = i;
 	__asm__ __volatile__(
-		LOCK "xaddl %0, %1;"
+		LOCK_PREFIX "xaddl %0, %1;"
 		:"=r"(i)
 		:"m"(v->counter), "0"(i));
 	return i + __i;
@@ -242,11 +236,11 @@ static __inline__ int atomic_sub_return(
 
 /* These are x86-specific, used by some header files */
 #define atomic_clear_mask(mask, addr) \
-__asm__ __volatile__(LOCK "andl %0,%1" \
+__asm__ __volatile__(LOCK_PREFIX "andl %0,%1" \
 : : "r" (~(mask)),"m" (*addr) : "memory")
 
 #define atomic_set_mask(mask, addr) \
-__asm__ __volatile__(LOCK "orl %0,%1" \
+__asm__ __volatile__(LOCK_PREFIX "orl %0,%1" \
 : : "r" (mask),"m" (*(addr)) : "memory")
 
 /* Atomic operations are already serializing on x86 */
Index: vanilla-2.6.15/include/asm-i386/bitops.h
===================================================================
--- vanilla-2.6.15.orig/include/asm-i386/bitops.h
+++ vanilla-2.6.15/include/asm-i386/bitops.h
@@ -7,6 +7,7 @@
 
 #include <linux/config.h>
 #include <linux/compiler.h>
+#include <asm/alternative.h>
 
 /*
  * These have to be done with inline assembly: that way the bit-setting
@@ -16,12 +17,6 @@
  * bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
  */
 
-#ifdef CONFIG_SMP
-#define LOCK_PREFIX "lock ; "
-#else
-#define LOCK_PREFIX ""
-#endif
-
 #define ADDR (*(volatile long *) addr)
 
 /**
Index: vanilla-2.6.15/include/asm-i386/cpufeature.h
===================================================================
--- vanilla-2.6.15.orig/include/asm-i386/cpufeature.h
+++ vanilla-2.6.15/include/asm-i386/cpufeature.h
@@ -71,6 +71,8 @@
 #define X86_FEATURE_P4		(3*32+ 7) /* P4 */
 #define X86_FEATURE_CONSTANT_TSC (3*32+ 8) /* TSC ticks at a constant rate */
 
+#define X86_FEATURE_UP		(3*32+ 8) /* smp kernel running on up */
+
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_XMM3	(4*32+ 0) /* Streaming SIMD Extensions-3 */
 #define X86_FEATURE_MWAIT	(4*32+ 3) /* Monitor/Mwait support */
Index: vanilla-2.6.15/include/asm-i386/rwlock.h
===================================================================
--- vanilla-2.6.15.orig/include/asm-i386/rwlock.h
+++ vanilla-2.6.15/include/asm-i386/rwlock.h
@@ -21,21 +21,23 @@
 #define RW_LOCK_BIAS_STR	"0x01000000"
 
 #define __build_read_lock_ptr(rw, helper)   \
-	asm volatile(LOCK "subl $1,(%0)\n\t" \
-		     "jns 1f\n" \
-		     "call " helper "\n\t" \
-		     "1:\n" \
-		     ::"a" (rw) : "memory")
+	alternative_smp("lock; subl $1,(%0)\n\t" \
+			"jns 1f\n" \
+			"call " helper "\n\t" \
+			"1:\n", \
+			"subl $1,(%0)\n\t", \
+			:"a" (rw) : "memory")
 
 #define __build_read_lock_const(rw, helper)   \
-	asm volatile(LOCK "subl $1,%0\n\t" \
-		     "jns 1f\n" \
-		     "pushl %%eax\n\t" \
-		     "leal %0,%%eax\n\t" \
-		     "call " helper "\n\t" \
-		     "popl %%eax\n\t" \
-		     "1:\n" \
-		     :"=m" (*(volatile int *)rw) : : "memory")
+	alternative_smp("lock; subl $1,%0\n\t" \
+			"jns 1f\n" \
+			"pushl %%eax\n\t" \
+			"leal %0,%%eax\n\t" \
+			"call " helper "\n\t" \
+			"popl %%eax\n\t" \
+			"1:\n", \
+			"subl $1,%0\n\t", \
+			"=m" (*(volatile int *)rw) : : "memory")
 
 #define __build_read_lock(rw, helper)	do { \
 						if (__builtin_constant_p(rw)) \
@@ -45,21 +47,23 @@
 					} while (0)
 
 #define __build_write_lock_ptr(rw, helper) \
-	asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \
-		     "jz 1f\n" \
-		     "call " helper "\n\t" \
-		     "1:\n" \
-		     ::"a" (rw) : "memory")
+	alternative_smp("lock; subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \
+			"jz 1f\n" \
+			"call " helper "\n\t" \
+			"1:\n", \
+			"subl $" RW_LOCK_BIAS_STR ",(%0)\n\t", \
+			:"a" (rw) : "memory")
 
 #define __build_write_lock_const(rw, helper) \
-	asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \
-		     "jz 1f\n" \
-		     "pushl %%eax\n\t" \
-		     "leal %0,%%eax\n\t" \
-		     "call " helper "\n\t" \
-		     "popl %%eax\n\t" \
-		     "1:\n" \
-		     :"=m" (*(volatile int *)rw) : : "memory")
+	alternative_smp("lock; subl $" RW_LOCK_BIAS_STR ",%0\n\t" \
+			"jz 1f\n" \
+			"pushl %%eax\n\t" \
+			"leal %0,%%eax\n\t" \
+			"call " helper "\n\t" \
+			"popl %%eax\n\t" \
+			"1:\n", \
+			"subl $" RW_LOCK_BIAS_STR ",%0\n\t", \
+			"=m" (*(volatile int *)rw) : : "memory")
 
 #define __build_write_lock(rw, helper)	do { \
 						if (__builtin_constant_p(rw)) \
Index: vanilla-2.6.15/include/asm-i386/semaphore.h
===================================================================
--- vanilla-2.6.15.orig/include/asm-i386/semaphore.h
+++ vanilla-2.6.15/include/asm-i386/semaphore.h
@@ -99,7 +99,7 @@ static inline void down(struct semaphore
 	might_sleep();
 	__asm__ __volatile__(
 		"# atomic down operation\n\t"
-		LOCK "decl %0\n\t"     /* --sem->count */
+		LOCK_PREFIX "decl %0\n\t"     /* --sem->count */
 		"js 2f\n"
 		"1:\n"
 		LOCK_SECTION_START("")
@@ -123,7 +123,7 @@ static inline int down_interruptible(str
 	might_sleep();
 	__asm__ __volatile__(
 		"# atomic interruptible down operation\n\t"
-		LOCK "decl %1\n\t"     /* --sem->count */
+		LOCK_PREFIX "decl %1\n\t"     /* --sem->count */
 		"js 2f\n\t"
 		"xorl %0,%0\n"
 		"1:\n"
@@ -148,7 +148,7 @@ static inline int down_trylock(struct se
 
 	__asm__ __volatile__(
 		"# atomic interruptible down operation\n\t"
-		LOCK "decl %1\n\t"     /* --sem->count */
+		LOCK_PREFIX "decl %1\n\t"     /* --sem->count */
 		"js 2f\n\t"
 		"xorl %0,%0\n"
 		"1:\n"
@@ -173,7 +173,7 @@ static inline void up(struct semaphore *
 {
 	__asm__ __volatile__(
 		"# atomic up operation\n\t"
-		LOCK "incl %0\n\t"     /* ++sem->count */
+		LOCK_PREFIX "incl %0\n\t"     /* ++sem->count */
 		"jle 2f\n"
 		"1:\n"
 		LOCK_SECTION_START("")
Index: vanilla-2.6.15/include/asm-i386/spinlock.h
===================================================================
--- vanilla-2.6.15.orig/include/asm-i386/spinlock.h
+++ vanilla-2.6.15/include/asm-i386/spinlock.h
@@ -48,18 +48,23 @@
 	"jmp 1b\n" \
 	"4:\n\t"
 
+#define __raw_spin_lock_string_up \
+	"\n\tdecb %0"
+
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
-	__asm__ __volatile__(
-		__raw_spin_lock_string
-		:"=m" (lock->slock) : : "memory");
+	alternative_smp(
+		__raw_spin_lock_string,
+		__raw_spin_lock_string_up,
+		"=m" (lock->slock) : : "memory");
 }
 
 static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags)
 {
-	__asm__ __volatile__(
-		__raw_spin_lock_string_flags
-		:"=m" (lock->slock) : "r" (flags) : "memory");
+	alternative_smp(
+		__raw_spin_lock_string_flags,
+		__raw_spin_lock_string_up,
+		"=m" (lock->slock) : "r" (flags) : "memory");
 }
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
@@ -178,12 +183,12 @@ static inline int __raw_write_trylock(ra
 
 static inline void __raw_read_unlock(raw_rwlock_t *rw)
 {
-	asm volatile("lock ; incl %0" :"=m" (rw->lock) : : "memory");
+	asm volatile(LOCK_PREFIX "incl %0" :"=m" (rw->lock) : : "memory");
 }
 
 static inline void __raw_write_unlock(raw_rwlock_t *rw)
 {
-	asm volatile("lock ; addl $" RW_LOCK_BIAS_STR ", %0"
+	asm volatile(LOCK_PREFIX "addl $" RW_LOCK_BIAS_STR ", %0"
 				 : "=m" (rw->lock) : : "memory");
 }
 
Index: vanilla-2.6.15/include/asm-i386/system.h
===================================================================
--- vanilla-2.6.15.orig/include/asm-i386/system.h
+++ vanilla-2.6.15/include/asm-i386/system.h
@@ -352,67 +352,6 @@ static inline unsigned long long __cmpxc
 
 #endif
     
-#ifdef __KERNEL__
-struct alt_instr { 
-	__u8 *instr; 		/* original instruction */
-	__u8 *replacement;
-	__u8  cpuid;		/* cpuid bit set for replacement */
-	__u8  instrlen;		/* length of original instruction */
-	__u8  replacementlen; 	/* length of new instruction, <= instrlen */ 
-	__u8  pad;
-}; 
-#endif
-
-/* 
- * Alternative instructions for different CPU types or capabilities.
- * 
- * This allows to use optimized instructions even on generic binary
- * kernels.
- * 
- * length of oldinstr must be longer or equal the length of newinstr
- * It can be padded with nops as needed.
- * 
- * For non barrier like inlines please define new variants
- * without volatile and memory clobber.
- */
-#define alternative(oldinstr, newinstr, feature) 	\
-	asm volatile ("661:\n\t" oldinstr "\n662:\n" 		     \
-		      ".section .altinstructions,\"a\"\n"     	     \
-		      "  .align 4\n"				       \
-		      "  .long 661b\n"            /* label */          \
-		      "  .long 663f\n"		  /* new instruction */ 	\
-		      "  .byte %c0\n"             /* feature bit */    \
-		      "  .byte 662b-661b\n"       /* sourcelen */      \
-		      "  .byte 664f-663f\n"       /* replacementlen */ \
-		      ".previous\n"						\
-		      ".section .altinstr_replacement,\"ax\"\n"			\
-		      "663:\n\t" newinstr "\n664:\n"   /* replacement */    \
-		      ".previous" :: "i" (feature) : "memory")  
-
-/*
- * Alternative inline assembly with input.
- * 
- * Pecularities:
- * No memory clobber here. 
- * Argument numbers start with 1.
- * Best is to use constraints that are fixed size (like (%1) ... "r")
- * If you use variable sized constraints like "m" or "g" in the 
- * replacement maake sure to pad to the worst case length.
- */
-#define alternative_input(oldinstr, newinstr, feature, input...)		\
-	asm volatile ("661:\n\t" oldinstr "\n662:\n"				\
-		      ".section .altinstructions,\"a\"\n"			\
-		      "  .align 4\n"						\
-		      "  .long 661b\n"            /* label */			\
-		      "  .long 663f\n"		  /* new instruction */ 	\
-		      "  .byte %c0\n"             /* feature bit */		\
-		      "  .byte 662b-661b\n"       /* sourcelen */		\
-		      "  .byte 664f-663f\n"       /* replacementlen */ 		\
-		      ".previous\n"						\
-		      ".section .altinstr_replacement,\"ax\"\n"			\
-		      "663:\n\t" newinstr "\n664:\n"   /* replacement */ 	\
-		      ".previous" :: "i" (feature), ##input)
-
 /*
  * Force strict CPU ordering.
  * And yes, this is required on UP too when we're talking
Index: vanilla-2.6.15/arch/um/kernel/um_arch.c
===================================================================
--- vanilla-2.6.15.orig/arch/um/kernel/um_arch.c
+++ vanilla-2.6.15/arch/um/kernel/um_arch.c
@@ -477,6 +477,16 @@ void __init check_bugs(void)
 	check_devanon();
 }
 
-void apply_alternatives(void *start, void *end)
+void apply_alternatives(struct alt_instr *start, struct alt_instr *end)
+{
+}
+
+void alternatives_smp_module_add(struct module *mod, char *name,
+				 void *locks, void *locks_end,
+				 void *text,  void *text_end)
+{
+}
+
+void alternatives_smp_module_del(struct module *mod)
 {
 }
Index: vanilla-2.6.15/include/asm-um/alternative.h
===================================================================
--- /dev/null
+++ vanilla-2.6.15/include/asm-um/alternative.h
@@ -0,0 +1,6 @@
+#ifndef __UM_ALTERNATIVE_H
+#define __UM_ALTERNATIVE_H
+
+#include "asm/arch/alternative.h"
+
+#endif
Index: vanilla-2.6.15/include/asm-i386/mutex.h
===================================================================
--- vanilla-2.6.15.orig/include/asm-i386/mutex.h
+++ vanilla-2.6.15/include/asm-i386/mutex.h
@@ -9,6 +9,8 @@
 #ifndef _ASM_MUTEX_H
 #define _ASM_MUTEX_H
 
+#include "asm/alternative.h"
+
 /**
  *  __mutex_fastpath_lock - try to take the lock by moving the count
  *                          from 1 to a 0 value
@@ -27,7 +29,7 @@ do {									\
 	typecheck_fn(fastcall void (*)(atomic_t *), fail_fn);		\
 									\
 	__asm__ __volatile__(						\
-		LOCK	"   decl (%%eax)	\n"			\
+		LOCK_PREFIX "   decl (%%eax)	\n"			\
 			"   js 2f		\n"			\
 			"1:			\n"			\
 									\
@@ -83,7 +85,7 @@ do {									\
 	typecheck_fn(fastcall void (*)(atomic_t *), fail_fn);		\
 									\
 	__asm__ __volatile__(						\
-		LOCK	"   incl (%%eax)	\n"			\
+		LOCK_PREFIX "   incl (%%eax)	\n"			\
 			"   jle 2f		\n"			\
 			"1:			\n"			\
 									\

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH] SMP alternatives
  2006-01-24 15:33 [PATCH] SMP alternatives Gerd Hoffmann
@ 2006-01-24 16:22 ` Ben Collins
  2006-01-25  9:20   ` Gerd Hoffmann
  2006-01-26 10:22 ` Pavel Machek
  1 sibling, 1 reply; 117+ messages in thread
From: Ben Collins @ 2006-01-24 16:22 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Andrew Morton, linux kernel mailing list

On Tue, 2006-01-24 at 16:33 +0100, Gerd Hoffmann wrote:
>   Hi Andrew,
> 
> Can you include the patch in -mm to give it some testing?  Then merge
> maybe for 2.6.17?  Posted last time in december, with nobody complaining
> any more about the most recent version.  The patch is almost unmodified
> since, I've only had to add a small chunk due to the mutex merge.
> Description is below, the patch (against -rc1-git4 snapshot) is attached.

FYI, I have this being used in Ubuntu's kernel right now. It's pretty
stable. I have it implemented for x86_64 aswell. I can send you that
patch when I get a chance to pull it from the repo cleanly. I did enable
a kconfig option and command line option so it can be enabled/disabled
by default and also at boot.

-- 
Ben Collins
Kernel Developer - Ubuntu Linux


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH] SMP alternatives
  2006-01-24 16:22 ` Ben Collins
@ 2006-01-25  9:20   ` Gerd Hoffmann
  0 siblings, 0 replies; 117+ messages in thread
From: Gerd Hoffmann @ 2006-01-25  9:20 UTC (permalink / raw)
  To: Ben Collins; +Cc: Andrew Morton, linux kernel mailing list

Ben Collins wrote:
> FYI, I have this being used in Ubuntu's kernel right now. It's pretty
> stable. I have it implemented for x86_64 aswell. I can send you that
> patch when I get a chance to pull it from the repo cleanly. I did enable
> a kconfig option and command line option so it can be enabled/disabled
> by default and also at boot.

The x86_64 bits would be very nice.  Linus' didn't like the idea to make
that a config option, commented "either it works or it doesn't", so I
left it out.  The code is small anyway, and the data tables can be
freed, so making it an option doesn't make much sense.  IMHO there are
way to many config options anyway ...

cheers,

  Gerd

-- 
Gerd 'just married' Hoffmann <kraxel@suse.de>
I'm the hacker formerly known as Gerd Knorr.
http://www.suse.de/~kraxel/just-married.jpeg

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH] SMP alternatives
  2006-01-24 15:33 [PATCH] SMP alternatives Gerd Hoffmann
  2006-01-24 16:22 ` Ben Collins
@ 2006-01-26 10:22 ` Pavel Machek
  2006-01-26 11:17   ` Gerd Hoffmann
  1 sibling, 1 reply; 117+ messages in thread
From: Pavel Machek @ 2006-01-26 10:22 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Andrew Morton, linux kernel mailing list

Hi!

> Can you include the patch in -mm to give it some testing?  Then merge
> maybe for 2.6.17?  Posted last time in december, with nobody complaining
> any more about the most recent version.  The patch is almost unmodified
> since, I've only had to add a small chunk due to the mutex merge.
> Description is below, the patch (against -rc1-git4 snapshot) is
> attached.

Well, I'm not 100% convinced this is really good idea.. It increases
complexity quite a lot.

Oh and please inline patches.

alternatives.c misses header (copyright, GPL).
								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH] SMP alternatives
  2006-01-26 10:22 ` Pavel Machek
@ 2006-01-26 11:17   ` Gerd Hoffmann
  2006-01-26 11:48     ` Pavel Machek
  0 siblings, 1 reply; 117+ messages in thread
From: Gerd Hoffmann @ 2006-01-26 11:17 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andrew Morton, linux kernel mailing list

Pavel Machek wrote:
> Hi!
> 
>> Can you include the patch in -mm to give it some testing?  Then merge
>> maybe for 2.6.17?  Posted last time in december, with nobody complaining
>> any more about the most recent version.  The patch is almost unmodified
>> since, I've only had to add a small chunk due to the mutex merge.
>> Description is below, the patch (against -rc1-git4 snapshot) is
>> attached.
> 
> Well, I'm not 100% convinced this is really good idea.. It increases
> complexity quite a lot.

Well, we have alternatives for quite some time already, this is just an
extension of the existing bits ...

> Oh and please inline patches.

Whats wrong with "Content-Disposition: inline" attachments?  The risk
they get whitespace-mangeled is much lower then.  Also mailers display
them inline and also quote them on reply so you can easily comment them.
 At least mutt and thunderbird do that.  If your mailer doesn't file a
bug ;)

cheers,

  Gerd

-- 
Gerd 'just married' Hoffmann <kraxel@suse.de>
I'm the hacker formerly known as Gerd Knorr.
http://www.suse.de/~kraxel/just-married.jpeg

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH] SMP alternatives
  2006-01-26 11:17   ` Gerd Hoffmann
@ 2006-01-26 11:48     ` Pavel Machek
  0 siblings, 0 replies; 117+ messages in thread
From: Pavel Machek @ 2006-01-26 11:48 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: kernel list

On Čt 26-01-06 12:17:26, Gerd Hoffmann wrote:
> Pavel Machek wrote:
> > Hi!
> > 
> >> Can you include the patch in -mm to give it some testing?  Then merge
> >> maybe for 2.6.17?  Posted last time in december, with nobody complaining
> >> any more about the most recent version.  The patch is almost unmodified
> >> since, I've only had to add a small chunk due to the mutex merge.
> >> Description is below, the patch (against -rc1-git4 snapshot) is
> >> attached.
> > 
> > Well, I'm not 100% convinced this is really good idea.. It increases
> > complexity quite a lot.
> 
> Well, we have alternatives for quite some time already, this is just an
> extension of the existing bits ...

Like... during suspend we hot-unplug all but one cpu. Patching code at
that point is quite unneccessary...

> > Oh and please inline patches.
> 
> Whats wrong with "Content-Disposition: inline" attachments?  The risk
> they get whitespace-mangeled is much lower then.  Also mailers display
> them inline and also quote them on reply so you can easily comment them.
>  At least mutt and thunderbird do that.  If your mailer doesn't file a
> bug ;)

Consensus on lkml is to inline patches. Content-disposition: inline is
commonly accepted as not-too-evil, and my mailer (mutt) usually
honours that, but something in your mail tripped it.

								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-28 23:30                                                               ` Zachary Amsden
@ 2005-11-28 23:32                                                                 ` H. Peter Anvin
  0 siblings, 0 replies; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-28 23:32 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeff V. Merkey, Bill Davidsen, Linus Torvalds, Alan Cox,
	Andi Kleen, Gerd Knorr, Dave Jones, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Zachary Amsden wrote:
> 
> I spoke silicon too heavy handedly.  The complexity of the issue 
> disappears if you take an exception, but rewinding state prior to the 
> exception and reissuing is going to be less efficient than getting it 
> right the first time, which is something software can always guarantee.  
> You need to add more hardware for prediction to get it right all the 
> time, and it is not clear the cost of that hardware is justified when 
> software can always do the right thing.
> 

Taking exceptions is fine as long as you don't do it too often.  I'm 
starting to suspect that the only way to do this right all the time is 
to have this be part of the page attributes, since it's region-specific.

	-hpa


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-28 23:07                                                             ` H. Peter Anvin
@ 2005-11-28 23:30                                                               ` Zachary Amsden
  2005-11-28 23:32                                                                 ` H. Peter Anvin
  0 siblings, 1 reply; 117+ messages in thread
From: Zachary Amsden @ 2005-11-28 23:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeff V. Merkey, Bill Davidsen, Linus Torvalds, Alan Cox,
	Andi Kleen, Gerd Knorr, Dave Jones, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

H. Peter Anvin wrote:

> Zachary Amsden wrote:
>
>>
>> You need a way to type the lock semantics by memory region, and a 
>> working hardware solution can not perform as well as a careful 
>> software solution.  As was pointed out earlier, you can't use memory 
>> type attributes to infer lock semantics, you must assume them in the 
>> decoder or implement complex deadlock detection and recovery in silicon.
>>
>
> Sure you can.  You just have to be prepared to take a microop 
> exception if you speculate incorrectly.


I spoke silicon too heavy handedly.  The complexity of the issue 
disappears if you take an exception, but rewinding state prior to the 
exception and reissuing is going to be less efficient than getting it 
right the first time, which is something software can always guarantee.  
You need to add more hardware for prediction to get it right all the 
time, and it is not clear the cost of that hardware is justified when 
software can always do the right thing.

Zach

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-28 23:00                                                           ` Zachary Amsden
  2005-11-28 23:07                                                             ` H. Peter Anvin
@ 2005-11-28 23:12                                                             ` Andi Kleen
  1 sibling, 0 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-28 23:12 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeff V. Merkey, Bill Davidsen, Linus Torvalds, Alan Cox,
	H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar

> I'm not sure a hardware solution is even the right thing - consider a 
> shared memory database process with a private heap.  You really want 
> locks on the shared memory, and you really don't on the heap.
> 
> You need a way to type the lock semantics by memory region, and a 
> working hardware solution can not perform as well as a careful software 
> solution.  As was pointed out earlier, you can't use memory type 

The problem is that nobody will change all the software.
Your careful software solution will only benefit a small
minority of performance conscious and well tuned programs.

The hardware solution might not be perfect, but has a good chance
to apply to 90% of the "don't care" programs and help them all a bit.
And every bit counts in the quest for more single thread performance.

And if someone wants to fine tune their programs they can
still change the software as much as they want.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-28 23:00                                                           ` Zachary Amsden
@ 2005-11-28 23:07                                                             ` H. Peter Anvin
  2005-11-28 23:30                                                               ` Zachary Amsden
  2005-11-28 23:12                                                             ` Andi Kleen
  1 sibling, 1 reply; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-28 23:07 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeff V. Merkey, Bill Davidsen, Linus Torvalds, Alan Cox,
	Andi Kleen, Gerd Knorr, Dave Jones, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Zachary Amsden wrote:
> 
> You need a way to type the lock semantics by memory region, and a 
> working hardware solution can not perform as well as a careful software 
> solution.  As was pointed out earlier, you can't use memory type 
> attributes to infer lock semantics, you must assume them in the decoder 
> or implement complex deadlock detection and recovery in silicon.
> 

Sure you can.  You just have to be prepared to take a microop exception 
if you speculate incorrectly.

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-28 22:19                                                         ` Jeff V. Merkey
@ 2005-11-28 23:00                                                           ` Zachary Amsden
  2005-11-28 23:07                                                             ` H. Peter Anvin
  2005-11-28 23:12                                                             ` Andi Kleen
  0 siblings, 2 replies; 117+ messages in thread
From: Zachary Amsden @ 2005-11-28 23:00 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Bill Davidsen, Linus Torvalds, Alan Cox, H. Peter Anvin,
	Andi Kleen, Gerd Knorr, Dave Jones, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Jeff V. Merkey wrote:

>
> In 2-3 years we might actually see the hardware solution, maybee .... 
> I am skeptical Intel will move quickly on it. A software solution will 
> get out faster.



I'm not sure a hardware solution is even the right thing - consider a 
shared memory database process with a private heap.  You really want 
locks on the shared memory, and you really don't on the heap.

You need a way to type the lock semantics by memory region, and a 
working hardware solution can not perform as well as a careful software 
solution.  As was pointed out earlier, you can't use memory type 
attributes to infer lock semantics, you must assume them in the decoder 
or implement complex deadlock detection and recovery in silicon.

I would be willing to bet that library users know best.  Most cleanly 
written libraries already have wrapper functions that can be used to 
plug in needed libc functions like malloc, even file I/O.  Even if they 
don't, you can rewrap all of the imported functions.  Using this, you 
can isolate threaded libraries from single threaded applications, and 
make sure the performance critical libraries use non-threaded 
operations.  You can even afford to use a medium heavy hammer and switch 
from non-threaded to threaded dependent libraries every time you call a 
thread-using library function, because by assumption, the majority of 
performance critical code is going to be running single threaded.

Zach

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-28 19:52                                                       ` Bill Davidsen
  2005-11-28 20:05                                                         ` Zachary Amsden
@ 2005-11-28 22:19                                                         ` Jeff V. Merkey
  2005-11-28 23:00                                                           ` Zachary Amsden
  1 sibling, 1 reply; 117+ messages in thread
From: Jeff V. Merkey @ 2005-11-28 22:19 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Linus Torvalds, Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Bill Davidsen wrote:

> Linus Torvalds wrote:
>
>>
>> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>>
>>> Why should we use a silicon based solution for this, when I posit that
>>> there are simpler and equally effective userspace solutions?
>>
>>
>>
>> Name them.
>>
>> In user space, doing things like clever run-time linking things is 
>> actually horribly bad. It causes COW faults at startup, and/or makes 
>> the compiler have to do indirections unnecessarily. Both of which 
>> actually make caches less effective, because now processes that 
>> really effectively do have exactly the same contents have them in 
>> different pages.
>>
>> The other alternative (which apparently glibc actually does use) is 
>> to dynamically branch over the lock prefixes, which actually works 
>> better: it's more work dynamically, but it's much cheaper from a 
>> startup standpoint and there's no memory duplication, so while it is 
>> the "stupid" approach, it's actually better than the clever one.
>>
>> The third alternative is to know at link-time that the process never 
>> does anything threaded, but that needs more developer attention and 
>> non-standard setups, and you _will_ get it wrong (some library will 
>> create some thread without the developer even realizing). It also has 
>> the duplicated library overhead (but at least now the duplication is 
>> just twice, not "each process duplicates its own private pointer")
>>
>> In short, there simply isn't any good alternatives. The end result is 
>> that thread-safe libraries are always in practice thread-safe even on 
>> UP, even though that serializes the CPU altogether unnecessarily.
>>
>> I'm sure you can make up alternatives every time you hit one 
>> _particular_ library, but that just doesn't scale in the real world.
>>
>> In contrast, the simple silicon support scales wonderfully well. 
>> Suddenly libraries can be thread-safe _and_ efficient on UP too. You 
>> get to eat your cake and have it too.
>
>
> I believe that a hardware solution would also accomodate the case 
> where a program runs unthreaded for most of the processing, and only 
> starts threads to do the final stage "report generation" tasks, where 
> that makes sense. I don't believe that it helps in the case where init 
> uses threads and then reverts to a single thread for the balance of 
> the task. I can't think of anything which does that, so it's probably 
> a non-critical corner case, or something the thread library could 
> correct.
>
>
In 2-3 years we might actually see the hardware solution, maybee .... I 
am skeptical Intel will move quickly on it. A software solution will get 
out faster.

Jeff

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-25 17:33                                                 ` Linus Torvalds
@ 2005-11-28 20:25                                                   ` Bill Davidsen
  0 siblings, 0 replies; 117+ messages in thread
From: Bill Davidsen @ 2005-11-28 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Linus Torvalds wrote:
> 
> On Thu, 24 Nov 2005, Chris Wedgwood wrote:
> 
>>CPUs in embedded the space could outnumber desktops & servers greatly
>>(cell phones, access pointers, routers, media players, etc).  Most of
>>these will be UP for some time.
> 
> 
> That's not entirely clear either.
> 
> There are definite advantages to SMP even in the embedded space - or, to 
> put it more strongly: _especially_ in the embedded space.
> 
I would argue that there is no "the embedded space," but rather a set of 
embedded spaces with various needs. Having worked doing industrial 
control for three years and lunched with IC folks another decade, I'm 
fairly sure that consumer goods are very different from real industrial 
control, a realtime items (multimedia) are different than phones and 
PDAs. Until the phone gets "swear at it" slow, features like voice 
recognition are more important than doing voice to number lookup in 20ms 
instead of 400ms. Cost and battery life matter a lot too, while the 
media and IC markets are already attached to expensive stuff, so the 
computer is is smaller fraction of the cost.

> None of the cellphone manufacturers seem to be in the least interested in 
> doing a "phone only" solution. They can already do that cheaply, they 
> can't make much money off it, and they are all interested in features. And 
> it really _is_ more power-efficient to have, say, a dual-core 200MHz chip 
> than it is to have a single-core 300MHz one.
> 
> Now, sometimes those SMP systems will actually be used as "tightly coupled 
> UP", where one of the CPU's is just basically a DSP. And from a power 
> efficiency standpoint, having specialized hardware (and thus _A_MP rather 
> than SMP) is obviously better, but in complex tasks - and communication 
> tends to be that - general-purpose is often desirable enough that people 
> will take the inefficiencies of a GP CPU over a fixed-function specialized 
> DSP-kind of environment.
> 
> But SMP is absolutely _not_ unusual in embedded. It's been there for years 
> already, and it's clearly moving downwards there too.

Absolutely true, but that dual core 200 MHz chip probably draws more 
power than a 200 MHz uni, etc. So there will probably be uni 
applications for the forseeable future, any benefit in uni performance 
will be useful.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-28 19:52                                                       ` Bill Davidsen
@ 2005-11-28 20:05                                                         ` Zachary Amsden
  2005-11-28 22:19                                                         ` Jeff V. Merkey
  1 sibling, 0 replies; 117+ messages in thread
From: Zachary Amsden @ 2005-11-28 20:05 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Linus Torvalds, Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr,
	Dave Jones, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Bill Davidsen wrote:

> Linus Torvalds wrote:
>
>>
>> In contrast, the simple silicon support scales wonderfully well. 
>> Suddenly libraries can be thread-safe _and_ efficient on UP too. You 
>> get to eat your cake and have it too.
>
>
> I believe that a hardware solution would also accomodate the case 
> where a program runs unthreaded for most of the processing, and only 
> starts threads to do the final stage "report generation" tasks, where 
> that makes sense. I don't believe that it helps in the case where init 
> uses threads and then reverts to a single thread for the balance of 
> the task. I can't think of anything which does that, so it's probably 
> a non-critical corner case, or something the thread library could 
> correct.


Startup routine of a scientific app calls a multithreaded "fetch work" 
routine, then crunches the data using a single thread.  This could even 
happen somewhere inside a library, so the application itself is unaware 
that threads were ever invoked.  This is not a far-fetched case.

You really need per-address object notions of "threadedness" when 
talking about shared memory, since you may need shared memory to be 
atomic, but operate on the heap in single threaded fashion.

Zach

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:42                                                       ` Daniel Jacobowitz
  2005-11-23 23:59                                                         ` Linus Torvalds
  2005-11-24 22:32                                                         ` Ulrich Drepper
@ 2005-11-28 19:58                                                         ` Bill Davidsen
  2 siblings, 0 replies; 117+ messages in thread
From: Bill Davidsen @ 2005-11-28 19:58 UTC (permalink / raw)
  To: Daniel Jacobowitz
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Daniel Jacobowitz wrote:
> On Wed, Nov 23, 2005 at 03:08:59PM -0800, Linus Torvalds wrote:

>>In contrast, the simple silicon support scales wonderfully well. Suddenly 
>>libraries can be thread-safe _and_ efficient on UP too. You get to eat 
>>your cake and have it too.
> 
> 
> By buying new hardware and only caring about people using the magic
> architecture.  No thanks.

That is the problem, waiting for Intel to do hardware magic, or even to 
decide IF they do it. Like assuming that everyone has SMP because a few 
percent of the users have dual core chips. The majority of the markey 
will have SMP someday, but ignoring the current status isn't realistic.
> 
> Maybe I'll implement this some weekend.

Love to see it, I'm only semi-convinced it can be done in a way which 
actually produces significant benefits.
> 

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:08                                                     ` Linus Torvalds
                                                                         ` (3 preceding siblings ...)
  2005-11-24 13:01                                                       ` Pádraig Brady
@ 2005-11-28 19:52                                                       ` Bill Davidsen
  2005-11-28 20:05                                                         ` Zachary Amsden
  2005-11-28 22:19                                                         ` Jeff V. Merkey
  4 siblings, 2 replies; 117+ messages in thread
From: Bill Davidsen @ 2005-11-28 19:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Linus Torvalds wrote:
> 
> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> 
>>Why should we use a silicon based solution for this, when I posit that
>>there are simpler and equally effective userspace solutions?
> 
> 
> Name them.
> 
> In user space, doing things like clever run-time linking things is 
> actually horribly bad. It causes COW faults at startup, and/or makes the 
> compiler have to do indirections unnecessarily.  Both of which actually 
> make caches less effective, because now processes that really effectively 
> do have exactly the same contents have them in different pages.
> 
> The other alternative (which apparently glibc actually does use) is to 
> dynamically branch over the lock prefixes, which actually works better: 
> it's more work dynamically, but it's much cheaper from a startup 
> standpoint and there's no memory duplication, so while it is the "stupid" 
> approach, it's actually better than the clever one.
> 
> The third alternative is to know at link-time that the process never does 
> anything threaded, but that needs more developer attention and 
> non-standard setups, and you _will_ get it wrong (some library will create 
> some thread without the developer even realizing). It also has the 
> duplicated library overhead (but at least now the duplication is just 
> twice, not "each process duplicates its own private pointer")
> 
> In short, there simply isn't any good alternatives. The end result is that 
> thread-safe libraries are always in practice thread-safe even on UP, even 
> though that serializes the CPU altogether unnecessarily.
> 
> I'm sure you can make up alternatives every time you hit one _particular_ 
> library, but that just doesn't scale in the real world.
> 
> In contrast, the simple silicon support scales wonderfully well. Suddenly 
> libraries can be thread-safe _and_ efficient on UP too. You get to eat 
> your cake and have it too.

I believe that a hardware solution would also accomodate the case where 
a program runs unthreaded for most of the processing, and only starts 
threads to do the final stage "report generation" tasks, where that 
makes sense. I don't believe that it helps in the case where init uses 
threads and then reverts to a single thread for the balance of the task. 
I can't think of anything which does that, so it's probably a 
non-critical corner case, or something the thread library could correct.


-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 23:35                                                                   ` Andi Kleen
  2005-11-25  0:13                                                                     ` Alan Cox
  2005-11-25  1:33                                                                     ` H. Peter Anvin
@ 2005-11-28 19:15                                                                     ` Bill Davidsen
  2 siblings, 0 replies; 117+ messages in thread
From: Bill Davidsen @ 2005-11-28 19:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Eric W. Biederman, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

Andi Kleen wrote:
> On Thu, Nov 24, 2005 at 02:48:25PM -0800, thockin@hockin.org wrote:
> 
>>On Thu, Nov 24, 2005 at 11:12:14PM +0000, Alan Cox wrote:
>>
>>>On Iau, 2005-11-24 at 20:14 +0100, Andi Kleen wrote:
>>>
>>>>I proposed something like that - best with an ASCII string
>>>>("First DIMM on the top left corner") But getting such stuff into BIOS 
>>>>is difficult and long winded.
>>>
>>>Propose it the desktop management people and get it into the DMI
>>>standard. They already have entries for each memory slot, they already
>>>have entries for descriptive strings for connectors. In fact you may
>>>well be able to 'bend' the spec enough to do it as is.
>>
>>There are enough fields that maybe one of them is loose enough to mean
>>this.  It doesn't help us convince mobo vendors to support it, though.
> 
> 
> With arbitary desktop/laptop/etc. vendors it's pretty hopeless I agree.
> But I suspect there is a chance at least on the server side. There
> is only a limited number of companies working on server BIOSes 
> for their boards and they tend to be more receptive to Linux's need
> because it's now a significant part of their market.

It would seem that the OEMs buying the board would like this feature, 
since it could be incorporated into POST, diagnostic CDs, etc. And since 
server owners are more likely to have a service contract, anything to 
make service calls faster is a benefit to the system vendor.
> 
> And it's clearly an obviously useful "RAS feature" which is
> fully buzzword compatible and everything.
> 
> IMHO it's time that Linux gets more proactive regarding talking
> to BIOS vendors. Perhaps a generic "BIOS writers guide for Linux"
> would be a good thing.  I have at least one other extension I would like
> BIOS vendors to support. Just would need to come up with a writeup
> for a clearly defined specification.

If someone handed them some good specs except for the table, I suspect 
they would see the benefit. Independent BIOS writers compete for board 
contracts, in-house writers want features with one time cost and every 
time benefit, I think you're right that this would be a benefit to everyone.

Given that it seems so simple, is there a reason why this hasn't been 
around for ages?

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-25  7:38                                               ` Chris Wedgwood
  2005-11-25 17:33                                                 ` Linus Torvalds
@ 2005-11-25 20:13                                                 ` H. Peter Anvin
  1 sibling, 0 replies; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-25 20:13 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Chris Wedgwood wrote:
> On Wed, Nov 23, 2005 at 01:36:08PM -0800, Linus Torvalds wrote:
> 
>>Actual UP machines are going to go away - even ARM is going SMP, and
>>in the PC space, we'll have multi-core laptops probably being the
>>rule rather than the exception in a couple of years.
> 
> CPUs in embedded the space could outnumber desktops & servers greatly
> (cell phones, access pointers, routers, media players, etc).  Most of
> these will be UP for some time.

It's unlikely, though, that you'd have a need to run an SMP-compiled 
kernel on these devices.

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-25  7:38                                               ` Chris Wedgwood
@ 2005-11-25 17:33                                                 ` Linus Torvalds
  2005-11-28 20:25                                                   ` Bill Davidsen
  2005-11-25 20:13                                                 ` H. Peter Anvin
  1 sibling, 1 reply; 117+ messages in thread
From: Linus Torvalds @ 2005-11-25 17:33 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Thu, 24 Nov 2005, Chris Wedgwood wrote:
> 
> CPUs in embedded the space could outnumber desktops & servers greatly
> (cell phones, access pointers, routers, media players, etc).  Most of
> these will be UP for some time.

That's not entirely clear either.

There are definite advantages to SMP even in the embedded space - or, to 
put it more strongly: _especially_ in the embedded space.

Not only does power usage go up cubically with frequency (which means that 
two cores are a lot more efficient than one at double the frequency), but 
embedded space also often has some clear separation between tasks that 
-can- be threaded (and often part of is has real-time characteristics, so 
getting a core of its own can be a good thing). Often more so than in the 
desktop space.

Now, obviously, in the "4- or 8-bit microcontroller" kind of embedded 
space, SMP isn't going to be a big issue. But anything that already uses 
an ARM, MIPS or a PowerPC-like chip, going SMP is not at all ridiculous. 
That includes things like cellphones, where one core might be for 
communication functions, and one for smartphones. 

None of the cellphone manufacturers seem to be in the least interested in 
doing a "phone only" solution. They can already do that cheaply, they 
can't make much money off it, and they are all interested in features. And 
it really _is_ more power-efficient to have, say, a dual-core 200MHz chip 
than it is to have a single-core 300MHz one.

Now, sometimes those SMP systems will actually be used as "tightly coupled 
UP", where one of the CPU's is just basically a DSP. And from a power 
efficiency standpoint, having specialized hardware (and thus _A_MP rather 
than SMP) is obviously better, but in complex tasks - and communication 
tends to be that - general-purpose is often desirable enough that people 
will take the inefficiencies of a GP CPU over a fixed-function specialized 
DSP-kind of environment.

But SMP is absolutely _not_ unusual in embedded. It's been there for years 
already, and it's clearly moving downwards there too.

			Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:36                                             ` Linus Torvalds
                                                                 ` (2 preceding siblings ...)
  2005-11-23 22:50                                               ` Alan Cox
@ 2005-11-25  7:38                                               ` Chris Wedgwood
  2005-11-25 17:33                                                 ` Linus Torvalds
  2005-11-25 20:13                                                 ` H. Peter Anvin
  3 siblings, 2 replies; 117+ messages in thread
From: Chris Wedgwood @ 2005-11-25  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 01:36:08PM -0800, Linus Torvalds wrote:

> Actual UP machines are going to go away - even ARM is going SMP, and
> in the PC space, we'll have multi-core laptops probably being the
> rule rather than the exception in a couple of years.

CPUs in embedded the space could outnumber desktops & servers greatly
(cell phones, access pointers, routers, media players, etc).  Most of
these will be UP for some time.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 23:35                                                                   ` Andi Kleen
  2005-11-25  0:13                                                                     ` Alan Cox
@ 2005-11-25  1:33                                                                     ` H. Peter Anvin
  2005-11-28 19:15                                                                     ` Bill Davidsen
  2 siblings, 0 replies; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-25  1:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: thockin, Alan Cox, Eric W. Biederman, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Ingo Molnar

Andi Kleen wrote:
> 
> IMHO it's time that Linux gets more proactive regarding talking
> to BIOS vendors. Perhaps a generic "BIOS writers guide for Linux"
> would be a good thing.  I have at least one other extension I would like
> BIOS vendors to support. Just would need to come up with a writeup
> for a clearly defined specification.
> 

BIOS, and hardware.  I think Alan wrote something up way long ago, but 
it hasn't really been updated.

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 23:35                                                                   ` Andi Kleen
@ 2005-11-25  0:13                                                                     ` Alan Cox
  2005-11-25  1:33                                                                     ` H. Peter Anvin
  2005-11-28 19:15                                                                     ` Bill Davidsen
  2 siblings, 0 replies; 117+ messages in thread
From: Alan Cox @ 2005-11-25  0:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: thockin, Eric W. Biederman, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Gwe, 2005-11-25 at 00:35 +0100, Andi Kleen wrote:
> to BIOS vendors. Perhaps a generic "BIOS writers guide for Linux"
> would be a good thing.  I have at least one other extension I would like
> BIOS vendors to support. Just would need to come up with a writeup
> for a clearly defined specification.

I wrote one for 2.2 era but it never got updated. I've probably still
got it around somewhere.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 22:48                                                                 ` thockin
@ 2005-11-24 23:35                                                                   ` Andi Kleen
  2005-11-25  0:13                                                                     ` Alan Cox
                                                                                       ` (2 more replies)
  0 siblings, 3 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 23:35 UTC (permalink / raw)
  To: thockin
  Cc: Alan Cox, Andi Kleen, Eric W. Biederman, Gerd Knorr,
	Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Ingo Molnar

On Thu, Nov 24, 2005 at 02:48:25PM -0800, thockin@hockin.org wrote:
> On Thu, Nov 24, 2005 at 11:12:14PM +0000, Alan Cox wrote:
> > On Iau, 2005-11-24 at 20:14 +0100, Andi Kleen wrote:
> > > I proposed something like that - best with an ASCII string
> > > ("First DIMM on the top left corner") But getting such stuff into BIOS 
> > > is difficult and long winded.
> > 
> > Propose it the desktop management people and get it into the DMI
> > standard. They already have entries for each memory slot, they already
> > have entries for descriptive strings for connectors. In fact you may
> > well be able to 'bend' the spec enough to do it as is.
> 
> There are enough fields that maybe one of them is loose enough to mean
> this.  It doesn't help us convince mobo vendors to support it, though.

With arbitary desktop/laptop/etc. vendors it's pretty hopeless I agree.
But I suspect there is a chance at least on the server side. There
is only a limited number of companies working on server BIOSes 
for their boards and they tend to be more receptive to Linux's need
because it's now a significant part of their market.

And it's clearly an obviously useful "RAS feature" which is
fully buzzword compatible and everything.

IMHO it's time that Linux gets more proactive regarding talking
to BIOS vendors. Perhaps a generic "BIOS writers guide for Linux"
would be a good thing.  I have at least one other extension I would like
BIOS vendors to support. Just would need to come up with a writeup
for a clearly defined specification.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 21:20                                                                     ` Andi Kleen
  2005-11-24 21:40                                                                       ` thockin
@ 2005-11-24 23:33                                                                       ` Eric W. Biederman
  1 sibling, 0 replies; 117+ messages in thread
From: Eric W. Biederman @ 2005-11-24 23:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: thockin, Alan Cox, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

Andi Kleen <ak@suse.de> writes:

> Thanks. But without a per board DIMM mapping it's pretty useless, isn't it?

Nope.

Getting a per board chip select to DIMM mapping is fairly easy, you
just need a lookup table of:
(memory_controller, chip_select, channel, dimm_label)

Which if the motherboard vendor does not give it to you is pretty
straight forward to discover just by plugging a minimal memory configuration
into various slots.  We can already query in software what the
motherboard is so keeping a table like this in user space is
not a problem.

> One could detect the IO hole by reading the IORR MSRs or alternatively
> parsing the e820 map in /var/log/boot.msg

The problem is not detection but compensating for how it changes
the address.

I do agree that it would be nice if there was a standard for
BIOS's reporting this information.  In LinuxBIOS it is one of
those TODO list items we never quite get to.  But so far a user
space table has proved quite useful in practice.

Eric

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 19:14                                                             ` Andi Kleen
  2005-11-24 19:24                                                               ` thockin
@ 2005-11-24 23:12                                                               ` Alan Cox
  2005-11-24 22:48                                                                 ` thockin
  1 sibling, 1 reply; 117+ messages in thread
From: Alan Cox @ 2005-11-24 23:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: thockin, Eric W. Biederman, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Iau, 2005-11-24 at 20:14 +0100, Andi Kleen wrote:
> I proposed something like that - best with an ASCII string
> ("First DIMM on the top left corner") But getting such stuff into BIOS 
> is difficult and long winded.

Propose it the desktop management people and get it into the DMI
standard. They already have entries for each memory slot, they already
have entries for descriptive strings for connectors. In fact you may
well be able to 'bend' the spec enough to do it as is.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 23:12                                                               ` Alan Cox
@ 2005-11-24 22:48                                                                 ` thockin
  2005-11-24 23:35                                                                   ` Andi Kleen
  0 siblings, 1 reply; 117+ messages in thread
From: thockin @ 2005-11-24 22:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Eric W. Biederman, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 11:12:14PM +0000, Alan Cox wrote:
> On Iau, 2005-11-24 at 20:14 +0100, Andi Kleen wrote:
> > I proposed something like that - best with an ASCII string
> > ("First DIMM on the top left corner") But getting such stuff into BIOS 
> > is difficult and long winded.
> 
> Propose it the desktop management people and get it into the DMI
> standard. They already have entries for each memory slot, they already
> have entries for descriptive strings for connectors. In fact you may
> well be able to 'bend' the spec enough to do it as is.

There are enough fields that maybe one of them is loose enough to mean
this.  It doesn't help us convince mobo vendors to support it, though.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:42                                                       ` Daniel Jacobowitz
  2005-11-23 23:59                                                         ` Linus Torvalds
@ 2005-11-24 22:32                                                         ` Ulrich Drepper
  2005-11-28 19:58                                                         ` Bill Davidsen
  2 siblings, 0 replies; 117+ messages in thread
From: Ulrich Drepper @ 2005-11-24 22:32 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On 11/23/05, Daniel Jacobowitz <dan@debian.org> wrote:
> Those are the wrong ways of doing this in userspace.  There are right
> ways.  For instance, tag the binary at link time "single-threaded".

This works and the system is designed this way.  But it's unlikely
that any distribution will ship code like this since the maintenance
is to problematic.


> Glibc does not do this to the best of my knowledge.  It does select
> different code paths in various places based on the presence of
> multiple threads, but that's for cancellation, not for locking.

Wrong.  Linus is right, we jump over lock prefix.  After a lot of
benchmarking I found this to be the fastest was and the Intel people
seemed to agree.


> This is also a trivially solvable problem in userspace; you make the
> dynamic linker enforce consistency of the tags.

This would require that potentially every single DSO is duplicated as
threaded and non-threaded.  If you like this you might as well enter
the horror world of BSD with their libc_r.  This will never fly, the
support costs are too high.


> The number of userspace libraries that use atomic operations is, in
> practice, quite small.

It really not and the number using them is growing.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 18:42                                         ` Linus Torvalds
                                                             ` (4 preceding siblings ...)
  2005-11-24  3:31                                           ` Mikulas Patocka
@ 2005-11-24 22:30                                           ` Ulrich Drepper
  5 siblings, 0 replies; 117+ messages in thread
From: Ulrich Drepper @ 2005-11-24 22:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

On 11/23/05, Linus Torvalds <torvalds@osdl.org> wrote:
> It _should_ be fairly easy to do something like that - just a simple
> global flag that gets set and makes CPL3 ignore lock prefixes.

This is a goof rist step.  But the effectiveness will descrease
significantly in the near future.  It can only work for
single-threaded processes without writable shared memory.

With the growing number of cores/threads the need to use parallelism
rises.  With techniques like OpenMP the threshold to do this is
lowered significantly.  The process-model, so much preferred on this
list over the thread model, requires shared memory, therefore also
eliminating the effectiveness of this functionality.

A real solution needs to be more fine grained.  It is often known in
the userland code whether the specific word which is accessed using
atomic ops really can be shared.  The POSIX interfaces, for instance,
require that all mutexes etc which are placed in shared memory are
attributed as such.  Combine this with the knowledge about the number
of threads in use and the result is that even with writable shared
memory segments the lock prefix can be avoided.  There are a whole
bunch of cases where we already do conditional locking.  It's just
plain ugly and not as efficient as we would like i t.

So, implementing control with per-userlevel context will se rapidly
diminishing success and I'm wondering whether it is better to go for
something with a bit finer level of control.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 21:20                                                                     ` Andi Kleen
@ 2005-11-24 21:40                                                                       ` thockin
  2005-11-24 23:33                                                                       ` Eric W. Biederman
  1 sibling, 0 replies; 117+ messages in thread
From: thockin @ 2005-11-24 21:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Alan Cox, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 10:20:00PM +0100, Andi Kleen wrote:
> > The below code works for us.  Note that I did not implement the
> > node-interleaving parts of the AMD algorithm.  If that matters, it should
> > be simple enough to do.  The BKDG has good pseudo-code.  The only thing it
> > gets absolutely wrong is the IO hole.
> 
> Thanks. But without a per board DIMM mapping it's pretty useless, isn't it?

Exactly.

> One could detect the IO hole by reading the IORR MSRs or alternatively
> parsing the e820 map in /var/log/boot.msg

Why bother?  In this process - turning a physical address into a DIMM,
you're poking at all the data anyway, just get the IO hole straight from
the chipset.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 19:44                                                                   ` thockin
@ 2005-11-24 21:20                                                                     ` Andi Kleen
  2005-11-24 21:40                                                                       ` thockin
  2005-11-24 23:33                                                                       ` Eric W. Biederman
  0 siblings, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 21:20 UTC (permalink / raw)
  To: thockin
  Cc: Andi Kleen, Eric W. Biederman, Alan Cox, Gerd Knorr,
	Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Ingo Molnar

On Thu, Nov 24, 2005 at 11:44:59AM -0800, thockin@hockin.org wrote:
> On Thu, Nov 24, 2005 at 08:29:53PM +0100, Andi Kleen wrote:
> > > We implemented AMD's reference algorithm, and made it work in the presence
> > > of a hardware IO hole.  It seems to work beautifully, but the last step is
> > > turning a (node:chip-select) into a (node:dimm).  Simple boards will use
> > > simple mappings, but we can't know that without board specific info.
> > > Especially with quad-rank DIMMs. :)
> > 
> > If you get something working it would be good if you could share the code
> > (even if it still needs to be tweaked) 
> 
> The below code works for us.  Note that I did not implement the
> node-interleaving parts of the AMD algorithm.  If that matters, it should
> be simple enough to do.  The BKDG has good pseudo-code.  The only thing it
> gets absolutely wrong is the IO hole.

Thanks. But without a per board DIMM mapping it's pretty useless, isn't it?

One could detect the IO hole by reading the IORR MSRs or alternatively
parsing the e820 map in /var/log/boot.msg

-Andi


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 19:29                                                                 ` Andi Kleen
@ 2005-11-24 19:44                                                                   ` thockin
  2005-11-24 21:20                                                                     ` Andi Kleen
  0 siblings, 1 reply; 117+ messages in thread
From: thockin @ 2005-11-24 19:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Alan Cox, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 08:29:53PM +0100, Andi Kleen wrote:
> > We implemented AMD's reference algorithm, and made it work in the presence
> > of a hardware IO hole.  It seems to work beautifully, but the last step is
> > turning a (node:chip-select) into a (node:dimm).  Simple boards will use
> > simple mappings, but we can't know that without board specific info.
> > Especially with quad-rank DIMMs. :)
> 
> If you get something working it would be good if you could share the code
> (even if it still needs to be tweaked) 

The below code works for us.  Note that I did not implement the
node-interleaving parts of the AMD algorithm.  If that matters, it should
be simple enough to do.  The BKDG has good pseudo-code.  The only thing it
gets absolutely wrong is the IO hole.

Let me know if you see any problems here.  This is a userspace tool, but
should be trivial to adapt.


#include <stdio.h>
#include <stdlib.h>
#include "pci.h"

#define MAX_NODES	8
#define MAX_CS		8
#define NODE_DEV(n)	(24+(n))
#define FOUR_GIG	0x01000000	// shifted to hold address[39..8]

static char *progname;

static void
usage(void)
{
	fprintf(stderr, "usage: %s <address>\n", progname);
}

/*
 * Map a CS to a pair of DIMM slots (for 128 bit operation).  This mapping
 * is board-specific, and has the potential to be very ugly.
 */
static int cs_to_pair[MAX_CS][2] = {
	{ 0, 1 }, { 0, 1 },
	{ 2, 3 }, { 2, 3 },
	{ 4, 5 }, { 4, 5 },
	{ 6, 7 }, { 6, 7 },
};

int
main(int argc, char *argv[])
{
	unsigned long long raw_addr;
	uint32_t address;
	char *endp;
	int node;

	progname = argv[0];

	if (argc != 2) {
		usage();
		exit(1);
	}
	raw_addr = strtoull(argv[1], &endp, 0);
	if (endp == argv[1]) {
		usage();
		exit(2);
	}

	/*
	 * The address space is 40 bits (for now).  We want to use 32 bit
	 * values, so we convert the input address to hold address[39..8].
	 * We'll use this format everywhere.  This loses the
	 * low-order bits, so we keep a raw copy around.
	 */
	address = (raw_addr & 0xffffffffff) >> 8;

	/* find the node that holds this address */
	for (node = 0; node < MAX_NODES; node++) {
		int pci_dev;
		uint32_t tmp;
		int dram_en;
		uint32_t dram_base;
		uint32_t dram_limit;
		int hole_en;
		uint32_t hole_base;
		uint32_t hole_size;
		int cs;

		/*
		 * The DRAM map and IO hole info are in function 1 of each
		 * node.
		 */
		pci_dev = pci_open(0, NODE_DEV(node), 1);
		if (pci_dev < 0) {
			/* node does not exist */
			break;
		}

		/*
		 * DRAM_BASE and DRAM_LIMIT already hold address[39..8].
		 */
		tmp = pci_read32(pci_dev, 0x40+(node*8));
		dram_en = tmp & 0x3;
		dram_base = tmp & 0xffff0000;

		tmp = pci_read32(pci_dev, 0x44+(node*8));
		dram_limit = tmp | 0x0000ffff;

		/*
		 * HOLE_BASE holds address[35..4], so we convert it to hold
		 * address[39..8].
		 */
		tmp = pci_read32(pci_dev, 0xf0);
		hole_en = tmp & 0x1;
		hole_base = (tmp & 0xff000000) >> 8;
		hole_size = FOUR_GIG - hole_base;

		pci_close(pci_dev);

		if (!dram_en) {
			/* no DRAM here */
			continue;
		}

		if (address > dram_limit) {
			/* keep looking */
			continue;
		}

		/*
		 * The address must be on this node.
		 */

		/* is the address in the IO hole? */
		if (hole_en && address >= hole_base && address < FOUR_GIG) {
			/* no DRAM in the IO hole */
			break;
		}

		/* is the address >= 4GB on a node with a hole? */
		if (hole_en && address >= FOUR_GIG) {
			/* adjust address for the IO hole */
			address -= hole_size;
		}

		/* adjust address to be node-relative */
		address -= dram_base;

		/* store addr[35..4] */
		address <<= 4;

		/* The chip-select map is in function 2 of each node. */
		pci_dev = pci_open(0, NODE_DEV(node), 2);

		/* find the chip-select that has this address */
		for (cs = 0; cs < MAX_CS; cs++) {
			uint32_t cs_base;
			uint32_t cs_mask;

			/*
			 * CS_BASE and CS_MASK hold address[35..0], so we
			 * convert them to hold address[39..8].
			 */
			tmp = pci_read32(pci_dev, 0x40+(cs*4));
			if ((tmp & 0x1) == 0) {
				/* this CS is not enabled */
				continue;
			}
			cs_base = tmp & 0xffe0fe00;

			tmp = pci_read32(pci_dev, 0x60+(cs*4));
			cs_mask = (tmp | 0x001f01ff) & 0x3fffffff;

			/* this should handle interleaving, too */
			if ((address & ~cs_mask) == (cs_base & ~cs_mask)) {
				int *pair = cs_to_pair[cs];
				int dimm;

				/* which DIMM is this address on? */
				if ((raw_addr & (1<<3)) == 0) {
					dimm = pair[0];
				} else {
					dimm = pair[1];
				}

				printf("node %d, CS %d, DIMM %d\n",
				       node, cs, dimm);
				break;
			}
		}
		pci_close(pci_dev);
		break;
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 19:24                                                               ` thockin
@ 2005-11-24 19:29                                                                 ` Andi Kleen
  2005-11-24 19:44                                                                   ` thockin
  0 siblings, 1 reply; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 19:29 UTC (permalink / raw)
  To: thockin
  Cc: Andi Kleen, Eric W. Biederman, Alan Cox, Gerd Knorr,
	Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Ingo Molnar

> We implemented AMD's reference algorithm, and made it work in the presence
> of a hardware IO hole.  It seems to work beautifully, but the last step is
> turning a (node:chip-select) into a (node:dimm).  Simple boards will use
> simple mappings, but we can't know that without board specific info.
> Especially with quad-rank DIMMs. :)

If you get something working it would be good if you could share the code
(even if it still needs to be tweaked) 

> 
> > > table to map chip-selects onto DIMMs? :)
> > 
> > I proposed something like that - best with an ASCII string
> > ("First DIMM on the top left corner") But getting such stuff into BIOS 
> > is difficult and long winded.
> 
> It would be easy enough to get into LinuxBIOS. :)
> 
> Seriously, this is work that is *long* overdue.  I have been wanting to
> look at this for over a year, but I have not had time.
> 
> Doing proper architecture and chipset-specific ECC/error handling which
> ties into a bigger abstracted error system is going to be really nice.

IMNSHO the x86-64 mce.c with its error log is a reasonable start. All
the smarts can be in user space and in mcelog.c. DIMM decoding
is a special case though because the information is really useful
to be printed onto the screen for fatal MCEs. So that one is better
in kernel space.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 19:16                                                 ` thockin
@ 2005-11-24 19:26                                                   ` Andi Kleen
  0 siblings, 0 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 19:26 UTC (permalink / raw)
  To: thockin
  Cc: Eric W. Biederman, Andi Kleen, Alan Cox, Gerd Knorr,
	Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Ingo Molnar

On Thu, Nov 24, 2005 at 11:16:36AM -0800, thockin@hockin.org wrote:
> On Thu, Nov 24, 2005 at 06:58:52AM -0700, Eric W. Biederman wrote:
> > > That's supposed to be done by hardware, no? 
> > > At least the K8 has a hardware scrubber (although it's not always enabled)
> > 
> > Recent good implementations like the Opteron will do it for you.
> > Older or cheaper memory controllers will not.
> 
> Beware of errata - there's at leats one errata on Opteron which forces you
> to choose between x4 (chipkill) ECC and scrubber.  One or the other, but
> not both.  There are plenty of errata on the scrubber alone.  Worse, if my
> (brain)memory is correct, without the scrubber, correctable errors are
> corrected on the fly, but never written back to DRAM.

All the scrub errata were fixed with E stepping AFAIK.

You have a point that using a sw scrubber might make sense on earlier
steppings though in case someone really wants chipkill.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 19:14                                                             ` Andi Kleen
@ 2005-11-24 19:24                                                               ` thockin
  2005-11-24 19:29                                                                 ` Andi Kleen
  2005-11-24 23:12                                                               ` Alan Cox
  1 sibling, 1 reply; 117+ messages in thread
From: thockin @ 2005-11-24 19:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Alan Cox, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 08:14:46PM +0100, Andi Kleen wrote:
> > I'm curious about that too.  Even with k8 you can get down to a
> > chip-select, but that doesn't necessarily map to a DIMM in any useful way,
> > unless you have some mobo knowledge.  Are we going to need a new BIOS
> 
> Yeah that's my problem.
> 
> It's not theoretical. We had cases where someone had to go 
> through 10+ DIMMs on a big machine in try and error to find
> out which one is wrong. Very bad situation.

I have the exact same problem right now.  Part of our early bootup we run
a simplish memory test.  Basically it's a "can the memory hold state"
test.  If anything fails, we have to identify as exactly as possible WHICH
DIMM needs to be replaced, so the hardware ops people can do it at
assembly/test time.

We implemented AMD's reference algorithm, and made it work in the presence
of a hardware IO hole.  It seems to work beautifully, but the last step is
turning a (node:chip-select) into a (node:dimm).  Simple boards will use
simple mappings, but we can't know that without board specific info.
Especially with quad-rank DIMMs. :)

> > table to map chip-selects onto DIMMs? :)
> 
> I proposed something like that - best with an ASCII string
> ("First DIMM on the top left corner") But getting such stuff into BIOS 
> is difficult and long winded.

It would be easy enough to get into LinuxBIOS. :)

Seriously, this is work that is *long* overdue.  I have been wanting to
look at this for over a year, but I have not had time.

Doing proper architecture and chipset-specific ECC/error handling which
ties into a bigger abstracted error system is going to be really nice.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 13:58                                               ` Eric W. Biederman
@ 2005-11-24 19:16                                                 ` thockin
  2005-11-24 19:26                                                   ` Andi Kleen
  0 siblings, 1 reply; 117+ messages in thread
From: thockin @ 2005-11-24 19:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Alan Cox, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 06:58:52AM -0700, Eric W. Biederman wrote:
> > That's supposed to be done by hardware, no? 
> > At least the K8 has a hardware scrubber (although it's not always enabled)
> 
> Recent good implementations like the Opteron will do it for you.
> Older or cheaper memory controllers will not.

Beware of errata - there's at leats one errata on Opteron which forces you
to choose between x4 (chipkill) ECC and scrubber.  One or the other, but
not both.  There are plenty of errata on the scrubber alone.  Worse, if my
(brain)memory is correct, without the scrubber, correctable errors are
corrected on the fly, but never written back to DRAM.

> Having an architecturally sane software scrubber as backup for
> the hardware implementations is nice, and except in the cases
> where someone disables the lock prefix it is takes very little
> code on x86.
> 
> Even on the Opteron you could theoretically have the case of a brain-dead
> external memory controller, although that is not likely.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 19:12                                                           ` thockin
@ 2005-11-24 19:14                                                             ` Andi Kleen
  2005-11-24 19:24                                                               ` thockin
  2005-11-24 23:12                                                               ` Alan Cox
  0 siblings, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 19:14 UTC (permalink / raw)
  To: thockin
  Cc: Andi Kleen, Eric W. Biederman, Alan Cox, Gerd Knorr,
	Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Ingo Molnar

> I'm curious about that too.  Even with k8 you can get down to a
> chip-select, but that doesn't necessarily map to a DIMM in any useful way,
> unless you have some mobo knowledge.  Are we going to need a new BIOS

Yeah that's my problem.

It's not theoretical. We had cases where someone had to go 
through 10+ DIMMs on a big machine in try and error to find
out which one is wrong. Very bad situation.

[Double plus bad if it wasn't actually any of the DIMMs that were
bad, but one of the VRMs on a big Opteron - it causes all the same
symptoms as a bad DIMM :/]

> table to map chip-selects onto DIMMs? :)

I proposed something like that - best with an ASCII string
("First DIMM on the top left corner") But getting such stuff into BIOS 
is difficult and long winded.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 15:36                                                         ` Andi Kleen
  2005-11-24 16:49                                                           ` Eric W. Biederman
@ 2005-11-24 19:12                                                           ` thockin
  2005-11-24 19:14                                                             ` Andi Kleen
  1 sibling, 1 reply; 117+ messages in thread
From: thockin @ 2005-11-24 19:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Alan Cox, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 04:36:35PM +0100, Andi Kleen wrote:
> > The current k8
> > code has been delayed for this reason.
> > 
> > Where the EDAC code goes beyond the current k8 facilities is the
> > decode to the dimm level so that the bad memory stick can be
> > easily identified.
> 
> That would be nice to have agreed. But I don't really know
> how to do this without mainboard specific knowledge.
> If you have something usable it's best to port it to mce.c
> or perhaps mcelog

I'm curious about that too.  Even with k8 you can get down to a
chip-select, but that doesn't necessarily map to a DIMM in any useful way,
unless you have some mobo knowledge.  Are we going to need a new BIOS
table to map chip-selects onto DIMMs? :)


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 15:15                                                   ` Alan Cox
  2005-11-24 14:55                                                     ` Andi Kleen
@ 2005-11-24 19:09                                                     ` thockin
  1 sibling, 0 replies; 117+ messages in thread
From: thockin @ 2005-11-24 19:09 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Eric W. Biederman, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
> The Intel I have looked at generates MCE if the L2/L1/bus parity errors
> but not on a RAM ECC error as that is memory controller not CPU level.
> That usually asserts NMI. Same for most older chips PIII/AMD Athlon etc

Some BIOSes route that into SMI.  The BIOS can then log the error and tell
the OS via the nmi_now bit.  Uggh.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 17:48 linux
@ 2005-11-24 18:48 ` Linus Torvalds
  0 siblings, 0 replies; 117+ messages in thread
From: Linus Torvalds @ 2005-11-24 18:48 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel



On Thu, 24 Nov 2005, linux@horizon.com wrote:
>
> > I suspect that with MAP_SHARED + PROT_WRITE being pretty uncommon anyway, 
> > we can probably find trivial patterns in the kernel. Like only one process 
> > holding that file open - which is what you get with things that use mmap() 
> > to write a new file (I think "ld" used to have a config option to write 
> > files that way, for example).
> 
> Just a bit of practical experience: I use mmap() to write data a LOT,
> because msync(MS_ASYNC) is the most portable way to do an async write.

Sure. But I suspect that nobody else has that file open when you do so?

In other words, even your usage is something where the OS could tell that 
you don't actually need atomic operations. It certainly gets slightly more 
complicated (we'd need to trigger some special stuff if another process 
does an mmap on it), but it's not conceptually very difficult to just 
notice automatically and do the right thing(tm).

Now, if two programs are using mmap() to write to the same file at the 
same time, then the kernel can't tell any more. But in that case, you 
probably _do_ want atomic ops to be guaranteed, so not disabling them is 
the right thing to do there.

			Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
@ 2005-11-24 18:24 colin
  0 siblings, 0 replies; 117+ messages in thread
From: colin @ 2005-11-24 18:24 UTC (permalink / raw)
  To: ak, linux-kernel; +Cc: linux

> For user space the primary trigger event would be "has any shared
> writable mappings or multiple threads". Even on a real MP systems it's 
> perfectly ok to run a program with no writable shared mappings with LOCK off.
                       ^ single-threaded

> Depending on the workload this transistion could happen quite often.
> Especially there is a worst case of an application allocating a few
> GB of memory and then starting a new thread.

One more thing, that may be to cute to be practical, but is worth mentioning:
shared address space or shared mappings only require LOCK if the memory
is ACTIVELY shared, i.e. used by DMA or by another task that is running
right now.

If you have a process with a helper thread that's asleep 99% of the time,
the savings of running with LOCK off might be worth the occasional
IPI to enable it on the main thread on the rare occasions that the
helper wakes up on a different processor.

For example, imagine a threaded async DNS resolver that tracks
TTL and times out cache entries.

If you have heavier-weight mutual exclusion, you don't need LOCK.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
@ 2005-11-24 17:48 linux
  2005-11-24 18:48 ` Linus Torvalds
  0 siblings, 1 reply; 117+ messages in thread
From: linux @ 2005-11-24 17:48 UTC (permalink / raw)
  To: linux-kernel, torvalds; +Cc: linux

> I suspect that with MAP_SHARED + PROT_WRITE being pretty uncommon anyway, 
> we can probably find trivial patterns in the kernel. Like only one process 
> holding that file open - which is what you get with things that use mmap() 
> to write a new file (I think "ld" used to have a config option to write 
> files that way, for example).

Just a bit of practical experience: I use mmap() to write data a LOT,
because msync(MS_ASYNC) is the most portable way to do an async write.

There are two applications.  First, helping the OS not fill up with
dirty pages.  It's basically a way of saying "this page is not going to
be dirtied again for a long time".

Secondly, to reduce the latency of synchronous writes.  If I need to
log operations durably, it helps to

1) fill the log pages, using MS_ASYNC as soon as the page is full
2) when committing a batch, use MS_SYNC to force data to disk
3) report batch successfully committed to stable storage

The aio_ routines are less widely supported some implementations have
very high overhead.  They would allow me to keep working while a commit
is in progress, but the above is simple and reduces the burstiness of
I/O considerably.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 15:36                                                         ` Andi Kleen
@ 2005-11-24 16:49                                                           ` Eric W. Biederman
  2005-11-24 19:12                                                           ` thockin
  1 sibling, 0 replies; 117+ messages in thread
From: Eric W. Biederman @ 2005-11-24 16:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Gerd Knorr, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Ingo Molnar

Andi Kleen <ak@suse.de> writes:

> At least the Lindenhurst (E7205) datasheet says the chipset can trigger
> MCEs in the CPU (using a MCEERR# pin). I don't know if it's always
> enabled, but the hardware seems to have the capability.
> That's the oldest Intel server chipset supported with EM64T CPUs.
>
> The threshold counters are not supported directly only.

I don't think that triggers on correctable errors, and I'm
not certain how useful the information reported it.  But it
should be at least as good as an NMI :)

Truthfully it really isn't just server chipsets that are interesting
either.  Anything that supports ECC or parity on memory is
interesting.

>> The current k8
>> code has been delayed for this reason.
>> 
>> Where the EDAC code goes beyond the current k8 facilities is the
>> decode to the dimm level so that the bad memory stick can be
>> easily identified.
>
> That would be nice to have agreed. But I don't really know
> how to do this without mainboard specific knowledge.
> If you have something usable it's best to port it to mce.c
> or perhaps mcelog

We do this for every memory controller EDAC supports, so yes
we know how to implement this.  Merging the non-controversial
bits is coming.  But it is certainly a goal to take the
best of the mce code and the EDAC code to generate a good
k8 driver.

Motherboard specific knowledge is really not required.  All that 
is really required is memory controller specific knowledge.  With that
you can decode to the chip select level on most memory controllers.

Then you need just a little extra code (probably in user space) to map
the chip select to which dimm socket they go to on the motherboard.

The memory controller knowledge pretty much needs to be a
kernel driver because reading the memory controller registers 
can be a non-trivial exercise at times.  At least one piece of
Intel seems to like recommending that BIOS developers turn off the PCI
device that has the registers.

> There is a clear case for being architecture specific here. Some 
> architectures - like PPC64 or IA64 - have good firmware support for it, so it's
> best to use these facilities. On others like i386 and
> x86-64 the x86-64 log architecture is good. I might be a bit
> biased but I think it's very good and should be used on i386
> at some point too. I don't see any need for more.

The implementation clearly should be architecture specific, and
it will take more feeling out before any work can be done
to even think about unify the interfaces.

Right now the goal is to simply get what has proven useful in the
real world merged.

Currently I do not see implementation problems but I do see
facilities not being as useful as they could be.  Getting little
things like decoding to the dimm sorted out removes hours of
work figuring out which dimm has problems.

I also see a large number of errors that the hardware can detect
that are going unreported because the code just isn't there.  There
are all kinds of bus errors that chipsets can report that are
just being ignored.

So now you know some of the hopes and dreams behind some of the EDAC
code and some more of the model.  Hopefully this will lead to productive
conversations as the code is merged.

Eric

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 15:09                                                       ` Eric W. Biederman
  2005-11-24 15:36                                                         ` Andi Kleen
@ 2005-11-24 16:02                                                         ` Alan Cox
  1 sibling, 0 replies; 117+ messages in thread
From: Alan Cox @ 2005-11-24 16:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Iau, 2005-11-24 at 08:09 -0700, Eric W. Biederman wrote:
> Where the EDAC code goes beyond the current k8 facilities is the
> decode to the dimm level so that the bad memory stick can be
> easily identified.

And also in finding/recording PCI parity errors (which will link nicely
into the IBM work for code to handle reported PCI errors).


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 15:09                                                       ` Eric W. Biederman
@ 2005-11-24 15:36                                                         ` Andi Kleen
  2005-11-24 16:49                                                           ` Eric W. Biederman
  2005-11-24 19:12                                                           ` thockin
  2005-11-24 16:02                                                         ` Alan Cox
  1 sibling, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 15:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Alan Cox, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 08:09:07AM -0700, Eric W. Biederman wrote:
> Andi Kleen <ak@suse.de> writes:
> 
> > On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
> >> On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
> >> > What do you need a special driver for if the northbridge just
> >> > can do the scrubbing by itself?
> >> 
> >> You need a driver to collect and report all the ECC single bit errors to
> >> the user so that they can decide if they have problem hardware.
> >
> > Assuming the errors are logged to the standard machine check
> > architecture that's already done by mce.c. K8 does that definitely.
> >
> > Take a look at mcelog at some point.
> > Your distro probably already sets it up by default to log to
> > /var/log/mcelog
> >
> >> 
> >> EDAC is more than one thing
> >> 	- Control response to a fatal error
> >> 	- Report non-fatal events for analysis/user decision making
> >
> > x86-64 mce.c does all that There was even a port to i386 around at some point.
> 
> Right on the k8 memory controller there is a lot of overlap,
> with what has already been implemented.  For all other x86 memory
> controllers the code is filling a large void.  

At least the Lindenhurst (E7205) datasheet says the chipset can trigger
MCEs in the CPU (using a MCEERR# pin). I don't know if it's always
enabled, but the hardware seems to have the capability.
That's the oldest Intel server chipset supported with EM64T CPUs.

The threshold counters are not supported directly only.


> The current k8
> code has been delayed for this reason.
> 
> Where the EDAC code goes beyond the current k8 facilities is the
> decode to the dimm level so that the bad memory stick can be
> easily identified.

That would be nice to have agreed. But I don't really know
how to do this without mainboard specific knowledge.
If you have something usable it's best to port it to mce.c
or perhaps mcelog

> 
> One of the goals of the EDAC code is to work towards a unified
> kernel architecture for this kind error reporting.  Currently every
> architecture (if the error are reported at all) handles this
> differently which makes it very hard to do something sane is
> user space.

There is a clear case for being architecture specific here. Some 
architectures - like PPC64 or IA64 - have good firmware support for it, so it's
best to use these facilities. On others like i386 and
x86-64 the x86-64 log architecture is good. I might be a bit
biased but I think it's very good and should be used on i386
at some point too. I don't see any need for more.

-Andi


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 14:22                                                 ` Andi Kleen
@ 2005-11-24 15:15                                                   ` Alan Cox
  2005-11-24 14:55                                                     ` Andi Kleen
  2005-11-24 19:09                                                     ` thockin
  0 siblings, 2 replies; 117+ messages in thread
From: Alan Cox @ 2005-11-24 15:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
> What do you need a special driver for if the northbridge just
> can do the scrubbing by itself?

You need a driver to collect and report all the ECC single bit errors to
the user so that they can decide if they have problem hardware.

EDAC is more than one thing
	- Control response to a fatal error
	- Report non-fatal events for analysis/user decision making

Hardware scrubbing is good, but knowing the rate of non-fatal errors and
the trend in rate of errors is essential to planning and management of
systems.

> On the modern systems I'm familiar with it's an machine check (although
> not necessarily a recoverable one and there might be other bad
> side effects) 

The Intel I have looked at generates MCE if the L2/L1/bus parity errors
but not on a RAM ECC error as that is memory controller not CPU level.
That usually asserts NMI. Same for most older chips PIII/AMD Athlon etc

> > The -mm EDAC code works on the basic assumption that unrecovered ECC is
> > a system halter although that is configurable.
> 
> I don't know what you could do over the default code for K8 at least.
> And on modern Intel server chipsets I would expect it also to not
> be needed.

Varies a lot again. Hopefully that'll simplify as/when/if Intel put the
memory controller on CPU.

Alan


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 14:55                                                     ` Andi Kleen
@ 2005-11-24 15:09                                                       ` Eric W. Biederman
  2005-11-24 15:36                                                         ` Andi Kleen
  2005-11-24 16:02                                                         ` Alan Cox
  0 siblings, 2 replies; 117+ messages in thread
From: Eric W. Biederman @ 2005-11-24 15:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Gerd Knorr, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Ingo Molnar

Andi Kleen <ak@suse.de> writes:

> On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
>> On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
>> > What do you need a special driver for if the northbridge just
>> > can do the scrubbing by itself?
>> 
>> You need a driver to collect and report all the ECC single bit errors to
>> the user so that they can decide if they have problem hardware.
>
> Assuming the errors are logged to the standard machine check
> architecture that's already done by mce.c. K8 does that definitely.
>
> Take a look at mcelog at some point.
> Your distro probably already sets it up by default to log to
> /var/log/mcelog
>
>> 
>> EDAC is more than one thing
>> 	- Control response to a fatal error
>> 	- Report non-fatal events for analysis/user decision making
>
> x86-64 mce.c does all that There was even a port to i386 around at some point.

Right on the k8 memory controller there is a lot of overlap,
with what has already been implemented.  For all other x86 memory
controllers the code is filling a large void.  The current k8
code has been delayed for this reason.

Where the EDAC code goes beyond the current k8 facilities is the
decode to the dimm level so that the bad memory stick can be
easily identified.

One of the goals of the EDAC code is to work towards a unified
kernel architecture for this kind error reporting.  Currently every
architecture (if the error are reported at all) handles this
differently which makes it very hard to do something sane is
user space.

Eric


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 15:15                                                   ` Alan Cox
@ 2005-11-24 14:55                                                     ` Andi Kleen
  2005-11-24 15:09                                                       ` Eric W. Biederman
  2005-11-24 19:09                                                     ` thockin
  1 sibling, 1 reply; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 14:55 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Eric W. Biederman, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
> On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
> > What do you need a special driver for if the northbridge just
> > can do the scrubbing by itself?
> 
> You need a driver to collect and report all the ECC single bit errors to
> the user so that they can decide if they have problem hardware.

Assuming the errors are logged to the standard machine check
architecture that's already done by mce.c. K8 does that definitely.

Take a look at mcelog at some point.
Your distro probably already sets it up by default to log to
/var/log/mcelog

> 
> EDAC is more than one thing
> 	- Control response to a fatal error
> 	- Report non-fatal events for analysis/user decision making

x86-64 mce.c does all that There was even a port to i386 around at some point.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 13:39                                             ` Andi Kleen
  2005-11-24 13:58                                               ` Eric W. Biederman
@ 2005-11-24 14:34                                               ` Alan Cox
  2005-11-24 14:22                                                 ` Andi Kleen
  1 sibling, 1 reply; 117+ messages in thread
From: Alan Cox @ 2005-11-24 14:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Iau, 2005-11-24 at 14:39 +0100, Andi Kleen wrote:
> That's supposed to be done by hardware, no? 

Varies immensely by system. Where there is a hardware scrubber and it is
enabled it will be used. Once nice thing about K8 is the mem controller
is in the CPU so they all use the same driver (not yet merged)

> If you try to do it this way then the code will become such
> a mess if not impossible to write that your changes to merge them
> and get it right are very slim. The only sane way to do all the locking etc. 
> is to hand over the handling to a thread. While that make the window
> of misusing the data wider it's the only sane alternative vs not
> doing it at all.

Its utterly hideous because the usual 'ECC error' reporting technique
for an uncorrectable error is an NMI. Locks could be in any state at
this point and even the registers needing to be accessed are across PCI
and we could be half way through a PCI configuration cycle.

The -mm EDAC code works on the basic assumption that unrecovered ECC is
a system halter although that is configurable.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 13:13                                         ` Andi Kleen
  2005-11-24 13:30                                           ` Eric W. Biederman
@ 2005-11-24 14:30                                           ` Alan Cox
  1 sibling, 0 replies; 117+ messages in thread
From: Alan Cox @ 2005-11-24 14:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gerd Knorr, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Iau, 2005-11-24 at 14:13 +0100, Andi Kleen wrote:
> > 1. The lock behaviour *is* defined for main memory access by all bus
> > masters.
> 
> For uncached memory, right?

For all memory. Same as processor to processor. 

> > 2. Uncached mappings are unworkable for this because we must never have
> > a page mapped with conflicting cache types - thats ugly, and plain
> > horrific on SMP.
> 
> For kernel mapping change_page_attr() takes care of it,
> and for user space memory following all mappings is the only
> reliable way to find out which process needs to be killed
> anyways - and when you do that you can as well unmap
> or just kill.

You are working from a kernel view address of a page that may be user
space. You don't need or want to kill anything because you are scrubbing
a corrected error.

> 
> > 3. Uncached has undefined semantics when racing a PCI master. Lock has
> > defined semantics. An uncached add #0 is permitted to read the memory
> > and then write it back as two different cycles and I suspect does.
> 
> Consider what happens with such a race: either the PCI master
> gets an bus abort because it still sees the corrupted data.
> Or it already accesses the repaired data. Both is ok.

This is a correctable error so there would be no abort. And there is a
race if you think for a microsecond or two

		Scrubber reads   (add #0 load of input)
		Bus master writes #FFFFFFFF
		Scrubber writes back #0

The lock prefix ensures that doesn't occur.

> > 4. The AMD BIOS guide requires both that LOCK is enabled by default and
> > that the "lock affects the external bus" bit is clear to enable locking
> > on the external bus.
> 
> The "Linux guidelines" might be different.

Then the EDAC code will reconfigure the registers back as expected by
the PC architecture. I've no problem with that and if EDAC is the only
person requiring a semantic that is more expensive it can flip the bits
back and forth. No need for anyone else to pay that cost.

Alan

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 14:34                                               ` Alan Cox
@ 2005-11-24 14:22                                                 ` Andi Kleen
  2005-11-24 15:15                                                   ` Alan Cox
  0 siblings, 1 reply; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 14:22 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Eric W. Biederman, Gerd Knorr, Linus Torvalds,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

On Thu, Nov 24, 2005 at 02:34:07PM +0000, Alan Cox wrote:
> On Iau, 2005-11-24 at 14:39 +0100, Andi Kleen wrote:
> > That's supposed to be done by hardware, no? 
> 
> Varies immensely by system. Where there is a hardware scrubber and it is
> enabled it will be used. Once nice thing about K8 is the mem controller
> is in the CPU so they all use the same driver (not yet merged)

What do you need a special driver for if the northbridge just
can do the scrubbing by itself?

> > If you try to do it this way then the code will become such
> > a mess if not impossible to write that your changes to merge them
> > and get it right are very slim. The only sane way to do all the locking etc. 
> > is to hand over the handling to a thread. While that make the window
> > of misusing the data wider it's the only sane alternative vs not
> > doing it at all.
> 
> Its utterly hideous because the usual 'ECC error' reporting technique
> for an uncorrectable error is an NMI. Locks could be in any state at

On the modern systems I'm familiar with it's an machine check (although
not necessarily a recoverable one and there might be other bad
side effects) 

> this point and even the registers needing to be accessed are across PCI
> and we could be half way through a PCI configuration cycle.
> 
> The -mm EDAC code works on the basic assumption that unrecovered ECC is
> a system halter although that is configurable.

I don't know what you could do over the default code for K8 at least.
And on modern Intel server chipsets I would expect it also to not
be needed.

-Andi


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 13:39                                             ` Andi Kleen
@ 2005-11-24 13:58                                               ` Eric W. Biederman
  2005-11-24 19:16                                                 ` thockin
  2005-11-24 14:34                                               ` Alan Cox
  1 sibling, 1 reply; 117+ messages in thread
From: Eric W. Biederman @ 2005-11-24 13:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Gerd Knorr, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Ingo Molnar

Andi Kleen <ak@suse.de> writes:

>> I think I see the source of the confusion.  Scrubbing is the
>> process of taking data that is correctable and writing it back to
>> memory so that if a second correctable error occurs the net is still
>> corrected.
>
> That's supposed to be done by hardware, no? 
> At least the K8 has a hardware scrubber (although it's not always enabled)

Recent good implementations like the Opteron will do it for you.
Older or cheaper memory controllers will not.

Having an architecturally sane software scrubber as backup for
the hardware implementations is nice, and except in the cases
where someone disables the lock prefix it is takes very little
code on x86.

Even on the Opteron you could theoretically have the case of a brain-dead
external memory controller, although that is not likely.

>> Directed killing of processes is something that must be done
>> inside a synchronous exception (like a machine check) because otherwise
>> it is so racy you don't know who has seen the bad data.  
>
> If you try to do it this way then the code will become such
> a mess if not impossible to write that your changes to merge them
> and get it right are very slim. The only sane way to do all the locking etc. 
> is to hand over the handling to a thread. While that make the window
> of misusing the data wider it's the only sane alternative vs not
> doing it at all.
>
> Also due to the way hardware works with machine checks usually being
> async and not precise works you have that window anyways, so it's 
> not even worse. Also consider multiple CPUs.

First I don't have any code to do this, but I have though about it.
The races are the primary reason I have never pushed for something
like this.  With memory errors coming in as machine checks it is now
possible to do a correct version.

Essentially we are talking something with the complexity of a page
fault.  All that must happen synchronously is the task must
be stopped, and flagged.

As for races every cpu that accesses that data should take
a synchronous exception.  DMA should do something similar but I'm 
not as familiar with that side of the problem.  And because everything
takes an exception multiple cpu races are not a problem.

Of course there are still the memory errors that are so bad that
they don't even cause a machine check.  Those are a real pain
to debug, and fix.

In this latter case assuming the memory error is transient and
not hard using a write-combine memory attribute when you write
to re-initialize the ECC state is the way to go.  But remember
you most do it on the cpu that is part of the memory controller.
Otherwise something in the whole read/modify process will fail
to get the ECC state initialized properly on an Opteron.

Eric

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 13:30                                           ` Eric W. Biederman
@ 2005-11-24 13:39                                             ` Andi Kleen
  2005-11-24 13:58                                               ` Eric W. Biederman
  2005-11-24 14:34                                               ` Alan Cox
  0 siblings, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 13:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Alan Cox, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

> I think I see the source of the confusion.  Scrubbing is the
> process of taking data that is correctable and writing it back to
> memory so that if a second correctable error occurs the net is still
> corrected.

That's supposed to be done by hardware, no? 
At least the K8 has a hardware scrubber (although it's not always enabled)

> Directed killing of processes is something that must be done
> inside a synchronous exception (like a machine check) because otherwise
> it is so racy you don't know who has seen the bad data.  

If you try to do it this way then the code will become such
a mess if not impossible to write that your changes to merge them
and get it right are very slim. The only sane way to do all the locking etc. 
is to hand over the handling to a thread. While that make the window
of misusing the data wider it's the only sane alternative vs not
doing it at all.

Also due to the way hardware works with machine checks usually being
async and not precise works you have that window anyways, so it's 
not even worse. Also consider multiple CPUs.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 13:13                                         ` Andi Kleen
@ 2005-11-24 13:30                                           ` Eric W. Biederman
  2005-11-24 13:39                                             ` Andi Kleen
  2005-11-24 14:30                                           ` Alan Cox
  1 sibling, 1 reply; 117+ messages in thread
From: Eric W. Biederman @ 2005-11-24 13:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Gerd Knorr, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Ingo Molnar

Andi Kleen <ak@suse.de> writes:

>> 2. Uncached mappings are unworkable for this because we must never have
>> a page mapped with conflicting cache types - thats ugly, and plain
>> horrific on SMP.
>
> For kernel mapping change_page_attr() takes care of it,
> and for user space memory following all mappings is the only
> reliable way to find out which process needs to be killed
> anyways - and when you do that you can as well unmap
> or just kill.

I think I see the source of the confusion.  Scrubbing is the
process of taking data that is correctable and writing it back to
memory so that if a second correctable error occurs the net is still
corrected.

Directed killing of processes is something that must be done
inside a synchronous exception (like a machine check) because otherwise
it is so racy you don't know who has seen the bad data.  

Eric

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:00                                       ` Alan Cox
@ 2005-11-24 13:13                                         ` Andi Kleen
  2005-11-24 13:30                                           ` Eric W. Biederman
  2005-11-24 14:30                                           ` Alan Cox
  0 siblings, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-24 13:13 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar

> 1. The lock behaviour *is* defined for main memory access by all bus
> masters.

For uncached memory, right?

> 2. Uncached mappings are unworkable for this because we must never have
> a page mapped with conflicting cache types - thats ugly, and plain
> horrific on SMP.

For kernel mapping change_page_attr() takes care of it,
and for user space memory following all mappings is the only
reliable way to find out which process needs to be killed
anyways - and when you do that you can as well unmap
or just kill.

> 3. Uncached has undefined semantics when racing a PCI master. Lock has
> defined semantics. An uncached add #0 is permitted to read the memory
> and then write it back as two different cycles and I suspect does.

Consider what happens with such a race: either the PCI master
gets an bus abort because it still sees the corrupted data.
Or it already accesses the repaired data. Both is ok.

> 4. The AMD BIOS guide requires both that LOCK is enabled by default and
> that the "lock affects the external bus" bit is clear to enable locking
> on the external bus.

The "Linux guidelines" might be different.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24 13:01                                                       ` Pádraig Brady
@ 2005-11-24 13:12                                                         ` Arjan van de Ven
  0 siblings, 0 replies; 117+ messages in thread
From: Arjan van de Ven @ 2005-11-24 13:12 UTC (permalink / raw)
  To: Pádraig Brady
  Cc: Linus Torvalds, Daniel Jacobowitz, Alan Cox, H. Peter Anvin,
	Andi Kleen, Gerd Knorr, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar


> Just a note to say glibc is getting better wrt to locking.
> Compare the results and trival test program here:
> http://lkml.org/lkml/2001/12/7/75
> That showed that for glibc 2.2.4, getc & putc
> were 669% slower than the unlocked versions.
> 
> 4 years later and with 2.3.5-1ubuntu1, getc & putc
> are only 230% slower than the unlocked versions:
> 
> $ dd bs=1MB count=100 if=/dev/zero | ./locked >/dev/null
> 100000000 bytes transferred in 3.709362 seconds (26958813 bytes/sec)
> $ dd bs=1MB count=100 if=/dev/zero | ./unlocked >/dev/null
> 100000000 bytes transferred in 1.602427 seconds (62405339 bytes/sec)

this could also mean that the unlocked version has gotten slower ;)




^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:08                                                     ` Linus Torvalds
                                                                         ` (2 preceding siblings ...)
  2005-11-24  1:02                                                       ` Jeff Garzik
@ 2005-11-24 13:01                                                       ` Pádraig Brady
  2005-11-24 13:12                                                         ` Arjan van de Ven
  2005-11-28 19:52                                                       ` Bill Davidsen
  4 siblings, 1 reply; 117+ messages in thread
From: Pádraig Brady @ 2005-11-24 13:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Jacobowitz, Alan Cox, H. Peter Anvin, Andi Kleen,
	Gerd Knorr, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar

Linus Torvalds wrote:

>On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>  
>
>>Why should we use a silicon based solution for this, when I posit that
>>there are simpler and equally effective userspace solutions?
>>    
>>
>
>Name them.
>
>In user space, doing things like clever run-time linking things is 
>actually horribly bad. It causes COW faults at startup, and/or makes the 
>compiler have to do indirections unnecessarily.  Both of which actually 
>make caches less effective, because now processes that really effectively 
>do have exactly the same contents have them in different pages.
>
>The other alternative (which apparently glibc actually does use) is to 
>dynamically branch over the lock prefixes, which actually works better: 
>it's more work dynamically, but it's much cheaper from a startup 
>standpoint and there's no memory duplication, so while it is the "stupid" 
>approach, it's actually better than the clever one.
>  
>
Just a note to say glibc is getting better wrt to locking.
Compare the results and trival test program here:
http://lkml.org/lkml/2001/12/7/75
That showed that for glibc 2.2.4, getc & putc
were 669% slower than the unlocked versions.

4 years later and with 2.3.5-1ubuntu1, getc & putc
are only 230% slower than the unlocked versions:

$ dd bs=1MB count=100 if=/dev/zero | ./locked >/dev/null
100000000 bytes transferred in 3.709362 seconds (26958813 bytes/sec)
$ dd bs=1MB count=100 if=/dev/zero | ./unlocked >/dev/null
100000000 bytes transferred in 1.602427 seconds (62405339 bytes/sec)

Pádraig.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-24  3:31                                           ` Mikulas Patocka
@ 2005-11-24  3:55                                             ` H. Peter Anvin
  0 siblings, 0 replies; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-24  3:55 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Mikulas Patocka wrote:
> 
> Why should they waste their (already complex) decoding logic with that?
> Why can't an application instead set a bit somewhere if it's running on
> SMP and if it's threaded and branch to variants with and without lock
> prefix?
> 

Why?  Because it would make the same code run faster on their CPU than 
on their competitor's.  That's the kind of things that make x86 vendors 
smile.

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 18:42                                         ` Linus Torvalds
                                                             ` (3 preceding siblings ...)
  2005-11-23 21:44                                           ` Alan Cox
@ 2005-11-24  3:31                                           ` Mikulas Patocka
  2005-11-24  3:55                                             ` H. Peter Anvin
  2005-11-24 22:30                                           ` Ulrich Drepper
  5 siblings, 1 reply; 117+ messages in thread
From: Mikulas Patocka @ 2005-11-24  3:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Linus Torvalds wrote:

>
>
> On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>>
>> Linus Torvalds wrote:
>>> What I suggested to Intel at the Developer Days is to have a MSR (or, better
>>> yet, a bit in the page table pointer %cr0) that disables "lock" in _user_
>>> space. Ie a lock would be a no-op when in CPL3, and only with certain
>>> processes.
>>
>> You mean %cr3, right?
>
> Yes.
>
> It _should_ be fairly easy to do something like that - just a simple
> global flag that gets set and makes CPL3 ignore lock prefixes. Even timing
> doesn't matter - it it takes a hundred cycles for the setting to take
> effect, we don't care, since you can't write %cr3 from user space anyway,
> and it will certainly take a hundred cycles (and a few serializing
> instructions) until we get to CPL3.
>
> I'd personally prefer it to be in %cr3, since we'd have to reload it on
> task switching, and that's one of the registers we load anyway. And it
> would make sense. But it could be in an MSR too.
>
> Of course, if it's in one of the low 12 bits of %cr3, there would have to
> be a "enable this bit" in %cr4 or something. Historically, you could write
> any crap in the low bits, I think.

Why should they waste their (already complex) decoding logic with that?
Why can't an application instead set a bit somewhere if it's running on
SMP and if it's threaded and branch to variants with and without lock
prefix?

(correctly predicted branch is even faster than some microcode to
determine the value of your bit)

Mikulas

> 		Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:44                                           ` Alan Cox
  2005-11-23 21:13                                             ` Andi Kleen
  2005-11-23 21:36                                             ` Linus Torvalds
@ 2005-11-24  3:23                                             ` Mikulas Patocka
  2 siblings, 0 replies; 117+ messages in thread
From: Mikulas Patocka @ 2005-11-24  3:23 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, H. Peter Anvin, Andi Kleen, Gerd Knorr,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Alan Cox wrote:

> On Mer, 2005-11-23 at 10:42 -0800, Linus Torvalds wrote:
>> Of course, if it's in one of the low 12 bits of %cr3, there would have to
>> be a "enable this bit" in %cr4 or something. Historically, you could write
>> any crap in the low bits, I think.
>
> There is a much much better way to do it than just user space and
> without hitting cr3/cr4 - put "lock works" in the PAT and while we'll
> have to add PAT support which we need to do anyway we would get a world
> where on uniprocessor lock prefix only works on addresse targets we want
> it to - ie pci_alloc_consistent() pages.

Given the CPU architecture it is unimplementable. Intructions are split
into microinstuctions and they are executed out of order. PAT is looked up
when LOAD microinstruction is executed. Imagine this

MOV EDX, [address1]
LOCK ADD [address2], EDX

that is translated to

LOAD EDX, [address1]
LOAD TMP1, [address2]
ADD TMP1, EDX
STORE [address2], TMP1

... now LOAD finds LOCK attribute in PAT --- so it locks the bus, however
EDX is still not loaded. Now LOAD EDX can't execute because the bus is
locked and ADD and STORE can't execute because they're waiting for LOAD
EDX. Deadlock.

Locks are so slow not because they are locks (if the target is in L1
cache, they operate only on cache and don't go to bus at all), but because
they need to flush completely microinstruction pool to avoid problems like
this. Of course Intel won't waste silicon in the execution engine for
instructions that execute so rarely, so they microcode them instead. So
lock detection is done at the decoder.

Mikulas

> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:59                                                         ` Linus Torvalds
@ 2005-11-24  2:06                                                           ` Daniel Jacobowitz
  0 siblings, 0 replies; 117+ messages in thread
From: Daniel Jacobowitz @ 2005-11-24  2:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 03:59:52PM -0800, Linus Torvalds wrote:
> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> > 
> > Those are the wrong ways of doing this in userspace.  There are right
> > ways.  For instance, tag the binary at link time "single-threaded". 
> 
> And I mentioned exactly this. It's my third alternative.
> 
> And it doesn't work well, exactly because developers don't know if the 
> libraries they use are always single-threaded etc. More importantly, it 
> just doesn't happen that much. People do "make ; make install". Hopefully 
> from pretty standard sources. Having to tweak things so that a project 
> compiles with a magic flag on a particular distribution is simply not done 
> very much.

But distributors (Debian included) do this all the time :-)

I'd even volunteer to get it done and pushed out and in use, if I was
as convinced of the benefits.  For most applications, though, I'm still
sceptical.

-- 
Daniel Jacobowitz
CodeSourcery, LLC

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:08                                                     ` Linus Torvalds
  2005-11-23 23:02                                                       ` Jeff V. Merkey
  2005-11-23 23:42                                                       ` Daniel Jacobowitz
@ 2005-11-24  1:02                                                       ` Jeff Garzik
  2005-11-24 13:01                                                       ` Pádraig Brady
  2005-11-28 19:52                                                       ` Bill Davidsen
  4 siblings, 0 replies; 117+ messages in thread
From: Jeff Garzik @ 2005-11-24  1:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Jacobowitz, Alan Cox, H. Peter Anvin, Andi Kleen,
	Gerd Knorr, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar

Linus Torvalds wrote:
> The third alternative is to know at link-time that the process never does 
> anything threaded, but that needs more developer attention and 
> non-standard setups, and you _will_ get it wrong (some library will create 
> some thread without the developer even realizing). It also has the 
> duplicated library overhead (but at least now the duplication is just 
> twice, not "each process duplicates its own private pointer")

Small data point:  In a lot of gcc-related build processes, the 
configure/makefile junk passes '-pthread' to the compiler/linker.

So a lot of programs in Linux distros are already built this way.  The 
bigger problem is with libraries, which cannot know ahead of time 
whether the app is threaded or not, and therefore must assume threaded.

A few libs do things like glibc, others (like GLib) have an explicit 
mylib_thread_init() called at program startup.

	Jeff



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:43                                               ` Andi Kleen
  2005-11-23 22:15                                                 ` Linus Torvalds
@ 2005-11-24  0:55                                                 ` Jeff Garzik
  1 sibling, 0 replies; 117+ messages in thread
From: Jeff Garzik @ 2005-11-24  0:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Alan Cox, H. Peter Anvin, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Andi Kleen wrote:
>>THAT is what I'd like to have CPU support for. Not for UP (it's going 
>>away), and not for the kernel (it's never single-threaded).
> 
> 
> There is one reasonly interesting special case that will probably stay
> around: single CPU guest in a virtualized environment.

There will continue to be tons of embedded uniprocessor Linux use...

It will take a while for Linux phones and watches to become multi-core.

	Jeff




^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:42                                                       ` Daniel Jacobowitz
@ 2005-11-23 23:59                                                         ` Linus Torvalds
  2005-11-24  2:06                                                           ` Daniel Jacobowitz
  2005-11-24 22:32                                                         ` Ulrich Drepper
  2005-11-28 19:58                                                         ` Bill Davidsen
  2 siblings, 1 reply; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 23:59 UTC (permalink / raw)
  To: Daniel Jacobowitz
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> 
> Those are the wrong ways of doing this in userspace.  There are right
> ways.  For instance, tag the binary at link time "single-threaded". 

And I mentioned exactly this. It's my third alternative.

And it doesn't work well, exactly because developers don't know if the 
libraries they use are always single-threaded etc. More importantly, it 
just doesn't happen that much. People do "make ; make install". Hopefully 
from pretty standard sources. Having to tweak things so that a project 
compiles with a magic flag on a particular distribution is simply not done 
very much.

			Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:08                                                     ` Linus Torvalds
  2005-11-23 23:02                                                       ` Jeff V. Merkey
@ 2005-11-23 23:42                                                       ` Daniel Jacobowitz
  2005-11-23 23:59                                                         ` Linus Torvalds
                                                                           ` (2 more replies)
  2005-11-24  1:02                                                       ` Jeff Garzik
                                                                         ` (2 subsequent siblings)
  4 siblings, 3 replies; 117+ messages in thread
From: Daniel Jacobowitz @ 2005-11-23 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 03:08:59PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> > 
> > Why should we use a silicon based solution for this, when I posit that
> > there are simpler and equally effective userspace solutions?
> 
> Name them.
> 
> In user space, doing things like clever run-time linking things is 
> actually horribly bad. It causes COW faults at startup, and/or makes the 
> compiler have to do indirections unnecessarily.  Both of which actually 
> make caches less effective, because now processes that really effectively 
> do have exactly the same contents have them in different pages.

Those are the wrong ways of doing this in userspace.  There are right
ways.  For instance, tag the binary at link time "single-threaded". 
Use dynamic linking and the existing hwcap mechanism to select
single-threaded libraries instead of the default ones.  Your
single-threaded applications will no longer mmap the same copy of glibc
as your multi-threaded applications; this does make caching mildly less
effective but only if you have a single-threaded app and a
multi-threaded one fighting for CPU time.

> The other alternative (which apparently glibc actually does use) is to 
> dynamically branch over the lock prefixes, which actually works better: 
> it's more work dynamically, but it's much cheaper from a startup 
> standpoint and there's no memory duplication, so while it is the "stupid" 
> approach, it's actually better than the clever one.

Glibc does not do this to the best of my knowledge.  It does select
different code paths in various places based on the presence of
multiple threads, but that's for cancellation, not for locking.

> The third alternative is to know at link-time that the process never does 
> anything threaded, but that needs more developer attention and 
> non-standard setups, and you _will_ get it wrong (some library will create 
> some thread without the developer even realizing).

This is also a trivially solvable problem in userspace; you make the
dynamic linker enforce consistency of the tags.

> I'm sure you can make up alternatives every time you hit one _particular_ 
> library, but that just doesn't scale in the real world.

The number of userspace libraries that use atomic operations is, in
practice, quite small.

> In contrast, the simple silicon support scales wonderfully well. Suddenly 
> libraries can be thread-safe _and_ efficient on UP too. You get to eat 
> your cake and have it too.

By buying new hardware and only caring about people using the magic
architecture.  No thanks.

Maybe I'll implement this some weekend.

-- 
Daniel Jacobowitz
CodeSourcery, LLC

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:29                                                         ` Eric W. Biederman
@ 2005-11-23 23:40                                                           ` Linus Torvalds
  0 siblings, 0 replies; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 23:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Daniel Jacobowitz, Alan Cox, Andi Kleen,
	Gerd Knorr, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar



On Wed, 23 Nov 2005, Eric W. Biederman wrote:
> 
> In fact for being explict we already have PROT_SEM on some architectures
> to report if we are going to use atomic operations, in the mmap.  For
> x86 we would probably need to introduce a PROT_NOSEM but it is sounds
> fairly straight forward to implement.

PROT_SEM was a mistake, I feel. It's way too easy to get it wrong. You 
have most architectures and environments that don't need it, and as a 
result, applications simply won't have it in their sources.

I suspect that with MAP_SHARED + PROT_WRITE being pretty uncommon anyway, 
we can probably find trivial patterns in the kernel. Like only one process 
holding that file open - which is what you get with things that use mmap() 
to write a new file (I think "ld" used to have a config option to write 
files that way, for example).

And if we end up sometimes giving "lock" meaning even when it's not 
needed, tough. The point of the simple hack is very much a "get 90% of the 
advantage for very little effort". 

Regardless, even if we get a flag like that (and the Intel people didn't 
seem to dismiss the idea), it's likely a more than a few years down the 
line. So it's not like this is a pressing concern ;)

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:21                                                       ` Linus Torvalds
@ 2005-11-23 23:29                                                         ` Eric W. Biederman
  2005-11-23 23:40                                                           ` Linus Torvalds
  0 siblings, 1 reply; 117+ messages in thread
From: Eric W. Biederman @ 2005-11-23 23:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Daniel Jacobowitz, Alan Cox, Andi Kleen,
	Gerd Knorr, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Ingo Molnar

Linus Torvalds <torvalds@osdl.org> writes:

> On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>> 
>> Yes.  Any shared mmaps may require working lock.
>
> Not "any". Only writable shared mmap. Which is actually the rare case.
>
> Even then, we might want to have such processes have a way to say "I don't 
> do futexes in this mmap" or similar. Quite often, writable shared mmaps 
> aren't interested in locked cycles - they are there to just write things 
> to disk, and all the serialization is done in the kernel when the user 
> does a "munmap()" or a "msync()".

In fact for being explict we already have PROT_SEM on some architectures
to report if we are going to use atomic operations, in the mmap.  For
x86 we would probably need to introduce a PROT_NOSEM but it is sounds
fairly straight forward to implement.

Eric

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:22                                                   ` Andi Kleen
  2005-11-23 22:25                                                     ` H. Peter Anvin
@ 2005-11-23 23:10                                                     ` Linus Torvalds
  1 sibling, 0 replies; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 23:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, H. Peter Anvin, Gerd Knorr, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Andi Kleen wrote:
> 
> I don't think it'll force them to that, it will just be a common
> use case. e.g. you start a separate VM to run your firewall in.
> Do you really need it multithreaded? 

The question is: do you need a virtualized environment for it.

And the answer is: no.

I realize that you can make up an infinite number of things you _can_ use 
virtualization for. That doesn't mean that people _will_.

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:20                                                   ` Daniel Jacobowitz
@ 2005-11-23 23:08                                                     ` Linus Torvalds
  2005-11-23 23:02                                                       ` Jeff V. Merkey
                                                                         ` (4 more replies)
  0 siblings, 5 replies; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 23:08 UTC (permalink / raw)
  To: Daniel Jacobowitz
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> 
> Why should we use a silicon based solution for this, when I posit that
> there are simpler and equally effective userspace solutions?

Name them.

In user space, doing things like clever run-time linking things is 
actually horribly bad. It causes COW faults at startup, and/or makes the 
compiler have to do indirections unnecessarily.  Both of which actually 
make caches less effective, because now processes that really effectively 
do have exactly the same contents have them in different pages.

The other alternative (which apparently glibc actually does use) is to 
dynamically branch over the lock prefixes, which actually works better: 
it's more work dynamically, but it's much cheaper from a startup 
standpoint and there's no memory duplication, so while it is the "stupid" 
approach, it's actually better than the clever one.

The third alternative is to know at link-time that the process never does 
anything threaded, but that needs more developer attention and 
non-standard setups, and you _will_ get it wrong (some library will create 
some thread without the developer even realizing). It also has the 
duplicated library overhead (but at least now the duplication is just 
twice, not "each process duplicates its own private pointer")

In short, there simply isn't any good alternatives. The end result is that 
thread-safe libraries are always in practice thread-safe even on UP, even 
though that serializes the CPU altogether unnecessarily.

I'm sure you can make up alternatives every time you hit one _particular_ 
library, but that just doesn't scale in the real world.

In contrast, the simple silicon support scales wonderfully well. Suddenly 
libraries can be thread-safe _and_ efficient on UP too. You get to eat 
your cake and have it too.

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 23:08                                                     ` Linus Torvalds
@ 2005-11-23 23:02                                                       ` Jeff V. Merkey
  2005-11-23 23:42                                                       ` Daniel Jacobowitz
                                                                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 117+ messages in thread
From: Jeff V. Merkey @ 2005-11-23 23:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Jacobowitz, Alan Cox, H. Peter Anvin, Andi Kleen,
	Gerd Knorr, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar

Linus Torvalds wrote:

>On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>  
>
>>Why should we use a silicon based solution for this, when I posit that
>>there are simpler and equally effective userspace solutions?
>>    
>>
>
>Name them.
>
>In user space, doing things like clever run-time linking things is 
>actually horribly bad. It causes COW faults at startup, and/or makes the 
>compiler have to do indirections unnecessarily.  Both of which actually 
>make caches less effective, because now processes that really effectively 
>do have exactly the same contents have them in different pages.
>
>The other alternative (which apparently glibc actually does use) is to 
>dynamically branch over the lock prefixes, which actually works better: 
>it's more work dynamically, but it's much cheaper from a startup 
>standpoint and there's no memory duplication, so while it is the "stupid" 
>approach, it's actually better than the clever one.
>  
>

Using self modifying code stubs will work, and Intel's architecture will 
support it. This would be
faster than waiting 2-3 years for Intel to spin a processor rev. NetWare 
did something similair with
global branch tables for memory protection.


J


>The third alternative is to know at link-time that the process never does 
>anything threaded, but that needs more developer attention and 
>non-standard setups, and you _will_ get it wrong (some library will create 
>some thread without the developer even realizing). It also has the 
>duplicated library overhead (but at least now the duplication is just 
>twice, not "each process duplicates its own private pointer")
>
>In short, there simply isn't any good alternatives. The end result is that 
>thread-safe libraries are always in practice thread-safe even on UP, even 
>though that serializes the CPU altogether unnecessarily.
>
>I'm sure you can make up alternatives every time you hit one _particular_ 
>library, but that just doesn't scale in the real world.
>
>In contrast, the simple silicon support scales wonderfully well. Suddenly 
>libraries can be thread-safe _and_ efficient on UP too. You get to eat 
>your cake and have it too.
>
>		Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:40                                                           ` Andi Kleen
@ 2005-11-23 22:52                                                             ` H. Peter Anvin
  0 siblings, 0 replies; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 22:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Alan Cox, Gerd Knorr, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Andi Kleen wrote:
>>Uhm... maybe we think of it differently, but typically I consider the 
>>host rings (which is what I talked about above) as orthogonal to the 
>>guest ring.  To the host, the guest is just a process in ring 3.
> 
> I don't think your thoughts match the terminology as used by Intel/AMD/Xen
> at least.

Perhaps not.  The Intel terminology seems really confused, especially.

	-hpa


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:36                                             ` Linus Torvalds
  2005-11-23 21:43                                               ` Andi Kleen
  2005-11-23 21:48                                               ` Daniel Jacobowitz
@ 2005-11-23 22:50                                               ` Alan Cox
  2005-11-23 22:22                                                 ` H. Peter Anvin
  2005-11-25  7:38                                               ` Chris Wedgwood
  3 siblings, 1 reply; 117+ messages in thread
From: Alan Cox @ 2005-11-23 22:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Mer, 2005-11-23 at 13:36 -0800, Linus Torvalds wrote:
> > have to add PAT support which we need to do anyway we would get a world
> > where on uniprocessor lock prefix only works on addresse targets we want
> > it to - ie pci_alloc_consistent() pages.
> 
> No. That would be wrong.
> 
> The thing is, "lock" is useless EVEN ON SMP in user space 99% of the time.

Now I see what you are aiming at, yes that makes vast amounts of sense
and since AMD have the "no lock effect" bit for general case maybe they
can



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:36                                                         ` H. Peter Anvin
@ 2005-11-23 22:40                                                           ` Andi Kleen
  2005-11-23 22:52                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 22:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Linus Torvalds, Alan Cox, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

> Uhm... maybe we think of it differently, but typically I consider the 
> host rings (which is what I talked about above) as orthogonal to the 
> guest ring.  To the host, the guest is just a process in ring 3.

I don't think your thoughts match the terminology as used by Intel/AMD/Xen
at least.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:32                                                       ` Andi Kleen
@ 2005-11-23 22:36                                                         ` H. Peter Anvin
  2005-11-23 22:40                                                           ` Andi Kleen
  0 siblings, 1 reply; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 22:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Alan Cox, Gerd Knorr, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Andi Kleen wrote:
>>Well, with VTX or Pacifica virtualization is in ring 3.  The fact that 
> 
> Not it's not. The whole point is that there is no "ring compression" 
> The guest has all its normal rings, just the hypervisor has additional
> "negative" rings.
> 

Uhm... maybe we think of it differently, but typically I consider the 
host rings (which is what I talked about above) as orthogonal to the 
guest ring.  To the host, the guest is just a process in ring 3.

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:25                                                     ` H. Peter Anvin
@ 2005-11-23 22:32                                                       ` Andi Kleen
  2005-11-23 22:36                                                         ` H. Peter Anvin
  0 siblings, 1 reply; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 22:32 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Linus Torvalds, Alan Cox, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

> Well, with VTX or Pacifica virtualization is in ring 3.  The fact that 

Not it's not. The whole point is that there is no "ring compression" 
The guest has all its normal rings, just the hypervisor has additional
"negative" rings.

In the current Xen x86-64 para virtualization setup the guest kernel
is in ring 3, but I hope VT/P. will do away with that because it
causes lots of issues.

> What you really want is one bit for kernel mode (cpl 0-2) and one for 
> user mode (cpl 3).

Yes.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:46                                               ` Jeff Garzik
  2005-11-23 22:23                                                 ` Andi Kleen
@ 2005-11-23 22:30                                                 ` Pavel Machek
  1 sibling, 0 replies; 117+ messages in thread
From: Pavel Machek @ 2005-11-23 22:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andi Kleen, Alan Cox, Linus Torvalds, H. Peter Anvin, Gerd Knorr,
	Dave Jones, Zachary Amsden, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Hi!


> >The idea was to turn LOCK on only if the process has any
> >shared writable mapping and num_online_cpus() > 0.
> 
> Yep.  Though I presume you mean "> 1".
> 
> One hopes that num_online_cpus() never reaches zero during runtime ;-)

Actually num_online_cpus() is very usefull -- suspend to RAM ;-))))).

								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:22                                                   ` Andi Kleen
@ 2005-11-23 22:25                                                     ` H. Peter Anvin
  2005-11-23 22:32                                                       ` Andi Kleen
  2005-11-23 23:10                                                     ` Linus Torvalds
  1 sibling, 1 reply; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 22:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Alan Cox, Gerd Knorr, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Andi Kleen wrote:
> On Wed, Nov 23, 2005 at 02:15:24PM -0800, Linus Torvalds wrote:
> 
>>
>>On Wed, 23 Nov 2005, Andi Kleen wrote:
>>
>>>>THAT is what I'd like to have CPU support for. Not for UP (it's going 
>>>>away), and not for the kernel (it's never single-threaded).
>>>
>>>There is one reasonly interesting special case that will probably stay
>>>around: single CPU guest in a virtualized environment.
>>
>>.. and then the _virtualizer_ should just set the bit. 
> 
> That wouldn't work if it's limited limited to ring 3.
> 
> Also currently at least the Xen the driver interfaces seem to 
> rely on lock, but perhaps that can be changed.
> 

Well, with VTX or Pacifica virtualization is in ring 3.  The fact that 
Xen isn't is a workaround for current hardware, so when we're talking 
about new hardware it's pointless.

What you really want is one bit for kernel mode (cpl 0-2) and one for 
user mode (cpl 3).

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:46                                               ` Jeff Garzik
@ 2005-11-23 22:23                                                 ` Andi Kleen
  2005-11-23 22:30                                                 ` Pavel Machek
  1 sibling, 0 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 22:23 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andi Kleen, Alan Cox, Linus Torvalds, H. Peter Anvin, Gerd Knorr,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 04:46:27PM -0500, Jeff Garzik wrote:
> Andi Kleen wrote:
> >The idea was to turn LOCK on only if the process has any
> >shared writable mapping and num_online_cpus() > 0.
> 
> Yep.  Though I presume you mean "> 1".

Yeah, > 1 of course.
-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:50                                               ` Alan Cox
@ 2005-11-23 22:22                                                 ` H. Peter Anvin
  0 siblings, 0 replies; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 22:22 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Alan Cox wrote:
> On Mer, 2005-11-23 at 13:36 -0800, Linus Torvalds wrote:
> 
>>>have to add PAT support which we need to do anyway we would get a world
>>>where on uniprocessor lock prefix only works on addresse targets we want
>>>it to - ie pci_alloc_consistent() pages.
>>
>>No. That would be wrong.
>>
>>The thing is, "lock" is useless EVEN ON SMP in user space 99% of the time.
> 
> 
> Now I see what you are aiming at, yes that makes vast amounts of sense
> and since AMD have the "no lock effect" bit for general case maybe they
> can
> 

What it really comes down to (virtualization or not!) is whether or not 
the OS can guarantee that nothing else is messing with memory at the 
same time.

This is potentially different from process to process (because of page 
table differences) and from kernel to user space (because of the User 
bit in the page tables.)

	-hpa



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:15                                                 ` Linus Torvalds
@ 2005-11-23 22:22                                                   ` Andi Kleen
  2005-11-23 22:25                                                     ` H. Peter Anvin
  2005-11-23 23:10                                                     ` Linus Torvalds
  0 siblings, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 22:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Alan Cox, H. Peter Anvin, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 02:15:24PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 23 Nov 2005, Andi Kleen wrote:
> >
> > > THAT is what I'd like to have CPU support for. Not for UP (it's going 
> > > away), and not for the kernel (it's never single-threaded).
> > 
> > There is one reasonly interesting special case that will probably stay
> > around: single CPU guest in a virtualized environment.
> 
> .. and then the _virtualizer_ should just set the bit. 

That wouldn't work if it's limited limited to ring 3.

Also currently at least the Xen the driver interfaces seem to 
rely on lock, but perhaps that can be changed.


> However, quite frankly, virtualization is overhyped, in my opinion. And if 
> it forces people to run UP because of performance issues, it's simply not 
> acceptable for a lot of loads.

I don't think it'll force them to that, it will just be a common
use case. e.g. you start a separate VM to run your firewall in.
Do you really need it multithreaded? 

-Andi


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:09                                                     ` H. Peter Anvin
@ 2005-11-23 22:21                                                       ` Linus Torvalds
  2005-11-23 23:29                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 22:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Daniel Jacobowitz, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, H. Peter Anvin wrote:
> 
> Yes.  Any shared mmaps may require working lock.

Not "any". Only writable shared mmap. Which is actually the rare case.

Even then, we might want to have such processes have a way to say "I don't 
do futexes in this mmap" or similar. Quite often, writable shared mmaps 
aren't interested in locked cycles - they are there to just write things 
to disk, and all the serialization is done in the kernel when the user 
does a "munmap()" or a "msync()".

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:19                                                 ` Linus Torvalds
@ 2005-11-23 22:20                                                   ` Daniel Jacobowitz
  2005-11-23 23:08                                                     ` Linus Torvalds
  0 siblings, 1 reply; 117+ messages in thread
From: Daniel Jacobowitz @ 2005-11-23 22:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 02:19:08PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> > 
> > I don't think I see the point.  This would let you optimize for the
> > "multi-threaded, but hasn't created any threads yet" or even
> > "multi-threaded, but not right now" cases.  But those really aren't the
> > interesting case to optimize for - that's the equivalent of supporting
> > CPU hotplug.
> 
> NO.
> 
> There is not a _single_ compiler that is multi-threaded, and I'd argue 
> that there probably never will be. It's pointless.
> 
> There's a _lot_ of really performance-sensitive stuff that will NEVER EVER 
> be threaded. You may run a hundred copies of them at the same time, but 
> every single copy will be single-threaded.
> 
> And this will optimize that case in a BIG way.
> 
> This is _not_ about "CPU hotplug". This is _not_ about "threaded apps 
> before they are threaded". This is all about the fact that serious 
> computation is done single-threaded, and anybody who thinks that 
> single-threading is going away is so totally out to lunch that it's not 
> even fun.

I get the feeling you didn't read my message at all.  Let me try it
again.

Why should we use a silicon based solution for this, when I posit that
there are simpler and equally effective userspace solutions?

-- 
Daniel Jacobowitz
CodeSourcery, LLC

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:48                                               ` Daniel Jacobowitz
  2005-11-23 21:53                                                 ` H. Peter Anvin
@ 2005-11-23 22:19                                                 ` Linus Torvalds
  2005-11-23 22:20                                                   ` Daniel Jacobowitz
  1 sibling, 1 reply; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 22:19 UTC (permalink / raw)
  To: Daniel Jacobowitz
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> 
> I don't think I see the point.  This would let you optimize for the
> "multi-threaded, but hasn't created any threads yet" or even
> "multi-threaded, but not right now" cases.  But those really aren't the
> interesting case to optimize for - that's the equivalent of supporting
> CPU hotplug.

NO.

There is not a _single_ compiler that is multi-threaded, and I'd argue 
that there probably never will be. It's pointless.

There's a _lot_ of really performance-sensitive stuff that will NEVER EVER 
be threaded. You may run a hundred copies of them at the same time, but 
every single copy will be single-threaded.

And this will optimize that case in a BIG way.

This is _not_ about "CPU hotplug". This is _not_ about "threaded apps 
before they are threaded". This is all about the fact that serious 
computation is done single-threaded, and anybody who thinks that 
single-threading is going away is so totally out to lunch that it's not 
even fun.

And yes, Sun will die. Single-thread performance matters a hell of a lot, 
and any company that bets that it doesn't, is a failure.

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:43                                               ` Andi Kleen
@ 2005-11-23 22:15                                                 ` Linus Torvalds
  2005-11-23 22:22                                                   ` Andi Kleen
  2005-11-24  0:55                                                 ` Jeff Garzik
  1 sibling, 1 reply; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 22:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, H. Peter Anvin, Gerd Knorr, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Andi Kleen wrote:
>
> > THAT is what I'd like to have CPU support for. Not for UP (it's going 
> > away), and not for the kernel (it's never single-threaded).
> 
> There is one reasonly interesting special case that will probably stay
> around: single CPU guest in a virtualized environment.

.. and then the _virtualizer_ should just set the bit. 

However, quite frankly, virtualization is overhyped, in my opinion. And if 
it forces people to run UP because of performance issues, it's simply not 
acceptable for a lot of loads.

It's cool technology and all, but realistically..

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:05                                               ` Alan Cox
  2005-11-23 21:36                                                 ` Arjan van de Ven
  2005-11-23 21:36                                                 ` Andi Kleen
@ 2005-11-23 22:13                                                 ` Linus Torvalds
  2 siblings, 0 replies; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 22:13 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, H. Peter Anvin, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Alan Cox wrote:
>
> On Mer, 2005-11-23 at 22:13 +0100, Andi Kleen wrote:
> > The idea was to turn LOCK on only if the process has any
> > shared writable mapping and num_online_cpus() > 0.
> 
> That makes a lot of sense, and if we hit hardware that does funky stuff
> then the driver can set a 'vma needs lock' bit for the same effect.
> 
> > Might be a bit costly to rewrite all the page tables for that case
> > just to change the PAT index.  A bit is nicer for that.
> 
> CPU insert/remove is performed how many times a second ? Or for that
> matter why not just reload the PAT register and keep the index the
> same ?

It's not about CPU insert/remove.

It's about a single-threaded process becoming multi-threaded, ie a simple 
"clone()" operation (or doing a shared mmap).

So it needs to be _fast_. 

I would strongly argue that it's not a TLB/PAT operation at all. It has 
nothing to do with the address of the operation. It's a global bit, and 
it's in the cr3 just because that's what gets reloaded on task switching. 
But it could be in the CS register too, for all I care..

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:03                                                   ` Daniel Jacobowitz
@ 2005-11-23 22:09                                                     ` H. Peter Anvin
  2005-11-23 22:21                                                       ` Linus Torvalds
  0 siblings, 1 reply; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 22:09 UTC (permalink / raw)
  To: Daniel Jacobowitz
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Daniel Jacobowitz wrote:
> 
> Please explain what problem you see.  If you use mmap to manually load
> libpthread.so, and patch up its relocations without going to ld.so,
> obviously you get to keep both pieces.  Or are you talking about
> synchronizing access to shared mmaped buffers?
> 

Yes.  Any shared mmaps may require working lock.

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:13                                             ` Andi Kleen
  2005-11-23 21:46                                               ` Jeff Garzik
@ 2005-11-23 22:05                                               ` Alan Cox
  2005-11-23 21:36                                                 ` Arjan van de Ven
                                                                   ` (2 more replies)
  1 sibling, 3 replies; 117+ messages in thread
From: Alan Cox @ 2005-11-23 22:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, H. Peter Anvin, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Mer, 2005-11-23 at 22:13 +0100, Andi Kleen wrote:
> The idea was to turn LOCK on only if the process has any
> shared writable mapping and num_online_cpus() > 0.

That makes a lot of sense, and if we hit hardware that does funky stuff
then the driver can set a 'vma needs lock' bit for the same effect.

> Might be a bit costly to rewrite all the page tables for that case
> just to change the PAT index.  A bit is nicer for that.

CPU insert/remove is performed how many times a second ? Or for that
matter why not just reload the PAT register and keep the index the
same ?


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:53                                                 ` H. Peter Anvin
@ 2005-11-23 22:03                                                   ` Daniel Jacobowitz
  2005-11-23 22:09                                                     ` H. Peter Anvin
  0 siblings, 1 reply; 117+ messages in thread
From: Daniel Jacobowitz @ 2005-11-23 22:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 01:53:59PM -0800, H. Peter Anvin wrote:
> Daniel Jacobowitz wrote:
> >
> >I don't think I see the point.  This would let you optimize for the
> >"multi-threaded, but hasn't created any threads yet" or even
> >"multi-threaded, but not right now" cases.  But those really aren't the
> >interesting case to optimize for - that's the equivalent of supporting
> >CPU hotplug.
> >
> >The interesting case is when you know at static link time that the
> >library is single-threaded, or even at dynamic link time.  And it's
> >easy enough at both of those times to handle this.  In many cases glibc
> >doesn't, because it's valid to dlopen libpthread.so, but that could be
> >accomodated - a simple matter of software.
> >
> 
> No, you can never know that unless you can't call mmap().

Please explain what problem you see.  If you use mmap to manually load
libpthread.so, and patch up its relocations without going to ld.so,
obviously you get to keep both pieces.  Or are you talking about
synchronizing access to shared mmaped buffers?

This is different from what Linus was talking about precisely because
we can do it imperatively ("I know this program is single-threaded and
I'm telling you so" instead of "Hmm, this program hasn't called clone
yet").

It's not as technologically slick but I'd need a lot of convincing to
believe it wasn't just as useful; and it has the benefit of not
requiring new silicon.

-- 
Daniel Jacobowitz
CodeSourcery, LLC

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 16:59                                     ` Andi Kleen
@ 2005-11-23 22:00                                       ` Alan Cox
  2005-11-24 13:13                                         ` Andi Kleen
  0 siblings, 1 reply; 117+ messages in thread
From: Alan Cox @ 2005-11-23 22:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gerd Knorr, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Mer, 2005-11-23 at 17:59 +0100, Andi Kleen wrote:
> You mean it might break an insane hack in someone's ECC scrubbing
> implementation. But last time I talked to people about this
> they suggested just using an uncacheable mapping instead of 
> this horrible thing. Uncached is actually what you want there,
> not relying on some undocumented lock bus cycle behaviour.

I reviewed your suggestions before:

1. The lock behaviour *is* defined for main memory access by all bus
masters.
2. Uncached mappings are unworkable for this because we must never have
a page mapped with conflicting cache types - thats ugly, and plain
horrific on SMP.
3. Uncached has undefined semantics when racing a PCI master. Lock has
defined semantics. An uncached add #0 is permitted to read the memory
and then write it back as two different cycles and I suspect does.
4. The AMD BIOS guide requires both that LOCK is enabled by default and
that the "lock affects the external bus" bit is clear to enable locking
on the external bus.

The problems we would have would be if some smartarse chip designed
optimised lock addl #0 to a no-op and the fact we possibly ought to
wbinvd the page and read it again to check the ECC recovery worked.

> Which drivers? I don't think there is anything in tree. I went
> over all the drivers early in the x86-64 port.

eepro100 used to from memory but it now carefully does a 16bit I/O.

> I'm sure I would have noticed because they very likely needed inline
> assembly for this and this generally broke when moving to x86-64.

Or use xchg() which is implied lock on x86 so isn't magically made
non-atomic by the LOCK macros. I did a sweep for xchg and didn't see any
problem.

> DRM did some tricks, but generally not with the hardware I believe,
> only between user/kernel space.

It does it extensively for user/kernel. Whether it does it with the GPU
in user space I can't say. The cards I am familar with do not.

Alan


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:48                                               ` Daniel Jacobowitz
@ 2005-11-23 21:53                                                 ` H. Peter Anvin
  2005-11-23 22:03                                                   ` Daniel Jacobowitz
  2005-11-23 22:19                                                 ` Linus Torvalds
  1 sibling, 1 reply; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 21:53 UTC (permalink / raw)
  To: Daniel Jacobowitz
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Daniel Jacobowitz wrote:
> 
> I don't think I see the point.  This would let you optimize for the
> "multi-threaded, but hasn't created any threads yet" or even
> "multi-threaded, but not right now" cases.  But those really aren't the
> interesting case to optimize for - that's the equivalent of supporting
> CPU hotplug.
> 
> The interesting case is when you know at static link time that the
> library is single-threaded, or even at dynamic link time.  And it's
> easy enough at both of those times to handle this.  In many cases glibc
> doesn't, because it's valid to dlopen libpthread.so, but that could be
> accomodated - a simple matter of software.
> 

No, you can never know that unless you can't call mmap().

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:36                                             ` Linus Torvalds
  2005-11-23 21:43                                               ` Andi Kleen
@ 2005-11-23 21:48                                               ` Daniel Jacobowitz
  2005-11-23 21:53                                                 ` H. Peter Anvin
  2005-11-23 22:19                                                 ` Linus Torvalds
  2005-11-23 22:50                                               ` Alan Cox
  2005-11-25  7:38                                               ` Chris Wedgwood
  3 siblings, 2 replies; 117+ messages in thread
From: Daniel Jacobowitz @ 2005-11-23 21:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 01:36:08PM -0800, Linus Torvalds wrote:
> But optimizing for a single _thread_ is not a lost game. I don't believe 
> that threaded applications are necessarily going to take over all that 
> much in a lot of areas. Sure, we'll have more threaded apps too, but we'll 
> continue to have tons more of performance-critical non-threaded things 
> like compilers etc.
> 
> And _that_ is worth optimizing for. General libraries that have to be able 
> to handle the threaded case dynamically, but that are often run with no 
> shared memory anywhere.
> 
> THAT is what I'd like to have CPU support for. Not for UP (it's going 
> away), and not for the kernel (it's never single-threaded).

I don't think I see the point.  This would let you optimize for the
"multi-threaded, but hasn't created any threads yet" or even
"multi-threaded, but not right now" cases.  But those really aren't the
interesting case to optimize for - that's the equivalent of supporting
CPU hotplug.

The interesting case is when you know at static link time that the
library is single-threaded, or even at dynamic link time.  And it's
easy enough at both of those times to handle this.  In many cases glibc
doesn't, because it's valid to dlopen libpthread.so, but that could be
accomodated - a simple matter of software.

-- 
Daniel Jacobowitz
CodeSourcery, LLC

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:13                                             ` Andi Kleen
@ 2005-11-23 21:46                                               ` Jeff Garzik
  2005-11-23 22:23                                                 ` Andi Kleen
  2005-11-23 22:30                                                 ` Pavel Machek
  2005-11-23 22:05                                               ` Alan Cox
  1 sibling, 2 replies; 117+ messages in thread
From: Jeff Garzik @ 2005-11-23 21:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Linus Torvalds, H. Peter Anvin, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Andi Kleen wrote:
> The idea was to turn LOCK on only if the process has any
> shared writable mapping and num_online_cpus() > 0.

Yep.  Though I presume you mean "> 1".

One hopes that num_online_cpus() never reaches zero during runtime ;-)

	Jeff



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 18:42                                         ` Linus Torvalds
                                                             ` (2 preceding siblings ...)
  2005-11-23 19:12                                           ` H. Peter Anvin
@ 2005-11-23 21:44                                           ` Alan Cox
  2005-11-23 21:13                                             ` Andi Kleen
                                                               ` (2 more replies)
  2005-11-24  3:31                                           ` Mikulas Patocka
  2005-11-24 22:30                                           ` Ulrich Drepper
  5 siblings, 3 replies; 117+ messages in thread
From: Alan Cox @ 2005-11-23 21:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Mer, 2005-11-23 at 10:42 -0800, Linus Torvalds wrote:
> Of course, if it's in one of the low 12 bits of %cr3, there would have to 
> be a "enable this bit" in %cr4 or something. Historically, you could write 
> any crap in the low bits, I think.

There is a much much better way to do it than just user space and
without hitting cr3/cr4 - put "lock works" in the PAT and while we'll
have to add PAT support which we need to do anyway we would get a world
where on uniprocessor lock prefix only works on addresse targets we want
it to - ie pci_alloc_consistent() pages.

Alan


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:36                                             ` Linus Torvalds
@ 2005-11-23 21:43                                               ` Andi Kleen
  2005-11-23 22:15                                                 ` Linus Torvalds
  2005-11-24  0:55                                                 ` Jeff Garzik
  2005-11-23 21:48                                               ` Daniel Jacobowitz
                                                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

> THAT is what I'd like to have CPU support for. Not for UP (it's going 
> away), and not for the kernel (it's never single-threaded).

There is one reasonly interesting special case that will probably stay
around: single CPU guest in a virtualized environment.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:05                                               ` Alan Cox
  2005-11-23 21:36                                                 ` Arjan van de Ven
@ 2005-11-23 21:36                                                 ` Andi Kleen
  2005-11-23 22:13                                                 ` Linus Torvalds
  2 siblings, 0 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 21:36 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Linus Torvalds, H. Peter Anvin, Gerd Knorr,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 10:05:40PM +0000, Alan Cox wrote:
> > Might be a bit costly to rewrite all the page tables for that case
> > just to change the PAT index.  A bit is nicer for that.
> 
> CPU insert/remove is performed how many times a second ? Or for that
> matter why not just reload the PAT register and keep the index the
> same ?

For user space the primary trigger event would be "has any shared
writable mappings or multiple threads". Even on a real MP systems it's 
perfectly ok to run a program with no writable shared mappings with LOCK off.
Depending on the workload this transistion could happen quite often.
Especially there is a worst case of an application allocating a few
GB of memory and then starting a new thread.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 22:05                                               ` Alan Cox
@ 2005-11-23 21:36                                                 ` Arjan van de Ven
  2005-11-23 21:36                                                 ` Andi Kleen
  2005-11-23 22:13                                                 ` Linus Torvalds
  2 siblings, 0 replies; 117+ messages in thread
From: Arjan van de Ven @ 2005-11-23 21:36 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Linus Torvalds, H. Peter Anvin, Gerd Knorr,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar


> CPU insert/remove is performed how many times a second ? Or for that
> matter why not just reload the PAT register and keep the index the
> same ?

you also want this for single threaded apps, so that the glibc locking
stuff can not do lock for single-threaded apps and non-shared memory



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:44                                           ` Alan Cox
  2005-11-23 21:13                                             ` Andi Kleen
@ 2005-11-23 21:36                                             ` Linus Torvalds
  2005-11-23 21:43                                               ` Andi Kleen
                                                                 ` (3 more replies)
  2005-11-24  3:23                                             ` Mikulas Patocka
  2 siblings, 4 replies; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 21:36 UTC (permalink / raw)
  To: Alan Cox
  Cc: H. Peter Anvin, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Alan Cox wrote:
> 
> There is a much much better way to do it than just user space and
> without hitting cr3/cr4 - put "lock works" in the PAT and while we'll
> have to add PAT support which we need to do anyway we would get a world
> where on uniprocessor lock prefix only works on addresse targets we want
> it to - ie pci_alloc_consistent() pages.

No. That would be wrong.

The thing is, "lock" is useless EVEN ON SMP in user space 99% of the time.

Think of all the thread locking in libc - where 99% of all processes are 
single-threaded, and it does nothing but slow things down.

Actual UP machines are going to go away - even ARM is going SMP, and in 
the PC space, we'll have multi-core laptops probably being the rule rather 
than the exception in a couple of years. So the kernel will need "lock" in 
the forseeable future, and optimizing for UP is a lost game.

But optimizing for a single _thread_ is not a lost game. I don't believe 
that threaded applications are necessarily going to take over all that 
much in a lot of areas. Sure, we'll have more threaded apps too, but we'll 
continue to have tons more of performance-critical non-threaded things 
like compilers etc.

And _that_ is worth optimizing for. General libraries that have to be able 
to handle the threaded case dynamically, but that are often run with no 
shared memory anywhere.

THAT is what I'd like to have CPU support for. Not for UP (it's going 
away), and not for the kernel (it's never single-threaded).

			Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 21:44                                           ` Alan Cox
@ 2005-11-23 21:13                                             ` Andi Kleen
  2005-11-23 21:46                                               ` Jeff Garzik
  2005-11-23 22:05                                               ` Alan Cox
  2005-11-23 21:36                                             ` Linus Torvalds
  2005-11-24  3:23                                             ` Mikulas Patocka
  2 siblings, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 21:13 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, H. Peter Anvin, Andi Kleen, Gerd Knorr,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 09:44:05PM +0000, Alan Cox wrote:
> On Mer, 2005-11-23 at 10:42 -0800, Linus Torvalds wrote:
> > Of course, if it's in one of the low 12 bits of %cr3, there would have to 
> > be a "enable this bit" in %cr4 or something. Historically, you could write 
> > any crap in the low bits, I think.
> 
> There is a much much better way to do it than just user space and
> without hitting cr3/cr4 - put "lock works" in the PAT and while we'll
> have to add PAT support which we need to do anyway we would get a world
> where on uniprocessor lock prefix only works on addresse targets we want
> it to - ie pci_alloc_consistent() pages.

The idea was to turn LOCK on only if the process has any
shared writable mapping and num_online_cpus() > 0.

Might be a bit costly to rewrite all the page tables for that case
just to change the PAT index.  A bit is nicer for that.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 19:03                                             ` Linus Torvalds
@ 2005-11-23 19:31                                               ` jmerkey
  0 siblings, 0 replies; 117+ messages in thread
From: jmerkey @ 2005-11-23 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff V. Merkey, H. Peter Anvin, Alan Cox, Andi Kleen, Gerd Knorr,
	Dave Jones, Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 11:03:15AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 23 Nov 2005, Jeff V. Merkey wrote:
> >
> > The lock prefix '0F' is used for a lot of opcodes other than "lock". Go check
> > the instruction set reference.
> 
> No it's not.
> 
> 0F is indeed the two-byte prefix. But lock is F0, and it's unique.
> 
> Sometimes Intel re-uses the prefixes for other things eg "rep nop", but I 
> don't think that has ever happened for the lock prefix. 
> 
> Besides, the instructions look very different internally in the CPU after 
> decoding, and anyway you'd not want to ignore the lock prefix _early_ at 
> decode time anyway (many instructions turn into illegal instructions with 
> a lock prefix, as do reg-reg modrm bytes). So you'd dismiss the lock 
> prefix not at a byte level, but at a minimum just after the decode stage.
> 
> 		Linus

I always get numbers and words transposed.  Thanks for the correction. 

J
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 19:12                                           ` H. Peter Anvin
@ 2005-11-23 19:30                                             ` jmerkey
  0 siblings, 0 replies; 117+ messages in thread
From: jmerkey @ 2005-11-23 19:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 11:12:01AM -0800, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> >
> >Of course, if it's in one of the low 12 bits of %cr3, there would have to 
> >be a "enable this bit" in %cr4 or something. Historically, you could write 
> >any crap in the low bits, I think.
> >
> 
> No, most of them are RAZ, but there are at least a couple of bits which 
> have effect (e.g. caching of the page tables.)
> 
> However, with PAE there aren't really a whole lot of unused bits in CR3.
> 
> 	-hpa
> -

Changing CR3 will break compatibility with Windows and interfere with Intel's Bread and Butter gravy Train with M$.  CR4 was created to deal with some of the
legacy issues with backward compatiblity with OS's that read CR3.  Messing with CR3 will break Windows.

They won't do anything that will upset the apple cart with M$.  I dealt with
Intel folks for years when Linux was unknown.  They look and act like boy scouts-- don't be fooled -- they're totally an M$ shop, always have been, always will be.  Linux was and is an intresting brain fart on their radar.  Their interests in it are solely based on their internal "Rabbits and Dogs" Andy Grove mentality.  They say there are rabbits and dogs.  Rabbits run out front, dogs chase the rabbits.  Intel does business with rabbits.  Linux is a dog -- it chases after innovators and replicates their work.  The fact it's free is the basis of their interest.

Those interests do not extend to anything that interferes with their M$ relationship.  Push for CR4, they might agree, but be assured your request will pass Balmers desk before it gets approved.

J

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-22 17:48                           ` [patch] " Gerd Knorr
  2005-11-22 18:01                             ` Pavel Machek
  2005-11-23 15:12                             ` Vincent Hanquez
@ 2005-11-23 19:17                             ` Andi Kleen
  2005-11-23 15:29                               ` Gerd Knorr
  2005-11-23 16:42                               ` Alan Cox
  2 siblings, 2 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 19:17 UTC (permalink / raw)
  To: Gerd Knorr
  Cc: Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Gerd Knorr <kraxel@suse.de> writes:

> Now, some days hacking & debugging and kernel crashing later I have
> something more than just proof-of-concept ;)
> 
> Modules are supported now, fully modularized distro kernel works fine
> with it.  If you have a kernel with HOTPLUG_CPU compiled you can
> shutdown the second CPU of your dual-processor system via sysfs (echo
> 0 > /sys/devices/system/cpu/cpu1/online) and watch the kernel switch
> over to UP code without lock-prefixed instructions and simplified
> spinlocks, then power up the second CPU again (echo 1 > /sys/...) and
> watch it patching back in the SMP locking.

This looks like total overkill to me. Who needs to optimize
CPU hotplug this way? If you really need this just do it 
at boot time with the existing mechanisms. This would keep
it much simpler and simplicity is very important with 
such code because otherwise the testing of all the corner 
cases will kill you.

BTW the existing mechanism already works fine for modules too.

> +	/* Paranoia */
> +	asm volatile ("jmp 1f\n1:");
> +	mb();

That would be totally obsolete 386 era paranoia. If anything then use 
a CLFLUSH (but not available on all x86s) 

-Andi


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 18:42                                         ` Linus Torvalds
  2005-11-23 17:26                                           ` Jeff V. Merkey
  2005-11-23 18:46                                           ` Andi Kleen
@ 2005-11-23 19:12                                           ` H. Peter Anvin
  2005-11-23 19:30                                             ` jmerkey
  2005-11-23 21:44                                           ` Alan Cox
                                                             ` (2 subsequent siblings)
  5 siblings, 1 reply; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 19:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Linus Torvalds wrote:
> 
> Of course, if it's in one of the low 12 bits of %cr3, there would have to 
> be a "enable this bit" in %cr4 or something. Historically, you could write 
> any crap in the low bits, I think.
> 

No, most of them are RAZ, but there are at least a couple of bits which 
have effect (e.g. caching of the page tables.)

However, with PAE there aren't really a whole lot of unused bits in CR3.

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 17:26                                           ` Jeff V. Merkey
@ 2005-11-23 19:03                                             ` Linus Torvalds
  2005-11-23 19:31                                               ` jmerkey
  0 siblings, 1 reply; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 19:03 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: H. Peter Anvin, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Jeff V. Merkey wrote:
>
> The lock prefix '0F' is used for a lot of opcodes other than "lock". Go check
> the instruction set reference.

No it's not.

0F is indeed the two-byte prefix. But lock is F0, and it's unique.

Sometimes Intel re-uses the prefixes for other things eg "rep nop", but I 
don't think that has ever happened for the lock prefix. 

Besides, the instructions look very different internally in the CPU after 
decoding, and anyway you'd not want to ignore the lock prefix _early_ at 
decode time anyway (many instructions turn into illegal instructions with 
a lock prefix, as do reg-reg modrm bytes). So you'd dismiss the lock 
prefix not at a byte level, but at a minimum just after the decode stage.

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 18:42                                         ` Linus Torvalds
  2005-11-23 17:26                                           ` Jeff V. Merkey
@ 2005-11-23 18:46                                           ` Andi Kleen
  2005-11-23 19:12                                           ` H. Peter Anvin
                                                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Wed, Nov 23, 2005 at 10:42:40AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 23 Nov 2005, H. Peter Anvin wrote:
> >
> > Linus Torvalds wrote:
> > > What I suggested to Intel at the Developer Days is to have a MSR (or, better
> > > yet, a bit in the page table pointer %cr0) that disables "lock" in _user_
> > > space. Ie a lock would be a no-op when in CPL3, and only with certain
> > > processes.
> > 
> > You mean %cr3, right?
> 
> Yes. 
> 
> It _should_ be fairly easy to do something like that - just a simple 
> global flag that gets set and makes CPL3 ignore lock prefixes. Even timing 
> doesn't matter - it it takes a hundred cycles for the setting to take 
> effect, we don't care, since you can't write %cr3 from user space anyway, 
> and it will certainly take a hundred cycles (and a few serializing 
> instructions) until we get to CPL3.

Another bit for ring 0 would be actually useful too. Then the patching
patch here wouldn't be needed.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 18:02                                       ` H. Peter Anvin
@ 2005-11-23 18:42                                         ` Linus Torvalds
  2005-11-23 17:26                                           ` Jeff V. Merkey
                                                             ` (5 more replies)
  0 siblings, 6 replies; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 18:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>
> Linus Torvalds wrote:
> > What I suggested to Intel at the Developer Days is to have a MSR (or, better
> > yet, a bit in the page table pointer %cr0) that disables "lock" in _user_
> > space. Ie a lock would be a no-op when in CPL3, and only with certain
> > processes.
> 
> You mean %cr3, right?

Yes. 

It _should_ be fairly easy to do something like that - just a simple 
global flag that gets set and makes CPL3 ignore lock prefixes. Even timing 
doesn't matter - it it takes a hundred cycles for the setting to take 
effect, we don't care, since you can't write %cr3 from user space anyway, 
and it will certainly take a hundred cycles (and a few serializing 
instructions) until we get to CPL3.

I'd personally prefer it to be in %cr3, since we'd have to reload it on 
task switching, and that's one of the registers we load anyway. And it 
would make sense. But it could be in an MSR too.

Of course, if it's in one of the low 12 bits of %cr3, there would have to 
be a "enable this bit" in %cr4 or something. Historically, you could write 
any crap in the low bits, I think.

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 17:02                                     ` Linus Torvalds
@ 2005-11-23 18:02                                       ` H. Peter Anvin
  2005-11-23 18:42                                         ` Linus Torvalds
  0 siblings, 1 reply; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Linus Torvalds wrote:
> What I suggested to Intel at the Developer Days is to have a MSR (or, 
> better yet, a bit in the page table pointer %cr0) that disables "lock" in 
> _user_ space. Ie a lock would be a no-op when in CPL3, and only with 
> certain processes.

You mean %cr3, right?

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 18:42                                         ` Linus Torvalds
@ 2005-11-23 17:26                                           ` Jeff V. Merkey
  2005-11-23 19:03                                             ` Linus Torvalds
  2005-11-23 18:46                                           ` Andi Kleen
                                                             ` (4 subsequent siblings)
  5 siblings, 1 reply; 117+ messages in thread
From: Jeff V. Merkey @ 2005-11-23 17:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Alan Cox, Andi Kleen, Gerd Knorr, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Linus Torvalds wrote:

>On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>  
>
>>Linus Torvalds wrote:
>>    
>>
>>>What I suggested to Intel at the Developer Days is to have a MSR (or, better
>>>yet, a bit in the page table pointer %cr0) that disables "lock" in _user_
>>>space. Ie a lock would be a no-op when in CPL3, and only with certain
>>>processes.
>>>      
>>>
>>You mean %cr3, right?
>>    
>>
>
>Yes. 
>
>It _should_ be fairly easy to do something like that - just a simple 
>global flag that gets set and makes CPL3 ignore lock prefixes. Even timing 
>doesn't matter - it it takes a hundred cycles for the setting to take 
>effect, we don't care, since you can't write %cr3 from user space anyway, 
>and it will certainly take a hundred cycles (and a few serializing 
>instructions) until we get to CPL3.
>
>I'd personally prefer it to be in %cr3, since we'd have to reload it on 
>task switching, and that's one of the registers we load anyway. And it 
>would make sense. But it could be in an MSR too.
>
>Of course, if it's in one of the low 12 bits of %cr3, there would have to 
>be a "enable this bit" in %cr4 or something. Historically, you could write 
>any crap in the low bits, I think.
>
>		Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>
The lock prefix '0F' is used for a lot of opcodes other than "lock". Go 
check the instruction set reference. It's not
trivial what you are proposing. Intel has a pretty hacked up opcode map 
with a lot of history. The bit should be in
CR4 and not CR3.

J

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 16:39                                 ` Andi Kleen
@ 2005-11-23 17:21                                   ` Alan Cox
  2005-11-23 16:59                                     ` Andi Kleen
  2005-11-23 17:02                                     ` Linus Torvalds
  0 siblings, 2 replies; 117+ messages in thread
From: Alan Cox @ 2005-11-23 17:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gerd Knorr, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Mer, 2005-11-23 at 17:39 +0100, Andi Kleen wrote:
> I much prefer the MSR bit too. Unfortunately it doesn't exist
> (or rather I bet it exists somewhere, just undocumented) on Intel 
> systems.

The MSR bits will break things like ECC scrubbing however. That can be
addressed although the test patch I have just refuses to load EDAC if
the BIOS writers didn't follow the BIOS guidelines.

Certainly it would be cleaner and easier to save the MSR, scrub and put
it back than do the fixup magic. Some drivers would need auditing as
they seem to use locked ops or xchg (implicit lock) to lock with a PCI
DMA master.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 17:21                                   ` Alan Cox
  2005-11-23 16:59                                     ` Andi Kleen
@ 2005-11-23 17:02                                     ` Linus Torvalds
  2005-11-23 18:02                                       ` H. Peter Anvin
  1 sibling, 1 reply; 117+ messages in thread
From: Linus Torvalds @ 2005-11-23 17:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Gerd Knorr, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar



On Wed, 23 Nov 2005, Alan Cox wrote:
> 
> The MSR bits will break things like ECC scrubbing however. That can be
> addressed although the test patch I have just refuses to load EDAC if
> the BIOS writers didn't follow the BIOS guidelines.
> 
> Certainly it would be cleaner and easier to save the MSR, scrub and put
> it back than do the fixup magic. Some drivers would need auditing as
> they seem to use locked ops or xchg (implicit lock) to lock with a PCI
> DMA master.

What I suggested to Intel at the Developer Days is to have a MSR (or, 
better yet, a bit in the page table pointer %cr0) that disables "lock" in 
_user_ space. Ie a lock would be a no-op when in CPL3, and only with 
certain processes.

The kernel really isn't that critical. We always need the locks in SMP 
(unlike user space, which never needs them if the process isn't threaded), 
and in the kernel space we occasionally need it even with UP to protect 
against devices. And we _can_ do these instruction rewrites, and they are 
even pretty trivial for the non-hotplug case.

User space is actually a lot more important. People spend more time in 
user space, and there the lock prefix is much more often totally useless 
and cannot just be edited away once per boot.

		Linus

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 17:21                                   ` Alan Cox
@ 2005-11-23 16:59                                     ` Andi Kleen
  2005-11-23 22:00                                       ` Alan Cox
  2005-11-23 17:02                                     ` Linus Torvalds
  1 sibling, 1 reply; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 16:59 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar

On Wed, Nov 23, 2005 at 05:21:29PM +0000, Alan Cox wrote:
> On Mer, 2005-11-23 at 17:39 +0100, Andi Kleen wrote:
> > I much prefer the MSR bit too. Unfortunately it doesn't exist
> > (or rather I bet it exists somewhere, just undocumented) on Intel 
> > systems.
> 
> The MSR bits will break things like ECC scrubbing however. That can be

You mean it might break an insane hack in someone's ECC scrubbing
implementation. But last time I talked to people about this
they suggested just using an uncacheable mapping instead of 
this horrible thing. Uncached is actually what you want there,
not relying on some undocumented lock bus cycle behaviour.

IMHO that would be much better and actually
have a chance of working over multiple generation.
 
> Certainly it would be cleaner and easier to save the MSR, scrub and put
> it back than do the fixup magic. Some drivers would need auditing as
> they seem to use locked ops or xchg (implicit lock) to lock with a PCI
> DMA master.

Which drivers? I don't think there is anything in tree. I went
over all the drivers early in the x86-64 port.

I'm sure I would have noticed because they very likely needed inline
assembly for this and this generally broke when moving to x86-64.

DRM did some tricks, but generally not with the hardware I believe,
only between user/kernel space.

-Andi


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 16:43                                 ` Gerd Knorr
@ 2005-11-23 16:51                                   ` H. Peter Anvin
  0 siblings, 0 replies; 117+ messages in thread
From: H. Peter Anvin @ 2005-11-23 16:51 UTC (permalink / raw)
  To: Gerd Knorr
  Cc: Alan Cox, Andi Kleen, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Gerd Knorr wrote:
> 
> Patching in/out SMP-locking with more than one active CPU would be a 
> pretty silly idea in the first place ;)
> 

No, doing this crap with CPU hotplug is a silly idea.  Patching on a 
real UP system, and then throwing out the tables, makes sense.  Keeping 
two sets of tables for a minimal performance improvement in a very rare 
configuration (CPU hotplug is the exception, not the rule) is just plain 
stupid.  You probably lose as much performance from the memory hogged up 
in the tables as you gain from it, and on every system where you have 
the tables at all you take the hit.

	-hpa

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 16:42                               ` Alan Cox
  2005-11-23 16:39                                 ` Andi Kleen
@ 2005-11-23 16:43                                 ` Gerd Knorr
  2005-11-23 16:51                                   ` H. Peter Anvin
  1 sibling, 1 reply; 117+ messages in thread
From: Gerd Knorr @ 2005-11-23 16:43 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

Alan Cox wrote:
> On Mer, 2005-11-23 at 12:17 -0700, Andi Kleen wrote:
>>> +	/* Paranoia */
>>> +	asm volatile ("jmp 1f\n1:");
>>> +	mb();
>> That would be totally obsolete 386 era paranoia. If anything then use 
>> a CLFLUSH (but not available on all x86s) 
> 
> If you are patching code another x86 CPU is running you must halt the
> other processors and ensure it executes a serialzing instruction before
> it enters any patched code. 

Patching in/out SMP-locking with more than one active CPU would be a 
pretty silly idea in the first place ;)

> How many kilobytes of tables do you add to the kernel to do this
> pointless stunt btw ?

  16 .smp_altinstructions 0000ae0b  c03b4000  003b4000  002b4000  2**2
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
  17 .smp_altinstr_replacement 00000f6e  c03bee0b  003bee0b  002bee0b  2**0
                   CONTENTS, ALLOC, LOAD, CODE

cheers,

   Gerd


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 19:17                             ` Andi Kleen
  2005-11-23 15:29                               ` Gerd Knorr
@ 2005-11-23 16:42                               ` Alan Cox
  2005-11-23 16:39                                 ` Andi Kleen
  2005-11-23 16:43                                 ` Gerd Knorr
  1 sibling, 2 replies; 117+ messages in thread
From: Alan Cox @ 2005-11-23 16:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gerd Knorr, Linus Torvalds, Dave Jones, Zachary Amsden,
	Pavel Machek, Andrew Morton, Linux Kernel Mailing List,
	H. Peter Anvin, Zwane Mwaikambo, Pratap Subrahmanyam,
	Christopher Li, Eric W. Biederman, Ingo Molnar

On Mer, 2005-11-23 at 12:17 -0700, Andi Kleen wrote:
> > +	/* Paranoia */
> > +	asm volatile ("jmp 1f\n1:");
> > +	mb();
> 
> That would be totally obsolete 386 era paranoia. If anything then use 
> a CLFLUSH (but not available on all x86s) 

If you are patching code another x86 CPU is running you must halt the
other processors and ensure it executes a serialzing instruction before
it enters any patched code. 

How many kilobytes of tables do you add to the kernel to do this
pointless stunt btw ?

Alan "CPU errata are fun" Cox


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 16:42                               ` Alan Cox
@ 2005-11-23 16:39                                 ` Andi Kleen
  2005-11-23 17:21                                   ` Alan Cox
  2005-11-23 16:43                                 ` Gerd Knorr
  1 sibling, 1 reply; 117+ messages in thread
From: Andi Kleen @ 2005-11-23 16:39 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Gerd Knorr, Linus Torvalds, Dave Jones,
	Zachary Amsden, Pavel Machek, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar

On Wed, Nov 23, 2005 at 04:42:13PM +0000, Alan Cox wrote:
> On Mer, 2005-11-23 at 12:17 -0700, Andi Kleen wrote:
> > > +	/* Paranoia */
> > > +	asm volatile ("jmp 1f\n1:");
> > > +	mb();
> > 
> > That would be totally obsolete 386 era paranoia. If anything then use 
> > a CLFLUSH (but not available on all x86s) 
> 
> If you are patching code another x86 CPU is running you must halt the
> other processors and ensure it executes a serialzing instruction before
> it enters any patched code. 

Yes that is why the original alternative() mechanism always only
runs before the code is ever executed.

> How many kilobytes of tables do you add to the kernel to do this
> pointless stunt btw ?

I much prefer the MSR bit too. Unfortunately it doesn't exist
(or rather I bet it exists somewhere, just undocumented) on Intel 
systems.

-Andi

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-23 19:17                             ` Andi Kleen
@ 2005-11-23 15:29                               ` Gerd Knorr
  2005-11-23 16:42                               ` Alan Cox
  1 sibling, 0 replies; 117+ messages in thread
From: Gerd Knorr @ 2005-11-23 15:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

Andi Kleen wrote:
> Gerd Knorr <kraxel@suse.de> writes:
> 
>> Modules are supported now, fully modularized distro kernel works fine
>> with it.  If you have a kernel with HOTPLUG_CPU compiled you can
>> shutdown the second CPU of your dual-processor system via sysfs (echo
>> 0 > /sys/devices/system/cpu/cpu1/online) and watch the kernel switch
>> over to UP code without lock-prefixed instructions and simplified
>> spinlocks, then power up the second CPU again (echo 1 > /sys/...) and
>> watch it patching back in the SMP locking.
> 
> This looks like total overkill to me. Who needs to optimize
> CPU hotplug this way? If you really need this just do it 
> at boot time with the existing mechanisms.

Sure, for real hardware doing that at boot time is be perfectly fine. 
In a virtual environment it's very useful to be able to plug in one more 
virtual CPU on demand without rebooting though.  The patch isn't very 
useful alone, it's more one step on the road of getting the xen bits 
merged mainline.

>> +	/* Paranoia */
>> +	asm volatile ("jmp 1f\n1:");
>> +	mb();
> 
> That would be totally obsolete 386 era paranoia. If anything then use 
> a CLFLUSH (but not available on all x86s) 

Ok, dropped.  I've just copyed that from the original, pretty ugly xen 
patch.

cheers,

   Gerd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-22 17:48                           ` [patch] " Gerd Knorr
  2005-11-22 18:01                             ` Pavel Machek
@ 2005-11-23 15:12                             ` Vincent Hanquez
  2005-11-23 19:17                             ` Andi Kleen
  2 siblings, 0 replies; 117+ messages in thread
From: Vincent Hanquez @ 2005-11-23 15:12 UTC (permalink / raw)
  To: Gerd Knorr
  Cc: Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

On Tue, Nov 22, 2005 at 06:48:07PM +0100, Gerd Knorr wrote:
> +	smp = kmalloc(sizeof(*smp), GFP_KERNEL);
> +	if (NULL == smp)
> +		return; /* we'll run the (safe but slow) SMP code then ... */
> +
> +	memset(smp,0,sizeof(*smp));

what about using kzalloc ?

> +	if (ALT_UP == smp_alt_state)
> +		goto out;

any chance to write it smp_alt_state == ALT_UP ?

IMHO, this way of writting equal condition is backward (like giving
answer before asking the question). I do know of the (pseudo-)benefit
to write it this way, but that's not worth it.

Plus, nowadays, gcc warns you about simple equal in if.

Cheers,
-- 
Vincent Hanquez

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [patch] SMP alternatives
  2005-11-22 17:48                           ` [patch] " Gerd Knorr
@ 2005-11-22 18:01                             ` Pavel Machek
  2005-11-23 15:12                             ` Vincent Hanquez
  2005-11-23 19:17                             ` Andi Kleen
  2 siblings, 0 replies; 117+ messages in thread
From: Pavel Machek @ 2005-11-22 18:01 UTC (permalink / raw)
  To: Gerd Knorr
  Cc: Linus Torvalds, Dave Jones, Zachary Amsden, Andrew Morton,
	Linux Kernel Mailing List, H. Peter Anvin, Zwane Mwaikambo,
	Pratap Subrahmanyam, Christopher Li, Eric W. Biederman,
	Ingo Molnar

Hi!

> For testing & benchmarking purposes I've put also in two (temporary) 
> sysrq's to switch between UP and SMP bits without booting/shutting down 
> the second CPU.  That one breaks non-i386 builds which are trivially 
> fixable by just dropping the drivers/char/sysrq.c changes ;)

> +/* Replace instructions with better alternatives for this CPU type.
> +
> +   This runs before SMP is initialized to avoid SMP problems with
> +   self modifying code. This implies that assymetric systems where
> +   APs have less capabilities than the boot processor are not handled. 
> +   Tough. Make sure you disable such features by hand. */ 
> +void apply_alternatives(struct alt_instr *start, struct alt_instr *end,
> +			__u8 *tstart, __u8 *tend)
> +{ 
> +        unsigned char **noptable = intel_nops;
> +	struct alt_instr *a; 

Some alignment problems here. (Maybe it is okay as a source).

> +struct smp_alt_module {
> +	/* what is this ??? */

:-))))))).

> +	struct module    *mod;
> +	char             *name;
> +
> +	/* our SMP alternatives table */
> +	struct alt_instr *astart;
> +	struct alt_instr *aend;
> +
> +	/* .text segment, needed to avoid patching init code ;) */
> +	__u8             *tstart;
> +	__u8             *tend;

You should be able to use u8 here.

> +		if (0 == strcmp(".text", secstrings + s->sh_name))
> +			text = s;
> +		if (0 == strcmp(".altinstructions", secstrings + s->sh_name))
> +			alt = s;
> +		if (0 == strcmp(".smp_altinstructions", secstrings + s->sh_name))
> +			smpalt = s;

Can we get if (!strcmp()) here?

								Pavel

-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [patch] SMP alternatives
  2005-11-16 16:12                         ` [RFC] SMP alternatives Gerd Knorr
@ 2005-11-22 17:48                           ` Gerd Knorr
  2005-11-22 18:01                             ` Pavel Machek
                                               ` (2 more replies)
  0 siblings, 3 replies; 117+ messages in thread
From: Gerd Knorr @ 2005-11-22 17:48 UTC (permalink / raw)
  To: Gerd Knorr
  Cc: Linus Torvalds, Dave Jones, Zachary Amsden, Pavel Machek,
	Andrew Morton, Linux Kernel Mailing List, H. Peter Anvin,
	Zwane Mwaikambo, Pratap Subrahmanyam, Christopher Li,
	Eric W. Biederman, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

Gerd Knorr wrote:
> Gerd Knorr wrote:
> 
>> i.e. something like this (as basic idea, patch is far away from doing 
>> anything useful ...)?
> 
> Adapting $subject to the actual topic, so other lkml readers can catch 
> up ;)
> 
> Ok, here new version of the SMP alternatives patch.  It features:

Now, some days hacking & debugging and kernel crashing later I have 
something more than just proof-of-concept ;)

Modules are supported now, fully modularized distro kernel works fine 
with it.  If you have a kernel with HOTPLUG_CPU compiled you can 
shutdown the second CPU of your dual-processor system via sysfs (echo 0 
 > /sys/devices/system/cpu/cpu1/online) and watch the kernel switch over 
to UP code without lock-prefixed instructions and simplified spinlocks, 
then power up the second CPU again (echo 1 > /sys/...) and watch it 
patching back in the SMP locking.

For testing & benchmarking purposes I've put also in two (temporary) 
sysrq's to switch between UP and SMP bits without booting/shutting down 
the second CPU.  That one breaks non-i386 builds which are trivially 
fixable by just dropping the drivers/char/sysrq.c changes ;)

enjoy,

   Gerd

[-- Attachment #2: smp-alternatives-22.diff --]
[-- Type: text/x-patch, Size: 42943 bytes --]

diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/arch/i386/kernel/Makefile work-2.6.14/arch/i386/kernel/Makefile
--- linux-2.6.14/arch/i386/kernel/Makefile	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/arch/i386/kernel/Makefile	2005-11-21 09:19:52.000000000 +0100
@@ -7,7 +7,7 @@
 obj-y	:= process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \
 		ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_i386.o \
 		pci-dma.o i386_ksyms.o i387.o dmi_scan.o bootflag.o \
-		doublefault.o quirks.o i8237.o
+		doublefault.o quirks.o i8237.o alternative.o
 
 obj-y				+= cpu/
 obj-y				+= timers/
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/arch/i386/kernel/alternative.c work-2.6.14/arch/i386/kernel/alternative.c
--- linux-2.6.14/arch/i386/kernel/alternative.c	1970-01-01 01:00:00.000000000 +0100
+++ work-2.6.14/arch/i386/kernel/alternative.c	2005-11-22 16:58:59.000000000 +0100
@@ -0,0 +1,285 @@
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <asm/alternative.h>
+
+#define DEBUG 0
+#if DEBUG
+# define DPRINTK(fmt, args...) printk(fmt, args)
+#else
+# define DPRINTK(fmt, args...)
+#endif
+
+/* Use inline assembly to define this because the nops are defined 
+   as inline assembly strings in the include files and we cannot 
+   get them easily into strings. */
+asm("\t.data\nintelnops: " 
+    GENERIC_NOP1 GENERIC_NOP2 GENERIC_NOP3 GENERIC_NOP4 GENERIC_NOP5 GENERIC_NOP6
+    GENERIC_NOP7 GENERIC_NOP8); 
+asm("\t.data\nk8nops: " 
+    K8_NOP1 K8_NOP2 K8_NOP3 K8_NOP4 K8_NOP5 K8_NOP6
+    K8_NOP7 K8_NOP8); 
+asm("\t.data\nk7nops: " 
+    K7_NOP1 K7_NOP2 K7_NOP3 K7_NOP4 K7_NOP5 K7_NOP6
+    K7_NOP7 K7_NOP8); 
+    
+extern unsigned char intelnops[], k8nops[], k7nops[];
+static unsigned char *intel_nops[ASM_NOP_MAX+1] = { 
+     NULL,
+     intelnops,
+     intelnops + 1,
+     intelnops + 1 + 2,
+     intelnops + 1 + 2 + 3,
+     intelnops + 1 + 2 + 3 + 4,
+     intelnops + 1 + 2 + 3 + 4 + 5,
+     intelnops + 1 + 2 + 3 + 4 + 5 + 6,
+     intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
+}; 
+static unsigned char *k8_nops[ASM_NOP_MAX+1] = { 
+     NULL,
+     k8nops,
+     k8nops + 1,
+     k8nops + 1 + 2,
+     k8nops + 1 + 2 + 3,
+     k8nops + 1 + 2 + 3 + 4,
+     k8nops + 1 + 2 + 3 + 4 + 5,
+     k8nops + 1 + 2 + 3 + 4 + 5 + 6,
+     k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
+}; 
+static unsigned char *k7_nops[ASM_NOP_MAX+1] = { 
+     NULL,
+     k7nops,
+     k7nops + 1,
+     k7nops + 1 + 2,
+     k7nops + 1 + 2 + 3,
+     k7nops + 1 + 2 + 3 + 4,
+     k7nops + 1 + 2 + 3 + 4 + 5,
+     k7nops + 1 + 2 + 3 + 4 + 5 + 6,
+     k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
+}; 
+static struct nop { 
+     int cpuid; 
+     unsigned char **noptable; 
+} noptypes[] = { 
+     { X86_FEATURE_K8, k8_nops }, 
+     { X86_FEATURE_K7, k7_nops }, 
+     { -1, NULL }
+}; 
+
+/* Replace instructions with better alternatives for this CPU type.
+
+   This runs before SMP is initialized to avoid SMP problems with
+   self modifying code. This implies that assymetric systems where
+   APs have less capabilities than the boot processor are not handled. 
+   Tough. Make sure you disable such features by hand. */ 
+void apply_alternatives(struct alt_instr *start, struct alt_instr *end,
+			__u8 *tstart, __u8 *tend)
+{ 
+        unsigned char **noptable = intel_nops;
+	struct alt_instr *a; 
+	int diff, i, k;
+
+	DPRINTK("%s: alts %p-%p, text %p-%p\n", __FUNCTION__,
+		start, end, tstart, tend);
+	for (i = 0; noptypes[i].cpuid >= 0; i++) { 
+		if (boot_cpu_has(noptypes[i].cpuid)) { 
+			noptable = noptypes[i].noptable;
+			break;
+		}
+	} 
+	for (a = start; a < end; a++) { 
+		BUG_ON(a->replacementlen > a->instrlen); 
+		if (!boot_cpu_has(a->cpuid))
+			continue;
+		if (tstart && a->instr < tstart)
+			continue;
+		if (tend && a->instr > tend)
+			continue;
+		memcpy(a->instr, a->replacement, a->replacementlen); 
+		diff = a->instrlen - a->replacementlen; 
+		/* Pad the rest with nops */
+		for (i = a->replacementlen; diff > 0; diff -= k, i += k) {
+			k = diff;
+			if (k > ASM_NOP_MAX)
+				k = ASM_NOP_MAX;
+			memcpy(a->instr + i, noptable[k], k); 
+		} 
+	}
+
+	/* Paranoia */
+	asm volatile ("jmp 1f\n1:");
+	mb();
+} 
+
+struct smp_alt_module {
+	/* what is this ??? */
+	struct module    *mod;
+	char             *name;
+
+	/* our SMP alternatives table */
+	struct alt_instr *astart;
+	struct alt_instr *aend;
+
+	/* .text segment, needed to avoid patching init code ;) */
+	__u8             *tstart;
+	__u8             *tend;
+
+	struct list_head next;
+};
+static LIST_HEAD(smp_alt_modules);
+static DEFINE_SPINLOCK(smp_alt);
+static enum {
+	ALT_UP, ALT_SMP
+} smp_alt_state = ALT_SMP;
+
+static void save_alternatives_smp(struct smp_alt_module *mod)
+{
+	struct alt_instr *a;
+
+	DPRINTK("%s: alts %p-%p, text %p-%p, name %s\n", __FUNCTION__,
+		mod->astart, mod->aend, mod->tstart, mod->tend, mod->name);
+	for (a = mod->astart; a < mod->aend; a++) {
+		if (a->instr < mod->tstart)
+			continue;
+		if (a->instr > mod->tend)
+			continue;
+		memcpy(a->replacement + a->replacementlen,
+		       a->instr,
+		       a->instrlen);
+	}
+}
+
+static void apply_alternatives_smp(struct smp_alt_module *mod)
+{
+	struct alt_instr *a;
+
+	DPRINTK("%s: alts %p-%p, text %p-%p, name %s\n", __FUNCTION__,
+		mod->astart, mod->aend, mod->tstart, mod->tend, mod->name);
+	for (a = mod->astart; a < mod->aend; a++) {
+		if (a->instr < mod->tstart)
+			continue;
+		if (a->instr > mod->tend)
+			continue;
+		memcpy(a->instr,
+		       a->replacement + a->replacementlen,
+		       a->instrlen);
+	}
+
+	/* Paranoia */
+	asm volatile ("jmp 1f\n1:");
+	mb();
+}
+
+extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
+extern struct alt_instr __smp_alt_instructions[], __smp_alt_instructions_end[];
+extern __u8 _text[], _etext[];
+
+void alternatives_smp_module_add(struct module *mod, char *name,
+				 void *astart, void *aend,
+				 void *tstart, void *tend)
+{
+	struct smp_alt_module *smp;
+	unsigned long flags;
+
+	smp = kmalloc(sizeof(*smp), GFP_KERNEL);
+	if (NULL == smp)
+		return; /* we'll run the (safe but slow) SMP code then ... */
+
+	memset(smp,0,sizeof(*smp));
+	smp->mod    = mod;
+	smp->name   = name;
+	smp->astart = astart;
+	smp->aend   = aend;
+	smp->tstart = tstart;
+	smp->tend   = tend;
+	DPRINTK("%s: alts %p-%p, text %p-%p, name %s\n", __FUNCTION__,
+		smp->astart, smp->aend, smp->tstart, smp->tend, smp->name);
+
+	spin_lock_irqsave(&smp_alt, flags);
+	list_add_tail(&smp->next, &smp_alt_modules);
+	save_alternatives_smp(smp);
+	if (ALT_UP == smp_alt_state)
+		apply_alternatives(smp->astart, smp->aend,
+				   smp->tstart, smp->tend);
+	spin_unlock_irqrestore(&smp_alt, flags);
+
+}
+
+void alternatives_smp_module_del(struct module *mod)
+{
+	struct smp_alt_module *item;
+	unsigned long flags;
+
+	spin_lock_irqsave(&smp_alt, flags);
+	list_for_each_entry(item, &smp_alt_modules, next) {
+		if (mod != item->mod)
+			continue;
+		list_del(&item->next);
+		spin_unlock_irqrestore(&smp_alt, flags);
+		DPRINTK("%s: %s\n", __FUNCTION__, item->name);
+		kfree(item);
+		return;
+	}
+	spin_unlock_irqrestore(&smp_alt, flags);
+}
+
+void switch_alternatives_up(void) 
+{
+	struct smp_alt_module *mod;
+	unsigned long flags;
+
+	if (num_online_cpus() > 1) {
+		/* shouldn't happen in theory ... */
+		printk("%s: Uh, oh, %d cpus active, NOT patching ...\n",
+		       __FUNCTION__, num_online_cpus());
+		dump_stack();
+		return;
+	}
+
+	spin_lock_irqsave(&smp_alt, flags);
+
+	if (ALT_UP == smp_alt_state)
+		goto out;
+	smp_alt_state = ALT_UP;
+	printk(KERN_INFO "alternatives: switching to UP code\n");
+
+	set_bit(X86_FEATURE_UP, boot_cpu_data.x86_capability);
+	list_for_each_entry(mod, &smp_alt_modules, next)
+		apply_alternatives(mod->astart, mod->aend,
+				   mod->tstart, mod->tend);
+
+ out:
+	spin_unlock_irqrestore(&smp_alt, flags);
+} 
+
+void switch_alternatives_smp(void) 
+{ 
+	struct smp_alt_module *mod;
+	unsigned long flags;
+
+	spin_lock_irqsave(&smp_alt, flags);
+
+	if (ALT_SMP == smp_alt_state)
+		goto out;
+	smp_alt_state = ALT_SMP;
+	printk(KERN_INFO "alternatives: switching to SMP code\n");
+
+	clear_bit(X86_FEATURE_UP, boot_cpu_data.x86_capability);
+	list_for_each_entry(mod, &smp_alt_modules, next)
+		apply_alternatives_smp(mod);
+
+ out:
+	spin_unlock_irqrestore(&smp_alt, flags);
+} 
+
+void __init alternative_instructions(void)
+{
+	apply_alternatives(__alt_instructions, __alt_instructions_end,
+			   NULL, NULL);
+	alternatives_smp_module_add(NULL, "core kernel",
+				    __smp_alt_instructions,
+				    __smp_alt_instructions_end,
+				    _text, _etext);
+	switch_alternatives_up();
+}
+
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/arch/i386/kernel/module.c work-2.6.14/arch/i386/kernel/module.c
--- linux-2.6.14/arch/i386/kernel/module.c	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/arch/i386/kernel/module.c	2005-11-22 15:59:19.000000000 +0100
@@ -104,26 +104,39 @@
 	return -ENOEXEC;
 }
 
-extern void apply_alternatives(void *start, void *end); 
-
 int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *me)
 {
-	const Elf_Shdr *s;
+	const Elf_Shdr *s, *text = NULL, *alt = NULL, *smpalt = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
-	/* look for .altinstructions to patch */ 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { 
-		void *seg; 		
-		if (strcmp(".altinstructions", secstrings + s->sh_name))
-			continue;
-		seg = (void *)s->sh_addr; 
-		apply_alternatives(seg, seg + s->sh_size); 
-	} 	
+		if (0 == strcmp(".text", secstrings + s->sh_name))
+			text = s;
+		if (0 == strcmp(".altinstructions", secstrings + s->sh_name))
+			alt = s;
+		if (0 == strcmp(".smp_altinstructions", secstrings + s->sh_name))
+			smpalt = s;
+	}
+
+	if (alt) {
+		/* patch .altinstructions */ 
+		void *aseg = (void *)alt->sh_addr;
+		apply_alternatives(aseg, aseg + alt->sh_size, NULL, NULL);
+	}
+	if (smpalt && text) {
+		void *aseg = (void *)smpalt->sh_addr;
+		void *tseg = (void *)text->sh_addr;
+		alternatives_smp_module_add(me, me->name,
+					    aseg, aseg + smpalt->sh_size,
+					    tseg, tseg + text->sh_size);
+	}
+
 	return 0;
 }
 
 void module_arch_cleanup(struct module *mod)
 {
+	alternatives_smp_module_del(mod);
 }
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/arch/i386/kernel/semaphore.c work-2.6.14/arch/i386/kernel/semaphore.c
--- linux-2.6.14/arch/i386/kernel/semaphore.c	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/arch/i386/kernel/semaphore.c	2005-11-17 11:17:58.000000000 +0100
@@ -110,11 +110,11 @@
 ".align	4\n"
 ".globl	__write_lock_failed\n"
 "__write_lock_failed:\n\t"
-	LOCK "addl	$" RW_LOCK_BIAS_STR ",(%eax)\n"
+	LOCK_PRE "addl	$" RW_LOCK_BIAS_STR ",(%eax)" LOCK_POST "\n"
 "1:	rep; nop\n\t"
 	"cmpl	$" RW_LOCK_BIAS_STR ",(%eax)\n\t"
 	"jne	1b\n\t"
-	LOCK "subl	$" RW_LOCK_BIAS_STR ",(%eax)\n\t"
+	LOCK_PRE "subl	$" RW_LOCK_BIAS_STR ",(%eax)" LOCK_POST "\n\t"
 	"jnz	__write_lock_failed\n\t"
 	"ret"
 );
@@ -124,11 +124,11 @@
 ".align	4\n"
 ".globl	__read_lock_failed\n"
 "__read_lock_failed:\n\t"
-	LOCK "incl	(%eax)\n"
+	LOCK_PRE "incl	(%eax)" LOCK_POST "\n"
 "1:	rep; nop\n\t"
 	"cmpl	$1,(%eax)\n\t"
 	"js	1b\n\t"
-	LOCK "decl	(%eax)\n\t"
+	LOCK_PRE "decl	(%eax)" LOCK_POST "\n\t"
 	"js	__read_lock_failed\n\t"
 	"ret"
 );
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/arch/i386/kernel/setup.c work-2.6.14/arch/i386/kernel/setup.c
--- linux-2.6.14/arch/i386/kernel/setup.c	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/arch/i386/kernel/setup.c	2005-11-21 09:10:40.000000000 +0100
@@ -1361,101 +1361,6 @@
 		pci_mem_start, gapstart, gapsize);
 }
 
-/* Use inline assembly to define this because the nops are defined 
-   as inline assembly strings in the include files and we cannot 
-   get them easily into strings. */
-asm("\t.data\nintelnops: " 
-    GENERIC_NOP1 GENERIC_NOP2 GENERIC_NOP3 GENERIC_NOP4 GENERIC_NOP5 GENERIC_NOP6
-    GENERIC_NOP7 GENERIC_NOP8); 
-asm("\t.data\nk8nops: " 
-    K8_NOP1 K8_NOP2 K8_NOP3 K8_NOP4 K8_NOP5 K8_NOP6
-    K8_NOP7 K8_NOP8); 
-asm("\t.data\nk7nops: " 
-    K7_NOP1 K7_NOP2 K7_NOP3 K7_NOP4 K7_NOP5 K7_NOP6
-    K7_NOP7 K7_NOP8); 
-    
-extern unsigned char intelnops[], k8nops[], k7nops[];
-static unsigned char *intel_nops[ASM_NOP_MAX+1] = { 
-     NULL,
-     intelnops,
-     intelnops + 1,
-     intelnops + 1 + 2,
-     intelnops + 1 + 2 + 3,
-     intelnops + 1 + 2 + 3 + 4,
-     intelnops + 1 + 2 + 3 + 4 + 5,
-     intelnops + 1 + 2 + 3 + 4 + 5 + 6,
-     intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-}; 
-static unsigned char *k8_nops[ASM_NOP_MAX+1] = { 
-     NULL,
-     k8nops,
-     k8nops + 1,
-     k8nops + 1 + 2,
-     k8nops + 1 + 2 + 3,
-     k8nops + 1 + 2 + 3 + 4,
-     k8nops + 1 + 2 + 3 + 4 + 5,
-     k8nops + 1 + 2 + 3 + 4 + 5 + 6,
-     k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-}; 
-static unsigned char *k7_nops[ASM_NOP_MAX+1] = { 
-     NULL,
-     k7nops,
-     k7nops + 1,
-     k7nops + 1 + 2,
-     k7nops + 1 + 2 + 3,
-     k7nops + 1 + 2 + 3 + 4,
-     k7nops + 1 + 2 + 3 + 4 + 5,
-     k7nops + 1 + 2 + 3 + 4 + 5 + 6,
-     k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
-}; 
-static struct nop { 
-     int cpuid; 
-     unsigned char **noptable; 
-} noptypes[] = { 
-     { X86_FEATURE_K8, k8_nops }, 
-     { X86_FEATURE_K7, k7_nops }, 
-     { -1, NULL }
-}; 
-
-/* Replace instructions with better alternatives for this CPU type.
-
-   This runs before SMP is initialized to avoid SMP problems with
-   self modifying code. This implies that assymetric systems where
-   APs have less capabilities than the boot processor are not handled. 
-   Tough. Make sure you disable such features by hand. */ 
-void apply_alternatives(void *start, void *end) 
-{ 
-	struct alt_instr *a; 
-	int diff, i, k;
-        unsigned char **noptable = intel_nops; 
-	for (i = 0; noptypes[i].cpuid >= 0; i++) { 
-		if (boot_cpu_has(noptypes[i].cpuid)) { 
-			noptable = noptypes[i].noptable;
-			break;
-		}
-	} 
-	for (a = start; (void *)a < end; a++) { 
-		if (!boot_cpu_has(a->cpuid))
-			continue;
-		BUG_ON(a->replacementlen > a->instrlen); 
-		memcpy(a->instr, a->replacement, a->replacementlen); 
-		diff = a->instrlen - a->replacementlen; 
-		/* Pad the rest with nops */
-		for (i = a->replacementlen; diff > 0; diff -= k, i += k) {
-			k = diff;
-			if (k > ASM_NOP_MAX)
-				k = ASM_NOP_MAX;
-			memcpy(a->instr + i, noptable[k], k); 
-		} 
-	}
-} 
-
-void __init alternative_instructions(void)
-{
-	extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
-	apply_alternatives(__alt_instructions, __alt_instructions_end);
-}
-
 static char * __init machine_specific_memory_setup(void);
 
 #ifdef CONFIG_MCA
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/arch/i386/kernel/smpboot.c work-2.6.14/arch/i386/kernel/smpboot.c
--- linux-2.6.14/arch/i386/kernel/smpboot.c	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/arch/i386/kernel/smpboot.c	2005-11-18 14:56:37.000000000 +0100
@@ -874,6 +874,7 @@
 	unsigned short nmi_high = 0, nmi_low = 0;
 
 	++cpucount;
+	switch_alternatives_smp();
 
 	/*
 	 * We can't use kernel_thread since we must avoid to
@@ -1315,6 +1316,9 @@
 	fixup_irqs(map);
 	/* It's now safe to remove this processor from the online map */
 	cpu_clear(cpu, cpu_online_map);
+
+	if (1 == num_online_cpus())
+		switch_alternatives_up();
 	return 0;
 }
 
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/arch/i386/kernel/vmlinux.lds.S work-2.6.14/arch/i386/kernel/vmlinux.lds.S
--- linux-2.6.14/arch/i386/kernel/vmlinux.lds.S	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/arch/i386/kernel/vmlinux.lds.S	2005-11-16 09:50:35.000000000 +0100
@@ -68,6 +68,16 @@
 	*(.data.init_task)
   }
 
+  . = ALIGN(4);
+  __smp_alt_instructions = .;
+  .smp_altinstructions : AT(ADDR(.smp_altinstructions) - LOAD_OFFSET) {
+	*(.smp_altinstructions)
+  }
+  __smp_alt_instructions_end = .; 
+  .smp_altinstr_replacement : AT(ADDR(.smp_altinstr_replacement) - LOAD_OFFSET) {
+	*(.smp_altinstr_replacement)
+  }
+
   /* will be freed after init */
   . = ALIGN(4096);		/* Init code and data */
   __init_begin = .;
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/drivers/char/sysrq.c work-2.6.14/drivers/char/sysrq.c
--- linux-2.6.14/drivers/char/sysrq.c	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/drivers/char/sysrq.c	2005-11-21 09:26:30.000000000 +0100
@@ -271,6 +271,34 @@
 	.enable_mask	= SYSRQ_ENABLE_RTNICE,
 };
 
+#ifdef CONFIG_SMP
+/* handy for testing & benchmarking, probably temporary though.
+ *                                       -- kraxel */
+static void sysrq_handle_up(int key, struct pt_regs *pt_regs,
+			    struct tty_struct *tty)
+{
+	switch_alternatives_up();
+}
+static struct sysrq_key_op sysrq_up_op = {
+	.handler	= sysrq_handle_up,
+	.help_msg	= "UP(x)",
+	.action_msg	= "switch smp alternatives to UP",
+	.enable_mask	= SYSRQ_ENABLE_LOG,
+};
+
+static void sysrq_handle_smp(int key, struct pt_regs *pt_regs,
+			    struct tty_struct *tty)
+{
+	switch_alternatives_smp();
+}
+static struct sysrq_key_op sysrq_smp_op = {
+	.handler	= sysrq_handle_smp,
+	.help_msg	= "SMP(y)",
+	.action_msg	= "switch smp alternatives to SMP",
+	.enable_mask	= SYSRQ_ENABLE_LOG,
+};
+#endif
+
 /* Key Operations table and lock */
 static DEFINE_SPINLOCK(sysrq_key_table_lock);
 #define SYSRQ_KEY_TABLE_LENGTH 36
@@ -323,8 +351,13 @@
 /* u */	&sysrq_mountro_op,
 /* v */	NULL, /* May be assigned at init time by SMP VOYAGER */
 /* w */	NULL,
+#ifdef CONFIG_SMP
+/* x */	&sysrq_up_op,
+/* y */	&sysrq_smp_op,
+#else
 /* x */	NULL,
 /* y */	NULL,
+#endif
 /* z */	NULL
 };
 
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/alternative.h work-2.6.14/include/asm-i386/alternative.h
--- linux-2.6.14/include/asm-i386/alternative.h	1970-01-01 01:00:00.000000000 +0100
+++ work-2.6.14/include/asm-i386/alternative.h	2005-11-22 15:30:36.000000000 +0100
@@ -0,0 +1,150 @@
+#ifndef _I386_ALTERNATIVE_H
+#define _I386_ALTERNATIVE_H
+
+#ifdef __KERNEL__
+
+struct alt_instr { 
+	__u8 *instr; 		/* original instruction */
+	__u8 *replacement;
+	__u8  cpuid;		/* cpuid bit set for replacement */
+	__u8  instrlen;		/* length of original instruction */
+	__u8  replacementlen; 	/* length of new instruction, <= instrlen */ 
+	__u8  pad;
+}; 
+
+extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end,
+			       __u8 *tstart, __u8 *tend);
+
+struct module;
+extern void alternatives_smp_module_add(struct module *mod, char *name,
+					void *astart, void *aend,
+					void *tstart, void *tend);
+extern void alternatives_smp_module_del(struct module *mod);
+
+extern void switch_alternatives_up(void);
+extern void switch_alternatives_smp(void);
+
+#endif
+
+/* 
+ * Alternative instructions for different CPU types or capabilities.
+ * 
+ * This allows to use optimized instructions even on generic binary
+ * kernels.
+ * 
+ * length of oldinstr must be longer or equal the length of newinstr
+ * It can be padded with nops as needed.
+ * 
+ * For non barrier like inlines please define new variants
+ * without volatile and memory clobber.
+ */
+#define alternative(oldinstr, newinstr, feature) 	\
+	asm volatile ("661:\n\t" oldinstr "\n662:\n" 		     \
+		      ".section .altinstructions,\"a\"\n"     	     \
+		      "  .align 4\n"				       \
+		      "  .long 661b\n"            /* label */          \
+		      "  .long 663f\n"		  /* new instruction */ 	\
+		      "  .byte %c0\n"             /* feature bit */    \
+		      "  .byte 662b-661b\n"       /* sourcelen */      \
+		      "  .byte 664f-663f\n"       /* replacementlen */ \
+		      ".previous\n"						\
+		      ".section .altinstr_replacement,\"ax\"\n"			\
+		      "663:\n\t" newinstr "\n664:\n"   /* replacement */    \
+		      ".previous" :: "i" (feature) : "memory")  
+
+/*
+ * Alternative inline assembly with input.
+ * 
+ * Pecularities:
+ * No memory clobber here. 
+ * Argument numbers start with 1.
+ * Best is to use constraints that are fixed size (like (%1) ... "r")
+ * If you use variable sized constraints like "m" or "g" in the 
+ * replacement maake sure to pad to the worst case length.
+ */
+#define alternative_input(oldinstr, newinstr, feature, input...)		\
+	asm volatile ("661:\n\t" oldinstr "\n662:\n"				\
+		      ".section .altinstructions,\"a\"\n"			\
+		      "  .align 4\n"						\
+		      "  .long 661b\n"            /* label */			\
+		      "  .long 663f\n"		  /* new instruction */ 	\
+		      "  .byte %c0\n"             /* feature bit */		\
+		      "  .byte 662b-661b\n"       /* sourcelen */		\
+		      "  .byte 664f-663f\n"       /* replacementlen */ 		\
+		      ".previous\n"						\
+		      ".section .altinstr_replacement,\"ax\"\n"			\
+		      "663:\n\t" newinstr "\n664:\n"   /* replacement */ 	\
+		      ".previous" :: "i" (feature), ##input)
+
+/*
+ * Alternative inline assembly for SMP.
+ *
+ * alternative_smp() takes two versions (SMP first, UP second) and is
+ * for more complex stuff such as spinlocks.
+ *
+ * alternative_smp_lock() just puts an lock in front of the
+ * instruction which will be nop'ed out for UP.
+ *
+ * The LOCK_PRE and LOCK_POST macros can be placed around the
+ * instruction to be locked in places where the simple
+ * alternative_smp_lock() doesn't work (inline asm also using section
+ * tricks, lock instruction in the middle of a longer sequence,
+ * whatever else ... )
+ *
+ * SMP alternatives use the same data structures as the other
+ * alternatives and the X86_FEATURE_UP flag to indicate the case of a
+ * UP system running a SMP kernel.  The existing apply_alternatives()
+ * works fine for patching a SMP kernel for UP.
+ * 
+ * The SMP alternative tables are kept after boot and contain both UP
+ * and SMP versions of the instructions to allow switching back to SMP
+ * at runtime, when hotplugging in a new CPU, which is especially
+ * useful in virtualized environments.
+ */ 
+
+#ifdef CONFIG_SMP
+#define alternative_smp(smpinstr, upinstr, args...) 	\
+	asm volatile ("661:\n\t" smpinstr "\n662:\n" 		     \
+		      ".section .smp_altinstructions,\"a\"\n"          \
+		      "  .align 4\n"				       \
+		      "  .long 661b\n"            /* label */          \
+		      "  .long 663f\n"		  /* new instruction */ 	\
+		      "  .byte 0x68\n"            /* X86_FEATURE_UP */    \
+		      "  .byte 662b-661b\n"       /* sourcelen */      \
+		      "  .byte 664f-663f\n"       /* replacementlen */ \
+		      ".previous\n"						\
+		      ".section .smp_altinstr_replacement,\"awx\"\n"   		\
+		      "663:\n\t" upinstr "\n"     /* replacement */    \
+		      "664:\n\t.fill 662b-661b,1,0x42\n" /* space for original */ \
+		      ".previous" : args)
+
+#define LOCK_PRE \
+	       	"661:\n\tlock; "
+#define LOCK_POST \
+		"\n" 		     \
+		".section .smp_altinstructions,\"a\"\n"          \
+		"  .align 4\n"				       \
+		"  .long 661b\n"            /* label */          \
+		"  .long 663f\n"	    /* new instruction */ 	\
+		"  .byte 0x68\n"            /* X86_FEATURE_UP */    \
+		"  .byte 1\n"               /* sourcelen */      \
+		"  .byte 0\n"               /* replacementlen */ \
+		".previous\n"						\
+		".section .smp_altinstr_replacement,\"awx\"\n"    		\
+		"663:\n"                    /* replacement */    \
+		"664:\n\tlock\n"            /* space for original */ \
+		".previous\n"
+
+#define alternative_smp_lock(lockinstr, args...) 	\
+	asm volatile (LOCK_PRE lockinstr LOCK_POST : args)
+
+#else /* ! CONFIG_SMP */
+#define alternative_smp(smpinstr, upinstr, args...) \
+	asm volatile (upinstr : args)
+#define alternative_smp_lock(lockinstr, args...) \
+	asm volatile (lockinstr : args)
+#define LOCK_PRE    ""
+#define LOCK_POST   ""
+#endif
+
+#endif /* _I386_ALTERNATIVE_H */
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/atomic.h work-2.6.14/include/asm-i386/atomic.h
--- linux-2.6.14/include/asm-i386/atomic.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/atomic.h	2005-11-16 18:18:42.000000000 +0100
@@ -10,12 +10,6 @@
  * resource counting etc..
  */
 
-#ifdef CONFIG_SMP
-#define LOCK "lock ; "
-#else
-#define LOCK ""
-#endif
-
 /*
  * Make sure gcc doesn't try to be clever and move things around
  * on us. We need to use _exactly_ the address the user gave us,
@@ -51,9 +45,9 @@
  */
 static __inline__ void atomic_add(int i, atomic_t *v)
 {
-	__asm__ __volatile__(
-		LOCK "addl %1,%0"
-		:"=m" (v->counter)
+	alternative_smp_lock(
+		"addl %1,%0",
+		"=m" (v->counter)
 		:"ir" (i), "m" (v->counter));
 }
 
@@ -66,9 +60,9 @@
  */
 static __inline__ void atomic_sub(int i, atomic_t *v)
 {
-	__asm__ __volatile__(
-		LOCK "subl %1,%0"
-		:"=m" (v->counter)
+	alternative_smp_lock(
+		"subl %1,%0",
+		"=m" (v->counter)
 		:"ir" (i), "m" (v->counter));
 }
 
@@ -85,9 +79,9 @@
 {
 	unsigned char c;
 
-	__asm__ __volatile__(
-		LOCK "subl %2,%0; sete %1"
-		:"=m" (v->counter), "=qm" (c)
+	alternative_smp_lock(
+		"subl %2,%0; sete %1",
+		"=m" (v->counter), "=qm" (c)
 		:"ir" (i), "m" (v->counter) : "memory");
 	return c;
 }
@@ -100,9 +94,9 @@
  */ 
 static __inline__ void atomic_inc(atomic_t *v)
 {
-	__asm__ __volatile__(
-		LOCK "incl %0"
-		:"=m" (v->counter)
+	alternative_smp_lock(
+		"incl %0",
+		"=m" (v->counter)
 		:"m" (v->counter));
 }
 
@@ -114,9 +108,9 @@
  */ 
 static __inline__ void atomic_dec(atomic_t *v)
 {
-	__asm__ __volatile__(
-		LOCK "decl %0"
-		:"=m" (v->counter)
+	alternative_smp_lock(
+		"decl %0",
+		"=m" (v->counter)
 		:"m" (v->counter));
 }
 
@@ -132,9 +126,9 @@
 {
 	unsigned char c;
 
-	__asm__ __volatile__(
-		LOCK "decl %0; sete %1"
-		:"=m" (v->counter), "=qm" (c)
+	alternative_smp_lock(
+		"decl %0; sete %1",
+		"=m" (v->counter), "=qm" (c)
 		:"m" (v->counter) : "memory");
 	return c != 0;
 }
@@ -151,9 +145,9 @@
 {
 	unsigned char c;
 
-	__asm__ __volatile__(
-		LOCK "incl %0; sete %1"
-		:"=m" (v->counter), "=qm" (c)
+	alternative_smp_lock(
+		"incl %0; sete %1",
+		"=m" (v->counter), "=qm" (c)
 		:"m" (v->counter) : "memory");
 	return c != 0;
 }
@@ -171,9 +165,9 @@
 {
 	unsigned char c;
 
-	__asm__ __volatile__(
-		LOCK "addl %2,%0; sets %1"
-		:"=m" (v->counter), "=qm" (c)
+	alternative_smp_lock(
+		"addl %2,%0; sets %1",
+		"=m" (v->counter), "=qm" (c)
 		:"ir" (i), "m" (v->counter) : "memory");
 	return c;
 }
@@ -194,9 +188,9 @@
 #endif
 	/* Modern 486+ processor */
 	__i = i;
-	__asm__ __volatile__(
-		LOCK "xaddl %0, %1;"
-		:"=r"(i)
+	alternative_smp_lock(
+		"xaddl %0, %1;",
+		"=r"(i)
 		:"m"(v->counter), "0"(i));
 	return i + __i;
 
@@ -220,12 +214,12 @@
 
 /* These are x86-specific, used by some header files */
 #define atomic_clear_mask(mask, addr) \
-__asm__ __volatile__(LOCK "andl %0,%1" \
-: : "r" (~(mask)),"m" (*addr) : "memory")
+alternative_smp_lock("andl %0,%1", \
+: "r" (~(mask)),"m" (*addr) : "memory")
 
 #define atomic_set_mask(mask, addr) \
-__asm__ __volatile__(LOCK "orl %0,%1" \
-: : "r" (mask),"m" (*(addr)) : "memory")
+alternative_smp_lock("orl %0,%1", \
+: "r" (mask),"m" (*(addr)) : "memory")
 
 /* Atomic operations are already serializing on x86 */
 #define smp_mb__before_atomic_dec()	barrier()
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/bitops.h work-2.6.14/include/asm-i386/bitops.h
--- linux-2.6.14/include/asm-i386/bitops.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/bitops.h	2005-11-17 09:55:40.000000000 +0100
@@ -7,6 +7,7 @@
 
 #include <linux/config.h>
 #include <linux/compiler.h>
+#include <asm/alternative.h>
 
 /*
  * These have to be done with inline assembly: that way the bit-setting
@@ -16,12 +17,6 @@
  * bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
  */
 
-#ifdef CONFIG_SMP
-#define LOCK_PREFIX "lock ; "
-#else
-#define LOCK_PREFIX ""
-#endif
-
 #define ADDR (*(volatile long *) addr)
 
 /**
@@ -41,9 +36,9 @@
  */
 static inline void set_bit(int nr, volatile unsigned long * addr)
 {
-	__asm__ __volatile__( LOCK_PREFIX
-		"btsl %1,%0"
-		:"=m" (ADDR)
+	alternative_smp_lock(
+		"btsl %1,%0",
+		"=m" (ADDR)
 		:"Ir" (nr));
 }
 
@@ -76,9 +71,9 @@
  */
 static inline void clear_bit(int nr, volatile unsigned long * addr)
 {
-	__asm__ __volatile__( LOCK_PREFIX
-		"btrl %1,%0"
-		:"=m" (ADDR)
+	alternative_smp_lock(
+		"btrl %1,%0",
+		"=m" (ADDR)
 		:"Ir" (nr));
 }
 
@@ -121,9 +116,9 @@
  */
 static inline void change_bit(int nr, volatile unsigned long * addr)
 {
-	__asm__ __volatile__( LOCK_PREFIX
-		"btcl %1,%0"
-		:"=m" (ADDR)
+	alternative_smp_lock(
+		"btcl %1,%0",
+		"=m" (ADDR)
 		:"Ir" (nr));
 }
 
@@ -140,9 +135,9 @@
 {
 	int oldbit;
 
-	__asm__ __volatile__( LOCK_PREFIX
-		"btsl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+	alternative_smp_lock(
+		"btsl %2,%1\n\tsbbl %0,%0",
+		"=r" (oldbit),"=m" (ADDR)
 		:"Ir" (nr) : "memory");
 	return oldbit;
 }
@@ -180,9 +175,9 @@
 {
 	int oldbit;
 
-	__asm__ __volatile__( LOCK_PREFIX
-		"btrl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+	alternative_smp_lock(
+		"btrl %2,%1\n\tsbbl %0,%0",
+		"=r" (oldbit),"=m" (ADDR)
 		:"Ir" (nr) : "memory");
 	return oldbit;
 }
@@ -231,9 +226,9 @@
 {
 	int oldbit;
 
-	__asm__ __volatile__( LOCK_PREFIX
-		"btcl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+	alternative_smp_lock(
+		"btcl %2,%1\n\tsbbl %0,%0",
+		"=r" (oldbit),"=m" (ADDR)
 		:"Ir" (nr) : "memory");
 	return oldbit;
 }
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/cpufeature.h work-2.6.14/include/asm-i386/cpufeature.h
--- linux-2.6.14/include/asm-i386/cpufeature.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/cpufeature.h	2005-11-16 09:43:47.000000000 +0100
@@ -70,6 +70,8 @@
 #define X86_FEATURE_P3		(3*32+ 6) /* P3 */
 #define X86_FEATURE_P4		(3*32+ 7) /* P4 */
 
+#define X86_FEATURE_UP		(3*32+ 8) /* smp kernel running on up */
+
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_XMM3	(4*32+ 0) /* Streaming SIMD Extensions-3 */
 #define X86_FEATURE_MWAIT	(4*32+ 3) /* Monitor/Mwait support */
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/futex.h work-2.6.14/include/asm-i386/futex.h
--- linux-2.6.14/include/asm-i386/futex.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/futex.h	2005-11-17 11:19:13.000000000 +0100
@@ -28,7 +28,7 @@
 "1:	movl	%2, %0\n\
 	movl	%0, %3\n"					\
 	insn "\n"						\
-"2:	" LOCK_PREFIX "cmpxchgl %3, %2\n\
+"2:	" LOCK_PRE "cmpxchgl %3, %2" LOCK_POST "\n\
 	jnz	1b\n\
 3:	.section .fixup,\"ax\"\n\
 4:	mov	%5, %1\n\
@@ -68,7 +68,7 @@
 #endif
 		switch (op) {
 		case FUTEX_OP_ADD:
-			__futex_atomic_op1(LOCK_PREFIX "xaddl %0, %2", ret,
+			__futex_atomic_op1(LOCK_PRE "xaddl %0, %2" LOCK_POST, ret,
 					   oldval, uaddr, oparg);
 			break;
 		case FUTEX_OP_OR:
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/rwlock.h work-2.6.14/include/asm-i386/rwlock.h
--- linux-2.6.14/include/asm-i386/rwlock.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/rwlock.h	2005-11-17 09:44:53.000000000 +0100
@@ -21,21 +21,23 @@
 #define RW_LOCK_BIAS_STR	"0x01000000"
 
 #define __build_read_lock_ptr(rw, helper)   \
-	asm volatile(LOCK "subl $1,(%0)\n\t" \
-		     "jns 1f\n" \
-		     "call " helper "\n\t" \
-		     "1:\n" \
-		     ::"a" (rw) : "memory")
+	alternative_smp("lock; subl $1,(%0)\n\t" \
+			"jns 1f\n" \
+			"call " helper "\n\t" \
+			"1:\n", \
+			"subl $1,(%0)\n\t", \
+			:"a" (rw) : "memory")
 
 #define __build_read_lock_const(rw, helper)   \
-	asm volatile(LOCK "subl $1,%0\n\t" \
-		     "jns 1f\n" \
-		     "pushl %%eax\n\t" \
-		     "leal %0,%%eax\n\t" \
-		     "call " helper "\n\t" \
-		     "popl %%eax\n\t" \
-		     "1:\n" \
-		     :"=m" (*(volatile int *)rw) : : "memory")
+	alternative_smp("lock; subl $1,%0\n\t" \
+			"jns 1f\n" \
+			"pushl %%eax\n\t" \
+			"leal %0,%%eax\n\t" \
+			"call " helper "\n\t" \
+			"popl %%eax\n\t" \
+			"1:\n", \
+			"subl $1,%0\n\t", \
+			"=m" (*(volatile int *)rw) : : "memory")
 
 #define __build_read_lock(rw, helper)	do { \
 						if (__builtin_constant_p(rw)) \
@@ -45,21 +47,23 @@
 					} while (0)
 
 #define __build_write_lock_ptr(rw, helper) \
-	asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \
-		     "jz 1f\n" \
-		     "call " helper "\n\t" \
-		     "1:\n" \
-		     ::"a" (rw) : "memory")
+	alternative_smp("lock; subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \
+			"jz 1f\n" \
+			"call " helper "\n\t" \
+			"1:\n", \
+			"subl $" RW_LOCK_BIAS_STR ",(%0)\n\t", \
+			:"a" (rw) : "memory")
 
 #define __build_write_lock_const(rw, helper) \
-	asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \
-		     "jz 1f\n" \
-		     "pushl %%eax\n\t" \
-		     "leal %0,%%eax\n\t" \
-		     "call " helper "\n\t" \
-		     "popl %%eax\n\t" \
-		     "1:\n" \
-		     :"=m" (*(volatile int *)rw) : : "memory")
+	alternative_smp("lock; subl $" RW_LOCK_BIAS_STR ",%0\n\t" \
+			"jz 1f\n" \
+			"pushl %%eax\n\t" \
+			"leal %0,%%eax\n\t" \
+			"call " helper "\n\t" \
+			"popl %%eax\n\t" \
+			"1:\n", \
+			"subl $" RW_LOCK_BIAS_STR ",%0\n\t", \
+			"=m" (*(volatile int *)rw) : : "memory")
 
 #define __build_write_lock(rw, helper)	do { \
 						if (__builtin_constant_p(rw)) \
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/rwsem.h work-2.6.14/include/asm-i386/rwsem.h
--- linux-2.6.14/include/asm-i386/rwsem.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/rwsem.h	2005-11-17 11:14:45.000000000 +0100
@@ -99,7 +99,7 @@
 {
 	__asm__ __volatile__(
 		"# beginning down_read\n\t"
-LOCK_PREFIX	"  incl      (%%eax)\n\t" /* adds 0x00000001, returns the old value */
+LOCK_PRE       	"  incl      (%%eax)" LOCK_POST "\n\t" /* adds 0x00000001, returns the old value */
 		"  js        2f\n\t" /* jump if we weren't granted the lock */
 		"1:\n\t"
 		LOCK_SECTION_START("")
@@ -130,7 +130,7 @@
 		"  movl	     %1,%2\n\t"
 		"  addl      %3,%2\n\t"
 		"  jle	     2f\n\t"
-LOCK_PREFIX	"  cmpxchgl  %2,%0\n\t"
+LOCK_PRE       	"  cmpxchgl  %2,%0" LOCK_POST "\n\t"
 		"  jnz	     1b\n\t"
 		"2:\n\t"
 		"# ending __down_read_trylock\n\t"
@@ -150,7 +150,7 @@
 	tmp = RWSEM_ACTIVE_WRITE_BIAS;
 	__asm__ __volatile__(
 		"# beginning down_write\n\t"
-LOCK_PREFIX	"  xadd      %%edx,(%%eax)\n\t" /* subtract 0x0000ffff, returns the old value */
+LOCK_PRE       	"  xadd      %%edx,(%%eax)" LOCK_POST "\n\t" /* subtract 0x0000ffff, returns the old value */
 		"  testl     %%edx,%%edx\n\t" /* was the count 0 before? */
 		"  jnz       2f\n\t" /* jump if we weren't granted the lock */
 		"1:\n\t"
@@ -188,7 +188,7 @@
 	__s32 tmp = -RWSEM_ACTIVE_READ_BIAS;
 	__asm__ __volatile__(
 		"# beginning __up_read\n\t"
-LOCK_PREFIX	"  xadd      %%edx,(%%eax)\n\t" /* subtracts 1, returns the old value */
+LOCK_PRE	"  xadd      %%edx,(%%eax)" LOCK_POST "\n\t" /* subtracts 1, returns the old value */
 		"  js        2f\n\t" /* jump if the lock is being waited upon */
 		"1:\n\t"
 		LOCK_SECTION_START("")
@@ -214,7 +214,7 @@
 	__asm__ __volatile__(
 		"# beginning __up_write\n\t"
 		"  movl      %2,%%edx\n\t"
-LOCK_PREFIX	"  xaddl     %%edx,(%%eax)\n\t" /* tries to transition 0xffff0001 -> 0x00000000 */
+LOCK_PRE       	"  xaddl     %%edx,(%%eax)" LOCK_POST "\n\t" /* tries to transition 0xffff0001 -> 0x00000000 */
 		"  jnz       2f\n\t" /* jump if the lock is being waited upon */
 		"1:\n\t"
 		LOCK_SECTION_START("")
@@ -239,7 +239,7 @@
 {
 	__asm__ __volatile__(
 		"# beginning __downgrade_write\n\t"
-LOCK_PREFIX	"  addl      %2,(%%eax)\n\t" /* transitions 0xZZZZ0001 -> 0xYYYY0001 */
+LOCK_PRE	"  addl      %2,(%%eax)" LOCK_POST "\n\t" /* transitions 0xZZZZ0001 -> 0xYYYY0001 */
 		"  js        2f\n\t" /* jump if the lock is being waited upon */
 		"1:\n\t"
 		LOCK_SECTION_START("")
@@ -262,9 +262,9 @@
  */
 static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
 {
-	__asm__ __volatile__(
-LOCK_PREFIX	"addl %1,%0"
-		: "=m"(sem->count)
+	alternative_smp_lock(
+		"addl %1,%0",
+		"=m"(sem->count)
 		: "ir"(delta), "m"(sem->count));
 }
 
@@ -275,9 +275,9 @@
 {
 	int tmp = delta;
 
-	__asm__ __volatile__(
-LOCK_PREFIX	"xadd %0,(%2)"
-		: "+r"(tmp), "=m"(sem->count)
+	alternative_smp_lock(
+		"xadd %0,(%2)",
+		"+r"(tmp), "=m"(sem->count)
 		: "r"(sem), "m"(sem->count)
 		: "memory");
 
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/semaphore.h work-2.6.14/include/asm-i386/semaphore.h
--- linux-2.6.14/include/asm-i386/semaphore.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/semaphore.h	2005-11-17 11:15:49.000000000 +0100
@@ -102,7 +102,7 @@
 	might_sleep();
 	__asm__ __volatile__(
 		"# atomic down operation\n\t"
-		LOCK "decl %0\n\t"     /* --sem->count */
+		LOCK_PRE "decl %0" LOCK_POST "\n\t"     /* --sem->count */
 		"js 2f\n"
 		"1:\n"
 		LOCK_SECTION_START("")
@@ -126,7 +126,7 @@
 	might_sleep();
 	__asm__ __volatile__(
 		"# atomic interruptible down operation\n\t"
-		LOCK "decl %1\n\t"     /* --sem->count */
+		LOCK_PRE "decl %1" LOCK_POST "\n\t"     /* --sem->count */
 		"js 2f\n\t"
 		"xorl %0,%0\n"
 		"1:\n"
@@ -151,7 +151,7 @@
 
 	__asm__ __volatile__(
 		"# atomic interruptible down operation\n\t"
-		LOCK "decl %1\n\t"     /* --sem->count */
+		LOCK_PRE "decl %1" LOCK_POST "\n\t"     /* --sem->count */
 		"js 2f\n\t"
 		"xorl %0,%0\n"
 		"1:\n"
@@ -176,7 +176,7 @@
 {
 	__asm__ __volatile__(
 		"# atomic up operation\n\t"
-		LOCK "incl %0\n\t"     /* ++sem->count */
+		LOCK_PRE "incl %0" LOCK_POST "\n\t"     /* ++sem->count */
 		"jle 2f\n"
 		"1:\n"
 		LOCK_SECTION_START("")
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/spinlock.h work-2.6.14/include/asm-i386/spinlock.h
--- linux-2.6.14/include/asm-i386/spinlock.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/spinlock.h	2005-11-16 16:22:12.000000000 +0100
@@ -48,18 +48,23 @@
 	"jmp 1b\n" \
 	"4:\n\t"
 
+#define __raw_spin_lock_string_up \
+	"\n\tdecb %0"
+
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
-	__asm__ __volatile__(
-		__raw_spin_lock_string
-		:"=m" (lock->slock) : : "memory");
+	alternative_smp(
+		__raw_spin_lock_string,
+		__raw_spin_lock_string_up,
+		"=m" (lock->slock) : : "memory");
 }
 
 static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags)
 {
-	__asm__ __volatile__(
-		__raw_spin_lock_string_flags
-		:"=m" (lock->slock) : "r" (flags) : "memory");
+	alternative_smp(
+		__raw_spin_lock_string_flags,
+		__raw_spin_lock_string_up,
+		"=m" (lock->slock) : "r" (flags) : "memory");
 }
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
@@ -178,13 +183,16 @@
 
 static inline void __raw_read_unlock(raw_rwlock_t *rw)
 {
-	asm volatile("lock ; incl %0" :"=m" (rw->lock) : : "memory");
+	alternative_smp_lock(
+		"incl %0",
+		"=m" (rw->lock) : : "memory");
 }
 
 static inline void __raw_write_unlock(raw_rwlock_t *rw)
 {
-	asm volatile("lock ; addl $" RW_LOCK_BIAS_STR ", %0"
-				 : "=m" (rw->lock) : : "memory");
+	alternative_smp_lock(
+		"addl $" RW_LOCK_BIAS_STR ", %0",
+		"=m" (rw->lock) : : "memory");
 }
 
 #endif /* __ASM_SPINLOCK_H */
diff -urN -x 'build-*' -x '*~' -x Make -x scripts linux-2.6.14/include/asm-i386/system.h work-2.6.14/include/asm-i386/system.h
--- linux-2.6.14/include/asm-i386/system.h	2005-10-28 02:02:08.000000000 +0200
+++ work-2.6.14/include/asm-i386/system.h	2005-11-17 09:28:29.000000000 +0100
@@ -267,20 +267,20 @@
 	unsigned long prev;
 	switch (size) {
 	case 1:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
-				     : "=a"(prev)
+		alternative_smp_lock("cmpxchgb %b1,%2",
+				     "=a"(prev)
 				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
 				     : "memory");
 		return prev;
 	case 2:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
-				     : "=a"(prev)
+		alternative_smp_lock("cmpxchgw %w1,%2",
+				     "=a"(prev)
 				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
 				     : "memory");
 		return prev;
 	case 4:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
-				     : "=a"(prev)
+		alternative_smp_lock("cmpxchgl %1,%2",
+				     "=a"(prev)
 				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
 				     : "memory");
 		return prev;
@@ -292,67 +292,6 @@
 	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
 					(unsigned long)(n),sizeof(*(ptr))))
     
-#ifdef __KERNEL__
-struct alt_instr { 
-	__u8 *instr; 		/* original instruction */
-	__u8 *replacement;
-	__u8  cpuid;		/* cpuid bit set for replacement */
-	__u8  instrlen;		/* length of original instruction */
-	__u8  replacementlen; 	/* length of new instruction, <= instrlen */ 
-	__u8  pad;
-}; 
-#endif
-
-/* 
- * Alternative instructions for different CPU types or capabilities.
- * 
- * This allows to use optimized instructions even on generic binary
- * kernels.
- * 
- * length of oldinstr must be longer or equal the length of newinstr
- * It can be padded with nops as needed.
- * 
- * For non barrier like inlines please define new variants
- * without volatile and memory clobber.
- */
-#define alternative(oldinstr, newinstr, feature) 	\
-	asm volatile ("661:\n\t" oldinstr "\n662:\n" 		     \
-		      ".section .altinstructions,\"a\"\n"     	     \
-		      "  .align 4\n"				       \
-		      "  .long 661b\n"            /* label */          \
-		      "  .long 663f\n"		  /* new instruction */ 	\
-		      "  .byte %c0\n"             /* feature bit */    \
-		      "  .byte 662b-661b\n"       /* sourcelen */      \
-		      "  .byte 664f-663f\n"       /* replacementlen */ \
-		      ".previous\n"						\
-		      ".section .altinstr_replacement,\"ax\"\n"			\
-		      "663:\n\t" newinstr "\n664:\n"   /* replacement */    \
-		      ".previous" :: "i" (feature) : "memory")  
-
-/*
- * Alternative inline assembly with input.
- * 
- * Pecularities:
- * No memory clobber here. 
- * Argument numbers start with 1.
- * Best is to use constraints that are fixed size (like (%1) ... "r")
- * If you use variable sized constraints like "m" or "g" in the 
- * replacement maake sure to pad to the worst case length.
- */
-#define alternative_input(oldinstr, newinstr, feature, input...)		\
-	asm volatile ("661:\n\t" oldinstr "\n662:\n"				\
-		      ".section .altinstructions,\"a\"\n"			\
-		      "  .align 4\n"						\
-		      "  .long 661b\n"            /* label */			\
-		      "  .long 663f\n"		  /* new instruction */ 	\
-		      "  .byte %c0\n"             /* feature bit */		\
-		      "  .byte 662b-661b\n"       /* sourcelen */		\
-		      "  .byte 664f-663f\n"       /* replacementlen */ 		\
-		      ".previous\n"						\
-		      ".section .altinstr_replacement,\"ax\"\n"			\
-		      "663:\n\t" newinstr "\n664:\n"   /* replacement */ 	\
-		      ".previous" :: "i" (feature), ##input)
-
 /*
  * Force strict CPU ordering.
  * And yes, this is required on UP too when we're talking

^ permalink raw reply	[flat|nested] 117+ messages in thread

end of thread, other threads:[~2006-01-26 11:49 UTC | newest]

Thread overview: 117+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-24 15:33 [PATCH] SMP alternatives Gerd Hoffmann
2006-01-24 16:22 ` Ben Collins
2006-01-25  9:20   ` Gerd Hoffmann
2006-01-26 10:22 ` Pavel Machek
2006-01-26 11:17   ` Gerd Hoffmann
2006-01-26 11:48     ` Pavel Machek
  -- strict thread matches above, loose matches on Subject: below --
2005-11-24 18:24 [patch] " colin
2005-11-24 17:48 linux
2005-11-24 18:48 ` Linus Torvalds
2005-11-10  0:32 [PATCH 1/10] Cr4 is valid on some 486s Zachary Amsden
2005-11-11 10:36 ` Pavel Machek
2005-11-11 19:36   ` Zachary Amsden
2005-11-11 19:58     ` Linus Torvalds
2005-11-11 20:14       ` Zachary Amsden
2005-11-11 20:22         ` Linus Torvalds
2005-11-13  7:42           ` Dave Jones
2005-11-13 19:24             ` Linus Torvalds
2005-11-13 20:29               ` Linus Torvalds
2005-11-14 15:06                 ` Gerd Knorr
2005-11-14 19:25                   ` Linus Torvalds
2005-11-15 14:12                     ` Gerd Knorr
2005-11-15 16:01                       ` Gerd Knorr
2005-11-16 16:12                         ` [RFC] SMP alternatives Gerd Knorr
2005-11-22 17:48                           ` [patch] " Gerd Knorr
2005-11-22 18:01                             ` Pavel Machek
2005-11-23 15:12                             ` Vincent Hanquez
2005-11-23 19:17                             ` Andi Kleen
2005-11-23 15:29                               ` Gerd Knorr
2005-11-23 16:42                               ` Alan Cox
2005-11-23 16:39                                 ` Andi Kleen
2005-11-23 17:21                                   ` Alan Cox
2005-11-23 16:59                                     ` Andi Kleen
2005-11-23 22:00                                       ` Alan Cox
2005-11-24 13:13                                         ` Andi Kleen
2005-11-24 13:30                                           ` Eric W. Biederman
2005-11-24 13:39                                             ` Andi Kleen
2005-11-24 13:58                                               ` Eric W. Biederman
2005-11-24 19:16                                                 ` thockin
2005-11-24 19:26                                                   ` Andi Kleen
2005-11-24 14:34                                               ` Alan Cox
2005-11-24 14:22                                                 ` Andi Kleen
2005-11-24 15:15                                                   ` Alan Cox
2005-11-24 14:55                                                     ` Andi Kleen
2005-11-24 15:09                                                       ` Eric W. Biederman
2005-11-24 15:36                                                         ` Andi Kleen
2005-11-24 16:49                                                           ` Eric W. Biederman
2005-11-24 19:12                                                           ` thockin
2005-11-24 19:14                                                             ` Andi Kleen
2005-11-24 19:24                                                               ` thockin
2005-11-24 19:29                                                                 ` Andi Kleen
2005-11-24 19:44                                                                   ` thockin
2005-11-24 21:20                                                                     ` Andi Kleen
2005-11-24 21:40                                                                       ` thockin
2005-11-24 23:33                                                                       ` Eric W. Biederman
2005-11-24 23:12                                                               ` Alan Cox
2005-11-24 22:48                                                                 ` thockin
2005-11-24 23:35                                                                   ` Andi Kleen
2005-11-25  0:13                                                                     ` Alan Cox
2005-11-25  1:33                                                                     ` H. Peter Anvin
2005-11-28 19:15                                                                     ` Bill Davidsen
2005-11-24 16:02                                                         ` Alan Cox
2005-11-24 19:09                                                     ` thockin
2005-11-24 14:30                                           ` Alan Cox
2005-11-23 17:02                                     ` Linus Torvalds
2005-11-23 18:02                                       ` H. Peter Anvin
2005-11-23 18:42                                         ` Linus Torvalds
2005-11-23 17:26                                           ` Jeff V. Merkey
2005-11-23 19:03                                             ` Linus Torvalds
2005-11-23 19:31                                               ` jmerkey
2005-11-23 18:46                                           ` Andi Kleen
2005-11-23 19:12                                           ` H. Peter Anvin
2005-11-23 19:30                                             ` jmerkey
2005-11-23 21:44                                           ` Alan Cox
2005-11-23 21:13                                             ` Andi Kleen
2005-11-23 21:46                                               ` Jeff Garzik
2005-11-23 22:23                                                 ` Andi Kleen
2005-11-23 22:30                                                 ` Pavel Machek
2005-11-23 22:05                                               ` Alan Cox
2005-11-23 21:36                                                 ` Arjan van de Ven
2005-11-23 21:36                                                 ` Andi Kleen
2005-11-23 22:13                                                 ` Linus Torvalds
2005-11-23 21:36                                             ` Linus Torvalds
2005-11-23 21:43                                               ` Andi Kleen
2005-11-23 22:15                                                 ` Linus Torvalds
2005-11-23 22:22                                                   ` Andi Kleen
2005-11-23 22:25                                                     ` H. Peter Anvin
2005-11-23 22:32                                                       ` Andi Kleen
2005-11-23 22:36                                                         ` H. Peter Anvin
2005-11-23 22:40                                                           ` Andi Kleen
2005-11-23 22:52                                                             ` H. Peter Anvin
2005-11-23 23:10                                                     ` Linus Torvalds
2005-11-24  0:55                                                 ` Jeff Garzik
2005-11-23 21:48                                               ` Daniel Jacobowitz
2005-11-23 21:53                                                 ` H. Peter Anvin
2005-11-23 22:03                                                   ` Daniel Jacobowitz
2005-11-23 22:09                                                     ` H. Peter Anvin
2005-11-23 22:21                                                       ` Linus Torvalds
2005-11-23 23:29                                                         ` Eric W. Biederman
2005-11-23 23:40                                                           ` Linus Torvalds
2005-11-23 22:19                                                 ` Linus Torvalds
2005-11-23 22:20                                                   ` Daniel Jacobowitz
2005-11-23 23:08                                                     ` Linus Torvalds
2005-11-23 23:02                                                       ` Jeff V. Merkey
2005-11-23 23:42                                                       ` Daniel Jacobowitz
2005-11-23 23:59                                                         ` Linus Torvalds
2005-11-24  2:06                                                           ` Daniel Jacobowitz
2005-11-24 22:32                                                         ` Ulrich Drepper
2005-11-28 19:58                                                         ` Bill Davidsen
2005-11-24  1:02                                                       ` Jeff Garzik
2005-11-24 13:01                                                       ` Pádraig Brady
2005-11-24 13:12                                                         ` Arjan van de Ven
2005-11-28 19:52                                                       ` Bill Davidsen
2005-11-28 20:05                                                         ` Zachary Amsden
2005-11-28 22:19                                                         ` Jeff V. Merkey
2005-11-28 23:00                                                           ` Zachary Amsden
2005-11-28 23:07                                                             ` H. Peter Anvin
2005-11-28 23:30                                                               ` Zachary Amsden
2005-11-28 23:32                                                                 ` H. Peter Anvin
2005-11-28 23:12                                                             ` Andi Kleen
2005-11-23 22:50                                               ` Alan Cox
2005-11-23 22:22                                                 ` H. Peter Anvin
2005-11-25  7:38                                               ` Chris Wedgwood
2005-11-25 17:33                                                 ` Linus Torvalds
2005-11-28 20:25                                                   ` Bill Davidsen
2005-11-25 20:13                                                 ` H. Peter Anvin
2005-11-24  3:23                                             ` Mikulas Patocka
2005-11-24  3:31                                           ` Mikulas Patocka
2005-11-24  3:55                                             ` H. Peter Anvin
2005-11-24 22:30                                           ` Ulrich Drepper
2005-11-23 16:43                                 ` Gerd Knorr
2005-11-23 16:51                                   ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).