All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/12] Immediate Values
@ 2009-09-24 13:26 Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 01/12] x86: text_poke_early non static Mathieu Desnoyers
                   ` (11 more replies)
  0 siblings, 12 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel

Hi Ingo,

Here is an updated version of the immediate values, applying to current tip.

[impact: data-cache optimization]

The main benefit of this infrastructure is to encode read-often variables into
the instruction stream. It can benefit to tracepoints by replacing the memory
load by an immediate value instruction. There is still room for improvement
through: an effort to provide static jump patching is ongoing, involving the
kernel and gcc communities.

Even then, the immediate values have their niche: when a value (rather than a
branch selection) is read often on fast-paths, the immediate value
infrastructure can encode these in the instruction stream without any d-cache
cost.

Two sample users are provided: prof_on from the scheduler code and tracepoints.
Feel free to merge/drop any of these.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 01/12] x86: text_poke_early non static
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 02/12] Immediate Values - Architecture Independent Code Mathieu Desnoyers
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, Ingo Molnar

[-- Attachment #1: x86-text-poke-early-non-static.patch --]
[-- Type: text/plain, Size: 2151 bytes --]

Needed by immediate.c.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/include/asm/alternative.h |    5 +++++
 arch/x86/kernel/alternative.c      |    4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux.trees.git/arch/x86/kernel/alternative.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/alternative.c	2009-09-24 09:13:59.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/alternative.c	2009-09-24 09:15:01.000000000 -0400
@@ -193,7 +193,7 @@ static void __init_or_module add_nops(vo
 
 extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
 extern u8 *__smp_locks[], *__smp_locks_end[];
-static void *text_poke_early(void *addr, const void *opcode, size_t len);
+void *text_poke_early(void *addr, const void *opcode, size_t len);
 
 /* Replace instructions with better alternatives for this CPU type.
    This runs before SMP is initialized to avoid SMP problems with
@@ -492,7 +492,7 @@ void __init alternative_instructions(voi
  * instructions. And on the local CPU you need to be protected again NMI or MCE
  * handlers seeing an inconsistent instruction while you patch.
  */
-static void *__init_or_module text_poke_early(void *addr, const void *opcode,
+void *__init_or_module text_poke_early(void *addr, const void *opcode,
 					      size_t len)
 {
 	unsigned long flags;
Index: linux.trees.git/arch/x86/include/asm/alternative.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/alternative.h	2009-09-24 09:15:22.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/alternative.h	2009-09-24 09:15:56.000000000 -0400
@@ -160,4 +160,9 @@ static inline void apply_paravirt(struct
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 
+/*
+ * text_poke for early boot
+ */
+extern void *text_poke_early(void *addr, const void *opcode, size_t len);
+
 #endif /* _ASM_X86_ALTERNATIVE_H */

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 01/12] x86: text_poke_early non static Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-25  4:20   ` Andrew Morton
  2009-09-24 13:26 ` [patch 03/12] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Jason Baron, Rusty Russell, Adrian Bunk,
	Andi Kleen, Christoph Hellwig, akpm

[-- Attachment #1: immediate-values-architecture-independent-code.patch --]
[-- Type: text/plain, Size: 17929 bytes --]

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _imv_read() version, which uses standard global
variables, and optimized per architecture imv_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__imv" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure :   getppid() takes  1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __imv is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
  the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
  that.
- Does not update tainted module's immediate values.
- remove imv_*_t types, add DECLARE_IMV() and DEFINE_IMV().
  - imv_read(&var) becomes imv_read(var) because of this.
- Adding a new EXPORT_IMV_SYMBOL(_GPL).
- remove imv_if(). Should use if (unlikely(imv_read(var))) instead.
  - Wait until we have gcc support before we add the imv_if macro, since
    its form may have to change.
- Dont't declare the __imv section in vmlinux.lds.h, just put the content
  in the rodata section.
- Simplify interface : remove imv_set_early, keep track of kernel boot
  status internally.
- Remove the ALIGN(8) before the __imv section. It is packed now.
- Uses an IPI busy-loop on each CPU with interrupts disabled as a simple,
  architecture agnostic, update mechanism.
- Use imv_* instead of immediate_*.
- Updating immediate values, cannot rely on smp_call_function() b/c
  synchronizing cpus using IPIs leads to deadlocks. Process A held a read lock
  on tasklist_lock, then process B called apply_imv_update(). Process A received
  the IPI and begins executing ipi_busy_loop(). Then process C takes a write
  lock irq on the task list lock, before receiving the IPI. Thus, process A
  holds up process C, and C can't get an IPI b/c interrupts are disabled. Solve
  this problem by using a new 'ALL_CPUS' parameter to stop_machine_run(). Which
  runs a function on all cpus after they are busy looping and have disabled
  irqs. Since this is done in a new process context, we don't have to worry
  about interrupted spin_locks. Also, less lines of code. Has survived 24 hours+
  of testing...
- Folded "fix use immediate" patch.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Jason Baron <jbaron@redhat.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 include/asm-generic/vmlinux.lds.h |    3 
 include/linux/immediate.h         |   94 +++++++++++++++++++
 include/linux/module.h            |   16 +++
 init/main.c                       |    7 +
 kernel/Makefile                   |    1 
 kernel/immediate.c                |  183 ++++++++++++++++++++++++++++++++++++++
 kernel/module.c                   |   41 ++++++++
 7 files changed, 345 insertions(+)

Index: linux.trees.git/include/linux/module.h
===================================================================
--- linux.trees.git.orig/include/linux/module.h	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/include/linux/module.h	2009-09-24 08:59:42.000000000 -0400
@@ -16,6 +16,7 @@
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
 #include <linux/tracepoint.h>
+#include <linux/immediate.h>
 
 #include <asm/local.h>
 #include <asm/module.h>
@@ -330,6 +331,10 @@ struct module
 	struct tracepoint *tracepoints;
 	unsigned int num_tracepoints;
 #endif
+#ifdef CONFIG_IMMEDIATE
+	const struct __imv *immediate;
+	unsigned int num_immediate;
+#endif
 
 #ifdef CONFIG_TRACING
 	const char **trace_bprintk_fmt_start;
@@ -533,6 +538,9 @@ extern void print_modules(void);
 extern void module_update_tracepoints(void);
 extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
 
+extern void _module_imv_update(void);
+extern void module_imv_update(void);
+
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -653,6 +661,14 @@ static inline int module_get_iter_tracep
 	return 0;
 }
 
+static inline void _module_imv_update(void)
+{
+}
+
+static inline void module_imv_update(void)
+{
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
Index: linux.trees.git/kernel/module.c
===================================================================
--- linux.trees.git.orig/kernel/module.c	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/kernel/module.c	2009-09-24 08:58:28.000000000 -0400
@@ -36,6 +36,7 @@
 #include <linux/cpu.h>
 #include <linux/moduleparam.h>
 #include <linux/errno.h>
+#include <linux/immediate.h>
 #include <linux/err.h>
 #include <linux/vermagic.h>
 #include <linux/notifier.h>
@@ -2236,6 +2237,11 @@ static noinline struct module *load_modu
 	mod->ctors = section_objs(hdr, sechdrs, secstrings, ".ctors",
 				  sizeof(*mod->ctors), &mod->num_ctors);
 #endif
+#ifdef CONFIG_IMMEDIATE
+	mod->immediate = section_objs(hdr, sechdrs, secstrings, "__imv",
+					sizeof(*mod->immediate),
+					&mod->num_immediate);
+#endif
 
 #ifdef CONFIG_TRACEPOINTS
 	mod->tracepoints = section_objs(hdr, sechdrs, secstrings,
@@ -3000,3 +3006,38 @@ int module_get_iter_tracepoints(struct t
 	return found;
 }
 #endif
+
+#ifdef CONFIG_IMMEDIATE
+/**
+ * _module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Module_mutex must be held be the caller.
+ */
+void _module_imv_update(void)
+{
+	struct module *mod;
+
+	list_for_each_entry(mod, &modules, list) {
+		if (mod->taints)
+			continue;
+		imv_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+	}
+}
+EXPORT_SYMBOL_GPL(_module_imv_update);
+
+/**
+ * module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Takes module_mutex.
+ */
+void module_imv_update(void)
+{
+	mutex_lock(&module_mutex);
+	_module_imv_update();
+	mutex_unlock(&module_mutex);
+}
+EXPORT_SYMBOL_GPL(module_imv_update);
+#endif
Index: linux.trees.git/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/kernel/immediate.c	2009-09-24 08:58:28.000000000 -0400
@@ -0,0 +1,183 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/immediate.h>
+#include <linux/memory.h>
+#include <linux/cpu.h>
+#include <linux/stop_machine.h>
+
+#include <asm/cacheflush.h>
+#include <asm/atomic.h>
+
+/*
+ * Kernel ready to execute the SMP update that may depend on trap and ipi.
+ */
+static int imv_early_boot_complete;
+static atomic_t stop_machine_first;
+static int wrote_text;
+
+extern const struct __imv __start___imv[];
+extern const struct __imv __stop___imv[];
+
+static int stop_machine_imv_update(void *imv_ptr)
+{
+	struct __imv *imv = imv_ptr;
+
+	if (atomic_dec_and_test(&stop_machine_first)) {
+		text_poke((void *)imv->imv, (void *)imv->var, imv->size);
+		smp_wmb(); /* make sure other cpus see that this has run */
+		wrote_text = 1;
+	} else {
+		while (!wrote_text)
+			smp_rmb();
+		sync_core();
+	}
+
+	flush_icache_range(imv->imv, imv->imv + imv->size);
+
+	return 0;
+}
+
+/*
+ * imv_mutex nests inside module_mutex. imv_mutex protects builtin
+ * immediates and module immediates.
+ */
+static DEFINE_MUTEX(imv_mutex);
+
+/**
+ * apply_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ * It makes sure all CPUs are not executing the modified code by having them
+ * busy looping with interrupts disabled.
+ * It does _not_ protect against NMI and MCE (could be a problem with Intel's
+ * errata if we use immediate values in their code path).
+ */
+static int apply_imv_update(const struct __imv *imv)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (imv->size) {
+	case 1:
+		if (*(uint8_t *)imv->imv == *(uint8_t *)imv->var)
+			return 0;
+		break;
+	case 2:
+		if (*(uint16_t *)imv->imv == *(uint16_t *)imv->var)
+			return 0;
+		break;
+	case 4:
+		if (*(uint32_t *)imv->imv == *(uint32_t *)imv->var)
+			return 0;
+		break;
+	case 8:
+		if (*(uint64_t *)imv->imv == *(uint64_t *)imv->var)
+			return 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (imv_early_boot_complete) {
+		mutex_lock(&text_mutex);
+		atomic_set(&stop_machine_first, 1);
+		wrote_text = 0;
+		stop_machine(stop_machine_imv_update, (void *)imv, NULL);
+		mutex_unlock(&text_mutex);
+	} else
+		text_poke_early((void *)imv->imv, (void *)imv->var,
+				imv->size);
+	return 0;
+}
+
+/**
+ * imv_update_range - Update immediate values in a range
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ *
+ * Updates a range of immediates.
+ */
+void imv_update_range(const struct __imv *begin,
+		const struct __imv *end)
+{
+	const struct __imv *iter;
+	int ret;
+
+	mutex_lock(&imv_mutex);
+	for (iter = begin; iter < end; iter++) {
+		ret = apply_imv_update(iter);
+		if (imv_early_boot_complete && ret)
+			printk(KERN_WARNING
+				"Invalid immediate value. "
+				"Variable at %p, "
+				"instruction at %p, size %hu\n",
+				(void *)iter->imv,
+				(void *)iter->var, iter->size);
+	}
+	mutex_unlock(&imv_mutex);
+}
+EXPORT_SYMBOL_GPL(imv_update_range);
+
+/**
+ * imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ */
+void core_imv_update(void)
+{
+	/* Core kernel imvs */
+	imv_update_range(__start___imv, __stop___imv);
+}
+EXPORT_SYMBOL_GPL(core_imv_update);
+
+void __init imv_init_complete(void)
+{
+	imv_early_boot_complete = 1;
+}
+
+int imv_module_notify(struct notifier_block *self,
+		      unsigned long val, void *data)
+{
+	struct module *mod = data;
+
+	switch (val) {
+	case MODULE_STATE_COMING:
+		imv_update_range(mod->immediate,
+				 mod->immediate + mod->num_immediate);
+		break;
+	case MODULE_STATE_GOING:
+		/* All references will be gone, no update required. */
+		break;
+	}
+	return 0;
+}
+
+struct notifier_block imv_module_nb = {
+	.notifier_call = imv_module_notify,
+	.priority = 0,
+};
+
+static int init_imv(void)
+{
+	return register_module_notifier(&imv_module_nb);
+}
+__initcall(init_imv);
Index: linux.trees.git/init/main.c
===================================================================
--- linux.trees.git.orig/init/main.c	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/init/main.c	2009-09-24 08:58:28.000000000 -0400
@@ -97,6 +97,11 @@ static inline void mark_rodata_ro(void) 
 #ifdef CONFIG_TC
 extern void tc_init(void);
 #endif
+#ifdef USE_IMMEDIATE
+extern void imv_init_complete(void);
+#else
+static inline void imv_init_complete(void) { }
+#endif
 
 enum system_states system_state __read_mostly;
 EXPORT_SYMBOL(system_state);
@@ -544,6 +549,7 @@ asmlinkage void __init start_kernel(void
 	boot_init_stack_canary();
 
 	cgroup_init_early();
+	core_imv_update();
 
 	local_irq_disable();
 	early_boot_irqs_off();
@@ -685,6 +691,7 @@ asmlinkage void __init start_kernel(void
 	cpuset_init();
 	taskstats_init_early();
 	delayacct_init();
+	imv_init_complete();
 
 	check_bugs();
 
Index: linux.trees.git/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux.trees.git.orig/include/asm-generic/vmlinux.lds.h	2009-09-24 08:52:54.000000000 -0400
+++ linux.trees.git/include/asm-generic/vmlinux.lds.h	2009-09-24 08:58:28.000000000 -0400
@@ -202,6 +202,9 @@
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
 		*(__tracepoints_strings)/* Tracepoints: strings */	\
+		VMLINUX_SYMBOL(__start___imv) = .;			\
+		*(__imv)		/* Immediate values: pointers */ \
+		VMLINUX_SYMBOL(__stop___imv) = .;			\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
Index: linux.trees.git/include/linux/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/include/linux/immediate.h	2009-09-24 08:58:28.000000000 -0400
@@ -0,0 +1,94 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+struct __imv {
+	unsigned long var;	/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	unsigned long imv;	/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	unsigned char size;	/* Type size. */
+} __attribute__ ((packed));
+
+#include <asm/immediate.h>
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i)						\
+	do {								\
+		name##__imv = (i);					\
+		core_imv_update();					\
+		module_imv_update();					\
+	} while (0)
+
+/*
+ * Internal update functions.
+ */
+extern void core_imv_update(void);
+extern void imv_update_range(const struct __imv *begin,
+	const struct __imv *end);
+
+#else
+
+/*
+ * Generic immediate values: a simple, standard, memory load.
+ */
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define imv_read(name)			_imv_read(name)
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i)		(name##__imv = (i))
+
+static inline void core_imv_update(void) { }
+static inline void module_imv_update(void) { }
+
+#endif
+
+#define DECLARE_IMV(type, name) extern __typeof__(type) name##__imv
+#define DEFINE_IMV(type, name)  __typeof__(type) name##__imv
+
+#define EXPORT_IMV_SYMBOL(name) EXPORT_SYMBOL(name##__imv)
+#define EXPORT_IMV_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__imv)
+
+/**
+ * _imv_read - Read immediate value with standard memory load.
+ * @name: immediate value name
+ *
+ * Force a data read of the immediate value instead of the immediate value
+ * based mechanism. Useful for __init and __exit section data read.
+ */
+#define _imv_read(name)		(name##__imv)
+
+#endif
Index: linux.trees.git/kernel/Makefile
===================================================================
--- linux.trees.git.orig/kernel/Makefile	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/kernel/Makefile	2009-09-24 09:00:15.000000000 -0400
@@ -88,6 +88,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 03/12] Immediate Values - Kconfig menu in EMBEDDED
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 01/12] x86: text_poke_early non static Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 02/12] Immediate Values - Architecture Independent Code Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 04/12] Immediate Values - x86 Optimization Mathieu Desnoyers
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Adrian Bunk, Andi Kleen,
	Christoph Hellwig, akpm

[-- Attachment #1: immediate-values-kconfig-menu-in-embedded.patch --]
[-- Type: text/plain, Size: 2319 bytes --]

Immediate values provide a way to use dynamic code patching to update variables
sitting within the instruction stream. It saves caches lines normally used by
static read mostly variables. Enable it by default, but let users disable it
through the EMBEDDED menu with the "Disable immediate values" submenu entry.

Note: Since I think that I really should let embedded systems developers using
RO memory the option to disable the immediate values, I choose to leave this
menu option there, in the EMBEDDED menu. Also, the "CONFIG_IMMEDIATE" makes
sense because we want to compile out all the immediate code when we decide not
to use optimized immediate values at all (it removes otherwise unused code).

Changelog:
- Change ARCH_SUPPORTS_IMMEDIATE for HAS_IMMEDIATE
- Turn DISABLE_IMMEDIATE into positive logic

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 init/Kconfig |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Index: linux.trees.git/init/Kconfig
===================================================================
--- linux.trees.git.orig/init/Kconfig	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/init/Kconfig	2009-09-24 09:00:20.000000000 -0400
@@ -1088,6 +1088,24 @@ config SLOW_WORK
 
 	  See Documentation/slow-work.txt.
 
+config HAVE_IMMEDIATE
+	def_bool n
+
+config IMMEDIATE
+	default y
+	depends on HAVE_IMMEDIATE
+	bool "Immediate value optimization" if EMBEDDED
+	help
+	  Immediate values are used as read-mostly variables that are rarely
+	  updated. They use code patching to modify the values inscribed in the
+	  instruction stream. It provides a way to save precious cache lines
+	  that would otherwise have to be used by these variables. They can be
+	  disabled through the EMBEDDED menu.
+
+	  It consumes slightly more memory and modifies the instruction stream
+	  each time any specially-marked variable is updated. Should really be
+	  disabled for embedded systems with read-only text.
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 04/12] Immediate Values - x86 Optimization
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2009-09-24 13:26 ` [patch 03/12] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 05/12] Add text_poke and sync_core to powerpc Mathieu Desnoyers
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Thomas Gleixner,
	Ingo Molnar, Rusty Russell, Adrian Bunk, akpm

[-- Attachment #1: immediate-values-x86-optimization.patch --]
[-- Type: text/plain, Size: 4829 bytes --]

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Note : a movb needs to get its value froma =q constraint.

Quoting "H. Peter Anvin" <hpa@zytor.com>

Using =r for single-byte values is incorrect for 32-bit code -- that would 
permit %spl, %bpl, %sil, %dil which are illegal in 32-bit mode.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the imv_mutex).
- Put imv_set and _imv_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.
- Bugfix : 8 bytes 64 bits immediate value was declared as "4 bytes" in the
  immediate structure.
- Change the immediate.c update code to support variable length opcodes.
- Vastly simplified, using a busy looping IPI with interrupts disabled.
  Does not protect against NMI nor MCE.
- Pack the __imv section. Use smallest types required for size (char).
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: akpm@osdl.org
---
 arch/x86/Kconfig                 |    1 
 arch/x86/include/asm/immediate.h |   77 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

Index: linux.trees.git/arch/x86/include/asm/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/x86/include/asm/immediate.h	2009-09-24 09:00:27.000000000 -0400
@@ -0,0 +1,77 @@
+#ifndef _ASM_X86_IMMEDIATE_H
+#define _ASM_X86_IMMEDIATE_H
+
+/*
+ * Immediate values. x86 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm.h>
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _imv_read() instead.
+ * If size is bigger than the architecture long size, fall back on a memory
+ * read.
+ *
+ * Make sure to populate the initial static 64 bits opcode with a value
+ * what will generate an instruction with 8 bytes immediate value (not the REX.W
+ * prefixed one that loads a sign extended 32 bits immediate value in a r64
+ * register).
+ */
+#define imv_read(name)							\
+	({								\
+		__typeof__(name##__imv) value;				\
+		BUILD_BUG_ON(sizeof(value) > 8);			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".byte %c2\n\t"				\
+				".previous\n\t"				\
+				"mov $0,%0\n\t"				\
+				"3:\n\t"				\
+				: "=q" (value)				\
+				: "i" (&name##__imv),			\
+				  "i" (sizeof(value)));			\
+			break;						\
+		case 2:							\
+		case 4:							\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".byte %c2\n\t"				\
+				".previous\n\t"				\
+				"mov $0,%0\n\t"				\
+				"3:\n\t"				\
+				: "=r" (value)				\
+				: "i" (&name##__imv),			\
+				  "i" (sizeof(value)));			\
+			break;						\
+		case 8:							\
+			if (sizeof(long) < 8) {				\
+				value = name##__imv;			\
+				break;					\
+			}						\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
+				".byte %c2\n\t"				\
+				".previous\n\t"				\
+				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
+				"3:\n\t"				\
+				: "=r" (value)				\
+				: "i" (&name##__imv),			\
+				  "i" (sizeof(value)));			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+#endif /* _ASM_X86_IMMEDIATE_H */
Index: linux.trees.git/arch/x86/Kconfig
===================================================================
--- linux.trees.git.orig/arch/x86/Kconfig	2009-09-24 08:52:41.000000000 -0400
+++ linux.trees.git/arch/x86/Kconfig	2009-09-24 09:00:27.000000000 -0400
@@ -45,6 +45,7 @@ config X86
 	select HAVE_GENERIC_DMA_COHERENT if X86_32
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select USER_STACKTRACE_SUPPORT
+	select HAVE_IMMEDIATE
 	select HAVE_DMA_API_DEBUG
 	select HAVE_KERNEL_GZIP
 	select HAVE_KERNEL_BZIP2

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 05/12] Add text_poke and sync_core to powerpc
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2009-09-24 13:26 ` [patch 04/12] Immediate Values - x86 Optimization Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 06/12] Immediate Values - Powerpc Optimization Mathieu Desnoyers
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Christoph Hellwig,
	Paul Mackerras, Adrian Bunk, Andi Kleen, akpm

[-- Attachment #1: add-text-poke-and-sync-core-to-powerpc.patch --]
[-- Type: text/plain, Size: 1413 bytes --]

- Needed on architectures where we must surround live instruction modification
  with "WP flag disable".
- Turns into a memcpy on powerpc since there is no WP flag activated for
  instruction pages (yet..).
- Add empty sync_core to powerpc so it can be used in architecture independent
  code.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Christoph Hellwig <hch@infradead.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 arch/powerpc/include/asm/cacheflush.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/powerpc/include/asm/cacheflush.h
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/include/asm/cacheflush.h	2009-01-09 18:16:16.000000000 -0500
+++ linux-2.6-lttng/arch/powerpc/include/asm/cacheflush.h	2009-01-09 18:17:34.000000000 -0500
@@ -63,7 +63,9 @@ extern void flush_dcache_phys_range(unsi
 #define copy_from_user_page(vma, page, vaddr, dst, src, len) \
 	memcpy(dst, src, len)
 
-
+#define text_poke	memcpy
+#define text_poke_early	text_poke
+#define sync_core()
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 /* internal debugging function */

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 06/12] Immediate Values - Powerpc Optimization
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2009-09-24 13:26 ` [patch 05/12] Add text_poke and sync_core to powerpc Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 07/12] Sparc create asm.h Mathieu Desnoyers
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Christoph Hellwig,
	Paul Mackerras, Adrian Bunk, Andi Kleen, akpm

[-- Attachment #1: immediate-values-powerpc-optimization.patch --]
[-- Type: text/plain, Size: 3114 bytes --]

PowerPC optimization of the immediate values which uses a li instruction,
patched with an immediate value.

Changelog:
- Put imv_set and _imv_set in the architecture independent header.
- Pack the __imv section. Use smallest types required for size (char).
- Remove architecture specific update code : now handled by architecture
  agnostic code.
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Christoph Hellwig <hch@infradead.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 arch/powerpc/Kconfig                 |    1 
 arch/powerpc/include/asm/immediate.h |   56 +++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

Index: linux.trees.git/arch/powerpc/include/asm/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/powerpc/include/asm/immediate.h	2009-09-24 09:00:33.000000000 -0400
@@ -0,0 +1,56 @@
+#ifndef _ASM_POWERPC_IMMEDIATE_H
+#define _ASM_POWERPC_IMMEDIATE_H
+
+/*
+ * Immediate values. PowerPC architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm-compat.h>
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _imv_read() instead.
+ */
+#define imv_read(name)							\
+	({								\
+		__typeof__(name##__imv) value;				\
+		BUILD_BUG_ON(sizeof(value) > 8);			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+					PPC_LONG "%c1, ((1f)-1)\n\t"	\
+					".byte 1\n\t"			\
+					".previous\n\t"			\
+					"li %0,0\n\t"			\
+					"1:\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__imv));			\
+			break;						\
+		case 2:							\
+			asm(".section __imv,\"a\",@progbits\n\t"	\
+					PPC_LONG "%c1, ((1f)-2)\n\t"	\
+					".byte 2\n\t"			\
+					".previous\n\t"			\
+					"li %0,0\n\t"			\
+					"1:\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__imv));			\
+			break;						\
+		case 4:							\
+		case 8:							\
+			value = name##__imv;				\
+			break;						\
+		};							\
+		value;							\
+	})
+
+#endif /* _ASM_POWERPC_IMMEDIATE_H */
Index: linux.trees.git/arch/powerpc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig	2009-09-24 08:52:40.000000000 -0400
+++ linux.trees.git/arch/powerpc/Kconfig	2009-09-24 09:00:51.000000000 -0400
@@ -130,6 +130,7 @@ config PPC
 	select HAVE_SYSCALL_WRAPPERS if PPC64
 	select GENERIC_ATOMIC64 if PPC32
 	select HAVE_PERF_EVENTS
+	select HAVE_IMMEDIATE
 
 config EARLY_PRINTK
 	bool

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 07/12] Sparc create asm.h
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2009-09-24 13:26 ` [patch 06/12] Immediate Values - Powerpc Optimization Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 21:10   ` David Miller
  2009-09-24 13:26 ` [patch 08/12] sparc64: Optimized immediate value implementation Mathieu Desnoyers
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, David S. Miller

[-- Attachment #1: sparc64-create-asm-h.patch --]
[-- Type: text/plain, Size: 1245 bytes --]

Create a assembly compatibility header for sparc32/64.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David S. Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/asm.h |   11 +++++++++++
 1 file changed, 11 insertions(+)

Index: linux-2.6-lttng/arch/sparc/include/asm/asm.h
===================================================================
--- linux-2.6-lttng.orig/arch/sparc/include/asm/asm.h	2009-09-24 08:42:17.000000000 -0400
+++ linux-2.6-lttng/arch/sparc/include/asm/asm.h	2009-09-24 08:49:13.000000000 -0400
@@ -18,6 +18,7 @@
 	brnz,PREDICT	REG, DEST
 #define BRANCH_REG_NOT_ZERO_ANNUL(PREDICT, REG, DEST) \
 	brnz,a,PREDICT	REG, DEST
+#define __ASM_SEL(a, b)	__ASM_FORM(b)
 #else
 #define BRANCH32(TYPE, PREDICT, DEST) \
 	TYPE		DEST
@@ -35,6 +36,16 @@
 #define BRANCH_REG_NOT_ZERO_ANNUL(PREDICT, REG, DEST) \
 	cmp		REG, 0; \
 	bne,a		DEST
+#define __ASM_SEL(a, b)	__ASM_FORM(a)
 #endif
 
+#ifdef __ASSEMBLY__
+#define __ASM_FORM(x)	x
+#else
+#define __ASM_FORM(x)	" " #x " "
+#endif
+
+#define _ASM_PTR	__ASM_SEL(.word, .xword)
+#define _ASM_UAPTR	__ASM_SEL(.uaword, .uaxword)
+
 #endif /* _SPARC_ASM_H */

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 08/12] sparc64: Optimized immediate value implementation.
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2009-09-24 13:26 ` [patch 07/12] Sparc create asm.h Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 09/12] Immediate Values - Documentation Mathieu Desnoyers
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: David S. Miller, Mathieu Desnoyers

[-- Attachment #1: sparc64-immedate-values.patch --]
[-- Type: text/plain, Size: 5413 bytes --]

commit f2b14974b823a9cd9b6f5c0d423945caa15de8a2
Author: David S. Miller <davem@davemloft.net>
Date:   Tue May 13 04:29:30 2008 -0700

We can only do byte sized values currently.

In order to support even 16-bit immediates we would need a 2
instruction sequence.

I believe that can be made to work with a suitable breakpoint or some
other kind of special patching sequence, but that isn't attempted
here.

[edit by Mathieu Desnoyers]
Use _ASM_PTR and _ASM_UAPTR 32/64 bits compatibility macros.

Use "unsigned long" type to encode pointers, with uaxword on sparc64 and uaword
on sparc32.

Disable immediate values on gcc < 4.0.0, because it seems to have problem with
passing pointers as "i" inline asm constraint.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 arch/sparc/Kconfig                 |    1 
 arch/sparc/Makefile                |    4 +++
 arch/sparc/include/asm/immediate.h |   40 ++++++++++++++++++++++++++++++
 arch/sparc/kernel/Makefile         |    1 
 arch/sparc/kernel/immediate.c      |   48 +++++++++++++++++++++++++++++++++++++
 5 files changed, 94 insertions(+)

Index: linux.trees.git/arch/sparc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/sparc/Kconfig	2009-09-24 08:52:41.000000000 -0400
+++ linux.trees.git/arch/sparc/Kconfig	2009-09-24 09:01:13.000000000 -0400
@@ -48,6 +48,7 @@ config SPARC64
 	select RTC_DRV_SUN4V
 	select RTC_DRV_STARFIRE
 	select HAVE_PERF_EVENTS
+	select HAVE_IMMEDIATE
 
 config ARCH_DEFCONFIG
 	string
Index: linux.trees.git/arch/sparc/kernel/Makefile
===================================================================
--- linux.trees.git.orig/arch/sparc/kernel/Makefile	2009-09-24 08:52:41.000000000 -0400
+++ linux.trees.git/arch/sparc/kernel/Makefile	2009-09-24 09:01:55.000000000 -0400
@@ -91,6 +91,7 @@ obj-$(CONFIG_SPARC64_PCI)    += pci_sun4
 obj-$(CONFIG_PCI_MSI)        += pci_msi.o
 
 obj-$(CONFIG_COMPAT)         += sys32.o sys_sparc32.o signal32.o
+obj-$(USE_IMMEDIATE)	     += immediate.o
 
 # sparc64 cpufreq
 obj-$(CONFIG_US3_FREQ)  += us3_cpufreq.o
Index: linux.trees.git/arch/sparc/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/sparc/kernel/immediate.c	2009-09-24 09:00:57.000000000 -0400
@@ -0,0 +1,48 @@
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/string.h>
+#include <linux/kprobes.h>
+
+#include <asm/system.h>
+
+int arch_imv_update(const struct __imv *imv, int early)
+{
+	unsigned long imv_vaddr = imv->imv;
+	unsigned long var_vaddr = imv->var;
+	u32 insn, *ip = (u32 *) imv_vaddr;
+
+	insn = *ip;
+
+#ifdef CONFIG_KPROBES
+	switch (imv->size) {
+	case 1:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (unlikely(!early &&
+		     (insn == BREAKPOINT_INSTRUCTION ||
+		      insn == BREAKPOINT_INSTRUCTION_2))) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %u\n",
+				    ip, (void *)var_vaddr, imv->size);
+		return -EBUSY;
+	}
+#endif
+
+	switch (imv->size) {
+	case 1:
+		if ((insn & 0x1fff) == *(uint8_t *)var_vaddr)
+			return 0;
+		insn &= ~0x00001fff;
+		insn |= (u32) (*(uint8_t *)var_vaddr);
+		break;
+	default:
+		return -EINVAL;
+	}
+	*ip = insn;
+	flushi(ip);
+	return 0;
+}
Index: linux.trees.git/arch/sparc/include/asm/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/sparc/include/asm/immediate.h	2009-09-24 09:00:57.000000000 -0400
@@ -0,0 +1,40 @@
+#ifndef _ASM_SPARC_IMMEDIATE_H
+#define _ASM_SPARC_IMMEDIATE_H
+
+#include <asm/asm.h>
+
+struct __imv {
+	unsigned long var;
+	unsigned long imv;
+	unsigned char size;
+} __attribute__ ((packed));
+
+#define imv_read(name)							\
+	({								\
+		__typeof__(name##__imv) value;				\
+		BUILD_BUG_ON(sizeof(value) > 8);			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
+					_ASM_UAPTR " %c1, 1f\n\t"	\
+					".byte 1\n\t"			\
+					".previous\n\t"			\
+					"1: mov 0, %0\n\t"		\
+				: "=r" (value)				\
+				: "i" (&name##__imv));			\
+			break;						\
+		case 2:							\
+		case 4:							\
+		case 8:							\
+			value = name##__imv;				\
+			break;						\
+		};							\
+		value;							\
+	})
+
+#define imv_cond(name)	imv_read(name)
+#define imv_cond_end()
+
+extern int arch_imv_update(const struct __imv *imv, int early);
+
+#endif /* _ASM_SPARC_IMMEDIATE_H */
Index: linux.trees.git/arch/sparc/Makefile
===================================================================
--- linux.trees.git.orig/arch/sparc/Makefile	2009-09-24 08:52:41.000000000 -0400
+++ linux.trees.git/arch/sparc/Makefile	2009-09-24 09:00:57.000000000 -0400
@@ -57,6 +57,10 @@ KBUILD_CFLAGS += -m64 -pipe -mno-fpu -mc
 KBUILD_CFLAGS += $(call cc-option,-mtune=ultrasparc3)
 KBUILD_AFLAGS += -m64 -mcpu=ultrasparc -Wa,--undeclared-regs
 
+# gcc 3.x has problems with passing symbol+offset in
+# asm "i" constraint.
+export USE_IMMEDIATE := $(call cc-ifversion, -ge, 0400, $(CONFIG_IMMEDIATE))
+
 ifeq ($(CONFIG_MCOUNT),y)
   KBUILD_CFLAGS += -pg
 endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 09/12] Immediate Values - Documentation
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2009-09-24 13:26 ` [patch 08/12] sparc64: Optimized immediate value implementation Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Adrian Bunk, Andi Kleen,
	Christoph Hellwig, akpm, KOSAKI Motohiro

[-- Attachment #1: immediate-values-documentation.patch --]
[-- Type: text/plain, Size: 9146 bytes --]

Changelog:
- Remove imv_set_early (removed from API).
- Use imv_* instead of immediate_*.
- Remove non-ascii characters.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/immediate.txt |  221 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 221 insertions(+)

Index: linux-2.6-lttng/Documentation/immediate.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/Documentation/immediate.txt	2009-09-24 08:31:49.000000000 -0400
@@ -0,0 +1,221 @@
+		        Using the Immediate Values
+
+			    Mathieu Desnoyers
+
+
+This document introduces Immediate Values and their use.
+
+
+* Purpose of immediate values
+
+An immediate value is used to compile into the kernel variables that sit within
+the instruction stream. They are meant to be rarely updated but read often.
+Using immediate values for these variables will save cache lines.
+
+This infrastructure is specialized in supporting dynamic patching of the values
+in the instruction stream when multiple CPUs are running without disturbing the
+normal system behavior.
+
+Compiling code meant to be rarely enabled at runtime can be done using
+if (unlikely(imv_read(var))) as condition surrounding the code. The
+smallest data type required for the test (an 8 bits char) is preferred, since
+some architectures, such as powerpc, only allow up to 16 bits immediate values.
+
+
+* Usage
+
+In order to use the "immediate" macros, you should include linux/immediate.h.
+
+#include <linux/immediate.h>
+
+DEFINE_IMV(char, this_immediate);
+EXPORT_IMV_SYMBOL(this_immediate);
+
+
+And use, in the body of a function:
+
+Use imv_set(this_immediate) to set the immediate value.
+
+Use imv_read(this_immediate) to read the immediate value.
+
+The immediate mechanism supports inserting multiple instances of the same
+immediate. Immediate values can be put in inline functions, inlined static
+functions, and unrolled loops.
+
+If you have to read the immediate values from a function declared as __init or
+__exit, you should explicitly use _imv_read(), which will fall back on a
+global variable read. Failing to do so will leave a reference to the __init
+section after it is freed (it would generate a modpost warning).
+
+You can choose to set an initial static value to the immediate by using, for
+instance:
+
+DEFINE_IMV(long, myptr) = 10;
+
+
+* Optimization for a given architecture
+
+One can implement optimized immediate values for a given architecture by
+replacing asm-$ARCH/immediate.h.
+
+
+* Performance improvement
+
+
+  * Memory hit for a data-based branch
+
+Here are the results on a 3GHz Pentium 4:
+
+number of tests: 100
+number of branches per test: 100000
+memory hit cycles per iteration (mean): 636.611
+L1 cache hit cycles per iteration (mean): 89.6413
+instruction stream based test, cycles per iteration (mean): 85.3438
+Just getting the pointer from a modulo on a pseudo-random value, doing
+  nothing with it, cycles per iteration (mean): 77.5044
+
+So:
+Base case:                      77.50 cycles
+instruction stream based test:  +7.8394 cycles
+L1 cache hit based test:        +12.1369 cycles
+Memory load based test:         +559.1066 cycles
+
+So let's say we have a ping flood coming at
+(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms)
+7674 packets per second. If we put 2 tracepoints for irq entry/exit, it
+brings us to 15348 tracepoint sites executed per second.
+
+(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029
+We therefore have a 0.29% slowdown just on this case.
+
+Compared to this, the instruction stream based test will cause a
+slowdown of:
+
+(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004
+For a 0.004% slowdown.
+
+If we plan to use this for memory allocation, spinlock, and all sorts of
+very high event rate tracing, we can assume it will execute 10 to 100
+times more sites per second, which brings us to 0.4% slowdown with the
+instruction stream based test compared to 29% slowdown with the memory
+load based test on a system with high memory pressure.
+
+
+
+  * Tracepoint impact under heavy memory load
+
+Running a kernel with my LTTng instrumentation set, in a test that
+generates memory pressure (from userspace) by trashing L1 and L2 caches
+between calls to getppid() (note: syscall_trace is active and calls
+a tracepoint upon syscall entry and syscall exit; tracepoints are disarmed).
+This test is done in user-space, so there are some delays due to IRQs
+coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20
+nice level)
+
+My first set of results: Linear cache trashing, turned out not to be
+very interesting, because it seems like the linearity of the memset on a
+full array is somehow detected and it does not "really" trash the
+caches.
+
+Now the most interesting result: Random walk L1 and L2 trashing
+surrounding a getppid() call.
+
+- Tracepoints compiled out (but syscall_trace execution forced)
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.033 cycles
+getppid: 1681.4 cycles
+With memory pressure
+Reading timestamps takes 102.938 cycles
+getppid: 15691.6 cycles
+
+
+- With the immediate values based tracepoints:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.006 cycles
+getppid: 1681.84 cycles
+With memory pressure
+Reading timestamps takes 100.291 cycles
+getppid: 11793 cycles
+
+
+- With global variables based tracepoints:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 107.999 cycles
+getppid: 1669.06 cycles
+With memory pressure
+Reading timestamps takes 102.839 cycles
+getppid: 12535 cycles
+
+The result is quite interesting in that the kernel is slower without
+tracepoints than with tracepoints. I explain it by the fact that the data
+accessed is not laid out in the same manner in the cache lines when the
+tracepoints are compiled in or out. It seems that it aligns the function's
+data better to compile-in the tracepoints in this case.
+
+But since the interesting comparison is between the immediate values and
+global variables based tracepoints, and because they share the same memory
+layout, except for the movl being replaced by a movz, we see that the
+global variable based tracepoints (2 tracepoints) adds 742 cycles to each system
+call (syscall entry and exit are traced and memory locations for both
+global variables lie on the same cache line).
+
+
+- Test redone with less iterations, but with error estimates
+
+10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with
+syscall trace inactive, comparing the case with memory pressure and without
+memory pressure. (sorry, my system is not setup to execute syscall_trace this
+time, but it will make the point anyway).
+
+No memory pressure
+Reading timestamps:     150.92 cycles,     std dev.    1.01 cycles
+getppid:               1462.09 cycles,     std dev.   18.87 cycles
+
+With memory pressure
+Reading timestamps:     578.22 cycles,     std dev.  269.51 cycles
+getppid:              17113.33 cycles,     std dev. 1655.92 cycles
+
+
+Now for memory read timing: (10 runs, branches per test: 100000)
+Memory read based branch:
+                       644.09 cycles,      std dev.   11.39 cycles
+L1 cache hit based branch:
+                        88.16 cycles,      std dev.    1.35 cycles
+
+
+So, now that we have the raw results, let's calculate:
+
+Memory read:
+644.09 +/- 11.39 - 88.16 +/- 1.35 = 555.93 +/- 11.46 cycles
+
+Getppid without memory pressure:
+1462.09 +/- 18.87 - 150.92 +/- 1.01 = 1311.17 +/- 18.90 cycles
+
+Getppid with memory pressure:
+17113.33 +/- 1655.92 - 578.22 +/- 269.51 = 16535.11 +/- 1677.71 cycles
+
+Therefore, if we add 2 tracepoints not based on immediate values to the getppid
+code, which would add 2 memory reads, we would add
+2 * 555.93 +/- 12.74 = 1111.86 +/- 25.48 cycles
+
+Therefore,
+
+1111.86 +/- 25.48 / 16535.11 +/- 1677.71 = 0.0672
+ relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2))
+                     = 0.1040
+ absolute error: 0.1040 * 0.0672 = 0.0070
+
+Therefore: 0.0672 +/- 0.0070 * 100% = 6.72 +/- 0.70 %
+
+We can therefore affirm that adding 2 tracepoints to getppid, on a system with
+high memory pressure, would have a performance hit of at least 6.0% on the
+system call time, all within the uncertainty limits of these tests. The same
+applies to other kernel code paths. The smaller those code paths are, the
+highest the impact ratio will be.
+
+Therefore, not only is it interesting to use the immediate values to dynamically
+activate dormant code such as the tracepoints, but I think it should also be
+considered as a replacement for many of the "read-mostly" static variables.

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 10/12] Immediate Values Support init
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (8 preceding siblings ...)
  2009-09-24 13:26 ` [patch 09/12] Immediate Values - Documentation Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 15:33   ` [patch 10.1/12] Immediate values fixes for modules Mathieu Desnoyers
  2009-09-24 15:35   ` [patch 10.2/12] Fix Immediate Values x86_64 support old gcc Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 11/12] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 12/12] Tracepoints - " Mathieu Desnoyers
  11 siblings, 2 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Frank Ch. Eigler, KOSAKI Motohiro

[-- Attachment #1: immediate-values-support-init.patch --]
[-- Type: text/plain, Size: 9923 bytes --]

Supports placing immediate values in init code

We need to put the immediate values in RW data section so we can edit them
before init section unload.

This code puts NULL pointers in lieu of original pointer referencing init code
before the init sections are freed, both in the core kernel and in modules.

TODO : support __exit section.

Changelog:
- Fix !CONFIG_IMMEDIATE
- Folded immediate-values-support-init-kerneldoc-fix

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/immediate.txt          |    8 ++++----
 arch/powerpc/include/asm/immediate.h |    4 ++--
 arch/x86/include/asm/immediate.h     |    6 +++---
 include/asm-generic/vmlinux.lds.h    |    6 +++---
 include/linux/immediate.h            |    4 ++++
 include/linux/module.h               |    2 +-
 init/main.c                          |    1 +
 kernel/immediate.c                   |   34 ++++++++++++++++++++++++++++++++--
 kernel/module.c                      |    4 ++++
 9 files changed, 54 insertions(+), 15 deletions(-)

Index: linux.trees.git/kernel/immediate.c
===================================================================
--- linux.trees.git.orig/kernel/immediate.c	2009-09-24 08:58:28.000000000 -0400
+++ linux.trees.git/kernel/immediate.c	2009-09-24 09:02:06.000000000 -0400
@@ -22,6 +22,7 @@
 #include <linux/cpu.h>
 #include <linux/stop_machine.h>
 
+#include <asm/sections.h>
 #include <asm/cacheflush.h>
 #include <asm/atomic.h>
 
@@ -32,8 +33,8 @@ static int imv_early_boot_complete;
 static atomic_t stop_machine_first;
 static int wrote_text;
 
-extern const struct __imv __start___imv[];
-extern const struct __imv __stop___imv[];
+extern struct __imv __start___imv[];
+extern struct __imv __stop___imv[];
 
 static int stop_machine_imv_update(void *imv_ptr)
 {
@@ -124,6 +125,8 @@ void imv_update_range(const struct __imv
 
 	mutex_lock(&imv_mutex);
 	for (iter = begin; iter < end; iter++) {
+		if (!iter->imv) /* Skip removed __init immediate values */
+			continue;
 		ret = apply_imv_update(iter);
 		if (imv_early_boot_complete && ret)
 			printk(KERN_WARNING
@@ -149,6 +152,33 @@ void core_imv_update(void)
 }
 EXPORT_SYMBOL_GPL(core_imv_update);
 
+/**
+ * imv_unref
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ * @start: beginning of the region to consider
+ * @size: size of the region to consider
+ *
+ * Deactivate any immediate value reference pointing into the code region in the
+ * range start to start + size.
+ */
+void imv_unref(struct __imv *begin, struct __imv *end, void *start,
+		unsigned long size)
+{
+	struct __imv *iter;
+
+	for (iter = begin; iter < end; iter++)
+		if (iter->imv >= (unsigned long)start
+			&& iter->imv < (unsigned long)start + size)
+			iter->imv = 0UL;
+}
+
+void imv_unref_core_init(void)
+{
+	imv_unref(__start___imv, __stop___imv, __init_begin,
+		(unsigned long)__init_end - (unsigned long)__init_begin);
+}
+
 void __init imv_init_complete(void)
 {
 	imv_early_boot_complete = 1;
Index: linux.trees.git/kernel/module.c
===================================================================
--- linux.trees.git.orig/kernel/module.c	2009-09-24 08:58:28.000000000 -0400
+++ linux.trees.git/kernel/module.c	2009-09-24 09:02:50.000000000 -0400
@@ -2491,6 +2491,10 @@ SYSCALL_DEFINE3(init_module, void __user
 	mutex_lock(&module_mutex);
 	/* Drop initial reference. */
 	module_put(mod);
+#ifdef CONFIG_IMMEDIATE
+	imv_unref(mod->immediate, mod->immediate + mod->num_immediate,
+		mod->module_init, mod->init_size);
+#endif
 	trim_init_extable(mod);
 	module_free(mod, mod->module_init);
 	mod->module_init = NULL;
Index: linux.trees.git/include/linux/module.h
===================================================================
--- linux.trees.git.orig/include/linux/module.h	2009-09-24 08:59:42.000000000 -0400
+++ linux.trees.git/include/linux/module.h	2009-09-24 09:02:06.000000000 -0400
@@ -332,7 +332,7 @@ struct module
 	unsigned int num_tracepoints;
 #endif
 #ifdef CONFIG_IMMEDIATE
-	const struct __imv *immediate;
+	struct __imv *immediate;
 	unsigned int num_immediate;
 #endif
 
Index: linux.trees.git/include/linux/immediate.h
===================================================================
--- linux.trees.git.orig/include/linux/immediate.h	2009-09-24 08:58:28.000000000 -0400
+++ linux.trees.git/include/linux/immediate.h	2009-09-24 09:02:06.000000000 -0400
@@ -46,6 +46,9 @@ struct __imv {
 extern void core_imv_update(void);
 extern void imv_update_range(const struct __imv *begin,
 	const struct __imv *end);
+extern void imv_unref_core_init(void);
+extern void imv_unref(struct __imv *begin, struct __imv *end, void *start,
+		unsigned long size);
 
 #else
 
@@ -73,6 +76,7 @@ extern void imv_update_range(const struc
 
 static inline void core_imv_update(void) { }
 static inline void module_imv_update(void) { }
+static inline void imv_unref_core_init(void) { }
 
 #endif
 
Index: linux.trees.git/arch/x86/include/asm/immediate.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/immediate.h	2009-09-24 09:00:27.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/immediate.h	2009-09-24 09:02:06.000000000 -0400
@@ -33,7 +33,7 @@
 		BUILD_BUG_ON(sizeof(value) > 8);			\
 		switch (sizeof(value)) {				\
 		case 1:							\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
 				".byte %c2\n\t"				\
 				".previous\n\t"				\
@@ -45,7 +45,7 @@
 			break;						\
 		case 2:							\
 		case 4:							\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
 				".byte %c2\n\t"				\
 				".previous\n\t"				\
@@ -60,7 +60,7 @@
 				value = name##__imv;			\
 				break;					\
 			}						\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
 				".byte %c2\n\t"				\
 				".previous\n\t"				\
Index: linux.trees.git/arch/powerpc/include/asm/immediate.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/immediate.h	2009-09-24 09:00:33.000000000 -0400
+++ linux.trees.git/arch/powerpc/include/asm/immediate.h	2009-09-24 09:02:06.000000000 -0400
@@ -26,7 +26,7 @@
 		BUILD_BUG_ON(sizeof(value) > 8);			\
 		switch (sizeof(value)) {				\
 		case 1:							\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 					PPC_LONG "%c1, ((1f)-1)\n\t"	\
 					".byte 1\n\t"			\
 					".previous\n\t"			\
@@ -36,7 +36,7 @@
 				: "i" (&name##__imv));			\
 			break;						\
 		case 2:							\
-			asm(".section __imv,\"a\",@progbits\n\t"	\
+			asm(".section __imv,\"aw\",@progbits\n\t"	\
 					PPC_LONG "%c1, ((1f)-2)\n\t"	\
 					".byte 2\n\t"			\
 					".previous\n\t"			\
Index: linux.trees.git/Documentation/immediate.txt
===================================================================
--- linux.trees.git.orig/Documentation/immediate.txt	2009-09-24 09:01:58.000000000 -0400
+++ linux.trees.git/Documentation/immediate.txt	2009-09-24 09:02:06.000000000 -0400
@@ -42,10 +42,10 @@ The immediate mechanism supports inserti
 immediate. Immediate values can be put in inline functions, inlined static
 functions, and unrolled loops.
 
-If you have to read the immediate values from a function declared as __init or
-__exit, you should explicitly use _imv_read(), which will fall back on a
-global variable read. Failing to do so will leave a reference to the __init
-section after it is freed (it would generate a modpost warning).
+If you have to read the immediate values from a function declared as __exit, you
+should explicitly use _imv_read(), which will fall back on a global variable
+read. Failing to do so will leave a reference to the __exit section in kernel
+without module unload support. imv_read() in the __init section is supported.
 
 You can choose to set an initial static value to the immediate by using, for
 instance:
Index: linux.trees.git/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux.trees.git.orig/include/asm-generic/vmlinux.lds.h	2009-09-24 08:58:28.000000000 -0400
+++ linux.trees.git/include/asm-generic/vmlinux.lds.h	2009-09-24 09:02:06.000000000 -0400
@@ -154,6 +154,9 @@
 	VMLINUX_SYMBOL(__start___tracepoints) = .;			\
 	*(__tracepoints)						\
 	VMLINUX_SYMBOL(__stop___tracepoints) = .;			\
+	VMLINUX_SYMBOL(__start___imv) = .;				\
+	*(__imv)		/* Immediate values: pointers */	\
+	VMLINUX_SYMBOL(__stop___imv) = .;				\
 	/* implement dynamic printk debug */				\
 	. = ALIGN(8);							\
 	VMLINUX_SYMBOL(__start___verbose) = .;                          \
@@ -202,9 +205,6 @@
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
 		*(__tracepoints_strings)/* Tracepoints: strings */	\
-		VMLINUX_SYMBOL(__start___imv) = .;			\
-		*(__imv)		/* Immediate values: pointers */ \
-		VMLINUX_SYMBOL(__stop___imv) = .;			\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
Index: linux.trees.git/init/main.c
===================================================================
--- linux.trees.git.orig/init/main.c	2009-09-24 08:58:28.000000000 -0400
+++ linux.trees.git/init/main.c	2009-09-24 09:02:06.000000000 -0400
@@ -822,6 +822,7 @@ static noinline int init_post(void)
 {
 	/* need to finish all async __init code before freeing the memory */
 	async_synchronize_full();
+	imv_unref_core_init();
 	free_initmem();
 	unlock_kernel();
 	mark_rodata_ro();

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 11/12] Scheduler Profiling - Use Immediate Values
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (9 preceding siblings ...)
  2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 13:26 ` [patch 12/12] Tracepoints - " Mathieu Desnoyers
  11 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Rusty Russell, Adrian Bunk, Andi Kleen,
	Christoph Hellwig, akpm

[-- Attachment #1: scheduler-profiling-use-immediate-values.patch --]
[-- Type: text/plain, Size: 6832 bytes --]

Use immediate values with lower d-cache hit in optimized version as a
condition for scheduler profiling call.

Changelog :
- Use imv_* instead of immediate_*.
- Follow the white rabbit : kvm_main.c which becomes x86.c.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
---
 arch/x86/kvm/x86.c      |    2 +-
 include/linux/profile.h |    5 +++--
 kernel/ksysfs.c         |    4 ++--
 kernel/profile.c        |   22 +++++++++++-----------
 kernel/sched_fair.c     |    7 ++-----
 5 files changed, 19 insertions(+), 21 deletions(-)

Index: linux.trees.git/kernel/profile.c
===================================================================
--- linux.trees.git.orig/kernel/profile.c	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/kernel/profile.c	2009-09-24 09:03:02.000000000 -0400
@@ -42,8 +42,8 @@ static int (*timer_hook)(struct pt_regs 
 static atomic_t *prof_buffer;
 static unsigned long prof_len, prof_shift;
 
-int prof_on __read_mostly;
-EXPORT_SYMBOL_GPL(prof_on);
+DEFINE_IMV(char, prof_on) __read_mostly;
+EXPORT_IMV_SYMBOL_GPL(prof_on);
 
 static cpumask_var_t prof_cpu_mask;
 #ifdef CONFIG_SMP
@@ -61,7 +61,7 @@ int profile_setup(char *str)
 
 	if (!strncmp(str, sleepstr, strlen(sleepstr))) {
 #ifdef CONFIG_SCHEDSTATS
-		prof_on = SLEEP_PROFILING;
+		imv_set(prof_on, SLEEP_PROFILING);
 		if (str[strlen(sleepstr)] == ',')
 			str += strlen(sleepstr) + 1;
 		if (get_option(&str, &par))
@@ -74,7 +74,7 @@ int profile_setup(char *str)
 			"kernel sleep profiling requires CONFIG_SCHEDSTATS\n");
 #endif /* CONFIG_SCHEDSTATS */
 	} else if (!strncmp(str, schedstr, strlen(schedstr))) {
-		prof_on = SCHED_PROFILING;
+		imv_set(prof_on, SCHED_PROFILING);
 		if (str[strlen(schedstr)] == ',')
 			str += strlen(schedstr) + 1;
 		if (get_option(&str, &par))
@@ -83,7 +83,7 @@ int profile_setup(char *str)
 			"kernel schedule profiling enabled (shift: %ld)\n",
 			prof_shift);
 	} else if (!strncmp(str, kvmstr, strlen(kvmstr))) {
-		prof_on = KVM_PROFILING;
+		imv_set(prof_on, KVM_PROFILING);
 		if (str[strlen(kvmstr)] == ',')
 			str += strlen(kvmstr) + 1;
 		if (get_option(&str, &par))
@@ -93,7 +93,7 @@ int profile_setup(char *str)
 			prof_shift);
 	} else if (get_option(&str, &par)) {
 		prof_shift = par;
-		prof_on = CPU_PROFILING;
+		imv_set(prof_on, CPU_PROFILING);
 		printk(KERN_INFO "kernel profiling enabled (shift: %ld)\n",
 			prof_shift);
 	}
@@ -105,7 +105,7 @@ __setup("profile=", profile_setup);
 int __ref profile_init(void)
 {
 	int buffer_bytes;
-	if (!prof_on)
+	if (!_imv_read(prof_on))
 		return 0;
 
 	/* only text is profiled */
@@ -309,7 +309,7 @@ void profile_hits(int type, void *__pc, 
 	int i, j, cpu;
 	struct profile_hit *hits;
 
-	if (prof_on != type || !prof_buffer)
+	if (!prof_buffer)
 		return;
 	pc = min((pc - (unsigned long)_stext) >> prof_shift, prof_len - 1);
 	i = primary = (pc & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT;
@@ -421,7 +421,7 @@ void profile_hits(int type, void *__pc, 
 {
 	unsigned long pc;
 
-	if (prof_on != type || !prof_buffer)
+	if (!prof_buffer)
 		return;
 	pc = ((unsigned long)__pc - (unsigned long)_stext) >> prof_shift;
 	atomic_add(nr_hits, &prof_buffer[min(pc, prof_len - 1)]);
@@ -585,7 +585,7 @@ static int create_hash_tables(void)
 	}
 	return 0;
 out_cleanup:
-	prof_on = 0;
+	imv_set(prof_on, 0);
 	smp_mb();
 	on_each_cpu(profile_nop, NULL, 1);
 	for_each_online_cpu(cpu) {
@@ -612,7 +612,7 @@ int __ref create_proc_profile(void) /* f
 {
 	struct proc_dir_entry *entry;
 
-	if (!prof_on)
+	if (!_imv_read(prof_on))
 		return 0;
 	if (create_hash_tables())
 		return -ENOMEM;
Index: linux.trees.git/include/linux/profile.h
===================================================================
--- linux.trees.git.orig/include/linux/profile.h	2008-10-30 20:22:52.000000000 -0400
+++ linux.trees.git/include/linux/profile.h	2009-09-24 09:03:02.000000000 -0400
@@ -5,6 +5,7 @@
 #include <linux/init.h>
 #include <linux/cpumask.h>
 #include <linux/cache.h>
+#include <linux/immediate.h>
 
 #include <asm/errno.h>
 
@@ -38,7 +39,7 @@ enum profile_type {
 
 #ifdef CONFIG_PROFILING
 
-extern int prof_on __read_mostly;
+DECLARE_IMV(char, prof_on) __read_mostly;
 
 /* init basic kernel profiler */
 int profile_init(void);
@@ -58,7 +59,7 @@ static inline void profile_hit(int type,
 	/*
 	 * Speedup for the common (no profiling enabled) case:
 	 */
-	if (unlikely(prof_on == type))
+	if (unlikely(imv_read(prof_on) == type))
 		profile_hits(type, ip, 1);
 }
 
Index: linux.trees.git/kernel/sched_fair.c
===================================================================
--- linux.trees.git.orig/kernel/sched_fair.c	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/kernel/sched_fair.c	2009-09-24 09:03:56.000000000 -0400
@@ -672,11 +672,8 @@ static void enqueue_sleeper(struct cfs_r
 			 * 20 to get a milliseconds-range estimation of the
 			 * amount of time that the task spent sleeping:
 			 */
-			if (unlikely(prof_on == SLEEP_PROFILING)) {
-				profile_hits(SLEEP_PROFILING,
-						(void *)get_wchan(tsk),
-						delta >> 20);
-			}
+			profile_hits(SLEEP_PROFILING, (void *)get_wchan(tsk),
+				     delta >> 20);
 			account_scheduler_latency(tsk, delta >> 10, 0);
 		}
 	}
Index: linux.trees.git/arch/x86/kvm/x86.c
===================================================================
--- linux.trees.git.orig/arch/x86/kvm/x86.c	2009-09-24 08:52:42.000000000 -0400
+++ linux.trees.git/arch/x86/kvm/x86.c	2009-09-24 09:03:02.000000000 -0400
@@ -3672,7 +3672,7 @@ static int vcpu_enter_guest(struct kvm_v
 	/*
 	 * Profile KVM exit RIPs:
 	 */
-	if (unlikely(prof_on == KVM_PROFILING)) {
+	if (unlikely(imv_read(prof_on) == KVM_PROFILING)) {
 		unsigned long rip = kvm_rip_read(vcpu);
 		profile_hit(KVM_PROFILING, (void *)rip);
 	}
Index: linux.trees.git/kernel/ksysfs.c
===================================================================
--- linux.trees.git.orig/kernel/ksysfs.c	2009-03-06 13:29:01.000000000 -0500
+++ linux.trees.git/kernel/ksysfs.c	2009-09-24 09:03:02.000000000 -0400
@@ -58,7 +58,7 @@ KERNEL_ATTR_RW(uevent_helper);
 static ssize_t profiling_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%d\n", prof_on);
+	return sprintf(buf, "%d\n", _imv_read(prof_on));
 }
 static ssize_t profiling_store(struct kobject *kobj,
 				   struct kobj_attribute *attr,
@@ -66,7 +66,7 @@ static ssize_t profiling_store(struct ko
 {
 	int ret;
 
-	if (prof_on)
+	if (_imv_read(prof_on))
 		return -EEXIST;
 	/*
 	 * This eventually calls into get_option() which

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 12/12] Tracepoints - Immediate Values
  2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
                   ` (10 preceding siblings ...)
  2009-09-24 13:26 ` [patch 11/12] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
@ 2009-09-24 13:26 ` Mathieu Desnoyers
  2009-09-24 14:51   ` Peter Zijlstra
  11 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 13:26 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Mathieu Desnoyers, Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt

[-- Attachment #1: tracepoints-immediate-values.patch --]
[-- Type: text/plain, Size: 5047 bytes --]

Use immediate values in tracepoints.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
---
 include/linux/tracepoint.h |   33 +++++++++++++++++++++++++++------
 kernel/tracepoint.c        |   14 +++++++++-----
 2 files changed, 36 insertions(+), 11 deletions(-)

Index: linux.trees.git/include/linux/tracepoint.h
===================================================================
--- linux.trees.git.orig/include/linux/tracepoint.h	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/include/linux/tracepoint.h	2009-09-24 09:05:29.000000000 -0400
@@ -14,6 +14,7 @@
  * See the file COPYING for more details.
  */
 
+#include <linux/immediate.h>
 #include <linux/types.h>
 #include <linux/rcupdate.h>
 
@@ -22,7 +23,7 @@ struct tracepoint;
 
 struct tracepoint {
 	const char *name;		/* Tracepoint name */
-	int state;			/* State. */
+	DEFINE_IMV(char, state);	/* State. */
 	void (*regfunc)(void);
 	void (*unregfunc)(void);
 	void **funcs;
@@ -58,18 +59,38 @@ struct tracepoint {
 		rcu_read_unlock_sched_notrace();			\
 	} while (0)
 
+#define __CHECK_TRACE(name, generic, proto, args)			\
+	do {								\
+		if (!generic) {						\
+			if (unlikely(imv_read(__tracepoint_##name.state))) \
+				__DO_TRACE(&__tracepoint_##name,	\
+					TP_PROTO(proto), TP_ARGS(args));\
+		} else {						\
+			if (unlikely(_imv_read(__tracepoint_##name.state))) \
+				__DO_TRACE(&__tracepoint_##name,	\
+					TP_PROTO(proto), TP_ARGS(args));\
+		}							\
+	} while (0)
+
 /*
  * Make sure the alignment of the structure in the __tracepoints section will
  * not add unwanted padding between the beginning of the section and the
  * structure. Force alignment to the same alignment as the section start.
+ *
+ * The "generic" argument, passed to the declared __trace_##name inline
+ * function controls which tracepoint enabling mechanism must be used.
+ * If generic is true, a variable read is used.
+ * If generic is false, immediate values are used.
  */
 #define DECLARE_TRACE(name, proto, args)				\
 	extern struct tracepoint __tracepoint_##name;			\
 	static inline void trace_##name(proto)				\
 	{								\
-		if (unlikely(__tracepoint_##name.state))		\
-			__DO_TRACE(&__tracepoint_##name,		\
-				TP_PROTO(proto), TP_ARGS(args));	\
+		__CHECK_TRACE(name, 0, TP_PROTO(proto), TP_ARGS(args));	\
+	}								\
+	static inline void _trace_##name(proto)				\
+	{								\
+		__CHECK_TRACE(name, 1, TP_PROTO(proto), TP_ARGS(args));	\
 	}								\
 	static inline int register_trace_##name(void (*probe)(proto))	\
 	{								\
@@ -101,10 +122,10 @@ extern void tracepoint_update_probe_rang
 
 #else /* !CONFIG_TRACEPOINTS */
 #define DECLARE_TRACE(name, proto, args)				\
-	static inline void _do_trace_##name(struct tracepoint *tp, proto) \
-	{ }								\
 	static inline void trace_##name(proto)				\
 	{ }								\
+	static inline void _trace_##name(proto)				\
+	{ }								\
 	static inline int register_trace_##name(void (*probe)(proto))	\
 	{								\
 		return -ENOSYS;						\
Index: linux.trees.git/kernel/tracepoint.c
===================================================================
--- linux.trees.git.orig/kernel/tracepoint.c	2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/kernel/tracepoint.c	2009-09-24 09:11:13.000000000 -0400
@@ -25,6 +25,7 @@
 #include <linux/err.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/immediate.h>
 
 extern struct tracepoint __start___tracepoints[];
 extern struct tracepoint __stop___tracepoints[];
@@ -243,9 +244,9 @@ static void set_tracepoint(struct tracep
 {
 	WARN_ON(strcmp((*entry)->name, elem->name) != 0);
 
-	if (elem->regfunc && !elem->state && active)
+	if (elem->regfunc && !_imv_read(elem->state) && active)
 		elem->regfunc();
-	else if (elem->unregfunc && elem->state && !active)
+	else if (elem->unregfunc && _imv_read(elem->state) && !active)
 		elem->unregfunc();
 
 	/*
@@ -256,7 +257,7 @@ static void set_tracepoint(struct tracep
 	 * is used.
 	 */
 	rcu_assign_pointer(elem->funcs, (*entry)->funcs);
-	elem->state = active;
+	elem->state__imv = active;
 }
 
 /*
@@ -267,10 +268,10 @@ static void set_tracepoint(struct tracep
  */
 static void disable_tracepoint(struct tracepoint *elem)
 {
-	if (elem->unregfunc && elem->state)
+	if (elem->unregfunc && _imv_read(elem->state))
 		elem->unregfunc();
 
-	elem->state = 0;
+	elem->state__imv = 0;
 	rcu_assign_pointer(elem->funcs, NULL);
 }
 
@@ -313,6 +314,9 @@ static void tracepoint_update_probes(voi
 		__stop___tracepoints);
 	/* tracepoints in modules. */
 	module_update_tracepoints();
+	/* Update immediate values */
+	core_imv_update();
+	module_imv_update();
 }
 
 static void *tracepoint_add_probe(const char *name, void *probe)

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 12/12] Tracepoints - Immediate Values
  2009-09-24 13:26 ` [patch 12/12] Tracepoints - " Mathieu Desnoyers
@ 2009-09-24 14:51   ` Peter Zijlstra
  2009-09-24 15:03     ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2009-09-24 14:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, linux-kernel, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt

On Thu, 2009-09-24 at 09:26 -0400, Mathieu Desnoyers wrote:
> plain text document attachment (tracepoints-immediate-values.patch)
> Use immediate values in tracepoints.

I might have missed it, but did both the Intel and AMD cpu folks clear
the SMP code rewrite bits?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 12/12] Tracepoints - Immediate Values
  2009-09-24 14:51   ` Peter Zijlstra
@ 2009-09-24 15:03     ` Mathieu Desnoyers
  2009-09-24 15:06       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Masami Hiramatsu

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2009-09-24 at 09:26 -0400, Mathieu Desnoyers wrote:
> > plain text document attachment (tracepoints-immediate-values.patch)
> > Use immediate values in tracepoints.
> 
> I might have missed it, but did both the Intel and AMD cpu folks clear
> the SMP code rewrite bits?
> 

SMP handling is performed with stop_machine() in this patchset. Nothing
fancy here.

I've got other patches, not included in this patchset, which implements
nmi-safe code modification, based on a scheme using breakpoints and
IPIs, inspired from djprobes. That one might be worth clearing with
intel/amd devs before merging.

However, doing code patching within stop_machine() is pretty safe, given
all other CPUs are busy-looping with interrupts off while this happens.
Ftrace already does this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 12/12] Tracepoints - Immediate Values
  2009-09-24 15:03     ` Mathieu Desnoyers
@ 2009-09-24 15:06       ` Peter Zijlstra
  2009-09-24 16:01         ` [RFC patch] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2009-09-24 15:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, linux-kernel, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Masami Hiramatsu

On Thu, 2009-09-24 at 11:03 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Thu, 2009-09-24 at 09:26 -0400, Mathieu Desnoyers wrote:
> > > plain text document attachment (tracepoints-immediate-values.patch)
> > > Use immediate values in tracepoints.
> > 
> > I might have missed it, but did both the Intel and AMD cpu folks clear
> > the SMP code rewrite bits?
> > 
> 
> SMP handling is performed with stop_machine() in this patchset. Nothing
> fancy here.
> 
> I've got other patches, not included in this patchset, which implements
> nmi-safe code modification, based on a scheme using breakpoints and
> IPIs, inspired from djprobes. That one might be worth clearing with
> intel/amd devs before merging.
> 
> However, doing code patching within stop_machine() is pretty safe, given
> all other CPUs are busy-looping with interrupts off while this happens.
> Ftrace already does this.

Agreed, I missed this relied on stopmachine. No problem then.

It would be good to reduce reliance on stopmachine, so it would be good
to get some CPU folks looking at your alternative implementation.

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 10.1/12] Immediate values fixes for modules
  2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
@ 2009-09-24 15:33   ` Mathieu Desnoyers
  2009-09-24 15:35   ` [patch 10.2/12] Fix Immediate Values x86_64 support old gcc Mathieu Desnoyers
  1 sibling, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 15:33 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel

Compilation fixes for immediate values when modules are disabled.

Mathieu : merged two fixes.

From: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/linux/immediate.h |    1 -
 include/linux/module.h    |   12 +++++++-----
 2 files changed, 7 insertions(+), 6 deletions(-)

Index: linux.trees.git/include/linux/immediate.h
===================================================================
--- linux.trees.git.orig/include/linux/immediate.h	2009-09-24 09:20:37.000000000 -0400
+++ linux.trees.git/include/linux/immediate.h	2009-09-24 10:53:51.000000000 -0400
@@ -75,7 +75,6 @@ extern void imv_unref(struct __imv *begi
 #define imv_set(name, i)		(name##__imv = (i))
 
 static inline void core_imv_update(void) { }
-static inline void module_imv_update(void) { }
 static inline void imv_unref_core_init(void) { }
 
 #endif
Index: linux.trees.git/include/linux/module.h
===================================================================
--- linux.trees.git.orig/include/linux/module.h	2009-09-24 09:20:37.000000000 -0400
+++ linux.trees.git/include/linux/module.h	2009-09-24 10:53:51.000000000 -0400
@@ -538,9 +538,6 @@ extern void print_modules(void);
 extern void module_update_tracepoints(void);
 extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
 
-extern void _module_imv_update(void);
-extern void module_imv_update(void);
-
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -661,6 +658,12 @@ static inline int module_get_iter_tracep
 	return 0;
 }
 
+#endif /* CONFIG_MODULES */
+
+#if defined(CONFIG_MODULES) && defined(CONFIG_IMMEDIATE)
+extern void _module_imv_update(void);
+extern void module_imv_update(void);
+#else
 static inline void _module_imv_update(void)
 {
 }
@@ -668,8 +671,7 @@ static inline void _module_imv_update(vo
 static inline void module_imv_update(void)
 {
 }
-
-#endif /* CONFIG_MODULES */
+#endif
 
 struct device_driver;
 #ifdef CONFIG_SYSFS

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 10.2/12] Fix Immediate Values x86_64 support old gcc
  2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
  2009-09-24 15:33   ` [patch 10.1/12] Immediate values fixes for modules Mathieu Desnoyers
@ 2009-09-24 15:35   ` Mathieu Desnoyers
  1 sibling, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 15:35 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Sam Ravnborg, H. Peter Anvin, Jeremy Fitzhardinge, David Miller,
	Paul Mackerras

GCC < 4, on x86_64, does not accept symbol+offset operands for "i" constraints
asm statements. Fallback on a memory read in lieue of immediate value if this
compiler is detected.

Changelog :
- USE_IMMEDIATE must now be used in lieue of CONFIG_IMMEDIATE in Makefiles and
  in C code.
- Every architecture implementing immediate values must declare USE_IMMEDIATE
  in their Makefile.
- Tab -> spaces in Makefiles.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: David Miller <davem@davemloft.net>
CC: Paul Mackerras <paulus@samba.org>
---
 Makefile                  |    5 +++++
 arch/powerpc/Makefile     |    2 ++
 arch/x86/Makefile         |    5 +++++
 include/linux/immediate.h |    2 +-
 include/linux/module.h    |    2 +-
 kernel/Makefile           |    2 +-
 kernel/module.c           |    6 +++---
 7 files changed, 18 insertions(+), 6 deletions(-)

Index: linux.trees.git/arch/x86/Makefile
===================================================================
--- linux.trees.git.orig/arch/x86/Makefile	2009-09-24 08:52:41.000000000 -0400
+++ linux.trees.git/arch/x86/Makefile	2009-09-24 10:53:59.000000000 -0400
@@ -41,6 +41,7 @@ ifeq ($(CONFIG_X86_32),y)
 
         # temporary until string.h is fixed
         KBUILD_CFLAGS += -ffreestanding
+        export USE_IMMEDIATE := $(CONFIG_IMMEDIATE)
 else
         BITS := 64
         UTS_MACHINE := x86_64
@@ -70,6 +71,10 @@ else
         # this works around some issues with generating unwind tables in older gccs
         # newer gccs do it by default
         KBUILD_CFLAGS += -maccumulate-outgoing-args
+
+        # x86_64 gcc 3.x has problems with passing symbol+offset in
+        # asm "i" constraint.
+        export USE_IMMEDIATE := $(call cc-ifversion, -ge, 0400, $(CONFIG_IMMEDIATE))
 endif
 
 ifdef CONFIG_CC_STACKPROTECTOR
Index: linux.trees.git/include/linux/immediate.h
===================================================================
--- linux.trees.git.orig/include/linux/immediate.h	2009-09-24 10:53:51.000000000 -0400
+++ linux.trees.git/include/linux/immediate.h	2009-09-24 10:53:59.000000000 -0400
@@ -10,7 +10,7 @@
  * See the file COPYING for more details.
  */
 
-#ifdef CONFIG_IMMEDIATE
+#ifdef USE_IMMEDIATE
 
 struct __imv {
 	unsigned long var;	/* Pointer to the identifier variable of the
Index: linux.trees.git/kernel/Makefile
===================================================================
--- linux.trees.git.orig/kernel/Makefile	2009-09-24 09:20:19.000000000 -0400
+++ linux.trees.git/kernel/Makefile	2009-09-24 10:55:09.000000000 -0400
@@ -88,7 +88,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
-obj-$(CONFIG_IMMEDIATE) += immediate.o
+obj-$(USE_IMMEDIATE) += immediate.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/
Index: linux.trees.git/arch/powerpc/Makefile
===================================================================
--- linux.trees.git.orig/arch/powerpc/Makefile	2009-09-24 08:52:40.000000000 -0400
+++ linux.trees.git/arch/powerpc/Makefile	2009-09-24 10:53:59.000000000 -0400
@@ -96,6 +96,8 @@ else
 LDFLAGS_MODULE	+= arch/powerpc/lib/crtsavres.o
 endif
 
+export USE_IMMEDIATE := $(CONFIG_IMMEDIATE)
+
 ifeq ($(CONFIG_TUNE_CELL),y)
 	KBUILD_CFLAGS += $(call cc-option,-mtune=cell)
 endif
Index: linux.trees.git/Makefile
===================================================================
--- linux.trees.git.orig/Makefile	2009-09-24 08:52:38.000000000 -0400
+++ linux.trees.git/Makefile	2009-09-24 10:53:59.000000000 -0400
@@ -550,6 +550,11 @@ ifdef CONFIG_FUNCTION_TRACER
 KBUILD_CFLAGS	+= -pg
 endif
 
+# arch Makefile detects if the compiler permits use of immediate values
+ifdef USE_IMMEDIATE
+KBUILD_CFLAGS	+= -DUSE_IMMEDIATE
+endif
+
 # We trigger additional mismatches with less inlining
 ifdef CONFIG_DEBUG_SECTION_MISMATCH
 KBUILD_CFLAGS += $(call cc-option, -fno-inline-functions-called-once)
Index: linux.trees.git/kernel/module.c
===================================================================
--- linux.trees.git.orig/kernel/module.c	2009-09-24 09:20:37.000000000 -0400
+++ linux.trees.git/kernel/module.c	2009-09-24 10:53:59.000000000 -0400
@@ -2237,7 +2237,7 @@ static noinline struct module *load_modu
 	mod->ctors = section_objs(hdr, sechdrs, secstrings, ".ctors",
 				  sizeof(*mod->ctors), &mod->num_ctors);
 #endif
-#ifdef CONFIG_IMMEDIATE
+#ifdef USE_IMMEDIATE
 	mod->immediate = section_objs(hdr, sechdrs, secstrings, "__imv",
 					sizeof(*mod->immediate),
 					&mod->num_immediate);
@@ -2491,7 +2491,7 @@ SYSCALL_DEFINE3(init_module, void __user
 	mutex_lock(&module_mutex);
 	/* Drop initial reference. */
 	module_put(mod);
-#ifdef CONFIG_IMMEDIATE
+#ifdef USE_IMMEDIATE
 	imv_unref(mod->immediate, mod->immediate + mod->num_immediate,
 		mod->module_init, mod->init_size);
 #endif
@@ -3011,7 +3011,7 @@ int module_get_iter_tracepoints(struct t
 }
 #endif
 
-#ifdef CONFIG_IMMEDIATE
+#ifdef USE_IMMEDIATE
 /**
  * _module_imv_update - update all immediate values in the kernel
  *
Index: linux.trees.git/include/linux/module.h
===================================================================
--- linux.trees.git.orig/include/linux/module.h	2009-09-24 10:53:51.000000000 -0400
+++ linux.trees.git/include/linux/module.h	2009-09-24 10:53:59.000000000 -0400
@@ -660,7 +660,7 @@ static inline int module_get_iter_tracep
 
 #endif /* CONFIG_MODULES */
 
-#if defined(CONFIG_MODULES) && defined(CONFIG_IMMEDIATE)
+#if defined(CONFIG_MODULES) && defined(USE_IMMEDIATE)
 extern void _module_imv_update(void);
 extern void module_imv_update(void);
 #else


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC patch] Immediate Values - x86 Optimization NMI and MCE support
  2009-09-24 15:06       ` Peter Zijlstra
@ 2009-09-24 16:01         ` Mathieu Desnoyers
  2009-09-24 21:59           ` Masami Hiramatsu
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-24 16:01 UTC (permalink / raw)
  To: Peter Zijlstra, anil.s.keshavamurthy, Jason Yeh, Robert Richter
  Cc: Andi Kleen, H. Peter Anvin, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge, Thomas Gleixner, Ingo Molnar, Ingo Molnar,
	linux-kernel, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Masami Hiramatsu

[Ingo: this patch is for RFC only. Do not merge.]

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2009-09-24 at 11:03 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Thu, 2009-09-24 at 09:26 -0400, Mathieu Desnoyers wrote:
> > > > plain text document attachment (tracepoints-immediate-values.patch)
> > > > Use immediate values in tracepoints.
> > > 
> > > I might have missed it, but did both the Intel and AMD cpu folks clear
> > > the SMP code rewrite bits?
> > > 
> > 
> > SMP handling is performed with stop_machine() in this patchset. Nothing
> > fancy here.
> > 
> > I've got other patches, not included in this patchset, which implements
> > nmi-safe code modification, based on a scheme using breakpoints and
> > IPIs, inspired from djprobes. That one might be worth clearing with
> > intel/amd devs before merging.
> > 
> > However, doing code patching within stop_machine() is pretty safe, given
> > all other CPUs are busy-looping with interrupts off while this happens.
> > Ftrace already does this.
> 
> Agreed, I missed this relied on stopmachine. No problem then.
> 
> It would be good to reduce reliance on stopmachine, so it would be good
> to get some CPU folks looking at your alternative implementation.
> 
> Thanks!
> 

Sure, here is the patch applying on top of the immediate values
patchset. It implements the breakpoint-based instruction patching
scheme. I just provide this one for review. There is a following patch
which makes the immediate values infrastructure use this arch-specific
file, which I'll leave out for now.

Immediate Values - x86 Optimization NMI and MCE support

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.
It uses a breakpoint to bypass the instruction being changed, which lessens the
interrupt latency of the operation and protects against NMIs and MCE.

- More reentrant immediate value : uses a breakpoint. Needs to know the
  instruction's first byte. This is why we keep the "instruction size"
  variable, so we can support the REX prefixed instructions too.

Changelog:
- Change the immediate.c update code to support variable length opcodes.
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the imv_mutex).
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.
- Change the immediate.c update code to support variable length opcodes.
- Use imv_* instead of immediate_*.
- Use kernel_wp_disable/enable instead of save/restore.
- Fix 1 byte immediate value so it declares its instruction size.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <andi@firstfloor.org>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: anil.s.keshavamurthy@intel.com
CC: Jason Yeh <jason.yeh@amd.com>
CC: Robert Richter <robert.richter@amd.com>
---
 arch/x86/include/asm/immediate.h |   48 +++++
 arch/x86/kernel/Makefile         |    1 
 arch/x86/kernel/immediate.c      |  319 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 362 insertions(+), 6 deletions(-)

Index: linux.trees.git/arch/x86/include/asm/immediate.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/immediate.h	2009-09-24 09:20:37.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/immediate.h	2009-09-24 11:07:10.000000000 -0400
@@ -12,6 +12,18 @@
 
 #include <asm/asm.h>
 
+struct __imv {
+	unsigned long var;	/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	unsigned long imv;	/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	unsigned char size;	/* Type size. */
+	unsigned char insn_size;/* Instruction size. */
+} __attribute__ ((packed));
+
 /**
  * imv_read - read immediate variable
  * @name: immediate value name
@@ -26,6 +38,11 @@
  * what will generate an instruction with 8 bytes immediate value (not the REX.W
  * prefixed one that loads a sign extended 32 bits immediate value in a r64
  * register).
+ *
+ * Create the instruction in a discarded section to calculate its size. This is
+ * how we can align the beginning of the instruction on an address that will
+ * permit atomic modification of the immediate value without knowing the size of
+ * the opcode used by the compiler. The operand size is known in advance.
  */
 #define imv_read(name)							\
 	({								\
@@ -33,9 +50,14 @@
 		BUILD_BUG_ON(sizeof(value) > 8);			\
 		switch (sizeof(value)) {				\
 		case 1:							\
-			asm(".section __imv,\"aw\",@progbits\n\t"	\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0,%0\n\t"				\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".byte %c2\n\t"				\
+				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
 				"mov $0,%0\n\t"				\
 				"3:\n\t"				\
@@ -45,10 +67,16 @@
 			break;						\
 		case 2:							\
 		case 4:							\
-			asm(".section __imv,\"aw\",@progbits\n\t"	\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0,%0\n\t"				\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".byte %c2\n\t"				\
+				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
+				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
 				"mov $0,%0\n\t"				\
 				"3:\n\t"				\
 				: "=r" (value)				\
@@ -60,10 +88,16 @@
 				value = name##__imv;			\
 				break;					\
 			}						\
-			asm(".section __imv,\"aw\",@progbits\n\t"	\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0xFEFEFEFE01010101,%0\n\t"	\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".byte %c2\n\t"				\
+				".byte %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
+				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
 				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
 				"3:\n\t"				\
 				: "=r" (value)				\
@@ -74,4 +108,6 @@
 		value;							\
 	})
 
+extern int arch_imv_update(const struct __imv *imv, int early);
+
 #endif /* _ASM_X86_IMMEDIATE_H */
Index: linux.trees.git/arch/x86/kernel/Makefile
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/Makefile	2009-09-24 10:55:49.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/Makefile	2009-09-24 11:07:32.000000000 -0400
@@ -79,6 +79,7 @@ obj-$(CONFIG_KEXEC)		+= relocate_kernel_
 obj-$(CONFIG_CRASH_DUMP)	+= crash_dump_$(BITS).o
 obj-$(CONFIG_KPROBES)		+= kprobes.o
 obj-$(CONFIG_MODULES)		+= module.o
+obj-$(USE_IMMEDIATE)		+= immediate.o
 obj-$(CONFIG_EFI) 		+= efi.o efi_$(BITS).o efi_stub_$(BITS).o
 obj-$(CONFIG_DOUBLEFAULT) 	+= doublefault_32.o
 obj-$(CONFIG_KGDB)		+= kgdb.o
Index: linux.trees.git/arch/x86/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/x86/kernel/immediate.c	2009-09-24 11:13:22.000000000 -0400
@@ -0,0 +1,319 @@
+/*
+ * Immediate Value - x86 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ *   Centrino Duo Processor Technology Specification Update, AH33.
+ *   Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ *   Instruction Execution Results.
+ *
+ * Quoting "Intel Core 2 Duo Processor for IntelCentrino Duo Processor
+ * Technology Specification Update" (AH33)
+ *
+ * "The act of one processor, or system bus master, writing data into a
+ * currently executing code segment of a second processor with the intent of
+ * having the second processor execute that data as code is called
+ * cross-modifying code (XMC). XMC that does not force the second processor to
+ * execute a synchronizing instruction, prior to execution of the new code, is
+ * called unsynchronized XMC. Software using unsynchronized XMC to modify the
+ * instruction byte stream of a processor can see unexpected or unpredictable
+ * execution behavior from the processor that is executing the modified code."
+ *
+ * This code turns what would otherwise be an unsynchronized XMC into a
+ * synchronized XMC by making sure a synchronizing instruction (either iret
+ * returning from the breakpoint, cpuid from sync_core(), or mfence) is executed
+ * by any CPU before it executes the modified instruction.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micro-ops resulting from instruction interpretation -
+ * cannot be guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a mfence, to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * The algorithm used to update instructions with breakpoint and IPI is inspired
+ * from the djprobe project. Credits to Masami Hiramatsu <mhiramat@redhat.com>.
+ *
+ * Copyright 2009 - Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ * Distributed under GPLv2.
+ */
+
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+#include <linux/rcupdate.h>
+#include <linux/kprobes.h>
+#include <linux/io.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION  0xcc
+#define BREAKPOINT_INS_LEN	1
+#define NR_NOPS			10
+
+static unsigned long target_after_int3;	/* EIP of the target after the int3 */
+static unsigned long bypass_eip;	/* EIP of the bypass. */
+static unsigned long bypass_after_int3;	/* EIP after the end-of-bypass int3 */
+static unsigned long after_imv;	/*
+					 * EIP where to resume after the
+					 * single-stepping.
+					 */
+
+/*
+ * Internal bypass used during value update. The bypass is skipped by the
+ * function in which it is inserted.
+ * No need to be aligned because we exclude readers from the site during
+ * update.
+ * Layout is:
+ * (10x nop) int3
+ * (maximum size is 2 bytes opcode + 8 bytes immediate value for long on x86_64)
+ * The nops are the target replaced by the instruction to single-step.
+ * Align on 16 bytes to make sure the nops fit within a single page so remapping
+ * it can be done easily.
+ */
+static inline void _imv_bypass(unsigned long *bypassaddr,
+	unsigned long *breaknextaddr)
+{
+		asm volatile("jmp 2f;\n\t"
+				".align 16;\n\t"
+				"0:\n\t"
+				".space 10, 0x90;\n\t"
+				"1:\n\t"
+				"int3;\n\t"
+				"2:\n\t"
+				"mov $(0b),%0;\n\t"
+				"mov $((1b)+1),%1;\n\t"
+				: "=r" (*bypassaddr),
+				  "=r" (*breaknextaddr));
+}
+
+static void imv_synchronize_core(void *info)
+{
+	/*
+	 * Read new instructions before continuing and stop speculative
+	 * execution.
+	 */
+	smp_mb();	
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movl.
+ * Disable preemption in the bypass to make sure no thread will be preempted in
+ * it. We can then use synchronize_sched() to make sure every bypass users have
+ * ended.
+ */
+static int imv_notifier(struct notifier_block *nb,
+	unsigned long val, void *data)
+{
+	enum die_val die_val = (enum die_val) val;
+	struct die_args *args = data;
+
+	if (!args->regs || user_mode_vm(args->regs))
+		return NOTIFY_DONE;
+
+	if (die_val == DIE_INT3) {
+		if (args->regs->ip == target_after_int3) {
+			preempt_disable();
+			args->regs->ip = bypass_eip;
+			return NOTIFY_STOP;
+		} else if (args->regs->ip == bypass_after_int3) {
+			args->regs->ip = after_imv;
+			preempt_enable();
+			return NOTIFY_STOP;
+		}
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block imv_notify = {
+	.notifier_call = imv_notifier,
+	.priority = 0x7fffffff,	/* we need to be notified first */
+};
+
+/**
+ * arch_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ * @early: early boot (1) or normal (0)
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ */
+__kprobes int arch_imv_update(const struct __imv *imv, int early)
+{
+	int ret;
+	unsigned char opcode_size = imv->insn_size - imv->size;
+	unsigned long insn = imv->imv - opcode_size;
+	unsigned long len;
+	char *vaddr;
+	struct page *pages[1];
+
+#ifdef CONFIG_KPROBES
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	if (unlikely(!early
+			&& *(unsigned char *)insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %hu\n",
+				    (void *)imv->imv,
+				    (void *)imv->var, imv->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (imv->size) {
+	case 1:	if (*(uint8_t *)imv->imv
+				== *(uint8_t *)imv->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)imv->imv
+				== *(uint16_t *)imv->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)imv->imv
+				== *(uint32_t *)imv->var)
+			return 0;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)imv->imv
+				== *(uint64_t *)imv->var)
+			return 0;
+		break;
+#endif
+	default:return -EINVAL;
+	}
+
+	if (!early) {
+		/* bypass is 10 bytes long for x86_64 long */
+		WARN_ON(imv->insn_size > 10);
+		_imv_bypass(&bypass_eip, &bypass_after_int3);
+
+		after_imv = imv->imv + imv->size;
+
+		/*
+		 * Using the _early variants because nobody is executing the
+		 * bypass code while we patch it. It is protected by the
+		 * imv_mutex. Since we modify the instructions non atomically
+		 * (for nops), we have to use the _early variant.
+		 * We must however deal with RO pages.
+		 * Use a single page : 10 bytes are aligned on 16 bytes
+		 * boundaries.
+		 */
+		pages[0] = virt_to_page((void *)bypass_eip);
+		vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
+		BUG_ON(!vaddr);
+		text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
+			(void *)insn, imv->insn_size);
+		/*
+		 * Fill the rest with nops.
+		 */
+		len = NR_NOPS - imv->insn_size;
+		add_nops((void *)
+			&vaddr[(bypass_eip & ~PAGE_MASK) + imv->insn_size],
+			len);
+		vunmap(vaddr);
+
+		target_after_int3 = insn + BREAKPOINT_INS_LEN;
+		/* register_die_notifier has memory barriers */
+		register_die_notifier(&imv_notify);
+		/* The breakpoint will single-step the bypass */
+		text_poke((void *)insn,
+			((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);
+		/*
+		 * Make sure the breakpoint is set before we continue (visible
+		 * to other CPUs and interrupts).
+		 */
+		smp_wmb();
+		/*
+		 * Execute smp_rmb() and serializing instruction on each CPU.
+		 */
+		ret = on_each_cpu(imv_synchronize_core, NULL, 1);
+		BUG_ON(ret != 0);
+
+		text_poke((void *)(insn + opcode_size), (void *)imv->var,
+				imv->size);
+		/*
+		 * Make sure the value can be seen from other CPUs and
+		 * interrupts.
+		 */
+		smp_wmb();
+		/*
+		 * Execute smp_mb() on each CPU.
+		 */
+		ret = on_each_cpu(imv_synchronize_core, NULL, 1);
+		BUG_ON(ret != 0);
+		text_poke((void *)insn, (unsigned char *)bypass_eip, 1);
+		/*
+		 * Wait for all int3 handlers to end (interrupts are disabled in
+		 * int3). This CPU is clearly not in a int3 handler, because
+		 * int3 handler is not preemptible and there cannot be any more
+		 * int3 handler called for this site, because we placed the
+		 * original instruction back.  synchronize_sched has memory
+		 * barriers.
+		 */
+		synchronize_sched();
+		unregister_die_notifier(&imv_notify);
+		/* unregister_die_notifier has memory barriers */
+	} else
+		text_poke_early((void *)imv->imv, (void *)imv->var,
+			imv->size);
+	return 0;
+}


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 07/12] Sparc create asm.h
  2009-09-24 13:26 ` [patch 07/12] Sparc create asm.h Mathieu Desnoyers
@ 2009-09-24 21:10   ` David Miller
  0 siblings, 0 replies; 34+ messages in thread
From: David Miller @ 2009-09-24 21:10 UTC (permalink / raw)
  To: mathieu.desnoyers; +Cc: mingo, linux-kernel

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Thu, 24 Sep 2009 09:26:33 -0400

> Create a assembly compatibility header for sparc32/64.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC patch] Immediate Values - x86 Optimization NMI and MCE support
  2009-09-24 16:01         ` [RFC patch] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
@ 2009-09-24 21:59           ` Masami Hiramatsu
  0 siblings, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2009-09-24 21:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, anil.s.keshavamurthy, Jason Yeh, Robert Richter,
	Andi Kleen, H. Peter Anvin, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge, Thomas Gleixner, Ingo Molnar, Ingo Molnar,
	linux-kernel, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Ananth N Mavinakayanahalli

Mathieu Desnoyers wrote:
> [Ingo: this patch is for RFC only. Do not merge.]
> 
> * Peter Zijlstra (peterz@infradead.org) wrote:
>> On Thu, 2009-09-24 at 11:03 -0400, Mathieu Desnoyers wrote:
>>> * Peter Zijlstra (peterz@infradead.org) wrote:
>>>> On Thu, 2009-09-24 at 09:26 -0400, Mathieu Desnoyers wrote:
>>>>> plain text document attachment (tracepoints-immediate-values.patch)
>>>>> Use immediate values in tracepoints.
>>>>
>>>> I might have missed it, but did both the Intel and AMD cpu folks clear
>>>> the SMP code rewrite bits?
>>>>
>>>
>>> SMP handling is performed with stop_machine() in this patchset. Nothing
>>> fancy here.
>>>
>>> I've got other patches, not included in this patchset, which implements
>>> nmi-safe code modification, based on a scheme using breakpoints and
>>> IPIs, inspired from djprobes. That one might be worth clearing with
>>> intel/amd devs before merging.
>>>
>>> However, doing code patching within stop_machine() is pretty safe, given
>>> all other CPUs are busy-looping with interrupts off while this happens.
>>> Ftrace already does this.
>>
>> Agreed, I missed this relied on stopmachine. No problem then.
>>
>> It would be good to reduce reliance on stopmachine, so it would be good
>> to get some CPU folks looking at your alternative implementation.
>>
>> Thanks!
>>
> 
> Sure, here is the patch applying on top of the immediate values
> patchset. It implements the breakpoint-based instruction patching
> scheme. I just provide this one for review. There is a following patch
> which makes the immediate values infrastructure use this arch-specific
> file, which I'll leave out for now.

Mathieu, could you check my previous patch?
http://lkml.org/lkml/2009/9/14/551

I think we can share some code and ideas about generic XMC:-).
But since it seems that the imv requires a dedicated method,
I don't think we can share the code entirely. :-)

> +#include <asm/cacheflush.h>
> +
> +#define BREAKPOINT_INSTRUCTION  0xcc
> +#define BREAKPOINT_INS_LEN	1
> +#define NR_NOPS			10

Why don't you reuse macros in asm/include/kprobes.h? :)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-24 13:26 ` [patch 02/12] Immediate Values - Architecture Independent Code Mathieu Desnoyers
@ 2009-09-25  4:20   ` Andrew Morton
  2009-09-27 23:23     ` Mathieu Desnoyers
  2009-09-28  1:23     ` Andi Kleen
  0 siblings, 2 replies; 34+ messages in thread
From: Andrew Morton @ 2009-09-25  4:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, linux-kernel, Jason Baron, Rusty Russell,
	Adrian Bunk, Andi Kleen, Christoph Hellwig

On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Immediate values are used as read mostly variables that are rarely updated. They
> use code patching to modify the values inscribed in the instruction stream. It
> provides a way to save precious cache lines that would otherwise have to be used
> by these variables.

What a hare-brained concept.

> * Why should this be merged *
> 
> It improves performances on heavy memory I/O workloads.
> 
> An interesting result shows the potential this infrastructure has by
> showing the slowdown a simple system call such as getppid() suffers when it is
> used under heavy user-space cache trashing:
> 
> Random walk L1 and L2 trashing surrounding a getppid() call:
> (note: in this test, do_syscal_trace was taken at each system call, see
> Documentation/immediate.txt in these patches for details)
> - No memory pressure :   getppid() takes  1573 cycles
> - With memory pressure : getppid() takes 15589 cycles

Our ideas of what constitutes an "interesting result" differ.

Do you have any data which indicates that this thing is of any real
benefit to anyone for anything?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-25  4:20   ` Andrew Morton
@ 2009-09-27 23:23     ` Mathieu Desnoyers
  2009-09-28  1:23     ` Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-27 23:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, linux-kernel, Jason Baron, Rusty Russell,
	Adrian Bunk, Andi Kleen, Christoph Hellwig

* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Immediate values are used as read mostly variables that are rarely updated. They
> > use code patching to modify the values inscribed in the instruction stream. It
> > provides a way to save precious cache lines that would otherwise have to be used
> > by these variables.
> 
> What a hare-brained concept.
> 

Hi Andrew,

Improving performance by specializing the implementation has been
studied thoroughly by many in the past, especially for JIT compilers.
What I am proposing here is merely a very specific use of the concept,
applied to read-often variables.

> > * Why should this be merged *
> > 
> > It improves performances on heavy memory I/O workloads.
> > 
> > An interesting result shows the potential this infrastructure has by
> > showing the slowdown a simple system call such as getppid() suffers when it is
> > used under heavy user-space cache trashing:
> > 
> > Random walk L1 and L2 trashing surrounding a getppid() call:
> > (note: in this test, do_syscal_trace was taken at each system call, see
> > Documentation/immediate.txt in these patches for details)
> > - No memory pressure :   getppid() takes  1573 cycles
> > - With memory pressure : getppid() takes 15589 cycles
> 
> Our ideas of what constitutes an "interesting result" differ.
> 
> Do you have any data which indicates that this thing is of any real
> benefit to anyone for anything?

Yep. See the benchmarks I just ran below.

Immediate Values Benchmarks

Kernel 2.6.31-tip
8-core Xeon, 2.0Ghz, E5405
gcc version 4.3.2 (Debian 4.3.2-1.1) 

Test workload: build the Linux kernel tree, cache-hot, make -j10

Executive result summary:

In these tests, each system call has an added workload, which is to read a fixed
number of integers from randomly chosen cache lines within an array and perform
a branch. The implementation is added to ptrace.c. The baseline is an unmodified
kernel.

* Baseline:				sys	0m57.63s

* 4096 integer reads, random locations	sys	2m21.781s
* 4096 integer reads, immediate values	sys	1m44.695s

* 128 integer reads, random locations	sys	0m59.348s
* 128 integer reads, immediate values	sys	0m58.640s

* 32 integer reads, random locations	sys	0m58.68s
* 32 integer reads, immediate values	sys	0m57.60s

These numbers show that by turning read-often data accesses into immediate
values, we can speed up the kernel.

Binary size results:

* 4096 integer reads, random locations
  text     data     bss     dec     hex filename
  66079     648  262156  328883   504b3 arch/x86/kernel/ptrace.o

* 4096 integer reads, immediate values
   text	   data	    bss	    dec	    hex	filename
  66079	  74412	 262156	 402647	  624d7	arch/x86/kernel/ptrace.o

As we notice, the size of text is the same, same for bss, but the data size
increases with immediate values. The section headers confirms that this extra
data is put in the __imv section, which is only accessed when immediate value
updates are performed.

So the tradeoff is: immediate values use more cache-cold space to increase
speed.

Therefore, if we can turn a significant amount of fast-path read-often variables
into immediate values, this should lead to a performance gain. Also,
given we can expect the fastpath cache-line footprint to grow with the
next kernel releases (this has been a trend I've seen a lot of people
complaining about), immediate values should help minimizing this by
removing the d-cache hit from such read-often variables, leaving a
i-cache hit within a mostly sequential instruction stream.

A quick look at the vmlinux section headers:

vmlinux:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
 13 .data.read_mostly 00002df0  ffffffff80859440  0000000000859440  00859440  2**6
                  CONTENTS, ALLOC, LOAD, DATA

Shows that we have about 11.48kB of read mostly data in the kernel image
which could be turned into immediate values. This is without counting
the modules. If only a portion of this data is not only read mostly, but
also read often, then we will see a clear performance improvement.

Thanks,

Mathieu


Detailed test results follow.
----------------------------------------

* Baseline:

# size of kernel original ptrace.o

   text	   data	    bss	    dec	    hex	filename
  12863	    648	      8	  13519	   34cf	arch/x86/kernel/ptrace.o

# time make -j10

real	1m25.358s
user	9m7.506s
sys	0m57.856s

real	1m21.580s
user	9m7.362s
sys	0m57.212s

real	1m21.361s
user	9m6.358s
sys	0m57.824s


* 4096 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# size of modified ptrace.o

  text	   data	    bss	    dec	    hex	filename
  66079	    648	 262156	 328883	  504b3	arch/x86/kernel/ptrace.o

# section headers

arch/x86/kernel/ptrace.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000f4e8  0000000000000000  0000000000000000  00000040  2**4
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         000000a8  0000000000000000  0000000000000000  0000f540  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          0004000c  0000000000000000  0000000000000000  0000f600  2**5
                  ALLOC
  3 .rodata       00000988  0000000000000000  0000000000000000  0000f600  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  4 .fixup        0000005b  0000000000000000  0000000000000000  0000ff88  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  5 __ex_table    00000090  0000000000000000  0000000000000000  0000ffe8  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  6 .smp_locks    00000028  0000000000000000  0000000000000000  00010078  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  7 .rodata.str1.8 000001f2  0000000000000000  0000000000000000  000100a0  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 .rodata.str1.1 00000097  0000000000000000  0000000000000000  00010292  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 __tracepoints 00000080  0000000000000000  0000000000000000  00010340  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 10 _ftrace_events 00000160  0000000000000000  0000000000000000  000103c0  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 11 __tracepoints_strings 00000013  0000000000000000  0000000000000000  00010520  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 12 .comment      0000001f  0000000000000000  0000000000000000  00010533  2**0
                  CONTENTS, READONLY
 13 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00010552  2**0
                  CONTENTS, READONLY

# pattern

     820:       83 3d 00 00 00 00 01    cmpl   $0x1,0x0(%rip)        # 827 <test_pollute_cache+0x7>
     827:       0f 84 cb cf 00 00       je     d7f8 <test_pollute_cache+0xcfd8>

# time make -j10

real	1m36.075s
user	9m15.163s
sys	2m21.781s


* 4096 imv read per system call
  (CONFIG_IMMEDIATE=y)

# size of modified ptrace.o

   text	   data	    bss	    dec	    hex	filename
  66079	  74412	 262156	 402647	  624d7	arch/x86/kernel/ptrace.o

    (note: data is larger due to __imv table, which is used only for updates)

# section headers

arch/x86/kernel/ptrace.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000f4e8  0000000000000000  0000000000000000  00000040  2**4
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         000000a8  0000000000000000  0000000000000000  0000f540  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          0004000c  0000000000000000  0000000000000000  0000f600  2**5
                  ALLOC
  3 .rodata       00000988  0000000000000000  0000000000000000  0000f600  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  4 .fixup        0000005b  0000000000000000  0000000000000000  0000ff88  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  5 __ex_table    00000090  0000000000000000  0000000000000000  0000ffe8  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  6 .smp_locks    00000028  0000000000000000  0000000000000000  00010078  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  7 __discard     00005004  0000000000000000  0000000000000000  000100a0  2**0
                  CONTENTS, READONLY
  8 __imv         00012024  0000000000000000  0000000000000000  000150a4  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  9 .rodata.str1.8 000001f2  0000000000000000  0000000000000000  000270c8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 10 .rodata.str1.1 00000097  0000000000000000  0000000000000000  000272ba  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 11 __tracepoints 00000080  0000000000000000  0000000000000000  00027360  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 12 _ftrace_events 00000160  0000000000000000  0000000000000000  000273e0  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 13 __tracepoints_strings 00000013  0000000000000000  0000000000000000  00027540  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 14 .comment      0000001f  0000000000000000  0000000000000000  00027553  2**0
                  CONTENTS, READONLY
 15 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00027572  2**0
                  CONTENTS, READONLY

# pattern

     820:       b8 00 00 00 00          mov    $0x0,%eax
     825:       ff c8                   dec    %eax
     827:       0f 84 d3 cf 00 00       je     d800 <test_pollute_cache+0xcfe0>

# time make -j10

real	1m30.688s
user	9m7.770s
sys	1m44.695s


* 128 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# time make -j10

real	1m27.801s
user	9m12.447s
sys	0m59.348s


* 128 imv read per system call
  (CONFIG_IMMEDIATE=y)

# time make -j10

real	1m22.454s
user	9m5.822s
sys	0m58.640s


* 32 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# time make -j10

real	1m21.539s
user	9m6.946s
sys	0m57.888s

real	1m26.789s
user	9m11.606s
sys	0m59.392s

real	1m29.461s
user	9m12.195s
sys	0m58.768s

avg sys:	58.68s


* 32 imv read per system call
  (CONFIG_IMMEDIATE=y)

# time make -j10

real	1m21.844s
user	9m7.278s
sys	0m57.648s

real	1m22.123s
user	9m6.850s
sys	0m56.848s

real	1m24.589s
user	9m5.674s
sys	0m58.328s

avg sys:	57.60s



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-25  4:20   ` Andrew Morton
  2009-09-27 23:23     ` Mathieu Desnoyers
@ 2009-09-28  1:23     ` Andi Kleen
  2009-09-28 17:46       ` Andrew Morton
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2009-09-28  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mathieu Desnoyers, Ingo Molnar, linux-kernel, Jason Baron,
	Rusty Russell, Adrian Bunk, Andi Kleen, Christoph Hellwig

On Thu, Sep 24, 2009 at 09:20:13PM -0700, Andrew Morton wrote:
> On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Immediate values are used as read mostly variables that are rarely updated. They
> > use code patching to modify the values inscribed in the instruction stream. It
> > provides a way to save precious cache lines that would otherwise have to be used
> > by these variables.
> 
> What a hare-brained concept.

The concept makes a lot of sense. Cache misses are extremly costly
on modern CPUs and when the workload has blown the caches away in user space
it can literally be hundreds or even thousands of cycles to fetch
a data cache line.

Similar optimizations are also quite common in compilers (with 
profile feedback) and JITs.

> Do you have any data which indicates that this thing is of any real
> benefit to anyone for anything?

There's a lot of data around that the kernel has very little IPC
due to a lot of cache misses in some workloads. In general this applies
to anything that touches a lot of data in user space and blows
the caches away.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28  1:23     ` Andi Kleen
@ 2009-09-28 17:46       ` Andrew Morton
  2009-09-28 18:03         ` Arjan van de Ven
  2009-09-28 20:11         ` Andi Kleen
  0 siblings, 2 replies; 34+ messages in thread
From: Andrew Morton @ 2009-09-28 17:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, Ingo Molnar, linux-kernel, Jason Baron,
	Rusty Russell, Adrian Bunk, Christoph Hellwig

On Mon, 28 Sep 2009 03:23:37 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> On Thu, Sep 24, 2009 at 09:20:13PM -0700, Andrew Morton wrote:
> > On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > > Immediate values are used as read mostly variables that are rarely updated. They
> > > use code patching to modify the values inscribed in the instruction stream. It
> > > provides a way to save precious cache lines that would otherwise have to be used
> > > by these variables.
> > 
> > What a hare-brained concept.
> 
> The concept makes a lot of sense.

But does it?

> Cache misses are extremly costly
> on modern CPUs and when the workload has blown the caches away in user space
> it can literally be hundreds or even thousands of cycles to fetch
> a data cache line.

Well yes.  But for a kernel dcache entry to have been replaced by a
userspace one, userspace will, on average, have itself incurred a *lot*
of dcache misses.  So we just spent a lot of CPU cycles in userspace,
so the cost of the in-kernel dcache miss is relatively small.

That's how caches work!  If a kernel variable is read frequently, it's
still in dcache.  If it's read infrequently, it falls out of dcache but
that doesn't matter much because it's read infrequently!

And lo, it appears that we're unable to observe any measurable benefit
from the changes, so we're cooking up weird fake testcases to be able to
drag this thing out of the noise floor.

Obviously the change will have _some_ performance benefit.  But is it
enough to justify the addition of yet more tricksy code to maintain? 
That's a very different question.  

> There's a lot of data around that the kernel has very little IPC
> due to a lot of cache misses in some workloads.

Kernel gets a lot of cache misses, but that's usually against
userspace, pagecache, net headers/data, etc.  I doubt if it gets many
misses against a small number of small, read-mostly data items which is
what this patch addresses.

And it is a *small* number of things to which this change is
applicable.  This is because the write operation for these read-mostly
variables becomes very expensive indeed.  This means that we cannot use
"immediate values" for any variable which can conceivable be modified
at high frequency by any workload.

For example, how do we know it's safe to use immediate-values for
anything which can be modified from userspace, such as a sysfs-accessed
tunable?  How do we know this won't take someone's odd-but-legitimate
workload and shoot it in the head?


Summary:

- at this stage no real-world beenefit has been demonstrated afaict

- the feature is narrowly applicable anyway

- it addes complexity and maintenance cost



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 17:46       ` Andrew Morton
@ 2009-09-28 18:03         ` Arjan van de Ven
  2009-09-28 18:40           ` Mathieu Desnoyers
  2009-09-28 19:54           ` Andi Kleen
  2009-09-28 20:11         ` Andi Kleen
  1 sibling, 2 replies; 34+ messages in thread
From: Arjan van de Ven @ 2009-09-28 18:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Mathieu Desnoyers, Ingo Molnar, linux-kernel,
	Jason Baron, Rusty Russell, Adrian Bunk, Christoph Hellwig

On Mon, 28 Sep 2009 10:46:17 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> Kernel gets a lot of cache misses, but that's usually against
> userspace, pagecache, net headers/data, etc.  I doubt if it gets many
> misses against a small number of small, read-mostly data items which
> is what this patch addresses.
> 
> And it is a *small* number of things to which this change is
> applicable.  This is because the write operation for these read-mostly
> variables becomes very expensive indeed.  This means that we cannot
> use "immediate values" for any variable which can conceivable be
> modified at high frequency by any workload.

btw just to add to this:
caches are unified code/data after L1 in general... it then does not
matter much if you encode the "almost constant" in the codestream or
slightly farther away, in both cases it takes up cache space.
(you can argue "but in the data case it might pull in a whole cacheline
just for this".. but that's a case for us to pack such read mostly
things properly)

And for L1.. well.. the L2 latency is not THAT much bigger. And L1 is 
tiny. more icache pressure hurts just as much as having more dcache
pressure there.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 18:03         ` Arjan van de Ven
@ 2009-09-28 18:40           ` Mathieu Desnoyers
  2009-09-28 19:54           ` Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-28 18:40 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Andi Kleen, Ingo Molnar, linux-kernel,
	Jason Baron, Rusty Russell, Adrian Bunk, Christoph Hellwig

* Arjan van de Ven (arjan@infradead.org) wrote:
> On Mon, 28 Sep 2009 10:46:17 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > Kernel gets a lot of cache misses, but that's usually against
> > userspace, pagecache, net headers/data, etc.  I doubt if it gets many
> > misses against a small number of small, read-mostly data items which
> > is what this patch addresses.
> > 
> > And it is a *small* number of things to which this change is
> > applicable.  This is because the write operation for these read-mostly
> > variables becomes very expensive indeed.  This means that we cannot
> > use "immediate values" for any variable which can conceivable be
> > modified at high frequency by any workload.
> 
> btw just to add to this:
> caches are unified code/data after L1 in general... it then does not
> matter much if you encode the "almost constant" in the codestream or
> slightly farther away, in both cases it takes up cache space.

Standard read from memory will typically need to have the address of the
data to access as operand to the instruction in i-cache, plus the data
in d-cache.

Compared to this, immediate values remove the need to have a pointer in
the i-cache, so the overall footprint, even for L2 cache, is lower.

> (you can argue "but in the data case it might pull in a whole cacheline
> just for this".. but that's a case for us to pack such read mostly
> things properly)
> 
> And for L1.. well.. the L2 latency is not THAT much bigger. And L1 is 
> tiny. more icache pressure hurts just as much as having more dcache
> pressure there.

Immediate values does not add i-cache pressure. They just remove d-cache
pressure. So it saves L1 d-cache, and the L1 i-cache pressure stays
mostly unchanged.

Thanks,

Mathieu


> 
> 
> -- 
> Arjan van de Ven 	Intel Open Source Technology Centre
> For development, discussion and tips for power savings, 
> visit http://www.lesswatts.org

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 18:03         ` Arjan van de Ven
  2009-09-28 18:40           ` Mathieu Desnoyers
@ 2009-09-28 19:54           ` Andi Kleen
  2009-09-28 20:37             ` Arjan van de Ven
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2009-09-28 19:54 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Andi Kleen, Mathieu Desnoyers, Ingo Molnar,
	linux-kernel, Jason Baron, Rusty Russell, Adrian Bunk,
	Christoph Hellwig

> btw just to add to this:
> caches are unified code/data after L1 in general... it then does not
> matter much if you encode the "almost constant" in the codestream or
> slightly farther away, in both cases it takes up cache space.

It does take up cache space, but when it's embedded in the instruction
stream the CPU has it usually already prefetched (CPUs are very good
at prefetching instructions). That's often not the case
with arbitary data accesses like this.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 17:46       ` Andrew Morton
  2009-09-28 18:03         ` Arjan van de Ven
@ 2009-09-28 20:11         ` Andi Kleen
  2009-09-28 21:16           ` Andrew Morton
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2009-09-28 20:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Mathieu Desnoyers, Ingo Molnar, linux-kernel,
	Jason Baron, Rusty Russell, Adrian Bunk, Christoph Hellwig

> That's how caches work!  If a kernel variable is read frequently, it's
> still in dcache.  If it's read infrequently, it falls out of dcache but
> that doesn't matter much because it's read infrequently!

You're assuming that the CPU's cache LRU is perfect. e.g. that it
never gets swamped by a lot of short term accesses. And that it has perfect
insight if something is really frequently used or not. But that's 
not true. A short term cache pig, even when it uses the data only
once, can swamp it and throw out all the kernel state, even when it's
accessed frequently enough.  It's a bit similar to all the similar problems
with the page cache LRU.
> 
> And lo, it appears that we're unable to observe any measurable benefit
> from the changes, so we're cooking up weird fake testcases to be able to
> drag this thing out of the noise floor.

Yes, a really measurable improvement would be great. One problem
right now is that not enough users are there, so measuring
something would first need more users to really reduce the cache misses.
> 
> Obviously the change will have _some_ performance benefit.  But is it
> enough to justify the addition of yet more tricksy code to maintain? 

I don't think the code is particularly tricky. Especially the user API
is very simple and neat.
> 
> And it is a *small* number of things to which this change is
> applicable.  This is because the write operation for these read-mostly
> variables becomes very expensive indeed.  This means that we cannot use
> "immediate values" for any variable which can conceivable be modified
> at high frequency by any workload.

A natural target is any sysctl for example.

> 
> For example, how do we know it's safe to use immediate-values for
> anything which can be modified from userspace, such as a sysfs-accessed
> tunable?  How do we know this won't take someone's odd-but-legitimate
> workload and shoot it in the head?

You're arguing we should tune for sysctl performance? That doesn't make
sense to me.

> 
> 
> Summary:
> 
> - at this stage no real-world beenefit has been demonstrated afaict

Yes that's an issue that needs to be addressed.

A good way would be probably to do some measurements on cache misses
for given workloads and then convert all applicable global references
and see how much difference it makes.

> - the feature is narrowly applicable anyway

I don't think so, there are quite a lot of global flag variables.

% find /proc/sys -type f | wc -l
565

not counting sysfs, boot options and other things.
> 
> - it addes complexity and maintenance cost

Very little as far as I can see.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 19:54           ` Andi Kleen
@ 2009-09-28 20:37             ` Arjan van de Ven
  2009-09-28 21:32               ` H. Peter Anvin
  0 siblings, 1 reply; 34+ messages in thread
From: Arjan van de Ven @ 2009-09-28 20:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Andi Kleen, Mathieu Desnoyers, Ingo Molnar,
	linux-kernel, Jason Baron, Rusty Russell, Adrian Bunk,
	Christoph Hellwig

On Mon, 28 Sep 2009 21:54:45 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> > btw just to add to this:
> > caches are unified code/data after L1 in general... it then does not
> > matter much if you encode the "almost constant" in the codestream or
> > slightly farther away, in both cases it takes up cache space.
> 
> It does take up cache space, but when it's embedded in the instruction
> stream the CPU has it usually already prefetched (CPUs are very good
> at prefetching instructions). That's often not the case
> with arbitary data accesses like this.

this makes me wonder what happens when a variable is used in multiple
places... that makes the icache overhead multiply right?


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 20:11         ` Andi Kleen
@ 2009-09-28 21:16           ` Andrew Morton
  2009-09-28 22:01             ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2009-09-28 21:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: andi, mathieu.desnoyers, mingo, linux-kernel, jbaron, rusty, bunk, hch

On Mon, 28 Sep 2009 22:11:08 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> > For example, how do we know it's safe to use immediate-values for
> > anything which can be modified from userspace, such as a sysfs-accessed
> > tunable?  How do we know this won't take someone's odd-but-legitimate
> > workload and shoot it in the head?
> 
> You're arguing we should tune for sysctl performance? That doesn't make
> sense to me.

We're talking about a tiny tiny performance gain (one which thus far
appears to be unobserveable) on the read-side traded off against a
tremendous slowdown on the write-side.

That's OK for people whose workloads use the expected read-vs-write
ratio.  But there's always someone out there who does something
peculiar.  There will be people who simply cannot accept large
slowdowns in writes to particular tunables.  Who these people are and
which tunables they care about we do not know.

No, I'm not saying we should "tune for sysctl performance".  I'm saying
we should tune for not making Linux utterly uselessly slow for people
for whom it previously worked OK.

It means we'd have to look very carefully at each tunable and decide
whether there's any conceivable situation in which someone would want
to alter it frequently.  If so, we need to leave it alone.

How many tunables will that leave behind, and how much use was it to
speed that remainder up by a teensy amount?  Who knows.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 20:37             ` Arjan van de Ven
@ 2009-09-28 21:32               ` H. Peter Anvin
  2009-09-28 22:05                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: H. Peter Anvin @ 2009-09-28 21:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Andrew Morton, Mathieu Desnoyers, Ingo Molnar,
	linux-kernel, Jason Baron, Rusty Russell, Adrian Bunk,
	Christoph Hellwig

On 09/28/2009 01:37 PM, Arjan van de Ven wrote:
> 
> this makes me wonder what happens when a variable is used in multiple
> places... that makes the icache overhead multiply right?
> 

On x86, the icache overhead can often be zero or close to zero -- or
even negative in a fairly common subcase[1] -- simply because you are
dropping a displacement used to fetch a global variable with an
immediate in the code itself.

For 8- or 16-bit data items this is even more of a win in terms of
icache space; for 64-bit data it is always a lose.

It is also worth noting that the way this is implemented as a graft-on
rather than with compiler support means that the full instruction set
cannot exploited -- x86 can often use a memory operand or immediate as
part of an operation.  This adds icache pressure.

	-hpa

[1] Common subcase:

	movl global, %reg	; 6 bytes (unless reg is eax on 32 bits)
	movl $immed, %reg	; 5 bytes


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 21:16           ` Andrew Morton
@ 2009-09-28 22:01             ` Mathieu Desnoyers
  0 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-28 22:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, mingo, linux-kernel, jbaron, rusty, bunk, hch,
	H. Peter Anvin

* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Mon, 28 Sep 2009 22:11:08 +0200
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > For example, how do we know it's safe to use immediate-values for
> > > anything which can be modified from userspace, such as a sysfs-accessed
> > > tunable?  How do we know this won't take someone's odd-but-legitimate
> > > workload and shoot it in the head?
> > 
> > You're arguing we should tune for sysctl performance? That doesn't make
> > sense to me.
> 
> We're talking about a tiny tiny performance gain (one which thus far
> appears to be unobserveable) on the read-side traded off against a
> tremendous slowdown on the write-side.
> 
> That's OK for people whose workloads use the expected read-vs-write
> ratio.  But there's always someone out there who does something
> peculiar.  There will be people who simply cannot accept large
> slowdowns in writes to particular tunables.  Who these people are and
> which tunables they care about we do not know.
> 
> No, I'm not saying we should "tune for sysctl performance".  I'm saying
> we should tune for not making Linux utterly uselessly slow for people
> for whom it previously worked OK.
> 
> It means we'd have to look very carefully at each tunable and decide
> whether there's any conceivable situation in which someone would want
> to alter it frequently.  If so, we need to leave it alone.
> 
> How many tunables will that leave behind, and how much use was it to
> speed that remainder up by a teensy amount?  Who knows.
> 

BTW, when/if we get the OK from Intel to use a breakpoint/IPI-based
scheme to perform the updates rather than using the heavyweight
stop_machine(), this update performance question will be much less of a
concern.

hpa is currently looking into this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 02/12] Immediate Values - Architecture Independent Code
  2009-09-28 21:32               ` H. Peter Anvin
@ 2009-09-28 22:05                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2009-09-28 22:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Arjan van de Ven, Andi Kleen, Andrew Morton, Ingo Molnar,
	linux-kernel, Jason Baron, Rusty Russell, Adrian Bunk,
	Christoph Hellwig

* H. Peter Anvin (hpa@zytor.com) wrote:
> On 09/28/2009 01:37 PM, Arjan van de Ven wrote:
> > 
> > this makes me wonder what happens when a variable is used in multiple
> > places... that makes the icache overhead multiply right?
> > 
> 
> On x86, the icache overhead can often be zero or close to zero -- or
> even negative in a fairly common subcase[1] -- simply because you are
> dropping a displacement used to fetch a global variable with an
> immediate in the code itself.
> 
> For 8- or 16-bit data items this is even more of a win in terms of
> icache space; for 64-bit data it is always a lose.
> 
> It is also worth noting that the way this is implemented as a graft-on
> rather than with compiler support means that the full instruction set
> cannot exploited -- x86 can often use a memory operand or immediate as
> part of an operation.  This adds icache pressure.

Indeed, these cases could make good use of compiler support to let
immediate values be added to a wider range of operations. Currently,
being limited to "mov" is somewhat limiting on x86. We could definitely
do better.

Mathieu

> 
> 	-hpa
> 
> [1] Common subcase:
> 
> 	movl global, %reg	; 6 bytes (unless reg is eax on 32 bits)
> 	movl $immed, %reg	; 5 bytes
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-09-28 22:10 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 01/12] x86: text_poke_early non static Mathieu Desnoyers
2009-09-24 13:26 ` [patch 02/12] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2009-09-25  4:20   ` Andrew Morton
2009-09-27 23:23     ` Mathieu Desnoyers
2009-09-28  1:23     ` Andi Kleen
2009-09-28 17:46       ` Andrew Morton
2009-09-28 18:03         ` Arjan van de Ven
2009-09-28 18:40           ` Mathieu Desnoyers
2009-09-28 19:54           ` Andi Kleen
2009-09-28 20:37             ` Arjan van de Ven
2009-09-28 21:32               ` H. Peter Anvin
2009-09-28 22:05                 ` Mathieu Desnoyers
2009-09-28 20:11         ` Andi Kleen
2009-09-28 21:16           ` Andrew Morton
2009-09-28 22:01             ` Mathieu Desnoyers
2009-09-24 13:26 ` [patch 03/12] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2009-09-24 13:26 ` [patch 04/12] Immediate Values - x86 Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 05/12] Add text_poke and sync_core to powerpc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 06/12] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 07/12] Sparc create asm.h Mathieu Desnoyers
2009-09-24 21:10   ` David Miller
2009-09-24 13:26 ` [patch 08/12] sparc64: Optimized immediate value implementation Mathieu Desnoyers
2009-09-24 13:26 ` [patch 09/12] Immediate Values - Documentation Mathieu Desnoyers
2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
2009-09-24 15:33   ` [patch 10.1/12] Immediate values fixes for modules Mathieu Desnoyers
2009-09-24 15:35   ` [patch 10.2/12] Fix Immediate Values x86_64 support old gcc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 11/12] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 12/12] Tracepoints - " Mathieu Desnoyers
2009-09-24 14:51   ` Peter Zijlstra
2009-09-24 15:03     ` Mathieu Desnoyers
2009-09-24 15:06       ` Peter Zijlstra
2009-09-24 16:01         ` [RFC patch] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
2009-09-24 21:59           ` Masami Hiramatsu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.