All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/61] ANNOUNCE: lock validator -V1
@ 2006-05-29 21:21 Ingo Molnar
  2006-05-29 21:22 ` [patch 01/61] lock validator: floppy.c irq-release fix Ingo Molnar
                   ` (73 more replies)
  0 siblings, 74 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

We are pleased to announce the first release of the "lock dependency 
correctness validator" kernel debugging feature, which can be downloaded 
from:

  http://redhat.com/~mingo/lockdep-patches/

The easiest way to try lockdep on a testbox is to apply the combo patch 
to 2.6.17-rc4-mm3. The patch order is:

  http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.17-rc4.tar.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc4/2.6.17-rc4-mm3/2.6.17-rc4-mm3.bz2
  http://redhat.com/~mingo/lockdep-patches/lockdep-combo.patch

do 'make oldconfig' and accept all the defaults for new config options - 
reboot into the kernel and if everything goes well it should boot up 
fine and you should have /proc/lockdep and /proc/lockdep_stats files.

Typically if the lock validator finds some problem it will print out 
voluminous debug output that begins with "BUG: ..." and which syslog 
output can be used by kernel developers to figure out the precise 
locking scenario.

What does the lock validator do? It "observes" and maps all locking 
rules as they occur dynamically (as triggered by the kernel's natural 
use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock 
validator subsystem detects a new locking scenario, it validates this 
new rule against the existing set of rules. If this new rule is 
consistent with the existing set of rules then the new rule is added 
transparently and the kernel continues as normal. If the new rule could 
create a deadlock scenario then this condition is printed out.

When determining validity of locking, all possible "deadlock scenarios" 
are considered: assuming arbitrary number of CPUs, arbitrary irq context 
and task context constellations, running arbitrary combinations of all 
the existing locking scenarios. In a typical system this means millions 
of separate scenarios. This is why we call it a "locking correctness" 
validator - for all rules that are observed the lock validator proves it 
with mathematical certainty that a deadlock could not occur (assuming 
that the lock validator implementation itself is correct and its 
internal data structures are not corrupted by some other kernel 
subsystem). [see more details and conditionals of this statement in 
include/linux/lockdep.h and Documentation/lockdep-design.txt]

Furthermore, this "all possible scenarios" property of the validator 
also enables the finding of complex, highly unlikely multi-CPU 
multi-context races via single single-context rules, increasing the 
likelyhood of finding bugs drastically. In practical terms: the lock 
validator already found a bug in the upstream kernel that could only 
occur on systems with 3 or more CPUs, and which needed 3 very unlikely 
code sequences to occur at once on the 3 CPUs. That bug was found and 
reported on a single-CPU system (!). So in essence a race will be found 
"piecemail-wise", triggering all the necessary components for the race, 
without having to reproduce the race scenario itself! In its short 
existence the lock validator found and reported many bugs before they 
actually caused a real deadlock.

To further increase the efficiency of the validator, the mapping is not 
per "lock instance", but per "lock-type". For example, all struct inode 
objects in the kernel have inode->inotify_mutex. If there are 10,000 
inodes cached, then there are 10,000 lock objects. But ->inotify_mutex 
is a single "lock type", and all locking activities that occur against 
->inotify_mutex are "unified" into this single lock-type. The advantage 
of the lock-type approach is that all historical ->inotify_mutex uses 
are mapped into a single (and as narrow as possible) set of locking 
rules - regardless of how many different tasks or inode structures it 
took to build this set of rules. The set of rules persist during the 
lifetime of the kernel.

To see the rough magnitude of checking that the lock validator does, 
here's a portion of /proc/lockdep_stats, fresh after bootup:

 lock-types:                            694 [max: 2048]
 direct dependencies:                  1598 [max: 8192]
 indirect dependencies:               17896
 all direct dependencies:             16206
 dependency chains:                    1910 [max: 8192]
 in-hardirq chains:                      17
 in-softirq chains:                     105
 in-process chains:                    1065
 stack-trace entries:                 38761 [max: 131072]
 combined max dependencies:         2033928
 hardirq-safe locks:                     24
 hardirq-unsafe locks:                  176
 softirq-safe locks:                     53
 softirq-unsafe locks:                  137
 irq-safe locks:                         59
 irq-unsafe locks:                      176

The lock validator has observed 1598 actual single-thread locking 
patterns, and has validated all possible 2033928 distinct locking 
scenarios.

More details about the design of the lock validator can be found in 
Documentation/lockdep-design.txt, which can also found at:

   http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt

The patchqueue consists of 61 patches, and the changes are quite 
extensive:

 215 files changed, 7693 insertions(+), 1247 deletions(-)

So be careful when testing.

We only plan to post the queue to lkml this time, we'll try to not flood 
lkml with future releases. The finegrained patch-queue can be also seen 
at:

  http://redhat.com/~mingo/lockdep-patches/patches/

(the series file, with explanations about splitup categories of the 
patches can be found attached below.)

The lock validator has been build-tested with allyesconfig, and booted 
on x86 and x86_64. (Other architectures probably dont build/work yet.)

Comments, test-results, bug fixes, and improvements are welcome!

	Ingo


# locking fixes (for bugs found by lockdep), not yet in mainline or -mm:

floppy-release-fix.patch
forcedeth-deadlock-fix.patch

# fixes for upstream that only triggers on lockdep:

sound_oss_emu10k1_midi-fix.patch
mutex-section-bug.patch

# locking subsystem debugging improvements:

warn-once.patch
add-module-address.patch

generic-lock-debugging.patch
locking-selftests.patch

spinlock-init-cleanups.patch
lock-init-improvement.patch
xfs-improve-mrinit-macro.patch

# stacktrace:

x86_64-beautify-stack-backtrace.patch
x86_64-document-stack-backtrace.patch
stacktrace.patch

x86_64-use-stacktrace-for-backtrace.patch

# irq-flags state tracing:

lockdep-fown-fixes.patch
lockdep-sk-callback-lock-fixes.patch
trace-irqflags.patch
trace-irqflags-cleanups-x86.patch
trace-irqflags-cleanups-x86_64.patch
local-irq-enable-in-hardirq.patch

# percpu subsystem feature needed for lockdep:

add-per-cpu-offset.patch

# lockdep subsystem core bits:

lockdep-core.patch
lockdep-proc.patch
lockdep-docs.patch

# make use of lockdep in locking subsystems:

lockdep-prove-rwsems.patch
lockdep-prove-spin_rwlocks.patch
lockdep-prove-mutexes.patch

# lockdep utility patches:

lockdep-print-types-in-sysrq.patch
lockdep-x86_64-early-init.patch
lockdep-i386-alternatives-off.patch
lockdep-printk-recursion.patch
lockdep-disable-nmi-watchdog.patch

# map all the locking details and quirks to lockdep:

lockdep-blockdev.patch
lockdep-direct-io.patch
lockdep-serial.patch
lockdep-dcache.patch
lockdep-namei.patch
lockdep-super.patch
lockdep-futex.patch
lockdep-genirq.patch
lockdep-kgdb.patch
lockdep-completions.patch
lockdep-waitqueue.patch
lockdep-mm.patch
lockdep-slab.patch

lockdep-skb_queue_head_init.patch
lockdep-timer.patch
lockdep-sched.patch
lockdep-hrtimer.patch
lockdep-sock.patch
lockdep-af_unix.patch
lockdep-lock_sock.patch
lockdep-mmap_sem.patch

lockdep-prune_dcache-workaround.patch
lockdep-jbd.patch
lockdep-posix-timers.patch
lockdep-sch_generic.patch
lockdep-xfrm.patch
lockdep-sound-seq-ports.patch

lockdep-enable-Kconfig.patch

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 01/61] lock validator: floppy.c irq-release fix
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
@ 2006-05-29 21:22 ` Ingo Molnar
  2006-05-30  1:32   ` Andrew Morton
  2006-05-29 21:23 ` [patch 02/61] lock validator: forcedeth.c fix Ingo Molnar
                   ` (72 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

floppy.c does alot of irq-unsafe work within floppy_release_irq_and_dma():
free_irq(), release_region() ... so when executing in irq context, push
the whole function into keventd.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 drivers/block/floppy.c |   27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

Index: linux/drivers/block/floppy.c
===================================================================
--- linux.orig/drivers/block/floppy.c
+++ linux/drivers/block/floppy.c
@@ -573,6 +573,21 @@ static int floppy_grab_irq_and_dma(void)
 static void floppy_release_irq_and_dma(void);
 
 /*
+ * Interrupt, DMA and region freeing must not be done from IRQ
+ * context - e.g. irq-unregistration means /proc VFS work, region
+ * release takes an irq-unsafe lock, etc. So we push this work
+ * into keventd:
+ */
+static void fd_release_fn(void *data)
+{
+	mutex_lock(&open_lock);
+	floppy_release_irq_and_dma();
+	mutex_unlock(&open_lock);
+}
+
+static DECLARE_WORK(floppy_release_irq_and_dma_work, fd_release_fn, NULL);
+
+/*
  * The "reset" variable should be tested whenever an interrupt is scheduled,
  * after the commands have been sent. This is to ensure that the driver doesn't
  * get wedged when the interrupt doesn't come because of a failed command.
@@ -836,7 +851,7 @@ static int set_dor(int fdc, char mask, c
 	if (newdor & FLOPPY_MOTOR_MASK)
 		floppy_grab_irq_and_dma();
 	if (olddor & FLOPPY_MOTOR_MASK)
-		floppy_release_irq_and_dma();
+		schedule_work(&floppy_release_irq_and_dma_work);
 	return olddor;
 }
 
@@ -917,6 +932,8 @@ static int _lock_fdc(int drive, int inte
 
 		set_current_state(TASK_RUNNING);
 		remove_wait_queue(&fdc_wait, &wait);
+
+		flush_scheduled_work();
 	}
 	command_status = FD_COMMAND_NONE;
 
@@ -950,7 +967,7 @@ static inline void unlock_fdc(void)
 	if (elv_next_request(floppy_queue))
 		do_fd_request(floppy_queue);
 	spin_unlock_irqrestore(&floppy_lock, flags);
-	floppy_release_irq_and_dma();
+	schedule_work(&floppy_release_irq_and_dma_work);
 	wake_up(&fdc_wait);
 }
 
@@ -4647,6 +4664,12 @@ void cleanup_module(void)
 	del_timer_sync(&fd_timer);
 	blk_cleanup_queue(floppy_queue);
 
+	/*
+	 * Wait for any asynchronous floppy_release_irq_and_dma()
+	 * calls to finish first:
+	 */
+	flush_scheduled_work();
+
 	if (usage_count)
 		floppy_release_irq_and_dma();
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 02/61] lock validator: forcedeth.c fix
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
  2006-05-29 21:22 ` [patch 01/61] lock validator: floppy.c irq-release fix Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-30  1:33   ` Andrew Morton
  2006-05-29 21:23 ` [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup Ingo Molnar
                   ` (71 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

nv_do_nic_poll() is called from timer softirqs, which has interrupts
enabled, but np->lock might also be taken by some other interrupt
context.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 drivers/net/forcedeth.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/drivers/net/forcedeth.c
===================================================================
--- linux.orig/drivers/net/forcedeth.c
+++ linux/drivers/net/forcedeth.c
@@ -2869,6 +2869,7 @@ static void nv_do_nic_poll(unsigned long
 	struct net_device *dev = (struct net_device *) data;
 	struct fe_priv *np = netdev_priv(dev);
 	u8 __iomem *base = get_hwbase(dev);
+	unsigned long flags;
 	u32 mask = 0;
 
 	/*
@@ -2897,10 +2898,9 @@ static void nv_do_nic_poll(unsigned long
 			mask |= NVREG_IRQ_OTHER;
 		}
 	}
+	local_irq_save(flags);
 	np->nic_poll_irq = 0;
 
-	/* FIXME: Do we need synchronize_irq(dev->irq) here? */
-
 	writel(mask, base + NvRegIrqMask);
 	pci_push(base);
 
@@ -2924,6 +2924,7 @@ static void nv_do_nic_poll(unsigned long
 			enable_irq(np->msi_x_entry[NV_MSI_X_VECTOR_OTHER].vector);
 		}
 	}
+	local_irq_restore(flags);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
  2006-05-29 21:22 ` [patch 01/61] lock validator: floppy.c irq-release fix Ingo Molnar
  2006-05-29 21:23 ` [patch 02/61] lock validator: forcedeth.c fix Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-30  1:33   ` Andrew Morton
  2006-05-29 21:23 ` [patch 04/61] lock validator: mutex section binutils workaround Ingo Molnar
                   ` (70 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

move the __attribute outside of the DEFINE_SPINLOCK() section.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 sound/oss/emu10k1/midi.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/sound/oss/emu10k1/midi.c
===================================================================
--- linux.orig/sound/oss/emu10k1/midi.c
+++ linux/sound/oss/emu10k1/midi.c
@@ -45,7 +45,7 @@
 #include "../sound_config.h"
 #endif
 
-static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
+static __attribute((unused)) DEFINE_SPINLOCK(midi_spinlock);
 
 static void init_midi_hdr(struct midi_hdr *midihdr)
 {

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 04/61] lock validator: mutex section binutils workaround
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (2 preceding siblings ...)
  2006-05-29 21:23 ` [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-29 21:23 ` [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond) Ingo Molnar
                   ` (69 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

work around weird section nesting build bug causing smp-alternatives
failures under certain circumstances.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/mutex.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/mutex.c
===================================================================
--- linux.orig/kernel/mutex.c
+++ linux/kernel/mutex.c
@@ -309,7 +309,7 @@ static inline int __mutex_trylock_slowpa
  * This function must not be used in interrupt context. The
  * mutex must be released by the same task that acquired it.
  */
-int fastcall mutex_trylock(struct mutex *lock)
+int fastcall __sched mutex_trylock(struct mutex *lock)
 {
 	return __mutex_fastpath_trylock(&lock->count,
 					__mutex_trylock_slowpath);

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (3 preceding siblings ...)
  2006-05-29 21:23 ` [patch 04/61] lock validator: mutex section binutils workaround Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-30  1:33   ` Andrew Morton
  2006-05-29 21:23 ` [patch 06/61] lock validator: add __module_address() method Ingo Molnar
                   ` (68 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

add WARN_ON_ONCE(cond) to print once-per-bootup messages.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/asm-generic/bug.h |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: linux/include/asm-generic/bug.h
===================================================================
--- linux.orig/include/asm-generic/bug.h
+++ linux/include/asm-generic/bug.h
@@ -44,4 +44,17 @@
 # define WARN_ON_SMP(x)			do { } while (0)
 #endif
 
+#define WARN_ON_ONCE(condition)				\
+({							\
+	static int __warn_once = 1;			\
+	int __ret = 0;					\
+							\
+	if (unlikely(__warn_once && (condition))) {	\
+		__warn_once = 0;			\
+		WARN_ON(1);				\
+		__ret = 1;				\
+	}						\
+	__ret;						\
+})
+
 #endif

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 06/61] lock validator: add __module_address() method
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (4 preceding siblings ...)
  2006-05-29 21:23 ` [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond) Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-30  1:33   ` Andrew Morton
  2006-05-29 21:23 ` [patch 07/61] lock validator: better lock debugging Ingo Molnar
                   ` (67 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

add __module_address() method - to be used by lockdep.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/module.h |    6 ++++++
 kernel/module.c        |   14 ++++++++++++++
 2 files changed, 20 insertions(+)

Index: linux/include/linux/module.h
===================================================================
--- linux.orig/include/linux/module.h
+++ linux/include/linux/module.h
@@ -371,6 +371,7 @@ static inline int module_is_live(struct 
 /* Is this address in a module? (second is with no locks, for oops) */
 struct module *module_text_address(unsigned long addr);
 struct module *__module_text_address(unsigned long addr);
+int __module_address(unsigned long addr);
 
 /* Returns module and fills in value, defined and namebuf, or NULL if
    symnum out of range. */
@@ -509,6 +510,11 @@ static inline struct module *__module_te
 	return NULL;
 }
 
+static inline int __module_address(unsigned long addr)
+{
+	return 0;
+}
+
 /* Get/put a kernel symbol (calls should be symmetric) */
 #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
 #define symbol_put(x) do { } while(0)
Index: linux/kernel/module.c
===================================================================
--- linux.orig/kernel/module.c
+++ linux/kernel/module.c
@@ -2222,6 +2222,20 @@ const struct exception_table_entry *sear
 	return e;
 }
 
+/*
+ * Is this a valid module address? We don't grab the lock.
+ */
+int __module_address(unsigned long addr)
+{
+	struct module *mod;
+
+	list_for_each_entry(mod, &modules, list)
+		if (within(addr, mod->module_core, mod->core_size))
+			return 1;
+	return 0;
+}
+
+
 /* Is this a valid kernel address?  We don't grab the lock: we are oopsing. */
 struct module *__module_text_address(unsigned long addr)
 {

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 07/61] lock validator: better lock debugging
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (5 preceding siblings ...)
  2006-05-29 21:23 ` [patch 06/61] lock validator: add __module_address() method Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-30  1:33   ` Andrew Morton
  2006-05-29 21:23 ` [patch 08/61] lock validator: locking API self-tests Ingo Molnar
                   ` (66 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

generic lock debugging:

 - generalized lock debugging framework. For example, a bug in one lock
   subsystem turns off debugging in all lock subsystems.

 - got rid of the caller address passing from the mutex/rtmutex debugging
   code: it caused way too much prototype hackery, and lockdep will give
   the same information anyway.

 - ability to do silent tests

 - check lock freeing in vfree too.

 - more finegrained debugging options, to allow distributions to
   turn off more expensive debugging features.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 drivers/char/sysrq.c             |    2 
 include/asm-generic/mutex-null.h |   11 -
 include/linux/debug_locks.h      |   62 ++++++++
 include/linux/init_task.h        |    1 
 include/linux/mm.h               |    8 -
 include/linux/mutex-debug.h      |   12 -
 include/linux/mutex.h            |    6 
 include/linux/rtmutex.h          |   10 -
 include/linux/sched.h            |    4 
 init/main.c                      |    9 +
 kernel/exit.c                    |    5 
 kernel/fork.c                    |    4 
 kernel/mutex-debug.c             |  289 +++----------------------------------
 kernel/mutex-debug.h             |   87 +----------
 kernel/mutex.c                   |   83 +++++++---
 kernel/mutex.h                   |   18 --
 kernel/rtmutex-debug.c           |  302 +--------------------------------------
 kernel/rtmutex-debug.h           |    8 -
 kernel/rtmutex.c                 |   45 ++---
 kernel/rtmutex.h                 |    3 
 kernel/sched.c                   |   16 +-
 lib/Kconfig.debug                |   26 ++-
 lib/Makefile                     |    2 
 lib/debug_locks.c                |   45 +++++
 lib/spinlock_debug.c             |   60 +++----
 mm/vmalloc.c                     |    2 
 26 files changed, 329 insertions(+), 791 deletions(-)

Index: linux/drivers/char/sysrq.c
===================================================================
--- linux.orig/drivers/char/sysrq.c
+++ linux/drivers/char/sysrq.c
@@ -152,7 +152,7 @@ static struct sysrq_key_op sysrq_mountro
 static void sysrq_handle_showlocks(int key, struct pt_regs *pt_regs,
 				struct tty_struct *tty)
 {
-	mutex_debug_show_all_locks();
+	debug_show_all_locks();
 }
 static struct sysrq_key_op sysrq_showlocks_op = {
 	.handler	= sysrq_handle_showlocks,
Index: linux/include/asm-generic/mutex-null.h
===================================================================
--- linux.orig/include/asm-generic/mutex-null.h
+++ linux/include/asm-generic/mutex-null.h
@@ -10,14 +10,9 @@
 #ifndef _ASM_GENERIC_MUTEX_NULL_H
 #define _ASM_GENERIC_MUTEX_NULL_H
 
-/* extra parameter only needed for mutex debugging: */
-#ifndef __IP__
-# define __IP__
-#endif
-
-#define __mutex_fastpath_lock(count, fail_fn)	      fail_fn(count __RET_IP__)
-#define __mutex_fastpath_lock_retval(count, fail_fn)  fail_fn(count __RET_IP__)
-#define __mutex_fastpath_unlock(count, fail_fn)       fail_fn(count __RET_IP__)
+#define __mutex_fastpath_lock(count, fail_fn)	      fail_fn(count)
+#define __mutex_fastpath_lock_retval(count, fail_fn)  fail_fn(count)
+#define __mutex_fastpath_unlock(count, fail_fn)       fail_fn(count)
 #define __mutex_fastpath_trylock(count, fail_fn)      fail_fn(count)
 #define __mutex_slowpath_needs_to_unlock()	      1
 
Index: linux/include/linux/debug_locks.h
===================================================================
--- /dev/null
+++ linux/include/linux/debug_locks.h
@@ -0,0 +1,62 @@
+#ifndef __LINUX_DEBUG_LOCKING_H
+#define __LINUX_DEBUG_LOCKING_H
+
+extern int debug_locks;
+extern int debug_locks_silent;
+
+/*
+ * Generic 'turn off all lock debugging' function:
+ */
+extern int debug_locks_off(void);
+
+/*
+ * In the debug case we carry the caller's instruction pointer into
+ * other functions, but we dont want the function argument overhead
+ * in the nondebug case - hence these macros:
+ */
+#define _RET_IP_		(unsigned long)__builtin_return_address(0)
+#define _THIS_IP_  ({ __label__ __here; __here: (unsigned long)&&__here; })
+
+#define DEBUG_WARN_ON(c)						\
+({									\
+	int __ret = 0;							\
+									\
+	if (unlikely(c)) {						\
+		if (debug_locks_off())					\
+			WARN_ON(1);					\
+		__ret = 1;						\
+	}								\
+	__ret;								\
+})
+
+#ifdef CONFIG_SMP
+# define SMP_DEBUG_WARN_ON(c)			DEBUG_WARN_ON(c)
+#else
+# define SMP_DEBUG_WARN_ON(c)			do { } while (0)
+#endif
+
+#ifdef CONFIG_DEBUG_LOCKING_API_SELFTESTS
+  extern void locking_selftest(void);
+#else
+# define locking_selftest()	do { } while (0)
+#endif
+
+static inline void
+debug_check_no_locks_freed(const void *from, unsigned long len)
+{
+}
+
+static inline void
+debug_check_no_locks_held(struct task_struct *task)
+{
+}
+
+static inline void debug_show_all_locks(void)
+{
+}
+
+static inline void debug_show_held_locks(struct task_struct *task)
+{
+}
+
+#endif
Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -133,7 +133,6 @@ extern struct group_info init_groups;
 	.journal_info	= NULL,						\
 	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
 	.fs_excl	= ATOMIC_INIT(0),				\
-	INIT_RT_MUTEXES(tsk)						\
 }
 
 
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -14,6 +14,7 @@
 #include <linux/prio_tree.h>
 #include <linux/fs.h>
 #include <linux/mutex.h>
+#include <linux/debug_locks.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -1080,13 +1081,6 @@ static inline void vm_stat_account(struc
 }
 #endif /* CONFIG_PROC_FS */
 
-static inline void
-debug_check_no_locks_freed(const void *from, unsigned long len)
-{
-	mutex_debug_check_no_locks_freed(from, len);
-	rt_mutex_debug_check_no_locks_freed(from, len);
-}
-
 #ifndef CONFIG_DEBUG_PAGEALLOC
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable)
Index: linux/include/linux/mutex-debug.h
===================================================================
--- linux.orig/include/linux/mutex-debug.h
+++ linux/include/linux/mutex-debug.h
@@ -7,17 +7,11 @@
  * Mutexes - debugging helpers:
  */
 
-#define __DEBUG_MUTEX_INITIALIZER(lockname) \
-	, .held_list = LIST_HEAD_INIT(lockname.held_list), \
-	  .name = #lockname , .magic = &lockname
+#define __DEBUG_MUTEX_INITIALIZER(lockname)				\
+	, .magic = &lockname
 
-#define mutex_init(sem)		__mutex_init(sem, __FUNCTION__)
+#define mutex_init(sem)		__mutex_init(sem, __FILE__":"#sem)
 
 extern void FASTCALL(mutex_destroy(struct mutex *lock));
 
-extern void mutex_debug_show_all_locks(void);
-extern void mutex_debug_show_held_locks(struct task_struct *filter);
-extern void mutex_debug_check_no_locks_held(struct task_struct *task);
-extern void mutex_debug_check_no_locks_freed(const void *from, unsigned long len);
-
 #endif
Index: linux/include/linux/mutex.h
===================================================================
--- linux.orig/include/linux/mutex.h
+++ linux/include/linux/mutex.h
@@ -50,8 +50,6 @@ struct mutex {
 	struct list_head	wait_list;
 #ifdef CONFIG_DEBUG_MUTEXES
 	struct thread_info	*owner;
-	struct list_head	held_list;
-	unsigned long		acquire_ip;
 	const char 		*name;
 	void			*magic;
 #endif
@@ -76,10 +74,6 @@ struct mutex_waiter {
 # define __DEBUG_MUTEX_INITIALIZER(lockname)
 # define mutex_init(mutex)			__mutex_init(mutex, NULL)
 # define mutex_destroy(mutex)				do { } while (0)
-# define mutex_debug_show_all_locks()			do { } while (0)
-# define mutex_debug_show_held_locks(p)			do { } while (0)
-# define mutex_debug_check_no_locks_held(task)		do { } while (0)
-# define mutex_debug_check_no_locks_freed(from, len)	do { } while (0)
 #endif
 
 #define __MUTEX_INITIALIZER(lockname) \
Index: linux/include/linux/rtmutex.h
===================================================================
--- linux.orig/include/linux/rtmutex.h
+++ linux/include/linux/rtmutex.h
@@ -29,8 +29,6 @@ struct rt_mutex {
 	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
-	struct list_head	held_list_entry;
-	unsigned long		acquire_ip;
 	const char 		*name, *file;
 	int			line;
 	void			*magic;
@@ -98,14 +96,6 @@ extern int rt_mutex_trylock(struct rt_mu
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
-#ifdef CONFIG_DEBUG_RT_MUTEXES
-# define INIT_RT_MUTEX_DEBUG(tsk)					\
-	.held_list_head	= LIST_HEAD_INIT(tsk.held_list_head),		\
-	.held_list_lock	= SPIN_LOCK_UNLOCKED
-#else
-# define INIT_RT_MUTEX_DEBUG(tsk)
-#endif
-
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)						\
 	.pi_waiters	= PLIST_HEAD_INIT(tsk.pi_waiters, tsk.pi_lock),	\
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -910,10 +910,6 @@ struct task_struct {
 	struct plist_head pi_waiters;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
-# ifdef CONFIG_DEBUG_RT_MUTEXES
-	spinlock_t held_list_lock;
-	struct list_head held_list_head;
-# endif
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -53,6 +53,7 @@
 #include <linux/key.h>
 #include <linux/root_dev.h>
 #include <linux/buffer_head.h>
+#include <linux/debug_locks.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -512,6 +513,14 @@ asmlinkage void __init start_kernel(void
 		panic(panic_later, panic_param);
 	profile_init();
 	local_irq_enable();
+
+	/*
+	 * Need to run this when irqs are enabled, because it wants
+	 * to self-test [hard/soft]-irqs on/off lock inversion bugs
+	 * too:
+	 */
+	locking_selftest();
+
 #ifdef CONFIG_BLK_DEV_INITRD
 	if (initrd_start && !initrd_below_start_ok &&
 			initrd_start < min_low_pfn << PAGE_SHIFT) {
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -952,10 +952,9 @@ fastcall NORET_TYPE void do_exit(long co
 	if (unlikely(current->pi_state_cache))
 		kfree(current->pi_state_cache);
 	/*
-	 * If DEBUG_MUTEXES is on, make sure we are holding no locks:
+	 * Make sure we are holding no locks:
 	 */
-	mutex_debug_check_no_locks_held(tsk);
-	rt_mutex_debug_check_no_locks_held(tsk);
+	debug_check_no_locks_held(tsk);
 
 	if (tsk->io_context)
 		exit_io_context();
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -921,10 +921,6 @@ static inline void rt_mutex_init_task(st
 	spin_lock_init(&p->pi_lock);
 	plist_head_init(&p->pi_waiters, &p->pi_lock);
 	p->pi_blocked_on = NULL;
-# ifdef CONFIG_DEBUG_RT_MUTEXES
-	spin_lock_init(&p->held_list_lock);
-	INIT_LIST_HEAD(&p->held_list_head);
-# endif
 #endif
 }
 
Index: linux/kernel/mutex-debug.c
===================================================================
--- linux.orig/kernel/mutex-debug.c
+++ linux/kernel/mutex-debug.c
@@ -19,37 +19,10 @@
 #include <linux/spinlock.h>
 #include <linux/kallsyms.h>
 #include <linux/interrupt.h>
+#include <linux/debug_locks.h>
 
 #include "mutex-debug.h"
 
-/*
- * We need a global lock when we walk through the multi-process
- * lock tree. Only used in the deadlock-debugging case.
- */
-DEFINE_SPINLOCK(debug_mutex_lock);
-
-/*
- * All locks held by all tasks, in a single global list:
- */
-LIST_HEAD(debug_mutex_held_locks);
-
-/*
- * In the debug case we carry the caller's instruction pointer into
- * other functions, but we dont want the function argument overhead
- * in the nondebug case - hence these macros:
- */
-#define __IP_DECL__		, unsigned long ip
-#define __IP__			, ip
-#define __RET_IP__		, (unsigned long)__builtin_return_address(0)
-
-/*
- * "mutex debugging enabled" flag. We turn it off when we detect
- * the first problem because we dont want to recurse back
- * into the tracing code when doing error printk or
- * executing a BUG():
- */
-int debug_mutex_on = 1;
-
 static void printk_task(struct task_struct *p)
 {
 	if (p)
@@ -66,157 +39,28 @@ static void printk_ti(struct thread_info
 		printk("<none>");
 }
 
-static void printk_task_short(struct task_struct *p)
-{
-	if (p)
-		printk("%s/%d [%p, %3d]", p->comm, p->pid, p, p->prio);
-	else
-		printk("<none>");
-}
-
 static void printk_lock(struct mutex *lock, int print_owner)
 {
-	printk(" [%p] {%s}\n", lock, lock->name);
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+	printk(" [%p] {%s}\n", lock, lock->dep_map.name);
+#else
+	printk(" [%p]\n", lock);
+#endif
 
 	if (print_owner && lock->owner) {
 		printk(".. held by:  ");
 		printk_ti(lock->owner);
 		printk("\n");
 	}
-	if (lock->owner) {
-		printk("... acquired at:               ");
-		print_symbol("%s\n", lock->acquire_ip);
-	}
-}
-
-/*
- * printk locks held by a task:
- */
-static void show_task_locks(struct task_struct *p)
-{
-	switch (p->state) {
-	case TASK_RUNNING:		printk("R"); break;
-	case TASK_INTERRUPTIBLE:	printk("S"); break;
-	case TASK_UNINTERRUPTIBLE:	printk("D"); break;
-	case TASK_STOPPED:		printk("T"); break;
-	case EXIT_ZOMBIE:		printk("Z"); break;
-	case EXIT_DEAD:			printk("X"); break;
-	default:			printk("?"); break;
-	}
-	printk_task(p);
-	if (p->blocked_on) {
-		struct mutex *lock = p->blocked_on->lock;
-
-		printk(" blocked on mutex:");
-		printk_lock(lock, 1);
-	} else
-		printk(" (not blocked on mutex)\n");
-}
-
-/*
- * printk all locks held in the system (if filter == NULL),
- * or all locks belonging to a single task (if filter != NULL):
- */
-void show_held_locks(struct task_struct *filter)
-{
-	struct list_head *curr, *cursor = NULL;
-	struct mutex *lock;
-	struct thread_info *t;
-	unsigned long flags;
-	int count = 0;
-
-	if (filter) {
-		printk("------------------------------\n");
-		printk("| showing all locks held by: |  (");
-		printk_task_short(filter);
-		printk("):\n");
-		printk("------------------------------\n");
-	} else {
-		printk("---------------------------\n");
-		printk("| showing all locks held: |\n");
-		printk("---------------------------\n");
-	}
-
-	/*
-	 * Play safe and acquire the global trace lock. We
-	 * cannot printk with that lock held so we iterate
-	 * very carefully:
-	 */
-next:
-	debug_spin_lock_save(&debug_mutex_lock, flags);
-	list_for_each(curr, &debug_mutex_held_locks) {
-		if (cursor && curr != cursor)
-			continue;
-		lock = list_entry(curr, struct mutex, held_list);
-		t = lock->owner;
-		if (filter && (t != filter->thread_info))
-			continue;
-		count++;
-		cursor = curr->next;
-		debug_spin_unlock_restore(&debug_mutex_lock, flags);
-
-		printk("\n#%03d:            ", count);
-		printk_lock(lock, filter ? 0 : 1);
-		goto next;
-	}
-	debug_spin_unlock_restore(&debug_mutex_lock, flags);
-	printk("\n");
-}
-
-void mutex_debug_show_all_locks(void)
-{
-	struct task_struct *g, *p;
-	int count = 10;
-	int unlock = 1;
-
-	printk("\nShowing all blocking locks in the system:\n");
-
-	/*
-	 * Here we try to get the tasklist_lock as hard as possible,
-	 * if not successful after 2 seconds we ignore it (but keep
-	 * trying). This is to enable a debug printout even if a
-	 * tasklist_lock-holding task deadlocks or crashes.
-	 */
-retry:
-	if (!read_trylock(&tasklist_lock)) {
-		if (count == 10)
-			printk("hm, tasklist_lock locked, retrying... ");
-		if (count) {
-			count--;
-			printk(" #%d", 10-count);
-			mdelay(200);
-			goto retry;
-		}
-		printk(" ignoring it.\n");
-		unlock = 0;
-	}
-	if (count != 10)
-		printk(" locked it.\n");
-
-	do_each_thread(g, p) {
-		show_task_locks(p);
-		if (!unlock)
-			if (read_trylock(&tasklist_lock))
-				unlock = 1;
-	} while_each_thread(g, p);
-
-	printk("\n");
-	show_held_locks(NULL);
-	printk("=============================================\n\n");
-
-	if (unlock)
-		read_unlock(&tasklist_lock);
 }
 
 static void report_deadlock(struct task_struct *task, struct mutex *lock,
-			    struct mutex *lockblk, unsigned long ip)
+			    struct mutex *lockblk)
 {
 	printk("\n%s/%d is trying to acquire this lock:\n",
 		current->comm, current->pid);
 	printk_lock(lock, 1);
-	printk("... trying at:                 ");
-	print_symbol("%s\n", ip);
-	show_held_locks(current);
+	debug_show_held_locks(current);
 
 	if (lockblk) {
 		printk("but %s/%d is deadlocking current task %s/%d!\n\n",
@@ -225,7 +69,7 @@ static void report_deadlock(struct task_
 			task->comm, task->pid);
 		printk_lock(lockblk, 1);
 
-		show_held_locks(task);
+		debug_show_held_locks(task);
 
 		printk("\n%s/%d's [blocked] stackdump:\n\n",
 			task->comm, task->pid);
@@ -235,7 +79,7 @@ static void report_deadlock(struct task_
 	printk("\n%s/%d's [current] stackdump:\n\n",
 		current->comm, current->pid);
 	dump_stack();
-	mutex_debug_show_all_locks();
+	debug_show_all_locks();
 	printk("[ turning off deadlock detection. Please report this. ]\n\n");
 	local_irq_disable();
 }
@@ -243,13 +87,12 @@ static void report_deadlock(struct task_
 /*
  * Recursively check for mutex deadlocks:
  */
-static int check_deadlock(struct mutex *lock, int depth,
-			  struct thread_info *ti, unsigned long ip)
+static int check_deadlock(struct mutex *lock, int depth, struct thread_info *ti)
 {
 	struct mutex *lockblk;
 	struct task_struct *task;
 
-	if (!debug_mutex_on)
+	if (!debug_locks)
 		return 0;
 
 	ti = lock->owner;
@@ -263,123 +106,46 @@ static int check_deadlock(struct mutex *
 
 	/* Self-deadlock: */
 	if (current == task) {
-		DEBUG_OFF();
+		debug_locks_off();
 		if (depth)
 			return 1;
 		printk("\n==========================================\n");
 		printk(  "[ BUG: lock recursion deadlock detected! |\n");
 		printk(  "------------------------------------------\n");
-		report_deadlock(task, lock, NULL, ip);
+		report_deadlock(task, lock, NULL);
 		return 0;
 	}
 
 	/* Ugh, something corrupted the lock data structure? */
 	if (depth > 20) {
-		DEBUG_OFF();
+		debug_locks_off();
 		printk("\n===========================================\n");
 		printk(  "[ BUG: infinite lock dependency detected!? |\n");
 		printk(  "-------------------------------------------\n");
-		report_deadlock(task, lock, lockblk, ip);
+		report_deadlock(task, lock, lockblk);
 		return 0;
 	}
 
 	/* Recursively check for dependencies: */
-	if (lockblk && check_deadlock(lockblk, depth+1, ti, ip)) {
+	if (lockblk && check_deadlock(lockblk, depth+1, ti)) {
 		printk("\n============================================\n");
 		printk(  "[ BUG: circular locking deadlock detected! ]\n");
 		printk(  "--------------------------------------------\n");
-		report_deadlock(task, lock, lockblk, ip);
+		report_deadlock(task, lock, lockblk);
 		return 0;
 	}
 	return 0;
 }
 
 /*
- * Called when a task exits, this function checks whether the
- * task is holding any locks, and reports the first one if so:
- */
-void mutex_debug_check_no_locks_held(struct task_struct *task)
-{
-	struct list_head *curr, *next;
-	struct thread_info *t;
-	unsigned long flags;
-	struct mutex *lock;
-
-	if (!debug_mutex_on)
-		return;
-
-	debug_spin_lock_save(&debug_mutex_lock, flags);
-	list_for_each_safe(curr, next, &debug_mutex_held_locks) {
-		lock = list_entry(curr, struct mutex, held_list);
-		t = lock->owner;
-		if (t != task->thread_info)
-			continue;
-		list_del_init(curr);
-		DEBUG_OFF();
-		debug_spin_unlock_restore(&debug_mutex_lock, flags);
-
-		printk("BUG: %s/%d, lock held at task exit time!\n",
-			task->comm, task->pid);
-		printk_lock(lock, 1);
-		if (lock->owner != task->thread_info)
-			printk("exiting task is not even the owner??\n");
-		return;
-	}
-	debug_spin_unlock_restore(&debug_mutex_lock, flags);
-}
-
-/*
- * Called when kernel memory is freed (or unmapped), or if a mutex
- * is destroyed or reinitialized - this code checks whether there is
- * any held lock in the memory range of <from> to <to>:
- */
-void mutex_debug_check_no_locks_freed(const void *from, unsigned long len)
-{
-	struct list_head *curr, *next;
-	const void *to = from + len;
-	unsigned long flags;
-	struct mutex *lock;
-	void *lock_addr;
-
-	if (!debug_mutex_on)
-		return;
-
-	debug_spin_lock_save(&debug_mutex_lock, flags);
-	list_for_each_safe(curr, next, &debug_mutex_held_locks) {
-		lock = list_entry(curr, struct mutex, held_list);
-		lock_addr = lock;
-		if (lock_addr < from || lock_addr >= to)
-			continue;
-		list_del_init(curr);
-		DEBUG_OFF();
-		debug_spin_unlock_restore(&debug_mutex_lock, flags);
-
-		printk("BUG: %s/%d, active lock [%p(%p-%p)] freed!\n",
-			current->comm, current->pid, lock, from, to);
-		dump_stack();
-		printk_lock(lock, 1);
-		if (lock->owner != current_thread_info())
-			printk("freeing task is not even the owner??\n");
-		return;
-	}
-	debug_spin_unlock_restore(&debug_mutex_lock, flags);
-}
-
-/*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock,
-			   struct thread_info *new_owner __IP_DECL__)
+void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
 {
 	lock->owner = new_owner;
-	DEBUG_WARN_ON(!list_empty(&lock->held_list));
-	if (debug_mutex_on) {
-		list_add_tail(&lock->held_list, &debug_mutex_held_locks);
-		lock->acquire_ip = ip;
-	}
 }
 
-void debug_mutex_init_waiter(struct mutex_waiter *waiter)
+void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
 	waiter->magic = waiter;
@@ -401,10 +167,12 @@ void debug_mutex_free_waiter(struct mute
 }
 
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			    struct thread_info *ti __IP_DECL__)
+			    struct thread_info *ti)
 {
 	SMP_DEBUG_WARN_ON(!spin_is_locked(&lock->wait_lock));
-	check_deadlock(lock, 0, ti, ip);
+#ifdef CONFIG_DEBUG_MUTEX_DEADLOCKS
+	check_deadlock(lock, 0, ti);
+#endif
 	/* Mark the current thread as blocked on the lock: */
 	ti->task->blocked_on = waiter;
 	waiter->lock = lock;
@@ -424,13 +192,10 @@ void mutex_remove_waiter(struct mutex *l
 
 void debug_mutex_unlock(struct mutex *lock)
 {
+	DEBUG_WARN_ON(lock->owner != current_thread_info());
 	DEBUG_WARN_ON(lock->magic != lock);
 	DEBUG_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
 	DEBUG_WARN_ON(lock->owner != current_thread_info());
-	if (debug_mutex_on) {
-		DEBUG_WARN_ON(list_empty(&lock->held_list));
-		list_del_init(&lock->held_list);
-	}
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name)
@@ -438,10 +203,8 @@ void debug_mutex_init(struct mutex *lock
 	/*
 	 * Make sure we are not reinitializing a held lock:
 	 */
-	mutex_debug_check_no_locks_freed((void *)lock, sizeof(*lock));
+	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
 	lock->owner = NULL;
-	INIT_LIST_HEAD(&lock->held_list);
-	lock->name = name;
 	lock->magic = lock;
 }
 
Index: linux/kernel/mutex-debug.h
===================================================================
--- linux.orig/kernel/mutex-debug.h
+++ linux/kernel/mutex-debug.h
@@ -10,110 +10,43 @@
  * More details are in kernel/mutex-debug.c.
  */
 
-extern spinlock_t debug_mutex_lock;
-extern struct list_head debug_mutex_held_locks;
-extern int debug_mutex_on;
-
-/*
- * In the debug case we carry the caller's instruction pointer into
- * other functions, but we dont want the function argument overhead
- * in the nondebug case - hence these macros:
- */
-#define __IP_DECL__		, unsigned long ip
-#define __IP__			, ip
-#define __RET_IP__		, (unsigned long)__builtin_return_address(0)
-
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void debug_mutex_set_owner(struct mutex *lock,
-				  struct thread_info *new_owner __IP_DECL__);
+extern void
+debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
 
 static inline void debug_mutex_clear_owner(struct mutex *lock)
 {
 	lock->owner = NULL;
 }
 
-extern void debug_mutex_init_waiter(struct mutex_waiter *waiter);
+extern void debug_mutex_lock_common(struct mutex *lock,
+				    struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
 				    struct mutex_waiter *waiter);
 extern void debug_mutex_free_waiter(struct mutex_waiter *waiter);
 extern void debug_mutex_add_waiter(struct mutex *lock,
 				   struct mutex_waiter *waiter,
-				   struct thread_info *ti __IP_DECL__);
+				   struct thread_info *ti);
 extern void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 				struct thread_info *ti);
 extern void debug_mutex_unlock(struct mutex *lock);
 extern void debug_mutex_init(struct mutex *lock, const char *name);
 
-#define debug_spin_lock_save(lock, flags)		\
-	do {						\
-		local_irq_save(flags);			\
-		if (debug_mutex_on)			\
-			spin_lock(lock);		\
-	} while (0)
-
-#define debug_spin_unlock_restore(lock, flags)		\
-	do {						\
-		if (debug_mutex_on)			\
-			spin_unlock(lock);		\
-		local_irq_restore(flags);		\
-		preempt_check_resched();		\
-	} while (0)
-
 #define spin_lock_mutex(lock, flags)			\
 	do {						\
 		struct mutex *l = container_of(lock, struct mutex, wait_lock); \
 							\
 		DEBUG_WARN_ON(in_interrupt());		\
-		debug_spin_lock_save(&debug_mutex_lock, flags); \
-		spin_lock(lock);			\
+		local_irq_save(flags);			\
+		__raw_spin_lock(&(lock)->raw_lock);	\
 		DEBUG_WARN_ON(l->magic != l);		\
 	} while (0)
 
 #define spin_unlock_mutex(lock, flags)			\
 	do {						\
-		spin_unlock(lock);			\
-		debug_spin_unlock_restore(&debug_mutex_lock, flags);	\
+		__raw_spin_unlock(&(lock)->raw_lock);	\
+		local_irq_restore(flags);		\
+		preempt_check_resched();		\
 	} while (0)
-
-#define DEBUG_OFF()					\
-do {							\
-	if (debug_mutex_on) {				\
-		debug_mutex_on = 0;			\
-		console_verbose();			\
-		if (spin_is_locked(&debug_mutex_lock))	\
-			spin_unlock(&debug_mutex_lock);	\
-	}						\
-} while (0)
-
-#define DEBUG_BUG()					\
-do {							\
-	if (debug_mutex_on) {				\
-		DEBUG_OFF();				\
-		BUG();					\
-	}						\
-} while (0)
-
-#define DEBUG_WARN_ON(c)				\
-do {							\
-	if (unlikely(c && debug_mutex_on)) {		\
-		DEBUG_OFF();				\
-		WARN_ON(1);				\
-	}						\
-} while (0)
-
-# define DEBUG_BUG_ON(c)				\
-do {							\
-	if (unlikely(c))				\
-		DEBUG_BUG();				\
-} while (0)
-
-#ifdef CONFIG_SMP
-# define SMP_DEBUG_WARN_ON(c)			DEBUG_WARN_ON(c)
-# define SMP_DEBUG_BUG_ON(c)			DEBUG_BUG_ON(c)
-#else
-# define SMP_DEBUG_WARN_ON(c)			do { } while (0)
-# define SMP_DEBUG_BUG_ON(c)			do { } while (0)
-#endif
-
Index: linux/kernel/mutex.c
===================================================================
--- linux.orig/kernel/mutex.c
+++ linux/kernel/mutex.c
@@ -17,6 +17,7 @@
 #include <linux/module.h>
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
+#include <linux/debug_locks.h>
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -38,7 +39,7 @@
  *
  * It is not allowed to initialize an already locked mutex.
  */
-void fastcall __mutex_init(struct mutex *lock, const char *name)
+__always_inline void fastcall __mutex_init(struct mutex *lock, const char *name)
 {
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
@@ -56,7 +57,7 @@ EXPORT_SYMBOL(__mutex_init);
  * branch is predicted by the CPU as default-untaken.
  */
 static void fastcall noinline __sched
-__mutex_lock_slowpath(atomic_t *lock_count __IP_DECL__);
+__mutex_lock_slowpath(atomic_t *lock_count);
 
 /***
  * mutex_lock - acquire the mutex
@@ -79,7 +80,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
  *
  * This function is similar to (but not equivalent to) down().
  */
-void fastcall __sched mutex_lock(struct mutex *lock)
+void inline fastcall __sched mutex_lock(struct mutex *lock)
 {
 	might_sleep();
 	/*
@@ -92,7 +93,7 @@ void fastcall __sched mutex_lock(struct 
 EXPORT_SYMBOL(mutex_lock);
 
 static void fastcall noinline __sched
-__mutex_unlock_slowpath(atomic_t *lock_count __IP_DECL__);
+__mutex_unlock_slowpath(atomic_t *lock_count);
 
 /***
  * mutex_unlock - release the mutex
@@ -116,22 +117,36 @@ void fastcall __sched mutex_unlock(struc
 
 EXPORT_SYMBOL(mutex_unlock);
 
+static void fastcall noinline __sched
+__mutex_unlock_non_nested_slowpath(atomic_t *lock_count);
+
+void fastcall __sched mutex_unlock_non_nested(struct mutex *lock)
+{
+	/*
+	 * The unlocking fastpath is the 0->1 transition from 'locked'
+	 * into 'unlocked' state:
+	 */
+	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_non_nested_slowpath);
+}
+
+EXPORT_SYMBOL(mutex_unlock_non_nested);
+
+
 /*
  * Lock a mutex (possibly interruptible), slowpath:
  */
 static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state __IP_DECL__)
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subtype)
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned int old_val;
 	unsigned long flags;
 
-	debug_mutex_init_waiter(&waiter);
-
 	spin_lock_mutex(&lock->wait_lock, flags);
 
-	debug_mutex_add_waiter(lock, &waiter, task->thread_info, ip);
+	debug_mutex_lock_common(lock, &waiter);
+	debug_mutex_add_waiter(lock, &waiter, task->thread_info);
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
@@ -173,7 +188,7 @@ __mutex_lock_common(struct mutex *lock, 
 
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task->thread_info);
-	debug_mutex_set_owner(lock, task->thread_info __IP__);
+	debug_mutex_set_owner(lock, task->thread_info);
 
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
@@ -183,32 +198,41 @@ __mutex_lock_common(struct mutex *lock, 
 
 	debug_mutex_free_waiter(&waiter);
 
-	DEBUG_WARN_ON(list_empty(&lock->held_list));
 	DEBUG_WARN_ON(lock->owner != task->thread_info);
 
 	return 0;
 }
 
 static void fastcall noinline __sched
-__mutex_lock_slowpath(atomic_t *lock_count __IP_DECL__)
+__mutex_lock_slowpath(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE __IP__);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0);
 }
 
+#ifdef CONFIG_DEBUG_MUTEXES
+void __sched
+mutex_lock_nested(struct mutex *lock, unsigned int subtype)
+{
+	might_sleep();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subtype);
+}
+
+EXPORT_SYMBOL_GPL(mutex_lock_nested);
+#endif
+
 /*
  * Release the lock, slowpath:
  */
-static fastcall noinline void
-__mutex_unlock_slowpath(atomic_t *lock_count __IP_DECL__)
+static fastcall inline void
+__mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 	unsigned long flags;
 
-	DEBUG_WARN_ON(lock->owner != current_thread_info());
-
 	spin_lock_mutex(&lock->wait_lock, flags);
+	debug_mutex_unlock(lock);
 
 	/*
 	 * some architectures leave the lock unlocked in the fastpath failure
@@ -218,8 +242,6 @@ __mutex_unlock_slowpath(atomic_t *lock_c
 	if (__mutex_slowpath_needs_to_unlock())
 		atomic_set(&lock->count, 1);
 
-	debug_mutex_unlock(lock);
-
 	if (!list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
@@ -237,11 +259,27 @@ __mutex_unlock_slowpath(atomic_t *lock_c
 }
 
 /*
+ * Release the lock, slowpath:
+ */
+static fastcall noinline void
+__mutex_unlock_slowpath(atomic_t *lock_count)
+{
+	__mutex_unlock_common_slowpath(lock_count, 1);
+}
+
+static fastcall noinline void
+__mutex_unlock_non_nested_slowpath(atomic_t *lock_count)
+{
+	__mutex_unlock_common_slowpath(lock_count, 0);
+}
+
+
+/*
  * Here come the less common (and hence less performance-critical) APIs:
  * mutex_lock_interruptible() and mutex_trylock().
  */
 static int fastcall noinline __sched
-__mutex_lock_interruptible_slowpath(atomic_t *lock_count __IP_DECL__);
+__mutex_lock_interruptible_slowpath(atomic_t *lock_count);
 
 /***
  * mutex_lock_interruptible - acquire the mutex, interruptable
@@ -264,11 +302,11 @@ int fastcall __sched mutex_lock_interrup
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
 static int fastcall noinline __sched
-__mutex_lock_interruptible_slowpath(atomic_t *lock_count __IP_DECL__)
+__mutex_lock_interruptible_slowpath(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE __IP__);
+	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0);
 }
 
 /*
@@ -285,7 +323,8 @@ static inline int __mutex_trylock_slowpa
 
 	prev = atomic_xchg(&lock->count, -1);
 	if (likely(prev == 1))
-		debug_mutex_set_owner(lock, current_thread_info() __RET_IP__);
+		debug_mutex_set_owner(lock, current_thread_info());
+
 	/* Set it back to 0 if there are no waiters: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
Index: linux/kernel/mutex.h
===================================================================
--- linux.orig/kernel/mutex.h
+++ linux/kernel/mutex.h
@@ -19,19 +19,15 @@
 #define DEBUG_WARN_ON(c)				do { } while (0)
 #define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
 #define debug_mutex_clear_owner(lock)			do { } while (0)
-#define debug_mutex_init_waiter(waiter)			do { } while (0)
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
-#define debug_mutex_add_waiter(lock, waiter, ti, ip)	do { } while (0)
+#define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
+#define mutex_acquire(lock, subtype, trylock)	do { } while (0)
+#define mutex_release(lock, nested)		do { } while (0)
 #define debug_mutex_unlock(lock)			do { } while (0)
 #define debug_mutex_init(lock, name)			do { } while (0)
 
-/*
- * Return-address parameters/declarations. They are very useful for
- * debugging, but add overhead in the !DEBUG case - so we go the
- * trouble of using this not too elegant but zero-cost solution:
- */
-#define __IP_DECL__
-#define __IP__
-#define __RET_IP__
-
+static inline void
+debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
+{
+}
Index: linux/kernel/rtmutex-debug.c
===================================================================
--- linux.orig/kernel/rtmutex-debug.c
+++ linux/kernel/rtmutex-debug.c
@@ -26,6 +26,7 @@
 #include <linux/interrupt.h>
 #include <linux/plist.h>
 #include <linux/fs.h>
+#include <linux/debug_locks.h>
 
 #include "rtmutex_common.h"
 
@@ -45,8 +46,6 @@ do {								\
 		console_verbose();				\
 		if (spin_is_locked(&current->pi_lock))		\
 			spin_unlock(&current->pi_lock);		\
-		if (spin_is_locked(&current->held_list_lock))	\
-			spin_unlock(&current->held_list_lock);	\
 	}							\
 } while (0)
 
@@ -105,14 +104,6 @@ static void printk_task(task_t *p)
 		printk("<none>");
 }
 
-static void printk_task_short(task_t *p)
-{
-	if (p)
-		printk("%s/%d [%p, %3d]", p->comm, p->pid, p, p->prio);
-	else
-		printk("<none>");
-}
-
 static void printk_lock(struct rt_mutex *lock, int print_owner)
 {
 	if (lock->name)
@@ -128,222 +119,6 @@ static void printk_lock(struct rt_mutex 
 		printk_task(rt_mutex_owner(lock));
 		printk("\n");
 	}
-	if (rt_mutex_owner(lock)) {
-		printk("... acquired at:               ");
-		print_symbol("%s\n", lock->acquire_ip);
-	}
-}
-
-static void printk_waiter(struct rt_mutex_waiter *w)
-{
-	printk("-------------------------\n");
-	printk("| waiter struct %p:\n", w);
-	printk("| w->list_entry: [DP:%p/%p|SP:%p/%p|PRI:%d]\n",
-	       w->list_entry.plist.prio_list.prev, w->list_entry.plist.prio_list.next,
-	       w->list_entry.plist.node_list.prev, w->list_entry.plist.node_list.next,
-	       w->list_entry.prio);
-	printk("| w->pi_list_entry: [DP:%p/%p|SP:%p/%p|PRI:%d]\n",
-	       w->pi_list_entry.plist.prio_list.prev, w->pi_list_entry.plist.prio_list.next,
-	       w->pi_list_entry.plist.node_list.prev, w->pi_list_entry.plist.node_list.next,
-	       w->pi_list_entry.prio);
-	printk("\n| lock:\n");
-	printk_lock(w->lock, 1);
-	printk("| w->ti->task:\n");
-	printk_task(w->task);
-	printk("| blocked at:  ");
-	print_symbol("%s\n", w->ip);
-	printk("-------------------------\n");
-}
-
-static void show_task_locks(task_t *p)
-{
-	switch (p->state) {
-	case TASK_RUNNING:		printk("R"); break;
-	case TASK_INTERRUPTIBLE:	printk("S"); break;
-	case TASK_UNINTERRUPTIBLE:	printk("D"); break;
-	case TASK_STOPPED:		printk("T"); break;
-	case EXIT_ZOMBIE:		printk("Z"); break;
-	case EXIT_DEAD:			printk("X"); break;
-	default:			printk("?"); break;
-	}
-	printk_task(p);
-	if (p->pi_blocked_on) {
-		struct rt_mutex *lock = p->pi_blocked_on->lock;
-
-		printk(" blocked on:");
-		printk_lock(lock, 1);
-	} else
-		printk(" (not blocked)\n");
-}
-
-void rt_mutex_show_held_locks(task_t *task, int verbose)
-{
-	struct list_head *curr, *cursor = NULL;
-	struct rt_mutex *lock;
-	task_t *t;
-	unsigned long flags;
-	int count = 0;
-
-	if (!rt_trace_on)
-		return;
-
-	if (verbose) {
-		printk("------------------------------\n");
-		printk("| showing all locks held by: |  (");
-		printk_task_short(task);
-		printk("):\n");
-		printk("------------------------------\n");
-	}
-
-next:
-	spin_lock_irqsave(&task->held_list_lock, flags);
-	list_for_each(curr, &task->held_list_head) {
-		if (cursor && curr != cursor)
-			continue;
-		lock = list_entry(curr, struct rt_mutex, held_list_entry);
-		t = rt_mutex_owner(lock);
-		WARN_ON(t != task);
-		count++;
-		cursor = curr->next;
-		spin_unlock_irqrestore(&task->held_list_lock, flags);
-
-		printk("\n#%03d:            ", count);
-		printk_lock(lock, 0);
-		goto next;
-	}
-	spin_unlock_irqrestore(&task->held_list_lock, flags);
-
-	printk("\n");
-}
-
-void rt_mutex_show_all_locks(void)
-{
-	task_t *g, *p;
-	int count = 10;
-	int unlock = 1;
-
-	printk("\n");
-	printk("----------------------\n");
-	printk("| showing all tasks: |\n");
-	printk("----------------------\n");
-
-	/*
-	 * Here we try to get the tasklist_lock as hard as possible,
-	 * if not successful after 2 seconds we ignore it (but keep
-	 * trying). This is to enable a debug printout even if a
-	 * tasklist_lock-holding task deadlocks or crashes.
-	 */
-retry:
-	if (!read_trylock(&tasklist_lock)) {
-		if (count == 10)
-			printk("hm, tasklist_lock locked, retrying... ");
-		if (count) {
-			count--;
-			printk(" #%d", 10-count);
-			mdelay(200);
-			goto retry;
-		}
-		printk(" ignoring it.\n");
-		unlock = 0;
-	}
-	if (count != 10)
-		printk(" locked it.\n");
-
-	do_each_thread(g, p) {
-		show_task_locks(p);
-		if (!unlock)
-			if (read_trylock(&tasklist_lock))
-				unlock = 1;
-	} while_each_thread(g, p);
-
-	printk("\n");
-
-	printk("-----------------------------------------\n");
-	printk("| showing all locks held in the system: |\n");
-	printk("-----------------------------------------\n");
-
-	do_each_thread(g, p) {
-		rt_mutex_show_held_locks(p, 0);
-		if (!unlock)
-			if (read_trylock(&tasklist_lock))
-				unlock = 1;
-	} while_each_thread(g, p);
-
-
-	printk("=============================================\n\n");
-
-	if (unlock)
-		read_unlock(&tasklist_lock);
-}
-
-void rt_mutex_debug_check_no_locks_held(task_t *task)
-{
-	struct rt_mutex_waiter *w;
-	struct list_head *curr;
-	struct rt_mutex *lock;
-
-	if (!rt_trace_on)
-		return;
-	if (!rt_prio(task->normal_prio) && rt_prio(task->prio)) {
-		printk("BUG: PI priority boost leaked!\n");
-		printk_task(task);
-		printk("\n");
-	}
-	if (list_empty(&task->held_list_head))
-		return;
-
-	spin_lock(&task->pi_lock);
-	plist_for_each_entry(w, &task->pi_waiters, pi_list_entry) {
-		TRACE_OFF();
-
-		printk("hm, PI interest held at exit time? Task:\n");
-		printk_task(task);
-		printk_waiter(w);
-		return;
-	}
-	spin_unlock(&task->pi_lock);
-
-	list_for_each(curr, &task->held_list_head) {
-		lock = list_entry(curr, struct rt_mutex, held_list_entry);
-
-		printk("BUG: %s/%d, lock held at task exit time!\n",
-		       task->comm, task->pid);
-		printk_lock(lock, 1);
-		if (rt_mutex_owner(lock) != task)
-			printk("exiting task is not even the owner??\n");
-	}
-}
-
-int rt_mutex_debug_check_no_locks_freed(const void *from, unsigned long len)
-{
-	const void *to = from + len;
-	struct list_head *curr;
-	struct rt_mutex *lock;
-	unsigned long flags;
-	void *lock_addr;
-
-	if (!rt_trace_on)
-		return 0;
-
-	spin_lock_irqsave(&current->held_list_lock, flags);
-	list_for_each(curr, &current->held_list_head) {
-		lock = list_entry(curr, struct rt_mutex, held_list_entry);
-		lock_addr = lock;
-		if (lock_addr < from || lock_addr >= to)
-			continue;
-		TRACE_OFF();
-
-		printk("BUG: %s/%d, active lock [%p(%p-%p)] freed!\n",
-			current->comm, current->pid, lock, from, to);
-		dump_stack();
-		printk_lock(lock, 1);
-		if (rt_mutex_owner(lock) != current)
-			printk("freeing task is not even the owner??\n");
-		return 1;
-	}
-	spin_unlock_irqrestore(&current->held_list_lock, flags);
-
-	return 0;
 }
 
 void rt_mutex_debug_task_free(struct task_struct *task)
@@ -395,85 +170,41 @@ void debug_rt_mutex_print_deadlock(struc
 	       current->comm, current->pid);
 	printk_lock(waiter->lock, 1);
 
-	printk("... trying at:                 ");
-	print_symbol("%s\n", waiter->ip);
-
 	printk("\n2) %s/%d is blocked on this lock:\n", task->comm, task->pid);
 	printk_lock(waiter->deadlock_lock, 1);
 
-	rt_mutex_show_held_locks(current, 1);
-	rt_mutex_show_held_locks(task, 1);
+	debug_show_held_locks(current);
+	debug_show_held_locks(task);
 
 	printk("\n%s/%d's [blocked] stackdump:\n\n", task->comm, task->pid);
 	show_stack(task, NULL);
 	printk("\n%s/%d's [current] stackdump:\n\n",
 	       current->comm, current->pid);
 	dump_stack();
-	rt_mutex_show_all_locks();
+	debug_show_all_locks();
+
 	printk("[ turning off deadlock detection."
 	       "Please report this trace. ]\n\n");
 	local_irq_disable();
 }
 
-void debug_rt_mutex_lock(struct rt_mutex *lock __IP_DECL__)
+void debug_rt_mutex_lock(struct rt_mutex *lock)
 {
-	unsigned long flags;
-
-	if (rt_trace_on) {
-		TRACE_WARN_ON_LOCKED(!list_empty(&lock->held_list_entry));
-
-		spin_lock_irqsave(&current->held_list_lock, flags);
-		list_add_tail(&lock->held_list_entry, &current->held_list_head);
-		spin_unlock_irqrestore(&current->held_list_lock, flags);
-
-		lock->acquire_ip = ip;
-	}
 }
 
 void debug_rt_mutex_unlock(struct rt_mutex *lock)
 {
-	unsigned long flags;
-
-	if (rt_trace_on) {
-		TRACE_WARN_ON_LOCKED(rt_mutex_owner(lock) != current);
-		TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list_entry));
-
-		spin_lock_irqsave(&current->held_list_lock, flags);
-		list_del_init(&lock->held_list_entry);
-		spin_unlock_irqrestore(&current->held_list_lock, flags);
-	}
+	TRACE_WARN_ON_LOCKED(rt_mutex_owner(lock) != current);
 }
 
-void debug_rt_mutex_proxy_lock(struct rt_mutex *lock,
-			       struct task_struct *powner __IP_DECL__)
+void
+debug_rt_mutex_proxy_lock(struct rt_mutex *lock, struct task_struct *powner)
 {
-	unsigned long flags;
-
-	if (rt_trace_on) {
-		TRACE_WARN_ON_LOCKED(!list_empty(&lock->held_list_entry));
-
-		spin_lock_irqsave(&powner->held_list_lock, flags);
-		list_add_tail(&lock->held_list_entry, &powner->held_list_head);
-		spin_unlock_irqrestore(&powner->held_list_lock, flags);
-
-		lock->acquire_ip = ip;
-	}
 }
 
 void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)
 {
-	unsigned long flags;
-
-	if (rt_trace_on) {
-		struct task_struct *owner = rt_mutex_owner(lock);
-
-		TRACE_WARN_ON_LOCKED(!owner);
-		TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list_entry));
-
-		spin_lock_irqsave(&owner->held_list_lock, flags);
-		list_del_init(&lock->held_list_entry);
-		spin_unlock_irqrestore(&owner->held_list_lock, flags);
-	}
+	TRACE_WARN_ON_LOCKED(!rt_mutex_owner(lock));
 }
 
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
@@ -493,14 +224,11 @@ void debug_rt_mutex_free_waiter(struct r
 
 void debug_rt_mutex_init(struct rt_mutex *lock, const char *name)
 {
-	void *addr = lock;
-
-	if (rt_trace_on) {
-		rt_mutex_debug_check_no_locks_freed(addr,
-						    sizeof(struct rt_mutex));
-		INIT_LIST_HEAD(&lock->held_list_entry);
-		lock->name = name;
-	}
+	/*
+	 * Make sure we are not reinitializing a held lock:
+	 */
+	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
+	lock->name = name;
 }
 
 void rt_mutex_deadlock_account_lock(struct rt_mutex *lock, task_t *task)
Index: linux/kernel/rtmutex-debug.h
===================================================================
--- linux.orig/kernel/rtmutex-debug.h
+++ linux/kernel/rtmutex-debug.h
@@ -9,20 +9,16 @@
  * This file contains macros used solely by rtmutex.c. Debug version.
  */
 
-#define __IP_DECL__		, unsigned long ip
-#define __IP__			, ip
-#define __RET_IP__		, (unsigned long)__builtin_return_address(0)
-
 extern void
 rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task);
 extern void rt_mutex_deadlock_account_unlock(struct task_struct *task);
 extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name);
-extern void debug_rt_mutex_lock(struct rt_mutex *lock __IP_DECL__);
+extern void debug_rt_mutex_lock(struct rt_mutex *lock);
 extern void debug_rt_mutex_unlock(struct rt_mutex *lock);
 extern void debug_rt_mutex_proxy_lock(struct rt_mutex *lock,
-				      struct task_struct *powner __IP_DECL__);
+				      struct task_struct *powner);
 extern void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock);
 extern void debug_rt_mutex_deadlock(int detect, struct rt_mutex_waiter *waiter,
 				    struct rt_mutex *lock);
Index: linux/kernel/rtmutex.c
===================================================================
--- linux.orig/kernel/rtmutex.c
+++ linux/kernel/rtmutex.c
@@ -160,8 +160,7 @@ int max_lock_depth = 1024;
 static int rt_mutex_adjust_prio_chain(task_t *task,
 				      int deadlock_detect,
 				      struct rt_mutex *orig_lock,
-				      struct rt_mutex_waiter *orig_waiter
-				      __IP_DECL__)
+				      struct rt_mutex_waiter *orig_waiter)
 {
 	struct rt_mutex *lock;
 	struct rt_mutex_waiter *waiter, *top_waiter = orig_waiter;
@@ -356,7 +355,7 @@ static inline int try_to_steal_lock(stru
  *
  * Must be called with lock->wait_lock held.
  */
-static int try_to_take_rt_mutex(struct rt_mutex *lock __IP_DECL__)
+static int try_to_take_rt_mutex(struct rt_mutex *lock)
 {
 	/*
 	 * We have to be careful here if the atomic speedups are
@@ -383,7 +382,7 @@ static int try_to_take_rt_mutex(struct r
 		return 0;
 
 	/* We got the lock. */
-	debug_rt_mutex_lock(lock __IP__);
+	debug_rt_mutex_lock(lock);
 
 	rt_mutex_set_owner(lock, current, 0);
 
@@ -401,8 +400,7 @@ static int try_to_take_rt_mutex(struct r
  */
 static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 				   struct rt_mutex_waiter *waiter,
-				   int detect_deadlock
-				   __IP_DECL__)
+				   int detect_deadlock)
 {
 	struct rt_mutex_waiter *top_waiter = waiter;
 	task_t *owner = rt_mutex_owner(lock);
@@ -450,8 +448,7 @@ static int task_blocks_on_rt_mutex(struc
 
 	spin_unlock(&lock->wait_lock);
 
-	res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock,
-					 waiter __IP__);
+	res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter);
 
 	spin_lock(&lock->wait_lock);
 
@@ -523,7 +520,7 @@ static void wakeup_next_waiter(struct rt
  * Must be called with lock->wait_lock held
  */
 static void remove_waiter(struct rt_mutex *lock,
-			  struct rt_mutex_waiter *waiter  __IP_DECL__)
+			  struct rt_mutex_waiter *waiter)
 {
 	int first = (waiter == rt_mutex_top_waiter(lock));
 	int boost = 0;
@@ -564,7 +561,7 @@ static void remove_waiter(struct rt_mute
 
 	spin_unlock(&lock->wait_lock);
 
-	rt_mutex_adjust_prio_chain(owner, 0, lock, NULL __IP__);
+	rt_mutex_adjust_prio_chain(owner, 0, lock, NULL);
 
 	spin_lock(&lock->wait_lock);
 }
@@ -575,7 +572,7 @@ static void remove_waiter(struct rt_mute
 static int __sched
 rt_mutex_slowlock(struct rt_mutex *lock, int state,
 		  struct hrtimer_sleeper *timeout,
-		  int detect_deadlock __IP_DECL__)
+		  int detect_deadlock)
 {
 	struct rt_mutex_waiter waiter;
 	int ret = 0;
@@ -586,7 +583,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 	spin_lock(&lock->wait_lock);
 
 	/* Try to acquire the lock again: */
-	if (try_to_take_rt_mutex(lock __IP__)) {
+	if (try_to_take_rt_mutex(lock)) {
 		spin_unlock(&lock->wait_lock);
 		return 0;
 	}
@@ -600,7 +597,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 
 	for (;;) {
 		/* Try to acquire the lock: */
-		if (try_to_take_rt_mutex(lock __IP__))
+		if (try_to_take_rt_mutex(lock))
 			break;
 
 		/*
@@ -624,7 +621,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 		 */
 		if (!waiter.task) {
 			ret = task_blocks_on_rt_mutex(lock, &waiter,
-						      detect_deadlock __IP__);
+						      detect_deadlock);
 			/*
 			 * If we got woken up by the owner then start loop
 			 * all over without going into schedule to try
@@ -650,7 +647,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 	set_current_state(TASK_RUNNING);
 
 	if (unlikely(waiter.task))
-		remove_waiter(lock, &waiter __IP__);
+		remove_waiter(lock, &waiter);
 
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit
@@ -681,7 +678,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
  * Slow path try-lock function:
  */
 static inline int
-rt_mutex_slowtrylock(struct rt_mutex *lock __IP_DECL__)
+rt_mutex_slowtrylock(struct rt_mutex *lock)
 {
 	int ret = 0;
 
@@ -689,7 +686,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lo
 
 	if (likely(rt_mutex_owner(lock) != current)) {
 
-		ret = try_to_take_rt_mutex(lock __IP__);
+		ret = try_to_take_rt_mutex(lock);
 		/*
 		 * try_to_take_rt_mutex() sets the lock waiters
 		 * bit unconditionally. Clean this up.
@@ -739,13 +736,13 @@ rt_mutex_fastlock(struct rt_mutex *lock,
 		  int detect_deadlock,
 		  int (*slowfn)(struct rt_mutex *lock, int state,
 				struct hrtimer_sleeper *timeout,
-				int detect_deadlock __IP_DECL__))
+				int detect_deadlock))
 {
 	if (!detect_deadlock && likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 0;
 	} else
-		return slowfn(lock, state, NULL, detect_deadlock __RET_IP__);
+		return slowfn(lock, state, NULL, detect_deadlock);
 }
 
 static inline int
@@ -753,24 +750,24 @@ rt_mutex_timed_fastlock(struct rt_mutex 
 			struct hrtimer_sleeper *timeout, int detect_deadlock,
 			int (*slowfn)(struct rt_mutex *lock, int state,
 				      struct hrtimer_sleeper *timeout,
-				      int detect_deadlock __IP_DECL__))
+				      int detect_deadlock))
 {
 	if (!detect_deadlock && likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 0;
 	} else
-		return slowfn(lock, state, timeout, detect_deadlock __RET_IP__);
+		return slowfn(lock, state, timeout, detect_deadlock);
 }
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
-		     int (*slowfn)(struct rt_mutex *lock __IP_DECL__))
+		     int (*slowfn)(struct rt_mutex *lock))
 {
 	if (likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 1;
 	}
-	return slowfn(lock __RET_IP__);
+	return slowfn(lock);
 }
 
 static inline void
@@ -918,7 +915,7 @@ void rt_mutex_init_proxy_locked(struct r
 				struct task_struct *proxy_owner)
 {
 	__rt_mutex_init(lock, NULL);
-	debug_rt_mutex_proxy_lock(lock, proxy_owner __RET_IP__);
+	debug_rt_mutex_proxy_lock(lock, proxy_owner);
 	rt_mutex_set_owner(lock, proxy_owner, 0);
 	rt_mutex_deadlock_account_lock(lock, proxy_owner);
 }
Index: linux/kernel/rtmutex.h
===================================================================
--- linux.orig/kernel/rtmutex.h
+++ linux/kernel/rtmutex.h
@@ -10,9 +10,6 @@
  * Non-debug version.
  */
 
-#define __IP_DECL__
-#define __IP__
-#define __RET_IP__
 #define rt_mutex_deadlock_check(l)			(0)
 #define rt_mutex_deadlock_account_lock(m, t)		do { } while (0)
 #define rt_mutex_deadlock_account_unlock(l)		do { } while (0)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -30,6 +30,7 @@
 #include <linux/capability.h>
 #include <linux/completion.h>
 #include <linux/kernel_stat.h>
+#include <linux/debug_locks.h>
 #include <linux/security.h>
 #include <linux/notifier.h>
 #include <linux/profile.h>
@@ -3158,12 +3159,13 @@ void fastcall add_preempt_count(int val)
 	/*
 	 * Underflow?
 	 */
-	BUG_ON((preempt_count() < 0));
+	if (DEBUG_WARN_ON((preempt_count() < 0)))
+		return;
 	preempt_count() += val;
 	/*
 	 * Spinlock count overflowing soon?
 	 */
-	BUG_ON((preempt_count() & PREEMPT_MASK) >= PREEMPT_MASK-10);
+	DEBUG_WARN_ON((preempt_count() & PREEMPT_MASK) >= PREEMPT_MASK-10);
 }
 EXPORT_SYMBOL(add_preempt_count);
 
@@ -3172,11 +3174,15 @@ void fastcall sub_preempt_count(int val)
 	/*
 	 * Underflow?
 	 */
-	BUG_ON(val > preempt_count());
+	if (DEBUG_WARN_ON(val > preempt_count()))
+		return;
 	/*
 	 * Is the spinlock portion underflowing?
 	 */
-	BUG_ON((val < PREEMPT_MASK) && !(preempt_count() & PREEMPT_MASK));
+	if (DEBUG_WARN_ON((val < PREEMPT_MASK) &&
+			!(preempt_count() & PREEMPT_MASK)))
+		return;
+
 	preempt_count() -= val;
 }
 EXPORT_SYMBOL(sub_preempt_count);
@@ -4715,7 +4721,7 @@ void show_state(void)
 	} while_each_thread(g, p);
 
 	read_unlock(&tasklist_lock);
-	mutex_debug_show_all_locks();
+	debug_show_all_locks();
 }
 
 /**
Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -130,12 +130,30 @@ config DEBUG_PREEMPT
 	  will detect preemption count underflows.
 
 config DEBUG_MUTEXES
-	bool "Mutex debugging, deadlock detection"
-	default n
+	bool "Mutex debugging, basic checks"
+	default y
 	depends on DEBUG_KERNEL
 	help
-	 This allows mutex semantics violations and mutex related deadlocks
-	 (lockups) to be detected and reported automatically.
+	 This feature allows mutex semantics violations to be detected and
+	 reported.
+
+config DEBUG_MUTEX_ALLOC
+	bool "Detect incorrect freeing of live mutexes"
+	default y
+	depends on DEBUG_MUTEXES
+	help
+	 This feature will check whether any held mutex is incorrectly
+	 freed by the kernel, via any of the memory-freeing routines
+	 (kfree(), kmem_cache_free(), free_pages(), vfree(), etc.),
+	 or whether there is any lock held during task exit.
+
+config DEBUG_MUTEX_DEADLOCKS
+	bool "Detect mutex related deadlocks"
+	default y
+	depends on DEBUG_MUTEXES
+	help
+	 This feature will automatically detect and report mutex related
+	 deadlocks, as they happen.
 
 config DEBUG_RT_MUTEXES
 	bool "RT Mutex debugging, deadlock detection"
Index: linux/lib/Makefile
===================================================================
--- linux.orig/lib/Makefile
+++ linux/lib/Makefile
@@ -11,7 +11,7 @@ lib-$(CONFIG_SMP) += cpumask.o
 
 lib-y	+= kobject.o kref.o kobject_uevent.o klist.o
 
-obj-y += sort.o parser.o halfmd4.o iomap_copy.o
+obj-y += sort.o parser.o halfmd4.o iomap_copy.o debug_locks.o
 
 ifeq ($(CONFIG_DEBUG_KOBJECT),y)
 CFLAGS_kobject.o += -DDEBUG
Index: linux/lib/debug_locks.c
===================================================================
--- /dev/null
+++ linux/lib/debug_locks.c
@@ -0,0 +1,45 @@
+/*
+ * lib/debug_locks.c
+ *
+ * Generic place for common debugging facilities for various locks:
+ * spinlocks, rwlocks, mutexes and rwsems.
+ *
+ * Started by Ingo Molnar:
+ *
+ *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ */
+#include <linux/rwsem.h>
+#include <linux/mutex.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/debug_locks.h>
+
+/*
+ * We want to turn all lock-debugging facilities on/off at once,
+ * via a global flag. The reason is that once a single bug has been
+ * detected and reported, there might be cascade of followup bugs
+ * that would just muddy the log. So we report the first one and
+ * shut up after that.
+ */
+int debug_locks = 1;
+
+/*
+ * The locking-testsuite uses <debug_locks_silent> to get a
+ * 'silent failure': nothing is printed to the console when
+ * a locking bug is detected.
+ */
+int debug_locks_silent;
+
+/*
+ * Generic 'turn off all lock debugging' function:
+ */
+int debug_locks_off(void)
+{
+	if (xchg(&debug_locks, 0)) {
+		if (!debug_locks_silent) {
+			console_verbose();
+			return 1;
+		}
+	}
+	return 0;
+}
Index: linux/lib/spinlock_debug.c
===================================================================
--- linux.orig/lib/spinlock_debug.c
+++ linux/lib/spinlock_debug.c
@@ -9,38 +9,35 @@
 #include <linux/config.h>
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
+#include <linux/debug_locks.h>
 #include <linux/delay.h>
+#include <linux/module.h>
 
 static void spin_bug(spinlock_t *lock, const char *msg)
 {
-	static long print_once = 1;
 	struct task_struct *owner = NULL;
 
-	if (xchg(&print_once, 0)) {
-		if (lock->owner && lock->owner != SPINLOCK_OWNER_INIT)
-			owner = lock->owner;
-		printk(KERN_EMERG "BUG: spinlock %s on CPU#%d, %s/%d\n",
-			msg, raw_smp_processor_id(),
-			current->comm, current->pid);
-		printk(KERN_EMERG " lock: %p, .magic: %08x, .owner: %s/%d, "
-				".owner_cpu: %d\n",
-			lock, lock->magic,
-			owner ? owner->comm : "<none>",
-			owner ? owner->pid : -1,
-			lock->owner_cpu);
-		dump_stack();
-#ifdef CONFIG_SMP
-		/*
-		 * We cannot continue on SMP:
-		 */
-//		panic("bad locking");
-#endif
-	}
+	if (!debug_locks_off())
+		return;
+
+	if (lock->owner && lock->owner != SPINLOCK_OWNER_INIT)
+		owner = lock->owner;
+	printk(KERN_EMERG "BUG: spinlock %s on CPU#%d, %s/%d\n",
+		msg, raw_smp_processor_id(),
+		current->comm, current->pid);
+	printk(KERN_EMERG " lock: %p, .magic: %08x, .owner: %s/%d, "
+			".owner_cpu: %d\n",
+		lock, lock->magic,
+		owner ? owner->comm : "<none>",
+		owner ? owner->pid : -1,
+		lock->owner_cpu);
+	dump_stack();
 }
 
 #define SPIN_BUG_ON(cond, lock, msg) if (unlikely(cond)) spin_bug(lock, msg)
 
-static inline void debug_spin_lock_before(spinlock_t *lock)
+static inline void
+debug_spin_lock_before(spinlock_t *lock)
 {
 	SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic");
 	SPIN_BUG_ON(lock->owner == current, lock, "recursion");
@@ -119,20 +116,13 @@ void _raw_spin_unlock(spinlock_t *lock)
 
 static void rwlock_bug(rwlock_t *lock, const char *msg)
 {
-	static long print_once = 1;
+	if (!debug_locks_off())
+		return;
 
-	if (xchg(&print_once, 0)) {
-		printk(KERN_EMERG "BUG: rwlock %s on CPU#%d, %s/%d, %p\n",
-			msg, raw_smp_processor_id(), current->comm,
-			current->pid, lock);
-		dump_stack();
-#ifdef CONFIG_SMP
-		/*
-		 * We cannot continue on SMP:
-		 */
-		panic("bad locking");
-#endif
-	}
+	printk(KERN_EMERG "BUG: rwlock %s on CPU#%d, %s/%d, %p\n",
+		msg, raw_smp_processor_id(), current->comm,
+		current->pid, lock);
+	dump_stack();
 }
 
 #define RWLOCK_BUG_ON(cond, lock, msg) if (unlikely(cond)) rwlock_bug(lock, msg)
Index: linux/mm/vmalloc.c
===================================================================
--- linux.orig/mm/vmalloc.c
+++ linux/mm/vmalloc.c
@@ -330,6 +330,8 @@ void __vunmap(void *addr, int deallocate
 		return;
 	}
 
+	debug_check_no_locks_freed(addr, area->size);
+
 	if (deallocate_pages) {
 		int i;
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 08/61] lock validator: locking API self-tests
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (6 preceding siblings ...)
  2006-05-29 21:23 ` [patch 07/61] lock validator: better lock debugging Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-29 21:23 ` [patch 09/61] lock validator: spin/rwlock init cleanups Ingo Molnar
                   ` (65 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

introduce DEBUG_LOCKING_API_SELFTESTS, which uses the generic lock
debugging code's silent-failure feature to run a matrix of testcases.
There are 210 testcases currently:

------------------------
| Locking API testsuite:
----------------------------------------------------------------------------
                                 | spin |wlock |rlock |mutex | wsem | rsem |
  --------------------------------------------------------------------------
                     A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                 A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
             A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
             A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
         A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
         A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
         A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                    double unlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                 bad unlock order:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
  --------------------------------------------------------------------------
              recursive read-lock:             |  ok  |             |  ok  |
  --------------------------------------------------------------------------
                non-nested unlock:  ok  |  ok  |  ok  |  ok  |
  ------------------------------------------------------------
     hard-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
     soft-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
     hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
     soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
       sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
       sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
         hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
         soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
         hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
         soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #1/132:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #1/132:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #1/213:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #1/213:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #1/231:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #1/231:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #1/312:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #1/312:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #1/321:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #1/321:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #2/123:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #2/123:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #2/132:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #2/132:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #2/213:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #2/213:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #2/231:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #2/231:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #2/312:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #2/312:  ok  |  ok  |  ok  |
    hard-safe-A + unsafe-B #2/321:  ok  |  ok  |  ok  |
    soft-safe-A + unsafe-B #2/321:  ok  |  ok  |  ok  |
      hard-irq lock-inversion/123:  ok  |  ok  |  ok  |
      soft-irq lock-inversion/123:  ok  |  ok  |  ok  |
      hard-irq lock-inversion/132:  ok  |  ok  |  ok  |
      soft-irq lock-inversion/132:  ok  |  ok  |  ok  |
      hard-irq lock-inversion/213:  ok  |  ok  |  ok  |
      soft-irq lock-inversion/213:  ok  |  ok  |  ok  |
      hard-irq lock-inversion/231:  ok  |  ok  |  ok  |
      soft-irq lock-inversion/231:  ok  |  ok  |  ok  |
      hard-irq lock-inversion/312:  ok  |  ok  |  ok  |
      soft-irq lock-inversion/312:  ok  |  ok  |  ok  |
      hard-irq lock-inversion/321:  ok  |  ok  |  ok  |
      soft-irq lock-inversion/321:  ok  |  ok  |  ok  |
      hard-irq read-recursion/123:  ok  |
      soft-irq read-recursion/123:  ok  |
      hard-irq read-recursion/132:  ok  |
      soft-irq read-recursion/132:  ok  |
      hard-irq read-recursion/213:  ok  |
      soft-irq read-recursion/213:  ok  |
      hard-irq read-recursion/231:  ok  |
      soft-irq read-recursion/231:  ok  |
      hard-irq read-recursion/312:  ok  |
      soft-irq read-recursion/312:  ok  |
      hard-irq read-recursion/321:  ok  |
      soft-irq read-recursion/321:  ok  |
-------------------------------------------------------
Good, all 210 testcases passed! |
---------------------------------

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 Documentation/kernel-parameters.txt  |    9 
 lib/Kconfig.debug                    |   12 
 lib/Makefile                         |    1 
 lib/locking-selftest-hardirq.h       |    9 
 lib/locking-selftest-mutex.h         |    5 
 lib/locking-selftest-rlock-hardirq.h |    2 
 lib/locking-selftest-rlock-softirq.h |    2 
 lib/locking-selftest-rlock.h         |    5 
 lib/locking-selftest-rsem.h          |    5 
 lib/locking-selftest-softirq.h       |    9 
 lib/locking-selftest-spin-hardirq.h  |    2 
 lib/locking-selftest-spin-softirq.h  |    2 
 lib/locking-selftest-spin.h          |    5 
 lib/locking-selftest-wlock-hardirq.h |    2 
 lib/locking-selftest-wlock-softirq.h |    2 
 lib/locking-selftest-wlock.h         |    5 
 lib/locking-selftest-wsem.h          |    5 
 lib/locking-selftest.c               | 1168 +++++++++++++++++++++++++++++++++++
 18 files changed, 1250 insertions(+)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -436,6 +436,15 @@ running once the system is up.
 
 	debug		[KNL] Enable kernel debugging (events log level).
 
+	debug_locks_verbose=
+			[KNL] verbose self-tests
+			Format=<0|1>
+			Print debugging info while doing the locking API
+			self-tests.
+			We default to 0 (no extra messages), setting it to
+			1 will print _a lot_ more information - normally
+			only useful to kernel developers.
+
 	decnet=		[HW,NET]
 			Format: <area>[,<node>]
 			See also Documentation/networking/decnet.txt.
Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -191,6 +191,18 @@ config DEBUG_SPINLOCK_SLEEP
 	  If you say Y here, various routines which may sleep will become very
 	  noisy if they are called with a spinlock held.
 
+config DEBUG_LOCKING_API_SELFTESTS
+	bool "Locking API boot-time self-tests"
+	depends on DEBUG_KERNEL
+	default y
+	help
+	  Say Y here if you want the kernel to run a short self-test during
+	  bootup. The self-test checks whether common types of locking bugs
+	  are detected by debugging mechanisms or not. (if you disable
+	  lock debugging then those bugs wont be detected of course.)
+	  The following locking APIs are covered: spinlocks, rwlocks,
+	  mutexes and rwsems.
+
 config DEBUG_KOBJECT
 	bool "kobject debugging"
 	depends on DEBUG_KERNEL
Index: linux/lib/Makefile
===================================================================
--- linux.orig/lib/Makefile
+++ linux/lib/Makefile
@@ -18,6 +18,7 @@ CFLAGS_kobject.o += -DDEBUG
 CFLAGS_kobject_uevent.o += -DDEBUG
 endif
 
+obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
 obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
 lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
Index: linux/lib/locking-selftest-hardirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-hardirq.h
@@ -0,0 +1,9 @@
+#undef IRQ_DISABLE
+#undef IRQ_ENABLE
+#undef IRQ_ENTER
+#undef IRQ_EXIT
+
+#define IRQ_ENABLE		HARDIRQ_ENABLE
+#define IRQ_DISABLE		HARDIRQ_DISABLE
+#define IRQ_ENTER		HARDIRQ_ENTER
+#define IRQ_EXIT		HARDIRQ_EXIT
Index: linux/lib/locking-selftest-mutex.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-mutex.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK		ML
+
+#undef UNLOCK
+#define UNLOCK		MU
Index: linux/lib/locking-selftest-rlock-hardirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-rlock-hardirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-rlock.h"
+#include "locking-selftest-hardirq.h"
Index: linux/lib/locking-selftest-rlock-softirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-rlock-softirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-rlock.h"
+#include "locking-selftest-softirq.h"
Index: linux/lib/locking-selftest-rlock.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-rlock.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK		RL
+
+#undef UNLOCK
+#define UNLOCK		RU
Index: linux/lib/locking-selftest-rsem.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-rsem.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK		RSL
+
+#undef UNLOCK
+#define UNLOCK		RSU
Index: linux/lib/locking-selftest-softirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-softirq.h
@@ -0,0 +1,9 @@
+#undef IRQ_DISABLE
+#undef IRQ_ENABLE
+#undef IRQ_ENTER
+#undef IRQ_EXIT
+
+#define IRQ_DISABLE		SOFTIRQ_DISABLE
+#define IRQ_ENABLE		SOFTIRQ_ENABLE
+#define IRQ_ENTER		SOFTIRQ_ENTER
+#define IRQ_EXIT		SOFTIRQ_EXIT
Index: linux/lib/locking-selftest-spin-hardirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-spin-hardirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-spin.h"
+#include "locking-selftest-hardirq.h"
Index: linux/lib/locking-selftest-spin-softirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-spin-softirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-spin.h"
+#include "locking-selftest-softirq.h"
Index: linux/lib/locking-selftest-spin.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-spin.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK		L
+
+#undef UNLOCK
+#define UNLOCK		U
Index: linux/lib/locking-selftest-wlock-hardirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-wlock-hardirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-wlock.h"
+#include "locking-selftest-hardirq.h"
Index: linux/lib/locking-selftest-wlock-softirq.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-wlock-softirq.h
@@ -0,0 +1,2 @@
+#include "locking-selftest-wlock.h"
+#include "locking-selftest-softirq.h"
Index: linux/lib/locking-selftest-wlock.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-wlock.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK		WL
+
+#undef UNLOCK
+#define UNLOCK		WU
Index: linux/lib/locking-selftest-wsem.h
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest-wsem.h
@@ -0,0 +1,5 @@
+#undef LOCK
+#define LOCK		WSL
+
+#undef UNLOCK
+#define UNLOCK		WSU
Index: linux/lib/locking-selftest.c
===================================================================
--- /dev/null
+++ linux/lib/locking-selftest.c
@@ -0,0 +1,1168 @@
+/*
+ * lib/locking-selftest.c
+ *
+ * Testsuite for various locking APIs: spinlocks, rwlocks,
+ * mutexes and rw-semaphores.
+ *
+ * It is checking both false positives and false negatives.
+ *
+ * Started by Ingo Molnar:
+ *
+ *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ */
+#include <linux/rwsem.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/delay.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/kallsyms.h>
+#include <linux/interrupt.h>
+#include <linux/debug_locks.h>
+
+/*
+ * Change this to 1 if you want to see the failure printouts:
+ */
+static unsigned int debug_locks_verbose;
+
+static int __init setup_debug_locks_verbose(char *str)
+{
+	get_option(&str, &debug_locks_verbose);
+
+	return 1;
+}
+
+__setup("debug_locks_verbose=", setup_debug_locks_verbose);
+
+#define FAILURE		0
+#define SUCCESS		1
+
+enum {
+	LOCKTYPE_SPIN,
+	LOCKTYPE_RWLOCK,
+	LOCKTYPE_MUTEX,
+	LOCKTYPE_RWSEM,
+};
+
+/*
+ * Normal standalone locks, for the circular and irq-context
+ * dependency tests:
+ */
+static DEFINE_SPINLOCK(lock_A);
+static DEFINE_SPINLOCK(lock_B);
+static DEFINE_SPINLOCK(lock_C);
+static DEFINE_SPINLOCK(lock_D);
+
+static DEFINE_RWLOCK(rwlock_A);
+static DEFINE_RWLOCK(rwlock_B);
+static DEFINE_RWLOCK(rwlock_C);
+static DEFINE_RWLOCK(rwlock_D);
+
+static DEFINE_MUTEX(mutex_A);
+static DEFINE_MUTEX(mutex_B);
+static DEFINE_MUTEX(mutex_C);
+static DEFINE_MUTEX(mutex_D);
+
+static DECLARE_RWSEM(rwsem_A);
+static DECLARE_RWSEM(rwsem_B);
+static DECLARE_RWSEM(rwsem_C);
+static DECLARE_RWSEM(rwsem_D);
+
+/*
+ * Locks that we initialize dynamically as well so that
+ * e.g. X1 and X2 becomes two instances of the same type,
+ * but X* and Y* are different types. We do this so that
+ * we do not trigger a real lockup:
+ */
+static DEFINE_SPINLOCK(lock_X1);
+static DEFINE_SPINLOCK(lock_X2);
+static DEFINE_SPINLOCK(lock_Y1);
+static DEFINE_SPINLOCK(lock_Y2);
+static DEFINE_SPINLOCK(lock_Z1);
+static DEFINE_SPINLOCK(lock_Z2);
+
+static DEFINE_RWLOCK(rwlock_X1);
+static DEFINE_RWLOCK(rwlock_X2);
+static DEFINE_RWLOCK(rwlock_Y1);
+static DEFINE_RWLOCK(rwlock_Y2);
+static DEFINE_RWLOCK(rwlock_Z1);
+static DEFINE_RWLOCK(rwlock_Z2);
+
+static DEFINE_MUTEX(mutex_X1);
+static DEFINE_MUTEX(mutex_X2);
+static DEFINE_MUTEX(mutex_Y1);
+static DEFINE_MUTEX(mutex_Y2);
+static DEFINE_MUTEX(mutex_Z1);
+static DEFINE_MUTEX(mutex_Z2);
+
+static DECLARE_RWSEM(rwsem_X1);
+static DECLARE_RWSEM(rwsem_X2);
+static DECLARE_RWSEM(rwsem_Y1);
+static DECLARE_RWSEM(rwsem_Y2);
+static DECLARE_RWSEM(rwsem_Z1);
+static DECLARE_RWSEM(rwsem_Z2);
+
+/*
+ * non-inlined runtime initializers, to let separate locks share
+ * the same lock-type:
+ */
+#define INIT_TYPE_FUNC(type) 				\
+static noinline void					\
+init_type_##type(spinlock_t *lock, rwlock_t *rwlock, struct mutex *mutex, \
+		 struct rw_semaphore *rwsem)		\
+{							\
+	spin_lock_init(lock);				\
+	rwlock_init(rwlock);				\
+	mutex_init(mutex);				\
+	init_rwsem(rwsem);				\
+}
+
+INIT_TYPE_FUNC(X)
+INIT_TYPE_FUNC(Y)
+INIT_TYPE_FUNC(Z)
+
+static void init_shared_types(void)
+{
+	init_type_X(&lock_X1, &rwlock_X1, &mutex_X1, &rwsem_X1);
+	init_type_X(&lock_X2, &rwlock_X2, &mutex_X2, &rwsem_X2);
+
+	init_type_Y(&lock_Y1, &rwlock_Y1, &mutex_Y1, &rwsem_Y1);
+	init_type_Y(&lock_Y2, &rwlock_Y2, &mutex_Y2, &rwsem_Y2);
+
+	init_type_Z(&lock_Z1, &rwlock_Z1, &mutex_Z1, &rwsem_Z1);
+	init_type_Z(&lock_Z2, &rwlock_Z2, &mutex_Z2, &rwsem_Z2);
+}
+
+/*
+ * For spinlocks and rwlocks we also do hardirq-safe / softirq-safe tests.
+ * The following functions use a lock from a simulated hardirq/softirq
+ * context, causing the locks to be marked as hardirq-safe/softirq-safe:
+ */
+
+#define HARDIRQ_DISABLE		local_irq_disable
+#define HARDIRQ_ENABLE		local_irq_enable
+
+#define HARDIRQ_ENTER()				\
+	local_irq_disable();			\
+	nmi_enter();				\
+	WARN_ON(!in_irq());
+
+#define HARDIRQ_EXIT()				\
+	nmi_exit();				\
+	local_irq_enable();
+
+#define SOFTIRQ_DISABLE		local_bh_disable
+#define SOFTIRQ_ENABLE		local_bh_enable
+
+#define SOFTIRQ_ENTER()				\
+		local_bh_disable();		\
+		local_irq_disable();		\
+		WARN_ON(!in_softirq());
+
+#define SOFTIRQ_EXIT()				\
+		local_irq_enable();		\
+		local_bh_enable();
+
+/*
+ * Shortcuts for lock/unlock API variants, to keep
+ * the testcases compact:
+ */
+#define L(x)			spin_lock(&lock_##x)
+#define U(x)			spin_unlock(&lock_##x)
+#define UNN(x)			spin_unlock_non_nested(&lock_##x)
+#define LU(x)			L(x); U(x)
+
+#define WL(x)			write_lock(&rwlock_##x)
+#define WU(x)			write_unlock(&rwlock_##x)
+#define WLU(x)			WL(x); WU(x)
+
+#define RL(x)			read_lock(&rwlock_##x)
+#define RU(x)			read_unlock(&rwlock_##x)
+#define RUNN(x)			read_unlock_non_nested(&rwlock_##x)
+#define RLU(x)			RL(x); RU(x)
+
+#define ML(x)			mutex_lock(&mutex_##x)
+#define MU(x)			mutex_unlock(&mutex_##x)
+#define MUNN(x)			mutex_unlock_non_nested(&mutex_##x)
+
+#define WSL(x)			down_write(&rwsem_##x)
+#define WSU(x)			up_write(&rwsem_##x)
+
+#define RSL(x)			down_read(&rwsem_##x)
+#define RSU(x)			up_read(&rwsem_##x)
+#define RSUNN(x)		up_read_non_nested(&rwsem_##x)
+
+#define LOCK_UNLOCK_2(x,y)	LOCK(x); LOCK(y); UNLOCK(y); UNLOCK(x)
+
+/*
+ * Generate different permutations of the same testcase, using
+ * the same basic lock-dependency/state events:
+ */
+
+#define GENERATE_TESTCASE(name)			\
+						\
+static void name(void) { E(); }
+
+#define GENERATE_PERMUTATIONS_2_EVENTS(name)	\
+						\
+static void name##_12(void) { E1(); E2(); }	\
+static void name##_21(void) { E2(); E1(); }
+
+#define GENERATE_PERMUTATIONS_3_EVENTS(name)		\
+							\
+static void name##_123(void) { E1(); E2(); E3(); }	\
+static void name##_132(void) { E1(); E3(); E2(); }	\
+static void name##_213(void) { E2(); E1(); E3(); }	\
+static void name##_231(void) { E2(); E3(); E1(); }	\
+static void name##_312(void) { E3(); E1(); E2(); }	\
+static void name##_321(void) { E3(); E2(); E1(); }
+
+/*
+ * AA deadlock:
+ */
+
+#define E()					\
+						\
+	LOCK(X1);				\
+	LOCK(X2); /* this one should fail */	\
+	UNLOCK(X2);				\
+	UNLOCK(X1);
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(AA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(AA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(AA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(AA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(AA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(AA_rsem)
+
+#undef E
+
+/*
+ * Special-case for read-locking, they are
+ * allowed to recurse on the same lock instance:
+ */
+static void rlock_AA1(void)
+{
+	RL(X1);
+	RL(X1); // this one should NOT fail
+	RU(X1);
+	RU(X1);
+}
+
+static void rsem_AA1(void)
+{
+	RSL(X1);
+	RSL(X1); // this one should fail
+	RSU(X1);
+	RSU(X1);
+}
+
+/*
+ * ABBA deadlock:
+ */
+
+#define E()					\
+						\
+	LOCK_UNLOCK_2(A, B);			\
+	LOCK_UNLOCK_2(B, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABBA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABBA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABBA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABBA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABBA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABBA_rsem)
+
+#undef E
+
+/*
+ * AB BC CA deadlock:
+ */
+
+#define E()					\
+						\
+	LOCK_UNLOCK_2(A, B);			\
+	LOCK_UNLOCK_2(B, C);			\
+	LOCK_UNLOCK_2(C, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABBCCA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABBCCA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABBCCA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABBCCA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABBCCA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABBCCA_rsem)
+
+#undef E
+
+/*
+ * AB CA BC deadlock:
+ */
+
+#define E()					\
+						\
+	LOCK_UNLOCK_2(A, B);			\
+	LOCK_UNLOCK_2(C, A);			\
+	LOCK_UNLOCK_2(B, C); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABCABC_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABCABC_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABCABC_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABCABC_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABCABC_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABCABC_rsem)
+
+#undef E
+
+/*
+ * AB BC CD DA deadlock:
+ */
+
+#define E()					\
+						\
+	LOCK_UNLOCK_2(A, B);			\
+	LOCK_UNLOCK_2(B, C);			\
+	LOCK_UNLOCK_2(C, D);			\
+	LOCK_UNLOCK_2(D, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABBCCDDA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABBCCDDA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABBCCDDA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABBCCDDA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABBCCDDA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABBCCDDA_rsem)
+
+#undef E
+
+/*
+ * AB CD BD DA deadlock:
+ */
+#define E()					\
+						\
+	LOCK_UNLOCK_2(A, B);			\
+	LOCK_UNLOCK_2(C, D);			\
+	LOCK_UNLOCK_2(B, D);			\
+	LOCK_UNLOCK_2(D, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABCDBDDA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABCDBDDA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABCDBDDA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABCDBDDA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABCDBDDA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABCDBDDA_rsem)
+
+#undef E
+
+/*
+ * AB CD BC DA deadlock:
+ */
+#define E()					\
+						\
+	LOCK_UNLOCK_2(A, B);			\
+	LOCK_UNLOCK_2(C, D);			\
+	LOCK_UNLOCK_2(B, C);			\
+	LOCK_UNLOCK_2(D, A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(ABCDBCDA_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(ABCDBCDA_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(ABCDBCDA_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(ABCDBCDA_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(ABCDBCDA_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(ABCDBCDA_rsem)
+
+#undef E
+
+/*
+ * Double unlock:
+ */
+#define E()					\
+						\
+	LOCK(A);				\
+	UNLOCK(A);				\
+	UNLOCK(A); /* fail */
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(double_unlock_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(double_unlock_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(double_unlock_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(double_unlock_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(double_unlock_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(double_unlock_rsem)
+
+#undef E
+
+/*
+ * Bad unlock ordering:
+ */
+#define E()					\
+						\
+	LOCK(A);				\
+	LOCK(B);				\
+	UNLOCK(A); /* fail */			\
+	UNLOCK(B);
+
+/*
+ * 6 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_TESTCASE(bad_unlock_order_spin)
+#include "locking-selftest-wlock.h"
+GENERATE_TESTCASE(bad_unlock_order_wlock)
+#include "locking-selftest-rlock.h"
+GENERATE_TESTCASE(bad_unlock_order_rlock)
+#include "locking-selftest-mutex.h"
+GENERATE_TESTCASE(bad_unlock_order_mutex)
+#include "locking-selftest-wsem.h"
+GENERATE_TESTCASE(bad_unlock_order_wsem)
+#include "locking-selftest-rsem.h"
+GENERATE_TESTCASE(bad_unlock_order_rsem)
+
+#undef E
+
+#ifdef CONFIG_LOCKDEP
+/*
+ * bad unlock ordering - but using the _non_nested API,
+ * which must supress the warning:
+ */
+static void spin_order_nn(void)
+{
+	L(A);
+	L(B);
+	UNN(A); // this one should succeed
+	UNN(B);
+}
+
+static void rlock_order_nn(void)
+{
+	RL(A);
+	RL(B);
+	RUNN(A); // this one should succeed
+	RUNN(B);
+}
+
+static void mutex_order_nn(void)
+{
+	ML(A);
+	ML(B);
+	MUNN(A); // this one should succeed
+	MUNN(B);
+}
+
+static void rsem_order_nn(void)
+{
+	RSL(A);
+	RSL(B);
+	RSUNN(A); // this one should succeed
+	RSUNN(B);
+}
+
+#endif
+
+/*
+ * locking an irq-safe lock with irqs enabled:
+ */
+#define E1()				\
+					\
+	IRQ_ENTER();			\
+	LOCK(A);			\
+	UNLOCK(A);			\
+	IRQ_EXIT();
+
+#define E2()				\
+					\
+	LOCK(A);			\
+	UNLOCK(A);
+
+/*
+ * Generate 24 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_soft_wlock)
+
+#undef E1
+#undef E2
+
+/*
+ * Enabling hardirqs with a softirq-safe lock held:
+ */
+#define E1()				\
+					\
+	SOFTIRQ_ENTER();		\
+	LOCK(A);			\
+	UNLOCK(A);			\
+	SOFTIRQ_EXIT();
+
+#define E2()				\
+					\
+	HARDIRQ_DISABLE();		\
+	LOCK(A);			\
+	HARDIRQ_ENABLE();		\
+	UNLOCK(A);
+
+/*
+ * Generate 12 testcases:
+ */
+#include "locking-selftest-spin.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2A_spin)
+
+#include "locking-selftest-wlock.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2A_wlock)
+
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2A_rlock)
+
+#undef E1
+#undef E2
+
+/*
+ * Enabling irqs with an irq-safe lock held:
+ */
+#define E1()				\
+					\
+	IRQ_ENTER();			\
+	LOCK(A);			\
+	UNLOCK(A);			\
+	IRQ_EXIT();
+
+#define E2()				\
+					\
+	IRQ_DISABLE();			\
+	LOCK(A);			\
+	IRQ_ENABLE();			\
+	UNLOCK(A);
+
+/*
+ * Generate 24 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_wlock)
+
+#undef E1
+#undef E2
+
+/*
+ * Acquiring a irq-unsafe lock while holding an irq-safe-lock:
+ */
+#define E1()				\
+					\
+	LOCK(A);			\
+	LOCK(B);			\
+	UNLOCK(B);			\
+	UNLOCK(A);			\
+
+#define E2()				\
+					\
+	LOCK(B);			\
+	UNLOCK(B);
+
+#define E3()				\
+					\
+	IRQ_ENTER();			\
+	LOCK(A);			\
+	UNLOCK(A);			\
+	IRQ_EXIT();
+
+/*
+ * Generate 36 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_wlock)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * If a lock turns into softirq-safe, but earlier it took
+ * a softirq-unsafe lock:
+ */
+
+#define E1()				\
+	IRQ_DISABLE();			\
+	LOCK(A);			\
+	LOCK(B);			\
+	UNLOCK(B);			\
+	UNLOCK(A);			\
+	IRQ_ENABLE();
+
+#define E2()				\
+	LOCK(B);			\
+	UNLOCK(B);
+
+#define E3()				\
+	IRQ_ENTER();			\
+	LOCK(A);			\
+	UNLOCK(A);			\
+	IRQ_EXIT();
+
+/*
+ * Generate 36 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_soft_wlock)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * read-lock / write-lock irq inversion.
+ *
+ * Deadlock scenario:
+ *
+ * CPU#1 is at #1, i.e. it has write-locked A, but has not
+ * taken B yet.
+ *
+ * CPU#2 is at #2, i.e. it has locked B.
+ *
+ * Hardirq hits CPU#2 at point #2 and is trying to read-lock A.
+ *
+ * The deadlock occurs because CPU#1 will spin on B, and CPU#2
+ * will spin on A.
+ */
+
+#define E1()				\
+					\
+	IRQ_DISABLE();			\
+	WL(A);				\
+	LOCK(B);			\
+	UNLOCK(B);			\
+	WU(A);				\
+	IRQ_ENABLE();
+
+#define E2()				\
+					\
+	LOCK(B);			\
+	UNLOCK(B);
+
+#define E3()				\
+					\
+	IRQ_ENTER();			\
+	RL(A);				\
+	RU(A);				\
+	IRQ_EXIT();
+
+/*
+ * Generate 36 testcases:
+ */
+#include "locking-selftest-spin-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_hard_spin)
+
+#include "locking-selftest-rlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_hard_rlock)
+
+#include "locking-selftest-wlock-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_hard_wlock)
+
+#include "locking-selftest-spin-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_spin)
+
+#include "locking-selftest-rlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_rlock)
+
+#include "locking-selftest-wlock-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_wlock)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * read-lock / write-lock recursion that is actually safe.
+ */
+
+#define E1()				\
+					\
+	IRQ_DISABLE();			\
+	WL(A);				\
+	WU(A);				\
+	IRQ_ENABLE();
+
+#define E2()				\
+					\
+	RL(A);				\
+	RU(A);				\
+
+#define E3()				\
+					\
+	IRQ_ENTER();			\
+	RL(A);				\
+	L(B);				\
+	U(B);				\
+	RU(A);				\
+	IRQ_EXIT();
+
+/*
+ * Generate 12 testcases:
+ */
+#include "locking-selftest-hardirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_hard)
+
+#include "locking-selftest-softirq.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_soft)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * read-lock / write-lock recursion that is unsafe.
+ */
+
+#define E1()				\
+					\
+	IRQ_DISABLE();			\
+	L(B);				\
+	WL(A);				\
+	WU(A);				\
+	U(B);				\
+	IRQ_ENABLE();
+
+#define E2()				\
+					\
+	RL(A);				\
+	RU(A);				\
+
+#define E3()				\
+					\
+	IRQ_ENTER();			\
+	L(B);				\
+	U(B);				\
+	IRQ_EXIT();
+
+/*
+ * Generate 12 testcases:
+ */
+#include "locking-selftest-hardirq.h"
+// GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_hard)
+
+#include "locking-selftest-softirq.h"
+// GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft)
+
+#define lockdep_reset()
+#define lockdep_reset_lock(x)
+
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+# define I_SPINLOCK(x)	lockdep_reset_lock(&lock_##x.dep_map)
+#else
+# define I_SPINLOCK(x)
+#endif
+
+#ifdef CONFIG_PROVE_RW_LOCKING
+# define I_RWLOCK(x)	lockdep_reset_lock(&rwlock_##x.dep_map)
+#else
+# define I_RWLOCK(x)
+#endif
+
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+# define I_MUTEX(x)	lockdep_reset_lock(&mutex_##x.dep_map)
+#else
+# define I_MUTEX(x)
+#endif
+
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+# define I_RWSEM(x)	lockdep_reset_lock(&rwsem_##x.dep_map)
+#else
+# define I_RWSEM(x)
+#endif
+
+#define I1(x)					\
+	do {					\
+		I_SPINLOCK(x);			\
+		I_RWLOCK(x);			\
+		I_MUTEX(x);			\
+		I_RWSEM(x);			\
+	} while (0)
+
+#define I2(x)					\
+	do {					\
+		spin_lock_init(&lock_##x);	\
+		rwlock_init(&rwlock_##x);	\
+		mutex_init(&mutex_##x);		\
+		init_rwsem(&rwsem_##x);		\
+	} while (0)
+
+static void reset_locks(void)
+{
+	local_irq_disable();
+	I1(A); I1(B); I1(C); I1(D);
+	I1(X1); I1(X2); I1(Y1); I1(Y2); I1(Z1); I1(Z2);
+	lockdep_reset();
+	I2(A); I2(B); I2(C); I2(D);
+	init_shared_types();
+	local_irq_enable();
+}
+
+#undef I
+
+static int testcase_total;
+static int testcase_successes;
+static int expected_testcase_failures;
+static int unexpected_testcase_failures;
+
+static void dotest(void (*testcase_fn)(void), int expected, int locktype)
+{
+	unsigned long saved_preempt_count = preempt_count();
+	int unexpected_failure = 0;
+
+	WARN_ON(irqs_disabled());
+
+	testcase_fn();
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+	if (locktype == LOCKTYPE_SPIN && debug_locks != expected)
+		unexpected_failure = 1;
+#endif
+#ifdef CONFIG_PROVE_RW_LOCKING
+	if (locktype == LOCKTYPE_RWLOCK && debug_locks != expected)
+		unexpected_failure = 1;
+#endif
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+	if (locktype == LOCKTYPE_MUTEX && debug_locks != expected)
+		unexpected_failure = 1;
+#endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+	if (locktype == LOCKTYPE_RWSEM && debug_locks != expected)
+		unexpected_failure = 1;
+#endif
+	if (debug_locks != expected) {
+		if (unexpected_failure) {
+			unexpected_testcase_failures++;
+			printk("FAILED|");
+		} else {
+			expected_testcase_failures++;
+			printk("failed|");
+		}
+	} else {
+		testcase_successes++;
+		printk("  ok  |");
+	}
+	testcase_total++;
+
+	/*
+	 * Some tests (e.g. double-unlock) might corrupt the preemption
+	 * count, so restore it:
+	 */
+	preempt_count() = saved_preempt_count;
+#ifdef CONFIG_TRACE_IRQFLAGS
+	if (softirq_count())
+		current->softirqs_enabled = 0;
+	else
+		current->softirqs_enabled = 1;
+#endif
+
+	reset_locks();
+}
+
+static inline void print_testname(const char *testname)
+{
+	printk("%33s:", testname);
+}
+
+#define DO_TESTCASE_1(desc, name, nr)				\
+	print_testname(desc"/"#nr);				\
+	dotest(name##_##nr, SUCCESS, LOCKTYPE_RWLOCK);		\
+	printk("\n");
+
+#define DO_TESTCASE_1B(desc, name, nr)				\
+	print_testname(desc"/"#nr);				\
+	dotest(name##_##nr, FAILURE, LOCKTYPE_RWLOCK);		\
+	printk("\n");
+
+#define DO_TESTCASE_3(desc, name, nr)				\
+	print_testname(desc"/"#nr);				\
+	dotest(name##_spin_##nr, FAILURE, LOCKTYPE_SPIN);	\
+	dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK);	\
+	dotest(name##_rlock_##nr, SUCCESS, LOCKTYPE_RWLOCK);	\
+	printk("\n");
+
+#define DO_TESTCASE_6(desc, name)				\
+	print_testname(desc);					\
+	dotest(name##_spin, FAILURE, LOCKTYPE_SPIN);		\
+	dotest(name##_wlock, FAILURE, LOCKTYPE_RWLOCK);		\
+	dotest(name##_rlock, FAILURE, LOCKTYPE_RWLOCK);		\
+	dotest(name##_mutex, FAILURE, LOCKTYPE_SPIN);		\
+	dotest(name##_wsem, FAILURE, LOCKTYPE_RWSEM);		\
+	dotest(name##_rsem, FAILURE, LOCKTYPE_RWSEM);		\
+	printk("\n");
+
+/*
+ * 'read' variant: rlocks must not trigger.
+ */
+#define DO_TESTCASE_6R(desc, name)				\
+	print_testname(desc);					\
+	dotest(name##_spin, FAILURE, LOCKTYPE_SPIN);		\
+	dotest(name##_wlock, FAILURE, LOCKTYPE_RWLOCK);		\
+	dotest(name##_rlock, SUCCESS, LOCKTYPE_RWLOCK);		\
+	dotest(name##_mutex, FAILURE, LOCKTYPE_SPIN);		\
+	dotest(name##_wsem, FAILURE, LOCKTYPE_RWSEM);		\
+	dotest(name##_rsem, FAILURE, LOCKTYPE_RWSEM);		\
+	printk("\n");
+
+#define DO_TESTCASE_2I(desc, name, nr)				\
+	DO_TESTCASE_1("hard-"desc, name##_hard, nr);		\
+	DO_TESTCASE_1("soft-"desc, name##_soft, nr);
+
+#define DO_TESTCASE_2IB(desc, name, nr)				\
+	DO_TESTCASE_1B("hard-"desc, name##_hard, nr);		\
+	DO_TESTCASE_1B("soft-"desc, name##_soft, nr);
+
+#define DO_TESTCASE_6I(desc, name, nr)				\
+	DO_TESTCASE_3("hard-"desc, name##_hard, nr);		\
+	DO_TESTCASE_3("soft-"desc, name##_soft, nr);
+
+#define DO_TESTCASE_2x3(desc, name)				\
+	DO_TESTCASE_3(desc, name, 12);				\
+	DO_TESTCASE_3(desc, name, 21);
+
+#define DO_TESTCASE_2x6(desc, name)				\
+	DO_TESTCASE_6I(desc, name, 12);				\
+	DO_TESTCASE_6I(desc, name, 21);
+
+#define DO_TESTCASE_6x2(desc, name)				\
+	DO_TESTCASE_2I(desc, name, 123);			\
+	DO_TESTCASE_2I(desc, name, 132);			\
+	DO_TESTCASE_2I(desc, name, 213);			\
+	DO_TESTCASE_2I(desc, name, 231);			\
+	DO_TESTCASE_2I(desc, name, 312);			\
+	DO_TESTCASE_2I(desc, name, 321);
+
+#define DO_TESTCASE_6x2B(desc, name)				\
+	DO_TESTCASE_2IB(desc, name, 123);			\
+	DO_TESTCASE_2IB(desc, name, 132);			\
+	DO_TESTCASE_2IB(desc, name, 213);			\
+	DO_TESTCASE_2IB(desc, name, 231);			\
+	DO_TESTCASE_2IB(desc, name, 312);			\
+	DO_TESTCASE_2IB(desc, name, 321);
+
+
+#define DO_TESTCASE_6x6(desc, name)				\
+	DO_TESTCASE_6I(desc, name, 123);			\
+	DO_TESTCASE_6I(desc, name, 132);			\
+	DO_TESTCASE_6I(desc, name, 213);			\
+	DO_TESTCASE_6I(desc, name, 231);			\
+	DO_TESTCASE_6I(desc, name, 312);			\
+	DO_TESTCASE_6I(desc, name, 321);
+
+void locking_selftest(void)
+{
+	/*
+	 * Got a locking failure before the selftest ran?
+	 */
+	if (!debug_locks) {
+		printk("----------------------------------\n");
+		printk("| Locking API testsuite disabled |\n");
+		printk("----------------------------------\n");
+		return;
+	}
+
+	/*
+	 * Run the testsuite:
+	 */
+	printk("------------------------\n");
+	printk("| Locking API testsuite:\n");
+	printk("----------------------------------------------------------------------------\n");
+	printk("                                 | spin |wlock |rlock |mutex | wsem | rsem |\n");
+	printk("  --------------------------------------------------------------------------\n");
+
+	init_shared_types();
+	debug_locks_silent = !debug_locks_verbose;
+
+	DO_TESTCASE_6("A-A deadlock", AA);
+	DO_TESTCASE_6R("A-B-B-A deadlock", ABBA);
+	DO_TESTCASE_6R("A-B-B-C-C-A deadlock", ABBCCA);
+	DO_TESTCASE_6R("A-B-C-A-B-C deadlock", ABCABC);
+	DO_TESTCASE_6R("A-B-B-C-C-D-D-A deadlock", ABBCCDDA);
+	DO_TESTCASE_6R("A-B-C-D-B-D-D-A deadlock", ABCDBDDA);
+	DO_TESTCASE_6R("A-B-C-D-B-C-D-A deadlock", ABCDBCDA);
+	DO_TESTCASE_6("double unlock", double_unlock);
+	DO_TESTCASE_6("bad unlock order", bad_unlock_order);
+
+	printk("  --------------------------------------------------------------------------\n");
+	print_testname("recursive read-lock");
+	printk("             |");
+	dotest(rlock_AA1, SUCCESS, LOCKTYPE_RWLOCK);
+	printk("             |");
+	dotest(rsem_AA1, FAILURE, LOCKTYPE_RWLOCK);
+	printk("\n");
+
+	printk("  --------------------------------------------------------------------------\n");
+
+#ifdef CONFIG_LOCKDEP
+	print_testname("non-nested unlock");
+	dotest(spin_order_nn, SUCCESS, LOCKTYPE_SPIN);
+	dotest(rlock_order_nn, SUCCESS, LOCKTYPE_RWLOCK);
+	dotest(mutex_order_nn, SUCCESS, LOCKTYPE_MUTEX);
+	dotest(rsem_order_nn, SUCCESS, LOCKTYPE_RWSEM);
+	printk("\n");
+	printk("  ------------------------------------------------------------\n");
+#endif
+	/*
+	 * irq-context testcases:
+	 */
+	DO_TESTCASE_2x6("irqs-on + irq-safe-A", irqsafe1);
+	DO_TESTCASE_2x3("sirq-safe-A => hirqs-on", irqsafe2A);
+	DO_TESTCASE_2x6("safe-A + irqs-on", irqsafe2B);
+	DO_TESTCASE_6x6("safe-A + unsafe-B #1", irqsafe3);
+	DO_TESTCASE_6x6("safe-A + unsafe-B #2", irqsafe4);
+	DO_TESTCASE_6x6("irq lock-inversion", irq_inversion);
+
+	DO_TESTCASE_6x2("irq read-recursion", irq_read_recursion);
+//	DO_TESTCASE_6x2B("irq read-recursion #2", irq_read_recursion2);
+
+	if (unexpected_testcase_failures) {
+		printk("-----------------------------------------------------------------\n");
+		debug_locks = 0;
+		printk("BUG: %3d unexpected failures (out of %3d) - debugging disabled! |\n",
+			unexpected_testcase_failures, testcase_total);
+		printk("-----------------------------------------------------------------\n");
+	} else if (expected_testcase_failures && testcase_successes) {
+		printk("--------------------------------------------------------\n");
+		printk("%3d out of %3d testcases failed, as expected. |\n",
+			expected_testcase_failures, testcase_total);
+		printk("----------------------------------------------------\n");
+		debug_locks = 1;
+	} else if (expected_testcase_failures && !testcase_successes) {
+		printk("--------------------------------------------------------\n");
+		printk("All %3d testcases failed, as expected. |\n",
+			expected_testcase_failures);
+		printk("----------------------------------------\n");
+		debug_locks = 1;
+	} else {
+		printk("-------------------------------------------------------\n");
+		printk("Good, all %3d testcases passed! |\n",
+			testcase_successes);
+		printk("---------------------------------\n");
+		debug_locks = 1;
+	}
+	debug_locks_silent = 0;
+}

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 09/61] lock validator: spin/rwlock init cleanups
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (7 preceding siblings ...)
  2006-05-29 21:23 ` [patch 08/61] lock validator: locking API self-tests Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-29 21:23 ` [patch 10/61] lock validator: locking init debugging improvement Ingo Molnar
                   ` (64 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

locking init cleanups:

 - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
 - convert rwlocks in a similar manner

this patch was generated automatically.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/ia64/sn/kernel/irq.c                    |    2 +-
 arch/mips/kernel/smtc.c                      |    4 ++--
 arch/powerpc/platforms/cell/spufs/switch.c   |    2 +-
 arch/powerpc/platforms/powermac/pfunc_core.c |    2 +-
 arch/powerpc/platforms/pseries/eeh_event.c   |    2 +-
 arch/powerpc/sysdev/mmio_nvram.c             |    2 +-
 arch/xtensa/kernel/time.c                    |    2 +-
 arch/xtensa/kernel/traps.c                   |    2 +-
 drivers/char/drm/drm_memory_debug.h          |    2 +-
 drivers/char/drm/via_dmablit.c               |    2 +-
 drivers/char/epca.c                          |    2 +-
 drivers/char/moxa.c                          |    2 +-
 drivers/char/specialix.c                     |    2 +-
 drivers/char/sx.c                            |    2 +-
 drivers/isdn/gigaset/common.c                |    2 +-
 drivers/leds/led-core.c                      |    2 +-
 drivers/leds/led-triggers.c                  |    2 +-
 drivers/message/i2o/exec-osm.c               |    2 +-
 drivers/misc/ibmasm/module.c                 |    2 +-
 drivers/pcmcia/m8xx_pcmcia.c                 |    4 ++--
 drivers/rapidio/rio-access.c                 |    4 ++--
 drivers/rtc/rtc-sa1100.c                     |    2 +-
 drivers/rtc/rtc-vr41xx.c                     |    2 +-
 drivers/s390/block/dasd_eer.c                |    2 +-
 drivers/scsi/libata-core.c                   |    2 +-
 drivers/sn/ioc3.c                            |    2 +-
 drivers/usb/ip/stub_dev.c                    |    4 ++--
 drivers/usb/ip/vhci_hcd.c                    |    4 ++--
 drivers/video/backlight/hp680_bl.c           |    2 +-
 fs/gfs2/ops_fstype.c                         |    2 +-
 fs/nfsd/nfs4state.c                          |    2 +-
 fs/ocfs2/cluster/heartbeat.c                 |    2 +-
 fs/ocfs2/cluster/tcp.c                       |    2 +-
 fs/ocfs2/dlm/dlmdomain.c                     |    2 +-
 fs/ocfs2/dlm/dlmlock.c                       |    2 +-
 fs/ocfs2/dlm/dlmrecovery.c                   |    4 ++--
 fs/ocfs2/dlmglue.c                           |    2 +-
 fs/ocfs2/journal.c                           |    2 +-
 fs/reiser4/block_alloc.c                     |    2 +-
 fs/reiser4/debug.c                           |    2 +-
 fs/reiser4/fsdata.c                          |    2 +-
 fs/reiser4/txnmgr.c                          |    2 +-
 include/asm-alpha/core_t2.h                  |    2 +-
 kernel/audit.c                               |    2 +-
 mm/sparse.c                                  |    2 +-
 net/ipv6/route.c                             |    2 +-
 net/sunrpc/auth_gss/gss_krb5_seal.c          |    2 +-
 net/tipc/bcast.c                             |    4 ++--
 net/tipc/bearer.c                            |    2 +-
 net/tipc/config.c                            |    2 +-
 net/tipc/dbg.c                               |    2 +-
 net/tipc/handler.c                           |    2 +-
 net/tipc/name_table.c                        |    4 ++--
 net/tipc/net.c                               |    2 +-
 net/tipc/node.c                              |    2 +-
 net/tipc/port.c                              |    4 ++--
 net/tipc/ref.c                               |    4 ++--
 net/tipc/subscr.c                            |    2 +-
 net/tipc/user_reg.c                          |    2 +-
 59 files changed, 69 insertions(+), 69 deletions(-)

Index: linux/arch/ia64/sn/kernel/irq.c
===================================================================
--- linux.orig/arch/ia64/sn/kernel/irq.c
+++ linux/arch/ia64/sn/kernel/irq.c
@@ -27,7 +27,7 @@ static void unregister_intr_pda(struct s
 int sn_force_interrupt_flag = 1;
 extern int sn_ioif_inited;
 struct list_head **sn_irq_lh;
-static spinlock_t sn_irq_info_lock = SPIN_LOCK_UNLOCKED; /* non-IRQ lock */
+static DEFINE_SPINLOCK(sn_irq_info_lock); /* non-IRQ lock */
 
 u64 sn_intr_alloc(nasid_t local_nasid, int local_widget,
 				     struct sn_irq_info *sn_irq_info,
Index: linux/arch/mips/kernel/smtc.c
===================================================================
--- linux.orig/arch/mips/kernel/smtc.c
+++ linux/arch/mips/kernel/smtc.c
@@ -367,7 +367,7 @@ void mipsmt_prepare_cpus(void)
 	dvpe();
 	dmt();
 
-	freeIPIq.lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&freeIPIq.lock);
 
 	/*
 	 * We probably don't have as many VPEs as we do SMP "CPUs",
@@ -375,7 +375,7 @@ void mipsmt_prepare_cpus(void)
 	 */
 	for (i=0; i<NR_CPUS; i++) {
 		IPIQ[i].head = IPIQ[i].tail = NULL;
-		IPIQ[i].lock = SPIN_LOCK_UNLOCKED;
+		spin_lock_init(&IPIQ[i].lock);
 		IPIQ[i].depth = 0;
 		ipi_timer_latch[i] = 0;
 	}
Index: linux/arch/powerpc/platforms/cell/spufs/switch.c
===================================================================
--- linux.orig/arch/powerpc/platforms/cell/spufs/switch.c
+++ linux/arch/powerpc/platforms/cell/spufs/switch.c
@@ -2183,7 +2183,7 @@ void spu_init_csa(struct spu_state *csa)
 
 	memset(lscsa, 0, sizeof(struct spu_lscsa));
 	csa->lscsa = lscsa;
-	csa->register_lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&csa->register_lock);
 
 	/* Set LS pages reserved to allow for user-space mapping. */
 	for (p = lscsa->ls; p < lscsa->ls + LS_SIZE; p += PAGE_SIZE)
Index: linux/arch/powerpc/platforms/powermac/pfunc_core.c
===================================================================
--- linux.orig/arch/powerpc/platforms/powermac/pfunc_core.c
+++ linux/arch/powerpc/platforms/powermac/pfunc_core.c
@@ -545,7 +545,7 @@ struct pmf_device {
 };
 
 static LIST_HEAD(pmf_devices);
-static spinlock_t pmf_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(pmf_lock);
 
 static void pmf_release_device(struct kref *kref)
 {
Index: linux/arch/powerpc/platforms/pseries/eeh_event.c
===================================================================
--- linux.orig/arch/powerpc/platforms/pseries/eeh_event.c
+++ linux/arch/powerpc/platforms/pseries/eeh_event.c
@@ -35,7 +35,7 @@
  */
 
 /* EEH event workqueue setup. */
-static spinlock_t eeh_eventlist_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(eeh_eventlist_lock);
 LIST_HEAD(eeh_eventlist);
 static void eeh_thread_launcher(void *);
 DECLARE_WORK(eeh_event_wq, eeh_thread_launcher, NULL);
Index: linux/arch/powerpc/sysdev/mmio_nvram.c
===================================================================
--- linux.orig/arch/powerpc/sysdev/mmio_nvram.c
+++ linux/arch/powerpc/sysdev/mmio_nvram.c
@@ -32,7 +32,7 @@
 
 static void __iomem *mmio_nvram_start;
 static long mmio_nvram_len;
-static spinlock_t mmio_nvram_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(mmio_nvram_lock);
 
 static ssize_t mmio_nvram_read(char *buf, size_t count, loff_t *index)
 {
Index: linux/arch/xtensa/kernel/time.c
===================================================================
--- linux.orig/arch/xtensa/kernel/time.c
+++ linux/arch/xtensa/kernel/time.c
@@ -29,7 +29,7 @@
 
 extern volatile unsigned long wall_jiffies;
 
-spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(rtc_lock);
 EXPORT_SYMBOL(rtc_lock);
 
 
Index: linux/arch/xtensa/kernel/traps.c
===================================================================
--- linux.orig/arch/xtensa/kernel/traps.c
+++ linux/arch/xtensa/kernel/traps.c
@@ -461,7 +461,7 @@ void show_code(unsigned int *pc)
 	}
 }
 
-spinlock_t die_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(die_lock);
 
 void die(const char * str, struct pt_regs * regs, long err)
 {
Index: linux/drivers/char/drm/drm_memory_debug.h
===================================================================
--- linux.orig/drivers/char/drm/drm_memory_debug.h
+++ linux/drivers/char/drm/drm_memory_debug.h
@@ -43,7 +43,7 @@ typedef struct drm_mem_stats {
 	unsigned long bytes_freed;
 } drm_mem_stats_t;
 
-static spinlock_t drm_mem_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(drm_mem_lock);
 static unsigned long drm_ram_available = 0;	/* In pages */
 static unsigned long drm_ram_used = 0;
 static drm_mem_stats_t drm_mem_stats[] =
Index: linux/drivers/char/drm/via_dmablit.c
===================================================================
--- linux.orig/drivers/char/drm/via_dmablit.c
+++ linux/drivers/char/drm/via_dmablit.c
@@ -557,7 +557,7 @@ via_init_dmablit(drm_device_t *dev)
 		blitq->num_outstanding = 0;
 		blitq->is_active = 0;
 		blitq->aborting = 0;
-		blitq->blit_lock = SPIN_LOCK_UNLOCKED;
+		spin_lock_init(&blitq->blit_lock);
 		for (j=0; j<VIA_NUM_BLIT_SLOTS; ++j) {
 			DRM_INIT_WAITQUEUE(blitq->blit_queue + j);
 		}
Index: linux/drivers/char/epca.c
===================================================================
--- linux.orig/drivers/char/epca.c
+++ linux/drivers/char/epca.c
@@ -80,7 +80,7 @@ static int invalid_lilo_config;
 /* The ISA boards do window flipping into the same spaces so its only sane
    with a single lock. It's still pretty efficient */
 
-static spinlock_t epca_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(epca_lock);
 
 /* -----------------------------------------------------------------------
 	MAXBOARDS is typically 12, but ISA and EISA cards are restricted to 
Index: linux/drivers/char/moxa.c
===================================================================
--- linux.orig/drivers/char/moxa.c
+++ linux/drivers/char/moxa.c
@@ -301,7 +301,7 @@ static struct tty_operations moxa_ops = 
 	.tiocmset = moxa_tiocmset,
 };
 
-static spinlock_t moxa_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(moxa_lock);
 
 #ifdef CONFIG_PCI
 static int moxa_get_PCI_conf(struct pci_dev *p, int board_type, moxa_board_conf * board)
Index: linux/drivers/char/specialix.c
===================================================================
--- linux.orig/drivers/char/specialix.c
+++ linux/drivers/char/specialix.c
@@ -2477,7 +2477,7 @@ static int __init specialix_init(void)
 #endif
 
 	for (i = 0; i < SX_NBOARD; i++)
-		sx_board[i].lock = SPIN_LOCK_UNLOCKED;
+		spin_lock_init(&sx_board[i].lock);
 
 	if (sx_init_drivers()) {
 		func_exit();
Index: linux/drivers/char/sx.c
===================================================================
--- linux.orig/drivers/char/sx.c
+++ linux/drivers/char/sx.c
@@ -2320,7 +2320,7 @@ static int sx_init_portstructs (int nboa
 #ifdef NEW_WRITE_LOCKING
 			port->gs.port_write_mutex = MUTEX;
 #endif
-			port->gs.driver_lock = SPIN_LOCK_UNLOCKED;
+			spin_lock_init(&port->gs.driver_lock);
 			/*
 			 * Initializing wait queue
 			 */
Index: linux/drivers/isdn/gigaset/common.c
===================================================================
--- linux.orig/drivers/isdn/gigaset/common.c
+++ linux/drivers/isdn/gigaset/common.c
@@ -981,7 +981,7 @@ exit:
 EXPORT_SYMBOL_GPL(gigaset_stop);
 
 static LIST_HEAD(drivers);
-static spinlock_t driver_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(driver_lock);
 
 struct cardstate *gigaset_get_cs_by_id(int id)
 {
Index: linux/drivers/leds/led-core.c
===================================================================
--- linux.orig/drivers/leds/led-core.c
+++ linux/drivers/leds/led-core.c
@@ -18,7 +18,7 @@
 #include <linux/leds.h>
 #include "leds.h"
 
-rwlock_t leds_list_lock = RW_LOCK_UNLOCKED;
+DEFINE_RWLOCK(leds_list_lock);
 LIST_HEAD(leds_list);
 
 EXPORT_SYMBOL_GPL(leds_list);
Index: linux/drivers/leds/led-triggers.c
===================================================================
--- linux.orig/drivers/leds/led-triggers.c
+++ linux/drivers/leds/led-triggers.c
@@ -26,7 +26,7 @@
 /*
  * Nests outside led_cdev->trigger_lock
  */
-static rwlock_t triggers_list_lock = RW_LOCK_UNLOCKED;
+static DEFINE_RWLOCK(triggers_list_lock);
 static LIST_HEAD(trigger_list);
 
 ssize_t led_trigger_store(struct class_device *dev, const char *buf,
Index: linux/drivers/message/i2o/exec-osm.c
===================================================================
--- linux.orig/drivers/message/i2o/exec-osm.c
+++ linux/drivers/message/i2o/exec-osm.c
@@ -213,7 +213,7 @@ static int i2o_msg_post_wait_complete(st
 {
 	struct i2o_exec_wait *wait, *tmp;
 	unsigned long flags;
-	static spinlock_t lock = SPIN_LOCK_UNLOCKED;
+	static DEFINE_SPINLOCK(lock);
 	int rc = 1;
 
 	/*
Index: linux/drivers/misc/ibmasm/module.c
===================================================================
--- linux.orig/drivers/misc/ibmasm/module.c
+++ linux/drivers/misc/ibmasm/module.c
@@ -85,7 +85,7 @@ static int __devinit ibmasm_init_one(str
 	}
 	memset(sp, 0, sizeof(struct service_processor));
 
-	sp->lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&sp->lock);
 	INIT_LIST_HEAD(&sp->command_queue);
 
 	pci_set_drvdata(pdev, (void *)sp);
Index: linux/drivers/pcmcia/m8xx_pcmcia.c
===================================================================
--- linux.orig/drivers/pcmcia/m8xx_pcmcia.c
+++ linux/drivers/pcmcia/m8xx_pcmcia.c
@@ -157,7 +157,7 @@ MODULE_LICENSE("Dual MPL/GPL");
 
 static int pcmcia_schlvl = PCMCIA_SCHLVL;
 
-static spinlock_t events_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(events_lock);
 
 
 #define PCMCIA_SOCKET_KEY_5V 1
@@ -644,7 +644,7 @@ static struct platform_device m8xx_devic
 };
 
 static u32 pending_events[PCMCIA_SOCKETS_NO];
-static spinlock_t pending_event_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(pending_event_lock);
 
 static irqreturn_t m8xx_interrupt(int irq, void *dev, struct pt_regs *regs)
 {
Index: linux/drivers/rapidio/rio-access.c
===================================================================
--- linux.orig/drivers/rapidio/rio-access.c
+++ linux/drivers/rapidio/rio-access.c
@@ -17,8 +17,8 @@
  * These interrupt-safe spinlocks protect all accesses to RIO
  * configuration space and doorbell access.
  */
-static spinlock_t rio_config_lock = SPIN_LOCK_UNLOCKED;
-static spinlock_t rio_doorbell_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(rio_config_lock);
+static DEFINE_SPINLOCK(rio_doorbell_lock);
 
 /*
  *  Wrappers for all RIO configuration access functions.  They just check
Index: linux/drivers/rtc/rtc-sa1100.c
===================================================================
--- linux.orig/drivers/rtc/rtc-sa1100.c
+++ linux/drivers/rtc/rtc-sa1100.c
@@ -45,7 +45,7 @@
 
 static unsigned long rtc_freq = 1024;
 static struct rtc_time rtc_alarm;
-static spinlock_t sa1100_rtc_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(sa1100_rtc_lock);
 
 static int rtc_update_alarm(struct rtc_time *alrm)
 {
Index: linux/drivers/rtc/rtc-vr41xx.c
===================================================================
--- linux.orig/drivers/rtc/rtc-vr41xx.c
+++ linux/drivers/rtc/rtc-vr41xx.c
@@ -93,7 +93,7 @@ static void __iomem *rtc2_base;
 
 static unsigned long epoch = 1970;	/* Jan 1 1970 00:00:00 */
 
-static spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(rtc_lock);
 static char rtc_name[] = "RTC";
 static unsigned long periodic_frequency;
 static unsigned long periodic_count;
Index: linux/drivers/s390/block/dasd_eer.c
===================================================================
--- linux.orig/drivers/s390/block/dasd_eer.c
+++ linux/drivers/s390/block/dasd_eer.c
@@ -89,7 +89,7 @@ struct eerbuffer {
 };
 
 static LIST_HEAD(bufferlist);
-static spinlock_t bufferlock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(bufferlock);
 static DECLARE_WAIT_QUEUE_HEAD(dasd_eer_read_wait_queue);
 
 /*
Index: linux/drivers/scsi/libata-core.c
===================================================================
--- linux.orig/drivers/scsi/libata-core.c
+++ linux/drivers/scsi/libata-core.c
@@ -5605,7 +5605,7 @@ module_init(ata_init);
 module_exit(ata_exit);
 
 static unsigned long ratelimit_time;
-static spinlock_t ata_ratelimit_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(ata_ratelimit_lock);
 
 int ata_ratelimit(void)
 {
Index: linux/drivers/sn/ioc3.c
===================================================================
--- linux.orig/drivers/sn/ioc3.c
+++ linux/drivers/sn/ioc3.c
@@ -26,7 +26,7 @@ static DECLARE_RWSEM(ioc3_devices_rwsem)
 
 static struct ioc3_submodule *ioc3_submodules[IOC3_MAX_SUBMODULES];
 static struct ioc3_submodule *ioc3_ethernet;
-static rwlock_t ioc3_submodules_lock = RW_LOCK_UNLOCKED;
+static DEFINE_RWLOCK(ioc3_submodules_lock);
 
 /* NIC probing code */
 
Index: linux/drivers/usb/ip/stub_dev.c
===================================================================
--- linux.orig/drivers/usb/ip/stub_dev.c
+++ linux/drivers/usb/ip/stub_dev.c
@@ -285,13 +285,13 @@ static struct stub_device * stub_device_
 
 	sdev->ud.side = USBIP_STUB;
 	sdev->ud.status = SDEV_ST_AVAILABLE;
-	sdev->ud.lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&sdev->ud.lock);
 	sdev->ud.tcp_socket = NULL;
 
 	INIT_LIST_HEAD(&sdev->priv_init);
 	INIT_LIST_HEAD(&sdev->priv_tx);
 	INIT_LIST_HEAD(&sdev->priv_free);
-	sdev->priv_lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&sdev->priv_lock);
 
 	sdev->ud.eh_ops.shutdown = stub_shutdown_connection;
 	sdev->ud.eh_ops.reset    = stub_device_reset;
Index: linux/drivers/usb/ip/vhci_hcd.c
===================================================================
--- linux.orig/drivers/usb/ip/vhci_hcd.c
+++ linux/drivers/usb/ip/vhci_hcd.c
@@ -768,11 +768,11 @@ static void vhci_device_init(struct vhci
 
 	vdev->ud.side   = USBIP_VHCI;
 	vdev->ud.status = VDEV_ST_NULL;
-	vdev->ud.lock   = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&vdev->ud.lock  );
 
 	INIT_LIST_HEAD(&vdev->priv_rx);
 	INIT_LIST_HEAD(&vdev->priv_tx);
-	vdev->priv_lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&vdev->priv_lock);
 
 	init_waitqueue_head(&vdev->waitq);
 
Index: linux/drivers/video/backlight/hp680_bl.c
===================================================================
--- linux.orig/drivers/video/backlight/hp680_bl.c
+++ linux/drivers/video/backlight/hp680_bl.c
@@ -27,7 +27,7 @@
 
 static int hp680bl_suspended;
 static int current_intensity = 0;
-static spinlock_t bl_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(bl_lock);
 static struct backlight_device *hp680_backlight_device;
 
 static void hp680bl_send_intensity(struct backlight_device *bd)
Index: linux/fs/gfs2/ops_fstype.c
===================================================================
--- linux.orig/fs/gfs2/ops_fstype.c
+++ linux/fs/gfs2/ops_fstype.c
@@ -58,7 +58,7 @@ static struct gfs2_sbd *init_sbd(struct 
 	gfs2_tune_init(&sdp->sd_tune);
 
 	for (x = 0; x < GFS2_GL_HASH_SIZE; x++) {
-		sdp->sd_gl_hash[x].hb_lock = RW_LOCK_UNLOCKED;
+		rwlock_init(&sdp->sd_gl_hash[x].hb_lock);
 		INIT_LIST_HEAD(&sdp->sd_gl_hash[x].hb_list);
 	}
 	INIT_LIST_HEAD(&sdp->sd_reclaim_list);
Index: linux/fs/nfsd/nfs4state.c
===================================================================
--- linux.orig/fs/nfsd/nfs4state.c
+++ linux/fs/nfsd/nfs4state.c
@@ -123,7 +123,7 @@ static void release_stateid(struct nfs4_
  */
 
 /* recall_lock protects the del_recall_lru */
-static spinlock_t recall_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(recall_lock);
 static struct list_head del_recall_lru;
 
 static void
Index: linux/fs/ocfs2/cluster/heartbeat.c
===================================================================
--- linux.orig/fs/ocfs2/cluster/heartbeat.c
+++ linux/fs/ocfs2/cluster/heartbeat.c
@@ -54,7 +54,7 @@ static DECLARE_RWSEM(o2hb_callback_sem);
  * multiple hb threads are watching multiple regions.  A node is live
  * whenever any of the threads sees activity from the node in its region.
  */
-static spinlock_t o2hb_live_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(o2hb_live_lock);
 static struct list_head o2hb_live_slots[O2NM_MAX_NODES];
 static unsigned long o2hb_live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
 static LIST_HEAD(o2hb_node_events);
Index: linux/fs/ocfs2/cluster/tcp.c
===================================================================
--- linux.orig/fs/ocfs2/cluster/tcp.c
+++ linux/fs/ocfs2/cluster/tcp.c
@@ -107,7 +107,7 @@
 	    ##args);							\
 } while (0)
 
-static rwlock_t o2net_handler_lock = RW_LOCK_UNLOCKED;
+static DEFINE_RWLOCK(o2net_handler_lock);
 static struct rb_root o2net_handler_tree = RB_ROOT;
 
 static struct o2net_node o2net_nodes[O2NM_MAX_NODES];
Index: linux/fs/ocfs2/dlm/dlmdomain.c
===================================================================
--- linux.orig/fs/ocfs2/dlm/dlmdomain.c
+++ linux/fs/ocfs2/dlm/dlmdomain.c
@@ -88,7 +88,7 @@ out_free:
  *
  */
 
-spinlock_t dlm_domain_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(dlm_domain_lock);
 LIST_HEAD(dlm_domains);
 static DECLARE_WAIT_QUEUE_HEAD(dlm_domain_events);
 
Index: linux/fs/ocfs2/dlm/dlmlock.c
===================================================================
--- linux.orig/fs/ocfs2/dlm/dlmlock.c
+++ linux/fs/ocfs2/dlm/dlmlock.c
@@ -53,7 +53,7 @@
 #define MLOG_MASK_PREFIX ML_DLM
 #include "cluster/masklog.h"
 
-static spinlock_t dlm_cookie_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(dlm_cookie_lock);
 static u64 dlm_next_cookie = 1;
 
 static enum dlm_status dlm_send_remote_lock_request(struct dlm_ctxt *dlm,
Index: linux/fs/ocfs2/dlm/dlmrecovery.c
===================================================================
--- linux.orig/fs/ocfs2/dlm/dlmrecovery.c
+++ linux/fs/ocfs2/dlm/dlmrecovery.c
@@ -101,8 +101,8 @@ static int dlm_lockres_master_requery(st
 
 static u64 dlm_get_next_mig_cookie(void);
 
-static spinlock_t dlm_reco_state_lock = SPIN_LOCK_UNLOCKED;
-static spinlock_t dlm_mig_cookie_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(dlm_reco_state_lock);
+static DEFINE_SPINLOCK(dlm_mig_cookie_lock);
 static u64 dlm_mig_cookie = 1;
 
 static u64 dlm_get_next_mig_cookie(void)
Index: linux/fs/ocfs2/dlmglue.c
===================================================================
--- linux.orig/fs/ocfs2/dlmglue.c
+++ linux/fs/ocfs2/dlmglue.c
@@ -242,7 +242,7 @@ static void ocfs2_build_lock_name(enum o
 	mlog_exit_void();
 }
 
-static spinlock_t ocfs2_dlm_tracking_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(ocfs2_dlm_tracking_lock);
 
 static void ocfs2_add_lockres_tracking(struct ocfs2_lock_res *res,
 				       struct ocfs2_dlm_debug *dlm_debug)
Index: linux/fs/ocfs2/journal.c
===================================================================
--- linux.orig/fs/ocfs2/journal.c
+++ linux/fs/ocfs2/journal.c
@@ -49,7 +49,7 @@
 
 #include "buffer_head_io.h"
 
-spinlock_t trans_inc_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(trans_inc_lock);
 
 static int ocfs2_force_read_journal(struct inode *inode);
 static int ocfs2_recover_node(struct ocfs2_super *osb,
Index: linux/fs/reiser4/block_alloc.c
===================================================================
--- linux.orig/fs/reiser4/block_alloc.c
+++ linux/fs/reiser4/block_alloc.c
@@ -499,7 +499,7 @@ void cluster_reserved2free(int count)
 	spin_unlock_reiser4_super(sbinfo);
 }
 
-static spinlock_t fake_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(fake_lock);
 static reiser4_block_nr fake_gen = 0;
 
 /* obtain a block number for new formatted node which will be used to refer
Index: linux/fs/reiser4/debug.c
===================================================================
--- linux.orig/fs/reiser4/debug.c
+++ linux/fs/reiser4/debug.c
@@ -52,7 +52,7 @@ static char panic_buf[REISER4_PANIC_MSG_
 /*
  * lock protecting consistency of panic_buf under concurrent panics
  */
-static spinlock_t panic_guard = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(panic_guard);
 
 /* Your best friend. Call it on each occasion.  This is called by
     fs/reiser4/debug.h:reiser4_panic(). */
Index: linux/fs/reiser4/fsdata.c
===================================================================
--- linux.orig/fs/reiser4/fsdata.c
+++ linux/fs/reiser4/fsdata.c
@@ -17,7 +17,7 @@ static LIST_HEAD(cursor_cache);
 static unsigned long d_cursor_unused = 0;
 
 /* spinlock protecting manipulations with dir_cursor's hash table and lists */
-spinlock_t d_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(d_lock);
 
 static reiser4_file_fsdata *create_fsdata(struct file *file);
 static int file_is_stateless(struct file *file);
Index: linux/fs/reiser4/txnmgr.c
===================================================================
--- linux.orig/fs/reiser4/txnmgr.c
+++ linux/fs/reiser4/txnmgr.c
@@ -905,7 +905,7 @@ jnode *find_first_dirty_jnode(txn_atom *
 
 /* this spin lock is used to prevent races during steal on capture.
    FIXME: should be per filesystem or even per atom */
-spinlock_t scan_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(scan_lock);
 
 /* Scan atom->writeback_nodes list and dispatch jnodes according to their state:
  * move dirty and !writeback jnodes to @fq, clean jnodes to atom's clean
Index: linux/include/asm-alpha/core_t2.h
===================================================================
--- linux.orig/include/asm-alpha/core_t2.h
+++ linux/include/asm-alpha/core_t2.h
@@ -435,7 +435,7 @@ static inline void t2_outl(u32 b, unsign
 	set_hae(msb); \
 }
 
-static spinlock_t t2_hae_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(t2_hae_lock);
 
 __EXTERN_INLINE u8 t2_readb(const volatile void __iomem *xaddr)
 {
Index: linux/kernel/audit.c
===================================================================
--- linux.orig/kernel/audit.c
+++ linux/kernel/audit.c
@@ -787,7 +787,7 @@ err:
  */
 unsigned int audit_serial(void)
 {
-	static spinlock_t serial_lock = SPIN_LOCK_UNLOCKED;
+	static DEFINE_SPINLOCK(serial_lock);
 	static unsigned int serial = 0;
 
 	unsigned long flags;
Index: linux/mm/sparse.c
===================================================================
--- linux.orig/mm/sparse.c
+++ linux/mm/sparse.c
@@ -45,7 +45,7 @@ static struct mem_section *sparse_index_
 
 static int sparse_index_init(unsigned long section_nr, int nid)
 {
-	static spinlock_t index_init_lock = SPIN_LOCK_UNLOCKED;
+	static DEFINE_SPINLOCK(index_init_lock);
 	unsigned long root = SECTION_NR_TO_ROOT(section_nr);
 	struct mem_section *section;
 	int ret = 0;
Index: linux/net/ipv6/route.c
===================================================================
--- linux.orig/net/ipv6/route.c
+++ linux/net/ipv6/route.c
@@ -343,7 +343,7 @@ static struct rt6_info *rt6_select(struc
 	    (strict & RT6_SELECT_F_REACHABLE) &&
 	    last && last != rt0) {
 		/* no entries matched; do round-robin */
-		static spinlock_t lock = SPIN_LOCK_UNLOCKED;
+		static DEFINE_SPINLOCK(lock);
 		spin_lock(&lock);
 		*head = rt0->u.next;
 		rt0->u.next = last->u.next;
Index: linux/net/sunrpc/auth_gss/gss_krb5_seal.c
===================================================================
--- linux.orig/net/sunrpc/auth_gss/gss_krb5_seal.c
+++ linux/net/sunrpc/auth_gss/gss_krb5_seal.c
@@ -70,7 +70,7 @@
 # define RPCDBG_FACILITY        RPCDBG_AUTH
 #endif
 
-spinlock_t krb5_seq_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(krb5_seq_lock);
 
 u32
 gss_get_mic_kerberos(struct gss_ctx *gss_ctx, struct xdr_buf *text,
Index: linux/net/tipc/bcast.c
===================================================================
--- linux.orig/net/tipc/bcast.c
+++ linux/net/tipc/bcast.c
@@ -102,7 +102,7 @@ struct bclink {
 static struct bcbearer *bcbearer = NULL;
 static struct bclink *bclink = NULL;
 static struct link *bcl = NULL;
-static spinlock_t bc_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(bc_lock);
 
 char tipc_bclink_name[] = "multicast-link";
 
@@ -783,7 +783,7 @@ int tipc_bclink_init(void)
 	memset(bclink, 0, sizeof(struct bclink));
 	INIT_LIST_HEAD(&bcl->waiting_ports);
 	bcl->next_out_no = 1;
-	bclink->node.lock =  SPIN_LOCK_UNLOCKED;        
+	spin_lock_init(&bclink->node.lock);
 	bcl->owner = &bclink->node;
         bcl->max_pkt = MAX_PKT_DEFAULT_MCAST;
 	tipc_link_set_queue_limits(bcl, BCLINK_WIN_DEFAULT);
Index: linux/net/tipc/bearer.c
===================================================================
--- linux.orig/net/tipc/bearer.c
+++ linux/net/tipc/bearer.c
@@ -552,7 +552,7 @@ restart:
 		b_ptr->link_req = tipc_disc_init_link_req(b_ptr, &m_ptr->bcast_addr,
 							  bcast_scope, 2);
 	}
-	b_ptr->publ.lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&b_ptr->publ.lock);
 	write_unlock_bh(&tipc_net_lock);
 	info("Enabled bearer <%s>, discovery domain %s, priority %u\n",
 	     name, addr_string_fill(addr_string, bcast_scope), priority);
Index: linux/net/tipc/config.c
===================================================================
--- linux.orig/net/tipc/config.c
+++ linux/net/tipc/config.c
@@ -63,7 +63,7 @@ struct manager {
 
 static struct manager mng = { 0};
 
-static spinlock_t config_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(config_lock);
 
 static const void *req_tlv_area;	/* request message TLV area */
 static int req_tlv_space;		/* request message TLV area size */
Index: linux/net/tipc/dbg.c
===================================================================
--- linux.orig/net/tipc/dbg.c
+++ linux/net/tipc/dbg.c
@@ -41,7 +41,7 @@
 #define MAX_STRING 512
 
 static char print_string[MAX_STRING];
-static spinlock_t print_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(print_lock);
 
 static struct print_buf cons_buf = { NULL, 0, NULL, NULL };
 struct print_buf *TIPC_CONS = &cons_buf;
Index: linux/net/tipc/handler.c
===================================================================
--- linux.orig/net/tipc/handler.c
+++ linux/net/tipc/handler.c
@@ -44,7 +44,7 @@ struct queue_item {
 
 static kmem_cache_t *tipc_queue_item_cache;
 static struct list_head signal_queue_head;
-static spinlock_t qitem_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(qitem_lock);
 static int handler_enabled = 0;
 
 static void process_signal_queue(unsigned long dummy);
Index: linux/net/tipc/name_table.c
===================================================================
--- linux.orig/net/tipc/name_table.c
+++ linux/net/tipc/name_table.c
@@ -101,7 +101,7 @@ struct name_table {
 
 static struct name_table table = { NULL } ;
 static atomic_t rsv_publ_ok = ATOMIC_INIT(0);
-rwlock_t tipc_nametbl_lock = RW_LOCK_UNLOCKED;
+DEFINE_RWLOCK(tipc_nametbl_lock);
 
 
 static int hash(int x)
@@ -172,7 +172,7 @@ static struct name_seq *tipc_nameseq_cre
 	}
 
 	memset(nseq, 0, sizeof(*nseq));
-	nseq->lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&nseq->lock);
 	nseq->type = type;
 	nseq->sseqs = sseq;
 	dbg("tipc_nameseq_create() nseq = %x type %u, ssseqs %x, ff: %u\n",
Index: linux/net/tipc/net.c
===================================================================
--- linux.orig/net/tipc/net.c
+++ linux/net/tipc/net.c
@@ -115,7 +115,7 @@
  *     - A local spin_lock protecting the queue of subscriber events.
 */
 
-rwlock_t tipc_net_lock = RW_LOCK_UNLOCKED;
+DEFINE_RWLOCK(tipc_net_lock);
 struct network tipc_net = { NULL };
 
 struct node *tipc_net_select_remote_node(u32 addr, u32 ref) 
Index: linux/net/tipc/node.c
===================================================================
--- linux.orig/net/tipc/node.c
+++ linux/net/tipc/node.c
@@ -64,7 +64,7 @@ struct node *tipc_node_create(u32 addr)
         if (n_ptr != NULL) {
                 memset(n_ptr, 0, sizeof(*n_ptr));
                 n_ptr->addr = addr;
-                n_ptr->lock =  SPIN_LOCK_UNLOCKED;	
+                spin_lock_init(&n_ptr->lock);
                 INIT_LIST_HEAD(&n_ptr->nsub);
 	
 		c_ptr = tipc_cltr_find(addr);
Index: linux/net/tipc/port.c
===================================================================
--- linux.orig/net/tipc/port.c
+++ linux/net/tipc/port.c
@@ -57,8 +57,8 @@
 static struct sk_buff *msg_queue_head = NULL;
 static struct sk_buff *msg_queue_tail = NULL;
 
-spinlock_t tipc_port_list_lock = SPIN_LOCK_UNLOCKED;
-static spinlock_t queue_lock = SPIN_LOCK_UNLOCKED;
+DEFINE_SPINLOCK(tipc_port_list_lock);
+static DEFINE_SPINLOCK(queue_lock);
 
 static LIST_HEAD(ports);
 static void port_handle_node_down(unsigned long ref);
Index: linux/net/tipc/ref.c
===================================================================
--- linux.orig/net/tipc/ref.c
+++ linux/net/tipc/ref.c
@@ -63,7 +63,7 @@
 
 struct ref_table tipc_ref_table = { NULL };
 
-static rwlock_t ref_table_lock = RW_LOCK_UNLOCKED;
+static DEFINE_RWLOCK(ref_table_lock);
 
 /**
  * tipc_ref_table_init - create reference table for objects
@@ -87,7 +87,7 @@ int tipc_ref_table_init(u32 requested_si
 	index_mask = sz - 1;
 	for (i = sz - 1; i >= 0; i--) {
 		table[i].object = NULL;
-		table[i].lock = SPIN_LOCK_UNLOCKED;
+		spin_lock_init(&table[i].lock);
 		table[i].data.next_plus_upper = (start & ~index_mask) + i - 1;
 	}
 	tipc_ref_table.entries = table;
Index: linux/net/tipc/subscr.c
===================================================================
--- linux.orig/net/tipc/subscr.c
+++ linux/net/tipc/subscr.c
@@ -457,7 +457,7 @@ int tipc_subscr_start(void)
 	int res = -1;
 
 	memset(&topsrv, 0, sizeof (topsrv));
-	topsrv.lock = SPIN_LOCK_UNLOCKED;
+	spin_lock_init(&topsrv.lock);
 	INIT_LIST_HEAD(&topsrv.subscriber_list);
 
 	spin_lock_bh(&topsrv.lock);
Index: linux/net/tipc/user_reg.c
===================================================================
--- linux.orig/net/tipc/user_reg.c
+++ linux/net/tipc/user_reg.c
@@ -67,7 +67,7 @@ struct tipc_user {
 
 static struct tipc_user *users = NULL;
 static u32 next_free_user = MAX_USERID + 1;
-static spinlock_t reg_lock = SPIN_LOCK_UNLOCKED;
+static DEFINE_SPINLOCK(reg_lock);
 
 /**
  * reg_init - create TIPC user registry (but don't activate it)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 10/61] lock validator: locking init debugging improvement
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (8 preceding siblings ...)
  2006-05-29 21:23 ` [patch 09/61] lock validator: spin/rwlock init cleanups Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-29 21:23 ` [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup Ingo Molnar
                   ` (63 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

locking init improvement:

 - introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
   to pass in the name string of locks, used by debugging

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86_64/kernel/smpboot.c   |    3 +++
 arch/x86_64/kernel/vsyscall.c  |    2 +-
 block/ll_rw_blk.c              |    1 +
 drivers/char/random.c          |    6 +++---
 drivers/ide/ide-io.c           |    2 ++
 drivers/scsi/libata-core.c     |    2 ++
 drivers/spi/spi.c              |    1 +
 fs/dcache.c                    |    2 +-
 include/linux/idr.h            |    2 +-
 include/linux/init_task.h      |   10 +++++-----
 include/linux/notifier.h       |    2 +-
 include/linux/seqlock.h        |   12 ++++++++++--
 include/linux/spinlock_types.h |   15 +++++++++------
 include/linux/wait.h           |    2 +-
 kernel/kmod.c                  |    2 ++
 kernel/rcupdate.c              |    4 ++--
 kernel/timer.c                 |    2 +-
 mm/swap_state.c                |    2 +-
 net/ipv4/tcp_ipv4.c            |    2 +-
 net/ipv4/tcp_minisocks.c       |    2 +-
 net/ipv4/xfrm4_policy.c        |    4 ++--
 21 files changed, 51 insertions(+), 29 deletions(-)

Index: linux/arch/x86_64/kernel/smpboot.c
===================================================================
--- linux.orig/arch/x86_64/kernel/smpboot.c
+++ linux/arch/x86_64/kernel/smpboot.c
@@ -771,8 +771,11 @@ static int __cpuinit do_boot_cpu(int cpu
 		.cpu = cpu,
 		.done = COMPLETION_INITIALIZER(c_idle.done),
 	};
+
 	DECLARE_WORK(work, do_fork_idle, &c_idle);
 
+	init_completion(&c_idle.done);
+
 	/* allocate memory for gdts of secondary cpus. Hotplug is considered */
 	if (!cpu_gdt_descr[cpu].address &&
 		!(cpu_gdt_descr[cpu].address = get_zeroed_page(GFP_KERNEL))) {
Index: linux/arch/x86_64/kernel/vsyscall.c
===================================================================
--- linux.orig/arch/x86_64/kernel/vsyscall.c
+++ linux/arch/x86_64/kernel/vsyscall.c
@@ -37,7 +37,7 @@
 #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
-seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
+__section_xtime_lock DEFINE_SEQLOCK(__xtime_lock);
 
 #include <asm/unistd.h>
 
Index: linux/block/ll_rw_blk.c
===================================================================
--- linux.orig/block/ll_rw_blk.c
+++ linux/block/ll_rw_blk.c
@@ -2529,6 +2529,7 @@ int blk_execute_rq(request_queue_t *q, s
 	char sense[SCSI_SENSE_BUFFERSIZE];
 	int err = 0;
 
+	init_completion(&wait);
 	/*
 	 * we need an extra reference to the request, so we can look at
 	 * it after io completion
Index: linux/drivers/char/random.c
===================================================================
--- linux.orig/drivers/char/random.c
+++ linux/drivers/char/random.c
@@ -417,7 +417,7 @@ static struct entropy_store input_pool =
 	.poolinfo = &poolinfo_table[0],
 	.name = "input",
 	.limit = 1,
-	.lock = SPIN_LOCK_UNLOCKED,
+	.lock = __SPIN_LOCK_UNLOCKED(&input_pool.lock),
 	.pool = input_pool_data
 };
 
@@ -426,7 +426,7 @@ static struct entropy_store blocking_poo
 	.name = "blocking",
 	.limit = 1,
 	.pull = &input_pool,
-	.lock = SPIN_LOCK_UNLOCKED,
+	.lock = __SPIN_LOCK_UNLOCKED(&blocking_pool.lock),
 	.pool = blocking_pool_data
 };
 
@@ -434,7 +434,7 @@ static struct entropy_store nonblocking_
 	.poolinfo = &poolinfo_table[1],
 	.name = "nonblocking",
 	.pull = &input_pool,
-	.lock = SPIN_LOCK_UNLOCKED,
+	.lock = __SPIN_LOCK_UNLOCKED(&nonblocking_pool.lock),
 	.pool = nonblocking_pool_data
 };
 
Index: linux/drivers/ide/ide-io.c
===================================================================
--- linux.orig/drivers/ide/ide-io.c
+++ linux/drivers/ide/ide-io.c
@@ -1700,6 +1700,8 @@ int ide_do_drive_cmd (ide_drive_t *drive
 	int where = ELEVATOR_INSERT_BACK, err;
 	int must_wait = (action == ide_wait || action == ide_head_wait);
 
+	init_completion(&wait);
+
 	rq->errors = 0;
 	rq->rq_status = RQ_ACTIVE;
 
Index: linux/drivers/scsi/libata-core.c
===================================================================
--- linux.orig/drivers/scsi/libata-core.c
+++ linux/drivers/scsi/libata-core.c
@@ -994,6 +994,8 @@ unsigned ata_exec_internal(struct ata_de
 	unsigned int err_mask;
 	int rc;
 
+	init_completion(&wait);
+
 	spin_lock_irqsave(&ap->host_set->lock, flags);
 
 	/* no internal command while frozen */
Index: linux/drivers/spi/spi.c
===================================================================
--- linux.orig/drivers/spi/spi.c
+++ linux/drivers/spi/spi.c
@@ -512,6 +512,7 @@ int spi_sync(struct spi_device *spi, str
 	DECLARE_COMPLETION(done);
 	int status;
 
+	init_completion(&done);
 	message->complete = spi_complete;
 	message->context = &done;
 	status = spi_async(spi, message);
Index: linux/fs/dcache.c
===================================================================
--- linux.orig/fs/dcache.c
+++ linux/fs/dcache.c
@@ -39,7 +39,7 @@ int sysctl_vfs_cache_pressure __read_mos
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
  __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
-static seqlock_t rename_lock __cacheline_aligned_in_smp = SEQLOCK_UNLOCKED;
+static __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(dcache_lock);
 
Index: linux/include/linux/idr.h
===================================================================
--- linux.orig/include/linux/idr.h
+++ linux/include/linux/idr.h
@@ -66,7 +66,7 @@ struct idr {
 	.id_free	= NULL,					\
 	.layers 	= 0,					\
 	.id_free_cnt	= 0,					\
-	.lock		= SPIN_LOCK_UNLOCKED,			\
+	.lock		= __SPIN_LOCK_UNLOCKED(name.lock),	\
 }
 #define DEFINE_IDR(name)	struct idr name = IDR_INIT(name)
 
Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -22,7 +22,7 @@
 	.count		= ATOMIC_INIT(1), 		\
 	.fdt		= &init_files.fdtab, 		\
 	.fdtab		= INIT_FDTABLE,			\
-	.file_lock	= SPIN_LOCK_UNLOCKED, 		\
+	.file_lock	= __SPIN_LOCK_UNLOCKED(init_task.file_lock), \
 	.next_fd	= 0, 				\
 	.close_on_exec_init = { { 0, } }, 		\
 	.open_fds_init	= { { 0, } }, 			\
@@ -37,7 +37,7 @@
 	.user_id	= 0,				\
 	.next		= NULL,				\
 	.wait		= __WAIT_QUEUE_HEAD_INITIALIZER(name.wait), \
-	.ctx_lock	= SPIN_LOCK_UNLOCKED,		\
+	.ctx_lock	= __SPIN_LOCK_UNLOCKED(name.ctx_lock), \
 	.reqs_active	= 0U,				\
 	.max_reqs	= ~0U,				\
 }
@@ -49,7 +49,7 @@
 	.mm_users	= ATOMIC_INIT(2), 			\
 	.mm_count	= ATOMIC_INIT(1), 			\
 	.mmap_sem	= __RWSEM_INITIALIZER(name.mmap_sem),	\
-	.page_table_lock =  SPIN_LOCK_UNLOCKED, 		\
+	.page_table_lock =  __SPIN_LOCK_UNLOCKED(name.page_table_lock),	\
 	.mmlist		= LIST_HEAD_INIT(name.mmlist),		\
 	.cpu_vm_mask	= CPU_MASK_ALL,				\
 }
@@ -78,7 +78,7 @@ extern struct nsproxy init_nsproxy;
 #define INIT_SIGHAND(sighand) {						\
 	.count		= ATOMIC_INIT(1), 				\
 	.action		= { { { .sa_handler = NULL, } }, },		\
-	.siglock	= SPIN_LOCK_UNLOCKED, 				\
+	.siglock	= __SPIN_LOCK_UNLOCKED(sighand.siglock),	\
 }
 
 extern struct group_info init_groups;
@@ -129,7 +129,7 @@ extern struct group_info init_groups;
 		.list = LIST_HEAD_INIT(tsk.pending.list),		\
 		.signal = {{0}}},					\
 	.blocked	= {{0}},					\
-	.alloc_lock	= SPIN_LOCK_UNLOCKED,				\
+	.alloc_lock	= __SPIN_LOCK_UNLOCKED(tsk.alloc_lock),		\
 	.journal_info	= NULL,						\
 	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
 	.fs_excl	= ATOMIC_INIT(0),				\
Index: linux/include/linux/notifier.h
===================================================================
--- linux.orig/include/linux/notifier.h
+++ linux/include/linux/notifier.h
@@ -65,7 +65,7 @@ struct raw_notifier_head {
 	} while (0)
 
 #define ATOMIC_NOTIFIER_INIT(name) {				\
-		.lock = SPIN_LOCK_UNLOCKED,			\
+		.lock = __SPIN_LOCK_UNLOCKED(name.lock),	\
 		.head = NULL }
 #define BLOCKING_NOTIFIER_INIT(name) {				\
 		.rwsem = __RWSEM_INITIALIZER((name).rwsem),	\
Index: linux/include/linux/seqlock.h
===================================================================
--- linux.orig/include/linux/seqlock.h
+++ linux/include/linux/seqlock.h
@@ -38,9 +38,17 @@ typedef struct {
  * These macros triggered gcc-3.x compile-time problems.  We think these are
  * OK now.  Be cautious.
  */
-#define SEQLOCK_UNLOCKED { 0, SPIN_LOCK_UNLOCKED }
-#define seqlock_init(x)	do { *(x) = (seqlock_t) SEQLOCK_UNLOCKED; } while (0)
+#define __SEQLOCK_UNLOCKED(lockname) \
+		 { 0, __SPIN_LOCK_UNLOCKED(lockname) }
 
+#define SEQLOCK_UNLOCKED \
+		 __SEQLOCK_UNLOCKED(old_style_seqlock_init)
+
+#define seqlock_init(x) \
+		do { *(x) = (seqlock_t) __SEQLOCK_UNLOCKED(x); } while (0)
+
+#define DEFINE_SEQLOCK(x) \
+		seqlock_t x = __SEQLOCK_UNLOCKED(x)
 
 /* Lock out other writers and update the count.
  * Acts like a normal spin_lock/unlock.
Index: linux/include/linux/spinlock_types.h
===================================================================
--- linux.orig/include/linux/spinlock_types.h
+++ linux/include/linux/spinlock_types.h
@@ -44,24 +44,27 @@ typedef struct {
 #define SPINLOCK_OWNER_INIT	((void *)-1L)
 
 #ifdef CONFIG_DEBUG_SPINLOCK
-# define SPIN_LOCK_UNLOCKED						\
+# define __SPIN_LOCK_UNLOCKED(lockname)					\
 	(spinlock_t)	{	.raw_lock = __RAW_SPIN_LOCK_UNLOCKED,	\
 				.magic = SPINLOCK_MAGIC,		\
 				.owner = SPINLOCK_OWNER_INIT,		\
 				.owner_cpu = -1 }
-#define RW_LOCK_UNLOCKED						\
+#define __RW_LOCK_UNLOCKED(lockname)					\
 	(rwlock_t)	{	.raw_lock = __RAW_RW_LOCK_UNLOCKED,	\
 				.magic = RWLOCK_MAGIC,			\
 				.owner = SPINLOCK_OWNER_INIT,		\
 				.owner_cpu = -1 }
 #else
-# define SPIN_LOCK_UNLOCKED \
+# define __SPIN_LOCK_UNLOCKED(lockname) \
 	(spinlock_t)	{	.raw_lock = __RAW_SPIN_LOCK_UNLOCKED }
-#define RW_LOCK_UNLOCKED \
+#define __RW_LOCK_UNLOCKED(lockname) \
 	(rwlock_t)	{	.raw_lock = __RAW_RW_LOCK_UNLOCKED }
 #endif
 
-#define DEFINE_SPINLOCK(x)	spinlock_t x = SPIN_LOCK_UNLOCKED
-#define DEFINE_RWLOCK(x)	rwlock_t x = RW_LOCK_UNLOCKED
+#define SPIN_LOCK_UNLOCKED	__SPIN_LOCK_UNLOCKED(old_style_spin_init)
+#define RW_LOCK_UNLOCKED	__RW_LOCK_UNLOCKED(old_style_rw_init)
+
+#define DEFINE_SPINLOCK(x)	spinlock_t x = __SPIN_LOCK_UNLOCKED(x)
+#define DEFINE_RWLOCK(x)	rwlock_t x = __RW_LOCK_UNLOCKED(x)
 
 #endif /* __LINUX_SPINLOCK_TYPES_H */
Index: linux/include/linux/wait.h
===================================================================
--- linux.orig/include/linux/wait.h
+++ linux/include/linux/wait.h
@@ -68,7 +68,7 @@ struct task_struct;
 	wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)
 
 #define __WAIT_QUEUE_HEAD_INITIALIZER(name) {				\
-	.lock		= SPIN_LOCK_UNLOCKED,				\
+	.lock		= __SPIN_LOCK_UNLOCKED(name.lock),		\
 	.task_list	= { &(name).task_list, &(name).task_list } }
 
 #define DECLARE_WAIT_QUEUE_HEAD(name) \
Index: linux/kernel/kmod.c
===================================================================
--- linux.orig/kernel/kmod.c
+++ linux/kernel/kmod.c
@@ -246,6 +246,8 @@ int call_usermodehelper_keys(char *path,
 	};
 	DECLARE_WORK(work, __call_usermodehelper, &sub_info);
 
+	init_completion(&done);
+
 	if (!khelper_wq)
 		return -EBUSY;
 
Index: linux/kernel/rcupdate.c
===================================================================
--- linux.orig/kernel/rcupdate.c
+++ linux/kernel/rcupdate.c
@@ -53,13 +53,13 @@
 static struct rcu_ctrlblk rcu_ctrlblk = {
 	.cur = -300,
 	.completed = -300,
-	.lock = SPIN_LOCK_UNLOCKED,
+	.lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
 	.cpumask = CPU_MASK_NONE,
 };
 static struct rcu_ctrlblk rcu_bh_ctrlblk = {
 	.cur = -300,
 	.completed = -300,
-	.lock = SPIN_LOCK_UNLOCKED,
+	.lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock),
 	.cpumask = CPU_MASK_NONE,
 };
 
Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c
+++ linux/kernel/timer.c
@@ -1142,7 +1142,7 @@ unsigned long wall_jiffies = INITIAL_JIF
  * playing with xtime and avenrun.
  */
 #ifndef ARCH_HAVE_XTIME_LOCK
-seqlock_t xtime_lock __cacheline_aligned_in_smp = SEQLOCK_UNLOCKED;
+__cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock);
 
 EXPORT_SYMBOL(xtime_lock);
 #endif
Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c
+++ linux/mm/swap_state.c
@@ -39,7 +39,7 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= RW_LOCK_UNLOCKED,
+	.tree_lock	= __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
Index: linux/net/ipv4/tcp_ipv4.c
===================================================================
--- linux.orig/net/ipv4/tcp_ipv4.c
+++ linux/net/ipv4/tcp_ipv4.c
@@ -90,7 +90,7 @@ static struct socket *tcp_socket;
 void tcp_v4_send_check(struct sock *sk, int len, struct sk_buff *skb);
 
 struct inet_hashinfo __cacheline_aligned tcp_hashinfo = {
-	.lhash_lock	= RW_LOCK_UNLOCKED,
+	.lhash_lock	= __RW_LOCK_UNLOCKED(tcp_hashinfo.lhash_lock),
 	.lhash_users	= ATOMIC_INIT(0),
 	.lhash_wait	= __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhash_wait),
 };
Index: linux/net/ipv4/tcp_minisocks.c
===================================================================
--- linux.orig/net/ipv4/tcp_minisocks.c
+++ linux/net/ipv4/tcp_minisocks.c
@@ -41,7 +41,7 @@ int sysctl_tcp_abort_on_overflow;
 struct inet_timewait_death_row tcp_death_row = {
 	.sysctl_max_tw_buckets = NR_FILE * 2,
 	.period		= TCP_TIMEWAIT_LEN / INET_TWDR_TWKILL_SLOTS,
-	.death_lock	= SPIN_LOCK_UNLOCKED,
+	.death_lock	= __SPIN_LOCK_UNLOCKED(tcp_death_row.death_lock),
 	.hashinfo	= &tcp_hashinfo,
 	.tw_timer	= TIMER_INITIALIZER(inet_twdr_hangman, 0,
 					    (unsigned long)&tcp_death_row),
Index: linux/net/ipv4/xfrm4_policy.c
===================================================================
--- linux.orig/net/ipv4/xfrm4_policy.c
+++ linux/net/ipv4/xfrm4_policy.c
@@ -17,7 +17,7 @@
 static struct dst_ops xfrm4_dst_ops;
 static struct xfrm_policy_afinfo xfrm4_policy_afinfo;
 
-static struct xfrm_type_map xfrm4_type_map = { .lock = RW_LOCK_UNLOCKED };
+static struct xfrm_type_map xfrm4_type_map = { .lock = __RW_LOCK_UNLOCKED(xfrm4_type_map.lock) };
 
 static int xfrm4_dst_lookup(struct xfrm_dst **dst, struct flowi *fl)
 {
@@ -299,7 +299,7 @@ static struct dst_ops xfrm4_dst_ops = {
 
 static struct xfrm_policy_afinfo xfrm4_policy_afinfo = {
 	.family = 		AF_INET,
-	.lock = 		RW_LOCK_UNLOCKED,
+	.lock = 		__RW_LOCK_UNLOCKED(xfrm4_policy_afinfo.lock),
 	.type_map = 		&xfrm4_type_map,
 	.dst_ops =		&xfrm4_dst_ops,
 	.dst_lookup =		xfrm4_dst_lookup,

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (9 preceding siblings ...)
  2006-05-29 21:23 ` [patch 10/61] lock validator: locking init debugging improvement Ingo Molnar
@ 2006-05-29 21:23 ` Ingo Molnar
  2006-05-30  1:33   ` Andrew Morton
  2006-05-29 21:24 ` [patch 12/61] lock validator: beautify x86_64 stacktraces Ingo Molnar
                   ` (62 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

init_rwsem() has no return value. This is not a problem if init_rwsem()
is a function, but it's a problem if it's a do { ... } while (0) macro.
(which lockdep introduces)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/xfs/linux-2.6/mrlock.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/fs/xfs/linux-2.6/mrlock.h
===================================================================
--- linux.orig/fs/xfs/linux-2.6/mrlock.h
+++ linux/fs/xfs/linux-2.6/mrlock.h
@@ -28,7 +28,7 @@ typedef struct {
 } mrlock_t;
 
 #define mrinit(mrp, name)	\
-	( (mrp)->mr_writer = 0, init_rwsem(&(mrp)->mr_lock) )
+	do { (mrp)->mr_writer = 0; init_rwsem(&(mrp)->mr_lock); } while (0)
 #define mrlock_init(mrp, t,n,s)	mrinit(mrp, n)
 #define mrfree(mrp)		do { } while (0)
 #define mraccess(mrp)		mraccessf(mrp, 0)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 12/61] lock validator: beautify x86_64 stacktraces
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (10 preceding siblings ...)
  2006-05-29 21:23 ` [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-30  1:33   ` Andrew Morton
  2006-05-29 21:24 ` [patch 13/61] lock validator: x86_64: document stack frame internals Ingo Molnar
                   ` (61 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

beautify x86_64 stacktraces to be more readable.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86_64/kernel/traps.c  |   55 ++++++++++++++++++++------------------------
 include/asm-x86_64/kdebug.h |    2 -
 2 files changed, 27 insertions(+), 30 deletions(-)

Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -108,28 +108,30 @@ static inline void preempt_conditional_c
 static int kstack_depth_to_print = 10;
 
 #ifdef CONFIG_KALLSYMS
-#include <linux/kallsyms.h> 
-int printk_address(unsigned long address)
-{ 
+# include <linux/kallsyms.h>
+void printk_address(unsigned long address)
+{
 	unsigned long offset = 0, symsize;
 	const char *symname;
 	char *modname;
-	char *delim = ":"; 
+	char *delim = ":";
 	char namebuf[128];
 
-	symname = kallsyms_lookup(address, &symsize, &offset, &modname, namebuf); 
-	if (!symname) 
-		return printk("[<%016lx>]", address);
-	if (!modname) 
+	symname = kallsyms_lookup(address, &symsize, &offset, &modname, namebuf);
+	if (!symname) {
+		printk(" [<%016lx>]", address);
+		return;
+	}
+	if (!modname)
 		modname = delim = ""; 		
-        return printk("<%016lx>{%s%s%s%s%+ld}",
-		      address, delim, modname, delim, symname, offset); 
-} 
+	printk(" [<%016lx>] %s%s%s%s+0x%lx/0x%lx",
+		address, delim, modname, delim, symname, offset, symsize);
+}
 #else
-int printk_address(unsigned long address)
-{ 
-	return printk("[<%016lx>]", address);
-} 
+void printk_address(unsigned long address)
+{
+	printk(" [<%016lx>]", address);
+}
 #endif
 
 static unsigned long *in_exception_stack(unsigned cpu, unsigned long stack,
@@ -200,21 +202,14 @@ void show_trace(unsigned long *stack)
 {
 	const unsigned cpu = safe_smp_processor_id();
 	unsigned long *irqstack_end = (unsigned long *)cpu_pda(cpu)->irqstackptr;
-	int i;
 	unsigned used = 0;
 
-	printk("\nCall Trace:");
+	printk("\nCall Trace:\n");
 
 #define HANDLE_STACK(cond) \
 	do while (cond) { \
 		unsigned long addr = *stack++; \
 		if (kernel_text_address(addr)) { \
-			if (i > 50) { \
-				printk("\n       "); \
-				i = 0; \
-			} \
-			else \
-				i += printk(" "); \
 			/* \
 			 * If the address is either in the text segment of the \
 			 * kernel, or in the region which contains vmalloc'ed \
@@ -223,20 +218,21 @@ void show_trace(unsigned long *stack)
 			 * down the cause of the crash will be able to figure \
 			 * out the call path that was taken. \
 			 */ \
-			i += printk_address(addr); \
+			printk_address(addr); \
+			printk("\n"); \
 		} \
 	} while (0)
 
-	for(i = 11; ; ) {
+	for ( ; ; ) {
 		const char *id;
 		unsigned long *estack_end;
 		estack_end = in_exception_stack(cpu, (unsigned long)stack,
 						&used, &id);
 
 		if (estack_end) {
-			i += printk(" <%s>", id);
+			printk(" <%s>", id);
 			HANDLE_STACK (stack < estack_end);
-			i += printk(" <EOE>");
+			printk(" <EOE>");
 			stack = (unsigned long *) estack_end[-2];
 			continue;
 		}
@@ -246,11 +242,11 @@ void show_trace(unsigned long *stack)
 				(IRQSTACKSIZE - 64) / sizeof(*irqstack);
 
 			if (stack >= irqstack && stack < irqstack_end) {
-				i += printk(" <IRQ>");
+				printk(" <IRQ>");
 				HANDLE_STACK (stack < irqstack_end);
 				stack = (unsigned long *) (irqstack_end[-1]);
 				irqstack_end = NULL;
-				i += printk(" <EOI>");
+				printk(" <EOI>");
 				continue;
 			}
 		}
@@ -259,6 +255,7 @@ void show_trace(unsigned long *stack)
 
 	HANDLE_STACK (((long) stack & (THREAD_SIZE-1)) != 0);
 #undef HANDLE_STACK
+
 	printk("\n");
 }
 
Index: linux/include/asm-x86_64/kdebug.h
===================================================================
--- linux.orig/include/asm-x86_64/kdebug.h
+++ linux/include/asm-x86_64/kdebug.h
@@ -49,7 +49,7 @@ static inline int notify_die(enum die_va
 	return atomic_notifier_call_chain(&die_chain, val, &args);
 } 
 
-extern int printk_address(unsigned long address);
+extern void printk_address(unsigned long address);
 extern void die(const char *,struct pt_regs *,long);
 extern void __die(const char *,struct pt_regs *,long);
 extern void show_registers(struct pt_regs *regs);

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 13/61] lock validator: x86_64: document stack frame internals
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (11 preceding siblings ...)
  2006-05-29 21:24 ` [patch 12/61] lock validator: beautify x86_64 stacktraces Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-29 21:24 ` [patch 14/61] lock validator: stacktrace Ingo Molnar
                   ` (60 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

document stack frame nesting internals some more.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86_64/kernel/traps.c |   64 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 62 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -134,8 +134,9 @@ void printk_address(unsigned long addres
 }
 #endif
 
-static unsigned long *in_exception_stack(unsigned cpu, unsigned long stack,
-					unsigned *usedp, const char **idp)
+unsigned long *
+in_exception_stack(unsigned cpu, unsigned long stack, unsigned *usedp,
+		   const char **idp)
 {
 	static char ids[][8] = {
 		[DEBUG_STACK - 1] = "#DB",
@@ -149,10 +150,22 @@ static unsigned long *in_exception_stack
 	};
 	unsigned k;
 
+	/*
+	 * Iterate over all exception stacks, and figure out whether
+	 * 'stack' is in one of them:
+	 */
 	for (k = 0; k < N_EXCEPTION_STACKS; k++) {
 		unsigned long end;
 
+		/*
+		 * set 'end' to the end of the exception stack.
+		 */
 		switch (k + 1) {
+		/*
+		 * TODO: this block is not needed i think, because
+		 * setup64.c:cpu_init() sets up t->ist[DEBUG_STACK]
+		 * properly too.
+		 */
 #if DEBUG_STKSZ > EXCEPTION_STKSZ
 		case DEBUG_STACK:
 			end = cpu_pda(cpu)->debugstack + DEBUG_STKSZ;
@@ -162,19 +175,43 @@ static unsigned long *in_exception_stack
 			end = per_cpu(init_tss, cpu).ist[k];
 			break;
 		}
+		/*
+		 * Is 'stack' above this exception frame's end?
+		 * If yes then skip to the next frame.
+		 */
 		if (stack >= end)
 			continue;
+		/*
+		 * Is 'stack' above this exception frame's start address?
+		 * If yes then we found the right frame.
+		 */
 		if (stack >= end - EXCEPTION_STKSZ) {
+			/*
+			 * Make sure we only iterate through an exception
+			 * stack once. If it comes up for the second time
+			 * then there's something wrong going on - just
+			 * break out and return NULL:
+			 */
 			if (*usedp & (1U << k))
 				break;
 			*usedp |= 1U << k;
 			*idp = ids[k];
 			return (unsigned long *)end;
 		}
+		/*
+		 * If this is a debug stack, and if it has a larger size than
+		 * the usual exception stacks, then 'stack' might still
+		 * be within the lower portion of the debug stack:
+		 */
 #if DEBUG_STKSZ > EXCEPTION_STKSZ
 		if (k == DEBUG_STACK - 1 && stack >= end - DEBUG_STKSZ) {
 			unsigned j = N_EXCEPTION_STACKS - 1;
 
+			/*
+			 * Black magic. A large debug stack is composed of
+			 * multiple exception stack entries, which we
+			 * iterate through now. Dont look:
+			 */
 			do {
 				++j;
 				end -= EXCEPTION_STKSZ;
@@ -206,6 +243,11 @@ void show_trace(unsigned long *stack)
 
 	printk("\nCall Trace:\n");
 
+	/*
+	 * Print function call entries within a stack. 'cond' is the
+	 * "end of stackframe" condition, that the 'stack++'
+	 * iteration will eventually trigger.
+	 */
 #define HANDLE_STACK(cond) \
 	do while (cond) { \
 		unsigned long addr = *stack++; \
@@ -223,6 +265,11 @@ void show_trace(unsigned long *stack)
 		} \
 	} while (0)
 
+	/*
+	 * Print function call entries in all stacks, starting at the
+	 * current stack address. If the stacks consist of nested
+	 * exceptions
+	 */
 	for ( ; ; ) {
 		const char *id;
 		unsigned long *estack_end;
@@ -233,6 +280,11 @@ void show_trace(unsigned long *stack)
 			printk(" <%s>", id);
 			HANDLE_STACK (stack < estack_end);
 			printk(" <EOE>");
+			/*
+			 * We link to the next stack via the
+			 * second-to-last pointer (index -2 to end) in the
+			 * exception stack:
+			 */
 			stack = (unsigned long *) estack_end[-2];
 			continue;
 		}
@@ -244,6 +296,11 @@ void show_trace(unsigned long *stack)
 			if (stack >= irqstack && stack < irqstack_end) {
 				printk(" <IRQ>");
 				HANDLE_STACK (stack < irqstack_end);
+				/*
+				 * We link to the next stack (which would be
+				 * the process stack normally) the last
+				 * pointer (index -1 to end) in the IRQ stack:
+				 */
 				stack = (unsigned long *) (irqstack_end[-1]);
 				irqstack_end = NULL;
 				printk(" <EOI>");
@@ -253,6 +310,9 @@ void show_trace(unsigned long *stack)
 		break;
 	}
 
+	/*
+	 * This prints the process stack:
+	 */
 	HANDLE_STACK (((long) stack & (THREAD_SIZE-1)) != 0);
 #undef HANDLE_STACK
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 14/61] lock validator: stacktrace
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (12 preceding siblings ...)
  2006-05-29 21:24 ` [patch 13/61] lock validator: x86_64: document stack frame internals Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-29 21:24 ` [patch 15/61] lock validator: x86_64: use stacktrace to generate backtraces Ingo Molnar
                   ` (59 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

framework to generate and save stacktraces quickly, without printing
anything to the console.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/i386/kernel/Makefile       |    2 
 arch/i386/kernel/stacktrace.c   |   98 +++++++++++++++++
 arch/x86_64/kernel/Makefile     |    2 
 arch/x86_64/kernel/stacktrace.c |  219 ++++++++++++++++++++++++++++++++++++++++
 include/linux/stacktrace.h      |   15 ++
 kernel/Makefile                 |    2 
 kernel/stacktrace.c             |   26 ++++
 7 files changed, 361 insertions(+), 3 deletions(-)

Index: linux/arch/i386/kernel/Makefile
===================================================================
--- linux.orig/arch/i386/kernel/Makefile
+++ linux/arch/i386/kernel/Makefile
@@ -4,7 +4,7 @@
 
 extra-y := head.o init_task.o vmlinux.lds
 
-obj-y	:= process.o semaphore.o signal.o entry.o traps.o irq.o \
+obj-y	:= process.o semaphore.o signal.o entry.o traps.o irq.o stacktrace.o \
 		ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_i386.o \
 		pci-dma.o i386_ksyms.o i387.o bootflag.o \
 		quirks.o i8237.o topology.o alternative.o i8253.o tsc.o
Index: linux/arch/i386/kernel/stacktrace.c
===================================================================
--- /dev/null
+++ linux/arch/i386/kernel/stacktrace.c
@@ -0,0 +1,98 @@
+/*
+ * arch/i386/kernel/stacktrace.c
+ *
+ * Stack trace management functions
+ *
+ *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ */
+#include <linux/sched.h>
+#include <linux/stacktrace.h>
+
+static inline int valid_stack_ptr(struct thread_info *tinfo, void *p)
+{
+	return	p > (void *)tinfo &&
+		p < (void *)tinfo + THREAD_SIZE - 3;
+}
+
+/*
+ * Save stack-backtrace addresses into a stack_trace buffer:
+ */
+static inline unsigned long
+save_context_stack(struct stack_trace *trace, unsigned int skip,
+		   struct thread_info *tinfo, unsigned long *stack,
+		   unsigned long ebp)
+{
+	unsigned long addr;
+
+#ifdef CONFIG_FRAME_POINTER
+	while (valid_stack_ptr(tinfo, (void *)ebp)) {
+		addr = *(unsigned long *)(ebp + 4);
+		if (!skip)
+			trace->entries[trace->nr_entries++] = addr;
+		else
+			skip--;
+		if (trace->nr_entries >= trace->max_entries)
+			break;
+		/*
+		 * break out of recursive entries (such as
+		 * end_of_stack_stop_unwind_function):
+	 	 */
+		if (ebp == *(unsigned long *)ebp)
+			break;
+
+		ebp = *(unsigned long *)ebp;
+	}
+#else
+	while (valid_stack_ptr(tinfo, stack)) {
+		addr = *stack++;
+		if (__kernel_text_address(addr)) {
+			if (!skip)
+				trace->entries[trace->nr_entries++] = addr;
+			else
+				skip--;
+			if (trace->nr_entries >= trace->max_entries)
+				break;
+		}
+	}
+#endif
+
+	return ebp;
+}
+
+/*
+ * Save stack-backtrace addresses into a stack_trace buffer.
+ * If all_contexts is set, all contexts (hardirq, softirq and process)
+ * are saved. If not set then only the current context is saved.
+ */
+void save_stack_trace(struct stack_trace *trace,
+		      struct task_struct *task, int all_contexts,
+		      unsigned int skip)
+{
+	unsigned long ebp;
+	unsigned long *stack = &ebp;
+
+	WARN_ON(trace->nr_entries || !trace->max_entries);
+
+	if (!task || task == current) {
+		/* Grab ebp right from our regs: */
+		asm ("movl %%ebp, %0" : "=r" (ebp));
+	} else {
+		/* ebp is the last reg pushed by switch_to(): */
+		ebp = *(unsigned long *) task->thread.esp;
+	}
+
+	while (1) {
+		struct thread_info *context = (struct thread_info *)
+				((unsigned long)stack & (~(THREAD_SIZE - 1)));
+
+		ebp = save_context_stack(trace, skip, context, stack, ebp);
+		stack = (unsigned long *)context->previous_esp;
+		if (!all_contexts || !stack ||
+				trace->nr_entries >= trace->max_entries)
+			break;
+		trace->entries[trace->nr_entries++] = ULONG_MAX;
+		if (trace->nr_entries >= trace->max_entries)
+			break;
+	}
+}
+
Index: linux/arch/x86_64/kernel/Makefile
===================================================================
--- linux.orig/arch/x86_64/kernel/Makefile
+++ linux/arch/x86_64/kernel/Makefile
@@ -4,7 +4,7 @@
 
 extra-y 	:= head.o head64.o init_task.o vmlinux.lds
 EXTRA_AFLAGS	:= -traditional
-obj-y	:= process.o signal.o entry.o traps.o irq.o \
+obj-y	:= process.o signal.o entry.o traps.o irq.o stacktrace.o \
 		ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_x86_64.o \
 		x8664_ksyms.o i387.o syscall.o vsyscall.o \
 		setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
Index: linux/arch/x86_64/kernel/stacktrace.c
===================================================================
--- /dev/null
+++ linux/arch/x86_64/kernel/stacktrace.c
@@ -0,0 +1,219 @@
+/*
+ * arch/x86_64/kernel/stacktrace.c
+ *
+ * Stack trace management functions
+ *
+ *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ */
+#include <linux/sched.h>
+#include <linux/stacktrace.h>
+
+#include <asm/smp.h>
+
+static inline int
+in_range(unsigned long start, unsigned long addr, unsigned long end)
+{
+	return addr >= start && addr <= end;
+}
+
+static unsigned long
+get_stack_end(struct task_struct *task, unsigned long stack)
+{
+	unsigned long stack_start, stack_end, flags;
+	int i, cpu;
+
+	/*
+	 * The most common case is that we are in the task stack:
+	 */
+	stack_start = (unsigned long)task->thread_info;
+	stack_end = stack_start + THREAD_SIZE;
+
+	if (in_range(stack_start, stack, stack_end))
+		return stack_end;
+
+	/*
+	 * We are in an interrupt if irqstackptr is set:
+	 */
+	raw_local_irq_save(flags);
+	cpu = safe_smp_processor_id();
+	stack_end = (unsigned long)cpu_pda(cpu)->irqstackptr;
+
+	if (stack_end) {
+		stack_start = stack_end & ~(IRQSTACKSIZE-1);
+		if (in_range(stack_start, stack, stack_end))
+			goto out_restore;
+		/*
+		 * We get here if we are in an IRQ context but we
+		 * are also in an exception stack.
+		 */
+	}
+
+	/*
+	 * Iterate over all exception stacks, and figure out whether
+	 * 'stack' is in one of them:
+	 */
+	for (i = 0; i < N_EXCEPTION_STACKS; i++) {
+		/*
+		 * set 'end' to the end of the exception stack.
+		 */
+		stack_end = per_cpu(init_tss, cpu).ist[i];
+		stack_start = stack_end - EXCEPTION_STKSZ;
+
+		/*
+		 * Is 'stack' above this exception frame's end?
+		 * If yes then skip to the next frame.
+		 */
+		if (stack >= stack_end)
+			continue;
+		/*
+		 * Is 'stack' above this exception frame's start address?
+		 * If yes then we found the right frame.
+		 */
+		if (stack >= stack_start)
+			goto out_restore;
+
+		/*
+		 * If this is a debug stack, and if it has a larger size than
+		 * the usual exception stacks, then 'stack' might still
+		 * be within the lower portion of the debug stack:
+		 */
+#if DEBUG_STKSZ > EXCEPTION_STKSZ
+		if (i == DEBUG_STACK - 1 && stack >= stack_end - DEBUG_STKSZ) {
+			/*
+			 * Black magic. A large debug stack is composed of
+			 * multiple exception stack entries, which we
+			 * iterate through now. Dont look:
+			 */
+			do {
+				stack_end -= EXCEPTION_STKSZ;
+				stack_start -= EXCEPTION_STKSZ;
+			} while (stack < stack_start);
+
+			goto out_restore;
+		}
+#endif
+	}
+	/*
+	 * Ok, 'stack' is not pointing to any of the system stacks.
+	 */
+	stack_end = 0;
+
+out_restore:
+	raw_local_irq_restore(flags);
+
+	return stack_end;
+}
+
+
+/*
+ * Save stack-backtrace addresses into a stack_trace buffer:
+ */
+static inline unsigned long
+save_context_stack(struct stack_trace *trace, unsigned int skip,
+		   unsigned long stack, unsigned long stack_end)
+{
+	unsigned long addr, prev_stack = 0;
+
+#ifdef CONFIG_FRAME_POINTER
+	while (in_range(prev_stack, (unsigned long)stack, stack_end)) {
+		pr_debug("stack:          %p\n", (void *)stack);
+		addr = (unsigned long)(((unsigned long *)stack)[1]);
+		pr_debug("addr:           %p\n", (void *)addr);
+		if (!skip)
+			trace->entries[trace->nr_entries++] = addr-1;
+		else
+			skip--;
+		if (trace->nr_entries >= trace->max_entries)
+			break;
+		if (!addr)
+			return 0;
+		/*
+		 * Stack frames must go forwards (otherwise a loop could
+		 * happen if the stackframe is corrupted), so we move
+		 * prev_stack forwards:
+		 */
+		prev_stack = stack;
+		stack = (unsigned long)(((unsigned long *)stack)[0]);
+	}
+	pr_debug("invalid:        %p\n", (void *)stack);
+#else
+	while (stack < stack_end) {
+		addr = (unsigned long *)stack[0];
+		stack += sizeof(long);
+		if (__kernel_text_address(addr)) {
+			if (!skip)
+				trace->entries[trace->nr_entries++] = addr-1;
+			else
+				skip--;
+			if (trace->nr_entries >= trace->max_entries)
+				break;
+		}
+	}
+#endif
+	return stack;
+}
+
+#define MAX_STACKS 10
+
+/*
+ * Save stack-backtrace addresses into a stack_trace buffer.
+ * If all_contexts is set, all contexts (hardirq, softirq and process)
+ * are saved. If not set then only the current context is saved.
+ */
+void save_stack_trace(struct stack_trace *trace,
+		      struct task_struct *task, int all_contexts,
+		      unsigned int skip)
+{
+	unsigned long stack = (unsigned long)&stack;
+	int i, nr_stacks = 0, stacks_done[MAX_STACKS];
+
+	WARN_ON(trace->nr_entries || !trace->max_entries);
+
+	if (!task)
+		task = current;
+
+	pr_debug("task: %p, ti: %p\n", task, task->thread_info);
+
+	if (!task || task == current) {
+		/* Grab rbp right from our regs: */
+		asm ("mov %%rbp, %0" : "=r" (stack));
+		pr_debug("rbp:            %p\n", (void *)stack);
+	} else {
+		/* rbp is the last reg pushed by switch_to(): */
+		stack = task->thread.rsp;
+		pr_debug("other task rsp: %p\n", (void *)stack);
+		stack = (unsigned long)(((unsigned long *)stack)[0]);
+		pr_debug("other task rbp: %p\n", (void *)stack);
+	}
+
+	while (1) {
+		unsigned long stack_end = get_stack_end(task, stack);
+
+		pr_debug("stack:          %p\n", (void *)stack);
+		pr_debug("stack end:      %p\n", (void *)stack_end);
+
+		/*
+		 * Invalid stack addres?
+		 */
+		if (!stack_end)
+			return;
+		/*
+		 * Were we in this stack already? (recursion)
+		 */
+		for (i = 0; i < nr_stacks; i++)
+			if (stacks_done[i] == stack_end)
+				return;
+		stacks_done[nr_stacks] = stack_end;
+
+		stack = save_context_stack(trace, skip, stack, stack_end);
+		if (!all_contexts || !stack ||
+				trace->nr_entries >= trace->max_entries)
+			return;
+		trace->entries[trace->nr_entries++] = ULONG_MAX;
+		if (trace->nr_entries >= trace->max_entries)
+			return;
+		if (++nr_stacks >= MAX_STACKS)
+			return;
+	}
+}
+
Index: linux/include/linux/stacktrace.h
===================================================================
--- /dev/null
+++ linux/include/linux/stacktrace.h
@@ -0,0 +1,15 @@
+#ifndef __LINUX_STACKTRACE_H
+#define __LINUX_STACKTRACE_H
+
+struct stack_trace {
+	unsigned int nr_entries, max_entries;
+	unsigned long *entries;
+};
+
+extern void save_stack_trace(struct stack_trace *trace,
+			     struct task_struct *task, int all_contexts,
+			     unsigned int skip);
+
+extern void print_stack_trace(struct stack_trace *trace, int spaces);
+
+#endif
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -8,7 +8,7 @@ obj-y     = sched.o fork.o exec_domain.o
 	    signal.o sys.o kmod.o workqueue.o pid.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
-	    hrtimer.o nsproxy.o
+	    hrtimer.o nsproxy.o stacktrace.o
 
 obj-y += time/
 obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
Index: linux/kernel/stacktrace.c
===================================================================
--- /dev/null
+++ linux/kernel/stacktrace.c
@@ -0,0 +1,26 @@
+/*
+ * kernel/stacktrace.c
+ *
+ * Stack trace management functions
+ *
+ *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ */
+#include <linux/sched.h>
+#include <linux/kallsyms.h>
+#include <linux/stacktrace.h>
+
+void print_stack_trace(struct stack_trace *trace, int spaces)
+{
+	int i, j;
+
+	for (i = 0; i < trace->nr_entries; i++) {
+		unsigned long ip = trace->entries[i];
+
+		for (j = 0; j < spaces + 1; j++)
+			printk(" ");
+
+		printk("[<%08lx>]", ip);
+		print_symbol(" %s\n", ip);
+	}
+}
+

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 15/61] lock validator: x86_64: use stacktrace to generate backtraces
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (13 preceding siblings ...)
  2006-05-29 21:24 ` [patch 14/61] lock validator: stacktrace Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-30  1:33   ` Andrew Morton
  2006-05-29 21:24 ` [patch 16/61] lock validator: fown locking workaround Ingo Molnar
                   ` (58 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

this switches x86_64 to use the stacktrace infrastructure when generating
backtrace printouts, if CONFIG_FRAME_POINTER=y. (This patch will go away
once the dwarf2 stackframe parser in -mm goes upstream.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86_64/kernel/traps.c |   35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -235,7 +235,31 @@ in_exception_stack(unsigned cpu, unsigne
  * severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
  */
 
-void show_trace(unsigned long *stack)
+#ifdef CONFIG_FRAME_POINTER
+
+#include <linux/stacktrace.h>
+
+#define MAX_TRACE_ENTRIES 64
+
+static void __show_trace(struct task_struct *task, unsigned long *stack)
+{
+	unsigned long entries[MAX_TRACE_ENTRIES];
+	struct stack_trace trace;
+
+	trace.nr_entries = 0;
+	trace.max_entries = MAX_TRACE_ENTRIES;
+	trace.entries = entries;
+
+	save_stack_trace(&trace, task, 1, 0);
+
+	pr_debug("got %d/%d entries.\n", trace.nr_entries, trace.max_entries);
+
+	print_stack_trace(&trace, 4);
+}
+
+#else
+
+void __show_trace(struct task_struct *task, unsigned long *stack)
 {
 	const unsigned cpu = safe_smp_processor_id();
 	unsigned long *irqstack_end = (unsigned long *)cpu_pda(cpu)->irqstackptr;
@@ -319,6 +343,13 @@ void show_trace(unsigned long *stack)
 	printk("\n");
 }
 
+#endif
+
+void show_trace(unsigned long *stack)
+{
+	__show_trace(current, stack);
+}
+
 void show_stack(struct task_struct *tsk, unsigned long * rsp)
 {
 	unsigned long *stack;
@@ -353,7 +384,7 @@ void show_stack(struct task_struct *tsk,
 		printk("%016lx ", *stack++);
 		touch_nmi_watchdog();
 	}
-	show_trace((unsigned long *)rsp);
+	__show_trace(tsk, (unsigned long *)rsp);
 }
 
 /*

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 16/61] lock validator: fown locking workaround
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (14 preceding siblings ...)
  2006-05-29 21:24 ` [patch 15/61] lock validator: x86_64: use stacktrace to generate backtraces Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-30  1:34   ` Andrew Morton
  2006-05-29 21:24 ` [patch 17/61] lock validator: sk_callback_lock workaround Ingo Molnar
                   ` (57 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

temporary workaround for the lock validator: make all uses of
f_owner.lock irq-safe. (The real solution will be to express to
the lock validator that f_owner.lock rules are to be generated
per-filesystem.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/cifs/file.c |   18 +++++++++---------
 fs/fcntl.c     |   11 +++++++----
 2 files changed, 16 insertions(+), 13 deletions(-)

Index: linux/fs/cifs/file.c
===================================================================
--- linux.orig/fs/cifs/file.c
+++ linux/fs/cifs/file.c
@@ -108,7 +108,7 @@ static inline int cifs_open_inode_helper
 			 &pCifsInode->openFileList);
 	}
 	write_unlock(&GlobalSMBSeslock);
-	write_unlock(&file->f_owner.lock);
+	write_unlock_irq(&file->f_owner.lock);
 	if (pCifsInode->clientCanCacheRead) {
 		/* we have the inode open somewhere else
 		   no need to discard cache data */
@@ -280,7 +280,7 @@ int cifs_open(struct inode *inode, struc
 		goto out;
 	}
 	pCifsFile = cifs_init_private(file->private_data, inode, file, netfid);
-	write_lock(&file->f_owner.lock);
+	write_lock_irq(&file->f_owner.lock);
 	write_lock(&GlobalSMBSeslock);
 	list_add(&pCifsFile->tlist, &pTcon->openFileList);
 
@@ -291,7 +291,7 @@ int cifs_open(struct inode *inode, struc
 					    &oplock, buf, full_path, xid);
 	} else {
 		write_unlock(&GlobalSMBSeslock);
-		write_unlock(&file->f_owner.lock);
+		write_unlock_irq(&file->f_owner.lock);
 	}
 
 	if (oplock & CIFS_CREATE_ACTION) {           
@@ -470,7 +470,7 @@ int cifs_close(struct inode *inode, stru
 	pTcon = cifs_sb->tcon;
 	if (pSMBFile) {
 		pSMBFile->closePend = TRUE;
-		write_lock(&file->f_owner.lock);
+		write_lock_irq(&file->f_owner.lock);
 		if (pTcon) {
 			/* no sense reconnecting to close a file that is
 			   already closed */
@@ -485,23 +485,23 @@ int cifs_close(struct inode *inode, stru
 					the struct would be in each open file,
 					but this should give enough time to 
 					clear the socket */
-					write_unlock(&file->f_owner.lock);
+					write_unlock_irq(&file->f_owner.lock);
 					cERROR(1,("close with pending writes"));
 					msleep(timeout);
-					write_lock(&file->f_owner.lock);
+					write_lock_irq(&file->f_owner.lock);
 					timeout *= 4;
 				} 
-				write_unlock(&file->f_owner.lock);
+				write_unlock_irq(&file->f_owner.lock);
 				rc = CIFSSMBClose(xid, pTcon,
 						  pSMBFile->netfid);
-				write_lock(&file->f_owner.lock);
+				write_lock_irq(&file->f_owner.lock);
 			}
 		}
 		write_lock(&GlobalSMBSeslock);
 		list_del(&pSMBFile->flist);
 		list_del(&pSMBFile->tlist);
 		write_unlock(&GlobalSMBSeslock);
-		write_unlock(&file->f_owner.lock);
+		write_unlock_irq(&file->f_owner.lock);
 		kfree(pSMBFile->search_resume_name);
 		kfree(file->private_data);
 		file->private_data = NULL;
Index: linux/fs/fcntl.c
===================================================================
--- linux.orig/fs/fcntl.c
+++ linux/fs/fcntl.c
@@ -470,9 +470,10 @@ static void send_sigio_to_task(struct ta
 void send_sigio(struct fown_struct *fown, int fd, int band)
 {
 	struct task_struct *p;
+	unsigned long flags;
 	int pid;
 	
-	read_lock(&fown->lock);
+	read_lock_irqsave(&fown->lock, flags);
 	pid = fown->pid;
 	if (!pid)
 		goto out_unlock_fown;
@@ -490,7 +491,7 @@ void send_sigio(struct fown_struct *fown
 	}
 	read_unlock(&tasklist_lock);
  out_unlock_fown:
-	read_unlock(&fown->lock);
+	read_unlock_irqrestore(&fown->lock, flags);
 }
 
 static void send_sigurg_to_task(struct task_struct *p,
@@ -503,9 +504,10 @@ static void send_sigurg_to_task(struct t
 int send_sigurg(struct fown_struct *fown)
 {
 	struct task_struct *p;
+	unsigned long flags;
 	int pid, ret = 0;
 	
-	read_lock(&fown->lock);
+	read_lock_irqsave(&fown->lock, flags);
 	pid = fown->pid;
 	if (!pid)
 		goto out_unlock_fown;
@@ -525,7 +527,8 @@ int send_sigurg(struct fown_struct *fown
 	}
 	read_unlock(&tasklist_lock);
  out_unlock_fown:
-	read_unlock(&fown->lock);
+	read_unlock_irqrestore(&fown->lock, flags);
+
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 17/61] lock validator: sk_callback_lock workaround
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (15 preceding siblings ...)
  2006-05-29 21:24 ` [patch 16/61] lock validator: fown locking workaround Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-30  1:34   ` Andrew Morton
  2006-05-29 21:24 ` [patch 18/61] lock validator: irqtrace: core Ingo Molnar
                   ` (56 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

temporary workaround for the lock validator: make all uses of
sk_callback_lock softirq-safe. (The real solution will be to
express to the lock validator that sk_callback_lock rules are
to be generated per-address-family.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 net/core/sock.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

Index: linux/net/core/sock.c
===================================================================
--- linux.orig/net/core/sock.c
+++ linux/net/core/sock.c
@@ -934,9 +934,9 @@ int sock_i_uid(struct sock *sk)
 {
 	int uid;
 
-	read_lock(&sk->sk_callback_lock);
+	read_lock_bh(&sk->sk_callback_lock);
 	uid = sk->sk_socket ? SOCK_INODE(sk->sk_socket)->i_uid : 0;
-	read_unlock(&sk->sk_callback_lock);
+	read_unlock_bh(&sk->sk_callback_lock);
 	return uid;
 }
 
@@ -944,9 +944,9 @@ unsigned long sock_i_ino(struct sock *sk
 {
 	unsigned long ino;
 
-	read_lock(&sk->sk_callback_lock);
+	read_lock_bh(&sk->sk_callback_lock);
 	ino = sk->sk_socket ? SOCK_INODE(sk->sk_socket)->i_ino : 0;
-	read_unlock(&sk->sk_callback_lock);
+	read_unlock_bh(&sk->sk_callback_lock);
 	return ino;
 }
 
@@ -1306,33 +1306,33 @@ ssize_t sock_no_sendpage(struct socket *
 
 static void sock_def_wakeup(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
+	read_lock_bh(&sk->sk_callback_lock);
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
-	read_unlock(&sk->sk_callback_lock);
+	read_unlock_bh(&sk->sk_callback_lock);
 }
 
 static void sock_def_error_report(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
+	read_lock_bh(&sk->sk_callback_lock);
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
-	read_unlock(&sk->sk_callback_lock);
+	read_unlock_bh(&sk->sk_callback_lock);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
 {
-	read_lock(&sk->sk_callback_lock);
+	read_lock_bh(&sk->sk_callback_lock);
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
-	read_unlock(&sk->sk_callback_lock);
+	read_unlock_bh(&sk->sk_callback_lock);
 }
 
 static void sock_def_write_space(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
+	read_lock_bh(&sk->sk_callback_lock);
 
 	/* Do not wake up a writer until he can make "significant"
 	 * progress.  --DaveM
@@ -1346,7 +1346,7 @@ static void sock_def_write_space(struct 
 			sk_wake_async(sk, 2, POLL_OUT);
 	}
 
-	read_unlock(&sk->sk_callback_lock);
+	read_unlock_bh(&sk->sk_callback_lock);
 }
 
 static void sock_def_destruct(struct sock *sk)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 18/61] lock validator: irqtrace: core
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (16 preceding siblings ...)
  2006-05-29 21:24 ` [patch 17/61] lock validator: sk_callback_lock workaround Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-30  1:34   ` Andrew Morton
  2006-05-29 21:24 ` [patch 19/61] lock validator: irqtrace: cleanup: include/asm-i386/irqflags.h Ingo Molnar
                   ` (55 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

accurate hard-IRQ-flags state tracing. This allows us to attach
extra functionality to IRQ flags on/off events (such as trace-on/off).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/i386/kernel/entry.S       |   25 ++++++-
 arch/i386/kernel/irq.c         |    6 +
 arch/x86_64/ia32/ia32entry.S   |   19 +++++
 arch/x86_64/kernel/entry.S     |   54 +++++++++++++++-
 arch/x86_64/kernel/irq.c       |    4 -
 include/asm-i386/irqflags.h    |   56 ++++++++++++++++
 include/asm-i386/spinlock.h    |    5 +
 include/asm-i386/system.h      |   20 -----
 include/asm-powerpc/irqflags.h |   31 +++++++++
 include/asm-x86_64/irqflags.h  |   54 ++++++++++++++++
 include/asm-x86_64/system.h    |   38 -----------
 include/linux/hardirq.h        |   13 +++
 include/linux/init_task.h      |    1 
 include/linux/interrupt.h      |   11 +--
 include/linux/sched.h          |   15 ++++
 include/linux/trace_irqflags.h |   87 ++++++++++++++++++++++++++
 kernel/fork.c                  |   20 +++++
 kernel/sched.c                 |    4 -
 kernel/softirq.c               |  137 +++++++++++++++++++++++++++++++++++------
 lib/locking-selftest.c         |    3 
 20 files changed, 513 insertions(+), 90 deletions(-)

Index: linux/arch/i386/kernel/entry.S
===================================================================
--- linux.orig/arch/i386/kernel/entry.S
+++ linux/arch/i386/kernel/entry.S
@@ -43,6 +43,7 @@
 #include <linux/config.h>
 #include <linux/linkage.h>
 #include <asm/thread_info.h>
+#include <asm/irqflags.h>
 #include <asm/errno.h>
 #include <asm/segment.h>
 #include <asm/smp.h>
@@ -76,7 +77,7 @@ NT_MASK		= 0x00004000
 VM_MASK		= 0x00020000
 
 #ifdef CONFIG_PREEMPT
-#define preempt_stop		cli
+#define preempt_stop		cli; TRACE_IRQS_OFF
 #else
 #define preempt_stop
 #define resume_kernel		restore_nocheck
@@ -186,6 +187,10 @@ need_resched:
 ENTRY(sysenter_entry)
 	movl TSS_sysenter_esp0(%esp),%esp
 sysenter_past_esp:
+	/*
+	 * No need to follow this irqs on/off section: the syscall
+	 * disabled irqs and here we enable it straight after entry:
+	 */
 	sti
 	pushl $(__USER_DS)
 	pushl %ebp
@@ -217,6 +222,7 @@ sysenter_past_esp:
 	call *sys_call_table(,%eax,4)
 	movl %eax,EAX(%esp)
 	cli
+	TRACE_IRQS_OFF
 	movl TI_flags(%ebp), %ecx
 	testw $_TIF_ALLWORK_MASK, %cx
 	jne syscall_exit_work
@@ -224,6 +230,7 @@ sysenter_past_esp:
 	movl EIP(%esp), %edx
 	movl OLDESP(%esp), %ecx
 	xorl %ebp,%ebp
+	TRACE_IRQS_ON
 	sti
 	sysexit
 
@@ -250,6 +257,7 @@ syscall_exit:
 	cli				# make sure we don't miss an interrupt
 					# setting need_resched or sigpending
 					# between sampling and the iret
+	TRACE_IRQS_OFF
 	movl TI_flags(%ebp), %ecx
 	testw $_TIF_ALLWORK_MASK, %cx	# current->work
 	jne syscall_exit_work
@@ -265,11 +273,14 @@ restore_all:
 	cmpl $((4 << 8) | 3), %eax
 	je ldt_ss			# returning to user-space with LDT SS
 restore_nocheck:
+	TRACE_IRQS_ON
+restore_nocheck_notrace:
 	RESTORE_REGS
 	addl $4, %esp
 1:	iret
 .section .fixup,"ax"
 iret_exc:
+	TRACE_IRQS_ON
 	sti
 	pushl $0			# no error code
 	pushl $do_iret_error
@@ -293,10 +304,12 @@ ldt_ss:
 	 * dosemu and wine happy. */
 	subl $8, %esp		# reserve space for switch16 pointer
 	cli
+	TRACE_IRQS_OFF
 	movl %esp, %eax
 	/* Set up the 16bit stack frame with switch32 pointer on top,
 	 * and a switch16 pointer on top of the current frame. */
 	call setup_x86_bogus_stack
+	TRACE_IRQS_ON
 	RESTORE_REGS
 	lss 20+4(%esp), %esp	# switch to 16bit stack
 1:	iret
@@ -315,6 +328,7 @@ work_resched:
 	cli				# make sure we don't miss an interrupt
 					# setting need_resched or sigpending
 					# between sampling and the iret
+	TRACE_IRQS_OFF
 	movl TI_flags(%ebp), %ecx
 	andl $_TIF_WORK_MASK, %ecx	# is there any work to be done other
 					# than syscall tracing?
@@ -364,6 +378,7 @@ syscall_trace_entry:
 syscall_exit_work:
 	testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP), %cl
 	jz work_pending
+	TRACE_IRQS_ON
 	sti				# could let do_syscall_trace() call
 					# schedule() instead
 	movl %esp, %eax
@@ -425,9 +440,14 @@ ENTRY(irq_entries_start)
 vector=vector+1
 .endr
 
+/*
+ * the CPU automatically disables interrupts when executing an IRQ vector,
+ * so IRQ-flags tracing has to follow that:
+ */
 	ALIGN
 common_interrupt:
 	SAVE_ALL
+	TRACE_IRQS_OFF
 	movl %esp,%eax
 	call do_IRQ
 	jmp ret_from_intr
@@ -436,6 +456,7 @@ common_interrupt:
 ENTRY(name)				\
 	pushl $~(nr);			\
 	SAVE_ALL			\
+	TRACE_IRQS_OFF			\
 	movl %esp,%eax;			\
 	call smp_/**/name;		\
 	jmp ret_from_intr;
@@ -565,7 +586,7 @@ nmi_stack_correct:
 	xorl %edx,%edx		# zero error code
 	movl %esp,%eax		# pt_regs pointer
 	call do_nmi
-	jmp restore_all
+	jmp restore_nocheck_notrace
 
 nmi_stack_fixup:
 	FIX_STACK(12,nmi_stack_correct, 1)
Index: linux/arch/i386/kernel/irq.c
===================================================================
--- linux.orig/arch/i386/kernel/irq.c
+++ linux/arch/i386/kernel/irq.c
@@ -147,7 +147,7 @@ void irq_ctx_init(int cpu)
 	irqctx->tinfo.task              = NULL;
 	irqctx->tinfo.exec_domain       = NULL;
 	irqctx->tinfo.cpu               = cpu;
-	irqctx->tinfo.preempt_count     = SOFTIRQ_OFFSET;
+	irqctx->tinfo.preempt_count     = 0;
 	irqctx->tinfo.addr_limit        = MAKE_MM_SEG(0);
 
 	softirq_ctx[cpu] = irqctx;
@@ -192,6 +192,10 @@ asmlinkage void do_softirq(void)
 			: "0"(isp)
 			: "memory", "cc", "edx", "ecx", "eax"
 		);
+		/*
+		 * Shouldnt happen, we returned above if in_interrupt():
+	 	 */
+		WARN_ON_ONCE(softirq_count());
 	}
 
 	local_irq_restore(flags);
Index: linux/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux.orig/arch/x86_64/ia32/ia32entry.S
+++ linux/arch/x86_64/ia32/ia32entry.S
@@ -13,6 +13,7 @@
 #include <asm/thread_info.h>	
 #include <asm/segment.h>
 #include <asm/vsyscall32.h>
+#include <asm/irqflags.h>
 #include <linux/linkage.h>
 
 #define IA32_NR_syscalls ((ia32_syscall_end - ia32_sys_call_table)/8)
@@ -75,6 +76,10 @@ ENTRY(ia32_sysenter_target)
 	swapgs
 	movq	%gs:pda_kernelstack, %rsp
 	addq	$(PDA_STACKOFFSET),%rsp	
+	/*
+	 * No need to follow this irqs on/off section: the syscall
+	 * disabled irqs, here we enable it straight after entry:
+	 */
 	sti	
  	movl	%ebp,%ebp		/* zero extension */
 	pushq	$__USER32_DS
@@ -118,6 +123,7 @@ sysenter_do_call:	
 	movq	%rax,RAX-ARGOFFSET(%rsp)
 	GET_THREAD_INFO(%r10)
 	cli
+	TRACE_IRQS_OFF
 	testl	$_TIF_ALLWORK_MASK,threadinfo_flags(%r10)
 	jnz	int_ret_from_sys_call
 	andl    $~TS_COMPAT,threadinfo_status(%r10)
@@ -132,6 +138,7 @@ sysenter_do_call:	
 	CFI_REGISTER rsp,rcx
 	movl	$VSYSCALL32_SYSEXIT,%edx	/* User %eip */
 	CFI_REGISTER rip,rdx
+	TRACE_IRQS_ON
 	swapgs
 	sti		/* sti only takes effect after the next instruction */
 	/* sysexit */
@@ -186,6 +193,10 @@ ENTRY(ia32_cstar_target)
 	movl	%esp,%r8d
 	CFI_REGISTER	rsp,r8
 	movq	%gs:pda_kernelstack,%rsp
+	/*
+	 * No need to follow this irqs on/off section: the syscall
+	 * disabled irqs and here we enable it straight after entry:
+	 */
 	sti
 	SAVE_ARGS 8,1,1
 	movl 	%eax,%eax	/* zero extension */
@@ -220,6 +231,7 @@ cstar_do_call:	
 	movq %rax,RAX-ARGOFFSET(%rsp)
 	GET_THREAD_INFO(%r10)
 	cli
+	TRACE_IRQS_OFF
 	testl $_TIF_ALLWORK_MASK,threadinfo_flags(%r10)
 	jnz  int_ret_from_sys_call
 	andl $~TS_COMPAT,threadinfo_status(%r10)
@@ -228,6 +240,7 @@ cstar_do_call:	
 	CFI_REGISTER rip,rcx
 	movl EFLAGS-ARGOFFSET(%rsp),%r11d	
 	/*CFI_REGISTER rflags,r11*/
+	TRACE_IRQS_ON
 	movl RSP-ARGOFFSET(%rsp),%esp
 	CFI_RESTORE rsp
 	swapgs
@@ -286,7 +299,11 @@ ENTRY(ia32_syscall)
 	/*CFI_REL_OFFSET	rflags,EFLAGS-RIP*/
 	/*CFI_REL_OFFSET	cs,CS-RIP*/
 	CFI_REL_OFFSET	rip,RIP-RIP
-	swapgs	
+	swapgs
+	/*
+	 * No need to follow this irqs on/off section: the syscall
+	 * disabled irqs and here we enable it straight after entry:
+	 */
 	sti
 	movl %eax,%eax
 	pushq %rax
Index: linux/arch/x86_64/kernel/entry.S
===================================================================
--- linux.orig/arch/x86_64/kernel/entry.S
+++ linux/arch/x86_64/kernel/entry.S
@@ -42,13 +42,14 @@
 #include <asm/thread_info.h>
 #include <asm/hw_irq.h>
 #include <asm/page.h>
+#include <asm/irqflags.h>
 
 	.code64
 
 #ifndef CONFIG_PREEMPT
 #define retint_kernel retint_restore_args
 #endif	
-	
+
 /*
  * C code is not supposed to know about undefined top of stack. Every time 
  * a C function with an pt_regs argument is called from the SYSCALL based 
@@ -195,6 +196,10 @@ ENTRY(system_call)
 	swapgs
 	movq	%rsp,%gs:pda_oldrsp 
 	movq	%gs:pda_kernelstack,%rsp
+	/*
+	 * No need to follow this irqs off/on section - it's straight
+	 * and short:
+	 */
 	sti					
 	SAVE_ARGS 8,1
 	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp) 
@@ -220,10 +225,15 @@ ret_from_sys_call:
 sysret_check:		
 	GET_THREAD_INFO(%rcx)
 	cli
+	TRACE_IRQS_OFF
 	movl threadinfo_flags(%rcx),%edx
 	andl %edi,%edx
 	CFI_REMEMBER_STATE
 	jnz  sysret_careful 
+	/*
+	 * sysretq will re-enable interrupts:
+	 */
+	TRACE_IRQS_ON
 	movq RIP-ARGOFFSET(%rsp),%rcx
 	CFI_REGISTER	rip,rcx
 	RESTORE_ARGS 0,-ARG_SKIP,1
@@ -238,6 +248,7 @@ sysret_careful:
 	CFI_RESTORE_STATE
 	bt $TIF_NEED_RESCHED,%edx
 	jnc sysret_signal
+	TRACE_IRQS_ON
 	sti
 	pushq %rdi
 	CFI_ADJUST_CFA_OFFSET 8
@@ -248,6 +259,7 @@ sysret_careful:
 
 	/* Handle a signal */ 
 sysret_signal:
+	TRACE_IRQS_ON
 	sti
 	testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
 	jz    1f
@@ -262,6 +274,7 @@ sysret_signal:
 	/* Use IRET because user could have changed frame. This
 	   works because ptregscall_common has called FIXUP_TOP_OF_STACK. */
 	cli
+	TRACE_IRQS_OFF
 	jmp int_with_check
 	
 badsys:
@@ -315,6 +328,7 @@ ENTRY(int_ret_from_sys_call)
 	CFI_REL_OFFSET	r10,R10-ARGOFFSET
 	CFI_REL_OFFSET	r11,R11-ARGOFFSET
 	cli
+	TRACE_IRQS_OFF
 	testl $3,CS-ARGOFFSET(%rsp)
 	je retint_restore_args
 	movl $_TIF_ALLWORK_MASK,%edi
@@ -333,6 +347,7 @@ int_with_check:
 int_careful:
 	bt $TIF_NEED_RESCHED,%edx
 	jnc  int_very_careful
+	TRACE_IRQS_ON
 	sti
 	pushq %rdi
 	CFI_ADJUST_CFA_OFFSET 8
@@ -340,10 +355,12 @@ int_careful:
 	popq %rdi
 	CFI_ADJUST_CFA_OFFSET -8
 	cli
+	TRACE_IRQS_OFF
 	jmp int_with_check
 
 	/* handle signals and tracing -- both require a full stack frame */
 int_very_careful:
+	TRACE_IRQS_ON
 	sti
 	SAVE_REST
 	/* Check for syscall exit trace */	
@@ -357,6 +374,7 @@ int_very_careful:
 	CFI_ADJUST_CFA_OFFSET -8
 	andl $~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP),%edi
 	cli
+	TRACE_IRQS_OFF
 	jmp int_restore_rest
 	
 int_signal:
@@ -369,6 +387,7 @@ int_signal:
 int_restore_rest:
 	RESTORE_REST
 	cli
+	TRACE_IRQS_OFF
 	jmp int_with_check
 	CFI_ENDPROC
 END(int_ret_from_sys_call)
@@ -501,6 +520,11 @@ END(stub_rt_sigreturn)
 #ifndef CONFIG_DEBUG_INFO
 	CFI_ADJUST_CFA_OFFSET	8
 #endif
+	/*
+	 * We entered an interrupt context - irqs are off:
+	 */
+	TRACE_IRQS_OFF
+
 	call \func
 	.endm
 
@@ -514,6 +538,7 @@ ret_from_intr:
 	CFI_ADJUST_CFA_OFFSET	-8
 #endif
 	cli	
+	TRACE_IRQS_OFF
 	decl %gs:pda_irqcount
 #ifdef CONFIG_DEBUG_INFO
 	movq RBP(%rdi),%rbp
@@ -538,9 +563,21 @@ retint_check:
 	CFI_REMEMBER_STATE
 	jnz  retint_careful
 retint_swapgs:	 	
+	/*
+	 * The iretq will re-enable interrupts:
+	 */
+	cli
+	TRACE_IRQS_ON
 	swapgs 
+	jmp restore_args
+
 retint_restore_args:				
 	cli
+	/*
+	 * The iretq will re-enable interrupts:
+	 */
+	TRACE_IRQS_ON
+restore_args:
 	RESTORE_ARGS 0,8,0						
 iret_label:	
 	iretq
@@ -553,6 +590,7 @@ iret_label:	
 	/* running with kernel gs */
 bad_iret:
 	movq $11,%rdi	/* SIGSEGV */
+	TRACE_IRQS_ON
 	sti
 	jmp do_exit			
 	.previous	
@@ -562,6 +600,7 @@ retint_careful:
 	CFI_RESTORE_STATE
 	bt    $TIF_NEED_RESCHED,%edx
 	jnc   retint_signal
+	TRACE_IRQS_ON
 	sti
 	pushq %rdi
 	CFI_ADJUST_CFA_OFFSET	8
@@ -570,11 +609,13 @@ retint_careful:
 	CFI_ADJUST_CFA_OFFSET	-8
 	GET_THREAD_INFO(%rcx)
 	cli
+	TRACE_IRQS_OFF
 	jmp retint_check
 	
 retint_signal:
 	testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
 	jz    retint_swapgs
+	TRACE_IRQS_ON
 	sti
 	SAVE_REST
 	movq $-1,ORIG_RAX(%rsp) 			
@@ -583,6 +624,7 @@ retint_signal:
 	call do_notify_resume
 	RESTORE_REST
 	cli
+	TRACE_IRQS_OFF
 	movl $_TIF_NEED_RESCHED,%edi
 	GET_THREAD_INFO(%rcx)
 	jmp retint_check
@@ -714,6 +756,7 @@ END(spurious_interrupt)
 	addq	$EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)
 	.endif
 	cli
+	TRACE_IRQS_OFF
 	.endm
 	
 /*
@@ -771,6 +814,7 @@ error_exit:		
 	movl %ebx,%eax		
 	RESTORE_REST
 	cli
+	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)	
 	testl %eax,%eax
 	jne  retint_kernel
@@ -778,6 +822,10 @@ error_exit:		
 	movl  $_TIF_WORK_MASK,%edi
 	andl  %edi,%edx
 	jnz  retint_careful
+	/*
+	 * The iret will restore flags:
+	 */
+	TRACE_IRQS_ON
 	swapgs 
 	RESTORE_ARGS 0,8,0						
 	jmp iret_label
@@ -980,16 +1028,20 @@ paranoid_userspace:	
 	testl $_TIF_NEED_RESCHED,%ebx
 	jnz paranoid_schedule
 	movl %ebx,%edx			/* arg3: thread flags */
+	TRACE_IRQS_ON
 	sti
 	xorl %esi,%esi 			/* arg2: oldset */
 	movq %rsp,%rdi 			/* arg1: &pt_regs */
 	call do_notify_resume
 	cli
+	TRACE_IRQS_OFF
 	jmp paranoid_userspace
 paranoid_schedule:
+	TRACE_IRQS_ON
 	sti
 	call schedule
 	cli
+	TRACE_IRQS_OFF
 	jmp paranoid_userspace
 	CFI_ENDPROC
 END(nmi)
Index: linux/arch/x86_64/kernel/irq.c
===================================================================
--- linux.orig/arch/x86_64/kernel/irq.c
+++ linux/arch/x86_64/kernel/irq.c
@@ -145,8 +145,10 @@ asmlinkage void do_softirq(void)
  	local_irq_save(flags);
  	pending = local_softirq_pending();
  	/* Switch to interrupt stack */
- 	if (pending)
+ 	if (pending) {
 		call_softirq();
+		WARN_ON_ONCE(softirq_count());
+	}
  	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(do_softirq);
Index: linux/include/asm-i386/irqflags.h
===================================================================
--- /dev/null
+++ linux/include/asm-i386/irqflags.h
@@ -0,0 +1,56 @@
+/*
+ * include/asm-i386/irqflags.h
+ *
+ * IRQ flags handling
+ *
+ * This file gets included from lowlevel asm headers too, to provide
+ * wrapped versions of the local_irq_*() APIs, based on the
+ * raw_local_irq_*() macros from the lowlevel headers.
+ */
+#ifndef _ASM_IRQFLAGS_H
+#define _ASM_IRQFLAGS_H
+
+#define raw_local_save_flags(x)	do { typecheck(unsigned long,x); __asm__ __volatile__("pushfl ; popl %0":"=g" (x): /* no input */); } while (0)
+#define raw_local_irq_restore(x) do { typecheck(unsigned long,x); __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory", "cc"); } while (0)
+#define raw_local_irq_disable()	__asm__ __volatile__("cli": : :"memory")
+#define raw_local_irq_enable()	__asm__ __volatile__("sti": : :"memory")
+/* used in the idle loop; sti takes one instruction cycle to complete */
+#define raw_safe_halt()		__asm__ __volatile__("sti; hlt": : :"memory")
+/* used when interrupts are already enabled or to shutdown the processor */
+#define halt()			__asm__ __volatile__("hlt": : :"memory")
+
+#define raw_irqs_disabled_flags(flags)	(!((flags) & (1<<9)))
+
+/* For spinlocks etc */
+#define raw_local_irq_save(x)	__asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")
+
+/*
+ * Do the CPU's IRQ-state tracing from assembly code. We call a
+ * C function, so save all the C-clobbered registers:
+ */
+#ifdef CONFIG_TRACE_IRQFLAGS
+
+# define TRACE_IRQS_ON				\
+	pushl %eax;				\
+	pushl %ecx;				\
+	pushl %edx;				\
+	call trace_hardirqs_on;			\
+	popl %edx;				\
+	popl %ecx;				\
+	popl %eax;
+
+# define TRACE_IRQS_OFF				\
+	pushl %eax;				\
+	pushl %ecx;				\
+	pushl %edx;				\
+	call trace_hardirqs_off;		\
+	popl %edx;				\
+	popl %ecx;				\
+	popl %eax;
+
+#else
+# define TRACE_IRQS_ON
+# define TRACE_IRQS_OFF
+#endif
+
+#endif
Index: linux/include/asm-i386/spinlock.h
===================================================================
--- linux.orig/include/asm-i386/spinlock.h
+++ linux/include/asm-i386/spinlock.h
@@ -31,6 +31,11 @@
 	"jmp 1b\n" \
 	"3:\n\t"
 
+/*
+ * NOTE: there's an irqs-on section here, which normally would have to be
+ * irq-traced, but on CONFIG_TRACE_IRQFLAGS we never use
+ * __raw_spin_lock_string_flags().
+ */
 #define __raw_spin_lock_string_flags \
 	"\n1:\t" \
 	"lock ; decb %0\n\t" \
Index: linux/include/asm-i386/system.h
===================================================================
--- linux.orig/include/asm-i386/system.h
+++ linux/include/asm-i386/system.h
@@ -456,25 +456,7 @@ static inline unsigned long long __cmpxc
 
 #define set_wmb(var, value) do { var = value; wmb(); } while (0)
 
-/* interrupt control.. */
-#define local_save_flags(x)	do { typecheck(unsigned long,x); __asm__ __volatile__("pushfl ; popl %0":"=g" (x): /* no input */); } while (0)
-#define local_irq_restore(x) 	do { typecheck(unsigned long,x); __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory", "cc"); } while (0)
-#define local_irq_disable() 	__asm__ __volatile__("cli": : :"memory")
-#define local_irq_enable()	__asm__ __volatile__("sti": : :"memory")
-/* used in the idle loop; sti takes one instruction cycle to complete */
-#define safe_halt()		__asm__ __volatile__("sti; hlt": : :"memory")
-/* used when interrupts are already enabled or to shutdown the processor */
-#define halt()			__asm__ __volatile__("hlt": : :"memory")
-
-#define irqs_disabled()			\
-({					\
-	unsigned long flags;		\
-	local_save_flags(flags);	\
-	!(flags & (1<<9));		\
-})
-
-/* For spinlocks etc */
-#define local_irq_save(x)	__asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")
+#include <linux/trace_irqflags.h>
 
 /*
  * disable hlt during certain critical i/o operations
Index: linux/include/asm-powerpc/irqflags.h
===================================================================
--- /dev/null
+++ linux/include/asm-powerpc/irqflags.h
@@ -0,0 +1,31 @@
+/*
+ * include/asm-powerpc/irqflags.h
+ *
+ * IRQ flags handling
+ *
+ * This file gets included from lowlevel asm headers too, to provide
+ * wrapped versions of the local_irq_*() APIs, based on the
+ * raw_local_irq_*() macros from the lowlevel headers.
+ */
+#ifndef _ASM_IRQFLAGS_H
+#define _ASM_IRQFLAGS_H
+
+/*
+ * Get definitions for raw_local_save_flags(x), etc.
+ */
+#include <asm-powerpc/hw_irq.h>
+
+/*
+ * Do the CPU's IRQ-state tracing from assembly code. We call a
+ * C function, so save all the C-clobbered registers:
+ */
+#ifdef CONFIG_TRACE_IRQFLAGS
+
+#error No support on PowerPC yet for CONFIG_TRACE_IRQFLAGS
+
+#else
+# define TRACE_IRQS_ON
+# define TRACE_IRQS_OFF
+#endif
+
+#endif
Index: linux/include/asm-x86_64/irqflags.h
===================================================================
--- /dev/null
+++ linux/include/asm-x86_64/irqflags.h
@@ -0,0 +1,54 @@
+/*
+ * include/asm-x86_64/irqflags.h
+ *
+ * IRQ flags handling
+ *
+ * This file gets included from lowlevel asm headers too, to provide
+ * wrapped versions of the local_irq_*() APIs, based on the
+ * raw_local_irq_*() macros from the lowlevel headers.
+ */
+#ifndef _ASM_IRQFLAGS_H
+#define _ASM_IRQFLAGS_H
+
+/* interrupt control.. */
+#define raw_local_save_flags(x)	do { warn_if_not_ulong(x); __asm__ __volatile__("# save_flags \n\t pushfq ; popq %q0":"=g" (x): /* no input */ :"memory"); } while (0)
+#define raw_local_irq_restore(x) 	__asm__ __volatile__("# restore_flags \n\t pushq %0 ; popfq": /* no output */ :"g" (x):"memory", "cc")
+
+#ifdef CONFIG_X86_VSMP
+/* Interrupt control for VSMP  architecture */
+#define raw_local_irq_disable()	do { unsigned long flags; raw_local_save_flags(flags); raw_local_irq_restore((flags & ~(1 << 9)) | (1 << 18)); } while (0)
+#define raw_local_irq_enable()	do { unsigned long flags; raw_local_save_flags(flags); raw_local_irq_restore((flags | (1 << 9)) & ~(1 << 18)); } while (0)
+
+#define raw_irqs_disabled_flags(flags)	\
+({						\
+	(flags & (1<<18)) || !(flags & (1<<9));	\
+})
+
+/* For spinlocks etc */
+#define raw_local_irq_save(x)	do { raw_local_save_flags(x); raw_local_irq_restore((x & ~(1 << 9)) | (1 << 18)); } while (0)
+#else  /* CONFIG_X86_VSMP */
+#define raw_local_irq_disable() 	__asm__ __volatile__("cli": : :"memory")
+#define raw_local_irq_enable()	__asm__ __volatile__("sti": : :"memory")
+
+#define raw_irqs_disabled_flags(flags)	\
+({						\
+	!(flags & (1<<9));			\
+})
+
+/* For spinlocks etc */
+#define raw_local_irq_save(x) 	do { warn_if_not_ulong(x); __asm__ __volatile__("# raw_local_irq_save \n\t pushfq ; popq %0 ; cli":"=g" (x): /* no input */ :"memory"); } while (0)
+#endif
+
+#define raw_irqs_disabled()			\
+({						\
+	unsigned long flags;			\
+	raw_local_save_flags(flags);		\
+	raw_irqs_disabled_flags(flags);		\
+})
+
+/* used in the idle loop; sti takes one instruction cycle to complete */
+#define raw_safe_halt()	__asm__ __volatile__("sti; hlt": : :"memory")
+/* used when interrupts are already enabled or to shutdown the processor */
+#define halt()			__asm__ __volatile__("hlt": : :"memory")
+
+#endif
Index: linux/include/asm-x86_64/system.h
===================================================================
--- linux.orig/include/asm-x86_64/system.h
+++ linux/include/asm-x86_64/system.h
@@ -244,43 +244,7 @@ static inline unsigned long __cmpxchg(vo
 
 #define warn_if_not_ulong(x) do { unsigned long foo; (void) (&(x) == &foo); } while (0)
 
-/* interrupt control.. */
-#define local_save_flags(x)	do { warn_if_not_ulong(x); __asm__ __volatile__("# save_flags \n\t pushfq ; popq %q0":"=g" (x): /* no input */ :"memory"); } while (0)
-#define local_irq_restore(x) 	__asm__ __volatile__("# restore_flags \n\t pushq %0 ; popfq": /* no output */ :"g" (x):"memory", "cc")
-
-#ifdef CONFIG_X86_VSMP
-/* Interrupt control for VSMP  architecture */
-#define local_irq_disable()	do { unsigned long flags; local_save_flags(flags); local_irq_restore((flags & ~(1 << 9)) | (1 << 18)); } while (0)
-#define local_irq_enable()	do { unsigned long flags; local_save_flags(flags); local_irq_restore((flags | (1 << 9)) & ~(1 << 18)); } while (0)
-
-#define irqs_disabled()					\
-({							\
-	unsigned long flags;				\
-	local_save_flags(flags);			\
-	(flags & (1<<18)) || !(flags & (1<<9));		\
-})
-
-/* For spinlocks etc */
-#define local_irq_save(x)	do { local_save_flags(x); local_irq_restore((x & ~(1 << 9)) | (1 << 18)); } while (0)
-#else  /* CONFIG_X86_VSMP */
-#define local_irq_disable() 	__asm__ __volatile__("cli": : :"memory")
-#define local_irq_enable()	__asm__ __volatile__("sti": : :"memory")
-
-#define irqs_disabled()			\
-({					\
-	unsigned long flags;		\
-	local_save_flags(flags);	\
-	!(flags & (1<<9));		\
-})
-
-/* For spinlocks etc */
-#define local_irq_save(x) 	do { warn_if_not_ulong(x); __asm__ __volatile__("# local_irq_save \n\t pushfq ; popq %0 ; cli":"=g" (x): /* no input */ :"memory"); } while (0)
-#endif
-
-/* used in the idle loop; sti takes one instruction cycle to complete */
-#define safe_halt()		__asm__ __volatile__("sti; hlt": : :"memory")
-/* used when interrupts are already enabled or to shutdown the processor */
-#define halt()			__asm__ __volatile__("hlt": : :"memory")
+#include <linux/trace_irqflags.h>
 
 void cpu_idle_wait(void);
 
Index: linux/include/linux/hardirq.h
===================================================================
--- linux.orig/include/linux/hardirq.h
+++ linux/include/linux/hardirq.h
@@ -87,7 +87,11 @@ extern void synchronize_irq(unsigned int
 #endif
 
 #define nmi_enter()		irq_enter()
-#define nmi_exit()		sub_preempt_count(HARDIRQ_OFFSET)
+#define nmi_exit()					\
+	do {						\
+		sub_preempt_count(HARDIRQ_OFFSET);	\
+		trace_hardirq_exit();			\
+	} while (0)
 
 struct task_struct;
 
@@ -97,10 +101,17 @@ static inline void account_system_vtime(
 }
 #endif
 
+/*
+ * It is safe to do non-atomic ops on ->hardirq_context,
+ * because NMI handlers may not preempt and the ops are
+ * always balanced, so the interrupted value of ->hardirq_context
+ * will always be restored.
+ */
 #define irq_enter()					\
 	do {						\
 		account_system_vtime(current);		\
 		add_preempt_count(HARDIRQ_OFFSET);	\
+		trace_hardirq_enter();			\
 	} while (0)
 
 extern void irq_exit(void);
Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -133,6 +133,7 @@ extern struct group_info init_groups;
 	.journal_info	= NULL,						\
 	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
 	.fs_excl	= ATOMIC_INIT(0),				\
+ 	INIT_TRACE_IRQFLAGS						\
 }
 
 
Index: linux/include/linux/interrupt.h
===================================================================
--- linux.orig/include/linux/interrupt.h
+++ linux/include/linux/interrupt.h
@@ -10,6 +10,7 @@
 #include <linux/irqreturn.h>
 #include <linux/hardirq.h>
 #include <linux/sched.h>
+#include <linux/trace_irqflags.h>
 #include <asm/atomic.h>
 #include <asm/ptrace.h>
 #include <asm/system.h>
@@ -72,13 +73,11 @@ static inline void __deprecated save_and
 #define save_and_cli(x)	save_and_cli(&x)
 #endif /* CONFIG_SMP */
 
-/* SoftIRQ primitives.  */
-#define local_bh_disable() \
-		do { add_preempt_count(SOFTIRQ_OFFSET); barrier(); } while (0)
-#define __local_bh_enable() \
-		do { barrier(); sub_preempt_count(SOFTIRQ_OFFSET); } while (0)
-
+extern void local_bh_disable(void);
+extern void __local_bh_enable(void);
+extern void _local_bh_enable(void);
 extern void local_bh_enable(void);
+extern void local_bh_enable_ip(unsigned long ip);
 
 /* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
    frequency threaded job scheduling. For almost all the purposes
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -916,6 +916,21 @@ struct task_struct {
 	/* mutex deadlock detection */
 	struct mutex_waiter *blocked_on;
 #endif
+#ifdef CONFIG_TRACE_IRQFLAGS
+	unsigned int irq_events;
+	int hardirqs_enabled;
+	unsigned long hardirq_enable_ip;
+	unsigned int hardirq_enable_event;
+	unsigned long hardirq_disable_ip;
+	unsigned int hardirq_disable_event;
+	int softirqs_enabled;
+	unsigned long softirq_disable_ip;
+	unsigned int softirq_disable_event;
+	unsigned long softirq_enable_ip;
+	unsigned int softirq_enable_event;
+	int hardirq_context;
+	int softirq_context;
+#endif
 
 /* journalling filesystem info */
 	void *journal_info;
Index: linux/include/linux/trace_irqflags.h
===================================================================
--- /dev/null
+++ linux/include/linux/trace_irqflags.h
@@ -0,0 +1,87 @@
+/*
+ * include/linux/trace_irqflags.h
+ *
+ * IRQ flags tracing: follow the state of the hardirq and softirq flags and
+ * provide callbacks for transitions between ON and OFF states.
+ *
+ * This file gets included from lowlevel asm headers too, to provide
+ * wrapped versions of the local_irq_*() APIs, based on the
+ * raw_local_irq_*() macros from the lowlevel headers.
+ */
+#ifndef _LINUX_TRACE_IRQFLAGS_H
+#define _LINUX_TRACE_IRQFLAGS_H
+
+#include <asm/irqflags.h>
+
+/*
+ * The local_irq_*() APIs are equal to the raw_local_irq*()
+ * if !TRACE_IRQFLAGS.
+ */
+#ifdef CONFIG_TRACE_IRQFLAGS
+  extern void trace_hardirqs_on(void);
+  extern void trace_hardirqs_off(void);
+  extern void trace_softirqs_on(unsigned long ip);
+  extern void trace_softirqs_off(unsigned long ip);
+# define trace_hardirq_context(p)	((p)->hardirq_context)
+# define trace_softirq_context(p)	((p)->softirq_context)
+# define trace_hardirqs_enabled(p)	((p)->hardirqs_enabled)
+# define trace_softirqs_enabled(p)	((p)->softirqs_enabled)
+# define trace_hardirq_enter()	do { current->hardirq_context++; } while (0)
+# define trace_hardirq_exit()	do { current->hardirq_context--; } while (0)
+# define trace_softirq_enter()	do { current->softirq_context++; } while (0)
+# define trace_softirq_exit()	do { current->softirq_context--; } while (0)
+# define INIT_TRACE_IRQFLAGS	.softirqs_enabled = 1,
+
+#else
+# define trace_hardirqs_on()		do { } while (0)
+# define trace_hardirqs_off()		do { } while (0)
+# define trace_softirqs_on(ip)		do { } while (0)
+# define trace_softirqs_off(ip)		do { } while (0)
+# define trace_hardirq_context(p)	0
+# define trace_softirq_context(p)	0
+# define trace_hardirqs_enabled(p)	0
+# define trace_softirqs_enabled(p)	0
+# define trace_hardirq_enter()		do { } while (0)
+# define trace_hardirq_exit()		do { } while (0)
+# define trace_softirq_enter()		do { } while (0)
+# define trace_softirq_exit()		do { } while (0)
+# define INIT_TRACE_IRQFLAGS
+#endif
+
+#define local_irq_enable() \
+	do { trace_hardirqs_on(); raw_local_irq_enable(); } while (0)
+#define local_irq_disable() \
+	do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
+#define local_irq_save(flags) \
+	do { raw_local_irq_save(flags); trace_hardirqs_off(); } while (0)
+
+#define local_irq_restore(flags)				\
+	do {							\
+		if (raw_irqs_disabled_flags(flags)) {		\
+			raw_local_irq_restore(flags);		\
+			trace_hardirqs_off();			\
+		} else {					\
+			trace_hardirqs_on();			\
+			raw_local_irq_restore(flags);		\
+		}						\
+	} while (0)
+
+#define safe_halt()						\
+	do {							\
+		trace_hardirqs_on();				\
+		raw_safe_halt();				\
+	} while (0)
+
+#define local_save_flags(flags)		raw_local_save_flags(flags)
+
+#define irqs_disabled()						\
+({								\
+	unsigned long flags;					\
+								\
+	raw_local_save_flags(flags);				\
+	raw_irqs_disabled_flags(flags);				\
+})
+
+#define irqs_disabled_flags(flags)	raw_irqs_disabled_flags(flags)
+
+#endif
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -970,6 +970,10 @@ static task_t *copy_process(unsigned lon
 	if (!p)
 		goto fork_out;
 
+#ifdef CONFIG_TRACE_IRQFLAGS
+	DEBUG_WARN_ON(!p->hardirqs_enabled);
+	DEBUG_WARN_ON(!p->softirqs_enabled);
+#endif
 	retval = -EAGAIN;
 	if (atomic_read(&p->user->processes) >=
 			p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
@@ -1051,7 +1055,21 @@ static task_t *copy_process(unsigned lon
 #ifdef CONFIG_DEBUG_MUTEXES
 	p->blocked_on = NULL; /* not blocked yet */
 #endif
-
+#ifdef CONFIG_TRACE_IRQFLAGS
+	p->irq_events = 0;
+	p->hardirqs_enabled = 0;
+	p->hardirq_enable_ip = 0;
+	p->hardirq_enable_event = 0;
+	p->hardirq_disable_ip = _THIS_IP_;
+	p->hardirq_disable_event = 0;
+	p->softirqs_enabled = 1;
+	p->softirq_enable_ip = _THIS_IP_;
+	p->softirq_enable_event = 0;
+	p->softirq_disable_ip = 0;
+	p->softirq_disable_event = 0;
+	p->hardirq_context = 0;
+	p->softirq_context = 0;
+#endif
 	p->tgid = p->pid;
 	if (clone_flags & CLONE_THREAD)
 		p->tgid = current->tgid;
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4481,7 +4481,9 @@ int __sched cond_resched_softirq(void)
 	BUG_ON(!in_softirq());
 
 	if (need_resched()) {
-		__local_bh_enable();
+		raw_local_irq_disable();
+		_local_bh_enable();
+		raw_local_irq_enable();
 		__cond_resched();
 		local_bh_disable();
 		return 1;
Index: linux/kernel/softirq.c
===================================================================
--- linux.orig/kernel/softirq.c
+++ linux/kernel/softirq.c
@@ -62,6 +62,119 @@ static inline void wakeup_softirqd(void)
 }
 
 /*
+ * This one is for softirq.c-internal use,
+ * where hardirqs are disabled legitimately:
+ */
+static void __local_bh_disable(unsigned long ip)
+{
+	unsigned long flags;
+
+	WARN_ON_ONCE(in_irq());
+
+	raw_local_irq_save(flags);
+	add_preempt_count(SOFTIRQ_OFFSET);
+	/*
+	 * Were softirqs turned off above:
+	 */
+	if (softirq_count() == SOFTIRQ_OFFSET)
+		trace_softirqs_off(ip);
+	raw_local_irq_restore(flags);
+}
+
+void local_bh_disable(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+	__local_bh_disable((unsigned long)__builtin_return_address(0));
+}
+
+EXPORT_SYMBOL(local_bh_disable);
+
+void __local_bh_enable(void)
+{
+	WARN_ON_ONCE(in_irq());
+
+	/*
+	 * softirqs should never be enabled by __local_bh_enable(),
+	 * it always nests inside local_bh_enable() sections:
+	 */
+	WARN_ON_ONCE(softirq_count() == SOFTIRQ_OFFSET);
+
+	sub_preempt_count(SOFTIRQ_OFFSET);
+}
+
+EXPORT_SYMBOL(__local_bh_enable);
+
+/*
+ * Special-case - softirqs can safely be enabled in
+ * cond_resched_softirq(), or by __do_softirq(),
+ * without processing still-pending softirqs:
+ */
+void _local_bh_enable(void)
+{
+	WARN_ON_ONCE(in_irq());
+	WARN_ON_ONCE(!irqs_disabled());
+
+	if (softirq_count() == SOFTIRQ_OFFSET)
+		trace_softirqs_on((unsigned long)__builtin_return_address(0));
+	sub_preempt_count(SOFTIRQ_OFFSET);
+}
+
+void local_bh_enable(void)
+{
+	unsigned long flags;
+
+	WARN_ON_ONCE(in_irq());
+	WARN_ON_ONCE(irqs_disabled());
+
+	local_irq_save(flags);
+	/*
+	 * Are softirqs going to be turned on now:
+	 */
+	if (softirq_count() == SOFTIRQ_OFFSET)
+		trace_softirqs_on((unsigned long)__builtin_return_address(0));
+	/*
+	 * Keep preemption disabled until we are done with
+	 * softirq processing:
+ 	 */
+ 	sub_preempt_count(SOFTIRQ_OFFSET - 1);
+
+	if (unlikely(!in_interrupt() && local_softirq_pending()))
+		do_softirq();
+
+	dec_preempt_count();
+	local_irq_restore(flags);
+	preempt_check_resched();
+}
+EXPORT_SYMBOL(local_bh_enable);
+
+void local_bh_enable_ip(unsigned long ip)
+{
+	unsigned long flags;
+
+	WARN_ON_ONCE(in_irq());
+
+	local_irq_save(flags);
+	/*
+	 * Are softirqs going to be turned on now:
+	 */
+	if (softirq_count() == SOFTIRQ_OFFSET)
+		trace_softirqs_on(ip);
+	/*
+	 * Keep preemption disabled until we are done with
+	 * softirq processing:
+ 	 */
+ 	sub_preempt_count(SOFTIRQ_OFFSET - 1);
+
+	if (unlikely(!in_interrupt() && local_softirq_pending()))
+		do_softirq();
+
+	dec_preempt_count();
+	local_irq_restore(flags);
+	preempt_check_resched();
+}
+EXPORT_SYMBOL(local_bh_enable_ip);
+
+/*
  * We restart softirq processing MAX_SOFTIRQ_RESTART times,
  * and we fall back to softirqd after that.
  *
@@ -80,8 +193,9 @@ asmlinkage void __do_softirq(void)
 	int cpu;
 
 	pending = local_softirq_pending();
+	__local_bh_disable((unsigned long)__builtin_return_address(0));
+	trace_softirq_enter();
 
-	local_bh_disable();
 	cpu = smp_processor_id();
 restart:
 	/* Reset the pending bitmask before enabling irqs */
@@ -109,7 +223,8 @@ restart:
 	if (pending)
 		wakeup_softirqd();
 
-	__local_bh_enable();
+	trace_softirq_exit();
+	_local_bh_enable();
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
@@ -136,23 +251,6 @@ EXPORT_SYMBOL(do_softirq);
 
 #endif
 
-void local_bh_enable(void)
-{
-	WARN_ON(irqs_disabled());
-	/*
-	 * Keep preemption disabled until we are done with
-	 * softirq processing:
- 	 */
- 	sub_preempt_count(SOFTIRQ_OFFSET - 1);
-
-	if (unlikely(!in_interrupt() && local_softirq_pending()))
-		do_softirq();
-
-	dec_preempt_count();
-	preempt_check_resched();
-}
-EXPORT_SYMBOL(local_bh_enable);
-
 #ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
 # define invoke_softirq()	__do_softirq()
 #else
@@ -165,6 +263,7 @@ EXPORT_SYMBOL(local_bh_enable);
 void irq_exit(void)
 {
 	account_system_vtime(current);
+	trace_hardirq_exit();
 	sub_preempt_count(IRQ_EXIT_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
Index: linux/lib/locking-selftest.c
===================================================================
--- linux.orig/lib/locking-selftest.c
+++ linux/lib/locking-selftest.c
@@ -19,6 +19,7 @@
 #include <linux/kallsyms.h>
 #include <linux/interrupt.h>
 #include <linux/debug_locks.h>
+#include <linux/trace_irqflags.h>
 
 /*
  * Change this to 1 if you want to see the failure printouts:
@@ -157,9 +158,11 @@ static void init_shared_types(void)
 #define SOFTIRQ_ENTER()				\
 		local_bh_disable();		\
 		local_irq_disable();		\
+		trace_softirq_enter();		\
 		WARN_ON(!in_softirq());
 
 #define SOFTIRQ_EXIT()				\
+		trace_softirq_exit();		\
 		local_irq_enable();		\
 		local_bh_enable();
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 19/61] lock validator: irqtrace: cleanup: include/asm-i386/irqflags.h
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (17 preceding siblings ...)
  2006-05-29 21:24 ` [patch 18/61] lock validator: irqtrace: core Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-29 21:24 ` [patch 20/61] lock validator: irqtrace: cleanup: include/asm-x86_64/irqflags.h Ingo Molnar
                   ` (54 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

clean up the x86 irqflags.h file:

 - macro => inline function transformation
 - simplifications
 - style fixes

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/asm-i386/irqflags.h |   95 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 83 insertions(+), 12 deletions(-)

Index: linux/include/asm-i386/irqflags.h
===================================================================
--- linux.orig/include/asm-i386/irqflags.h
+++ linux/include/asm-i386/irqflags.h
@@ -5,24 +5,95 @@
  *
  * This file gets included from lowlevel asm headers too, to provide
  * wrapped versions of the local_irq_*() APIs, based on the
- * raw_local_irq_*() macros from the lowlevel headers.
+ * raw_local_irq_*() functions from the lowlevel headers.
  */
 #ifndef _ASM_IRQFLAGS_H
 #define _ASM_IRQFLAGS_H
 
-#define raw_local_save_flags(x)	do { typecheck(unsigned long,x); __asm__ __volatile__("pushfl ; popl %0":"=g" (x): /* no input */); } while (0)
-#define raw_local_irq_restore(x) do { typecheck(unsigned long,x); __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory", "cc"); } while (0)
-#define raw_local_irq_disable()	__asm__ __volatile__("cli": : :"memory")
-#define raw_local_irq_enable()	__asm__ __volatile__("sti": : :"memory")
-/* used in the idle loop; sti takes one instruction cycle to complete */
-#define raw_safe_halt()		__asm__ __volatile__("sti; hlt": : :"memory")
-/* used when interrupts are already enabled or to shutdown the processor */
-#define halt()			__asm__ __volatile__("hlt": : :"memory")
+#ifndef __ASSEMBLY__
 
-#define raw_irqs_disabled_flags(flags)	(!((flags) & (1<<9)))
+static inline unsigned long __raw_local_save_flags(void)
+{
+	unsigned long flags;
+
+	__asm__ __volatile__(
+		"pushfl ; popl %0"
+		: "=g" (flags)
+		: /* no input */
+	);
+
+	return flags;
+}
+
+#define raw_local_save_flags(flags) \
+		do { (flags) = __raw_local_save_flags(); } while (0)
+
+static inline void raw_local_irq_restore(unsigned long flags)
+{
+	__asm__ __volatile__(
+		"pushl %0 ; popfl"
+		: /* no output */
+		:"g" (flags)
+		:"memory", "cc"
+	);
+}
+
+static inline void raw_local_irq_disable(void)
+{
+	__asm__ __volatile__("cli" : : : "memory");
+}
+
+static inline void raw_local_irq_enable(void)
+{
+	__asm__ __volatile__("sti" : : : "memory");
+}
 
-/* For spinlocks etc */
-#define raw_local_irq_save(x)	__asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline void raw_safe_halt(void)
+{
+	__asm__ __volatile__("sti; hlt" : : : "memory");
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline void halt(void)
+{
+	__asm__ __volatile__("hlt": : :"memory");
+}
+
+static inline int raw_irqs_disabled_flags(unsigned long flags)
+{
+	return !(flags & (1 << 9));
+}
+
+static inline int raw_irqs_disabled(void)
+{
+	unsigned long flags = __raw_local_save_flags();
+
+	return raw_irqs_disabled_flags(flags);
+}
+
+/*
+ * For spinlocks, etc:
+ */
+static inline unsigned long __raw_local_irq_save(void)
+{
+	unsigned long flags = __raw_local_save_flags();
+
+	raw_local_irq_disable();
+
+	return flags;
+}
+
+#define raw_local_irq_save(flags) \
+		do { (flags) = __raw_local_irq_save(); } while (0)
+
+#endif /* __ASSEMBLY__ */
 
 /*
  * Do the CPU's IRQ-state tracing from assembly code. We call a

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 20/61] lock validator: irqtrace: cleanup: include/asm-x86_64/irqflags.h
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (18 preceding siblings ...)
  2006-05-29 21:24 ` [patch 19/61] lock validator: irqtrace: cleanup: include/asm-i386/irqflags.h Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-29 21:24 ` [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API Ingo Molnar
                   ` (53 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

clean up the x86-64 irqflags.h file:

 - macro => inline function transformation
 - simplifications
 - style fixes

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86_64/lib/thunk.S       |    5 +
 include/asm-x86_64/irqflags.h |  159 ++++++++++++++++++++++++++++++++----------
 2 files changed, 128 insertions(+), 36 deletions(-)

Index: linux/arch/x86_64/lib/thunk.S
===================================================================
--- linux.orig/arch/x86_64/lib/thunk.S
+++ linux/arch/x86_64/lib/thunk.S
@@ -47,6 +47,11 @@
 	thunk_retrax __down_failed_interruptible,__down_interruptible
 	thunk_retrax __down_failed_trylock,__down_trylock
 	thunk __up_wakeup,__up
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+	thunk trace_hardirqs_on_thunk,trace_hardirqs_on
+	thunk trace_hardirqs_off_thunk,trace_hardirqs_off
+#endif
 	
 	/* SAVE_ARGS below is used only for the .cfi directives it contains. */
 	CFI_STARTPROC
Index: linux/include/asm-x86_64/irqflags.h
===================================================================
--- linux.orig/include/asm-x86_64/irqflags.h
+++ linux/include/asm-x86_64/irqflags.h
@@ -5,50 +5,137 @@
  *
  * This file gets included from lowlevel asm headers too, to provide
  * wrapped versions of the local_irq_*() APIs, based on the
- * raw_local_irq_*() macros from the lowlevel headers.
+ * raw_local_irq_*() functions from the lowlevel headers.
  */
 #ifndef _ASM_IRQFLAGS_H
 #define _ASM_IRQFLAGS_H
 
-/* interrupt control.. */
-#define raw_local_save_flags(x)	do { warn_if_not_ulong(x); __asm__ __volatile__("# save_flags \n\t pushfq ; popq %q0":"=g" (x): /* no input */ :"memory"); } while (0)
-#define raw_local_irq_restore(x) 	__asm__ __volatile__("# restore_flags \n\t pushq %0 ; popfq": /* no output */ :"g" (x):"memory", "cc")
+#ifndef __ASSEMBLY__
+/*
+ * Interrupt control:
+ */
+
+static inline unsigned long __raw_local_save_flags(void)
+{
+	unsigned long flags;
+
+	__asm__ __volatile__(
+		"# __raw_save_flags\n\t"
+		"pushfq ; popq %q0"
+		: "=g" (flags)
+		: /* no input */
+		: "memory"
+	);
+
+	return flags;
+}
+
+#define raw_local_save_flags(flags) \
+		do { (flags) = __raw_local_save_flags(); } while (0)
+
+static inline void raw_local_irq_restore(unsigned long flags)
+{
+	__asm__ __volatile__(
+		"pushq %0 ; popfq"
+		: /* no output */
+		:"g" (flags)
+		:"memory", "cc"
+	);
+}
 
 #ifdef CONFIG_X86_VSMP
-/* Interrupt control for VSMP  architecture */
-#define raw_local_irq_disable()	do { unsigned long flags; raw_local_save_flags(flags); raw_local_irq_restore((flags & ~(1 << 9)) | (1 << 18)); } while (0)
-#define raw_local_irq_enable()	do { unsigned long flags; raw_local_save_flags(flags); raw_local_irq_restore((flags | (1 << 9)) & ~(1 << 18)); } while (0)
-
-#define raw_irqs_disabled_flags(flags)	\
-({						\
-	(flags & (1<<18)) || !(flags & (1<<9));	\
-})
-
-/* For spinlocks etc */
-#define raw_local_irq_save(x)	do { raw_local_save_flags(x); raw_local_irq_restore((x & ~(1 << 9)) | (1 << 18)); } while (0)
-#else  /* CONFIG_X86_VSMP */
-#define raw_local_irq_disable() 	__asm__ __volatile__("cli": : :"memory")
-#define raw_local_irq_enable()	__asm__ __volatile__("sti": : :"memory")
-
-#define raw_irqs_disabled_flags(flags)	\
-({						\
-	!(flags & (1<<9));			\
-})
 
-/* For spinlocks etc */
-#define raw_local_irq_save(x) 	do { warn_if_not_ulong(x); __asm__ __volatile__("# raw_local_irq_save \n\t pushfq ; popq %0 ; cli":"=g" (x): /* no input */ :"memory"); } while (0)
+/*
+ * Interrupt control for the VSMP architecture:
+ */
+
+static inline void raw_local_irq_disable(void)
+{
+	unsigned long flags = __raw_local_save_flags();
+
+	raw_local_irq_restore((flags & ~(1 << 9)) | (1 << 18));
+}
+
+static inline void raw_local_irq_enable(void)
+{
+	unsigned long flags = __raw_local_save_flags();
+
+	raw_local_irq_restore((flags | (1 << 9)) & ~(1 << 18));
+}
+
+static inline int raw_irqs_disabled_flags(unsigned long flags)
+{
+	return !(flags & (1<<9)) || (flags & (1 << 18));
+}
+
+#else /* CONFIG_X86_VSMP */
+
+static inline void raw_local_irq_disable(void)
+{
+	__asm__ __volatile__("cli" : : : "memory");
+}
+
+static inline void raw_local_irq_enable(void)
+{
+	__asm__ __volatile__("sti" : : : "memory");
+}
+
+static inline int raw_irqs_disabled_flags(unsigned long flags)
+{
+	return !(flags & (1 << 9));
+}
+
 #endif
 
-#define raw_irqs_disabled()			\
-({						\
-	unsigned long flags;			\
-	raw_local_save_flags(flags);		\
-	raw_irqs_disabled_flags(flags);		\
-})
-
-/* used in the idle loop; sti takes one instruction cycle to complete */
-#define raw_safe_halt()	__asm__ __volatile__("sti; hlt": : :"memory")
-/* used when interrupts are already enabled or to shutdown the processor */
-#define halt()			__asm__ __volatile__("hlt": : :"memory")
+/*
+ * For spinlocks, etc.:
+ */
+
+static inline unsigned long __raw_local_irq_save(void)
+{
+	unsigned long flags = __raw_local_save_flags();
+
+	raw_local_irq_disable();
+
+	return flags;
+}
+
+#define raw_local_irq_save(flags) \
+		do { (flags) = __raw_local_irq_save(); } while (0)
+
+static inline int raw_irqs_disabled(void)
+{
+	unsigned long flags = __raw_local_save_flags();
+
+	return raw_irqs_disabled_flags(flags);
+}
+
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline void raw_safe_halt(void)
+{
+	__asm__ __volatile__("sti; hlt" : : : "memory");
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline void halt(void)
+{
+	__asm__ __volatile__("hlt": : :"memory");
+}
+
+#else /* __ASSEMBLY__: */
+# ifdef CONFIG_TRACE_IRQFLAGS
+#  define TRACE_IRQS_ON		call trace_hardirqs_on_thunk
+#  define TRACE_IRQS_OFF	call trace_hardirqs_off_thunk
+# else
+#  define TRACE_IRQS_ON
+#  define TRACE_IRQS_OFF
+# endif
+#endif
 
 #endif

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API.
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (19 preceding siblings ...)
  2006-05-29 21:24 ` [patch 20/61] lock validator: irqtrace: cleanup: include/asm-x86_64/irqflags.h Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-30  1:34   ` Andrew Morton
  2006-05-29 21:24 ` [patch 22/61] lock validator: add per_cpu_offset() Ingo Molnar
                   ` (52 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

introduce local_irq_enable_in_hardirq() API. It is currently
aliased to local_irq_enable(), hence has no functional effects.

This API will be used by lockdep, but even without lockdep
this will better document places in the kernel where a hardirq
context enables hardirqs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/i386/kernel/nmi.c         |    3 ++-
 arch/x86_64/kernel/nmi.c       |    3 ++-
 drivers/ide/ide-io.c           |    6 +++---
 drivers/ide/ide-taskfile.c     |    2 +-
 include/linux/ide.h            |    2 +-
 include/linux/trace_irqflags.h |    2 ++
 kernel/irq/handle.c            |    2 +-
 7 files changed, 12 insertions(+), 8 deletions(-)

Index: linux/arch/i386/kernel/nmi.c
===================================================================
--- linux.orig/arch/i386/kernel/nmi.c
+++ linux/arch/i386/kernel/nmi.c
@@ -188,7 +188,8 @@ static __cpuinit inline int nmi_known_cp
 static __init void nmi_cpu_busy(void *data)
 {
 	volatile int *endflag = data;
-	local_irq_enable();
+
+	local_irq_enable_in_hardirq();
 	/* Intentionally don't use cpu_relax here. This is
 	   to make sure that the performance counter really ticks,
 	   even if there is a simulator or similar that catches the
Index: linux/arch/x86_64/kernel/nmi.c
===================================================================
--- linux.orig/arch/x86_64/kernel/nmi.c
+++ linux/arch/x86_64/kernel/nmi.c
@@ -186,7 +186,8 @@ void nmi_watchdog_default(void)
 static __init void nmi_cpu_busy(void *data)
 {
 	volatile int *endflag = data;
-	local_irq_enable();
+
+	local_irq_enable_in_hardirq();
 	/* Intentionally don't use cpu_relax here. This is
 	   to make sure that the performance counter really ticks,
 	   even if there is a simulator or similar that catches the
Index: linux/drivers/ide/ide-io.c
===================================================================
--- linux.orig/drivers/ide/ide-io.c
+++ linux/drivers/ide/ide-io.c
@@ -689,7 +689,7 @@ static ide_startstop_t drive_cmd_intr (i
 	u8 stat = hwif->INB(IDE_STATUS_REG);
 	int retries = 10;
 
-	local_irq_enable();
+	local_irq_enable_in_hardirq();
 	if ((stat & DRQ_STAT) && args && args[3]) {
 		u8 io_32bit = drive->io_32bit;
 		drive->io_32bit = 0;
@@ -1273,7 +1273,7 @@ static void ide_do_request (ide_hwgroup_
 		if (masked_irq != IDE_NO_IRQ && hwif->irq != masked_irq)
 			disable_irq_nosync(hwif->irq);
 		spin_unlock(&ide_lock);
-		local_irq_enable();
+		local_irq_enable_in_hardirq();
 			/* allow other IRQs while we start this request */
 		startstop = start_request(drive, rq);
 		spin_lock_irq(&ide_lock);
@@ -1622,7 +1622,7 @@ irqreturn_t ide_intr (int irq, void *dev
 	spin_unlock(&ide_lock);
 
 	if (drive->unmask)
-		local_irq_enable();
+		local_irq_enable_in_hardirq();
 	/* service this interrupt, may set handler for next interrupt */
 	startstop = handler(drive);
 	spin_lock_irq(&ide_lock);
Index: linux/drivers/ide/ide-taskfile.c
===================================================================
--- linux.orig/drivers/ide/ide-taskfile.c
+++ linux/drivers/ide/ide-taskfile.c
@@ -223,7 +223,7 @@ ide_startstop_t task_no_data_intr (ide_d
 	ide_hwif_t *hwif	= HWIF(drive);
 	u8 stat;
 
-	local_irq_enable();
+	local_irq_enable_in_hardirq();
 	if (!OK_STAT(stat = hwif->INB(IDE_STATUS_REG),READY_STAT,BAD_STAT)) {
 		return ide_error(drive, "task_no_data_intr", stat);
 		/* calls ide_end_drive_cmd */
Index: linux/include/linux/ide.h
===================================================================
--- linux.orig/include/linux/ide.h
+++ linux/include/linux/ide.h
@@ -1361,7 +1361,7 @@ extern struct semaphore ide_cfg_sem;
  * ide_drive_t->hwif: constant, no locking
  */
 
-#define local_irq_set(flags)	do { local_save_flags((flags)); local_irq_enable(); } while (0)
+#define local_irq_set(flags)	do { local_save_flags((flags)); local_irq_enable_in_hardirq(); } while (0)
 
 extern struct bus_type ide_bus_type;
 
Index: linux/include/linux/trace_irqflags.h
===================================================================
--- linux.orig/include/linux/trace_irqflags.h
+++ linux/include/linux/trace_irqflags.h
@@ -66,6 +66,8 @@
 		}						\
 	} while (0)
 
+#define local_irq_enable_in_hardirq()	local_irq_enable()
+
 #define safe_halt()						\
 	do {							\
 		trace_hardirqs_on();				\
Index: linux/kernel/irq/handle.c
===================================================================
--- linux.orig/kernel/irq/handle.c
+++ linux/kernel/irq/handle.c
@@ -83,7 +83,7 @@ fastcall irqreturn_t handle_IRQ_event(un
 	unsigned int status = 0;
 
 	if (!(action->flags & SA_INTERRUPT))
-		local_irq_enable();
+		local_irq_enable_in_hardirq();
 
 	do {
 		ret = action->handler(irq, action->dev_id, regs);

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 22/61] lock validator:  add per_cpu_offset()
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (20 preceding siblings ...)
  2006-05-29 21:24 ` [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API Ingo Molnar
@ 2006-05-29 21:24 ` Ingo Molnar
  2006-05-30  1:34   ` Andrew Morton
  2006-05-29 21:25 ` [patch 23/61] lock validator: core Ingo Molnar
                   ` (51 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

add the per_cpu_offset() generic method. (used by the lock validator)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/asm-generic/percpu.h |    2 ++
 include/asm-x86_64/percpu.h  |    2 ++
 2 files changed, 4 insertions(+)

Index: linux/include/asm-generic/percpu.h
===================================================================
--- linux.orig/include/asm-generic/percpu.h
+++ linux/include/asm-generic/percpu.h
@@ -7,6 +7,8 @@
 
 extern unsigned long __per_cpu_offset[NR_CPUS];
 
+#define per_cpu_offset(x) (__per_cpu_offset[x])
+
 /* Separate out the type, so (int[3], foo) works. */
 #define DEFINE_PER_CPU(type, name) \
     __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
Index: linux/include/asm-x86_64/percpu.h
===================================================================
--- linux.orig/include/asm-x86_64/percpu.h
+++ linux/include/asm-x86_64/percpu.h
@@ -14,6 +14,8 @@
 #define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
 #define __my_cpu_offset() read_pda(data_offset)
 
+#define per_cpu_offset(x) (__per_cpu_offset(x))
+
 /* Separate out the type, so (int[3], foo) works. */
 #define DEFINE_PER_CPU(type, name) \
     __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 23/61] lock validator: core
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (21 preceding siblings ...)
  2006-05-29 21:24 ` [patch 22/61] lock validator: add per_cpu_offset() Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:25 ` [patch 24/61] lock validator: procfs Ingo Molnar
                   ` (50 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

lock validator core changes. Not enabled yet.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/init_task.h      |    1 
 include/linux/lockdep.h        |  280 ++++
 include/linux/sched.h          |   12 
 include/linux/trace_irqflags.h |   13 
 init/main.c                    |   16 
 kernel/Makefile                |    1 
 kernel/fork.c                  |    5 
 kernel/irq/manage.c            |    6 
 kernel/lockdep.c               | 2633 +++++++++++++++++++++++++++++++++++++++++
 kernel/lockdep_internals.h     |   93 +
 kernel/module.c                |    3 
 lib/Kconfig.debug              |    2 
 lib/locking-selftest.c         |    4 
 13 files changed, 3064 insertions(+), 5 deletions(-)

Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -134,6 +134,7 @@ extern struct group_info init_groups;
 	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
 	.fs_excl	= ATOMIC_INIT(0),				\
  	INIT_TRACE_IRQFLAGS						\
+ 	INIT_LOCKDEP							\
 }
 
 
Index: linux/include/linux/lockdep.h
===================================================================
--- /dev/null
+++ linux/include/linux/lockdep.h
@@ -0,0 +1,280 @@
+/*
+ * Runtime locking correctness validator
+ *
+ *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ *
+ * see Documentation/lockdep-design.txt for more details.
+ */
+#ifndef __LINUX_LOCKDEP_H
+#define __LINUX_LOCKDEP_H
+
+#include <linux/linkage.h>
+#include <linux/list.h>
+#include <linux/debug_locks.h>
+#include <linux/stacktrace.h>
+
+#ifdef CONFIG_LOCKDEP
+
+/*
+ * Lock-type usage-state bits:
+ */
+enum lock_usage_bit
+{
+	LOCK_USED = 0,
+	LOCK_USED_IN_HARDIRQ,
+	LOCK_USED_IN_SOFTIRQ,
+	LOCK_ENABLED_SOFTIRQS,
+	LOCK_ENABLED_HARDIRQS,
+	LOCK_USED_IN_HARDIRQ_READ,
+	LOCK_USED_IN_SOFTIRQ_READ,
+	LOCK_ENABLED_SOFTIRQS_READ,
+	LOCK_ENABLED_HARDIRQS_READ,
+	LOCK_USAGE_STATES
+};
+
+/*
+ * Usage-state bitmasks:
+ */
+#define LOCKF_USED			(1 << LOCK_USED)
+#define LOCKF_USED_IN_HARDIRQ		(1 << LOCK_USED_IN_HARDIRQ)
+#define LOCKF_USED_IN_SOFTIRQ		(1 << LOCK_USED_IN_SOFTIRQ)
+#define LOCKF_ENABLED_HARDIRQS		(1 << LOCK_ENABLED_HARDIRQS)
+#define LOCKF_ENABLED_SOFTIRQS		(1 << LOCK_ENABLED_SOFTIRQS)
+
+#define LOCKF_ENABLED_IRQS (LOCKF_ENABLED_HARDIRQS | LOCKF_ENABLED_SOFTIRQS)
+#define LOCKF_USED_IN_IRQ (LOCKF_USED_IN_HARDIRQ | LOCKF_USED_IN_SOFTIRQ)
+
+#define LOCKF_USED_IN_HARDIRQ_READ	(1 << LOCK_USED_IN_HARDIRQ_READ)
+#define LOCKF_USED_IN_SOFTIRQ_READ	(1 << LOCK_USED_IN_SOFTIRQ_READ)
+#define LOCKF_ENABLED_HARDIRQS_READ	(1 << LOCK_ENABLED_HARDIRQS_READ)
+#define LOCKF_ENABLED_SOFTIRQS_READ	(1 << LOCK_ENABLED_SOFTIRQS_READ)
+
+#define LOCKF_ENABLED_IRQS_READ \
+		(LOCKF_ENABLED_HARDIRQS_READ | LOCKF_ENABLED_SOFTIRQS_READ)
+#define LOCKF_USED_IN_IRQ_READ \
+		(LOCKF_USED_IN_HARDIRQ_READ | LOCKF_USED_IN_SOFTIRQ_READ)
+
+#define MAX_LOCKDEP_SUBTYPES		8UL
+
+/*
+ * Lock-types are keyed via unique addresses, by embedding the
+ * locktype-key into the kernel (or module) .data section. (For
+ * static locks we use the lock address itself as the key.)
+ */
+struct lockdep_subtype_key {
+	char __one_byte;
+} __attribute__ ((__packed__));
+
+struct lockdep_type_key {
+	struct lockdep_subtype_key	subkeys[MAX_LOCKDEP_SUBTYPES];
+};
+
+/*
+ * The lock-type itself:
+ */
+struct lock_type {
+	/*
+	 * type-hash:
+	 */
+	struct list_head		hash_entry;
+
+	/*
+	 * global list of all lock-types:
+	 */
+	struct list_head		lock_entry;
+
+	struct lockdep_subtype_key	*key;
+	unsigned int			subtype;
+
+	/*
+	 * IRQ/softirq usage tracking bits:
+	 */
+	unsigned long			usage_mask;
+	struct stack_trace		usage_traces[LOCK_USAGE_STATES];
+
+	/*
+	 * These fields represent a directed graph of lock dependencies,
+	 * to every node we attach a list of "forward" and a list of
+	 * "backward" graph nodes.
+	 */
+	struct list_head		locks_after, locks_before;
+
+	/*
+	 * Generation counter, when doing certain types of graph walking,
+	 * to ensure that we check one node only once:
+	 */
+	unsigned int			version;
+
+	/*
+	 * Statistics counter:
+	 */
+	unsigned long			ops;
+
+	const char			*name;
+	int				name_version;
+};
+
+/*
+ * Map the lock object (the lock instance) to the lock-type object.
+ * This is embedded into specific lock instances:
+ */
+struct lockdep_map {
+	struct lockdep_type_key		*key;
+	struct lock_type		*type[MAX_LOCKDEP_SUBTYPES];
+	const char			*name;
+};
+
+/*
+ * Every lock has a list of other locks that were taken after it.
+ * We only grow the list, never remove from it:
+ */
+struct lock_list {
+	struct list_head		entry;
+	struct lock_type		*type;
+	struct stack_trace		trace;
+};
+
+/*
+ * We record lock dependency chains, so that we can cache them:
+ */
+struct lock_chain {
+	struct list_head		entry;
+	u64				chain_key;
+};
+
+struct held_lock {
+	/*
+	 * One-way hash of the dependency chain up to this point. We
+	 * hash the hashes step by step as the dependency chain grows.
+	 *
+	 * We use it for dependency-caching and we skip detection
+	 * passes and dependency-updates if there is a cache-hit, so
+	 * it is absolutely critical for 100% coverage of the validator
+	 * to have a unique key value for every unique dependency path
+	 * that can occur in the system, to make a unique hash value
+	 * as likely as possible - hence the 64-bit width.
+	 *
+	 * The task struct holds the current hash value (initialized
+	 * with zero), here we store the previous hash value:
+	 */
+	u64				prev_chain_key;
+	struct lock_type		*type;
+	unsigned long			acquire_ip;
+	struct lockdep_map		*instance;
+
+	/*
+	 * The lock-stack is unified in that the lock chains of interrupt
+	 * contexts nest ontop of process context chains, but we 'separate'
+	 * the hashes by starting with 0 if we cross into an interrupt
+	 * context, and we also keep do not add cross-context lock
+	 * dependencies - the lock usage graph walking covers that area
+	 * anyway, and we'd just unnecessarily increase the number of
+	 * dependencies otherwise. [Note: hardirq and softirq contexts
+	 * are separated from each other too.]
+	 *
+	 * The following field is used to detect when we cross into an
+	 * interrupt context:
+	 */
+	int				irq_context;
+	int				trylock;
+	int				read;
+	int				hardirqs_off;
+};
+
+/*
+ * Initialization, self-test and debugging-output methods:
+ */
+extern void lockdep_init(void);
+extern void lockdep_info(void);
+extern void lockdep_reset(void);
+extern void lockdep_reset_lock(struct lockdep_map *lock);
+extern void lockdep_free_key_range(void *start, unsigned long size);
+
+extern void print_lock_types(void);
+extern void lockdep_print_held_locks(struct task_struct *task);
+
+/*
+ * These methods are used by specific locking variants (spinlocks,
+ * rwlocks, mutexes and rwsems) to pass init/acquire/release events
+ * to lockdep:
+ */
+
+extern void lockdep_init_map(struct lockdep_map *lock, const char *name,
+			     struct lockdep_type_key *key);
+
+extern void lockdep_acquire(struct lockdep_map *lock, unsigned int subtype,
+			    int trylock, int read, unsigned long ip);
+
+extern void lockdep_release(struct lockdep_map *lock, int nested,
+			    unsigned long ip);
+
+# define INIT_LOCKDEP				.lockdep_recursion = 0,
+
+extern void early_boot_irqs_off(void);
+extern void early_boot_irqs_on(void);
+
+#else /* LOCKDEP */
+# define lockdep_init()				do { } while (0)
+# define lockdep_info()				do { } while (0)
+# define print_lock_types()			do { } while (0)
+# define lockdep_print_held_locks(task)		do { (void)(task); } while (0)
+# define lockdep_init_map(lock, name, key)	do { } while (0)
+# define INIT_LOCKDEP
+# define lockdep_reset()		do { debug_locks = 1; } while (0)
+# define lockdep_free_key_range(start, size)	do { } while (0)
+# define early_boot_irqs_off()			do { } while (0)
+# define early_boot_irqs_on()			do { } while (0)
+/*
+ * The type key takes no space if lockdep is disabled:
+ */
+struct lockdep_type_key { };
+#endif /* !LOCKDEP */
+
+/*
+ * For trivial one-depth nesting of a lock-type, the following
+ * global define can be used. (Subsystems with multiple levels
+ * of nesting should define their own lock-nesting subtypes.)
+ */
+#define SINGLE_DEPTH_NESTING			1
+
+/*
+ * Map the dependency ops to NOP or to real lockdep ops, depending
+ * on the per lock-type debug mode:
+ */
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+# define spin_acquire(l, s, t, i)		lockdep_acquire(l, s, t, 0, i)
+# define spin_release(l, n, i)			lockdep_release(l, n, i)
+#else
+# define spin_acquire(l, s, t, i)		do { } while (0)
+# define spin_release(l, n, i)			do { } while (0)
+#endif
+
+#ifdef CONFIG_PROVE_RW_LOCKING
+# define rwlock_acquire(l, s, t, i)		lockdep_acquire(l, s, t, 0, i)
+# define rwlock_acquire_read(l, s, t, i)	lockdep_acquire(l, s, t, 1, i)
+# define rwlock_release(l, n, i)		lockdep_release(l, n, i)
+#else
+# define rwlock_acquire(l, s, t, i)		do { } while (0)
+# define rwlock_acquire_read(l, s, t, i)	do { } while (0)
+# define rwlock_release(l, n, i)		do { } while (0)
+#endif
+
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+# define mutex_acquire(l, s, t, i)		lockdep_acquire(l, s, t, 0, i)
+# define mutex_release(l, n, i)			lockdep_release(l, n, i)
+#else
+# define mutex_acquire(l, s, t, i)		do { } while (0)
+# define mutex_release(l, n, i)			do { } while (0)
+#endif
+
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+# define rwsem_acquire(l, s, t, i)		lockdep_acquire(l, s, t, 0, i)
+# define rwsem_acquire_read(l, s, t, i)		lockdep_acquire(l, s, t, -1, i)
+# define rwsem_release(l, n, i)			lockdep_release(l, n, i)
+#else
+# define rwsem_acquire(l, s, t, i)		do { } while (0)
+# define rwsem_acquire_read(l, s, t, i)		do { } while (0)
+# define rwsem_release(l, n, i)			do { } while (0)
+#endif
+
+#endif /* __LINUX_LOCKDEP_H */
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -931,6 +931,13 @@ struct task_struct {
 	int hardirq_context;
 	int softirq_context;
 #endif
+#ifdef CONFIG_LOCKDEP
+# define MAX_LOCK_DEPTH 30UL
+	u64 curr_chain_key;
+	int lockdep_depth;
+	struct held_lock held_locks[MAX_LOCK_DEPTH];
+#endif
+	unsigned int lockdep_recursion;
 
 /* journalling filesystem info */
 	void *journal_info;
@@ -1350,6 +1357,11 @@ static inline void task_lock(struct task
 	spin_lock(&p->alloc_lock);
 }
 
+static inline void task_lock_free(struct task_struct *p)
+{
+	spin_lock_nested(&p->alloc_lock, SINGLE_DEPTH_NESTING);
+}
+
 static inline void task_unlock(struct task_struct *p)
 {
 	spin_unlock(&p->alloc_lock);
Index: linux/include/linux/trace_irqflags.h
===================================================================
--- linux.orig/include/linux/trace_irqflags.h
+++ linux/include/linux/trace_irqflags.h
@@ -66,7 +66,18 @@
 		}						\
 	} while (0)
 
-#define local_irq_enable_in_hardirq()	local_irq_enable()
+/*
+ * On lockdep we dont want to enable hardirqs in hardirq
+ * context. NOTE: in theory this might break fragile code
+ * that relies on hardirq delivery - in practice we dont
+ * seem to have such places left. So the only effect should
+ * be slightly increased irqs-off latencies.
+ */
+#ifdef CONFIG_LOCKDEP
+# define local_irq_enable_in_hardirq()	do { } while (0)
+#else
+# define local_irq_enable_in_hardirq()	local_irq_enable()
+#endif
 
 #define safe_halt()						\
 	do {							\
Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -54,6 +54,7 @@
 #include <linux/root_dev.h>
 #include <linux/buffer_head.h>
 #include <linux/debug_locks.h>
+#include <linux/lockdep.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -80,6 +81,7 @@
 
 static int init(void *);
 
+extern void early_init_irq_lock_type(void);
 extern void init_IRQ(void);
 extern void fork_init(unsigned long);
 extern void mca_init(void);
@@ -461,6 +463,17 @@ asmlinkage void __init start_kernel(void
 {
 	char * command_line;
 	extern struct kernel_param __start___param[], __stop___param[];
+
+	/*
+	 * Need to run as early as possible, to initialize the
+	 * lockdep hash:
+	 */
+	lockdep_init();
+
+	local_irq_disable();
+	early_boot_irqs_off();
+	early_init_irq_lock_type();
+
 /*
  * Interrupts are still disabled. Do necessary setups, then
  * enable them
@@ -512,8 +525,11 @@ asmlinkage void __init start_kernel(void
 	if (panic_later)
 		panic(panic_later, panic_param);
 	profile_init();
+	early_boot_irqs_on();
 	local_irq_enable();
 
+	lockdep_info();
+
 	/*
 	 * Need to run this when irqs are enabled, because it wants
 	 * to self-test [hard/soft]-irqs on/off lock inversion bugs
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -12,6 +12,7 @@ obj-y     = sched.o fork.o exec_domain.o
 
 obj-y += time/
 obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
+obj-$(CONFIG_LOCKDEP) += lockdep.o
 obj-$(CONFIG_FUTEX) += futex.o
 ifeq ($(CONFIG_COMPAT),y)
 obj-$(CONFIG_FUTEX) += futex_compat.o
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -1049,6 +1049,11 @@ static task_t *copy_process(unsigned lon
  	}
 	mpol_fix_fork_child_flag(p);
 #endif
+#ifdef CONFIG_LOCKDEP
+	p->lockdep_depth = 0; /* no locks held yet */
+	p->curr_chain_key = 0;
+	p->lockdep_recursion = 0;
+#endif
 
 	rt_mutex_init_task(p);
 
Index: linux/kernel/irq/manage.c
===================================================================
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -406,6 +406,12 @@ int request_irq(unsigned int irq,
 		   immediately, so let's make sure....
 		   We do this before actually registering it, to make sure that a 'real'
 		   IRQ doesn't run in parallel with our fake. */
+#ifdef CONFIG_LOCKDEP
+		/*
+		 * Lockdep wants atomic interrupt handlers:
+		 */
+		irqflags |= SA_INTERRUPT;
+#endif
 		if (irqflags & SA_INTERRUPT) {
 			unsigned long flags;
 
Index: linux/kernel/lockdep.c
===================================================================
--- /dev/null
+++ linux/kernel/lockdep.c
@@ -0,0 +1,2633 @@
+/*
+ * kernel/lockdep.c
+ *
+ * Runtime locking correctness validator
+ *
+ * Started by Ingo Molnar:
+ *
+ *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ *
+ * this code maps all the lock dependencies as they occur in a live kernel
+ * and will warn about the following types of locking bugs:
+ *
+ * - lock inversion scenarios
+ * - circular lock dependencies
+ * - hardirq/softirq safe/unsafe locking bugs
+ *
+ * Bugs are reported even if the current locking scenario does not cause
+ * any deadlock at this point.
+ *
+ * I.e. if anytime in the past two locks were taken in a different order,
+ * even if it happened for another task, even if those were different
+ * locks (but of the same type as this lock), this code will detect it.
+ *
+ * Thanks to Arjan van de Ven for coming up with the initial idea of
+ * mapping lock dependencies runtime.
+ */
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/delay.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/kallsyms.h>
+#include <linux/interrupt.h>
+#include <linux/stacktrace.h>
+#include <linux/debug_locks.h>
+#include <linux/trace_irqflags.h>
+
+#include <asm/sections.h>
+
+#include "lockdep_internals.h"
+
+/*
+ * hash_lock: protects the lockdep hashes and type/list/hash allocators.
+ *
+ * This is one of the rare exceptions where it's justified
+ * to use a raw spinlock - we really dont want the spinlock
+ * code to recurse back into the lockdep code.
+ */
+static raw_spinlock_t hash_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+
+static int lockdep_initialized;
+
+unsigned long nr_list_entries;
+static struct lock_list list_entries[MAX_LOCKDEP_ENTRIES];
+
+/*
+ * Allocate a lockdep entry. (assumes hash_lock held, returns
+ * with NULL on failure)
+ */
+static struct lock_list *alloc_list_entry(void)
+{
+	if (nr_list_entries >= MAX_LOCKDEP_ENTRIES) {
+		__raw_spin_unlock(&hash_lock);
+		debug_locks_off();
+		printk("BUG: MAX_LOCKDEP_ENTRIES too low!\n");
+		printk("turning off the locking correctness validator.\n");
+		return NULL;
+	}
+	return list_entries + nr_list_entries++;
+}
+
+/*
+ * All data structures here are protected by the global debug_lock.
+ *
+ * Mutex key structs only get allocated, once during bootup, and never
+ * get freed - this significantly simplifies the debugging code.
+ */
+unsigned long nr_lock_types;
+static struct lock_type lock_types[MAX_LOCKDEP_KEYS];
+
+/*
+ * We keep a global list of all lock types. The list only grows,
+ * never shrinks. The list is only accessed with the lockdep
+ * spinlock lock held.
+ */
+LIST_HEAD(all_lock_types);
+
+/*
+ * The lockdep types are in a hash-table as well, for fast lookup:
+ */
+#define TYPEHASH_BITS		(MAX_LOCKDEP_KEYS_BITS - 1)
+#define TYPEHASH_SIZE		(1UL << TYPEHASH_BITS)
+#define TYPEHASH_MASK		(TYPEHASH_SIZE - 1)
+#define __typehashfn(key)	((((unsigned long)key >> TYPEHASH_BITS) + (unsigned long)key) & TYPEHASH_MASK)
+#define typehashentry(key)	(typehash_table + __typehashfn((key)))
+
+static struct list_head typehash_table[TYPEHASH_SIZE];
+
+unsigned long nr_lock_chains;
+static struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS];
+
+/*
+ * We put the lock dependency chains into a hash-table as well, to cache
+ * their existence:
+ */
+#define CHAINHASH_BITS		(MAX_LOCKDEP_CHAINS_BITS-1)
+#define CHAINHASH_SIZE		(1UL << CHAINHASH_BITS)
+#define CHAINHASH_MASK		(CHAINHASH_SIZE - 1)
+#define __chainhashfn(chain) \
+		(((chain >> CHAINHASH_BITS) + chain) & CHAINHASH_MASK)
+#define chainhashentry(chain)	(chainhash_table + __chainhashfn((chain)))
+
+static struct list_head chainhash_table[CHAINHASH_SIZE];
+
+/*
+ * The hash key of the lock dependency chains is a hash itself too:
+ * it's a hash of all locks taken up to that lock, including that lock.
+ * It's a 64-bit hash, because it's important for the keys to be
+ * unique.
+ */
+#define iterate_chain_key(key1, key2) \
+	(((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \
+	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \
+	(key2))
+
+/*
+ * Debugging switches:
+ */
+#define LOCKDEP_OFF		0
+
+#define VERBOSE			0
+
+#if VERBOSE
+# define HARDIRQ_VERBOSE	1
+# define SOFTIRQ_VERBOSE	1
+#else
+# define HARDIRQ_VERBOSE	0
+# define SOFTIRQ_VERBOSE	0
+#endif
+
+#if VERBOSE || HARDIRQ_VERBOSE || SOFTIRQ_VERBOSE
+/*
+ * Quick filtering for interesting events:
+ */
+static int type_filter(struct lock_type *type)
+{
+	if (type->name_version == 2 &&
+			!strcmp(type->name, "xfrm_state_afinfo_lock"))
+		return 1;
+	if ((type->name_version == 2 || type->name_version == 4) &&
+			!strcmp(type->name, "&mc->mca_lock"))
+		return 1;
+	return 0;
+}
+#endif
+
+static int verbose(struct lock_type *type)
+{
+#if VERBOSE
+	return type_filter(type);
+#endif
+	return 0;
+}
+
+static int hardirq_verbose(struct lock_type *type)
+{
+#if HARDIRQ_VERBOSE
+	return type_filter(type);
+#endif
+	return 0;
+}
+
+static int softirq_verbose(struct lock_type *type)
+{
+#if SOFTIRQ_VERBOSE
+	return type_filter(type);
+#endif
+	return 0;
+}
+
+/*
+ * Stack-trace: tightly packed array of stack backtrace
+ * addresses. Protected by the hash_lock.
+ */
+unsigned long nr_stack_trace_entries;
+static unsigned long stack_trace[MAX_STACK_TRACE_ENTRIES];
+
+static int save_trace(struct stack_trace *trace)
+{
+	trace->nr_entries = 0;
+	trace->max_entries = MAX_STACK_TRACE_ENTRIES - nr_stack_trace_entries;
+	trace->entries = stack_trace + nr_stack_trace_entries;
+
+	save_stack_trace(trace, NULL, 0, 3);
+
+	trace->max_entries = trace->nr_entries;
+
+	nr_stack_trace_entries += trace->nr_entries;
+	if (DEBUG_WARN_ON(nr_stack_trace_entries > MAX_STACK_TRACE_ENTRIES))
+		return 0;
+
+	if (nr_stack_trace_entries == MAX_STACK_TRACE_ENTRIES) {
+		__raw_spin_unlock(&hash_lock);
+		if (debug_locks_off()) {
+			printk("BUG: MAX_STACK_TRACE_ENTRIES too low!\n");
+			printk("turning off the locking correctness validator.\n");
+			dump_stack();
+		}
+		return 0;
+	}
+
+	return 1;
+}
+
+unsigned int nr_hardirq_chains;
+unsigned int nr_softirq_chains;
+unsigned int nr_process_chains;
+unsigned int max_lockdep_depth;
+unsigned int max_recursion_depth;
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+/*
+ * We cannot printk in early bootup code. Not even early_printk()
+ * might work. So we mark any initialization errors and printk
+ * about it later on, in lockdep_info().
+ */
+int lockdep_init_error;
+
+/*
+ * Various lockdep statistics:
+ */
+atomic_t chain_lookup_hits;
+atomic_t chain_lookup_misses;
+atomic_t hardirqs_on_events;
+atomic_t hardirqs_off_events;
+atomic_t redundant_hardirqs_on;
+atomic_t redundant_hardirqs_off;
+atomic_t softirqs_on_events;
+atomic_t softirqs_off_events;
+atomic_t redundant_softirqs_on;
+atomic_t redundant_softirqs_off;
+atomic_t nr_unused_locks;
+atomic_t nr_hardirq_safe_locks;
+atomic_t nr_softirq_safe_locks;
+atomic_t nr_hardirq_unsafe_locks;
+atomic_t nr_softirq_unsafe_locks;
+atomic_t nr_hardirq_read_safe_locks;
+atomic_t nr_softirq_read_safe_locks;
+atomic_t nr_hardirq_read_unsafe_locks;
+atomic_t nr_softirq_read_unsafe_locks;
+atomic_t nr_cyclic_checks;
+atomic_t nr_cyclic_check_recursions;
+atomic_t nr_find_usage_forwards_checks;
+atomic_t nr_find_usage_forwards_recursions;
+atomic_t nr_find_usage_backwards_checks;
+atomic_t nr_find_usage_backwards_recursions;
+# define debug_atomic_inc(ptr)		atomic_inc(ptr)
+# define debug_atomic_dec(ptr)		atomic_dec(ptr)
+# define debug_atomic_read(ptr)		atomic_read(ptr)
+#else
+# define debug_atomic_inc(ptr)		do { } while (0)
+# define debug_atomic_dec(ptr)		do { } while (0)
+# define debug_atomic_read(ptr)		0
+#endif
+
+/*
+ * Locking printouts:
+ */
+
+static const char *usage_str[] =
+{
+	[LOCK_USED] =			"initial-use ",
+	[LOCK_USED_IN_HARDIRQ] =	"in-hardirq-W",
+	[LOCK_USED_IN_SOFTIRQ] =	"in-softirq-W",
+	[LOCK_ENABLED_SOFTIRQS] =	"softirq-on-W",
+	[LOCK_ENABLED_HARDIRQS] =	"hardirq-on-W",
+	[LOCK_USED_IN_HARDIRQ_READ] =	"in-hardirq-R",
+	[LOCK_USED_IN_SOFTIRQ_READ] =	"in-softirq-R",
+	[LOCK_ENABLED_SOFTIRQS_READ] =	"softirq-on-R",
+	[LOCK_ENABLED_HARDIRQS_READ] =	"hardirq-on-R",
+};
+
+static void printk_sym(unsigned long ip)
+{
+	printk(" [<%08lx>]", ip);
+	print_symbol(" %s\n", ip);
+}
+
+const char * __get_key_name(struct lockdep_subtype_key *key, char *str)
+{
+	unsigned long offs, size;
+	char *modname;
+
+	return kallsyms_lookup((unsigned long)key, &size, &offs, &modname, str);
+}
+
+void
+get_usage_chars(struct lock_type *type, char *c1, char *c2, char *c3, char *c4)
+{
+	*c1 = '.', *c2 = '.', *c3 = '.', *c4 = '.';
+
+	if (type->usage_mask & LOCKF_USED_IN_HARDIRQ)
+		*c1 = '+';
+	else
+		if (type->usage_mask & LOCKF_ENABLED_HARDIRQS)
+			*c1 = '-';
+
+	if (type->usage_mask & LOCKF_USED_IN_SOFTIRQ)
+		*c2 = '+';
+	else
+		if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS)
+			*c2 = '-';
+
+	if (type->usage_mask & LOCKF_ENABLED_HARDIRQS_READ)
+		*c3 = '-';
+	if (type->usage_mask & LOCKF_USED_IN_HARDIRQ_READ) {
+		*c3 = '+';
+		if (type->usage_mask & LOCKF_ENABLED_HARDIRQS_READ)
+			*c3 = (char)'??';
+	}
+
+	if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ)
+		*c4 = '-';
+	if (type->usage_mask & LOCKF_USED_IN_SOFTIRQ_READ) {
+		*c4 = '+';
+		if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ)
+			*c4 = (char)'??';
+	}
+}
+
+static void print_lock_name(struct lock_type *type)
+{
+	char str[128], c1, c2, c3, c4;
+	const char *name;
+
+	get_usage_chars(type, &c1, &c2, &c3, &c4);
+
+	name = type->name;
+	if (!name) {
+		name = __get_key_name(type->key, str);
+		printk(" (%s", name);
+	} else {
+		printk(" (%s", name);
+		if (type->name_version > 1)
+			printk("#%d", type->name_version);
+		if (type->subtype)
+			printk("/%d", type->subtype);
+	}
+	printk("){%c%c%c%c}", c1, c2, c3, c4);
+}
+
+static void print_lock_name_field(struct lock_type *type)
+{
+	const char *name;
+	char str[128];
+
+	name = type->name;
+	if (!name) {
+		name = __get_key_name(type->key, str);
+		printk("%30s", name);
+	} else {
+		printk("%30s", name);
+		if (type->name_version > 1)
+			printk("#%d", type->name_version);
+		if (type->subtype)
+			printk("/%d", type->subtype);
+	}
+}
+
+static void print_lockdep_cache(struct lockdep_map *lock)
+{
+	const char *name;
+	char str[128];
+
+	name = lock->name;
+	if (!name)
+		name = __get_key_name(lock->key->subkeys, str);
+
+	printk("%s", name);
+}
+
+static void print_lock(struct held_lock *hlock)
+{
+	print_lock_name(hlock->type);
+	printk(", at:");
+	printk_sym(hlock->acquire_ip);
+}
+
+void lockdep_print_held_locks(struct task_struct *curr)
+{
+	int i;
+
+	if (!curr->lockdep_depth) {
+		printk("no locks held by %s/%d.\n", curr->comm, curr->pid);
+		return;
+	}
+	printk("%d locks held by %s/%d:\n",
+		curr->lockdep_depth, curr->comm, curr->pid);
+
+	for (i = 0; i < curr->lockdep_depth; i++) {
+		printk(" #%d: ", i);
+		print_lock(curr->held_locks + i);
+	}
+}
+/*
+ * Helper to print a nice hierarchy of lock dependencies:
+ */
+static void print_spaces(int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		printk("  ");
+}
+
+void print_lock_type_header(struct lock_type *type, int depth)
+{
+	int bit;
+
+	print_spaces(depth);
+	printk("->");
+	print_lock_name(type);
+	printk(" ops: %lu", type->ops);
+	printk(" {\n");
+
+	for (bit = 0; bit < LOCK_USAGE_STATES; bit++) {
+		if (type->usage_mask & (1 << bit)) {
+			int len = depth;
+
+			print_spaces(depth);
+			len += printk("   %s", usage_str[bit]);
+			len += printk(" at:\n");
+			print_stack_trace(type->usage_traces + bit, len);
+		}
+	}
+	print_spaces(depth);
+	printk(" }\n");
+
+	print_spaces(depth);
+	printk(" ... key      at:");
+	printk_sym((unsigned long)type->key);
+}
+
+/*
+ * printk all lock dependencies starting at <entry>:
+ */
+static void print_lock_dependencies(struct lock_type *type, int depth)
+{
+	struct lock_list *entry;
+
+	if (DEBUG_WARN_ON(depth >= 20))
+		return;
+
+	print_lock_type_header(type, depth);
+
+	list_for_each_entry(entry, &type->locks_after, entry) {
+		DEBUG_WARN_ON(!entry->type);
+		print_lock_dependencies(entry->type, depth + 1);
+
+		print_spaces(depth);
+		printk(" ... acquired at:\n");
+		print_stack_trace(&entry->trace, 2);
+		printk("\n");
+	}
+}
+
+/*
+ * printk all locks that are taken after this lock:
+ */
+static void print_flat_dependencies(struct lock_type *type)
+{
+	struct lock_list *entry;
+	int nr = 0;
+
+	printk(" {\n");
+	list_for_each_entry(entry, &type->locks_after, entry) {
+		nr++;
+		DEBUG_WARN_ON(!entry->type);
+		printk("    -> ");
+		print_lock_name_field(entry->type);
+		if (entry->type->subtype)
+			printk("/%d", entry->type->subtype);
+		print_stack_trace(&entry->trace, 2);
+	}
+	printk(" } [%d]", nr);
+}
+
+void print_lock_type(struct lock_type *type)
+{
+	print_lock_type_header(type, 0);
+	if (!list_empty(&type->locks_after))
+		print_flat_dependencies(type);
+	printk("\n");
+}
+
+void print_lock_types(void)
+{
+	struct list_head *head;
+	struct lock_type *type;
+	int i, nr;
+
+	printk("lock types:\n");
+
+	for (i = 0; i < TYPEHASH_SIZE; i++) {
+		head = typehash_table + i;
+		if (list_empty(head))
+			continue;
+		printk("\nhash-list at %d:\n", i);
+		nr = 0;
+		list_for_each_entry(type, head, hash_entry) {
+			printk("\n");
+			print_lock_type(type);
+			nr++;
+		}
+	}
+}
+
+/*
+ * Add a new dependency to the head of the list:
+ */
+static int add_lock_to_list(struct lock_type *type, struct lock_type *this,
+			    struct list_head *head, unsigned long ip)
+{
+	struct lock_list *entry;
+	/*
+	 * Lock not present yet - get a new dependency struct and
+	 * add it to the list:
+	 */
+	entry = alloc_list_entry();
+	if (!entry)
+		return 0;
+
+	entry->type = this;
+	save_trace(&entry->trace);
+
+	/*
+	 * Since we never remove from the dependency list, the list can
+	 * be walked lockless by other CPUs, it's only allocation
+	 * that must be protected by the spinlock. But this also means
+	 * we must make new entries visible only once writes to the
+	 * entry become visible - hence the RCU op:
+	 */
+	list_add_tail_rcu(&entry->entry, head);
+
+	return 1;
+}
+
+/*
+ * Recursive, forwards-direction lock-dependency checking, used for
+ * both noncyclic checking and for hardirq-unsafe/softirq-unsafe
+ * checking.
+ *
+ * (to keep the stackframe of the recursive functions small we
+ *  use these global variables, and we also mark various helper
+ *  functions as noinline.)
+ */
+static struct held_lock *check_source, *check_target;
+
+/*
+ * Print a dependency chain entry (this is only done when a deadlock
+ * has been detected):
+ */
+static noinline int
+print_circular_bug_entry(struct lock_list *target, unsigned int depth)
+{
+	if (debug_locks_silent)
+		return 0;
+	printk("\n-> #%u", depth);
+	print_lock_name(target->type);
+	printk(":\n");
+	print_stack_trace(&target->trace, 6);
+
+	return 0;
+}
+
+/*
+ * When a circular dependency is detected, print the
+ * header first:
+ */
+static noinline int
+print_circular_bug_header(struct lock_list *entry, unsigned int depth)
+{
+	struct task_struct *curr = current;
+
+	__raw_spin_unlock(&hash_lock);
+	debug_locks_off();
+	if (debug_locks_silent)
+		return 0;
+
+	printk("\n=====================================================\n");
+	printk(  "[ BUG: possible circular locking deadlock detected! ]\n");
+	printk(  "-----------------------------------------------------\n");
+	printk("%s/%d is trying to acquire lock:\n",
+		curr->comm, curr->pid);
+	print_lock(check_source);
+	printk("\nbut task is already holding lock:\n");
+	print_lock(check_target);
+	printk("\nwhich lock already depends on the new lock,\n");
+	printk("which could lead to circular deadlocks!\n");
+	printk("\nthe existing dependency chain (in reverse order) is:\n");
+
+	print_circular_bug_entry(entry, depth);
+
+	return 0;
+}
+
+static noinline int print_circular_bug_tail(void)
+{
+	struct task_struct *curr = current;
+	struct lock_list this;
+
+	if (debug_locks_silent)
+		return 0;
+
+	this.type = check_source->type;
+	save_trace(&this.trace);
+	print_circular_bug_entry(&this, 0);
+
+	printk("\nother info that might help us debug this:\n\n");
+	lockdep_print_held_locks(curr);
+
+	printk("\nstack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+
+static int noinline print_infinite_recursion_bug(void)
+{
+	__raw_spin_unlock(&hash_lock);
+	DEBUG_WARN_ON(1);
+
+	return 0;
+}
+
+/*
+ * Prove that the dependency graph starting at <entry> can not
+ * lead to <target>. Print an error and return 0 if it does.
+ */
+static noinline int
+check_noncircular(struct lock_type *source, unsigned int depth)
+{
+	struct lock_list *entry;
+
+	debug_atomic_inc(&nr_cyclic_check_recursions);
+	if (depth > max_recursion_depth)
+		max_recursion_depth = depth;
+	if (depth >= 20)
+		return print_infinite_recursion_bug();
+	/*
+	 * Check this lock's dependency list:
+	 */
+	list_for_each_entry(entry, &source->locks_after, entry) {
+		if (entry->type == check_target->type)
+			return print_circular_bug_header(entry, depth+1);
+		debug_atomic_inc(&nr_cyclic_checks);
+		if (!check_noncircular(entry->type, depth+1))
+			return print_circular_bug_entry(entry, depth+1);
+	}
+	return 1;
+}
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+
+/*
+ * Forwards and backwards subgraph searching, for the purposes of
+ * proving that two subgraphs can be connected by a new dependency
+ * without creating any illegal irq-safe -> irq-unsafe lock dependency.
+ */
+static enum lock_usage_bit find_usage_bit;
+static struct lock_type *forwards_match, *backwards_match;
+
+/*
+ * Find a node in the forwards-direction dependency sub-graph starting
+ * at <source> that matches <find_usage_bit>.
+ *
+ * Return 2 if such a node exists in the subgraph, and put that node
+ * into <forwards_match>.
+ *
+ * Return 1 otherwise and keep <forwards_match> unchanged.
+ * Return 0 on error.
+ */
+static noinline int
+find_usage_forwards(struct lock_type *source, unsigned int depth)
+{
+	struct lock_list *entry;
+	int ret;
+
+	if (depth > max_recursion_depth)
+		max_recursion_depth = depth;
+	if (depth >= 20)
+		return print_infinite_recursion_bug();
+
+	debug_atomic_inc(&nr_find_usage_forwards_checks);
+	if (source->usage_mask & (1 << find_usage_bit)) {
+		forwards_match = source;
+		return 2;
+	}
+
+	/*
+	 * Check this lock's dependency list:
+	 */
+	list_for_each_entry(entry, &source->locks_after, entry) {
+		debug_atomic_inc(&nr_find_usage_forwards_recursions);
+		ret = find_usage_forwards(entry->type, depth+1);
+		if (ret == 2 || ret == 0)
+			return ret;
+	}
+	return 1;
+}
+
+/*
+ * Find a node in the backwards-direction dependency sub-graph starting
+ * at <source> that matches <find_usage_bit>.
+ *
+ * Return 2 if such a node exists in the subgraph, and put that node
+ * into <backwards_match>.
+ *
+ * Return 1 otherwise and keep <backwards_match> unchanged.
+ * Return 0 on error.
+ */
+static noinline int
+find_usage_backwards(struct lock_type *source, unsigned int depth)
+{
+	struct lock_list *entry;
+	int ret;
+
+	if (depth > max_recursion_depth)
+		max_recursion_depth = depth;
+	if (depth >= 20)
+		return print_infinite_recursion_bug();
+
+	debug_atomic_inc(&nr_find_usage_backwards_checks);
+	if (source->usage_mask & (1 << find_usage_bit)) {
+		backwards_match = source;
+		return 2;
+	}
+
+	/*
+	 * Check this lock's dependency list:
+	 */
+	list_for_each_entry(entry, &source->locks_before, entry) {
+		debug_atomic_inc(&nr_find_usage_backwards_recursions);
+		ret = find_usage_backwards(entry->type, depth+1);
+		if (ret == 2 || ret == 0)
+			return ret;
+	}
+	return 1;
+}
+
+static int
+print_bad_irq_dependency(struct task_struct *curr,
+			 struct held_lock *prev,
+			 struct held_lock *next,
+			 enum lock_usage_bit bit1,
+			 enum lock_usage_bit bit2,
+			 const char *irqtype)
+{
+	__raw_spin_unlock(&hash_lock);
+	debug_locks_off();
+	if (debug_locks_silent)
+		return 0;
+
+	printk("\n======================================================\n");
+	printk(  "[ BUG: %s-safe -> %s-unsafe lock order detected! ]\n",
+		irqtype, irqtype);
+	printk(  "------------------------------------------------------\n");
+	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
+		curr->comm, curr->pid,
+		curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
+		curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT,
+		curr->hardirqs_enabled,
+		curr->softirqs_enabled);
+	print_lock(next);
+
+	printk("\nand this task is already holding:\n");
+	print_lock(prev);
+	printk("which would create a new lock dependency:\n");
+	print_lock_name(prev->type);
+	printk(" ->");
+	print_lock_name(next->type);
+	printk("\n");
+
+	printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
+		irqtype);
+	print_lock_name(backwards_match);
+	printk("\n... which became %s-irq-safe at:\n", irqtype);
+
+	print_stack_trace(backwards_match->usage_traces + bit1, 1);
+
+	printk("\nto a %s-irq-unsafe lock:\n", irqtype);
+	print_lock_name(forwards_match);
+	printk("\n... which became %s-irq-unsafe at:\n", irqtype);
+	printk("...");
+
+	print_stack_trace(forwards_match->usage_traces + bit2, 1);
+
+	printk("\nwhich could potentially lead to deadlocks!\n");
+
+	printk("\nother info that might help us debug this:\n\n");
+	lockdep_print_held_locks(curr);
+
+	printk("\nthe %s-irq-safe lock's dependencies:\n", irqtype);
+	print_lock_dependencies(backwards_match, 0);
+
+	printk("\nthe %s-irq-unsafe lock's dependencies:\n", irqtype);
+	print_lock_dependencies(forwards_match, 0);
+
+	printk("\nstack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+
+static int
+check_usage(struct task_struct *curr, struct held_lock *prev,
+	    struct held_lock *next, enum lock_usage_bit bit_backwards,
+	    enum lock_usage_bit bit_forwards, const char *irqtype)
+{
+	int ret;
+
+	find_usage_bit = bit_backwards;
+	/* fills in <backwards_match> */
+	ret = find_usage_backwards(prev->type, 0);
+	if (!ret || ret == 1)
+		return ret;
+
+	find_usage_bit = bit_forwards;
+	ret = find_usage_forwards(next->type, 0);
+	if (!ret || ret == 1)
+		return ret;
+	/* ret == 2 */
+	return print_bad_irq_dependency(curr, prev, next,
+			bit_backwards, bit_forwards, irqtype);
+}
+
+#endif
+
+static int
+print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
+		   struct held_lock *next)
+{
+	debug_locks_off();
+	__raw_spin_unlock(&hash_lock);
+	if (debug_locks_silent)
+		return 0;
+
+	printk("\n====================================\n");
+	printk(  "[ BUG: possible deadlock detected! ]\n");
+	printk(  "------------------------------------\n");
+	printk("%s/%d is trying to acquire lock:\n",
+		curr->comm, curr->pid);
+	print_lock(next);
+	printk("\nbut task is already holding lock:\n");
+	print_lock(prev);
+	printk("\nwhich could potentially lead to deadlocks!\n");
+
+	printk("\nother info that might help us debug this:\n");
+	lockdep_print_held_locks(curr);
+
+	printk("\nstack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+
+/*
+ * Check whether we are holding such a type already.
+ *
+ * (Note that this has to be done separately, because the graph cannot
+ * detect such types of deadlocks.)
+ *
+ * Returns: 0 on deadlock detected, 1 on OK, 2 on recursive read
+ */
+static int
+check_deadlock(struct task_struct *curr, struct held_lock *next,
+	       struct lockdep_map *next_instance, int read)
+{
+	struct held_lock *prev;
+	int i;
+
+	for (i = 0; i < curr->lockdep_depth; i++) {
+		prev = curr->held_locks + i;
+		if (prev->type != next->type)
+			continue;
+		/*
+		 * Allow read-after-read recursion of the same
+		 * lock instance (i.e. read_lock(lock)+read_lock(lock)):
+		 */
+		if ((read > 0) && prev->read &&
+				(prev->instance == next_instance))
+			return 2;
+		return print_deadlock_bug(curr, prev, next);
+	}
+	return 1;
+}
+
+/*
+ * There was a chain-cache miss, and we are about to add a new dependency
+ * to a previous lock. We recursively validate the following rules:
+ *
+ *  - would the adding of the <prev> -> <next> dependency create a
+ *    circular dependency in the graph? [== circular deadlock]
+ *
+ *  - does the new prev->next dependency connect any hardirq-safe lock
+ *    (in the full backwards-subgraph starting at <prev>) with any
+ *    hardirq-unsafe lock (in the full forwards-subgraph starting at
+ *    <next>)? [== illegal lock inversion with hardirq contexts]
+ *
+ *  - does the new prev->next dependency connect any softirq-safe lock
+ *    (in the full backwards-subgraph starting at <prev>) with any
+ *    softirq-unsafe lock (in the full forwards-subgraph starting at
+ *    <next>)? [== illegal lock inversion with softirq contexts]
+ *
+ * any of these scenarios could lead to a deadlock.
+ *
+ * Then if all the validations pass, we add the forwards and backwards
+ * dependency.
+ */
+static int
+check_prev_add(struct task_struct *curr, struct held_lock *prev,
+	       struct held_lock *next)
+{
+	struct lock_list *entry;
+	int ret;
+
+	/*
+	 * Prove that the new <prev> -> <next> dependency would not
+	 * create a circular dependency in the graph. (We do this by
+	 * forward-recursing into the graph starting at <next>, and
+	 * checking whether we can reach <prev>.)
+	 *
+	 * We are using global variables to control the recursion, to
+	 * keep the stackframe size of the recursive functions low:
+	 */
+	check_source = next;
+	check_target = prev;
+	if (!(check_noncircular(next->type, 0)))
+		return print_circular_bug_tail();
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+	/*
+	 * Prove that the new dependency does not connect a hardirq-safe
+	 * lock with a hardirq-unsafe lock - to achieve this we search
+	 * the backwards-subgraph starting at <prev>, and the
+	 * forwards-subgraph starting at <next>:
+	 */
+	if (!check_usage(curr, prev, next, LOCK_USED_IN_HARDIRQ,
+					LOCK_ENABLED_HARDIRQS, "hard"))
+		return 0;
+
+	/*
+	 * Prove that the new dependency does not connect a hardirq-safe-read
+	 * lock with a hardirq-unsafe lock - to achieve this we search
+	 * the backwards-subgraph starting at <prev>, and the
+	 * forwards-subgraph starting at <next>:
+	 */
+	if (!check_usage(curr, prev, next, LOCK_USED_IN_HARDIRQ_READ,
+					LOCK_ENABLED_HARDIRQS, "hard-read"))
+		return 0;
+
+	/*
+	 * Prove that the new dependency does not connect a softirq-safe
+	 * lock with a softirq-unsafe lock - to achieve this we search
+	 * the backwards-subgraph starting at <prev>, and the
+	 * forwards-subgraph starting at <next>:
+	 */
+	if (!check_usage(curr, prev, next, LOCK_USED_IN_SOFTIRQ,
+					LOCK_ENABLED_SOFTIRQS, "soft"))
+		return 0;
+	/*
+	 * Prove that the new dependency does not connect a softirq-safe-read
+	 * lock with a softirq-unsafe lock - to achieve this we search
+	 * the backwards-subgraph starting at <prev>, and the
+	 * forwards-subgraph starting at <next>:
+	 */
+	if (!check_usage(curr, prev, next, LOCK_USED_IN_SOFTIRQ_READ,
+					LOCK_ENABLED_SOFTIRQS, "soft"))
+		return 0;
+#endif
+	/*
+	 * For recursive read-locks we do all the dependency checks,
+	 * but we dont store read-triggered dependencies (only
+	 * write-triggered dependencies). This ensures that only the
+	 * write-side dependencies matter, and that if for example a
+	 * write-lock never takes any other locks, then the reads are
+	 * equivalent to a NOP.
+	 */
+	if (next->read == 1 || prev->read == 1)
+		return 1;
+	/*
+	 * Is the <prev> -> <next> dependency already present?
+	 *
+	 * (this may occur even though this is a new chain: consider
+	 *  e.g. the L1 -> L2 -> L3 -> L4 and the L5 -> L1 -> L2 -> L3
+	 *  chains - the second one will be new, but L1 already has
+	 *  L2 added to its dependency list, due to the first chain.)
+	 */
+	list_for_each_entry(entry, &prev->type->locks_after, entry) {
+		if (entry->type == next->type)
+			return 2;
+	}
+
+	/*
+	 * Ok, all validations passed, add the new lock
+	 * to the previous lock's dependency list:
+	 */
+	ret = add_lock_to_list(prev->type, next->type,
+			       &prev->type->locks_after, next->acquire_ip);
+	if (!ret)
+		return 0;
+	/*
+	 * Return value of 2 signals 'dependency already added',
+	 * in that case we dont have to add the backlink either.
+	 */
+	if (ret == 2)
+		return 2;
+	ret = add_lock_to_list(next->type, prev->type,
+			       &next->type->locks_before, next->acquire_ip);
+
+	/*
+	 * Debugging printouts:
+	 */
+	if (verbose(prev->type) || verbose(next->type)) {
+		__raw_spin_unlock(&hash_lock);
+		print_lock_name_field(prev->type);
+		printk(" => ");
+		print_lock_name_field(next->type);
+		printk("\n");
+		dump_stack();
+		__raw_spin_lock(&hash_lock);
+	}
+	return 1;
+}
+
+/*
+ * Add the dependency to all directly-previous locks that are 'relevant'.
+ * The ones that are relevant are (in increasing distance from curr):
+ * all consecutive trylock entries and the final non-trylock entry - or
+ * the end of this context's lock-chain - whichever comes first.
+ */
+static int
+check_prevs_add(struct task_struct *curr, struct held_lock *next)
+{
+	int depth = curr->lockdep_depth;
+	struct held_lock *hlock;
+
+	/*
+	 * Debugging checks.
+	 *
+	 * Depth must not be zero for a non-head lock:
+	 */
+	if (!depth)
+		goto out_bug;
+	/*
+	 * At least two relevant locks must exist for this
+	 * to be a head:
+	 */
+	if (curr->held_locks[depth].irq_context !=
+			curr->held_locks[depth-1].irq_context)
+		goto out_bug;
+
+	for (;;) {
+		hlock = curr->held_locks + depth-1;
+		/*
+		 * Only non-recursive-read entries get new dependencies
+		 * added:
+		 */
+		if (hlock->read != 2) {
+			check_prev_add(curr, hlock, next);
+			/*
+			 * Stop after the first non-trylock entry,
+			 * as non-trylock entries have added their
+			 * own direct dependencies already, so this
+			 * lock is connected to them indirectly:
+			 */
+			if (!hlock->trylock)
+				break;
+		}
+		depth--;
+		/*
+		 * End of lock-stack?
+		 */
+		if (!depth)
+			break;
+		/*
+		 * Stop the search if we cross into another context:
+		 */
+		if (curr->held_locks[depth].irq_context !=
+				curr->held_locks[depth-1].irq_context)
+			break;
+	}
+	return 1;
+out_bug:
+	__raw_spin_unlock(&hash_lock);
+	DEBUG_WARN_ON(1);
+
+	return 0;
+}
+
+
+/*
+ * Is this the address of a static object:
+ */
+static int static_obj(void *obj)
+{
+	unsigned long start = (unsigned long) &_stext,
+		      end   = (unsigned long) &_end,
+		      addr  = (unsigned long) obj;
+	int i;
+
+	/*
+	 * static variable?
+	 */
+	if ((addr >= start) && (addr < end))
+		return 1;
+
+#ifdef CONFIG_SMP
+	/*
+	 * percpu var?
+	 */
+	for_each_possible_cpu(i) {
+		start = (unsigned long) &__per_cpu_start + per_cpu_offset(i);
+		end   = (unsigned long) &__per_cpu_end   + per_cpu_offset(i);
+
+		if ((addr >= start) && (addr < end))
+			return 1;
+	}
+#endif
+
+	/*
+	 * module var?
+	 */
+	return __module_address(addr);
+}
+
+/*
+ * To make lock name printouts unique, we calculate a unique
+ * type->name_version generation counter:
+ */
+int count_matching_names(struct lock_type *new_type)
+{
+	struct lock_type *type;
+	int count = 0;
+
+	if (!new_type->name)
+		return 0;
+
+	list_for_each_entry(type, &all_lock_types, lock_entry) {
+		if (new_type->key - new_type->subtype == type->key)
+			return type->name_version;
+		if (!strcmp(type->name, new_type->name))
+			count = max(count, type->name_version);
+	}
+
+	return count + 1;
+}
+
+extern void __error_too_big_MAX_LOCKDEP_SUBTYPES(void);
+
+/*
+ * Register a lock's type in the hash-table, if the type is not present
+ * yet. Otherwise we look it up. We cache the result in the lock object
+ * itself, so actual lookup of the hash should be once per lock object.
+ */
+static inline struct lock_type *
+register_lock_type(struct lockdep_map *lock, unsigned int subtype)
+{
+	struct lockdep_subtype_key *key;
+	struct list_head *hash_head;
+	struct lock_type *type;
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+	/*
+	 * If the architecture calls into lockdep before initializing
+	 * the hashes then we'll warn about it later. (we cannot printk
+	 * right now)
+	 */
+	if (unlikely(!lockdep_initialized)) {
+		lockdep_init();
+		lockdep_init_error = 1;
+	}
+#endif
+
+	/*
+	 * Static locks do not have their type-keys yet - for them the key
+	 * is the lock object itself:
+	 */
+	if (unlikely(!lock->key))
+		lock->key = (void *)lock;
+
+	/*
+	 * Debug-check: all keys must be persistent!
+ 	 */
+	if (DEBUG_WARN_ON(!static_obj(lock->key))) {
+		debug_locks_off();
+		printk("BUG: trying to register non-static key!\n");
+		printk("turning off the locking correctness validator.\n");
+		dump_stack();
+		return NULL;
+	}
+
+	/*
+	 * NOTE: the type-key must be unique. For dynamic locks, a static
+	 * lockdep_type_key variable is passed in through the mutex_init()
+	 * (or spin_lock_init()) call - which acts as the key. For static
+	 * locks we use the lock object itself as the key.
+	 */
+	if (sizeof(struct lockdep_type_key) > sizeof(struct lock_type))
+		__error_too_big_MAX_LOCKDEP_SUBTYPES();
+
+	key = lock->key->subkeys + subtype;
+
+	hash_head = typehashentry(key);
+
+	/*
+	 * We can walk the hash lockfree, because the hash only
+	 * grows, and we are careful when adding entries to the end:
+	 */
+	list_for_each_entry(type, hash_head, hash_entry)
+		if (type->key == key)
+			goto out_set;
+
+	__raw_spin_lock(&hash_lock);
+	/*
+	 * We have to do the hash-walk again, to avoid races
+	 * with another CPU:
+	 */
+	list_for_each_entry(type, hash_head, hash_entry)
+		if (type->key == key)
+			goto out_unlock_set;
+	/*
+	 * Allocate a new key from the static array, and add it to
+	 * the hash:
+	 */
+	if (nr_lock_types >= MAX_LOCKDEP_KEYS) {
+		__raw_spin_unlock(&hash_lock);
+		debug_locks_off();
+		printk("BUG: MAX_LOCKDEP_KEYS too low!\n");
+		printk("turning off the locking correctness validator.\n");
+		return NULL;
+	}
+	type = lock_types + nr_lock_types++;
+	debug_atomic_inc(&nr_unused_locks);
+	type->key = key;
+	type->name = lock->name;
+	type->subtype = subtype;
+	INIT_LIST_HEAD(&type->lock_entry);
+	INIT_LIST_HEAD(&type->locks_before);
+	INIT_LIST_HEAD(&type->locks_after);
+	type->name_version = count_matching_names(type);
+	/*
+	 * We use RCU's safe list-add method to make
+	 * parallel walking of the hash-list safe:
+	 */
+	list_add_tail_rcu(&type->hash_entry, hash_head);
+
+	if (verbose(type)) {
+		__raw_spin_unlock(&hash_lock);
+		printk("new type %p: %s", type->key, type->name);
+		if (type->name_version > 1)
+			printk("#%d", type->name_version);
+		printk("\n");
+		dump_stack();
+		__raw_spin_lock(&hash_lock);
+	}
+out_unlock_set:
+	__raw_spin_unlock(&hash_lock);
+
+out_set:
+	lock->type[subtype] = type;
+
+	DEBUG_WARN_ON(type->subtype != subtype);
+
+	return type;
+}
+
+/*
+ * Look up a dependency chain. If the key is not present yet then
+ * add it and return 0 - in this case the new dependency chain is
+ * validated. If the key is already hashed, return 1.
+ */
+static inline int lookup_chain_cache(u64 chain_key)
+{
+	struct list_head *hash_head = chainhashentry(chain_key);
+	struct lock_chain *chain;
+
+	DEBUG_WARN_ON(!irqs_disabled());
+	/*
+	 * We can walk it lock-free, because entries only get added
+	 * to the hash:
+	 */
+	list_for_each_entry(chain, hash_head, entry) {
+		if (chain->chain_key == chain_key) {
+cache_hit:
+			debug_atomic_inc(&chain_lookup_hits);
+			/*
+			 * In the debugging case, force redundant checking
+			 * by returning 1:
+			 */
+#ifdef CONFIG_DEBUG_LOCKDEP
+			__raw_spin_lock(&hash_lock);
+			return 1;
+#endif
+			return 0;
+		}
+	}
+	/*
+	 * Allocate a new chain entry from the static array, and add
+	 * it to the hash:
+	 */
+	__raw_spin_lock(&hash_lock);
+	/*
+	 * We have to walk the chain again locked - to avoid duplicates:
+	 */
+	list_for_each_entry(chain, hash_head, entry) {
+		if (chain->chain_key == chain_key) {
+			__raw_spin_unlock(&hash_lock);
+			goto cache_hit;
+		}
+	}
+	if (unlikely(nr_lock_chains >= MAX_LOCKDEP_CHAINS)) {
+		__raw_spin_unlock(&hash_lock);
+		debug_locks_off();
+		printk("BUG: MAX_LOCKDEP_CHAINS too low!\n");
+		printk("turning off the locking correctness validator.\n");
+		return 0;
+	}
+	chain = lock_chains + nr_lock_chains++;
+	chain->chain_key = chain_key;
+	list_add_tail_rcu(&chain->entry, hash_head);
+	debug_atomic_inc(&chain_lookup_misses);
+#ifdef CONFIG_TRACE_IRQFLAGS
+	if (current->hardirq_context)
+		nr_hardirq_chains++;
+	else {
+		if (current->softirq_context)
+			nr_softirq_chains++;
+		else
+			nr_process_chains++;
+	}
+#else
+	nr_process_chains++;
+#endif
+
+	return 1;
+}
+
+/*
+ * We are building curr_chain_key incrementally, so double-check
+ * it from scratch, to make sure that it's done correctly:
+ */
+static void check_chain_key(struct task_struct *curr)
+{
+#ifdef CONFIG_DEBUG_LOCKDEP
+	struct held_lock *hlock, *prev_hlock = NULL;
+	unsigned int i, id;
+	u64 chain_key = 0;
+
+	for (i = 0; i < curr->lockdep_depth; i++) {
+		hlock = curr->held_locks + i;
+		if (chain_key != hlock->prev_chain_key) {
+			debug_locks_off();
+			printk("hm#1, depth: %u [%u], %016Lx != %016Lx\n",
+				curr->lockdep_depth, i, chain_key,
+				hlock->prev_chain_key);
+			WARN_ON(1);
+			return;
+		}
+		id = hlock->type - lock_types;
+		DEBUG_WARN_ON(id >= MAX_LOCKDEP_KEYS);
+		if (prev_hlock && (prev_hlock->irq_context !=
+							hlock->irq_context))
+			chain_key = 0;
+		chain_key = iterate_chain_key(chain_key, id);
+		prev_hlock = hlock;
+	}
+	if (chain_key != curr->curr_chain_key) {
+		debug_locks_off();
+		printk("hm#2, depth: %u [%u], %016Lx != %016Lx\n",
+			curr->lockdep_depth, i, chain_key,
+			curr->curr_chain_key);
+		WARN_ON(1);
+	}
+#endif
+}
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+
+/*
+ * print irq inversion bug:
+ */
+static int
+print_irq_inversion_bug(struct task_struct *curr, struct lock_type *other,
+			struct held_lock *this, int forwards,
+			const char *irqtype)
+{
+	__raw_spin_unlock(&hash_lock);
+	debug_locks_off();
+	if (debug_locks_silent)
+		return 0;
+
+	printk("\n==================================================\n");
+	printk(  "[ BUG: possible irq lock inversion bug detected! ]\n");
+	printk(  "--------------------------------------------------\n");
+	printk("%s/%d just changed the state of lock:\n",
+		curr->comm, curr->pid);
+	print_lock(this);
+	if (forwards)
+		printk("but this lock took another, %s-irq-unsafe lock in the past:\n", irqtype);
+	else
+		printk("but this lock was taken by another, %s-irq-safe lock in the past:\n", irqtype);
+	print_lock_name(other);
+	printk("\n\nand interrupts could create inverse lock ordering between them,\n");
+
+	printk("which could potentially lead to deadlocks!\n");
+
+	printk("\nother info that might help us debug this:\n");
+	lockdep_print_held_locks(curr);
+
+	printk("\nthe first lock's dependencies:\n");
+	print_lock_dependencies(this->type, 0);
+
+	printk("\nthe second lock's dependencies:\n");
+	print_lock_dependencies(other, 0);
+
+	printk("\nstack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+
+/*
+ * Prove that in the forwards-direction subgraph starting at <this>
+ * there is no lock matching <mask>:
+ */
+static int
+check_usage_forwards(struct task_struct *curr, struct held_lock *this,
+		     enum lock_usage_bit bit, const char *irqtype)
+{
+	int ret;
+
+	find_usage_bit = bit;
+	/* fills in <forwards_match> */
+	ret = find_usage_forwards(this->type, 0);
+	if (!ret || ret == 1)
+		return ret;
+
+	return print_irq_inversion_bug(curr, forwards_match, this, 1, irqtype);
+}
+
+/*
+ * Prove that in the backwards-direction subgraph starting at <this>
+ * there is no lock matching <mask>:
+ */
+static int
+check_usage_backwards(struct task_struct *curr, struct held_lock *this,
+		      enum lock_usage_bit bit, const char *irqtype)
+{
+	int ret;
+
+	find_usage_bit = bit;
+	/* fills in <backwards_match> */
+	ret = find_usage_backwards(this->type, 0);
+	if (!ret || ret == 1)
+		return ret;
+
+	return print_irq_inversion_bug(curr, backwards_match, this, 0, irqtype);
+}
+
+static inline void print_irqtrace_events(struct task_struct *curr)
+{
+	printk("irq event stamp: %u\n", curr->irq_events);
+	printk("hardirqs last  enabled at (%u): [<%08lx>]",
+		curr->hardirq_enable_event, curr->hardirq_enable_ip);
+	print_symbol(" %s\n", curr->hardirq_enable_ip);
+	printk("hardirqs last disabled at (%u): [<%08lx>]",
+		curr->hardirq_disable_event, curr->hardirq_disable_ip);
+	print_symbol(" %s\n", curr->hardirq_disable_ip);
+	printk("softirqs last  enabled at (%u): [<%08lx>]",
+		curr->softirq_enable_event, curr->softirq_enable_ip);
+	print_symbol(" %s\n", curr->softirq_enable_ip);
+	printk("softirqs last disabled at (%u): [<%08lx>]",
+		curr->softirq_disable_event, curr->softirq_disable_ip);
+	print_symbol(" %s\n", curr->softirq_disable_ip);
+}
+
+#else
+static inline void print_irqtrace_events(struct task_struct *curr)
+{
+}
+#endif
+
+static int
+print_usage_bug(struct task_struct *curr, struct held_lock *this,
+		enum lock_usage_bit prev_bit, enum lock_usage_bit new_bit)
+{
+	__raw_spin_unlock(&hash_lock);
+	debug_locks_off();
+	if (debug_locks_silent)
+		return 0;
+
+	printk("\n============================\n");
+	printk(  "[ BUG: illegal lock usage! ]\n");
+	printk(  "----------------------------\n");
+
+	printk("illegal {%s} -> {%s} usage.\n",
+		usage_str[prev_bit], usage_str[new_bit]);
+
+	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
+		curr->comm, curr->pid,
+		trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
+		trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
+		trace_hardirqs_enabled(curr),
+		trace_softirqs_enabled(curr));
+	print_lock(this);
+
+	printk("{%s} state was registered at:\n", usage_str[prev_bit]);
+	print_stack_trace(this->type->usage_traces + prev_bit, 1);
+
+	print_irqtrace_events(curr);
+	printk("\nother info that might help us debug this:\n");
+	lockdep_print_held_locks(curr);
+
+	printk("\nstack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+
+/*
+ * Print out an error if an invalid bit is set:
+ */
+static inline int
+valid_state(struct task_struct *curr, struct held_lock *this,
+	    enum lock_usage_bit new_bit, enum lock_usage_bit bad_bit)
+{
+	if (unlikely(this->type->usage_mask & (1 << bad_bit)))
+		return print_usage_bug(curr, this, bad_bit, new_bit);
+	return 1;
+}
+
+#define STRICT_READ_CHECKS	1
+
+/*
+ * Mark a lock with a usage bit, and validate the state transition:
+ */
+static int mark_lock(struct task_struct *curr, struct held_lock *this,
+		     enum lock_usage_bit new_bit, unsigned long ip)
+{
+	unsigned int new_mask = 1 << new_bit, ret = 1;
+
+	/*
+	 * If already set then do not dirty the cacheline,
+	 * nor do any checks:
+	 */
+	if (likely(this->type->usage_mask & new_mask))
+		return 1;
+
+	__raw_spin_lock(&hash_lock);
+	/*
+	 * Make sure we didnt race:
+	 */
+	if (unlikely(this->type->usage_mask & new_mask)) {
+		__raw_spin_unlock(&hash_lock);
+		return 1;
+	}
+
+	this->type->usage_mask |= new_mask;
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+	if (new_bit == LOCK_ENABLED_HARDIRQS ||
+			new_bit == LOCK_ENABLED_HARDIRQS_READ)
+		ip = curr->hardirq_enable_ip;
+	else if (new_bit == LOCK_ENABLED_SOFTIRQS ||
+			new_bit == LOCK_ENABLED_SOFTIRQS_READ)
+		ip = curr->softirq_enable_ip;
+#endif
+	if (!save_trace(this->type->usage_traces + new_bit))
+		return 0;
+
+	switch (new_bit) {
+#ifdef CONFIG_TRACE_IRQFLAGS
+	case LOCK_USED_IN_HARDIRQ:
+		if (!valid_state(curr, this, new_bit, LOCK_ENABLED_HARDIRQS))
+			return 0;
+		if (!valid_state(curr, this, new_bit,
+				 LOCK_ENABLED_HARDIRQS_READ))
+			return 0;
+		/*
+		 * just marked it hardirq-safe, check that this lock
+		 * took no hardirq-unsafe lock in the past:
+		 */
+		if (!check_usage_forwards(curr, this,
+					  LOCK_ENABLED_HARDIRQS, "hard"))
+			return 0;
+#if STRICT_READ_CHECKS
+		/*
+		 * just marked it hardirq-safe, check that this lock
+		 * took no hardirq-unsafe-read lock in the past:
+		 */
+		if (!check_usage_forwards(curr, this,
+				LOCK_ENABLED_HARDIRQS_READ, "hard-read"))
+			return 0;
+#endif
+		debug_atomic_inc(&nr_hardirq_safe_locks);
+		if (hardirq_verbose(this->type))
+			ret = 2;
+		break;
+	case LOCK_USED_IN_SOFTIRQ:
+		if (!valid_state(curr, this, new_bit, LOCK_ENABLED_SOFTIRQS))
+			return 0;
+		if (!valid_state(curr, this, new_bit,
+				 LOCK_ENABLED_SOFTIRQS_READ))
+			return 0;
+		/*
+		 * just marked it softirq-safe, check that this lock
+		 * took no softirq-unsafe lock in the past:
+		 */
+		if (!check_usage_forwards(curr, this,
+					  LOCK_ENABLED_SOFTIRQS, "soft"))
+			return 0;
+#if STRICT_READ_CHECKS
+		/*
+		 * just marked it softirq-safe, check that this lock
+		 * took no softirq-unsafe-read lock in the past:
+		 */
+		if (!check_usage_forwards(curr, this,
+				LOCK_ENABLED_SOFTIRQS_READ, "soft-read"))
+			return 0;
+#endif
+		debug_atomic_inc(&nr_softirq_safe_locks);
+		if (softirq_verbose(this->type))
+			ret = 2;
+		break;
+	case LOCK_USED_IN_HARDIRQ_READ:
+		if (!valid_state(curr, this, new_bit, LOCK_ENABLED_HARDIRQS))
+			return 0;
+		/*
+		 * just marked it hardirq-read-safe, check that this lock
+		 * took no hardirq-unsafe lock in the past:
+		 */
+		if (!check_usage_forwards(curr, this,
+					  LOCK_ENABLED_HARDIRQS, "hard"))
+			return 0;
+		debug_atomic_inc(&nr_hardirq_read_safe_locks);
+		if (hardirq_verbose(this->type))
+			ret = 2;
+		break;
+	case LOCK_USED_IN_SOFTIRQ_READ:
+		if (!valid_state(curr, this, new_bit, LOCK_ENABLED_SOFTIRQS))
+			return 0;
+		/*
+		 * just marked it softirq-read-safe, check that this lock
+		 * took no softirq-unsafe lock in the past:
+		 */
+		if (!check_usage_forwards(curr, this,
+					  LOCK_ENABLED_SOFTIRQS, "soft"))
+			return 0;
+		debug_atomic_inc(&nr_softirq_read_safe_locks);
+		if (softirq_verbose(this->type))
+			ret = 2;
+		break;
+	case LOCK_ENABLED_HARDIRQS:
+		if (!valid_state(curr, this, new_bit, LOCK_USED_IN_HARDIRQ))
+			return 0;
+		if (!valid_state(curr, this, new_bit,
+				 LOCK_USED_IN_HARDIRQ_READ))
+			return 0;
+		/*
+		 * just marked it hardirq-unsafe, check that no hardirq-safe
+		 * lock in the system ever took it in the past:
+		 */
+		if (!check_usage_backwards(curr, this,
+					   LOCK_USED_IN_HARDIRQ, "hard"))
+			return 0;
+#if STRICT_READ_CHECKS
+		/*
+		 * just marked it hardirq-unsafe, check that no
+		 * hardirq-safe-read lock in the system ever took
+		 * it in the past:
+		 */
+		if (!check_usage_backwards(curr, this,
+				   LOCK_USED_IN_HARDIRQ_READ, "hard-read"))
+			return 0;
+#endif
+		debug_atomic_inc(&nr_hardirq_unsafe_locks);
+		if (hardirq_verbose(this->type))
+			ret = 2;
+		break;
+	case LOCK_ENABLED_SOFTIRQS:
+		if (!valid_state(curr, this, new_bit, LOCK_USED_IN_SOFTIRQ))
+			return 0;
+		if (!valid_state(curr, this, new_bit,
+				 LOCK_USED_IN_SOFTIRQ_READ))
+			return 0;
+		/*
+		 * just marked it softirq-unsafe, check that no softirq-safe
+		 * lock in the system ever took it in the past:
+		 */
+		if (!check_usage_backwards(curr, this,
+					   LOCK_USED_IN_SOFTIRQ, "soft"))
+			return 0;
+#if STRICT_READ_CHECKS
+		/*
+		 * just marked it softirq-unsafe, check that no
+		 * softirq-safe-read lock in the system ever took
+		 * it in the past:
+		 */
+		if (!check_usage_backwards(curr, this,
+				   LOCK_USED_IN_SOFTIRQ_READ, "soft-read"))
+			return 0;
+#endif
+		debug_atomic_inc(&nr_softirq_unsafe_locks);
+		if (softirq_verbose(this->type))
+			ret = 2;
+		break;
+	case LOCK_ENABLED_HARDIRQS_READ:
+		if (!valid_state(curr, this, new_bit, LOCK_USED_IN_HARDIRQ))
+			return 0;
+#if STRICT_READ_CHECKS
+		/*
+		 * just marked it hardirq-read-unsafe, check that no
+		 * hardirq-safe lock in the system ever took it in the past:
+		 */
+		if (!check_usage_backwards(curr, this,
+					   LOCK_USED_IN_HARDIRQ, "hard"))
+			return 0;
+#endif
+		debug_atomic_inc(&nr_hardirq_read_unsafe_locks);
+		if (hardirq_verbose(this->type))
+			ret = 2;
+		break;
+	case LOCK_ENABLED_SOFTIRQS_READ:
+		if (!valid_state(curr, this, new_bit, LOCK_USED_IN_SOFTIRQ))
+			return 0;
+#if STRICT_READ_CHECKS
+		/*
+		 * just marked it softirq-read-unsafe, check that no
+		 * softirq-safe lock in the system ever took it in the past:
+		 */
+		if (!check_usage_backwards(curr, this,
+					   LOCK_USED_IN_SOFTIRQ, "soft"))
+			return 0;
+#endif
+		debug_atomic_inc(&nr_softirq_read_unsafe_locks);
+		if (softirq_verbose(this->type))
+			ret = 2;
+		break;
+#endif
+	case LOCK_USED:
+		/*
+		 * Add it to the global list of types:
+		 */
+		list_add_tail_rcu(&this->type->lock_entry, &all_lock_types);
+		debug_atomic_dec(&nr_unused_locks);
+		break;
+	default:
+		debug_locks_off();
+		WARN_ON(1);
+		return 0;
+	}
+
+	__raw_spin_unlock(&hash_lock);
+
+	/*
+	 * We must printk outside of the hash_lock:
+	 */
+	if (ret == 2) {
+		printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
+		print_lock(this);
+		print_irqtrace_events(curr);
+		dump_stack();
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+/*
+ * Mark all held locks with a usage bit:
+ */
+static int
+mark_held_locks(struct task_struct *curr, int hardirq, unsigned long ip)
+{
+	enum lock_usage_bit usage_bit;
+	struct held_lock *hlock;
+	int i;
+
+	for (i = 0; i < curr->lockdep_depth; i++) {
+		hlock = curr->held_locks + i;
+
+		if (hardirq) {
+			if (hlock->read)
+				usage_bit = LOCK_ENABLED_HARDIRQS_READ;
+			else
+				usage_bit = LOCK_ENABLED_HARDIRQS;
+		} else {
+			if (hlock->read)
+				usage_bit = LOCK_ENABLED_SOFTIRQS_READ;
+			else
+				usage_bit = LOCK_ENABLED_SOFTIRQS;
+		}
+		if (!mark_lock(curr, hlock, usage_bit, ip))
+			return 0;
+	}
+
+	return 1;
+}
+
+/*
+ * Debugging helper: via this flag we know that we are in
+ * 'early bootup code', and will warn about any invalid irqs-on event:
+ */
+static int early_boot_irqs_enabled;
+
+void early_boot_irqs_off(void)
+{
+	early_boot_irqs_enabled = 0;
+}
+
+void early_boot_irqs_on(void)
+{
+	early_boot_irqs_enabled = 1;
+}
+
+/*
+ * Hardirqs will be enabled:
+ */
+void trace_hardirqs_on(void)
+{
+	struct task_struct *curr = current;
+	unsigned long ip;
+
+	if (unlikely(!debug_locks))
+		return;
+
+	if (DEBUG_WARN_ON(unlikely(!early_boot_irqs_enabled)))
+		return;
+
+	if (unlikely(curr->hardirqs_enabled)) {
+		debug_atomic_inc(&redundant_hardirqs_on);
+		return;
+	}
+	/* we'll do an OFF -> ON transition: */
+	curr->hardirqs_enabled = 1;
+	ip = (unsigned long) __builtin_return_address(0);
+
+	if (DEBUG_WARN_ON(!irqs_disabled()))
+		return;
+	if (DEBUG_WARN_ON(current->hardirq_context))
+		return;
+	/*
+	 * We are going to turn hardirqs on, so set the
+	 * usage bit for all held locks:
+	 */
+	if (!mark_held_locks(curr, 1, ip))
+		return;
+	/*
+	 * If we have softirqs enabled, then set the usage
+	 * bit for all held locks. (disabled hardirqs prevented
+	 * this bit from being set before)
+	 */
+	if (curr->softirqs_enabled)
+		if (!mark_held_locks(curr, 0, ip))
+			return;
+
+	curr->hardirq_enable_ip = ip;
+	curr->hardirq_enable_event = ++curr->irq_events;
+	debug_atomic_inc(&hardirqs_on_events);
+}
+
+EXPORT_SYMBOL(trace_hardirqs_on);
+
+/*
+ * Hardirqs were disabled:
+ */
+void trace_hardirqs_off(void)
+{
+	struct task_struct *curr = current;
+
+	if (unlikely(!debug_locks))
+		return;
+
+	if (DEBUG_WARN_ON(!irqs_disabled()))
+		return;
+
+	if (curr->hardirqs_enabled) {
+		/*
+		 * We have done an ON -> OFF transition:
+		 */
+		curr->hardirqs_enabled = 0;
+		curr->hardirq_disable_ip = _RET_IP_;
+		curr->hardirq_disable_event = ++curr->irq_events;
+		debug_atomic_inc(&hardirqs_off_events);
+	} else
+		debug_atomic_inc(&redundant_hardirqs_off);
+}
+
+EXPORT_SYMBOL(trace_hardirqs_off);
+
+/*
+ * Softirqs will be enabled:
+ */
+void trace_softirqs_on(unsigned long ip)
+{
+	struct task_struct *curr = current;
+
+	if (unlikely(!debug_locks))
+		return;
+
+	if (DEBUG_WARN_ON(!irqs_disabled()))
+		return;
+
+	if (curr->softirqs_enabled) {
+		debug_atomic_inc(&redundant_softirqs_on);
+		return;
+	}
+
+	/*
+	 * We'll do an OFF -> ON transition:
+	 */
+	curr->softirqs_enabled = 1;
+	curr->softirq_enable_ip = ip;
+	curr->softirq_enable_event = ++curr->irq_events;
+	debug_atomic_inc(&softirqs_on_events);
+	/*
+	 * We are going to turn softirqs on, so set the
+	 * usage bit for all held locks, if hardirqs are
+	 * enabled too:
+	 */
+	if (curr->hardirqs_enabled)
+		mark_held_locks(curr, 0, ip);
+}
+
+/*
+ * Softirqs were disabled:
+ */
+void trace_softirqs_off(unsigned long ip)
+{
+	struct task_struct *curr = current;
+
+	if (unlikely(!debug_locks))
+		return;
+
+	if (DEBUG_WARN_ON(!irqs_disabled()))
+		return;
+
+	if (curr->softirqs_enabled) {
+		/*
+		 * We have done an ON -> OFF transition:
+		 */
+		curr->softirqs_enabled = 0;
+		curr->softirq_disable_ip = ip;
+		curr->softirq_disable_event = ++curr->irq_events;
+		debug_atomic_inc(&softirqs_off_events);
+		DEBUG_WARN_ON(!softirq_count());
+	} else
+		debug_atomic_inc(&redundant_softirqs_off);
+}
+
+#endif
+
+/*
+ * Initialize a lock instance's lock-type mapping info:
+ */
+void lockdep_init_map(struct lockdep_map *lock, const char *name,
+		      struct lockdep_type_key *key)
+{
+	if (unlikely(!debug_locks))
+		return;
+
+	if (DEBUG_WARN_ON(!key))
+		return;
+
+	/*
+	 * Sanity check, the lock-type key must be persistent:
+	 */
+	if (!static_obj(key)) {
+		printk("BUG: key %p not in .data!\n", key);
+		DEBUG_WARN_ON(1);
+		return;
+	}
+	lock->name = name;
+	lock->key = key;
+	memset(lock->type, 0, sizeof(lock->type[0])*MAX_LOCKDEP_SUBTYPES);
+}
+
+EXPORT_SYMBOL_GPL(lockdep_init_map);
+
+/*
+ * This gets called for every mutex_lock*()/spin_lock*() operation.
+ * We maintain the dependency maps and validate the locking attempt:
+ */
+static int __lockdep_acquire(struct lockdep_map *lock, unsigned int subtype,
+			     int trylock, int read, int hardirqs_off,
+			     unsigned long ip)
+{
+	struct task_struct *curr = current;
+	struct held_lock *hlock;
+	struct lock_type *type;
+	unsigned int depth, id;
+	int chain_head = 0;
+	u64 chain_key;
+
+	if (unlikely(!debug_locks))
+		return 0;
+
+	if (DEBUG_WARN_ON(!irqs_disabled()))
+		return 0;
+
+	if (unlikely(subtype >= MAX_LOCKDEP_SUBTYPES)) {
+		debug_locks_off();
+		printk("BUG: MAX_LOCKDEP_SUBTYPES too low!\n");
+		printk("turning off the locking correctness validator.\n");
+		return 0;
+	}
+
+	type = lock->type[subtype];
+	/* not cached yet? */
+	if (unlikely(!type)) {
+		type = register_lock_type(lock, subtype);
+		if (!type)
+			return 0;
+	}
+	debug_atomic_inc((atomic_t *)&type->ops);
+
+	/*
+	 * Add the lock to the list of currently held locks.
+	 * (we dont increase the depth just yet, up until the
+	 * dependency checks are done)
+	 */
+	depth = curr->lockdep_depth;
+	if (DEBUG_WARN_ON(depth >= MAX_LOCK_DEPTH))
+		return 0;
+
+	hlock = curr->held_locks + depth;
+
+	hlock->type = type;
+	hlock->acquire_ip = ip;
+	hlock->instance = lock;
+	hlock->trylock = trylock;
+	hlock->read = read;
+	hlock->hardirqs_off = hardirqs_off;
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+	/*
+	 * If non-trylock use in a hardirq or softirq context, then
+	 * mark the lock as used in these contexts:
+	 */
+	if (!trylock) {
+		if (read) {
+			if (curr->hardirq_context)
+				if (!mark_lock(curr, hlock,
+						LOCK_USED_IN_HARDIRQ_READ, ip))
+					return 0;
+			if (curr->softirq_context)
+				if (!mark_lock(curr, hlock,
+						LOCK_USED_IN_SOFTIRQ_READ, ip))
+					return 0;
+		} else {
+			if (curr->hardirq_context)
+				if (!mark_lock(curr, hlock, LOCK_USED_IN_HARDIRQ, ip))
+					return 0;
+			if (curr->softirq_context)
+				if (!mark_lock(curr, hlock, LOCK_USED_IN_SOFTIRQ, ip))
+					return 0;
+		}
+	}
+	if (!hardirqs_off) {
+		if (read) {
+			if (!mark_lock(curr, hlock,
+					LOCK_ENABLED_HARDIRQS_READ, ip))
+				return 0;
+			if (curr->softirqs_enabled)
+				if (!mark_lock(curr, hlock,
+						LOCK_ENABLED_SOFTIRQS_READ, ip))
+					return 0;
+		} else {
+			if (!mark_lock(curr, hlock,
+					LOCK_ENABLED_HARDIRQS, ip))
+				return 0;
+			if (curr->softirqs_enabled)
+				if (!mark_lock(curr, hlock,
+						LOCK_ENABLED_SOFTIRQS, ip))
+					return 0;
+		}
+	}
+#endif
+	/* mark it as used: */
+	if (!mark_lock(curr, hlock, LOCK_USED, ip))
+		return 0;
+	/*
+	 * Calculate the chain hash: it's the combined has of all the
+	 * lock keys along the dependency chain. We save the hash value
+	 * at every step so that we can get the current hash easily
+	 * after unlock. The chain hash is then used to cache dependency
+	 * results.
+	 *
+	 * The 'key ID' is what is the most compact key value to drive
+	 * the hash, not type->key.
+	 */
+	id = type - lock_types;
+	if (DEBUG_WARN_ON(id >= MAX_LOCKDEP_KEYS))
+		return 0;
+
+	chain_key = curr->curr_chain_key;
+	if (!depth) {
+		if (DEBUG_WARN_ON(chain_key != 0))
+			return 0;
+		chain_head = 1;
+	}
+
+	hlock->prev_chain_key = chain_key;
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+	/*
+	 * Keep track of points where we cross into an interrupt context:
+	 */
+	hlock->irq_context = 2*(curr->hardirq_context ? 1 : 0) +
+				curr->softirq_context;
+	if (depth) {
+		struct held_lock *prev_hlock;
+
+		prev_hlock = curr->held_locks + depth-1;
+		/*
+		 * If we cross into another context, reset the
+		 * hash key (this also prevents the checking and the
+		 * adding of the dependency to 'prev'):
+		 */
+		if (prev_hlock->irq_context != hlock->irq_context) {
+			chain_key = 0;
+			chain_head = 1;
+		}
+	}
+#endif
+	chain_key = iterate_chain_key(chain_key, id);
+	curr->curr_chain_key = chain_key;
+
+	/*
+	 * Trylock needs to maintain the stack of held locks, but it
+	 * does not add new dependencies, because trylock can be done
+	 * in any order.
+	 *
+	 * We look up the chain_key and do the O(N^2) check and update of
+	 * the dependencies only if this is a new dependency chain.
+	 * (If lookup_chain_cache() returns with 1 it acquires
+	 * hash_lock for us)
+	 */
+	if (!trylock && lookup_chain_cache(chain_key)) {
+		/*
+		 * Check whether last held lock:
+		 *
+		 * - is irq-safe, if this lock is irq-unsafe
+		 * - is softirq-safe, if this lock is hardirq-unsafe
+		 *
+		 * And check whether the new lock's dependency graph
+		 * could lead back to the previous lock.
+		 *
+		 * any of these scenarios could lead to a deadlock. If
+		 * All validations
+		 */
+		int ret = check_deadlock(curr, hlock, lock, read);
+
+		if (!ret)
+			return 0;
+		/*
+		 * Mark recursive read, as we jump over it when
+		 * building dependencies (just like we jump over
+		 * trylock entries):
+		 */
+		if (ret == 2)
+			hlock->read = 2;
+		/*
+		 * Add dependency only if this lock is not the head
+		 * of the chain, and if it's not a secondary read-lock:
+		 */
+		if (!chain_head && ret != 2)
+			if (!check_prevs_add(curr, hlock))
+				return 0;
+		__raw_spin_unlock(&hash_lock);
+	}
+	curr->lockdep_depth++;
+	check_chain_key(curr);
+	if (unlikely(curr->lockdep_depth >= MAX_LOCK_DEPTH)) {
+		debug_locks_off();
+		printk("BUG: MAX_LOCK_DEPTH too low!\n");
+		printk("turning off the locking correctness validator.\n");
+		return 0;
+	}
+	if (unlikely(curr->lockdep_depth > max_lockdep_depth))
+		max_lockdep_depth = curr->lockdep_depth;
+
+	return 1;
+}
+
+static int
+print_unlock_order_bug(struct task_struct *curr, struct lockdep_map *lock,
+		       struct held_lock *hlock, unsigned long ip)
+{
+	debug_locks_off();
+	if (debug_locks_silent)
+		return 0;
+
+	printk("\n======================================\n");
+	printk(  "[ BUG: bad unlock ordering detected! ]\n");
+	printk(  "--------------------------------------\n");
+	printk("%s/%d is trying to release lock (",
+		curr->comm, curr->pid);
+	print_lockdep_cache(lock);
+	printk(") at:\n");
+	printk_sym(ip);
+	printk("but the next lock to release is:\n");
+	print_lock(hlock);
+	printk("\nother info that might help us debug this:\n");
+	lockdep_print_held_locks(curr);
+
+	printk("\nstack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+
+static int
+print_unlock_inbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
+			   unsigned long ip)
+{
+	debug_locks_off();
+	if (debug_locks_silent)
+		return 0;
+
+	printk("\n=====================================\n");
+	printk(  "[ BUG: bad unlock balance detected! ]\n");
+	printk(  "-------------------------------------\n");
+	printk("%s/%d is trying to release lock (",
+		curr->comm, curr->pid);
+	print_lockdep_cache(lock);
+	printk(") at:\n");
+	printk_sym(ip);
+	printk("but there are no more locks to release!\n");
+	printk("\nother info that might help us debug this:\n");
+	lockdep_print_held_locks(curr);
+
+	printk("\nstack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+
+/*
+ * Common debugging checks for both nested and non-nested unlock:
+ */
+static int check_unlock(struct task_struct *curr, struct lockdep_map *lock,
+			unsigned long ip)
+{
+	if (unlikely(!debug_locks))
+		return 0;
+	if (DEBUG_WARN_ON(!irqs_disabled()))
+		return 0;
+
+	if (curr->lockdep_depth <= 0)
+		return print_unlock_inbalance_bug(curr, lock, ip);
+
+	return 1;
+}
+
+/*
+ * Remove the lock to the list of currently held locks - this gets
+ * called on mutex_unlock()/spin_unlock*() (or on a failed
+ * mutex_lock_interruptible()). This is done for unlocks that nest
+ * perfectly. (i.e. the current top of the lock-stack is unlocked)
+ */
+static int lockdep_release_nested(struct task_struct *curr,
+				  struct lockdep_map *lock, unsigned long ip)
+{
+	struct held_lock *hlock;
+	unsigned int depth;
+
+	/*
+	 * Pop off the top of the lock stack:
+	 */
+	depth = --curr->lockdep_depth;
+	hlock = curr->held_locks + depth;
+
+	if (hlock->instance != lock)
+		return print_unlock_order_bug(curr, lock, hlock, ip);
+
+	if (DEBUG_WARN_ON(!depth && (hlock->prev_chain_key != 0)))
+		return 0;
+
+	curr->curr_chain_key = hlock->prev_chain_key;
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+	hlock->prev_chain_key = 0;
+	hlock->type = NULL;
+	hlock->acquire_ip = 0;
+	hlock->irq_context = 0;
+#endif
+	return 1;
+}
+
+/*
+ * Remove the lock to the list of currently held locks in a
+ * potentially non-nested (out of order) manner. This is a
+ * relatively rare operation, as all the unlock APIs default
+ * to nested mode (which uses lockdep_release()):
+ */
+static int
+lockdep_release_non_nested(struct task_struct *curr,
+			   struct lockdep_map *lock, unsigned long ip)
+{
+	struct held_lock *hlock, *prev_hlock;
+	unsigned int depth;
+	int i;
+
+	/*
+	 * Check whether the lock exists in the current stack
+	 * of held locks:
+	 */
+	depth = curr->lockdep_depth;
+	if (DEBUG_WARN_ON(!depth))
+		return 0;
+
+	prev_hlock = NULL;
+	for (i = depth-1; i >= 0; i--) {
+		hlock = curr->held_locks + i;
+		/*
+		 * We must not cross into another context:
+		 */
+		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
+			break;
+		if (hlock->instance == lock)
+			goto found_it;
+		prev_hlock = hlock;
+	}
+	return print_unlock_inbalance_bug(curr, lock, ip);
+
+found_it:
+	/*
+	 * We have the right lock to unlock, 'hlock' points to it.
+	 * Now we remove it from the stack, and add back the other
+	 * entries (if any), recalculating the hash along the way:
+	 */
+	curr->lockdep_depth = i;
+	curr->curr_chain_key = hlock->prev_chain_key;
+
+	for (i++; i < depth; i++) {
+		hlock = curr->held_locks + i;
+		if (!__lockdep_acquire(hlock->instance,
+			hlock->type->subtype, hlock->trylock,
+				hlock->read, hlock->hardirqs_off,
+				hlock->acquire_ip))
+			return 0;
+	}
+
+	if (DEBUG_WARN_ON(curr->lockdep_depth != depth - 1))
+		return 0;
+	return 1;
+}
+
+/*
+ * Remove the lock to the list of currently held locks - this gets
+ * called on mutex_unlock()/spin_unlock*() (or on a failed
+ * mutex_lock_interruptible()). This is done for unlocks that nest
+ * perfectly. (i.e. the current top of the lock-stack is unlocked)
+ */
+static void __lockdep_release(struct lockdep_map *lock, int nested,
+			      unsigned long ip)
+{
+	struct task_struct *curr = current;
+
+	if (!check_unlock(curr, lock, ip))
+		return;
+
+	if (nested) {
+		if (!lockdep_release_nested(curr, lock, ip))
+			return;
+	} else {
+		if (!lockdep_release_non_nested(curr, lock, ip))
+			return;
+	}
+
+	check_chain_key(curr);
+}
+
+/*
+ * Check whether we follow the irq-flags state precisely:
+ */
+static void check_flags(unsigned long flags)
+{
+#if defined(CONFIG_DEBUG_LOCKDEP) && defined(CONFIG_TRACE_IRQFLAGS)
+	if (!debug_locks)
+		return;
+
+	if (irqs_disabled_flags(flags))
+		DEBUG_WARN_ON(current->hardirqs_enabled);
+	else
+		DEBUG_WARN_ON(!current->hardirqs_enabled);
+
+	/*
+	 * We dont accurately track softirq state in e.g.
+	 * hardirq contexts (such as on 4KSTACKS), so only
+	 * check if not in hardirq contexts:
+	 */
+	if (!hardirq_count()) {
+		if (softirq_count())
+			DEBUG_WARN_ON(current->softirqs_enabled);
+		else
+			DEBUG_WARN_ON(!current->softirqs_enabled);
+	}
+
+	if (!debug_locks)
+		print_irqtrace_events(current);
+#endif
+}
+
+/*
+ * We are not always called with irqs disabled - do that here,
+ * and also avoid lockdep recursion:
+ */
+void lockdep_acquire(struct lockdep_map *lock, unsigned int subtype,
+		     int trylock, int read, unsigned long ip)
+{
+	unsigned long flags;
+
+	if (LOCKDEP_OFF)
+		return;
+
+	raw_local_irq_save(flags);
+	check_flags(flags);
+
+	if (unlikely(current->lockdep_recursion))
+		goto out;
+	current->lockdep_recursion = 1;
+	__lockdep_acquire(lock, subtype, trylock, read, irqs_disabled_flags(flags), ip);
+	current->lockdep_recursion = 0;
+out:
+	raw_local_irq_restore(flags);
+}
+
+EXPORT_SYMBOL_GPL(lockdep_acquire);
+
+void lockdep_release(struct lockdep_map *lock, int nested, unsigned long ip)
+{
+	unsigned long flags;
+
+	if (LOCKDEP_OFF)
+		return;
+
+	raw_local_irq_save(flags);
+	check_flags(flags);
+	if (unlikely(current->lockdep_recursion))
+		goto out;
+	current->lockdep_recursion = 1;
+	__lockdep_release(lock, nested, ip);
+	current->lockdep_recursion = 0;
+out:
+	raw_local_irq_restore(flags);
+}
+
+EXPORT_SYMBOL_GPL(lockdep_release);
+
+/*
+ * Used by the testsuite, sanitize the validator state
+ * after a simulated failure:
+ */
+
+void lockdep_reset(void)
+{
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	current->curr_chain_key = 0;
+	current->lockdep_depth = 0;
+	current->lockdep_recursion = 0;
+	memset(current->held_locks, 0, MAX_LOCK_DEPTH*sizeof(struct held_lock));
+	nr_hardirq_chains = 0;
+	nr_softirq_chains = 0;
+	nr_process_chains = 0;
+	debug_locks = 1;
+	raw_local_irq_restore(flags);
+}
+
+static void zap_type(struct lock_type *type)
+{
+	int i;
+
+	/*
+	 * Remove all dependencies this lock is
+	 * involved in:
+	 */
+	for (i = 0; i < nr_list_entries; i++) {
+		if (list_entries[i].type == type)
+			list_del_rcu(&list_entries[i].entry);
+	}
+	/*
+	 * Unhash the type and remove it from the all_lock_types list:
+	 */
+	list_del_rcu(&type->hash_entry);
+	list_del_rcu(&type->lock_entry);
+
+}
+
+static inline int within(void *addr, void *start, unsigned long size)
+{
+	return addr >= start && addr < start + size;
+}
+
+void lockdep_free_key_range(void *start, unsigned long size)
+{
+	struct lock_type *type, *next;
+	struct list_head *head;
+	unsigned long flags;
+	int i;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&hash_lock);
+
+	/*
+	 * Unhash all types that were created by this module:
+	 */
+	for (i = 0; i < TYPEHASH_SIZE; i++) {
+		head = typehash_table + i;
+		if (list_empty(head))
+			continue;
+		list_for_each_entry_safe(type, next, head, hash_entry)
+			if (within(type->key, start, size))
+				zap_type(type);
+	}
+
+	__raw_spin_unlock(&hash_lock);
+	raw_local_irq_restore(flags);
+}
+
+void lockdep_reset_lock(struct lockdep_map *lock)
+{
+	struct lock_type *type, *next, *entry;
+	struct list_head *head;
+	unsigned long flags;
+	int i, j;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&hash_lock);
+
+	/*
+	 * Remove all types this lock has:
+	 */
+	for (i = 0; i < TYPEHASH_SIZE; i++) {
+		head = typehash_table + i;
+		if (list_empty(head))
+			continue;
+		list_for_each_entry_safe(type, next, head, hash_entry) {
+			for (j = 0; j < MAX_LOCKDEP_SUBTYPES; j++) {
+				entry = lock->type[j];
+				if (type == entry) {
+					zap_type(type);
+					lock->type[j] = NULL;
+					break;
+				}
+			}
+		}
+	}
+
+	/*
+	 * Debug check: in the end all mapped types should
+	 * be gone.
+	 */
+	for (j = 0; j < MAX_LOCKDEP_SUBTYPES; j++) {
+		entry = lock->type[j];
+		if (!entry)
+			continue;
+		__raw_spin_unlock(&hash_lock);
+		DEBUG_WARN_ON(1);
+		raw_local_irq_restore(flags);
+		return;
+	}
+
+	__raw_spin_unlock(&hash_lock);
+	raw_local_irq_restore(flags);
+}
+
+void __init lockdep_init(void)
+{
+	int i;
+
+	/*
+	 * Some architectures have their own start_kernel()
+	 * code which calls lockdep_init(), while we also
+	 * call lockdep_init() from the start_kernel() itself,
+	 * and we want to initialize the hashes only once:
+	 */
+	if (lockdep_initialized)
+		return;
+
+	for (i = 0; i < TYPEHASH_SIZE; i++)
+		INIT_LIST_HEAD(typehash_table + i);
+
+	for (i = 0; i < CHAINHASH_SIZE; i++)
+		INIT_LIST_HEAD(chainhash_table + i);
+
+	lockdep_initialized = 1;
+}
+
+void __init lockdep_info(void)
+{
+	printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n");
+
+	printk("... MAX_LOCKDEP_SUBTYPES:    %lu\n", MAX_LOCKDEP_SUBTYPES);
+	printk("... MAX_LOCK_DEPTH:          %lu\n", MAX_LOCK_DEPTH);
+	printk("... MAX_LOCKDEP_KEYS:        %lu\n", MAX_LOCKDEP_KEYS);
+	printk("... TYPEHASH_SIZE:           %lu\n", TYPEHASH_SIZE);
+	printk("... MAX_LOCKDEP_ENTRIES:     %lu\n", MAX_LOCKDEP_ENTRIES);
+	printk("... MAX_LOCKDEP_CHAINS:      %lu\n", MAX_LOCKDEP_CHAINS);
+	printk("... CHAINHASH_SIZE:          %lu\n", CHAINHASH_SIZE);
+
+	printk(" memory used by lock dependency info: %lu kB\n",
+		(sizeof(struct lock_type) * MAX_LOCKDEP_KEYS +
+		sizeof(struct list_head) * TYPEHASH_SIZE +
+		sizeof(struct lock_list) * MAX_LOCKDEP_ENTRIES +
+		sizeof(struct lock_chain) * MAX_LOCKDEP_CHAINS +
+		sizeof(struct list_head) * CHAINHASH_SIZE) / 1024);
+
+	printk(" per task-struct memory footprint: %lu bytes\n",
+		sizeof(struct held_lock) * MAX_LOCK_DEPTH);
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+	if (lockdep_init_error)
+		printk("WARNING: lockdep init error! Arch code didnt call lockdep_init() early enough?\n");
+#endif
+}
+
Index: linux/kernel/lockdep_internals.h
===================================================================
--- /dev/null
+++ linux/kernel/lockdep_internals.h
@@ -0,0 +1,93 @@
+/*
+ * kernel/lockdep_internals.h
+ *
+ * Runtime locking correctness validator
+ *
+ * lockdep subsystem internal functions and variables.
+ */
+
+/*
+ * MAX_LOCKDEP_ENTRIES is the maximum number of lock dependencies
+ * we track.
+ *
+ * We use the per-lock dependency maps in two ways: we grow it by adding
+ * every to-be-taken lock to all currently held lock's own dependency
+ * table (if it's not there yet), and we check it for lock order
+ * conflicts and deadlocks.
+ */
+#define MAX_LOCKDEP_ENTRIES	8192UL
+
+#define MAX_LOCKDEP_KEYS_BITS	11
+#define MAX_LOCKDEP_KEYS	(1UL << MAX_LOCKDEP_KEYS_BITS)
+
+#define MAX_LOCKDEP_CHAINS_BITS	13
+#define MAX_LOCKDEP_CHAINS	(1UL << MAX_LOCKDEP_CHAINS_BITS)
+
+/*
+ * Stack-trace: tightly packed array of stack backtrace
+ * addresses. Protected by the hash_lock.
+ */
+#define MAX_STACK_TRACE_ENTRIES	131072UL
+
+extern struct list_head all_lock_types;
+
+extern void
+get_usage_chars(struct lock_type *type, char *c1, char *c2, char *c3, char *c4);
+
+extern const char * __get_key_name(struct lockdep_subtype_key *key, char *str);
+
+extern unsigned long nr_lock_types;
+extern unsigned long nr_list_entries;
+extern unsigned long nr_lock_chains;
+extern unsigned long nr_stack_trace_entries;
+
+extern unsigned int nr_hardirq_chains;
+extern unsigned int nr_softirq_chains;
+extern unsigned int nr_process_chains;
+extern unsigned int max_lockdep_depth;
+extern unsigned int max_recursion_depth;
+
+#ifdef CONFIG_DEBUG_LOCKDEP
+/*
+ * We cannot printk in early bootup code. Not even early_printk()
+ * might work. So we mark any initialization errors and printk
+ * about it later on, in lockdep_info().
+ */
+extern int lockdep_init_error;
+
+/*
+ * Various lockdep statistics:
+ */
+extern atomic_t chain_lookup_hits;
+extern atomic_t chain_lookup_misses;
+extern atomic_t hardirqs_on_events;
+extern atomic_t hardirqs_off_events;
+extern atomic_t redundant_hardirqs_on;
+extern atomic_t redundant_hardirqs_off;
+extern atomic_t softirqs_on_events;
+extern atomic_t softirqs_off_events;
+extern atomic_t redundant_softirqs_on;
+extern atomic_t redundant_softirqs_off;
+extern atomic_t nr_unused_locks;
+extern atomic_t nr_hardirq_safe_locks;
+extern atomic_t nr_softirq_safe_locks;
+extern atomic_t nr_hardirq_unsafe_locks;
+extern atomic_t nr_softirq_unsafe_locks;
+extern atomic_t nr_hardirq_read_safe_locks;
+extern atomic_t nr_softirq_read_safe_locks;
+extern atomic_t nr_hardirq_read_unsafe_locks;
+extern atomic_t nr_softirq_read_unsafe_locks;
+extern atomic_t nr_cyclic_checks;
+extern atomic_t nr_cyclic_check_recursions;
+extern atomic_t nr_find_usage_forwards_checks;
+extern atomic_t nr_find_usage_forwards_recursions;
+extern atomic_t nr_find_usage_backwards_checks;
+extern atomic_t nr_find_usage_backwards_recursions;
+# define debug_atomic_inc(ptr)		atomic_inc(ptr)
+# define debug_atomic_dec(ptr)		atomic_dec(ptr)
+# define debug_atomic_read(ptr)		atomic_read(ptr)
+#else
+# define debug_atomic_inc(ptr)		do { } while (0)
+# define debug_atomic_dec(ptr)		do { } while (0)
+# define debug_atomic_read(ptr)		0
+#endif
Index: linux/kernel/module.c
===================================================================
--- linux.orig/kernel/module.c
+++ linux/kernel/module.c
@@ -1151,6 +1151,9 @@ static void free_module(struct module *m
 	if (mod->percpu)
 		percpu_modfree(mod->percpu);
 
+	/* Free lock-types: */
+	lockdep_free_key_range(mod->module_core, mod->core_size);
+
 	/* Finally, free the core (containing the module structure) */
 	module_free(mod, mod->module_core);
 }
Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -57,7 +57,7 @@ config DEBUG_KERNEL
 config LOG_BUF_SHIFT
 	int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL
 	range 12 21
-	default 17 if S390
+	default 17 if S390 || LOCKDEP
 	default 16 if X86_NUMAQ || IA64
 	default 15 if SMP
 	default 14
Index: linux/lib/locking-selftest.c
===================================================================
--- linux.orig/lib/locking-selftest.c
+++ linux/lib/locking-selftest.c
@@ -15,6 +15,7 @@
 #include <linux/sched.h>
 #include <linux/delay.h>
 #include <linux/module.h>
+#include <linux/lockdep.h>
 #include <linux/spinlock.h>
 #include <linux/kallsyms.h>
 #include <linux/interrupt.h>
@@ -872,9 +873,6 @@ GENERATE_PERMUTATIONS_3_EVENTS(irq_read_
 #include "locking-selftest-softirq.h"
 // GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft)
 
-#define lockdep_reset()
-#define lockdep_reset_lock(x)
-
 #ifdef CONFIG_PROVE_SPIN_LOCKING
 # define I_SPINLOCK(x)	lockdep_reset_lock(&lock_##x.dep_map)
 #else

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 24/61] lock validator: procfs
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (22 preceding siblings ...)
  2006-05-29 21:25 ` [patch 23/61] lock validator: core Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:25 ` [patch 25/61] lock validator: design docs Ingo Molnar
                   ` (49 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

lock validator /proc/lockdep and /proc/lockdep_stats support.
(FIXME: should go into debugfs)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/Makefile       |    3 
 kernel/lockdep_proc.c |  345 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 348 insertions(+)

Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -13,6 +13,9 @@ obj-y     = sched.o fork.o exec_domain.o
 obj-y += time/
 obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
 obj-$(CONFIG_LOCKDEP) += lockdep.o
+ifeq ($(CONFIG_PROC_FS),y)
+obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
+endif
 obj-$(CONFIG_FUTEX) += futex.o
 ifeq ($(CONFIG_COMPAT),y)
 obj-$(CONFIG_FUTEX) += futex_compat.o
Index: linux/kernel/lockdep_proc.c
===================================================================
--- /dev/null
+++ linux/kernel/lockdep_proc.c
@@ -0,0 +1,345 @@
+/*
+ * kernel/lockdep_proc.c
+ *
+ * Runtime locking correctness validator
+ *
+ * Started by Ingo Molnar:
+ *
+ *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ *
+ * Code for /proc/lockdep and /proc/lockdep_stats:
+ *
+ */
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/debug_locks.h>
+
+#include "lockdep_internals.h"
+
+static void *l_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct lock_type *type = v;
+
+	(*pos)++;
+
+	if (type->lock_entry.next != &all_lock_types)
+		type = list_entry(type->lock_entry.next, struct lock_type,
+				  lock_entry);
+	else
+		type = NULL;
+	m->private = type;
+
+	return type;
+}
+
+static void *l_start(struct seq_file *m, loff_t *pos)
+{
+	struct lock_type *type = m->private;
+
+	if (&type->lock_entry == all_lock_types.next)
+		seq_printf(m, "all lock types:\n");
+
+	return type;
+}
+
+static void l_stop(struct seq_file *m, void *v)
+{
+}
+
+static unsigned long count_forward_deps(struct lock_type *type)
+{
+	struct lock_list *entry;
+	unsigned long ret = 1;
+
+	/*
+	 * Recurse this type's dependency list:
+	 */
+	list_for_each_entry(entry, &type->locks_after, entry)
+		ret += count_forward_deps(entry->type);
+
+	return ret;
+}
+
+static unsigned long count_backward_deps(struct lock_type *type)
+{
+	struct lock_list *entry;
+	unsigned long ret = 1;
+
+	/*
+	 * Recurse this type's dependency list:
+	 */
+	list_for_each_entry(entry, &type->locks_before, entry)
+		ret += count_backward_deps(entry->type);
+
+	return ret;
+}
+
+static int l_show(struct seq_file *m, void *v)
+{
+	unsigned long nr_forward_deps, nr_backward_deps;
+	struct lock_type *type = m->private;
+	char str[128], c1, c2, c3, c4;
+	const char *name;
+
+	seq_printf(m, "%p", type->key);
+#ifdef CONFIG_DEBUG_LOCKDEP
+	seq_printf(m, " OPS:%8ld", type->ops);
+#endif
+	nr_forward_deps = count_forward_deps(type);
+	seq_printf(m, " FD:%5ld", nr_forward_deps);
+
+	nr_backward_deps = count_backward_deps(type);
+	seq_printf(m, " BD:%5ld", nr_backward_deps);
+
+	get_usage_chars(type, &c1, &c2, &c3, &c4);
+	seq_printf(m, " %c%c%c%c", c1, c2, c3, c4);
+
+	name = type->name;
+	if (!name) {
+		name = __get_key_name(type->key, str);
+		seq_printf(m, ": %s", name);
+	} else{
+		seq_printf(m, ": %s", name);
+		if (type->name_version > 1)
+			seq_printf(m, "#%d", type->name_version);
+		if (type->subtype)
+			seq_printf(m, "/%d", type->subtype);
+	}
+	seq_puts(m, "\n");
+
+	return 0;
+}
+
+static struct seq_operations lockdep_ops = {
+	.start	= l_start,
+	.next	= l_next,
+	.stop	= l_stop,
+	.show	= l_show,
+};
+
+static int lockdep_open(struct inode *inode, struct file *file)
+{
+	int res = seq_open(file, &lockdep_ops);
+	if (!res) {
+		struct seq_file *m = file->private_data;
+
+		if (!list_empty(&all_lock_types))
+			m->private = list_entry(all_lock_types.next,
+					struct lock_type, lock_entry);
+		else
+			m->private = NULL;
+	}
+	return res;
+}
+
+static struct file_operations proc_lockdep_operations = {
+	.open		= lockdep_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static void lockdep_stats_debug_show(struct seq_file *m)
+{
+#ifdef CONFIG_DEBUG_LOCKDEP
+	unsigned int hi1 = debug_atomic_read(&hardirqs_on_events),
+		     hi2 = debug_atomic_read(&hardirqs_off_events),
+		     hr1 = debug_atomic_read(&redundant_hardirqs_on),
+		     hr2 = debug_atomic_read(&redundant_hardirqs_off),
+		     si1 = debug_atomic_read(&softirqs_on_events),
+		     si2 = debug_atomic_read(&softirqs_off_events),
+		     sr1 = debug_atomic_read(&redundant_softirqs_on),
+		     sr2 = debug_atomic_read(&redundant_softirqs_off);
+
+	seq_printf(m, " chain lookup misses:           %11u\n",
+		debug_atomic_read(&chain_lookup_misses));
+	seq_printf(m, " chain lookup hits:             %11u\n",
+		debug_atomic_read(&chain_lookup_hits));
+	seq_printf(m, " cyclic checks:                 %11u\n",
+		debug_atomic_read(&nr_cyclic_checks));
+	seq_printf(m, " cyclic-check recursions:       %11u\n",
+		debug_atomic_read(&nr_cyclic_check_recursions));
+	seq_printf(m, " find-mask forwards checks:     %11u\n",
+		debug_atomic_read(&nr_find_usage_forwards_checks));
+	seq_printf(m, " find-mask forwards recursions: %11u\n",
+		debug_atomic_read(&nr_find_usage_forwards_recursions));
+	seq_printf(m, " find-mask backwards checks:    %11u\n",
+		debug_atomic_read(&nr_find_usage_backwards_checks));
+	seq_printf(m, " find-mask backwards recursions:%11u\n",
+		debug_atomic_read(&nr_find_usage_backwards_recursions));
+
+	seq_printf(m, " hardirq on events:             %11u\n", hi1);
+	seq_printf(m, " hardirq off events:            %11u\n", hi2);
+	seq_printf(m, " redundant hardirq ons:         %11u\n", hr1);
+	seq_printf(m, " redundant hardirq offs:        %11u\n", hr2);
+	seq_printf(m, " softirq on events:             %11u\n", si1);
+	seq_printf(m, " softirq off events:            %11u\n", si2);
+	seq_printf(m, " redundant softirq ons:         %11u\n", sr1);
+	seq_printf(m, " redundant softirq offs:        %11u\n", sr2);
+#endif
+}
+
+static int lockdep_stats_show(struct seq_file *m, void *v)
+{
+	struct lock_type *type;
+	unsigned long nr_unused = 0, nr_uncategorized = 0,
+		      nr_irq_safe = 0, nr_irq_unsafe = 0,
+		      nr_softirq_safe = 0, nr_softirq_unsafe = 0,
+		      nr_hardirq_safe = 0, nr_hardirq_unsafe = 0,
+		      nr_irq_read_safe = 0, nr_irq_read_unsafe = 0,
+		      nr_softirq_read_safe = 0, nr_softirq_read_unsafe = 0,
+		      nr_hardirq_read_safe = 0, nr_hardirq_read_unsafe = 0,
+		      sum_forward_deps = 0, factor = 0;
+
+	list_for_each_entry(type, &all_lock_types, lock_entry) {
+
+		if (type->usage_mask == 0)
+			nr_unused++;
+		if (type->usage_mask == LOCKF_USED)
+			nr_uncategorized++;
+		if (type->usage_mask & LOCKF_USED_IN_IRQ)
+			nr_irq_safe++;
+		if (type->usage_mask & LOCKF_ENABLED_IRQS)
+			nr_irq_unsafe++;
+		if (type->usage_mask & LOCKF_USED_IN_SOFTIRQ)
+			nr_softirq_safe++;
+		if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS)
+			nr_softirq_unsafe++;
+		if (type->usage_mask & LOCKF_USED_IN_HARDIRQ)
+			nr_hardirq_safe++;
+		if (type->usage_mask & LOCKF_ENABLED_HARDIRQS)
+			nr_hardirq_unsafe++;
+		if (type->usage_mask & LOCKF_USED_IN_IRQ_READ)
+			nr_irq_read_safe++;
+		if (type->usage_mask & LOCKF_ENABLED_IRQS_READ)
+			nr_irq_read_unsafe++;
+		if (type->usage_mask & LOCKF_USED_IN_SOFTIRQ_READ)
+			nr_softirq_read_safe++;
+		if (type->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ)
+			nr_softirq_read_unsafe++;
+		if (type->usage_mask & LOCKF_USED_IN_HARDIRQ_READ)
+			nr_hardirq_read_safe++;
+		if (type->usage_mask & LOCKF_ENABLED_HARDIRQS_READ)
+			nr_hardirq_read_unsafe++;
+
+		sum_forward_deps += count_forward_deps(type);
+	}
+#ifdef CONFIG_LOCKDEP_DEBUG
+	DEBUG_WARN_ON(debug_atomic_read(&nr_unused_locks) != nr_unused);
+#endif
+	seq_printf(m, " lock-types:                    %11lu [max: %lu]\n",
+			nr_lock_types, MAX_LOCKDEP_KEYS);
+	seq_printf(m, " direct dependencies:           %11lu [max: %lu]\n",
+			nr_list_entries, MAX_LOCKDEP_ENTRIES);
+	seq_printf(m, " indirect dependencies:         %11lu\n",
+			sum_forward_deps);
+
+	/*
+	 * Total number of dependencies:
+	 *
+	 * All irq-safe locks may nest inside irq-unsafe locks,
+	 * plus all the other known dependencies:
+	 */
+	seq_printf(m, " all direct dependencies:       %11lu\n",
+			nr_irq_unsafe * nr_irq_safe +
+			nr_hardirq_unsafe * nr_hardirq_safe +
+			nr_list_entries);
+
+	/*
+	 * Estimated factor between direct and indirect
+	 * dependencies:
+	 */
+	if (nr_list_entries)
+		factor = sum_forward_deps / nr_list_entries;
+
+	seq_printf(m, " dependency chains:             %11lu [max: %lu]\n",
+			nr_lock_chains, MAX_LOCKDEP_CHAINS);
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+	seq_printf(m, " in-hardirq chains:             %11u\n",
+			nr_hardirq_chains);
+	seq_printf(m, " in-softirq chains:             %11u\n",
+			nr_softirq_chains);
+#endif
+	seq_printf(m, " in-process chains:             %11u\n",
+			nr_process_chains);
+	seq_printf(m, " stack-trace entries:           %11lu [max: %lu]\n",
+			nr_stack_trace_entries, MAX_STACK_TRACE_ENTRIES);
+	seq_printf(m, " combined max dependencies:     %11u\n",
+			(nr_hardirq_chains + 1) *
+			(nr_softirq_chains + 1) *
+			(nr_process_chains + 1)
+	);
+	seq_printf(m, " hardirq-safe locks:            %11lu\n",
+			nr_hardirq_safe);
+	seq_printf(m, " hardirq-unsafe locks:          %11lu\n",
+			nr_hardirq_unsafe);
+	seq_printf(m, " softirq-safe locks:            %11lu\n",
+			nr_softirq_safe);
+	seq_printf(m, " softirq-unsafe locks:          %11lu\n",
+			nr_softirq_unsafe);
+	seq_printf(m, " irq-safe locks:                %11lu\n",
+			nr_irq_safe);
+	seq_printf(m, " irq-unsafe locks:              %11lu\n",
+			nr_irq_unsafe);
+
+	seq_printf(m, " hardirq-read-safe locks:       %11lu\n",
+			nr_hardirq_read_safe);
+	seq_printf(m, " hardirq-read-unsafe locks:     %11lu\n",
+			nr_hardirq_read_unsafe);
+	seq_printf(m, " softirq-read-safe locks:       %11lu\n",
+			nr_softirq_read_safe);
+	seq_printf(m, " softirq-read-unsafe locks:     %11lu\n",
+			nr_softirq_read_unsafe);
+	seq_printf(m, " irq-read-safe locks:           %11lu\n",
+			nr_irq_read_safe);
+	seq_printf(m, " irq-read-unsafe locks:         %11lu\n",
+			nr_irq_read_unsafe);
+
+	seq_printf(m, " uncategorized locks:           %11lu\n",
+			nr_uncategorized);
+	seq_printf(m, " unused locks:                  %11lu\n",
+			nr_unused);
+	seq_printf(m, " max locking depth:             %11u\n",
+			max_lockdep_depth);
+	seq_printf(m, " max recursion depth:           %11u\n",
+			max_recursion_depth);
+	lockdep_stats_debug_show(m);
+	seq_printf(m, " debug_locks:                   %11u\n",
+			debug_locks);
+
+	return 0;
+}
+
+static int lockdep_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, lockdep_stats_show, NULL);
+}
+
+static struct file_operations proc_lockdep_stats_operations = {
+	.open		= lockdep_stats_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init lockdep_proc_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("lockdep", S_IRUSR, NULL);
+	if (entry)
+		entry->proc_fops = &proc_lockdep_operations;
+
+	entry = create_proc_entry("lockdep_stats", S_IRUSR, NULL);
+	if (entry)
+		entry->proc_fops = &proc_lockdep_stats_operations;
+
+	return 0;
+}
+
+__initcall(lockdep_proc_init);
+

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 25/61] lock validator: design docs
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (23 preceding siblings ...)
  2006-05-29 21:25 ` [patch 24/61] lock validator: procfs Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-30  9:07   ` Nikita Danilov
  2006-05-29 21:25 ` [patch 26/61] lock validator: prove rwsem locking correctness Ingo Molnar
                   ` (48 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

lock validator design documentation.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 Documentation/lockdep-design.txt |  224 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 224 insertions(+)

Index: linux/Documentation/lockdep-design.txt
===================================================================
--- /dev/null
+++ linux/Documentation/lockdep-design.txt
@@ -0,0 +1,224 @@
+Runtime locking correctness validator
+=====================================
+
+started by Ingo Molnar <mingo@redhat.com>
+additions by Arjan van de Ven <arjan@linux.intel.com>
+
+Lock-type
+---------
+
+The basic object the validator operates upon is the 'type' or 'class' of
+locks.
+
+A class of locks is a group of locks that are logically the same with
+respect to locking rules, even if the locks may have multiple (possibly
+tens of thousands of) instantiations. For example a lock in the inode
+struct is one class, while each inode has its own instantiation of that
+lock class.
+
+The validator tracks the 'state' of lock-types, and it tracks
+dependencies between different lock-types. The validator maintains a
+rolling proof that the state and the dependencies are correct.
+
+Unlike an lock instantiation, the lock-type itself never goes away: when
+a lock-type is used for the first time after bootup it gets registered,
+and all subsequent uses of that lock-type will be attached to this
+lock-type.
+
+State
+-----
+
+The validator tracks lock-type usage history into 5 separate state bits:
+
+- 'ever held in hardirq context'                    [ == hardirq-safe   ]
+- 'ever held in softirq context'                    [ == softirq-safe   ]
+- 'ever held with hardirqs enabled'                 [ == hardirq-unsafe ]
+- 'ever held with softirqs and hardirqs enabled'    [ == softirq-unsafe ]
+
+- 'ever used'                                       [ == !unused        ]
+
+Single-lock state rules:
+------------------------
+
+A softirq-unsafe lock-type is automatically hardirq-unsafe as well. The
+following states are exclusive, and only one of them is allowed to be
+set for any lock-type:
+
+ <hardirq-safe> and <hardirq-unsafe>
+ <softirq-safe> and <softirq-unsafe>
+
+The validator detects and reports lock usage that violate these
+single-lock state rules.
+
+Multi-lock dependency rules:
+----------------------------
+
+The same lock-type must not be acquired twice, because this could lead
+to lock recursion deadlocks.
+
+Furthermore, two locks may not be taken in different order:
+
+ <L1> -> <L2>
+ <L2> -> <L1>
+
+because this could lead to lock inversion deadlocks. (The validator
+finds such dependencies in arbitrary complexity, i.e. there can be any
+other locking sequence between the acquire-lock operations, the
+validator will still track all dependencies between locks.)
+
+Furthermore, the following usage based lock dependencies are not allowed
+between any two lock-types:
+
+   <hardirq-safe>   ->  <hardirq-unsafe>
+   <softirq-safe>   ->  <softirq-unsafe>
+
+The first rule comes from the fact the a hardirq-safe lock could be
+taken by a hardirq context, interrupting a hardirq-unsafe lock - and
+thus could result in a lock inversion deadlock. Likewise, a softirq-safe
+lock could be taken by an softirq context, interrupting a softirq-unsafe
+lock.
+
+The above rules are enforced for any locking sequence that occurs in the
+kernel: when acquiring a new lock, the validator checks whether there is
+any rule violation between the new lock and any of the held locks.
+
+When a lock-type changes its state, the following aspects of the above
+dependency rules are enforced:
+
+- if a new hardirq-safe lock is discovered, we check whether it
+  took any hardirq-unsafe lock in the past.
+
+- if a new softirq-safe lock is discovered, we check whether it took
+  any softirq-unsafe lock in the past.
+
+- if a new hardirq-unsafe lock is discovered, we check whether any
+  hardirq-safe lock took it in the past.
+
+- if a new softirq-unsafe lock is discovered, we check whether any
+  softirq-safe lock took it in the past.
+
+(Again, we do these checks too on the basis that an interrupt context
+could interrupt _any_ of the irq-unsafe or hardirq-unsafe locks, which
+could lead to a lock inversion deadlock - even if that lock scenario did
+not trigger in practice yet.)
+
+Exception 1: Nested data types leading to nested locking
+--------------------------------------------------------
+
+There are a few cases where the Linux kernel acquires more than one
+instance of the same lock-type. Such cases typically happen when there
+is some sort of hierarchy within objects of the same type. In these
+cases there is an inherent "natural" ordering between the two objects
+(defined by the properties of the hierarchy), and the kernel grabs the
+locks in this fixed order on each of the objects.
+
+An example of such an object hieararchy that results in "nested locking"
+is that of a "whole disk" block-dev object and a "partition" block-dev
+object; the partition is "part of" the whole device and as long as one
+always takes the whole disk lock as a higher lock than the partition
+lock, the lock ordering is fully correct. The validator does not
+automatically detect this natural ordering, as the locking rule behind
+the ordering is not static.
+
+In order to teach the validator about this correct usage model, new
+versions of the various locking primitives were added that allow you to
+specify a "nesting level". An example call, for the block device mutex,
+looks like this:
+
+enum bdev_bd_mutex_lock_type
+{
+       BD_MUTEX_NORMAL,
+       BD_MUTEX_WHOLE,
+       BD_MUTEX_PARTITION
+};
+
+ mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION);
+
+In this case the locking is done on a bdev object that is known to be a
+partition.
+
+The validator treats a lock that is taken in such a nested fasion as a
+separate (sub)class for the purposes of validation.
+
+Note: When changing code to use the _nested() primitives, be careful and
+check really thoroughly that the hiearchy is correctly mapped; otherwise
+you can get false positives or false negatives.
+
+Exception 2: Out of order unlocking
+-----------------------------------
+
+In the Linux kernel, locks are released in the opposite order in which
+they were taken, with a few exceptions. The validator is optimized for
+the common case, and in fact treats an "out of order" unlock as a
+locking bug. (the rationale is that the code is doing something rare,
+which can be a sign of a bug)
+
+There are some cases where releasing the locks out of order is
+unavoidable and dictated by the algorithm that is being implemented.
+Therefore, the validator can be told about this, using a special
+unlocking variant of the primitives. An example call looks like this:
+
+ spin_unlock_non_nested(&target->d_lock);
+
+Here the d_lock is released by the VFS in a different order than it was
+taken, as required by the d_move() algorithm.
+
+Note: the _non_nested() primitives are more expensive than the "normal"
+primitives, and in almost all cases it's trivial to use the natural
+unlock order. There are gains in doing this that are outside the realm
+of the validator regardless so it's strongly suggested to make sure that
+unlocking always happens in the natural order whenever reasonable,
+rather than blindly changing code to use the _non_nested() variants.
+
+Proof of 100% correctness:
+--------------------------
+
+The validator achieves perfect, mathematical 'closure' (proof of locking
+correctness) in the sense that for every simple, standalone single-task
+locking sequence that occured at least once during the lifetime of the
+kernel, the validator proves it with a 100% certainty that no
+combination and timing of these locking sequences can cause any type of
+lock related deadlock. [*]
+
+I.e. complex multi-CPU and multi-task locking scenarios do not have to
+occur in practice to prove a deadlock: only the simple 'component'
+locking chains have to occur at least once (anytime, in any
+task/context) for the validator to be able to prove correctness. (For
+example, complex deadlocks that would normally need more than 3 CPUs and
+a very unlikely constellation of tasks, irq-contexts and timings to
+occur, can be detected on a plain, lightly loaded single-CPU system as
+well!)
+
+This radically decreases the complexity of locking related QA of the
+kernel: what has to be done during QA is to trigger as many "simple"
+single-task locking dependencies in the kernel as possible, at least
+once, to prove locking correctness - instead of having to trigger every
+possible combination of locking interaction between CPUs, combined with
+every possible hardirq and softirq nesting scenario (which is impossible
+to do in practice).
+
+[*] assuming that the validator itself is 100% correct, and no other
+    part of the system corrupts the state of the validator in any way.
+    We also assume that all NMI/SMM paths [which could interrupt
+    even hardirq-disabled codepaths] are correct and do not interfere
+    with the validator. We also assume that the 64-bit 'chain hash'
+    value is unique for every lock-chain in the system. Also, lock
+    recursion must not be higher than 20.
+
+Performance:
+------------
+
+The above rules require _massive_ amounts of runtime checking. If we did
+that for every lock taken and for every irqs-enable event, it would
+render the system practically unusably slow. The complexity of checking
+is O(N^2), so even with just a few hundred lock-types we'd have to do
+tens of thousands of checks for every event.
+
+This problem is solved by checking any given 'locking scenario' (unique
+sequence of locks taken after each other) only once. A simple stack of
+held locks is maintained, and a lightweight 64-bit hash value is
+calculated, which hash is unique for every lock chain. The hash value,
+when the chain is validated for the first time, is then put into a hash
+table, which hash-table can be checked in a lockfree manner. If the
+locking chain occurs again later on, the hash table tells us that we
+dont have to validate the chain again.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 26/61] lock validator: prove rwsem locking correctness
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (24 preceding siblings ...)
  2006-05-29 21:25 ` [patch 25/61] lock validator: design docs Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:25 ` [patch 27/61] lock validator: prove spinlock/rwlock " Ingo Molnar
                   ` (47 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

add CONFIG_PROVE_RWSEM_LOCKING, which uses the lock validator framework
to prove rwsem locking correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/asm-i386/rwsem.h       |   38 +++++++++++++++++++--------
 include/linux/rwsem-spinlock.h |   23 +++++++++++++++-
 include/linux/rwsem.h          |   56 +++++++++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c           |   15 ++++++++--
 lib/rwsem.c                    |   19 +++++++++++++
 5 files changed, 135 insertions(+), 16 deletions(-)

Index: linux/include/asm-i386/rwsem.h
===================================================================
--- linux.orig/include/asm-i386/rwsem.h
+++ linux/include/asm-i386/rwsem.h
@@ -40,6 +40,7 @@
 
 #include <linux/list.h>
 #include <linux/spinlock.h>
+#include <linux/lockdep.h>
 
 struct rwsem_waiter;
 
@@ -64,6 +65,9 @@ struct rw_semaphore {
 #if RWSEM_DEBUG
 	int			debug;
 #endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+	struct lockdep_map dep_map;
+#endif
 };
 
 /*
@@ -75,22 +79,29 @@ struct rw_semaphore {
 #define __RWSEM_DEBUG_INIT	/* */
 #endif
 
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
+#else
+# define __RWSEM_DEP_MAP_INIT(lockname)
+#endif
+
+
 #define __RWSEM_INITIALIZER(name) \
 { RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) \
-	__RWSEM_DEBUG_INIT }
+	__RWSEM_DEBUG_INIT __RWSEM_DEP_MAP_INIT(name) }
 
 #define DECLARE_RWSEM(name) \
 	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
 
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->count = RWSEM_UNLOCKED_VALUE;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
+extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
+			 struct lockdep_type_key *key);
+
+#define init_rwsem(sem)						\
+do {								\
+	static struct lockdep_type_key __key;			\
+								\
+	__init_rwsem((sem), #sem, &__key);			\
+} while (0)
 
 /*
  * lock for reading
@@ -143,7 +154,7 @@ LOCK_PREFIX	"  cmpxchgl  %2,%0\n\t"
 /*
  * lock for writing
  */
-static inline void __down_write(struct rw_semaphore *sem)
+static inline void __down_write_nested(struct rw_semaphore *sem, int subtype)
 {
 	int tmp;
 
@@ -167,6 +178,11 @@ LOCK_PREFIX	"  xadd      %%edx,(%%eax)\n
 		: "memory", "cc");
 }
 
+static inline void __down_write(struct rw_semaphore *sem)
+{
+	__down_write_nested(sem, 0);
+}
+
 /*
  * trylock for writing -- returns 1 if successful, 0 if contention
  */
Index: linux/include/linux/rwsem-spinlock.h
===================================================================
--- linux.orig/include/linux/rwsem-spinlock.h
+++ linux/include/linux/rwsem-spinlock.h
@@ -35,6 +35,9 @@ struct rw_semaphore {
 #if RWSEM_DEBUG
 	int			debug;
 #endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+	struct lockdep_map dep_map;
+#endif
 };
 
 /*
@@ -46,16 +49,32 @@ struct rw_semaphore {
 #define __RWSEM_DEBUG_INIT	/* */
 #endif
 
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
+#else
+# define __RWSEM_DEP_MAP_INIT(lockname)
+#endif
+
 #define __RWSEM_INITIALIZER(name) \
-{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
+{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT __RWSEM_DEP_MAP_INIT(name) }
 
 #define DECLARE_RWSEM(name) \
 	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
 
-extern void FASTCALL(init_rwsem(struct rw_semaphore *sem));
+extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
+			 struct lockdep_type_key *key);
+
+#define init_rwsem(sem)						\
+do {								\
+	static struct lockdep_type_key __key;			\
+								\
+	__init_rwsem((sem), #sem, &__key);			\
+} while (0)
+
 extern void FASTCALL(__down_read(struct rw_semaphore *sem));
 extern int FASTCALL(__down_read_trylock(struct rw_semaphore *sem));
 extern void FASTCALL(__down_write(struct rw_semaphore *sem));
+extern void FASTCALL(__down_write_nested(struct rw_semaphore *sem, int subtype));
 extern int FASTCALL(__down_write_trylock(struct rw_semaphore *sem));
 extern void FASTCALL(__up_read(struct rw_semaphore *sem));
 extern void FASTCALL(__up_write(struct rw_semaphore *sem));
Index: linux/include/linux/rwsem.h
===================================================================
--- linux.orig/include/linux/rwsem.h
+++ linux/include/linux/rwsem.h
@@ -40,6 +40,20 @@ extern void FASTCALL(rwsemtrace(struct r
 static inline void down_read(struct rw_semaphore *sem)
 {
 	might_sleep();
+	rwsem_acquire_read(&sem->dep_map, 0, 0, _THIS_IP_);
+
+	rwsemtrace(sem,"Entering down_read");
+	__down_read(sem);
+	rwsemtrace(sem,"Leaving down_read");
+}
+
+/*
+ * Take a lock when not the owner will release it:
+ */
+static inline void down_read_non_owner(struct rw_semaphore *sem)
+{
+	might_sleep();
+
 	rwsemtrace(sem,"Entering down_read");
 	__down_read(sem);
 	rwsemtrace(sem,"Leaving down_read");
@@ -53,6 +67,8 @@ static inline int down_read_trylock(stru
 	int ret;
 	rwsemtrace(sem,"Entering down_read_trylock");
 	ret = __down_read_trylock(sem);
+	if (ret == 1)
+		rwsem_acquire_read(&sem->dep_map, 0, 1, _THIS_IP_);
 	rwsemtrace(sem,"Leaving down_read_trylock");
 	return ret;
 }
@@ -63,12 +79,28 @@ static inline int down_read_trylock(stru
 static inline void down_write(struct rw_semaphore *sem)
 {
 	might_sleep();
+	rwsem_acquire(&sem->dep_map, 0, 0, _THIS_IP_);
+
 	rwsemtrace(sem,"Entering down_write");
 	__down_write(sem);
 	rwsemtrace(sem,"Leaving down_write");
 }
 
 /*
+ * lock for writing
+ */
+static inline void down_write_nested(struct rw_semaphore *sem, int subtype)
+{
+	might_sleep();
+	rwsem_acquire(&sem->dep_map, subtype, 0, _THIS_IP_);
+
+	rwsemtrace(sem,"Entering down_write_nested");
+	__down_write_nested(sem, subtype);
+	rwsemtrace(sem,"Leaving down_write_nested");
+}
+
+
+/*
  * trylock for writing -- returns 1 if successful, 0 if contention
  */
 static inline int down_write_trylock(struct rw_semaphore *sem)
@@ -76,6 +108,8 @@ static inline int down_write_trylock(str
 	int ret;
 	rwsemtrace(sem,"Entering down_write_trylock");
 	ret = __down_write_trylock(sem);
+	if (ret == 1)
+		rwsem_acquire(&sem->dep_map, 0, 0, _THIS_IP_);
 	rwsemtrace(sem,"Leaving down_write_trylock");
 	return ret;
 }
@@ -85,16 +119,34 @@ static inline int down_write_trylock(str
  */
 static inline void up_read(struct rw_semaphore *sem)
 {
+	rwsem_release(&sem->dep_map, 1, _THIS_IP_);
+
 	rwsemtrace(sem,"Entering up_read");
 	__up_read(sem);
 	rwsemtrace(sem,"Leaving up_read");
 }
 
+static inline void up_read_non_nested(struct rw_semaphore *sem)
+{
+	rwsem_release(&sem->dep_map, 0, _THIS_IP_);
+	__up_read(sem);
+}
+
+/*
+ * Not the owner will release it:
+ */
+static inline void up_read_non_owner(struct rw_semaphore *sem)
+{
+	__up_read(sem);
+}
+
 /*
  * release a write lock
  */
 static inline void up_write(struct rw_semaphore *sem)
 {
+	rwsem_release(&sem->dep_map, 1, _THIS_IP_);
+
 	rwsemtrace(sem,"Entering up_write");
 	__up_write(sem);
 	rwsemtrace(sem,"Leaving up_write");
@@ -105,6 +157,10 @@ static inline void up_write(struct rw_se
  */
 static inline void downgrade_write(struct rw_semaphore *sem)
 {
+	/*
+	 * lockdep: a downgraded write will live on as a write
+	 * dependency.
+	 */
 	rwsemtrace(sem,"Entering downgrade_write");
 	__downgrade_write(sem);
 	rwsemtrace(sem,"Leaving downgrade_write");
Index: linux/lib/rwsem-spinlock.c
===================================================================
--- linux.orig/lib/rwsem-spinlock.c
+++ linux/lib/rwsem-spinlock.c
@@ -30,7 +30,8 @@ void rwsemtrace(struct rw_semaphore *sem
 /*
  * initialise the semaphore
  */
-void fastcall init_rwsem(struct rw_semaphore *sem)
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+		  struct lockdep_type_key *key)
 {
 	sem->activity = 0;
 	spin_lock_init(&sem->wait_lock);
@@ -38,6 +39,9 @@ void fastcall init_rwsem(struct rw_semap
 #if RWSEM_DEBUG
 	sem->debug = 0;
 #endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+	lockdep_init_map(&sem->dep_map, name, key);
+#endif
 }
 
 /*
@@ -204,7 +208,7 @@ int fastcall __down_read_trylock(struct 
  * get a write lock on the semaphore
  * - we increment the waiting count anyway to indicate an exclusive lock
  */
-void fastcall __sched __down_write(struct rw_semaphore *sem)
+void fastcall __sched __down_write_nested(struct rw_semaphore *sem, int subtype)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk;
@@ -247,6 +251,11 @@ void fastcall __sched __down_write(struc
 	rwsemtrace(sem, "Leaving __down_write");
 }
 
+void fastcall __sched __down_write(struct rw_semaphore *sem)
+{
+	__down_write_nested(sem, 0);
+}
+
 /*
  * trylock for writing -- returns 1 if successful, 0 if contention
  */
@@ -331,7 +340,7 @@ void fastcall __downgrade_write(struct r
 	rwsemtrace(sem, "Leaving __downgrade_write");
 }
 
-EXPORT_SYMBOL(init_rwsem);
+EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
 EXPORT_SYMBOL(__down_write);
Index: linux/lib/rwsem.c
===================================================================
--- linux.orig/lib/rwsem.c
+++ linux/lib/rwsem.c
@@ -8,6 +8,25 @@
 #include <linux/init.h>
 #include <linux/module.h>
 
+/*
+ * Initialize an rwsem:
+ */
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+		  struct lockdep_type_key *key)
+{
+	sem->count = RWSEM_UNLOCKED_VALUE;
+	spin_lock_init(&sem->wait_lock);
+	INIT_LIST_HEAD(&sem->wait_list);
+#if RWSEM_DEBUG
+	sem->debug = 0;
+#endif
+#ifdef CONFIG_PROVE_RWSEM_LOCKING
+	lockdep_init_map(&sem->dep_map, name, key);
+#endif
+}
+
+EXPORT_SYMBOL(__init_rwsem);
+
 struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 27/61] lock validator: prove spinlock/rwlock locking correctness
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (25 preceding siblings ...)
  2006-05-29 21:25 ` [patch 26/61] lock validator: prove rwsem locking correctness Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-30  1:35   ` Andrew Morton
  2006-05-29 21:25 ` [patch 28/61] lock validator: prove mutex " Ingo Molnar
                   ` (46 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

add CONFIG_PROVE_SPIN_LOCKING and CONFIG_PROVE_RW_LOCKING, which uses
the lock validator framework to prove spinlock and rwlock locking
correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/asm-i386/spinlock.h       |    2 
 include/linux/spinlock.h          |   96 ++++++++++++++++++++++-----
 include/linux/spinlock_api_smp.h  |    4 +
 include/linux/spinlock_api_up.h   |    3 
 include/linux/spinlock_types.h    |   32 ++++++++-
 include/linux/spinlock_types_up.h |   10 ++
 include/linux/spinlock_up.h       |    4 -
 kernel/Makefile                   |    2 
 kernel/sched.c                    |   10 ++
 kernel/spinlock.c                 |  131 +++++++++++++++++++++++++++++++++++---
 lib/kernel_lock.c                 |    7 +-
 net/ipv4/route.c                  |    4 -
 12 files changed, 269 insertions(+), 36 deletions(-)

Index: linux/include/asm-i386/spinlock.h
===================================================================
--- linux.orig/include/asm-i386/spinlock.h
+++ linux/include/asm-i386/spinlock.h
@@ -68,6 +68,7 @@ static inline void __raw_spin_lock(raw_s
 		"=m" (lock->slock) : : "memory");
 }
 
+#ifndef CONFIG_PROVE_SPIN_LOCKING
 static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags)
 {
 	alternative_smp(
@@ -75,6 +76,7 @@ static inline void __raw_spin_lock_flags
 		__raw_spin_lock_string_up,
 		"=m" (lock->slock) : "r" (flags) : "memory");
 }
+#endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
 {
Index: linux/include/linux/spinlock.h
===================================================================
--- linux.orig/include/linux/spinlock.h
+++ linux/include/linux/spinlock.h
@@ -82,14 +82,64 @@ extern int __lockfunc generic__raw_read_
 /*
  * Pull the __raw*() functions/declarations (UP-nondebug doesnt need them):
  */
-#if defined(CONFIG_SMP)
+#ifdef CONFIG_SMP
 # include <asm/spinlock.h>
 #else
 # include <linux/spinlock_up.h>
 #endif
 
-#define spin_lock_init(lock)	do { *(lock) = SPIN_LOCK_UNLOCKED; } while (0)
-#define rwlock_init(lock)	do { *(lock) = RW_LOCK_UNLOCKED; } while (0)
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PROVE_SPIN_LOCKING)
+  extern void __spin_lock_init(spinlock_t *lock, const char *name,
+			       struct lockdep_type_key *key);
+# define spin_lock_init(lock)					\
+do {								\
+	static struct lockdep_type_key __key;			\
+								\
+	__spin_lock_init((lock), #lock, &__key);		\
+} while (0)
+
+/*
+ * If for example an array of static locks are initialized
+ * via spin_lock_init(), this API variant can be used to
+ * split the lock-types of them:
+ */
+# define spin_lock_init_static(lock)				\
+	__spin_lock_init((lock), #lock,				\
+			 (struct lockdep_type_key *)(lock))	\
+
+/*
+ * Type splitting can also be done for dynamic locks, if for
+ * example there are per-CPU dynamically allocated locks:
+ */
+# define spin_lock_init_key(lock, key)				\
+	__spin_lock_init((lock), #lock, key)
+
+#else
+# define spin_lock_init(lock)					\
+	do { *(lock) = SPIN_LOCK_UNLOCKED; } while (0)
+# define spin_lock_init_static(lock) 				\
+	spin_lock_init(lock)
+# define spin_lock_init_key(lock, key)				\
+	do { spin_lock_init(lock); (void)(key); } while (0)
+#endif
+
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PROVE_RW_LOCKING)
+  extern void __rwlock_init(rwlock_t *lock, const char *name,
+			    struct lockdep_type_key *key);
+# define rwlock_init(lock)					\
+do {								\
+	static struct lockdep_type_key __key;			\
+								\
+	__rwlock_init((lock), #lock, &__key);			\
+} while (0)
+# define rwlock_init_key(lock, key)				\
+	__rwlock_init((lock), #lock, key)
+#else
+# define rwlock_init(lock)					\
+	do { *(lock) = RW_LOCK_UNLOCKED; } while (0)
+# define rwlock_init_key(lock, key)				\
+	do { rwlock_init(lock); (void)(key); } while (0)
+#endif
 
 #define spin_is_locked(lock)	__raw_spin_is_locked(&(lock)->raw_lock)
 
@@ -102,7 +152,9 @@ extern int __lockfunc generic__raw_read_
 /*
  * Pull the _spin_*()/_read_*()/_write_*() functions/declarations:
  */
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_SPIN_LOCKING) || \
+	defined(CONFIG_PROVE_RW_LOCKING)
 # include <linux/spinlock_api_smp.h>
 #else
 # include <linux/spinlock_api_up.h>
@@ -113,7 +165,6 @@ extern int __lockfunc generic__raw_read_
 #define _raw_spin_lock_flags(lock, flags) _raw_spin_lock(lock)
  extern int _raw_spin_trylock(spinlock_t *lock);
  extern void _raw_spin_unlock(spinlock_t *lock);
-
  extern void _raw_read_lock(rwlock_t *lock);
  extern int _raw_read_trylock(rwlock_t *lock);
  extern void _raw_read_unlock(rwlock_t *lock);
@@ -121,17 +172,17 @@ extern int __lockfunc generic__raw_read_
  extern int _raw_write_trylock(rwlock_t *lock);
  extern void _raw_write_unlock(rwlock_t *lock);
 #else
-# define _raw_spin_unlock(lock)		__raw_spin_unlock(&(lock)->raw_lock)
-# define _raw_spin_trylock(lock)	__raw_spin_trylock(&(lock)->raw_lock)
 # define _raw_spin_lock(lock)		__raw_spin_lock(&(lock)->raw_lock)
 # define _raw_spin_lock_flags(lock, flags) \
 		__raw_spin_lock_flags(&(lock)->raw_lock, *(flags))
+# define _raw_spin_trylock(lock)	__raw_spin_trylock(&(lock)->raw_lock)
+# define _raw_spin_unlock(lock)		__raw_spin_unlock(&(lock)->raw_lock)
 # define _raw_read_lock(rwlock)		__raw_read_lock(&(rwlock)->raw_lock)
-# define _raw_write_lock(rwlock)	__raw_write_lock(&(rwlock)->raw_lock)
-# define _raw_read_unlock(rwlock)	__raw_read_unlock(&(rwlock)->raw_lock)
-# define _raw_write_unlock(rwlock)	__raw_write_unlock(&(rwlock)->raw_lock)
 # define _raw_read_trylock(rwlock)	__raw_read_trylock(&(rwlock)->raw_lock)
+# define _raw_read_unlock(rwlock)	__raw_read_unlock(&(rwlock)->raw_lock)
+# define _raw_write_lock(rwlock)	__raw_write_lock(&(rwlock)->raw_lock)
 # define _raw_write_trylock(rwlock)	__raw_write_trylock(&(rwlock)->raw_lock)
+# define _raw_write_unlock(rwlock)	__raw_write_unlock(&(rwlock)->raw_lock)
 #endif
 
 #define read_can_lock(rwlock)		__raw_read_can_lock(&(rwlock)->raw_lock)
@@ -147,10 +198,14 @@ extern int __lockfunc generic__raw_read_
 #define write_trylock(lock)		__cond_lock(_write_trylock(lock))
 
 #define spin_lock(lock)			_spin_lock(lock)
+#define spin_lock_nested(lock, subtype) \
+					_spin_lock_nested(lock, subtype)
 #define write_lock(lock)		_write_lock(lock)
 #define read_lock(lock)			_read_lock(lock)
 
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_SPIN_LOCKING) || \
+	defined(CONFIG_PROVE_RW_LOCKING)
 #define spin_lock_irqsave(lock, flags)	flags = _spin_lock_irqsave(lock)
 #define read_lock_irqsave(lock, flags)	flags = _read_lock_irqsave(lock)
 #define write_lock_irqsave(lock, flags)	flags = _write_lock_irqsave(lock)
@@ -172,21 +227,24 @@ extern int __lockfunc generic__raw_read_
 /*
  * We inline the unlock functions in the nondebug case:
  */
-#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP)
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \
+	!defined(CONFIG_SMP) || \
+	defined(CONFIG_PROVE_SPIN_LOCKING) || \
+	defined(CONFIG_PROVE_RW_LOCKING)
 # define spin_unlock(lock)		_spin_unlock(lock)
+# define spin_unlock_non_nested(lock)	_spin_unlock_non_nested(lock)
 # define read_unlock(lock)		_read_unlock(lock)
+# define read_unlock_non_nested(lock)	_read_unlock_non_nested(lock)
 # define write_unlock(lock)		_write_unlock(lock)
-#else
-# define spin_unlock(lock)		__raw_spin_unlock(&(lock)->raw_lock)
-# define read_unlock(lock)		__raw_read_unlock(&(lock)->raw_lock)
-# define write_unlock(lock)		__raw_write_unlock(&(lock)->raw_lock)
-#endif
-
-#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP)
 # define spin_unlock_irq(lock)		_spin_unlock_irq(lock)
 # define read_unlock_irq(lock)		_read_unlock_irq(lock)
 # define write_unlock_irq(lock)		_write_unlock_irq(lock)
 #else
+# define spin_unlock(lock)		__raw_spin_unlock(&(lock)->raw_lock)
+# define spin_unlock_non_nested(lock)	__raw_spin_unlock(&(lock)->raw_lock)
+# define read_unlock(lock)		__raw_read_unlock(&(lock)->raw_lock)
+# define read_unlock_non_nested(lock)	__raw_read_unlock(&(lock)->raw_lock)
+# define write_unlock(lock)		__raw_write_unlock(&(lock)->raw_lock)
 # define spin_unlock_irq(lock) \
     do { __raw_spin_unlock(&(lock)->raw_lock); local_irq_enable(); } while (0)
 # define read_unlock_irq(lock) \
Index: linux/include/linux/spinlock_api_smp.h
===================================================================
--- linux.orig/include/linux/spinlock_api_smp.h
+++ linux/include/linux/spinlock_api_smp.h
@@ -20,6 +20,8 @@ int in_lock_functions(unsigned long addr
 #define assert_spin_locked(x)	BUG_ON(!spin_is_locked(x))
 
 void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(spinlock_t);
+void __lockfunc _spin_lock_nested(spinlock_t *lock, int subtype)
+							__acquires(spinlock_t);
 void __lockfunc _read_lock(rwlock_t *lock)		__acquires(rwlock_t);
 void __lockfunc _write_lock(rwlock_t *lock)		__acquires(rwlock_t);
 void __lockfunc _spin_lock_bh(spinlock_t *lock)		__acquires(spinlock_t);
@@ -39,7 +41,9 @@ int __lockfunc _read_trylock(rwlock_t *l
 int __lockfunc _write_trylock(rwlock_t *lock);
 int __lockfunc _spin_trylock_bh(spinlock_t *lock);
 void __lockfunc _spin_unlock(spinlock_t *lock)		__releases(spinlock_t);
+void __lockfunc _spin_unlock_non_nested(spinlock_t *lock) __releases(spinlock_t);
 void __lockfunc _read_unlock(rwlock_t *lock)		__releases(rwlock_t);
+void __lockfunc _read_unlock_non_nested(rwlock_t *lock)	__releases(rwlock_t);
 void __lockfunc _write_unlock(rwlock_t *lock)		__releases(rwlock_t);
 void __lockfunc _spin_unlock_bh(spinlock_t *lock)	__releases(spinlock_t);
 void __lockfunc _read_unlock_bh(rwlock_t *lock)		__releases(rwlock_t);
Index: linux/include/linux/spinlock_api_up.h
===================================================================
--- linux.orig/include/linux/spinlock_api_up.h
+++ linux/include/linux/spinlock_api_up.h
@@ -49,6 +49,7 @@
   do { local_irq_restore(flags); __UNLOCK(lock); } while (0)
 
 #define _spin_lock(lock)			__LOCK(lock)
+#define _spin_lock_nested(lock, subtype)	__LOCK(lock)
 #define _read_lock(lock)			__LOCK(lock)
 #define _write_lock(lock)			__LOCK(lock)
 #define _spin_lock_bh(lock)			__LOCK_BH(lock)
@@ -65,7 +66,9 @@
 #define _write_trylock(lock)			({ __LOCK(lock); 1; })
 #define _spin_trylock_bh(lock)			({ __LOCK_BH(lock); 1; })
 #define _spin_unlock(lock)			__UNLOCK(lock)
+#define _spin_unlock_non_nested(lock)		__UNLOCK(lock)
 #define _read_unlock(lock)			__UNLOCK(lock)
+#define _read_unlock_non_nested(lock)		__UNLOCK(lock)
 #define _write_unlock(lock)			__UNLOCK(lock)
 #define _spin_unlock_bh(lock)			__UNLOCK_BH(lock)
 #define _write_unlock_bh(lock)			__UNLOCK_BH(lock)
Index: linux/include/linux/spinlock_types.h
===================================================================
--- linux.orig/include/linux/spinlock_types.h
+++ linux/include/linux/spinlock_types.h
@@ -9,6 +9,8 @@
  * Released under the General Public License (GPL).
  */
 
+#include <linux/lockdep.h>
+
 #if defined(CONFIG_SMP)
 # include <asm/spinlock_types.h>
 #else
@@ -24,6 +26,9 @@ typedef struct {
 	unsigned int magic, owner_cpu;
 	void *owner;
 #endif
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+	struct lockdep_map dep_map;
+#endif
 } spinlock_t;
 
 #define SPINLOCK_MAGIC		0xdead4ead
@@ -37,28 +42,47 @@ typedef struct {
 	unsigned int magic, owner_cpu;
 	void *owner;
 #endif
+#ifdef CONFIG_PROVE_RW_LOCKING
+	struct lockdep_map dep_map;
+#endif
 } rwlock_t;
 
 #define RWLOCK_MAGIC		0xdeaf1eed
 
 #define SPINLOCK_OWNER_INIT	((void *)-1L)
 
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+# define SPIN_DEP_MAP_INIT(lockname)	.dep_map = { .name = #lockname }
+#else
+# define SPIN_DEP_MAP_INIT(lockname)
+#endif
+
+#ifdef CONFIG_PROVE_RW_LOCKING
+# define RW_DEP_MAP_INIT(lockname)	.dep_map = { .name = #lockname }
+#else
+# define RW_DEP_MAP_INIT(lockname)
+#endif
+
 #ifdef CONFIG_DEBUG_SPINLOCK
 # define __SPIN_LOCK_UNLOCKED(lockname)					\
 	(spinlock_t)	{	.raw_lock = __RAW_SPIN_LOCK_UNLOCKED,	\
 				.magic = SPINLOCK_MAGIC,		\
 				.owner = SPINLOCK_OWNER_INIT,		\
-				.owner_cpu = -1 }
+				.owner_cpu = -1,			\
+				SPIN_DEP_MAP_INIT(lockname) }
 #define __RW_LOCK_UNLOCKED(lockname)					\
 	(rwlock_t)	{	.raw_lock = __RAW_RW_LOCK_UNLOCKED,	\
 				.magic = RWLOCK_MAGIC,			\
 				.owner = SPINLOCK_OWNER_INIT,		\
-				.owner_cpu = -1 }
+				.owner_cpu = -1,			\
+				RW_DEP_MAP_INIT(lockname) }
 #else
 # define __SPIN_LOCK_UNLOCKED(lockname) \
-	(spinlock_t)	{	.raw_lock = __RAW_SPIN_LOCK_UNLOCKED }
+	(spinlock_t)	{	.raw_lock = __RAW_SPIN_LOCK_UNLOCKED,	\
+				SPIN_DEP_MAP_INIT(lockname) }
 #define __RW_LOCK_UNLOCKED(lockname) \
-	(rwlock_t)	{	.raw_lock = __RAW_RW_LOCK_UNLOCKED }
+	(rwlock_t)	{	.raw_lock = __RAW_RW_LOCK_UNLOCKED,	\
+				RW_DEP_MAP_INIT(lockname) }
 #endif
 
 #define SPIN_LOCK_UNLOCKED	__SPIN_LOCK_UNLOCKED(old_style_spin_init)
Index: linux/include/linux/spinlock_types_up.h
===================================================================
--- linux.orig/include/linux/spinlock_types_up.h
+++ linux/include/linux/spinlock_types_up.h
@@ -12,10 +12,15 @@
  * Released under the General Public License (GPL).
  */
 
-#ifdef CONFIG_DEBUG_SPINLOCK
+#if defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_SPIN_LOCKING) || \
+	defined(CONFIG_PROVE_RW_LOCKING)
 
 typedef struct {
 	volatile unsigned int slock;
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+	struct lockdep_map dep_map;
+#endif
 } raw_spinlock_t;
 
 #define __RAW_SPIN_LOCK_UNLOCKED { 1 }
@@ -30,6 +35,9 @@ typedef struct { } raw_spinlock_t;
 
 typedef struct {
 	/* no debug version on UP */
+#ifdef CONFIG_PROVE_RW_LOCKING
+	struct lockdep_map dep_map;
+#endif
 } raw_rwlock_t;
 
 #define __RAW_RW_LOCK_UNLOCKED { }
Index: linux/include/linux/spinlock_up.h
===================================================================
--- linux.orig/include/linux/spinlock_up.h
+++ linux/include/linux/spinlock_up.h
@@ -17,7 +17,9 @@
  * No atomicity anywhere, we are on UP.
  */
 
-#ifdef CONFIG_DEBUG_SPINLOCK
+#if defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_SPIN_LOCKING) || \
+	defined(CONFIG_PROVE_RW_LOCKING)
 
 #define __raw_spin_is_locked(x)		((x)->slock == 0)
 
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -26,6 +26,8 @@ obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
 obj-$(CONFIG_SMP) += cpu.o spinlock.o
 obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o
+obj-$(CONFIG_PROVE_SPIN_LOCKING) += spinlock.o
+obj-$(CONFIG_PROVE_RW_LOCKING) += spinlock.o
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += module.o
 obj-$(CONFIG_KALLSYMS) += kallsyms.o
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -312,6 +312,13 @@ static inline void finish_lock_switch(ru
 	/* this is a valid case when another task releases the spinlock */
 	rq->lock.owner = current;
 #endif
+	/*
+	 * If we are tracking spinlock dependencies then we have to
+	 * fix up the runqueue lock - which gets 'carried over' from
+	 * prev into current:
+	 */
+	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
+
 	spin_unlock_irq(&rq->lock);
 }
 
@@ -1839,6 +1846,7 @@ task_t * context_switch(runqueue_t *rq, 
 		WARN_ON(rq->prev_mm);
 		rq->prev_mm = oldmm;
 	}
+	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
 
 	/* Here we just switch the register state and the stack. */
 	switch_to(prev, next, prev);
@@ -4406,6 +4414,7 @@ asmlinkage long sys_sched_yield(void)
 	 * no need to preempt or enable interrupts:
 	 */
 	__release(rq->lock);
+	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
 	_raw_spin_unlock(&rq->lock);
 	preempt_enable_no_resched();
 
@@ -4465,6 +4474,7 @@ int cond_resched_lock(spinlock_t *lock)
 		spin_lock(lock);
 	}
 	if (need_resched()) {
+		spin_release(&lock->dep_map, 1, _THIS_IP_);
 		_raw_spin_unlock(lock);
 		preempt_enable_no_resched();
 		__cond_resched();
Index: linux/kernel/spinlock.c
===================================================================
--- linux.orig/kernel/spinlock.c
+++ linux/kernel/spinlock.c
@@ -14,8 +14,47 @@
 #include <linux/preempt.h>
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
+#include <linux/debug_locks.h>
 #include <linux/module.h>
 
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PROVE_SPIN_LOCKING)
+void __spin_lock_init(spinlock_t *lock, const char *name,
+		      struct lockdep_type_key *key)
+{
+	lock->raw_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+#ifdef CONFIG_DEBUG_SPINLOCK
+	lock->magic = SPINLOCK_MAGIC;
+	lock->owner = SPINLOCK_OWNER_INIT;
+	lock->owner_cpu = -1;
+#endif
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+	lockdep_init_map(&lock->dep_map, name, key);
+#endif
+}
+
+EXPORT_SYMBOL(__spin_lock_init);
+
+#endif
+
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PROVE_RW_LOCKING)
+
+void __rwlock_init(rwlock_t *lock, const char *name,
+		   struct lockdep_type_key *key)
+{
+	lock->raw_lock = (raw_rwlock_t) __RAW_RW_LOCK_UNLOCKED;
+#ifdef CONFIG_DEBUG_SPINLOCK
+	lock->magic = RWLOCK_MAGIC;
+	lock->owner = SPINLOCK_OWNER_INIT;
+	lock->owner_cpu = -1;
+#endif
+#ifdef CONFIG_PROVE_RW_LOCKING
+	lockdep_init_map(&lock->dep_map, name, key);
+#endif
+}
+
+EXPORT_SYMBOL(__rwlock_init);
+
+#endif
 /*
  * Generic declaration of the raw read_trylock() function,
  * architectures are supposed to optimize this:
@@ -30,8 +69,10 @@ EXPORT_SYMBOL(generic__raw_read_trylock)
 int __lockfunc _spin_trylock(spinlock_t *lock)
 {
 	preempt_disable();
-	if (_raw_spin_trylock(lock))
+	if (_raw_spin_trylock(lock)) {
+		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
+	}
 	
 	preempt_enable();
 	return 0;
@@ -41,8 +82,10 @@ EXPORT_SYMBOL(_spin_trylock);
 int __lockfunc _read_trylock(rwlock_t *lock)
 {
 	preempt_disable();
-	if (_raw_read_trylock(lock))
+	if (_raw_read_trylock(lock)) {
+		rwlock_acquire_read(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
+	}
 
 	preempt_enable();
 	return 0;
@@ -52,19 +95,29 @@ EXPORT_SYMBOL(_read_trylock);
 int __lockfunc _write_trylock(rwlock_t *lock)
 {
 	preempt_disable();
-	if (_raw_write_trylock(lock))
+	if (_raw_write_trylock(lock)) {
+		rwlock_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
+	}
 
 	preempt_enable();
 	return 0;
 }
 EXPORT_SYMBOL(_write_trylock);
 
-#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP)
+/*
+ * If lockdep is enabled then we use the non-preemption spin-ops
+ * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
+ * not re-enabled during lock-acquire (which the preempt-spin-ops do):
+ */
+#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
+	defined(CONFIG_PROVE_SPIN_LOCKING) || \
+	defined(CONFIG_PROVE_RW_LOCKING)
 
 void __lockfunc _read_lock(rwlock_t *lock)
 {
 	preempt_disable();
+	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_read_lock(lock);
 }
 EXPORT_SYMBOL(_read_lock);
@@ -75,7 +128,17 @@ unsigned long __lockfunc _spin_lock_irqs
 
 	local_irq_save(flags);
 	preempt_disable();
+	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	/*
+	 * On lockdep we dont want the hand-coded irq-enable of
+	 * _raw_spin_lock_flags() code, because lockdep assumes
+	 * that interrupts are not re-enabled during lock-acquire:
+	 */
+#ifdef CONFIG_PROVE_SPIN_LOCKING
+	_raw_spin_lock(lock);
+#else
 	_raw_spin_lock_flags(lock, &flags);
+#endif
 	return flags;
 }
 EXPORT_SYMBOL(_spin_lock_irqsave);
@@ -84,6 +147,7 @@ void __lockfunc _spin_lock_irq(spinlock_
 {
 	local_irq_disable();
 	preempt_disable();
+	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_spin_lock(lock);
 }
 EXPORT_SYMBOL(_spin_lock_irq);
@@ -92,6 +156,7 @@ void __lockfunc _spin_lock_bh(spinlock_t
 {
 	local_bh_disable();
 	preempt_disable();
+	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_spin_lock(lock);
 }
 EXPORT_SYMBOL(_spin_lock_bh);
@@ -102,6 +167,7 @@ unsigned long __lockfunc _read_lock_irqs
 
 	local_irq_save(flags);
 	preempt_disable();
+	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_read_lock(lock);
 	return flags;
 }
@@ -111,6 +177,7 @@ void __lockfunc _read_lock_irq(rwlock_t 
 {
 	local_irq_disable();
 	preempt_disable();
+	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_read_lock(lock);
 }
 EXPORT_SYMBOL(_read_lock_irq);
@@ -119,6 +186,7 @@ void __lockfunc _read_lock_bh(rwlock_t *
 {
 	local_bh_disable();
 	preempt_disable();
+	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_read_lock(lock);
 }
 EXPORT_SYMBOL(_read_lock_bh);
@@ -129,6 +197,7 @@ unsigned long __lockfunc _write_lock_irq
 
 	local_irq_save(flags);
 	preempt_disable();
+	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_write_lock(lock);
 	return flags;
 }
@@ -138,6 +207,7 @@ void __lockfunc _write_lock_irq(rwlock_t
 {
 	local_irq_disable();
 	preempt_disable();
+	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_write_lock(lock);
 }
 EXPORT_SYMBOL(_write_lock_irq);
@@ -146,6 +216,7 @@ void __lockfunc _write_lock_bh(rwlock_t 
 {
 	local_bh_disable();
 	preempt_disable();
+	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_write_lock(lock);
 }
 EXPORT_SYMBOL(_write_lock_bh);
@@ -153,6 +224,7 @@ EXPORT_SYMBOL(_write_lock_bh);
 void __lockfunc _spin_lock(spinlock_t *lock)
 {
 	preempt_disable();
+	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_spin_lock(lock);
 }
 
@@ -161,6 +233,7 @@ EXPORT_SYMBOL(_spin_lock);
 void __lockfunc _write_lock(rwlock_t *lock)
 {
 	preempt_disable();
+	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	_raw_write_lock(lock);
 }
 
@@ -256,15 +329,35 @@ BUILD_LOCK_OPS(write, rwlock);
 
 #endif /* CONFIG_PREEMPT */
 
+void __lockfunc _spin_lock_nested(spinlock_t *lock, int subtype)
+{
+	preempt_disable();
+	spin_acquire(&lock->dep_map, subtype, 0, _RET_IP_);
+	_raw_spin_lock(lock);
+}
+
+EXPORT_SYMBOL(_spin_lock_nested);
+
 void __lockfunc _spin_unlock(spinlock_t *lock)
 {
+	spin_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_spin_unlock(lock);
 	preempt_enable();
 }
 EXPORT_SYMBOL(_spin_unlock);
 
+void __lockfunc _spin_unlock_non_nested(spinlock_t *lock)
+{
+	spin_release(&lock->dep_map, 0, _RET_IP_);
+	_raw_spin_unlock(lock);
+	preempt_enable();
+}
+EXPORT_SYMBOL(_spin_unlock_non_nested);
+
+
 void __lockfunc _write_unlock(rwlock_t *lock)
 {
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_write_unlock(lock);
 	preempt_enable();
 }
@@ -272,13 +365,23 @@ EXPORT_SYMBOL(_write_unlock);
 
 void __lockfunc _read_unlock(rwlock_t *lock)
 {
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_read_unlock(lock);
 	preempt_enable();
 }
 EXPORT_SYMBOL(_read_unlock);
 
+void __lockfunc _read_unlock_non_nested(rwlock_t *lock)
+{
+	rwlock_release(&lock->dep_map, 0, _RET_IP_);
+	_raw_read_unlock(lock);
+	preempt_enable();
+}
+EXPORT_SYMBOL(_read_unlock_non_nested);
+
 void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
 {
+	spin_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_spin_unlock(lock);
 	local_irq_restore(flags);
 	preempt_enable();
@@ -287,6 +390,7 @@ EXPORT_SYMBOL(_spin_unlock_irqrestore);
 
 void __lockfunc _spin_unlock_irq(spinlock_t *lock)
 {
+	spin_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_spin_unlock(lock);
 	local_irq_enable();
 	preempt_enable();
@@ -295,14 +399,16 @@ EXPORT_SYMBOL(_spin_unlock_irq);
 
 void __lockfunc _spin_unlock_bh(spinlock_t *lock)
 {
+	spin_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_spin_unlock(lock);
 	preempt_enable_no_resched();
-	local_bh_enable();
+	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
 }
 EXPORT_SYMBOL(_spin_unlock_bh);
 
 void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 {
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_read_unlock(lock);
 	local_irq_restore(flags);
 	preempt_enable();
@@ -311,6 +417,7 @@ EXPORT_SYMBOL(_read_unlock_irqrestore);
 
 void __lockfunc _read_unlock_irq(rwlock_t *lock)
 {
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_read_unlock(lock);
 	local_irq_enable();
 	preempt_enable();
@@ -319,14 +426,16 @@ EXPORT_SYMBOL(_read_unlock_irq);
 
 void __lockfunc _read_unlock_bh(rwlock_t *lock)
 {
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_read_unlock(lock);
 	preempt_enable_no_resched();
-	local_bh_enable();
+	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
 }
 EXPORT_SYMBOL(_read_unlock_bh);
 
 void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 {
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_write_unlock(lock);
 	local_irq_restore(flags);
 	preempt_enable();
@@ -335,6 +444,7 @@ EXPORT_SYMBOL(_write_unlock_irqrestore);
 
 void __lockfunc _write_unlock_irq(rwlock_t *lock)
 {
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_write_unlock(lock);
 	local_irq_enable();
 	preempt_enable();
@@ -343,9 +453,10 @@ EXPORT_SYMBOL(_write_unlock_irq);
 
 void __lockfunc _write_unlock_bh(rwlock_t *lock)
 {
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	_raw_write_unlock(lock);
 	preempt_enable_no_resched();
-	local_bh_enable();
+	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
 }
 EXPORT_SYMBOL(_write_unlock_bh);
 
@@ -353,11 +464,13 @@ int __lockfunc _spin_trylock_bh(spinlock
 {
 	local_bh_disable();
 	preempt_disable();
-	if (_raw_spin_trylock(lock))
+	if (_raw_spin_trylock(lock)) {
+		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
+	}
 
 	preempt_enable_no_resched();
-	local_bh_enable();
+	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
 	return 0;
 }
 EXPORT_SYMBOL(_spin_trylock_bh);
Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -177,7 +177,12 @@ static inline void __lock_kernel(void)
 
 static inline void __unlock_kernel(void)
 {
-	spin_unlock(&kernel_flag);
+	/*
+	 * the BKL is not covered by lockdep, so we open-code the
+	 * unlocking sequence (and thus avoid the dep-chain ops):
+	 */
+	_raw_spin_unlock(&kernel_flag);
+	preempt_enable();
 }
 
 /*
Index: linux/net/ipv4/route.c
===================================================================
--- linux.orig/net/ipv4/route.c
+++ linux/net/ipv4/route.c
@@ -206,7 +206,9 @@ __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_SPIN_LOCKING) || \
+	defined(CONFIG_PROVE_RW_LOCKING)
 /*
  * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
  * The size of this table is a power of two and depends on the number of CPUS.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 28/61] lock validator: prove mutex locking correctness
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (26 preceding siblings ...)
  2006-05-29 21:25 ` [patch 27/61] lock validator: prove spinlock/rwlock " Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:25 ` [patch 29/61] lock validator: print all lock-types on SysRq-D Ingo Molnar
                   ` (45 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

add CONFIG_PROVE_MUTEX_LOCKING, which uses the lock validator framework
to prove mutex locking correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/mutex-debug.h |    8 +++++++-
 include/linux/mutex.h       |   34 +++++++++++++++++++++++++++++++---
 kernel/mutex-debug.c        |    8 ++++++++
 kernel/mutex-lockdep.h      |   40 ++++++++++++++++++++++++++++++++++++++++
 kernel/mutex.c              |   28 ++++++++++++++++++++++------
 kernel/mutex.h              |    3 +--
 6 files changed, 109 insertions(+), 12 deletions(-)

Index: linux/include/linux/mutex-debug.h
===================================================================
--- linux.orig/include/linux/mutex-debug.h
+++ linux/include/linux/mutex-debug.h
@@ -2,6 +2,7 @@
 #define __LINUX_MUTEX_DEBUG_H
 
 #include <linux/linkage.h>
+#include <linux/lockdep.h>
 
 /*
  * Mutexes - debugging helpers:
@@ -10,7 +11,12 @@
 #define __DEBUG_MUTEX_INITIALIZER(lockname)				\
 	, .magic = &lockname
 
-#define mutex_init(sem)		__mutex_init(sem, __FILE__":"#sem)
+#define mutex_init(mutex)						\
+do {									\
+	static struct lockdep_type_key __key;				\
+									\
+	__mutex_init((mutex), #mutex, &__key);				\
+} while (0)
 
 extern void FASTCALL(mutex_destroy(struct mutex *lock));
 
Index: linux/include/linux/mutex.h
===================================================================
--- linux.orig/include/linux/mutex.h
+++ linux/include/linux/mutex.h
@@ -13,6 +13,7 @@
 #include <linux/list.h>
 #include <linux/spinlock_types.h>
 #include <linux/linkage.h>
+#include <linux/lockdep.h>
 
 #include <asm/atomic.h>
 
@@ -53,6 +54,9 @@ struct mutex {
 	const char 		*name;
 	void			*magic;
 #endif
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+	struct lockdep_map	dep_map;
+#endif
 };
 
 /*
@@ -72,20 +76,36 @@ struct mutex_waiter {
 # include <linux/mutex-debug.h>
 #else
 # define __DEBUG_MUTEX_INITIALIZER(lockname)
-# define mutex_init(mutex)			__mutex_init(mutex, NULL)
+# define mutex_init(mutex) \
+do {							\
+	static struct lockdep_type_key __key;		\
+							\
+	__mutex_init((mutex), NULL, &__key);		\
+} while (0)
 # define mutex_destroy(mutex)				do { } while (0)
 #endif
 
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \
+		, .dep_map = { .name = #lockname }
+#else
+# define __DEP_MAP_MUTEX_INITIALIZER(lockname)
+#endif
+
 #define __MUTEX_INITIALIZER(lockname) \
 		{ .count = ATOMIC_INIT(1) \
 		, .wait_lock = SPIN_LOCK_UNLOCKED \
 		, .wait_list = LIST_HEAD_INIT(lockname.wait_list) \
-		__DEBUG_MUTEX_INITIALIZER(lockname) }
+		__DEBUG_MUTEX_INITIALIZER(lockname) \
+		__DEP_MAP_MUTEX_INITIALIZER(lockname) }
 
 #define DEFINE_MUTEX(mutexname) \
 	struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)
 
-extern void fastcall __mutex_init(struct mutex *lock, const char *name);
+extern void __mutex_init(struct mutex *lock, const char *name,
+			 struct lockdep_type_key *key);
+
+#define mutex_init_key(mutex, name, key) __mutex_init((mutex), name, key)
 
 /***
  * mutex_is_locked - is the mutex locked
@@ -104,11 +124,19 @@ static inline int fastcall mutex_is_lock
  */
 extern void fastcall mutex_lock(struct mutex *lock);
 extern int fastcall mutex_lock_interruptible(struct mutex *lock);
+
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+extern void mutex_lock_nested(struct mutex *lock, unsigned int subtype);
+#else
+# define mutex_lock_nested(lock, subtype) mutex_lock(lock)
+#endif
+
 /*
  * NOTE: mutex_trylock() follows the spin_trylock() convention,
  *       not the down_trylock() convention!
  */
 extern int fastcall mutex_trylock(struct mutex *lock);
 extern void fastcall mutex_unlock(struct mutex *lock);
+extern void fastcall mutex_unlock_non_nested(struct mutex *lock);
 
 #endif
Index: linux/kernel/mutex-debug.c
===================================================================
--- linux.orig/kernel/mutex-debug.c
+++ linux/kernel/mutex-debug.c
@@ -100,6 +100,14 @@ static int check_deadlock(struct mutex *
 		return 0;
 
 	task = ti->task;
+	/*
+	 * In the PROVE_MUTEX_LOCKING we are tracking all held
+	 * locks already, which allows us to optimize this:
+	 */
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+	if (!task->lockdep_depth)
+		return 0;
+#endif
 	lockblk = NULL;
 	if (task->blocked_on)
 		lockblk = task->blocked_on->lock;
Index: linux/kernel/mutex-lockdep.h
===================================================================
--- /dev/null
+++ linux/kernel/mutex-lockdep.h
@@ -0,0 +1,40 @@
+/*
+ * Mutexes: blocking mutual exclusion locks
+ *
+ * started by Ingo Molnar:
+ *
+ *  Copyright (C) 2004-2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ *
+ * This file contains mutex debugging related internal prototypes, for the
+ * !CONFIG_DEBUG_MUTEXES && CONFIG_PROVE_MUTEX_LOCKING case. Most of
+ * them are NOPs:
+ */
+
+#define spin_lock_mutex(lock, flags)			\
+	do {						\
+		local_irq_save(flags);			\
+		__raw_spin_lock(&(lock)->raw_lock);	\
+	} while (0)
+
+#define spin_unlock_mutex(lock, flags)			\
+	do {						\
+		__raw_spin_unlock(&(lock)->raw_lock);	\
+		local_irq_restore(flags);		\
+	} while (0)
+
+#define mutex_remove_waiter(lock, waiter, ti) \
+		__list_del((waiter)->list.prev, (waiter)->list.next)
+
+#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
+#define debug_mutex_clear_owner(lock)			do { } while (0)
+#define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
+#define debug_mutex_free_waiter(waiter)			do { } while (0)
+#define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
+#define debug_mutex_unlock(lock)			do { } while (0)
+#define debug_mutex_init(lock, name)			do { } while (0)
+
+static inline void
+debug_mutex_lock_common(struct mutex *lock,
+			struct mutex_waiter *waiter)
+{
+}
Index: linux/kernel/mutex.c
===================================================================
--- linux.orig/kernel/mutex.c
+++ linux/kernel/mutex.c
@@ -27,8 +27,13 @@
 # include "mutex-debug.h"
 # include <asm-generic/mutex-null.h>
 #else
-# include "mutex.h"
-# include <asm/mutex.h>
+# ifdef CONFIG_PROVE_MUTEX_LOCKING
+#  include "mutex-lockdep.h"
+#  include <asm-generic/mutex-null.h>
+# else
+#  include "mutex.h"
+#  include <asm/mutex.h>
+# endif
 #endif
 
 /***
@@ -39,13 +44,18 @@
  *
  * It is not allowed to initialize an already locked mutex.
  */
-__always_inline void fastcall __mutex_init(struct mutex *lock, const char *name)
+void
+__mutex_init(struct mutex *lock, const char *name, struct lockdep_type_key *key)
 {
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
 
 	debug_mutex_init(lock, name);
+
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
+	lockdep_init_map(&lock->dep_map, name, key);
+#endif
 }
 
 EXPORT_SYMBOL(__mutex_init);
@@ -146,6 +156,7 @@ __mutex_lock_common(struct mutex *lock, 
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);
+	mutex_acquire(&lock->dep_map, subtype, 0, _RET_IP_);
 	debug_mutex_add_waiter(lock, &waiter, task->thread_info);
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
@@ -173,6 +184,7 @@ __mutex_lock_common(struct mutex *lock, 
 		if (unlikely(state == TASK_INTERRUPTIBLE &&
 						signal_pending(task))) {
 			mutex_remove_waiter(lock, &waiter, task->thread_info);
+			mutex_release(&lock->dep_map, 1, _RET_IP_);
 			spin_unlock_mutex(&lock->wait_lock, flags);
 
 			debug_mutex_free_waiter(&waiter);
@@ -198,7 +210,9 @@ __mutex_lock_common(struct mutex *lock, 
 
 	debug_mutex_free_waiter(&waiter);
 
+#ifdef CONFIG_DEBUG_MUTEXES
 	DEBUG_WARN_ON(lock->owner != task->thread_info);
+#endif
 
 	return 0;
 }
@@ -211,7 +225,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
 	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0);
 }
 
-#ifdef CONFIG_DEBUG_MUTEXES
+#ifdef CONFIG_PROVE_MUTEX_LOCKING
 void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subtype)
 {
@@ -232,6 +246,7 @@ __mutex_unlock_common_slowpath(atomic_t 
 	unsigned long flags;
 
 	spin_lock_mutex(&lock->wait_lock, flags);
+	mutex_release(&lock->dep_map, nested, _RET_IP_);
 	debug_mutex_unlock(lock);
 
 	/*
@@ -322,9 +337,10 @@ static inline int __mutex_trylock_slowpa
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	prev = atomic_xchg(&lock->count, -1);
-	if (likely(prev == 1))
+	if (likely(prev == 1)) {
 		debug_mutex_set_owner(lock, current_thread_info());
-
+		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+	}
 	/* Set it back to 0 if there are no waiters: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
Index: linux/kernel/mutex.h
===================================================================
--- linux.orig/kernel/mutex.h
+++ linux/kernel/mutex.h
@@ -16,14 +16,13 @@
 #define mutex_remove_waiter(lock, waiter, ti) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
+#undef DEBUG_WARN_ON
 #define DEBUG_WARN_ON(c)				do { } while (0)
 #define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
 #define debug_mutex_clear_owner(lock)			do { } while (0)
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
-#define mutex_acquire(lock, subtype, trylock)	do { } while (0)
-#define mutex_release(lock, nested)		do { } while (0)
 #define debug_mutex_unlock(lock)			do { } while (0)
 #define debug_mutex_init(lock, name)			do { } while (0)
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 29/61] lock validator: print all lock-types on SysRq-D
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (27 preceding siblings ...)
  2006-05-29 21:25 ` [patch 28/61] lock validator: prove mutex " Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:25 ` [patch 30/61] lock validator: x86_64 early init Ingo Molnar
                   ` (44 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

print all lock-types on SysRq-D.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 drivers/char/sysrq.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/drivers/char/sysrq.c
===================================================================
--- linux.orig/drivers/char/sysrq.c
+++ linux/drivers/char/sysrq.c
@@ -148,12 +148,14 @@ static struct sysrq_key_op sysrq_mountro
 	.enable_mask	= SYSRQ_ENABLE_REMOUNT,
 };
 
-#ifdef CONFIG_DEBUG_MUTEXES
+#ifdef CONFIG_LOCKDEP
 static void sysrq_handle_showlocks(int key, struct pt_regs *pt_regs,
 				struct tty_struct *tty)
 {
 	debug_show_all_locks();
+	print_lock_types();
 }
+
 static struct sysrq_key_op sysrq_showlocks_op = {
 	.handler	= sysrq_handle_showlocks,
 	.help_msg	= "show-all-locks(D)",

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 30/61] lock validator: x86_64 early init
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (28 preceding siblings ...)
  2006-05-29 21:25 ` [patch 29/61] lock validator: print all lock-types on SysRq-D Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:25 ` [patch 31/61] lock validator: SMP alternatives workaround Ingo Molnar
                   ` (43 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

x86_64 uses spinlocks very early - earlier than start_kernel().
So call lockdep_init() from the arch setup code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86_64/kernel/head64.c |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux/arch/x86_64/kernel/head64.c
===================================================================
--- linux.orig/arch/x86_64/kernel/head64.c
+++ linux/arch/x86_64/kernel/head64.c
@@ -85,6 +85,11 @@ void __init x86_64_start_kernel(char * r
 	clear_bss();
 
 	/*
+	 * This must be called really, really early:
+	 */
+	lockdep_init();
+
+	/*
 	 * switch to init_level4_pgt from boot_level4_pgt
 	 */
 	memcpy(init_level4_pgt, boot_level4_pgt, PTRS_PER_PGD*sizeof(pgd_t));

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 31/61] lock validator: SMP alternatives workaround
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (29 preceding siblings ...)
  2006-05-29 21:25 ` [patch 30/61] lock validator: x86_64 early init Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:25 ` [patch 32/61] lock validator: do not recurse in printk() Ingo Molnar
                   ` (42 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

disable SMP alternatives fixups (the patching in of NOPs on 1-CPU
systems) if the lock validator is enabled: there is a binutils
section handling bug that causes corrupted instructions when
UP instructions are patched in.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/i386/kernel/alternative.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux/arch/i386/kernel/alternative.c
===================================================================
--- linux.orig/arch/i386/kernel/alternative.c
+++ linux/arch/i386/kernel/alternative.c
@@ -301,6 +301,16 @@ void alternatives_smp_switch(int smp)
 	struct smp_alt_module *mod;
 	unsigned long flags;
 
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * A not yet fixed binutils section handling bug prevents
+	 * alternatives-replacement from working reliably, so turn
+	 * it off:
+	 */
+	printk("lockdep: not fixing up alternatives.\n");
+	return;
+#endif
+
 	if (no_replacement || smp_alt_once)
 		return;
 	BUG_ON(!smp && (num_online_cpus() > 1));

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 32/61] lock validator: do not recurse in printk()
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (30 preceding siblings ...)
  2006-05-29 21:25 ` [patch 31/61] lock validator: SMP alternatives workaround Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:25 ` [patch 33/61] lock validator: disable NMI watchdog if CONFIG_LOCKDEP Ingo Molnar
                   ` (41 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

make printk()-ing from within the lock validation code safer by
using the lockdep-recursion counter.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/printk.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -516,7 +516,9 @@ asmlinkage int vprintk(const char *fmt, 
 		zap_locks();
 
 	/* This stops the holder of console_sem just where we want him */
-	spin_lock_irqsave(&logbuf_lock, flags);
+	local_irq_save(flags);
+	current->lockdep_recursion++;
+	spin_lock(&logbuf_lock);
 	printk_cpu = smp_processor_id();
 
 	/* Emit the output into the temporary buffer */
@@ -586,7 +588,7 @@ asmlinkage int vprintk(const char *fmt, 
 		 */
 		console_locked = 1;
 		printk_cpu = UINT_MAX;
-		spin_unlock_irqrestore(&logbuf_lock, flags);
+		spin_unlock(&logbuf_lock);
 
 		/*
 		 * Console drivers may assume that per-cpu resources have
@@ -602,6 +604,8 @@ asmlinkage int vprintk(const char *fmt, 
 			console_locked = 0;
 			up(&console_sem);
 		}
+		current->lockdep_recursion--;
+		local_irq_restore(flags);
 	} else {
 		/*
 		 * Someone else owns the drivers.  We drop the spinlock, which
@@ -609,7 +613,9 @@ asmlinkage int vprintk(const char *fmt, 
 		 * console drivers with the output which we just produced.
 		 */
 		printk_cpu = UINT_MAX;
-		spin_unlock_irqrestore(&logbuf_lock, flags);
+		spin_unlock(&logbuf_lock);
+		current->lockdep_recursion--;
+		local_irq_restore(flags);
 	}
 
 	preempt_enable();
@@ -783,7 +789,13 @@ void release_console_sem(void)
 	up(&console_sem);
 	spin_unlock_irqrestore(&logbuf_lock, flags);
 	if (wake_klogd && !oops_in_progress && waitqueue_active(&log_wait))
-		wake_up_interruptible(&log_wait);
+		/*
+		 * If we printk from within the lock dependency code,
+		 * from within the scheduler code, then do not lock
+		 * up due to self-recursion:
+		 */
+		if (current->lockdep_recursion <= 1)
+			wake_up_interruptible(&log_wait);
 }
 EXPORT_SYMBOL(release_console_sem);
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 33/61] lock validator: disable NMI watchdog if CONFIG_LOCKDEP
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (31 preceding siblings ...)
  2006-05-29 21:25 ` [patch 32/61] lock validator: do not recurse in printk() Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 22:49   ` Keith Owens
  2006-05-29 21:25 ` [patch 34/61] lock validator: special locking: bdev Ingo Molnar
                   ` (40 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

The NMI watchdog uses spinlocks (notifier chains, etc.),
so it's not lockdep-safe at the moment.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86_64/kernel/nmi.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux/arch/x86_64/kernel/nmi.c
===================================================================
--- linux.orig/arch/x86_64/kernel/nmi.c
+++ linux/arch/x86_64/kernel/nmi.c
@@ -205,6 +205,18 @@ int __init check_nmi_watchdog (void)
 	int *counts;
 	int cpu;
 
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * The NMI watchdog uses spinlocks (notifier chains, etc.),
+	 * so it's not lockdep-safe:
+	 */
+	nmi_watchdog = 0;
+	for_each_online_cpu(cpu)
+		per_cpu(nmi_watchdog_ctlblk.enabled, cpu) = 0;
+
+	printk("lockdep: disabled NMI watchdog.\n");
+	return 0;
+#endif
 	if ((nmi_watchdog == NMI_NONE) || (nmi_watchdog == NMI_DEFAULT))
 		return 0;
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 34/61] lock validator: special locking: bdev
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (32 preceding siblings ...)
  2006-05-29 21:25 ` [patch 33/61] lock validator: disable NMI watchdog if CONFIG_LOCKDEP Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-30  1:35   ` Andrew Morton
  2006-05-29 21:25 ` [patch 35/61] lock validator: special locking: direct-IO Ingo Molnar
                   ` (39 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 drivers/md/md.c    |    6 +--
 fs/block_dev.c     |  105 ++++++++++++++++++++++++++++++++++++++++++++++-------
 include/linux/fs.h |   17 ++++++++
 3 files changed, 112 insertions(+), 16 deletions(-)

Index: linux/drivers/md/md.c
===================================================================
--- linux.orig/drivers/md/md.c
+++ linux/drivers/md/md.c
@@ -1394,7 +1394,7 @@ static int lock_rdev(mdk_rdev_t *rdev, d
 	struct block_device *bdev;
 	char b[BDEVNAME_SIZE];
 
-	bdev = open_by_devnum(dev, FMODE_READ|FMODE_WRITE);
+	bdev = open_partition_by_devnum(dev, FMODE_READ|FMODE_WRITE);
 	if (IS_ERR(bdev)) {
 		printk(KERN_ERR "md: could not open %s.\n",
 			__bdevname(dev, b));
@@ -1404,7 +1404,7 @@ static int lock_rdev(mdk_rdev_t *rdev, d
 	if (err) {
 		printk(KERN_ERR "md: could not bd_claim %s.\n",
 			bdevname(bdev, b));
-		blkdev_put(bdev);
+		blkdev_put_partition(bdev);
 		return err;
 	}
 	rdev->bdev = bdev;
@@ -1418,7 +1418,7 @@ static void unlock_rdev(mdk_rdev_t *rdev
 	if (!bdev)
 		MD_BUG();
 	bd_release(bdev);
-	blkdev_put(bdev);
+	blkdev_put_partition(bdev);
 }
 
 void md_autodetect_dev(dev_t dev);
Index: linux/fs/block_dev.c
===================================================================
--- linux.orig/fs/block_dev.c
+++ linux/fs/block_dev.c
@@ -746,7 +746,7 @@ static int bd_claim_by_kobject(struct bl
 	if (!bo)
 		return -ENOMEM;
 
-	mutex_lock(&bdev->bd_mutex);
+	mutex_lock_nested(&bdev->bd_mutex, BD_MUTEX_PARTITION);
 	res = bd_claim(bdev, holder);
 	if (res || !add_bd_holder(bdev, bo))
 		free_bd_holder(bo);
@@ -771,7 +771,7 @@ static void bd_release_from_kobject(stru
 	if (!kobj)
 		return;
 
-	mutex_lock(&bdev->bd_mutex);
+	mutex_lock_nested(&bdev->bd_mutex, BD_MUTEX_PARTITION);
 	bd_release(bdev);
 	if ((bo = del_bd_holder(bdev, kobj)))
 		free_bd_holder(bo);
@@ -829,6 +829,22 @@ struct block_device *open_by_devnum(dev_
 
 EXPORT_SYMBOL(open_by_devnum);
 
+static int
+blkdev_get_partition(struct block_device *bdev, mode_t mode, unsigned flags);
+
+struct block_device *open_partition_by_devnum(dev_t dev, unsigned mode)
+{
+	struct block_device *bdev = bdget(dev);
+	int err = -ENOMEM;
+	int flags = mode & FMODE_WRITE ? O_RDWR : O_RDONLY;
+	if (bdev)
+		err = blkdev_get_partition(bdev, mode, flags);
+	return err ? ERR_PTR(err) : bdev;
+}
+
+EXPORT_SYMBOL(open_partition_by_devnum);
+
+
 /*
  * This routine checks whether a removable media has been changed,
  * and invalidates all buffer-cache-entries in that case. This
@@ -875,7 +891,11 @@ void bd_set_size(struct block_device *bd
 }
 EXPORT_SYMBOL(bd_set_size);
 
-static int do_open(struct block_device *bdev, struct file *file)
+static int
+blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags);
+
+static int
+do_open(struct block_device *bdev, struct file *file, unsigned int subtype)
 {
 	struct module *owner = NULL;
 	struct gendisk *disk;
@@ -892,7 +912,8 @@ static int do_open(struct block_device *
 	}
 	owner = disk->fops->owner;
 
-	mutex_lock(&bdev->bd_mutex);
+	mutex_lock_nested(&bdev->bd_mutex, subtype);
+
 	if (!bdev->bd_openers) {
 		bdev->bd_disk = disk;
 		bdev->bd_contains = bdev;
@@ -917,13 +938,17 @@ static int do_open(struct block_device *
 			struct block_device *whole;
 			whole = bdget_disk(disk, 0);
 			ret = -ENOMEM;
+			/*
+			 * We must not recurse deeper than 1:
+			 */
+			WARN_ON(subtype != 0);
 			if (!whole)
 				goto out_first;
-			ret = blkdev_get(whole, file->f_mode, file->f_flags);
+			ret = blkdev_get_whole(whole, file->f_mode, file->f_flags);
 			if (ret)
 				goto out_first;
 			bdev->bd_contains = whole;
-			mutex_lock(&whole->bd_mutex);
+			mutex_lock_nested(&whole->bd_mutex, BD_MUTEX_WHOLE);
 			whole->bd_part_count++;
 			p = disk->part[part - 1];
 			bdev->bd_inode->i_data.backing_dev_info =
@@ -951,7 +976,8 @@ static int do_open(struct block_device *
 			if (bdev->bd_invalidated)
 				rescan_partitions(bdev->bd_disk, bdev);
 		} else {
-			mutex_lock(&bdev->bd_contains->bd_mutex);
+			mutex_lock_nested(&bdev->bd_contains->bd_mutex,
+					  BD_MUTEX_PARTITION);
 			bdev->bd_contains->bd_part_count++;
 			mutex_unlock(&bdev->bd_contains->bd_mutex);
 		}
@@ -992,11 +1018,49 @@ int blkdev_get(struct block_device *bdev
 	fake_file.f_dentry = &fake_dentry;
 	fake_dentry.d_inode = bdev->bd_inode;
 
-	return do_open(bdev, &fake_file);
+	return do_open(bdev, &fake_file, BD_MUTEX_NORMAL);
 }
 
 EXPORT_SYMBOL(blkdev_get);
 
+static int
+blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags)
+{
+	/*
+	 * This crockload is due to bad choice of ->open() type.
+	 * It will go away.
+	 * For now, block device ->open() routine must _not_
+	 * examine anything in 'inode' argument except ->i_rdev.
+	 */
+	struct file fake_file = {};
+	struct dentry fake_dentry = {};
+	fake_file.f_mode = mode;
+	fake_file.f_flags = flags;
+	fake_file.f_dentry = &fake_dentry;
+	fake_dentry.d_inode = bdev->bd_inode;
+
+	return do_open(bdev, &fake_file, BD_MUTEX_WHOLE);
+}
+
+static int
+blkdev_get_partition(struct block_device *bdev, mode_t mode, unsigned flags)
+{
+	/*
+	 * This crockload is due to bad choice of ->open() type.
+	 * It will go away.
+	 * For now, block device ->open() routine must _not_
+	 * examine anything in 'inode' argument except ->i_rdev.
+	 */
+	struct file fake_file = {};
+	struct dentry fake_dentry = {};
+	fake_file.f_mode = mode;
+	fake_file.f_flags = flags;
+	fake_file.f_dentry = &fake_dentry;
+	fake_dentry.d_inode = bdev->bd_inode;
+
+	return do_open(bdev, &fake_file, BD_MUTEX_PARTITION);
+}
+
 static int blkdev_open(struct inode * inode, struct file * filp)
 {
 	struct block_device *bdev;
@@ -1012,7 +1076,7 @@ static int blkdev_open(struct inode * in
 
 	bdev = bd_acquire(inode);
 
-	res = do_open(bdev, filp);
+	res = do_open(bdev, filp, BD_MUTEX_NORMAL);
 	if (res)
 		return res;
 
@@ -1026,13 +1090,13 @@ static int blkdev_open(struct inode * in
 	return res;
 }
 
-int blkdev_put(struct block_device *bdev)
+static int __blkdev_put(struct block_device *bdev, unsigned int subtype)
 {
 	int ret = 0;
 	struct inode *bd_inode = bdev->bd_inode;
 	struct gendisk *disk = bdev->bd_disk;
 
-	mutex_lock(&bdev->bd_mutex);
+	mutex_lock_nested(&bdev->bd_mutex, subtype);
 	lock_kernel();
 	if (!--bdev->bd_openers) {
 		sync_blockdev(bdev);
@@ -1042,7 +1106,9 @@ int blkdev_put(struct block_device *bdev
 		if (disk->fops->release)
 			ret = disk->fops->release(bd_inode, NULL);
 	} else {
-		mutex_lock(&bdev->bd_contains->bd_mutex);
+		WARN_ON(subtype != 0);
+		mutex_lock_nested(&bdev->bd_contains->bd_mutex,
+				  BD_MUTEX_PARTITION);
 		bdev->bd_contains->bd_part_count--;
 		mutex_unlock(&bdev->bd_contains->bd_mutex);
 	}
@@ -1059,7 +1125,8 @@ int blkdev_put(struct block_device *bdev
 		bdev->bd_disk = NULL;
 		bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
 		if (bdev != bdev->bd_contains) {
-			blkdev_put(bdev->bd_contains);
+			WARN_ON(subtype != 0);
+			__blkdev_put(bdev->bd_contains, 1);
 		}
 		bdev->bd_contains = NULL;
 	}
@@ -1069,8 +1136,20 @@ int blkdev_put(struct block_device *bdev
 	return ret;
 }
 
+int blkdev_put(struct block_device *bdev)
+{
+	return __blkdev_put(bdev, BD_MUTEX_NORMAL);
+}
+
 EXPORT_SYMBOL(blkdev_put);
 
+int blkdev_put_partition(struct block_device *bdev)
+{
+	return __blkdev_put(bdev, BD_MUTEX_PARTITION);
+}
+
+EXPORT_SYMBOL(blkdev_put_partition);
+
 static int blkdev_close(struct inode * inode, struct file * filp)
 {
 	struct block_device *bdev = I_BDEV(filp->f_mapping->host);
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h
+++ linux/include/linux/fs.h
@@ -436,6 +436,21 @@ struct block_device {
 };
 
 /*
+ * bdev->bd_mutex nesting types for the LOCKDEP validator:
+ *
+ * 0: normal
+ * 1: 'whole'
+ * 2: 'partition'
+ */
+enum bdev_bd_mutex_lock_type
+{
+	BD_MUTEX_NORMAL,
+	BD_MUTEX_WHOLE,
+	BD_MUTEX_PARTITION
+};
+
+
+/*
  * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
  * radix trees
  */
@@ -1404,6 +1419,7 @@ extern void bd_set_size(struct block_dev
 extern void bd_forget(struct inode *inode);
 extern void bdput(struct block_device *);
 extern struct block_device *open_by_devnum(dev_t, unsigned);
+extern struct block_device *open_partition_by_devnum(dev_t, unsigned);
 extern const struct file_operations def_blk_fops;
 extern const struct address_space_operations def_blk_aops;
 extern const struct file_operations def_chr_fops;
@@ -1414,6 +1430,7 @@ extern int blkdev_ioctl(struct inode *, 
 extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
 extern int blkdev_get(struct block_device *, mode_t, unsigned);
 extern int blkdev_put(struct block_device *);
+extern int blkdev_put_partition(struct block_device *);
 extern int bd_claim(struct block_device *, void *);
 extern void bd_release(struct block_device *);
 #ifdef CONFIG_SYSFS

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 35/61] lock validator: special locking: direct-IO
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (33 preceding siblings ...)
  2006-05-29 21:25 ` [patch 34/61] lock validator: special locking: bdev Ingo Molnar
@ 2006-05-29 21:25 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 36/61] lock validator: special locking: serial Ingo Molnar
                   ` (38 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (rwsem-in-irq) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
---
 fs/direct-io.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux/fs/direct-io.c
===================================================================
--- linux.orig/fs/direct-io.c
+++ linux/fs/direct-io.c
@@ -220,7 +220,8 @@ static void dio_complete(struct dio *dio
 	if (dio->end_io && dio->result)
 		dio->end_io(dio->iocb, offset, bytes, dio->map_bh.b_private);
 	if (dio->lock_type == DIO_LOCKING)
-		up_read(&dio->inode->i_alloc_sem);
+		/* lockdep: non-owner release */
+		up_read_non_owner(&dio->inode->i_alloc_sem);
 }
 
 /*
@@ -1261,7 +1262,8 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 
 		if (dio_lock_type == DIO_LOCKING)
-			down_read(&inode->i_alloc_sem);
+			/* lockdep: not the owner will release it */
+			down_read_non_owner(&inode->i_alloc_sem);
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 36/61] lock validator: special locking: serial
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (34 preceding siblings ...)
  2006-05-29 21:25 ` [patch 35/61] lock validator: special locking: direct-IO Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-30  1:35   ` Andrew Morton
  2006-05-29 21:26 ` [patch 37/61] lock validator: special locking: dcache Ingo Molnar
                   ` (37 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (dual-initialized) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
---
 drivers/serial/serial_core.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux/drivers/serial/serial_core.c
===================================================================
--- linux.orig/drivers/serial/serial_core.c
+++ linux/drivers/serial/serial_core.c
@@ -1849,6 +1849,12 @@ static const struct baud_rates baud_rate
 	{      0, B38400  }
 };
 
+/*
+ * lockdep: port->lock is initialized in two places, but we
+ *          want only one lock-type:
+ */
+static struct lockdep_type_key port_lock_key;
+
 /**
  *	uart_set_options - setup the serial console parameters
  *	@port: pointer to the serial ports uart_port structure
@@ -1869,7 +1875,7 @@ uart_set_options(struct uart_port *port,
 	 * Ensure that the serial console lock is initialised
 	 * early.
 	 */
-	spin_lock_init(&port->lock);
+	spin_lock_init_key(&port->lock, &port_lock_key);
 
 	memset(&termios, 0, sizeof(struct termios));
 
@@ -2255,7 +2261,7 @@ int uart_add_one_port(struct uart_driver
 	 * initialised.
 	 */
 	if (!(uart_console(port) && (port->cons->flags & CON_ENABLED)))
-		spin_lock_init(&port->lock);
+		spin_lock_init_key(&port->lock, &port_lock_key);
 
 	uart_configure_port(drv, state, port);
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 37/61] lock validator: special locking: dcache
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (35 preceding siblings ...)
  2006-05-29 21:26 ` [patch 36/61] lock validator: special locking: serial Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-30  1:35   ` Andrew Morton
  2006-05-29 21:26 ` [patch 38/61] lock validator: special locking: i_mutex Ingo Molnar
                   ` (36 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/dcache.c            |    6 +++---
 include/linux/dcache.h |   12 ++++++++++++
 2 files changed, 15 insertions(+), 3 deletions(-)

Index: linux/fs/dcache.c
===================================================================
--- linux.orig/fs/dcache.c
+++ linux/fs/dcache.c
@@ -1380,10 +1380,10 @@ void d_move(struct dentry * dentry, stru
 	 */
 	if (target < dentry) {
 		spin_lock(&target->d_lock);
-		spin_lock(&dentry->d_lock);
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 	} else {
 		spin_lock(&dentry->d_lock);
-		spin_lock(&target->d_lock);
+		spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
 	}
 
 	/* Move the dentry to the target hash queue, if on different bucket */
@@ -1420,7 +1420,7 @@ already_unhashed:
 	}
 
 	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
-	spin_unlock(&target->d_lock);
+	spin_unlock_non_nested(&target->d_lock);
 	fsnotify_d_move(dentry);
 	spin_unlock(&dentry->d_lock);
 	write_sequnlock(&rename_lock);
Index: linux/include/linux/dcache.h
===================================================================
--- linux.orig/include/linux/dcache.h
+++ linux/include/linux/dcache.h
@@ -114,6 +114,18 @@ struct dentry {
 	unsigned char d_iname[DNAME_INLINE_LEN_MIN];	/* small names */
 };
 
+/*
+ * dentry->d_lock spinlock nesting types:
+ *
+ * 0: normal
+ * 1: nested
+ */
+enum dentry_d_lock_type
+{
+	DENTRY_D_LOCK_NORMAL,
+	DENTRY_D_LOCK_NESTED
+};
+
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
 	int (*d_hash) (struct dentry *, struct qstr *);

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 38/61] lock validator: special locking: i_mutex
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (36 preceding siblings ...)
  2006-05-29 21:26 ` [patch 37/61] lock validator: special locking: dcache Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-30 20:53   ` Steven Rostedt
  2006-05-29 21:26 ` [patch 39/61] lock validator: special locking: s_lock Ingo Molnar
                   ` (35 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 drivers/usb/core/inode.c |    2 +-
 fs/namei.c               |   24 ++++++++++++------------
 include/linux/fs.h       |   14 ++++++++++++++
 3 files changed, 27 insertions(+), 13 deletions(-)

Index: linux/drivers/usb/core/inode.c
===================================================================
--- linux.orig/drivers/usb/core/inode.c
+++ linux/drivers/usb/core/inode.c
@@ -201,7 +201,7 @@ static void update_sb(struct super_block
 	if (!root)
 		return;
 
-	mutex_lock(&root->d_inode->i_mutex);
+	mutex_lock_nested(&root->d_inode->i_mutex, I_MUTEX_PARENT);
 
 	list_for_each_entry(bus, &root->d_subdirs, d_u.d_child) {
 		if (bus->d_inode) {
Index: linux/fs/namei.c
===================================================================
--- linux.orig/fs/namei.c
+++ linux/fs/namei.c
@@ -1422,7 +1422,7 @@ struct dentry *lock_rename(struct dentry
 	struct dentry *p;
 
 	if (p1 == p2) {
-		mutex_lock(&p1->d_inode->i_mutex);
+		mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
 		return NULL;
 	}
 
@@ -1430,30 +1430,30 @@ struct dentry *lock_rename(struct dentry
 
 	for (p = p1; p->d_parent != p; p = p->d_parent) {
 		if (p->d_parent == p2) {
-			mutex_lock(&p2->d_inode->i_mutex);
-			mutex_lock(&p1->d_inode->i_mutex);
+			mutex_lock_nested(&p2->d_inode->i_mutex, I_MUTEX_PARENT);
+			mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_CHILD);
 			return p;
 		}
 	}
 
 	for (p = p2; p->d_parent != p; p = p->d_parent) {
 		if (p->d_parent == p1) {
-			mutex_lock(&p1->d_inode->i_mutex);
-			mutex_lock(&p2->d_inode->i_mutex);
+			mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
+			mutex_lock_nested(&p2->d_inode->i_mutex, I_MUTEX_CHILD);
 			return p;
 		}
 	}
 
-	mutex_lock(&p1->d_inode->i_mutex);
-	mutex_lock(&p2->d_inode->i_mutex);
+	mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
+	mutex_lock_nested(&p2->d_inode->i_mutex, I_MUTEX_CHILD);
 	return NULL;
 }
 
 void unlock_rename(struct dentry *p1, struct dentry *p2)
 {
-	mutex_unlock(&p1->d_inode->i_mutex);
+	mutex_unlock_non_nested(&p1->d_inode->i_mutex);
 	if (p1 != p2) {
-		mutex_unlock(&p2->d_inode->i_mutex);
+		mutex_unlock_non_nested(&p2->d_inode->i_mutex);
 		mutex_unlock(&p1->d_inode->i_sb->s_vfs_rename_mutex);
 	}
 }
@@ -1750,7 +1750,7 @@ struct dentry *lookup_create(struct name
 {
 	struct dentry *dentry = ERR_PTR(-EEXIST);
 
-	mutex_lock(&nd->dentry->d_inode->i_mutex);
+	mutex_lock_nested(&nd->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	/*
 	 * Yucky last component or no last component at all?
 	 * (foo/., foo/.., /////)
@@ -2007,7 +2007,7 @@ static long do_rmdir(int dfd, const char
 			error = -EBUSY;
 			goto exit1;
 	}
-	mutex_lock(&nd.dentry->d_inode->i_mutex);
+	mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	dentry = lookup_hash(&nd);
 	error = PTR_ERR(dentry);
 	if (!IS_ERR(dentry)) {
@@ -2081,7 +2081,7 @@ static long do_unlinkat(int dfd, const c
 	error = -EISDIR;
 	if (nd.last_type != LAST_NORM)
 		goto exit1;
-	mutex_lock(&nd.dentry->d_inode->i_mutex);
+	mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	dentry = lookup_hash(&nd);
 	error = PTR_ERR(dentry);
 	if (!IS_ERR(dentry)) {
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h
+++ linux/include/linux/fs.h
@@ -558,6 +558,20 @@ struct inode {
 };
 
 /*
+ * inode->i_mutex nesting types for the LOCKDEP validator:
+ *
+ * 0: the object of the current VFS operation
+ * 1: parent
+ * 2: child/target
+ */
+enum inode_i_mutex_lock_type
+{
+	I_MUTEX_NORMAL,
+	I_MUTEX_PARENT,
+	I_MUTEX_CHILD
+};
+
+/*
  * NOTE: in a 32bit arch with a preemptable kernel and
  * an UP compile the i_size_read/write must be atomic
  * with respect to the local cpu (unlike with preempt disabled),

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 39/61] lock validator: special locking: s_lock
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (37 preceding siblings ...)
  2006-05-29 21:26 ` [patch 38/61] lock validator: special locking: i_mutex Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 40/61] lock validator: special locking: futex Ingo Molnar
                   ` (34 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (per-filesystem) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
---
 fs/super.c         |   13 +++++++++----
 include/linux/fs.h |    1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

Index: linux/fs/super.c
===================================================================
--- linux.orig/fs/super.c
+++ linux/fs/super.c
@@ -54,7 +54,7 @@ DEFINE_SPINLOCK(sb_lock);
  *	Allocates and initializes a new &struct super_block.  alloc_super()
  *	returns a pointer new superblock or %NULL if allocation had failed.
  */
-static struct super_block *alloc_super(void)
+static struct super_block *alloc_super(struct file_system_type *type)
 {
 	struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);
 	static struct super_operations default_op;
@@ -72,7 +72,12 @@ static struct super_block *alloc_super(v
 		INIT_HLIST_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 		init_rwsem(&s->s_umount);
-		mutex_init(&s->s_lock);
+		/*
+		 * The locking rules for s_lock are up to the
+		 * filesystem. For example ext3fs has different
+		 * lock ordering than usbfs:
+		 */
+		mutex_init_key(&s->s_lock, type->name, &type->s_lock_key);
 		down_write(&s->s_umount);
 		s->s_count = S_BIAS;
 		atomic_set(&s->s_active, 1);
@@ -297,7 +302,7 @@ retry:
 	}
 	if (!s) {
 		spin_unlock(&sb_lock);
-		s = alloc_super();
+		s = alloc_super(type);
 		if (!s)
 			return ERR_PTR(-ENOMEM);
 		goto retry;
@@ -696,7 +701,7 @@ struct super_block *get_sb_bdev(struct f
 	 */
 	mutex_lock(&bdev->bd_mount_mutex);
 	s = sget(fs_type, test_bdev_super, set_bdev_super, bdev);
-	mutex_unlock(&bdev->bd_mount_mutex);
+	mutex_unlock_non_nested(&bdev->bd_mount_mutex);
 	if (IS_ERR(s))
 		goto out;
 
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h
+++ linux/include/linux/fs.h
@@ -1307,6 +1307,7 @@ struct file_system_type {
 	struct module *owner;
 	struct file_system_type * next;
 	struct list_head fs_supers;
+	struct lockdep_type_key s_lock_key;
 };
 
 struct super_block *get_sb_bdev(struct file_system_type *fs_type,

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 40/61] lock validator: special locking: futex
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (38 preceding siblings ...)
  2006-05-29 21:26 ` [patch 39/61] lock validator: special locking: s_lock Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 41/61] lock validator: special locking: genirq Ingo Molnar
                   ` (33 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/futex.c |   44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

Index: linux/kernel/futex.c
===================================================================
--- linux.orig/kernel/futex.c
+++ linux/kernel/futex.c
@@ -604,6 +604,22 @@ static int unlock_futex_pi(u32 __user *u
 }
 
 /*
+ * Express the locking dependencies for lockdep:
+ */
+static inline void
+double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
+{
+	if (hb1 <= hb2) {
+		spin_lock(&hb1->lock);
+		if (hb1 < hb2)
+			spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
+	} else { /* hb1 > hb2 */
+		spin_lock(&hb2->lock);
+		spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
+	}
+}
+
+/*
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:
  */
@@ -669,19 +685,15 @@ retryfull:
 	hb2 = hash_futex(&key2);
 
 retry:
-	if (hb1 < hb2)
-		spin_lock(&hb1->lock);
-	spin_lock(&hb2->lock);
-	if (hb1 > hb2)
-		spin_lock(&hb1->lock);
+	double_lock_hb(hb1, hb2);
 
 	op_ret = futex_atomic_op_inuser(op, uaddr2);
 	if (unlikely(op_ret < 0)) {
 		u32 dummy;
 
-		spin_unlock(&hb1->lock);
+		spin_unlock_non_nested(&hb1->lock);
 		if (hb1 != hb2)
-			spin_unlock(&hb2->lock);
+			spin_unlock_non_nested(&hb2->lock);
 
 #ifndef CONFIG_MMU
 		/*
@@ -748,9 +760,9 @@ retry:
 		ret += op_ret;
 	}
 
-	spin_unlock(&hb1->lock);
+	spin_unlock_non_nested(&hb1->lock);
 	if (hb1 != hb2)
-		spin_unlock(&hb2->lock);
+		spin_unlock_non_nested(&hb2->lock);
 out:
 	up_read(&current->mm->mmap_sem);
 	return ret;
@@ -782,11 +794,7 @@ static int futex_requeue(u32 __user *uad
 	hb1 = hash_futex(&key1);
 	hb2 = hash_futex(&key2);
 
-	if (hb1 < hb2)
-		spin_lock(&hb1->lock);
-	spin_lock(&hb2->lock);
-	if (hb1 > hb2)
-		spin_lock(&hb1->lock);
+	double_lock_hb(hb1, hb2);
 
 	if (likely(cmpval != NULL)) {
 		u32 curval;
@@ -794,9 +802,9 @@ static int futex_requeue(u32 __user *uad
 		ret = get_futex_value_locked(&curval, uaddr1);
 
 		if (unlikely(ret)) {
-			spin_unlock(&hb1->lock);
+			spin_unlock_non_nested(&hb1->lock);
 			if (hb1 != hb2)
-				spin_unlock(&hb2->lock);
+				spin_unlock_non_nested(&hb2->lock);
 
 			/*
 			 * If we would have faulted, release mmap_sem, fault
@@ -842,9 +850,9 @@ static int futex_requeue(u32 __user *uad
 	}
 
 out_unlock:
-	spin_unlock(&hb1->lock);
+	spin_unlock_non_nested(&hb1->lock);
 	if (hb1 != hb2)
-		spin_unlock(&hb2->lock);
+		spin_unlock_non_nested(&hb2->lock);
 
 	/* drop_key_refs() must be called outside the spinlocks. */
 	while (--drop_count >= 0)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 41/61] lock validator: special locking: genirq
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (39 preceding siblings ...)
  2006-05-29 21:26 ` [patch 40/61] lock validator: special locking: futex Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 42/61] lock validator: special locking: kgdb Ingo Molnar
                   ` (32 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
---
 kernel/irq/handle.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: linux/kernel/irq/handle.c
===================================================================
--- linux.orig/kernel/irq/handle.c
+++ linux/kernel/irq/handle.c
@@ -11,6 +11,7 @@
 #include <linux/random.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include <linux/kallsyms.h>
 
 #include "internals.h"
 
@@ -193,3 +194,15 @@ out:
 	return 1;
 }
 
+/*
+ * lockdep: we want to handle all irq_desc locks as a single lock-type:
+ */
+static struct lockdep_type_key irq_desc_lock_type;
+
+void early_init_irq_lock_type(void)
+{
+	int i;
+
+	for (i = 0; i < NR_IRQS; i++)
+		spin_lock_init_key(&irq_desc[i].lock, &irq_desc_lock_type);
+}

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 42/61] lock validator: special locking: kgdb
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (40 preceding siblings ...)
  2006-05-29 21:26 ` [patch 41/61] lock validator: special locking: genirq Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 43/61] lock validator: special locking: completions Ingo Molnar
                   ` (31 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive, non-ordered) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
---
 kernel/kgdb.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/kernel/kgdb.c
===================================================================
--- linux.orig/kernel/kgdb.c
+++ linux/kernel/kgdb.c
@@ -1539,7 +1539,7 @@ int kgdb_handle_exception(int ex_vector,
 
 	if (!debugger_step || !kgdb_contthread) {
 		for (i = 0; i < NR_CPUS; i++)
-			spin_unlock(&slavecpulocks[i]);
+			spin_unlock_non_nested(&slavecpulocks[i]);
 		/* Wait till all the processors have quit
 		 * from the debugger. */
 		for (i = 0; i < NR_CPUS; i++) {
@@ -1622,7 +1622,7 @@ static void __init kgdb_internal_init(vo
 
 	/* Initialize our spinlocks. */
 	for (i = 0; i < NR_CPUS; i++)
-		spin_lock_init(&slavecpulocks[i]);
+		spin_lock_init_static(&slavecpulocks[i]);
 
 	for (i = 0; i < MAX_BREAKPOINTS; i++)
 		kgdb_break[i].state = bp_none;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 43/61] lock validator: special locking: completions
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (41 preceding siblings ...)
  2006-05-29 21:26 ` [patch 42/61] lock validator: special locking: kgdb Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 44/61] lock validator: special locking: waitqueues Ingo Molnar
                   ` (30 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (multi-initialized) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
---
 include/linux/completion.h |    6 +-----
 kernel/sched.c             |    8 ++++++++
 2 files changed, 9 insertions(+), 5 deletions(-)

Index: linux/include/linux/completion.h
===================================================================
--- linux.orig/include/linux/completion.h
+++ linux/include/linux/completion.h
@@ -21,11 +21,7 @@ struct completion {
 #define DECLARE_COMPLETION(work) \
 	struct completion work = COMPLETION_INITIALIZER(work)
 
-static inline void init_completion(struct completion *x)
-{
-	x->done = 0;
-	init_waitqueue_head(&x->wait);
-}
+extern void init_completion(struct completion *x);
 
 extern void FASTCALL(wait_for_completion(struct completion *));
 extern int FASTCALL(wait_for_completion_interruptible(struct completion *x));
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3569,6 +3569,14 @@ __wake_up_sync(wait_queue_head_t *q, uns
 }
 EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
 
+void init_completion(struct completion *x)
+{
+	x->done = 0;
+	__init_waitqueue_head(&x->wait);
+}
+
+EXPORT_SYMBOL(init_completion);
+
 void fastcall complete(struct completion *x)
 {
 	unsigned long flags;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 44/61] lock validator: special locking: waitqueues
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (42 preceding siblings ...)
  2006-05-29 21:26 ` [patch 43/61] lock validator: special locking: completions Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 45/61] lock validator: special locking: mm Ingo Molnar
                   ` (29 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

map special (multi-initialized) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
---
 include/linux/wait.h |   11 +++++++++--
 kernel/wait.c        |    9 +++++++++
 2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux/include/linux/wait.h
===================================================================
--- linux.orig/include/linux/wait.h
+++ linux/include/linux/wait.h
@@ -77,12 +77,19 @@ struct task_struct;
 #define __WAIT_BIT_KEY_INITIALIZER(word, bit)				\
 	{ .flags = word, .bit_nr = bit, }
 
-static inline void init_waitqueue_head(wait_queue_head_t *q)
+/*
+ * lockdep: we want one lock-type for all waitqueue locks.
+ */
+extern struct lockdep_type_key waitqueue_lock_key;
+
+static inline void __init_waitqueue_head(wait_queue_head_t *q)
 {
-	spin_lock_init(&q->lock);
+	spin_lock_init_key(&q->lock, &waitqueue_lock_key);
 	INIT_LIST_HEAD(&q->task_list);
 }
 
+extern void init_waitqueue_head(wait_queue_head_t *q);
+
 static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
 {
 	q->flags = 0;
Index: linux/kernel/wait.c
===================================================================
--- linux.orig/kernel/wait.c
+++ linux/kernel/wait.c
@@ -11,6 +11,15 @@
 #include <linux/wait.h>
 #include <linux/hash.h>
 
+struct lockdep_type_key waitqueue_lock_key;
+
+void init_waitqueue_head(wait_queue_head_t *q)
+{
+	__init_waitqueue_head(q);
+}
+
+EXPORT_SYMBOL(init_waitqueue_head);
+
 void fastcall add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
 {
 	unsigned long flags;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 45/61] lock validator: special locking: mm
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (43 preceding siblings ...)
  2006-05-29 21:26 ` [patch 44/61] lock validator: special locking: waitqueues Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 46/61] lock validator: special locking: slab Ingo Molnar
                   ` (28 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 mm/memory.c |    2 +-
 mm/mremap.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -509,7 +509,7 @@ again:
 		return -ENOMEM;
 	src_pte = pte_offset_map_nested(src_pmd, addr);
 	src_ptl = pte_lockptr(src_mm, src_pmd);
-	spin_lock(src_ptl);
+	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
 	do {
 		/*
Index: linux/mm/mremap.c
===================================================================
--- linux.orig/mm/mremap.c
+++ linux/mm/mremap.c
@@ -97,7 +97,7 @@ static void move_ptes(struct vm_area_str
  	new_pte = pte_offset_map_nested(new_pmd, new_addr);
 	new_ptl = pte_lockptr(mm, new_pmd);
 	if (new_ptl != old_ptl)
-		spin_lock(new_ptl);
+		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
 				   new_pte++, new_addr += PAGE_SIZE) {

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 46/61] lock validator: special locking: slab
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (44 preceding siblings ...)
  2006-05-29 21:26 ` [patch 45/61] lock validator: special locking: mm Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-30  1:35   ` Andrew Morton
  2006-05-29 21:26 ` [patch 47/61] lock validator: special locking: skb_queue_head_init() Ingo Molnar
                   ` (27 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

fix initialize-locks-via-memcpy assumptions.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 mm/slab.c |   59 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 48 insertions(+), 11 deletions(-)

Index: linux/mm/slab.c
===================================================================
--- linux.orig/mm/slab.c
+++ linux/mm/slab.c
@@ -1026,7 +1026,8 @@ static void drain_alien_cache(struct kme
 	}
 }
 
-static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
+static inline int cache_free_alien(struct kmem_cache *cachep, void *objp,
+				   int nesting)
 {
 	struct slab *slabp = virt_to_slab(objp);
 	int nodeid = slabp->nodeid;
@@ -1044,7 +1045,7 @@ static inline int cache_free_alien(struc
 	STATS_INC_NODEFREES(cachep);
 	if (l3->alien && l3->alien[nodeid]) {
 		alien = l3->alien[nodeid];
-		spin_lock(&alien->lock);
+		spin_lock_nested(&alien->lock, nesting);
 		if (unlikely(alien->avail == alien->limit)) {
 			STATS_INC_ACOVERFLOW(cachep);
 			__drain_alien_cache(cachep, alien, nodeid);
@@ -1073,7 +1074,8 @@ static inline void free_alien_cache(stru
 {
 }
 
-static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
+static inline int cache_free_alien(struct kmem_cache *cachep, void *objp,
+				   int nesting)
 {
 	return 0;
 }
@@ -1278,6 +1280,11 @@ static void init_list(struct kmem_cache 
 
 	local_irq_disable();
 	memcpy(ptr, list, sizeof(struct kmem_list3));
+	/*
+	 * Do not assume that spinlocks can be initialized via memcpy:
+	 */
+	spin_lock_init(&ptr->list_lock);
+
 	MAKE_ALL_LISTS(cachep, ptr, nodeid);
 	cachep->nodelists[nodeid] = ptr;
 	local_irq_enable();
@@ -1408,7 +1415,7 @@ void __init kmem_cache_init(void)
 	}
 	/* 4) Replace the bootstrap head arrays */
 	{
-		void *ptr;
+		struct array_cache *ptr;
 
 		ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL);
 
@@ -1416,6 +1423,11 @@ void __init kmem_cache_init(void)
 		BUG_ON(cpu_cache_get(&cache_cache) != &initarray_cache.cache);
 		memcpy(ptr, cpu_cache_get(&cache_cache),
 		       sizeof(struct arraycache_init));
+		/*
+		 * Do not assume that spinlocks can be initialized via memcpy:
+		 */
+		spin_lock_init(&ptr->lock);
+
 		cache_cache.array[smp_processor_id()] = ptr;
 		local_irq_enable();
 
@@ -1426,6 +1438,11 @@ void __init kmem_cache_init(void)
 		       != &initarray_generic.cache);
 		memcpy(ptr, cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep),
 		       sizeof(struct arraycache_init));
+		/*
+		 * Do not assume that spinlocks can be initialized via memcpy:
+		 */
+		spin_lock_init(&ptr->lock);
+
 		malloc_sizes[INDEX_AC].cs_cachep->array[smp_processor_id()] =
 		    ptr;
 		local_irq_enable();
@@ -1753,6 +1770,8 @@ static void slab_destroy_objs(struct kme
 }
 #endif
 
+static void __cache_free(struct kmem_cache *cachep, void *objp, int nesting);
+
 /**
  * slab_destroy - destroy and release all objects in a slab
  * @cachep: cache pointer being destroyed
@@ -1776,8 +1795,17 @@ static void slab_destroy(struct kmem_cac
 		call_rcu(&slab_rcu->head, kmem_rcu_free);
 	} else {
 		kmem_freepages(cachep, addr);
-		if (OFF_SLAB(cachep))
-			kmem_cache_free(cachep->slabp_cache, slabp);
+		if (OFF_SLAB(cachep)) {
+			unsigned long flags;
+
+			/*
+		 	 * lockdep: we may nest inside an already held
+			 * ac->lock, so pass in a nesting flag:
+			 */
+			local_irq_save(flags);
+			__cache_free(cachep->slabp_cache, slabp, 1);
+			local_irq_restore(flags);
+		}
 	}
 }
 
@@ -3062,7 +3090,16 @@ static void free_block(struct kmem_cache
 		if (slabp->inuse == 0) {
 			if (l3->free_objects > l3->free_limit) {
 				l3->free_objects -= cachep->num;
+				/*
+				 * It is safe to drop the lock. The slab is
+				 * no longer linked to the cache. cachep
+				 * cannot disappear - we are using it and
+				 * all destruction of caches must be
+				 * serialized properly by the user.
+				 */
+				spin_unlock(&l3->list_lock);
 				slab_destroy(cachep, slabp);
+				spin_lock(&l3->list_lock);
 			} else {
 				list_add(&slabp->list, &l3->slabs_free);
 			}
@@ -3088,7 +3125,7 @@ static void cache_flusharray(struct kmem
 #endif
 	check_irq_off();
 	l3 = cachep->nodelists[node];
-	spin_lock(&l3->list_lock);
+	spin_lock_nested(&l3->list_lock, SINGLE_DEPTH_NESTING);
 	if (l3->shared) {
 		struct array_cache *shared_array = l3->shared;
 		int max = shared_array->limit - shared_array->avail;
@@ -3131,14 +3168,14 @@ free_done:
  * Release an obj back to its cache. If the obj has a constructed state, it must
  * be in this state _before_ it is released.  Called with disabled ints.
  */
-static inline void __cache_free(struct kmem_cache *cachep, void *objp)
+static void __cache_free(struct kmem_cache *cachep, void *objp, int nesting)
 {
 	struct array_cache *ac = cpu_cache_get(cachep);
 
 	check_irq_off();
 	objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
-	if (cache_free_alien(cachep, objp))
+	if (cache_free_alien(cachep, objp, nesting))
 		return;
 
 	if (likely(ac->avail < ac->limit)) {
@@ -3393,7 +3430,7 @@ void kmem_cache_free(struct kmem_cache *
 	BUG_ON(virt_to_cache(objp) != cachep);
 
 	local_irq_save(flags);
-	__cache_free(cachep, objp);
+	__cache_free(cachep, objp, 0);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(kmem_cache_free);
@@ -3418,7 +3455,7 @@ void kfree(const void *objp)
 	kfree_debugcheck(objp);
 	c = virt_to_cache(objp);
 	debug_check_no_locks_freed(objp, obj_size(c));
-	__cache_free(c, (void *)objp);
+	__cache_free(c, (void *)objp, 0);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(kfree);

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 47/61] lock validator: special locking: skb_queue_head_init()
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (45 preceding siblings ...)
  2006-05-29 21:26 ` [patch 46/61] lock validator: special locking: slab Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:26 ` [patch 48/61] lock validator: special locking: timer.c Ingo Molnar
                   ` (26 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (multi-initialized) locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
---
 include/linux/skbuff.h |    7 +------
 net/core/skbuff.c      |    9 +++++++++
 2 files changed, 10 insertions(+), 6 deletions(-)

Index: linux/include/linux/skbuff.h
===================================================================
--- linux.orig/include/linux/skbuff.h
+++ linux/include/linux/skbuff.h
@@ -584,12 +584,7 @@ static inline __u32 skb_queue_len(const 
 	return list_->qlen;
 }
 
-static inline void skb_queue_head_init(struct sk_buff_head *list)
-{
-	spin_lock_init(&list->lock);
-	list->prev = list->next = (struct sk_buff *)list;
-	list->qlen = 0;
-}
+extern void skb_queue_head_init(struct sk_buff_head *list);
 
 /*
  *	Insert an sk_buff at the start of a list.
Index: linux/net/core/skbuff.c
===================================================================
--- linux.orig/net/core/skbuff.c
+++ linux/net/core/skbuff.c
@@ -71,6 +71,15 @@
 static kmem_cache_t *skbuff_head_cache __read_mostly;
 static kmem_cache_t *skbuff_fclone_cache __read_mostly;
 
+void skb_queue_head_init(struct sk_buff_head *list)
+{
+	spin_lock_init(&list->lock);
+	list->prev = list->next = (struct sk_buff *)list;
+	list->qlen = 0;
+}
+
+EXPORT_SYMBOL(skb_queue_head_init);
+
 /*
  *	Keep out-of-line to prevent kernel bloat.
  *	__builtin_return_address is not used because it is not always

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 48/61] lock validator: special locking: timer.c
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (46 preceding siblings ...)
  2006-05-29 21:26 ` [patch 47/61] lock validator: special locking: skb_queue_head_init() Ingo Molnar
@ 2006-05-29 21:26 ` Ingo Molnar
  2006-05-29 21:27 ` [patch 49/61] lock validator: special locking: sched.c Ingo Molnar
                   ` (25 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/timer.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c
+++ linux/kernel/timer.c
@@ -1496,6 +1496,13 @@ asmlinkage long sys_sysinfo(struct sysin
 	return 0;
 }
 
+/*
+ * lockdep: we want to track each per-CPU base as a separate lock-type,
+ * but timer-bases are kmalloc()-ed, so we need to attach separate
+ * keys to them:
+ */
+static struct lockdep_type_key base_lock_keys[NR_CPUS];
+
 static int __devinit init_timers_cpu(int cpu)
 {
 	int j;
@@ -1530,7 +1537,7 @@ static int __devinit init_timers_cpu(int
 		base = per_cpu(tvec_bases, cpu);
 	}
 
-	spin_lock_init(&base->lock);
+	spin_lock_init_key(&base->lock, base_lock_keys + cpu);
 	for (j = 0; j < TVN_SIZE; j++) {
 		INIT_LIST_HEAD(base->tv5.vec + j);
 		INIT_LIST_HEAD(base->tv4.vec + j);

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 49/61] lock validator: special locking: sched.c
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (47 preceding siblings ...)
  2006-05-29 21:26 ` [patch 48/61] lock validator: special locking: timer.c Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-29 21:27 ` [patch 50/61] lock validator: special locking: hrtimer.c Ingo Molnar
                   ` (24 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/sched.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1963,7 +1963,7 @@ static void double_rq_unlock(runqueue_t 
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	spin_unlock(&rq1->lock);
+	spin_unlock_non_nested(&rq1->lock);
 	if (rq1 != rq2)
 		spin_unlock(&rq2->lock);
 	else
@@ -1980,7 +1980,7 @@ static void double_lock_balance(runqueue
 {
 	if (unlikely(!spin_trylock(&busiest->lock))) {
 		if (busiest->cpu < this_rq->cpu) {
-			spin_unlock(&this_rq->lock);
+			spin_unlock_non_nested(&this_rq->lock);
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
 		} else
@@ -2602,7 +2602,7 @@ static int load_balance_newidle(int this
 		nr_moved = move_tasks(this_rq, this_cpu, busiest,
 					minus_1_or_zero(busiest->nr_running),
 					imbalance, sd, NEWLY_IDLE, NULL);
-		spin_unlock(&busiest->lock);
+		spin_unlock_non_nested(&busiest->lock);
 	}
 
 	if (!nr_moved) {
@@ -2687,7 +2687,7 @@ static void active_load_balance(runqueue
 	else
 		schedstat_inc(sd, alb_failed);
 out:
-	spin_unlock(&target_rq->lock);
+	spin_unlock_non_nested(&target_rq->lock);
 }
 
 /*
@@ -3032,7 +3032,7 @@ static void wake_sleeping_dependent(int 
 	}
 
 	for_each_cpu_mask(i, sibling_map)
-		spin_unlock(&cpu_rq(i)->lock);
+		spin_unlock_non_nested(&cpu_rq(i)->lock);
 	/*
 	 * We exit with this_cpu's rq still held and IRQs
 	 * still disabled:
@@ -3068,7 +3068,7 @@ static int dependent_sleeper(int this_cp
 	 * The same locking rules and details apply as for
 	 * wake_sleeping_dependent():
 	 */
-	spin_unlock(&this_rq->lock);
+	spin_unlock_non_nested(&this_rq->lock);
 	sibling_map = sd->span;
 	for_each_cpu_mask(i, sibling_map)
 		spin_lock(&cpu_rq(i)->lock);
@@ -3146,7 +3146,7 @@ check_smt_task:
 	}
 out_unlock:
 	for_each_cpu_mask(i, sibling_map)
-		spin_unlock(&cpu_rq(i)->lock);
+		spin_unlock_non_nested(&cpu_rq(i)->lock);
 	return ret;
 }
 #else
@@ -6680,7 +6680,7 @@ void __init sched_init(void)
 		prio_array_t *array;
 
 		rq = cpu_rq(i);
-		spin_lock_init(&rq->lock);
+		spin_lock_init_static(&rq->lock);
 		rq->nr_running = 0;
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 50/61] lock validator: special locking: hrtimer.c
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (48 preceding siblings ...)
  2006-05-29 21:27 ` [patch 49/61] lock validator: special locking: sched.c Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-30  1:35   ` Andrew Morton
  2006-05-29 21:27 ` [patch 51/61] lock validator: special locking: sock_lock_init() Ingo Molnar
                   ` (23 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/hrtimer.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/hrtimer.c
===================================================================
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -786,7 +786,7 @@ static void __devinit init_hrtimers_cpu(
 	int i;
 
 	for (i = 0; i < MAX_HRTIMER_BASES; i++, base++)
-		spin_lock_init(&base->lock);
+		spin_lock_init_static(&base->lock);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 51/61] lock validator: special locking: sock_lock_init()
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (49 preceding siblings ...)
  2006-05-29 21:27 ` [patch 50/61] lock validator: special locking: hrtimer.c Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-30  1:36   ` Andrew Morton
  2006-05-29 21:27 ` [patch 52/61] lock validator: special locking: af_unix Ingo Molnar
                   ` (22 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (multi-initialized, per-address-family) locking code to the
lock validator. Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/net/sock.h |    6 ------
 net/core/sock.c    |   27 +++++++++++++++++++++++----
 2 files changed, 23 insertions(+), 10 deletions(-)

Index: linux/include/net/sock.h
===================================================================
--- linux.orig/include/net/sock.h
+++ linux/include/net/sock.h
@@ -81,12 +81,6 @@ typedef struct {
 	wait_queue_head_t	wq;
 } socket_lock_t;
 
-#define sock_lock_init(__sk) \
-do {	spin_lock_init(&((__sk)->sk_lock.slock)); \
-	(__sk)->sk_lock.owner = NULL; \
-	init_waitqueue_head(&((__sk)->sk_lock.wq)); \
-} while(0)
-
 struct sock;
 struct proto;
 
Index: linux/net/core/sock.c
===================================================================
--- linux.orig/net/core/sock.c
+++ linux/net/core/sock.c
@@ -739,6 +739,27 @@ lenout:
   	return 0;
 }
 
+/*
+ * Each address family might have different locking rules, so we have
+ * one slock key per address family:
+ */
+static struct lockdep_type_key af_family_keys[AF_MAX];
+
+static void noinline sock_lock_init(struct sock *sk)
+{
+	spin_lock_init_key(&sk->sk_lock.slock, af_family_keys + sk->sk_family);
+	sk->sk_lock.owner = NULL;
+	init_waitqueue_head(&sk->sk_lock.wq);
+}
+
+static struct lockdep_type_key af_callback_keys[AF_MAX];
+
+static void noinline sock_rwlock_init(struct sock *sk)
+{
+	rwlock_init(&sk->sk_dst_lock);
+	rwlock_init_key(&sk->sk_callback_lock, af_callback_keys + sk->sk_family);
+}
+
 /**
  *	sk_alloc - All socket objects are allocated here
  *	@family: protocol family
@@ -833,8 +854,7 @@ struct sock *sk_clone(const struct sock 
 		skb_queue_head_init(&newsk->sk_receive_queue);
 		skb_queue_head_init(&newsk->sk_write_queue);
 
-		rwlock_init(&newsk->sk_dst_lock);
-		rwlock_init(&newsk->sk_callback_lock);
+		sock_rwlock_init(newsk);
 
 		newsk->sk_dst_cache	= NULL;
 		newsk->sk_wmem_queued	= 0;
@@ -1404,8 +1424,7 @@ void sock_init_data(struct socket *sock,
 	} else
 		sk->sk_sleep	=	NULL;
 
-	rwlock_init(&sk->sk_dst_lock);
-	rwlock_init(&sk->sk_callback_lock);
+	sock_rwlock_init(sk);
 
 	sk->sk_state_change	=	sock_def_wakeup;
 	sk->sk_data_ready	=	sock_def_readable;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 52/61] lock validator: special locking: af_unix
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (50 preceding siblings ...)
  2006-05-29 21:27 ` [patch 51/61] lock validator: special locking: sock_lock_init() Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-30  1:36   ` Andrew Morton
  2006-05-29 21:27 ` [patch 53/61] lock validator: special locking: bh_lock_sock() Ingo Molnar
                   ` (21 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

(includes workaround for sk_receive_queue.lock, which is currently
treated globally by the lock validator, but which be switched to
per-address-family locking rules.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/net/af_unix.h |    3 +++
 net/unix/af_unix.c    |   10 +++++-----
 net/unix/garbage.c    |    8 ++++----
 3 files changed, 12 insertions(+), 9 deletions(-)

Index: linux/include/net/af_unix.h
===================================================================
--- linux.orig/include/net/af_unix.h
+++ linux/include/net/af_unix.h
@@ -61,6 +61,9 @@ struct unix_skb_parms {
 #define unix_state_rlock(s)	spin_lock(&unix_sk(s)->lock)
 #define unix_state_runlock(s)	spin_unlock(&unix_sk(s)->lock)
 #define unix_state_wlock(s)	spin_lock(&unix_sk(s)->lock)
+#define unix_state_wlock_nested(s) \
+				spin_lock_nested(&unix_sk(s)->lock, \
+				SINGLE_DEPTH_NESTING)
 #define unix_state_wunlock(s)	spin_unlock(&unix_sk(s)->lock)
 
 #ifdef __KERNEL__
Index: linux/net/unix/af_unix.c
===================================================================
--- linux.orig/net/unix/af_unix.c
+++ linux/net/unix/af_unix.c
@@ -1022,7 +1022,7 @@ restart:
 		goto out_unlock;
 	}
 
-	unix_state_wlock(sk);
+	unix_state_wlock_nested(sk);
 
 	if (sk->sk_state != st) {
 		unix_state_wunlock(sk);
@@ -1073,12 +1073,12 @@ restart:
 	unix_state_wunlock(sk);
 
 	/* take ten and and send info to listening sock */
-	spin_lock(&other->sk_receive_queue.lock);
+	spin_lock_bh(&other->sk_receive_queue.lock);
 	__skb_queue_tail(&other->sk_receive_queue, skb);
 	/* Undo artificially decreased inflight after embrion
 	 * is installed to listening socket. */
 	atomic_inc(&newu->inflight);
-	spin_unlock(&other->sk_receive_queue.lock);
+	spin_unlock_bh(&other->sk_receive_queue.lock);
 	unix_state_runlock(other);
 	other->sk_data_ready(other, 0);
 	sock_put(other);
@@ -1843,7 +1843,7 @@ static int unix_ioctl(struct socket *soc
 				break;
 			}
 
-			spin_lock(&sk->sk_receive_queue.lock);
+			spin_lock_bh(&sk->sk_receive_queue.lock);
 			if (sk->sk_type == SOCK_STREAM ||
 			    sk->sk_type == SOCK_SEQPACKET) {
 				skb_queue_walk(&sk->sk_receive_queue, skb)
@@ -1853,7 +1853,7 @@ static int unix_ioctl(struct socket *soc
 				if (skb)
 					amount=skb->len;
 			}
-			spin_unlock(&sk->sk_receive_queue.lock);
+			spin_unlock_bh(&sk->sk_receive_queue.lock);
 			err = put_user(amount, (int __user *)arg);
 			break;
 		}
Index: linux/net/unix/garbage.c
===================================================================
--- linux.orig/net/unix/garbage.c
+++ linux/net/unix/garbage.c
@@ -235,7 +235,7 @@ void unix_gc(void)
 		struct sock *x = pop_stack();
 		struct sock *sk;
 
-		spin_lock(&x->sk_receive_queue.lock);
+		spin_lock_bh(&x->sk_receive_queue.lock);
 		skb = skb_peek(&x->sk_receive_queue);
 		
 		/*
@@ -270,7 +270,7 @@ void unix_gc(void)
 				maybe_unmark_and_push(skb->sk);
 			skb=skb->next;
 		}
-		spin_unlock(&x->sk_receive_queue.lock);
+		spin_unlock_bh(&x->sk_receive_queue.lock);
 		sock_put(x);
 	}
 
@@ -283,7 +283,7 @@ void unix_gc(void)
 		if (u->gc_tree == GC_ORPHAN) {
 			struct sk_buff *nextsk;
 
-			spin_lock(&s->sk_receive_queue.lock);
+			spin_lock_bh(&s->sk_receive_queue.lock);
 			skb = skb_peek(&s->sk_receive_queue);
 			while (skb &&
 			       skb != (struct sk_buff *)&s->sk_receive_queue) {
@@ -298,7 +298,7 @@ void unix_gc(void)
 				}
 				skb = nextsk;
 			}
-			spin_unlock(&s->sk_receive_queue.lock);
+			spin_unlock_bh(&s->sk_receive_queue.lock);
 		}
 		u->gc_tree = GC_ORPHAN;
 	}

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 53/61] lock validator: special locking: bh_lock_sock()
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (51 preceding siblings ...)
  2006-05-29 21:27 ` [patch 52/61] lock validator: special locking: af_unix Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-29 21:27 ` [patch 54/61] lock validator: special locking: mmap_sem Ingo Molnar
                   ` (20 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/net/sock.h  |    3 +++
 net/ipv4/tcp_ipv4.c |    2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux/include/net/sock.h
===================================================================
--- linux.orig/include/net/sock.h
+++ linux/include/net/sock.h
@@ -743,6 +743,9 @@ extern void FASTCALL(release_sock(struct
 
 /* BH context may only use the following locking interface. */
 #define bh_lock_sock(__sk)	spin_lock(&((__sk)->sk_lock.slock))
+#define bh_lock_sock_nested(__sk) \
+				spin_lock_nested(&((__sk)->sk_lock.slock), \
+				SINGLE_DEPTH_NESTING)
 #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
 
 extern struct sock		*sk_alloc(int family,
Index: linux/net/ipv4/tcp_ipv4.c
===================================================================
--- linux.orig/net/ipv4/tcp_ipv4.c
+++ linux/net/ipv4/tcp_ipv4.c
@@ -1088,7 +1088,7 @@ process:
 
 	skb->dev = NULL;
 
-	bh_lock_sock(sk);
+	bh_lock_sock_nested(sk);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
 		if (!tcp_prequeue(sk, skb))

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 54/61] lock validator: special locking: mmap_sem
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (52 preceding siblings ...)
  2006-05-29 21:27 ` [patch 53/61] lock validator: special locking: bh_lock_sock() Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-29 21:27 ` [patch 55/61] lock validator: special locking: sb->s_umount Ingo Molnar
                   ` (19 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/exit.c |    2 +-
 kernel/fork.c |    5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -582,7 +582,7 @@ static void exit_mm(struct task_struct *
 	/* more a memory barrier than a real lock */
 	task_lock(tsk);
 	tsk->mm = NULL;
-	up_read(&mm->mmap_sem);
+	up_read_non_nested(&mm->mmap_sem);
 	enter_lazy_tlb(mm, current);
 	task_unlock(tsk);
 	mmput(mm);
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -196,7 +196,10 @@ static inline int dup_mmap(struct mm_str
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_mm(oldmm);
-	down_write(&mm->mmap_sem);
+	/*
+	 * Not linked in yet - no deadlock potential:
+	 */
+	down_write_nested(&mm->mmap_sem, 1);
 
 	mm->locked_vm = 0;
 	mm->mmap = NULL;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 55/61] lock validator: special locking: sb->s_umount
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (53 preceding siblings ...)
  2006-05-29 21:27 ` [patch 54/61] lock validator: special locking: mmap_sem Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-30  1:36   ` Andrew Morton
  2006-05-29 21:27 ` [patch 56/61] lock validator: special locking: jbd Ingo Molnar
                   ` (18 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

workaround for special sb->s_umount locking rule.

s_umount gets held across a series of lock dropping and releasing
in prune_one_dentry(), so i changed the order, at the risk of
introducing a umount race. FIXME.

i think a better fix would be to do the unlocks as _non_nested in
prune_one_dentry(), and to do the up_read() here as
an up_read_non_nested() as well?

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/dcache.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/fs/dcache.c
===================================================================
--- linux.orig/fs/dcache.c
+++ linux/fs/dcache.c
@@ -470,8 +470,9 @@ static void prune_dcache(int count, stru
 		s_umount = &dentry->d_sb->s_umount;
 		if (down_read_trylock(s_umount)) {
 			if (dentry->d_sb->s_root != NULL) {
-				prune_one_dentry(dentry);
+// lockdep hack: do this better!
 				up_read(s_umount);
+				prune_one_dentry(dentry);
 				continue;
 			}
 			up_read(s_umount);

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 56/61] lock validator: special locking: jbd
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (54 preceding siblings ...)
  2006-05-29 21:27 ` [patch 55/61] lock validator: special locking: sb->s_umount Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-29 21:27 ` [patch 57/61] lock validator: special locking: posix-timers Ingo Molnar
                   ` (17 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (non-nested) unlocking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/jbd/checkpoint.c |    2 +-
 fs/jbd/commit.c     |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux/fs/jbd/checkpoint.c
===================================================================
--- linux.orig/fs/jbd/checkpoint.c
+++ linux/fs/jbd/checkpoint.c
@@ -135,7 +135,7 @@ void __log_wait_for_space(journal_t *jou
 			log_do_checkpoint(journal);
 			spin_lock(&journal->j_state_lock);
 		}
-		mutex_unlock(&journal->j_checkpoint_mutex);
+		mutex_unlock_non_nested(&journal->j_checkpoint_mutex);
 	}
 }
 
Index: linux/fs/jbd/commit.c
===================================================================
--- linux.orig/fs/jbd/commit.c
+++ linux/fs/jbd/commit.c
@@ -838,7 +838,7 @@ restart_loop:
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
-	spin_unlock(&journal->j_state_lock);
+	spin_unlock_non_nested(&journal->j_state_lock);
 
 	if (commit_transaction->t_checkpoint_list == NULL) {
 		__journal_drop_transaction(journal, commit_transaction);

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 57/61] lock validator: special locking: posix-timers
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (55 preceding siblings ...)
  2006-05-29 21:27 ` [patch 56/61] lock validator: special locking: jbd Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-29 21:27 ` [patch 58/61] lock validator: special locking: sch_generic.c Ingo Molnar
                   ` (16 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (non-nested) unlocking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/posix-timers.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/posix-timers.c
===================================================================
--- linux.orig/kernel/posix-timers.c
+++ linux/kernel/posix-timers.c
@@ -576,7 +576,7 @@ static struct k_itimer * lock_timer(time
 	timr = (struct k_itimer *) idr_find(&posix_timers_id, (int) timer_id);
 	if (timr) {
 		spin_lock(&timr->it_lock);
-		spin_unlock(&idr_lock);
+		spin_unlock_non_nested(&idr_lock);
 
 		if ((timr->it_id != timer_id) || !(timr->it_process) ||
 				timr->it_process->tgid != current->tgid) {

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 58/61] lock validator: special locking: sch_generic.c
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (56 preceding siblings ...)
  2006-05-29 21:27 ` [patch 57/61] lock validator: special locking: posix-timers Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-29 21:27 ` [patch 59/61] lock validator: special locking: xfrm Ingo Molnar
                   ` (15 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (non-nested) unlocking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 net/sched/sch_generic.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/net/sched/sch_generic.c
===================================================================
--- linux.orig/net/sched/sch_generic.c
+++ linux/net/sched/sch_generic.c
@@ -132,7 +132,7 @@ int qdisc_restart(struct net_device *dev
 		
 		{
 			/* And release queue */
-			spin_unlock(&dev->queue_lock);
+			spin_unlock_non_nested(&dev->queue_lock);
 
 			if (!netif_queue_stopped(dev)) {
 				int ret;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 59/61] lock validator: special locking: xfrm
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (57 preceding siblings ...)
  2006-05-29 21:27 ` [patch 58/61] lock validator: special locking: sch_generic.c Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-30  1:36   ` Andrew Morton
  2006-05-29 21:27 ` [patch 60/61] lock validator: special locking: sound/core/seq/seq_ports.c Ingo Molnar
                   ` (14 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (non-nested) unlocking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 net/xfrm/xfrm_policy.c |    2 +-
 net/xfrm/xfrm_state.c  |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux/net/xfrm/xfrm_policy.c
===================================================================
--- linux.orig/net/xfrm/xfrm_policy.c
+++ linux/net/xfrm/xfrm_policy.c
@@ -1308,7 +1308,7 @@ static struct xfrm_policy_afinfo *xfrm_p
 	afinfo = xfrm_policy_afinfo[family];
 	if (likely(afinfo != NULL))
 		read_lock(&afinfo->lock);
-	read_unlock(&xfrm_policy_afinfo_lock);
+	read_unlock_non_nested(&xfrm_policy_afinfo_lock);
 	return afinfo;
 }
 
Index: linux/net/xfrm/xfrm_state.c
===================================================================
--- linux.orig/net/xfrm/xfrm_state.c
+++ linux/net/xfrm/xfrm_state.c
@@ -1105,7 +1105,7 @@ static struct xfrm_state_afinfo *xfrm_st
 	afinfo = xfrm_state_afinfo[family];
 	if (likely(afinfo != NULL))
 		read_lock(&afinfo->lock);
-	read_unlock(&xfrm_state_afinfo_lock);
+	read_unlock_non_nested(&xfrm_state_afinfo_lock);
 	return afinfo;
 }
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 60/61] lock validator: special locking: sound/core/seq/seq_ports.c
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (58 preceding siblings ...)
  2006-05-29 21:27 ` [patch 59/61] lock validator: special locking: xfrm Ingo Molnar
@ 2006-05-29 21:27 ` Ingo Molnar
  2006-05-29 21:28 ` [patch 61/61] lock validator: enable lock validator in Kconfig Ingo Molnar
                   ` (13 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

teach special (recursive) locking code to the lock validator. Has no
effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 sound/core/seq/seq_ports.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/sound/core/seq/seq_ports.c
===================================================================
--- linux.orig/sound/core/seq/seq_ports.c
+++ linux/sound/core/seq/seq_ports.c
@@ -518,7 +518,7 @@ int snd_seq_port_connect(struct snd_seq_
 	atomic_set(&subs->ref_count, 2);
 
 	down_write(&src->list_mutex);
-	down_write(&dest->list_mutex);
+	down_write_nested(&dest->list_mutex, SINGLE_DEPTH_NESTING);
 
 	exclusive = info->flags & SNDRV_SEQ_PORT_SUBS_EXCLUSIVE ? 1 : 0;
 	err = -EBUSY;
@@ -591,7 +591,7 @@ int snd_seq_port_disconnect(struct snd_s
 	unsigned long flags;
 
 	down_write(&src->list_mutex);
-	down_write(&dest->list_mutex);
+	down_write_nested(&dest->list_mutex, SINGLE_DEPTH_NESTING);
 
 	/* look for the connection */
 	list_for_each(p, &src->list_head) {

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 61/61] lock validator: enable lock validator in Kconfig
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (59 preceding siblings ...)
  2006-05-29 21:27 ` [patch 60/61] lock validator: special locking: sound/core/seq/seq_ports.c Ingo Molnar
@ 2006-05-29 21:28 ` Ingo Molnar
  2006-05-30  1:36   ` Andrew Morton
  2006-05-30 13:33   ` Roman Zippel
  2006-05-29 22:28 ` [patch 00/61] ANNOUNCE: lock validator -V1 Michal Piotrowski
                   ` (12 subsequent siblings)
  73 siblings, 2 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 21:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andrew Morton

From: Ingo Molnar <mingo@elte.hu>

offer the following lock validation options:

 CONFIG_PROVE_SPIN_LOCKING
 CONFIG_PROVE_RW_LOCKING
 CONFIG_PROVE_MUTEX_LOCKING
 CONFIG_PROVE_RWSEM_LOCKING

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 lib/Kconfig.debug |  167 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)

Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -184,6 +184,173 @@ config DEBUG_SPINLOCK
 	  best used in conjunction with the NMI watchdog so that spinlock
 	  deadlocks are also debuggable.
 
+config PROVE_SPIN_LOCKING
+	bool "Prove spin-locking correctness"
+	default y
+	help
+	 This feature enables the kernel to prove that all spinlock
+	 locking that occurs in the kernel runtime is mathematically
+	 correct: that under no circumstance could an arbitrary (and
+	 not yet triggered) combination of observed spinlock locking
+	 sequences (on an arbitrary number of CPUs, running an
+	 arbitrary number of tasks and interrupt contexts) cause a
+	 deadlock.
+
+	 In short, this feature enables the kernel to report spinlock
+	 deadlocks before they actually occur.
+
+	 The proof does not depend on how hard and complex a
+	 deadlock scenario would be to trigger: how many
+	 participant CPUs, tasks and irq-contexts would be needed
+	 for it to trigger. The proof also does not depend on
+	 timing: if a race and a resulting deadlock is possible
+	 theoretically (no matter how unlikely the race scenario
+	 is), it will be proven so and will immediately be
+	 reported by the kernel (once the event is observed that
+	 makes the deadlock theoretically possible).
+
+	 If a deadlock is impossible (i.e. the locking rules, as
+	 observed by the kernel, are mathematically correct), the
+	 kernel reports nothing.
+
+	 NOTE: this feature can also be enabled for rwlocks, mutexes
+	 and rwsems - in which case all dependencies between these
+	 different locking variants are observed and mapped too, and
+	 the proof of observed correctness is also maintained for an
+	 arbitrary combination of these separate locking variants.
+
+	 For more details, see Documentation/locking-correctness.txt.
+
+config PROVE_RW_LOCKING
+	bool "Prove rw-locking correctness"
+	default y
+	help
+	 This feature enables the kernel to prove that all rwlock
+	 locking that occurs in the kernel runtime is mathematically
+	 correct: that under no circumstance could an arbitrary (and
+	 not yet triggered) combination of observed rwlock locking
+	 sequences (on an arbitrary number of CPUs, running an
+	 arbitrary number of tasks and interrupt contexts) cause a
+	 deadlock.
+
+	 In short, this feature enables the kernel to report rwlock
+	 deadlocks before they actually occur.
+
+	 The proof does not depend on how hard and complex a
+	 deadlock scenario would be to trigger: how many
+	 participant CPUs, tasks and irq-contexts would be needed
+	 for it to trigger. The proof also does not depend on
+	 timing: if a race and a resulting deadlock is possible
+	 theoretically (no matter how unlikely the race scenario
+	 is), it will be proven so and will immediately be
+	 reported by the kernel (once the event is observed that
+	 makes the deadlock theoretically possible).
+
+	 If a deadlock is impossible (i.e. the locking rules, as
+	 observed by the kernel, are mathematically correct), the
+	 kernel reports nothing.
+
+	 NOTE: this feature can also be enabled for spinlocks, mutexes
+	 and rwsems - in which case all dependencies between these
+	 different locking variants are observed and mapped too, and
+	 the proof of observed correctness is also maintained for an
+	 arbitrary combination of these separate locking variants.
+
+	 For more details, see Documentation/locking-correctness.txt.
+
+config PROVE_MUTEX_LOCKING
+	bool "Prove mutex-locking correctness"
+	default y
+	help
+	 This feature enables the kernel to prove that all mutexlock
+	 locking that occurs in the kernel runtime is mathematically
+	 correct: that under no circumstance could an arbitrary (and
+	 not yet triggered) combination of observed mutexlock locking
+	 sequences (on an arbitrary number of CPUs, running an
+	 arbitrary number of tasks and interrupt contexts) cause a
+	 deadlock.
+
+	 In short, this feature enables the kernel to report mutexlock
+	 deadlocks before they actually occur.
+
+	 The proof does not depend on how hard and complex a
+	 deadlock scenario would be to trigger: how many
+	 participant CPUs, tasks and irq-contexts would be needed
+	 for it to trigger. The proof also does not depend on
+	 timing: if a race and a resulting deadlock is possible
+	 theoretically (no matter how unlikely the race scenario
+	 is), it will be proven so and will immediately be
+	 reported by the kernel (once the event is observed that
+	 makes the deadlock theoretically possible).
+
+	 If a deadlock is impossible (i.e. the locking rules, as
+	 observed by the kernel, are mathematically correct), the
+	 kernel reports nothing.
+
+	 NOTE: this feature can also be enabled for spinlock, rwlocks
+	 and rwsems - in which case all dependencies between these
+	 different locking variants are observed and mapped too, and
+	 the proof of observed correctness is also maintained for an
+	 arbitrary combination of these separate locking variants.
+
+	 For more details, see Documentation/locking-correctness.txt.
+
+config PROVE_RWSEM_LOCKING
+	bool "Prove rwsem-locking correctness"
+	default y
+	help
+	 This feature enables the kernel to prove that all rwsemlock
+	 locking that occurs in the kernel runtime is mathematically
+	 correct: that under no circumstance could an arbitrary (and
+	 not yet triggered) combination of observed rwsemlock locking
+	 sequences (on an arbitrary number of CPUs, running an
+	 arbitrary number of tasks and interrupt contexts) cause a
+	 deadlock.
+
+	 In short, this feature enables the kernel to report rwsemlock
+	 deadlocks before they actually occur.
+
+	 The proof does not depend on how hard and complex a
+	 deadlock scenario would be to trigger: how many
+	 participant CPUs, tasks and irq-contexts would be needed
+	 for it to trigger. The proof also does not depend on
+	 timing: if a race and a resulting deadlock is possible
+	 theoretically (no matter how unlikely the race scenario
+	 is), it will be proven so and will immediately be
+	 reported by the kernel (once the event is observed that
+	 makes the deadlock theoretically possible).
+
+	 If a deadlock is impossible (i.e. the locking rules, as
+	 observed by the kernel, are mathematically correct), the
+	 kernel reports nothing.
+
+	 NOTE: this feature can also be enabled for spinlocks, rwlocks
+	 and mutexes - in which case all dependencies between these
+	 different locking variants are observed and mapped too, and
+	 the proof of observed correctness is also maintained for an
+	 arbitrary combination of these separate locking variants.
+
+	 For more details, see Documentation/locking-correctness.txt.
+
+config LOCKDEP
+	bool
+	default y
+	depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING
+
+config DEBUG_LOCKDEP
+	bool "Lock dependency engine debugging"
+	depends on LOCKDEP
+	default y
+	help
+	  If you say Y here, the lock dependency engine will do
+	  additional runtime checks to debug itself, at the price
+	  of more runtime overhead.
+
+config TRACE_IRQFLAGS
+	bool
+	default y
+	depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING
+
 config DEBUG_SPINLOCK_SLEEP
 	bool "Sleep-inside-spinlock checking"
 	depends on DEBUG_KERNEL

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (60 preceding siblings ...)
  2006-05-29 21:28 ` [patch 61/61] lock validator: enable lock validator in Kconfig Ingo Molnar
@ 2006-05-29 22:28 ` Michal Piotrowski
  2006-05-29 22:41   ` Ingo Molnar
  2006-05-30  5:20   ` Arjan van de Ven
  2006-05-30  1:35 ` Andrew Morton
                   ` (11 subsequent siblings)
  73 siblings, 2 replies; 320+ messages in thread
From: Michal Piotrowski @ 2006-05-29 22:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton, Dave Jones

On 29/05/06, Ingo Molnar <mingo@elte.hu> wrote:
> We are pleased to announce the first release of the "lock dependency
> correctness validator" kernel debugging feature, which can be downloaded
> from:
>
>   http://redhat.com/~mingo/lockdep-patches/
>
[snip]

I get this while loading cpufreq modules

=====================================================
[ BUG: possible circular locking deadlock detected! ]
-----------------------------------------------------
modprobe/1942 is trying to acquire lock:
 (&anon_vma->lock){--..}, at: [<c10609cf>] anon_vma_link+0x1d/0xc9

but task is already holding lock:
 (&mm->mmap_sem/1){--..}, at: [<c101e5a0>] copy_process+0xbc6/0x1519

which lock already depends on the new lock,
which could lead to circular deadlocks!

the existing dependency chain (in reverse order) is:

-> #1 (cpucontrol){--..}:
       [<c10394be>] lockdep_acquire+0x69/0x82
       [<c11ed759>] __mutex_lock_slowpath+0xd0/0x347
       [<c11ed9ec>] mutex_lock+0x1c/0x1f
       [<c103dda5>] __lock_cpu_hotplug+0x36/0x56
       [<c103ddde>] lock_cpu_hotplug+0xa/0xc
       [<c1199e06>] __cpufreq_driver_target+0x15/0x50
       [<c119a1c2>] cpufreq_governor_performance+0x1a/0x20
       [<c1198b0a>] __cpufreq_governor+0xa0/0x1a9
       [<c1198ce2>] __cpufreq_set_policy+0xcf/0x100
       [<c11991c6>] cpufreq_set_policy+0x2d/0x6f
       [<c1199cae>] cpufreq_add_dev+0x34f/0x492
       [<c114b8c8>] sysdev_driver_register+0x58/0x9b
       [<c119a036>] cpufreq_register_driver+0x80/0xf4
       [<fd97b02a>] ct_get_next+0x17/0x3f [ip_conntrack]
       [<c10410e1>] sys_init_module+0xa6/0x230
       [<c11ef9ab>] sysenter_past_esp+0x54/0x8d

-> #0 (&anon_vma->lock){--..}:
       [<c10394be>] lockdep_acquire+0x69/0x82
       [<c11ed759>] __mutex_lock_slowpath+0xd0/0x347
       [<c11ed9ec>] mutex_lock+0x1c/0x1f
       [<c11990eb>] cpufreq_update_policy+0x34/0xd8
       [<fd9ad50b>] cpufreq_stat_cpu_callback+0x1b/0x7c [cpufreq_stats]
       [<fd9b007d>] cpufreq_stats_init+0x7d/0x9b [cpufreq_stats]
       [<c10410e1>] sys_init_module+0xa6/0x230
       [<c11ef9ab>] sysenter_past_esp+0x54/0x8d

other info that might help us debug this:

1 locks held by modprobe/1942:
  #0:  (cpucontrol){--..}, at: [<c11ed9ec>] mutex_lock+0x1c/0x1f

stack backtrace:
 <c1003f36> show_trace+0xd/0xf  <c1004449> dump_stack+0x17/0x19
 <c103863e> print_circular_bug_tail+0x59/0x64  <c1038e91>
__lockdep_acquire+0x848/0xa39
 <c10394be> lockdep_acquire+0x69/0x82  <c11ed759>
__mutex_lock_slowpath+0xd0/0x347
 <c11ed9ec> mutex_lock+0x1c/0x1f  <c11990eb> cpufreq_update_policy+0x34/0xd8
 <fd9ad50b> cpufreq_stat_cpu_callback+0x1b/0x7c [cpufreq_stats]
<fd9b007d> cpufreq_stats_init+0x7d/0x9b [cpufreq_stats]
 <c10410e1> sys_init_module+0xa6/0x230  <c11ef9ab> sysenter_past_esp+0x54/0x8d

Here is dmesg http://www.stardust.webpages.pl/files/lockdep/2.6.17-rc4-mm3-lockdep1/lockdep-dmesg3

Here is config
http://www.stardust.webpages.pl/files/lockdep/2.6.17-rc4-mm3-lockdep1/lockdep-config2

BTW I still must revert lockdep-serial.patch - it doesn't compile on
my gcc 4.1.1

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 22:28 ` [patch 00/61] ANNOUNCE: lock validator -V1 Michal Piotrowski
@ 2006-05-29 22:41   ` Ingo Molnar
  2006-05-29 23:09     ` Dave Jones
  2006-05-30  5:20   ` Arjan van de Ven
  1 sibling, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-29 22:41 UTC (permalink / raw)
  To: Michal Piotrowski
  Cc: linux-kernel, Arjan van de Ven, Andrew Morton, Dave Jones


* Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote:

> On 29/05/06, Ingo Molnar <mingo@elte.hu> wrote:
> >We are pleased to announce the first release of the "lock dependency
> >correctness validator" kernel debugging feature, which can be downloaded
> >from:
> >
> >  http://redhat.com/~mingo/lockdep-patches/
> >
> [snip]
> 
> I get this while loading cpufreq modules
> 
> =====================================================
> [ BUG: possible circular locking deadlock detected! ]
> -----------------------------------------------------
> modprobe/1942 is trying to acquire lock:
> (&anon_vma->lock){--..}, at: [<c10609cf>] anon_vma_link+0x1d/0xc9
> 
> but task is already holding lock:
> (&mm->mmap_sem/1){--..}, at: [<c101e5a0>] copy_process+0xbc6/0x1519
> 
> which lock already depends on the new lock,
> which could lead to circular deadlocks!

hm, this one could perhaps be a real bug. Dave: lockdep complains about 
having observed:

	anon_vma->lock  =>   mm->mmap_sem
	mm->mmap_sem    =>   anon_vma->lock

locking sequences, in the cpufreq code. Is there some special runtime 
behavior that still makes this safe, or is it a real bug?

> stack backtrace:
> <c1003f36> show_trace+0xd/0xf  <c1004449> dump_stack+0x17/0x19
> <c103863e> print_circular_bug_tail+0x59/0x64  <c1038e91>
> __lockdep_acquire+0x848/0xa39
> <c10394be> lockdep_acquire+0x69/0x82  <c11ed759>
> __mutex_lock_slowpath+0xd0/0x347

there's one small detail to improve future lockdep printouts: please set 
CONFIG_STACK_BACKTRACE_COLS=1, so that the backtrace is more readable. 
(i'll change the code to force that when CONFIG_LOCKDEP is enabled)

> BTW I still must revert lockdep-serial.patch - it doesn't compile on 
> my gcc 4.1.1

ok, will check this.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 33/61] lock validator: disable NMI watchdog if CONFIG_LOCKDEP
  2006-05-29 21:25 ` [patch 33/61] lock validator: disable NMI watchdog if CONFIG_LOCKDEP Ingo Molnar
@ 2006-05-29 22:49   ` Keith Owens
  0 siblings, 0 replies; 320+ messages in thread
From: Keith Owens @ 2006-05-29 22:49 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton

Ingo Molnar (on Mon, 29 May 2006 23:25:50 +0200) wrote:
>From: Ingo Molnar <mingo@elte.hu>
>
>The NMI watchdog uses spinlocks (notifier chains, etc.),
>so it's not lockdep-safe at the moment.

Fixed in 2.6.17-rc1.  notify_die() uses atomic_notifier_call_chain()
which uses RCU, not spinlocks.


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 22:41   ` Ingo Molnar
@ 2006-05-29 23:09     ` Dave Jones
  2006-05-30  5:45       ` Arjan van de Ven
  2006-05-30  5:52       ` [patch 00/61] ANNOUNCE: lock validator -V1 Michal Piotrowski
  0 siblings, 2 replies; 320+ messages in thread
From: Dave Jones @ 2006-05-29 23:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Michal Piotrowski, linux-kernel, Arjan van de Ven, Andrew Morton

On Tue, May 30, 2006 at 12:41:08AM +0200, Ingo Molnar wrote:

 > > =====================================================
 > > [ BUG: possible circular locking deadlock detected! ]
 > > -----------------------------------------------------
 > > modprobe/1942 is trying to acquire lock:
 > > (&anon_vma->lock){--..}, at: [<c10609cf>] anon_vma_link+0x1d/0xc9
 > > 
 > > but task is already holding lock:
 > > (&mm->mmap_sem/1){--..}, at: [<c101e5a0>] copy_process+0xbc6/0x1519
 > > 
 > > which lock already depends on the new lock,
 > > which could lead to circular deadlocks!
 > 
 > hm, this one could perhaps be a real bug. Dave: lockdep complains about 
 > having observed:
 > 
 > 	anon_vma->lock  =>   mm->mmap_sem
 > 	mm->mmap_sem    =>   anon_vma->lock
 > 
 > locking sequences, in the cpufreq code. Is there some special runtime 
 > behavior that still makes this safe, or is it a real bug?

I'm feeling a bit overwhelmed by the voluminous output of this checker.
Especially as (directly at least) cpufreq doesn't touch vma's, or mmap's.

The first stack trace it shows has us down in the bowels of cpu hotplug,
where we're taking the cpucontrol sem.  The second stack trace shows
us in cpufreq_update_policy taking a per-cpu data->lock semaphore.

Now, I notice this is modprobe triggering this, and this *looks* like
we're loading two modules simultaneously (the first trace is from a
scaling driver like powernow-k8 or the like, whilst the second trace
is from cpufreq-stats).  

How on earth did we get into this situation? module loading is supposed
to be serialised on the module_mutex no ?

It's been a while since a debug patch has sent me in search of paracetamol ;)

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup
  2006-05-30  1:33   ` Andrew Morton
@ 2006-05-30  1:32     ` Nathan Scott
  0 siblings, 0 replies; 320+ messages in thread
From: Nathan Scott @ 2006-05-30  1:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel, arjan

On Mon, May 29, 2006 at 06:33:41PM -0700, Andrew Morton wrote:
> I'll queue this for mainline, via the XFS tree.

Thanks Andrew, its merged in our tree now.

-- 
Nathan

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 01/61] lock validator: floppy.c irq-release fix
  2006-05-29 21:22 ` [patch 01/61] lock validator: floppy.c irq-release fix Ingo Molnar
@ 2006-05-30  1:32   ` Andrew Morton
  0 siblings, 0 replies; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:22:56 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> floppy.c does alot of irq-unsafe work within floppy_release_irq_and_dma():
> free_irq(), release_region() ... so when executing in irq context, push
> the whole function into keventd.

I seem to remember having issues with this - of the "not yet adequate"
type.  But I forget what they were.  Perhaps we have enough
flush_scheduled_work()s in there now.

We're glad to see you reassuming floppy.c maintenance.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 02/61] lock validator: forcedeth.c fix
  2006-05-29 21:23 ` [patch 02/61] lock validator: forcedeth.c fix Ingo Molnar
@ 2006-05-30  1:33   ` Andrew Morton
  2006-05-31  5:40     ` Manfred Spraul
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan, Ayaz Abdulla, Manfred Spraul

On Mon, 29 May 2006 23:23:13 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> nv_do_nic_poll() is called from timer softirqs, which has interrupts
> enabled, but np->lock might also be taken by some other interrupt
> context.

But the driver does disable_irq(), so I'd say this was a false-positive.

And afaict this is not a timer handler - it's a poll_controller handler
(although maybe that get called from timer handler somewhere?)

That being said, doing disable_irq() from a poll_controller handler is
downright scary.

Anwyay, I'll tentatively mark this as a lockdep workaround, not a bugfix.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup
  2006-05-29 21:23 ` [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup Ingo Molnar
@ 2006-05-30  1:33   ` Andrew Morton
  2006-05-30 10:51     ` Takashi Iwai
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan, Jaroslav Kysela, Takashi Iwai

On Mon, 29 May 2006 23:23:19 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> move the __attribute outside of the DEFINE_SPINLOCK() section.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  sound/oss/emu10k1/midi.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux/sound/oss/emu10k1/midi.c
> ===================================================================
> --- linux.orig/sound/oss/emu10k1/midi.c
> +++ linux/sound/oss/emu10k1/midi.c
> @@ -45,7 +45,7 @@
>  #include "../sound_config.h"
>  #endif
>  
> -static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
> +static __attribute((unused)) DEFINE_SPINLOCK(midi_spinlock);
>  
>  static void init_midi_hdr(struct midi_hdr *midihdr)
>  {

I'll tag this as for-mainline-via-alsa.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)
  2006-05-29 21:23 ` [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond) Ingo Molnar
@ 2006-05-30  1:33   ` Andrew Morton
  2006-05-30 17:38     ` Steven Rostedt
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:23:28 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> add WARN_ON_ONCE(cond) to print once-per-bootup messages.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  include/asm-generic/bug.h |   13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> Index: linux/include/asm-generic/bug.h
> ===================================================================
> --- linux.orig/include/asm-generic/bug.h
> +++ linux/include/asm-generic/bug.h
> @@ -44,4 +44,17 @@
>  # define WARN_ON_SMP(x)			do { } while (0)
>  #endif
>  
> +#define WARN_ON_ONCE(condition)				\
> +({							\
> +	static int __warn_once = 1;			\
> +	int __ret = 0;					\
> +							\
> +	if (unlikely(__warn_once && (condition))) {	\
> +		__warn_once = 0;			\
> +		WARN_ON(1);				\
> +		__ret = 1;				\
> +	}						\
> +	__ret;						\
> +})
> +
>  #endif

I'll queue this for mainline inclusion.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/61] lock validator: add __module_address() method
  2006-05-29 21:23 ` [patch 06/61] lock validator: add __module_address() method Ingo Molnar
@ 2006-05-30  1:33   ` Andrew Morton
  2006-05-30 17:45     ` Steven Rostedt
  2006-06-23  8:38     ` Ingo Molnar
  0 siblings, 2 replies; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:23:33 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> +/*
> + * Is this a valid module address? We don't grab the lock.
> + */
> +int __module_address(unsigned long addr)
> +{
> +	struct module *mod;
> +
> +	list_for_each_entry(mod, &modules, list)
> +		if (within(addr, mod->module_core, mod->core_size))
> +			return 1;
> +	return 0;
> +}

Returns a boolean.

>  /* Is this a valid kernel address?  We don't grab the lock: we are oopsing. */
>  struct module *__module_text_address(unsigned long addr)

But this returns a module*.

I'd suggest that __module_address() should do the same thing, from an API neatness
POV.  Although perhaps that's mot very useful if we didn't take a ref on the returned
object (but module_text_address() doesn't either).

Also, the name's a bit misleading - it sounds like it returns the address
of a module or something.  __module_any_address() would be better, perhaps?

Also, how come this doesn't need modlist_lock()?


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 07/61] lock validator: better lock debugging
  2006-05-29 21:23 ` [patch 07/61] lock validator: better lock debugging Ingo Molnar
@ 2006-05-30  1:33   ` Andrew Morton
  2006-06-23 10:25     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:23:37 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> --- /dev/null
> +++ linux/include/linux/debug_locks.h
> @@ -0,0 +1,62 @@
> +#ifndef __LINUX_DEBUG_LOCKING_H
> +#define __LINUX_DEBUG_LOCKING_H
> +
> +extern int debug_locks;
> +extern int debug_locks_silent;
> +
> +/*
> + * Generic 'turn off all lock debugging' function:
> + */
> +extern int debug_locks_off(void);
> +
> +/*
> + * In the debug case we carry the caller's instruction pointer into
> + * other functions, but we dont want the function argument overhead
> + * in the nondebug case - hence these macros:
> + */
> +#define _RET_IP_		(unsigned long)__builtin_return_address(0)
> +#define _THIS_IP_  ({ __label__ __here; __here: (unsigned long)&&__here; })
> +
> +#define DEBUG_WARN_ON(c)						\
> +({									\
> +	int __ret = 0;							\
> +									\
> +	if (unlikely(c)) {						\
> +		if (debug_locks_off())					\
> +			WARN_ON(1);					\
> +		__ret = 1;						\
> +	}								\
> +	__ret;								\
> +})

Either the name of this thing is too generic, or we _make_ it generic, in
which case it's in the wrong header file.

> +#ifdef CONFIG_SMP
> +# define SMP_DEBUG_WARN_ON(c)			DEBUG_WARN_ON(c)
> +#else
> +# define SMP_DEBUG_WARN_ON(c)			do { } while (0)
> +#endif

Probably ditto.



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup
  2006-05-29 21:23 ` [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup Ingo Molnar
@ 2006-05-30  1:33   ` Andrew Morton
  2006-05-30  1:32     ` Nathan Scott
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan, Nathan Scott

On Mon, 29 May 2006 23:23:59 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> nit_rwsem() has no return value. This is not a problem if init_rwsem()
> is a function, but it's a problem if it's a do { ... } while (0) macro.
> (which lockdep introduces)
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  fs/xfs/linux-2.6/mrlock.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux/fs/xfs/linux-2.6/mrlock.h
> ===================================================================
> --- linux.orig/fs/xfs/linux-2.6/mrlock.h
> +++ linux/fs/xfs/linux-2.6/mrlock.h
> @@ -28,7 +28,7 @@ typedef struct {
>  } mrlock_t;
>  
>  #define mrinit(mrp, name)	\
> -	( (mrp)->mr_writer = 0, init_rwsem(&(mrp)->mr_lock) )
> +	do { (mrp)->mr_writer = 0; init_rwsem(&(mrp)->mr_lock); } while (0)
>  #define mrlock_init(mrp, t,n,s)	mrinit(mrp, n)
>  #define mrfree(mrp)		do { } while (0)
>  #define mraccess(mrp)		mraccessf(mrp, 0)

I'll queue this for mainline, via the XFS tree.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 12/61] lock validator: beautify x86_64 stacktraces
  2006-05-29 21:24 ` [patch 12/61] lock validator: beautify x86_64 stacktraces Ingo Molnar
@ 2006-05-30  1:33   ` Andrew Morton
  0 siblings, 0 replies; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:24:05 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> beautify x86_64 stacktraces to be more readable.

One reject fixed due to the backtrace changes in Andi's tree.

I'll get all this compiling, but we'll need to review and test the end
result please, make sure that it all landed OK.


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 15/61] lock validator: x86_64: use stacktrace to generate backtraces
  2006-05-29 21:24 ` [patch 15/61] lock validator: x86_64: use stacktrace to generate backtraces Ingo Molnar
@ 2006-05-30  1:33   ` Andrew Morton
  0 siblings, 0 replies; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:24:19 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> this switches x86_64 to use the stacktrace infrastructure when generating
> backtrace printouts, if CONFIG_FRAME_POINTER=y. (This patch will go away
> once the dwarf2 stackframe parser in -mm goes upstream.)

yup, I dropped it.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 16/61] lock validator: fown locking workaround
  2006-05-29 21:24 ` [patch 16/61] lock validator: fown locking workaround Ingo Molnar
@ 2006-05-30  1:34   ` Andrew Morton
  2006-06-23  9:10     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:24:23 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> temporary workaround for the lock validator: make all uses of
> f_owner.lock irq-safe. (The real solution will be to express to
> the lock validator that f_owner.lock rules are to be generated
> per-filesystem.)

This description forgot to tell us what problem is being worked around.

This patch is a bit of a show-stopper.  How hard-n-bad is the real fix?

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 17/61] lock validator: sk_callback_lock workaround
  2006-05-29 21:24 ` [patch 17/61] lock validator: sk_callback_lock workaround Ingo Molnar
@ 2006-05-30  1:34   ` Andrew Morton
  2006-06-23  9:19     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:24:27 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> temporary workaround for the lock validator: make all uses of
> sk_callback_lock softirq-safe. (The real solution will be to
> express to the lock validator that sk_callback_lock rules are
> to be generated per-address-family.)

Ditto.  What's the actual problem being worked around here, and how's the
real fix shaping up?



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 18/61] lock validator: irqtrace: core
  2006-05-29 21:24 ` [patch 18/61] lock validator: irqtrace: core Ingo Molnar
@ 2006-05-30  1:34   ` Andrew Morton
  2006-06-23 10:42     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:24:32 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> accurate hard-IRQ-flags state tracing. This allows us to attach
> extra functionality to IRQ flags on/off events (such as trace-on/off).

That's a fairly skimpy description of some fairly substantial new
infrastructure.


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API.
  2006-05-29 21:24 ` [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API Ingo Molnar
@ 2006-05-30  1:34   ` Andrew Morton
  2006-06-23  9:28     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:24:52 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> introduce local_irq_enable_in_hardirq() API. It is currently
> aliased to local_irq_enable(), hence has no functional effects.
> 
> This API will be used by lockdep, but even without lockdep
> this will better document places in the kernel where a hardirq
> context enables hardirqs.

If we expect people to use this then we'd best whack a comment over it.

Also, trace_irqflags.h doesn't seem an appropriate place for it to live.

I trust all the affected files are including trace_irqflags.h by some
means.  Hopefully a _reliable_ means.  No doubt I'm about to find out ;)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 22/61] lock validator:  add per_cpu_offset()
  2006-05-29 21:24 ` [patch 22/61] lock validator: add per_cpu_offset() Ingo Molnar
@ 2006-05-30  1:34   ` Andrew Morton
  2006-06-23  9:30     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, arjan, Luck, Tony, Benjamin Herrenschmidt,
	Paul Mackerras, Martin Schwidefsky, David S. Miller

On Mon, 29 May 2006 23:24:57 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> add the per_cpu_offset() generic method. (used by the lock validator)
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  include/asm-generic/percpu.h |    2 ++
>  include/asm-x86_64/percpu.h  |    2 ++
>  2 files changed, 4 insertions(+)
> 
> Index: linux/include/asm-generic/percpu.h
> ===================================================================
> --- linux.orig/include/asm-generic/percpu.h
> +++ linux/include/asm-generic/percpu.h
> @@ -7,6 +7,8 @@
>  
>  extern unsigned long __per_cpu_offset[NR_CPUS];
>  
> +#define per_cpu_offset(x) (__per_cpu_offset[x])
> +
>  /* Separate out the type, so (int[3], foo) works. */
>  #define DEFINE_PER_CPU(type, name) \
>      __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
> Index: linux/include/asm-x86_64/percpu.h
> ===================================================================
> --- linux.orig/include/asm-x86_64/percpu.h
> +++ linux/include/asm-x86_64/percpu.h
> @@ -14,6 +14,8 @@
>  #define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
>  #define __my_cpu_offset() read_pda(data_offset)
>  
> +#define per_cpu_offset(x) (__per_cpu_offset(x))
> +
>  /* Separate out the type, so (int[3], foo) works. */
>  #define DEFINE_PER_CPU(type, name) \
>      __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

I can tell just looking at it that it'll break various builds.I assume that
things still happen to compile because you're presently using it in code
which those architectures don't presently compile.

But introducing a "generic" function invites others to start using it.  And
they will, and they'll ship code which "works" but is broken, because they
only tested it on x86 and x86_64.

I'll queue the needed fixups - please check it.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (61 preceding siblings ...)
  2006-05-29 22:28 ` [patch 00/61] ANNOUNCE: lock validator -V1 Michal Piotrowski
@ 2006-05-30  1:35 ` Andrew Morton
  2006-06-23  9:41   ` Ingo Molnar
  2006-05-30  4:52 ` Mike Galbraith
                   ` (10 subsequent siblings)
  73 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:21:09 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> We are pleased to announce the first release of the "lock dependency 
> correctness validator" kernel debugging feature

What are the runtime speed and space costs of enabling this?

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 27/61] lock validator: prove spinlock/rwlock locking correctness
  2006-05-29 21:25 ` [patch 27/61] lock validator: prove spinlock/rwlock " Ingo Molnar
@ 2006-05-30  1:35   ` Andrew Morton
  2006-06-23 10:44     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:25:23 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> +# define spin_lock_init_key(lock, key)				\
> +	__spin_lock_init((lock), #lock, key)

erk.  This adds a whole new layer of obfuscation on top of the existing
spinlock header files.  You already need to run the preprocessor and
disassembler to even work out which flavour you're presently using.

Ho hum.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 34/61] lock validator: special locking: bdev
  2006-05-29 21:25 ` [patch 34/61] lock validator: special locking: bdev Ingo Molnar
@ 2006-05-30  1:35   ` Andrew Morton
  2006-05-30  5:13     ` Arjan van de Ven
                       ` (2 more replies)
  0 siblings, 3 replies; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:25:54 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> teach special (recursive) locking code to the lock validator. Has no
> effect on non-lockdep kernels.
> 

There's no description here of the problem which is being worked around. 
This leaves everyone in the dark.

> +static int
> +blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags)
> +{
> +	/*
> +	 * This crockload is due to bad choice of ->open() type.
> +	 * It will go away.
> +	 * For now, block device ->open() routine must _not_
> +	 * examine anything in 'inode' argument except ->i_rdev.
> +	 */
> +	struct file fake_file = {};
> +	struct dentry fake_dentry = {};
> +	fake_file.f_mode = mode;
> +	fake_file.f_flags = flags;
> +	fake_file.f_dentry = &fake_dentry;
> +	fake_dentry.d_inode = bdev->bd_inode;
> +
> +	return do_open(bdev, &fake_file, BD_MUTEX_WHOLE);
> +}

"crock" is a decent description ;)

How long will this live, and what will the fix look like?

(This is all a bit of a pain - carrying these patches in -mm will require
some effort, and they're not ready to go yet, which will lengthen the pain
arbitrarily).


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 36/61] lock validator: special locking: serial
  2006-05-29 21:26 ` [patch 36/61] lock validator: special locking: serial Ingo Molnar
@ 2006-05-30  1:35   ` Andrew Morton
  2006-06-23  9:49     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan, Russell King

On Mon, 29 May 2006 23:26:04 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> teach special (dual-initialized) locking code to the lock validator.
> Has no effect on non-lockdep kernels.
> 

This isn't an adequate description of the problem which this patch is
solving, IMO.

I _assume_ the validator is using the instruction pointer of the
spin_lock_init() site (or the file-n-line) as the lock's identifier.  Or
something?

> 
> Index: linux/drivers/serial/serial_core.c
> ===================================================================
> --- linux.orig/drivers/serial/serial_core.c
> +++ linux/drivers/serial/serial_core.c
> @@ -1849,6 +1849,12 @@ static const struct baud_rates baud_rate
>  	{      0, B38400  }
>  };
>  
> +/*
> + * lockdep: port->lock is initialized in two places, but we
> + *          want only one lock-type:
> + */
> +static struct lockdep_type_key port_lock_key;
> +
>  /**
>   *	uart_set_options - setup the serial console parameters
>   *	@port: pointer to the serial ports uart_port structure
> @@ -1869,7 +1875,7 @@ uart_set_options(struct uart_port *port,
>  	 * Ensure that the serial console lock is initialised
>  	 * early.
>  	 */
> -	spin_lock_init(&port->lock);
> +	spin_lock_init_key(&port->lock, &port_lock_key);
>  
>  	memset(&termios, 0, sizeof(struct termios));
>  
> @@ -2255,7 +2261,7 @@ int uart_add_one_port(struct uart_driver
>  	 * initialised.
>  	 */
>  	if (!(uart_console(port) && (port->cons->flags & CON_ENABLED)))
> -		spin_lock_init(&port->lock);
> +		spin_lock_init_key(&port->lock, &port_lock_key);
>  
>  	uart_configure_port(drv, state, port);
>  

Is there a cleaner way of doing this?

Perhaps write a new helper function which initialises the spinlock, call
that?  Rather than open-coding lockdep stuff?


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 37/61] lock validator: special locking: dcache
  2006-05-29 21:26 ` [patch 37/61] lock validator: special locking: dcache Ingo Molnar
@ 2006-05-30  1:35   ` Andrew Morton
  2006-05-30 20:51     ` Steven Rostedt
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:26:08 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> teach special (recursive) locking code to the lock validator. Has no
> effect on non-lockdep kernels.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  fs/dcache.c            |    6 +++---
>  include/linux/dcache.h |   12 ++++++++++++
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> Index: linux/fs/dcache.c
> ===================================================================
> --- linux.orig/fs/dcache.c
> +++ linux/fs/dcache.c
> @@ -1380,10 +1380,10 @@ void d_move(struct dentry * dentry, stru
>  	 */
>  	if (target < dentry) {
>  		spin_lock(&target->d_lock);
> -		spin_lock(&dentry->d_lock);
> +		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
>  	} else {
>  		spin_lock(&dentry->d_lock);
> -		spin_lock(&target->d_lock);
> +		spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
>  	}
>  
>  	/* Move the dentry to the target hash queue, if on different bucket */
> @@ -1420,7 +1420,7 @@ already_unhashed:
>  	}
>  
>  	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
> -	spin_unlock(&target->d_lock);
> +	spin_unlock_non_nested(&target->d_lock);
>  	fsnotify_d_move(dentry);
>  	spin_unlock(&dentry->d_lock);
>  	write_sequnlock(&rename_lock);
> Index: linux/include/linux/dcache.h
> ===================================================================
> --- linux.orig/include/linux/dcache.h
> +++ linux/include/linux/dcache.h
> @@ -114,6 +114,18 @@ struct dentry {
>  	unsigned char d_iname[DNAME_INLINE_LEN_MIN];	/* small names */
>  };
>  
> +/*
> + * dentry->d_lock spinlock nesting types:
> + *
> + * 0: normal
> + * 1: nested
> + */
> +enum dentry_d_lock_type
> +{
> +	DENTRY_D_LOCK_NORMAL,
> +	DENTRY_D_LOCK_NESTED
> +};
> +
>  struct dentry_operations {
>  	int (*d_revalidate)(struct dentry *, struct nameidata *);
>  	int (*d_hash) (struct dentry *, struct qstr *);

DENTRY_D_LOCK_NORMAL isn't used anywhere.


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 46/61] lock validator: special locking: slab
  2006-05-29 21:26 ` [patch 46/61] lock validator: special locking: slab Ingo Molnar
@ 2006-05-30  1:35   ` Andrew Morton
  2006-06-23  9:54     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:26:49 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> +		/*
> +		 * Do not assume that spinlocks can be initialized via memcpy:
> +		 */

I'd view that as something which should be fixed in mainline.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 50/61] lock validator: special locking: hrtimer.c
  2006-05-29 21:27 ` [patch 50/61] lock validator: special locking: hrtimer.c Ingo Molnar
@ 2006-05-30  1:35   ` Andrew Morton
  2006-06-23 10:04     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:27:09 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> teach special (recursive) locking code to the lock validator. Has no
> effect on non-lockdep kernels.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  kernel/hrtimer.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux/kernel/hrtimer.c
> ===================================================================
> --- linux.orig/kernel/hrtimer.c
> +++ linux/kernel/hrtimer.c
> @@ -786,7 +786,7 @@ static void __devinit init_hrtimers_cpu(
>  	int i;
>  
>  	for (i = 0; i < MAX_HRTIMER_BASES; i++, base++)
> -		spin_lock_init(&base->lock);
> +		spin_lock_init_static(&base->lock);
>  }
>  

Perhaps the validator core's implementation of spin_lock_init() could look
at the address and work out if it's within the static storage sections.


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 51/61] lock validator: special locking: sock_lock_init()
  2006-05-29 21:27 ` [patch 51/61] lock validator: special locking: sock_lock_init() Ingo Molnar
@ 2006-05-30  1:36   ` Andrew Morton
  2006-06-23 10:06     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan, David S. Miller

On Mon, 29 May 2006 23:27:14 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> teach special (multi-initialized, per-address-family) locking code to the
> lock validator. Has no effect on non-lockdep kernels.
> 
> Index: linux/include/net/sock.h
> ===================================================================
> --- linux.orig/include/net/sock.h
> +++ linux/include/net/sock.h
> @@ -81,12 +81,6 @@ typedef struct {
>  	wait_queue_head_t	wq;
>  } socket_lock_t;
>  
> -#define sock_lock_init(__sk) \
> -do {	spin_lock_init(&((__sk)->sk_lock.slock)); \
> -	(__sk)->sk_lock.owner = NULL; \
> -	init_waitqueue_head(&((__sk)->sk_lock.wq)); \
> -} while(0)
> -
>  struct sock;
>  struct proto;
>  
> Index: linux/net/core/sock.c
> ===================================================================
> --- linux.orig/net/core/sock.c
> +++ linux/net/core/sock.c
> @@ -739,6 +739,27 @@ lenout:
>    	return 0;
>  }
>  
> +/*
> + * Each address family might have different locking rules, so we have
> + * one slock key per address family:
> + */
> +static struct lockdep_type_key af_family_keys[AF_MAX];
> +
> +static void noinline sock_lock_init(struct sock *sk)
> +{
> +	spin_lock_init_key(&sk->sk_lock.slock, af_family_keys + sk->sk_family);
> +	sk->sk_lock.owner = NULL;
> +	init_waitqueue_head(&sk->sk_lock.wq);
> +}

OK, no code outside net/core/sock.c uses sock_lock_init().

Hopefully the same is true of out-of-tree code...

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 52/61] lock validator: special locking: af_unix
  2006-05-29 21:27 ` [patch 52/61] lock validator: special locking: af_unix Ingo Molnar
@ 2006-05-30  1:36   ` Andrew Morton
  2006-06-23 10:07     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan, David S. Miller

On Mon, 29 May 2006 23:27:19 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> teach special (recursive) locking code to the lock validator. Has no
> effect on non-lockdep kernels.
> 
> (includes workaround for sk_receive_queue.lock, which is currently
> treated globally by the lock validator, but which be switched to
> per-address-family locking rules.)
> 
> ...
>
>  
> -			spin_lock(&sk->sk_receive_queue.lock);
> +			spin_lock_bh(&sk->sk_receive_queue.lock);

Again, a bit of a show-stopper.  Will the real fix be far off?

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 55/61] lock validator: special locking: sb->s_umount
  2006-05-29 21:27 ` [patch 55/61] lock validator: special locking: sb->s_umount Ingo Molnar
@ 2006-05-30  1:36   ` Andrew Morton
  2006-06-23 10:55     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:27:32 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> workaround for special sb->s_umount locking rule.
> 
> s_umount gets held across a series of lock dropping and releasing
> in prune_one_dentry(), so i changed the order, at the risk of
> introducing a umount race. FIXME.
> 
> i think a better fix would be to do the unlocks as _non_nested in
> prune_one_dentry(), and to do the up_read() here as
> an up_read_non_nested() as well?
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  fs/dcache.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: linux/fs/dcache.c
> ===================================================================
> --- linux.orig/fs/dcache.c
> +++ linux/fs/dcache.c
> @@ -470,8 +470,9 @@ static void prune_dcache(int count, stru
>  		s_umount = &dentry->d_sb->s_umount;
>  		if (down_read_trylock(s_umount)) {
>  			if (dentry->d_sb->s_root != NULL) {
> -				prune_one_dentry(dentry);
> +// lockdep hack: do this better!
>  				up_read(s_umount);
> +				prune_one_dentry(dentry);
>  				continue;

argh, you broke my kernel!

I'll whack some ifdefs in here so it's only known-broken if CONFIG_LOCKDEP.

Again, we'd need the real fix here.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 59/61] lock validator: special locking: xfrm
  2006-05-29 21:27 ` [patch 59/61] lock validator: special locking: xfrm Ingo Molnar
@ 2006-05-30  1:36   ` Andrew Morton
  0 siblings, 0 replies; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan, David S. Miller, Patrick McHardy

On Mon, 29 May 2006 23:27:51 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> teach special (non-nested) unlocking code to the lock validator. Has no
> effect on non-lockdep kernels.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  net/xfrm/xfrm_policy.c |    2 +-
>  net/xfrm/xfrm_state.c  |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux/net/xfrm/xfrm_policy.c
> ===================================================================
> --- linux.orig/net/xfrm/xfrm_policy.c
> +++ linux/net/xfrm/xfrm_policy.c
> @@ -1308,7 +1308,7 @@ static struct xfrm_policy_afinfo *xfrm_p
>  	afinfo = xfrm_policy_afinfo[family];
>  	if (likely(afinfo != NULL))
>  		read_lock(&afinfo->lock);
> -	read_unlock(&xfrm_policy_afinfo_lock);
> +	read_unlock_non_nested(&xfrm_policy_afinfo_lock);
>  	return afinfo;
>  }
>  
> Index: linux/net/xfrm/xfrm_state.c
> ===================================================================
> --- linux.orig/net/xfrm/xfrm_state.c
> +++ linux/net/xfrm/xfrm_state.c
> @@ -1105,7 +1105,7 @@ static struct xfrm_state_afinfo *xfrm_st
>  	afinfo = xfrm_state_afinfo[family];
>  	if (likely(afinfo != NULL))
>  		read_lock(&afinfo->lock);
> -	read_unlock(&xfrm_state_afinfo_lock);
> +	read_unlock_non_nested(&xfrm_state_afinfo_lock);
>  	return afinfo;
>  }
>  

I got a bunch of rejects here due to changes in git-net.patch.  Please
verify the result.  It could well be wrong (the changes in there are odd).


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 61/61] lock validator: enable lock validator in Kconfig
  2006-05-29 21:28 ` [patch 61/61] lock validator: enable lock validator in Kconfig Ingo Molnar
@ 2006-05-30  1:36   ` Andrew Morton
  2006-05-30 13:33   ` Roman Zippel
  1 sibling, 0 replies; 320+ messages in thread
From: Andrew Morton @ 2006-05-30  1:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Mon, 29 May 2006 23:28:12 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> offer the following lock validation options:
> 
>  CONFIG_PROVE_SPIN_LOCKING
>  CONFIG_PROVE_RW_LOCKING
>  CONFIG_PROVE_MUTEX_LOCKING
>  CONFIG_PROVE_RWSEM_LOCKING
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  lib/Kconfig.debug |  167 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
> 
> Index: linux/lib/Kconfig.debug
> ===================================================================
> --- linux.orig/lib/Kconfig.debug
> +++ linux/lib/Kconfig.debug
> @@ -184,6 +184,173 @@ config DEBUG_SPINLOCK
>  	  best used in conjunction with the NMI watchdog so that spinlock
>  	  deadlocks are also debuggable.
>  
> +config PROVE_SPIN_LOCKING
> +	bool "Prove spin-locking correctness"
> +	default y

err, I think I'll be sticking a `depends on X86' in there, thanks very
much.  I'd prefer that you be the first to test it ;)


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (62 preceding siblings ...)
  2006-05-30  1:35 ` Andrew Morton
@ 2006-05-30  4:52 ` Mike Galbraith
  2006-05-30  6:20   ` Arjan van de Ven
                     ` (2 more replies)
  2006-05-30  9:14 ` Benoit Boissinot
                   ` (9 subsequent siblings)
  73 siblings, 3 replies; 320+ messages in thread
From: Mike Galbraith @ 2006-05-30  4:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton

On Mon, 2006-05-29 at 23:21 +0200, Ingo Molnar wrote:
> The easiest way to try lockdep on a testbox is to apply the combo patch 
> to 2.6.17-rc4-mm3. The patch order is:
> 
>   http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.17-rc4.tar.bz2
>   http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc4/2.6.17-rc4-mm3/2.6.17-rc4-mm3.bz2
>   http://redhat.com/~mingo/lockdep-patches/lockdep-combo.patch
> 
> do 'make oldconfig' and accept all the defaults for new config options - 
> reboot into the kernel and if everything goes well it should boot up 
> fine and you should have /proc/lockdep and /proc/lockdep_stats files.

Darn.  It said all tests passed, then oopsed.

(have .config all gzipped up if you want it)

	-Mike

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
b103a872
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
last sysfs file:
Modules linked in:
CPU:    0
EIP:    0060:[<b103a872>]    Not tainted VLI
EFLAGS: 00010083   (2.6.17-rc4-mm3-smp #157)
EIP is at count_matching_names+0x5b/0xa2
eax: b15074a8   ebx: 00000000   ecx: b165c430   edx: b165b320
esi: 00000000   edi: b1410423   ebp: dfe20e74   esp: dfe20e68
ds: 007b   es: 007b   ss: 0068
Process idle (pid: 1, threadinfo=dfe20000 task=effc1470)
Stack: 000139b0 b165c430 00000000 dfe20ec8 b103d442 b1797a6c b1797a64 effc1470
       b1797a64 00000004 b1797a50 00000000 b15074a8 effc1470 dfe20ef8 b106da88
       b169d0a8 b1797a64 dfe20f52 0000000a b106dec7 00000282 dfe20000 00000000
Call Trace:
 <b1003d73> show_stack_log_lvl+0x9e/0xc3  <b1003f80> show_registers+0x1ac/0x237
 <b100413d> die+0x132/0x2fb  <b101a083> do_page_fault+0x5cf/0x656
 <b10038a7> error_code+0x4f/0x54  <b103d442> __lockdep_acquire+0xa6f/0xc32
 <b103d9f8> lockdep_acquire+0x61/0x77  <b13d27f3> _spin_lock+0x2e/0x42
 <b102b03a> register_sysctl_table+0x4e/0xaa  <b15a463a> sched_init_smp+0x411/0x41e
 <b100035d> init+0xbd/0x2c6  <b1001005> kernel_thread_helper+0x5/0xb
Code: 92 50 b1 74 5d 8b 41 10 2b 41 14 31 db 39 42 10 75 0d eb 53 8b 41 10 2b 41 14 3b 42 10 74 48 8b b2 a0 00 00 00 8b b9 a0 00 00 00 <ac> ae 75 08 84 c0 75 f8 31 c0 eb 04 19 c0 0c 01 85 c0 75 0b 8b

1151            list_for_each_entry(type, &all_lock_types, lock_entry) {
1152                    if (new_type->key - new_type->subtype == type->key)
1153                            return type->name_version;
1154                    if (!strcmp(type->name, new_type->name))  <--kaboom
1155                            count = max(count, type->name_version);
1156            }

EIP: [<b103a872>] count_matching_names+0x5b/0xa2 SS:ESP 0068:dfe20e68
 Kernel panic - not syncing: Attempted to kill init!
 BUG: warning at arch/i386/kernel/smp.c:537/smp_call_function()
 <b1003dd2> show_trace+0xd/0xf  <b10044c0> dump_stack+0x17/0x19
 <b10129ff> smp_call_function+0x11d/0x122  <b1012a22> smp_send_stop+0x1e/0x31
 <b1022f4b> panic+0x60/0x1d5  <b10267fa> do_exit+0x613/0x94f
 <b1004306> do_trap+0x0/0x9e  <b101a083> do_page_fault+0x5cf/0x656
 <b10038a7> error_code+0x4f/0x54  <b103d442> __lockdep_acquire+0xa6f/0xc32
 <b103d9f8> lockdep_acquire+0x61/0x77  <b13d27f3> _spin_lock+0x2e/0x42
 <b102b03a> register_sysctl_table+0x4e/0xaa  <b15a463a> sched_init_smp+0x411/0x41e
 <b100035d> init+0xbd/0x2c6  <b1001005> kernel_thread_helper+0x5/0xb
BUG: NMI Watchdog detected LOCKUP on CPU1, eip b103cc64, registers:
Modules linked in:
CPU:    1
EIP:    0060:[<b103cc64>]    Not tainted VLI
EFLAGS: 00000086   (2.6.17-rc4-mm3-smp #157)
EIP is at __lockdep_acquire+0x291/0xc32
eax: 00000000   ebx: 000001d7   ecx: b16bf938   edx: 00000000
esi: 00000000   edi: b16bf938   ebp: effc4ea4   esp: effc4e58
ds: 007b   es: 007b   ss: 0068
Process idle (pid: 0, threadinfo=effc4000 task=effc0a50)
Stack: b101d4ce 00000000 effc0fb8 000001d7 effc0a50 b16bf938 00000000 b29b38c8
       effc0a50 effc0fb8 00000001 00000000 00000005 00000000 00000000 00000000
       00000096 effc4000 00000000 effc4ecc b103d9f8 00000000 00000001 b101d4ce
Call Trace:
 <b1003d73> show_stack_log_lvl+0x9e/0xc3  <b1003f80> show_registers+0x1ac/0x237
 <b10050d9> die_nmi+0x93/0xeb  <b1015af1> nmi_watchdog_tick+0xff/0x20e
 <b1004542> do_nmi+0x80/0x249  <b1003912> nmi_stack_correct+0x1d/0x22
 <b103d9f8> lockdep_acquire+0x61/0x77  <b13d27f3> _spin_lock+0x2e/0x42
 <b101d4ce> scheduler_tick+0xd0/0x381  <b102d47e> update_process_times+0x42/0x61
 <b1014f9f> smp_apic_timer_interrupt+0x67/0x78  <b10037ba> apic_timer_interrupt+0x2a/0x30
 <b1001e5b> cpu_idle+0x71/0xb8  <b1013c6e> start_secondary+0x3e5/0x46b
 <00000000> _stext+0x4efffd68/0x8  <effc4fb4> 0xeffc4fb4
Code: 18 01 90 39 c7 0f 84 2e 02 00 00 8b 50 0c 31 f2 8b 40 08 31 d8 09 c2 75 e2 f0 ff 05 08 8a 61 b1 f0 fe 0d e4 92 50 b1 79 0d f3 90 <80> 3d e4 92 50 b1 00 7e f5 eb ea 8b 55 d4 8b b2 64 05 00 00 85
console shuts up ...



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 34/61] lock validator: special locking: bdev
  2006-05-30  1:35   ` Andrew Morton
@ 2006-05-30  5:13     ` Arjan van de Ven
  2006-05-30  9:58     ` Al Viro
  2006-05-30 10:45     ` Arjan van de Ven
  2 siblings, 0 replies; 320+ messages in thread
From: Arjan van de Ven @ 2006-05-30  5:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel

On Mon, 2006-05-29 at 18:35 -0700, Andrew Morton wrote:
> On Mon, 29 May 2006 23:25:54 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > 
> > teach special (recursive) locking code to the lock validator. Has no
> > effect on non-lockdep kernels.
> > 
> 
> There's no description here of the problem which is being worked around. 
> This leaves everyone in the dark.

it's not really a workaround, it's a "separate the uses" thing. The real
problem is an inherent hierarchy between "disk" and "partition". Where
lots of code assumes you can first take the disk mutex, and then the
partition mutex, and never deadlock. This patch basically separates the
"get me the disk" versus "get me the partition" uses.


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 22:28 ` [patch 00/61] ANNOUNCE: lock validator -V1 Michal Piotrowski
  2006-05-29 22:41   ` Ingo Molnar
@ 2006-05-30  5:20   ` Arjan van de Ven
  1 sibling, 0 replies; 320+ messages in thread
From: Arjan van de Ven @ 2006-05-30  5:20 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: Ingo Molnar, linux-kernel, Andrew Morton, Dave Jones

On Tue, 2006-05-30 at 00:28 +0200, Michal Piotrowski wrote:
> On 29/05/06, Ingo Molnar <mingo@elte.hu> wrote:
> > We are pleased to announce the first release of the "lock dependency
> > correctness validator" kernel debugging feature, which can be downloaded
> > from:
> >
> >   http://redhat.com/~mingo/lockdep-patches/
> >
> [snip]
> 
> I get this while loading cpufreq modules

can you enable CONFIG_KALLSYMS_ALL ? that will give a more accurate
debug output...


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 23:09     ` Dave Jones
@ 2006-05-30  5:45       ` Arjan van de Ven
  2006-05-30  6:07         ` Michal Piotrowski
                           ` (2 more replies)
  2006-05-30  5:52       ` [patch 00/61] ANNOUNCE: lock validator -V1 Michal Piotrowski
  1 sibling, 3 replies; 320+ messages in thread
From: Arjan van de Ven @ 2006-05-30  5:45 UTC (permalink / raw)
  To: Dave Jones; +Cc: Andrew Morton, linux-kernel, Michal Piotrowski, Ingo Molnar


> I'm feeling a bit overwhelmed by the voluminous output of this checker.
> Especially as (directly at least) cpufreq doesn't touch vma's, or mmap's.

the reporter doesn't have CONFIG_KALLSYMS_ALL enabled which gives
sometimes misleading backtraces (should lockdep just enable KALLSYMS_ALL
to get more useful bugreports?)

the problem is this, there are 2 scenarios in this bug:

One
---
store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
__cpufreq_set_policy calls __cpufreq_governor
__cpufreq_governor  calls __cpufreq_driver_target via cpufreq_governor_performance
__cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)


Two
---
cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
cpufreq_stat_cpu_callback calls cpufreq_update_policy
cpufreq_update_policy takes the policy->lock


so this looks like a real honest AB-BA deadlock to me...



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 23:09     ` Dave Jones
  2006-05-30  5:45       ` Arjan van de Ven
@ 2006-05-30  5:52       ` Michal Piotrowski
  1 sibling, 0 replies; 320+ messages in thread
From: Michal Piotrowski @ 2006-05-30  5:52 UTC (permalink / raw)
  To: Dave Jones, Ingo Molnar, Michal Piotrowski, linux-kernel,
	Arjan van de Ven, Andrew Morton

Hi,

On 30/05/06, Dave Jones <davej@redhat.com> wrote:
> On Tue, May 30, 2006 at 12:41:08AM +0200, Ingo Molnar wrote:
>
>  > > =====================================================
>  > > [ BUG: possible circular locking deadlock detected! ]
>  > > -----------------------------------------------------
>  > > modprobe/1942 is trying to acquire lock:
>  > > (&anon_vma->lock){--..}, at: [<c10609cf>] anon_vma_link+0x1d/0xc9
>  > >
>  > > but task is already holding lock:
>  > > (&mm->mmap_sem/1){--..}, at: [<c101e5a0>] copy_process+0xbc6/0x1519
>  > >
>  > > which lock already depends on the new lock,
>  > > which could lead to circular deadlocks!
>  >
>  > hm, this one could perhaps be a real bug. Dave: lockdep complains about
>  > having observed:
>  >
>  >      anon_vma->lock  =>   mm->mmap_sem
>  >      mm->mmap_sem    =>   anon_vma->lock
>  >
>  > locking sequences, in the cpufreq code. Is there some special runtime
>  > behavior that still makes this safe, or is it a real bug?
>
> I'm feeling a bit overwhelmed by the voluminous output of this checker.
> Especially as (directly at least) cpufreq doesn't touch vma's, or mmap's.
>
> The first stack trace it shows has us down in the bowels of cpu hotplug,
> where we're taking the cpucontrol sem.  The second stack trace shows
> us in cpufreq_update_policy taking a per-cpu data->lock semaphore.
>
> Now, I notice this is modprobe triggering this, and this *looks* like
> we're loading two modules simultaneously (the first trace is from a
> scaling driver like powernow-k8 or the like, whilst the second trace
> is from cpufreq-stats).

/etc/init.d/cpuspeed starts very early
$ ls /etc/rc5.d/ | grep cpu
S06cpuspeed

I have this in my /etc/rc.local
modprobe -i cpufreq_conservative
modprobe -i cpufreq_ondemand
modprobe -i cpufreq_powersave
modprobe -i cpufreq_stats
modprobe -i cpufreq_userspace
modprobe -i freq_table

>
> How on earth did we get into this situation?

Just before gdm starts, while /etc/rc.local is processed.

> module loading is supposed
> to be serialised on the module_mutex no ?
>
> It's been a while since a debug patch has sent me in search of paracetamol ;)
>
>                 Dave

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  5:45       ` Arjan van de Ven
@ 2006-05-30  6:07         ` Michal Piotrowski
  2006-05-30 14:10         ` Dave Jones
  2006-05-30 20:54         ` [patch, -rc5-mm1] lock validator: select KALLSYMS_ALL Ingo Molnar
  2 siblings, 0 replies; 320+ messages in thread
From: Michal Piotrowski @ 2006-05-30  6:07 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Dave Jones, Andrew Morton, linux-kernel, Ingo Molnar

Hi,

On 30/05/06, Arjan van de Ven <arjan@infradead.org> wrote:
>
> > I'm feeling a bit overwhelmed by the voluminous output of this checker.
> > Especially as (directly at least) cpufreq doesn't touch vma's, or mmap's.
>
> the reporter doesn't have CONFIG_KALLSYMS_ALL enabled which gives
> sometimes misleading backtraces (should lockdep just enable KALLSYMS_ALL
> to get more useful bugreports?)

Here is bug with CONFIG_KALLSYMS_ALL enabled.

=====================================================
[ BUG: possible circular locking deadlock detected! ]
-----------------------------------------------------
modprobe/1950 is trying to acquire lock:
 (&sighand->siglock){.+..}, at: [<c102b632>] do_notify_parent+0x12b/0x1b9

but task is already holding lock:
 (tasklist_lock){..-<B1>}, at: [<c1023473>] do_exit+0x608/0xa43

which lock already depends on the new lock,
which could lead to circular deadlocks!

the existing dependency chain (in reverse order) is:

-> #1 (cpucontrol){--..}:
       [<c10394be>] lockdep_acquire+0x69/0x82
       [<c11ed729>] __mutex_lock_slowpath+0xd0/0x347
       [<c11ed9bc>] mutex_lock+0x1c/0x1f
       [<c103dda5>] __lock_cpu_hotplug+0x36/0x56
       [<c103ddde>] lock_cpu_hotplug+0xa/0xc
       [<c1199dd6>] __cpufreq_driver_target+0x15/0x50
       [<c119a192>] cpufreq_governor_performance+0x1a/0x20
       [<c1198ada>] __cpufreq_governor+0xa0/0x1a9
       [<c1198cb2>] __cpufreq_set_policy+0xcf/0x100
       [<c1199196>] cpufreq_set_policy+0x2d/0x6f
       [<c1199c7e>] cpufreq_add_dev+0x34f/0x492
       [<c114b898>] sysdev_driver_register+0x58/0x9b
       [<c119a006>] cpufreq_register_driver+0x80/0xf4
       [<fd91402a>] ipt_local_out_hook+0x2a/0x65 [iptable_filter]
       [<c10410e1>] sys_init_module+0xa6/0x230
       [<c11ef97b>] sysenter_past_esp+0x54/0x8d

-> #0 (&sighand->siglock){.+..}:
       [<c10394be>] lockdep_acquire+0x69/0x82
       [<c11ed729>] __mutex_lock_slowpath+0xd0/0x347
       [<c11ed9bc>] mutex_lock+0x1c/0x1f
       [<c11990bb>] cpufreq_update_policy+0x34/0xd8
       [<fd9a350b>] cpufreq_stat_cpu_callback+0x1b/0x7c [cpufreq_stats]
       [<fd9a607d>] cpufreq_stats_init+0x7d/0x9b [cpufreq_stats]
       [<c10410e1>] sys_init_module+0xa6/0x230
       [<c11ef97b>] sysenter_past_esp+0x54/0x8d

other info that might help us debug this:

1 locks held by modprobe/1950:
 #0:  (cpucontrol){--..}, at: [<c11ed9bc>] mutex_lock+0x1c/0x1f

stack backtrace:
 [<c1003ed6>] show_trace+0xd/0xf
 [<c10043e9>] dump_stack+0x17/0x19
 [<c103863e>] print_circular_bug_tail+0x59/0x64
 [<c1038e91>] __lockdep_acquire+0x848/0xa39
 [<c10394be>] lockdep_acquire+0x69/0x82
 [<c11ed729>] __mutex_lock_slowpath+0xd0/0x347
 [<c11ed9bc>] mutex_lock+0x1c/0x1f
 [<c11990bb>] cpufreq_update_policy+0x34/0xd8
 [<fd9a350b>] cpufreq_stat_cpu_callback+0x1b/0x7c [cpufreq_stats]
 [<fd9a607d>] cpufreq_stats_init+0x7d/0x9b [cpufreq_stats]
 [<c10410e1>] sys_init_module+0xa6/0x230
 [<c11ef97b>] sysenter_past_esp+0x54/0x8d


>
> the problem is this, there are 2 scenarios in this bug:
>
> One
> ---
> store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
> __cpufreq_set_policy calls __cpufreq_governor
> __cpufreq_governor  calls __cpufreq_driver_target via cpufreq_governor_performance
> __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
>
>
> Two
> ---
> cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
> cpufreq_stat_cpu_callback calls cpufreq_update_policy
> cpufreq_update_policy takes the policy->lock
>
>
> so this looks like a real honest AB-BA deadlock to me...

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  4:52 ` Mike Galbraith
@ 2006-05-30  6:20   ` Arjan van de Ven
  2006-05-30  6:35   ` Arjan van de Ven
  2006-05-30  6:37   ` Ingo Molnar
  2 siblings, 0 replies; 320+ messages in thread
From: Arjan van de Ven @ 2006-05-30  6:20 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, linux-kernel, Andrew Morton

On Tue, 2006-05-30 at 06:52 +0200, Mike Galbraith wrote:
> On Mon, 2006-05-29 at 23:21 +0200, Ingo Molnar wrote:
> > The easiest way to try lockdep on a testbox is to apply the combo patch 
> > to 2.6.17-rc4-mm3. The patch order is:
> > 
> >   http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.17-rc4.tar.bz2
> >   http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc4/2.6.17-rc4-mm3/2.6.17-rc4-mm3.bz2
> >   http://redhat.com/~mingo/lockdep-patches/lockdep-combo.patch
> > 
> > do 'make oldconfig' and accept all the defaults for new config options - 
> > reboot into the kernel and if everything goes well it should boot up 
> > fine and you should have /proc/lockdep and /proc/lockdep_stats files.
> 
> Darn.  It said all tests passed, then oopsed.
> 
> (have .config all gzipped up if you want it)


yes please get me/Ingo the .config; something odd is going on


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  4:52 ` Mike Galbraith
  2006-05-30  6:20   ` Arjan van de Ven
@ 2006-05-30  6:35   ` Arjan van de Ven
  2006-05-30  7:47     ` Ingo Molnar
  2006-05-30  6:37   ` Ingo Molnar
  2 siblings, 1 reply; 320+ messages in thread
From: Arjan van de Ven @ 2006-05-30  6:35 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, linux-kernel, Andrew Morton

On Tue, 2006-05-30 at 06:52 +0200, Mike Galbraith wrote:
> On Mon, 2006-05-29 at 23:21 +0200, Ingo Molnar wrote:
> > The easiest way to try lockdep on a testbox is to apply the combo patch 
> > to 2.6.17-rc4-mm3. The patch order is:
> > 
> >   http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.17-rc4.tar.bz2
> >   http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17-rc4/2.6.17-rc4-mm3/2.6.17-rc4-mm3.bz2
> >   http://redhat.com/~mingo/lockdep-patches/lockdep-combo.patch
> > 
> > do 'make oldconfig' and accept all the defaults for new config options - 
> > reboot into the kernel and if everything goes well it should boot up 
> > fine and you should have /proc/lockdep and /proc/lockdep_stats files.
> 
> Darn.  It said all tests passed, then oopsed.


does this fix it?


type->name can be NULL legitimately; all places but one check for this
already. Fix this off-by-one.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

--- linux-2.6.17-rc4-mm3-lockdep/kernel/lockdep.c.org	2006-05-30 08:32:52.000000000 +0200
+++ linux-2.6.17-rc4-mm3-lockdep/kernel/lockdep.c	2006-05-30 08:33:09.000000000 +0200
@@ -1151,7 +1151,7 @@ int count_matching_names(struct lock_typ
 	list_for_each_entry(type, &all_lock_types, lock_entry) {
 		if (new_type->key - new_type->subtype == type->key)
 			return type->name_version;
-		if (!strcmp(type->name, new_type->name))
+		if (type->name && !strcmp(type->name, new_type->name))
 			count = max(count, type->name_version);
 	}
 



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  4:52 ` Mike Galbraith
  2006-05-30  6:20   ` Arjan van de Ven
  2006-05-30  6:35   ` Arjan van de Ven
@ 2006-05-30  6:37   ` Ingo Molnar
  2006-05-30  9:25     ` Mike Galbraith
  2 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30  6:37 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton


* Mike Galbraith <efault@gmx.de> wrote:

> Darn.  It said all tests passed, then oopsed.
> 
> (have .config all gzipped up if you want it)

yeah, please.

> EIP:    0060:[<b103a872>]    Not tainted VLI
> EFLAGS: 00010083   (2.6.17-rc4-mm3-smp #157)
> EIP is at count_matching_names+0x5b/0xa2

> 1151            list_for_each_entry(type, &all_lock_types, lock_entry) {
> 1152                    if (new_type->key - new_type->subtype == type->key)
> 1153                            return type->name_version;
> 1154                    if (!strcmp(type->name, new_type->name))  <--kaboom
> 1155                            count = max(count, type->name_version);

hm, while most code (except the one above) is prepared for type->name 
being NULL, it should not be NULL. Maybe an uninitialized lock slipped 
through? Please try the patch below - it both protects against 
type->name being NULL in this place, and will warn if it finds a NULL 
lockname.

	Ingo

Index: linux/kernel/lockdep.c
===================================================================
--- linux.orig/kernel/lockdep.c
+++ linux/kernel/lockdep.c
@@ -1151,7 +1151,7 @@ int count_matching_names(struct lock_typ
 	list_for_each_entry(type, &all_lock_types, lock_entry) {
 		if (new_type->key - new_type->subtype == type->key)
 			return type->name_version;
-		if (!strcmp(type->name, new_type->name))
+		if (type->name && !strcmp(type->name, new_type->name))
 			count = max(count, type->name_version);
 	}
 
@@ -1974,7 +1974,8 @@ void lockdep_init_map(struct lockdep_map
 
 	if (DEBUG_WARN_ON(!key))
 		return;
-
+	if (DEBUG_WARN_ON(!name))
+		return;
 	/*
 	 * Sanity check, the lock-type key must be persistent:
 	 */

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  6:35   ` Arjan van de Ven
@ 2006-05-30  7:47     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30  7:47 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Mike Galbraith, linux-kernel, Andrew Morton


* Arjan van de Ven <arjan@infradead.org> wrote:

> > Darn.  It said all tests passed, then oopsed.
>
> does this fix it?
> 
> type->name can be NULL legitimately; all places but one check for this 
> already. Fix this off-by-one.

that used to be the case, but shouldnt happen anymore - with current 
lockdep code we always pass some string to the lock init code. (that's 
what lock-init-improvement.patch achieves in essence.) Worst-case the 
string should be "old_style_spin_init" or "old_style_rw_init".

So Mike please try the other patch i sent - it also adds a debugging 
check so that we can see where that NULL name comes from. It could be 
something benign like me forgetting to pass in a string somewhere in the 
initialization macros, but it could also be something more nasty like an 
initialize-by-memset assumption.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 25/61] lock validator: design docs
  2006-05-29 21:25 ` [patch 25/61] lock validator: design docs Ingo Molnar
@ 2006-05-30  9:07   ` Nikita Danilov
  0 siblings, 0 replies; 320+ messages in thread
From: Nikita Danilov @ 2006-05-30  9:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Arjan van de Ven, Andrew Morton, Linux Kernel Mailing List

Ingo Molnar writes:
 > From: Ingo Molnar <mingo@elte.hu>

[...]

 > +
 > +enum bdev_bd_mutex_lock_type
 > +{
 > +       BD_MUTEX_NORMAL,
 > +       BD_MUTEX_WHOLE,
 > +       BD_MUTEX_PARTITION
 > +};

In some situations well-defined and finite set of "nesting levels" does
not exist. For example, if one has a tree with per-node locking, and
algorithms acquire multiple node locks left-to-right in the tree
order. Reiser4 does this.

Can nested locking restrictions be weakened for certain lock types?

Nikita.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (63 preceding siblings ...)
  2006-05-30  4:52 ` Mike Galbraith
@ 2006-05-30  9:14 ` Benoit Boissinot
  2006-05-30 10:26   ` Arjan van de Ven
  2006-06-01 14:42   ` [patch mm1-rc2] lock validator: netlink.c netlink_table_grab fix Frederik Deweerdt
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
                   ` (8 subsequent siblings)
  73 siblings, 2 replies; 320+ messages in thread
From: Benoit Boissinot @ 2006-05-30  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Arjan van de Ven, Andrew Morton, yi.zhu, jketreno

On 5/29/06, Ingo Molnar <mingo@elte.hu> wrote:
> We are pleased to announce the first release of the "lock dependency
> correctness validator" kernel debugging feature, which can be downloaded
> from:
>
>   http://redhat.com/~mingo/lockdep-patches/
> [snip]

I get this right after ipw2200 is loaded (it is quite verbose, I
probably shoudln't post everything...)

ipw2200: Detected Intel PRO/Wireless 2200BG Network Connection
ipw2200: Detected geography ZZD (13 802.11bg channels, 0 802.11a channels)

======================================================
[ BUG: hard-safe -> hard-unsafe lock order detected! ]
------------------------------------------------------
default.hotplug/3212 [HC0[0]:SC1[1]:HE0:SE0] is trying to acquire:
 (nl_table_lock){-.-±}, at: [<c0301efa>] netlink_broadcast+0x7a/0x360

and this task is already holding:
 (&priv->lock){++..}, at: [<e1cfe588>] ipw_irq_tasklet+0x18/0x500 [ipw2200]
which would create a new lock dependency:
 (&priv->lock){++..} -> (nl_table_lock){-.-±}

but this new dependency connects a hard-irq-safe lock:
 (&priv->lock){++..}
... which became hard-irq-safe at:
  [<c01395da>] lockdep_acquire+0x7a/0xa0
  [<c0352583>] _spin_lock+0x23/0x30
  [<e1cfdbc1>] ipw_isr+0x21/0xd0 [ipw2200]
  [<c01466e3>] handle_IRQ_event+0x33/0x80
  [<c01467e4>] __do_IRQ+0xb4/0x120
  [<c01057c0>] do_IRQ+0x70/0xc0

to a hard-irq-unsafe lock:
 (nl_table_lock){-.-±}
... which became hard-irq-unsafe at:
...  [<c01395da>] lockdep_acquire+0x7a/0xa0
  [<c03520da>] _write_lock_bh+0x2a/0x30
  [<c03017d2>] netlink_table_grab+0x12/0xe0
  [<c0301bcb>] netlink_insert+0x2b/0x180
  [<c030307c>] netlink_kernel_create+0xac/0x140
  [<c048f29a>] rtnetlink_init+0x6a/0xc0
  [<c048f6b9>] netlink_proto_init+0x169/0x180
  [<c010029f>] _stext+0x7f/0x250
  [<c0101005>] kernel_thread_helper+0x5/0xb

which could potentially lead to deadlocks!

other info that might help us debug this:

1 locks held by default.hotplug/3212:
 #0:  (&priv->lock){++..}, at: [<e1cfe588>] ipw_irq_tasklet+0x18/0x500 [ipw2200]

the hard-irq-safe lock's dependencies:
-> (&priv->lock){++..} ops: 102 {
   initial-use  at:
                                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                                       [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                                       [<e1cf6a0c>] ipw_load+0x1fc/0xc90 [ipw2200]
                                       [<e1cf74e8>] ipw_up+0x48/0x520 [ipw2200]
                                       [<e1cfda87>] ipw_net_init+0x27/0x50 [ipw2200]
                                       [<c02eeef1>] register_netdevice+0xd1/0x410
                                       [<c02f0609>] register_netdev+0x59/0x70
                                       [<e1cfe4d6>] ipw_pci_probe+0x806/0x8a0 [ipw2200]
                                       [<c023481e>] pci_device_probe+0x5e/0x80
                                       [<c02a86e4>] driver_probe_device+0x44/0xc0
                                       [<c02a888b>] __driver_attach+0x9b/0xa0
                                       [<c02a8039>] bus_for_each_dev+0x49/0x70
                                       [<c02a8629>] driver_attach+0x19/0x20
                                       [<c02a7c64>] bus_add_driver+0x74/0x140
                                       [<c02a8b06>] driver_register+0x56/0x90
                                       [<c0234a10>] __pci_register_driver+0x50/0x70
                                       [<e18b302e>] 0xe18b302e
                                       [<c014034d>] sys_init_module+0xcd/0x1630
                                       [<c035273b>] sysenter_past_esp+0x54/0x8d
   in-hardirq-W at:
                                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                                       [<c0352583>] _spin_lock+0x23/0x30
                                       [<e1cfdbc1>] ipw_isr+0x21/0xd0 [ipw2200]
                                       [<c01466e3>] handle_IRQ_event+0x33/0x80
                                       [<c01467e4>] __do_IRQ+0xb4/0x120
                                       [<c01057c0>] do_IRQ+0x70/0xc0
   in-softirq-W at:
                                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                                       [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                                       [<e1cfe588>] ipw_irq_tasklet+0x18/0x500 [ipw2200]
                                       [<c0121ea0>] tasklet_action+0x40/0x90
                                       [<c01223b4>] __do_softirq+0x54/0xc0
                                       [<c01056bb>] do_softirq+0x5b/0xf0
 }
 ... key      at: [<e1d0b438>] __key.27363+0x0/0xffff38f6 [ipw2200]
  -> (&q->lock){++..} ops: 33353 {
     initial-use  at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c0352509>] _spin_lock_irq+0x29/0x40
                      [<c034f084>] wait_for_completion+0x24/0x150
                      [<c013160e>] keventd_create_kthread+0x2e/0x70
                      [<c01315d6>] kthread_create+0xe6/0xf0
                      [<c0121b75>] cpu_callback+0x95/0x110
                      [<c0481194>] spawn_ksoftirqd+0x14/0x30
                      [<c010023c>] _stext+0x1c/0x250
                      [<c0101005>] kernel_thread_helper+0x5/0xb
     in-hardirq-W at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<c011794b>] __wake_up+0x1b/0x50
                      [<c012dcdd>] __queue_work+0x4d/0x70
                      [<c012ddaf>] queue_work+0x6f/0x80
                      [<c0269588>] acpi_os_execute+0xcd/0xe9
                      [<c026eea1>] acpi_ev_gpe_dispatch+0xbc/0x122
                      [<c026f106>] acpi_ev_gpe_detect+0x99/0xe0
                      [<c026d90b>] acpi_ev_sci_xrupt_handler+0x15/0x1d
                      [<c0268c55>] acpi_irq+0xe/0x18
                      [<c01466e3>] handle_IRQ_event+0x33/0x80
                      [<c01467e4>] __do_IRQ+0xb4/0x120
                      [<c01057c0>] do_IRQ+0x70/0xc0
     in-softirq-W at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<c011786b>] complete+0x1b/0x60
                      [<c012ef0b>] wakeme_after_rcu+0xb/0x10
                      [<c012f0c9>] __rcu_process_callbacks+0x69/0x1c0
                      [<c012f232>] rcu_process_callbacks+0x12/0x30
                      [<c0121ea0>] tasklet_action+0x40/0x90
                      [<c01223b4>] __do_softirq+0x54/0xc0
                      [<c01056bb>] do_softirq+0x5b/0xf0
   }
   ... key      at: [<c04d47c8>] 0xc04d47c8
    -> (&rq->lock){++..} ops: 68824 {
       initial-use  at:
                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                       [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                       [<c0117bcc>] init_idle+0x4c/0x80
                       [<c0480ad8>] sched_init+0xa8/0xb0
                       [<c0473558>] start_kernel+0x58/0x330
                       [<c0100199>] 0xc0100199
       in-hardirq-W at:
                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                       [<c0352583>] _spin_lock+0x23/0x30
                       [<c0117cc7>] scheduler_tick+0xc7/0x310
                       [<c01270ee>] update_process_times+0x3e/0x70
                       [<c0106c21>] timer_interrupt+0x41/0xa0
                       [<c01466e3>] handle_IRQ_event+0x33/0x80
                       [<c01467e4>] __do_IRQ+0xb4/0x120
                       [<c01057c0>] do_IRQ+0x70/0xc0
       in-softirq-W at:
                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                       [<c0352583>] _spin_lock+0x23/0x30
                       [<c01183e0>] try_to_wake_up+0x30/0x170
                       [<c011854f>] wake_up_process+0xf/0x20
                       [<c0122413>] __do_softirq+0xb3/0xc0
                       [<c01056bb>] do_softirq+0x5b/0xf0
     }
     ... key      at: [<c04c1400>] 0xc04c1400
   ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c0352583>] _spin_lock+0x23/0x30
   [<c01183e0>] try_to_wake_up+0x30/0x170
   [<c011852b>] default_wake_function+0xb/0x10
   [<c01172d9>] __wake_up_common+0x39/0x70
   [<c011788d>] complete+0x3d/0x60
   [<c01316d4>] kthread+0x84/0xbc
   [<c0101005>] kernel_thread_helper+0x5/0xb

 ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c03524c0>] _spin_lock_irqsave+0x30/0x50
   [<c011794b>] __wake_up+0x1b/0x50
   [<e1cf6a2e>] ipw_load+0x21e/0xc90 [ipw2200]
   [<e1cf74e8>] ipw_up+0x48/0x520 [ipw2200]
   [<e1cfda87>] ipw_net_init+0x27/0x50 [ipw2200]
   [<c02eeef1>] register_netdevice+0xd1/0x410
   [<c02f0609>] register_netdev+0x59/0x70
   [<e1cfe4d6>] ipw_pci_probe+0x806/0x8a0 [ipw2200]
   [<c023481e>] pci_device_probe+0x5e/0x80
   [<c02a86e4>] driver_probe_device+0x44/0xc0
   [<c02a888b>] __driver_attach+0x9b/0xa0
   [<c02a8039>] bus_for_each_dev+0x49/0x70
   [<c02a8629>] driver_attach+0x19/0x20
   [<c02a7c64>] bus_add_driver+0x74/0x140
   [<c02a8b06>] driver_register+0x56/0x90
   [<c0234a10>] __pci_register_driver+0x50/0x70
   [<e18b302e>] 0xe18b302e
   [<c014034d>] sys_init_module+0xcd/0x1630
   [<c035273b>] sysenter_past_esp+0x54/0x8d

  -> (&rxq->lock){.+..} ops: 40 {
     initial-use  at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<e1cf66d0>] ipw_rx_queue_replenish+0x20/0x120 [ipw2200]
                      [<e1cf72e0>] ipw_load+0xad0/0xc90 [ipw2200]
                      [<e1cf74e8>] ipw_up+0x48/0x520 [ipw2200]
                      [<e1cfda87>] ipw_net_init+0x27/0x50 [ipw2200]
                      [<c02eeef1>] register_netdevice+0xd1/0x410
                      [<c02f0609>] register_netdev+0x59/0x70
                      [<e1cfe4d6>] ipw_pci_probe+0x806/0x8a0 [ipw2200]
                      [<c023481e>] pci_device_probe+0x5e/0x80
                      [<c02a86e4>] driver_probe_device+0x44/0xc0
                      [<c02a888b>] __driver_attach+0x9b/0xa0
                      [<c02a8039>] bus_for_each_dev+0x49/0x70
                      [<c02a8629>] driver_attach+0x19/0x20
                      [<c02a7c64>] bus_add_driver+0x74/0x140
                      [<c02a8b06>] driver_register+0x56/0x90
                      [<c0234a10>] __pci_register_driver+0x50/0x70
                      [<e18b302e>] 0xe18b302e
                      [<c014034d>] sys_init_module+0xcd/0x1630
                      [<c035273b>] sysenter_past_esp+0x54/0x8d
     in-softirq-W at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<e1cf25bf>] ipw_rx_queue_restock+0x1f/0x120 [ipw2200]
                      [<e1cf80d1>] ipw_rx+0x631/0x1bb0 [ipw2200]
                      [<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
                      [<c0121ea0>] tasklet_action+0x40/0x90
                      [<c01223b4>] __do_softirq+0x54/0xc0
                      [<c01056bb>] do_softirq+0x5b/0xf0
   }
   ... key      at: [<e1d0b440>] __key.23915+0x0/0xffff38ee [ipw2200]
    -> (&parent->list_lock){.+..} ops: 17457 {
       initial-use  at:
                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                       [<c0352583>] _spin_lock+0x23/0x30
                       [<c0166437>] cache_alloc_refill+0x87/0x650
                       [<c0166bae>] kmem_cache_zalloc+0xbe/0xd0
                       [<c01672d4>] kmem_cache_create+0x154/0x540
                       [<c0483ad9>] kmem_cache_init+0x179/0x3d0
                       [<c0473638>] start_kernel+0x138/0x330
                       [<c0100199>] 0xc0100199
       in-softirq-W at:
                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                       [<c0352583>] _spin_lock+0x23/0x30
                       [<c0166073>] free_block+0x183/0x190
                       [<c0165bdf>] __cache_free+0x9f/0x120
                       [<c0165da8>] kmem_cache_free+0x88/0xb0
                       [<c0119e21>] free_task+0x21/0x30
                       [<c011b955>] __put_task_struct+0x95/0x156
                       [<c011db12>] delayed_put_task_struct+0x32/0x60
                       [<c012f0c9>] __rcu_process_callbacks+0x69/0x1c0
                       [<c012f232>] rcu_process_callbacks+0x12/0x30
                       [<c0121ea0>] tasklet_action+0x40/0x90
                       [<c01223b4>] __do_softirq+0x54/0xc0
                       [<c01056bb>] do_softirq+0x5b/0xf0
     }
     ... key      at: [<c060d00c>] 0xc060d00c
   ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c0352583>] _spin_lock+0x23/0x30
   [<c0166437>] cache_alloc_refill+0x87/0x650
   [<c0166ab8>] __kmalloc+0xb8/0xf0
   [<c02eb3cb>] __alloc_skb+0x4b/0x100
   [<e1cf6769>] ipw_rx_queue_replenish+0xb9/0x120 [ipw2200]
   [<e1cf72e0>] ipw_load+0xad0/0xc90 [ipw2200]
   [<e1cf74e8>] ipw_up+0x48/0x520 [ipw2200]
   [<e1cfda87>] ipw_net_init+0x27/0x50 [ipw2200]
   [<c02eeef1>] register_netdevice+0xd1/0x410
   [<c02f0609>] register_netdev+0x59/0x70
   [<e1cfe4d6>] ipw_pci_probe+0x806/0x8a0 [ipw2200]
   [<c023481e>] pci_device_probe+0x5e/0x80
   [<c02a86e4>] driver_probe_device+0x44/0xc0
   [<c02a888b>] __driver_attach+0x9b/0xa0
   [<c02a8039>] bus_for_each_dev+0x49/0x70
   [<c02a8629>] driver_attach+0x19/0x20
   [<c02a7c64>] bus_add_driver+0x74/0x140
   [<c02a8b06>] driver_register+0x56/0x90
   [<c0234a10>] __pci_register_driver+0x50/0x70
   [<e18b302e>] 0xe18b302e
   [<c014034d>] sys_init_module+0xcd/0x1630
   [<c035273b>] sysenter_past_esp+0x54/0x8d

 ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c03524c0>] _spin_lock_irqsave+0x30/0x50
   [<e1cf25bf>] ipw_rx_queue_restock+0x1f/0x120 [ipw2200]
   [<e1cf80d1>] ipw_rx+0x631/0x1bb0 [ipw2200]
   [<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
   [<c0121ea0>] tasklet_action+0x40/0x90
   [<c01223b4>] __do_softirq+0x54/0xc0
   [<c01056bb>] do_softirq+0x5b/0xf0

  -> (&ieee->lock){.+..} ops: 15 {
     initial-use  at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<e1c9d0cf>] ieee80211_process_probe_response+0x1ff/0x790 [ieee80211]
                      [<e1c9d70f>] ieee80211_rx_mgt+0xaf/0x340 [ieee80211]
                      [<e1cf8219>] ipw_rx+0x779/0x1bb0 [ipw2200]
                      [<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
                      [<c0121ea0>] tasklet_action+0x40/0x90
                      [<c01223b4>] __do_softirq+0x54/0xc0
                      [<c01056bb>] do_softirq+0x5b/0xf0
     in-softirq-W at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<e1c9d0cf>] ieee80211_process_probe_response+0x1ff/0x790 [ieee80211]
                      [<e1c9d70f>] ieee80211_rx_mgt+0xaf/0x340 [ieee80211]
                      [<e1cf8219>] ipw_rx+0x779/0x1bb0 [ipw2200]
                      [<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
                      [<c0121ea0>] tasklet_action+0x40/0x90
                      [<c01223b4>] __do_softirq+0x54/0xc0
                      [<c01056bb>] do_softirq+0x5b/0xf0
   }
   ... key      at: [<e1ca2781>] __key.22782+0x0/0xffffdc00 [ieee80211]
 ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c03524c0>] _spin_lock_irqsave+0x30/0x50
   [<e1c9d0cf>] ieee80211_process_probe_response+0x1ff/0x790 [ieee80211]
   [<e1c9d70f>] ieee80211_rx_mgt+0xaf/0x340 [ieee80211]
   [<e1cf8219>] ipw_rx+0x779/0x1bb0 [ipw2200]
   [<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
   [<c0121ea0>] tasklet_action+0x40/0x90
   [<c01223b4>] __do_softirq+0x54/0xc0
   [<c01056bb>] do_softirq+0x5b/0xf0

  -> (&cwq->lock){++..} ops: 3739 {
     initial-use  at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<c012dca8>] __queue_work+0x18/0x70
                      [<c012ddaf>] queue_work+0x6f/0x80
                      [<c012d949>] call_usermodehelper_keys+0x139/0x160
                      [<c0219a2a>] kobject_uevent+0x7a/0x4a0
                      [<c0219753>] kobject_register+0x43/0x50
                      [<c02a7687>] sysdev_register+0x67/0x100
                      [<c02aa950>] register_cpu+0x30/0x70
                      [<c0108f7a>] arch_register_cpu+0x2a/0x30
                      [<c047850a>] topology_init+0xa/0x10
                      [<c010029f>] _stext+0x7f/0x250
                      [<c0101005>] kernel_thread_helper+0x5/0xb
     in-hardirq-W at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<c012dca8>] __queue_work+0x18/0x70
                      [<c012ddaf>] queue_work+0x6f/0x80
                      [<c0269588>] acpi_os_execute+0xcd/0xe9
                      [<c026eea1>] acpi_ev_gpe_dispatch+0xbc/0x122
                      [<c026f106>] acpi_ev_gpe_detect+0x99/0xe0
                      [<c026d90b>] acpi_ev_sci_xrupt_handler+0x15/0x1d
                      [<c0268c55>] acpi_irq+0xe/0x18
                      [<c01466e3>] handle_IRQ_event+0x33/0x80
                      [<c01467e4>] __do_IRQ+0xb4/0x120
                      [<c01057c0>] do_IRQ+0x70/0xc0
     in-softirq-W at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<c012dca8>] __queue_work+0x18/0x70
                      [<c012dd30>] delayed_work_timer_fn+0x30/0x40
                      [<c012633e>] run_timer_softirq+0x12e/0x180
                      [<c01223b4>] __do_softirq+0x54/0xc0
                      [<c01056bb>] do_softirq+0x5b/0xf0
   }
   ... key      at: [<c04d4334>] 0xc04d4334
    -> (&q->lock){++..} ops: 33353 {
       initial-use  at:
                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                       [<c0352509>] _spin_lock_irq+0x29/0x40
                       [<c034f084>] wait_for_completion+0x24/0x150
                       [<c013160e>] keventd_create_kthread+0x2e/0x70
                       [<c01315d6>] kthread_create+0xe6/0xf0
                       [<c0121b75>] cpu_callback+0x95/0x110
                       [<c0481194>] spawn_ksoftirqd+0x14/0x30
                       [<c010023c>] _stext+0x1c/0x250
                       [<c0101005>] kernel_thread_helper+0x5/0xb
       in-hardirq-W at:
                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                       [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                       [<c011794b>] __wake_up+0x1b/0x50
                       [<c012dcdd>] __queue_work+0x4d/0x70
                       [<c012ddaf>] queue_work+0x6f/0x80
                       [<c0269588>] acpi_os_execute+0xcd/0xe9
                       [<c026eea1>] acpi_ev_gpe_dispatch+0xbc/0x122
                       [<c026f106>] acpi_ev_gpe_detect+0x99/0xe0
                       [<c026d90b>] acpi_ev_sci_xrupt_handler+0x15/0x1d
                       [<c0268c55>] acpi_irq+0xe/0x18
                       [<c01466e3>] handle_IRQ_event+0x33/0x80
                       [<c01467e4>] __do_IRQ+0xb4/0x120
                       [<c01057c0>] do_IRQ+0x70/0xc0
       in-softirq-W at:
                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                       [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                       [<c011786b>] complete+0x1b/0x60
                       [<c012ef0b>] wakeme_after_rcu+0xb/0x10
                       [<c012f0c9>] __rcu_process_callbacks+0x69/0x1c0
                       [<c012f232>] rcu_process_callbacks+0x12/0x30
                       [<c0121ea0>] tasklet_action+0x40/0x90
                       [<c01223b4>] __do_softirq+0x54/0xc0
                       [<c01056bb>] do_softirq+0x5b/0xf0
     }
     ... key      at: [<c04d47c8>] 0xc04d47c8
      -> (&rq->lock){++..} ops: 68824 {
         initial-use  at:
                        [<c01395da>] lockdep_acquire+0x7a/0xa0
                        [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                        [<c0117bcc>] init_idle+0x4c/0x80
                        [<c0480ad8>] sched_init+0xa8/0xb0
                        [<c0473558>] start_kernel+0x58/0x330
                        [<c0100199>] 0xc0100199
         in-hardirq-W at:
                        [<c01395da>] lockdep_acquire+0x7a/0xa0
                        [<c0352583>] _spin_lock+0x23/0x30
                        [<c0117cc7>] scheduler_tick+0xc7/0x310
                        [<c01270ee>] update_process_times+0x3e/0x70
                        [<c0106c21>] timer_interrupt+0x41/0xa0
                        [<c01466e3>] handle_IRQ_event+0x33/0x80
                        [<c01467e4>] __do_IRQ+0xb4/0x120
                        [<c01057c0>] do_IRQ+0x70/0xc0
         in-softirq-W at:
                        [<c01395da>] lockdep_acquire+0x7a/0xa0
                        [<c0352583>] _spin_lock+0x23/0x30
                        [<c01183e0>] try_to_wake_up+0x30/0x170
                        [<c011854f>] wake_up_process+0xf/0x20
                        [<c0122413>] __do_softirq+0xb3/0xc0
                        [<c01056bb>] do_softirq+0x5b/0xf0
       }
       ... key      at: [<c04c1400>] 0xc04c1400
     ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c0352583>] _spin_lock+0x23/0x30
   [<c01183e0>] try_to_wake_up+0x30/0x170
   [<c011852b>] default_wake_function+0xb/0x10
   [<c01172d9>] __wake_up_common+0x39/0x70
   [<c011788d>] complete+0x3d/0x60
   [<c01316d4>] kthread+0x84/0xbc
   [<c0101005>] kernel_thread_helper+0x5/0xb

   ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c03524c0>] _spin_lock_irqsave+0x30/0x50
   [<c011794b>] __wake_up+0x1b/0x50
   [<c012dcdd>] __queue_work+0x4d/0x70
   [<c012ddaf>] queue_work+0x6f/0x80
   [<c012d949>] call_usermodehelper_keys+0x139/0x160
   [<c0219a2a>] kobject_uevent+0x7a/0x4a0
   [<c0219753>] kobject_register+0x43/0x50
   [<c02a7687>] sysdev_register+0x67/0x100
   [<c02aa950>] register_cpu+0x30/0x70
   [<c0108f7a>] arch_register_cpu+0x2a/0x30
   [<c047850a>] topology_init+0xa/0x10
   [<c010029f>] _stext+0x7f/0x250
   [<c0101005>] kernel_thread_helper+0x5/0xb

 ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c03524c0>] _spin_lock_irqsave+0x30/0x50
   [<c012dca8>] __queue_work+0x18/0x70
   [<c012ddaf>] queue_work+0x6f/0x80
   [<e1cf267e>] ipw_rx_queue_restock+0xde/0x120 [ipw2200]
   [<e1cf80d1>] ipw_rx+0x631/0x1bb0 [ipw2200]
   [<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
   [<c0121ea0>] tasklet_action+0x40/0x90
   [<c01223b4>] __do_softirq+0x54/0xc0
   [<c01056bb>] do_softirq+0x5b/0xf0

  -> (&base->lock){++..} ops: 8140 {
     initial-use  at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<c0126e4a>] lock_timer_base+0x3a/0x60
                      [<c0126f17>] __mod_timer+0x37/0xc0
                      [<c0127036>] mod_timer+0x36/0x50
                      [<c048a2e5>] con_init+0x1b5/0x200
                      [<c0489802>] console_init+0x32/0x40
                      [<c04735ea>] start_kernel+0xea/0x330
                      [<c0100199>] 0xc0100199
     in-hardirq-W at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c03524c0>] _spin_lock_irqsave+0x30/0x50
                      [<c0126e4a>] lock_timer_base+0x3a/0x60
                      [<c0126e9c>] del_timer+0x2c/0x70
                      [<c02bc619>] ide_intr+0x69/0x1f0
                      [<c01466e3>] handle_IRQ_event+0x33/0x80
                      [<c01467e4>] __do_IRQ+0xb4/0x120
                      [<c01057c0>] do_IRQ+0x70/0xc0
     in-softirq-W at:
                      [<c01395da>] lockdep_acquire+0x7a/0xa0
                      [<c0352509>] _spin_lock_irq+0x29/0x40
                      [<c0126239>] run_timer_softirq+0x29/0x180
                      [<c01223b4>] __do_softirq+0x54/0xc0
                      [<c01056bb>] do_softirq+0x5b/0xf0
   }
   ... key      at: [<c04d3af8>] 0xc04d3af8
 ... acquired at:
   [<c01395da>] lockdep_acquire+0x7a/0xa0
   [<c03524c0>] _spin_lock_irqsave+0x30/0x50
   [<c0126e4a>] lock_timer_base+0x3a/0x60
   [<c0126e9c>] del_timer+0x2c/0x70
   [<e1cf83d9>] ipw_rx+0x939/0x1bb0 [ipw2200]
   [<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
   [<c0121ea0>] tasklet_action+0x40/0x90
   [<c01223b4>] __do_softirq+0x54/0xc0
   [<c01056bb>] do_softirq+0x5b/0xf0


the hard-irq-unsafe lock's dependencies:
-> (nl_table_lock){-.-±} ops: 1585 {
   initial-use  at:
                                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                                       [<c03520da>] _write_lock_bh+0x2a/0x30
                                       [<c03017d2>] netlink_table_grab+0x12/0xe0
                                       [<c0301bcb>] netlink_insert+0x2b/0x180
                                       [<c030307c>] netlink_kernel_create+0xac/0x140
                                       [<c048f29a>] rtnetlink_init+0x6a/0xc0
                                       [<c048f6b9>] netlink_proto_init+0x169/0x180
                                       [<c010029f>] _stext+0x7f/0x250
                                       [<c0101005>] kernel_thread_helper+0x5/0xb
   hardirq-on-W at:
                                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                                       [<c03520da>] _write_lock_bh+0x2a/0x30
                                       [<c03017d2>] netlink_table_grab+0x12/0xe0
                                       [<c0301bcb>] netlink_insert+0x2b/0x180
                                       [<c030307c>] netlink_kernel_create+0xac/0x140
                                       [<c048f29a>] rtnetlink_init+0x6a/0xc0
                                       [<c048f6b9>] netlink_proto_init+0x169/0x180
                                       [<c010029f>] _stext+0x7f/0x250
                                       [<c0101005>] kernel_thread_helper+0x5/0xb
   in-softirq-R at:
                                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                                       [<c0352130>] _read_lock+0x20/0x30
                                       [<c0301efa>] netlink_broadcast+0x7a/0x360
                                       [<c02fb6a4>] wireless_send_event+0x304/0x340
                                       [<e1cf8e11>] ipw_rx+0x1371/0x1bb0 [ipw2200]
                                       [<e1cfe6ac>] ipw_irq_tasklet+0x13c/0x500 [ipw2200]
                                       [<c0121ea0>] tasklet_action+0x40/0x90
                                       [<c01223b4>] __do_softirq+0x54/0xc0
                                       [<c01056bb>] do_softirq+0x5b/0xf0
   softirq-on-R at:
                                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                                       [<c0352130>] _read_lock+0x20/0x30
                                       [<c0301efa>] netlink_broadcast+0x7a/0x360
                                       [<c02199f0>] kobject_uevent+0x40/0x4a0
                                       [<c0219753>] kobject_register+0x43/0x50
                                       [<c02a7687>] sysdev_register+0x67/0x100
                                       [<c02aa950>] register_cpu+0x30/0x70
                                       [<c0108f7a>] arch_register_cpu+0x2a/0x30
                                       [<c047850a>] topology_init+0xa/0x10
                                       [<c010029f>] _stext+0x7f/0x250
                                       [<c0101005>] kernel_thread_helper+0x5/0xb
   hardirq-on-R at:
                                       [<c01395da>] lockdep_acquire+0x7a/0xa0
                                       [<c0352130>] _read_lock+0x20/0x30
                                       [<c0301efa>] netlink_broadcast+0x7a/0x360
                                       [<c02199f0>] kobject_uevent+0x40/0x4a0
                                       [<c0219753>] kobject_register+0x43/0x50
                                       [<c02a7687>] sysdev_register+0x67/0x100
                                       [<c02aa950>] register_cpu+0x30/0x70
                                       [<c0108f7a>] arch_register_cpu+0x2a/0x30
                                       [<c047850a>] topology_init+0xa/0x10
                                       [<c010029f>] _stext+0x7f/0x250
                                       [<c0101005>] kernel_thread_helper+0x5/0xb
 }
 ... key      at: [<c0438908>] 0xc0438908

stack backtrace:
 <c010402d> show_trace+0xd/0x10  <c0104687> dump_stack+0x17/0x20
 <c0137fe3> check_usage+0x263/0x270  <c0138f06> __lockdep_acquire+0xb96/0xd40
 <c01395da> lockdep_acquire+0x7a/0xa0  <c0352130> _read_lock+0x20/0x30
 <c0301efa> netlink_broadcast+0x7a/0x360  <c02fb6a4> wireless_send_event+0x304/0x340
 <e1cf8e11> ipw_rx+0x1371/0x1bb0 [ipw2200]  <e1cfe6ac> ipw_irq_tasklet+0x13c/0x500 [ipw2200]
 <c0121ea0> tasklet_action+0x40/0x90  <c01223b4> __do_softirq+0x54/0xc0
 <c01056bb> do_softirq+0x5b/0xf0 
 =======================
 <c0122455> irq_exit+0x35/0x40  <c01057c7> do_IRQ+0x77/0xc0
 <c0103949> common_interrupt+0x25/0x2c 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  6:37   ` Ingo Molnar
@ 2006-05-30  9:25     ` Mike Galbraith
  2006-05-30 10:57       ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Mike Galbraith @ 2006-05-30  9:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton

On Tue, 2006-05-30 at 08:37 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > Darn.  It said all tests passed, then oopsed.
> > 
> > (have .config all gzipped up if you want it)
> 
> yeah, please.

(sent off list)

> > EIP:    0060:[<b103a872>]    Not tainted VLI
> > EFLAGS: 00010083   (2.6.17-rc4-mm3-smp #157)
> > EIP is at count_matching_names+0x5b/0xa2
> 
> > 1151            list_for_each_entry(type, &all_lock_types, lock_entry) {
> > 1152                    if (new_type->key - new_type->subtype == type->key)
> > 1153                            return type->name_version;
> > 1154                    if (!strcmp(type->name, new_type->name))  <--kaboom
> > 1155                            count = max(count, type->name_version);
> 
> hm, while most code (except the one above) is prepared for type->name 
> being NULL, it should not be NULL. Maybe an uninitialized lock slipped 
> through? Please try the patch below - it both protects against 
> type->name being NULL in this place, and will warn if it finds a NULL 
> lockname.

Got the warning.  It failed testing, but booted.

Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBTYPES:    8
... MAX_LOCK_DEPTH:          30
... MAX_LOCKDEP_KEYS:        2048
... TYPEHASH_SIZE:           1024
... MAX_LOCKDEP_ENTRIES:     8192
... MAX_LOCKDEP_CHAINS:      8192
... CHAINHASH_SIZE:          4096
 memory used by lock dependency info: 696 kB
 per task-struct memory footprint: 1080 bytes
------------------------
| Locking API testsuite:
----------------------------------------------------------------------------
                                 | spin |wlock |rlock |mutex | wsem | rsem |
  --------------------------------------------------------------------------
BUG: warning at kernel/lockdep.c:1977/lockdep_init_map()
 <b1003dd2> show_trace+0xd/0xf  <b10044c0> dump_stack+0x17/0x19
 <b103badf> lockdep_init_map+0x10a/0x10f  <b10398d7> __mutex_init+0x3b/0x44
 <b11d4601> init_type_X+0x37/0x4d  <b11d4638> init_shared_types+0x21/0xaa
 <b11dcca3> locking_selftest+0x76/0x1889  <b1597657> start_kernel+0x1e7/0x400
 <b1000210> 0xb1000210 
                     A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                 A-B-B-A deadlock:  ok  |  ok  |FAILED|  ok  |  ok  |  ok  |
             A-B-B-C-C-A deadlock:  ok  |  ok  |FAILED|  ok  |  ok  |  ok  |
             A-B-C-A-B-C deadlock:  ok  |  ok  |FAILED|  ok  |  ok  |  ok  |
         A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |FAILED|  ok  |  ok  |  ok  |
         A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |FAILED|  ok  |  ok  |  ok  |
         A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |FAILED|  ok  |  ok  |  ok  |
                    double unlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                 bad unlock order:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
  --------------------------------------------------------------------------
              recursive read-lock:             |FAILED|             |  ok  |
  --------------------------------------------------------------------------
                non-nested unlock:FAILED|FAILED|FAILED|FAILED|
  ------------------------------------------------------------
     hard-irqs-on + irq-safe-A/12:  ok  |  ok  |FAILED|
     soft-irqs-on + irq-safe-A/12:  ok  |  ok  |FAILED|
     hard-irqs-on + irq-safe-A/21:  ok  |  ok  |FAILED|
     soft-irqs-on + irq-safe-A/21:  ok  |  ok  |FAILED|
       sirq-safe-A => hirqs-on/12:  ok  |  ok  |FAILED|
       sirq-safe-A => hirqs-on/21:  ok  |  ok  |FAILED|
         hard-safe-A + irqs-on/12:  ok  |  ok  |FAILED|
         soft-safe-A + irqs-on/12:  ok  |  ok  |FAILED|
         hard-safe-A + irqs-on/21:  ok  |  ok  |FAILED|
         soft-safe-A + irqs-on/21:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #1/123:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #1/123:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #1/132:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #1/132:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #1/213:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #1/213:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #1/231:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #1/231:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #1/312:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #1/312:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #1/321:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #1/321:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #2/123:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #2/123:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #2/132:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #2/132:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #2/213:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #2/213:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #2/231:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #2/231:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #2/312:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #2/312:  ok  |  ok  |FAILED|
    hard-safe-A + unsafe-B #2/321:  ok  |  ok  |FAILED|
    soft-safe-A + unsafe-B #2/321:  ok  |  ok  |FAILED|
      hard-irq lock-inversion/123:  ok  |  ok  |FAILED|
      soft-irq lock-inversion/123:  ok  |  ok  |FAILED|
      hard-irq lock-inversion/132:  ok  |  ok  |FAILED|
      soft-irq lock-inversion/132:  ok  |  ok  |FAILED|
      hard-irq lock-inversion/213:  ok  |  ok  |FAILED|
      soft-irq lock-inversion/213:  ok  |  ok  |FAILED|
      hard-irq lock-inversion/231:  ok  |  ok  |FAILED|
      soft-irq lock-inversion/231:  ok  |  ok  |FAILED|
      hard-irq lock-inversion/312:  ok  |  ok  |FAILED|
      soft-irq lock-inversion/312:  ok  |  ok  |FAILED|
      hard-irq lock-inversion/321:  ok  |  ok  |FAILED|
      soft-irq lock-inversion/321:  ok  |  ok  |FAILED|
      hard-irq read-recursion/123:FAILED|
      soft-irq read-recursion/123:FAILED|
      hard-irq read-recursion/132:FAILED|
      soft-irq read-recursion/132:FAILED|
      hard-irq read-recursion/213:FAILED|
      soft-irq read-recursion/213:FAILED|
      hard-irq read-recursion/231:FAILED|
      soft-irq read-recursion/231:FAILED|
      hard-irq read-recursion/312:FAILED|
      soft-irq read-recursion/312:FAILED|
      hard-irq read-recursion/321:FAILED|
      soft-irq read-recursion/321:FAILED|
-----------------------------------------------------------------
BUG:  69 unexpected failures (out of 210) - debugging disabled! |
-----------------------------------------------------------------



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 34/61] lock validator: special locking: bdev
  2006-05-30  1:35   ` Andrew Morton
  2006-05-30  5:13     ` Arjan van de Ven
@ 2006-05-30  9:58     ` Al Viro
  2006-05-30 10:45     ` Arjan van de Ven
  2 siblings, 0 replies; 320+ messages in thread
From: Al Viro @ 2006-05-30  9:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel, arjan

On Mon, May 29, 2006 at 06:35:23PM -0700, Andrew Morton wrote:
> > +	 * For now, block device ->open() routine must _not_
> > +	 * examine anything in 'inode' argument except ->i_rdev.
> > +	 */
> > +	struct file fake_file = {};
> > +	struct dentry fake_dentry = {};
> > +	fake_file.f_mode = mode;
> > +	fake_file.f_flags = flags;
> > +	fake_file.f_dentry = &fake_dentry;
> > +	fake_dentry.d_inode = bdev->bd_inode;
> > +
> > +	return do_open(bdev, &fake_file, BD_MUTEX_WHOLE);
> > +}
> 
> "crock" is a decent description ;)
> 
> How long will this live, and what will the fix look like?

The comment there is a bit deceptive.  

The real problem is with the stuff ->open() uses.  Short version of the
story:
	* everything uses inode->i_bdev.  Since we always pass an inode
allocated in block_dev.c along with bdev and its ->i_bdev points to that
bdev (i.e. at the constant offset from inode), it doesn't matter whether
we pass struct inode or struct block_device.
	* many things use file->f_mode.  Nobody modifies it.
	* some things use file->f_flags.  Used flags: O_EXCL and O_NDELAY.
Nobody modifies it.
	* one (and only one) weird driver uses something else.  That FPOS
is floppy.c and it needs more detailed description.

floppy.c is _weird_.  In addition to normally used stuff, it checks for
opener having write permissions on file->f_dentry->d_inode.  Then it
modifies file->private_data to store that information and uses it as
permission check in ->ioctl().

The rationale for that crock is a big load of bullshit.  It goes like that:
	We have priveleged ioctls and can't allow them unless you have
write permissions.
	We can't ask to just open() the damn thing for write and let these
be done as usual (and check file->f_mode & FMODE_WRITE) because we might want
them on drive that has no disk in it or a write-protected one.  Opening it
for write would try to check for disk being writable and screw itself.
	Passing O_NDELAY would avoid that problem by skipping the checks
for disk being writable, present, etc., but we can't use that.  Reasons
why we can't?  We don't need no stinkin' reasons!

IOW, *all* of that could be avoided if floppy.c
	* checked FMODE_WRITE for ability to do priveleged ioctls
	* had those who want to issue such ioctls on drive that might have
no disk in it pass O_NDELAY|O_WRONLY (or O_NDELAY|O_RDWR) when they open
the fscker.  Note that userland code always could have done that -
passing O_NDELAY|O_RDWR will do the right thing with any kernel.

That FPOS is the main reason why we pass struct file * there at all *and*
care to have ->f_dentry->d_inode in it (normally that wouldn't be even
looked at).  Again, my prefered solution would be to pass 4-bit flags and
either inode or block_device.  Flags being FMODE_READ, FMODE_WRITE,
O_EXCL and O_NDELAY.

The problem is moronic semantics for ioctl access control in floppy.c,
even though the sane API is _already_ present and always had been.  In
the very same floppy_open()...

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  9:14 ` Benoit Boissinot
@ 2006-05-30 10:26   ` Arjan van de Ven
  2006-05-30 11:42     ` Benoit Boissinot
  2006-06-01 14:42   ` [patch mm1-rc2] lock validator: netlink.c netlink_table_grab fix Frederik Deweerdt
  1 sibling, 1 reply; 320+ messages in thread
From: Arjan van de Ven @ 2006-05-30 10:26 UTC (permalink / raw)
  To: Benoit Boissinot
  Cc: jketreno, yi.zhu, Andrew Morton, Ingo Molnar, linux-kernel

On Tue, 2006-05-30 at 11:14 +0200, Benoit Boissinot wrote:
> On 5/29/06, Ingo Molnar <mingo@elte.hu> wrote:
> > We are pleased to announce the first release of the "lock dependency
> > correctness validator" kernel debugging feature, which can be downloaded
> > from:
> >
> >   http://redhat.com/~mingo/lockdep-patches/
> > [snip]
> 
> I get this right after ipw2200 is loaded (it is quite verbose, I
> probably shoudln't post everything...)
> 
> ipw2200: Detected Intel PRO/Wireless 2200BG Network Connection
> ipw2200: Detected geography ZZD (13 802.11bg channels, 0 802.11a channels)


>  <c0301efa> netlink_broadcast+0x7a/0x360  

this isn't allow to be called from IRQ context, because it takes
nl_table_lock for read, but that is taken as
        write_lock_bh(&nl_table_lock);
in 
	static void netlink_table_grab(void)
so without disabling interrupts; which would thus deadlock if this
read_lock-from-irq would hit.

>  <c02fb6a4> wireless_send_event+0x304/0x340
>  <e1cf8e11> ipw_rx+0x1371/0x1bb0 [ipw2200] 
>  <e1cfe6ac> ipw_irq_tasklet+0x13c/0x500 [ipw2200]
>  <c0121ea0> tasklet_action+0x40/0x90  

but it's more complex than that, since we ARE in BH context.
The complexity comes from us holding &priv->lock, which is 
used in hard irq context.

so the deadlock is like this:


cpu 0: user context					cpu1: softirq context
   netlink_table_grab takes nl_table_lock as		take priv->lock	in ipw_irq_tasklet
   write_lock_bh, but leaves irqs enabled


   hardirq comes in and the isr tries to take           in ipw_rx, call wireless_send_event which
   priv->lock but has to wait on cpu 1                  tries to take nl_table_lock for read
                                                        but has to wait for cpu0

and... kaboom kabang deadlock :)



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 34/61] lock validator: special locking: bdev
  2006-05-30  1:35   ` Andrew Morton
  2006-05-30  5:13     ` Arjan van de Ven
  2006-05-30  9:58     ` Al Viro
@ 2006-05-30 10:45     ` Arjan van de Ven
  2 siblings, 0 replies; 320+ messages in thread
From: Arjan van de Ven @ 2006-05-30 10:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel

On Mon, 2006-05-29 at 18:35 -0700, Andrew Morton wrote:
> On Mon, 29 May 2006 23:25:54 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > 
> > teach special (recursive) locking code to the lock validator. Has no
> > effect on non-lockdep kernels.
> > 
> 
> There's no description here of the problem which is being worked around. 
> This leaves everyone in the dark.
> 
> > +static int
> > +blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags)
> > +{
> > +	/*
> > +	 * This crockload is due to bad choice of ->open() type.
> > +	 * It will go away.
> > +	 * For now, block device ->open() routine must _not_
> > +	 * examine anything in 'inode' argument except ->i_rdev.
> > +	 */
> > +	struct file fake_file = {};
> > +	struct dentry fake_dentry = {};
> > +	fake_file.f_mode = mode;
> > +	fake_file.f_flags = flags;
> > +	fake_file.f_dentry = &fake_dentry;
> > +	fake_dentry.d_inode = bdev->bd_inode;
> > +
> > +	return do_open(bdev, &fake_file, BD_MUTEX_WHOLE);
> > +}
> 
> "crock" is a decent description ;)
> 
> How long will this live, and what will the fix look like?

this btw is not new crock; the only new thing is the BD_MUTEX_WHOLE :)


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup
  2006-05-30  1:33   ` Andrew Morton
@ 2006-05-30 10:51     ` Takashi Iwai
  2006-05-30 11:03       ` Alexey Dobriyan
  0 siblings, 1 reply; 320+ messages in thread
From: Takashi Iwai @ 2006-05-30 10:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel, arjan, Jaroslav Kysela

At Mon, 29 May 2006 18:33:17 -0700,
Andrew Morton wrote:
> 
> On Mon, 29 May 2006 23:23:19 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > move the __attribute outside of the DEFINE_SPINLOCK() section.
> > 
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> > ---
> >  sound/oss/emu10k1/midi.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > Index: linux/sound/oss/emu10k1/midi.c
> > ===================================================================
> > --- linux.orig/sound/oss/emu10k1/midi.c
> > +++ linux/sound/oss/emu10k1/midi.c
> > @@ -45,7 +45,7 @@
> >  #include "../sound_config.h"
> >  #endif
> >  
> > -static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
> > +static __attribute((unused)) DEFINE_SPINLOCK(midi_spinlock);
> >  
> >  static void init_midi_hdr(struct midi_hdr *midihdr)
> >  {
> 
> I'll tag this as for-mainline-via-alsa.

Acked-by: Takashi Iwai <tiwai@suse.de>


It's OSS stuff, so feel free to push it from your side ;)


thanks,

Takashi

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  9:25     ` Mike Galbraith
@ 2006-05-30 10:57       ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30 10:57 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton


* Mike Galbraith <efault@gmx.de> wrote:

> On Tue, 2006-05-30 at 08:37 +0200, Ingo Molnar wrote:
> > * Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > Darn.  It said all tests passed, then oopsed.
> > > 
> > > (have .config all gzipped up if you want it)
> > 
> > yeah, please.
> 
> (sent off list)

thanks, i managed to reproduce the warning with your .config - i'm 
debugging the problem now.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup
  2006-05-30 10:51     ` Takashi Iwai
@ 2006-05-30 11:03       ` Alexey Dobriyan
  0 siblings, 0 replies; 320+ messages in thread
From: Alexey Dobriyan @ 2006-05-30 11:03 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: Andrew Morton, Ingo Molnar, linux-kernel, arjan, Jaroslav Kysela

On Tue, May 30, 2006 at 12:51:53PM +0200, Takashi Iwai wrote:
> At Mon, 29 May 2006 18:33:17 -0700,
> Andrew Morton wrote:
> > 
> > On Mon, 29 May 2006 23:23:19 +0200
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > move the __attribute outside of the DEFINE_SPINLOCK() section.
> > > 
> > > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > > Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> > > ---
> > >  sound/oss/emu10k1/midi.c |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > Index: linux/sound/oss/emu10k1/midi.c
> > > ===================================================================
> > > --- linux.orig/sound/oss/emu10k1/midi.c
> > > +++ linux/sound/oss/emu10k1/midi.c
> > > @@ -45,7 +45,7 @@
> > >  #include "../sound_config.h"
> > >  #endif
> > >  
> > > -static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
> > > +static __attribute((unused)) DEFINE_SPINLOCK(midi_spinlock);
> > >  
> > >  static void init_midi_hdr(struct midi_hdr *midihdr)
> > >  {
> > 
> > I'll tag this as for-mainline-via-alsa.
> 
> Acked-by: Takashi Iwai <tiwai@suse.de>
> 
> 
> It's OSS stuff, so feel free to push it from your side ;)

Why it is marked unused when in fact it's used?

[PATCH] Mark midi_spinlock as used

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

--- a/sound/oss/emu10k1/midi.c
+++ b/sound/oss/emu10k1/midi.c
@@ -45,7 +45,7 @@
 #include "../sound_config.h"
 #endif
 
-static DEFINE_SPINLOCK(midi_spinlock __attribute((unused)));
+static DEFINE_SPINLOCK(midi_spinlock);
 
 static void init_midi_hdr(struct midi_hdr *midihdr)
 {


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 10:26   ` Arjan van de Ven
@ 2006-05-30 11:42     ` Benoit Boissinot
  2006-05-30 12:13       ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Benoit Boissinot @ 2006-05-30 11:42 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: jketreno, yi.zhu, Andrew Morton, Ingo Molnar, linux-kernel

On Tue, May 30, 2006 at 12:26:27PM +0200, Arjan van de Ven wrote:
> On Tue, 2006-05-30 at 11:14 +0200, Benoit Boissinot wrote:
> > On 5/29/06, Ingo Molnar <mingo@elte.hu> wrote:
> > > We are pleased to announce the first release of the "lock dependency
> > > correctness validator" kernel debugging feature, which can be downloaded
> > > from:
> > >
> > >   http://redhat.com/~mingo/lockdep-patches/
> > > [snip]
> > 
> > I get this right after ipw2200 is loaded (it is quite verbose, I
> > probably shoudln't post everything...)
> > 
> > ipw2200: Detected Intel PRO/Wireless 2200BG Network Connection
> > ipw2200: Detected geography ZZD (13 802.11bg channels, 0 802.11a channels)
> 
> 
> >  <c0301efa> netlink_broadcast+0x7a/0x360  
> 
> this isn't allow to be called from IRQ context, because it takes
> nl_table_lock for read, but that is taken as
>         write_lock_bh(&nl_table_lock);
> in 
> 	static void netlink_table_grab(void)
> so without disabling interrupts; which would thus deadlock if this
> read_lock-from-irq would hit.
> 
> >  <c02fb6a4> wireless_send_event+0x304/0x340
> >  <e1cf8e11> ipw_rx+0x1371/0x1bb0 [ipw2200] 
> >  <e1cfe6ac> ipw_irq_tasklet+0x13c/0x500 [ipw2200]
> >  <c0121ea0> tasklet_action+0x40/0x90  
> 
> but it's more complex than that, since we ARE in BH context.
> The complexity comes from us holding &priv->lock, which is 
> used in hard irq context.

It is probably related, but I got this in my log too:

BUG: warning at kernel/softirq.c:86/local_bh_disable()
 <c010402d> show_trace+0xd/0x10  <c0104687> dump_stack+0x17/0x20
 <c0121fdc> local_bh_disable+0x5c/0x70  <c03520f1> _read_lock_bh+0x11/0x30
 <c02e8dce> sock_def_readable+0x1e/0x80  <c0302130> netlink_broadcast+0x2b0/0x360
 <c02fb6a4> wireless_send_event+0x304/0x340  <e1cf8e11> ipw_rx+0x1371/0x1bb0 [ipw2200]
 <e1cfe6ac> ipw_irq_tasklet+0x13c/0x500 [ipw2200] <c0121ea0> tasklet_action+0x40/0x90
 <c01223b4> __do_softirq+0x54/0xc0  <c01056bb> do_softirq+0x5b/0xf0
 =======================
 <c0122455> irq_exit+0x35/0x40  <c01057c7> do_IRQ+0x77/0xc0
 <c0103949> common_interrupt+0x25/0x2c 

> 
> so the deadlock is like this:
> 
> 
> cpu 0: user context					cpu1: softirq context
>    netlink_table_grab takes nl_table_lock as		take priv->lock	in ipw_irq_tasklet
>    write_lock_bh, but leaves irqs enabled
> 
> 
>    hardirq comes in and the isr tries to take           in ipw_rx, call wireless_send_event which
>    priv->lock but has to wait on cpu 1                  tries to take nl_table_lock for read
>                                                         but has to wait for cpu0
> 
> and... kaboom kabang deadlock :)
> 
> 

-- 
powered by bash/screen/(urxvt/fvwm|linux-console)/gentoo/gnu/linux OS

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 11:42     ` Benoit Boissinot
@ 2006-05-30 12:13       ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30 12:13 UTC (permalink / raw)
  To: Benoit Boissinot
  Cc: Arjan van de Ven, jketreno, yi.zhu, Andrew Morton, linux-kernel


* Benoit Boissinot <benoit.boissinot@ens-lyon.org> wrote:

> It is probably related, but I got this in my log too:
> 
> BUG: warning at kernel/softirq.c:86/local_bh_disable()

this one is harmless, you can ignore it. (already sent a patch to remove 
the WARN_ON)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 61/61] lock validator: enable lock validator in Kconfig
  2006-05-29 21:28 ` [patch 61/61] lock validator: enable lock validator in Kconfig Ingo Molnar
  2006-05-30  1:36   ` Andrew Morton
@ 2006-05-30 13:33   ` Roman Zippel
  2006-06-23 11:01     ` Ingo Molnar
  1 sibling, 1 reply; 320+ messages in thread
From: Roman Zippel @ 2006-05-30 13:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton

Hi,

On Mon, 29 May 2006, Ingo Molnar wrote:

> Index: linux/lib/Kconfig.debug
> ===================================================================
> --- linux.orig/lib/Kconfig.debug
> +++ linux/lib/Kconfig.debug
> @@ -184,6 +184,173 @@ config DEBUG_SPINLOCK
>  	  best used in conjunction with the NMI watchdog so that spinlock
>  	  deadlocks are also debuggable.
>  
> +config PROVE_SPIN_LOCKING
> +	bool "Prove spin-locking correctness"
> +	default y

Could you please keep all the defaults in a separate -mm-only patch, so 
it doesn't get merged?
There are also a number of dependencies on DEBUG_KERNEL missing, it 
completely breaks the debugging menu.

> +config LOCKDEP
> +	bool
> +	default y
> +	depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING

This can be written shorter as:

config LOCKDEP
	def_bool PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING

bye, Roman

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  5:45       ` Arjan van de Ven
  2006-05-30  6:07         ` Michal Piotrowski
@ 2006-05-30 14:10         ` Dave Jones
  2006-05-30 14:19           ` Arjan van de Ven
  2006-05-30 20:54         ` [patch, -rc5-mm1] lock validator: select KALLSYMS_ALL Ingo Molnar
  2 siblings, 1 reply; 320+ messages in thread
From: Dave Jones @ 2006-05-30 14:10 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, linux-kernel, Michal Piotrowski, Ingo Molnar

On Tue, May 30, 2006 at 07:45:47AM +0200, Arjan van de Ven wrote:

 > One
 > ---
 > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
 > __cpufreq_set_policy calls __cpufreq_governor
 > __cpufreq_governor  calls __cpufreq_driver_target via cpufreq_governor_performance
 > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
 > 
 > 
 > Two
 > ---
 > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
 > cpufreq_stat_cpu_callback calls cpufreq_update_policy
 > cpufreq_update_policy takes the policy->lock
 > 
 > 
 > so this looks like a real honest AB-BA deadlock to me...

This looks a little clearer this morning.  I missed the fact that sys_init_module
isn't completely serialised, only the loading part. ->init routines can and will be
called in parallel.

I don't see where cpufreq_update_policy takes policy->lock though.
In my tree it just takes the per-cpu data->lock.

Time for more wake-up juice? or am I missing something obvious again?

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 14:10         ` Dave Jones
@ 2006-05-30 14:19           ` Arjan van de Ven
  2006-05-30 14:58             ` Dave Jones
  0 siblings, 1 reply; 320+ messages in thread
From: Arjan van de Ven @ 2006-05-30 14:19 UTC (permalink / raw)
  To: Dave Jones; +Cc: Andrew Morton, linux-kernel, Michal Piotrowski, Ingo Molnar

On Tue, 2006-05-30 at 10:10 -0400, Dave Jones wrote:
> On Tue, May 30, 2006 at 07:45:47AM +0200, Arjan van de Ven wrote:
> 
>  > One
>  > ---
>  > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
>  > __cpufreq_set_policy calls __cpufreq_governor
>  > __cpufreq_governor  calls __cpufreq_driver_target via cpufreq_governor_performance
>  > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
>  > 
>  > 
>  > Two
>  > ---
>  > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
>  > cpufreq_stat_cpu_callback calls cpufreq_update_policy
>  > cpufreq_update_policy takes the policy->lock
>  > 
>  > 
>  > so this looks like a real honest AB-BA deadlock to me...
> 
> This looks a little clearer this morning.  I missed the fact that sys_init_module
> isn't completely serialised, only the loading part. ->init routines can and will be
> called in parallel.
> 
> I don't see where cpufreq_update_policy takes policy->lock though.
> In my tree it just takes the per-cpu data->lock.

isn't that basically the same lock?



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 14:19           ` Arjan van de Ven
@ 2006-05-30 14:58             ` Dave Jones
  2006-05-30 17:11               ` Dominik Brodowski
  0 siblings, 1 reply; 320+ messages in thread
From: Dave Jones @ 2006-05-30 14:58 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, linux-kernel, Michal Piotrowski, Ingo Molnar, linux

On Tue, May 30, 2006 at 04:19:22PM +0200, Arjan van de Ven wrote:

 > >  > One
 > >  > ---
 > >  > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
 > >  > __cpufreq_set_policy calls __cpufreq_governor
 > >  > __cpufreq_governor  calls __cpufreq_driver_target via cpufreq_governor_performance
 > >  > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
 > >  > 
 > >  > 
 > >  > Two
 > >  > ---
 > >  > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
 > >  > cpufreq_stat_cpu_callback calls cpufreq_update_policy
 > >  > cpufreq_update_policy takes the policy->lock
 > >  > 
 > >  > 
 > >  > so this looks like a real honest AB-BA deadlock to me...
 > > 
 > > This looks a little clearer this morning.  I missed the fact that sys_init_module
 > > isn't completely serialised, only the loading part. ->init routines can and will be
 > > called in parallel.
 > > 
 > > I don't see where cpufreq_update_policy takes policy->lock though.
 > > In my tree it just takes the per-cpu data->lock.
 > 
 > isn't that basically the same lock?

Ugh, I've completely forgotten how this stuff fits together.

Dominik, any clues ?

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 14:58             ` Dave Jones
@ 2006-05-30 17:11               ` Dominik Brodowski
  2006-05-30 19:02                 ` Dave Jones
  2006-05-30 19:39                 ` Dave Jones
  0 siblings, 2 replies; 320+ messages in thread
From: Dominik Brodowski @ 2006-05-30 17:11 UTC (permalink / raw)
  To: Dave Jones, Arjan van de Ven, Andrew Morton, linux-kernel,
	Michal Piotrowski, Ingo Molnar, nanhai.zou

Hi,

On Tue, May 30, 2006 at 10:58:52AM -0400, Dave Jones wrote:
> On Tue, May 30, 2006 at 04:19:22PM +0200, Arjan van de Ven wrote:
> 
>  > >  > One
>  > >  > ---
>  > >  > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
>  > >  > __cpufreq_set_policy calls __cpufreq_governor
>  > >  > __cpufreq_governor  calls __cpufreq_driver_target via cpufreq_governor_performance
>  > >  > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
>  > >  > 
>  > >  > 
>  > >  > Two
>  > >  > ---
>  > >  > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
>  > >  > cpufreq_stat_cpu_callback calls cpufreq_update_policy
>  > >  > cpufreq_update_policy takes the policy->lock
>  > >  > 
>  > >  > 
>  > >  > so this looks like a real honest AB-BA deadlock to me...
>  > > 
>  > > This looks a little clearer this morning.  I missed the fact that sys_init_module
>  > > isn't completely serialised, only the loading part. ->init routines can and will be
>  > > called in parallel.
>  > > 
>  > > I don't see where cpufreq_update_policy takes policy->lock though.
>  > > In my tree it just takes the per-cpu data->lock.
>  > 
>  > isn't that basically the same lock?
> 
> Ugh, I've completely forgotten how this stuff fits together.
> 
> Dominik, any clues ?

That's indeed a possible deadlock situation -- what's the
cpufreq_update_policy() call needed for in cpufreq_stat_cpu_callback anyway?

	Dominik

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)
  2006-05-30  1:33   ` Andrew Morton
@ 2006-05-30 17:38     ` Steven Rostedt
  2006-06-03 18:09       ` Steven Rostedt
  0 siblings, 1 reply; 320+ messages in thread
From: Steven Rostedt @ 2006-05-30 17:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel, arjan

On Mon, 2006-05-29 at 18:33 -0700, Andrew Morton wrote:
> On Mon, 29 May 2006 23:23:28 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > add WARN_ON_ONCE(cond) to print once-per-bootup messages.
> > 
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> > ---
> >  include/asm-generic/bug.h |   13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > Index: linux/include/asm-generic/bug.h
> > ===================================================================
> > --- linux.orig/include/asm-generic/bug.h
> > +++ linux/include/asm-generic/bug.h
> > @@ -44,4 +44,17 @@
> >  # define WARN_ON_SMP(x)			do { } while (0)
> >  #endif
> >  
> > +#define WARN_ON_ONCE(condition)				\
> > +({							\
> > +	static int __warn_once = 1;			\
> > +	int __ret = 0;					\
> > +							\
> > +	if (unlikely(__warn_once && (condition))) {	\

Since __warn_once is likely to be true, and the condition is likely to
be false, wouldn't it be better to switch this around to:

  if (unlikely((condition) && __warn_once)) {

So the && will fall out before having to check a global variable.

Only after the unlikely condition would the __warn_once be false.

-- Steve

> > +		__warn_once = 0;			\
> > +		WARN_ON(1);				\
> > +		__ret = 1;				\
> > +	}						\
> > +	__ret;						\
> > +})
> > +
> >  #endif
> 
> I'll queue this for mainline inclusion.



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/61] lock validator: add __module_address() method
  2006-05-30  1:33   ` Andrew Morton
@ 2006-05-30 17:45     ` Steven Rostedt
  2006-06-23  8:38     ` Ingo Molnar
  1 sibling, 0 replies; 320+ messages in thread
From: Steven Rostedt @ 2006-05-30 17:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel, arjan

On Mon, 2006-05-29 at 18:33 -0700, Andrew Morton wrote:

> 
> I'd suggest that __module_address() should do the same thing, from an API neatness
> POV.  Although perhaps that's mot very useful if we didn't take a ref on the returned
> object (but module_text_address() doesn't either).
> 
> Also, the name's a bit misleading - it sounds like it returns the address
> of a module or something.  __module_any_address() would be better, perhaps?

How about __valid_module_address()  so that it describes exactly what it
is doing. Or __module_address_valid().

-- Steve

> 
> Also, how come this doesn't need modlist_lock()?



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 17:11               ` Dominik Brodowski
@ 2006-05-30 19:02                 ` Dave Jones
  2006-05-30 19:25                   ` Roland Dreier
  2006-05-30 19:39                 ` Dave Jones
  1 sibling, 1 reply; 320+ messages in thread
From: Dave Jones @ 2006-05-30 19:02 UTC (permalink / raw)
  To: Dominik Brodowski, Arjan van de Ven, Andrew Morton, linux-kernel,
	Michal Piotrowski, Ingo Molnar, nanhai.zou

On Tue, May 30, 2006 at 07:11:18PM +0200, Dominik Brodowski wrote:

 > That's indeed a possible deadlock situation -- what's the
 > cpufreq_update_policy() call needed for in cpufreq_stat_cpu_callback anyway?

I was hoping you could enlighten me :)
I started picking through history with gitk, but my tk install uses
fonts that make my eyes bleed.  My kingdom for a 'git annotate'..

		Dave
-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 19:02                 ` Dave Jones
@ 2006-05-30 19:25                   ` Roland Dreier
  2006-05-30 19:34                     ` Dave Jones
  2006-05-30 20:41                     ` Ingo Molnar
  0 siblings, 2 replies; 320+ messages in thread
From: Roland Dreier @ 2006-05-30 19:25 UTC (permalink / raw)
  To: Dave Jones
  Cc: Dominik Brodowski, Arjan van de Ven, Andrew Morton, linux-kernel,
	Michal Piotrowski, Ingo Molnar, nanhai.zou

    Dave> I was hoping you could enlighten me :) I started picking
    Dave> through history with gitk, but my tk install uses fonts that
    Dave> make my eyes bleed.  My kingdom for a 'git annotate'..

Heh -- try "git annotate" or "git blame".  I think you need git 1.3.x
for that... details of where to send your kingdom forthcoming...

 - R.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 19:25                   ` Roland Dreier
@ 2006-05-30 19:34                     ` Dave Jones
  2006-05-30 20:41                     ` Ingo Molnar
  1 sibling, 0 replies; 320+ messages in thread
From: Dave Jones @ 2006-05-30 19:34 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dominik Brodowski, Arjan van de Ven, Andrew Morton, linux-kernel,
	Michal Piotrowski, Ingo Molnar, nanhai.zou

On Tue, May 30, 2006 at 12:25:29PM -0700, Roland Dreier wrote:
 >     Dave> I was hoping you could enlighten me :) I started picking
 >     Dave> through history with gitk, but my tk install uses fonts that
 >     Dave> make my eyes bleed.  My kingdom for a 'git annotate'..
 > 
 > Heh -- try "git annotate" or "git blame".  I think you need git 1.3.x
 > for that... details of where to send your kingdom forthcoming...

How on earth did I miss that?  Thanks for the pointer.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 17:11               ` Dominik Brodowski
  2006-05-30 19:02                 ` Dave Jones
@ 2006-05-30 19:39                 ` Dave Jones
  2006-05-30 19:53                   ` Ashok Raj
  1 sibling, 1 reply; 320+ messages in thread
From: Dave Jones @ 2006-05-30 19:39 UTC (permalink / raw)
  To: Dominik Brodowski, Arjan van de Ven, Andrew Morton, linux-kernel,
	Michal Piotrowski, Ingo Molnar, nanhai.zou, ashok.raj

On Tue, May 30, 2006 at 07:11:18PM +0200, Dominik Brodowski wrote:
 
 > On Tue, May 30, 2006 at 10:58:52AM -0400, Dave Jones wrote:
 > > On Tue, May 30, 2006 at 04:19:22PM +0200, Arjan van de Ven wrote:
 > > 
 > >  > >  > One
 > >  > >  > ---
 > >  > >  > store_scaling_governor takes policy->lock and then calls __cpufreq_set_policy
 > >  > >  > __cpufreq_set_policy calls __cpufreq_governor
 > >  > >  > __cpufreq_governor  calls __cpufreq_driver_target via cpufreq_governor_performance
 > >  > >  > __cpufreq_driver_target calls lock_cpu_hotplug() (which takes the hotplug lock)
 > >  > >  > 
 > >  > >  > 
 > >  > >  > Two
 > >  > >  > ---
 > >  > >  > cpufreq_stats_init lock_cpu_hotplug() and then calls cpufreq_stat_cpu_callback
 > >  > >  > cpufreq_stat_cpu_callback calls cpufreq_update_policy
 > >  > >  > cpufreq_update_policy takes the policy->lock
 > >  > >  > 
 > >  > >  > 
 > >  > >  > so this looks like a real honest AB-BA deadlock to me...
 > >  > > 
 > >  > > This looks a little clearer this morning.  I missed the fact that sys_init_module
 > >  > > isn't completely serialised, only the loading part. ->init routines can and will be
 > >  > > called in parallel.
 > >  > > 
 > >  > > I don't see where cpufreq_update_policy takes policy->lock though.
 > >  > > In my tree it just takes the per-cpu data->lock.
 > >  > 
 > >  > isn't that basically the same lock?
 > > 
 > > Ugh, I've completely forgotten how this stuff fits together.
 > > 
 > > Dominik, any clues ?
 > 
 > That's indeed a possible deadlock situation -- what's the
 > cpufreq_update_policy() call needed for in cpufreq_stat_cpu_callback anyway?

Oh wow. Reading the commit message of this change rings alarm bells.

change c32b6b8e524d2c337767d312814484d9289550cf has this to say..

    [PATCH] create and destroy cpufreq sysfs entries based on cpu notifiers
    
    cpufreq entries in sysfs should only be populated when CPU is online state.
     When we either boot with maxcpus=x and then boot the other cpus by echoing
    to sysfs online file, these entries should be created and destroyed when
    CPU_DEAD is notified.  Same treatement as cache entries under sysfs.
    
    We place the processor in the lowest frequency, so hw managed P-State
    transitions can still work on the other threads to save power.
    
    Primary goal was to just make these directories appear/disapper dynamically.
    
    There is one in this patch i had to do, which i really dont like myself but
    probably best if someone handling the cpufreq infrastructure could give
    this code right treatment if this is not acceptable.  I guess its probably
    good for the first cut.
    
    - Converting lock_cpu_hotplug()/unlock_cpu_hotplug() to disable/enable preempt.
      The locking was smack in the middle of the notification path, when the
      hotplug is already holding the lock. I tried another solution to avoid this
      so avoid taking locks if we know we are from notification path. The solution
      was getting very ugly and i decided this was probably good for this iteration
      until someone who understands cpufreq could do a better job than me.

So, that last part pretty highlights that we knew about this problem, and meant to
come back and fix it later. Surprise surprise, no one came back and fixed it.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 19:39                 ` Dave Jones
@ 2006-05-30 19:53                   ` Ashok Raj
  2006-06-01  5:50                     ` Nathan Lynch
  0 siblings, 1 reply; 320+ messages in thread
From: Ashok Raj @ 2006-05-30 19:53 UTC (permalink / raw)
  To: Dave Jones
  Cc: Dominik Brodowski, Arjan van de Ven, Andrew Morton, linux-kernel,
	Michal Piotrowski, Ingo Molnar, nanhai.zou, ashok.raj

On Tue, May 30, 2006 at 03:39:47PM -0400, Dave Jones wrote:

> So, that last part pretty highlights that we knew about this problem, and meant to
> come back and fix it later. Surprise surprise, no one came back and fixed it.
> 

There was another iteration after his, and currently we keep track of
the owner in lock_cpu_hotplug()->__lock_cpu_hotplug(). So if we are in 
same thread context we dont acquire locks.

    if (lock_cpu_hotplug_owner != current) {
        if (interruptible)
            ret = down_interruptible(&cpucontrol);
        else
            down(&cpucontrol);
    }


the lock and unlock kept track of the depth as well, so we know when to release

We didnt hear any better suggestions (from cpufreq folks), so we left it in 
that state (atlease the same thread doenst try to take the lock twice) 
that resulted in deadlocks earlier.

-- 
Cheers,
Ashok Raj
- Open Source Technology Center

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 19:25                   ` Roland Dreier
  2006-05-30 19:34                     ` Dave Jones
@ 2006-05-30 20:41                     ` Ingo Molnar
  2006-05-30 20:44                       ` Ingo Molnar
  2006-05-30 21:58                       ` Paolo Ciarrocchi
  1 sibling, 2 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30 20:41 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dave Jones, Dominik Brodowski, Arjan van de Ven, Andrew Morton,
	linux-kernel, Michal Piotrowski, nanhai.zou


* Roland Dreier <rdreier@cisco.com> wrote:

>     Dave> I was hoping you could enlighten me :) I started picking
>     Dave> through history with gitk, but my tk install uses fonts that
>     Dave> make my eyes bleed.  My kingdom for a 'git annotate'..
> 
> Heh -- try "git annotate" or "git blame".  I think you need git 1.3.x 
> for that... details of where to send your kingdom forthcoming...

i use qgit, which is GTK based and thus uses the native desktop fonts.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 20:41                     ` Ingo Molnar
@ 2006-05-30 20:44                       ` Ingo Molnar
  2006-05-30 21:58                       ` Paolo Ciarrocchi
  1 sibling, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30 20:44 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Dave Jones, Dominik Brodowski, Arjan van de Ven, Andrew Morton,
	linux-kernel, Michal Piotrowski, nanhai.zou


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Roland Dreier <rdreier@cisco.com> wrote:
> 
> >     Dave> I was hoping you could enlighten me :) I started picking
> >     Dave> through history with gitk, but my tk install uses fonts that
> >     Dave> make my eyes bleed.  My kingdom for a 'git annotate'..
> > 
> > Heh -- try "git annotate" or "git blame".  I think you need git 1.3.x 
> > for that... details of where to send your kingdom forthcoming...
> 
> i use qgit, which is GTK based and thus uses the native desktop fonts.

and qgit annotates source files in the background while you are viewing 
them, and then you can click on lines to jump to the last commit that 
touched it. It doesnt need latest GIT, qgit always did this (by itself).

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 37/61] lock validator: special locking: dcache
  2006-05-30  1:35   ` Andrew Morton
@ 2006-05-30 20:51     ` Steven Rostedt
  2006-05-30 21:01       ` Ingo Molnar
  2006-06-23  9:51       ` Ingo Molnar
  0 siblings, 2 replies; 320+ messages in thread
From: Steven Rostedt @ 2006-05-30 20:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel, arjan

On Mon, 2006-05-29 at 18:35 -0700, Andrew Morton wrote:

> > Index: linux/fs/dcache.c
> > ===================================================================
> > --- linux.orig/fs/dcache.c
> > +++ linux/fs/dcache.c
> > @@ -1380,10 +1380,10 @@ void d_move(struct dentry * dentry, stru
> >  	 */
> >  	if (target < dentry) {
> >  		spin_lock(&target->d_lock);
> > -		spin_lock(&dentry->d_lock);
> > +		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> >  	} else {
> >  		spin_lock(&dentry->d_lock);
> > -		spin_lock(&target->d_lock);
> > +		spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
> >  	}
> > 
>  

[...]

> > +/*
> > + * dentry->d_lock spinlock nesting types:
> > + *
> > + * 0: normal
> > + * 1: nested
> > + */
> > +enum dentry_d_lock_type
> > +{
> > +	DENTRY_D_LOCK_NORMAL,
> > +	DENTRY_D_LOCK_NESTED
> > +};
> > +
> >  struct dentry_operations {
> >  	int (*d_revalidate)(struct dentry *, struct nameidata *);
> >  	int (*d_hash) (struct dentry *, struct qstr *);
> 
> DENTRY_D_LOCK_NORMAL isn't used anywhere.
> 

I guess it is implied with the normal spin_lock.  Since 
  spin_lock(&target->d_lock) and
  spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NORMAL)
are equivalent. (DENTRY_D_LOCK_NORMAL == 0)

Probably this deserves a comment.

-- Steve



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 38/61] lock validator: special locking: i_mutex
  2006-05-29 21:26 ` [patch 38/61] lock validator: special locking: i_mutex Ingo Molnar
@ 2006-05-30 20:53   ` Steven Rostedt
  2006-05-30 21:06     ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Steven Rostedt @ 2006-05-30 20:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton

On Mon, 2006-05-29 at 23:26 +0200, Ingo Molnar wrote:
> + * inode->i_mutex nesting types for the LOCKDEP validator:
> + *
> + * 0: the object of the current VFS operation
> + * 1: parent
> + * 2: child/target
> + */
> +enum inode_i_mutex_lock_type
> +{
> +       I_MUTEX_NORMAL,
> +       I_MUTEX_PARENT,
> +       I_MUTEX_CHILD
> +};
> +
> +/* 

I guess we can say the same about I_MUTEX_NORMAL.

-- Steve



^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch, -rc5-mm1] lock validator: select KALLSYMS_ALL
  2006-05-30  5:45       ` Arjan van de Ven
  2006-05-30  6:07         ` Michal Piotrowski
  2006-05-30 14:10         ` Dave Jones
@ 2006-05-30 20:54         ` Ingo Molnar
  2 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30 20:54 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dave Jones, Andrew Morton, linux-kernel, Michal Piotrowski


* Arjan van de Ven <arjan@infradead.org> wrote:

> the reporter doesn't have CONFIG_KALLSYMS_ALL enabled which gives 
> sometimes misleading backtraces (should lockdep just enable 
> KALLSYMS_ALL to get more useful bugreports?)

agreed - the patch below does that.

-----------------------
Subject: lock validator: select KALLSYMS_ALL
From: Ingo Molnar <mingo@elte.hu>

all the kernel symbol printouts make alot more sense if KALLSYMS_ALL
is enabled too - force it on if lockdep is enabled.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 lib/Kconfig.debug |    1 +
 1 file changed, 1 insertion(+)

Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -342,6 +342,7 @@ config LOCKDEP
 	default y
 	select FRAME_POINTER
 	select KALLSYMS
+	select KALLSYMS_ALL
 	depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING
 
 config DEBUG_LOCKDEP

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 37/61] lock validator: special locking: dcache
  2006-05-30 20:51     ` Steven Rostedt
@ 2006-05-30 21:01       ` Ingo Molnar
  2006-06-23  9:51       ` Ingo Molnar
  1 sibling, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30 21:01 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Andrew Morton, linux-kernel, arjan


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > > +enum dentry_d_lock_type
> > > +{
> > > +	DENTRY_D_LOCK_NORMAL,
> > > +	DENTRY_D_LOCK_NESTED
> > > +};
> > > +
> > >  struct dentry_operations {
> > >  	int (*d_revalidate)(struct dentry *, struct nameidata *);
> > >  	int (*d_hash) (struct dentry *, struct qstr *);
> > 
> > DENTRY_D_LOCK_NORMAL isn't used anywhere.
> 
> I guess it is implied with the normal spin_lock.  Since 
>   spin_lock(&target->d_lock) and
>   spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NORMAL)
> are equivalent. (DENTRY_D_LOCK_NORMAL == 0)

correct. This is the case for all the subtype enum definitions: 0 means 
normal spinlock [rwlock, rwsem, mutex] API use.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 38/61] lock validator: special locking: i_mutex
  2006-05-30 20:53   ` Steven Rostedt
@ 2006-05-30 21:06     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-30 21:06 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 2006-05-29 at 23:26 +0200, Ingo Molnar wrote:
> > + * inode->i_mutex nesting types for the LOCKDEP validator:
> > + *
> > + * 0: the object of the current VFS operation
> > + * 1: parent
> > + * 2: child/target
> > + */
> > +enum inode_i_mutex_lock_type
> > +{
> > +       I_MUTEX_NORMAL,
> > +       I_MUTEX_PARENT,
> > +       I_MUTEX_CHILD
> > +};
> > +
> > +/* 
> 
> I guess we can say the same about I_MUTEX_NORMAL.

yeah. Subtypes start from 1, as 0 is the basic type.

Lock types are keyed via static kernel addresses. This means that we can 
use the lock address (for DEFINE_SPINLOCK) or the static key embedded in 
spin_lock_init() as a key in 99% of the cases. The key [struct 
lockdep_type_key, see include/linux/lockdep.h] occupies enough bytes (of 
kernel static virtual memory) so that the keys remain automatically 
unique. Right now MAX_LOKCDEP_SUBTYPES is 8, so the keys take at most 8 
bytes. (To save some memory there's another detail: for static locks 
(DEFINE_SPINLOCK ones) we use the lock address itself as the key.)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 20:41                     ` Ingo Molnar
  2006-05-30 20:44                       ` Ingo Molnar
@ 2006-05-30 21:58                       ` Paolo Ciarrocchi
  2006-05-31  8:40                         ` Ingo Molnar
  1 sibling, 1 reply; 320+ messages in thread
From: Paolo Ciarrocchi @ 2006-05-30 21:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Roland Dreier, Dave Jones, Dominik Brodowski, Arjan van de Ven,
	Andrew Morton, linux-kernel, Michal Piotrowski, nanhai.zou

On 5/30/06, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Roland Dreier <rdreier@cisco.com> wrote:
>
> >     Dave> I was hoping you could enlighten me :) I started picking
> >     Dave> through history with gitk, but my tk install uses fonts that
> >     Dave> make my eyes bleed.  My kingdom for a 'git annotate'..
> >
> > Heh -- try "git annotate" or "git blame".  I think you need git 1.3.x
> > for that... details of where to send your kingdom forthcoming...
>
> i use qgit, which is GTK based and thus uses the native desktop fonts.

GTK? A typo, I suppose.
QGit is a git GUI viewer built on Qt/C++ (that I hope will be added to
the git.git tree soon).

Ciao,

-- 
Paolo
http://paolociarrocchi.googlepages.com

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 02/61] lock validator: forcedeth.c fix
  2006-05-30  1:33   ` Andrew Morton
@ 2006-05-31  5:40     ` Manfred Spraul
  0 siblings, 0 replies; 320+ messages in thread
From: Manfred Spraul @ 2006-05-31  5:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel, arjan, Ayaz Abdulla

Andrew Morton wrote:

>On Mon, 29 May 2006 23:23:13 +0200
>Ingo Molnar <mingo@elte.hu> wrote:
>
>  
>
>>nv_do_nic_poll() is called from timer softirqs, which has interrupts
>>enabled, but np->lock might also be taken by some other interrupt
>>context.
>>    
>>
>
>But the driver does disable_irq(), so I'd say this was a false-positive.
>
>And afaict this is not a timer handler - it's a poll_controller handler
>(although maybe that get called from timer handler somewhere?)
>
>  
>
It's both a timer handler and a poll_controller handler:
- if the interrupt handler causes a system overload (gig e without irq 
mitigation...), then the nic disables the irq on the device and waits 
one tick and handles the interrupts from a timer. This is nv_do_nic_poll().

- nv_do_nic_poll is also called from the poll_controller handler.

I'll try to remove the disable_irq() calls from the poll_controller 
handler, but probably not before the week-end.

--
    Manfred

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 21:58                       ` Paolo Ciarrocchi
@ 2006-05-31  8:40                         ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-05-31  8:40 UTC (permalink / raw)
  To: Paolo Ciarrocchi
  Cc: Roland Dreier, Dave Jones, Dominik Brodowski, Arjan van de Ven,
	Andrew Morton, linux-kernel, Michal Piotrowski, nanhai.zou


* Paolo Ciarrocchi <paolo.ciarrocchi@gmail.com> wrote:

> GTK? A typo, I suppose.

brainfart, sorry :)

> QGit is a git GUI viewer built on Qt/C++ (that I hope will be added to 
> the git.git tree soon).

yeah.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30 19:53                   ` Ashok Raj
@ 2006-06-01  5:50                     ` Nathan Lynch
  0 siblings, 0 replies; 320+ messages in thread
From: Nathan Lynch @ 2006-06-01  5:50 UTC (permalink / raw)
  To: Ashok Raj
  Cc: Dave Jones, Dominik Brodowski, Arjan van de Ven, Andrew Morton,
	linux-kernel, Michal Piotrowski, Ingo Molnar, nanhai.zou,
	Zwane Mwaikambo

Ashok Raj wrote:
> On Tue, May 30, 2006 at 03:39:47PM -0400, Dave Jones wrote:
> 
> > So, that last part pretty highlights that we knew about this problem, and meant to
> > come back and fix it later. Surprise surprise, no one came back and fixed it.
> > 
> 
> There was another iteration after his, and currently we keep track of
> the owner in lock_cpu_hotplug()->__lock_cpu_hotplug(). So if we are in 
> same thread context we dont acquire locks.
> 
>     if (lock_cpu_hotplug_owner != current) {
>         if (interruptible)
>             ret = down_interruptible(&cpucontrol);
>         else
>             down(&cpucontrol);
>     }
> 
> 
> the lock and unlock kept track of the depth as well, so we know when to release

Can we please kill this recursive locking hack in the cpu hotplug code
in 2.6.18/soon?  It's papering over the real problem, and I worry that
if it's allowed to sit there, other users will start to take
"advantage" of it.  Perhaps, at the very least, cpufreq could be made
to handle this itself instead of polluting the core code...


> We didnt hear any better suggestions (from cpufreq folks), so we left it in 
> that state (atlease the same thread doenst try to take the lock twice) 
> that resulted in deadlocks earlier.

Fix (and document!) the ordering of lock acquisitions in cpufreq?

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch mm1-rc2] lock validator: netlink.c netlink_table_grab fix
  2006-05-30  9:14 ` Benoit Boissinot
  2006-05-30 10:26   ` Arjan van de Ven
@ 2006-06-01 14:42   ` Frederik Deweerdt
  2006-06-02  3:10     ` Zhu Yi
  1 sibling, 1 reply; 320+ messages in thread
From: Frederik Deweerdt @ 2006-06-01 14:42 UTC (permalink / raw)
  To: Benoit Boissinot
  Cc: linux-kernel, Ingo Molnar, Arjan van de Ven, Andrew Morton,
	yi.zhu, jketreno

On Tue, May 30, 2006 at 11:14:15AM +0200, Benoit Boissinot wrote:
> On 5/29/06, Ingo Molnar <mingo@elte.hu> wrote:
> >We are pleased to announce the first release of the "lock dependency
> >correctness validator" kernel debugging feature, which can be downloaded
> >from:
> >
> >  http://redhat.com/~mingo/lockdep-patches/
> >[snip]
> 
> I get this right after ipw2200 is loaded (it is quite verbose, I
> probably shoudln't post everything...)
> 
This got rid of the oops for me, is it the right fix?

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
--- /usr/src/linux/net/netlink/af_netlink.c	2006-05-24 14:58:38.000000000 +0200
+++ net/netlink/af_netlink.c	2006-06-01 16:36:51.000000000 +0200
@@ -157,7 +157,7 @@ static void netlink_sock_destruct(struct
 
 static void netlink_table_grab(void)
 {
-	write_lock_bh(&nl_table_lock);
+	write_lock_irq(&nl_table_lock);
 
 	if (atomic_read(&nl_table_users)) {
 		DECLARE_WAITQUEUE(wait, current);
@@ -167,9 +167,9 @@ static void netlink_table_grab(void)
 			set_current_state(TASK_UNINTERRUPTIBLE);
 			if (atomic_read(&nl_table_users) == 0)
 				break;
-			write_unlock_bh(&nl_table_lock);
+			write_unlock_irq(&nl_table_lock);
 			schedule();
-			write_lock_bh(&nl_table_lock);
+			write_lock_irq(&nl_table_lock);
 		}
 
 		__set_current_state(TASK_RUNNING);
@@ -179,7 +179,7 @@ static void netlink_table_grab(void)
 
 static __inline__ void netlink_table_ungrab(void)
 {
-	write_unlock_bh(&nl_table_lock);
+	write_unlock_irq(&nl_table_lock);
 	wake_up(&nl_table_wait);
 }
 


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch mm1-rc2] lock validator: netlink.c netlink_table_grab fix
  2006-06-01 14:42   ` [patch mm1-rc2] lock validator: netlink.c netlink_table_grab fix Frederik Deweerdt
@ 2006-06-02  3:10     ` Zhu Yi
  2006-06-02  9:53       ` Frederik Deweerdt
  0 siblings, 1 reply; 320+ messages in thread
From: Zhu Yi @ 2006-06-02  3:10 UTC (permalink / raw)
  To: Frederik Deweerdt
  Cc: Benoit Boissinot, linux-kernel, Ingo Molnar, Arjan van de Ven,
	Andrew Morton, jketreno

[-- Attachment #1: Type: text/plain, Size: 227 bytes --]

On Thu, 2006-06-01 at 16:42 +0200, Frederik Deweerdt wrote:
> This got rid of the oops for me, is it the right fix?

I don't think netlink will contend with hardirqs. Can you test with this
fix for ipw2200 driver?

Thanks,
-yi

[-- Attachment #2: ipw2200-lockdep-fix.patch --]
[-- Type: text/x-patch, Size: 1141 bytes --]

diff -urp a/drivers/net/wireless/ipw2200.c b/drivers/net/wireless/ipw2200.c
--- a/drivers/net/wireless/ipw2200.c	2006-04-01 09:47:24.000000000 +0800
+++ b/drivers/net/wireless/ipw2200.c	2006-06-01 14:32:00.000000000 +0800
@@ -11058,11 +11058,9 @@ static irqreturn_t ipw_isr(int irq, void
 	if (!priv)
 		return IRQ_NONE;
 
-	spin_lock(&priv->lock);
-
 	if (!(priv->status & STATUS_INT_ENABLED)) {
 		/* Shared IRQ */
-		goto none;
+		return IRQ_NONE;
 	}
 
 	inta = ipw_read32(priv, IPW_INTA_RW);
@@ -11071,12 +11069,12 @@ static irqreturn_t ipw_isr(int irq, void
 	if (inta == 0xFFFFFFFF) {
 		/* Hardware disappeared */
 		IPW_WARNING("IRQ INTA == 0xFFFFFFFF\n");
-		goto none;
+		return IRQ_NONE;
 	}
 
 	if (!(inta & (IPW_INTA_MASK_ALL & inta_mask))) {
 		/* Shared interrupt */
-		goto none;
+		return IRQ_NONE;
 	}
 
 	/* tell the device to stop sending interrupts */
@@ -11091,12 +11089,7 @@ static irqreturn_t ipw_isr(int irq, void
 
 	tasklet_schedule(&priv->irq_tasklet);
 
-	spin_unlock(&priv->lock);
-
 	return IRQ_HANDLED;
-      none:
-	spin_unlock(&priv->lock);
-	return IRQ_NONE;
 }
 
 static void ipw_rf_kill(void *adapter)

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch mm1-rc2] lock validator: netlink.c netlink_table_grab fix
  2006-06-02  3:10     ` Zhu Yi
@ 2006-06-02  9:53       ` Frederik Deweerdt
  2006-06-05  3:40         ` Zhu Yi
  0 siblings, 1 reply; 320+ messages in thread
From: Frederik Deweerdt @ 2006-06-02  9:53 UTC (permalink / raw)
  To: Zhu Yi
  Cc: Benoit Boissinot, linux-kernel, Ingo Molnar, Arjan van de Ven,
	Andrew Morton, jketreno

On Fri, Jun 02, 2006 at 11:10:10AM +0800, Zhu Yi wrote:
> On Thu, 2006-06-01 at 16:42 +0200, Frederik Deweerdt wrote:
> > This got rid of the oops for me, is it the right fix?
> 
> I don't think netlink will contend with hardirqs. Can you test with this
> fix for ipw2200 driver?
> 
It does work, thanks. But doesn't this add a possibility of missing 
some interrupts?
	cpu0				cpu1
        ====				====
in isr				in tasklet

				ipw_enable_interrupts
				|->priv->status |= STATUS_INT_ENABLED;

ipw_disable_interrupts
|->priv->status &= ~STATUS_INT_ENABLED;
|->ipw_write32(priv, IPW_INTA_MASK_R, ~IPW_INTA_MASK_ALL);

				|->ipw_write32(priv, IPW_INTA_MASK_R, IPW_INTA_MASK_ALL);
				/* This is possible due to priv->lock no longer being taken
				   in isr */

=>interrupt from ipw2200
in new isr
if (!(priv->status & STATUS_INT_ENABLED))
	return IRQ_NONE; /* we wrongfully return here because priv->status
                            does not reflect the register's value */


Not sure this is really important at all, just curious.

Thanks,
Frederik

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)
  2006-05-30 17:38     ` Steven Rostedt
@ 2006-06-03 18:09       ` Steven Rostedt
  2006-06-04  9:18         ` Arjan van de Ven
  0 siblings, 1 reply; 320+ messages in thread
From: Steven Rostedt @ 2006-06-03 18:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, linux-kernel, arjan

On Tue, 2006-05-30 at 13:38 -0400, Steven Rostedt wrote:
> On Mon, 2006-05-29 at 18:33 -0700, Andrew Morton wrote:
> > On Mon, 29 May 2006 23:23:28 +0200
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > add WARN_ON_ONCE(cond) to print once-per-bootup messages.
> > > 
> > > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > > Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> > > ---
> > >  include/asm-generic/bug.h |   13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > > 
> > > Index: linux/include/asm-generic/bug.h
> > > ===================================================================
> > > --- linux.orig/include/asm-generic/bug.h
> > > +++ linux/include/asm-generic/bug.h
> > > @@ -44,4 +44,17 @@
> > >  # define WARN_ON_SMP(x)			do { } while (0)
> > >  #endif
> > >  
> > > +#define WARN_ON_ONCE(condition)				\
> > > +({							\
> > > +	static int __warn_once = 1;			\
> > > +	int __ret = 0;					\
> > > +							\
> > > +	if (unlikely(__warn_once && (condition))) {	\
> 
> Since __warn_once is likely to be true, and the condition is likely to
> be false, wouldn't it be better to switch this around to:
> 
>   if (unlikely((condition) && __warn_once)) {
> 
> So the && will fall out before having to check a global variable.
> 
> Only after the unlikely condition would the __warn_once be false.

Hi Ingo,

Not sure if you missed this request or didn't think it mattered.  But I
just tried out the difference between the two to see what gcc would do
to a simple function compiling with -O2.

Here's my code:

----- with the current WARN_ON_ONCE ----

#define unlikely(x) __builtin_expect(!!(x), 0)

#define WARN_ON_ONCE(condition)                         \
({                                                      \
        static int __warn_once = 1;                     \
        int __ret = 0;                                  \
                                                        \
        if (__warn_once && unlikely((condition))) {     \
                __warn_once = 0;                        \
                WARN_ON(1);                             \
                __ret = 1;                              \
        }                                               \
        __ret;                                          \
})

int warn (int x)
{
        WARN_ON_ONCE(x==1);
        return x+1;
}


----- with the version I suggest. ----

#define unlikely(x) __builtin_expect(!!(x), 0)

#define WARN_ON_ONCE(condition)                         \
({                                                      \
        static int __warn_once = 1;                     \
        int __ret = 0;                                  \
                                                        \
        if (unlikely((condition)) && __warn_once) {     \
                __warn_once = 0;                        \
                WARN_ON(1);                             \
                __ret = 1;                              \
        }                                               \
        __ret;                                          \
})

int warn(int x)
{
        WARN_ON_ONCE(x==1);
        return x+1;
}

-------


Compiling these two I get this:


current warn.o:

00000000 <warn>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   53                      push   %ebx
   4:   83 ec 04                sub    $0x4,%esp
   7:   a1 00 00 00 00          mov    0x0,%eax
   c:   8b 5d 08                mov    0x8(%ebp),%ebx

# here we test the __warn_once first and if it is not zero
# it jumps to warn+0x20 to do the condition test
   f:   85 c0                   test   %eax,%eax
  11:   75 0d                   jne    20 <warn+0x20>
  13:   5a                      pop    %edx
  14:   8d 43 01                lea    0x1(%ebx),%eax
  17:   5b                      pop    %ebx
  18:   5d                      pop    %ebp
  19:   c3                      ret
  1a:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
  20:   83 fb 01                cmp    $0x1,%ebx
  23:   75 ee                   jne    13 <warn+0x13>
  25:   31 c9                   xor    %ecx,%ecx
  27:   89 0d 00 00 00 00       mov    %ecx,0x0
  2d:   c7 04 24 01 00 00 00    movl   $0x1,(%esp)
  34:   e8 fc ff ff ff          call   35 <warn+0x35>
  39:   eb d8                   jmp    13 <warn+0x13>
Disassembly of section .data:


My suggested change of doing the condition first:

00000000 <warn>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   53                      push   %ebx
   4:   83 ec 04                sub    $0x4,%esp
   7:   8b 5d 08                mov    0x8(%ebp),%ebx

# here we test the condition first, and if it the
# unlikely condition is true, then we jump to test
# the __warn_once.
   a:   83 fb 01                cmp    $0x1,%ebx
   d:   74 07                   je     16 <warn+0x16>
   f:   5a                      pop    %edx
  10:   8d 43 01                lea    0x1(%ebx),%eax
  13:   5b                      pop    %ebx
  14:   5d                      pop    %ebp
  15:   c3                      ret
  16:   a1 00 00 00 00          mov    0x0,%eax
  1b:   85 c0                   test   %eax,%eax
  1d:   74 f0                   je     f <warn+0xf>
  1f:   31 c9                   xor    %ecx,%ecx
  21:   89 0d 00 00 00 00       mov    %ecx,0x0
  27:   c7 04 24 01 00 00 00    movl   $0x1,(%esp)
  2e:   e8 fc ff ff ff          call   2f <warn+0x2f>
  33:   eb da                   jmp    f <warn+0xf>
Disassembly of section .data:


As you can see, because the whole thing is unlikely, the first condition
is expected to fail.  With the current WARN_ON logic, that means that
the __warn_once is expected to fail, but that's not the case.  So on a
normal system where the WARN_ON_ONCE condition would never happen, you
are always branching.   So simply reversing the order to test the
condition before testing the __warn_once variable should improve cache
performance.

Below is my recommended patch.

-- Steve

Index: linux-2.6.17-rc5-mm2/include/asm-generic/bug.h
===================================================================
--- linux-2.6.17-rc5-mm2.orig/include/asm-generic/bug.h	2006-06-03 14:01:22.000000000 -0400
+++ linux-2.6.17-rc5-mm2/include/asm-generic/bug.h	2006-06-03 14:01:50.000000000 -0400
@@ -43,7 +43,7 @@
 	static int __warn_once = 1;			\
 	int __ret = 0;					\
 							\
-	if (unlikely(__warn_once && (condition))) {	\
+	if (unlikely((condition) && __warn_once)) {	\
 		__warn_once = 0;			\
 		WARN_ON(1);				\
 		__ret = 1;				\



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)
  2006-06-03 18:09       ` Steven Rostedt
@ 2006-06-04  9:18         ` Arjan van de Ven
  2006-06-04 13:43           ` Steven Rostedt
  0 siblings, 1 reply; 320+ messages in thread
From: Arjan van de Ven @ 2006-06-04  9:18 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Andrew Morton, Ingo Molnar, linux-kernel

On Sat, 2006-06-03 at 14:09 -0400, Steven Rostedt wrote:

> 
> As you can see, because the whole thing is unlikely, the first condition
> is expected to fail.  With the current WARN_ON logic, that means that
> the __warn_once is expected to fail, but that's not the case.  So on a
> normal system where the WARN_ON_ONCE condition would never happen, you
> are always branching. 

which is no cost since it's consistent for the branch predictor

>   So simply reversing the order to test the
> condition before testing the __warn_once variable should improve cache
> performance.
> -	if (unlikely(__warn_once && (condition))) {	\
> +	if (unlikely((condition) && __warn_once)) {	\
>  		__warn_once = 0;			\

I disagree with this; "condition" can be a relatively complex thing,
such as a function call. doing the cheaper (and consistent!) test first
will be better. __warn_once will be branch predicted correctly ALWAYS,
except the exact ONE time you turn hit the backtrace. So it's really
really cheap to test, and if the WARN_ON_ONCE is triggering a lot after
the first time, you now would have a flapping first condition (which
means lots of branch mispredicts) while the original code has a perfect
one-check-predicted-exit scenario.




^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond)
  2006-06-04  9:18         ` Arjan van de Ven
@ 2006-06-04 13:43           ` Steven Rostedt
  0 siblings, 0 replies; 320+ messages in thread
From: Steven Rostedt @ 2006-06-04 13:43 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, Ingo Molnar, linux-kernel

On Sun, 2006-06-04 at 11:18 +0200, Arjan van de Ven wrote:
> On Sat, 2006-06-03 at 14:09 -0400, Steven Rostedt wrote:
> 
> > 
> > As you can see, because the whole thing is unlikely, the first condition
> > is expected to fail.  With the current WARN_ON logic, that means that
> > the __warn_once is expected to fail, but that's not the case.  So on a
> > normal system where the WARN_ON_ONCE condition would never happen, you
> > are always branching. 
> 
> which is no cost since it's consistent for the branch predictor
> 
> >   So simply reversing the order to test the
> > condition before testing the __warn_once variable should improve cache
> > performance.
> > -	if (unlikely(__warn_once && (condition))) {	\
> > +	if (unlikely((condition) && __warn_once)) {	\
> >  		__warn_once = 0;			\
> 
> I disagree with this; "condition" can be a relatively complex thing,
> such as a function call. doing the cheaper (and consistent!) test first
> will be better. 

Wrong!  It's not better, because it is pretty much ALWAYS TRUE!  So even
if you have branch prediction you will call the condition regardless!

> __warn_once will be branch predicted correctly ALWAYS,
> except the exact ONE time you turn hit the backtrace. So it's really
> really cheap to test, and if the WARN_ON_ONCE is triggering a lot after
> the first time, you now would have a flapping first condition (which
> means lots of branch mispredicts) while the original code has a perfect
> one-check-predicted-exit scenario.

Who cares?  If the WARN_ON_ONCE _is_ triggered a bunch of times, that
means the kernel is broken.  The WARN_ON is about checking for validity,
and the condition should never trigger on a proper setup.  The ONCE part
is to keep the users logs from getting full and killing performance with
printk. And even so.  If you have 100 instances of WARN_ON_ONCE in the
kernel, only one at time would probably trigger, so you save on the
other 99.  Your idea is to optimize the broken kernel while punishing
the working one.

The analysis wasn't only about the code, but also about the use of
WARN_ON_ONCE.  The condition should _not_ be too complex and slow since
the WARN_ON_ONCE is just a check, and not something that should slow the
system down too much.

One other thing that wasn't mentioned.  The __warn_once variable is
global and not setup as a read_mostly (which maybe it should).  Because
now it can be placed in the same cache line as some global variable that
is modified a lot, so every time you test __warn_once you need to do a
cache coherency  with other CPUS, thus bringing down the performance
further.

-- Steve


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch mm1-rc2] lock validator: netlink.c netlink_table_grab fix
  2006-06-02  9:53       ` Frederik Deweerdt
@ 2006-06-05  3:40         ` Zhu Yi
  0 siblings, 0 replies; 320+ messages in thread
From: Zhu Yi @ 2006-06-05  3:40 UTC (permalink / raw)
  To: Frederik Deweerdt
  Cc: Benoit Boissinot, linux-kernel, Ingo Molnar, Arjan van de Ven,
	Andrew Morton, jketreno

[-- Attachment #1: Type: text/plain, Size: 1298 bytes --]

On Fri, 2006-06-02 at 09:53 +0000, Frederik Deweerdt wrote:
> On Fri, Jun 02, 2006 at 11:10:10AM +0800, Zhu Yi wrote:
> > On Thu, 2006-06-01 at 16:42 +0200, Frederik Deweerdt wrote:
> > > This got rid of the oops for me, is it the right fix?
> > 
> > I don't think netlink will contend with hardirqs. Can you test with this
> > fix for ipw2200 driver?
> > 
> It does work, thanks. But doesn't this add a possibility of missing 
> some interrupts?
> 	cpu0				cpu1
>         ====				====
> in isr				in tasklet
> 
> 				ipw_enable_interrupts
> 				|->priv->status |= STATUS_INT_ENABLED;

This is unlikely. cpu0 should not receive ipw2200 interrupt since the
interrupt is disabled until HERE (see below).

> ipw_disable_interrupts
> |->priv->status &= ~STATUS_INT_ENABLED;
> |->ipw_write32(priv, IPW_INTA_MASK_R, ~IPW_INTA_MASK_ALL);
> 
> 				|->ipw_write32(priv, IPW_INTA_MASK_R, IPW_INTA_MASK_ALL);
> 				/* This is possible due to priv->lock no longer being taken
> 				   in isr */

				HERE


Well, this is not 100% if when the card fires two consecutive
interrupts. Though unlikely, it's better to protect early than seeing
some "weird" bugs one day. I proposed attached patch. If you can help to
test, that will be appreciated (I cannot see the lockdep warning on my
box somehow).

Thanks,
-yi

[-- Attachment #2: lock_irq.patch --]
[-- Type: text/x-patch, Size: 3972 bytes --]

diff -urp a/drivers/net/wireless/ipw2200.c b/drivers/net/wireless/ipw2200.c
--- a/drivers/net/wireless/ipw2200.c	2006-04-01 09:47:24.000000000 +0800
+++ b/drivers/net/wireless/ipw2200.c	2006-06-05 11:32:18.000000000 +0800
@@ -542,7 +542,7 @@ static inline void ipw_clear_bit(struct 
 	ipw_write32(priv, reg, ipw_read32(priv, reg) & ~mask);
 }
 
-static inline void ipw_enable_interrupts(struct ipw_priv *priv)
+static inline void __ipw_enable_interrupts(struct ipw_priv *priv)
 {
 	if (priv->status & STATUS_INT_ENABLED)
 		return;
@@ -550,7 +550,7 @@ static inline void ipw_enable_interrupts
 	ipw_write32(priv, IPW_INTA_MASK_R, IPW_INTA_MASK_ALL);
 }
 
-static inline void ipw_disable_interrupts(struct ipw_priv *priv)
+static inline void __ipw_disable_interrupts(struct ipw_priv *priv)
 {
 	if (!(priv->status & STATUS_INT_ENABLED))
 		return;
@@ -558,6 +558,20 @@ static inline void ipw_disable_interrupt
 	ipw_write32(priv, IPW_INTA_MASK_R, ~IPW_INTA_MASK_ALL);
 }
 
+static inline void ipw_enable_interrupts(struct ipw_priv *priv)
+{
+	spin_lock_irqsave(&priv->irq_lock, priv->lock_flags);
+	__ipw_enable_interrupts(priv);
+	spin_unlock_irqrestore(&priv->irq_lock, priv->lock_flags);
+}
+
+static inline void ipw_disable_interrupts(struct ipw_priv *priv)
+{
+	spin_lock_irqsave(&priv->irq_lock, priv->lock_flags);
+	__ipw_disable_interrupts(priv);
+	spin_unlock_irqrestore(&priv->irq_lock, priv->lock_flags);
+}
+
 #ifdef CONFIG_IPW2200_DEBUG
 static char *ipw_error_desc(u32 val)
 {
@@ -1959,7 +1973,7 @@ static void ipw_irq_tasklet(struct ipw_p
 	unsigned long flags;
 	int rc = 0;
 
-	spin_lock_irqsave(&priv->lock, flags);
+	spin_lock_irqsave(&priv->irq_lock, flags);
 
 	inta = ipw_read32(priv, IPW_INTA_RW);
 	inta_mask = ipw_read32(priv, IPW_INTA_MASK_R);
@@ -1968,6 +1982,10 @@ static void ipw_irq_tasklet(struct ipw_p
 	/* Add any cached INTA values that need to be handled */
 	inta |= priv->isr_inta;
 
+	spin_unlock_irqrestore(&priv->irq_lock, flags);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
 	/* handle all the justifications for the interrupt */
 	if (inta & IPW_INTA_BIT_RX_TRANSFER) {
 		ipw_rx(priv);
@@ -2096,10 +2114,10 @@ static void ipw_irq_tasklet(struct ipw_p
 		IPW_ERROR("Unhandled INTA bits 0x%08x\n", inta & ~handled);
 	}
 
+	spin_unlock_irqrestore(&priv->lock, flags);
+
 	/* enable all interrupts */
 	ipw_enable_interrupts(priv);
-
-	spin_unlock_irqrestore(&priv->lock, flags);
 }
 
 #define IPW_CMD(x) case IPW_CMD_ ## x : return #x
@@ -11058,7 +11076,7 @@ static irqreturn_t ipw_isr(int irq, void
 	if (!priv)
 		return IRQ_NONE;
 
-	spin_lock(&priv->lock);
+	spin_lock(&priv->irq_lock);
 
 	if (!(priv->status & STATUS_INT_ENABLED)) {
 		/* Shared IRQ */
@@ -11080,7 +11098,7 @@ static irqreturn_t ipw_isr(int irq, void
 	}
 
 	/* tell the device to stop sending interrupts */
-	ipw_disable_interrupts(priv);
+	__ipw_disable_interrupts(priv);
 
 	/* ack current interrupts */
 	inta &= (IPW_INTA_MASK_ALL & inta_mask);
@@ -11091,11 +11109,11 @@ static irqreturn_t ipw_isr(int irq, void
 
 	tasklet_schedule(&priv->irq_tasklet);
 
-	spin_unlock(&priv->lock);
+	spin_unlock(&priv->irq_lock);
 
 	return IRQ_HANDLED;
       none:
-	spin_unlock(&priv->lock);
+	spin_unlock(&priv->irq_lock);
 	return IRQ_NONE;
 }
 
@@ -12185,6 +12203,7 @@ static int ipw_pci_probe(struct pci_dev 
 #ifdef CONFIG_IPW2200_DEBUG
 	ipw_debug_level = debug;
 #endif
+	spin_lock_init(&priv->irq_lock);
 	spin_lock_init(&priv->lock);
 	for (i = 0; i < IPW_IBSS_MAC_HASH_SIZE; i++)
 		INIT_LIST_HEAD(&priv->ibss_mac_hash[i]);
diff -urp a/drivers/net/wireless/ipw2200.h b/drivers/net/wireless/ipw2200.h
--- a/drivers/net/wireless/ipw2200.h	2006-04-01 09:47:24.000000000 +0800
+++ b/drivers/net/wireless/ipw2200.h	2006-06-05 11:32:18.000000000 +0800
@@ -1181,6 +1181,8 @@ struct ipw_priv {
 	struct ieee80211_device *ieee;
 
 	spinlock_t lock;
+	spinlock_t irq_lock;
+	unsigned long lock_flags;
 	struct mutex mutex;
 
 	/* basic pci-network driver stuff */

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/61] lock validator: add __module_address() method
  2006-05-30  1:33   ` Andrew Morton
  2006-05-30 17:45     ` Steven Rostedt
@ 2006-06-23  8:38     ` Ingo Molnar
  1 sibling, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  8:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 29 May 2006 23:23:33 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > +/*
> > + * Is this a valid module address? We don't grab the lock.
> > + */
> > +int __module_address(unsigned long addr)
> > +{
> > +	struct module *mod;
> > +
> > +	list_for_each_entry(mod, &modules, list)
> > +		if (within(addr, mod->module_core, mod->core_size))
> > +			return 1;
> > +	return 0;
> > +}
> 
> Returns a boolean.
> 
> >  /* Is this a valid kernel address?  We don't grab the lock: we are oopsing. */
> >  struct module *__module_text_address(unsigned long addr)
> 
> But this returns a module*.
> 
> I'd suggest that __module_address() should do the same thing, from an 
> API neatness POV.  Although perhaps that's mot very useful if we 
> didn't take a ref on the returned object (but module_text_address() 
> doesn't either).
> 
> Also, the name's a bit misleading - it sounds like it returns the 
> address of a module or something.  __module_any_address() would be 
> better, perhaps?

yeah. I changed this to __is_module_address().

> Also, how come this doesn't need modlist_lock()?

indeed. I originally avoided taking that lock due to recursion worries - 
but in fact we use this only in sections that initialize a lock - hence 
no recursion problems.

i fixed this and renamed the function to is_module_address() :)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 16/61] lock validator: fown locking workaround
  2006-05-30  1:34   ` Andrew Morton
@ 2006-06-23  9:10     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  9:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 29 May 2006 23:24:23 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > temporary workaround for the lock validator: make all uses of 
> > f_owner.lock irq-safe. (The real solution will be to express to the 
> > lock validator that f_owner.lock rules are to be generated 
> > per-filesystem.)
> 
> This description forgot to tell us what problem is being worked 
> around.

f_owner locking rules are per-filesystem: some of them have this lock 
irq-safe [because they use it in irq-context generated SIGIOs], some of 
them have it irq-unsafe [because they dont generate SIGIOs in irq 
context]. The lock validator meshes them together and produces a false 
positive. The workaround changed all uses of f_owner.lock to be 
irq-safe.

> This patch is a bit of a show-stopper.  How hard-n-bad is the real 
> fix?

the real fix would be to correctly map the 'key' of the f_owner.lock to 
the filesystem. I.e. to embedd a "lockdep_type_key s_fown_key" in 
'struct file_system_type', and to use that key when initializing 
f_own.lock.

the practical problem is that the initialization site of f_owner.lock 
does not know about which filesystem this file will belong to.

there might be another way though: the only non-core user of f_own.lock 
is CIFS, and that use of f_own.lock seems unnecessary - it does not 
change any fowner state, and its justification for taking that lock 
seems rather vague as well:

 *  GlobalSMBSesLock protects:
 *      list operations on tcp and SMB session lists and tCon lists
 *  f_owner.lock protects certain per file struct operations

maybe CIFS or VFS people could comment?

that way you could remove the following patch from -mm:

   lock-validator-fown-locking-workaround.patch

and add the patch below. (the fcntl.c portion of the above patch is 
meanwhile moot)

	Ingo

--------------------------------------
Subject: CIFS: remove f_owner.lock use
From: Ingo Molnar <mingo@elte.hu>

CIFS takes/releases f_owner.lock - why? It does not change anything
in the fowner state. Remove this locking.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 fs/cifs/file.c |    9 ---------
 1 file changed, 9 deletions(-)

Index: linux/fs/cifs/file.c
===================================================================
--- linux.orig/fs/cifs/file.c
+++ linux/fs/cifs/file.c
@@ -110,7 +110,6 @@ static inline int cifs_open_inode_helper
 			 &pCifsInode->openFileList);
 	}
 	write_unlock(&GlobalSMBSeslock);
-	write_unlock(&file->f_owner.lock);
 	if (pCifsInode->clientCanCacheRead) {
 		/* we have the inode open somewhere else
 		   no need to discard cache data */
@@ -287,7 +286,6 @@ int cifs_open(struct inode *inode, struc
 		goto out;
 	}
 	pCifsFile = cifs_init_private(file->private_data, inode, file, netfid);
-	write_lock(&file->f_owner.lock);
 	write_lock(&GlobalSMBSeslock);
 	list_add(&pCifsFile->tlist, &pTcon->openFileList);
 
@@ -298,7 +296,6 @@ int cifs_open(struct inode *inode, struc
 					    &oplock, buf, full_path, xid);
 	} else {
 		write_unlock(&GlobalSMBSeslock);
-		write_unlock(&file->f_owner.lock);
 	}
 
 	if (oplock & CIFS_CREATE_ACTION) {           
@@ -477,7 +474,6 @@ int cifs_close(struct inode *inode, stru
 	pTcon = cifs_sb->tcon;
 	if (pSMBFile) {
 		pSMBFile->closePend = TRUE;
-		write_lock(&file->f_owner.lock);
 		if (pTcon) {
 			/* no sense reconnecting to close a file that is
 			   already closed */
@@ -492,23 +488,18 @@ int cifs_close(struct inode *inode, stru
 					the struct would be in each open file,
 					but this should give enough time to 
 					clear the socket */
-					write_unlock(&file->f_owner.lock);
 					cERROR(1,("close with pending writes"));
 					msleep(timeout);
-					write_lock(&file->f_owner.lock);
 					timeout *= 4;
 				} 
-				write_unlock(&file->f_owner.lock);
 				rc = CIFSSMBClose(xid, pTcon,
 						  pSMBFile->netfid);
-				write_lock(&file->f_owner.lock);
 			}
 		}
 		write_lock(&GlobalSMBSeslock);
 		list_del(&pSMBFile->flist);
 		list_del(&pSMBFile->tlist);
 		write_unlock(&GlobalSMBSeslock);
-		write_unlock(&file->f_owner.lock);
 		kfree(pSMBFile->search_resume_name);
 		kfree(file->private_data);
 		file->private_data = NULL;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 17/61] lock validator: sk_callback_lock workaround
  2006-05-30  1:34   ` Andrew Morton
@ 2006-06-23  9:19     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  9:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 29 May 2006 23:24:27 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > temporary workaround for the lock validator: make all uses of 
> > sk_callback_lock softirq-safe. (The real solution will be to express 
> > to the lock validator that sk_callback_lock rules are to be 
> > generated per-address-family.)
> 
> Ditto.  What's the actual problem being worked around here, and how's 
> the real fix shaping up?

this patch should be moot meanwhile. Earlier versions of the lock 
validator produced false positives for certain read-locking constructs.

i have undone the patch:

  lock-validator-sk_callback_lock-workaround.patch

and there doesnt seem to be any false positives popping up. Please dont 
remove it from -mm yet, i'll test this some more and will do the removal 
in the lock validator queue refactoring, ok?

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API.
  2006-05-30  1:34   ` Andrew Morton
@ 2006-06-23  9:28     ` Ingo Molnar
  2006-06-23  9:52       ` Andrew Morton
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  9:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 29 May 2006 23:24:52 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > introduce local_irq_enable_in_hardirq() API. It is currently
> > aliased to local_irq_enable(), hence has no functional effects.
> > 
> > This API will be used by lockdep, but even without lockdep
> > this will better document places in the kernel where a hardirq
> > context enables hardirqs.
> 
> If we expect people to use this then we'd best whack a comment over 
> it.

ok, i've improved the comment in trace_irqflags.h.

> Also, trace_irqflags.h doesn't seem an appropriate place for it to 
> live.

seems like the most practical place for it. Previously we had no central 
include file for irq-flags APIs (they used to be included from 
asm/system.h and other random per-arch places) - trace_irqflags.h has 
become the central file now. Should i rename it to irqflags.h perhaps, 
to not tie it to tracing? We have some deprecated irq-flags ops in 
interrupt.h, maybe this all belongs there. (although i think it's 
cleaner to have linux/include/irqflags.h and include it from 
interrupt.h)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 22/61] lock validator:  add per_cpu_offset()
  2006-05-30  1:34   ` Andrew Morton
@ 2006-06-23  9:30     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  9:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, arjan, Luck, Tony, Benjamin Herrenschmidt,
	Paul Mackerras, Martin Schwidefsky, David S. Miller


* Andrew Morton <akpm@osdl.org> wrote:

> > +#define per_cpu_offset(x) (__per_cpu_offset(x))
> > +
> >  /* Separate out the type, so (int[3], foo) works. */
> >  #define DEFINE_PER_CPU(type, name) \
> >      __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
> 
> I can tell just looking at it that it'll break various builds.I assume 
> that things still happen to compile because you're presently using it 
> in code which those architectures don't presently compile.
> 
> But introducing a "generic" function invites others to start using it.  
> And they will, and they'll ship code which "works" but is broken, 
> because they only tested it on x86 and x86_64.
> 
> I'll queue the needed fixups - please check it.

[belated reply] They look good.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/61] ANNOUNCE: lock validator -V1
  2006-05-30  1:35 ` Andrew Morton
@ 2006-06-23  9:41   ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  9:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> > We are pleased to announce the first release of the "lock dependency 
> > correctness validator" kernel debugging feature
> 
> What are the runtime speed and space costs of enabling this?

The RAM space costs are estimated in the bootup info printout:

 ... MAX_LOCKDEP_SUBTYPES:    8
 ... MAX_LOCK_DEPTH:          30
 ... MAX_LOCKDEP_KEYS:        2048
 ... TYPEHASH_SIZE:           1024
 ... MAX_LOCKDEP_ENTRIES:     8192
 ... MAX_LOCKDEP_CHAINS:      8192
 ... CHAINHASH_SIZE:          4096
  memory used by lock dependency info: 696 kB
  per task-struct memory footprint: 1200 bytes

Plus every lock now embedds the lock_map structure which is 10 pointers. 
That is the biggest direct dynamic RAM cost.

There are also a few embedded keys in .data but they are small.

The .text overhead mostly comes from the subsystem itself - which is 
around 20K of .text. The callbacks are not inlined most of the time - 
there are about 200 of them right now, which should be another +1-2K of 
.text cost.

The runtime cycle cost is significant if CONFIG_DEBUG_LOCKDEP [lock 
validator self-consistency checks] is enabled - then we take a global 
lock from every lock operation which kills scalability.

If DEBUG_LOCKDEP is disabled then it's OK - smaller than DEBUG_SLAB. In 
this case we have the lock-stack maintainance overhead, the irq-trace 
callbacks and a lockless hash-lookup per lock operation. All of that 
overhead is O(1) and lockless so it shouldnt change fundamental 
characteristics anywhere.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 36/61] lock validator: special locking: serial
  2006-05-30  1:35   ` Andrew Morton
@ 2006-06-23  9:49     ` Ingo Molnar
  2006-06-23 10:04       ` Andrew Morton
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  9:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan, Russell King


* Andrew Morton <akpm@osdl.org> wrote:

> > +/*
> > + * lockdep: port->lock is initialized in two places, but we
> > + *          want only one lock-type:
> > + */
> > +static struct lockdep_type_key port_lock_key;
> > +
> >  /**
> >   *	uart_set_options - setup the serial console parameters
> >   *	@port: pointer to the serial ports uart_port structure
> > @@ -1869,7 +1875,7 @@ uart_set_options(struct uart_port *port,
> >  	 * Ensure that the serial console lock is initialised
> >  	 * early.
> >  	 */
> > -	spin_lock_init(&port->lock);
> > +	spin_lock_init_key(&port->lock, &port_lock_key);
> >  
> >  	memset(&termios, 0, sizeof(struct termios));
> >  
> > @@ -2255,7 +2261,7 @@ int uart_add_one_port(struct uart_driver
> >  	 * initialised.
> >  	 */
> >  	if (!(uart_console(port) && (port->cons->flags & CON_ENABLED)))
> > -		spin_lock_init(&port->lock);
> > +		spin_lock_init_key(&port->lock, &port_lock_key);
> >  
> >  	uart_configure_port(drv, state, port);
> >  
> 
> Is there a cleaner way of doing this?
> 
> Perhaps write a new helper function which initialises the spinlock, 
> call that?  Rather than open-coding lockdep stuff?

yes, we can do that too - but that would have an effect to non-lockdep 
kernels too.

Also, the initialization of the 'port' seems a bit twisted here, already 
initialized and not-yet-initialized ports can be passed in to 
uard_add_one_port(). So i did not want to touch the structure of the 
code - hence the open-coded solution.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 37/61] lock validator: special locking: dcache
  2006-05-30 20:51     ` Steven Rostedt
  2006-05-30 21:01       ` Ingo Molnar
@ 2006-06-23  9:51       ` Ingo Molnar
  1 sibling, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  9:51 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Andrew Morton, linux-kernel, arjan


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 2006-05-29 at 18:35 -0700, Andrew Morton wrote:

> > DENTRY_D_LOCK_NORMAL isn't used anywhere.
> 
> I guess it is implied with the normal spin_lock.  Since 
>   spin_lock(&target->d_lock) and
>   spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NORMAL)
> are equivalent. (DENTRY_D_LOCK_NORMAL == 0)
> 
> Probably this deserves a comment.

i have added a comment to dcache.h explaining this better.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API.
  2006-06-23  9:28     ` Ingo Molnar
@ 2006-06-23  9:52       ` Andrew Morton
  2006-06-23 10:20         ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-06-23  9:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Fri, 23 Jun 2006 11:28:52 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > On Mon, 29 May 2006 23:24:52 +0200
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > introduce local_irq_enable_in_hardirq() API. It is currently
> > > aliased to local_irq_enable(), hence has no functional effects.
> > > 
> > > This API will be used by lockdep, but even without lockdep
> > > this will better document places in the kernel where a hardirq
> > > context enables hardirqs.
> > 
> > If we expect people to use this then we'd best whack a comment over 
> > it.
> 
> ok, i've improved the comment in trace_irqflags.h.
> 
> > Also, trace_irqflags.h doesn't seem an appropriate place for it to 
> > live.
> 
> seems like the most practical place for it. Previously we had no central 
> include file for irq-flags APIs (they used to be included from 
> asm/system.h and other random per-arch places) - trace_irqflags.h has 
> become the central file now. Should i rename it to irqflags.h perhaps, 
> to not tie it to tracing? We have some deprecated irq-flags ops in 
> interrupt.h, maybe this all belongs there. (although i think it's 
> cleaner to have linux/include/irqflags.h and include it from 
> interrupt.h)
> 

Yes, irqflags.h is nice.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 46/61] lock validator: special locking: slab
  2006-05-30  1:35   ` Andrew Morton
@ 2006-06-23  9:54     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23  9:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 29 May 2006 23:26:49 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > +		/*
> > +		 * Do not assume that spinlocks can be initialized via memcpy:
> > +		 */
> 
> I'd view that as something which should be fixed in mainline.

yeah. I got bitten by this (read: pulled hair for hours) when converting 
the slab spinlocks to rtmutexes in the -rt tree.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 50/61] lock validator: special locking: hrtimer.c
  2006-05-30  1:35   ` Andrew Morton
@ 2006-06-23 10:04     ` Ingo Molnar
  2006-06-23 10:38       ` Andrew Morton
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> >  	for (i = 0; i < MAX_HRTIMER_BASES; i++, base++)
> > -		spin_lock_init(&base->lock);
> > +		spin_lock_init_static(&base->lock);
> >  }
> >  
> 
> Perhaps the validator core's implementation of spin_lock_init() could 
> look at the address and work out if it's within the static storage 
> sections.

yeah, but there are two cases: places where we want to 'unify' array 
locks into a single type, and places where we want to treat them 
separately. The case where we 'unify' is the more common one: locks 
embedded into hash-tables for example. So i went for annotating the ones 
that are rarer. There are 2 right now: scheduler, hrtimers, with the 
hrtimers one going away in the high-res-timers implementation. (we 
unified the hrtimers locks into a per-CPU lock) (there's also a kgdb 
annotation for -mm)

perhaps the naming should be clearer? I had it named 
spin_lock_init_standalone() originally, then cleaned it up to be 
spin_lock_init_static(). Maybe the original name is better?

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 36/61] lock validator: special locking: serial
  2006-06-23  9:49     ` Ingo Molnar
@ 2006-06-23 10:04       ` Andrew Morton
  2006-06-23 10:18         ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-06-23 10:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan, rmk

On Fri, 23 Jun 2006 11:49:41 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > > +/*
> > > + * lockdep: port->lock is initialized in two places, but we
> > > + *          want only one lock-type:
> > > + */
> > > +static struct lockdep_type_key port_lock_key;
> > > +
> > >  /**
> > >   *	uart_set_options - setup the serial console parameters
> > >   *	@port: pointer to the serial ports uart_port structure
> > > @@ -1869,7 +1875,7 @@ uart_set_options(struct uart_port *port,
> > >  	 * Ensure that the serial console lock is initialised
> > >  	 * early.
> > >  	 */
> > > -	spin_lock_init(&port->lock);
> > > +	spin_lock_init_key(&port->lock, &port_lock_key);
> > >  
> > >  	memset(&termios, 0, sizeof(struct termios));
> > >  
> > > @@ -2255,7 +2261,7 @@ int uart_add_one_port(struct uart_driver
> > >  	 * initialised.
> > >  	 */
> > >  	if (!(uart_console(port) && (port->cons->flags & CON_ENABLED)))
> > > -		spin_lock_init(&port->lock);
> > > +		spin_lock_init_key(&port->lock, &port_lock_key);
> > >  
> > >  	uart_configure_port(drv, state, port);
> > >  
> > 
> > Is there a cleaner way of doing this?
> > 
> > Perhaps write a new helper function which initialises the spinlock, 
> > call that?  Rather than open-coding lockdep stuff?
> 
> yes, we can do that too - but that would have an effect to non-lockdep 
> kernels too.
> 
> Also, the initialization of the 'port' seems a bit twisted here, already 
> initialized and not-yet-initialized ports can be passed in to 
> uard_add_one_port(). So i did not want to touch the structure of the 
> code - hence the open-coded solution.
> 

btw, I was looking at this change:

diff -puN drivers/scsi/libata-core.c~lock-validator-locking-init-debugging-improvement drivers/scsi/libata-core.c
--- a/drivers/scsi/libata-core.c~lock-validator-locking-init-debugging-improvement
+++ a/drivers/scsi/libata-core.c
@@ -1003,6 +1003,7 @@ unsigned ata_exec_internal(struct ata_de
 	unsigned int err_mask;
 	int rc;
 
+	init_completion(&wait);
 	spin_lock_irqsave(ap->lock, flags);
 
 	/* no internal command while frozen */

That local was already initialised with DEFINE_WAIT().  Am surprised that
an init_wait() also was needed?


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 51/61] lock validator: special locking: sock_lock_init()
  2006-05-30  1:36   ` Andrew Morton
@ 2006-06-23 10:06     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan, David S. Miller


* Andrew Morton <akpm@osdl.org> wrote:

> > +/*
> > + * Each address family might have different locking rules, so we have
> > + * one slock key per address family:
> > + */
> > +static struct lockdep_type_key af_family_keys[AF_MAX];
> > +
> > +static void noinline sock_lock_init(struct sock *sk)
> > +{
> > +	spin_lock_init_key(&sk->sk_lock.slock, af_family_keys + sk->sk_family);
> > +	sk->sk_lock.owner = NULL;
> > +	init_waitqueue_head(&sk->sk_lock.wq);
> > +}
> 
> OK, no code outside net/core/sock.c uses sock_lock_init().

yeah.

> Hopefully the same is true of out-of-tree code...

it wont go unnoticed even if it does: we'll get a nonfatal lockdep 
message and fix it up. I dont expect out-of-tree code to mess with 
sk_lock.slock though ...

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 52/61] lock validator: special locking: af_unix
  2006-05-30  1:36   ` Andrew Morton
@ 2006-06-23 10:07     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan, David S. Miller


* Andrew Morton <akpm@osdl.org> wrote:

> > -			spin_lock(&sk->sk_receive_queue.lock);
> > +			spin_lock_bh(&sk->sk_receive_queue.lock);
> 
> Again, a bit of a show-stopper.  Will the real fix be far off?

ok, this should be solved in recent -mm, via:

 lock-validator-special-locking-af_unix-undo-af_unix-_bh-locking-changes-and-split-lock-type.patch

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 36/61] lock validator: special locking: serial
  2006-06-23 10:04       ` Andrew Morton
@ 2006-06-23 10:18         ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan, rmk


* Andrew Morton <akpm@osdl.org> wrote:

> btw, I was looking at this change:

> @@ -1003,6 +1003,7 @@ unsigned ata_exec_internal(struct ata_de
>  	unsigned int err_mask;
>  	int rc;
>  
> +	init_completion(&wait);
>  	spin_lock_irqsave(ap->lock, flags);
>  
>  	/* no internal command while frozen */
> 
> That local was already initialised with DEFINE_COMPLETION().  Am 
> surprised that an init_completion() also was needed?

That's a fundamental problem of DECLARE_COMPLETION() done on the kernel 
stack - it does build-time initialization with no opportunity to inject 
any runtime logic. (which lockdep would need. Maybe i missed some clever 
way to add a runtime callback into the initialization? [*])

Btw., there is no danger from missing the initialization of a wait 
structure: lockdep will detect "uninitialized" on-stack locks and will 
complain about it and turn itself off. [this happened a few times during 
development - that's how those init_completion() calls got added]

But at a minimum these initializations need to become lockdep-specific 
key-reinits - otherwise there will be impact to non-lockdep kernels too.

	Ingo

[*] the only solution i can see is to introduce 
DECLARE_COMPLETION_ONSTACK(), which could call a function with &wait 
passed in, where that function would return with a structure. The macro 
magic would resolve to something like:

  struct completion wait = lockdep_init_completion(&wait);

and thus the structure would be initialized. But this method cannot be 
used for static scope uses of DECLARE_COMPLETION, because it's not a 
constant initializer. So we'd definitely have to make a distinction in 
terms of _ONSTACK(). Is there really no compiler feature that could help 
us out here?

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API.
  2006-06-23  9:52       ` Andrew Morton
@ 2006-06-23 10:20         ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> > > Also, trace_irqflags.h doesn't seem an appropriate place for it to 
> > > live.
> > 
> > seems like the most practical place for it. Previously we had no 
> > central include file for irq-flags APIs (they used to be included 
> > from asm/system.h and other random per-arch places) - 
> > trace_irqflags.h has become the central file now. Should i rename it 
> > to irqflags.h perhaps, to not tie it to tracing? We have some 
> > deprecated irq-flags ops in interrupt.h, maybe this all belongs 
> > there. (although i think it's cleaner to have 
> > linux/include/irqflags.h and include it from interrupt.h)
> > 
> 
> Yes, irqflags.h is nice.

ok, done.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 07/61] lock validator: better lock debugging
  2006-05-30  1:33   ` Andrew Morton
@ 2006-06-23 10:25     ` Ingo Molnar
  2006-06-23 11:06       ` Andrew Morton
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> > +#define DEBUG_WARN_ON(c)						\
> > +({									\
> > +	int __ret = 0;							\
> > +									\
> > +	if (unlikely(c)) {						\
> > +		if (debug_locks_off())					\
> > +			WARN_ON(1);					\
> > +		__ret = 1;						\
> > +	}								\
> > +	__ret;								\
> > +})
> 
> Either the name of this thing is too generic, or we _make_ it generic, 
> in which case it's in the wrong header file.

this op is only intended to be used only by the lock debugging 
infrastructure. So it should be renamed - but i fail to find a good name 
for it. (it's used quite frequently within the lock debugging code, at 
60+ places) Maybe INTERNAL_WARN_ON()? [that makes it sound special 
enough.] DEBUG_LOCKS_WARN_ON() might work too.

> > +#ifdef CONFIG_SMP
> > +# define SMP_DEBUG_WARN_ON(c)			DEBUG_WARN_ON(c)
> > +#else
> > +# define SMP_DEBUG_WARN_ON(c)			do { } while (0)
> > +#endif
> 
> Probably ditto.

agreed.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 50/61] lock validator: special locking: hrtimer.c
  2006-06-23 10:04     ` Ingo Molnar
@ 2006-06-23 10:38       ` Andrew Morton
  2006-06-23 10:52         ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-06-23 10:38 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Fri, 23 Jun 2006 12:04:39 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > >  	for (i = 0; i < MAX_HRTIMER_BASES; i++, base++)
> > > -		spin_lock_init(&base->lock);
> > > +		spin_lock_init_static(&base->lock);
> > >  }
> > >  
> > 
> > Perhaps the validator core's implementation of spin_lock_init() could 
> > look at the address and work out if it's within the static storage 
> > sections.
> 
> yeah, but there are two cases: places where we want to 'unify' array 
> locks into a single type, and places where we want to treat them 
> separately. The case where we 'unify' is the more common one: locks 
> embedded into hash-tables for example. So i went for annotating the ones 
> that are rarer. There are 2 right now: scheduler, hrtimers, with the 
> hrtimers one going away in the high-res-timers implementation. (we 
> unified the hrtimers locks into a per-CPU lock) (there's also a kgdb 
> annotation for -mm)
> 
> perhaps the naming should be clearer? I had it named 
> spin_lock_init_standalone() originally, then cleaned it up to be 
> spin_lock_init_static(). Maybe the original name is better?
> 

hm.  This is where a "term of art" is needed.  What is lockdep's internal
term for locks-of-a-different-type?  It should have such a term.

"class" would be a good term, although terribly overused.  Using that as an
example, spin_lock_init_standalone_class()?  ug.

<gives up>

You want spin_lock_init_singleton().

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 18/61] lock validator: irqtrace: core
  2006-05-30  1:34   ` Andrew Morton
@ 2006-06-23 10:42     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 29 May 2006 23:24:32 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > accurate hard-IRQ-flags state tracing. This allows us to attach 
> > extra functionality to IRQ flags on/off events (such as 
> > trace-on/off).
> 
> That's a fairly skimpy description of some fairly substantial new 
> infrastructure.

ok, here's some more info (i'll add this to the irq-flags-tracing core 
patch):

the "irq state tracing" feature "traces" hardirq and softirq state, in 
that it gives interested subsystems an opportunity to be notified of 
every hardirqs-off/hardirqs-on, softirqs-off/softirqs-on event that 
happens in the kernel.

CONFIG_TRACE_IRQFLAGS_SUPPORT is needed for CONFIG_PROVE_SPIN_LOCKING 
and CONFIG_PROVE_RW_LOCKING to be offered by the generic lock debugging 
code. Otherwise only CONFIG_PROVE_MUTEX_LOCKING and 
CONFIG_PROVE_RWSEM_LOCKING will be offered on an architecture - these 
are locking APIs that are not used in IRQ context. (the one exception 
for rwsems is worked around)

Right now the only interested subsystem is the lock validator, but 
things like RTLinux, ADEOS, the -rt tree and the latency tracer would 
certainly be interested in managing irq-flags state too. (I did not add 
any expansive (and probably expensive) notifier mechanism yet, before 
someone actually tries to mix multiple users of this infrastructure and 
comes up with the right abstraction.)

architecture support for this is certainly not in the "trivial" 
category, because lots of lowlevel assembly code deal with irq-flags 
state changes. But an architecture can be irq-flags-tracing enabled in a 
rather straightforward and risk-free manner.

Architectures that want to support this need to do a couple of 
code-organizational changes first:

- move their irq-flags manipulation code from their asm/system.h header 
  to asm/irqflags.h

- rename local_irq_disable()/etc to raw_local_irq_disable()/etc. so that 
  the linux/irqflags.h code can inject callbacks and can construct the 
  real local_irq_disable()/etc APIs.

- add and enable TRACE_IRQFLAGS_SUPPORT in their arch level Kconfig file

and then a couple of functional changes are needed as well to implement 
irq-flags-tracing support:

- in lowlevel entry code add (build-conditional) calls to the
  trace_hardirqs_off()/trace_hardirqs_on() functions. The lock validator 
  closely guards whether the 'real' irq-flags matches the 'virtual' 
  irq-flags state, and complains loudly (and turns itself off) if the 
  two do not match. Usually most of the time for arch support for 
  irq-flags-tracing is spent in this state: look at the lockdep 
  complaint, try to figure out the assembly code we did not cover yet, 
  fix and repeat. Once the system has booted up and works without a 
  lockdep complaint in the irq-flags-tracing functions arch support is 
  complete.

- if the architecture has non-maskable interrupts then those need to be 
  excluded from the irq-tracing [and lock validation] mechanism via
  lockdep_off()/lockdep_on().

in general there is no risk from having an incomplete irq-flags-tracing 
implementation in an architecture: lockdep will detect that and will 
turn itself off. I.e. the lock validator will still be reliable. There 
should be no crashes due to irq-tracing bugs. (except if the assembly 
changes break other code by modifying conditions or registers that 
shouldnt be)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 27/61] lock validator: prove spinlock/rwlock locking correctness
  2006-05-30  1:35   ` Andrew Morton
@ 2006-06-23 10:44     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 29 May 2006 23:25:23 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > +# define spin_lock_init_key(lock, key)				\
> > +	__spin_lock_init((lock), #lock, key)
> 
> erk.  This adds a whole new layer of obfuscation on top of the 
> existing spinlock header files.  You already need to run the 
> preprocessor and disassembler to even work out which flavour you're 
> presently using.
> 
> Ho hum.

agreed. I think the API we started using in latest -mm 
(lockdep_init_key()) is the cleaner approach - that also makes it 
trivially sure that lockdep doesnt impact non-lockdep code. I'll fix the 
current lockdep_init_key() shortcomings and i'll get rid of the 
*_init_key() APIs.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 50/61] lock validator: special locking: hrtimer.c
  2006-06-23 10:38       ` Andrew Morton
@ 2006-06-23 10:52         ` Ingo Molnar
  2006-06-23 11:52           ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> > perhaps the naming should be clearer? I had it named 
> > spin_lock_init_standalone() originally, then cleaned it up to be 
> > spin_lock_init_static(). Maybe the original name is better?
> > 
> 
> hm.  This is where a "term of art" is needed.  What is lockdep's 
> internal term for locks-of-a-different-type?  It should have such a 
> term.

'lock type' is what i tried to use consistenty.

> "class" would be a good term, although terribly overused.  Using that 
> as an example, spin_lock_init_standalone_class()?  ug.
> 
> <gives up>
> 
> You want spin_lock_init_singleton().

hehe ;)

singleton wouldnt be enough here as we dont want just one instance of 
this lock type: we want separate types for each array entry. I.e. we 
dont want to unify the lock types (as the common spin_lock_init() call 
suggests), we want to split them along their static addresses.

singleton initialization is what spin_lock_init() itself accomplishes: 
the first call to a given spin_lock_init() will register a 'lock type' 
structure, and all subsequent calls to spin_lock_init() will find this 
type registered already. (keyed by the lockdep-type-key embedded in the 
spin_lock_init() macro)

so - spin_lock_init_split_type() might be better i think and expresses 
the purpose (to split away this type from the other lock types 
initialized here).

Or we could simply get rid of this static-variables special-case and 
embedd a lock_type_key in the runqueue and use 
spin_lock_init_key(&rq->rq_lock_key)? That would unify the 'splitting' 
of types for static and dynamic locks. (at a minimal cost of .data) Hm?

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 55/61] lock validator: special locking: sb->s_umount
  2006-05-30  1:36   ` Andrew Morton
@ 2006-06-23 10:55     ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 10:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> > +++ linux/fs/dcache.c
> > @@ -470,8 +470,9 @@ static void prune_dcache(int count, stru
> >  		s_umount = &dentry->d_sb->s_umount;
> >  		if (down_read_trylock(s_umount)) {
> >  			if (dentry->d_sb->s_root != NULL) {
> > -				prune_one_dentry(dentry);
> > +// lockdep hack: do this better!
> >  				up_read(s_umount);
> > +				prune_one_dentry(dentry);
> >  				continue;
> 
> argh, you broke my kernel!
> 
> I'll whack some ifdefs in here so it's only known-broken if 
> CONFIG_LOCKDEP.
> 
> Again, we'd need the real fix here.

yeah. We should undo this patch for now. This will only be complained 
about if CONFIG_DEBUG_NON_NESTED_UNLOCKS is enabled. [i'll do this in my 
refactored queue]

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 61/61] lock validator: enable lock validator in Kconfig
  2006-05-30 13:33   ` Roman Zippel
@ 2006-06-23 11:01     ` Ingo Molnar
  2006-06-26 11:37       ` Roman Zippel
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 11:01 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton


* Roman Zippel <zippel@linux-m68k.org> wrote:

> > +config PROVE_SPIN_LOCKING
> > +	bool "Prove spin-locking correctness"
> > +	default y
> 
> Could you please keep all the defaults in a separate -mm-only patch, 
> so it doesn't get merged?

yep - the default got removed.

> There are also a number of dependencies on DEBUG_KERNEL missing, it 
> completely breaks the debugging menu.

i have solved this problem in current -mm by making more advanced 
versions of lock debugging (allocation/exit checks, validator) depend on 
more basic lock debugging options. All the basic lock debugging options 
have a DEBUG_KERNEL dependency, which thus gets inherited by the other 
options as well.

> > +config LOCKDEP
> > +	bool
> > +	default y
> > +	depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING
> 
> This can be written shorter as:
> 
> config LOCKDEP
> 	def_bool PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING

ok, done. (Btw., there's tons of other Kconfig code though that uses the 
bool + depends syntax though, and def_bool usage is quite rare.)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 07/61] lock validator: better lock debugging
  2006-06-23 11:06       ` Andrew Morton
@ 2006-06-23 11:04         ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 11:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Andrew Morton <akpm@osdl.org> wrote:

> > > Either the name of this thing is too generic, or we _make_ it 
> > > generic, in which case it's in the wrong header file.
> > 
> > this op is only intended to be used only by the lock debugging 
> > infrastructure. So it should be renamed - but i fail to find a good 
> > name for it. (it's used quite frequently within the lock debugging 
> > code, at 60+ places) Maybe INTERNAL_WARN_ON()? [that makes it sound 
> > special enough.] DEBUG_LOCKS_WARN_ON() might work too.
> 
> Well it has a debug_locks_off() in there, so DEBUG_LOCKS_WARN_ON() 
> seems right.

done.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 07/61] lock validator: better lock debugging
  2006-06-23 10:25     ` Ingo Molnar
@ 2006-06-23 11:06       ` Andrew Morton
  2006-06-23 11:04         ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andrew Morton @ 2006-06-23 11:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Fri, 23 Jun 2006 12:25:23 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > > +#define DEBUG_WARN_ON(c)						\
> > > +({									\
> > > +	int __ret = 0;							\
> > > +									\
> > > +	if (unlikely(c)) {						\
> > > +		if (debug_locks_off())					\
> > > +			WARN_ON(1);					\
> > > +		__ret = 1;						\
> > > +	}								\
> > > +	__ret;								\
> > > +})
> > 
> > Either the name of this thing is too generic, or we _make_ it generic, 
> > in which case it's in the wrong header file.
> 
> this op is only intended to be used only by the lock debugging 
> infrastructure. So it should be renamed - but i fail to find a good name 
> for it. (it's used quite frequently within the lock debugging code, at 
> 60+ places) Maybe INTERNAL_WARN_ON()? [that makes it sound special 
> enough.] DEBUG_LOCKS_WARN_ON() might work too.

Well it has a debug_locks_off() in there, so DEBUG_LOCKS_WARN_ON() seems right.


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 50/61] lock validator: special locking: hrtimer.c
  2006-06-23 10:52         ` Ingo Molnar
@ 2006-06-23 11:52           ` Ingo Molnar
  2006-06-23 12:06             ` Andrew Morton
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2006-06-23 11:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, arjan


* Ingo Molnar <mingo@elte.hu> wrote:

> > > perhaps the naming should be clearer? I had it named 
> > > spin_lock_init_standalone() originally, then cleaned it up to be 
> > > spin_lock_init_static(). Maybe the original name is better?
> > > 
> > 
> > hm.  This is where a "term of art" is needed.  What is lockdep's 
> > internal term for locks-of-a-different-type?  It should have such a 
> > term.
> 
> 'lock type' is what i tried to use consistenty.
> 
> > "class" would be a good term, although terribly overused.  Using that 
> > as an example, spin_lock_init_standalone_class()?  ug.

actually ... 'class' might be an even better term than 'type', mainly 
because type is even more overloaded in this context than class. "Q: 
What type does this lock have?" The natural answer: "it's a spinlock".

so i'm strongly considering the renaming of 'lock type' to 'lock class' 
and push that through all the APIs (and documentation). (i.e. we'd have 
'subclasses' of locks, not 'subtypes'.)

then we could do the annotations (where the call-site heuristics get the 
class wrong and either do false splits or dont do a split) via:

	spin_lock_set_class(&lock, &class_key)
	rwlock_set_class(&rwlock, &class_key)
	mutex_set_class(&mutex, &class_key)
	rwsem_set_class(&rwsem, &class_key)

[And for class-internal nesting, we'd have subclass nesting levels.]

hm?

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 50/61] lock validator: special locking: hrtimer.c
  2006-06-23 11:52           ` Ingo Molnar
@ 2006-06-23 12:06             ` Andrew Morton
  0 siblings, 0 replies; 320+ messages in thread
From: Andrew Morton @ 2006-06-23 12:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, arjan

On Fri, 23 Jun 2006 13:52:54 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > > perhaps the naming should be clearer? I had it named 
> > > > spin_lock_init_standalone() originally, then cleaned it up to be 
> > > > spin_lock_init_static(). Maybe the original name is better?
> > > > 
> > > 
> > > hm.  This is where a "term of art" is needed.  What is lockdep's 
> > > internal term for locks-of-a-different-type?  It should have such a 
> > > term.
> > 
> > 'lock type' is what i tried to use consistenty.
> > 
> > > "class" would be a good term, although terribly overused.  Using that 
> > > as an example, spin_lock_init_standalone_class()?  ug.
> 
> actually ... 'class' might be an even better term than 'type', mainly 
> because type is even more overloaded in this context than class. "Q: 
> What type does this lock have?" The natural answer: "it's a spinlock".
> 
> so i'm strongly considering the renaming of 'lock type' to 'lock class' 
> and push that through all the APIs (and documentation). (i.e. we'd have 
> 'subclasses' of locks, not 'subtypes'.)
> 
> then we could do the annotations (where the call-site heuristics get the 
> class wrong and either do false splits or dont do a split) via:
> 
> 	spin_lock_set_class(&lock, &class_key)
> 	rwlock_set_class(&rwlock, &class_key)
> 	mutex_set_class(&mutex, &class_key)
> 	rwsem_set_class(&rwsem, &class_key)
> 
> [And for class-internal nesting, we'd have subclass nesting levels.]
> 

Works for me.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 61/61] lock validator: enable lock validator in Kconfig
  2006-06-23 11:01     ` Ingo Molnar
@ 2006-06-26 11:37       ` Roman Zippel
  0 siblings, 0 replies; 320+ messages in thread
From: Roman Zippel @ 2006-06-26 11:37 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Arjan van de Ven, Andrew Morton

Hi,

On Fri, 23 Jun 2006, Ingo Molnar wrote:

> > Could you please keep all the defaults in a separate -mm-only patch, 
> > so it doesn't get merged?
> 
> yep - the default got removed.

Thanks.

> > > +config LOCKDEP
> > > +	bool
> > > +	default y
> > > +	depends on PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING
> > 
> > This can be written shorter as:
> > 
> > config LOCKDEP
> > 	def_bool PROVE_SPIN_LOCKING || PROVE_RW_LOCKING || PROVE_MUTEX_LOCKING || PROVE_RWSEM_LOCKING
> 
> ok, done. (Btw., there's tons of other Kconfig code though that uses the 
> bool + depends syntax though, and def_bool usage is quite rare.)

The new syntax was added later, so everything that was converted uses the 
basic syntax and is still copied around a lot (where it's probably also 
doesn't help that it's not properly documented yet). I'm still planing to 
go through this and convert most of them...

bye, Roman

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (64 preceding siblings ...)
  2006-05-30  9:14 ` Benoit Boissinot
@ 2007-02-13 14:20 ` Ingo Molnar
  2007-02-13 15:00   ` Alan
                     ` (6 more replies)
  2007-02-13 14:20 ` [patch 01/11] syslets: add async.h include file, kernel-side API definitions Ingo Molnar
                   ` (7 subsequent siblings)
  73 siblings, 7 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

I'm pleased to announce the first release of the "Syslet" kernel feature 
and kernel subsystem, which provides generic asynchrous system call 
support:

   http://redhat.com/~mingo/syslet-patches/

Syslets are small, simple, lightweight programs (consisting of 
system-calls, 'atoms') that the kernel can execute autonomously (and, 
not the least, asynchronously), without having to exit back into 
user-space. Syslets can be freely constructed and submitted by any 
unprivileged user-space context - and they have access to all the 
resources (and only those resources) that the original context has 
access to.

because the proof of the pudding is eating it, here are the performance 
results from async-test.c which does open()+read()+close() of 1000 small 
random files (smaller is better):

                  synchronous IO      |   Syslets:
                  --------------------------------------------
  uncached:       45.8 seconds        |  34.2 seconds   ( +33.9% )
  cached:         31.6 msecs          |  26.5 msecs     ( +19.2% )

("uncached" results were done via "echo 3 > /proc/sys/vm/drop_caches". 
The default IO scheduler was the deadline scheduler, the test was run on 
ext3, using a single PATA IDE disk.)

So syslets, in this particular workload, are a nice speedup /both/ in 
the uncached and in the cached case. (note that i used only a single
disk, so the level of parallelism in the hardware is quite limited.)

the testcode can be found at:

     http://redhat.com/~mingo/syslet-patches/async-test-0.1.tar.gz

The boring details:

Syslets consist of 'syslet atoms', where each atom represents a single 
system-call. These atoms can be chained to each other: serially, in 
branches or in loops. The return value of an executed atom is checked 
against the condition flags. So an atom can specify 'exit on nonzero' or 
'loop until non-negative' kind of constructs.

Syslet atoms fundamentally execute only system calls, thus to be able to 
manipulate user-space variables from syslets i've added a simple special 
system call: sys_umem_add(ptr, val). This can be used to increase or 
decrease the user-space variable (and to get the result), or to simply 
read out the variable (if 'val' is 0).

So a single syslet (submitted and executed via a single system call) can 
be arbitrarily complex. For example it can be like this:

       --------------------
       |     accept()     |-----> [ stop if returns negative ]
       --------------------
                |
                V
  -------------------------------
  |   setsockopt(TCP_NODELAY)   |-----> [ stop if returns negative ]
  -------------------------------
                |
                v
       --------------------
       |      read()      |<---------
       --------------------         | [ loop while positive ]
           |    |                   |
           |    ---------------------
           |
        -----------------------------------------
        | decrease and read user space variable |
        -----------------------------------------                    A
                    |                                                |
                    -------[ loop back to accept() if positive ]------

(you can find a VFS example and a hello.c example in the user-space 
testcode.)

A syslet is executed opportunistically: i.e. the syslet subsystem 
assumes that the syslet will not block, and it will switch to a 
cachemiss kernel thread from the scheduler. This means that even a 
single-atom syslet (i.e. a pure system call) is very close in 
performance to a pure system call. The syslet NULL-overhead in the 
cached case is roughly 10% of the SYSENTER NULL-syscall overhead. This 
means that two atoms are a win already, even in the cached case.

When a 'cachemiss' occurs, i.e. if we hit schedule() and are about to 
consider other threads, the syslet subsystem picks up a 'cachemiss 
thread' and switches the current task's user-space context over to the 
cachemiss thread, and makes the cachemiss thread available. The original 
thread (which now becomes a 'busy' cachemiss thread) continues to block. 
This means that user-space will still be executed without stopping - 
even if user-space is single-threaded.

if the submitting user-space context /knows/ that a system call will 
block, it can request immediate 'cachemiss' via the SYSLET_ASYNC flag. 
This would be used if for example an O_DIRECT file is read() or 
write()n.

likewise, if user-space knows (or expects) that a system call takes alot 
of CPU time even in the cached case, and it wants to offload it to 
another asynchronous context, it can request that via the SYSLET_ASYNC 
flag too.

completions of asynchronous syslets are done via a user-space ringbuffer 
that the kernel fills and user-space clears. Waiting is done via the 
sys_async_wait() system call. Completion can be supressed on a per-atom 
basis via the SYSLET_NO_COMPLETE flag, for atoms that include some 
implicit notification mechanism. (such as sys_kill(), etc.)

As it might be obvious to some of you, the syslet subsystem takes many 
ideas and experience from my Tux in-kernel webserver :) The syslet code 
originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss 
infrastructure.

Open issues:

 - the 'TID' of the 'head' thread currently varies depending on which 
   thread is running the user-space context.

 - signal support is not fully thought through - probably the head 
   should be getting all of them - the cachemiss threads are not really 
   interested in executing signal handlers.

 - sys_fork() and sys_async_exec() should be filtered out from the 
   syscalls that are allowed - first one only makes sense with ptregs, 
   second one is a nice kernel recursion thing :) I didnt want to 
   duplicate the sys_call_table though - maybe others have a better 
   idea.

See more details in Documentation/syslet-design.txt. The patchset is 
against v2.6.20, but should apply to the -git head as well.

Thanks to Zach Brown for the idea to drive cachemisses via the 
scheduler. Thanks to Arjan van de Ven for early review feedback.

Comments, suggestions, reports are welcome!

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 01/11] syslets: add async.h include file, kernel-side API definitions
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (65 preceding siblings ...)
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
@ 2007-02-13 14:20 ` Ingo Molnar
  2007-02-13 14:20 ` [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions Ingo Molnar
                   ` (6 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

From: Ingo Molnar <mingo@elte.hu>

add include/linux/async.h which contains the kernel-side API
declarations.

it also provides NOP stubs for the !CONFIG_ASYNC_SUPPORT case.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/async.h |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Index: linux/include/linux/async.h
===================================================================
--- /dev/null
+++ linux/include/linux/async.h
@@ -0,0 +1,25 @@
+#ifndef _LINUX_ASYNC_H
+#define _LINUX_ASYNC_H
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Generic kernel API definitions:
+ */
+
+#ifdef CONFIG_ASYNC_SUPPORT
+extern void async_init(struct task_struct *t);
+extern void async_exit(struct task_struct *t);
+extern void __async_schedule(struct task_struct *t);
+#else /* !CONFIG_ASYNC_SUPPORT */
+static inline void async_init(struct task_struct *t)
+{
+}
+static inline void async_exit(struct task_struct *t)
+{
+}
+static inline void __async_schedule(struct task_struct *t)
+{
+}
+#endif /* !CONFIG_ASYNC_SUPPORT */
+
+#endif

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (66 preceding siblings ...)
  2007-02-13 14:20 ` [patch 01/11] syslets: add async.h include file, kernel-side API definitions Ingo Molnar
@ 2007-02-13 14:20 ` Ingo Molnar
  2007-02-13 20:17   ` Indan Zupancic
  2007-02-19  0:22   ` Paul Mackerras
  2007-02-13 14:20 ` [patch 03/11] syslets: generic kernel bits Ingo Molnar
                   ` (5 subsequent siblings)
  73 siblings, 2 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

From: Ingo Molnar <mingo@elte.hu>

add include/linux/syslet.h which contains the user-space API/ABI
declarations. Add the new header to include/linux/Kbuild as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/Kbuild   |    1 
 include/linux/syslet.h |  136 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+)

Index: linux/include/linux/Kbuild
===================================================================
--- linux.orig/include/linux/Kbuild
+++ linux/include/linux/Kbuild
@@ -140,6 +140,7 @@ header-y += sockios.h
 header-y += som.h
 header-y += sound.h
 header-y += synclink.h
+header-y += syslet.h
 header-y += telephony.h
 header-y += termios.h
 header-y += ticable.h
Index: linux/include/linux/syslet.h
===================================================================
--- /dev/null
+++ linux/include/linux/syslet.h
@@ -0,0 +1,136 @@
+#ifndef _LINUX_SYSLET_H
+#define _LINUX_SYSLET_H
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Started by Ingo Molnar:
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ *
+ * User-space API/ABI definitions:
+ */
+
+/*
+ * This is the 'Syslet Atom' - the basic unit of execution
+ * within the syslet framework. A syslet always represents
+ * a single system-call plus its arguments, plus has conditions
+ * attached to it that allows the construction of larger
+ * programs from these atoms. User-space variables can be used
+ * (for example a loop index) via the special sys_umem*() syscalls.
+ *
+ * Arguments are implemented via pointers to arguments. This not
+ * only increases the flexibility of syslet atoms (multiple syslets
+ * can share the same variable for example), but is also an
+ * optimization: copy_uatom() will only fetch syscall parameters
+ * up until the point it meets the first NULL pointer. 50% of all
+ * syscalls have 2 or less parameters (and 90% of all syscalls have
+ * 4 or less parameters).
+ *
+ * [ Note: since the argument array is at the end of the atom, and the
+ *   kernel will not touch any argument beyond the final NULL one, atoms
+ *   might be packed more tightly. (the only special case exception to
+ *   this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
+ *   jump a full syslet_uatom number of bytes.) ]
+ */
+struct syslet_uatom {
+	unsigned long				flags;
+	unsigned long				nr;
+	long __user				*ret_ptr;
+	struct syslet_uatom	__user		*next;
+	unsigned long		__user		*arg_ptr[6];
+	/*
+	 * User-space can put anything in here, kernel will not
+	 * touch it:
+	 */
+	void __user				*private;
+};
+
+/*
+ * Flags to modify/control syslet atom behavior:
+ */
+
+/*
+ * Immediately queue this syslet asynchronously - do not even
+ * attempt to execute it synchronously in the user context:
+ */
+#define SYSLET_ASYNC				0x00000001
+
+/*
+ * Never queue this syslet asynchronously - even if synchronous
+ * execution causes a context-switching:
+ */
+#define SYSLET_SYNC				0x00000002
+
+/*
+ * Do not queue the syslet in the completion ring when done.
+ *
+ * ( the default is that the final atom of a syslet is queued
+ *   in the completion ring. )
+ *
+ * Some syscalls generate implicit completion events of their
+ * own.
+ */
+#define SYSLET_NO_COMPLETE			0x00000004
+
+/*
+ * Execution control: conditions upon the return code
+ * of the previous syslet atom. 'Stop' means syslet
+ * execution is stopped and the atom is put into the
+ * completion ring:
+ */
+#define SYSLET_STOP_ON_NONZERO			0x00000008
+#define SYSLET_STOP_ON_ZERO			0x00000010
+#define SYSLET_STOP_ON_NEGATIVE			0x00000020
+#define SYSLET_STOP_ON_NON_POSITIVE		0x00000040
+
+#define SYSLET_STOP_MASK				\
+	(	SYSLET_STOP_ON_NONZERO		|	\
+		SYSLET_STOP_ON_ZERO		|	\
+		SYSLET_STOP_ON_NEGATIVE		|	\
+		SYSLET_STOP_ON_NON_POSITIVE		)
+
+/*
+ * Special modifier to 'stop' handling: instead of stopping the
+ * execution of the syslet, the linearly next syslet is executed.
+ * (Normal execution flows along atom->next, and execution stops
+ *  if atom->next is NULL or a stop condition becomes true.)
+ *
+ * This is what allows true branches of execution within syslets.
+ */
+#define SYSLET_SKIP_TO_NEXT_ON_STOP		0x00000080
+
+/*
+ * This is the (per-user-context) descriptor of the async completion
+ * ring. This gets registered via sys_async_register().
+ */
+struct async_head_user {
+	/*
+	 * Pointers to completed async syslets (i.e. syslets that
+	 * generated a cachemiss and went async, returning -EASYNCSYSLET
+	 * to the user context by sys_async_exec()) are queued here.
+	 * Syslets that were executed synchronously are not queued here.
+	 *
+	 * Note: the final atom that generated the exit condition is
+	 * queued here. Normally this would be the last atom of a syslet.
+	 */
+	struct syslet_uatom __user		**completion_ring;
+	/*
+	 * Ring size in bytes:
+	 */
+	unsigned long				ring_size_bytes;
+
+	/*
+	 * Maximum number of asynchronous contexts the kernel creates.
+	 *
+	 * -1UL has a special meaning: the kernel manages the optimal
+	 * size of the async pool.
+	 *
+	 * Note: this field should be valid for the lifetime of async
+	 * processing, because future kernels detect changes to this
+	 * field. (enabling user-space to control the size of the async
+	 * pool in a low-overhead fashion)
+	 */
+	unsigned long				max_nr_threads;
+};
+
+#endif

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 03/11] syslets: generic kernel bits
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (67 preceding siblings ...)
  2007-02-13 14:20 ` [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions Ingo Molnar
@ 2007-02-13 14:20 ` Ingo Molnar
  2007-02-13 14:20 ` [patch 04/11] syslets: core, data structures Ingo Molnar
                   ` (4 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

From: Ingo Molnar <mingo@elte.hu>

add the kernel generic bits - these are present even if !CONFIG_ASYNC_SUPPORT.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/sched.h |    7 ++++++-
 kernel/exit.c         |    3 +++
 kernel/fork.c         |    2 ++
 kernel/sched.c        |    9 +++++++++
 4 files changed, 20 insertions(+), 1 deletion(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -88,7 +88,8 @@ struct sched_param {
 
 struct exec_domain;
 struct futex_pi_state;
-
+struct async_thread;
+struct async_head;
 /*
  * List of flags we want to share for kernel threads,
  * if only because they are not used by them anyway.
@@ -997,6 +998,10 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 
+/* async syscall support: */
+	struct async_thread *at, *async_ready;
+	struct async_head *ah;
+
 /* VM state */
 	struct reclaim_state *reclaim_state;
 
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -26,6 +26,7 @@
 #include <linux/ptrace.h>
 #include <linux/profile.h>
 #include <linux/mount.h>
+#include <linux/async.h>
 #include <linux/proc_fs.h>
 #include <linux/mempolicy.h>
 #include <linux/taskstats_kern.h>
@@ -889,6 +890,8 @@ fastcall NORET_TYPE void do_exit(long co
 		schedule();
 	}
 
+	async_exit(tsk);
+
 	tsk->flags |= PF_EXITING;
 
 	if (unlikely(in_atomic()))
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -22,6 +22,7 @@
 #include <linux/personality.h>
 #include <linux/mempolicy.h>
 #include <linux/sem.h>
+#include <linux/async.h>
 #include <linux/file.h>
 #include <linux/key.h>
 #include <linux/binfmts.h>
@@ -1054,6 +1055,7 @@ static struct task_struct *copy_process(
 
 	p->lock_depth = -1;		/* -1 = no lock */
 	do_posix_clock_monotonic_gettime(&p->start_time);
+	async_init(p);
 	p->security = NULL;
 	p->io_context = NULL;
 	p->io_wait = NULL;
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -38,6 +38,7 @@
 #include <linux/vmalloc.h>
 #include <linux/blkdev.h>
 #include <linux/delay.h>
+#include <linux/async.h>
 #include <linux/smp.h>
 #include <linux/threads.h>
 #include <linux/timer.h>
@@ -3436,6 +3437,14 @@ asmlinkage void __sched schedule(void)
 	}
 	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
 
+	prev = current;
+	if (unlikely(prev->async_ready)) {
+		if (prev->state && !(preempt_count() & PREEMPT_ACTIVE) &&
+			(!(prev->state & TASK_INTERRUPTIBLE) ||
+				!signal_pending(prev)))
+			__async_schedule(prev);
+	}
+
 need_resched:
 	preempt_disable();
 	prev = current;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 04/11] syslets: core, data structures
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (68 preceding siblings ...)
  2007-02-13 14:20 ` [patch 03/11] syslets: generic kernel bits Ingo Molnar
@ 2007-02-13 14:20 ` Ingo Molnar
  2007-02-13 14:20 ` [patch 05/11] syslets: core code Ingo Molnar
                   ` (3 subsequent siblings)
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

From: Ingo Molnar <mingo@elte.hu>

this adds the data structures used by the syslet / async system calls
infrastructure.

This is used only if CONFIG_ASYNC_SUPPORT is enabled.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/async.h |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

Index: linux/kernel/async.h
===================================================================
--- /dev/null
+++ linux/kernel/async.h
@@ -0,0 +1,58 @@
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Syslet-subsystem internal definitions:
+ */
+
+/*
+ * The kernel-side copy of a syslet atom - with arguments expanded:
+ */
+struct syslet_atom {
+	unsigned long				flags;
+	unsigned long				nr;
+	long __user				*ret_ptr;
+	struct syslet_uatom	__user		*next;
+	unsigned long				args[6];
+};
+
+/*
+ * The 'async head' is the thread which has user-space context (ptregs)
+ * 'below it' - this is the one that can return to user-space:
+ */
+struct async_head {
+	spinlock_t				lock;
+	struct task_struct			*user_task;
+
+	struct list_head			ready_async_threads;
+	struct list_head			busy_async_threads;
+
+	unsigned long				events_left;
+	wait_queue_head_t			wait;
+
+	struct async_head_user	__user		*uah;
+	struct syslet_uatom	__user		**completion_ring;
+	unsigned long				curr_ring_idx;
+	unsigned long				max_ring_idx;
+	unsigned long				ring_size_bytes;
+
+	unsigned int				nr_threads;
+	unsigned int				max_nr_threads;
+
+	struct completion			start_done;
+	struct completion			exit_done;
+};
+
+/*
+ * The 'async thread' is either a newly created async thread or it is
+ * an 'ex-head' - it cannot return to user-space and only has kernel
+ * context.
+ */
+struct async_thread {
+	struct task_struct			*task;
+	struct syslet_uatom	__user		*work;
+	struct async_head			*ah;
+
+	struct list_head			entry;
+
+	unsigned int				exit;
+};

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 05/11] syslets: core code
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (69 preceding siblings ...)
  2007-02-13 14:20 ` [patch 04/11] syslets: core, data structures Ingo Molnar
@ 2007-02-13 14:20 ` Ingo Molnar
  2007-02-13 23:15   ` Andi Kleen
                     ` (3 more replies)
  2007-02-13 14:20 ` [patch 06/11] syslets: core, documentation Ingo Molnar
                   ` (2 subsequent siblings)
  73 siblings, 4 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

From: Ingo Molnar <mingo@elte.hu>

the core syslet / async system calls infrastructure code.

Is built only if CONFIG_ASYNC_SUPPORT is enabled.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/Makefile |    1 
 kernel/async.c  |  811 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 812 insertions(+)

Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -10,6 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
 
+obj-$(CONFIG_ASYNC_SUPPORT) += async.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
 obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
Index: linux/kernel/async.c
===================================================================
--- /dev/null
+++ linux/kernel/async.c
@@ -0,0 +1,811 @@
+/*
+ * kernel/async.c
+ *
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Started by Ingo Molnar:
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This code implements asynchronous syscalls via 'syslets'.
+ *
+ * Syslets consist of a set of 'syslet atoms' which are residing
+ * purely in user-space memory and have no kernel-space resource
+ * attached to them. These atoms can be linked to each other via
+ * pointers. Besides the fundamental ability to execute system
+ * calls, syslet atoms can also implement branches, loops and
+ * arithmetics.
+ *
+ * Thus syslets can be used to build small autonomous programs that
+ * the kernel can execute purely from kernel-space, without having
+ * to return to any user-space context. Syslets can be run by any
+ * unprivileged user-space application - they are executed safely
+ * by the kernel.
+ */
+#include <linux/syscalls.h>
+#include <linux/syslet.h>
+#include <linux/delay.h>
+#include <linux/async.h>
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <linux/err.h>
+
+#include <asm/uaccess.h>
+#include <asm/unistd.h>
+
+#include "async.h"
+
+typedef asmlinkage long (*syscall_fn_t)(long, long, long, long, long, long);
+
+extern syscall_fn_t sys_call_table[NR_syscalls];
+
+static void
+__mark_async_thread_ready(struct async_thread *at, struct async_head *ah)
+{
+	list_del(&at->entry);
+	list_add_tail(&at->entry, &ah->ready_async_threads);
+	if (list_empty(&ah->busy_async_threads))
+		wake_up(&ah->wait);
+}
+
+static void
+mark_async_thread_ready(struct async_thread *at, struct async_head *ah)
+{
+	spin_lock(&ah->lock);
+	__mark_async_thread_ready(at, ah);
+	spin_unlock(&ah->lock);
+}
+
+static void
+__mark_async_thread_busy(struct async_thread *at, struct async_head *ah)
+{
+	list_del(&at->entry);
+	list_add_tail(&at->entry, &ah->busy_async_threads);
+}
+
+static void
+mark_async_thread_busy(struct async_thread *at, struct async_head *ah)
+{
+	spin_lock(&ah->lock);
+	__mark_async_thread_busy(at, ah);
+	spin_unlock(&ah->lock);
+}
+
+static void
+__async_thread_init(struct task_struct *t, struct async_thread *at,
+		    struct async_head *ah)
+{
+	INIT_LIST_HEAD(&at->entry);
+	at->exit = 0;
+	at->task = t;
+	at->ah = ah;
+	at->work = NULL;
+
+	t->at = at;
+	ah->nr_threads++;
+}
+
+static void
+async_thread_init(struct task_struct *t, struct async_thread *at,
+		  struct async_head *ah)
+{
+	spin_lock(&ah->lock);
+	__async_thread_init(t, at, ah);
+	__mark_async_thread_ready(at, ah);
+	spin_unlock(&ah->lock);
+}
+
+
+static void
+async_thread_exit(struct async_thread *at, struct task_struct *t)
+{
+	struct async_head *ah;
+
+	ah = at->ah;
+
+	spin_lock(&ah->lock);
+	list_del_init(&at->entry);
+	if (at->exit)
+		complete(&ah->exit_done);
+	t->at = NULL;
+	at->task = NULL;
+	WARN_ON(!ah->nr_threads);
+	ah->nr_threads--;
+	spin_unlock(&ah->lock);
+}
+
+static struct async_thread *
+pick_ready_cachemiss_thread(struct async_head *ah)
+{
+	struct list_head *head = &ah->ready_async_threads;
+	struct async_thread *at;
+
+	if (list_empty(head))
+		return NULL;
+
+	at = list_entry(head->next, struct async_thread, entry);
+
+	return at;
+}
+
+static void pick_new_async_head(struct async_head *ah,
+				struct task_struct *t, struct pt_regs *old_regs)
+{
+	struct async_thread *new_async_thread;
+	struct async_thread *async_ready;
+	struct task_struct *new_task;
+	struct pt_regs *new_regs;
+
+	spin_lock(&ah->lock);
+
+	new_async_thread = pick_ready_cachemiss_thread(ah);
+	if (!new_async_thread)
+		goto out_unlock;
+
+	async_ready = t->async_ready;
+	WARN_ON(!async_ready);
+	t->async_ready = NULL;
+
+	new_task = new_async_thread->task;
+	new_regs = task_pt_regs(new_task);
+	*new_regs = *old_regs;
+
+	new_task->at = NULL;
+	t->ah = NULL;
+	new_task->ah = ah;
+
+	wake_up_process(new_task);
+
+	__async_thread_init(t, async_ready, ah);
+	__mark_async_thread_busy(t->at, ah);
+
+ out_unlock:
+	spin_unlock(&ah->lock);
+}
+
+void __async_schedule(struct task_struct *t)
+{
+	struct async_head *ah = t->ah;
+	struct pt_regs *old_regs = task_pt_regs(t);
+
+	pick_new_async_head(ah, t, old_regs);
+}
+
+static void async_schedule(struct task_struct *t)
+{
+	if (t->async_ready)
+		__async_schedule(t);
+}
+
+static long __exec_atom(struct task_struct *t, struct syslet_atom *atom)
+{
+	struct async_thread *async_ready_save;
+	long ret;
+
+	/*
+	 * If user-space expects the syscall to schedule then
+	 * (try to) switch user-space to another thread straight
+	 * away and execute the syscall asynchronously:
+	 */
+	if (unlikely(atom->flags & SYSLET_ASYNC))
+		async_schedule(t);
+	/*
+	 * Does user-space want synchronous execution for this atom?:
+	 */
+	async_ready_save = t->async_ready;
+	if (unlikely(atom->flags & SYSLET_SYNC))
+		t->async_ready = NULL;
+
+	if (unlikely(atom->nr >= NR_syscalls))
+		return -ENOSYS;
+
+	ret = sys_call_table[atom->nr](atom->args[0], atom->args[1],
+				       atom->args[2], atom->args[3],
+				       atom->args[4], atom->args[5]);
+	if (atom->ret_ptr && put_user(ret, atom->ret_ptr))
+		return -EFAULT;
+
+	if (t->ah)
+		t->async_ready = async_ready_save;
+
+	return ret;
+}
+
+/*
+ * Arithmetics syscall, add a value to a user-space memory location.
+ *
+ * Generic C version - in case the architecture has not implemented it
+ * in assembly.
+ */
+asmlinkage __attribute__((weak)) long
+sys_umem_add(unsigned long __user *uptr, unsigned long inc)
+{
+	unsigned long val, new_val;
+
+	if (get_user(val, uptr))
+		return -EFAULT;
+	/*
+	 * inc == 0 means 'read memory value':
+	 */
+	if (!inc)
+		return val;
+
+	new_val = val + inc;
+	__put_user(new_val, uptr);
+
+	return new_val;
+}
+
+/*
+ * Open-coded because this is a very hot codepath during syslet
+ * execution and every cycle counts ...
+ *
+ * [ NOTE: it's an explicit fastcall because optimized assembly code
+ *   might depend on this. There are some kernels that disable regparm,
+ *   so lets not break those if possible. ]
+ */
+fastcall __attribute__((weak)) long
+copy_uatom(struct syslet_atom *atom, struct syslet_uatom __user *uatom)
+{
+	unsigned long __user *arg_ptr;
+	long ret = 0;
+
+	if (!access_ok(VERIFY_WRITE, uatom, sizeof(*uatom)))
+		return -EFAULT;
+
+	ret = __get_user(atom->nr, &uatom->nr);
+	ret |= __get_user(atom->ret_ptr, &uatom->ret_ptr);
+	ret |= __get_user(atom->flags, &uatom->flags);
+	ret |= __get_user(atom->next, &uatom->next);
+
+	memset(atom->args, 0, sizeof(atom->args));
+
+	ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
+	if (!arg_ptr)
+		return ret;
+	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+		return -EFAULT;
+	ret |= __get_user(atom->args[0], arg_ptr);
+
+	ret |= __get_user(arg_ptr, &uatom->arg_ptr[1]);
+	if (!arg_ptr)
+		return ret;
+	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+		return -EFAULT;
+	ret |= __get_user(atom->args[1], arg_ptr);
+
+	ret |= __get_user(arg_ptr, &uatom->arg_ptr[2]);
+	if (!arg_ptr)
+		return ret;
+	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+		return -EFAULT;
+	ret |= __get_user(atom->args[2], arg_ptr);
+
+	ret |= __get_user(arg_ptr, &uatom->arg_ptr[3]);
+	if (!arg_ptr)
+		return ret;
+	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+		return -EFAULT;
+	ret |= __get_user(atom->args[3], arg_ptr);
+
+	ret |= __get_user(arg_ptr, &uatom->arg_ptr[4]);
+	if (!arg_ptr)
+		return ret;
+	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+		return -EFAULT;
+	ret |= __get_user(atom->args[4], arg_ptr);
+
+	ret |= __get_user(arg_ptr, &uatom->arg_ptr[5]);
+	if (!arg_ptr)
+		return ret;
+	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
+		return -EFAULT;
+	ret |= __get_user(atom->args[5], arg_ptr);
+
+	return ret;
+}
+
+/*
+ * Should the next atom run, depending on the return value of
+ * the current atom - or should we stop execution?
+ */
+static int run_next_atom(struct syslet_atom *atom, long ret)
+{
+	switch (atom->flags & SYSLET_STOP_MASK) {
+		case SYSLET_STOP_ON_NONZERO:
+			if (!ret)
+				return 1;
+			return 0;
+		case SYSLET_STOP_ON_ZERO:
+			if (ret)
+				return 1;
+			return 0;
+		case SYSLET_STOP_ON_NEGATIVE:
+			if (ret >= 0)
+				return 1;
+			return 0;
+		case SYSLET_STOP_ON_NON_POSITIVE:
+			if (ret > 0)
+				return 1;
+			return 0;
+	}
+	return 1;
+}
+
+static struct syslet_uatom __user *
+next_uatom(struct syslet_atom *atom, struct syslet_uatom *uatom, long ret)
+{
+	/*
+	 * If the stop condition is false then continue
+	 * to atom->next:
+	 */
+	if (run_next_atom(atom, ret))
+		return atom->next;
+	/*
+	 * Special-case: if the stop condition is true and the atom
+	 * has SKIP_TO_NEXT_ON_STOP set, then instead of
+	 * stopping we skip to the atom directly after this atom
+	 * (in linear address-space).
+	 *
+	 * This, combined with the atom->next pointer and the
+	 * stop condition flags is what allows true branches and
+	 * loops in syslets:
+	 */
+	if (atom->flags & SYSLET_SKIP_TO_NEXT_ON_STOP)
+		return uatom + 1;
+
+	return NULL;
+}
+
+/*
+ * If user-space requested a completion event then put the last
+ * executed uatom into the completion ring:
+ */
+static long
+complete_uatom(struct async_head *ah, struct task_struct *t,
+	       struct syslet_atom *atom, struct syslet_uatom __user *uatom)
+{
+	struct syslet_uatom __user **ring_slot, *slot_val = NULL;
+	long ret;
+
+	WARN_ON(!t->at);
+	WARN_ON(t->ah);
+
+	if (unlikely(atom->flags & SYSLET_NO_COMPLETE))
+		return 0;
+
+	/*
+	 * Asynchron threads can complete in parallel so use the
+	 * head-lock to serialize:
+	 */
+	spin_lock(&ah->lock);
+	ring_slot = ah->completion_ring + ah->curr_ring_idx;
+	ret = __copy_from_user_inatomic(&slot_val, ring_slot, sizeof(slot_val));
+	/*
+	 * User-space submitted more work than what fits into the
+	 * completion ring - do not stomp over it silently and signal
+	 * the error condition:
+	 */
+	if (unlikely(slot_val)) {
+		spin_unlock(&ah->lock);
+		return -EFAULT;
+	}
+	slot_val = uatom;
+	ret |= __copy_to_user_inatomic(ring_slot, &slot_val, sizeof(slot_val));
+
+	ah->curr_ring_idx++;
+	if (unlikely(ah->curr_ring_idx == ah->max_ring_idx))
+		ah->curr_ring_idx = 0;
+
+	/*
+	 * See whether the async-head is waiting and needs a wakeup:
+	 */
+	if (ah->events_left) {
+		ah->events_left--;
+		if (!ah->events_left)
+			wake_up(&ah->wait);
+	}
+
+	spin_unlock(&ah->lock);
+
+	return ret;
+}
+
+/*
+ * This is the main syslet atom execution loop. This fetches atoms
+ * and executes them until it runs out of atoms or until the
+ * exit condition becomes false:
+ */
+static struct syslet_uatom __user *
+exec_atom(struct async_head *ah, struct task_struct *t,
+	  struct syslet_uatom __user *uatom)
+{
+	struct syslet_uatom __user *last_uatom;
+	struct syslet_atom atom;
+	long ret;
+
+ run_next:
+	if (unlikely(copy_uatom(&atom, uatom)))
+		return ERR_PTR(-EFAULT);
+
+	last_uatom = uatom;
+	ret = __exec_atom(t, &atom);
+	if (unlikely(signal_pending(t) || need_resched()))
+		goto stop;
+
+	uatom = next_uatom(&atom, uatom, ret);
+	if (uatom)
+		goto run_next;
+ stop:
+	/*
+	 * We do completion only in async context:
+	 */
+	if (t->at && complete_uatom(ah, t, &atom, last_uatom))
+		return ERR_PTR(-EFAULT);
+
+	return last_uatom;
+}
+
+static void cachemiss_execute(struct async_thread *at, struct async_head *ah,
+			      struct task_struct *t)
+{
+	struct syslet_uatom __user *uatom;
+
+	uatom = at->work;
+	WARN_ON(!uatom);
+	at->work = NULL;
+
+	exec_atom(ah, t, uatom);
+}
+
+static void
+cachemiss_loop(struct async_thread *at, struct async_head *ah,
+	       struct task_struct *t)
+{
+	for (;;) {
+		schedule();
+		mark_async_thread_busy(at, ah);
+		set_task_state(t, TASK_INTERRUPTIBLE);
+		if (at->work)
+			cachemiss_execute(at, ah, t);
+		if (unlikely(t->ah || at->exit || signal_pending(t)))
+			break;
+		mark_async_thread_ready(at, ah);
+	}
+	t->state = TASK_RUNNING;
+
+	async_thread_exit(at, t);
+}
+
+static int cachemiss_thread(void *data)
+{
+	struct task_struct *t = current;
+	struct async_head *ah = data;
+	struct async_thread at;
+
+	async_thread_init(t, &at, ah);
+	complete(&ah->start_done);
+
+	cachemiss_loop(&at, ah, t);
+	if (at.exit)
+		do_exit(0);
+
+	if (!t->ah && signal_pending(t)) {
+		WARN_ON(1);
+		do_exit(0);
+	}
+
+	/*
+	 * Return to user-space with NULL:
+	 */
+	return 0;
+}
+
+static void __notify_async_thread_exit(struct async_thread *at,
+				       struct async_head *ah)
+{
+	list_del_init(&at->entry);
+	at->exit = 1;
+	init_completion(&ah->exit_done);
+	wake_up_process(at->task);
+}
+
+static void stop_cachemiss_threads(struct async_head *ah)
+{
+	struct async_thread *at;
+
+repeat:
+	spin_lock(&ah->lock);
+	list_for_each_entry(at, &ah->ready_async_threads, entry) {
+
+		__notify_async_thread_exit(at, ah);
+		spin_unlock(&ah->lock);
+
+		wait_for_completion(&ah->exit_done);
+
+		goto repeat;
+	}
+
+	list_for_each_entry(at, &ah->busy_async_threads, entry) {
+
+		__notify_async_thread_exit(at, ah);
+		spin_unlock(&ah->lock);
+
+		wait_for_completion(&ah->exit_done);
+
+		goto repeat;
+	}
+	spin_unlock(&ah->lock);
+}
+
+static void async_head_exit(struct async_head *ah, struct task_struct *t)
+{
+	stop_cachemiss_threads(ah);
+	WARN_ON(!list_empty(&ah->ready_async_threads));
+	WARN_ON(!list_empty(&ah->busy_async_threads));
+	WARN_ON(ah->nr_threads);
+	WARN_ON(spin_is_locked(&ah->lock));
+	kfree(ah);
+	t->ah = NULL;
+}
+
+/*
+ * Pretty arbitrary for now. The kernel resource-controls the number
+ * of threads anyway.
+ */
+#define DEFAULT_THREAD_LIMIT 1024
+
+/*
+ * Initialize the in-kernel async head, based on the user-space async
+ * head:
+ */
+static long
+async_head_init(struct task_struct *t, struct async_head_user __user *uah)
+{
+	unsigned long max_nr_threads, ring_size_bytes, max_ring_idx;
+	struct syslet_uatom __user **completion_ring;
+	struct async_head *ah;
+	long ret;
+
+	if (get_user(max_nr_threads, &uah->max_nr_threads))
+		return -EFAULT;
+	if (get_user(completion_ring, &uah->completion_ring))
+		return -EFAULT;
+	if (get_user(ring_size_bytes, &uah->ring_size_bytes))
+		return -EFAULT;
+	if (!ring_size_bytes)
+		return -EINVAL;
+	/*
+	 * We pre-check the ring pointer, so that in the fastpath
+	 * we can use __put_user():
+	 */
+	if (!access_ok(VERIFY_WRITE, completion_ring, ring_size_bytes))
+		return -EFAULT;
+
+	max_ring_idx = ring_size_bytes / sizeof(void *);
+	if (ring_size_bytes != max_ring_idx * sizeof(void *))
+		return -EINVAL;
+
+	/*
+	 * Lock down the ring. Note: user-space should not munlock() this,
+	 * because if the ring pages get swapped out then the async
+	 * completion code might return a -EFAULT instead of the expected
+	 * completion. (the kernel safely handles that case too, so this
+	 * isnt a security problem.)
+	 *
+	 * mlock() is better here because it gets resource-accounted
+	 * properly, and even unprivileged userspace has a few pages
+	 * of mlock-able memory available. (which is more than enough
+	 * for the completion-pointers ringbuffer)
+	 */
+	ret = sys_mlock((unsigned long)completion_ring, ring_size_bytes);
+	if (ret)
+		return ret;
+
+	/*
+	 * -1 means: the kernel manages the optimal size of the async pool.
+	 * Simple static limit for now.
+	 */
+	if (max_nr_threads == -1UL)
+		max_nr_threads = DEFAULT_THREAD_LIMIT;
+	/*
+	 * If the ring is smaller than the number of threads requested
+	 * then lower the thread count - otherwise we might lose
+	 * syslet completion events:
+	 */
+	max_nr_threads = min(max_ring_idx, max_nr_threads);
+
+	ah = kmalloc(sizeof(*ah), GFP_KERNEL);
+	if (!ah)
+		return -ENOMEM;
+
+	spin_lock_init(&ah->lock);
+	ah->nr_threads = 0;
+	ah->max_nr_threads = max_nr_threads;
+	INIT_LIST_HEAD(&ah->ready_async_threads);
+	INIT_LIST_HEAD(&ah->busy_async_threads);
+	init_waitqueue_head(&ah->wait);
+	ah->events_left = 0;
+	ah->uah = uah;
+	ah->curr_ring_idx = 0;
+	ah->max_ring_idx = max_ring_idx;
+	ah->completion_ring = completion_ring;
+	ah->ring_size_bytes = ring_size_bytes;
+
+	ah->user_task = t;
+	t->ah = ah;
+
+	return 0;
+}
+
+/**
+ * sys_async_register - enable async syscall support
+ */
+asmlinkage long
+sys_async_register(struct async_head_user __user *uah, unsigned int len)
+{
+	struct task_struct *t = current;
+
+	/*
+	 * This 'len' check enables future extension of
+	 * the async_head ABI:
+	 */
+	if (len != sizeof(struct async_head_user))
+		return -EINVAL;
+	/*
+	 * Already registered?
+	 */
+	if (t->ah)
+		return -EEXIST;
+
+	return async_head_init(t, uah);
+}
+
+/**
+ * sys_async_unregister - disable async syscall support
+ */
+asmlinkage long
+sys_async_unregister(struct async_head_user __user *uah, unsigned int len)
+{
+	struct syslet_uatom __user **completion_ring;
+	struct task_struct *t = current;
+	struct async_head *ah = t->ah;
+	unsigned long ring_size_bytes;
+
+	if (len != sizeof(struct async_head_user))
+		return -EINVAL;
+	/*
+	 * Already unregistered?
+	 */
+	if (!ah)
+		return -EINVAL;
+
+	completion_ring = ah->completion_ring;
+	ring_size_bytes = ah->ring_size_bytes;
+
+	async_head_exit(ah, t);
+
+	/*
+	 * Unpin the ring:
+	 */
+	return sys_munlock((unsigned long)completion_ring, ring_size_bytes);
+}
+
+/*
+ * Simple limit and pool management mechanism for now:
+ */
+static void refill_cachemiss_pool(struct async_head *ah)
+{
+	int pid;
+
+	if (ah->nr_threads >= ah->max_nr_threads)
+		return;
+
+	init_completion(&ah->start_done);
+
+	pid = create_async_thread(cachemiss_thread, (void *)ah,
+			   CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+			   CLONE_PTRACE | CLONE_THREAD | CLONE_SYSVSEM);
+	if (pid < 0)
+		return;
+
+	wait_for_completion(&ah->start_done);
+}
+
+/**
+ * sys_async_wait - wait for async completion events
+ *
+ * This syscall waits for @min_wait_events syslet completion events
+ * to finish or for all async processing to finish (whichever
+ * comes first).
+ */
+asmlinkage long sys_async_wait(unsigned long min_wait_events)
+{
+	struct async_head *ah = current->ah;
+
+	if (!ah)
+		return -EINVAL;
+
+	if (min_wait_events) {
+		spin_lock(&ah->lock);
+		ah->events_left = min_wait_events;
+		spin_unlock(&ah->lock);
+	}
+
+	return wait_event_interruptible(ah->wait,
+		list_empty(&ah->busy_async_threads) || !ah->events_left);
+}
+
+/**
+ * sys_async_exec - execute a syslet.
+ *
+ * returns the uatom that was last executed, if the kernel was able to
+ * execute the syslet synchronously, or NULL if the syslet became
+ * asynchronous. (in the latter case syslet completion will be notified
+ * via the completion ring)
+ *
+ * (Various errors might also be returned via the usual negative numbers.)
+ */
+asmlinkage struct syslet_uatom __user *
+sys_async_exec(struct syslet_uatom __user *uatom)
+{
+	struct syslet_uatom __user *ret;
+	struct task_struct *t = current;
+	struct async_head *ah = t->ah;
+	struct async_thread at;
+
+	if (unlikely(!ah))
+		return ERR_PTR(-EINVAL);
+
+	if (list_empty(&ah->ready_async_threads))
+		refill_cachemiss_pool(ah);
+
+	t->async_ready = &at;
+	ret = exec_atom(ah, t, uatom);
+
+	if (t->ah) {
+		WARN_ON(!t->async_ready);
+		t->async_ready = NULL;
+		return ret;
+	}
+	ret = ERR_PTR(-EINTR);
+	if (!at.exit && !signal_pending(t)) {
+		set_task_state(t, TASK_INTERRUPTIBLE);
+		mark_async_thread_ready(&at, ah);
+		cachemiss_loop(&at, ah, t);
+	}
+	if (t->ah)
+		return NULL;
+	else
+		do_exit(0);
+}
+
+/*
+ * fork()-time initialization:
+ */
+void async_init(struct task_struct *t)
+{
+	t->at = NULL;
+	t->async_ready = NULL;
+	t->ah = NULL;
+}
+
+/*
+ * do_exit()-time cleanup:
+ */
+void async_exit(struct task_struct *t)
+{
+	struct async_thread *at = t->at;
+	struct async_head *ah = t->ah;
+
+	WARN_ON(at && ah);
+	WARN_ON(t->async_ready);
+
+	if (unlikely(at))
+		async_thread_exit(at, t);
+
+	if (unlikely(ah))
+		async_head_exit(ah, t);
+}

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 06/11] syslets: core, documentation
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (70 preceding siblings ...)
  2007-02-13 14:20 ` [patch 05/11] syslets: core code Ingo Molnar
@ 2007-02-13 14:20 ` Ingo Molnar
  2007-02-13 20:18   ` Davide Libenzi
  2007-02-14 10:36   ` Russell King
  2007-02-13 14:20 ` [patch 07/11] syslets: x86, add create_async_thread() method Ingo Molnar
       [not found] ` <20061213130211.GT21847@elte.hu>
  73 siblings, 2 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

From: Ingo Molnar <mingo@elte.hu>

Add Documentation/syslet-design.txt with a high-level description
of the syslet concepts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 Documentation/syslet-design.txt |  137 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 137 insertions(+)

Index: linux/Documentation/syslet-design.txt
===================================================================
--- /dev/null
+++ linux/Documentation/syslet-design.txt
@@ -0,0 +1,137 @@
+Syslets / asynchronous system calls
+===================================
+
+started by Ingo Molnar <mingo@redhat.com>
+
+Goal:
+-----
+
+The goal of the syslet subsystem is to allow user-space to execute
+arbitrary system calls asynchronously. It does so by allowing user-space
+to execute "syslets" which are small scriptlets that the kernel can execute
+both securely and asynchronously without having to exit to user-space.
+
+the core syslet concepts are:
+
+The Syslet Atom:
+----------------
+
+The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
+user-space memory, which is the basic unit of execution within the syslet
+framework. A syslet represents a single system-call and its arguments.
+In addition it also has condition flags attached to it that allows the
+construction of larger programs (syslets) from these atoms.
+
+Arguments to the system call are implemented via pointers to arguments.
+This not only increases the flexibility of syslet atoms (multiple syslets
+can share the same variable for example), but is also an optimization:
+copy_uatom() will only fetch syscall parameters up until the point it
+meets the first NULL pointer. 50% of all syscalls have 2 or less
+parameters (and 90% of all syscalls have 4 or less parameters).
+
+ [ Note: since the argument array is at the end of the atom, and the
+   kernel will not touch any argument beyond the final NULL one, atoms
+   might be packed more tightly. (the only special case exception to
+   this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
+   jump a full syslet_uatom number of bytes.) ]
+
+The Syslet:
+-----------
+
+A syslet is a program, represented by a graph of syslet atoms. The
+syslet atoms are chained to each other either via the atom->next pointer,
+or via the SYSLET_SKIP_TO_NEXT_ON_STOP flag.
+
+Running Syslets:
+----------------
+
+Syslets can be run via the sys_async_exec() system call, which takes
+the first atom of the syslet as an argument. The kernel does not need
+to be told about the other atoms - it will fetch them on the fly as
+execution goes forward.
+
+A syslet might either be executed 'cached', or it might generate a
+'cachemiss'.
+
+'Cached' syslet execution means that the whole syslet was executed
+without blocking. The system-call returns the submitted atom's address
+in this case.
+
+If a syslet blocks while the kernel executes a system-call embedded in
+one of its atoms, the kernel will keep working on that syscall in
+parallel, but it immediately returns to user-space with a NULL pointer,
+so the submitting task can submit other syslets.
+
+Completion of asynchronous syslets:
+-----------------------------------
+
+Completion of asynchronous syslets is done via the 'completion ring',
+which is a ringbuffer of syslet atom pointers user user-space memory,
+provided by user-space in the sys_async_register() syscall. The
+kernel fills in the ringbuffer starting at index 0, and user-space
+must clear out these pointers. Once the kernel reaches the end of
+the ring it wraps back to index 0. The kernel will not overwrite
+non-NULL pointers (but will return an error), user-space has to
+make sure it completes all events it asked for.
+
+Waiting for completions:
+------------------------
+
+Syslet completions can be waited for via the sys_async_wait()
+system call - which takes the number of events it should wait for as
+a parameter. This system call will also return if the number of
+pending events goes down to zero.
+
+Sample Hello World syslet code:
+
+--------------------------->
+/*
+ * Set up a syslet atom:
+ */
+static void
+init_atom(struct syslet_uatom *atom, int nr,
+	  void *arg_ptr0, void *arg_ptr1, void *arg_ptr2,
+	  void *arg_ptr3, void *arg_ptr4, void *arg_ptr5,
+	  void *ret_ptr, unsigned long flags, struct syslet_uatom *next)
+{
+	atom->nr = nr;
+	atom->arg_ptr[0] = arg_ptr0;
+	atom->arg_ptr[1] = arg_ptr1;
+	atom->arg_ptr[2] = arg_ptr2;
+	atom->arg_ptr[3] = arg_ptr3;
+	atom->arg_ptr[4] = arg_ptr4;
+	atom->arg_ptr[5] = arg_ptr5;
+	atom->ret_ptr = ret_ptr;
+	atom->flags = flags;
+	atom->next = next;
+}
+
+int main(int argc, char *argv[])
+{
+	unsigned long int fd_out = 1; /* standard output */
+	char *buf = "Hello Syslet World!\n";
+	unsigned long size = strlen(buf);
+	struct syslet_uatom atom, *done;
+
+	async_head_init();
+
+	/*
+	 * Simple syslet consisting of a single atom:
+	 */
+	init_atom(&atom, __NR_sys_write, &fd_out, &buf, &size,
+		  NULL, NULL, NULL, NULL, SYSLET_ASYNC, NULL);
+	done = sys_async_exec(&atom);
+	if (!done) {
+		sys_async_wait(1);
+		if (completion_ring[curr_ring_idx] == &atom) {
+			completion_ring[curr_ring_idx] = NULL;
+			printf("completed an async syslet atom!\n");
+		}
+	} else {
+		printf("completed an cached syslet atom!\n");
+	}
+
+	async_head_exit();
+
+	return 0;
+}

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch 07/11] syslets: x86, add create_async_thread() method
  2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
                   ` (71 preceding siblings ...)
  2007-02-13 14:20 ` [patch 06/11] syslets: core, documentation Ingo Molnar
@ 2007-02-13 14:20 ` Ingo Molnar
       [not found] ` <20061213130211.GT21847@elte.hu>
  73 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 14:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

From: Ingo Molnar <mingo@elte.hu>

add the create_async_thread() way of creating kernel threads:
these threads first execute a kernel function and when they
return from it they execute user-space.

An architecture must implement this interface before it can turn
CONFIG_ASYNC_SUPPORT on.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/i386/kernel/entry.S     |   25 +++++++++++++++++++++++++
 arch/i386/kernel/process.c   |   31 +++++++++++++++++++++++++++++++
 include/asm-i386/processor.h |    5 +++++
 3 files changed, 61 insertions(+)

Index: linux/arch/i386/kernel/entry.S
===================================================================
--- linux.orig/arch/i386/kernel/entry.S
+++ linux/arch/i386/kernel/entry.S
@@ -996,6 +996,31 @@ ENTRY(kernel_thread_helper)
 	CFI_ENDPROC
 ENDPROC(kernel_thread_helper)
 
+ENTRY(async_thread_helper)
+	CFI_STARTPROC
+	/*
+	 * Allocate space on the stack for pt-regs.
+	 * sizeof(struct pt_regs) == 64, and we've got 8 bytes on the
+	 * kernel stack already:
+	 */
+	subl $64-8, %esp
+	CFI_ADJUST_CFA_OFFSET 64
+	movl %edx,%eax
+	push %edx
+	CFI_ADJUST_CFA_OFFSET 4
+	call *%ebx
+	addl $4, %esp
+	CFI_ADJUST_CFA_OFFSET -4
+
+	movl %eax, PT_EAX(%esp)
+
+	GET_THREAD_INFO(%ebp)
+
+	jmp syscall_exit
+	CFI_ENDPROC
+ENDPROC(async_thread_helper)
+
+
 .section .rodata,"a"
 #include "syscall_table.S"
 
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -352,6 +352,37 @@ int kernel_thread(int (*fn)(void *), voi
 EXPORT_SYMBOL(kernel_thread);
 
 /*
+ * This gets run with %ebx containing the
+ * function to call, and %edx containing
+ * the "args".
+ */
+extern void async_thread_helper(void);
+
+/*
+ * Create an async thread
+ */
+int create_async_thread(int (*fn)(void *), void * arg, unsigned long flags)
+{
+	struct pt_regs regs;
+
+	memset(&regs, 0, sizeof(regs));
+
+	regs.ebx = (unsigned long) fn;
+	regs.edx = (unsigned long) arg;
+
+	regs.xds = __USER_DS;
+	regs.xes = __USER_DS;
+	regs.xgs = __KERNEL_PDA;
+	regs.orig_eax = -1;
+	regs.eip = (unsigned long) async_thread_helper;
+	regs.xcs = __KERNEL_CS | get_kernel_rpl();
+	regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;
+
+	/* Ok, create the new task.. */
+	return do_fork(flags | CLONE_VM, 0, &regs, 0, NULL, NULL);
+}
+
+/*
  * Free current thread data structures etc..
  */
 void exit_thread(void)
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -468,6 +468,11 @@ extern void prepare_to_copy(struct task_
  */
 extern int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags);
 
+/*
+ * create an async thread:
+ */
+extern int create_async_thread(int (*fn)(void *), void * arg, unsigned long flags);
+
 extern unsigned long thread_saved_pc(struct task_struct *tsk);
 void show_trace(struct task_struct *task, struct pt_regs *regs, unsigned long *stack);
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 15:00   ` Alan
@ 2007-02-13 14:58     ` Benjamin LaHaise
  2007-02-13 15:09       ` Arjan van de Ven
                         ` (3 more replies)
  2007-02-13 15:46     ` Dmitry Torokhov
                       ` (2 subsequent siblings)
  3 siblings, 4 replies; 320+ messages in thread
From: Benjamin LaHaise @ 2007-02-13 14:58 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Thomas Gleixner

On Tue, Feb 13, 2007 at 03:00:19PM +0000, Alan wrote:
> > Open issues:
> 
> Let me add some more

Also: FPU state (especially important with the FPU and SSE memory copy 
variants), segment register bases on x86-64, interaction with set_fs()...  
There is no easy way of getting around the full thread context switch and 
its associated overhead (mucking around in CR0 is one of the more expensive 
bits of the context switch code path, and at the very least, setting the FPU 
not present is mandatory).  I have looked into exactly this approach, and 
it's only cheaper if the code is incomplete.  Linux's native threads are 
pretty damned good.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
@ 2007-02-13 15:00   ` Alan
  2007-02-13 14:58     ` Benjamin LaHaise
                       ` (3 more replies)
  2007-02-13 20:22   ` Davide Libenzi
                     ` (5 subsequent siblings)
  6 siblings, 4 replies; 320+ messages in thread
From: Alan @ 2007-02-13 15:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

> A syslet is executed opportunistically: i.e. the syslet subsystem 
> assumes that the syslet will not block, and it will switch to a 
> cachemiss kernel thread from the scheduler. This means that even a 

How is scheduler fairness maintained ? and what is done for resource
accounting here ?

> that the kernel fills and user-space clears. Waiting is done via the 
> sys_async_wait() system call. Completion can be supressed on a per-atom 

They should be selectable as well iff possible.

> Open issues:

Let me add some more

	sys_setuid/gid/etc need to be synchronous only and not occur
while other async syscalls are running in parallel to meet current kernel
assumptions.

	sys_exec and other security boundaries must be synchronous only
and not allow async "spill over" (consider setuid async binary patching)

>  - sys_fork() and sys_async_exec() should be filtered out from the 
>    syscalls that are allowed - first one only makes sense with ptregs, 

clone and vfork. async_vfork is a real mindbender actually.

>    second one is a nice kernel recursion thing :) I didnt want to 
>    duplicate the sys_call_table though - maybe others have a better 
>    idea.

What are the semantics of async sys_async_wait and async sys_async ?


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:58     ` Benjamin LaHaise
@ 2007-02-13 15:09       ` Arjan van de Ven
  2007-02-13 16:24       ` bert hubert
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 320+ messages in thread
From: Arjan van de Ven @ 2007-02-13 15:09 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Alan, Ingo Molnar, linux-kernel, Linus Torvalds,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Thomas Gleixner

On Tue, 2007-02-13 at 09:58 -0500, Benjamin LaHaise wrote:
> On Tue, Feb 13, 2007 at 03:00:19PM +0000, Alan wrote:
> > > Open issues:
> > 
> > Let me add some more
> 
> Also: FPU state (especially important with the FPU and SSE memory copy 
> variants)

are these preserved over explicit system calls? 
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 15:00   ` Alan
  2007-02-13 14:58     ` Benjamin LaHaise
@ 2007-02-13 15:46     ` Dmitry Torokhov
  2007-02-13 20:39       ` Ingo Molnar
  2007-02-13 16:39     ` Andi Kleen
  2007-02-13 16:42     ` Ingo Molnar
  3 siblings, 1 reply; 320+ messages in thread
From: Dmitry Torokhov @ 2007-02-13 15:46 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On 2/13/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:
> > A syslet is executed opportunistically: i.e. the syslet subsystem
> > assumes that the syslet will not block, and it will switch to a
> > cachemiss kernel thread from the scheduler. This means that even a
>
> How is scheduler fairness maintained ? and what is done for resource
> accounting here ?
>
> > that the kernel fills and user-space clears. Waiting is done via the
> > sys_async_wait() system call. Completion can be supressed on a per-atom
>
> They should be selectable as well iff possible.
>
> > Open issues:
>
> Let me add some more
>
>        sys_setuid/gid/etc need to be synchronous only and not occur
> while other async syscalls are running in parallel to meet current kernel
> assumptions.
>
>        sys_exec and other security boundaries must be synchronous only
> and not allow async "spill over" (consider setuid async binary patching)
>
> >  - sys_fork() and sys_async_exec() should be filtered out from the
> >    syscalls that are allowed - first one only makes sense with ptregs,
>
> clone and vfork. async_vfork is a real mindbender actually.
>
> >    second one is a nice kernel recursion thing :) I didnt want to
> >    duplicate the sys_call_table though - maybe others have a better
> >    idea.
>
> What are the semantics of async sys_async_wait and async sys_async ?
>

Ooooohh. OpenVMS lives forever ;) Me likeee ;)

-- 
Dmitry

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:58     ` Benjamin LaHaise
  2007-02-13 15:09       ` Arjan van de Ven
@ 2007-02-13 16:24       ` bert hubert
  2007-02-13 16:56       ` Ingo Molnar
  2007-02-13 20:34       ` Ingo Molnar
  3 siblings, 0 replies; 320+ messages in thread
From: bert hubert @ 2007-02-13 16:24 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Alan, Ingo Molnar, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Tue, Feb 13, 2007 at 09:58:48AM -0500, Benjamin LaHaise wrote:

> not present is mandatory).  I have looked into exactly this approach, and 
> it's only cheaper if the code is incomplete.  Linux's native threads are 
> pretty damned good.

Cheaper in time or in memory? Iow, would you be able to queue up as many
threads as syslets?

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 16:39     ` Andi Kleen
@ 2007-02-13 16:26       ` Linus Torvalds
  2007-02-13 17:03         ` Ingo Molnar
  2007-02-13 20:26         ` Davide Libenzi
  2007-02-13 16:49       ` Ingo Molnar
  1 sibling, 2 replies; 320+ messages in thread
From: Linus Torvalds @ 2007-02-13 16:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan, Ingo Molnar, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Tue, 13 Feb 2007, Andi Kleen wrote:

> > 	sys_exec and other security boundaries must be synchronous only
> > and not allow async "spill over" (consider setuid async binary patching)
> 
> He probably would need some generalization of Andrea's seccomp work.
> Perhaps using bitmaps? For paranoia I would suggest to white list, not black list
> calls.

It's actually more likely a lot more efficient to let the system call 
itself do the sanity checking. That allows the common system calls (that 
*don't* need to even check) to just not do anything at all, instead of 
having some complex logic in the common system call execution trying to 
figure out for each system call whether it is ok or not.

Ie, we could just add to "do_fork()" (which is where all of the 
vfork/clone/fork cases end up) a simple case like

	err = wait_async_context();
	if (err)
		return err;

or

	if (in_async_context())
		return -EINVAL;

or similar. We need that "async_context()" function anyway for the other 
cases where we can't do other things concurrently, like changing the UID.

I would suggest that "wait_async_context()" would do:

 - if weare *in* an async context, return an error. We cannot wait for 
   ourselves!
 - if we are the "real thread", wait for all async contexts to go away 
   (and since we are the real thread, no new ones will be created, so this 
   is not going to be an infinite wait)

The new thing would be that wait_async_context() would possibly return 
-ERESTARTSYS (signal while an async context was executing), so any system 
call that does this would possibly return EINTR. Which "fork()" hasn't 
historically done. But if you have async events active, some operations 
likely cannot be done (setuid() and execve() comes to mind), so you really 
do need something like this.

And obviously it would only affect any program that actually would _use_ 
any of the suggested new interfaces, so it's not like a new error return 
would break anything old.

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 15:00   ` Alan
  2007-02-13 14:58     ` Benjamin LaHaise
  2007-02-13 15:46     ` Dmitry Torokhov
@ 2007-02-13 16:39     ` Andi Kleen
  2007-02-13 16:26       ` Linus Torvalds
  2007-02-13 16:49       ` Ingo Molnar
  2007-02-13 16:42     ` Ingo Molnar
  3 siblings, 2 replies; 320+ messages in thread
From: Andi Kleen @ 2007-02-13 16:39 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

Alan <alan@lxorguk.ukuu.org.uk> writes:

Funny, it sounds like batch() on stereoids @) Ok with an async context it becomes
somewhat more interesting.
 
> 	sys_setuid/gid/etc need to be synchronous only and not occur
> while other async syscalls are running in parallel to meet current kernel
> assumptions.
> 
> 	sys_exec and other security boundaries must be synchronous only
> and not allow async "spill over" (consider setuid async binary patching)

He probably would need some generalization of Andrea's seccomp work.
Perhaps using bitmaps? For paranoia I would suggest to white list, not black list
calls.

-Andi

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 15:00   ` Alan
                       ` (2 preceding siblings ...)
  2007-02-13 16:39     ` Andi Kleen
@ 2007-02-13 16:42     ` Ingo Molnar
  3 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 16:42 UTC (permalink / raw)
  To: Alan
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Alan <alan@lxorguk.ukuu.org.uk> wrote:

> > A syslet is executed opportunistically: i.e. the syslet subsystem 
> > assumes that the syslet will not block, and it will switch to a 
> > cachemiss kernel thread from the scheduler. This means that even a
> 
> How is scheduler fairness maintained ? and what is done for resource 
> accounting here ?

the async threads are as if the user created user-space threads - and 
it's accounted (and scheduled) accordingly.

> > that the kernel fills and user-space clears. Waiting is done via the 
> > sys_async_wait() system call. Completion can be supressed on a 
> > per-atom
> 
> They should be selectable as well iff possible.

basically arbitrary notification interfaces are supported. For example 
if you add a sys_kill() call as the last syslet atom then this will 
notify any waiter in sigwait().

or if you want to select(), just do it in the fds that you are 
interested in, and the write that the syslet does triggers select() 
completion.

but the fastest one will be by using syslets: to just check the 
notification ring pointer in user-space, and then call into 
sys_async_wait() if the ring is empty.

I just noticed a small bug here: sys_async_wait() should also take the 
ring index userspace checked as a second parameter, and fix up the 
number of events it waits for with the delta between the ring index the 
kernel maintains and the ring index user-space has. The patch below 
fixes this bug.

> > Open issues:
> 
> Let me add some more
> 
> 	sys_setuid/gid/etc need to be synchronous only and not occur 
> while other async syscalls are running in parallel to meet current 
> kernel assumptions.

these should probably be taken out of the 'async syscall table', along 
with fork and the async syscalls themselves.

> 	sys_exec and other security boundaries must be synchronous 
> only and not allow async "spill over" (consider setuid async binary 
> patching)

i've tested sys_exec() and it seems to work, but i might have missed 
some corner-cases. (And what you raise is not academic, it might even 
make sense to do it, in the vfork() way.)

> >  - sys_fork() and sys_async_exec() should be filtered out from the 
> >    syscalls that are allowed - first one only makes sense with ptregs, 
> 
> clone and vfork. async_vfork is a real mindbender actually.

yeah. Also, create_module() perhaps. I'm starting to lean towards an 
async_syscall_table[]. At which point we could reduce the max syslet 
parameter count to 4, and do those few 5 and 6 parameter syscalls (of 
which only splice() and futex() truly matter i suspect) via wrappers. 
This would fit a syslet atom into 32 bytes on x86. Hm?

> >    second one is a nice kernel recursion thing :) I didnt want to 
> >    duplicate the sys_call_table though - maybe others have a better 
> >    idea.
> 
> What are the semantics of async sys_async_wait and async sys_async ?

agreed, that should be forbidden too.

	Ingo

---------------------->
---
 kernel/async.c |   12 +++++++++---
 kernel/async.h |    2 +-
 2 files changed, 10 insertions(+), 4 deletions(-)

Index: linux/kernel/async.c
===================================================================
--- linux.orig/kernel/async.c
+++ linux/kernel/async.c
@@ -721,7 +721,8 @@ static void refill_cachemiss_pool(struct
  * to finish or for all async processing to finish (whichever
  * comes first).
  */
-asmlinkage long sys_async_wait(unsigned long min_wait_events)
+asmlinkage long
+sys_async_wait(unsigned long min_wait_events, unsigned long user_curr_ring_idx)
 {
 	struct async_head *ah = current->ah;
 
@@ -730,12 +731,17 @@ asmlinkage long sys_async_wait(unsigned 
 
 	if (min_wait_events) {
 		spin_lock(&ah->lock);
-		ah->events_left = min_wait_events;
+		/*
+		 * Account any completions that happened since user-space
+		 * checked the ring:
+	 	 */
+		ah->events_left = min_wait_events -
+				(ah->curr_ring_idx - user_curr_ring_idx);
 		spin_unlock(&ah->lock);
 	}
 
 	return wait_event_interruptible(ah->wait,
-		list_empty(&ah->busy_async_threads) || !ah->events_left);
+		list_empty(&ah->busy_async_threads) || ah->events_left > 0);
 }
 
 /**
Index: linux/kernel/async.h
===================================================================
--- linux.orig/kernel/async.h
+++ linux/kernel/async.h
@@ -26,7 +26,7 @@ struct async_head {
 	struct list_head			ready_async_threads;
 	struct list_head			busy_async_threads;
 
-	unsigned long				events_left;
+	long					events_left;
 	wait_queue_head_t			wait;
 
 	struct async_head_user	__user		*uah;

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 16:39     ` Andi Kleen
  2007-02-13 16:26       ` Linus Torvalds
@ 2007-02-13 16:49       ` Ingo Molnar
  1 sibling, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 16:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Andi Kleen <andi@firstfloor.org> wrote:

> > 	sys_exec and other security boundaries must be synchronous 
> > only and not allow async "spill over" (consider setuid async binary 
> > patching)
> 
> He probably would need some generalization of Andrea's seccomp work. 
> Perhaps using bitmaps? For paranoia I would suggest to white list, not 
> black list calls.

what i've implemented in my tree is sys_async_call_table[] which is a 
copy of sys_call_table[] with certain entries modified (by architecture 
level code, not by kernel/async.c) to sys_ni_syscall(). It's up to the 
architecture to decide which syscalls are allowed.

but i could use a bitmap too - whatever linear construct. [ I'm not sure 
there's much connection to seccomp - seccomp uses a NULL terminated 
whitelist - while syslets would use most of the entries (and would not 
want to have the overhead of checking a blacklist). ]

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:58     ` Benjamin LaHaise
  2007-02-13 15:09       ` Arjan van de Ven
  2007-02-13 16:24       ` bert hubert
@ 2007-02-13 16:56       ` Ingo Molnar
  2007-02-13 18:56         ` Evgeniy Polyakov
  2007-02-13 20:34       ` Ingo Molnar
  3 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 16:56 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Alan, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Thomas Gleixner


* Benjamin LaHaise <bcrl@kvack.org> wrote:

> > > Open issues:
> > 
> > Let me add some more
> 
> Also: FPU state (especially important with the FPU and SSE memory copy 
> variants), segment register bases on x86-64, interaction with 
> set_fs()...

agreed - i'll fix this. But i can see no big conceptual issue here - 
these resources are all attached to the user context, and that doesnt 
change upon an 'async context-switch'. So it's "only" a matter of 
properly separating the user execution context from the kernel execution 
context. The hardest bit was getting the ptregs details right - the 
FPU/SSE state is pretty much async already (in the hardware too) and 
isnt even touched by any of these codepaths.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 16:26       ` Linus Torvalds
@ 2007-02-13 17:03         ` Ingo Molnar
  2007-02-13 20:26         ` Davide Libenzi
  1 sibling, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Alan, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Ie, we could just add to "do_fork()" (which is where all of the 
> vfork/clone/fork cases end up) a simple case like
> 
> 	err = wait_async_context();
> 	if (err)
> 		return err;
> 
> or
> 
> 	if (in_async_context())
> 		return -EINVAL;

ok, this is a much nicer solution. I've scrapped the 
sys_async_sys_call_table[] thing.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 16:56       ` Ingo Molnar
@ 2007-02-13 18:56         ` Evgeniy Polyakov
  2007-02-13 19:12           ` Evgeniy Polyakov
  2007-02-13 22:18           ` Ingo Molnar
  0 siblings, 2 replies; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-13 18:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Benjamin LaHaise, Alan, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Tue, Feb 13, 2007 at 05:56:42PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Benjamin LaHaise <bcrl@kvack.org> wrote:
> 
> > > > Open issues:
> > > 
> > > Let me add some more
> > 
> > Also: FPU state (especially important with the FPU and SSE memory copy 
> > variants), segment register bases on x86-64, interaction with 
> > set_fs()...
> 
> agreed - i'll fix this. But i can see no big conceptual issue here - 
> these resources are all attached to the user context, and that doesnt 
> change upon an 'async context-switch'. So it's "only" a matter of 
> properly separating the user execution context from the kernel execution 
> context. The hardest bit was getting the ptregs details right - the 
> FPU/SSE state is pretty much async already (in the hardware too) and 
> isnt even touched by any of these codepaths.

Good work, Ingo.

I have not received first mail with announcement yet, so I will place 
my thoughts here if you do not mind.

First one is per-thread data like TID. What about TLS related kernel
data (is non-exec stack property stored in TLS block or in kernel)?
Should it be copied with regs too (or better introduce new clone flag,
which would force that info copy)?

Btw, does SSE?/MMX?/call-it-yourself really saved on context switch?
As far as I can see no syscalls (and kernel at all) use that registers.

Another one is more global AIO question - while this approach IMHO
outperforms micro-thread design (Zach and Linus created really good
starting points, but they too have fundamental limiting factor), it
still has a problem - syscall blocks and the same thread thus is not
allowed to continue execution and fill the pipe - so what if system
issues thousands of requests and there are only tens of working thread
at most. What Tux did, as far as I recall, (and some other similar 
state machines do :) was to break blocking syscall issues and return
to the next execution entity (next syslet or atom). Is it possible to
extend exactly this state machine and interface to allow that (so that
some other state machine implementations would not continue its life :)?

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 18:56         ` Evgeniy Polyakov
@ 2007-02-13 19:12           ` Evgeniy Polyakov
  2007-02-13 22:19             ` Ingo Molnar
  2007-02-13 22:18           ` Ingo Molnar
  1 sibling, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-13 19:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Benjamin LaHaise, Alan, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

> I have not received first mail with announcement yet, so I will place 
> my thoughts here if you do not mind.

An issue with sys_async_wait():
is is possible that events_left will be setup too late so that all
events are already ready and thus sys_async_wait() can wait forever
(or until next $sys_async_wait are ready)?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions
  2007-02-13 14:20 ` [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions Ingo Molnar
@ 2007-02-13 20:17   ` Indan Zupancic
  2007-02-13 21:43     ` Ingo Molnar
  2007-02-19  0:22   ` Paul Mackerras
  1 sibling, 1 reply; 320+ messages in thread
From: Indan Zupancic @ 2007-02-13 20:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper

On Tue, February 13, 2007 15:20, Ingo Molnar wrote:
> +/*
> + * Execution control: conditions upon the return code
> + * of the previous syslet atom. 'Stop' means syslet
> + * execution is stopped and the atom is put into the
> + * completion ring:
> + */
> +#define SYSLET_STOP_ON_NONZERO			0x00000008
> +#define SYSLET_STOP_ON_ZERO			0x00000010
> +#define SYSLET_STOP_ON_NEGATIVE			0x00000020
> +#define SYSLET_STOP_ON_NON_POSITIVE		0x00000040

This is confusing. Why the return code of the previous syslet atom?
Wouldn't it be more clear if the flag was for the current tasklet?
Worse, what is the previous atom? Imagine some case with a loop:

  A
  |
  B<--.
  |   |
  C---'

What will be the previous atom of B here? It can be either A or C,
but their return values can be different and incompatible, so what
flag should B set?

> +/*
> + * Special modifier to 'stop' handling: instead of stopping the
> + * execution of the syslet, the linearly next syslet is executed.
> + * (Normal execution flows along atom->next, and execution stops
> + *  if atom->next is NULL or a stop condition becomes true.)
> + *
> + * This is what allows true branches of execution within syslets.
> + */
> +#define SYSLET_SKIP_TO_NEXT_ON_STOP		0x00000080
> +

Might rename this to SYSLET_SKIP_NEXT_ON_STOP too then.

Greetings,

Indan




^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-13 14:20 ` [patch 06/11] syslets: core, documentation Ingo Molnar
@ 2007-02-13 20:18   ` Davide Libenzi
  2007-02-13 21:34     ` Ingo Molnar
  2007-02-14 10:36   ` Russell King
  1 sibling, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-13 20:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner


Wow! You really helped Zach out ;)



On Tue, 13 Feb 2007, Ingo Molnar wrote:

> +The Syslet Atom:
> +----------------
> +
> +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> +user-space memory, which is the basic unit of execution within the syslet
> +framework. A syslet represents a single system-call and its arguments.
> +In addition it also has condition flags attached to it that allows the
> +construction of larger programs (syslets) from these atoms.
> +
> +Arguments to the system call are implemented via pointers to arguments.
> +This not only increases the flexibility of syslet atoms (multiple syslets
> +can share the same variable for example), but is also an optimization:
> +copy_uatom() will only fetch syscall parameters up until the point it
> +meets the first NULL pointer. 50% of all syscalls have 2 or less
> +parameters (and 90% of all syscalls have 4 or less parameters).

Why do you need to have an extra memory indirection per parameter in 
copy_uatom()? It also forces you to have parameters pointed-to, to be 
"long" (or pointers), instead of their natural POSIX type (like fd being 
"int" for example). Also, you need to have array pointers (think about a 
"char buf[];" passed to an async read(2)) to be saved into a pointer 
variable, and pass the pointer of the latter to the async system. Same for 
all structures (ie. stat(2) "struct stat"). Let them be real argouments 
and add a nparams argoument to the structure:

struct syslet_atom {
       unsigned long                       flags;
       unsigned int                        nr;
       unsigned int                        nparams;
       long __user                         *ret_ptr;
       struct syslet_uatom     __user      *next;
       unsigned long                       args[6];
};

I can understand that chaining syscalls requires variable sharing, but the 
majority of the parameters passed to syscalls are just direct ones.
Maybe a smart method that allows you to know if a parameter is a direct 
one or a pointer to one? An "unsigned int pmap" where bit N is 1 if param 
N is an indirection? Hmm?





> +Running Syslets:
> +----------------
> +
> +Syslets can be run via the sys_async_exec() system call, which takes
> +the first atom of the syslet as an argument. The kernel does not need
> +to be told about the other atoms - it will fetch them on the fly as
> +execution goes forward.
> +
> +A syslet might either be executed 'cached', or it might generate a
> +'cachemiss'.
> +
> +'Cached' syslet execution means that the whole syslet was executed
> +without blocking. The system-call returns the submitted atom's address
> +in this case.
> +
> +If a syslet blocks while the kernel executes a system-call embedded in
> +one of its atoms, the kernel will keep working on that syscall in
> +parallel, but it immediately returns to user-space with a NULL pointer,
> +so the submitting task can submit other syslets.
> +
> +Completion of asynchronous syslets:
> +-----------------------------------
> +
> +Completion of asynchronous syslets is done via the 'completion ring',
> +which is a ringbuffer of syslet atom pointers user user-space memory,
> +provided by user-space in the sys_async_register() syscall. The
> +kernel fills in the ringbuffer starting at index 0, and user-space
> +must clear out these pointers. Once the kernel reaches the end of
> +the ring it wraps back to index 0. The kernel will not overwrite
> +non-NULL pointers (but will return an error), user-space has to
> +make sure it completes all events it asked for.

Sigh, I really dislike shared userspace/kernel stuff, when we're 
transfering pointers to userspace. Did you actually bench it against a:

int async_wait(struct syslet_uatom **r, int n);

I can fully understand sharing userspace buffers with the kernel, if we're 
talking about KB transferd during a block or net I/O DMA operation, but 
for transfering a pointer? Behind each pointer transfer(4/8 bytes) there 
is a whole syscall execution, that makes the 4/8 bytes tranfers have a 
relative cost of 0.01% *maybe*. Different case is a O_DIRECT read of 16KB 
of data, where in that case the memory transfer has a relative cost 
compared to the syscall, that can be pretty high. The syscall saving 
argument is moot too, because syscall are cheap, and if there's a lot of 
async traffic, you'll be fetching lots of completions to keep you dispatch 
loop pretty busy for a while.
And the API is *certainly* cleaner.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
  2007-02-13 15:00   ` Alan
@ 2007-02-13 20:22   ` Davide Libenzi
  2007-02-13 21:24     ` Davide Libenzi
  2007-02-13 21:57     ` Ingo Molnar
  2007-02-14  3:28   ` Davide Libenzi
                     ` (4 subsequent siblings)
  6 siblings, 2 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-13 20:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> As it might be obvious to some of you, the syslet subsystem takes many 
> ideas and experience from my Tux in-kernel webserver :) The syslet code 
> originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss 
> infrastructure.
> 
> Open issues:
> 
>  - the 'TID' of the 'head' thread currently varies depending on which 
>    thread is running the user-space context.
> 
>  - signal support is not fully thought through - probably the head 
>    should be getting all of them - the cachemiss threads are not really 
>    interested in executing signal handlers.
> 
>  - sys_fork() and sys_async_exec() should be filtered out from the 
>    syscalls that are allowed - first one only makes sense with ptregs, 
>    second one is a nice kernel recursion thing :) I didnt want to 
>    duplicate the sys_call_table though - maybe others have a better 
>    idea.

If this is going to be a generic AIO subsystem:

- Cancellation of peding request



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 16:26       ` Linus Torvalds
  2007-02-13 17:03         ` Ingo Molnar
@ 2007-02-13 20:26         ` Davide Libenzi
  1 sibling, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-13 20:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Alan, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Linus Torvalds wrote:

> 	if (in_async_context())
> 		return -EINVAL;
> 
> or similar. We need that "async_context()" function anyway for the other 
> cases where we can't do other things concurrently, like changing the UID.

Yes, that's definitely better. Let's have the policy about weather a 
syscall is or is not async-enabled, inside the syscall itself. Simplify 
things a lot.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:58     ` Benjamin LaHaise
                         ` (2 preceding siblings ...)
  2007-02-13 16:56       ` Ingo Molnar
@ 2007-02-13 20:34       ` Ingo Molnar
  3 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 20:34 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Alan, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Thomas Gleixner


* Benjamin LaHaise <bcrl@kvack.org> wrote:

> [...] interaction with set_fs()...

hm, this one should already work in the current version, because 
addr_limit is in thread_info and hence stays with the async context. Or 
can you see any hole in it?

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 15:46     ` Dmitry Torokhov
@ 2007-02-13 20:39       ` Ingo Molnar
  2007-02-13 22:36         ` Dmitry Torokhov
  2007-02-14 11:07         ` Alan
  0 siblings, 2 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 20:39 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Alan, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Dmitry Torokhov <dmitry.torokhov@gmail.com> wrote:

> > What are the semantics of async sys_async_wait and async sys_async ?
> 
> Ooooohh. OpenVMS lives forever ;) Me likeee ;)

hm, i dont know OpenVMS - but googled around a bit for 'VMS 
asynchronous' and it gave me this:

  http://en.wikipedia.org/wiki/Asynchronous_system_trap

is AST what you mean? From a quick read AST seems to be a signal 
mechanism a bit like Unix signals, extended to kernel-space as well - 
while syslets are a different 'safe execution engine' kind of thing 
centered around the execution of system calls.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 20:22   ` Davide Libenzi
@ 2007-02-13 21:24     ` Davide Libenzi
  2007-02-13 22:10       ` Ingo Molnar
  2007-02-13 21:57     ` Ingo Molnar
  1 sibling, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-13 21:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Davide Libenzi wrote:

> If this is going to be a generic AIO subsystem:
> 
> - Cancellation of peding request

What about the busy_async_threads list becoming a hash/rb_tree indexed by 
syslet_atom ptr. A cancel would lookup the thread and send a signal (of 
course, signal handling of the async threads should be set properly)?



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-13 20:18   ` Davide Libenzi
@ 2007-02-13 21:34     ` Ingo Molnar
  2007-02-13 23:21       ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 21:34 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > +The Syslet Atom:
> > +----------------
> > +
> > +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> > +user-space memory, which is the basic unit of execution within the syslet
> > +framework. A syslet represents a single system-call and its arguments.
> > +In addition it also has condition flags attached to it that allows the
> > +construction of larger programs (syslets) from these atoms.
> > +
> > +Arguments to the system call are implemented via pointers to arguments.
> > +This not only increases the flexibility of syslet atoms (multiple syslets
> > +can share the same variable for example), but is also an optimization:
> > +copy_uatom() will only fetch syscall parameters up until the point it
> > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > +parameters (and 90% of all syscalls have 4 or less parameters).
> 
> Why do you need to have an extra memory indirection per parameter in 
> copy_uatom()? [...]

yes. Try to use them in real programs, and you'll see that most of the 
time the variable an atom wants to access should also be accessed by 
other atoms. For example a socket file descriptor - one atom opens it, 
another one reads from it, a third one closes it. By having the 
parameters in the atoms we'd have to copy the fd to two other places.

but i see your point: i actually had it like that in my earlier 
versions, only changed it to an indirect method later on, when writing 
more complex syslets. And, surprisingly, performance of atom handling 
/improved/ on both Intel and AMD CPUs when i added indirection, because 
the indirection enables the 'tail NULL' optimization. (which wasnt the 
goal of indirection, it was just a side-effect)

> [...] It also forces you to have parameters pointed-to, to be "long" 
> (or pointers), instead of their natural POSIX type (like fd being 
> "int" for example). [...]

this wasnt a big problem while coding syslets. I'd also not expect 
application writers having to do these things on the syscall level - 
this is a system interface after all. But you do have a point.

> I can understand that chaining syscalls requires variable sharing, but 
> the majority of the parameters passed to syscalls are just direct 
> ones. Maybe a smart method that allows you to know if a parameter is a 
> direct one or a pointer to one? An "unsigned int pmap" where bit N is 
> 1 if param N is an indirection? Hmm?

adding such things tends to slow down atom parsing.

there's another reason as well: i wanted syslets to be like 
'instructions' - i.e. not self-modifying. If the fd parameter is 
embedded in the syslet then every syslet has to be replicated

note that chaining does not necessarily require variable sharing: a 
sys_umem_add() atom could be used to modify the next syslet's ->fd 
parameter. So for example

	sys_open() -> returns 'fd'
        sys_umem_add(&atom1->fd) <= atom1->fd is 0 initially
        sys_umem_add(&atom2->fd) <= the first umem_add returns the value
        atom1 [uses fd]
        atom2 [uses fd]

but i didnt like this approach: this means 1 more atom per indirect 
parameter, and quite some trickery to put the right information into the 
right place. Furthermore, this makes syslets very much tied to the 
'register contents' - instead of them being 'pure instructions/code'.

> > +Completion of asynchronous syslets:
> > +-----------------------------------
> > +
> > +Completion of asynchronous syslets is done via the 'completion ring',
> > +which is a ringbuffer of syslet atom pointers user user-space memory,
> > +provided by user-space in the sys_async_register() syscall. The
> > +kernel fills in the ringbuffer starting at index 0, and user-space
> > +must clear out these pointers. Once the kernel reaches the end of
> > +the ring it wraps back to index 0. The kernel will not overwrite
> > +non-NULL pointers (but will return an error), user-space has to
> > +make sure it completes all events it asked for.
> 
> Sigh, I really dislike shared userspace/kernel stuff, when we're 
> transfering pointers to userspace. Did you actually bench it against 
> a:
> 
> int async_wait(struct syslet_uatom **r, int n);
> 
> I can fully understand sharing userspace buffers with the kernel, if 
> we're talking about KB transferd during a block or net I/O DMA 
> operation, but for transfering a pointer? Behind each pointer 
> transfer(4/8 bytes) there is a whole syscall execution, [...]

there are three main reasons for this choice:

- firstly, by putting completion events into the user-space ringbuffer
  the asynchronous contexts are not held up at all, and the threads are
  available for further syslet use.

- secondly, it was the most obvious and simplest solution to me - it 
  just fits well into the syslet model - which is an execution concept 
  centered around pure user-space memory and system calls, not some 
  kernel resource. Kernel fills in the ringbuffer, user-space clears it. 
  If we had to worry about a handshake between user-space and 
  kernel-space for the completion information to be passed along, that 
  would either mean extra buffering or extra overhead. Extra buffering 
  (in the kernel) would be for no good reason: why not buffer it in the 
  place where the information is destined for in the first place. The 
  ringbuffer of /pointers/ is what makes this really powerful. I never 
  really liked the AIO/etc. method /event buffer/ rings. With syslets 
  the 'cookie' is the pointer to the syslet atom itself. It doesnt get 
  any more straightforward than that i believe.

- making 'is there more stuff for me to work on' a simple instruction in
  user-space makes it a no-brainer for user-space to promptly and
  without thinking complete events. It's also the right thing to do on 
  SMP: if one core is solely dedicated to the asynchronous workload,
  only running on kernel mode, and the other code is only running
  user-space, why ever switch between protection domains? [except if any
  of them is idle] The fastest completion signalling method is the
  /memory bus/, not an interrupt. User-space could in theory even use
  MWAIT (in user-space!) to wait for the other core to complete stuff. 
  That makes for a hell of a fast wakeup.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions
  2007-02-13 20:17   ` Indan Zupancic
@ 2007-02-13 21:43     ` Ingo Molnar
  2007-02-13 22:24       ` Indan Zupancic
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 21:43 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper


* Indan Zupancic <indan@nul.nu> wrote:

> > + * Execution control: conditions upon the return code
> > + * of the previous syslet atom. 'Stop' means syslet
> > + * execution is stopped and the atom is put into the
> > + * completion ring:
> > + */
> > +#define SYSLET_STOP_ON_NONZERO			0x00000008
> > +#define SYSLET_STOP_ON_ZERO			0x00000010
> > +#define SYSLET_STOP_ON_NEGATIVE			0x00000020
> > +#define SYSLET_STOP_ON_NON_POSITIVE		0x00000040
> 
> This is confusing. Why the return code of the previous syslet atom? 
> Wouldn't it be more clear if the flag was for the current tasklet? 
> Worse, what is the previous atom? [...]

the previously executed atom. (I have fixed up the comment in my tree to 
say that.)

> [...] Imagine some case with a loop:
> 
>   A
>   |
>   B<--.
>   |   |
>   C---'
> 
> What will be the previous atom of B here? It can be either A or C, but 
> their return values can be different and incompatible, so what flag 
> should B set?

previous here is the previously executed atom, which is always a 
specific atom. Think of atoms as 'instructions', and these condition 
flags as the 'CPU flags' like 'zero' 'carry' 'sign', etc. Syslets can be 
thought of as streams of simplified instructions.

> > +/*
> > + * Special modifier to 'stop' handling: instead of stopping the
> > + * execution of the syslet, the linearly next syslet is executed.
> > + * (Normal execution flows along atom->next, and execution stops
> > + *  if atom->next is NULL or a stop condition becomes true.)
> > + *
> > + * This is what allows true branches of execution within syslets.
> > + */
> > +#define SYSLET_SKIP_TO_NEXT_ON_STOP		0x00000080
> > +
> 
> Might rename this to SYSLET_SKIP_NEXT_ON_STOP too then.

but that's not what it does. It really 'skips to the next one on a stop 
event'. I.e. if you have three consecutive atoms (consecutive in linear 
memory):

	atom1 returns 0
	atom2 has SYSLET_STOP_ON_ZERO|SYSLET_SKIP_NEXT_ON_STOP set
	atom3

then after atom1 returns 0, the SYSLET_STOP_ON_ZERO condition is 
recognized as a 'stop' event - but due to the SYSLET_SKIP_NEXT_ON_STOP 
flag execution does not stop (i.e. we do not return to user-space or 
complete the syslet), but we continue execution at atom3.

this flag basically avoids having to add an atom->else pointer and keeps 
the data structure more compressed. Two-way branches are sufficiently 
rare, so i wanted to avoid the atom->else pointer.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 20:22   ` Davide Libenzi
  2007-02-13 21:24     ` Davide Libenzi
@ 2007-02-13 21:57     ` Ingo Molnar
  2007-02-13 22:50       ` Olivier Galibert
                         ` (3 more replies)
  1 sibling, 4 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 21:57 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > Open issues:

> If this is going to be a generic AIO subsystem:
> 
> - Cancellation of pending request

How about implementing aio_cancel() as a NOP. Can anyone prove that the 
kernel didnt actually attempt to cancel that IO? [but unfortunately 
failed at doing so, because the platters were being written already.]

really, what's the point behind aio_cancel()?

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 21:24     ` Davide Libenzi
@ 2007-02-13 22:10       ` Ingo Molnar
  2007-02-13 23:28         ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 22:10 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > If this is going to be a generic AIO subsystem:
> > 
> > - Cancellation of peding request
> 
> What about the busy_async_threads list becoming a hash/rb_tree indexed 
> by syslet_atom ptr. A cancel would lookup the thread and send a signal 
> (of course, signal handling of the async threads should be set 
> properly)?

well, each async syslet has a separate TID at the moment, so if we want 
a submitted syslet to be cancellable then we could return the TID of the 
syslet handler (instead of the NULL) in sys_async_exec(). Then 
user-space could send a signal the old-fashioned way, via sys_tkill(), 
if it so wishes.

the TID could also be used in a sys_async_wait_on() API. I.e. it would 
be a natural, readily accessible 'cookie' for the pending work. TIDs can 
be looked up lockless via RCU, so it's reasonably fast as well.

( Note that there's already a way to 'signal' pending syslets: do_exit() 
  in the user context will signal all async contexts (which results in 
  -EINTR of currently executing syscalls, wherever possible) and will 
  tear them down. But that's too crude for aio_cancel() i guess. )

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 18:56         ` Evgeniy Polyakov
  2007-02-13 19:12           ` Evgeniy Polyakov
@ 2007-02-13 22:18           ` Ingo Molnar
  2007-02-14  8:59             ` Evgeniy Polyakov
  1 sibling, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 22:18 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Benjamin LaHaise, Alan, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> [...] it still has a problem - syscall blocks and the same thread thus 
> is not allowed to continue execution and fill the pipe - so what if 
> system issues thousands of requests and there are only tens of working 
> thread at most. [...]

the same thread is allowed to continue execution even if the system call 
blocks: take a look at async_schedule(). The blocked system-call is 'put 
aside' (in a sleeping thread), the kernel switches the user-space 
context (registers) to a free kernel thread and switches to it - and 
returns to user-space as if nothing happened - allowing the user-space 
context to 'fill the pipe' as much as it can. Or did i misunderstand 
your point?

basically there's SYSLET_ASYNC for 'always async' and SYSLET_SYNC for 
'always sync' - but the default syslet behavior is: 'try sync and switch 
transparently to async on demand'. The testcode i sent very much uses 
this. (and this mechanism is in essence Zach's fibril-switching thing, 
but done via kernel threads.)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 19:12           ` Evgeniy Polyakov
@ 2007-02-13 22:19             ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 22:19 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Benjamin LaHaise, Alan, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > I have not received first mail with announcement yet, so I will place 
> > my thoughts here if you do not mind.
> 
> An issue with sys_async_wait(): is is possible that events_left will 
> be setup too late so that all events are already ready and thus 
> sys_async_wait() can wait forever (or until next $sys_async_wait are 
> ready)?

yeah. I have fixed this up and have uploaded a newer queue to:

 http://redhat.com/~mingo/syslet-patches/

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 23:15   ` Andi Kleen
@ 2007-02-13 22:24     ` Ingo Molnar
  2007-02-13 22:30       ` Andi Kleen
  2007-02-13 22:57       ` Andrew Morton
  0 siblings, 2 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 22:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Andi Kleen <andi@firstfloor.org> wrote:

> Ingo Molnar <mingo@elte.hu> writes:
> 
> > +
> > +static struct async_thread *
> > +pick_ready_cachemiss_thread(struct async_head *ah)
> 
> The cachemiss names are confusing. I assume that's just a left over 
> from Tux?

yeah. Although 'stuff goes async' is quite similar to a cachemiss. We 
didnt have some resource available right now so the syscall has to block 
== i.e. some cache was not available.

> > +
> > +	memset(atom->args, 0, sizeof(atom->args));
> > +
> > +	ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
> > +	if (!arg_ptr)
> > +		return ret;
> > +	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
> > +		return -EFAULT;
> 
> It's a little unclear why you do that many individual access_ok()s. 
> And why is the target constant sized anyways?

each indirect pointer has to be checked separately, before dereferencing 
it. (Andrew pointed out that they should be VERIFY_READ, i fixed that in 
my tree)

it looks a bit scary in C but the assembly code is very fast and quite 
straightforward.

> +	/*
> +	 * Lock down the ring. Note: user-space should not munlock() this,
> +	 * because if the ring pages get swapped out then the async
> +	 * completion code might return a -EFAULT instead of the expected
> +	 * completion. (the kernel safely handles that case too, so this
> +	 * isnt a security problem.)
> +	 *
> +	 * mlock() is better here because it gets resource-accounted
> +	 * properly, and even unprivileged userspace has a few pages
> +	 * of mlock-able memory available. (which is more than enough
> +	 * for the completion-pointers ringbuffer)
> +	 */
> 
> If it's only a few pages you don't need any resource accounting. If 
> it's more then it's nasty to steal the users quota. I think plain 
> gup() would be better.

get_user_pages() would have to be limited in some way - and i didnt want 
to add yet another wacky limit thing - so i just used the already 
existing mlock() infrastructure for this. If Oracle wants to set up a 10 
MB ringbuffer, they can set the PAM resource limits to 11 MB and still 
have enough stuff left. And i dont really expect GPG to start using 
syslets - just yet ;-)

a single page is enough for 1024 completion pointers - that's more than 
enough for most purposes - and the default mlock limit is 40K.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions
  2007-02-13 21:43     ` Ingo Molnar
@ 2007-02-13 22:24       ` Indan Zupancic
  2007-02-13 22:32         ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Indan Zupancic @ 2007-02-13 22:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper

On Tue, February 13, 2007 22:43, Ingo Molnar wrote:
> * Indan Zupancic <indan@nul.nu> wrote:
>>   A
>>   |
>>   B<--.
>>   |   |
>>   C---'
>>
>> What will be the previous atom of B here? It can be either A or C, but
>> their return values can be different and incompatible, so what flag
>> should B set?
>
> previous here is the previously executed atom, which is always a
> specific atom. Think of atoms as 'instructions', and these condition
> flags as the 'CPU flags' like 'zero' 'carry' 'sign', etc. Syslets can be
> thought of as streams of simplified instructions.

In the diagram above the previously executed atom, when handling atom B,
can be either atom A or atom C. So B doesn't know what kind of return value
to expect, because it depends on the previous atom's kind of syscall, and
not on B's return type. So I think you would want to move those return value
flags one atom earlier, in this case to A and C. So each atom will have a
flag telling what to to depending on its own return value.

>> > +/*
>> > + * Special modifier to 'stop' handling: instead of stopping the
>> > + * execution of the syslet, the linearly next syslet is executed.
>> > + * (Normal execution flows along atom->next, and execution stops
>> > + *  if atom->next is NULL or a stop condition becomes true.)
>> > + *
>> > + * This is what allows true branches of execution within syslets.
>> > + */
>> > +#define SYSLET_SKIP_TO_NEXT_ON_STOP		0x00000080
>> > +
>>
>> Might rename this to SYSLET_SKIP_NEXT_ON_STOP too then.
>
> but that's not what it does. It really 'skips to the next one on a stop
> event'. I.e. if you have three consecutive atoms (consecutive in linear
> memory):
>
> 	atom1 returns 0
> 	atom2 has SYSLET_STOP_ON_ZERO|SYSLET_SKIP_NEXT_ON_STOP set
> 	atom3
>
> then after atom1 returns 0, the SYSLET_STOP_ON_ZERO condition is
> recognized as a 'stop' event - but due to the SYSLET_SKIP_NEXT_ON_STOP
> flag execution does not stop (i.e. we do not return to user-space or
> complete the syslet), but we continue execution at atom3.
>
> this flag basically avoids having to add an atom->else pointer and keeps
> the data structure more compressed. Two-way branches are sufficiently
> rare, so i wanted to avoid the atom->else pointer.

The flags are smart, they're just at the wrong place I think.

In your example, if atom3 has a 'next' pointing to atom2, atom2 wouldn't
know which return value it's checking: The one of atom1, or the one of
atom3? You're spreading syscall specific knowledge over multiple atoms
while that isn't necessary.

What I propose:

	atom1 returns 0, has SYSLET_STOP_ON_ZERO|SYSLET_SKIP_NEXT_ON_STOP set
	atom2
	atom3

(You've already used my SYSLET_SKIP_NEXT_ON_STOP instead of
SYSLET_SKIP_TO_NEXT_ON_STOP. ;-)

Perhaps it's even more clear when splitting that SYSLET_STOP_* into a
SYSLET_STOP flag, and specific SYSLET_IF_* flags. Either that, or go
all the way and introduce seperate SYSLET_SKIP_NEXT_ON_*.

	atom1 returns 0, has SYSLET_SKIP_NEXT|SYSLET_IF_ZERO set
	atom2
	atom3

Greetings,

Indan



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 23:25       ` Andi Kleen
@ 2007-02-13 22:26         ` Ingo Molnar
  2007-02-13 22:32           ` Andi Kleen
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 22:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner


* Andi Kleen <andi@firstfloor.org> wrote:

> > really, what's the point behind aio_cancel()?
> 
> The main use case is when you open a file requester on a network file 
> system where the server is down and you get tired of waiting and press 
> "Cancel" it should abort the hanging IO immediately.

ok, that should work fine already - exit in the user context gets 
propagated to all async syslet contexts immediately. So if the syscalls 
that the syslet uses are reasonably interruptible, it will work out 
fine.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 22:24     ` Ingo Molnar
@ 2007-02-13 22:30       ` Andi Kleen
  2007-02-13 22:41         ` Ingo Molnar
  2007-02-13 22:57       ` Andrew Morton
  1 sibling, 1 reply; 320+ messages in thread
From: Andi Kleen @ 2007-02-13 22:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Tue, Feb 13, 2007 at 11:24:43PM +0100, Ingo Molnar wrote:
> > > +	memset(atom->args, 0, sizeof(atom->args));
> > > +
> > > +	ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
> > > +	if (!arg_ptr)
> > > +		return ret;
> > > +	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
> > > +		return -EFAULT;
> > 
> > It's a little unclear why you do that many individual access_ok()s. 
> > And why is the target constant sized anyways?
> 
> each indirect pointer has to be checked separately, before dereferencing 
> it. (Andrew pointed out that they should be VERIFY_READ, i fixed that in 
> my tree)

But why only constant sized? It could be a variable length object, couldn't it?

If it's an array it could be all checked together

(i must be missing something here) 

> > If it's only a few pages you don't need any resource accounting. If 
> > it's more then it's nasty to steal the users quota. I think plain 
> > gup() would be better.
> 
> get_user_pages() would have to be limited in some way - and i didnt want 

If you only use it for a small ring buffer it is naturally limited.

Also beancounter will fix that eventually.

> a single page is enough for 1024 completion pointers - that's more than 
> enough for most purposes - and the default mlock limit is 40K.

Then limit it to a single page and use gup

-Andi

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 22:26         ` Ingo Molnar
@ 2007-02-13 22:32           ` Andi Kleen
  2007-02-13 22:43             ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Andi Kleen @ 2007-02-13 22:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Davide Libenzi, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, Feb 13, 2007 at 11:26:26PM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > really, what's the point behind aio_cancel()?
> > 
> > The main use case is when you open a file requester on a network file 
> > system where the server is down and you get tired of waiting and press 
> > "Cancel" it should abort the hanging IO immediately.
> 
> ok, that should work fine already - exit in the user context gets 

That would be a little heavy handed. I wouldn't expect my GUI
program to quit itself on cancel. And requiring it to create a new
thread just to exit on cancel would be also nasty.

And of course you cannot interrupt blocked IOs this way right now
(currently it only works with signals in some cases on NFS)

-Andi

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions
  2007-02-13 22:24       ` Indan Zupancic
@ 2007-02-13 22:32         ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 22:32 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper


* Indan Zupancic <indan@nul.nu> wrote:

> What I propose:
> 
> 	atom1 returns 0, has SYSLET_STOP_ON_ZERO|SYSLET_SKIP_NEXT_ON_STOP set
> 	atom2
> 	atom3
> 
> (You've already used my SYSLET_SKIP_NEXT_ON_STOP instead of 
> SYSLET_SKIP_TO_NEXT_ON_STOP. ;-)

doh. Yes. I noticed and implemented this yesterday and it's in the 
submitted syslet code - but i guess i was too tired to remember my own 
code - so i added the wrong comments :-/ If you look at the sample 
user-space code:

        init_atom(req, &req->open_file, __NR_sys_open,
                  &req->filename_p, &O_RDONLY_var, NULL, NULL, NULL, NULL,
                  &req->fd, SYSLET_STOP_ON_NEGATIVE, &req->read_file);

the 'STOP_ON_NEGATIVE' acts on that particular atom.

this indeed cleaned up things quite a bit and made the user-space syslet 
code alot more straightforward. A return value can still be recovered 
and examined (with a different condition and a different jump target) 
arbitrary number of times via ret_ptr and via sys_umem_add().

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 20:39       ` Ingo Molnar
@ 2007-02-13 22:36         ` Dmitry Torokhov
  2007-02-14 11:07         ` Alan
  1 sibling, 0 replies; 320+ messages in thread
From: Dmitry Torokhov @ 2007-02-13 22:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

Hi Ingo,

On Tuesday 13 February 2007 15:39, Ingo Molnar wrote:
> 
> * Dmitry Torokhov <dmitry.torokhov@gmail.com> wrote:
> 
> > > What are the semantics of async sys_async_wait and async sys_async ?
> > 
> > Ooooohh. OpenVMS lives forever ;) Me likeee ;)
> 
> hm, i dont know OpenVMS - but googled around a bit for 'VMS 
> asynchronous' and it gave me this:
> 
>   http://en.wikipedia.org/wiki/Asynchronous_system_trap
> 
> is AST what you mean? From a quick read AST seems to be a signal 
> mechanism a bit like Unix signals, extended to kernel-space as well - 
> while syslets are a different 'safe execution engine' kind of thing 
> centered around the execution of system calls.
> 

That is only one of ways of notifying userspace of system call completion
on OpenVMS. Pretty much every syscall there exists in 2 flavors - async
and sync, for example $QIO and $QIOW or $ENQ/$ENQW (actually -W flavor
is async call + $SYNCH to wait for completion). Once system service call
is completed the OS would raise a so-called event flag and may also
deliver an AST to the process. Application may either wait for an
event flag/set of event flags (EFN) or rely on AST to get notification.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 22:30       ` Andi Kleen
@ 2007-02-13 22:41         ` Ingo Molnar
  2007-02-14  9:13           ` Evgeniy Polyakov
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 22:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Andi Kleen <andi@firstfloor.org> wrote:

> > > > +	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
> > > > +		return -EFAULT;
> > > 
> > > It's a little unclear why you do that many individual access_ok()s. 
> > > And why is the target constant sized anyways?
> > 
> > each indirect pointer has to be checked separately, before dereferencing 
> > it. (Andrew pointed out that they should be VERIFY_READ, i fixed that in 
> > my tree)
> 
> But why only constant sized? It could be a variable length object, 
> couldn't it?

i think what you might be missing is that it's only the 6 syscall 
arguments that are fetched via indirect pointers - security checks are 
then done by the system calls themselves. It's a bit awkward to think 
about, but it is surprisingly clean in the assembly, and it simplified 
syslet programming too.

> > get_user_pages() would have to be limited in some way - and i didnt 
> > want
> 
> If you only use it for a small ring buffer it is naturally limited.

yeah, but 'small' is a dangerous word when it comes to adding IO 
interfaces ;-)

> > a single page is enough for 1024 completion pointers - that's more 
> > than enough for most purposes - and the default mlock limit is 40K.
> 
> Then limit it to a single page and use gup

1024 (512 on 64-bit) is alot but not ALOT. It is also certainly not 
ALOOOOT :-) Really, people will want to have more than 512 
disks/spindles in the same box. I have used such a beast myself. For Tux 
workloads and benchmarks we had parallelism levels of millions of 
pending requests (!) on a single system - networking, socket limits, 
disk IO combined with thousands of clients do create such scenarios. I 
really think that such 'pinned pages' are a pretty natural fit for 
sys_mlock() and RLIMIT_MEMLOCK, and since the kernel side is careful to 
use the _inatomic() uaccess methods, it's safe (and fast) as well.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 22:32           ` Andi Kleen
@ 2007-02-13 22:43             ` Ingo Molnar
  2007-02-13 22:47               ` Andi Kleen
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-13 22:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner


* Andi Kleen <andi@firstfloor.org> wrote:

> > ok, that should work fine already - exit in the user context gets
> 
> That would be a little heavy handed. I wouldn't expect my GUI program 
> to quit itself on cancel. And requiring it to create a new thread just 
> to exit on cancel would be also nasty.
> 
> And of course you cannot interrupt blocked IOs this way right now 
> (currently it only works with signals in some cases on NFS)

ok. The TID+signal approach i mentioned in the other reply should work. 
If it's frequent enough we could make this an explicit 
sys_async_cancel(TID) API.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 22:43             ` Ingo Molnar
@ 2007-02-13 22:47               ` Andi Kleen
  0 siblings, 0 replies; 320+ messages in thread
From: Andi Kleen @ 2007-02-13 22:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Davide Libenzi, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

> ok. The TID+signal approach i mentioned in the other reply should work. 

Not sure if a signal is good for this. It might conflict with existing
strange historical semantics.

> If it's frequent enough we could make this an explicit 
> sys_async_cancel(TID) API.

Ideally there should be a new function like signal_pending() that checks for
this. Then the network fs could check those in their blocking loops
and error out.

Then it would even work on non intr NFS mounts.

-Andi

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 21:57     ` Ingo Molnar
@ 2007-02-13 22:50       ` Olivier Galibert
  2007-02-13 22:59       ` Ulrich Drepper
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 320+ messages in thread
From: Olivier Galibert @ 2007-02-13 22:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

On Tue, Feb 13, 2007 at 10:57:24PM +0100, Ingo Molnar wrote:
> 
> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > > Open issues:
> 
> > If this is going to be a generic AIO subsystem:
> > 
> > - Cancellation of pending request
> 
> How about implementing aio_cancel() as a NOP. Can anyone prove that the 
> kernel didnt actually attempt to cancel that IO? [but unfortunately 
> failed at doing so, because the platters were being written already.]
> 
> really, what's the point behind aio_cancel()?

Lemme give you a real-world scenario: Question Answering in a Dialog
System.  Your locked-in-memory index ranks documents in a several
million files corpus depending of the chances they have to have what
you're looking for.  You have a tenth of a second to read as many of
them as you can, and each seek is 5ms.  So you aio-read them,
requesting them in order of ranking up to 200 or so, and see what you
have at the 0.1s deadline.  If you're lucky, a combination of cache
(especially if you stat() the whole dir tree on a regular basis to
keep the metadata fresh in cache) and of good io reorganisation by the
scheduler will allow you to get a good number of them and do the
information extraction, scoring and clustering of answers, which is
pure CPU at that point.  You *have* to cancel the remaining i/o
because you do not want the disk saturated when the next request
comes, especially if it's 10ms later because the dialog manager found
out it needed a complementary request.

Incidentally, that's something I'm currently implementing for work,
making these aio discussions more interesting that usual :-)

  OG.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 22:24     ` Ingo Molnar
  2007-02-13 22:30       ` Andi Kleen
@ 2007-02-13 22:57       ` Andrew Morton
  1 sibling, 0 replies; 320+ messages in thread
From: Andrew Morton @ 2007-02-13 22:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: andi, linux-kernel, torvalds, arjan, hch, akpm, alan, drepper,
	zach.brown, johnpol, davem, bcrl, suparna, davidel, tglx

> On Tue, 13 Feb 2007 23:24:43 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> > If it's only a few pages you don't need any resource accounting. If 
> > it's more then it's nasty to steal the users quota. I think plain 
> > gup() would be better.
> 
> get_user_pages() would have to be limited in some way - and i didnt want 
> to add yet another wacky limit thing - so i just used the already 
> existing mlock() infrastructure for this. If Oracle wants to set up a 10 
> MB ringbuffer, they can set the PAM resource limits to 11 MB and still 
> have enough stuff left. And i dont really expect GPG to start using 
> syslets - just yet ;-)
> 
> a single page is enough for 1024 completion pointers - that's more than 
> enough for most purposes - and the default mlock limit is 40K.

So if I have an application which instantiates a single mlocked page
for this purpose, I can only run ten of them at once, and any other
mlock-using process which I'm using starts to mysteriously fail.

It seems like a problem to me..

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 21:57     ` Ingo Molnar
  2007-02-13 22:50       ` Olivier Galibert
@ 2007-02-13 22:59       ` Ulrich Drepper
  2007-02-13 23:24       ` Davide Libenzi
  2007-02-13 23:25       ` Andi Kleen
  3 siblings, 0 replies; 320+ messages in thread
From: Ulrich Drepper @ 2007-02-13 22:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

Ingo Molnar wrote:
> really, what's the point behind aio_cancel()?

- sequence

     aio_write()
     aio_cancel()
     aio_write()

  with both writes going to the same place must be predictably

- think beyond files.  Writes to sockets, ttys, they  can block and
cancel must abort them.  Even for files the same applies in some
situations, e.g., for network filesystems.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 14:20 ` [patch 05/11] syslets: core code Ingo Molnar
@ 2007-02-13 23:15   ` Andi Kleen
  2007-02-13 22:24     ` Ingo Molnar
  2007-02-14 12:43   ` Guillaume Chazarain
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 320+ messages in thread
From: Andi Kleen @ 2007-02-13 23:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

Ingo Molnar <mingo@elte.hu> writes:

> +
> +static struct async_thread *
> +pick_ready_cachemiss_thread(struct async_head *ah)

The cachemiss names are confusing. I assume that's just a left over
from Tux?
> +
> +	memset(atom->args, 0, sizeof(atom->args));
> +
> +	ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
> +	if (!arg_ptr)
> +		return ret;
> +	if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
> +		return -EFAULT;

It's a little unclear why you do that many individual access_ok()s.
And why is the target constant sized anyways?


+	/*
+	 * Lock down the ring. Note: user-space should not munlock() this,
+	 * because if the ring pages get swapped out then the async
+	 * completion code might return a -EFAULT instead of the expected
+	 * completion. (the kernel safely handles that case too, so this
+	 * isnt a security problem.)
+	 *
+	 * mlock() is better here because it gets resource-accounted
+	 * properly, and even unprivileged userspace has a few pages
+	 * of mlock-able memory available. (which is more than enough
+	 * for the completion-pointers ringbuffer)
+	 */

If it's only a few pages you don't need any resource accounting.
If it's more then it's nasty to steal the users quota.
I think plain gup() would be better.


-Andi

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-13 21:34     ` Ingo Molnar
@ 2007-02-13 23:21       ` Davide Libenzi
  2007-02-14  0:18         ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-13 23:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> 
> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > > +The Syslet Atom:
> > > +----------------
> > > +
> > > +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> > > +user-space memory, which is the basic unit of execution within the syslet
> > > +framework. A syslet represents a single system-call and its arguments.
> > > +In addition it also has condition flags attached to it that allows the
> > > +construction of larger programs (syslets) from these atoms.
> > > +
> > > +Arguments to the system call are implemented via pointers to arguments.
> > > +This not only increases the flexibility of syslet atoms (multiple syslets
> > > +can share the same variable for example), but is also an optimization:
> > > +copy_uatom() will only fetch syscall parameters up until the point it
> > > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > > +parameters (and 90% of all syscalls have 4 or less parameters).
> > 
> > Why do you need to have an extra memory indirection per parameter in 
> > copy_uatom()? [...]
> 
> yes. Try to use them in real programs, and you'll see that most of the 
> time the variable an atom wants to access should also be accessed by 
> other atoms. For example a socket file descriptor - one atom opens it, 
> another one reads from it, a third one closes it. By having the 
> parameters in the atoms we'd have to copy the fd to two other places.

Yes, of course we have to support the indirection, otherwise chaining 
won't work. But ...



> > I can understand that chaining syscalls requires variable sharing, but 
> > the majority of the parameters passed to syscalls are just direct 
> > ones. Maybe a smart method that allows you to know if a parameter is a 
> > direct one or a pointer to one? An "unsigned int pmap" where bit N is 
> > 1 if param N is an indirection? Hmm?
> 
> adding such things tends to slow down atom parsing.

I really think it simplifies it. You simply *copy* the parameter (I'd say 
that 70+% of cases falls inside here), and if the current "pmap" bit is 
set, then you do all the indirection copy-from-userspace stuff.
It also simplify userspace a lot, since you can now pass arrays and 
structure pointers directly, w/out saving them in a temporary variable.




> > Sigh, I really dislike shared userspace/kernel stuff, when we're 
> > transfering pointers to userspace. Did you actually bench it against 
> > a:
> > 
> > int async_wait(struct syslet_uatom **r, int n);
> > 
> > I can fully understand sharing userspace buffers with the kernel, if 
> > we're talking about KB transferd during a block or net I/O DMA 
> > operation, but for transfering a pointer? Behind each pointer 
> > transfer(4/8 bytes) there is a whole syscall execution, [...]
> 
> there are three main reasons for this choice:
> 
> - firstly, by putting completion events into the user-space ringbuffer
>   the asynchronous contexts are not held up at all, and the threads are
>   available for further syslet use.
> 
> - secondly, it was the most obvious and simplest solution to me - it 
>   just fits well into the syslet model - which is an execution concept 
>   centered around pure user-space memory and system calls, not some 
>   kernel resource. Kernel fills in the ringbuffer, user-space clears it. 
>   If we had to worry about a handshake between user-space and 
>   kernel-space for the completion information to be passed along, that 
>   would either mean extra buffering or extra overhead. Extra buffering 
>   (in the kernel) would be for no good reason: why not buffer it in the 
>   place where the information is destined for in the first place. The 
>   ringbuffer of /pointers/ is what makes this really powerful. I never 
>   really liked the AIO/etc. method /event buffer/ rings. With syslets 
>   the 'cookie' is the pointer to the syslet atom itself. It doesnt get 
>   any more straightforward than that i believe.
> 
> - making 'is there more stuff for me to work on' a simple instruction in
>   user-space makes it a no-brainer for user-space to promptly and
>   without thinking complete events. It's also the right thing to do on 
>   SMP: if one core is solely dedicated to the asynchronous workload,
>   only running on kernel mode, and the other code is only running
>   user-space, why ever switch between protection domains? [except if any
>   of them is idle] The fastest completion signalling method is the
>   /memory bus/, not an interrupt. User-space could in theory even use
>   MWAIT (in user-space!) to wait for the other core to complete stuff. 
>   That makes for a hell of a fast wakeup.

That makes also for a hell ugly retrieval API IMO ;)
If it'd be backed up but considerable performance gains, then it might be OK.
But I believe it won't be the case, and that leave us with an ugly API.
OTOH, if noone else object this, it means that I'm the only wierdo :) and 
the API is just fine.




- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 21:57     ` Ingo Molnar
  2007-02-13 22:50       ` Olivier Galibert
  2007-02-13 22:59       ` Ulrich Drepper
@ 2007-02-13 23:24       ` Davide Libenzi
  2007-02-13 23:25       ` Andi Kleen
  3 siblings, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-13 23:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> 
> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > > Open issues:
> 
> > If this is going to be a generic AIO subsystem:
> > 
> > - Cancellation of pending request
> 
> How about implementing aio_cancel() as a NOP. Can anyone prove that the 
> kernel didnt actually attempt to cancel that IO? [but unfortunately 
> failed at doing so, because the platters were being written already.]
> 
> really, what's the point behind aio_cancel()?

You need cancel. If you scheduled an async syscall, and the "session" 
linked with that chain is going away, you better have that canceled before 
cleaning up buffers to where the chain is going to read/write.
If you keep and hash or a tree indexed by atom-ptr, than become a matter 
of a lookup and sending a signal.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 21:57     ` Ingo Molnar
                         ` (2 preceding siblings ...)
  2007-02-13 23:24       ` Davide Libenzi
@ 2007-02-13 23:25       ` Andi Kleen
  2007-02-13 22:26         ` Ingo Molnar
  3 siblings, 1 reply; 320+ messages in thread
From: Andi Kleen @ 2007-02-13 23:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

Ingo Molnar <mingo@elte.hu> writes:
> 
> really, what's the point behind aio_cancel()?

The main use case is when you open a file requester on a network
file system where the server is down and you get tired of waiting
and press "Cancel" it should abort the hanging IO immediately.

At least I would appreciate such a feature sometimes.

e.g. the readdir loop could be a syslet (are they powerful
enough to allocate memory for a arbitary sized directory? Probably not) 
and then the cancel button could async_cancel() it.

-Andi


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 22:10       ` Ingo Molnar
@ 2007-02-13 23:28         ` Davide Libenzi
  0 siblings, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-13 23:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > > If this is going to be a generic AIO subsystem:
> > > 
> > > - Cancellation of peding request
> > 
> > What about the busy_async_threads list becoming a hash/rb_tree indexed 
> > by syslet_atom ptr. A cancel would lookup the thread and send a signal 
> > (of course, signal handling of the async threads should be set 
> > properly)?
> 
> well, each async syslet has a separate TID at the moment, so if we want 
> a submitted syslet to be cancellable then we could return the TID of the 
> syslet handler (instead of the NULL) in sys_async_exec(). Then 
> user-space could send a signal the old-fashioned way, via sys_tkill(), 
> if it so wishes.

That works too. I was thinking about identifying syslets with the 
userspace ptr, but the TID is fine too.



> the TID could also be used in a sys_async_wait_on() API. I.e. it would 
> be a natural, readily accessible 'cookie' for the pending work. TIDs can 
> be looked up lockless via RCU, so it's reasonably fast as well.
> 
> ( Note that there's already a way to 'signal' pending syslets: do_exit() 
>   in the user context will signal all async contexts (which results in 
>   -EINTR of currently executing syscalls, wherever possible) and will 
>   tear them down. But that's too crude for aio_cancel() i guess. )

Yup.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-13 23:21       ` Davide Libenzi
@ 2007-02-14  0:18         ` Davide Libenzi
  0 siblings, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14  0:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Davide Libenzi wrote:

> > > I can understand that chaining syscalls requires variable sharing, but 
> > > the majority of the parameters passed to syscalls are just direct 
> > > ones. Maybe a smart method that allows you to know if a parameter is a 
> > > direct one or a pointer to one? An "unsigned int pmap" where bit N is 
> > > 1 if param N is an indirection? Hmm?
> > 
> > adding such things tends to slow down atom parsing.
> 
> I really think it simplifies it. You simply *copy* the parameter (I'd say 
> that 70+% of cases falls inside here), and if the current "pmap" bit is 
> set, then you do all the indirection copy-from-userspace stuff.
> It also simplify userspace a lot, since you can now pass arrays and 
> structure pointers directly, w/out saving them in a temporary variable.

Very rough sketch below ...


---
struct syslet_uatom {
	unsigned long                           flags;
	unsigned int                            nr;
	unsigned short                          nparams;
	unsigned short                          pmap;
	long __user                             *ret_ptr;
	struct syslet_uatom     __user          *next;
	unsigned long           __user          args[6];
	void __user                             *private;
};

long copy_uatom(struct syslet_atom *atom, struct syslet_uatom __user *uatom)
{
	unsigned short i, pmap;
	unsigned long __user *arg_ptr;
	long ret = 0;

	if (!access_ok(VERIFY_WRITE, uatom, sizeof(*uatom)))
		return -EFAULT;

	ret = __get_user(atom->nr, &uatom->nr);
	ret |= __get_user(atom->nparams, &uatom->nparams);
	ret |= __get_user(pmap, &uatom->pmap);
	ret |= __get_user(atom->ret_ptr, &uatom->ret_ptr);
	ret |= __get_user(atom->flags, &uatom->flags);
	ret |= __get_user(atom->next, &uatom->next);
	if (unlikely(atom->nparams >= 6))
		return -EINVAL;
	for (i = 0; i < atom->nparams; i++, pmap >>= 1) {
		atom->args[i] = uatom->args[i];
		if (unlikely(pmap & 1)) {
			arg_ptr = (unsigned long __user *) atom->args[i];
			if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
				return -EFAULT;
			ret |= __get_user(atom->args[i], arg_ptr);
		}
	}

	return ret;
}

void init_utaom(struct syslet_uatom *ua, unsigned long flags, unsigned int nr,
		long *ret_ptr, struct syslet_uatom *next, void *private,
		int nparams, ...)
{
	int i, mode;
	va_list args;

	ua->flags = flags;
	ua->nr = nr;
	ua->ret_ptr = ret_ptr;
	ua->next = next;
	ua->private = private;
	ua->nparams = nparams;
	ua->pmap = 0;
	va_start(args, nparams);
	for (i = 0; i < nparams; i++) {
		mode = va_arg(args, int);
		ua->args[i] = va_arg(args, unsigned long);
		if (mode == UASYNC_INDIR)
			ua->pmap |= 1 << i;
	}
	va_end(args);
}


#define UASYNC_IMM 0
#define UASYNC_INDIR 1
#define UAPD(a) UASYNC_IMM, (unsigned long) (a)
#define UAPI(a) UASYNC_INDIR, (unsigned long) (a)


void foo(void)
{
	int fd;
	long res;
	struct stat stb;
	struct syslet_uatom ua;

	init_utaom(&ua, 0, __NR_fstat, &res, NULL, NULL, 2,
		   UAPI(&fd), UAPD(&stb));
	...
}
---



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
  2007-02-13 15:00   ` Alan
  2007-02-13 20:22   ` Davide Libenzi
@ 2007-02-14  3:28   ` Davide Libenzi
  2007-02-14  4:49     ` Davide Libenzi
  2007-02-14  4:42   ` Willy Tarreau
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14  3:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> I'm pleased to announce the first release of the "Syslet" kernel feature 
> and kernel subsystem, which provides generic asynchrous system call 
> support:
> [...]

Ok, I had little to time review the code, but it has been a long 
working day, so bear with me if I missed something.
I don't see how sys_async_exec would not block, based on your patches. 
Let's try to follow:

- We enter sys_async_exec

- We may fill the pool, but that's nothing interesting ATM. A bunch of 
  threads will be created, and they'll end up sleeping inside the 
  cachemiss_loop

- We set the async_ready pointer and we fall inside exec_atom

- There we copy the atom (nothing interesting from a scheduling POV) and 
  we fall inside __exec_atom

- In __exec_atom we do the actual syscall call. Note that we're still the 
  task/thread that called sys_async_exec

- So we enter the syscall, and now we end up in schedule because we're 
  just unlucky

- We notice that the async_ready pointer is not NULL, and we call 
  __async_schedule

- Finally we're in pick_new_async_thread and we pick one of the ready 
  threads sleeping in cachemiss_loop

- We copy the pt_regs to the newly picked-up thread, we set its async head 
  pointer, we set the current task async_ready pointer to NULL, we 
  re-initialize the async_thread structure (the old async_ready), and we 
  put ourselves in the busy_list

- Then we roll back to the schedule that started everything, and being 
  still "prev" for the scheduler, we go to sleep

So the sys_async_exec task is going to block. Now, am I being really 
tired, or the cachemiss fast return is simply not there?
There's another problem AFAICS:

- We woke up one of the cachemiss_loop threads in pick_new_async_thread

- The threads wakes up, mark itself as busy, and look at the ->work 
  pointer hoping to find something to work on

But we never set that pointer to a userspace atom AFAICS. Me blind? :)




- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
                     ` (2 preceding siblings ...)
  2007-02-14  3:28   ` Davide Libenzi
@ 2007-02-14  4:42   ` Willy Tarreau
  2007-02-14 12:37   ` Pavel Machek
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 320+ messages in thread
From: Willy Tarreau @ 2007-02-14  4:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Hi Ingo !

On Tue, Feb 13, 2007 at 03:20:10PM +0100, Ingo Molnar wrote:
> I'm pleased to announce the first release of the "Syslet" kernel feature 
> and kernel subsystem, which provides generic asynchrous system call 
> support:
> 
>    http://redhat.com/~mingo/syslet-patches/
> 
> Syslets are small, simple, lightweight programs (consisting of 
> system-calls, 'atoms') that the kernel can execute autonomously (and, 
> not the least, asynchronously), without having to exit back into 
> user-space. Syslets can be freely constructed and submitted by any 
> unprivileged user-space context - and they have access to all the 
> resources (and only those resources) that the original context has 
> access to.

I like this a lot. I've always felt frustrated by the wasted time in
setsockopt() calls after accept() or before connect(), or in multiple
calls to epoll_ctl(). It might also be useful as an efficient readv()
emulation using recv(), etc...

Nice work !
Willy


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-14  3:28   ` Davide Libenzi
@ 2007-02-14  4:49     ` Davide Libenzi
  2007-02-14  8:26       ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14  4:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Tue, 13 Feb 2007, Davide Libenzi wrote:

[...]

> So the sys_async_exec task is going to block. Now, am I being really 
> tired, or the cachemiss fast return is simply not there?

The former 8)

pick_new_async_head()
	new_task->ah = ah;

cachemiss_loop()
	for (;;) {
		if (unlikely(t->ah || ...))
			break;


> There's another problem AFAICS:
> 
> - We woke up one of the cachemiss_loop threads in pick_new_async_thread
> 
> - The threads wakes up, mark itself as busy, and look at the ->work 
>   pointer hoping to find something to work on
> 
> But we never set that pointer to a userspace atom AFAICS. Me blind? :)

I still don't see at->work ever set to a valid userspace atom though...



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-14  4:49     ` Davide Libenzi
@ 2007-02-14  8:26       ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14  8:26 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > There's another problem AFAICS:
> > 
> > - We woke up one of the cachemiss_loop threads in pick_new_async_thread
> > 
> > - The threads wakes up, mark itself as busy, and look at the ->work 
> >   pointer hoping to find something to work on
> > 
> > But we never set that pointer to a userspace atom AFAICS. Me blind? :)
> 
> I still don't see at->work ever set to a valid userspace atom 
> though...

yeah - i havent added support for 'submit syslet from within a syslet' 
support yet :-)

note that current normal syslet operation (both async and sync alike) 
does not need at->work at all. When we cachemiss then the new head task 
just wants to return a NULL pointer to user-space, to signal that work 
is continuing in the background. A ready 'cachemiss' thread is really 
not there to do cachemisses, it is a 'new head task in the waiting'. The 
name comes from Tux and i guess it's time for a rename :)

but i do plan a SYSLET_ASYNC_CONTINUE flag, roughly along the patch i've 
attached below: this would skip to the linearly next syslet and would 
let the original syslet execute in the. I have not fully thought this 
through though (let alone tested it ;) - can you see any hole in this 
approach? This would in essence allow the following construct:

   syslet1 &
   syslet2 &
   syslet3 &
   syslet4 &

submitted in parallel, straight to cachemiss threads, from a syslet 
itself.

there's yet another work submission variant that makes sense to do, a 
true syslet vector submission: to do a loop over syslet atoms in 
sys_async_exec(). That would have the added advantage of enabling 
caching. If one vector component generates a cachemiss then the head 
would continue with the next vector. (this too needs at->work alike 
communication between ex-head and new-head)

maybe the latter would be the cleaner approach - SYSLET_ASYNC_CONTINUE 
has no effect in cachemiss context, so it only makes sense if the 
submitted syslet is a pure vector of parallel atoms. Alternatively, 
SYSLET_ASYNC_CONTINUE would have to be made work from cachemiss contexts 
too. (because that makes sense too, to start new async execution from 
another async context.)

another not yet clear area is when there's no cachemiss thread 
available. Right now SYSLET_ASYNC_CONTINUE will just fail - which makes 
it nondeterministic.

	Ingo

---
 include/linux/async.h  |   13 +++++++++++--
 include/linux/sched.h  |    3 +--
 include/linux/syslet.h |   20 +++++++++++++-------
 kernel/async.c         |   43 +++++++++++++++++++++++++++++--------------
 kernel/sched.c         |    2 +-
 5 files changed, 56 insertions(+), 27 deletions(-)

 # *DOCUMENTATION*
Index: linux/include/linux/async.h
===================================================================
--- linux.orig/include/linux/async.h
+++ linux/include/linux/async.h
@@ -1,15 +1,23 @@
 #ifndef _LINUX_ASYNC_H
 #define _LINUX_ASYNC_H
+
+#include <linux/compiler.h>
+
 /*
  * The syslet subsystem - asynchronous syscall execution support.
  *
  * Generic kernel API definitions:
  */
 
+struct syslet_uatom;
+struct async_thread;
+struct async_head;
+
 #ifdef CONFIG_ASYNC_SUPPORT
 extern void async_init(struct task_struct *t);
 extern void async_exit(struct task_struct *t);
-extern void __async_schedule(struct task_struct *t);
+extern void
+__async_schedule(struct task_struct *t, struct syslet_uatom __user *next_uatom);
 #else /* !CONFIG_ASYNC_SUPPORT */
 static inline void async_init(struct task_struct *t)
 {
@@ -17,7 +25,8 @@ static inline void async_init(struct tas
 static inline void async_exit(struct task_struct *t)
 {
 }
-static inline void __async_schedule(struct task_struct *t)
+static inline void
+__async_schedule(struct task_struct *t, struct syslet_uatom __user *next_uatom)
 {
 }
 #endif /* !CONFIG_ASYNC_SUPPORT */
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -83,13 +83,12 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 #include <linux/task_io_accounting.h>
+#include <linux/async.h>
 
 #include <asm/processor.h>
 
 struct exec_domain;
 struct futex_pi_state;
-struct async_thread;
-struct async_head;
 /*
  * List of flags we want to share for kernel threads,
  * if only because they are not used by them anyway.
Index: linux/include/linux/syslet.h
===================================================================
--- linux.orig/include/linux/syslet.h
+++ linux/include/linux/syslet.h
@@ -56,10 +56,16 @@ struct syslet_uatom {
 #define SYSLET_ASYNC				0x00000001
 
 /*
+ * Queue this syslet asynchronously and continue executing the
+ * next linear atom:
+ */
+#define SYSLET_ASYNC_CONTINUE			0x00000002
+
+/*
  * Never queue this syslet asynchronously - even if synchronous
  * execution causes a context-switching:
  */
-#define SYSLET_SYNC				0x00000002
+#define SYSLET_SYNC				0x00000004
 
 /*
  * Do not queue the syslet in the completion ring when done.
@@ -70,7 +76,7 @@ struct syslet_uatom {
  * Some syscalls generate implicit completion events of their
  * own.
  */
-#define SYSLET_NO_COMPLETE			0x00000004
+#define SYSLET_NO_COMPLETE			0x00000008
 
 /*
  * Execution control: conditions upon the return code
@@ -78,10 +84,10 @@ struct syslet_uatom {
  * execution is stopped and the atom is put into the
  * completion ring:
  */
-#define SYSLET_STOP_ON_NONZERO			0x00000008
-#define SYSLET_STOP_ON_ZERO			0x00000010
-#define SYSLET_STOP_ON_NEGATIVE			0x00000020
-#define SYSLET_STOP_ON_NON_POSITIVE		0x00000040
+#define SYSLET_STOP_ON_NONZERO			0x00000010
+#define SYSLET_STOP_ON_ZERO			0x00000020
+#define SYSLET_STOP_ON_NEGATIVE			0x00000040
+#define SYSLET_STOP_ON_NON_POSITIVE		0x00000080
 
 #define SYSLET_STOP_MASK				\
 	(	SYSLET_STOP_ON_NONZERO		|	\
@@ -97,7 +103,7 @@ struct syslet_uatom {
  *
  * This is what allows true branches of execution within syslets.
  */
-#define SYSLET_SKIP_TO_NEXT_ON_STOP		0x00000080
+#define SYSLET_SKIP_TO_NEXT_ON_STOP		0x00000100
 
 /*
  * This is the (per-user-context) descriptor of the async completion
Index: linux/kernel/async.c
===================================================================
--- linux.orig/kernel/async.c
+++ linux/kernel/async.c
@@ -75,13 +75,14 @@ mark_async_thread_busy(struct async_thre
 
 static void
 __async_thread_init(struct task_struct *t, struct async_thread *at,
-		    struct async_head *ah)
+		    struct async_head *ah,
+		    struct syslet_uatom __user *work)
 {
 	INIT_LIST_HEAD(&at->entry);
 	at->exit = 0;
 	at->task = t;
 	at->ah = ah;
-	at->work = NULL;
+	at->work = work;
 
 	t->at = at;
 	ah->nr_threads++;
@@ -92,7 +93,7 @@ async_thread_init(struct task_struct *t,
 		  struct async_head *ah)
 {
 	spin_lock(&ah->lock);
-	__async_thread_init(t, at, ah);
+	__async_thread_init(t, at, ah, NULL);
 	__mark_async_thread_ready(at, ah);
 	spin_unlock(&ah->lock);
 }
@@ -130,8 +131,10 @@ pick_ready_cachemiss_thread(struct async
 	return at;
 }
 
-static void pick_new_async_head(struct async_head *ah,
-				struct task_struct *t, struct pt_regs *old_regs)
+static void
+pick_new_async_head(struct async_head *ah, struct task_struct *t,
+		    struct pt_regs *old_regs,
+		    struct syslet_uatom __user *next_uatom)
 {
 	struct async_thread *new_async_thread;
 	struct async_thread *async_ready;
@@ -158,28 +161,31 @@ static void pick_new_async_head(struct a
 
 	wake_up_process(new_task);
 
-	__async_thread_init(t, async_ready, ah);
+	__async_thread_init(t, async_ready, ah, next_uatom);
 	__mark_async_thread_busy(t->at, ah);
 
  out_unlock:
 	spin_unlock(&ah->lock);
 }
 
-void __async_schedule(struct task_struct *t)
+void
+__async_schedule(struct task_struct *t, struct syslet_uatom __user *next_uatom)
 {
 	struct async_head *ah = t->ah;
 	struct pt_regs *old_regs = task_pt_regs(t);
 
-	pick_new_async_head(ah, t, old_regs);
+	pick_new_async_head(ah, t, old_regs, next_uatom);
 }
 
-static void async_schedule(struct task_struct *t)
+static void
+async_schedule(struct task_struct *t, struct syslet_uatom __user *next_uatom)
 {
 	if (t->async_ready)
-		__async_schedule(t);
+		__async_schedule(t, next_uatom);
 }
 
-static long __exec_atom(struct task_struct *t, struct syslet_atom *atom)
+static long __exec_atom(struct task_struct *t, struct syslet_atom *atom,
+			struct syslet_uatom __user *uatom)
 {
 	struct async_thread *async_ready_save;
 	long ret;
@@ -189,8 +195,17 @@ static long __exec_atom(struct task_stru
 	 * (try to) switch user-space to another thread straight
 	 * away and execute the syscall asynchronously:
 	 */
-	if (unlikely(atom->flags & SYSLET_ASYNC))
-		async_schedule(t);
+	if (unlikely(atom->flags & (SYSLET_ASYNC | SYSLET_ASYNC_CONTINUE))) {
+		/*
+		 * If this is a parallel (vectored) submission straight to
+		 * a cachemiss context then the linearly next (uatom + 1)
+		 * uatom will be executed by this context.
+		 */
+		if (atom->flags & SYSLET_ASYNC_CONTINUE)
+			async_schedule(t, uatom + 1);
+		else
+			async_schedule(t, NULL);
+	}
 	/*
 	 * Does user-space want synchronous execution for this atom?:
 	 */
@@ -432,7 +447,7 @@ exec_atom(struct async_head *ah, struct 
 		return ERR_PTR(-EFAULT);
 
 	last_uatom = uatom;
-	ret = __exec_atom(t, &atom);
+	ret = __exec_atom(t, &atom, uatom);
 	if (unlikely(signal_pending(t) || need_resched()))
 		goto stop;
 
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3442,7 +3442,7 @@ asmlinkage void __sched schedule(void)
 		if (prev->state && !(preempt_count() & PREEMPT_ACTIVE) &&
 			(!(prev->state & TASK_INTERRUPTIBLE) ||
 				!signal_pending(prev)))
-			__async_schedule(prev);
+			__async_schedule(prev, NULL);
 	}
 
 need_resched:


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 22:18           ` Ingo Molnar
@ 2007-02-14  8:59             ` Evgeniy Polyakov
  2007-02-14 10:37               ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-14  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Benjamin LaHaise, Alan, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Tue, Feb 13, 2007 at 11:18:10PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > [...] it still has a problem - syscall blocks and the same thread thus 
> > is not allowed to continue execution and fill the pipe - so what if 
> > system issues thousands of requests and there are only tens of working 
> > thread at most. [...]
> 
> the same thread is allowed to continue execution even if the system call 
> blocks: take a look at async_schedule(). The blocked system-call is 'put 
> aside' (in a sleeping thread), the kernel switches the user-space 
> context (registers) to a free kernel thread and switches to it - and 
> returns to user-space as if nothing happened - allowing the user-space 
> context to 'fill the pipe' as much as it can. Or did i misunderstand 
> your point?

Let me clarify what I meant.
There is only limited number of threads, which are supposed to execute
blocking context, so when all they are used, main one will block too - I
asked about possibility to reuse the same thread to execute queue of
requests attached to it, each request can block, but if blocking issue
is removed, it would be possible to return.

What I'm asking for is how actually kevent IO state machine functions work
- each IO request is made not through usual mpage and bio allocations,
but with special kevent ones, which do not wait until completion, but
instead in destructor it is either rescheduled (if big file is
transferred, then it is split into parts for transmission) or committed
as ready (thus it becomes possible to read completion through kevent
queue or ring), so there are only several threads, each one does small
job on each request, but the same request can be rescheduled to it again
and again (from bio destructor or ->end_io callback for example).

So I asked if it is possible to extend this state machine to work not
only with blocked syscalls but with non-blocked functions with
possibility to reschedule the same item again.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 22:41         ` Ingo Molnar
@ 2007-02-14  9:13           ` Evgeniy Polyakov
  2007-02-14  9:46             ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-14  9:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Tue, Feb 13, 2007 at 11:41:31PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > Then limit it to a single page and use gup
> 
> 1024 (512 on 64-bit) is alot but not ALOT. It is also certainly not 
> ALOOOOT :-) Really, people will want to have more than 512 
> disks/spindles in the same box. I have used such a beast myself. For Tux 
> workloads and benchmarks we had parallelism levels of millions of 
> pending requests (!) on a single system - networking, socket limits, 
> disk IO combined with thousands of clients do create such scenarios. I 
> really think that such 'pinned pages' are a pretty natural fit for 
> sys_mlock() and RLIMIT_MEMLOCK, and since the kernel side is careful to 
> use the _inatomic() uaccess methods, it's safe (and fast) as well.

This will end up badly - I used the same approach in the early kevent
days and was proven to have swapable memory for the ring. I think it
would be much better to have userspace allocated ring and use
copy_to_user() there.

Btw, as a bit of advertisement, the whole completion part can be done
through kevent which already has ring buffer, queue operations and
non-racy updates... :)

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14  9:13           ` Evgeniy Polyakov
@ 2007-02-14  9:46             ` Ingo Molnar
  2007-02-14 10:09               ` Evgeniy Polyakov
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14  9:46 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Andi Kleen, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> This will end up badly - I used the same approach in the early kevent 
> days and was proven to have swapable memory for the ring. I think it 
> would be much better to have userspace allocated ring and use 
> copy_to_user() there.

it is a userspace allocated ring - but pinned down by the kernel.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14  9:46             ` Ingo Molnar
@ 2007-02-14 10:09               ` Evgeniy Polyakov
  2007-02-14 10:30                 ` Arjan van de Ven
  0 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-14 10:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Wed, Feb 14, 2007 at 10:46:29AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > This will end up badly - I used the same approach in the early kevent 
> > days and was proven to have swapable memory for the ring. I think it 
> > would be much better to have userspace allocated ring and use 
> > copy_to_user() there.
> 
> it is a userspace allocated ring - but pinned down by the kernel.

That's a problem - 1000/512 pages per 'usual' thread ends up with the
whole memory locked by malicious/stupid application (at least on Debian
and Mandrake there is no locked memory limit by default). And if such 
a limit exists, this will hurt big-iron applications, which want to used
high-order rings legitimely.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 10:09               ` Evgeniy Polyakov
@ 2007-02-14 10:30                 ` Arjan van de Ven
  2007-02-14 10:41                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 320+ messages in thread
From: Arjan van de Ven @ 2007-02-14 10:30 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Andi Kleen, linux-kernel, Linus Torvalds,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

> (at least on Debian
> and Mandrake there is no locked memory limit by default).

that sounds like 2 very large bugtraq-worthy bugs in these distros.. so
bad a bug that I almost find it hard to believe...

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-13 14:20 ` [patch 06/11] syslets: core, documentation Ingo Molnar
  2007-02-13 20:18   ` Davide Libenzi
@ 2007-02-14 10:36   ` Russell King
  2007-02-14 10:50     ` Ingo Molnar
  1 sibling, 1 reply; 320+ messages in thread
From: Russell King @ 2007-02-14 10:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> +Arguments to the system call are implemented via pointers to arguments.
> +This not only increases the flexibility of syslet atoms (multiple syslets
> +can share the same variable for example), but is also an optimization:
> +copy_uatom() will only fetch syscall parameters up until the point it
> +meets the first NULL pointer. 50% of all syscalls have 2 or less
> +parameters (and 90% of all syscalls have 4 or less parameters).
> +
> + [ Note: since the argument array is at the end of the atom, and the
> +   kernel will not touch any argument beyond the final NULL one, atoms
> +   might be packed more tightly. (the only special case exception to
> +   this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> +   jump a full syslet_uatom number of bytes.) ]

What if you need to increase the number of arguments passed to a system
call later?  That would be an API change since the size of syslet_uatom
would change?

Also, what if you have an ABI such that:

sys_foo(int fd, long long a)

where:
 arg[0] <= fd
 arg[1] <= unused
 arg[2] <= low 32-bits a
 arg[3] <= high 32-bits a

it seems you need to point arg[1] to some valid but dummy variable.

How do you propose syslet users know about these kinds of ABI issues
(including the endian-ness of 64-bit arguments) ?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-14  8:59             ` Evgeniy Polyakov
@ 2007-02-14 10:37               ` Ingo Molnar
  2007-02-14 11:10                 ` Evgeniy Polyakov
  2007-02-14 17:17                 ` Davide Libenzi
  0 siblings, 2 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 10:37 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Benjamin LaHaise, Alan, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Let me clarify what I meant. There is only limited number of threads, 
> which are supposed to execute blocking context, so when all they are 
> used, main one will block too - I asked about possibility to reuse the 
> same thread to execute queue of requests attached to it, each request 
> can block, but if blocking issue is removed, it would be possible to 
> return.

ah, ok, i understand your point. This is not quite possible: the 
cachemisses are driven from schedule(), which can be arbitraily deep 
inside arbitrary system calls. It can be in a mutex_lock() deep inside a 
driver. It can be due to a alloc_pages() call done by a kmalloc() call 
done from within ext3, which was called from the loopback block driver, 
which was called from XFS, which was called from a VFS syscall.

Even if it were possible to backtrack i'm quite sure we dont want to do 
this, for three main reasons:

Firstly, backtracking and retrying always has a cost. We construct state 
on the way in - and we destruct on the way out. The kernel stack we have 
built up has a (nontrivial) construction cost and thus a construction 
value - we should preserve that if possible.

Secondly, and this is equally important: i wanted the number of async 
kernel threads to be the natural throttling mechanism. If user-space 
wants to use less threads and overcommit the request queue then it can 
be done in user-space: by over-queueing requests into a separate list, 
and taking from that list upon completion and submitting it. User-space 
has precise knowledge of overqueueing scenarios: if the event ring is 
full then all async kernel threads are busy.

but note that there's a deeper reason as well for not wanting 
over-queueing: the main cost of a 'pending request' is the kernel stack 
of the blocked thread itself! So do we want to allow 'requests' to stay 
'pending' even if there are "no more threads available"? Nope: because 
letting them 'pend' would essentially (and implicitly) mean an increase 
of the thread pool.

In other words: with the syslet subsystem, a kernel thread /is/ the 
asynchronous request itself. So 'have more requests pending' means 'have 
more kernel threads'. And 'no kernel thread available' must thus mean 
'no queueing of this request'.

Thirdly, there is a performance advantage of this queueing property as 
well: by letting a cachemiss thread only do a single syslet all work is 
concentrated back to the 'head' task, and all queueing decisions are 
immediately known by user-space and can be acted upon.

So the work-queueing setup is not symmetric at all, there's a 
fundamental bias and tendency back towards the head task - this helps 
caching too. That's what Tux did too - it always tried to queue back to 
the 'head task' as soon as it could. Spreading out work dynamically and 
transparently is necessary and nice, but it's useless if the system has 
no automatic tendency to move back into single-threaded (fully cached) 
state if the workload becomes less parallel. Without this fundamental 
(and transparent) 'shrink parallelism' property syslets would only 
degrade into yet another threading construct.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 10:30                 ` Arjan van de Ven
@ 2007-02-14 10:41                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-14 10:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Andi Kleen, linux-kernel, Linus Torvalds,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Wed, Feb 14, 2007 at 11:30:55AM +0100, Arjan van de Ven (arjan@infradead.org) wrote:
> > (at least on Debian
> > and Mandrake there is no locked memory limit by default).
> 
> that sounds like 2 very large bugtraq-worthy bugs in these distros.. so
> bad a bug that I almost find it hard to believe...

Well:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
max rt priority                 (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
$ cat /etc/debian_version
4.0

$ ulimit -a
core file size        (blocks, -c) 0
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
max locked memory     (kbytes, -l) unlimited
max memory size       (kbytes, -m) unlimited
open files                    (-n) 1024
pipe size          (512 bytes, -p) 8
stack size            (kbytes, -s) 8192
cpu time             (seconds, -t) unlimited
max user processes            (-u) 7168
virtual memory        (kbytes, -v) unlimited
$ cat /etc/mandrake-release
Mandrake Linux release 10.0 (Community) for i586

Anyway, even if there is a limit like in fc5 - 32kb,
so I doubt any unpriveledged userspace application 
will ever run there.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 10:36   ` Russell King
@ 2007-02-14 10:50     ` Ingo Molnar
  2007-02-14 11:04       ` Russell King
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 10:50 UTC (permalink / raw)
  To: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner
  Cc: Russell King


* Russell King <rmk+lkml@arm.linux.org.uk> wrote:

> On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> > +Arguments to the system call are implemented via pointers to arguments.
> > +This not only increases the flexibility of syslet atoms (multiple syslets
> > +can share the same variable for example), but is also an optimization:
> > +copy_uatom() will only fetch syscall parameters up until the point it
> > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > +parameters (and 90% of all syscalls have 4 or less parameters).
> > +
> > + [ Note: since the argument array is at the end of the atom, and the
> > +   kernel will not touch any argument beyond the final NULL one, atoms
> > +   might be packed more tightly. (the only special case exception to
> > +   this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> > +   jump a full syslet_uatom number of bytes.) ]
> 
> What if you need to increase the number of arguments passed to a 
> system call later?  That would be an API change since the size of 
> syslet_uatom would change?

the syslet_uatom has a constant size right now, and space for a maximum 
of 6 arguments. /If/ the user knows that a specific atom (which for 
example does a sys_close()) takes only 1 argument, it could shrink the 
size of the atom down by 4 arguments.

[ i'd not actually recommend doing this, because it's generally a 
  volatile thing to play such tricks - i guess i shouldnt have written 
  that side-note in the header file :-) ]

there should be no new ABI issues: the existing syscall ABI never 
changes, it's only extended. New syslets can rely on new properties of 
new system calls. This is quite parallel to how glibc handles system 
calls.

> How do you propose syslet users know about these kinds of ABI issues 
> (including the endian-ness of 64-bit arguments) ?

syslet users would preferably be libraries like glibc - not applications 
- i'm not sure the raw syslet interface should be exposed to 
applications. Thus my current thinking is that syslets ought to be 
per-arch structures - no need to pad them out to 64 bits on 32-bit 
architectures - it's per-arch userspace that makes use of them anyway. 
system call encodings are fundamentally per-arch anyway - every arch 
does various fixups and has its own order of system calls.

but ... i'd not be against having a 'generic syscall layer' though, and 
syslets might be a good starting point for that. But that would 
necessiate a per-arch table of translating syscall numbers into this 
'generic' numbering, at minimum - or a separate sys_async_call_table[].

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 10:50     ` Ingo Molnar
@ 2007-02-14 11:04       ` Russell King
  2007-02-14 17:52         ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Russell King @ 2007-02-14 11:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Wed, Feb 14, 2007 at 11:50:39AM +0100, Ingo Molnar wrote:
> * Russell King <rmk+lkml@arm.linux.org.uk> wrote:
> > On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> > > +Arguments to the system call are implemented via pointers to arguments.
> > > +This not only increases the flexibility of syslet atoms (multiple syslets
> > > +can share the same variable for example), but is also an optimization:
> > > +copy_uatom() will only fetch syscall parameters up until the point it
> > > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > > +parameters (and 90% of all syscalls have 4 or less parameters).
> > > +
> > > + [ Note: since the argument array is at the end of the atom, and the
> > > +   kernel will not touch any argument beyond the final NULL one, atoms
> > > +   might be packed more tightly. (the only special case exception to
> > > +   this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> > > +   jump a full syslet_uatom number of bytes.) ]
> > 
> > What if you need to increase the number of arguments passed to a 
> > system call later?  That would be an API change since the size of 
> > syslet_uatom would change?
> 
> the syslet_uatom has a constant size right now, and space for a maximum 
> of 6 arguments. /If/ the user knows that a specific atom (which for 
> example does a sys_close()) takes only 1 argument, it could shrink the 
> size of the atom down by 4 arguments.
> 
> [ i'd not actually recommend doing this, because it's generally a 
>   volatile thing to play such tricks - i guess i shouldnt have written 
>   that side-note in the header file :-) ]
> 
> there should be no new ABI issues: the existing syscall ABI never 
> changes, it's only extended. New syslets can rely on new properties of 
> new system calls. This is quite parallel to how glibc handles system 
> calls.

Let me spell it out, since you appear to have completely missed my point.

At the moment, SKIP_TO_NEXT_ON_STOP is specified to jump a "jump a full
syslet_uatom number of bytes".

If we end up with a system call being added which requires more than
the currently allowed number of arguments (and it _has_ happened before)
then either those syscalls are not accessible to syslets, or you need
to increase the arg_ptr array.

That makes syslet_uatom larger.

If syslet_uatom is larger, SKIP_TO_NEXT_ON_STOP increments the syslet_uatom
pointer by a greater number of bytes.

If we're running a set of userspace syslets built for an older kernel on
such a newer kernel, that is an incompatible change which will break.

> > How do you propose syslet users know about these kinds of ABI issues 
> > (including the endian-ness of 64-bit arguments) ?
> 
> syslet users would preferably be libraries like glibc - not applications 
> - i'm not sure the raw syslet interface should be exposed to 
> applications. Thus my current thinking is that syslets ought to be 
> per-arch structures - no need to pad them out to 64 bits on 32-bit 
> architectures - it's per-arch userspace that makes use of them anyway. 
> system call encodings are fundamentally per-arch anyway - every arch 
> does various fixups and has its own order of system calls.
> 
> but ... i'd not be against having a 'generic syscall layer' though, and 
> syslets might be a good starting point for that. But that would 
> necessiate a per-arch table of translating syscall numbers into this 
> 'generic' numbering, at minimum - or a separate sys_async_call_table[].

Okay - I guess the userspace library approach is fine, but it needs
to be documented that applications which build syslets directly are
going to be non-portable.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 20:39       ` Ingo Molnar
  2007-02-13 22:36         ` Dmitry Torokhov
@ 2007-02-14 11:07         ` Alan
  1 sibling, 0 replies; 320+ messages in thread
From: Alan @ 2007-02-14 11:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dmitry Torokhov, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

> > Ooooohh. OpenVMS lives forever ;) Me likeee ;)
> 
> hm, i dont know OpenVMS - but googled around a bit for 'VMS 
> asynchronous' and it gave me this:

VMS had SYS$QIO which is asynchronous I/O queueing with completions of
sorts. You had to specifically remember if you wanted to a synchronous
I/O.

Nothing afaik quite like series of commands batched async, although VMS
has a call for everything else so its possible there is one buried in the
back of volume 347 of the grey wall ;)

Looking at the completion side I'm not 100% sure we need async_wait given
the async batches can include futex operations...

Alan

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-14 10:37               ` Ingo Molnar
@ 2007-02-14 11:10                 ` Evgeniy Polyakov
  2007-02-14 17:17                 ` Davide Libenzi
  1 sibling, 0 replies; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-14 11:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Benjamin LaHaise, Alan, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Wed, Feb 14, 2007 at 11:37:31AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > Let me clarify what I meant. There is only limited number of threads, 
> > which are supposed to execute blocking context, so when all they are 
> > used, main one will block too - I asked about possibility to reuse the 
> > same thread to execute queue of requests attached to it, each request 
> > can block, but if blocking issue is removed, it would be possible to 
> > return.
> 
> ah, ok, i understand your point. This is not quite possible: the 
> cachemisses are driven from schedule(), which can be arbitraily deep 
> inside arbitrary system calls. It can be in a mutex_lock() deep inside a 
> driver. It can be due to a alloc_pages() call done by a kmalloc() call 
> done from within ext3, which was called from the loopback block driver, 
> which was called from XFS, which was called from a VFS syscall.

That's only because of schedule() is a main point where
'rescheduling'/requeuing (task switch in other words) happens - but if
it will be possible to bypass schedule()'s decision and not reschedule
there, but 'on demand', will it be possible to reuse the same syslet?

Let me show an example:
consider aio_sendfile() on a big file, so it is not possible to fully
get it into VFS, but having spinning on per-page basis (like right now)
is no optial solution too. For kevent AIO I created new address space
operation aio_getpages() which is essentially mpage_readpages() - it
populates several pages into VFS in one BIO (if possible, otherwise in
the smallest possible number of chunks) and then in bio destruction
callback (actually in bio_endio callback, but for that case it can be
considered as the same) I reschedule the same request to some other (not
exactly the same as started) thread. When processed data is being sent
and next chunk of the file is populated to the VFS using aio_getpages(),
which in BIO callback will reschedule the same request again.

So it is possible with essentially one thread (or limited number of
them) to fill the whole IO pipe.

With syslet approach it seems to be impossible due to the fact, that
request is a whole sendfile. Even if one uses proper readahed (fadvise)
advise, there is no possibility to split sendfile and form it as a set
of essentially the same requests with different start/offset/whatever
parameters (well, exactly for senfile() it is possible - just setup
several calls in one syslet from different offsets and with different
lengths and form a proper state machine of them, but for example TCP 
recv() will not match that scenario).

So my main question was about possibility to reuse syslet state machine
in kevent AIO instead of own (althtough own one lacks only one good 
feature of syslets threads currently - its set of threads is global, 
but not per-task, which does not allow to scale good with number of 
different processes doing IO) so to not duplicate the code if kevent is
ever be possible to get into.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
                     ` (3 preceding siblings ...)
  2007-02-14  4:42   ` Willy Tarreau
@ 2007-02-14 12:37   ` Pavel Machek
  2007-02-14 17:14     ` Linus Torvalds
  2007-02-14 20:52   ` Jeremy Fitzhardinge
  2007-02-15  2:44   ` Zach Brown
  6 siblings, 1 reply; 320+ messages in thread
From: Pavel Machek @ 2007-02-14 12:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

Hi!
  
> The boring details:
> 
> Syslets consist of 'syslet atoms', where each atom represents a single 
> system-call. These atoms can be chained to each other: serially, in 
> branches or in loops. The return value of an executed atom is checked 
> against the condition flags. So an atom can specify 'exit on nonzero' or 
> 'loop until non-negative' kind of constructs.

Ouch, yet another interpretter in kernel :-(. Can we reuse acpi or
something?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 14:20 ` [patch 05/11] syslets: core code Ingo Molnar
  2007-02-13 23:15   ` Andi Kleen
@ 2007-02-14 12:43   ` Guillaume Chazarain
  2007-02-14 13:17   ` Stephen Rothwell
  2007-02-14 20:38   ` Linus Torvalds
  3 siblings, 0 replies; 320+ messages in thread
From: Guillaume Chazarain @ 2007-02-14 12:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 705 bytes --]

Ingo Molnar a écrit :
> +	if (unlikely(signal_pending(t) || need_resched()))
> +		goto stop;
>   

So, this is how you'll prevent me from running an infinite loop ;-)
The attached patch adds a cond_resched() instead, to allow infinite
loops without DoS. I dropped the unlikely() as it's already in the
definition of signal_pending().

> +asmlinkage long sys_async_wait(unsigned long min_wait_events)
>   

Here I would expect:

    sys_async_wait_for_all(struct syslet_atom *atoms, long nr_atoms)

and

    sys_async_wait_for_any(struct syslet_atom *atoms, long nr_atoms).

This way syslets can be used by different parts of a program without
having them waiting for each other.

Thanks.

-- 
Guillaume


[-- Attachment #2: cond_resched.diff --]
[-- Type: text/x-patch, Size: 321 bytes --]

--- linux-2.6/kernel/async.c
+++ linux-2.6/kernel/async.c
@@ -433,9 +433,10 @@
 
 	last_uatom = uatom;
 	ret = __exec_atom(t, &atom);
-	if (unlikely(signal_pending(t) || need_resched()))
+	if (signal_pending(t))
 		goto stop;
 
+	cond_resched();
 	uatom = next_uatom(&atom, uatom, ret);
 	if (uatom)
 		goto run_next;




^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 14:20 ` [patch 05/11] syslets: core code Ingo Molnar
  2007-02-13 23:15   ` Andi Kleen
  2007-02-14 12:43   ` Guillaume Chazarain
@ 2007-02-14 13:17   ` Stephen Rothwell
  2007-02-14 20:38   ` Linus Torvalds
  3 siblings, 0 replies; 320+ messages in thread
From: Stephen Rothwell @ 2007-02-14 13:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 389 bytes --]

Hi Ingo,

On Tue, 13 Feb 2007 15:20:35 +0100 Ingo Molnar <mingo@elte.hu> wrote:
>
> From: Ingo Molnar <mingo@elte.hu>
>
> the core syslet / async system calls infrastructure code.

It occurred to me that the 32 compat code for 64 bit architectures for
all this could be very hairy  ...

--
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-14 12:37   ` Pavel Machek
@ 2007-02-14 17:14     ` Linus Torvalds
  0 siblings, 0 replies; 320+ messages in thread
From: Linus Torvalds @ 2007-02-14 17:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, linux-kernel, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Wed, 14 Feb 2007, Pavel Machek wrote:
> 
> Ouch, yet another interpretter in kernel :-(. Can we reuse acpi or
> something?

Hah. You make the joke! I get it!

Mwahahahaa! 

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-14 10:37               ` Ingo Molnar
  2007-02-14 11:10                 ` Evgeniy Polyakov
@ 2007-02-14 17:17                 ` Davide Libenzi
  1 sibling, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14 17:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Benjamin LaHaise, Alan,
	Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Thomas Gleixner

On Wed, 14 Feb 2007, Ingo Molnar wrote:

> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > Let me clarify what I meant. There is only limited number of threads, 
> > which are supposed to execute blocking context, so when all they are 
> > used, main one will block too - I asked about possibility to reuse the 
> > same thread to execute queue of requests attached to it, each request 
> > can block, but if blocking issue is removed, it would be possible to 
> > return.
> 
> ah, ok, i understand your point. This is not quite possible: the 
> cachemisses are driven from schedule(), which can be arbitraily deep 
> inside arbitrary system calls. It can be in a mutex_lock() deep inside a 
> driver. It can be due to a alloc_pages() call done by a kmalloc() call 
> done from within ext3, which was called from the loopback block driver, 
> which was called from XFS, which was called from a VFS syscall.
> 
> Even if it were possible to backtrack i'm quite sure we dont want to do 
> this, for three main reasons:

IMO it'd be quite simple. We detect the service-thread full condition, 
*before* entering exec_atom and we queue the atom in an async_head request 
list. Yes, there is the chance that from the test time in sys_async_exec, 
to the time we'll end up entering exec_atom and down to schedule, one 
of the threads would become free, but IMO better that blocking 
sys_async_exec.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 11:04       ` Russell King
@ 2007-02-14 17:52         ` Davide Libenzi
  2007-02-14 18:03           ` Benjamin LaHaise
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14 17:52 UTC (permalink / raw)
  To: Russell King
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

On Wed, 14 Feb 2007, Russell King wrote:

> Let me spell it out, since you appear to have completely missed my point.
> 
> At the moment, SKIP_TO_NEXT_ON_STOP is specified to jump a "jump a full
> syslet_uatom number of bytes".
> 
> If we end up with a system call being added which requires more than
> the currently allowed number of arguments (and it _has_ happened before)
> then either those syscalls are not accessible to syslets, or you need
> to increase the arg_ptr array.

I was thinking about this yesterday, since I honestly thought that this 
whole chaining, and conditions, and parameter lists, and argoument passed 
by pointers, etc... was at the end a little clumsy IMO.
Wouldn't a syslet look better like:

long syslet(void *ctx) {
	struct sctx *c = ctx;

	if (open(c->file, ...) == -1)
		return -1;
	read();
	send();
	blah();
	...
	return 0;
}

That'd be, instead of passing a chain of atoms, with the kernel 
interpreting conditions, and parameter lists, etc..., we let gcc 
do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec, 
that exec the above under the same schedule-trapped environment, but in 
userspace. We setup a special userspace ad-hoc frame (ala signal), and we 
trap underneath task schedule attempt in the same way we do now.
We setup the frame and when we return from sys_async_exec, we basically 
enter the "clet", that will return to a ret_from_async, that will return 
to userspace. Or, maybe we can support both. A simple single-syscall exec 
in the way we do now, and a clet way for the ones that requires chains and 
conditions. Hmmm?



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 17:52         ` Davide Libenzi
@ 2007-02-14 18:03           ` Benjamin LaHaise
  2007-02-14 19:45             ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Benjamin LaHaise @ 2007-02-14 18:03 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, Feb 14, 2007 at 09:52:20AM -0800, Davide Libenzi wrote:
> That'd be, instead of passing a chain of atoms, with the kernel 
> interpreting conditions, and parameter lists, etc..., we let gcc 
> do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec, 
> that exec the above under the same schedule-trapped environment, but in 
> userspace. We setup a special userspace ad-hoc frame (ala signal), and we 
> trap underneath task schedule attempt in the same way we do now.
> We setup the frame and when we return from sys_async_exec, we basically 
> enter the "clet", that will return to a ret_from_async, that will return 
> to userspace. Or, maybe we can support both. A simple single-syscall exec 
> in the way we do now, and a clet way for the ones that requires chains and 
> conditions. Hmmm?

Which is just the same as using threads.  My argument is that once you 
look at all the details involved, what you end up arriving at is the 
creation of threads.  Threads are relatively cheap, it's just that the 
hardware currently has several performance bugs with them on x86 (and more 
on x86-64 with the MSR fiddling that hits the hot path).  Architectures 
like powerpc are not going to benefit anywhere near as much from this 
exercise, as the state involved is processed much more sanely.  IA64 as 
usual is simply doomed by way of having too many registers to switch.

If people really want to go down this path, please make an effort to compare 
threads on a properly tuned platform.  This means that things like the kernel 
and userland stacks must take into account the cache alignment (we do some 
of this already, but there are some very definate L1 cache colour collisions 
between commonly hit data structures amongst threads).  The existing AIO 
ringbuffer suffers from this, as important data is always on the beginning 
of the first page.  Yes, these might be microoptimizations, but accumulated 
changes of this nature have been known to buy 100%+ improvements in 
performance.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 18:03           ` Benjamin LaHaise
@ 2007-02-14 19:45             ` Davide Libenzi
  2007-02-14 20:03               ` Benjamin LaHaise
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14 19:45 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 09:52:20AM -0800, Davide Libenzi wrote:
> > That'd be, instead of passing a chain of atoms, with the kernel 
> > interpreting conditions, and parameter lists, etc..., we let gcc 
> > do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec, 
> > that exec the above under the same schedule-trapped environment, but in 
> > userspace. We setup a special userspace ad-hoc frame (ala signal), and we 
> > trap underneath task schedule attempt in the same way we do now.
> > We setup the frame and when we return from sys_async_exec, we basically 
> > enter the "clet", that will return to a ret_from_async, that will return 
> > to userspace. Or, maybe we can support both. A simple single-syscall exec 
> > in the way we do now, and a clet way for the ones that requires chains and 
> > conditions. Hmmm?
> 
> Which is just the same as using threads.  My argument is that once you 
> look at all the details involved, what you end up arriving at is the 
> creation of threads.  Threads are relatively cheap, it's just that the 
> hardware currently has several performance bugs with them on x86 (and more 
> on x86-64 with the MSR fiddling that hits the hot path).  Architectures 
> like powerpc are not going to benefit anywhere near as much from this 
> exercise, as the state involved is processed much more sanely.  IA64 as 
> usual is simply doomed by way of having too many registers to switch.

Sort of, except that the whole thing can complete syncronously w/out 
context switches. The real point of the whole fibrils/syslets solution is 
that kind of optimization. The solution is as good as it is now, for 
single syscalls (modulo sys_async_cancel implementation), but for multiple 
chained submission it kinda stinks IMHO. Once you have to build chains, 
and conditions, and new syscalls to implement userspace variable 
increments, and so on..., at that point it's better to have the chain to 
be coded in C ala thread proc. Yes, it requires a frame setup and another 
entry to the kernel, but IMO that will be amortized in the cost of the 
multiple syscalls inside the "clet".



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 19:45             ` Davide Libenzi
@ 2007-02-14 20:03               ` Benjamin LaHaise
  2007-02-14 20:14                 ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Benjamin LaHaise @ 2007-02-14 20:03 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, Feb 14, 2007 at 11:45:23AM -0800, Davide Libenzi wrote:
> Sort of, except that the whole thing can complete syncronously w/out 
> context switches. The real point of the whole fibrils/syslets solution is 
> that kind of optimization. The solution is as good as it is now, for 

Except that You Can't Do That (tm).  Try to predict beforehand if the code 
path being followed will touch the FPU or SSE state, and you can't.  There is 
no way to avoid the context switch overhead, as you have to preserve things 
so that whatever state is being returned to the user is as it was.  Unless 
you plan on resetting the state beforehand, but then you have to call into 
arch specific code that ends up with a comparable overhead to the context 
switch.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 20:03               ` Benjamin LaHaise
@ 2007-02-14 20:14                 ` Davide Libenzi
  2007-02-14 20:34                   ` Benjamin LaHaise
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14 20:14 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 11:45:23AM -0800, Davide Libenzi wrote:
> > Sort of, except that the whole thing can complete syncronously w/out 
> > context switches. The real point of the whole fibrils/syslets solution is 
> > that kind of optimization. The solution is as good as it is now, for 
> 
> Except that You Can't Do That (tm).  Try to predict beforehand if the code 
> path being followed will touch the FPU or SSE state, and you can't.  There is 
> no way to avoid the context switch overhead, as you have to preserve things 
> so that whatever state is being returned to the user is as it was.  Unless 
> you plan on resetting the state beforehand, but then you have to call into 
> arch specific code that ends up with a comparable overhead to the context 
> switch.

I think you may have mis-interpreted my words. *When* a schedule would 
block a synco execution try, then you do have a context switch. Noone 
argue that, and the code is clear. The sys_async_exec thread will block, 
and a newly woke up thread will re-emerge to sys_async_exec with a NULL 
returned to userspace. But in a "cachehit" case (no schedule happens 
during the syscall/*let execution), there is no context switch at all. 
That is the whole point of the optimization.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 20:14                 ` Davide Libenzi
@ 2007-02-14 20:34                   ` Benjamin LaHaise
  2007-02-14 21:06                     ` Davide Libenzi
  2007-02-14 21:49                     ` [patch] x86: split FPU state from task state Ingo Molnar
  0 siblings, 2 replies; 320+ messages in thread
From: Benjamin LaHaise @ 2007-02-14 20:34 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, Feb 14, 2007 at 12:14:29PM -0800, Davide Libenzi wrote:
> I think you may have mis-interpreted my words. *When* a schedule would 
> block a synco execution try, then you do have a context switch. Noone 
> argue that, and the code is clear. The sys_async_exec thread will block, 
> and a newly woke up thread will re-emerge to sys_async_exec with a NULL 
> returned to userspace. But in a "cachehit" case (no schedule happens 
> during the syscall/*let execution), there is no context switch at all. 
> That is the whole point of the optimization.

And I will repeat myself: that cannot be done.  Tell me how the following 
what if scenario works: you're in an MMX optimized memory copy and you take 
a page fault.  How does returning to the submittor of the async operation 
get the correct MMX state restored?  It doesn't.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-13 14:20 ` [patch 05/11] syslets: core code Ingo Molnar
                     ` (2 preceding siblings ...)
  2007-02-14 13:17   ` Stephen Rothwell
@ 2007-02-14 20:38   ` Linus Torvalds
  2007-02-14 21:02     ` Ingo Molnar
                       ` (3 more replies)
  3 siblings, 4 replies; 320+ messages in thread
From: Linus Torvalds @ 2007-02-14 20:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Tue, 13 Feb 2007, Ingo Molnar wrote:
> 
> the core syslet / async system calls infrastructure code.

Ok, having now looked at it more, I can say:

 - I hate it.

I dislike it intensely, because it's so _close_ to being usable. But the 
programming interface looks absolutely horrid for any "casual" use, and 
while the loops etc look like fun, I think they are likely to be less than 
useful in practice. Yeah, you can do the "setup and teardown" just once, 
but it ends up being "once per user", and it ends up being a lot of stuff 
to do for somebody who wants to just do some simple async stuff.

And the whole "lock things down in memory" approach is bad. It's doing 
expensive things like mlock(), making the overhead for _single_ system 
calls much more expensive. Since I don't actually believe that the 
non-single case is even all that interesting, I really don't like it.

I think it's clever and potentially useful to allow user mode to see the 
data structures (and even allow user mode to *modify* them) while the 
async thing is running, but it really seems to be a case of excessive 
cleverness.

For example, how would you use this to emulate the *current* aio_read() 
etc interfaces that don't have any user-level component except for the 
actual call? And if you can't do that, the whole exercise is pointless.

Or how would you do the trivial example loop that I explained was a good 
idea:

        struct one_entry *prev = NULL;
        struct dirent *de;

        while ((de = readdir(dir)) != NULL) {
                struct one_entry *entry = malloc(..);

                /* Add it to the list, fill in the name */
                entry->next = prev;
                prev = entry;
                strcpy(entry->name, de->d_name);

                /* Do the stat lookup async */
                async_stat(de->d_name, &entry->stat_buf);
        }
        wait_for_async();
        .. Ta-daa! All done ..


Notice? This also "chains system calls together", but it does it using a 
*much* more powerful entity called "user space". That's what user space 
is. And yeah, it's a pretty complex sequencer, but happily we have 
hardware support for accelerating it to the point that the kernel never 
even needs to care.

The above is a *realistic* schenario, where you actually have things like 
memory allocation etc going on. In contrast, just chaining system calls 
together isn't a realistic schenario at all.

So I think we have one _known_ usage schenario:

 - replacing the _existing_ aio_read() etc system calls (with not just 
   existing semantics, but actually binary-compatible)

 - simple code use where people are willing to perhaps do something 
   Linux-specific, but because it's so _simple_, they'll do it.

In neither case does the "chaining atoms together" seem to really solve 
the problem. It's clever, but it's not what people would actually do.

And yes, you can hide things like that behind an abstraction library, but 
once you start doing that, I've got three questions for you:

 - what's the point?
 - we're adding overhead, so how are we getting it back
 - how do we handle independent libraries each doing their own thing and 
   version skew between them?

In other words, the "let user space sort out the complexity" is not a good 
answer. It just means that the interface is badly designed.

			Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
                     ` (4 preceding siblings ...)
  2007-02-14 12:37   ` Pavel Machek
@ 2007-02-14 20:52   ` Jeremy Fitzhardinge
  2007-02-14 21:36     ` Davide Libenzi
  2007-02-15  2:44   ` Zach Brown
  6 siblings, 1 reply; 320+ messages in thread
From: Jeremy Fitzhardinge @ 2007-02-14 20:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

Ingo Molnar wrote:
> Syslets consist of 'syslet atoms', where each atom represents a single 
> system-call. These atoms can be chained to each other: serially, in 
> branches or in loops. The return value of an executed atom is checked 
> against the condition flags. So an atom can specify 'exit on nonzero' or 
> 'loop until non-negative' kind of constructs.
>
> Syslet atoms fundamentally execute only system calls, thus to be able to 
> manipulate user-space variables from syslets i've added a simple special 
> system call: sys_umem_add(ptr, val). This can be used to increase or 
> decrease the user-space variable (and to get the result), or to simply 
> read out the variable (if 'val' is 0).
>   

This looks very interesting.  A couple of questions:

Are there any special semantics that result from running the syslet
atoms in kernel mode?  If I wanted to, could I write a syslet emulation
in userspace that's functionally identical to a kernel-based
implementation?  (Obviously the performance characteristics will be
different.)

I'm asking from the perspective of trying to work out the Valgrind
binding for this if it goes into the kernel.  Valgrind needs to see all
the input and output values of each system call the client makes,
including those done within the syslet mechanism.  It seems to me that
the easiest way to do this would be to intercept the syslet system
calls, and just implement them in usermode, performing the same series
of syscalls directly, and applying the Valgrind machinery to each one in
turn.

Would this work?

Also, an unrelated question: is there enough control structure in place
to allow multiple syslet streams to synchronize with each other with
futexes?

Thanks,
    J

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 20:38   ` Linus Torvalds
@ 2007-02-14 21:02     ` Ingo Molnar
  2007-02-14 21:12       ` Ingo Molnar
  2007-02-14 21:26       ` Linus Torvalds
  2007-02-14 21:09     ` Davide Libenzi
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 21:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And the whole "lock things down in memory" approach is bad. It's doing 
> expensive things like mlock(), making the overhead for _single_ system 
> calls much more expensive. [...]

hm, there must be some misunderstanding here. That mlock is /only/ once 
per the lifetime of the whole 'head' - i.e. per sys_async_register(). 
(And you can even forget i ever did it - it's 5 lines of code to turn 
the completion ring into a swappable entity.)

never does any MMU trick ever enter the picture during the whole 
operation of this thing, and that's very much intentional.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 20:34                   ` Benjamin LaHaise
@ 2007-02-14 21:06                     ` Davide Libenzi
  2007-02-14 21:44                       ` Benjamin LaHaise
  2007-02-14 21:49                     ` [patch] x86: split FPU state from task state Ingo Molnar
  1 sibling, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14 21:06 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 12:14:29PM -0800, Davide Libenzi wrote:
> > I think you may have mis-interpreted my words. *When* a schedule would 
> > block a synco execution try, then you do have a context switch. Noone 
> > argue that, and the code is clear. The sys_async_exec thread will block, 
> > and a newly woke up thread will re-emerge to sys_async_exec with a NULL 
> > returned to userspace. But in a "cachehit" case (no schedule happens 
> > during the syscall/*let execution), there is no context switch at all. 
> > That is the whole point of the optimization.
> 
> And I will repeat myself: that cannot be done.  Tell me how the following 
> what if scenario works: you're in an MMX optimized memory copy and you take 
> a page fault.  How does returning to the submittor of the async operation 
> get the correct MMX state restored?  It doesn't.

Bear with me Ben, and let's follow this up :) If you are in the middle of 
an MMX copy operation, inside the syscall, you are:

- Userspace, on task A, calls sys_async_exec

- Userspace in _not_ doing any MMX stuff before the call

- We execute the syscall

- Task A, executing the syscall and inside an MMX copy operation, gets a 
  page fault

- We get a schedule

- Task A MMX state will *follow* task A, that will be put to sleep

- We wake task B that will return to userspace

So if the MMX work happens inside the syscall execution, we're fine 
because its context will follow the same task being put into sleep.
Problem would be to preserve the *caller* (userspace) context. But than 
can be done in a lazy way (detecting if task A user the FPU) like we're 
currently doing it, once we detect a schedule-out condition. That wouldn't 
be the most common case for many userspace programs in any case.




- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 20:38   ` Linus Torvalds
  2007-02-14 21:02     ` Ingo Molnar
@ 2007-02-14 21:09     ` Davide Libenzi
  2007-02-14 22:09     ` Ingo Molnar
  2007-02-15 13:35     ` Evgeniy Polyakov
  3 siblings, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Wed, 14 Feb 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 13 Feb 2007, Ingo Molnar wrote:
> > 
> > the core syslet / async system calls infrastructure code.
> 
> Ok, having now looked at it more, I can say:
> 
>  - I hate it.
> 
> I dislike it intensely, because it's so _close_ to being usable. But the 
> programming interface looks absolutely horrid for any "casual" use, and 
> while the loops etc look like fun, I think they are likely to be less than 
> useful in practice. Yeah, you can do the "setup and teardown" just once, 
> but it ends up being "once per user", and it ends up being a lot of stuff 
> to do for somebody who wants to just do some simple async stuff.
> 
> And the whole "lock things down in memory" approach is bad. It's doing 
> expensive things like mlock(), making the overhead for _single_ system 
> calls much more expensive. Since I don't actually believe that the 
> non-single case is even all that interesting, I really don't like it.
> 
> I think it's clever and potentially useful to allow user mode to see the 
> data structures (and even allow user mode to *modify* them) while the 
> async thing is running, but it really seems to be a case of excessive 
> cleverness.

Ok, that makes the wierdo-count up to two :) I agree with you that the 
chained API can be improved at least.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 21:02     ` Ingo Molnar
@ 2007-02-14 21:12       ` Ingo Molnar
  2007-02-14 21:26       ` Linus Torvalds
  1 sibling, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 21:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > And the whole "lock things down in memory" approach is bad. It's 
> > doing expensive things like mlock(), making the overhead for 
> > _single_ system calls much more expensive. [...]
> 
> hm, there must be some misunderstanding here. That mlock is /only/ 
> once per the lifetime of the whole 'head' - i.e. per 
> sys_async_register(). (And you can even forget i ever did it - it's 5 
> lines of code to turn the completion ring into a swappable entity.)
> 
> never does any MMU trick ever enter the picture during the whole 
> operation of this thing, and that's very much intentional.

to stress it: never does any mlocking or other lockdown happen of any 
syslet atom - it is /only/ the completion ring of syslet pointers that i 
made mlocked - but even that can be made generic memory no problem.

It's all about asynchronous system calls, and if you want you can have a 
terabyte of syslets in user memory, half of it swapped out. They have 
absolutely zero kernel context attached to them in the 'cached case' (be 
that locked memory or some other kernel resource).

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 21:02     ` Ingo Molnar
  2007-02-14 21:12       ` Ingo Molnar
@ 2007-02-14 21:26       ` Linus Torvalds
  2007-02-14 21:35         ` Ingo Molnar
                           ` (2 more replies)
  1 sibling, 3 replies; 320+ messages in thread
From: Linus Torvalds @ 2007-02-14 21:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Wed, 14 Feb 2007, Ingo Molnar wrote:
> 
> hm, there must be some misunderstanding here. That mlock is /only/ once 
> per the lifetime of the whole 'head' - i.e. per sys_async_register(). 
> (And you can even forget i ever did it - it's 5 lines of code to turn 
> the completion ring into a swappable entity.)

But the whole point is that the notion of a "register" is wrong in the 
first place. It's wrong because:

 - it assumes we are going to make these complex state machines (which I 
   don't believe for a second that a real program will do)

 - it assumes that we're going to make many async system calls that go 
   together (which breaks the whole notion of having different libraries 
   using this for their own internal reasons - they may not even *know* 
   about other libraries that _also_ do async IO for *their* reasons)

 - it fundamentally is based on a broken notion that everything would use 
   this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT, since 
   current users use "aio_read()" that simply doesn't have that and 
   doesn't build up any such data structures.

So please answer my questions. The problem wasn't the mlock(), even though 
that was just STUPID. The problem was much deeper. This is not a "prepare 
to do a lot of very boutique linked list operations" problem. This is a 
"people already use 'aio_read()' and want to extend on it" problem.

You didn't at all react to that fundamental issue: you have an overly 
complex and clever thing that doesn't actually *match* what people do.

			Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 21:26       ` Linus Torvalds
@ 2007-02-14 21:35         ` Ingo Molnar
  2007-02-15  2:52           ` Zach Brown
  2007-02-14 21:44         ` Ingo Molnar
  2007-02-14 21:56         ` Alan
  2 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 21:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But the whole point is that the notion of a "register" is wrong in the 
> first place. [...]

forget about it then. The thing we "register" is dead-simple:

 struct async_head_user {
         struct syslet_uatom __user              **completion_ring;
         unsigned long                           ring_size_bytes;
         unsigned long                           max_nr_threads;
 };

this can be passed in to sys_async_exec() as a second pointer, and the 
kernel can put the expected-completion pointer (and the user ring idx 
pointer) into its struct atom. It's just a few instructions, and only in 
the cachemiss case.

that would make completions arbitrarily split-up-able. No registration 
whatsoever. A waiter could specify which ring's events it is interested 
in. A 'ring' could be a single-entry thing as well, for a single 
instance of pending IO.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-14 20:52   ` Jeremy Fitzhardinge
@ 2007-02-14 21:36     ` Davide Libenzi
  2007-02-15  0:08       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14 21:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

On Wed, 14 Feb 2007, Jeremy Fitzhardinge wrote:

> Are there any special semantics that result from running the syslet
> atoms in kernel mode?  If I wanted to, could I write a syslet emulation
> in userspace that's functionally identical to a kernel-based
> implementation?  (Obviously the performance characteristics will be
> different.)
> 
> I'm asking from the perspective of trying to work out the Valgrind
> binding for this if it goes into the kernel.  Valgrind needs to see all
> the input and output values of each system call the client makes,
> including those done within the syslet mechanism.  It seems to me that
> the easiest way to do this would be to intercept the syslet system
> calls, and just implement them in usermode, performing the same series
> of syscalls directly, and applying the Valgrind machinery to each one in
> turn.
> 
> Would this work?

Hopefully the API will simplify enough so that emulation will becomes 
easier.



> Also, an unrelated question: is there enough control structure in place
> to allow multiple syslet streams to synchronize with each other with
> futexes?

I think the whole point of an async execution of a syscall or a syslet, is 
that the syscall/syslet itself includes a non interlocked operations with 
other syscalls/syslets. So that the main scheduler thread can run in a 
lockless/singletask fashion. There are no technical obstacles that 
prevents you to do it, bu if you start adding locks (and hence having 
long-living syslet-threads) at that point you'll end up with a fully 
multithreaded solution.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 21:06                     ` Davide Libenzi
@ 2007-02-14 21:44                       ` Benjamin LaHaise
  2007-02-14 23:17                         ` Davide Libenzi
  2007-02-15  1:32                         ` Michael K. Edwards
  0 siblings, 2 replies; 320+ messages in thread
From: Benjamin LaHaise @ 2007-02-14 21:44 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, Feb 14, 2007 at 01:06:59PM -0800, Davide Libenzi wrote:
> Bear with me Ben, and let's follow this up :) If you are in the middle of 
> an MMX copy operation, inside the syscall, you are:
> 
> - Userspace, on task A, calls sys_async_exec
> 
> - Userspace in _not_ doing any MMX stuff before the call

That's an incorrect assumption.  Every task/thread in the system has FPU 
state associated with it, in part due to the fact that glibc has to change 
some of the rounding mode bits, making them different than the default from 
a freshly initialized state.

> - We wake task B that will return to userspace

At which point task B has to touch the FPU in userspace as part of the 
cleanup, which adds back in an expensive operation to the whole process.

The whole context switch mechanism is a zero sum game -- everything that 
occurs does so because it *must* be done.  If you remove something at one 
point, then it has to occur somewhere else.

My opinion of this whole thread is that it implies that our thread creation 
and/or context switch is too slow.  If that is the case, improve those 
elements first.  At least some of those optimizations have to be done in 
hardware on x86, while on other platforms are probably unnecessary.

Fwiw, there are patches floating around that did AIO via kernel threads 
for file descriptors that didn't implement AIO (and remember: kernel thread 
context switches are cheaper than userland thread context switches).  At 
least take a stab at measuring what the performance differences are and 
what optimizations are possible before prematurely introducing a new "fast" 
way of doing things that adds a bunch more to maintain.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 21:26       ` Linus Torvalds
  2007-02-14 21:35         ` Ingo Molnar
@ 2007-02-14 21:44         ` Ingo Molnar
  2007-02-14 21:56         ` Alan
  2 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 21:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

>  - it fundamentally is based on a broken notion that everything would
>    use this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT, 
>    since current users use "aio_read()" that simply doesn't have that 
>    and doesn't build up any such data structures.

i'm not sure what you mean here either - aio_read()/write()/etc. could 
very much be implemented using syslets - and in fact one goal of syslets 
is to enable such use. struct aiocb is mostly shaped by glibc internals, 
and it currently has 32 bytes of free space. Enough to put a single atom 
there. (or a pointer to an atom)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* [patch] x86: split FPU state from task state
  2007-02-14 20:34                   ` Benjamin LaHaise
  2007-02-14 21:06                     ` Davide Libenzi
@ 2007-02-14 21:49                     ` Ingo Molnar
  2007-02-14 22:04                       ` Benjamin LaHaise
  1 sibling, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 21:49 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Davide Libenzi, Russell King, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner


* Benjamin LaHaise <bcrl@kvack.org> wrote:

> On Wed, Feb 14, 2007 at 12:14:29PM -0800, Davide Libenzi wrote:
> > I think you may have mis-interpreted my words. *When* a schedule 
> > would block a synco execution try, then you do have a context 
> > switch. Noone argue that, and the code is clear. The sys_async_exec 
> > thread will block, and a newly woke up thread will re-emerge to 
> > sys_async_exec with a NULL returned to userspace. But in a 
> > "cachehit" case (no schedule happens during the syscall/*let 
> > execution), there is no context switch at all. That is the whole 
> > point of the optimization.
> 
> And I will repeat myself: that cannot be done.  Tell me how the 
> following what if scenario works: you're in an MMX optimized memory 
> copy and you take a page fault.  How does returning to the submittor 
> of the async operation get the correct MMX state restored?  It 
> doesn't.

this can very much be done, with a straightforward extension of how we 
handle FPU state. That makes sense independently of syslets/async as 
well, so find below the standalone patch from Arjan. It's in my current 
syslet queue and works great.

	Ingo

------------------------>
Subject: [patch] x86: split FPU state from task state
From: Arjan van de Ven <arjan@linux.intel.com>

Split the FPU save area from the task struct. This allows easy migration 
of FPU context, and it's generally cleaner. It also allows the following 
two (future) optimizations:

1) allocate the right size for the actual cpu rather than 512 bytes always
2) only allocate when the application actually uses FPU, so in the first 
   lazy FPU trap. This could save memory for non-fpu using apps.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/kernel/i387.c        |   96 ++++++++++++++++++++---------------------
 arch/i386/kernel/process.c     |   56 +++++++++++++++++++++++
 arch/i386/kernel/traps.c       |   10 ----
 include/asm-i386/i387.h        |    6 +-
 include/asm-i386/processor.h   |    6 ++
 include/asm-i386/thread_info.h |    6 ++
 kernel/fork.c                  |    7 ++
 7 files changed, 123 insertions(+), 64 deletions(-)

Index: linux/arch/i386/kernel/i387.c
===================================================================
--- linux.orig/arch/i386/kernel/i387.c
+++ linux/arch/i386/kernel/i387.c
@@ -31,9 +31,9 @@ void mxcsr_feature_mask_init(void)
 	unsigned long mask = 0;
 	clts();
 	if (cpu_has_fxsr) {
-		memset(&current->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
-		asm volatile("fxsave %0" : : "m" (current->thread.i387.fxsave)); 
-		mask = current->thread.i387.fxsave.mxcsr_mask;
+		memset(&current->thread.i387->fxsave, 0, sizeof(struct i387_fxsave_struct));
+		asm volatile("fxsave %0" : : "m" (current->thread.i387->fxsave));
+		mask = current->thread.i387->fxsave.mxcsr_mask;
 		if (mask == 0) mask = 0x0000ffbf;
 	} 
 	mxcsr_feature_mask &= mask;
@@ -49,16 +49,16 @@ void mxcsr_feature_mask_init(void)
 void init_fpu(struct task_struct *tsk)
 {
 	if (cpu_has_fxsr) {
-		memset(&tsk->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
-		tsk->thread.i387.fxsave.cwd = 0x37f;
+		memset(&tsk->thread.i387->fxsave, 0, sizeof(struct i387_fxsave_struct));
+		tsk->thread.i387->fxsave.cwd = 0x37f;
 		if (cpu_has_xmm)
-			tsk->thread.i387.fxsave.mxcsr = 0x1f80;
+			tsk->thread.i387->fxsave.mxcsr = 0x1f80;
 	} else {
-		memset(&tsk->thread.i387.fsave, 0, sizeof(struct i387_fsave_struct));
-		tsk->thread.i387.fsave.cwd = 0xffff037fu;
-		tsk->thread.i387.fsave.swd = 0xffff0000u;
-		tsk->thread.i387.fsave.twd = 0xffffffffu;
-		tsk->thread.i387.fsave.fos = 0xffff0000u;
+		memset(&tsk->thread.i387->fsave, 0, sizeof(struct i387_fsave_struct));
+		tsk->thread.i387->fsave.cwd = 0xffff037fu;
+		tsk->thread.i387->fsave.swd = 0xffff0000u;
+		tsk->thread.i387->fsave.twd = 0xffffffffu;
+		tsk->thread.i387->fsave.fos = 0xffff0000u;
 	}
 	/* only the device not available exception or ptrace can call init_fpu */
 	set_stopped_child_used_math(tsk);
@@ -152,18 +152,18 @@ static inline unsigned long twd_fxsr_to_
 unsigned short get_fpu_cwd( struct task_struct *tsk )
 {
 	if ( cpu_has_fxsr ) {
-		return tsk->thread.i387.fxsave.cwd;
+		return tsk->thread.i387->fxsave.cwd;
 	} else {
-		return (unsigned short)tsk->thread.i387.fsave.cwd;
+		return (unsigned short)tsk->thread.i387->fsave.cwd;
 	}
 }
 
 unsigned short get_fpu_swd( struct task_struct *tsk )
 {
 	if ( cpu_has_fxsr ) {
-		return tsk->thread.i387.fxsave.swd;
+		return tsk->thread.i387->fxsave.swd;
 	} else {
-		return (unsigned short)tsk->thread.i387.fsave.swd;
+		return (unsigned short)tsk->thread.i387->fsave.swd;
 	}
 }
 
@@ -171,9 +171,9 @@ unsigned short get_fpu_swd( struct task_
 unsigned short get_fpu_twd( struct task_struct *tsk )
 {
 	if ( cpu_has_fxsr ) {
-		return tsk->thread.i387.fxsave.twd;
+		return tsk->thread.i387->fxsave.twd;
 	} else {
-		return (unsigned short)tsk->thread.i387.fsave.twd;
+		return (unsigned short)tsk->thread.i387->fsave.twd;
 	}
 }
 #endif  /*  0  */
@@ -181,7 +181,7 @@ unsigned short get_fpu_twd( struct task_
 unsigned short get_fpu_mxcsr( struct task_struct *tsk )
 {
 	if ( cpu_has_xmm ) {
-		return tsk->thread.i387.fxsave.mxcsr;
+		return tsk->thread.i387->fxsave.mxcsr;
 	} else {
 		return 0x1f80;
 	}
@@ -192,27 +192,27 @@ unsigned short get_fpu_mxcsr( struct tas
 void set_fpu_cwd( struct task_struct *tsk, unsigned short cwd )
 {
 	if ( cpu_has_fxsr ) {
-		tsk->thread.i387.fxsave.cwd = cwd;
+		tsk->thread.i387->fxsave.cwd = cwd;
 	} else {
-		tsk->thread.i387.fsave.cwd = ((long)cwd | 0xffff0000u);
+		tsk->thread.i387->fsave.cwd = ((long)cwd | 0xffff0000u);
 	}
 }
 
 void set_fpu_swd( struct task_struct *tsk, unsigned short swd )
 {
 	if ( cpu_has_fxsr ) {
-		tsk->thread.i387.fxsave.swd = swd;
+		tsk->thread.i387->fxsave.swd = swd;
 	} else {
-		tsk->thread.i387.fsave.swd = ((long)swd | 0xffff0000u);
+		tsk->thread.i387->fsave.swd = ((long)swd | 0xffff0000u);
 	}
 }
 
 void set_fpu_twd( struct task_struct *tsk, unsigned short twd )
 {
 	if ( cpu_has_fxsr ) {
-		tsk->thread.i387.fxsave.twd = twd_i387_to_fxsr(twd);
+		tsk->thread.i387->fxsave.twd = twd_i387_to_fxsr(twd);
 	} else {
-		tsk->thread.i387.fsave.twd = ((long)twd | 0xffff0000u);
+		tsk->thread.i387->fsave.twd = ((long)twd | 0xffff0000u);
 	}
 }
 
@@ -298,8 +298,8 @@ static inline int save_i387_fsave( struc
 	struct task_struct *tsk = current;
 
 	unlazy_fpu( tsk );
-	tsk->thread.i387.fsave.status = tsk->thread.i387.fsave.swd;
-	if ( __copy_to_user( buf, &tsk->thread.i387.fsave,
+	tsk->thread.i387->fsave.status = tsk->thread.i387->fsave.swd;
+	if ( __copy_to_user( buf, &tsk->thread.i387->fsave,
 			     sizeof(struct i387_fsave_struct) ) )
 		return -1;
 	return 1;
@@ -312,15 +312,15 @@ static int save_i387_fxsave( struct _fps
 
 	unlazy_fpu( tsk );
 
-	if ( convert_fxsr_to_user( buf, &tsk->thread.i387.fxsave ) )
+	if ( convert_fxsr_to_user( buf, &tsk->thread.i387->fxsave ) )
 		return -1;
 
-	err |= __put_user( tsk->thread.i387.fxsave.swd, &buf->status );
+	err |= __put_user( tsk->thread.i387->fxsave.swd, &buf->status );
 	err |= __put_user( X86_FXSR_MAGIC, &buf->magic );
 	if ( err )
 		return -1;
 
-	if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387.fxsave,
+	if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387->fxsave,
 			     sizeof(struct i387_fxsave_struct) ) )
 		return -1;
 	return 1;
@@ -343,7 +343,7 @@ int save_i387( struct _fpstate __user *b
 			return save_i387_fsave( buf );
 		}
 	} else {
-		return save_i387_soft( &current->thread.i387.soft, buf );
+		return save_i387_soft( &current->thread.i387->soft, buf );
 	}
 }
 
@@ -351,7 +351,7 @@ static inline int restore_i387_fsave( st
 {
 	struct task_struct *tsk = current;
 	clear_fpu( tsk );
-	return __copy_from_user( &tsk->thread.i387.fsave, buf,
+	return __copy_from_user( &tsk->thread.i387->fsave, buf,
 				 sizeof(struct i387_fsave_struct) );
 }
 
@@ -360,11 +360,11 @@ static int restore_i387_fxsave( struct _
 	int err;
 	struct task_struct *tsk = current;
 	clear_fpu( tsk );
-	err = __copy_from_user( &tsk->thread.i387.fxsave, &buf->_fxsr_env[0],
+	err = __copy_from_user( &tsk->thread.i387->fxsave, &buf->_fxsr_env[0],
 				sizeof(struct i387_fxsave_struct) );
 	/* mxcsr reserved bits must be masked to zero for security reasons */
-	tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
-	return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387.fxsave, buf );
+	tsk->thread.i387->fxsave.mxcsr &= mxcsr_feature_mask;
+	return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387->fxsave, buf );
 }
 
 int restore_i387( struct _fpstate __user *buf )
@@ -378,7 +378,7 @@ int restore_i387( struct _fpstate __user
 			err = restore_i387_fsave( buf );
 		}
 	} else {
-		err = restore_i387_soft( &current->thread.i387.soft, buf );
+		err = restore_i387_soft( &current->thread.i387->soft, buf );
 	}
 	set_used_math();
 	return err;
@@ -391,7 +391,7 @@ int restore_i387( struct _fpstate __user
 static inline int get_fpregs_fsave( struct user_i387_struct __user *buf,
 				    struct task_struct *tsk )
 {
-	return __copy_to_user( buf, &tsk->thread.i387.fsave,
+	return __copy_to_user( buf, &tsk->thread.i387->fsave,
 			       sizeof(struct user_i387_struct) );
 }
 
@@ -399,7 +399,7 @@ static inline int get_fpregs_fxsave( str
 				     struct task_struct *tsk )
 {
 	return convert_fxsr_to_user( (struct _fpstate __user *)buf,
-				     &tsk->thread.i387.fxsave );
+				     &tsk->thread.i387->fxsave );
 }
 
 int get_fpregs( struct user_i387_struct __user *buf, struct task_struct *tsk )
@@ -411,7 +411,7 @@ int get_fpregs( struct user_i387_struct 
 			return get_fpregs_fsave( buf, tsk );
 		}
 	} else {
-		return save_i387_soft( &tsk->thread.i387.soft,
+		return save_i387_soft( &tsk->thread.i387->soft,
 				       (struct _fpstate __user *)buf );
 	}
 }
@@ -419,14 +419,14 @@ int get_fpregs( struct user_i387_struct 
 static inline int set_fpregs_fsave( struct task_struct *tsk,
 				    struct user_i387_struct __user *buf )
 {
-	return __copy_from_user( &tsk->thread.i387.fsave, buf,
+	return __copy_from_user( &tsk->thread.i387->fsave, buf,
 				 sizeof(struct user_i387_struct) );
 }
 
 static inline int set_fpregs_fxsave( struct task_struct *tsk,
 				     struct user_i387_struct __user *buf )
 {
-	return convert_fxsr_from_user( &tsk->thread.i387.fxsave,
+	return convert_fxsr_from_user( &tsk->thread.i387->fxsave,
 				       (struct _fpstate __user *)buf );
 }
 
@@ -439,7 +439,7 @@ int set_fpregs( struct task_struct *tsk,
 			return set_fpregs_fsave( tsk, buf );
 		}
 	} else {
-		return restore_i387_soft( &tsk->thread.i387.soft,
+		return restore_i387_soft( &tsk->thread.i387->soft,
 					  (struct _fpstate __user *)buf );
 	}
 }
@@ -447,7 +447,7 @@ int set_fpregs( struct task_struct *tsk,
 int get_fpxregs( struct user_fxsr_struct __user *buf, struct task_struct *tsk )
 {
 	if ( cpu_has_fxsr ) {
-		if (__copy_to_user( buf, &tsk->thread.i387.fxsave,
+		if (__copy_to_user( buf, &tsk->thread.i387->fxsave,
 				    sizeof(struct user_fxsr_struct) ))
 			return -EFAULT;
 		return 0;
@@ -461,11 +461,11 @@ int set_fpxregs( struct task_struct *tsk
 	int ret = 0;
 
 	if ( cpu_has_fxsr ) {
-		if (__copy_from_user( &tsk->thread.i387.fxsave, buf,
+		if (__copy_from_user( &tsk->thread.i387->fxsave, buf,
 				  sizeof(struct user_fxsr_struct) ))
 			ret = -EFAULT;
 		/* mxcsr reserved bits must be masked to zero for security reasons */
-		tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
+		tsk->thread.i387->fxsave.mxcsr &= mxcsr_feature_mask;
 	} else {
 		ret = -EIO;
 	}
@@ -479,7 +479,7 @@ int set_fpxregs( struct task_struct *tsk
 static inline void copy_fpu_fsave( struct task_struct *tsk,
 				   struct user_i387_struct *fpu )
 {
-	memcpy( fpu, &tsk->thread.i387.fsave,
+	memcpy( fpu, &tsk->thread.i387->fsave,
 		sizeof(struct user_i387_struct) );
 }
 
@@ -490,10 +490,10 @@ static inline void copy_fpu_fxsave( stru
 	unsigned short *from;
 	int i;
 
-	memcpy( fpu, &tsk->thread.i387.fxsave, 7 * sizeof(long) );
+	memcpy( fpu, &tsk->thread.i387->fxsave, 7 * sizeof(long) );
 
 	to = (unsigned short *)&fpu->st_space[0];
-	from = (unsigned short *)&tsk->thread.i387.fxsave.st_space[0];
+	from = (unsigned short *)&tsk->thread.i387->fxsave.st_space[0];
 	for ( i = 0 ; i < 8 ; i++, to += 5, from += 8 ) {
 		memcpy( to, from, 5 * sizeof(unsigned short) );
 	}
@@ -540,7 +540,7 @@ int dump_task_extended_fpu(struct task_s
 	if (fpvalid) {
 		if (tsk == current)
 		       unlazy_fpu(tsk);
-		memcpy(fpu, &tsk->thread.i387.fxsave, sizeof(*fpu));
+		memcpy(fpu, &tsk->thread.i387->fxsave, sizeof(*fpu));
 	}
 	return fpvalid;
 }
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -645,7 +645,7 @@ struct task_struct fastcall * __switch_t
 
 	/* we're going to use this soon, after a few expensive things */
 	if (next_p->fpu_counter > 5)
-		prefetch(&next->i387.fxsave);
+		prefetch(&next->i387->fxsave);
 
 	/*
 	 * Reload esp0.
@@ -908,3 +908,57 @@ unsigned long arch_align_stack(unsigned 
 		sp -= get_random_int() % 8192;
 	return sp & ~0xf;
 }
+
+
+
+struct kmem_cache *task_struct_cachep;
+struct kmem_cache *task_i387_cachep;
+
+struct task_struct * alloc_task_struct(void)
+{
+	struct task_struct *tsk;
+	tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
+	if (!tsk)
+		return NULL;
+	tsk->thread.i387 = kmem_cache_alloc(task_i387_cachep, GFP_KERNEL);
+	if (!tsk->thread.i387)
+		goto error;
+	WARN_ON((unsigned long)tsk->thread.i387 & 15);
+	return tsk;
+
+error:
+	kfree(tsk);
+	return NULL;
+}
+
+void memcpy_task_struct(struct task_struct *dst, struct task_struct *src)
+{
+	union i387_union *ptr;
+	ptr = dst->thread.i387;
+	*dst = *src;
+	dst->thread.i387 = ptr;
+	memcpy(dst->thread.i387, src->thread.i387, sizeof(union i387_union));
+}
+
+void free_task_struct(struct task_struct *tsk)
+{
+	kmem_cache_free(task_i387_cachep, tsk->thread.i387);
+	tsk->thread.i387=NULL;
+	kmem_cache_free(task_struct_cachep, tsk);
+}
+
+
+void task_struct_slab_init(void)
+{
+ 	/* create a slab on which task_structs can be allocated */
+        task_struct_cachep =
+        	kmem_cache_create("task_struct", sizeof(struct task_struct),
+        		ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);
+        task_i387_cachep =
+        	kmem_cache_create("task_i387", sizeof(union i387_union), 32,
+        	    SLAB_PANIC | SLAB_MUST_HWCACHE_ALIGN, NULL, NULL);
+}
+
+
+/* the very init task needs a static allocated i387 area */
+union i387_union init_i387_context;
Index: linux/arch/i386/kernel/traps.c
===================================================================
--- linux.orig/arch/i386/kernel/traps.c
+++ linux/arch/i386/kernel/traps.c
@@ -1154,16 +1154,6 @@ void __init trap_init(void)
 	set_trap_gate(19,&simd_coprocessor_error);
 
 	if (cpu_has_fxsr) {
-		/*
-		 * Verify that the FXSAVE/FXRSTOR data will be 16-byte aligned.
-		 * Generates a compile-time "error: zero width for bit-field" if
-		 * the alignment is wrong.
-		 */
-		struct fxsrAlignAssert {
-			int _:!(offsetof(struct task_struct,
-					thread.i387.fxsave) & 15);
-		};
-
 		printk(KERN_INFO "Enabling fast FPU save and restore... ");
 		set_in_cr4(X86_CR4_OSFXSR);
 		printk("done.\n");
Index: linux/include/asm-i386/i387.h
===================================================================
--- linux.orig/include/asm-i386/i387.h
+++ linux/include/asm-i386/i387.h
@@ -34,7 +34,7 @@ extern void init_fpu(struct task_struct 
 		"nop ; frstor %1",		\
 		"fxrstor %1",			\
 		X86_FEATURE_FXSR,		\
-		"m" ((tsk)->thread.i387.fxsave))
+		"m" ((tsk)->thread.i387->fxsave))
 
 extern void kernel_fpu_begin(void);
 #define kernel_fpu_end() do { stts(); preempt_enable(); } while(0)
@@ -60,8 +60,8 @@ static inline void __save_init_fpu( stru
 		"fxsave %[fx]\n"
 		"bt $7,%[fsw] ; jnc 1f ; fnclex\n1:",
 		X86_FEATURE_FXSR,
-		[fx] "m" (tsk->thread.i387.fxsave),
-		[fsw] "m" (tsk->thread.i387.fxsave.swd) : "memory");
+		[fx] "m" (tsk->thread.i387->fxsave),
+		[fsw] "m" (tsk->thread.i387->fxsave.swd) : "memory");
 	/* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
 	   is pending.  Clear the x87 state here by setting it to fixed
    	   values. safe_address is a random variable that should be in L1 */
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -407,7 +407,7 @@ struct thread_struct {
 /* fault info */
 	unsigned long	cr2, trap_no, error_code;
 /* floating point info */
-	union i387_union	i387;
+	union i387_union	*i387;
 /* virtual 86 mode info */
 	struct vm86_struct __user * vm86_info;
 	unsigned long		screen_bitmap;
@@ -420,11 +420,15 @@ struct thread_struct {
 	unsigned long	io_bitmap_max;
 };
 
+
+extern union i387_union init_i387_context;
+
 #define INIT_THREAD  {							\
 	.vm86_info = NULL,						\
 	.sysenter_cs = __KERNEL_CS,					\
 	.io_bitmap_ptr = NULL,						\
 	.gs = __KERNEL_PDA,						\
+	.i387 = &init_i387_context,					\
 }
 
 /*
Index: linux/include/asm-i386/thread_info.h
===================================================================
--- linux.orig/include/asm-i386/thread_info.h
+++ linux/include/asm-i386/thread_info.h
@@ -102,6 +102,12 @@ static inline struct thread_info *curren
 
 #define free_thread_info(info)	kfree(info)
 
+#define __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
+extern struct task_struct * alloc_task_struct(void);
+extern void free_task_struct(struct task_struct *tsk);
+extern void memcpy_task_struct(struct task_struct *dst, struct task_struct *src);
+extern void task_struct_slab_init(void);
+
 #else /* !__ASSEMBLY__ */
 
 /* how to get the thread information struct from ASM */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -83,6 +83,8 @@ int nr_processes(void)
 #ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
 # define alloc_task_struct()	kmem_cache_alloc(task_struct_cachep, GFP_KERNEL)
 # define free_task_struct(tsk)	kmem_cache_free(task_struct_cachep, (tsk))
+# define memcpy_task_struct(dst, src) *dst = *src;
+
 static struct kmem_cache *task_struct_cachep;
 #endif
 
@@ -137,6 +139,8 @@ void __init fork_init(unsigned long memp
 	task_struct_cachep =
 		kmem_cache_create("task_struct", sizeof(struct task_struct),
 			ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);
+#else
+	task_struct_slab_init();
 #endif
 
 	/*
@@ -175,7 +179,8 @@ static struct task_struct *dup_task_stru
 		return NULL;
 	}
 
-	*tsk = *orig;
+	memcpy_task_struct(tsk, orig);
+
 	tsk->thread_info = ti;
 	setup_thread_stack(tsk, orig);
 

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 21:26       ` Linus Torvalds
  2007-02-14 21:35         ` Ingo Molnar
  2007-02-14 21:44         ` Ingo Molnar
@ 2007-02-14 21:56         ` Alan
  2007-02-14 22:32           ` Ingo Molnar
  2 siblings, 1 reply; 320+ messages in thread
From: Alan @ 2007-02-14 21:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

>  - it assumes we are going to make these complex state machines (which I 
>    don't believe for a second that a real program will do)

They've not had the chance before and there are certain chains of them
which make huge amounts of sense because you don't want to keep taking
completion hits. Not so much looping ones but stuff like

	cork write sendfile uncork close

are very natural sequences.

There seem to be a lot of typical sequences it doesn't represent however
(consider the trivial copy case where you use the result one syscall into
the next)

>  - it assumes that we're going to make many async system calls that go 
>    together (which breaks the whole notion of having different libraries 
>    using this for their own internal reasons - they may not even *know* 
>    about other libraries that _also_ do async IO for *their* reasons)

They can each register their own async objects. They need to do this
anyway so that the libraries can use asynchronous I/O and hide it from
applications.

>    this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT, since 
>    current users use "aio_read()" that simply doesn't have that and 
>    doesn't build up any such data structures.

Do current users do this because that is all they have, because it is
hard, or because the current option is all that makes sense ?

The ability to avoid asynchronous completion waits and
complete/wake/despatch cycles is a good thing of itself. I don't know if
it justifies the rest but it has potential for excellent performance.

Alan

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch] x86: split FPU state from task state
  2007-02-14 21:49                     ` [patch] x86: split FPU state from task state Ingo Molnar
@ 2007-02-14 22:04                       ` Benjamin LaHaise
  2007-02-14 22:10                         ` Arjan van de Ven
  0 siblings, 1 reply; 320+ messages in thread
From: Benjamin LaHaise @ 2007-02-14 22:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Russell King, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, Feb 14, 2007 at 10:49:44PM +0100, Ingo Molnar wrote:
> this can very much be done, with a straightforward extension of how we 
> handle FPU state. That makes sense independently of syslets/async as 
> well, so find below the standalone patch from Arjan. It's in my current 
> syslet queue and works great.

That patch adds a bunch of memory dereferences and another allocation 
to the thread creation code path -- a tax that all users must pay.  Granted, 
it's small, but at the very least should be configurable out for the 99.9% 
of users that will never use this functionality.

I'm willing to be convinced, it's just that I would like to see some 
numbers that scream out that this is a good thing.

		-ben

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 20:38   ` Linus Torvalds
  2007-02-14 21:02     ` Ingo Molnar
  2007-02-14 21:09     ` Davide Libenzi
@ 2007-02-14 22:09     ` Ingo Molnar
  2007-02-14 23:13       ` Linus Torvalds
  2007-02-15 13:35     ` Evgeniy Polyakov
  3 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Or how would you do the trivial example loop that I explained was a 
> good idea:
> 
>         struct one_entry *prev = NULL;
>         struct dirent *de;
> 
>         while ((de = readdir(dir)) != NULL) {
>                 struct one_entry *entry = malloc(..);
> 
>                 /* Add it to the list, fill in the name */
>                 entry->next = prev;
>                 prev = entry;
>                 strcpy(entry->name, de->d_name);
> 
>                 /* Do the stat lookup async */
>                 async_stat(de->d_name, &entry->stat_buf);
>         }
>         wait_for_async();
>         .. Ta-daa! All done ..

i think you are banging on open doors. That async_stat() call is very 
much what i'd like to see glibc to provide, not really the raw syslet 
interface. Nor do i want to see raw syscalls exposed to applications. 
Plus the single-atom thing is what i think will be used mostly 
initially, so all my optimizations went into that case.

while i agree with you that state machines are hard, it's all a function 
of where the concentration of processing is. If most of the application 
complexity happens in user-space, then the logic should live there. But 
for infrastructure things (like the async_stat() calls, or aio_read(), 
or other, future interfaces) i wouldnt mind at all if they were 
implemented using syslets. Likewise, if someone wants to implement the 
hottest accept loop in Apache or Samba via syslets, keeping them from 
wasting time on writing in-kernel webservers (oops, did i really say 
that?), it can be done. If a JVM wants to use syslets, sure - it's an 
abstraction machine anyway so application programmers are not exposed to 
it.

syslets are just a realization that /if/ the thing we want to do is 
mostly on the kernel side, then we might as well put the logic to the 
kernel side. It's more of a 'compound interface builder' than the place 
for real program logic. It makes our interfaces usable more flexibly, 
and it allows the kernel to provide 'atomic' APIs, instead of having to 
provide the most common compounded uses as well.

and note that if you actually try to do an async_stat() sanely, you do 
get quite close to the point of having syslets. You get basically up to 
a one-shot atom concept and 90% of what i have in kernel/async.c. The 
remaining 10% of further execution control is easy and still it opens up 
these new things that were not possible before: compounding, vectoring, 
simple program logic, etc.

The 'cost' of syslets is mostly the atom->next pointer in essence. The 
whole async infrastructure only takes up 20 nsecs more in the cached 
case. (but with some crazier hacks i got the one-shot atom overhead 
[compared to a simple synchronous null syscall] to below 10 nsecs, so 
there's room in there for further optimizations. Our current null 
syscall latency is around ~150 nsecs.)

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch] x86: split FPU state from task state
  2007-02-14 22:04                       ` Benjamin LaHaise
@ 2007-02-14 22:10                         ` Arjan van de Ven
  0 siblings, 0 replies; 320+ messages in thread
From: Arjan van de Ven @ 2007-02-14 22:10 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ingo Molnar, Davide Libenzi, Russell King,
	Linux Kernel Mailing List, Linus Torvalds, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, 2007-02-14 at 17:04 -0500, Benjamin LaHaise wrote:
> On Wed, Feb 14, 2007 at 10:49:44PM +0100, Ingo Molnar wrote:
> > this can very much be done, with a straightforward extension of how we 
> > handle FPU state. That makes sense independently of syslets/async as 
> > well, so find below the standalone patch from Arjan. It's in my current 
> > syslet queue and works great.
> 
> That patch adds a bunch of memory dereferences 

not really; you missed that most of the ->'s are actually just going to
members of the union and aren't actually extra dereference.

> and another allocation 
> to the thread creation code path -- a tax that all users must pay. 

so the next step, as mentioned in the changelog, to allocate only on the
first FPU fault, so that it becomes a GAIN for everyone, since only
threads that use FPU will use the memory.

The second gain (although only on old cpus) is that you only need to
allocate enough memory for your cpu, rather than 512 always.



-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 21:56         ` Alan
@ 2007-02-14 22:32           ` Ingo Molnar
  2007-02-15  1:01             ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 22:32 UTC (permalink / raw)
  To: Alan
  Cc: Linus Torvalds, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Alan <alan@lxorguk.ukuu.org.uk> wrote:

> >    this "AIO atom" in the first place, WHICH WE KNOW IS INCORRECT, 
> >    since current users use "aio_read()" that simply doesn't have 
> >    that and doesn't build up any such data structures.
> 
> Do current users do this because that is all they have, because it is 
> hard, or because the current option is all that makes sense ?
> 
> The ability to avoid asynchronous completion waits and 
> complete/wake/despatch cycles is a good thing of itself. [...]

yeah, that's another key thing. I do plan to provide a sys_upcall() 
syscall as well which calls a 5-parameter user-space function with a 
special stack. (it's like a lightweight signal/event handler, without 
any of the signal handler legacies and overhead - it's like a reverse 
system call - a "user call". Obviously pure userspace would never use 
sys_upcall(), unless as an act of sheer masochism.)

[ that way say a full HTTP request could be done by an asynchronous
  context, because the HTTP parser could be done as a sys_upcall(). ]

so if it's simpler/easier for a syslet to do a step in user-space - as 
long as it's an 'atom' of processing - it can be done.

or if processing is so heavily in user-space that most of the logic 
lives there then just use plain pthreads. There's just no point in 
moving complex user-space code to the syslet side if it's easier/faster 
to do it in user-space. Syslets are there for asynchronous /kernel/ 
execution, and is centered around how the kernel does stuff: system 
calls.

besides sys_upcall() i also plan two other extensions:

 - a CLONE_ASYNC_WORKER for user-space to be able use its pthread as an
   optional worker thread in the async engine. A thread executing
   user-space code qualifies as a 'busy' thread - it has to call into
   sys_async_cachemiss_thread() to 'offer' itself as a ready thread that
   the 'head' could switch to anytime.

 - support for multiple heads sharing the async context pool. All the
   locking logic is there already (because cachemiss threads can already
   access the queue), it only needs a refcount in struct async_head
   (only accessed during fork/exit), and an update to the teardown logic
   (that too is a slowpath).

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 22:09     ` Ingo Molnar
@ 2007-02-14 23:13       ` Linus Torvalds
  2007-02-14 23:44         ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Linus Torvalds @ 2007-02-14 23:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Wed, 14 Feb 2007, Ingo Molnar wrote:
> 
> i think you are banging on open doors. That async_stat() call is very 
> much what i'd like to see glibc to provide, not really the raw syslet 
> interface.

Right. Which is why I wrote (and you removed) the rest of my email.

If the "raw" interfaces aren't actually what you use, and you just expect 
glibc to translate things into them, WHY DO WE HAVE THEM AT ALL?

> The 'cost' of syslets is mostly the atom->next pointer in essence.

No. The cost is:

 - indirect interfaces are harder to follow and debug. It's a LOT easier 
   to debug things that go wrong when it just does what you ask it for, 
   instead of writing to memory and doing something totally obscure.

   I don't know about you, but I use "strace" a lot. That's the kind of 
   cost we have.

 - the cost is the extra and totally unnecessary setup for the 
   indirection, that nobody reallyis likely to use.

> The whole async infrastructure only takes up 20 nsecs more in the cached 
> case. (but with some crazier hacks i got the one-shot atom overhead 
> [compared to a simple synchronous null syscall] to below 10 nsecs, so 
> there's room in there for further optimizations. Our current null 
> syscall latency is around ~150 nsecs.)

You are not counting the whole setup cost there, then, because your setup 
cost is going to be at a minimum more expensive than the null system call.

And yes, for benchmarks, it's going to be done just once, and then the 
benchmark will loop a million times. But for other things like libraries, 
that don't know whether they get called once, or a million times, this is 
a big deal.

This is why I'd like a "async_stat()" to basically be the *same* cost as a 
"stat()". To within nanoseconds. WITH ALL THE SETUP! Because otherwise, a 
library may not be able to use it without thinking about it a lot, because 
it simply doesn't know whether the caller is going to call it once or many 
times.

THIS was why I wanted the "synchronous mode". Exactly because it removes 
all the questions about "is it worth it". If the cost overhead is 
basically zero, you know it's always worth it.

Now, if you make the "async_submit()" _incldue_ the setup itself (as you 
alluded to in one of your emails), and the cost of that is basically 
negligible, and it still allows people to do things simply and just submit 
a single system call without any real overhead, then hey, it's may be a 
complex interface, but at least you can _use_ it as a simple one.

At that point most of my arguments against it go away. It might still be 
over-engineered, but if the costs aren't visible, and it's obvious enough 
that the over-engineering doesn't result in subtle bugs, THEN (and only
then) is a more complex and generic interface worth it even if nobody 
actually ends up using it.

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 21:44                       ` Benjamin LaHaise
@ 2007-02-14 23:17                         ` Davide Libenzi
  2007-02-14 23:40                           ` Benjamin LaHaise
  2007-02-15  1:32                         ` Michael K. Edwards
  1 sibling, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-14 23:17 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 01:06:59PM -0800, Davide Libenzi wrote:
> > Bear with me Ben, and let's follow this up :) If you are in the middle of 
> > an MMX copy operation, inside the syscall, you are:
> > 
> > - Userspace, on task A, calls sys_async_exec
> > 
> > - Userspace in _not_ doing any MMX stuff before the call
> 
> That's an incorrect assumption.  Every task/thread in the system has FPU 
> state associated with it, in part due to the fact that glibc has to change 
> some of the rounding mode bits, making them different than the default from 
> a freshly initialized state.

IMO I still belive this is not a huge problem. FPU state propagation/copy 
can be done in a clever way, once we detect the in-async condition.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 23:17                         ` Davide Libenzi
@ 2007-02-14 23:40                           ` Benjamin LaHaise
  2007-02-15  0:35                             ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Benjamin LaHaise @ 2007-02-14 23:40 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, Feb 14, 2007 at 03:17:59PM -0800, Davide Libenzi wrote:
> > That's an incorrect assumption.  Every task/thread in the system has FPU 
> > state associated with it, in part due to the fact that glibc has to change 
> > some of the rounding mode bits, making them different than the default from 
> > a freshly initialized state.
> 
> IMO I still belive this is not a huge problem. FPU state propagation/copy 
> can be done in a clever way, once we detect the in-async condition.

Show me.  clts() and stts() are expensive hardware operations which there 
is no means of avoiding as control register writes impact the CPU in a not 
trivial manner.  I've spent far too much time staring at profiles of what 
goes on in the context switch code in the process of looking for optimizations 
on this very issue to be ignored on this point.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 23:13       ` Linus Torvalds
@ 2007-02-14 23:44         ` Ingo Molnar
  2007-02-15  0:04           ` Ingo Molnar
  0 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-14 23:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > case. (but with some crazier hacks i got the one-shot atom overhead 
> > [compared to a simple synchronous null syscall] to below 10 nsecs, 
> > so there's room in there for further optimizations. Our current null 
> > syscall latency is around ~150 nsecs.)
> 
> You are not counting the whole setup cost there, then, because your 
> setup cost is going to be at a minimum more expensive than the null 
> system call.

hm, this one-time cost was never on my radar. [ It's really dwarved by 
other startup costs (a single fork() takes 100 usecs, an exec() takes 
800 usecs.) ]

In any case, we can delay this cost into the first cachemiss, or can 
eliminate it by making it a globally pooled thing. It does not seem like 
a big issue.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 23:44         ` Ingo Molnar
@ 2007-02-15  0:04           ` Ingo Molnar
  0 siblings, 0 replies; 320+ messages in thread
From: Ingo Molnar @ 2007-02-15  0:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> > You are not counting the whole setup cost there, then, because your 
> > setup cost is going to be at a minimum more expensive than the null 
> > system call.
> 
> hm, this one-time cost was never on my radar. [ It's really dwarved by 
> other startup costs (a single fork() takes 100 usecs, an exec() takes 
> 800 usecs.) ]

i really count this into the category of 'application startup', and thus 
it's is another type of 'cachemiss': the cost of having to bootstrap a 
new context. (Even though obviously we want this to go as fast as 
possible too.) Library startups, linking (even with prelink), etc., is 
quite expensive already - goes into the tens of milliseconds.

or if it's a new thread startup - where this cost would indeed be 
visible, if the thread exits straight after being startup up, and where 
this thread would want to do a single AIO, then shareable async heads 
(see my mail to Alan) ought to solve this. (But short-lifetime threads 
are not really a good idea in themselves.)

but the moment it's some fork()ed context, or even an exec()ed context, 
this cost is very small in comparisno. And who in their right mind 
starts up a whole new process just to do a single IO and then exit 
without doing any other processing? (so that the async setup cost would 
show up)

but, short-lived contexts, where this cost would be visible, are 
generally a really bad idea.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-14 21:36     ` Davide Libenzi
@ 2007-02-15  0:08       ` Jeremy Fitzhardinge
  2007-02-15  2:07         ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Jeremy Fitzhardinge @ 2007-02-15  0:08 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

Davide Libenzi wrote:
>> Would this work?
>>     
>
> Hopefully the API will simplify enough so that emulation will becomes 
> easier.
>   

The big question in my mind is how all this stuff interacts with
signals.  Can a blocked syscall atom be interrupted by a signal?  If so,
what thread does it get delivered to?  How does sigprocmask affect this
(is it atomic with respect to blocking atoms)?

>> Also, an unrelated question: is there enough control structure in place
>> to allow multiple syslet streams to synchronize with each other with
>> futexes?
>>     
>
> I think the whole point of an async execution of a syscall or a syslet, is 
> that the syscall/syslet itself includes a non interlocked operations with 
> other syscalls/syslets. So that the main scheduler thread can run in a 
> lockless/singletask fashion. There are no technical obstacles that 
> prevents you to do it, bu if you start adding locks (and hence having 
> long-living syslet-threads) at that point you'll end up with a fully 
> multithreaded solution.
>   

I was thinking you'd use the futexes more like barriers than locks. 
That way you could have several streams going asynchronously, but use
futexes to gang them together at appropriate times in the stream.  A
handwavy example would be to have separate async streams for audio and
video, but use futexes to stop them from drifting too far from each other.

    J

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 23:40                           ` Benjamin LaHaise
@ 2007-02-15  0:35                             ` Davide Libenzi
  0 siblings, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-15  0:35 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Russell King, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Thomas Gleixner

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 03:17:59PM -0800, Davide Libenzi wrote:
> > > That's an incorrect assumption.  Every task/thread in the system has FPU 
> > > state associated with it, in part due to the fact that glibc has to change 
> > > some of the rounding mode bits, making them different than the default from 
> > > a freshly initialized state.
> > 
> > IMO I still belive this is not a huge problem. FPU state propagation/copy 
> > can be done in a clever way, once we detect the in-async condition.
> 
> Show me.  clts() and stts() are expensive hardware operations which there 
> is no means of avoiding as control register writes impact the CPU in a not 
> trivial manner.  I've spent far too much time staring at profiles of what 
> goes on in the context switch code in the process of looking for optimizations 
> on this very issue to be ignored on this point.

The trivial case is the cachehit case. Everything flows like usual since 
we don't swap threads.
If we're going to sleep, __async_schedule has to save/copy (depending if 
TS_USEDFPU is set) the current FPU state to the newly selected service 
thread (return-to-userspace thread).
When a fault eventually happen in the new userspace thread, context is 
restored.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 22:32           ` Ingo Molnar
@ 2007-02-15  1:01             ` Davide Libenzi
  2007-02-15  1:28               ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-15  1:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan, Linus Torvalds, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

On Wed, 14 Feb 2007, Ingo Molnar wrote:

> yeah, that's another key thing. I do plan to provide a sys_upcall() 
> syscall as well which calls a 5-parameter user-space function with a 
> special stack. (it's like a lightweight signal/event handler, without 
> any of the signal handler legacies and overhead - it's like a reverse 
> system call - a "user call". Obviously pure userspace would never use 
> sys_upcall(), unless as an act of sheer masochism.)

That is exactly what I described as clets. Instead of having complex jump 
and condition interpreters on the kernel (on top of new syscalls to 
modify/increment userspace variables), you just code it in C and you pass 
the clet pointer to the kernel.
The upcall will setup a frame, execute the clet (where jump/conditions and 
userspace variable changes happen in machine code - gcc is pretty good in 
taking care of that for us) on its return, come back through a 
sys_async_return, and go back to userspace.




- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15  1:01             ` Davide Libenzi
@ 2007-02-15  1:28               ` Davide Libenzi
  2007-02-18 20:01                 ` Pavel Machek
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-15  1:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan, Linus Torvalds, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

On Wed, 14 Feb 2007, Davide Libenzi wrote:

> On Wed, 14 Feb 2007, Ingo Molnar wrote:
> 
> > yeah, that's another key thing. I do plan to provide a sys_upcall() 
> > syscall as well which calls a 5-parameter user-space function with a 
> > special stack. (it's like a lightweight signal/event handler, without 
> > any of the signal handler legacies and overhead - it's like a reverse 
> > system call - a "user call". Obviously pure userspace would never use 
> > sys_upcall(), unless as an act of sheer masochism.)
> 
> That is exactly what I described as clets. Instead of having complex jump 
> and condition interpreters on the kernel (on top of new syscalls to 
> modify/increment userspace variables), you just code it in C and you pass 
> the clet pointer to the kernel.
> The upcall will setup a frame, execute the clet (where jump/conditions and 
> userspace variable changes happen in machine code - gcc is pretty good in 
> taking care of that for us) on its return, come back through a 
> sys_async_return, and go back to userspace.

So, for example, this is the setup code for the current API (and that's a 
really simple one - immagine going wacko with loops and userspace varaible 
changes):


static struct req *alloc_req(void)
{
        /*
         * Constants can be picked up by syslets via static variables:
         */
        static long O_RDONLY_var = O_RDONLY;
        static long FILE_BUF_SIZE_var = FILE_BUF_SIZE;
                
        struct req *req;
                 
        if (freelist) {
                req = freelist;
                freelist = freelist->next_free;
                req->next_free = NULL;
                return req;
        }
                        
        req = calloc(1, sizeof(struct req));
 
        /*
         * This is the first atom in the syslet, it opens the file:
         *
         *  req->fd = open(req->filename, O_RDONLY);
         *
         * It is linked to the next read() atom.
         */
        req->filename_p = req->filename;
        init_atom(req, &req->open_file, __NR_sys_open,
                  &req->filename_p, &O_RDONLY_var, NULL, NULL, NULL, NULL,
                  &req->fd, SYSLET_STOP_ON_NEGATIVE, &req->read_file);
        
        /*
         * This second read() atom is linked back to itself, it skips to
         * the next one on stop:
         */
        req->file_buf_ptr = req->file_buf;
        init_atom(req, &req->read_file, __NR_sys_read,
                  &req->fd, &req->file_buf_ptr, &FILE_BUF_SIZE_var,
                  NULL, NULL, NULL, NULL,
                  SYSLET_STOP_ON_NON_POSITIVE | SYSLET_SKIP_TO_NEXT_ON_STOP,
                  &req->read_file);
                
        /*
         * This close() atom has NULL as next, this finishes the syslet:
         */
        init_atom(req, &req->close_file, __NR_sys_close,
                  &req->fd, NULL, NULL, NULL, NULL, NULL, NULL, 0, NULL);
                
        return req;
}


Here's how your clet would look like:

static long main_sync_loop(ctx *c)
{
        int fd;
        char file_buf[FILE_BUF_SIZE+1];
        
        if ((fd = open(c->filename, O_RDONLY)) == -1)
                return -1;
        while (read(fd, file_buf, FILE_BUF_SIZE) > 0)
                ;
        close(fd);
        return 0;
}


Kinda easier to code isn't it? And the cost of the upcall to schedule the 
clet is widely amortized by the multple syscalls you're going to do inside 
your clet.




- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 06/11] syslets: core, documentation
  2007-02-14 21:44                       ` Benjamin LaHaise
  2007-02-14 23:17                         ` Davide Libenzi
@ 2007-02-15  1:32                         ` Michael K. Edwards
  1 sibling, 0 replies; 320+ messages in thread
From: Michael K. Edwards @ 2007-02-15  1:32 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Davide Libenzi, Russell King, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Thomas Gleixner

On 2/14/07, Benjamin LaHaise <bcrl@kvack.org> wrote:
> My opinion of this whole thread is that it implies that our thread creation
> and/or context switch is too slow.  If that is the case, improve those
> elements first.  At least some of those optimizations have to be done in
> hardware on x86, while on other platforms are probably unnecessary.

Not necessarily too slow, but too opaque in terms of system-wide
impact and global flow control.  Here are the four practical use cases
that I have seen come up in this discussion:

1) Databases that want to parallelize I/O storms, with an emphasis on
getting results that are already cache-hot immediately (not least so
they don't get evicted by other I/O results); there is also room to
push some of the I/O clustering and sequencing logic down into the
kernel.

2) Static-content-intensive network servers, with an emphasis on
servicing those requests that can be serviced promptly (to avoid a
ballooning connection backlog) and avoiding duplication of I/O effort
when many clients suddenly want the same cold content; the real win
may be in "smart prefetch" initiated from outside the network server
proper.

3) Network information gathering GUIs, which want to harvest as much
information as possible for immediate display and then switch to an
event-based delivery mechanism for tardy responses; these need
throttling of concurrent requests (ideally, in-kernel traffic shaping
by request group and destination class) and efficient cancellation of
no-longer-interesting requests.

4) Document search facilities, which need all of the above (big
surprise there) as well as a rich diagnostic toolset, including a
practical snooping and profiling facility to guide tuning for
application responsiveness.

Even if threads were so cheap that you could just fire off one per I/O
request, they're a poor solution to the host of flow control issues
raised in these use cases.  A sequential thread of execution per I/O
request may be the friendliest mental model for the individual delayed
I/Os, but the global traffic shaping and scheduling is a data
structure problem.

The right question to be asking is, what are the operations that need
to be supported on the system-wide pool of pending AIOs, and on what
data structure can they be implemented efficiently?  For instance, can
we provide an RCU priority queue implementation (perhaps based on
splay trees) so that userspace can scan a coherent read-only snapshot
of the structure and select candidates for cancellation, etc., without
interfering with kernel completions?  Or is it more important to have
a three-sided query operation (characteristic of priority search
trees), or perhaps a lower amortized cost bulk delete?

Once you've thought through the data structure manipulation, you'll
know what AIO submission / cancellation / reprioritization interfaces
are practical.  Then you can work on a programming model for
application-level "I/O completions" that is library-friendly and
allows a "fast path" optimization for the fraction of requests that
can be served synchronously.  Then and only then does it make sense to
code-bum the asynchronous path.  Not that it isn't interesting to
think in advance about what stack space completions will run in and
which bits of the task struct needn't be in a coherent condition; but
that's probably not going to guide you to the design that meets the
application needs.

I know I'm teaching my grandmother to suck eggs here.  But there are
application programmers high up the code stack whose code makes
implicit use of asynchronous I/O continuations.  In addition to the
GUI example I blathered about a few days ago, I have in mind Narrative
Javascript's "blocking operator" and Twisted Python's Deferred.  Those
folks would be well served by an async I/O interface to the kernel
which mates well to their language's closure/continuation facilities.
If it's usable from C, that's nice too.  :-)

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-15  0:08       ` Jeremy Fitzhardinge
@ 2007-02-15  2:07         ` Davide Libenzi
  0 siblings, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-15  2:07 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

On Wed, 14 Feb 2007, Jeremy Fitzhardinge wrote:

> Davide Libenzi wrote:
> >> Would this work?
> >>     
> >
> > Hopefully the API will simplify enough so that emulation will becomes 
> > easier.
> >   
> 
> The big question in my mind is how all this stuff interacts with
> signals.  Can a blocked syscall atom be interrupted by a signal?  If so,
> what thread does it get delivered to?  How does sigprocmask affect this
> (is it atomic with respect to blocking atoms)?

Signal context is another thing that we need to transfer to the 
return-to-userspace task, in case we switch. Async threads inherit that 
from the "main" task once they're created, but from there to the 
sys_async_exec syscall, userspace might have changed the signal context, 
and re-emerging with a different one is not an option ;)
We should setup service-threds signal context, so that we can cancel them, 
but the implementation should be hidden to userspace (that will use 
sys_async_cancel - or whatever name -  to do that).



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
  2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
                     ` (5 preceding siblings ...)
  2007-02-14 20:52   ` Jeremy Fitzhardinge
@ 2007-02-15  2:44   ` Zach Brown
  6 siblings, 0 replies; 320+ messages in thread
From: Zach Brown @ 2007-02-15  2:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


I'm finally back from my travel and conference hiatus.. you guys have  
been busy! :)

On Feb 13, 2007, at 6:20 AM, Ingo Molnar wrote:

> I'm pleased to announce the first release of the "Syslet" kernel  
> feature
> and kernel subsystem, which provides generic asynchrous system call
> support:
>
>    http://redhat.com/~mingo/syslet-patches/

In general, I really like the look of this.

I think I'm convinced that your strong preference to do this with  
full kernel threads (1:1 task_struct -> thread_info/stack  
relationship) is the right thing to do.  The fibrils fell on the side  
of risking bugs by sharing task_structs amongst stacks executing  
kernel paths.  This, correct me if I'm wrong, falls on the side of  
risking behavioural quirks stemming from task_struct references that  
we happen to have not enabled sharing of yet.

I have strong hopes that we won't actually *care* about the  
behavioural differences we get from having individual task structs  
(which share the important things!) between syscall handlers.  The  
*only* seemingly significant case I've managed to find, the IO  
scheduler priority and context fields, is easy enough to fix up.   
Jens and I have been talking about that.  It's been bugging him for  
other reasons.

So, thanks, nice work.  I'm going to focus on finding out if its  
feasible for The Database to use this instead of the current iocb  
mechanics.  I'm optimistic.

> Syslets are small, simple, lightweight programs (consisting of
> system-calls, 'atoms')

I will admit, though, that I'm not at all convinced that we need  
this.  Adding a system call for addition (just addition?  how far do  
we go?!) sure feels like a warning sign to me that we're heading down  
a slippery slope.  I would rather we started with an obviously  
minimal syscall which just takes an array of calls and args and  
executes them unconditionally.

But its existance doesn't stop the use case I care about.  So it's  
hard to get *too* worked up about it.

> Comments, suggestions, reports are welcome!

For what it's worth, it looks like 'x86-optimized-copy_uatom.patch'  
got some hunks that should have been in 'x86-optimized- 
sys_umem_add.patch'.

- z

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 21:35         ` Ingo Molnar
@ 2007-02-15  2:52           ` Zach Brown
  0 siblings, 0 replies; 320+ messages in thread
From: Zach Brown @ 2007-02-15  2:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

>> But the whole point is that the notion of a "register" is wrong in  
>> the
>> first place. [...]
>
> forget about it then. The thing we "register" is dead-simple:
>
>  struct async_head_user {
>          struct syslet_uatom __user              **completion_ring;
>          unsigned long                           ring_size_bytes;
>          unsigned long                           max_nr_threads;
>  };
>
> this can be passed in to sys_async_exec() as a second pointer, and the
> kernel can put the expected-completion pointer (and the user ring idx
> pointer) into its struct atom. It's just a few instructions, and  
> only in
> the cachemiss case.
>
> that would make completions arbitrarily split-up-able. No registration
> whatsoever. A waiter could specify which ring's events it is  
> interested
> in. A 'ring' could be a single-entry thing as well, for a single
> instance of pending IO.

I like this, too.  (Not surprisingly, having outlined something like  
it in a mail in one of the previous threads :)).

I'll bring up the POSIX AIO "list" IO case.  It wants to issue a  
group of IOs and sleep until they all return.  Being able to cheaply  
instantiate a ring implicitly with the submission of the IO calls in  
the list will make implementing this almost too easy.  It'd obviously  
just wait for that list's ring to drain.

I hope.  There might be complications around the edges (waiting for  
multiple list IOs to drain?), but it seems like this would be on the  
right track.

I might be alone in caring about having a less ridiculous POSIX AIO  
interface in glibc, though, I'll admit.  It seems like it'd be a  
pretty sad missed opportunity if we rolled a fantastic general AIO  
interface and left glibc to still screw around with it's own manual  
threading :/.

- z

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 19/31] clockevents: i386 drivers
       [not found] ` <20061213130211.GT21847@elte.hu>
@ 2007-02-15 10:13   ` Andrew Morton
  0 siblings, 0 replies; 320+ messages in thread
From: Andrew Morton @ 2007-02-15 10:13 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Thomas Gleixner, linux-kernel

On Wed, 13 Dec 2006 14:02:11 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> Add clockevent drivers for i386: lapic (local) and PIT (global).  Update 
> the timer IRQ to call into the PIT driver's event handler and the 
> lapic-timer IRQ to call into the lapic clockevent driver.  The 
> assignement of timer functionality is delegated to the core framework 
> code and replaces the compile and runtime evalution in 
> do_timer_interrupt_hook()
> 
> Use the clockevents broadcast support and implement the lapic_broadcast
> function for ACPI.
> 
> No changes to existing functionality.

This patch breaks the NMI on my crufty old dual-PIII supermicro p6dbe
machine:


Testing NMI watchdog ... CPU#0: NMI appears to be stuck (26->26)!
CPU#1: NMI appears to be stuck (0->0)!


vmm:/home/akpm> cat /proc/interrupts 
           CPU0       CPU1       
  0:         59          0   IO-APIC-edge      timer
  1:          2          0   IO-APIC-edge      i8042
  2:          0          0    XT-PIC-XT        cascade
  6:          3          0   IO-APIC-edge      floppy
  8:          1          0   IO-APIC-edge      rtc
 10:        192         61   IO-APIC-fasteoi   aic7xxx
 11:       1339         31   IO-APIC-fasteoi   eth0
 12:          3          1   IO-APIC-edge      i8042
 15:       3067          7   IO-APIC-edge      ide1
NMI:         26          0 
LOC:      58665      58663 
ERR:          0
MIS:          0

and it isn't changing.

See http://userweb.kernel.org/~akpm/nmi-prob/

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-14 20:38   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2007-02-14 22:09     ` Ingo Molnar
@ 2007-02-15 13:35     ` Evgeniy Polyakov
  2007-02-15 16:09       ` Linus Torvalds
  3 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-15 13:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Wed, Feb 14, 2007 at 12:38:16PM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> Or how would you do the trivial example loop that I explained was a good 
> idea:
> 
>         struct one_entry *prev = NULL;
>         struct dirent *de;
> 
>         while ((de = readdir(dir)) != NULL) {
>                 struct one_entry *entry = malloc(..);
> 
>                 /* Add it to the list, fill in the name */
>                 entry->next = prev;
>                 prev = entry;
>                 strcpy(entry->name, de->d_name);
> 
>                 /* Do the stat lookup async */
>                 async_stat(de->d_name, &entry->stat_buf);
>         }
>         wait_for_async();
>         .. Ta-daa! All done ..
> 
> 
> Notice? This also "chains system calls together", but it does it using a 
> *much* more powerful entity called "user space". That's what user space 
> is. And yeah, it's a pretty complex sequencer, but happily we have 
> hardware support for accelerating it to the point that the kernel never 
> even needs to care.
> 
> The above is a *realistic* schenario, where you actually have things like 
> memory allocation etc going on. In contrast, just chaining system calls 
> together isn't a realistic schenario at all.

One can still perfectly fine and easily use sys_async_exec(...stat()...)
in above scenario. Although I do think that having a web server in
kernel is overkill, having a proper state machine for good async
processing is a must.
Not that I agree, that it should be done on top of syscalls as basic
elements, but it is an initial state.

> So I think we have one _known_ usage schenario:
> 
>  - replacing the _existing_ aio_read() etc system calls (with not just 
>    existing semantics, but actually binary-compatible)
> 
>  - simple code use where people are willing to perhaps do something 
>    Linux-specific, but because it's so _simple_, they'll do it.
> 
> In neither case does the "chaining atoms together" seem to really solve 
> the problem. It's clever, but it's not what people would actually do.

It is an example of what can be done. If one do not like it - do not use
it. State machine is implemented in sendfile() syscall - and although it
is not a good idea to have async sendfile as is in micro-thread design
(due to network blocking and small per-page reading), it is still a state 
machine, which can be used with syslet state machine (if it could be 
extended).

> And yes, you can hide things like that behind an abstraction library, but 
> once you start doing that, I've got three questions for you:
> 
>  - what's the point?
>  - we're adding overhead, so how are we getting it back
>  - how do we handle independent libraries each doing their own thing and 
>    version skew between them?
> 
> In other words, the "let user space sort out the complexity" is not a good 
> answer. It just means that the interface is badly designed.

Well, if we can setup iocb structure, why we can not setup syslet one?

Yes, with syscalls as a state machine elements 99% of users will not use
it (I can only think about proper fadvice()+read()/sendfile() states),
but there is no problem to setup a structure in userspace at all. And if
there is possibility to use it for other things, it is definitely a win.

Actually complex structure setup argument is stupid - everyone forces to 
have timeval structure instead of number of microseconds.

So there is no point in 'complex setup and overhead', but there is
a. limit of the AIO (although my point is not to have huge amount of
	working threads - they were created by people who can not
	program state machines (c) Alan Cox)
b. possibility to implement a state machine (in current form likely will
	not be used except maybe some optional hints for IO tasks like
	fadvice)
c. in all other ways it has all pros and cons of micro-thread design (it
	looks neat and simple, although is utterly broken in some usage
	cases).

> 			Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 13:35     ` Evgeniy Polyakov
@ 2007-02-15 16:09       ` Linus Torvalds
  2007-02-15 16:37         ` Evgeniy Polyakov
                           ` (2 more replies)
  0 siblings, 3 replies; 320+ messages in thread
From: Linus Torvalds @ 2007-02-15 16:09 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> > 
> > In other words, the "let user space sort out the complexity" is not a good 
> > answer. It just means that the interface is badly designed.
> 
> Well, if we can setup iocb structure, why we can not setup syslet one?

(I'm cutting wildly, to try to get to the part I wanted to answer)

I actually think aio_read/write() and friends are *horrible* interfaces.

Here's a quick question: how many people have actually ever seen them used 
in "normal code"? 

Yeah. Nobody uses them. They're not all that portable (even within unixes 
they aren't always there, much less in other places), they are fairly 
obscure, and they are just not really easy to use.

Guess what? The same is going to be true *in*spades* for any Linux- 
specific async system call thing.

This is why I think simplicity of use along with transparency, is so 
important. I think "aio_read()" is already a nasty enough interface, and 
it's sure more portable than any Linux-specific extension will be (only 
until the revolution comes, of course - at that point, everybody who 
doesn't use Linux will be up against the wall, so we can solve the problem 
that way).

So a Linux-specific extension needs to be *easier* to use or at least 
understand, and have *more* obvious upsides than "aio_read()" has. 
Otherwise, it's pointless - nobody is really going to use it.

This is why I think the main goals should be:

 - the *internal* kernel goal of trying to replace some very specific 
   aio_read() etc code with somethign more maintainable.

   This is really a maintainability argument, nothing more. Even if we 
   never do anything *but* aio_read() and friends, if we can avoid having 
   the VFS code have multiple code-paths and explicit checks for AIO, and 
   instead handle it more "automatically", I think it is already worth it.

 - add extensions that people *actually*can*use* in practice.

   And this is where "simple interfaces" comes in.

> So there is no point in 'complex setup and overhead', but there is
> a. limit of the AIO (although my point is not to have huge amount of
> 	working threads - they were created by people who can not
> 	program state machines (c) Alan Cox)
> b. possibility to implement a state machine (in current form likely will
> 	not be used except maybe some optional hints for IO tasks like
> 	fadvice)
> c. in all other ways it has all pros and cons of micro-thread design (it
> 	looks neat and simple, although is utterly broken in some usage
> 	cases).

I don't think the "atom" approach is bad per se. I think it could be fine 
to have some state information in user space. It's just that I think 
complex interfaces that people largely won't even use is a big mistake. We 
should concentrate on usability first, and some excessive cleverness 
really isn't a big advantage.

Being able to do a "open + stat" looks like a fine thing. But I doubt 
you'll see a lot of other combinations.

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 16:09       ` Linus Torvalds
@ 2007-02-15 16:37         ` Evgeniy Polyakov
  2007-02-15 17:42           ` Linus Torvalds
  2007-02-15 17:05         ` Davide Libenzi
  2007-02-15 17:17         ` Ulrich Drepper
  2 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-15 16:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Thu, Feb 15, 2007 at 08:09:54AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > > In other words, the "let user space sort out the complexity" is not a good 
> > > answer. It just means that the interface is badly designed.
> > 
> > Well, if we can setup iocb structure, why we can not setup syslet one?
> 
> (I'm cutting wildly, to try to get to the part I wanted to answer)
> 
> I actually think aio_read/write() and friends are *horrible* interfaces.
> 
> Here's a quick question: how many people have actually ever seen them used 
> in "normal code"? 

Agree, existing AIO interface is far from ideal IMO, but it is used. No
matter if it is normal or not, AIO itself is not normal interface -
there are no books about POSIX AIO, so no one knows about AIO at all.

> Yeah. Nobody uses them. They're not all that portable (even within unixes 
> they aren't always there, much less in other places), they are fairly 
> obscure, and they are just not really easy to use.
> 
> Guess what? The same is going to be true *in*spades* for any Linux- 
> specific async system call thing.
> 
> This is why I think simplicity of use along with transparency, is so 
> important. I think "aio_read()" is already a nasty enough interface, and 
> it's sure more portable than any Linux-specific extension will be (only 
> until the revolution comes, of course - at that point, everybody who 
> doesn't use Linux will be up against the wall, so we can solve the problem 
> that way).
> 
> So a Linux-specific extension needs to be *easier* to use or at least 
> understand, and have *more* obvious upsides than "aio_read()" has. 
> Otherwise, it's pointless - nobody is really going to use it.

Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about. Period
. // <- wrapped one

If system is designed that with API changes it breaks - that system sucks 
wildly and should be thrown away. Syslets do not suffer from that.

We can have tons of interfaces any alien would be happy with (imho it is
not even kernel's task at all) - new table of syscalls the way usual ones 
are used for example. 
And we will have async_stat() exactly the same way people would use that.

It is not even a thing to discus. There are other technical issues with
syslets yet to resolve. If people do happy with design of the system, it
is time to think about how it will look from user's point of view.

syslet(__NR_stat) -> async_stat() - say it and Ingo and other developers
will think about how to implement that or start to discuss that it is
bad interface, and instead something else should be invented.

If interface sucks and _interface_ must be changed/extended/replaced.
If overall design sucks and it must be changed.

Solve problems one-by-one, not throwing something just because it uses
wild interface which can be changed in a minute.

> This is why I think the main goals should be:
> 
>  - the *internal* kernel goal of trying to replace some very specific 
>    aio_read() etc code with somethign more maintainable.
> 
>    This is really a maintainability argument, nothing more. Even if we 
>    never do anything *but* aio_read() and friends, if we can avoid having 
>    the VFS code have multiple code-paths and explicit checks for AIO, and 
>    instead handle it more "automatically", I think it is already worth it.
> 
>  - add extensions that people *actually*can*use* in practice.
> 
>    And this is where "simple interfaces" comes in.

There is absolutely _NO_ problem in having any interface people will use.
What one will you?
async_stat() instead of syslet(complex_struct_blah_sync)?
No problem - it is _really_ trivial to implement.
Ingo mentioned that it should be done, and it is really simple task for
glibc just like it is done for usual syscalls, which has completely nothing 
with overall system design at all.

> > So there is no point in 'complex setup and overhead', but there is
> > a. limit of the AIO (although my point is not to have huge amount of
> > 	working threads - they were created by people who can not
> > 	program state machines (c) Alan Cox)
> > b. possibility to implement a state machine (in current form likely will
> > 	not be used except maybe some optional hints for IO tasks like
> > 	fadvice)
> > c. in all other ways it has all pros and cons of micro-thread design (it
> > 	looks neat and simple, although is utterly broken in some usage
> > 	cases).
> 
> I don't think the "atom" approach is bad per se. I think it could be fine 
> to have some state information in user space. It's just that I think 
> complex interfaces that people largely won't even use is a big mistake. We 
> should concentrate on usability first, and some excessive cleverness 
> really isn't a big advantage.
> 
> Being able to do a "open + stat" looks like a fine thing. But I doubt 
> you'll see a lot of other combinations.

Then no problem. 

Interface does suck, especially since it does not allow to forbid some
syscalls from async execution, but it is just a brick which allows to
build really good system of it.

I personally vote for table of async syscalls transferred into 
human-readable aliases like async_stat() and the like.

> 		Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 16:09       ` Linus Torvalds
  2007-02-15 16:37         ` Evgeniy Polyakov
@ 2007-02-15 17:05         ` Davide Libenzi
  2007-02-15 17:17           ` Evgeniy Polyakov
  2007-02-15 17:17         ` Ulrich Drepper
  2 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-15 17:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Evgeniy Polyakov, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Thu, 15 Feb 2007, Linus Torvalds wrote:

> I don't think the "atom" approach is bad per se. I think it could be fine 
> to have some state information in user space. It's just that I think 
> complex interfaces that people largely won't even use is a big mistake. We 
> should concentrate on usability first, and some excessive cleverness 
> really isn't a big advantage.
> 
> Being able to do a "open + stat" looks like a fine thing. But I doubt 
> you'll see a lot of other combinations.

I actually think that building chains of syscalls bring you back to a 
multithreaded solution. Why? Because suddendly the service thread become 
from servicing a syscall (with possible cachehit optimization), to 
servicing a whole session. So the number of service threads needed (locked 
down by a chain) becomes big because requests goes from being short-lived 
syscalls to long-lived chains of them. Think about the trivial web server, 
and think about a chain that does open->fstat->sendhdrs->sendfile after an 
accept. What's the difference with a multithreaded solution that does 
accept->clone and execute the above code in the new thread? Nada, NIL. 
Actually, there is a difference. The standard multithreaded function is 
easier to code in C than with the complex atoms chains. The number of 
service thread becomes suddendly proportional to the number of active 
sessions.
The more I look at this, the more I think that async_submit should submit 
simple syscalls, or an array of them (unrelated/parallel).



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 16:09       ` Linus Torvalds
  2007-02-15 16:37         ` Evgeniy Polyakov
  2007-02-15 17:05         ` Davide Libenzi
@ 2007-02-15 17:17         ` Ulrich Drepper
  2 siblings, 0 replies; 320+ messages in thread
From: Ulrich Drepper @ 2007-02-15 17:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 867 bytes --]

Linus Torvalds wrote:
> Here's a quick question: how many people have actually ever seen them used 
> in "normal code"? 
> 
> Yeah. Nobody uses them. They're not all that portable (even within unixes 
> they aren't always there, much less in other places), they are fairly 
> obscure, and they are just not really easy to use.

That's nonsense.  They are widely used (just hear people scream if
something changes or breaks) and they are available on all Unix
implementations which are not geared towards embedded use.  POSIX makes
AIO in the next revision mandatory.

Just because you don't like it, don't discount it.  Yes, the interface
is not the best.  But this is what you get if you cannot dictate
interfaces to everybody.  You have to make concessions.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 17:05         ` Davide Libenzi
@ 2007-02-15 17:17           ` Evgeniy Polyakov
  2007-02-15 17:39             ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-15 17:17 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Thu, Feb 15, 2007 at 09:05:13AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Thu, 15 Feb 2007, Linus Torvalds wrote:
> 
> > I don't think the "atom" approach is bad per se. I think it could be fine 
> > to have some state information in user space. It's just that I think 
> > complex interfaces that people largely won't even use is a big mistake. We 
> > should concentrate on usability first, and some excessive cleverness 
> > really isn't a big advantage.
> > 
> > Being able to do a "open + stat" looks like a fine thing. But I doubt 
> > you'll see a lot of other combinations.
> 
> I actually think that building chains of syscalls bring you back to a 
> multithreaded solution. Why? Because suddendly the service thread become 
> from servicing a syscall (with possible cachehit optimization), to 
> servicing a whole session. So the number of service threads needed (locked 
> down by a chain) becomes big because requests goes from being short-lived 
> syscalls to long-lived chains of them. Think about the trivial web server, 
> and think about a chain that does open->fstat->sendhdrs->sendfile after an 
> accept. What's the difference with a multithreaded solution that does 
> accept->clone and execute the above code in the new thread? Nada, NIL.

That is more ideological question about micro-thread design at all.
If syslet will be able to perform only one syscall, one will have 4
threads for above case, not one, so it is even more broken.

So, if Linux moves that way of doing AIO (IMO incorrect, I think that
the correct state machine made not of syscalls, but specially crafted
entries - like populate pages into VFS, send chunk, recv chunk without
blocking and continue on completion and the like), syslets with attached 
state machines are the (smallest evil) best choice.

> Actually, there is a difference. The standard multithreaded function is 
> easier to code in C than with the complex atoms chains. The number of 
> service thread becomes suddendly proportional to the number of active 
> sessions.
> The more I look at this, the more I think that async_submit should submit 
> simple syscalls, or an array of them (unrelated/parallel).
 
That is the case - atom items (I do hope that this subsystem would be
able to perform not only syscalls, but any kernel interfaces suitable
prototypes, v2 seems to move that direction) are called asynchronously 
from main userspace thread to achieve the maximum performance.
 
> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 17:17           ` Evgeniy Polyakov
@ 2007-02-15 17:39             ` Davide Libenzi
  2007-02-15 18:01               ` Evgeniy Polyakov
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-15 17:39 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:

> On Thu, Feb 15, 2007 at 09:05:13AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > 
> > I actually think that building chains of syscalls bring you back to a 
> > multithreaded solution. Why? Because suddendly the service thread become 
> > from servicing a syscall (with possible cachehit optimization), to 
> > servicing a whole session. So the number of service threads needed (locked 
> > down by a chain) becomes big because requests goes from being short-lived 
> > syscalls to long-lived chains of them. Think about the trivial web server, 
> > and think about a chain that does open->fstat->sendhdrs->sendfile after an 
> > accept. What's the difference with a multithreaded solution that does 
> > accept->clone and execute the above code in the new thread? Nada, NIL.
> 
> That is more ideological question about micro-thread design at all.
> If syslet will be able to perform only one syscall, one will have 4
> threads for above case, not one, so it is even more broken.

Nope, just one thread. Well, two, if you consider the "main" dispatch 
thread, and the syscall service thread.



> So, if Linux moves that way of doing AIO (IMO incorrect, I think that
> the correct state machine made not of syscalls, but specially crafted
> entries - like populate pages into VFS, send chunk, recv chunk without
> blocking and continue on completion and the like), syslets with attached 
> state machines are the (smallest evil) best choice.

But at that point you don't need to have complex atom interfaces, with 
chains, whips and leather pants :) Just code it in C and submit that to 
the async engine. The longer is the chain though, the closer you get to a 
fully multithreaded solution, in terms of service thread consuption. And 
what do you save WRT a multithreaded solution? Not thread 
creation/destroy, because that cost is fully amortized inside the chain 
execution cost (plus a pool would even save that).
IMO the plus of a generic async engine is mostly from a kernel code 
maintainance POV. You don't need anymore to have AIO-aware code paths, 
that automatically transalte to smaller and more maintainable code.



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 16:37         ` Evgeniy Polyakov
@ 2007-02-15 17:42           ` Linus Torvalds
  2007-02-15 18:11             ` Evgeniy Polyakov
  2007-02-15 18:46             ` bert hubert
  0 siblings, 2 replies; 320+ messages in thread
From: Linus Torvalds @ 2007-02-15 17:42 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> 
> Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about. Period
> . // <- wrapped one

No, I really think you're wrong.

In many ways, the interfaces and especially data structures are *more* 
important than the code.

The code we can fix. The interfaces, on the other hand, we'll have to live 
with forever.

So complex interfaces that expose lots of implementation detail are not a 
good thing, and it's _not_ the last thing you want to think about. Complex 
interfaces with a lot of semantic knowledge seriously limit how you can 
fix things up later.

In contrast, simple interfaces that have clear and unambiguous semantics 
and that can be explained at a conceptual level are things that you can 
often implement in many different ways. So the interface isn't the bottle 
neck: you may have to have a "backwards compatibility layer" for it 

> If system is designed that with API changes it breaks - that system sucks 
> wildly and should be thrown away. Syslets do not suffer from that.

The syslet code itself looks fine. It's the user-visible part I'm not 
convinced about.

I'm just saying: how would use use this for existing programs?

For something this machine-specific, you're not going to have any big 
project written around the "async atom" code. So realistically, the kinds 
of usage we'd see is likely some compile-time configuration option, where 
people replace some specific sequence of code with another one. THAT is 
what we should aim to make easy and flexible, I think. And that is where 
interfaces really are as important as code.

We know one interface: the current aio_read() one. Nobody really _likes_ 
it (even database people would apparently like to extend it), but it has 
the huge advantage of "being there", and having real programs that really 
care that use it today. 

Others? We don't know yet. And exposing complex interfaces that may not be 
the right ones is much *worse* than exposing simple interfaces (that 
_also_ may not be the right ones, of course - but simple and 
straightforward interfaces with obvious and not-very-complex semantics are 
a lot easier to write compatibility layers for if the internal code 
changes radically)

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 17:39             ` Davide Libenzi
@ 2007-02-15 18:01               ` Evgeniy Polyakov
  0 siblings, 0 replies; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-15 18:01 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Thu, Feb 15, 2007 at 09:39:33AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> 
> > On Thu, Feb 15, 2007 at 09:05:13AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > > 
> > > I actually think that building chains of syscalls bring you back to a 
> > > multithreaded solution. Why? Because suddendly the service thread become 
> > > from servicing a syscall (with possible cachehit optimization), to 
> > > servicing a whole session. So the number of service threads needed (locked 
> > > down by a chain) becomes big because requests goes from being short-lived 
> > > syscalls to long-lived chains of them. Think about the trivial web server, 
> > > and think about a chain that does open->fstat->sendhdrs->sendfile after an 
> > > accept. What's the difference with a multithreaded solution that does 
> > > accept->clone and execute the above code in the new thread? Nada, NIL.
> > 
> > That is more ideological question about micro-thread design at all.
> > If syslet will be able to perform only one syscall, one will have 4
> > threads for above case, not one, so it is even more broken.
> 
> Nope, just one thread. Well, two, if you consider the "main" dispatch 
> thread, and the syscall service thread.
 
Argh, if they are supposed to run synchronously, for example stat can be
done in parallel with sendfile in above example, but generally yes, one
execution thread.
 
> > So, if Linux moves that way of doing AIO (IMO incorrect, I think that
> > the correct state machine made not of syscalls, but specially crafted
> > entries - like populate pages into VFS, send chunk, recv chunk without
> > blocking and continue on completion and the like), syslets with attached 
> > state machines are the (smallest evil) best choice.
> 
> But at that point you don't need to have complex atom interfaces, with 
> chains, whips and leather pants :) Just code it in C and submit that to 
> the async engine. The longer is the chain though, the closer you get to a 
> fully multithreaded solution, in terms of service thread consuption. And 
> what do you save WRT a multithreaded solution? Not thread 
> creation/destroy, because that cost is fully amortized inside the chain 
> execution cost (plus a pool would even save that).
> IMO the plus of a generic async engine is mostly from a kernel code 
> maintainance POV. You don't need anymore to have AIO-aware code paths, 
> that automatically transalte to smaller and more maintainable code.
 
It is completely possible to not wire up several syscalls and just use
only one per async call, but _if_ such a requirement rises, the whole
infrastructure is there. 
 
> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 17:42           ` Linus Torvalds
@ 2007-02-15 18:11             ` Evgeniy Polyakov
  2007-02-15 18:25               ` Linus Torvalds
  2007-02-15 18:46             ` bert hubert
  1 sibling, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-15 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Thu, Feb 15, 2007 at 09:42:32AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> > 
> > Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about. Period
> > . // <- wrapped one
> 
> No, I really think you're wrong.
> 
> In many ways, the interfaces and especially data structures are *more* 
> important than the code.
> 
> The code we can fix. The interfaces, on the other hand, we'll have to live 
> with forever.
> 
> So complex interfaces that expose lots of implementation detail are not a 
> good thing, and it's _not_ the last thing you want to think about. Complex 
> interfaces with a lot of semantic knowledge seriously limit how you can 
> fix things up later.
> 
> In contrast, simple interfaces that have clear and unambiguous semantics 
> and that can be explained at a conceptual level are things that you can 
> often implement in many different ways. So the interface isn't the bottle 
> neck: you may have to have a "backwards compatibility layer" for it 

That's exaclt the way we should discuss it - you do ont like that
interface, but Ingo proposed a way to change that via table of async
syscalls - people asks, people answers - so eventually interface and (if
any) other problems got resolved.

> > If system is designed that with API changes it breaks - that system sucks 
> > wildly and should be thrown away. Syslets do not suffer from that.
> 
> The syslet code itself looks fine. It's the user-visible part I'm not 
> convinced about.
> 
> I'm just saying: how would use use this for existing programs?
> 
> For something this machine-specific, you're not going to have any big 
> project written around the "async atom" code. So realistically, the kinds 
> of usage we'd see is likely some compile-time configuration option, where 
> people replace some specific sequence of code with another one. THAT is 
> what we should aim to make easy and flexible, I think. And that is where 
> interfaces really are as important as code.
> 
> We know one interface: the current aio_read() one. Nobody really _likes_ 
> it (even database people would apparently like to extend it), but it has 
> the huge advantage of "being there", and having real programs that really 
> care that use it today. 
> 
> Others? We don't know yet. And exposing complex interfaces that may not be 
> the right ones is much *worse* than exposing simple interfaces (that 
> _also_ may not be the right ones, of course - but simple and 
> straightforward interfaces with obvious and not-very-complex semantics are 
> a lot easier to write compatibility layers for if the internal code 
> changes radically)

So we just need to describe the way we want to see new interface -
that's it.

Here is a stub for async_stat() - probably broken example, but that does
not matter - this interface is really easy to change.

static void syslet_setup(struct syslet *s, int nr, void *arg1...)
{
	s->flags = ...
	s->arg[1] = arg1;
	....
}

long glibc_async_stat(const char *path, struct stat *buf)
{
	/* What about making syslet and/or set of atoms per thread and preallocate 
	 * them when working threads are allocated? */
	struct syslet s;
	syslet_setup(&s, __NR_stat, path, buf, NULL, NULL, NULL, NULL);
	return async_submit(&s);
}

> 		Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 18:11             ` Evgeniy Polyakov
@ 2007-02-15 18:25               ` Linus Torvalds
  2007-02-15 19:04                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 320+ messages in thread
From: Linus Torvalds @ 2007-02-15 18:25 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> 
> So we just need to describe the way we want to see new interface -
> that's it.

Agreed. Absolutely.

But please keep the kernel interface as part of that. Not just a strange 
and complex kernel interface and then _usable_ library interfaces that use 
the strange and complex one internally. Because if the complex one has no 
validity on its own, it's just (a) a bitch to debug and (b) if we ever 
change any details inside the kernel we'll end up with a lot of subtle 
code where user land creates complex data, and the kernel just reads it, 
and both just (unnecessarily) work around the fact that the other doesn't 
do the straightforward thing.

> Here is a stub for async_stat() - probably broken example, but that does
> not matter - this interface is really easy to change.
> 
> static void syslet_setup(struct syslet *s, int nr, void *arg1...)
> {
> 	s->flags = ...
> 	s->arg[1] = arg1;
> 	....
> }
> 
> long glibc_async_stat(const char *path, struct stat *buf)
> {
> 	/* What about making syslet and/or set of atoms per thread and preallocate 
> 	 * them when working threads are allocated? */
> 	struct syslet s;
> 	syslet_setup(&s, __NR_stat, path, buf, NULL, NULL, NULL, NULL);
> 	return async_submit(&s);
> }

And this is a classic example of potentially totally buggy code.

Why? You're releasing the automatic variable on the stack before it's 
necessarily all used!

So now you need to do a _longterm_ allocation, and that in turn means that 
you need to do a long-term de-allocation!

Ok, so do we make the rule be that all atoms *have* to be read fully 
before we start the async submission (so that the caller doesn't need to 
do a long-term allocation)?

Or do we make the rule be that just the *first* atom is copied by the 
kernel before the async_sumbit() returns, and thus it's ok to do the above 
*IFF* you only have a single system call?

See? The example you tried to use to show how "simple" the interface iswas 
actually EXACTLY THE REVERSE. It shows how subtle bugs can creep in!

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 17:42           ` Linus Torvalds
  2007-02-15 18:11             ` Evgeniy Polyakov
@ 2007-02-15 18:46             ` bert hubert
  2007-02-15 19:10               ` Evgeniy Polyakov
                                 ` (2 more replies)
  1 sibling, 3 replies; 320+ messages in thread
From: bert hubert @ 2007-02-15 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Evgeniy Polyakov, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Thu, Feb 15, 2007 at 09:42:32AM -0800, Linus Torvalds wrote:

> We know one interface: the current aio_read() one. Nobody really _likes_ 
[...]

> Others? We don't know yet. And exposing complex interfaces that may not be 
> the right ones is much *worse* than exposing simple interfaces (that 
> _also_ may not be the right ones, of course - but simple and 

>From humble userland, here's two things I'd hope to be able to do, although
I admit my needs are rather specialist.

1) batch, and wait for, with proper error reporting:
	socket();
	[ setsockopt(); ]
	bind();
	connect();
	gettimeofday();  // doesn't *always* happen
	send();
	recv();
	gettimeofday(); // doesn't *always* happen

	I go through this sequence for each outgoing powerdns UDP query
	because I need a new random source port for each query, and I
	connect because I care about errrors. Linux does not give me random
	source ports for UDP sockets.

	When async, I can probably just drop the setsockopt (for
	nonblocking). I already batch the gettimeofday to 'once per epoll
	return', but quite often this is once per packet.

2) 	On the client facing side (port 53), I'd very much hope for a way to
	do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
	UDP datagrams with only one kernel transition.

	This would mean that I batch up either 10 calls to recv(), or one
	'atom' of 10 recv's.

Both 1 and 2 are currently limiting factors when I enter the 100kqps domain
of name serving. This doesn't mean the rest of my code is as tight as it
could be, but I spend a significant portion of time in the kernel even at
moderate (10kqps effective) loads, even though I already use epoll. A busy
PowerDNS recursor typically spends 25% to 50% of its time on 'sy' load.

This might be due to my use of get/set/swap/makecontext though.

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 18:25               ` Linus Torvalds
@ 2007-02-15 19:04                 ` Evgeniy Polyakov
  2007-02-15 19:28                   ` Linus Torvalds
  0 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-15 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Thu, Feb 15, 2007 at 10:25:37AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > static void syslet_setup(struct syslet *s, int nr, void *arg1...)
> > {
> > 	s->flags = ...
> > 	s->arg[1] = arg1;
> > 	....
> > }
> > 
> > long glibc_async_stat(const char *path, struct stat *buf)
> > {
> > 	/* What about making syslet and/or set of atoms per thread and preallocate 
> > 	 * them when working threads are allocated? */
> > 	struct syslet s;
> > 	syslet_setup(&s, __NR_stat, path, buf, NULL, NULL, NULL, NULL);
> > 	return async_submit(&s);
> > }
> 
> And this is a classic example of potentially totally buggy code.
> 
> Why? You're releasing the automatic variable on the stack before it's 
> necessarily all used!
> 
> So now you need to do a _longterm_ allocation, and that in turn means that 
> you need to do a long-term de-allocation!
> 
> Ok, so do we make the rule be that all atoms *have* to be read fully 
> before we start the async submission (so that the caller doesn't need to 
> do a long-term allocation)?
> 
> Or do we make the rule be that just the *first* atom is copied by the 
> kernel before the async_sumbit() returns, and thus it's ok to do the above 
> *IFF* you only have a single system call?
> 
> See? The example you tried to use to show how "simple" the interface iswas 
> actually EXACTLY THE REVERSE. It shows how subtle bugs can creep in!

So describe what are the requirements (constraints)?

Above example has exactly one syscall in the chain, so it is ok, but
generally it is not correct.

So instead there will be 
s = atom_create_and_add(__NR_stat, path, stat, NULL, NULL, NULL, NULL);
atom then can be freed in the glibc_async_wait() wrapper just before
returning data to userspace.

There are millions of possible ways to do that, but what exactly one
should be used from your point of view? Describe _your_ vision of that path.

Currently generic example is following:
allocate mem
setup complex structure
submit syscall
wait syscall
free mem

the first two can be hidden in glibc setup/startup code, the last one -
in waiting or cleanup entry.

Or it can be this one (just an idea):

glibc_async_stat(path, &stat);

int glibc_async_stat(char *path, struct stat *stat)
{
	struct pthread *p;

	asm ("movl %%gs:0, %0", "=r"(unsigned long)(p));

	atom = allocate_new_atom_and_setup_initial_values();
	setup_atom(atom, __NR_stat, path, stat, ...);
	add_atom_into_private_tree(p, atom);
	return async_submit(atom);
}

glibc_async_wait()
{
	struct pthread *p;

	asm ("movl %%gs:0, %0", "=r"(unsigned long)(p));
	
	cookie = sys_async_wait();
	atom = search_for_cookie_and_remove(p);
	free_atom(atom);
}

Although that cruft might need to be extended...

So, describe how exactly _you_ think it should be implemented with its
pros and cons, so that systemn could be adopted without trying to
mind-read of what is simple and good or complex and really bad.

> 		Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 18:46             ` bert hubert
@ 2007-02-15 19:10               ` Evgeniy Polyakov
  2007-02-15 19:16               ` Zach Brown
  2007-02-15 19:26               ` Eric Dumazet
  2 siblings, 0 replies; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-15 19:10 UTC (permalink / raw)
  To: bert hubert
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Thu, Feb 15, 2007 at 07:46:56PM +0100, bert hubert (bert.hubert@netherlabs.nl) wrote:
> 1) batch, and wait for, with proper error reporting:
> 	socket();
> 	[ setsockopt(); ]
> 	bind();
> 	connect();
> 	gettimeofday();  // doesn't *always* happen
> 	send();
> 	recv();
> 	gettimeofday(); // doesn't *always* happen
> 
> 	I go through this sequence for each outgoing powerdns UDP query
> 	because I need a new random source port for each query, and I
> 	connect because I care about errrors. Linux does not give me random
> 	source ports for UDP sockets.

What about a setsockopt or just random port selection patch? :)

> 	When async, I can probably just drop the setsockopt (for
> 	nonblocking). I already batch the gettimeofday to 'once per epoll
> 	return', but quite often this is once per packet.
> 
> 2) 	On the client facing side (port 53), I'd very much hope for a way to
> 	do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
> 	UDP datagrams with only one kernel transition.
> 
> 	This would mean that I batch up either 10 calls to recv(), or one
> 	'atom' of 10 recv's.
> 
> Both 1 and 2 are currently limiting factors when I enter the 100kqps domain
> of name serving. This doesn't mean the rest of my code is as tight as it
> could be, but I spend a significant portion of time in the kernel even at
> moderate (10kqps effective) loads, even though I already use epoll. A busy
> PowerDNS recursor typically spends 25% to 50% of its time on 'sy' load.
> 
> This might be due to my use of get/set/swap/makecontext though.

It is only about one syscall in get and set/swap context, btw, so it
should not be a main factor, doesn't it?

As an advertisement note, but if you have a lot of network events per epoll 
read try to use kevent - its socket notifications do not require
additional traverse of the list of ready events as in poll usage.

> 	Bert
> 
> -- 
> http://www.PowerDNS.com      Open source, database driven DNS Software 
> http://netherlabs.nl              Open and Closed source services

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 18:46             ` bert hubert
  2007-02-15 19:10               ` Evgeniy Polyakov
@ 2007-02-15 19:16               ` Zach Brown
  2007-02-15 19:26               ` Eric Dumazet
  2 siblings, 0 replies; 320+ messages in thread
From: Zach Brown @ 2007-02-15 19:16 UTC (permalink / raw)
  To: bert hubert
  Cc: Linus Torvalds, Evgeniy Polyakov, Ingo Molnar,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Davide Libenzi,
	Thomas Gleixner

> 2) 	On the client facing side (port 53), I'd very much hope for a  
> way to
> 	do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
> 	UDP datagrams with only one kernel transition.

I want to highlight this point that Bert is making.

Whenever we talk about AIO and kernel threads some folks are rightly  
concerned that we're talking about a thread *per IO* and fear that  
memory consumption will be fatal.

Take the case of userspace which implements what we'd think of as  
page cache writeback.  (*coughs, points at email address*).  It wants  
to issue thousands of IOs to disjoint regions of a file.  "Thousands  
of kernel threads, oh crap!"

But it only issues each IO with a separate syscall (or io_submit()  
op) because it doesn't have an interface that lets it specify IOs  
that vector user memory addresses *and file position*.

If we had a seemingly obvious interface that let it kick off batched  
IOs to different parts of the file, the looming disaster of a thread  
per IO vanishes in that case.

struct off_vec {
	off_t pos;
	size_t len;
};

long sys_sgwrite(int fd, struct iovec *memvec, size_t mv_count,
	struct off_vec *ovec, size_t ov_count);

It doesn't take long to imagine other uses for this that are less  
exotic.

Take e2fsck and its iterating through indirect blocks or directory  
data blocks.  It has a list of disjoint file regions (blocks) it  
wants to read, but it does them serially to keep the code from  
getting even more confusing.  blktrace a clean e2fsck -f some time..  
it's leaving *HALF* of the disk read bandwith on the table by  
performing serial block-sized reads.  If it could specify batches of  
them the code would still be simple but it could tell the kernel and  
IO scheduler *exactly* what it wants, without having to mess around  
with sys_readahead() or AIO or any of that junk :).

Anyway, that's just something that's been on my mind.  If there are  
obvious clean opportunities to get more done with single syscalls, it  
might not be such a bad thing.

- z

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 18:46             ` bert hubert
  2007-02-15 19:10               ` Evgeniy Polyakov
  2007-02-15 19:16               ` Zach Brown
@ 2007-02-15 19:26               ` Eric Dumazet
  2 siblings, 0 replies; 320+ messages in thread
From: Eric Dumazet @ 2007-02-15 19:26 UTC (permalink / raw)
  To: bert hubert
  Cc: Linus Torvalds, Evgeniy Polyakov, Ingo Molnar,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Ulrich Drepper, Zach Brown,
	David S. Miller, Benjamin LaHaise, Suparna Bhattacharya,
	Davide Libenzi, Thomas Gleixner

On Thursday 15 February 2007 19:46, bert hubert wrote:

> Both 1 and 2 are currently limiting factors when I enter the 100kqps domain
> of name serving. This doesn't mean the rest of my code is as tight as it
> could be, but I spend a significant portion of time in the kernel even at
> moderate (10kqps effective) loads, even though I already use epoll. A busy
> PowerDNS recursor typically spends 25% to 50% of its time on 'sy' load.

Well, I guess in your workload most of system overhead is because of sockets 
creation/destruction, UDP/IP stack work, nic driver, interrupts... I really 
doubt async_io could help you... Do you have some oprofile results to share 
with us ?

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 19:04                 ` Evgeniy Polyakov
@ 2007-02-15 19:28                   ` Linus Torvalds
  2007-02-15 20:07                     ` Linus Torvalds
  2007-02-16  8:57                     ` Evgeniy Polyakov
  0 siblings, 2 replies; 320+ messages in thread
From: Linus Torvalds @ 2007-02-15 19:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Thu, 15 Feb 2007, Evgeniy Polyakov wrote:
> > 
> > See? The example you tried to use to show how "simple" the interface iswas 
> > actually EXACTLY THE REVERSE. It shows how subtle bugs can creep in!
> 
> So describe what are the requirements (constraints)?
> 
> Above example has exactly one syscall in the chain, so it is ok, but
> generally it is not correct.

Well, it *could* be correct. It depends on the semantics of the atom 
fetching. If we make the semantics be that the first atom is fetched 
entirely synchronously, then we could make the rule be that single-syscall 
async things can do their job with a temporary allocation.

So that wasn't my point. My point was that a complicated interface that 
uses indirection actually has subtle issues. You *thought* you were doing 
something simple, and you didn't even realize the subtle assumptions you 
made.

THAT was the point. Interfaces are really really subtle and important. 
It's absolutely not a case of "we can just write wrappers to fix up any 
library issues".

> So instead there will be 
> s = atom_create_and_add(__NR_stat, path, stat, NULL, NULL, NULL, NULL);
> atom then can be freed in the glibc_async_wait() wrapper just before
> returning data to userspace.

So now you add some kind of allocation/dealloction thing. In user space or 
in the kernel?

> There are millions of possible ways to do that, but what exactly one
> should be used from your point of view? Describe _your_ vision of that path.

My vision is that we should be able to do the simple things *easily* and 
without any extra overhead.

And doing wrappers in user space is almost entirely unacceptable, becasue 
a lot of the wrapping needs to be done at release time (for example: 
de-allocating memory), and that means that you no longer can do simple 
system calls that don't even need release notification AT ALL.

> Currently generic example is following:
> allocate mem
> setup complex structure
> submit syscall
> wait syscall
> free mem

And that "allocate mem" and "free mem" is a problem. It's not just a 
performance problem, it is a _complexity_ problem. It means that people 
have to track things that they are NOT AT ALL INTERESTED IN!

> So, describe how exactly _you_ think it should be implemented with its
> pros and cons, so that systemn could be adopted without trying to
> mind-read of what is simple and good or complex and really bad.

So I think that a good implementation just does everything up-front, and 
doesn't _need_ a user buffer that is live over longer periods, except for 
the actual results. Exactly because the whole alloc/teardown is nasty.

And I think a good implementation doesn't need wrapping in user space to 
be useful - at *least* not wrapping at completion time, which is the 
really difficult one (since, by definition, in an async world completion 
is separated from the initial submit() event, and with kernel-only threads 
you actually want to *avoid* having to do user code after the operation 
completed).

I suspect Ingo's thing can do that. But I also suspect (nay, _know_, from 
this discussion), that you didn't even think of the problems.

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 19:28                   ` Linus Torvalds
@ 2007-02-15 20:07                     ` Linus Torvalds
  2007-02-15 21:17                       ` Davide Libenzi
                                         ` (2 more replies)
  2007-02-16  8:57                     ` Evgeniy Polyakov
  1 sibling, 3 replies; 320+ messages in thread
From: Linus Torvalds @ 2007-02-15 20:07 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Thu, 15 Feb 2007, Linus Torvalds wrote:
> 
> So I think that a good implementation just does everything up-front, and 
> doesn't _need_ a user buffer that is live over longer periods, except for 
> the actual results. Exactly because the whole alloc/teardown is nasty.

Btw, this doesn't necessarily mean "not supporting multiple atoms at all".

I think the batching of async things is potentially a great idea. I think 
it's quite workable for "open+fstat" kind of things, and I agree that it 
can solve other things too (the "socket+bind+connect+sendmsg+rcv" kind of 
complex setup things).

But I suspect that if we just said:
 - we limit these atom sequences to just linear sequences of max "n" ops
 - we read them all in in a single go at startup

we actually avoid several nasty issues. Not just the memory allocation 
issue in user space (now it's perfectly ok to build up a sequence of ops 
in temporary memory and throw it away once it's been submitted), but also 
issues like the 32-bit vs 64-bit compatibility stuff (the compat handlers 
would just convert it when they do the initial copying, and then the 
actual run-time wouldn't care about user-level pointers having different 
sizes etc).

Would it make the interface less cool? Yeah. Would it limit it to just a 
few linked system calls (to avoid memory allocation issues in the kernel)? 
Yes again. But it would simplify a lot of the interface issues.

It would _also_ allow the "sys_aio_read()" function to build up its 
*own* set of atoms in kernel space to actually do the read, and there 
would be no impact of the actual run-time wanting to read stuff from user 
space. Again - it's actually the same issue as with the compat system 
call: by making the interfaces do things up-front rather than dynamically, 
it becomes more static, but also easier to do interface translations. You 
can translate into any arbitrary internal format _once_, and be done with 
it.

I dunno. 

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 20:07                     ` Linus Torvalds
@ 2007-02-15 21:17                       ` Davide Libenzi
  2007-02-15 22:34                       ` Michael K. Edwards
  2007-02-16 12:28                       ` Ingo Molnar
  2 siblings, 0 replies; 320+ messages in thread
From: Davide Libenzi @ 2007-02-15 21:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Evgeniy Polyakov, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Thomas Gleixner

On Thu, 15 Feb 2007, Linus Torvalds wrote:

> 
> 
> On Thu, 15 Feb 2007, Linus Torvalds wrote:
> > 
> > So I think that a good implementation just does everything up-front, and 
> > doesn't _need_ a user buffer that is live over longer periods, except for 
> > the actual results. Exactly because the whole alloc/teardown is nasty.
> 
> Btw, this doesn't necessarily mean "not supporting multiple atoms at all".
> 
> I think the batching of async things is potentially a great idea. I think 
> it's quite workable for "open+fstat" kind of things, and I agree that it 
> can solve other things too (the "socket+bind+connect+sendmsg+rcv" kind of 
> complex setup things).

If you *really* want to allow chains (note that the above could be 
prolly be hosted on a real thread, once chains becomes that long), then 
try to build that chain with the current API, and compare it with:

long my_clet(ctx *c) {
	int fd, error = -1;

	if ((fd = socket(...)) == -1 ||
	    bind(fd, &c->laddr, sizeof(c->laddr)) ||
	    connect(fd, &c->saddr, sizeof(c->saddr)) ||
	    sendmsg(fd, ...) == -1 ||
	    recv(fd, ...) <= 0)
		goto
	error = 0;
erxit:
	close(fd);
	return error;
}

Points:

- Keep the submission API to submit one or an array of parallel async 
  syscalls/clets

- Keep arguments of the syscall being longs (no need for extra pointer 
  indirection compat code, and special copy_atoms functions)

- No need for the "next" atom pointer chaining (nice for compat too)

- No need to create special conditions/jump interpreters into the kernel 
  (nice for compat and emulators). C->machine-code that that for us

- Easier to code. Try to build a chain like that with the current API and 
  you will see what I saying

- Did I say faster? Machine code is faster than sudo-VM interpretation of 
  jumps/conditions done inside the kernel




- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 20:07                     ` Linus Torvalds
  2007-02-15 21:17                       ` Davide Libenzi
@ 2007-02-15 22:34                       ` Michael K. Edwards
  2007-02-16 12:28                       ` Ingo Molnar
  2 siblings, 0 replies; 320+ messages in thread
From: Michael K. Edwards @ 2007-02-15 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Evgeniy Polyakov, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On 2/15/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Would it make the interface less cool? Yeah. Would it limit it to just a
> few linked system calls (to avoid memory allocation issues in the kernel)?
> Yes again. But it would simplify a lot of the interface issues.

Only in toy applications.  Real userspace code that lives between
networks+disks and impatient humans is 80% exception handling,
logging, and diagnostics.  If you can't do any of that between stages
of an async syscall chain, you're fscked when it comes to performance
analysis (the "which 10% of the traffic do we not abort under
pressure" kind, not the "cut overhead by 50%" kind).  Not that real
userspace code could get any simpler by using this facility anyway,
since you can't jump the queue, cancel in bulk, or add cleanup hooks.

Efficiently interleaved execution of high-latency I/O chains would be
nice.  Low overhead for cache hits would be nicer.  But least for the
workloads that interest me, neither is anywhere near as important as
the ability to say, "This 10% (or 90%) of my requests are going to
take forever?  Nevermind -- but don't cancel the 1% I can't do
without."

This is not a scheduling problem, it is a caching problem.  Caches are
data structures, not thread pools.  Assume that you need to design for
dynamic reprioritization, speculative fetch, and opportunistic flush,
even if you don't implement them at first.  Above all, stay out of the
way when a synchronous request misses cache -- and when application
code decides that a bunch of its outstanding requests are no longer
interesting, take the hint!

Oh, and while you're at it: I'd like to program AIO facilities using a
C compiler with an explicitly parallel construct -- something along
the lines of:

try (my_aio_batch, initial_priority, ...) {
} catch {
} finally {
}

Naturally the compiler will know how to convert synchronous syscalls
to their asynchronous equivalent, will use an analogue of IEEE NaNs to
minimize the hits to the exception path, and won't let you call
functions that aren't annotated as safe in IO completion context.  I
would also like five acres in town and a pony.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 19:28                   ` Linus Torvalds
  2007-02-15 20:07                     ` Linus Torvalds
@ 2007-02-16  8:57                     ` Evgeniy Polyakov
  2007-02-16 15:54                       ` Linus Torvalds
  1 sibling, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-16  8:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Thu, Feb 15, 2007 at 11:28:57AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> THAT was the point. Interfaces are really really subtle and important. 
> It's absolutely not a case of "we can just write wrappers to fix up any 
> library issues".

Interfaces can be created and destroyed - they do not affect overall
system design in anyway (well, if they do, something is broken).
So let's solve problems in order of their appearence - if interfaces
are more important for you than overall design - that is a problem I think.

> > So instead there will be 
> > s = atom_create_and_add(__NR_stat, path, stat, NULL, NULL, NULL, NULL);
> > atom then can be freed in the glibc_async_wait() wrapper just before
> > returning data to userspace.
> 
> So now you add some kind of allocation/dealloction thing. In user space or 
> in the kernel?

In userspace.
It was not added by me - it is just a wrapper.

> > There are millions of possible ways to do that, but what exactly one
> > should be used from your point of view? Describe _your_ vision of that path.
> 
> My vision is that we should be able to do the simple things *easily* and 
> without any extra overhead.
> 
> And doing wrappers in user space is almost entirely unacceptable, becasue 
> a lot of the wrapping needs to be done at release time (for example: 
> de-allocating memory), and that means that you no longer can do simple 
> system calls that don't even need release notification AT ALL.

syslets do work that way - they require some user memory - likely
long-standing (100% sure for multi atom setup, maybe it can be optimized
though) - if you do not want to allocate it explicitly - it is possible
to have a wrapper.

> > Currently generic example is following:
> > allocate mem
> > setup complex structure
> > submit syscall
> > wait syscall
> > free mem
> 
> And that "allocate mem" and "free mem" is a problem. It's not just a 
> performance problem, it is a _complexity_ problem. It means that people 
> have to track things that they are NOT AT ALL INTERESTED IN!

I proposed a way to hide allocation - it is simple, but you've cut it.
I can create another one without special per-thread thing - 
handle = async_init();
async_stat(handle, path, stat);
async_cleanup(); // not needed, since will be freed on exit automatically

Another one is to preallocate set of atoms in __attribute((contructor))
function.

There are really a lot of possible ways - _I_ can use the first one with
explicit operations, others likely can not - so I _ask_ about how should
it look like.

> > So, describe how exactly _you_ think it should be implemented with its
> > pros and cons, so that systemn could be adopted without trying to
> > mind-read of what is simple and good or complex and really bad.
> 
> So I think that a good implementation just does everything up-front, and 
> doesn't _need_ a user buffer that is live over longer periods, except for 
> the actual results. Exactly because the whole alloc/teardown is nasty.
> 
> And I think a good implementation doesn't need wrapping in user space to 
> be useful - at *least* not wrapping at completion time, which is the 
> really difficult one (since, by definition, in an async world completion 
> is separated from the initial submit() event, and with kernel-only threads 
> you actually want to *avoid* having to do user code after the operation 
> completed).

So where is a problem?
I proposed already three ways to do the thing - user will not even know
about something happend. You did not comment on anyone, instead you
handwaving with talks about how in theory something should be similar
to. What exactly do _you_ expect from interface?

> I suspect Ingo's thing can do that. But I also suspect (nay, _know_, from 
> this discussion), that you didn't even think of the problems.

That is another problem - you think you know something, but you fail to
prove that. 

I can work with explicit structure allocation/deallocation/setup - 
you do not want that - so I ask you about your opinion, and instead of 
getting an answer I receive theoretical word-fall about how perfect 
interface should look like.

You only need to have one function call without ever thinking bout
freeing? I proposed _two_ ways to that.
You can live with explicit init/cleanup (opt) code? There is another
one.

So please decribe your vision of interface with details, so that it
could be think about or/and implemented.

> 		Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15 20:07                     ` Linus Torvalds
  2007-02-15 21:17                       ` Davide Libenzi
  2007-02-15 22:34                       ` Michael K. Edwards
@ 2007-02-16 12:28                       ` Ingo Molnar
  2007-02-16 13:28                         ` Evgeniy Polyakov
  2 siblings, 1 reply; 320+ messages in thread
From: Ingo Molnar @ 2007-02-16 12:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Evgeniy Polyakov, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 15 Feb 2007, Linus Torvalds wrote:
> > 
> > So I think that a good implementation just does everything up-front, 
> > and doesn't _need_ a user buffer that is live over longer periods, 
> > except for the actual results. Exactly because the whole 
> > alloc/teardown is nasty.
> 
> Btw, this doesn't necessarily mean "not supporting multiple atoms at 
> all".
> 
> I think the batching of async things is potentially a great idea. I 
> think it's quite workable for "open+fstat" kind of things, and I agree 
> that it can solve other things too (the 
> "socket+bind+connect+sendmsg+rcv" kind of complex setup things).
> 
> But I suspect that if we just said:
>  - we limit these atom sequences to just linear sequences of max "n" ops
>  - we read them all in in a single go at startup
>
> we actually avoid several nasty issues. Not just the memory allocation 
> issue in user space (now it's perfectly ok to build up a sequence of 
> ops in temporary memory and throw it away once it's been submitted), 
> but also issues like the 32-bit vs 64-bit compatibility stuff (the 
> compat handlers would just convert it when they do the initial 
> copying, and then the actual run-time wouldn't care about user-level 
> pointers having different sizes etc).
> 
> Would it make the interface less cool? Yeah. Would it limit it to just 
> a few linked system calls (to avoid memory allocation issues in the 
> kernel)? Yes again. But it would simplify a lot of the interface 
> issues.
> 
> It would _also_ allow the "sys_aio_read()" function to build up its 
> *own* set of atoms in kernel space to actually do the read, and there 
> would be no impact of the actual run-time wanting to read stuff from 
> user space. Again - it's actually the same issue as with the compat 
> system call: by making the interfaces do things up-front rather than 
> dynamically, it becomes more static, but also easier to do interface 
> translations. You can translate into any arbitrary internal format 
> _once_, and be done with it.
> 
> I dunno.

[ hm. I again wrote a pretty long email for you to read. Darn! ]

regarding the API - i share most of your concerns, and it's all a 
function of how widely we want to push this into user-space.

My initial thought was for syslets to be used by glibc as small, secure 
kernel-side 'syscall plugins' mainly - so that it can do things like 
'POSIX AIO signal notifications' (which are madness in terms of 
performance, but which applications rely on) /without/ having to burden 
the kernel-side AIO with such requirements: glibc just adds an enclosing 
sys_kill() to the syslet and it will do the proper signal notification, 
asynchronously. (and of course syslets can be used for the Tux type of 
performance sillinesses as well ;-)

So a sane user API (all used at the glibc level, not at application 
level) would use simple syslets, while more broken ones would have to 
use longer ones - but nobody would have the burden of having to 
synchronize back to the issuer context. Natural selection will gravitate 
application use towards the APIs with the shorter syslets. (at least so 
i hope)

In this model syslets arent really user-programmable entities but rather 
small plugins available to glibc to build up more complex, more 
innovative (or just more broken) APIs than what the kernel wants to 
provide - without putting any true new ABI dependency on the kernel, 
other than the already existing syscall ABIs.

But if we'd like glibc to provide this to applications in some sort of 
standardized /programmable/ manner, with a wide range of atom selections 
(not directly coded syscall numbers, but rather as function pointers to 
actual glibc functions, which glibc could translate to syscall numbers, 
argument encodings, etc.), then i agree that doing the compat things and 
making it 32/64-bit agnostic (and much more) is pretty much a must. If 
90% of this current job is finished then sorting those out will at least 
be another 90% of the work ;-)

and actually this latter model scares me, and i think that model scared 
the hell out of you as well.

But i really have no strong opinion about which one we want yet, without 
having walked the path. Somewhere inside me i'd of course like syslets 
to become a widely available interface - but my fear is that it might 
just not be 'human' enough to make sense - and we'd just not want to tie 
us down with an ABI that's not used. I dont want this to become another 
sys_sendfile - much talked about and _almost_ useful but in practice 
seldom used due to its programmability and utility limitations.

OTOH, the syslet concept right now already looks very ubiquitous, and 
the main problem with AIO use in applications wasnt just even its broken 
API or its broken performance, but the fundamental lack of all Linux IO 
disciplines supporting AIO, and the lack of significantly parallel 
hardware. We have kaio that is centered around block drivers - then we 
have epoll that works best with networking, and inotify that deals with 
some (but not all) VFS events - but neither supports every IO and event 
disciple well, at once. My feeling is that /this/ is the main 
fundamental problem with AIO in general, not just its programmability 
limitations.

Right now i'm concentrating on trying to build up something on the 
scheduling side that shows the issues in practice, shows the limitations 
and shows the possibilities. For example the easy ability to turn a 
cachemiss thread back into a user thread (and then back into a cachemiss 
thread) was a true surprise to me which increased utility quite a bit. I 
couldnt have designed it into the concept because it just didnt occur to 
me in the early stages. The notification ring related limitations you 
noticed is another important thing to fix - and these issues go to the 
core scheduling model of the concept and affect everything.

Thirdly, while Tux does not matter much to us, at least to me it is 
pretty clear what it takes to get performance up to the levels of Tux - 
and i dont see any big fundamental compromise possible on that front. 
Syslets are partly Tux repackaged into something generic - they are 
probably a bit slower than straight kernel code Tux, but not by much and 
it's also not behaving fundamentally differently. And if we dont offer 
at least something close to those possibilities then people will 
re-start trying to add those special-purpose state machine APIs again, 
and the whole "we need /true/ async IO" game starts again.

So if we accept "make parallelism easier to program" and "get somewhat 
close to Tux's performance and scalability" as a premise (which you 
might not agree with in that form), then i dont think there's much 
choice we have: either we use kernel threads, synchronous system calls 
and the scheduler intelligently (and the scheduling/threading bits of 
syslets are pretty much the most intelligent kernel thread based 
approach i can imagine at the moment =B-) or we use a special-purpose 
KAIO state machine subsystem, avoiding most of the existing synchronous 
infrastructure, painfully coding it into every IO discipline - and this 
will certainly haunt us until the end of times.

So that's why i'm not /that/ much worried about the final form of the 
API at the moment - even though i agree that it is /the/ most important 
decision factor in the end: i see various unavoidable externalities 
forcing us very much, and in the end we either like the result and make 
it available to programmers, or we dont, and limit it to system-glue 
glibc use - or we throw it away altogether. I'm curious about the end 
result even if it gets limited or gets thrown away (joining 4:4 on the 
way to the bit bucket ;) and while i'm cautiously optimistic that 
something useful can come out of this, i cannot know it for sure at the 
moment.

	Ingo

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-16 12:28                       ` Ingo Molnar
@ 2007-02-16 13:28                         ` Evgeniy Polyakov
  0 siblings, 0 replies; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-16 13:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Fri, Feb 16, 2007 at 01:28:06PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> OTOH, the syslet concept right now already looks very ubiquitous, and 
> the main problem with AIO use in applications wasnt just even its broken 
> API or its broken performance, but the fundamental lack of all Linux IO 
> disciplines supporting AIO, and the lack of significantly parallel 
> hardware. We have kaio that is centered around block drivers - then we 
> have epoll that works best with networking, and inotify that deals with 
> some (but not all) VFS events - but neither supports every IO and event 
> disciple well, at once. My feeling is that /this/ is the main 
> fundamental problem with AIO in general, not just its programmability 
> limitations.

That is quite dissapointing to hear when weekely released kevent can
solve that problem already more than year ago - it was designed specially to
support every possible notification types and does support file
descriptor ones, VFS (dropped in current releases to reduce size) and
tons of other including POSIX times, signals, own high-performance AIO
(which was created as a a bit complex state machine over internals of
page population code) and essentially everything one can ever imagine 
with quite a bit of code needed for new type.

I was requested to add waiting for futex through kevent queue - that is
quite simple task, but having complete lack of feedback and ignorance of
the project even from people who asked about its features, it looks like
there is no need for that at all.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-16  8:57                     ` Evgeniy Polyakov
@ 2007-02-16 15:54                       ` Linus Torvalds
  2007-02-16 16:05                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 320+ messages in thread
From: Linus Torvalds @ 2007-02-16 15:54 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner



On Fri, 16 Feb 2007, Evgeniy Polyakov wrote:
> 
> Interfaces can be created and destroyed - they do not affect overall
> system design in anyway (well, if they do, something is broken).

I'm sorry, but you've obviously never maintained any piece of software 
that actually has users.

As long as you think that interfaces can change, this discussion is 
pointless. 

So go away, ponder things.

		Linus

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-16 15:54                       ` Linus Torvalds
@ 2007-02-16 16:05                         ` Evgeniy Polyakov
  2007-02-16 16:53                           ` Ray Lee
  0 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-16 16:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Fri, Feb 16, 2007 at 07:54:22AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > Interfaces can be created and destroyed - they do not affect overall
> > system design in anyway (well, if they do, something is broken).
> 
> I'm sorry, but you've obviously never maintained any piece of software 
> that actually has users.

Strong. But saying for others usualy tends to show own problems.

> As long as you think that interfaces can change, this discussion is 
> pointless. 

That is too cool phrase to be heared - if you will make me a favour and
reread what was written you will (hopefully) detect that there were no
words about interfaces being changed after put into the wild - talk was 
only about time when system is designed and implemented, and there is
time for discussion about its rough edges - if its design is good, then
interface can be changed in a moment without any problem - that is what
we see with syslets right now - they are designed and implemented (the
formed was done several years ago), and it is time to shape its edges -
like change userspace API - it is easy, but you do not (want/like to) 
see that.

> So go away, ponder things.

But my above words are too lame for self-hearing Olympus liver.
Definitely.

> 		Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-16 16:05                         ` Evgeniy Polyakov
@ 2007-02-16 16:53                           ` Ray Lee
  2007-02-16 16:58                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 320+ messages in thread
From: Ray Lee @ 2007-02-16 16:53 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On 2/16/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> if its design is good, then
> interface can be changed in a moment without any problem

This isn't always the case. Sometimes the interface puts requirements
(contract-like) upon the implementation. Case in point in the kernel,
dnotify versus inotify. dnotify is a steaming pile of worthlessness,
because it's userspace interface is so bad (meaning inefficient) as to
be nearly unusable.

inotify has a different interface, one that supplies details about
events rather that mere notice that an event occurred, and therefore
has different requirements in implementation. dnotify probably was a
good design, but for a worthless interface.

The interface isn't always important, but it's certainly something
that has to be understood before putting the finishing touches on the
behind-the-scenes implementation.

Ray

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-16 16:53                           ` Ray Lee
@ 2007-02-16 16:58                             ` Evgeniy Polyakov
  2007-02-16 20:20                               ` Cyrill V. Gorcunov
  2007-02-17  4:54                               ` Ray Lee
  0 siblings, 2 replies; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-16 16:58 UTC (permalink / raw)
  To: ray-gmail
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Fri, Feb 16, 2007 at 08:53:30AM -0800, Ray Lee (madrabbit@gmail.com) wrote:
> On 2/16/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >if its design is good, then
> >interface can be changed in a moment without any problem
> 
> This isn't always the case. Sometimes the interface puts requirements
> (contract-like) upon the implementation. Case in point in the kernel,
> dnotify versus inotify. dnotify is a steaming pile of worthlessness,
> because it's userspace interface is so bad (meaning inefficient) as to
> be nearly unusable.
> 
> inotify has a different interface, one that supplies details about
> events rather that mere notice that an event occurred, and therefore
> has different requirements in implementation. dnotify probably was a
> good design, but for a worthless interface.
> 
> The interface isn't always important, but it's certainly something
> that has to be understood before putting the finishing touches on the
> behind-the-scenes implementation.

Absolutely.
And if overall system design is good, there is no problem to change
(well, for those who fail to read to the end and understand my english
replace 'to change' with 'to create and commit') interface to the state
where it will satisfy all (majority of) users.

Situations when system is designed from interface down to system ends up
with one thread per IO and huge limitations on how system is going to be
used at all.

> Ray

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-16 16:58                             ` Evgeniy Polyakov
@ 2007-02-16 20:20                               ` Cyrill V. Gorcunov
  2007-02-17 10:02                                 ` Evgeniy Polyakov
  2007-02-17  4:54                               ` Ray Lee
  1 sibling, 1 reply; 320+ messages in thread
From: Cyrill V. Gorcunov @ 2007-02-16 20:20 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel-list

On Fri, Feb 16, 2007 at 07:58:54PM +0300, Evgeniy Polyakov wrote:
| Absolutely.
| And if overall system design is good, there is no problem to change
| (well, for those who fail to read to the end and understand my english
| replace 'to change' with 'to create and commit') interface to the state
| where it will satisfy all (majority of) users.
| 
| Situations when system is designed from interface down to system ends up
| with one thread per IO and huge limitations on how system is going to be
| used at all.
| 
| -- 
| 	Evgeniy Polyakov

I'm sorry for meddling in conversation but I think Linus misunderstood
you. If I'm right you propose to "create and commit" _new_ interfaces
only? I mean _changing_ of interfaces exported to user space is
very painfull... for further support. Don't swear at me if I wrote
something stupid ;)

-- 

		Cyrill


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-16 16:58                             ` Evgeniy Polyakov
  2007-02-16 20:20                               ` Cyrill V. Gorcunov
@ 2007-02-17  4:54                               ` Ray Lee
  2007-02-17 10:15                                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 320+ messages in thread
From: Ray Lee @ 2007-02-17  4:54 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

Evgeniy Polyakov wrote:
> On Fri, Feb 16, 2007 at 08:53:30AM -0800, Ray Lee (madrabbit@gmail.com) wrote:
>> On 2/16/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>>> if its design is good, then
>>> interface can be changed in a moment without any problem
>> This isn't always the case. Sometimes the interface puts requirements
>> (contract-like) upon the implementation. Case in point in the kernel,
>> dnotify versus inotify. dnotify is a steaming pile of worthlessness,
>> because it's userspace interface is so bad (meaning inefficient) as to
>> be nearly unusable.
>>
>> inotify has a different interface, one that supplies details about
>> events rather that mere notice that an event occurred, and therefore
>> has different requirements in implementation. dnotify probably was a
>> good design, but for a worthless interface.
>>
>> The interface isn't always important, but it's certainly something
>> that has to be understood before putting the finishing touches on the
>> behind-the-scenes implementation.
> 
> Absolutely.
> And if overall system design is good,

dnotify was a good system design for a stupid (or misunderstood) problem.

> there is no problem to change
> (well, for those who fail to read to the end and understand my english
> replace 'to change' with 'to create and commit') interface to the state
> where it will satisfy all (majority of) users.

You might be right, but the point I (and others) are trying to make is
that there are some cases where you *really* need to understand the
users of the interface first. You might have everything else right
(userspace wants to know when filesystem changes occur, great), but if
you don't know what form those notifications have to look like, you'll
end up doing a lot of wasted work on a worthless piece of code that no
one will ever use.

Sometimes the interface really is the most important thing. Just like a
contract between people.

(This is probably why, by the way, most people are staying silent on
your excellent kevent work. The kernel side is, in some ways, the easy
part. It's getting an interface that will handle all users [ users ==
producers and consumers of kevents ], that is the hard bit.)

Or, let me put it yet another way: How do you prove to the rest of us
that you, or Ingo, or whomever, are not building another dnotify? (Maybe
you're smart enough in this problem space that you know you're not --
that's actually the most likely possibility. But you still have to prove
it to the rest of us. Sucks, I know.)

> Situations when system is designed from interface down to system ends up
> with one thread per IO and huge limitations on how system is going to be
> used at all.

The other side is you start from the goal in mind and get Ingo's state
machines with loops and conditionals and marmalade in syslets which
appear a bit baroque and overkill for the majority of us userspace folk.

(No offense intended to Ingo, he's obviously quite a bit more conversant
on the needs of high speed interfaces than I am. However, I suspect I
have a bit more clarity on what us normal folk would actually use, and
kernel driven FSMs ain't it. Userspace often makes a lot of contextual
decisions that I would absolutely *hate* to write and debug as a state
machine that gets handed off to the kernel. I'll happily take a 10% hit
in efficiency that Moore's law will get me back in a few months, instead
of spending a bunch of time debugging difficult heisenbugs due to the
syslet FSM reading a userspace variable at a slightly different time
once in a blue moon. OTOH, I'm also not Oracle, so what do I know?)

The truth of this lies somewhere in the middle. It isn't kernel driven,
or userspace interface driven, but a tradeoff between the two.

So:

> Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about.
> Period

Please listen to those of us who are saying that this might not be the
case. Maybe we're idiots, but then again maybe we're not, okay?
Sometimes the API really *DOES* change the underlying implementation.

Ray

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-16 20:20                               ` Cyrill V. Gorcunov
@ 2007-02-17 10:02                                 ` Evgeniy Polyakov
  2007-02-17 17:59                                   ` Cyrill V. Gorcunov
  0 siblings, 1 reply; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-17 10:02 UTC (permalink / raw)
  To: Cyrill V. Gorcunov; +Cc: linux-kernel-list

On Fri, Feb 16, 2007 at 11:20:36PM +0300, Cyrill V. Gorcunov (gorcunov@gmail.com) wrote:
> On Fri, Feb 16, 2007 at 07:58:54PM +0300, Evgeniy Polyakov wrote:
> | Absolutely.
> | And if overall system design is good, there is no problem to change
> | (well, for those who fail to read to the end and understand my english
> | replace 'to change' with 'to create and commit') interface to the state
> | where it will satisfy all (majority of) users.
> | 
> | Situations when system is designed from interface down to system ends up
> | with one thread per IO and huge limitations on how system is going to be
> | used at all.
> | 
> | -- 
> | 	Evgeniy Polyakov
> 
> I'm sorry for meddling in conversation but I think Linus misunderstood
> you. If I'm right you propose to "create and commit" _new_ interfaces
> only? I mean _changing_ of interfaces exported to user space is
> very painfull... for further support. Don't swear at me if I wrote
> something stupid ;)

Yes, I only proposed to change what Ingo has right now - although it is
usable, but it does suck, but since overall syslet design is indeed good
it does not suffer from possible interface changes - so I said that it
can be trivially changed in that regard that until it is committed
anything can be done to extend it.

> -- 
> 
> 		Cyrill

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-17  4:54                               ` Ray Lee
@ 2007-02-17 10:15                                 ` Evgeniy Polyakov
  0 siblings, 0 replies; 320+ messages in thread
From: Evgeniy Polyakov @ 2007-02-17 10:15 UTC (permalink / raw)
  To: Ray Lee
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

On Fri, Feb 16, 2007 at 08:54:11PM -0800, Ray Lee (ray-lk@madrabbit.org) wrote:
> (This is probably why, by the way, most people are staying silent on
> your excellent kevent work. The kernel side is, in some ways, the easy
> part. It's getting an interface that will handle all users [ users ==
> producers and consumers of kevents ], that is the hard bit.)

Kevent interface was completely changed 4 (!) times for the last year 
after kernel developers request without any damage to its kernel part.

> Or, let me put it yet another way: How do you prove to the rest of us
> that you, or Ingo, or whomever, are not building another dnotify? (Maybe
> you're smart enough in this problem space that you know you're not --
> that's actually the most likely possibility. But you still have to prove
> it to the rest of us. Sucks, I know.)

I only want to say that when system is designed correctly there is no
problem to change interface (yes, I again said 'to change' just because
I hope everyone understand that I'm talking about time when system is
not yet committed to the tree).

Btw, dnotify had problems in its design highlighted at inotify statrt -
mainly that watchers were not attached to inode.

It is right now the time to ask users what interface they expect from
AIO - so I asked Linus and proposed three different ones, two of them
were designed in a way that user would not even know that some
allocation/freeing was done - and as a result I got 'you suck' response
exactly the same as was returned on the first syslet release - just
_anly_ fscking _just_ because it had ugly interface.

> > Situations when system is designed from interface down to system ends up
> > with one thread per IO and huge limitations on how system is going to be
> > used at all.
> 
> The other side is you start from the goal in mind and get Ingo's state
> machines with loops and conditionals and marmalade in syslets which
> appear a bit baroque and overkill for the majority of us userspace folk.

Well, I designed kevent AIO in the similar way, but it has even more
complex one which is built on top of internal page population functions.

It is complex a bit, but it works fast. And it works with any type (if I
would not be lazy and implement bindings) of AIO.

Interface of syslets is not perfect, but it can be changed (I said it
again? I think we all understand what I mean by that already) trivially
right now (before it is included) - it is not the way to throw thing
just because it has bad interface which can be extended in a moment.

> (No offense intended to Ingo, he's obviously quite a bit more conversant
> on the needs of high speed interfaces than I am. However, I suspect I
> have a bit more clarity on what us normal folk would actually use, and
> kernel driven FSMs ain't it. Userspace often makes a lot of contextual
> decisions that I would absolutely *hate* to write and debug as a state
> machine that gets handed off to the kernel. I'll happily take a 10% hit
> in efficiency that Moore's law will get me back in a few months, instead
> of spending a bunch of time debugging difficult heisenbugs due to the
> syslet FSM reading a userspace variable at a slightly different time
> once in a blue moon. OTOH, I'm also not Oracle, so what do I know?)
> 
> The truth of this lies somewhere in the middle. It isn't kernel driven,
> or userspace interface driven, but a tradeoff between the two.
> 
> So:
> 
> > Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about.
> > Period
> 
> Please listen to those of us who are saying that this might not be the
> case. Maybe we're idiots, but then again maybe we're not, okay?
> Sometimes the API really *DOES* change the underlying implementation.

It is exactly the time to say what interface sould be good.
System is almost ready - it is time to make it looks cool for users.

> Ray

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-17 10:02                                 ` Evgeniy Polyakov
@ 2007-02-17 17:59                                   ` Cyrill V. Gorcunov
  0 siblings, 0 replies; 320+ messages in thread
From: Cyrill V. Gorcunov @ 2007-02-17 17:59 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel-list

On Sat, Feb 17, 2007 at 01:02:00PM +0300, Evgeniy Polyakov wrote:
[... snipped ...]

| Yes, I only proposed to change what Ingo has right now - although it is
| usable, but it does suck, but since overall syslet design is indeed good
| it does not suffer from possible interface changes - so I said that it
| can be trivially changed in that regard that until it is committed
| anything can be done to extend it.
| 
| -- 
| 	Evgeniy Polyakov
| 

I think Evgeniy - you are right! For times of research _changing_ a lot
of things is almost a low. syslets are in test area and why should we
bound ourself in survey of best. If something in syslets is sucks so
lets change it as early as possible. Of course I mean no more interface
changing after some _commit_ point (and that should be Linus decision).

-- 

		Cyrill


^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-15  1:28               ` Davide Libenzi
@ 2007-02-18 20:01                 ` Pavel Machek
  2007-02-18 20:37                   ` Davide Libenzi
  0 siblings, 1 reply; 320+ messages in thread
From: Pavel Machek @ 2007-02-18 20:01 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Alan, Linus Torvalds, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

Hi!

> > The upcall will setup a frame, execute the clet (where jump/conditions and 
> > userspace variable changes happen in machine code - gcc is pretty good in 
> > taking care of that for us) on its return, come back through a 
> > sys_async_return, and go back to userspace.
> 
> So, for example, this is the setup code for the current API (and that's a 
> really simple one - immagine going wacko with loops and userspace varaible 
> changes):
> 
> 
> static struct req *alloc_req(void)
> {
>         /*
>          * Constants can be picked up by syslets via static variables:
>          */
>         static long O_RDONLY_var = O_RDONLY;
>         static long FILE_BUF_SIZE_var = FILE_BUF_SIZE;
>                 
>         struct req *req;
>                  
>         if (freelist) {
>                 req = freelist;
>                 freelist = freelist->next_free;
>                 req->next_free = NULL;
>                 return req;
>         }
>                         
>         req = calloc(1, sizeof(struct req));
>  
>         /*
>          * This is the first atom in the syslet, it opens the file:
>          *
>          *  req->fd = open(req->filename, O_RDONLY);
>          *
>          * It is linked to the next read() atom.
>          */
>         req->filename_p = req->filename;
>         init_atom(req, &req->open_file, __NR_sys_open,
>                   &req->filename_p, &O_RDONLY_var, NULL, NULL, NULL, NULL,
>                   &req->fd, SYSLET_STOP_ON_NEGATIVE, &req->read_file);
>         
>         /*
>          * This second read() atom is linked back to itself, it skips to
>          * the next one on stop:
>          */
>         req->file_buf_ptr = req->file_buf;
>         init_atom(req, &req->read_file, __NR_sys_read,
>                   &req->fd, &req->file_buf_ptr, &FILE_BUF_SIZE_var,
>                   NULL, NULL, NULL, NULL,
>                   SYSLET_STOP_ON_NON_POSITIVE | SYSLET_SKIP_TO_NEXT_ON_STOP,
>                   &req->read_file);
>                 
>         /*
>          * This close() atom has NULL as next, this finishes the syslet:
>          */
>         init_atom(req, &req->close_file, __NR_sys_close,
>                   &req->fd, NULL, NULL, NULL, NULL, NULL, NULL, 0, NULL);
>                 
>         return req;
> }
> 
> 
> Here's how your clet would look like:
> 
> static long main_sync_loop(ctx *c)
> {
>         int fd;
>         char file_buf[FILE_BUF_SIZE+1];
>         
>         if ((fd = open(c->filename, O_RDONLY)) == -1)
>                 return -1;
>         while (read(fd, file_buf, FILE_BUF_SIZE) > 0)
>                 ;
>         close(fd);
>         return 0;
> }
> 
> 
> Kinda easier to code isn't it? And the cost of the upcall to schedule the 
> clet is widely amortized by the multple syscalls you're going to do inside 
> your clet.

I do not get it. What if clet includes

int *a = 0; *a = 1; /* enjoy your oops, stupid kernel? */

I.e. how do you make sure kernel is protected from malicious clets?

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-18 20:01                 ` Pavel Machek
@ 2007-02-18 20:37                   ` Davide Libenzi
  2007-02-18 21:04                     ` Michael K. Edwards
  0 siblings, 1 reply; 320+ messages in thread
From: Davide Libenzi @ 2007-02-18 20:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Alan, Linus Torvalds, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Ulrich Drepper, Zach Brown, Evgeniy Polyakov, David S. Miller,
	Benjamin LaHaise, Suparna Bhattacharya, Thomas Gleixner

On Sun, 18 Feb 2007, Pavel Machek wrote:

> > > The upcall will setup a frame, execute the clet (where jump/conditions and 
> > > userspace variable changes happen in machine code - gcc is pretty good in 
> > > taking care of that for us) on its return, come back through a 
> > > sys_async_return, and go back to userspace.
> > 
> > So, for example, this is the setup code for the current API (and that's a 
> > really simple one - immagine going wacko with loops and userspace varaible 
> > changes):
> > 
> > 
> > static struct req *alloc_req(void)
> > {
> >         /*
> >          * Constants can be picked up by syslets via static variables:
> >          */
> >         static long O_RDONLY_var = O_RDONLY;
> >         static long FILE_BUF_SIZE_var = FILE_BUF_SIZE;
> >                 
> >         struct req *req;
> >                  
> >         if (freelist) {
> >                 req = freelist;
> >                 freelist = freelist->next_free;
> >                 req->next_free = NULL;
> >                 return req;
> >         }
> >                         
> >         req = calloc(1, sizeof(struct req));
> >  
> >         /*
> >          * This is the first atom in the syslet, it opens the file:
> >          *
> >          *  req->fd = open(req->filename, O_RDONLY);
> >          *
> >          * It is linked to the next read() atom.
> >          */
> >         req->filename_p = req->filename;
> >         init_atom(req, &req->open_file, __NR_sys_open,
> >                   &req->filename_p, &O_RDONLY_var, NULL, NULL, NULL, NULL,
> >                   &req->fd, SYSLET_STOP_ON_NEGATIVE, &req->read_file);
> >         
> >         /*
> >          * This second read() atom is linked back to itself, it skips to
> >          * the next one on stop:
> >          */
> >         req->file_buf_ptr = req->file_buf;
> >         init_atom(req, &req->read_file, __NR_sys_read,
> >                   &req->fd, &req->file_buf_ptr, &FILE_BUF_SIZE_var,
> >                   NULL, NULL, NULL, NULL,
> >                   SYSLET_STOP_ON_NON_POSITIVE | SYSLET_SKIP_TO_NEXT_ON_STOP,
> >                   &req->read_file);
> >                 
> >         /*
> >          * This close() atom has NULL as next, this finishes the syslet:
> >          */
> >         init_atom(req, &req->close_file, __NR_sys_close,
> >                   &req->fd, NULL, NULL, NULL, NULL, NULL, NULL, 0, NULL);
> >                 
> >         return req;
> > }
> > 
> > 
> > Here's how your clet would look like:
> > 
> > static long main_sync_loop(ctx *c)
> > {
> >         int fd;
> >         char file_buf[FILE_BUF_SIZE+1];
> >         
> >         if ((fd = open(c->filename, O_RDONLY)) == -1)
> >                 return -1;
> >         while (read(fd, file_buf, FILE_BUF_SIZE) > 0)
> >                 ;
> >         close(fd);
> >         return 0;
> > }
> > 
> > 
> > Kinda easier to code isn't it? And the cost of the upcall to schedule the 
> > clet is widely amortized by the multple syscalls you're going to do inside 
> > your clet.
> 
> I do not get it. What if clet includes
> 
> int *a = 0; *a = 1; /* enjoy your oops, stupid kernel? */
> 
> I.e. how do you make sure kernel is protected from malicious clets?

Clets would execute in userspace, like signal handlers, but under the 
special schedule() handler. In that way chains happens by the mean of 
natural C code, and access to userspace variables happen by the mean of 
natural C code too (not with special syscalls to manipulate userspace 
memory). I'm not a big fan of chains of syscalls for the reasons I 
already explained, but at least clets (or whatever name) has a way lower 
cost for the programmer (easier to code than atom chains), and for the 
kernel (no need of all that atom handling stuff, no need of limited 
cond/jump interpreters in the kernel, and no need of nightmare compat 
code).



- Davide



^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
  2007-02-18 20:37                   ` Davide Libenzi
@ 2007-02-18 21:04                     ` Michael K. Edwards
  0 siblings, 0 replies; 320+ messages in thread
From: Michael K. Edwards @ 2007-02-18 21:04 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Pavel Machek, Ingo Molnar, Alan, Linus Torvalds,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Ulrich Drepper, Zach Brown, Evgeniy Polyakov,
	David S. Miller, Benjamin LaHaise, Suparna Bhattacharya,
	Thomas Gleixner

On 2/18/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> Clets would execute in userspace, like signal handlers,

or like "event handlers" in cooperative multitasking environments
without the Unix baggage

> but under the special schedule() handler.

or, better yet, as the next tasklet in the chain after the softirq
dispatcher, since I/Os almost always unblock as a result of something
that happens in an ISR or softirq

> In that way chains happens by the mean of
> natural C code, and access to userspace variables happen by the mean of
> natural C code too (not with special syscalls to manipulate userspace
> memory).

yep.  That way you can exploit this nice hardware block called an MMU.

> I'm not a big fan of chains of syscalls for the reasons I
> already explained,

to a kernel programmer, all userspace programs are chains of syscalls.  :-)

> but at least clets (or whatever name) has a way lower
> cost for the programmer (easier to code than atom chains),

except you still have the 80% of the code that is half-assed exception
handling using overloaded semantics on function return values and a
thread-local errno, which is totally unsafe with fibrils, syslets,
clets, and giblets, since none of them promise to run continuations in
the same thread context as the submission.  Of course you aren't going
to use errno as such, but that means that async-ifying code isn't
s/syscall/aio_syscall/, it's a complete rewrite.  If you're going to
design a new AIO interface, please model it after the only standard
that has ever made deeply pipelined, massively parallel execution
programmer-friendly -- IEEE 754.

> and for the kernel (no need of all that atom handling stuff,

you still need this, but it has to be centered on a data structure
that makes request throttling, dynamic reprioritization, and bulk
cancellation practical

> no need of limited cond/jump interpreters in the kernel,

you still need this, for efficient handling of speculative execution,
pipeline stalls, and exception propagation, but it's invisible to the
interface and you don't have to invent it up front

> and no need of nightmare compat code).

compat code, yes. nighmare, no.  Just like kernel FP emulation on any
processor other than an x86.  Unimplemented instruction traps.  x86 is
so utterly the wrong architecture on which to prototype this it isn't
even funny.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions
  2007-02-13 14:20 ` [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions Ingo Molnar
  2007-02-13 20:17   ` Indan Zupancic
@ 2007-02-19  0:22   ` Paul Mackerras
  1 sibling, 0 replies; 320+ messages in thread
From: Paul Mackerras @ 2007-02-19  0:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Benjamin LaHaise,
	Suparna Bhattacharya, Davide Libenzi, Thomas Gleixner

Ingo Molnar writes:

> add include/linux/syslet.h which contains the user-space API/ABI
> declarations. Add the new header to include/linux/Kbuild as well.

> +struct syslet_uatom {
> +	unsigned long				flags;
> +	unsigned long				nr;
> +	long __user				*ret_ptr;
> +	struct syslet_uatom	__user		*next;
> +	unsigned long		__user		*arg_ptr[6];
> +	/*
> +	 * User-space can put anything in here, kernel will not
> +	 * touch it:
> +	 */
> +	void __user				*private;
> +};

This structure, with its unsigned longs and pointers, is going to
create enormous headaches for 32-bit processes on 64-bit machines as
far as I can see---and on ppc64 machines, almost all processes are
32-bit, since there is no inherent speed penalty for running in 32-bit
mode, and some space savings.

Have you thought about how you will handle compatibility for 32-bit
processes?  The issue will arise for x86_64 and ia64 (among others)
too, I would think.

Paul.

^ permalink raw reply	[flat|nested] 320+ messages in thread

* Re: [patch 05/11] syslets: core code
@ 2007-02-17 14:57 Al Boldi
  0 siblings, 0 replies; 320+ messages in thread
From: Al Boldi @ 2007-02-17 14:57 UTC (permalink / raw)
  To: linux-kernel

Evgeniy Polyakov wrote:
> Ray Lee (ray-lk@madrabbit.org) wrote:
> > The truth of this lies somewhere in the middle. It isn't kernel driven,
> > or userspace interface driven, but a tradeoff between the two.
> >
> > So:
> > > Userspace_API_is_the_ever_possible_last_thing_to_ever_think_about.
> > > Period
> >
> > Please listen to those of us who are saying that this might not be the
> > case. Maybe we're idiots, but then again maybe we're not, okay?
> > Sometimes the API really *DOES* change the underlying implementation.
>
> It is exactly the time to say what interface sould be good.
> System is almost ready - it is time to make it looks cool for users.

IMHO, what is needed is an event registration switch-board that handles 
notifications from the kernel and the user side respectively.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 320+ messages in thread

end of thread, other threads:[~2007-02-19  0:22 UTC | newest]

Thread overview: 320+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-29 21:21 [patch 00/61] ANNOUNCE: lock validator -V1 Ingo Molnar
2006-05-29 21:22 ` [patch 01/61] lock validator: floppy.c irq-release fix Ingo Molnar
2006-05-30  1:32   ` Andrew Morton
2006-05-29 21:23 ` [patch 02/61] lock validator: forcedeth.c fix Ingo Molnar
2006-05-30  1:33   ` Andrew Morton
2006-05-31  5:40     ` Manfred Spraul
2006-05-29 21:23 ` [patch 03/61] lock validator: sound/oss/emu10k1/midi.c cleanup Ingo Molnar
2006-05-30  1:33   ` Andrew Morton
2006-05-30 10:51     ` Takashi Iwai
2006-05-30 11:03       ` Alexey Dobriyan
2006-05-29 21:23 ` [patch 04/61] lock validator: mutex section binutils workaround Ingo Molnar
2006-05-29 21:23 ` [patch 05/61] lock validator: introduce WARN_ON_ONCE(cond) Ingo Molnar
2006-05-30  1:33   ` Andrew Morton
2006-05-30 17:38     ` Steven Rostedt
2006-06-03 18:09       ` Steven Rostedt
2006-06-04  9:18         ` Arjan van de Ven
2006-06-04 13:43           ` Steven Rostedt
2006-05-29 21:23 ` [patch 06/61] lock validator: add __module_address() method Ingo Molnar
2006-05-30  1:33   ` Andrew Morton
2006-05-30 17:45     ` Steven Rostedt
2006-06-23  8:38     ` Ingo Molnar
2006-05-29 21:23 ` [patch 07/61] lock validator: better lock debugging Ingo Molnar
2006-05-30  1:33   ` Andrew Morton
2006-06-23 10:25     ` Ingo Molnar
2006-06-23 11:06       ` Andrew Morton
2006-06-23 11:04         ` Ingo Molnar
2006-05-29 21:23 ` [patch 08/61] lock validator: locking API self-tests Ingo Molnar
2006-05-29 21:23 ` [patch 09/61] lock validator: spin/rwlock init cleanups Ingo Molnar
2006-05-29 21:23 ` [patch 10/61] lock validator: locking init debugging improvement Ingo Molnar
2006-05-29 21:23 ` [patch 11/61] lock validator: lockdep: small xfs init_rwsem() cleanup Ingo Molnar
2006-05-30  1:33   ` Andrew Morton
2006-05-30  1:32     ` Nathan Scott
2006-05-29 21:24 ` [patch 12/61] lock validator: beautify x86_64 stacktraces Ingo Molnar
2006-05-30  1:33   ` Andrew Morton
2006-05-29 21:24 ` [patch 13/61] lock validator: x86_64: document stack frame internals Ingo Molnar
2006-05-29 21:24 ` [patch 14/61] lock validator: stacktrace Ingo Molnar
2006-05-29 21:24 ` [patch 15/61] lock validator: x86_64: use stacktrace to generate backtraces Ingo Molnar
2006-05-30  1:33   ` Andrew Morton
2006-05-29 21:24 ` [patch 16/61] lock validator: fown locking workaround Ingo Molnar
2006-05-30  1:34   ` Andrew Morton
2006-06-23  9:10     ` Ingo Molnar
2006-05-29 21:24 ` [patch 17/61] lock validator: sk_callback_lock workaround Ingo Molnar
2006-05-30  1:34   ` Andrew Morton
2006-06-23  9:19     ` Ingo Molnar
2006-05-29 21:24 ` [patch 18/61] lock validator: irqtrace: core Ingo Molnar
2006-05-30  1:34   ` Andrew Morton
2006-06-23 10:42     ` Ingo Molnar
2006-05-29 21:24 ` [patch 19/61] lock validator: irqtrace: cleanup: include/asm-i386/irqflags.h Ingo Molnar
2006-05-29 21:24 ` [patch 20/61] lock validator: irqtrace: cleanup: include/asm-x86_64/irqflags.h Ingo Molnar
2006-05-29 21:24 ` [patch 21/61] lock validator: lockdep: add local_irq_enable_in_hardirq() API Ingo Molnar
2006-05-30  1:34   ` Andrew Morton
2006-06-23  9:28     ` Ingo Molnar
2006-06-23  9:52       ` Andrew Morton
2006-06-23 10:20         ` Ingo Molnar
2006-05-29 21:24 ` [patch 22/61] lock validator: add per_cpu_offset() Ingo Molnar
2006-05-30  1:34   ` Andrew Morton
2006-06-23  9:30     ` Ingo Molnar
2006-05-29 21:25 ` [patch 23/61] lock validator: core Ingo Molnar
2006-05-29 21:25 ` [patch 24/61] lock validator: procfs Ingo Molnar
2006-05-29 21:25 ` [patch 25/61] lock validator: design docs Ingo Molnar
2006-05-30  9:07   ` Nikita Danilov
2006-05-29 21:25 ` [patch 26/61] lock validator: prove rwsem locking correctness Ingo Molnar
2006-05-29 21:25 ` [patch 27/61] lock validator: prove spinlock/rwlock " Ingo Molnar
2006-05-30  1:35   ` Andrew Morton
2006-06-23 10:44     ` Ingo Molnar
2006-05-29 21:25 ` [patch 28/61] lock validator: prove mutex " Ingo Molnar
2006-05-29 21:25 ` [patch 29/61] lock validator: print all lock-types on SysRq-D Ingo Molnar
2006-05-29 21:25 ` [patch 30/61] lock validator: x86_64 early init Ingo Molnar
2006-05-29 21:25 ` [patch 31/61] lock validator: SMP alternatives workaround Ingo Molnar
2006-05-29 21:25 ` [patch 32/61] lock validator: do not recurse in printk() Ingo Molnar
2006-05-29 21:25 ` [patch 33/61] lock validator: disable NMI watchdog if CONFIG_LOCKDEP Ingo Molnar
2006-05-29 22:49   ` Keith Owens
2006-05-29 21:25 ` [patch 34/61] lock validator: special locking: bdev Ingo Molnar
2006-05-30  1:35   ` Andrew Morton
2006-05-30  5:13     ` Arjan van de Ven
2006-05-30  9:58     ` Al Viro
2006-05-30 10:45     ` Arjan van de Ven
2006-05-29 21:25 ` [patch 35/61] lock validator: special locking: direct-IO Ingo Molnar
2006-05-29 21:26 ` [patch 36/61] lock validator: special locking: serial Ingo Molnar
2006-05-30  1:35   ` Andrew Morton
2006-06-23  9:49     ` Ingo Molnar
2006-06-23 10:04       ` Andrew Morton
2006-06-23 10:18         ` Ingo Molnar
2006-05-29 21:26 ` [patch 37/61] lock validator: special locking: dcache Ingo Molnar
2006-05-30  1:35   ` Andrew Morton
2006-05-30 20:51     ` Steven Rostedt
2006-05-30 21:01       ` Ingo Molnar
2006-06-23  9:51       ` Ingo Molnar
2006-05-29 21:26 ` [patch 38/61] lock validator: special locking: i_mutex Ingo Molnar
2006-05-30 20:53   ` Steven Rostedt
2006-05-30 21:06     ` Ingo Molnar
2006-05-29 21:26 ` [patch 39/61] lock validator: special locking: s_lock Ingo Molnar
2006-05-29 21:26 ` [patch 40/61] lock validator: special locking: futex Ingo Molnar
2006-05-29 21:26 ` [patch 41/61] lock validator: special locking: genirq Ingo Molnar
2006-05-29 21:26 ` [patch 42/61] lock validator: special locking: kgdb Ingo Molnar
2006-05-29 21:26 ` [patch 43/61] lock validator: special locking: completions Ingo Molnar
2006-05-29 21:26 ` [patch 44/61] lock validator: special locking: waitqueues Ingo Molnar
2006-05-29 21:26 ` [patch 45/61] lock validator: special locking: mm Ingo Molnar
2006-05-29 21:26 ` [patch 46/61] lock validator: special locking: slab Ingo Molnar
2006-05-30  1:35   ` Andrew Morton
2006-06-23  9:54     ` Ingo Molnar
2006-05-29 21:26 ` [patch 47/61] lock validator: special locking: skb_queue_head_init() Ingo Molnar
2006-05-29 21:26 ` [patch 48/61] lock validator: special locking: timer.c Ingo Molnar
2006-05-29 21:27 ` [patch 49/61] lock validator: special locking: sched.c Ingo Molnar
2006-05-29 21:27 ` [patch 50/61] lock validator: special locking: hrtimer.c Ingo Molnar
2006-05-30  1:35   ` Andrew Morton
2006-06-23 10:04     ` Ingo Molnar
2006-06-23 10:38       ` Andrew Morton
2006-06-23 10:52         ` Ingo Molnar
2006-06-23 11:52           ` Ingo Molnar
2006-06-23 12:06             ` Andrew Morton
2006-05-29 21:27 ` [patch 51/61] lock validator: special locking: sock_lock_init() Ingo Molnar
2006-05-30  1:36   ` Andrew Morton
2006-06-23 10:06     ` Ingo Molnar
2006-05-29 21:27 ` [patch 52/61] lock validator: special locking: af_unix Ingo Molnar
2006-05-30  1:36   ` Andrew Morton
2006-06-23 10:07     ` Ingo Molnar
2006-05-29 21:27 ` [patch 53/61] lock validator: special locking: bh_lock_sock() Ingo Molnar
2006-05-29 21:27 ` [patch 54/61] lock validator: special locking: mmap_sem Ingo Molnar
2006-05-29 21:27 ` [patch 55/61] lock validator: special locking: sb->s_umount Ingo Molnar
2006-05-30  1:36   ` Andrew Morton
2006-06-23 10:55     ` Ingo Molnar
2006-05-29 21:27 ` [patch 56/61] lock validator: special locking: jbd Ingo Molnar
2006-05-29 21:27 ` [patch 57/61] lock validator: special locking: posix-timers Ingo Molnar
2006-05-29 21:27 ` [patch 58/61] lock validator: special locking: sch_generic.c Ingo Molnar
2006-05-29 21:27 ` [patch 59/61] lock validator: special locking: xfrm Ingo Molnar
2006-05-30  1:36   ` Andrew Morton
2006-05-29 21:27 ` [patch 60/61] lock validator: special locking: sound/core/seq/seq_ports.c Ingo Molnar
2006-05-29 21:28 ` [patch 61/61] lock validator: enable lock validator in Kconfig Ingo Molnar
2006-05-30  1:36   ` Andrew Morton
2006-05-30 13:33   ` Roman Zippel
2006-06-23 11:01     ` Ingo Molnar
2006-06-26 11:37       ` Roman Zippel
2006-05-29 22:28 ` [patch 00/61] ANNOUNCE: lock validator -V1 Michal Piotrowski
2006-05-29 22:41   ` Ingo Molnar
2006-05-29 23:09     ` Dave Jones
2006-05-30  5:45       ` Arjan van de Ven
2006-05-30  6:07         ` Michal Piotrowski
2006-05-30 14:10         ` Dave Jones
2006-05-30 14:19           ` Arjan van de Ven
2006-05-30 14:58             ` Dave Jones
2006-05-30 17:11               ` Dominik Brodowski
2006-05-30 19:02                 ` Dave Jones
2006-05-30 19:25                   ` Roland Dreier
2006-05-30 19:34                     ` Dave Jones
2006-05-30 20:41                     ` Ingo Molnar
2006-05-30 20:44                       ` Ingo Molnar
2006-05-30 21:58                       ` Paolo Ciarrocchi
2006-05-31  8:40                         ` Ingo Molnar
2006-05-30 19:39                 ` Dave Jones
2006-05-30 19:53                   ` Ashok Raj
2006-06-01  5:50                     ` Nathan Lynch
2006-05-30 20:54         ` [patch, -rc5-mm1] lock validator: select KALLSYMS_ALL Ingo Molnar
2006-05-30  5:52       ` [patch 00/61] ANNOUNCE: lock validator -V1 Michal Piotrowski
2006-05-30  5:20   ` Arjan van de Ven
2006-05-30  1:35 ` Andrew Morton
2006-06-23  9:41   ` Ingo Molnar
2006-05-30  4:52 ` Mike Galbraith
2006-05-30  6:20   ` Arjan van de Ven
2006-05-30  6:35   ` Arjan van de Ven
2006-05-30  7:47     ` Ingo Molnar
2006-05-30  6:37   ` Ingo Molnar
2006-05-30  9:25     ` Mike Galbraith
2006-05-30 10:57       ` Ingo Molnar
2006-05-30  9:14 ` Benoit Boissinot
2006-05-30 10:26   ` Arjan van de Ven
2006-05-30 11:42     ` Benoit Boissinot
2006-05-30 12:13       ` Ingo Molnar
2006-06-01 14:42   ` [patch mm1-rc2] lock validator: netlink.c netlink_table_grab fix Frederik Deweerdt
2006-06-02  3:10     ` Zhu Yi
2006-06-02  9:53       ` Frederik Deweerdt
2006-06-05  3:40         ` Zhu Yi
2007-02-13 14:20 ` [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Ingo Molnar
2007-02-13 15:00   ` Alan
2007-02-13 14:58     ` Benjamin LaHaise
2007-02-13 15:09       ` Arjan van de Ven
2007-02-13 16:24       ` bert hubert
2007-02-13 16:56       ` Ingo Molnar
2007-02-13 18:56         ` Evgeniy Polyakov
2007-02-13 19:12           ` Evgeniy Polyakov
2007-02-13 22:19             ` Ingo Molnar
2007-02-13 22:18           ` Ingo Molnar
2007-02-14  8:59             ` Evgeniy Polyakov
2007-02-14 10:37               ` Ingo Molnar
2007-02-14 11:10                 ` Evgeniy Polyakov
2007-02-14 17:17                 ` Davide Libenzi
2007-02-13 20:34       ` Ingo Molnar
2007-02-13 15:46     ` Dmitry Torokhov
2007-02-13 20:39       ` Ingo Molnar
2007-02-13 22:36         ` Dmitry Torokhov
2007-02-14 11:07         ` Alan
2007-02-13 16:39     ` Andi Kleen
2007-02-13 16:26       ` Linus Torvalds
2007-02-13 17:03         ` Ingo Molnar
2007-02-13 20:26         ` Davide Libenzi
2007-02-13 16:49       ` Ingo Molnar
2007-02-13 16:42     ` Ingo Molnar
2007-02-13 20:22   ` Davide Libenzi
2007-02-13 21:24     ` Davide Libenzi
2007-02-13 22:10       ` Ingo Molnar
2007-02-13 23:28         ` Davide Libenzi
2007-02-13 21:57     ` Ingo Molnar
2007-02-13 22:50       ` Olivier Galibert
2007-02-13 22:59       ` Ulrich Drepper
2007-02-13 23:24       ` Davide Libenzi
2007-02-13 23:25       ` Andi Kleen
2007-02-13 22:26         ` Ingo Molnar
2007-02-13 22:32           ` Andi Kleen
2007-02-13 22:43             ` Ingo Molnar
2007-02-13 22:47               ` Andi Kleen
2007-02-14  3:28   ` Davide Libenzi
2007-02-14  4:49     ` Davide Libenzi
2007-02-14  8:26       ` Ingo Molnar
2007-02-14  4:42   ` Willy Tarreau
2007-02-14 12:37   ` Pavel Machek
2007-02-14 17:14     ` Linus Torvalds
2007-02-14 20:52   ` Jeremy Fitzhardinge
2007-02-14 21:36     ` Davide Libenzi
2007-02-15  0:08       ` Jeremy Fitzhardinge
2007-02-15  2:07         ` Davide Libenzi
2007-02-15  2:44   ` Zach Brown
2007-02-13 14:20 ` [patch 01/11] syslets: add async.h include file, kernel-side API definitions Ingo Molnar
2007-02-13 14:20 ` [patch 02/11] syslets: add syslet.h include file, user API/ABI definitions Ingo Molnar
2007-02-13 20:17   ` Indan Zupancic
2007-02-13 21:43     ` Ingo Molnar
2007-02-13 22:24       ` Indan Zupancic
2007-02-13 22:32         ` Ingo Molnar
2007-02-19  0:22   ` Paul Mackerras
2007-02-13 14:20 ` [patch 03/11] syslets: generic kernel bits Ingo Molnar
2007-02-13 14:20 ` [patch 04/11] syslets: core, data structures Ingo Molnar
2007-02-13 14:20 ` [patch 05/11] syslets: core code Ingo Molnar
2007-02-13 23:15   ` Andi Kleen
2007-02-13 22:24     ` Ingo Molnar
2007-02-13 22:30       ` Andi Kleen
2007-02-13 22:41         ` Ingo Molnar
2007-02-14  9:13           ` Evgeniy Polyakov
2007-02-14  9:46             ` Ingo Molnar
2007-02-14 10:09               ` Evgeniy Polyakov
2007-02-14 10:30                 ` Arjan van de Ven
2007-02-14 10:41                   ` Evgeniy Polyakov
2007-02-13 22:57       ` Andrew Morton
2007-02-14 12:43   ` Guillaume Chazarain
2007-02-14 13:17   ` Stephen Rothwell
2007-02-14 20:38   ` Linus Torvalds
2007-02-14 21:02     ` Ingo Molnar
2007-02-14 21:12       ` Ingo Molnar
2007-02-14 21:26       ` Linus Torvalds
2007-02-14 21:35         ` Ingo Molnar
2007-02-15  2:52           ` Zach Brown
2007-02-14 21:44         ` Ingo Molnar
2007-02-14 21:56         ` Alan
2007-02-14 22:32           ` Ingo Molnar
2007-02-15  1:01             ` Davide Libenzi
2007-02-15  1:28               ` Davide Libenzi
2007-02-18 20:01                 ` Pavel Machek
2007-02-18 20:37                   ` Davide Libenzi
2007-02-18 21:04                     ` Michael K. Edwards
2007-02-14 21:09     ` Davide Libenzi
2007-02-14 22:09     ` Ingo Molnar
2007-02-14 23:13       ` Linus Torvalds
2007-02-14 23:44         ` Ingo Molnar
2007-02-15  0:04           ` Ingo Molnar
2007-02-15 13:35     ` Evgeniy Polyakov
2007-02-15 16:09       ` Linus Torvalds
2007-02-15 16:37         ` Evgeniy Polyakov
2007-02-15 17:42           ` Linus Torvalds
2007-02-15 18:11             ` Evgeniy Polyakov
2007-02-15 18:25               ` Linus Torvalds
2007-02-15 19:04                 ` Evgeniy Polyakov
2007-02-15 19:28                   ` Linus Torvalds
2007-02-15 20:07                     ` Linus Torvalds
2007-02-15 21:17                       ` Davide Libenzi
2007-02-15 22:34                       ` Michael K. Edwards
2007-02-16 12:28                       ` Ingo Molnar
2007-02-16 13:28                         ` Evgeniy Polyakov
2007-02-16  8:57                     ` Evgeniy Polyakov
2007-02-16 15:54                       ` Linus Torvalds
2007-02-16 16:05                         ` Evgeniy Polyakov
2007-02-16 16:53                           ` Ray Lee
2007-02-16 16:58                             ` Evgeniy Polyakov
2007-02-16 20:20                               ` Cyrill V. Gorcunov
2007-02-17 10:02                                 ` Evgeniy Polyakov
2007-02-17 17:59                                   ` Cyrill V. Gorcunov
2007-02-17  4:54                               ` Ray Lee
2007-02-17 10:15                                 ` Evgeniy Polyakov
2007-02-15 18:46             ` bert hubert
2007-02-15 19:10               ` Evgeniy Polyakov
2007-02-15 19:16               ` Zach Brown
2007-02-15 19:26               ` Eric Dumazet
2007-02-15 17:05         ` Davide Libenzi
2007-02-15 17:17           ` Evgeniy Polyakov
2007-02-15 17:39             ` Davide Libenzi
2007-02-15 18:01               ` Evgeniy Polyakov
2007-02-15 17:17         ` Ulrich Drepper
2007-02-13 14:20 ` [patch 06/11] syslets: core, documentation Ingo Molnar
2007-02-13 20:18   ` Davide Libenzi
2007-02-13 21:34     ` Ingo Molnar
2007-02-13 23:21       ` Davide Libenzi
2007-02-14  0:18         ` Davide Libenzi
2007-02-14 10:36   ` Russell King
2007-02-14 10:50     ` Ingo Molnar
2007-02-14 11:04       ` Russell King
2007-02-14 17:52         ` Davide Libenzi
2007-02-14 18:03           ` Benjamin LaHaise
2007-02-14 19:45             ` Davide Libenzi
2007-02-14 20:03               ` Benjamin LaHaise
2007-02-14 20:14                 ` Davide Libenzi
2007-02-14 20:34                   ` Benjamin LaHaise
2007-02-14 21:06                     ` Davide Libenzi
2007-02-14 21:44                       ` Benjamin LaHaise
2007-02-14 23:17                         ` Davide Libenzi
2007-02-14 23:40                           ` Benjamin LaHaise
2007-02-15  0:35                             ` Davide Libenzi
2007-02-15  1:32                         ` Michael K. Edwards
2007-02-14 21:49                     ` [patch] x86: split FPU state from task state Ingo Molnar
2007-02-14 22:04                       ` Benjamin LaHaise
2007-02-14 22:10                         ` Arjan van de Ven
2007-02-13 14:20 ` [patch 07/11] syslets: x86, add create_async_thread() method Ingo Molnar
     [not found] ` <20061213130211.GT21847@elte.hu>
2007-02-15 10:13   ` [patch 19/31] clockevents: i386 drivers Andrew Morton
2007-02-17 14:57 [patch 05/11] syslets: core code Al Boldi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.