linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [git pull] scheduler fixes
@ 2008-07-31 21:43 Ingo Molnar
  2008-07-31 22:04 ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2008-07-31 21:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra

Linus,

Please pull the latest sched-fixes-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git sched-fixes-for-linus

Thanks,

	Ingo

------------------>
Hugh Dickins (1):
      sched: move sched_clock before first use

OGAWA Hirofumi (1):
      sched: fix SCHED_HRTICK dependency

Peter Zijlstra (2):
      sched: fix warning in hrtick_start_fair()
      lockdep: change scheduler annotation

roel kluin (1):
      sched: test runtime rather than period in global_rt_runtime()


 kernel/Kconfig.hz    |    2 +-
 kernel/sched.c       |   12 +++++-------
 kernel/sched_clock.c |   19 +++++++++----------
 kernel/sched_fair.c  |    2 +-
 4 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 382dd5a..94fabd5 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -55,4 +55,4 @@ config HZ
 	default 1000 if HZ_1000
 
 config SCHED_HRTICK
-	def_bool HIGH_RES_TIMERS && USE_GENERIC_SMP_HELPERS
+	def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
diff --git a/kernel/sched.c b/kernel/sched.c
index 0236958..f63af45 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -600,7 +600,6 @@ struct rq {
 	/* BKL stats */
 	unsigned int bkl_count;
 #endif
-	struct lock_class_key rq_lock_key;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -834,7 +833,7 @@ static inline u64 global_rt_period(void)
 
 static inline u64 global_rt_runtime(void)
 {
-	if (sysctl_sched_rt_period < 0)
+	if (sysctl_sched_rt_runtime < 0)
 		return RUNTIME_INF;
 
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
@@ -2759,10 +2758,10 @@ static void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	} else {
 		if (rq1 < rq2) {
 			spin_lock(&rq1->lock);
-			spin_lock(&rq2->lock);
+			spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
 		} else {
 			spin_lock(&rq2->lock);
-			spin_lock(&rq1->lock);
+			spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
 		}
 	}
 	update_rq_clock(rq1);
@@ -2805,10 +2804,10 @@ static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 		if (busiest < this_rq) {
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
-			spin_lock(&this_rq->lock);
+			spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
 			ret = 1;
 		} else
-			spin_lock(&busiest->lock);
+			spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
 	}
 	return ret;
 }
@@ -7998,7 +7997,6 @@ void __init sched_init(void)
 
 		rq = cpu_rq(i);
 		spin_lock_init(&rq->lock);
-		lockdep_set_class(&rq->lock, &rq->rq_lock_key);
 		rq->nr_running = 0;
 		init_cfs_rq(&rq->cfs, rq);
 		init_rt_rq(&rq->rt, rq);
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 22ed55d..5a2dc7d 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -32,6 +32,15 @@
 #include <linux/ktime.h>
 #include <linux/module.h>
 
+/*
+ * Scheduler clock - returns current time in nanosec units.
+ * This is default implementation.
+ * Architectures and sub-architectures can override this.
+ */
+unsigned long long __attribute__((weak)) sched_clock(void)
+{
+	return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
+}
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 
@@ -321,16 +330,6 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
 #endif
 
-/*
- * Scheduler clock - returns current time in nanosec units.
- * This is default implementation.
- * Architectures and sub-architectures can override this.
- */
-unsigned long long __attribute__((weak)) sched_clock(void)
-{
-	return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
-}
-
 unsigned long long cpu_clock(int cpu)
 {
 	unsigned long long clock;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index cf2cd6c..0fe94ea 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -899,7 +899,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 		 * doesn't make sense. Rely on vruntime for fairness.
 		 */
 		if (rq->curr != p)
-			delta = max(10000LL, delta);
+			delta = max_t(s64, 10000LL, delta);
 
 		hrtick_start(rq, delta);
 	}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [git pull] scheduler fixes
  2008-07-31 21:43 [git pull] scheduler fixes Ingo Molnar
@ 2008-07-31 22:04 ` David Miller
  2008-07-31 22:26   ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-07-31 22:04 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 31 Jul 2008 23:43:52 +0200

> Peter Zijlstra (2):
>       sched: fix warning in hrtick_start_fair()
>       lockdep: change scheduler annotation

Ingo, Peter forgot to tell you but this lockdep change
blows up on my Niagara boxes in a way I haven't figured
out yet.

So could you please not merge that change in for the
time being?

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [git pull] scheduler fixes
  2008-07-31 22:04 ` David Miller
@ 2008-07-31 22:26   ` Ingo Molnar
  2008-07-31 22:55     ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2008-07-31 22:26 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Thu, 31 Jul 2008 23:43:52 +0200
> 
> > Peter Zijlstra (2):
> >       sched: fix warning in hrtick_start_fair()
> >       lockdep: change scheduler annotation
> 
> Ingo, Peter forgot to tell you but this lockdep change
> blows up on my Niagara boxes in a way I haven't figured
> out yet.

are you sure it's this (trivial) annotation in sched.c that blows up, 
not your other (more invasive) changes to lockdep? This patch looked 
fine in my testing.

> So could you please not merge that change in for the
> time being?

sure thing, we can defer it until your bug is understood more fully. 
I've created a second, sched-fixes-for-linus-2 tree for Linus below - 
Linus, if you have not pulled the previous tree yet could you please 
pull this one?

	Ingo

------------------->
Linus,

Please pull the latest sched-fixes-for-linus-2 git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git sched-fixes-for-linus-2

 Thanks,

	Ingo

------------------>
Hugh Dickins (1):
      sched: move sched_clock before first use

OGAWA Hirofumi (1):
      sched: fix SCHED_HRTICK dependency

Peter Zijlstra (1):
      sched: fix warning in hrtick_start_fair()

roel kluin (1):
      sched: test runtime rather than period in global_rt_runtime()


 kernel/Kconfig.hz    |    2 +-
 kernel/sched.c       |    2 +-
 kernel/sched_clock.c |   19 +++++++++----------
 kernel/sched_fair.c  |    2 +-
 4 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 382dd5a..94fabd5 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -55,4 +55,4 @@ config HZ
 	default 1000 if HZ_1000
 
 config SCHED_HRTICK
-	def_bool HIGH_RES_TIMERS && USE_GENERIC_SMP_HELPERS
+	def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
diff --git a/kernel/sched.c b/kernel/sched.c
index 0236958..0d1717b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -834,7 +834,7 @@ static inline u64 global_rt_period(void)
 
 static inline u64 global_rt_runtime(void)
 {
-	if (sysctl_sched_rt_period < 0)
+	if (sysctl_sched_rt_runtime < 0)
 		return RUNTIME_INF;
 
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 22ed55d..5a2dc7d 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -32,6 +32,15 @@
 #include <linux/ktime.h>
 #include <linux/module.h>
 
+/*
+ * Scheduler clock - returns current time in nanosec units.
+ * This is default implementation.
+ * Architectures and sub-architectures can override this.
+ */
+unsigned long long __attribute__((weak)) sched_clock(void)
+{
+	return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
+}
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 
@@ -321,16 +330,6 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
 #endif
 
-/*
- * Scheduler clock - returns current time in nanosec units.
- * This is default implementation.
- * Architectures and sub-architectures can override this.
- */
-unsigned long long __attribute__((weak)) sched_clock(void)
-{
-	return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
-}
-
 unsigned long long cpu_clock(int cpu)
 {
 	unsigned long long clock;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index cf2cd6c..0fe94ea 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -899,7 +899,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 		 * doesn't make sense. Rely on vruntime for fairness.
 		 */
 		if (rq->curr != p)
-			delta = max(10000LL, delta);
+			delta = max_t(s64, 10000LL, delta);
 
 		hrtick_start(rq, delta);
 	}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [git pull] scheduler fixes
  2008-07-31 22:26   ` Ingo Molnar
@ 2008-07-31 22:55     ` David Miller
  2008-08-01  8:11       ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-07-31 22:55 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 1 Aug 2008 00:26:24 +0200

> * David Miller <davem@davemloft.net> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > Date: Thu, 31 Jul 2008 23:43:52 +0200
> > 
> > > Peter Zijlstra (2):
> > >       sched: fix warning in hrtick_start_fair()
> > >       lockdep: change scheduler annotation
> > 
> > Ingo, Peter forgot to tell you but this lockdep change
> > blows up on my Niagara boxes in a way I haven't figured
> > out yet.
> 
> are you sure it's this (trivial) annotation in sched.c that blows up, 
> not your other (more invasive) changes to lockdep? This patch looked 
> fine in my testing.

I am absolutely sure, I spent the whole night yesterday trying to
debug this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [git pull] scheduler fixes
  2008-07-31 22:55     ` David Miller
@ 2008-08-01  8:11       ` David Miller
  2008-08-01  9:01         ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-08-01  8:11 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra

From: David Miller <davem@davemloft.net>
Date: Thu, 31 Jul 2008 15:55:04 -0700 (PDT)

> I am absolutely sure, I spent the whole night yesterday trying to
> debug this.

Followup.  I lost two days of my life debugging this because seemingly
nobody can friggin' agree on what to do about the "printk() wakeup
issue".  Thanks!

Can we fix this now, please?

The problem was that Peter's patch triggers a print_deadlock_bug()
in lockdep.c on the runqueue locks.

But those printk()'s quickly want to do a wakeup, which wants to
take the runqueue lock this thread already holds.  So I would only
get the first line of the lockdep debugging followed by a complete
hang.

Doing these wakeups in such a BUG message is unwise.  Please can
we apply something like the following and save other developers
countless wasted hours of their time?

Thanks :-)

--------------------

debug_locks: Set oops_in_progress if we will log messages.

Otherwise lock debugging messages on runqueue locks can deadlock the
system due to the wakeups performed by printk().

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/lib/debug_locks.c b/lib/debug_locks.c
index 0ef01d1..0218b46 100644
--- a/lib/debug_locks.c
+++ b/lib/debug_locks.c
@@ -8,6 +8,7 @@
  *
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  */
+#include <linux/kernel.h>
 #include <linux/rwsem.h>
 #include <linux/mutex.h>
 #include <linux/module.h>
@@ -37,6 +38,7 @@ int debug_locks_off(void)
 {
 	if (xchg(&debug_locks, 0)) {
 		if (!debug_locks_silent) {
+			oops_in_progress = 1;
 			console_verbose();
 			return 1;
 		}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [git pull] scheduler fixes
  2008-08-01  8:11       ` David Miller
@ 2008-08-01  9:01         ` Ingo Molnar
  2008-08-01  9:13           ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2008-08-01  9:01 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra


* David Miller <davem@davemloft.net> wrote:

> From: David Miller <davem@davemloft.net>
> Date: Thu, 31 Jul 2008 15:55:04 -0700 (PDT)
> 
> > I am absolutely sure, I spent the whole night yesterday trying to 
> > debug this.
> 
> Followup.  I lost two days of my life debugging this because seemingly 
> nobody can friggin' agree on what to do about the "printk() wakeup 
> issue".  Thanks!
> 
> Can we fix this now, please?
> 
> The problem was that Peter's patch triggers a print_deadlock_bug() in 
> lockdep.c on the runqueue locks.
> 
> But those printk()'s quickly want to do a wakeup, which wants to take 
> the runqueue lock this thread already holds.  So I would only get the 
> first line of the lockdep debugging followed by a complete hang.

ugh. In the context of lockdep you are the first one to trigger this bug 
- thanks for the fix!

We had a few other incidents of printks generated by bugs within the 
scheduler code causing lockups (due to the wakeup) - and Steve Rostedt 
sent a more generic solution for that: to first trylock the runqueue 
lock in that case instead of doing an unconditional wakeup.

The patch made it to linux-next but Andrew NAK-ed that patch because it 
caused other problems: it made printk wakeups conceptually less 
reliable. (a spurious lock taken from another CPU could prevent a printk 
wakeup from propagating and could block klogd indefinitely)

> Doing these wakeups in such a BUG message is unwise.  Please can we 
> apply something like the following and save other developers countless 
> wasted hours of their time?
> 
> Thanks :-)

applied to tip/core/locking.

I'm wondering, does this mean that Peter's:

    lockdep: change scheduler annotation

is still not good enough yet?

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [git pull] scheduler fixes
  2008-08-01  9:01         ` Ingo Molnar
@ 2008-08-01  9:13           ` David Miller
  2008-08-01 11:08             ` [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass Peter Zijlstra
  2008-08-01 11:08             ` [PATCH] lockdep: re-annotate scheduler runqueues Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: David Miller @ 2008-08-01  9:13 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 1 Aug 2008 11:01:00 +0200

> I'm wondering, does this mean that Peter's:
> 
>     lockdep: change scheduler annotation
> 
> is still not good enough yet?

It doesn't work, the problem is this:

[ 7819.764377] =============================================
[ 7819.775495] [ INFO: possible recursive locking detected ]
[ 7819.780969] 2.6.27-rc1-lockdep #12
[ 7819.786256] ---------------------------------------------
[ 7819.791548] cc1/29466 is trying to acquire lock:
[ 7819.796685]  (&rq->lock/1){.+..}, at: [<0000000000455548>] double_lock_balance+0x78/0x90
[ 7819.806849] 
[ 7819.806859] but task is already holding lock:
[ 7819.815871]  (&rq->lock/1){.+..}, at: [<0000000000455538>] double_lock_balance+0x68/0x90
[ 7819.825448] 
[ 7819.825456] other info that might help us debug this:
[ 7819.834373] 1 lock held by cc1/29466:
[ 7819.838709]  #0:  (&rq->lock/1){.+..}, at: [<0000000000455538>] double_lock_balance+0x68/0x90
[ 7819.847818] 
[ 7819.847827] stack backtrace:
[ 7819.855832] Call Trace:
[ 7819.859696]  [000000000047ecac] __lock_acquire+0xbdc/0xf98
[ 7819.863669]  [000000000047f7c4] lock_acquire+0x64/0x7c
[ 7819.867613]  [00000000006d67d4] _spin_lock_nested+0x1c/0x58
[ 7819.871612]  [0000000000455548] double_lock_balance+0x78/0x90
[ 7819.875600]  [00000000006d3afc] schedule+0x324/0x8d0
[ 7819.879536]  [00000000004c089c] pipe_wait+0x58/0x8c
[ 7819.883462]  [00000000004c0c94] pipe_write+0x3c4/0x480
[ 7819.887597]  [00000000004ba150] do_sync_write+0x80/0xd0
[ 7819.891460]  [00000000004ba92c] vfs_write+0x70/0x10c
[ 7819.895264]  [00000000004bad18] sys_write+0x2c/0x60
[ 7819.899073]  [0000000000406294] linux_sparc_syscall32+0x34/0x40

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01  9:13           ` David Miller
@ 2008-08-01 11:08             ` Peter Zijlstra
  2008-08-01 18:06               ` Jeremy Fitzhardinge
  2008-08-01 11:08             ` [PATCH] lockdep: re-annotate scheduler runqueues Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2008-08-01 11:08 UTC (permalink / raw)
  To: David Miller; +Cc: mingo, torvalds, linux-kernel, akpm

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: lockdep: lock_set_subclass - reset a held lock's subclass

this can be used to reset a held lock's subclass, for arbitrary-depth
iterated data structures such as trees or lists which have per-node
locks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/lockdep.h |    4 ++
 kernel/lockdep.c        |   69 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+)

Index: linux-2.6/include/linux/lockdep.h
===================================================================
--- linux-2.6.orig/include/linux/lockdep.h
+++ linux-2.6/include/linux/lockdep.h
@@ -300,6 +300,9 @@ extern void lock_acquire(struct lockdep_
 extern void lock_release(struct lockdep_map *lock, int nested,
 			 unsigned long ip);
 
+extern void lock_set_subclass(struct lockdep_map *lock, unsigned int subclass,
+			      unsigned long ip);
+
 # define INIT_LOCKDEP				.lockdep_recursion = 0,
 
 #define lockdep_depth(tsk)	(debug_locks ? (tsk)->lockdep_depth : 0)
@@ -316,6 +319,7 @@ static inline void lockdep_on(void)
 
 # define lock_acquire(l, s, t, r, c, i)		do { } while (0)
 # define lock_release(l, n, i)			do { } while (0)
+# define lock_set_subclass(l, s, i)		do { } while (0)
 # define lockdep_init()				do { } while (0)
 # define lockdep_info()				do { } while (0)
 # define lockdep_init_map(lock, name, key, sub)	do { (void)(key); } while (0)
Index: linux-2.6/kernel/lockdep.c
===================================================================
--- linux-2.6.orig/kernel/lockdep.c
+++ linux-2.6/kernel/lockdep.c
@@ -2658,6 +2658,55 @@ static int check_unlock(struct task_stru
 	return 1;
 }
 
+static int
+__lock_set_subclass(struct lockdep_map *lock,
+		    unsigned int subclass, unsigned long ip)
+{
+	struct task_struct *curr = current;
+	struct held_lock *hlock, *prev_hlock;
+	struct lock_class *class;
+	unsigned int depth;
+	int i;
+
+	depth = curr->lockdep_depth;
+	if (DEBUG_LOCKS_WARN_ON(!depth))
+		return 0;
+
+	prev_hlock = NULL;
+	for (i = depth-1; i >= 0; i--) {
+		hlock = curr->held_locks + i;
+		/*
+		 * We must not cross into another context:
+		 */
+		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
+			break;
+		if (hlock->instance == lock)
+			goto found_it;
+		prev_hlock = hlock;
+	}
+	return print_unlock_inbalance_bug(curr, lock, ip);
+
+found_it:
+	class = register_lock_class(lock, subclass, 0);
+	hlock->class = class;
+
+	curr->lockdep_depth = i;
+	curr->curr_chain_key = hlock->prev_chain_key;
+
+	for (; i < depth; i++) {
+		hlock = curr->held_locks + i;
+		if (!__lock_acquire(hlock->instance,
+			hlock->class->subclass, hlock->trylock,
+				hlock->read, hlock->check, hlock->hardirqs_off,
+				hlock->acquire_ip))
+			return 0;
+	}
+
+	if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth))
+		return 0;
+	return 1;
+}
+
 /*
  * Remove the lock to the list of currently held locks in a
  * potentially non-nested (out of order) manner. This is a
@@ -2822,6 +2871,26 @@ static void check_flags(unsigned long fl
 #endif
 }
 
+void
+lock_set_subclass(struct lockdep_map *lock,
+		  unsigned int subclass, unsigned long ip)
+{
+	unsigned long flags;
+
+	if (unlikely(current->lockdep_recursion))
+		return;
+
+	raw_local_irq_save(flags);
+	current->lockdep_recursion = 1;
+	check_flags(flags);
+	if (__lock_set_subclass(lock, subclass, ip))
+		check_chain_key(current);
+	current->lockdep_recursion = 0;
+	raw_local_irq_restore(flags);
+}
+
+EXPORT_SYMBOL_GPL(lock_set_subclass);
+
 /*
  * We are not always called with irqs disabled - do that here,
  * and also avoid lockdep recursion:



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] lockdep: re-annotate scheduler runqueues
  2008-08-01  9:13           ` David Miller
  2008-08-01 11:08             ` [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass Peter Zijlstra
@ 2008-08-01 11:08             ` Peter Zijlstra
  2008-08-01 17:04               ` Linus Torvalds
  2008-08-02  8:34               ` David Miller
  1 sibling, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2008-08-01 11:08 UTC (permalink / raw)
  To: David Miller; +Cc: mingo, torvalds, linux-kernel, akpm

Ingo,

I cannot seem to reproduce this on my quad (I'll let it run a fwe hours
just to make sure), could you meanwhile try on a somewhat larger
machine?

The thing that triggered it for David is a regular kernel build with
large concurrency.

(don't forget to 'fix' printk or apply David's oops_in_progress patch)

---
Subject: lockdep: re-annotate scheduler runqueues

Instead of using a per-rq lock class, use the regular nesting operations.

However, take extra care with double_lock_balance() as it can release the
already held rq->lock (and therefore change its nesting class).

So what can happen is:

 spin_lock(rq->lock);	// this rq subclass 0

 double_lock_balance(rq, other_rq);
   // release rq
   // acquire other_rq->lock subclass 0
   // acquire rq->lock subclass 1

 spin_unlock(other_rq->lock);

leaving you with rq->lock in subclass 1

So a subsequent double_lock_balance() call can try to nest a subclass 1
lock while already holding a subclass 1 lock.

Fix this by introducing double_unlock_balance() which releases the other
rq's lock, but also re-sets the subclass for this rq's lock to 0.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c    |   21 +++++++++++++--------
 kernel/sched_rt.c |    8 +++++---
 2 files changed, 18 insertions(+), 11 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -600,7 +600,6 @@ struct rq {
 	/* BKL stats */
 	unsigned int bkl_count;
 #endif
-	struct lock_class_key rq_lock_key;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -2759,10 +2758,10 @@ static void double_rq_lock(struct rq *rq
 	} else {
 		if (rq1 < rq2) {
 			spin_lock(&rq1->lock);
-			spin_lock(&rq2->lock);
+			spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
 		} else {
 			spin_lock(&rq2->lock);
-			spin_lock(&rq1->lock);
+			spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
 		}
 	}
 	update_rq_clock(rq1);
@@ -2805,14 +2804,21 @@ static int double_lock_balance(struct rq
 		if (busiest < this_rq) {
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
-			spin_lock(&this_rq->lock);
+			spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
 			ret = 1;
 		} else
-			spin_lock(&busiest->lock);
+			spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
 	}
 	return ret;
 }
 
+static void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
+	__releases(busiest->lock)
+{
+	spin_unlock(&busiest->lock);
+	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+}
+
 /*
  * If dest_cpu is allowed for this process, migrate the task to it.
  * This is accomplished by forcing the cpu_allowed mask to only
@@ -3637,7 +3643,7 @@ redo:
 		ld_moved = move_tasks(this_rq, this_cpu, busiest,
 					imbalance, sd, CPU_NEWLY_IDLE,
 					&all_pinned);
-		spin_unlock(&busiest->lock);
+		double_unlock_balance(this_rq, busiest);
 
 		if (unlikely(all_pinned)) {
 			cpu_clear(cpu_of(busiest), *cpus);
@@ -3752,7 +3758,7 @@ static void active_load_balance(struct r
 		else
 			schedstat_inc(sd, alb_failed);
 	}
-	spin_unlock(&target_rq->lock);
+	double_unlock_balance(busiest_rq, target_rq);
 }
 
 #ifdef CONFIG_NO_HZ
@@ -8000,7 +8006,6 @@ void __init sched_init(void)
 
 		rq = cpu_rq(i);
 		spin_lock_init(&rq->lock);
-		lockdep_set_class(&rq->lock, &rq->rq_lock_key);
 		rq->nr_running = 0;
 		init_cfs_rq(&rq->cfs, rq);
 		init_rt_rq(&rq->rt, rq);
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -861,6 +861,8 @@ static void put_prev_task_rt(struct rq *
 #define RT_MAX_TRIES 3
 
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
+static void double_unlock_balance(struct rq *this_rq, struct rq *busiest);
+
 static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
 
 static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
@@ -1022,7 +1024,7 @@ static struct rq *find_lock_lowest_rq(st
 			break;
 
 		/* try again */
-		spin_unlock(&lowest_rq->lock);
+		double_unlock_balance(rq, lowest_rq);
 		lowest_rq = NULL;
 	}
 
@@ -1091,7 +1093,7 @@ static int push_rt_task(struct rq *rq)
 
 	resched_task(lowest_rq->curr);
 
-	spin_unlock(&lowest_rq->lock);
+	double_unlock_balance(rq, lowest_rq);
 
 	ret = 1;
 out:
@@ -1197,7 +1199,7 @@ static int pull_rt_task(struct rq *this_
 
 		}
  skip:
-		spin_unlock(&src_rq->lock);
+		double_unlock_balance(this_rq, src_rq);
 	}
 
 	return ret;



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: re-annotate scheduler runqueues
  2008-08-01 11:08             ` [PATCH] lockdep: re-annotate scheduler runqueues Peter Zijlstra
@ 2008-08-01 17:04               ` Linus Torvalds
  2008-08-02  8:34               ` David Miller
  1 sibling, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2008-08-01 17:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: David Miller, mingo, linux-kernel, akpm



Ingo, I hope you haven't gone off for a vacation yet, because right now 
I'm not sure any of the scheduler trees are safe to pull. So pls advise.

		Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 11:08             ` [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass Peter Zijlstra
@ 2008-08-01 18:06               ` Jeremy Fitzhardinge
  2008-08-01 18:14                 ` Linus Torvalds
  2008-08-01 20:49                 ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-01 18:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: David Miller, mingo, torvalds, linux-kernel, akpm

Peter Zijlstra wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Subject: lockdep: lock_set_subclass - reset a held lock's subclass
>
> this can be used to reset a held lock's subclass, for arbitrary-depth
> iterated data structures such as trees or lists which have per-node
> locks.
>   

Hey, can I use this to suppress the spurious lockdep warnings I get when 
I try to hold more than one pte lock at once?

    J

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 18:06               ` Jeremy Fitzhardinge
@ 2008-08-01 18:14                 ` Linus Torvalds
  2008-08-01 19:10                   ` Jeremy Fitzhardinge
  2008-08-01 20:49                 ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2008-08-01 18:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, David Miller, mingo, linux-kernel, akpm



On Fri, 1 Aug 2008, Jeremy Fitzhardinge wrote:
> 
> Hey, can I use this to suppress the spurious lockdep warnings I get when I try
> to hold more than one pte lock at once?

So how sure are you that they are spurious?

			Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 18:14                 ` Linus Torvalds
@ 2008-08-01 19:10                   ` Jeremy Fitzhardinge
  2008-08-01 19:24                     ` Linus Torvalds
  2008-08-01 19:59                     ` Hugh Dickins
  0 siblings, 2 replies; 22+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-01 19:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, David Miller, mingo, linux-kernel, akpm

Linus Torvalds wrote:
> On Fri, 1 Aug 2008, Jeremy Fitzhardinge wrote:
>   
>> Hey, can I use this to suppress the spurious lockdep warnings I get when I try
>> to hold more than one pte lock at once?
>>     
>
> So how sure are you that they are spurious?
>   

I have a function traversing a pagetable in vaddr order (low to high), 
taking pte locks as it builds up batches of pte page updates.  When the 
batch is issued, it releases all the locks, and won't end up holding 
more than ~16 at a time.

So, I think this is OK.  There are no internal lock ordering issues, and 
I don't think there'll be any bad interactions from someone trying to 
take pte locks for two separate pagetables.  I don't think there's 
anyone else trying to take more than one pte lock at once, but if there 
were "lock low vaddr then high" seems like a reasonable locking rule (or 
more precisely "lowest" to deal with the case of a pte page being 
aliased at multiple vaddrs).

Lockdep complains because the split pte locks are all in the same lock 
class, so it reports it as taking a spinlock recursively.  I'd reported 
this to PeterZ before, and he responded with "uh, I'll think about 
it...", which I took to mean "...when someone else has a problem".  So 
I'm wondering if this is that time...

(This is all with split pte locks, obviously.)

    J

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 19:10                   ` Jeremy Fitzhardinge
@ 2008-08-01 19:24                     ` Linus Torvalds
  2008-08-01 20:08                       ` Jeremy Fitzhardinge
  2008-08-01 19:59                     ` Hugh Dickins
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2008-08-01 19:24 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, David Miller, mingo, linux-kernel, akpm



On Fri, 1 Aug 2008, Jeremy Fitzhardinge wrote:
> 
> I have a function traversing a pagetable in vaddr order (low to high), taking
> pte locks as it builds up batches of pte page updates.  When the batch is
> issued, it releases all the locks, and won't end up holding more than ~16 at a
> time.

Hmm. 

With hashed locks, that is _not_ safe in general. Now, it may well be safe 
in your case, so I'm not going to say you have a bug, but even if you do 
them in vaddr order, the thing is, we don't guarantee that the _locks_ are 
ordered in virtual address order.

Right now, I think the pte locks are either a single one per mm (in which 
case I assume you don't take any lock at all), or it's a lock that is 
embedded in the pmd page iirc.

What if you have pmd sharing through some shared area being mapped at two 
different processes at different addresses? Yeah, I don't think we share 
pmd's at all (except if you use hugetables and for the kernel), but it's 
one of those things where subtle changes in how the pte lock allocation 
could cause problems.

Eg, I could easily see somebody doing the pte lock as a hash over not just 
the address, but the "struct mm" pointer too. At which point different 
parts of the address space might even share the PTE lock, and you'd get 
deadlocks even without any ABBA behavior, just because the pte lock might 
be A B C A or something inside the same process.

			Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 19:10                   ` Jeremy Fitzhardinge
  2008-08-01 19:24                     ` Linus Torvalds
@ 2008-08-01 19:59                     ` Hugh Dickins
  2008-08-01 20:22                       ` Jeremy Fitzhardinge
  2008-08-01 23:20                       ` David Miller
  1 sibling, 2 replies; 22+ messages in thread
From: Hugh Dickins @ 2008-08-01 19:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, Peter Zijlstra, David Miller, mingo, linux-kernel, akpm

On Fri, 1 Aug 2008, Jeremy Fitzhardinge wrote:
> 
> I have a function traversing a pagetable in vaddr order (low to high), taking
> pte locks as it builds up batches of pte page updates.  When the batch is
> issued, it releases all the locks, and won't end up holding more than ~16 at a
> time.
> 
> So, I think this is OK.  There are no internal lock ordering issues, and I
> don't think there'll be any bad interactions from someone trying to take pte
> locks for two separate pagetables.  I don't think there's anyone else trying
> to take more than one pte lock at once, but if there were "lock low vaddr then
> high" seems like a reasonable locking rule (or more precisely "lowest" to deal
> with the case of a pte page being aliased at multiple vaddrs).

Please check the spin_lock_nested() in move_ptes() in mm/mremap.c.

If you have down_write(&mm->mmap_sem) then you should be safe,
but may need to do something to placate lockdep.  If you don't
have down_write(&mm->mmap_sem), then I think you're in trouble?

Not a big deal, the move_ptes() locking can be adjusted to suit
your rule, it was just easier to do it the way it is at the time.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 19:24                     ` Linus Torvalds
@ 2008-08-01 20:08                       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 22+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-01 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, David Miller, mingo, linux-kernel, akpm

Linus Torvalds wrote:
> Hmm. 
>
> With hashed locks, that is _not_ safe in general. Now, it may well be safe 
> in your case, so I'm not going to say you have a bug, but even if you do 
> them in vaddr order, the thing is, we don't guarantee that the _locks_ are 
> ordered in virtual address order.
>   

Yes, it's current simplicity definitely relies on a simple relationship 
between pte pages and locks.

> Right now, I think the pte locks are either a single one per mm (in which 
> case I assume you don't take any lock at all), or it's a lock that is 
> embedded in the pmd page iirc.
>   

Actually the locks in the pte page, so it's already fairly 
fine-grained.  If it were made coarser - at the pmd level - then I'd 
simply move my lock-taking out a level in the pagetable traversal.

But making it finer - by hashing individual pte entries to their own 
locks - would work badly for me.  I'm taking the lock specifically to 
protect against updates to any pte within the pte page, so I'd have to 
manually take all the locks that the ptes could possibly hash to.  
Presumably in that eventuality we could define correct order for taking 
the hashed pte locks, independent of how the ptes actually map to them 
(for example, always take locks in low->high order of *lock* address).

> What if you have pmd sharing through some shared area being mapped at two 
> different processes at different addresses? Yeah, I don't think we share 
> pmd's at all (except if you use hugetables and for the kernel), but it's 
> one of those things where subtle changes in how the pte lock allocation 
> could cause problems.
>   

In this particular case it isn't an issue.  The traversal is performing 
first/last use init/deinit, so any shared parts of the pagetable can be 
skipped with no action because they're already done.  Presumably there'd 
be separate lifetime management for whatever parts of the pagetable can 
be shared, so I'd need to hook in there, rather than the wholesale 
pagetable create/destroy as I do now.

> Eg, I could easily see somebody doing the pte lock as a hash over not just 
> the address, but the "struct mm" pointer too. At which point different 
> parts of the address space might even share the PTE lock, and you'd get 
> deadlocks even without any ABBA behavior, just because the pte lock might 
> be A B C A or something inside the same process.
>   

Yep.  That would be awkward.  A function which says "here's a pte page, 
return the set of all pte locks these ptes map to in correct locking 
order" would be useful in that case.  Ugh, but still unpleasant.

    J

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 19:59                     ` Hugh Dickins
@ 2008-08-01 20:22                       ` Jeremy Fitzhardinge
  2008-08-01 20:33                         ` Hugh Dickins
  2008-08-01 23:20                       ` David Miller
  1 sibling, 1 reply; 22+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-01 20:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Peter Zijlstra, David Miller, mingo, linux-kernel, akpm

Hugh Dickins wrote:
> Please check the spin_lock_nested() in move_ptes() in mm/mremap.c.
>
> If you have down_write(&mm->mmap_sem) then you should be safe,
> but may need to do something to placate lockdep.  If you don't
> have down_write(&mm->mmap_sem), then I think you're in trouble?
>
> Not a big deal, the move_ptes() locking can be adjusted to suit
> your rule, it was just easier to do it the way it is at the time.

Ah, yes, I did look at that.  I think it isn't an issue, because my code 
is called from dup_mmap(), activate_mm() or exit_mmap().

dup_mmap() already holds mmap_sem.

activate_mm() in exec doesn't hold the sem, but I don't think it's 
possible for anyone to be racing against it.
activate_mm() in unshare doesn't seem to get used.

exit_mmap() gets called when there are no other users, so we'd better 
not be racing...

    J

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 20:22                       ` Jeremy Fitzhardinge
@ 2008-08-01 20:33                         ` Hugh Dickins
  0 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2008-08-01 20:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, Peter Zijlstra, David Miller, mingo, linux-kernel, akpm

On Fri, 1 Aug 2008, Jeremy Fitzhardinge wrote:
> Hugh Dickins wrote:
> > Please check the spin_lock_nested() in move_ptes() in mm/mremap.c.
> >
> > If you have down_write(&mm->mmap_sem) then you should be safe,
> > but may need to do something to placate lockdep.  If you don't
> > have down_write(&mm->mmap_sem), then I think you're in trouble?
> >
> > Not a big deal, the move_ptes() locking can be adjusted to suit
> > your rule, it was just easier to do it the way it is at the time.
> 
> Ah, yes, I did look at that.  I think it isn't an issue, because my code is
> called from dup_mmap(), activate_mm() or exit_mmap().
> 
> dup_mmap() already holds mmap_sem.
> 
> activate_mm() in exec doesn't hold the sem, but I don't think it's possible
> for anyone to be racing against it.
> activate_mm() in unshare doesn't seem to get used.
> 
> exit_mmap() gets called when there are no other users, so we'd better not be
> racing...

All safe against mremap, agreed.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 18:06               ` Jeremy Fitzhardinge
  2008-08-01 18:14                 ` Linus Torvalds
@ 2008-08-01 20:49                 ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2008-08-01 20:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: David Miller, mingo, torvalds, linux-kernel, akpm

On Fri, 2008-08-01 at 11:06 -0700, Jeremy Fitzhardinge wrote:
> Peter Zijlstra wrote:
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Subject: lockdep: lock_set_subclass - reset a held lock's subclass
> >
> > this can be used to reset a held lock's subclass, for arbitrary-depth
> > iterated data structures such as trees or lists which have per-node
> > locks.
> >   
> 
> Hey, can I use this to suppress the spurious lockdep warnings I get when 
> I try to hold more than one pte lock at once?

No, you still cannot hold more than 48 locks at any one time, and this
somewhat horrible annotation only works if you don't generate double
classes in the same chain.

What it can do is:

  L1
   L2
    U1
     L3
      U2
       etc..

What you do is:

  spin_lock(L1)
   spin_lock_nested(L2, 1)
    spin_unlock(L1);
    lockdep_set_subclass(L2, 0); // because L1, who was 0 isn't there anymore
     spin_lock_nested(L3, 1);
      spin_unlock(L2);
      lockdep_set_subclass(L2, 0); // blah blah..
       etc...




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 19:59                     ` Hugh Dickins
  2008-08-01 20:22                       ` Jeremy Fitzhardinge
@ 2008-08-01 23:20                       ` David Miller
  2008-08-01 23:26                         ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: David Miller @ 2008-08-01 23:20 UTC (permalink / raw)
  To: hugh; +Cc: jeremy, torvalds, a.p.zijlstra, mingo, linux-kernel, akpm

From: Hugh Dickins <hugh@veritas.com>
Date: Fri, 1 Aug 2008 20:59:08 +0100 (BST)

> On Fri, 1 Aug 2008, Jeremy Fitzhardinge wrote:
> > 
> > I have a function traversing a pagetable in vaddr order (low to high), taking
> > pte locks as it builds up batches of pte page updates.  When the batch is
> > issued, it releases all the locks, and won't end up holding more than ~16 at a
> > time.
> > 
> > So, I think this is OK.  There are no internal lock ordering issues, and I
> > don't think there'll be any bad interactions from someone trying to take pte
> > locks for two separate pagetables.  I don't think there's anyone else trying
> > to take more than one pte lock at once, but if there were "lock low vaddr then
> > high" seems like a reasonable locking rule (or more precisely "lowest" to deal
> > with the case of a pte page being aliased at multiple vaddrs).
> 
> Please check the spin_lock_nested() in move_ptes() in mm/mremap.c.

It won't work because spin_lock_nested() is limited to a depth
of 8 and he aparently needs 16.

Taking more than a few locks of the same class at once is bad
news and it's better to find an alternative method.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass
  2008-08-01 23:20                       ` David Miller
@ 2008-08-01 23:26                         ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2008-08-01 23:26 UTC (permalink / raw)
  To: David Miller; +Cc: hugh, jeremy, a.p.zijlstra, mingo, linux-kernel, akpm



On Fri, 1 Aug 2008, David Miller wrote:
> 
> Taking more than a few locks of the same class at once is bad
> news and it's better to find an alternative method.

It's not always wrong.

If you can guarantee that anybody that takes more than one lock of a 
particular class will always take a single top-level lock _first_, then 
that's all good. You can obviously screw up and take the same lock _twice_ 
(which will deadlock), but at least you cannot get into ABBA situations.

So maybe the right thing to do is to just teach lockdep about "lock 
protection locks". That would have solved the multi-queue issues for 
networking too - all the actual network drivers would still have taken 
just their single queue lock, but the one case that needs to take all of 
them would have taken a separate top-level lock first.

Never mind that the multi-queue locks were always taken in the same order: 
it's never wrong to just have some top-level serialization, and anybody 
who needs to take <n> locks might as well do <n+1>, because they sure as 
hell aren't going to be on _any_ fastpaths.

So the simplest solution really sounds like just teaching lockdep about 
that one special case. It's not "nesting" exactly, although it's obviously 
related to it.

			Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] lockdep: re-annotate scheduler runqueues
  2008-08-01 11:08             ` [PATCH] lockdep: re-annotate scheduler runqueues Peter Zijlstra
  2008-08-01 17:04               ` Linus Torvalds
@ 2008-08-02  8:34               ` David Miller
  1 sibling, 0 replies; 22+ messages in thread
From: David Miller @ 2008-08-02  8:34 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, torvalds, linux-kernel, akpm

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri, 01 Aug 2008 13:08:43 +0200

> The thing that triggered it for David is a regular kernel build with
> large concurrency.

Peter, I've been beating up these two patches here for a while
on my 64-cpu box and it hasn't budged yet.

I'll let you know if anything happens, but it looks good so far.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-08-02  8:34 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-31 21:43 [git pull] scheduler fixes Ingo Molnar
2008-07-31 22:04 ` David Miller
2008-07-31 22:26   ` Ingo Molnar
2008-07-31 22:55     ` David Miller
2008-08-01  8:11       ` David Miller
2008-08-01  9:01         ` Ingo Molnar
2008-08-01  9:13           ` David Miller
2008-08-01 11:08             ` [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass Peter Zijlstra
2008-08-01 18:06               ` Jeremy Fitzhardinge
2008-08-01 18:14                 ` Linus Torvalds
2008-08-01 19:10                   ` Jeremy Fitzhardinge
2008-08-01 19:24                     ` Linus Torvalds
2008-08-01 20:08                       ` Jeremy Fitzhardinge
2008-08-01 19:59                     ` Hugh Dickins
2008-08-01 20:22                       ` Jeremy Fitzhardinge
2008-08-01 20:33                         ` Hugh Dickins
2008-08-01 23:20                       ` David Miller
2008-08-01 23:26                         ` Linus Torvalds
2008-08-01 20:49                 ` Peter Zijlstra
2008-08-01 11:08             ` [PATCH] lockdep: re-annotate scheduler runqueues Peter Zijlstra
2008-08-01 17:04               ` Linus Torvalds
2008-08-02  8:34               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).