All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: irq lock inversion
       [not found]     ` <4AF28D7A.6020209@kernel.org>
@ 2009-11-05 14:31       ` Jiri Kosina
  2009-11-06  5:53         ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Jiri Kosina @ 2009-11-05 14:31 UTC (permalink / raw)
  To: Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: Yinghai Lu, Thomas Gleixner, cl, linux-kernel


[ added LKML to CC so that the lockdep message is at least indexed by 
  search engines in archives ]

On Thu, 5 Nov 2009, Tejun Heo wrote:

> > lockdep only considers a lock irq-safe if it was used from an irq 
> > context before.
> > 
> > _irqsave() API usage alone does not trigger this.
> 
> Thanks for the explanation.  It's about the same tho.  sched_init() is
> calling it with irq disabled but that's an exception to might_sleep()
> rule.  Maybe making lockdep recognize the same exception as
> might_sleep() so that lockdep doesn't consider a lock irq-safe if it's
> called with irq off but before system_state is set to SYSTEM_RUNNING
> works around the problem?

Hmm, I wonder why I don't see this lockdep warning myself with 
head on 1836d9592, even though I have 

	CONFIG_PROVE_LOCKING=y
	CONFIG_TRACE_IRQFLAGS=y

... ?

Anyway, how about something like this? (I can't verify myself that it even 
fixes the warning, as I don't see it for some odd reason)




From: Jiri Kosina <jkosina@suse.cz>
Subject: lockdep: avoid false positives about irq-safety

Commit 403a91b1 ("percpu: allow pcpu_alloc() to be called
with IRQs off") introduced this warning:

	=========================================================
	[ INFO: possible irq lock inversion dependency detected ]
	2.6.32-rc5-tip-04815-g12f0f93-dirty #745
	---------------------------------------------------------
	hub 1-3:1.0: state 7 ports 2 chg 0000 evt 0004
	ksoftirqd/65/199 just changed the state of lock:
	 (pcpu_lock){..-...}, at: [<ffffffff81130e04>] free_percpu+0x38/0x104
	but this lock took another, SOFTIRQ-unsafe lock in the past:
	 (vmap_area_lock){+.+...}

	and interrupts could create inverse lock ordering between them.

This warning is bogus -- sched_init() is being called very early with IRQs
disabled, and the irqsave/restore code paths in pcpu_alloc() are only for early
init. The path can never be called from irq context once the early init
finishes. Rationale for this is explained in changelog of the commit mentioned
above.

This problem can be encountered generally in any other early code running
with IRQs off and using irqsave/irqrestore.

Reported-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

--- 
 kernel/lockdep.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 9af5672..996b395 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2487,6 +2487,14 @@ void lockdep_trace_alloc(gfp_t gfp_mask)
 
 static int mark_irqflags(struct task_struct *curr, struct held_lock *hlock)
 {
+	/* 
+	 * This is exception similar to the might_sleep() one. 
+	 * We don't care about irq-safety of the locks this early, as
+	 * it will produce false positives (sched_init() is called with
+	 * irqs off, but needs to use irqsave/irqrestore API)
+	 */
+	if (system_state != SYSTEM_RUNNING)
+		return 1;
 	/*
 	 * If non-trylock use in a hardirq or softirq context, then
 	 * mark the lock as used in these contexts:

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-05 14:31       ` irq lock inversion Jiri Kosina
@ 2009-11-06  5:53         ` Tejun Heo
  2009-11-06  7:17           ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2009-11-06  5:53 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Ingo Molnar, Peter Zijlstra, Yinghai Lu, Thomas Gleixner, cl,
	linux-kernel

Hello, Jiri.

Jiri Kosina wrote:
> Hmm, I wonder why I don't see this lockdep warning myself with 
> head on 1836d9592, even though I have 
> 
> 	CONFIG_PROVE_LOCKING=y
> 	CONFIG_TRACE_IRQFLAGS=y
> 
> ... ?

You need pcpu_mem_free() hit vfree() to trigger the warning by
allocating a lot of small percpu areas so that allocation map inside a
chunk becomes larger than 4k and then get extended once more.

> Anyway, how about something like this? (I can't verify myself that it even 
> fixes the warning, as I don't see it for some odd reason)
> 
> From: Jiri Kosina <jkosina@suse.cz>
> Subject: lockdep: avoid false positives about irq-safety
> 
> Commit 403a91b1 ("percpu: allow pcpu_alloc() to be called
> with IRQs off") introduced this warning:
> 
> 	=========================================================
> 	[ INFO: possible irq lock inversion dependency detected ]
> 	2.6.32-rc5-tip-04815-g12f0f93-dirty #745
> 	---------------------------------------------------------
> 	hub 1-3:1.0: state 7 ports 2 chg 0000 evt 0004
> 	ksoftirqd/65/199 just changed the state of lock:
> 	 (pcpu_lock){..-...}, at: [<ffffffff81130e04>] free_percpu+0x38/0x104
> 	but this lock took another, SOFTIRQ-unsafe lock in the past:
> 	 (vmap_area_lock){+.+...}
> 
> 	and interrupts could create inverse lock ordering between them.
> 
> This warning is bogus -- sched_init() is being called very early with IRQs
> disabled, and the irqsave/restore code paths in pcpu_alloc() are only for early
> init. The path can never be called from irq context once the early init
> finishes. Rationale for this is explained in changelog of the commit mentioned
> above.
> 
> This problem can be encountered generally in any other early code running
> with IRQs off and using irqsave/irqrestore.
> 
> Reported-by: Yinghai Lu <yhlu.kernel@gmail.com>
> Signed-off-by: Jiri Kosina <jkosina@suse.cz>

Looks good to me.  Ingo, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  5:53         ` Tejun Heo
@ 2009-11-06  7:17           ` Ingo Molnar
  2009-11-06  7:45             ` Tejun Heo
  2009-11-06  9:59             ` Jens Axboe
  0 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2009-11-06  7:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jiri Kosina, Peter Zijlstra, Yinghai Lu, Thomas Gleixner, cl,
	linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> > From: Jiri Kosina <jkosina@suse.cz>
> > Subject: lockdep: avoid false positives about irq-safety
> > 
> > Commit 403a91b1 ("percpu: allow pcpu_alloc() to be called
> > with IRQs off") introduced this warning:
> > 
> > 	=========================================================
> > 	[ INFO: possible irq lock inversion dependency detected ]
> > 	2.6.32-rc5-tip-04815-g12f0f93-dirty #745
> > 	---------------------------------------------------------
> > 	hub 1-3:1.0: state 7 ports 2 chg 0000 evt 0004
> > 	ksoftirqd/65/199 just changed the state of lock:
> > 	 (pcpu_lock){..-...}, at: [<ffffffff81130e04>] free_percpu+0x38/0x104
> > 	but this lock took another, SOFTIRQ-unsafe lock in the past:
> > 	 (vmap_area_lock){+.+...}
> > 
> > 	and interrupts could create inverse lock ordering between them.
> > 
> > This warning is bogus -- sched_init() is being called very early with IRQs
> > disabled, and the irqsave/restore code paths in pcpu_alloc() are only for early
> > init. The path can never be called from irq context once the early init
> > finishes. Rationale for this is explained in changelog of the commit mentioned
> > above.
> > 
> > This problem can be encountered generally in any other early code running
> > with IRQs off and using irqsave/irqrestore.
> > 
> > Reported-by: Yinghai Lu <yhlu.kernel@gmail.com>
> > Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> 
> Looks good to me.  Ingo, what do you think?

Ugh, this explanation is _BOGUS_. As i said, taking a lock with irqs 
disabled does _NOT_ mark a lock as 'irq safe' - if it did, we'd have 
false positives left and right.

Read the lockdep message please, consider all the backtraces it prints, 
it says something different.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  7:17           ` Ingo Molnar
@ 2009-11-06  7:45             ` Tejun Heo
  2009-11-06  7:58               ` Ingo Molnar
  2009-11-06  9:59             ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2009-11-06  7:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Kosina, Peter Zijlstra, Yinghai Lu, Thomas Gleixner, cl,
	linux-kernel

Ingo Molnar wrote:
>>> This warning is bogus -- sched_init() is being called very early with IRQs
>>> disabled, and the irqsave/restore code paths in pcpu_alloc() are only for early
>>> init. The path can never be called from irq context once the early init
>>> finishes. Rationale for this is explained in changelog of the commit mentioned
>>> above.
>>>
>>> This problem can be encountered generally in any other early code running
>>> with IRQs off and using irqsave/irqrestore.
>>>
>>> Reported-by: Yinghai Lu <yhlu.kernel@gmail.com>
>>> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
>> Looks good to me.  Ingo, what do you think?
> 
> Ugh, this explanation is _BOGUS_. As i said, taking a lock with irqs 
> disabled does _NOT_ mark a lock as 'irq safe' - if it did, we'd have 
> false positives left and right.
> 
> Read the lockdep message please, consider all the backtraces it prints, 
> it says something different.

Ah... okay, the pcpu_free() path is correctly marking the lock
irqsafe.  I assumed this was caused by recent pcpu_alloc() change.
Sorry about that.  The lock inversion problem has always been there,
it just never showed up because none has use allocation map that large
I suppose.

So, the correct fix would be either 1. push down irqsafeness down to
vmalloc locks or 2. the rather ugly unlock-lock dancing in
pcpu_extend_area_map() I posted earlier.  For 2.6.32, I guess we'll
have to go with #2.  For longer term, we'll probably have to do #1 as
it's required to implement atomic percpu allocations too.

I'll try to reproduce the problem here and verify the previous locking
dance patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  7:45             ` Tejun Heo
@ 2009-11-06  7:58               ` Ingo Molnar
  2009-11-06  8:24                 ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2009-11-06  7:58 UTC (permalink / raw)
  To: Tejun Heo, Nick Piggin
  Cc: Jiri Kosina, Peter Zijlstra, Yinghai Lu, Thomas Gleixner, cl,
	linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> Ingo Molnar wrote:
> >>> This warning is bogus -- sched_init() is being called very early with IRQs
> >>> disabled, and the irqsave/restore code paths in pcpu_alloc() are only for early
> >>> init. The path can never be called from irq context once the early init
> >>> finishes. Rationale for this is explained in changelog of the commit mentioned
> >>> above.
> >>>
> >>> This problem can be encountered generally in any other early code running
> >>> with IRQs off and using irqsave/irqrestore.
> >>>
> >>> Reported-by: Yinghai Lu <yhlu.kernel@gmail.com>
> >>> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> >> Looks good to me.  Ingo, what do you think?
> > 
> > Ugh, this explanation is _BOGUS_. As i said, taking a lock with irqs 
> > disabled does _NOT_ mark a lock as 'irq safe' - if it did, we'd have 
> > false positives left and right.
> > 
> > Read the lockdep message please, consider all the backtraces it prints, 
> > it says something different.
> 
> Ah... okay, the pcpu_free() path is correctly marking the lock 
> irqsafe.  I assumed this was caused by recent pcpu_alloc() change. 
> Sorry about that.  The lock inversion problem has always been there, 
> it just never showed up because none has use allocation map that large 
> I suppose.
> 
> So, the correct fix would be either 1. push down irqsafeness down to 
> vmalloc locks or 2. the rather ugly unlock-lock dancing in 
> pcpu_extend_area_map() I posted earlier.  For 2.6.32, I guess we'll 
> have to go with #2.  For longer term, we'll probably have to do #1 as 
> it's required to implement atomic percpu allocations too.
> 
> I'll try to reproduce the problem here and verify the previous locking 
> dance patch.

I havent looked deeply but at first sight i'm not 100% sure that even 
the lock dance hack is safe - doesnt vfree() do TLB flushes, which must 
be done with irqs enabled in general? If yes, then the whole notion of 
using the allocator from irqs-off sections is wrong and the flags 
save/restore is misguided (or at least incomplete).

So the real problem right now i think is the use of the pcpu allocator 
from within a BH section (and from irqs-off sections) - that usage 
should be eliminated from .32, or the allocator should be fixed. (which 
looks non-trivial vmalloc/vfree was never really intended to be used in 
irq-atomic contexts)

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  7:58               ` Ingo Molnar
@ 2009-11-06  8:24                 ` Tejun Heo
  2009-11-06  8:40                   ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2009-11-06  8:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Jiri Kosina, Peter Zijlstra, Yinghai Lu,
	Thomas Gleixner, cl, linux-kernel

Hello, Ingo.

Ingo Molnar wrote:
> I havent looked deeply but at first sight i'm not 100% sure that even 
> the lock dance hack is safe - doesnt vfree() do TLB flushes, which must 
> be done with irqs enabled in general? If yes, then the whole notion of 
> using the allocator from irqs-off sections is wrong and the flags 
> save/restore is misguided (or at least incomplete).

The only place where any v*() call is nested under pcpu_lock is in the
alloc path, specifically pcpu_extend_area_map() ends up calling
vfree().  pcpu_free() path which can be called from irq context never
calls any vmalloc function directly.  The reclaiming is deferred to a
work.  Breaking the single nesting completely decouples the two locks
and nobody would be calling vfree() with irq disabled, so I don't
think there will be any problem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  8:24                 ` Tejun Heo
@ 2009-11-06  8:40                   ` Ingo Molnar
  2009-11-06  8:52                     ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2009-11-06  8:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nick Piggin, Jiri Kosina, Peter Zijlstra, Yinghai Lu,
	Thomas Gleixner, cl, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo.
> 
> Ingo Molnar wrote:
> > I havent looked deeply but at first sight i'm not 100% sure that even 
> > the lock dance hack is safe - doesnt vfree() do TLB flushes, which must 
> > be done with irqs enabled in general? If yes, then the whole notion of 
> > using the allocator from irqs-off sections is wrong and the flags 
> > save/restore is misguided (or at least incomplete).
> 
> The only place where any v*() call is nested under pcpu_lock is in the 
> alloc path, specifically pcpu_extend_area_map() ends up calling 
> vfree().  pcpu_free() path which can be called from irq context never 
> calls any vmalloc function directly.  The reclaiming is deferred to a 
> work.  Breaking the single nesting completely decouples the two locks 
> and nobody would be calling vfree() with irq disabled, so I don't 
> think there will be any problem.

My question is, why do we do flags save/restore in pcpu-alloc? Do we 
ever call it with irqs disabled? If yes, then the vfree might be unsafe 
due to vfree() potentially flushing TLBs (on all CPUs) and that act of 
sending IPIs requiring irqs to be enabled.

( Now, Nick has optimized vfree recently to lazy-free areas, but that 
  was a statistical optimization: TLB flushes are still possible, just 
  done more rarely. So we might end up calling flush_tlb_kernel_range()
  from vfree(). I've Cc:-ed Nick. )

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  8:40                   ` Ingo Molnar
@ 2009-11-06  8:52                     ` Tejun Heo
  2009-11-06 16:08                       ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2009-11-06  8:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Jiri Kosina, Peter Zijlstra, Yinghai Lu,
	Thomas Gleixner, cl, linux-kernel

Ingo Molnar wrote:
> My question is, why do we do flags save/restore in pcpu-alloc?

That's strictly for calls from sched_init().

> Do we ever call it with irqs disabled? If yes, then the vfree might
> be unsafe due to vfree() potentially flushing TLBs (on all CPUs) and
> that act of sending IPIs requiring irqs to be enabled.

And when called from sched_init(), it won't call vfree().

> ( Now, Nick has optimized vfree recently to lazy-free areas, but that 
>   was a statistical optimization: TLB flushes are still possible, just 
>   done more rarely. So we might end up calling flush_tlb_kernel_range()
>   from vfree(). I've Cc:-ed Nick. )

Nevertheless, it would be nice to allow at least the free part to be
called from irqsafe context.  vmalloc is doing a lot of things lazily
so deferring TLB flushes to a work wouldn't make much difference, I
suppose?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  7:17           ` Ingo Molnar
  2009-11-06  7:45             ` Tejun Heo
@ 2009-11-06  9:59             ` Jens Axboe
  2009-11-08  9:38               ` Ingo Molnar
  1 sibling, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2009-11-06  9:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Jiri Kosina, Peter Zijlstra, Yinghai Lu,
	Thomas Gleixner, cl, linux-kernel

On Fri, Nov 06 2009, Ingo Molnar wrote:
> Read the lockdep message please, consider all the backtraces it prints, 
> it says something different.

In all honesty, reading and parsing lockdep messages requires a special
state of mind. IOW, readability is not its high point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  8:52                     ` Tejun Heo
@ 2009-11-06 16:08                       ` Christoph Lameter
  2009-11-06 16:38                         ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2009-11-06 16:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Nick Piggin, Jiri Kosina, Peter Zijlstra,
	Yinghai Lu, Thomas Gleixner, linux-kernel

On Fri, 6 Nov 2009, Tejun Heo wrote:

> Ingo Molnar wrote:
> > My question is, why do we do flags save/restore in pcpu-alloc?
>
> That's strictly for calls from sched_init().

Right its a hack for 2.6.32. Fix it the right way by making the per cpu
allocator take gfp flags like any other allocator in the kernel.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06 16:08                       ` Christoph Lameter
@ 2009-11-06 16:38                         ` Tejun Heo
  2009-11-06 17:03                           ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2009-11-06 16:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Nick Piggin, Jiri Kosina, Peter Zijlstra,
	Yinghai Lu, Thomas Gleixner, linux-kernel

Christoph Lameter wrote:
> On Fri, 6 Nov 2009, Tejun Heo wrote:
> 
>> Ingo Molnar wrote:
>>> My question is, why do we do flags save/restore in pcpu-alloc?
>> That's strictly for calls from sched_init().
> 
> Right its a hack for 2.6.32. Fix it the right way by making the per cpu
> allocator take gfp flags like any other allocator in the kernel.

vmalloc/vfree is an allocator in the kernel and can't be called from
irq context and doesn't take gfp flags.  percpu allocator being
dependent on vmalloc area, it's gonna be a bit tricky.  It's
definitely doable but I'm still not quite sure whether the benfit
would worth the added complexity.  The only known use case is for lazy
allocation from memory allocator, right?  How much does it hurt not to
have that lazy allocation?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06 16:38                         ` Tejun Heo
@ 2009-11-06 17:03                           ` Christoph Lameter
  2009-11-07 16:13                             ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2009-11-06 17:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Nick Piggin, Jiri Kosina, Peter Zijlstra,
	Yinghai Lu, Thomas Gleixner, linux-kernel

On Sat, 7 Nov 2009, Tejun Heo wrote:

> vmalloc/vfree is an allocator in the kernel and can't be called from
> irq context and doesn't take gfp flags.  percpu allocator being
> dependent on vmalloc area, it's gonna be a bit tricky.  It's
> definitely doable but I'm still not quite sure whether the benfit
> would worth the added complexity.  The only known use case is for lazy
> allocation from memory allocator, right?  How much does it hurt not to
> have that lazy allocation?

Nick did some work recently on the vmalloc subsystem and one of the
intents was (well I think so at least) to make it usable for fsblock which
requires virtual mappings. Maybe even from an atomic context.

Potential use cases for atomic percpu allocs exist all over the kernel. If
you need to allocate a structure in an atomic context then attaching per
cpu information to that structure will require also per cpu allocation in
an atomic context.

We discourage atomic allocations and they are rare. Atomic allocs are
allowed to fail. But they are occasionally necessary.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06 17:03                           ` Christoph Lameter
@ 2009-11-07 16:13                             ` Peter Zijlstra
  2009-11-09  5:46                               ` [PATCH percpu#for-linus] percpu: fix possible deadlock via " Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2009-11-07 16:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Ingo Molnar, Nick Piggin, Jiri Kosina, Yinghai Lu,
	Thomas Gleixner, linux-kernel

On Fri, 2009-11-06 at 12:03 -0500, Christoph Lameter wrote:
> Atomic allocs are allowed to fail.

All allocs (except __GFP_NOFAIL, but those are a deadlock waiting to
happen) are allowed to fail, !__GFP_WAIT allocs simply fail more easily.

The fact that __GFP_WAIT && order <= PAGE_ALLOC_COSTLY_ORDER allocations
currently don't fail but keep hammering the allocator could be
considered a bug (the recent OOM threads in fact suggest this is a
serious issue for some people).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-06  9:59             ` Jens Axboe
@ 2009-11-08  9:38               ` Ingo Molnar
  2009-11-09 15:34                 ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2009-11-08  9:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Jiri Kosina, Peter Zijlstra, Yinghai Lu,
	Thomas Gleixner, cl, linux-kernel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Nov 06 2009, Ingo Molnar wrote:
> > Read the lockdep message please, consider all the backtraces it prints, 
> > it says something different.
> 
> In all honesty, reading and parsing lockdep messages requires a 
> special state of mind. IOW, readability is not its high point.

We frequently do patches to improve the messages but there's a hard 
limit: generally the messages mirror the complexity of the underlying 
locking scenario.

Unfortunately lockdep cannot pretend something is simple when it is not. 
There are two ways out of that: either to simplify the underlying 
locking rules, or to understand them.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH percpu#for-linus] percpu: fix possible deadlock via irq lock inversion
  2009-11-07 16:13                             ` Peter Zijlstra
@ 2009-11-09  5:46                               ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2009-11-09  5:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Ingo Molnar, Nick Piggin, Jiri Kosina,
	Yinghai Lu, Thomas Gleixner, linux-kernel

Lockdep caught the following irq lock inversion which can lead to
deadlock.

 =========================================================
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.32-rc5-tip-04815-g12f0f93-dirty #745
 ---------------------------------------------------------
 hub 1-3:1.0: state 7 ports 2 chg 0000 evt 0004
 ksoftirqd/65/199 just changed the state of lock:
  (pcpu_lock){..-...}, at: [<ffffffff81130e04>] free_percpu+0x38/0x104
 but this lock took another, SOFTIRQ-unsafe lock in the past:
  (vmap_area_lock){+.+...}

 and interrupts could create inverse lock ordering between them.

This happens because pcpu_lock is allowed to be acquired from irq
context for free_percpu() path and alloc_percpu() path may call
vfree() with pcpu_lock held.  As vmap_area_lock isn't irq safe, if an
IRQ occurs while vmap_area_lock is held and the irq handler calls
free_percpu(), locking order inversion occurs.

As the nesting only occurs in the alloc path which isn't allowed to be
called from irq context, A->B->A deadlock won't occur but A->B B->A
deadlock is still possible.

The only place where vmap_area_lock is nested under pcpu_lock is while
extending area_map to free old map.  Break the locking order by
temporarily releasing pcpu_lock when freeing old map.  This is safe to
do as allocation path is protected with outer pcpu_alloc_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
---
If nobody objects, I'll push it to Linus tomorrow.  Thanks.

 mm/percpu.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index d907971..30cd343 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -372,7 +372,7 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 static int pcpu_extend_area_map(struct pcpu_chunk *chunk, unsigned long *flags)
 {
 	int new_alloc;
-	int *new;
+	int *new, *old = NULL;
 	size_t size;

 	/* has enough? */
@@ -407,10 +407,23 @@ static int pcpu_extend_area_map(struct pcpu_chunk *chunk, unsigned long *flags)
 	 * one of the first chunks and still using static map.
 	 */
 	if (chunk->map_alloc >= PCPU_DFL_MAP_ALLOC)
-		pcpu_mem_free(chunk->map, size);
+		old = chunk->map;

 	chunk->map_alloc = new_alloc;
 	chunk->map = new;
+
+	/*
+	 * pcpu_mem_free() might end up calling vfree() which uses
+	 * IRQ-unsafe lock and thus can't be called with pcpu_lock
+	 * held.  Release and reacquire pcpu_lock if old map needs to
+	 * be freed.
+	 */
+	if (old) {
+		spin_unlock_irqrestore(&pcpu_lock, *flags);
+		pcpu_mem_free(old, size);
+		spin_lock_irqsave(&pcpu_lock, *flags);
+	}
+
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-08  9:38               ` Ingo Molnar
@ 2009-11-09 15:34                 ` Jens Axboe
  2009-11-09 15:45                   ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2009-11-09 15:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Jiri Kosina, Peter Zijlstra, Yinghai Lu,
	Thomas Gleixner, cl, linux-kernel

On Sun, Nov 08 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Nov 06 2009, Ingo Molnar wrote:
> > > Read the lockdep message please, consider all the backtraces it prints, 
> > > it says something different.
> > 
> > In all honesty, reading and parsing lockdep messages requires a 
> > special state of mind. IOW, readability is not its high point.
> 
> We frequently do patches to improve the messages but there's a hard 
> limit: generally the messages mirror the complexity of the underlying 
> locking scenario.
> 
> Unfortunately lockdep cannot pretend something is simple when it is not. 
> There are two ways out of that: either to simplify the underlying 
> locking rules, or to understand them.

I think the primary problem is that it tries to condense too much
information, instead of just spelling it out. That may be obvious to a
person intimately familiar with lockdep, but not to others. Things like
the STATE line, for instance. It would read a lot easier if these things
were just spelled out.

I know this message isn't really productive, just tossing it out there.
I'll try to to back it up with a patch the next time it annoys me :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-09 15:34                 ` Jens Axboe
@ 2009-11-09 15:45                   ` Ingo Molnar
  2009-11-09 15:49                     ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2009-11-09 15:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Jiri Kosina, Peter Zijlstra, Yinghai Lu,
	Thomas Gleixner, cl, linux-kernel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Sun, Nov 08 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > On Fri, Nov 06 2009, Ingo Molnar wrote:
> > > > Read the lockdep message please, consider all the backtraces it prints, 
> > > > it says something different.
> > > 
> > > In all honesty, reading and parsing lockdep messages requires a 
> > > special state of mind. IOW, readability is not its high point.
> > 
> > We frequently do patches to improve the messages but there's a hard 
> > limit: generally the messages mirror the complexity of the underlying 
> > locking scenario.
> > 
> > Unfortunately lockdep cannot pretend something is simple when it is not. 
> > There are two ways out of that: either to simplify the underlying 
> > locking rules, or to understand them.
> 
> I think the primary problem is that it tries to condense too much 
> information, instead of just spelling it out. That may be obvious to a 
> person intimately familiar with lockdep, but not to others. Things 
> like the STATE line, for instance. It would read a lot easier if these 
> things were just spelled out.
> 
> I know this message isn't really productive, just tossing it out 
> there. I'll try to to back it up with a patch the next time it annoys 
> me :-)

Well, previously lockdep spewed out a lot of info, which we condensed 
down because people complained ;-)

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: irq lock inversion
  2009-11-09 15:45                   ` Ingo Molnar
@ 2009-11-09 15:49                     ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2009-11-09 15:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Jiri Kosina, Peter Zijlstra, Yinghai Lu,
	Thomas Gleixner, cl, linux-kernel

On Mon, Nov 09 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Sun, Nov 08 2009, Ingo Molnar wrote:
> > > 
> > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > On Fri, Nov 06 2009, Ingo Molnar wrote:
> > > > > Read the lockdep message please, consider all the backtraces it prints, 
> > > > > it says something different.
> > > > 
> > > > In all honesty, reading and parsing lockdep messages requires a 
> > > > special state of mind. IOW, readability is not its high point.
> > > 
> > > We frequently do patches to improve the messages but there's a hard 
> > > limit: generally the messages mirror the complexity of the underlying 
> > > locking scenario.
> > > 
> > > Unfortunately lockdep cannot pretend something is simple when it is not. 
> > > There are two ways out of that: either to simplify the underlying 
> > > locking rules, or to understand them.
> > 
> > I think the primary problem is that it tries to condense too much 
> > information, instead of just spelling it out. That may be obvious to a 
> > person intimately familiar with lockdep, but not to others. Things 
> > like the STATE line, for instance. It would read a lot easier if these 
> > things were just spelled out.
> > 
> > I know this message isn't really productive, just tossing it out 
> > there. I'll try to to back it up with a patch the next time it annoys 
> > me :-)
> 
> Well, previously lockdep spewed out a lot of info, which we condensed 
> down because people complained ;-)

Heh, can't win 'em all! Stack traces are so large anyway that I don't
think saving 1-2 lines per lock in the trace would make much of a
difference. It's debug output, after all.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-11-09 15:49 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <86802c440911041008q4969b9bdk15b4598c40bb84bd@mail.gmail.com>
     [not found] ` <4AF25FC7.4000502@kernel.org>
     [not found]   ` <20091105082102.GA2870@elte.hu>
     [not found]     ` <4AF28D7A.6020209@kernel.org>
2009-11-05 14:31       ` irq lock inversion Jiri Kosina
2009-11-06  5:53         ` Tejun Heo
2009-11-06  7:17           ` Ingo Molnar
2009-11-06  7:45             ` Tejun Heo
2009-11-06  7:58               ` Ingo Molnar
2009-11-06  8:24                 ` Tejun Heo
2009-11-06  8:40                   ` Ingo Molnar
2009-11-06  8:52                     ` Tejun Heo
2009-11-06 16:08                       ` Christoph Lameter
2009-11-06 16:38                         ` Tejun Heo
2009-11-06 17:03                           ` Christoph Lameter
2009-11-07 16:13                             ` Peter Zijlstra
2009-11-09  5:46                               ` [PATCH percpu#for-linus] percpu: fix possible deadlock via " Tejun Heo
2009-11-06  9:59             ` Jens Axboe
2009-11-08  9:38               ` Ingo Molnar
2009-11-09 15:34                 ` Jens Axboe
2009-11-09 15:45                   ` Ingo Molnar
2009-11-09 15:49                     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.