linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
@ 2018-10-22 15:18 Bart Van Assche
  2018-10-22 20:14 ` Johannes Berg
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2018-10-22 15:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Bart Van Assche, Johannes Berg, Christoph Hellwig,
	Sagi Grimberg, linux-nvme @ lists . infradead . org

Lockdep is a tool that developers rely on to find actual bugs. False
positive lockdep warnings are very annoying because it takes time to
analyze these and to verify that the warning is really a false positive
report. Since commit 87915adc3f0a ("workqueue: re-add lockdep
dependencies for flushing") causes lockdep to produce multiple false
positive reports, revert that commit. An example of a false positive
report produced by that commit:

======================================================
WARNING: possible circular locking dependency detected
4.19.0-rc1-dbg+ #1 Not tainted
------------------------------------------------------
fio/29348 is trying to acquire lock:
00000000f1894b1d ((wq_completion)"dio/%s"sb->s_id){+.+.}, at: flush_workqueue+0xd0/0x970

but task is already holding lock:
000000007d8b84da (&sb->s_type->i_mutex_key#16){++++}, at: ext4_file_write_iter+0x154/0x710

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&sb->s_type->i_mutex_key#16){++++}:
       down_write+0x3d/0x80
       __generic_file_fsync+0x77/0xf0
       ext4_sync_file+0x3c9/0x780
       vfs_fsync_range+0x66/0x100
       dio_complete+0x2f5/0x360
       dio_aio_complete_work+0x1c/0x20
       process_one_work+0x4ae/0xa20
       worker_thread+0x63/0x5a0
       kthread+0x1cf/0x1f0
       ret_from_fork+0x24/0x30

-> #1 ((work_completion)(&dio->complete_work)){+.+.}:
       process_one_work+0x474/0xa20
       worker_thread+0x63/0x5a0
       kthread+0x1cf/0x1f0
       ret_from_fork+0x24/0x30

-> #0 ((wq_completion)"dio/%s"sb->s_id){+.+.}:
       lock_acquire+0xc5/0x200
       flush_workqueue+0xf3/0x970
       drain_workqueue+0xec/0x220
       destroy_workqueue+0x23/0x350
       sb_init_dio_done_wq+0x6a/0x80
       do_blockdev_direct_IO+0x3b8/0x3d00
       __blockdev_direct_IO+0x79/0x86
       ext4_direct_IO+0x5df/0xbc0
       generic_file_direct_write+0x119/0x220
       __generic_file_write_iter+0x131/0x2d0
       ext4_file_write_iter+0x3fa/0x710
       aio_write+0x235/0x330
       io_submit_one+0x510/0xeb0
       __x64_sys_io_submit+0x122/0x340
       do_syscall_64+0x71/0x220
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
  (wq_completion)"dio/%s"sb->s_id --> (work_completion)(&dio->complete_work) --> &sb->s_type->i_mutex_key#16

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&sb->s_type->i_mutex_key#16);
                               lock((work_completion)(&dio->complete_work));
                               lock(&sb->s_type->i_mutex_key#16);
  lock((wq_completion)"dio/%s"sb->s_id);

 *** DEADLOCK ***

1 lock held by fio/29348:
 #0: 000000007d8b84da (&sb->s_type->i_mutex_key#16){++++}, at: ext4_file_write_iter+0x154/0x710

stack backtrace:
CPU: 4 PID: 29348 Comm: fio Not tainted 4.19.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0x86/0xc5
 print_circular_bug.isra.32+0x20a/0x218
 __lock_acquire+0x1a54/0x1b20
 lock_acquire+0xc5/0x200
 flush_workqueue+0xf3/0x970
 drain_workqueue+0xec/0x220
 destroy_workqueue+0x23/0x350
 sb_init_dio_done_wq+0x6a/0x80
 do_blockdev_direct_IO+0x3b8/0x3d00
 __blockdev_direct_IO+0x79/0x86
 ext4_direct_IO+0x5df/0xbc0
 generic_file_direct_write+0x119/0x220
 __generic_file_write_iter+0x131/0x2d0
 ext4_file_write_iter+0x3fa/0x710
 aio_write+0x235/0x330
 io_submit_one+0x510/0xeb0
 __x64_sys_io_submit+0x122/0x340
 do_syscall_64+0x71/0x220
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing")
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org <linux-nvme@lists.infradead.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 kernel/workqueue.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 60e80198c3df..a6c2b823f348 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2652,9 +2652,6 @@ void flush_workqueue(struct workqueue_struct *wq)
 	if (WARN_ON(!wq_online))
 		return;
 
-	lock_map_acquire(&wq->lockdep_map);
-	lock_map_release(&wq->lockdep_map);
-
 	mutex_lock(&wq->mutex);
 
 	/*
@@ -2908,11 +2905,6 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
 	if (WARN_ON(!wq_online))
 		return false;
 
-	if (!from_cancel) {
-		lock_map_acquire(&work->lockdep_map);
-		lock_map_release(&work->lockdep_map);
-	}
-
 	if (start_flush_work(work, &barr, from_cancel)) {
 		wait_for_completion(&barr.done);
 		destroy_work_on_stack(&barr.work);
-- 
2.19.1.568.g152ad8e336-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 15:18 [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" Bart Van Assche
@ 2018-10-22 20:14 ` Johannes Berg
  2018-10-22 20:28   ` Johannes Berg
  2018-10-22 21:47   ` Bart Van Assche
  0 siblings, 2 replies; 15+ messages in thread
From: Johannes Berg @ 2018-10-22 20:14 UTC (permalink / raw)
  To: Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 15:18 +0000, Bart Van Assche wrote:
> Lockdep is a tool that developers rely on to find actual bugs. False
> positive lockdep warnings are very annoying because it takes time to
> analyze these and to verify that the warning is really a false positive
> report. Since commit 87915adc3f0a ("workqueue: re-add lockdep
> dependencies for flushing") causes lockdep to produce multiple false
> positive reports, revert that commit. 


What, no, I obviously _strongly_ object to this. You've just killed
lockdep reporting an entire *class* of problems over a single report
that you don't like.

"False positives" - which in this case really should more accurately be
called "bad lockdep annotations" - of the kind you report here are
inherent in how lockdep works, and thus following your argument to the
ultimate conclusion you should remove lockdep.


Say, for example (and similar things exist, like socket locks in
networking), you have a list of objects, each initialized through a
single function that does mutex_init(), causing a single lockdep class
to be used.

Now you have to acquire the locks for multiple such objects, and to make
that safe you always do that in object ID order, which is strictly
monotonically increasing.

Lockdep will *certainly* complain about this situation, since as far as
it is concerned, you've just done a recursive lock.

The solution to this isn't to remove lockdep annotations for mutexes (or
work objects, as you propose), the solution is to tell lockdep that you
actually have multiple classes of objects, and thus the recursion you do
is not a problem.

So clearly, NACK.

I must also say that I'm disappointed you'd try to do things this way.
I'd be (have been?) willing to actually help you understand the problem
and add the annotations, but rather than answer my question ("where do I
find the right git tree"!) you just send a revert patch.

To do that, you have to understand what recursion is valid (I'm guessing
there's some sort of layering involved), and I'm far from understanding
anything about the code that triggered this report.

I'm almost certain can teach lockdep about why the recursion is fine,
probably by using a lock subclass (last argument to lockdep_init_map())
in either a new variant of alloc_workqueue() or a new variant of
INIT_WORK().

johannes



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 20:14 ` Johannes Berg
@ 2018-10-22 20:28   ` Johannes Berg
  2018-10-22 20:54     ` Bart Van Assche
  2018-10-22 21:47   ` Bart Van Assche
  1 sibling, 1 reply; 15+ messages in thread
From: Johannes Berg @ 2018-10-22 20:28 UTC (permalink / raw)
  To: Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 22:14 +0200, Johannes Berg wrote:

> "False positives" - which in this case really should more accurately be
> called "bad lockdep annotations" - of the kind you report here are
> inherent in how lockdep works, and thus following your argument to the
> ultimate conclusion you should remove lockdep.

> I must also say that I'm disappointed you'd try to do things this way.
> I'd be (have been?) willing to actually help you understand the problem
> and add the annotations, but rather than answer my question ("where do I
> find the right git tree"!) you just send a revert patch.

Actually, wait, you have a different problem in the patch than the one
you reported earlier? Which I haven't even seen at all...

And this one is actually either the same missing subclass annotation, or
completely valid. You have here

 * dio/... workqueue
 * dio->complete_work, running on that workqueue
 * i_mutex_key#16

The lockdep report even more or less tells you what's going on. Perhaps
we need to find a way to make lockdep not print "lock()" but "start()"
or "flush()" for work items ... but if you read it this way, you see:

CPU0			CPU1

lock(i_mutex_key)
			start(dio->complete_work)
			lock(i_mutex_key)
flush(wq dio/...)

which is *clearly* a problem. You can't flush a workqueue under lock
that is (or might be) currently running a work item that takes the same
lock. You've already acquired the lock, and so the complete_work is
waiting for the lock, but you're now in the flush waiting for it to
finish ... pretty much a classic ABBA deadlock, just with workqueues.

It's possible, of course, that multiple instances of the same class are
involved and they're somehow ordered/nested in a way that is in fact
safe, say in this case CPU0 and CPU1 have acquired (or are trying to
acquire) fundamentally different instances of "i_mutex_key#16", or the
dio->complete_work here doesn't actually run on the dio workqueue in
question, but somehow a different one.

With workqueues, it's also possible that something out-of-band, other
than the flush_workqueue(), prevents dio->complete_work from being
running at this point in time - but in that case you might actually be
able to get rid of the flush_workqueue() entirely.

It ought to be possible to annotate this too though, but again, I'm not
an expert on what the correct and safe nesting here would be, and thus
wouldn't be able to say how to do it.

johannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 20:28   ` Johannes Berg
@ 2018-10-22 20:54     ` Bart Van Assche
  2018-10-22 21:04       ` Johannes Berg
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2018-10-22 20:54 UTC (permalink / raw)
  To: Johannes Berg, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 22:28 +0200, Johannes Berg wrote:
> The lockdep report even more or less tells you what's going on. Perhaps
> we need to find a way to make lockdep not print "lock()" but "start()"
> or "flush()" for work items ... but if you read it this way, you see:
> 
> CPU0			CPU1
> 
> lock(i_mutex_key)
> 			start(dio->complete_work)
> 			lock(i_mutex_key)
> flush(wq dio/...)
> 
> which is *clearly* a problem.

Your patch made lockdep report that the direct I/O code deadlocks although it
clearly doesn't deadlock. So where do you think the bug is? In the direct I/O
code or in your patch?

The code in the column with label "CPU0" is code called by do_blockdev_direct_IO().
From the body of that function:

			/* will be released by direct_io_worker */
			inode_lock(inode);

I think that is sufficient evidence that the direct I/O code is fine and that
your patch caused lockdep to produce an incorrect deadlock report.

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 20:54     ` Bart Van Assche
@ 2018-10-22 21:04       ` Johannes Berg
  2018-10-22 21:26         ` Bart Van Assche
  2018-10-23  1:17         ` Bart Van Assche
  0 siblings, 2 replies; 15+ messages in thread
From: Johannes Berg @ 2018-10-22 21:04 UTC (permalink / raw)
  To: Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 13:54 -0700, Bart Van Assche wrote:
> On Mon, 2018-10-22 at 22:28 +0200, Johannes Berg wrote:
> > The lockdep report even more or less tells you what's going on. Perhaps
> > we need to find a way to make lockdep not print "lock()" but "start()"
> > or "flush()" for work items ... but if you read it this way, you see:
> > 
> > CPU0			CPU1
> > 
> > lock(i_mutex_key)
> > 			start(dio->complete_work)
> > 			lock(i_mutex_key)
> > flush(wq dio/...)
> > 
> > which is *clearly* a problem.
> 
> Your patch made lockdep report that the direct I/O code deadlocks although it
> clearly doesn't deadlock. So where do you think the bug is? In the direct I/O
> code or in your patch?

I think you see things too black and white.

When lockdep was added, the networking socket locks also were (mostly)
fine, recursion there only happened on different types of sockets. Yet,
lockdep did in fact report issues with it, because originally they
weren't split into different classes.

Did we remove lockdep again because it was reporting a potential problem
in the socket lock code?

> The code in the column with label "CPU0" is code called by do_blockdev_direct_IO().
> From the body of that function:
> 
> 			/* will be released by direct_io_worker */
> 			inode_lock(inode);

I don't think this is related. If this comment is true (and I have no
reason to believe it's not), then the inode lock is - by nature of
allowing lock/unlock to happen in different processes - not something
lockdep can track to start with.

> I think that is sufficient evidence that the direct I/O code is fine 

I have no reason to doubt that.

> and that
> your patch caused lockdep to produce an incorrect deadlock report.

Obviously, I disagree.

As I thought I made pretty clear, the report is in fact valid. I have no
reason to believe your judgement that the code is also valid.

However, you're basically assuming that these two statements are
mutually exclusive, when they aren't. Just like the socket lock I've
been pointing to, there are things that lockdep doesn't know about
unless you teach it.

Yes, adding workqueues to lockdep tracking has the effect that now it'll
give you a report about "non-issues" because it doesn't know about
subclasses (of work items/workqueues) that happen to be fine for
recursion / flushing with locks held.

This isn't really a new thing. Any kind of lock that lockdep tracks has
exactly this problem. We've solved this in the past without giving up
and saying "I don't like this report, so let's remove lockdep".

I really don't understand how you can argue that lockdep tracking
something is a *bad* thing. You do realize that this workqueue tracking
stuff has been around for a few years (and got removed again in
refactoring, etc.) and has found countless bugs? When it was first added
to the very old workqueue code, I analysed countless bugs and told
people that no, flushing a work struct under a lock that was also taken
by the work struct is in fact _not_ a good thing to do.

johannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 21:04       ` Johannes Berg
@ 2018-10-22 21:26         ` Bart Van Assche
  2018-10-23 19:44           ` Johannes Berg
  2018-10-23  1:17         ` Bart Van Assche
  1 sibling, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2018-10-22 21:26 UTC (permalink / raw)
  To: Johannes Berg, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 23:04 +0200, Johannes Berg wrote:
> When lockdep was added, the networking socket locks also were (mostly)
> fine, recursion there only happened on different types of sockets. Yet,
> lockdep did in fact report issues with it, because originally they
> weren't split into different classes.
> 
> Did we remove lockdep again because it was reporting a potential problem
> in the socket lock code?

I do not agree. It is accepted in the kernel community that if locks are
nested that nested locks should always be taken in the same order. There is
no agreement however that the kind of checking implemented by the "crosslock"
code made sense. My understanding is that you are trying to reintroduce
through a backdoor some of the crosslock code. There is scientific evidence
that it is not possible to come up with an algorithm that flags all
potential deadlocks in programs that use completions without reporting false
positives. See e.g. Agarwal, Rahul, and Scott D. Stoller. "Run-time detection
of potential deadlocks for programs with locks, semaphores, and condition
variables." In Proceedings of the 2006 workshop on Parallel and distributed
systems: testing and debugging, pp. 51-60. ACM, 2006.
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.7823&rep=rep1&type=pdf).

> As I thought I made pretty clear, the report is in fact valid.

I'm surprised that although your patch caused a deadlock to be reported while
no deadlock occurred that you keep insisting that your report is valid. I don't
understand this.

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 20:14 ` Johannes Berg
  2018-10-22 20:28   ` Johannes Berg
@ 2018-10-22 21:47   ` Bart Van Assche
  2018-10-23  0:43     ` Sagi Grimberg
  1 sibling, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2018-10-22 21:47 UTC (permalink / raw)
  To: Johannes Berg, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 22:14 +0200, Johannes Berg wrote:
> I must also say that I'm disappointed you'd try to do things this way.
> I'd be (have been?) willing to actually help you understand the problem
> and add the annotations, but rather than answer my question ("where do I
> find the right git tree"!) you just send a revert patch.

Sorry that I had not yet provided that information. You should have
received this information through another e-mail thread. See also
http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html.

> To do that, you have to understand what recursion is valid (I'm guessing
> there's some sort of layering involved), and I'm far from understanding
> anything about the code that triggered this report.

I don't think there is any kind of recursion involved in the NVMe code
that triggered the lockdep complaint. Sagi, please correct me if I got this
wrong.

Bart.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 21:47   ` Bart Van Assche
@ 2018-10-23  0:43     ` Sagi Grimberg
  2018-10-23 19:26       ` Johannes Berg
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2018-10-23  0:43 UTC (permalink / raw)
  To: Bart Van Assche, Johannes Berg, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, linux-nvme @ lists . infradead . org


>> I must also say that I'm disappointed you'd try to do things this way.
>> I'd be (have been?) willing to actually help you understand the problem
>> and add the annotations, but rather than answer my question ("where do I
>> find the right git tree"!) you just send a revert patch.
> 
> Sorry that I had not yet provided that information. You should have
> received this information through another e-mail thread. See also
> http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html.
> 
>> To do that, you have to understand what recursion is valid (I'm guessing
>> there's some sort of layering involved), and I'm far from understanding
>> anything about the code that triggered this report.
> 
> I don't think there is any kind of recursion involved in the NVMe code
> that triggered the lockdep complaint. Sagi, please correct me if I got this
> wrong.

I commented on the original thread. I'm not sure it qualifies as a
recursion, but in that use-case, when priv->handler_mutex is taken
it is possible that other priv->handler_mutex instances are taken but
are guaranteed not to belong to that priv...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 21:04       ` Johannes Berg
  2018-10-22 21:26         ` Bart Van Assche
@ 2018-10-23  1:17         ` Bart Van Assche
  2018-10-23 19:50           ` Johannes Berg
  1 sibling, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2018-10-23  1:17 UTC (permalink / raw)
  To: Johannes Berg, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On 10/22/18 2:04 PM, Johannes Berg wrote:
> On Mon, 2018-10-22 at 13:54 -0700, Bart Van Assche wrote:
>> The code in the column with label "CPU0" is code called by do_blockdev_direct_IO().
>>  From the body of that function:
>>
>> 			/* will be released by direct_io_worker */
>> 			inode_lock(inode);
> 
> I don't think this is related. If this comment is true (and I have no
> reason to believe it's not), then the inode lock is - by nature of
> allowing lock/unlock to happen in different processes - not something
> lockdep can track to start with.
> 
> [ ... ]
 >> You do realize that this workqueue tracking stuff has been around for
> a few years (and got removed again in refactoring, etc.) and has found > countless bugs?

This is something I had not realized when I posted the patch at the 
start of this e-mail thread. Thanks for having mentioned this.

But I doubt that the inode lock has been annotated incorrectly. From the 
kernel source code:

static inline void inode_lock(struct inode *inode)
{
	down_write(&inode->i_rwsem);
}

[ ... ]

void __sched down_write(struct rw_semaphore *sem)
{
	might_sleep();
	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);

	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
	rwsem_set_owner(sem);
}

It seems to me that the inode lock has been annotated correctly as an 
rwsem. It's not clear to me however why lockdep complains about a 
deadlock for the direct I/O code. I hope someone has the time to go to 
the bottom of this.

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-23  0:43     ` Sagi Grimberg
@ 2018-10-23 19:26       ` Johannes Berg
  2018-10-23 21:30         ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Berg @ 2018-10-23 19:26 UTC (permalink / raw)
  To: Sagi Grimberg, Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 17:43 -0700, Sagi Grimberg wrote:
> > > I must also say that I'm disappointed you'd try to do things this way.
> > > I'd be (have been?) willing to actually help you understand the problem
> > > and add the annotations, but rather than answer my question ("where do I
> > > find the right git tree"!) you just send a revert patch.
> > 
> > Sorry that I had not yet provided that information. You should have
> > received this information through another e-mail thread. See also
> > http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html.
> > 
> > > To do that, you have to understand what recursion is valid (I'm guessing
> > > there's some sort of layering involved), and I'm far from understanding
> > > anything about the code that triggered this report.
> > 
> > I don't think there is any kind of recursion involved in the NVMe code
> > that triggered the lockdep complaint. Sagi, please correct me if I got this
> > wrong.
> 
> I commented on the original thread. I'm not sure it qualifies as a
> recursion, but in that use-case, when priv->handler_mutex is taken
> it is possible that other priv->handler_mutex instances are taken but
> are guaranteed not to belong to that priv...

I also commented over there regarding this specific problem.

I think my wording was inaccurate perhaps. It's not so much direct
recursion, but a dependency chain connecting the handler_mutex class
back to itself. This is how lockdep finds problems, it doesn't consider
actual *instances* but only *classes*.

As Sagi mentioned on the other thread, in this specific instance you
happen to know out-of-band that this is safe, (somehow) the code ensures
that you can't actually recursively connect back to yourself. Perhaps
there's some kind of layering involved, I don't know. (*)

As I said over there, this type of thing can be solved with
mutex_lock_nested(), though it's possible that in this case it should
really be flush_workqueue_nested() or flush_work_nested() or so.

johannes

(*) note that just checking "obj != self" might not be enough, depending
on how you pick candidates for "obj" you might recurse down another
level and end up back at yourself, just two levels down, and actually
have a dependency chain leading back to yourself.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-22 21:26         ` Bart Van Assche
@ 2018-10-23 19:44           ` Johannes Berg
  2018-10-23 19:58             ` Johannes Berg
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Berg @ 2018-10-23 19:44 UTC (permalink / raw)
  To: Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 14:26 -0700, Bart Van Assche wrote:
> On Mon, 2018-10-22 at 23:04 +0200, Johannes Berg wrote:
> > When lockdep was added, the networking socket locks also were (mostly)
> > fine, recursion there only happened on different types of sockets. Yet,
> > lockdep did in fact report issues with it, because originally they
> > weren't split into different classes.
> > 
> > Did we remove lockdep again because it was reporting a potential problem
> > in the socket lock code?
> 
> I do not agree. It is accepted in the kernel community that if locks are
> nested that nested locks should always be taken in the same order.

You brought here another example, different from the one that we've been
discussing in the nvmet-rdma thread. If somebody can explain how you end
up with this specific example I'd be happy to explain what lockdep
thinks it's going on, but from just the report, without knowing what the
setup was, I don't really have any idea what's going on.

If I were to venture a guess I'd say you mounted a file system image
loopback, and this happens during some kind of direct IO flush or
perhaps unmount, because you try to flush the inner filesystem and that
ends up tricking out to the outer filesystem, which is basically all the
same (dio) code, so it connects the lock class back to itself as far as
lockdep is concerned.

> There is
> no agreement however that the kind of checking implemented by the "crosslock"
> code made sense. My understanding is that you are trying to reintroduce
> through a backdoor some of the crosslock code. 

Not at all.

> There is scientific evidence
> that it is not possible to come up with an algorithm that flags all
> potential deadlocks in programs that use completions without reporting false
> positives. See e.g. Agarwal, Rahul, and Scott D. Stoller. "Run-time detection
> of potential deadlocks for programs with locks, semaphores, and condition
> variables." In Proceedings of the 2006 workshop on Parallel and distributed
> systems: testing and debugging, pp. 51-60. ACM, 2006.
> (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.7823&rep=rep1&type=pdf).

Yeah, but again, you could take that as an argument for removing
lockdep.

> > As I thought I made pretty clear, the report is in fact valid.
> 
> I'm surprised that although your patch caused a deadlock to be reported while
> no deadlock occurred that you keep insisting that your report is valid. I don't
> understand this.

I guess it boils down to semantics of "valid" here. I do argue that the
report is valid because the report only says:

   Look here, I found that you created a dependency from lock class X
   (i_mutex_key) back to itself. This means you have a potential bug in
   your program, because maybe if these are actually the same instance,
   then you may deadlock.

Note that this kind of reporting is all lockdep ever does.

Think of it like this: You have a bunch of chains, looking like this:

A1-B1
   B2-A2

where the letters are the lock classes, and the numbers are the
instances.

Lockdep will basically project that down to be

A-B-A

and tell you that you have a potential deadlock. If somehow you've
ensured that A1 and A3 are _guaranteed_ to be different *instances*,
then you can use *_nested() primitives to tell this to lockdep, in which
case it would project down to

A-B-A'

and not complain.

Now let's put a work struct in here, and you get something like

A-W    # flush work struct W (*) while holding lock A
  W-A  # lock A while running from work struct W

(*) or a single-threaded/ordered workqueue that may run work struct W


This is what lockdep is reporting here now. Again, if the A's are
actually two guaranteed-to-be-different instances, then *_nested()
versions of the locking, which may have to be added for flushing
workqueues/work items, will let you tell lockdep that this is safe.


You are arguing that we should remove the lockdep annotations instead,
basically saying that you'd prefer that lockdep only sees

A      # flush work struct W (*) while holding lock A
    A  # lock A
while running from work struct W

instead of the picture with A-W/W-A above.

While this obviously gets rid of this report and the other one in nvmet-
rdma that you found, all it means is that you've just - to some extent -
crippled lockdep's ability to find chains.

You haven't removed something that's creating false positives, you've
just removed links from the chains.

This is why I say this code is valid, and why I say what you're arguing
is pretty much equivalent to saying that lockdep is just fundamentally
useless. You wouldn't, after all, say that mutexes shouldn't be
annotated in lockdep just because lockdep found an A-B-A mutex class
chain that you had to annotate with mutex_lock_nested() to make it into
A-B-A' as per my other example above.


I hope that clarifies what's going on here. I do understand that it can
be frustrating if you see a lockdep report and can't really understand
it immediately, and I also understand that it's not always trivial to
teach lockdep about the right nesting, but I'd argue that we have ways
to address both dynamic (mutex_lock_nested) and static (__mutex_init
passing a different class) ways of addressing this issue.

johannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-23  1:17         ` Bart Van Assche
@ 2018-10-23 19:50           ` Johannes Berg
  2018-10-23 20:24             ` Johannes Berg
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Berg @ 2018-10-23 19:50 UTC (permalink / raw)
  To: Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Mon, 2018-10-22 at 18:17 -0700, Bart Van Assche wrote:

> It seems to me that the inode lock has been annotated correctly as an 
> rwsem. It's not clear to me however why lockdep complains about a 
> deadlock for the direct I/O code. I hope someone has the time to go to 
> the bottom of this.

I think the explanation I just sent should help clarify this.

The reason for the report is that with the workqueue annotations, we've
added new links to the chains that lockdep sees. I _think_ those
annotations are correct and only create links in the chain when they are
actually present, but since those links are between *classes* not
*instances* these new links may cause false positives.

I don't think the issue is the annotations of the inode lock per se, but
that the new links in the class chain cause these locks to cycle back to
themselves, which indicates a potential deadlock to lockdep.

Of course now we need to go in and tell lockdep that it shouldn't
consider these classes the same, but somebody who actually understands
why it's _fine_ that this ends up linking back to itself should do that.
I'm willing to help, but I'd have to have somebody on the phone who
knows about all these locks and workqueues and how they interact to be
able to do something useful.

Like I said in the other email though, I don't think arbitrarily
removing links from the lockdep chain is the right thing to do. Why
arbitrarily? Well, you might just as well remove the inode sem
annotations, and that would also break the chain, right? Yes, those have
been present longer, but for this particular instance it'd be
equivalent, you'd just get different reports in other places than if you
remove the workqueue annotations again.

johannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-23 19:44           ` Johannes Berg
@ 2018-10-23 19:58             ` Johannes Berg
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Berg @ 2018-10-23 19:58 UTC (permalink / raw)
  To: Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Tue, 2018-10-23 at 21:44 +0200, Johannes Berg wrote:

> > There is
> > no agreement however that the kind of checking implemented by the "crosslock"
> > code made sense. My understanding is that you are trying to reintroduce
> > through a backdoor some of the crosslock code. 
> 
> Not at all.

Perhaps I should elaborate on this, although I'm not really entirely
sure of it.

As I understand it, crosslock was trying to solve an entirely different
problem, namely that of tracking locks that can be acquired in one
thread, and released in another. This obviously still causes deadlocks,
but doesn't lend itself to actual tracking in the lockdep chain, since
you don't really know how directly long the lock was held.

With a classic mutex (spinlock, ...), you always have

lock(A)
do_something()
unlock(A)

in the same thread, so if do_something() contains another lock(B), you
know that you've got a dependency A->B.

If A instead is something that might be released in another thread, e.g.
a completion or semaphore, it's really hard to tell whether you have

down(A)
do_something()
up(A)

in a single thread, or if the up(A) happens elsewhere entirely.
Therefore, things like this aren't tracked by lockdep at all.

Crosslock tried to address this.

The workqueue annotations, on the other hand, *are* within the same
thread. You're either executing from the work struct, and the same
thread will obviously end up going out of the work struct again, or
you're flushing (and waiting) for the workqueue, so all of that also
happens in the same thread (apart from the actual work that you wait
for, but that's unrelated).

johannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-23 19:50           ` Johannes Berg
@ 2018-10-23 20:24             ` Johannes Berg
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Berg @ 2018-10-23 20:24 UTC (permalink / raw)
  To: Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg,
	linux-nvme @ lists . infradead . org

On Tue, 2018-10-23 at 21:50 +0200, Johannes Berg wrote:
> 
> The reason for the report is that with the workqueue annotations, we've
> added new links to the chains that lockdep sees. I _think_ those
> annotations are correct and only create links in the chain when they are
> actually present, but since those links are between *classes* not
> *instances* these new links may cause false positives.

So over in the nvmet-rdma thread you asked:

> Are the lockdep annotations in kernel/workqueue.c correct? My understanding
> is that lock_map_acquire() should only be used to annotate mutually exclusive
> locking (e.g. mutex, spinlock). However, multiple work items associated with
> the same workqueue can be executed concurrently. From lockdep.h:
> 
> #define lock_map_acquire(l) lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)

Yes, I will admit that I'm not entirely sure that they are indeed
correct, because there are complexities with single-threaded/ordered and
multi-threaded workqueues.

The annotations are like this though:

1) when the work runs, we have

        lock_map_acquire(&pwq->wq->lockdep_map);
        lock_map_acquire(&work->lockdep_map);

	call_work_function()

        lock_map_release(&work->lockdep_map);
        lock_map_release(&pwq->wq->lockdep_map);

(where &work->lockdep_map is &lockdep_map in the actual code, but that's
a copy of it and we just use it because the work might free itself, it's
not relevant for lockdep's purposes)

This should be OK, since if you're running the work struct on a given
work queue you've undoubtedly caused a dependency chain of
	workqueue -> work item -> anything in the worker function.

Note that this is actually limited, see the comments around this code.


2) flush_workqueue does:

        lock_map_acquire(&wq->lockdep_map);
        lock_map_release(&wq->lockdep_map);

which causes a chain
	whatever the caller holds -> workqueue.

This should similarly be OK. Note that this already lets you find bugs,
because if you flush a workqueue that might run a work that takes a
lock, while holding the same lock, you deadlock:

CPU0			CPU1
lock(A)
			run work X on W:
			  lock(A)	// waiting to acquire
flush_workqueue(W)

-> deadlock, but flagged by lockdep even if typically work X finishes
well before CPU0 gets to lock(A).


3) start_flush_work/__flush_work

These are a bit more tricky, and actually fairly limited.

The &work->lockdep_map ones here are actually straight-forward - if you
try to wait for a work item to finish while holding a lock that this
work item might also take, you need this dependency to flag it in
lockdep.

The &pwq->wq->lockdep_map are a bit trickier. Basically, they let you
catch things like

workqueue  W
work items S1, S2
lock       A

W is single-threaded/ordered and can execute both S1 and S2.

executing S1 creates: W -> S1 -> A // let's say S1 locks A
executing S2 creates: W -> S2

Now if you do

	lock(A);
	flush_work(S2);

that seems fine, on the face of it, since S2 doesn't lock A. However,
there's a problem now. If S1 was scheduled to run before S2 but hasn't
completed yet, it might be waiting for lock(A) inside S1, and S2 is
waiting for S1 because the workqueue is single-threaded.

The way this is expressed in lockdep is that the flush side will
additionally create a dependency chain of

A -> W

with lock_map_acquire in start_flush_work(), given the right
circumstances. Together with the chains created above, you'd get a
report of

A -> W -> S1 -> A

which can cause a deadlock, even though you were trying to flush S2 and
that doesn't even show up here! This is one of the trickier reports to
actually understand, but we've had quite a few of these in the past.

I think all the conditions here are fine, and Tejun did look at them I
believe.

Now, I said this was limited - the problem here is that
start_flush_work() will only ever find a pwq in a limited amount of
scenarios, which basically means that we're creating fewer chain links
than we should. We just don't know that workqueue this item might
execute on unless it's currently pending, so if it typically completes
by the time you get to flush_work(S2), this code can no longer create
the "A -> W" dependency and will miss issues in a significant number of
cases, since that's really the typical case of flushing a work (that
it's not currently pending or running). With the "A -> W" dependency
missing we're left with just the chains

W -> S1 -> A
W -> S2
A -> S2

and don't see any problem even though it clearly exists. The original
code before Tejun's workqueue refactoring was actually better in this 
regard, I believe.

Come to think of it, perhaps we could address it by turning around the
dependencies when we run the work? then we'd end up with

     S1 -> W -> A
     S2 -> W
A -> S2

and actually _have_ a link from A back to itself in this case. But I'd
have to analyse the other cases in more detail to see whether that makes
sense. However, I don't think we depend on the S1 <-> W links much right
now, so it's tempting to do this, let's say next year after the
currently flagged issues are more under control :-)


As for your question regarding exclusive, I must admit that I don't
recall exactly why that was chosen. I want to believe I discussed that
with PeterZ back when I originally added the workqueue lockdep
annotations (before they were mostly lost), but I couldn't point you to
any discussions (which probably were on IRC anyway), and I can't say I
remember the exact details of which type of lockdep acquire to choose at
what time. I do seem to recall that it's not so much about exclusiveness
but about the type of tracking that happens internally regarding things
like lock recursion.

johannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
  2018-10-23 19:26       ` Johannes Berg
@ 2018-10-23 21:30         ` Sagi Grimberg
  0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2018-10-23 21:30 UTC (permalink / raw)
  To: Johannes Berg, Bart Van Assche, Tejun Heo
  Cc: linux-kernel, Christoph Hellwig, linux-nvme @ lists . infradead . org


> 
> I also commented over there regarding this specific problem.
> 
> I think my wording was inaccurate perhaps. It's not so much direct
> recursion, but a dependency chain connecting the handler_mutex class
> back to itself. This is how lockdep finds problems, it doesn't consider
> actual *instances* but only *classes*.
> 
> As Sagi mentioned on the other thread, in this specific instance you
> happen to know out-of-band that this is safe, (somehow) the code ensures
> that you can't actually recursively connect back to yourself. Perhaps
> there's some kind of layering involved, I don't know. (*)
> 
> As I said over there, this type of thing can be solved with
> mutex_lock_nested(), though it's possible that in this case it should
> really be flush_workqueue_nested() or flush_work_nested() or so.

I guess that could make this specific complaint complaint go away...

> johannes
> 
> (*) note that just checking "obj != self" might not be enough, depending
> on how you pick candidates for "obj" you might recurse down another
> level and end up back at yourself, just two levels down, and actually
> have a dependency chain leading back to yourself.

Its not a simple check, any removal invocation is serialized under a
dedicated workqueue and in order to be queued there, the connection need
to first be connected which comes later (after the flush takes place).

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-10-23 21:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-22 15:18 [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" Bart Van Assche
2018-10-22 20:14 ` Johannes Berg
2018-10-22 20:28   ` Johannes Berg
2018-10-22 20:54     ` Bart Van Assche
2018-10-22 21:04       ` Johannes Berg
2018-10-22 21:26         ` Bart Van Assche
2018-10-23 19:44           ` Johannes Berg
2018-10-23 19:58             ` Johannes Berg
2018-10-23  1:17         ` Bart Van Assche
2018-10-23 19:50           ` Johannes Berg
2018-10-23 20:24             ` Johannes Berg
2018-10-22 21:47   ` Bart Van Assche
2018-10-23  0:43     ` Sagi Grimberg
2018-10-23 19:26       ` Johannes Berg
2018-10-23 21:30         ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).