* [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" @ 2018-10-22 15:18 Bart Van Assche 2018-10-22 20:14 ` Johannes Berg 0 siblings, 1 reply; 15+ messages in thread From: Bart Van Assche @ 2018-10-22 15:18 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel, Bart Van Assche, Johannes Berg, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org Lockdep is a tool that developers rely on to find actual bugs. False positive lockdep warnings are very annoying because it takes time to analyze these and to verify that the warning is really a false positive report. Since commit 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") causes lockdep to produce multiple false positive reports, revert that commit. An example of a false positive report produced by that commit: ====================================================== WARNING: possible circular locking dependency detected 4.19.0-rc1-dbg+ #1 Not tainted ------------------------------------------------------ fio/29348 is trying to acquire lock: 00000000f1894b1d ((wq_completion)"dio/%s"sb->s_id){+.+.}, at: flush_workqueue+0xd0/0x970 but task is already holding lock: 000000007d8b84da (&sb->s_type->i_mutex_key#16){++++}, at: ext4_file_write_iter+0x154/0x710 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&sb->s_type->i_mutex_key#16){++++}: down_write+0x3d/0x80 __generic_file_fsync+0x77/0xf0 ext4_sync_file+0x3c9/0x780 vfs_fsync_range+0x66/0x100 dio_complete+0x2f5/0x360 dio_aio_complete_work+0x1c/0x20 process_one_work+0x4ae/0xa20 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #1 ((work_completion)(&dio->complete_work)){+.+.}: process_one_work+0x474/0xa20 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #0 ((wq_completion)"dio/%s"sb->s_id){+.+.}: lock_acquire+0xc5/0x200 flush_workqueue+0xf3/0x970 drain_workqueue+0xec/0x220 destroy_workqueue+0x23/0x350 sb_init_dio_done_wq+0x6a/0x80 do_blockdev_direct_IO+0x3b8/0x3d00 __blockdev_direct_IO+0x79/0x86 ext4_direct_IO+0x5df/0xbc0 generic_file_direct_write+0x119/0x220 __generic_file_write_iter+0x131/0x2d0 ext4_file_write_iter+0x3fa/0x710 aio_write+0x235/0x330 io_submit_one+0x510/0xeb0 __x64_sys_io_submit+0x122/0x340 do_syscall_64+0x71/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: (wq_completion)"dio/%s"sb->s_id --> (work_completion)(&dio->complete_work) --> &sb->s_type->i_mutex_key#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sb->s_type->i_mutex_key#16); lock((work_completion)(&dio->complete_work)); lock(&sb->s_type->i_mutex_key#16); lock((wq_completion)"dio/%s"sb->s_id); *** DEADLOCK *** 1 lock held by fio/29348: #0: 000000007d8b84da (&sb->s_type->i_mutex_key#16){++++}, at: ext4_file_write_iter+0x154/0x710 stack backtrace: CPU: 4 PID: 29348 Comm: fio Not tainted 4.19.0-rc1-dbg+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x86/0xc5 print_circular_bug.isra.32+0x20a/0x218 __lock_acquire+0x1a54/0x1b20 lock_acquire+0xc5/0x200 flush_workqueue+0xf3/0x970 drain_workqueue+0xec/0x220 destroy_workqueue+0x23/0x350 sb_init_dio_done_wq+0x6a/0x80 do_blockdev_direct_IO+0x3b8/0x3d00 __blockdev_direct_IO+0x79/0x86 ext4_direct_IO+0x5df/0xbc0 generic_file_direct_write+0x119/0x220 __generic_file_write_iter+0x131/0x2d0 ext4_file_write_iter+0x3fa/0x710 aio_write+0x235/0x330 io_submit_one+0x510/0xeb0 __x64_sys_io_submit+0x122/0x340 do_syscall_64+0x71/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") Cc: Johannes Berg <johannes.berg@intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org <linux-nvme@lists.infradead.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- kernel/workqueue.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 60e80198c3df..a6c2b823f348 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2652,9 +2652,6 @@ void flush_workqueue(struct workqueue_struct *wq) if (WARN_ON(!wq_online)) return; - lock_map_acquire(&wq->lockdep_map); - lock_map_release(&wq->lockdep_map); - mutex_lock(&wq->mutex); /* @@ -2908,11 +2905,6 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) if (WARN_ON(!wq_online)) return false; - if (!from_cancel) { - lock_map_acquire(&work->lockdep_map); - lock_map_release(&work->lockdep_map); - } - if (start_flush_work(work, &barr, from_cancel)) { wait_for_completion(&barr.done); destroy_work_on_stack(&barr.work); -- 2.19.1.568.g152ad8e336-goog ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 15:18 [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" Bart Van Assche @ 2018-10-22 20:14 ` Johannes Berg 2018-10-22 20:28 ` Johannes Berg 2018-10-22 21:47 ` Bart Van Assche 0 siblings, 2 replies; 15+ messages in thread From: Johannes Berg @ 2018-10-22 20:14 UTC (permalink / raw) To: Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 15:18 +0000, Bart Van Assche wrote: > Lockdep is a tool that developers rely on to find actual bugs. False > positive lockdep warnings are very annoying because it takes time to > analyze these and to verify that the warning is really a false positive > report. Since commit 87915adc3f0a ("workqueue: re-add lockdep > dependencies for flushing") causes lockdep to produce multiple false > positive reports, revert that commit. What, no, I obviously _strongly_ object to this. You've just killed lockdep reporting an entire *class* of problems over a single report that you don't like. "False positives" - which in this case really should more accurately be called "bad lockdep annotations" - of the kind you report here are inherent in how lockdep works, and thus following your argument to the ultimate conclusion you should remove lockdep. Say, for example (and similar things exist, like socket locks in networking), you have a list of objects, each initialized through a single function that does mutex_init(), causing a single lockdep class to be used. Now you have to acquire the locks for multiple such objects, and to make that safe you always do that in object ID order, which is strictly monotonically increasing. Lockdep will *certainly* complain about this situation, since as far as it is concerned, you've just done a recursive lock. The solution to this isn't to remove lockdep annotations for mutexes (or work objects, as you propose), the solution is to tell lockdep that you actually have multiple classes of objects, and thus the recursion you do is not a problem. So clearly, NACK. I must also say that I'm disappointed you'd try to do things this way. I'd be (have been?) willing to actually help you understand the problem and add the annotations, but rather than answer my question ("where do I find the right git tree"!) you just send a revert patch. To do that, you have to understand what recursion is valid (I'm guessing there's some sort of layering involved), and I'm far from understanding anything about the code that triggered this report. I'm almost certain can teach lockdep about why the recursion is fine, probably by using a lock subclass (last argument to lockdep_init_map()) in either a new variant of alloc_workqueue() or a new variant of INIT_WORK(). johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 20:14 ` Johannes Berg @ 2018-10-22 20:28 ` Johannes Berg 2018-10-22 20:54 ` Bart Van Assche 2018-10-22 21:47 ` Bart Van Assche 1 sibling, 1 reply; 15+ messages in thread From: Johannes Berg @ 2018-10-22 20:28 UTC (permalink / raw) To: Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 22:14 +0200, Johannes Berg wrote: > "False positives" - which in this case really should more accurately be > called "bad lockdep annotations" - of the kind you report here are > inherent in how lockdep works, and thus following your argument to the > ultimate conclusion you should remove lockdep. > I must also say that I'm disappointed you'd try to do things this way. > I'd be (have been?) willing to actually help you understand the problem > and add the annotations, but rather than answer my question ("where do I > find the right git tree"!) you just send a revert patch. Actually, wait, you have a different problem in the patch than the one you reported earlier? Which I haven't even seen at all... And this one is actually either the same missing subclass annotation, or completely valid. You have here * dio/... workqueue * dio->complete_work, running on that workqueue * i_mutex_key#16 The lockdep report even more or less tells you what's going on. Perhaps we need to find a way to make lockdep not print "lock()" but "start()" or "flush()" for work items ... but if you read it this way, you see: CPU0 CPU1 lock(i_mutex_key) start(dio->complete_work) lock(i_mutex_key) flush(wq dio/...) which is *clearly* a problem. You can't flush a workqueue under lock that is (or might be) currently running a work item that takes the same lock. You've already acquired the lock, and so the complete_work is waiting for the lock, but you're now in the flush waiting for it to finish ... pretty much a classic ABBA deadlock, just with workqueues. It's possible, of course, that multiple instances of the same class are involved and they're somehow ordered/nested in a way that is in fact safe, say in this case CPU0 and CPU1 have acquired (or are trying to acquire) fundamentally different instances of "i_mutex_key#16", or the dio->complete_work here doesn't actually run on the dio workqueue in question, but somehow a different one. With workqueues, it's also possible that something out-of-band, other than the flush_workqueue(), prevents dio->complete_work from being running at this point in time - but in that case you might actually be able to get rid of the flush_workqueue() entirely. It ought to be possible to annotate this too though, but again, I'm not an expert on what the correct and safe nesting here would be, and thus wouldn't be able to say how to do it. johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 20:28 ` Johannes Berg @ 2018-10-22 20:54 ` Bart Van Assche 2018-10-22 21:04 ` Johannes Berg 0 siblings, 1 reply; 15+ messages in thread From: Bart Van Assche @ 2018-10-22 20:54 UTC (permalink / raw) To: Johannes Berg, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 22:28 +0200, Johannes Berg wrote: > The lockdep report even more or less tells you what's going on. Perhaps > we need to find a way to make lockdep not print "lock()" but "start()" > or "flush()" for work items ... but if you read it this way, you see: > > CPU0 CPU1 > > lock(i_mutex_key) > start(dio->complete_work) > lock(i_mutex_key) > flush(wq dio/...) > > which is *clearly* a problem. Your patch made lockdep report that the direct I/O code deadlocks although it clearly doesn't deadlock. So where do you think the bug is? In the direct I/O code or in your patch? The code in the column with label "CPU0" is code called by do_blockdev_direct_IO(). From the body of that function: /* will be released by direct_io_worker */ inode_lock(inode); I think that is sufficient evidence that the direct I/O code is fine and that your patch caused lockdep to produce an incorrect deadlock report. Bart. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 20:54 ` Bart Van Assche @ 2018-10-22 21:04 ` Johannes Berg 2018-10-22 21:26 ` Bart Van Assche 2018-10-23 1:17 ` Bart Van Assche 0 siblings, 2 replies; 15+ messages in thread From: Johannes Berg @ 2018-10-22 21:04 UTC (permalink / raw) To: Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 13:54 -0700, Bart Van Assche wrote: > On Mon, 2018-10-22 at 22:28 +0200, Johannes Berg wrote: > > The lockdep report even more or less tells you what's going on. Perhaps > > we need to find a way to make lockdep not print "lock()" but "start()" > > or "flush()" for work items ... but if you read it this way, you see: > > > > CPU0 CPU1 > > > > lock(i_mutex_key) > > start(dio->complete_work) > > lock(i_mutex_key) > > flush(wq dio/...) > > > > which is *clearly* a problem. > > Your patch made lockdep report that the direct I/O code deadlocks although it > clearly doesn't deadlock. So where do you think the bug is? In the direct I/O > code or in your patch? I think you see things too black and white. When lockdep was added, the networking socket locks also were (mostly) fine, recursion there only happened on different types of sockets. Yet, lockdep did in fact report issues with it, because originally they weren't split into different classes. Did we remove lockdep again because it was reporting a potential problem in the socket lock code? > The code in the column with label "CPU0" is code called by do_blockdev_direct_IO(). > From the body of that function: > > /* will be released by direct_io_worker */ > inode_lock(inode); I don't think this is related. If this comment is true (and I have no reason to believe it's not), then the inode lock is - by nature of allowing lock/unlock to happen in different processes - not something lockdep can track to start with. > I think that is sufficient evidence that the direct I/O code is fine I have no reason to doubt that. > and that > your patch caused lockdep to produce an incorrect deadlock report. Obviously, I disagree. As I thought I made pretty clear, the report is in fact valid. I have no reason to believe your judgement that the code is also valid. However, you're basically assuming that these two statements are mutually exclusive, when they aren't. Just like the socket lock I've been pointing to, there are things that lockdep doesn't know about unless you teach it. Yes, adding workqueues to lockdep tracking has the effect that now it'll give you a report about "non-issues" because it doesn't know about subclasses (of work items/workqueues) that happen to be fine for recursion / flushing with locks held. This isn't really a new thing. Any kind of lock that lockdep tracks has exactly this problem. We've solved this in the past without giving up and saying "I don't like this report, so let's remove lockdep". I really don't understand how you can argue that lockdep tracking something is a *bad* thing. You do realize that this workqueue tracking stuff has been around for a few years (and got removed again in refactoring, etc.) and has found countless bugs? When it was first added to the very old workqueue code, I analysed countless bugs and told people that no, flushing a work struct under a lock that was also taken by the work struct is in fact _not_ a good thing to do. johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 21:04 ` Johannes Berg @ 2018-10-22 21:26 ` Bart Van Assche 2018-10-23 19:44 ` Johannes Berg 2018-10-23 1:17 ` Bart Van Assche 1 sibling, 1 reply; 15+ messages in thread From: Bart Van Assche @ 2018-10-22 21:26 UTC (permalink / raw) To: Johannes Berg, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 23:04 +0200, Johannes Berg wrote: > When lockdep was added, the networking socket locks also were (mostly) > fine, recursion there only happened on different types of sockets. Yet, > lockdep did in fact report issues with it, because originally they > weren't split into different classes. > > Did we remove lockdep again because it was reporting a potential problem > in the socket lock code? I do not agree. It is accepted in the kernel community that if locks are nested that nested locks should always be taken in the same order. There is no agreement however that the kind of checking implemented by the "crosslock" code made sense. My understanding is that you are trying to reintroduce through a backdoor some of the crosslock code. There is scientific evidence that it is not possible to come up with an algorithm that flags all potential deadlocks in programs that use completions without reporting false positives. See e.g. Agarwal, Rahul, and Scott D. Stoller. "Run-time detection of potential deadlocks for programs with locks, semaphores, and condition variables." In Proceedings of the 2006 workshop on Parallel and distributed systems: testing and debugging, pp. 51-60. ACM, 2006. (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.7823&rep=rep1&type=pdf). > As I thought I made pretty clear, the report is in fact valid. I'm surprised that although your patch caused a deadlock to be reported while no deadlock occurred that you keep insisting that your report is valid. I don't understand this. Bart. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 21:26 ` Bart Van Assche @ 2018-10-23 19:44 ` Johannes Berg 2018-10-23 19:58 ` Johannes Berg 0 siblings, 1 reply; 15+ messages in thread From: Johannes Berg @ 2018-10-23 19:44 UTC (permalink / raw) To: Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 14:26 -0700, Bart Van Assche wrote: > On Mon, 2018-10-22 at 23:04 +0200, Johannes Berg wrote: > > When lockdep was added, the networking socket locks also were (mostly) > > fine, recursion there only happened on different types of sockets. Yet, > > lockdep did in fact report issues with it, because originally they > > weren't split into different classes. > > > > Did we remove lockdep again because it was reporting a potential problem > > in the socket lock code? > > I do not agree. It is accepted in the kernel community that if locks are > nested that nested locks should always be taken in the same order. You brought here another example, different from the one that we've been discussing in the nvmet-rdma thread. If somebody can explain how you end up with this specific example I'd be happy to explain what lockdep thinks it's going on, but from just the report, without knowing what the setup was, I don't really have any idea what's going on. If I were to venture a guess I'd say you mounted a file system image loopback, and this happens during some kind of direct IO flush or perhaps unmount, because you try to flush the inner filesystem and that ends up tricking out to the outer filesystem, which is basically all the same (dio) code, so it connects the lock class back to itself as far as lockdep is concerned. > There is > no agreement however that the kind of checking implemented by the "crosslock" > code made sense. My understanding is that you are trying to reintroduce > through a backdoor some of the crosslock code. Not at all. > There is scientific evidence > that it is not possible to come up with an algorithm that flags all > potential deadlocks in programs that use completions without reporting false > positives. See e.g. Agarwal, Rahul, and Scott D. Stoller. "Run-time detection > of potential deadlocks for programs with locks, semaphores, and condition > variables." In Proceedings of the 2006 workshop on Parallel and distributed > systems: testing and debugging, pp. 51-60. ACM, 2006. > (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.7823&rep=rep1&type=pdf). Yeah, but again, you could take that as an argument for removing lockdep. > > As I thought I made pretty clear, the report is in fact valid. > > I'm surprised that although your patch caused a deadlock to be reported while > no deadlock occurred that you keep insisting that your report is valid. I don't > understand this. I guess it boils down to semantics of "valid" here. I do argue that the report is valid because the report only says: Look here, I found that you created a dependency from lock class X (i_mutex_key) back to itself. This means you have a potential bug in your program, because maybe if these are actually the same instance, then you may deadlock. Note that this kind of reporting is all lockdep ever does. Think of it like this: You have a bunch of chains, looking like this: A1-B1 B2-A2 where the letters are the lock classes, and the numbers are the instances. Lockdep will basically project that down to be A-B-A and tell you that you have a potential deadlock. If somehow you've ensured that A1 and A3 are _guaranteed_ to be different *instances*, then you can use *_nested() primitives to tell this to lockdep, in which case it would project down to A-B-A' and not complain. Now let's put a work struct in here, and you get something like A-W # flush work struct W (*) while holding lock A W-A # lock A while running from work struct W (*) or a single-threaded/ordered workqueue that may run work struct W This is what lockdep is reporting here now. Again, if the A's are actually two guaranteed-to-be-different instances, then *_nested() versions of the locking, which may have to be added for flushing workqueues/work items, will let you tell lockdep that this is safe. You are arguing that we should remove the lockdep annotations instead, basically saying that you'd prefer that lockdep only sees A # flush work struct W (*) while holding lock A A # lock A while running from work struct W instead of the picture with A-W/W-A above. While this obviously gets rid of this report and the other one in nvmet- rdma that you found, all it means is that you've just - to some extent - crippled lockdep's ability to find chains. You haven't removed something that's creating false positives, you've just removed links from the chains. This is why I say this code is valid, and why I say what you're arguing is pretty much equivalent to saying that lockdep is just fundamentally useless. You wouldn't, after all, say that mutexes shouldn't be annotated in lockdep just because lockdep found an A-B-A mutex class chain that you had to annotate with mutex_lock_nested() to make it into A-B-A' as per my other example above. I hope that clarifies what's going on here. I do understand that it can be frustrating if you see a lockdep report and can't really understand it immediately, and I also understand that it's not always trivial to teach lockdep about the right nesting, but I'd argue that we have ways to address both dynamic (mutex_lock_nested) and static (__mutex_init passing a different class) ways of addressing this issue. johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-23 19:44 ` Johannes Berg @ 2018-10-23 19:58 ` Johannes Berg 0 siblings, 0 replies; 15+ messages in thread From: Johannes Berg @ 2018-10-23 19:58 UTC (permalink / raw) To: Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Tue, 2018-10-23 at 21:44 +0200, Johannes Berg wrote: > > There is > > no agreement however that the kind of checking implemented by the "crosslock" > > code made sense. My understanding is that you are trying to reintroduce > > through a backdoor some of the crosslock code. > > Not at all. Perhaps I should elaborate on this, although I'm not really entirely sure of it. As I understand it, crosslock was trying to solve an entirely different problem, namely that of tracking locks that can be acquired in one thread, and released in another. This obviously still causes deadlocks, but doesn't lend itself to actual tracking in the lockdep chain, since you don't really know how directly long the lock was held. With a classic mutex (spinlock, ...), you always have lock(A) do_something() unlock(A) in the same thread, so if do_something() contains another lock(B), you know that you've got a dependency A->B. If A instead is something that might be released in another thread, e.g. a completion or semaphore, it's really hard to tell whether you have down(A) do_something() up(A) in a single thread, or if the up(A) happens elsewhere entirely. Therefore, things like this aren't tracked by lockdep at all. Crosslock tried to address this. The workqueue annotations, on the other hand, *are* within the same thread. You're either executing from the work struct, and the same thread will obviously end up going out of the work struct again, or you're flushing (and waiting) for the workqueue, so all of that also happens in the same thread (apart from the actual work that you wait for, but that's unrelated). johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 21:04 ` Johannes Berg 2018-10-22 21:26 ` Bart Van Assche @ 2018-10-23 1:17 ` Bart Van Assche 2018-10-23 19:50 ` Johannes Berg 1 sibling, 1 reply; 15+ messages in thread From: Bart Van Assche @ 2018-10-23 1:17 UTC (permalink / raw) To: Johannes Berg, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On 10/22/18 2:04 PM, Johannes Berg wrote: > On Mon, 2018-10-22 at 13:54 -0700, Bart Van Assche wrote: >> The code in the column with label "CPU0" is code called by do_blockdev_direct_IO(). >> From the body of that function: >> >> /* will be released by direct_io_worker */ >> inode_lock(inode); > > I don't think this is related. If this comment is true (and I have no > reason to believe it's not), then the inode lock is - by nature of > allowing lock/unlock to happen in different processes - not something > lockdep can track to start with. > > [ ... ] >> You do realize that this workqueue tracking stuff has been around for > a few years (and got removed again in refactoring, etc.) and has found > countless bugs? This is something I had not realized when I posted the patch at the start of this e-mail thread. Thanks for having mentioned this. But I doubt that the inode lock has been annotated incorrectly. From the kernel source code: static inline void inode_lock(struct inode *inode) { down_write(&inode->i_rwsem); } [ ... ] void __sched down_write(struct rw_semaphore *sem) { might_sleep(); rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); rwsem_set_owner(sem); } It seems to me that the inode lock has been annotated correctly as an rwsem. It's not clear to me however why lockdep complains about a deadlock for the direct I/O code. I hope someone has the time to go to the bottom of this. Bart. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-23 1:17 ` Bart Van Assche @ 2018-10-23 19:50 ` Johannes Berg 2018-10-23 20:24 ` Johannes Berg 0 siblings, 1 reply; 15+ messages in thread From: Johannes Berg @ 2018-10-23 19:50 UTC (permalink / raw) To: Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 18:17 -0700, Bart Van Assche wrote: > It seems to me that the inode lock has been annotated correctly as an > rwsem. It's not clear to me however why lockdep complains about a > deadlock for the direct I/O code. I hope someone has the time to go to > the bottom of this. I think the explanation I just sent should help clarify this. The reason for the report is that with the workqueue annotations, we've added new links to the chains that lockdep sees. I _think_ those annotations are correct and only create links in the chain when they are actually present, but since those links are between *classes* not *instances* these new links may cause false positives. I don't think the issue is the annotations of the inode lock per se, but that the new links in the class chain cause these locks to cycle back to themselves, which indicates a potential deadlock to lockdep. Of course now we need to go in and tell lockdep that it shouldn't consider these classes the same, but somebody who actually understands why it's _fine_ that this ends up linking back to itself should do that. I'm willing to help, but I'd have to have somebody on the phone who knows about all these locks and workqueues and how they interact to be able to do something useful. Like I said in the other email though, I don't think arbitrarily removing links from the lockdep chain is the right thing to do. Why arbitrarily? Well, you might just as well remove the inode sem annotations, and that would also break the chain, right? Yes, those have been present longer, but for this particular instance it'd be equivalent, you'd just get different reports in other places than if you remove the workqueue annotations again. johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-23 19:50 ` Johannes Berg @ 2018-10-23 20:24 ` Johannes Berg 0 siblings, 0 replies; 15+ messages in thread From: Johannes Berg @ 2018-10-23 20:24 UTC (permalink / raw) To: Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Tue, 2018-10-23 at 21:50 +0200, Johannes Berg wrote: > > The reason for the report is that with the workqueue annotations, we've > added new links to the chains that lockdep sees. I _think_ those > annotations are correct and only create links in the chain when they are > actually present, but since those links are between *classes* not > *instances* these new links may cause false positives. So over in the nvmet-rdma thread you asked: > Are the lockdep annotations in kernel/workqueue.c correct? My understanding > is that lock_map_acquire() should only be used to annotate mutually exclusive > locking (e.g. mutex, spinlock). However, multiple work items associated with > the same workqueue can be executed concurrently. From lockdep.h: > > #define lock_map_acquire(l) lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_) Yes, I will admit that I'm not entirely sure that they are indeed correct, because there are complexities with single-threaded/ordered and multi-threaded workqueues. The annotations are like this though: 1) when the work runs, we have lock_map_acquire(&pwq->wq->lockdep_map); lock_map_acquire(&work->lockdep_map); call_work_function() lock_map_release(&work->lockdep_map); lock_map_release(&pwq->wq->lockdep_map); (where &work->lockdep_map is &lockdep_map in the actual code, but that's a copy of it and we just use it because the work might free itself, it's not relevant for lockdep's purposes) This should be OK, since if you're running the work struct on a given work queue you've undoubtedly caused a dependency chain of workqueue -> work item -> anything in the worker function. Note that this is actually limited, see the comments around this code. 2) flush_workqueue does: lock_map_acquire(&wq->lockdep_map); lock_map_release(&wq->lockdep_map); which causes a chain whatever the caller holds -> workqueue. This should similarly be OK. Note that this already lets you find bugs, because if you flush a workqueue that might run a work that takes a lock, while holding the same lock, you deadlock: CPU0 CPU1 lock(A) run work X on W: lock(A) // waiting to acquire flush_workqueue(W) -> deadlock, but flagged by lockdep even if typically work X finishes well before CPU0 gets to lock(A). 3) start_flush_work/__flush_work These are a bit more tricky, and actually fairly limited. The &work->lockdep_map ones here are actually straight-forward - if you try to wait for a work item to finish while holding a lock that this work item might also take, you need this dependency to flag it in lockdep. The &pwq->wq->lockdep_map are a bit trickier. Basically, they let you catch things like workqueue W work items S1, S2 lock A W is single-threaded/ordered and can execute both S1 and S2. executing S1 creates: W -> S1 -> A // let's say S1 locks A executing S2 creates: W -> S2 Now if you do lock(A); flush_work(S2); that seems fine, on the face of it, since S2 doesn't lock A. However, there's a problem now. If S1 was scheduled to run before S2 but hasn't completed yet, it might be waiting for lock(A) inside S1, and S2 is waiting for S1 because the workqueue is single-threaded. The way this is expressed in lockdep is that the flush side will additionally create a dependency chain of A -> W with lock_map_acquire in start_flush_work(), given the right circumstances. Together with the chains created above, you'd get a report of A -> W -> S1 -> A which can cause a deadlock, even though you were trying to flush S2 and that doesn't even show up here! This is one of the trickier reports to actually understand, but we've had quite a few of these in the past. I think all the conditions here are fine, and Tejun did look at them I believe. Now, I said this was limited - the problem here is that start_flush_work() will only ever find a pwq in a limited amount of scenarios, which basically means that we're creating fewer chain links than we should. We just don't know that workqueue this item might execute on unless it's currently pending, so if it typically completes by the time you get to flush_work(S2), this code can no longer create the "A -> W" dependency and will miss issues in a significant number of cases, since that's really the typical case of flushing a work (that it's not currently pending or running). With the "A -> W" dependency missing we're left with just the chains W -> S1 -> A W -> S2 A -> S2 and don't see any problem even though it clearly exists. The original code before Tejun's workqueue refactoring was actually better in this regard, I believe. Come to think of it, perhaps we could address it by turning around the dependencies when we run the work? then we'd end up with S1 -> W -> A S2 -> W A -> S2 and actually _have_ a link from A back to itself in this case. But I'd have to analyse the other cases in more detail to see whether that makes sense. However, I don't think we depend on the S1 <-> W links much right now, so it's tempting to do this, let's say next year after the currently flagged issues are more under control :-) As for your question regarding exclusive, I must admit that I don't recall exactly why that was chosen. I want to believe I discussed that with PeterZ back when I originally added the workqueue lockdep annotations (before they were mostly lost), but I couldn't point you to any discussions (which probably were on IRC anyway), and I can't say I remember the exact details of which type of lockdep acquire to choose at what time. I do seem to recall that it's not so much about exclusiveness but about the type of tracking that happens internally regarding things like lock recursion. johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 20:14 ` Johannes Berg 2018-10-22 20:28 ` Johannes Berg @ 2018-10-22 21:47 ` Bart Van Assche 2018-10-23 0:43 ` Sagi Grimberg 1 sibling, 1 reply; 15+ messages in thread From: Bart Van Assche @ 2018-10-22 21:47 UTC (permalink / raw) To: Johannes Berg, Tejun Heo Cc: linux-kernel, Christoph Hellwig, Sagi Grimberg, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 22:14 +0200, Johannes Berg wrote: > I must also say that I'm disappointed you'd try to do things this way. > I'd be (have been?) willing to actually help you understand the problem > and add the annotations, but rather than answer my question ("where do I > find the right git tree"!) you just send a revert patch. Sorry that I had not yet provided that information. You should have received this information through another e-mail thread. See also http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html. > To do that, you have to understand what recursion is valid (I'm guessing > there's some sort of layering involved), and I'm far from understanding > anything about the code that triggered this report. I don't think there is any kind of recursion involved in the NVMe code that triggered the lockdep complaint. Sagi, please correct me if I got this wrong. Bart. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-22 21:47 ` Bart Van Assche @ 2018-10-23 0:43 ` Sagi Grimberg 2018-10-23 19:26 ` Johannes Berg 0 siblings, 1 reply; 15+ messages in thread From: Sagi Grimberg @ 2018-10-23 0:43 UTC (permalink / raw) To: Bart Van Assche, Johannes Berg, Tejun Heo Cc: linux-kernel, Christoph Hellwig, linux-nvme @ lists . infradead . org >> I must also say that I'm disappointed you'd try to do things this way. >> I'd be (have been?) willing to actually help you understand the problem >> and add the annotations, but rather than answer my question ("where do I >> find the right git tree"!) you just send a revert patch. > > Sorry that I had not yet provided that information. You should have > received this information through another e-mail thread. See also > http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html. > >> To do that, you have to understand what recursion is valid (I'm guessing >> there's some sort of layering involved), and I'm far from understanding >> anything about the code that triggered this report. > > I don't think there is any kind of recursion involved in the NVMe code > that triggered the lockdep complaint. Sagi, please correct me if I got this > wrong. I commented on the original thread. I'm not sure it qualifies as a recursion, but in that use-case, when priv->handler_mutex is taken it is possible that other priv->handler_mutex instances are taken but are guaranteed not to belong to that priv... ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-23 0:43 ` Sagi Grimberg @ 2018-10-23 19:26 ` Johannes Berg 2018-10-23 21:30 ` Sagi Grimberg 0 siblings, 1 reply; 15+ messages in thread From: Johannes Berg @ 2018-10-23 19:26 UTC (permalink / raw) To: Sagi Grimberg, Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, linux-nvme @ lists . infradead . org On Mon, 2018-10-22 at 17:43 -0700, Sagi Grimberg wrote: > > > I must also say that I'm disappointed you'd try to do things this way. > > > I'd be (have been?) willing to actually help you understand the problem > > > and add the annotations, but rather than answer my question ("where do I > > > find the right git tree"!) you just send a revert patch. > > > > Sorry that I had not yet provided that information. You should have > > received this information through another e-mail thread. See also > > http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html. > > > > > To do that, you have to understand what recursion is valid (I'm guessing > > > there's some sort of layering involved), and I'm far from understanding > > > anything about the code that triggered this report. > > > > I don't think there is any kind of recursion involved in the NVMe code > > that triggered the lockdep complaint. Sagi, please correct me if I got this > > wrong. > > I commented on the original thread. I'm not sure it qualifies as a > recursion, but in that use-case, when priv->handler_mutex is taken > it is possible that other priv->handler_mutex instances are taken but > are guaranteed not to belong to that priv... I also commented over there regarding this specific problem. I think my wording was inaccurate perhaps. It's not so much direct recursion, but a dependency chain connecting the handler_mutex class back to itself. This is how lockdep finds problems, it doesn't consider actual *instances* but only *classes*. As Sagi mentioned on the other thread, in this specific instance you happen to know out-of-band that this is safe, (somehow) the code ensures that you can't actually recursively connect back to yourself. Perhaps there's some kind of layering involved, I don't know. (*) As I said over there, this type of thing can be solved with mutex_lock_nested(), though it's possible that in this case it should really be flush_workqueue_nested() or flush_work_nested() or so. johannes (*) note that just checking "obj != self" might not be enough, depending on how you pick candidates for "obj" you might recurse down another level and end up back at yourself, just two levels down, and actually have a dependency chain leading back to yourself. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" 2018-10-23 19:26 ` Johannes Berg @ 2018-10-23 21:30 ` Sagi Grimberg 0 siblings, 0 replies; 15+ messages in thread From: Sagi Grimberg @ 2018-10-23 21:30 UTC (permalink / raw) To: Johannes Berg, Bart Van Assche, Tejun Heo Cc: linux-kernel, Christoph Hellwig, linux-nvme @ lists . infradead . org > > I also commented over there regarding this specific problem. > > I think my wording was inaccurate perhaps. It's not so much direct > recursion, but a dependency chain connecting the handler_mutex class > back to itself. This is how lockdep finds problems, it doesn't consider > actual *instances* but only *classes*. > > As Sagi mentioned on the other thread, in this specific instance you > happen to know out-of-band that this is safe, (somehow) the code ensures > that you can't actually recursively connect back to yourself. Perhaps > there's some kind of layering involved, I don't know. (*) > > As I said over there, this type of thing can be solved with > mutex_lock_nested(), though it's possible that in this case it should > really be flush_workqueue_nested() or flush_work_nested() or so. I guess that could make this specific complaint complaint go away... > johannes > > (*) note that just checking "obj != self" might not be enough, depending > on how you pick candidates for "obj" you might recurse down another > level and end up back at yourself, just two levels down, and actually > have a dependency chain leading back to yourself. Its not a simple check, any removal invocation is serialized under a dedicated workqueue and in order to be queued there, the connection need to first be connected which comes later (after the flush takes place). ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2018-10-23 21:30 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-10-22 15:18 [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing" Bart Van Assche 2018-10-22 20:14 ` Johannes Berg 2018-10-22 20:28 ` Johannes Berg 2018-10-22 20:54 ` Bart Van Assche 2018-10-22 21:04 ` Johannes Berg 2018-10-22 21:26 ` Bart Van Assche 2018-10-23 19:44 ` Johannes Berg 2018-10-23 19:58 ` Johannes Berg 2018-10-23 1:17 ` Bart Van Assche 2018-10-23 19:50 ` Johannes Berg 2018-10-23 20:24 ` Johannes Berg 2018-10-22 21:47 ` Bart Van Assche 2018-10-23 0:43 ` Sagi Grimberg 2018-10-23 19:26 ` Johannes Berg 2018-10-23 21:30 ` Sagi Grimberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).