* Re: possible deadlock in __generic_file_fsync [not found] <000000000000cd1e2205785951c2@google.com> @ 2019-03-22 21:28 ` syzbot 2019-03-23 7:16 ` Dmitry Vyukov 0 siblings, 1 reply; 9+ messages in thread From: syzbot @ 2019-03-22 21:28 UTC (permalink / raw) To: adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel, syzkaller-bugs, torvalds, tytso, viro syzbot has bisected this bug to: commit 9022ada8ab6f1f1a932a3c93815061042e6548a5 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri Aug 24 20:16:36 2018 +0000 Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15330437200000 start commit: 270b77a0 Merge tag 'drm-fixes-2018-10-20-1' of git://anong.. git tree: upstream final crash: https://syzkaller.appspot.com/x/report.txt?x=17330437200000 console output: https://syzkaller.appspot.com/x/log.txt?x=13330437200000 kernel config: https://syzkaller.appspot.com/x/.config?x=b3f55cb3dfcc6c33 dashboard link: https://syzkaller.appspot.com/bug?extid=5cd33f0e6abe2bb3e397 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1436fc45400000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11058e2d400000 Reported-by: syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com Fixes: 9022ada8ab6f ("Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq") For information about bisection process see: https://goo.gl/tpsmEJ#bisection ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible deadlock in __generic_file_fsync 2019-03-22 21:28 ` possible deadlock in __generic_file_fsync syzbot @ 2019-03-23 7:16 ` Dmitry Vyukov 2019-03-23 13:56 ` Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: Dmitry Vyukov @ 2019-03-23 7:16 UTC (permalink / raw) To: syzbot Cc: Andreas Dilger, linux-ext4, linux-fsdevel, LKML, syzkaller-bugs, Linus Torvalds, Theodore Ts'o, Al Viro On Fri, Mar 22, 2019 at 10:28 PM syzbot <syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com> wrote: > > syzbot has bisected this bug to: > > commit 9022ada8ab6f1f1a932a3c93815061042e6548a5 > Author: Linus Torvalds <torvalds@linux-foundation.org> > Date: Fri Aug 24 20:16:36 2018 +0000 > > Merge branch 'for-4.19' of > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15330437200000 > start commit: 270b77a0 Merge tag 'drm-fixes-2018-10-20-1' of git://anong.. > git tree: upstream > final crash: https://syzkaller.appspot.com/x/report.txt?x=17330437200000 > console output: https://syzkaller.appspot.com/x/log.txt?x=13330437200000 > kernel config: https://syzkaller.appspot.com/x/.config?x=b3f55cb3dfcc6c33 > dashboard link: https://syzkaller.appspot.com/bug?extid=5cd33f0e6abe2bb3e397 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1436fc45400000 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11058e2d400000 > > Reported-by: syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com > Fixes: 9022ada8ab6f ("Merge branch 'for-4.19' of > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq") > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection This is a lockdep-detected bug, but it is reproduced with very low probability: testing commit 270b77a0f30e7bc61a9081b86d74dbb62fa6a69d with gcc (GCC) 8.1.0 run #0: crashed: possible deadlock in __generic_file_fsync run #1: OK run #2: OK run #3: OK run #4: OK run #5: OK run #6: OK run #7: OK run #8: OK run #9: OK I would expect that for lockdep it's only enough to trigger each path involved in the deadlock once. Why is it so hard to reproduce then? Is it something to improve in lockdep? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible deadlock in __generic_file_fsync 2019-03-23 7:16 ` Dmitry Vyukov @ 2019-03-23 13:56 ` Theodore Ts'o 2019-03-26 10:32 ` Dmitry Vyukov 0 siblings, 1 reply; 9+ messages in thread From: Theodore Ts'o @ 2019-03-23 13:56 UTC (permalink / raw) To: Dmitry Vyukov Cc: syzbot, Andreas Dilger, linux-ext4, linux-fsdevel, LKML, syzkaller-bugs, Linus Torvalds, Al Viro On Sat, Mar 23, 2019 at 08:16:36AM +0100, Dmitry Vyukov wrote: > > This is a lockdep-detected bug, but it is reproduced with very low > probability... > > I would expect that for lockdep it's only enough to trigger each path > involved in the deadlock once. Why is it so hard to reproduce then? Is > it something to improve in lockdep? It's a false positive report. The problem is that without doing some fairly deep code analysis --- the kind that a human needs to do; this is not the kind of thing that ML and adjusting weights in a nueral net can deal with --- a computer can't determine what the completion handler will need to do. The root cause here is that we have N CPU's that are trying to do direct I/O's, and on the very first DIO write for a fd, we need to create a workqueue. (Why do we do it here? Because most fd's don't do DIO, so we don't want to waste resources unnecessarily. Why don't we fix it by adding a mutex? Because it would slow down *all* Direct I/O operations just to suppress a rare, false positive, lockdep report.) The reason why it's so hard for lockdep to reproduce is because it's extremely rare for this situation to get hit. When it does get hit, several CPU's will try to create the workqueue, and all but one will lose the cmpxchg, and so all but one will need to destroy the workqueue which they had just created. Since the wq in question has never been used, it's safe to call destroy_workqueue(wq) while holding the inode mutex --- but lockdep doesn't know this. As I pointed out in [1] one way to fix this is to create a new API and use it instead: I_solemnly_swear_this_workqueue_has_never_been_used_please_destroy() [1] https://lore.kernel.org/patchwork/patch/1003553/#1187773 Unfortunately, this trades off fixing a very low probability lockdep false positive report that in practice only gets hit with Syzkaller, with making the code more fragile if the developer potentially uses the API incorrectly. As you can see from the date on the discussion, it's been over six months, and there still hasn't been a decision about the best way to fix this. I think the real problem is that it's pretty low priority, since it's only something that Syzkaller notices. The reality is in a world without Syzkaller, maybe once a decade, it would get hit on a real-life workload, and so we'd have to close a bug report with "Won't Fix; Not reproducible", and add a note saying that it's a false positive lockdep report. Syzkaller is adding stress to the system by demanding perfection from lockdep, and it isn't that, for better or for worse. ‾\_(ツ)_/‾ The question is what is the best way to annotate this as a false positive, so we can suppress the report, either in Lockdep or in Syzkaller. Cheers, - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: possible deadlock in __generic_file_fsync 2019-03-23 13:56 ` Theodore Ts'o @ 2019-03-26 10:32 ` Dmitry Vyukov 2020-03-08 5:52 ` [PATCH] fs/direct-io.c: avoid workqueue allocation race Eric Biggers 0 siblings, 1 reply; 9+ messages in thread From: Dmitry Vyukov @ 2019-03-26 10:32 UTC (permalink / raw) To: Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, linux-ext4, linux-fsdevel, LKML, syzkaller-bugs, Linus Torvalds, Al Viro On Sat, Mar 23, 2019 at 2:56 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Sat, Mar 23, 2019 at 08:16:36AM +0100, Dmitry Vyukov wrote: > > > > This is a lockdep-detected bug, but it is reproduced with very low > > probability... > > > > I would expect that for lockdep it's only enough to trigger each path > > involved in the deadlock once. Why is it so hard to reproduce then? Is > > it something to improve in lockdep? > > It's a false positive report. The problem is that without doing some > fairly deep code analysis --- the kind that a human needs to do; this > is not the kind of thing that ML and adjusting weights in a nueral net > can deal with --- a computer can't determine what the completion > handler will need to do. > > The root cause here is that we have N CPU's that are trying to do > direct I/O's, and on the very first DIO write for a fd, we need to > create a workqueue. (Why do we do it here? Because most fd's don't > do DIO, so we don't want to waste resources unnecessarily. Why don't > we fix it by adding a mutex? Because it would slow down *all* Direct > I/O operations just to suppress a rare, false positive, lockdep > report.) > > The reason why it's so hard for lockdep to reproduce is because it's > extremely rare for this situation to get hit. When it does get hit, > several CPU's will try to create the workqueue, and all but one will > lose the cmpxchg, and so all but one will need to destroy the > workqueue which they had just created. Since the wq in question has > never been used, it's safe to call destroy_workqueue(wq) while holding > the inode mutex --- but lockdep doesn't know this. As I pointed out > in [1] one way to fix this is to create a new API and use it instead: > > I_solemnly_swear_this_workqueue_has_never_been_used_please_destroy() > > [1] https://lore.kernel.org/patchwork/patch/1003553/#1187773 > > Unfortunately, this trades off fixing a very low probability lockdep > false positive report that in practice only gets hit with Syzkaller, > with making the code more fragile if the developer potentially uses > the API incorrectly. > > As you can see from the date on the discussion, it's been over six > months, and there still hasn't been a decision about the best way to > fix this. I think the real problem is that it's pretty low priority, > since it's only something that Syzkaller notices. > > The reality is in a world without Syzkaller, maybe once a decade, it > would get hit on a real-life workload, and so we'd have to close a bug > report with "Won't Fix; Not reproducible", and add a note saying that > it's a false positive lockdep report. Syzkaller is adding stress to > the system by demanding perfection from lockdep, and it isn't that, > for better or for worse. ‾\_(ツ)_/‾ The question is what is the best > way to annotate this as a false positive, so we can suppress the > report, either in Lockdep or in Syzkaller. Hi Ted, Thanks for the analysis. So we can classify the reason for wrong bisection result as "too hard to trigger bug". Re lockdep perfection, do you see any better alternative then what is happening now? One alternative is obviously to stop testing kernel which would remove all related stress from all involved parties and remove any perfection/quality requirement from everything :) But it does not look like a better path forward. Re I_solemnly_swear_this_workqueue_has_never_been_used_please_destroy, I wonder if it's possible to automatically note the fact that the workqueue was used. It should not make the code more fragile and should not use to incorrect uses of API. It can slightly move the situation from "reporting false positives" to "not reporting true positives", but all bugs should be detected eventually (we just need any test where a single item was submitted to the queue). And in my experience not reporting false positives is much more important then reporting 100% of true positives. Something along the lines of: on submission of an item: #ifdef LOCKDEP WRITE_ONCE(wq->was_used, 1); #endif in flush: #ifdef LOCKDEP if (READ_ONCE(wq->was_used) { lock_map_acquire(&wq->lockdep_map); lock_map_release(&wq->lockdep_map); } #endif ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] fs/direct-io.c: avoid workqueue allocation race 2019-03-26 10:32 ` Dmitry Vyukov @ 2020-03-08 5:52 ` Eric Biggers 2020-03-08 23:12 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Eric Biggers @ 2020-03-08 5:52 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-xfs, linux-ext4, linux-kernel, syzkaller-bugs From: Eric Biggers <ebiggers@google.com> When a thread loses the workqueue allocation race in sb_init_dio_done_wq(), lockdep reports that the call to destroy_workqueue() can deadlock waiting for work to complete. This is a false positive since the workqueue is empty. But we shouldn't simply skip the lockdep check for empty workqueues for everyone. Just avoid this issue by using a mutex to serialize the workqueue allocation. We still keep the preliminary check for ->s_dio_done_wq, so this doesn't affect direct I/O performance. Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(), since it's a data race. (That part wasn't actually found by syzbot yet, but it could be detected by KCSAN in the future.) Note: the lockdep false positive could alternatively be fixed by introducing a new function like "destroy_unused_workqueue()" to the workqueue API as previously suggested. But I think it makes sense to avoid the double allocation anyway. Reported-by: syzbot+a50c7541a4a55cd49b02@syzkaller.appspotmail.com Reported-by: syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> --- fs/direct-io.c | 39 ++++++++++++++++++++------------------- fs/internal.h | 9 ++++++++- fs/iomap/direct-io.c | 3 +-- 3 files changed, 29 insertions(+), 22 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 00b4d15bb811..8b73a2501c03 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -590,22 +590,25 @@ static inline int dio_bio_reap(struct dio *dio, struct dio_submit *sdio) * filesystems that don't need it and also allows us to create the workqueue * late enough so the we can include s_id in the name of the workqueue. */ -int sb_init_dio_done_wq(struct super_block *sb) +int __sb_init_dio_done_wq(struct super_block *sb) { - struct workqueue_struct *old; - struct workqueue_struct *wq = alloc_workqueue("dio/%s", - WQ_MEM_RECLAIM, 0, - sb->s_id); - if (!wq) - return -ENOMEM; - /* - * This has to be atomic as more DIOs can race to create the workqueue - */ - old = cmpxchg(&sb->s_dio_done_wq, NULL, wq); - /* Someone created workqueue before us? Free ours... */ - if (old) - destroy_workqueue(wq); - return 0; + static DEFINE_MUTEX(sb_init_dio_done_wq_mutex); + struct workqueue_struct *wq; + int err = 0; + + mutex_lock(&sb_init_dio_done_wq_mutex); + if (sb->s_dio_done_wq) + goto out; + wq = alloc_workqueue("dio/%s", WQ_MEM_RECLAIM, 0, sb->s_id); + if (!wq) { + err = -ENOMEM; + goto out; + } + /* pairs with READ_ONCE() in sb_init_dio_done_wq() */ + smp_store_release(&sb->s_dio_done_wq, wq); +out: + mutex_unlock(&sb_init_dio_done_wq_mutex); + return err; } static int dio_set_defer_completion(struct dio *dio) @@ -615,9 +618,7 @@ static int dio_set_defer_completion(struct dio *dio) if (dio->defer_completion) return 0; dio->defer_completion = true; - if (!sb->s_dio_done_wq) - return sb_init_dio_done_wq(sb); - return 0; + return sb_init_dio_done_wq(sb); } /* @@ -1250,7 +1251,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, retval = 0; if (iocb->ki_flags & IOCB_DSYNC) retval = dio_set_defer_completion(dio); - else if (!dio->inode->i_sb->s_dio_done_wq) { + else { /* * In case of AIO write racing with buffered read we * need to defer completion. We can't decide this now, diff --git a/fs/internal.h b/fs/internal.h index f3f280b952a3..7813dae1dbcd 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -183,7 +183,14 @@ extern void mnt_pin_kill(struct mount *m); extern const struct dentry_operations ns_dentry_operations; /* direct-io.c: */ -int sb_init_dio_done_wq(struct super_block *sb); +int __sb_init_dio_done_wq(struct super_block *sb); +static inline int sb_init_dio_done_wq(struct super_block *sb) +{ + /* pairs with smp_store_release() in __sb_init_dio_done_wq() */ + if (likely(READ_ONCE(sb->s_dio_done_wq))) + return 0; + return __sb_init_dio_done_wq(sb); +} /* * fs/stat.c: diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 23837926c0c5..5d81faada8a0 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -484,8 +484,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, dio_warn_stale_pagecache(iocb->ki_filp); ret = 0; - if (iov_iter_rw(iter) == WRITE && !wait_for_completion && - !inode->i_sb->s_dio_done_wq) { + if (iov_iter_rw(iter) == WRITE && !wait_for_completion) { ret = sb_init_dio_done_wq(inode->i_sb); if (ret < 0) goto out_free_dio; -- 2.25.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/direct-io.c: avoid workqueue allocation race 2020-03-08 5:52 ` [PATCH] fs/direct-io.c: avoid workqueue allocation race Eric Biggers @ 2020-03-08 23:12 ` Dave Chinner 2020-03-09 1:24 ` Eric Biggers 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2020-03-08 23:12 UTC (permalink / raw) To: Eric Biggers Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-kernel, syzkaller-bugs On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote: > From: Eric Biggers <ebiggers@google.com> > > When a thread loses the workqueue allocation race in > sb_init_dio_done_wq(), lockdep reports that the call to > destroy_workqueue() can deadlock waiting for work to complete. This is > a false positive since the workqueue is empty. But we shouldn't simply > skip the lockdep check for empty workqueues for everyone. Why not? If the wq is empty, it can't deadlock, so this is a problem with the workqueue lockdep annotations, not a problem with code that is destroying an empty workqueue. > Just avoid this issue by using a mutex to serialize the workqueue > allocation. We still keep the preliminary check for ->s_dio_done_wq, so > this doesn't affect direct I/O performance. > > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(), > since it's a data race. (That part wasn't actually found by syzbot yet, > but it could be detected by KCSAN in the future.) > > Note: the lockdep false positive could alternatively be fixed by > introducing a new function like "destroy_unused_workqueue()" to the > workqueue API as previously suggested. But I think it makes sense to > avoid the double allocation anyway. Fix the infrastructure, don't work around it be placing constraints on how the callers can use the infrastructure to work around problems internal to the infrastructure. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/direct-io.c: avoid workqueue allocation race 2020-03-08 23:12 ` Dave Chinner @ 2020-03-09 1:24 ` Eric Biggers 2020-03-10 16:27 ` Darrick J. Wong 0 siblings, 1 reply; 9+ messages in thread From: Eric Biggers @ 2020-03-09 1:24 UTC (permalink / raw) To: Dave Chinner Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-kernel, syzkaller-bugs On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote: > On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote: > > From: Eric Biggers <ebiggers@google.com> > > > > When a thread loses the workqueue allocation race in > > sb_init_dio_done_wq(), lockdep reports that the call to > > destroy_workqueue() can deadlock waiting for work to complete. This is > > a false positive since the workqueue is empty. But we shouldn't simply > > skip the lockdep check for empty workqueues for everyone. > > Why not? If the wq is empty, it can't deadlock, so this is a problem > with the workqueue lockdep annotations, not a problem with code that > is destroying an empty workqueue. Skipping the lockdep check when flushing an empty workqueue would reduce the ability of lockdep to detect deadlocks when flushing that workqueue. I.e., it could cause lots of false negatives, since there are many cases where workqueues are *usually* empty when flushed/destroyed but it's still possible that they are nonempty. > > > Just avoid this issue by using a mutex to serialize the workqueue > > allocation. We still keep the preliminary check for ->s_dio_done_wq, so > > this doesn't affect direct I/O performance. > > > > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(), > > since it's a data race. (That part wasn't actually found by syzbot yet, > > but it could be detected by KCSAN in the future.) > > > > Note: the lockdep false positive could alternatively be fixed by > > introducing a new function like "destroy_unused_workqueue()" to the > > workqueue API as previously suggested. But I think it makes sense to > > avoid the double allocation anyway. > > Fix the infrastructure, don't work around it be placing constraints > on how the callers can use the infrastructure to work around > problems internal to the infrastructure. Well, it's also preferable not to make our debugging tools less effective to support people doing weird things that they shouldn't really be doing anyway. (BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate the data race. That could be split into a separate patch though.) Another idea that came up is to make each workqueue_struct track whether work has been queued on it or not yet, and make flush_workqueue() skip the lockdep check if the workqueue has always been empty. (That could still cause lockdep false negatives, but not as many as if we checked if the workqueue is *currently* empty.) Would you prefer that solution? Adding more overhead to workqueues would be undesirable though, so I think it would have to be conditional on CONFIG_LOCKDEP, like (untested): diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 301db4406bc37..72222c09bcaeb 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -263,6 +263,7 @@ struct workqueue_struct { char *lock_name; struct lock_class_key key; struct lockdep_map lockdep_map; + bool used; #endif char name[WQ_NAME_LEN]; /* I: workqueue name */ @@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, lockdep_assert_irqs_disabled(); debug_work_activate(work); +#ifdef CONFIG_LOCKDEP + WRITE_ONCE(wq->used, true); +#endif /* if draining, only works from the same workqueue are allowed */ if (unlikely(wq->flags & __WQ_DRAINING) && @@ -2772,8 +2776,12 @@ void flush_workqueue(struct workqueue_struct *wq) if (WARN_ON(!wq_online)) return; - lock_map_acquire(&wq->lockdep_map); - lock_map_release(&wq->lockdep_map); +#ifdef CONFIG_LOCKDEP + if (READ_ONCE(wq->used)) { + lock_map_acquire(&wq->lockdep_map); + lock_map_release(&wq->lockdep_map); + } +#endif mutex_lock(&wq->mutex); ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/direct-io.c: avoid workqueue allocation race 2020-03-09 1:24 ` Eric Biggers @ 2020-03-10 16:27 ` Darrick J. Wong 2020-03-10 22:22 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Darrick J. Wong @ 2020-03-10 16:27 UTC (permalink / raw) To: Eric Biggers Cc: Dave Chinner, linux-fsdevel, linux-xfs, linux-ext4, linux-kernel, syzkaller-bugs On Sun, Mar 08, 2020 at 06:24:24PM -0700, Eric Biggers wrote: > On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote: > > On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote: > > > From: Eric Biggers <ebiggers@google.com> > > > > > > When a thread loses the workqueue allocation race in > > > sb_init_dio_done_wq(), lockdep reports that the call to > > > destroy_workqueue() can deadlock waiting for work to complete. This is > > > a false positive since the workqueue is empty. But we shouldn't simply > > > skip the lockdep check for empty workqueues for everyone. > > > > Why not? If the wq is empty, it can't deadlock, so this is a problem > > with the workqueue lockdep annotations, not a problem with code that > > is destroying an empty workqueue. > > Skipping the lockdep check when flushing an empty workqueue would reduce the > ability of lockdep to detect deadlocks when flushing that workqueue. I.e., it > could cause lots of false negatives, since there are many cases where workqueues > are *usually* empty when flushed/destroyed but it's still possible that they are > nonempty. > > > > > > Just avoid this issue by using a mutex to serialize the workqueue > > > allocation. We still keep the preliminary check for ->s_dio_done_wq, so > > > this doesn't affect direct I/O performance. > > > > > > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(), > > > since it's a data race. (That part wasn't actually found by syzbot yet, > > > but it could be detected by KCSAN in the future.) > > > > > > Note: the lockdep false positive could alternatively be fixed by > > > introducing a new function like "destroy_unused_workqueue()" to the > > > workqueue API as previously suggested. But I think it makes sense to > > > avoid the double allocation anyway. > > > > Fix the infrastructure, don't work around it be placing constraints > > on how the callers can use the infrastructure to work around > > problems internal to the infrastructure. > > Well, it's also preferable not to make our debugging tools less effective to > support people doing weird things that they shouldn't really be doing anyway. > > (BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate > the data race. That could be split into a separate patch though.) > > Another idea that came up is to make each workqueue_struct track whether work > has been queued on it or not yet, and make flush_workqueue() skip the lockdep > check if the workqueue has always been empty. (That could still cause lockdep > false negatives, but not as many as if we checked if the workqueue is > *currently* empty.) Would you prefer that solution? Adding more overhead to > workqueues would be undesirable though, so I think it would have to be > conditional on CONFIG_LOCKDEP, like (untested): I can't speak for Dave, but if the problem here really is that lockdep's modelling of flush_workqueue()'s behavior could be improved to eliminate false reports, then this seems reasonable to me... --D > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 301db4406bc37..72222c09bcaeb 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -263,6 +263,7 @@ struct workqueue_struct { > char *lock_name; > struct lock_class_key key; > struct lockdep_map lockdep_map; > + bool used; > #endif > char name[WQ_NAME_LEN]; /* I: workqueue name */ > > @@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, > lockdep_assert_irqs_disabled(); > > debug_work_activate(work); > +#ifdef CONFIG_LOCKDEP > + WRITE_ONCE(wq->used, true); > +#endif > > /* if draining, only works from the same workqueue are allowed */ > if (unlikely(wq->flags & __WQ_DRAINING) && > @@ -2772,8 +2776,12 @@ void flush_workqueue(struct workqueue_struct *wq) > if (WARN_ON(!wq_online)) > return; > > - lock_map_acquire(&wq->lockdep_map); > - lock_map_release(&wq->lockdep_map); > +#ifdef CONFIG_LOCKDEP > + if (READ_ONCE(wq->used)) { > + lock_map_acquire(&wq->lockdep_map); > + lock_map_release(&wq->lockdep_map); > + } > +#endif > > mutex_lock(&wq->mutex); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/direct-io.c: avoid workqueue allocation race 2020-03-10 16:27 ` Darrick J. Wong @ 2020-03-10 22:22 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2020-03-10 22:22 UTC (permalink / raw) To: Darrick J. Wong Cc: Eric Biggers, linux-fsdevel, linux-xfs, linux-ext4, linux-kernel, syzkaller-bugs [ Sorry, my responses are limited at the moment because I took a chunk out of a fingertip a couple of days ago and I can only do about half an hour before my hand and arm start to cramp from the weird positions and motions 3 finger typing results in.... ] On Tue, Mar 10, 2020 at 09:27:58AM -0700, Darrick J. Wong wrote: > On Sun, Mar 08, 2020 at 06:24:24PM -0700, Eric Biggers wrote: > > On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote: > > > On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote: > > > > From: Eric Biggers <ebiggers@google.com> > > > > > > > > When a thread loses the workqueue allocation race in > > > > sb_init_dio_done_wq(), lockdep reports that the call to > > > > destroy_workqueue() can deadlock waiting for work to complete. This is > > > > a false positive since the workqueue is empty. But we shouldn't simply > > > > skip the lockdep check for empty workqueues for everyone. > > > > > > Why not? If the wq is empty, it can't deadlock, so this is a problem > > > with the workqueue lockdep annotations, not a problem with code that > > > is destroying an empty workqueue. > > > > Skipping the lockdep check when flushing an empty workqueue would reduce the > > ability of lockdep to detect deadlocks when flushing that workqueue. I.e., it > > could cause lots of false negatives, since there are many cases where workqueues > > are *usually* empty when flushed/destroyed but it's still possible that they are > > nonempty. > > > > > > > > > Just avoid this issue by using a mutex to serialize the workqueue > > > > allocation. We still keep the preliminary check for ->s_dio_done_wq, so > > > > this doesn't affect direct I/O performance. > > > > > > > > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(), > > > > since it's a data race. (That part wasn't actually found by syzbot yet, > > > > but it could be detected by KCSAN in the future.) > > > > > > > > Note: the lockdep false positive could alternatively be fixed by > > > > introducing a new function like "destroy_unused_workqueue()" to the > > > > workqueue API as previously suggested. But I think it makes sense to > > > > avoid the double allocation anyway. > > > > > > Fix the infrastructure, don't work around it be placing constraints > > > on how the callers can use the infrastructure to work around > > > problems internal to the infrastructure. > > > > Well, it's also preferable not to make our debugging tools less effective to > > support people doing weird things that they shouldn't really be doing anyway. > > > > (BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate > > the data race. That could be split into a separate patch though.) > > > > Another idea that came up is to make each workqueue_struct track whether work > > has been queued on it or not yet, and make flush_workqueue() skip the lockdep > > check if the workqueue has always been empty. (That could still cause lockdep > > false negatives, but not as many as if we checked if the workqueue is > > *currently* empty.) Would you prefer that solution? Adding more overhead to > > workqueues would be undesirable though, so I think it would have to be > > conditional on CONFIG_LOCKDEP, like (untested): > > I can't speak for Dave, but if the problem here really is that lockdep's > modelling of flush_workqueue()'s behavior could be improved to eliminate > false reports, then this seems reasonable to me... Yeah, that's what I've been trying to say. IT seems much more reasonable to fix it for everyone once with a few lines of code than have to re-write every caller that might trip over this. e.g. think of all the failure teardown paths that destroy workqueues without having used them... So, yeah, this seems like a much better approach.... > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > > index 301db4406bc37..72222c09bcaeb 100644 > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -263,6 +263,7 @@ struct workqueue_struct { > > char *lock_name; > > struct lock_class_key key; > > struct lockdep_map lockdep_map; > > + bool used; > > #endif > > char name[WQ_NAME_LEN]; /* I: workqueue name */ > > > > @@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, > > lockdep_assert_irqs_disabled(); > > > > debug_work_activate(work); > > +#ifdef CONFIG_LOCKDEP > > + WRITE_ONCE(wq->used, true); > > +#endif ....with an appropriate comment to explain why this code is needed. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-03-10 22:22 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <000000000000cd1e2205785951c2@google.com> 2019-03-22 21:28 ` possible deadlock in __generic_file_fsync syzbot 2019-03-23 7:16 ` Dmitry Vyukov 2019-03-23 13:56 ` Theodore Ts'o 2019-03-26 10:32 ` Dmitry Vyukov 2020-03-08 5:52 ` [PATCH] fs/direct-io.c: avoid workqueue allocation race Eric Biggers 2020-03-08 23:12 ` Dave Chinner 2020-03-09 1:24 ` Eric Biggers 2020-03-10 16:27 ` Darrick J. Wong 2020-03-10 22:22 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).