Re: possible deadlock in __generic_file

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: possible deadlock in __generic_file_fsync
       [not found] <000000000000cd1e2205785951c2@google.com>
@ 2019-03-22 21:28 ` syzbot
  2019-03-23  7:16   ` Dmitry Vyukov
  0 siblings, 1 reply; 9+ messages in thread
From: syzbot @ 2019-03-22 21:28 UTC (permalink / raw)
  To: adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	syzkaller-bugs, torvalds, tytso, viro

syzbot has bisected this bug to:

commit 9022ada8ab6f1f1a932a3c93815061042e6548a5
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Aug 24 20:16:36 2018 +0000

     Merge branch 'for-4.19' of  
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15330437200000
start commit:   270b77a0 Merge tag 'drm-fixes-2018-10-20-1' of git://anong..
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=17330437200000
console output: https://syzkaller.appspot.com/x/log.txt?x=13330437200000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b3f55cb3dfcc6c33
dashboard link: https://syzkaller.appspot.com/bug?extid=5cd33f0e6abe2bb3e397
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1436fc45400000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11058e2d400000

Reported-by: syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com
Fixes: 9022ada8ab6f ("Merge branch 'for-4.19' of  
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible deadlock in __generic_file_fsync
  2019-03-22 21:28 ` possible deadlock in __generic_file_fsync syzbot
@ 2019-03-23  7:16   ` Dmitry Vyukov
  2019-03-23 13:56     ` Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Vyukov @ 2019-03-23  7:16 UTC (permalink / raw)
  To: syzbot
  Cc: Andreas Dilger, linux-ext4, linux-fsdevel, LKML, syzkaller-bugs,
	Linus Torvalds, Theodore Ts'o, Al Viro

On Fri, Mar 22, 2019 at 10:28 PM syzbot
<syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com> wrote:
>
> syzbot has bisected this bug to:
>
> commit 9022ada8ab6f1f1a932a3c93815061042e6548a5
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Fri Aug 24 20:16:36 2018 +0000
>
>      Merge branch 'for-4.19' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
>
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15330437200000
> start commit:   270b77a0 Merge tag 'drm-fixes-2018-10-20-1' of git://anong..
> git tree:       upstream
> final crash:    https://syzkaller.appspot.com/x/report.txt?x=17330437200000
> console output: https://syzkaller.appspot.com/x/log.txt?x=13330437200000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b3f55cb3dfcc6c33
> dashboard link: https://syzkaller.appspot.com/bug?extid=5cd33f0e6abe2bb3e397
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1436fc45400000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11058e2d400000
>
> Reported-by: syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com
> Fixes: 9022ada8ab6f ("Merge branch 'for-4.19' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")
>
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection



This is a lockdep-detected bug, but it is reproduced with very low probability:

testing commit 270b77a0f30e7bc61a9081b86d74dbb62fa6a69d with gcc (GCC) 8.1.0
run #0: crashed: possible deadlock in __generic_file_fsync
run #1: OK
run #2: OK
run #3: OK
run #4: OK
run #5: OK
run #6: OK
run #7: OK
run #8: OK
run #9: OK

I would expect that for lockdep it's only enough to trigger each path
involved in the deadlock once. Why is it so hard to reproduce then? Is
it something to improve in lockdep?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible deadlock in __generic_file_fsync
  2019-03-23  7:16   ` Dmitry Vyukov
@ 2019-03-23 13:56     ` Theodore Ts'o
  2019-03-26 10:32       ` Dmitry Vyukov
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2019-03-23 13:56 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Andreas Dilger, linux-ext4, linux-fsdevel, LKML,
	syzkaller-bugs, Linus Torvalds, Al Viro

On Sat, Mar 23, 2019 at 08:16:36AM +0100, Dmitry Vyukov wrote:
> 
> This is a lockdep-detected bug, but it is reproduced with very low
> probability...
> 
> I would expect that for lockdep it's only enough to trigger each path
> involved in the deadlock once. Why is it so hard to reproduce then? Is
> it something to improve in lockdep?

It's a false positive report.  The problem is that without doing some
fairly deep code analysis --- the kind that a human needs to do; this
is not the kind of thing that ML and adjusting weights in a nueral net
can deal with --- a computer can't determine what the completion
handler will need to do.

The root cause here is that we have N CPU's that are trying to do
direct I/O's, and on the very first DIO write for a fd, we need to
create a workqueue.  (Why do we do it here?  Because most fd's don't
do DIO, so we don't want to waste resources unnecessarily.  Why don't
we fix it by adding a mutex?  Because it would slow down *all* Direct
I/O operations just to suppress a rare, false positive, lockdep
report.)

The reason why it's so hard for lockdep to reproduce is because it's
extremely rare for this situation to get hit.  When it does get hit,
several CPU's will try to create the workqueue, and all but one will
lose the cmpxchg, and so all but one will need to destroy the
workqueue which they had just created.  Since the wq in question has
never been used, it's safe to call destroy_workqueue(wq) while holding
the inode mutex --- but lockdep doesn't know this.  As I pointed out
in [1] one way to fix this is to create a new API and use it instead:

   I_solemnly_swear_this_workqueue_has_never_been_used_please_destroy()

[1] https://lore.kernel.org/patchwork/patch/1003553/#1187773

Unfortunately, this trades off fixing a very low probability lockdep
false positive report that in practice only gets hit with Syzkaller,
with making the code more fragile if the developer potentially uses
the API incorrectly.

As you can see from the date on the discussion, it's been over six
months, and there still hasn't been a decision about the best way to
fix this.  I think the real problem is that it's pretty low priority,
since it's only something that Syzkaller notices.

The reality is in a world without Syzkaller, maybe once a decade, it
would get hit on a real-life workload, and so we'd have to close a bug
report with "Won't Fix; Not reproducible", and add a note saying that
it's a false positive lockdep report.  Syzkaller is adding stress to
the system by demanding perfection from lockdep, and it isn't that,
for better or for worse.  ‾\_(ツ)_/‾ The question is what is the best
way to annotate this as a false positive, so we can suppress the
report, either in Lockdep or in Syzkaller.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: possible deadlock in __generic_file_fsync
  2019-03-23 13:56     ` Theodore Ts'o
@ 2019-03-26 10:32       ` Dmitry Vyukov
  2020-03-08  5:52         ` [PATCH] fs/direct-io.c: avoid workqueue allocation race Eric Biggers
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Vyukov @ 2019-03-26 10:32 UTC (permalink / raw)
  To: Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger,
	linux-ext4, linux-fsdevel, LKML, syzkaller-bugs, Linus Torvalds,
	Al Viro

On Sat, Mar 23, 2019 at 2:56 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Sat, Mar 23, 2019 at 08:16:36AM +0100, Dmitry Vyukov wrote:
> >
> > This is a lockdep-detected bug, but it is reproduced with very low
> > probability...
> >
> > I would expect that for lockdep it's only enough to trigger each path
> > involved in the deadlock once. Why is it so hard to reproduce then? Is
> > it something to improve in lockdep?
>
> It's a false positive report.  The problem is that without doing some
> fairly deep code analysis --- the kind that a human needs to do; this
> is not the kind of thing that ML and adjusting weights in a nueral net
> can deal with --- a computer can't determine what the completion
> handler will need to do.
>
> The root cause here is that we have N CPU's that are trying to do
> direct I/O's, and on the very first DIO write for a fd, we need to
> create a workqueue.  (Why do we do it here?  Because most fd's don't
> do DIO, so we don't want to waste resources unnecessarily.  Why don't
> we fix it by adding a mutex?  Because it would slow down *all* Direct
> I/O operations just to suppress a rare, false positive, lockdep
> report.)
>
> The reason why it's so hard for lockdep to reproduce is because it's
> extremely rare for this situation to get hit.  When it does get hit,
> several CPU's will try to create the workqueue, and all but one will
> lose the cmpxchg, and so all but one will need to destroy the
> workqueue which they had just created.  Since the wq in question has
> never been used, it's safe to call destroy_workqueue(wq) while holding
> the inode mutex --- but lockdep doesn't know this.  As I pointed out
> in [1] one way to fix this is to create a new API and use it instead:
>
>    I_solemnly_swear_this_workqueue_has_never_been_used_please_destroy()
>
> [1] https://lore.kernel.org/patchwork/patch/1003553/#1187773
>
> Unfortunately, this trades off fixing a very low probability lockdep
> false positive report that in practice only gets hit with Syzkaller,
> with making the code more fragile if the developer potentially uses
> the API incorrectly.
>
> As you can see from the date on the discussion, it's been over six
> months, and there still hasn't been a decision about the best way to
> fix this.  I think the real problem is that it's pretty low priority,
> since it's only something that Syzkaller notices.
>
> The reality is in a world without Syzkaller, maybe once a decade, it
> would get hit on a real-life workload, and so we'd have to close a bug
> report with "Won't Fix; Not reproducible", and add a note saying that
> it's a false positive lockdep report.  Syzkaller is adding stress to
> the system by demanding perfection from lockdep, and it isn't that,
> for better or for worse.  ‾\_(ツ)_/‾ The question is what is the best
> way to annotate this as a false positive, so we can suppress the
> report, either in Lockdep or in Syzkaller.

Hi Ted,

Thanks for the analysis.

So we can classify the reason for wrong bisection result as "too hard
to trigger bug".

Re lockdep perfection, do you see any better alternative then what is
happening now?
One alternative is obviously to stop testing kernel which would remove
all related stress from all involved parties and remove any
perfection/quality requirement from everything :)
But it does not look like a better path forward.

Re I_solemnly_swear_this_workqueue_has_never_been_used_please_destroy,
I wonder if it's possible to automatically note the fact that the
workqueue was used. It should not make the code more fragile and
should not use to incorrect uses of API. It can slightly move the
situation from "reporting false positives" to "not reporting true
positives", but all bugs should be detected eventually (we just need
any test where a single item was submitted to the queue). And in my
experience not reporting false positives is much more important then
reporting 100% of true positives.
Something along the lines of:

on submission of an item:
#ifdef LOCKDEP
  WRITE_ONCE(wq->was_used, 1);
#endif

in flush:
#ifdef LOCKDEP
  if (READ_ONCE(wq->was_used) {
    lock_map_acquire(&wq->lockdep_map);
    lock_map_release(&wq->lockdep_map);
  }
#endif

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] fs/direct-io.c: avoid workqueue allocation race
  2019-03-26 10:32       ` Dmitry Vyukov
@ 2020-03-08  5:52         ` Eric Biggers
  2020-03-08 23:12           ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Biggers @ 2020-03-08  5:52 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-ext4, linux-kernel, syzkaller-bugs

From: Eric Biggers <ebiggers@google.com>

When a thread loses the workqueue allocation race in
sb_init_dio_done_wq(), lockdep reports that the call to
destroy_workqueue() can deadlock waiting for work to complete.  This is
a false positive since the workqueue is empty.  But we shouldn't simply
skip the lockdep check for empty workqueues for everyone.

Just avoid this issue by using a mutex to serialize the workqueue
allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
this doesn't affect direct I/O performance.

Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
since it's a data race.  (That part wasn't actually found by syzbot yet,
but it could be detected by KCSAN in the future.)

Note: the lockdep false positive could alternatively be fixed by
introducing a new function like "destroy_unused_workqueue()" to the
workqueue API as previously suggested.  But I think it makes sense to
avoid the double allocation anyway.

Reported-by: syzbot+a50c7541a4a55cd49b02@syzkaller.appspotmail.com
Reported-by: syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/direct-io.c       | 39 ++++++++++++++++++++-------------------
 fs/internal.h        |  9 ++++++++-
 fs/iomap/direct-io.c |  3 +--
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 00b4d15bb811..8b73a2501c03 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -590,22 +590,25 @@ static inline int dio_bio_reap(struct dio *dio, struct dio_submit *sdio)
  * filesystems that don't need it and also allows us to create the workqueue
  * late enough so the we can include s_id in the name of the workqueue.
  */
-int sb_init_dio_done_wq(struct super_block *sb)
+int __sb_init_dio_done_wq(struct super_block *sb)
 {
-	struct workqueue_struct *old;
-	struct workqueue_struct *wq = alloc_workqueue("dio/%s",
-						      WQ_MEM_RECLAIM, 0,
-						      sb->s_id);
-	if (!wq)
-		return -ENOMEM;
-	/*
-	 * This has to be atomic as more DIOs can race to create the workqueue
-	 */
-	old = cmpxchg(&sb->s_dio_done_wq, NULL, wq);
-	/* Someone created workqueue before us? Free ours... */
-	if (old)
-		destroy_workqueue(wq);
-	return 0;
+	static DEFINE_MUTEX(sb_init_dio_done_wq_mutex);
+	struct workqueue_struct *wq;
+	int err = 0;
+
+	mutex_lock(&sb_init_dio_done_wq_mutex);
+	if (sb->s_dio_done_wq)
+		goto out;
+	wq = alloc_workqueue("dio/%s", WQ_MEM_RECLAIM, 0, sb->s_id);
+	if (!wq) {
+		err = -ENOMEM;
+		goto out;
+	}
+	/* pairs with READ_ONCE() in sb_init_dio_done_wq() */
+	smp_store_release(&sb->s_dio_done_wq, wq);
+out:
+	mutex_unlock(&sb_init_dio_done_wq_mutex);
+	return err;
 }
 
 static int dio_set_defer_completion(struct dio *dio)
@@ -615,9 +618,7 @@ static int dio_set_defer_completion(struct dio *dio)
 	if (dio->defer_completion)
 		return 0;
 	dio->defer_completion = true;
-	if (!sb->s_dio_done_wq)
-		return sb_init_dio_done_wq(sb);
-	return 0;
+	return sb_init_dio_done_wq(sb);
 }
 
 /*
@@ -1250,7 +1251,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		retval = 0;
 		if (iocb->ki_flags & IOCB_DSYNC)
 			retval = dio_set_defer_completion(dio);
-		else if (!dio->inode->i_sb->s_dio_done_wq) {
+		else {
 			/*
 			 * In case of AIO write racing with buffered read we
 			 * need to defer completion. We can't decide this now,
diff --git a/fs/internal.h b/fs/internal.h
index f3f280b952a3..7813dae1dbcd 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -183,7 +183,14 @@ extern void mnt_pin_kill(struct mount *m);
 extern const struct dentry_operations ns_dentry_operations;
 
 /* direct-io.c: */
-int sb_init_dio_done_wq(struct super_block *sb);
+int __sb_init_dio_done_wq(struct super_block *sb);
+static inline int sb_init_dio_done_wq(struct super_block *sb)
+{
+	/* pairs with smp_store_release() in __sb_init_dio_done_wq() */
+	if (likely(READ_ONCE(sb->s_dio_done_wq)))
+		return 0;
+	return __sb_init_dio_done_wq(sb);
+}
 
 /*
  * fs/stat.c:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..5d81faada8a0 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -484,8 +484,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		dio_warn_stale_pagecache(iocb->ki_filp);
 	ret = 0;
 
-	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
-	    !inode->i_sb->s_dio_done_wq) {
+	if (iov_iter_rw(iter) == WRITE && !wait_for_completion) {
 		ret = sb_init_dio_done_wq(inode->i_sb);
 		if (ret < 0)
 			goto out_free_dio;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] fs/direct-io.c: avoid workqueue allocation race
  2020-03-08  5:52         ` [PATCH] fs/direct-io.c: avoid workqueue allocation race Eric Biggers
@ 2020-03-08 23:12           ` Dave Chinner
  2020-03-09  1:24             ` Eric Biggers
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2020-03-08 23:12 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-kernel, syzkaller-bugs

On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> When a thread loses the workqueue allocation race in
> sb_init_dio_done_wq(), lockdep reports that the call to
> destroy_workqueue() can deadlock waiting for work to complete.  This is
> a false positive since the workqueue is empty.  But we shouldn't simply
> skip the lockdep check for empty workqueues for everyone.

Why not? If the wq is empty, it can't deadlock, so this is a problem
with the workqueue lockdep annotations, not a problem with code that
is destroying an empty workqueue.

> Just avoid this issue by using a mutex to serialize the workqueue
> allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
> this doesn't affect direct I/O performance.
> 
> Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
> since it's a data race.  (That part wasn't actually found by syzbot yet,
> but it could be detected by KCSAN in the future.)
> 
> Note: the lockdep false positive could alternatively be fixed by
> introducing a new function like "destroy_unused_workqueue()" to the
> workqueue API as previously suggested.  But I think it makes sense to
> avoid the double allocation anyway.

Fix the infrastructure, don't work around it be placing constraints
on how the callers can use the infrastructure to work around
problems internal to the infrastructure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fs/direct-io.c: avoid workqueue allocation race
  2020-03-08 23:12           ` Dave Chinner
@ 2020-03-09  1:24             ` Eric Biggers
  2020-03-10 16:27               ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Biggers @ 2020-03-09  1:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, linux-ext4, linux-kernel, syzkaller-bugs

On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote:
> On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > When a thread loses the workqueue allocation race in
> > sb_init_dio_done_wq(), lockdep reports that the call to
> > destroy_workqueue() can deadlock waiting for work to complete.  This is
> > a false positive since the workqueue is empty.  But we shouldn't simply
> > skip the lockdep check for empty workqueues for everyone.
> 
> Why not? If the wq is empty, it can't deadlock, so this is a problem
> with the workqueue lockdep annotations, not a problem with code that
> is destroying an empty workqueue.

Skipping the lockdep check when flushing an empty workqueue would reduce the
ability of lockdep to detect deadlocks when flushing that workqueue.  I.e., it
could cause lots of false negatives, since there are many cases where workqueues
are *usually* empty when flushed/destroyed but it's still possible that they are
nonempty.

> 
> > Just avoid this issue by using a mutex to serialize the workqueue
> > allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
> > this doesn't affect direct I/O performance.
> > 
> > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
> > since it's a data race.  (That part wasn't actually found by syzbot yet,
> > but it could be detected by KCSAN in the future.)
> > 
> > Note: the lockdep false positive could alternatively be fixed by
> > introducing a new function like "destroy_unused_workqueue()" to the
> > workqueue API as previously suggested.  But I think it makes sense to
> > avoid the double allocation anyway.
> 
> Fix the infrastructure, don't work around it be placing constraints
> on how the callers can use the infrastructure to work around
> problems internal to the infrastructure.

Well, it's also preferable not to make our debugging tools less effective to
support people doing weird things that they shouldn't really be doing anyway.

(BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate
the data race.  That could be split into a separate patch though.)

Another idea that came up is to make each workqueue_struct track whether work
has been queued on it or not yet, and make flush_workqueue() skip the lockdep
check if the workqueue has always been empty.  (That could still cause lockdep
false negatives, but not as many as if we checked if the workqueue is
*currently* empty.)  Would you prefer that solution?  Adding more overhead to
workqueues would be undesirable though, so I think it would have to be
conditional on CONFIG_LOCKDEP, like (untested):

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 301db4406bc37..72222c09bcaeb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -263,6 +263,7 @@ struct workqueue_struct {
 	char			*lock_name;
 	struct lock_class_key	key;
 	struct lockdep_map	lockdep_map;
+	bool			used;
 #endif
 	char			name[WQ_NAME_LEN]; /* I: workqueue name */
 
@@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	lockdep_assert_irqs_disabled();
 
 	debug_work_activate(work);
+#ifdef CONFIG_LOCKDEP
+	WRITE_ONCE(wq->used, true);
+#endif
 
 	/* if draining, only works from the same workqueue are allowed */
 	if (unlikely(wq->flags & __WQ_DRAINING) &&
@@ -2772,8 +2776,12 @@ void flush_workqueue(struct workqueue_struct *wq)
 	if (WARN_ON(!wq_online))
 		return;
 
-	lock_map_acquire(&wq->lockdep_map);
-	lock_map_release(&wq->lockdep_map);
+#ifdef CONFIG_LOCKDEP
+	if (READ_ONCE(wq->used)) {
+		lock_map_acquire(&wq->lockdep_map);
+		lock_map_release(&wq->lockdep_map);
+	}
+#endif
 
 	mutex_lock(&wq->mutex);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] fs/direct-io.c: avoid workqueue allocation race
  2020-03-09  1:24             ` Eric Biggers
@ 2020-03-10 16:27               ` Darrick J. Wong
  2020-03-10 22:22                 ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2020-03-10 16:27 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, linux-ext4, linux-kernel,
	syzkaller-bugs

On Sun, Mar 08, 2020 at 06:24:24PM -0700, Eric Biggers wrote:
> On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote:
> > On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote:
> > > From: Eric Biggers <ebiggers@google.com>
> > > 
> > > When a thread loses the workqueue allocation race in
> > > sb_init_dio_done_wq(), lockdep reports that the call to
> > > destroy_workqueue() can deadlock waiting for work to complete.  This is
> > > a false positive since the workqueue is empty.  But we shouldn't simply
> > > skip the lockdep check for empty workqueues for everyone.
> > 
> > Why not? If the wq is empty, it can't deadlock, so this is a problem
> > with the workqueue lockdep annotations, not a problem with code that
> > is destroying an empty workqueue.
> 
> Skipping the lockdep check when flushing an empty workqueue would reduce the
> ability of lockdep to detect deadlocks when flushing that workqueue.  I.e., it
> could cause lots of false negatives, since there are many cases where workqueues
> are *usually* empty when flushed/destroyed but it's still possible that they are
> nonempty.
> 
> > 
> > > Just avoid this issue by using a mutex to serialize the workqueue
> > > allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
> > > this doesn't affect direct I/O performance.
> > > 
> > > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
> > > since it's a data race.  (That part wasn't actually found by syzbot yet,
> > > but it could be detected by KCSAN in the future.)
> > > 
> > > Note: the lockdep false positive could alternatively be fixed by
> > > introducing a new function like "destroy_unused_workqueue()" to the
> > > workqueue API as previously suggested.  But I think it makes sense to
> > > avoid the double allocation anyway.
> > 
> > Fix the infrastructure, don't work around it be placing constraints
> > on how the callers can use the infrastructure to work around
> > problems internal to the infrastructure.
> 
> Well, it's also preferable not to make our debugging tools less effective to
> support people doing weird things that they shouldn't really be doing anyway.
> 
> (BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate
> the data race.  That could be split into a separate patch though.)
> 
> Another idea that came up is to make each workqueue_struct track whether work
> has been queued on it or not yet, and make flush_workqueue() skip the lockdep
> check if the workqueue has always been empty.  (That could still cause lockdep
> false negatives, but not as many as if we checked if the workqueue is
> *currently* empty.)  Would you prefer that solution?  Adding more overhead to
> workqueues would be undesirable though, so I think it would have to be
> conditional on CONFIG_LOCKDEP, like (untested):

I can't speak for Dave, but if the problem here really is that lockdep's
modelling of flush_workqueue()'s behavior could be improved to eliminate
false reports, then this seems reasonable to me...

--D

> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 301db4406bc37..72222c09bcaeb 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -263,6 +263,7 @@ struct workqueue_struct {
>  	char			*lock_name;
>  	struct lock_class_key	key;
>  	struct lockdep_map	lockdep_map;
> +	bool			used;
>  #endif
>  	char			name[WQ_NAME_LEN]; /* I: workqueue name */
>  
> @@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
>  	lockdep_assert_irqs_disabled();
>  
>  	debug_work_activate(work);
> +#ifdef CONFIG_LOCKDEP
> +	WRITE_ONCE(wq->used, true);
> +#endif
>  
>  	/* if draining, only works from the same workqueue are allowed */
>  	if (unlikely(wq->flags & __WQ_DRAINING) &&
> @@ -2772,8 +2776,12 @@ void flush_workqueue(struct workqueue_struct *wq)
>  	if (WARN_ON(!wq_online))
>  		return;
>  
> -	lock_map_acquire(&wq->lockdep_map);
> -	lock_map_release(&wq->lockdep_map);
> +#ifdef CONFIG_LOCKDEP
> +	if (READ_ONCE(wq->used)) {
> +		lock_map_acquire(&wq->lockdep_map);
> +		lock_map_release(&wq->lockdep_map);
> +	}
> +#endif
>  
>  	mutex_lock(&wq->mutex);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fs/direct-io.c: avoid workqueue allocation race
  2020-03-10 16:27               ` Darrick J. Wong
@ 2020-03-10 22:22                 ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2020-03-10 22:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Eric Biggers, linux-fsdevel, linux-xfs, linux-ext4, linux-kernel,
	syzkaller-bugs

[ Sorry, my responses are limited at the moment because I took a
chunk out of a fingertip a couple of days ago and I can only do
about half an hour before my hand and arm start to cramp from the
weird positions and motions 3 finger typing results in.... ]

On Tue, Mar 10, 2020 at 09:27:58AM -0700, Darrick J. Wong wrote:
> On Sun, Mar 08, 2020 at 06:24:24PM -0700, Eric Biggers wrote:
> > On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote:
> > > On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote:
> > > > From: Eric Biggers <ebiggers@google.com>
> > > > 
> > > > When a thread loses the workqueue allocation race in
> > > > sb_init_dio_done_wq(), lockdep reports that the call to
> > > > destroy_workqueue() can deadlock waiting for work to complete.  This is
> > > > a false positive since the workqueue is empty.  But we shouldn't simply
> > > > skip the lockdep check for empty workqueues for everyone.
> > > 
> > > Why not? If the wq is empty, it can't deadlock, so this is a problem
> > > with the workqueue lockdep annotations, not a problem with code that
> > > is destroying an empty workqueue.
> > 
> > Skipping the lockdep check when flushing an empty workqueue would reduce the
> > ability of lockdep to detect deadlocks when flushing that workqueue.  I.e., it
> > could cause lots of false negatives, since there are many cases where workqueues
> > are *usually* empty when flushed/destroyed but it's still possible that they are
> > nonempty.
> > 
> > > 
> > > > Just avoid this issue by using a mutex to serialize the workqueue
> > > > allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
> > > > this doesn't affect direct I/O performance.
> > > > 
> > > > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
> > > > since it's a data race.  (That part wasn't actually found by syzbot yet,
> > > > but it could be detected by KCSAN in the future.)
> > > > 
> > > > Note: the lockdep false positive could alternatively be fixed by
> > > > introducing a new function like "destroy_unused_workqueue()" to the
> > > > workqueue API as previously suggested.  But I think it makes sense to
> > > > avoid the double allocation anyway.
> > > 
> > > Fix the infrastructure, don't work around it be placing constraints
> > > on how the callers can use the infrastructure to work around
> > > problems internal to the infrastructure.
> > 
> > Well, it's also preferable not to make our debugging tools less effective to
> > support people doing weird things that they shouldn't really be doing anyway.
> > 
> > (BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate
> > the data race.  That could be split into a separate patch though.)
> > 
> > Another idea that came up is to make each workqueue_struct track whether work
> > has been queued on it or not yet, and make flush_workqueue() skip the lockdep
> > check if the workqueue has always been empty.  (That could still cause lockdep
> > false negatives, but not as many as if we checked if the workqueue is
> > *currently* empty.)  Would you prefer that solution?  Adding more overhead to
> > workqueues would be undesirable though, so I think it would have to be
> > conditional on CONFIG_LOCKDEP, like (untested):
> 
> I can't speak for Dave, but if the problem here really is that lockdep's
> modelling of flush_workqueue()'s behavior could be improved to eliminate
> false reports, then this seems reasonable to me...

Yeah, that's what I've been trying to say. IT seems much more
reasonable to fix it for everyone once with a few lines of code than
have to re-write every caller that might trip over this. e.g. think
of all the failure teardown paths that destroy workqueues without
having used them...

So, yeah, this seems like a much better approach....

> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 301db4406bc37..72222c09bcaeb 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -263,6 +263,7 @@ struct workqueue_struct {
> >  	char			*lock_name;
> >  	struct lock_class_key	key;
> >  	struct lockdep_map	lockdep_map;
> > +	bool			used;
> >  #endif
> >  	char			name[WQ_NAME_LEN]; /* I: workqueue name */
> >  
> > @@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
> >  	lockdep_assert_irqs_disabled();
> >  
> >  	debug_work_activate(work);
> > +#ifdef CONFIG_LOCKDEP
> > +	WRITE_ONCE(wq->used, true);
> > +#endif

....with an appropriate comment to explain why this code is needed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-03-10 22:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <000000000000cd1e2205785951c2@google.com>
2019-03-22 21:28 ` possible deadlock in __generic_file_fsync syzbot
2019-03-23  7:16   ` Dmitry Vyukov
2019-03-23 13:56     ` Theodore Ts'o
2019-03-26 10:32       ` Dmitry Vyukov
2020-03-08  5:52         ` [PATCH] fs/direct-io.c: avoid workqueue allocation race Eric Biggers
2020-03-08 23:12           ` Dave Chinner
2020-03-09  1:24             ` Eric Biggers
2020-03-10 16:27               ` Darrick J. Wong
2020-03-10 22:22                 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).