* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
2019-08-06 17:34 [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems Omar Sandoval
@ 2019-08-07 7:17 ` Nikolay Borisov
2019-08-07 17:08 ` Omar Sandoval
2019-08-12 11:38 ` Filipe Manana
2019-08-19 16:37 ` David Sterba
2 siblings, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2019-08-07 7:17 UTC (permalink / raw)
To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Tejun Heo
On 6.08.19 г. 20:34 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> We hit a the following very strange deadlock on a system with Btrfs on a
> loop device backed by another Btrfs filesystem:
>
> 1. The top (loop device) filesystem queues an async_cow work item from
> cow_file_range_async(). We'll call this work X.
> 2. Worker thread A starts work X (normal_work_helper()).
> 3. Worker thread A executes the ordered work for the top filesystem
> (run_ordered_work()).
> 4. Worker thread A finishes the ordered work for work X and frees X
> (work->ordered_free()).
> 5. Worker thread A executes another ordered work and gets blocked on I/O
> to the bottom filesystem (still in run_ordered_work()).
> 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> work item which happens to be the recently-freed X.
> 7. The workqueue code sees that X is already being executed by worker
> thread A, so it schedules X to be executed _after_ worker thread A
> finishes (see the find_worker_executing_work() call in
> process_one_work()).
Isn't the bigger problem that a single run_ordered_work could
potentially run the ordered work for more than one normal work? E.g.
what if btrfs' code is reworked such that run_ordered_work executes
ordered_func for just one work item (the one which called the function
in the first place) ? Wouldn't that also resolve the issue? Correct me
if I'm wrong but it seems silly to have one work item outlive
ordered_free which is what currently happens, right?
>
> Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> the bottom filesystem is waiting for the top filesystem to finish, so we
> deadlock.
>
> This happens because we are breaking the workqueue assumption that a
> work item cannot be recycled while it still depends on other work. Fix
> it by waiting to free the work item until we are done with all of the
> related ordered work.
>
> P.S.:
>
> One might ask why the workqueue code doesn't try to detect a recycled
> work item. It actually does try by checking whether the work item has
> the same work function (find_worker_executing_work()), but in our case
> the function is the same. This is the only key that the workqueue code
> has available to compare, short of adding an additional, layer-violating
> "custom key". Considering that we're the only ones that have ever hit
> this, we should just play by the rules.
>
> Unfortunately, we haven't been able to create a minimal reproducer other
> than our full container setup using a compress-force=zstd filesystem on
> top of another compress-force=zstd filesystem.
>
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++---------
> 1 file changed, 44 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 122cb97c7909..b2bfde560331 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq)
> }
> }
>
> -static void run_ordered_work(struct __btrfs_workqueue *wq)
> +static void run_ordered_work(struct btrfs_work *self)
> {
> + struct __btrfs_workqueue *wq = self->wq;
> struct list_head *list = &wq->ordered_list;
> struct btrfs_work *work;
> spinlock_t *lock = &wq->list_lock;
> unsigned long flags;
> + void *wtag;
> + bool free_self = false;
>
> while (1) {
> - void *wtag;
> -
> spin_lock_irqsave(lock, flags);
> if (list_empty(list))
> break;
> @@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
> list_del(&work->ordered_list);
> spin_unlock_irqrestore(lock, flags);
>
> - /*
> - * We don't want to call the ordered free functions with the
> - * lock held though. Save the work as tag for the trace event,
> - * because the callback could free the structure.
> - */
> - wtag = work;
> - work->ordered_free(work);
> - trace_btrfs_all_work_done(wq->fs_info, wtag);
> + if (work == self) {
> + /*
> + * This is the work item that the worker is currently
> + * executing.
> + *
> + * The kernel workqueue code guarantees non-reentrancy
> + * of work items. I.e., if a work item with the same
> + * address and work function is queued twice, the second
> + * execution is blocked until the first one finishes. A
> + * work item may be freed and recycled with the same
> + * work function; the workqueue code assumes that the
> + * original work item cannot depend on the recycled work
> + * item in that case (see find_worker_executing_work()).
> + *
> + * Note that the work of one Btrfs filesystem may depend
> + * on the work of another Btrfs filesystem via, e.g., a
> + * loop device. Therefore, we must not allow the current
> + * work item to be recycled until we are really done,
> + * otherwise we break the above assumption and can
> + * deadlock.
> + */
> + free_self = true;
> + } else {
> + /*
> + * We don't want to call the ordered free functions with
> + * the lock held though. Save the work as tag for the
> + * trace event, because the callback could free the
> + * structure.
> + */
> + wtag = work;
> + work->ordered_free(work);
> + trace_btrfs_all_work_done(wq->fs_info, wtag);
> + }
> }
> spin_unlock_irqrestore(lock, flags);
> +
> + if (free_self) {
> + wtag = self;
> + self->ordered_free(self);
> + trace_btrfs_all_work_done(wq->fs_info, wtag);
> + }
> }
>
> static void normal_work_helper(struct btrfs_work *work)
> @@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work)
> work->func(work);
> if (need_order) {
> set_bit(WORK_DONE_BIT, &work->flags);
> - run_ordered_work(wq);
> + run_ordered_work(work);
> }
> if (!need_order)
> trace_btrfs_all_work_done(wq->fs_info, wtag);
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
2019-08-07 7:17 ` Nikolay Borisov
@ 2019-08-07 17:08 ` Omar Sandoval
0 siblings, 0 replies; 7+ messages in thread
From: Omar Sandoval @ 2019-08-07 17:08 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: linux-btrfs, kernel-team, Tejun Heo
On Wed, Aug 07, 2019 at 10:17:26AM +0300, Nikolay Borisov wrote:
>
>
> On 6.08.19 г. 20:34 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> >
> > We hit a the following very strange deadlock on a system with Btrfs on a
> > loop device backed by another Btrfs filesystem:
> >
> > 1. The top (loop device) filesystem queues an async_cow work item from
> > cow_file_range_async(). We'll call this work X.
> > 2. Worker thread A starts work X (normal_work_helper()).
> > 3. Worker thread A executes the ordered work for the top filesystem
> > (run_ordered_work()).
> > 4. Worker thread A finishes the ordered work for work X and frees X
> > (work->ordered_free()).
> > 5. Worker thread A executes another ordered work and gets blocked on I/O
> > to the bottom filesystem (still in run_ordered_work()).
> > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> > work item which happens to be the recently-freed X.
> > 7. The workqueue code sees that X is already being executed by worker
> > thread A, so it schedules X to be executed _after_ worker thread A
> > finishes (see the find_worker_executing_work() call in
> > process_one_work()).
>
> Isn't the bigger problem that a single run_ordered_work could
> potentially run the ordered work for more than one normal work? E.g.
> what if btrfs' code is reworked such that run_ordered_work executes
> ordered_func for just one work item (the one which called the function
> in the first place) ? Wouldn't that also resolve the issue? Correct me
> if I'm wrong but it seems silly to have one work item outlive
> ordered_free which is what currently happens, right?
We can't always run the ordered work for the normal work because then it
wouldn't be ordered :) If work item N completes before item N-1, then we
can't run the ordered work for N yet. Then, when N-1 completes, we need
to do the ordered work for N-1 _and_ N, which is how we get in this
situation.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
2019-08-06 17:34 [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems Omar Sandoval
2019-08-07 7:17 ` Nikolay Borisov
@ 2019-08-12 11:38 ` Filipe Manana
2019-08-12 18:48 ` Omar Sandoval
2019-08-19 16:37 ` David Sterba
2 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2019-08-12 11:38 UTC (permalink / raw)
To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Tejun Heo
On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> We hit a the following very strange deadlock on a system with Btrfs on a
> loop device backed by another Btrfs filesystem:
>
> 1. The top (loop device) filesystem queues an async_cow work item from
> cow_file_range_async(). We'll call this work X.
> 2. Worker thread A starts work X (normal_work_helper()).
> 3. Worker thread A executes the ordered work for the top filesystem
> (run_ordered_work()).
> 4. Worker thread A finishes the ordered work for work X and frees X
> (work->ordered_free()).
> 5. Worker thread A executes another ordered work and gets blocked on I/O
> to the bottom filesystem (still in run_ordered_work()).
> 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> work item which happens to be the recently-freed X.
> 7. The workqueue code sees that X is already being executed by worker
> thread A, so it schedules X to be executed _after_ worker thread A
> finishes (see the find_worker_executing_work() call in
> process_one_work()).
>
> Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> the bottom filesystem is waiting for the top filesystem to finish, so we
> deadlock.
>
> This happens because we are breaking the workqueue assumption that a
> work item cannot be recycled while it still depends on other work. Fix
> it by waiting to free the work item until we are done with all of the
> related ordered work.
>
> P.S.:
>
> One might ask why the workqueue code doesn't try to detect a recycled
> work item. It actually does try by checking whether the work item has
> the same work function (find_worker_executing_work()), but in our case
> the function is the same. This is the only key that the workqueue code
> has available to compare, short of adding an additional, layer-violating
> "custom key". Considering that we're the only ones that have ever hit
> this, we should just play by the rules.
>
> Unfortunately, we haven't been able to create a minimal reproducer other
> than our full container setup using a compress-force=zstd filesystem on
> top of another compress-force=zstd filesystem.
>
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Looks good to me, thanks.
Another variant of the problem Liu fixed back in 2014 (commit
9e0af23764344f7f1b68e4eefbe7dc865018b63d).
> ---
> fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++---------
> 1 file changed, 44 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 122cb97c7909..b2bfde560331 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq)
> }
> }
>
> -static void run_ordered_work(struct __btrfs_workqueue *wq)
> +static void run_ordered_work(struct btrfs_work *self)
> {
> + struct __btrfs_workqueue *wq = self->wq;
> struct list_head *list = &wq->ordered_list;
> struct btrfs_work *work;
> spinlock_t *lock = &wq->list_lock;
> unsigned long flags;
> + void *wtag;
> + bool free_self = false;
>
> while (1) {
> - void *wtag;
> -
> spin_lock_irqsave(lock, flags);
> if (list_empty(list))
> break;
> @@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
> list_del(&work->ordered_list);
> spin_unlock_irqrestore(lock, flags);
>
> - /*
> - * We don't want to call the ordered free functions with the
> - * lock held though. Save the work as tag for the trace event,
> - * because the callback could free the structure.
> - */
> - wtag = work;
> - work->ordered_free(work);
> - trace_btrfs_all_work_done(wq->fs_info, wtag);
> + if (work == self) {
> + /*
> + * This is the work item that the worker is currently
> + * executing.
> + *
> + * The kernel workqueue code guarantees non-reentrancy
> + * of work items. I.e., if a work item with the same
> + * address and work function is queued twice, the second
> + * execution is blocked until the first one finishes. A
> + * work item may be freed and recycled with the same
> + * work function; the workqueue code assumes that the
> + * original work item cannot depend on the recycled work
> + * item in that case (see find_worker_executing_work()).
> + *
> + * Note that the work of one Btrfs filesystem may depend
> + * on the work of another Btrfs filesystem via, e.g., a
> + * loop device. Therefore, we must not allow the current
> + * work item to be recycled until we are really done,
> + * otherwise we break the above assumption and can
> + * deadlock.
> + */
> + free_self = true;
> + } else {
> + /*
> + * We don't want to call the ordered free functions with
> + * the lock held though. Save the work as tag for the
> + * trace event, because the callback could free the
> + * structure.
> + */
> + wtag = work;
> + work->ordered_free(work);
> + trace_btrfs_all_work_done(wq->fs_info, wtag);
> + }
> }
> spin_unlock_irqrestore(lock, flags);
> +
> + if (free_self) {
> + wtag = self;
> + self->ordered_free(self);
> + trace_btrfs_all_work_done(wq->fs_info, wtag);
> + }
> }
>
> static void normal_work_helper(struct btrfs_work *work)
> @@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work)
> work->func(work);
> if (need_order) {
> set_bit(WORK_DONE_BIT, &work->flags);
> - run_ordered_work(wq);
> + run_ordered_work(work);
> }
> if (!need_order)
> trace_btrfs_all_work_done(wq->fs_info, wtag);
> --
> 2.22.0
>
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
2019-08-12 11:38 ` Filipe Manana
@ 2019-08-12 18:48 ` Omar Sandoval
2019-08-12 18:53 ` Filipe Manana
0 siblings, 1 reply; 7+ messages in thread
From: Omar Sandoval @ 2019-08-12 18:48 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs, kernel-team, Tejun Heo
On Mon, Aug 12, 2019 at 12:38:55PM +0100, Filipe Manana wrote:
> On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > From: Omar Sandoval <osandov@fb.com>
> >
> > We hit a the following very strange deadlock on a system with Btrfs on a
> > loop device backed by another Btrfs filesystem:
> >
> > 1. The top (loop device) filesystem queues an async_cow work item from
> > cow_file_range_async(). We'll call this work X.
> > 2. Worker thread A starts work X (normal_work_helper()).
> > 3. Worker thread A executes the ordered work for the top filesystem
> > (run_ordered_work()).
> > 4. Worker thread A finishes the ordered work for work X and frees X
> > (work->ordered_free()).
> > 5. Worker thread A executes another ordered work and gets blocked on I/O
> > to the bottom filesystem (still in run_ordered_work()).
> > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> > work item which happens to be the recently-freed X.
> > 7. The workqueue code sees that X is already being executed by worker
> > thread A, so it schedules X to be executed _after_ worker thread A
> > finishes (see the find_worker_executing_work() call in
> > process_one_work()).
> >
> > Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> > the bottom filesystem is waiting for the top filesystem to finish, so we
> > deadlock.
> >
> > This happens because we are breaking the workqueue assumption that a
> > work item cannot be recycled while it still depends on other work. Fix
> > it by waiting to free the work item until we are done with all of the
> > related ordered work.
> >
> > P.S.:
> >
> > One might ask why the workqueue code doesn't try to detect a recycled
> > work item. It actually does try by checking whether the work item has
> > the same work function (find_worker_executing_work()), but in our case
> > the function is the same. This is the only key that the workqueue code
> > has available to compare, short of adding an additional, layer-violating
> > "custom key". Considering that we're the only ones that have ever hit
> > this, we should just play by the rules.
> >
> > Unfortunately, we haven't been able to create a minimal reproducer other
> > than our full container setup using a compress-force=zstd filesystem on
> > top of another compress-force=zstd filesystem.
> >
> > Suggested-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Looks good to me, thanks.
> Another variant of the problem Liu fixed back in 2014 (commit
> 9e0af23764344f7f1b68e4eefbe7dc865018b63d).
Good point. I think we can actually get rid of those unique helpers with
this fix. I'll send some followup cleanups.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
2019-08-12 18:48 ` Omar Sandoval
@ 2019-08-12 18:53 ` Filipe Manana
0 siblings, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2019-08-12 18:53 UTC (permalink / raw)
To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Tejun Heo
On Mon, Aug 12, 2019 at 7:48 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Mon, Aug 12, 2019 at 12:38:55PM +0100, Filipe Manana wrote:
> > On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote:
> > >
> > > From: Omar Sandoval <osandov@fb.com>
> > >
> > > We hit a the following very strange deadlock on a system with Btrfs on a
> > > loop device backed by another Btrfs filesystem:
> > >
> > > 1. The top (loop device) filesystem queues an async_cow work item from
> > > cow_file_range_async(). We'll call this work X.
> > > 2. Worker thread A starts work X (normal_work_helper()).
> > > 3. Worker thread A executes the ordered work for the top filesystem
> > > (run_ordered_work()).
> > > 4. Worker thread A finishes the ordered work for work X and frees X
> > > (work->ordered_free()).
> > > 5. Worker thread A executes another ordered work and gets blocked on I/O
> > > to the bottom filesystem (still in run_ordered_work()).
> > > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> > > work item which happens to be the recently-freed X.
> > > 7. The workqueue code sees that X is already being executed by worker
> > > thread A, so it schedules X to be executed _after_ worker thread A
> > > finishes (see the find_worker_executing_work() call in
> > > process_one_work()).
> > >
> > > Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> > > the bottom filesystem is waiting for the top filesystem to finish, so we
> > > deadlock.
> > >
> > > This happens because we are breaking the workqueue assumption that a
> > > work item cannot be recycled while it still depends on other work. Fix
> > > it by waiting to free the work item until we are done with all of the
> > > related ordered work.
> > >
> > > P.S.:
> > >
> > > One might ask why the workqueue code doesn't try to detect a recycled
> > > work item. It actually does try by checking whether the work item has
> > > the same work function (find_worker_executing_work()), but in our case
> > > the function is the same. This is the only key that the workqueue code
> > > has available to compare, short of adding an additional, layer-violating
> > > "custom key". Considering that we're the only ones that have ever hit
> > > this, we should just play by the rules.
> > >
> > > Unfortunately, we haven't been able to create a minimal reproducer other
> > > than our full container setup using a compress-force=zstd filesystem on
> > > top of another compress-force=zstd filesystem.
> > >
> > > Suggested-by: Tejun Heo <tj@kernel.org>
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> >
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >
> > Looks good to me, thanks.
> > Another variant of the problem Liu fixed back in 2014 (commit
> > 9e0af23764344f7f1b68e4eefbe7dc865018b63d).
>
> Good point. I think we can actually get rid of those unique helpers with
> this fix. I'll send some followup cleanups.
Great! Thanks.
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
2019-08-06 17:34 [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems Omar Sandoval
2019-08-07 7:17 ` Nikolay Borisov
2019-08-12 11:38 ` Filipe Manana
@ 2019-08-19 16:37 ` David Sterba
2 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2019-08-19 16:37 UTC (permalink / raw)
To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Tejun Heo
On Tue, Aug 06, 2019 at 10:34:52AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> We hit a the following very strange deadlock on a system with Btrfs on a
> loop device backed by another Btrfs filesystem:
>
> 1. The top (loop device) filesystem queues an async_cow work item from
> cow_file_range_async(). We'll call this work X.
> 2. Worker thread A starts work X (normal_work_helper()).
> 3. Worker thread A executes the ordered work for the top filesystem
> (run_ordered_work()).
> 4. Worker thread A finishes the ordered work for work X and frees X
> (work->ordered_free()).
> 5. Worker thread A executes another ordered work and gets blocked on I/O
> to the bottom filesystem (still in run_ordered_work()).
> 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> work item which happens to be the recently-freed X.
> 7. The workqueue code sees that X is already being executed by worker
> thread A, so it schedules X to be executed _after_ worker thread A
> finishes (see the find_worker_executing_work() call in
> process_one_work()).
>
> Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> the bottom filesystem is waiting for the top filesystem to finish, so we
> deadlock.
>
> This happens because we are breaking the workqueue assumption that a
> work item cannot be recycled while it still depends on other work. Fix
> it by waiting to free the work item until we are done with all of the
> related ordered work.
>
> P.S.:
>
> One might ask why the workqueue code doesn't try to detect a recycled
> work item. It actually does try by checking whether the work item has
> the same work function (find_worker_executing_work()), but in our case
> the function is the same. This is the only key that the workqueue code
> has available to compare, short of adding an additional, layer-violating
> "custom key". Considering that we're the only ones that have ever hit
> this, we should just play by the rules.
>
> Unfortunately, we haven't been able to create a minimal reproducer other
> than our full container setup using a compress-force=zstd filesystem on
> top of another compress-force=zstd filesystem.
>
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
Added to misc-next, thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread