Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems

From: Filipe Manana <fdmanana@gmail.com>
To: Omar Sandoval <osandov@osandov.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	kernel-team@fb.com, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
Date: Mon, 12 Aug 2019 12:38:55 +0100	[thread overview]
Message-ID: <CAL3q7H4cSMNSKfQKtFk9Q5Shw3VxMFZQ0E7uusL0efHzyN3MXw@mail.gmail.com> (raw)
In-Reply-To: <0bea516a54b26e4e1c42e6fe47548cb48cc4172b.1565112813.git.osandov@fb.com>

On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> We hit a the following very strange deadlock on a system with Btrfs on a
> loop device backed by another Btrfs filesystem:
>
> 1. The top (loop device) filesystem queues an async_cow work item from
>    cow_file_range_async(). We'll call this work X.
> 2. Worker thread A starts work X (normal_work_helper()).
> 3. Worker thread A executes the ordered work for the top filesystem
>    (run_ordered_work()).
> 4. Worker thread A finishes the ordered work for work X and frees X
>    (work->ordered_free()).
> 5. Worker thread A executes another ordered work and gets blocked on I/O
>    to the bottom filesystem (still in run_ordered_work()).
> 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
>    work item which happens to be the recently-freed X.
> 7. The workqueue code sees that X is already being executed by worker
>    thread A, so it schedules X to be executed _after_ worker thread A
>    finishes (see the find_worker_executing_work() call in
>    process_one_work()).
>
> Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> the bottom filesystem is waiting for the top filesystem to finish, so we
> deadlock.
>
> This happens because we are breaking the workqueue assumption that a
> work item cannot be recycled while it still depends on other work. Fix
> it by waiting to free the work item until we are done with all of the
> related ordered work.
>
> P.S.:
>
> One might ask why the workqueue code doesn't try to detect a recycled
> work item. It actually does try by checking whether the work item has
> the same work function (find_worker_executing_work()), but in our case
> the function is the same. This is the only key that the workqueue code
> has available to compare, short of adding an additional, layer-violating
> "custom key". Considering that we're the only ones that have ever hit
> this, we should just play by the rules.
>
> Unfortunately, we haven't been able to create a minimal reproducer other
> than our full container setup using a compress-force=zstd filesystem on
> top of another compress-force=zstd filesystem.
>
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good to me, thanks.
Another variant of the problem Liu fixed back in 2014 (commit
9e0af23764344f7f1b68e4eefbe7dc865018b63d).

> ---
>  fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++---------
>  1 file changed, 44 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 122cb97c7909..b2bfde560331 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq)
>         }
>  }
>
> -static void run_ordered_work(struct __btrfs_workqueue *wq)
> +static void run_ordered_work(struct btrfs_work *self)
>  {
> +       struct __btrfs_workqueue *wq = self->wq;
>         struct list_head *list = &wq->ordered_list;
>         struct btrfs_work *work;
>         spinlock_t *lock = &wq->list_lock;
>         unsigned long flags;
> +       void *wtag;
> +       bool free_self = false;
>
>         while (1) {
> -               void *wtag;
> -
>                 spin_lock_irqsave(lock, flags);
>                 if (list_empty(list))
>                         break;
> @@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
>                 list_del(&work->ordered_list);
>                 spin_unlock_irqrestore(lock, flags);
>
> -               /*
> -                * We don't want to call the ordered free functions with the
> -                * lock held though. Save the work as tag for the trace event,
> -                * because the callback could free the structure.
> -                */
> -               wtag = work;
> -               work->ordered_free(work);
> -               trace_btrfs_all_work_done(wq->fs_info, wtag);
> +               if (work == self) {
> +                       /*
> +                        * This is the work item that the worker is currently
> +                        * executing.
> +                        *
> +                        * The kernel workqueue code guarantees non-reentrancy
> +                        * of work items. I.e., if a work item with the same
> +                        * address and work function is queued twice, the second
> +                        * execution is blocked until the first one finishes. A
> +                        * work item may be freed and recycled with the same
> +                        * work function; the workqueue code assumes that the
> +                        * original work item cannot depend on the recycled work
> +                        * item in that case (see find_worker_executing_work()).
> +                        *
> +                        * Note that the work of one Btrfs filesystem may depend
> +                        * on the work of another Btrfs filesystem via, e.g., a
> +                        * loop device. Therefore, we must not allow the current
> +                        * work item to be recycled until we are really done,
> +                        * otherwise we break the above assumption and can
> +                        * deadlock.
> +                        */
> +                       free_self = true;
> +               } else {
> +                       /*
> +                        * We don't want to call the ordered free functions with
> +                        * the lock held though. Save the work as tag for the
> +                        * trace event, because the callback could free the
> +                        * structure.
> +                        */
> +                       wtag = work;
> +                       work->ordered_free(work);
> +                       trace_btrfs_all_work_done(wq->fs_info, wtag);
> +               }
>         }
>         spin_unlock_irqrestore(lock, flags);
> +
> +       if (free_self) {
> +               wtag = self;
> +               self->ordered_free(self);
> +               trace_btrfs_all_work_done(wq->fs_info, wtag);
> +       }
>  }
>
>  static void normal_work_helper(struct btrfs_work *work)
> @@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work)
>         work->func(work);
>         if (need_order) {
>                 set_bit(WORK_DONE_BIT, &work->flags);
> -               run_ordered_work(wq);
> +               run_ordered_work(work);
>         }
>         if (!need_order)
>                 trace_btrfs_all_work_done(wq->fs_info, wtag);
> --
> 2.22.0
>

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”