linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
@ 2019-08-06 17:34 Omar Sandoval
  2019-08-07  7:17 ` Nikolay Borisov
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Omar Sandoval @ 2019-08-06 17:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Tejun Heo

From: Omar Sandoval <osandov@fb.com>

We hit a the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:

1. The top (loop device) filesystem queues an async_cow work item from
   cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
   (run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
   (work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
   to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
   work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
   thread A, so it schedules X to be executed _after_ worker thread A
   finishes (see the find_worker_executing_work() call in
   process_one_work()).

Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.

This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.

P.S.:

One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.

Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index 122cb97c7909..b2bfde560331 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq)
 	}
 }
 
-static void run_ordered_work(struct __btrfs_workqueue *wq)
+static void run_ordered_work(struct btrfs_work *self)
 {
+	struct __btrfs_workqueue *wq = self->wq;
 	struct list_head *list = &wq->ordered_list;
 	struct btrfs_work *work;
 	spinlock_t *lock = &wq->list_lock;
 	unsigned long flags;
+	void *wtag;
+	bool free_self = false;
 
 	while (1) {
-		void *wtag;
-
 		spin_lock_irqsave(lock, flags);
 		if (list_empty(list))
 			break;
@@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
 		list_del(&work->ordered_list);
 		spin_unlock_irqrestore(lock, flags);
 
-		/*
-		 * We don't want to call the ordered free functions with the
-		 * lock held though. Save the work as tag for the trace event,
-		 * because the callback could free the structure.
-		 */
-		wtag = work;
-		work->ordered_free(work);
-		trace_btrfs_all_work_done(wq->fs_info, wtag);
+		if (work == self) {
+			/*
+			 * This is the work item that the worker is currently
+			 * executing.
+			 *
+			 * The kernel workqueue code guarantees non-reentrancy
+			 * of work items. I.e., if a work item with the same
+			 * address and work function is queued twice, the second
+			 * execution is blocked until the first one finishes. A
+			 * work item may be freed and recycled with the same
+			 * work function; the workqueue code assumes that the
+			 * original work item cannot depend on the recycled work
+			 * item in that case (see find_worker_executing_work()).
+			 *
+			 * Note that the work of one Btrfs filesystem may depend
+			 * on the work of another Btrfs filesystem via, e.g., a
+			 * loop device. Therefore, we must not allow the current
+			 * work item to be recycled until we are really done,
+			 * otherwise we break the above assumption and can
+			 * deadlock.
+			 */
+			free_self = true;
+		} else {
+			/*
+			 * We don't want to call the ordered free functions with
+			 * the lock held though. Save the work as tag for the
+			 * trace event, because the callback could free the
+			 * structure.
+			 */
+			wtag = work;
+			work->ordered_free(work);
+			trace_btrfs_all_work_done(wq->fs_info, wtag);
+		}
 	}
 	spin_unlock_irqrestore(lock, flags);
+
+	if (free_self) {
+		wtag = self;
+		self->ordered_free(self);
+		trace_btrfs_all_work_done(wq->fs_info, wtag);
+	}
 }
 
 static void normal_work_helper(struct btrfs_work *work)
@@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work)
 	work->func(work);
 	if (need_order) {
 		set_bit(WORK_DONE_BIT, &work->flags);
-		run_ordered_work(wq);
+		run_ordered_work(work);
 	}
 	if (!need_order)
 		trace_btrfs_all_work_done(wq->fs_info, wtag);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
  2019-08-06 17:34 [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems Omar Sandoval
@ 2019-08-07  7:17 ` Nikolay Borisov
  2019-08-07 17:08   ` Omar Sandoval
  2019-08-12 11:38 ` Filipe Manana
  2019-08-19 16:37 ` David Sterba
  2 siblings, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2019-08-07  7:17 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Tejun Heo



On 6.08.19 г. 20:34 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We hit a the following very strange deadlock on a system with Btrfs on a
> loop device backed by another Btrfs filesystem:
> 
> 1. The top (loop device) filesystem queues an async_cow work item from
>    cow_file_range_async(). We'll call this work X.
> 2. Worker thread A starts work X (normal_work_helper()).
> 3. Worker thread A executes the ordered work for the top filesystem
>    (run_ordered_work()).
> 4. Worker thread A finishes the ordered work for work X and frees X
>    (work->ordered_free()).
> 5. Worker thread A executes another ordered work and gets blocked on I/O
>    to the bottom filesystem (still in run_ordered_work()).
> 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
>    work item which happens to be the recently-freed X.
> 7. The workqueue code sees that X is already being executed by worker
>    thread A, so it schedules X to be executed _after_ worker thread A
>    finishes (see the find_worker_executing_work() call in
>    process_one_work()).

Isn't the bigger problem  that a single run_ordered_work could
potentially run the ordered work for more than one normal work? E.g.
what if btrfs' code is reworked such that run_ordered_work executes
ordered_func for just one work item (the one which called the function
in the first place) ? Wouldn't that also resolve the issue? Correct me
if I'm wrong but it seems silly to have one work item outlive
ordered_free which is what currently happens, right?



> 
> Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> the bottom filesystem is waiting for the top filesystem to finish, so we
> deadlock.
> 
> This happens because we are breaking the workqueue assumption that a
> work item cannot be recycled while it still depends on other work. Fix
> it by waiting to free the work item until we are done with all of the
> related ordered work.
> 
> P.S.:
> 
> One might ask why the workqueue code doesn't try to detect a recycled
> work item. It actually does try by checking whether the work item has
> the same work function (find_worker_executing_work()), but in our case
> the function is the same. This is the only key that the workqueue code
> has available to compare, short of adding an additional, layer-violating
> "custom key". Considering that we're the only ones that have ever hit
> this, we should just play by the rules.
> 
> Unfortunately, we haven't been able to create a minimal reproducer other
> than our full container setup using a compress-force=zstd filesystem on
> top of another compress-force=zstd filesystem.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++---------
>  1 file changed, 44 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 122cb97c7909..b2bfde560331 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq)
>  	}
>  }
>  
> -static void run_ordered_work(struct __btrfs_workqueue *wq)
> +static void run_ordered_work(struct btrfs_work *self)
>  {
> +	struct __btrfs_workqueue *wq = self->wq;
>  	struct list_head *list = &wq->ordered_list;
>  	struct btrfs_work *work;
>  	spinlock_t *lock = &wq->list_lock;
>  	unsigned long flags;
> +	void *wtag;
> +	bool free_self = false;
>  
>  	while (1) {
> -		void *wtag;
> -
>  		spin_lock_irqsave(lock, flags);
>  		if (list_empty(list))
>  			break;
> @@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
>  		list_del(&work->ordered_list);
>  		spin_unlock_irqrestore(lock, flags);
>  
> -		/*
> -		 * We don't want to call the ordered free functions with the
> -		 * lock held though. Save the work as tag for the trace event,
> -		 * because the callback could free the structure.
> -		 */
> -		wtag = work;
> -		work->ordered_free(work);
> -		trace_btrfs_all_work_done(wq->fs_info, wtag);
> +		if (work == self) {
> +			/*
> +			 * This is the work item that the worker is currently
> +			 * executing.
> +			 *
> +			 * The kernel workqueue code guarantees non-reentrancy
> +			 * of work items. I.e., if a work item with the same
> +			 * address and work function is queued twice, the second
> +			 * execution is blocked until the first one finishes. A
> +			 * work item may be freed and recycled with the same
> +			 * work function; the workqueue code assumes that the
> +			 * original work item cannot depend on the recycled work
> +			 * item in that case (see find_worker_executing_work()).
> +			 *
> +			 * Note that the work of one Btrfs filesystem may depend
> +			 * on the work of another Btrfs filesystem via, e.g., a
> +			 * loop device. Therefore, we must not allow the current
> +			 * work item to be recycled until we are really done,
> +			 * otherwise we break the above assumption and can
> +			 * deadlock.
> +			 */
> +			free_self = true;
> +		} else {
> +			/*
> +			 * We don't want to call the ordered free functions with
> +			 * the lock held though. Save the work as tag for the
> +			 * trace event, because the callback could free the
> +			 * structure.
> +			 */
> +			wtag = work;
> +			work->ordered_free(work);
> +			trace_btrfs_all_work_done(wq->fs_info, wtag);
> +		}
>  	}
>  	spin_unlock_irqrestore(lock, flags);
> +
> +	if (free_self) {
> +		wtag = self;
> +		self->ordered_free(self);
> +		trace_btrfs_all_work_done(wq->fs_info, wtag);
> +	}
>  }
>  
>  static void normal_work_helper(struct btrfs_work *work)
> @@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work)
>  	work->func(work);
>  	if (need_order) {
>  		set_bit(WORK_DONE_BIT, &work->flags);
> -		run_ordered_work(wq);
> +		run_ordered_work(work);
>  	}
>  	if (!need_order)
>  		trace_btrfs_all_work_done(wq->fs_info, wtag);
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
  2019-08-07  7:17 ` Nikolay Borisov
@ 2019-08-07 17:08   ` Omar Sandoval
  0 siblings, 0 replies; 7+ messages in thread
From: Omar Sandoval @ 2019-08-07 17:08 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, kernel-team, Tejun Heo

On Wed, Aug 07, 2019 at 10:17:26AM +0300, Nikolay Borisov wrote:
> 
> 
> On 6.08.19 г. 20:34 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > We hit a the following very strange deadlock on a system with Btrfs on a
> > loop device backed by another Btrfs filesystem:
> > 
> > 1. The top (loop device) filesystem queues an async_cow work item from
> >    cow_file_range_async(). We'll call this work X.
> > 2. Worker thread A starts work X (normal_work_helper()).
> > 3. Worker thread A executes the ordered work for the top filesystem
> >    (run_ordered_work()).
> > 4. Worker thread A finishes the ordered work for work X and frees X
> >    (work->ordered_free()).
> > 5. Worker thread A executes another ordered work and gets blocked on I/O
> >    to the bottom filesystem (still in run_ordered_work()).
> > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> >    work item which happens to be the recently-freed X.
> > 7. The workqueue code sees that X is already being executed by worker
> >    thread A, so it schedules X to be executed _after_ worker thread A
> >    finishes (see the find_worker_executing_work() call in
> >    process_one_work()).
> 
> Isn't the bigger problem  that a single run_ordered_work could
> potentially run the ordered work for more than one normal work? E.g.
> what if btrfs' code is reworked such that run_ordered_work executes
> ordered_func for just one work item (the one which called the function
> in the first place) ? Wouldn't that also resolve the issue? Correct me
> if I'm wrong but it seems silly to have one work item outlive
> ordered_free which is what currently happens, right?

We can't always run the ordered work for the normal work because then it
wouldn't be ordered :) If work item N completes before item N-1, then we
can't run the ordered work for N yet. Then, when N-1 completes, we need
to do the ordered work for N-1 _and_ N, which is how we get in this
situation.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
  2019-08-06 17:34 [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems Omar Sandoval
  2019-08-07  7:17 ` Nikolay Borisov
@ 2019-08-12 11:38 ` Filipe Manana
  2019-08-12 18:48   ` Omar Sandoval
  2019-08-19 16:37 ` David Sterba
  2 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2019-08-12 11:38 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Tejun Heo

On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> We hit a the following very strange deadlock on a system with Btrfs on a
> loop device backed by another Btrfs filesystem:
>
> 1. The top (loop device) filesystem queues an async_cow work item from
>    cow_file_range_async(). We'll call this work X.
> 2. Worker thread A starts work X (normal_work_helper()).
> 3. Worker thread A executes the ordered work for the top filesystem
>    (run_ordered_work()).
> 4. Worker thread A finishes the ordered work for work X and frees X
>    (work->ordered_free()).
> 5. Worker thread A executes another ordered work and gets blocked on I/O
>    to the bottom filesystem (still in run_ordered_work()).
> 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
>    work item which happens to be the recently-freed X.
> 7. The workqueue code sees that X is already being executed by worker
>    thread A, so it schedules X to be executed _after_ worker thread A
>    finishes (see the find_worker_executing_work() call in
>    process_one_work()).
>
> Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> the bottom filesystem is waiting for the top filesystem to finish, so we
> deadlock.
>
> This happens because we are breaking the workqueue assumption that a
> work item cannot be recycled while it still depends on other work. Fix
> it by waiting to free the work item until we are done with all of the
> related ordered work.
>
> P.S.:
>
> One might ask why the workqueue code doesn't try to detect a recycled
> work item. It actually does try by checking whether the work item has
> the same work function (find_worker_executing_work()), but in our case
> the function is the same. This is the only key that the workqueue code
> has available to compare, short of adding an additional, layer-violating
> "custom key". Considering that we're the only ones that have ever hit
> this, we should just play by the rules.
>
> Unfortunately, we haven't been able to create a minimal reproducer other
> than our full container setup using a compress-force=zstd filesystem on
> top of another compress-force=zstd filesystem.
>
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good to me, thanks.
Another variant of the problem Liu fixed back in 2014 (commit
9e0af23764344f7f1b68e4eefbe7dc865018b63d).

> ---
>  fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++---------
>  1 file changed, 44 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 122cb97c7909..b2bfde560331 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -250,16 +250,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq)
>         }
>  }
>
> -static void run_ordered_work(struct __btrfs_workqueue *wq)
> +static void run_ordered_work(struct btrfs_work *self)
>  {
> +       struct __btrfs_workqueue *wq = self->wq;
>         struct list_head *list = &wq->ordered_list;
>         struct btrfs_work *work;
>         spinlock_t *lock = &wq->list_lock;
>         unsigned long flags;
> +       void *wtag;
> +       bool free_self = false;
>
>         while (1) {
> -               void *wtag;
> -
>                 spin_lock_irqsave(lock, flags);
>                 if (list_empty(list))
>                         break;
> @@ -285,16 +286,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
>                 list_del(&work->ordered_list);
>                 spin_unlock_irqrestore(lock, flags);
>
> -               /*
> -                * We don't want to call the ordered free functions with the
> -                * lock held though. Save the work as tag for the trace event,
> -                * because the callback could free the structure.
> -                */
> -               wtag = work;
> -               work->ordered_free(work);
> -               trace_btrfs_all_work_done(wq->fs_info, wtag);
> +               if (work == self) {
> +                       /*
> +                        * This is the work item that the worker is currently
> +                        * executing.
> +                        *
> +                        * The kernel workqueue code guarantees non-reentrancy
> +                        * of work items. I.e., if a work item with the same
> +                        * address and work function is queued twice, the second
> +                        * execution is blocked until the first one finishes. A
> +                        * work item may be freed and recycled with the same
> +                        * work function; the workqueue code assumes that the
> +                        * original work item cannot depend on the recycled work
> +                        * item in that case (see find_worker_executing_work()).
> +                        *
> +                        * Note that the work of one Btrfs filesystem may depend
> +                        * on the work of another Btrfs filesystem via, e.g., a
> +                        * loop device. Therefore, we must not allow the current
> +                        * work item to be recycled until we are really done,
> +                        * otherwise we break the above assumption and can
> +                        * deadlock.
> +                        */
> +                       free_self = true;
> +               } else {
> +                       /*
> +                        * We don't want to call the ordered free functions with
> +                        * the lock held though. Save the work as tag for the
> +                        * trace event, because the callback could free the
> +                        * structure.
> +                        */
> +                       wtag = work;
> +                       work->ordered_free(work);
> +                       trace_btrfs_all_work_done(wq->fs_info, wtag);
> +               }
>         }
>         spin_unlock_irqrestore(lock, flags);
> +
> +       if (free_self) {
> +               wtag = self;
> +               self->ordered_free(self);
> +               trace_btrfs_all_work_done(wq->fs_info, wtag);
> +       }
>  }
>
>  static void normal_work_helper(struct btrfs_work *work)
> @@ -322,7 +354,7 @@ static void normal_work_helper(struct btrfs_work *work)
>         work->func(work);
>         if (need_order) {
>                 set_bit(WORK_DONE_BIT, &work->flags);
> -               run_ordered_work(wq);
> +               run_ordered_work(work);
>         }
>         if (!need_order)
>                 trace_btrfs_all_work_done(wq->fs_info, wtag);
> --
> 2.22.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
  2019-08-12 11:38 ` Filipe Manana
@ 2019-08-12 18:48   ` Omar Sandoval
  2019-08-12 18:53     ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Omar Sandoval @ 2019-08-12 18:48 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team, Tejun Heo

On Mon, Aug 12, 2019 at 12:38:55PM +0100, Filipe Manana wrote:
> On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > From: Omar Sandoval <osandov@fb.com>
> >
> > We hit a the following very strange deadlock on a system with Btrfs on a
> > loop device backed by another Btrfs filesystem:
> >
> > 1. The top (loop device) filesystem queues an async_cow work item from
> >    cow_file_range_async(). We'll call this work X.
> > 2. Worker thread A starts work X (normal_work_helper()).
> > 3. Worker thread A executes the ordered work for the top filesystem
> >    (run_ordered_work()).
> > 4. Worker thread A finishes the ordered work for work X and frees X
> >    (work->ordered_free()).
> > 5. Worker thread A executes another ordered work and gets blocked on I/O
> >    to the bottom filesystem (still in run_ordered_work()).
> > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> >    work item which happens to be the recently-freed X.
> > 7. The workqueue code sees that X is already being executed by worker
> >    thread A, so it schedules X to be executed _after_ worker thread A
> >    finishes (see the find_worker_executing_work() call in
> >    process_one_work()).
> >
> > Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> > the bottom filesystem is waiting for the top filesystem to finish, so we
> > deadlock.
> >
> > This happens because we are breaking the workqueue assumption that a
> > work item cannot be recycled while it still depends on other work. Fix
> > it by waiting to free the work item until we are done with all of the
> > related ordered work.
> >
> > P.S.:
> >
> > One might ask why the workqueue code doesn't try to detect a recycled
> > work item. It actually does try by checking whether the work item has
> > the same work function (find_worker_executing_work()), but in our case
> > the function is the same. This is the only key that the workqueue code
> > has available to compare, short of adding an additional, layer-violating
> > "custom key". Considering that we're the only ones that have ever hit
> > this, we should just play by the rules.
> >
> > Unfortunately, we haven't been able to create a minimal reproducer other
> > than our full container setup using a compress-force=zstd filesystem on
> > top of another compress-force=zstd filesystem.
> >
> > Suggested-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Looks good to me, thanks.
> Another variant of the problem Liu fixed back in 2014 (commit
> 9e0af23764344f7f1b68e4eefbe7dc865018b63d).

Good point. I think we can actually get rid of those unique helpers with
this fix. I'll send some followup cleanups.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
  2019-08-12 18:48   ` Omar Sandoval
@ 2019-08-12 18:53     ` Filipe Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2019-08-12 18:53 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Tejun Heo

On Mon, Aug 12, 2019 at 7:48 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Mon, Aug 12, 2019 at 12:38:55PM +0100, Filipe Manana wrote:
> > On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osandov@osandov.com> wrote:
> > >
> > > From: Omar Sandoval <osandov@fb.com>
> > >
> > > We hit a the following very strange deadlock on a system with Btrfs on a
> > > loop device backed by another Btrfs filesystem:
> > >
> > > 1. The top (loop device) filesystem queues an async_cow work item from
> > >    cow_file_range_async(). We'll call this work X.
> > > 2. Worker thread A starts work X (normal_work_helper()).
> > > 3. Worker thread A executes the ordered work for the top filesystem
> > >    (run_ordered_work()).
> > > 4. Worker thread A finishes the ordered work for work X and frees X
> > >    (work->ordered_free()).
> > > 5. Worker thread A executes another ordered work and gets blocked on I/O
> > >    to the bottom filesystem (still in run_ordered_work()).
> > > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> > >    work item which happens to be the recently-freed X.
> > > 7. The workqueue code sees that X is already being executed by worker
> > >    thread A, so it schedules X to be executed _after_ worker thread A
> > >    finishes (see the find_worker_executing_work() call in
> > >    process_one_work()).
> > >
> > > Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> > > the bottom filesystem is waiting for the top filesystem to finish, so we
> > > deadlock.
> > >
> > > This happens because we are breaking the workqueue assumption that a
> > > work item cannot be recycled while it still depends on other work. Fix
> > > it by waiting to free the work item until we are done with all of the
> > > related ordered work.
> > >
> > > P.S.:
> > >
> > > One might ask why the workqueue code doesn't try to detect a recycled
> > > work item. It actually does try by checking whether the work item has
> > > the same work function (find_worker_executing_work()), but in our case
> > > the function is the same. This is the only key that the workqueue code
> > > has available to compare, short of adding an additional, layer-violating
> > > "custom key". Considering that we're the only ones that have ever hit
> > > this, we should just play by the rules.
> > >
> > > Unfortunately, we haven't been able to create a minimal reproducer other
> > > than our full container setup using a compress-force=zstd filesystem on
> > > top of another compress-force=zstd filesystem.
> > >
> > > Suggested-by: Tejun Heo <tj@kernel.org>
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> >
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >
> > Looks good to me, thanks.
> > Another variant of the problem Liu fixed back in 2014 (commit
> > 9e0af23764344f7f1b68e4eefbe7dc865018b63d).
>
> Good point. I think we can actually get rid of those unique helpers with
> this fix. I'll send some followup cleanups.

Great! Thanks.



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems
  2019-08-06 17:34 [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems Omar Sandoval
  2019-08-07  7:17 ` Nikolay Borisov
  2019-08-12 11:38 ` Filipe Manana
@ 2019-08-19 16:37 ` David Sterba
  2 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2019-08-19 16:37 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Tejun Heo

On Tue, Aug 06, 2019 at 10:34:52AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We hit a the following very strange deadlock on a system with Btrfs on a
> loop device backed by another Btrfs filesystem:
> 
> 1. The top (loop device) filesystem queues an async_cow work item from
>    cow_file_range_async(). We'll call this work X.
> 2. Worker thread A starts work X (normal_work_helper()).
> 3. Worker thread A executes the ordered work for the top filesystem
>    (run_ordered_work()).
> 4. Worker thread A finishes the ordered work for work X and frees X
>    (work->ordered_free()).
> 5. Worker thread A executes another ordered work and gets blocked on I/O
>    to the bottom filesystem (still in run_ordered_work()).
> 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
>    work item which happens to be the recently-freed X.
> 7. The workqueue code sees that X is already being executed by worker
>    thread A, so it schedules X to be executed _after_ worker thread A
>    finishes (see the find_worker_executing_work() call in
>    process_one_work()).
> 
> Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> the bottom filesystem is waiting for the top filesystem to finish, so we
> deadlock.
> 
> This happens because we are breaking the workqueue assumption that a
> work item cannot be recycled while it still depends on other work. Fix
> it by waiting to free the work item until we are done with all of the
> related ordered work.
> 
> P.S.:
> 
> One might ask why the workqueue code doesn't try to detect a recycled
> work item. It actually does try by checking whether the work item has
> the same work function (find_worker_executing_work()), but in our case
> the function is the same. This is the only key that the workqueue code
> has available to compare, short of adding an additional, layer-violating
> "custom key". Considering that we're the only ones that have ever hit
> this, we should just play by the rules.
> 
> Unfortunately, we haven't been able to create a minimal reproducer other
> than our full container setup using a compress-force=zstd filesystem on
> top of another compress-force=zstd filesystem.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-08-19 16:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-06 17:34 [PATCH] Btrfs: fix workqueue deadlock on dependent filesystems Omar Sandoval
2019-08-07  7:17 ` Nikolay Borisov
2019-08-07 17:08   ` Omar Sandoval
2019-08-12 11:38 ` Filipe Manana
2019-08-12 18:48   ` Omar Sandoval
2019-08-12 18:53     ` Filipe Manana
2019-08-19 16:37 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).