linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 5.4 299/350] btrfs: don't prematurely free work in end_workqueue_fn()
       [not found] <20191210210735.9077-1-sashal@kernel.org>
@ 2019-12-10 21:06 ` Sasha Levin
  2019-12-12 12:11   ` David Sterba
  2019-12-10 21:06 ` [PATCH AUTOSEL 5.4 300/350] btrfs: don't prematurely free work in run_ordered_work() Sasha Levin
  1 sibling, 1 reply; 4+ messages in thread
From: Sasha Levin @ 2019-12-10 21:06 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Omar Sandoval, Johannes Thumshirn, David Sterba, Sasha Levin,
	linux-btrfs

From: Omar Sandoval <osandov@fb.com>

[ Upstream commit 9be490f1e15c34193b1aae17da58e14dd9f55a95 ]

Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
the work item) and then calls bio_endio(). This is another potential
instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".

In particular, the endio call may depend on other work items. For
example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
__btrfs_correct_data_nocsum() -> dio_read_error() ->
submit_dio_repair_bio(), which submits a bio that is also completed
through a end_workqueue_fn() work item. However,
__btrfs_correct_data_nocsum() waits for the newly submitted bio to
complete, thus it depends on another work item.

This example currently usually works because we use different workqueue
helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
However, it may deadlock with stacked filesystems and is fragile
overall. The proper fix is to free the work item at the very end of the
work function, so let's do that.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/disk-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 402b61bf345cd..3895c21853cc4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1657,8 +1657,8 @@ static void end_workqueue_fn(struct btrfs_work *work)
 	bio->bi_status = end_io_wq->status;
 	bio->bi_private = end_io_wq->private;
 	bio->bi_end_io = end_io_wq->end_io;
-	kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
 	bio_endio(bio);
+	kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
 }
 
 static int cleaner_kthread(void *arg)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 5.4 300/350] btrfs: don't prematurely free work in run_ordered_work()
       [not found] <20191210210735.9077-1-sashal@kernel.org>
  2019-12-10 21:06 ` [PATCH AUTOSEL 5.4 299/350] btrfs: don't prematurely free work in end_workqueue_fn() Sasha Levin
@ 2019-12-10 21:06 ` Sasha Levin
  1 sibling, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2019-12-10 21:06 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Omar Sandoval, Tejun Heo, Johannes Thumshirn, David Sterba,
	Sasha Levin, linux-btrfs

From: Omar Sandoval <osandov@fb.com>

[ Upstream commit c495dcd6fbe1dce51811a76bb85b4675f6494938 ]

We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:

1. The top (loop device) filesystem queues an async_cow work item from
   cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
   (run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
   (work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
   to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
   work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
   thread A, so it schedules X to be executed _after_ worker thread A
   finishes (see the find_worker_executing_work() call in
   process_one_work()).

Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.

This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.

P.S.:

One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.

Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.

Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/async-thread.c | 56 ++++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index 2e9e13ffbd082..10a04b99798ae 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -252,16 +252,17 @@ static inline void thresh_exec_hook(struct __btrfs_workqueue *wq)
 	}
 }
 
-static void run_ordered_work(struct __btrfs_workqueue *wq)
+static void run_ordered_work(struct __btrfs_workqueue *wq,
+			     struct btrfs_work *self)
 {
 	struct list_head *list = &wq->ordered_list;
 	struct btrfs_work *work;
 	spinlock_t *lock = &wq->list_lock;
 	unsigned long flags;
+	void *wtag;
+	bool free_self = false;
 
 	while (1) {
-		void *wtag;
-
 		spin_lock_irqsave(lock, flags);
 		if (list_empty(list))
 			break;
@@ -287,16 +288,47 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
 		list_del(&work->ordered_list);
 		spin_unlock_irqrestore(lock, flags);
 
-		/*
-		 * We don't want to call the ordered free functions with the
-		 * lock held though. Save the work as tag for the trace event,
-		 * because the callback could free the structure.
-		 */
-		wtag = work;
-		work->ordered_free(work);
-		trace_btrfs_all_work_done(wq->fs_info, wtag);
+		if (work == self) {
+			/*
+			 * This is the work item that the worker is currently
+			 * executing.
+			 *
+			 * The kernel workqueue code guarantees non-reentrancy
+			 * of work items. I.e., if a work item with the same
+			 * address and work function is queued twice, the second
+			 * execution is blocked until the first one finishes. A
+			 * work item may be freed and recycled with the same
+			 * work function; the workqueue code assumes that the
+			 * original work item cannot depend on the recycled work
+			 * item in that case (see find_worker_executing_work()).
+			 *
+			 * Note that the work of one Btrfs filesystem may depend
+			 * on the work of another Btrfs filesystem via, e.g., a
+			 * loop device. Therefore, we must not allow the current
+			 * work item to be recycled until we are really done,
+			 * otherwise we break the above assumption and can
+			 * deadlock.
+			 */
+			free_self = true;
+		} else {
+			/*
+			 * We don't want to call the ordered free functions with
+			 * the lock held though. Save the work as tag for the
+			 * trace event, because the callback could free the
+			 * structure.
+			 */
+			wtag = work;
+			work->ordered_free(work);
+			trace_btrfs_all_work_done(wq->fs_info, wtag);
+		}
 	}
 	spin_unlock_irqrestore(lock, flags);
+
+	if (free_self) {
+		wtag = self;
+		self->ordered_free(self);
+		trace_btrfs_all_work_done(wq->fs_info, wtag);
+	}
 }
 
 static void normal_work_helper(struct btrfs_work *work)
@@ -324,7 +356,7 @@ static void normal_work_helper(struct btrfs_work *work)
 	work->func(work);
 	if (need_order) {
 		set_bit(WORK_DONE_BIT, &work->flags);
-		run_ordered_work(wq);
+		run_ordered_work(wq, work);
 	}
 	if (!need_order)
 		trace_btrfs_all_work_done(wq->fs_info, wtag);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH AUTOSEL 5.4 299/350] btrfs: don't prematurely free work in end_workqueue_fn()
  2019-12-10 21:06 ` [PATCH AUTOSEL 5.4 299/350] btrfs: don't prematurely free work in end_workqueue_fn() Sasha Levin
@ 2019-12-12 12:11   ` David Sterba
  2019-12-19 23:06     ` Sasha Levin
  0 siblings, 1 reply; 4+ messages in thread
From: David Sterba @ 2019-12-12 12:11 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Omar Sandoval, Johannes Thumshirn,
	David Sterba, linux-btrfs

On Tue, Dec 10, 2019 at 04:06:44PM -0500, Sasha Levin wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> [ Upstream commit 9be490f1e15c34193b1aae17da58e14dd9f55a95 ]
> 
> Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
> the work item) and then calls bio_endio(). This is another potential
> instance of the bug in "btrfs: don't prematurely free work in
> run_ordered_work()".
> 
> In particular, the endio call may depend on other work items. For
> example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
> __btrfs_correct_data_nocsum() -> dio_read_error() ->
> submit_dio_repair_bio(), which submits a bio that is also completed
> through a end_workqueue_fn() work item. However,
> __btrfs_correct_data_nocsum() waits for the newly submitted bio to
> complete, thus it depends on another work item.
> 
> This example currently usually works because we use different workqueue
> helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
> However, it may deadlock with stacked filesystems and is fragile
> overall. The proper fix is to free the work item at the very end of the
> work function, so let's do that.
> 
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> Reviewed-by: David Sterba <dsterba@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Sasha Levin <sashal@kernel.org>

The were more patches in the series, all contain "don't prematurely free
work in" and were part of a rework of async work processing. They're
fixing a very uncommon usecase, so if there's desire to backport them
the whole series needs to go in.

In the autosel list, there are only 2 and without the important fix.

c495dcd6fbe1 btrfs: don't prematurely free work in run_ordered_work()
9be490f1e15c btrfs: don't prematurely free work in end_workqueue_fn()
e732fe95e4ca btrfs: don't prematurely free work in reada_start_machine_worker()
57d4f0b86327 btrfs: don't prematurely free work in scrub_missing_raid56_worker()

a0cac0ec961f btrfs: get rid of unique workqueue helper functions
- this is only a cleanup that removes code obsoleted by the fixes above,
  probably out of scope of stable

I have intentionally not tagged the patches for stable, the usecase is
is specific to one user (FB), the known reproducer is only their
workload and the fixes are in their kernel already.

So if there's desire to add the patches to stable trees, then it has to
be the whole series, but I don't see a strong reason for it.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH AUTOSEL 5.4 299/350] btrfs: don't prematurely free work in end_workqueue_fn()
  2019-12-12 12:11   ` David Sterba
@ 2019-12-19 23:06     ` Sasha Levin
  0 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2019-12-19 23:06 UTC (permalink / raw)
  To: dsterba, linux-kernel, stable, Omar Sandoval, Johannes Thumshirn,
	David Sterba, linux-btrfs

On Thu, Dec 12, 2019 at 01:11:03PM +0100, David Sterba wrote:
>On Tue, Dec 10, 2019 at 04:06:44PM -0500, Sasha Levin wrote:
>> From: Omar Sandoval <osandov@fb.com>
>>
>> [ Upstream commit 9be490f1e15c34193b1aae17da58e14dd9f55a95 ]
>>
>> Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
>> the work item) and then calls bio_endio(). This is another potential
>> instance of the bug in "btrfs: don't prematurely free work in
>> run_ordered_work()".
>>
>> In particular, the endio call may depend on other work items. For
>> example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
>> __btrfs_correct_data_nocsum() -> dio_read_error() ->
>> submit_dio_repair_bio(), which submits a bio that is also completed
>> through a end_workqueue_fn() work item. However,
>> __btrfs_correct_data_nocsum() waits for the newly submitted bio to
>> complete, thus it depends on another work item.
>>
>> This example currently usually works because we use different workqueue
>> helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
>> However, it may deadlock with stacked filesystems and is fragile
>> overall. The proper fix is to free the work item at the very end of the
>> work function, so let's do that.
>>
>> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>> Reviewed-by: David Sterba <dsterba@suse.com>
>> Signed-off-by: David Sterba <dsterba@suse.com>
>> Signed-off-by: Sasha Levin <sashal@kernel.org>
>
>The were more patches in the series, all contain "don't prematurely free
>work in" and were part of a rework of async work processing. They're
>fixing a very uncommon usecase, so if there's desire to backport them
>the whole series needs to go in.
>
>In the autosel list, there are only 2 and without the important fix.
>
>c495dcd6fbe1 btrfs: don't prematurely free work in run_ordered_work()
>9be490f1e15c btrfs: don't prematurely free work in end_workqueue_fn()
>e732fe95e4ca btrfs: don't prematurely free work in reada_start_machine_worker()
>57d4f0b86327 btrfs: don't prematurely free work in scrub_missing_raid56_worker()

I've queued all 4, thanks!

>a0cac0ec961f btrfs: get rid of unique workqueue helper functions
>- this is only a cleanup that removes code obsoleted by the fixes above,
>  probably out of scope of stable
>
>I have intentionally not tagged the patches for stable, the usecase is
>is specific to one user (FB), the known reproducer is only their
>workload and the fixes are in their kernel already.
>
>So if there's desire to add the patches to stable trees, then it has to
>be the whole series, but I don't see a strong reason for it.

If it's upstream and broken then it's relevant, it doesn't matter if its
one user or a million users.

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-12-19 23:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20191210210735.9077-1-sashal@kernel.org>
2019-12-10 21:06 ` [PATCH AUTOSEL 5.4 299/350] btrfs: don't prematurely free work in end_workqueue_fn() Sasha Levin
2019-12-12 12:11   ` David Sterba
2019-12-19 23:06     ` Sasha Levin
2019-12-10 21:06 ` [PATCH AUTOSEL 5.4 300/350] btrfs: don't prematurely free work in run_ordered_work() Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).