All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] fixup work fixes
@ 2020-01-21 16:51 Josef Bacik
  2020-01-21 16:51 ` [PATCH 1/3] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker Josef Bacik
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Josef Bacik @ 2020-01-21 16:51 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

This series is to address a few issues with the fixup worker we hit in
production.

The first of this is a resend of

  Btrfs: keep pages dirty when using

I've cleaned this up based on the feedback and added a bunch more comments to
make it clear what is happening and why we're doing it.

The next patch is a cleanup that is made possible by the previous patch, again
to clear up the fixup workers job.

  btrfs: drop the -EBUSY case in __extent_writepage_io

And finally the deadlock fix that I submitted earlier.  I noticed while trying
to backport this onto our kernel that we had changed the error case with the
above patch from Chris, and actually we really, really need Chris's fix as well.
There is also a change in the error handling from v1 where we now set the page
error properly but only once we've locked the page and verified we're still
responsible for COW'ing the page.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker
  2020-01-21 16:51 [PATCH 0/3] fixup work fixes Josef Bacik
@ 2020-01-21 16:51 ` Josef Bacik
  2020-01-21 16:51 ` [PATCH 2/3] btrfs: drop the -EBUSY case in __extent_writepage_io Josef Bacik
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Josef Bacik @ 2020-01-21 16:51 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Chris Mason

From: Chris Mason <clm@fb.com>

For COW, btrfs expects pages dirty pages to have been through a few setup
steps.  This includes reserving space for the new block allocations and marking
the range in the state tree for delayed allocation.

A few places outside btrfs will dirty pages directly, especially when unmapping
mmap'd pages.  In order for these to properly go through COW, we run them
through a fixup worker to wait for stable pages, and do the delalloc prep.

87826df0ec36 added a window where the dirty pages were cleaned, but pending
more action from the fixup worker.  We clear_page_dirty_for_io() before
we call into writepage, so the page is no longer dirty.  The commit
changed it so now we leave the page clean between unlocking it here and
the fixup worker starting at some point in the future.

During this window, page migration can jump in and relocate the page.  Once our
fixup work actually starts, it finds page->mapping is NULL and we end up
freeing the page without ever writing it.

This leads to crc errors and other exciting problems, since it screws up the
whole statemachine for waiting for ordered extents.  The fix here is to keep
the page dirty while we're waiting for the fixup worker to get to work.
This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
if we queued the page up for fixup, which will cause the writepage
function to redirty the page.

Because we now expect the page to be dirty once it gets to the fixup
worker we must adjust the error cases to call clear_page_dirty_for_io()
on the page.  That is the bulk of the patch, but it is not the fix, the
fix is the -EAGAIN from btrfs_writepage_cow_fixup.  We cannot separate
these two changes out because the error conditions change with the new
expectations.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 61 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index da31571b150b..69f8e65b378b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2211,17 +2211,27 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	struct inode *inode;
 	u64 page_start;
 	u64 page_end;
-	int ret;
+	int ret = 0;
 
 	fixup = container_of(work, struct btrfs_writepage_fixup, work);
 	page = fixup->page;
 again:
 	lock_page(page);
-	if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
-		ClearPageChecked(page);
+
+	/*
+	 * before we queued this fixup, we took a reference on the page.
+	 * page->mapping may go NULL, but it shouldn't be moved to a
+	 * different address space.
+	 */
+	if (!page->mapping || !PageDirty(page) || !PageChecked(page))
 		goto out_page;
-	}
 
+	/*
+	 * we keep the PageChecked() bit set until we're done with the
+	 * btrfs_start_ordered_extent() dance that we do below.  That
+	 * drops and retakes the page lock, so we don't want new
+	 * fixup workers queued for this page during the churn.
+	 */
 	inode = page->mapping->host;
 	page_start = page_offset(page);
 	page_end = page_offset(page) + PAGE_SIZE - 1;
@@ -2246,24 +2256,22 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 
 	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
 					   PAGE_SIZE);
-	if (ret) {
-		mapping_set_error(page->mapping, ret);
-		end_extent_writepage(page, ret, page_start, page_end);
-		ClearPageChecked(page);
+	if (ret)
 		goto out;
-	 }
 
 	ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
 					&cached_state);
-	if (ret) {
-		mapping_set_error(page->mapping, ret);
-		end_extent_writepage(page, ret, page_start, page_end);
-		ClearPageChecked(page);
+	if (ret)
 		goto out_reserved;
-	}
 
-	ClearPageChecked(page);
-	set_page_dirty(page);
+	/*
+	 * everything went as planned, we're now the proud owners of a
+	 * Dirty page with delayed allocation bits set and space reserved
+	 * for our COW destination.
+	 *
+	 * The page was dirty when we started, nothing should have cleaned it.
+	 */
+	BUG_ON(!PageDirty(page));
 out_reserved:
 	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	if (ret)
@@ -2273,6 +2281,17 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start, page_end,
 			     &cached_state);
 out_page:
+	if (ret) {
+		/*
+		 * We hit ENOSPC or other errors.  Update the mapping and page
+		 * to reflect the errors and clean the page.
+		 */
+		mapping_set_error(page->mapping, ret);
+		end_extent_writepage(page, ret, page_start, page_end);
+		clear_page_dirty_for_io(page);
+		SetPageError(page);
+	}
+	ClearPageChecked(page);
 	unlock_page(page);
 	put_page(page);
 	kfree(fixup);
@@ -2300,6 +2319,13 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	if (TestClearPagePrivate2(page))
 		return 0;
 
+	/*
+	 * PageChecked is set below when we create a fixup worker for this page,
+	 * don't try to create another one if we're already PageChecked()
+	 *
+	 * The extent_io writepage code will redirty the page if we send
+	 * back EAGAIN.
+	 */
 	if (PageChecked(page))
 		return -EAGAIN;
 
@@ -2312,7 +2338,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL, NULL);
 	fixup->page = page;
 	btrfs_queue_work(fs_info->fixup_workers, &fixup->work);
-	return -EBUSY;
+
+	return -EAGAIN;
 }
 
 static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] btrfs: drop the -EBUSY case in __extent_writepage_io
  2020-01-21 16:51 [PATCH 0/3] fixup work fixes Josef Bacik
  2020-01-21 16:51 ` [PATCH 1/3] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker Josef Bacik
@ 2020-01-21 16:51 ` Josef Bacik
  2020-01-21 16:51 ` [PATCH 3/3][v2] btrfs: do not do delalloc reservation under page lock Josef Bacik
  2020-01-29 15:11 ` [PATCH 0/3] fixup work fixes David Sterba
  3 siblings, 0 replies; 7+ messages in thread
From: Josef Bacik @ 2020-01-21 16:51 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
do not need this -EBUSY case.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f9b223d827b3..0b5513f98a67 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3457,11 +3457,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 	ret = btrfs_writepage_cow_fixup(page, start, page_end);
 	if (ret) {
 		/* Fixup worker will requeue */
-		if (ret == -EBUSY)
-			wbc->pages_skipped++;
-		else
-			redirty_page_for_writepage(wbc, page);
-
+		redirty_page_for_writepage(wbc, page);
 		update_nr_written(wbc, nr_written);
 		unlock_page(page);
 		return 1;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3][v2] btrfs: do not do delalloc reservation under page lock
  2020-01-21 16:51 [PATCH 0/3] fixup work fixes Josef Bacik
  2020-01-21 16:51 ` [PATCH 1/3] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker Josef Bacik
  2020-01-21 16:51 ` [PATCH 2/3] btrfs: drop the -EBUSY case in __extent_writepage_io Josef Bacik
@ 2020-01-21 16:51 ` Josef Bacik
  2020-01-21 19:34   ` [PATCH][v3] " Josef Bacik
  2020-01-29 15:11 ` [PATCH 0/3] fixup work fixes David Sterba
  3 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2020-01-21 16:51 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We ran into a deadlock in production with the fixup worker.  The stack
traces were as follows

Thread responsible for the writeout, waiting on the page lock

[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x109/0x1e0
[<0>] extent_write_cache_pages+0x206/0x360
[<0>] extent_writepages+0x40/0x60
[<0>] do_writepages+0x31/0xb0
[<0>] __writeback_single_inode+0x3d/0x350
[<0>] writeback_sb_inodes+0x19d/0x3c0
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x231/0x2c0
[<0>] wb_workfn+0x308/0x3c0
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

Thread of the fixup worker who is holding the page lock

[<0>] start_delalloc_inodes+0x241/0x2d0
[<0>] btrfs_start_delalloc_roots+0x179/0x230
[<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
[<0>] btrfs_check_data_free_space+0x53/0xa0
[<0>] btrfs_delalloc_reserve_space+0x20/0x70
[<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
[<0>] normal_work_helper+0x11c/0x360
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

Thankfully the stars have to align just right to hit this.  First you
have to end up in the fixup worker, which is tricky by itself (my
reproducer does DIO reads into a MMAP'ed region, so not a common
operation).  Then you have to have less than a page size of free data
space and 0 unallocated space so you go down the "commit the transaction
to free up pinned space" path.  This was accomplished by a random
balance that was running on the host.  Then you get this deadlock.

I'm still in the process of trying to force the deadlock to happen on
demand, but I've hit other issues.  I can still trigger the fixup worker
path itself so this patch has been tested in that regard, so the normal
case is fine.

Fixes: 87826df0ec36 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 71 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 55 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 69f8e65b378b..10f4952895d8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2198,6 +2198,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
 /* see btrfs_writepage_start_hook for details on why this is required */
 struct btrfs_writepage_fixup {
 	struct page *page;
+	struct inode *inode;
 	struct btrfs_work work;
 };
 
@@ -2212,9 +2213,20 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	u64 page_start;
 	u64 page_end;
 	int ret = 0;
+	bool free_delalloc_space = true;
 
 	fixup = container_of(work, struct btrfs_writepage_fixup, work);
 	page = fixup->page;
+	inode = fixup->inode;
+	page_start = page_offset(page);
+	page_end = page_offset(page) + PAGE_SIZE - 1;
+
+	/*
+	 * This is similar to page_mkwrite, we need to reserve the space before
+	 * we take the page lock.
+	 */
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
+					   PAGE_SIZE);
 again:
 	lock_page(page);
 
@@ -2223,25 +2235,48 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	 * page->mapping may go NULL, but it shouldn't be moved to a
 	 * different address space.
 	 */
-	if (!page->mapping || !PageDirty(page) || !PageChecked(page))
+	if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
+		/*
+		 * Unfortunately this is a little tricky, either
+		 *
+		 * 1) We got here and our page had already been dealt with and
+		 *    we reserved our space, thus ret == 0, so we need to just
+		 *    drop our space reservation and bail.  This can happen the
+		 *    first time we come into the fixup worker, or could happen
+		 *    while waiting for the ordered extent.
+		 * 2) Our page was already dealt with, but we happened to get an
+		 *    ENOSPC above from the btrfs_delalloc_reserve_space.  In
+		 *    this case we obviously don't have anything to release, but
+		 *    because the page was already dealt with we don't want to
+		 *    mark the page with an error, so make sure we're resetting
+		 *    ret to 0.  This is why we have this check _before_ the ret
+		 *    check, because we do not want to have a surprise ENOSPC
+		 *    when the page was already properly dealt with.
+		 */
+		if (!ret) {
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+						       PAGE_SIZE);
+			btrfs_delalloc_release_space(inode, data_reserved,
+						     page_start, PAGE_SIZE,
+						     true);
+		}
+		ret = 0;
 		goto out_page;
+	}
 
 	/*
-	 * we keep the PageChecked() bit set until we're done with the
-	 * btrfs_start_ordered_extent() dance that we do below.  That
-	 * drops and retakes the page lock, so we don't want new
-	 * fixup workers queued for this page during the churn.
+	 * We can't mess with the page state unless it is locked, so now that we
+	 * are locked bail if we failed to make our space reservation.
 	 */
-	inode = page->mapping->host;
-	page_start = page_offset(page);
-	page_end = page_offset(page) + PAGE_SIZE - 1;
+	if (ret)
+		goto out_page;
 
 	lock_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
 			 &cached_state);
 
 	/* already ordered? We're done */
 	if (PagePrivate2(page))
-		goto out;
+		goto out_reserved;
 
 	ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), page_start,
 					PAGE_SIZE);
@@ -2254,11 +2289,6 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		goto again;
 	}
 
-	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
-					   PAGE_SIZE);
-	if (ret)
-		goto out;
-
 	ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
 					&cached_state);
 	if (ret)
@@ -2272,12 +2302,12 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	 * The page was dirty when we started, nothing should have cleaned it.
 	 */
 	BUG_ON(!PageDirty(page));
+	free_delalloc_space = false;
 out_reserved:
 	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-	if (ret)
+	if (free_delalloc_space)
 		btrfs_delalloc_release_space(inode, data_reserved, page_start,
 					     PAGE_SIZE, true);
-out:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start, page_end,
 			     &cached_state);
 out_page:
@@ -2296,6 +2326,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	put_page(page);
 	kfree(fixup);
 	extent_changeset_free(data_reserved);
+	iput(inode);
 }
 
 /*
@@ -2333,10 +2364,18 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	if (!fixup)
 		return -EAGAIN;
 
+	/*
+	 * We are already holding a reference to this inode from
+	 * write_cache_pages.  We need to hold it because the space reservation
+	 * takes place outside of the page lock, and we can't trust
+	 * page->mapping outside of the page lock.
+	 */
+	ihold(inode);
 	SetPageChecked(page);
 	get_page(page);
 	btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL, NULL);
 	fixup->page = page;
+	fixup->inode = inode;
 	btrfs_queue_work(fs_info->fixup_workers, &fixup->work);
 
 	return -EAGAIN;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH][v3] btrfs: do not do delalloc reservation under page lock
  2020-01-21 16:51 ` [PATCH 3/3][v2] btrfs: do not do delalloc reservation under page lock Josef Bacik
@ 2020-01-21 19:34   ` Josef Bacik
  2020-01-29 15:12     ` David Sterba
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2020-01-21 19:34 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We ran into a deadlock in production with the fixup worker.  The stack
traces were as follows

Thread responsible for the writeout, waiting on the page lock

[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x109/0x1e0
[<0>] extent_write_cache_pages+0x206/0x360
[<0>] extent_writepages+0x40/0x60
[<0>] do_writepages+0x31/0xb0
[<0>] __writeback_single_inode+0x3d/0x350
[<0>] writeback_sb_inodes+0x19d/0x3c0
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x231/0x2c0
[<0>] wb_workfn+0x308/0x3c0
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

Thread of the fixup worker who is holding the page lock

[<0>] start_delalloc_inodes+0x241/0x2d0
[<0>] btrfs_start_delalloc_roots+0x179/0x230
[<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
[<0>] btrfs_check_data_free_space+0x53/0xa0
[<0>] btrfs_delalloc_reserve_space+0x20/0x70
[<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
[<0>] normal_work_helper+0x11c/0x360
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

Thankfully the stars have to align just right to hit this.  First you
have to end up in the fixup worker, which is tricky by itself (my
reproducer does DIO reads into a MMAP'ed region, so not a common
operation).  Then you have to have less than a page size of free data
space and 0 unallocated space so you go down the "commit the transaction
to free up pinned space" path.  This was accomplished by a random
balance that was running on the host.  Then you get this deadlock.

I'm still in the process of trying to force the deadlock to happen on
demand, but I've hit other issues.  I can still trigger the fixup worker
path itself so this patch has been tested in that regard, so the normal
case is fine.

Fixes: 87826df0ec36 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
v2->v3:
- Use a delayed iput in the fixup worker.  I *think* we can deadlock if we do
  the final iput and need to flush space, which may trigger the fixup worker
  which is busy doing our iput.  Err on the side of caution and use a delayed
  iput.

 fs/btrfs/inode.c | 71 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 55 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 69f8e65b378b..9320f13778ce 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2198,6 +2198,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
 /* see btrfs_writepage_start_hook for details on why this is required */
 struct btrfs_writepage_fixup {
 	struct page *page;
+	struct inode *inode;
 	struct btrfs_work work;
 };
 
@@ -2212,9 +2213,20 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	u64 page_start;
 	u64 page_end;
 	int ret = 0;
+	bool free_delalloc_space = true;
 
 	fixup = container_of(work, struct btrfs_writepage_fixup, work);
 	page = fixup->page;
+	inode = fixup->inode;
+	page_start = page_offset(page);
+	page_end = page_offset(page) + PAGE_SIZE - 1;
+
+	/*
+	 * This is similar to page_mkwrite, we need to reserve the space before
+	 * we take the page lock.
+	 */
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
+					   PAGE_SIZE);
 again:
 	lock_page(page);
 
@@ -2223,25 +2235,48 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	 * page->mapping may go NULL, but it shouldn't be moved to a
 	 * different address space.
 	 */
-	if (!page->mapping || !PageDirty(page) || !PageChecked(page))
+	if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
+		/*
+		 * Unfortunately this is a little tricky, either
+		 *
+		 * 1) We got here and our page had already been dealt with and
+		 *    we reserved our space, thus ret == 0, so we need to just
+		 *    drop our space reservation and bail.  This can happen the
+		 *    first time we come into the fixup worker, or could happen
+		 *    while waiting for the ordered extent.
+		 * 2) Our page was already dealt with, but we happened to get an
+		 *    ENOSPC above from the btrfs_delalloc_reserve_space.  In
+		 *    this case we obviously don't have anything to release, but
+		 *    because the page was already dealt with we don't want to
+		 *    mark the page with an error, so make sure we're resetting
+		 *    ret to 0.  This is why we have this check _before_ the ret
+		 *    check, because we do not want to have a surprise ENOSPC
+		 *    when the page was already properly dealt with.
+		 */
+		if (!ret) {
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+						       PAGE_SIZE);
+			btrfs_delalloc_release_space(inode, data_reserved,
+						     page_start, PAGE_SIZE,
+						     true);
+		}
+		ret = 0;
 		goto out_page;
+	}
 
 	/*
-	 * we keep the PageChecked() bit set until we're done with the
-	 * btrfs_start_ordered_extent() dance that we do below.  That
-	 * drops and retakes the page lock, so we don't want new
-	 * fixup workers queued for this page during the churn.
+	 * We can't mess with the page state unless it is locked, so now that we
+	 * are locked bail if we failed to make our space reservation.
 	 */
-	inode = page->mapping->host;
-	page_start = page_offset(page);
-	page_end = page_offset(page) + PAGE_SIZE - 1;
+	if (ret)
+		goto out_page;
 
 	lock_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
 			 &cached_state);
 
 	/* already ordered? We're done */
 	if (PagePrivate2(page))
-		goto out;
+		goto out_reserved;
 
 	ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), page_start,
 					PAGE_SIZE);
@@ -2254,11 +2289,6 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		goto again;
 	}
 
-	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
-					   PAGE_SIZE);
-	if (ret)
-		goto out;
-
 	ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
 					&cached_state);
 	if (ret)
@@ -2272,12 +2302,12 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	 * The page was dirty when we started, nothing should have cleaned it.
 	 */
 	BUG_ON(!PageDirty(page));
+	free_delalloc_space = false;
 out_reserved:
 	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-	if (ret)
+	if (free_delalloc_space)
 		btrfs_delalloc_release_space(inode, data_reserved, page_start,
 					     PAGE_SIZE, true);
-out:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start, page_end,
 			     &cached_state);
 out_page:
@@ -2296,6 +2326,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	put_page(page);
 	kfree(fixup);
 	extent_changeset_free(data_reserved);
+	btrfs_add_delayed_iput(inode);
 }
 
 /*
@@ -2333,10 +2364,18 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	if (!fixup)
 		return -EAGAIN;
 
+	/*
+	 * We are already holding a reference to this inode from
+	 * write_cache_pages.  We need to hold it because the space reservation
+	 * takes place outside of the page lock, and we can't trust
+	 * page->mapping outside of the page lock.
+	 */
+	ihold(inode);
 	SetPageChecked(page);
 	get_page(page);
 	btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL, NULL);
 	fixup->page = page;
+	fixup->inode = inode;
 	btrfs_queue_work(fs_info->fixup_workers, &fixup->work);
 
 	return -EAGAIN;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] fixup work fixes
  2020-01-21 16:51 [PATCH 0/3] fixup work fixes Josef Bacik
                   ` (2 preceding siblings ...)
  2020-01-21 16:51 ` [PATCH 3/3][v2] btrfs: do not do delalloc reservation under page lock Josef Bacik
@ 2020-01-29 15:11 ` David Sterba
  3 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2020-01-29 15:11 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Tue, Jan 21, 2020 at 11:51:41AM -0500, Josef Bacik wrote:
> This series is to address a few issues with the fixup worker we hit in
> production.
> 
> The first of this is a resend of
> 
>   Btrfs: keep pages dirty when using
> 
> I've cleaned this up based on the feedback and added a bunch more comments to
> make it clear what is happening and why we're doing it.
> 
> The next patch is a cleanup that is made possible by the previous patch, again
> to clear up the fixup workers job.
> 
>   btrfs: drop the -EBUSY case in __extent_writepage_io
> 
> And finally the deadlock fix that I submitted earlier.  I noticed while trying
> to backport this onto our kernel that we had changed the error case with the
> above patch from Chris, and actually we really, really need Chris's fix as well.
> There is also a change in the error handling from v1 where we now set the page
> error properly but only once we've locked the page and verified we're still
> responsible for COW'ing the page.  Thanks,

Reviewed-by: David Sterba <dsterba@suse.com>

Very tricky stuff that fixup worker, it's like the worst present a
filesystem can get from memory management.

Estimated Merge target is 5.6, post rc1 so we have enough time for
testing. It'll appear either in misc-next or for-next.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][v3] btrfs: do not do delalloc reservation under page lock
  2020-01-21 19:34   ` [PATCH][v3] " Josef Bacik
@ 2020-01-29 15:12     ` David Sterba
  0 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2020-01-29 15:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Tue, Jan 21, 2020 at 02:34:52PM -0500, Josef Bacik wrote:
> Thankfully the stars have to align just right to hit this.  First you
> have to end up in the fixup worker, which is tricky by itself (my
> reproducer does DIO reads into a MMAP'ed region, so not a common
> operation).  Then you have to have less than a page size of free data
> space and 0 unallocated space so you go down the "commit the transaction
> to free up pinned space" path.  This was accomplished by a random
> balance that was running on the host.  Then you get this deadlock.
> 
> I'm still in the process of trying to force the deadlock to happen on
> demand, but I've hit other issues.  I can still trigger the fixup worker
> path itself so this patch has been tested in that regard, so the normal
> case is fine.
> 
> Fixes: 87826df0ec36 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> v2->v3:
> - Use a delayed iput in the fixup worker.  I *think* we can deadlock if we do
>   the final iput and need to flush space, which may trigger the fixup worker
>   which is busy doing our iput.  Err on the side of caution and use a delayed
>   iput.

Sounds serious so I've turned this into a comment.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-29 15:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-21 16:51 [PATCH 0/3] fixup work fixes Josef Bacik
2020-01-21 16:51 ` [PATCH 1/3] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker Josef Bacik
2020-01-21 16:51 ` [PATCH 2/3] btrfs: drop the -EBUSY case in __extent_writepage_io Josef Bacik
2020-01-21 16:51 ` [PATCH 3/3][v2] btrfs: do not do delalloc reservation under page lock Josef Bacik
2020-01-21 19:34   ` [PATCH][v3] " Josef Bacik
2020-01-29 15:12     ` David Sterba
2020-01-29 15:11 ` [PATCH 0/3] fixup work fixes David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.