All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] btrfs: shrink delalloc fixes
@ 2021-06-11 13:53 Josef Bacik
  2021-06-11 13:53 ` [PATCH 1/4] btrfs: wait on async extents when flushing delalloc Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Josef Bacik @ 2021-06-11 13:53 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Hello,

I backported the patch to switch us to using sync_inode() to our kernel inside
Facebook to fix a deadlock when using the async delalloc shrinker threads.  This
uncovered a bunch of problems with how we shrink delalloc, as we use -o
compress-force, and thus everything goes through the async compression threads.

I ripped out the async pages stuff because originally I had switched us to just
writing the whole inode.  This caused a performance regression, and so I
switched us to calling sync_inode() twice to handle the async extent case.  The
problem is that sync_inode() can skip writing inodes sometimes, and thus we
weren't properly waiting on the async helpers.  We really do need to wait for
the async delalloc pages to go down before continuing to shrink delalloc.  There
was also a race in how we woke up the async delalloc pages waiter which could be
problematic.

And then finally there is our use of sync_inode().  It tries to be too clever
for us, when in reality we want to make sure all pages are under writeback
before we come back to the shrinking loop.  I've added a small helper to give us
this flexibilty and have switched us to that helper.

With these patches, and others that will be sent separately, the early ENOSPC
problems we were experiencing have been eliminated.  Thanks,

Josef Bacik (4):
  btrfs: wait on async extents when flushing delalloc
  btrfs: wake up async_delalloc_pages waiters after submit
  fs: add a filemap_fdatawrite_wbc helper
  btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking

 fs/btrfs/inode.c      | 16 ++++++----------
 fs/btrfs/space-info.c | 33 +++++++++++++++++++++++++++++++++
 include/linux/fs.h    |  2 ++
 mm/filemap.c          | 29 ++++++++++++++++++++++++-----
 4 files changed, 65 insertions(+), 15 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] btrfs: wait on async extents when flushing delalloc
  2021-06-11 13:53 [PATCH 0/4] btrfs: shrink delalloc fixes Josef Bacik
@ 2021-06-11 13:53 ` Josef Bacik
  2021-06-14 10:05   ` Nikolay Borisov
  2021-06-11 13:53 ` [PATCH 2/4] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2021-06-11 13:53 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

I've been debugging an early ENOSPC problem in production and finally
root caused it to this problem.  When we switched to the per-inode
flushing code I pulled out the async extent handling, because we were
doing the correct thing by calling filemap_flush() if we had async
extents set.  This would properly wait on any async extents by locking
the page in the second flush, thus making sure our ordered extents were
properly set up.

However when I switched us back to page based flushing, I used
sync_inode(), which allows us to pass in our own wbc.  The problem here
is that sync_inode() is smarter than the filemap_* helpers, it tries to
avoid calling writepages at all.  This means that our second call could
skip calling do_writepages altogether, and thus not wait on the pagelock
for the async helpers.  This means we could come back before any ordered
extents were created and then simply continue on in our flushing
mechanisms and ENOSPC out when we have plenty of space to use.

Fix this by putting back the async pages logic in shrink_delalloc.  This
allows us to bulk write out everything that we need to, and then we can
wait in one place for the async helpers to catch up, and then wait on
any ordered extents that are created.

Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c      |  4 ----
 fs/btrfs/space-info.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ba4b194cd3e..6cb73ff59c7c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9681,10 +9681,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 					 &work->work);
 		} else {
 			ret = sync_inode(inode, wbc);
-			if (!ret &&
-			    test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-				     &BTRFS_I(inode)->runtime_flags))
-				ret = sync_inode(inode, wbc);
 			btrfs_add_delayed_iput(inode);
 			if (ret || wbc->nr_to_write <= 0)
 				goto out;
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index af467e888545..3c89af4fd729 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -547,9 +547,42 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
 	while ((delalloc_bytes || ordered_bytes) && loops < 3) {
 		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
 		long nr_pages = min_t(u64, temp, LONG_MAX);
+		int async_pages;
 
 		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
 
+		/*
+		 * We need to make sure any outstanding async pages are now
+		 * processed before we continue.  This is because things like
+		 * sync_inode() try to be smart and skip writing if the inode is
+		 * marked clean.  We don't use filemap_fwrite for flushing
+		 * because we want to control how many pages we write out at a
+		 * time, thus this is the only safe way to make sure we've
+		 * waited for outstanding compressed workers to have started
+		 * their jobs and thus have ordered extents set up properly.
+		 * Josef, when you think you are smart enough to remove this in
+		 * the future, read this comment 4 times and explain in the
+		 * commit message why it makes sense to remove it, and then
+		 * delete the commit and don't remove it.
+		 */
+		async_pages = atomic_read(&fs_info->async_delalloc_pages);
+		if (!async_pages)
+			goto skip_async;
+
+		/*
+		 * We don't want to wait forever, if we wrote less pages in this
+		 * loop than we have outstanding, only wait for that number of
+		 * pages, otherwise we can wait for all async pages to finish
+		 * before continuing.
+		 */
+		if (async_pages > nr_pages)
+			async_pages -= nr_pages;
+		else
+			async_pages = 0;
+		wait_event(fs_info->async_submit_wait,
+			   atomic_read(&fs_info->async_delalloc_pages) <=
+			   async_pages);
+skip_async:
 		loops++;
 		if (wait_ordered && !trans) {
 			btrfs_wait_ordered_roots(fs_info, items, 0, (u64)-1);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] btrfs: wake up async_delalloc_pages waiters after submit
  2021-06-11 13:53 [PATCH 0/4] btrfs: shrink delalloc fixes Josef Bacik
  2021-06-11 13:53 ` [PATCH 1/4] btrfs: wait on async extents when flushing delalloc Josef Bacik
@ 2021-06-11 13:53 ` Josef Bacik
  2021-06-14 10:20   ` Nikolay Borisov
  2021-06-11 13:53 ` [PATCH 3/4] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2021-06-11 13:53 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: stable

We use the async_delalloc_pages mechanism to make sure that we've
completed our async work before trying to continue our delalloc
flushing.  The reason for this is we need to see any ordered extents
that were created by our delalloc flushing.  However we're waking up
before we do the submit work, which is before we create the ordered
extents.  This is a pretty wide race window where we could potentially
think there are no ordered extents and thus exit shrink_delalloc
prematurely.  Fix this by waking us up after we've done the work to
create ordered extents.

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6cb73ff59c7c..c37271df2c6d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1271,11 +1271,6 @@ static noinline void async_cow_submit(struct btrfs_work *work)
 	nr_pages = (async_chunk->end - async_chunk->start + PAGE_SIZE) >>
 		PAGE_SHIFT;
 
-	/* atomic_sub_return implies a barrier */
-	if (atomic_sub_return(nr_pages, &fs_info->async_delalloc_pages) <
-	    5 * SZ_1M)
-		cond_wake_up_nomb(&fs_info->async_submit_wait);
-
 	/*
 	 * ->inode could be NULL if async_chunk_start has failed to compress,
 	 * in which case we don't have anything to submit, yet we need to
@@ -1284,6 +1279,11 @@ static noinline void async_cow_submit(struct btrfs_work *work)
 	 */
 	if (async_chunk->inode)
 		submit_compressed_extents(async_chunk);
+
+	/* atomic_sub_return implies a barrier */
+	if (atomic_sub_return(nr_pages, &fs_info->async_delalloc_pages) <
+	    5 * SZ_1M)
+		cond_wake_up_nomb(&fs_info->async_submit_wait);
 }
 
 static noinline void async_cow_free(struct btrfs_work *work)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] fs: add a filemap_fdatawrite_wbc helper
  2021-06-11 13:53 [PATCH 0/4] btrfs: shrink delalloc fixes Josef Bacik
  2021-06-11 13:53 ` [PATCH 1/4] btrfs: wait on async extents when flushing delalloc Josef Bacik
  2021-06-11 13:53 ` [PATCH 2/4] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
@ 2021-06-11 13:53 ` Josef Bacik
  2021-06-14 10:16   ` Nikolay Borisov
  2021-06-11 13:54 ` [PATCH 4/4] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
  2021-06-14 11:13 ` [PATCH 0/4] btrfs: shrink delalloc fixes Johannes Thumshirn
  4 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2021-06-11 13:53 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
order to reclaim metadata reservations.  Unfortunately most helpers in
this area are too smart for us

1) The normal filemap_fdata* helpers only take range and sync modes, and
   don't give any indication of how much was written, so we can only
   flush full inodes, which isn't what we want in most cases.
2) The normal writeback path requires us to have the s_umount sem held,
   but we can't unconditionally take it in this path because we could
   deadlock.
3) The normal writeback path also skips inodes with I_SYNC set if we
   write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
   ENOSPC pressure, we want to actually make sure the pages are under
   writeback before returning, and if another thread is in the middle of
   writing the file we may return before they're under writeback and
   miss our ordered extents and not properly wait for completion.
4) sync_inode() uses the normal writeback path and has the same problem
   as #3.

What we really want is to call do_writepages() with our wbc.  This way
we can make sure that writeback is actually started on the pages, and we
can control how many pages are written as a whole as we write many
inodes using the same wbc.  Accomplish this with a new helper that does
just that so we can use it for our ENOSPC flushing infrastructure.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 29 ++++++++++++++++++++++++-----
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3c88fdb9b2a..aace07f88b73 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2886,6 +2886,8 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 extern int filemap_check_errors(struct address_space *mapping);
 extern void __filemap_set_wb_err(struct address_space *mapping, int err);
+extern int filemap_fdatawrite_wbc(struct address_space *mapping,
+				  struct writeback_control *wbc);
 
 static inline int filemap_write_and_wait(struct address_space *mapping)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 66f7e9fdfbc4..0408bc247e71 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,29 @@ static int filemap_check_and_keep_errors(struct address_space *mapping)
 		return -ENOSPC;
 	return 0;
 }
+/**
+ * filemap_fdatawrite_wbc - start writeback on mapping dirty pages in range
+ * @mapping:	address space structure to write
+ * @wbc:	the writeback_control controlling the writeout
+ *
+ * This behaves the same way as __filemap_fdatawrite_range, but simply takes the
+ * writeback_control as an argument to allow the caller to have more flexibility
+ * over the writeout parameters, and with no checks around whether the mapping
+ * has dirty pages or anything, it simply calls writepages.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_fdatawrite_wbc(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	int ret;
+
+	wbc_attach_fdatawrite_inode(wbc, mapping->host);
+	ret = do_writepages(mapping, wbc);
+	wbc_detach_inode(wbc);
+	return ret;
+}
+EXPORT_SYMBOL(filemap_fdatawrite_wbc);
 
 /**
  * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
@@ -397,7 +420,6 @@ static int filemap_check_and_keep_errors(struct address_space *mapping)
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end, int sync_mode)
 {
-	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
 		.nr_to_write = LONG_MAX,
@@ -409,10 +431,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 	    !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
 		return 0;
 
-	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
-	ret = do_writepages(mapping, &wbc);
-	wbc_detach_inode(&wbc);
-	return ret;
+	return filemap_fdatawrite_wbc(mapping, &wbc);
 }
 
 static inline int __filemap_fdatawrite(struct address_space *mapping,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
  2021-06-11 13:53 [PATCH 0/4] btrfs: shrink delalloc fixes Josef Bacik
                   ` (2 preceding siblings ...)
  2021-06-11 13:53 ` [PATCH 3/4] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
@ 2021-06-11 13:54 ` Josef Bacik
  2021-06-14 10:57   ` Nikolay Borisov
  2021-06-14 11:13 ` [PATCH 0/4] btrfs: shrink delalloc fixes Johannes Thumshirn
  4 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2021-06-11 13:54 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

sync_inode() has some holes that can cause problems if we're under heavy
ENOSPC pressure.  If there's writeback running on a separate thread
sync_inode() will skip writing the inode altogether.  What we really
want is to make sure writeback has been started on all the pages to make
sure we can see the ordered extents and wait on them if appropriate.
Switch to this new helper which will allow us to accomplish this and
avoid ENOSPC'ing early.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c37271df2c6d..93d113991e2c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9680,7 +9680,7 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 			btrfs_queue_work(root->fs_info->flush_workers,
 					 &work->work);
 		} else {
-			ret = sync_inode(inode, wbc);
+			ret = filemap_fdatawrite_wbc(inode->i_mapping, wbc);
 			btrfs_add_delayed_iput(inode);
 			if (ret || wbc->nr_to_write <= 0)
 				goto out;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: wait on async extents when flushing delalloc
  2021-06-11 13:53 ` [PATCH 1/4] btrfs: wait on async extents when flushing delalloc Josef Bacik
@ 2021-06-14 10:05   ` Nikolay Borisov
  2021-06-14 18:30     ` Josef Bacik
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2021-06-14 10:05 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 11.06.21 г. 16:53, Josef Bacik wrote:
> I've been debugging an early ENOSPC problem in production and finally
> root caused it to this problem.  When we switched to the per-inode
> flushing code I pulled out the async extent handling, because we were
> doing the correct thing by calling filemap_flush() if we had async
> extents set.  This would properly wait on any async extents by locking
> the page in the second flush, thus making sure our ordered extents were
> properly set up.

Which commit is this paragraph referring to? It will be helpful to
include a reference so once can actually trace the  history of this code.
> 
> However when I switched us back to page based flushing, I used
> sync_inode(), which allows us to pass in our own wbc.  The problem here
> is that sync_inode() is smarter than the filemap_* helpers, it tries to
> avoid calling writepages at all.  This means that our second call could
> skip calling do_writepages altogether, and thus not wait on the pagelock
> for the async helpers.  This means we could come back before any ordered
> extents were created and then simply continue on in our flushing
> mechanisms and ENOSPC out when we have plenty of space to use.

AFAICS in the code and based on your problem statement it would seem the
culprit here is really the fact we are calling the 2nd sync_inode with
WBC_SYNC_NONE, which is an indicator for sync_inode (respectively.
writeback_single_inode) to simply return in case I_SYNC is still set
from the first invocation of sync_inode or if the inode doesn't have
I_DIRTY_ALL set. So wouldn't a simple fix be to change the wbc's
.sync_mode to the second sync_inode to be WB_SYNC_ALL that will ensure
that the 2nd sync_inode takes effect?

> 
> Fix this by putting back the async pages logic in shrink_delalloc.  This
> allows us to bulk write out everything that we need to, and then we can
> wait in one place for the async helpers to catch up, and then wait on
> any ordered extents that are created.
> 
> Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>


> ---
>  fs/btrfs/inode.c      |  4 ----
>  fs/btrfs/space-info.c | 33 +++++++++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8ba4b194cd3e..6cb73ff59c7c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9681,10 +9681,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
>  					 &work->work);
>  		} else {
>  			ret = sync_inode(inode, wbc);
> -			if (!ret &&
> -			    test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> -				     &BTRFS_I(inode)->runtime_flags))
> -				ret = sync_inode(inode, wbc);
>  			btrfs_add_delayed_iput(inode);
>  			if (ret || wbc->nr_to_write <= 0)
>  				goto out;
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index af467e888545..3c89af4fd729 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -547,9 +547,42 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
>  	while ((delalloc_bytes || ordered_bytes) && loops < 3) {
>  		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
>  		long nr_pages = min_t(u64, temp, LONG_MAX);
> +		int async_pages;
>  
>  		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
>  
> +		/*
> +		 * We need to make sure any outstanding async pages are now
> +		 * processed before we continue.  This is because things like
> +		 * sync_inode() try to be smart and skip writing if the inode is
> +		 * marked clean.  We don't use filemap_fwrite for flushing
> +		 * because we want to control how many pages we write out at a
> +		 * time, thus this is the only safe way to make sure we've
> +		 * waited for outstanding compressed workers to have started
> +		 * their jobs and thus have ordered extents set up properly.
> +		 * Josef, when you think you are smart enough to remove this in
> +		 * the future, read this comment 4 times and explain in the
> +		 * commit message why it makes sense to remove it, and then
> +		 * delete the commit and don't remove it.

I think the last sentence can be omitted, simply prefix the paragraph
with N.B. or put a more generic warning e.g. "Be extra careful and
thoughtful when considering changing this logic".

> +		 */
> +		async_pages = atomic_read(&fs_info->async_delalloc_pages);
> +		if (!async_pages)
> +			goto skip_async;
> +
> +		/*
> +		 * We don't want to wait forever, if we wrote less pages in this
> +		 * loop than we have outstanding, only wait for that number of
> +		 * pages, otherwise we can wait for all async pages to finish
> +		 * before continuing.
> +		 */
> +		if (async_pages > nr_pages)
> +			async_pages -= nr_pages;
> +		else
> +			async_pages = 0;
> +		wait_event(fs_info->async_submit_wait,
> +			   atomic_read(&fs_info->async_delalloc_pages) <=
> +			   async_pages);

nit: What's funny is that without this patch async_submit_wait actually
doesn't have a single user, it's only woken up in async_cow_submit but
nobody actually waited for it. Ideally it should have been removed when
its last user was removed. ;)

> +skip_async:
>  		loops++;
>  		if (wait_ordered && !trans) {
>  			btrfs_wait_ordered_roots(fs_info, items, 0, (u64)-1);
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] fs: add a filemap_fdatawrite_wbc helper
  2021-06-11 13:53 ` [PATCH 3/4] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
@ 2021-06-14 10:16   ` Nikolay Borisov
  2021-06-15 13:37     ` Josef Bacik
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2021-06-14 10:16 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 11.06.21 г. 16:53, Josef Bacik wrote:
> Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
> order to reclaim metadata reservations.  Unfortunately most helpers in
> this area are too smart for us
> 
> 1) The normal filemap_fdata* helpers only take range and sync modes, and
>    don't give any indication of how much was written, so we can only
>    flush full inodes, which isn't what we want in most cases.
> 2) The normal writeback path requires us to have the s_umount sem held,
>    but we can't unconditionally take it in this path because we could
>    deadlock.
> 3) The normal writeback path also skips inodes with I_SYNC set if we
>    write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
>    ENOSPC pressure, we want to actually make sure the pages are under
>    writeback before returning, and if another thread is in the middle of
>    writing the file we may return before they're under writeback and
>    miss our ordered extents and not properly wait for completion.
> 4) sync_inode() uses the normal writeback path and has the same problem
>    as #3.
> 
> What we really want is to call do_writepages() with our wbc.  This way
> we can make sure that writeback is actually started on the pages, and we
> can control how many pages are written as a whole as we write many
> inodes using the same wbc.  Accomplish this with a new helper that does
> just that so we can use it for our ENOSPC flushing infrastructure.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  include/linux/fs.h |  2 ++
>  mm/filemap.c       | 29 ++++++++++++++++++++++++-----
>  2 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c3c88fdb9b2a..aace07f88b73 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2886,6 +2886,8 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
>  				loff_t start, loff_t end);
>  extern int filemap_check_errors(struct address_space *mapping);
>  extern void __filemap_set_wb_err(struct address_space *mapping, int err);
> +extern int filemap_fdatawrite_wbc(struct address_space *mapping,
> +				  struct writeback_control *wbc);
>  
>  static inline int filemap_write_and_wait(struct address_space *mapping)
>  {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 66f7e9fdfbc4..0408bc247e71 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -376,6 +376,29 @@ static int filemap_check_and_keep_errors(struct address_space *mapping)
>  		return -ENOSPC;
>  	return 0;
>  }
> +/**
> + * filemap_fdatawrite_wbc - start writeback on mapping dirty pages in range
> + * @mapping:	address space structure to write
> + * @wbc:	the writeback_control controlling the writeout
> + *
> + * This behaves the same way as __filemap_fdatawrite_range, but simply takes the

That's not true, because __filemap_fdatawrite_range will only issue
writeback in case of PAGECACHE_TAG_DIRTY && the inode's bdi having
BDI_CAP_WRITEBACK. So I think those checks should also be moved to
fdatawrite_wbc.

In fact what would be good for readability since we have a bunch of
__filemap_fdatawrite functions is to have each one call your newly
introduced helper and have their body simply setup the correct
writeback_control structure. Alternative right now one has to chase up
to 3-4 levels of (admittedly very short) functions. I.E

filemap_fdatawrite->__filemap_fdatawrite->__filemap_fdatawrite_range->filemap_fdatawrite_wbc

which is somewhat annoying. Instead I propose having

filemap_fdatawrite->filemap_fdatawrite_wbc
filemap_flush->filemap_fdatawrite_wbc etc...


> + * writeback_control as an argument to allow the caller to have more flexibility
> + * over the writeout parameters, and with no checks around whether the mapping
> + * has dirty pages or anything, it simply calls writepages.
> + *
> + * Return: %0 on success, negative error code otherwise.
> + */
> +int filemap_fdatawrite_wbc(struct address_space *mapping,
> +			   struct writeback_control *wbc)
> +{
> +	int ret;
> +
> +	wbc_attach_fdatawrite_inode(wbc, mapping->host);
> +	ret = do_writepages(mapping, wbc);
> +	wbc_detach_inode(wbc);
> +	return ret;
> +}
> +EXPORT_SYMBOL(filemap_fdatawrite_wbc);
>  
>  /**
>   * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
> @@ -397,7 +420,6 @@ static int filemap_check_and_keep_errors(struct address_space *mapping)
>  int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  				loff_t end, int sync_mode)
>  {
> -	int ret;
>  	struct writeback_control wbc = {
>  		.sync_mode = sync_mode,
>  		.nr_to_write = LONG_MAX,
> @@ -409,10 +431,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  	    !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
>  		return 0;
>  
> -	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
> -	ret = do_writepages(mapping, &wbc);
> -	wbc_detach_inode(&wbc);
> -	return ret;
> +	return filemap_fdatawrite_wbc(mapping, &wbc);
>  }
>  
>  static inline int __filemap_fdatawrite(struct address_space *mapping,
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] btrfs: wake up async_delalloc_pages waiters after submit
  2021-06-11 13:53 ` [PATCH 2/4] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
@ 2021-06-14 10:20   ` Nikolay Borisov
  0 siblings, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2021-06-14 10:20 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team; +Cc: stable



On 11.06.21 г. 16:53, Josef Bacik wrote:
> We use the async_delalloc_pages mechanism to make sure that we've
> completed our async work before trying to continue our delalloc
> flushing.  The reason for this is we need to see any ordered extents
> that were created by our delalloc flushing.  However we're waking up
> before we do the submit work, which is before we create the ordered
> extents.  This is a pretty wide race window where we could potentially
> think there are no ordered extents and thus exit shrink_delalloc
> prematurely.  Fix this by waking us up after we've done the work to
> create ordered extents.
> 
> cc: stable@vger.kernel.org
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/inode.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 6cb73ff59c7c..c37271df2c6d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1271,11 +1271,6 @@ static noinline void async_cow_submit(struct btrfs_work *work)
>  	nr_pages = (async_chunk->end - async_chunk->start + PAGE_SIZE) >>
>  		PAGE_SHIFT;
>  
> -	/* atomic_sub_return implies a barrier */
> -	if (atomic_sub_return(nr_pages, &fs_info->async_delalloc_pages) <
> -	    5 * SZ_1M)
> -		cond_wake_up_nomb(&fs_info->async_submit_wait);
> -
>  	/*
>  	 * ->inode could be NULL if async_chunk_start has failed to compress,
>  	 * in which case we don't have anything to submit, yet we need to
> @@ -1284,6 +1279,11 @@ static noinline void async_cow_submit(struct btrfs_work *work)
>  	 */
>  	if (async_chunk->inode)
>  		submit_compressed_extents(async_chunk);
> +
> +	/* atomic_sub_return implies a barrier */
> +	if (atomic_sub_return(nr_pages, &fs_info->async_delalloc_pages) <
> +	    5 * SZ_1M)
> +		cond_wake_up_nomb(&fs_info->async_submit_wait);
>  }
>  
>  static noinline void async_cow_free(struct btrfs_work *work)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
  2021-06-11 13:54 ` [PATCH 4/4] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
@ 2021-06-14 10:57   ` Nikolay Borisov
  0 siblings, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2021-06-14 10:57 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 11.06.21 г. 16:54, Josef Bacik wrote:
> sync_inode() has some holes that can cause problems if we're under heavy
> ENOSPC pressure.  If there's writeback running on a separate thread
> sync_inode() will skip writing the inode altogether.  What we really
> want is to make sure writeback has been started on all the pages to make
> sure we can see the ordered extents and wait on them if appropriate.
> Switch to this new helper which will allow us to accomplish this and
> avoid ENOSPC'ing early.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c37271df2c6d..93d113991e2c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9680,7 +9680,7 @@ static int start_delalloc_inodes(struct btrfs_root *root,
>  			btrfs_queue_work(root->fs_info->flush_workers,
>  					 &work->work);
>  		} else {
> -			ret = sync_inode(inode, wbc);
> +			ret = filemap_fdatawrite_wbc(inode->i_mapping, wbc);
>  			btrfs_add_delayed_iput(inode);
>  			if (ret || wbc->nr_to_write <= 0)
>  				goto out;
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] btrfs: shrink delalloc fixes
  2021-06-11 13:53 [PATCH 0/4] btrfs: shrink delalloc fixes Josef Bacik
                   ` (3 preceding siblings ...)
  2021-06-11 13:54 ` [PATCH 4/4] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
@ 2021-06-14 11:13 ` Johannes Thumshirn
  2021-06-14 18:38   ` Josef Bacik
  4 siblings, 1 reply; 14+ messages in thread
From: Johannes Thumshirn @ 2021-06-14 11:13 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team

On 11/06/2021 15:54, Josef Bacik wrote:
> Hello,
> 
> I backported the patch to switch us to using sync_inode() to our kernel inside
> Facebook to fix a deadlock when using the async delalloc shrinker threads.  This
> uncovered a bunch of problems with how we shrink delalloc, as we use -o
> compress-force, and thus everything goes through the async compression threads.
> 
> I ripped out the async pages stuff because originally I had switched us to just
> writing the whole inode.  This caused a performance regression, and so I
> switched us to calling sync_inode() twice to handle the async extent case.  The
> problem is that sync_inode() can skip writing inodes sometimes, and thus we
> weren't properly waiting on the async helpers.  We really do need to wait for
> the async delalloc pages to go down before continuing to shrink delalloc.  There
> was also a race in how we woke up the async delalloc pages waiter which could be
> problematic.
> 
> And then finally there is our use of sync_inode().  It tries to be too clever
> for us, when in reality we want to make sure all pages are under writeback
> before we come back to the shrinking loop.  I've added a small helper to give us
> this flexibilty and have switched us to that helper.
> 
> With these patches, and others that will be sent separately, the early ENOSPC
> problems we were experiencing have been eliminated.  Thanks,
> 
> Josef Bacik (4):
>   btrfs: wait on async extents when flushing delalloc
>   btrfs: wake up async_delalloc_pages waiters after submit
>   fs: add a filemap_fdatawrite_wbc helper
>   btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
> 
>  fs/btrfs/inode.c      | 16 ++++++----------
>  fs/btrfs/space-info.c | 33 +++++++++++++++++++++++++++++++++
>  include/linux/fs.h    |  2 ++
>  mm/filemap.c          | 29 ++++++++++++++++++++++++-----
>  4 files changed, 65 insertions(+), 15 deletions(-)
> 

I've ran this through xfstests on a zoned device and got the following lockdep
splat. I'm not sure if this is your fault or my fault. btrfs_reclaim_bgs_work()
very much sounds like my fault. Writing it here for completeness anyways.
Otherwise no noticeable deviations from the baseline.

[  634.999371] BTRFS info (device nullb1): reclaiming chunk 3489660928 with 22% used
[  634.999445] BTRFS info (device nullb1): relocating block group 3489660928 flags metadata
                                                                                           
[  642.127246] ======================================================                                                                                                                                                                                                                                                                                                       
[  642.127876] WARNING: possible circular locking dependency detected
[  642.128526] 5.13.0-rc5-josef-delalloc #1080 Not tainted
[  642.129060] ------------------------------------------------------
[  642.129699] kworker/u4:5/11096 is trying to acquire lock:        
[  642.130259] ffff888278839438 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.131321]                           
               but task is already holding lock:
[  642.131921] ffff8882bdc45cf8 (btrfs-tree-01#2/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.132926] 
               which lock already depends on the new lock.
                                                                                                                                                                                      
[  642.133767]                                                                                                                                                                        
               the existing dependency chain (in reverse order) is:                                                                                                                   
[  642.134529]                                                                                                                                                                        
               -> #2 (btrfs-tree-01#2/1){+.+.}-{3:3}:                                                                                                                                 
[  642.135181]        down_write_nested+0x23/0x60                                                                                                                                     
[  642.135647]        __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.136238]        btrfs_init_new_buffer+0x7d/0x2a0 [btrfs]
[  642.136844]        btrfs_alloc_tree_block+0x113/0x320 [btrfs]                                                                                                                      
[  642.137460]        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]                                                                                                                  
[  642.138115]        __btrfs_cow_block+0x13b/0x5e0 [btrfs]            
[  642.138688]        btrfs_cow_block+0x114/0x220 [btrfs]
[  642.139296]        btrfs_search_slot+0x587/0x950 [btrfs]
[  642.139907]        btrfs_lookup_inode+0x2a/0x90 [btrfs]
[  642.140501]        __btrfs_update_delayed_inode+0x80/0x2d0 [btrfs]
[  642.141177]        btrfs_async_run_delayed_root+0x174/0x240 [btrfs]
[  642.141878]        btrfs_work_helper+0xfe/0x530 [btrfs]
[  642.142454]        process_one_work+0x24f/0x570
[  642.142931]        worker_thread+0x4f/0x3e0      
[  642.143357]        kthread+0x12c/0x150  
[  642.143748]        ret_from_fork+0x22/0x30         
[  642.144181]                                      
               -> #1 (btrfs-tree-01#2){++++}-{3:3}:   
[  642.144807]        down_write_nested+0x23/0x60    
[  642.145260]        __btrfs_tree_lock+0x28/0x120 [btrfs] 
[  642.145839]        btrfs_search_slot+0x259/0x950 [btrfs]
[  642.146413]        do_relocation+0xf9/0x5d0 [btrfs]
[  642.146960]        relocate_tree_blocks+0x293/0x610 [btrfs]
[  642.147579]        relocate_block_group+0x1f2/0x560 [btrfs]
[  642.148201]        btrfs_relocate_block_group+0x16c/0x320 [btrfs]
[  642.148869]        btrfs_relocate_chunk+0x38/0x120 [btrfs]
[  642.149479]        btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
[  642.150162]        process_one_work+0x24f/0x570
[  642.150628]        worker_thread+0x4f/0x3e0
[  642.151062]        kthread+0x12c/0x150     
[  642.151450]        ret_from_fork+0x22/0x30
[  642.151872]                                 
               -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
[  642.152518]        __lock_acquire+0x1235/0x2320                                     
[  642.152985]        lock_acquire+0xab/0x340                                       
[  642.153409]        down_write_nested+0x23/0x60                                          
[  642.153872]        __btrfs_tree_lock+0x28/0x120 [btrfs]                             
[  642.154455]        btrfs_lock_root_node+0x31/0x40 [btrfs]                        
[  642.155064]        btrfs_search_slot+0x6bc/0x950 [btrfs]                                
[  642.155639]        replace_path+0x56f/0xa30 [btrfs]                                 
[  642.156181]        merge_reloc_root+0x258/0x600 [btrfs]                   
[  642.156773]        merge_reloc_roots+0xdd/0x210 [btrfs]   
[  642.157368]        relocate_block_group+0x2c9/0x560 [btrfs]                                                                                                                        
[  642.157986]        btrfs_relocate_block_group+0x16c/0x320 [btrfs]                  
[  642.158660]        btrfs_relocate_chunk+0x38/0x120 [btrfs]        
[  642.159272]        btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
[  642.159951]        process_one_work+0x24f/0x570
[  642.160418]        worker_thread+0x4f/0x3e0
[  642.160850]        kthread+0x12c/0x150
[  642.161243]        ret_from_fork+0x22/0x30
[  642.161664] 
               other info that might help us debug this:

[  642.162480] Chain exists of:
                 btrfs-treloc-02#2 --> btrfs-tree-01#2 --> btrfs-tree-01#2/1

[  642.163616]  Possible unsafe locking scenario:

[  642.164226]        CPU0                    CPU1
[  642.164699]        ----                    ----
[  642.165164]   lock(btrfs-tree-01#2/1);
[  642.165553]                                lock(btrfs-tree-01#2);
[  642.166183]                                lock(btrfs-tree-01#2/1);
[  642.166820]   lock(btrfs-treloc-02#2);
[  642.167209] 
                *** DEADLOCK ***

[  642.167812] 6 locks held by kworker/u4:5/11096:
[  642.168288]  #0: ffff8881000c4938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c6/0x570
[  642.169317]  #1: ffffc900046e3e78 ((work_completion)(&fs_info->reclaim_bgs_work)){+.+.}-{0:0}, at: process_one_work+0x1c6/0x570
[  642.170507]  #2: ffff888136c6a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_reclaim_bgs_work+0x5d/0x1b0 [btrfs]
[  642.171687]  #3: ffff888136c68838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x164/0x320 [btrfs]
[  642.172877]  #4: ffff88811d11f620 (sb_internal){.+.+}-{0:0}, at: merge_reloc_root+0x178/0x600 [btrfs]
[  642.173871]  #5: ffff8882bdc45cf8 (btrfs-tree-01#2/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.174906] 
[  642.175362] CPU: 0 PID: 11096 Comm: kworker/u4:5 Not tainted 5.13.0-rc5-josef-delalloc #1080
[  642.176215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[  642.177366] Workqueue: events_unbound btrfs_reclaim_bgs_work [btrfs]
[  642.178065] Call Trace:
[  642.178323]  dump_stack+0x6d/0x89
[  642.178670]  check_noncircular+0xcf/0xf0
[  642.179085]  __lock_acquire+0x1235/0x2320
[  642.179509]  lock_acquire+0xab/0x340
[  642.179889]  ? __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.180442]  ? find_held_lock+0x2b/0x80
[  642.180843]  ? btrfs_root_node+0x93/0x1d0 [btrfs]
[  642.181359]  down_write_nested+0x23/0x60
[  642.181781]  ? __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.182327]  __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.182862]  btrfs_lock_root_node+0x31/0x40 [btrfs]
[  642.183411]  btrfs_search_slot+0x6bc/0x950 [btrfs]
[  642.183940]  ? release_extent_buffer+0x111/0x160 [btrfs]
[  642.184526]  replace_path+0x56f/0xa30 [btrfs]
[  642.185019]  merge_reloc_root+0x258/0x600 [btrfs]
[  642.185557]  merge_reloc_roots+0xdd/0x210 [btrfs]
[  642.186087]  relocate_block_group+0x2c9/0x560 [btrfs]
[  642.186658]  btrfs_relocate_block_group+0x16c/0x320 [btrfs]
[  642.187283]  btrfs_relocate_chunk+0x38/0x120 [btrfs]
[  642.187839]  btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
[  642.188472]  process_one_work+0x24f/0x570
[  642.188894]  worker_thread+0x4f/0x3e0
[  642.189273]  ? process_one_work+0x570/0x570
[  642.189714]  kthread+0x12c/0x150
[  642.190053]  ? __kthread_bind_mask+0x60/0x60
[  642.190492]  ret_from_fork+0x22/0x30
[  644.827436] BTRFS info (device nullb1): found 3645 extents, stage: move data extents
[  645.132633] BTRFS info (device nullb1): reclaiming chunk 4026531840 with 22% used

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] btrfs: wait on async extents when flushing delalloc
  2021-06-14 10:05   ` Nikolay Borisov
@ 2021-06-14 18:30     ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2021-06-14 18:30 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 6/14/21 6:05 AM, Nikolay Borisov wrote:
> 
> 
> On 11.06.21 г. 16:53, Josef Bacik wrote:
>> I've been debugging an early ENOSPC problem in production and finally
>> root caused it to this problem.  When we switched to the per-inode
>> flushing code I pulled out the async extent handling, because we were
>> doing the correct thing by calling filemap_flush() if we had async
>> extents set.  This would properly wait on any async extents by locking
>> the page in the second flush, thus making sure our ordered extents were
>> properly set up.
> 
> Which commit is this paragraph referring to? It will be helpful to
> include a reference so once can actually trace the  history of this code.
>>
>> However when I switched us back to page based flushing, I used
>> sync_inode(), which allows us to pass in our own wbc.  The problem here
>> is that sync_inode() is smarter than the filemap_* helpers, it tries to
>> avoid calling writepages at all.  This means that our second call could
>> skip calling do_writepages altogether, and thus not wait on the pagelock
>> for the async helpers.  This means we could come back before any ordered
>> extents were created and then simply continue on in our flushing
>> mechanisms and ENOSPC out when we have plenty of space to use.
> 
> AFAICS in the code and based on your problem statement it would seem the
> culprit here is really the fact we are calling the 2nd sync_inode with
> WBC_SYNC_NONE, which is an indicator for sync_inode (respectively.
> writeback_single_inode) to simply return in case I_SYNC is still set
> from the first invocation of sync_inode or if the inode doesn't have
> I_DIRTY_ALL set. So wouldn't a simple fix be to change the wbc's
> .sync_mode to the second sync_inode to be WB_SYNC_ALL that will ensure
> that the 2nd sync_inode takes effect?

It would, except that would make us wait on writeback, which is not what we want 
necessarily, we want to just wait for the async workers to finish and then 
continue writing stuff.

> 
>>
>> Fix this by putting back the async pages logic in shrink_delalloc.  This
>> allows us to bulk write out everything that we need to, and then we can
>> wait in one place for the async helpers to catch up, and then wait on
>> any ordered extents that are created.
>>
>> Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> 
> 
>> ---
>>   fs/btrfs/inode.c      |  4 ----
>>   fs/btrfs/space-info.c | 33 +++++++++++++++++++++++++++++++++
>>   2 files changed, 33 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 8ba4b194cd3e..6cb73ff59c7c 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -9681,10 +9681,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
>>   					 &work->work);
>>   		} else {
>>   			ret = sync_inode(inode, wbc);
>> -			if (!ret &&
>> -			    test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>> -				     &BTRFS_I(inode)->runtime_flags))
>> -				ret = sync_inode(inode, wbc);
>>   			btrfs_add_delayed_iput(inode);
>>   			if (ret || wbc->nr_to_write <= 0)
>>   				goto out;
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index af467e888545..3c89af4fd729 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -547,9 +547,42 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
>>   	while ((delalloc_bytes || ordered_bytes) && loops < 3) {
>>   		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
>>   		long nr_pages = min_t(u64, temp, LONG_MAX);
>> +		int async_pages;
>>   
>>   		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
>>   
>> +		/*
>> +		 * We need to make sure any outstanding async pages are now
>> +		 * processed before we continue.  This is because things like
>> +		 * sync_inode() try to be smart and skip writing if the inode is
>> +		 * marked clean.  We don't use filemap_fwrite for flushing
>> +		 * because we want to control how many pages we write out at a
>> +		 * time, thus this is the only safe way to make sure we've
>> +		 * waited for outstanding compressed workers to have started
>> +		 * their jobs and thus have ordered extents set up properly.
>> +		 * Josef, when you think you are smart enough to remove this in
>> +		 * the future, read this comment 4 times and explain in the
>> +		 * commit message why it makes sense to remove it, and then
>> +		 * delete the commit and don't remove it.
> 
> I think the last sentence can be omitted, simply prefix the paragraph
> with N.B. or put a more generic warning e.g. "Be extra careful and
> thoughtful when considering changing this logic".

Yeah I meant to go back and change that.

> 
>> +		 */
>> +		async_pages = atomic_read(&fs_info->async_delalloc_pages);
>> +		if (!async_pages)
>> +			goto skip_async;
>> +
>> +		/*
>> +		 * We don't want to wait forever, if we wrote less pages in this
>> +		 * loop than we have outstanding, only wait for that number of
>> +		 * pages, otherwise we can wait for all async pages to finish
>> +		 * before continuing.
>> +		 */
>> +		if (async_pages > nr_pages)
>> +			async_pages -= nr_pages;
>> +		else
>> +			async_pages = 0;
>> +		wait_event(fs_info->async_submit_wait,
>> +			   atomic_read(&fs_info->async_delalloc_pages) <=
>> +			   async_pages);
> 
> nit: What's funny is that without this patch async_submit_wait actually
> doesn't have a single user, it's only woken up in async_cow_submit but
> nobody actually waited for it. Ideally it should have been removed when
> its last user was removed. ;)
> 

Clearly, but hooray I didn't rip it out because now I can just add this bit. 
Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] btrfs: shrink delalloc fixes
  2021-06-14 11:13 ` [PATCH 0/4] btrfs: shrink delalloc fixes Johannes Thumshirn
@ 2021-06-14 18:38   ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2021-06-14 18:38 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs, kernel-team

On 6/14/21 7:13 AM, Johannes Thumshirn wrote:
> On 11/06/2021 15:54, Josef Bacik wrote:
>> Hello,
>>
>> I backported the patch to switch us to using sync_inode() to our kernel inside
>> Facebook to fix a deadlock when using the async delalloc shrinker threads.  This
>> uncovered a bunch of problems with how we shrink delalloc, as we use -o
>> compress-force, and thus everything goes through the async compression threads.
>>
>> I ripped out the async pages stuff because originally I had switched us to just
>> writing the whole inode.  This caused a performance regression, and so I
>> switched us to calling sync_inode() twice to handle the async extent case.  The
>> problem is that sync_inode() can skip writing inodes sometimes, and thus we
>> weren't properly waiting on the async helpers.  We really do need to wait for
>> the async delalloc pages to go down before continuing to shrink delalloc.  There
>> was also a race in how we woke up the async delalloc pages waiter which could be
>> problematic.
>>
>> And then finally there is our use of sync_inode().  It tries to be too clever
>> for us, when in reality we want to make sure all pages are under writeback
>> before we come back to the shrinking loop.  I've added a small helper to give us
>> this flexibilty and have switched us to that helper.
>>
>> With these patches, and others that will be sent separately, the early ENOSPC
>> problems we were experiencing have been eliminated.  Thanks,
>>
>> Josef Bacik (4):
>>    btrfs: wait on async extents when flushing delalloc
>>    btrfs: wake up async_delalloc_pages waiters after submit
>>    fs: add a filemap_fdatawrite_wbc helper
>>    btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
>>
>>   fs/btrfs/inode.c      | 16 ++++++----------
>>   fs/btrfs/space-info.c | 33 +++++++++++++++++++++++++++++++++
>>   include/linux/fs.h    |  2 ++
>>   mm/filemap.c          | 29 ++++++++++++++++++++++++-----
>>   4 files changed, 65 insertions(+), 15 deletions(-)
>>
> 
> I've ran this through xfstests on a zoned device and got the following lockdep
> splat. I'm not sure if this is your fault or my fault. btrfs_reclaim_bgs_work()
> very much sounds like my fault. Writing it here for completeness anyways.
> Otherwise no noticeable deviations from the baseline.
> 
> [  634.999371] BTRFS info (device nullb1): reclaiming chunk 3489660928 with 22% used
> [  634.999445] BTRFS info (device nullb1): relocating block group 3489660928 flags metadata
>                                                                                             
> [  642.127246] ======================================================
> [  642.127876] WARNING: possible circular locking dependency detected
> [  642.128526] 5.13.0-rc5-josef-delalloc #1080 Not tainted
> [  642.129060] ------------------------------------------------------
> [  642.129699] kworker/u4:5/11096 is trying to acquire lock:
> [  642.130259] ffff888278839438 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.131321]
>                 but task is already holding lock:
> [  642.131921] ffff8882bdc45cf8 (btrfs-tree-01#2/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.132926]
>                 which lock already depends on the new lock.
>                                                                                                                                                                                        
> [  642.133767]
>                 the existing dependency chain (in reverse order) is:
> [  642.134529]
>                 -> #2 (btrfs-tree-01#2/1){+.+.}-{3:3}:
> [  642.135181]        down_write_nested+0x23/0x60
> [  642.135647]        __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.136238]        btrfs_init_new_buffer+0x7d/0x2a0 [btrfs]
> [  642.136844]        btrfs_alloc_tree_block+0x113/0x320 [btrfs]
> [  642.137460]        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
> [  642.138115]        __btrfs_cow_block+0x13b/0x5e0 [btrfs]
> [  642.138688]        btrfs_cow_block+0x114/0x220 [btrfs]
> [  642.139296]        btrfs_search_slot+0x587/0x950 [btrfs]
> [  642.139907]        btrfs_lookup_inode+0x2a/0x90 [btrfs]
> [  642.140501]        __btrfs_update_delayed_inode+0x80/0x2d0 [btrfs]
> [  642.141177]        btrfs_async_run_delayed_root+0x174/0x240 [btrfs]
> [  642.141878]        btrfs_work_helper+0xfe/0x530 [btrfs]
> [  642.142454]        process_one_work+0x24f/0x570
> [  642.142931]        worker_thread+0x4f/0x3e0
> [  642.143357]        kthread+0x12c/0x150
> [  642.143748]        ret_from_fork+0x22/0x30
> [  642.144181]
>                 -> #1 (btrfs-tree-01#2){++++}-{3:3}:
> [  642.144807]        down_write_nested+0x23/0x60
> [  642.145260]        __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.145839]        btrfs_search_slot+0x259/0x950 [btrfs]
> [  642.146413]        do_relocation+0xf9/0x5d0 [btrfs]
> [  642.146960]        relocate_tree_blocks+0x293/0x610 [btrfs]
> [  642.147579]        relocate_block_group+0x1f2/0x560 [btrfs]
> [  642.148201]        btrfs_relocate_block_group+0x16c/0x320 [btrfs]
> [  642.148869]        btrfs_relocate_chunk+0x38/0x120 [btrfs]
> [  642.149479]        btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
> [  642.150162]        process_one_work+0x24f/0x570
> [  642.150628]        worker_thread+0x4f/0x3e0
> [  642.151062]        kthread+0x12c/0x150
> [  642.151450]        ret_from_fork+0x22/0x30
> [  642.151872]
>                 -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
> [  642.152518]        __lock_acquire+0x1235/0x2320
> [  642.152985]        lock_acquire+0xab/0x340
> [  642.153409]        down_write_nested+0x23/0x60
> [  642.153872]        __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.154455]        btrfs_lock_root_node+0x31/0x40 [btrfs]
> [  642.155064]        btrfs_search_slot+0x6bc/0x950 [btrfs]
> [  642.155639]        replace_path+0x56f/0xa30 [btrfs]
> [  642.156181]        merge_reloc_root+0x258/0x600 [btrfs]
> [  642.156773]        merge_reloc_roots+0xdd/0x210 [btrfs]
> [  642.157368]        relocate_block_group+0x2c9/0x560 [btrfs]
> [  642.157986]        btrfs_relocate_block_group+0x16c/0x320 [btrfs]
> [  642.158660]        btrfs_relocate_chunk+0x38/0x120 [btrfs]
> [  642.159272]        btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
> [  642.159951]        process_one_work+0x24f/0x570
> [  642.160418]        worker_thread+0x4f/0x3e0
> [  642.160850]        kthread+0x12c/0x150
> [  642.161243]        ret_from_fork+0x22/0x30
> [  642.161664]
>                 other info that might help us debug this:
> 
> [  642.162480] Chain exists of:
>                   btrfs-treloc-02#2 --> btrfs-tree-01#2 --> btrfs-tree-01#2/1
> 
> [  642.163616]  Possible unsafe locking scenario:
> 
> [  642.164226]        CPU0                    CPU1
> [  642.164699]        ----                    ----
> [  642.165164]   lock(btrfs-tree-01#2/1);
> [  642.165553]                                lock(btrfs-tree-01#2);
> [  642.166183]                                lock(btrfs-tree-01#2/1);
> [  642.166820]   lock(btrfs-treloc-02#2);
> [  642.167209]
>                  *** DEADLOCK ***
> 
> [  642.167812] 6 locks held by kworker/u4:5/11096:
> [  642.168288]  #0: ffff8881000c4938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c6/0x570
> [  642.169317]  #1: ffffc900046e3e78 ((work_completion)(&fs_info->reclaim_bgs_work)){+.+.}-{0:0}, at: process_one_work+0x1c6/0x570
> [  642.170507]  #2: ffff888136c6a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_reclaim_bgs_work+0x5d/0x1b0 [btrfs]
> [  642.171687]  #3: ffff888136c68838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x164/0x320 [btrfs]
> [  642.172877]  #4: ffff88811d11f620 (sb_internal){.+.+}-{0:0}, at: merge_reloc_root+0x178/0x600 [btrfs]
> [  642.173871]  #5: ffff8882bdc45cf8 (btrfs-tree-01#2/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.174906]
> [  642.175362] CPU: 0 PID: 11096 Comm: kworker/u4:5 Not tainted 5.13.0-rc5-josef-delalloc #1080
> [  642.176215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> [  642.177366] Workqueue: events_unbound btrfs_reclaim_bgs_work [btrfs]
> [  642.178065] Call Trace:
> [  642.178323]  dump_stack+0x6d/0x89
> [  642.178670]  check_noncircular+0xcf/0xf0
> [  642.179085]  __lock_acquire+0x1235/0x2320
> [  642.179509]  lock_acquire+0xab/0x340
> [  642.179889]  ? __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.180442]  ? find_held_lock+0x2b/0x80
> [  642.180843]  ? btrfs_root_node+0x93/0x1d0 [btrfs]
> [  642.181359]  down_write_nested+0x23/0x60
> [  642.181781]  ? __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.182327]  __btrfs_tree_lock+0x28/0x120 [btrfs]
> [  642.182862]  btrfs_lock_root_node+0x31/0x40 [btrfs]
> [  642.183411]  btrfs_search_slot+0x6bc/0x950 [btrfs]
> [  642.183940]  ? release_extent_buffer+0x111/0x160 [btrfs]
> [  642.184526]  replace_path+0x56f/0xa30 [btrfs]
> [  642.185019]  merge_reloc_root+0x258/0x600 [btrfs]
> [  642.185557]  merge_reloc_roots+0xdd/0x210 [btrfs]
> [  642.186087]  relocate_block_group+0x2c9/0x560 [btrfs]
> [  642.186658]  btrfs_relocate_block_group+0x16c/0x320 [btrfs]
> [  642.187283]  btrfs_relocate_chunk+0x38/0x120 [btrfs]
> [  642.187839]  btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
> [  642.188472]  process_one_work+0x24f/0x570
> [  642.188894]  worker_thread+0x4f/0x3e0
> [  642.189273]  ? process_one_work+0x570/0x570
> [  642.189714]  kthread+0x12c/0x150
> [  642.190053]  ? __kthread_bind_mask+0x60/0x60
> [  642.190492]  ret_from_fork+0x22/0x30
> [  644.827436] BTRFS info (device nullb1): found 3645 extents, stage: move data extents
> [  645.132633] BTRFS info (device nullb1): reclaiming chunk 4026531840 with 22% used
> 

This doesn't seem to be either of our faults, unless you changed the locking in 
replace_path.  The locking order is

reloc root -> real root

I'm actually not sure how this happens looking at the code, we clearly come in 
with the path pointing at the reloc root, and then we by hand walk down the real 
root.  At the point that we unlock the path and search down the real root again 
we shouldn't be holding any more reloc root things.  So this requires more 
investigation, because something fishy is happening.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] fs: add a filemap_fdatawrite_wbc helper
  2021-06-14 10:16   ` Nikolay Borisov
@ 2021-06-15 13:37     ` Josef Bacik
  2021-06-16  4:41       ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2021-06-15 13:37 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 6/14/21 6:16 AM, Nikolay Borisov wrote:
> 
> 
> On 11.06.21 г. 16:53, Josef Bacik wrote:
>> Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
>> order to reclaim metadata reservations.  Unfortunately most helpers in
>> this area are too smart for us
>>
>> 1) The normal filemap_fdata* helpers only take range and sync modes, and
>>     don't give any indication of how much was written, so we can only
>>     flush full inodes, which isn't what we want in most cases.
>> 2) The normal writeback path requires us to have the s_umount sem held,
>>     but we can't unconditionally take it in this path because we could
>>     deadlock.
>> 3) The normal writeback path also skips inodes with I_SYNC set if we
>>     write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
>>     ENOSPC pressure, we want to actually make sure the pages are under
>>     writeback before returning, and if another thread is in the middle of
>>     writing the file we may return before they're under writeback and
>>     miss our ordered extents and not properly wait for completion.
>> 4) sync_inode() uses the normal writeback path and has the same problem
>>     as #3.
>>
>> What we really want is to call do_writepages() with our wbc.  This way
>> we can make sure that writeback is actually started on the pages, and we
>> can control how many pages are written as a whole as we write many
>> inodes using the same wbc.  Accomplish this with a new helper that does
>> just that so we can use it for our ENOSPC flushing infrastructure.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   include/linux/fs.h |  2 ++
>>   mm/filemap.c       | 29 ++++++++++++++++++++++++-----
>>   2 files changed, 26 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index c3c88fdb9b2a..aace07f88b73 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2886,6 +2886,8 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
>>   				loff_t start, loff_t end);
>>   extern int filemap_check_errors(struct address_space *mapping);
>>   extern void __filemap_set_wb_err(struct address_space *mapping, int err);
>> +extern int filemap_fdatawrite_wbc(struct address_space *mapping,
>> +				  struct writeback_control *wbc);
>>   
>>   static inline int filemap_write_and_wait(struct address_space *mapping)
>>   {
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 66f7e9fdfbc4..0408bc247e71 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -376,6 +376,29 @@ static int filemap_check_and_keep_errors(struct address_space *mapping)
>>   		return -ENOSPC;
>>   	return 0;
>>   }
>> +/**
>> + * filemap_fdatawrite_wbc - start writeback on mapping dirty pages in range
>> + * @mapping:	address space structure to write
>> + * @wbc:	the writeback_control controlling the writeout
>> + *
>> + * This behaves the same way as __filemap_fdatawrite_range, but simply takes the
> 
> That's not true, because __filemap_fdatawrite_range will only issue
> writeback in case of PAGECACHE_TAG_DIRTY && the inode's bdi having
> BDI_CAP_WRITEBACK. So I think those checks should also be moved to
> fdatawrite_wbc.
> 

Yeah I'll move those into _wbc

> In fact what would be good for readability since we have a bunch of
> __filemap_fdatawrite functions is to have each one call your newly
> introduced helper and have their body simply setup the correct
> writeback_control structure. Alternative right now one has to chase up
> to 3-4 levels of (admittedly very short) functions. I.E
> 
> filemap_fdatawrite->__filemap_fdatawrite->__filemap_fdatawrite_range->filemap_fdatawrite_wbc
> 
> which is somewhat annoying. Instead I propose having
> 
> filemap_fdatawrite->filemap_fdatawrite_wbc
> filemap_flush->filemap_fdatawrite_wbc etc...
> 

Yeah I'd like to clean this up at some point, but that's outside the scope of 
this patch.  I want to get a helper in without needing to run it by everybody, 
we can take up cleaning this up at a later point with input from everybody else. 
  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] fs: add a filemap_fdatawrite_wbc helper
  2021-06-15 13:37     ` Josef Bacik
@ 2021-06-16  4:41       ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2021-06-16  4:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Nikolay Borisov, linux-btrfs, kernel-team

On Tue, Jun 15, 2021 at 09:37:33AM -0400, Josef Bacik wrote:
> Yeah I'd like to clean this up at some point, but that's outside the scope
> of this patch.  I want to get a helper in without needing to run it by
> everybody, we can take up cleaning this up at a later point with input from
> everybody else.  Thanks,

But you _do_ need to run a new core writeback function past "everybody".
Just posting it on the btrfs list like for this series is definitely
not enough.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-06-16  4:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-11 13:53 [PATCH 0/4] btrfs: shrink delalloc fixes Josef Bacik
2021-06-11 13:53 ` [PATCH 1/4] btrfs: wait on async extents when flushing delalloc Josef Bacik
2021-06-14 10:05   ` Nikolay Borisov
2021-06-14 18:30     ` Josef Bacik
2021-06-11 13:53 ` [PATCH 2/4] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
2021-06-14 10:20   ` Nikolay Borisov
2021-06-11 13:53 ` [PATCH 3/4] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
2021-06-14 10:16   ` Nikolay Borisov
2021-06-15 13:37     ` Josef Bacik
2021-06-16  4:41       ` Christoph Hellwig
2021-06-11 13:54 ` [PATCH 4/4] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
2021-06-14 10:57   ` Nikolay Borisov
2021-06-14 11:13 ` [PATCH 0/4] btrfs: shrink delalloc fixes Johannes Thumshirn
2021-06-14 18:38   ` Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.