All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: don't preempt writeback on page errors
@ 2018-10-31 14:01 Brian Foster
  2018-10-31 14:01 ` [PATCH 1/2] xfs: add writepage map error tag Brian Foster
  2018-10-31 14:01 ` [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error Brian Foster
  0 siblings, 2 replies; 7+ messages in thread
From: Brian Foster @ 2018-10-31 14:01 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's a stab at a fix for the issue demonstrated by the previously
posted xfstest [1]. This basically just filters out error returns to
write_cache_pages() and returns the first encountered error from the
higher level ->writepage[s]() call. AFAICT the end result should be the
same as today in the event of a persistent writeback error.

I'm not totally sure if this is the best solution, but it's the simplest
that comes to mind. FWIW, this patch changes error return behavior
unconditionally. A slightly modified approach might be to only filter
errors as such on integrity writeback so background writeback will
presumably behave as it always has, while allowing fsync/unmount to
process the address space in one writeback job.

If there's a reason we want to preserve the current writeback behavior
in general, I suppose we could consider more involved options like tag
any inode that preempts writeback due to error and then make sure we
process/scan/clear such inodes of delalloc blocks before we unmount
(and/or on fsync). Either way, we may also want to consider rate
limiting the "page discard" errors if it's more likely we dump a bunch
at once.

Thoughts?

Brian

[1] https://marc.info/?l=linux-xfs&m=154031860122440&w=2

Brian Foster (2):
  xfs: add writepage map error tag
  xfs: don't preempt writeback sequence on single page wb error

 fs/xfs/libxfs/xfs_errortag.h |  4 +++-
 fs/xfs/xfs_aops.c            | 19 ++++++++++++++-----
 fs/xfs/xfs_error.c           |  3 +++
 3 files changed, 20 insertions(+), 6 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] xfs: add writepage map error tag
  2018-10-31 14:01 [PATCH 0/2] xfs: don't preempt writeback on page errors Brian Foster
@ 2018-10-31 14:01 ` Brian Foster
  2018-10-31 14:01 ` [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error Brian Foster
  1 sibling, 0 replies; 7+ messages in thread
From: Brian Foster @ 2018-10-31 14:01 UTC (permalink / raw)
  To: linux-xfs

Add an error tag to inject errors in the writeback block mapping
codepath. This facilitates testing of the error path responsible for
discarding delalloc blocks that could not be converted to real
blocks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_errortag.h | 4 +++-
 fs/xfs/xfs_aops.c            | 6 ++++++
 fs/xfs/xfs_error.c           | 3 +++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 66077a105cbb..97c9eaa72dee 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -54,7 +54,8 @@
 #define XFS_ERRTAG_BUF_LRU_REF				31
 #define XFS_ERRTAG_FORCE_SCRUB_REPAIR			32
 #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
-#define XFS_ERRTAG_MAX					34
+#define XFS_ERRTAG_WRITEPAGE_MAP			34
+#define XFS_ERRTAG_MAX					35
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -93,5 +94,6 @@
 #define XFS_RANDOM_BUF_LRU_REF				2
 #define XFS_RANDOM_FORCE_SCRUB_REPAIR			1
 #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
+#define XFS_RANDOM_WRITEPAGE_MAP			2
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 49f5f5896a43..3feae3691467 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -21,6 +21,7 @@
 #include "xfs_bmap_util.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
+#include "xfs_errortag.h"
 #include <linux/writeback.h>
 
 /*
@@ -718,6 +719,11 @@ xfs_writepage_map(
 		if (iop && !test_bit(i, iop->uptodate))
 			continue;
 
+		if (XFS_TEST_ERROR(false, XFS_I(inode)->i_mount,
+				   XFS_ERRTAG_WRITEPAGE_MAP)) {
+			error = -EIO;
+			break;
+		}
 		error = xfs_map_blocks(wpc, inode, file_offset);
 		if (error)
 			break;
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 9866f542e77b..e15ac398b1da 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -51,6 +51,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_BUF_LRU_REF,
 	XFS_RANDOM_FORCE_SCRUB_REPAIR,
 	XFS_RANDOM_FORCE_SUMMARY_RECALC,
+	XFS_RANDOM_WRITEPAGE_MAP,
 };
 
 struct xfs_errortag_attr {
@@ -159,6 +160,7 @@ XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
 XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
 XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
+XFS_ERRORTAG_ATTR_RW(writepage_map,	XFS_ERRTAG_WRITEPAGE_MAP);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -195,6 +197,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(buf_lru_ref),
 	XFS_ERRORTAG_ATTR_LIST(force_repair),
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
+	XFS_ERRORTAG_ATTR_LIST(writepage_map),
 	NULL,
 };
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error
  2018-10-31 14:01 [PATCH 0/2] xfs: don't preempt writeback on page errors Brian Foster
  2018-10-31 14:01 ` [PATCH 1/2] xfs: add writepage map error tag Brian Foster
@ 2018-10-31 14:01 ` Brian Foster
  2018-10-31 23:02   ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Brian Foster @ 2018-10-31 14:01 UTC (permalink / raw)
  To: linux-xfs

xfs_do_writepage() currently returns errors directly regardless of
whether it is called via ->writepages() or ->writepage(). In the
case of ->writepages(), an xfs_do_writepage() error return breaks
the current writeback sequence in write_cache_pages(). This means
that an integrity writeback (i.e., sync), for example, returns
before all associated pages have been processed.

This can be problematic in cases like unmount. If the writeback
doesn't process all delalloc pages before unmounting, we end up
reclaiming inodes with non-zero delalloc block counts. In turn, this
breaks block accounting and leaves the fs inconsistent.

XFS explicitly discards delalloc blocks on such writepage failures
to avoid this problem. This isn't terribly useful if we allow an
integrity writeback to complete (and thus a filesystem to unmount)
without addressing the entire set of dirty pages on an inode.
Therefore, change ->writepage[s]() to track high level error state
in the xfs_writepage_ctx structure and return it from the higher
level operation callout rather than xfs_do_writepage(). This ensures
that write_cache_pages() does not exit prematurely when called via
->writepages(), but both ->writepage() and ->writepages() still
ultimately return an error for the higher level operation.

This patch introduces a subtle change in the behavior of background
writeback in the event of persistent errors. The current behavior of
returning an error preempts the background writeback. Writeback
eventually comes around again and repeats the process for a few more
pages (in practice) before it once again fails. This repeats over
and over until the entire set of dirty pages is cleaned. This
behavior results in a somewhat slower stream of "page discard"
errors in the system log and dictates that many repeated fsync calls
may be required before the entire data set is processed and mapping
error consumed. With this change in place, background writeback
executes on as many pages as necessary as if each page writeback
were successful. The pages are cleaned immediately and significantly
more page discard errors can be observed at once.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3feae3691467..438cfc66a40e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -32,6 +32,7 @@ struct xfs_writepage_ctx {
 	unsigned int		io_type;
 	unsigned int		cow_seq;
 	struct xfs_ioend	*ioend;
+	int			error;
 };
 
 struct block_device *
@@ -798,7 +799,9 @@ xfs_writepage_map(
 		end_page_writeback(page);
 done:
 	mapping_set_error(page->mapping, error);
-	return error;
+	if (!wpc->error)
+		wpc->error = error;
+	return 0;
 }
 
 /*
@@ -929,8 +932,8 @@ xfs_vm_writepage(
 
 	ret = xfs_do_writepage(page, wbc, &wpc);
 	if (wpc.ioend)
-		ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
-	return ret;
+		ret = xfs_submit_ioend(wbc, wpc.ioend, wpc.error);
+	return ret ? ret : wpc.error;
 }
 
 STATIC int
@@ -946,8 +949,8 @@ xfs_vm_writepages(
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
 	if (wpc.ioend)
-		ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
-	return ret;
+		ret = xfs_submit_ioend(wbc, wpc.ioend, wpc.error);
+	return ret ? ret : wpc.error;
 }
 
 STATIC int
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error
  2018-10-31 14:01 ` [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error Brian Foster
@ 2018-10-31 23:02   ` Dave Chinner
  2018-11-01 14:17     ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2018-10-31 23:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Oct 31, 2018 at 10:01:55AM -0400, Brian Foster wrote:
> xfs_do_writepage() currently returns errors directly regardless of
> whether it is called via ->writepages() or ->writepage(). In the
> case of ->writepages(), an xfs_do_writepage() error return breaks
> the current writeback sequence in write_cache_pages(). This means
> that an integrity writeback (i.e., sync), for example, returns
> before all associated pages have been processed.

That sounds like a bug in write_cache_pages(). It sends pages
one at a time to the writepage context, and it is supposed to
iterate the entire range on a data sync operation. If you look at
the code, is clearly stopping at the first error.

IOWs, every filesystem that uses write_cache_pages() for data
integrity writes is broken in the same way.

And a quick look as fs specific writepages implementations indicates
that ext4_writepages() and ibtrfs's extent_write_cache_pages() have
the same issue.

> This can be problematic in cases like unmount. If the writeback
> doesn't process all delalloc pages before unmounting, we end up
> reclaiming inodes with non-zero delalloc block counts. In turn, this
> breaks block accounting and leaves the fs inconsistent.

XFS is probably the only filesystem that leaves detectable state
around and then detects and reports it....

> XFS explicitly discards delalloc blocks on such writepage failures
> to avoid this problem. This isn't terribly useful if we allow an
> integrity writeback to complete (and thus a filesystem to unmount)
> without addressing the entire set of dirty pages on an inode.
> Therefore, change ->writepage[s]() to track high level error state
> in the xfs_writepage_ctx structure and return it from the higher
> level operation callout rather than xfs_do_writepage(). This ensures
> that write_cache_pages() does not exit prematurely when called via
> ->writepages(), but both ->writepage() and ->writepages() still
> ultimately return an error for the higher level operation.
> 
> This patch introduces a subtle change in the behavior of background
> writeback in the event of persistent errors. The current behavior of
> returning an error preempts the background writeback. Writeback
> eventually comes around again and repeats the process for a few more
> pages (in practice) before it once again fails. This repeats over
> and over until the entire set of dirty pages is cleaned. This
> behavior results in a somewhat slower stream of "page discard"
> errors in the system log and dictates that many repeated fsync calls
> may be required before the entire data set is processed and mapping
> error consumed. With this change in place, background writeback
> executes on as many pages as necessary as if each page writeback
> were successful. The pages are cleaned immediately and significantly
> more page discard errors can be observed at once.

Yeah, this is a good change in behaviour, but I think the
implementation is wrong. write_cache_pages() needs to continue
iterating the range if WB_SYNC_ALL is set even when errors occur.

i.e. the error state should be maintained by write_cache_pages and
returned on completion, not require the filesystem to hide errors
from wcp in it's own specific writepage structure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error
  2018-10-31 23:02   ` Dave Chinner
@ 2018-11-01 14:17     ` Brian Foster
  2018-11-01 21:37       ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2018-11-01 14:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Nov 01, 2018 at 10:02:25AM +1100, Dave Chinner wrote:
> On Wed, Oct 31, 2018 at 10:01:55AM -0400, Brian Foster wrote:
> > xfs_do_writepage() currently returns errors directly regardless of
> > whether it is called via ->writepages() or ->writepage(). In the
> > case of ->writepages(), an xfs_do_writepage() error return breaks
> > the current writeback sequence in write_cache_pages(). This means
> > that an integrity writeback (i.e., sync), for example, returns
> > before all associated pages have been processed.
> 
> That sounds like a bug in write_cache_pages(). It sends pages
> one at a time to the writepage context, and it is supposed to
> iterate the entire range on a data sync operation. If you look at
> the code, is clearly stopping at the first error.
> 
> IOWs, every filesystem that uses write_cache_pages() for data
> integrity writes is broken in the same way.
> 
> And a quick look as fs specific writepages implementations indicates
> that ext4_writepages() and ibtrfs's extent_write_cache_pages() have
> the same issue.
> 

I veered away from this approach initially because I didn't want to risk
breaking behavior of other filesystems. I can revisit it if more
appropriate and is broadly safe.

> > This can be problematic in cases like unmount. If the writeback
> > doesn't process all delalloc pages before unmounting, we end up
> > reclaiming inodes with non-zero delalloc block counts. In turn, this
> > breaks block accounting and leaves the fs inconsistent.
> 
> XFS is probably the only filesystem that leaves detectable state
> around and then detects and reports it....
> 
> > XFS explicitly discards delalloc blocks on such writepage failures
> > to avoid this problem. This isn't terribly useful if we allow an
> > integrity writeback to complete (and thus a filesystem to unmount)
> > without addressing the entire set of dirty pages on an inode.
> > Therefore, change ->writepage[s]() to track high level error state
> > in the xfs_writepage_ctx structure and return it from the higher
> > level operation callout rather than xfs_do_writepage(). This ensures
> > that write_cache_pages() does not exit prematurely when called via
> > ->writepages(), but both ->writepage() and ->writepages() still
> > ultimately return an error for the higher level operation.
> > 
> > This patch introduces a subtle change in the behavior of background
> > writeback in the event of persistent errors. The current behavior of
> > returning an error preempts the background writeback. Writeback
> > eventually comes around again and repeats the process for a few more
> > pages (in practice) before it once again fails. This repeats over
> > and over until the entire set of dirty pages is cleaned. This
> > behavior results in a somewhat slower stream of "page discard"
> > errors in the system log and dictates that many repeated fsync calls
> > may be required before the entire data set is processed and mapping
> > error consumed. With this change in place, background writeback
> > executes on as many pages as necessary as if each page writeback
> > were successful. The pages are cleaned immediately and significantly
> > more page discard errors can be observed at once.
> 
> Yeah, this is a good change in behaviour, but I think the
> implementation is wrong. write_cache_pages() needs to continue
> iterating the range if WB_SYNC_ALL is set even when errors occur.
> 

The other thing that comes to mind is that any time we have this kind of
iterative/callback interface, it's usually appropriate to have some kind
of means for the callback to signal the caller to stop processing. You
could argue that right now write_cache_pages() works as expected,
returning an error is that stop mechanism and XFS has the bug as it is
the only filesystem that depends on completely processing integrity wbs
for fs correctness (or we're all just using it wrong ;P, I haven't
investigated enough yet to say one way or the other).

OTOH, we could maintain the ability to interrupt the scan with a magic
error return or some such, that just may require a bit more care and
attention across the set of existing callers. Thoughts?

Brian

> i.e. the error state should be maintained by write_cache_pages and
> returned on completion, not require the filesystem to hide errors
> from wcp in it's own specific writepage structure...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error
  2018-11-01 14:17     ` Brian Foster
@ 2018-11-01 21:37       ` Dave Chinner
  2018-11-02 11:42         ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2018-11-01 21:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Nov 01, 2018 at 10:17:26AM -0400, Brian Foster wrote:
> On Thu, Nov 01, 2018 at 10:02:25AM +1100, Dave Chinner wrote:
> > On Wed, Oct 31, 2018 at 10:01:55AM -0400, Brian Foster wrote:
> > > xfs_do_writepage() currently returns errors directly regardless of
> > > whether it is called via ->writepages() or ->writepage(). In the
> > > case of ->writepages(), an xfs_do_writepage() error return breaks
> > > the current writeback sequence in write_cache_pages(). This means
> > > that an integrity writeback (i.e., sync), for example, returns
> > > before all associated pages have been processed.
> > 
> > That sounds like a bug in write_cache_pages(). It sends pages
> > one at a time to the writepage context, and it is supposed to
> > iterate the entire range on a data sync operation. If you look at
> > the code, is clearly stopping at the first error.
> > 
> > IOWs, every filesystem that uses write_cache_pages() for data
> > integrity writes is broken in the same way.
> > 
> > And a quick look as fs specific writepages implementations indicates
> > that ext4_writepages() and ibtrfs's extent_write_cache_pages() have
> > the same issue.
> > 
> 
> I veered away from this approach initially because I didn't want to risk
> breaking behavior of other filesystems. I can revisit it if more
> appropriate and is broadly safe.

We should fix problems in the generic code if that's where the
problem lies. You can ignore filesystems with their own custom code,
though, because you are more likely to break them that generic
code...

> > > This can be problematic in cases like unmount. If the writeback
> > > doesn't process all delalloc pages before unmounting, we end up
> > > reclaiming inodes with non-zero delalloc block counts. In turn, this
> > > breaks block accounting and leaves the fs inconsistent.
> > 
> > XFS is probably the only filesystem that leaves detectable state
> > around and then detects and reports it....
> > 
> > > XFS explicitly discards delalloc blocks on such writepage failures
> > > to avoid this problem. This isn't terribly useful if we allow an
> > > integrity writeback to complete (and thus a filesystem to unmount)
> > > without addressing the entire set of dirty pages on an inode.
> > > Therefore, change ->writepage[s]() to track high level error state
> > > in the xfs_writepage_ctx structure and return it from the higher
> > > level operation callout rather than xfs_do_writepage(). This ensures
> > > that write_cache_pages() does not exit prematurely when called via
> > > ->writepages(), but both ->writepage() and ->writepages() still
> > > ultimately return an error for the higher level operation.
> > > 
> > > This patch introduces a subtle change in the behavior of background
> > > writeback in the event of persistent errors. The current behavior of
> > > returning an error preempts the background writeback. Writeback
> > > eventually comes around again and repeats the process for a few more
> > > pages (in practice) before it once again fails. This repeats over
> > > and over until the entire set of dirty pages is cleaned. This
> > > behavior results in a somewhat slower stream of "page discard"
> > > errors in the system log and dictates that many repeated fsync calls
> > > may be required before the entire data set is processed and mapping
> > > error consumed. With this change in place, background writeback
> > > executes on as many pages as necessary as if each page writeback
> > > were successful. The pages are cleaned immediately and significantly
> > > more page discard errors can be observed at once.
> > 
> > Yeah, this is a good change in behaviour, but I think the
> > implementation is wrong. write_cache_pages() needs to continue
> > iterating the range if WB_SYNC_ALL is set even when errors occur.
> > 
> 
> The other thing that comes to mind is that any time we have this kind of
> iterative/callback interface, it's usually appropriate to have some kind
> of means for the callback to signal the caller to stop processing. You
> could argue that right now write_cache_pages() works as expected,
> returning an error is that stop mechanism and XFS has the bug as it is
> the only filesystem that depends on completely processing integrity wbs
> for fs correctness (or we're all just using it wrong ;P, I haven't
> investigated enough yet to say one way or the other).

The comment on wcp error handling is enlightening:

		ret = (*writepage)(page, wbc, data);
		if (unlikely(ret)) {
			if (ret == AOP_WRITEPAGE_ACTIVATE) {
				unlock_page(page);
				ret = 0;
			} else {
				/*
				 * done_index is set past this page,
				 * so media errors will not choke
				 * background writeout for the entire
				 * file. This has consequences for
				 * range_cyclic semantics (ie. it may
				 * not be suitable for data integrity
				 * writeout).
				 */
				done_index = page->index + 1;
				done = 1;
				break;
			}
		}

"ie it may not be suitable for data integrity writeout"

In a nutshell. The current error handling behaviour is not suitable
for data integrity writeout, as your proposed patch has demonstrated.


> OTOH, we could maintain the ability to interrupt the scan with a magic
> error return or some such, that just may require a bit more care and
> attention across the set of existing callers. Thoughts?

I'm not sure why we'd want "stop on error" for data integrity
writeback given the problems you've pointed out.  For WB_SYNC_NONE
writeback, the current behaviour (stop on error) is fine, so i don't
see a big need to make unnecessary API changes here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error
  2018-11-01 21:37       ` Dave Chinner
@ 2018-11-02 11:42         ` Brian Foster
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Foster @ 2018-11-02 11:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Nov 02, 2018 at 08:37:40AM +1100, Dave Chinner wrote:
> On Thu, Nov 01, 2018 at 10:17:26AM -0400, Brian Foster wrote:
> > On Thu, Nov 01, 2018 at 10:02:25AM +1100, Dave Chinner wrote:
> > > On Wed, Oct 31, 2018 at 10:01:55AM -0400, Brian Foster wrote:
> > > > xfs_do_writepage() currently returns errors directly regardless of
> > > > whether it is called via ->writepages() or ->writepage(). In the
> > > > case of ->writepages(), an xfs_do_writepage() error return breaks
> > > > the current writeback sequence in write_cache_pages(). This means
> > > > that an integrity writeback (i.e., sync), for example, returns
> > > > before all associated pages have been processed.
> > > 
> > > That sounds like a bug in write_cache_pages(). It sends pages
> > > one at a time to the writepage context, and it is supposed to
> > > iterate the entire range on a data sync operation. If you look at
> > > the code, is clearly stopping at the first error.
> > > 
> > > IOWs, every filesystem that uses write_cache_pages() for data
> > > integrity writes is broken in the same way.
> > > 
> > > And a quick look as fs specific writepages implementations indicates
> > > that ext4_writepages() and ibtrfs's extent_write_cache_pages() have
> > > the same issue.
> > > 
> > 
> > I veered away from this approach initially because I didn't want to risk
> > breaking behavior of other filesystems. I can revisit it if more
> > appropriate and is broadly safe.
> 
> We should fix problems in the generic code if that's where the
> problem lies. You can ignore filesystems with their own custom code,
> though, because you are more likely to break them that generic
> code...
> 
> > > > This can be problematic in cases like unmount. If the writeback
> > > > doesn't process all delalloc pages before unmounting, we end up
> > > > reclaiming inodes with non-zero delalloc block counts. In turn, this
> > > > breaks block accounting and leaves the fs inconsistent.
> > > 
> > > XFS is probably the only filesystem that leaves detectable state
> > > around and then detects and reports it....
> > > 
> > > > XFS explicitly discards delalloc blocks on such writepage failures
> > > > to avoid this problem. This isn't terribly useful if we allow an
> > > > integrity writeback to complete (and thus a filesystem to unmount)
> > > > without addressing the entire set of dirty pages on an inode.
> > > > Therefore, change ->writepage[s]() to track high level error state
> > > > in the xfs_writepage_ctx structure and return it from the higher
> > > > level operation callout rather than xfs_do_writepage(). This ensures
> > > > that write_cache_pages() does not exit prematurely when called via
> > > > ->writepages(), but both ->writepage() and ->writepages() still
> > > > ultimately return an error for the higher level operation.
> > > > 
> > > > This patch introduces a subtle change in the behavior of background
> > > > writeback in the event of persistent errors. The current behavior of
> > > > returning an error preempts the background writeback. Writeback
> > > > eventually comes around again and repeats the process for a few more
> > > > pages (in practice) before it once again fails. This repeats over
> > > > and over until the entire set of dirty pages is cleaned. This
> > > > behavior results in a somewhat slower stream of "page discard"
> > > > errors in the system log and dictates that many repeated fsync calls
> > > > may be required before the entire data set is processed and mapping
> > > > error consumed. With this change in place, background writeback
> > > > executes on as many pages as necessary as if each page writeback
> > > > were successful. The pages are cleaned immediately and significantly
> > > > more page discard errors can be observed at once.
> > > 
> > > Yeah, this is a good change in behaviour, but I think the
> > > implementation is wrong. write_cache_pages() needs to continue
> > > iterating the range if WB_SYNC_ALL is set even when errors occur.
> > > 
> > 
> > The other thing that comes to mind is that any time we have this kind of
> > iterative/callback interface, it's usually appropriate to have some kind
> > of means for the callback to signal the caller to stop processing. You
> > could argue that right now write_cache_pages() works as expected,
> > returning an error is that stop mechanism and XFS has the bug as it is
> > the only filesystem that depends on completely processing integrity wbs
> > for fs correctness (or we're all just using it wrong ;P, I haven't
> > investigated enough yet to say one way or the other).
> 
> The comment on wcp error handling is enlightening:
> 
> 		ret = (*writepage)(page, wbc, data);
> 		if (unlikely(ret)) {
> 			if (ret == AOP_WRITEPAGE_ACTIVATE) {
> 				unlock_page(page);
> 				ret = 0;
> 			} else {
> 				/*
> 				 * done_index is set past this page,
> 				 * so media errors will not choke
> 				 * background writeout for the entire
> 				 * file. This has consequences for
> 				 * range_cyclic semantics (ie. it may
> 				 * not be suitable for data integrity
> 				 * writeout).
> 				 */
> 				done_index = page->index + 1;
> 				done = 1;
> 				break;
> 			}
> 		}
> 
> "ie it may not be suitable for data integrity writeout"
> 

The way I read that comment is that the done_index bump may not be
suitable for integrity writeback (re: the tradeoff being to not "choke"
bg writeback), but the wording is not totally clear.

> In a nutshell. The current error handling behaviour is not suitable
> for data integrity writeout, as your proposed patch has demonstrated.
> 
> 
> > OTOH, we could maintain the ability to interrupt the scan with a magic
> > error return or some such, that just may require a bit more care and
> > attention across the set of existing callers. Thoughts?
> 
> I'm not sure why we'd want "stop on error" for data integrity
> writeback given the problems you've pointed out.  For WB_SYNC_NONE
> writeback, the current behaviour (stop on error) is fine, so i don't
> see a big need to make unnecessary API changes here.
> 

Neither am I, for XFS at least, I just wasn't sure if other filesystems
depended on or expected this behavior in any way. I'll take a more
thorough look, run some broader testing and send a write_cache_pages()
patch to a wider audience.

BTW, it looks like we already have this AOP_WRITEPAGE_
positive_aop_returns enum for special return codes the callback can use
to control the caller. I suppose we could add a new one if some fs does
have a particular need to forcibly interrupt a writeback.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-02 20:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-31 14:01 [PATCH 0/2] xfs: don't preempt writeback on page errors Brian Foster
2018-10-31 14:01 ` [PATCH 1/2] xfs: add writepage map error tag Brian Foster
2018-10-31 14:01 ` [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error Brian Foster
2018-10-31 23:02   ` Dave Chinner
2018-11-01 14:17     ` Brian Foster
2018-11-01 21:37       ` Dave Chinner
2018-11-02 11:42         ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.