[PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
@ 2017-03-05 13:35 Jeff Layton
  2017-03-05 13:35 ` [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback Jeff Layton
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Jeff Layton @ 2017-03-05 13:35 UTC (permalink / raw)
  To: viro, konishi.ryusuke; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nilfs

I recently did some work to wire up -ENOSPC handling in ceph, and found
I could get back -EIO errors in some cases when I should have instead
gotten -ENOSPC. The problem was that the ceph writeback code would set
PG_error on a writeback error, and that error would clobber the mapping
error.

While I fixed that problem by simply not setting that bit on errors,
that led me down a rabbit hole of looking at how PG_error is being
handled in the kernel.

This patch series is a few fixes for things that I 100% noticed by
inspection. I don't have a great way to test these since they involve
error handling. I can certainly doctor up a kernel to inject errors
in this code and test by hand however if these look plausible up front.

Jeff Layton (3):
  nilfs2: set the mapping error when calling SetPageError on writeback
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: set mapping error when launder_pages fails

 fs/nilfs2/segment.c |  1 +
 mm/filemap.c        | 19 ++++---------------
 mm/truncate.c       |  6 +++++-
 3 files changed, 10 insertions(+), 16 deletions(-)

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback
  2017-03-05 13:35 [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
@ 2017-03-05 13:35 ` Jeff Layton
  2017-03-07 13:46   ` Ryusuke Konishi
  2017-03-05 13:35 ` [PATCH 2/3] mm: don't TestClearPageError in __filemap_fdatawait_range Jeff Layton
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2017-03-05 13:35 UTC (permalink / raw)
  To: viro, konishi.ryusuke; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nilfs

In a later patch, we're going to want to make the fsync codepath not do
a TestClearPageError call as that can override the error set in the
address space. To do that though, we need to ensure that filesystems
that are relying on the PG_error bit for reporting writeback errors
also set an error in the address space.

The only place I've found that looks potentially problematic is this
spot in nilfs2. Ensure that it sets an error in the mapping in addition
to setting PageError.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nilfs2/segment.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index bedcae2c28e6..c1041b07060e 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1743,6 +1743,7 @@ static void nilfs_end_page_io(struct page *page, int err)
 	} else {
 		__set_page_dirty_nobuffers(page);
 		SetPageError(page);
+		mapping_set_error(page_mapping(page), err);
 	}
 
 	end_page_writeback(page);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/3] mm: don't TestClearPageError in __filemap_fdatawait_range
  2017-03-05 13:35 [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
  2017-03-05 13:35 ` [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback Jeff Layton
@ 2017-03-05 13:35 ` Jeff Layton
  2017-03-05 13:35 ` [PATCH 3/3] mm: set mapping error when launder_pages fails Jeff Layton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2017-03-05 13:35 UTC (permalink / raw)
  To: viro, konishi.ryusuke; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nilfs

The -EIO returned here can end up overriding whatever error is marked in
the address space. This means that an -ENOSPC error (AS_ENOSPC) can end
up being turned into -EIO if a page gets PG_error set on it during error
handling. Arguably, that's a bug in the writeback code, but...

Read errors are also tracked on a per page level using PG_error. Suppose
we have a read error on a page, and then that page is subsequently
dirtied by overwriting the whole page. Writeback doesn't clear PG_error,
so we can then end up successfully writing back that page and still
return -EIO on fsync.

Since the handling of this bit is somewhat inconsistent across
subsystems, let's just rely on marking the address space when there
are writeback errors.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 mm/filemap.c | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3f9afded581b..2b0b4ff4668b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -375,17 +375,16 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
-static int __filemap_fdatawait_range(struct address_space *mapping,
+static void __filemap_fdatawait_range(struct address_space *mapping,
 				     loff_t start_byte, loff_t end_byte)
 {
 	pgoff_t index = start_byte >> PAGE_SHIFT;
 	pgoff_t end = end_byte >> PAGE_SHIFT;
 	struct pagevec pvec;
 	int nr_pages;
-	int ret = 0;
 
 	if (end_byte < start_byte)
-		goto out;
+		return;
 
 	pagevec_init(&pvec, 0);
 	while ((index <= end) &&
@@ -402,14 +401,10 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
 				continue;
 
 			wait_on_page_writeback(page);
-			if (TestClearPageError(page))
-				ret = -EIO;
 		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-out:
-	return ret;
 }
 
 /**
@@ -429,14 +424,8 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
 int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 			    loff_t end_byte)
 {
-	int ret, ret2;
-
-	ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
-	ret2 = filemap_check_errors(mapping);
-	if (!ret)
-		ret = ret2;
-
-	return ret;
+	__filemap_fdatawait_range(mapping, start_byte, end_byte);
+	return filemap_check_errors(mapping);
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/3] mm: set mapping error when launder_pages fails
  2017-03-05 13:35 [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
  2017-03-05 13:35 ` [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback Jeff Layton
  2017-03-05 13:35 ` [PATCH 2/3] mm: don't TestClearPageError in __filemap_fdatawait_range Jeff Layton
@ 2017-03-05 13:35 ` Jeff Layton
  2017-03-05 14:40 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
  2017-03-06  3:06 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business NeilBrown
  4 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2017-03-05 13:35 UTC (permalink / raw)
  To: viro, konishi.ryusuke; +Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nilfs

If launder_page fails, then we hit a problem writing back some inode
data. Ensure that we communicate that fact in a subsequent fsync since
another task could still have it open for write.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 mm/truncate.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index dd7b24e083c5..49ad4e2a6cb6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -593,11 +593,15 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 
 static int do_launder_page(struct address_space *mapping, struct page *page)
 {
+	int ret;
+
 	if (!PageDirty(page))
 		return 0;
 	if (page->mapping != mapping || mapping->a_ops->launder_page == NULL)
 		return 0;
-	return mapping->a_ops->launder_page(page);
+	ret = mapping->a_ops->launder_page(page);
+	mapping_set_error(mapping, ret);
+	return ret;
 }
 
 /**
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-05 13:35 [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
                   ` (2 preceding siblings ...)
  2017-03-05 13:35 ` [PATCH 3/3] mm: set mapping error when launder_pages fails Jeff Layton
@ 2017-03-05 14:40 ` Jeff Layton
  2017-03-06 23:08   ` Ross Zwisler
  2017-03-06  3:06 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business NeilBrown
  4 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2017-03-05 14:40 UTC (permalink / raw)
  To: viro, konishi.ryusuke
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nilfs, NeilBrown,
	Ross Zwisler

On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> I recently did some work to wire up -ENOSPC handling in ceph, and found
> I could get back -EIO errors in some cases when I should have instead
> gotten -ENOSPC. The problem was that the ceph writeback code would set
> PG_error on a writeback error, and that error would clobber the mapping
> error.
> 

I should also note that relying on PG_error to report writeback errors
is inherently unreliable as well. If someone calls sync() before your
fsync gets in there, then you'll likely lose it anyway.

filemap_fdatawait_keep_errors will preserve the error in the mapping,
but not the individual PG_error flags, so I think we do want to ensure
that the mapping error is set when there is a writeback error and not
rely on PG_error bit for that.

> While I fixed that problem by simply not setting that bit on errors,
> that led me down a rabbit hole of looking at how PG_error is being
> handled in the kernel.
> 
> This patch series is a few fixes for things that I 100% noticed by
> inspection. I don't have a great way to test these since they involve
> error handling. I can certainly doctor up a kernel to inject errors
> in this code and test by hand however if these look plausible up front.
> 
> Jeff Layton (3):
>   nilfs2: set the mapping error when calling SetPageError on writeback
>   mm: don't TestClearPageError in __filemap_fdatawait_range
>   mm: set mapping error when launder_pages fails
> 
>  fs/nilfs2/segment.c |  1 +
>  mm/filemap.c        | 19 ++++---------------
>  mm/truncate.c       |  6 +++++-
>  3 files changed, 10 insertions(+), 16 deletions(-)
> 

(cc'ing Ross...)

Just when I thought that only NILFS2 needed a little work here, I see
another spot...

I think that we should also need to fix dax_writeback_mapping_range to
set a mapping error on writeback as well. It looks like that's not
happening today. Something like the patch below (obviously untested).

I'll also plan to follow up with a patch to vfs.txt to outline how
writeback errors should be handled by filesystems, assuming that this
patchset isn't completely off base.

-------------------8<-----------------------

[PATCH] dax: set error in mapping when writeback fails

In order to get proper error codes from fsync, we must set an error in
the mapping range when writeback fails.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/dax.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index c45598b912e1..9005d90deeda 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 			ret = dax_writeback_one(bdev, mapping, indices[i],
 					pvec.pages[i]);
-			if (ret < 0)
+			if (ret < 0) {
+				mapping_set_error(mapping, ret);
 				return ret;
+			}
 		}
 	}
 	return 0;
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-05 13:35 [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
                   ` (3 preceding siblings ...)
  2017-03-05 14:40 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
@ 2017-03-06  3:06 ` NeilBrown
  2017-03-06 11:43   ` Jeff Layton
  4 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2017-03-06  3:06 UTC (permalink / raw)
  To: Jeff Layton, viro, konishi.ryusuke
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nilfs

[-- Attachment #1: Type: text/plain, Size: 2154 bytes --]

On Sun, Mar 05 2017, Jeff Layton wrote:

> I recently did some work to wire up -ENOSPC handling in ceph, and found
> I could get back -EIO errors in some cases when I should have instead
> gotten -ENOSPC. The problem was that the ceph writeback code would set
> PG_error on a writeback error, and that error would clobber the mapping
> error.
>
> While I fixed that problem by simply not setting that bit on errors,
> that led me down a rabbit hole of looking at how PG_error is being
> handled in the kernel.

Speaking of rabbit holes... I thought to wonder how IO error propagate
up from NFS.
It doesn't use SetPageError or mapping_set_error() for files (except in
one case that looks a bit odd).
It has an "nfs_open_context" and store the latest error in ctx->error.

So when you get around to documenting how this is supposed to work, it
would be worth while describing the required observable behaviour, and
note that while filesystems can use mapping_set_error() to achieve this,
they don't have to.

I notice that
  drivers/staging/lustre/lustre/llite/rw.c
  fs/afs/write.c
  fs/btrfs/extent_io.c
  fs/cifs/file.c
  fs/jffs2/file.c
  fs/jfs/jfs_metapage.c
  fs/ntfs/aops.c

(and possible others) all have SetPageError() calls that seem to be
in response to a write error to a file, but don't appear to have
matching mapping_set_error() calls.  Did you look at these?  Did I miss
something?

Thanks,
NeilBrown

>
> This patch series is a few fixes for things that I 100% noticed by
> inspection. I don't have a great way to test these since they involve
> error handling. I can certainly doctor up a kernel to inject errors
> in this code and test by hand however if these look plausible up front.
>
> Jeff Layton (3):
>   nilfs2: set the mapping error when calling SetPageError on writeback
>   mm: don't TestClearPageError in __filemap_fdatawait_range
>   mm: set mapping error when launder_pages fails
>
>  fs/nilfs2/segment.c |  1 +
>  mm/filemap.c        | 19 ++++---------------
>  mm/truncate.c       |  6 +++++-
>  3 files changed, 10 insertions(+), 16 deletions(-)
>
> -- 
> 2.9.3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-06  3:06 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business NeilBrown
@ 2017-03-06 11:43   ` Jeff Layton
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2017-03-06 11:43 UTC (permalink / raw)
  To: NeilBrown, viro, konishi.ryusuke
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-nilfs, Andrew Morton

On Mon, 2017-03-06 at 14:06 +1100, NeilBrown wrote:
> On Sun, Mar 05 2017, Jeff Layton wrote:
> 
> > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > I could get back -EIO errors in some cases when I should have instead
> > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > PG_error on a writeback error, and that error would clobber the mapping
> > error.
> > 
> > While I fixed that problem by simply not setting that bit on errors,
> > that led me down a rabbit hole of looking at how PG_error is being
> > handled in the kernel.
> 
> Speaking of rabbit holes... I thought to wonder how IO error propagate
> up from NFS.
> It doesn't use SetPageError or mapping_set_error() for files (except in
> one case that looks a bit odd).
> It has an "nfs_open_context" and store the latest error in ctx->error.
> 
> So when you get around to documenting how this is supposed to work, it
> would be worth while describing the required observable behaviour, and
> note that while filesystems can use mapping_set_error() to achieve this,
> they don't have to.
> 
> I notice that
>   drivers/staging/lustre/lustre/llite/rw.c
>   fs/afs/write.c
>   fs/btrfs/extent_io.c
>   fs/cifs/file.c
>   fs/jffs2/file.c
>   fs/jfs/jfs_metapage.c
>   fs/ntfs/aops.c
> 
> (and possible others) all have SetPageError() calls that seem to be
> in response to a write error to a file, but don't appear to have
> matching mapping_set_error() calls.  Did you look at these?  Did I miss
> something?
> 

Those are all in writepage implementations, and the callers of writepage
all call mapping_set_error if it returns error. The exception is
write_one_page, which is typically used for writing out dir info and
such, and so it's not really necessary there.

Now that I look though, I think I may have gotten the page migration
codepath wrong. I had convinced myself it was ok before, but looking
again, I think we probably need to add a mapping_set_error call to 
writeout() as well. I'll go over that more carefully in a little while.

> > 
> > This patch series is a few fixes for things that I 100% noticed by
> > inspection. I don't have a great way to test these since they involve
> > error handling. I can certainly doctor up a kernel to inject errors
> > in this code and test by hand however if these look plausible up front.
> > 
> > Jeff Layton (3):
> >   nilfs2: set the mapping error when calling SetPageError on writeback
> >   mm: don't TestClearPageError in __filemap_fdatawait_range
> >   mm: set mapping error when launder_pages fails
> > 
> >  fs/nilfs2/segment.c |  1 +
> >  mm/filemap.c        | 19 ++++---------------
> >  mm/truncate.c       |  6 +++++-
> >  3 files changed, 10 insertions(+), 16 deletions(-)
> > 
> > -- 
> > 2.9.3

-- 
Jeff Layton <jlayton@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-05 14:40 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
@ 2017-03-06 23:08   ` Ross Zwisler
  2017-03-07 10:26     ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: Ross Zwisler @ 2017-03-06 23:08 UTC (permalink / raw)
  To: Jeff Layton
  Cc: viro, konishi.ryusuke, linux-fsdevel, linux-mm, linux-kernel,
	linux-nilfs, NeilBrown, Ross Zwisler, Jan Kara

On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > I could get back -EIO errors in some cases when I should have instead
> > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > PG_error on a writeback error, and that error would clobber the mapping
> > error.
> > 
> 
> I should also note that relying on PG_error to report writeback errors
> is inherently unreliable as well. If someone calls sync() before your
> fsync gets in there, then you'll likely lose it anyway.
> 
> filemap_fdatawait_keep_errors will preserve the error in the mapping,
> but not the individual PG_error flags, so I think we do want to ensure
> that the mapping error is set when there is a writeback error and not
> rely on PG_error bit for that.
> 
> > While I fixed that problem by simply not setting that bit on errors,
> > that led me down a rabbit hole of looking at how PG_error is being
> > handled in the kernel.
> > 
> > This patch series is a few fixes for things that I 100% noticed by
> > inspection. I don't have a great way to test these since they involve
> > error handling. I can certainly doctor up a kernel to inject errors
> > in this code and test by hand however if these look plausible up front.
> > 
> > Jeff Layton (3):
> >   nilfs2: set the mapping error when calling SetPageError on writeback
> >   mm: don't TestClearPageError in __filemap_fdatawait_range
> >   mm: set mapping error when launder_pages fails
> > 
> >  fs/nilfs2/segment.c |  1 +
> >  mm/filemap.c        | 19 ++++---------------
> >  mm/truncate.c       |  6 +++++-
> >  3 files changed, 10 insertions(+), 16 deletions(-)
> > 
> 
> (cc'ing Ross...)
> 
> Just when I thought that only NILFS2 needed a little work here, I see
> another spot...
> 
> I think that we should also need to fix dax_writeback_mapping_range to
> set a mapping error on writeback as well. It looks like that's not
> happening today. Something like the patch below (obviously untested).
> 
> I'll also plan to follow up with a patch to vfs.txt to outline how
> writeback errors should be handled by filesystems, assuming that this
> patchset isn't completely off base.
> 
> -------------------8<-----------------------
> 
> [PATCH] dax: set error in mapping when writeback fails
> 
> In order to get proper error codes from fsync, we must set an error in
> the mapping range when writeback fails.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/dax.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index c45598b912e1..9005d90deeda 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  
>  			ret = dax_writeback_one(bdev, mapping, indices[i],
>  					pvec.pages[i]);
> -			if (ret < 0)
> +			if (ret < 0) {
> +				mapping_set_error(mapping, ret);
>  				return ret;
> +			}

(Adding Jan)

I tested this a bit, and for the DAX case at least I don't think this does
what you want.  The current code already returns -EIO if dax_writeback_one()
hits an error, which bubbles up through the call stack and makes the fsync()
call in userspace fail with EIO, as we want.  With both ext4 and xfs this
patch (applied to v4.10) makes it so that we fail the current fsync() due to
the return value of -EIO, then we fail the next fsync() as well because only
then do we actually process the AS_EIO flag inside of filemap_check_errors().

I think maybe the missing piece is that our normal DAX fsync call stack
doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
our stack in xfs:

    dax_writeback_mapping_range+0x32/0x70
    xfs_vm_writepages+0x8c/0xf0
    do_writepages+0x21/0x30
    __filemap_fdatawrite_range+0xc6/0x100
    filemap_write_and_wait_range+0x44/0x90
    xfs_file_fsync+0x7a/0x2c0
    vfs_fsync_range+0x4b/0xb0
    ? trace_hardirqs_on_caller+0xf5/0x1b0
    do_fsync+0x3d/0x70
    SyS_fsync+0x10/0x20
    entry_SYSCALL_64_fastpath+0x1f/0xc2

On the subsequent fsync() call we *do* end up calling filemap_check_errors()
via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
mapping:

    filemap_fdatawait_range+0x3b/0x80
    filemap_write_and_wait_range+0x5a/0x90
    xfs_file_fsync+0x7a/0x2c0
    vfs_fsync_range+0x4b/0xb0
    ? trace_hardirqs_on_caller+0xf5/0x1b0
    do_fsync+0x3d/0x70
    SyS_fsync+0x10/0x20
    entry_SYSCALL_64_fastpath+0x1f/0xc2

Was your concern just that you didn't think that fsync() was properly
returning an error when dax_writeback_one() hit an error?  Or is there another
path by which we need to report the error, where it is actually important that
we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
call path so that we both generate and consume AS_EIO on the same call,
probably in filemap_write_and_wait_range().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-06 23:08   ` Ross Zwisler
@ 2017-03-07 10:26     ` Jan Kara
  2017-03-07 14:03       ` Jeff Layton
                         ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Jan Kara @ 2017-03-07 10:26 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, viro, konishi.ryusuke, linux-fsdevel, linux-mm,
	linux-kernel, linux-nilfs, NeilBrown, Jan Kara

On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > I could get back -EIO errors in some cases when I should have instead
> > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > PG_error on a writeback error, and that error would clobber the mapping
> > > error.
> > > 
> > 
> > I should also note that relying on PG_error to report writeback errors
> > is inherently unreliable as well. If someone calls sync() before your
> > fsync gets in there, then you'll likely lose it anyway.
> > 
> > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > but not the individual PG_error flags, so I think we do want to ensure
> > that the mapping error is set when there is a writeback error and not
> > rely on PG_error bit for that.
> > 
> > > While I fixed that problem by simply not setting that bit on errors,
> > > that led me down a rabbit hole of looking at how PG_error is being
> > > handled in the kernel.
> > > 
> > > This patch series is a few fixes for things that I 100% noticed by
> > > inspection. I don't have a great way to test these since they involve
> > > error handling. I can certainly doctor up a kernel to inject errors
> > > in this code and test by hand however if these look plausible up front.
> > > 
> > > Jeff Layton (3):
> > >   nilfs2: set the mapping error when calling SetPageError on writeback
> > >   mm: don't TestClearPageError in __filemap_fdatawait_range
> > >   mm: set mapping error when launder_pages fails
> > > 
> > >  fs/nilfs2/segment.c |  1 +
> > >  mm/filemap.c        | 19 ++++---------------
> > >  mm/truncate.c       |  6 +++++-
> > >  3 files changed, 10 insertions(+), 16 deletions(-)
> > > 
> > 
> > (cc'ing Ross...)
> > 
> > Just when I thought that only NILFS2 needed a little work here, I see
> > another spot...
> > 
> > I think that we should also need to fix dax_writeback_mapping_range to
> > set a mapping error on writeback as well. It looks like that's not
> > happening today. Something like the patch below (obviously untested).
> > 
> > I'll also plan to follow up with a patch to vfs.txt to outline how
> > writeback errors should be handled by filesystems, assuming that this
> > patchset isn't completely off base.
> > 
> > -------------------8<-----------------------
> > 
> > [PATCH] dax: set error in mapping when writeback fails
> > 
> > In order to get proper error codes from fsync, we must set an error in
> > the mapping range when writeback fails.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  fs/dax.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index c45598b912e1..9005d90deeda 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> >  
> >  			ret = dax_writeback_one(bdev, mapping, indices[i],
> >  					pvec.pages[i]);
> > -			if (ret < 0)
> > +			if (ret < 0) {
> > +				mapping_set_error(mapping, ret);
> >  				return ret;
> > +			}
> 
> (Adding Jan)
> 
> I tested this a bit, and for the DAX case at least I don't think this does
> what you want.  The current code already returns -EIO if dax_writeback_one()
> hits an error, which bubbles up through the call stack and makes the fsync()
> call in userspace fail with EIO, as we want.  With both ext4 and xfs this
> patch (applied to v4.10) makes it so that we fail the current fsync() due to
> the return value of -EIO, then we fail the next fsync() as well because only
> then do we actually process the AS_EIO flag inside of filemap_check_errors().
> 
> I think maybe the missing piece is that our normal DAX fsync call stack
> doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
> our stack in xfs:
> 
>     dax_writeback_mapping_range+0x32/0x70
>     xfs_vm_writepages+0x8c/0xf0
>     do_writepages+0x21/0x30
>     __filemap_fdatawrite_range+0xc6/0x100
>     filemap_write_and_wait_range+0x44/0x90
>     xfs_file_fsync+0x7a/0x2c0
>     vfs_fsync_range+0x4b/0xb0
>     ? trace_hardirqs_on_caller+0xf5/0x1b0
>     do_fsync+0x3d/0x70
>     SyS_fsync+0x10/0x20
>     entry_SYSCALL_64_fastpath+0x1f/0xc2
> 
> On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> mapping:
> 
>     filemap_fdatawait_range+0x3b/0x80
>     filemap_write_and_wait_range+0x5a/0x90
>     xfs_file_fsync+0x7a/0x2c0
>     vfs_fsync_range+0x4b/0xb0
>     ? trace_hardirqs_on_caller+0xf5/0x1b0
>     do_fsync+0x3d/0x70
>     SyS_fsync+0x10/0x20
>     entry_SYSCALL_64_fastpath+0x1f/0xc2
> 
> Was your concern just that you didn't think that fsync() was properly
> returning an error when dax_writeback_one() hit an error?  Or is there another
> path by which we need to report the error, where it is actually important that
> we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
> call path so that we both generate and consume AS_EIO on the same call,
> probably in filemap_write_and_wait_range().

So I believe this is due to the special handling of EIO inside
filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
filemap_fdatawait() there however not for EIO returned from
filemap_fdatawrite(). In that case we bail out immediately. So I think
Jeff's patch is correct but we need to change filemap_write_and_wait() to
call also filemap_check_errors() directly on EIO from filemap_fdatawrite().

On a more general note (DAX is actually fine here), I find the current
practice of clearing page dirty bits on error and reporting it just once
problematic. It keeps the system running but data is lost and possibly
without getting the error anywhere where it is useful. We get away with
this because it is a rare event but it seems like a problematic behavior.
But this is more for the discussion at LSF.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback
  2017-03-05 13:35 ` [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback Jeff Layton
@ 2017-03-07 13:46   ` Ryusuke Konishi
  0 siblings, 0 replies; 25+ messages in thread
From: Ryusuke Konishi @ 2017-03-07 13:46 UTC (permalink / raw)
  To: Jeff Layton; +Cc: viro, linux-fsdevel, linux-mm, linux-kernel, linux-nilfs

On Sun,  5 Mar 2017 08:35:33 -0500, Jeff Layton <jlayton@redhat.com> wrote:
> In a later patch, we're going to want to make the fsync codepath not do
> a TestClearPageError call as that can override the error set in the
> address space. To do that though, we need to ensure that filesystems
> that are relying on the PG_error bit for reporting writeback errors
> also set an error in the address space.
> 
> The only place I've found that looks potentially problematic is this
> spot in nilfs2. Ensure that it sets an error in the mapping in addition
> to setting PageError.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>

Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

Agreed that nilfs2 needs this if the successive patch is applied.

Thanks,
Ryusuke Konishi

> ---
>  fs/nilfs2/segment.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index bedcae2c28e6..c1041b07060e 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1743,6 +1743,7 @@ static void nilfs_end_page_io(struct page *page, int err)
>  	} else {
>  		__set_page_dirty_nobuffers(page);
>  		SetPageError(page);
> +		mapping_set_error(page_mapping(page), err);
>  	}
>  
>  	end_page_writeback(page);
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-07 10:26     ` Jan Kara
@ 2017-03-07 14:03       ` Jeff Layton
  2017-03-07 15:59       ` Ross Zwisler
  2017-03-09  2:57       ` Theodore Ts'o
  2 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2017-03-07 14:03 UTC (permalink / raw)
  To: Jan Kara, Ross Zwisler
  Cc: viro, konishi.ryusuke, linux-fsdevel, linux-mm, linux-kernel,
	linux-nilfs, NeilBrown, Kevin Wolf

On Tue, 2017-03-07 at 11:26 +0100, Jan Kara wrote:
> On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > I could get back -EIO errors in some cases when I should have instead
> > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > error.
> > > > 
> > > 
> > > I should also note that relying on PG_error to report writeback errors
> > > is inherently unreliable as well. If someone calls sync() before your
> > > fsync gets in there, then you'll likely lose it anyway.
> > > 
> > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > but not the individual PG_error flags, so I think we do want to ensure
> > > that the mapping error is set when there is a writeback error and not
> > > rely on PG_error bit for that.
> > > 
> > > > While I fixed that problem by simply not setting that bit on errors,
> > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > handled in the kernel.
> > > > 
> > > > This patch series is a few fixes for things that I 100% noticed by
> > > > inspection. I don't have a great way to test these since they involve
> > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > in this code and test by hand however if these look plausible up front.
> > > > 
> > > > Jeff Layton (3):
> > > >   nilfs2: set the mapping error when calling SetPageError on writeback
> > > >   mm: don't TestClearPageError in __filemap_fdatawait_range
> > > >   mm: set mapping error when launder_pages fails
> > > > 
> > > >  fs/nilfs2/segment.c |  1 +
> > > >  mm/filemap.c        | 19 ++++---------------
> > > >  mm/truncate.c       |  6 +++++-
> > > >  3 files changed, 10 insertions(+), 16 deletions(-)
> > > > 
> > > 
> > > (cc'ing Ross...)
> > > 
> > > Just when I thought that only NILFS2 needed a little work here, I see
> > > another spot...
> > > 
> > > I think that we should also need to fix dax_writeback_mapping_range to
> > > set a mapping error on writeback as well. It looks like that's not
> > > happening today. Something like the patch below (obviously untested).
> > > 
> > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > writeback errors should be handled by filesystems, assuming that this
> > > patchset isn't completely off base.
> > > 
> > > -------------------8<-----------------------
> > > 
> > > [PATCH] dax: set error in mapping when writeback fails
> > > 
> > > In order to get proper error codes from fsync, we must set an error in
> > > the mapping range when writeback fails.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > ---
> > >  fs/dax.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index c45598b912e1..9005d90deeda 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > >  
> > >  			ret = dax_writeback_one(bdev, mapping, indices[i],
> > >  					pvec.pages[i]);
> > > -			if (ret < 0)
> > > +			if (ret < 0) {
> > > +				mapping_set_error(mapping, ret);
> > >  				return ret;
> > > +			}
> > 
> > (Adding Jan)
> > 
> > I tested this a bit, and for the DAX case at least I don't think this does
> > what you want.  The current code already returns -EIO if dax_writeback_one()
> > hits an error, which bubbles up through the call stack and makes the fsync()
> > call in userspace fail with EIO, as we want.  With both ext4 and xfs this
> > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > the return value of -EIO, then we fail the next fsync() as well because only
> > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> > 
> > I think maybe the missing piece is that our normal DAX fsync call stack
> > doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
> > our stack in xfs:
> > 
> >     dax_writeback_mapping_range+0x32/0x70
> >     xfs_vm_writepages+0x8c/0xf0
> >     do_writepages+0x21/0x30
> >     __filemap_fdatawrite_range+0xc6/0x100
> >     filemap_write_and_wait_range+0x44/0x90
> >     xfs_file_fsync+0x7a/0x2c0
> >     vfs_fsync_range+0x4b/0xb0
> >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> >     do_fsync+0x3d/0x70
> >     SyS_fsync+0x10/0x20
> >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > 
> > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > mapping:
> > 
> >     filemap_fdatawait_range+0x3b/0x80
> >     filemap_write_and_wait_range+0x5a/0x90
> >     xfs_file_fsync+0x7a/0x2c0
> >     vfs_fsync_range+0x4b/0xb0
> >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> >     do_fsync+0x3d/0x70
> >     SyS_fsync+0x10/0x20
> >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > 
> > Was your concern just that you didn't think that fsync() was properly
> > returning an error when dax_writeback_one() hit an error?  Or is there another
> > path by which we need to report the error, where it is actually important that
> > we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
> > call path so that we both generate and consume AS_EIO on the same call,
> > probably in filemap_write_and_wait_range().
> 
> So I believe this is due to the special handling of EIO inside
> filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
> filemap_fdatawait() there however not for EIO returned from
> filemap_fdatawrite(). In that case we bail out immediately. So I think
> Jeff's patch is correct but we need to change filemap_write_and_wait() to
> call also filemap_check_errors() directly on EIO from filemap_fdatawrite().
> 

Yes that makes total sense. I've got a filemap_write_and_wait patch in
my pile now that does this. I'll run what I have through an xfstests
run, and see how it does, and will plan to post a v2 set once I do.

> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.
> 

That really is the crux of the matter. Unfortunately, that's sort of how
the POSIX write/fsync model is designed. If we want to change that, then
I think that we have to consider what a new interface for this would
look like. Maybe we can do something there with new sync_file_range
flags?

I think this probably also dovetails with Kevin Wolf's proposed LSF
topic too, so maybe we can discuss all of this together there.
-- 
Jeff Layton <jlayton@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-07 10:26     ` Jan Kara
  2017-03-07 14:03       ` Jeff Layton
@ 2017-03-07 15:59       ` Ross Zwisler
  2017-03-07 16:17         ` Jan Kara
  2017-03-09  2:57       ` Theodore Ts'o
  2 siblings, 1 reply; 25+ messages in thread
From: Ross Zwisler @ 2017-03-07 15:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Jeff Layton, viro, konishi.ryusuke, linux-fsdevel,
	linux-mm, linux-kernel, linux-nilfs, NeilBrown

On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > I could get back -EIO errors in some cases when I should have instead
> > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > error.
> > > > 
> > > 
> > > I should also note that relying on PG_error to report writeback errors
> > > is inherently unreliable as well. If someone calls sync() before your
> > > fsync gets in there, then you'll likely lose it anyway.
> > > 
> > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > but not the individual PG_error flags, so I think we do want to ensure
> > > that the mapping error is set when there is a writeback error and not
> > > rely on PG_error bit for that.
> > > 
> > > > While I fixed that problem by simply not setting that bit on errors,
> > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > handled in the kernel.
> > > > 
> > > > This patch series is a few fixes for things that I 100% noticed by
> > > > inspection. I don't have a great way to test these since they involve
> > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > in this code and test by hand however if these look plausible up front.
> > > > 
> > > > Jeff Layton (3):
> > > >   nilfs2: set the mapping error when calling SetPageError on writeback
> > > >   mm: don't TestClearPageError in __filemap_fdatawait_range
> > > >   mm: set mapping error when launder_pages fails
> > > > 
> > > >  fs/nilfs2/segment.c |  1 +
> > > >  mm/filemap.c        | 19 ++++---------------
> > > >  mm/truncate.c       |  6 +++++-
> > > >  3 files changed, 10 insertions(+), 16 deletions(-)
> > > > 
> > > 
> > > (cc'ing Ross...)
> > > 
> > > Just when I thought that only NILFS2 needed a little work here, I see
> > > another spot...
> > > 
> > > I think that we should also need to fix dax_writeback_mapping_range to
> > > set a mapping error on writeback as well. It looks like that's not
> > > happening today. Something like the patch below (obviously untested).
> > > 
> > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > writeback errors should be handled by filesystems, assuming that this
> > > patchset isn't completely off base.
> > > 
> > > -------------------8<-----------------------
> > > 
> > > [PATCH] dax: set error in mapping when writeback fails
> > > 
> > > In order to get proper error codes from fsync, we must set an error in
> > > the mapping range when writeback fails.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > ---
> > >  fs/dax.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index c45598b912e1..9005d90deeda 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > >  
> > >  			ret = dax_writeback_one(bdev, mapping, indices[i],
> > >  					pvec.pages[i]);
> > > -			if (ret < 0)
> > > +			if (ret < 0) {
> > > +				mapping_set_error(mapping, ret);
> > >  				return ret;
> > > +			}
> > 
> > (Adding Jan)
> > 
> > I tested this a bit, and for the DAX case at least I don't think this does
> > what you want.  The current code already returns -EIO if dax_writeback_one()
> > hits an error, which bubbles up through the call stack and makes the fsync()
> > call in userspace fail with EIO, as we want.  With both ext4 and xfs this
> > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > the return value of -EIO, then we fail the next fsync() as well because only
> > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> > 
> > I think maybe the missing piece is that our normal DAX fsync call stack
> > doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
> > our stack in xfs:
> > 
> >     dax_writeback_mapping_range+0x32/0x70
> >     xfs_vm_writepages+0x8c/0xf0
> >     do_writepages+0x21/0x30
> >     __filemap_fdatawrite_range+0xc6/0x100
> >     filemap_write_and_wait_range+0x44/0x90
> >     xfs_file_fsync+0x7a/0x2c0
> >     vfs_fsync_range+0x4b/0xb0
> >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> >     do_fsync+0x3d/0x70
> >     SyS_fsync+0x10/0x20
> >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > 
> > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > mapping:
> > 
> >     filemap_fdatawait_range+0x3b/0x80
> >     filemap_write_and_wait_range+0x5a/0x90
> >     xfs_file_fsync+0x7a/0x2c0
> >     vfs_fsync_range+0x4b/0xb0
> >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> >     do_fsync+0x3d/0x70
> >     SyS_fsync+0x10/0x20
> >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > 
> > Was your concern just that you didn't think that fsync() was properly
> > returning an error when dax_writeback_one() hit an error?  Or is there another
> > path by which we need to report the error, where it is actually important that
> > we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
> > call path so that we both generate and consume AS_EIO on the same call,
> > probably in filemap_write_and_wait_range().
> 
> So I believe this is due to the special handling of EIO inside
> filemap_write_and_wait(). Normally, filemap_check_errors() happens inside

s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular
case, but we definitely want to make changes that keep them consistent.

> filemap_fdatawait() there however not for EIO returned from
> filemap_fdatawrite(). In that case we bail out immediately. So I think
> Jeff's patch is correct but we need to change filemap_write_and_wait() to
> call also filemap_check_errors() directly on EIO from filemap_fdatawrite().

So I guess my question was: why is it important that we set AS_EIO, if the EIO
is already being reported correctly?  Is it just for consistency with the
buffered fsync case, or is there currently a path where the -EIO from DAX will
be missed, and we actually need AS_EIO to be set in mapping->flags so that we
correctly report an error?

> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-07 15:59       ` Ross Zwisler
@ 2017-03-07 16:17         ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2017-03-07 16:17 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Jeff Layton, viro, konishi.ryusuke, linux-fsdevel,
	linux-mm, linux-kernel, linux-nilfs, NeilBrown

On Tue 07-03-17 08:59:16, Ross Zwisler wrote:
> On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > > I could get back -EIO errors in some cases when I should have instead
> > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > > error.
> > > > > 
> > > > 
> > > > I should also note that relying on PG_error to report writeback errors
> > > > is inherently unreliable as well. If someone calls sync() before your
> > > > fsync gets in there, then you'll likely lose it anyway.
> > > > 
> > > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > > but not the individual PG_error flags, so I think we do want to ensure
> > > > that the mapping error is set when there is a writeback error and not
> > > > rely on PG_error bit for that.
> > > > 
> > > > > While I fixed that problem by simply not setting that bit on errors,
> > > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > > handled in the kernel.
> > > > > 
> > > > > This patch series is a few fixes for things that I 100% noticed by
> > > > > inspection. I don't have a great way to test these since they involve
> > > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > > in this code and test by hand however if these look plausible up front.
> > > > > 
> > > > > Jeff Layton (3):
> > > > >   nilfs2: set the mapping error when calling SetPageError on writeback
> > > > >   mm: don't TestClearPageError in __filemap_fdatawait_range
> > > > >   mm: set mapping error when launder_pages fails
> > > > > 
> > > > >  fs/nilfs2/segment.c |  1 +
> > > > >  mm/filemap.c        | 19 ++++---------------
> > > > >  mm/truncate.c       |  6 +++++-
> > > > >  3 files changed, 10 insertions(+), 16 deletions(-)
> > > > > 
> > > > 
> > > > (cc'ing Ross...)
> > > > 
> > > > Just when I thought that only NILFS2 needed a little work here, I see
> > > > another spot...
> > > > 
> > > > I think that we should also need to fix dax_writeback_mapping_range to
> > > > set a mapping error on writeback as well. It looks like that's not
> > > > happening today. Something like the patch below (obviously untested).
> > > > 
> > > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > > writeback errors should be handled by filesystems, assuming that this
> > > > patchset isn't completely off base.
> > > > 
> > > > -------------------8<-----------------------
> > > > 
> > > > [PATCH] dax: set error in mapping when writeback fails
> > > > 
> > > > In order to get proper error codes from fsync, we must set an error in
> > > > the mapping range when writeback fails.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > >  fs/dax.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index c45598b912e1..9005d90deeda 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > > >  
> > > >  			ret = dax_writeback_one(bdev, mapping, indices[i],
> > > >  					pvec.pages[i]);
> > > > -			if (ret < 0)
> > > > +			if (ret < 0) {
> > > > +				mapping_set_error(mapping, ret);
> > > >  				return ret;
> > > > +			}
> > > 
> > > (Adding Jan)
> > > 
> > > I tested this a bit, and for the DAX case at least I don't think this does
> > > what you want.  The current code already returns -EIO if dax_writeback_one()
> > > hits an error, which bubbles up through the call stack and makes the fsync()
> > > call in userspace fail with EIO, as we want.  With both ext4 and xfs this
> > > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > > the return value of -EIO, then we fail the next fsync() as well because only
> > > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> > > 
> > > I think maybe the missing piece is that our normal DAX fsync call stack
> > > doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
> > > our stack in xfs:
> > > 
> > >     dax_writeback_mapping_range+0x32/0x70
> > >     xfs_vm_writepages+0x8c/0xf0
> > >     do_writepages+0x21/0x30
> > >     __filemap_fdatawrite_range+0xc6/0x100
> > >     filemap_write_and_wait_range+0x44/0x90
> > >     xfs_file_fsync+0x7a/0x2c0
> > >     vfs_fsync_range+0x4b/0xb0
> > >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> > >     do_fsync+0x3d/0x70
> > >     SyS_fsync+0x10/0x20
> > >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > > 
> > > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > > mapping:
> > > 
> > >     filemap_fdatawait_range+0x3b/0x80
> > >     filemap_write_and_wait_range+0x5a/0x90
> > >     xfs_file_fsync+0x7a/0x2c0
> > >     vfs_fsync_range+0x4b/0xb0
> > >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> > >     do_fsync+0x3d/0x70
> > >     SyS_fsync+0x10/0x20
> > >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > > 
> > > Was your concern just that you didn't think that fsync() was properly
> > > returning an error when dax_writeback_one() hit an error?  Or is there another
> > > path by which we need to report the error, where it is actually important that
> > > we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
> > > call path so that we both generate and consume AS_EIO on the same call,
> > > probably in filemap_write_and_wait_range().
> > 
> > So I believe this is due to the special handling of EIO inside
> > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
> 
> s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular
> case, but we definitely want to make changes that keep them consistent.
> 
> > filemap_fdatawait() there however not for EIO returned from
> > filemap_fdatawrite(). In that case we bail out immediately. So I think
> > Jeff's patch is correct but we need to change filemap_write_and_wait() to
> > call also filemap_check_errors() directly on EIO from filemap_fdatawrite().
> 
> So I guess my question was: why is it important that we set AS_EIO, if the EIO
> is already being reported correctly?  Is it just for consistency with the
> buffered fsync case, or is there currently a path where the -EIO from DAX will
> be missed, and we actually need AS_EIO to be set in mapping->flags so that we
> correctly report an error?

It is just for consistency and future-proofing. E.g. if we ever decided to do
some background flushing of CPU caches, current DAX behavior would
suddently result in missed reports of EIO errors.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-07 10:26     ` Jan Kara
  2017-03-07 14:03       ` Jeff Layton
  2017-03-07 15:59       ` Ross Zwisler
@ 2017-03-09  2:57       ` Theodore Ts'o
  2017-03-09  9:04         ` Jan Kara
  2 siblings, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2017-03-09  2:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Jeff Layton, viro, konishi.ryusuke, linux-fsdevel,
	linux-mm, linux-kernel, linux-nilfs, NeilBrown

On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.

I'm actually running into this in the last day or two because some MM
folks at $WORK have been trying to push hard for GFP_NOFS removal in
ext4 (at least when we are holding some mutex/semaphore like
i_data_sem) because otherwise it's possible for the OOM killer to be
unable to kill processes because they are holding on to locks that
ext4 is holding.

I've done some initial investigation, and while it's not that hard to
remove GFP_NOFS from certain parts of the writepages() codepath (which
is where we had been are running into problems), a really, REALLY big
problem is if any_filesystem->writepages() returns ENOMEM, it causes
silent data loss, because the pages are marked clean, and so data
written using buffered writeback goes *poof*.

I confirmed this by creating a test kernel with a simple patch such
that if the ext4 file system is mounted with -o debug, there was a 1
in 16 chance that ext4_writepages will immediately return with ENOMEM
(and printk the inode number, so I knew which inodes had gotten the
ENOMEM treatment).  The result was **NOT** pretty.

What I think we should strongly consider is at the very least, special
case ENOMEM being returned by writepages() during background
writeback, and *not* mark the pages clean, and make sure the inode
stays on the dirty inode list, so we can retry the write later.  This
is especially important since the process that issued the write may
have gone away, so there might not even be a userspace process to
complain to.  By converting certain page allocations (most notably in
ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
release the i_data_sem lock and return an error.  This should allow
allow the OOM killer to do its dirty deed, and hopefully we can retry
the writepages() for that inode later.

In the case of a data integrity sync being issued by fsync() or
umount(), we could allow ENOMEM to get returned to userspace in that
case as well.  I'm not convinced all userspace code will handle an
ENOMEM correctly or sanely, but at least they people will be (less
likely) to blame file system developers.  :-)

The real problem that's going on here, by the way, is that people are
trying to run programs in insanely tight containers, and then when the
kernel locks up, they blame the mm developers.  But if there is silent
data corruption, they will blame the fs developers instead.  And while
kernel lockups are temporary (all you have to do is let the watchdog
reboot the system :-), silent data corruption is *forever*.  So what
we really need to do is to allow the OOM killer do its work, and if
job owners are unhappy that their processes are getting OOM killed,
maybe they will be suitably incentivized to pay for more memory in
their containers....

						- Ted

P.S. Michael Hocko, apologies for not getting back to you with your
GFP_NOFS removal patches.  But the possibility of fs malfunctions that
might lead to silent data corruption is why I'm being very cautious,
and I now have rather strong confirmation that this is not just an
irrational concern on my part.  (This is also performance review
season, FAST conference was last week, and Usenix ATC program
committee reviews are due this week.  So apologies for any reply
latency.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-09  2:57       ` Theodore Ts'o
@ 2017-03-09  9:04         ` Jan Kara
  2017-03-09 10:47           ` Jeff Layton
  2017-03-15  5:07           ` [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Theodore Ts'o
  0 siblings, 2 replies; 25+ messages in thread
From: Jan Kara @ 2017-03-09  9:04 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Ross Zwisler, Jeff Layton, viro, konishi.ryusuke,
	linux-fsdevel, linux-mm, linux-kernel, linux-nilfs, NeilBrown

On Wed 08-03-17 21:57:25, Ted Tso wrote:
> On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > On a more general note (DAX is actually fine here), I find the current
> > practice of clearing page dirty bits on error and reporting it just once
> > problematic. It keeps the system running but data is lost and possibly
> > without getting the error anywhere where it is useful. We get away with
> > this because it is a rare event but it seems like a problematic behavior.
> > But this is more for the discussion at LSF.
> 
> I'm actually running into this in the last day or two because some MM
> folks at $WORK have been trying to push hard for GFP_NOFS removal in
> ext4 (at least when we are holding some mutex/semaphore like
> i_data_sem) because otherwise it's possible for the OOM killer to be
> unable to kill processes because they are holding on to locks that
> ext4 is holding.
> 
> I've done some initial investigation, and while it's not that hard to
> remove GFP_NOFS from certain parts of the writepages() codepath (which
> is where we had been are running into problems), a really, REALLY big
> problem is if any_filesystem->writepages() returns ENOMEM, it causes
> silent data loss, because the pages are marked clean, and so data
> written using buffered writeback goes *poof*.
> 
> I confirmed this by creating a test kernel with a simple patch such
> that if the ext4 file system is mounted with -o debug, there was a 1
> in 16 chance that ext4_writepages will immediately return with ENOMEM
> (and printk the inode number, so I knew which inodes had gotten the
> ENOMEM treatment).  The result was **NOT** pretty.
> 
> What I think we should strongly consider is at the very least, special
> case ENOMEM being returned by writepages() during background
> writeback, and *not* mark the pages clean, and make sure the inode
> stays on the dirty inode list, so we can retry the write later.  This
> is especially important since the process that issued the write may
> have gone away, so there might not even be a userspace process to
> complain to.  By converting certain page allocations (most notably in
> ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> release the i_data_sem lock and return an error.  This should allow
> allow the OOM killer to do its dirty deed, and hopefully we can retry
> the writepages() for that inode later.

Yeah, so if we can hope the error is transient, keeping pages dirty and
retrying the write is definitely better option. For start we can say that
ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
there's no hope of getting data to disk and so we just discard them. It
will be somewhat rough distinction but probably better than what we have
now.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-09  9:04         ` Jan Kara
@ 2017-03-09 10:47           ` Jeff Layton
  2017-03-09 11:02             ` Jan Kara
  2017-03-15  5:07           ` [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Theodore Ts'o
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2017-03-09 10:47 UTC (permalink / raw)
  To: Jan Kara, Theodore Ts'o
  Cc: Ross Zwisler, viro, konishi.ryusuke, linux-fsdevel, linux-mm,
	linux-kernel, linux-nilfs, NeilBrown

On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > On a more general note (DAX is actually fine here), I find the current
> > > practice of clearing page dirty bits on error and reporting it just once
> > > problematic. It keeps the system running but data is lost and possibly
> > > without getting the error anywhere where it is useful. We get away with
> > > this because it is a rare event but it seems like a problematic behavior.
> > > But this is more for the discussion at LSF.
> > 
> > I'm actually running into this in the last day or two because some MM
> > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > ext4 (at least when we are holding some mutex/semaphore like
> > i_data_sem) because otherwise it's possible for the OOM killer to be
> > unable to kill processes because they are holding on to locks that
> > ext4 is holding.
> > 
> > I've done some initial investigation, and while it's not that hard to
> > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > is where we had been are running into problems), a really, REALLY big
> > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > silent data loss, because the pages are marked clean, and so data
> > written using buffered writeback goes *poof*.
> > 
> > I confirmed this by creating a test kernel with a simple patch such
> > that if the ext4 file system is mounted with -o debug, there was a 1
> > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > (and printk the inode number, so I knew which inodes had gotten the
> > ENOMEM treatment).  The result was **NOT** pretty.
> > 
> > What I think we should strongly consider is at the very least, special
> > case ENOMEM being returned by writepages() during background
> > writeback, and *not* mark the pages clean, and make sure the inode
> > stays on the dirty inode list, so we can retry the write later.  This
> > is especially important since the process that issued the write may
> > have gone away, so there might not even be a userspace process to
> > complain to.  By converting certain page allocations (most notably in
> > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > release the i_data_sem lock and return an error.  This should allow
> > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > the writepages() for that inode later.
> 
> Yeah, so if we can hope the error is transient, keeping pages dirty and
> retrying the write is definitely better option. For start we can say that
> ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> there's no hope of getting data to disk and so we just discard them. It
> will be somewhat rough distinction but probably better than what we have
> now.
> 
> 								Honza

I'm not sure about ENOSPC there. That's a return code that is
specifically expected to be returned by fsync. It seems like that ought
not be considered a transient error?

-- 
Jeff Layton <jlayton@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-09 10:47           ` Jeff Layton
@ 2017-03-09 11:02             ` Jan Kara
  2017-03-09 12:43               ` Jeff Layton
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2017-03-09 11:02 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jan Kara, Theodore Ts'o, Ross Zwisler, viro, konishi.ryusuke,
	linux-fsdevel, linux-mm, linux-kernel, linux-nilfs, NeilBrown

On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > On a more general note (DAX is actually fine here), I find the current
> > > > practice of clearing page dirty bits on error and reporting it just once
> > > > problematic. It keeps the system running but data is lost and possibly
> > > > without getting the error anywhere where it is useful. We get away with
> > > > this because it is a rare event but it seems like a problematic behavior.
> > > > But this is more for the discussion at LSF.
> > > 
> > > I'm actually running into this in the last day or two because some MM
> > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > ext4 (at least when we are holding some mutex/semaphore like
> > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > unable to kill processes because they are holding on to locks that
> > > ext4 is holding.
> > > 
> > > I've done some initial investigation, and while it's not that hard to
> > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > is where we had been are running into problems), a really, REALLY big
> > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > silent data loss, because the pages are marked clean, and so data
> > > written using buffered writeback goes *poof*.
> > > 
> > > I confirmed this by creating a test kernel with a simple patch such
> > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > (and printk the inode number, so I knew which inodes had gotten the
> > > ENOMEM treatment).  The result was **NOT** pretty.
> > > 
> > > What I think we should strongly consider is at the very least, special
> > > case ENOMEM being returned by writepages() during background
> > > writeback, and *not* mark the pages clean, and make sure the inode
> > > stays on the dirty inode list, so we can retry the write later.  This
> > > is especially important since the process that issued the write may
> > > have gone away, so there might not even be a userspace process to
> > > complain to.  By converting certain page allocations (most notably in
> > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > release the i_data_sem lock and return an error.  This should allow
> > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > the writepages() for that inode later.
> > 
> > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > retrying the write is definitely better option. For start we can say that
> > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > there's no hope of getting data to disk and so we just discard them. It
> > will be somewhat rough distinction but probably better than what we have
> > now.
> > 
> > 								Honza
> 
> I'm not sure about ENOSPC there. That's a return code that is
> specifically expected to be returned by fsync. It seems like that ought
> not be considered a transient error?

Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
Long term, we may need to make at least some ENOSPC situations behave as
transient to make thin provisioned storage not loose data in case admin
does not supply additional space fast enough (i.e., before ENOSPC is
actually hit).

EIO is actually in a similar bucket although probably more on the "hard
failure" side - I can imagine there can by types of storage and situations
where the loss of connectivity to the storage is only transient. But for
start I would not bother with this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-09 11:02             ` Jan Kara
@ 2017-03-09 12:43               ` Jeff Layton
  2017-03-09 13:22                 ` Brian Foster
  2017-03-09 14:21                 ` Theodore Ts'o
  0 siblings, 2 replies; 25+ messages in thread
From: Jeff Layton @ 2017-03-09 12:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Ross Zwisler, viro, konishi.ryusuke,
	linux-fsdevel, linux-mm, linux-kernel, linux-nilfs, NeilBrown

On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > without getting the error anywhere where it is useful. We get away with
> > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > But this is more for the discussion at LSF.
> > > > 
> > > > I'm actually running into this in the last day or two because some MM
> > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > unable to kill processes because they are holding on to locks that
> > > > ext4 is holding.
> > > > 
> > > > I've done some initial investigation, and while it's not that hard to
> > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > is where we had been are running into problems), a really, REALLY big
> > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > silent data loss, because the pages are marked clean, and so data
> > > > written using buffered writeback goes *poof*.
> > > > 
> > > > I confirmed this by creating a test kernel with a simple patch such
> > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > ENOMEM treatment).  The result was **NOT** pretty.
> > > > 
> > > > What I think we should strongly consider is at the very least, special
> > > > case ENOMEM being returned by writepages() during background
> > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > stays on the dirty inode list, so we can retry the write later.  This
> > > > is especially important since the process that issued the write may
> > > > have gone away, so there might not even be a userspace process to
> > > > complain to.  By converting certain page allocations (most notably in
> > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > release the i_data_sem lock and return an error.  This should allow
> > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > the writepages() for that inode later.
> > > 
> > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > retrying the write is definitely better option. For start we can say that
> > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > there's no hope of getting data to disk and so we just discard them. It
> > > will be somewhat rough distinction but probably better than what we have
> > > now.
> > > 
> > > 								Honza
> > 
> > I'm not sure about ENOSPC there. That's a return code that is
> > specifically expected to be returned by fsync. It seems like that ought
> > not be considered a transient error?
> 
> Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> Long term, we may need to make at least some ENOSPC situations behave as
> transient to make thin provisioned storage not loose data in case admin
> does not supply additional space fast enough (i.e., before ENOSPC is
> actually hit).
> 

Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
transient error? Just have it hang until we get enough space when that
tunable is enabled?

> EIO is actually in a similar bucket although probably more on the "hard
> failure" side - I can imagine there can by types of storage and situations
> where the loss of connectivity to the storage is only transient. But for
> start I would not bother with this.
> 
> 								Honza

I don't see what we can reasonably do with -EIO other than return a hard
error. If we want to deal with loss of connectivity to storage as a
transient failure, I think that we'd need to ensure that the lower
layers return more distinct error codes in those cases (ENODEV or ENXIO
maybe? Or declare a new kernel-internal code -- EDEVGONE?).

In any case, I think that the basic idea of marking certain
writepage/writepages/launder_page errors as transient might be a
reasonable approach to handling this sanely.

The problem with all of this though is that we have a pile of existing
code that will likely need to be reworked for the new error handling. I
expect that we'll have to walk all of the
writepage/writepages/launder_page implementations and fix them up one by
one once we sort out the rules for this.

-- 
Jeff Layton <jlayton@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-09 12:43               ` Jeff Layton
@ 2017-03-09 13:22                 ` Brian Foster
  2017-03-09 14:21                 ` Theodore Ts'o
  1 sibling, 0 replies; 25+ messages in thread
From: Brian Foster @ 2017-03-09 13:22 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jan Kara, Theodore Ts'o, Ross Zwisler, viro, konishi.ryusuke,
	linux-fsdevel, linux-mm, linux-kernel, linux-nilfs, NeilBrown

On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote:
> On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> > On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > > without getting the error anywhere where it is useful. We get away with
> > > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > > But this is more for the discussion at LSF.
> > > > > 
> > > > > I'm actually running into this in the last day or two because some MM
> > > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > > unable to kill processes because they are holding on to locks that
> > > > > ext4 is holding.
> > > > > 
> > > > > I've done some initial investigation, and while it's not that hard to
> > > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > > is where we had been are running into problems), a really, REALLY big
> > > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > > silent data loss, because the pages are marked clean, and so data
> > > > > written using buffered writeback goes *poof*.
> > > > > 
> > > > > I confirmed this by creating a test kernel with a simple patch such
> > > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > > ENOMEM treatment).  The result was **NOT** pretty.
> > > > > 
> > > > > What I think we should strongly consider is at the very least, special
> > > > > case ENOMEM being returned by writepages() during background
> > > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > > stays on the dirty inode list, so we can retry the write later.  This
> > > > > is especially important since the process that issued the write may
> > > > > have gone away, so there might not even be a userspace process to
> > > > > complain to.  By converting certain page allocations (most notably in
> > > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > > release the i_data_sem lock and return an error.  This should allow
> > > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > > the writepages() for that inode later.
> > > > 
> > > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > > retrying the write is definitely better option. For start we can say that
> > > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > > there's no hope of getting data to disk and so we just discard them. It
> > > > will be somewhat rough distinction but probably better than what we have
> > > > now.
> > > > 
> > > > 								Honza
> > > 
> > > I'm not sure about ENOSPC there. That's a return code that is
> > > specifically expected to be returned by fsync. It seems like that ought
> > > not be considered a transient error?
> > 
> > Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> > Long term, we may need to make at least some ENOSPC situations behave as
> > transient to make thin provisioned storage not loose data in case admin
> > does not supply additional space fast enough (i.e., before ENOSPC is
> > actually hit).
> > 
> 
> Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
> transient error? Just have it hang until we get enough space when that
> tunable is enabled?
> 

Just FYI, XFS has a similar error configuration mechanism that we use
for essentially this purpose when dealing with metadata buffer I/O
errors.  The original motivation was to help us deal with the varying
requirements for thin provisioning. I.e., whether a particular error is
permanent or transient depends on the admin's preference and affects
whether the filesystem continuously retries failed I/Os in anticipation
of future success or gives up quickly and shuts down (or something in
between).

See roughly commits 192852be8b5 through e6b3bb7896 for the initial code,
/sys/fs/xfs/<dev>/error on an XFS filesystem for the user interface, and
the "Error handling" section of Documentation/filesystems/xfs.txt for
information on how the interface works.

Brian

> > EIO is actually in a similar bucket although probably more on the "hard
> > failure" side - I can imagine there can by types of storage and situations
> > where the loss of connectivity to the storage is only transient. But for
> > start I would not bother with this.
> > 
> > 								Honza
> 
> I don't see what we can reasonably do with -EIO other than return a hard
> error. If we want to deal with loss of connectivity to storage as a
> transient failure, I think that we'd need to ensure that the lower
> layers return more distinct error codes in those cases (ENODEV or ENXIO
> maybe? Or declare a new kernel-internal code -- EDEVGONE?).
> 
> In any case, I think that the basic idea of marking certain
> writepage/writepages/launder_page errors as transient might be a
> reasonable approach to handling this sanely.
> 
> The problem with all of this though is that we have a pile of existing
> code that will likely need to be reworked for the new error handling. I
> expect that we'll have to walk all of the
> writepage/writepages/launder_page implementations and fix them up one by
> one once we sort out the rules for this.
> 
> -- 
> Jeff Layton <jlayton@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
  2017-03-09 12:43               ` Jeff Layton
  2017-03-09 13:22                 ` Brian Foster
@ 2017-03-09 14:21                 ` Theodore Ts'o
  1 sibling, 0 replies; 25+ messages in thread
From: Theodore Ts'o @ 2017-03-09 14:21 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jan Kara, Ross Zwisler, viro, konishi.ryusuke, linux-fsdevel,
	linux-mm, linux-kernel, linux-nilfs, NeilBrown

On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote:
> 
> Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
> transient error? Just have it hang until we get enough space when that
> tunable is enabled?

Or maybe we need a new kernel-internal errno (ala ERESTARSYS) which
means it's a "soft ENOSPC"?  It would get translated to ENOSPC if it
gets propagated to userspace, but that way for devices like dm-thin or
other storage array with thin volumes, it could send back a soft
ENOSPC, while for file systems where "ENOSPC means ENOSPC", we can
treat those as a hard ENOSPC.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback
  2017-03-09  9:04         ` Jan Kara
  2017-03-09 10:47           ` Jeff Layton
@ 2017-03-15  5:07           ` Theodore Ts'o
  2017-03-15 11:59             ` Jan Kara
  2017-03-15 13:03             ` Michal Hocko
  1 sibling, 2 replies; 25+ messages in thread
From: Theodore Ts'o @ 2017-03-15  5:07 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Theodore Ts'o, Jan Kara, Michal Hocko, linux-mm

Currently, file system's writepages() function must not fail with an
ENOMEM, since if they do, it's possible for buffered data to be lost.
This is because on a data integrity writeback writepages() gets called
but once, and if it returns ENOMEM and you're lucky the error will get
reflected back to the userspace process calling fsync() --- at which
point the application may or may not be properly checking error codes.
If you aren't lucky, the user is unmounting the file system, and the
dirty pages will simply be lost.

For this reason, file system code generally will use GFP_NOFS, and in
some cases, will retry the allocation in a loop, on the theory that
"kernel livelocks are temporary; data loss is forever".
Unfortunately, this can indeed cause livelocks, since inside the
writepages() call, the file system is holding various mutexes, and
these mutexes may prevent the OOM killer from killing its targetted
victim if it is also holding on to those mutexes.

A better solution would be to allow writepages() to call the memory
allocator with flags that give greater latitude to the allocator to
fail, and then release its locks and return ENOMEM, and in the case of
background writeback, the writes can be retried at a later time.  In
the case of data-integrity writeback retry after waiting a brief
amount of time.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---

As we had discussed in an e-mail thread last week, I'm interested in
allowing ext4_writepages() to return ENOMEM without causing dirty
pages from buffered writes getting list.  It looks like doing so
should be fairly straightforward.   What do folks think?

 mm/page-writeback.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 290e8b7d3181..8666d3f3c57a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2352,10 +2352,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)

 	if (wbc->nr_to_write <= 0)
 		return 0;
-	if (mapping->a_ops->writepages)
-		ret = mapping->a_ops->writepages(mapping, wbc);
-	else
-		ret = generic_writepages(mapping, wbc);
+	while (1) {
+		if (mapping->a_ops->writepages)
+			ret = mapping->a_ops->writepages(mapping, wbc);
+		else
+			ret = generic_writepages(mapping, wbc);
+		if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
+			break;
+		cond_resched();
+		congestion_wait(BLK_RW_ASYNC, HZ/50);
+	}
 	return ret;
 }

-- 
2.11.0.rc0.7.gbe5a750

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback
  2017-03-15  5:07           ` [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Theodore Ts'o
@ 2017-03-15 11:59             ` Jan Kara
  2017-03-15 14:09               ` Theodore Ts'o
  2017-03-15 13:03             ` Michal Hocko
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kara @ 2017-03-15 11:59 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel, Jan Kara, Michal Hocko, linux-mm

On Wed 15-03-17 01:07:43, Ted Tso wrote:
> Currently, file system's writepages() function must not fail with an
> ENOMEM, since if they do, it's possible for buffered data to be lost.
> This is because on a data integrity writeback writepages() gets called
> but once, and if it returns ENOMEM and you're lucky the error will get
> reflected back to the userspace process calling fsync() --- at which
> point the application may or may not be properly checking error codes.
> If you aren't lucky, the user is unmounting the file system, and the
> dirty pages will simply be lost.
> 
> For this reason, file system code generally will use GFP_NOFS, and in
> some cases, will retry the allocation in a loop, on the theory that
> "kernel livelocks are temporary; data loss is forever".
> Unfortunately, this can indeed cause livelocks, since inside the
> writepages() call, the file system is holding various mutexes, and
> these mutexes may prevent the OOM killer from killing its targetted
> victim if it is also holding on to those mutexes.
> 
> A better solution would be to allow writepages() to call the memory
> allocator with flags that give greater latitude to the allocator to
> fail, and then release its locks and return ENOMEM, and in the case of
> background writeback, the writes can be retried at a later time.  In
> the case of data-integrity writeback retry after waiting a brief
> amount of time.
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
> 
> As we had discussed in an e-mail thread last week, I'm interested in
> allowing ext4_writepages() to return ENOMEM without causing dirty
> pages from buffered writes getting list.  It looks like doing so
> should be fairly straightforward.   What do folks think?

Makes sense to me. One comment below:


> +	while (1) {
> +		if (mapping->a_ops->writepages)
> +			ret = mapping->a_ops->writepages(mapping, wbc);
> +		else
> +			ret = generic_writepages(mapping, wbc);
> +		if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))

-ENOMEM I guess...


> +			break;
> +		cond_resched();
> +		congestion_wait(BLK_RW_ASYNC, HZ/50);
> +	}
>  	return ret;
>  }

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback
  2017-03-15  5:07           ` [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Theodore Ts'o
  2017-03-15 11:59             ` Jan Kara
@ 2017-03-15 13:03             ` Michal Hocko
  2017-03-16 10:18               ` Tetsuo Handa
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2017-03-15 13:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel, Jan Kara, linux-mm

On Wed 15-03-17 01:07:43, Theodore Ts'o wrote:
> Currently, file system's writepages() function must not fail with an
> ENOMEM, since if they do, it's possible for buffered data to be lost.
> This is because on a data integrity writeback writepages() gets called
> but once, and if it returns ENOMEM and you're lucky the error will get
> reflected back to the userspace process calling fsync() --- at which
> point the application may or may not be properly checking error codes.
> If you aren't lucky, the user is unmounting the file system, and the
> dirty pages will simply be lost.
> 
> For this reason, file system code generally will use GFP_NOFS, and in
> some cases, will retry the allocation in a loop, on the theory that
> "kernel livelocks are temporary; data loss is forever".
> Unfortunately, this can indeed cause livelocks, since inside the
> writepages() call, the file system is holding various mutexes, and
> these mutexes may prevent the OOM killer from killing its targetted
> victim if it is also holding on to those mutexes.

The victim might be looping inside do_writepages now instead (especially
when the memory reserves are depleted), though. On the other hand the
recent OOM killer changes do not rely on the oom victim exiting anymore.
We try to reap as much memory from its address space as possible
which alone should help us to move on. Even if that is not sufficient we
will move on to another victim. So unless everything is in this path and
all the memory is sitting unreachable from the reapable address space we
should be safe.

> A better solution would be to allow writepages() to call the memory
> allocator with flags that give greater latitude to the allocator to
> fail, and then release its locks and return ENOMEM, and in the case of
> background writeback, the writes can be retried at a later time.  In
> the case of data-integrity writeback retry after waiting a brief
> amount of time.

yes that sounds reasonable to me. Btw. I was proposing
__GFP_RETRY_MAYFAIL recently [1] which sounds like a good fit here.

[1] http://lkml.kernel.org/r/20170307154843.32516-1-mhocko@kernel.org

> Signed-off-by: Theodore Ts'o <tytso@mit.edu>

The patch looks good to me be I am not familiar with all the callers to
be fully qualified to give my Acked-by

> ---
> 
> As we had discussed in an e-mail thread last week, I'm interested in
> allowing ext4_writepages() to return ENOMEM without causing dirty
> pages from buffered writes getting list.  It looks like doing so
> should be fairly straightforward.   What do folks think?
> 
>  mm/page-writeback.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 290e8b7d3181..8666d3f3c57a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2352,10 +2352,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  
>  	if (wbc->nr_to_write <= 0)
>  		return 0;
> -	if (mapping->a_ops->writepages)
> -		ret = mapping->a_ops->writepages(mapping, wbc);
> -	else
> -		ret = generic_writepages(mapping, wbc);
> +	while (1) {
> +		if (mapping->a_ops->writepages)
> +			ret = mapping->a_ops->writepages(mapping, wbc);
> +		else
> +			ret = generic_writepages(mapping, wbc);
> +		if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
> +			break;
> +		cond_resched();
> +		congestion_wait(BLK_RW_ASYNC, HZ/50);
> +	}
>  	return ret;
>  }
>  
> -- 
> 2.11.0.rc0.7.gbe5a750

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback
  2017-03-15 11:59             ` Jan Kara
@ 2017-03-15 14:09               ` Theodore Ts'o
  0 siblings, 0 replies; 25+ messages in thread
From: Theodore Ts'o @ 2017-03-15 14:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Michal Hocko, linux-mm

On Wed, Mar 15, 2017 at 12:59:33PM +0100, Jan Kara wrote:
> > +	while (1) {
> > +		if (mapping->a_ops->writepages)
> > +			ret = mapping->a_ops->writepages(mapping, wbc);
> > +		else
> > +			ret = generic_writepages(mapping, wbc);
> > +		if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
> 
> -ENOMEM I guess...

Oops.  Thanks for noticing!

Unless anyone has any objections I plan to carry this in the ext4
tree.

						- Ted

>From 063312672cf277b12e337e91309672499bc797f7 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Tue, 14 Mar 2017 21:13:04 -0400
Subject: [PATCH -v2] mm: retry writepages() on ENOMEM when doing an data integrity writeback

Currently, file system's writepages() function must not fail with an
ENOMEM, since if they do, it's possible for buffered data to be lost.
This is because on a data integrity writeback writepages() gets called
but once, and if it returns ENOMEM, if you're lucky the error will get
reflected back to the userspace process calling fsync().  If you
aren't lucky, the user is unmounting the file system, and the dirty
pages will simply be lost.

For this reason, file system code generally will use GFP_NOFS, and in
some cases, will retry the allocation in a loop, on the theory that
"kernel livelocks are temporary; data loss is forever".
Unfortunately, this can indeed cause livelocks, since inside the
writepages() call, the file system is holding various mutexes, and
these mutexes may prevent the OOM killer from killing its targetted
victim if it is also holding on to those mutexes.

A better solution would be to allow writepages() to call the memory
allocator with flags that give greater latitude to the allocator to
fail, and then release its locks and return ENOMEM, and in the case of
background writeback, the writes can be retried at a later time.  In
the case of data-integrity writeback retry after waiting a brief
amount of time.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 mm/page-writeback.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 290e8b7d3181..c623cef68a8e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2352,10 +2352,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)

 	if (wbc->nr_to_write <= 0)
 		return 0;
-	if (mapping->a_ops->writepages)
-		ret = mapping->a_ops->writepages(mapping, wbc);
-	else
-		ret = generic_writepages(mapping, wbc);
+	while (1) {
+		if (mapping->a_ops->writepages)
+			ret = mapping->a_ops->writepages(mapping, wbc);
+		else
+			ret = generic_writepages(mapping, wbc);
+		if ((ret != -ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
+			break;
+		cond_resched();
+		congestion_wait(BLK_RW_ASYNC, HZ/50);
+	}
 	return ret;
 }

-- 
2.11.0.rc0.7.gbe5a750

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback
  2017-03-15 13:03             ` Michal Hocko
@ 2017-03-16 10:18               ` Tetsuo Handa
  0 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2017-03-16 10:18 UTC (permalink / raw)
  To: Michal Hocko, Theodore Ts'o; +Cc: linux-fsdevel, Jan Kara, linux-mm

On 2017/03/15 22:03, Michal Hocko wrote:
> On Wed 15-03-17 01:07:43, Theodore Ts'o wrote:
>> Unfortunately, this can indeed cause livelocks, since inside the
>> writepages() call, the file system is holding various mutexes, and
>> these mutexes may prevent the OOM killer from killing its targetted
>> victim if it is also holding on to those mutexes.
> 
> The victim might be looping inside do_writepages now instead (especially
> when the memory reserves are depleted), though. On the other hand the
> recent OOM killer changes do not rely on the oom victim exiting anymore.

True only if CONFIG_MMU=y.

> We try to reap as much memory from its address space as possible
> which alone should help us to move on. Even if that is not sufficient we
> will move on to another victim. So unless everything is in this path and
> all the memory is sitting unreachable from the reapable address space we
> should be safe.

If the caller is doing sync() or umount() syscall, isn't it reasonable
to bail out if fatal_signal_pending() is true because it is caller's
responsibility to check whether sync() or umount() succeeded? Though,
I don't know whether writepages() can preserve data for later retry by
other callers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-03-16 10:18 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-05 13:35 [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
2017-03-05 13:35 ` [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback Jeff Layton
2017-03-07 13:46   ` Ryusuke Konishi
2017-03-05 13:35 ` [PATCH 2/3] mm: don't TestClearPageError in __filemap_fdatawait_range Jeff Layton
2017-03-05 13:35 ` [PATCH 3/3] mm: set mapping error when launder_pages fails Jeff Layton
2017-03-05 14:40 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
2017-03-06 23:08   ` Ross Zwisler
2017-03-07 10:26     ` Jan Kara
2017-03-07 14:03       ` Jeff Layton
2017-03-07 15:59       ` Ross Zwisler
2017-03-07 16:17         ` Jan Kara
2017-03-09  2:57       ` Theodore Ts'o
2017-03-09  9:04         ` Jan Kara
2017-03-09 10:47           ` Jeff Layton
2017-03-09 11:02             ` Jan Kara
2017-03-09 12:43               ` Jeff Layton
2017-03-09 13:22                 ` Brian Foster
2017-03-09 14:21                 ` Theodore Ts'o
2017-03-15  5:07           ` [RFC PATCH] mm: retry writepages() on ENOMEM when doing an data integrity writeback Theodore Ts'o
2017-03-15 11:59             ` Jan Kara
2017-03-15 14:09               ` Theodore Ts'o
2017-03-15 13:03             ` Michal Hocko
2017-03-16 10:18               ` Tetsuo Handa
2017-03-06  3:06 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business NeilBrown
2017-03-06 11:43   ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).