All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
@ 2017-03-31 19:25 Jeff Layton
  2017-03-31 19:26 ` [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting Jeff Layton
                   ` (4 more replies)
  0 siblings, 5 replies; 54+ messages in thread
From: Jeff Layton @ 2017-03-31 19:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy, neilb

During LSF/MM this year, we had a discussion about the current sorry
state of writeback error reporting, and what could be done to improve
the situation. This patchset represents a first pass at the proposal
I made there.

It first adds a new set of writeback error tracking infrastructure to
ensure that errors are properly stored and reported at fsync time. It
also makes a small but significant change to ensure that writeback
errors are reported on all file descriptors, not just on the first one
where fsync is called.

Note that this is a _very_ rough draft at this point. I did some by-hand
testing with dm-error to ensure that it does the right thing there.
Mostly I'm interested in early feedback at this point -- does this basic
approach make sense?

Jeff Layton (4):
  fs: new infrastructure for writeback error handling and reporting
  dax: set errors in mapping when writeback fails
  buffer: set wb errors using both new and old infrastructure for now
  ext4: wire it up to the new writeback error reporting infrastructure

 Documentation/filesystems/vfs.txt | 14 +++++++--
 fs/buffer.c                       |  6 +++-
 fs/dax.c                          |  4 ++-
 fs/ext4/dir.c                     |  1 +
 fs/ext4/ext4.h                    |  1 +
 fs/ext4/file.c                    |  1 +
 fs/ext4/fsync.c                   | 15 +++++++---
 fs/ext4/inode.c                   |  2 +-
 fs/ext4/page-io.c                 |  4 +--
 fs/open.c                         |  3 ++
 include/linux/fs.h                |  5 ++++
 mm/filemap.c                      | 61 +++++++++++++++++++++++++++++++++++++++
 12 files changed, 106 insertions(+), 11 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting
  2017-03-31 19:25 [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it Jeff Layton
@ 2017-03-31 19:26 ` Jeff Layton
  2017-04-03  7:12   ` Nikolay Borisov
  2017-04-03 14:47   ` Matthew Wilcox
  2017-03-31 19:26 ` [RFC PATCH 2/4] dax: set errors in mapping when writeback fails Jeff Layton
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 54+ messages in thread
From: Jeff Layton @ 2017-03-31 19:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy, neilb

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

It's those non-fsync callers that are problematic. We should be
reporting writeback errors during fsync, but many places in the code
clear out errors before they can be properly reported, or report errors
at nonsensical times. If I get -EIO on a stat() call, how do I know that
was because writeback failed?

This patch adds a small bit of new infrastructure for setting and
reporting errors during pagecache writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fd may not be associated with one another in any way. They could even be
in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds a wb_error field and a sequence counter to the
address_space, and a corresponding sequence counter in the struct file.
When errors are reported during writeback, we set the error field in the
mapping and increment the sequence counter.

When fsync or flush is called, we check the sequence in the file vs. the
one in the mapping. If the file's counter is behind the one in the
mapping, then we update the sequence counter in the file to the value of
the one in the mapping and report the error. If the file is "caught up"
then we just report 0.

This changes the semantics of fsync such that applications can now use
it to determine whether there were any writeback errors since fsync(fd)
was last called (or since the file was opened in the case of fsync
having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The basic idea here is for filesystems to use filemap_set_wb_error to
set the error in the mapping when there are writeback errors, and then
have the fsync and flush operations call filemap_report_wb_error just
before returning to ensure that those errors get reported properly.

Eventually, it may make sense to move the reporting into the generic
vfs_fsync_range helper, but doing it this way for now makes it simpler
to convert filesystems to the new API individually.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 Documentation/filesystems/vfs.txt | 14 +++++++--
 fs/open.c                         |  3 ++
 include/linux/fs.h                |  5 ++++
 mm/filemap.c                      | 61 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 569211703721..b2b5e411b340 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -577,6 +577,11 @@ should clear PG_Dirty and set PG_Writeback.  It can be actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
+If there is an error during writeback, then the address_space should be
+marked with an error (typically using filemap_set_wb_error), in order to
+ensure that the error can later be reported to the application at fsync
+or close.
+
 Writeback makes use of a writeback_control structure...
 
 struct address_space_operations
@@ -885,11 +890,16 @@ otherwise noted.
 	"private_data" member in the file structure if you want to point
 	to a device structure
 
-  flush: called by the close(2) system call to flush a file
+  flush: called by the close(2) system call to flush a file. Writeback
+	errors not previously reported via fsync should be reported
+	here as you would for fsync.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Filesystems that use the
+	pagecache should call filemap_report_wb_error before returning
+	to ensure that any errors that occurred during writeback are
+	reported and the file's error sequence advanced.
 
   fasync: called by the fcntl(2) system call when asynchronous
 	(non-blocking) mode is enabled for a file
diff --git a/fs/open.c b/fs/open.c
index 949cef29c3bb..26a1483bcad6 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f,
 	f->f_inode = inode;
 	f->f_mapping = inode->i_mapping;
 
+	/* Don't need the i_lock since we're only interested in sequence */
+	f->f_wb_err_seq = inode->i_mapping->wb_err_seq;
+
 	if (unlikely(f->f_flags & O_PATH)) {
 		f->f_mode = FMODE_PATH;
 		f->f_op = &empty_fops;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..88d4577d761a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -394,6 +394,8 @@ struct address_space {
 	gfp_t			gfp_mask;	/* implicit gfp mask for allocations */
 	struct list_head	private_list;	/* ditto */
 	void			*private_data;	/* ditto */
+	u64			wb_err_seq;
+	int			wb_err;
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -868,6 +870,7 @@ struct file {
 	struct list_head	f_tfile_llink;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
+	u64			f_wb_err_seq;
 } __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
 
 struct file_handle {
@@ -2521,6 +2524,8 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 extern int filemap_check_errors(struct address_space *mapping);
+extern void filemap_set_wb_error(struct address_space *mapping, int err);
+extern int filemap_report_wb_error(struct file *file);
 
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
 			   int datasync);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1694623a6289..703f069b9812 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -546,6 +546,67 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 EXPORT_SYMBOL(filemap_write_and_wait_range);
 
 /**
+ * filemap_set_wb_error - set the wb error in the mapping for later reporting
+ * @mapping: mapping in which the error should be set
+ * @err: error to set
+ *
+ * When an error occurs during writeback of inode data, we must report that
+ * error during fsync. This function sets the writeback error field in the
+ * mapping, and increments the sequence counter. When fsync or close is later
+ * performed, the caller can then check the sequence in the mapping against
+ * the one in the file to determine whether the error should be reported.
+ *
+ * Note that we always use the latest writeback error, under the assumption
+ * that it doesn't really matter which one gets reported in the case where we
+ * have multiple errors (e.g. -EIO followed by -ENOSPC).
+ */
+void filemap_set_wb_error(struct address_space *mapping, int err)
+{
+	struct inode *inode = mapping->host;
+
+	/*
+	 * This should be called with the error code that we want to return
+	 * on fsync. Thus, it should always be <= 0.
+	 */
+	WARN_ON(err > 0);
+
+	spin_lock(&inode->i_lock);
+	mapping->wb_err_seq++;
+	mapping->wb_err = err;
+	spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL(filemap_set_wb_error);
+
+/**
+ * filemap_report_wb_error - report wb error (if any) that was previously set
+ * @file: struct file on which the error is being reported
+ *
+ * When userland calls fsync or close (or something like nfsd does the
+ * equivalent), we want to report any writeback errors that occurred since
+ * the last fsync (or since the file was opened if there haven't been any).
+ *
+ * Check the sequence counter in the file and see if it lags the one in the
+ * mapping. If it does, then return whatever error is in the mapping and
+ * set the sequence counter to the value of the one in the mapping. Otherwise,
+ * return 0.
+ */
+int filemap_report_wb_error(struct file *file)
+{
+	int err = 0;
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = file->f_mapping;
+
+	spin_lock(&inode->i_lock);
+	if (file->f_wb_err_seq < mapping->wb_err_seq) {
+		err = mapping->wb_err;
+		file->f_wb_err_seq = mapping->wb_err_seq;
+	}
+	spin_unlock(&inode->i_lock);
+	return err;
+}
+EXPORT_SYMBOL(filemap_report_wb_error);
+
+/**
  * replace_page_cache_page - replace a pagecache page with a new one
  * @old:	page to be replaced
  * @new:	page to replace with
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 2/4] dax: set errors in mapping when writeback fails
  2017-03-31 19:25 [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it Jeff Layton
  2017-03-31 19:26 ` [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting Jeff Layton
@ 2017-03-31 19:26 ` Jeff Layton
  2017-03-31 19:26 ` [RFC PATCH 3/4] buffer: set wb errors using both new and old infrastructure for now Jeff Layton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-03-31 19:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy, neilb

In order to get proper error codes from fsync, we must set an error in
the mapping range when writeback fails.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/dax.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index de622d4282a6..b76b3ffc141a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -892,8 +892,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 			ret = dax_writeback_one(bdev, mapping, indices[i],
 					pvec.pages[i]);
-			if (ret < 0)
+			if (ret < 0) {
+				filemap_set_wb_error(mapping, ret);
 				return ret;
+			}
 		}
 	}
 	return 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 3/4] buffer: set wb errors using both new and old infrastructure for now
  2017-03-31 19:25 [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it Jeff Layton
  2017-03-31 19:26 ` [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting Jeff Layton
  2017-03-31 19:26 ` [RFC PATCH 2/4] dax: set errors in mapping when writeback fails Jeff Layton
@ 2017-03-31 19:26 ` Jeff Layton
  2017-03-31 19:26 ` [RFC PATCH 4/4] ext4: wire it up to the new writeback error reporting infrastructure Jeff Layton
  2017-04-03  4:25 ` [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it NeilBrown
  4 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-03-31 19:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy, neilb

In later patches, we'll be converting filesystems over to use the new
filemap_*_wb_error API for storing and reporting writeback errors.

Since several different filesystems use the code in buffer.c, ensure
that we report errors using both schemes for now. Eventually we
should be able to remove the mapping_set_error calls once nothing
relies on it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/buffer.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9196f2a270da..5dde6de2408f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -354,6 +354,7 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 	} else {
 		buffer_io_error(bh, ", lost async page write");
 		mapping_set_error(page->mapping, -EIO);
+		filemap_set_wb_error(page->mapping, -EIO);
 		set_buffer_write_io_error(bh);
 		clear_buffer_uptodate(bh);
 		SetPageError(page);
@@ -1879,6 +1880,7 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 	SetPageError(page);
 	BUG_ON(PageWriteback(page));
 	mapping_set_error(page->mapping, err);
+	filemap_set_wb_error(page->mapping, err);
 	set_page_writeback(page);
 	do {
 		struct buffer_head *next = bh->b_this_page;
@@ -3291,8 +3293,10 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
 
 	bh = head;
 	do {
-		if (buffer_write_io_error(bh) && page->mapping)
+		if (buffer_write_io_error(bh) && page->mapping) {
 			mapping_set_error(page->mapping, -EIO);
+			filemap_set_wb_error(page->mapping, -EIO);
+		}
 		if (buffer_busy(bh))
 			goto failed;
 		bh = bh->b_this_page;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 4/4] ext4: wire it up to the new writeback error reporting infrastructure
  2017-03-31 19:25 [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it Jeff Layton
                   ` (2 preceding siblings ...)
  2017-03-31 19:26 ` [RFC PATCH 3/4] buffer: set wb errors using both new and old infrastructure for now Jeff Layton
@ 2017-03-31 19:26 ` Jeff Layton
  2017-04-03  4:25 ` [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it NeilBrown
  4 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-03-31 19:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy, neilb

Convert ext4 to use filemap_set_wb_error for reporting writeback errors
instead of mapping_set_error. Ensure that it calls filemap_report_wb_error
before returning from fsync, and add a new flush operation that ensures
that just calls filemap_report_wb_error before returning.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ext4/dir.c     |  1 +
 fs/ext4/ext4.h    |  1 +
 fs/ext4/file.c    |  1 +
 fs/ext4/fsync.c   | 15 +++++++++++----
 fs/ext4/inode.c   |  2 +-
 fs/ext4/page-io.c |  4 ++--
 6 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..c333b148b5fe 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -656,6 +656,7 @@ const struct file_operations ext4_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext4_compat_ioctl,
 #endif
+	.flush		= ext4_flush_file,
 	.fsync		= ext4_sync_file,
 	.open		= ext4_dir_open,
 	.release	= ext4_release_dir,
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f493af666591..20437bedeed0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2382,6 +2382,7 @@ extern int ext4_check_all_de(struct inode *dir, struct buffer_head *bh,
 
 /* fsync.c */
 extern int ext4_sync_file(struct file *, loff_t, loff_t, int);
+extern int ext4_flush_file(struct file *, fl_owner_t);
 
 /* hash.c */
 extern int ext4fs_dirhash(const char *name, int len, struct
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8210c1f43556..cc536a1aa731 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -735,6 +735,7 @@ const struct file_operations ext4_file_operations = {
 	.mmap		= ext4_file_mmap,
 	.open		= ext4_file_open,
 	.release	= ext4_release_file,
+	.flush		= ext4_flush_file,
 	.fsync		= ext4_sync_file,
 	.get_unmapped_area = thp_get_unmapped_area,
 	.splice_read	= generic_file_splice_read,
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 9d549608fd30..e381ab03f1cc 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -79,6 +79,11 @@ static int ext4_sync_parent(struct inode *inode)
 	return ret;
 }
 
+int ext4_flush_file(struct file *file, fl_owner_t id)
+{
+	return filemap_report_wb_error(file);
+}
+
 /*
  * akpm: A new design for ext4_sync_file().
  *
@@ -96,12 +101,12 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	struct inode *inode = file->f_mapping->host;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
-	int ret = 0, err;
+	int ret = -EIO, err;
 	tid_t commit_tid;
 	bool needs_barrier = false;
 
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
-		return -EIO;
+		goto out;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
 
@@ -110,6 +115,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	if (inode->i_sb->s_flags & MS_RDONLY) {
 		/* Make sure that we read updated s_mount_flags value */
 		smp_rmb();
+		ret = 0;
 		if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
 			ret = -EROFS;
 		goto out;
@@ -126,7 +132,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
 	if (ret)
-		return ret;
+		goto out;
 	/*
 	 * data=writeback,ordered:
 	 *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -159,5 +165,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	}
 out:
 	trace_ext4_sync_file_exit(inode, ret);
-	return ret;
+	err = filemap_report_wb_error(file);
+	return ret ? : err;
 }
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4247d8d25687..b9c8a73c404b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2634,7 +2634,7 @@ static int __writepage(struct page *page, struct writeback_control *wbc,
 {
 	struct address_space *mapping = data;
 	int ret = ext4_writepage(page, wbc);
-	mapping_set_error(mapping, ret);
+	filemap_set_wb_error(mapping, ret);
 	return ret;
 }
 
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 208241b06662..5adcf5d0ee9a 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -87,7 +87,7 @@ static void ext4_finish_bio(struct bio *bio)
 
 		if (bio->bi_error) {
 			SetPageError(page);
-			mapping_set_error(page->mapping, -EIO);
+			filemap_set_wb_error(page->mapping, -EIO);
 		}
 		bh = head = page_buffers(page);
 		/*
@@ -311,7 +311,7 @@ static void ext4_end_bio(struct bio *bio)
 			     (long) io_end->size,
 			     (unsigned long long)
 			     bi_sector >> (inode->i_blkbits - 9));
-		mapping_set_error(inode->i_mapping, bio->bi_error);
+		filemap_set_wb_error(inode->i_mapping, bio->bi_error);
 	}
 
 	if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-03-31 19:25 [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it Jeff Layton
                   ` (3 preceding siblings ...)
  2017-03-31 19:26 ` [RFC PATCH 4/4] ext4: wire it up to the new writeback error reporting infrastructure Jeff Layton
@ 2017-04-03  4:25 ` NeilBrown
  2017-04-03 10:28   ` Jeff Layton
  2017-04-03 14:51   ` Matthew Wilcox
  4 siblings, 2 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-03  4:25 UTC (permalink / raw)
  To: Jeff Layton, linux-fsdevel
  Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy

[-- Attachment #1: Type: text/plain, Size: 2897 bytes --]

On Fri, Mar 31 2017, Jeff Layton wrote:

> During LSF/MM this year, we had a discussion about the current sorry
> state of writeback error reporting, and what could be done to improve
> the situation. This patchset represents a first pass at the proposal
> I made there.
>
> It first adds a new set of writeback error tracking infrastructure to
> ensure that errors are properly stored and reported at fsync time. It
> also makes a small but significant change to ensure that writeback
> errors are reported on all file descriptors, not just on the first one
> where fsync is called.
>
> Note that this is a _very_ rough draft at this point. I did some by-hand
> testing with dm-error to ensure that it does the right thing there.
> Mostly I'm interested in early feedback at this point -- does this basic
> approach make sense?

I think that having ->wb_err_seq and returning errors to all file
descriptors is a good idea.
I don't like ->wb_err, particularly that you allow it to be set
to zero:
 +	/*
 +	 * This should be called with the error code that we want to return
 +	 * on fsync. Thus, it should always be <= 0.
 +	 */
 +	WARN_ON(err > 0);

Why is that ??

Also I think that EIO should always over-ride ENOSPC as the possible
responses are different.  That probably means you need a separate seq
number for each, which isn't ideal.

I don't like that you need to add a 'flush' handler to every filesystem,
most of which just call
 +	return filemap_report_wb_error(file);

Could we just have
	if (filp->f_op->flush)
		retval = filp->f_op->flush(filp, id);
+	else
+		retval = filemap_report_wb_error(filp);
in flip_close() ??

... or maybe it is wrong to return this error on close().
After all, the file actually does get closed, so no error occurred.
If an application cares about EIO, it should always call fsync() before
close().

Thanks,
NeilBrown


>
> Jeff Layton (4):
>   fs: new infrastructure for writeback error handling and reporting
>   dax: set errors in mapping when writeback fails
>   buffer: set wb errors using both new and old infrastructure for now
>   ext4: wire it up to the new writeback error reporting infrastructure
>
>  Documentation/filesystems/vfs.txt | 14 +++++++--
>  fs/buffer.c                       |  6 +++-
>  fs/dax.c                          |  4 ++-
>  fs/ext4/dir.c                     |  1 +
>  fs/ext4/ext4.h                    |  1 +
>  fs/ext4/file.c                    |  1 +
>  fs/ext4/fsync.c                   | 15 +++++++---
>  fs/ext4/inode.c                   |  2 +-
>  fs/ext4/page-io.c                 |  4 +--
>  fs/open.c                         |  3 ++
>  include/linux/fs.h                |  5 ++++
>  mm/filemap.c                      | 61 +++++++++++++++++++++++++++++++++++++++
>  12 files changed, 106 insertions(+), 11 deletions(-)
>
> -- 
> 2.9.3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting
  2017-03-31 19:26 ` [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting Jeff Layton
@ 2017-04-03  7:12   ` Nikolay Borisov
  2017-04-03 10:28     ` Jeff Layton
  2017-04-03 14:47   ` Matthew Wilcox
  1 sibling, 1 reply; 54+ messages in thread
From: Nikolay Borisov @ 2017-04-03  7:12 UTC (permalink / raw)
  To: Jeff Layton, linux-fsdevel
  Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy, neilb



On 31.03.2017 22:26, Jeff Layton wrote:
> Most filesystems currently use mapping_set_error and
> filemap_check_errors for setting and reporting/clearing writeback errors
> at the mapping level. filemap_check_errors is indirectly called from
> most of the filemap_fdatawait_* functions and from
> filemap_write_and_wait*. These functions are called from all sorts of
> contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> also in truncate calls, getattr, etc.
> 
> It's those non-fsync callers that are problematic. We should be
> reporting writeback errors during fsync, but many places in the code
> clear out errors before they can be properly reported, or report errors
> at nonsensical times. If I get -EIO on a stat() call, how do I know that
> was because writeback failed?
> 
> This patch adds a small bit of new infrastructure for setting and
> reporting errors during pagecache writeback. While the above was my
> original impetus for adding this, I think it's also the case that
> current fsync semantics are just problematic for userland. Most
> applications that call fsync do so to ensure that the data they wrote
> has hit the backing store.
> 
> In the case where there are multiple writers to the file at the same
> time, this is really hard to determine. The first one to call fsync will
> see any stored error, and the rest get back 0. The processes with open
> fd may not be associated with one another in any way. They could even be
> in different containers, so ensuring coordination between all fsync
> callers is not really an option.
> 
> One way to remedy this would be to track what file descriptor was used
> to dirty the file, but that's rather cumbersome and would likely be
> slow. However, there is a simpler way to improve the semantics here
> without incurring too much overhead.
> 
> This set adds a wb_error field and a sequence counter to the
> address_space, and a corresponding sequence counter in the struct file.
> When errors are reported during writeback, we set the error field in the
> mapping and increment the sequence counter.
> 
> When fsync or flush is called, we check the sequence in the file vs. the
> one in the mapping. If the file's counter is behind the one in the
> mapping, then we update the sequence counter in the file to the value of
> the one in the mapping and report the error. If the file is "caught up"
> then we just report 0.
> 
> This changes the semantics of fsync such that applications can now use
> it to determine whether there were any writeback errors since fsync(fd)
> was last called (or since the file was opened in the case of fsync
> having never been called).
> 
> Note that those writeback errors may have occurred when writing data
> that was dirtied via an entirely different fd, but that's the case now
> with the current mapping_set_error/filemap_check_error infrastructure.
> This will at least prevent you from getting a false report of success.
> 
> The basic idea here is for filesystems to use filemap_set_wb_error to
> set the error in the mapping when there are writeback errors, and then
> have the fsync and flush operations call filemap_report_wb_error just
> before returning to ensure that those errors get reported properly.
> 
> Eventually, it may make sense to move the reporting into the generic
> vfs_fsync_range helper, but doing it this way for now makes it simpler
> to convert filesystems to the new API individually.

There is already a mapping_set_error API which sets flags in
mapping->flags (AS_EIO/AS_ENOSPC). Aren't you essentially duplicating
some of the semantics of that API ?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03  4:25 ` [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it NeilBrown
@ 2017-04-03 10:28   ` Jeff Layton
  2017-04-03 14:32     ` Matthew Wilcox
  2017-04-03 14:51   ` Matthew Wilcox
  1 sibling, 1 reply; 54+ messages in thread
From: Jeff Layton @ 2017-04-03 10:28 UTC (permalink / raw)
  To: NeilBrown, linux-fsdevel
  Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy

On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> On Fri, Mar 31 2017, Jeff Layton wrote:
> 
> > During LSF/MM this year, we had a discussion about the current sorry
> > state of writeback error reporting, and what could be done to improve
> > the situation. This patchset represents a first pass at the proposal
> > I made there.
> > 
> > It first adds a new set of writeback error tracking infrastructure to
> > ensure that errors are properly stored and reported at fsync time. It
> > also makes a small but significant change to ensure that writeback
> > errors are reported on all file descriptors, not just on the first one
> > where fsync is called.
> > 
> > Note that this is a _very_ rough draft at this point. I did some by-hand
> > testing with dm-error to ensure that it does the right thing there.
> > Mostly I'm interested in early feedback at this point -- does this basic
> > approach make sense?
> 
> I think that having ->wb_err_seq and returning errors to all file
> descriptors is a good idea.
> I don't like ->wb_err, particularly that you allow it to be set
> to zero:
>  +	/*
>  +	 * This should be called with the error code that we want to return
>  +	 * on fsync. Thus, it should always be <= 0.
>  +	 */
>  +	WARN_ON(err > 0);
> 
> Why is that ??
> 

It's because I wasn't thinking about all of the places that currently
call mapping_set_error with an error of 0. This worked for ext4 since we
only call this when there is an actual error. You're correct here -- we
should only set the error when it's non-zero. I'll fix that.

> Also I think that EIO should always over-ride ENOSPC as the possible
> responses are different.  That probably means you need a separate seq
> number for each, which isn't ideal.
> 

I'm not quite convinced that it's really useful to do anything but
report the latest error.

But...if we did need to prefer one over another, could we get away with
always reporting -EIO once that error occurs? If so, then we'd still
just need a single sequence counter.

> I don't like that you need to add a 'flush' handler to every filesystem,
> most of which just call
>  +	return filemap_report_wb_error(file);
> 
> Could we just have
> 	if (filp->f_op->flush)
> 		retval = filp->f_op->flush(filp, id);
> +	else
> +		retval = filemap_report_wb_error(filp);
> in flip_close() ??
> 

Sure, that's possible.

I'm leery of making too much in the way of changes to the generic VFS
layer code just yet. After making several abortive attempts to try to
fix some of this with large, sweeping changes to the code, I think the
approach of doing this on a per-filesystem basis will be saner.

My concern there for now is that some code (e.g. fs/buffer.c) is shared
between filesystems and will need to call both routines in the interim.
Suppose we have a filesystem (ext2?) that is using the older routines
for now. Making the change to filp_close above might subtly change its
behavior, and I don't think we want to do that.

Once we have everything converted to use the newer API, we should be
able to collapse a lot of the flush routines into the above though.

> ... or maybe it is wrong to return this error on close().
> After all, the file actually does get closed, so no error occurred.
> If an application cares about EIO, it should always call fsync() before
> close().
> 

Applications should, but the close(2) manpage does say:

       Not  checking  the return value of close() is a common
       but nevertheless serious  programming  error.   It  is
       quite  possible  that  errors  on  a previous write(2)
       operation are first reported  at  the  final  close().
       Not  checking  the  return value when closing the file
       may lead to silent loss of data.

POSIX seems to say that that behavior is optional, but I think reporting
errors at close is a good idea. There are programs that do check for
that, but whether they do anything useful with the error is a little
less clear.

> > 
> > Jeff Layton (4):
> >   fs: new infrastructure for writeback error handling and reporting
> >   dax: set errors in mapping when writeback fails
> >   buffer: set wb errors using both new and old infrastructure for now
> >   ext4: wire it up to the new writeback error reporting infrastructure
> > 
> >  Documentation/filesystems/vfs.txt | 14 +++++++--
> >  fs/buffer.c                       |  6 +++-
> >  fs/dax.c                          |  4 ++-
> >  fs/ext4/dir.c                     |  1 +
> >  fs/ext4/ext4.h                    |  1 +
> >  fs/ext4/file.c                    |  1 +
> >  fs/ext4/fsync.c                   | 15 +++++++---
> >  fs/ext4/inode.c                   |  2 +-
> >  fs/ext4/page-io.c                 |  4 +--
> >  fs/open.c                         |  3 ++
> >  include/linux/fs.h                |  5 ++++
> >  mm/filemap.c                      | 61 +++++++++++++++++++++++++++++++++++++++
> >  12 files changed, 106 insertions(+), 11 deletions(-)
> > 
> > -- 
> > 2.9.3

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting
  2017-04-03  7:12   ` Nikolay Borisov
@ 2017-04-03 10:28     ` Jeff Layton
  0 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-03 10:28 UTC (permalink / raw)
  To: Nikolay Borisov, linux-fsdevel
  Cc: linux-kernel, linux-ext4, akpm, tytso, jack, willy, neilb

On Mon, 2017-04-03 at 10:12 +0300, Nikolay Borisov wrote:
> 
> On 31.03.2017 22:26, Jeff Layton wrote:
> > Most filesystems currently use mapping_set_error and
> > filemap_check_errors for setting and reporting/clearing writeback errors
> > at the mapping level. filemap_check_errors is indirectly called from
> > most of the filemap_fdatawait_* functions and from
> > filemap_write_and_wait*. These functions are called from all sorts of
> > contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> > also in truncate calls, getattr, etc.
> > 
> > It's those non-fsync callers that are problematic. We should be
> > reporting writeback errors during fsync, but many places in the code
> > clear out errors before they can be properly reported, or report errors
> > at nonsensical times. If I get -EIO on a stat() call, how do I know that
> > was because writeback failed?
> > 
> > This patch adds a small bit of new infrastructure for setting and
> > reporting errors during pagecache writeback. While the above was my
> > original impetus for adding this, I think it's also the case that
> > current fsync semantics are just problematic for userland. Most
> > applications that call fsync do so to ensure that the data they wrote
> > has hit the backing store.
> > 
> > In the case where there are multiple writers to the file at the same
> > time, this is really hard to determine. The first one to call fsync will
> > see any stored error, and the rest get back 0. The processes with open
> > fd may not be associated with one another in any way. They could even be
> > in different containers, so ensuring coordination between all fsync
> > callers is not really an option.
> > 
> > One way to remedy this would be to track what file descriptor was used
> > to dirty the file, but that's rather cumbersome and would likely be
> > slow. However, there is a simpler way to improve the semantics here
> > without incurring too much overhead.
> > 
> > This set adds a wb_error field and a sequence counter to the
> > address_space, and a corresponding sequence counter in the struct file.
> > When errors are reported during writeback, we set the error field in the
> > mapping and increment the sequence counter.
> > 
> > When fsync or flush is called, we check the sequence in the file vs. the
> > one in the mapping. If the file's counter is behind the one in the
> > mapping, then we update the sequence counter in the file to the value of
> > the one in the mapping and report the error. If the file is "caught up"
> > then we just report 0.
> > 
> > This changes the semantics of fsync such that applications can now use
> > it to determine whether there were any writeback errors since fsync(fd)
> > was last called (or since the file was opened in the case of fsync
> > having never been called).
> > 
> > Note that those writeback errors may have occurred when writing data
> > that was dirtied via an entirely different fd, but that's the case now
> > with the current mapping_set_error/filemap_check_error infrastructure.
> > This will at least prevent you from getting a false report of success.
> > 
> > The basic idea here is for filesystems to use filemap_set_wb_error to
> > set the error in the mapping when there are writeback errors, and then
> > have the fsync and flush operations call filemap_report_wb_error just
> > before returning to ensure that those errors get reported properly.
> > 
> > Eventually, it may make sense to move the reporting into the generic
> > vfs_fsync_range helper, but doing it this way for now makes it simpler
> > to convert filesystems to the new API individually.
> 
> There is already a mapping_set_error API which sets flags in
> mapping->flags (AS_EIO/AS_ENOSPC). Aren't you essentially duplicating
> some of the semantics of that API ?

Yes, more or less for now. The arguments of mapping_set_error and
filemap_set_wb_error are the same, but they do different things with the
error.

The plan is eventually to eliminate mapping_set_error and convert
everything over to use the new infrastructure I'm adding. That's
difficult to do all at once however, so for now some duplication is
necessary.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 10:28   ` Jeff Layton
@ 2017-04-03 14:32     ` Matthew Wilcox
  2017-04-03 17:47       ` Jeff Layton
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-03 14:32 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > Also I think that EIO should always over-ride ENOSPC as the possible
> > responses are different.  That probably means you need a separate seq
> > number for each, which isn't ideal.
> > 
> 
> I'm not quite convinced that it's really useful to do anything but
> report the latest error.
> 
> But...if we did need to prefer one over another, could we get away with
> always reporting -EIO once that error occurs? If so, then we'd still
> just need a single sequence counter.

I wonder whether it's even worth supporting both EIO and ENOSPC for a
writeback problem.  If I understand correctly, at the time of write(),
filesystems check to see if they have enough blocks to satisfy the
request, so ENOSPC only comes up in the writeback context for thinly
provisioned devices.

Programs have basically no use for the distinction.  In either case,
the situation is the same.  The written data is safely in RAM and cannot
be written to the storage.  If one were to make superhuman efforts,
one could mmap the file and write() it to a different device, but that
is incredibly rare.  For most programs, the response is to just die and
let the human deal with the corrupted file.

>From a sysadmin point of view, of course the situation is different,
and the remedy is different, but they should be getting that information
through a different mechanism than monitoring the errno from every
system call.

If we do want to continue to support both EIO and ENOSPC from writeback,
then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
in after an EIO is set, it only bumps the counter and applications will
see EIO, not ENOSPC on fresh calls to fsync().

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting
  2017-03-31 19:26 ` [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting Jeff Layton
  2017-04-03  7:12   ` Nikolay Borisov
@ 2017-04-03 14:47   ` Matthew Wilcox
  2017-04-03 15:19     ` Jeff Layton
  1 sibling, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-03 14:47 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack, neilb

On Fri, Mar 31, 2017 at 03:26:00PM -0400, Jeff Layton wrote:
> This set adds a wb_error field and a sequence counter to the
> address_space, and a corresponding sequence counter in the struct file.
> When errors are reported during writeback, we set the error field in the
> mapping and increment the sequence counter.

> +++ b/fs/open.c
> @@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f,
>  	f->f_inode = inode;
>  	f->f_mapping = inode->i_mapping;
>  
> +	/* Don't need the i_lock since we're only interested in sequence */
> +	f->f_wb_err_seq = inode->i_mapping->wb_err_seq;
> +

Do we need READ_ONCE() though, to ensure we get a consistent view of
wb_err_seq?  In particular, you made it 64 bit, so 32-bit architectures
are going to have a problem if it's rolling over between 2^32-1 and 2^32.

> +++ b/include/linux/fs.h
> @@ -394,6 +394,8 @@ struct address_space {
>  	gfp_t			gfp_mask;	/* implicit gfp mask for allocations */
>  	struct list_head	private_list;	/* ditto */
>  	void			*private_data;	/* ditto */
> +	u64			wb_err_seq;
> +	int			wb_err;
>  } __attribute__((aligned(sizeof(long))));
>  	/*
>  	 * On most architectures that alignment is already the case; but

I thought we had you convinced to make wb_err_seq an s32 and do clock
arithmetic?

> +int filemap_report_wb_error(struct file *file)
> +{
> +	int err = 0;
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = file->f_mapping;
> +
> +	spin_lock(&inode->i_lock);
> +	if (file->f_wb_err_seq < mapping->wb_err_seq) {
> +		err = mapping->wb_err;
> +		file->f_wb_err_seq = mapping->wb_err_seq;
> +	}
> +	spin_unlock(&inode->i_lock);
> +	return err;
> +}

Now that I think about this some more, I don't think you even need clock
arithmetic -- you just need !=.  And that means there's only a 1 in 2^32
chance that you miss an error.  Good enough, I say!  Particularly since
if errors are occurring that frequently that we wrapped the sequence
counter, the chance that we hit that magic point are really low.

We could even combine the two (I know Dave Chinner has been really
against growing struct address_space in the past):

int decode_wb_err(u32 wb_err)
{
	if (wb_err & 1)
		return -EIO;
	if (wb_err & 2)
		return -ENOSPC;
	return 0;
}

void set_wb_err(struct address_space *mapping, int err)
{
	if (err == -EIO)
		mapping->wb_err |= 1;
	else if (err == -ENOSPC)
		mapping->wb_err |= 2;
	else
		return;
	mapping->wb_err += 4;
}

...
	if (file->f_wb_err != mapping->wb_err) {
		err = decode_wb_err(mapping->wb_err);
		file->f_wb_err = mapping->wb_err;
	}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03  4:25 ` [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it NeilBrown
  2017-04-03 10:28   ` Jeff Layton
@ 2017-04-03 14:51   ` Matthew Wilcox
  1 sibling, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-03 14:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Mon, Apr 03, 2017 at 02:25:11PM +1000, NeilBrown wrote:
> I don't like that you need to add a 'flush' handler to every filesystem,
> most of which just call
>  +	return filemap_report_wb_error(file);
> 
> Could we just have
> 	if (filp->f_op->flush)
> 		retval = filp->f_op->flush(filp, id);
> +	else
> +		retval = filemap_report_wb_error(filp);
> in flip_close() ??

Maybe this is badly named as ext4_flush_file().  Maybe this should be
generic_flush_file(), and then there's no per-filesystem overhead to this?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting
  2017-04-03 14:47   ` Matthew Wilcox
@ 2017-04-03 15:19     ` Jeff Layton
  2017-04-03 16:15       ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: Jeff Layton @ 2017-04-03 15:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack, neilb

On Mon, 2017-04-03 at 07:47 -0700, Matthew Wilcox wrote:
> On Fri, Mar 31, 2017 at 03:26:00PM -0400, Jeff Layton wrote:
> > This set adds a wb_error field and a sequence counter to the
> > address_space, and a corresponding sequence counter in the struct file.
> > When errors are reported during writeback, we set the error field in the
> > mapping and increment the sequence counter.
> > +++ b/fs/open.c
> > @@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f,
> >  	f->f_inode = inode;
> >  	f->f_mapping = inode->i_mapping;
> >  
> > +	/* Don't need the i_lock since we're only interested in sequence */
> > +	f->f_wb_err_seq = inode->i_mapping->wb_err_seq;
> > +
> 
> Do we need READ_ONCE() though, to ensure we get a consistent view of
> wb_err_seq?  In particular, you made it 64 bit, so 32-bit architectures
> are going to have a problem if it's rolling over between 2^32-1 and 2^32.
> 

Yeah, I thought about that, and wasn't sure so I left that off. If you
think it's a good idea, then I'm fine with adding it.

> > +++ b/include/linux/fs.h
> > @@ -394,6 +394,8 @@ struct address_space {
> >  	gfp_t			gfp_mask;	/* implicit gfp mask for allocations */
> >  	struct list_head	private_list;	/* ditto */
> >  	void			*private_data;	/* ditto */
> > +	u64			wb_err_seq;
> > +	int			wb_err;
> >  } __attribute__((aligned(sizeof(long))));
> >  	/*
> >  	 * On most architectures that alignment is already the case; but
> 
> I thought we had you convinced to make wb_err_seq an s32 and do clock
> arithmetic?
> 
> > +int filemap_report_wb_error(struct file *file)
> > +{
> > +	int err = 0;
> > +	struct inode *inode = file_inode(file);
> > +	struct address_space *mapping = file->f_mapping;
> > +
> > +	spin_lock(&inode->i_lock);
> > +	if (file->f_wb_err_seq < mapping->wb_err_seq) {
> > +		err = mapping->wb_err;
> > +		file->f_wb_err_seq = mapping->wb_err_seq;
> > +	}
> > +	spin_unlock(&inode->i_lock);
> > +	return err;
> > +}
> 
> Now that I think about this some more, I don't think you even need clock
> arithmetic -- you just need !=.  And that means there's only a 1 in 2^32
> chance that you miss an error.  Good enough, I say!  Particularly since
> if errors are occurring that frequently that we wrapped the sequence
> counter, the chance that we hit that magic point are really low.
> 

> We could even combine the two (I know Dave Chinner has been really
> against growing struct address_space in the past):
> 
> int decode_wb_err(u32 wb_err)
> {
> 	if (wb_err & 1)
> 		return -EIO;
> 	if (wb_err & 2)
> 		return -ENOSPC;
> 	return 0;
> }
> 
> void set_wb_err(struct address_space *mapping, int err)
> {
> 	if (err == -EIO)
> 		mapping->wb_err |= 1;
> 	else if (err == -ENOSPC)
> 		mapping->wb_err |= 2;
> 	else
> 		return;
> 	mapping->wb_err += 4;
> }
> 
> ...
> 	if (file->f_wb_err != mapping->wb_err) {
> 		err = decode_wb_err(mapping->wb_err);
> 		file->f_wb_err = mapping->wb_err;
> 	}

Agreed. I had the same thought about checking for equality just after I
hit send last week. :)

Yes, so just to be clear here if you bump a 32 bit counter every
microsecond you'll end up wrapping in a little over an hour. How fast
can DAX generate I/O errors? :)

I'm fine with a 32 bit counter (and even with using the low order bits
to store error flags) if we're ok with that limitation. The big
question there is whether it's ok to continue reporting -EIO when there
has actually been nothing but -ENOSPC errors since the last fsync. I
think it's a corner case that's not of terribly great concern so I'm
fine with that.

We could try to mitigate it by zeroing out the value when i_writecount
goes to zero though. Then if you close all of the fds on the file, the
error is cleared. Or maybe we could add a new ioctl to explicitly zero
it out?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting
  2017-04-03 15:19     ` Jeff Layton
@ 2017-04-03 16:15       ` Matthew Wilcox
  2017-04-03 16:30         ` Jeff Layton
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-03 16:15 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack, neilb

On Mon, Apr 03, 2017 at 11:19:51AM -0400, Jeff Layton wrote:
> Yes, so just to be clear here if you bump a 32 bit counter every
> microsecond you'll end up wrapping in a little over an hour. How fast
> can DAX generate I/O errors? :)

I admit to not having picked through the code, but how often do we try
to do writebacks?  And how often do we retry writebacks once an -EIO
has happened?  Once we mark a page as PG_error, do we keep trying to
write it back and set the AS error each time?

> I'm fine with a 32 bit counter (and even with using the low order bits
> to store error flags) if we're ok with that limitation. The big
> question there is whether it's ok to continue reporting -EIO when there
> has actually been nothing but -ENOSPC errors since the last fsync. I
> think it's a corner case that's not of terribly great concern so I'm
> fine with that.

Yeah, I was thinking about that, and I'm fine with it too.

> We could try to mitigate it by zeroing out the value when i_writecount
> goes to zero though. Then if you close all of the fds on the file, the
> error is cleared. Or maybe we could add a new ioctl to explicitly zero
> it out?

I'm OK with zeroing the wb_err once i_writecount drops to 0.  Everybody
who cares has already been notified.  The new ioctl feels like overkill.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting
  2017-04-03 16:15       ` Matthew Wilcox
@ 2017-04-03 16:30         ` Jeff Layton
  0 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-03 16:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack, neilb

On Mon, 2017-04-03 at 09:15 -0700, Matthew Wilcox wrote:
> On Mon, Apr 03, 2017 at 11:19:51AM -0400, Jeff Layton wrote:
> > Yes, so just to be clear here if you bump a 32 bit counter every
> > microsecond you'll end up wrapping in a little over an hour. How fast
> > can DAX generate I/O errors? :)
> 
> I admit to not having picked through the code, but how often do we try
> to do writebacks?  And how often do we retry writebacks once an -EIO
> has happened?  Once we mark a page as PG_error, do we keep trying to
> write it back and set the AS error each time?
> 

It depends, but I think it could theoretically happen after trying to
sync out every page in a file. With something like DAX it seems like
you could do that pretty quickly.

One thing we could do is to try and push the filemap_set_wb_error calls
out of writepage ops and allow the callers to do that so we can avoid
bumping the counter unnecessarily. Not sure if that's enough to avoid
wrapping too quickly.

> > I'm fine with a 32 bit counter (and even with using the low order bits
> > to store error flags) if we're ok with that limitation. The big
> > question there is whether it's ok to continue reporting -EIO when there
> > has actually been nothing but -ENOSPC errors since the last fsync. I
> > think it's a corner case that's not of terribly great concern so I'm
> > fine with that.
> 
> Yeah, I was thinking about that, and I'm fine with it too.
> 
> > We could try to mitigate it by zeroing out the value when i_writecount
> > goes to zero though. Then if you close all of the fds on the file, the
> > error is cleared. Or maybe we could add a new ioctl to explicitly zero
> > it out?
> 
> I'm OK with zeroing the wb_err once i_writecount drops to 0.  Everybody
> who cares has already been notified.  The new ioctl feels like overkill.

That's my feeling too.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 14:32     ` Matthew Wilcox
@ 2017-04-03 17:47       ` Jeff Layton
  2017-04-03 18:09         ` Jeremy Allison
  2017-04-03 19:16         ` Matthew Wilcox
  0 siblings, 2 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-03 17:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > responses are different.  That probably means you need a separate seq
> > > number for each, which isn't ideal.
> > > 
> > 
> > I'm not quite convinced that it's really useful to do anything but
> > report the latest error.
> > 
> > But...if we did need to prefer one over another, could we get away with
> > always reporting -EIO once that error occurs? If so, then we'd still
> > just need a single sequence counter.
> 
> I wonder whether it's even worth supporting both EIO and ENOSPC for a
> writeback problem.  If I understand correctly, at the time of write(),
> filesystems check to see if they have enough blocks to satisfy the
> request, so ENOSPC only comes up in the writeback context for thinly
> provisioned devices.
> 
> Programs have basically no use for the distinction.  In either case,
> the situation is the same.  The written data is safely in RAM and cannot
> be written to the storage.  If one were to make superhuman efforts,
> one could mmap the file and write() it to a different device, but that
> is incredibly rare.  For most programs, the response is to just die and
> let the human deal with the corrupted file.
> 
> From a sysadmin point of view, of course the situation is different,
> and the remedy is different, but they should be getting that information
> through a different mechanism than monitoring the errno from every
> system call.
> 
> If we do want to continue to support both EIO and ENOSPC from writeback,
> then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> in after an EIO is set, it only bumps the counter and applications will
> see EIO, not ENOSPC on fresh calls to fsync().


No, ENOSPC on writeback can certainly happen with network filesystems.
NFS and CIFS have no way to reserve space. You wouldn't want to have to
do an extra RPC on every buffered write. :)
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 17:47       ` Jeff Layton
@ 2017-04-03 18:09         ` Jeremy Allison
  2017-04-03 18:18           ` Jeff Layton
  2017-04-03 19:16         ` Matthew Wilcox
  1 sibling, 1 reply; 54+ messages in thread
From: Jeremy Allison @ 2017-04-03 18:09 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Matthew Wilcox, NeilBrown, linux-fsdevel, linux-kernel,
	linux-ext4, akpm, tytso, jack

On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > responses are different.  That probably means you need a separate seq
> > > > number for each, which isn't ideal.
> > > > 
> > > 
> > > I'm not quite convinced that it's really useful to do anything but
> > > report the latest error.
> > > 
> > > But...if we did need to prefer one over another, could we get away with
> > > always reporting -EIO once that error occurs? If so, then we'd still
> > > just need a single sequence counter.
> > 
> > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > writeback problem.  If I understand correctly, at the time of write(),
> > filesystems check to see if they have enough blocks to satisfy the
> > request, so ENOSPC only comes up in the writeback context for thinly
> > provisioned devices.
> > 
> > Programs have basically no use for the distinction.  In either case,
> > the situation is the same.  The written data is safely in RAM and cannot
> > be written to the storage.  If one were to make superhuman efforts,
> > one could mmap the file and write() it to a different device, but that
> > is incredibly rare.  For most programs, the response is to just die and
> > let the human deal with the corrupted file.
> > 
> > From a sysadmin point of view, of course the situation is different,
> > and the remedy is different, but they should be getting that information
> > through a different mechanism than monitoring the errno from every
> > system call.
> > 
> > If we do want to continue to support both EIO and ENOSPC from writeback,
> > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > in after an EIO is set, it only bumps the counter and applications will
> > see EIO, not ENOSPC on fresh calls to fsync().
> 
> 
> No, ENOSPC on writeback can certainly happen with network filesystems.
> NFS and CIFS have no way to reserve space. You wouldn't want to have to
> do an extra RPC on every buffered write. :)

CIFS has a way to reserve space. Look into "allocation size" on create.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 18:09         ` Jeremy Allison
@ 2017-04-03 18:18           ` Jeff Layton
  2017-04-03 18:36             ` Jeremy Allison
  0 siblings, 1 reply; 54+ messages in thread
From: Jeff Layton @ 2017-04-03 18:18 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Matthew Wilcox, NeilBrown, linux-fsdevel, linux-kernel,
	linux-ext4, akpm, tytso, jack

On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > > responses are different.  That probably means you need a separate seq
> > > > > number for each, which isn't ideal.
> > > > > 
> > > > 
> > > > I'm not quite convinced that it's really useful to do anything but
> > > > report the latest error.
> > > > 
> > > > But...if we did need to prefer one over another, could we get away with
> > > > always reporting -EIO once that error occurs? If so, then we'd still
> > > > just need a single sequence counter.
> > > 
> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > writeback problem.  If I understand correctly, at the time of write(),
> > > filesystems check to see if they have enough blocks to satisfy the
> > > request, so ENOSPC only comes up in the writeback context for thinly
> > > provisioned devices.
> > > 
> > > Programs have basically no use for the distinction.  In either case,
> > > the situation is the same.  The written data is safely in RAM and cannot
> > > be written to the storage.  If one were to make superhuman efforts,
> > > one could mmap the file and write() it to a different device, but that
> > > is incredibly rare.  For most programs, the response is to just die and
> > > let the human deal with the corrupted file.
> > > 
> > > From a sysadmin point of view, of course the situation is different,
> > > and the remedy is different, but they should be getting that information
> > > through a different mechanism than monitoring the errno from every
> > > system call.
> > > 
> > > If we do want to continue to support both EIO and ENOSPC from writeback,
> > > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > > in after an EIO is set, it only bumps the counter and applications will
> > > see EIO, not ENOSPC on fresh calls to fsync().
> > 
> > 
> > No, ENOSPC on writeback can certainly happen with network filesystems.
> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > do an extra RPC on every buffered write. :)
> 
> CIFS has a way to reserve space. Look into "allocation size" on create.

That won't help here as it's done on open().

The problem here is that we might create a file (and not preallocate
anything), then write a bunch of stuff to the cache under an oplock.
Then when we go to write back, we get the CIFS equivalent of -ENOSPC.

What local filesystems do (AIUI) is preallocate so that you can catch
an ENOSPC condition earlier, when you're dirtying new pages in the
cache. That's pretty much impossible to do on a network filesystem
though.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 18:18           ` Jeff Layton
@ 2017-04-03 18:36             ` Jeremy Allison
  2017-04-03 18:40               ` Jeremy Allison
  0 siblings, 1 reply; 54+ messages in thread
From: Jeremy Allison @ 2017-04-03 18:36 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Matthew Wilcox, NeilBrown, linux-fsdevel, linux-kernel,
	linux-ext4, akpm, tytso, jack

On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > > > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > > > responses are different.  That probably means you need a separate seq
> > > > > > number for each, which isn't ideal.
> > > > > > 
> > > > > 
> > > > > I'm not quite convinced that it's really useful to do anything but
> > > > > report the latest error.
> > > > > 
> > > > > But...if we did need to prefer one over another, could we get away with
> > > > > always reporting -EIO once that error occurs? If so, then we'd still
> > > > > just need a single sequence counter.
> > > > 
> > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > > writeback problem.  If I understand correctly, at the time of write(),
> > > > filesystems check to see if they have enough blocks to satisfy the
> > > > request, so ENOSPC only comes up in the writeback context for thinly
> > > > provisioned devices.
> > > > 
> > > > Programs have basically no use for the distinction.  In either case,
> > > > the situation is the same.  The written data is safely in RAM and cannot
> > > > be written to the storage.  If one were to make superhuman efforts,
> > > > one could mmap the file and write() it to a different device, but that
> > > > is incredibly rare.  For most programs, the response is to just die and
> > > > let the human deal with the corrupted file.
> > > > 
> > > > From a sysadmin point of view, of course the situation is different,
> > > > and the remedy is different, but they should be getting that information
> > > > through a different mechanism than monitoring the errno from every
> > > > system call.
> > > > 
> > > > If we do want to continue to support both EIO and ENOSPC from writeback,
> > > > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > > > in after an EIO is set, it only bumps the counter and applications will
> > > > see EIO, not ENOSPC on fresh calls to fsync().
> > > 
> > > 
> > > No, ENOSPC on writeback can certainly happen with network filesystems.
> > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > > do an extra RPC on every buffered write. :)
> > 
> > CIFS has a way to reserve space. Look into "allocation size" on create.
> 
> That won't help here as it's done on open().
> 
> The problem here is that we might create a file (and not preallocate
> anything), then write a bunch of stuff to the cache under an oplock.
> Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> 
> What local filesystems do (AIUI) is preallocate so that you can catch
> an ENOSPC condition earlier, when you're dirtying new pages in the
> cache. That's pretty much impossible to do on a network filesystem
> though.

There's also SMB_SET_FILE_ALLOCATION_INFO which can be
done over SMB1/2/3 on an open file handle.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 18:36             ` Jeremy Allison
@ 2017-04-03 18:40               ` Jeremy Allison
  2017-04-03 18:49                 ` Jeff Layton
  0 siblings, 1 reply; 54+ messages in thread
From: Jeremy Allison @ 2017-04-03 18:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Matthew Wilcox, NeilBrown, linux-fsdevel, linux-kernel,
	linux-ext4, akpm, tytso, jack

On Mon, Apr 03, 2017 at 11:36:48AM -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > > 
> > > CIFS has a way to reserve space. Look into "allocation size" on create.
> > 
> > That won't help here as it's done on open().
> > 
> > The problem here is that we might create a file (and not preallocate
> > anything), then write a bunch of stuff to the cache under an oplock.
> > Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> > 
> > What local filesystems do (AIUI) is preallocate so that you can catch
> > an ENOSPC condition earlier, when you're dirtying new pages in the
> > cache. That's pretty much impossible to do on a network filesystem
> > though.
> 
> There's also SMB_SET_FILE_ALLOCATION_INFO which can be
> done over SMB1/2/3 on an open file handle.

There's *always* a way to do something in SMB1/2/3. :-).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 18:40               ` Jeremy Allison
@ 2017-04-03 18:49                 ` Jeff Layton
  0 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-03 18:49 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Matthew Wilcox, NeilBrown, linux-fsdevel, linux-kernel,
	linux-ext4, akpm, tytso, jack

On Mon, 2017-04-03 at 11:40 -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 11:36:48AM -0700, Jeremy Allison wrote:
> > On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > > > 
> > > > CIFS has a way to reserve space. Look into "allocation size" on create.
> > > 
> > > That won't help here as it's done on open().
> > > 
> > > The problem here is that we might create a file (and not preallocate
> > > anything), then write a bunch of stuff to the cache under an oplock.
> > > Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> > > 
> > > What local filesystems do (AIUI) is preallocate so that you can catch
> > > an ENOSPC condition earlier, when you're dirtying new pages in the
> > > cache. That's pretty much impossible to do on a network filesystem
> > > though.
> > 
> > There's also SMB_SET_FILE_ALLOCATION_INFO which can be
> > done over SMB1/2/3 on an open file handle.
> 
> There's *always* a way to do something in SMB1/2/3. :-).

Yes, indeed...Still, I think we'll need to deal with this during
writeback as well. Earlier versions of NFS certainly don't have
anything along those lines, though you could probably do some sort of
speculative preallocation with v4.2.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 17:47       ` Jeff Layton
  2017-04-03 18:09         ` Jeremy Allison
@ 2017-04-03 19:16         ` Matthew Wilcox
  2017-04-03 20:16           ` Jeff Layton
  1 sibling, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-03 19:16 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > writeback problem.  If I understand correctly, at the time of write(),
> > filesystems check to see if they have enough blocks to satisfy the
> > request, so ENOSPC only comes up in the writeback context for thinly
> > provisioned devices.
> 
> No, ENOSPC on writeback can certainly happen with network filesystems.
> NFS and CIFS have no way to reserve space. You wouldn't want to have to
> do an extra RPC on every buffered write. :)

Aaah, yes, network filesystems.  I would indeed not want to do an extra
RPC on every write to a hole (it's a hole vs non-hole question, rather
than a buffered/unbuffered question ... unless you're WAFLing and not
reclaiming quickly enough, I suppose).

So, OK, that makes sense, we should keep allowing filesystems to report
ENOSPC as a writeback error.  But I think much of the argument below
still holds, and we should continue to have a prior EIO to be reported
over a new ENOSPC (even if the program has already consumed the EIO).

If you find that unconvincing, we could do something like this ...

void filemap_set_wb_error(struct address_space *mapping, int err)
{
	struct inode *inode = mapping->host;

	if (!err)
		return;
	/*
	 * This should be called with the error code that we want to return
	 * on fsync. Thus, it should always be <= 0.
	 */
	WARN_ON(err > 0);

	spin_lock(&inode->i_lock);
	if (err == -EIO)
		mapping->wb_err |= 1;
	else if (err == -ENOSPC)
		mapping->wb_err |= 2;
	mapping->wb_err += 4;
	spin_unlock(&inode->i_lock);
}

int filemap_report_wb_error(struct file *file)
{
	struct inode *inode = file_inode(file);
	struct address_space *mapping = file->f_mapping;
	int err;

	spin_lock(&inode->i_lock);
	if (file->f_wb_err == mapping->wb_err) {
		err = 0;
	} else if (mapping->wb_err & 1) {
		filp->f_wb_err = mapping->wb_err & ~2;
		err = -EIO;
	} else {
		filp->f_wb_err = mapping->wb_err;
		err = -ENOSPC;
	}
	spin_unlock(&inode->i_lock);
	return err;
}

If I got that right, calling fsync() on an inode which has experienced
both errors would first get an EIO.  Calling fsync() on it again would
get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
either error occurs again, the thread will go back through the cycle
(EIO -> ENOSPC -> 0).

> > Programs have basically no use for the distinction.  In either case,
> > the situation is the same.  The written data is safely in RAM and cannot
> > be written to the storage.  If one were to make superhuman efforts,
> > one could mmap the file and write() it to a different device, but that
> > is incredibly rare.  For most programs, the response is to just die and
> > let the human deal with the corrupted file.
> > 
> > From a sysadmin point of view, of course the situation is different,
> > and the remedy is different, but they should be getting that information
> > through a different mechanism than monitoring the errno from every
> > system call.
> > 
> > If we do want to continue to support both EIO and ENOSPC from writeback,
> > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > in after an EIO is set, it only bumps the counter and applications will
> > see EIO, not ENOSPC on fresh calls to fsync().

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 19:16         ` Matthew Wilcox
@ 2017-04-03 20:16           ` Jeff Layton
  2017-04-04  2:45             ` Matthew Wilcox
  2017-04-04  3:03             ` NeilBrown
  0 siblings, 2 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-03 20:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > writeback problem.  If I understand correctly, at the time of write(),
> > > filesystems check to see if they have enough blocks to satisfy the
> > > request, so ENOSPC only comes up in the writeback context for thinly
> > > provisioned devices.
> > 
> > No, ENOSPC on writeback can certainly happen with network filesystems.
> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > do an extra RPC on every buffered write. :)
> 
> Aaah, yes, network filesystems.  I would indeed not want to do an extra
> RPC on every write to a hole (it's a hole vs non-hole question, rather
> than a buffered/unbuffered question ... unless you're WAFLing and not
> reclaiming quickly enough, I suppose).
> 
> So, OK, that makes sense, we should keep allowing filesystems to report
> ENOSPC as a writeback error.  But I think much of the argument below
> still holds, and we should continue to have a prior EIO to be reported
> over a new ENOSPC (even if the program has already consumed the EIO).
> 

I'm fine with that (though I'd like Neil's thoughts before we decide
anything) there.

> If you find that unconvincing, we could do something like this ...
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
> 	struct inode *inode = mapping->host;
> 
> 	if (!err)
> 		return;
> 	/*
> 	 * This should be called with the error code that we want to return
> 	 * on fsync. Thus, it should always be <= 0.
> 	 */
> 	WARN_ON(err > 0);
> 
> 	spin_lock(&inode->i_lock);
> 	if (err == -EIO)
> 		mapping->wb_err |= 1;
> 	else if (err == -ENOSPC)
> 		mapping->wb_err |= 2;
> 	mapping->wb_err += 4;
> 	spin_unlock(&inode->i_lock);
> }
> 
> int filemap_report_wb_error(struct file *file)
> {
> 	struct inode *inode = file_inode(file);
> 	struct address_space *mapping = file->f_mapping;
> 	int err;
> 
> 	spin_lock(&inode->i_lock);
> 	if (file->f_wb_err == mapping->wb_err) {
> 		err = 0;
> 	} else if (mapping->wb_err & 1) {
> 		filp->f_wb_err = mapping->wb_err & ~2;
> 		err = -EIO;
> 	} else {
> 		filp->f_wb_err = mapping->wb_err;
> 		err = -ENOSPC;
> 	}
> 	spin_unlock(&inode->i_lock);
> 	return err;
> }
> 
> If I got that right, calling fsync() on an inode which has experienced
> both errors would first get an EIO.  Calling fsync() on it again would
> get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
> either error occurs again, the thread will go back through the cycle
> (EIO -> ENOSPC -> 0).
> 

I don't think so? mapping->wb_err would still have 0x1 set after the
first call so you'd always end up in the first else if branch.

It's getting toward beer 30 here though so I could be misreading it.

In any case, I'd rather not do this any more cleverly than we have to.
Simpler is better here, and letting EIO override ENOSPC would be
preferable to me.

Also, since we're going to do this with 32 bit values, it might be nice
to use atomics and not need the spinlock there, especially if we want
to be able to clear that value out when the i_writecount goes to 0.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 20:16           ` Jeff Layton
@ 2017-04-04  2:45             ` Matthew Wilcox
  2017-04-04  3:03             ` NeilBrown
  1 sibling, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-04  2:45 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Mon, Apr 03, 2017 at 04:16:17PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > int filemap_report_wb_error(struct file *file)
> > {
> > 	struct inode *inode = file_inode(file);
> > 	struct address_space *mapping = file->f_mapping;
> > 	int err;
> > 
> > 	spin_lock(&inode->i_lock);
> > 	if (file->f_wb_err == mapping->wb_err) {
> > 		err = 0;
> > 	} else if (mapping->wb_err & 1) {
> > 		filp->f_wb_err = mapping->wb_err & ~2;
> > 		err = -EIO;
> > 	} else {
> > 		filp->f_wb_err = mapping->wb_err;
> > 		err = -ENOSPC;
> > 	}
> > 	spin_unlock(&inode->i_lock);
> > 	return err;
> > }
> > 
> > If I got that right, calling fsync() on an inode which has experienced
> > both errors would first get an EIO.  Calling fsync() on it again would
> > get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
> > either error occurs again, the thread will go back through the cycle
> > (EIO -> ENOSPC -> 0).
> > 
> 
> I don't think so? mapping->wb_err would still have 0x1 set after the
> first call so you'd always end up in the first else if branch.
> 
> It's getting toward beer 30 here though so I could be misreading it.

Well, yes, of course you misread it.  You read what I actually wrote
instead of what I intended to write.  Silly Jeff ...

int filemap_report_wb_error(struct file *file)
{
	struct inode *inode = file_inode(file);
	struct address_space *mapping = file->f_mapping;
	int err;

	spin_lock(&inode->i_lock);
	if (file->f_wb_err == mapping->wb_err) {
		err = 0;
	} else if ((mapping->wb_err ^ file->f_wb_err) == 2) {
		filp->f_wb_err = mapping->wb_err;
		err = -ENOSPC;
	} else {
		filp->f_wb_err = mapping->wb_err & ~2;
		err = -EIO;
	}
	spin_unlock(&inode->i_lock);
	return err;
}

The read side is easier in terms of atomic ...

int filemap_report_wb_error(struct file *file)
{
	unsigned int wb_err = atomic_read(&file->f_mapping->wb_err)

	if (file->f_wb_err == wb_err)
		return 0;
	if ((file->f_wb_err ^ wb_err) == 2) {
		filp->f_wb_err = wb_err;
		return -ENOSPC;
	} else {
		filp->f_wb_err = wb_err & ~2;
		return -EIO;
	}
}

but doing the write side with an atomic looks incredibly painful.  Since
we don't actually need to make the write side scalable, I'd rather see the
write side continue to use a spinlock and do the read side this way:

int filemap_report_wb_error(struct file *file)
{
	unsigned int wb_err = READ_ONCE(file->f_mapping->wb_err)

	if (file->f_wb_err == wb_err)
		return 0;
	if ((file->f_wb_err ^ wb_err) == 2) {
		filp->f_wb_err = wb_err;
		return -ENOSPC;
	} else {
		filp->f_wb_err = wb_err & ~2;
		return -EIO;
	}
}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-03 20:16           ` Jeff Layton
  2017-04-04  2:45             ` Matthew Wilcox
@ 2017-04-04  3:03             ` NeilBrown
  2017-04-04 11:41               ` Jeff Layton
  2017-04-04 11:53               ` Matthew Wilcox
  1 sibling, 2 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-04  3:03 UTC (permalink / raw)
  To: Jeff Layton, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 4060 bytes --]

On Mon, Apr 03 2017, Jeff Layton wrote:

> On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
>> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
>> > > writeback problem.  If I understand correctly, at the time of write(),
>> > > filesystems check to see if they have enough blocks to satisfy the
>> > > request, so ENOSPC only comes up in the writeback context for thinly
>> > > provisioned devices.
>> > 
>> > No, ENOSPC on writeback can certainly happen with network filesystems.
>> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
>> > do an extra RPC on every buffered write. :)
>> 
>> Aaah, yes, network filesystems.  I would indeed not want to do an extra
>> RPC on every write to a hole (it's a hole vs non-hole question, rather
>> than a buffered/unbuffered question ... unless you're WAFLing and not
>> reclaiming quickly enough, I suppose).
>> 
>> So, OK, that makes sense, we should keep allowing filesystems to report
>> ENOSPC as a writeback error.  But I think much of the argument below
>> still holds, and we should continue to have a prior EIO to be reported
>> over a new ENOSPC (even if the program has already consumed the EIO).
>> 
>
> I'm fine with that (though I'd like Neil's thoughts before we decide
> anything) there.

I'd like there be a well defined time when old errors were forgotten.
It does make sense for EIO to persist even if ENOSPC or EDQUOT is
received, but not forever.
Clearing the remembered errors when put_write_access() causes
i_writecount to reach zero is one option (as suggested), but I'm not
sure I'm happy with it.

Local filesystems, or network filesystems which receive strong write
delegations, should only ever return EIO to fsync.  We should
concentrate on them first, I think.  As there is only one possible
error, the seq counter is sufficient to "clear" it once it has been
reported to fsync() (or write()?).

Other network filesystems could return a whole host of errors: ENOSPC
EDQUOT ESTALE EPERM EFBIG ...
Do we want to limit exactly which errors are allowed in generic code, or
do we just support EIO generically and expect the filesystem to sort out
the details for anything else?

One possible approach a filesystem could take is just to allow a single
async writeback error.  After that error, all subsequent write()
system calls become synchronous. As write() or fsync() is called on each
file descriptor (which could possibly have sent the write which caused
the error), an error is returned and that fact is counted.  Once we have
returned as many errors as there are open file descriptors
(i_writecount?), and have seen a successful write, the filesystem
forgets all recorded errors and switches back to async writes (for that
inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().

The "which could possibly have sent the write which caused the error" is
an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
flags to return async errors.  It allocates an nfs_open_context for each
user who opens a given inode, and stores an error in there.  Each dirty
pages is associated with one of these, so errors a sure to go to the
correct user, though not necessarily the correct fd at present.

When we specify the new behaviour we should be careful to be as vague as
possible while still saying what we need.  This allows filesystems some
flexibility.

  If an error happens during writeback, the next write() or fsync() (or
  ....) on the file descriptor to which data was written will return -1
  with errno set to EIO or some other relevant error.  Other file
  descriptors open on the same file may receive EIO or some other error
  on a subsequent appropriate system call.
  It should not be assumed that close() will return an error.  fsync()
  must be called before close() if writeback errors are important to the
  application.


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04  3:03             ` NeilBrown
@ 2017-04-04 11:41               ` Jeff Layton
  2017-04-04 22:41                 ` NeilBrown
  2017-04-04 11:53               ` Matthew Wilcox
  1 sibling, 1 reply; 54+ messages in thread
From: Jeff Layton @ 2017-04-04 11:41 UTC (permalink / raw)
  To: NeilBrown, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Tue, 2017-04-04 at 13:03 +1000, NeilBrown wrote:
> On Mon, Apr 03 2017, Jeff Layton wrote:
> 
> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > > > writeback problem.  If I understand correctly, at the time of write(),
> > > > > filesystems check to see if they have enough blocks to satisfy the
> > > > > request, so ENOSPC only comes up in the writeback context for thinly
> > > > > provisioned devices.
> > > > 
> > > > No, ENOSPC on writeback can certainly happen with network filesystems.
> > > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > > > do an extra RPC on every buffered write. :)
> > > 
> > > Aaah, yes, network filesystems.  I would indeed not want to do an extra
> > > RPC on every write to a hole (it's a hole vs non-hole question, rather
> > > than a buffered/unbuffered question ... unless you're WAFLing and not
> > > reclaiming quickly enough, I suppose).
> > > 
> > > So, OK, that makes sense, we should keep allowing filesystems to report
> > > ENOSPC as a writeback error.  But I think much of the argument below
> > > still holds, and we should continue to have a prior EIO to be reported
> > > over a new ENOSPC (even if the program has already consumed the EIO).
> > > 
> > 
> > I'm fine with that (though I'd like Neil's thoughts before we decide
> > anything) there.
> 
> I'd like there be a well defined time when old errors were forgotten.
> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> received, but not forever.
> Clearing the remembered errors when put_write_access() causes
> i_writecount to reach zero is one option (as suggested), but I'm not
> sure I'm happy with it.
> 
> Local filesystems, or network filesystems which receive strong write
> delegations, should only ever return EIO to fsync.  We should
> concentrate on them first, I think.  As there is only one possible
> error, the seq counter is sufficient to "clear" it once it has been
> reported to fsync() (or write()?).
> 
> Other network filesystems could return a whole host of errors: ENOSPC
> EDQUOT ESTALE EPERM EFBIG ...
> Do we want to limit exactly which errors are allowed in generic code, or
> do we just support EIO generically and expect the filesystem to sort out
> the details for anything else?
> 
> One possible approach a filesystem could take is just to allow a single
> async writeback error.  After that error, all subsequent write()
> system calls become synchronous. As write() or fsync() is called on each
> file descriptor (which could possibly have sent the write which caused
> the error), an error is returned and that fact is counted.  Once we have
> returned as many errors as there are open file descriptors
> (i_writecount?), and have seen a successful write, the filesystem
> forgets all recorded errors and switches back to async writes (for that
> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
> 
> The "which could possibly have sent the write which caused the error" is
> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> flags to return async errors.  It allocates an nfs_open_context for each
> user who opens a given inode, and stores an error in there.  Each dirty
> pages is associated with one of these, so errors a sure to go to the
> correct user, though not necessarily the correct fd at present.
> 
> When we specify the new behaviour we should be careful to be as vague as
> possible while still saying what we need.  This allows filesystems some
> flexibility.
> 
>   If an error happens during writeback, the next write() or fsync() (or
>   ....) on the file descriptor to which data was written will return -1
>   with errno set to EIO or some other relevant error.  Other file
>   descriptors open on the same file may receive EIO or some other error
>   on a subsequent appropriate system call.
>   It should not be assumed that close() will return an error.  fsync()
>   must be called before close() if writeback errors are important to the
>   application.
> 
> 

A lot in here... :)

While I like the NFS method of switching to sync I/O on error (and
indeed, I'm copying that in the Ceph ENOSPC patches I have), I'm not
sure it would really help anything here. The main reason NFS does that
is to prevent you from dirtying tons of pages that can't be cleaned. 

While that is a laudable goal, it's not really the problem I'm
interested in solving here. My goal is simply to ensure that you see a
writeback error on fsync if one occurred since the last fsync.

I think it just comes down to the fact that I'm not convinced that it
really matters much _what_ error gets reported, as long as you get one.
As you've mentioned in earlier discussions, most programs just treat it
as a fatal error anyway. As long as that error is representative of
some error that occurred during writeback, do we really care what it
was?

Suppose we have a bunch of dirty pages on an inode, get an EIO error
and then ENOSPC on a different write (maybe issued in parallel). We
send the ENOSPC error back to the application on an fsync (since it
came in last). Application then cleans out some junk from the fs and
then reissues the writes. They fail again and then he gets EIO from the
fsync and aborts.

Ok, so we might not have had to clean out the files and reissue the
writes there since we were going to give up anyway. Is it worth going
to extra lengths to avoid that there, given that we're in an error
condition anyway?

I'm just trying to understand why it matters at all what error you get
back when there multiple problems. They all seem equally valid to me in
that situation.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04  3:03             ` NeilBrown
  2017-04-04 11:41               ` Jeff Layton
@ 2017-04-04 11:53               ` Matthew Wilcox
  2017-04-04 12:17                 ` Jeff Layton
                                   ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-04 11:53 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
> On Mon, Apr 03 2017, Jeff Layton wrote:
> 
> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> >> So, OK, that makes sense, we should keep allowing filesystems to report
> >> ENOSPC as a writeback error.  But I think much of the argument below
> >> still holds, and we should continue to have a prior EIO to be reported
> >> over a new ENOSPC (even if the program has already consumed the EIO).
> >
> > I'm fine with that (though I'd like Neil's thoughts before we decide
> > anything) there.
> 
> I'd like there be a well defined time when old errors were forgotten.
> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> received, but not forever.
> Clearing the remembered errors when put_write_access() causes
> i_writecount to reach zero is one option (as suggested), but I'm not
> sure I'm happy with it.
> 
> Local filesystems, or network filesystems which receive strong write
> delegations, should only ever return EIO to fsync.  We should
> concentrate on them first, I think.  As there is only one possible
> error, the seq counter is sufficient to "clear" it once it has been
> reported to fsync() (or write()?).
> 
> Other network filesystems could return a whole host of errors: ENOSPC
> EDQUOT ESTALE EPERM EFBIG ...
> Do we want to limit exactly which errors are allowed in generic code, or
> do we just support EIO generically and expect the filesystem to sort out
> the details for anything else?

I'd like us to focus on our POSIX compliance here and not return
arbitrary errors.  The relevant pages are here:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html

For close(), we have to map every error to EIO.
For fsync(), we can return any error that write() could have.  That limits
us to:

EFBIG ENOSPC EIO ENOBUFS ENXIO

I think EFBIG really isn't a writeback error; are there any network
filesystems that don't know the file size limit at the time they accept
the original write?  ENOBUFS seems like a transient error (*this* call to
fsync() failed, but the next one may succeed ... it's the equivalent of
ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
error.  So that leaves us with ENOSPC and EIO, as we have support today.

> One possible approach a filesystem could take is just to allow a single
> async writeback error.  After that error, all subsequent write()
> system calls become synchronous. As write() or fsync() is called on each
> file descriptor (which could possibly have sent the write which caused
> the error), an error is returned and that fact is counted.  Once we have
> returned as many errors as there are open file descriptors
> (i_writecount?), and have seen a successful write, the filesystem
> forgets all recorded errors and switches back to async writes (for that
> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
> 
> The "which could possibly have sent the write which caused the error" is
> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> flags to return async errors.  It allocates an nfs_open_context for each
> user who opens a given inode, and stores an error in there.  Each dirty
> pages is associated with one of these, so errors a sure to go to the
> correct user, though not necessarily the correct fd at present.

... and you need the nfs_open_context in order to use the correct
credentials when writing a page to the server, correct?

> When we specify the new behaviour we should be careful to be as vague as
> possible while still saying what we need.  This allows filesystems some
> flexibility.
> 
>   If an error happens during writeback, the next write() or fsync() (or
>   ....) on the file descriptor to which data was written will return -1
>   with errno set to EIO or some other relevant error.  Other file
>   descriptors open on the same file may receive EIO or some other error
>   on a subsequent appropriate system call.
>   It should not be assumed that close() will return an error.  fsync()
>   must be called before close() if writeback errors are important to the
>   application.

Thanks for explaining what NFS does today.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 11:53               ` Matthew Wilcox
@ 2017-04-04 12:17                 ` Jeff Layton
  2017-04-04 16:12                   ` Matthew Wilcox
  2017-04-04 13:38                 ` Theodore Ts'o
  2017-04-04 22:28                 ` NeilBrown
  2 siblings, 1 reply; 54+ messages in thread
From: Jeff Layton @ 2017-04-04 12:17 UTC (permalink / raw)
  To: Matthew Wilcox, NeilBrown
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Tue, 2017-04-04 at 04:53 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
> > On Mon, Apr 03 2017, Jeff Layton wrote:
> > 
> > > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > > > So, OK, that makes sense, we should keep allowing filesystems to report
> > > > ENOSPC as a writeback error.  But I think much of the argument below
> > > > still holds, and we should continue to have a prior EIO to be reported
> > > > over a new ENOSPC (even if the program has already consumed the EIO).
> > > 
> > > I'm fine with that (though I'd like Neil's thoughts before we decide
> > > anything) there.
> > 
> > I'd like there be a well defined time when old errors were forgotten.
> > It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> > received, but not forever.
> > Clearing the remembered errors when put_write_access() causes
> > i_writecount to reach zero is one option (as suggested), but I'm not
> > sure I'm happy with it.
> > 
> > Local filesystems, or network filesystems which receive strong write
> > delegations, should only ever return EIO to fsync.  We should
> > concentrate on them first, I think.  As there is only one possible
> > error, the seq counter is sufficient to "clear" it once it has been
> > reported to fsync() (or write()?).
> > 
> > Other network filesystems could return a whole host of errors: ENOSPC
> > EDQUOT ESTALE EPERM EFBIG ...
> > Do we want to limit exactly which errors are allowed in generic code, or
> > do we just support EIO generically and expect the filesystem to sort out
> > the details for anything else?
> 
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
> 
> For close(), we have to map every error to EIO.
> For fsync(), we can return any error that write() could have.  That limits
> us to:
> 
> EFBIG ENOSPC EIO ENOBUFS ENXIO
> 
> I think EFBIG really isn't a writeback error; are there any network
> filesystems that don't know the file size limit at the time they accept
> the original write?  ENOBUFS seems like a transient error (*this* call to
> fsync() failed, but the next one may succeed ... it's the equivalent of
> ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
> error.  So that leaves us with ENOSPC and EIO, as we have support today.
> 

Agreed that we should focus on POSIX compliance. I'll also note that
POSIX states:

"If more than one error occurs in processing a function call, any one
of the possible errors may be returned, as the order of
detection is undefined."

    http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03

So, I'd like to push back on this idea that we need to prefer reporting
-EIO over other errors. POSIX certainly doesn't mandate that. 

If we agree that that is the case, then I think the simplest thing to
do here would be to clear the other error flag(s) when we get a new
error, such that we only preserve the latest one. With that, we also
wouldn't need to clear anything out when i_writecount goes to zero
either. It would "just work" without that.

> > One possible approach a filesystem could take is just to allow a single
> > async writeback error.  After that error, all subsequent write()
> > system calls become synchronous. As write() or fsync() is called on each
> > file descriptor (which could possibly have sent the write which caused
> > the error), an error is returned and that fact is counted.  Once we have
> > returned as many errors as there are open file descriptors
> > (i_writecount?), and have seen a successful write, the filesystem
> > forgets all recorded errors and switches back to async writes (for that
> > inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
> > 
> > The "which could possibly have sent the write which caused the error" is
> > an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> > flags to return async errors.  It allocates an nfs_open_context for each
> > user who opens a given inode, and stores an error in there.  Each dirty
> > pages is associated with one of these, so errors a sure to go to the
> > correct user, though not necessarily the correct fd at present.
> 
> ... and you need the nfs_open_context in order to use the correct
> credentials when writing a page to the server, correct?
> 

Yes, and it is expensive. I don't think we want to do that at the
generic VFS layer if we can at all help it.

> > When we specify the new behaviour we should be careful to be as vague as
> > possible while still saying what we need.  This allows filesystems some
> > flexibility.
> > 
> >   If an error happens during writeback, the next write() or fsync() (or
> >   ....) on the file descriptor to which data was written will return -1
> >   with errno set to EIO or some other relevant error.  Other file
> >   descriptors open on the same file may receive EIO or some other error
> >   on a subsequent appropriate system call.
> >   It should not be assumed that close() will return an error.  fsync()
> >   must be called before close() if writeback errors are important to the
> >   application.
> 

...and I also agree that we leave as much grey area as possible here to
allow for a wide range of implementations.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 11:53               ` Matthew Wilcox
  2017-04-04 12:17                 ` Jeff Layton
@ 2017-04-04 13:38                 ` Theodore Ts'o
  2017-04-04 22:28                 ` NeilBrown
  2 siblings, 0 replies; 54+ messages in thread
From: Theodore Ts'o @ 2017-04-04 13:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, Jeff Layton, linux-fsdevel, linux-kernel, linux-ext4,
	akpm, jack

On Tue, Apr 04, 2017 at 04:53:58AM -0700, Matthew Wilcox wrote:
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
> 
> For close(), we have to map every error to EIO.

I don't believe that's true.  POSIX explicitly states that
implementations may return additional errors, and define additional
errors, in addition to the ones that are listeds in the specification.

The specification is pretty clear about saying when a particular
system call "shall" fail (meaning it must fail if it was so listed),
and when it "may" fail.  But no where does it say that these are the
only situations when a system call is allowed to fail.

Which is good, because ext4 and xfs will both return EUCLEAN if the
file system is corrupted.  (Mainly because it's too painful to define
a new errno, EFSCORRUPTED --- not because of trying to get it into
Posix, but because it's painful to get new errno's defined in glibc.)
POSIX says *nothing* about file systems being corrupted, and if your
interpretation were correct, we're already in violation of POSIX....

	       	    	     	   	      - Ted
					      

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 12:17                 ` Jeff Layton
@ 2017-04-04 16:12                   ` Matthew Wilcox
  2017-04-04 16:25                     ` Jeff Layton
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-04 16:12 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> Agreed that we should focus on POSIX compliance. I'll also note that
> POSIX states:
> 
> "If more than one error occurs in processing a function call, any one
> of the possible errors may be returned, as the order of
> detection is undefined."
> 
>     http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> 
> So, I'd like to push back on this idea that we need to prefer reporting
> -EIO over other errors. POSIX certainly doesn't mandate that. 

I honestly wonder if we need to support ENOSPC from writeback at all.
Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
in 2003:

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16

That seems to come from here:
http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
which is marked as a resend, but I can't find the original.

It's a little misleading because the immediately preceding patch
introduced mapping->error, so there's no precedent here to speak of.
It looks like we used to just silently lose writeback errors (*cough*).

I'd like to suggest that maybe we don't need to support multiple errors
at all.  That all errors, including ENOSPC, get collapsed into EIO.
POSIX already tells us to do that for close() and permits us to do that
for fsync().

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 16:12                   ` Matthew Wilcox
@ 2017-04-04 16:25                     ` Jeff Layton
  2017-04-04 17:09                       ` Matthew Wilcox
  2017-04-04 23:13                       ` NeilBrown
  0 siblings, 2 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-04 16:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> > Agreed that we should focus on POSIX compliance. I'll also note that
> > POSIX states:
> > 
> > "If more than one error occurs in processing a function call, any one
> > of the possible errors may be returned, as the order of
> > detection is undefined."
> > 
> >     http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> > 
> > So, I'd like to push back on this idea that we need to prefer reporting
> > -EIO over other errors. POSIX certainly doesn't mandate that. 
> 
> I honestly wonder if we need to support ENOSPC from writeback at all.
> Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
> in 2003:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
> 
> That seems to come from here:
> http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
> which is marked as a resend, but I can't find the original.
> 
> It's a little misleading because the immediately preceding patch
> introduced mapping->error, so there's no precedent here to speak of.
> It looks like we used to just silently lose writeback errors (*cough*).
> 
> I'd like to suggest that maybe we don't need to support multiple errors
> at all.  That all errors, including ENOSPC, get collapsed into EIO.
> POSIX already tells us to do that for close() and permits us to do that
> for fsync().
> 

That is certainly allowed under POSIX as I interpret the spec. At a
minimum we just need a single flag and can collapse all errors under
that.

That said, I think giving more specific errors where we can is useful.
When your program is erroring out and writing 'I/O error' to the logs,
then how much time will your admins burn before they figure out that it
really failed because the filesystem was full?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 16:25                     ` Jeff Layton
@ 2017-04-04 17:09                       ` Matthew Wilcox
  2017-04-04 18:08                         ` Jeff Layton
                                           ` (2 more replies)
  2017-04-04 23:13                       ` NeilBrown
  1 sibling, 3 replies; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-04 17:09 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> That said, I think giving more specific errors where we can is useful.
> When your program is erroring out and writing 'I/O error' to the logs,
> then how much time will your admins burn before they figure out that it
> really failed because the filesystem was full?

df is one of the first things I check ... a few years ago, I also learned
to check df -i ... ;-)

Anyway, given the decision to simply report the last error lets us do this
implementation:

void filemap_set_wb_error(struct address_space *mapping, int err)
{
	struct inode *inode = mapping->host;
	unsigned int wb_err;

	if (!err)
		return;
	/*
	 * This should be called with the error code that we want to return
	 * on fsync. Thus, it should always be <= 0.
	 */
	WARN_ON(err > 0 || err < -MAX_ERRNO);

	spin_lock(&inode->i_lock);
	wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
	WRITE_ONCE(mapping->wb_err, wb_err);
	spin_unlock(&inode->i_lock);
}

int filemap_report_wb_error(struct file *file)
{
	struct inode *inode = file_inode(file);
	unsigned int wb_err = READ_ONCE(mapping->wb_err);

	if (file->f_wb_err == wb_err)
		return 0;
	return -(wb_err & 4095);
}

That only gives us 20 bits of counter, but I think that's enough.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 17:09                       ` Matthew Wilcox
@ 2017-04-04 18:08                         ` Jeff Layton
  2017-04-04 22:50                         ` NeilBrown
  2017-04-05 19:49                         ` Jeff Layton
  2 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-04 18:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
> 
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
> 	struct inode *inode = mapping->host;
> 	unsigned int wb_err;
> 
> 	if (!err)
> 		return;
> 	/*
> 	 * This should be called with the error code that we want to return
> 	 * on fsync. Thus, it should always be <= 0.
> 	 */
> 	WARN_ON(err > 0 || err < -MAX_ERRNO);
> 
> 	spin_lock(&inode->i_lock);
> 	wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
> 	WRITE_ONCE(mapping->wb_err, wb_err);

Do we need the WRITE_ONCE, given that you're under a spinlock there?

> 	spin_unlock(&inode->i_lock);
> }
> 
> int filemap_report_wb_error(struct file *file)
> {
> 	struct inode *inode = file_inode(file);
> 	unsigned int wb_err = READ_ONCE(mapping->wb_err);
> 
> 	if (file->f_wb_err == wb_err)
> 		return 0;
> 	return -(wb_err & 4095);
> }
> 
> That only gives us 20 bits of counter, but I think that's enough.

That'd be fine with me, but I'm all for allowing filesystems to return
arbitrary writeback errors on fsync.

Others may have different opinions there. We could add a wrapper
function that sanitizes the error codes if some filesystems wanted that
though.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 11:53               ` Matthew Wilcox
  2017-04-04 12:17                 ` Jeff Layton
  2017-04-04 13:38                 ` Theodore Ts'o
@ 2017-04-04 22:28                 ` NeilBrown
  2 siblings, 0 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-04 22:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 4833 bytes --]

On Tue, Apr 04 2017, Matthew Wilcox wrote:

> On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
>> On Mon, Apr 03 2017, Jeff Layton wrote:
>> 
>> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> >> So, OK, that makes sense, we should keep allowing filesystems to report
>> >> ENOSPC as a writeback error.  But I think much of the argument below
>> >> still holds, and we should continue to have a prior EIO to be reported
>> >> over a new ENOSPC (even if the program has already consumed the EIO).
>> >
>> > I'm fine with that (though I'd like Neil's thoughts before we decide
>> > anything) there.
>> 
>> I'd like there be a well defined time when old errors were forgotten.
>> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
>> received, but not forever.
>> Clearing the remembered errors when put_write_access() causes
>> i_writecount to reach zero is one option (as suggested), but I'm not
>> sure I'm happy with it.
>> 
>> Local filesystems, or network filesystems which receive strong write
>> delegations, should only ever return EIO to fsync.  We should
>> concentrate on them first, I think.  As there is only one possible
>> error, the seq counter is sufficient to "clear" it once it has been
>> reported to fsync() (or write()?).
>> 
>> Other network filesystems could return a whole host of errors: ENOSPC
>> EDQUOT ESTALE EPERM EFBIG ...
>> Do we want to limit exactly which errors are allowed in generic code, or
>> do we just support EIO generically and expect the filesystem to sort out
>> the details for anything else?
>
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
>
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
>
> For close(), we have to map every error to EIO.
> For fsync(), we can return any error that write() could have.  That limits
> us to:
>
> EFBIG ENOSPC EIO ENOBUFS ENXIO
>
> I think EFBIG really isn't a writeback error; are there any network
> filesystems that don't know the file size limit at the time they accept
> the original write?  ENOBUFS seems like a transient error (*this* call to
> fsync() failed, but the next one may succeed ... it's the equivalent of
> ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
> error.  So that leaves us with ENOSPC and EIO, as we have support today.

I guess Posix doesn't acknowledge the existence of disk quotas?
I think we need to add EDQUOT to your list.
Other hypothetical errors errors from the server such as EPERM or ESTALE
can reasonably be mapped to EIO.

>
>> One possible approach a filesystem could take is just to allow a single
>> async writeback error.  After that error, all subsequent write()
>> system calls become synchronous. As write() or fsync() is called on each
>> file descriptor (which could possibly have sent the write which caused
>> the error), an error is returned and that fact is counted.  Once we have
>> returned as many errors as there are open file descriptors
>> (i_writecount?), and have seen a successful write, the filesystem
>> forgets all recorded errors and switches back to async writes (for that
>> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
>> 
>> The "which could possibly have sent the write which caused the error" is
>> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
>> flags to return async errors.  It allocates an nfs_open_context for each
>> user who opens a given inode, and stores an error in there.  Each dirty
>> pages is associated with one of these, so errors a sure to go to the
>> correct user, though not necessarily the correct fd at present.
>
> ... and you need the nfs_open_context in order to use the correct
> credentials when writing a page to the server, correct?

Correct.

Thanks,
NeilBrown


>
>> When we specify the new behaviour we should be careful to be as vague as
>> possible while still saying what we need.  This allows filesystems some
>> flexibility.
>> 
>>   If an error happens during writeback, the next write() or fsync() (or
>>   ....) on the file descriptor to which data was written will return -1
>>   with errno set to EIO or some other relevant error.  Other file
>>   descriptors open on the same file may receive EIO or some other error
>>   on a subsequent appropriate system call.
>>   It should not be assumed that close() will return an error.  fsync()
>>   must be called before close() if writeback errors are important to the
>>   application.
>
> Thanks for explaining what NFS does today.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 11:41               ` Jeff Layton
@ 2017-04-04 22:41                 ` NeilBrown
  0 siblings, 0 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-04 22:41 UTC (permalink / raw)
  To: Jeff Layton, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 6870 bytes --]

On Tue, Apr 04 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 13:03 +1000, NeilBrown wrote:
>> On Mon, Apr 03 2017, Jeff Layton wrote:
>> 
>> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> > > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
>> > > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
>> > > > > writeback problem.  If I understand correctly, at the time of write(),
>> > > > > filesystems check to see if they have enough blocks to satisfy the
>> > > > > request, so ENOSPC only comes up in the writeback context for thinly
>> > > > > provisioned devices.
>> > > > 
>> > > > No, ENOSPC on writeback can certainly happen with network filesystems.
>> > > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
>> > > > do an extra RPC on every buffered write. :)
>> > > 
>> > > Aaah, yes, network filesystems.  I would indeed not want to do an extra
>> > > RPC on every write to a hole (it's a hole vs non-hole question, rather
>> > > than a buffered/unbuffered question ... unless you're WAFLing and not
>> > > reclaiming quickly enough, I suppose).
>> > > 
>> > > So, OK, that makes sense, we should keep allowing filesystems to report
>> > > ENOSPC as a writeback error.  But I think much of the argument below
>> > > still holds, and we should continue to have a prior EIO to be reported
>> > > over a new ENOSPC (even if the program has already consumed the EIO).
>> > > 
>> > 
>> > I'm fine with that (though I'd like Neil's thoughts before we decide
>> > anything) there.
>> 
>> I'd like there be a well defined time when old errors were forgotten.
>> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
>> received, but not forever.
>> Clearing the remembered errors when put_write_access() causes
>> i_writecount to reach zero is one option (as suggested), but I'm not
>> sure I'm happy with it.
>> 
>> Local filesystems, or network filesystems which receive strong write
>> delegations, should only ever return EIO to fsync.  We should
>> concentrate on them first, I think.  As there is only one possible
>> error, the seq counter is sufficient to "clear" it once it has been
>> reported to fsync() (or write()?).
>> 
>> Other network filesystems could return a whole host of errors: ENOSPC
>> EDQUOT ESTALE EPERM EFBIG ...
>> Do we want to limit exactly which errors are allowed in generic code, or
>> do we just support EIO generically and expect the filesystem to sort out
>> the details for anything else?
>> 
>> One possible approach a filesystem could take is just to allow a single
>> async writeback error.  After that error, all subsequent write()
>> system calls become synchronous. As write() or fsync() is called on each
>> file descriptor (which could possibly have sent the write which caused
>> the error), an error is returned and that fact is counted.  Once we have
>> returned as many errors as there are open file descriptors
>> (i_writecount?), and have seen a successful write, the filesystem
>> forgets all recorded errors and switches back to async writes (for that
>> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
>> 
>> The "which could possibly have sent the write which caused the error" is
>> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
>> flags to return async errors.  It allocates an nfs_open_context for each
>> user who opens a given inode, and stores an error in there.  Each dirty
>> pages is associated with one of these, so errors a sure to go to the
>> correct user, though not necessarily the correct fd at present.
>> 
>> When we specify the new behaviour we should be careful to be as vague as
>> possible while still saying what we need.  This allows filesystems some
>> flexibility.
>> 
>>   If an error happens during writeback, the next write() or fsync() (or
>>   ....) on the file descriptor to which data was written will return -1
>>   with errno set to EIO or some other relevant error.  Other file
>>   descriptors open on the same file may receive EIO or some other error
>>   on a subsequent appropriate system call.
>>   It should not be assumed that close() will return an error.  fsync()
>>   must be called before close() if writeback errors are important to the
>>   application.
>> 
>> 
>
> A lot in here... :)
>
> While I like the NFS method of switching to sync I/O on error (and
> indeed, I'm copying that in the Ceph ENOSPC patches I have), I'm not
> sure it would really help anything here. The main reason NFS does that
> is to prevent you from dirtying tons of pages that can't be cleaned. 

It would help because it means there are no longer any async errors, so
there is no need to try to keep track of them.

If we decided that "last error wins", then that becomes irrelevant.  But
if we do care about any precedence of errors, then going sync is an easy
way to make sure the right error gets to the right place.

>
> While that is a laudable goal, it's not really the problem I'm
> interested in solving here. My goal is simply to ensure that you see a
> writeback error on fsync if one occurred since the last fsync.
>
> I think it just comes down to the fact that I'm not convinced that it
> really matters much _what_ error gets reported, as long as you get one.
> As you've mentioned in earlier discussions, most programs just treat it
> as a fatal error anyway. As long as that error is representative of
> some error that occurred during writeback, do we really care what it
> was?

I don't think I personally care at all, but there might be programs out
there...
I think it would be a good design goal to ensure that the behaviour seen
when there is only one open file descriptor on a file, remains
unchanged.
That means that file the fd is help open, multiple different error codes
can be returned in arbitrary order (unlikely, but possible).

Thanks,
NeilBrown


>
> Suppose we have a bunch of dirty pages on an inode, get an EIO error
> and then ENOSPC on a different write (maybe issued in parallel). We
> send the ENOSPC error back to the application on an fsync (since it
> came in last). Application then cleans out some junk from the fs and
> then reissues the writes. They fail again and then he gets EIO from the
> fsync and aborts.
>
> Ok, so we might not have had to clean out the files and reissue the
> writes there since we were going to give up anyway. Is it worth going
> to extra lengths to avoid that there, given that we're in an error
> condition anyway?
>
> I'm just trying to understand why it matters at all what error you get
> back when there multiple problems. They all seem equally valid to me in
> that situation.
>
> -- 
> Jeff Layton <jlayton@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 17:09                       ` Matthew Wilcox
  2017-04-04 18:08                         ` Jeff Layton
@ 2017-04-04 22:50                         ` NeilBrown
  2017-04-05 19:49                         ` Jeff Layton
  2 siblings, 0 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-04 22:50 UTC (permalink / raw)
  To: Matthew Wilcox, Jeff Layton
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]

On Tue, Apr 04 2017, Matthew Wilcox wrote:

> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
>> That said, I think giving more specific errors where we can is useful.
>> When your program is erroring out and writing 'I/O error' to the logs,
>> then how much time will your admins burn before they figure out that it
>> really failed because the filesystem was full?
>
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
>
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
>
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
> 	struct inode *inode = mapping->host;
> 	unsigned int wb_err;
>
> 	if (!err)
> 		return;
> 	/*
> 	 * This should be called with the error code that we want to return
> 	 * on fsync. Thus, it should always be <= 0.
> 	 */
> 	WARN_ON(err > 0 || err < -MAX_ERRNO);
>
> 	spin_lock(&inode->i_lock);
> 	wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;

Seriously?  You are missing "MAX_ERRNO" (4095) together with "1 << 12"
(4096) in the one expression without a big comment saying why?

Surely:
>       BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO+1);
> 	wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (MAX_ERRNO+1)) | -err;

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 16:25                     ` Jeff Layton
  2017-04-04 17:09                       ` Matthew Wilcox
@ 2017-04-04 23:13                       ` NeilBrown
  2017-04-05 11:14                         ` Jeff Layton
  1 sibling, 1 reply; 54+ messages in thread
From: NeilBrown @ 2017-04-04 23:13 UTC (permalink / raw)
  To: Jeff Layton, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 3381 bytes --]

On Tue, Apr 04 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
>> On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
>> > Agreed that we should focus on POSIX compliance. I'll also note that
>> > POSIX states:
>> > 
>> > "If more than one error occurs in processing a function call, any one
>> > of the possible errors may be returned, as the order of
>> > detection is undefined."
>> > 
>> >     http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
>> > 
>> > So, I'd like to push back on this idea that we need to prefer reporting
>> > -EIO over other errors. POSIX certainly doesn't mandate that. 
>> 
>> I honestly wonder if we need to support ENOSPC from writeback at all.
>> Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
>> in 2003:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
>> 
>> That seems to come from here:
>> http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
>> which is marked as a resend, but I can't find the original.
>> 
>> It's a little misleading because the immediately preceding patch
>> introduced mapping->error, so there's no precedent here to speak of.
>> It looks like we used to just silently lose writeback errors (*cough*).
>> 
>> I'd like to suggest that maybe we don't need to support multiple errors
>> at all.  That all errors, including ENOSPC, get collapsed into EIO.
>> POSIX already tells us to do that for close() and permits us to do that
>> for fsync().
>> 
>
> That is certainly allowed under POSIX as I interpret the spec. At a
> minimum we just need a single flag and can collapse all errors under
> that.
>
> That said, I think giving more specific errors where we can is useful.
> When your program is erroring out and writing 'I/O error' to the logs,
> then how much time will your admins burn before they figure out that it
> really failed because the filesystem was full?

What if you don't have an admin?  What if it was an over-quota error?
I think precise error messages are valuable.
I am leaning towards "last error wins" though.  The complexity of any
scheme that reports "worst recent error" seems to out weigh the value.

I think we should present this as a service to filesystems. e.g. create
a "recent_wb_error" structure which the filesystem can record errors in
when they occur, and syscalls can read errors from.
One of these would be provided in 'struct address_space', but
filesystems can easily embed one in their own data structure
(e.g. nfs_open_context) if they want to.

I don't think we should return a recent_wb_error on close by default,
but individual filesystems can ("man 2 close" implies NFS does this for
EDQUOT at it should continue to do so).

fsync() (and file_sync_range()) should return a recent_wb_error, but
what about write()?  It would be a suitable way to stop an application
early, but it isn't exactly the requested write that failed...
Posix says of EIO from write:

    A physical I/O error has occurred.

which is rather vague.  Where and when did this error in physics (:-)
occur?

O_DIRECT write() can get an EIO from a previous write-back write to the
same file.  Maybe non-O_DIRECT writes should too?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 23:13                       ` NeilBrown
@ 2017-04-05 11:14                         ` Jeff Layton
  2017-04-06  0:24                           ` NeilBrown
  0 siblings, 1 reply; 54+ messages in thread
From: Jeff Layton @ 2017-04-05 11:14 UTC (permalink / raw)
  To: NeilBrown, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Wed, 2017-04-05 at 09:13 +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> > > > Agreed that we should focus on POSIX compliance. I'll also note that
> > > > POSIX states:
> > > > 
> > > > "If more than one error occurs in processing a function call, any one
> > > > of the possible errors may be returned, as the order of
> > > > detection is undefined."
> > > > 
> > > >     http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> > > > 
> > > > So, I'd like to push back on this idea that we need to prefer reporting
> > > > -EIO over other errors. POSIX certainly doesn't mandate that. 
> > > 
> > > I honestly wonder if we need to support ENOSPC from writeback at all.
> > > Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
> > > in 2003:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
> > > 
> > > That seems to come from here:
> > > http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
> > > which is marked as a resend, but I can't find the original.
> > > 
> > > It's a little misleading because the immediately preceding patch
> > > introduced mapping->error, so there's no precedent here to speak of.
> > > It looks like we used to just silently lose writeback errors (*cough*).
> > > 
> > > I'd like to suggest that maybe we don't need to support multiple errors
> > > at all.  That all errors, including ENOSPC, get collapsed into EIO.
> > > POSIX already tells us to do that for close() and permits us to do that
> > > for fsync().
> > > 
> > 
> > That is certainly allowed under POSIX as I interpret the spec. At a
> > minimum we just need a single flag and can collapse all errors under
> > that.
> > 
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> What if you don't have an admin?  What if it was an over-quota error?
> I think precise error messages are valuable.
> I am leaning towards "last error wins" though.  The complexity of any
> scheme that reports "worst recent error" seems to out weigh the value.
> 
> I think we should present this as a service to filesystems. e.g. create
> a "recent_wb_error" structure which the filesystem can record errors in
> when they occur, and syscalls can read errors from.
> One of these would be provided in 'struct address_space', but
> filesystems can easily embed one in their own data structure
> (e.g. nfs_open_context) if they want to.
> 
> I don't think we should return a recent_wb_error on close by default,
> but individual filesystems can ("man 2 close" implies NFS does this for
> EDQUOT at it should continue to do so).
> 
> fsync() (and file_sync_range()) should return a recent_wb_error, but
> what about write()?  It would be a suitable way to stop an application
> early, but it isn't exactly the requested write that failed...
> Posix says of EIO from write:
> 
>     A physical I/O error has occurred.
> 
> which is rather vague.  Where and when did this error in physics (:-)
> occur?
> 
> O_DIRECT write() can get an EIO from a previous write-back write to the
> same file.  Maybe non-O_DIRECT writes should too?
> 

Some already do this for buffered writes.

This is really a philosophical question, IMO...is it correct to return
an error on a write call, due to writeback failing previously or during
the write call, quite possibly to a range that the write call does not
touch? I can see an argument either way for this.

Also, if we do think that returning an error on the write is the right
thing to do, should that error advance the sequence counter in the
struct file, such that an fsync afterward gets back 0? My feeling here
is that fsync should still report an error after a failed write, but
maybe that's wrong?

This is certainly one area where switching to synchronous writes on
error would make things a little simpler.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-04 17:09                       ` Matthew Wilcox
  2017-04-04 18:08                         ` Jeff Layton
  2017-04-04 22:50                         ` NeilBrown
@ 2017-04-05 19:49                         ` Jeff Layton
  2017-04-05 21:03                           ` Matthew Wilcox
  2017-04-06  0:02                           ` NeilBrown
  2 siblings, 2 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-05 19:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
> 
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
> 	struct inode *inode = mapping->host;
> 	unsigned int wb_err;
> 
> 	if (!err)
> 		return;
> 	/*
> 	 * This should be called with the error code that we want to return
> 	 * on fsync. Thus, it should always be <= 0.
> 	 */
> 	WARN_ON(err > 0 || err < -MAX_ERRNO);
> 
> 	spin_lock(&inode->i_lock);
> 	wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
> 	WRITE_ONCE(mapping->wb_err, wb_err);
> 	spin_unlock(&inode->i_lock);
> }
> 

I like this idea of being able to store arbitrary error codes there.
That should be used judiciously of course, but we already allow
returning arbitrary errors via the ->fsync op anyway.

I'll plan to incorporate something like that into the next set (with
judicious comments and constants).

One question...is the i_lock the right way to protect this? I think we
could do this locklessly too (cmpxchg in a loop, for instance). I'm not
worried about performance here -- it's just nice to be able to call
simple stuff like this without worrying about locking.

> int filemap_report_wb_error(struct file *file)
> {
> 	struct inode *inode = file_inode(file);
> 	unsigned int wb_err = READ_ONCE(mapping->wb_err);
> 
> 	if (file->f_wb_err == wb_err)
> 		return 0;
> 	return -(wb_err & 4095);
> }
> 
> That only gives us 20 bits of counter, but I think that's enough.

2^20 is 1048576, which seems a little small to me.

We may end up bumping the counter on every failed I/O. How fast can we
generate 1M failed I/Os? :)

2^52 however is 4503599627370496 (4Tios or so) ... that might take a
little longer to overflow. Is it worth the cost here to ensure that
this won't occur?

Actually...we could put this field in the inode instead of the mapping.
I know we've traditionally tracked this in the mapping, but is that
required here?

If we put this field in the inode then perhaps we can union it with
something and mitigate the cost of a larger counter...maybe in the
i_pipe union? I don't think S_ISREG inodes use anything in there, do
they?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-05 19:49                         ` Jeff Layton
@ 2017-04-05 21:03                           ` Matthew Wilcox
  2017-04-06  0:19                             ` NeilBrown
  2017-04-06  0:02                           ` NeilBrown
  1 sibling, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-05 21:03 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Wed, Apr 05, 2017 at 03:49:52PM -0400, Jeff Layton wrote:
> > That only gives us 20 bits of counter, but I think that's enough.
> 
> 2^20 is 1048576, which seems a little small to me.
> 
> We may end up bumping the counter on every failed I/O. How fast can we
> generate 1M failed I/Os? :)

So there's a one-in-a-million chance of missing a failed I/O ... if
we're generating lots of errors, the next time the app calls fsync(),
it'll notice the other million times we've hit the problem :-)

> Actually...we could put this field in the inode instead of the mapping.
> I know we've traditionally tracked this in the mapping, but is that
> required here?
> 
> If we put this field in the inode then perhaps we can union it with
> something and mitigate the cost of a larger counter...maybe in the
> i_pipe union? I don't think S_ISREG inodes use anything in there, do
> they?

But writeback isn't just done on ISREG inodes, but also on S_ISBLK inodes,
which use i_bdev (right?)

Another possibility is to move this out of the address_space and into
either the super_block or the backing_device_info.  Errors don't tend
to be constrained to a single file but affect the entire filesystem,
or even multiple filesystems if you have a partitioned block device ...

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-05 19:49                         ` Jeff Layton
  2017-04-05 21:03                           ` Matthew Wilcox
@ 2017-04-06  0:02                           ` NeilBrown
  2017-04-06  2:55                             ` Matthew Wilcox
                                               ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-06  0:02 UTC (permalink / raw)
  To: Jeff Layton, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 4146 bytes --]

On Thu, Apr 06 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
>> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
>> > That said, I think giving more specific errors where we can is useful.
>> > When your program is erroring out and writing 'I/O error' to the logs,
>> > then how much time will your admins burn before they figure out that it
>> > really failed because the filesystem was full?
>> 
>> df is one of the first things I check ... a few years ago, I also learned
>> to check df -i ... ;-)
>> 
>> Anyway, given the decision to simply report the last error lets us do this
>> implementation:
>> 
>> void filemap_set_wb_error(struct address_space *mapping, int err)
>> {
>> 	struct inode *inode = mapping->host;
>> 	unsigned int wb_err;
>> 
>> 	if (!err)
>> 		return;
>> 	/*
>> 	 * This should be called with the error code that we want to return
>> 	 * on fsync. Thus, it should always be <= 0.
>> 	 */
>> 	WARN_ON(err > 0 || err < -MAX_ERRNO);
>> 
>> 	spin_lock(&inode->i_lock);
>> 	wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
>> 	WRITE_ONCE(mapping->wb_err, wb_err);
>> 	spin_unlock(&inode->i_lock);
>> }
>> 
>
> I like this idea of being able to store arbitrary error codes there.
> That should be used judiciously of course, but we already allow
> returning arbitrary errors via the ->fsync op anyway.
>
> I'll plan to incorporate something like that into the next set (with
> judicious comments and constants).
>
> One question...is the i_lock the right way to protect this? I think we
> could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> worried about performance here -- it's just nice to be able to call
> simple stuff like this without worrying about locking.

I like the idea of using cmpxchg.


>
>> int filemap_report_wb_error(struct file *file)
>> {
>> 	struct inode *inode = file_inode(file);
>> 	unsigned int wb_err = READ_ONCE(mapping->wb_err);
>> 
>> 	if (file->f_wb_err == wb_err)
>> 		return 0;
>> 	return -(wb_err & 4095);
>> }
>> 
>> That only gives us 20 bits of counter, but I think that's enough.
>
> 2^20 is 1048576, which seems a little small to me.
>
> We may end up bumping the counter on every failed I/O. How fast can we
> generate 1M failed I/Os? :)

Do we need to count all of those if no-one sees them?
i.e. use one bit to say "this error hasn't been seen".
If an error occurs with has the name error code as is currently stored,
and the bit is set, don't make a change.  Otherwise make the change,
inc the counter, set the bit.
When checking for an error, if the bit is set, clear it first.
Then you can count 500,000 errors-returned-to-some-thread, which is
probably enough.

>
> 2^52 however is 4503599627370496 (4Tios or so) ... that might take a
> little longer to overflow. Is it worth the cost here to ensure that
> this won't occur?
>
> Actually...we could put this field in the inode instead of the mapping.
> I know we've traditionally tracked this in the mapping, but is that
> required here?

What if the address_space is shared by two inodes?  That is the whole
point of the i_mapping pointer.  This would make it harder for the
"other" inode to get the error.
(Does anyone actually use fs/coda ??
 Actually, block devices use i_mapping too.
 If two block device inodes have the same major/minor number, they
 end up having i_mapping point to the same place)

If you are concerned about space in 'struct address_space', just prune
some wastage.
The "host" field brings no value.  It is only ever assigned in
inode_init_always():

	struct address_space *const mapping = &inode->i_data;
......
	mapping->host = inode;

So you could change all references to use
   container_of(mapping, struct inode, i_data)

NeilBrown



>
> If we put this field in the inode then perhaps we can union it with
> something and mitigate the cost of a larger counter...maybe in the
> i_pipe union? I don't think S_ISREG inodes use anything in there, do
> they?
> -- 
> Jeff Layton <jlayton@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-05 21:03                           ` Matthew Wilcox
@ 2017-04-06  0:19                             ` NeilBrown
  0 siblings, 0 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-06  0:19 UTC (permalink / raw)
  To: Matthew Wilcox, Jeff Layton
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On Wed, Apr 05 2017, Matthew Wilcox wrote:

> On Wed, Apr 05, 2017 at 03:49:52PM -0400, Jeff Layton wrote:
>> > That only gives us 20 bits of counter, but I think that's enough.
>> 
>> 2^20 is 1048576, which seems a little small to me.
>> 
>> We may end up bumping the counter on every failed I/O. How fast can we
>> generate 1M failed I/Os? :)
>
> So there's a one-in-a-million chance of missing a failed I/O ... if
> we're generating lots of errors, the next time the app calls fsync(),
> it'll notice the other million times we've hit the problem :-)
>
>> Actually...we could put this field in the inode instead of the mapping.
>> I know we've traditionally tracked this in the mapping, but is that
>> required here?
>> 
>> If we put this field in the inode then perhaps we can union it with
>> something and mitigate the cost of a larger counter...maybe in the
>> i_pipe union? I don't think S_ISREG inodes use anything in there, do
>> they?
>
> But writeback isn't just done on ISREG inodes, but also on S_ISBLK inodes,
> which use i_bdev (right?)
>
> Another possibility is to move this out of the address_space and into
> either the super_block or the backing_device_info.  Errors don't tend
> to be constrained to a single file but affect the entire filesystem,
> or even multiple filesystems if you have a partitioned block device ...

EDQUOT.  Remember EDQUOT.  It certainly don't affect the whole
filesystem.
Even without that, filesystems can easily treat different files
differently.  We shouldn't assume one-failes-all-fail.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-05 11:14                         ` Jeff Layton
@ 2017-04-06  0:24                           ` NeilBrown
  0 siblings, 0 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-06  0:24 UTC (permalink / raw)
  To: Jeff Layton, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 1585 bytes --]

On Wed, Apr 05 2017, Jeff Layton wrote:

>> 
>> O_DIRECT write() can get an EIO from a previous write-back write to the
>> same file.  Maybe non-O_DIRECT writes should too?
>> 
>
> Some already do this for buffered writes.
>
> This is really a philosophical question, IMO...is it correct to return
> an error on a write call, due to writeback failing previously or during
> the write call, quite possibly to a range that the write call does not
> touch? I can see an argument either way for this.

I like the "we already do" argument.

>
> Also, if we do think that returning an error on the write is the right
> thing to do, should that error advance the sequence counter in the
> struct file, such that an fsync afterward gets back 0? My feeling here
> is that fsync should still report an error after a failed write, but
> maybe that's wrong?

My first thought was that one the error has been returned to any
syscall on a given fd, it has been returned.  Once is enough.
My second thought was that maybe your feeling is right.  Having a well
defined error-return point in fsync feels like a nice design.
My third thought was that this would mean either
 - write continues to fail until fsync is called (probably bad), or
 - we need two counters per "struct file", one for fsync, one for write.
   I don't like that much.

So I'm going back to my first thought.

Thanks,
NeilBrown


>
> This is certainly one area where switching to synchronous writes on
> error would make things a little simpler.
> -- 
> Jeff Layton <jlayton@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06  0:02                           ` NeilBrown
@ 2017-04-06  2:55                             ` Matthew Wilcox
  2017-04-06  5:12                               ` NeilBrown
  2017-04-06 14:02                             ` Jeff Layton
  2017-04-06 19:14                             ` Jeff Layton
  2 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-06  2:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
> If you are concerned about space in 'struct address_space', just prune
> some wastage.

I'm trying to (via wlists).  still buggy though.

> The "host" field brings no value.  It is only ever assigned in
> inode_init_always():
> 
> 	struct address_space *const mapping = &inode->i_data;
> ......
> 	mapping->host = inode;
> 
> So you could change all references to use
>    container_of(mapping, struct inode, i_data)

Alas, no:

drivers/dax/dax.c:      inode->i_mapping->host = dax_dev->inode;
fs/gfs2/glock.c:                mapping->host = s->s_bdev->bd_inode;
fs/gfs2/ops_fstype.c:   mapping->host = sb->s_bdev->bd_inode;
fs/nilfs2/page.c:       mapping->host = inode;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06  2:55                             ` Matthew Wilcox
@ 2017-04-06  5:12                               ` NeilBrown
  2017-04-06 13:31                                 ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: NeilBrown @ 2017-04-06  5:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 2365 bytes --]

On Wed, Apr 05 2017, Matthew Wilcox wrote:

> On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
>> If you are concerned about space in 'struct address_space', just prune
>> some wastage.
>
> I'm trying to (via wlists).  still buggy though.

Cool.
(I wonder what a wlist is.... weighted list?)

>
>> The "host" field brings no value.  It is only ever assigned in
>> inode_init_always():
>> 
>> 	struct address_space *const mapping = &inode->i_data;
>> ......
>> 	mapping->host = inode;
>> 
>> So you could change all references to use
>>    container_of(mapping, struct inode, i_data)
>
> Alas, no:
>
> drivers/dax/dax.c:      inode->i_mapping->host = dax_dev->inode;

	inode->i_mapping = dax_dev->inode->i_mapping;
	inode->i_mapping->host = dax_dev->inode;
so that second line is equivalent to
        dax_dev->inode->i_mapping->host = dax_dev->inode;
so inode->mapping->host leads back to inode.  So this doesn't break the
invariant.
       

> fs/gfs2/glock.c:                mapping->host = s->s_bdev->bd_inode;
> fs/gfs2/ops_fstype.c:   mapping->host = sb->s_bdev->bd_inode;

Hmm.. that's weird.  I cannot quite follow what is happening there.
It creates an address-space for metadata which doesn't have a real
inode, and borrows bits of the bdev inode ... possibly just to be able
to find the blocksize deep in buffer.c or similar.
I suspect that using an 'inode' instead of a 'mapping' would make the
code clearer.

> fs/nilfs2/page.c:       mapping->host = inode;

A nilfs inode is allocated with 2 address spaces,
one for the data and one for btree indexing metadata.
And then there are a couple of extra address spaces for the
global metadata-file (mtd).

I wonder what the ->host pointer is actually used for.
buffer.c uses it:
 - to mark the inode 'dirty' when the page is marked dirty
 - to find the blocksize of the inode, for creating buffer_heads
 - find the size of the mapping (i_size)

I could probably argue that the 'dirty' flag (at least for the data) and
the size really belong in the address_space, not in the inode.
The blocksize, I'm less sure of.

I suspect gfs2 and nilfs2 could be changed to allocate a separate inode
(instead of address_space), or to not make use of the ->host pointer.
It would be more work than I at first thought though.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06  5:12                               ` NeilBrown
@ 2017-04-06 13:31                                 ` Matthew Wilcox
  2017-04-06 21:53                                   ` NeilBrown
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-06 13:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Thu, Apr 06, 2017 at 03:12:42PM +1000, NeilBrown wrote:
> On Wed, Apr 05 2017, Matthew Wilcox wrote:
> 
> > On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
> >> If you are concerned about space in 'struct address_space', just prune
> >> some wastage.
> >
> > I'm trying to (via wlists).  still buggy though.
> 
> Cool.
> (I wonder what a wlist is.... weighted list?)

A homonym ;-)  Waitlists.  Yet Another Variation of a doubly linked list.
This variation has only a single head pointer (like the hlist), allows
O(1) deletion from any point in the list (but requires that you know the
head of the list).  The tail of the list is pointed to by head->prev.

http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/wlist

It also needs updating to the latest tree; WW locks got revamped.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06  0:02                           ` NeilBrown
  2017-04-06  2:55                             ` Matthew Wilcox
@ 2017-04-06 14:02                             ` Jeff Layton
  2017-04-06 19:14                             ` Jeff Layton
  2 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-06 14:02 UTC (permalink / raw)
  To: NeilBrown, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Thu, 2017-04-06 at 10:02 +1000, NeilBrown wrote:
> On Thu, Apr 06 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > > > That said, I think giving more specific errors where we can is useful.
> > > > When your program is erroring out and writing 'I/O error' to the logs,
> > > > then how much time will your admins burn before they figure out that it
> > > > really failed because the filesystem was full?
> > > 
> > > df is one of the first things I check ... a few years ago, I also learned
> > > to check df -i ... ;-)
> > > 
> > > Anyway, given the decision to simply report the last error lets us do this
> > > implementation:
> > > 
> > > void filemap_set_wb_error(struct address_space *mapping, int err)
> > > {
> > > 	struct inode *inode = mapping->host;
> > > 	unsigned int wb_err;
> > > 
> > > 	if (!err)
> > > 		return;
> > > 	/*
> > > 	 * This should be called with the error code that we want to return
> > > 	 * on fsync. Thus, it should always be <= 0.
> > > 	 */
> > > 	WARN_ON(err > 0 || err < -MAX_ERRNO);
> > > 
> > > 	spin_lock(&inode->i_lock);
> > > 	wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
> > > 	WRITE_ONCE(mapping->wb_err, wb_err);
> > > 	spin_unlock(&inode->i_lock);
> > > }
> > > 
> > 
> > I like this idea of being able to store arbitrary error codes there.
> > That should be used judiciously of course, but we already allow
> > returning arbitrary errors via the ->fsync op anyway.
> > 
> > I'll plan to incorporate something like that into the next set (with
> > judicious comments and constants).
> > 
> > One question...is the i_lock the right way to protect this? I think we
> > could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> > worried about performance here -- it's just nice to be able to call
> > simple stuff like this without worrying about locking.
> 
> I like the idea of using cmpxchg.
> 
> 
> > 
> > > int filemap_report_wb_error(struct file *file)
> > > {
> > > 	struct inode *inode = file_inode(file);
> > > 	unsigned int wb_err = READ_ONCE(mapping->wb_err);
> > > 
> > > 	if (file->f_wb_err == wb_err)
> > > 		return 0;
> > > 	return -(wb_err & 4095);
> > > }
> > > 
> > > That only gives us 20 bits of counter, but I think that's enough.
> > 
> > 2^20 is 1048576, which seems a little small to me.
> > 
> > We may end up bumping the counter on every failed I/O. How fast can we
> > generate 1M failed I/Os? :)
> 
> Do we need to count all of those if no-one sees them?
> i.e. use one bit to say "this error hasn't been seen".
> If an error occurs with has the name error code as is currently stored,
> and the bit is set, don't make a change.  Otherwise make the change,
> inc the counter, set the bit.
> When checking for an error, if the bit is set, clear it first.
> Then you can count 500,000 errors-returned-to-some-thread, which is
> probably enough.
> 

Yeah, that seems like it might be a good idea if we want to stick to a
small value here.

> > 
> > 2^52 however is 4503599627370496 (4Tios or so) ... that might take a
> > little longer to overflow. Is it worth the cost here to ensure that
> > this won't occur?
> > 
> > Actually...we could put this field in the inode instead of the mapping.
> > I know we've traditionally tracked this in the mapping, but is that
> > required here?
> 
> What if the address_space is shared by two inodes?  That is the whole
> point of the i_mapping pointer.  This would make it harder for the
> "other" inode to get the error.
> (Does anyone actually use fs/coda ??
>  Actually, block devices use i_mapping too.
>  If two block device inodes have the same major/minor number, they
>  end up having i_mapping point to the same place)
> 

Ahh ok, that makes sense. I'll plan to keep it part of the mapping.

> If you are concerned about space in 'struct address_space', just prune
> some wastage.
> The "host" field brings no value.  It is only ever assigned in
> inode_init_always():
> 
> 	struct address_space *const mapping = &inode->i_data;
> ......
> 	mapping->host = inode;
> 
> So you could change all references to use
>    container_of(mapping, struct inode, i_data)
> 

That might be a nice cleanup, but I think I'll leave that to be done
separately.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06  0:02                           ` NeilBrown
  2017-04-06  2:55                             ` Matthew Wilcox
  2017-04-06 14:02                             ` Jeff Layton
@ 2017-04-06 19:14                             ` Jeff Layton
  2017-04-06 20:05                               ` Matthew Wilcox
  2017-04-06 22:15                               ` NeilBrown
  2 siblings, 2 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-06 19:14 UTC (permalink / raw)
  To: NeilBrown, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Thu, 2017-04-06 at 10:02 +1000, NeilBrown wrote:
> > 

> On Thu, Apr 06 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > > > That said, I think giving more specific errors where we can is useful.
> > > > When your program is erroring out and writing 'I/O error' to the logs,
> > > > then how much time will your admins burn before they figure out that it
> > > > really failed because the filesystem was full?
> > > 
> > > df is one of the first things I check ... a few years ago, I also learned
> > > to check df -i ... ;-)
> > > 
> > > Anyway, given the decision to simply report the last error lets us do this
> > > implementation:
> > > 
> > > void filemap_set_wb_error(struct address_space *mapping, int err)
> > > {
> > > 	struct inode *inode = mapping->host;
> > > 	unsigned int wb_err;
> > > 
> > > 	if (!err)
> > > 		return;
> > > 	/*
> > > 	 * This should be called with the error code that we want to return
> > > 	 * on fsync. Thus, it should always be <= 0.
> > > 	 */
> > > 	WARN_ON(err > 0 || err < -MAX_ERRNO);
> > > 
> > > 	spin_lock(&inode->i_lock);
> > > 	wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
> > > 	WRITE_ONCE(mapping->wb_err, wb_err);
> > > 	spin_unlock(&inode->i_lock);
> > > }
> > > 
> > 
> > I like this idea of being able to store arbitrary error codes there.
> > That should be used judiciously of course, but we already allow
> > returning arbitrary errors via the ->fsync op anyway.
> > 
> > I'll plan to incorporate something like that into the next set (with
> > judicious comments and constants).
> > 
> > One question...is the i_lock the right way to protect this? I think we
> > could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> > worried about performance here -- it's just nice to be able to call
> > simple stuff like this without worrying about locking.
> 
> I like the idea of using cmpxchg.
> 
> 
> > 
> > > int filemap_report_wb_error(struct file *file)
> > > {
> > > 	struct inode *inode = file_inode(file);
> > > 	unsigned int wb_err = READ_ONCE(mapping->wb_err);
> > > 
> > > 	if (file->f_wb_err == wb_err)
> > > 		return 0;
> > > 	return -(wb_err & 4095);
> > > }
> > > 
> > > That only gives us 20 bits of counter, but I think that's enough.
> > 
> > 2^20 is 1048576, which seems a little small to me.
> > 
> > We may end up bumping the counter on every failed I/O. How fast can we
> > generate 1M failed I/Os? :)
> 
> Do we need to count all of those if no-one sees them?
> i.e. use one bit to say "this error hasn't been seen".
> If an error occurs with has the name error code as is currently stored,
> and the bit is set, don't make a change.  Otherwise make the change,
> inc the counter, set the bit.
> When checking for an error, if the bit is set, clear it first.
> Then you can count 500,000 errors-returned-to-some-thread, which is
> probably enough.
> 

Ok, so here's a replacement for patch #1. The other 3 are pretty much
the same. The main changes are:

- 32 bit value:
  - 12 bits for error code
  - 1 bit for "seen" flag
  - 19 bits for the counter
- mapping->wb_err is managed with cmpxchg
- file->f_wb_err is protected with file->f_lock

I tried to avoid updating things unnecesssarily. I could use some
guidance on how to specify the constants in terms of MAX_ERRNO as well.

It seems to work, in very basic by-hand testing.

If this looks reasonable, I may try again to plug this in at a higher
level, so we don't need to change so much filesystem code. IOW:

- make filemap_set_wb_error the new implementation of mapping_set_error
- have vfs_fsync_range call filemap_report_wb_error, and return what it
  returns if it's non-zero
- have filemap_check_error grab the current error code without updating
  the counter or the seen flag

That approach may not work, but I'll see. Anyway, here's the updated
patch. I may need to revise the changelog too.

--------------------------8<---------------------

[PATCH] fs: new infrastructure for writeback error handling and reporting

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

It's those non-fsync callers that are problematic. We should be
reporting writeback errors during fsync, but many places in the code
clear out errors before they can be properly reported, or report errors
at nonsensical times. If I get -EIO on a stat() call, how do I know that
was because writeback failed?

This patch adds a small bit of new infrastructure for setting and
reporting errors during pagecache writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fd may not be associated with one another in any way. They could even be
in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds a wb_error field and a sequence counter to the
address_space, and a corresponding sequence counter in the struct file.
When errors are reported during writeback, we set the error field in the
mapping and increment the sequence counter.

When fsync or flush is called, we check the sequence in the file vs. the
one in the mapping. If the file's counter is behind the one in the
mapping, then we update the sequence counter in the file to the value of
the one in the mapping and report the error. If the file is "caught up"
then we just report 0.

This changes the semantics of fsync such that applications can now use
it to determine whether there were any writeback errors since fsync(fd)
was last called (or since the file was opened in the case of fsync
having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The basic idea here is for filesystems to use filemap_set_wb_error to
set the error in the mapping when there are writeback errors, and then
have the fsync and flush operations call filemap_report_wb_error just
before returning to ensure that those errors get reported properly.

Eventually, it may make sense to move the reporting into the generic
vfs_fsync_range helper, but doing it this way for now makes it simpler
to convert filesystems to the new API individually.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 Documentation/filesystems/vfs.txt |  14 +++-
 fs/open.c                         |   3 +
 include/linux/fs.h                |   4 +
 mm/filemap.c                      | 162 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 181 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 569211703721..b2b5e411b340 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -577,6 +577,11 @@ should clear PG_Dirty and set PG_Writeback.  It can be actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
+If there is an error during writeback, then the address_space should be
+marked with an error (typically using filemap_set_wb_error), in order to
+ensure that the error can later be reported to the application at fsync
+or close.
+
 Writeback makes use of a writeback_control structure...
 
 struct address_space_operations
@@ -885,11 +890,16 @@ otherwise noted.
 	"private_data" member in the file structure if you want to point
 	to a device structure
 
-  flush: called by the close(2) system call to flush a file
+  flush: called by the close(2) system call to flush a file. Writeback
+	errors not previously reported via fsync should be reported
+	here as you would for fsync.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Filesystems that use the
+	pagecache should call filemap_report_wb_error before returning
+	to ensure that any errors that occurred during writeback are
+	reported and the file's error sequence advanced.
 
   fasync: called by the fcntl(2) system call when asynchronous
 	(non-blocking) mode is enabled for a file
diff --git a/fs/open.c b/fs/open.c
index 949cef29c3bb..baf82f2c642e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f,
 	f->f_inode = inode;
 	f->f_mapping = inode->i_mapping;
 
+	/* Ensure that we skip any errors that predate opening of the file */
+	f->f_wb_err = READ_ONCE(inode->i_mapping->wb_err);
+
 	if (unlikely(f->f_flags & O_PATH)) {
 		f->f_mode = FMODE_PATH;
 		f->f_op = &empty_fops;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..f33857113ff4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -394,6 +394,7 @@ struct address_space {
 	gfp_t			gfp_mask;	/* implicit gfp mask for allocations */
 	struct list_head	private_list;	/* ditto */
 	void			*private_data;	/* ditto */
+	u32			wb_err;
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -868,6 +869,7 @@ struct file {
 	struct list_head	f_tfile_llink;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
+	u32			f_wb_err;
 } __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
 
 struct file_handle {
@@ -2521,6 +2523,8 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 extern int filemap_check_errors(struct address_space *mapping);
+extern void filemap_set_wb_error(struct address_space *mapping, int err);
+extern int filemap_report_wb_error(struct file *file);
 
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
 			   int datasync);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1694623a6289..60b6fa417b98 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -545,6 +545,168 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(filemap_write_and_wait_range);
 
+/*
+ * The wb_err field in the address_space provides a place to store writeback
+ * errors. We endeavor to deliver writeback errors to fsync on all open file
+ * descriptors that were open at the time that the error was caught. We do
+ * this using a 32-bit value to store the error, with the upper bits as a
+ * sequence counter. We can store any error up to MAX_ERROR.
+ *
+ * Additionally, we reserve one bit to indicate whether any fd has grabbed the
+ * value to record in its struct file. If nothing has, then we don't really
+ * need to increment the counter.
+ */
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define WB_ERR_SEEN		(1 << 12)
+
+/* Increment the counter by this much to ensure that we don't touch earlier
+ * values */
+#define WB_ERR_CTR_INC		(1 << 13)
+
+/**
+ * filemap_set_wb_error - set the wb error in the mapping for later reporting
+ * @mapping: mapping in which the error should be set
+ * @err: error to set. must be negative value but not less than -MAX_ERRNO
+ *
+ * When an error occurs during writeback of inode data, we must report that
+ * error during fsync. This function sets the writeback error field in the
+ * mapping, and increments the sequence counter. When fsync or close is later
+ * performed, the caller can then check the sequence in the mapping against
+ * the one in the file to determine whether the error should be reported.
+ *
+ * Because there are so few bits for the counter, we try to avoid incrementing
+ * it unless someone is going to record the value for later comparison. This
+ * is tracked by a bit in the 32 bit word that we use as a "seen" flag.
+ *
+ * Note that we always use the latest writeback error, since POSIX states
+ * that when there are multiple errors (e.g. -EIO followed by -ENOSPC),
+ * that any possible error may be returned.
+ */
+void filemap_set_wb_error(struct address_space *mapping, int err)
+{
+	u32 old;
+
+	/*
+	 * The above constants rely indirectly on MAX_ERRNO not changing
+	 * since I'm not sure how to take a log at build time. Suggestions
+	 * of better ways to phrase the flag values would be welcome.
+	 */
+	BUILD_BUG_ON(MAX_ERRNO + 1 != WB_ERR_SEEN);
+
+	/* Optimize for common case of no error */
+	if (likely(!err))
+		return;
+
+	/*
+	 * Ensure the error code actually fits where we want it to go. If it
+	 * doesn't then just throw a warning and don't record anything.
+	 */
+	if (unlikely(err > 0 || err < -MAX_ERRNO)) {
+		WARN(1, "err=%d\n", err);
+		return;
+	}
+
+	old = READ_ONCE(mapping->wb_err);
+	for (;;) {
+		u32 new, cur;
+
+		/* Clear out error bits and set new error */
+		new = (old & ~MAX_ERRNO) | -err;
+
+		/* Only increment if someone has looked at it */
+		if (old & WB_ERR_SEEN) {
+			new += WB_ERR_CTR_INC;
+			new &= ~WB_ERR_SEEN;
+		}
+
+		/* Try to swap the new value into place */
+		cur = cmpxchg(&mapping->wb_err, old, new);
+
+		/*
+		 * Call it success if we did the swap or someone else beat us
+		 * to it for the same value.
+		 */
+		if (likely(cur == old || cur == new))
+			break;
+
+		/* Raced with an update, try again */
+		old = cur;
+	}
+}
+EXPORT_SYMBOL(filemap_set_wb_error);
+
+/**
+ * filemap_report_wb_error - report wb error (if any) that was previously set
+ * @file: struct file on which the error is being reported
+ *
+ * When userland calls fsync or close (or something like nfsd does the
+ * equivalent), we want to report any writeback errors that occurred since
+ * the last fsync (or since the file was opened if there haven't been any).
+ *
+ * Grab the wb_err from the mapping. If it matches what we have in the file,
+ * then just quickly return 0. The file is all caught up.
+ *
+ * If it doesn't match, then take the mapping value, set the "seen" flag in
+ * it and try to swap it into place. If it works, or another task beat us
+ * to it with the new value, then update the f_wb_err and return the error
+ * portion. The error at this point _should_ be reported to userland.
+ *
+ * While we handle mapping->wb_err with atomic operations, the f_wb_err
+ * value is protected by the f_lock since we must ensure that it reflects
+ * the latest value swapped in for this file descriptor.
+ */
+int filemap_report_wb_error(struct file *file)
+{
+	int err = 0;
+	struct address_space *mapping = file->f_mapping;
+	u32 old;
+
+	old = READ_ONCE(mapping->wb_err);
+
+	/*
+	 * This catches the common case of no errors, and the case where
+	 * nothing has changed since we last checked.
+	 */
+	if (old == READ_ONCE(file->f_wb_err))
+		goto out;
+
+	spin_lock(&file->f_lock);
+	for (;;) {
+		u32 cur, new;
+
+		/*
+		 * We always store values with the "seen" bit set, so if this
+		 * matches what we already have, then we can call it done.
+		 * There is nothing to update so just return 0.
+		 */
+		if (old == file->f_wb_err)
+			break;
+
+		/* set flag and try to swap it into place */
+		new = old | WB_ERR_SEEN;
+		cur = cmpxchg(&mapping->wb_err, old, new);
+
+		/*
+		 * We can quit now if we successfully swapped in the new value
+		 * or someone else beat us to it with the same value that we
+		 * were planning to store.
+		 */
+		if (likely(cur == old || cur == new)) {
+			file->f_wb_err = new;
+			err = -(new & MAX_ERRNO);
+			break;
+		}
+
+		/* Raced with an update, try again */
+		old = cur;
+	}
+	spin_unlock(&file->f_lock);
+out:
+	return err;
+}
+EXPORT_SYMBOL(filemap_report_wb_error);
+
 /**
  * replace_page_cache_page - replace a pagecache page with a new one
  * @old:	page to be replaced
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06 19:14                             ` Jeff Layton
@ 2017-04-06 20:05                               ` Matthew Wilcox
  2017-04-07 13:12                                 ` Jeff Layton
  2017-04-06 22:15                               ` NeilBrown
  1 sibling, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2017-04-06 20:05 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Thu, Apr 06, 2017 at 03:14:52PM -0400, Jeff Layton wrote:
> @@ -868,6 +869,7 @@ struct file {
>  	struct list_head	f_tfile_llink;
>  #endif /* #ifdef CONFIG_EPOLL */
>  	struct address_space	*f_mapping;
> +	u32			f_wb_err;
>  } __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
>  

I think we can squeeze that in next to f_flags?

> +/**
> + * filemap_set_wb_error - set the wb error in the mapping for later reporting
> + * @mapping: mapping in which the error should be set
> + * @err: error to set. must be negative value but not less than -MAX_ERRNO

Do we want to have users call filemap_set_wb_error(mapping, EIO)
or filemap_set_wb_error(mapping, -EIO)?  Either way, we can assert
that it's in the correct range (oh look, we have at least one user of
mapping_set_error calling it with a positive errno ...)

I've been playing with positive or negative errnos for the xarray, and
positive looks better to me, although there's a definite advantage to
being able to just call filemap_set_wb_error(mapping, result).

#define XAS_ERROR(errno)        ((struct xa_node *)((errno << 1) | 1))

static inline int xas_error(const struct xa_state *xas)
{
        unsigned long v = (unsigned long)xas->xa_node;
        return (v & 1) ? -(v >> 1) : 0;
}

static inline void xas_set_err(struct xa_state *xas, unsigned long err)
{
        XA_BUG_ON(err > MAX_ERRNO);
        xas->xa_node = XAS_ERROR(err);
}

> +	/*
> +	 * Ensure the error code actually fits where we want it to go. If it
> +	 * doesn't then just throw a warning and don't record anything.
> +	 */
> +	if (unlikely(err > 0 || err < -MAX_ERRNO)) {
> +		WARN(1, "err=%d\n", err);
> +		return;
> +	}

Cute trick to make this more succinct:

	if (WARN(err > 0 || err < -MAX_ERRNO), "err = %d\n", err)
		return;
or even ...

	if (WARN((unsigned int)-err > MAX_ERRNO), "err = %d\n", err)
		return;

> +		/* Clear out error bits and set new error */
> +		new = (old & ~MAX_ERRNO) | -err;
> +
> +		/* Only increment if someone has looked at it */
> +		if (old & WB_ERR_SEEN) {
> +			new += WB_ERR_CTR_INC;
> +			new &= ~WB_ERR_SEEN;
> +		}

Although we always want to clear out the SEEN bit if we're updating ... so

		new = (old & ~(MAX_ERRNO | WB_ERR_SEEN) | -err;

		/* Only increment if someone has looked at it */
		if (old & WB_ERR_SEEN)
			new += WB_ERR_CTR_INC;

... and then there's no need to update if it's the same errno and nobody's
seen it:

		if (old == new)
			break;

[...]

> +		/*
> +		 * We always store values with the "seen" bit set, so if this
> +		 * matches what we already have, then we can call it done.
> +		 * There is nothing to update so just return 0.
> +		 */
> +		if (old == file->f_wb_err)
> +			break;
> +
> +		/* set flag and try to swap it into place */
> +		new = old | WB_ERR_SEEN;

Again, I think we should avoid the cmpxchg with:

		if (old == new)
			break;

> +		cur = cmpxchg(&mapping->wb_err, old, new);
> +
> +		/*
> +		 * We can quit now if we successfully swapped in the new value
> +		 * or someone else beat us to it with the same value that we
> +		 * were planning to store.
> +		 */
> +		if (likely(cur == old || cur == new)) {
> +			file->f_wb_err = new;
> +			err = -(new & MAX_ERRNO);
> +			break;
> +		}
> +
> +		/* Raced with an update, try again */
> +		old = cur;

Well ... should we?  We're returning an error which is new to this fd anyway.
Do we want to return the most recent error by a nanosecond, or should we
return the previous one and then see this one next time we call fsync()?

I'd lean towards not looping here; not even looking at 'cur'.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06 13:31                                 ` Matthew Wilcox
@ 2017-04-06 21:53                                   ` NeilBrown
  0 siblings, 0 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-06 21:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

On Thu, Apr 06 2017, Matthew Wilcox wrote:

> On Thu, Apr 06, 2017 at 03:12:42PM +1000, NeilBrown wrote:
>> On Wed, Apr 05 2017, Matthew Wilcox wrote:
>> 
>> > On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
>> >> If you are concerned about space in 'struct address_space', just prune
>> >> some wastage.
>> >
>> > I'm trying to (via wlists).  still buggy though.
>> 
>> Cool.
>> (I wonder what a wlist is.... weighted list?)
>
> A homonym ;-)  Waitlists.  Yet Another Variation of a doubly linked list.
> This variation has only a single head pointer (like the hlist), allows
> O(1) deletion from any point in the list (but requires that you know the
> head of the list).  The tail of the list is pointed to by head->prev.
>
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/wlist
>
> It also needs updating to the latest tree; WW locks got revamped.

So nothing points to the head, meaning you need both the node and the
head to delete something.  But the head is smaller.
I can see how that would be useful.
Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06 19:14                             ` Jeff Layton
  2017-04-06 20:05                               ` Matthew Wilcox
@ 2017-04-06 22:15                               ` NeilBrown
  1 sibling, 0 replies; 54+ messages in thread
From: NeilBrown @ 2017-04-06 22:15 UTC (permalink / raw)
  To: Jeff Layton, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 1313 bytes --]

On Thu, Apr 06 2017, Jeff Layton wrote:

>
> I tried to avoid updating things unnecesssarily. I could use some
> guidance on how to specify the constants in terms of MAX_ERRNO as well.

ilog2() defined in include/linux/log2.h

And you have MAX_ERROR in one comment, instead of MAX_ERRNO :-)

>  
> -  flush: called by the close(2) system call to flush a file
> +  flush: called by the close(2) system call to flush a file. Writeback
> +	errors not previously reported via fsync should be reported
> +	here as you would for fsync.

"could", not "should".  I think it is agreed that this is a good
idea, is it?

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7251f7bb45e8..f33857113ff4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -394,6 +394,7 @@ struct address_space {
>  	gfp_t			gfp_mask;	/* implicit gfp mask for allocations */
>  	struct list_head	private_list;	/* ditto */
>  	void			*private_data;	/* ditto */
> +	u32			wb_err;

I would rather this was a wb_err_t or similar, and that the functions
which implement it take a pointer to a wb_err_t.
Then the 'error' in 'struct nfs_open_context' could become a 'wb_err_t',
and nfs could use these functions to do error tracking the way it wants
to.

Thanks - looking good.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-06 20:05                               ` Matthew Wilcox
@ 2017-04-07 13:12                                 ` Jeff Layton
  2017-04-09 23:15                                   ` NeilBrown
  0 siblings, 1 reply; 54+ messages in thread
From: Jeff Layton @ 2017-04-07 13:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Thu, 2017-04-06 at 13:05 -0700, Matthew Wilcox wrote:
> On Thu, Apr 06, 2017 at 03:14:52PM -0400, Jeff Layton wrote:
> > @@ -868,6 +869,7 @@ struct file {
> >  	struct list_head	f_tfile_llink;
> >  #endif /* #ifdef CONFIG_EPOLL */
> >  	struct address_space	*f_mapping;
> > +	u32			f_wb_err;
> >  } __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
> >  
> 
> I think we can squeeze that in next to f_flags?
> 

Sure, will do. I meant to look at pahole output and see if there are
existing holes.

> > +/**
> > + * filemap_set_wb_error - set the wb error in the mapping for later reporting
> > + * @mapping: mapping in which the error should be set
> > + * @err: error to set. must be negative value but not less than -MAX_ERRNO
> 
> Do we want to have users call filemap_set_wb_error(mapping, EIO)
> or filemap_set_wb_error(mapping, -EIO)?  Either way, we can assert
> that it's in the correct range (oh look, we have at least one user of
> mapping_set_error calling it with a positive errno ...)
> 

Yeah, I sent a patch for that a while back but I don't think anyone
picked it up. Luckily that caller is harmless since EIO just ends up in
the default case and gets turned into -EIO.

> I've been playing with positive or negative errnos for the xarray, and
> positive looks better to me, although there's a definite advantage to
> being able to just call filemap_set_wb_error(mapping, result).
> 

That's my main rationale. We generally use negative error codes in the
kernel, so let's do what's easiest for most callsites. I say negative
error codes here.


> #define XAS_ERROR(errno)        ((struct xa_node *)((errno << 1) | 1))
> 
> static inline int xas_error(const struct xa_state *xas)
> {
>         unsigned long v = (unsigned long)xas->xa_node;
>         return (v & 1) ? -(v >> 1) : 0;
> }
> 
> static inline void xas_set_err(struct xa_state *xas, unsigned long err)
> {
>         XA_BUG_ON(err > MAX_ERRNO);
>         xas->xa_node = XAS_ERROR(err);
> }
> 
> > +	/*
> > +	 * Ensure the error code actually fits where we want it to go. If it
> > +	 * doesn't then just throw a warning and don't record anything.
> > +	 */
> > +	if (unlikely(err > 0 || err < -MAX_ERRNO)) {
> > +		WARN(1, "err=%d\n", err);
> > +		return;
> > +	}
> 
> Cute trick to make this more succinct:
> 
> 	if (WARN(err > 0 || err < -MAX_ERRNO), "err = %d\n", err)
> 		return;
> or even ...
> 
> 	if (WARN((unsigned int)-err > MAX_ERRNO), "err = %d\n", err)
> 		return;
> 

Nice. I always forget that WARN has a return. Will fix.

> > +		/* Clear out error bits and set new error */
> > +		new = (old & ~MAX_ERRNO) | -err;
> > +
> > +		/* Only increment if someone has looked at it */
> > +		if (old & WB_ERR_SEEN) {
> > +			new += WB_ERR_CTR_INC;
> > +			new &= ~WB_ERR_SEEN;
> > +		}
> 
> Although we always want to clear out the SEEN bit if we're updating ... so
> 
> 		new = (old & ~(MAX_ERRNO | WB_ERR_SEEN) | -err;
> 
> 		/* Only increment if someone has looked at it */
> 		if (old & WB_ERR_SEEN)
> 			new += WB_ERR_CTR_INC;
> 

Sure, that is more succinct.

> ... and then there's no need to update if it's the same errno and nobody's
> seen it:
> 
> 		if (old == new)
> 			break;
> 

No, we can't do this. The thing could have just been updated by a task
that is setting the "seen" bit. We don't want to lose the error here. We
always have to do the cmpxchg on the set_wb_error side, I think.

> [...]
> 
> > +		/*
> > +		 * We always store values with the "seen" bit set, so if this
> > +		 * matches what we already have, then we can call it done.
> > +		 * There is nothing to update so just return 0.
> > +		 */
> > +		if (old == file->f_wb_err)
> > +			break;
> > +
> > +		/* set flag and try to swap it into place */
> > +		new = old | WB_ERR_SEEN;
> 
> Again, I think we should avoid the cmpxchg with:
> 
> 		if (old == new)
> 			break;
> 

Yeah, we may be able to do this one. I had myself convinced otherwise
yesterday, but I think you may be right.

> > +		cur = cmpxchg(&mapping->wb_err, old, new);
> > +
> > +		/*
> > +		 * We can quit now if we successfully swapped in the new value
> > +		 * or someone else beat us to it with the same value that we
> > +		 * were planning to store.
> > +		 */
> > +		if (likely(cur == old || cur == new)) {
> > +			file->f_wb_err = new;
> > +			err = -(new & MAX_ERRNO);
> > +			break;
> > +		}
> > +
> > +		/* Raced with an update, try again */
> > +		old = cur;
> 
> Well ... should we?  We're returning an error which is new to this fd anyway.
> Do we want to return the most recent error by a nanosecond, or should we
> return the previous one and then see this one next time we call fsync()?
> 
> I'd lean towards not looping here; not even looking at 'cur'.
> 

Yeah, that might be fine here. Let me think about it a bit more.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-07 13:12                                 ` Jeff Layton
@ 2017-04-09 23:15                                   ` NeilBrown
  2017-04-10 13:19                                     ` Jeff Layton
  0 siblings, 1 reply; 54+ messages in thread
From: NeilBrown @ 2017-04-09 23:15 UTC (permalink / raw)
  To: Jeff Layton, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

[-- Attachment #1: Type: text/plain, Size: 876 bytes --]

On Fri, Apr 07 2017, Jeff Layton wrote:

>
>> ... and then there's no need to update if it's the same errno and nobody's
>> seen it:
>> 
>> 		if (old == new)
>> 			break;
>> 
>
> No, we can't do this. The thing could have just been updated by a task
> that is setting the "seen" bit. We don't want to lose the error here. We
> always have to do the cmpxchg on the set_wb_error side, I think.

I don't follow your logic.
If (old == new) then there was a moment since this function started when
performing the cmpxchg() would not have changed the contents of memory.
So let's pretend it did actually happen at that moment, and not change
memory.

If we race with a task setting the "seen" bit, then it will have seen
the error *after* the new error, that this thread is reporting, actually
happened.  So the result is still correct.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it
  2017-04-09 23:15                                   ` NeilBrown
@ 2017-04-10 13:19                                     ` Jeff Layton
  0 siblings, 0 replies; 54+ messages in thread
From: Jeff Layton @ 2017-04-10 13:19 UTC (permalink / raw)
  To: NeilBrown, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-ext4, akpm, tytso, jack

On Mon, 2017-04-10 at 09:15 +1000, NeilBrown wrote:
> On Fri, Apr 07 2017, Jeff Layton wrote:
> 
> > 
> > > ... and then there's no need to update if it's the same errno and nobody's
> > > seen it:
> > > 
> > > 		if (old == new)
> > > 			break;
> > > 
> > 
> > No, we can't do this. The thing could have just been updated by a task
> > that is setting the "seen" bit. We don't want to lose the error here. We
> > always have to do the cmpxchg on the set_wb_error side, I think.
> 
> I don't follow your logic.
> If (old == new) then there was a moment since this function started when
> performing the cmpxchg() would not have changed the contents of memory.
> So let's pretend it did actually happen at that moment, and not change
> memory.
> 
> If we race with a task setting the "seen" bit, then it will have seen
> the error *after* the new error, that this thread is reporting, actually
> happened.  So the result is still correct.
> 

Ok, that does make sense. I'll plan to do that.

There's also a bug in the last patch that I sent. We need to mark the
SEEN bit when we sample the value at open time, so we need a
filemap_sample_wb_error function to grab the current wb_err_t and mark
it SEEN if necessary.

That also gives us a way to handle something like filemap_write_and_wait
(which doesn't take a struct file). We can sample the wb_err_t prior to
starting writeback, and then return an error if anything failed after
that point.

I think that's probably close enough to how the current code works that
we can use it to make drop-in replacements for filemap_write_and_wait*
which should keep us from having to change so much existing code here.

filemap_check_errors would need to take a previously-sampled wb_err_t
argument, but only the lowest-level callers of that and
filemap_fdatawait* would need to deal with them directly.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2017-04-10 13:20 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-31 19:25 [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it Jeff Layton
2017-03-31 19:26 ` [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting Jeff Layton
2017-04-03  7:12   ` Nikolay Borisov
2017-04-03 10:28     ` Jeff Layton
2017-04-03 14:47   ` Matthew Wilcox
2017-04-03 15:19     ` Jeff Layton
2017-04-03 16:15       ` Matthew Wilcox
2017-04-03 16:30         ` Jeff Layton
2017-03-31 19:26 ` [RFC PATCH 2/4] dax: set errors in mapping when writeback fails Jeff Layton
2017-03-31 19:26 ` [RFC PATCH 3/4] buffer: set wb errors using both new and old infrastructure for now Jeff Layton
2017-03-31 19:26 ` [RFC PATCH 4/4] ext4: wire it up to the new writeback error reporting infrastructure Jeff Layton
2017-04-03  4:25 ` [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it NeilBrown
2017-04-03 10:28   ` Jeff Layton
2017-04-03 14:32     ` Matthew Wilcox
2017-04-03 17:47       ` Jeff Layton
2017-04-03 18:09         ` Jeremy Allison
2017-04-03 18:18           ` Jeff Layton
2017-04-03 18:36             ` Jeremy Allison
2017-04-03 18:40               ` Jeremy Allison
2017-04-03 18:49                 ` Jeff Layton
2017-04-03 19:16         ` Matthew Wilcox
2017-04-03 20:16           ` Jeff Layton
2017-04-04  2:45             ` Matthew Wilcox
2017-04-04  3:03             ` NeilBrown
2017-04-04 11:41               ` Jeff Layton
2017-04-04 22:41                 ` NeilBrown
2017-04-04 11:53               ` Matthew Wilcox
2017-04-04 12:17                 ` Jeff Layton
2017-04-04 16:12                   ` Matthew Wilcox
2017-04-04 16:25                     ` Jeff Layton
2017-04-04 17:09                       ` Matthew Wilcox
2017-04-04 18:08                         ` Jeff Layton
2017-04-04 22:50                         ` NeilBrown
2017-04-05 19:49                         ` Jeff Layton
2017-04-05 21:03                           ` Matthew Wilcox
2017-04-06  0:19                             ` NeilBrown
2017-04-06  0:02                           ` NeilBrown
2017-04-06  2:55                             ` Matthew Wilcox
2017-04-06  5:12                               ` NeilBrown
2017-04-06 13:31                                 ` Matthew Wilcox
2017-04-06 21:53                                   ` NeilBrown
2017-04-06 14:02                             ` Jeff Layton
2017-04-06 19:14                             ` Jeff Layton
2017-04-06 20:05                               ` Matthew Wilcox
2017-04-07 13:12                                 ` Jeff Layton
2017-04-09 23:15                                   ` NeilBrown
2017-04-10 13:19                                     ` Jeff Layton
2017-04-06 22:15                               ` NeilBrown
2017-04-04 23:13                       ` NeilBrown
2017-04-05 11:14                         ` Jeff Layton
2017-04-06  0:24                           ` NeilBrown
2017-04-04 13:38                 ` Theodore Ts'o
2017-04-04 22:28                 ` NeilBrown
2017-04-03 14:51   ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.