[PATCH v2 0/5] ext4: fix inconsistency since reading old metadata from disk

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] ext4: fix inconsistency since reading old metadata from disk
@ 2020-06-17 11:59 zhangyi (F)
  2020-06-17 11:59 ` [PATCH v2 1/5] fs: add bdev writepage hook to block device zhangyi (F)
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: zhangyi (F) @ 2020-06-17 11:59 UTC (permalink / raw)
  To: linux-ext4, tytso, jack
  Cc: adilger.kernel, zhangxiaoxu5, yi.zhang, linux-fsdevel

Changes since v1:
 - Give up the solution of re-adding metadata buffer's uptodate flag.
 - Patch 1-2: Add a call back for end_buffer_async_write() and invoke
   ext4_error_err() to handle metadata buffer async write IO error
   immediately.
 - Patch 3: Add mapping->wb_err check and also invoke ext4_error_err()
   in ext4_journal_get_write_access() if wb_err is different from the
   original one saved at mount time. Add this patch because patch 2
   could not fix all cases.
 - Patch 4-5: Remove partial fix <7963e5ac90125> and <9c83a923c67d>.

The above 5 patches are based on linux-5.8-rc1 and have been tested by
xfstests, no new failures.

Thanks,
Yi.

-----------------------

Original background
===================

This patch set point to fix the inconsistency problem which has been
discussed and partial fixed in [1].

Now, the problem is on the unstable storage which has a flaky transport
(e.g. iSCSI transport may disconnect few seconds and reconnect due to
the bad network environment), if we failed to async write metadata in
background, the end write routine in block layer will clear the buffer's
uptodate flag, but the data in such buffer is actually uptodate. Finally
we may read "old && inconsistent" metadata from the disk when we get the
buffer later because not only the uptodate flag was cleared but also we
do not check the write io error flag, or even worse the buffer has been
freed due to memory presure.

Fortunately, if the jbd2 do checkpoint after async IO error happens,
the checkpoint routine will check the write_io_error flag and abort the
the journal if detect IO error. And in the journal recover case, the
recover code will invoke sync_blockdev() after recover complete, it will
also detect IO error and refuse to mount the filesystem.

Current ext4 have already deal with this problem in __ext4_get_inode_loc()
and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
containing valid data"), but it's not enough.

[1] https://lore.kernel.org/linux-ext4/20190823030207.GC8130@mit.edu/

zhangyi (F) (5):
  fs: add bdev writepage hook to block device
  ext4: mark filesystem error if failed to async write metadata
  ext4: detect metadata async write error when getting journal's write
    access
  ext4: remove ext4_buffer_uptodate()
  ext4: remove write io error check before read inode block

 fs/block_dev.c      |  5 ++++
 fs/ext4/ext4.h      | 24 +++++++++----------
 fs/ext4/ext4_jbd2.c | 34 +++++++++++++++++++++++----
 fs/ext4/inode.c     | 13 ++---------
 fs/ext4/page-io.c   | 57 +++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/super.c     | 32 ++++++++++++++++++++++++-
 include/linux/fs.h  |  1 +
 7 files changed, 136 insertions(+), 30 deletions(-)

-- 
2.25.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/5] fs: add bdev writepage hook to block device
  2020-06-17 11:59 [PATCH v2 0/5] ext4: fix inconsistency since reading old metadata from disk zhangyi (F)
@ 2020-06-17 11:59 ` zhangyi (F)
  2020-06-18  7:02   ` Christoph Hellwig
  2020-06-17 11:59 ` [PATCH v2 2/5] ext4: mark filesystem error if failed to async write metadata zhangyi (F)
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: zhangyi (F) @ 2020-06-17 11:59 UTC (permalink / raw)
  To: linux-ext4, tytso, jack
  Cc: adilger.kernel, zhangxiaoxu5, yi.zhang, linux-fsdevel

Add a new bdev_write_page hook into struct super_operations and called
by bdev_writepage(), which could be used by filesystem to propagate
private handlers.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
---
 fs/block_dev.c     | 5 +++++
 include/linux/fs.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 47860e589388..46e25a4e3ebf 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -606,6 +606,11 @@ EXPORT_SYMBOL(thaw_bdev);
 
 static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
+
+	if (super && super->s_op->bdev_write_page)
+		return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);
+
 	return block_write_full_page(page, blkdev_get_block, wbc);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cffc3619eed5..b87b784c6bc6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1983,6 +1983,7 @@ struct super_operations {
 	struct dquot **(*get_dquots)(struct inode *);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+	int (*bdev_write_page)(struct page *, get_block_t *, struct writeback_control *);
 	long (*nr_cached_objects)(struct super_block *,
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/5] ext4: mark filesystem error if failed to async write metadata
  2020-06-17 11:59 [PATCH v2 0/5] ext4: fix inconsistency since reading old metadata from disk zhangyi (F)
  2020-06-17 11:59 ` [PATCH v2 1/5] fs: add bdev writepage hook to block device zhangyi (F)
@ 2020-06-17 11:59 ` zhangyi (F)
  2020-06-17 12:48   ` Jan Kara
  2020-06-17 11:59 ` [PATCH v2 3/5] ext4: detect metadata async write error when getting journal's write access zhangyi (F)
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: zhangyi (F) @ 2020-06-17 11:59 UTC (permalink / raw)
  To: linux-ext4, tytso, jack
  Cc: adilger.kernel, zhangxiaoxu5, yi.zhang, linux-fsdevel

There is a risk of filesystem inconsistency if we failed to async write
back metadata buffer in the background. Because of current buffer's end
io procedure is handled by end_buffer_async_write() in the block layer,
and it only clear the buffer's uptodate flag and mark the write_io_error
flag, so ext4 cannot detect such failure immediately. In most cases of
getting metadata buffer (e.g. ext4_read_inode_bitmap()), although the
buffer's data is actually uptodate, it may still read data from disk
because the buffer's uptodate flag has been cleared. Finally, it may
lead to on-disk filesystem inconsistency if reading old data from the
disk successfully and write them out again.

This patch propagate ext4 end buffer callback to the block layer which
could detect metadata buffer's async error and invoke ext4 error handler
immediately.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    |  7 +++++++
 fs/ext4/page-io.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/super.c   | 13 +++++++++++++
 3 files changed, 67 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 15b062efcff1..2f22476f41d2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1515,6 +1515,10 @@ struct ext4_sb_info {
 	/* workqueue for reserved extent conversions (buffered io) */
 	struct workqueue_struct *rsv_conversion_wq;
 
+	/* workqueue for handle metadata buffer async writeback error */
+	struct workqueue_struct *s_bdev_wb_err_wq;
+	struct work_struct s_bdev_wb_err_work;
+
 	/* timer for periodic error stats printing */
 	struct timer_list s_err_report;
 
@@ -3426,6 +3430,9 @@ extern int ext4_bio_write_page(struct ext4_io_submit *io,
 			       bool keep_towrite);
 extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
 extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
+extern void ext4_end_buffer_async_write_error(struct work_struct *work);
+extern int ext4_bdev_write_page(struct page *page, get_block_t *get_block,
+				struct writeback_control *wbc);
 
 /* mmp.c */
 extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index de6fe969f773..50aa8e26e38c 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -560,3 +560,50 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 		end_page_writeback(page);
 	return ret;
 }
+
+/*
+ * Handle error of async writeback metadata buffer, just mark the filesystem
+ * error to prevent potential further inconsistency.
+ */
+void ext4_end_buffer_async_write_error(struct work_struct *work)
+{
+	struct ext4_sb_info *sbi = container_of(work,
+				struct ext4_sb_info, s_bdev_wb_err_work);
+
+	/*
+	 * If we failed to async write back metadata buffer, there is a risk
+	 * of filesystem inconsistency since we may read old metadata from the
+	 * disk successfully and write them out again.
+	 */
+	ext4_error_err(sbi->s_sb, -EIO, "Error while async write back metadata buffer");
+}
+
+static void ext4_end_buffer_async_write(struct buffer_head *bh, int uptodate)
+{
+	struct super_block *sb = bh->b_bdev->bd_super;
+
+	end_buffer_async_write(bh, uptodate);
+
+	if (!uptodate && sb && !sb_rdonly(sb)) {
+		struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+		/* Handle error of async writeback metadata buffer */
+		queue_work(sbi->s_bdev_wb_err_wq, &sbi->s_bdev_wb_err_work);
+	}
+}
+
+int ext4_bdev_write_page(struct page *page, get_block_t *get_block,
+			 struct writeback_control *wbc)
+{
+	struct inode * const inode = page->mapping->host;
+	loff_t i_size = i_size_read(inode);
+	const pgoff_t end_index = i_size >> PAGE_SHIFT;
+	unsigned int offset;
+
+	offset = i_size & ~PAGE_MASK;
+	if (page->index == end_index && offset)
+		zero_user_segment(page, offset, PAGE_SIZE);
+
+	return __block_write_full_page(inode, page, get_block,
+				       wbc, ext4_end_buffer_async_write);
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9824cd8203e8..f04b161a64a0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1013,6 +1013,7 @@ static void ext4_put_super(struct super_block *sb)
 	ext4_quota_off_umount(sb);
 
 	destroy_workqueue(sbi->rsv_conversion_wq);
+	destroy_workqueue(sbi->s_bdev_wb_err_wq);
 
 	/*
 	 * Unregister sysfs before destroying jbd2 journal.
@@ -1492,6 +1493,7 @@ static const struct super_operations ext4_sops = {
 	.get_dquots	= ext4_get_dquots,
 #endif
 	.bdev_try_to_free_page = bdev_try_to_free_page,
+	.bdev_write_page = ext4_bdev_write_page,
 };
 
 static const struct export_operations ext4_export_ops = {
@@ -4598,6 +4600,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount4;
 	}
 
+	EXT4_SB(sb)->s_bdev_wb_err_wq =
+		alloc_workqueue("ext4-bdev-write-error", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+	if (!EXT4_SB(sb)->s_bdev_wb_err_wq) {
+		ext4_msg(sb, KERN_ERR, "failed to create workqueue\n");
+		ret = -ENOMEM;
+		goto failed_mount4;
+	}
+	INIT_WORK(&EXT4_SB(sb)->s_bdev_wb_err_work, ext4_end_buffer_async_write_error);
+
 	/*
 	 * The jbd2_journal_load will have done any necessary log recovery,
 	 * so we can safely mount the rest of the filesystem now.
@@ -4781,6 +4792,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_root = NULL;
 failed_mount4:
 	ext4_msg(sb, KERN_ERR, "mount failed");
+	if (EXT4_SB(sb)->s_bdev_wb_err_wq)
+		destroy_workqueue(EXT4_SB(sb)->s_bdev_wb_err_wq);
 	if (EXT4_SB(sb)->rsv_conversion_wq)
 		destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
 failed_mount_wq:
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/5] ext4: detect metadata async write error when getting journal's write access
  2020-06-17 11:59 [PATCH v2 0/5] ext4: fix inconsistency since reading old metadata from disk zhangyi (F)
  2020-06-17 11:59 ` [PATCH v2 1/5] fs: add bdev writepage hook to block device zhangyi (F)
  2020-06-17 11:59 ` [PATCH v2 2/5] ext4: mark filesystem error if failed to async write metadata zhangyi (F)
@ 2020-06-17 11:59 ` zhangyi (F)
  2020-06-17 12:41   ` Jan Kara
  2020-06-17 11:59 ` [PATCH v2 4/5] ext4: remove ext4_buffer_uptodate() zhangyi (F)
  2020-06-17 11:59 ` [PATCH v2 5/5] ext4: remove write io error check before read inode block zhangyi (F)
  4 siblings, 1 reply; 11+ messages in thread
From: zhangyi (F) @ 2020-06-17 11:59 UTC (permalink / raw)
  To: linux-ext4, tytso, jack
  Cc: adilger.kernel, zhangxiaoxu5, yi.zhang, linux-fsdevel

Although we have already introduce s_bdev_wb_err_work to detect and
handle async write metadata buffer error as soon as possible, there is
still a potential race that could lead to filesystem inconsistency,
which is the buffer may reading and re-writing out to journal before
s_bdev_wb_err_work run. So this patch detect bdev mapping->wb_err when
getting journal's write access and also mark the filesystem error if
something bad happened.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h      |  4 ++++
 fs/ext4/ext4_jbd2.c | 34 +++++++++++++++++++++++++++++-----
 fs/ext4/page-io.c   | 12 +++++++++++-
 fs/ext4/super.c     | 17 +++++++++++++++++
 4 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2f22476f41d2..82ae41a828dd 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1519,6 +1519,10 @@ struct ext4_sb_info {
 	struct workqueue_struct *s_bdev_wb_err_wq;
 	struct work_struct s_bdev_wb_err_work;
 
+	/* Record the errseq of the backing block device */
+	errseq_t s_bdev_wb_err;
+	spinlock_t s_bdev_wb_lock;
+
 	/* timer for periodic error stats printing */
 	struct timer_list s_err_report;
 
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 0c76cdd44d90..66620324f019 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -198,16 +198,40 @@ static void ext4_journal_abort_handle(const char *caller, unsigned int line,
 int __ext4_journal_get_write_access(const char *where, unsigned int line,
 				    handle_t *handle, struct buffer_head *bh)
 {
+	struct super_block *sb = bh->b_bdev->bd_super;
 	int err = 0;
 
 	might_sleep();
 
-	if (ext4_handle_valid(handle)) {
-		err = jbd2_journal_get_write_access(handle, bh);
-		if (err)
-			ext4_journal_abort_handle(where, line, __func__, bh,
-						  handle, err);
+	/*
+	 * If the block device has write error flag, it may have failed to
+	 * async write out metadata buffers in the background but the error
+	 * handle worker hasn't been executed yet. In this case, we could
+	 * read old data from disk and write it out again, which may lead
+	 * to on-disk filesystem inconsistency.
+	 */
+	if (sb) {
+		struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+		struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+		if (errseq_check(&mapping->wb_err, READ_ONCE(sbi->s_bdev_wb_err))) {
+			spin_lock(&sbi->s_bdev_wb_lock);
+			err = errseq_check_and_advance(&mapping->wb_err,
+						       &sbi->s_bdev_wb_err);
+			spin_unlock(&sbi->s_bdev_wb_lock);
+			if (err) {
+				ext4_error_err(sb, err,
+					       "Error while async write back metadata");
+				goto out;
+			}
+		}
 	}
+
+	if (ext4_handle_valid(handle))
+		err = jbd2_journal_get_write_access(handle, bh);
+out:
+	if (err)
+		ext4_journal_abort_handle(where, line, __func__, bh, handle, err);
 	return err;
 }
 
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 50aa8e26e38c..b2c3da4be2aa 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -569,13 +569,23 @@ void ext4_end_buffer_async_write_error(struct work_struct *work)
 {
 	struct ext4_sb_info *sbi = container_of(work,
 				struct ext4_sb_info, s_bdev_wb_err_work);
+	struct address_space *mapping = sbi->s_sb->s_bdev->bd_inode->i_mapping;
+	int err;
 
 	/*
 	 * If we failed to async write back metadata buffer, there is a risk
 	 * of filesystem inconsistency since we may read old metadata from the
 	 * disk successfully and write them out again.
 	 */
-	ext4_error_err(sbi->s_sb, -EIO, "Error while async write back metadata buffer");
+	if (errseq_check(&mapping->wb_err, READ_ONCE(sbi->s_bdev_wb_err))) {
+		spin_lock(&sbi->s_bdev_wb_lock);
+		err = errseq_check_and_advance(&mapping->wb_err,
+					       &sbi->s_bdev_wb_err);
+		spin_unlock(&sbi->s_bdev_wb_lock);
+		if (err)
+			ext4_error_err(sbi->s_sb, -EIO,
+				       "Error while async write back metadata buffer");
+	}
 }
 
 static void ext4_end_buffer_async_write(struct buffer_head *bh, int uptodate)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f04b161a64a0..3e867ff452cd 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4715,6 +4715,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	}
 #endif  /* CONFIG_QUOTA */
 
+	/*
+	 * Save the original bdev mapping's wb_err value which could be
+	 * used to detect the metadata async write error.
+	 */
+	spin_lock_init(&sbi->s_bdev_wb_lock);
+	if (!sb_rdonly(sb))
+		errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
+					 &sbi->s_bdev_wb_err);
+	sb->s_bdev->bd_super = sb;
 	EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
 	ext4_orphan_cleanup(sb, es);
 	EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
@@ -5580,6 +5589,14 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 				goto restore_opts;
 			}
 
+			/*
+			 * Update the original bdev mapping's wb_err value
+			 * which could be used to detect the metadata async
+			 * write error.
+			 */
+			errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
+						 &sbi->s_bdev_wb_err);
+
 			/*
 			 * Mounting a RDONLY partition read-write, so reread
 			 * and store the current valid flag.  (It may have
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 4/5] ext4: remove ext4_buffer_uptodate()
  2020-06-17 11:59 [PATCH v2 0/5] ext4: fix inconsistency since reading old metadata from disk zhangyi (F)
                   ` (2 preceding siblings ...)
  2020-06-17 11:59 ` [PATCH v2 3/5] ext4: detect metadata async write error when getting journal's write access zhangyi (F)
@ 2020-06-17 11:59 ` zhangyi (F)
  2020-06-17 11:59 ` [PATCH v2 5/5] ext4: remove write io error check before read inode block zhangyi (F)
  4 siblings, 0 replies; 11+ messages in thread
From: zhangyi (F) @ 2020-06-17 11:59 UTC (permalink / raw)
  To: linux-ext4, tytso, jack
  Cc: adilger.kernel, zhangxiaoxu5, yi.zhang, linux-fsdevel

After we add ext4_end_buffer_async_write() callback into block layer to
detect metadata buffer's async write error in the background, we can
remove the partial fix for filesystem inconsistency problem caused by
reading old data from disk in commit <7963e5ac9012> "ext4: treat buffers
with write errors as containing valid data" and <cf2834a5ed57> "ext4:
treat buffers contining write errors as valid in ext4_sb_bread()".

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h  | 13 -------------
 fs/ext4/inode.c |  4 ++--
 fs/ext4/super.c |  2 +-
 3 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 82ae41a828dd..a3699677a119 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3497,19 +3497,6 @@ extern const struct iomap_ops ext4_iomap_ops;
 extern const struct iomap_ops ext4_iomap_overwrite_ops;
 extern const struct iomap_ops ext4_iomap_report_ops;
 
-static inline int ext4_buffer_uptodate(struct buffer_head *bh)
-{
-	/*
-	 * If the buffer has the write error flag, we have failed
-	 * to write out data in the block.  In this  case, we don't
-	 * have to read the block because we may read the old data
-	 * successfully.
-	 */
-	if (!buffer_uptodate(bh) && buffer_write_io_error(bh))
-		set_buffer_uptodate(bh);
-	return buffer_uptodate(bh);
-}
-
 #endif	/* __KERNEL__ */
 
 #define EFSBADCRC	EBADMSG		/* Bad CRC detected */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 52be85f96159..8ccb6996c384 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -875,7 +875,7 @@ struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
 	bh = ext4_getblk(handle, inode, block, map_flags);
 	if (IS_ERR(bh))
 		return bh;
-	if (!bh || ext4_buffer_uptodate(bh))
+	if (!bh || buffer_uptodate(bh))
 		return bh;
 	ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1, &bh);
 	wait_on_buffer(bh);
@@ -902,7 +902,7 @@ int ext4_bread_batch(struct inode *inode, ext4_lblk_t block, int bh_count,
 
 	for (i = 0; i < bh_count; i++)
 		/* Note that NULL bhs[i] is valid because of holes. */
-		if (bhs[i] && !ext4_buffer_uptodate(bhs[i]))
+		if (bhs[i] && !buffer_uptodate(bhs[i]))
 			ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1,
 				    &bhs[i]);
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3e867ff452cd..9cb85aa72ba3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -154,7 +154,7 @@ ext4_sb_bread(struct super_block *sb, sector_t block, int op_flags)
 
 	if (bh == NULL)
 		return ERR_PTR(-ENOMEM);
-	if (ext4_buffer_uptodate(bh))
+	if (buffer_uptodate(bh))
 		return bh;
 	ll_rw_block(REQ_OP_READ, REQ_META | op_flags, 1, &bh);
 	wait_on_buffer(bh);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 5/5] ext4: remove write io error check before read inode block
  2020-06-17 11:59 [PATCH v2 0/5] ext4: fix inconsistency since reading old metadata from disk zhangyi (F)
                   ` (3 preceding siblings ...)
  2020-06-17 11:59 ` [PATCH v2 4/5] ext4: remove ext4_buffer_uptodate() zhangyi (F)
@ 2020-06-17 11:59 ` zhangyi (F)
  4 siblings, 0 replies; 11+ messages in thread
From: zhangyi (F) @ 2020-06-17 11:59 UTC (permalink / raw)
  To: linux-ext4, tytso, jack
  Cc: adilger.kernel, zhangxiaoxu5, yi.zhang, linux-fsdevel

After we add ext4_end_buffer_async_write() callback into block layer to
detect metadata buffer's async write error in the background, we can
remove the partial fix for filesystem inconsistency problem caused by
reading old data from disk in commit <9c83a923c67d> "ext4: don't read
inode block if the buffer has a write error".

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8ccb6996c384..b2fc1aef3886 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4281,15 +4281,6 @@ static int __ext4_get_inode_loc(struct inode *inode,
 	if (!buffer_uptodate(bh)) {
 		lock_buffer(bh);
 
-		/*
-		 * If the buffer has the write error flag, we have failed
-		 * to write out another inode in the same block.  In this
-		 * case, we don't have to read the block because we may
-		 * read the old inode data successfully.
-		 */
-		if (buffer_write_io_error(bh) && !buffer_uptodate(bh))
-			set_buffer_uptodate(bh);
-
 		if (buffer_uptodate(bh)) {
 			/* someone brought it uptodate while we waited */
 			unlock_buffer(bh);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/5] ext4: detect metadata async write error when getting journal's write access
  2020-06-17 11:59 ` [PATCH v2 3/5] ext4: detect metadata async write error when getting journal's write access zhangyi (F)
@ 2020-06-17 12:41   ` Jan Kara
  2020-06-17 13:44     ` zhangyi (F)
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2020-06-17 12:41 UTC (permalink / raw)
  To: zhangyi (F)
  Cc: linux-ext4, tytso, jack, adilger.kernel, zhangxiaoxu5, linux-fsdevel

On Wed 17-06-20 19:59:45, zhangyi (F) wrote:
> Although we have already introduce s_bdev_wb_err_work to detect and
> handle async write metadata buffer error as soon as possible, there is
> still a potential race that could lead to filesystem inconsistency,
> which is the buffer may reading and re-writing out to journal before
> s_bdev_wb_err_work run. So this patch detect bdev mapping->wb_err when
> getting journal's write access and also mark the filesystem error if
> something bad happened.
> 
> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>

So instead of all this, cannot we just do:

	if (work_pending(sbi->s_bdev_wb_err_work))
		flush_work(sbi->s_bdev_wb_err_work);

? And so we are sure the filesystem is aborted if the abort was pending?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/5] ext4: mark filesystem error if failed to async write metadata
  2020-06-17 11:59 ` [PATCH v2 2/5] ext4: mark filesystem error if failed to async write metadata zhangyi (F)
@ 2020-06-17 12:48   ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2020-06-17 12:48 UTC (permalink / raw)
  To: zhangyi (F)
  Cc: linux-ext4, tytso, jack, adilger.kernel, zhangxiaoxu5, linux-fsdevel

On Wed 17-06-20 19:59:44, zhangyi (F) wrote:
> There is a risk of filesystem inconsistency if we failed to async write
> back metadata buffer in the background. Because of current buffer's end
> io procedure is handled by end_buffer_async_write() in the block layer,
> and it only clear the buffer's uptodate flag and mark the write_io_error
> flag, so ext4 cannot detect such failure immediately. In most cases of
> getting metadata buffer (e.g. ext4_read_inode_bitmap()), although the
> buffer's data is actually uptodate, it may still read data from disk
> because the buffer's uptodate flag has been cleared. Finally, it may
> lead to on-disk filesystem inconsistency if reading old data from the
> disk successfully and write them out again.
> 
> This patch propagate ext4 end buffer callback to the block layer which
> could detect metadata buffer's async error and invoke ext4 error handler
> immediately.
> 
> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>

Thanks for the patch. It looks good, just some language fixes below...

> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 15b062efcff1..2f22476f41d2 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1515,6 +1515,10 @@ struct ext4_sb_info {
>  	/* workqueue for reserved extent conversions (buffered io) */
>  	struct workqueue_struct *rsv_conversion_wq;
>  
> +	/* workqueue for handle metadata buffer async writeback error */
                         ^^ handling

> +	struct workqueue_struct *s_bdev_wb_err_wq;
> +	struct work_struct s_bdev_wb_err_work;
> +
>  	/* timer for periodic error stats printing */
>  	struct timer_list s_err_report;
>  
...
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index de6fe969f773..50aa8e26e38c 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -560,3 +560,50 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
>  		end_page_writeback(page);
>  	return ret;
>  }
> +
> +/*
> + * Handle error of async writeback metadata buffer, just mark the filesystem
> + * error to prevent potential further inconsistency.
> + */

This comment is probably unnecessary. The comment inside the function is
enough. So I'd just delete this one.

> +void ext4_end_buffer_async_write_error(struct work_struct *work)
> +{
> +	struct ext4_sb_info *sbi = container_of(work,
> +				struct ext4_sb_info, s_bdev_wb_err_work);
> +
> +	/*
> +	 * If we failed to async write back metadata buffer, there is a risk
> +	 * of filesystem inconsistency since we may read old metadata from the
> +	 * disk successfully and write them out again.
> +	 */
> +	ext4_error_err(sbi->s_sb, -EIO, "Error while async write back metadata buffer");
> +}
> +
> +static void ext4_end_buffer_async_write(struct buffer_head *bh, int uptodate)
> +{
> +	struct super_block *sb = bh->b_bdev->bd_super;
> +
> +	end_buffer_async_write(bh, uptodate);
> +
> +	if (!uptodate && sb && !sb_rdonly(sb)) {
> +		struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> +		/* Handle error of async writeback metadata buffer */

Instead of this comment, which isn't very useful, I'd add a comment
explaining why we do it like this. So something like:

/*
 * This function is called from softirq handler and complete abort of a
 * filesystem requires taking sleeping locks and submitting IO. So postpone
 * the real work to a workqueue.
 */

> +		queue_work(sbi->s_bdev_wb_err_wq, &sbi->s_bdev_wb_err_work);
> +	}
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/5] ext4: detect metadata async write error when getting journal's write access
  2020-06-17 12:41   ` Jan Kara
@ 2020-06-17 13:44     ` zhangyi (F)
  2020-06-18  3:53       ` zhangyi (F)
  0 siblings, 1 reply; 11+ messages in thread
From: zhangyi (F) @ 2020-06-17 13:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, tytso, adilger.kernel, zhangxiaoxu5, linux-fsdevel

On 2020/6/17 20:41, Jan Kara wrote:
> On Wed 17-06-20 19:59:45, zhangyi (F) wrote:
>> Although we have already introduce s_bdev_wb_err_work to detect and
>> handle async write metadata buffer error as soon as possible, there is
>> still a potential race that could lead to filesystem inconsistency,
>> which is the buffer may reading and re-writing out to journal before
>> s_bdev_wb_err_work run. So this patch detect bdev mapping->wb_err when
>> getting journal's write access and also mark the filesystem error if
>> something bad happened.
>>
>> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
> 
> So instead of all this, cannot we just do:
> 
> 	if (work_pending(sbi->s_bdev_wb_err_work))
> 		flush_work(sbi->s_bdev_wb_err_work);
> 
> ? And so we are sure the filesystem is aborted if the abort was pending?
> 

Thanks for this suggestion. Yeah, we could do this, it depends on the second
patch, if we check and flush the pending work here, we could not use the
end_buffer_async_write() in ext4_end_buffer_async_write(), we need to open
coding ext4_end_buffer_async_write() and queue the error work before the
buffer is unlocked, or else the race is still there. Do you agree ?

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/5] ext4: detect metadata async write error when getting journal's write access
  2020-06-17 13:44     ` zhangyi (F)
@ 2020-06-18  3:53       ` zhangyi (F)
  0 siblings, 0 replies; 11+ messages in thread
From: zhangyi (F) @ 2020-06-18  3:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, zhangxiaoxu5, linux-fsdevel, yi.zhang

On 2020/6/17 21:44, zhangyi (F) wrote:
> On 2020/6/17 20:41, Jan Kara wrote:
>> On Wed 17-06-20 19:59:45, zhangyi (F) wrote:
>>> Although we have already introduce s_bdev_wb_err_work to detect and
>>> handle async write metadata buffer error as soon as possible, there is
>>> still a potential race that could lead to filesystem inconsistency,
>>> which is the buffer may reading and re-writing out to journal before
>>> s_bdev_wb_err_work run. So this patch detect bdev mapping->wb_err when
>>> getting journal's write access and also mark the filesystem error if
>>> something bad happened.
>>>
>>> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
>>
>> So instead of all this, cannot we just do:
>>
>> 	if (work_pending(sbi->s_bdev_wb_err_work))
>> 		flush_work(sbi->s_bdev_wb_err_work);
>>
>> ? And so we are sure the filesystem is aborted if the abort was pending?
>>
> 
> Thanks for this suggestion. Yeah, we could do this, it depends on the second
> patch, if we check and flush the pending work here, we could not use the
> end_buffer_async_write() in ext4_end_buffer_async_write(), we need to open
> coding ext4_end_buffer_async_write() and queue the error work before the
> buffer is unlocked, or else the race is still there. Do you agree ?
> 

Add one point, add work_pending check here may not safe. We need to make sure
the filesystem is aborted, so we need to wait the error handle work is finished,
but the work's pending bit is cleared before it start running. I think may
better to just invoke flush_work() here.

BTW, I also notice another race condition that may lead to inconsistency. In
bdev_try_to_free_page(), if we free a write error buffer before the worker
is finished, the jbd2 checkpoint procedure will miss this error and wrongly
think it has already been written to disk successfully, and finally it will
destroy the log and lead to inconsistency (the same to no-journal mode).
So I think the ninth patch in my v1 patch set is still needed. What do you
think?

Thanks,
Yi.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/5] fs: add bdev writepage hook to block device
  2020-06-17 11:59 ` [PATCH v2 1/5] fs: add bdev writepage hook to block device zhangyi (F)
@ 2020-06-18  7:02   ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2020-06-18  7:02 UTC (permalink / raw)
  To: zhangyi (F)
  Cc: linux-ext4, tytso, jack, adilger.kernel, zhangxiaoxu5, linux-fsdevel

On Wed, Jun 17, 2020 at 07:59:43PM +0800, zhangyi (F) wrote:
> Add a new bdev_write_page hook into struct super_operations and called
> by bdev_writepage(), which could be used by filesystem to propagate
> private handlers.

Sorry. but no.  We've been trying to get the fs decoupled from the whole
buffer_head crap for quite a while, and this just makes it much worse.
Please don't add layering violations like this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-06-18  7:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17 11:59 [PATCH v2 0/5] ext4: fix inconsistency since reading old metadata from disk zhangyi (F)
2020-06-17 11:59 ` [PATCH v2 1/5] fs: add bdev writepage hook to block device zhangyi (F)
2020-06-18  7:02   ` Christoph Hellwig
2020-06-17 11:59 ` [PATCH v2 2/5] ext4: mark filesystem error if failed to async write metadata zhangyi (F)
2020-06-17 12:48   ` Jan Kara
2020-06-17 11:59 ` [PATCH v2 3/5] ext4: detect metadata async write error when getting journal's write access zhangyi (F)
2020-06-17 12:41   ` Jan Kara
2020-06-17 13:44     ` zhangyi (F)
2020-06-18  3:53       ` zhangyi (F)
2020-06-17 11:59 ` [PATCH v2 4/5] ext4: remove ext4_buffer_uptodate() zhangyi (F)
2020-06-17 11:59 ` [PATCH v2 5/5] ext4: remove write io error check before read inode block zhangyi (F)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).