[PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure
@ 2014-09-22 16:41 Filipe Manana
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
  2014-09-23 10:14 ` [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure Liu Bo
  0 siblings, 2 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-22 16:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If submit_extent_page() fails in write_one_eb(), we end up with the current
page not marked dirty anymore, unlocked and marked for writeback. But we never
end up calling end_page_writeback() against the page, which will make calls to
filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
for the writeback bit to be cleared from the page.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3af4966..91f866c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3668,6 +3668,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 		if (ret) {
 			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
 			SetPageError(p);
+			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
 			ret = -EIO;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure Filipe Manana
@ 2014-09-22 16:41 ` Filipe Manana
  2014-09-23 14:05   ` Liu Bo
                     ` (11 more replies)
  2014-09-23 10:14 ` [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure Liu Bo
  1 sibling, 12 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-22 16:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the BTRFS_INODE_BTREE_IO_ERR and BTRFS_INODE_BTREE_LOG_IO_ERR
inode flags also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h |  2 ++
 fs/btrfs/extent_io.c   | 69 +++++++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/transaction.c | 20 ++++++++++++---
 fs/btrfs/transaction.h |  3 +--
 fs/btrfs/tree-log.c    | 13 ++++++----
 5 files changed, 93 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3511031..dbe37dc 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -44,6 +44,8 @@
 #define BTRFS_INODE_IN_DELALLOC_LIST		9
 #define BTRFS_INODE_READDIO_NEED_LOCK		10
 #define BTRFS_INODE_HAS_PROPS		        11
+#define BTRFS_INODE_BTREE_IO_ERR		12
+#define BTRFS_INODE_BTREE_LOG_IO_ERR		13
 
 /* in memory btrfs inode */
 struct btrfs_inode {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..33b113b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,68 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page, int err)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	const u64 start = eb->start;
+	const u64 end = eb->start + eb->len - 1;
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	int ret;
+
+	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	SetPageError(page);
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, set the bit BTRFS_INODE_BTREE_IO_ERR in the inode btree.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't
+	 * done and would not be completely reliable, as the eb might be removed
+	 * from memory and read back when trying to get it, which clears that
+	 * flag right before reading the eb's pages from disk, making us not
+	 * know about the previous write error.
+	 *
+	 * Using the BTRFS_INODE_BTREE_IO_ERR and BTRFS_INODE_BTREE_LOG_IO_ERR
+	 * inode flags also makes us achieve the goal of AS_EIO/AS_ENOSPC when
+	 * writepages() returns success, started writeback for all dirty pages
+	 * and before filemap_fdatawait_range() is called, the writeback for
+	 * all dirty pages had already finished with errors - because we were
+	 * not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
+	 * success, as it could not know that writeback errors happened (the
+	 * pages were no longer tagged for writeback).
+	 */
+	ASSERT(fs_info->running_transaction);
+	ret = test_range_bit(&fs_info->running_transaction->dirty_pages,
+			     start, end, EXTENT_NEED_WAIT | EXTENT_DIRTY,
+			     1, NULL);
+	if (ret)
+		set_bit(BTRFS_INODE_BTREE_IO_ERR,
+			&BTRFS_I(fs_info->btree_inode)->runtime_flags);
+	else
+		set_bit(BTRFS_INODE_BTREE_LOG_IO_ERR,
+			&BTRFS_I(fs_info->btree_inode)->runtime_flags);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3620,9 +3683,8 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		done = atomic_dec_and_test(&eb->io_pages);
 
 		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page, err < 0 ? err : -EIO);
 		}
 
 		end_page_writeback(page);
@@ -3666,8 +3728,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p, ret);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..f17829a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -844,6 +844,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
  * on all the pages and clear them from the dirty pages state tree
  */
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 			      struct extent_io_tree *dirty_pages, int mark)
 {
 	int err = 0;
@@ -852,6 +853,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	struct inode *btree_inode = root->fs_info->btree_inode;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +867,17 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (dirty_pages == &trans->transaction->dirty_pages) {
+		if (test_and_clear_bit(BTRFS_INODE_BTREE_IO_ERR,
+				       &BTRFS_I(btree_inode)->runtime_flags))
+			werr = werr ? werr : -EIO;
+	} else {
+		if (test_and_clear_bit(BTRFS_INODE_BTREE_LOG_IO_ERR,
+				       &BTRFS_I(btree_inode)->runtime_flags))
+			werr = werr ? werr : -EIO;
+	}
+
 	return werr;
 }
 
@@ -874,6 +887,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
  * those extents are on disk for transaction or log commit
  */
 static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
+				struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark)
 {
 	int ret;
@@ -883,7 +897,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
 	blk_finish_plug(&plug);
-	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
+	ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
 
 	if (ret)
 		return ret;
@@ -892,7 +906,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	return 0;
 }
 
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
+static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 				     struct btrfs_root *root)
 {
 	if (!trans || !trans->transaction) {
@@ -900,7 +914,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 		btree_inode = root->fs_info->btree_inode;
 		return filemap_write_and_wait(btree_inode->i_mapping);
 	}
-	return btrfs_write_and_wait_marked_extents(root,
+	return btrfs_write_and_wait_marked_extents(root, trans,
 					   &trans->transaction->dirty_pages,
 					   EXTENT_DIRTY);
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 7dd558e..78b754a 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -146,8 +146,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
 					struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
 int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
-				     struct btrfs_root *root);
 
 void btrfs_add_dead_root(struct btrfs_root *root);
 int btrfs_defrag_root(struct btrfs_root *root);
@@ -167,6 +165,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 int btrfs_write_marked_extents(struct btrfs_root *root,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_transaction_blocked(struct btrfs_fs_info *info);
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2d0fa43..22ffd32 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 			mutex_unlock(&log_root_tree->log_mutex);
 			goto out;
 		}
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		wait_log_commit(trans, log_root_tree,
 				root_log_ctx.log_transid);
 		btrfs_free_logged_extents(log, log_transid);
@@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	}
-	btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
-	btrfs_wait_marked_extents(log_root_tree,
+	btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
+	btrfs_wait_marked_extents(log_root_tree, trans,
 				  &log_root_tree->dirty_log_pages,
 				  EXTENT_NEW | EXTENT_DIRTY);
 	btrfs_wait_logged_extents(log, log_transid);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure
  2014-09-22 16:41 [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure Filipe Manana
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
@ 2014-09-23 10:14 ` Liu Bo
  2014-09-23 13:03   ` Filipe David Manana
  1 sibling, 1 reply; 21+ messages in thread
From: Liu Bo @ 2014-09-23 10:14 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Mon, Sep 22, 2014 at 05:41:04PM +0100, Filipe Manana wrote:
> If submit_extent_page() fails in write_one_eb(), we end up with the current
> page not marked dirty anymore, unlocked and marked for writeback. But we never
> end up calling end_page_writeback() against the page, which will make calls to
> filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
> for the writeback bit to be cleared from the page.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/extent_io.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 3af4966..91f866c 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3668,6 +3668,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>  		if (ret) {
>  			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
>  			SetPageError(p);
> +			end_page_writeback(p);

This is not always true, and it depends on the place where we get error in
submit_extent_page(), whether it has built and submitted a bio, if it's true,
case can be different as bio_endio will be called, and calling end_page_writeback()
again will end up with panic.

thanks,
-liubo
 
>  			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
>  				end_extent_buffer_writeback(eb);
>  			ret = -EIO;
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure
  2014-09-23 10:14 ` [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure Liu Bo
@ 2014-09-23 13:03   ` Filipe David Manana
  2014-09-23 13:39     ` Liu Bo
  0 siblings, 1 reply; 21+ messages in thread
From: Filipe David Manana @ 2014-09-23 13:03 UTC (permalink / raw)
  To: Liu Bo; +Cc: Filipe Manana, linux-btrfs

On Tue, Sep 23, 2014 at 11:14 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Mon, Sep 22, 2014 at 05:41:04PM +0100, Filipe Manana wrote:
>> If submit_extent_page() fails in write_one_eb(), we end up with the current
>> page not marked dirty anymore, unlocked and marked for writeback. But we never
>> end up calling end_page_writeback() against the page, which will make calls to
>> filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
>> for the writeback bit to be cleared from the page.
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/extent_io.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 3af4966..91f866c 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3668,6 +3668,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>>               if (ret) {
>>                       set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
>>                       SetPageError(p);
>> +                     end_page_writeback(p);
>
> This is not always true, and it depends on the place where we get error in
> submit_extent_page(), whether it has built and submitted a bio, if it's true,
> case can be different as bio_endio will be called, and calling end_page_writeback()
> again will end up with panic.

No, it's always true when the caller is write_one_eb(). If
submit_extent_page() returns an error, we're sure that the page wasn't
added to a bio, because bio_ret is always not NULL (but *bio_ret can
be). Also, if we call submit_one_bio() in submit_extent_page() (first
if statement) , we know that the bio doesn't contain the page.

thanks

>
> thanks,
> -liubo
>
>>                       if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
>>                               end_extent_buffer_writeback(eb);
>>                       ret = -EIO;
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure
  2014-09-23 13:03   ` Filipe David Manana
@ 2014-09-23 13:39     ` Liu Bo
  0 siblings, 0 replies; 21+ messages in thread
From: Liu Bo @ 2014-09-23 13:39 UTC (permalink / raw)
  To: Filipe David Manana; +Cc: Filipe Manana, linux-btrfs

On Tue, Sep 23, 2014 at 02:03:07PM +0100, Filipe David Manana wrote:
> On Tue, Sep 23, 2014 at 11:14 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Mon, Sep 22, 2014 at 05:41:04PM +0100, Filipe Manana wrote:
> >> If submit_extent_page() fails in write_one_eb(), we end up with the current
> >> page not marked dirty anymore, unlocked and marked for writeback. But we never
> >> end up calling end_page_writeback() against the page, which will make calls to
> >> filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
> >> for the writeback bit to be cleared from the page.
> >>
> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >>  fs/btrfs/extent_io.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >> index 3af4966..91f866c 100644
> >> --- a/fs/btrfs/extent_io.c
> >> +++ b/fs/btrfs/extent_io.c
> >> @@ -3668,6 +3668,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
> >>               if (ret) {
> >>                       set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> >>                       SetPageError(p);
> >> +                     end_page_writeback(p);
> >
> > This is not always true, and it depends on the place where we get error in
> > submit_extent_page(), whether it has built and submitted a bio, if it's true,
> > case can be different as bio_endio will be called, and calling end_page_writeback()
> > again will end up with panic.
> 
> No, it's always true when the caller is write_one_eb(). If
> submit_extent_page() returns an error, we're sure that the page wasn't
> added to a bio, because bio_ret is always not NULL (but *bio_ret can
> be). Also, if we call submit_one_bio() in submit_extent_page() (first
> if statement) , we know that the bio doesn't contain the page.

I see, you're right.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

thanks,
-liubo

> 
> thanks
> 
> >
> > thanks,
> > -liubo
> >
> >>                       if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
> >>                               end_extent_buffer_writeback(eb);
> >>                       ret = -EIO;
> >> --
> >> 1.9.1
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
@ 2014-09-23 14:05   ` Liu Bo
  2014-09-23 14:27     ` Filipe David Manana
  2014-09-24  0:20   ` [PATCH 2/2 v2] " Filipe Manana
                     ` (10 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Liu Bo @ 2014-09-23 14:05 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Mon, Sep 22, 2014 at 05:41:05PM +0100, Filipe Manana wrote:
> While we have a transaction ongoing, the VM might decide at any time
> to call btree_inode->i_mapping->a_ops->writepages(), which will start
> writeback of dirty pages belonging to btree nodes/leafs. This call
> might return an error or the writeback might finish with an error
> before we attempt to commit the running transaction. If this happens,
> we might have no way of knowing that such error happened when we are
> committing the transaction - because the pages might no longer be
> marked dirty nor tagged for writeback (if a subsequent modification
> to the extent buffer didn't happen before the transaction commit) which
> makes filemap_fdata[write|wait]_range unable to find such pages (even
> if they're marked with SetPageError).
> So if this happens we must abort the transaction, otherwise we commit
> a super block with btree roots that point to btree nodes/leafs whose
> content on disk is invalid - either garbage or the content of some
> node/leaf from a past generation that got cowed or deleted and is no
> longer valid (for this later case we end up getting error messages like
> "parent transid verify failed on 10826481664 wanted 25748 found 29562"
> when reading btree nodes/leafs from disk).

Good catch!

> 
> Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
> i_mapping would not be enough because we need to distinguish between
> log tree extents (not fatal) vs non-log tree extents (fatal) and
> because the next call to filemap_fdatawait_range() will catch and clear
> such errors in the mapping - and that call might be from a log sync and
> not from a transaction commit, which means we would not know about the
> error at transaction commit time. Also, checking for the eb flag
> EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
> not be completely reliable, as the eb might be removed from memory and
> read back when trying to get it, which clears that flag right before
> reading the eb's pages from disk, making us not know about the previous
> write error.
> 
> Using the BTRFS_INODE_BTREE_IO_ERR and BTRFS_INODE_BTREE_LOG_IO_ERR
> inode flags also makes us achieve the goal of AS_EIO/AS_ENOSPC when
> writepages() returns success, started writeback for all dirty pages
> and before filemap_fdatawait_range() is called, the writeback for
> all dirty pages had already finished with errors - because we were
> not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
> success, as it could not know that writeback errors happened (the
> pages were no longer tagged for writeback).

But there is a problem when the extent buffer with IO error is COWed and
deleted, but the BTRFS_INODE_BTREE_IO_ERR flag still remains in
btree_inode's runtime_flags, we'd still run into cleanup_transaction() as this
patch expects.  So I think we need to remove that BTREE_IO_ERR flag in this
case.

Well, I notice that you don't clear rest pages' DIRTY in the error case, so it
can lead to hitting the BUG_ON in btrfs_release_extent_buffer_page() when the eb
with IO error is COWed by a new good eb and gets itself freed then.  I'm making
a patch to add missing clear_page_dirty_for_io().

thanks,
-liubo

> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/btrfs_inode.h |  2 ++
>  fs/btrfs/extent_io.c   | 69 +++++++++++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/transaction.c | 20 ++++++++++++---
>  fs/btrfs/transaction.h |  3 +--
>  fs/btrfs/tree-log.c    | 13 ++++++----
>  5 files changed, 93 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 3511031..dbe37dc 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -44,6 +44,8 @@
>  #define BTRFS_INODE_IN_DELALLOC_LIST		9
>  #define BTRFS_INODE_READDIO_NEED_LOCK		10
>  #define BTRFS_INODE_HAS_PROPS		        11
> +#define BTRFS_INODE_BTREE_IO_ERR		12
> +#define BTRFS_INODE_BTREE_LOG_IO_ERR		13
>  
>  /* in memory btrfs inode */
>  struct btrfs_inode {
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 91f866c..33b113b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -20,6 +20,7 @@
>  #include "locking.h"
>  #include "rcu-string.h"
>  #include "backref.h"
> +#include "transaction.h"
>  
>  static struct kmem_cache *extent_state_cache;
>  static struct kmem_cache *extent_buffer_cache;
> @@ -3606,6 +3607,68 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
>  	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
>  }
>  
> +static void set_btree_ioerr(struct page *page, int err)
> +{
> +	struct extent_buffer *eb = (struct extent_buffer *)page->private;
> +	const u64 start = eb->start;
> +	const u64 end = eb->start + eb->len - 1;
> +	struct btrfs_fs_info *fs_info = eb->fs_info;
> +	int ret;
> +
> +	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> +	SetPageError(page);
> +
> +	/*
> +	 * If writeback for a btree extent that doesn't belong to a log tree
> +	 * failed, set the bit BTRFS_INODE_BTREE_IO_ERR in the inode btree.
> +	 * We do this because while the transaction is running and before it's
> +	 * committing (when we call filemap_fdata[write|wait]_range against
> +	 * the btree inode), we might have
> +	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
> +	 * returns an error or an error happens during writeback, when we're
> +	 * committing the transaction we wouldn't know about it, since the pages
> +	 * can be no longer dirty nor marked anymore for writeback (if a
> +	 * subsequent modification to the extent buffer didn't happen before the
> +	 * transaction commit), which makes filemap_fdata[write|wait]_range not
> +	 * able to find the pages tagged with SetPageError at transaction
> +	 * commit time. So if this happens we must abort the transaction,
> +	 * otherwise we commit a super block with btree roots that point to
> +	 * btree nodes/leafs whose content on disk is invalid - either garbage
> +	 * or the content of some node/leaf from a past generation that got
> +	 * cowed or deleted and is no longer valid.
> +	 *
> +	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
> +	 * not be enough - we need to distinguish between log tree extents vs
> +	 * non-log tree extents, and the next filemap_fdatawait_range() call
> +	 * will catch and clear such errors in the mapping - and that call might
> +	 * be from a log sync and not from a transaction commit. Also, checking
> +	 * for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't
> +	 * done and would not be completely reliable, as the eb might be removed
> +	 * from memory and read back when trying to get it, which clears that
> +	 * flag right before reading the eb's pages from disk, making us not
> +	 * know about the previous write error.
> +	 *
> +	 * Using the BTRFS_INODE_BTREE_IO_ERR and BTRFS_INODE_BTREE_LOG_IO_ERR
> +	 * inode flags also makes us achieve the goal of AS_EIO/AS_ENOSPC when
> +	 * writepages() returns success, started writeback for all dirty pages
> +	 * and before filemap_fdatawait_range() is called, the writeback for
> +	 * all dirty pages had already finished with errors - because we were
> +	 * not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
> +	 * success, as it could not know that writeback errors happened (the
> +	 * pages were no longer tagged for writeback).
> +	 */
> +	ASSERT(fs_info->running_transaction);
> +	ret = test_range_bit(&fs_info->running_transaction->dirty_pages,
> +			     start, end, EXTENT_NEED_WAIT | EXTENT_DIRTY,
> +			     1, NULL);
> +	if (ret)
> +		set_bit(BTRFS_INODE_BTREE_IO_ERR,
> +			&BTRFS_I(fs_info->btree_inode)->runtime_flags);
> +	else
> +		set_bit(BTRFS_INODE_BTREE_LOG_IO_ERR,
> +			&BTRFS_I(fs_info->btree_inode)->runtime_flags);
> +}
> +
>  static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
>  {
>  	struct bio_vec *bvec;
> @@ -3620,9 +3683,8 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
>  		done = atomic_dec_and_test(&eb->io_pages);
>  
>  		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
> -			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
>  			ClearPageUptodate(page);
> -			SetPageError(page);
> +			set_btree_ioerr(page, err < 0 ? err : -EIO);
>  		}
>  
>  		end_page_writeback(page);
> @@ -3666,8 +3728,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>  					 0, epd->bio_flags, bio_flags);
>  		epd->bio_flags = bio_flags;
>  		if (ret) {
> -			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> -			SetPageError(p);
> +			set_btree_ioerr(p, ret);
>  			end_page_writeback(p);
>  			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
>  				end_extent_buffer_writeback(eb);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 1e272c0..f17829a 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -844,6 +844,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
>   * on all the pages and clear them from the dirty pages state tree
>   */
>  int btrfs_wait_marked_extents(struct btrfs_root *root,
> +			      struct btrfs_trans_handle *trans,
>  			      struct extent_io_tree *dirty_pages, int mark)
>  {
>  	int err = 0;
> @@ -852,6 +853,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>  	struct extent_state *cached_state = NULL;
>  	u64 start = 0;
>  	u64 end;
> +	struct inode *btree_inode = root->fs_info->btree_inode;
>  
>  	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
>  				      EXTENT_NEED_WAIT, &cached_state)) {
> @@ -865,6 +867,17 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>  	}
>  	if (err)
>  		werr = err;
> +
> +	if (dirty_pages == &trans->transaction->dirty_pages) {
> +		if (test_and_clear_bit(BTRFS_INODE_BTREE_IO_ERR,
> +				       &BTRFS_I(btree_inode)->runtime_flags))
> +			werr = werr ? werr : -EIO;
> +	} else {
> +		if (test_and_clear_bit(BTRFS_INODE_BTREE_LOG_IO_ERR,
> +				       &BTRFS_I(btree_inode)->runtime_flags))
> +			werr = werr ? werr : -EIO;
> +	}
> +
>  	return werr;
>  }
>  
> @@ -874,6 +887,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>   * those extents are on disk for transaction or log commit
>   */
>  static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
> +				struct btrfs_trans_handle *trans,
>  				struct extent_io_tree *dirty_pages, int mark)
>  {
>  	int ret;
> @@ -883,7 +897,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
>  	blk_finish_plug(&plug);
> -	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
> +	ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
>  
>  	if (ret)
>  		return ret;
> @@ -892,7 +906,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  	return 0;
>  }
>  
> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
> +static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>  				     struct btrfs_root *root)
>  {
>  	if (!trans || !trans->transaction) {
> @@ -900,7 +914,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>  		btree_inode = root->fs_info->btree_inode;
>  		return filemap_write_and_wait(btree_inode->i_mapping);
>  	}
> -	return btrfs_write_and_wait_marked_extents(root,
> +	return btrfs_write_and_wait_marked_extents(root, trans,
>  					   &trans->transaction->dirty_pages,
>  					   EXTENT_DIRTY);
>  }
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 7dd558e..78b754a 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -146,8 +146,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
>  					struct btrfs_root *root);
>  struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
>  int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
> -				     struct btrfs_root *root);
>  
>  void btrfs_add_dead_root(struct btrfs_root *root);
>  int btrfs_defrag_root(struct btrfs_root *root);
> @@ -167,6 +165,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>  int btrfs_write_marked_extents(struct btrfs_root *root,
>  				struct extent_io_tree *dirty_pages, int mark);
>  int btrfs_wait_marked_extents(struct btrfs_root *root,
> +			      struct btrfs_trans_handle *trans,
>  				struct extent_io_tree *dirty_pages, int mark);
>  int btrfs_transaction_blocked(struct btrfs_fs_info *info);
>  int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 2d0fa43..22ffd32 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  			mutex_unlock(&log_root_tree->log_mutex);
>  			goto out;
>  		}
> -		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> +		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
> +					  mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = -EAGAIN;
> @@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	index2 = root_log_ctx.log_transid % 2;
>  	if (atomic_read(&log_root_tree->log_commit[index2])) {
>  		blk_finish_plug(&plug);
> -		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> +		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
> +					  mark);
>  		wait_log_commit(trans, log_root_tree,
>  				root_log_ctx.log_transid);
>  		btrfs_free_logged_extents(log, log_transid);
> @@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	 */
>  	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
>  		blk_finish_plug(&plug);
> -		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> +		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
> +					  mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = -EAGAIN;
> @@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		goto out_wake_log_root;
>  	}
> -	btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> -	btrfs_wait_marked_extents(log_root_tree,
> +	btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
> +	btrfs_wait_marked_extents(log_root_tree, trans,
>  				  &log_root_tree->dirty_log_pages,
>  				  EXTENT_NEW | EXTENT_DIRTY);
>  	btrfs_wait_logged_extents(log, log_transid);
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-23 14:05   ` Liu Bo
@ 2014-09-23 14:27     ` Filipe David Manana
  0 siblings, 0 replies; 21+ messages in thread
From: Filipe David Manana @ 2014-09-23 14:27 UTC (permalink / raw)
  To: Liu Bo; +Cc: Filipe Manana, linux-btrfs

On Tue, Sep 23, 2014 at 3:05 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Mon, Sep 22, 2014 at 05:41:05PM +0100, Filipe Manana wrote:
>> While we have a transaction ongoing, the VM might decide at any time
>> to call btree_inode->i_mapping->a_ops->writepages(), which will start
>> writeback of dirty pages belonging to btree nodes/leafs. This call
>> might return an error or the writeback might finish with an error
>> before we attempt to commit the running transaction. If this happens,
>> we might have no way of knowing that such error happened when we are
>> committing the transaction - because the pages might no longer be
>> marked dirty nor tagged for writeback (if a subsequent modification
>> to the extent buffer didn't happen before the transaction commit) which
>> makes filemap_fdata[write|wait]_range unable to find such pages (even
>> if they're marked with SetPageError).
>> So if this happens we must abort the transaction, otherwise we commit
>> a super block with btree roots that point to btree nodes/leafs whose
>> content on disk is invalid - either garbage or the content of some
>> node/leaf from a past generation that got cowed or deleted and is no
>> longer valid (for this later case we end up getting error messages like
>> "parent transid verify failed on 10826481664 wanted 25748 found 29562"
>> when reading btree nodes/leafs from disk).
>
> Good catch!
>
>>
>> Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
>> i_mapping would not be enough because we need to distinguish between
>> log tree extents (not fatal) vs non-log tree extents (fatal) and
>> because the next call to filemap_fdatawait_range() will catch and clear
>> such errors in the mapping - and that call might be from a log sync and
>> not from a transaction commit, which means we would not know about the
>> error at transaction commit time. Also, checking for the eb flag
>> EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
>> not be completely reliable, as the eb might be removed from memory and
>> read back when trying to get it, which clears that flag right before
>> reading the eb's pages from disk, making us not know about the previous
>> write error.
>>
>> Using the BTRFS_INODE_BTREE_IO_ERR and BTRFS_INODE_BTREE_LOG_IO_ERR
>> inode flags also makes us achieve the goal of AS_EIO/AS_ENOSPC when
>> writepages() returns success, started writeback for all dirty pages
>> and before filemap_fdatawait_range() is called, the writeback for
>> all dirty pages had already finished with errors - because we were
>> not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
>> success, as it could not know that writeback errors happened (the
>> pages were no longer tagged for writeback).
>
> But there is a problem when the extent buffer with IO error is COWed and
> deleted, but the BTRFS_INODE_BTREE_IO_ERR flag still remains in
> btree_inode's runtime_flags, we'd still run into cleanup_transaction() as this
> patch expects.  So I think we need to remove that BTREE_IO_ERR flag in this
> case.

Thought about that, no good reason to not do it (yet at least). I'll
think a bit about it and send a v2 if I can't find a reason.

>
> Well, I notice that you don't clear rest pages' DIRTY in the error case,

Where exactly?

> so it
> can lead to hitting the BUG_ON in btrfs_release_extent_buffer_page() when the eb
> with IO error is COWed by a new good eb and gets itself freed then.  I'm making
> a patch to add missing clear_page_dirty_for_io().

clear_page_dirty_for_io() is already called by write_one_eb(), and
immediately after it tags the page(s) for writeback.
If error happens during writeback, end_bio_extent_buffer_writepage()
removes the writeback tag (from page and eb's flags).
So either I'm missing something, or you didn't notice all this.

thanks

>
> thanks,
> -liubo
>
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/btrfs_inode.h |  2 ++
>>  fs/btrfs/extent_io.c   | 69 +++++++++++++++++++++++++++++++++++++++++++++++---
>>  fs/btrfs/transaction.c | 20 ++++++++++++---
>>  fs/btrfs/transaction.h |  3 +--
>>  fs/btrfs/tree-log.c    | 13 ++++++----
>>  5 files changed, 93 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index 3511031..dbe37dc 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -44,6 +44,8 @@
>>  #define BTRFS_INODE_IN_DELALLOC_LIST         9
>>  #define BTRFS_INODE_READDIO_NEED_LOCK                10
>>  #define BTRFS_INODE_HAS_PROPS                        11
>> +#define BTRFS_INODE_BTREE_IO_ERR             12
>> +#define BTRFS_INODE_BTREE_LOG_IO_ERR         13
>>
>>  /* in memory btrfs inode */
>>  struct btrfs_inode {
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 91f866c..33b113b 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -20,6 +20,7 @@
>>  #include "locking.h"
>>  #include "rcu-string.h"
>>  #include "backref.h"
>> +#include "transaction.h"
>>
>>  static struct kmem_cache *extent_state_cache;
>>  static struct kmem_cache *extent_buffer_cache;
>> @@ -3606,6 +3607,68 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
>>       wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
>>  }
>>
>> +static void set_btree_ioerr(struct page *page, int err)
>> +{
>> +     struct extent_buffer *eb = (struct extent_buffer *)page->private;
>> +     const u64 start = eb->start;
>> +     const u64 end = eb->start + eb->len - 1;
>> +     struct btrfs_fs_info *fs_info = eb->fs_info;
>> +     int ret;
>> +
>> +     set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
>> +     SetPageError(page);
>> +
>> +     /*
>> +      * If writeback for a btree extent that doesn't belong to a log tree
>> +      * failed, set the bit BTRFS_INODE_BTREE_IO_ERR in the inode btree.
>> +      * We do this because while the transaction is running and before it's
>> +      * committing (when we call filemap_fdata[write|wait]_range against
>> +      * the btree inode), we might have
>> +      * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
>> +      * returns an error or an error happens during writeback, when we're
>> +      * committing the transaction we wouldn't know about it, since the pages
>> +      * can be no longer dirty nor marked anymore for writeback (if a
>> +      * subsequent modification to the extent buffer didn't happen before the
>> +      * transaction commit), which makes filemap_fdata[write|wait]_range not
>> +      * able to find the pages tagged with SetPageError at transaction
>> +      * commit time. So if this happens we must abort the transaction,
>> +      * otherwise we commit a super block with btree roots that point to
>> +      * btree nodes/leafs whose content on disk is invalid - either garbage
>> +      * or the content of some node/leaf from a past generation that got
>> +      * cowed or deleted and is no longer valid.
>> +      *
>> +      * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
>> +      * not be enough - we need to distinguish between log tree extents vs
>> +      * non-log tree extents, and the next filemap_fdatawait_range() call
>> +      * will catch and clear such errors in the mapping - and that call might
>> +      * be from a log sync and not from a transaction commit. Also, checking
>> +      * for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't
>> +      * done and would not be completely reliable, as the eb might be removed
>> +      * from memory and read back when trying to get it, which clears that
>> +      * flag right before reading the eb's pages from disk, making us not
>> +      * know about the previous write error.
>> +      *
>> +      * Using the BTRFS_INODE_BTREE_IO_ERR and BTRFS_INODE_BTREE_LOG_IO_ERR
>> +      * inode flags also makes us achieve the goal of AS_EIO/AS_ENOSPC when
>> +      * writepages() returns success, started writeback for all dirty pages
>> +      * and before filemap_fdatawait_range() is called, the writeback for
>> +      * all dirty pages had already finished with errors - because we were
>> +      * not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
>> +      * success, as it could not know that writeback errors happened (the
>> +      * pages were no longer tagged for writeback).
>> +      */
>> +     ASSERT(fs_info->running_transaction);
>> +     ret = test_range_bit(&fs_info->running_transaction->dirty_pages,
>> +                          start, end, EXTENT_NEED_WAIT | EXTENT_DIRTY,
>> +                          1, NULL);
>> +     if (ret)
>> +             set_bit(BTRFS_INODE_BTREE_IO_ERR,
>> +                     &BTRFS_I(fs_info->btree_inode)->runtime_flags);
>> +     else
>> +             set_bit(BTRFS_INODE_BTREE_LOG_IO_ERR,
>> +                     &BTRFS_I(fs_info->btree_inode)->runtime_flags);
>> +}
>> +
>>  static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
>>  {
>>       struct bio_vec *bvec;
>> @@ -3620,9 +3683,8 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
>>               done = atomic_dec_and_test(&eb->io_pages);
>>
>>               if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
>> -                     set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
>>                       ClearPageUptodate(page);
>> -                     SetPageError(page);
>> +                     set_btree_ioerr(page, err < 0 ? err : -EIO);
>>               }
>>
>>               end_page_writeback(page);
>> @@ -3666,8 +3728,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>>                                        0, epd->bio_flags, bio_flags);
>>               epd->bio_flags = bio_flags;
>>               if (ret) {
>> -                     set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
>> -                     SetPageError(p);
>> +                     set_btree_ioerr(p, ret);
>>                       end_page_writeback(p);
>>                       if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
>>                               end_extent_buffer_writeback(eb);
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index 1e272c0..f17829a 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -844,6 +844,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
>>   * on all the pages and clear them from the dirty pages state tree
>>   */
>>  int btrfs_wait_marked_extents(struct btrfs_root *root,
>> +                           struct btrfs_trans_handle *trans,
>>                             struct extent_io_tree *dirty_pages, int mark)
>>  {
>>       int err = 0;
>> @@ -852,6 +853,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>>       struct extent_state *cached_state = NULL;
>>       u64 start = 0;
>>       u64 end;
>> +     struct inode *btree_inode = root->fs_info->btree_inode;
>>
>>       while (!find_first_extent_bit(dirty_pages, start, &start, &end,
>>                                     EXTENT_NEED_WAIT, &cached_state)) {
>> @@ -865,6 +867,17 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>>       }
>>       if (err)
>>               werr = err;
>> +
>> +     if (dirty_pages == &trans->transaction->dirty_pages) {
>> +             if (test_and_clear_bit(BTRFS_INODE_BTREE_IO_ERR,
>> +                                    &BTRFS_I(btree_inode)->runtime_flags))
>> +                     werr = werr ? werr : -EIO;
>> +     } else {
>> +             if (test_and_clear_bit(BTRFS_INODE_BTREE_LOG_IO_ERR,
>> +                                    &BTRFS_I(btree_inode)->runtime_flags))
>> +                     werr = werr ? werr : -EIO;
>> +     }
>> +
>>       return werr;
>>  }
>>
>> @@ -874,6 +887,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>>   * those extents are on disk for transaction or log commit
>>   */
>>  static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>> +                             struct btrfs_trans_handle *trans,
>>                               struct extent_io_tree *dirty_pages, int mark)
>>  {
>>       int ret;
>> @@ -883,7 +897,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>>       blk_start_plug(&plug);
>>       ret = btrfs_write_marked_extents(root, dirty_pages, mark);
>>       blk_finish_plug(&plug);
>> -     ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>> +     ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
>>
>>       if (ret)
>>               return ret;
>> @@ -892,7 +906,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>>       return 0;
>>  }
>>
>> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>> +static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>>                                    struct btrfs_root *root)
>>  {
>>       if (!trans || !trans->transaction) {
>> @@ -900,7 +914,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>>               btree_inode = root->fs_info->btree_inode;
>>               return filemap_write_and_wait(btree_inode->i_mapping);
>>       }
>> -     return btrfs_write_and_wait_marked_extents(root,
>> +     return btrfs_write_and_wait_marked_extents(root, trans,
>>                                          &trans->transaction->dirty_pages,
>>                                          EXTENT_DIRTY);
>>  }
>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
>> index 7dd558e..78b754a 100644
>> --- a/fs/btrfs/transaction.h
>> +++ b/fs/btrfs/transaction.h
>> @@ -146,8 +146,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
>>                                       struct btrfs_root *root);
>>  struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
>>  int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
>> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>> -                                  struct btrfs_root *root);
>>
>>  void btrfs_add_dead_root(struct btrfs_root *root);
>>  int btrfs_defrag_root(struct btrfs_root *root);
>> @@ -167,6 +165,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>>  int btrfs_write_marked_extents(struct btrfs_root *root,
>>                               struct extent_io_tree *dirty_pages, int mark);
>>  int btrfs_wait_marked_extents(struct btrfs_root *root,
>> +                           struct btrfs_trans_handle *trans,
>>                               struct extent_io_tree *dirty_pages, int mark);
>>  int btrfs_transaction_blocked(struct btrfs_fs_info *info);
>>  int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 2d0fa43..22ffd32 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>                       mutex_unlock(&log_root_tree->log_mutex);
>>                       goto out;
>>               }
>> -             btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>> +             btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
>> +                                       mark);
>>               btrfs_free_logged_extents(log, log_transid);
>>               mutex_unlock(&log_root_tree->log_mutex);
>>               ret = -EAGAIN;
>> @@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>       index2 = root_log_ctx.log_transid % 2;
>>       if (atomic_read(&log_root_tree->log_commit[index2])) {
>>               blk_finish_plug(&plug);
>> -             btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>> +             btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
>> +                                       mark);
>>               wait_log_commit(trans, log_root_tree,
>>                               root_log_ctx.log_transid);
>>               btrfs_free_logged_extents(log, log_transid);
>> @@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>        */
>>       if (btrfs_need_log_full_commit(root->fs_info, trans)) {
>>               blk_finish_plug(&plug);
>> -             btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>> +             btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
>> +                                       mark);
>>               btrfs_free_logged_extents(log, log_transid);
>>               mutex_unlock(&log_root_tree->log_mutex);
>>               ret = -EAGAIN;
>> @@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>               mutex_unlock(&log_root_tree->log_mutex);
>>               goto out_wake_log_root;
>>       }
>> -     btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>> -     btrfs_wait_marked_extents(log_root_tree,
>> +     btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
>> +     btrfs_wait_marked_extents(log_root_tree, trans,
>>                                 &log_root_tree->dirty_log_pages,
>>                                 EXTENT_NEW | EXTENT_DIRTY);
>>       btrfs_wait_logged_extents(log, log_transid);
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v2] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
  2014-09-23 14:05   ` Liu Bo
@ 2014-09-24  0:20   ` Filipe Manana
  2014-09-24  0:43   ` [PATCH 2/2 v3] " Filipe Manana
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-24  0:20 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c |  8 ++++++
 fs/btrfs/extent_io.c   | 76 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h   |  3 +-
 fs/btrfs/transaction.c | 22 +++++++++++++--
 fs/btrfs/transaction.h |  5 ++--
 fs/btrfs/tree-log.c    | 13 +++++----
 7 files changed, 111 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8b54acf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..608814b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6221,6 +6221,14 @@ out:
 	 * anymore.
 	 */
 	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	/*
+	 * The unwritten node/leaf (due to an IO error) isn't pointed to by any
+	 * other node in a tree, so it's safe to forget about the write error
+	 * and avoid a transaction abort.
+	 */
+	if (test_and_clear_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
+		atomic_dec(&trans->transaction->eb_write_errors);
+
 	btrfs_put_block_group(cache);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..e21f200 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,68 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	const u64 start = eb->start;
+	const u64 end = eb->start + eb->len - 1;
+	struct btrfs_transaction *cur_trans;
+	int ret;
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	cur_trans = eb->fs_info->running_transaction;
+	ASSERT(cur_trans);
+	ret = test_range_bit(&cur_trans->dirty_pages,
+			     start, end, EXTENT_NEED_WAIT | EXTENT_DIRTY,
+			     1, NULL);
+	if (ret)
+		atomic_inc(&cur_trans->eb_write_errors);
+	else
+		atomic_add_unless(&cur_trans->log_eb_write_errors, 1, 1);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3682,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3711,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3728,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5130,8 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..6456179 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..c488bf9 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -193,6 +193,8 @@ loop:
 	 * commit the transaction.
 	 */
 	atomic_set(&cur_trans->use_count, 2);
+	atomic_set(&cur_trans->eb_write_errors, 0);
+	atomic_set(&cur_trans->log_eb_write_errors, 0);
 	cur_trans->start_time = get_seconds();
 
 	cur_trans->delayed_refs.href_root = RB_ROOT;
@@ -844,6 +846,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
  * on all the pages and clear them from the dirty pages state tree
  */
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 			      struct extent_io_tree *dirty_pages, int mark)
 {
 	int err = 0;
@@ -852,6 +855,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	int errors;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +869,17 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (dirty_pages == &trans->transaction->dirty_pages)
+		errors = atomic_xchg(&trans->transaction->eb_write_errors, 0);
+	else
+		errors = !!atomic_add_unless(
+		    &trans->transaction->log_eb_write_errors, -1, 0);
+
+	ASSERT(errors >= 0);
+	if (errors > 0 && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
@@ -874,6 +889,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
  * those extents are on disk for transaction or log commit
  */
 static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
+				struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark)
 {
 	int ret;
@@ -883,7 +899,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
 	blk_finish_plug(&plug);
-	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
+	ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
 
 	if (ret)
 		return ret;
@@ -892,7 +908,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	return 0;
 }
 
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
+static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 				     struct btrfs_root *root)
 {
 	if (!trans || !trans->transaction) {
@@ -900,7 +916,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 		btree_inode = root->fs_info->btree_inode;
 		return filemap_write_and_wait(btree_inode->i_mapping);
 	}
-	return btrfs_write_and_wait_marked_extents(root,
+	return btrfs_write_and_wait_marked_extents(root, trans,
 					   &trans->transaction->dirty_pages,
 					   EXTENT_DIRTY);
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 7dd558e..311f3e3 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -46,6 +46,8 @@ struct btrfs_transaction {
 	 */
 	atomic_t num_writers;
 	atomic_t use_count;
+	atomic_t eb_write_errors;
+	atomic_t log_eb_write_errors;
 
 	/* Be protected by fs_info->trans_lock when we want to change it. */
 	enum btrfs_trans_state state;
@@ -146,8 +148,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
 					struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
 int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
-				     struct btrfs_root *root);
 
 void btrfs_add_dead_root(struct btrfs_root *root);
 int btrfs_defrag_root(struct btrfs_root *root);
@@ -167,6 +167,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 int btrfs_write_marked_extents(struct btrfs_root *root,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_transaction_blocked(struct btrfs_fs_info *info);
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2d0fa43..22ffd32 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 			mutex_unlock(&log_root_tree->log_mutex);
 			goto out;
 		}
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		wait_log_commit(trans, log_root_tree,
 				root_log_ctx.log_transid);
 		btrfs_free_logged_extents(log, log_transid);
@@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	}
-	btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
-	btrfs_wait_marked_extents(log_root_tree,
+	btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
+	btrfs_wait_marked_extents(log_root_tree, trans,
 				  &log_root_tree->dirty_log_pages,
 				  EXTENT_NEW | EXTENT_DIRTY);
 	btrfs_wait_logged_extents(log, log_transid);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v3] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
  2014-09-23 14:05   ` Liu Bo
  2014-09-24  0:20   ` [PATCH 2/2 v2] " Filipe Manana
@ 2014-09-24  0:43   ` Filipe Manana
  2014-09-24  4:49   ` [PATCH 2/2 v4] " Filipe Manana
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-24  0:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c |  8 ++++++
 fs/btrfs/extent_io.c   | 76 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h   |  3 +-
 fs/btrfs/transaction.c | 22 +++++++++++++--
 fs/btrfs/transaction.h |  5 ++--
 fs/btrfs/tree-log.c    | 13 +++++----
 7 files changed, 111 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8b54acf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..608814b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6221,6 +6221,14 @@ out:
 	 * anymore.
 	 */
 	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	/*
+	 * The unwritten node/leaf (due to an IO error) isn't pointed to by any
+	 * other node in a tree, so it's safe to forget about the write error
+	 * and avoid a transaction abort.
+	 */
+	if (test_and_clear_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
+		atomic_dec(&trans->transaction->eb_write_errors);
+
 	btrfs_put_block_group(cache);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..da1706f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,68 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	const u64 start = eb->start;
+	const u64 end = eb->start + eb->len - 1;
+	struct btrfs_transaction *cur_trans;
+	int ret;
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	cur_trans = eb->fs_info->running_transaction;
+	ASSERT(cur_trans);
+	ret = test_range_bit(&cur_trans->dirty_pages,
+			     start, end, EXTENT_NEED_WAIT | EXTENT_DIRTY,
+			     1, NULL);
+	if (ret)
+		atomic_inc(&cur_trans->eb_write_errors);
+	else
+		atomic_inc(&cur_trans->log_eb_write_errors);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3682,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3711,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3728,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5130,8 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..6456179 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..0b325d0 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -193,6 +193,8 @@ loop:
 	 * commit the transaction.
 	 */
 	atomic_set(&cur_trans->use_count, 2);
+	atomic_set(&cur_trans->eb_write_errors, 0);
+	atomic_set(&cur_trans->log_eb_write_errors, 0);
 	cur_trans->start_time = get_seconds();
 
 	cur_trans->delayed_refs.href_root = RB_ROOT;
@@ -844,6 +846,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
  * on all the pages and clear them from the dirty pages state tree
  */
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 			      struct extent_io_tree *dirty_pages, int mark)
 {
 	int err = 0;
@@ -852,6 +855,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	int errors;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +869,17 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (dirty_pages == &trans->transaction->dirty_pages)
+		errors = atomic_xchg(&trans->transaction->eb_write_errors, 0);
+	else
+		errors = atomic_xchg(
+		    &trans->transaction->log_eb_write_errors, 0);
+
+	ASSERT(errors >= 0);
+	if (errors > 0 && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
@@ -874,6 +889,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
  * those extents are on disk for transaction or log commit
  */
 static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
+				struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark)
 {
 	int ret;
@@ -883,7 +899,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
 	blk_finish_plug(&plug);
-	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
+	ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
 
 	if (ret)
 		return ret;
@@ -892,7 +908,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	return 0;
 }
 
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
+static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 				     struct btrfs_root *root)
 {
 	if (!trans || !trans->transaction) {
@@ -900,7 +916,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 		btree_inode = root->fs_info->btree_inode;
 		return filemap_write_and_wait(btree_inode->i_mapping);
 	}
-	return btrfs_write_and_wait_marked_extents(root,
+	return btrfs_write_and_wait_marked_extents(root, trans,
 					   &trans->transaction->dirty_pages,
 					   EXTENT_DIRTY);
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 7dd558e..311f3e3 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -46,6 +46,8 @@ struct btrfs_transaction {
 	 */
 	atomic_t num_writers;
 	atomic_t use_count;
+	atomic_t eb_write_errors;
+	atomic_t log_eb_write_errors;
 
 	/* Be protected by fs_info->trans_lock when we want to change it. */
 	enum btrfs_trans_state state;
@@ -146,8 +148,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
 					struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
 int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
-				     struct btrfs_root *root);
 
 void btrfs_add_dead_root(struct btrfs_root *root);
 int btrfs_defrag_root(struct btrfs_root *root);
@@ -167,6 +167,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 int btrfs_write_marked_extents(struct btrfs_root *root,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_transaction_blocked(struct btrfs_fs_info *info);
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2d0fa43..22ffd32 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 			mutex_unlock(&log_root_tree->log_mutex);
 			goto out;
 		}
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		wait_log_commit(trans, log_root_tree,
 				root_log_ctx.log_transid);
 		btrfs_free_logged_extents(log, log_transid);
@@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	}
-	btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
-	btrfs_wait_marked_extents(log_root_tree,
+	btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
+	btrfs_wait_marked_extents(log_root_tree, trans,
 				  &log_root_tree->dirty_log_pages,
 				  EXTENT_NEW | EXTENT_DIRTY);
 	btrfs_wait_logged_extents(log, log_transid);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v4] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (2 preceding siblings ...)
  2014-09-24  0:43   ` [PATCH 2/2 v3] " Filipe Manana
@ 2014-09-24  4:49   ` Filipe Manana
  2014-09-24 10:28   ` [PATCH 2/2 v5] " Filipe Manana
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-24  4:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c | 12 +++++++++
 fs/btrfs/extent_io.c   | 71 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h   |  3 ++-
 fs/btrfs/transaction.c | 22 +++++++++++++---
 fs/btrfs/transaction.h |  5 ++--
 fs/btrfs/tree-log.c    | 13 +++++----
 7 files changed, 110 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8b54acf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..bdacd33 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6221,6 +6221,18 @@ out:
 	 * anymore.
 	 */
 	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	/*
+	 * The unwritten node/leaf (due to an IO error) isn't pointed to by any
+	 * other node in a tree, so it's safe to forget about the write error
+	 * and avoid a transaction abort.
+	 */
+	if (test_and_clear_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags)) {
+		if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+			atomic_dec(&trans->transaction->log_eb_write_errors);
+		else
+			atomic_dec(&trans->transaction->eb_write_errors);
+	}
+
 	btrfs_put_block_group(cache);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..a83e1c5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,63 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+	struct btrfs_transaction *cur_trans;
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	cur_trans = eb->fs_info->running_transaction;
+	ASSERT(cur_trans);
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+		atomic_inc(&cur_trans->log_eb_write_errors);
+	else
+		atomic_inc(&cur_trans->eb_write_errors);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3677,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3706,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3723,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5125,8 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..6456179 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..d05f6f8 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -193,6 +193,8 @@ loop:
 	 * commit the transaction.
 	 */
 	atomic_set(&cur_trans->use_count, 2);
+	atomic_set(&cur_trans->eb_write_errors, 0);
+	atomic_set(&cur_trans->log_eb_write_errors, 0);
 	cur_trans->start_time = get_seconds();
 
 	cur_trans->delayed_refs.href_root = RB_ROOT;
@@ -844,6 +846,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
  * on all the pages and clear them from the dirty pages state tree
  */
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 			      struct extent_io_tree *dirty_pages, int mark)
 {
 	int err = 0;
@@ -852,6 +855,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	int errors;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +869,17 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+		errors = atomic_xchg(
+		    &trans->transaction->log_eb_write_errors, 0);
+	else
+		errors = atomic_xchg(&trans->transaction->eb_write_errors, 0);
+
+	ASSERT(errors >= 0);
+	if (errors > 0 && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
@@ -874,6 +889,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
  * those extents are on disk for transaction or log commit
  */
 static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
+				struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark)
 {
 	int ret;
@@ -883,7 +899,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
 	blk_finish_plug(&plug);
-	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
+	ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
 
 	if (ret)
 		return ret;
@@ -892,7 +908,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	return 0;
 }
 
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
+static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 				     struct btrfs_root *root)
 {
 	if (!trans || !trans->transaction) {
@@ -900,7 +916,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 		btree_inode = root->fs_info->btree_inode;
 		return filemap_write_and_wait(btree_inode->i_mapping);
 	}
-	return btrfs_write_and_wait_marked_extents(root,
+	return btrfs_write_and_wait_marked_extents(root, trans,
 					   &trans->transaction->dirty_pages,
 					   EXTENT_DIRTY);
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 7dd558e..311f3e3 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -46,6 +46,8 @@ struct btrfs_transaction {
 	 */
 	atomic_t num_writers;
 	atomic_t use_count;
+	atomic_t eb_write_errors;
+	atomic_t log_eb_write_errors;
 
 	/* Be protected by fs_info->trans_lock when we want to change it. */
 	enum btrfs_trans_state state;
@@ -146,8 +148,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
 					struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
 int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
-				     struct btrfs_root *root);
 
 void btrfs_add_dead_root(struct btrfs_root *root);
 int btrfs_defrag_root(struct btrfs_root *root);
@@ -167,6 +167,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 int btrfs_write_marked_extents(struct btrfs_root *root,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_transaction_blocked(struct btrfs_fs_info *info);
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2d0fa43..22ffd32 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 			mutex_unlock(&log_root_tree->log_mutex);
 			goto out;
 		}
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		wait_log_commit(trans, log_root_tree,
 				root_log_ctx.log_transid);
 		btrfs_free_logged_extents(log, log_transid);
@@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	}
-	btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
-	btrfs_wait_marked_extents(log_root_tree,
+	btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
+	btrfs_wait_marked_extents(log_root_tree, trans,
 				  &log_root_tree->dirty_log_pages,
 				  EXTENT_NEW | EXTENT_DIRTY);
 	btrfs_wait_logged_extents(log, log_transid);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v5] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (3 preceding siblings ...)
  2014-09-24  4:49   ` [PATCH 2/2 v4] " Filipe Manana
@ 2014-09-24 10:28   ` Filipe Manana
  2014-09-24 11:16     ` Miao Xie
  2014-09-24 17:19   ` [PATCH 2/2 v6] " Filipe Manana
                     ` (6 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2014-09-24 10:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c |  1 +
 fs/btrfs/extent_io.c   | 98 ++++++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/extent_io.h   |  4 ++-
 fs/btrfs/transaction.c | 21 +++++++++--
 fs/btrfs/transaction.h |  5 +--
 fs/btrfs/tree-log.c    | 13 ++++---
 7 files changed, 126 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8b54acf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..ec185b5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6221,6 +6221,7 @@ out:
 	 * anymore.
 	 */
 	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	clear_extent_buffer_write_err(buf);
 	btrfs_put_block_group(cache);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..ed1be9c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,63 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+	struct btrfs_transaction *cur_trans;
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	cur_trans = eb->fs_info->running_transaction;
+	ASSERT(cur_trans);
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+		atomic_inc(&cur_trans->log_eb_write_errors);
+	else
+		atomic_inc(&cur_trans->eb_write_errors);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3677,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3706,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_extent_buffer_write_err(eb);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3723,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5125,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
@@ -5514,3 +5570,31 @@ int try_release_extent_buffer(struct page *page)
 
 	return release_extent_buffer(eb);
 }
+
+void clear_extent_buffer_write_err(struct extent_buffer *eb)
+{
+	/*
+	 * Ignore/discard a write IO error when:
+	 *
+	 * 1) The unwritten node/leaf isn't pointed to by any other node in
+	 * a tree because it was deleted - so it's safe to forget about the
+	 * write error and avoid a transaction abort;
+	 *
+	 * 2) The same eb was marked dirty again, so we will attempt to write
+	 * it to disk again, which might succeed this time.
+	 */
+	if (test_and_clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+		struct btrfs_root *root;
+		struct btrfs_transaction *cur_trans;
+
+		root = BTRFS_I(eb->pages[0]->mapping->host)->root;
+		cur_trans = eb->fs_info->running_transaction;
+		ASSERT(cur_trans);
+
+		if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+			atomic_dec(&cur_trans->log_eb_write_errors);
+		else
+			atomic_dec(&cur_trans->eb_write_errors);
+	}
+
+}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..2c25cc9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -314,6 +315,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+void clear_extent_buffer_write_err(struct extent_buffer *eb);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..8f412c0 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -193,6 +193,8 @@ loop:
 	 * commit the transaction.
 	 */
 	atomic_set(&cur_trans->use_count, 2);
+	atomic_set(&cur_trans->eb_write_errors, 0);
+	atomic_set(&cur_trans->log_eb_write_errors, 0);
 	cur_trans->start_time = get_seconds();
 
 	cur_trans->delayed_refs.href_root = RB_ROOT;
@@ -844,6 +846,7 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
  * on all the pages and clear them from the dirty pages state tree
  */
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 			      struct extent_io_tree *dirty_pages, int mark)
 {
 	int err = 0;
@@ -852,6 +855,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	int errors;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +869,16 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+		errors = atomic_xchg(
+		    &trans->transaction->log_eb_write_errors, 0);
+	else
+		errors = atomic_xchg(&trans->transaction->eb_write_errors, 0);
+
+	if (errors && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
@@ -874,6 +888,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
  * those extents are on disk for transaction or log commit
  */
 static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
+				struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark)
 {
 	int ret;
@@ -883,7 +898,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
 	blk_finish_plug(&plug);
-	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
+	ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
 
 	if (ret)
 		return ret;
@@ -892,7 +907,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 	return 0;
 }
 
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
+static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 				     struct btrfs_root *root)
 {
 	if (!trans || !trans->transaction) {
@@ -900,7 +915,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 		btree_inode = root->fs_info->btree_inode;
 		return filemap_write_and_wait(btree_inode->i_mapping);
 	}
-	return btrfs_write_and_wait_marked_extents(root,
+	return btrfs_write_and_wait_marked_extents(root, trans,
 					   &trans->transaction->dirty_pages,
 					   EXTENT_DIRTY);
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 7dd558e..311f3e3 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -46,6 +46,8 @@ struct btrfs_transaction {
 	 */
 	atomic_t num_writers;
 	atomic_t use_count;
+	atomic_t eb_write_errors;
+	atomic_t log_eb_write_errors;
 
 	/* Be protected by fs_info->trans_lock when we want to change it. */
 	enum btrfs_trans_state state;
@@ -146,8 +148,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
 					struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
 int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
-int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
-				     struct btrfs_root *root);
 
 void btrfs_add_dead_root(struct btrfs_root *root);
 int btrfs_defrag_root(struct btrfs_root *root);
@@ -167,6 +167,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 int btrfs_write_marked_extents(struct btrfs_root *root,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_wait_marked_extents(struct btrfs_root *root,
+			      struct btrfs_trans_handle *trans,
 				struct extent_io_tree *dirty_pages, int mark);
 int btrfs_transaction_blocked(struct btrfs_fs_info *info);
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2d0fa43..22ffd32 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 			mutex_unlock(&log_root_tree->log_mutex);
 			goto out;
 		}
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		wait_log_commit(trans, log_root_tree,
 				root_log_ctx.log_transid);
 		btrfs_free_logged_extents(log, log_transid);
@@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
 		blk_finish_plug(&plug);
-		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
+		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
+					  mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
@@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		mutex_unlock(&log_root_tree->log_mutex);
 		goto out_wake_log_root;
 	}
-	btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
-	btrfs_wait_marked_extents(log_root_tree,
+	btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
+	btrfs_wait_marked_extents(log_root_tree, trans,
 				  &log_root_tree->dirty_log_pages,
 				  EXTENT_NEW | EXTENT_DIRTY);
 	btrfs_wait_logged_extents(log, log_transid);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2 v5] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-24 10:28   ` [PATCH 2/2 v5] " Filipe Manana
@ 2014-09-24 11:16     ` Miao Xie
  2014-09-24 12:59       ` Filipe David Manana
  0 siblings, 1 reply; 21+ messages in thread
From: Miao Xie @ 2014-09-24 11:16 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs

On Wed, 24 Sep 2014 11:28:26 +0100, Filipe Manana wrote:
[SNIP]
>  int btrfs_wait_marked_extents(struct btrfs_root *root,
> +			      struct btrfs_trans_handle *trans,
>  			      struct extent_io_tree *dirty_pages, int mark)
>  {
>  	int err = 0;
> @@ -852,6 +855,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>  	struct extent_state *cached_state = NULL;
>  	u64 start = 0;
>  	u64 end;
> +	int errors;
>  
>  	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
>  				      EXTENT_NEED_WAIT, &cached_state)) {
> @@ -865,6 +869,16 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>  	}
>  	if (err)
>  		werr = err;
> +
> +	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
> +		errors = atomic_xchg(
> +		    &trans->transaction->log_eb_write_errors, 0);
> +	else
> +		errors = atomic_xchg(&trans->transaction->eb_write_errors, 0);

There is a bug in log tree commit case.
As we know, each fs/file tree has two log sub-transaction, when we are committing
a log sub-transaction, the other one may be started by the other file sync tasks.
It is very likely that there is no any error happens in the former, but some write
errors happen in the latter. The above code might clear the number of that errors.

So I think the variant for log write error should be created for each log sub-transaction.

Thanks
Miao

> +
> +	if (errors && !werr)
> +		werr = -EIO;
> +
>  	return werr;
>  }
>  
> @@ -874,6 +888,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>   * those extents are on disk for transaction or log commit
>   */
>  static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
> +				struct btrfs_trans_handle *trans,
>  				struct extent_io_tree *dirty_pages, int mark)
>  {
>  	int ret;
> @@ -883,7 +898,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
>  	blk_finish_plug(&plug);
> -	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
> +	ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
>  
>  	if (ret)
>  		return ret;
> @@ -892,7 +907,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  	return 0;
>  }
>  
> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
> +static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>  				     struct btrfs_root *root)
>  {
>  	if (!trans || !trans->transaction) {
> @@ -900,7 +915,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>  		btree_inode = root->fs_info->btree_inode;
>  		return filemap_write_and_wait(btree_inode->i_mapping);
>  	}
> -	return btrfs_write_and_wait_marked_extents(root,
> +	return btrfs_write_and_wait_marked_extents(root, trans,
>  					   &trans->transaction->dirty_pages,
>  					   EXTENT_DIRTY);
>  }
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 7dd558e..311f3e3 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -46,6 +46,8 @@ struct btrfs_transaction {
>  	 */
>  	atomic_t num_writers;
>  	atomic_t use_count;
> +	atomic_t eb_write_errors;
> +	atomic_t log_eb_write_errors;
>  
>  	/* Be protected by fs_info->trans_lock when we want to change it. */
>  	enum btrfs_trans_state state;
> @@ -146,8 +148,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
>  					struct btrfs_root *root);
>  struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
>  int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
> -				     struct btrfs_root *root);
>  
>  void btrfs_add_dead_root(struct btrfs_root *root);
>  int btrfs_defrag_root(struct btrfs_root *root);
> @@ -167,6 +167,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>  int btrfs_write_marked_extents(struct btrfs_root *root,
>  				struct extent_io_tree *dirty_pages, int mark);
>  int btrfs_wait_marked_extents(struct btrfs_root *root,
> +			      struct btrfs_trans_handle *trans,
>  				struct extent_io_tree *dirty_pages, int mark);
>  int btrfs_transaction_blocked(struct btrfs_fs_info *info);
>  int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 2d0fa43..22ffd32 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  			mutex_unlock(&log_root_tree->log_mutex);
>  			goto out;
>  		}
> -		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> +		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
> +					  mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = -EAGAIN;
> @@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	index2 = root_log_ctx.log_transid % 2;
>  	if (atomic_read(&log_root_tree->log_commit[index2])) {
>  		blk_finish_plug(&plug);
> -		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> +		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
> +					  mark);
>  		wait_log_commit(trans, log_root_tree,
>  				root_log_ctx.log_transid);
>  		btrfs_free_logged_extents(log, log_transid);
> @@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	 */
>  	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
>  		blk_finish_plug(&plug);
> -		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> +		btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
> +					  mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = -EAGAIN;
> @@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		goto out_wake_log_root;
>  	}
> -	btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
> -	btrfs_wait_marked_extents(log_root_tree,
> +	btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
> +	btrfs_wait_marked_extents(log_root_tree, trans,
>  				  &log_root_tree->dirty_log_pages,
>  				  EXTENT_NEW | EXTENT_DIRTY);
>  	btrfs_wait_logged_extents(log, log_transid);
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2 v5] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-24 11:16     ` Miao Xie
@ 2014-09-24 12:59       ` Filipe David Manana
  0 siblings, 0 replies; 21+ messages in thread
From: Filipe David Manana @ 2014-09-24 12:59 UTC (permalink / raw)
  To: Miao Xie; +Cc: Filipe Manana, linux-btrfs

On Wed, Sep 24, 2014 at 12:16 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On Wed, 24 Sep 2014 11:28:26 +0100, Filipe Manana wrote:
> [SNIP]
>>  int btrfs_wait_marked_extents(struct btrfs_root *root,
>> +                           struct btrfs_trans_handle *trans,
>>                             struct extent_io_tree *dirty_pages, int mark)
>>  {
>>       int err = 0;
>> @@ -852,6 +855,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>>       struct extent_state *cached_state = NULL;
>>       u64 start = 0;
>>       u64 end;
>> +     int errors;
>>
>>       while (!find_first_extent_bit(dirty_pages, start, &start, &end,
>>                                     EXTENT_NEED_WAIT, &cached_state)) {
>> @@ -865,6 +869,16 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>>       }
>>       if (err)
>>               werr = err;
>> +
>> +     if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
>> +             errors = atomic_xchg(
>> +                 &trans->transaction->log_eb_write_errors, 0);
>> +     else
>> +             errors = atomic_xchg(&trans->transaction->eb_write_errors, 0);
>
> There is a bug in log tree commit case.
> As we know, each fs/file tree has two log sub-transaction, when we are committing
> a log sub-transaction, the other one may be started by the other file sync tasks.
> It is very likely that there is no any error happens in the former, but some write
> errors happen in the latter. The above code might clear the number of that errors.
>
> So I think the variant for log write error should be created for each log sub-transaction.

Right.
I'm fixing that and another issue I missed before.

thanks

>
> Thanks
> Miao
>
>> +
>> +     if (errors && !werr)
>> +             werr = -EIO;
>> +
>>       return werr;
>>  }
>>
>> @@ -874,6 +888,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
>>   * those extents are on disk for transaction or log commit
>>   */
>>  static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>> +                             struct btrfs_trans_handle *trans,
>>                               struct extent_io_tree *dirty_pages, int mark)
>>  {
>>       int ret;
>> @@ -883,7 +898,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>>       blk_start_plug(&plug);
>>       ret = btrfs_write_marked_extents(root, dirty_pages, mark);
>>       blk_finish_plug(&plug);
>> -     ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>> +     ret2 = btrfs_wait_marked_extents(root, trans, dirty_pages, mark);
>>
>>       if (ret)
>>               return ret;
>> @@ -892,7 +907,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>>       return 0;
>>  }
>>
>> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>> +static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>>                                    struct btrfs_root *root)
>>  {
>>       if (!trans || !trans->transaction) {
>> @@ -900,7 +915,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>>               btree_inode = root->fs_info->btree_inode;
>>               return filemap_write_and_wait(btree_inode->i_mapping);
>>       }
>> -     return btrfs_write_and_wait_marked_extents(root,
>> +     return btrfs_write_and_wait_marked_extents(root, trans,
>>                                          &trans->transaction->dirty_pages,
>>                                          EXTENT_DIRTY);
>>  }
>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
>> index 7dd558e..311f3e3 100644
>> --- a/fs/btrfs/transaction.h
>> +++ b/fs/btrfs/transaction.h
>> @@ -46,6 +46,8 @@ struct btrfs_transaction {
>>        */
>>       atomic_t num_writers;
>>       atomic_t use_count;
>> +     atomic_t eb_write_errors;
>> +     atomic_t log_eb_write_errors;
>>
>>       /* Be protected by fs_info->trans_lock when we want to change it. */
>>       enum btrfs_trans_state state;
>> @@ -146,8 +148,6 @@ struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
>>                                       struct btrfs_root *root);
>>  struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
>>  int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
>> -int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>> -                                  struct btrfs_root *root);
>>
>>  void btrfs_add_dead_root(struct btrfs_root *root);
>>  int btrfs_defrag_root(struct btrfs_root *root);
>> @@ -167,6 +167,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>>  int btrfs_write_marked_extents(struct btrfs_root *root,
>>                               struct extent_io_tree *dirty_pages, int mark);
>>  int btrfs_wait_marked_extents(struct btrfs_root *root,
>> +                           struct btrfs_trans_handle *trans,
>>                               struct extent_io_tree *dirty_pages, int mark);
>>  int btrfs_transaction_blocked(struct btrfs_fs_info *info);
>>  int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 2d0fa43..22ffd32 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -2583,7 +2583,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>                       mutex_unlock(&log_root_tree->log_mutex);
>>                       goto out;
>>               }
>> -             btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>> +             btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
>> +                                       mark);
>>               btrfs_free_logged_extents(log, log_transid);
>>               mutex_unlock(&log_root_tree->log_mutex);
>>               ret = -EAGAIN;
>> @@ -2599,7 +2600,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>       index2 = root_log_ctx.log_transid % 2;
>>       if (atomic_read(&log_root_tree->log_commit[index2])) {
>>               blk_finish_plug(&plug);
>> -             btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>> +             btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
>> +                                       mark);
>>               wait_log_commit(trans, log_root_tree,
>>                               root_log_ctx.log_transid);
>>               btrfs_free_logged_extents(log, log_transid);
>> @@ -2623,7 +2625,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>        */
>>       if (btrfs_need_log_full_commit(root->fs_info, trans)) {
>>               blk_finish_plug(&plug);
>> -             btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>> +             btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages,
>> +                                       mark);
>>               btrfs_free_logged_extents(log, log_transid);
>>               mutex_unlock(&log_root_tree->log_mutex);
>>               ret = -EAGAIN;
>> @@ -2641,8 +2644,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>               mutex_unlock(&log_root_tree->log_mutex);
>>               goto out_wake_log_root;
>>       }
>> -     btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>> -     btrfs_wait_marked_extents(log_root_tree,
>> +     btrfs_wait_marked_extents(log, trans, &log->dirty_log_pages, mark);
>> +     btrfs_wait_marked_extents(log_root_tree, trans,
>>                                 &log_root_tree->dirty_log_pages,
>>                                 EXTENT_NEW | EXTENT_DIRTY);
>>       btrfs_wait_logged_extents(log, log_transid);
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v6] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (4 preceding siblings ...)
  2014-09-24 10:28   ` [PATCH 2/2 v5] " Filipe Manana
@ 2014-09-24 17:19   ` Filipe Manana
  2014-09-24 22:20   ` [PATCH 2/2 v7] " Filipe Manana
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-24 17:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
    a (short) time window where fs_info->running_transaction is NULL.
    There's now 2 counters for log extent buffers too, each one representing
    a different log transaction.

 fs/btrfs/ctree.h       |   3 ++
 fs/btrfs/disk-io.c     |   7 +++-
 fs/btrfs/extent-tree.c |   1 +
 fs/btrfs/extent_io.c   | 105 +++++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/extent_io.h   |   4 +-
 fs/btrfs/transaction.c |  18 +++++++++
 fs/btrfs/tree-log.c    |  23 +++++++++++
 fs/btrfs/tree-log.h    |   2 +
 8 files changed, 153 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f20b60d..0e5ca39 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1510,6 +1510,9 @@ struct btrfs_fs_info {
 	atomic_t async_delalloc_pages;
 	atomic_t open_ioctl_trans;
 
+	atomic_t eb_write_errors;
+	atomic_t log_eb_write_errors[2];
+
 	/*
 	 * this is used to protect the following list -- ordered_roots.
 	 */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8f1deca 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
@@ -2271,6 +2271,9 @@ int open_ctree(struct super_block *sb,
 	atomic_set(&fs_info->nr_async_bios, 0);
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->qgroup_op_seq, 0);
+	atomic_set(&fs_info->eb_write_errors, 0);
+	atomic_set(&fs_info->log_eb_write_errors[0], 0);
+	atomic_set(&fs_info->log_eb_write_errors[1], 0);
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 	fs_info->sb = sb;
 	fs_info->max_inline = 8192 * 1024;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..ec185b5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6221,6 +6221,7 @@ out:
 	 * anymore.
 	 */
 	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	clear_extent_buffer_write_err(buf);
 	btrfs_put_block_group(cache);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..24adc8a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,8 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
+#include "tree-log.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3608,63 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		const int i = extent_buffer_log_index(root, eb);
+
+		atomic_inc(&eb->fs_info->log_eb_write_errors[i]);
+	} else {
+		atomic_inc(&eb->fs_info->eb_write_errors);
+	}
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3678,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3707,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_extent_buffer_write_err(eb);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3724,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5126,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
@@ -5514,3 +5571,37 @@ int try_release_extent_buffer(struct page *page)
 
 	return release_extent_buffer(eb);
 }
+
+void clear_extent_buffer_write_err(struct extent_buffer *eb)
+{
+	/*
+	 * Ignore/discard a write IO error when:
+	 *
+	 * 1) The unwritten node/leaf isn't pointed to by any other node in
+	 * a tree because it was deleted - so it's safe to forget about the
+	 * write error and avoid a transaction abort;
+	 *
+	 * 2) The same eb was marked dirty again, so we will attempt to write
+	 * it to disk again, which might succeed this time.
+	 */
+	if (test_and_clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+		struct btrfs_root *root;
+
+		root = BTRFS_I(eb->pages[0]->mapping->host)->root;
+
+		if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+			const int i = extent_buffer_log_index(root, eb);
+
+			atomic_dec(&eb->fs_info->log_eb_write_errors[i]);
+		} else {
+			/*
+			 * The counter can be 0 here, because the previous
+			 * current transaction (fs_info->running_transaction)
+			 * just called btrfs_wait_marked_extents() and is about
+			 * to set the fs to read-only mode and abort.
+			 */
+			atomic_add_unless(&eb->fs_info->eb_write_errors, -1, 0);
+		}
+	}
+
+}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..2c25cc9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -314,6 +315,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+void clear_extent_buffer_write_err(struct extent_buffer *eb);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..6e696a0 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -852,6 +852,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	int errors;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +866,23 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		atomic_t *log_errors = root->fs_info->log_eb_write_errors;
+
+		errors = 0;
+		if (mark & EXTENT_DIRTY)
+			errors += atomic_xchg(&log_errors[0], 0);
+		if (mark & EXTENT_NEW)
+			errors += atomic_xchg(&log_errors[1], 0);
+	} else {
+		errors = atomic_xchg(&root->fs_info->eb_write_errors, 0);
+	}
+
+	ASSERT(errors >= 0);
+	if (errors > 0 && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2d0fa43..9686be6 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4591,3 +4591,26 @@ int btrfs_log_new_name(struct btrfs_trans_handle *trans,
 				      LLONG_MAX, 1, NULL);
 }
 
+int extent_buffer_log_index(struct btrfs_root *root,
+			    struct extent_buffer *eb)
+{
+	const u64 start = eb->start;
+	const u64 end = eb->start + eb->len - 1;
+	int index;
+	int ret;
+
+	ret = test_range_bit(&root->dirty_log_pages, start, end,
+			     EXTENT_DIRTY, 1, NULL);
+	if (ret) {
+		index = 0;
+	} else {
+#ifdef CONFIG_BTRFS_ASSERT
+		ret = test_range_bit(&root->dirty_log_pages, start, end,
+				     EXTENT_NEW, 1, NULL);
+		ASSERT(ret);
+#endif
+		index = 1;
+	}
+
+	return index;
+}
diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h
index 154990c..6e5c7f8 100644
--- a/fs/btrfs/tree-log.h
+++ b/fs/btrfs/tree-log.h
@@ -80,4 +80,6 @@ void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans,
 int btrfs_log_new_name(struct btrfs_trans_handle *trans,
 			struct inode *inode, struct inode *old_dir,
 			struct dentry *parent);
+int extent_buffer_log_index(struct btrfs_root *root,
+			    struct extent_buffer *eb);
 #endif
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v7] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (5 preceding siblings ...)
  2014-09-24 17:19   ` [PATCH 2/2 v6] " Filipe Manana
@ 2014-09-24 22:20   ` Filipe Manana
  2014-09-25 11:12   ` [PATCH 2/2 v8] " Filipe Manana
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-24 22:20 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
    a (short) time window where fs_info->running_transaction is NULL.
    There's now 2 counters for log extent buffers too, each one representing
    a different log transaction.

V7: Track the eb's log index in the eb itself, otherwise it wasn't possible
    to find it when writeback triggered from a transaction commit.

 fs/btrfs/ctree.h       |   3 ++
 fs/btrfs/disk-io.c     |   7 +++-
 fs/btrfs/extent-tree.c |   4 +-
 fs/btrfs/extent_io.c   | 101 +++++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/extent_io.h   |   6 ++-
 fs/btrfs/transaction.c |  18 +++++++++
 6 files changed, 128 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f20b60d..0e5ca39 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1510,6 +1510,9 @@ struct btrfs_fs_info {
 	atomic_t async_delalloc_pages;
 	atomic_t open_ioctl_trans;
 
+	atomic_t eb_write_errors;
+	atomic_t log_eb_write_errors[2];
+
 	/*
 	 * this is used to protect the following list -- ordered_roots.
 	 */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8f1deca 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
@@ -2271,6 +2271,9 @@ int open_ctree(struct super_block *sb,
 	atomic_set(&fs_info->nr_async_bios, 0);
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->qgroup_op_seq, 0);
+	atomic_set(&fs_info->eb_write_errors, 0);
+	atomic_set(&fs_info->log_eb_write_errors[0], 0);
+	atomic_set(&fs_info->log_eb_write_errors[1], 0);
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 	fs_info->sb = sb;
 	fs_info->max_inline = 8192 * 1024;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..9442f3c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6221,6 +6221,7 @@ out:
 	 * anymore.
 	 */
 	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	clear_extent_buffer_write_err(buf);
 	btrfs_put_block_group(cache);
 }
 
@@ -7203,11 +7204,12 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_buffer_uptodate(buf);
 
 	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		buf->log_index = root->log_transid % 2;
 		/*
 		 * we allow two log transactions at a time, use different
 		 * EXENT bit to differentiate dirty pages.
 		 */
-		if (root->log_transid % 2 == 0)
+		if (buf->log_index == 0)
 			set_extent_dirty(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 		else
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..c1139ae 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,60 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+		atomic_inc(&eb->fs_info->log_eb_write_errors[eb->log_index]);
+	else
+		atomic_inc(&eb->fs_info->eb_write_errors);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3674,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3703,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_extent_buffer_write_err(eb);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3720,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5122,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
@@ -5514,3 +5567,37 @@ int try_release_extent_buffer(struct page *page)
 
 	return release_extent_buffer(eb);
 }
+
+void clear_extent_buffer_write_err(struct extent_buffer *eb)
+{
+	/*
+	 * Ignore/discard a write IO error when:
+	 *
+	 * 1) The unwritten node/leaf isn't pointed to by any other node in
+	 * a tree because it was deleted - so it's safe to forget about the
+	 * write error and avoid a transaction abort;
+	 *
+	 * 2) The same eb was marked dirty again, so we will attempt to write
+	 * it to disk again, which might succeed this time.
+	 */
+	if (test_and_clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+		struct btrfs_root *root;
+
+		root = BTRFS_I(eb->pages[0]->mapping->host)->root;
+
+		if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+			const int i = eb->log_index;
+
+			atomic_dec(&eb->fs_info->log_eb_write_errors[i]);
+		} else {
+			/*
+			 * The counter can be 0 here, because the previous
+			 * current transaction (fs_info->running_transaction)
+			 * just called btrfs_wait_marked_extents() and is about
+			 * to set the fs to read-only mode and abort.
+			 */
+			atomic_add_unless(&eb->fs_info->eb_write_errors, -1, 0);
+		}
+	}
+
+}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..f9ba8c4 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -142,6 +143,8 @@ struct extent_buffer {
 	atomic_t spinning_readers;
 	atomic_t spinning_writers;
 	int lock_nested;
+	/* meaningful only if the owner root is from a log tree */
+	int log_index;
 
 	/* protects write locks */
 	rwlock_t lock;
@@ -314,6 +317,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+void clear_extent_buffer_write_err(struct extent_buffer *eb);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..6e696a0 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -852,6 +852,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	int errors;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +866,23 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		atomic_t *log_errors = root->fs_info->log_eb_write_errors;
+
+		errors = 0;
+		if (mark & EXTENT_DIRTY)
+			errors += atomic_xchg(&log_errors[0], 0);
+		if (mark & EXTENT_NEW)
+			errors += atomic_xchg(&log_errors[1], 0);
+	} else {
+		errors = atomic_xchg(&root->fs_info->eb_write_errors, 0);
+	}
+
+	ASSERT(errors >= 0);
+	if (errors > 0 && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v8] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (6 preceding siblings ...)
  2014-09-24 22:20   ` [PATCH 2/2 v7] " Filipe Manana
@ 2014-09-25 11:12   ` Filipe Manana
  2014-09-25 16:33   ` [PATCH 2/2 v9] " Filipe Manana
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-25 11:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
    a (short) time window where fs_info->running_transaction is NULL.
    There's now 2 counters for log extent buffers too, each one representing
    a different log transaction.

V7: Track the eb's log index in the eb itself, otherwise it wasn't possible
    to find it when writeback triggered from a transaction commit.

V8: Track the log eb write errors per root instead, and reset them on a
    transaction commit.

 fs/btrfs/ctree.h       |   2 +
 fs/btrfs/disk-io.c     |   7 +++-
 fs/btrfs/extent-tree.c |   4 +-
 fs/btrfs/extent_io.c   | 100 +++++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/extent_io.h   |   8 +++-
 fs/btrfs/transaction.c |  18 +++++++++
 fs/btrfs/tree-log.c    |   2 +
 7 files changed, 129 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f20b60d..96f5186 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1509,6 +1509,7 @@ struct btrfs_fs_info {
 	atomic_t nr_async_bios;
 	atomic_t async_delalloc_pages;
 	atomic_t open_ioctl_trans;
+	atomic_t eb_write_errors;
 
 	/*
 	 * this is used to protect the following list -- ordered_roots.
@@ -1790,6 +1791,7 @@ struct btrfs_root {
 	atomic_t log_writers;
 	atomic_t log_commit[2];
 	atomic_t log_batch;
+	atomic_t log_eb_write_errors[2];
 	int log_transid;
 	/* No matter the commit succeeds or not*/
 	int log_transid_committed;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..e792ee3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
@@ -1277,6 +1277,8 @@ static void __setup_root(u32 nodesize, u32 sectorsize, u32 stripesize,
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
 	atomic_set(&root->log_batch, 0);
+	atomic_set(&root->log_eb_write_errors[0], 0);
+	atomic_set(&root->log_eb_write_errors[1], 0);
 	atomic_set(&root->orphan_inodes, 0);
 	atomic_set(&root->refs, 1);
 	atomic_set(&root->will_be_snapshoted, 0);
@@ -2271,6 +2273,7 @@ int open_ctree(struct super_block *sb,
 	atomic_set(&fs_info->nr_async_bios, 0);
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->qgroup_op_seq, 0);
+	atomic_set(&fs_info->eb_write_errors, 0);
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 	fs_info->sb = sb;
 	fs_info->max_inline = 8192 * 1024;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..9442f3c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6221,6 +6221,7 @@ out:
 	 * anymore.
 	 */
 	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	clear_extent_buffer_write_err(buf);
 	btrfs_put_block_group(cache);
 }
 
@@ -7203,11 +7204,12 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_buffer_uptodate(buf);
 
 	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		buf->log_index = root->log_transid % 2;
 		/*
 		 * we allow two log transactions at a time, use different
 		 * EXENT bit to differentiate dirty pages.
 		 */
-		if (root->log_transid % 2 == 0)
+		if (buf->log_index == 0)
 			set_extent_dirty(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 		else
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..81432c7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,60 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+		atomic_inc(&root->log_eb_write_errors[eb->log_index]);
+	else
+		atomic_inc(&eb->fs_info->eb_write_errors);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3674,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3703,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_extent_buffer_write_err(eb);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3720,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5122,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
@@ -5514,3 +5567,36 @@ int try_release_extent_buffer(struct page *page)
 
 	return release_extent_buffer(eb);
 }
+
+void clear_extent_buffer_write_err(struct extent_buffer *eb)
+{
+	/*
+	 * Ignore/discard a write IO error when:
+	 *
+	 * 1) The unwritten node/leaf isn't pointed to by any other node in
+	 * a tree because it was deleted - so it's safe to forget about the
+	 * write error and avoid a transaction abort;
+	 *
+	 * 2) The same eb was marked dirty again, so we will attempt to write
+	 * it to disk again, which might succeed this time.
+	 */
+	if (test_and_clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+		struct btrfs_root *root;
+
+		root = BTRFS_I(eb->pages[0]->mapping->host)->root;
+
+		/*
+		 * The counter can be 0 here, because the previous current
+		 * transaction (fs_info->running_transaction) just called
+		 * btrfs_wait_marked_extents() and is about to set the fs to
+		 * read-only mode and abort (if it's not a log commit).
+		 */
+		if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+			atomic_add_unless(
+			      &root->log_eb_write_errors[eb->log_index], -1, 0);
+		} else {
+			atomic_add_unless(&eb->fs_info->eb_write_errors, -1, 0);
+		}
+	}
+
+}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..a607680 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -141,7 +142,9 @@ struct extent_buffer {
 	atomic_t blocking_readers;
 	atomic_t spinning_readers;
 	atomic_t spinning_writers;
-	int lock_nested;
+	unsigned int lock_nested:1;
+	/* meaningful only if the owner root is from a log tree */
+	unsigned int log_index:31;
 
 	/* protects write locks */
 	rwlock_t lock;
@@ -314,6 +317,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+void clear_extent_buffer_write_err(struct extent_buffer *eb);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..c89421b 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -852,6 +852,7 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	bool errors;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +866,23 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		int err0 = 0;
+		int err1 = 0;
+
+		if (mark & EXTENT_DIRTY)
+			err0 = atomic_xchg(&root->log_eb_write_errors[0], 0);
+		if (mark & EXTENT_NEW)
+			err1 = atomic_xchg(&root->log_eb_write_errors[1], 0);
+		errors = err0 || err1;
+	} else {
+		errors = atomic_xchg(&root->fs_info->eb_write_errors, 0) != 0;
+	}
+
+	if (errors && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2d0fa43..a3cdeb2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2750,6 +2750,8 @@ int btrfs_free_log(struct btrfs_trans_handle *trans, struct btrfs_root *root)
 	if (root->log_root) {
 		free_log_tree(trans, root->log_root);
 		root->log_root = NULL;
+		atomic_set(&root->log_eb_write_errors[0], 0);
+		atomic_set(&root->log_eb_write_errors[1], 0);
 	}
 	return 0;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2 v10] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-25 16:51   ` [PATCH 2/2 v10] " Filipe Manana
@ 2014-09-25 16:30     ` David Sterba
  0 siblings, 0 replies; 21+ messages in thread
From: David Sterba @ 2014-09-25 16:30 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Thu, Sep 25, 2014 at 05:51:45PM +0100, Filipe Manana wrote:
> +	switch (eb->log_index) {
> +	case -1:
> +		set_bit(BTRFS_INODE_BTREE_ERR, &btree_ino->runtime_flags);
> +		break;
> +	case 0:
> +		set_bit(BTRFS_INODE_BTREE_LOG1_ERR, &btree_ino->runtime_flags);
> +		break;
> +	case 1:
> +		set_bit(BTRFS_INODE_BTREE_LOG2_ERR, &btree_ino->runtime_flags);
> +		break;
> +	default:
> +		ASSERT(0); /* unexpected, logic error */

I think we can afford a BUG() here, preceded by a message. The logic is
pretty strict about the possible values so if we get here, something is
really bad and we don't want to make it silent.

This could happen if the memory modules are faulty and tripping over
random BUG_ONs is usually a sign of problems. I've seen that in the
past with certain CPU types during stresstests and it definetely was
worth knowing. (The BUG_ONs were in memory management, it's not a crime
to use them, only in place of proper error hanling.)

> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -141,7 +142,9 @@ struct extent_buffer {
>  	atomic_t blocking_readers;
>  	atomic_t spinning_readers;
>  	atomic_t spinning_writers;
> -	int lock_nested;
> +	unsigned int lock_nested:1;

Please do not use bitfields next to atomic_t or spin_lock (or kref_t),
this can lead to all sorts of funny problems on architectures that are
not able to do atomic bit operations and have to do read-modify-write
cycle. This leads to problems when it's on the same 8-byte word with an
atomic/... and can corrupt it if the timing is right.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v9] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (7 preceding siblings ...)
  2014-09-25 11:12   ` [PATCH 2/2 v8] " Filipe Manana
@ 2014-09-25 16:33   ` Filipe Manana
  2014-09-25 16:51   ` [PATCH 2/2 v10] " Filipe Manana
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-25 16:33 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
    a (short) time window where fs_info->running_transaction is NULL.
    There's now 2 counters for log extent buffers too, each one representing
    a different log transaction.

V7: Track the eb's log index in the eb itself, otherwise it wasn't possible
    to find it when writeback triggered from a transaction commit.

V8: Track the log eb write errors per root instead, and reset them on a
    transaction commit.

V9: Don't decrement the error counters if the eb is deleted or re-written.
    It is not safe because there's a time window when committing a transaction,
    between setting fs_info->current_transaction to NULL and checking the
    error counters in btrfs_write_and_wait_transaction(), where a new transaction
    can start and delete or re-write an eb that has the write error flag set.
    If this happens it means the previous transaction can write a superblock
    that refers to trees that point to unwritten nodes.
    Replaced the counters with simple flags in the btree inode's runtime
    flags - essentially back to V1 but accounting for the 2 different log
    sub-transactions.
    Removed access to an eb's parent root through
    BTRFS_I(eb->pages[0]->mapping->host)->root since it was not correct, as this
    always gives us the btree inode's root (objectid 1ULL). Instead use the
    field eb->log_index to know wether it's a log btree eb (and which sub-
    -transaction) or a non-log btree eb.

 fs/btrfs/btrfs_inode.h | 11 ++++++++
 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c |  4 ++-
 fs/btrfs/extent_io.c   | 76 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h   |  8 ++++--
 fs/btrfs/transaction.c | 22 +++++++++++++++
 fs/btrfs/tree-log.c    |  6 ++++
 7 files changed, 119 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3511031..aee4050 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -44,6 +44,17 @@
 #define BTRFS_INODE_IN_DELALLOC_LIST		9
 #define BTRFS_INODE_READDIO_NEED_LOCK		10
 #define BTRFS_INODE_HAS_PROPS		        11
+/*
+ * The following 3 bits are meant only for the btree inode.
+ * When any of them is set, it means an error happened while writing an
+ * extent buffer belonging to:
+ * 1) a non-log btree
+ * 2) a log btree and first log sub-transaction
+ * 3) a log btree and second log sub-transaction
+ */
+#define BTRFS_INODE_BTREE_ERR		        12
+#define BTRFS_INODE_BTREE_LOG1_ERR		13
+#define BTRFS_INODE_BTREE_LOG2_ERR		14
 
 /* in memory btrfs inode */
 struct btrfs_inode {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8b54acf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..c4df702 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7203,17 +7203,19 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_buffer_uptodate(buf);
 
 	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		buf->log_index = root->log_transid % 2;
 		/*
 		 * we allow two log transactions at a time, use different
 		 * EXENT bit to differentiate dirty pages.
 		 */
-		if (root->log_transid % 2 == 0)
+		if (buf->log_index == 0)
 			set_extent_dirty(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 		else
 			set_extent_new(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 	} else {
+		buf->log_index = -1;
 		set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
 			 buf->start + buf->len - 1, GFP_NOFS);
 	}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..07b7252 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,69 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_inode *btree_ino = BTRFS_I(eb->fs_info->btree_inode);
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	switch (eb->log_index) {
+	case -1:
+		set_bit(BTRFS_INODE_BTREE_ERR, &btree_ino->runtime_flags);
+		break;
+	case 0:
+		set_bit(BTRFS_INODE_BTREE_LOG1_ERR, &btree_ino->runtime_flags);
+		break;
+	case 1:
+		set_bit(BTRFS_INODE_BTREE_LOG2_ERR, &btree_ino->runtime_flags);
+		break;
+	default:
+		ASSERT(0); /* unexpected, logic error */
+	}
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3683,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3712,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3729,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5131,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..fd51d7e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -141,7 +142,9 @@ struct extent_buffer {
 	atomic_t blocking_readers;
 	atomic_t spinning_readers;
 	atomic_t spinning_writers;
-	int lock_nested;
+	unsigned int lock_nested:1;
+	/* >= 0 if eb belongs to a log tree, -1 otherwise */
+	short log_index;
 
 	/* protects write locks */
 	rwlock_t lock;
@@ -314,6 +317,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+void clear_extent_buffer_write_err(struct extent_buffer *eb);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..c404e79 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -852,6 +852,8 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
+	bool errors = false;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +867,26 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		if ((mark & EXTENT_DIRTY) &&
+		    test_and_clear_bit(BTRFS_INODE_BTREE_LOG1_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+
+		if ((mark & EXTENT_NEW) &&
+		    test_and_clear_bit(BTRFS_INODE_BTREE_LOG2_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+	} else {
+		if (test_and_clear_bit(BTRFS_INODE_BTREE_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+	}
+
+	if (errors && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2d0fa43..80f9b4c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2748,8 +2748,14 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 int btrfs_free_log(struct btrfs_trans_handle *trans, struct btrfs_root *root)
 {
 	if (root->log_root) {
+		struct inode *btree_inode = root->fs_info->btree_inode;
+
 		free_log_tree(trans, root->log_root);
 		root->log_root = NULL;
+		clear_bit(BTRFS_INODE_BTREE_LOG1_ERR,
+			  &BTRFS_I(btree_inode)->runtime_flags);
+		clear_bit(BTRFS_INODE_BTREE_LOG2_ERR,
+			  &BTRFS_I(btree_inode)->runtime_flags);
 	}
 	return 0;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v10] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (8 preceding siblings ...)
  2014-09-25 16:33   ` [PATCH 2/2 v9] " Filipe Manana
@ 2014-09-25 16:51   ` Filipe Manana
  2014-09-25 16:30     ` David Sterba
  2014-09-25 18:01   ` [PATCH 2/2 v11] " Filipe Manana
  2014-09-26 11:25   ` [PATCH 2/2 v12] " Filipe Manana
  11 siblings, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2014-09-25 16:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new counters eb_write_errors and log_eb_write_errors in the
transaction also makes us achieve the goal of AS_EIO/AS_ENOSPC when
writepages() returns success, started writeback for all dirty pages
and before filemap_fdatawait_range() is called, the writeback for
all dirty pages had already finished with errors - because we were
not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return
success, as it could not know that writeback errors happened (the
pages were no longer tagged for writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
    a (short) time window where fs_info->running_transaction is NULL.
    There's now 2 counters for log extent buffers too, each one representing
    a different log transaction.

V7: Track the eb's log index in the eb itself, otherwise it wasn't possible
    to find it when writeback triggered from a transaction commit.

V8: Track the log eb write errors per root instead, and reset them on a
    transaction commit.

V9: Don't decrement the error counters if the eb is deleted or re-written.
    It is not safe because there's a time window when committing a transaction,
    between setting fs_info->current_transaction to NULL and checking the
    error counters in btrfs_write_and_wait_transaction(), where a new transaction
    can start and delete or re-write an eb that has the write error flag set.
    If this happens it means the previous transaction can write a superblock
    that refers to trees that point to unwritten nodes.
    Replaced the counters with simple flags in the btree inode's runtime
    flags - essentially back to V1 but accounting for the 2 different log
    sub-transactions.
    Removed access to an eb's parent root through
    BTRFS_I(eb->pages[0]->mapping->host)->root since it was not correct, as this
    always gives us the btree inode's root (objectid 1ULL). Instead use the
    field eb->log_index to know wether it's a log btree eb (and which sub-
    -transaction) or a non-log btree eb.

V10: Clear the log eb write error flags in a more logical place (transaction
     commit function).

 fs/btrfs/btrfs_inode.h | 11 ++++++++
 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c |  4 ++-
 fs/btrfs/extent_io.c   | 76 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h   |  8 ++++--
 fs/btrfs/transaction.c | 26 +++++++++++++++++
 6 files changed, 117 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3511031..aee4050 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -44,6 +44,17 @@
 #define BTRFS_INODE_IN_DELALLOC_LIST		9
 #define BTRFS_INODE_READDIO_NEED_LOCK		10
 #define BTRFS_INODE_HAS_PROPS		        11
+/*
+ * The following 3 bits are meant only for the btree inode.
+ * When any of them is set, it means an error happened while writing an
+ * extent buffer belonging to:
+ * 1) a non-log btree
+ * 2) a log btree and first log sub-transaction
+ * 3) a log btree and second log sub-transaction
+ */
+#define BTRFS_INODE_BTREE_ERR		        12
+#define BTRFS_INODE_BTREE_LOG1_ERR		13
+#define BTRFS_INODE_BTREE_LOG2_ERR		14
 
 /* in memory btrfs inode */
 struct btrfs_inode {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8b54acf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..c4df702 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7203,17 +7203,19 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_buffer_uptodate(buf);
 
 	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		buf->log_index = root->log_transid % 2;
 		/*
 		 * we allow two log transactions at a time, use different
 		 * EXENT bit to differentiate dirty pages.
 		 */
-		if (root->log_transid % 2 == 0)
+		if (buf->log_index == 0)
 			set_extent_dirty(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 		else
 			set_extent_new(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 	} else {
+		buf->log_index = -1;
 		set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
 			 buf->start + buf->len - 1, GFP_NOFS);
 	}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..07b7252 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,69 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_inode *btree_ino = BTRFS_I(eb->fs_info->btree_inode);
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using transaction counters eb_write_errors and log_eb_write_errors
+	 * also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages()
+	 * returns success, started writeback for all dirty pages and before
+	 * filemap_fdatawait_range() is called, the writeback for all dirty
+	 * pages had already finished with errors - because we were not using
+	 * AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as
+	 * it could not know that writeback errors happened (the pages were no
+	 * longer tagged for writeback).
+	 */
+	switch (eb->log_index) {
+	case -1:
+		set_bit(BTRFS_INODE_BTREE_ERR, &btree_ino->runtime_flags);
+		break;
+	case 0:
+		set_bit(BTRFS_INODE_BTREE_LOG1_ERR, &btree_ino->runtime_flags);
+		break;
+	case 1:
+		set_bit(BTRFS_INODE_BTREE_LOG2_ERR, &btree_ino->runtime_flags);
+		break;
+	default:
+		ASSERT(0); /* unexpected, logic error */
+	}
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3683,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3712,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3729,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5131,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..fd51d7e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -141,7 +142,9 @@ struct extent_buffer {
 	atomic_t blocking_readers;
 	atomic_t spinning_readers;
 	atomic_t spinning_writers;
-	int lock_nested;
+	unsigned int lock_nested:1;
+	/* >= 0 if eb belongs to a log tree, -1 otherwise */
+	short log_index;
 
 	/* protects write locks */
 	rwlock_t lock;
@@ -314,6 +317,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+void clear_extent_buffer_write_err(struct extent_buffer *eb);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..51cb84f 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -852,6 +852,8 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
+	bool errors = false;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +867,26 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		if ((mark & EXTENT_DIRTY) &&
+		    test_and_clear_bit(BTRFS_INODE_BTREE_LOG1_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+
+		if ((mark & EXTENT_NEW) &&
+		    test_and_clear_bit(BTRFS_INODE_BTREE_LOG2_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+	} else {
+		if (test_and_clear_bit(BTRFS_INODE_BTREE_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+	}
+
+	if (errors && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
@@ -1651,6 +1673,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
 	struct btrfs_transaction *prev_trans = NULL;
+	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
 	int ret;
 
 	ret = btrfs_run_ordered_operations(trans, root, 0);
@@ -1897,6 +1920,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	memcpy(root->fs_info->super_for_commit, root->fs_info->super_copy,
 	       sizeof(*root->fs_info->super_copy));
 
+	clear_bit(BTRFS_INODE_BTREE_LOG1_ERR, &btree_ino->runtime_flags);
+	clear_bit(BTRFS_INODE_BTREE_LOG2_ERR, &btree_ino->runtime_flags);
+
 	spin_lock(&root->fs_info->trans_lock);
 	cur_trans->state = TRANS_STATE_UNBLOCKED;
 	root->fs_info->running_transaction = NULL;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v11] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (9 preceding siblings ...)
  2014-09-25 16:51   ` [PATCH 2/2 v10] " Filipe Manana
@ 2014-09-25 18:01   ` Filipe Manana
  2014-09-26 11:25   ` [PATCH 2/2 v12] " Filipe Manana
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-25 18:01 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
    a (short) time window where fs_info->running_transaction is NULL.
    There's now 2 counters for log extent buffers too, each one representing
    a different log transaction.

V7: Track the eb's log index in the eb itself, otherwise it wasn't possible
    to find it when writeback triggered from a transaction commit.

V8: Track the log eb write errors per root instead, and reset them on a
    transaction commit.

V9: Don't decrement the error counters if the eb is deleted or re-written.
    It is not safe because there's a time window when committing a transaction,
    between setting fs_info->current_transaction to NULL and checking the
    error counters in btrfs_write_and_wait_transaction(), where a new transaction
    can start and delete or re-write an eb that has the write error flag set.
    If this happens it means the previous transaction can write a superblock
    that refers to trees that point to unwritten nodes.
    Replaced the counters with simple flags in the btree inode's runtime
    flags - essentially back to V1 but accounting for the 2 different log
    sub-transactions.
    Removed access to an eb's parent root through
    BTRFS_I(eb->pages[0]->mapping->host)->root since it was not correct, as this
    always gives us the btree inode's root (objectid 1ULL). Instead use the
    field eb->log_index to know wether it's a log btree eb (and which sub-
    -transaction) or a non-log btree eb.

V10: Clear the log eb write error flags in a more logical place (transaction
     commit function).

V11: Updated commit message and a comment, replaced an ASSERT() with a BUG()
     and changed eb->lock_nested to a short to keep the structure size.

 fs/btrfs/btrfs_inode.h | 11 ++++++++
 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c |  4 ++-
 fs/btrfs/extent_io.c   | 76 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h   |  8 ++++--
 fs/btrfs/transaction.c | 26 +++++++++++++++++
 6 files changed, 117 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3511031..aee4050 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -44,6 +44,17 @@
 #define BTRFS_INODE_IN_DELALLOC_LIST		9
 #define BTRFS_INODE_READDIO_NEED_LOCK		10
 #define BTRFS_INODE_HAS_PROPS		        11
+/*
+ * The following 3 bits are meant only for the btree inode.
+ * When any of them is set, it means an error happened while writing an
+ * extent buffer belonging to:
+ * 1) a non-log btree
+ * 2) a log btree and first log sub-transaction
+ * 3) a log btree and second log sub-transaction
+ */
+#define BTRFS_INODE_BTREE_ERR		        12
+#define BTRFS_INODE_BTREE_LOG1_ERR		13
+#define BTRFS_INODE_BTREE_LOG2_ERR		14
 
 /* in memory btrfs inode */
 struct btrfs_inode {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23393ec..8b54acf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -610,7 +610,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -683,7 +683,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ef0845d..c4df702 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7203,17 +7203,19 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_buffer_uptodate(buf);
 
 	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		buf->log_index = root->log_transid % 2;
 		/*
 		 * we allow two log transactions at a time, use different
 		 * EXENT bit to differentiate dirty pages.
 		 */
-		if (root->log_transid % 2 == 0)
+		if (buf->log_index == 0)
 			set_extent_dirty(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 		else
 			set_extent_new(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 	} else {
+		buf->log_index = -1;
 		set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
 			 buf->start + buf->len - 1, GFP_NOFS);
 	}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91f866c..9c5dbef 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "transaction.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3606,6 +3607,69 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_inode *btree_ino = BTRFS_I(eb->fs_info->btree_inode);
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be completely reliable, as the eb might be
+	 * removed from memory and read back when trying to get it, which clears
+	 * that flag right before reading the eb's pages from disk, making us
+	 * not know about the previous write error.
+	 *
+	 * Using the flags below in the btree inode also makes us achieve the
+	 * goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
+	 * writeback for all dirty pages and before filemap_fdatawait_range()
+	 * is called, the writeback for all dirty pages had already finished
+	 * with errors - because we were not using AS_EIO/AS_ENOSPC,
+	 * filemap_fdatawait_range() would return success, as it could not know
+	 * that writeback errors happened (the pages were no longer tagged for
+	 * writeback).
+	 */
+	switch (eb->log_index) {
+	case -1:
+		set_bit(BTRFS_INODE_BTREE_ERR, &btree_ino->runtime_flags);
+		break;
+	case 0:
+		set_bit(BTRFS_INODE_BTREE_LOG1_ERR, &btree_ino->runtime_flags);
+		break;
+	case 1:
+		set_bit(BTRFS_INODE_BTREE_LOG2_ERR, &btree_ino->runtime_flags);
+		break;
+	default:
+		BUG(); /* unexpected, logic error */
+	}
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3619,10 +3683,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3649,7 +3712,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3666,8 +3729,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5069,7 +5131,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..c8a6d92 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -141,7 +142,9 @@ struct extent_buffer {
 	atomic_t blocking_readers;
 	atomic_t spinning_readers;
 	atomic_t spinning_writers;
-	int lock_nested;
+	short lock_nested;
+	/* >= 0 if eb belongs to a log tree, -1 otherwise */
+	short log_index;
 
 	/* protects write locks */
 	rwlock_t lock;
@@ -314,6 +317,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+void clear_extent_buffer_write_err(struct extent_buffer *eb);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 1e272c0..51cb84f 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -852,6 +852,8 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
+	bool errors = false;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -865,6 +867,26 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		if ((mark & EXTENT_DIRTY) &&
+		    test_and_clear_bit(BTRFS_INODE_BTREE_LOG1_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+
+		if ((mark & EXTENT_NEW) &&
+		    test_and_clear_bit(BTRFS_INODE_BTREE_LOG2_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+	} else {
+		if (test_and_clear_bit(BTRFS_INODE_BTREE_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+	}
+
+	if (errors && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
@@ -1651,6 +1673,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
 	struct btrfs_transaction *prev_trans = NULL;
+	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
 	int ret;
 
 	ret = btrfs_run_ordered_operations(trans, root, 0);
@@ -1897,6 +1920,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	memcpy(root->fs_info->super_for_commit, root->fs_info->super_copy,
 	       sizeof(*root->fs_info->super_copy));
 
+	clear_bit(BTRFS_INODE_BTREE_LOG1_ERR, &btree_ino->runtime_flags);
+	clear_bit(BTRFS_INODE_BTREE_LOG2_ERR, &btree_ino->runtime_flags);
+
 	spin_lock(&root->fs_info->trans_lock);
 	cur_trans->state = TRANS_STATE_UNBLOCKED;
 	root->fs_info->running_transaction = NULL;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2 v12] Btrfs: be aware of btree inode write errors to avoid fs corruption
  2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
                     ` (10 preceding siblings ...)
  2014-09-25 18:01   ` [PATCH 2/2 v11] " Filipe Manana
@ 2014-09-26 11:25   ` Filipe Manana
  11 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2014-09-26 11:25 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
    a (short) time window where fs_info->running_transaction is NULL.
    There's now 2 counters for log extent buffers too, each one representing
    a different log transaction.

V7: Track the eb's log index in the eb itself, otherwise it wasn't possible
    to find it when writeback triggered from a transaction commit.

V8: Track the log eb write errors per root instead, and reset them on a
    transaction commit.

V9: Don't decrement the error counters if the eb is deleted or re-written.
    It is not safe because there's a time window when committing a transaction,
    between setting fs_info->current_transaction to NULL and checking the
    error counters in btrfs_write_and_wait_transaction(), where a new transaction
    can start and delete or re-write an eb that has the write error flag set.
    If this happens it means the previous transaction can write a superblock
    that refers to trees that point to unwritten nodes.
    Replaced the counters with simple flags in the btree inode's runtime
    flags - essentially back to V1 but accounting for the 2 different log
    sub-transactions.
    Removed access to an eb's parent root through
    BTRFS_I(eb->pages[0]->mapping->host)->root since it was not correct, as this
    always gives us the btree inode's root (objectid 1ULL). Instead use the
    field eb->log_index to know wether it's a log btree eb (and which sub-
    -transaction) or a non-log btree eb.

V10: Clear the log eb write error flags in a more logical place (transaction
     commit function).

V11: Updated commit message and a comment, replaced an ASSERT() with a BUG()
     and changed eb->lock_nested to a short to keep the structure size.

V12: Removed leftovers from previous versions (no longer necessary #include and
     prototype in extent_io.h of no longer existing function) and updated parts
     from a comment that apply only to some past versions.
     Rebased against latest integration branch (didn't apply cleanly) and re-tested.

 fs/btrfs/btrfs_inode.h | 11 ++++++++
 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c |  4 ++-
 fs/btrfs/extent_io.c   | 74 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h   |  7 +++--
 fs/btrfs/transaction.c | 26 ++++++++++++++++++
 6 files changed, 114 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 7a7521c..8a42adb 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -44,6 +44,17 @@
 #define BTRFS_INODE_IN_DELALLOC_LIST		9
 #define BTRFS_INODE_READDIO_NEED_LOCK		10
 #define BTRFS_INODE_HAS_PROPS		        11
+/*
+ * The following 3 bits are meant only for the btree inode.
+ * When any of them is set, it means an error happened while writing an
+ * extent buffer belonging to:
+ * 1) a non-log btree
+ * 2) a log btree and first log sub-transaction
+ * 3) a log btree and second log sub-transaction
+ */
+#define BTRFS_INODE_BTREE_ERR		        12
+#define BTRFS_INODE_BTREE_LOG1_ERR		13
+#define BTRFS_INODE_BTREE_LOG2_ERR		14
 
 /* in memory btrfs inode */
 struct btrfs_inode {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4780e66..09b3c8a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -607,7 +607,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -680,7 +680,7 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
 	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 44d0497..8ebe6bf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7235,17 +7235,19 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_buffer_uptodate(buf);
 
 	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		buf->log_index = root->log_transid % 2;
 		/*
 		 * we allow two log transactions at a time, use different
 		 * EXENT bit to differentiate dirty pages.
 		 */
-		if (root->log_transid % 2 == 0)
+		if (buf->log_index == 0)
 			set_extent_dirty(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 		else
 			set_extent_new(&root->dirty_log_pages, buf->start,
 					buf->start + buf->len - 1, GFP_NOFS);
 	} else {
+		buf->log_index = -1;
 		set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
 			 buf->start + buf->len - 1, GFP_NOFS);
 	}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 828aded..9cc757f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3601,6 +3601,68 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
+static void set_btree_ioerr(struct page *page)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_inode *btree_ino = BTRFS_I(eb->fs_info->btree_inode);
+
+	SetPageError(page);
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+		return;
+
+	/*
+	 * If writeback for a btree extent that doesn't belong to a log tree
+	 * failed, increment the counter transaction->eb_write_errors.
+	 * We do this because while the transaction is running and before it's
+	 * committing (when we call filemap_fdata[write|wait]_range against
+	 * the btree inode), we might have
+	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * returns an error or an error happens during writeback, when we're
+	 * committing the transaction we wouldn't know about it, since the pages
+	 * can be no longer dirty nor marked anymore for writeback (if a
+	 * subsequent modification to the extent buffer didn't happen before the
+	 * transaction commit), which makes filemap_fdata[write|wait]_range not
+	 * able to find the pages tagged with SetPageError at transaction
+	 * commit time. So if this happens we must abort the transaction,
+	 * otherwise we commit a super block with btree roots that point to
+	 * btree nodes/leafs whose content on disk is invalid - either garbage
+	 * or the content of some node/leaf from a past generation that got
+	 * cowed or deleted and is no longer valid.
+	 *
+	 * Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
+	 * not be enough - we need to distinguish between log tree extents vs
+	 * non-log tree extents, and the next filemap_fdatawait_range() call
+	 * will catch and clear such errors in the mapping - and that call might
+	 * be from a log sync and not from a transaction commit. Also, checking
+	 * for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
+	 * not done and would not be reliable - the eb might have been released
+	 * from memory and reading it back again means that flag would not be
+	 * set (since it's a runtime flag, not persisted on disk).
+	 *
+	 * Using the flags below in the btree inode also makes us achieve the
+	 * goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
+	 * writeback for all dirty pages and before filemap_fdatawait_range()
+	 * is called, the writeback for all dirty pages had already finished
+	 * with errors - because we were not using AS_EIO/AS_ENOSPC,
+	 * filemap_fdatawait_range() would return success, as it could not know
+	 * that writeback errors happened (the pages were no longer tagged for
+	 * writeback).
+	 */
+	switch (eb->log_index) {
+	case -1:
+		set_bit(BTRFS_INODE_BTREE_ERR, &btree_ino->runtime_flags);
+		break;
+	case 0:
+		set_bit(BTRFS_INODE_BTREE_LOG1_ERR, &btree_ino->runtime_flags);
+		break;
+	case 1:
+		set_bit(BTRFS_INODE_BTREE_LOG2_ERR, &btree_ino->runtime_flags);
+		break;
+	default:
+		BUG(); /* unexpected, logic error */
+	}
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
@@ -3614,10 +3676,9 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			SetPageError(page);
+			set_btree_ioerr(page);
 		}
 
 		end_page_writeback(page);
@@ -3644,7 +3705,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
@@ -3661,8 +3722,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-			SetPageError(p);
+			set_btree_ioerr(p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
@@ -5054,7 +5114,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5e91fb9..06f030c 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -41,9 +41,10 @@
 #define EXTENT_BUFFER_TREE_REF 5
 #define EXTENT_BUFFER_STALE 6
 #define EXTENT_BUFFER_WRITEBACK 7
-#define EXTENT_BUFFER_IOERR 8
+#define EXTENT_BUFFER_READ_ERR 8        /* read IO error */
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
+#define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -141,7 +142,9 @@ struct extent_buffer {
 	atomic_t blocking_readers;
 	atomic_t spinning_readers;
 	atomic_t spinning_writers;
-	int lock_nested;
+	short lock_nested;
+	/* >= 0 if eb belongs to a log tree, -1 otherwise */
+	short log_index;
 
 	/* protects write locks */
 	rwlock_t lock;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 16d0c1b..a47b100 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -851,6 +851,8 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
+	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
+	bool errors = false;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      EXTENT_NEED_WAIT, &cached_state)) {
@@ -864,6 +866,26 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
+
+	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+		if ((mark & EXTENT_DIRTY) &&
+		    test_and_clear_bit(BTRFS_INODE_BTREE_LOG1_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+
+		if ((mark & EXTENT_NEW) &&
+		    test_and_clear_bit(BTRFS_INODE_BTREE_LOG2_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+	} else {
+		if (test_and_clear_bit(BTRFS_INODE_BTREE_ERR,
+				       &btree_ino->runtime_flags))
+			errors = true;
+	}
+
+	if (errors && !werr)
+		werr = -EIO;
+
 	return werr;
 }
 
@@ -1629,6 +1651,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
 	struct btrfs_transaction *prev_trans = NULL;
+	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
 	int ret;
 
 	/* Stop the commit early if ->aborted is set */
@@ -1871,6 +1894,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	btrfs_update_commit_device_size(root->fs_info);
 	btrfs_update_commit_device_bytes_used(root, cur_trans);
 
+	clear_bit(BTRFS_INODE_BTREE_LOG1_ERR, &btree_ino->runtime_flags);
+	clear_bit(BTRFS_INODE_BTREE_LOG2_ERR, &btree_ino->runtime_flags);
+
 	spin_lock(&root->fs_info->trans_lock);
 	cur_trans->state = TRANS_STATE_UNBLOCKED;
 	root->fs_info->running_transaction = NULL;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-09-26 10:27 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-22 16:41 [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure Filipe Manana
2014-09-22 16:41 ` [PATCH 2/2] Btrfs: be aware of btree inode write errors to avoid fs corruption Filipe Manana
2014-09-23 14:05   ` Liu Bo
2014-09-23 14:27     ` Filipe David Manana
2014-09-24  0:20   ` [PATCH 2/2 v2] " Filipe Manana
2014-09-24  0:43   ` [PATCH 2/2 v3] " Filipe Manana
2014-09-24  4:49   ` [PATCH 2/2 v4] " Filipe Manana
2014-09-24 10:28   ` [PATCH 2/2 v5] " Filipe Manana
2014-09-24 11:16     ` Miao Xie
2014-09-24 12:59       ` Filipe David Manana
2014-09-24 17:19   ` [PATCH 2/2 v6] " Filipe Manana
2014-09-24 22:20   ` [PATCH 2/2 v7] " Filipe Manana
2014-09-25 11:12   ` [PATCH 2/2 v8] " Filipe Manana
2014-09-25 16:33   ` [PATCH 2/2 v9] " Filipe Manana
2014-09-25 16:51   ` [PATCH 2/2 v10] " Filipe Manana
2014-09-25 16:30     ` David Sterba
2014-09-25 18:01   ` [PATCH 2/2 v11] " Filipe Manana
2014-09-26 11:25   ` [PATCH 2/2 v12] " Filipe Manana
2014-09-23 10:14 ` [PATCH 1/2] Btrfs: add missing end_page_writeback on submit_extent_page failure Liu Bo
2014-09-23 13:03   ` Filipe David Manana
2014-09-23 13:39     ` Liu Bo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.