All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
@ 2017-03-06  2:55 Qu Wenruo
  2017-03-06  2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Qu Wenruo @ 2017-03-06  2:55 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
changelog:
v6:
  New, split from v5 patch, as this is a separate bug.
---
 fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2bc07aad1ae..1d83d504f2e5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
 		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
 			ret = btrfs_reloc_clone_csums(inode, start,
 						      cur_alloc_size);
+			/*
+			 * Only drop cache here, and process as normal.
+			 *
+			 * We must not allow extent_clear_unlock_delalloc()
+			 * at out_unlock label to free meta of this ordered
+			 * extent, as its meta should be freed by
+			 * btrfs_finish_ordered_io().
+			 *
+			 * So we must continue until @start is increased to
+			 * skip current ordered extent.
+			 */
 			if (ret)
-				goto out_drop_extent_cache;
+				btrfs_drop_extent_cache(BTRFS_I(inode), start,
+						start + ram_size - 1, 0);
 		}
 
 		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
-		if (disk_num_bytes < cur_alloc_size)
-			break;
-
 		/* we're not doing compressed IO, don't unlock the first
 		 * page (which the caller expects to stay locked), don't
 		 * clear any dirty bits and don't set any writeback bits
@@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
 					     delalloc_end, locked_page,
 					     EXTENT_LOCKED | EXTENT_DELALLOC,
 					     op);
-		disk_num_bytes -= cur_alloc_size;
+		if (disk_num_bytes > cur_alloc_size)
+			disk_num_bytes = 0;
+		else
+			disk_num_bytes -= cur_alloc_size;
 		num_bytes -= cur_alloc_size;
 		alloc_hint = ins.objectid + ins.offset;
 		start += cur_alloc_size;
+
+		/*
+		 * btrfs_reloc_clone_csums() error, since start is increased
+		 * extent_clear_unlock_delalloc() at out_unlock label won't
+		 * free metadata of current ordered extent, we're OK to exit.
+		 */
+		if (ret)
+			goto out_unlock;
 	}
 out:
 	return ret;
@@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
 		BUG_ON(ret); /* -ENOMEM */
 
 		if (root->root_key.objectid ==
-		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
+		    BTRFS_DATA_RELOC_TREE_OBJECTID)
+			/*
+			 * Error handled later, as we must prevent
+			 * extent_clear_unlock_delalloc() in error handler
+			 * from freeing metadata of created ordered extent.
+			 */
 			ret = btrfs_reloc_clone_csums(inode, cur_offset,
 						      num_bytes);
-			if (ret) {
-				if (!nolock && nocow)
-					btrfs_end_write_no_snapshoting(root);
-				goto error;
-			}
-		}
 
 		extent_clear_unlock_delalloc(inode, cur_offset,
 					     cur_offset + num_bytes - 1, end,
@@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
 		if (!nolock && nocow)
 			btrfs_end_write_no_snapshoting(root);
 		cur_offset = extent_end;
+
+		/*
+		 * btrfs_reloc_clone_csums() error, now we're OK to call error
+		 * handler, as metadata for created ordered extent will only
+		 * be freed by btrfs_finish_ordered_io().
+		 */
+		if (ret)
+			goto error;
 		if (cur_offset > end)
 			break;
 	}
-- 
2.12.0




^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-06  2:55 [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
@ 2017-03-06  2:55 ` Qu Wenruo
  2017-03-06 23:18   ` Filipe Manana
  2017-03-07 22:11   ` Liu Bo
  2017-03-06 23:19 ` [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana
  2017-03-07 20:49 ` Liu Bo
  2 siblings, 2 replies; 16+ messages in thread
From: Qu Wenruo @ 2017-03-06  2:55 UTC (permalink / raw)
  To: fdmanana, linux-btrfs; +Cc: Filipe Manana

[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

[CAUSE]

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|                       |<---------- cleanup range --------->|
 ||
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.

However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.

So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.

[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.

The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.

After fix, delalloc error will be handled like:

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<--------  ----------->|<------ old error handler --------->|
 ||          ||
 ||          \_=> Cleaned up by cleanup_ordered_extents()
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
changelog:
v2:
  Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
  outstanding extents, which is already done by
  extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
v3:
  Skip first page to avoid underflow ordered->bytes_left.
  Fix range passed in cow_file_range() which doesn't cover the whole
  extent.
  Expend extent_clear_unlock_delalloc() range to allow them to handle
  metadata release.
v4:
  Don't use extra bit to skip metadata freeing for ordered extent,
  but only handle btrfs_reloc_clone_csums() error just before processing
  next extent.
  This makes error handle much easier for run_delalloc_nocow().
v5:
  Variant gramma and comment fixes suggested by Filipe Manana
  Enhanced commit message to focus on the generic error handler bug,
  pointed out by Filipe Manana
  Adding missing wq/func user in __endio_write_update_ordered(), pointed
  out by Filipe Manana.
  Enhanced commit message with ascii art to explain the bug more easily.
  Fix a bug which can leads to corrupted extent allocation, exposed by
  Filipe Manana.
v6:
  Split the metadata underflow fix to another patch.
  Use better comment and btrfs_cleanup_ordered_extent() implementation
  from Filipe Manana.
---
 fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1d83d504f2e5..9b03eb9b27d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 				       u64 ram_bytes, int compress_type,
 				       int type);
 
+static void __endio_write_update_ordered(struct inode *inode,
+					 u64 offset, u64 bytes,
+					 bool uptodate);
+
+/*
+ * Cleanup all submitted ordered extents in specified range to handle errors
+ * from the fill_dellaloc() callback.
+ *
+ * NOTE: caller must ensure that when an error happens, it can not call
+ * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
+ * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
+ * to be released, which we want to happen only when finishing the ordered
+ * extent (btrfs_finish_ordered_io()). Also note that the caller of the
+ * fill_dealloc() callback already does proper cleanup for the first page of
+ * the range, that is, it invokes the callback writepage_end_io_hook() for the
+ * range of the first page.
+ */
+static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
+						 u64 offset, u64 bytes)
+{
+	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
+					    bytes - PAGE_SIZE, false);
+}
+
 static int btrfs_dirty_inode(struct inode *inode);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
@@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 		ret = cow_file_range_async(inode, locked_page, start, end,
 					   page_started, nr_written);
 	}
+	if (ret)
+		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
 	return ret;
 }
 
@@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio *bio)
 	bio_put(bio);
 }
 
-static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
-						    const u64 offset,
-						    const u64 bytes,
-						    const int uptodate)
+static void __endio_write_update_ordered(struct inode *inode,
+					 u64 offset, u64 bytes,
+					 bool uptodate)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_ordered_extent *ordered = NULL;
+	struct btrfs_workqueue *wq;
+	btrfs_work_func_t func;
 	u64 ordered_offset = offset;
 	u64 ordered_bytes = bytes;
 	int ret;
 
+	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
+		wq = fs_info->endio_freespace_worker;
+		func = btrfs_freespace_write_helper;
+	} else {
+		wq = fs_info->endio_write_workers;
+		func = btrfs_endio_write_helper;
+	}
+
 again:
 	ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
 						   &ordered_offset,
@@ -8161,9 +8196,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
 	if (!ret)
 		goto out_test;
 
-	btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
-			finish_ordered_fn, NULL, NULL);
-	btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
+	btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL);
+	btrfs_queue_work(wq, &ordered->work);
 out_test:
 	/*
 	 * our bio might span multiple ordered extents.  If we haven't
@@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio *bio)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio *dio_bio = dip->dio_bio;
 
-	btrfs_endio_direct_write_update_ordered(dip->inode,
-						dip->logical_offset,
-						dip->bytes,
-						!bio->bi_error);
+	__endio_write_update_ordered(dip->inode, dip->logical_offset,
+				     dip->bytes, !bio->bi_error);
 
 	kfree(dip);
 
@@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		io_bio = NULL;
 	} else {
 		if (write)
-			btrfs_endio_direct_write_update_ordered(inode,
+			__endio_write_update_ordered(inode,
 						file_offset,
 						dio_bio->bi_iter.bi_size,
-						0);
+						false);
 		else
 			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
 			      file_offset + dio_bio->bi_iter.bi_size - 1);
@@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			 */
 			if (dio_data.unsubmitted_oe_range_start <
 			    dio_data.unsubmitted_oe_range_end)
-				btrfs_endio_direct_write_update_ordered(inode,
+				__endio_write_update_ordered(inode,
 					dio_data.unsubmitted_oe_range_start,
 					dio_data.unsubmitted_oe_range_end -
 					dio_data.unsubmitted_oe_range_start,
-					0);
+					false);
 		} else if (ret >= 0 && (size_t)ret < count)
 			btrfs_delalloc_release_space(inode, offset,
 						     count - (size_t)ret);
-- 
2.12.0




^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-06  2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
@ 2017-03-06 23:18   ` Filipe Manana
  2017-03-07 22:11   ` Liu Bo
  1 sibling, 0 replies; 16+ messages in thread
From: Filipe Manana @ 2017-03-06 23:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana

On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> [BUG]
> If run_delalloc_range() returns error and there is already some ordered
> extents created, btrfs will be hanged with the following backtrace:
>
> Call Trace:
>  __schedule+0x2d4/0xae0
>  schedule+0x3d/0x90
>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>  ? wake_atomic_t_function+0x60/0x60
>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>  process_one_work+0x2af/0x720
>  ? process_one_work+0x22b/0x720
>  worker_thread+0x4b/0x4f0
>  kthread+0x10f/0x150
>  ? process_one_work+0x720/0x720
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x2e/0x40
>
> [CAUSE]
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|                       |<---------- cleanup range --------->|
>  ||
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>
> The problem is caused by error handler of run_delalloc_range(), which
> doesn't handle any created ordered extents, leaving them waiting on
> btrfs_finish_ordered_io() to finish.
>
> However after run_delalloc_range() returns error, __extent_writepage()
> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
> except the first page, and btrfs_finish_ordered_io() won't be triggered
> for created ordered extents either.
>
> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
> will also hang.
>
> [FIX]
> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
> ordered extents and finish them manually.
>
> The function is based on existing
> btrfs_endio_direct_write_update_ordered() function, and modify it to
> act just like btrfs_writepage_endio_hook() but handles specified range
> other than one page.
>
> After fix, delalloc error will be handled like:
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|<--------  ----------->|<------ old error handler --------->|
>  ||          ||
>  ||          \_=> Cleaned up by cleanup_ordered_extents()
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

thanks

> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> changelog:
> v2:
>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>   outstanding extents, which is already done by
>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
> v3:
>   Skip first page to avoid underflow ordered->bytes_left.
>   Fix range passed in cow_file_range() which doesn't cover the whole
>   extent.
>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>   metadata release.
> v4:
>   Don't use extra bit to skip metadata freeing for ordered extent,
>   but only handle btrfs_reloc_clone_csums() error just before processing
>   next extent.
>   This makes error handle much easier for run_delalloc_nocow().
> v5:
>   Variant gramma and comment fixes suggested by Filipe Manana
>   Enhanced commit message to focus on the generic error handler bug,
>   pointed out by Filipe Manana
>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>   out by Filipe Manana.
>   Enhanced commit message with ascii art to explain the bug more easily.
>   Fix a bug which can leads to corrupted extent allocation, exposed by
>   Filipe Manana.
> v6:
>   Split the metadata underflow fix to another patch.
>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>   from Filipe Manana.
> ---
>  fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 47 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1d83d504f2e5..9b03eb9b27d0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
>                                        u64 ram_bytes, int compress_type,
>                                        int type);
>
> +static void __endio_write_update_ordered(struct inode *inode,
> +                                        u64 offset, u64 bytes,
> +                                        bool uptodate);
> +
> +/*
> + * Cleanup all submitted ordered extents in specified range to handle errors
> + * from the fill_dellaloc() callback.
> + *
> + * NOTE: caller must ensure that when an error happens, it can not call
> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
> + * to be released, which we want to happen only when finishing the ordered
> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
> + * fill_dealloc() callback already does proper cleanup for the first page of
> + * the range, that is, it invokes the callback writepage_end_io_hook() for the
> + * range of the first page.
> + */
> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
> +                                                u64 offset, u64 bytes)
> +{
> +       return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> +                                           bytes - PAGE_SIZE, false);
> +}
> +
>  static int btrfs_dirty_inode(struct inode *inode);
>
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
>                 ret = cow_file_range_async(inode, locked_page, start, end,
>                                            page_started, nr_written);
>         }
> +       if (ret)
> +               btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>         return ret;
>  }
>
> @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio *bio)
>         bio_put(bio);
>  }
>
> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
> -                                                   const u64 offset,
> -                                                   const u64 bytes,
> -                                                   const int uptodate)
> +static void __endio_write_update_ordered(struct inode *inode,
> +                                        u64 offset, u64 bytes,
> +                                        bool uptodate)
>  {
>         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>         struct btrfs_ordered_extent *ordered = NULL;
> +       struct btrfs_workqueue *wq;
> +       btrfs_work_func_t func;
>         u64 ordered_offset = offset;
>         u64 ordered_bytes = bytes;
>         int ret;
>
> +       if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
> +               wq = fs_info->endio_freespace_worker;
> +               func = btrfs_freespace_write_helper;
> +       } else {
> +               wq = fs_info->endio_write_workers;
> +               func = btrfs_endio_write_helper;
> +       }
> +
>  again:
>         ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
>                                                    &ordered_offset,
> @@ -8161,9 +8196,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
>         if (!ret)
>                 goto out_test;
>
> -       btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
> -                       finish_ordered_fn, NULL, NULL);
> -       btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
> +       btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL);
> +       btrfs_queue_work(wq, &ordered->work);
>  out_test:
>         /*
>          * our bio might span multiple ordered extents.  If we haven't
> @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio *bio)
>         struct btrfs_dio_private *dip = bio->bi_private;
>         struct bio *dio_bio = dip->dio_bio;
>
> -       btrfs_endio_direct_write_update_ordered(dip->inode,
> -                                               dip->logical_offset,
> -                                               dip->bytes,
> -                                               !bio->bi_error);
> +       __endio_write_update_ordered(dip->inode, dip->logical_offset,
> +                                    dip->bytes, !bio->bi_error);
>
>         kfree(dip);
>
> @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
>                 io_bio = NULL;
>         } else {
>                 if (write)
> -                       btrfs_endio_direct_write_update_ordered(inode,
> +                       __endio_write_update_ordered(inode,
>                                                 file_offset,
>                                                 dio_bio->bi_iter.bi_size,
> -                                               0);
> +                                               false);
>                 else
>                         unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
>                               file_offset + dio_bio->bi_iter.bi_size - 1);
> @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                          */
>                         if (dio_data.unsubmitted_oe_range_start <
>                             dio_data.unsubmitted_oe_range_end)
> -                               btrfs_endio_direct_write_update_ordered(inode,
> +                               __endio_write_update_ordered(inode,
>                                         dio_data.unsubmitted_oe_range_start,
>                                         dio_data.unsubmitted_oe_range_end -
>                                         dio_data.unsubmitted_oe_range_start,
> -                                       0);
> +                                       false);
>                 } else if (ret >= 0 && (size_t)ret < count)
>                         btrfs_delalloc_release_space(inode, offset,
>                                                      count - (size_t)ret);
> --
> 2.12.0
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-06  2:55 [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
  2017-03-06  2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
@ 2017-03-06 23:19 ` Filipe Manana
  2017-03-07 17:32   ` David Sterba
  2017-03-07 20:49 ` Liu Bo
  2 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2017-03-06 23:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
>
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
>
> a) Create one data block group with one 4K extent in it.
>    To avoid the bug that hangs btrfs due to ordered extent which never
>    finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
>
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
>
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
>                     |<----------- cleanup range --------------->|
> |<-----------  ----------->|
>              \/
>  btrfs_finish_ordered_io() range
>
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
>
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<-----------  ----------->|<---------- cleanup range --------->|
>              \/
>  btrfs_finish_ordered_io() range
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

thanks

> ---
> changelog:
> v6:
>   New, split from v5 patch, as this is a separate bug.
> ---
>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b2bc07aad1ae..1d83d504f2e5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>                     BTRFS_DATA_RELOC_TREE_OBJECTID) {
>                         ret = btrfs_reloc_clone_csums(inode, start,
>                                                       cur_alloc_size);
> +                       /*
> +                        * Only drop cache here, and process as normal.
> +                        *
> +                        * We must not allow extent_clear_unlock_delalloc()
> +                        * at out_unlock label to free meta of this ordered
> +                        * extent, as its meta should be freed by
> +                        * btrfs_finish_ordered_io().
> +                        *
> +                        * So we must continue until @start is increased to
> +                        * skip current ordered extent.
> +                        */
>                         if (ret)
> -                               goto out_drop_extent_cache;
> +                               btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +                                               start + ram_size - 1, 0);
>                 }
>
>                 btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>
> -               if (disk_num_bytes < cur_alloc_size)
> -                       break;
> -
>                 /* we're not doing compressed IO, don't unlock the first
>                  * page (which the caller expects to stay locked), don't
>                  * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>                                              delalloc_end, locked_page,
>                                              EXTENT_LOCKED | EXTENT_DELALLOC,
>                                              op);
> -               disk_num_bytes -= cur_alloc_size;
> +               if (disk_num_bytes > cur_alloc_size)
> +                       disk_num_bytes = 0;
> +               else
> +                       disk_num_bytes -= cur_alloc_size;
>                 num_bytes -= cur_alloc_size;
>                 alloc_hint = ins.objectid + ins.offset;
>                 start += cur_alloc_size;
> +
> +               /*
> +                * btrfs_reloc_clone_csums() error, since start is increased
> +                * extent_clear_unlock_delalloc() at out_unlock label won't
> +                * free metadata of current ordered extent, we're OK to exit.
> +                */
> +               if (ret)
> +                       goto out_unlock;
>         }
>  out:
>         return ret;
> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>                 BUG_ON(ret); /* -ENOMEM */
>
>                 if (root->root_key.objectid ==
> -                   BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +                   BTRFS_DATA_RELOC_TREE_OBJECTID)
> +                       /*
> +                        * Error handled later, as we must prevent
> +                        * extent_clear_unlock_delalloc() in error handler
> +                        * from freeing metadata of created ordered extent.
> +                        */
>                         ret = btrfs_reloc_clone_csums(inode, cur_offset,
>                                                       num_bytes);
> -                       if (ret) {
> -                               if (!nolock && nocow)
> -                                       btrfs_end_write_no_snapshoting(root);
> -                               goto error;
> -                       }
> -               }
>
>                 extent_clear_unlock_delalloc(inode, cur_offset,
>                                              cur_offset + num_bytes - 1, end,
> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>                 if (!nolock && nocow)
>                         btrfs_end_write_no_snapshoting(root);
>                 cur_offset = extent_end;
> +
> +               /*
> +                * btrfs_reloc_clone_csums() error, now we're OK to call error
> +                * handler, as metadata for created ordered extent will only
> +                * be freed by btrfs_finish_ordered_io().
> +                */
> +               if (ret)
> +                       goto error;
>                 if (cur_offset > end)
>                         break;
>         }
> --
> 2.12.0
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-06 23:19 ` [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana
@ 2017-03-07 17:32   ` David Sterba
  2017-03-07 20:51     ` Liu Bo
  0 siblings, 1 reply; 16+ messages in thread
From: David Sterba @ 2017-03-07 17:32 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs

On Mon, Mar 06, 2017 at 11:19:32PM +0000, Filipe Manana wrote:
> On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> > [BUG]
> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> > and leads to kernel assertion on outstanding extents in
> > run_delalloc_nocow() and cow_file_range().
> >
> >  BTRFS info (device vdb5): relocating block group 12582912 flags data
> >  BTRFS info (device vdb5): found 1 extents
> >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> >
> > Currently, due to another bug blocking ordered extents, the bug is only
> > reproducible under certain block group layout and using error injection.
> >
> > a) Create one data block group with one 4K extent in it.
> >    To avoid the bug that hangs btrfs due to ordered extent which never
> >    finishes
> > b) Make btrfs_reloc_clone_csum() always fail
> > c) Relocate that block group
> >
> > [CAUSE]
> > run_delalloc_nocow() and cow_file_range() handles error from
> > btrfs_reloc_clone_csum() wrongly:
> >
> > (The ascii chart shows a more generic case of this bug other than the
> > bug mentioned above)
> >
> > |<------------------ delalloc range --------------------------->|
> > | OE 1 | OE 2 | ... | OE n |
> >                     |<----------- cleanup range --------------->|
> > |<-----------  ----------->|
> >              \/
> >  btrfs_finish_ordered_io() range
> >
> > So error handler, which calls extent_clear_unlock_delalloc() with
> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> > will both cover OE n, and free its metadata, causing metadata under flow.
> >
> > [Fix]
> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> > call error handler after increasing the iteration offset, so that
> > cleanup range won't cover any created ordered extent.
> >
> > |<------------------ delalloc range --------------------------->|
> > | OE 1 | OE 2 | ... | OE n |
> > |<-----------  ----------->|<---------- cleanup range --------->|
> >              \/
> >  btrfs_finish_ordered_io() range
> >
> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>

1-2 added to for-next and queued for 4.11. Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-06  2:55 [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
  2017-03-06  2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
  2017-03-06 23:19 ` [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana
@ 2017-03-07 20:49 ` Liu Bo
  2017-03-07 20:59   ` Liu Bo
  2017-03-08  0:17   ` Qu Wenruo
  2 siblings, 2 replies; 16+ messages in thread
From: Liu Bo @ 2017-03-07 20:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: fdmanana, linux-btrfs

On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
> 
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> 
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
> 
> a) Create one data block group with one 4K extent in it.
>    To avoid the bug that hangs btrfs due to ordered extent which never
>    finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
> 
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
> 
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
>                     |<----------- cleanup range --------------->|
> |<-----------  ----------->|
>              \/
>  btrfs_finish_ordered_io() range
> 
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
> 
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<-----------  ----------->|<---------- cleanup range --------->|
>              \/
>  btrfs_finish_ordered_io() range
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> changelog:
> v6:
>   New, split from v5 patch, as this is a separate bug.
> ---
>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b2bc07aad1ae..1d83d504f2e5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  			ret = btrfs_reloc_clone_csums(inode, start,
>  						      cur_alloc_size);
> +			/*
> +			 * Only drop cache here, and process as normal.
> +			 *
> +			 * We must not allow extent_clear_unlock_delalloc()
> +			 * at out_unlock label to free meta of this ordered
> +			 * extent, as its meta should be freed by
> +			 * btrfs_finish_ordered_io().
> +			 *
> +			 * So we must continue until @start is increased to
> +			 * skip current ordered extent.
> +			 */
>  			if (ret)
> -				goto out_drop_extent_cache;
> +				btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +						start + ram_size - 1, 0);
>  		}
>  
>  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  
> -		if (disk_num_bytes < cur_alloc_size)
> -			break;
> -
>  		/* we're not doing compressed IO, don't unlock the first
>  		 * page (which the caller expects to stay locked), don't
>  		 * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>  					     delalloc_end, locked_page,
>  					     EXTENT_LOCKED | EXTENT_DELALLOC,
>  					     op);
> -		disk_num_bytes -= cur_alloc_size;
> +		if (disk_num_bytes > cur_alloc_size)
> +			disk_num_bytes = 0;
> +		else
> +			disk_num_bytes -= cur_alloc_size;

I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?

Thanks,

-liubo

>  		num_bytes -= cur_alloc_size;
>  		alloc_hint = ins.objectid + ins.offset;
>  		start += cur_alloc_size;
> +
> +		/*
> +		 * btrfs_reloc_clone_csums() error, since start is increased
> +		 * extent_clear_unlock_delalloc() at out_unlock label won't
> +		 * free metadata of current ordered extent, we're OK to exit.
> +		 */
> +		if (ret)
> +			goto out_unlock;
>  	}
>  out:
>  	return ret;
> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  		BUG_ON(ret); /* -ENOMEM */
>  
>  		if (root->root_key.objectid ==
> -		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +		    BTRFS_DATA_RELOC_TREE_OBJECTID)
> +			/*
> +			 * Error handled later, as we must prevent
> +			 * extent_clear_unlock_delalloc() in error handler
> +			 * from freeing metadata of created ordered extent.
> +			 */
>  			ret = btrfs_reloc_clone_csums(inode, cur_offset,
>  						      num_bytes);
> -			if (ret) {
> -				if (!nolock && nocow)
> -					btrfs_end_write_no_snapshoting(root);
> -				goto error;
> -			}
> -		}
>  
>  		extent_clear_unlock_delalloc(inode, cur_offset,
>  					     cur_offset + num_bytes - 1, end,
> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  		if (!nolock && nocow)
>  			btrfs_end_write_no_snapshoting(root);
>  		cur_offset = extent_end;
> +
> +		/*
> +		 * btrfs_reloc_clone_csums() error, now we're OK to call error
> +		 * handler, as metadata for created ordered extent will only
> +		 * be freed by btrfs_finish_ordered_io().
> +		 */
> +		if (ret)
> +			goto error;
>  		if (cur_offset > end)
>  			break;
>  	}
> -- 
> 2.12.0
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-07 17:32   ` David Sterba
@ 2017-03-07 20:51     ` Liu Bo
  0 siblings, 0 replies; 16+ messages in thread
From: Liu Bo @ 2017-03-07 20:51 UTC (permalink / raw)
  To: dsterba, Filipe Manana, Qu Wenruo, linux-btrfs

On Tue, Mar 07, 2017 at 06:32:23PM +0100, David Sterba wrote:
> On Mon, Mar 06, 2017 at 11:19:32PM +0000, Filipe Manana wrote:
> > On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> > > [BUG]
> > > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> > > and leads to kernel assertion on outstanding extents in
> > > run_delalloc_nocow() and cow_file_range().
> > >
> > >  BTRFS info (device vdb5): relocating block group 12582912 flags data
> > >  BTRFS info (device vdb5): found 1 extents
> > >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> > >
> > > Currently, due to another bug blocking ordered extents, the bug is only
> > > reproducible under certain block group layout and using error injection.
> > >
> > > a) Create one data block group with one 4K extent in it.
> > >    To avoid the bug that hangs btrfs due to ordered extent which never
> > >    finishes
> > > b) Make btrfs_reloc_clone_csum() always fail
> > > c) Relocate that block group
> > >
> > > [CAUSE]
> > > run_delalloc_nocow() and cow_file_range() handles error from
> > > btrfs_reloc_clone_csum() wrongly:
> > >
> > > (The ascii chart shows a more generic case of this bug other than the
> > > bug mentioned above)
> > >
> > > |<------------------ delalloc range --------------------------->|
> > > | OE 1 | OE 2 | ... | OE n |
> > >                     |<----------- cleanup range --------------->|
> > > |<-----------  ----------->|
> > >              \/
> > >  btrfs_finish_ordered_io() range
> > >
> > > So error handler, which calls extent_clear_unlock_delalloc() with
> > > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> > > will both cover OE n, and free its metadata, causing metadata under flow.
> > >
> > > [Fix]
> > > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> > > call error handler after increasing the iteration offset, so that
> > > cleanup range won't cover any created ordered extent.
> > >
> > > |<------------------ delalloc range --------------------------->|
> > > | OE 1 | OE 2 | ... | OE n |
> > > |<-----------  ----------->|<---------- cleanup range --------->|
> > >              \/
> > >  btrfs_finish_ordered_io() range
> > >
> > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> > 
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> 1-2 added to for-next and queued for 4.11. Thanks.

I'm afraid that patch 1 has a problem, please take a look at the thread.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-07 20:49 ` Liu Bo
@ 2017-03-07 20:59   ` Liu Bo
  2017-03-08  0:17     ` Filipe Manana
  2017-03-08  0:17   ` Qu Wenruo
  1 sibling, 1 reply; 16+ messages in thread
From: Liu Bo @ 2017-03-07 20:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: fdmanana, linux-btrfs

On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote:
> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
> > [BUG]
> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> > and leads to kernel assertion on outstanding extents in
> > run_delalloc_nocow() and cow_file_range().
> > 
> >  BTRFS info (device vdb5): relocating block group 12582912 flags data
> >  BTRFS info (device vdb5): found 1 extents
> >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> > 
> > Currently, due to another bug blocking ordered extents, the bug is only
> > reproducible under certain block group layout and using error injection.
> > 
> > a) Create one data block group with one 4K extent in it.
> >    To avoid the bug that hangs btrfs due to ordered extent which never
> >    finishes
> > b) Make btrfs_reloc_clone_csum() always fail
> > c) Relocate that block group
> > 
> > [CAUSE]
> > run_delalloc_nocow() and cow_file_range() handles error from
> > btrfs_reloc_clone_csum() wrongly:
> > 
> > (The ascii chart shows a more generic case of this bug other than the
> > bug mentioned above)
> > 
> > |<------------------ delalloc range --------------------------->|
> > | OE 1 | OE 2 | ... | OE n |
> >                     |<----------- cleanup range --------------->|
> > |<-----------  ----------->|
> >              \/
> >  btrfs_finish_ordered_io() range
> > 
> > So error handler, which calls extent_clear_unlock_delalloc() with
> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> > will both cover OE n, and free its metadata, causing metadata under flow.
> > 
> > [Fix]
> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> > call error handler after increasing the iteration offset, so that
> > cleanup range won't cover any created ordered extent.
> > 
> > |<------------------ delalloc range --------------------------->|
> > | OE 1 | OE 2 | ... | OE n |
> > |<-----------  ----------->|<---------- cleanup range --------->|
> >              \/
> >  btrfs_finish_ordered_io() range
> > 
> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> > ---
> > changelog:
> > v6:
> >   New, split from v5 patch, as this is a separate bug.
> > ---
> >  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 39 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index b2bc07aad1ae..1d83d504f2e5 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
> >  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >  			ret = btrfs_reloc_clone_csums(inode, start,
> >  						      cur_alloc_size);
> > +			/*
> > +			 * Only drop cache here, and process as normal.
> > +			 *
> > +			 * We must not allow extent_clear_unlock_delalloc()
> > +			 * at out_unlock label to free meta of this ordered
> > +			 * extent, as its meta should be freed by
> > +			 * btrfs_finish_ordered_io().
> > +			 *
> > +			 * So we must continue until @start is increased to
> > +			 * skip current ordered extent.
> > +			 */
> >  			if (ret)
> > -				goto out_drop_extent_cache;
> > +				btrfs_drop_extent_cache(BTRFS_I(inode), start,
> > +						start + ram_size - 1, 0);
> >  		}
> >  
> >  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> >  
> > -		if (disk_num_bytes < cur_alloc_size)
> > -			break;
> > -
> >  		/* we're not doing compressed IO, don't unlock the first
> >  		 * page (which the caller expects to stay locked), don't
> >  		 * clear any dirty bits and don't set any writeback bits
> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
> >  					     delalloc_end, locked_page,
> >  					     EXTENT_LOCKED | EXTENT_DELALLOC,
> >  					     op);
> > -		disk_num_bytes -= cur_alloc_size;
> > +		if (disk_num_bytes > cur_alloc_size)
> > +			disk_num_bytes = 0;
> > +		else
> > +			disk_num_bytes -= cur_alloc_size;
> 
> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?

I assume that you've run fstests against this patch, if so, I actually
start worrying about that no fstests found this problem.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-06  2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
  2017-03-06 23:18   ` Filipe Manana
@ 2017-03-07 22:11   ` Liu Bo
  2017-03-08  0:18     ` Qu Wenruo
  1 sibling, 1 reply; 16+ messages in thread
From: Liu Bo @ 2017-03-07 22:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: fdmanana, linux-btrfs, Filipe Manana

On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote:
> [BUG]
> If run_delalloc_range() returns error and there is already some ordered
> extents created, btrfs will be hanged with the following backtrace:
> 
> Call Trace:
>  __schedule+0x2d4/0xae0
>  schedule+0x3d/0x90
>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>  ? wake_atomic_t_function+0x60/0x60
>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>  process_one_work+0x2af/0x720
>  ? process_one_work+0x22b/0x720
>  worker_thread+0x4b/0x4f0
>  kthread+0x10f/0x150
>  ? process_one_work+0x720/0x720
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x2e/0x40
> 
> [CAUSE]
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|                       |<---------- cleanup range --------->|
>  ||
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
> 
> The problem is caused by error handler of run_delalloc_range(), which
> doesn't handle any created ordered extents, leaving them waiting on
> btrfs_finish_ordered_io() to finish.
> 
> However after run_delalloc_range() returns error, __extent_writepage()
> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
> except the first page, and btrfs_finish_ordered_io() won't be triggered
> for created ordered extents either.
> 
> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
> will also hang.
> 
> [FIX]
> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
> ordered extents and finish them manually.
> 
> The function is based on existing
> btrfs_endio_direct_write_update_ordered() function, and modify it to
> act just like btrfs_writepage_endio_hook() but handles specified range
> other than one page.
> 
> After fix, delalloc error will be handled like:
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|<--------  ----------->|<------ old error handler --------->|
>  ||          ||
>  ||          \_=> Cleaned up by cleanup_ordered_extents()
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>

Looks good, some nitpicks below.

> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> changelog:
> v2:
>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>   outstanding extents, which is already done by
>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
> v3:
>   Skip first page to avoid underflow ordered->bytes_left.
>   Fix range passed in cow_file_range() which doesn't cover the whole
>   extent.
>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>   metadata release.
> v4:
>   Don't use extra bit to skip metadata freeing for ordered extent,
>   but only handle btrfs_reloc_clone_csums() error just before processing
>   next extent.
>   This makes error handle much easier for run_delalloc_nocow().
> v5:
>   Variant gramma and comment fixes suggested by Filipe Manana
>   Enhanced commit message to focus on the generic error handler bug,
>   pointed out by Filipe Manana
>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>   out by Filipe Manana.
>   Enhanced commit message with ascii art to explain the bug more easily.
>   Fix a bug which can leads to corrupted extent allocation, exposed by
>   Filipe Manana.
> v6:
>   Split the metadata underflow fix to another patch.
>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>   from Filipe Manana.
> ---
>  fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 47 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1d83d504f2e5..9b03eb9b27d0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
>  				       u64 ram_bytes, int compress_type,
>  				       int type);
>  
> +static void __endio_write_update_ordered(struct inode *inode,
> +					 u64 offset, u64 bytes,
> +					 bool uptodate);
> +
> +/*
> + * Cleanup all submitted ordered extents in specified range to handle errors
> + * from the fill_dellaloc() callback.
> + *
> + * NOTE: caller must ensure that when an error happens, it can not call
> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
> + * to be released, which we want to happen only when finishing the ordered
> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
> + * fill_dealloc() callback already does proper cleanup for the first page of

fill_delalloc()

> + * the range, that is, it invokes the callback writepage_end_io_hook() for the
> + * range of the first page.
> + */
> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
> +						 u64 offset, u64 bytes)
> +{
> +	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> +					    bytes - PAGE_SIZE, false);
> +}
> +
>  static int btrfs_dirty_inode(struct inode *inode);
>  
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
>  		ret = cow_file_range_async(inode, locked_page, start, end,
>  					   page_started, nr_written);
>  	}
> +	if (ret)
> +		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>  	return ret;
>  }
>  
> @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio *bio)
>  	bio_put(bio);
>  }
>  
> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
> -						    const u64 offset,
> -						    const u64 bytes,
> -						    const int uptodate)
> +static void __endio_write_update_ordered(struct inode *inode,
> +					 u64 offset, u64 bytes,
> +					 bool uptodate)

Not serious, but why const was killed?

With that fixed, you can have 

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct btrfs_ordered_extent *ordered = NULL;
> +	struct btrfs_workqueue *wq;
> +	btrfs_work_func_t func;
>  	u64 ordered_offset = offset;
>  	u64 ordered_bytes = bytes;
>  	int ret;
>  
> +	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
> +		wq = fs_info->endio_freespace_worker;
> +		func = btrfs_freespace_write_helper;
> +	} else {
> +		wq = fs_info->endio_write_workers;
> +		func = btrfs_endio_write_helper;
> +	}
> +
>  again:
>  	ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
>  						   &ordered_offset,
> @@ -8161,9 +8196,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
>  	if (!ret)
>  		goto out_test;
>  
> -	btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
> -			finish_ordered_fn, NULL, NULL);
> -	btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
> +	btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL);
> +	btrfs_queue_work(wq, &ordered->work);
>  out_test:
>  	/*
>  	 * our bio might span multiple ordered extents.  If we haven't
> @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio *bio)
>  	struct btrfs_dio_private *dip = bio->bi_private;
>  	struct bio *dio_bio = dip->dio_bio;
>  
> -	btrfs_endio_direct_write_update_ordered(dip->inode,
> -						dip->logical_offset,
> -						dip->bytes,
> -						!bio->bi_error);
> +	__endio_write_update_ordered(dip->inode, dip->logical_offset,
> +				     dip->bytes, !bio->bi_error);
>  
>  	kfree(dip);
>  
> @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
>  		io_bio = NULL;
>  	} else {
>  		if (write)
> -			btrfs_endio_direct_write_update_ordered(inode,
> +			__endio_write_update_ordered(inode,
>  						file_offset,
>  						dio_bio->bi_iter.bi_size,
> -						0);
> +						false);
>  		else
>  			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
>  			      file_offset + dio_bio->bi_iter.bi_size - 1);
> @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  			 */
>  			if (dio_data.unsubmitted_oe_range_start <
>  			    dio_data.unsubmitted_oe_range_end)
> -				btrfs_endio_direct_write_update_ordered(inode,
> +				__endio_write_update_ordered(inode,
>  					dio_data.unsubmitted_oe_range_start,
>  					dio_data.unsubmitted_oe_range_end -
>  					dio_data.unsubmitted_oe_range_start,
> -					0);
> +					false);
>  		} else if (ret >= 0 && (size_t)ret < count)
>  			btrfs_delalloc_release_space(inode, offset,
>  						     count - (size_t)ret);
> -- 
> 2.12.0
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-07 20:59   ` Liu Bo
@ 2017-03-08  0:17     ` Filipe Manana
  2017-03-08  1:16       ` Liu Bo
  0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2017-03-08  0:17 UTC (permalink / raw)
  To: Liu Bo; +Cc: Qu Wenruo, linux-btrfs

On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote:
>> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
>> > [BUG]
>> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
>> > and leads to kernel assertion on outstanding extents in
>> > run_delalloc_nocow() and cow_file_range().
>> >
>> >  BTRFS info (device vdb5): relocating block group 12582912 flags data
>> >  BTRFS info (device vdb5): found 1 extents
>> >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>> >
>> > Currently, due to another bug blocking ordered extents, the bug is only
>> > reproducible under certain block group layout and using error injection.
>> >
>> > a) Create one data block group with one 4K extent in it.
>> >    To avoid the bug that hangs btrfs due to ordered extent which never
>> >    finishes
>> > b) Make btrfs_reloc_clone_csum() always fail
>> > c) Relocate that block group
>> >
>> > [CAUSE]
>> > run_delalloc_nocow() and cow_file_range() handles error from
>> > btrfs_reloc_clone_csum() wrongly:
>> >
>> > (The ascii chart shows a more generic case of this bug other than the
>> > bug mentioned above)
>> >
>> > |<------------------ delalloc range --------------------------->|
>> > | OE 1 | OE 2 | ... | OE n |
>> >                     |<----------- cleanup range --------------->|
>> > |<-----------  ----------->|
>> >              \/
>> >  btrfs_finish_ordered_io() range
>> >
>> > So error handler, which calls extent_clear_unlock_delalloc() with
>> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
>> > will both cover OE n, and free its metadata, causing metadata under flow.
>> >
>> > [Fix]
>> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
>> > call error handler after increasing the iteration offset, so that
>> > cleanup range won't cover any created ordered extent.
>> >
>> > |<------------------ delalloc range --------------------------->|
>> > | OE 1 | OE 2 | ... | OE n |
>> > |<-----------  ----------->|<---------- cleanup range --------->|
>> >              \/
>> >  btrfs_finish_ordered_io() range
>> >
>> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> > ---
>> > changelog:
>> > v6:
>> >   New, split from v5 patch, as this is a separate bug.
>> > ---
>> >  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>> >  1 file changed, 39 insertions(+), 12 deletions(-)
>> >
>> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> > index b2bc07aad1ae..1d83d504f2e5 100644
>> > --- a/fs/btrfs/inode.c
>> > +++ b/fs/btrfs/inode.c
>> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>> >                 BTRFS_DATA_RELOC_TREE_OBJECTID) {
>> >                     ret = btrfs_reloc_clone_csums(inode, start,
>> >                                                   cur_alloc_size);
>> > +                   /*
>> > +                    * Only drop cache here, and process as normal.
>> > +                    *
>> > +                    * We must not allow extent_clear_unlock_delalloc()
>> > +                    * at out_unlock label to free meta of this ordered
>> > +                    * extent, as its meta should be freed by
>> > +                    * btrfs_finish_ordered_io().
>> > +                    *
>> > +                    * So we must continue until @start is increased to
>> > +                    * skip current ordered extent.
>> > +                    */
>> >                     if (ret)
>> > -                           goto out_drop_extent_cache;
>> > +                           btrfs_drop_extent_cache(BTRFS_I(inode), start,
>> > +                                           start + ram_size - 1, 0);
>> >             }
>> >
>> >             btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>> >
>> > -           if (disk_num_bytes < cur_alloc_size)
>> > -                   break;
>> > -
>> >             /* we're not doing compressed IO, don't unlock the first
>> >              * page (which the caller expects to stay locked), don't
>> >              * clear any dirty bits and don't set any writeback bits
>> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>> >                                          delalloc_end, locked_page,
>> >                                          EXTENT_LOCKED | EXTENT_DELALLOC,
>> >                                          op);
>> > -           disk_num_bytes -= cur_alloc_size;
>> > +           if (disk_num_bytes > cur_alloc_size)
>> > +                   disk_num_bytes = 0;
>> > +           else
>> > +                   disk_num_bytes -= cur_alloc_size;
>>
>> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?
>
> I assume that you've run fstests against this patch, if so, I actually
> start worrying about that no fstests found this problem.

The operator is definitely wrong, should be < instead of > (previous
patch versions were correct).

It's not a surprise fstests don't catch this - one would need a fs
with no more free space to allocate new data chunks and existing data
chunks would have to be fragmented to the point we need to allocate
two or more smaller extents to satisfy the write, so whether fstests
exercises this depends on the size of the test devices (mine are 100G
for example) and the size of the data the tests create. Having a
deterministic test for that for that should be possible and useful.

>
> Thanks,
>
> -liubo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-07 20:49 ` Liu Bo
  2017-03-07 20:59   ` Liu Bo
@ 2017-03-08  0:17   ` Qu Wenruo
  1 sibling, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2017-03-08  0:17 UTC (permalink / raw)
  To: bo.li.liu; +Cc: fdmanana, linux-btrfs



At 03/08/2017 04:49 AM, Liu Bo wrote:
> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
>> [BUG]
>> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
>> and leads to kernel assertion on outstanding extents in
>> run_delalloc_nocow() and cow_file_range().
>>
>>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>>  BTRFS info (device vdb5): found 1 extents
>>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>>
>> Currently, due to another bug blocking ordered extents, the bug is only
>> reproducible under certain block group layout and using error injection.
>>
>> a) Create one data block group with one 4K extent in it.
>>    To avoid the bug that hangs btrfs due to ordered extent which never
>>    finishes
>> b) Make btrfs_reloc_clone_csum() always fail
>> c) Relocate that block group
>>
>> [CAUSE]
>> run_delalloc_nocow() and cow_file_range() handles error from
>> btrfs_reloc_clone_csum() wrongly:
>>
>> (The ascii chart shows a more generic case of this bug other than the
>> bug mentioned above)
>>
>> |<------------------ delalloc range --------------------------->|
>> | OE 1 | OE 2 | ... | OE n |
>>                     |<----------- cleanup range --------------->|
>> |<-----------  ----------->|
>>              \/
>>  btrfs_finish_ordered_io() range
>>
>> So error handler, which calls extent_clear_unlock_delalloc() with
>> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
>> will both cover OE n, and free its metadata, causing metadata under flow.
>>
>> [Fix]
>> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
>> call error handler after increasing the iteration offset, so that
>> cleanup range won't cover any created ordered extent.
>>
>> |<------------------ delalloc range --------------------------->|
>> | OE 1 | OE 2 | ... | OE n |
>> |<-----------  ----------->|<---------- cleanup range --------->|
>>              \/
>>  btrfs_finish_ordered_io() range
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>> changelog:
>> v6:
>>   New, split from v5 patch, as this is a separate bug.
>> ---
>>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index b2bc07aad1ae..1d83d504f2e5 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>>  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>  			ret = btrfs_reloc_clone_csums(inode, start,
>>  						      cur_alloc_size);
>> +			/*
>> +			 * Only drop cache here, and process as normal.
>> +			 *
>> +			 * We must not allow extent_clear_unlock_delalloc()
>> +			 * at out_unlock label to free meta of this ordered
>> +			 * extent, as its meta should be freed by
>> +			 * btrfs_finish_ordered_io().
>> +			 *
>> +			 * So we must continue until @start is increased to
>> +			 * skip current ordered extent.
>> +			 */
>>  			if (ret)
>> -				goto out_drop_extent_cache;
>> +				btrfs_drop_extent_cache(BTRFS_I(inode), start,
>> +						start + ram_size - 1, 0);
>>  		}
>>
>>  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>>
>> -		if (disk_num_bytes < cur_alloc_size)
>> -			break;
>> -
>>  		/* we're not doing compressed IO, don't unlock the first
>>  		 * page (which the caller expects to stay locked), don't
>>  		 * clear any dirty bits and don't set any writeback bits
>> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>>  					     delalloc_end, locked_page,
>>  					     EXTENT_LOCKED | EXTENT_DELALLOC,
>>  					     op);
>> -		disk_num_bytes -= cur_alloc_size;
>> +		if (disk_num_bytes > cur_alloc_size)
>> +			disk_num_bytes = 0;
>> +		else
>> +			disk_num_bytes -= cur_alloc_size;
>
> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?

Thanks for the catch, it should be disk_num_bytes < cur_alloc_size.

I'll update it.

Thanks,
Qu
>
> Thanks,
>
> -liubo
>
>>  		num_bytes -= cur_alloc_size;
>>  		alloc_hint = ins.objectid + ins.offset;
>>  		start += cur_alloc_size;
>> +
>> +		/*
>> +		 * btrfs_reloc_clone_csums() error, since start is increased
>> +		 * extent_clear_unlock_delalloc() at out_unlock label won't
>> +		 * free metadata of current ordered extent, we're OK to exit.
>> +		 */
>> +		if (ret)
>> +			goto out_unlock;
>>  	}
>>  out:
>>  	return ret;
>> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  		BUG_ON(ret); /* -ENOMEM */
>>
>>  		if (root->root_key.objectid ==
>> -		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>> +		    BTRFS_DATA_RELOC_TREE_OBJECTID)
>> +			/*
>> +			 * Error handled later, as we must prevent
>> +			 * extent_clear_unlock_delalloc() in error handler
>> +			 * from freeing metadata of created ordered extent.
>> +			 */
>>  			ret = btrfs_reloc_clone_csums(inode, cur_offset,
>>  						      num_bytes);
>> -			if (ret) {
>> -				if (!nolock && nocow)
>> -					btrfs_end_write_no_snapshoting(root);
>> -				goto error;
>> -			}
>> -		}
>>
>>  		extent_clear_unlock_delalloc(inode, cur_offset,
>>  					     cur_offset + num_bytes - 1, end,
>> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  		if (!nolock && nocow)
>>  			btrfs_end_write_no_snapshoting(root);
>>  		cur_offset = extent_end;
>> +
>> +		/*
>> +		 * btrfs_reloc_clone_csums() error, now we're OK to call error
>> +		 * handler, as metadata for created ordered extent will only
>> +		 * be freed by btrfs_finish_ordered_io().
>> +		 */
>> +		if (ret)
>> +			goto error;
>>  		if (cur_offset > end)
>>  			break;
>>  	}
>> --
>> 2.12.0
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-07 22:11   ` Liu Bo
@ 2017-03-08  0:18     ` Qu Wenruo
  2017-03-08  0:21       ` Filipe Manana
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2017-03-08  0:18 UTC (permalink / raw)
  To: bo.li.liu; +Cc: fdmanana, linux-btrfs, Filipe Manana



At 03/08/2017 06:11 AM, Liu Bo wrote:
> On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote:
>> [BUG]
>> If run_delalloc_range() returns error and there is already some ordered
>> extents created, btrfs will be hanged with the following backtrace:
>>
>> Call Trace:
>>  __schedule+0x2d4/0xae0
>>  schedule+0x3d/0x90
>>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>>  ? wake_atomic_t_function+0x60/0x60
>>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>>  process_one_work+0x2af/0x720
>>  ? process_one_work+0x22b/0x720
>>  worker_thread+0x4b/0x4f0
>>  kthread+0x10f/0x150
>>  ? process_one_work+0x720/0x720
>>  ? kthread_create_on_node+0x40/0x40
>>  ret_from_fork+0x2e/0x40
>>
>> [CAUSE]
>>
>> |<------------------ delalloc range --------------------------->|
>> | OE 1 | OE 2 | ... | OE n |
>> |<>|                       |<---------- cleanup range --------->|
>>  ||
>>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>>
>> The problem is caused by error handler of run_delalloc_range(), which
>> doesn't handle any created ordered extents, leaving them waiting on
>> btrfs_finish_ordered_io() to finish.
>>
>> However after run_delalloc_range() returns error, __extent_writepage()
>> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
>> except the first page, and btrfs_finish_ordered_io() won't be triggered
>> for created ordered extents either.
>>
>> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
>> will also hang.
>>
>> [FIX]
>> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
>> ordered extents and finish them manually.
>>
>> The function is based on existing
>> btrfs_endio_direct_write_update_ordered() function, and modify it to
>> act just like btrfs_writepage_endio_hook() but handles specified range
>> other than one page.
>>
>> After fix, delalloc error will be handled like:
>>
>> |<------------------ delalloc range --------------------------->|
>> | OE 1 | OE 2 | ... | OE n |
>> |<>|<--------  ----------->|<------ old error handler --------->|
>>  ||          ||
>>  ||          \_=> Cleaned up by cleanup_ordered_extents()
>>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>>
>
> Looks good, some nitpicks below.
>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>> changelog:
>> v2:
>>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>>   outstanding extents, which is already done by
>>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
>> v3:
>>   Skip first page to avoid underflow ordered->bytes_left.
>>   Fix range passed in cow_file_range() which doesn't cover the whole
>>   extent.
>>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>>   metadata release.
>> v4:
>>   Don't use extra bit to skip metadata freeing for ordered extent,
>>   but only handle btrfs_reloc_clone_csums() error just before processing
>>   next extent.
>>   This makes error handle much easier for run_delalloc_nocow().
>> v5:
>>   Variant gramma and comment fixes suggested by Filipe Manana
>>   Enhanced commit message to focus on the generic error handler bug,
>>   pointed out by Filipe Manana
>>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>>   out by Filipe Manana.
>>   Enhanced commit message with ascii art to explain the bug more easily.
>>   Fix a bug which can leads to corrupted extent allocation, exposed by
>>   Filipe Manana.
>> v6:
>>   Split the metadata underflow fix to another patch.
>>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>>   from Filipe Manana.
>> ---
>>  fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++--------------
>>  1 file changed, 47 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 1d83d504f2e5..9b03eb9b27d0 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
>>  				       u64 ram_bytes, int compress_type,
>>  				       int type);
>>
>> +static void __endio_write_update_ordered(struct inode *inode,
>> +					 u64 offset, u64 bytes,
>> +					 bool uptodate);
>> +
>> +/*
>> + * Cleanup all submitted ordered extents in specified range to handle errors
>> + * from the fill_dellaloc() callback.
>> + *
>> + * NOTE: caller must ensure that when an error happens, it can not call
>> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
>> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
>> + * to be released, which we want to happen only when finishing the ordered
>> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
>> + * fill_dealloc() callback already does proper cleanup for the first page of
>
> fill_delalloc()
>
>> + * the range, that is, it invokes the callback writepage_end_io_hook() for the
>> + * range of the first page.
>> + */
>> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>> +						 u64 offset, u64 bytes)
>> +{
>> +	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
>> +					    bytes - PAGE_SIZE, false);
>> +}
>> +
>>  static int btrfs_dirty_inode(struct inode *inode);
>>
>>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>> @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
>>  		ret = cow_file_range_async(inode, locked_page, start, end,
>>  					   page_started, nr_written);
>>  	}
>> +	if (ret)
>> +		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>>  	return ret;
>>  }
>>
>> @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio *bio)
>>  	bio_put(bio);
>>  }
>>
>> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
>> -						    const u64 offset,
>> -						    const u64 bytes,
>> -						    const int uptodate)
>> +static void __endio_write_update_ordered(struct inode *inode,
>> +					 u64 offset, u64 bytes,
>> +					 bool uptodate)
>
> Not serious, but why const was killed?

Because const for @offset, @bytes has no meaning, u64/bool are passed by 
their value.

Any modification to @offset/@bytes/@update has no effect on the passed 
values.

Thanks,
Qu

>
> With that fixed, you can have
>
> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
>
> Thanks,
>
> -liubo
>>  {
>>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>  	struct btrfs_ordered_extent *ordered = NULL;
>> +	struct btrfs_workqueue *wq;
>> +	btrfs_work_func_t func;
>>  	u64 ordered_offset = offset;
>>  	u64 ordered_bytes = bytes;
>>  	int ret;
>>
>> +	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
>> +		wq = fs_info->endio_freespace_worker;
>> +		func = btrfs_freespace_write_helper;
>> +	} else {
>> +		wq = fs_info->endio_write_workers;
>> +		func = btrfs_endio_write_helper;
>> +	}
>> +
>>  again:
>>  	ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
>>  						   &ordered_offset,
>> @@ -8161,9 +8196,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
>>  	if (!ret)
>>  		goto out_test;
>>
>> -	btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
>> -			finish_ordered_fn, NULL, NULL);
>> -	btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
>> +	btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL);
>> +	btrfs_queue_work(wq, &ordered->work);
>>  out_test:
>>  	/*
>>  	 * our bio might span multiple ordered extents.  If we haven't
>> @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio *bio)
>>  	struct btrfs_dio_private *dip = bio->bi_private;
>>  	struct bio *dio_bio = dip->dio_bio;
>>
>> -	btrfs_endio_direct_write_update_ordered(dip->inode,
>> -						dip->logical_offset,
>> -						dip->bytes,
>> -						!bio->bi_error);
>> +	__endio_write_update_ordered(dip->inode, dip->logical_offset,
>> +				     dip->bytes, !bio->bi_error);
>>
>>  	kfree(dip);
>>
>> @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
>>  		io_bio = NULL;
>>  	} else {
>>  		if (write)
>> -			btrfs_endio_direct_write_update_ordered(inode,
>> +			__endio_write_update_ordered(inode,
>>  						file_offset,
>>  						dio_bio->bi_iter.bi_size,
>> -						0);
>> +						false);
>>  		else
>>  			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
>>  			      file_offset + dio_bio->bi_iter.bi_size - 1);
>> @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>  			 */
>>  			if (dio_data.unsubmitted_oe_range_start <
>>  			    dio_data.unsubmitted_oe_range_end)
>> -				btrfs_endio_direct_write_update_ordered(inode,
>> +				__endio_write_update_ordered(inode,
>>  					dio_data.unsubmitted_oe_range_start,
>>  					dio_data.unsubmitted_oe_range_end -
>>  					dio_data.unsubmitted_oe_range_start,
>> -					0);
>> +					false);
>>  		} else if (ret >= 0 && (size_t)ret < count)
>>  			btrfs_delalloc_release_space(inode, offset,
>>  						     count - (size_t)ret);
>> --
>> 2.12.0
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-08  0:18     ` Qu Wenruo
@ 2017-03-08  0:21       ` Filipe Manana
  2017-03-08  0:26         ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2017-03-08  0:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Liu Bo, linux-btrfs, Filipe Manana

On Wed, Mar 8, 2017 at 12:18 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 03/08/2017 06:11 AM, Liu Bo wrote:
>>
>> On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote:
>>>
>>> [BUG]
>>> If run_delalloc_range() returns error and there is already some ordered
>>> extents created, btrfs will be hanged with the following backtrace:
>>>
>>> Call Trace:
>>>  __schedule+0x2d4/0xae0
>>>  schedule+0x3d/0x90
>>>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>>>  ? wake_atomic_t_function+0x60/0x60
>>>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>>>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>>>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>>>  process_one_work+0x2af/0x720
>>>  ? process_one_work+0x22b/0x720
>>>  worker_thread+0x4b/0x4f0
>>>  kthread+0x10f/0x150
>>>  ? process_one_work+0x720/0x720
>>>  ? kthread_create_on_node+0x40/0x40
>>>  ret_from_fork+0x2e/0x40
>>>
>>> [CAUSE]
>>>
>>> |<------------------ delalloc range --------------------------->|
>>> | OE 1 | OE 2 | ... | OE n |
>>> |<>|                       |<---------- cleanup range --------->|
>>>  ||
>>>  \_=> First page handled by end_extent_writepage() in
>>> __extent_writepage()
>>>
>>> The problem is caused by error handler of run_delalloc_range(), which
>>> doesn't handle any created ordered extents, leaving them waiting on
>>> btrfs_finish_ordered_io() to finish.
>>>
>>> However after run_delalloc_range() returns error, __extent_writepage()
>>> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
>>> except the first page, and btrfs_finish_ordered_io() won't be triggered
>>> for created ordered extents either.
>>>
>>> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
>>> will also hang.
>>>
>>> [FIX]
>>> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
>>> ordered extents and finish them manually.
>>>
>>> The function is based on existing
>>> btrfs_endio_direct_write_update_ordered() function, and modify it to
>>> act just like btrfs_writepage_endio_hook() but handles specified range
>>> other than one page.
>>>
>>> After fix, delalloc error will be handled like:
>>>
>>> |<------------------ delalloc range --------------------------->|
>>> | OE 1 | OE 2 | ... | OE n |
>>> |<>|<--------  ----------->|<------ old error handler --------->|
>>>  ||          ||
>>>  ||          \_=> Cleaned up by cleanup_ordered_extents()
>>>  \_=> First page handled by end_extent_writepage() in
>>> __extent_writepage()
>>>
>>
>> Looks good, some nitpicks below.
>>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>> changelog:
>>> v2:
>>>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>>>   outstanding extents, which is already done by
>>>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
>>> v3:
>>>   Skip first page to avoid underflow ordered->bytes_left.
>>>   Fix range passed in cow_file_range() which doesn't cover the whole
>>>   extent.
>>>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>>>   metadata release.
>>> v4:
>>>   Don't use extra bit to skip metadata freeing for ordered extent,
>>>   but only handle btrfs_reloc_clone_csums() error just before processing
>>>   next extent.
>>>   This makes error handle much easier for run_delalloc_nocow().
>>> v5:
>>>   Variant gramma and comment fixes suggested by Filipe Manana
>>>   Enhanced commit message to focus on the generic error handler bug,
>>>   pointed out by Filipe Manana
>>>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>>>   out by Filipe Manana.
>>>   Enhanced commit message with ascii art to explain the bug more easily.
>>>   Fix a bug which can leads to corrupted extent allocation, exposed by
>>>   Filipe Manana.
>>> v6:
>>>   Split the metadata underflow fix to another patch.
>>>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>>>   from Filipe Manana.
>>> ---
>>>  fs/btrfs/inode.c | 62
>>> ++++++++++++++++++++++++++++++++++++++++++--------------
>>>  1 file changed, 47 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 1d83d504f2e5..9b03eb9b27d0 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode
>>> *inode, u64 start, u64 len,
>>>                                        u64 ram_bytes, int compress_type,
>>>                                        int type);
>>>
>>> +static void __endio_write_update_ordered(struct inode *inode,
>>> +                                        u64 offset, u64 bytes,
>>> +                                        bool uptodate);
>>> +
>>> +/*
>>> + * Cleanup all submitted ordered extents in specified range to handle
>>> errors
>>> + * from the fill_dellaloc() callback.
>>> + *
>>> + * NOTE: caller must ensure that when an error happens, it can not call
>>> + * extent_clear_unlock_delalloc() to clear both the bits
>>> EXTENT_DO_ACCOUNTING
>>> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved
>>> metadata
>>> + * to be released, which we want to happen only when finishing the
>>> ordered
>>> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
>>> + * fill_dealloc() callback already does proper cleanup for the first
>>> page of
>>
>>
>> fill_delalloc()
>>
>>> + * the range, that is, it invokes the callback writepage_end_io_hook()
>>> for the
>>> + * range of the first page.
>>> + */
>>> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>>> +                                                u64 offset, u64 bytes)
>>> +{
>>> +       return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
>>> +                                           bytes - PAGE_SIZE, false);
>>> +}
>>> +
>>>  static int btrfs_dirty_inode(struct inode *inode);
>>>
>>>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>>> @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode,
>>> struct page *locked_page,
>>>                 ret = cow_file_range_async(inode, locked_page, start,
>>> end,
>>>                                            page_started, nr_written);
>>>         }
>>> +       if (ret)
>>> +               btrfs_cleanup_ordered_extents(inode, start, end - start +
>>> 1);
>>>         return ret;
>>>  }
>>>
>>> @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio
>>> *bio)
>>>         bio_put(bio);
>>>  }
>>>
>>> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
>>> -                                                   const u64 offset,
>>> -                                                   const u64 bytes,
>>> -                                                   const int uptodate)
>>> +static void __endio_write_update_ordered(struct inode *inode,
>>> +                                        u64 offset, u64 bytes,
>>> +                                        bool uptodate)
>>
>>
>> Not serious, but why const was killed?
>
>
> Because const for @offset, @bytes has no meaning, u64/bool are passed by
> their value.
>
> Any modification to @offset/@bytes/@update has no effect on the passed
> values.

But it helps detect logic errors inside the function. For example when
you have local variables with names similar to the function
parameters, it's easy to make a mistake and modify a parameter instead
of a local variable - that's why I always use const for new functions
(and not only because I like to type a lot).

>
> Thanks,
> Qu
>
>
>>
>> With that fixed, you can have
>>
>> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
>>
>> Thanks,
>>
>> -liubo
>>>
>>>  {
>>>         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>>         struct btrfs_ordered_extent *ordered = NULL;
>>> +       struct btrfs_workqueue *wq;
>>> +       btrfs_work_func_t func;
>>>         u64 ordered_offset = offset;
>>>         u64 ordered_bytes = bytes;
>>>         int ret;
>>>
>>> +       if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
>>> +               wq = fs_info->endio_freespace_worker;
>>> +               func = btrfs_freespace_write_helper;
>>> +       } else {
>>> +               wq = fs_info->endio_write_workers;
>>> +               func = btrfs_endio_write_helper;
>>> +       }
>>> +
>>>  again:
>>>         ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
>>>                                                    &ordered_offset,
>>> @@ -8161,9 +8196,8 @@ static void
>>> btrfs_endio_direct_write_update_ordered(struct inode *inode,
>>>         if (!ret)
>>>                 goto out_test;
>>>
>>> -       btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
>>> -                       finish_ordered_fn, NULL, NULL);
>>> -       btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
>>> +       btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL,
>>> NULL);
>>> +       btrfs_queue_work(wq, &ordered->work);
>>>  out_test:
>>>         /*
>>>          * our bio might span multiple ordered extents.  If we haven't
>>> @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio
>>> *bio)
>>>         struct btrfs_dio_private *dip = bio->bi_private;
>>>         struct bio *dio_bio = dip->dio_bio;
>>>
>>> -       btrfs_endio_direct_write_update_ordered(dip->inode,
>>> -                                               dip->logical_offset,
>>> -                                               dip->bytes,
>>> -                                               !bio->bi_error);
>>> +       __endio_write_update_ordered(dip->inode, dip->logical_offset,
>>> +                                    dip->bytes, !bio->bi_error);
>>>
>>>         kfree(dip);
>>>
>>> @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio
>>> *dio_bio, struct inode *inode,
>>>                 io_bio = NULL;
>>>         } else {
>>>                 if (write)
>>> -                       btrfs_endio_direct_write_update_ordered(inode,
>>> +                       __endio_write_update_ordered(inode,
>>>                                                 file_offset,
>>>                                                 dio_bio->bi_iter.bi_size,
>>> -                                               0);
>>> +                                               false);
>>>                 else
>>>                         unlock_extent(&BTRFS_I(inode)->io_tree,
>>> file_offset,
>>>                               file_offset + dio_bio->bi_iter.bi_size -
>>> 1);
>>> @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb
>>> *iocb, struct iov_iter *iter)
>>>                          */
>>>                         if (dio_data.unsubmitted_oe_range_start <
>>>                             dio_data.unsubmitted_oe_range_end)
>>> -
>>> btrfs_endio_direct_write_update_ordered(inode,
>>> +                               __endio_write_update_ordered(inode,
>>>
>>> dio_data.unsubmitted_oe_range_start,
>>>                                         dio_data.unsubmitted_oe_range_end
>>> -
>>>
>>> dio_data.unsubmitted_oe_range_start,
>>> -                                       0);
>>> +                                       false);
>>>                 } else if (ret >= 0 && (size_t)ret < count)
>>>                         btrfs_delalloc_release_space(inode, offset,
>>>                                                      count -
>>> (size_t)ret);
>>> --
>>> 2.12.0
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-08  0:21       ` Filipe Manana
@ 2017-03-08  0:26         ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2017-03-08  0:26 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Liu Bo, linux-btrfs, Filipe Manana



At 03/08/2017 08:21 AM, Filipe Manana wrote:
> On Wed, Mar 8, 2017 at 12:18 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> At 03/08/2017 06:11 AM, Liu Bo wrote:
>>>
>>> On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote:
>>>>
>>>> [BUG]
>>>> If run_delalloc_range() returns error and there is already some ordered
>>>> extents created, btrfs will be hanged with the following backtrace:
>>>>
>>>> Call Trace:
>>>>  __schedule+0x2d4/0xae0
>>>>  schedule+0x3d/0x90
>>>>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>>>>  ? wake_atomic_t_function+0x60/0x60
>>>>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>>>>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>>>>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>>>>  process_one_work+0x2af/0x720
>>>>  ? process_one_work+0x22b/0x720
>>>>  worker_thread+0x4b/0x4f0
>>>>  kthread+0x10f/0x150
>>>>  ? process_one_work+0x720/0x720
>>>>  ? kthread_create_on_node+0x40/0x40
>>>>  ret_from_fork+0x2e/0x40
>>>>
>>>> [CAUSE]
>>>>
>>>> |<------------------ delalloc range --------------------------->|
>>>> | OE 1 | OE 2 | ... | OE n |
>>>> |<>|                       |<---------- cleanup range --------->|
>>>>  ||
>>>>  \_=> First page handled by end_extent_writepage() in
>>>> __extent_writepage()
>>>>
>>>> The problem is caused by error handler of run_delalloc_range(), which
>>>> doesn't handle any created ordered extents, leaving them waiting on
>>>> btrfs_finish_ordered_io() to finish.
>>>>
>>>> However after run_delalloc_range() returns error, __extent_writepage()
>>>> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
>>>> except the first page, and btrfs_finish_ordered_io() won't be triggered
>>>> for created ordered extents either.
>>>>
>>>> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
>>>> will also hang.
>>>>
>>>> [FIX]
>>>> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
>>>> ordered extents and finish them manually.
>>>>
>>>> The function is based on existing
>>>> btrfs_endio_direct_write_update_ordered() function, and modify it to
>>>> act just like btrfs_writepage_endio_hook() but handles specified range
>>>> other than one page.
>>>>
>>>> After fix, delalloc error will be handled like:
>>>>
>>>> |<------------------ delalloc range --------------------------->|
>>>> | OE 1 | OE 2 | ... | OE n |
>>>> |<>|<--------  ----------->|<------ old error handler --------->|
>>>>  ||          ||
>>>>  ||          \_=> Cleaned up by cleanup_ordered_extents()
>>>>  \_=> First page handled by end_extent_writepage() in
>>>> __extent_writepage()
>>>>
>>>
>>> Looks good, some nitpicks below.
>>>
>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>> ---
>>>> changelog:
>>>> v2:
>>>>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>>>>   outstanding extents, which is already done by
>>>>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
>>>> v3:
>>>>   Skip first page to avoid underflow ordered->bytes_left.
>>>>   Fix range passed in cow_file_range() which doesn't cover the whole
>>>>   extent.
>>>>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>>>>   metadata release.
>>>> v4:
>>>>   Don't use extra bit to skip metadata freeing for ordered extent,
>>>>   but only handle btrfs_reloc_clone_csums() error just before processing
>>>>   next extent.
>>>>   This makes error handle much easier for run_delalloc_nocow().
>>>> v5:
>>>>   Variant gramma and comment fixes suggested by Filipe Manana
>>>>   Enhanced commit message to focus on the generic error handler bug,
>>>>   pointed out by Filipe Manana
>>>>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>>>>   out by Filipe Manana.
>>>>   Enhanced commit message with ascii art to explain the bug more easily.
>>>>   Fix a bug which can leads to corrupted extent allocation, exposed by
>>>>   Filipe Manana.
>>>> v6:
>>>>   Split the metadata underflow fix to another patch.
>>>>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>>>>   from Filipe Manana.
>>>> ---
>>>>  fs/btrfs/inode.c | 62
>>>> ++++++++++++++++++++++++++++++++++++++++++--------------
>>>>  1 file changed, 47 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>> index 1d83d504f2e5..9b03eb9b27d0 100644
>>>> --- a/fs/btrfs/inode.c
>>>> +++ b/fs/btrfs/inode.c
>>>> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode
>>>> *inode, u64 start, u64 len,
>>>>                                        u64 ram_bytes, int compress_type,
>>>>                                        int type);
>>>>
>>>> +static void __endio_write_update_ordered(struct inode *inode,
>>>> +                                        u64 offset, u64 bytes,
>>>> +                                        bool uptodate);
>>>> +
>>>> +/*
>>>> + * Cleanup all submitted ordered extents in specified range to handle
>>>> errors
>>>> + * from the fill_dellaloc() callback.
>>>> + *
>>>> + * NOTE: caller must ensure that when an error happens, it can not call
>>>> + * extent_clear_unlock_delalloc() to clear both the bits
>>>> EXTENT_DO_ACCOUNTING
>>>> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved
>>>> metadata
>>>> + * to be released, which we want to happen only when finishing the
>>>> ordered
>>>> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
>>>> + * fill_dealloc() callback already does proper cleanup for the first
>>>> page of
>>>
>>>
>>> fill_delalloc()
>>>
>>>> + * the range, that is, it invokes the callback writepage_end_io_hook()
>>>> for the
>>>> + * range of the first page.
>>>> + */
>>>> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>>>> +                                                u64 offset, u64 bytes)
>>>> +{
>>>> +       return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
>>>> +                                           bytes - PAGE_SIZE, false);
>>>> +}
>>>> +
>>>>  static int btrfs_dirty_inode(struct inode *inode);
>>>>
>>>>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>>>> @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode,
>>>> struct page *locked_page,
>>>>                 ret = cow_file_range_async(inode, locked_page, start,
>>>> end,
>>>>                                            page_started, nr_written);
>>>>         }
>>>> +       if (ret)
>>>> +               btrfs_cleanup_ordered_extents(inode, start, end - start +
>>>> 1);
>>>>         return ret;
>>>>  }
>>>>
>>>> @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio
>>>> *bio)
>>>>         bio_put(bio);
>>>>  }
>>>>
>>>> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
>>>> -                                                   const u64 offset,
>>>> -                                                   const u64 bytes,
>>>> -                                                   const int uptodate)
>>>> +static void __endio_write_update_ordered(struct inode *inode,
>>>> +                                        u64 offset, u64 bytes,
>>>> +                                        bool uptodate)
>>>
>>>
>>> Not serious, but why const was killed?
>>
>>
>> Because const for @offset, @bytes has no meaning, u64/bool are passed by
>> their value.
>>
>> Any modification to @offset/@bytes/@update has no effect on the passed
>> values.
>
> But it helps detect logic errors inside the function. For example when
> you have local variables with names similar to the function
> parameters, it's easy to make a mistake and modify a parameter instead
> of a local variable - that's why I always use const for new functions
> (and not only because I like to type a lot).

Yes, that makes sense.

I was originally worried about killing the possibility to reuse @offset 
as iterator, just like what we do with @start in cow_file_range().
But it turns out that, the modification of @start in cow_file_range() is 
already a bad idea.

So in next version const prefix will be added bad.

Thanks,
Qu

>
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>> With that fixed, you can have
>>>
>>> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
>>>
>>> Thanks,
>>>
>>> -liubo
>>>>
>>>>  {
>>>>         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>>>         struct btrfs_ordered_extent *ordered = NULL;
>>>> +       struct btrfs_workqueue *wq;
>>>> +       btrfs_work_func_t func;
>>>>         u64 ordered_offset = offset;
>>>>         u64 ordered_bytes = bytes;
>>>>         int ret;
>>>>
>>>> +       if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
>>>> +               wq = fs_info->endio_freespace_worker;
>>>> +               func = btrfs_freespace_write_helper;
>>>> +       } else {
>>>> +               wq = fs_info->endio_write_workers;
>>>> +               func = btrfs_endio_write_helper;
>>>> +       }
>>>> +
>>>>  again:
>>>>         ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
>>>>                                                    &ordered_offset,
>>>> @@ -8161,9 +8196,8 @@ static void
>>>> btrfs_endio_direct_write_update_ordered(struct inode *inode,
>>>>         if (!ret)
>>>>                 goto out_test;
>>>>
>>>> -       btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
>>>> -                       finish_ordered_fn, NULL, NULL);
>>>> -       btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
>>>> +       btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL,
>>>> NULL);
>>>> +       btrfs_queue_work(wq, &ordered->work);
>>>>  out_test:
>>>>         /*
>>>>          * our bio might span multiple ordered extents.  If we haven't
>>>> @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio
>>>> *bio)
>>>>         struct btrfs_dio_private *dip = bio->bi_private;
>>>>         struct bio *dio_bio = dip->dio_bio;
>>>>
>>>> -       btrfs_endio_direct_write_update_ordered(dip->inode,
>>>> -                                               dip->logical_offset,
>>>> -                                               dip->bytes,
>>>> -                                               !bio->bi_error);
>>>> +       __endio_write_update_ordered(dip->inode, dip->logical_offset,
>>>> +                                    dip->bytes, !bio->bi_error);
>>>>
>>>>         kfree(dip);
>>>>
>>>> @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio
>>>> *dio_bio, struct inode *inode,
>>>>                 io_bio = NULL;
>>>>         } else {
>>>>                 if (write)
>>>> -                       btrfs_endio_direct_write_update_ordered(inode,
>>>> +                       __endio_write_update_ordered(inode,
>>>>                                                 file_offset,
>>>>                                                 dio_bio->bi_iter.bi_size,
>>>> -                                               0);
>>>> +                                               false);
>>>>                 else
>>>>                         unlock_extent(&BTRFS_I(inode)->io_tree,
>>>> file_offset,
>>>>                               file_offset + dio_bio->bi_iter.bi_size -
>>>> 1);
>>>> @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb
>>>> *iocb, struct iov_iter *iter)
>>>>                          */
>>>>                         if (dio_data.unsubmitted_oe_range_start <
>>>>                             dio_data.unsubmitted_oe_range_end)
>>>> -
>>>> btrfs_endio_direct_write_update_ordered(inode,
>>>> +                               __endio_write_update_ordered(inode,
>>>>
>>>> dio_data.unsubmitted_oe_range_start,
>>>>                                         dio_data.unsubmitted_oe_range_end
>>>> -
>>>>
>>>> dio_data.unsubmitted_oe_range_start,
>>>> -                                       0);
>>>> +                                       false);
>>>>                 } else if (ret >= 0 && (size_t)ret < count)
>>>>                         btrfs_delalloc_release_space(inode, offset,
>>>>                                                      count -
>>>> (size_t)ret);
>>>> --
>>>> 2.12.0
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>
>>
>
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-08  0:17     ` Filipe Manana
@ 2017-03-08  1:16       ` Liu Bo
  2017-03-08  1:21         ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2017-03-08  1:16 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs

On Wed, Mar 08, 2017 at 12:17:16AM +0000, Filipe Manana wrote:
> On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote:
> >> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
> >> > [BUG]
> >> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> >> > and leads to kernel assertion on outstanding extents in
> >> > run_delalloc_nocow() and cow_file_range().
> >> >
> >> >  BTRFS info (device vdb5): relocating block group 12582912 flags data
> >> >  BTRFS info (device vdb5): found 1 extents
> >> >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> >> >
> >> > Currently, due to another bug blocking ordered extents, the bug is only
> >> > reproducible under certain block group layout and using error injection.
> >> >
> >> > a) Create one data block group with one 4K extent in it.
> >> >    To avoid the bug that hangs btrfs due to ordered extent which never
> >> >    finishes
> >> > b) Make btrfs_reloc_clone_csum() always fail
> >> > c) Relocate that block group
> >> >
> >> > [CAUSE]
> >> > run_delalloc_nocow() and cow_file_range() handles error from
> >> > btrfs_reloc_clone_csum() wrongly:
> >> >
> >> > (The ascii chart shows a more generic case of this bug other than the
> >> > bug mentioned above)
> >> >
> >> > |<------------------ delalloc range --------------------------->|
> >> > | OE 1 | OE 2 | ... | OE n |
> >> >                     |<----------- cleanup range --------------->|
> >> > |<-----------  ----------->|
> >> >              \/
> >> >  btrfs_finish_ordered_io() range
> >> >
> >> > So error handler, which calls extent_clear_unlock_delalloc() with
> >> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> >> > will both cover OE n, and free its metadata, causing metadata under flow.
> >> >
> >> > [Fix]
> >> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> >> > call error handler after increasing the iteration offset, so that
> >> > cleanup range won't cover any created ordered extent.
> >> >
> >> > |<------------------ delalloc range --------------------------->|
> >> > | OE 1 | OE 2 | ... | OE n |
> >> > |<-----------  ----------->|<---------- cleanup range --------->|
> >> >              \/
> >> >  btrfs_finish_ordered_io() range
> >> >
> >> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> >> > ---
> >> > changelog:
> >> > v6:
> >> >   New, split from v5 patch, as this is a separate bug.
> >> > ---
> >> >  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
> >> >  1 file changed, 39 insertions(+), 12 deletions(-)
> >> >
> >> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >> > index b2bc07aad1ae..1d83d504f2e5 100644
> >> > --- a/fs/btrfs/inode.c
> >> > +++ b/fs/btrfs/inode.c
> >> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
> >> >                 BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >> >                     ret = btrfs_reloc_clone_csums(inode, start,
> >> >                                                   cur_alloc_size);
> >> > +                   /*
> >> > +                    * Only drop cache here, and process as normal.
> >> > +                    *
> >> > +                    * We must not allow extent_clear_unlock_delalloc()
> >> > +                    * at out_unlock label to free meta of this ordered
> >> > +                    * extent, as its meta should be freed by
> >> > +                    * btrfs_finish_ordered_io().
> >> > +                    *
> >> > +                    * So we must continue until @start is increased to
> >> > +                    * skip current ordered extent.
> >> > +                    */
> >> >                     if (ret)
> >> > -                           goto out_drop_extent_cache;
> >> > +                           btrfs_drop_extent_cache(BTRFS_I(inode), start,
> >> > +                                           start + ram_size - 1, 0);
> >> >             }
> >> >
> >> >             btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> >> >
> >> > -           if (disk_num_bytes < cur_alloc_size)
> >> > -                   break;
> >> > -
> >> >             /* we're not doing compressed IO, don't unlock the first
> >> >              * page (which the caller expects to stay locked), don't
> >> >              * clear any dirty bits and don't set any writeback bits
> >> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
> >> >                                          delalloc_end, locked_page,
> >> >                                          EXTENT_LOCKED | EXTENT_DELALLOC,
> >> >                                          op);
> >> > -           disk_num_bytes -= cur_alloc_size;
> >> > +           if (disk_num_bytes > cur_alloc_size)
> >> > +                   disk_num_bytes = 0;
> >> > +           else
> >> > +                   disk_num_bytes -= cur_alloc_size;
> >>
> >> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?
> >
> > I assume that you've run fstests against this patch, if so, I actually
> > start worrying about that no fstests found this problem.
> 
> The operator is definitely wrong, should be < instead of > (previous
> patch versions were correct).
> 
> It's not a surprise fstests don't catch this - one would need a fs
> with no more free space to allocate new data chunks and existing data
> chunks would have to be fragmented to the point we need to allocate
> two or more smaller extents to satisfy the write, so whether fstests
> exercises this depends on the size of the test devices (mine are 100G
> for example) and the size of the data the tests create. Having a
> deterministic test for that for that should be possible and useful.

I see, I think that 'mount -o fragment=data' could help us get it.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-08  1:16       ` Liu Bo
@ 2017-03-08  1:21         ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2017-03-08  1:21 UTC (permalink / raw)
  To: bo.li.liu, Filipe Manana; +Cc: linux-btrfs



At 03/08/2017 09:16 AM, Liu Bo wrote:
> On Wed, Mar 08, 2017 at 12:17:16AM +0000, Filipe Manana wrote:
>> On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
>>> On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote:
>>>> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
>>>>> [BUG]
>>>>> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
>>>>> and leads to kernel assertion on outstanding extents in
>>>>> run_delalloc_nocow() and cow_file_range().
>>>>>
>>>>>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>>>>>  BTRFS info (device vdb5): found 1 extents
>>>>>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>>>>>
>>>>> Currently, due to another bug blocking ordered extents, the bug is only
>>>>> reproducible under certain block group layout and using error injection.
>>>>>
>>>>> a) Create one data block group with one 4K extent in it.
>>>>>    To avoid the bug that hangs btrfs due to ordered extent which never
>>>>>    finishes
>>>>> b) Make btrfs_reloc_clone_csum() always fail
>>>>> c) Relocate that block group
>>>>>
>>>>> [CAUSE]
>>>>> run_delalloc_nocow() and cow_file_range() handles error from
>>>>> btrfs_reloc_clone_csum() wrongly:
>>>>>
>>>>> (The ascii chart shows a more generic case of this bug other than the
>>>>> bug mentioned above)
>>>>>
>>>>> |<------------------ delalloc range --------------------------->|
>>>>> | OE 1 | OE 2 | ... | OE n |
>>>>>                     |<----------- cleanup range --------------->|
>>>>> |<-----------  ----------->|
>>>>>              \/
>>>>>  btrfs_finish_ordered_io() range
>>>>>
>>>>> So error handler, which calls extent_clear_unlock_delalloc() with
>>>>> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
>>>>> will both cover OE n, and free its metadata, causing metadata under flow.
>>>>>
>>>>> [Fix]
>>>>> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
>>>>> call error handler after increasing the iteration offset, so that
>>>>> cleanup range won't cover any created ordered extent.
>>>>>
>>>>> |<------------------ delalloc range --------------------------->|
>>>>> | OE 1 | OE 2 | ... | OE n |
>>>>> |<-----------  ----------->|<---------- cleanup range --------->|
>>>>>              \/
>>>>>  btrfs_finish_ordered_io() range
>>>>>
>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>> ---
>>>>> changelog:
>>>>> v6:
>>>>>   New, split from v5 patch, as this is a separate bug.
>>>>> ---
>>>>>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>>>>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>> index b2bc07aad1ae..1d83d504f2e5 100644
>>>>> --- a/fs/btrfs/inode.c
>>>>> +++ b/fs/btrfs/inode.c
>>>>> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>>>>>                 BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>>>>                     ret = btrfs_reloc_clone_csums(inode, start,
>>>>>                                                   cur_alloc_size);
>>>>> +                   /*
>>>>> +                    * Only drop cache here, and process as normal.
>>>>> +                    *
>>>>> +                    * We must not allow extent_clear_unlock_delalloc()
>>>>> +                    * at out_unlock label to free meta of this ordered
>>>>> +                    * extent, as its meta should be freed by
>>>>> +                    * btrfs_finish_ordered_io().
>>>>> +                    *
>>>>> +                    * So we must continue until @start is increased to
>>>>> +                    * skip current ordered extent.
>>>>> +                    */
>>>>>                     if (ret)
>>>>> -                           goto out_drop_extent_cache;
>>>>> +                           btrfs_drop_extent_cache(BTRFS_I(inode), start,
>>>>> +                                           start + ram_size - 1, 0);
>>>>>             }
>>>>>
>>>>>             btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>>>>>
>>>>> -           if (disk_num_bytes < cur_alloc_size)
>>>>> -                   break;
>>>>> -
>>>>>             /* we're not doing compressed IO, don't unlock the first
>>>>>              * page (which the caller expects to stay locked), don't
>>>>>              * clear any dirty bits and don't set any writeback bits
>>>>> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>>>>>                                          delalloc_end, locked_page,
>>>>>                                          EXTENT_LOCKED | EXTENT_DELALLOC,
>>>>>                                          op);
>>>>> -           disk_num_bytes -= cur_alloc_size;
>>>>> +           if (disk_num_bytes > cur_alloc_size)
>>>>> +                   disk_num_bytes = 0;
>>>>> +           else
>>>>> +                   disk_num_bytes -= cur_alloc_size;
>>>>
>>>> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?
>>>
>>> I assume that you've run fstests against this patch, if so, I actually
>>> start worrying about that no fstests found this problem.
>>
>> The operator is definitely wrong, should be < instead of > (previous
>> patch versions were correct).
>>
>> It's not a surprise fstests don't catch this - one would need a fs
>> with no more free space to allocate new data chunks and existing data
>> chunks would have to be fragmented to the point we need to allocate
>> two or more smaller extents to satisfy the write, so whether fstests
>> exercises this depends on the size of the test devices (mine are 100G
>> for example) and the size of the data the tests create. Having a
>> deterministic test for that for that should be possible and useful.
>
> I see, I think that 'mount -o fragment=data' could help us get it.
>
> Thanks,
>
> -liubo
>
>
I'll also submit a test case for this.

By filling the fs with small files(sectorsize) with sequential number as 
file name.

Then sync fs, remove the files with odd number, then do a large write.

This one should trigger the problem without the use of the debug option, 
and can be a generic test.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-03-08  1:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-06  2:55 [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
2017-03-06  2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
2017-03-06 23:18   ` Filipe Manana
2017-03-07 22:11   ` Liu Bo
2017-03-08  0:18     ` Qu Wenruo
2017-03-08  0:21       ` Filipe Manana
2017-03-08  0:26         ` Qu Wenruo
2017-03-06 23:19 ` [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana
2017-03-07 17:32   ` David Sterba
2017-03-07 20:51     ` Liu Bo
2017-03-07 20:49 ` Liu Bo
2017-03-07 20:59   ` Liu Bo
2017-03-08  0:17     ` Filipe Manana
2017-03-08  1:16       ` Liu Bo
2017-03-08  1:21         ` Qu Wenruo
2017-03-08  0:17   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.