All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
@ 2017-03-08  2:25 Qu Wenruo
  2017-03-08  2:25 ` [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Qu Wenruo @ 2017-03-08  2:25 UTC (permalink / raw)
  To: linux-btrfs, bo.li.liu, fdmanana

[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
v6:
  New, split from v5 patch, as this is a separate bug.
v7:
  Fix a wrong operator pointed out by Liu Bo.
  Test case will follow soon.
---
 fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c40060cc481f..fe588bf30d02 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
 		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
 			ret = btrfs_reloc_clone_csums(inode, start,
 						      cur_alloc_size);
+			/*
+			 * Only drop cache here, and process as normal.
+			 *
+			 * We must not allow extent_clear_unlock_delalloc()
+			 * at out_unlock label to free meta of this ordered
+			 * extent, as its meta should be freed by
+			 * btrfs_finish_ordered_io().
+			 *
+			 * So we must continue until @start is increased to
+			 * skip current ordered extent.
+			 */
 			if (ret)
-				goto out_drop_extent_cache;
+				btrfs_drop_extent_cache(BTRFS_I(inode), start,
+						start + ram_size - 1, 0);
 		}
 
 		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
-		if (disk_num_bytes < cur_alloc_size)
-			break;
-
 		/* we're not doing compressed IO, don't unlock the first
 		 * page (which the caller expects to stay locked), don't
 		 * clear any dirty bits and don't set any writeback bits
@@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
 					     delalloc_end, locked_page,
 					     EXTENT_LOCKED | EXTENT_DELALLOC,
 					     op);
-		disk_num_bytes -= cur_alloc_size;
+		if (disk_num_bytes < cur_alloc_size)
+			disk_num_bytes = 0;
+		else
+			disk_num_bytes -= cur_alloc_size;
 		num_bytes -= cur_alloc_size;
 		alloc_hint = ins.objectid + ins.offset;
 		start += cur_alloc_size;
+
+		/*
+		 * btrfs_reloc_clone_csums() error, since start is increased
+		 * extent_clear_unlock_delalloc() at out_unlock label won't
+		 * free metadata of current ordered extent, we're OK to exit.
+		 */
+		if (ret)
+			goto out_unlock;
 	}
 out:
 	return ret;
@@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
 		BUG_ON(ret); /* -ENOMEM */
 
 		if (root->root_key.objectid ==
-		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
+		    BTRFS_DATA_RELOC_TREE_OBJECTID)
+			/*
+			 * Error handled later, as we must prevent
+			 * extent_clear_unlock_delalloc() in error handler
+			 * from freeing metadata of created ordered extent.
+			 */
 			ret = btrfs_reloc_clone_csums(inode, cur_offset,
 						      num_bytes);
-			if (ret) {
-				if (!nolock && nocow)
-					btrfs_end_write_no_snapshoting(root);
-				goto error;
-			}
-		}
 
 		extent_clear_unlock_delalloc(inode, cur_offset,
 					     cur_offset + num_bytes - 1, end,
@@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
 		if (!nolock && nocow)
 			btrfs_end_write_no_snapshoting(root);
 		cur_offset = extent_end;
+
+		/*
+		 * btrfs_reloc_clone_csums() error, now we're OK to call error
+		 * handler, as metadata for created ordered extent will only
+		 * be freed by btrfs_finish_ordered_io().
+		 */
+		if (ret)
+			goto error;
 		if (cur_offset > end)
 			break;
 	}
-- 
2.12.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-08  2:25 [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
@ 2017-03-08  2:25 ` Qu Wenruo
  2017-03-08 10:19   ` Filipe Manana
  2017-03-09 17:37   ` Liu Bo
  2017-03-08 10:18 ` [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2017-03-08  2:25 UTC (permalink / raw)
  To: linux-btrfs, bo.li.liu, fdmanana; +Cc: Filipe Manana

[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

[CAUSE]

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|                       |<---------- cleanup range --------->|
 ||
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.

However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.

So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.

[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.

The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.

After fix, delalloc error will be handled like:

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<--------  ----------->|<------ old error handler --------->|
 ||          ||
 ||          \_=> Cleaned up by cleanup_ordered_extents()
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
v2:
  Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
  outstanding extents, which is already done by
  extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
v3:
  Skip first page to avoid underflow ordered->bytes_left.
  Fix range passed in cow_file_range() which doesn't cover the whole
  extent.
  Expend extent_clear_unlock_delalloc() range to allow them to handle
  metadata release.
v4:
  Don't use extra bit to skip metadata freeing for ordered extent,
  but only handle btrfs_reloc_clone_csums() error just before processing
  next extent.
  This makes error handle much easier for run_delalloc_nocow().
v5:
  Variant gramma and comment fixes suggested by Filipe Manana
  Enhanced commit message to focus on the generic error handler bug,
  pointed out by Filipe Manana
  Adding missing wq/func user in __endio_write_update_ordered(), pointed
  out by Filipe Manana.
  Enhanced commit message with ascii art to explain the bug more easily.
  Fix a bug which can leads to corrupted extent allocation, exposed by
  Filipe Manana.
v6:
  Split the metadata underflow fix to another patch.
  Use better comment and btrfs_cleanup_ordered_extent() implementation
  from Filipe Manana.
v7:
  Add back const prefix
---
 fs/btrfs/inode.c | 63 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fe588bf30d02..d3bc68bbe0e7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -115,6 +115,31 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 				       u64 ram_bytes, int compress_type,
 				       int type);
 
+static void __endio_write_update_ordered(struct inode *inode,
+					 const u64 offset, const u64 bytes,
+					 const bool uptodate);
+
+/*
+ * Cleanup all submitted ordered extents in specified range to handle errors
+ * from the fill_dellaloc() callback.
+ *
+ * NOTE: caller must ensure that when an error happens, it can not call
+ * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
+ * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
+ * to be released, which we want to happen only when finishing the ordered
+ * extent (btrfs_finish_ordered_io()). Also note that the caller of the
+ * fill_delalloc() callback already does proper cleanup for the first page of
+ * the range, that is, it invokes the callback writepage_end_io_hook() for the
+ * range of the first page.
+ */
+static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
+						 const u64 offset,
+						 const u64 bytes)
+{
+	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
+					    bytes - PAGE_SIZE, false);
+}
+
 static int btrfs_dirty_inode(struct inode *inode);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
@@ -1536,6 +1561,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 		ret = cow_file_range_async(inode, locked_page, start, end,
 					   page_started, nr_written);
 	}
+	if (ret)
+		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
 	return ret;
 }
 
@@ -8142,17 +8169,26 @@ static void btrfs_endio_direct_read(struct bio *bio)
 	bio_put(bio);
 }
 
-static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
-						    const u64 offset,
-						    const u64 bytes,
-						    const int uptodate)
+static void __endio_write_update_ordered(struct inode *inode,
+					 const u64 offset, const u64 bytes,
+					 const bool uptodate)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_ordered_extent *ordered = NULL;
+	struct btrfs_workqueue *wq;
+	btrfs_work_func_t func;
 	u64 ordered_offset = offset;
 	u64 ordered_bytes = bytes;
 	int ret;
 
+	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
+		wq = fs_info->endio_freespace_worker;
+		func = btrfs_freespace_write_helper;
+	} else {
+		wq = fs_info->endio_write_workers;
+		func = btrfs_endio_write_helper;
+	}
+
 again:
 	ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
 						   &ordered_offset,
@@ -8161,9 +8197,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
 	if (!ret)
 		goto out_test;
 
-	btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
-			finish_ordered_fn, NULL, NULL);
-	btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
+	btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL);
+	btrfs_queue_work(wq, &ordered->work);
 out_test:
 	/*
 	 * our bio might span multiple ordered extents.  If we haven't
@@ -8181,10 +8216,8 @@ static void btrfs_endio_direct_write(struct bio *bio)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio *dio_bio = dip->dio_bio;
 
-	btrfs_endio_direct_write_update_ordered(dip->inode,
-						dip->logical_offset,
-						dip->bytes,
-						!bio->bi_error);
+	__endio_write_update_ordered(dip->inode, dip->logical_offset,
+				     dip->bytes, !bio->bi_error);
 
 	kfree(dip);
 
@@ -8545,10 +8578,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		io_bio = NULL;
 	} else {
 		if (write)
-			btrfs_endio_direct_write_update_ordered(inode,
+			__endio_write_update_ordered(inode,
 						file_offset,
 						dio_bio->bi_iter.bi_size,
-						0);
+						false);
 		else
 			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
 			      file_offset + dio_bio->bi_iter.bi_size - 1);
@@ -8683,11 +8716,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			 */
 			if (dio_data.unsubmitted_oe_range_start <
 			    dio_data.unsubmitted_oe_range_end)
-				btrfs_endio_direct_write_update_ordered(inode,
+				__endio_write_update_ordered(inode,
 					dio_data.unsubmitted_oe_range_start,
 					dio_data.unsubmitted_oe_range_end -
 					dio_data.unsubmitted_oe_range_start,
-					0);
+					false);
 		} else if (ret >= 0 && (size_t)ret < count)
 			btrfs_delalloc_release_space(inode, offset,
 						     count - (size_t)ret);
-- 
2.12.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-08  2:25 [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
  2017-03-08  2:25 ` [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
@ 2017-03-08 10:18 ` Filipe Manana
  2017-03-09 17:35 ` Liu Bo
  2017-03-12 20:49 ` Stefan Priebe - Profihost AG
  3 siblings, 0 replies; 15+ messages in thread
From: Filipe Manana @ 2017-03-08 10:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Liu Bo

On Wed, Mar 8, 2017 at 2:25 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
>
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
>
> a) Create one data block group with one 4K extent in it.
>    To avoid the bug that hangs btrfs due to ordered extent which never
>    finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
>
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
>
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
>                     |<----------- cleanup range --------------->|
> |<-----------  ----------->|
>              \/
>  btrfs_finish_ordered_io() range
>
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
>
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<-----------  ----------->|<---------- cleanup range --------->|
>              \/
>  btrfs_finish_ordered_io() range
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks

> ---
> v6:
>   New, split from v5 patch, as this is a separate bug.
> v7:
>   Fix a wrong operator pointed out by Liu Bo.
>   Test case will follow soon.
> ---
>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c40060cc481f..fe588bf30d02 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>                     BTRFS_DATA_RELOC_TREE_OBJECTID) {
>                         ret = btrfs_reloc_clone_csums(inode, start,
>                                                       cur_alloc_size);
> +                       /*
> +                        * Only drop cache here, and process as normal.
> +                        *
> +                        * We must not allow extent_clear_unlock_delalloc()
> +                        * at out_unlock label to free meta of this ordered
> +                        * extent, as its meta should be freed by
> +                        * btrfs_finish_ordered_io().
> +                        *
> +                        * So we must continue until @start is increased to
> +                        * skip current ordered extent.
> +                        */
>                         if (ret)
> -                               goto out_drop_extent_cache;
> +                               btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +                                               start + ram_size - 1, 0);
>                 }
>
>                 btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>
> -               if (disk_num_bytes < cur_alloc_size)
> -                       break;
> -
>                 /* we're not doing compressed IO, don't unlock the first
>                  * page (which the caller expects to stay locked), don't
>                  * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>                                              delalloc_end, locked_page,
>                                              EXTENT_LOCKED | EXTENT_DELALLOC,
>                                              op);
> -               disk_num_bytes -= cur_alloc_size;
> +               if (disk_num_bytes < cur_alloc_size)
> +                       disk_num_bytes = 0;
> +               else
> +                       disk_num_bytes -= cur_alloc_size;
>                 num_bytes -= cur_alloc_size;
>                 alloc_hint = ins.objectid + ins.offset;
>                 start += cur_alloc_size;
> +
> +               /*
> +                * btrfs_reloc_clone_csums() error, since start is increased
> +                * extent_clear_unlock_delalloc() at out_unlock label won't
> +                * free metadata of current ordered extent, we're OK to exit.
> +                */
> +               if (ret)
> +                       goto out_unlock;
>         }
>  out:
>         return ret;
> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>                 BUG_ON(ret); /* -ENOMEM */
>
>                 if (root->root_key.objectid ==
> -                   BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +                   BTRFS_DATA_RELOC_TREE_OBJECTID)
> +                       /*
> +                        * Error handled later, as we must prevent
> +                        * extent_clear_unlock_delalloc() in error handler
> +                        * from freeing metadata of created ordered extent.
> +                        */
>                         ret = btrfs_reloc_clone_csums(inode, cur_offset,
>                                                       num_bytes);
> -                       if (ret) {
> -                               if (!nolock && nocow)
> -                                       btrfs_end_write_no_snapshoting(root);
> -                               goto error;
> -                       }
> -               }
>
>                 extent_clear_unlock_delalloc(inode, cur_offset,
>                                              cur_offset + num_bytes - 1, end,
> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>                 if (!nolock && nocow)
>                         btrfs_end_write_no_snapshoting(root);
>                 cur_offset = extent_end;
> +
> +               /*
> +                * btrfs_reloc_clone_csums() error, now we're OK to call error
> +                * handler, as metadata for created ordered extent will only
> +                * be freed by btrfs_finish_ordered_io().
> +                */
> +               if (ret)
> +                       goto error;
>                 if (cur_offset > end)
>                         break;
>         }
> --
> 2.12.0
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-08  2:25 ` [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
@ 2017-03-08 10:19   ` Filipe Manana
  2017-03-09 17:37   ` Liu Bo
  1 sibling, 0 replies; 15+ messages in thread
From: Filipe Manana @ 2017-03-08 10:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Liu Bo, Filipe Manana

On Wed, Mar 8, 2017 at 2:25 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> [BUG]
> If run_delalloc_range() returns error and there is already some ordered
> extents created, btrfs will be hanged with the following backtrace:
>
> Call Trace:
>  __schedule+0x2d4/0xae0
>  schedule+0x3d/0x90
>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>  ? wake_atomic_t_function+0x60/0x60
>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>  process_one_work+0x2af/0x720
>  ? process_one_work+0x22b/0x720
>  worker_thread+0x4b/0x4f0
>  kthread+0x10f/0x150
>  ? process_one_work+0x720/0x720
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x2e/0x40
>
> [CAUSE]
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|                       |<---------- cleanup range --------->|
>  ||
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>
> The problem is caused by error handler of run_delalloc_range(), which
> doesn't handle any created ordered extents, leaving them waiting on
> btrfs_finish_ordered_io() to finish.
>
> However after run_delalloc_range() returns error, __extent_writepage()
> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
> except the first page, and btrfs_finish_ordered_io() won't be triggered
> for created ordered extents either.
>
> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
> will also hang.
>
> [FIX]
> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
> ordered extents and finish them manually.
>
> The function is based on existing
> btrfs_endio_direct_write_update_ordered() function, and modify it to
> act just like btrfs_writepage_endio_hook() but handles specified range
> other than one page.
>
> After fix, delalloc error will be handled like:
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|<--------  ----------->|<------ old error handler --------->|
>  ||          ||
>  ||          \_=> Cleaned up by cleanup_ordered_extents()
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

thanks

> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> v2:
>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>   outstanding extents, which is already done by
>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
> v3:
>   Skip first page to avoid underflow ordered->bytes_left.
>   Fix range passed in cow_file_range() which doesn't cover the whole
>   extent.
>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>   metadata release.
> v4:
>   Don't use extra bit to skip metadata freeing for ordered extent,
>   but only handle btrfs_reloc_clone_csums() error just before processing
>   next extent.
>   This makes error handle much easier for run_delalloc_nocow().
> v5:
>   Variant gramma and comment fixes suggested by Filipe Manana
>   Enhanced commit message to focus on the generic error handler bug,
>   pointed out by Filipe Manana
>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>   out by Filipe Manana.
>   Enhanced commit message with ascii art to explain the bug more easily.
>   Fix a bug which can leads to corrupted extent allocation, exposed by
>   Filipe Manana.
> v6:
>   Split the metadata underflow fix to another patch.
>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>   from Filipe Manana.
> v7:
>   Add back const prefix
> ---
>  fs/btrfs/inode.c | 63 ++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index fe588bf30d02..d3bc68bbe0e7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -115,6 +115,31 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
>                                        u64 ram_bytes, int compress_type,
>                                        int type);
>
> +static void __endio_write_update_ordered(struct inode *inode,
> +                                        const u64 offset, const u64 bytes,
> +                                        const bool uptodate);
> +
> +/*
> + * Cleanup all submitted ordered extents in specified range to handle errors
> + * from the fill_dellaloc() callback.
> + *
> + * NOTE: caller must ensure that when an error happens, it can not call
> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
> + * to be released, which we want to happen only when finishing the ordered
> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
> + * fill_delalloc() callback already does proper cleanup for the first page of
> + * the range, that is, it invokes the callback writepage_end_io_hook() for the
> + * range of the first page.
> + */
> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
> +                                                const u64 offset,
> +                                                const u64 bytes)
> +{
> +       return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> +                                           bytes - PAGE_SIZE, false);
> +}
> +
>  static int btrfs_dirty_inode(struct inode *inode);
>
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> @@ -1536,6 +1561,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
>                 ret = cow_file_range_async(inode, locked_page, start, end,
>                                            page_started, nr_written);
>         }
> +       if (ret)
> +               btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>         return ret;
>  }
>
> @@ -8142,17 +8169,26 @@ static void btrfs_endio_direct_read(struct bio *bio)
>         bio_put(bio);
>  }
>
> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
> -                                                   const u64 offset,
> -                                                   const u64 bytes,
> -                                                   const int uptodate)
> +static void __endio_write_update_ordered(struct inode *inode,
> +                                        const u64 offset, const u64 bytes,
> +                                        const bool uptodate)
>  {
>         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>         struct btrfs_ordered_extent *ordered = NULL;
> +       struct btrfs_workqueue *wq;
> +       btrfs_work_func_t func;
>         u64 ordered_offset = offset;
>         u64 ordered_bytes = bytes;
>         int ret;
>
> +       if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
> +               wq = fs_info->endio_freespace_worker;
> +               func = btrfs_freespace_write_helper;
> +       } else {
> +               wq = fs_info->endio_write_workers;
> +               func = btrfs_endio_write_helper;
> +       }
> +
>  again:
>         ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
>                                                    &ordered_offset,
> @@ -8161,9 +8197,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
>         if (!ret)
>                 goto out_test;
>
> -       btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
> -                       finish_ordered_fn, NULL, NULL);
> -       btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
> +       btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL);
> +       btrfs_queue_work(wq, &ordered->work);
>  out_test:
>         /*
>          * our bio might span multiple ordered extents.  If we haven't
> @@ -8181,10 +8216,8 @@ static void btrfs_endio_direct_write(struct bio *bio)
>         struct btrfs_dio_private *dip = bio->bi_private;
>         struct bio *dio_bio = dip->dio_bio;
>
> -       btrfs_endio_direct_write_update_ordered(dip->inode,
> -                                               dip->logical_offset,
> -                                               dip->bytes,
> -                                               !bio->bi_error);
> +       __endio_write_update_ordered(dip->inode, dip->logical_offset,
> +                                    dip->bytes, !bio->bi_error);
>
>         kfree(dip);
>
> @@ -8545,10 +8578,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
>                 io_bio = NULL;
>         } else {
>                 if (write)
> -                       btrfs_endio_direct_write_update_ordered(inode,
> +                       __endio_write_update_ordered(inode,
>                                                 file_offset,
>                                                 dio_bio->bi_iter.bi_size,
> -                                               0);
> +                                               false);
>                 else
>                         unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
>                               file_offset + dio_bio->bi_iter.bi_size - 1);
> @@ -8683,11 +8716,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                          */
>                         if (dio_data.unsubmitted_oe_range_start <
>                             dio_data.unsubmitted_oe_range_end)
> -                               btrfs_endio_direct_write_update_ordered(inode,
> +                               __endio_write_update_ordered(inode,
>                                         dio_data.unsubmitted_oe_range_start,
>                                         dio_data.unsubmitted_oe_range_end -
>                                         dio_data.unsubmitted_oe_range_start,
> -                                       0);
> +                                       false);
>                 } else if (ret >= 0 && (size_t)ret < count)
>                         btrfs_delalloc_release_space(inode, offset,
>                                                      count - (size_t)ret);
> --
> 2.12.0
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-08  2:25 [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
  2017-03-08  2:25 ` [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
  2017-03-08 10:18 ` [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana
@ 2017-03-09 17:35 ` Liu Bo
  2017-03-12 20:49 ` Stefan Priebe - Profihost AG
  3 siblings, 0 replies; 15+ messages in thread
From: Liu Bo @ 2017-03-09 17:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, fdmanana

On Wed, Mar 08, 2017 at 10:25:51AM +0800, Qu Wenruo wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
> 
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> 
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
> 
> a) Create one data block group with one 4K extent in it.
>    To avoid the bug that hangs btrfs due to ordered extent which never
>    finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
> 
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
> 
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
>                     |<----------- cleanup range --------------->|
> |<-----------  ----------->|
>              \/
>  btrfs_finish_ordered_io() range
> 
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
> 
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<-----------  ----------->|<---------- cleanup range --------->|
>              \/
>  btrfs_finish_ordered_io() range

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> v6:
>   New, split from v5 patch, as this is a separate bug.
> v7:
>   Fix a wrong operator pointed out by Liu Bo.
>   Test case will follow soon.
> ---
>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c40060cc481f..fe588bf30d02 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  			ret = btrfs_reloc_clone_csums(inode, start,
>  						      cur_alloc_size);
> +			/*
> +			 * Only drop cache here, and process as normal.
> +			 *
> +			 * We must not allow extent_clear_unlock_delalloc()
> +			 * at out_unlock label to free meta of this ordered
> +			 * extent, as its meta should be freed by
> +			 * btrfs_finish_ordered_io().
> +			 *
> +			 * So we must continue until @start is increased to
> +			 * skip current ordered extent.
> +			 */
>  			if (ret)
> -				goto out_drop_extent_cache;
> +				btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +						start + ram_size - 1, 0);
>  		}
>  
>  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  
> -		if (disk_num_bytes < cur_alloc_size)
> -			break;
> -
>  		/* we're not doing compressed IO, don't unlock the first
>  		 * page (which the caller expects to stay locked), don't
>  		 * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>  					     delalloc_end, locked_page,
>  					     EXTENT_LOCKED | EXTENT_DELALLOC,
>  					     op);
> -		disk_num_bytes -= cur_alloc_size;
> +		if (disk_num_bytes < cur_alloc_size)
> +			disk_num_bytes = 0;
> +		else
> +			disk_num_bytes -= cur_alloc_size;
>  		num_bytes -= cur_alloc_size;
>  		alloc_hint = ins.objectid + ins.offset;
>  		start += cur_alloc_size;
> +
> +		/*
> +		 * btrfs_reloc_clone_csums() error, since start is increased
> +		 * extent_clear_unlock_delalloc() at out_unlock label won't
> +		 * free metadata of current ordered extent, we're OK to exit.
> +		 */
> +		if (ret)
> +			goto out_unlock;
>  	}
>  out:
>  	return ret;
> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  		BUG_ON(ret); /* -ENOMEM */
>  
>  		if (root->root_key.objectid ==
> -		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +		    BTRFS_DATA_RELOC_TREE_OBJECTID)
> +			/*
> +			 * Error handled later, as we must prevent
> +			 * extent_clear_unlock_delalloc() in error handler
> +			 * from freeing metadata of created ordered extent.
> +			 */
>  			ret = btrfs_reloc_clone_csums(inode, cur_offset,
>  						      num_bytes);
> -			if (ret) {
> -				if (!nolock && nocow)
> -					btrfs_end_write_no_snapshoting(root);
> -				goto error;
> -			}
> -		}
>  
>  		extent_clear_unlock_delalloc(inode, cur_offset,
>  					     cur_offset + num_bytes - 1, end,
> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  		if (!nolock && nocow)
>  			btrfs_end_write_no_snapshoting(root);
>  		cur_offset = extent_end;
> +
> +		/*
> +		 * btrfs_reloc_clone_csums() error, now we're OK to call error
> +		 * handler, as metadata for created ordered extent will only
> +		 * be freed by btrfs_finish_ordered_io().
> +		 */
> +		if (ret)
> +			goto error;
>  		if (cur_offset > end)
>  			break;
>  	}
> -- 
> 2.12.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
  2017-03-08  2:25 ` [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
  2017-03-08 10:19   ` Filipe Manana
@ 2017-03-09 17:37   ` Liu Bo
  1 sibling, 0 replies; 15+ messages in thread
From: Liu Bo @ 2017-03-09 17:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, fdmanana, Filipe Manana

On Wed, Mar 08, 2017 at 10:25:52AM +0800, Qu Wenruo wrote:
> [BUG]
> If run_delalloc_range() returns error and there is already some ordered
> extents created, btrfs will be hanged with the following backtrace:
> 
> Call Trace:
>  __schedule+0x2d4/0xae0
>  schedule+0x3d/0x90
>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>  ? wake_atomic_t_function+0x60/0x60
>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>  process_one_work+0x2af/0x720
>  ? process_one_work+0x22b/0x720
>  worker_thread+0x4b/0x4f0
>  kthread+0x10f/0x150
>  ? process_one_work+0x720/0x720
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x2e/0x40
> 
> [CAUSE]
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|                       |<---------- cleanup range --------->|
>  ||
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
> 
> The problem is caused by error handler of run_delalloc_range(), which
> doesn't handle any created ordered extents, leaving them waiting on
> btrfs_finish_ordered_io() to finish.
> 
> However after run_delalloc_range() returns error, __extent_writepage()
> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
> except the first page, and btrfs_finish_ordered_io() won't be triggered
> for created ordered extents either.
> 
> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
> will also hang.
> 
> [FIX]
> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
> ordered extents and finish them manually.
> 
> The function is based on existing
> btrfs_endio_direct_write_update_ordered() function, and modify it to
> act just like btrfs_writepage_endio_hook() but handles specified range
> other than one page.
> 
> After fix, delalloc error will be handled like:
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|<--------  ----------->|<------ old error handler --------->|
>  ||          ||
>  ||          \_=> Cleaned up by cleanup_ordered_extents()
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> v2:
>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>   outstanding extents, which is already done by
>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
> v3:
>   Skip first page to avoid underflow ordered->bytes_left.
>   Fix range passed in cow_file_range() which doesn't cover the whole
>   extent.
>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>   metadata release.
> v4:
>   Don't use extra bit to skip metadata freeing for ordered extent,
>   but only handle btrfs_reloc_clone_csums() error just before processing
>   next extent.
>   This makes error handle much easier for run_delalloc_nocow().
> v5:
>   Variant gramma and comment fixes suggested by Filipe Manana
>   Enhanced commit message to focus on the generic error handler bug,
>   pointed out by Filipe Manana
>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>   out by Filipe Manana.
>   Enhanced commit message with ascii art to explain the bug more easily.
>   Fix a bug which can leads to corrupted extent allocation, exposed by
>   Filipe Manana.
> v6:
>   Split the metadata underflow fix to another patch.
>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>   from Filipe Manana.
> v7:
>   Add back const prefix
> ---
>  fs/btrfs/inode.c | 63 ++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index fe588bf30d02..d3bc68bbe0e7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -115,6 +115,31 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
>  				       u64 ram_bytes, int compress_type,
>  				       int type);
>  
> +static void __endio_write_update_ordered(struct inode *inode,
> +					 const u64 offset, const u64 bytes,
> +					 const bool uptodate);
> +
> +/*
> + * Cleanup all submitted ordered extents in specified range to handle errors
> + * from the fill_dellaloc() callback.
> + *
> + * NOTE: caller must ensure that when an error happens, it can not call
> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
> + * to be released, which we want to happen only when finishing the ordered
> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
> + * fill_delalloc() callback already does proper cleanup for the first page of
> + * the range, that is, it invokes the callback writepage_end_io_hook() for the
> + * range of the first page.
> + */
> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
> +						 const u64 offset,
> +						 const u64 bytes)
> +{
> +	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> +					    bytes - PAGE_SIZE, false);
> +}
> +
>  static int btrfs_dirty_inode(struct inode *inode);
>  
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> @@ -1536,6 +1561,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
>  		ret = cow_file_range_async(inode, locked_page, start, end,
>  					   page_started, nr_written);
>  	}
> +	if (ret)
> +		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>  	return ret;
>  }
>  
> @@ -8142,17 +8169,26 @@ static void btrfs_endio_direct_read(struct bio *bio)
>  	bio_put(bio);
>  }
>  
> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
> -						    const u64 offset,
> -						    const u64 bytes,
> -						    const int uptodate)
> +static void __endio_write_update_ordered(struct inode *inode,
> +					 const u64 offset, const u64 bytes,
> +					 const bool uptodate)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct btrfs_ordered_extent *ordered = NULL;
> +	struct btrfs_workqueue *wq;
> +	btrfs_work_func_t func;
>  	u64 ordered_offset = offset;
>  	u64 ordered_bytes = bytes;
>  	int ret;
>  
> +	if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
> +		wq = fs_info->endio_freespace_worker;
> +		func = btrfs_freespace_write_helper;
> +	} else {
> +		wq = fs_info->endio_write_workers;
> +		func = btrfs_endio_write_helper;
> +	}
> +
>  again:
>  	ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
>  						   &ordered_offset,
> @@ -8161,9 +8197,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
>  	if (!ret)
>  		goto out_test;
>  
> -	btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
> -			finish_ordered_fn, NULL, NULL);
> -	btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
> +	btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL);
> +	btrfs_queue_work(wq, &ordered->work);
>  out_test:
>  	/*
>  	 * our bio might span multiple ordered extents.  If we haven't
> @@ -8181,10 +8216,8 @@ static void btrfs_endio_direct_write(struct bio *bio)
>  	struct btrfs_dio_private *dip = bio->bi_private;
>  	struct bio *dio_bio = dip->dio_bio;
>  
> -	btrfs_endio_direct_write_update_ordered(dip->inode,
> -						dip->logical_offset,
> -						dip->bytes,
> -						!bio->bi_error);
> +	__endio_write_update_ordered(dip->inode, dip->logical_offset,
> +				     dip->bytes, !bio->bi_error);
>  
>  	kfree(dip);
>  
> @@ -8545,10 +8578,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
>  		io_bio = NULL;
>  	} else {
>  		if (write)
> -			btrfs_endio_direct_write_update_ordered(inode,
> +			__endio_write_update_ordered(inode,
>  						file_offset,
>  						dio_bio->bi_iter.bi_size,
> -						0);
> +						false);
>  		else
>  			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
>  			      file_offset + dio_bio->bi_iter.bi_size - 1);
> @@ -8683,11 +8716,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  			 */
>  			if (dio_data.unsubmitted_oe_range_start <
>  			    dio_data.unsubmitted_oe_range_end)
> -				btrfs_endio_direct_write_update_ordered(inode,
> +				__endio_write_update_ordered(inode,
>  					dio_data.unsubmitted_oe_range_start,
>  					dio_data.unsubmitted_oe_range_end -
>  					dio_data.unsubmitted_oe_range_start,
> -					0);
> +					false);
>  		} else if (ret >= 0 && (size_t)ret < count)
>  			btrfs_delalloc_release_space(inode, offset,
>  						     count - (size_t)ret);
> -- 
> 2.12.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-08  2:25 [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
                   ` (2 preceding siblings ...)
  2017-03-09 17:35 ` Liu Bo
@ 2017-03-12 20:49 ` Stefan Priebe - Profihost AG
  2017-03-13  1:16   ` Qu Wenruo
  3 siblings, 1 reply; 15+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-03-12 20:49 UTC (permalink / raw)
  To: quwenruo, linux-btrfs, fdmanana

Hi Qu,

while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4).

V7 results in OOPS to me:
BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0
IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
PGD 14e18d4067 PUD 14e1868067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
i2c_core usb_common ata_piix floppy
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.7.5-20140722_172050-sagunt 04/01/2014
task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
__endio_write_update_ordered+0x33/0x140 [btrfs]
RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
Call Trace:
[<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
[<ffffffffb438f2f7>] bio_endio+0x57/0x60
[<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
[<ffffffffb438f2f7>] bio_endio+0x57/0x60
[<ffffffffb439763b>] blk_update_request+0x8b/0x330
[<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
[<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
[<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
[<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
[<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
[<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
[<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
[<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
[<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
[<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
[<ffffffffb4007cad>] handle_irq+0x1d/0x30
[<ffffffffb400750b>] do_IRQ+0x4b/0xd0
[<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
Leftover inexact backtrace:
2017-03-12 20:33:08     <IRQ><EOI>
2017-03-12 20:33:08      [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10
[<ffffffffb400fa3e>] default_idle+0x1e/0xe0
[<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
[<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
[<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
[<ffffffffb46a358c>] rest_init+0x7c/0x80
[<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
[<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
[<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
[<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
RSP <ffff8814eae03cd8>
CR2: 00000000000001f0
---[ end trace 7529a0652fd7873e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffffbfffffff)

Greets,
Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-12 20:49 ` Stefan Priebe - Profihost AG
@ 2017-03-13  1:16   ` Qu Wenruo
  2017-03-13  7:26     ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2017-03-13  1:16 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, linux-btrfs, fdmanana



At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
> Hi Qu,
>
> while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4).

Thanks for the test.

>
> V7 results in OOPS to me:
> BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0

This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite 
nice clue.

> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]

IP points to:
---
static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
{
         struct btrfs_root *root = inode->root; << Either here

         if (root == root->fs_info->tree_root && << Or here
             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)

---

Taking the above offset into consideration, it's only possible for later 
case.

So here, we have a btrfs_inode whose @root is NULL.

This can be fixed easily by checking @root inside 
btrfs_is_free_space_inode(), as the backtrace shows that it's only 
happening for DirectIO, and it won't happen for free space cache inode.

But I'm more curious how this happened for a more accurate fix, or we 
could have other NULL pointer access.

Did you have any reproducer for this?

Thanks,
Qu

> PGD 14e18d4067 PUD 14e1868067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
> i2c_core usb_common ata_piix floppy
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.7.5-20140722_172050-sagunt 04/01/2014
> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
> __endio_write_update_ordered+0x33/0x140 [btrfs]
> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
> Call Trace:
> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
> Leftover inexact backtrace:
> 2017-03-12 20:33:08     <IRQ><EOI>
> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10
> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
> [<ffffffffb46a358c>] rest_init+0x7c/0x80
> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
> RSP <ffff8814eae03cd8>
> CR2: 00000000000001f0
> ---[ end trace 7529a0652fd7873e ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-13  1:16   ` Qu Wenruo
@ 2017-03-13  7:26     ` Stefan Priebe - Profihost AG
  2017-03-13  7:39       ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-03-13  7:26 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, fdmanana

Hi Qu,

Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
> 
> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>> Hi Qu,
>>
>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>> v4.4).
> 
> Thanks for the test.
> 
>> V7 results in OOPS to me:
>> BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0
> 
> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
> nice clue.
> 
>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
> 
> IP points to:
> ---
> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
> {
>         struct btrfs_root *root = inode->root; << Either here
> 
>         if (root == root->fs_info->tree_root && << Or here
>             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
> 
> ---
> 
> Taking the above offset into consideration, it's only possible for later
> case.
> 
> So here, we have a btrfs_inode whose @root is NULL.

But wasn't this part of the code identical in V5? Why does it only
happen with V7?

> This can be fixed easily by checking @root inside
> btrfs_is_free_space_inode(), as the backtrace shows that it's only
> happening for DirectIO, and it won't happen for free space cache inode.
> 
> But I'm more curious how this happened for a more accurate fix, or we
> could have other NULL pointer access.
> 
> Did you have any reproducer for this?

Sorry no - this is a production MariaDB Server running btrfs with
compress-force=zlib. But if i could test anything i'll do.

Greets,
Stefan

> 
> Thanks,
> Qu
> 
>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>> Oops: 0000 [#1] SMP
>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>> i2c_core usb_common ata_piix floppy
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> 1.7.5-20140722_172050-sagunt 04/01/2014
>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000)
>> knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
>> Call Trace:
>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>> Leftover inexact backtrace:
>> 2017-03-12 20:33:08     <IRQ><EOI>
>> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10
>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
>> [<ffffffffb46a358c>] rest_init+0x7c/0x80
>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
>> RSP <ffff8814eae03cd8>
>> CR2: 00000000000001f0
>> ---[ end trace 7529a0652fd7873e ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
>> 0xffffffff80000000-0xffffffffbfffffff)
>>
>> Greets,
>> Stefan
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-13  7:26     ` Stefan Priebe - Profihost AG
@ 2017-03-13  7:39       ` Qu Wenruo
  2017-03-13 13:26         ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2017-03-13  7:39 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, linux-btrfs, fdmanana



At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
> Hi Qu,
>
> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>
>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>>> Hi Qu,
>>>
>>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>>> v4.4).
>>
>> Thanks for the test.
>>
>>> V7 results in OOPS to me:
>>> BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0
>>
>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
>> nice clue.
>>
>>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
>>
>> IP points to:
>> ---
>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
>> {
>>         struct btrfs_root *root = inode->root; << Either here
>>
>>         if (root == root->fs_info->tree_root && << Or here
>>             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
>>
>> ---
>>
>> Taking the above offset into consideration, it's only possible for later
>> case.
>>
>> So here, we have a btrfs_inode whose @root is NULL.
>
> But wasn't this part of the code identical in V5? Why does it only
> happen with V7?

There are still difference, but just as you said, the related 
part(checking if inode is free space cache inode) is identical across v5 
and v7.


I'm afraid that's a rare race leading to NULL btrfs_inode->root, which 
could happen in both v5 and v7.

What's the difference between SUSE and mainline kernel?
Maybe some mainline kernel commits have already fixed it?

Thanks,
Qu
>
>> This can be fixed easily by checking @root inside
>> btrfs_is_free_space_inode(), as the backtrace shows that it's only
>> happening for DirectIO, and it won't happen for free space cache inode.
>>
>> But I'm more curious how this happened for a more accurate fix, or we
>> could have other NULL pointer access.
>>
>> Did you have any reproducer for this?
>
> Sorry no - this is a production MariaDB Server running btrfs with
> compress-force=zlib. But if i could test anything i'll do.
>
> Greets,
> Stefan
>
>>
>> Thanks,
>> Qu
>>
>>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>>> Oops: 0000 [#1] SMP
>>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
>>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>>> i2c_core usb_common ata_piix floppy
>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> 1.7.5-20140722_172050-sagunt 04/01/2014
>>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
>>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
>>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
>>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
>>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
>>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
>>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
>>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
>>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000)
>>> knlGS:0000000000000000
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
>>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
>>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
>>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
>>> Call Trace:
>>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
>>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
>>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
>>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
>>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
>>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
>>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
>>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
>>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
>>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
>>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
>>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
>>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
>>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>>> Leftover inexact backtrace:
>>> 2017-03-12 20:33:08     <IRQ><EOI>
>>> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10
>>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
>>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
>>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
>>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
>>> [<ffffffffb46a358c>] rest_init+0x7c/0x80
>>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
>>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
>>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
>>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
>>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
>>> RSP <ffff8814eae03cd8>
>>> CR2: 00000000000001f0
>>> ---[ end trace 7529a0652fd7873e ]---
>>> Kernel panic - not syncing: Fatal exception in interrupt
>>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
>>> 0xffffffff80000000-0xffffffffbfffffff)
>>>
>>> Greets,
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>>
>
>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-13  7:39       ` Qu Wenruo
@ 2017-03-13 13:26         ` Stefan Priebe - Profihost AG
  2017-03-14  0:30           ` Qu Wenruo
  2017-03-14  2:50           ` Qu Wenruo
  0 siblings, 2 replies; 15+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-03-13 13:26 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, fdmanana


Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
> 
> 
> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Qu,
>>
>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>
>>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>>>> Hi Qu,
>>>>
>>>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>>>> v4.4).
>>>
>>> Thanks for the test.
>>>
>>>> V7 results in OOPS to me:
>>>> BUG: unable to handle kernel NULL pointer dereference at
>>>> 00000000000001f0
>>>
>>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
>>> nice clue.
>>>
>>>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140
>>>> [btrfs]
>>>
>>> IP points to:
>>> ---
>>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
>>> {
>>>         struct btrfs_root *root = inode->root; << Either here
>>>
>>>         if (root == root->fs_info->tree_root && << Or here
>>>             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
>>>
>>> ---
>>>
>>> Taking the above offset into consideration, it's only possible for later
>>> case.
>>>
>>> So here, we have a btrfs_inode whose @root is NULL.
>>
>> But wasn't this part of the code identical in V5? Why does it only
>> happen with V7?
> 
> There are still difference, but just as you said, the related
> part(checking if inode is free space cache inode) is identical across v5
> and v7.

But if i boot v7 it always happens. If i boot v5 it always works. Have
done 5 repeatet tests.

> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
> could happen in both v5 and v7.
> 
> What's the difference between SUSE and mainline kernel?

A lot ;-) But i don't think anything related.

> Maybe some mainline kernel commits have already fixed it?

May be no idea. But i haven't found any reason why v5 works.

Stefan

> 
> Thanks,
> Qu
>>
>>> This can be fixed easily by checking @root inside
>>> btrfs_is_free_space_inode(), as the backtrace shows that it's only
>>> happening for DirectIO, and it won't happen for free space cache inode.
>>>
>>> But I'm more curious how this happened for a more accurate fix, or we
>>> could have other NULL pointer access.
>>>
>>> Did you have any reproducer for this?
>>
>> Sorry no - this is a production MariaDB Server running btrfs with
>> compress-force=zlib. But if i could test anything i'll do.
>>
>> Greets,
>> Stefan
>>
>>>
>>> Thanks,
>>> Qu
>>>
>>>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>>>> Oops: 0000 [#1] SMP
>>>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>>>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>>>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq
>>>> ata_generic
>>>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>>>> i2c_core usb_common ata_piix floppy
>>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>> 1.7.5-20140722_172050-sagunt 04/01/2014
>>>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
>>>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
>>>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>>>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
>>>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
>>>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
>>>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
>>>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
>>>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
>>>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000)
>>>> knlGS:0000000000000000
>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
>>>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
>>>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
>>>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
>>>> Call Trace:
>>>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
>>>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
>>>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>>>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
>>>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
>>>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
>>>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
>>>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
>>>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
>>>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
>>>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
>>>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
>>>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
>>>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
>>>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>>>> Leftover inexact backtrace:
>>>> 2017-03-12 20:33:08     <IRQ><EOI>
>>>> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ?
>>>> native_safe_halt+0x6/0x10
>>>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
>>>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
>>>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
>>>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
>>>> [<ffffffffb46a358c>] rest_init+0x7c/0x80
>>>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
>>>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
>>>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
>>>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
>>>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>>>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>>>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>>>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140
>>>> [btrfs]
>>>> RSP <ffff8814eae03cd8>
>>>> CR2: 00000000000001f0
>>>> ---[ end trace 7529a0652fd7873e ]---
>>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
>>>> 0xffffffff80000000-0xffffffffbfffffff)
>>>>
>>>> Greets,
>>>> Stefan
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>
>>>
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-13 13:26         ` Stefan Priebe - Profihost AG
@ 2017-03-14  0:30           ` Qu Wenruo
  2017-03-14  2:50           ` Qu Wenruo
  1 sibling, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2017-03-14  0:30 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, linux-btrfs, fdmanana



At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:
>
> Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
>>
>>
>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>>> Hi Qu,
>>>
>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>>
>>>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi Qu,
>>>>>
>>>>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>>>>> v4.4).
>>>>
>>>> Thanks for the test.
>>>>
>>>>> V7 results in OOPS to me:
>>>>> BUG: unable to handle kernel NULL pointer dereference at
>>>>> 00000000000001f0
>>>>
>>>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
>>>> nice clue.
>>>>
>>>>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140
>>>>> [btrfs]
>>>>
>>>> IP points to:
>>>> ---
>>>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
>>>> {
>>>>         struct btrfs_root *root = inode->root; << Either here
>>>>
>>>>         if (root == root->fs_info->tree_root && << Or here
>>>>             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
>>>>
>>>> ---
>>>>
>>>> Taking the above offset into consideration, it's only possible for later
>>>> case.
>>>>
>>>> So here, we have a btrfs_inode whose @root is NULL.
>>>
>>> But wasn't this part of the code identical in V5? Why does it only
>>> happen with V7?
>>
>> There are still difference, but just as you said, the related
>> part(checking if inode is free space cache inode) is identical across v5
>> and v7.
>
> But if i boot v7 it always happens. If i boot v5 it always works. Have
> done 5 repeatet tests.

That's a problem now, I'll dig it further to find out the difference.

Thanks,
Qu

>
>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
>> could happen in both v5 and v7.
>>
>> What's the difference between SUSE and mainline kernel?
>
> A lot ;-) But i don't think anything related.
>
>> Maybe some mainline kernel commits have already fixed it?
>
> May be no idea. But i haven't found any reason why v5 works.
>
> Stefan
>
>>
>> Thanks,
>> Qu
>>>
>>>> This can be fixed easily by checking @root inside
>>>> btrfs_is_free_space_inode(), as the backtrace shows that it's only
>>>> happening for DirectIO, and it won't happen for free space cache inode.
>>>>
>>>> But I'm more curious how this happened for a more accurate fix, or we
>>>> could have other NULL pointer access.
>>>>
>>>> Did you have any reproducer for this?
>>>
>>> Sorry no - this is a production MariaDB Server running btrfs with
>>> compress-force=zlib. But if i could test anything i'll do.
>>>
>>> Greets,
>>> Stefan
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>>>>> Oops: 0000 [#1] SMP
>>>>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>>>>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>>>>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq
>>>>> ata_generic
>>>>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>>>>> i2c_core usb_common ata_piix floppy
>>>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>>> 1.7.5-20140722_172050-sagunt 04/01/2014
>>>>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
>>>>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
>>>>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>>>>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
>>>>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
>>>>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
>>>>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
>>>>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
>>>>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
>>>>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000)
>>>>> knlGS:0000000000000000
>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
>>>>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
>>>>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
>>>>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
>>>>> Call Trace:
>>>>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>>>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
>>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>>>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
>>>>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
>>>>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>>>>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
>>>>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
>>>>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
>>>>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
>>>>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
>>>>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
>>>>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
>>>>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
>>>>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
>>>>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
>>>>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
>>>>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>>>>> Leftover inexact backtrace:
>>>>> 2017-03-12 20:33:08     <IRQ><EOI>
>>>>> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ?
>>>>> native_safe_halt+0x6/0x10
>>>>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
>>>>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
>>>>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
>>>>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
>>>>> [<ffffffffb46a358c>] rest_init+0x7c/0x80
>>>>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
>>>>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
>>>>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
>>>>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
>>>>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>>>>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>>>>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>>>>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140
>>>>> [btrfs]
>>>>> RSP <ffff8814eae03cd8>
>>>>> CR2: 00000000000001f0
>>>>> ---[ end trace 7529a0652fd7873e ]---
>>>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
>>>>> 0xffffffff80000000-0xffffffffbfffffff)
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-13 13:26         ` Stefan Priebe - Profihost AG
  2017-03-14  0:30           ` Qu Wenruo
@ 2017-03-14  2:50           ` Qu Wenruo
  2017-03-14  9:06             ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2017-03-14  2:50 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, linux-btrfs, fdmanana



At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:
>
> Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
>>
>>
>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>>> Hi Qu,
>>>
>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>
>>> But wasn't this part of the code identical in V5? Why does it only
>>> happen with V7?
>>
>> There are still difference, but just as you said, the related
>> part(checking if inode is free space cache inode) is identical across v5
>> and v7.
>
> But if i boot v7 it always happens. If i boot v5 it always works. Have
> done 5 repeatet tests.

I rechecked the code change between v7 and v5.

It turns out that, the code base may cause the problem.

In v7, the base is v4.11-rc1, which introduced quite a lot of 
btrfs_inode cleanup.

One of the difference is the parameter for btrfs_is_free_space_inode().

In v7, the parameter @inode changed from struct inode to struct btrfs_inode.

So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(), 
other than plain inode.

That's the most possible cause for me here.

So would you please paste the final patch applied to your tree?
Git diff or git format-patch can both handle it.

Thanks,
Qu

>
>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
>> could happen in both v5 and v7.
>>
>> What's the difference between SUSE and mainline kernel?
>
> A lot ;-) But i don't think anything related.
>
>> Maybe some mainline kernel commits have already fixed it?
>
> May be no idea. But i haven't found any reason why v5 works.
>
> Stefan
>
>>
>> Thanks,
>> Qu
>>>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-14  2:50           ` Qu Wenruo
@ 2017-03-14  9:06             ` Stefan Priebe - Profihost AG
  2017-03-14  9:09               ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-03-14  9:06 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, fdmanana

Thanks Qu, removing BTRFS_I from the inode fixes this issue to me.

Greets,
Stefan


Am 14.03.2017 um 03:50 schrieb Qu Wenruo:
> 
> 
> At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:
>>
>> Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
>>>
>>>
>>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>>>> Hi Qu,
>>>>
>>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>>
>>>> But wasn't this part of the code identical in V5? Why does it only
>>>> happen with V7?
>>>
>>> There are still difference, but just as you said, the related
>>> part(checking if inode is free space cache inode) is identical across v5
>>> and v7.
>>
>> But if i boot v7 it always happens. If i boot v5 it always works. Have
>> done 5 repeatet tests.
> 
> I rechecked the code change between v7 and v5.
> 
> It turns out that, the code base may cause the problem.
> 
> In v7, the base is v4.11-rc1, which introduced quite a lot of
> btrfs_inode cleanup.
> 
> One of the difference is the parameter for btrfs_is_free_space_inode().
> 
> In v7, the parameter @inode changed from struct inode to struct
> btrfs_inode.
> 
> So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(),
> other than plain inode.
> 
> That's the most possible cause for me here.
> 
> So would you please paste the final patch applied to your tree?
> Git diff or git format-patch can both handle it.
> 
> Thanks,
> Qu
> 
>>
>>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
>>> could happen in both v5 and v7.
>>>
>>> What's the difference between SUSE and mainline kernel?
>>
>> A lot ;-) But i don't think anything related.
>>
>>> Maybe some mainline kernel commits have already fixed it?
>>
>> May be no idea. But i haven't found any reason why v5 works.
>>
>> Stefan
>>
>>>
>>> Thanks,
>>> Qu
>>>>
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  2017-03-14  9:06             ` Stefan Priebe - Profihost AG
@ 2017-03-14  9:09               ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2017-03-14  9:09 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, linux-btrfs, fdmanana



At 03/14/2017 05:06 PM, Stefan Priebe - Profihost AG wrote:
> Thanks Qu, removing BTRFS_I from the inode fixes this issue to me.
>
> Greets,
> Stefan
>

Glad to hear that.

And a small tip is, when compiling kernel(at least btrfs module), any 
warning should be checked carefully.

Such type mismatch will cause a warning and can help us to avoid such 
problem.

Thanks,
Qu
>
> Am 14.03.2017 um 03:50 schrieb Qu Wenruo:
>>
>>
>> At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:
>>>
>>> Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
>>>>
>>>>
>>>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi Qu,
>>>>>
>>>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>>>
>>>>> But wasn't this part of the code identical in V5? Why does it only
>>>>> happen with V7?
>>>>
>>>> There are still difference, but just as you said, the related
>>>> part(checking if inode is free space cache inode) is identical across v5
>>>> and v7.
>>>
>>> But if i boot v7 it always happens. If i boot v5 it always works. Have
>>> done 5 repeatet tests.
>>
>> I rechecked the code change between v7 and v5.
>>
>> It turns out that, the code base may cause the problem.
>>
>> In v7, the base is v4.11-rc1, which introduced quite a lot of
>> btrfs_inode cleanup.
>>
>> One of the difference is the parameter for btrfs_is_free_space_inode().
>>
>> In v7, the parameter @inode changed from struct inode to struct
>> btrfs_inode.
>>
>> So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(),
>> other than plain inode.
>>
>> That's the most possible cause for me here.
>>
>> So would you please paste the final patch applied to your tree?
>> Git diff or git format-patch can both handle it.
>>
>> Thanks,
>> Qu
>>
>>>
>>>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
>>>> could happen in both v5 and v7.
>>>>
>>>> What's the difference between SUSE and mainline kernel?
>>>
>>> A lot ;-) But i don't think anything related.
>>>
>>>> Maybe some mainline kernel commits have already fixed it?
>>>
>>> May be no idea. But i haven't found any reason why v5 works.
>>>
>>> Stefan
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-03-14  9:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-08  2:25 [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
2017-03-08  2:25 ` [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
2017-03-08 10:19   ` Filipe Manana
2017-03-09 17:37   ` Liu Bo
2017-03-08 10:18 ` [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana
2017-03-09 17:35 ` Liu Bo
2017-03-12 20:49 ` Stefan Priebe - Profihost AG
2017-03-13  1:16   ` Qu Wenruo
2017-03-13  7:26     ` Stefan Priebe - Profihost AG
2017-03-13  7:39       ` Qu Wenruo
2017-03-13 13:26         ` Stefan Priebe - Profihost AG
2017-03-14  0:30           ` Qu Wenruo
2017-03-14  2:50           ` Qu Wenruo
2017-03-14  9:06             ` Stefan Priebe - Profihost AG
2017-03-14  9:09               ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.