All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Btrfs: fix regression when running delayed references
@ 2015-10-22  8:47 fdmanana
  2015-10-22  9:32 ` Qu Wenruo
  2015-10-22 10:05 ` Koen Kooi
  0 siblings, 2 replies; 15+ messages in thread
From: fdmanana @ 2015-10-22  8:47 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

In the kernel 4.2 merge window we had a refactoring/rework of the delayed
references implementation in order to fix certain problems with qgroups.
However that rework introduced one more regression that leads to the
following trace when running delayed references for metadata:

[35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
[35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
[35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
[35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
[35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
[35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
[35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
[35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
[35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
[35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
[35908.065201] Stack:
[35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
[35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
[35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
[35908.065201] Call Trace:
[35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
[35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
[35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
[35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
[35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
[35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
[35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
[35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
[35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
[35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
[35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
[35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
[35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
[35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
[35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
[35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
[35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
[35908.065201]  RSP <ffff88010c4cbb08>
[35908.310885] ---[ end trace fe4299baf0666457 ]---

This happens because the new delayed references code no longer merges
delayed references that have different sequence values. The following
steps are an example sequence leading to this issue:

1) Transaction N starts, fs_info->tree_mod_seq has value 0;

2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
   bytenr A is created, with a value of 1 and a seq value of 0;

3) fs_info->tree_mod_seq is incremented to 1;

4) Extent buffer A is deleted through btrfs_del_items(), which calls
   btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
   later returns the metadata extent associated to extent buffer A to
   the free space cache (the range is not pinned), because the extent
   buffer was created in the current transaction (N) and writeback never
   happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
   in the extent buffer).
   This creates the delayed reference Ref2 for bytenr A, with a value
   of -1 and a seq value of 1;

5) Delayed reference Ref2 is not merged with Ref1 when we create it,
   because they have different sequence numbers (decided at
   add_delayed_ref_tail_merge());

6) fs_info->tree_mod_seq is incremented to 2;

7) Some task attempts to allocate a new extent buffer (done at
   extent-tree.c:find_free_extent()), but due to heavy fragmentation
   and running low on metadata space the clustered allocation fails
   and we fall back to unclustered allocation, which finds the
   extent at offset A, so a new extent buffer at offset A is allocated.
   This creates delayed reference Ref3 for bytenr A, with a value of -1
   and a seq value of 2;

8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
   all have different seq values;

9) We start running the delayed references (__btrfs_run_delayed_refs());

10) The delayed Ref1 is the first one being applied, which ends up
    creating an inline extent backref in the extent tree;

10) Next the delayed reference Ref3 is selected for execution, and not
    Ref2, because select_delayed_ref() always gives a preference for
    positive references (that have an action of BTRFS_ADD_DELAYED_REF);

11) When running Ref3 we encounter alreay the inline extent backref
    in the extent tree at insert_inline_extent_backref(), which makes
    us hit the following BUG_ON:

        BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);

    This is always true because owner corresponds to the level of the
    extent buffer/btree node in the btree.

For the scenario described above we hit the BUG_ON because we never merge
references that have different seq values.

We used to do the merging before the 4.2 kernel, more specifically, before
the commmits:

  c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
  c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")

This issue became more exposed after the following change that was added
to 4.2 as well:

  cffc3374e567 ("Btrfs: fix order by which delayed references are run")

Which in turn fixed another regression by the two commits previously
mentioned.

So fix this by bringing back the delayed reference merge code, with the
proper adaptations so that it operates against the new data structure
(linked list vs old red black tree implementation).

This issue was hit running fstest btrfs/063 in a loop. Several people have
reported this issue in the mailing list when running on kernels 4.2+.

Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org  # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent-tree.c |  14 ++++++
 2 files changed, 127 insertions(+)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ac3e81d..4832943 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
 		trans->delayed_ref_updates--;
 }
 
+static bool merge_ref(struct btrfs_trans_handle *trans,
+		      struct btrfs_delayed_ref_root *delayed_refs,
+		      struct btrfs_delayed_ref_head *head,
+		      struct btrfs_delayed_ref_node *ref,
+		      u64 seq)
+{
+	struct btrfs_delayed_ref_node *next;
+	bool done = false;
+
+	next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
+				list);
+	while (!done && &next->list != &head->ref_list) {
+		int mod;
+		struct btrfs_delayed_ref_node *next2;
+
+		next2 = list_next_entry(next, list);
+
+		if (next == ref)
+			goto next;
+
+		if (seq && next->seq >= seq)
+			goto next;
+
+		if (next->type != ref->type || next->no_quota != ref->no_quota)
+			goto next;
+
+		if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
+		     ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
+		    comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
+				   btrfs_delayed_node_to_tree_ref(next),
+				   ref->type))
+			goto next;
+		if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
+		     ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
+		    comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
+				   btrfs_delayed_node_to_data_ref(next)))
+			goto next;
+
+		if (ref->action == next->action) {
+			mod = next->ref_mod;
+		} else {
+			if (ref->ref_mod < next->ref_mod) {
+				swap(ref, next);
+				done = true;
+			}
+			mod = -next->ref_mod;
+		}
+
+		drop_delayed_ref(trans, delayed_refs, head, next);
+		ref->ref_mod += mod;
+		if (ref->ref_mod == 0) {
+			drop_delayed_ref(trans, delayed_refs, head, ref);
+			done = true;
+		} else {
+			/*
+			 * Can't have multiples of the same ref on a tree block.
+			 */
+			WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
+				ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
+		}
+next:
+		next = next2;
+	}
+
+	return done;
+}
+
+void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
+			      struct btrfs_fs_info *fs_info,
+			      struct btrfs_delayed_ref_root *delayed_refs,
+			      struct btrfs_delayed_ref_head *head)
+{
+	struct btrfs_delayed_ref_node *ref;
+	u64 seq = 0;
+
+	assert_spin_locked(&head->lock);
+
+	if (list_empty(&head->ref_list))
+		return;
+
+	/* We don't have too many refs to merge for data. */
+	if (head->is_data)
+		return;
+
+	spin_lock(&fs_info->tree_mod_seq_lock);
+	if (!list_empty(&fs_info->tree_mod_seq_list)) {
+		struct seq_list *elem;
+
+		elem = list_first_entry(&fs_info->tree_mod_seq_list,
+					struct seq_list, list);
+		seq = elem->seq;
+	}
+	spin_unlock(&fs_info->tree_mod_seq_lock);
+
+	ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
+			       list);
+	while (&ref->list != &head->ref_list) {
+		if (seq && ref->seq >= seq)
+			goto next;
+
+		if (merge_ref(trans, delayed_refs, head, ref, seq)) {
+			if (list_empty(&head->ref_list))
+				break;
+			ref = list_first_entry(&head->ref_list,
+					       struct btrfs_delayed_ref_node,
+					       list);
+			continue;
+		}
+next:
+		ref = list_next_entry(ref, list);
+	}
+}
+
 int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
 			    struct btrfs_delayed_ref_root *delayed_refs,
 			    u64 seq)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 522fb45..42d9310 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			}
 		}
 
+		/*
+		 * We need to try and merge add/drops of the same ref since we
+		 * can run into issues with relocate dropping the implicit ref
+		 * and then it being added back again before the drop can
+		 * finish.  If we merged anything we need to re-loop so we can
+		 * get a good ref.
+		 * Or we can get node references of the same type that weren't
+		 * merged when created due to bumps in the tree mod seq, and
+		 * we need to merge them to prevent adding an inline extent
+		 * backref before dropping it (triggering a BUG_ON at
+		 * insert_inline_extent_backref()).
+		 */
 		spin_lock(&locked_ref->lock);
+		btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
+					 locked_ref);
 
 		/*
 		 * locked_ref is the head node, so we have to go one
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22  8:47 [PATCH] Btrfs: fix regression when running delayed references fdmanana
@ 2015-10-22  9:32 ` Qu Wenruo
  2015-10-22  9:43   ` Filipe Manana
  2015-10-22 10:05 ` Koen Kooi
  1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2015-10-22  9:32 UTC (permalink / raw)
  To: fdmanana, linux-btrfs; +Cc: Filipe Manana



  wrote on 2015/10/22 09:47 +0100:
> From: Filipe Manana <fdmanana@suse.com>
>
> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
> references implementation in order to fix certain problems with qgroups.
> However that rework introduced one more regression that leads to the
> following trace when running delayed references for metadata:
>
> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
> [35908.065201] Stack:
> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
> [35908.065201] Call Trace:
> [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
> [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
> [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
> [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
> [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
> [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
> [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
> [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
> [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201]  RSP <ffff88010c4cbb08>
> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>
> This happens because the new delayed references code no longer merges
> delayed references that have different sequence values. The following
> steps are an example sequence leading to this issue:
>
> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>
> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
>     bytenr A is created, with a value of 1 and a seq value of 0;
>
> 3) fs_info->tree_mod_seq is incremented to 1;
>
> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>     btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>     later returns the metadata extent associated to extent buffer A to
>     the free space cache (the range is not pinned), because the extent
>     buffer was created in the current transaction (N) and writeback never
>     happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
>     in the extent buffer).
>     This creates the delayed reference Ref2 for bytenr A, with a value
>     of -1 and a seq value of 1;
>
> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>     because they have different sequence numbers (decided at
>     add_delayed_ref_tail_merge());
>
> 6) fs_info->tree_mod_seq is incremented to 2;
>
> 7) Some task attempts to allocate a new extent buffer (done at
>     extent-tree.c:find_free_extent()), but due to heavy fragmentation
>     and running low on metadata space the clustered allocation fails
>     and we fall back to unclustered allocation, which finds the
>     extent at offset A, so a new extent buffer at offset A is allocated.
>     This creates delayed reference Ref3 for bytenr A, with a value of -1
>     and a seq value of 2;
>
> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>     all have different seq values;
>
> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>
> 10) The delayed Ref1 is the first one being applied, which ends up
>      creating an inline extent backref in the extent tree;
>
> 10) Next the delayed reference Ref3 is selected for execution, and not
>      Ref2, because select_delayed_ref() always gives a preference for
>      positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>
> 11) When running Ref3 we encounter alreay the inline extent backref
>      in the extent tree at insert_inline_extent_backref(), which makes
>      us hit the following BUG_ON:
>
>          BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>
>      This is always true because owner corresponds to the level of the
>      extent buffer/btree node in the btree.
>
> For the scenario described above we hit the BUG_ON because we never merge
> references that have different seq values.
>
> We used to do the merging before the 4.2 kernel, more specifically, before
> the commmits:
>
>    c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
>    c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>
> This issue became more exposed after the following change that was added
> to 4.2 as well:
>
>    cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>
> Which in turn fixed another regression by the two commits previously
> mentioned.
>
> So fix this by bringing back the delayed reference merge code, with the
> proper adaptations so that it operates against the new data structure
> (linked list vs old red black tree implementation).
>
> This issue was hit running fstest btrfs/063 in a loop. Several people have
> reported this issue in the mailing list when running on kernels 4.2+.

Thanks Filipe,

My fault again. :(
But I'm not completely sure about if tree_mod_seq is still needed now.

IIRC, with the new qgroup accounting happen at commit_transaction time,
btrfs_find_all_roots() should either searching commit tree for old 
roots, or search current tree for new roots.
No need to search using tree_mod_seq.

If so, I'd like just allow merging refs without checking tree_mod_seq.

I was going to do it but not completely sure is there any other user of 
tree_mod_seq.
And if it's possible to get rid of tree_mod_seq and merge with last 
delayed_ref, things should get cleaner without new codes.

Thanks,
Qu


>
> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
> Reported-by: Peter Becker <floyd.net@gmail.com>
> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
> Reported-by: Malte Schröder <malte@tnxip.de>
> Reported-by: Derek Dongray <derek@valedon.co.uk>
> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
> Cc: stable@vger.kernel.org  # 4.2+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/extent-tree.c |  14 ++++++
>   2 files changed, 127 insertions(+)
>
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index ac3e81d..4832943 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
>   		trans->delayed_ref_updates--;
>   }
>
> +static bool merge_ref(struct btrfs_trans_handle *trans,
> +		      struct btrfs_delayed_ref_root *delayed_refs,
> +		      struct btrfs_delayed_ref_head *head,
> +		      struct btrfs_delayed_ref_node *ref,
> +		      u64 seq)
> +{
> +	struct btrfs_delayed_ref_node *next;
> +	bool done = false;
> +
> +	next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
> +				list);
> +	while (!done && &next->list != &head->ref_list) {
> +		int mod;
> +		struct btrfs_delayed_ref_node *next2;
> +
> +		next2 = list_next_entry(next, list);
> +
> +		if (next == ref)
> +			goto next;
> +
> +		if (seq && next->seq >= seq)
> +			goto next;
> +
> +		if (next->type != ref->type || next->no_quota != ref->no_quota)
> +			goto next;
> +
> +		if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
> +		     ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
> +		    comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
> +				   btrfs_delayed_node_to_tree_ref(next),
> +				   ref->type))
> +			goto next;
> +		if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
> +		     ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
> +		    comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
> +				   btrfs_delayed_node_to_data_ref(next)))
> +			goto next;
> +
> +		if (ref->action == next->action) {
> +			mod = next->ref_mod;
> +		} else {
> +			if (ref->ref_mod < next->ref_mod) {
> +				swap(ref, next);
> +				done = true;
> +			}
> +			mod = -next->ref_mod;
> +		}
> +
> +		drop_delayed_ref(trans, delayed_refs, head, next);
> +		ref->ref_mod += mod;
> +		if (ref->ref_mod == 0) {
> +			drop_delayed_ref(trans, delayed_refs, head, ref);
> +			done = true;
> +		} else {
> +			/*
> +			 * Can't have multiples of the same ref on a tree block.
> +			 */
> +			WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
> +				ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
> +		}
> +next:
> +		next = next2;
> +	}
> +
> +	return done;
> +}
> +
> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
> +			      struct btrfs_fs_info *fs_info,
> +			      struct btrfs_delayed_ref_root *delayed_refs,
> +			      struct btrfs_delayed_ref_head *head)
> +{
> +	struct btrfs_delayed_ref_node *ref;
> +	u64 seq = 0;
> +
> +	assert_spin_locked(&head->lock);
> +
> +	if (list_empty(&head->ref_list))
> +		return;
> +
> +	/* We don't have too many refs to merge for data. */
> +	if (head->is_data)
> +		return;
> +
> +	spin_lock(&fs_info->tree_mod_seq_lock);
> +	if (!list_empty(&fs_info->tree_mod_seq_list)) {
> +		struct seq_list *elem;
> +
> +		elem = list_first_entry(&fs_info->tree_mod_seq_list,
> +					struct seq_list, list);
> +		seq = elem->seq;
> +	}
> +	spin_unlock(&fs_info->tree_mod_seq_lock);
> +
> +	ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
> +			       list);
> +	while (&ref->list != &head->ref_list) {
> +		if (seq && ref->seq >= seq)
> +			goto next;
> +
> +		if (merge_ref(trans, delayed_refs, head, ref, seq)) {
> +			if (list_empty(&head->ref_list))
> +				break;
> +			ref = list_first_entry(&head->ref_list,
> +					       struct btrfs_delayed_ref_node,
> +					       list);
> +			continue;
> +		}
> +next:
> +		ref = list_next_entry(ref, list);
> +	}
> +}
> +
>   int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>   			    struct btrfs_delayed_ref_root *delayed_refs,
>   			    u64 seq)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 522fb45..42d9310 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>   			}
>   		}
>
> +		/*
> +		 * We need to try and merge add/drops of the same ref since we
> +		 * can run into issues with relocate dropping the implicit ref
> +		 * and then it being added back again before the drop can
> +		 * finish.  If we merged anything we need to re-loop so we can
> +		 * get a good ref.
> +		 * Or we can get node references of the same type that weren't
> +		 * merged when created due to bumps in the tree mod seq, and
> +		 * we need to merge them to prevent adding an inline extent
> +		 * backref before dropping it (triggering a BUG_ON at
> +		 * insert_inline_extent_backref()).
> +		 */
>   		spin_lock(&locked_ref->lock);
> +		btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
> +					 locked_ref);
>
>   		/*
>   		 * locked_ref is the head node, so we have to go one
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22  9:32 ` Qu Wenruo
@ 2015-10-22  9:43   ` Filipe Manana
  2015-10-22  9:47     ` Filipe Manana
  2015-10-22 10:06     ` Qu Wenruo
  0 siblings, 2 replies; 15+ messages in thread
From: Filipe Manana @ 2015-10-22  9:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana

On Thu, Oct 22, 2015 at 10:32 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
>  wrote on 2015/10/22 09:47 +0100:
>>
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>> references implementation in order to fix certain problems with qgroups.
>> However that rework introduced one more regression that leads to the
>> following trace when running delayed references for metadata:
>>
>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic
>> xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache
>> sunrpc loop fuse parport_pc psmouse i2
>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W
>> 4.3.0-rc5-btrfs-next-17+ #1
>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
>> [btrfs]
>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti:
>> ffff88010c4c8000
>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>]
>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX:
>> 0000000000000000
>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI:
>> 0000000000000000
>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09:
>> ffff88010c4cb9f8
>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12:
>> 0000000000000000
>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15:
>> 0000000000000000
>> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000)
>> knlGS:0000000000000000
>> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4:
>> 00000000000006e0
>> [35908.065201] Stack:
>> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578
>> ffff88015a408000
>> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005
>> ffff88010c4cbbd8
>> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000
>> 0000000000000000
>> [35908.065201] Call Trace:
>> [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208
>> [btrfs]
>> [35908.065201]  [<ffffffffa0497117>] ?
>> __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>> [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33
>> [btrfs]
>> [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f
>> [btrfs]
>> [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f
>> [btrfs]
>> [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd
>> [btrfs]
>> [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b
>> [btrfs]
>> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a
>> [btrfs]
>> [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14
>> [btrfs]
>> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48
>> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02
>> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>> [35908.065201] RIP  [<ffffffffa04928b5>]
>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201]  RSP <ffff88010c4cbb08>
>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>
>> This happens because the new delayed references code no longer merges
>> delayed references that have different sequence values. The following
>> steps are an example sequence leading to this issue:
>>
>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>
>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
>>     bytenr A is created, with a value of 1 and a seq value of 0;
>>
>> 3) fs_info->tree_mod_seq is incremented to 1;
>>
>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>>     btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>>     later returns the metadata extent associated to extent buffer A to
>>     the free space cache (the range is not pinned), because the extent
>>     buffer was created in the current transaction (N) and writeback never
>>     happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
>>     in the extent buffer).
>>     This creates the delayed reference Ref2 for bytenr A, with a value
>>     of -1 and a seq value of 1;
>>
>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>>     because they have different sequence numbers (decided at
>>     add_delayed_ref_tail_merge());
>>
>> 6) fs_info->tree_mod_seq is incremented to 2;
>>
>> 7) Some task attempts to allocate a new extent buffer (done at
>>     extent-tree.c:find_free_extent()), but due to heavy fragmentation
>>     and running low on metadata space the clustered allocation fails
>>     and we fall back to unclustered allocation, which finds the
>>     extent at offset A, so a new extent buffer at offset A is allocated.
>>     This creates delayed reference Ref3 for bytenr A, with a value of -1
>>     and a seq value of 2;
>>
>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>>     all have different seq values;
>>
>> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>>
>> 10) The delayed Ref1 is the first one being applied, which ends up
>>      creating an inline extent backref in the extent tree;
>>
>> 10) Next the delayed reference Ref3 is selected for execution, and not
>>      Ref2, because select_delayed_ref() always gives a preference for
>>      positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>>
>> 11) When running Ref3 we encounter alreay the inline extent backref
>>      in the extent tree at insert_inline_extent_backref(), which makes
>>      us hit the following BUG_ON:
>>
>>          BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>
>>      This is always true because owner corresponds to the level of the
>>      extent buffer/btree node in the btree.
>>
>> For the scenario described above we hit the BUG_ON because we never merge
>> references that have different seq values.
>>
>> We used to do the merging before the 4.2 kernel, more specifically, before
>> the commmits:
>>
>>    c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in
>> ref_head.")
>>    c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>>
>> This issue became more exposed after the following change that was added
>> to 4.2 as well:
>>
>>    cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>>
>> Which in turn fixed another regression by the two commits previously
>> mentioned.
>>
>> So fix this by bringing back the delayed reference merge code, with the
>> proper adaptations so that it operates against the new data structure
>> (linked list vs old red black tree implementation).
>>
>> This issue was hit running fstest btrfs/063 in a loop. Several people have
>> reported this issue in the mailing list when running on kernels 4.2+.
>
>
> Thanks Filipe,
>
> My fault again. :(
> But I'm not completely sure about if tree_mod_seq is still needed now.
>
> IIRC, with the new qgroup accounting happen at commit_transaction time,
> btrfs_find_all_roots() should either searching commit tree for old roots, or
> search current tree for new roots.
> No need to search using tree_mod_seq.
>
> If so, I'd like just allow merging refs without checking tree_mod_seq.
>
> I was going to do it but not completely sure is there any other user of
> tree_mod_seq.
> And if it's possible to get rid of tree_mod_seq and merge with last
> delayed_ref, things should get cleaner without new codes.

Well, the tree mod seq is what allows backref walkers (and possibly
other paths) to get a consistent view of all btrees and delayed refs
state while doing some processing - that's why we have calls to
btrfs_check_delayed_seq() when running delayed references - so that
any backref walker will not see changes that happen after it started,
i.e. it will see a consistent view of all the btrees (like an
in-memory snapshot of all btrees while the transaction is running).

I don't think you can get this level of consistency through any other
existing means.
So just adding back yet more code that was removed despite still being needed.

This is affecting way too many people now, I would like to get this
fixed and later, if there's a better (new) solution for this, we can
get it in.

thanks

>
> Thanks,
> Qu
>
>
>
>>
>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root
>> in ref_head.")
>> Reported-by: Peter Becker <floyd.net@gmail.com>
>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>> Reported-by: Malte Schröder <malte@tnxip.de>
>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>> Cc: stable@vger.kernel.org  # 4.2+
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>   fs/btrfs/delayed-ref.c | 113
>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/extent-tree.c |  14 ++++++
>>   2 files changed, 127 insertions(+)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index ac3e81d..4832943 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct
>> btrfs_trans_handle *trans,
>>                 trans->delayed_ref_updates--;
>>   }
>>
>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>> +                     struct btrfs_delayed_ref_root *delayed_refs,
>> +                     struct btrfs_delayed_ref_head *head,
>> +                     struct btrfs_delayed_ref_node *ref,
>> +                     u64 seq)
>> +{
>> +       struct btrfs_delayed_ref_node *next;
>> +       bool done = false;
>> +
>> +       next = list_first_entry(&head->ref_list, struct
>> btrfs_delayed_ref_node,
>> +                               list);
>> +       while (!done && &next->list != &head->ref_list) {
>> +               int mod;
>> +               struct btrfs_delayed_ref_node *next2;
>> +
>> +               next2 = list_next_entry(next, list);
>> +
>> +               if (next == ref)
>> +                       goto next;
>> +
>> +               if (seq && next->seq >= seq)
>> +                       goto next;
>> +
>> +               if (next->type != ref->type || next->no_quota !=
>> ref->no_quota)
>> +                       goto next;
>> +
>> +               if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>> +                    ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>> +                   comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>> +                                  btrfs_delayed_node_to_tree_ref(next),
>> +                                  ref->type))
>> +                       goto next;
>> +               if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>> +                    ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>> +                   comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>> +                                  btrfs_delayed_node_to_data_ref(next)))
>> +                       goto next;
>> +
>> +               if (ref->action == next->action) {
>> +                       mod = next->ref_mod;
>> +               } else {
>> +                       if (ref->ref_mod < next->ref_mod) {
>> +                               swap(ref, next);
>> +                               done = true;
>> +                       }
>> +                       mod = -next->ref_mod;
>> +               }
>> +
>> +               drop_delayed_ref(trans, delayed_refs, head, next);
>> +               ref->ref_mod += mod;
>> +               if (ref->ref_mod == 0) {
>> +                       drop_delayed_ref(trans, delayed_refs, head, ref);
>> +                       done = true;
>> +               } else {
>> +                       /*
>> +                        * Can't have multiples of the same ref on a tree
>> block.
>> +                        */
>> +                       WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>> +                               ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
>> +               }
>> +next:
>> +               next = next2;
>> +       }
>> +
>> +       return done;
>> +}
>> +
>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>> +                             struct btrfs_fs_info *fs_info,
>> +                             struct btrfs_delayed_ref_root *delayed_refs,
>> +                             struct btrfs_delayed_ref_head *head)
>> +{
>> +       struct btrfs_delayed_ref_node *ref;
>> +       u64 seq = 0;
>> +
>> +       assert_spin_locked(&head->lock);
>> +
>> +       if (list_empty(&head->ref_list))
>> +               return;
>> +
>> +       /* We don't have too many refs to merge for data. */
>> +       if (head->is_data)
>> +               return;
>> +
>> +       spin_lock(&fs_info->tree_mod_seq_lock);
>> +       if (!list_empty(&fs_info->tree_mod_seq_list)) {
>> +               struct seq_list *elem;
>> +
>> +               elem = list_first_entry(&fs_info->tree_mod_seq_list,
>> +                                       struct seq_list, list);
>> +               seq = elem->seq;
>> +       }
>> +       spin_unlock(&fs_info->tree_mod_seq_lock);
>> +
>> +       ref = list_first_entry(&head->ref_list, struct
>> btrfs_delayed_ref_node,
>> +                              list);
>> +       while (&ref->list != &head->ref_list) {
>> +               if (seq && ref->seq >= seq)
>> +                       goto next;
>> +
>> +               if (merge_ref(trans, delayed_refs, head, ref, seq)) {
>> +                       if (list_empty(&head->ref_list))
>> +                               break;
>> +                       ref = list_first_entry(&head->ref_list,
>> +                                              struct
>> btrfs_delayed_ref_node,
>> +                                              list);
>> +                       continue;
>> +               }
>> +next:
>> +               ref = list_next_entry(ref, list);
>> +       }
>> +}
>> +
>>   int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>>                             struct btrfs_delayed_ref_root *delayed_refs,
>>                             u64 seq)
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 522fb45..42d9310 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct
>> btrfs_trans_handle *trans,
>>                         }
>>                 }
>>
>> +               /*
>> +                * We need to try and merge add/drops of the same ref
>> since we
>> +                * can run into issues with relocate dropping the implicit
>> ref
>> +                * and then it being added back again before the drop can
>> +                * finish.  If we merged anything we need to re-loop so we
>> can
>> +                * get a good ref.
>> +                * Or we can get node references of the same type that
>> weren't
>> +                * merged when created due to bumps in the tree mod seq,
>> and
>> +                * we need to merge them to prevent adding an inline
>> extent
>> +                * backref before dropping it (triggering a BUG_ON at
>> +                * insert_inline_extent_backref()).
>> +                */
>>                 spin_lock(&locked_ref->lock);
>> +               btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
>> +                                        locked_ref);
>>
>>                 /*
>>                  * locked_ref is the head node, so we have to go one
>>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22  9:43   ` Filipe Manana
@ 2015-10-22  9:47     ` Filipe Manana
  2015-10-22 14:58       ` Stéphane Lesimple
  2015-10-22 10:06     ` Qu Wenruo
  1 sibling, 1 reply; 15+ messages in thread
From: Filipe Manana @ 2015-10-22  9:47 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana

On Thu, Oct 22, 2015 at 10:43 AM, Filipe Manana <fdmanana@kernel.org> wrote:
> On Thu, Oct 22, 2015 at 10:32 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>>  wrote on 2015/10/22 09:47 +0100:
>>>
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>>> references implementation in order to fix certain problems with qgroups.
>>> However that rework introduced one more regression that leads to the
>>> following trace when running delayed references for metadata:
>>>
>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic
>>> xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache
>>> sunrpc loop fuse parport_pc psmouse i2
>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W
>>> 4.3.0-rc5-btrfs-next-17+ #1
>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
>>> [btrfs]
>>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti:
>>> ffff88010c4c8000
>>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>]
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
>>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX:
>>> 0000000000000000
>>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI:
>>> 0000000000000000
>>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09:
>>> ffff88010c4cb9f8
>>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12:
>>> 0000000000000000
>>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15:
>>> 0000000000000000
>>> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000)
>>> knlGS:0000000000000000
>>> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4:
>>> 00000000000006e0
>>> [35908.065201] Stack:
>>> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578
>>> ffff88015a408000
>>> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005
>>> ffff88010c4cbbd8
>>> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000
>>> 0000000000000000
>>> [35908.065201] Call Trace:
>>> [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa0497117>] ?
>>> __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>>> [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14
>>> [btrfs]
>>> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>>> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48
>>> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02
>>> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>>> [35908.065201] RIP  [<ffffffffa04928b5>]
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201]  RSP <ffff88010c4cbb08>
>>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>>
>>> This happens because the new delayed references code no longer merges
>>> delayed references that have different sequence values. The following
>>> steps are an example sequence leading to this issue:
>>>
>>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>>
>>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
>>>     bytenr A is created, with a value of 1 and a seq value of 0;
>>>
>>> 3) fs_info->tree_mod_seq is incremented to 1;
>>>
>>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>>>     btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>>>     later returns the metadata extent associated to extent buffer A to
>>>     the free space cache (the range is not pinned), because the extent
>>>     buffer was created in the current transaction (N) and writeback never
>>>     happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
>>>     in the extent buffer).
>>>     This creates the delayed reference Ref2 for bytenr A, with a value
>>>     of -1 and a seq value of 1;
>>>
>>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>>>     because they have different sequence numbers (decided at
>>>     add_delayed_ref_tail_merge());
>>>
>>> 6) fs_info->tree_mod_seq is incremented to 2;
>>>
>>> 7) Some task attempts to allocate a new extent buffer (done at
>>>     extent-tree.c:find_free_extent()), but due to heavy fragmentation
>>>     and running low on metadata space the clustered allocation fails
>>>     and we fall back to unclustered allocation, which finds the
>>>     extent at offset A, so a new extent buffer at offset A is allocated.
>>>     This creates delayed reference Ref3 for bytenr A, with a value of -1
>>>     and a seq value of 2;
>>>
>>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>>>     all have different seq values;
>>>
>>> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>>>
>>> 10) The delayed Ref1 is the first one being applied, which ends up
>>>      creating an inline extent backref in the extent tree;
>>>
>>> 10) Next the delayed reference Ref3 is selected for execution, and not
>>>      Ref2, because select_delayed_ref() always gives a preference for
>>>      positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>>>
>>> 11) When running Ref3 we encounter alreay the inline extent backref
>>>      in the extent tree at insert_inline_extent_backref(), which makes
>>>      us hit the following BUG_ON:
>>>
>>>          BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>>
>>>      This is always true because owner corresponds to the level of the
>>>      extent buffer/btree node in the btree.
>>>
>>> For the scenario described above we hit the BUG_ON because we never merge
>>> references that have different seq values.
>>>
>>> We used to do the merging before the 4.2 kernel, more specifically, before
>>> the commmits:
>>>
>>>    c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in
>>> ref_head.")
>>>    c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>>>
>>> This issue became more exposed after the following change that was added
>>> to 4.2 as well:
>>>
>>>    cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>>>
>>> Which in turn fixed another regression by the two commits previously
>>> mentioned.
>>>
>>> So fix this by bringing back the delayed reference merge code, with the
>>> proper adaptations so that it operates against the new data structure
>>> (linked list vs old red black tree implementation).
>>>
>>> This issue was hit running fstest btrfs/063 in a loop. Several people have
>>> reported this issue in the mailing list when running on kernels 4.2+.
>>
>>
>> Thanks Filipe,
>>
>> My fault again. :(
>> But I'm not completely sure about if tree_mod_seq is still needed now.
>>
>> IIRC, with the new qgroup accounting happen at commit_transaction time,
>> btrfs_find_all_roots() should either searching commit tree for old roots, or
>> search current tree for new roots.
>> No need to search using tree_mod_seq.
>>
>> If so, I'd like just allow merging refs without checking tree_mod_seq.

And to make it clear, that wouldn't work. If a backref walker starts
iterating the btrees and then new delayed refs get merged
independently of the current tree mod seq, the walker will see an
inconsistent state in the extent tree if the delayed references are
run (which can happen often before a transaction commit).

So either make delayed references continue using tree mod seq as
before 4.2, or come with a whole new mechanism that replaces the tree
mod seq while still giving the same consistency guarantees.

>>
>> I was going to do it but not completely sure is there any other user of
>> tree_mod_seq.
>> And if it's possible to get rid of tree_mod_seq and merge with last
>> delayed_ref, things should get cleaner without new codes.
>
> Well, the tree mod seq is what allows backref walkers (and possibly
> other paths) to get a consistent view of all btrees and delayed refs
> state while doing some processing - that's why we have calls to
> btrfs_check_delayed_seq() when running delayed references - so that
> any backref walker will not see changes that happen after it started,
> i.e. it will see a consistent view of all the btrees (like an
> in-memory snapshot of all btrees while the transaction is running).
>
> I don't think you can get this level of consistency through any other
> existing means.
> So just adding back yet more code that was removed despite still being needed.
>
> This is affecting way too many people now, I would like to get this
> fixed and later, if there's a better (new) solution for this, we can
> get it in.
>
> thanks
>
>>
>> Thanks,
>> Qu
>>
>>
>>
>>>
>>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root
>>> in ref_head.")
>>> Reported-by: Peter Becker <floyd.net@gmail.com>
>>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>>> Reported-by: Malte Schröder <malte@tnxip.de>
>>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>>> Cc: stable@vger.kernel.org  # 4.2+
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>   fs/btrfs/delayed-ref.c | 113
>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>   fs/btrfs/extent-tree.c |  14 ++++++
>>>   2 files changed, 127 insertions(+)
>>>
>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>> index ac3e81d..4832943 100644
>>> --- a/fs/btrfs/delayed-ref.c
>>> +++ b/fs/btrfs/delayed-ref.c
>>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct
>>> btrfs_trans_handle *trans,
>>>                 trans->delayed_ref_updates--;
>>>   }
>>>
>>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>>> +                     struct btrfs_delayed_ref_root *delayed_refs,
>>> +                     struct btrfs_delayed_ref_head *head,
>>> +                     struct btrfs_delayed_ref_node *ref,
>>> +                     u64 seq)
>>> +{
>>> +       struct btrfs_delayed_ref_node *next;
>>> +       bool done = false;
>>> +
>>> +       next = list_first_entry(&head->ref_list, struct
>>> btrfs_delayed_ref_node,
>>> +                               list);
>>> +       while (!done && &next->list != &head->ref_list) {
>>> +               int mod;
>>> +               struct btrfs_delayed_ref_node *next2;
>>> +
>>> +               next2 = list_next_entry(next, list);
>>> +
>>> +               if (next == ref)
>>> +                       goto next;
>>> +
>>> +               if (seq && next->seq >= seq)
>>> +                       goto next;
>>> +
>>> +               if (next->type != ref->type || next->no_quota !=
>>> ref->no_quota)
>>> +                       goto next;
>>> +
>>> +               if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>> +                    ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>>> +                   comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>>> +                                  btrfs_delayed_node_to_tree_ref(next),
>>> +                                  ref->type))
>>> +                       goto next;
>>> +               if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>>> +                    ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>>> +                   comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>>> +                                  btrfs_delayed_node_to_data_ref(next)))
>>> +                       goto next;
>>> +
>>> +               if (ref->action == next->action) {
>>> +                       mod = next->ref_mod;
>>> +               } else {
>>> +                       if (ref->ref_mod < next->ref_mod) {
>>> +                               swap(ref, next);
>>> +                               done = true;
>>> +                       }
>>> +                       mod = -next->ref_mod;
>>> +               }
>>> +
>>> +               drop_delayed_ref(trans, delayed_refs, head, next);
>>> +               ref->ref_mod += mod;
>>> +               if (ref->ref_mod == 0) {
>>> +                       drop_delayed_ref(trans, delayed_refs, head, ref);
>>> +                       done = true;
>>> +               } else {
>>> +                       /*
>>> +                        * Can't have multiples of the same ref on a tree
>>> block.
>>> +                        */
>>> +                       WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>> +                               ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
>>> +               }
>>> +next:
>>> +               next = next2;
>>> +       }
>>> +
>>> +       return done;
>>> +}
>>> +
>>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>>> +                             struct btrfs_fs_info *fs_info,
>>> +                             struct btrfs_delayed_ref_root *delayed_refs,
>>> +                             struct btrfs_delayed_ref_head *head)
>>> +{
>>> +       struct btrfs_delayed_ref_node *ref;
>>> +       u64 seq = 0;
>>> +
>>> +       assert_spin_locked(&head->lock);
>>> +
>>> +       if (list_empty(&head->ref_list))
>>> +               return;
>>> +
>>> +       /* We don't have too many refs to merge for data. */
>>> +       if (head->is_data)
>>> +               return;
>>> +
>>> +       spin_lock(&fs_info->tree_mod_seq_lock);
>>> +       if (!list_empty(&fs_info->tree_mod_seq_list)) {
>>> +               struct seq_list *elem;
>>> +
>>> +               elem = list_first_entry(&fs_info->tree_mod_seq_list,
>>> +                                       struct seq_list, list);
>>> +               seq = elem->seq;
>>> +       }
>>> +       spin_unlock(&fs_info->tree_mod_seq_lock);
>>> +
>>> +       ref = list_first_entry(&head->ref_list, struct
>>> btrfs_delayed_ref_node,
>>> +                              list);
>>> +       while (&ref->list != &head->ref_list) {
>>> +               if (seq && ref->seq >= seq)
>>> +                       goto next;
>>> +
>>> +               if (merge_ref(trans, delayed_refs, head, ref, seq)) {
>>> +                       if (list_empty(&head->ref_list))
>>> +                               break;
>>> +                       ref = list_first_entry(&head->ref_list,
>>> +                                              struct
>>> btrfs_delayed_ref_node,
>>> +                                              list);
>>> +                       continue;
>>> +               }
>>> +next:
>>> +               ref = list_next_entry(ref, list);
>>> +       }
>>> +}
>>> +
>>>   int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>>>                             struct btrfs_delayed_ref_root *delayed_refs,
>>>                             u64 seq)
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 522fb45..42d9310 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct
>>> btrfs_trans_handle *trans,
>>>                         }
>>>                 }
>>>
>>> +               /*
>>> +                * We need to try and merge add/drops of the same ref
>>> since we
>>> +                * can run into issues with relocate dropping the implicit
>>> ref
>>> +                * and then it being added back again before the drop can
>>> +                * finish.  If we merged anything we need to re-loop so we
>>> can
>>> +                * get a good ref.
>>> +                * Or we can get node references of the same type that
>>> weren't
>>> +                * merged when created due to bumps in the tree mod seq,
>>> and
>>> +                * we need to merge them to prevent adding an inline
>>> extent
>>> +                * backref before dropping it (triggering a BUG_ON at
>>> +                * insert_inline_extent_backref()).
>>> +                */
>>>                 spin_lock(&locked_ref->lock);
>>> +               btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
>>> +                                        locked_ref);
>>>
>>>                 /*
>>>                  * locked_ref is the head node, so we have to go one
>>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22  8:47 [PATCH] Btrfs: fix regression when running delayed references fdmanana
  2015-10-22  9:32 ` Qu Wenruo
@ 2015-10-22 10:05 ` Koen Kooi
  2015-10-22 10:12   ` Filipe Manana
  1 sibling, 1 reply; 15+ messages in thread
From: Koen Kooi @ 2015-10-22 10:05 UTC (permalink / raw)
  To: linux-btrfs

Op 22-10-15 om 10:47 schreef fdmanana@kernel.org:
> From: Filipe Manana <fdmanana@suse.com>
> 
> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
> references implementation in order to fix certain problems with qgroups.
> However that rework introduced one more regression that leads to the
> following trace when running delayed references for metadata:
> 
> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
> [35908.065201] Stack:
> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
> [35908.065201] Call Trace:
> [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
> [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
> [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
> [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
> [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
> [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
> [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
> [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
> [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201]  RSP <ffff88010c4cbb08>
> [35908.310885] ---[ end trace fe4299baf0666457 ]---

Would this also solve this:

Oct 22 12:03:20 beast kernel: WARNING: CPU: 5 PID: 323 at lib/list_debug.c:62
__list_del_entry+0x5a/0x98()
Oct 22 12:03:20 beast kernel: list_del corruption. next->prev should be
ffff88033f864500, but was ffff88033f8642c0
Oct 22 12:03:20 beast kernel: Modules linked in: arc4 md4 nls_utf8 cifs
dns_resolver fscache ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack veth loop b43
mac80211 cfg80211 ssb mmc_core kvm_intel kvm crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel serio_raw sb_edac edac_core i2c_i801 btusb btrtl btintel
btbcm bluetooth joydev bcma rfkill cp210x tpm_infineon tpm_tis tpm
sch_fq_codel radeon crc32c_intel ttm drm_kms_helper
Oct 22 12:03:20 beast kernel: CPU: 5 PID: 323 Comm: kworker/u16:12 Tainted: G
       W       4.2.2 #50
Oct 22 12:03:20 beast kernel: Hardware name: System manufacturer System
Product Name/X79-DELUXE, BIOS 0901 06/20/2014
Oct 22 12:03:20 beast kernel: Workqueue: btrfs-delalloc btrfs_delalloc_helper
Oct 22 12:03:20 beast kernel:  0000000000000009 ffff88013993fb98
ffffffff8170b663 0000000000000006
Oct 22 12:03:20 beast kernel:  ffff88013993fbe8 ffff88013993fbd8
ffffffff8106aa40 ffff88013993fc78
Oct 22 12:03:20 beast kernel:  ffffffff813392c3 ffff88033f864480
ffff88033f864500 ffff880ba968d510
Oct 22 12:03:20 beast kernel: Call Trace:
Oct 22 12:03:20 beast kernel:  [<ffffffff8170b663>] dump_stack+0x45/0x57
Oct 22 12:03:20 beast kernel:  [<ffffffff8106aa40>] warn_slowpath_common+0xa1/0xbb
Oct 22 12:03:20 beast kernel:  [<ffffffff813392c3>] ? __list_del_entry+0x5a/0x98
Oct 22 12:03:20 beast kernel:  [<ffffffff8106aaa0>] warn_slowpath_fmt+0x46/0x48
Oct 22 12:03:20 beast kernel:  [<ffffffff811699ba>] ? kmem_cache_alloc+0x56/0x10b
Oct 22 12:03:20 beast kernel:  [<ffffffff813392c3>] __list_del_entry+0x5a/0x98
Oct 22 12:03:20 beast kernel:  [<ffffffff8127063d>]
remove_extent_mapping+0x4d/0x65
Oct 22 12:03:20 beast kernel:  [<ffffffff8126d38f>]
btrfs_drop_extent_cache+0x34f/0x388
Oct 22 12:03:20 beast kernel:  [<ffffffff8124edb5>] ?
btrfs_reserve_extent+0x75/0x120
Oct 22 12:03:20 beast kernel:  [<ffffffff81264380>]
submit_compressed_extents+0x1c8/0x430
Oct 22 12:03:20 beast kernel:  [<ffffffff8126466f>] async_cow_submit+0x87/0x8c
Oct 22 12:03:20 beast kernel:  [<ffffffff81284fe2>] normal_work_helper+0x125/0x1c5
Oct 22 12:03:20 beast kernel:  [<ffffffff812850a8>]
btrfs_delalloc_helper+0x12/0x14
Oct 22 12:03:20 beast kernel:  [<ffffffff8107d91c>] process_one_work+0x165/0x29a
Oct 22 12:03:20 beast kernel:  [<ffffffff8107ded4>] worker_thread+0x1ef/0x29a
Oct 22 12:03:20 beast kernel:  [<ffffffff8107dce5>] ? rescuer_thread+0x265/0x265
Oct 22 12:03:20 beast kernel:  [<ffffffff8107dce5>] ? rescuer_thread+0x265/0x265
Oct 22 12:03:20 beast kernel:  [<ffffffff810820e1>] kthread+0xb7/0xbf
Oct 22 12:03:20 beast kernel:  [<ffffffff8108202a>] ? kthread_parkme+0x24/0x24
Oct 22 12:03:20 beast kernel:  [<ffffffff817121cf>] ret_from_fork+0x3f/0x70
Oct 22 12:03:20 beast kernel:  [<ffffffff8108202a>] ? kthread_parkme+0x24/0x24
Oct 22 12:03:20 beast kernel: ---[ end trace 395ca73acb05ecbc ]---

Using the following kernel: Linux beast 4.2.2 #50 SMP Wed Sep 30 13:42:29 CEST
2015 x86_64 GNU/Linux

The above trace shows up every few minutes, basically during every rsync.

regards,

Koen



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22  9:43   ` Filipe Manana
  2015-10-22  9:47     ` Filipe Manana
@ 2015-10-22 10:06     ` Qu Wenruo
  1 sibling, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2015-10-22 10:06 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, Filipe Manana



在 2015年10月22日 17:43, Filipe Manana 写道:
> On Thu, Oct 22, 2015 at 10:32 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>>   wrote on 2015/10/22 09:47 +0100:
>>>
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>>> references implementation in order to fix certain problems with qgroups.
>>> However that rework introduced one more regression that leads to the
>>> following trace when running delayed references for metadata:
>>>
>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic
>>> xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache
>>> sunrpc loop fuse parport_pc psmouse i2
>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W
>>> 4.3.0-rc5-btrfs-next-17+ #1
>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
>>> [btrfs]
>>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti:
>>> ffff88010c4c8000
>>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>]
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
>>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX:
>>> 0000000000000000
>>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI:
>>> 0000000000000000
>>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09:
>>> ffff88010c4cb9f8
>>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12:
>>> 0000000000000000
>>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15:
>>> 0000000000000000
>>> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000)
>>> knlGS:0000000000000000
>>> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4:
>>> 00000000000006e0
>>> [35908.065201] Stack:
>>> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578
>>> ffff88015a408000
>>> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005
>>> ffff88010c4cbbd8
>>> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000
>>> 0000000000000000
>>> [35908.065201] Call Trace:
>>> [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa0497117>] ?
>>> __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>>> [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14
>>> [btrfs]
>>> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>>> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48
>>> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02
>>> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>>> [35908.065201] RIP  [<ffffffffa04928b5>]
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201]  RSP <ffff88010c4cbb08>
>>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>>
>>> This happens because the new delayed references code no longer merges
>>> delayed references that have different sequence values. The following
>>> steps are an example sequence leading to this issue:
>>>
>>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>>
>>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
>>>      bytenr A is created, with a value of 1 and a seq value of 0;
>>>
>>> 3) fs_info->tree_mod_seq is incremented to 1;
>>>
>>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>>>      btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>>>      later returns the metadata extent associated to extent buffer A to
>>>      the free space cache (the range is not pinned), because the extent
>>>      buffer was created in the current transaction (N) and writeback never
>>>      happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
>>>      in the extent buffer).
>>>      This creates the delayed reference Ref2 for bytenr A, with a value
>>>      of -1 and a seq value of 1;
>>>
>>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>>>      because they have different sequence numbers (decided at
>>>      add_delayed_ref_tail_merge());
>>>
>>> 6) fs_info->tree_mod_seq is incremented to 2;
>>>
>>> 7) Some task attempts to allocate a new extent buffer (done at
>>>      extent-tree.c:find_free_extent()), but due to heavy fragmentation
>>>      and running low on metadata space the clustered allocation fails
>>>      and we fall back to unclustered allocation, which finds the
>>>      extent at offset A, so a new extent buffer at offset A is allocated.
>>>      This creates delayed reference Ref3 for bytenr A, with a value of -1
>>>      and a seq value of 2;
>>>
>>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>>>      all have different seq values;
>>>
>>> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>>>
>>> 10) The delayed Ref1 is the first one being applied, which ends up
>>>       creating an inline extent backref in the extent tree;
>>>
>>> 10) Next the delayed reference Ref3 is selected for execution, and not
>>>       Ref2, because select_delayed_ref() always gives a preference for
>>>       positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>>>
>>> 11) When running Ref3 we encounter alreay the inline extent backref
>>>       in the extent tree at insert_inline_extent_backref(), which makes
>>>       us hit the following BUG_ON:
>>>
>>>           BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>>
>>>       This is always true because owner corresponds to the level of the
>>>       extent buffer/btree node in the btree.
>>>
>>> For the scenario described above we hit the BUG_ON because we never merge
>>> references that have different seq values.
>>>
>>> We used to do the merging before the 4.2 kernel, more specifically, before
>>> the commmits:
>>>
>>>     c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in
>>> ref_head.")
>>>     c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>>>
>>> This issue became more exposed after the following change that was added
>>> to 4.2 as well:
>>>
>>>     cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>>>
>>> Which in turn fixed another regression by the two commits previously
>>> mentioned.
>>>
>>> So fix this by bringing back the delayed reference merge code, with the
>>> proper adaptations so that it operates against the new data structure
>>> (linked list vs old red black tree implementation).
>>>
>>> This issue was hit running fstest btrfs/063 in a loop. Several people have
>>> reported this issue in the mailing list when running on kernels 4.2+.
>>
>>
>> Thanks Filipe,
>>
>> My fault again. :(
>> But I'm not completely sure about if tree_mod_seq is still needed now.
>>
>> IIRC, with the new qgroup accounting happen at commit_transaction time,
>> btrfs_find_all_roots() should either searching commit tree for old roots, or
>> search current tree for new roots.
>> No need to search using tree_mod_seq.
>>
>> If so, I'd like just allow merging refs without checking tree_mod_seq.
>>
>> I was going to do it but not completely sure is there any other user of
>> tree_mod_seq.
>> And if it's possible to get rid of tree_mod_seq and merge with last
>> delayed_ref, things should get cleaner without new codes.
>
> Well, the tree mod seq is what allows backref walkers (and possibly
> other paths) to get a consistent view of all btrees and delayed refs
> state while doing some processing - that's why we have calls to
> btrfs_check_delayed_seq() when running delayed references - so that
> any backref walker will not see changes that happen after it started,
> i.e. it will see a consistent view of all the btrees (like an
> in-memory snapshot of all btrees while the transaction is running).
>
> I don't think you can get this level of consistency through any other
> existing means.
> So just adding back yet more code that was removed despite still being needed.
>
> This is affecting way too many people now, I would like to get this
> fixed and later, if there's a better (new) solution for this, we can
> get it in.
>
> thanks
>
That's fine.
Let's use the old codes until their is a better solution for it.

Sorry for my fault again.

Thanks,
Qu

>>
>> Thanks,
>> Qu
>>
>>
>>
>>>
>>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root
>>> in ref_head.")
>>> Reported-by: Peter Becker <floyd.net@gmail.com>
>>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>>> Reported-by: Malte Schröder <malte@tnxip.de>
>>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>>> Cc: stable@vger.kernel.org  # 4.2+
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>    fs/btrfs/delayed-ref.c | 113
>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>    fs/btrfs/extent-tree.c |  14 ++++++
>>>    2 files changed, 127 insertions(+)
>>>
>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>> index ac3e81d..4832943 100644
>>> --- a/fs/btrfs/delayed-ref.c
>>> +++ b/fs/btrfs/delayed-ref.c
>>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct
>>> btrfs_trans_handle *trans,
>>>                  trans->delayed_ref_updates--;
>>>    }
>>>
>>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>>> +                     struct btrfs_delayed_ref_root *delayed_refs,
>>> +                     struct btrfs_delayed_ref_head *head,
>>> +                     struct btrfs_delayed_ref_node *ref,
>>> +                     u64 seq)
>>> +{
>>> +       struct btrfs_delayed_ref_node *next;
>>> +       bool done = false;
>>> +
>>> +       next = list_first_entry(&head->ref_list, struct
>>> btrfs_delayed_ref_node,
>>> +                               list);
>>> +       while (!done && &next->list != &head->ref_list) {
>>> +               int mod;
>>> +               struct btrfs_delayed_ref_node *next2;
>>> +
>>> +               next2 = list_next_entry(next, list);
>>> +
>>> +               if (next == ref)
>>> +                       goto next;
>>> +
>>> +               if (seq && next->seq >= seq)
>>> +                       goto next;
>>> +
>>> +               if (next->type != ref->type || next->no_quota !=
>>> ref->no_quota)
>>> +                       goto next;
>>> +
>>> +               if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>> +                    ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>>> +                   comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>>> +                                  btrfs_delayed_node_to_tree_ref(next),
>>> +                                  ref->type))
>>> +                       goto next;
>>> +               if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>>> +                    ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>>> +                   comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>>> +                                  btrfs_delayed_node_to_data_ref(next)))
>>> +                       goto next;
>>> +
>>> +               if (ref->action == next->action) {
>>> +                       mod = next->ref_mod;
>>> +               } else {
>>> +                       if (ref->ref_mod < next->ref_mod) {
>>> +                               swap(ref, next);
>>> +                               done = true;
>>> +                       }
>>> +                       mod = -next->ref_mod;
>>> +               }
>>> +
>>> +               drop_delayed_ref(trans, delayed_refs, head, next);
>>> +               ref->ref_mod += mod;
>>> +               if (ref->ref_mod == 0) {
>>> +                       drop_delayed_ref(trans, delayed_refs, head, ref);
>>> +                       done = true;
>>> +               } else {
>>> +                       /*
>>> +                        * Can't have multiples of the same ref on a tree
>>> block.
>>> +                        */
>>> +                       WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>> +                               ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
>>> +               }
>>> +next:
>>> +               next = next2;
>>> +       }
>>> +
>>> +       return done;
>>> +}
>>> +
>>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>>> +                             struct btrfs_fs_info *fs_info,
>>> +                             struct btrfs_delayed_ref_root *delayed_refs,
>>> +                             struct btrfs_delayed_ref_head *head)
>>> +{
>>> +       struct btrfs_delayed_ref_node *ref;
>>> +       u64 seq = 0;
>>> +
>>> +       assert_spin_locked(&head->lock);
>>> +
>>> +       if (list_empty(&head->ref_list))
>>> +               return;
>>> +
>>> +       /* We don't have too many refs to merge for data. */
>>> +       if (head->is_data)
>>> +               return;
>>> +
>>> +       spin_lock(&fs_info->tree_mod_seq_lock);
>>> +       if (!list_empty(&fs_info->tree_mod_seq_list)) {
>>> +               struct seq_list *elem;
>>> +
>>> +               elem = list_first_entry(&fs_info->tree_mod_seq_list,
>>> +                                       struct seq_list, list);
>>> +               seq = elem->seq;
>>> +       }
>>> +       spin_unlock(&fs_info->tree_mod_seq_lock);
>>> +
>>> +       ref = list_first_entry(&head->ref_list, struct
>>> btrfs_delayed_ref_node,
>>> +                              list);
>>> +       while (&ref->list != &head->ref_list) {
>>> +               if (seq && ref->seq >= seq)
>>> +                       goto next;
>>> +
>>> +               if (merge_ref(trans, delayed_refs, head, ref, seq)) {
>>> +                       if (list_empty(&head->ref_list))
>>> +                               break;
>>> +                       ref = list_first_entry(&head->ref_list,
>>> +                                              struct
>>> btrfs_delayed_ref_node,
>>> +                                              list);
>>> +                       continue;
>>> +               }
>>> +next:
>>> +               ref = list_next_entry(ref, list);
>>> +       }
>>> +}
>>> +
>>>    int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>>>                              struct btrfs_delayed_ref_root *delayed_refs,
>>>                              u64 seq)
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 522fb45..42d9310 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct
>>> btrfs_trans_handle *trans,
>>>                          }
>>>                  }
>>>
>>> +               /*
>>> +                * We need to try and merge add/drops of the same ref
>>> since we
>>> +                * can run into issues with relocate dropping the implicit
>>> ref
>>> +                * and then it being added back again before the drop can
>>> +                * finish.  If we merged anything we need to re-loop so we
>>> can
>>> +                * get a good ref.
>>> +                * Or we can get node references of the same type that
>>> weren't
>>> +                * merged when created due to bumps in the tree mod seq,
>>> and
>>> +                * we need to merge them to prevent adding an inline
>>> extent
>>> +                * backref before dropping it (triggering a BUG_ON at
>>> +                * insert_inline_extent_backref()).
>>> +                */
>>>                  spin_lock(&locked_ref->lock);
>>> +               btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
>>> +                                        locked_ref);
>>>
>>>                  /*
>>>                   * locked_ref is the head node, so we have to go one
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22 10:05 ` Koen Kooi
@ 2015-10-22 10:12   ` Filipe Manana
  0 siblings, 0 replies; 15+ messages in thread
From: Filipe Manana @ 2015-10-22 10:12 UTC (permalink / raw)
  To: Koen Kooi; +Cc: linux-btrfs

On Thu, Oct 22, 2015 at 11:05 AM, Koen Kooi <koen@dominion.thruhere.net> wrote:
> Op 22-10-15 om 10:47 schreef fdmanana@kernel.org:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>> references implementation in order to fix certain problems with qgroups.
>> However that rework introduced one more regression that leads to the
>> following trace when running delayed references for metadata:
>>
>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
>> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
>> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
>> [35908.065201] Stack:
>> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
>> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
>> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
>> [35908.065201] Call Trace:
>> [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
>> [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>> [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
>> [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
>> [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
>> [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
>> [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
>> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
>> [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
>> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>> [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201]  RSP <ffff88010c4cbb08>
>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>
> Would this also solve this:

No, what you get is a totally different and unrelated problem.

>
> Oct 22 12:03:20 beast kernel: WARNING: CPU: 5 PID: 323 at lib/list_debug.c:62
> __list_del_entry+0x5a/0x98()
> Oct 22 12:03:20 beast kernel: list_del corruption. next->prev should be
> ffff88033f864500, but was ffff88033f8642c0
> Oct 22 12:03:20 beast kernel: Modules linked in: arc4 md4 nls_utf8 cifs
> dns_resolver fscache ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack veth loop b43
> mac80211 cfg80211 ssb mmc_core kvm_intel kvm crct10dif_pclmul crc32_pclmul
> ghash_clmulni_intel serio_raw sb_edac edac_core i2c_i801 btusb btrtl btintel
> btbcm bluetooth joydev bcma rfkill cp210x tpm_infineon tpm_tis tpm
> sch_fq_codel radeon crc32c_intel ttm drm_kms_helper
> Oct 22 12:03:20 beast kernel: CPU: 5 PID: 323 Comm: kworker/u16:12 Tainted: G
>        W       4.2.2 #50
> Oct 22 12:03:20 beast kernel: Hardware name: System manufacturer System
> Product Name/X79-DELUXE, BIOS 0901 06/20/2014
> Oct 22 12:03:20 beast kernel: Workqueue: btrfs-delalloc btrfs_delalloc_helper
> Oct 22 12:03:20 beast kernel:  0000000000000009 ffff88013993fb98
> ffffffff8170b663 0000000000000006
> Oct 22 12:03:20 beast kernel:  ffff88013993fbe8 ffff88013993fbd8
> ffffffff8106aa40 ffff88013993fc78
> Oct 22 12:03:20 beast kernel:  ffffffff813392c3 ffff88033f864480
> ffff88033f864500 ffff880ba968d510
> Oct 22 12:03:20 beast kernel: Call Trace:
> Oct 22 12:03:20 beast kernel:  [<ffffffff8170b663>] dump_stack+0x45/0x57
> Oct 22 12:03:20 beast kernel:  [<ffffffff8106aa40>] warn_slowpath_common+0xa1/0xbb
> Oct 22 12:03:20 beast kernel:  [<ffffffff813392c3>] ? __list_del_entry+0x5a/0x98
> Oct 22 12:03:20 beast kernel:  [<ffffffff8106aaa0>] warn_slowpath_fmt+0x46/0x48
> Oct 22 12:03:20 beast kernel:  [<ffffffff811699ba>] ? kmem_cache_alloc+0x56/0x10b
> Oct 22 12:03:20 beast kernel:  [<ffffffff813392c3>] __list_del_entry+0x5a/0x98
> Oct 22 12:03:20 beast kernel:  [<ffffffff8127063d>]
> remove_extent_mapping+0x4d/0x65
> Oct 22 12:03:20 beast kernel:  [<ffffffff8126d38f>]
> btrfs_drop_extent_cache+0x34f/0x388
> Oct 22 12:03:20 beast kernel:  [<ffffffff8124edb5>] ?
> btrfs_reserve_extent+0x75/0x120
> Oct 22 12:03:20 beast kernel:  [<ffffffff81264380>]
> submit_compressed_extents+0x1c8/0x430
> Oct 22 12:03:20 beast kernel:  [<ffffffff8126466f>] async_cow_submit+0x87/0x8c
> Oct 22 12:03:20 beast kernel:  [<ffffffff81284fe2>] normal_work_helper+0x125/0x1c5
> Oct 22 12:03:20 beast kernel:  [<ffffffff812850a8>]
> btrfs_delalloc_helper+0x12/0x14
> Oct 22 12:03:20 beast kernel:  [<ffffffff8107d91c>] process_one_work+0x165/0x29a
> Oct 22 12:03:20 beast kernel:  [<ffffffff8107ded4>] worker_thread+0x1ef/0x29a
> Oct 22 12:03:20 beast kernel:  [<ffffffff8107dce5>] ? rescuer_thread+0x265/0x265
> Oct 22 12:03:20 beast kernel:  [<ffffffff8107dce5>] ? rescuer_thread+0x265/0x265
> Oct 22 12:03:20 beast kernel:  [<ffffffff810820e1>] kthread+0xb7/0xbf
> Oct 22 12:03:20 beast kernel:  [<ffffffff8108202a>] ? kthread_parkme+0x24/0x24
> Oct 22 12:03:20 beast kernel:  [<ffffffff817121cf>] ret_from_fork+0x3f/0x70
> Oct 22 12:03:20 beast kernel:  [<ffffffff8108202a>] ? kthread_parkme+0x24/0x24
> Oct 22 12:03:20 beast kernel: ---[ end trace 395ca73acb05ecbc ]---
>
> Using the following kernel: Linux beast 4.2.2 #50 SMP Wed Sep 30 13:42:29 CEST
> 2015 x86_64 GNU/Linux
>
> The above trace shows up every few minutes, basically during every rsync.
>
> regards,
>
> Koen
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22  9:47     ` Filipe Manana
@ 2015-10-22 14:58       ` Stéphane Lesimple
  2015-10-22 17:03         ` Filipe Manana
  0 siblings, 1 reply; 15+ messages in thread
From: Stéphane Lesimple @ 2015-10-22 14:58 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs, Filipe Manana

Le 2015-10-22 11:47, Filipe Manana a écrit :
> On Thu, Oct 22, 2015 at 10:43 AM, Filipe Manana <fdmanana@kernel.org> 
> wrote:
>> On Thu, Oct 22, 2015 at 10:32 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> 
>> wrote:
>>> 
>>> 
>>>  wrote on 2015/10/22 09:47 +0100:
>>>> 
>>>> From: Filipe Manana <fdmanana@suse.com>
>>>> 
>>>> In the kernel 4.2 merge window we had a refactoring/rework of the 
>>>> delayed
>>>> references implementation in order to fix certain problems with 
>>>> qgroups.
>>>> However that rework introduced one more regression that leads to the
>>>> following trace when running delayed references for metadata:
>>>> 
>>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs 
>>>> crc32c_generic
>>>> xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace 
>>>> fscache
>>>> sunrpc loop fuse parport_pc psmouse i2
>>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G     
>>>>    W
>>>> 4.3.0-rc5-btrfs-next-17+ #1
>>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>>> 1996), BIOS
>>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 
>>>> 04/01/2014
>>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
>>>> [btrfs]
>>>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti:
>>>> ffff88010c4c8000
>>>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>]
>>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>>> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
>>>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX:
>>>> 0000000000000000
>>>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI:
>>>> 0000000000000000
>>>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09:
>>>> ffff88010c4cb9f8
>>>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12:
>>>> 0000000000000000
>>>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15:
>>>> 0000000000000000
>>>> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000)
>>>> knlGS:0000000000000000
>>>> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4:
>>>> 00000000000006e0
>>>> [35908.065201] Stack:
>>>> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578
>>>> ffff88015a408000
>>>> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005
>>>> ffff88010c4cbbd8
>>>> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000
>>>> 0000000000000000
>>>> [35908.065201] Call Trace:
>>>> [35908.065201]  [<ffffffffa0492b9a>] 
>>>> __btrfs_inc_extent_ref+0x8b/0x208
>>>> [btrfs]
>>>> [35908.065201]  [<ffffffffa0497117>] ?
>>>> __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>>>> [35908.065201]  [<ffffffffa049773d>] 
>>>> __btrfs_run_delayed_refs+0xafa/0xd33
>>>> [btrfs]
>>>> [35908.065201]  [<ffffffffa04a976a>] ? 
>>>> join_transaction.isra.10+0x25/0x41f
>>>> [btrfs]
>>>> [35908.065201]  [<ffffffffa04a97ed>] ? 
>>>> join_transaction.isra.10+0xa8/0x41f
>>>> [btrfs]
>>>> [35908.065201]  [<ffffffffa049914d>] 
>>>> btrfs_run_delayed_refs+0x75/0x1dd
>>>> [btrfs]
>>>> [35908.065201]  [<ffffffffa04992f1>] 
>>>> delayed_ref_async_start+0x3c/0x7b
>>>> [btrfs]
>>>> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a
>>>> [btrfs]
>>>> [35908.065201]  [<ffffffffa04d4e93>] 
>>>> btrfs_extent_refs_helper+0x12/0x14
>>>> [btrfs]
>>>> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>>>> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
>>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>>> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
>>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>>> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 
>>>> c8 48
>>>> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 
>>>> 77 02
>>>> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>>>> [35908.065201] RIP  [<ffffffffa04928b5>]
>>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>>> [35908.065201]  RSP <ffff88010c4cbb08>
>>>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>>> 
>>>> This happens because the new delayed references code no longer 
>>>> merges
>>>> delayed references that have different sequence values. The 
>>>> following
>>>> steps are an example sequence leading to this issue:
>>>> 
>>>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>>> 
>>>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 
>>>> for
>>>>     bytenr A is created, with a value of 1 and a seq value of 0;
>>>> 
>>>> 3) fs_info->tree_mod_seq is incremented to 1;
>>>> 
>>>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>>>>     btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). 
>>>> The
>>>>     later returns the metadata extent associated to extent buffer A 
>>>> to
>>>>     the free space cache (the range is not pinned), because the 
>>>> extent
>>>>     buffer was created in the current transaction (N) and writeback 
>>>> never
>>>>     happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN 
>>>> not set
>>>>     in the extent buffer).
>>>>     This creates the delayed reference Ref2 for bytenr A, with a 
>>>> value
>>>>     of -1 and a seq value of 1;
>>>> 
>>>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>>>>     because they have different sequence numbers (decided at
>>>>     add_delayed_ref_tail_merge());
>>>> 
>>>> 6) fs_info->tree_mod_seq is incremented to 2;
>>>> 
>>>> 7) Some task attempts to allocate a new extent buffer (done at
>>>>     extent-tree.c:find_free_extent()), but due to heavy 
>>>> fragmentation
>>>>     and running low on metadata space the clustered allocation fails
>>>>     and we fall back to unclustered allocation, which finds the
>>>>     extent at offset A, so a new extent buffer at offset A is 
>>>> allocated.
>>>>     This creates delayed reference Ref3 for bytenr A, with a value 
>>>> of -1
>>>>     and a seq value of 2;
>>>> 
>>>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>>>>     all have different seq values;
>>>> 
>>>> 9) We start running the delayed references 
>>>> (__btrfs_run_delayed_refs());
>>>> 
>>>> 10) The delayed Ref1 is the first one being applied, which ends up
>>>>      creating an inline extent backref in the extent tree;
>>>> 
>>>> 10) Next the delayed reference Ref3 is selected for execution, and 
>>>> not
>>>>      Ref2, because select_delayed_ref() always gives a preference 
>>>> for
>>>>      positive references (that have an action of 
>>>> BTRFS_ADD_DELAYED_REF);
>>>> 
>>>> 11) When running Ref3 we encounter alreay the inline extent backref
>>>>      in the extent tree at insert_inline_extent_backref(), which 
>>>> makes
>>>>      us hit the following BUG_ON:
>>>> 
>>>>          BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>>> 
>>>>      This is always true because owner corresponds to the level of 
>>>> the
>>>>      extent buffer/btree node in the btree.
>>>> 
>>>> For the scenario described above we hit the BUG_ON because we never 
>>>> merge
>>>> references that have different seq values.
>>>> 
>>>> We used to do the merging before the 4.2 kernel, more specifically, 
>>>> before
>>>> the commmits:
>>>> 
>>>>    c6fc24549960 ("btrfs: delayed-ref: Use list to replace the 
>>>> ref_root in
>>>> ref_head.")
>>>>    c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded 
>>>> functions.")
>>>> 
>>>> This issue became more exposed after the following change that was 
>>>> added
>>>> to 4.2 as well:
>>>> 
>>>>    cffc3374e567 ("Btrfs: fix order by which delayed references are 
>>>> run")
>>>> 
>>>> Which in turn fixed another regression by the two commits previously
>>>> mentioned.
>>>> 
>>>> So fix this by bringing back the delayed reference merge code, with 
>>>> the
>>>> proper adaptations so that it operates against the new data 
>>>> structure
>>>> (linked list vs old red black tree implementation).
>>>> 
>>>> This issue was hit running fstest btrfs/063 in a loop. Several 
>>>> people have
>>>> reported this issue in the mailing list when running on kernels 
>>>> 4.2+.
>>> 
>>> 
>>> Thanks Filipe,
>>> 
>>> My fault again. :(
>>> But I'm not completely sure about if tree_mod_seq is still needed 
>>> now.
>>> 
>>> IIRC, with the new qgroup accounting happen at commit_transaction 
>>> time,
>>> btrfs_find_all_roots() should either searching commit tree for old 
>>> roots, or
>>> search current tree for new roots.
>>> No need to search using tree_mod_seq.
>>> 
>>> If so, I'd like just allow merging refs without checking 
>>> tree_mod_seq.
> 
> And to make it clear, that wouldn't work. If a backref walker starts
> iterating the btrees and then new delayed refs get merged
> independently of the current tree mod seq, the walker will see an
> inconsistent state in the extent tree if the delayed references are
> run (which can happen often before a transaction commit).
> 
> So either make delayed references continue using tree mod seq as
> before 4.2, or come with a whole new mechanism that replaces the tree
> mod seq while still giving the same consistency guarantees.

I used your patch at https://patchwork.kernel.org/patch/7463161/ to 
build a 4.3.0-rc6 kernel.
I mounted my FS with skip_balance, cancelled the paused balance (just to 
be sure), then started a new one with :

# btrfs balance start -dconvert=raid5,soft /tank

(half of my data blocks are RAID1, the other half is RAID5, the goal is 
to be fully RAID5).

I got a different stacktrace than the usual one, via netconsole, after a 
few minutes. It's still referencing btrfs_run_delayed_refs though:

[  822.461809] ------------[ cut here ]------------
[  822.461833] kernel BUG at fs/btrfs/extent-tree.c:2287!
[  822.461849] invalid opcode: 0000 [#1] SMP
[  822.461866] Modules linked in: nfnetlink_queue nfnetlink_log 
nfnetlink xt_multiport xt_comment xt_conntrack xt_nat xt_tcpudp xts 
gf128mul drbg ansi_cprng btrfs nf_conntrack_ftp nf_conntrack_sane 
iptable_security iptable_filter iptable_mangle iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4
[  822.462365] CPU: 1 PID: 3810 Comm: btrfs-transacti Tainted: G        
W       4.3.0-rc6p7463161+ #3
[  822.462391] Hardware name: ASUS All Series/H87I-PLUS, BIOS 1005 
01/06/2014
[  822.462412] task: ffff88011a4a1a00 ti: ffff8800b6638000 task.ti: 
ffff8800b6638000
[  822.462434] RIP: 0010:[<ffffffffc032310b>]  [<ffffffffc032310b>] 
__btrfs_run_delayed_refs.constprop.73+0x108b/0x10d0 [btrfs]
[  822.462476] RSP: 0018:ffff8800b663bcb0  EFLAGS: 00010202
[  822.462495] RAX: 0000000000000001 RBX: ffff8800b3821888 RCX: 
ffff8800b3a46cb8
[  822.462517] RDX: 0000000000000001 RSI: 00000000000001e1 RDI: 
ffff8800b3a46cb0
[  822.462543] RBP: ffff8800b663bdb8 R08: 0000000000000000 R09: 
ffff8800b3a46cb8
[  822.462565] R10: ffff8800b3a46cb8 R11: ffff8800b3a46cb8 R12: 
0000000000000000
[  822.462587] R13: 0000000000000000 R14: 00000afefd330000 R15: 
ffff8800b3a46c38
[  822.462687] FS:  0000000000000000(0000) GS:ffff88011fb00000(0000) 
knlGS:0000000000000000
[  822.462716] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  822.462736] CR2: 00007ff7a21ca000 CR3: 0000000001c10000 CR4: 
00000000000406e0
[  822.462762] Stack:
[  822.462770]  ffffffffc037c8be ffff88011fb169f0 0000000000000001 
0000000000000001
[  822.462802]  0000000000000001 ffff88011a4a1a60 ffff8800b663bd38 
ffff8800b663bd40
[  822.462831]  ffffffff810aded5 ffffffff8101fcc9 ffff8800b663bd20 
00000000000014d9
[  822.462859] Call Trace:
[  822.462877]  [<ffffffffc037c8be>] ? 
try_merge_free_space.isra.26+0x12e/0x180 [btrfs]
[  822.462902]  [<ffffffff810aded5>] ? put_prev_entity+0x35/0x660
[  822.462924]  [<ffffffff8101fcc9>] ? sched_clock+0x9/0x10
[  822.462949]  [<ffffffffc0325dcd>] btrfs_run_delayed_refs+0x7d/0x2b0 
[btrfs]
[  822.462972]  [<ffffffff817aa7ab>] ? schedule_timeout+0x16b/0x2a0
[  822.462999]  [<ffffffffc033a6b3>] btrfs_commit_transaction+0x43/0xb10 
[btrfs]
[  822.463028]  [<ffffffffc0335c19>] transaction_kthread+0x1a9/0x230 
[btrfs]
[  822.463056]  [<ffffffffc0335a70>] ? 
btrfs_cleanup_transaction+0x550/0x550 [btrfs]
[  822.463080]  [<ffffffff81097099>] kthread+0xc9/0xe0
[  822.463096]  [<ffffffff81096fd0>] ? kthread_park+0x60/0x60
[  822.463116]  [<ffffffff817aba8f>] ret_from_fork+0x3f/0x70
[  822.463134]  [<ffffffff81096fd0>] ? kthread_park+0x60/0x60
[  822.463153] Code: c0 48 8b bd 40 ff ff ff 31 c0 e8 31 94 fe ff 0f 0b 
be d3 00 00 00 48 c7 c7 3b c6 3b c0 e8 6e 63 d5 c0 e9 64 f9 ff ff 0f 0b 
0f 0b <0f> 0b be d3 00 00 00 48 c7 c7 3b c6 3b c0 e8 52 63 d5 c0 e9 60
[  822.463300] RIP  [<ffffffffc032310b>] 
__btrfs_run_delayed_refs.constprop.73+0x108b/0x10d0 [btrfs]
[  822.463335]  RSP <ffff8800b663bcb0>
[  822.472131] ---[ end trace f1e21f38cb0ea144 ]---

A couple other stacktraces follow after some seconds, then the system 
dies completely. sysrqd doesn't even work to reboot it remotely using 
sysrq logic.

fs/btrfs/extent-tree.c:2287 is the line you get from a vanilla 4.3-rc6 + 
your patch. I'll post it as soon as I can get somebody to manually 
reboot this remote machine (my kernel build machine is the same than the 
one hosting the btrfs FS).

Don't hesitate to ask if you need me to debug or even ftrace something.

Thanks,

-- 
Stéphane.

>>> I was going to do it but not completely sure is there any other user 
>>> of
>>> tree_mod_seq.
>>> And if it's possible to get rid of tree_mod_seq and merge with last
>>> delayed_ref, things should get cleaner without new codes.
>> 
>> Well, the tree mod seq is what allows backref walkers (and possibly
>> other paths) to get a consistent view of all btrees and delayed refs
>> state while doing some processing - that's why we have calls to
>> btrfs_check_delayed_seq() when running delayed references - so that
>> any backref walker will not see changes that happen after it started,
>> i.e. it will see a consistent view of all the btrees (like an
>> in-memory snapshot of all btrees while the transaction is running).
>> 
>> I don't think you can get this level of consistency through any other
>> existing means.
>> So just adding back yet more code that was removed despite still being 
>> needed.
>> 
>> This is affecting way too many people now, I would like to get this
>> fixed and later, if there's a better (new) solution for this, we can
>> get it in.
>> 
>> thanks
>> 
>>> 
>>> Thanks,
>>> Qu
>>> 
>>> 
>>> 
>>>> 
>>>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the 
>>>> ref_root
>>>> in ref_head.")
>>>> Reported-by: Peter Becker <floyd.net@gmail.com>
>>>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>>>> Reported-by: Malte Schröder <malte@tnxip.de>
>>>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>>>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>>>> Cc: stable@vger.kernel.org  # 4.2+
>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>> ---
>>>>   fs/btrfs/delayed-ref.c | 113
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   fs/btrfs/extent-tree.c |  14 ++++++
>>>>   2 files changed, 127 insertions(+)
>>>> 
>>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>>> index ac3e81d..4832943 100644
>>>> --- a/fs/btrfs/delayed-ref.c
>>>> +++ b/fs/btrfs/delayed-ref.c
>>>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct
>>>> btrfs_trans_handle *trans,
>>>>                 trans->delayed_ref_updates--;
>>>>   }
>>>> 
>>>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>>>> +                     struct btrfs_delayed_ref_root *delayed_refs,
>>>> +                     struct btrfs_delayed_ref_head *head,
>>>> +                     struct btrfs_delayed_ref_node *ref,
>>>> +                     u64 seq)
>>>> +{
>>>> +       struct btrfs_delayed_ref_node *next;
>>>> +       bool done = false;
>>>> +
>>>> +       next = list_first_entry(&head->ref_list, struct
>>>> btrfs_delayed_ref_node,
>>>> +                               list);
>>>> +       while (!done && &next->list != &head->ref_list) {
>>>> +               int mod;
>>>> +               struct btrfs_delayed_ref_node *next2;
>>>> +
>>>> +               next2 = list_next_entry(next, list);
>>>> +
>>>> +               if (next == ref)
>>>> +                       goto next;
>>>> +
>>>> +               if (seq && next->seq >= seq)
>>>> +                       goto next;
>>>> +
>>>> +               if (next->type != ref->type || next->no_quota !=
>>>> ref->no_quota)
>>>> +                       goto next;
>>>> +
>>>> +               if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>>> +                    ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>>>> +                   
>>>> comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>>>> +                                  
>>>> btrfs_delayed_node_to_tree_ref(next),
>>>> +                                  ref->type))
>>>> +                       goto next;
>>>> +               if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>>>> +                    ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>>>> +                   
>>>> comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>>>> +                                  
>>>> btrfs_delayed_node_to_data_ref(next)))
>>>> +                       goto next;
>>>> +
>>>> +               if (ref->action == next->action) {
>>>> +                       mod = next->ref_mod;
>>>> +               } else {
>>>> +                       if (ref->ref_mod < next->ref_mod) {
>>>> +                               swap(ref, next);
>>>> +                               done = true;
>>>> +                       }
>>>> +                       mod = -next->ref_mod;
>>>> +               }
>>>> +
>>>> +               drop_delayed_ref(trans, delayed_refs, head, next);
>>>> +               ref->ref_mod += mod;
>>>> +               if (ref->ref_mod == 0) {
>>>> +                       drop_delayed_ref(trans, delayed_refs, head, 
>>>> ref);
>>>> +                       done = true;
>>>> +               } else {
>>>> +                       /*
>>>> +                        * Can't have multiples of the same ref on a 
>>>> tree
>>>> block.
>>>> +                        */
>>>> +                       WARN_ON(ref->type == 
>>>> BTRFS_TREE_BLOCK_REF_KEY ||
>>>> +                               ref->type == 
>>>> BTRFS_SHARED_BLOCK_REF_KEY);
>>>> +               }
>>>> +next:
>>>> +               next = next2;
>>>> +       }
>>>> +
>>>> +       return done;
>>>> +}
>>>> +
>>>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>>>> +                             struct btrfs_fs_info *fs_info,
>>>> +                             struct btrfs_delayed_ref_root 
>>>> *delayed_refs,
>>>> +                             struct btrfs_delayed_ref_head *head)
>>>> +{
>>>> +       struct btrfs_delayed_ref_node *ref;
>>>> +       u64 seq = 0;
>>>> +
>>>> +       assert_spin_locked(&head->lock);
>>>> +
>>>> +       if (list_empty(&head->ref_list))
>>>> +               return;
>>>> +
>>>> +       /* We don't have too many refs to merge for data. */
>>>> +       if (head->is_data)
>>>> +               return;
>>>> +
>>>> +       spin_lock(&fs_info->tree_mod_seq_lock);
>>>> +       if (!list_empty(&fs_info->tree_mod_seq_list)) {
>>>> +               struct seq_list *elem;
>>>> +
>>>> +               elem = list_first_entry(&fs_info->tree_mod_seq_list,
>>>> +                                       struct seq_list, list);
>>>> +               seq = elem->seq;
>>>> +       }
>>>> +       spin_unlock(&fs_info->tree_mod_seq_lock);
>>>> +
>>>> +       ref = list_first_entry(&head->ref_list, struct
>>>> btrfs_delayed_ref_node,
>>>> +                              list);
>>>> +       while (&ref->list != &head->ref_list) {
>>>> +               if (seq && ref->seq >= seq)
>>>> +                       goto next;
>>>> +
>>>> +               if (merge_ref(trans, delayed_refs, head, ref, seq)) 
>>>> {
>>>> +                       if (list_empty(&head->ref_list))
>>>> +                               break;
>>>> +                       ref = list_first_entry(&head->ref_list,
>>>> +                                              struct
>>>> btrfs_delayed_ref_node,
>>>> +                                              list);
>>>> +                       continue;
>>>> +               }
>>>> +next:
>>>> +               ref = list_next_entry(ref, list);
>>>> +       }
>>>> +}
>>>> +
>>>>   int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>>>>                             struct btrfs_delayed_ref_root 
>>>> *delayed_refs,
>>>>                             u64 seq)
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index 522fb45..42d9310 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -2433,7 +2433,21 @@ static noinline int 
>>>> __btrfs_run_delayed_refs(struct
>>>> btrfs_trans_handle *trans,
>>>>                         }
>>>>                 }
>>>> 
>>>> +               /*
>>>> +                * We need to try and merge add/drops of the same 
>>>> ref
>>>> since we
>>>> +                * can run into issues with relocate dropping the 
>>>> implicit
>>>> ref
>>>> +                * and then it being added back again before the 
>>>> drop can
>>>> +                * finish.  If we merged anything we need to re-loop 
>>>> so we
>>>> can
>>>> +                * get a good ref.
>>>> +                * Or we can get node references of the same type 
>>>> that
>>>> weren't
>>>> +                * merged when created due to bumps in the tree mod 
>>>> seq,
>>>> and
>>>> +                * we need to merge them to prevent adding an inline
>>>> extent
>>>> +                * backref before dropping it (triggering a BUG_ON 
>>>> at
>>>> +                * insert_inline_extent_backref()).
>>>> +                */
>>>>                 spin_lock(&locked_ref->lock);
>>>> +               btrfs_merge_delayed_refs(trans, fs_info, 
>>>> delayed_refs,
>>>> +                                        locked_ref);
>>>> 
>>>>                 /*
>>>>                  * locked_ref is the head node, so we have to go one
>>>> 
>>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22 14:58       ` Stéphane Lesimple
@ 2015-10-22 17:03         ` Filipe Manana
  2015-10-22 17:54           ` Stéphane Lesimple
  0 siblings, 1 reply; 15+ messages in thread
From: Filipe Manana @ 2015-10-22 17:03 UTC (permalink / raw)
  To: Stéphane Lesimple; +Cc: Qu Wenruo, linux-btrfs, Filipe Manana

On Thu, Oct 22, 2015 at 3:58 PM, Stéphane Lesimple
<stephane_btrfs@lesimple.fr> wrote:
> Le 2015-10-22 11:47, Filipe Manana a écrit :
>>
>> On Thu, Oct 22, 2015 at 10:43 AM, Filipe Manana <fdmanana@kernel.org>
>> wrote:
>>>
>>> On Thu, Oct 22, 2015 at 10:32 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>>  wrote on 2015/10/22 09:47 +0100:
>>>>>
>>>>>
>>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>>
>>>>> In the kernel 4.2 merge window we had a refactoring/rework of the
>>>>> delayed
>>>>> references implementation in order to fix certain problems with
>>>>> qgroups.
>>>>> However that rework introduced one more regression that leads to the
>>>>> following trace when running delayed references for metadata:
>>>>>
>>>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>>>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic
>>>>> xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace
>>>>> fscache
>>>>> sunrpc loop fuse parport_pc psmouse i2
>>>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G
>>>>> W
>>>>> 4.3.0-rc5-btrfs-next-17+ #1
>>>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>> BIOS
>>>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
>>>>> [btrfs]
>>>>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti:
>>>>> ffff88010c4c8000
>>>>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>]
>>>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>>>> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
>>>>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX:
>>>>> 0000000000000000
>>>>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI:
>>>>> 0000000000000000
>>>>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09:
>>>>> ffff88010c4cb9f8
>>>>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12:
>>>>> 0000000000000000
>>>>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15:
>>>>> 0000000000000000
>>>>> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000)
>>>>> knlGS:0000000000000000
>>>>> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4:
>>>>> 00000000000006e0
>>>>> [35908.065201] Stack:
>>>>> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578
>>>>> ffff88015a408000
>>>>> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005
>>>>> ffff88010c4cbbd8
>>>>> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000
>>>>> 0000000000000000
>>>>> [35908.065201] Call Trace:
>>>>> [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208
>>>>> [btrfs]
>>>>> [35908.065201]  [<ffffffffa0497117>] ?
>>>>> __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>>>>> [35908.065201]  [<ffffffffa049773d>]
>>>>> __btrfs_run_delayed_refs+0xafa/0xd33
>>>>> [btrfs]
>>>>> [35908.065201]  [<ffffffffa04a976a>] ?
>>>>> join_transaction.isra.10+0x25/0x41f
>>>>> [btrfs]
>>>>> [35908.065201]  [<ffffffffa04a97ed>] ?
>>>>> join_transaction.isra.10+0xa8/0x41f
>>>>> [btrfs]
>>>>> [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd
>>>>> [btrfs]
>>>>> [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b
>>>>> [btrfs]
>>>>> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a
>>>>> [btrfs]
>>>>> [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14
>>>>> [btrfs]
>>>>> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>>>>> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
>>>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>>>> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
>>>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>>>> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>>>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>>>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8
>>>>> 48
>>>>> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77
>>>>> 02
>>>>> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>>>>> [35908.065201] RIP  [<ffffffffa04928b5>]
>>>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>>>> [35908.065201]  RSP <ffff88010c4cbb08>
>>>>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>>>>
>>>>> This happens because the new delayed references code no longer merges
>>>>> delayed references that have different sequence values. The following
>>>>> steps are an example sequence leading to this issue:
>>>>>
>>>>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>>>>
>>>>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1
>>>>> for
>>>>>     bytenr A is created, with a value of 1 and a seq value of 0;
>>>>>
>>>>> 3) fs_info->tree_mod_seq is incremented to 1;
>>>>>
>>>>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>>>>>     btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>>>>>     later returns the metadata extent associated to extent buffer A to
>>>>>     the free space cache (the range is not pinned), because the extent
>>>>>     buffer was created in the current transaction (N) and writeback
>>>>> never
>>>>>     happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not
>>>>> set
>>>>>     in the extent buffer).
>>>>>     This creates the delayed reference Ref2 for bytenr A, with a value
>>>>>     of -1 and a seq value of 1;
>>>>>
>>>>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>>>>>     because they have different sequence numbers (decided at
>>>>>     add_delayed_ref_tail_merge());
>>>>>
>>>>> 6) fs_info->tree_mod_seq is incremented to 2;
>>>>>
>>>>> 7) Some task attempts to allocate a new extent buffer (done at
>>>>>     extent-tree.c:find_free_extent()), but due to heavy fragmentation
>>>>>     and running low on metadata space the clustered allocation fails
>>>>>     and we fall back to unclustered allocation, which finds the
>>>>>     extent at offset A, so a new extent buffer at offset A is
>>>>> allocated.
>>>>>     This creates delayed reference Ref3 for bytenr A, with a value of
>>>>> -1
>>>>>     and a seq value of 2;
>>>>>
>>>>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>>>>>     all have different seq values;
>>>>>
>>>>> 9) We start running the delayed references
>>>>> (__btrfs_run_delayed_refs());
>>>>>
>>>>> 10) The delayed Ref1 is the first one being applied, which ends up
>>>>>      creating an inline extent backref in the extent tree;
>>>>>
>>>>> 10) Next the delayed reference Ref3 is selected for execution, and not
>>>>>      Ref2, because select_delayed_ref() always gives a preference for
>>>>>      positive references (that have an action of
>>>>> BTRFS_ADD_DELAYED_REF);
>>>>>
>>>>> 11) When running Ref3 we encounter alreay the inline extent backref
>>>>>      in the extent tree at insert_inline_extent_backref(), which makes
>>>>>      us hit the following BUG_ON:
>>>>>
>>>>>          BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>>>>
>>>>>      This is always true because owner corresponds to the level of the
>>>>>      extent buffer/btree node in the btree.
>>>>>
>>>>> For the scenario described above we hit the BUG_ON because we never
>>>>> merge
>>>>> references that have different seq values.
>>>>>
>>>>> We used to do the merging before the 4.2 kernel, more specifically,
>>>>> before
>>>>> the commmits:
>>>>>
>>>>>    c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root
>>>>> in
>>>>> ref_head.")
>>>>>    c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>>>>>
>>>>> This issue became more exposed after the following change that was
>>>>> added
>>>>> to 4.2 as well:
>>>>>
>>>>>    cffc3374e567 ("Btrfs: fix order by which delayed references are
>>>>> run")
>>>>>
>>>>> Which in turn fixed another regression by the two commits previously
>>>>> mentioned.
>>>>>
>>>>> So fix this by bringing back the delayed reference merge code, with the
>>>>> proper adaptations so that it operates against the new data structure
>>>>> (linked list vs old red black tree implementation).
>>>>>
>>>>> This issue was hit running fstest btrfs/063 in a loop. Several people
>>>>> have
>>>>> reported this issue in the mailing list when running on kernels 4.2+.
>>>>
>>>>
>>>>
>>>> Thanks Filipe,
>>>>
>>>> My fault again. :(
>>>> But I'm not completely sure about if tree_mod_seq is still needed now.
>>>>
>>>> IIRC, with the new qgroup accounting happen at commit_transaction time,
>>>> btrfs_find_all_roots() should either searching commit tree for old
>>>> roots, or
>>>> search current tree for new roots.
>>>> No need to search using tree_mod_seq.
>>>>
>>>> If so, I'd like just allow merging refs without checking tree_mod_seq.
>>
>>
>> And to make it clear, that wouldn't work. If a backref walker starts
>> iterating the btrees and then new delayed refs get merged
>> independently of the current tree mod seq, the walker will see an
>> inconsistent state in the extent tree if the delayed references are
>> run (which can happen often before a transaction commit).
>>
>> So either make delayed references continue using tree mod seq as
>> before 4.2, or come with a whole new mechanism that replaces the tree
>> mod seq while still giving the same consistency guarantees.
>
>
> I used your patch at https://patchwork.kernel.org/patch/7463161/ to build a
> 4.3.0-rc6 kernel.
> I mounted my FS with skip_balance, cancelled the paused balance (just to be
> sure), then started a new one with :
>
> # btrfs balance start -dconvert=raid5,soft /tank
>
> (half of my data blocks are RAID1, the other half is RAID5, the goal is to
> be fully RAID5).
>
> I got a different stacktrace than the usual one, via netconsole, after a few
> minutes. It's still referencing btrfs_run_delayed_refs though:
>
> [  822.461809] ------------[ cut here ]------------
> [  822.461833] kernel BUG at fs/btrfs/extent-tree.c:2287!
> [  822.461849] invalid opcode: 0000 [#1] SMP
> [  822.461866] Modules linked in: nfnetlink_queue nfnetlink_log nfnetlink
> xt_multiport xt_comment xt_conntrack xt_nat xt_tcpudp xts gf128mul drbg
> ansi_cprng btrfs nf_conntrack_ftp nf_conntrack_sane iptable_security
> iptable_filter iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> [  822.462365] CPU: 1 PID: 3810 Comm: btrfs-transacti Tainted: G        W
> 4.3.0-rc6p7463161+ #3
> [  822.462391] Hardware name: ASUS All Series/H87I-PLUS, BIOS 1005
> 01/06/2014
> [  822.462412] task: ffff88011a4a1a00 ti: ffff8800b6638000 task.ti:
> ffff8800b6638000
> [  822.462434] RIP: 0010:[<ffffffffc032310b>]  [<ffffffffc032310b>]
> __btrfs_run_delayed_refs.constprop.73+0x108b/0x10d0 [btrfs]
> [  822.462476] RSP: 0018:ffff8800b663bcb0  EFLAGS: 00010202
> [  822.462495] RAX: 0000000000000001 RBX: ffff8800b3821888 RCX:
> ffff8800b3a46cb8
> [  822.462517] RDX: 0000000000000001 RSI: 00000000000001e1 RDI:
> ffff8800b3a46cb0
> [  822.462543] RBP: ffff8800b663bdb8 R08: 0000000000000000 R09:
> ffff8800b3a46cb8
> [  822.462565] R10: ffff8800b3a46cb8 R11: ffff8800b3a46cb8 R12:
> 0000000000000000
> [  822.462587] R13: 0000000000000000 R14: 00000afefd330000 R15:
> ffff8800b3a46c38
> [  822.462687] FS:  0000000000000000(0000) GS:ffff88011fb00000(0000)
> knlGS:0000000000000000
> [  822.462716] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  822.462736] CR2: 00007ff7a21ca000 CR3: 0000000001c10000 CR4:
> 00000000000406e0
> [  822.462762] Stack:
> [  822.462770]  ffffffffc037c8be ffff88011fb169f0 0000000000000001
> 0000000000000001
> [  822.462802]  0000000000000001 ffff88011a4a1a60 ffff8800b663bd38
> ffff8800b663bd40
> [  822.462831]  ffffffff810aded5 ffffffff8101fcc9 ffff8800b663bd20
> 00000000000014d9
> [  822.462859] Call Trace:
> [  822.462877]  [<ffffffffc037c8be>] ?
> try_merge_free_space.isra.26+0x12e/0x180 [btrfs]
> [  822.462902]  [<ffffffff810aded5>] ? put_prev_entity+0x35/0x660
> [  822.462924]  [<ffffffff8101fcc9>] ? sched_clock+0x9/0x10
> [  822.462949]  [<ffffffffc0325dcd>] btrfs_run_delayed_refs+0x7d/0x2b0
> [btrfs]
> [  822.462972]  [<ffffffff817aa7ab>] ? schedule_timeout+0x16b/0x2a0
> [  822.462999]  [<ffffffffc033a6b3>] btrfs_commit_transaction+0x43/0xb10
> [btrfs]
> [  822.463028]  [<ffffffffc0335c19>] transaction_kthread+0x1a9/0x230 [btrfs]
> [  822.463056]  [<ffffffffc0335a70>] ? btrfs_cleanup_transaction+0x550/0x550
> [btrfs]
> [  822.463080]  [<ffffffff81097099>] kthread+0xc9/0xe0
> [  822.463096]  [<ffffffff81096fd0>] ? kthread_park+0x60/0x60
> [  822.463116]  [<ffffffff817aba8f>] ret_from_fork+0x3f/0x70
> [  822.463134]  [<ffffffff81096fd0>] ? kthread_park+0x60/0x60
> [  822.463153] Code: c0 48 8b bd 40 ff ff ff 31 c0 e8 31 94 fe ff 0f 0b be
> d3 00 00 00 48 c7 c7 3b c6 3b c0 e8 6e 63 d5 c0 e9 64 f9 ff ff 0f 0b 0f 0b
> <0f> 0b be d3 00 00 00 48 c7 c7 3b c6 3b c0 e8 52 63 d5 c0 e9 60
> [  822.463300] RIP  [<ffffffffc032310b>]
> __btrfs_run_delayed_refs.constprop.73+0x108b/0x10d0 [btrfs]
> [  822.463335]  RSP <ffff8800b663bcb0>
> [  822.472131] ---[ end trace f1e21f38cb0ea144 ]---
>
> A couple other stacktraces follow after some seconds, then the system dies
> completely. sysrqd doesn't even work to reboot it remotely using sysrq
> logic.
>
> fs/btrfs/extent-tree.c:2287 is the line you get from a vanilla 4.3-rc6 +
> your patch. I'll post it as soon as I can get somebody to manually reboot
> this remote machine (my kernel build machine is the same than the one
> hosting the btrfs FS).
>
> Don't hesitate to ask if you need me to debug or even ftrace something.

Thanks Stéphane. I haven't seen that crash yet (still running tests
for 2 consecutive days now).
Can you please try the following patch, which works on top of mine,
and enable ftrace before running balance:

Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor

Enable ftrace:

    $ echo > /sys/kernel/debug/tracing/trace
    $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
    $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
you can use larger buffer size, even better
    $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
    $ echo 1 > /sys/kernel/debug/tracing/tracing_on

    $ run balance... wait until it finishes with IO error or the
patch's printk message shows up in dmesg/syslog

    $ echo 0 > /sys/kernel/debug/tracing/tracing_on

    $ cat /sys/kernel/debug/tracing/trace > some_file.txt

Then send is some_file.txt for debugging, hopefully it will give some
useful information. Note that it might produce tons of messages,
depending on how long it takes for you to hit the BUG_ON.

Thanks a lot for this.





>
> Thanks,
>
> --
> Stéphane.
>
>>>> I was going to do it but not completely sure is there any other user of
>>>> tree_mod_seq.
>>>> And if it's possible to get rid of tree_mod_seq and merge with last
>>>> delayed_ref, things should get cleaner without new codes.
>>>
>>>
>>> Well, the tree mod seq is what allows backref walkers (and possibly
>>> other paths) to get a consistent view of all btrees and delayed refs
>>> state while doing some processing - that's why we have calls to
>>> btrfs_check_delayed_seq() when running delayed references - so that
>>> any backref walker will not see changes that happen after it started,
>>> i.e. it will see a consistent view of all the btrees (like an
>>> in-memory snapshot of all btrees while the transaction is running).
>>>
>>> I don't think you can get this level of consistency through any other
>>> existing means.
>>> So just adding back yet more code that was removed despite still being
>>> needed.
>>>
>>> This is affecting way too many people now, I would like to get this
>>> fixed and later, if there's a better (new) solution for this, we can
>>> get it in.
>>>
>>> thanks
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>
>>>>
>>>>>
>>>>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the
>>>>> ref_root
>>>>> in ref_head.")
>>>>> Reported-by: Peter Becker <floyd.net@gmail.com>
>>>>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>>>>> Reported-by: Malte Schröder <malte@tnxip.de>
>>>>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>>>>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>>>>> Cc: stable@vger.kernel.org  # 4.2+
>>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>>> ---
>>>>>   fs/btrfs/delayed-ref.c | 113
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>   fs/btrfs/extent-tree.c |  14 ++++++
>>>>>   2 files changed, 127 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>>>> index ac3e81d..4832943 100644
>>>>> --- a/fs/btrfs/delayed-ref.c
>>>>> +++ b/fs/btrfs/delayed-ref.c
>>>>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct
>>>>> btrfs_trans_handle *trans,
>>>>>                 trans->delayed_ref_updates--;
>>>>>   }
>>>>>
>>>>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>>>>> +                     struct btrfs_delayed_ref_root *delayed_refs,
>>>>> +                     struct btrfs_delayed_ref_head *head,
>>>>> +                     struct btrfs_delayed_ref_node *ref,
>>>>> +                     u64 seq)
>>>>> +{
>>>>> +       struct btrfs_delayed_ref_node *next;
>>>>> +       bool done = false;
>>>>> +
>>>>> +       next = list_first_entry(&head->ref_list, struct
>>>>> btrfs_delayed_ref_node,
>>>>> +                               list);
>>>>> +       while (!done && &next->list != &head->ref_list) {
>>>>> +               int mod;
>>>>> +               struct btrfs_delayed_ref_node *next2;
>>>>> +
>>>>> +               next2 = list_next_entry(next, list);
>>>>> +
>>>>> +               if (next == ref)
>>>>> +                       goto next;
>>>>> +
>>>>> +               if (seq && next->seq >= seq)
>>>>> +                       goto next;
>>>>> +
>>>>> +               if (next->type != ref->type || next->no_quota !=
>>>>> ref->no_quota)
>>>>> +                       goto next;
>>>>> +
>>>>> +               if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>>>> +                    ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>>>>> +                   comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>>>>> +
>>>>> btrfs_delayed_node_to_tree_ref(next),
>>>>> +                                  ref->type))
>>>>> +                       goto next;
>>>>> +               if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>>>>> +                    ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>>>>> +                   comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>>>>> +
>>>>> btrfs_delayed_node_to_data_ref(next)))
>>>>> +                       goto next;
>>>>> +
>>>>> +               if (ref->action == next->action) {
>>>>> +                       mod = next->ref_mod;
>>>>> +               } else {
>>>>> +                       if (ref->ref_mod < next->ref_mod) {
>>>>> +                               swap(ref, next);
>>>>> +                               done = true;
>>>>> +                       }
>>>>> +                       mod = -next->ref_mod;
>>>>> +               }
>>>>> +
>>>>> +               drop_delayed_ref(trans, delayed_refs, head, next);
>>>>> +               ref->ref_mod += mod;
>>>>> +               if (ref->ref_mod == 0) {
>>>>> +                       drop_delayed_ref(trans, delayed_refs, head,
>>>>> ref);
>>>>> +                       done = true;
>>>>> +               } else {
>>>>> +                       /*
>>>>> +                        * Can't have multiples of the same ref on a
>>>>> tree
>>>>> block.
>>>>> +                        */
>>>>> +                       WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY
>>>>> ||
>>>>> +                               ref->type ==
>>>>> BTRFS_SHARED_BLOCK_REF_KEY);
>>>>> +               }
>>>>> +next:
>>>>> +               next = next2;
>>>>> +       }
>>>>> +
>>>>> +       return done;
>>>>> +}
>>>>> +
>>>>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>>>>> +                             struct btrfs_fs_info *fs_info,
>>>>> +                             struct btrfs_delayed_ref_root
>>>>> *delayed_refs,
>>>>> +                             struct btrfs_delayed_ref_head *head)
>>>>> +{
>>>>> +       struct btrfs_delayed_ref_node *ref;
>>>>> +       u64 seq = 0;
>>>>> +
>>>>> +       assert_spin_locked(&head->lock);
>>>>> +
>>>>> +       if (list_empty(&head->ref_list))
>>>>> +               return;
>>>>> +
>>>>> +       /* We don't have too many refs to merge for data. */
>>>>> +       if (head->is_data)
>>>>> +               return;
>>>>> +
>>>>> +       spin_lock(&fs_info->tree_mod_seq_lock);
>>>>> +       if (!list_empty(&fs_info->tree_mod_seq_list)) {
>>>>> +               struct seq_list *elem;
>>>>> +
>>>>> +               elem = list_first_entry(&fs_info->tree_mod_seq_list,
>>>>> +                                       struct seq_list, list);
>>>>> +               seq = elem->seq;
>>>>> +       }
>>>>> +       spin_unlock(&fs_info->tree_mod_seq_lock);
>>>>> +
>>>>> +       ref = list_first_entry(&head->ref_list, struct
>>>>> btrfs_delayed_ref_node,
>>>>> +                              list);
>>>>> +       while (&ref->list != &head->ref_list) {
>>>>> +               if (seq && ref->seq >= seq)
>>>>> +                       goto next;
>>>>> +
>>>>> +               if (merge_ref(trans, delayed_refs, head, ref, seq)) {
>>>>> +                       if (list_empty(&head->ref_list))
>>>>> +                               break;
>>>>> +                       ref = list_first_entry(&head->ref_list,
>>>>> +                                              struct
>>>>> btrfs_delayed_ref_node,
>>>>> +                                              list);
>>>>> +                       continue;
>>>>> +               }
>>>>> +next:
>>>>> +               ref = list_next_entry(ref, list);
>>>>> +       }
>>>>> +}
>>>>> +
>>>>>   int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>>>>>                             struct btrfs_delayed_ref_root
>>>>> *delayed_refs,
>>>>>                             u64 seq)
>>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>>> index 522fb45..42d9310 100644
>>>>> --- a/fs/btrfs/extent-tree.c
>>>>> +++ b/fs/btrfs/extent-tree.c
>>>>> @@ -2433,7 +2433,21 @@ static noinline int
>>>>> __btrfs_run_delayed_refs(struct
>>>>> btrfs_trans_handle *trans,
>>>>>                         }
>>>>>                 }
>>>>>
>>>>> +               /*
>>>>> +                * We need to try and merge add/drops of the same ref
>>>>> since we
>>>>> +                * can run into issues with relocate dropping the
>>>>> implicit
>>>>> ref
>>>>> +                * and then it being added back again before the drop
>>>>> can
>>>>> +                * finish.  If we merged anything we need to re-loop so
>>>>> we
>>>>> can
>>>>> +                * get a good ref.
>>>>> +                * Or we can get node references of the same type that
>>>>> weren't
>>>>> +                * merged when created due to bumps in the tree mod
>>>>> seq,
>>>>> and
>>>>> +                * we need to merge them to prevent adding an inline
>>>>> extent
>>>>> +                * backref before dropping it (triggering a BUG_ON at
>>>>> +                * insert_inline_extent_backref()).
>>>>> +                */
>>>>>                 spin_lock(&locked_ref->lock);
>>>>> +               btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
>>>>> +                                        locked_ref);
>>>>>
>>>>>                 /*
>>>>>                  * locked_ref is the head node, so we have to go one
>>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22 17:03         ` Filipe Manana
@ 2015-10-22 17:54           ` Stéphane Lesimple
  2015-10-22 22:38             ` Stéphane Lesimple
  0 siblings, 1 reply; 15+ messages in thread
From: Stéphane Lesimple @ 2015-10-22 17:54 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs, Filipe Manana

Le 2015-10-22 19:03, Filipe Manana a écrit :
> On Thu, Oct 22, 2015 at 3:58 PM, Stéphane Lesimple
> <stephane_btrfs@lesimple.fr> wrote:
>> Le 2015-10-22 11:47, Filipe Manana a écrit :
>>> 
>>> On Thu, Oct 22, 2015 at 10:43 AM, Filipe Manana <fdmanana@kernel.org>
>>> wrote:
>>>> 
>>>> On Thu, Oct 22, 2015 at 10:32 AM, Qu Wenruo 
>>>> <quwenruo@cn.fujitsu.com>
>>>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>  wrote on 2015/10/22 09:47 +0100:
>>>>>> 
>>>>>> 
>>>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>>> 
>>>>>> In the kernel 4.2 merge window we had a refactoring/rework of the
>>>>>> delayed
>>>>>> references implementation in order to fix certain problems with
>>>>>> qgroups.
>>>>>> However that rework introduced one more regression that leads to 
>>>>>> the
>>>>>> following trace when running delayed references for metadata:
>>>>>> 
>>>>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>>>>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP 
>>>>>> DEBUG_PAGEALLOC
>>>>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs 
>>>>>> crc32c_generic
>>>>>> xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace
>>>>>> fscache
>>>>>> sunrpc loop fuse parport_pc psmouse i2
>>>>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G
>>>>>> W
>>>>>> 4.3.0-rc5-btrfs-next-17+ #1
>>>>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>>>>> 1996),
>>>>>> BIOS
>>>>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 
>>>>>> 04/01/2014
>>>>>> [35908.065201] Workqueue: btrfs-extent-refs 
>>>>>> btrfs_extent_refs_helper
>>>>>> [btrfs]
>>>>>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 
>>>>>> task.ti:
>>>>>> ffff88010c4c8000
>>>>>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  
>>>>>> [<ffffffffa04928b5>]
>>>>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>>>>> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
>>>>>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX:
>>>>>> 0000000000000000
>>>>>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI:
>>>>>> 0000000000000000
>>>>>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09:
>>>>>> ffff88010c4cb9f8
>>>>>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12:
>>>>>> 0000000000000000
>>>>>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15:
>>>>>> 0000000000000000
>>>>>> [35908.065201] FS:  0000000000000000(0000) 
>>>>>> GS:ffff88023edc0000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4:
>>>>>> 00000000000006e0
>>>>>> [35908.065201] Stack:
>>>>>> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578
>>>>>> ffff88015a408000
>>>>>> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005
>>>>>> ffff88010c4cbbd8
>>>>>> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000
>>>>>> 0000000000000000
>>>>>> [35908.065201] Call Trace:
>>>>>> [35908.065201]  [<ffffffffa0492b9a>] 
>>>>>> __btrfs_inc_extent_ref+0x8b/0x208
>>>>>> [btrfs]
>>>>>> [35908.065201]  [<ffffffffa0497117>] ?
>>>>>> __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
>>>>>> [35908.065201]  [<ffffffffa049773d>]
>>>>>> __btrfs_run_delayed_refs+0xafa/0xd33
>>>>>> [btrfs]
>>>>>> [35908.065201]  [<ffffffffa04a976a>] ?
>>>>>> join_transaction.isra.10+0x25/0x41f
>>>>>> [btrfs]
>>>>>> [35908.065201]  [<ffffffffa04a97ed>] ?
>>>>>> join_transaction.isra.10+0xa8/0x41f
>>>>>> [btrfs]
>>>>>> [35908.065201]  [<ffffffffa049914d>] 
>>>>>> btrfs_run_delayed_refs+0x75/0x1dd
>>>>>> [btrfs]
>>>>>> [35908.065201]  [<ffffffffa04992f1>] 
>>>>>> delayed_ref_async_start+0x3c/0x7b
>>>>>> [btrfs]
>>>>>> [35908.065201]  [<ffffffffa04d4b4f>] 
>>>>>> normal_work_helper+0x14c/0x32a
>>>>>> [btrfs]
>>>>>> [35908.065201]  [<ffffffffa04d4e93>] 
>>>>>> btrfs_extent_refs_helper+0x12/0x14
>>>>>> [btrfs]
>>>>>> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>>>>>> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
>>>>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>>>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>>>>> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
>>>>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>>>>> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>>>>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>>>>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 
>>>>>> 89 c8
>>>>>> 48
>>>>>> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 
>>>>>> 00 77
>>>>>> 02
>>>>>> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>>>>>> [35908.065201] RIP  [<ffffffffa04928b5>]
>>>>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>>>>> [35908.065201]  RSP <ffff88010c4cbb08>
>>>>>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>>>>> 
>>>>>> This happens because the new delayed references code no longer 
>>>>>> merges
>>>>>> delayed references that have different sequence values. The 
>>>>>> following
>>>>>> steps are an example sequence leading to this issue:
>>>>>> 
>>>>>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>>>>> 
>>>>>> 2) Extent buffer (btree node) A is allocated, delayed reference 
>>>>>> Ref1
>>>>>> for
>>>>>>     bytenr A is created, with a value of 1 and a seq value of 0;
>>>>>> 
>>>>>> 3) fs_info->tree_mod_seq is incremented to 1;
>>>>>> 
>>>>>> 4) Extent buffer A is deleted through btrfs_del_items(), which 
>>>>>> calls
>>>>>>     btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). 
>>>>>> The
>>>>>>     later returns the metadata extent associated to extent buffer 
>>>>>> A to
>>>>>>     the free space cache (the range is not pinned), because the 
>>>>>> extent
>>>>>>     buffer was created in the current transaction (N) and 
>>>>>> writeback
>>>>>> never
>>>>>>     happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN 
>>>>>> not
>>>>>> set
>>>>>>     in the extent buffer).
>>>>>>     This creates the delayed reference Ref2 for bytenr A, with a 
>>>>>> value
>>>>>>     of -1 and a seq value of 1;
>>>>>> 
>>>>>> 5) Delayed reference Ref2 is not merged with Ref1 when we create 
>>>>>> it,
>>>>>>     because they have different sequence numbers (decided at
>>>>>>     add_delayed_ref_tail_merge());
>>>>>> 
>>>>>> 6) fs_info->tree_mod_seq is incremented to 2;
>>>>>> 
>>>>>> 7) Some task attempts to allocate a new extent buffer (done at
>>>>>>     extent-tree.c:find_free_extent()), but due to heavy 
>>>>>> fragmentation
>>>>>>     and running low on metadata space the clustered allocation 
>>>>>> fails
>>>>>>     and we fall back to unclustered allocation, which finds the
>>>>>>     extent at offset A, so a new extent buffer at offset A is
>>>>>> allocated.
>>>>>>     This creates delayed reference Ref3 for bytenr A, with a value 
>>>>>> of
>>>>>> -1
>>>>>>     and a seq value of 2;
>>>>>> 
>>>>>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because 
>>>>>> they
>>>>>>     all have different seq values;
>>>>>> 
>>>>>> 9) We start running the delayed references
>>>>>> (__btrfs_run_delayed_refs());
>>>>>> 
>>>>>> 10) The delayed Ref1 is the first one being applied, which ends up
>>>>>>      creating an inline extent backref in the extent tree;
>>>>>> 
>>>>>> 10) Next the delayed reference Ref3 is selected for execution, and 
>>>>>> not
>>>>>>      Ref2, because select_delayed_ref() always gives a preference 
>>>>>> for
>>>>>>      positive references (that have an action of
>>>>>> BTRFS_ADD_DELAYED_REF);
>>>>>> 
>>>>>> 11) When running Ref3 we encounter alreay the inline extent 
>>>>>> backref
>>>>>>      in the extent tree at insert_inline_extent_backref(), which 
>>>>>> makes
>>>>>>      us hit the following BUG_ON:
>>>>>> 
>>>>>>          BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>>>>> 
>>>>>>      This is always true because owner corresponds to the level of 
>>>>>> the
>>>>>>      extent buffer/btree node in the btree.
>>>>>> 
>>>>>> For the scenario described above we hit the BUG_ON because we 
>>>>>> never
>>>>>> merge
>>>>>> references that have different seq values.
>>>>>> 
>>>>>> We used to do the merging before the 4.2 kernel, more 
>>>>>> specifically,
>>>>>> before
>>>>>> the commmits:
>>>>>> 
>>>>>>    c6fc24549960 ("btrfs: delayed-ref: Use list to replace the 
>>>>>> ref_root
>>>>>> in
>>>>>> ref_head.")
>>>>>>    c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded 
>>>>>> functions.")
>>>>>> 
>>>>>> This issue became more exposed after the following change that was
>>>>>> added
>>>>>> to 4.2 as well:
>>>>>> 
>>>>>>    cffc3374e567 ("Btrfs: fix order by which delayed references are
>>>>>> run")
>>>>>> 
>>>>>> Which in turn fixed another regression by the two commits 
>>>>>> previously
>>>>>> mentioned.
>>>>>> 
>>>>>> So fix this by bringing back the delayed reference merge code, 
>>>>>> with the
>>>>>> proper adaptations so that it operates against the new data 
>>>>>> structure
>>>>>> (linked list vs old red black tree implementation).
>>>>>> 
>>>>>> This issue was hit running fstest btrfs/063 in a loop. Several 
>>>>>> people
>>>>>> have
>>>>>> reported this issue in the mailing list when running on kernels 
>>>>>> 4.2+.
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks Filipe,
>>>>> 
>>>>> My fault again. :(
>>>>> But I'm not completely sure about if tree_mod_seq is still needed 
>>>>> now.
>>>>> 
>>>>> IIRC, with the new qgroup accounting happen at commit_transaction 
>>>>> time,
>>>>> btrfs_find_all_roots() should either searching commit tree for old
>>>>> roots, or
>>>>> search current tree for new roots.
>>>>> No need to search using tree_mod_seq.
>>>>> 
>>>>> If so, I'd like just allow merging refs without checking 
>>>>> tree_mod_seq.
>>> 
>>> 
>>> And to make it clear, that wouldn't work. If a backref walker starts
>>> iterating the btrees and then new delayed refs get merged
>>> independently of the current tree mod seq, the walker will see an
>>> inconsistent state in the extent tree if the delayed references are
>>> run (which can happen often before a transaction commit).
>>> 
>>> So either make delayed references continue using tree mod seq as
>>> before 4.2, or come with a whole new mechanism that replaces the tree
>>> mod seq while still giving the same consistency guarantees.
>> 
>> 
>> I used your patch at https://patchwork.kernel.org/patch/7463161/ to 
>> build a
>> 4.3.0-rc6 kernel.
>> I mounted my FS with skip_balance, cancelled the paused balance (just 
>> to be
>> sure), then started a new one with :
>> 
>> # btrfs balance start -dconvert=raid5,soft /tank
>> 
>> (half of my data blocks are RAID1, the other half is RAID5, the goal 
>> is to
>> be fully RAID5).
>> 
>> I got a different stacktrace than the usual one, via netconsole, after 
>> a few
>> minutes. It's still referencing btrfs_run_delayed_refs though:
>> 
>> [  822.461809] ------------[ cut here ]------------
>> [  822.461833] kernel BUG at fs/btrfs/extent-tree.c:2287!
>> [  822.461849] invalid opcode: 0000 [#1] SMP
>> [  822.461866] Modules linked in: nfnetlink_queue nfnetlink_log 
>> nfnetlink
>> xt_multiport xt_comment xt_conntrack xt_nat xt_tcpudp xts gf128mul 
>> drbg
>> ansi_cprng btrfs nf_conntrack_ftp nf_conntrack_sane iptable_security
>> iptable_filter iptable_mangle iptable_nat nf_conntrack_ipv4 
>> nf_defrag_ipv4
>> [  822.462365] CPU: 1 PID: 3810 Comm: btrfs-transacti Tainted: G       
>>  W
>> 4.3.0-rc6p7463161+ #3
>> [  822.462391] Hardware name: ASUS All Series/H87I-PLUS, BIOS 1005
>> 01/06/2014
>> [  822.462412] task: ffff88011a4a1a00 ti: ffff8800b6638000 task.ti:
>> ffff8800b6638000
>> [  822.462434] RIP: 0010:[<ffffffffc032310b>]  [<ffffffffc032310b>]
>> __btrfs_run_delayed_refs.constprop.73+0x108b/0x10d0 [btrfs]
>> [  822.462476] RSP: 0018:ffff8800b663bcb0  EFLAGS: 00010202
>> [  822.462495] RAX: 0000000000000001 RBX: ffff8800b3821888 RCX:
>> ffff8800b3a46cb8
>> [  822.462517] RDX: 0000000000000001 RSI: 00000000000001e1 RDI:
>> ffff8800b3a46cb0
>> [  822.462543] RBP: ffff8800b663bdb8 R08: 0000000000000000 R09:
>> ffff8800b3a46cb8
>> [  822.462565] R10: ffff8800b3a46cb8 R11: ffff8800b3a46cb8 R12:
>> 0000000000000000
>> [  822.462587] R13: 0000000000000000 R14: 00000afefd330000 R15:
>> ffff8800b3a46c38
>> [  822.462687] FS:  0000000000000000(0000) GS:ffff88011fb00000(0000)
>> knlGS:0000000000000000
>> [  822.462716] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  822.462736] CR2: 00007ff7a21ca000 CR3: 0000000001c10000 CR4:
>> 00000000000406e0
>> [  822.462762] Stack:
>> [  822.462770]  ffffffffc037c8be ffff88011fb169f0 0000000000000001
>> 0000000000000001
>> [  822.462802]  0000000000000001 ffff88011a4a1a60 ffff8800b663bd38
>> ffff8800b663bd40
>> [  822.462831]  ffffffff810aded5 ffffffff8101fcc9 ffff8800b663bd20
>> 00000000000014d9
>> [  822.462859] Call Trace:
>> [  822.462877]  [<ffffffffc037c8be>] ?
>> try_merge_free_space.isra.26+0x12e/0x180 [btrfs]
>> [  822.462902]  [<ffffffff810aded5>] ? put_prev_entity+0x35/0x660
>> [  822.462924]  [<ffffffff8101fcc9>] ? sched_clock+0x9/0x10
>> [  822.462949]  [<ffffffffc0325dcd>] btrfs_run_delayed_refs+0x7d/0x2b0
>> [btrfs]
>> [  822.462972]  [<ffffffff817aa7ab>] ? schedule_timeout+0x16b/0x2a0
>> [  822.462999]  [<ffffffffc033a6b3>] 
>> btrfs_commit_transaction+0x43/0xb10
>> [btrfs]
>> [  822.463028]  [<ffffffffc0335c19>] transaction_kthread+0x1a9/0x230 
>> [btrfs]
>> [  822.463056]  [<ffffffffc0335a70>] ? 
>> btrfs_cleanup_transaction+0x550/0x550
>> [btrfs]
>> [  822.463080]  [<ffffffff81097099>] kthread+0xc9/0xe0
>> [  822.463096]  [<ffffffff81096fd0>] ? kthread_park+0x60/0x60
>> [  822.463116]  [<ffffffff817aba8f>] ret_from_fork+0x3f/0x70
>> [  822.463134]  [<ffffffff81096fd0>] ? kthread_park+0x60/0x60
>> [  822.463153] Code: c0 48 8b bd 40 ff ff ff 31 c0 e8 31 94 fe ff 0f 
>> 0b be
>> d3 00 00 00 48 c7 c7 3b c6 3b c0 e8 6e 63 d5 c0 e9 64 f9 ff ff 0f 0b 
>> 0f 0b
>> <0f> 0b be d3 00 00 00 48 c7 c7 3b c6 3b c0 e8 52 63 d5 c0 e9 60
>> [  822.463300] RIP  [<ffffffffc032310b>]
>> __btrfs_run_delayed_refs.constprop.73+0x108b/0x10d0 [btrfs]
>> [  822.463335]  RSP <ffff8800b663bcb0>
>> [  822.472131] ---[ end trace f1e21f38cb0ea144 ]---
>> 
>> A couple other stacktraces follow after some seconds, then the system 
>> dies
>> completely. sysrqd doesn't even work to reboot it remotely using sysrq
>> logic.
>> 
>> fs/btrfs/extent-tree.c:2287 is the line you get from a vanilla 4.3-rc6 
>> +
>> your patch. I'll post it as soon as I can get somebody to manually 
>> reboot
>> this remote machine (my kernel build machine is the same than the one
>> hosting the btrfs FS).
>> 
>> Don't hesitate to ask if you need me to debug or even ftrace 
>> something.
> 
> Thanks Stéphane. I haven't seen that crash yet (still running tests
> for 2 consecutive days now).
> Can you please try the following patch, which works on top of mine,
> and enable ftrace before running balance:
> 
> Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
> 
> Enable ftrace:
> 
>     $ echo > /sys/kernel/debug/tracing/trace
>     $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
>     $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
> you can use larger buffer size, even better
>     $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
>     $ echo 1 > /sys/kernel/debug/tracing/tracing_on
> 
>     $ run balance... wait until it finishes with IO error or the
> patch's printk message shows up in dmesg/syslog
> 
>     $ echo 0 > /sys/kernel/debug/tracing/tracing_on
> 
>     $ cat /sys/kernel/debug/tracing/trace > some_file.txt
> 
> Then send is some_file.txt for debugging, hopefully it will give some
> useful information. Note that it might produce tons of messages,
> depending on how long it takes for you to hit the BUG_ON.
> 
> Thanks a lot for this.

I'm compiling it now (using your v2 of the friendpaste diff).

I took the liberty to add a tracing_off() right before the return -EIO 
so that the trace tail ends exactly at the right place.

Last time I tried to use ftrace to diagnose the bug we're trying to fix, 
the system crashes so hard that usually it's complicated to get the 
trace contents written somewhere before the system is unusable. But I'll 
eventually work around it by using /sys/kernel/debug/tracing/trace_pipe 
to send the trace live to another machine over the LAN.

This series of bugs are so easy to trigger on my system that we'll 
hopefully get something useful out of the trace. I guess that's a good 
thing !

>> 
>> Thanks,
>> 
>> --
>> Stéphane.
>> 
>>>>> I was going to do it but not completely sure is there any other 
>>>>> user of
>>>>> tree_mod_seq.
>>>>> And if it's possible to get rid of tree_mod_seq and merge with last
>>>>> delayed_ref, things should get cleaner without new codes.
>>>> 
>>>> 
>>>> Well, the tree mod seq is what allows backref walkers (and possibly
>>>> other paths) to get a consistent view of all btrees and delayed refs
>>>> state while doing some processing - that's why we have calls to
>>>> btrfs_check_delayed_seq() when running delayed references - so that
>>>> any backref walker will not see changes that happen after it 
>>>> started,
>>>> i.e. it will see a consistent view of all the btrees (like an
>>>> in-memory snapshot of all btrees while the transaction is running).
>>>> 
>>>> I don't think you can get this level of consistency through any 
>>>> other
>>>> existing means.
>>>> So just adding back yet more code that was removed despite still 
>>>> being
>>>> needed.
>>>> 
>>>> This is affecting way too many people now, I would like to get this
>>>> fixed and later, if there's a better (new) solution for this, we can
>>>> get it in.
>>>> 
>>>> thanks
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Qu
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the
>>>>>> ref_root
>>>>>> in ref_head.")
>>>>>> Reported-by: Peter Becker <floyd.net@gmail.com>
>>>>>> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
>>>>>> Reported-by: Malte Schröder <malte@tnxip.de>
>>>>>> Reported-by: Derek Dongray <derek@valedon.co.uk>
>>>>>> Reported-by: Erkki Seppala <flux-btrfs@inside.org>
>>>>>> Cc: stable@vger.kernel.org  # 4.2+
>>>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>>>> ---
>>>>>>   fs/btrfs/delayed-ref.c | 113
>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>   fs/btrfs/extent-tree.c |  14 ++++++
>>>>>>   2 files changed, 127 insertions(+)
>>>>>> 
>>>>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>>>>> index ac3e81d..4832943 100644
>>>>>> --- a/fs/btrfs/delayed-ref.c
>>>>>> +++ b/fs/btrfs/delayed-ref.c
>>>>>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct
>>>>>> btrfs_trans_handle *trans,
>>>>>>                 trans->delayed_ref_updates--;
>>>>>>   }
>>>>>> 
>>>>>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>>>>>> +                     struct btrfs_delayed_ref_root *delayed_refs,
>>>>>> +                     struct btrfs_delayed_ref_head *head,
>>>>>> +                     struct btrfs_delayed_ref_node *ref,
>>>>>> +                     u64 seq)
>>>>>> +{
>>>>>> +       struct btrfs_delayed_ref_node *next;
>>>>>> +       bool done = false;
>>>>>> +
>>>>>> +       next = list_first_entry(&head->ref_list, struct
>>>>>> btrfs_delayed_ref_node,
>>>>>> +                               list);
>>>>>> +       while (!done && &next->list != &head->ref_list) {
>>>>>> +               int mod;
>>>>>> +               struct btrfs_delayed_ref_node *next2;
>>>>>> +
>>>>>> +               next2 = list_next_entry(next, list);
>>>>>> +
>>>>>> +               if (next == ref)
>>>>>> +                       goto next;
>>>>>> +
>>>>>> +               if (seq && next->seq >= seq)
>>>>>> +                       goto next;
>>>>>> +
>>>>>> +               if (next->type != ref->type || next->no_quota !=
>>>>>> ref->no_quota)
>>>>>> +                       goto next;
>>>>>> +
>>>>>> +               if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>>>>> +                    ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>>>>>> +                   
>>>>>> comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>>>>>> +
>>>>>> btrfs_delayed_node_to_tree_ref(next),
>>>>>> +                                  ref->type))
>>>>>> +                       goto next;
>>>>>> +               if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>>>>>> +                    ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>>>>>> +                   
>>>>>> comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>>>>>> +
>>>>>> btrfs_delayed_node_to_data_ref(next)))
>>>>>> +                       goto next;
>>>>>> +
>>>>>> +               if (ref->action == next->action) {
>>>>>> +                       mod = next->ref_mod;
>>>>>> +               } else {
>>>>>> +                       if (ref->ref_mod < next->ref_mod) {
>>>>>> +                               swap(ref, next);
>>>>>> +                               done = true;
>>>>>> +                       }
>>>>>> +                       mod = -next->ref_mod;
>>>>>> +               }
>>>>>> +
>>>>>> +               drop_delayed_ref(trans, delayed_refs, head, next);
>>>>>> +               ref->ref_mod += mod;
>>>>>> +               if (ref->ref_mod == 0) {
>>>>>> +                       drop_delayed_ref(trans, delayed_refs, 
>>>>>> head,
>>>>>> ref);
>>>>>> +                       done = true;
>>>>>> +               } else {
>>>>>> +                       /*
>>>>>> +                        * Can't have multiples of the same ref on 
>>>>>> a
>>>>>> tree
>>>>>> block.
>>>>>> +                        */
>>>>>> +                       WARN_ON(ref->type == 
>>>>>> BTRFS_TREE_BLOCK_REF_KEY
>>>>>> ||
>>>>>> +                               ref->type ==
>>>>>> BTRFS_SHARED_BLOCK_REF_KEY);
>>>>>> +               }
>>>>>> +next:
>>>>>> +               next = next2;
>>>>>> +       }
>>>>>> +
>>>>>> +       return done;
>>>>>> +}
>>>>>> +
>>>>>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>>>>>> +                             struct btrfs_fs_info *fs_info,
>>>>>> +                             struct btrfs_delayed_ref_root
>>>>>> *delayed_refs,
>>>>>> +                             struct btrfs_delayed_ref_head *head)
>>>>>> +{
>>>>>> +       struct btrfs_delayed_ref_node *ref;
>>>>>> +       u64 seq = 0;
>>>>>> +
>>>>>> +       assert_spin_locked(&head->lock);
>>>>>> +
>>>>>> +       if (list_empty(&head->ref_list))
>>>>>> +               return;
>>>>>> +
>>>>>> +       /* We don't have too many refs to merge for data. */
>>>>>> +       if (head->is_data)
>>>>>> +               return;
>>>>>> +
>>>>>> +       spin_lock(&fs_info->tree_mod_seq_lock);
>>>>>> +       if (!list_empty(&fs_info->tree_mod_seq_list)) {
>>>>>> +               struct seq_list *elem;
>>>>>> +
>>>>>> +               elem = 
>>>>>> list_first_entry(&fs_info->tree_mod_seq_list,
>>>>>> +                                       struct seq_list, list);
>>>>>> +               seq = elem->seq;
>>>>>> +       }
>>>>>> +       spin_unlock(&fs_info->tree_mod_seq_lock);
>>>>>> +
>>>>>> +       ref = list_first_entry(&head->ref_list, struct
>>>>>> btrfs_delayed_ref_node,
>>>>>> +                              list);
>>>>>> +       while (&ref->list != &head->ref_list) {
>>>>>> +               if (seq && ref->seq >= seq)
>>>>>> +                       goto next;
>>>>>> +
>>>>>> +               if (merge_ref(trans, delayed_refs, head, ref, 
>>>>>> seq)) {
>>>>>> +                       if (list_empty(&head->ref_list))
>>>>>> +                               break;
>>>>>> +                       ref = list_first_entry(&head->ref_list,
>>>>>> +                                              struct
>>>>>> btrfs_delayed_ref_node,
>>>>>> +                                              list);
>>>>>> +                       continue;
>>>>>> +               }
>>>>>> +next:
>>>>>> +               ref = list_next_entry(ref, list);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>>   int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>>>>>>                             struct btrfs_delayed_ref_root
>>>>>> *delayed_refs,
>>>>>>                             u64 seq)
>>>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>>>> index 522fb45..42d9310 100644
>>>>>> --- a/fs/btrfs/extent-tree.c
>>>>>> +++ b/fs/btrfs/extent-tree.c
>>>>>> @@ -2433,7 +2433,21 @@ static noinline int
>>>>>> __btrfs_run_delayed_refs(struct
>>>>>> btrfs_trans_handle *trans,
>>>>>>                         }
>>>>>>                 }
>>>>>> 
>>>>>> +               /*
>>>>>> +                * We need to try and merge add/drops of the same 
>>>>>> ref
>>>>>> since we
>>>>>> +                * can run into issues with relocate dropping the
>>>>>> implicit
>>>>>> ref
>>>>>> +                * and then it being added back again before the 
>>>>>> drop
>>>>>> can
>>>>>> +                * finish.  If we merged anything we need to 
>>>>>> re-loop so
>>>>>> we
>>>>>> can
>>>>>> +                * get a good ref.
>>>>>> +                * Or we can get node references of the same type 
>>>>>> that
>>>>>> weren't
>>>>>> +                * merged when created due to bumps in the tree 
>>>>>> mod
>>>>>> seq,
>>>>>> and
>>>>>> +                * we need to merge them to prevent adding an 
>>>>>> inline
>>>>>> extent
>>>>>> +                * backref before dropping it (triggering a BUG_ON 
>>>>>> at
>>>>>> +                * insert_inline_extent_backref()).
>>>>>> +                */
>>>>>>                 spin_lock(&locked_ref->lock);
>>>>>> +               btrfs_merge_delayed_refs(trans, fs_info, 
>>>>>> delayed_refs,
>>>>>> +                                        locked_ref);
>>>>>> 
>>>>>>                 /*
>>>>>>                  * locked_ref is the head node, so we have to go 
>>>>>> one
>>>>>> 
>>>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22 17:54           ` Stéphane Lesimple
@ 2015-10-22 22:38             ` Stéphane Lesimple
  2015-10-22 23:03               ` Filipe Manana
  0 siblings, 1 reply; 15+ messages in thread
From: Stéphane Lesimple @ 2015-10-22 22:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana, Qu Wenruo, Filipe Manana

>>> [ ... thread cleanup ... ]
>>> Don't hesitate to ask if you need me to debug or even ftrace 
>>> something.
>> 
>> Thanks Stéphane. I haven't seen that crash yet (still running tests
>> for 2 consecutive days now).
>> Can you please try the following patch, which works on top of mine,
>> and enable ftrace before running balance:
>> 
>> Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
>> 
>> Enable ftrace:
>> 
>>     $ echo > /sys/kernel/debug/tracing/trace
>>     $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
>>     $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
>> you can use larger buffer size, even better
>>     $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
>>     $ echo 1 > /sys/kernel/debug/tracing/tracing_on
>> 
>>     $ run balance... wait until it finishes with IO error or the
>> patch's printk message shows up in dmesg/syslog
>> 
>>     $ echo 0 > /sys/kernel/debug/tracing/tracing_on
>> 
>>     $ cat /sys/kernel/debug/tracing/trace > some_file.txt
>> 
>> Then send is some_file.txt for debugging, hopefully it will give some
>> useful information. Note that it might produce tons of messages,
>> depending on how long it takes for you to hit the BUG_ON.
>> 
>> Thanks a lot for this.
> 
> I'm compiling it now (using your v2 of the friendpaste diff).
> 
> I took the liberty to add a tracing_off() right before the return -EIO
> so that the trace tail ends exactly at the right place.
> 
> Last time I tried to use ftrace to diagnose the bug we're trying to
> fix, the system crashes so hard that usually it's complicated to get
> the trace contents written somewhere before the system is unusable.
> But I'll eventually work around it by using
> /sys/kernel/debug/tracing/trace_pipe to send the trace live to another
> machine over the LAN.
> 
> This series of bugs are so easy to trigger on my system that we'll
> hopefully get something useful out of the trace. I guess that's a good
> thing !

So, this time it took a little over an hour to get the crash, but it did 
reach the -EIO condition eventually.
The ftrace log (2M gzipped) is available here :
http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz

The associated kernel log is as follows :

[ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than 120 
seconds.
[ 2880.178600]       Not tainted 4.3.0-rc6p7463161+ #3
[ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp) score 
246 or sacrifice child
[ 3088.829435] Killed process 9449 (df-complex2simp) total-vm:964732kB, 
anon-rss:943764kB, file-rss:0kB
[ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than 120 
seconds.
[ 3600.197657]       Not tainted 4.3.0-rc6p7463161+ #3
[ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than 120 
seconds.
[ 3840.204180]       Not tainted 4.3.0-rc6p7463161+ #3
[ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp) score 
227 or sacrifice child
[ 3993.671989] Killed process 11357 (df-complex2simp) total-vm:891608kB, 
anon-rss:870704kB, file-rss:60kB
[ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than 120 
seconds.
[ 4080.210336]       Not tainted 4.3.0-rc6p7463161+ #3
[ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than 120 
seconds.
[ 4320.215662]       Not tainted 4.3.0-rc6p7463161+ #3
[ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than 120 
seconds.
[ 4560.221146]       Not tainted 4.3.0-rc6p7463161+ #3
[ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than 120 
seconds.
[ 4800.226898]       Not tainted 4.3.0-rc6p7463161+ #3
[ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp) score 
207 or sacrifice child
[ 4890.116138] Killed process 13377 (df-complex2simp) total-vm:834976kB, 
anon-rss:793272kB, file-rss:48kB
[ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp) score 
201 or sacrifice child
[ 5785.793586] Killed process 15285 (df-complex2simp) total-vm:802208kB, 
anon-rss:772172kB, file-rss:4kB
[ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than 120 
seconds.
[ 6480.269738]       Not tainted 4.3.0-rc6p7463161+ #3
[ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576, ref_mod 
2, seq 0 action 1
[ 7081.967784] BTRFS: error (device dm-3) in 
btrfs_run_delayed_refs:2872: errno=-5 IO failure

The OOM conditions are unrelated, this is an rrdtool cron script that 
apparently takes too much memory from time to time, I'll disable it for 
further tests to get noise out of the way.

If I can be of some more help, just ask.

Thanks,

-- 
Stéphane.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22 22:38             ` Stéphane Lesimple
@ 2015-10-22 23:03               ` Filipe Manana
  2015-10-23 10:11                 ` Filipe Manana
  0 siblings, 1 reply; 15+ messages in thread
From: Filipe Manana @ 2015-10-22 23:03 UTC (permalink / raw)
  To: Stéphane Lesimple; +Cc: linux-btrfs, Qu Wenruo, Filipe Manana

On Thu, Oct 22, 2015 at 11:38 PM, Stéphane Lesimple
<stephane_btrfs@lesimple.fr> wrote:
>>>> [ ... thread cleanup ... ]
>>>>
>>>> Don't hesitate to ask if you need me to debug or even ftrace something.
>>>
>>>
>>> Thanks Stéphane. I haven't seen that crash yet (still running tests
>>> for 2 consecutive days now).
>>> Can you please try the following patch, which works on top of mine,
>>> and enable ftrace before running balance:
>>>
>>> Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
>>>
>>> Enable ftrace:
>>>
>>>     $ echo > /sys/kernel/debug/tracing/trace
>>>     $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
>>>     $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
>>> you can use larger buffer size, even better
>>>     $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
>>>     $ echo 1 > /sys/kernel/debug/tracing/tracing_on
>>>
>>>     $ run balance... wait until it finishes with IO error or the
>>> patch's printk message shows up in dmesg/syslog
>>>
>>>     $ echo 0 > /sys/kernel/debug/tracing/tracing_on
>>>
>>>     $ cat /sys/kernel/debug/tracing/trace > some_file.txt
>>>
>>> Then send is some_file.txt for debugging, hopefully it will give some
>>> useful information. Note that it might produce tons of messages,
>>> depending on how long it takes for you to hit the BUG_ON.
>>>
>>> Thanks a lot for this.
>>
>>
>> I'm compiling it now (using your v2 of the friendpaste diff).
>>
>> I took the liberty to add a tracing_off() right before the return -EIO
>> so that the trace tail ends exactly at the right place.
>>
>> Last time I tried to use ftrace to diagnose the bug we're trying to
>> fix, the system crashes so hard that usually it's complicated to get
>> the trace contents written somewhere before the system is unusable.
>> But I'll eventually work around it by using
>> /sys/kernel/debug/tracing/trace_pipe to send the trace live to another
>> machine over the LAN.
>>
>> This series of bugs are so easy to trigger on my system that we'll
>> hopefully get something useful out of the trace. I guess that's a good
>> thing !
>
>
> So, this time it took a little over an hour to get the crash, but it did
> reach the -EIO condition eventually.
> The ftrace log (2M gzipped) is available here :
> http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz
>
> The associated kernel log is as follows :
>
> [ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 2880.178600]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp) score 246
> or sacrifice child
> [ 3088.829435] Killed process 9449 (df-complex2simp) total-vm:964732kB,
> anon-rss:943764kB, file-rss:0kB
> [ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 3600.197657]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 3840.204180]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp) score 227
> or sacrifice child
> [ 3993.671989] Killed process 11357 (df-complex2simp) total-vm:891608kB,
> anon-rss:870704kB, file-rss:60kB
> [ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 4080.210336]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 4320.215662]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 4560.221146]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 4800.226898]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp) score 207
> or sacrifice child
> [ 4890.116138] Killed process 13377 (df-complex2simp) total-vm:834976kB,
> anon-rss:793272kB, file-rss:48kB
> [ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp) score 201
> or sacrifice child
> [ 5785.793586] Killed process 15285 (df-complex2simp) total-vm:802208kB,
> anon-rss:772172kB, file-rss:4kB
> [ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 6480.269738]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576, ref_mod 2,
> seq 0 action 1
> [ 7081.967784] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:2872:
> errno=-5 IO failure
>
> The OOM conditions are unrelated, this is an rrdtool cron script that
> apparently takes too much memory from time to time, I'll disable it for
> further tests to get noise out of the way.
>
> If I can be of some more help, just ask.

That's great Stéphane.
I'll take a look tomorrow morning at the logs and then get back to you.

Thanks a lot for that.

>
> Thanks,
>
> --
> Stéphane.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-22 23:03               ` Filipe Manana
@ 2015-10-23 10:11                 ` Filipe Manana
  2015-10-23 10:59                   ` Stéphane Lesimple
  0 siblings, 1 reply; 15+ messages in thread
From: Filipe Manana @ 2015-10-23 10:11 UTC (permalink / raw)
  To: Stéphane Lesimple; +Cc: linux-btrfs, Qu Wenruo, Filipe Manana

On Fri, Oct 23, 2015 at 12:03 AM, Filipe Manana <fdmanana@kernel.org> wrote:
> On Thu, Oct 22, 2015 at 11:38 PM, Stéphane Lesimple
> <stephane_btrfs@lesimple.fr> wrote:
>>>>> [ ... thread cleanup ... ]
>>>>>
>>>>> Don't hesitate to ask if you need me to debug or even ftrace something.
>>>>
>>>>
>>>> Thanks Stéphane. I haven't seen that crash yet (still running tests
>>>> for 2 consecutive days now).
>>>> Can you please try the following patch, which works on top of mine,
>>>> and enable ftrace before running balance:
>>>>
>>>> Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
>>>>
>>>> Enable ftrace:
>>>>
>>>>     $ echo > /sys/kernel/debug/tracing/trace
>>>>     $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
>>>>     $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
>>>> you can use larger buffer size, even better
>>>>     $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
>>>>     $ echo 1 > /sys/kernel/debug/tracing/tracing_on
>>>>
>>>>     $ run balance... wait until it finishes with IO error or the
>>>> patch's printk message shows up in dmesg/syslog
>>>>
>>>>     $ echo 0 > /sys/kernel/debug/tracing/tracing_on
>>>>
>>>>     $ cat /sys/kernel/debug/tracing/trace > some_file.txt
>>>>
>>>> Then send is some_file.txt for debugging, hopefully it will give some
>>>> useful information. Note that it might produce tons of messages,
>>>> depending on how long it takes for you to hit the BUG_ON.
>>>>
>>>> Thanks a lot for this.
>>>
>>>
>>> I'm compiling it now (using your v2 of the friendpaste diff).
>>>
>>> I took the liberty to add a tracing_off() right before the return -EIO
>>> so that the trace tail ends exactly at the right place.
>>>
>>> Last time I tried to use ftrace to diagnose the bug we're trying to
>>> fix, the system crashes so hard that usually it's complicated to get
>>> the trace contents written somewhere before the system is unusable.
>>> But I'll eventually work around it by using
>>> /sys/kernel/debug/tracing/trace_pipe to send the trace live to another
>>> machine over the LAN.
>>>
>>> This series of bugs are so easy to trigger on my system that we'll
>>> hopefully get something useful out of the trace. I guess that's a good
>>> thing !
>>
>>
>> So, this time it took a little over an hour to get the crash, but it did
>> reach the -EIO condition eventually.
>> The ftrace log (2M gzipped) is available here :
>> http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz
>>
>> The associated kernel log is as follows :
>>
>> [ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 2880.178600]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp) score 246
>> or sacrifice child
>> [ 3088.829435] Killed process 9449 (df-complex2simp) total-vm:964732kB,
>> anon-rss:943764kB, file-rss:0kB
>> [ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 3600.197657]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 3840.204180]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp) score 227
>> or sacrifice child
>> [ 3993.671989] Killed process 11357 (df-complex2simp) total-vm:891608kB,
>> anon-rss:870704kB, file-rss:60kB
>> [ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 4080.210336]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 4320.215662]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 4560.221146]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 4800.226898]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp) score 207
>> or sacrifice child
>> [ 4890.116138] Killed process 13377 (df-complex2simp) total-vm:834976kB,
>> anon-rss:793272kB, file-rss:48kB
>> [ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp) score 201
>> or sacrifice child
>> [ 5785.793586] Killed process 15285 (df-complex2simp) total-vm:802208kB,
>> anon-rss:772172kB, file-rss:4kB
>> [ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 6480.269738]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576, ref_mod 2,
>> seq 0 action 1
>> [ 7081.967784] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:2872:
>> errno=-5 IO failure
>>
>> The OOM conditions are unrelated, this is an rrdtool cron script that
>> apparently takes too much memory from time to time, I'll disable it for
>> further tests to get noise out of the way.
>>
>> If I can be of some more help, just ask.
>
> That's great Stéphane.
> I'll take a look tomorrow morning at the logs and then get back to you.
>
> Thanks a lot for that.

Stéphane, can you do the balance again but with the following new debug patch:

https://friendpaste.com/3NXzDFGcK4Hxi0yiypMgRO

(I added your tracing_off(); call)

It applies on top of my fix.
Also, are you by chance using qgroups (btrfs quotas)? If you are I
think I know what's the problem, but either way please run with this
new debug patch, so that I can confirm my ideas.

Many thanks, much appreciated!

>
>>
>> Thanks,
>>
>> --
>> Stéphane.
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-23 10:11                 ` Filipe Manana
@ 2015-10-23 10:59                   ` Stéphane Lesimple
  2015-10-23 14:35                     ` Stéphane Lesimple
  0 siblings, 1 reply; 15+ messages in thread
From: Stéphane Lesimple @ 2015-10-23 10:59 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, Qu Wenruo, Filipe Manana

Le 2015-10-23 12:11, Filipe Manana a écrit :
> On Fri, Oct 23, 2015 at 12:03 AM, Filipe Manana <fdmanana@kernel.org> 
> wrote:
>> On Thu, Oct 22, 2015 at 11:38 PM, Stéphane Lesimple
>> <stephane_btrfs@lesimple.fr> wrote:
>>>>>> [ ... thread cleanup ... ]
>>>>>> 
>>>>>> Don't hesitate to ask if you need me to debug or even ftrace 
>>>>>> something.
>>>>> 
>>>>> 
>>>>> Thanks Stéphane. I haven't seen that crash yet (still running tests
>>>>> for 2 consecutive days now).
>>>>> Can you please try the following patch, which works on top of mine,
>>>>> and enable ftrace before running balance:
>>>>> 
>>>>> Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
>>>>> 
>>>>> Enable ftrace:
>>>>> 
>>>>>     $ echo > /sys/kernel/debug/tracing/trace
>>>>>     $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
>>>>>     $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
>>>>> you can use larger buffer size, even better
>>>>>     $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
>>>>>     $ echo 1 > /sys/kernel/debug/tracing/tracing_on
>>>>> 
>>>>>     $ run balance... wait until it finishes with IO error or the
>>>>> patch's printk message shows up in dmesg/syslog
>>>>> 
>>>>>     $ echo 0 > /sys/kernel/debug/tracing/tracing_on
>>>>> 
>>>>>     $ cat /sys/kernel/debug/tracing/trace > some_file.txt
>>>>> 
>>>>> Then send is some_file.txt for debugging, hopefully it will give 
>>>>> some
>>>>> useful information. Note that it might produce tons of messages,
>>>>> depending on how long it takes for you to hit the BUG_ON.
>>>>> 
>>>>> Thanks a lot for this.
>>>> 
>>>> 
>>>> I'm compiling it now (using your v2 of the friendpaste diff).
>>>> 
>>>> I took the liberty to add a tracing_off() right before the return 
>>>> -EIO
>>>> so that the trace tail ends exactly at the right place.
>>>> 
>>>> Last time I tried to use ftrace to diagnose the bug we're trying to
>>>> fix, the system crashes so hard that usually it's complicated to get
>>>> the trace contents written somewhere before the system is unusable.
>>>> But I'll eventually work around it by using
>>>> /sys/kernel/debug/tracing/trace_pipe to send the trace live to 
>>>> another
>>>> machine over the LAN.
>>>> 
>>>> This series of bugs are so easy to trigger on my system that we'll
>>>> hopefully get something useful out of the trace. I guess that's a 
>>>> good
>>>> thing !
>>> 
>>> 
>>> So, this time it took a little over an hour to get the crash, but it 
>>> did
>>> reach the -EIO condition eventually.
>>> The ftrace log (2M gzipped) is available here :
>>> http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz
>>> 
>>> The associated kernel log is as follows :
>>> 
>>> [ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than 
>>> 120
>>> seconds.
>>> [ 2880.178600]       Not tainted 4.3.0-rc6p7463161+ #3
>>> [ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables
>>> this message.
>>> [ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp) 
>>> score 246
>>> or sacrifice child
>>> [ 3088.829435] Killed process 9449 (df-complex2simp) 
>>> total-vm:964732kB,
>>> anon-rss:943764kB, file-rss:0kB
>>> [ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than 
>>> 120
>>> seconds.
>>> [ 3600.197657]       Not tainted 4.3.0-rc6p7463161+ #3
>>> [ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables
>>> this message.
>>> [ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than 
>>> 120
>>> seconds.
>>> [ 3840.204180]       Not tainted 4.3.0-rc6p7463161+ #3
>>> [ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables
>>> this message.
>>> [ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp) 
>>> score 227
>>> or sacrifice child
>>> [ 3993.671989] Killed process 11357 (df-complex2simp) 
>>> total-vm:891608kB,
>>> anon-rss:870704kB, file-rss:60kB
>>> [ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than 
>>> 120
>>> seconds.
>>> [ 4080.210336]       Not tainted 4.3.0-rc6p7463161+ #3
>>> [ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables
>>> this message.
>>> [ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than 
>>> 120
>>> seconds.
>>> [ 4320.215662]       Not tainted 4.3.0-rc6p7463161+ #3
>>> [ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables
>>> this message.
>>> [ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than 
>>> 120
>>> seconds.
>>> [ 4560.221146]       Not tainted 4.3.0-rc6p7463161+ #3
>>> [ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables
>>> this message.
>>> [ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than 
>>> 120
>>> seconds.
>>> [ 4800.226898]       Not tainted 4.3.0-rc6p7463161+ #3
>>> [ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables
>>> this message.
>>> [ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp) 
>>> score 207
>>> or sacrifice child
>>> [ 4890.116138] Killed process 13377 (df-complex2simp) 
>>> total-vm:834976kB,
>>> anon-rss:793272kB, file-rss:48kB
>>> [ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp) 
>>> score 201
>>> or sacrifice child
>>> [ 5785.793586] Killed process 15285 (df-complex2simp) 
>>> total-vm:802208kB,
>>> anon-rss:772172kB, file-rss:4kB
>>> [ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than 
>>> 120
>>> seconds.
>>> [ 6480.269738]       Not tainted 4.3.0-rc6p7463161+ #3
>>> [ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables
>>> this message.
>>> [ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576, 
>>> ref_mod 2,
>>> seq 0 action 1
>>> [ 7081.967784] BTRFS: error (device dm-3) in 
>>> btrfs_run_delayed_refs:2872:
>>> errno=-5 IO failure
>>> 
>>> The OOM conditions are unrelated, this is an rrdtool cron script that
>>> apparently takes too much memory from time to time, I'll disable it 
>>> for
>>> further tests to get noise out of the way.
>>> 
>>> If I can be of some more help, just ask.
>> 
>> That's great Stéphane.
>> I'll take a look tomorrow morning at the logs and then get back to 
>> you.
>> 
>> Thanks a lot for that.
> 
> Stéphane, can you do the balance again but with the following new debug 
> patch:
> 
> https://friendpaste.com/3NXzDFGcK4Hxi0yiypMgRO
> 
> (I added your tracing_off(); call)
> 
> It applies on top of my fix.
> Also, are you by chance using qgroups (btrfs quotas)? If you are I
> think I know what's the problem, but either way please run with this
> new debug patch, so that I can confirm my ideas.

I am indeed using qgroups (use case for me is to be able to tell how 
much real space are taking snapshots).

I'm applying your patch and resuming the balance, will keep you posted 
as usual!

> Many thanks, much appreciated!

And thanks for taking the time to look into this :)

-- 
Stéphane.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Btrfs: fix regression when running delayed references
  2015-10-23 10:59                   ` Stéphane Lesimple
@ 2015-10-23 14:35                     ` Stéphane Lesimple
  0 siblings, 0 replies; 15+ messages in thread
From: Stéphane Lesimple @ 2015-10-23 14:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana, Qu Wenruo, Filipe Manana

Le 2015-10-23 12:59, Stéphane Lesimple a écrit :
> Le 2015-10-23 12:11, Filipe Manana a écrit :
>> On Fri, Oct 23, 2015 at 12:03 AM, Filipe Manana <fdmanana@kernel.org> 
>> wrote:
>>> On Thu, Oct 22, 2015 at 11:38 PM, Stéphane Lesimple
>>> <stephane_btrfs@lesimple.fr> wrote:
>>>>>>> [ ... thread cleanup ... ]
>>>>>>> 
>>>>>>> Don't hesitate to ask if you need me to debug or even ftrace 
>>>>>>> something.
>>>>>> 
>>>>>> 
>>>>>> Thanks Stéphane. I haven't seen that crash yet (still running 
>>>>>> tests
>>>>>> for 2 consecutive days now).
>>>>>> Can you please try the following patch, which works on top of 
>>>>>> mine,
>>>>>> and enable ftrace before running balance:
>>>>>> 
>>>>>> Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
>>>>>> 
>>>>>> Enable ftrace:
>>>>>> 
>>>>>>     $ echo > /sys/kernel/debug/tracing/trace
>>>>>>     $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
>>>>>>     $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # 
>>>>>> if
>>>>>> you can use larger buffer size, even better
>>>>>>     $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
>>>>>>     $ echo 1 > /sys/kernel/debug/tracing/tracing_on
>>>>>> 
>>>>>>     $ run balance... wait until it finishes with IO error or the
>>>>>> patch's printk message shows up in dmesg/syslog
>>>>>> 
>>>>>>     $ echo 0 > /sys/kernel/debug/tracing/tracing_on
>>>>>> 
>>>>>>     $ cat /sys/kernel/debug/tracing/trace > some_file.txt
>>>>>> 
>>>>>> Then send is some_file.txt for debugging, hopefully it will give 
>>>>>> some
>>>>>> useful information. Note that it might produce tons of messages,
>>>>>> depending on how long it takes for you to hit the BUG_ON.
>>>>>> 
>>>>>> Thanks a lot for this.
>>>>> 
>>>>> 
>>>>> I'm compiling it now (using your v2 of the friendpaste diff).
>>>>> 
>>>>> I took the liberty to add a tracing_off() right before the return 
>>>>> -EIO
>>>>> so that the trace tail ends exactly at the right place.
>>>>> 
>>>>> Last time I tried to use ftrace to diagnose the bug we're trying to
>>>>> fix, the system crashes so hard that usually it's complicated to 
>>>>> get
>>>>> the trace contents written somewhere before the system is unusable.
>>>>> But I'll eventually work around it by using
>>>>> /sys/kernel/debug/tracing/trace_pipe to send the trace live to 
>>>>> another
>>>>> machine over the LAN.
>>>>> 
>>>>> This series of bugs are so easy to trigger on my system that we'll
>>>>> hopefully get something useful out of the trace. I guess that's a 
>>>>> good
>>>>> thing !
>>>> 
>>>> 
>>>> So, this time it took a little over an hour to get the crash, but it 
>>>> did
>>>> reach the -EIO condition eventually.
>>>> The ftrace log (2M gzipped) is available here :
>>>> http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz
>>>> 
>>>> The associated kernel log is as follows :
>>>> 
>>>> [ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than 
>>>> 120
>>>> seconds.
>>>> [ 2880.178600]       Not tainted 4.3.0-rc6p7463161+ #3
>>>> [ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>> disables
>>>> this message.
>>>> [ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp) 
>>>> score 246
>>>> or sacrifice child
>>>> [ 3088.829435] Killed process 9449 (df-complex2simp) 
>>>> total-vm:964732kB,
>>>> anon-rss:943764kB, file-rss:0kB
>>>> [ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than 
>>>> 120
>>>> seconds.
>>>> [ 3600.197657]       Not tainted 4.3.0-rc6p7463161+ #3
>>>> [ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>> disables
>>>> this message.
>>>> [ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than 
>>>> 120
>>>> seconds.
>>>> [ 3840.204180]       Not tainted 4.3.0-rc6p7463161+ #3
>>>> [ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>> disables
>>>> this message.
>>>> [ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp) 
>>>> score 227
>>>> or sacrifice child
>>>> [ 3993.671989] Killed process 11357 (df-complex2simp) 
>>>> total-vm:891608kB,
>>>> anon-rss:870704kB, file-rss:60kB
>>>> [ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than 
>>>> 120
>>>> seconds.
>>>> [ 4080.210336]       Not tainted 4.3.0-rc6p7463161+ #3
>>>> [ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>> disables
>>>> this message.
>>>> [ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than 
>>>> 120
>>>> seconds.
>>>> [ 4320.215662]       Not tainted 4.3.0-rc6p7463161+ #3
>>>> [ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>> disables
>>>> this message.
>>>> [ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than 
>>>> 120
>>>> seconds.
>>>> [ 4560.221146]       Not tainted 4.3.0-rc6p7463161+ #3
>>>> [ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>> disables
>>>> this message.
>>>> [ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than 
>>>> 120
>>>> seconds.
>>>> [ 4800.226898]       Not tainted 4.3.0-rc6p7463161+ #3
>>>> [ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>> disables
>>>> this message.
>>>> [ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp) 
>>>> score 207
>>>> or sacrifice child
>>>> [ 4890.116138] Killed process 13377 (df-complex2simp) 
>>>> total-vm:834976kB,
>>>> anon-rss:793272kB, file-rss:48kB
>>>> [ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp) 
>>>> score 201
>>>> or sacrifice child
>>>> [ 5785.793586] Killed process 15285 (df-complex2simp) 
>>>> total-vm:802208kB,
>>>> anon-rss:772172kB, file-rss:4kB
>>>> [ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than 
>>>> 120
>>>> seconds.
>>>> [ 6480.269738]       Not tainted 4.3.0-rc6p7463161+ #3
>>>> [ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>> disables
>>>> this message.
>>>> [ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576, 
>>>> ref_mod 2,
>>>> seq 0 action 1
>>>> [ 7081.967784] BTRFS: error (device dm-3) in 
>>>> btrfs_run_delayed_refs:2872:
>>>> errno=-5 IO failure
>>>> 
>>>> The OOM conditions are unrelated, this is an rrdtool cron script 
>>>> that
>>>> apparently takes too much memory from time to time, I'll disable it 
>>>> for
>>>> further tests to get noise out of the way.
>>>> 
>>>> If I can be of some more help, just ask.
>>> 
>>> That's great Stéphane.
>>> I'll take a look tomorrow morning at the logs and then get back to 
>>> you.
>>> 
>>> Thanks a lot for that.
>> 
>> Stéphane, can you do the balance again but with the following new 
>> debug patch:
>> 
>> https://friendpaste.com/3NXzDFGcK4Hxi0yiypMgRO
>> 
>> (I added your tracing_off(); call)
>> 
>> It applies on top of my fix.
>> Also, are you by chance using qgroups (btrfs quotas)? If you are I
>> think I know what's the problem, but either way please run with this
>> new debug patch, so that I can confirm my ideas.
> 
> I am indeed using qgroups (use case for me is to be able to tell how
> much real space are taking snapshots).
> 
> I'm applying your patch and resuming the balance, will keep you posted 
> as usual!
> 
>> Many thanks, much appreciated!
> 
> And thanks for taking the time to look into this :)

Okay, the system crashed as expected but after ~30 minutes this time. 
I'm lucky on this run because the NMI watchdog successfully kicked in, 
panicked the kernel nicely and autorebooted without physical 
intervention :)

The kernel log is as follows :

[ 1537.855455] BTRFS: here, ref_mod != 1, bytenr 12089671335936, ref_mod 
2, seq 0 action 1
[ 1537.855591] BTRFS: error (device dm-3) in 
btrfs_run_delayed_refs:2877: errno=-5 IO failure
[ 1561.908656] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
[btrfs:6297]
[ 1561.909839] Kernel panic - not syncing: softlockup: hung tasks
[ 1561.909861] CPU: 1 PID: 6297 Comm: btrfs Tainted: G        W    L  
4.3.0-rc6p7463161+ #3
[ 1561.909883] Hardware name: ASUS All Series/H87I-PLUS, BIOS 1005 
01/06/2014
[ 1561.909903]  ffff8800b49ddb00 ffff88011fb03e30 ffffffff813a12cd 
ffffffff81ac0dfa
[ 1561.909930]  ffff88011fb03ea8 ffffffff811753fe 0000000000000008 
ffff88011fb03eb8
[ 1561.909955]  ffff88011fb03e58 ffff88011fb03e98 0000000000000000 
0000000000000100
[ 1561.909982] Call Trace:
[ 1561.909990]  <IRQ>  [<ffffffff813a12cd>] dump_stack+0x44/0x57
[ 1561.910011]  [<ffffffff811753fe>] panic+0xc8/0x205
[ 1561.910027]  [<ffffffff8112931b>] watchdog_timer_fn+0x1db/0x1f0
[ 1561.910044]  [<ffffffff81129140>] ? watchdog+0x40/0x40
[ 1561.910062]  [<ffffffff810e2394>] __hrtimer_run_queues+0xe4/0x200
[ 1561.910080]  [<ffffffff810e27c8>] hrtimer_interrupt+0xa8/0x1a0
[ 1561.910099]  [<ffffffff8104e105>] 
local_apic_timer_interrupt+0x35/0x60
[ 1561.910118]  [<ffffffff817ae5fd>] smp_apic_timer_interrupt+0x3d/0x60
[ 1561.910136]  [<ffffffff817ac462>] apic_timer_interrupt+0x82/0x90
[ 1561.910153]  <EOI>  [<ffffffffc090e2f0>] ? 
__del_reloc_root+0xd0/0x100 [btrfs]
[ 1561.910187]  [<ffffffffc090e276>] ? __del_reloc_root+0x56/0x100 
[btrfs]
[ 1561.910211]  [<ffffffffc090e250>] ? __del_reloc_root+0x30/0x100 
[btrfs]
[ 1561.910235]  [<ffffffffc090e341>] free_reloc_roots+0x21/0x30 [btrfs]
[ 1561.910257]  [<ffffffffc091190a>] merge_reloc_roots+0x16a/0x230 
[btrfs]
[ 1561.910280]  [<ffffffffc0911dc5>] relocate_block_group+0x3f5/0x610 
[btrfs]
[ 1561.910304]  [<ffffffffc0912185>] 
btrfs_relocate_block_group+0x1a5/0x290 [btrfs]
[ 1561.910330]  [<ffffffffc08e8217>] 
btrfs_relocate_chunk.isra.36+0x47/0xd0 [btrfs]
[ 1561.910356]  [<ffffffffc08e99a0>] btrfs_balance+0x7d0/0xea0 [btrfs]
[ 1561.910379]  [<ffffffffc08f58de>] btrfs_ioctl_balance+0x17e/0x540 
[btrfs]
[ 1561.910404]  [<ffffffffc08f80a7>] btrfs_ioctl+0x567/0x2440 [btrfs]
[ 1561.910422]  [<ffffffff81186637>] ? 
lru_cache_add_active_or_unevictable+0x27/0x90
[ 1561.910443]  [<ffffffff811a6850>] ? handle_mm_fault+0xb90/0x17d0
[ 1561.910461]  [<ffffffff811aa929>] ? vma_link+0xb9/0xc0
[ 1561.910476]  [<ffffffff8120192d>] do_vfs_ioctl+0x2cd/0x4b0
[ 1561.910492]  [<ffffffff81064797>] ? __do_page_fault+0x1b7/0x430
[ 1561.910509]  [<ffffffff81201b89>] SyS_ioctl+0x79/0x90
[ 1561.910524]  [<ffffffff817ab6f6>] entry_SYSCALL_64_fastpath+0x16/0x75
[ 1561.910554] Kernel Offset: disabled
[ 1561.917367] Rebooting in 5 seconds..
[ 1566.951784] ACPI MEMORY or I/O RESET_REG.

The gzipped ftrace log is less that 4M and available here : 
http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace2.log.gz

Thanks,

-- 
Stéphane.




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-10-23 14:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-22  8:47 [PATCH] Btrfs: fix regression when running delayed references fdmanana
2015-10-22  9:32 ` Qu Wenruo
2015-10-22  9:43   ` Filipe Manana
2015-10-22  9:47     ` Filipe Manana
2015-10-22 14:58       ` Stéphane Lesimple
2015-10-22 17:03         ` Filipe Manana
2015-10-22 17:54           ` Stéphane Lesimple
2015-10-22 22:38             ` Stéphane Lesimple
2015-10-22 23:03               ` Filipe Manana
2015-10-23 10:11                 ` Filipe Manana
2015-10-23 10:59                   ` Stéphane Lesimple
2015-10-23 14:35                     ` Stéphane Lesimple
2015-10-22 10:06     ` Qu Wenruo
2015-10-22 10:05 ` Koen Kooi
2015-10-22 10:12   ` Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.