All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop
@ 2015-04-09  4:34 Zhaolei
  2015-04-09  4:34 ` [PATCH 1/9] btrfs: fix condition of commit transaction Zhaolei
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

This is v2 of resend-fix-no-space.

Most of them are send in single patch, I resend them in patchset
to make it easy to access.

Notice that "Btrfs: fix find_free_dev_extent() malfunction in case
device tree has hole" from Forrest Liu in:
  https://patchwork.kernel.org/patch/5800231/
is also need to fix all known no_space bug.

Changelog v1->v2:
1: Rebased on top of v4.0-rc7
2: Fixed a lock problem reported by:
   'Tsutomu Itoh' <t-itoh@jp.fujitsu.com>
3: Add Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
   to [PATCH 2/9] btrfs:

Tested by busy dd and rm loop script in 2000 times.
I'll add xfstests for this case later.

This is available at fix_no_space branch on my tree:
  git://github.com/zhaoleidd/btrfs.git
It is also included in integration-for-chris branch in above tree.

Thanks
Zhaolei

Zhao Lei (9):
  btrfs: fix condition of commit transaction
  btrfs: Fix tail space processing in find_free_dev_extent()
  btrfs: Adjust commit-transaction condition to avoid NO_SPACE more
  btrfs: Set relative data on clear btrfs_block_group_cache->pinned
  btrfs: add WARN_ON() to check is space_info op current
  btrfs: Fix NO_SPACE bug caused by delayed-iput
  btrfs: Support busy loop of write and delete
  btrfs: wait for delayed iputs on no space
  btrfs: cleanup unused alloc_chunk varible

 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     |  3 ++-
 fs/btrfs/extent-tree.c | 66 +++++++++++++++++++++++++++++++++++++++-----------
 fs/btrfs/inode.c       |  4 +++
 fs/btrfs/volumes.c     | 24 +++++++++---------
 5 files changed, 72 insertions(+), 26 deletions(-)

-- 
1.8.5.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/9] btrfs: fix condition of commit transaction
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  2015-04-09  4:34 ` [PATCH 2/9] btrfs: Fix tail space processing in find_free_dev_extent() Zhaolei
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

Old code bypass commit transaction when we don't have enough
pinned space, but another case is there exist freed bgs in current
transction, it have possibility to make alloc_chunk success.

This patch modify the condition to:
if (have_free_bg || have_pinned_space) commit_transaction()

Confirmed above action by printk before and after patch.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8b353ad..ebeedb4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3641,7 +3641,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 used;
-	int ret = 0, committed = 0, alloc_chunk = 1;
+	int ret = 0, committed = 0, have_pinned_space = 1, alloc_chunk = 1;
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
@@ -3709,11 +3709,12 @@ alloc:
 
 		/*
 		 * If we don't have enough pinned space to deal with this
-		 * allocation don't bother committing the transaction.
+		 * allocation, and no removed chunk in current transaction,
+		 * don't bother committing the transaction.
 		 */
 		if (percpu_counter_compare(&data_sinfo->total_bytes_pinned,
 					   bytes) < 0)
-			committed = 1;
+			have_pinned_space = 0;
 		spin_unlock(&data_sinfo->lock);
 
 		/* commit the current transaction and try again */
@@ -3725,10 +3726,15 @@ commit_trans:
 			trans = btrfs_join_transaction(root);
 			if (IS_ERR(trans))
 				return PTR_ERR(trans);
-			ret = btrfs_commit_transaction(trans, root);
-			if (ret)
-				return ret;
-			goto again;
+			if (have_pinned_space ||
+			    trans->transaction->have_free_bgs) {
+				ret = btrfs_commit_transaction(trans, root);
+				if (ret)
+					return ret;
+				goto again;
+			} else {
+				btrfs_end_transaction(trans, root);
+			}
 		}
 
 		trace_btrfs_space_reservation(root->fs_info,
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/9] btrfs: Fix tail space processing in find_free_dev_extent()
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
  2015-04-09  4:34 ` [PATCH 1/9] btrfs: fix condition of commit transaction Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  2015-04-09  4:34 ` [PATCH 3/9] btrfs: Adjust commit-transaction condition to avoid NO_SPACE more Zhaolei
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

It is another reason for NO_SPACE case.

When we found enough free space in loop and saved them to
max_hole_start/size before, and tail space contains pending extent,
origional innocent max_hole_start/size are reset in retry.

As a result, find_free_dev_extent() returns less space than it can,
and cause NO_SPACE in user program.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/volumes.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f..586824a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1136,11 +1136,11 @@ int find_free_dev_extent(struct btrfs_trans_handle *trans,
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-again:
+
 	max_hole_start = search_start;
 	max_hole_size = 0;
-	hole_size = 0;
 
+again:
 	if (search_start >= search_end || device->is_tgtdev_for_dev_replace) {
 		ret = -ENOSPC;
 		goto out;
@@ -1233,21 +1233,23 @@ next:
 	 * allocated dev extents, and when shrinking the device,
 	 * search_end may be smaller than search_start.
 	 */
-	if (search_end > search_start)
+	if (search_end > search_start) {
 		hole_size = search_end - search_start;
 
-	if (hole_size > max_hole_size) {
-		max_hole_start = search_start;
-		max_hole_size = hole_size;
-	}
+		if (contains_pending_extent(trans, device, &search_start,
+					    hole_size)) {
+			btrfs_release_path(path);
+			goto again;
+		}
 
-	if (contains_pending_extent(trans, device, &search_start, hole_size)) {
-		btrfs_release_path(path);
-		goto again;
+		if (hole_size > max_hole_size) {
+			max_hole_start = search_start;
+			max_hole_size = hole_size;
+		}
 	}
 
 	/* See above. */
-	if (hole_size < num_bytes)
+	if (max_hole_size < num_bytes)
 		ret = -ENOSPC;
 	else
 		ret = 0;
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/9] btrfs: Adjust commit-transaction condition to avoid NO_SPACE more
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
  2015-04-09  4:34 ` [PATCH 1/9] btrfs: fix condition of commit transaction Zhaolei
  2015-04-09  4:34 ` [PATCH 2/9] btrfs: Fix tail space processing in find_free_dev_extent() Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  2015-04-09  4:34 ` [PATCH 4/9] btrfs: Set relative data on clear btrfs_block_group_cache->pinned Zhaolei
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

If we have any chance to make a successful write, we should not give up.

This patch adjust commit-transaction condition from:
  pinned >= wanted
to
  left + pinned >= wanted

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ebeedb4..644468b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3713,7 +3713,8 @@ alloc:
 		 * don't bother committing the transaction.
 		 */
 		if (percpu_counter_compare(&data_sinfo->total_bytes_pinned,
-					   bytes) < 0)
+					   used + bytes -
+					   data_sinfo->total_bytes) < 0)
 			have_pinned_space = 0;
 		spin_unlock(&data_sinfo->lock);
 
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/9] btrfs: Set relative data on clear btrfs_block_group_cache->pinned
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
                   ` (2 preceding siblings ...)
  2015-04-09  4:34 ` [PATCH 3/9] btrfs: Adjust commit-transaction condition to avoid NO_SPACE more Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  2015-04-09  4:34 ` [PATCH 5/9] btrfs: add WARN_ON() to check is space_info op current Zhaolei
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

Bug1:
  space_info->bytes_readonly was set to very large(negative) value in
  btrfs_remove_block_group().

Reason:
  Current code set block_group_cache->pinned = 0 in btrfs_delete_unused_bgs(),
  but above space was not counted to space_info->bytes_readonly.

  Then in btrfs_remove_block_group():
    block_group->space_info->bytes_readonly -= block_group->key.offset;
  We can see following value in trace:
    btrfs_remove_block_group: pid=2677 comm=btrfs-cleaner WARNING: bytes_readonly=12582912, key.offset=134217728

Bug2:
  space_info->total_bytes_pinned grow to value larger than fs size.
  In a 1.2G fs, we can get following trace log:
  at first:
    ZL_DEBUG: add_pinned_bytes: pid=2710 comm=sync change total_bytes_pinned flags=1 869793792 + 95944704 = 965738496
  after some op:
    ZL_DEBUG: add_pinned_bytes: pid=2770 comm=sync change total_bytes_pinned flags=1 1780178944 + 95944704 = 1876123648
  after some op:
    ZL_DEBUG: add_pinned_bytes: pid=3193 comm=sync change total_bytes_pinned flags=1 2924568576 + 95551488 = 3020120064
  ...

Reason:
  Similar to bug1, we also need to adjust space_info->total_bytes_pinned
  in above code block.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 644468b..e04ea1f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9654,8 +9654,18 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 
 		/* Reset pinned so btrfs_put_block_group doesn't complain */
+		spin_lock(&space_info->lock);
+		spin_lock(&block_group->lock);
+
+		space_info->bytes_pinned -= block_group->pinned;
+		space_info->bytes_readonly += block_group->pinned;
+		percpu_counter_add(&space_info->total_bytes_pinned,
+				   -block_group->pinned);
 		block_group->pinned = 0;
 
+		spin_unlock(&block_group->lock);
+		spin_unlock(&space_info->lock);
+
 		/*
 		 * Btrfs_remove_chunk will abort the transaction if things go
 		 * horribly wrong.
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/9] btrfs: add WARN_ON() to check is space_info op current
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
                   ` (3 preceding siblings ...)
  2015-04-09  4:34 ` [PATCH 4/9] btrfs: Set relative data on clear btrfs_block_group_cache->pinned Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  2015-04-09  4:34 ` [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput Zhaolei
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

space_info's value calculation is some complex and easy to cause
bug, add WARN_ON() to help debug.

Changelog v1->v2:
 Put WARN_ON()s under the ENOSPC_DEBUG mount option.
 Suggested by: David Sterba <dsterba@suse.cz>

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e04ea1f..203ac63 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9464,9 +9464,19 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	spin_lock(&block_group->space_info->lock);
 	list_del_init(&block_group->ro_list);
+
+	if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
+		WARN_ON(block_group->space_info->total_bytes
+			< block_group->key.offset);
+		WARN_ON(block_group->space_info->bytes_readonly
+			< block_group->key.offset);
+		WARN_ON(block_group->space_info->disk_total
+			< block_group->key.offset * factor);
+	}
 	block_group->space_info->total_bytes -= block_group->key.offset;
 	block_group->space_info->bytes_readonly -= block_group->key.offset;
 	block_group->space_info->disk_total -= block_group->key.offset * factor;
+
 	spin_unlock(&block_group->space_info->lock);
 
 	memcpy(&key, &block_group->key, sizeof(key));
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
                   ` (4 preceding siblings ...)
  2015-04-09  4:34 ` [PATCH 5/9] btrfs: add WARN_ON() to check is space_info op current Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  2015-07-07  8:13   ` Liu Bo
  2015-04-09  4:34 ` [PATCH 7/9] btrfs: Support busy loop of write and delete Zhaolei
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

Steps to reproduce:
  while true; do
    dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
    rm /btrfs_dir/file
    sync
  done

  And we'll see dd failed because btrfs return NO_SPACE.

Reason:
  Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
  in end to free fs space for next write, but sometimes it hadn't
  done work on time, because btrfs-cleaner thread get delayed-iputs
  from list before, but do iput() after next write.

  This is log:
  [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin

  [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
  [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
  [ 2569.087554] comm=sync func=btrfs_commit_transaction() end

  [ 2569.191081] comm=dd begin
  [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28

  [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
  [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
  ...
  [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
  [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end

Fix:
  Make btrfs_commit_transaction() wait current running btrfs-cleaner's
  delayed-iputs() done in end.

Test:
  Use script similar to above(more complex),
  before patch:
    7 failed in 100 * 20 loop.
  after patch:
    0 failed in 100 * 20 loop.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       | 1 +
 fs/btrfs/disk-io.c     | 3 ++-
 fs/btrfs/extent-tree.c | 6 ++++++
 fs/btrfs/inode.c       | 4 ++++
 4 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f9c89ca..54d4d78 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1513,6 +1513,7 @@ struct btrfs_fs_info {
 
 	spinlock_t delayed_iput_lock;
 	struct list_head delayed_iputs;
+	struct rw_semaphore delayed_iput_sem;
 
 	/* this protects tree_mod_seq_list */
 	spinlock_t tree_mod_seq_lock;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 639f266..6867471 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2241,11 +2241,12 @@ int open_ctree(struct super_block *sb,
 	spin_lock_init(&fs_info->qgroup_op_lock);
 	spin_lock_init(&fs_info->buffer_lock);
 	spin_lock_init(&fs_info->unused_bgs_lock);
-	mutex_init(&fs_info->unused_bg_unpin_mutex);
 	rwlock_init(&fs_info->tree_mod_log_lock);
+	mutex_init(&fs_info->unused_bg_unpin_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	seqlock_init(&fs_info->profiles_lock);
+	init_rwsem(&fs_info->delayed_iput_sem);
 
 	init_completion(&fs_info->kobj_unregister);
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 203ac63..6fd7dca 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3732,6 +3732,12 @@ commit_trans:
 				ret = btrfs_commit_transaction(trans, root);
 				if (ret)
 					return ret;
+				/*
+				 * make sure that all running delayed iput are
+				 * done
+				 */
+				down_write(&root->fs_info->delayed_iput_sem);
+				up_write(&root->fs_info->delayed_iput_sem);
 				goto again;
 			} else {
 				btrfs_end_transaction(trans, root);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d2e732d..34d10be 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3110,6 +3110,8 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root)
 	if (empty)
 		return;
 
+	down_read(&fs_info->delayed_iput_sem);
+
 	spin_lock(&fs_info->delayed_iput_lock);
 	list_splice_init(&fs_info->delayed_iputs, &list);
 	spin_unlock(&fs_info->delayed_iput_lock);
@@ -3120,6 +3122,8 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root)
 		iput(delayed->inode);
 		kfree(delayed);
 	}
+
+	up_read(&root->fs_info->delayed_iput_sem);
 }
 
 /*
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/9] btrfs: Support busy loop of write and delete
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
                   ` (5 preceding siblings ...)
  2015-04-09  4:34 ` [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  2015-04-09  4:34 ` [PATCH 8/9] btrfs: wait for delayed iputs on no space Zhaolei
  2015-04-09  4:34 ` [PATCH 9/9] btrfs: cleanup unused alloc_chunk varible Zhaolei
  8 siblings, 0 replies; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

Reproduce:
 while true; do
   dd if=/dev/zero of=/mnt/btrfs/file count=[75% fs_size]
   rm /mnt/btrfs/file
 done
 Then we can see above loop failed on NO_SPACE.

It it long-term problem since very beginning, because delayed-iput
after rm are not run.

We already have commit_transaction() in alloc_space code, but it is
not triggered in above case.
This patch trigger commit_transaction() to run delayed-iput and
reflash pinned-space to to make write success.

It is based on previous fix of delayed-iput in commit_transaction(),
need to be applied on top of:
btrfs: Fix NO_SPACE bug caused by delayed-iput

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6fd7dca..0572f14 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3641,13 +3641,13 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 used;
-	int ret = 0, committed = 0, have_pinned_space = 1, alloc_chunk = 1;
+	int ret = 0, need_commit = 2, have_pinned_space, alloc_chunk = 1;
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
 
 	if (btrfs_is_free_space_inode(inode)) {
-		committed = 1;
+		need_commit = 0;
 		ASSERT(current->journal_info);
 	}
 
@@ -3697,8 +3697,10 @@ alloc:
 			if (ret < 0) {
 				if (ret != -ENOSPC)
 					return ret;
-				else
+				else {
+					have_pinned_space = 1;
 					goto commit_trans;
+				}
 			}
 
 			if (!data_sinfo)
@@ -3712,23 +3714,23 @@ alloc:
 		 * allocation, and no removed chunk in current transaction,
 		 * don't bother committing the transaction.
 		 */
-		if (percpu_counter_compare(&data_sinfo->total_bytes_pinned,
-					   used + bytes -
-					   data_sinfo->total_bytes) < 0)
-			have_pinned_space = 0;
+		have_pinned_space = percpu_counter_compare(
+			&data_sinfo->total_bytes_pinned,
+			used + bytes - data_sinfo->total_bytes);
 		spin_unlock(&data_sinfo->lock);
 
 		/* commit the current transaction and try again */
 commit_trans:
-		if (!committed &&
+		if (need_commit &&
 		    !atomic_read(&root->fs_info->open_ioctl_trans)) {
-			committed = 1;
+			need_commit--;
 
 			trans = btrfs_join_transaction(root);
 			if (IS_ERR(trans))
 				return PTR_ERR(trans);
-			if (have_pinned_space ||
-			    trans->transaction->have_free_bgs) {
+			if (have_pinned_space >= 0 ||
+			    trans->transaction->have_free_bgs ||
+			    need_commit > 0) {
 				ret = btrfs_commit_transaction(trans, root);
 				if (ret)
 					return ret;
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 8/9] btrfs: wait for delayed iputs on no space
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
                   ` (6 preceding siblings ...)
  2015-04-09  4:34 ` [PATCH 7/9] btrfs: Support busy loop of write and delete Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  2015-04-13 14:54   ` Chris Mason
  2015-04-09  4:34 ` [PATCH 9/9] btrfs: cleanup unused alloc_chunk varible Zhaolei
  8 siblings, 1 reply; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs will report no_space when we run following write and delete
file loop:
 # FILE_SIZE_M=[ 75% of fs space ]
 # DEV=[ some dev ]
 # MNT=[ some dir ]
 #
 # mkfs.btrfs -f "$DEV"
 # mount -o nodatacow "$DEV" "$MNT"
 # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done
 #

Reason:
 iput() and evict() is run after write pages to block device, if
 write pages work is not finished before next write, the "rm"ed space
 is not freed, and caused above bug.

Fix:
 We can add "-o flushoncommit" mount option to avoid above bug, but
 it have performance problem. Actually, we can to wait for on-the-fly
 writes only when no-space happened, it is which this patch do.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0572f14..d5ec383 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3725,6 +3725,9 @@ commit_trans:
 		    !atomic_read(&root->fs_info->open_ioctl_trans)) {
 			need_commit--;
 
+			if (need_commit > 0)
+				btrfs_wait_ordered_roots(fs_info, -1);
+
 			trans = btrfs_join_transaction(root);
 			if (IS_ERR(trans))
 				return PTR_ERR(trans);
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 9/9] btrfs: cleanup unused alloc_chunk varible
  2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
                   ` (7 preceding siblings ...)
  2015-04-09  4:34 ` [PATCH 8/9] btrfs: wait for delayed iputs on no space Zhaolei
@ 2015-04-09  4:34 ` Zhaolei
  8 siblings, 0 replies; 16+ messages in thread
From: Zhaolei @ 2015-04-09  4:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

Remove int alloc_chunk in btrfs_check_data_free_space() for not
necessary.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d5ec383..b009987 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3641,7 +3641,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 used;
-	int ret = 0, need_commit = 2, have_pinned_space, alloc_chunk = 1;
+	int ret = 0, need_commit = 2, have_pinned_space;
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
@@ -3669,7 +3669,7 @@ again:
 		 * if we don't have enough free bytes in this space then we need
 		 * to alloc a new chunk.
 		 */
-		if (!data_sinfo->full && alloc_chunk) {
+		if (!data_sinfo->full) {
 			u64 alloc_target;
 
 			data_sinfo->force_alloc = CHUNK_ALLOC_FORCE;
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 8/9] btrfs: wait for delayed iputs on no space
  2015-04-09  4:34 ` [PATCH 8/9] btrfs: wait for delayed iputs on no space Zhaolei
@ 2015-04-13 14:54   ` Chris Mason
  2015-04-14  4:06     ` Zhao Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Mason @ 2015-04-13 14:54 UTC (permalink / raw)
  To: Zhaolei, linux-btrfs

On 04/09/2015 12:34 AM, Zhaolei wrote:
> From: Zhao Lei <zhaolei@cn.fujitsu.com>
> 
> btrfs will report no_space when we run following write and delete
> file loop:
>  # FILE_SIZE_M=[ 75% of fs space ]
>  # DEV=[ some dev ]
>  # MNT=[ some dir ]
>  #
>  # mkfs.btrfs -f "$DEV"
>  # mount -o nodatacow "$DEV" "$MNT"
>  # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done
>  #
> 
> Reason:
>  iput() and evict() is run after write pages to block device, if
>  write pages work is not finished before next write, the "rm"ed space
>  is not freed, and caused above bug.
> 
> Fix:
>  We can add "-o flushoncommit" mount option to avoid above bug, but
>  it have performance problem. Actually, we can to wait for on-the-fly
>  writes only when no-space happened, it is which this patch do.

Can you please change this so we only do this flush if the first commit
doesn't free up enough space?  I think this is going to have a
performance impact as the FS fills up.

-chris


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 8/9] btrfs: wait for delayed iputs on no space
  2015-04-13 14:54   ` Chris Mason
@ 2015-04-14  4:06     ` Zhao Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Zhao Lei @ 2015-04-14  4:06 UTC (permalink / raw)
  To: 'Chris Mason', linux-btrfs

Hi, Chris

> -----Original Message-----
> From: Chris Mason [mailto:clm@fb.com]
> Sent: Monday, April 13, 2015 10:55 PM
> To: Zhaolei; linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH 8/9] btrfs: wait for delayed iputs on no space
> 
> On 04/09/2015 12:34 AM, Zhaolei wrote:
> > From: Zhao Lei <zhaolei@cn.fujitsu.com>
> >
> > btrfs will report no_space when we run following write and delete file
> > loop:
> >  # FILE_SIZE_M=[ 75% of fs space ]
> >  # DEV=[ some dev ]
> >  # MNT=[ some dir ]
> >  #
> >  # mkfs.btrfs -f "$DEV"
> >  # mount -o nodatacow "$DEV" "$MNT"
> >  # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0
> > bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done  #
> >
> > Reason:
> >  iput() and evict() is run after write pages to block device, if
> > write pages work is not finished before next write, the "rm"ed space
> > is not freed, and caused above bug.
> >
> > Fix:
> >  We can add "-o flushoncommit" mount option to avoid above bug, but
> > it have performance problem. Actually, we can to wait for on-the-fly
> > writes only when no-space happened, it is which this patch do.
> 
> Can you please change this so we only do this flush if the first commit doesn't
> free up enough space?  I think this is going to have a performance impact as
> the FS fills up.
> 
btrfs_wait_ordered_roots() can only ensure that all bio are finished,
and relative iputs are added into delayed_iputs in end_io.
And we need 2 commit to make free space accessable:
One for run delayed_iputs(), and another for unpin.

It is why I put above line to first commit, to ensure we have
enough commit operation to make free space accessable.

It is only called then the disk is almost full, and have no performance impact
in most case(disk not full).

Another way is to call btrfs_wait_ordered_roots() after first commit() try,
but give it addition commit().

Thanks
Zhaolei



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput
  2015-04-09  4:34 ` [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput Zhaolei
@ 2015-07-07  8:13   ` Liu Bo
  2015-07-07  9:16     ` Zhao Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2015-07-07  8:13 UTC (permalink / raw)
  To: Zhaolei; +Cc: linux-btrfs

Hi,

On Thu, Apr 09, 2015 at 12:34:41PM +0800, Zhaolei wrote:
> From: Zhao Lei <zhaolei@cn.fujitsu.com>
> 
> Steps to reproduce:
>   while true; do
>     dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
>     rm /btrfs_dir/file
>     sync
>   done
> 
>   And we'll see dd failed because btrfs return NO_SPACE.
> 
> Reason:
>   Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
>   in end to free fs space for next write, but sometimes it hadn't
>   done work on time, because btrfs-cleaner thread get delayed-iputs
>   from list before, but do iput() after next write.
> 
>   This is log:
>   [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
> 
>   [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
>   [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
>   [ 2569.087554] comm=sync func=btrfs_commit_transaction() end
> 
>   [ 2569.191081] comm=dd begin
>   [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
> 
>   [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
>   [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
>   ...
>   [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
>   [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
> 
> Fix:
>   Make btrfs_commit_transaction() wait current running btrfs-cleaner's
>   delayed-iputs() done in end.
> 
> Test:
>   Use script similar to above(more complex),
>   before patch:
>     7 failed in 100 * 20 loop.
>   after patch:
>     0 failed in 100 * 20 loop.
> 
> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> ---
>  fs/btrfs/ctree.h       | 1 +
>  fs/btrfs/disk-io.c     | 3 ++-
>  fs/btrfs/extent-tree.c | 6 ++++++
>  fs/btrfs/inode.c       | 4 ++++
>  4 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f9c89ca..54d4d78 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1513,6 +1513,7 @@ struct btrfs_fs_info {
>  
>  	spinlock_t delayed_iput_lock;
>  	struct list_head delayed_iputs;
> +	struct rw_semaphore delayed_iput_sem;
>  
>  	/* this protects tree_mod_seq_list */
>  	spinlock_t tree_mod_seq_lock;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 639f266..6867471 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2241,11 +2241,12 @@ int open_ctree(struct super_block *sb,
>  	spin_lock_init(&fs_info->qgroup_op_lock);
>  	spin_lock_init(&fs_info->buffer_lock);
>  	spin_lock_init(&fs_info->unused_bgs_lock);
> -	mutex_init(&fs_info->unused_bg_unpin_mutex);
>  	rwlock_init(&fs_info->tree_mod_log_lock);
> +	mutex_init(&fs_info->unused_bg_unpin_mutex);
>  	mutex_init(&fs_info->reloc_mutex);
>  	mutex_init(&fs_info->delalloc_root_mutex);
>  	seqlock_init(&fs_info->profiles_lock);
> +	init_rwsem(&fs_info->delayed_iput_sem);
>  
>  	init_completion(&fs_info->kobj_unregister);
>  	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 203ac63..6fd7dca 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3732,6 +3732,12 @@ commit_trans:
>  				ret = btrfs_commit_transaction(trans, root);
>  				if (ret)
>  					return ret;
> +				/*
> +				 * make sure that all running delayed iput are
> +				 * done
> +				 */
> +				down_write(&root->fs_info->delayed_iput_sem);
> +				up_write(&root->fs_info->delayed_iput_sem);
>  				goto again;
>  			} else {
>  				btrfs_end_transaction(trans, root);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d2e732d..34d10be 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3110,6 +3110,8 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root)
>  	if (empty)
>  		return;
>  
> +	down_read(&fs_info->delayed_iput_sem);
> +
>  	spin_lock(&fs_info->delayed_iput_lock);
>  	list_splice_init(&fs_info->delayed_iputs, &list);
>  	spin_unlock(&fs_info->delayed_iput_lock);
> @@ -3120,6 +3122,8 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root)
>  		iput(delayed->inode);
>  		kfree(delayed);
>  	}
> +
> +	up_read(&root->fs_info->delayed_iput_sem);

By introducing this "delayed_iput_sem", fstests's generic/241 cries
about a possible deadlock, can we use a waitqueue to avoid this?


[ 2061.345955] =============================================
[ 2061.346027] [ INFO: possible recursive locking detected ]
[ 2061.346027] 4.1.0+ #268 Tainted: G        W      
[ 2061.346027] ---------------------------------------------
[ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
[ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
[<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
[ 2061.346027] but task is already holding lock:
[ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
[ 2061.346027] other info that might help us debug this:
[ 2061.346027]  Possible unsafe locking scenario:

[ 2061.346027]        CPU0
[ 2061.346027]        ----
[ 2061.346027]   lock(&fs_info->delayed_iput_sem);
[ 2061.346027]   lock(&fs_info->delayed_iput_sem);
[ 2061.346027] 
 *** DEADLOCK ***

[ 2061.346027]  May be due to missing lock nesting notation

[ 2061.346027] 2 locks held by btrfs-cleaner/3045:
[ 2061.346027]  #0:  (&fs_info->cleaner_mutex){+.+.+.}, at: [<ffffffff813f6cc4>] cleaner_kthread+0x64/0x1a0
[ 2061.346027]  #1:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
[ 2061.346027] stack backtrace:
[ 2061.346027] CPU: 0 PID: 3045 Comm: btrfs-cleaner Tainted: G        W 4.1.0+ #268
[ 2061.346027] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
[ 2061.346027]  ffff88013760a700 ffff8800b4aff888 ffffffff818bb68e 0000000000000007
[ 2061.346027]  ffffffff828fa0a0 ffff8800b4aff968 ffffffff810db18d ffff8800b4aff978
[ 2061.346027]  ffffffff00000000 ffff88013760a750 ffffffff00000001 ffff88013760b470
[ 2061.346027] Call Trace:
[ 2061.346027]  [<ffffffff818bb68e>] dump_stack+0x4c/0x65
[ 2061.346027]  [<ffffffff810db18d>] __lock_acquire+0x214d/0x22f0
[ 2061.346027]  [<ffffffff810d5e3d>] ? trace_hardirqs_off+0xd/0x10
[ 2061.346027]  [<ffffffff810d649d>] ? __lock_is_held+0x4d/0x70
[ 2061.346027]  [<ffffffff810dc050>] lock_acquire+0xd0/0x280
[ 2061.346027]  [<ffffffff814063ab>] ?  btrfs_run_delayed_iputs+0x6b/0x100
[ 2061.346027]  [<ffffffff818c275c>] down_read+0x4c/0x70
[ 2061.346027]  [<ffffffff814063ab>] ?  btrfs_run_delayed_iputs+0x6b/0x100
[ 2061.346027]  [<ffffffff818c48cb>] ? _raw_spin_unlock+0x2b/0x40
[ 2061.346027]  [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
[ 2061.346027]  [<ffffffff8140042f>] ?  btrfs_commit_transaction+0xa1f/0xc50
[ 2061.346027]  [<ffffffff81400454>] btrfs_commit_transaction+0xa44/0xc50
[ 2061.346027]  [<ffffffff810cdbd0>] ? wait_woken+0xa0/0xa0
[ 2061.346027]  [<ffffffff813e5247>] flush_space+0x97/0x4b0
[ 2061.346027]  [<ffffffff818c48cb>] ? _raw_spin_unlock+0x2b/0x40
[ 2061.346027]  [<ffffffff813e5a3b>] ?  reserve_metadata_bytes+0x20b/0x6d0
[ 2061.346027]  [<ffffffff813e5a52>] reserve_metadata_bytes+0x222/0x6d0
[ 2061.346027]  [<ffffffff813e6245>] ? btrfs_block_rsv_refill+0x65/0x90
[ 2061.346027]  [<ffffffff813e6257>] btrfs_block_rsv_refill+0x77/0x90
[ 2061.346027]  [<ffffffff8140c723>] btrfs_evict_inode+0x473/0x700
[ 2061.346027]  [<ffffffff8123dcbb>] evict+0xab/0x170
[ 2061.346027]  [<ffffffff8123e6fd>] iput+0x1cd/0x350
[ 2061.346027]  [<ffffffff81406409>] btrfs_run_delayed_iputs+0xc9/0x100
[ 2061.346027]  [<ffffffff813f6cc4>] ? cleaner_kthread+0x64/0x1a0
[ 2061.346027]  [<ffffffff813f6dba>] cleaner_kthread+0x15a/0x1a0
[ 2061.346027]  [<ffffffff813f6c60>] ? btree_invalidatepage+0x90/0x90
[ 2061.346027]  [<ffffffff810a3d5f>] kthread+0xef/0x110
[ 2061.346027]  [<ffffffff810a3c70>] ?  kthread_create_on_node+0x210/0x210
[ 2061.346027]  [<ffffffff818c550f>] ret_from_fork+0x3f/0x70
[ 2061.346027]  [<ffffffff810a3c70>] ?  kthread_create_on_node+0x210/0x210


Thanks,

-liubo
>  }
>  
>  /*
> -- 
> 1.8.5.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput
  2015-07-07  8:13   ` Liu Bo
@ 2015-07-07  9:16     ` Zhao Lei
  2015-07-07  9:26       ` Liu Bo
  0 siblings, 1 reply; 16+ messages in thread
From: Zhao Lei @ 2015-07-07  9:16 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

Hi, Liubo

> -----Original Message-----
> From: Liu Bo [mailto:bo.li.liu@oracle.com]
> Sent: Tuesday, July 07, 2015 4:14 PM
> To: Zhaolei
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput
> 
> Hi,
> 
> On Thu, Apr 09, 2015 at 12:34:41PM +0800, Zhaolei wrote:
> > From: Zhao Lei <zhaolei@cn.fujitsu.com>
> >
> > Steps to reproduce:
> >   while true; do
> >     dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
> >     rm /btrfs_dir/file
> >     sync
> >   done
> >
> >   And we'll see dd failed because btrfs return NO_SPACE.
> >
> > Reason:
> >   Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
> >   in end to free fs space for next write, but sometimes it hadn't
> >   done work on time, because btrfs-cleaner thread get delayed-iputs
> >   from list before, but do iput() after next write.
> >
> >   This is log:
> >   [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
> >
> >   [ 2569.084280] comm=sync func=btrfs_commit_transaction() call
> btrfs_run_delayed_iputs()
> >   [ 2569.085418] comm=sync func=btrfs_commit_transaction() done
> btrfs_run_delayed_iputs()
> >   [ 2569.087554] comm=sync func=btrfs_commit_transaction() end
> >
> >   [ 2569.191081] comm=dd begin
> >   [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
> >
> >   [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 +
> 32677888 = 32677888
> >   [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 +
> 23834624 = 56512512
> >   ...
> >   [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448
> + 21762048 = 965738496
> >   [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
> >
> > Fix:
> >   Make btrfs_commit_transaction() wait current running btrfs-cleaner's
> >   delayed-iputs() done in end.
> >
> > Test:
> >   Use script similar to above(more complex),
> >   before patch:
> >     7 failed in 100 * 20 loop.
> >   after patch:
> >     0 failed in 100 * 20 loop.
> >
> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > ---
> >  fs/btrfs/ctree.h       | 1 +
> >  fs/btrfs/disk-io.c     | 3 ++-
> >  fs/btrfs/extent-tree.c | 6 ++++++
> >  fs/btrfs/inode.c       | 4 ++++
> >  4 files changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index
> > f9c89ca..54d4d78 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1513,6 +1513,7 @@ struct btrfs_fs_info {
> >
> >  	spinlock_t delayed_iput_lock;
> >  	struct list_head delayed_iputs;
> > +	struct rw_semaphore delayed_iput_sem;
> >
> >  	/* this protects tree_mod_seq_list */
> >  	spinlock_t tree_mod_seq_lock;
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
> > 639f266..6867471 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2241,11 +2241,12 @@ int open_ctree(struct super_block *sb,
> >  	spin_lock_init(&fs_info->qgroup_op_lock);
> >  	spin_lock_init(&fs_info->buffer_lock);
> >  	spin_lock_init(&fs_info->unused_bgs_lock);
> > -	mutex_init(&fs_info->unused_bg_unpin_mutex);
> >  	rwlock_init(&fs_info->tree_mod_log_lock);
> > +	mutex_init(&fs_info->unused_bg_unpin_mutex);
> >  	mutex_init(&fs_info->reloc_mutex);
> >  	mutex_init(&fs_info->delalloc_root_mutex);
> >  	seqlock_init(&fs_info->profiles_lock);
> > +	init_rwsem(&fs_info->delayed_iput_sem);
> >
> >  	init_completion(&fs_info->kobj_unregister);
> >  	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> > 203ac63..6fd7dca 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -3732,6 +3732,12 @@ commit_trans:
> >  				ret = btrfs_commit_transaction(trans, root);
> >  				if (ret)
> >  					return ret;
> > +				/*
> > +				 * make sure that all running delayed iput are
> > +				 * done
> > +				 */
> > +				down_write(&root->fs_info->delayed_iput_sem);
> > +				up_write(&root->fs_info->delayed_iput_sem);
> >  				goto again;
> >  			} else {
> >  				btrfs_end_transaction(trans, root); diff --git
> a/fs/btrfs/inode.c
> > b/fs/btrfs/inode.c index d2e732d..34d10be 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -3110,6 +3110,8 @@ void btrfs_run_delayed_iputs(struct btrfs_root
> *root)
> >  	if (empty)
> >  		return;
> >
> > +	down_read(&fs_info->delayed_iput_sem);
> > +
> >  	spin_lock(&fs_info->delayed_iput_lock);
> >  	list_splice_init(&fs_info->delayed_iputs, &list);
> >  	spin_unlock(&fs_info->delayed_iput_lock);
> > @@ -3120,6 +3122,8 @@ void btrfs_run_delayed_iputs(struct btrfs_root
> *root)
> >  		iput(delayed->inode);
> >  		kfree(delayed);
> >  	}
> > +
> > +	up_read(&root->fs_info->delayed_iput_sem);
> 
> By introducing this "delayed_iput_sem", fstests's generic/241 cries about a
> possible deadlock, can we use a waitqueue to avoid this?
> 
> 
> [ 2061.345955] =============================================
> [ 2061.346027] [ INFO: possible recursive locking detected ]
> [ 2061.346027] 4.1.0+ #268 Tainted: G        W
> [ 2061.346027] ---------------------------------------------
> [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
> [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
> [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
> [ 2061.346027] but task is already holding lock:
> [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>]
> btrfs_run_delayed_iputs+0x6b/0x100
> [ 2061.346027] other info that might help us debug this:
> [ 2061.346027]  Possible unsafe locking scenario:
> 
> [ 2061.346027]        CPU0
> [ 2061.346027]        ----
> [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
> [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
> [ 2061.346027]
>  *** DEADLOCK ***
> 
> [ 2061.346027]  May be due to missing lock nesting notation
> 
> [ 2061.346027] 2 locks held by btrfs-cleaner/3045:
> [ 2061.346027]  #0:  (&fs_info->cleaner_mutex){+.+.+.}, at:
> [<ffffffff813f6cc4>] cleaner_kthread+0x64/0x1a0 [ 2061.346027]  #1:
> (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>]
> btrfs_run_delayed_iputs+0x6b/0x100
> [ 2061.346027] stack backtrace:
> [ 2061.346027] CPU: 0 PID: 3045 Comm: btrfs-cleaner Tainted: G        W
> 4.1.0+ #268
> [ 2061.346027] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.7.5-20140709_153950- 04/01/2014 [ 2061.346027]  ffff88013760a700
> ffff8800b4aff888 ffffffff818bb68e 0000000000000007 [ 2061.346027]
> ffffffff828fa0a0 ffff8800b4aff968 ffffffff810db18d ffff8800b4aff978
> [ 2061.346027]  ffffffff00000000 ffff88013760a750 ffffffff00000001
> ffff88013760b470 [ 2061.346027] Call Trace:
> [ 2061.346027]  [<ffffffff818bb68e>] dump_stack+0x4c/0x65 [ 2061.346027]
> [<ffffffff810db18d>] __lock_acquire+0x214d/0x22f0 [ 2061.346027]
> [<ffffffff810d5e3d>] ? trace_hardirqs_off+0xd/0x10 [ 2061.346027]
> [<ffffffff810d649d>] ? __lock_is_held+0x4d/0x70 [ 2061.346027]
> [<ffffffff810dc050>] lock_acquire+0xd0/0x280 [ 2061.346027]
> [<ffffffff814063ab>] ?  btrfs_run_delayed_iputs+0x6b/0x100
> [ 2061.346027]  [<ffffffff818c275c>] down_read+0x4c/0x70 [ 2061.346027]
> [<ffffffff814063ab>] ?  btrfs_run_delayed_iputs+0x6b/0x100
> [ 2061.346027]  [<ffffffff818c48cb>] ? _raw_spin_unlock+0x2b/0x40
> [ 2061.346027]  [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
> [ 2061.346027]  [<ffffffff8140042f>] ?
> btrfs_commit_transaction+0xa1f/0xc50
> [ 2061.346027]  [<ffffffff81400454>] btrfs_commit_transaction+0xa44/0xc50
> [ 2061.346027]  [<ffffffff810cdbd0>] ? wait_woken+0xa0/0xa0 [ 2061.346027]
> [<ffffffff813e5247>] flush_space+0x97/0x4b0 [ 2061.346027]
> [<ffffffff818c48cb>] ? _raw_spin_unlock+0x2b/0x40 [ 2061.346027]
> [<ffffffff813e5a3b>] ?  reserve_metadata_bytes+0x20b/0x6d0
> [ 2061.346027]  [<ffffffff813e5a52>] reserve_metadata_bytes+0x222/0x6d0
> [ 2061.346027]  [<ffffffff813e6245>] ? btrfs_block_rsv_refill+0x65/0x90
> [ 2061.346027]  [<ffffffff813e6257>] btrfs_block_rsv_refill+0x77/0x90
> [ 2061.346027]  [<ffffffff8140c723>] btrfs_evict_inode+0x473/0x700
> [ 2061.346027]  [<ffffffff8123dcbb>] evict+0xab/0x170 [ 2061.346027]
> [<ffffffff8123e6fd>] iput+0x1cd/0x350 [ 2061.346027]  [<ffffffff81406409>]
> btrfs_run_delayed_iputs+0xc9/0x100
> [ 2061.346027]  [<ffffffff813f6cc4>] ? cleaner_kthread+0x64/0x1a0
> [ 2061.346027]  [<ffffffff813f6dba>] cleaner_kthread+0x15a/0x1a0
> [ 2061.346027]  [<ffffffff813f6c60>] ? btree_invalidatepage+0x90/0x90
> [ 2061.346027]  [<ffffffff810a3d5f>] kthread+0xef/0x110 [ 2061.346027]
> [<ffffffff810a3c70>] ?  kthread_create_on_node+0x210/0x210
> [ 2061.346027]  [<ffffffff818c550f>] ret_from_fork+0x3f/0x70 [ 2061.346027]
> [<ffffffff810a3c70>] ?  kthread_create_on_node+0x210/0x210
> 
Thanks for report above problem.
Seems it is caused by recursion of btrfs_run_delayed_iputs:

  cleaner_kthread
  -> btrfs_run_delayed_iputs() *1
  -> get delayed_iput_sem lock *2
  -> iput()
  -> ...
  -> btrfs_commit_transaction()
  -> btrfs_run_delayed_iputs() *1
  -> get delayed_iput_sem lock (dead lock) *2

  *1: recursion of btrfs_run_delayed_iputs()
  *2: dead lock of delayed_iput_sem

I'll reproduce this problem in my env, and comfirm fix of
above problem.

Thanks
Zhaolei


> 
> Thanks,
> 
> -liubo
> >  }
> >
> >  /*
> > --
> > 1.8.5.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput
  2015-07-07  9:16     ` Zhao Lei
@ 2015-07-07  9:26       ` Liu Bo
  2015-07-07  9:38         ` Zhao Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2015-07-07  9:26 UTC (permalink / raw)
  To: Zhao Lei; +Cc: linux-btrfs

On Tue, Jul 07, 2015 at 05:16:29PM +0800, Zhao Lei wrote:
> Hi, Liubo
> 
> > -----Original Message-----
> > From: Liu Bo [mailto:bo.li.liu@oracle.com]
> > Sent: Tuesday, July 07, 2015 4:14 PM
> > To: Zhaolei
> > Cc: linux-btrfs@vger.kernel.org
> > Subject: Re: [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput
> > 
> > Hi,
> > 
> > On Thu, Apr 09, 2015 at 12:34:41PM +0800, Zhaolei wrote:
> > > From: Zhao Lei <zhaolei@cn.fujitsu.com>
> > >
> > > Steps to reproduce:
> > >   while true; do
> > >     dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
> > >     rm /btrfs_dir/file
> > >     sync
> > >   done
> > >
> > >   And we'll see dd failed because btrfs return NO_SPACE.
> > >
> > > Reason:
> > >   Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
> > >   in end to free fs space for next write, but sometimes it hadn't
> > >   done work on time, because btrfs-cleaner thread get delayed-iputs
> > >   from list before, but do iput() after next write.
> > >
> > >   This is log:
> > >   [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
> > >
> > >   [ 2569.084280] comm=sync func=btrfs_commit_transaction() call
> > btrfs_run_delayed_iputs()
> > >   [ 2569.085418] comm=sync func=btrfs_commit_transaction() done
> > btrfs_run_delayed_iputs()
> > >   [ 2569.087554] comm=sync func=btrfs_commit_transaction() end
> > >
> > >   [ 2569.191081] comm=dd begin
> > >   [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
> > >
> > >   [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 +
> > 32677888 = 32677888
> > >   [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 +
> > 23834624 = 56512512
> > >   ...
> > >   [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448
> > + 21762048 = 965738496
> > >   [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
> > >
> > > Fix:
> > >   Make btrfs_commit_transaction() wait current running btrfs-cleaner's
> > >   delayed-iputs() done in end.
> > >
> > > Test:
> > >   Use script similar to above(more complex),
> > >   before patch:
> > >     7 failed in 100 * 20 loop.
> > >   after patch:
> > >     0 failed in 100 * 20 loop.
> > >
> > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > > ---
> > >  fs/btrfs/ctree.h       | 1 +
> > >  fs/btrfs/disk-io.c     | 3 ++-
> > >  fs/btrfs/extent-tree.c | 6 ++++++
> > >  fs/btrfs/inode.c       | 4 ++++
> > >  4 files changed, 13 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index
> > > f9c89ca..54d4d78 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -1513,6 +1513,7 @@ struct btrfs_fs_info {
> > >
> > >  	spinlock_t delayed_iput_lock;
> > >  	struct list_head delayed_iputs;
> > > +	struct rw_semaphore delayed_iput_sem;
> > >
> > >  	/* this protects tree_mod_seq_list */
> > >  	spinlock_t tree_mod_seq_lock;
> > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
> > > 639f266..6867471 100644
> > > --- a/fs/btrfs/disk-io.c
> > > +++ b/fs/btrfs/disk-io.c
> > > @@ -2241,11 +2241,12 @@ int open_ctree(struct super_block *sb,
> > >  	spin_lock_init(&fs_info->qgroup_op_lock);
> > >  	spin_lock_init(&fs_info->buffer_lock);
> > >  	spin_lock_init(&fs_info->unused_bgs_lock);
> > > -	mutex_init(&fs_info->unused_bg_unpin_mutex);
> > >  	rwlock_init(&fs_info->tree_mod_log_lock);
> > > +	mutex_init(&fs_info->unused_bg_unpin_mutex);
> > >  	mutex_init(&fs_info->reloc_mutex);
> > >  	mutex_init(&fs_info->delalloc_root_mutex);
> > >  	seqlock_init(&fs_info->profiles_lock);
> > > +	init_rwsem(&fs_info->delayed_iput_sem);
> > >
> > >  	init_completion(&fs_info->kobj_unregister);
> > >  	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> > > 203ac63..6fd7dca 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -3732,6 +3732,12 @@ commit_trans:
> > >  				ret = btrfs_commit_transaction(trans, root);
> > >  				if (ret)
> > >  					return ret;
> > > +				/*
> > > +				 * make sure that all running delayed iput are
> > > +				 * done
> > > +				 */
> > > +				down_write(&root->fs_info->delayed_iput_sem);
> > > +				up_write(&root->fs_info->delayed_iput_sem);
> > >  				goto again;
> > >  			} else {
> > >  				btrfs_end_transaction(trans, root); diff --git
> > a/fs/btrfs/inode.c
> > > b/fs/btrfs/inode.c index d2e732d..34d10be 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -3110,6 +3110,8 @@ void btrfs_run_delayed_iputs(struct btrfs_root
> > *root)
> > >  	if (empty)
> > >  		return;
> > >
> > > +	down_read(&fs_info->delayed_iput_sem);
> > > +
> > >  	spin_lock(&fs_info->delayed_iput_lock);
> > >  	list_splice_init(&fs_info->delayed_iputs, &list);
> > >  	spin_unlock(&fs_info->delayed_iput_lock);
> > > @@ -3120,6 +3122,8 @@ void btrfs_run_delayed_iputs(struct btrfs_root
> > *root)
> > >  		iput(delayed->inode);
> > >  		kfree(delayed);
> > >  	}
> > > +
> > > +	up_read(&root->fs_info->delayed_iput_sem);
> > 
> > By introducing this "delayed_iput_sem", fstests's generic/241 cries about a
> > possible deadlock, can we use a waitqueue to avoid this?
> > 
> > 
> > [ 2061.345955] =============================================
> > [ 2061.346027] [ INFO: possible recursive locking detected ]
> > [ 2061.346027] 4.1.0+ #268 Tainted: G        W
> > [ 2061.346027] ---------------------------------------------
> > [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
> > [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
> > [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
> > [ 2061.346027] but task is already holding lock:
> > [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>]
> > btrfs_run_delayed_iputs+0x6b/0x100
> > [ 2061.346027] other info that might help us debug this:
> > [ 2061.346027]  Possible unsafe locking scenario:
> > 
> > [ 2061.346027]        CPU0
> > [ 2061.346027]        ----
> > [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
> > [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
> > [ 2061.346027]
> >  *** DEADLOCK ***
> > 
> > [ 2061.346027]  May be due to missing lock nesting notation
> > 
> > [ 2061.346027] 2 locks held by btrfs-cleaner/3045:
> > [ 2061.346027]  #0:  (&fs_info->cleaner_mutex){+.+.+.}, at:
> > [<ffffffff813f6cc4>] cleaner_kthread+0x64/0x1a0 [ 2061.346027]  #1:
> > (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>]
> > btrfs_run_delayed_iputs+0x6b/0x100
> > [ 2061.346027] stack backtrace:
> > [ 2061.346027] CPU: 0 PID: 3045 Comm: btrfs-cleaner Tainted: G        W
> > 4.1.0+ #268
> > [ 2061.346027] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > 1.7.5-20140709_153950- 04/01/2014 [ 2061.346027]  ffff88013760a700
> > ffff8800b4aff888 ffffffff818bb68e 0000000000000007 [ 2061.346027]
> > ffffffff828fa0a0 ffff8800b4aff968 ffffffff810db18d ffff8800b4aff978
> > [ 2061.346027]  ffffffff00000000 ffff88013760a750 ffffffff00000001
> > ffff88013760b470 [ 2061.346027] Call Trace:
> > [ 2061.346027]  [<ffffffff818bb68e>] dump_stack+0x4c/0x65 [ 2061.346027]
> > [<ffffffff810db18d>] __lock_acquire+0x214d/0x22f0 [ 2061.346027]
> > [<ffffffff810d5e3d>] ? trace_hardirqs_off+0xd/0x10 [ 2061.346027]
> > [<ffffffff810d649d>] ? __lock_is_held+0x4d/0x70 [ 2061.346027]
> > [<ffffffff810dc050>] lock_acquire+0xd0/0x280 [ 2061.346027]
> > [<ffffffff814063ab>] ?  btrfs_run_delayed_iputs+0x6b/0x100
> > [ 2061.346027]  [<ffffffff818c275c>] down_read+0x4c/0x70 [ 2061.346027]
> > [<ffffffff814063ab>] ?  btrfs_run_delayed_iputs+0x6b/0x100
> > [ 2061.346027]  [<ffffffff818c48cb>] ? _raw_spin_unlock+0x2b/0x40
> > [ 2061.346027]  [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
> > [ 2061.346027]  [<ffffffff8140042f>] ?
> > btrfs_commit_transaction+0xa1f/0xc50
> > [ 2061.346027]  [<ffffffff81400454>] btrfs_commit_transaction+0xa44/0xc50
> > [ 2061.346027]  [<ffffffff810cdbd0>] ? wait_woken+0xa0/0xa0 [ 2061.346027]
> > [<ffffffff813e5247>] flush_space+0x97/0x4b0 [ 2061.346027]
> > [<ffffffff818c48cb>] ? _raw_spin_unlock+0x2b/0x40 [ 2061.346027]
> > [<ffffffff813e5a3b>] ?  reserve_metadata_bytes+0x20b/0x6d0
> > [ 2061.346027]  [<ffffffff813e5a52>] reserve_metadata_bytes+0x222/0x6d0
> > [ 2061.346027]  [<ffffffff813e6245>] ? btrfs_block_rsv_refill+0x65/0x90
> > [ 2061.346027]  [<ffffffff813e6257>] btrfs_block_rsv_refill+0x77/0x90
> > [ 2061.346027]  [<ffffffff8140c723>] btrfs_evict_inode+0x473/0x700
> > [ 2061.346027]  [<ffffffff8123dcbb>] evict+0xab/0x170 [ 2061.346027]
> > [<ffffffff8123e6fd>] iput+0x1cd/0x350 [ 2061.346027]  [<ffffffff81406409>]
> > btrfs_run_delayed_iputs+0xc9/0x100
> > [ 2061.346027]  [<ffffffff813f6cc4>] ? cleaner_kthread+0x64/0x1a0
> > [ 2061.346027]  [<ffffffff813f6dba>] cleaner_kthread+0x15a/0x1a0
> > [ 2061.346027]  [<ffffffff813f6c60>] ? btree_invalidatepage+0x90/0x90
> > [ 2061.346027]  [<ffffffff810a3d5f>] kthread+0xef/0x110 [ 2061.346027]
> > [<ffffffff810a3c70>] ?  kthread_create_on_node+0x210/0x210
> > [ 2061.346027]  [<ffffffff818c550f>] ret_from_fork+0x3f/0x70 [ 2061.346027]
> > [<ffffffff810a3c70>] ?  kthread_create_on_node+0x210/0x210
> > 
> Thanks for report above problem.
> Seems it is caused by recursion of btrfs_run_delayed_iputs:
> 
>   cleaner_kthread
>   -> btrfs_run_delayed_iputs() *1
>   -> get delayed_iput_sem lock *2
>   -> iput()
>   -> ...
>   -> btrfs_commit_transaction()
>   -> btrfs_run_delayed_iputs() *1
>   -> get delayed_iput_sem lock (dead lock) *2
> 
>   *1: recursion of btrfs_run_delayed_iputs()
>   *2: dead lock of delayed_iput_sem

Yep, that's what stack trace shows, but I guess it hardly runs to
deadlock, since we've checked if the list is empty before acquiring
the semephore. 

> 
> I'll reproduce this problem in my env, and comfirm fix of
> above problem.

It didn't occur every time, but quite often on my box.

Thanks,

-liubo
> 
> Thanks
> Zhaolei
> 
> 
> > 
> > Thanks,
> > 
> > -liubo
> > >  }
> > >
> > >  /*
> > > --
> > > 1.8.5.1
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> > > in the body of a message to majordomo@vger.kernel.org More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput
  2015-07-07  9:26       ` Liu Bo
@ 2015-07-07  9:38         ` Zhao Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Zhao Lei @ 2015-07-07  9:38 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

Hi, Liubo

> -----Original Message-----
> From: Liu Bo [mailto:bo.li.liu@oracle.com]
> Sent: Tuesday, July 07, 2015 5:27 PM
> To: Zhao Lei
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput
> 
> On Tue, Jul 07, 2015 at 05:16:29PM +0800, Zhao Lei wrote:
> > Hi, Liubo
> >
> > > -----Original Message-----
> > > From: Liu Bo [mailto:bo.li.liu@oracle.com]
> > > Sent: Tuesday, July 07, 2015 4:14 PM
> > > To: Zhaolei
> > > Cc: linux-btrfs@vger.kernel.org
> > > Subject: Re: [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by
> > > delayed-iput
> > >
> > > Hi,
> > >
> > > On Thu, Apr 09, 2015 at 12:34:41PM +0800, Zhaolei wrote:
> > > > From: Zhao Lei <zhaolei@cn.fujitsu.com>
> > > >
> > > > Steps to reproduce:
> > > >   while true; do
> > > >     dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
> > > >     rm /btrfs_dir/file
> > > >     sync
> > > >   done
> > > >
> > > >   And we'll see dd failed because btrfs return NO_SPACE.
> > > >
> > > > Reason:
> > > >   Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
> > > >   in end to free fs space for next write, but sometimes it hadn't
> > > >   done work on time, because btrfs-cleaner thread get delayed-iputs
> > > >   from list before, but do iput() after next write.
> > > >
> > > >   This is log:
> > > >   [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
> > > >
> > > >   [ 2569.084280] comm=sync func=btrfs_commit_transaction() call
> > > btrfs_run_delayed_iputs()
> > > >   [ 2569.085418] comm=sync func=btrfs_commit_transaction() done
> > > btrfs_run_delayed_iputs()
> > > >   [ 2569.087554] comm=sync func=btrfs_commit_transaction() end
> > > >
> > > >   [ 2569.191081] comm=dd begin
> > > >   [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
> > > >
> > > >   [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 +
> > > 32677888 = 32677888
> > > >   [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes()
> > > > 32677888 +
> > > 23834624 = 56512512
> > > >   ...
> > > >   [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes()
> > > > 943976448
> > > + 21762048 = 965738496
> > > >   [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
> > > >
> > > > Fix:
> > > >   Make btrfs_commit_transaction() wait current running btrfs-cleaner's
> > > >   delayed-iputs() done in end.
> > > >
> > > > Test:
> > > >   Use script similar to above(more complex),
> > > >   before patch:
> > > >     7 failed in 100 * 20 loop.
> > > >   after patch:
> > > >     0 failed in 100 * 20 loop.
> > > >
> > > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > > > ---
> > > >  fs/btrfs/ctree.h       | 1 +
> > > >  fs/btrfs/disk-io.c     | 3 ++-
> > > >  fs/btrfs/extent-tree.c | 6 ++++++
> > > >  fs/btrfs/inode.c       | 4 ++++
> > > >  4 files changed, 13 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index
> > > > f9c89ca..54d4d78 100644
> > > > --- a/fs/btrfs/ctree.h
> > > > +++ b/fs/btrfs/ctree.h
> > > > @@ -1513,6 +1513,7 @@ struct btrfs_fs_info {
> > > >
> > > >  	spinlock_t delayed_iput_lock;
> > > >  	struct list_head delayed_iputs;
> > > > +	struct rw_semaphore delayed_iput_sem;
> > > >
> > > >  	/* this protects tree_mod_seq_list */
> > > >  	spinlock_t tree_mod_seq_lock;
> > > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
> > > > 639f266..6867471 100644
> > > > --- a/fs/btrfs/disk-io.c
> > > > +++ b/fs/btrfs/disk-io.c
> > > > @@ -2241,11 +2241,12 @@ int open_ctree(struct super_block *sb,
> > > >  	spin_lock_init(&fs_info->qgroup_op_lock);
> > > >  	spin_lock_init(&fs_info->buffer_lock);
> > > >  	spin_lock_init(&fs_info->unused_bgs_lock);
> > > > -	mutex_init(&fs_info->unused_bg_unpin_mutex);
> > > >  	rwlock_init(&fs_info->tree_mod_log_lock);
> > > > +	mutex_init(&fs_info->unused_bg_unpin_mutex);
> > > >  	mutex_init(&fs_info->reloc_mutex);
> > > >  	mutex_init(&fs_info->delalloc_root_mutex);
> > > >  	seqlock_init(&fs_info->profiles_lock);
> > > > +	init_rwsem(&fs_info->delayed_iput_sem);
> > > >
> > > >  	init_completion(&fs_info->kobj_unregister);
> > > >  	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> > > > 203ac63..6fd7dca 100644
> > > > --- a/fs/btrfs/extent-tree.c
> > > > +++ b/fs/btrfs/extent-tree.c
> > > > @@ -3732,6 +3732,12 @@ commit_trans:
> > > >  				ret = btrfs_commit_transaction(trans, root);
> > > >  				if (ret)
> > > >  					return ret;
> > > > +				/*
> > > > +				 * make sure that all running delayed iput are
> > > > +				 * done
> > > > +				 */
> > > > +				down_write(&root->fs_info->delayed_iput_sem);
> > > > +				up_write(&root->fs_info->delayed_iput_sem);
> > > >  				goto again;
> > > >  			} else {
> > > >  				btrfs_end_transaction(trans, root); diff --git
> > > a/fs/btrfs/inode.c
> > > > b/fs/btrfs/inode.c index d2e732d..34d10be 100644
> > > > --- a/fs/btrfs/inode.c
> > > > +++ b/fs/btrfs/inode.c
> > > > @@ -3110,6 +3110,8 @@ void btrfs_run_delayed_iputs(struct
> > > > btrfs_root
> > > *root)
> > > >  	if (empty)
> > > >  		return;
> > > >
> > > > +	down_read(&fs_info->delayed_iput_sem);
> > > > +
> > > >  	spin_lock(&fs_info->delayed_iput_lock);
> > > >  	list_splice_init(&fs_info->delayed_iputs, &list);
> > > >  	spin_unlock(&fs_info->delayed_iput_lock);
> > > > @@ -3120,6 +3122,8 @@ void btrfs_run_delayed_iputs(struct
> > > > btrfs_root
> > > *root)
> > > >  		iput(delayed->inode);
> > > >  		kfree(delayed);
> > > >  	}
> > > > +
> > > > +	up_read(&root->fs_info->delayed_iput_sem);
> > >
> > > By introducing this "delayed_iput_sem", fstests's generic/241 cries
> > > about a possible deadlock, can we use a waitqueue to avoid this?
> > >
> > >
> > > [ 2061.345955] =============================================
> > > [ 2061.346027] [ INFO: possible recursive locking detected ]
> > > [ 2061.346027] 4.1.0+ #268 Tainted: G        W
> > > [ 2061.346027] ---------------------------------------------
> > > [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
> > > [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
> > > [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
> > > [ 2061.346027] but task is already holding lock:
> > > [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
> > > [<ffffffff814063ab>]
> > > btrfs_run_delayed_iputs+0x6b/0x100
> > > [ 2061.346027] other info that might help us debug this:
> > > [ 2061.346027]  Possible unsafe locking scenario:
> > >
> > > [ 2061.346027]        CPU0
> > > [ 2061.346027]        ----
> > > [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
> > > [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
> > > [ 2061.346027]
> > >  *** DEADLOCK ***
> > >
> > > [ 2061.346027]  May be due to missing lock nesting notation
> > >
> > > [ 2061.346027] 2 locks held by btrfs-cleaner/3045:
> > > [ 2061.346027]  #0:  (&fs_info->cleaner_mutex){+.+.+.}, at:
> > > [<ffffffff813f6cc4>] cleaner_kthread+0x64/0x1a0 [ 2061.346027]  #1:
> > > (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>]
> > > btrfs_run_delayed_iputs+0x6b/0x100
> > > [ 2061.346027] stack backtrace:
> > > [ 2061.346027] CPU: 0 PID: 3045 Comm: btrfs-cleaner Tainted: G
> W
> > > 4.1.0+ #268
> > > [ 2061.346027] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > 1996), BIOS
> > > 1.7.5-20140709_153950- 04/01/2014 [ 2061.346027]  ffff88013760a700
> > > ffff8800b4aff888 ffffffff818bb68e 0000000000000007 [ 2061.346027]
> > > ffffffff828fa0a0 ffff8800b4aff968 ffffffff810db18d ffff8800b4aff978
> > > [ 2061.346027]  ffffffff00000000 ffff88013760a750 ffffffff00000001
> > > ffff88013760b470 [ 2061.346027] Call Trace:
> > > [ 2061.346027]  [<ffffffff818bb68e>] dump_stack+0x4c/0x65 [
> > > 2061.346027] [<ffffffff810db18d>] __lock_acquire+0x214d/0x22f0 [
> > > 2061.346027] [<ffffffff810d5e3d>] ? trace_hardirqs_off+0xd/0x10 [
> > > 2061.346027] [<ffffffff810d649d>] ? __lock_is_held+0x4d/0x70 [
> > > 2061.346027] [<ffffffff810dc050>] lock_acquire+0xd0/0x280 [
> > > 2061.346027] [<ffffffff814063ab>] ?
> > > btrfs_run_delayed_iputs+0x6b/0x100
> > > [ 2061.346027]  [<ffffffff818c275c>] down_read+0x4c/0x70 [
> > > 2061.346027] [<ffffffff814063ab>] ?
> > > btrfs_run_delayed_iputs+0x6b/0x100
> > > [ 2061.346027]  [<ffffffff818c48cb>] ? _raw_spin_unlock+0x2b/0x40 [
> > > 2061.346027]  [<ffffffff814063ab>]
> > > btrfs_run_delayed_iputs+0x6b/0x100
> > > [ 2061.346027]  [<ffffffff8140042f>] ?
> > > btrfs_commit_transaction+0xa1f/0xc50
> > > [ 2061.346027]  [<ffffffff81400454>]
> > > btrfs_commit_transaction+0xa44/0xc50
> > > [ 2061.346027]  [<ffffffff810cdbd0>] ? wait_woken+0xa0/0xa0 [
> > > 2061.346027] [<ffffffff813e5247>] flush_space+0x97/0x4b0 [
> > > 2061.346027] [<ffffffff818c48cb>] ? _raw_spin_unlock+0x2b/0x40 [
> > > 2061.346027] [<ffffffff813e5a3b>] ?
> > > reserve_metadata_bytes+0x20b/0x6d0
> > > [ 2061.346027]  [<ffffffff813e5a52>]
> > > reserve_metadata_bytes+0x222/0x6d0
> > > [ 2061.346027]  [<ffffffff813e6245>] ?
> > > btrfs_block_rsv_refill+0x65/0x90 [ 2061.346027]
> > > [<ffffffff813e6257>] btrfs_block_rsv_refill+0x77/0x90 [ 2061.346027]
> > > [<ffffffff8140c723>] btrfs_evict_inode+0x473/0x700 [ 2061.346027]
> > > [<ffffffff8123dcbb>] evict+0xab/0x170 [ 2061.346027]
> > > [<ffffffff8123e6fd>] iput+0x1cd/0x350 [ 2061.346027]
> > > [<ffffffff81406409>]
> > > btrfs_run_delayed_iputs+0xc9/0x100
> > > [ 2061.346027]  [<ffffffff813f6cc4>] ? cleaner_kthread+0x64/0x1a0 [
> > > 2061.346027]  [<ffffffff813f6dba>] cleaner_kthread+0x15a/0x1a0 [
> > > 2061.346027]  [<ffffffff813f6c60>] ? btree_invalidatepage+0x90/0x90
> > > [ 2061.346027]  [<ffffffff810a3d5f>] kthread+0xef/0x110 [
> > > 2061.346027] [<ffffffff810a3c70>] ?
> > > kthread_create_on_node+0x210/0x210
> > > [ 2061.346027]  [<ffffffff818c550f>] ret_from_fork+0x3f/0x70 [
> > > 2061.346027] [<ffffffff810a3c70>] ?
> > > kthread_create_on_node+0x210/0x210
> > >
> > Thanks for report above problem.
> > Seems it is caused by recursion of btrfs_run_delayed_iputs:
> >
> >   cleaner_kthread
> >   -> btrfs_run_delayed_iputs() *1
> >   -> get delayed_iput_sem lock *2
> >   -> iput()
> >   -> ...
> >   -> btrfs_commit_transaction()
> >   -> btrfs_run_delayed_iputs() *1
> >   -> get delayed_iput_sem lock (dead lock) *2
> >
> >   *1: recursion of btrfs_run_delayed_iputs()
> >   *2: dead lock of delayed_iput_sem
> 
> Yep, that's what stack trace shows, but I guess it hardly runs to deadlock, since
> we've checked if the list is empty before acquiring the semephore.
> 
Or some reason, which cause " fs_info->delayed_iputs not empty" in second call.
(only guess)

> >
> > I'll reproduce this problem in my env, and comfirm fix of above
> > problem.
> 
> It didn't occur every time, but quite often on my box.
> 
Got it.
So, 1000 times test.

Thanks
Zhaolei

> Thanks,
> 
> -liubo
> >
> > Thanks
> > Zhaolei
> >
> >
> > >
> > > Thanks,
> > >
> > > -liubo
> > > >  }
> > > >
> > > >  /*
> > > > --
> > > > 1.8.5.1
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-07-07  9:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-09  4:34 [PATCH v2 0/9] btrfs: Fix no_space on dd and rm loop Zhaolei
2015-04-09  4:34 ` [PATCH 1/9] btrfs: fix condition of commit transaction Zhaolei
2015-04-09  4:34 ` [PATCH 2/9] btrfs: Fix tail space processing in find_free_dev_extent() Zhaolei
2015-04-09  4:34 ` [PATCH 3/9] btrfs: Adjust commit-transaction condition to avoid NO_SPACE more Zhaolei
2015-04-09  4:34 ` [PATCH 4/9] btrfs: Set relative data on clear btrfs_block_group_cache->pinned Zhaolei
2015-04-09  4:34 ` [PATCH 5/9] btrfs: add WARN_ON() to check is space_info op current Zhaolei
2015-04-09  4:34 ` [PATCH 6/9] btrfs: Fix NO_SPACE bug caused by delayed-iput Zhaolei
2015-07-07  8:13   ` Liu Bo
2015-07-07  9:16     ` Zhao Lei
2015-07-07  9:26       ` Liu Bo
2015-07-07  9:38         ` Zhao Lei
2015-04-09  4:34 ` [PATCH 7/9] btrfs: Support busy loop of write and delete Zhaolei
2015-04-09  4:34 ` [PATCH 8/9] btrfs: wait for delayed iputs on no space Zhaolei
2015-04-13 14:54   ` Chris Mason
2015-04-14  4:06     ` Zhao Lei
2015-04-09  4:34 ` [PATCH 9/9] btrfs: cleanup unused alloc_chunk varible Zhaolei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.