Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/3][V3] Delayed iput fixes
@ 2019-01-11 15:21 Josef Bacik
  2019-01-11 15:21 ` [PATCH 1/3] btrfs: run delayed iputs before committing Josef Bacik
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Josef Bacik @ 2019-01-11 15:21 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

v1->v2:
- only wakeup if the cleaner isn't currently doing work.
- re-arranged some stuff for running delayed iputs during flushint.
- removed the open code wakeup in the waitqueue patch.

-- Original message --

Here are some delayed iput fixes.  Delayed iputs can hold reservations for a
while and there's no real good way to make sure they were gone for good, which
means we could early enospc when in reality if we had just waited for the iput
we would have had plenty of space.  So fix this up by making us wait for delayed
iputs when deciding if we need to commit for enospc flushing, and then cleanup
and rework how we run delayed iputs to make it more straightforward to wait on
them and make sure we're all done using them.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] btrfs: run delayed iputs before committing
  2019-01-11 15:21 [PATCH 0/3][V3] Delayed iput fixes Josef Bacik
@ 2019-01-11 15:21 ` Josef Bacik
  2019-01-11 15:21 ` [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput Josef Bacik
  2019-01-11 15:21 ` [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue Josef Bacik
  2 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2019-01-11 15:21 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd.  So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent-tree.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8dfddfd3f315..0127d272cd2a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4953,6 +4953,15 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 			ret = 0;
 		break;
 	case COMMIT_TRANS:
+		/*
+		 * If we have pending delayed iputs then we could free up a
+		 * bunch of pinned space, so make sure we run the iputs before
+		 * we do our pinned bytes check below.
+		 */
+		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
+		btrfs_run_delayed_iputs(fs_info);
+		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+
 		ret = may_commit_transaction(fs_info, space_info);
 		break;
 	default:
-- 
2.14.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
  2019-01-11 15:21 [PATCH 0/3][V3] Delayed iput fixes Josef Bacik
  2019-01-11 15:21 ` [PATCH 1/3] btrfs: run delayed iputs before committing Josef Bacik
@ 2019-01-11 15:21 ` Josef Bacik
  2019-01-11 15:21 ` [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue Josef Bacik
  2 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2019-01-11 15:21 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path.  Delaying iputs
means we are potentially delaying the eviction of an inode and it's
respective space.  The cleaner thread only gets woken up every 30
seconds, or when we require space.  If there are a lot of inodes that
need to be deleted we could induce a serious amount of latency while we
wait for these inodes to be evicted.  So instead wakeup the cleaner if
it's not already awake to process any new delayed iputs we add to the
list.  If we suddenly need space we will less likely be backed up
behind a bunch of inodes that are waiting to be deleted, and we could
possibly free space before we need to get into the flushing logic which
will save us some latency.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h   | 3 +++
 fs/btrfs/disk-io.c | 3 +++
 fs/btrfs/inode.c   | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c8ddbacb6748..dc56a4d940c3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -769,6 +769,9 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
  */
 #define BTRFS_FS_BALANCE_RUNNING		18
 
+/* Indicate that the cleaner thread is awake and doing something. */
+#define BTRFS_FS_CLEANER_RUNNING		19
+
 struct btrfs_fs_info {
 	u8 fsid[BTRFS_FSID_SIZE];
 	u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c5918ff8241b..f40f6fdc1019 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1669,6 +1669,8 @@ static int cleaner_kthread(void *arg)
 	while (1) {
 		again = 0;
 
+		set_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags);
+
 		/* Make the cleaner go to sleep early. */
 		if (btrfs_need_cleaner_sleep(fs_info))
 			goto sleep;
@@ -1715,6 +1717,7 @@ static int cleaner_kthread(void *arg)
 		 */
 		btrfs_delete_unused_bgs(fs_info);
 sleep:
+		clear_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags);
 		if (kthread_should_park())
 			kthread_parkme();
 		if (kthread_should_stop())
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ac7abe2ae9b..0b9f3e482cea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3264,6 +3264,8 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	ASSERT(list_empty(&binode->delayed_iput));
 	list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs);
 	spin_unlock(&fs_info->delayed_iput_lock);
+	if (!test_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags))
+		wake_up_process(fs_info->cleaner_kthread);
 }
 
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
-- 
2.14.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2019-01-11 15:21 [PATCH 0/3][V3] Delayed iput fixes Josef Bacik
  2019-01-11 15:21 ` [PATCH 1/3] btrfs: run delayed iputs before committing Josef Bacik
  2019-01-11 15:21 ` [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput Josef Bacik
@ 2019-01-11 15:21 ` Josef Bacik
  2019-01-16 19:12   ` David Sterba
  2 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2019-01-11 15:21 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h       |  4 +++-
 fs/btrfs/disk-io.c     |  5 ++---
 fs/btrfs/extent-tree.c | 13 ++++++++-----
 fs/btrfs/inode.c       | 21 +++++++++++++++++++++
 4 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dc56a4d940c3..20af5d6d81f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -915,7 +915,8 @@ struct btrfs_fs_info {
 
 	spinlock_t delayed_iput_lock;
 	struct list_head delayed_iputs;
-	struct mutex cleaner_delayed_iput_mutex;
+	atomic_t nr_delayed_iputs;
+	wait_queue_head_t delayed_iputs_wait;
 
 	/* this protects tree_mod_seq_list */
 	spinlock_t tree_mod_seq_lock;
@@ -3240,6 +3241,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
 			      u64 start, u64 num_bytes, u64 min_size,
 			      loff_t actual_len, u64 *alloc_hint);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f40f6fdc1019..238e0113f2d3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1694,9 +1694,7 @@ static int cleaner_kthread(void *arg)
 			goto sleep;
 		}
 
-		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
 		btrfs_run_delayed_iputs(fs_info);
-		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
 
 		again = btrfs_clean_one_deleted_snapshot(root);
 		mutex_unlock(&fs_info->cleaner_mutex);
@@ -2654,7 +2652,6 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
-	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
@@ -2676,6 +2673,7 @@ int open_ctree(struct super_block *sb,
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->qgroup_op_seq, 0);
 	atomic_set(&fs_info->reada_works_cnt, 0);
+	atomic_set(&fs_info->nr_delayed_iputs, 0);
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 	fs_info->sb = sb;
 	fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2753,6 +2751,7 @@ int open_ctree(struct super_block *sb,
 	init_waitqueue_head(&fs_info->transaction_wait);
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
+	init_waitqueue_head(&fs_info->delayed_iputs_wait);
 
 	INIT_LIST_HEAD(&fs_info->pinned_chunks);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0127d272cd2a..5b6c9fc227ff 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4280,10 +4280,14 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
 				/*
 				 * The cleaner kthread might still be doing iput
 				 * operations. Wait for it to finish so that
-				 * more space is released.
+				 * more space is released.  We don't need to
+				 * explicitly run the delayed iputs here because
+				 * the commit_transaction would have woken up
+				 * the cleaner.
 				 */
-				mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
-				mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+				ret = btrfs_wait_on_delayed_iputs(fs_info);
+				if (ret)
+					return ret;
 				goto again;
 			} else {
 				btrfs_end_transaction(trans);
@@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		 * bunch of pinned space, so make sure we run the iputs before
 		 * we do our pinned bytes check below.
 		 */
-		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
 		btrfs_run_delayed_iputs(fs_info);
-		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+		btrfs_wait_on_delayed_iputs(fs_info);
 
 		ret = may_commit_transaction(fs_info, space_info);
 		break;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0b9f3e482cea..958e30c7c744 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3260,6 +3260,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	if (atomic_add_unless(&inode->i_count, -1, 1))
 		return;
 
+	atomic_inc(&fs_info->nr_delayed_iputs);
 	spin_lock(&fs_info->delayed_iput_lock);
 	ASSERT(list_empty(&binode->delayed_iput));
 	list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs);
@@ -3280,11 +3281,31 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
 		list_del_init(&inode->delayed_iput);
 		spin_unlock(&fs_info->delayed_iput_lock);
 		iput(&inode->vfs_inode);
+		if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
+			wake_up(&fs_info->delayed_iputs_wait);
 		spin_lock(&fs_info->delayed_iput_lock);
 	}
 	spin_unlock(&fs_info->delayed_iput_lock);
 }
 
+/**
+ * btrfs_wait_on_delayed_iputs - wait on the delayed iputs to be done running
+ * @fs_info - the fs_info for this fs
+ * @return - EINTR if we were killed, 0 if nothing's pending
+ *
+ * This will wait on any delayed iputs that are currently running with KILLABLE
+ * set.  Once they are all done running we will return, unless we are killed in
+ * which case we return EINTR.
+ */
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info)
+{
+	int ret = wait_event_killable(fs_info->delayed_iputs_wait,
+			atomic_read(&fs_info->nr_delayed_iputs) == 0);
+	if (ret)
+		return -EINTR;
+	return 0;
+}
+
 /*
  * This creates an orphan entry for the given inode in case something goes wrong
  * in the middle of an unlink.
-- 
2.14.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2019-01-11 15:21 ` [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue Josef Bacik
@ 2019-01-16 19:12   ` David Sterba
  2019-02-05 18:23     ` David Sterba
  0 siblings, 1 reply; 12+ messages in thread
From: David Sterba @ 2019-01-16 19:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Jan 11, 2019 at 10:21:03AM -0500, Josef Bacik wrote:
> The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> we could think we're done flushing iputs in the data space reservation
> path when we could have a throttler doing an iput.  There's no real
> reason to serialize the delayed iput flushing, so instead of taking the
> cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> replace it with an atomic counter and a waitqueue.  This removes the
> short (or long depending on how big the inode is) window where we think
> there are no more pending iputs when there really are some.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/ctree.h       |  4 +++-
>  fs/btrfs/disk-io.c     |  5 ++---
>  fs/btrfs/extent-tree.c | 13 ++++++++-----
>  fs/btrfs/inode.c       | 21 +++++++++++++++++++++
>  4 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index dc56a4d940c3..20af5d6d81f1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -915,7 +915,8 @@ struct btrfs_fs_info {
>  
>  	spinlock_t delayed_iput_lock;
>  	struct list_head delayed_iputs;
> -	struct mutex cleaner_delayed_iput_mutex;
> +	atomic_t nr_delayed_iputs;
> +	wait_queue_head_t delayed_iputs_wait;
>  
>  	/* this protects tree_mod_seq_list */
>  	spinlock_t tree_mod_seq_lock;
> @@ -3240,6 +3241,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
>  int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
>  void btrfs_add_delayed_iput(struct inode *inode);
>  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
> +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
>  int btrfs_prealloc_file_range(struct inode *inode, int mode,
>  			      u64 start, u64 num_bytes, u64 min_size,
>  			      loff_t actual_len, u64 *alloc_hint);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index f40f6fdc1019..238e0113f2d3 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1694,9 +1694,7 @@ static int cleaner_kthread(void *arg)
>  			goto sleep;
>  		}
>  
> -		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
>  		btrfs_run_delayed_iputs(fs_info);
> -		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
>  
>  		again = btrfs_clean_one_deleted_snapshot(root);
>  		mutex_unlock(&fs_info->cleaner_mutex);
> @@ -2654,7 +2652,6 @@ int open_ctree(struct super_block *sb,
>  	mutex_init(&fs_info->delete_unused_bgs_mutex);
>  	mutex_init(&fs_info->reloc_mutex);
>  	mutex_init(&fs_info->delalloc_root_mutex);
> -	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
>  	seqlock_init(&fs_info->profiles_lock);
>  
>  	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> @@ -2676,6 +2673,7 @@ int open_ctree(struct super_block *sb,
>  	atomic_set(&fs_info->defrag_running, 0);
>  	atomic_set(&fs_info->qgroup_op_seq, 0);
>  	atomic_set(&fs_info->reada_works_cnt, 0);
> +	atomic_set(&fs_info->nr_delayed_iputs, 0);
>  	atomic64_set(&fs_info->tree_mod_seq, 0);
>  	fs_info->sb = sb;
>  	fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
> @@ -2753,6 +2751,7 @@ int open_ctree(struct super_block *sb,
>  	init_waitqueue_head(&fs_info->transaction_wait);
>  	init_waitqueue_head(&fs_info->transaction_blocked_wait);
>  	init_waitqueue_head(&fs_info->async_submit_wait);
> +	init_waitqueue_head(&fs_info->delayed_iputs_wait);
>  
>  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
>  
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0127d272cd2a..5b6c9fc227ff 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4280,10 +4280,14 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>  				/*
>  				 * The cleaner kthread might still be doing iput
>  				 * operations. Wait for it to finish so that
> -				 * more space is released.
> +				 * more space is released.  We don't need to
> +				 * explicitly run the delayed iputs here because
> +				 * the commit_transaction would have woken up
> +				 * the cleaner.
>  				 */
> -				mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
> -				mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> +				ret = btrfs_wait_on_delayed_iputs(fs_info);
> +				if (ret)
> +					return ret;
>  				goto again;
>  			} else {
>  				btrfs_end_transaction(trans);
> @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>  		 * bunch of pinned space, so make sure we run the iputs before
>  		 * we do our pinned bytes check below.
>  		 */
> -		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
>  		btrfs_run_delayed_iputs(fs_info);
> -		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> +		btrfs_wait_on_delayed_iputs(fs_info);
>  
>  		ret = may_commit_transaction(fs_info, space_info);
>  		break;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0b9f3e482cea..958e30c7c744 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3260,6 +3260,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
>  	if (atomic_add_unless(&inode->i_count, -1, 1))
>  		return;
>  
> +	atomic_inc(&fs_info->nr_delayed_iputs);
>  	spin_lock(&fs_info->delayed_iput_lock);
>  	ASSERT(list_empty(&binode->delayed_iput));
>  	list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs);
> @@ -3280,11 +3281,31 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
>  		list_del_init(&inode->delayed_iput);
>  		spin_unlock(&fs_info->delayed_iput_lock);
>  		iput(&inode->vfs_inode);
> +		if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
> +			wake_up(&fs_info->delayed_iputs_wait);
>  		spin_lock(&fs_info->delayed_iput_lock);
>  	}
>  	spin_unlock(&fs_info->delayed_iput_lock);
>  }
>  
> +/**
> + * btrfs_wait_on_delayed_iputs - wait on the delayed iputs to be done running
> + * @fs_info - the fs_info for this fs
> + * @return - EINTR if we were killed, 0 if nothing's pending
> + *
> + * This will wait on any delayed iputs that are currently running with KILLABLE
> + * set.  Once they are all done running we will return, unless we are killed in
> + * which case we return EINTR.

I still wonder from which context the 'killable' property is used.

btrfs_alloc_data_chunk_ondemand is indirectly inside zero range and
fallocate, there it makes sense. With flush_space is less obvious, so
this should be explained in the changelog what's expected.

> + */
> +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info)
> +{
> +	int ret = wait_event_killable(fs_info->delayed_iputs_wait,
> +			atomic_read(&fs_info->nr_delayed_iputs) == 0);
> +	if (ret)
> +		return -EINTR;
> +	return 0;
> +}
> +
>  /*
>   * This creates an orphan entry for the given inode in case something goes wrong
>   * in the middle of an unlink.
> -- 
> 2.14.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2019-01-16 19:12   ` David Sterba
@ 2019-02-05 18:23     ` David Sterba
  0 siblings, 0 replies; 12+ messages in thread
From: David Sterba @ 2019-02-05 18:23 UTC (permalink / raw)
  To: David Sterba; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Wed, Jan 16, 2019 at 08:12:16PM +0100, David Sterba wrote:
> > +/**
> > + * btrfs_wait_on_delayed_iputs - wait on the delayed iputs to be done running
> > + * @fs_info - the fs_info for this fs
> > + * @return - EINTR if we were killed, 0 if nothing's pending
> > + *
> > + * This will wait on any delayed iputs that are currently running with KILLABLE
> > + * set.  Once they are all done running we will return, unless we are killed in
> > + * which case we return EINTR.
> 
> I still wonder from which context the 'killable' property is used.
> 
> btrfs_alloc_data_chunk_ondemand is indirectly inside zero range and
> fallocate, there it makes sense. With flush_space is less obvious, so
> this should be explained in the changelog what's expected.

Something along those lines is now in the changelog and patch goes to
misc-next.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2018-12-04 11:46   ` Nikolay Borisov
@ 2018-12-04 18:21     ` Josef Bacik
  0 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2018-12-04 18:21 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Tue, Dec 04, 2018 at 01:46:58PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> > The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> > we could think we're done flushing iputs in the data space reservation
> > path when we could have a throttler doing an iput.  There's no real
> > reason to serialize the delayed iput flushing, so instead of taking the
> > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> > replace it with an atomic counter and a waitqueue.  This removes the
> > short (or long depending on how big the inode is) window where we think
> > there are no more pending iputs when there really are some.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/ctree.h       |  4 +++-
> >  fs/btrfs/disk-io.c     |  5 ++---
> >  fs/btrfs/extent-tree.c | 13 ++++++++-----
> >  fs/btrfs/inode.c       | 21 +++++++++++++++++++++
> >  4 files changed, 34 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index dc56a4d940c3..20af5d6d81f1 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -915,7 +915,8 @@ struct btrfs_fs_info {
> >  
> >  	spinlock_t delayed_iput_lock;
> >  	struct list_head delayed_iputs;
> > -	struct mutex cleaner_delayed_iput_mutex;
> > +	atomic_t nr_delayed_iputs;
> > +	wait_queue_head_t delayed_iputs_wait;
> >  
> 
> Have you considered whether the same could be achieved with a completion
> rather than an open-coded waitqueue? I tried prototyping it and it could
> be done but it becomes messy regarding when the completion should be
> initialised i.e only when it's not in btrfs_add_delayed_iput
> 

Yeah a waitqueue makes more sense here than a completion since it's not a
one-off.

> 
> <snip>
> 
> > @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
> >  		 * bunch of pinned space, so make sure we run the iputs before
> >  		 * we do our pinned bytes check below.
> >  		 */
> > -		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
> >  		btrfs_run_delayed_iputs(fs_info);
> > -		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> > +		btrfs_wait_on_delayed_iputs(fs_info);
> 
> Waiting on delayed iputs here is pointless since they are run
> synchronously form this context.
> 

Unless there are other threads (the cleaner thread) running iputs as well.  We
could be running an iput that is evicting the inode in another thread and we
really want that space, so we need to wait here to make sure everybody is truly
done.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2018-12-03 16:06 ` [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue Josef Bacik
@ 2018-12-04 11:46   ` Nikolay Borisov
  2018-12-04 18:21     ` Josef Bacik
  0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2018-12-04 11:46 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> we could think we're done flushing iputs in the data space reservation
> path when we could have a throttler doing an iput.  There's no real
> reason to serialize the delayed iput flushing, so instead of taking the
> cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> replace it with an atomic counter and a waitqueue.  This removes the
> short (or long depending on how big the inode is) window where we think
> there are no more pending iputs when there really are some.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/ctree.h       |  4 +++-
>  fs/btrfs/disk-io.c     |  5 ++---
>  fs/btrfs/extent-tree.c | 13 ++++++++-----
>  fs/btrfs/inode.c       | 21 +++++++++++++++++++++
>  4 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index dc56a4d940c3..20af5d6d81f1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -915,7 +915,8 @@ struct btrfs_fs_info {
>  
>  	spinlock_t delayed_iput_lock;
>  	struct list_head delayed_iputs;
> -	struct mutex cleaner_delayed_iput_mutex;
> +	atomic_t nr_delayed_iputs;
> +	wait_queue_head_t delayed_iputs_wait;
>  

Have you considered whether the same could be achieved with a completion
rather than an open-coded waitqueue? I tried prototyping it and it could
be done but it becomes messy regarding when the completion should be
initialised i.e only when it's not in btrfs_add_delayed_iput


<snip>

> @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>  		 * bunch of pinned space, so make sure we run the iputs before
>  		 * we do our pinned bytes check below.
>  		 */
> -		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
>  		btrfs_run_delayed_iputs(fs_info);
> -		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> +		btrfs_wait_on_delayed_iputs(fs_info);

Waiting on delayed iputs here is pointless since they are run
synchronously form this context.

>  
>  		ret = may_commit_transaction(fs_info, space_info);
>  		break;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0b9f3e482cea..958e30c7c744 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3260,6 +3260,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
>  	if (atomic_add_unless(&inode->i_count, -1, 1))
>  		return;
>  
> +	atomic_inc(&fs_info->nr_delayed_iputs);
>  	spin_lock(&fs_info->delayed_iput_lock);
>  	ASSERT(list_empty(&binode->delayed_iput));
>  	list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs);
> @@ -3280,11 +3281,31 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
>  		list_del_init(&inode->delayed_iput);
>  		spin_unlock(&fs_info->delayed_iput_lock);
>  		iput(&inode->vfs_inode);
> +		if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
> +			wake_up(&fs_info->delayed_iputs_wait);
>  		spin_lock(&fs_info->delayed_iput_lock);
>  	}
>  	spin_unlock(&fs_info->delayed_iput_lock);
>  }
>  
> +/**
> + * btrfs_wait_on_delayed_iputs - wait on the delayed iputs to be done running
> + * @fs_info - the fs_info for this fs
> + * @return - EINTR if we were killed, 0 if nothing's pending
> + *
> + * This will wait on any delayed iputs that are currently running with KILLABLE
> + * set.  Once they are all done running we will return, unless we are killed in
> + * which case we return EINTR.
> + */
> +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info)
> +{
> +	int ret = wait_event_killable(fs_info->delayed_iputs_wait,
> +			atomic_read(&fs_info->nr_delayed_iputs) == 0);
> +	if (ret)
> +		return -EINTR;
> +	return 0;
> +}
> +
>  /*
>   * This creates an orphan entry for the given inode in case something goes wrong
>   * in the middle of an unlink.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2018-12-03 16:06 [PATCH 0/3][V2] Delayed iput fixes Josef Bacik
@ 2018-12-03 16:06 ` Josef Bacik
  2018-12-04 11:46   ` Nikolay Borisov
  0 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2018-12-03 16:06 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h       |  4 +++-
 fs/btrfs/disk-io.c     |  5 ++---
 fs/btrfs/extent-tree.c | 13 ++++++++-----
 fs/btrfs/inode.c       | 21 +++++++++++++++++++++
 4 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dc56a4d940c3..20af5d6d81f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -915,7 +915,8 @@ struct btrfs_fs_info {
 
 	spinlock_t delayed_iput_lock;
 	struct list_head delayed_iputs;
-	struct mutex cleaner_delayed_iput_mutex;
+	atomic_t nr_delayed_iputs;
+	wait_queue_head_t delayed_iputs_wait;
 
 	/* this protects tree_mod_seq_list */
 	spinlock_t tree_mod_seq_lock;
@@ -3240,6 +3241,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
 			      u64 start, u64 num_bytes, u64 min_size,
 			      loff_t actual_len, u64 *alloc_hint);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f40f6fdc1019..238e0113f2d3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1694,9 +1694,7 @@ static int cleaner_kthread(void *arg)
 			goto sleep;
 		}
 
-		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
 		btrfs_run_delayed_iputs(fs_info);
-		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
 
 		again = btrfs_clean_one_deleted_snapshot(root);
 		mutex_unlock(&fs_info->cleaner_mutex);
@@ -2654,7 +2652,6 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
-	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
@@ -2676,6 +2673,7 @@ int open_ctree(struct super_block *sb,
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->qgroup_op_seq, 0);
 	atomic_set(&fs_info->reada_works_cnt, 0);
+	atomic_set(&fs_info->nr_delayed_iputs, 0);
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 	fs_info->sb = sb;
 	fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2753,6 +2751,7 @@ int open_ctree(struct super_block *sb,
 	init_waitqueue_head(&fs_info->transaction_wait);
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
+	init_waitqueue_head(&fs_info->delayed_iputs_wait);
 
 	INIT_LIST_HEAD(&fs_info->pinned_chunks);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0127d272cd2a..5b6c9fc227ff 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4280,10 +4280,14 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
 				/*
 				 * The cleaner kthread might still be doing iput
 				 * operations. Wait for it to finish so that
-				 * more space is released.
+				 * more space is released.  We don't need to
+				 * explicitly run the delayed iputs here because
+				 * the commit_transaction would have woken up
+				 * the cleaner.
 				 */
-				mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
-				mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+				ret = btrfs_wait_on_delayed_iputs(fs_info);
+				if (ret)
+					return ret;
 				goto again;
 			} else {
 				btrfs_end_transaction(trans);
@@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		 * bunch of pinned space, so make sure we run the iputs before
 		 * we do our pinned bytes check below.
 		 */
-		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
 		btrfs_run_delayed_iputs(fs_info);
-		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+		btrfs_wait_on_delayed_iputs(fs_info);
 
 		ret = may_commit_transaction(fs_info, space_info);
 		break;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0b9f3e482cea..958e30c7c744 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3260,6 +3260,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	if (atomic_add_unless(&inode->i_count, -1, 1))
 		return;
 
+	atomic_inc(&fs_info->nr_delayed_iputs);
 	spin_lock(&fs_info->delayed_iput_lock);
 	ASSERT(list_empty(&binode->delayed_iput));
 	list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs);
@@ -3280,11 +3281,31 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
 		list_del_init(&inode->delayed_iput);
 		spin_unlock(&fs_info->delayed_iput_lock);
 		iput(&inode->vfs_inode);
+		if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
+			wake_up(&fs_info->delayed_iputs_wait);
 		spin_lock(&fs_info->delayed_iput_lock);
 	}
 	spin_unlock(&fs_info->delayed_iput_lock);
 }
 
+/**
+ * btrfs_wait_on_delayed_iputs - wait on the delayed iputs to be done running
+ * @fs_info - the fs_info for this fs
+ * @return - EINTR if we were killed, 0 if nothing's pending
+ *
+ * This will wait on any delayed iputs that are currently running with KILLABLE
+ * set.  Once they are all done running we will return, unless we are killed in
+ * which case we return EINTR.
+ */
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info)
+{
+	int ret = wait_event_killable(fs_info->delayed_iputs_wait,
+			atomic_read(&fs_info->nr_delayed_iputs) == 0);
+	if (ret)
+		return -EINTR;
+	return 0;
+}
+
 /*
  * This creates an orphan entry for the given inode in case something goes wrong
  * in the middle of an unlink.
-- 
2.14.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2018-11-27  8:29   ` Nikolay Borisov
@ 2018-11-27 20:01     ` Josef Bacik
  0 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2018-11-27 20:01 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Tue, Nov 27, 2018 at 10:29:57AM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> > The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> 
> Which one is the throttle path? btrfs_end_transaction_throttle is only
> called during snapshot drop and relocation? What scenario triggered your
> observation and prompted this fix?
> 

One of my enospc tests runs snapshot creation/deletion in the background.

> > we could think we're done flushing iputs in the data space reservation
> > path when we could have a throttler doing an iput.  There's no real
> > reason to serialize the delayed iput flushing, so instead of taking the
> > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> > replace it with an atomic counter and a waitqueue.  This removes the
> > short (or long depending on how big the inode is) window where we think
> > there are no more pending iputs when there really are some.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/ctree.h       |  4 +++-
> >  fs/btrfs/disk-io.c     |  5 ++---
> >  fs/btrfs/extent-tree.c |  9 +++++----
> >  fs/btrfs/inode.c       | 21 +++++++++++++++++++++
> >  4 files changed, 31 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 709de7471d86..a835fe7076eb 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -912,7 +912,8 @@ struct btrfs_fs_info {
> >  
> >  	spinlock_t delayed_iput_lock;
> >  	struct list_head delayed_iputs;
> > -	struct mutex cleaner_delayed_iput_mutex;
> > +	atomic_t nr_delayed_iputs;
> > +	wait_queue_head_t delayed_iputs_wait;
> >  
> >  	/* this protects tree_mod_seq_list */
> >  	spinlock_t tree_mod_seq_lock;
> > @@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
> >  int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
> >  void btrfs_add_delayed_iput(struct inode *inode);
> >  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
> > +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
> >  int btrfs_prealloc_file_range(struct inode *inode, int mode,
> >  			      u64 start, u64 num_bytes, u64 min_size,
> >  			      loff_t actual_len, u64 *alloc_hint);
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index c5918ff8241b..3f81dfaefa32 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
> >  			goto sleep;
> >  		}
> >  
> > -		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
> >  		btrfs_run_delayed_iputs(fs_info);
> > -		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> >  
> >  		again = btrfs_clean_one_deleted_snapshot(root);
> >  		mutex_unlock(&fs_info->cleaner_mutex);
> > @@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb,
> >  	mutex_init(&fs_info->delete_unused_bgs_mutex);
> >  	mutex_init(&fs_info->reloc_mutex);
> >  	mutex_init(&fs_info->delalloc_root_mutex);
> > -	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
> >  	seqlock_init(&fs_info->profiles_lock);
> >  
> >  	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> > @@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb,
> >  	atomic_set(&fs_info->defrag_running, 0);
> >  	atomic_set(&fs_info->qgroup_op_seq, 0);
> >  	atomic_set(&fs_info->reada_works_cnt, 0);
> > +	atomic_set(&fs_info->nr_delayed_iputs, 0);
> >  	atomic64_set(&fs_info->tree_mod_seq, 0);
> >  	fs_info->sb = sb;
> >  	fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
> > @@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb,
> >  	init_waitqueue_head(&fs_info->transaction_wait);
> >  	init_waitqueue_head(&fs_info->transaction_blocked_wait);
> >  	init_waitqueue_head(&fs_info->async_submit_wait);
> > +	init_waitqueue_head(&fs_info->delayed_iputs_wait);
> >  
> >  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
> >  
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 3a90dc1d6b31..36f43876be22 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
> >  				 * operations. Wait for it to finish so that
> >  				 * more space is released.
> >  				 */
> > -				mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
> > -				mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> > +				ret = btrfs_wait_on_delayed_iputs(fs_info);
> > +				if (ret)
> > +					return ret;
> >  				goto again;
> >  			} else {
> >  				btrfs_end_transaction(trans);
> > @@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
> >  	 * pinned space, so make sure we run the iputs before we do our pinned
> >  	 * bytes check below.
> >  	 */
> > -	mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
> >  	btrfs_run_delayed_iputs(fs_info);
> > -	mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> > +	wait_event(fs_info->delayed_iputs_wait,
> > +		   atomic_read(&fs_info->nr_delayed_iputs) == 0);
> 
> Why open code btrfs_wait_on_delayed_iputs(fs_info)? Just make it use
> wait_event instead of the killable version and use the function
> uniformally across the code base.

I don't want the flusher thread to be killable, but now that I've said that out
loud I realize that's not possible, so I'll just use the helper.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2018-11-21 19:09 ` [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue Josef Bacik
@ 2018-11-27  8:29   ` Nikolay Borisov
  2018-11-27 20:01     ` Josef Bacik
  0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2018-11-27  8:29 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> The throttle path doesn't take cleaner_delayed_iput_mutex, which means

Which one is the throttle path? btrfs_end_transaction_throttle is only
called during snapshot drop and relocation? What scenario triggered your
observation and prompted this fix?

> we could think we're done flushing iputs in the data space reservation
> path when we could have a throttler doing an iput.  There's no real
> reason to serialize the delayed iput flushing, so instead of taking the
> cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> replace it with an atomic counter and a waitqueue.  This removes the
> short (or long depending on how big the inode is) window where we think
> there are no more pending iputs when there really are some.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/ctree.h       |  4 +++-
>  fs/btrfs/disk-io.c     |  5 ++---
>  fs/btrfs/extent-tree.c |  9 +++++----
>  fs/btrfs/inode.c       | 21 +++++++++++++++++++++
>  4 files changed, 31 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 709de7471d86..a835fe7076eb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -912,7 +912,8 @@ struct btrfs_fs_info {
>  
>  	spinlock_t delayed_iput_lock;
>  	struct list_head delayed_iputs;
> -	struct mutex cleaner_delayed_iput_mutex;
> +	atomic_t nr_delayed_iputs;
> +	wait_queue_head_t delayed_iputs_wait;
>  
>  	/* this protects tree_mod_seq_list */
>  	spinlock_t tree_mod_seq_lock;
> @@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
>  int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
>  void btrfs_add_delayed_iput(struct inode *inode);
>  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
> +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
>  int btrfs_prealloc_file_range(struct inode *inode, int mode,
>  			      u64 start, u64 num_bytes, u64 min_size,
>  			      loff_t actual_len, u64 *alloc_hint);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index c5918ff8241b..3f81dfaefa32 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
>  			goto sleep;
>  		}
>  
> -		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
>  		btrfs_run_delayed_iputs(fs_info);
> -		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
>  
>  		again = btrfs_clean_one_deleted_snapshot(root);
>  		mutex_unlock(&fs_info->cleaner_mutex);
> @@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb,
>  	mutex_init(&fs_info->delete_unused_bgs_mutex);
>  	mutex_init(&fs_info->reloc_mutex);
>  	mutex_init(&fs_info->delalloc_root_mutex);
> -	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
>  	seqlock_init(&fs_info->profiles_lock);
>  
>  	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> @@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb,
>  	atomic_set(&fs_info->defrag_running, 0);
>  	atomic_set(&fs_info->qgroup_op_seq, 0);
>  	atomic_set(&fs_info->reada_works_cnt, 0);
> +	atomic_set(&fs_info->nr_delayed_iputs, 0);
>  	atomic64_set(&fs_info->tree_mod_seq, 0);
>  	fs_info->sb = sb;
>  	fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
> @@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb,
>  	init_waitqueue_head(&fs_info->transaction_wait);
>  	init_waitqueue_head(&fs_info->transaction_blocked_wait);
>  	init_waitqueue_head(&fs_info->async_submit_wait);
> +	init_waitqueue_head(&fs_info->delayed_iputs_wait);
>  
>  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
>  
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3a90dc1d6b31..36f43876be22 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>  				 * operations. Wait for it to finish so that
>  				 * more space is released.
>  				 */
> -				mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
> -				mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> +				ret = btrfs_wait_on_delayed_iputs(fs_info);
> +				if (ret)
> +					return ret;
>  				goto again;
>  			} else {
>  				btrfs_end_transaction(trans);
> @@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  	 * pinned space, so make sure we run the iputs before we do our pinned
>  	 * bytes check below.
>  	 */
> -	mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
>  	btrfs_run_delayed_iputs(fs_info);
> -	mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> +	wait_event(fs_info->delayed_iputs_wait,
> +		   atomic_read(&fs_info->nr_delayed_iputs) == 0);

Why open code btrfs_wait_on_delayed_iputs(fs_info)? Just make it use
wait_event instead of the killable version and use the function
uniformally across the code base.

>  
>  	trans = btrfs_join_transaction(fs_info->extent_root);
>  	if (IS_ERR(trans))
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3c42d8887183..57bf514a90eb 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3260,6 +3260,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
>  	if (atomic_add_unless(&inode->i_count, -1, 1))
>  		return;
>  
> +	atomic_inc(&fs_info->nr_delayed_iputs);
>  	spin_lock(&fs_info->delayed_iput_lock);
>  	ASSERT(list_empty(&binode->delayed_iput));
>  	list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs);
> @@ -3279,11 +3280,31 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
>  		list_del_init(&inode->delayed_iput);
>  		spin_unlock(&fs_info->delayed_iput_lock);
>  		iput(&inode->vfs_inode);
> +		if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
> +			wake_up(&fs_info->delayed_iputs_wait);
>  		spin_lock(&fs_info->delayed_iput_lock);
>  	}
>  	spin_unlock(&fs_info->delayed_iput_lock);
>  }
>  
> +/**
> + * btrfs_wait_on_delayed_iputs - wait on the delayed iputs to be done running
> + * @fs_info - the fs_info for this fs
> + * @return - EINTR if we were killed, 0 if nothing's pending
> + *
> + * This will wait on any delayed iputs that are currently running with KILLABLE
> + * set.  Once they are all done running we will return, unless we are killed in
> + * which case we return EINTR.
> + */
> +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info)
> +{
> +	int ret = wait_event_killable(fs_info->delayed_iputs_wait,
> +			atomic_read(&fs_info->nr_delayed_iputs) == 0);
> +	if (ret)
> +		return -EINTR;
> +	return 0;
> +}
> +
>  /*
>   * This creates an orphan entry for the given inode in case something goes wrong
>   * in the middle of an unlink.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
  2018-11-21 19:09 [PATCH 0/3] Delayed iput fixes Josef Bacik
@ 2018-11-21 19:09 ` Josef Bacik
  2018-11-27  8:29   ` Nikolay Borisov
  0 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2018-11-21 19:09 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h       |  4 +++-
 fs/btrfs/disk-io.c     |  5 ++---
 fs/btrfs/extent-tree.c |  9 +++++----
 fs/btrfs/inode.c       | 21 +++++++++++++++++++++
 4 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 709de7471d86..a835fe7076eb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -912,7 +912,8 @@ struct btrfs_fs_info {
 
 	spinlock_t delayed_iput_lock;
 	struct list_head delayed_iputs;
-	struct mutex cleaner_delayed_iput_mutex;
+	atomic_t nr_delayed_iputs;
+	wait_queue_head_t delayed_iputs_wait;
 
 	/* this protects tree_mod_seq_list */
 	spinlock_t tree_mod_seq_lock;
@@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
 			      u64 start, u64 num_bytes, u64 min_size,
 			      loff_t actual_len, u64 *alloc_hint);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c5918ff8241b..3f81dfaefa32 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
 			goto sleep;
 		}
 
-		mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
 		btrfs_run_delayed_iputs(fs_info);
-		mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
 
 		again = btrfs_clean_one_deleted_snapshot(root);
 		mutex_unlock(&fs_info->cleaner_mutex);
@@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
-	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
@@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb,
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->qgroup_op_seq, 0);
 	atomic_set(&fs_info->reada_works_cnt, 0);
+	atomic_set(&fs_info->nr_delayed_iputs, 0);
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 	fs_info->sb = sb;
 	fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb,
 	init_waitqueue_head(&fs_info->transaction_wait);
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
+	init_waitqueue_head(&fs_info->delayed_iputs_wait);
 
 	INIT_LIST_HEAD(&fs_info->pinned_chunks);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3a90dc1d6b31..36f43876be22 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
 				 * operations. Wait for it to finish so that
 				 * more space is released.
 				 */
-				mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
-				mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+				ret = btrfs_wait_on_delayed_iputs(fs_info);
+				if (ret)
+					return ret;
 				goto again;
 			} else {
 				btrfs_end_transaction(trans);
@@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	 * pinned space, so make sure we run the iputs before we do our pinned
 	 * bytes check below.
 	 */
-	mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
 	btrfs_run_delayed_iputs(fs_info);
-	mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+	wait_event(fs_info->delayed_iputs_wait,
+		   atomic_read(&fs_info->nr_delayed_iputs) == 0);
 
 	trans = btrfs_join_transaction(fs_info->extent_root);
 	if (IS_ERR(trans))
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3c42d8887183..57bf514a90eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3260,6 +3260,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	if (atomic_add_unless(&inode->i_count, -1, 1))
 		return;
 
+	atomic_inc(&fs_info->nr_delayed_iputs);
 	spin_lock(&fs_info->delayed_iput_lock);
 	ASSERT(list_empty(&binode->delayed_iput));
 	list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs);
@@ -3279,11 +3280,31 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
 		list_del_init(&inode->delayed_iput);
 		spin_unlock(&fs_info->delayed_iput_lock);
 		iput(&inode->vfs_inode);
+		if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
+			wake_up(&fs_info->delayed_iputs_wait);
 		spin_lock(&fs_info->delayed_iput_lock);
 	}
 	spin_unlock(&fs_info->delayed_iput_lock);
 }
 
+/**
+ * btrfs_wait_on_delayed_iputs - wait on the delayed iputs to be done running
+ * @fs_info - the fs_info for this fs
+ * @return - EINTR if we were killed, 0 if nothing's pending
+ *
+ * This will wait on any delayed iputs that are currently running with KILLABLE
+ * set.  Once they are all done running we will return, unless we are killed in
+ * which case we return EINTR.
+ */
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info)
+{
+	int ret = wait_event_killable(fs_info->delayed_iputs_wait,
+			atomic_read(&fs_info->nr_delayed_iputs) == 0);
+	if (ret)
+		return -EINTR;
+	return 0;
+}
+
 /*
  * This creates an orphan entry for the given inode in case something goes wrong
  * in the middle of an unlink.
-- 
2.14.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, back to index

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-11 15:21 [PATCH 0/3][V3] Delayed iput fixes Josef Bacik
2019-01-11 15:21 ` [PATCH 1/3] btrfs: run delayed iputs before committing Josef Bacik
2019-01-11 15:21 ` [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput Josef Bacik
2019-01-11 15:21 ` [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue Josef Bacik
2019-01-16 19:12   ` David Sterba
2019-02-05 18:23     ` David Sterba
  -- strict thread matches above, loose matches on Subject: below --
2018-12-03 16:06 [PATCH 0/3][V2] Delayed iput fixes Josef Bacik
2018-12-03 16:06 ` [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue Josef Bacik
2018-12-04 11:46   ` Nikolay Borisov
2018-12-04 18:21     ` Josef Bacik
2018-11-21 19:09 [PATCH 0/3] Delayed iput fixes Josef Bacik
2018-11-21 19:09 ` [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue Josef Bacik
2018-11-27  8:29   ` Nikolay Borisov
2018-11-27 20:01     ` Josef Bacik

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox