All of lore.kernel.org
 help / color / mirror / Atom feed
* [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
@ 2019-07-04  7:25 Liu Bo
  2019-07-08 20:43 ` Vivek Goyal
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Liu Bo @ 2019-07-04  7:25 UTC (permalink / raw)
  To: virtio-fs

With free fuse dax mapping reducing, read performance is impacted
significantly because reads need to wait for a free fuse dax mapping.

Although reads will trigger reclaim work to try to reclaim fuse dax
mapping, reclaim code can barely make any progress if most of fuse dax
mappings are used by the file we're reading since inode lock is required
by reclaim code.

However, we don't have to take inode lock for reclaiming if dax mapping
has its own reference count, reference counting is to tell reclaim code to
skip those in use dax mappings, such that we can avoid the risk of
accidentally reclaiming a dax mapping that other readers are using.

On the other hand, holding ->i_dmap_sem during reclaim can be used to
prevent the follwing reads to get a dax mapping under reclaim.

Another reason is that reads/writes only use fuse dax mapping within
dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
different story and we have to take ->i_mmap_sem prior to reclaiming a dax
mapping in order to avoid the race.

This adds reference count for fuse dax mapping and removes the acquisition
of inode lock during reclaim.


RESULTS:

virtiofsd -cache_size=2G

vanilla kernel: IOPS=378
patched kernel: IOPS=4508


*********************************
$ cat fio-rand-read.job
; fio-rand-read.job for fiotest

[global]
name=fio-rand-read
filename=fio_file
rw=randread
bs=4K
direct=1
numjobs=1
time_based=1
runtime=120
directory=/mnt/test/
fsync=1
group_reporting=1

[file1]
size=5G
# use sync/libaio
ioengine=sync
iodepth=1


Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/file.c   | 55 +++++++++++++++++++++++++++++++++++++------------------
 fs/fuse/fuse_i.h |  3 +++
 fs/fuse/inode.c  |  1 +
 3 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4ed45a7..e3ec646 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1870,6 +1870,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
 		if (flags & IOMAP_FAULT)
 			iomap->length = ALIGN(len, PAGE_SIZE);
 		iomap->type = IOMAP_MAPPED;
+
+		/*
+		 * increace refcnt so that reclaim code knows this dmap is in
+		 * use.
+		 */
+		atomic_inc(&dmap->refcnt);
+
+		/* iomap->private should be NULL */
+		WARN_ON_ONCE(iomap->private);
+		iomap->private = dmap;
+
 		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
 				" length 0x%llx\n", __func__, iomap->addr,
 				iomap->offset, iomap->length);
@@ -2024,6 +2035,11 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			  ssize_t written, unsigned flags,
 			  struct iomap *iomap)
 {
+	struct fuse_dax_mapping *dmap = iomap->private;
+
+	if (dmap)
+		atomic_dec(&dmap->refcnt);
+
 	/* DAX writes beyond end-of-file aren't handled using iomap, so the
 	 * file size is unchanged and there is nothing to do here.
 	 */
@@ -3959,6 +3975,10 @@ static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
 	int ret;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 
+	/*
+	 * igrab() was done to make sure inode won't go under us, and this
+	 * further avoids the race with evict().
+	 */
 	ret = dmap_writeback_invalidate(inode, dmap);
 
 	/* TODO: What to do if above fails? For now,
@@ -4053,23 +4073,25 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
 	}
 }
 
-int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
-				   u64 dmap_start)
+static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
+					  struct inode *inode, u64 dmap_start)
 {
 	int ret;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_dax_mapping *dmap;
 
-	WARN_ON(!inode_is_locked(inode));
-
 	/* Find fuse dax mapping at file offset inode. */
 	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
-							dmap_start);
+						 dmap_start);
 
 	/* Range already got cleaned up by somebody else */
 	if (!dmap)
 		return 0;
 
+	/* still in use. */
+	if (atomic_read(&dmap->refcnt))
+		return 0;
+
 	ret = reclaim_one_dmap_locked(fc, inode, dmap);
 	if (ret < 0)
 		return ret;
@@ -4084,29 +4106,21 @@ int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
 /*
  * Free a range of memory.
  * Locking.
- * 1. Take inode->i_rwsem to prever further read/write.
- * 2. Take fuse_inode->i_mmap_sem to block dax faults.
- * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
+ * 1. Take fuse_inode->i_mmap_sem to block dax faults.
+ * 2. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
  *    be strictly necessary as lock 1 and 2 seem sufficient.
  */
-int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
-			    u64 dmap_start)
+static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
+				   u64 dmap_start)
 {
 	int ret;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 
-	/*
-	 * If process is blocked waiting for memory while holding inode
-	 * lock, we will deadlock. So continue to free next range.
-	 */
-	if (!inode_trylock(inode))
-		return -EAGAIN;
 	down_write(&fi->i_mmap_sem);
 	down_write(&fi->i_dmap_sem);
 	ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
 	up_write(&fi->i_dmap_sem);
 	up_write(&fi->i_mmap_sem);
-	inode_unlock(inode);
 	return ret;
 }
 
@@ -4134,6 +4148,12 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
 
 		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
 						busy_list) {
+			dmap = pos;
+
+			/* skip this range if it's in use. */
+			if (atomic_read(&dmap->refcnt))
+				continue;
+
 			inode = igrab(pos->inode);
 			/*
 			 * This inode is going away. That will free
@@ -4147,7 +4167,6 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
 			 * inode lock can't be obtained, this will help with
 			 * selecting new element
 			 */
-			dmap = pos;
 			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
 			dmap_start = dmap->start;
 			window_offset = dmap->window_offset;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 94cfde0..31ecdac 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -143,6 +143,9 @@ struct fuse_dax_mapping {
 
 	/* indicate if the mapping is set up for write purpose */
 	unsigned flags;
+
+	/* reference count when the mapping is used by dax iomap. */
+	atomic_t refcnt;
 };
 
 /** FUSE inode */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 288daee..27c6055 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -671,6 +671,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 		range->length = FUSE_DAX_MEM_RANGE_SZ;
 		list_add_tail(&range->list, &mem_ranges);
 		INIT_LIST_HEAD(&range->busy_list);
+		atomic_set(&range->refcnt, 0);
 		allocated_ranges++;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-04  7:25 [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim Liu Bo
@ 2019-07-08 20:43 ` Vivek Goyal
  2019-07-10 18:41   ` Liu Bo
  2019-07-10 13:38 ` Vivek Goyal
  2019-07-15 20:37 ` Vivek Goyal
  2 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-08 20:43 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> With free fuse dax mapping reducing, read performance is impacted
> significantly because reads need to wait for a free fuse dax mapping.
> 
> Although reads will trigger reclaim work to try to reclaim fuse dax
> mapping, reclaim code can barely make any progress if most of fuse dax
> mappings are used by the file we're reading since inode lock is required
> by reclaim code.
> 
> However, we don't have to take inode lock for reclaiming if dax mapping
> has its own reference count, reference counting is to tell reclaim code to
> skip those in use dax mappings, such that we can avoid the risk of
> accidentally reclaiming a dax mapping that other readers are using.
> 
> On the other hand, holding ->i_dmap_sem during reclaim can be used to
> prevent the follwing reads to get a dax mapping under reclaim.
> 
> Another reason is that reads/writes only use fuse dax mapping within
> dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> mapping in order to avoid the race.

Hi Liu Bo,

Not sure why we can get rid of inode lock but not ->i_mmap_sem. Holding
this lock only prevents further page fault. But existing mapped pages
will continue to be accessed by process.

And in theory we could drop inode lock and be safe only with ->dmap_sem
lock, can't we do something similar for ->i_mmap_sem lock.

Also little worried about races w.r.t truncation and i_size update. Dax
code assumes that i_size is stable  and filesystem is holding enough
locks to ensure that.

What kind of testing you have done to make sure this is safe. Try
running blogbench. Possibly mix of read/write/mmap workload along
with heavy truncation/punch hole operation in parallel as well.

I will spend more time thinking about this.

Thanks
Vivek
> 
> This adds reference count for fuse dax mapping and removes the acquisition
> of inode lock during reclaim.
> 
> 
> RESULTS:
> 
> virtiofsd -cache_size=2G
> 
> vanilla kernel: IOPS=378
> patched kernel: IOPS=4508
> 
> 
> *********************************
> $ cat fio-rand-read.job
> ; fio-rand-read.job for fiotest
> 
> [global]
> name=fio-rand-read
> filename=fio_file
> rw=randread
> bs=4K
> direct=1
> numjobs=1
> time_based=1
> runtime=120
> directory=/mnt/test/
> fsync=1
> group_reporting=1
> 
> [file1]
> size=5G
> # use sync/libaio
> ioengine=sync
> iodepth=1
> 
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/file.c   | 55 +++++++++++++++++++++++++++++++++++++------------------
>  fs/fuse/fuse_i.h |  3 +++
>  fs/fuse/inode.c  |  1 +
>  3 files changed, 41 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 4ed45a7..e3ec646 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1870,6 +1870,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
>  		if (flags & IOMAP_FAULT)
>  			iomap->length = ALIGN(len, PAGE_SIZE);
>  		iomap->type = IOMAP_MAPPED;
> +
> +		/*
> +		 * increace refcnt so that reclaim code knows this dmap is in
> +		 * use.
> +		 */
> +		atomic_inc(&dmap->refcnt);
> +
> +		/* iomap->private should be NULL */
> +		WARN_ON_ONCE(iomap->private);
> +		iomap->private = dmap;
> +
>  		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
>  				" length 0x%llx\n", __func__, iomap->addr,
>  				iomap->offset, iomap->length);
> @@ -2024,6 +2035,11 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>  			  ssize_t written, unsigned flags,
>  			  struct iomap *iomap)
>  {
> +	struct fuse_dax_mapping *dmap = iomap->private;
> +
> +	if (dmap)
> +		atomic_dec(&dmap->refcnt);
> +
>  	/* DAX writes beyond end-of-file aren't handled using iomap, so the
>  	 * file size is unchanged and there is nothing to do here.
>  	 */
> @@ -3959,6 +3975,10 @@ static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  
> +	/*
> +	 * igrab() was done to make sure inode won't go under us, and this
> +	 * further avoids the race with evict().
> +	 */
>  	ret = dmap_writeback_invalidate(inode, dmap);
>  
>  	/* TODO: What to do if above fails? For now,
> @@ -4053,23 +4073,25 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
>  	}
>  }
>  
> -int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> -				   u64 dmap_start)
> +static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
> +					  struct inode *inode, u64 dmap_start)
>  {
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_dax_mapping *dmap;
>  
> -	WARN_ON(!inode_is_locked(inode));
> -
>  	/* Find fuse dax mapping at file offset inode. */
>  	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> -							dmap_start);
> +						 dmap_start);
>  
>  	/* Range already got cleaned up by somebody else */
>  	if (!dmap)
>  		return 0;
>  
> +	/* still in use. */
> +	if (atomic_read(&dmap->refcnt))
> +		return 0;
> +
>  	ret = reclaim_one_dmap_locked(fc, inode, dmap);
>  	if (ret < 0)
>  		return ret;
> @@ -4084,29 +4106,21 @@ int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
>  /*
>   * Free a range of memory.
>   * Locking.
> - * 1. Take inode->i_rwsem to prever further read/write.
> - * 2. Take fuse_inode->i_mmap_sem to block dax faults.
> - * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> + * 1. Take fuse_inode->i_mmap_sem to block dax faults.
> + * 2. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
>   *    be strictly necessary as lock 1 and 2 seem sufficient.
>   */
> -int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> -			    u64 dmap_start)
> +static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> +				   u64 dmap_start)
>  {
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  
> -	/*
> -	 * If process is blocked waiting for memory while holding inode
> -	 * lock, we will deadlock. So continue to free next range.
> -	 */
> -	if (!inode_trylock(inode))
> -		return -EAGAIN;
>  	down_write(&fi->i_mmap_sem);
>  	down_write(&fi->i_dmap_sem);
>  	ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
>  	up_write(&fi->i_dmap_sem);
>  	up_write(&fi->i_mmap_sem);
> -	inode_unlock(inode);
>  	return ret;
>  }
>  
> @@ -4134,6 +4148,12 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
>  
>  		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
>  						busy_list) {
> +			dmap = pos;
> +
> +			/* skip this range if it's in use. */
> +			if (atomic_read(&dmap->refcnt))
> +				continue;
> +
>  			inode = igrab(pos->inode);
>  			/*
>  			 * This inode is going away. That will free
> @@ -4147,7 +4167,6 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
>  			 * inode lock can't be obtained, this will help with
>  			 * selecting new element
>  			 */
> -			dmap = pos;
>  			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
>  			dmap_start = dmap->start;
>  			window_offset = dmap->window_offset;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 94cfde0..31ecdac 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -143,6 +143,9 @@ struct fuse_dax_mapping {
>  
>  	/* indicate if the mapping is set up for write purpose */
>  	unsigned flags;
> +
> +	/* reference count when the mapping is used by dax iomap. */
> +	atomic_t refcnt;
>  };
>  
>  /** FUSE inode */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 288daee..27c6055 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -671,6 +671,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
>  		range->length = FUSE_DAX_MEM_RANGE_SZ;
>  		list_add_tail(&range->list, &mem_ranges);
>  		INIT_LIST_HEAD(&range->busy_list);
> +		atomic_set(&range->refcnt, 0);
>  		allocated_ranges++;
>  	}
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-04  7:25 [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim Liu Bo
  2019-07-08 20:43 ` Vivek Goyal
@ 2019-07-10 13:38 ` Vivek Goyal
  2019-07-10 18:59   ` Liu Bo
  2019-07-15 20:37 ` Vivek Goyal
  2 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-10 13:38 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> With free fuse dax mapping reducing, read performance is impacted
> significantly because reads need to wait for a free fuse dax mapping.

I am not sure how are you getting so much of performance gain. Basically,
you are not taking inode lock in reclaim path and skip the range. But
reads will still block on ->i_dmap_sem if free range is not available. So
in terms of waiting this should be similar to waiting on inode lock.

If performance is low due to lock contention, then lock contention
basically moves from inode lock to ->i_dmap_sem, isn't it? If that's
the case, I am wondring how does that improve performance.

Vivek

> 
> Although reads will trigger reclaim work to try to reclaim fuse dax
> mapping, reclaim code can barely make any progress if most of fuse dax
> mappings are used by the file we're reading since inode lock is required
> by reclaim code.
> 
> However, we don't have to take inode lock for reclaiming if dax mapping
> has its own reference count, reference counting is to tell reclaim code to
> skip those in use dax mappings, such that we can avoid the risk of
> accidentally reclaiming a dax mapping that other readers are using.
> 
> On the other hand, holding ->i_dmap_sem during reclaim can be used to
> prevent the follwing reads to get a dax mapping under reclaim.
> 
> Another reason is that reads/writes only use fuse dax mapping within
> dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> mapping in order to avoid the race.
> 
> This adds reference count for fuse dax mapping and removes the acquisition
> of inode lock during reclaim.
> 
> 
> RESULTS:
> 
> virtiofsd -cache_size=2G
> 
> vanilla kernel: IOPS=378
> patched kernel: IOPS=4508
> 
> 
> *********************************
> $ cat fio-rand-read.job
> ; fio-rand-read.job for fiotest
> 
> [global]
> name=fio-rand-read
> filename=fio_file
> rw=randread
> bs=4K
> direct=1
> numjobs=1
> time_based=1
> runtime=120
> directory=/mnt/test/
> fsync=1
> group_reporting=1
> 
> [file1]
> size=5G
> # use sync/libaio
> ioengine=sync
> iodepth=1
> 
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/file.c   | 55 +++++++++++++++++++++++++++++++++++++------------------
>  fs/fuse/fuse_i.h |  3 +++
>  fs/fuse/inode.c  |  1 +
>  3 files changed, 41 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 4ed45a7..e3ec646 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1870,6 +1870,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
>  		if (flags & IOMAP_FAULT)
>  			iomap->length = ALIGN(len, PAGE_SIZE);
>  		iomap->type = IOMAP_MAPPED;
> +
> +		/*
> +		 * increace refcnt so that reclaim code knows this dmap is in
> +		 * use.
> +		 */
> +		atomic_inc(&dmap->refcnt);
> +
> +		/* iomap->private should be NULL */
> +		WARN_ON_ONCE(iomap->private);
> +		iomap->private = dmap;
> +
>  		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
>  				" length 0x%llx\n", __func__, iomap->addr,
>  				iomap->offset, iomap->length);
> @@ -2024,6 +2035,11 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>  			  ssize_t written, unsigned flags,
>  			  struct iomap *iomap)
>  {
> +	struct fuse_dax_mapping *dmap = iomap->private;
> +
> +	if (dmap)
> +		atomic_dec(&dmap->refcnt);
> +
>  	/* DAX writes beyond end-of-file aren't handled using iomap, so the
>  	 * file size is unchanged and there is nothing to do here.
>  	 */
> @@ -3959,6 +3975,10 @@ static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  
> +	/*
> +	 * igrab() was done to make sure inode won't go under us, and this
> +	 * further avoids the race with evict().
> +	 */
>  	ret = dmap_writeback_invalidate(inode, dmap);
>  
>  	/* TODO: What to do if above fails? For now,
> @@ -4053,23 +4073,25 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
>  	}
>  }
>  
> -int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> -				   u64 dmap_start)
> +static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
> +					  struct inode *inode, u64 dmap_start)
>  {
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_dax_mapping *dmap;
>  
> -	WARN_ON(!inode_is_locked(inode));
> -
>  	/* Find fuse dax mapping at file offset inode. */
>  	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> -							dmap_start);
> +						 dmap_start);
>  
>  	/* Range already got cleaned up by somebody else */
>  	if (!dmap)
>  		return 0;
>  
> +	/* still in use. */
> +	if (atomic_read(&dmap->refcnt))
> +		return 0;
> +
>  	ret = reclaim_one_dmap_locked(fc, inode, dmap);
>  	if (ret < 0)
>  		return ret;
> @@ -4084,29 +4106,21 @@ int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
>  /*
>   * Free a range of memory.
>   * Locking.
> - * 1. Take inode->i_rwsem to prever further read/write.
> - * 2. Take fuse_inode->i_mmap_sem to block dax faults.
> - * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> + * 1. Take fuse_inode->i_mmap_sem to block dax faults.
> + * 2. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
>   *    be strictly necessary as lock 1 and 2 seem sufficient.
>   */
> -int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> -			    u64 dmap_start)
> +static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> +				   u64 dmap_start)
>  {
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  
> -	/*
> -	 * If process is blocked waiting for memory while holding inode
> -	 * lock, we will deadlock. So continue to free next range.
> -	 */
> -	if (!inode_trylock(inode))
> -		return -EAGAIN;
>  	down_write(&fi->i_mmap_sem);
>  	down_write(&fi->i_dmap_sem);
>  	ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
>  	up_write(&fi->i_dmap_sem);
>  	up_write(&fi->i_mmap_sem);
> -	inode_unlock(inode);
>  	return ret;
>  }
>  
> @@ -4134,6 +4148,12 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
>  
>  		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
>  						busy_list) {
> +			dmap = pos;
> +
> +			/* skip this range if it's in use. */
> +			if (atomic_read(&dmap->refcnt))
> +				continue;
> +
>  			inode = igrab(pos->inode);
>  			/*
>  			 * This inode is going away. That will free
> @@ -4147,7 +4167,6 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
>  			 * inode lock can't be obtained, this will help with
>  			 * selecting new element
>  			 */
> -			dmap = pos;
>  			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
>  			dmap_start = dmap->start;
>  			window_offset = dmap->window_offset;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 94cfde0..31ecdac 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -143,6 +143,9 @@ struct fuse_dax_mapping {
>  
>  	/* indicate if the mapping is set up for write purpose */
>  	unsigned flags;
> +
> +	/* reference count when the mapping is used by dax iomap. */
> +	atomic_t refcnt;
>  };
>  
>  /** FUSE inode */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 288daee..27c6055 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -671,6 +671,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
>  		range->length = FUSE_DAX_MEM_RANGE_SZ;
>  		list_add_tail(&range->list, &mem_ranges);
>  		INIT_LIST_HEAD(&range->busy_list);
> +		atomic_set(&range->refcnt, 0);
>  		allocated_ranges++;
>  	}
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-08 20:43 ` Vivek Goyal
@ 2019-07-10 18:41   ` Liu Bo
  2019-07-10 20:37     ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Liu Bo @ 2019-07-10 18:41 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Mon, Jul 08, 2019 at 04:43:01PM -0400, Vivek Goyal wrote:
> On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> > With free fuse dax mapping reducing, read performance is impacted
> > significantly because reads need to wait for a free fuse dax mapping.
> > 
> > Although reads will trigger reclaim work to try to reclaim fuse dax
> > mapping, reclaim code can barely make any progress if most of fuse dax
> > mappings are used by the file we're reading since inode lock is required
> > by reclaim code.
> > 
> > However, we don't have to take inode lock for reclaiming if dax mapping
> > has its own reference count, reference counting is to tell reclaim code to
> > skip those in use dax mappings, such that we can avoid the risk of
> > accidentally reclaiming a dax mapping that other readers are using.
> > 
> > On the other hand, holding ->i_dmap_sem during reclaim can be used to
> > prevent the follwing reads to get a dax mapping under reclaim.
> > 
> > Another reason is that reads/writes only use fuse dax mapping within
> > dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> > different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> > mapping in order to avoid the race.
> 
> Hi Liu Bo,
> 
> Not sure why we can get rid of inode lock but not ->i_mmap_sem. Holding
> this lock only prevents further page fault. But existing mapped pages
> will continue to be accessed by process.
>

The idea is that for reads/writes, it always goes thru fuse_iomap_{begin,end}
during the whole IO process, while for mmap, once fault-in finishes, it can
read/write to the area without fuse_iomap_{begin,end}.

> And in theory we could drop inode lock and be safe only with ->dmap_sem
> lock, can't we do something similar for ->i_mmap_sem lock.
> 
> Also little worried about races w.r.t truncation and i_size update. Dax
> code assumes that i_size is stable  and filesystem is holding enough
> locks to ensure that.
>

typically i_mmap_sem is held by truncate to avoid the race with dax code.

but the race between truncate and isize changes seems to be unrelated to this
patch, as the patch only does updates how the background reclaim worker
manipulates locks.

I went thru fuse_setattr(), we don't remove dax mapping ranges in truncate, either.

> What kind of testing you have done to make sure this is safe. Try
> running blogbench. Possibly mix of read/write/mmap workload along
> with heavy truncation/punch hole operation in parallel as well.
>

Not yet, but surely I'm planning to run a whole round of regression test against
it.

thanks,
-liubo

> I will spend more time thinking about this.
> 
> Thanks
> Vivek
> > 
> > This adds reference count for fuse dax mapping and removes the acquisition
> > of inode lock during reclaim.
> > 
> > 
> > RESULTS:
> > 
> > virtiofsd -cache_size=2G
> > 
> > vanilla kernel: IOPS=378
> > patched kernel: IOPS=4508
> > 
> > 
> > *********************************
> > $ cat fio-rand-read.job
> > ; fio-rand-read.job for fiotest
> > 
> > [global]
> > name=fio-rand-read
> > filename=fio_file
> > rw=randread
> > bs=4K
> > direct=1
> > numjobs=1
> > time_based=1
> > runtime=120
> > directory=/mnt/test/
> > fsync=1
> > group_reporting=1
> > 
> > [file1]
> > size=5G
> > # use sync/libaio
> > ioengine=sync
> > iodepth=1
> > 
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---
> >  fs/fuse/file.c   | 55 +++++++++++++++++++++++++++++++++++++------------------
> >  fs/fuse/fuse_i.h |  3 +++
> >  fs/fuse/inode.c  |  1 +
> >  3 files changed, 41 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 4ed45a7..e3ec646 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1870,6 +1870,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
> >  		if (flags & IOMAP_FAULT)
> >  			iomap->length = ALIGN(len, PAGE_SIZE);
> >  		iomap->type = IOMAP_MAPPED;
> > +
> > +		/*
> > +		 * increace refcnt so that reclaim code knows this dmap is in
> > +		 * use.
> > +		 */
> > +		atomic_inc(&dmap->refcnt);
> > +
> > +		/* iomap->private should be NULL */
> > +		WARN_ON_ONCE(iomap->private);
> > +		iomap->private = dmap;
> > +
> >  		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> >  				" length 0x%llx\n", __func__, iomap->addr,
> >  				iomap->offset, iomap->length);
> > @@ -2024,6 +2035,11 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> >  			  ssize_t written, unsigned flags,
> >  			  struct iomap *iomap)
> >  {
> > +	struct fuse_dax_mapping *dmap = iomap->private;
> > +
> > +	if (dmap)
> > +		atomic_dec(&dmap->refcnt);
> > +
> >  	/* DAX writes beyond end-of-file aren't handled using iomap, so the
> >  	 * file size is unchanged and there is nothing to do here.
> >  	 */
> > @@ -3959,6 +3975,10 @@ static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  
> > +	/*
> > +	 * igrab() was done to make sure inode won't go under us, and this
> > +	 * further avoids the race with evict().
> > +	 */
> >  	ret = dmap_writeback_invalidate(inode, dmap);
> >  
> >  	/* TODO: What to do if above fails? For now,
> > @@ -4053,23 +4073,25 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
> >  	}
> >  }
> >  
> > -int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> > -				   u64 dmap_start)
> > +static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
> > +					  struct inode *inode, u64 dmap_start)
> >  {
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  	struct fuse_dax_mapping *dmap;
> >  
> > -	WARN_ON(!inode_is_locked(inode));
> > -
> >  	/* Find fuse dax mapping at file offset inode. */
> >  	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> > -							dmap_start);
> > +						 dmap_start);
> >  
> >  	/* Range already got cleaned up by somebody else */
> >  	if (!dmap)
> >  		return 0;
> >  
> > +	/* still in use. */
> > +	if (atomic_read(&dmap->refcnt))
> > +		return 0;
> > +
> >  	ret = reclaim_one_dmap_locked(fc, inode, dmap);
> >  	if (ret < 0)
> >  		return ret;
> > @@ -4084,29 +4106,21 @@ int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> >  /*
> >   * Free a range of memory.
> >   * Locking.
> > - * 1. Take inode->i_rwsem to prever further read/write.
> > - * 2. Take fuse_inode->i_mmap_sem to block dax faults.
> > - * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> > + * 1. Take fuse_inode->i_mmap_sem to block dax faults.
> > + * 2. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> >   *    be strictly necessary as lock 1 and 2 seem sufficient.
> >   */
> > -int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> > -			    u64 dmap_start)
> > +static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> > +				   u64 dmap_start)
> >  {
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  
> > -	/*
> > -	 * If process is blocked waiting for memory while holding inode
> > -	 * lock, we will deadlock. So continue to free next range.
> > -	 */
> > -	if (!inode_trylock(inode))
> > -		return -EAGAIN;
> >  	down_write(&fi->i_mmap_sem);
> >  	down_write(&fi->i_dmap_sem);
> >  	ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
> >  	up_write(&fi->i_dmap_sem);
> >  	up_write(&fi->i_mmap_sem);
> > -	inode_unlock(inode);
> >  	return ret;
> >  }
> >  
> > @@ -4134,6 +4148,12 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
> >  
> >  		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
> >  						busy_list) {
> > +			dmap = pos;
> > +
> > +			/* skip this range if it's in use. */
> > +			if (atomic_read(&dmap->refcnt))
> > +				continue;
> > +
> >  			inode = igrab(pos->inode);
> >  			/*
> >  			 * This inode is going away. That will free
> > @@ -4147,7 +4167,6 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
> >  			 * inode lock can't be obtained, this will help with
> >  			 * selecting new element
> >  			 */
> > -			dmap = pos;
> >  			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
> >  			dmap_start = dmap->start;
> >  			window_offset = dmap->window_offset;
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 94cfde0..31ecdac 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -143,6 +143,9 @@ struct fuse_dax_mapping {
> >  
> >  	/* indicate if the mapping is set up for write purpose */
> >  	unsigned flags;
> > +
> > +	/* reference count when the mapping is used by dax iomap. */
> > +	atomic_t refcnt;
> >  };
> >  
> >  /** FUSE inode */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 288daee..27c6055 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -671,6 +671,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
> >  		range->length = FUSE_DAX_MEM_RANGE_SZ;
> >  		list_add_tail(&range->list, &mem_ranges);
> >  		INIT_LIST_HEAD(&range->busy_list);
> > +		atomic_set(&range->refcnt, 0);
> >  		allocated_ranges++;
> >  	}
> >  
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-10 13:38 ` Vivek Goyal
@ 2019-07-10 18:59   ` Liu Bo
  0 siblings, 0 replies; 12+ messages in thread
From: Liu Bo @ 2019-07-10 18:59 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Wed, Jul 10, 2019 at 09:38:18AM -0400, Vivek Goyal wrote:
> On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> > With free fuse dax mapping reducing, read performance is impacted
> > significantly because reads need to wait for a free fuse dax mapping.
> 
> I am not sure how are you getting so much of performance gain. Basically,
> you are not taking inode lock in reclaim path and skip the range. But
> reads will still block on ->i_dmap_sem if free range is not available. So
> in terms of waiting this should be similar to waiting on inode lock.
> 
> If performance is low due to lock contention, then lock contention
> basically moves from inode lock to ->i_dmap_sem, isn't it? If that's
> the case, I am wondring how does that improve performance.

Yes, it's the case, the gain comes from the fact that i_dmap_sem is held much
shorter than inode lock, so it's a more fine-grained lock.

After commit "virtio-fs: Fix a race in range reclaim", reads are no longer able
to reclaim dax ranges within inode_lock, so it has to depend on the reclaim
worker to do the job.  But the reclaim code also needs to acquire inode lock
(write lock) prior to reclaiming ranges.

   read                       reclaim thread
inode_lock
dax_iomap_rw
  wake up reclaim worker
                        --->  inode_try_lock //failed 20 times and queued a delayed work in 10ms
inode_unlock
retry and wait 


when dax ranges are exhausted, reclaim thread gets woken up very frequently but
only can reclaim at most 10 ranges or just queue a delayed work.

thanks,
-liubo

> 
> Vivek
> 
> > 
> > Although reads will trigger reclaim work to try to reclaim fuse dax
> > mapping, reclaim code can barely make any progress if most of fuse dax
> > mappings are used by the file we're reading since inode lock is required
> > by reclaim code.
> > 
> > However, we don't have to take inode lock for reclaiming if dax mapping
> > has its own reference count, reference counting is to tell reclaim code to
> > skip those in use dax mappings, such that we can avoid the risk of
> > accidentally reclaiming a dax mapping that other readers are using.
> > 
> > On the other hand, holding ->i_dmap_sem during reclaim can be used to
> > prevent the follwing reads to get a dax mapping under reclaim.
> > 
> > Another reason is that reads/writes only use fuse dax mapping within
> > dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> > different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> > mapping in order to avoid the race.
> > 
> > This adds reference count for fuse dax mapping and removes the acquisition
> > of inode lock during reclaim.
> > 
> > 
> > RESULTS:
> > 
> > virtiofsd -cache_size=2G
> > 
> > vanilla kernel: IOPS=378
> > patched kernel: IOPS=4508
> > 
> > 
> > *********************************
> > $ cat fio-rand-read.job
> > ; fio-rand-read.job for fiotest
> > 
> > [global]
> > name=fio-rand-read
> > filename=fio_file
> > rw=randread
> > bs=4K
> > direct=1
> > numjobs=1
> > time_based=1
> > runtime=120
> > directory=/mnt/test/
> > fsync=1
> > group_reporting=1
> > 
> > [file1]
> > size=5G
> > # use sync/libaio
> > ioengine=sync
> > iodepth=1
> > 
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---
> >  fs/fuse/file.c   | 55 +++++++++++++++++++++++++++++++++++++------------------
> >  fs/fuse/fuse_i.h |  3 +++
> >  fs/fuse/inode.c  |  1 +
> >  3 files changed, 41 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 4ed45a7..e3ec646 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1870,6 +1870,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
> >  		if (flags & IOMAP_FAULT)
> >  			iomap->length = ALIGN(len, PAGE_SIZE);
> >  		iomap->type = IOMAP_MAPPED;
> > +
> > +		/*
> > +		 * increace refcnt so that reclaim code knows this dmap is in
> > +		 * use.
> > +		 */
> > +		atomic_inc(&dmap->refcnt);
> > +
> > +		/* iomap->private should be NULL */
> > +		WARN_ON_ONCE(iomap->private);
> > +		iomap->private = dmap;
> > +
> >  		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> >  				" length 0x%llx\n", __func__, iomap->addr,
> >  				iomap->offset, iomap->length);
> > @@ -2024,6 +2035,11 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> >  			  ssize_t written, unsigned flags,
> >  			  struct iomap *iomap)
> >  {
> > +	struct fuse_dax_mapping *dmap = iomap->private;
> > +
> > +	if (dmap)
> > +		atomic_dec(&dmap->refcnt);
> > +
> >  	/* DAX writes beyond end-of-file aren't handled using iomap, so the
> >  	 * file size is unchanged and there is nothing to do here.
> >  	 */
> > @@ -3959,6 +3975,10 @@ static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  
> > +	/*
> > +	 * igrab() was done to make sure inode won't go under us, and this
> > +	 * further avoids the race with evict().
> > +	 */
> >  	ret = dmap_writeback_invalidate(inode, dmap);
> >  
> >  	/* TODO: What to do if above fails? For now,
> > @@ -4053,23 +4073,25 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
> >  	}
> >  }
> >  
> > -int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> > -				   u64 dmap_start)
> > +static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
> > +					  struct inode *inode, u64 dmap_start)
> >  {
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  	struct fuse_dax_mapping *dmap;
> >  
> > -	WARN_ON(!inode_is_locked(inode));
> > -
> >  	/* Find fuse dax mapping at file offset inode. */
> >  	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> > -							dmap_start);
> > +						 dmap_start);
> >  
> >  	/* Range already got cleaned up by somebody else */
> >  	if (!dmap)
> >  		return 0;
> >  
> > +	/* still in use. */
> > +	if (atomic_read(&dmap->refcnt))
> > +		return 0;
> > +
> >  	ret = reclaim_one_dmap_locked(fc, inode, dmap);
> >  	if (ret < 0)
> >  		return ret;
> > @@ -4084,29 +4106,21 @@ int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> >  /*
> >   * Free a range of memory.
> >   * Locking.
> > - * 1. Take inode->i_rwsem to prever further read/write.
> > - * 2. Take fuse_inode->i_mmap_sem to block dax faults.
> > - * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> > + * 1. Take fuse_inode->i_mmap_sem to block dax faults.
> > + * 2. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> >   *    be strictly necessary as lock 1 and 2 seem sufficient.
> >   */
> > -int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> > -			    u64 dmap_start)
> > +static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> > +				   u64 dmap_start)
> >  {
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  
> > -	/*
> > -	 * If process is blocked waiting for memory while holding inode
> > -	 * lock, we will deadlock. So continue to free next range.
> > -	 */
> > -	if (!inode_trylock(inode))
> > -		return -EAGAIN;
> >  	down_write(&fi->i_mmap_sem);
> >  	down_write(&fi->i_dmap_sem);
> >  	ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
> >  	up_write(&fi->i_dmap_sem);
> >  	up_write(&fi->i_mmap_sem);
> > -	inode_unlock(inode);
> >  	return ret;
> >  }
> >  
> > @@ -4134,6 +4148,12 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
> >  
> >  		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
> >  						busy_list) {
> > +			dmap = pos;
> > +
> > +			/* skip this range if it's in use. */
> > +			if (atomic_read(&dmap->refcnt))
> > +				continue;
> > +
> >  			inode = igrab(pos->inode);
> >  			/*
> >  			 * This inode is going away. That will free
> > @@ -4147,7 +4167,6 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
> >  			 * inode lock can't be obtained, this will help with
> >  			 * selecting new element
> >  			 */
> > -			dmap = pos;
> >  			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
> >  			dmap_start = dmap->start;
> >  			window_offset = dmap->window_offset;
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 94cfde0..31ecdac 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -143,6 +143,9 @@ struct fuse_dax_mapping {
> >  
> >  	/* indicate if the mapping is set up for write purpose */
> >  	unsigned flags;
> > +
> > +	/* reference count when the mapping is used by dax iomap. */
> > +	atomic_t refcnt;
> >  };
> >  
> >  /** FUSE inode */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 288daee..27c6055 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -671,6 +671,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
> >  		range->length = FUSE_DAX_MEM_RANGE_SZ;
> >  		list_add_tail(&range->list, &mem_ranges);
> >  		INIT_LIST_HEAD(&range->busy_list);
> > +		atomic_set(&range->refcnt, 0);
> >  		allocated_ranges++;
> >  	}
> >  
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-10 18:41   ` Liu Bo
@ 2019-07-10 20:37     ` Vivek Goyal
  2019-07-11  8:49       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-10 20:37 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Wed, Jul 10, 2019 at 11:41:29AM -0700, Liu Bo wrote:
> On Mon, Jul 08, 2019 at 04:43:01PM -0400, Vivek Goyal wrote:
> > On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> > > With free fuse dax mapping reducing, read performance is impacted
> > > significantly because reads need to wait for a free fuse dax mapping.
> > > 
> > > Although reads will trigger reclaim work to try to reclaim fuse dax
> > > mapping, reclaim code can barely make any progress if most of fuse dax
> > > mappings are used by the file we're reading since inode lock is required
> > > by reclaim code.
> > > 
> > > However, we don't have to take inode lock for reclaiming if dax mapping
> > > has its own reference count, reference counting is to tell reclaim code to
> > > skip those in use dax mappings, such that we can avoid the risk of
> > > accidentally reclaiming a dax mapping that other readers are using.
> > > 
> > > On the other hand, holding ->i_dmap_sem during reclaim can be used to
> > > prevent the follwing reads to get a dax mapping under reclaim.
> > > 
> > > Another reason is that reads/writes only use fuse dax mapping within
> > > dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> > > different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> > > mapping in order to avoid the race.
> > 
> > Hi Liu Bo,
> > 
> > Not sure why we can get rid of inode lock but not ->i_mmap_sem. Holding
> > this lock only prevents further page fault. But existing mapped pages
> > will continue to be accessed by process.
> >
> 
> The idea is that for reads/writes, it always goes thru fuse_iomap_{begin,end}
> during the whole IO process, while for mmap, once fault-in finishes, it can
> read/write to the area without fuse_iomap_{begin,end}.
> 
> > And in theory we could drop inode lock and be safe only with ->dmap_sem
> > lock, can't we do something similar for ->i_mmap_sem lock.
> > 
> > Also little worried about races w.r.t truncation and i_size update. Dax
> > code assumes that i_size is stable  and filesystem is holding enough
> > locks to ensure that.
> >
> 
> typically i_mmap_sem is held by truncate to avoid the race with dax code.
> 
> but the race between truncate and isize changes seems to be unrelated to this
> patch, as the patch only does updates how the background reclaim worker
> manipulates locks.
> 
> I went thru fuse_setattr(), we don't remove dax mapping ranges in truncate, either.
> 
> > What kind of testing you have done to make sure this is safe. Try
> > running blogbench. Possibly mix of read/write/mmap workload along
> > with heavy truncation/punch hole operation in parallel as well.
> >
> 
> Not yet, but surely I'm planning to run a whole round of regression test against
> it.

I ran bunch of fio jobs and compared the results with vanilla kernel and
with your patch. Looks like randwrite jobs are showing some regression. 

Here are my scripts.

https://github.com/rhvgoyal/virtiofs-tests

I rand tests with cache=always, cache size=2G, dax enabled.


NAME                    I/O Operation           BW(Read/Write)
virtiofs-vanilla        seqread                 164(MiB/s)
virtiofs-liubo-patch    seqread                 163(MiB/s)

virtiofs-vanilla        seqread-mmap-single     200(MiB/s)
virtiofs-liubo-patch    seqread-mmap-single     220(MiB/s)

virtiofs-vanilla        seqread-mmap-multi      736(MiB/s)
virtiofs-liubo-patch    seqread-mmap-multi      771(MiB/s)

virtiofs-vanilla        randread                1406(KiB/s)
virtiofs-liubo-patch    randread                15(MiB/s)

virtiofs-vanilla        randread-mmap-single    13(MiB/s)
virtiofs-liubo-patch    randread-mmap-single    11(MiB/s)

virtiofs-vanilla        randread-mmap-multi     8934(KiB/s)
virtiofs-liubo-patch    randread-mmap-multi     9264(KiB/s)

virtiofs-vanilla        seqwrite                120(MiB/s)
virtiofs-liubo-patch    seqwrite                110(MiB/s)

virtiofs-vanilla        seqwrite-mmap-single    242(MiB/s)
virtiofs-liubo-patch    seqwrite-mmap-single    250(MiB/s)

virtiofs-vanilla        seqwrite-mmap-multi     595(MiB/s)
virtiofs-liubo-patch    seqwrite-mmap-multi     646(MiB/s)

virtiofs-vanilla        randwrite               20(MiB/s)
virtiofs-liubo-patch    randwrite               15(MiB/s)

virtiofs-vanilla        randwrite-mmap-single   12(MiB/s)
virtiofs-liubo-patch    randwrite-mmap-single   10(MiB/s)

virtiofs-vanilla        randwrite-mmap-multi    11(MiB/s)
virtiofs-liubo-patch    randwrite-mmap-multi    8246(KiB/s)

Vivek


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-10 20:37     ` Vivek Goyal
@ 2019-07-11  8:49       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-11  8:49 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Wed, Jul 10, 2019 at 11:41:29AM -0700, Liu Bo wrote:
> > On Mon, Jul 08, 2019 at 04:43:01PM -0400, Vivek Goyal wrote:
> > > On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> > > > With free fuse dax mapping reducing, read performance is impacted
> > > > significantly because reads need to wait for a free fuse dax mapping.
> > > > 
> > > > Although reads will trigger reclaim work to try to reclaim fuse dax
> > > > mapping, reclaim code can barely make any progress if most of fuse dax
> > > > mappings are used by the file we're reading since inode lock is required
> > > > by reclaim code.
> > > > 
> > > > However, we don't have to take inode lock for reclaiming if dax mapping
> > > > has its own reference count, reference counting is to tell reclaim code to
> > > > skip those in use dax mappings, such that we can avoid the risk of
> > > > accidentally reclaiming a dax mapping that other readers are using.
> > > > 
> > > > On the other hand, holding ->i_dmap_sem during reclaim can be used to
> > > > prevent the follwing reads to get a dax mapping under reclaim.
> > > > 
> > > > Another reason is that reads/writes only use fuse dax mapping within
> > > > dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> > > > different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> > > > mapping in order to avoid the race.
> > > 
> > > Hi Liu Bo,
> > > 
> > > Not sure why we can get rid of inode lock but not ->i_mmap_sem. Holding
> > > this lock only prevents further page fault. But existing mapped pages
> > > will continue to be accessed by process.
> > >
> > 
> > The idea is that for reads/writes, it always goes thru fuse_iomap_{begin,end}
> > during the whole IO process, while for mmap, once fault-in finishes, it can
> > read/write to the area without fuse_iomap_{begin,end}.
> > 
> > > And in theory we could drop inode lock and be safe only with ->dmap_sem
> > > lock, can't we do something similar for ->i_mmap_sem lock.
> > > 
> > > Also little worried about races w.r.t truncation and i_size update. Dax
> > > code assumes that i_size is stable  and filesystem is holding enough
> > > locks to ensure that.
> > >
> > 
> > typically i_mmap_sem is held by truncate to avoid the race with dax code.
> > 
> > but the race between truncate and isize changes seems to be unrelated to this
> > patch, as the patch only does updates how the background reclaim worker
> > manipulates locks.
> > 
> > I went thru fuse_setattr(), we don't remove dax mapping ranges in truncate, either.
> > 
> > > What kind of testing you have done to make sure this is safe. Try
> > > running blogbench. Possibly mix of read/write/mmap workload along
> > > with heavy truncation/punch hole operation in parallel as well.
> > >
> > 
> > Not yet, but surely I'm planning to run a whole round of regression test against
> > it.
> 
> I ran bunch of fio jobs and compared the results with vanilla kernel and
> with your patch. Looks like randwrite jobs are showing some regression. 

But...

> Here are my scripts.
> 
> https://github.com/rhvgoyal/virtiofs-tests
> 
> I rand tests with cache=always, cache size=2G, dax enabled.
> 
> 
> NAME                    I/O Operation           BW(Read/Write)
> virtiofs-vanilla        seqread                 164(MiB/s)
> virtiofs-liubo-patch    seqread                 163(MiB/s)
> 
> virtiofs-vanilla        seqread-mmap-single     200(MiB/s)
> virtiofs-liubo-patch    seqread-mmap-single     220(MiB/s)
> 
> virtiofs-vanilla        seqread-mmap-multi      736(MiB/s)
> virtiofs-liubo-patch    seqread-mmap-multi      771(MiB/s)
> 
> virtiofs-vanilla        randread                1406(KiB/s)
> virtiofs-liubo-patch    randread                15(MiB/s)

That's a hell of an improvement for randread; perhaps the write
regression is worth it?

Dave

> virtiofs-vanilla        randread-mmap-single    13(MiB/s)
> virtiofs-liubo-patch    randread-mmap-single    11(MiB/s)
> 
> virtiofs-vanilla        randread-mmap-multi     8934(KiB/s)
> virtiofs-liubo-patch    randread-mmap-multi     9264(KiB/s)
> 
> virtiofs-vanilla        seqwrite                120(MiB/s)
> virtiofs-liubo-patch    seqwrite                110(MiB/s)
> 
> virtiofs-vanilla        seqwrite-mmap-single    242(MiB/s)
> virtiofs-liubo-patch    seqwrite-mmap-single    250(MiB/s)
> 
> virtiofs-vanilla        seqwrite-mmap-multi     595(MiB/s)
> virtiofs-liubo-patch    seqwrite-mmap-multi     646(MiB/s)
> 
> virtiofs-vanilla        randwrite               20(MiB/s)
> virtiofs-liubo-patch    randwrite               15(MiB/s)
> 
> virtiofs-vanilla        randwrite-mmap-single   12(MiB/s)
> virtiofs-liubo-patch    randwrite-mmap-single   10(MiB/s)
> 
> virtiofs-vanilla        randwrite-mmap-multi    11(MiB/s)
> virtiofs-liubo-patch    randwrite-mmap-multi    8246(KiB/s)
> 
> Vivek
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-04  7:25 [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim Liu Bo
  2019-07-08 20:43 ` Vivek Goyal
  2019-07-10 13:38 ` Vivek Goyal
@ 2019-07-15 20:37 ` Vivek Goyal
  2019-07-15 21:38   ` Liu Bo
  2 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-15 20:37 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> With free fuse dax mapping reducing, read performance is impacted
> significantly because reads need to wait for a free fuse dax mapping.
> 
> Although reads will trigger reclaim work to try to reclaim fuse dax
> mapping, reclaim code can barely make any progress if most of fuse dax
> mappings are used by the file we're reading since inode lock is required
> by reclaim code.
> 
> However, we don't have to take inode lock for reclaiming if dax mapping
> has its own reference count, reference counting is to tell reclaim code to
> skip those in use dax mappings, such that we can avoid the risk of
> accidentally reclaiming a dax mapping that other readers are using.
> 
> On the other hand, holding ->i_dmap_sem during reclaim can be used to
> prevent the follwing reads to get a dax mapping under reclaim.
> 
> Another reason is that reads/writes only use fuse dax mapping within
> dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> mapping in order to avoid the race.
> 
> This adds reference count for fuse dax mapping and removes the acquisition
> of inode lock during reclaim.

I am not sure that this reference count implementation is safe. For
example, what prevents atomic_dec() from being reordered so that it
could be executed before actually copying to user space is finished.

Say cpu1 is reading a dax page and cpu 2 is freeing memory. 

cpu1 read				cpu2 free dmap
---------				-------------
atomic_inc()				atomic_read()			
copy data to user space
atomic_dec

Say atomic_dec() gets reordered w.r.t copy_data_to_user_space.

cpu1 read				cpu2 free dmap
---------				-------------
atomic_inc()				atomic_read()			
atomic_dec
copy data to user space

Now cpu2 will free up dax range while it is still being read?

Thanks
Vivek

> 
> 
> RESULTS:
> 
> virtiofsd -cache_size=2G
> 
> vanilla kernel: IOPS=378
> patched kernel: IOPS=4508
> 
> 
> *********************************
> $ cat fio-rand-read.job
> ; fio-rand-read.job for fiotest
> 
> [global]
> name=fio-rand-read
> filename=fio_file
> rw=randread
> bs=4K
> direct=1
> numjobs=1
> time_based=1
> runtime=120
> directory=/mnt/test/
> fsync=1
> group_reporting=1
> 
> [file1]
> size=5G
> # use sync/libaio
> ioengine=sync
> iodepth=1
> 
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/file.c   | 55 +++++++++++++++++++++++++++++++++++++------------------
>  fs/fuse/fuse_i.h |  3 +++
>  fs/fuse/inode.c  |  1 +
>  3 files changed, 41 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 4ed45a7..e3ec646 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1870,6 +1870,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
>  		if (flags & IOMAP_FAULT)
>  			iomap->length = ALIGN(len, PAGE_SIZE);
>  		iomap->type = IOMAP_MAPPED;
> +
> +		/*
> +		 * increace refcnt so that reclaim code knows this dmap is in
> +		 * use.
> +		 */
> +		atomic_inc(&dmap->refcnt);
> +
> +		/* iomap->private should be NULL */
> +		WARN_ON_ONCE(iomap->private);
> +		iomap->private = dmap;
> +
>  		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
>  				" length 0x%llx\n", __func__, iomap->addr,
>  				iomap->offset, iomap->length);
> @@ -2024,6 +2035,11 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>  			  ssize_t written, unsigned flags,
>  			  struct iomap *iomap)
>  {
> +	struct fuse_dax_mapping *dmap = iomap->private;
> +
> +	if (dmap)
> +		atomic_dec(&dmap->refcnt);
> +
>  	/* DAX writes beyond end-of-file aren't handled using iomap, so the
>  	 * file size is unchanged and there is nothing to do here.
>  	 */
> @@ -3959,6 +3975,10 @@ static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  
> +	/*
> +	 * igrab() was done to make sure inode won't go under us, and this
> +	 * further avoids the race with evict().
> +	 */
>  	ret = dmap_writeback_invalidate(inode, dmap);
>  
>  	/* TODO: What to do if above fails? For now,
> @@ -4053,23 +4073,25 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
>  	}
>  }
>  
> -int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> -				   u64 dmap_start)
> +static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
> +					  struct inode *inode, u64 dmap_start)
>  {
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  	struct fuse_dax_mapping *dmap;
>  
> -	WARN_ON(!inode_is_locked(inode));
> -
>  	/* Find fuse dax mapping at file offset inode. */
>  	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> -							dmap_start);
> +						 dmap_start);
>  
>  	/* Range already got cleaned up by somebody else */
>  	if (!dmap)
>  		return 0;
>  
> +	/* still in use. */
> +	if (atomic_read(&dmap->refcnt))
> +		return 0;
> +
>  	ret = reclaim_one_dmap_locked(fc, inode, dmap);
>  	if (ret < 0)
>  		return ret;
> @@ -4084,29 +4106,21 @@ int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
>  /*
>   * Free a range of memory.
>   * Locking.
> - * 1. Take inode->i_rwsem to prever further read/write.
> - * 2. Take fuse_inode->i_mmap_sem to block dax faults.
> - * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> + * 1. Take fuse_inode->i_mmap_sem to block dax faults.
> + * 2. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
>   *    be strictly necessary as lock 1 and 2 seem sufficient.
>   */
> -int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> -			    u64 dmap_start)
> +static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> +				   u64 dmap_start)
>  {
>  	int ret;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
>  
> -	/*
> -	 * If process is blocked waiting for memory while holding inode
> -	 * lock, we will deadlock. So continue to free next range.
> -	 */
> -	if (!inode_trylock(inode))
> -		return -EAGAIN;
>  	down_write(&fi->i_mmap_sem);
>  	down_write(&fi->i_dmap_sem);
>  	ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
>  	up_write(&fi->i_dmap_sem);
>  	up_write(&fi->i_mmap_sem);
> -	inode_unlock(inode);
>  	return ret;
>  }
>  
> @@ -4134,6 +4148,12 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
>  
>  		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
>  						busy_list) {
> +			dmap = pos;
> +
> +			/* skip this range if it's in use. */
> +			if (atomic_read(&dmap->refcnt))
> +				continue;
> +
>  			inode = igrab(pos->inode);
>  			/*
>  			 * This inode is going away. That will free
> @@ -4147,7 +4167,6 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
>  			 * inode lock can't be obtained, this will help with
>  			 * selecting new element
>  			 */
> -			dmap = pos;
>  			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
>  			dmap_start = dmap->start;
>  			window_offset = dmap->window_offset;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 94cfde0..31ecdac 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -143,6 +143,9 @@ struct fuse_dax_mapping {
>  
>  	/* indicate if the mapping is set up for write purpose */
>  	unsigned flags;
> +
> +	/* reference count when the mapping is used by dax iomap. */
> +	atomic_t refcnt;
>  };
>  
>  /** FUSE inode */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 288daee..27c6055 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -671,6 +671,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
>  		range->length = FUSE_DAX_MEM_RANGE_SZ;
>  		list_add_tail(&range->list, &mem_ranges);
>  		INIT_LIST_HEAD(&range->busy_list);
> +		atomic_set(&range->refcnt, 0);
>  		allocated_ranges++;
>  	}
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-15 20:37 ` Vivek Goyal
@ 2019-07-15 21:38   ` Liu Bo
  2019-07-16 18:36     ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Liu Bo @ 2019-07-15 21:38 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Mon, Jul 15, 2019 at 04:37:39PM -0400, Vivek Goyal wrote:
> On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> > With free fuse dax mapping reducing, read performance is impacted
> > significantly because reads need to wait for a free fuse dax mapping.
> > 
> > Although reads will trigger reclaim work to try to reclaim fuse dax
> > mapping, reclaim code can barely make any progress if most of fuse dax
> > mappings are used by the file we're reading since inode lock is required
> > by reclaim code.
> > 
> > However, we don't have to take inode lock for reclaiming if dax mapping
> > has its own reference count, reference counting is to tell reclaim code to
> > skip those in use dax mappings, such that we can avoid the risk of
> > accidentally reclaiming a dax mapping that other readers are using.
> > 
> > On the other hand, holding ->i_dmap_sem during reclaim can be used to
> > prevent the follwing reads to get a dax mapping under reclaim.
> > 
> > Another reason is that reads/writes only use fuse dax mapping within
> > dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> > different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> > mapping in order to avoid the race.
> > 
> > This adds reference count for fuse dax mapping and removes the acquisition
> > of inode lock during reclaim.
> 
> I am not sure that this reference count implementation is safe. For
> example, what prevents atomic_dec() from being reordered so that it
> could be executed before actually copying to user space is finished.
> 
> Say cpu1 is reading a dax page and cpu 2 is freeing memory. 
> 
> cpu1 read				cpu2 free dmap
> ---------				-------------
> atomic_inc()				atomic_read()			
> copy data to user space
> atomic_dec
> 
> Say atomic_dec() gets reordered w.r.t copy_data_to_user_space.
> 
> cpu1 read				cpu2 free dmap
> ---------				-------------
> atomic_inc()				atomic_read()			
> atomic_dec
> copy data to user space
> 
> Now cpu2 will free up dax range while it is still being read?

Yep, I think this is possible.

For this specific reorder, barriers like smp_mb__{before,after}_atomic could fix
it.

thanks,
-liubo
> 
> Thanks
> Vivek
> 
> > 
> > 
> > RESULTS:
> > 
> > virtiofsd -cache_size=2G
> > 
> > vanilla kernel: IOPS=378
> > patched kernel: IOPS=4508
> > 
> > 
> > *********************************
> > $ cat fio-rand-read.job
> > ; fio-rand-read.job for fiotest
> > 
> > [global]
> > name=fio-rand-read
> > filename=fio_file
> > rw=randread
> > bs=4K
> > direct=1
> > numjobs=1
> > time_based=1
> > runtime=120
> > directory=/mnt/test/
> > fsync=1
> > group_reporting=1
> > 
> > [file1]
> > size=5G
> > # use sync/libaio
> > ioengine=sync
> > iodepth=1
> > 
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---
> >  fs/fuse/file.c   | 55 +++++++++++++++++++++++++++++++++++++------------------
> >  fs/fuse/fuse_i.h |  3 +++
> >  fs/fuse/inode.c  |  1 +
> >  3 files changed, 41 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 4ed45a7..e3ec646 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1870,6 +1870,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
> >  		if (flags & IOMAP_FAULT)
> >  			iomap->length = ALIGN(len, PAGE_SIZE);
> >  		iomap->type = IOMAP_MAPPED;
> > +
> > +		/*
> > +		 * increace refcnt so that reclaim code knows this dmap is in
> > +		 * use.
> > +		 */
> > +		atomic_inc(&dmap->refcnt);
> > +
> > +		/* iomap->private should be NULL */
> > +		WARN_ON_ONCE(iomap->private);
> > +		iomap->private = dmap;
> > +
> >  		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> >  				" length 0x%llx\n", __func__, iomap->addr,
> >  				iomap->offset, iomap->length);
> > @@ -2024,6 +2035,11 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> >  			  ssize_t written, unsigned flags,
> >  			  struct iomap *iomap)
> >  {
> > +	struct fuse_dax_mapping *dmap = iomap->private;
> > +
> > +	if (dmap)
> > +		atomic_dec(&dmap->refcnt);
> > +
> >  	/* DAX writes beyond end-of-file aren't handled using iomap, so the
> >  	 * file size is unchanged and there is nothing to do here.
> >  	 */
> > @@ -3959,6 +3975,10 @@ static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  
> > +	/*
> > +	 * igrab() was done to make sure inode won't go under us, and this
> > +	 * further avoids the race with evict().
> > +	 */
> >  	ret = dmap_writeback_invalidate(inode, dmap);
> >  
> >  	/* TODO: What to do if above fails? For now,
> > @@ -4053,23 +4073,25 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
> >  	}
> >  }
> >  
> > -int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> > -				   u64 dmap_start)
> > +static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
> > +					  struct inode *inode, u64 dmap_start)
> >  {
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  	struct fuse_dax_mapping *dmap;
> >  
> > -	WARN_ON(!inode_is_locked(inode));
> > -
> >  	/* Find fuse dax mapping at file offset inode. */
> >  	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> > -							dmap_start);
> > +						 dmap_start);
> >  
> >  	/* Range already got cleaned up by somebody else */
> >  	if (!dmap)
> >  		return 0;
> >  
> > +	/* still in use. */
> > +	if (atomic_read(&dmap->refcnt))
> > +		return 0;
> > +
> >  	ret = reclaim_one_dmap_locked(fc, inode, dmap);
> >  	if (ret < 0)
> >  		return ret;
> > @@ -4084,29 +4106,21 @@ int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> >  /*
> >   * Free a range of memory.
> >   * Locking.
> > - * 1. Take inode->i_rwsem to prever further read/write.
> > - * 2. Take fuse_inode->i_mmap_sem to block dax faults.
> > - * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> > + * 1. Take fuse_inode->i_mmap_sem to block dax faults.
> > + * 2. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
> >   *    be strictly necessary as lock 1 and 2 seem sufficient.
> >   */
> > -int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> > -			    u64 dmap_start)
> > +static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> > +				   u64 dmap_start)
> >  {
> >  	int ret;
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> >  
> > -	/*
> > -	 * If process is blocked waiting for memory while holding inode
> > -	 * lock, we will deadlock. So continue to free next range.
> > -	 */
> > -	if (!inode_trylock(inode))
> > -		return -EAGAIN;
> >  	down_write(&fi->i_mmap_sem);
> >  	down_write(&fi->i_dmap_sem);
> >  	ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
> >  	up_write(&fi->i_dmap_sem);
> >  	up_write(&fi->i_mmap_sem);
> > -	inode_unlock(inode);
> >  	return ret;
> >  }
> >  
> > @@ -4134,6 +4148,12 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
> >  
> >  		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
> >  						busy_list) {
> > +			dmap = pos;
> > +
> > +			/* skip this range if it's in use. */
> > +			if (atomic_read(&dmap->refcnt))
> > +				continue;
> > +
> >  			inode = igrab(pos->inode);
> >  			/*
> >  			 * This inode is going away. That will free
> > @@ -4147,7 +4167,6 @@ static int try_to_free_dmap_chunks(struct fuse_conn *fc,
> >  			 * inode lock can't be obtained, this will help with
> >  			 * selecting new element
> >  			 */
> > -			dmap = pos;
> >  			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
> >  			dmap_start = dmap->start;
> >  			window_offset = dmap->window_offset;
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 94cfde0..31ecdac 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -143,6 +143,9 @@ struct fuse_dax_mapping {
> >  
> >  	/* indicate if the mapping is set up for write purpose */
> >  	unsigned flags;
> > +
> > +	/* reference count when the mapping is used by dax iomap. */
> > +	atomic_t refcnt;
> >  };
> >  
> >  /** FUSE inode */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 288daee..27c6055 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -671,6 +671,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
> >  		range->length = FUSE_DAX_MEM_RANGE_SZ;
> >  		list_add_tail(&range->list, &mem_ranges);
> >  		INIT_LIST_HEAD(&range->busy_list);
> > +		atomic_set(&range->refcnt, 0);
> >  		allocated_ranges++;
> >  	}
> >  
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-15 21:38   ` Liu Bo
@ 2019-07-16 18:36     ` Vivek Goyal
  2019-07-16 19:07       ` Liu Bo
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-16 18:36 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Mon, Jul 15, 2019 at 02:38:06PM -0700, Liu Bo wrote:
> On Mon, Jul 15, 2019 at 04:37:39PM -0400, Vivek Goyal wrote:
> > On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> > > With free fuse dax mapping reducing, read performance is impacted
> > > significantly because reads need to wait for a free fuse dax mapping.
> > > 
> > > Although reads will trigger reclaim work to try to reclaim fuse dax
> > > mapping, reclaim code can barely make any progress if most of fuse dax
> > > mappings are used by the file we're reading since inode lock is required
> > > by reclaim code.
> > > 
> > > However, we don't have to take inode lock for reclaiming if dax mapping
> > > has its own reference count, reference counting is to tell reclaim code to
> > > skip those in use dax mappings, such that we can avoid the risk of
> > > accidentally reclaiming a dax mapping that other readers are using.
> > > 
> > > On the other hand, holding ->i_dmap_sem during reclaim can be used to
> > > prevent the follwing reads to get a dax mapping under reclaim.
> > > 
> > > Another reason is that reads/writes only use fuse dax mapping within
> > > dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> > > different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> > > mapping in order to avoid the race.
> > > 
> > > This adds reference count for fuse dax mapping and removes the acquisition
> > > of inode lock during reclaim.
> > 
> > I am not sure that this reference count implementation is safe. For
> > example, what prevents atomic_dec() from being reordered so that it
> > could be executed before actually copying to user space is finished.
> > 
> > Say cpu1 is reading a dax page and cpu 2 is freeing memory. 
> > 
> > cpu1 read				cpu2 free dmap
> > ---------				-------------
> > atomic_inc()				atomic_read()			
> > copy data to user space
> > atomic_dec
> > 
> > Say atomic_dec() gets reordered w.r.t copy_data_to_user_space.
> > 
> > cpu1 read				cpu2 free dmap
> > ---------				-------------
> > atomic_inc()				atomic_read()			
> > atomic_dec
> > copy data to user space
> > 
> > Now cpu2 will free up dax range while it is still being read?
> 
> Yep, I think this is possible.
> 
> For this specific reorder, barriers like smp_mb__{before,after}_atomic could fix
> it.

Hi Liu Bo,

I have modified the patch to use refcount_t. Our use case is little
different than typical reference counts. I hope that are no bugs there.

I have also fixed bunch of other issues and enabled inline range reclaim
for read path. (It became possible with dmap refcount patch).

Pushed my changes here for now. 

https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1

Please have a look and test. If everything looks good, I will squash these
new patches into existing patches.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-16 18:36     ` Vivek Goyal
@ 2019-07-16 19:07       ` Liu Bo
  2019-07-16 19:15         ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Liu Bo @ 2019-07-16 19:07 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

Hi Vivek,

On Tue, Jul 16, 2019 at 02:36:35PM -0400, Vivek Goyal wrote:
> On Mon, Jul 15, 2019 at 02:38:06PM -0700, Liu Bo wrote:
> > On Mon, Jul 15, 2019 at 04:37:39PM -0400, Vivek Goyal wrote:
> > > On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> > > > With free fuse dax mapping reducing, read performance is impacted
> > > > significantly because reads need to wait for a free fuse dax mapping.
> > > > 
> > > > Although reads will trigger reclaim work to try to reclaim fuse dax
> > > > mapping, reclaim code can barely make any progress if most of fuse dax
> > > > mappings are used by the file we're reading since inode lock is required
> > > > by reclaim code.
> > > > 
> > > > However, we don't have to take inode lock for reclaiming if dax mapping
> > > > has its own reference count, reference counting is to tell reclaim code to
> > > > skip those in use dax mappings, such that we can avoid the risk of
> > > > accidentally reclaiming a dax mapping that other readers are using.
> > > > 
> > > > On the other hand, holding ->i_dmap_sem during reclaim can be used to
> > > > prevent the follwing reads to get a dax mapping under reclaim.
> > > > 
> > > > Another reason is that reads/writes only use fuse dax mapping within
> > > > dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> > > > different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> > > > mapping in order to avoid the race.
> > > > 
> > > > This adds reference count for fuse dax mapping and removes the acquisition
> > > > of inode lock during reclaim.
> > > 
> > > I am not sure that this reference count implementation is safe. For
> > > example, what prevents atomic_dec() from being reordered so that it
> > > could be executed before actually copying to user space is finished.
> > > 
> > > Say cpu1 is reading a dax page and cpu 2 is freeing memory. 
> > > 
> > > cpu1 read				cpu2 free dmap
> > > ---------				-------------
> > > atomic_inc()				atomic_read()			
> > > copy data to user space
> > > atomic_dec
> > > 
> > > Say atomic_dec() gets reordered w.r.t copy_data_to_user_space.
> > > 
> > > cpu1 read				cpu2 free dmap
> > > ---------				-------------
> > > atomic_inc()				atomic_read()			
> > > atomic_dec
> > > copy data to user space
> > > 
> > > Now cpu2 will free up dax range while it is still being read?
> > 
> > Yep, I think this is possible.
> > 
> > For this specific reorder, barriers like smp_mb__{before,after}_atomic could fix
> > it.
> 
> Hi Liu Bo,
> 
> I have modified the patch to use refcount_t. Our use case is little
> different than typical reference counts. I hope that are no bugs there.
> 
> I have also fixed bunch of other issues and enabled inline range reclaim
> for read path. (It became possible with dmap refcount patch).
>
> Pushed my changes here for now. 
> 
> https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> 
> Please have a look and test. If everything looks good, I will squash these
> new patches into existing patches.

That's nice, I'll check it out.

Can you please send your patches out to this maillist?

(In fact it's difficult for us to track patches/changes on github,
esp. when they're squashed and we need to rebase our internal virtiofs
base by looking at the code line by line.)

thanks,
-liubo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim
  2019-07-16 19:07       ` Liu Bo
@ 2019-07-16 19:15         ` Vivek Goyal
  0 siblings, 0 replies; 12+ messages in thread
From: Vivek Goyal @ 2019-07-16 19:15 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Tue, Jul 16, 2019 at 12:07:38PM -0700, Liu Bo wrote:
> Hi Vivek,
> 
> On Tue, Jul 16, 2019 at 02:36:35PM -0400, Vivek Goyal wrote:
> > On Mon, Jul 15, 2019 at 02:38:06PM -0700, Liu Bo wrote:
> > > On Mon, Jul 15, 2019 at 04:37:39PM -0400, Vivek Goyal wrote:
> > > > On Thu, Jul 04, 2019 at 03:25:31PM +0800, Liu Bo wrote:
> > > > > With free fuse dax mapping reducing, read performance is impacted
> > > > > significantly because reads need to wait for a free fuse dax mapping.
> > > > > 
> > > > > Although reads will trigger reclaim work to try to reclaim fuse dax
> > > > > mapping, reclaim code can barely make any progress if most of fuse dax
> > > > > mappings are used by the file we're reading since inode lock is required
> > > > > by reclaim code.
> > > > > 
> > > > > However, we don't have to take inode lock for reclaiming if dax mapping
> > > > > has its own reference count, reference counting is to tell reclaim code to
> > > > > skip those in use dax mappings, such that we can avoid the risk of
> > > > > accidentally reclaiming a dax mapping that other readers are using.
> > > > > 
> > > > > On the other hand, holding ->i_dmap_sem during reclaim can be used to
> > > > > prevent the follwing reads to get a dax mapping under reclaim.
> > > > > 
> > > > > Another reason is that reads/writes only use fuse dax mapping within
> > > > > dax_iomap_rw(), so we can do such a trick, while mmap/faulting is a
> > > > > different story and we have to take ->i_mmap_sem prior to reclaiming a dax
> > > > > mapping in order to avoid the race.
> > > > > 
> > > > > This adds reference count for fuse dax mapping and removes the acquisition
> > > > > of inode lock during reclaim.
> > > > 
> > > > I am not sure that this reference count implementation is safe. For
> > > > example, what prevents atomic_dec() from being reordered so that it
> > > > could be executed before actually copying to user space is finished.
> > > > 
> > > > Say cpu1 is reading a dax page and cpu 2 is freeing memory. 
> > > > 
> > > > cpu1 read				cpu2 free dmap
> > > > ---------				-------------
> > > > atomic_inc()				atomic_read()			
> > > > copy data to user space
> > > > atomic_dec
> > > > 
> > > > Say atomic_dec() gets reordered w.r.t copy_data_to_user_space.
> > > > 
> > > > cpu1 read				cpu2 free dmap
> > > > ---------				-------------
> > > > atomic_inc()				atomic_read()			
> > > > atomic_dec
> > > > copy data to user space
> > > > 
> > > > Now cpu2 will free up dax range while it is still being read?
> > > 
> > > Yep, I think this is possible.
> > > 
> > > For this specific reorder, barriers like smp_mb__{before,after}_atomic could fix
> > > it.
> > 
> > Hi Liu Bo,
> > 
> > I have modified the patch to use refcount_t. Our use case is little
> > different than typical reference counts. I hope that are no bugs there.
> > 
> > I have also fixed bunch of other issues and enabled inline range reclaim
> > for read path. (It became possible with dmap refcount patch).
> >
> > Pushed my changes here for now. 
> > 
> > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > 
> > Please have a look and test. If everything looks good, I will squash these
> > new patches into existing patches.
> 
> That's nice, I'll check it out.
> 
> Can you please send your patches out to this maillist?

Ok, I will. 

> 
> (In fact it's difficult for us to track patches/changes on github,
> esp. when they're squashed and we need to rebase our internal virtiofs
> base by looking at the code line by line.)

Ideally there should not be many patches/changes outside of development
tree. 

But I have to squash patches because none of this is upstream. And if
I post too many patches upstream, nobody is going to look at the patch
series.

Vivek


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-07-16 19:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-04  7:25 [Virtio-fs] [PATCH RFC] virtiofs: use fine-grained lock for dmap reclaim Liu Bo
2019-07-08 20:43 ` Vivek Goyal
2019-07-10 18:41   ` Liu Bo
2019-07-10 20:37     ` Vivek Goyal
2019-07-11  8:49       ` Dr. David Alan Gilbert
2019-07-10 13:38 ` Vivek Goyal
2019-07-10 18:59   ` Liu Bo
2019-07-15 20:37 ` Vivek Goyal
2019-07-15 21:38   ` Liu Bo
2019-07-16 18:36     ` Vivek Goyal
2019-07-16 19:07       ` Liu Bo
2019-07-16 19:15         ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.