All of lore.kernel.org
 help / color / mirror / Atom feed
* [Virtio-fs] [PATCH 0/2] guest hang fixes
@ 2019-05-07  5:58 Liu Bo
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE Liu Bo
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping Liu Bo
  0 siblings, 2 replies; 13+ messages in thread
From: Liu Bo @ 2019-05-07  5:58 UTC (permalink / raw)
  To: virtio-fs

This is trying to fix guest hang issue due to full host fs, more
details are available at patch 2.

Liu Bo (2):
  virtio-fs: send setupmapping with write flag only if IOMAP_WRITE
  virtiofs: add dmap flags to differentiate write mapping from read
    mapping

 fs/fuse/file.c   | 83 ++++++++++++++++++++++++++++++++++++++++++++++----------
 fs/fuse/fuse_i.h |  3 ++
 2 files changed, 72 insertions(+), 14 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE
  2019-05-07  5:58 [Virtio-fs] [PATCH 0/2] guest hang fixes Liu Bo
@ 2019-05-07  5:58 ` Liu Bo
  2019-05-07  8:06   ` Tao Peng
  2019-05-07 18:29   ` Vivek Goyal
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping Liu Bo
  1 sibling, 2 replies; 13+ messages in thread
From: Liu Bo @ 2019-05-07  5:58 UTC (permalink / raw)
  To: virtio-fs

This is to let virtiofs daemon to do inline fallocate to preallocate
space for this mapping.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/file.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 488e932f..6f403c8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -251,8 +251,8 @@ static void free_dax_mapping(struct fuse_conn *fc,
 
 /* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
 static int fuse_setup_one_mapping(struct inode *inode,
-				struct file *file, loff_t offset,
-				struct fuse_dax_mapping *dmap)
+				  struct file *file, loff_t offset, unsigned flags,
+				  struct fuse_dax_mapping *dmap)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -276,14 +276,21 @@ static int fuse_setup_one_mapping(struct inode *inode,
 		inarg.fh = -1;
 	inarg.moffset = dmap->window_offset;
 	inarg.len = FUSE_DAX_MEM_RANGE_SZ;
+
+       /*
+	* with IOMAP_WRITE, virtiofs daemon will make sure space is
+	* allocated for the range.
+	*/
 	if (file) {
-		inarg.flags |= (file->f_mode & FMODE_WRITE) ?
-				FUSE_SETUPMAPPING_FLAG_WRITE : 0;
+		if (file->f_mode & FMODE_WRITE &&
+		    flags & IOMAP_WRITE)
+			inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
 		inarg.flags |= (file->f_mode & FMODE_READ) ?
 				FUSE_SETUPMAPPING_FLAG_READ : 0;
 	} else {
 		inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
-		inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+		if (flags & IOMAP_WRITE)
+			inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
 	}
 	args.in.h.opcode = FUSE_SETUPMAPPING;
 	args.in.h.nodeid = fi->nodeid;
@@ -1847,8 +1854,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 
 		/* Setup one mapping */
 		ret = fuse_setup_one_mapping(inode, NULL,
-				ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
-				alloc_dmap);
+					ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
+					     flags, alloc_dmap);
 		if (ret < 0) {
 			printk("fuse_setup_one_mapping() failed. err=%d"
 				" pos=0x%llx\n", ret, pos);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping
  2019-05-07  5:58 [Virtio-fs] [PATCH 0/2] guest hang fixes Liu Bo
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE Liu Bo
@ 2019-05-07  5:58 ` Liu Bo
  2019-05-07  8:02   ` Tao Peng
                     ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Liu Bo @ 2019-05-07  5:58 UTC (permalink / raw)
  To: virtio-fs

There're 2 problems in dax rw path which were found by [1][2],

a) setupmapping always sends a RW mapping message to virtiofs daemon side
no matter it's doing reads or writes, the end result is that guest reads
on a mapping will cause a write page fault on host, which is unnecessary.

b) After a successful setupmapping, it doesn't guarantee the following
writes can land on host, e.g. page fault on host may fail because of
ENOSPC.

This is trying to solve the problems by
a) marking a dmap as RO or RW to indicate it's being used for reads or
   writes,
b) setting up a RO dmap for reads
c) setting up a RW dmap for writes
d) using an existing dmap for reads if it's availalbe in the interval tree
e) converting an existing dmap from RO to RW for writes if it's available
   in the interval tree

The downside of the above approach is write amplification, i.e. even if a
4k write is only made, setupmapping [0, 2M) will do a fallocate [0, 2M)
against the mmap'd file on host fs (fallocate will use FALLOC_FL_KEEP_SIZE
though), this is because some following writes landing on [4k, 2m) are
possible and having no space for [4k, 2m) can end up with guest hanging
forever, which is something we can't afford in real world.

w/o:
- reproducer[1] in guest will hang on dax_copy_{from,to}_iter because
  of write page fault on host fs getting ENOSPC,
- reproducer[2] in guest will hang on dax_copy_{from,to}_iter because
  of write page fault on host fs getting ENOSPC,

w/:
- reproducer[1] can read all 100G as they're just page cache on host fs.
- reproducer[2] can get an early ENOSPC error.

[1]:
mount -odax,tag=myvirt test /mnt/virtiofs
truncate -s 100G /mnt/virtiofs/foobar
od -x /mnt/virtiofs/foobar

[2]
mount -odax,tag=myvirt test /mnt/virtiofs
truncate -s 100G /mnt/virtiofs/foobar
xfs_io -c "pwrite 0 100G" /mnt/virtiofs/foobar

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/file.c   | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 fs/fuse/fuse_i.h |  3 +++
 2 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6f403c8..7362aab 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -250,9 +250,9 @@ static void free_dax_mapping(struct fuse_conn *fc,
 }
 
 /* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
-static int fuse_setup_one_mapping(struct inode *inode,
+static int __fuse_setup_one_mapping(struct inode *inode,
 				  struct file *file, loff_t offset, unsigned flags,
-				  struct fuse_dax_mapping *dmap)
+				    struct fuse_dax_mapping *dmap, bool mkwrite)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -299,12 +299,19 @@ static int fuse_setup_one_mapping(struct inode *inode,
 	args.in.args[0].value = &inarg;
 	err = fuse_simple_request(fc, &args);
 	if (err < 0) {
-		printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
-				 __func__, dmap->window_offset, err);
+		printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd mkwrite=%d\n",
+		       __func__, dmap->window_offset, err, mkwrite);
 		return err;
 	}
 
-	pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx err=%zd\n", offset, err);
+	pr_debug("%s succeeded. offset=0x%llx err=%zd mkwrite=%d\n",
+		 __func__, offset, err, mkwrite);
+
+	/* We only want to flip an existing dmap's flags. */
+	if (mkwrite) {
+		dmap->flags = IOMAP_WRITE;
+		return 0;
+	}
 
 	/*
 	 * We don't take a refernce on inode. inode is valid right now and
@@ -317,6 +324,8 @@ static int fuse_setup_one_mapping(struct inode *inode,
 	dmap->inode = inode;
 	dmap->start = offset;
 	dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
+	dmap->flags = (flags & IOMAP_WRITE) ? IOMAP_WRITE : 0;
+
 	/* Protected by fi->i_dmap_sem */
 	fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
 	fi->nr_dmaps++;
@@ -327,6 +336,20 @@ static int fuse_setup_one_mapping(struct inode *inode,
 	return 0;
 }
 
+static int fuse_setup_one_mapping(struct inode *inode, struct file *file,
+				  loff_t offset, unsigned flags,
+				  struct fuse_dax_mapping *dmap)
+{
+	return __fuse_setup_one_mapping(inode, file, offset, flags, dmap, false);
+}
+
+static int fuse_mkwrite_one_mapping(struct inode *inode, struct file *file,
+				    loff_t offset, unsigned flags,
+				    struct fuse_dax_mapping *dmap)
+{
+	return __fuse_setup_one_mapping(inode, file, offset, flags, dmap, true);
+}
+
 static int fuse_removemapping_one(struct inode *inode,
 					struct fuse_dax_mapping *dmap)
 {
@@ -1765,6 +1788,16 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
 	}
 }
 
+static bool dmap_match(struct fuse_dax_mapping *dmap, unsigned flags)
+{
+	if (!(flags & IOMAP_WRITE))
+		return true;
+
+	if (dmap->flags & IOMAP_WRITE)
+		return true;
+	return false;
+}
+
 /* This is just for DAX and the mapping is ephemeral, do not use it for other
  * purposes since there is no block device with a permanent mapping.
  */
@@ -1804,7 +1837,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 	down_read(&fi->i_dmap_sem);
 	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
 
-	if (dmap) {
+	if (dmap && dmap_match(dmap, flags)) {
 		fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
 		up_read(&fi->i_dmap_sem);
 		return 0;
@@ -1846,12 +1879,27 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 		dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
 							pos);
 		if (dmap) {
-			fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+			if (dmap_match(dmap, flags)) {
+				ret = 0;
+			} else {
+				/* switch to IOMAP_WRITE */
+				ret = fuse_mkwrite_one_mapping(inode, NULL,
+					ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
+							       flags, dmap);
+				if (ret < 0) {
+					pr_err("fuse_mkwrite_one_mapping failed err=%d, pos=0x%llx\n",
+					       ret, pos);
+				}
+			}
+			if (!ret)
+				fuse_fill_iomap(inode, pos, length, iomap,
+						dmap, flags);
 			free_dax_mapping(fc, alloc_dmap);
 			up_write(&fi->i_dmap_sem);
-			return 0;
+			return ret;
 		}
 
+		/* !dmap case */
 		/* Setup one mapping */
 		ret = fuse_setup_one_mapping(inode, NULL,
 					ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c492426..346689d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -138,6 +138,9 @@ struct fuse_dax_mapping {
 
        /** Length of mapping, in bytes */
        loff_t length;
+
+	/* indicate if the mapping is set up for write purpose */
+	unsigned flags;
 };
 
 /** FUSE inode */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping Liu Bo
@ 2019-05-07  8:02   ` Tao Peng
  2019-05-08 13:54   ` Vivek Goyal
  2019-05-08 15:27   ` Vivek Goyal
  2 siblings, 0 replies; 13+ messages in thread
From: Tao Peng @ 2019-05-07  8:02 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Tue, May 7, 2019 at 1:58 PM Liu Bo <bo.liu@linux.alibaba.com> wrote:
>
> There're 2 problems in dax rw path which were found by [1][2],
>
> a) setupmapping always sends a RW mapping message to virtiofs daemon side
> no matter it's doing reads or writes, the end result is that guest reads
> on a mapping will cause a write page fault on host, which is unnecessary.
>
> b) After a successful setupmapping, it doesn't guarantee the following
> writes can land on host, e.g. page fault on host may fail because of
> ENOSPC.
>
> This is trying to solve the problems by
> a) marking a dmap as RO or RW to indicate it's being used for reads or
>    writes,
> b) setting up a RO dmap for reads
> c) setting up a RW dmap for writes
> d) using an existing dmap for reads if it's availalbe in the interval tree
> e) converting an existing dmap from RO to RW for writes if it's available
>    in the interval tree
>
> The downside of the above approach is write amplification, i.e. even if a
> 4k write is only made, setupmapping [0, 2M) will do a fallocate [0, 2M)
> against the mmap'd file on host fs (fallocate will use FALLOC_FL_KEEP_SIZE
> though), this is because some following writes landing on [4k, 2m) are
> possible and having no space for [4k, 2m) can end up with guest hanging
> forever, which is something we can't afford in real world.
>
> w/o:
> - reproducer[1] in guest will hang on dax_copy_{from,to}_iter because
>   of write page fault on host fs getting ENOSPC,
> - reproducer[2] in guest will hang on dax_copy_{from,to}_iter because
>   of write page fault on host fs getting ENOSPC,
>
> w/:
> - reproducer[1] can read all 100G as they're just page cache on host fs.
> - reproducer[2] can get an early ENOSPC error.
>
> [1]:
> mount -odax,tag=myvirt test /mnt/virtiofs
> truncate -s 100G /mnt/virtiofs/foobar
> od -x /mnt/virtiofs/foobar
>
> [2]
> mount -odax,tag=myvirt test /mnt/virtiofs
> truncate -s 100G /mnt/virtiofs/foobar
> xfs_io -c "pwrite 0 100G" /mnt/virtiofs/foobar
>
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/file.c   | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-------
>  fs/fuse/fuse_i.h |  3 +++
>  2 files changed, 59 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 6f403c8..7362aab 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -250,9 +250,9 @@ static void free_dax_mapping(struct fuse_conn *fc,
>  }
>
>  /* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> -static int fuse_setup_one_mapping(struct inode *inode,
> +static int __fuse_setup_one_mapping(struct inode *inode,
>                                   struct file *file, loff_t offset, unsigned flags,
> -                                 struct fuse_dax_mapping *dmap)
> +                                   struct fuse_dax_mapping *dmap, bool mkwrite)
>  {
>         struct fuse_conn *fc = get_fuse_conn(inode);
>         struct fuse_inode *fi = get_fuse_inode(inode);
> @@ -299,12 +299,19 @@ static int fuse_setup_one_mapping(struct inode *inode,
>         args.in.args[0].value = &inarg;
>         err = fuse_simple_request(fc, &args);
>         if (err < 0) {
> -               printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
> -                                __func__, dmap->window_offset, err);
> +               printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd mkwrite=%d\n",
> +                      __func__, dmap->window_offset, err, mkwrite);
>                 return err;
>         }
>
> -       pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx err=%zd\n", offset, err);
> +       pr_debug("%s succeeded. offset=0x%llx err=%zd mkwrite=%d\n",
> +                __func__, offset, err, mkwrite);
> +
> +       /* We only want to flip an existing dmap's flags. */
> +       if (mkwrite) {
> +               dmap->flags = IOMAP_WRITE;
> +               return 0;
> +       }
>
>         /*
>          * We don't take a refernce on inode. inode is valid right now and
> @@ -317,6 +324,8 @@ static int fuse_setup_one_mapping(struct inode *inode,
>         dmap->inode = inode;
>         dmap->start = offset;
>         dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
> +       dmap->flags = (flags & IOMAP_WRITE) ? IOMAP_WRITE : 0;
> +
>         /* Protected by fi->i_dmap_sem */
>         fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
>         fi->nr_dmaps++;
> @@ -327,6 +336,20 @@ static int fuse_setup_one_mapping(struct inode *inode,
>         return 0;
>  }
>
> +static int fuse_setup_one_mapping(struct inode *inode, struct file *file,
> +                                 loff_t offset, unsigned flags,
> +                                 struct fuse_dax_mapping *dmap)
> +{
> +       return __fuse_setup_one_mapping(inode, file, offset, flags, dmap, false);
> +}
> +
> +static int fuse_mkwrite_one_mapping(struct inode *inode, struct file *file,
> +                                   loff_t offset, unsigned flags,
> +                                   struct fuse_dax_mapping *dmap)
> +{
> +       return __fuse_setup_one_mapping(inode, file, offset, flags, dmap, true);
> +}
> +
>  static int fuse_removemapping_one(struct inode *inode,
>                                         struct fuse_dax_mapping *dmap)
>  {
> @@ -1765,6 +1788,16 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
>         }
>  }
>
> +static bool dmap_match(struct fuse_dax_mapping *dmap, unsigned flags)
> +{
> +       if (!(flags & IOMAP_WRITE))
> +               return true;
> +
> +       if (dmap->flags & IOMAP_WRITE)
> +               return true;
> +       return false;
> +}
> +
>  /* This is just for DAX and the mapping is ephemeral, do not use it for other
>   * purposes since there is no block device with a permanent mapping.
>   */
> @@ -1804,7 +1837,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
>         down_read(&fi->i_dmap_sem);
>         dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
>
> -       if (dmap) {
> +       if (dmap && dmap_match(dmap, flags)) {
>                 fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
>                 up_read(&fi->i_dmap_sem);
>                 return 0;

For unmatched dmap, down_write dmap sem and try to lookup again to
avoid dmap allocation/direct reclaim? Downside is that we might have
to drop it again if someone races in and frees the unmatched dmap,
then we have to drop the dmap sem for dmap direct reclaim/allocation.
Since the race is much less likely compared with write after read
scenario, it worth the trouble imo.

> @@ -1846,12 +1879,27 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
>                 dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
>                                                         pos);
>                 if (dmap) {
> -                       fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> +                       if (dmap_match(dmap, flags)) {
> +                               ret = 0;
> +                       } else {
> +                               /* switch to IOMAP_WRITE */
> +                               ret = fuse_mkwrite_one_mapping(inode, NULL,
> +                                       ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> +                                                              flags, dmap);
> +                               if (ret < 0) {
> +                                       pr_err("fuse_mkwrite_one_mapping failed err=%d, pos=0x%llx\n",
> +                                              ret, pos);
> +                               }
> +                       }
> +                       if (!ret)
> +                               fuse_fill_iomap(inode, pos, length, iomap,
> +                                               dmap, flags);
>                         free_dax_mapping(fc, alloc_dmap);
>                         up_write(&fi->i_dmap_sem);
> -                       return 0;
> +                       return ret;
>                 }
>
> +               /* !dmap case */
>                 /* Setup one mapping */
>                 ret = fuse_setup_one_mapping(inode, NULL,
>                                         ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index c492426..346689d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -138,6 +138,9 @@ struct fuse_dax_mapping {
>
>         /** Length of mapping, in bytes */
>         loff_t length;
> +
> +       /* indicate if the mapping is set up for write purpose */
> +       unsigned flags;
>  };
>
>  /** FUSE inode */
> --
> 1.8.3.1
>
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs



-- 
bergwolf@hyper.sh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE Liu Bo
@ 2019-05-07  8:06   ` Tao Peng
  2019-05-07 18:29   ` Vivek Goyal
  1 sibling, 0 replies; 13+ messages in thread
From: Tao Peng @ 2019-05-07  8:06 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Tue, May 7, 2019 at 1:58 PM Liu Bo <bo.liu@linux.alibaba.com> wrote:
>
> This is to let virtiofs daemon to do inline fallocate to preallocate
> space for this mapping.
>
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Peng Tao <bergwolf@hyper.sh>

> ---
>  fs/fuse/file.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 488e932f..6f403c8 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -251,8 +251,8 @@ static void free_dax_mapping(struct fuse_conn *fc,
>
>  /* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
>  static int fuse_setup_one_mapping(struct inode *inode,
> -                               struct file *file, loff_t offset,
> -                               struct fuse_dax_mapping *dmap)
> +                                 struct file *file, loff_t offset, unsigned flags,
> +                                 struct fuse_dax_mapping *dmap)
>  {
>         struct fuse_conn *fc = get_fuse_conn(inode);
>         struct fuse_inode *fi = get_fuse_inode(inode);
> @@ -276,14 +276,21 @@ static int fuse_setup_one_mapping(struct inode *inode,
>                 inarg.fh = -1;
>         inarg.moffset = dmap->window_offset;
>         inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> +
> +       /*
> +       * with IOMAP_WRITE, virtiofs daemon will make sure space is
> +       * allocated for the range.
> +       */
>         if (file) {
> -               inarg.flags |= (file->f_mode & FMODE_WRITE) ?
> -                               FUSE_SETUPMAPPING_FLAG_WRITE : 0;
> +               if (file->f_mode & FMODE_WRITE &&
> +                   flags & IOMAP_WRITE)
> +                       inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
>                 inarg.flags |= (file->f_mode & FMODE_READ) ?
>                                 FUSE_SETUPMAPPING_FLAG_READ : 0;
>         } else {
>                 inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> -               inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> +               if (flags & IOMAP_WRITE)
> +                       inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
>         }
>         args.in.h.opcode = FUSE_SETUPMAPPING;
>         args.in.h.nodeid = fi->nodeid;
> @@ -1847,8 +1854,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
>
>                 /* Setup one mapping */
>                 ret = fuse_setup_one_mapping(inode, NULL,
> -                               ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> -                               alloc_dmap);
> +                                       ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> +                                            flags, alloc_dmap);
>                 if (ret < 0) {
>                         printk("fuse_setup_one_mapping() failed. err=%d"
>                                 " pos=0x%llx\n", ret, pos);
> --
> 1.8.3.1
>
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs



-- 
bergwolf@hyper.sh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE Liu Bo
  2019-05-07  8:06   ` Tao Peng
@ 2019-05-07 18:29   ` Vivek Goyal
  2019-05-07 18:41     ` Liu Bo
  1 sibling, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2019-05-07 18:29 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Tue, May 07, 2019 at 01:58:10PM +0800, Liu Bo wrote:
> This is to let virtiofs daemon to do inline fallocate to preallocate
> space for this mapping.
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/file.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 488e932f..6f403c8 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -251,8 +251,8 @@ static void free_dax_mapping(struct fuse_conn *fc,
>  
>  /* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
>  static int fuse_setup_one_mapping(struct inode *inode,
> -				struct file *file, loff_t offset,
> -				struct fuse_dax_mapping *dmap)
> +				  struct file *file, loff_t offset, unsigned flags,
> +				  struct fuse_dax_mapping *dmap)
>  {
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> @@ -276,14 +276,21 @@ static int fuse_setup_one_mapping(struct inode *inode,
>  		inarg.fh = -1;
>  	inarg.moffset = dmap->window_offset;
>  	inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> +
> +       /*
> +	* with IOMAP_WRITE, virtiofs daemon will make sure space is
> +	* allocated for the range.
> +	*/
>  	if (file) {
> -		inarg.flags |= (file->f_mode & FMODE_WRITE) ?
> -				FUSE_SETUPMAPPING_FLAG_WRITE : 0;
> +		if (file->f_mode & FMODE_WRITE &&
> +		    flags & IOMAP_WRITE)
> +			inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
>  		inarg.flags |= (file->f_mode & FMODE_READ) ?
>  				FUSE_SETUPMAPPING_FLAG_READ : 0;
>  	} else {
>  		inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> -		inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> +		if (flags & IOMAP_WRITE)
> +			inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;

Not sure we can afford to map a mapping read only. We don't have a
mechanism yet to upgrade a mapping. IOW, what if one process opens
file read only and does some read. This will lead to a mapping setup
with FUSE_SETUPMAPPING_FLAG_READ which will be cached. Now another process
might open file read/write and try to write to same mapping without
having to go through setupmapping path again and will run into this
-ENOSPC again.

Vivek

>  	}
>  	args.in.h.opcode = FUSE_SETUPMAPPING;
>  	args.in.h.nodeid = fi->nodeid;
> @@ -1847,8 +1854,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
>  
>  		/* Setup one mapping */
>  		ret = fuse_setup_one_mapping(inode, NULL,
> -				ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> -				alloc_dmap);
> +					ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> +					     flags, alloc_dmap);
>  		if (ret < 0) {
>  			printk("fuse_setup_one_mapping() failed. err=%d"
>  				" pos=0x%llx\n", ret, pos);
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE
  2019-05-07 18:29   ` Vivek Goyal
@ 2019-05-07 18:41     ` Liu Bo
  0 siblings, 0 replies; 13+ messages in thread
From: Liu Bo @ 2019-05-07 18:41 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Tue, May 07, 2019 at 02:29:43PM -0400, Vivek Goyal wrote:
> On Tue, May 07, 2019 at 01:58:10PM +0800, Liu Bo wrote:
> > This is to let virtiofs daemon to do inline fallocate to preallocate
> > space for this mapping.
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---
> >  fs/fuse/file.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 488e932f..6f403c8 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -251,8 +251,8 @@ static void free_dax_mapping(struct fuse_conn *fc,
> >  
> >  /* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> >  static int fuse_setup_one_mapping(struct inode *inode,
> > -				struct file *file, loff_t offset,
> > -				struct fuse_dax_mapping *dmap)
> > +				  struct file *file, loff_t offset, unsigned flags,
> > +				  struct fuse_dax_mapping *dmap)
> >  {
> >  	struct fuse_conn *fc = get_fuse_conn(inode);
> >  	struct fuse_inode *fi = get_fuse_inode(inode);
> > @@ -276,14 +276,21 @@ static int fuse_setup_one_mapping(struct inode *inode,
> >  		inarg.fh = -1;
> >  	inarg.moffset = dmap->window_offset;
> >  	inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> > +
> > +       /*
> > +	* with IOMAP_WRITE, virtiofs daemon will make sure space is
> > +	* allocated for the range.
> > +	*/
> >  	if (file) {
> > -		inarg.flags |= (file->f_mode & FMODE_WRITE) ?
> > -				FUSE_SETUPMAPPING_FLAG_WRITE : 0;
> > +		if (file->f_mode & FMODE_WRITE &&
> > +		    flags & IOMAP_WRITE)
> > +			inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> >  		inarg.flags |= (file->f_mode & FMODE_READ) ?
> >  				FUSE_SETUPMAPPING_FLAG_READ : 0;
> >  	} else {
> >  		inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> > -		inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> > +		if (flags & IOMAP_WRITE)
> > +			inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> 
> Not sure we can afford to map a mapping read only. We don't have a
> mechanism yet to upgrade a mapping. IOW, what if one process opens
> file read only and does some read. This will lead to a mapping setup
> with FUSE_SETUPMAPPING_FLAG_READ which will be cached. Now another process
> might open file read/write and try to write to same mapping without
> having to go through setupmapping path again and will run into this
> -ENOSPC again.

You're right, patch 2 provides a "mkwrite" helper to map the mapping RW.

thanks,
-liubo

> 
> Vivek
> 
> >  	}
> >  	args.in.h.opcode = FUSE_SETUPMAPPING;
> >  	args.in.h.nodeid = fi->nodeid;
> > @@ -1847,8 +1854,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> >  
> >  		/* Setup one mapping */
> >  		ret = fuse_setup_one_mapping(inode, NULL,
> > -				ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> > -				alloc_dmap);
> > +					ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> > +					     flags, alloc_dmap);
> >  		if (ret < 0) {
> >  			printk("fuse_setup_one_mapping() failed. err=%d"
> >  				" pos=0x%llx\n", ret, pos);
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping Liu Bo
  2019-05-07  8:02   ` Tao Peng
@ 2019-05-08 13:54   ` Vivek Goyal
  2019-05-08 15:27   ` Vivek Goyal
  2 siblings, 0 replies; 13+ messages in thread
From: Vivek Goyal @ 2019-05-08 13:54 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Tue, May 07, 2019 at 01:58:11PM +0800, Liu Bo wrote:
> There're 2 problems in dax rw path which were found by [1][2],
> 
> a) setupmapping always sends a RW mapping message to virtiofs daemon side
> no matter it's doing reads or writes, the end result is that guest reads
> on a mapping will cause a write page fault on host, which is unnecessary.
> 
> b) After a successful setupmapping, it doesn't guarantee the following
> writes can land on host, e.g. page fault on host may fail because of
> ENOSPC.
> 
> This is trying to solve the problems by
> a) marking a dmap as RO or RW to indicate it's being used for reads or
>    writes,
> b) setting up a RO dmap for reads
> c) setting up a RW dmap for writes
> d) using an existing dmap for reads if it's availalbe in the interval tree
> e) converting an existing dmap from RO to RW for writes if it's available
>    in the interval tree
> 
> The downside of the above approach is write amplification, i.e. even if a
> 4k write is only made, setupmapping [0, 2M) will do a fallocate [0, 2M)
> against the mmap'd file on host fs (fallocate will use FALLOC_FL_KEEP_SIZE
> though), this is because some following writes landing on [4k, 2m) are
> possible and having no space for [4k, 2m) can end up with guest hanging
> forever, which is something we can't afford in real world.

Write amplification is a real concern with fallocate() approach.
Especially somebody dealing with small files (say 4K files) will
suddenly find host filesystem using much more space than total file
sizes and that will be a huge surprise.

I feel using fallocate(2MB) is just a band aid solution for -ENOSPC
problem. There are other things which can go wrong. We need to find
a generic solution for all class of such errors and propagate these
errors back to guest and to user space.

Modifying async_pf code to handle error and report error back to guest
is one such attempt. Why not enhance that code?

Thanks
Vivek


> 
> w/o:
> - reproducer[1] in guest will hang on dax_copy_{from,to}_iter because
>   of write page fault on host fs getting ENOSPC,
> - reproducer[2] in guest will hang on dax_copy_{from,to}_iter because
>   of write page fault on host fs getting ENOSPC,
> 
> w/:
> - reproducer[1] can read all 100G as they're just page cache on host fs.
> - reproducer[2] can get an early ENOSPC error.
> 
> [1]:
> mount -odax,tag=myvirt test /mnt/virtiofs
> truncate -s 100G /mnt/virtiofs/foobar
> od -x /mnt/virtiofs/foobar
> 
> [2]
> mount -odax,tag=myvirt test /mnt/virtiofs
> truncate -s 100G /mnt/virtiofs/foobar
> xfs_io -c "pwrite 0 100G" /mnt/virtiofs/foobar
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/file.c   | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-------
>  fs/fuse/fuse_i.h |  3 +++
>  2 files changed, 59 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 6f403c8..7362aab 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -250,9 +250,9 @@ static void free_dax_mapping(struct fuse_conn *fc,
>  }
>  
>  /* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> -static int fuse_setup_one_mapping(struct inode *inode,
> +static int __fuse_setup_one_mapping(struct inode *inode,
>  				  struct file *file, loff_t offset, unsigned flags,
> -				  struct fuse_dax_mapping *dmap)
> +				    struct fuse_dax_mapping *dmap, bool mkwrite)
>  {
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> @@ -299,12 +299,19 @@ static int fuse_setup_one_mapping(struct inode *inode,
>  	args.in.args[0].value = &inarg;
>  	err = fuse_simple_request(fc, &args);
>  	if (err < 0) {
> -		printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
> -				 __func__, dmap->window_offset, err);
> +		printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd mkwrite=%d\n",
> +		       __func__, dmap->window_offset, err, mkwrite);
>  		return err;
>  	}
>  
> -	pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx err=%zd\n", offset, err);
> +	pr_debug("%s succeeded. offset=0x%llx err=%zd mkwrite=%d\n",
> +		 __func__, offset, err, mkwrite);
> +
> +	/* We only want to flip an existing dmap's flags. */
> +	if (mkwrite) {
> +		dmap->flags = IOMAP_WRITE;
> +		return 0;
> +	}
>  
>  	/*
>  	 * We don't take a refernce on inode. inode is valid right now and
> @@ -317,6 +324,8 @@ static int fuse_setup_one_mapping(struct inode *inode,
>  	dmap->inode = inode;
>  	dmap->start = offset;
>  	dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
> +	dmap->flags = (flags & IOMAP_WRITE) ? IOMAP_WRITE : 0;
> +
>  	/* Protected by fi->i_dmap_sem */
>  	fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
>  	fi->nr_dmaps++;
> @@ -327,6 +336,20 @@ static int fuse_setup_one_mapping(struct inode *inode,
>  	return 0;
>  }
>  
> +static int fuse_setup_one_mapping(struct inode *inode, struct file *file,
> +				  loff_t offset, unsigned flags,
> +				  struct fuse_dax_mapping *dmap)
> +{
> +	return __fuse_setup_one_mapping(inode, file, offset, flags, dmap, false);
> +}
> +
> +static int fuse_mkwrite_one_mapping(struct inode *inode, struct file *file,
> +				    loff_t offset, unsigned flags,
> +				    struct fuse_dax_mapping *dmap)
> +{
> +	return __fuse_setup_one_mapping(inode, file, offset, flags, dmap, true);
> +}
> +
>  static int fuse_removemapping_one(struct inode *inode,
>  					struct fuse_dax_mapping *dmap)
>  {
> @@ -1765,6 +1788,16 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
>  	}
>  }
>  
> +static bool dmap_match(struct fuse_dax_mapping *dmap, unsigned flags)
> +{
> +	if (!(flags & IOMAP_WRITE))
> +		return true;
> +
> +	if (dmap->flags & IOMAP_WRITE)
> +		return true;
> +	return false;
> +}
> +
>  /* This is just for DAX and the mapping is ephemeral, do not use it for other
>   * purposes since there is no block device with a permanent mapping.
>   */
> @@ -1804,7 +1837,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
>  	down_read(&fi->i_dmap_sem);
>  	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
>  
> -	if (dmap) {
> +	if (dmap && dmap_match(dmap, flags)) {
>  		fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
>  		up_read(&fi->i_dmap_sem);
>  		return 0;
> @@ -1846,12 +1879,27 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
>  		dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
>  							pos);
>  		if (dmap) {
> -			fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> +			if (dmap_match(dmap, flags)) {
> +				ret = 0;
> +			} else {
> +				/* switch to IOMAP_WRITE */
> +				ret = fuse_mkwrite_one_mapping(inode, NULL,
> +					ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> +							       flags, dmap);
> +				if (ret < 0) {
> +					pr_err("fuse_mkwrite_one_mapping failed err=%d, pos=0x%llx\n",
> +					       ret, pos);
> +				}
> +			}
> +			if (!ret)
> +				fuse_fill_iomap(inode, pos, length, iomap,
> +						dmap, flags);
>  			free_dax_mapping(fc, alloc_dmap);
>  			up_write(&fi->i_dmap_sem);
> -			return 0;
> +			return ret;
>  		}
>  
> +		/* !dmap case */
>  		/* Setup one mapping */
>  		ret = fuse_setup_one_mapping(inode, NULL,
>  					ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index c492426..346689d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -138,6 +138,9 @@ struct fuse_dax_mapping {
>  
>         /** Length of mapping, in bytes */
>         loff_t length;
> +
> +	/* indicate if the mapping is set up for write purpose */
> +	unsigned flags;
>  };
>  
>  /** FUSE inode */
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping
  2019-05-07  5:58 ` [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping Liu Bo
  2019-05-07  8:02   ` Tao Peng
  2019-05-08 13:54   ` Vivek Goyal
@ 2019-05-08 15:27   ` Vivek Goyal
  2019-05-08 18:48     ` Liu Bo
  2 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2019-05-08 15:27 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Tue, May 07, 2019 at 01:58:11PM +0800, Liu Bo wrote:
> There're 2 problems in dax rw path which were found by [1][2],
> 
> a) setupmapping always sends a RW mapping message to virtiofs daemon side
> no matter it's doing reads or writes, the end result is that guest reads
> on a mapping will cause a write page fault on host, which is unnecessary.
> 
> b) After a successful setupmapping, it doesn't guarantee the following
> writes can land on host, e.g. page fault on host may fail because of
> ENOSPC.
> 
> This is trying to solve the problems by
> a) marking a dmap as RO or RW to indicate it's being used for reads or
>    writes,
> b) setting up a RO dmap for reads
> c) setting up a RW dmap for writes
> d) using an existing dmap for reads if it's availalbe in the interval tree
> e) converting an existing dmap from RO to RW for writes if it's available
>    in the interval tree
> 
> The downside of the above approach is write amplification.

Another downside of using fallocate() is performance overhead. I wrote
a problem to do small 16K file extending writes and measure total time
with and without fallocate. Fallocate version seems to be much slower.

With fallocate()
===============
# ./file-extending-writes /mnt/virtio-fs/foo.txt
16384 extending writes took 14 secs and 647210 us
16384 extending writes took 11 secs and 571941 us
16384 extending writes took 11 secs and 981329 us

With dax + direct write (FUSE_WRITE)
==============================
# ./file-extending-writes /mnt/virtio-fs/foo.txt
16384 extending writes took 2 secs and 477642 us
16384 extending writes took 2 secs and 352413 us
16384 extending writes took 2 secs and 347223 us

I ran your test script as well and that does not show much difference.
I think part of the problem is that every write launches new xfs_io
process. So overhead of launching process, opening fd again hides the
fallocate overhead.

Here is my C program.

===========================================================
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>

void drop_caches(void)
{
	system("sync");
	system("echo 3 > /proc/sys/vm/drop_caches");
}

void run_test(int fd)
{
	int nr_writes = 16 * 1024;
	struct timeval tv_start, tv_end, tv_elapsed;
	char *buf = "Hello";
	int buf_len = strlen(buf);
	ssize_t written = 0;
	int i;

	drop_caches();

	/* Extend file in a loop */		
	gettimeofday(&tv_start, NULL);
	for (i = 0; i < nr_writes; i++) {
		written += write(fd, buf, buf_len);
		if (written == -1) {
			fprintf(stderr, "Failed to write %d bytes:%s,"
				" errorno=%d\n", strlen(buf), strerror(errno),
				errno);
			exit(1);
		}
	}
	gettimeofday(&tv_end, NULL);
	timersub(&tv_end, &tv_start, &tv_elapsed);	
	printf("%d extending writes took %d secs and %d us\n", nr_writes, tv_elapsed.tv_sec, tv_elapsed.tv_usec);
}

int main(int argc, char *argv[])
{
	int fd, flags = 0;
	mode_t mode = 0;
	int i, nr_runs = 3;

	if (argc != 2) {
		printf("Usage:%s <file-to-open>\n", argv[0]);
		exit(1);
	}

	fd = open(argv[1], O_RDWR | O_CREAT | O_APPEND | O_TRUNC);
	if (fd == -1) {
		fprintf(stderr, "Failed to open file %s:%s, errorno=%d\n",
			argv[0], strerror(errno), errno);
		exit(1);
	}

	for(i = 0; i < nr_runs; i++)
		run_test(fd);
	close(fd);
}

==========================================================================


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping
  2019-05-08 15:27   ` Vivek Goyal
@ 2019-05-08 18:48     ` Liu Bo
  2019-05-09  2:01       ` Tao Peng
  0 siblings, 1 reply; 13+ messages in thread
From: Liu Bo @ 2019-05-08 18:48 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Wed, May 08, 2019 at 11:27:04AM -0400, Vivek Goyal wrote:
> On Tue, May 07, 2019 at 01:58:11PM +0800, Liu Bo wrote:
> > There're 2 problems in dax rw path which were found by [1][2],
> > 
> > a) setupmapping always sends a RW mapping message to virtiofs daemon side
> > no matter it's doing reads or writes, the end result is that guest reads
> > on a mapping will cause a write page fault on host, which is unnecessary.
> > 
> > b) After a successful setupmapping, it doesn't guarantee the following
> > writes can land on host, e.g. page fault on host may fail because of
> > ENOSPC.
> > 
> > This is trying to solve the problems by
> > a) marking a dmap as RO or RW to indicate it's being used for reads or
> >    writes,
> > b) setting up a RO dmap for reads
> > c) setting up a RW dmap for writes
> > d) using an existing dmap for reads if it's availalbe in the interval tree
> > e) converting an existing dmap from RO to RW for writes if it's available
> >    in the interval tree
> > 
> > The downside of the above approach is write amplification.
> 
> Another downside of using fallocate() is performance overhead. I wrote
> a problem to do small 16K file extending writes and measure total time
> with and without fallocate. Fallocate version seems to be much slower.
> 
> With fallocate()
> ===============
> # ./file-extending-writes /mnt/virtio-fs/foo.txt
> 16384 extending writes took 14 secs and 647210 us
> 16384 extending writes took 11 secs and 571941 us
> 16384 extending writes took 11 secs and 981329 us
> 
> With dax + direct write (FUSE_WRITE)
> ==============================
> # ./file-extending-writes /mnt/virtio-fs/foo.txt
> 16384 extending writes took 2 secs and 477642 us
> 16384 extending writes took 2 secs and 352413 us
> 16384 extending writes took 2 secs and 347223 us
> 
> I ran your test script as well and that does not show much difference.
> I think part of the problem is that every write launches new xfs_io
> process. So overhead of launching process, opening fd again hides the
> fallocate overhead.

It's so weird, with the below C test, I couldn't reproduce the huge
difference of w/ or w/o fallocate.

=================================
mount -t virtio_fs atest /mnt/test/ -otag=myfs-2,rootmode=040000,user_id=0,group_id=0,dax

- w/

[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/foobar
16384 extending writes took 2 secs and 407494 us
16384 extending writes took 2 secs and 368892 us
16384 extending writes took 2 secs and 347208 us
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/foobar
16384 extending writes took 2 secs and 306272 us
16384 extending writes took 2 secs and 364296 us
16384 extending writes took 2 secs and 354999 us
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/foobar
16384 extending writes took 2 secs and 363470 us
16384 extending writes took 2 secs and 742043 us
16384 extending writes took 2 secs and 527137 us

- w/o (Revert "fuse, dax: Do not use dax for file extnding writes" + 
Revert "virtiofsd: do fallocate for setupmapping requests with write flag")

[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/foobar
16384 extending writes took 2 secs and 331097 us
16384 extending writes took 2 secs and 298868 us
16384 extending writes took 2 secs and 337858 us
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/foobar
16384 extending writes took 2 secs and 319989 us
16384 extending writes took 2 secs and 519414 us
16384 extending writes took 2 secs and 436929 us
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/foobar
16384 extending writes took 2 secs and 365331 us
16384 extending writes took 2 secs and 353973 us
16384 extending writes took 2 secs and 354729 us
=================================

thanks,
-liubo

> 
> Here is my C program.
> 
> ===========================================================
> #include <stdio.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <string.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <sys/time.h>
> 
> void drop_caches(void)
> {
> 	system("sync");
> 	system("echo 3 > /proc/sys/vm/drop_caches");
> }
> 
> void run_test(int fd)
> {
> 	int nr_writes = 16 * 1024;
> 	struct timeval tv_start, tv_end, tv_elapsed;
> 	char *buf = "Hello";
> 	int buf_len = strlen(buf);
> 	ssize_t written = 0;
> 	int i;
> 
> 	drop_caches();
> 
> 	/* Extend file in a loop */		
> 	gettimeofday(&tv_start, NULL);
> 	for (i = 0; i < nr_writes; i++) {
> 		written += write(fd, buf, buf_len);
> 		if (written == -1) {
> 			fprintf(stderr, "Failed to write %d bytes:%s,"
> 				" errorno=%d\n", strlen(buf), strerror(errno),
> 				errno);
> 			exit(1);
> 		}
> 	}
> 	gettimeofday(&tv_end, NULL);
> 	timersub(&tv_end, &tv_start, &tv_elapsed);	
> 	printf("%d extending writes took %d secs and %d us\n", nr_writes, tv_elapsed.tv_sec, tv_elapsed.tv_usec);
> }
> 
> int main(int argc, char *argv[])
> {
> 	int fd, flags = 0;
> 	mode_t mode = 0;
> 	int i, nr_runs = 3;
> 
> 	if (argc != 2) {
> 		printf("Usage:%s <file-to-open>\n", argv[0]);
> 		exit(1);
> 	}
> 
> 	fd = open(argv[1], O_RDWR | O_CREAT | O_APPEND | O_TRUNC);
> 	if (fd == -1) {
> 		fprintf(stderr, "Failed to open file %s:%s, errorno=%d\n",
> 			argv[0], strerror(errno), errno);
> 		exit(1);
> 	}
> 
> 	for(i = 0; i < nr_runs; i++)
> 		run_test(fd);
> 	close(fd);
> }
> 
> ==========================================================================


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping
  2019-05-08 18:48     ` Liu Bo
@ 2019-05-09  2:01       ` Tao Peng
  2019-05-09  2:32         ` Liu Bo
  0 siblings, 1 reply; 13+ messages in thread
From: Tao Peng @ 2019-05-09  2:01 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Thu, May 9, 2019 at 2:48 AM Liu Bo <bo.liu@linux.alibaba.com> wrote:
>
> On Wed, May 08, 2019 at 11:27:04AM -0400, Vivek Goyal wrote:
> > On Tue, May 07, 2019 at 01:58:11PM +0800, Liu Bo wrote:
> > > There're 2 problems in dax rw path which were found by [1][2],
> > >
> > > a) setupmapping always sends a RW mapping message to virtiofs daemon side
> > > no matter it's doing reads or writes, the end result is that guest reads
> > > on a mapping will cause a write page fault on host, which is unnecessary.
> > >
> > > b) After a successful setupmapping, it doesn't guarantee the following
> > > writes can land on host, e.g. page fault on host may fail because of
> > > ENOSPC.
> > >
> > > This is trying to solve the problems by
> > > a) marking a dmap as RO or RW to indicate it's being used for reads or
> > >    writes,
> > > b) setting up a RO dmap for reads
> > > c) setting up a RW dmap for writes
> > > d) using an existing dmap for reads if it's availalbe in the interval tree
> > > e) converting an existing dmap from RO to RW for writes if it's available
> > >    in the interval tree
> > >
> > > The downside of the above approach is write amplification.
> >
> > Another downside of using fallocate() is performance overhead. I wrote
> > a problem to do small 16K file extending writes and measure total time
> > with and without fallocate. Fallocate version seems to be much slower.
> >
> > With fallocate()
> > ===============
> > # ./file-extending-writes /mnt/virtio-fs/foo.txt
> > 16384 extending writes took 14 secs and 647210 us
> > 16384 extending writes took 11 secs and 571941 us
> > 16384 extending writes took 11 secs and 981329 us
> >
> > With dax + direct write (FUSE_WRITE)
> > ==============================
> > # ./file-extending-writes /mnt/virtio-fs/foo.txt
> > 16384 extending writes took 2 secs and 477642 us
> > 16384 extending writes took 2 secs and 352413 us
> > 16384 extending writes took 2 secs and 347223 us
> >
> > I ran your test script as well and that does not show much difference.
> > I think part of the problem is that every write launches new xfs_io
> > process. So overhead of launching process, opening fd again hides the
> > fallocate overhead.
>
> It's so weird, with the below C test, I couldn't reproduce the huge
> difference of w/ or w/o fallocate.
>
ext4 vs. xfs difference?

Cheers,
Tao
-- 
bergwolf@hyper.sh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping
  2019-05-09  2:01       ` Tao Peng
@ 2019-05-09  2:32         ` Liu Bo
  2019-05-09  3:44           ` Tao Peng
  0 siblings, 1 reply; 13+ messages in thread
From: Liu Bo @ 2019-05-09  2:32 UTC (permalink / raw)
  To: Tao Peng; +Cc: virtio-fs

On Thu, May 09, 2019 at 10:01:03AM +0800, Tao Peng wrote:
> On Thu, May 9, 2019 at 2:48 AM Liu Bo <bo.liu@linux.alibaba.com> wrote:
> >
> > On Wed, May 08, 2019 at 11:27:04AM -0400, Vivek Goyal wrote:
> > > On Tue, May 07, 2019 at 01:58:11PM +0800, Liu Bo wrote:
> > > > There're 2 problems in dax rw path which were found by [1][2],

> > > >
> > > > a) setupmapping always sends a RW mapping message to virtiofs daemon side
> > > > no matter it's doing reads or writes, the end result is that guest reads
> > > > on a mapping will cause a write page fault on host, which is unnecessary.
> > > >
> > > > b) After a successful setupmapping, it doesn't guarantee the following
> > > > writes can land on host, e.g. page fault on host may fail because of
> > > > ENOSPC.
> > > >
> > > > This is trying to solve the problems by
> > > > a) marking a dmap as RO or RW to indicate it's being used for reads or
> > > >    writes,
> > > > b) setting up a RO dmap for reads
> > > > c) setting up a RW dmap for writes
> > > > d) using an existing dmap for reads if it's availalbe in the interval tree
> > > > e) converting an existing dmap from RO to RW for writes if it's available
> > > >    in the interval tree
> > > >
> > > > The downside of the above approach is write amplification.
> > >
> > > Another downside of using fallocate() is performance overhead. I wrote
> > > a problem to do small 16K file extending writes and measure total time
> > > with and without fallocate. Fallocate version seems to be much slower.
> > >
> > > With fallocate()
> > > ===============
> > > # ./file-extending-writes /mnt/virtio-fs/foo.txt
> > > 16384 extending writes took 14 secs and 647210 us
> > > 16384 extending writes took 11 secs and 571941 us
> > > 16384 extending writes took 11 secs and 981329 us
> > >
> > > With dax + direct write (FUSE_WRITE)
> > > ==============================
> > > # ./file-extending-writes /mnt/virtio-fs/foo.txt
> > > 16384 extending writes took 2 secs and 477642 us
> > > 16384 extending writes took 2 secs and 352413 us
> > > 16384 extending writes took 2 secs and 347223 us
> > >
> > > I ran your test script as well and that does not show much difference.
> > > I think part of the problem is that every write launches new xfs_io
> > > process. So overhead of launching process, opening fd again hides the
> > > fallocate overhead.
> >
> > It's so weird, with the below C test, I couldn't reproduce the huge
> > difference of w/ or w/o fallocate.
> >
> ext4 vs. xfs difference?

Indeed there exists a gap, but on my box it's not as much as Vivek's,

=========================
using xfs as shared dir
- w/ fallocate
$ ./a.out /mnt/test/newtest
16384 extending writes took 3 secs and 908534 us
16384 extending writes took 3 secs and 988464 us
16384 extending writes took 3 secs and 908328 us
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/newtest
16384 extending writes took 3 secs and 523781 us
16384 extending writes took 3 secs and 474909 us
16384 extending writes took 3 secs and 524108 us
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/newtest
16384 extending writes took 3 secs and 846989 us
16384 extending writes took 3 secs and 888489 us
16384 extending writes took 3 secs and 885019 us

- w/o fallocate
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/newtest 
16384 extending writes took 2 secs and 317730 us
16384 extending writes took 2 secs and 336573 us
16384 extending writes took 2 secs and 398379 us
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/newtest
16384 extending writes took 2 secs and 184630 us
16384 extending writes took 2 secs and 117942 us
16384 extending writes took 2 secs and 168083 us
[root@debug010000002015 virtiofs]
$ ./a.out /mnt/test/newtest
16384 extending writes took 2 secs and 200756 us
16384 extending writes took 2 secs and 225786 us
16384 extending writes took 2 secs and 183911 us


thanks,
-liubo


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping
  2019-05-09  2:32         ` Liu Bo
@ 2019-05-09  3:44           ` Tao Peng
  0 siblings, 0 replies; 13+ messages in thread
From: Tao Peng @ 2019-05-09  3:44 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Thu, May 9, 2019 at 10:32 AM Liu Bo <bo.liu@linux.alibaba.com> wrote:
>
> On Thu, May 09, 2019 at 10:01:03AM +0800, Tao Peng wrote:
> > On Thu, May 9, 2019 at 2:48 AM Liu Bo <bo.liu@linux.alibaba.com> wrote:
> > >
> > > On Wed, May 08, 2019 at 11:27:04AM -0400, Vivek Goyal wrote:
> > > > On Tue, May 07, 2019 at 01:58:11PM +0800, Liu Bo wrote:
> > > > > There're 2 problems in dax rw path which were found by [1][2],
>
> > > > >
> > > > > a) setupmapping always sends a RW mapping message to virtiofs daemon side
> > > > > no matter it's doing reads or writes, the end result is that guest reads
> > > > > on a mapping will cause a write page fault on host, which is unnecessary.
> > > > >
> > > > > b) After a successful setupmapping, it doesn't guarantee the following
> > > > > writes can land on host, e.g. page fault on host may fail because of
> > > > > ENOSPC.
> > > > >
> > > > > This is trying to solve the problems by
> > > > > a) marking a dmap as RO or RW to indicate it's being used for reads or
> > > > >    writes,
> > > > > b) setting up a RO dmap for reads
> > > > > c) setting up a RW dmap for writes
> > > > > d) using an existing dmap for reads if it's availalbe in the interval tree
> > > > > e) converting an existing dmap from RO to RW for writes if it's available
> > > > >    in the interval tree
> > > > >
> > > > > The downside of the above approach is write amplification.
> > > >
> > > > Another downside of using fallocate() is performance overhead. I wrote
> > > > a problem to do small 16K file extending writes and measure total time
> > > > with and without fallocate. Fallocate version seems to be much slower.
> > > >
> > > > With fallocate()
> > > > ===============
> > > > # ./file-extending-writes /mnt/virtio-fs/foo.txt
> > > > 16384 extending writes took 14 secs and 647210 us
> > > > 16384 extending writes took 11 secs and 571941 us
> > > > 16384 extending writes took 11 secs and 981329 us
> > > >
> > > > With dax + direct write (FUSE_WRITE)
> > > > ==============================
> > > > # ./file-extending-writes /mnt/virtio-fs/foo.txt
> > > > 16384 extending writes took 2 secs and 477642 us
> > > > 16384 extending writes took 2 secs and 352413 us
> > > > 16384 extending writes took 2 secs and 347223 us
> > > >
> > > > I ran your test script as well and that does not show much difference.
> > > > I think part of the problem is that every write launches new xfs_io
> > > > process. So overhead of launching process, opening fd again hides the
> > > > fallocate overhead.
> > >
> > > It's so weird, with the below C test, I couldn't reproduce the huge
> > > difference of w/ or w/o fallocate.
> > >
> > ext4 vs. xfs difference?
>
> Indeed there exists a gap, but on my box it's not as much as Vivek's,
>
fallocate requires to zero the newly allocated blocks. So underlying
block device speed can also make a difference.

-- 
bergwolf@hyper.sh


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-05-09  3:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-07  5:58 [Virtio-fs] [PATCH 0/2] guest hang fixes Liu Bo
2019-05-07  5:58 ` [Virtio-fs] [PATCH 1/2] virtio-fs: send setupmapping with write flag only if IOMAP_WRITE Liu Bo
2019-05-07  8:06   ` Tao Peng
2019-05-07 18:29   ` Vivek Goyal
2019-05-07 18:41     ` Liu Bo
2019-05-07  5:58 ` [Virtio-fs] [PATCH 2/2] virtiofs: add dmap flags to differentiate write mapping from read mapping Liu Bo
2019-05-07  8:02   ` Tao Peng
2019-05-08 13:54   ` Vivek Goyal
2019-05-08 15:27   ` Vivek Goyal
2019-05-08 18:48     ` Liu Bo
2019-05-09  2:01       ` Tao Peng
2019-05-09  2:32         ` Liu Bo
2019-05-09  3:44           ` Tao Peng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.