All of lore.kernel.org
 help / color / mirror / Atom feed
* [Virtio-fs] [PATCH 0/4] virtiofsd fixes
@ 2019-04-16 19:08 Liu Bo
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure Liu Bo
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Liu Bo @ 2019-04-16 19:08 UTC (permalink / raw)
  To: virtio-fs

Hi,

This set contains a few fixes against virtiofsd.

Patch 1 fixes dio read hang.

Patch 2 adds ns support for file timestamp in stat(2).

Patch 3 fixes a problem found by generic/413 which shows vhost-user
backend doesn't include all memory mappings in use.

Patch 4 uses fallocate(2) instead to support more modes.

thanks,
liubo

Eryu Guan (1):
  virtiofsd: send reply correctly on read failure

Jiufei Xue (1):
  virtiofsd: support nanosecond resolution for file timestamp

Xiaoguang Wang (2):
  virtiofsd: use file-backend memory region for virtiofsd's cache area
  virtiofsd: use fallocate(2) instead posix_fallocate(3)

 configure                          | 16 +++++++++++
 contrib/virtiofsd/fuse_lowlevel.c  |  4 +--
 contrib/virtiofsd/fuse_misc.h      |  1 +
 contrib/virtiofsd/fuse_virtio.c    | 12 ++++++--
 contrib/virtiofsd/passthrough_ll.c | 11 ++-----
 hw/virtio/vhost-user-fs.c          | 59 +++++++++++++++++++++++++++++++-------
 6 files changed, 80 insertions(+), 23 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure
  2019-04-16 19:08 [Virtio-fs] [PATCH 0/4] virtiofsd fixes Liu Bo
@ 2019-04-16 19:08 ` Liu Bo
  2019-04-17 17:26   ` Dr. David Alan Gilbert
  2019-04-23  6:27   ` [Virtio-fs] [PATCH v2] " Eryu Guan
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 2/4] virtiofsd: support nanosecond resolution for file timestamp Liu Bo
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 31+ messages in thread
From: Liu Bo @ 2019-04-16 19:08 UTC (permalink / raw)
  To: virtio-fs

From: Eryu Guan <eguan@linux.alibaba.com>

Currently when a lo_read() operation fails, we don't send the failure
back to fuse client, and read(2) operation from guest kernel would hang
on waiting for the reply.

This is easily triggered by a direct read with non-aligned length.

Fix it by detecting preadv(2) error in virtio_send_data_iov(), and
teaching fuse_reply_data() to reply error on error case.

Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
---
 contrib/virtiofsd/fuse_lowlevel.c |  4 ++--
 contrib/virtiofsd/fuse_virtio.c   | 12 +++++++++---
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/contrib/virtiofsd/fuse_lowlevel.c b/contrib/virtiofsd/fuse_lowlevel.c
index 111c6e1..aeb5fe2 100644
--- a/contrib/virtiofsd/fuse_lowlevel.c
+++ b/contrib/virtiofsd/fuse_lowlevel.c
@@ -524,11 +524,11 @@ int fuse_reply_data(fuse_req_t req, struct fuse_bufvec *bufv,
 	out.error = 0;
 
 	res = fuse_send_data_iov(req->se, req->ch, iov, 1, bufv, flags);
-	if (res <= 0) {
+	if (res >= 0) {
 		fuse_free_req(req);
 		return res;
 	} else {
-		return fuse_reply_err(req, res);
+		return fuse_reply_err(req, -res);
 	}
 }
 
diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
index ca988aa..0b5736d 100644
--- a/contrib/virtiofsd/fuse_virtio.c
+++ b/contrib/virtiofsd/fuse_virtio.c
@@ -333,8 +333,13 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
                 ret = preadv(buf->buf[0].fd, in_sg_ptr, in_sg_cpy_count, buf->buf[0].pos);
 
                 if (se->debug)
-                        fprintf(stderr, "%s: preadv_res=%d len=%zd\n",
-                                __func__, ret, len);
+                        fprintf(stderr, "%s: preadv_res=%d(%s) len=%zd\n",
+                                __func__, ret, strerror(errno), len);
+                if (ret == -1) {
+                        ret = -errno;
+                        free(in_sg_cpy);
+                        goto err;
+                }
                 if (ret < len && ret) {
                         if (se->debug)
                                 fprintf(stderr, "%s: ret < len\n", __func__);
@@ -379,7 +384,8 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
         vu_queue_notify(&se->virtio_dev->dev, q);
 
 err:
-        ch->qi->reply_sent = true;
+        if (!ret)
+                ch->qi->reply_sent = true;
 
         return ret;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Virtio-fs] [PATCH 2/4] virtiofsd: support nanosecond resolution for file timestamp
  2019-04-16 19:08 [Virtio-fs] [PATCH 0/4] virtiofsd fixes Liu Bo
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure Liu Bo
@ 2019-04-16 19:08 ` Liu Bo
  2019-04-17 13:31   ` Dr. David Alan Gilbert
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area Liu Bo
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3) Liu Bo
  3 siblings, 1 reply; 31+ messages in thread
From: Liu Bo @ 2019-04-16 19:08 UTC (permalink / raw)
  To: virtio-fs

From: Jiufei Xue <jiufei.xue@linux.alibaba.com>

Define HAVE_STRUCT_STAT_ST_ATIM to 1 if `st_atim' is member of `struct
stat' which means support nanosecond resolution for the file timestamp
fields.

Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
---
 configure                     | 16 ++++++++++++++++
 contrib/virtiofsd/fuse_misc.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/configure b/configure
index 5c21a65..57336ec 100755
--- a/configure
+++ b/configure
@@ -5147,6 +5147,19 @@ if compile_prog "" "" ; then
     strchrnul=yes
 fi
 
+#########################################
+# check if we have st_atim
+
+st_atim=no
+cat > $TMPC << EOF
+#include <sys/stat.h>
+#include <stddef.h>
+int main(void) { return offsetof(struct stat, st_atim); }
+EOF
+if compile_prog "" "" ; then
+    st_atim=yes
+fi
+
 ##########################################
 # check if trace backend exists
 
@@ -6757,6 +6770,9 @@ fi
 if test "$strchrnul" = "yes" ; then
   echo "HAVE_STRCHRNUL=y" >> $config_host_mak
 fi
+if test "$st_atim" = "yes" ; then
+  echo "HAVE_STRUCT_STAT_ST_ATIM=y" >> $config_host_mak
+fi
 if test "$byteswap_h" = "yes" ; then
   echo "CONFIG_BYTESWAP_H=y" >> $config_host_mak
 fi
diff --git a/contrib/virtiofsd/fuse_misc.h b/contrib/virtiofsd/fuse_misc.h
index 2f6663e..30a1d7f 100644
--- a/contrib/virtiofsd/fuse_misc.h
+++ b/contrib/virtiofsd/fuse_misc.h
@@ -7,6 +7,7 @@
 */
 
 #include <pthread.h>
+#include "config-host.h"
 
 /*
   Versioned symbols cannot be used in some cases because it
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-16 19:08 [Virtio-fs] [PATCH 0/4] virtiofsd fixes Liu Bo
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure Liu Bo
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 2/4] virtiofsd: support nanosecond resolution for file timestamp Liu Bo
@ 2019-04-16 19:08 ` Liu Bo
  2019-04-17 14:51   ` Dr. David Alan Gilbert
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3) Liu Bo
  3 siblings, 1 reply; 31+ messages in thread
From: Liu Bo @ 2019-04-16 19:08 UTC (permalink / raw)
  To: virtio-fs

From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

When running xfstests test case generic/413, we found such issue:
    1, create a file in one virtiofsd mount point with dax enabled
    2, mmap this file, get virtual addr: A
    3, write(fd, A, len), here fd comes from another file in another
       virtiofsd mount point without dax enabled, also note here write(2)
       is direct io.
    4, this direct io will hang forever, because the virtiofsd has crashed.
Here is the stack:
[  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
[  247.168006] Call Trace:
[  247.169067]  ? __schedule+0x3d0/0x830
[  247.170219]  schedule+0x32/0x80
[  247.171328]  schedule_timeout+0x1e2/0x350
[  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
[  247.173516]  wait_for_completion+0x123/0x190
[  247.174593]  ? wake_up_q+0x70/0x70
[  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
[  247.176724]  generic_file_read_iter+0xaa/0xd20
[  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
[  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
[  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
[  247.181136]  __vfs_read+0x108/0x190
[  247.181930]  vfs_read+0x91/0x130
[  247.182671]  ksys_read+0x52/0xc0
[  247.183454]  do_syscall_64+0x55/0x170
[  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
address correctly. For a memory mapped area in dax mode, indeed the page
for this area points virtiofsd's cache area, or rather virtio pci device's
cache bar. In qemu, currently this cache bar is implemented with an anonymous
memory and will not pass this cache bar's address info to vhost-user backend,
so vu_gpa_to_va() will fail.

To fix this issue, we create this vhost cache area with a file backend
memory area.

Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 hw/virtio/vhost-user-fs.c | 59 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 49 insertions(+), 10 deletions(-)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index a5110a9..fc2ed38 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -300,6 +300,35 @@ static bool vuf_guest_notifier_pending(VirtIODevice *vdev, int idx)
     return vhost_virtqueue_pending(&fs->vhost_dev, idx);
 }
 
+static int vuf_get_tmpfile(size_t size)
+{
+    const char *tmpdir = g_get_tmp_dir();
+    gchar *fname;
+    int fd;
+
+    fname = g_strdup_printf("%s/virtio-fs-cache-XXXXXX", tmpdir);
+    fd = mkstemp(fname);
+    if (fd < 0) {
+        fprintf(stderr, "%s: mkstemp(%s) failed\n", __func__, fname);
+        g_free(fname);
+        return -1;
+    }
+
+    /*
+     * Note: this temp file would be deleted until file is closed or
+     * process exits.
+     */
+    unlink(fname);
+    if (ftruncate(fd, size) == -1) {
+        fprintf(stderr, "%s: ftruncate(%s) failed\n", __func__, fname);
+        close(fd);
+        fd = -1;
+    }
+
+    g_free(fname);
+    return fd;
+}
+
 static void vuf_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -349,16 +378,26 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
                          "no smaller than the page size");
         return;
     }
-    /* We need a region with some host memory, 'ram' is the easiest */
-    memory_region_init_ram_nomigrate(&fs->cache, OBJECT(vdev),
-                       "virtio-fs-cache",
-                       fs->conf.cache_size, NULL);
-    /*
-     * But we don't actually want anyone reading/writing the raw
-     * area with no cache data.
-     */
-    mprotect(memory_region_get_ram_ptr(&fs->cache), fs->conf.cache_size,
-             PROT_NONE);
+
+    if (fs->conf.cache_size) {
+            int fd;
+
+            fd = vuf_get_tmpfile(fs->conf.cache_size);
+            if (fd < 0) {
+                    error_setg(errp, "failed to initialize virtio-fs-cache");
+                    return;
+            }
+            memory_region_init_ram_from_fd(&fs->cache, OBJECT(vdev),
+                                           "virtio-fs-cache",
+                                           fs->conf.cache_size, true, fd, NULL);
+
+            /*
+             * But we don't actually want anyone reading/writing the raw
+             * area with no cache data.
+             */
+            mprotect(memory_region_get_ram_ptr(&fs->cache),
+                     fs->conf.cache_size, PROT_NONE);
+    }
 
     if (!vhost_user_init(&fs->vhost_user, &fs->conf.chardev, errp)) {
         return;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3)
  2019-04-16 19:08 [Virtio-fs] [PATCH 0/4] virtiofsd fixes Liu Bo
                   ` (2 preceding siblings ...)
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area Liu Bo
@ 2019-04-16 19:08 ` Liu Bo
  2019-04-17 13:18   ` Dr. David Alan Gilbert
  3 siblings, 1 reply; 31+ messages in thread
From: Liu Bo @ 2019-04-16 19:08 UTC (permalink / raw)
  To: virtio-fs

From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

This is because posix_fallocate(3) does not support FALLOC_FL_KEEP_SIZE
and FALLOC_FL_PUNCH_HOLE. Our underlying host filesystem is ext4 and
ext4 supports FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE well, so
this change will be ok.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 contrib/virtiofsd/passthrough_ll.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
index 10ea8aa..ccb5312 100644
--- a/contrib/virtiofsd/passthrough_ll.c
+++ b/contrib/virtiofsd/passthrough_ll.c
@@ -1568,15 +1568,10 @@ static void lo_fallocate(fuse_req_t req, fuse_ino_t ino, int mode,
 			 off_t offset, off_t length, struct fuse_file_info *fi)
 {
 	int err;
-	(void) ino;
-
-	if (mode) {
-		fuse_reply_err(req, EOPNOTSUPP);
-		return;
-	}
 
-	err = posix_fallocate(lo_fi_fd(req, fi), offset,
-			      length);
+	err = fallocate(lo_fi_fd(req, fi), mode, offset, length);
+        if (err < 0)
+                err = errno;
 
 	fuse_reply_err(req, err);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3)
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3) Liu Bo
@ 2019-04-17 13:18   ` Dr. David Alan Gilbert
  2019-04-17 13:44     ` Miklos Szeredi
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-17 13:18 UTC (permalink / raw)
  To: Liu Bo, mszeredi; +Cc: virtio-fs

* Liu Bo (bo.liu@linux.alibaba.com) wrote:
> From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> 
> This is because posix_fallocate(3) does not support FALLOC_FL_KEEP_SIZE
> and FALLOC_FL_PUNCH_HOLE. Our underlying host filesystem is ext4 and
> ext4 supports FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE well, so
> this change will be ok.
> 
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

  We need to check what 'fuse' expects - is it defined what
fallocate features it has, and what the semantics are?

Miklos: Do you know?

Dave

  
> ---
>  contrib/virtiofsd/passthrough_ll.c | 11 +++--------
>  1 file changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
> index 10ea8aa..ccb5312 100644
> --- a/contrib/virtiofsd/passthrough_ll.c
> +++ b/contrib/virtiofsd/passthrough_ll.c
> @@ -1568,15 +1568,10 @@ static void lo_fallocate(fuse_req_t req, fuse_ino_t ino, int mode,
>  			 off_t offset, off_t length, struct fuse_file_info *fi)
>  {
>  	int err;
> -	(void) ino;
> -
> -	if (mode) {
> -		fuse_reply_err(req, EOPNOTSUPP);
> -		return;
> -	}
>  
> -	err = posix_fallocate(lo_fi_fd(req, fi), offset,
> -			      length);
> +	err = fallocate(lo_fi_fd(req, fi), mode, offset, length);
> +        if (err < 0)
> +                err = errno;
>  
>  	fuse_reply_err(req, err);
>  }
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 2/4] virtiofsd: support nanosecond resolution for file timestamp
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 2/4] virtiofsd: support nanosecond resolution for file timestamp Liu Bo
@ 2019-04-17 13:31   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-17 13:31 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

* Liu Bo (bo.liu@linux.alibaba.com) wrote:
> From: Jiufei Xue <jiufei.xue@linux.alibaba.com>
> 
> Define HAVE_STRUCT_STAT_ST_ATIM to 1 if `st_atim' is member of `struct
> stat' which means support nanosecond resolution for the file timestamp
> fields.
> 
> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks, applied to my tree.

> ---
>  configure                     | 16 ++++++++++++++++
>  contrib/virtiofsd/fuse_misc.h |  1 +
>  2 files changed, 17 insertions(+)
> 
> diff --git a/configure b/configure
> index 5c21a65..57336ec 100755
> --- a/configure
> +++ b/configure
> @@ -5147,6 +5147,19 @@ if compile_prog "" "" ; then
>      strchrnul=yes
>  fi
>  
> +#########################################
> +# check if we have st_atim
> +
> +st_atim=no
> +cat > $TMPC << EOF
> +#include <sys/stat.h>
> +#include <stddef.h>
> +int main(void) { return offsetof(struct stat, st_atim); }
> +EOF
> +if compile_prog "" "" ; then
> +    st_atim=yes
> +fi
> +
>  ##########################################
>  # check if trace backend exists
>  
> @@ -6757,6 +6770,9 @@ fi
>  if test "$strchrnul" = "yes" ; then
>    echo "HAVE_STRCHRNUL=y" >> $config_host_mak
>  fi
> +if test "$st_atim" = "yes" ; then
> +  echo "HAVE_STRUCT_STAT_ST_ATIM=y" >> $config_host_mak
> +fi
>  if test "$byteswap_h" = "yes" ; then
>    echo "CONFIG_BYTESWAP_H=y" >> $config_host_mak
>  fi
> diff --git a/contrib/virtiofsd/fuse_misc.h b/contrib/virtiofsd/fuse_misc.h
> index 2f6663e..30a1d7f 100644
> --- a/contrib/virtiofsd/fuse_misc.h
> +++ b/contrib/virtiofsd/fuse_misc.h
> @@ -7,6 +7,7 @@
>  */
>  
>  #include <pthread.h>
> +#include "config-host.h"
>  
>  /*
>    Versioned symbols cannot be used in some cases because it
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3)
  2019-04-17 13:18   ` Dr. David Alan Gilbert
@ 2019-04-17 13:44     ` Miklos Szeredi
  2019-04-17 14:05       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 31+ messages in thread
From: Miklos Szeredi @ 2019-04-17 13:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 3:18 PM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> >
> > This is because posix_fallocate(3) does not support FALLOC_FL_KEEP_SIZE
> > and FALLOC_FL_PUNCH_HOLE. Our underlying host filesystem is ext4 and
> > ext4 supports FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE well, so
> > this change will be ok.
> >
> > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
>
>   We need to check what 'fuse' expects - is it defined what
> fallocate features it has, and what the semantics are?

The patch looks good to me.

Fuse (the kernel part) supports FALLOC_FL_KEEP_SIZE and
FALLOC_FL_PUNCH_HOLE, but returns EOPNOTSUPP for any other flag.

Even if, at a later time, fuse starts supporting additional fallocate
flags, then the passthrough implementation calling fallocate(2) should
be fine.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3)
  2019-04-17 13:44     ` Miklos Szeredi
@ 2019-04-17 14:05       ` Dr. David Alan Gilbert
  2019-04-17 14:25         ` Miklos Szeredi
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-17 14:05 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs

* Miklos Szeredi (mszeredi@redhat.com) wrote:
> On Wed, Apr 17, 2019 at 3:18 PM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > >
> > > This is because posix_fallocate(3) does not support FALLOC_FL_KEEP_SIZE
> > > and FALLOC_FL_PUNCH_HOLE. Our underlying host filesystem is ext4 and
> > > ext4 supports FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE well, so
> > > this change will be ok.
> > >
> > > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> >
> >   We need to check what 'fuse' expects - is it defined what
> > fallocate features it has, and what the semantics are?
> 
> The patch looks good to me.
> 
> Fuse (the kernel part) supports FALLOC_FL_KEEP_SIZE and
> FALLOC_FL_PUNCH_HOLE, but returns EOPNOTSUPP for any other flag.
> 
> Even if, at a later time, fuse starts supporting additional fallocate
> flags, then the passthrough implementation calling fallocate(2) should
> be fine.

OK, so then I'll take this patch; would it make sense for it to be sent
to upstream fuse?

Dave

> Thanks,
> Miklos
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3)
  2019-04-17 14:05       ` Dr. David Alan Gilbert
@ 2019-04-17 14:25         ` Miklos Szeredi
  2019-04-17 14:29           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 31+ messages in thread
From: Miklos Szeredi @ 2019-04-17 14:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 4:05 PM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Miklos Szeredi (mszeredi@redhat.com) wrote:
> > On Wed, Apr 17, 2019 at 3:18 PM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > >
> > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > >
> > > > This is because posix_fallocate(3) does not support FALLOC_FL_KEEP_SIZE
> > > > and FALLOC_FL_PUNCH_HOLE. Our underlying host filesystem is ext4 and
> > > > ext4 supports FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE well, so
> > > > this change will be ok.
> > > >
> > > > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > >
> > >   We need to check what 'fuse' expects - is it defined what
> > > fallocate features it has, and what the semantics are?
> >
> > The patch looks good to me.
> >
> > Fuse (the kernel part) supports FALLOC_FL_KEEP_SIZE and
> > FALLOC_FL_PUNCH_HOLE, but returns EOPNOTSUPP for any other flag.
> >
> > Even if, at a later time, fuse starts supporting additional fallocate
> > flags, then the passthrough implementation calling fallocate(2) should
> > be fine.
>
> OK, so then I'll take this patch; would it make sense for it to be sent
> to upstream fuse?

It would have to be wrapped in #ifdef HAVE_FALLOCATE to make it
portable.   Otherwise I don't see why not.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3)
  2019-04-17 14:25         ` Miklos Szeredi
@ 2019-04-17 14:29           ` Dr. David Alan Gilbert
  2019-04-17 19:24             ` Liu Bo
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-17 14:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs

* Miklos Szeredi (mszeredi@redhat.com) wrote:
> On Wed, Apr 17, 2019 at 4:05 PM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Miklos Szeredi (mszeredi@redhat.com) wrote:
> > > On Wed, Apr 17, 2019 at 3:18 PM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > >
> > > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > >
> > > > > This is because posix_fallocate(3) does not support FALLOC_FL_KEEP_SIZE
> > > > > and FALLOC_FL_PUNCH_HOLE. Our underlying host filesystem is ext4 and
> > > > > ext4 supports FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE well, so
> > > > > this change will be ok.
> > > > >
> > > > > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > >
> > > >   We need to check what 'fuse' expects - is it defined what
> > > > fallocate features it has, and what the semantics are?
> > >
> > > The patch looks good to me.
> > >
> > > Fuse (the kernel part) supports FALLOC_FL_KEEP_SIZE and
> > > FALLOC_FL_PUNCH_HOLE, but returns EOPNOTSUPP for any other flag.
> > >
> > > Even if, at a later time, fuse starts supporting additional fallocate
> > > flags, then the passthrough implementation calling fallocate(2) should
> > > be fine.
> >
> > OK, so then I'll take this patch; would it make sense for it to be sent
> > to upstream fuse?
> 
> It would have to be wrapped in #ifdef HAVE_FALLOCATE to make it
> portable.   Otherwise I don't see why not.

OK, I'll take it for now.

Liu Bo: Can you make it portable please and post it upstream on the fuse
list?

Dave

> Thanks,
> Miklos
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area Liu Bo
@ 2019-04-17 14:51   ` Dr. David Alan Gilbert
  2019-04-23 12:09     ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-17 14:51 UTC (permalink / raw)
  To: Liu Bo, vgoyal, stefanha; +Cc: virtio-fs

* Liu Bo (bo.liu@linux.alibaba.com) wrote:
> From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> 
> When running xfstests test case generic/413, we found such issue:
>     1, create a file in one virtiofsd mount point with dax enabled
>     2, mmap this file, get virtual addr: A
>     3, write(fd, A, len), here fd comes from another file in another
>        virtiofsd mount point without dax enabled, also note here write(2)
>        is direct io.
>     4, this direct io will hang forever, because the virtiofsd has crashed.
> Here is the stack:
> [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> [  247.168006] Call Trace:
> [  247.169067]  ? __schedule+0x3d0/0x830
> [  247.170219]  schedule+0x32/0x80
> [  247.171328]  schedule_timeout+0x1e2/0x350
> [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> [  247.173516]  wait_for_completion+0x123/0x190
> [  247.174593]  ? wake_up_q+0x70/0x70
> [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> [  247.176724]  generic_file_read_iter+0xaa/0xd20
> [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> [  247.181136]  __vfs_read+0x108/0x190
> [  247.181930]  vfs_read+0x91/0x130
> [  247.182671]  ksys_read+0x52/0xc0
> [  247.183454]  do_syscall_64+0x55/0x170
> [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> address correctly. For a memory mapped area in dax mode, indeed the page
> for this area points virtiofsd's cache area, or rather virtio pci device's
> cache bar. In qemu, currently this cache bar is implemented with an anonymous
> memory and will not pass this cache bar's address info to vhost-user backend,
> so vu_gpa_to_va() will fail.
> 
> To fix this issue, we create this vhost cache area with a file backend
> memory area.

Thanks,
  I know there was another case of the daemon trying to access the
buffer that Stefan and Vivek hit, but fixed by persuading the kernel
not to do it;  Stefan/Vivek: What do you think?

It worries me a little exposing the area back to the daemon; the guest
can write the BAR and change the mapping, I doubt anything would notice
that (but also I doubt it happens much).

Dave

> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> ---
>  hw/virtio/vhost-user-fs.c | 59 +++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 49 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index a5110a9..fc2ed38 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -300,6 +300,35 @@ static bool vuf_guest_notifier_pending(VirtIODevice *vdev, int idx)
>      return vhost_virtqueue_pending(&fs->vhost_dev, idx);
>  }
>  
> +static int vuf_get_tmpfile(size_t size)
> +{
> +    const char *tmpdir = g_get_tmp_dir();
> +    gchar *fname;
> +    int fd;
> +
> +    fname = g_strdup_printf("%s/virtio-fs-cache-XXXXXX", tmpdir);
> +    fd = mkstemp(fname);
> +    if (fd < 0) {
> +        fprintf(stderr, "%s: mkstemp(%s) failed\n", __func__, fname);
> +        g_free(fname);
> +        return -1;
> +    }
> +
> +    /*
> +     * Note: this temp file would be deleted until file is closed or
> +     * process exits.
> +     */
> +    unlink(fname);
> +    if (ftruncate(fd, size) == -1) {
> +        fprintf(stderr, "%s: ftruncate(%s) failed\n", __func__, fname);
> +        close(fd);
> +        fd = -1;
> +    }
> +
> +    g_free(fname);
> +    return fd;
> +}
> +
>  static void vuf_device_realize(DeviceState *dev, Error **errp)
>  {
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> @@ -349,16 +378,26 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
>                           "no smaller than the page size");
>          return;
>      }
> -    /* We need a region with some host memory, 'ram' is the easiest */
> -    memory_region_init_ram_nomigrate(&fs->cache, OBJECT(vdev),
> -                       "virtio-fs-cache",
> -                       fs->conf.cache_size, NULL);
> -    /*
> -     * But we don't actually want anyone reading/writing the raw
> -     * area with no cache data.
> -     */
> -    mprotect(memory_region_get_ram_ptr(&fs->cache), fs->conf.cache_size,
> -             PROT_NONE);
> +
> +    if (fs->conf.cache_size) {
> +            int fd;
> +
> +            fd = vuf_get_tmpfile(fs->conf.cache_size);
> +            if (fd < 0) {
> +                    error_setg(errp, "failed to initialize virtio-fs-cache");
> +                    return;
> +            }
> +            memory_region_init_ram_from_fd(&fs->cache, OBJECT(vdev),
> +                                           "virtio-fs-cache",
> +                                           fs->conf.cache_size, true, fd, NULL);
> +
> +            /*
> +             * But we don't actually want anyone reading/writing the raw
> +             * area with no cache data.
> +             */
> +            mprotect(memory_region_get_ram_ptr(&fs->cache),
> +                     fs->conf.cache_size, PROT_NONE);
> +    }
>  
>      if (!vhost_user_init(&fs->vhost_user, &fs->conf.chardev, errp)) {
>          return;
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure Liu Bo
@ 2019-04-17 17:26   ` Dr. David Alan Gilbert
  2019-04-18 12:25     ` Dr. David Alan Gilbert
  2019-04-23  6:27   ` [Virtio-fs] [PATCH v2] " Eryu Guan
  1 sibling, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-17 17:26 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

* Liu Bo (bo.liu@linux.alibaba.com) wrote:
> From: Eryu Guan <eguan@linux.alibaba.com>
> 
> Currently when a lo_read() operation fails, we don't send the failure
> back to fuse client, and read(2) operation from guest kernel would hang
> on waiting for the reply.
> 
> This is easily triggered by a direct read with non-aligned length.
> 
> Fix it by detecting preadv(2) error in virtio_send_data_iov(), and
> teaching fuse_reply_data() to reply error on error case.

Thank you for spotting this.

> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
> ---
>  contrib/virtiofsd/fuse_lowlevel.c |  4 ++--
>  contrib/virtiofsd/fuse_virtio.c   | 12 +++++++++---
>  2 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/contrib/virtiofsd/fuse_lowlevel.c b/contrib/virtiofsd/fuse_lowlevel.c
> index 111c6e1..aeb5fe2 100644
> --- a/contrib/virtiofsd/fuse_lowlevel.c
> +++ b/contrib/virtiofsd/fuse_lowlevel.c
> @@ -524,11 +524,11 @@ int fuse_reply_data(fuse_req_t req, struct fuse_bufvec *bufv,
>  	out.error = 0;
>  
>  	res = fuse_send_data_iov(req->se, req->ch, iov, 1, bufv, flags);
> -	if (res <= 0) {
> +	if (res >= 0) {

I need to go and ask the upstream fuse list about this; it's not clear
to me what fuse_send_data_iov is supposed to return;  let me clarify
that and get back to you.

Dave

>  		fuse_free_req(req);
>  		return res;
>  	} else {
> -		return fuse_reply_err(req, res);
> +		return fuse_reply_err(req, -res);
>  	}
>  }
>  
> diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
> index ca988aa..0b5736d 100644
> --- a/contrib/virtiofsd/fuse_virtio.c
> +++ b/contrib/virtiofsd/fuse_virtio.c
> @@ -333,8 +333,13 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
>                  ret = preadv(buf->buf[0].fd, in_sg_ptr, in_sg_cpy_count, buf->buf[0].pos);
>  
>                  if (se->debug)
> -                        fprintf(stderr, "%s: preadv_res=%d len=%zd\n",
> -                                __func__, ret, len);
> +                        fprintf(stderr, "%s: preadv_res=%d(%s) len=%zd\n",
> +                                __func__, ret, strerror(errno), len);
> +                if (ret == -1) {
> +                        ret = -errno;
> +                        free(in_sg_cpy);
> +                        goto err;
> +                }
>                  if (ret < len && ret) {
>                          if (se->debug)
>                                  fprintf(stderr, "%s: ret < len\n", __func__);
> @@ -379,7 +384,8 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
>          vu_queue_notify(&se->virtio_dev->dev, q);
>  
>  err:
> -        ch->qi->reply_sent = true;
> +        if (!ret)
> +                ch->qi->reply_sent = true;
>  
>          return ret;
>  }
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3)
  2019-04-17 14:29           ` Dr. David Alan Gilbert
@ 2019-04-17 19:24             ` Liu Bo
  0 siblings, 0 replies; 31+ messages in thread
From: Liu Bo @ 2019-04-17 19:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, Miklos Szeredi

On Wed, Apr 17, 2019 at 03:29:00PM +0100, Dr. David Alan Gilbert wrote:
> * Miklos Szeredi (mszeredi@redhat.com) wrote:
> > On Wed, Apr 17, 2019 at 4:05 PM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > >
> > > * Miklos Szeredi (mszeredi@redhat.com) wrote:
> > > > On Wed, Apr 17, 2019 at 3:18 PM Dr. David Alan Gilbert
> > > > <dgilbert@redhat.com> wrote:
> > > > >
> > > > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > > >
> > > > > > This is because posix_fallocate(3) does not support FALLOC_FL_KEEP_SIZE
> > > > > > and FALLOC_FL_PUNCH_HOLE. Our underlying host filesystem is ext4 and
> > > > > > ext4 supports FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE well, so
> > > > > > this change will be ok.
> > > > > >
> > > > > > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > >
> > > > >   We need to check what 'fuse' expects - is it defined what
> > > > > fallocate features it has, and what the semantics are?
> > > >
> > > > The patch looks good to me.
> > > >
> > > > Fuse (the kernel part) supports FALLOC_FL_KEEP_SIZE and
> > > > FALLOC_FL_PUNCH_HOLE, but returns EOPNOTSUPP for any other flag.
> > > >
> > > > Even if, at a later time, fuse starts supporting additional fallocate
> > > > flags, then the passthrough implementation calling fallocate(2) should
> > > > be fine.
> > >
> > > OK, so then I'll take this patch; would it make sense for it to be sent
> > > to upstream fuse?
> > 
> > It would have to be wrapped in #ifdef HAVE_FALLOCATE to make it
> > portable.   Otherwise I don't see why not.
> 
> OK, I'll take it for now.
> 
> Liu Bo: Can you make it portable please and post it upstream on the fuse
> list?
>

No problem, will do it.

thanks,
-liubo


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure
  2019-04-17 17:26   ` Dr. David Alan Gilbert
@ 2019-04-18 12:25     ` Dr. David Alan Gilbert
  2019-04-18 18:14       ` Liu Bo
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-18 12:25 UTC (permalink / raw)
  To: Liu Bo, eguan; +Cc: virtio-fs

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > From: Eryu Guan <eguan@linux.alibaba.com>
> > 
> > Currently when a lo_read() operation fails, we don't send the failure
> > back to fuse client, and read(2) operation from guest kernel would hang
> > on waiting for the reply.
> > 
> > This is easily triggered by a direct read with non-aligned length.
> > 
> > Fix it by detecting preadv(2) error in virtio_send_data_iov(), and
> > teaching fuse_reply_data() to reply error on error case.
> 
> Thank you for spotting this.
> 
> > Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> > Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
> > ---
> >  contrib/virtiofsd/fuse_lowlevel.c |  4 ++--
> >  contrib/virtiofsd/fuse_virtio.c   | 12 +++++++++---
> >  2 files changed, 11 insertions(+), 5 deletions(-)
> > 
> > diff --git a/contrib/virtiofsd/fuse_lowlevel.c b/contrib/virtiofsd/fuse_lowlevel.c
> > index 111c6e1..aeb5fe2 100644
> > --- a/contrib/virtiofsd/fuse_lowlevel.c
> > +++ b/contrib/virtiofsd/fuse_lowlevel.c
> > @@ -524,11 +524,11 @@ int fuse_reply_data(fuse_req_t req, struct fuse_bufvec *bufv,
> >  	out.error = 0;
> >  
> >  	res = fuse_send_data_iov(req->se, req->ch, iov, 1, bufv, flags);
> > -	if (res <= 0) {
> > +	if (res >= 0) {
> 
> I need to go and ask the upstream fuse list about this; it's not clear
> to me what fuse_send_data_iov is supposed to return;  let me clarify
> that and get back to you.

Miklos replied to me on the upstream fuse list:
https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/CAOssrKdVViWewmh3nrfKkq6eaWNJ629AR7w90KG3XT2hV9_D1w%40mail.gmail.com/#msg36642807

so this comparison is actually already correct because a negative value
means that something is wrong with the fuse connection (or in our case
virtio) so we can't send an error reply anyway.

> Dave
> 
> >  		fuse_free_req(req);
> >  		return res;
> >  	} else {
> > -		return fuse_reply_err(req, res);
> > +		return fuse_reply_err(req, -res);
> >  	}
> >  }
> >  
> > diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
> > index ca988aa..0b5736d 100644
> > --- a/contrib/virtiofsd/fuse_virtio.c
> > +++ b/contrib/virtiofsd/fuse_virtio.c
> > @@ -333,8 +333,13 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
> >                  ret = preadv(buf->buf[0].fd, in_sg_ptr, in_sg_cpy_count, buf->buf[0].pos);
> >  
> >                  if (se->debug)
> > -                        fprintf(stderr, "%s: preadv_res=%d len=%zd\n",
> > -                                __func__, ret, len);
> > +                        fprintf(stderr, "%s: preadv_res=%d(%s) len=%zd\n",
> > +                                __func__, ret, strerror(errno), len);
> > +                if (ret == -1) {
> > +                        ret = -errno;

So this will need to be      ret = errno    to give a postive error
response to mean that the error was on the file side on the fuse side.
(There's also another case below that where I set ret = EIO where it
should be EIO I think)

> > +                        free(in_sg_cpy);
> > +                        goto err;
> > +                }
> >                  if (ret < len && ret) {
> >                          if (se->debug)
> >                                  fprintf(stderr, "%s: ret < len\n", __func__);
> > @@ -379,7 +384,8 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
> >          vu_queue_notify(&se->virtio_dev->dev, q);
> >  
> >  err:
> > -        ch->qi->reply_sent = true;
> > +        if (!ret)
> > +                ch->qi->reply_sent = true;

Yes, I think that's OK.

Dave

> >  
> >          return ret;
> >  }
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure
  2019-04-18 12:25     ` Dr. David Alan Gilbert
@ 2019-04-18 18:14       ` Liu Bo
  2019-04-18 19:05         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 31+ messages in thread
From: Liu Bo @ 2019-04-18 18:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs

On Thu, Apr 18, 2019 at 01:25:27PM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > From: Eryu Guan <eguan@linux.alibaba.com>
> > > 
> > > Currently when a lo_read() operation fails, we don't send the failure
> > > back to fuse client, and read(2) operation from guest kernel would hang
> > > on waiting for the reply.
> > > 
> > > This is easily triggered by a direct read with non-aligned length.
> > > 
> > > Fix it by detecting preadv(2) error in virtio_send_data_iov(), and
> > > teaching fuse_reply_data() to reply error on error case.
> > 
> > Thank you for spotting this.
> > 
> > > Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
> > > ---
> > >  contrib/virtiofsd/fuse_lowlevel.c |  4 ++--
> > >  contrib/virtiofsd/fuse_virtio.c   | 12 +++++++++---
> > >  2 files changed, 11 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/contrib/virtiofsd/fuse_lowlevel.c b/contrib/virtiofsd/fuse_lowlevel.c
> > > index 111c6e1..aeb5fe2 100644
> > > --- a/contrib/virtiofsd/fuse_lowlevel.c
> > > +++ b/contrib/virtiofsd/fuse_lowlevel.c
> > > @@ -524,11 +524,11 @@ int fuse_reply_data(fuse_req_t req, struct fuse_bufvec *bufv,
> > >  	out.error = 0;
> > >  
> > >  	res = fuse_send_data_iov(req->se, req->ch, iov, 1, bufv, flags);
> > > -	if (res <= 0) {
> > > +	if (res >= 0) {
> > 
> > I need to go and ask the upstream fuse list about this; it's not clear
> > to me what fuse_send_data_iov is supposed to return;  let me clarify
> > that and get back to you.
> 
> Miklos replied to me on the upstream fuse list:
> https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/CAOssrKdVViWewmh3nrfKkq6eaWNJ629AR7w90KG3XT2hV9_D1w%40mail.gmail.com/#msg36642807
> 
> so this comparison is actually already correct because a negative value
> means that something is wrong with the fuse connection (or in our case
> virtio) so we can't send an error reply anyway.
>

I see, that makes sense.

> > Dave
> > 
> > >  		fuse_free_req(req);
> > >  		return res;
> > >  	} else {
> > > -		return fuse_reply_err(req, res);
> > > +		return fuse_reply_err(req, -res);
> > >  	}
> > >  }
> > >  
> > > diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
> > > index ca988aa..0b5736d 100644
> > > --- a/contrib/virtiofsd/fuse_virtio.c
> > > +++ b/contrib/virtiofsd/fuse_virtio.c
> > > @@ -333,8 +333,13 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
> > >                  ret = preadv(buf->buf[0].fd, in_sg_ptr, in_sg_cpy_count, buf->buf[0].pos);
> > >  
> > >                  if (se->debug)
> > > -                        fprintf(stderr, "%s: preadv_res=%d len=%zd\n",
> > > -                                __func__, ret, len);
> > > +                        fprintf(stderr, "%s: preadv_res=%d(%s) len=%zd\n",
> > > +                                __func__, ret, strerror(errno), len);
> > > +                if (ret == -1) {
> > > +                        ret = -errno;
> 
> So this will need to be      ret = errno    to give a postive error
> response to mean that the error was on the file side on the fuse side.
> (There's also another case below that where I set ret = EIO where it
> should be EIO I think)
>

OK, do you want me to fix both in the same patch?

> > > +                        free(in_sg_cpy);
> > > +                        goto err;
> > > +                }
> > >                  if (ret < len && ret) {
> > >                          if (se->debug)
> > >                                  fprintf(stderr, "%s: ret < len\n", __func__);
> > > @@ -379,7 +384,8 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
> > >          vu_queue_notify(&se->virtio_dev->dev, q);
> > >  
> > >  err:
> > > -        ch->qi->reply_sent = true;
> > > +        if (!ret)
> > > +                ch->qi->reply_sent = true;
> 
> Yes, I think that's OK.

Correct, this is the one causing fuse kernel side's hang.

thanks,
-liubo

> 
> Dave
> 
> > >  
> > >          return ret;
> > >  }
> > > -- 
> > > 1.8.3.1
> > > 
> > > _______________________________________________
> > > Virtio-fs mailing list
> > > Virtio-fs@redhat.com
> > > https://www.redhat.com/mailman/listinfo/virtio-fs
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure
  2019-04-18 18:14       ` Liu Bo
@ 2019-04-18 19:05         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-18 19:05 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

* Liu Bo (bo.liu@linux.alibaba.com) wrote:
> On Thu, Apr 18, 2019 at 01:25:27PM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > From: Eryu Guan <eguan@linux.alibaba.com>
> > > > 
> > > > Currently when a lo_read() operation fails, we don't send the failure
> > > > back to fuse client, and read(2) operation from guest kernel would hang
> > > > on waiting for the reply.
> > > > 
> > > > This is easily triggered by a direct read with non-aligned length.
> > > > 
> > > > Fix it by detecting preadv(2) error in virtio_send_data_iov(), and
> > > > teaching fuse_reply_data() to reply error on error case.
> > > 
> > > Thank you for spotting this.
> > > 
> > > > Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > > Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
> > > > ---
> > > >  contrib/virtiofsd/fuse_lowlevel.c |  4 ++--
> > > >  contrib/virtiofsd/fuse_virtio.c   | 12 +++++++++---
> > > >  2 files changed, 11 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/contrib/virtiofsd/fuse_lowlevel.c b/contrib/virtiofsd/fuse_lowlevel.c
> > > > index 111c6e1..aeb5fe2 100644
> > > > --- a/contrib/virtiofsd/fuse_lowlevel.c
> > > > +++ b/contrib/virtiofsd/fuse_lowlevel.c
> > > > @@ -524,11 +524,11 @@ int fuse_reply_data(fuse_req_t req, struct fuse_bufvec *bufv,
> > > >  	out.error = 0;
> > > >  
> > > >  	res = fuse_send_data_iov(req->se, req->ch, iov, 1, bufv, flags);
> > > > -	if (res <= 0) {
> > > > +	if (res >= 0) {
> > > 
> > > I need to go and ask the upstream fuse list about this; it's not clear
> > > to me what fuse_send_data_iov is supposed to return;  let me clarify
> > > that and get back to you.
> > 
> > Miklos replied to me on the upstream fuse list:
> > https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/CAOssrKdVViWewmh3nrfKkq6eaWNJ629AR7w90KG3XT2hV9_D1w%40mail.gmail.com/#msg36642807
> > 
> > so this comparison is actually already correct because a negative value
> > means that something is wrong with the fuse connection (or in our case
> > virtio) so we can't send an error reply anyway.
> >
> 
> I see, that makes sense.
> 
> > > Dave
> > > 
> > > >  		fuse_free_req(req);
> > > >  		return res;
> > > >  	} else {
> > > > -		return fuse_reply_err(req, res);
> > > > +		return fuse_reply_err(req, -res);
> > > >  	}
> > > >  }
> > > >  
> > > > diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
> > > > index ca988aa..0b5736d 100644
> > > > --- a/contrib/virtiofsd/fuse_virtio.c
> > > > +++ b/contrib/virtiofsd/fuse_virtio.c
> > > > @@ -333,8 +333,13 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
> > > >                  ret = preadv(buf->buf[0].fd, in_sg_ptr, in_sg_cpy_count, buf->buf[0].pos);
> > > >  
> > > >                  if (se->debug)
> > > > -                        fprintf(stderr, "%s: preadv_res=%d len=%zd\n",
> > > > -                                __func__, ret, len);
> > > > +                        fprintf(stderr, "%s: preadv_res=%d(%s) len=%zd\n",
> > > > +                                __func__, ret, strerror(errno), len);
> > > > +                if (ret == -1) {
> > > > +                        ret = -errno;
> > 
> > So this will need to be      ret = errno    to give a postive error
> > response to mean that the error was on the file side on the fuse side.
> > (There's also another case below that where I set ret = EIO where it
> > should be EIO I think)
> >
> 
> OK, do you want me to fix both in the same patch?

Yes please.

> > > > +                        free(in_sg_cpy);
> > > > +                        goto err;
> > > > +                }
> > > >                  if (ret < len && ret) {
> > > >                          if (se->debug)
> > > >                                  fprintf(stderr, "%s: ret < len\n", __func__);
> > > > @@ -379,7 +384,8 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
> > > >          vu_queue_notify(&se->virtio_dev->dev, q);
> > > >  
> > > >  err:
> > > > -        ch->qi->reply_sent = true;
> > > > +        if (!ret)
> > > > +                ch->qi->reply_sent = true;
> > 
> > Yes, I think that's OK.
> 
> Correct, this is the one causing fuse kernel side's hang.
> 

Thanks!

Dave (Note, I'm out for a week)

> thanks,
> -liubo
> 
> > 
> > Dave
> > 
> > > >  
> > > >          return ret;
> > > >  }
> > > > -- 
> > > > 1.8.3.1
> > > > 
> > > > _______________________________________________
> > > > Virtio-fs mailing list
> > > > Virtio-fs@redhat.com
> > > > https://www.redhat.com/mailman/listinfo/virtio-fs
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > 
> > > _______________________________________________
> > > Virtio-fs mailing list
> > > Virtio-fs@redhat.com
> > > https://www.redhat.com/mailman/listinfo/virtio-fs
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Virtio-fs] [PATCH v2] virtiofsd: send reply correctly on read failure
  2019-04-16 19:08 ` [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure Liu Bo
  2019-04-17 17:26   ` Dr. David Alan Gilbert
@ 2019-04-23  6:27   ` Eryu Guan
  2019-04-30 14:58     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 31+ messages in thread
From: Eryu Guan @ 2019-04-23  6:27 UTC (permalink / raw)
  To: virtio-fs

Currently when a lo_read() operation fails, we don't send the failure
back to fuse client, and read(2) operation from guest kernel would hang
on waiting for the reply.

This is easily triggered by a direct read with non-aligned length.

Fix it by detecting preadv(2) error in virtio_send_data_iov(), and
teaching fuse_reply_data() to reply error on error case.

Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
---
v2:
- return postive error number on read error
- also return postive EIO in short read case
- drop "-1" -> "-EOPNOTSUPP" update, which should be in a sperate patch

 contrib/virtiofsd/fuse_virtio.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
index ca988aac923f..7337f2a2c3ab 100644
--- a/contrib/virtiofsd/fuse_virtio.c
+++ b/contrib/virtiofsd/fuse_virtio.c
@@ -333,8 +333,13 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
                 ret = preadv(buf->buf[0].fd, in_sg_ptr, in_sg_cpy_count, buf->buf[0].pos);
 
                 if (se->debug)
-                        fprintf(stderr, "%s: preadv_res=%d len=%zd\n",
-                                __func__, ret, len);
+                        fprintf(stderr, "%s: preadv_res=%d(%s) len=%zd\n",
+                                __func__, ret, strerror(errno), len);
+                if (ret == -1) {
+                        ret = errno;
+                        free(in_sg_cpy);
+                        goto err;
+                }
                 if (ret < len && ret) {
                         if (se->debug)
                                 fprintf(stderr, "%s: ret < len\n", __func__);
@@ -356,7 +361,7 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
                 if (ret != len) {
                         if (se->debug)
                                 fprintf(stderr, "%s: ret!=len\n", __func__);
-                        ret = -EIO;
+                        ret = EIO;
                         free(in_sg_cpy);
                         goto err;
                 }
@@ -379,7 +384,8 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
         vu_queue_notify(&se->virtio_dev->dev, q);
 
 err:
-        ch->qi->reply_sent = true;
+        if (ret == 0)
+                ch->qi->reply_sent = true;
 
         return ret;
 }
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-17 14:51   ` Dr. David Alan Gilbert
@ 2019-04-23 12:09     ` Stefan Hajnoczi
  2019-04-23 18:49       ` Liu Bo
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2019-04-23 12:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs

[-- Attachment #1: Type: text/plain, Size: 2921 bytes --]

On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > 
> > When running xfstests test case generic/413, we found such issue:
> >     1, create a file in one virtiofsd mount point with dax enabled
> >     2, mmap this file, get virtual addr: A
> >     3, write(fd, A, len), here fd comes from another file in another
> >        virtiofsd mount point without dax enabled, also note here write(2)
> >        is direct io.
> >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > Here is the stack:
> > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > [  247.168006] Call Trace:
> > [  247.169067]  ? __schedule+0x3d0/0x830
> > [  247.170219]  schedule+0x32/0x80
> > [  247.171328]  schedule_timeout+0x1e2/0x350
> > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > [  247.173516]  wait_for_completion+0x123/0x190
> > [  247.174593]  ? wake_up_q+0x70/0x70
> > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > [  247.181136]  __vfs_read+0x108/0x190
> > [  247.181930]  vfs_read+0x91/0x130
> > [  247.182671]  ksys_read+0x52/0xc0
> > [  247.183454]  do_syscall_64+0x55/0x170
> > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > address correctly. For a memory mapped area in dax mode, indeed the page
> > for this area points virtiofsd's cache area, or rather virtio pci device's
> > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > memory and will not pass this cache bar's address info to vhost-user backend,
> > so vu_gpa_to_va() will fail.
> > 
> > To fix this issue, we create this vhost cache area with a file backend
> > memory area.
> 
> Thanks,
>   I know there was another case of the daemon trying to access the
> buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> not to do it;  Stefan/Vivek: What do you think?

That case happened with cache=none and the dax mount option.

The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
is outside guest RAM.

> 
> It worries me a little exposing the area back to the daemon; the guest
> can write the BAR and change the mapping, I doubt anything would notice
> that (but also I doubt it happens much).

If two virtiofsd processes are involved then it's even harder since they
do not have up-to-date access the other's DAX window.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-23 12:09     ` Stefan Hajnoczi
@ 2019-04-23 18:49       ` Liu Bo
  2019-04-25 14:33         ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Liu Bo @ 2019-04-23 18:49 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs

On Tue, Apr 23, 2019 at 01:09:19PM +0100, Stefan Hajnoczi wrote:
> On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > 
> > > When running xfstests test case generic/413, we found such issue:
> > >     1, create a file in one virtiofsd mount point with dax enabled
> > >     2, mmap this file, get virtual addr: A
> > >     3, write(fd, A, len), here fd comes from another file in another
> > >        virtiofsd mount point without dax enabled, also note here write(2)
> > >        is direct io.
> > >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > > Here is the stack:
> > > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > > [  247.168006] Call Trace:
> > > [  247.169067]  ? __schedule+0x3d0/0x830
> > > [  247.170219]  schedule+0x32/0x80
> > > [  247.171328]  schedule_timeout+0x1e2/0x350
> > > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > > [  247.173516]  wait_for_completion+0x123/0x190
> > > [  247.174593]  ? wake_up_q+0x70/0x70
> > > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > > [  247.181136]  __vfs_read+0x108/0x190
> > > [  247.181930]  vfs_read+0x91/0x130
> > > [  247.182671]  ksys_read+0x52/0xc0
> > > [  247.183454]  do_syscall_64+0x55/0x170
> > > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > > address correctly. For a memory mapped area in dax mode, indeed the page
> > > for this area points virtiofsd's cache area, or rather virtio pci device's
> > > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > > memory and will not pass this cache bar's address info to vhost-user backend,
> > > so vu_gpa_to_va() will fail.
> > > 
> > > To fix this issue, we create this vhost cache area with a file backend
> > > memory area.
> > 
> > Thanks,
> >   I know there was another case of the daemon trying to access the
> > buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> > not to do it;  Stefan/Vivek: What do you think?
> 
> That case happened with cache=none and the dax mount option.
> 
> The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> is outside guest RAM.
>

Can you please elaborate how the buffer is outside guest RAM?
Is it also via direct IO?

> > 
> > It worries me a little exposing the area back to the daemon; the guest
> > can write the BAR and change the mapping, I doubt anything would notice
> > that (but also I doubt it happens much).
> 
> If two virtiofsd processes are involved then it's even harder since they
> do not have up-to-date access the other's DAX window.
> 

In case of direct IO, kvm is able to make sure that guest's dax
mapping is sync'd with the underlying host mmap region, isn't it?


thanks,
-liubo


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-23 18:49       ` Liu Bo
@ 2019-04-25 14:33         ` Stefan Hajnoczi
  2019-04-25 21:21           ` Vivek Goyal
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2019-04-25 14:33 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

[-- Attachment #1: Type: text/plain, Size: 4264 bytes --]

On Tue, Apr 23, 2019 at 11:49:15AM -0700, Liu Bo wrote:
> On Tue, Apr 23, 2019 at 01:09:19PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > 
> > > > When running xfstests test case generic/413, we found such issue:
> > > >     1, create a file in one virtiofsd mount point with dax enabled
> > > >     2, mmap this file, get virtual addr: A
> > > >     3, write(fd, A, len), here fd comes from another file in another
> > > >        virtiofsd mount point without dax enabled, also note here write(2)
> > > >        is direct io.
> > > >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > > > Here is the stack:
> > > > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > > > [  247.168006] Call Trace:
> > > > [  247.169067]  ? __schedule+0x3d0/0x830
> > > > [  247.170219]  schedule+0x32/0x80
> > > > [  247.171328]  schedule_timeout+0x1e2/0x350
> > > > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > > > [  247.173516]  wait_for_completion+0x123/0x190
> > > > [  247.174593]  ? wake_up_q+0x70/0x70
> > > > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > > > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > > > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > > > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > > > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > > > [  247.181136]  __vfs_read+0x108/0x190
> > > > [  247.181930]  vfs_read+0x91/0x130
> > > > [  247.182671]  ksys_read+0x52/0xc0
> > > > [  247.183454]  do_syscall_64+0x55/0x170
> > > > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 
> > > > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > > > address correctly. For a memory mapped area in dax mode, indeed the page
> > > > for this area points virtiofsd's cache area, or rather virtio pci device's
> > > > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > > > memory and will not pass this cache bar's address info to vhost-user backend,
> > > > so vu_gpa_to_va() will fail.
> > > > 
> > > > To fix this issue, we create this vhost cache area with a file backend
> > > > memory area.
> > > 
> > > Thanks,
> > >   I know there was another case of the daemon trying to access the
> > > buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> > > not to do it;  Stefan/Vivek: What do you think?
> > 
> > That case happened with cache=none and the dax mount option.
> > 
> > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > is outside guest RAM.
> >
> 
> Can you please elaborate how the buffer is outside guest RAM?
> Is it also via direct IO?

The DAX window is a PCI BAR on the virtio-fs PCI device.  It is not
guest RAM.

vhost-user only shares guest RAM with the vhost-user device backend
process (virtiofsd).  Therefore virtiofsd does not have access to the
contents of the DAX window.

This only happens when the virtio-fs file system is mounted with the
"dax" option.

> > > 
> > > It worries me a little exposing the area back to the daemon; the guest
> > > can write the BAR and change the mapping, I doubt anything would notice
> > > that (but also I doubt it happens much).
> > 
> > If two virtiofsd processes are involved then it's even harder since they
> > do not have up-to-date access the other's DAX window.
> > 
> 
> In case of direct IO, kvm is able to make sure that guest's dax
> mapping is sync'd with the underlying host mmap region, isn't it?

See my explanation above.  In theory a virtiofsd could keep track of the
mapping region by performing the same mmap(2) system calls as the QEMU
process based on the FUSE_SETUPMAPPING requests (currently only QEMU
does the mmaps).  What I'm saying here is that it gets even harder when
multiple virtiofsd processes are involved because they do not have
access to each other's DAX windows.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-25 14:33         ` Stefan Hajnoczi
@ 2019-04-25 21:21           ` Vivek Goyal
  2019-04-26  9:05             ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Vivek Goyal @ 2019-04-25 21:21 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs

On Thu, Apr 25, 2019 at 03:33:23PM +0100, Stefan Hajnoczi wrote:
> On Tue, Apr 23, 2019 at 11:49:15AM -0700, Liu Bo wrote:
> > On Tue, Apr 23, 2019 at 01:09:19PM +0100, Stefan Hajnoczi wrote:
> > > On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > > 
> > > > > When running xfstests test case generic/413, we found such issue:
> > > > >     1, create a file in one virtiofsd mount point with dax enabled
> > > > >     2, mmap this file, get virtual addr: A
> > > > >     3, write(fd, A, len), here fd comes from another file in another
> > > > >        virtiofsd mount point without dax enabled, also note here write(2)
> > > > >        is direct io.
> > > > >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > > > > Here is the stack:
> > > > > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > > > > [  247.168006] Call Trace:
> > > > > [  247.169067]  ? __schedule+0x3d0/0x830
> > > > > [  247.170219]  schedule+0x32/0x80
> > > > > [  247.171328]  schedule_timeout+0x1e2/0x350
> > > > > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > > > > [  247.173516]  wait_for_completion+0x123/0x190
> > > > > [  247.174593]  ? wake_up_q+0x70/0x70
> > > > > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > > > > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > > > > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > > > > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > > > > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > > > > [  247.181136]  __vfs_read+0x108/0x190
> > > > > [  247.181930]  vfs_read+0x91/0x130
> > > > > [  247.182671]  ksys_read+0x52/0xc0
> > > > > [  247.183454]  do_syscall_64+0x55/0x170
> > > > > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > 
> > > > > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > > > > address correctly. For a memory mapped area in dax mode, indeed the page
> > > > > for this area points virtiofsd's cache area, or rather virtio pci device's
> > > > > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > > > > memory and will not pass this cache bar's address info to vhost-user backend,
> > > > > so vu_gpa_to_va() will fail.
> > > > > 
> > > > > To fix this issue, we create this vhost cache area with a file backend
> > > > > memory area.
> > > > 
> > > > Thanks,
> > > >   I know there was another case of the daemon trying to access the
> > > > buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> > > > not to do it;  Stefan/Vivek: What do you think?
> > > 
> > > That case happened with cache=none and the dax mount option.
> > > 
> > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > is outside guest RAM.

Stefan,

Can this be emulated by sending a request to qemu? If virtiofsd can detect
that source/destination of READ/WRITE is not guest RAM, can it forward
message to qemu to do this operation (which has access to all the DAX
windows)?

This probably will mean introducing new messages like
setupmapping/removemapping messages between virtiofsd/qemu.  

Thanks
Vivek


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-25 21:21           ` Vivek Goyal
@ 2019-04-26  9:05             ` Stefan Hajnoczi
  2019-05-01 18:59               ` Dr. David Alan Gilbert
  2019-05-18  2:28               ` Liu Bo
  0 siblings, 2 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2019-04-26  9:05 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

[-- Attachment #1: Type: text/plain, Size: 4111 bytes --]

On Thu, Apr 25, 2019 at 05:21:58PM -0400, Vivek Goyal wrote:
> On Thu, Apr 25, 2019 at 03:33:23PM +0100, Stefan Hajnoczi wrote:
> > On Tue, Apr 23, 2019 at 11:49:15AM -0700, Liu Bo wrote:
> > > On Tue, Apr 23, 2019 at 01:09:19PM +0100, Stefan Hajnoczi wrote:
> > > > On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > > > 
> > > > > > When running xfstests test case generic/413, we found such issue:
> > > > > >     1, create a file in one virtiofsd mount point with dax enabled
> > > > > >     2, mmap this file, get virtual addr: A
> > > > > >     3, write(fd, A, len), here fd comes from another file in another
> > > > > >        virtiofsd mount point without dax enabled, also note here write(2)
> > > > > >        is direct io.
> > > > > >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > > > > > Here is the stack:
> > > > > > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > > > > > [  247.168006] Call Trace:
> > > > > > [  247.169067]  ? __schedule+0x3d0/0x830
> > > > > > [  247.170219]  schedule+0x32/0x80
> > > > > > [  247.171328]  schedule_timeout+0x1e2/0x350
> > > > > > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > > > > > [  247.173516]  wait_for_completion+0x123/0x190
> > > > > > [  247.174593]  ? wake_up_q+0x70/0x70
> > > > > > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > > > > > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > > > > > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > > > > > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > > > > > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > > > > > [  247.181136]  __vfs_read+0x108/0x190
> > > > > > [  247.181930]  vfs_read+0x91/0x130
> > > > > > [  247.182671]  ksys_read+0x52/0xc0
> > > > > > [  247.183454]  do_syscall_64+0x55/0x170
> > > > > > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > 
> > > > > > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > > > > > address correctly. For a memory mapped area in dax mode, indeed the page
> > > > > > for this area points virtiofsd's cache area, or rather virtio pci device's
> > > > > > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > > > > > memory and will not pass this cache bar's address info to vhost-user backend,
> > > > > > so vu_gpa_to_va() will fail.
> > > > > > 
> > > > > > To fix this issue, we create this vhost cache area with a file backend
> > > > > > memory area.
> > > > > 
> > > > > Thanks,
> > > > >   I know there was another case of the daemon trying to access the
> > > > > buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> > > > > not to do it;  Stefan/Vivek: What do you think?
> > > > 
> > > > That case happened with cache=none and the dax mount option.
> > > > 
> > > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > > is outside guest RAM.
> 
> Stefan,
> 
> Can this be emulated by sending a request to qemu? If virtiofsd can detect
> that source/destination of READ/WRITE is not guest RAM, can it forward
> message to qemu to do this operation (which has access to all the DAX
> windows)?
> 
> This probably will mean introducing new messages like
> setupmapping/removemapping messages between virtiofsd/qemu.  

Yes, interesting idea!

When virtiofsd is unable to map the virtqueue iovecs due to addresses
outside guest RAM, it could forward READ/WRITE requests to QEMU along
with the file descriptor.  It would be slow but fixes the problem.

Implementing this is a little tricky because the libvhost-user code
probably fails before fuse_lowlevel.c is able to parse the FUSE request
header.  It will require reworking libvhost-user and fuse_virtio.c code,
I think.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH v2] virtiofsd: send reply correctly on read failure
  2019-04-23  6:27   ` [Virtio-fs] [PATCH v2] " Eryu Guan
@ 2019-04-30 14:58     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-04-30 14:58 UTC (permalink / raw)
  To: Eryu Guan; +Cc: virtio-fs

* Eryu Guan (eguan@linux.alibaba.com) wrote:
> Currently when a lo_read() operation fails, we don't send the failure
> back to fuse client, and read(2) operation from guest kernel would hang
> on waiting for the reply.
> 
> This is easily triggered by a direct read with non-aligned length.
> 
> Fix it by detecting preadv(2) error in virtio_send_data_iov(), and
> teaching fuse_reply_data() to reply error on error case.
> 
> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thank you, I've added that to my dev tree.
> ---
> v2:
> - return postive error number on read error
> - also return postive EIO in short read case
> - drop "-1" -> "-EOPNOTSUPP" update, which should be in a sperate patch
> 
>  contrib/virtiofsd/fuse_virtio.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
> index ca988aac923f..7337f2a2c3ab 100644
> --- a/contrib/virtiofsd/fuse_virtio.c
> +++ b/contrib/virtiofsd/fuse_virtio.c
> @@ -333,8 +333,13 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
>                  ret = preadv(buf->buf[0].fd, in_sg_ptr, in_sg_cpy_count, buf->buf[0].pos);
>  
>                  if (se->debug)
> -                        fprintf(stderr, "%s: preadv_res=%d len=%zd\n",
> -                                __func__, ret, len);
> +                        fprintf(stderr, "%s: preadv_res=%d(%s) len=%zd\n",
> +                                __func__, ret, strerror(errno), len);
> +                if (ret == -1) {
> +                        ret = errno;
> +                        free(in_sg_cpy);
> +                        goto err;
> +                }
>                  if (ret < len && ret) {
>                          if (se->debug)
>                                  fprintf(stderr, "%s: ret < len\n", __func__);
> @@ -356,7 +361,7 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
>                  if (ret != len) {
>                          if (se->debug)
>                                  fprintf(stderr, "%s: ret!=len\n", __func__);
> -                        ret = -EIO;
> +                        ret = EIO;
>                          free(in_sg_cpy);
>                          goto err;
>                  }
> @@ -379,7 +384,8 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
>          vu_queue_notify(&se->virtio_dev->dev, q);
>  
>  err:
> -        ch->qi->reply_sent = true;
> +        if (ret == 0)
> +                ch->qi->reply_sent = true;
>  
>          return ret;
>  }
> -- 
> 2.14.4.44.g2045bb6
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-26  9:05             ` Stefan Hajnoczi
@ 2019-05-01 18:59               ` Dr. David Alan Gilbert
  2019-05-02 11:46                 ` Stefan Hajnoczi
  2019-05-18  2:28               ` Liu Bo
  1 sibling, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-01 18:59 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Thu, Apr 25, 2019 at 05:21:58PM -0400, Vivek Goyal wrote:
> > On Thu, Apr 25, 2019 at 03:33:23PM +0100, Stefan Hajnoczi wrote:
> > > On Tue, Apr 23, 2019 at 11:49:15AM -0700, Liu Bo wrote:
> > > > On Tue, Apr 23, 2019 at 01:09:19PM +0100, Stefan Hajnoczi wrote:
> > > > > On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > > > > 
> > > > > > > When running xfstests test case generic/413, we found such issue:
> > > > > > >     1, create a file in one virtiofsd mount point with dax enabled
> > > > > > >     2, mmap this file, get virtual addr: A
> > > > > > >     3, write(fd, A, len), here fd comes from another file in another
> > > > > > >        virtiofsd mount point without dax enabled, also note here write(2)
> > > > > > >        is direct io.
> > > > > > >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > > > > > > Here is the stack:
> > > > > > > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > > > > > > [  247.168006] Call Trace:
> > > > > > > [  247.169067]  ? __schedule+0x3d0/0x830
> > > > > > > [  247.170219]  schedule+0x32/0x80
> > > > > > > [  247.171328]  schedule_timeout+0x1e2/0x350
> > > > > > > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > > > > > > [  247.173516]  wait_for_completion+0x123/0x190
> > > > > > > [  247.174593]  ? wake_up_q+0x70/0x70
> > > > > > > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > > > > > > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > > > > > > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > > > > > > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > > > > > > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > > > > > > [  247.181136]  __vfs_read+0x108/0x190
> > > > > > > [  247.181930]  vfs_read+0x91/0x130
> > > > > > > [  247.182671]  ksys_read+0x52/0xc0
> > > > > > > [  247.183454]  do_syscall_64+0x55/0x170
> > > > > > > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > > 
> > > > > > > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > > > > > > address correctly. For a memory mapped area in dax mode, indeed the page
> > > > > > > for this area points virtiofsd's cache area, or rather virtio pci device's
> > > > > > > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > > > > > > memory and will not pass this cache bar's address info to vhost-user backend,
> > > > > > > so vu_gpa_to_va() will fail.
> > > > > > > 
> > > > > > > To fix this issue, we create this vhost cache area with a file backend
> > > > > > > memory area.
> > > > > > 
> > > > > > Thanks,
> > > > > >   I know there was another case of the daemon trying to access the
> > > > > > buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> > > > > > not to do it;  Stefan/Vivek: What do you think?
> > > > > 
> > > > > That case happened with cache=none and the dax mount option.
> > > > > 
> > > > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > > > is outside guest RAM.
> > 
> > Stefan,
> > 
> > Can this be emulated by sending a request to qemu? If virtiofsd can detect
> > that source/destination of READ/WRITE is not guest RAM, can it forward
> > message to qemu to do this operation (which has access to all the DAX
> > windows)?
> > 
> > This probably will mean introducing new messages like
> > setupmapping/removemapping messages between virtiofsd/qemu.  
> 
> Yes, interesting idea!
> 
> When virtiofsd is unable to map the virtqueue iovecs due to addresses
> outside guest RAM, it could forward READ/WRITE requests to QEMU along
> with the file descriptor.  It would be slow but fixes the problem.
> 
> Implementing this is a little tricky because the libvhost-user code
> probably fails before fuse_lowlevel.c is able to parse the FUSE request
> header.  It will require reworking libvhost-user and fuse_virtio.c code,
> I think.

Yes, this doesn't look too bad; I need to tweak
   vu_queue_pop->vu_queue_map_desc->virtqueue_map_desc
to give back a list of unmappable parts of the iovec
(assuming that the first few elements of the iovec are
mappable then the rest aren't and not allowing weird mixes).

One thing that worries me a bit is that if we do a read() or write()
in the qemu code, it might block on the mmap'd backing file.

Dave

> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-05-01 18:59               ` Dr. David Alan Gilbert
@ 2019-05-02 11:46                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2019-05-02 11:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs

[-- Attachment #1: Type: text/plain, Size: 5103 bytes --]

On Wed, May 01, 2019 at 07:59:17PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Thu, Apr 25, 2019 at 05:21:58PM -0400, Vivek Goyal wrote:
> > > On Thu, Apr 25, 2019 at 03:33:23PM +0100, Stefan Hajnoczi wrote:
> > > > On Tue, Apr 23, 2019 at 11:49:15AM -0700, Liu Bo wrote:
> > > > > On Tue, Apr 23, 2019 at 01:09:19PM +0100, Stefan Hajnoczi wrote:
> > > > > > On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > > > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > > > > > 
> > > > > > > > When running xfstests test case generic/413, we found such issue:
> > > > > > > >     1, create a file in one virtiofsd mount point with dax enabled
> > > > > > > >     2, mmap this file, get virtual addr: A
> > > > > > > >     3, write(fd, A, len), here fd comes from another file in another
> > > > > > > >        virtiofsd mount point without dax enabled, also note here write(2)
> > > > > > > >        is direct io.
> > > > > > > >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > > > > > > > Here is the stack:
> > > > > > > > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > > > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > > > > > > > [  247.168006] Call Trace:
> > > > > > > > [  247.169067]  ? __schedule+0x3d0/0x830
> > > > > > > > [  247.170219]  schedule+0x32/0x80
> > > > > > > > [  247.171328]  schedule_timeout+0x1e2/0x350
> > > > > > > > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > > > > > > > [  247.173516]  wait_for_completion+0x123/0x190
> > > > > > > > [  247.174593]  ? wake_up_q+0x70/0x70
> > > > > > > > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > > > > > > > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > > > > > > > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > > > > > > > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > > > > > > > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > > > > > > > [  247.181136]  __vfs_read+0x108/0x190
> > > > > > > > [  247.181930]  vfs_read+0x91/0x130
> > > > > > > > [  247.182671]  ksys_read+0x52/0xc0
> > > > > > > > [  247.183454]  do_syscall_64+0x55/0x170
> > > > > > > > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > > > 
> > > > > > > > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > > > > > > > address correctly. For a memory mapped area in dax mode, indeed the page
> > > > > > > > for this area points virtiofsd's cache area, or rather virtio pci device's
> > > > > > > > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > > > > > > > memory and will not pass this cache bar's address info to vhost-user backend,
> > > > > > > > so vu_gpa_to_va() will fail.
> > > > > > > > 
> > > > > > > > To fix this issue, we create this vhost cache area with a file backend
> > > > > > > > memory area.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > >   I know there was another case of the daemon trying to access the
> > > > > > > buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> > > > > > > not to do it;  Stefan/Vivek: What do you think?
> > > > > > 
> > > > > > That case happened with cache=none and the dax mount option.
> > > > > > 
> > > > > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > > > > is outside guest RAM.
> > > 
> > > Stefan,
> > > 
> > > Can this be emulated by sending a request to qemu? If virtiofsd can detect
> > > that source/destination of READ/WRITE is not guest RAM, can it forward
> > > message to qemu to do this operation (which has access to all the DAX
> > > windows)?
> > > 
> > > This probably will mean introducing new messages like
> > > setupmapping/removemapping messages between virtiofsd/qemu.  
> > 
> > Yes, interesting idea!
> > 
> > When virtiofsd is unable to map the virtqueue iovecs due to addresses
> > outside guest RAM, it could forward READ/WRITE requests to QEMU along
> > with the file descriptor.  It would be slow but fixes the problem.
> > 
> > Implementing this is a little tricky because the libvhost-user code
> > probably fails before fuse_lowlevel.c is able to parse the FUSE request
> > header.  It will require reworking libvhost-user and fuse_virtio.c code,
> > I think.
> 
> Yes, this doesn't look too bad; I need to tweak
>    vu_queue_pop->vu_queue_map_desc->virtqueue_map_desc
> to give back a list of unmappable parts of the iovec
> (assuming that the first few elements of the iovec are
> mappable then the rest aren't and not allowing weird mixes).
> 
> One thing that worries me a bit is that if we do a read() or write()
> in the qemu code, it might block on the mmap'd backing file.

Even a memcpy (no read()/write()) could block if the page is not
resident :(.  We have to expect that this operation could block.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-04-26  9:05             ` Stefan Hajnoczi
  2019-05-01 18:59               ` Dr. David Alan Gilbert
@ 2019-05-18  2:28               ` Liu Bo
  2019-05-20 13:49                 ` Vivek Goyal
  2019-05-20 17:58                 ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 31+ messages in thread
From: Liu Bo @ 2019-05-18  2:28 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, Vivek Goyal

On Fri, Apr 26, 2019 at 10:05:24AM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 25, 2019 at 05:21:58PM -0400, Vivek Goyal wrote:
> > On Thu, Apr 25, 2019 at 03:33:23PM +0100, Stefan Hajnoczi wrote:
> > > On Tue, Apr 23, 2019 at 11:49:15AM -0700, Liu Bo wrote:
> > > > On Tue, Apr 23, 2019 at 01:09:19PM +0100, Stefan Hajnoczi wrote:
> > > > > On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > > > > 
> > > > > > > When running xfstests test case generic/413, we found such issue:
> > > > > > >     1, create a file in one virtiofsd mount point with dax enabled
> > > > > > >     2, mmap this file, get virtual addr: A
> > > > > > >     3, write(fd, A, len), here fd comes from another file in another
> > > > > > >        virtiofsd mount point without dax enabled, also note here write(2)
> > > > > > >        is direct io.
> > > > > > >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > > > > > > Here is the stack:
> > > > > > > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > > > > > > [  247.168006] Call Trace:
> > > > > > > [  247.169067]  ? __schedule+0x3d0/0x830
> > > > > > > [  247.170219]  schedule+0x32/0x80
> > > > > > > [  247.171328]  schedule_timeout+0x1e2/0x350
> > > > > > > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > > > > > > [  247.173516]  wait_for_completion+0x123/0x190
> > > > > > > [  247.174593]  ? wake_up_q+0x70/0x70
> > > > > > > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > > > > > > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > > > > > > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > > > > > > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > > > > > > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > > > > > > [  247.181136]  __vfs_read+0x108/0x190
> > > > > > > [  247.181930]  vfs_read+0x91/0x130
> > > > > > > [  247.182671]  ksys_read+0x52/0xc0
> > > > > > > [  247.183454]  do_syscall_64+0x55/0x170
> > > > > > > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > > 
> > > > > > > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > > > > > > address correctly. For a memory mapped area in dax mode, indeed the page
> > > > > > > for this area points virtiofsd's cache area, or rather virtio pci device's
> > > > > > > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > > > > > > memory and will not pass this cache bar's address info to vhost-user backend,
> > > > > > > so vu_gpa_to_va() will fail.
> > > > > > > 
> > > > > > > To fix this issue, we create this vhost cache area with a file backend
> > > > > > > memory area.
> > > > > > 
> > > > > > Thanks,
> > > > > >   I know there was another case of the daemon trying to access the
> > > > > > buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> > > > > > not to do it;  Stefan/Vivek: What do you think?
> > > > > 
> > > > > That case happened with cache=none and the dax mount option.
> > > > > 
> > > > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > > > is outside guest RAM.
> > 
> > Stefan,
> > 
> > Can this be emulated by sending a request to qemu? If virtiofsd can detect
> > that source/destination of READ/WRITE is not guest RAM, can it forward
> > message to qemu to do this operation (which has access to all the DAX
> > windows)?
> > 
> > This probably will mean introducing new messages like
> > setupmapping/removemapping messages between virtiofsd/qemu.  
> 
> Yes, interesting idea!
> 
> When virtiofsd is unable to map the virtqueue iovecs due to addresses
> outside guest RAM, it could forward READ/WRITE requests to QEMU along
> with the file descriptor.  It would be slow but fixes the problem.
>

It is probably not easy to do.

Imagine the following case,
// foo1 is on a dax virtiofs, foo2 is on a nondax virtiofs

p = mmap(foo1, ...);
write(foo2, p, ...);

virtiofsd where foo2 is using needs to interpret gpa from virtiofs
where foo1 exists along with fd being foo1, but a write fuse_req
doesn't have foo1's fd.

And are you suggesting that qemu goes to read the data on gpa and
returns via vhost-user message?  or let this virtiofsd (foo2) do mmap
on foo1 directly?

thanks,
-liubo

> Implementing this is a little tricky because the libvhost-user code
> probably fails before fuse_lowlevel.c is able to parse the FUSE request
> header.  It will require reworking libvhost-user and fuse_virtio.c code,
> I think.
> 
> Stefan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-05-18  2:28               ` Liu Bo
@ 2019-05-20 13:49                 ` Vivek Goyal
  2019-05-20 18:33                   ` Liu Bo
  2019-05-20 17:58                 ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 31+ messages in thread
From: Vivek Goyal @ 2019-05-20 13:49 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Fri, May 17, 2019 at 07:28:22PM -0700, Liu Bo wrote:

[..]
> > > > > > That case happened with cache=none and the dax mount option.
> > > > > > 
> > > > > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > > > > is outside guest RAM.
> > > 
> > > Stefan,
> > > 
> > > Can this be emulated by sending a request to qemu? If virtiofsd can detect
> > > that source/destination of READ/WRITE is not guest RAM, can it forward
> > > message to qemu to do this operation (which has access to all the DAX
> > > windows)?
> > > 
> > > This probably will mean introducing new messages like
> > > setupmapping/removemapping messages between virtiofsd/qemu.  
> > 
> > Yes, interesting idea!
> > 
> > When virtiofsd is unable to map the virtqueue iovecs due to addresses
> > outside guest RAM, it could forward READ/WRITE requests to QEMU along
> > with the file descriptor.  It would be slow but fixes the problem.
> >
> 
> It is probably not easy to do.
> 
> Imagine the following case,
> // foo1 is on a dax virtiofs, foo2 is on a nondax virtiofs
> 
> p = mmap(foo1, ...);
> write(foo2, p, ...);
> 
> virtiofsd where foo2 is using needs to interpret gpa from virtiofs
> where foo1 exists along with fd being foo1, but a write fuse_req
> doesn't have foo1's fd.
> 
> And are you suggesting that qemu goes to read the data on gpa and
> returns via vhost-user message?  or let this virtiofsd (foo2) do mmap
> on foo1 directly?

Hi Liu Bo,

Not sure I understand the question. Assuming foo2 virtiofs instances
is mounted with cache=none, write(foo2), will trigger FUSE_WRITE
request to daemon with source address being in dax window of first
virtiofs instance. 

This request should be forwarded to qemu which can just read from
foo1 mmaped area in qemu address space and write to foo2. I am assuming
before FUSE_WRITE comes in, mapping for foo1 has already been
establied.

IOW, foo1 is already mmaped in qemu address space so there is no
need to pass fd of foo1 in FUSE_WRITE(foo2) request. Did I miss
something?

Vivek


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-05-18  2:28               ` Liu Bo
  2019-05-20 13:49                 ` Vivek Goyal
@ 2019-05-20 17:58                 ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-20 17:58 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs, Vivek Goyal

* Liu Bo (bo.liu@linux.alibaba.com) wrote:
> On Fri, Apr 26, 2019 at 10:05:24AM +0100, Stefan Hajnoczi wrote:
> > On Thu, Apr 25, 2019 at 05:21:58PM -0400, Vivek Goyal wrote:
> > > On Thu, Apr 25, 2019 at 03:33:23PM +0100, Stefan Hajnoczi wrote:
> > > > On Tue, Apr 23, 2019 at 11:49:15AM -0700, Liu Bo wrote:
> > > > > On Tue, Apr 23, 2019 at 01:09:19PM +0100, Stefan Hajnoczi wrote:
> > > > > > On Wed, Apr 17, 2019 at 03:51:21PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > > * Liu Bo (bo.liu@linux.alibaba.com) wrote:
> > > > > > > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > > > > > > 
> > > > > > > > When running xfstests test case generic/413, we found such issue:
> > > > > > > >     1, create a file in one virtiofsd mount point with dax enabled
> > > > > > > >     2, mmap this file, get virtual addr: A
> > > > > > > >     3, write(fd, A, len), here fd comes from another file in another
> > > > > > > >        virtiofsd mount point without dax enabled, also note here write(2)
> > > > > > > >        is direct io.
> > > > > > > >     4, this direct io will hang forever, because the virtiofsd has crashed.
> > > > > > > > Here is the stack:
> > > > > > > > [  247.166276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > > > [  247.167171] t_mmap_dio      D    0  2335   2102 0x00000000
> > > > > > > > [  247.168006] Call Trace:
> > > > > > > > [  247.169067]  ? __schedule+0x3d0/0x830
> > > > > > > > [  247.170219]  schedule+0x32/0x80
> > > > > > > > [  247.171328]  schedule_timeout+0x1e2/0x350
> > > > > > > > [  247.172416]  ? fuse_direct_io+0x2e5/0x6b0 [fuse]
> > > > > > > > [  247.173516]  wait_for_completion+0x123/0x190
> > > > > > > > [  247.174593]  ? wake_up_q+0x70/0x70
> > > > > > > > [  247.175640]  fuse_direct_IO+0x265/0x310 [fuse]
> > > > > > > > [  247.176724]  generic_file_read_iter+0xaa/0xd20
> > > > > > > > [  247.177824]  fuse_file_read_iter+0x81/0x130 [fuse]
> > > > > > > > [  247.178938]  ? fuse_simple_request+0x104/0x1b0 [fuse]
> > > > > > > > [  247.180041]  ? fuse_fsync_common+0xad/0x240 [fuse]
> > > > > > > > [  247.181136]  __vfs_read+0x108/0x190
> > > > > > > > [  247.181930]  vfs_read+0x91/0x130
> > > > > > > > [  247.182671]  ksys_read+0x52/0xc0
> > > > > > > > [  247.183454]  do_syscall_64+0x55/0x170
> > > > > > > > [  247.184200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > > > 
> > > > > > > > And virtiofsd crashed because vu_gpa_to_va() can not handle guest physical
> > > > > > > > address correctly. For a memory mapped area in dax mode, indeed the page
> > > > > > > > for this area points virtiofsd's cache area, or rather virtio pci device's
> > > > > > > > cache bar. In qemu, currently this cache bar is implemented with an anonymous
> > > > > > > > memory and will not pass this cache bar's address info to vhost-user backend,
> > > > > > > > so vu_gpa_to_va() will fail.
> > > > > > > > 
> > > > > > > > To fix this issue, we create this vhost cache area with a file backend
> > > > > > > > memory area.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > >   I know there was another case of the daemon trying to access the
> > > > > > > buffer that Stefan and Vivek hit, but fixed by persuading the kernel
> > > > > > > not to do it;  Stefan/Vivek: What do you think?
> > > > > > 
> > > > > > That case happened with cache=none and the dax mount option.
> > > > > > 
> > > > > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > > > > is outside guest RAM.
> > > 
> > > Stefan,
> > > 
> > > Can this be emulated by sending a request to qemu? If virtiofsd can detect
> > > that source/destination of READ/WRITE is not guest RAM, can it forward
> > > message to qemu to do this operation (which has access to all the DAX
> > > windows)?
> > > 
> > > This probably will mean introducing new messages like
> > > setupmapping/removemapping messages between virtiofsd/qemu.  
> > 
> > Yes, interesting idea!
> > 
> > When virtiofsd is unable to map the virtqueue iovecs due to addresses
> > outside guest RAM, it could forward READ/WRITE requests to QEMU along
> > with the file descriptor.  It would be slow but fixes the problem.
> >
> 
> It is probably not easy to do.
> 
> Imagine the following case,
> // foo1 is on a dax virtiofs, foo2 is on a nondax virtiofs
> 
> p = mmap(foo1, ...);
> write(foo2, p, ...);
> 
> virtiofsd where foo2 is using needs to interpret gpa from virtiofs
> where foo1 exists along with fd being foo1, but a write fuse_req
> doesn't have foo1's fd.
> 
> And are you suggesting that qemu goes to read the data on gpa and
> returns via vhost-user message?  or let this virtiofsd (foo2) do mmap
> on foo1 directly?

I have a patchset I'm just tidying up that passes this case back to qemu
to handle.  I intend to post it by the end of the week.

What it does is that when the virtiofsd receives a read/write to an area
of memory that it doesn't have a mapping for, it forms a new slave
message back to qemu together with the fd asking qemu to read/write at
the given GPA.  Then it's upto QEMU to deal with it.

That should work even if there are two separate daemons.

It's not a pretty solution; but I think it should work.

Dave

> thanks,
> -liubo
> 
> > Implementing this is a little tricky because the libvhost-user code
> > probably fails before fuse_lowlevel.c is able to parse the FUSE request
> > header.  It will require reworking libvhost-user and fuse_virtio.c code,
> > I think.
> > 
> > Stefan
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-05-20 13:49                 ` Vivek Goyal
@ 2019-05-20 18:33                   ` Liu Bo
  2019-05-20 19:01                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 31+ messages in thread
From: Liu Bo @ 2019-05-20 18:33 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Mon, May 20, 2019 at 09:49:34AM -0400, Vivek Goyal wrote:
> On Fri, May 17, 2019 at 07:28:22PM -0700, Liu Bo wrote:
> 
> [..]
> > > > > > > That case happened with cache=none and the dax mount option.
> > > > > > > 
> > > > > > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > > > > > is outside guest RAM.
> > > > 
> > > > Stefan,
> > > > 
> > > > Can this be emulated by sending a request to qemu? If virtiofsd can detect
> > > > that source/destination of READ/WRITE is not guest RAM, can it forward
> > > > message to qemu to do this operation (which has access to all the DAX
> > > > windows)?
> > > > 
> > > > This probably will mean introducing new messages like
> > > > setupmapping/removemapping messages between virtiofsd/qemu.  
> > > 
> > > Yes, interesting idea!
> > > 
> > > When virtiofsd is unable to map the virtqueue iovecs due to addresses
> > > outside guest RAM, it could forward READ/WRITE requests to QEMU along
> > > with the file descriptor.  It would be slow but fixes the problem.
> > >
> > 
> > It is probably not easy to do.
> > 
> > Imagine the following case,
> > // foo1 is on a dax virtiofs, foo2 is on a nondax virtiofs
> > 
> > p = mmap(foo1, ...);
> > write(foo2, p, ...);
> > 
> > virtiofsd where foo2 is using needs to interpret gpa from virtiofs
> > where foo1 exists along with fd being foo1, but a write fuse_req
> > doesn't have foo1's fd.
> > 
> > And are you suggesting that qemu goes to read the data on gpa and
> > returns via vhost-user message?  or let this virtiofsd (foo2) do mmap
> > on foo1 directly?
> 
> Hi Liu Bo,
> 
> Not sure I understand the question. Assuming foo2 virtiofs instances
> is mounted with cache=none, write(foo2), will trigger FUSE_WRITE
> request to daemon with source address being in dax window of first
> virtiofs instance. 
> 
> This request should be forwarded to qemu which can just read from
> foo1 mmaped area in qemu address space and write to foo2. I am assuming
> before FUSE_WRITE comes in, mapping for foo1 has already been
> establied.

Bofore sending any message to qemu, fv_queue_thread() will fail at
vu_queue_pop()->vu_queue_map_desc()->virtqueue_map_desc()->vu_gpa_to_va(),
because virtiofsd (where foo2 is on) doesn't have the information of
foo1's mapping within its address space.

It could be reproduced by fstests/generic/413.

thanks,
-liubo

> 
> IOW, foo1 is already mmaped in qemu address space so there is no
> need to pass fd of foo1 in FUSE_WRITE(foo2) request. Did I miss
> something?
> 
> Vivek


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area
  2019-05-20 18:33                   ` Liu Bo
@ 2019-05-20 19:01                     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-20 19:01 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs, Vivek Goyal

* Liu Bo (bo.liu@linux.alibaba.com) wrote:
> On Mon, May 20, 2019 at 09:49:34AM -0400, Vivek Goyal wrote:
> > On Fri, May 17, 2019 at 07:28:22PM -0700, Liu Bo wrote:
> > 
> > [..]
> > > > > > > > That case happened with cache=none and the dax mount option.
> > > > > > > > 
> > > > > > > > The general problem is when FUSE_READ/FUSE_WRITE is sent and the buffer
> > > > > > > > is outside guest RAM.
> > > > > 
> > > > > Stefan,
> > > > > 
> > > > > Can this be emulated by sending a request to qemu? If virtiofsd can detect
> > > > > that source/destination of READ/WRITE is not guest RAM, can it forward
> > > > > message to qemu to do this operation (which has access to all the DAX
> > > > > windows)?
> > > > > 
> > > > > This probably will mean introducing new messages like
> > > > > setupmapping/removemapping messages between virtiofsd/qemu.  
> > > > 
> > > > Yes, interesting idea!
> > > > 
> > > > When virtiofsd is unable to map the virtqueue iovecs due to addresses
> > > > outside guest RAM, it could forward READ/WRITE requests to QEMU along
> > > > with the file descriptor.  It would be slow but fixes the problem.
> > > >
> > > 
> > > It is probably not easy to do.
> > > 
> > > Imagine the following case,
> > > // foo1 is on a dax virtiofs, foo2 is on a nondax virtiofs
> > > 
> > > p = mmap(foo1, ...);
> > > write(foo2, p, ...);
> > > 
> > > virtiofsd where foo2 is using needs to interpret gpa from virtiofs
> > > where foo1 exists along with fd being foo1, but a write fuse_req
> > > doesn't have foo1's fd.
> > > 
> > > And are you suggesting that qemu goes to read the data on gpa and
> > > returns via vhost-user message?  or let this virtiofsd (foo2) do mmap
> > > on foo1 directly?
> > 
> > Hi Liu Bo,
> > 
> > Not sure I understand the question. Assuming foo2 virtiofs instances
> > is mounted with cache=none, write(foo2), will trigger FUSE_WRITE
> > request to daemon with source address being in dax window of first
> > virtiofs instance. 
> > 
> > This request should be forwarded to qemu which can just read from
> > foo1 mmaped area in qemu address space and write to foo2. I am assuming
> > before FUSE_WRITE comes in, mapping for foo1 has already been
> > establied.
> 
> Bofore sending any message to qemu, fv_queue_thread() will fail at
> vu_queue_pop()->vu_queue_map_desc()->virtqueue_map_desc()->vu_gpa_to_va(),
> because virtiofsd (where foo2 is on) doesn't have the information of
> foo1's mapping within its address space.

The trick is to bounce the problem back to qemu; as soon as the mapping
fails I add a flag to say the mapping failed; then that causes the
request to be sent to qemu via a slave command - and qemu has all of
the mappings so can read/write to all of the fs's.

Dave

> It could be reproduced by fstests/generic/413.
> 
> thanks,
> -liubo
> 
> > 
> > IOW, foo1 is already mmaped in qemu address space so there is no
> > need to pass fd of foo1 in FUSE_WRITE(foo2) request. Did I miss
> > something?
> > 
> > Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-05-20 19:01 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-16 19:08 [Virtio-fs] [PATCH 0/4] virtiofsd fixes Liu Bo
2019-04-16 19:08 ` [Virtio-fs] [PATCH 1/4] virtiofsd: send reply correctly on read failure Liu Bo
2019-04-17 17:26   ` Dr. David Alan Gilbert
2019-04-18 12:25     ` Dr. David Alan Gilbert
2019-04-18 18:14       ` Liu Bo
2019-04-18 19:05         ` Dr. David Alan Gilbert
2019-04-23  6:27   ` [Virtio-fs] [PATCH v2] " Eryu Guan
2019-04-30 14:58     ` Dr. David Alan Gilbert
2019-04-16 19:08 ` [Virtio-fs] [PATCH 2/4] virtiofsd: support nanosecond resolution for file timestamp Liu Bo
2019-04-17 13:31   ` Dr. David Alan Gilbert
2019-04-16 19:08 ` [Virtio-fs] [PATCH 3/4] virtiofsd: use file-backend memory region for virtiofsd's cache area Liu Bo
2019-04-17 14:51   ` Dr. David Alan Gilbert
2019-04-23 12:09     ` Stefan Hajnoczi
2019-04-23 18:49       ` Liu Bo
2019-04-25 14:33         ` Stefan Hajnoczi
2019-04-25 21:21           ` Vivek Goyal
2019-04-26  9:05             ` Stefan Hajnoczi
2019-05-01 18:59               ` Dr. David Alan Gilbert
2019-05-02 11:46                 ` Stefan Hajnoczi
2019-05-18  2:28               ` Liu Bo
2019-05-20 13:49                 ` Vivek Goyal
2019-05-20 18:33                   ` Liu Bo
2019-05-20 19:01                     ` Dr. David Alan Gilbert
2019-05-20 17:58                 ` Dr. David Alan Gilbert
2019-04-16 19:08 ` [Virtio-fs] [PATCH 4/4] virtiofsd: use fallocate(2) instead posix_fallocate(3) Liu Bo
2019-04-17 13:18   ` Dr. David Alan Gilbert
2019-04-17 13:44     ` Miklos Szeredi
2019-04-17 14:05       ` Dr. David Alan Gilbert
2019-04-17 14:25         ` Miklos Szeredi
2019-04-17 14:29           ` Dr. David Alan Gilbert
2019-04-17 19:24             ` Liu Bo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.