All of lore.kernel.org
 help / color / mirror / Atom feed
* [Virtio-fs] [PATCH 0/9] virtio-fs fixes
@ 2019-04-16 18:03 Liu Bo
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 1/9] virtio-fs: fix multiple tag support Liu Bo
                   ` (9 more replies)
  0 siblings, 10 replies; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

Hi,

This set contains a few fixes collected by running xfstests.

thanks,
liubo

Liu Bo (6):
  virtio-fs: fix multiple tag support
  virtio-fs: clean up dax mapping before aborting connection
  fuse: export fuse_drop_waiting()
  virtio-fs: let dax style override directIO style when dax+cache=none
  fuse: return early if punch_hole fails
  virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate

Xiaoguang Wang (3):
  virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info
  fuse: do not write whole page while page straddles i_size
  fuse: fix deadlock in __fuse_file_fallocate()

 fs/fuse/dev.c       |  3 ++-
 fs/fuse/file.c      | 31 +++++++++++++++++++++++--------
 fs/fuse/fuse_i.h    |  3 +++
 fs/fuse/inode.c     |  9 +++++++++
 fs/fuse/virtio_fs.c | 13 +++++++++++--
 5 files changed, 48 insertions(+), 11 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 1/9] virtio-fs: fix multiple tag support
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-16 18:09   ` Vivek Goyal
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 2/9] virtio-fs: clean up dax mapping before aborting connection Liu Bo
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

While doing memremap from pci_dev's system bus address to kernel virtual
address, we assign a wrong value to the %end of pgmap.res, which ends up
with a wrong resource size in the process of memremap, and that further
prevent the second virtiofs pci device from being probed successfully.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/virtio_fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index d669512..68fd11b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -589,7 +589,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
 	pgmap->res = (struct resource){
 		.name = "virtio-fs dax window",
 		.start = (phys_addr_t) cache_reg.addr,
-		.end = (phys_addr_t) cache_reg.addr + cache_reg.len,
+		.end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
 	};
 
 	fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 2/9] virtio-fs: clean up dax mapping before aborting connection
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 1/9] virtio-fs: fix multiple tag support Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-16 18:18   ` Vivek Goyal
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 3/9] fuse: export fuse_drop_waiting() Liu Bo
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

fstests generic/127 reported that we failed to remove dax mapping during
umount,

-----
run fstests generic/127 at 2019-03-20 04:54:35
fuse_removemapping_one request failed -107
Failed to removemapping. offset=0x200000 len=0x200000
-----

The root cause:
virtio_kill_sb
  -> fuse_kill_sb_anon
    -> fuse_sb_destroy
      -> fuse_abort_conn #fiq->connect = 0
  -> kill_anon_super
    -> fuse_evict_inode
      -> fuse_removemapping
        -> fuse_removemapping_one
           # get -ENOTCONN, bail out

Unfortunately we have to make sure that all dax mappings linked in
per-inode interval tree are moved back to fc->busy_ranges in order to
avoid mapping leakage.

As we've tracked all dax mappings in fc->busy_ranges, this forces to free
dax mappings before aborting connection.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
---
 fs/fuse/fuse_i.h | 2 ++
 fs/fuse/inode.c  | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 25e1e50..2c32846 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1202,4 +1202,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 void fuse_dax_free_mem_worker(struct work_struct *work);
 void fuse_removemapping(struct inode *inode);
 
+int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free);
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 6e96613..22d5f6a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1378,6 +1378,15 @@ static void fuse_sb_destroy(struct super_block *sb)
 	if (fc) {
 		fuse_send_destroy(fc);
 
+		/* cleanup dax mapping ranges before aborting connection. */
+		spin_lock(&fc->lock);
+		if (fc->nr_busy_ranges) {
+			spin_unlock(&fc->lock);
+			fuse_dax_free_memory(fc, -1UL);
+		} else {
+			spin_unlock(&fc->lock);
+		}
+
 		fuse_abort_conn(fc, false);
 		fuse_wait_aborted(fc);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 3/9] fuse: export fuse_drop_waiting()
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 1/9] virtio-fs: fix multiple tag support Liu Bo
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 2/9] virtio-fs: clean up dax mapping before aborting connection Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-16 18:28   ` Vivek Goyal
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 4/9] virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info Liu Bo
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

virtio-fs will need to do ref count against FORGET requests from
outside fs/fuse/dev.c.  Make the symbol visible.

Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 fs/fuse/dev.c    | 3 ++-
 fs/fuse/fuse_i.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 07ae706..7e6afd0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -128,7 +128,7 @@ static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
 	return !fc->initialized || (for_background && fc->blocked);
 }
 
-static void fuse_drop_waiting(struct fuse_conn *fc)
+void fuse_drop_waiting(struct fuse_conn *fc)
 {
 	if (fc->connected) {
 		atomic_dec(&fc->num_waiting);
@@ -137,6 +137,7 @@ static void fuse_drop_waiting(struct fuse_conn *fc)
 		wake_up_all(&fc->blocked_waitq);
 	}
 }
+EXPORT_SYMBOL_GPL(fuse_drop_waiting);
 
 static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 				       bool for_background)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2c32846..171aee8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1199,6 +1199,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
  * Get the next unique ID for a request
  */
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
+void fuse_drop_waiting(struct fuse_conn *fc);
 void fuse_dax_free_mem_worker(struct work_struct *work);
 void fuse_removemapping(struct inode *inode);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 4/9] virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
                   ` (2 preceding siblings ...)
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 3/9] fuse: export fuse_drop_waiting() Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-23 19:51   ` Vivek Goyal
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 5/9] fuse: do not write whole page while page straddles i_size Liu Bo
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

When running xfstests generic/135, we hit below panic:
[  376.386031] Workqueue: events virtio_fs_hiprio_done_work [virtio_fs]
[  376.386086] RIP: 0010:_raw_spin_lock+0xc/0x20
[  376.386132] Code: 00 f0 0f b1 17 75 02 f3 c3 89 c6 e8 6e b6 91 ff 66
90 c3 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 31 c0 ba 01 00 00
00 <f0> 0f b1 17 75 02 f3 c3 89 c6 e8 45 b6 91 ff 66 90 c3 66 90 66 66
[  376.386234] RSP: 0018:ffffc900024f7e50 EFLAGS: 00010246
[  376.386273] RAX: 0000000000000000 RBX: ffff888136c08000 RCX:
ffff88813b862520
[  376.386316] RDX: 0000000000000001 RSI: ffff88813b014ab0 RDI:
000000000000000c
[  376.386361] RBP: 000000000000000c R08: 000073746e657665 R09:
8080808080808080
[  376.386404] R10: 0000000000000000 R11: fffffffffffffff8 R12:
ffff88813b866600
[  376.386447] R13: 0000000000000000 R14: ffff888136eee180 R15:
ffff8881374bece8
[  376.386491] FS:  0000000000000000(0000) GS:ffff88813b840000(0000)
knlGS:0000000000000000
[  376.386539] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  376.386577] CR2: 000000000000000c CR3: 000000000200a003 CR4:
00000000000606e0
[  376.386624] Call Trace:
[  376.386693]  virtio_fs_hiprio_done_work+0x32/0xa0 [virtio_fs]
[  376.386745]  process_one_work+0x19e/0x380
[  376.386787]  worker_thread+0x4f/0x3b0
[  376.386821]  ? rescuer_thread+0x350/0x350
[  376.386858]  kthread+0xf6/0x130
[  376.386895]  ? kthread_create_worker_on_cpu+0x70/0x70
[  376.386933]  ret_from_fork+0x1f/0x30

After some investigations, the root cause is FORGET requests in
virtio-fs not being ref counted, then virtio_fs_vq's fuse_dev
info maybe freed in virtio_fs_free_devs(), but we still have
un-finished FORGET requests, see below race:

kernel issue FORGET request      |
                                 |  user issue umount operation
                                 |
                                 |  virtio_kill_sb
                                 |  ==> fuse_kill_sb_anon
                                 |  ====> fuse_sb_destroy
                                 |  ======> fuse_wait_aborted
                                 |         Note: here we won't wait FORGET
                                 |         request
                                 |  ==> virtio_fs_free_devs
                                 |      free fuse_dev info
                                 |
                                 |
FORGET request is issued to host |
                                 |
FORGET request is completed, and |
virtio_fs_hiprio_done_work() will|
be called, then use-after-free   |
against fuse_dev occurs.         |
                                 |

Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 fs/fuse/virtio_fs.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 68fd11b..c55f1b1 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -181,6 +181,7 @@ static void virtio_fs_hiprio_done_work(struct work_struct *work)
 	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
 						 done_work);
 	struct fuse_pqueue *fpq = &fsvq->fud->pq;
+	struct fuse_conn *fc = fsvq->fud->fc;
 	struct virtqueue *vq = fsvq->vq;
 
 	/* Free completed FUSE_FORGET requests */
@@ -191,8 +192,10 @@ static void virtio_fs_hiprio_done_work(struct work_struct *work)
 
 		virtqueue_disable_cb(vq);
 
-		while ((req = virtqueue_get_buf(vq, &len)) != NULL)
+		while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
 			kfree(req);
+			fuse_drop_waiting(fc);
+		}
 	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
 	spin_unlock(&fpq->lock);
 }
@@ -208,6 +211,7 @@ static void virtio_fs_hiprio_dispatch_work(struct work_struct *work)
 	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
 						 dispatch_work.work);
 	struct fuse_pqueue *fpq = &fsvq->fud->pq;
+	struct fuse_conn *fc = fsvq->fud->fc;
 	struct virtqueue *vq = fsvq->vq;
 	struct scatterlist sg;
 	struct scatterlist *sgs[] = {&sg};
@@ -246,6 +250,7 @@ static void virtio_fs_hiprio_dispatch_work(struct work_struct *work)
 			spin_unlock(&fpq->lock);
 			return;
 		}
+		atomic_inc(&fc->num_waiting);
 
 		notify = virtqueue_kick_prepare(vq);
 		spin_unlock(&fpq->lock);
@@ -747,6 +752,7 @@ static int virtio_fs_restore(struct virtio_device *vdev)
 static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq)
 __releases(fiq->waitq.lock)
 {
+	struct fuse_conn *fc;
 	struct fuse_forget_link *link;
 	struct virtio_fs_forget *forget;
 	struct fuse_pqueue *fpq;
@@ -813,6 +819,9 @@ static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq)
 		goto out;
 	}
 
+	fc = fsvq->fud->fc;
+	atomic_inc(&fc->num_waiting);
+
 	notify = virtqueue_kick_prepare(vq);
 
 	spin_unlock(&fpq->lock);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 5/9] fuse: do not write whole page while page straddles i_size
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
                   ` (3 preceding siblings ...)
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 4/9] virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-16 20:16   ` Vivek Goyal
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none Liu Bo
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

If page straddles i_size and we write the whole page, the fuse
user-space filesystem may extend file size, it will confuse users.

Before this patch:
xfs_io -t -f \
        -c "truncate 5120" \
        -c "pwrite -S 0x58 0 5120"      \
        -c "mmap -rw 0 5120"            \
        -c "mwrite -S 0x59 2048 3072" \
        -c "close"      \
        testfile
testfile's length will be 8192 bytes, with this patch, testfile's
length will be 5120 bytes.

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 fs/fuse/file.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index bb45acc..c6090f5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2293,6 +2293,8 @@ static int fuse_writepages_fill(struct page *page,
 	struct page *tmp_page;
 	bool is_writeback;
 	int err;
+	loff_t size;
+	unsigned int len;
 
 	if (!data->ff) {
 		err = -EIO;
@@ -2364,7 +2366,12 @@ static int fuse_writepages_fill(struct page *page,
 	copy_highpage(tmp_page, page);
 	req->pages[req->num_pages] = tmp_page;
 	req->page_descs[req->num_pages].offset = 0;
-	req->page_descs[req->num_pages].length = PAGE_SIZE;
+	size = i_size_read(inode);
+	if (page->index == size >> PAGE_SHIFT)
+		len = size & ~PAGE_MASK;
+	else
+		len = PAGE_SIZE;
+	req->page_descs[req->num_pages].length = len;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
 	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
                   ` (4 preceding siblings ...)
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 5/9] fuse: do not write whole page while page straddles i_size Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-16 19:38   ` Vivek Goyal
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 7/9] fuse: return early if punch_hole fails Liu Bo
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

In case of dax+cache=none, mmap uses dax style prior to directIO style,
while read/write don't, but it seems that there is no reason not to do so.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
---
 fs/fuse/file.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c6090f5..620326e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1158,12 +1158,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct inode *inode = file->f_mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
-	if (ff->open_flags & FOPEN_DIRECT_IO)
-		return fuse_direct_read_iter(iocb, to);
-
 	if (IS_DAX(inode))
 		return fuse_dax_read_iter(iocb, to);
 
+	if (ff->open_flags & FOPEN_DIRECT_IO)
+		return fuse_direct_read_iter(iocb, to);
+
 	/*
 	 * In auto invalidate mode, always update attributes on read.
 	 * Otherwise, only update if we attempt to read past EOF (to ensure
@@ -1426,11 +1426,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t err;
 	loff_t endbyte = 0;
 
-	if (ff->open_flags & FOPEN_DIRECT_IO)
-		return fuse_direct_write_iter(iocb, from);
 	if (IS_DAX(inode))
 		return fuse_dax_write_iter(iocb, from);
 
+	if (ff->open_flags & FOPEN_DIRECT_IO)
+		return fuse_direct_write_iter(iocb, from);
+
 	if (get_fuse_conn(inode)->writeback_cache) {
 		/* Update size (EOF optimization) and mode (SUID clearing) */
 		err = fuse_update_attributes(mapping->host, file);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 7/9] fuse: return early if punch_hole fails
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
                   ` (5 preceding siblings ...)
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-16 19:51   ` Vivek Goyal
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 8/9] virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate Liu Bo
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

We don't have to clear FUSE_I_SIZE_UNSTABLE if punch_hole fails at
filemap_write_and_wait_rang(), since the bit is not set yet.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 620326e..712fd97 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3585,7 +3585,7 @@ static long __fuse_file_fallocate(struct file *file, int mode,
 		err = filemap_write_and_wait_range(inode->i_mapping, offset,
 							endbyte);
 		if (err)
-			goto out;
+			return err;
 		fuse_sync_writes(inode);
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 8/9] virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
                   ` (6 preceding siblings ...)
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 7/9] fuse: return early if punch_hole fails Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-16 19:57   ` Vivek Goyal
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 9/9] fuse: fix deadlock in __fuse_file_fallocate() Liu Bo
  2019-04-24 18:41 ` [Virtio-fs] [PATCH 0/9] virtio-fs fixes Vivek Goyal
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

generic/228 reported this failure that fuse fallocate does not honor what
'ulimit -f' has set.

This adds the necessary inode_newsize_ok() check.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/file.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 712fd97..6ab23d7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3589,6 +3589,13 @@ static long __fuse_file_fallocate(struct file *file, int mode,
 		fuse_sync_writes(inode);
 	}
 
+	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+	    offset + length > i_size_read(inode)) {
+		err = inode_newsize_ok(inode, offset + length);
+		if (err)
+			return err;
+	}
+
 	if (!(mode & FALLOC_FL_KEEP_SIZE))
 		set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Virtio-fs] [PATCH 9/9] fuse: fix deadlock in __fuse_file_fallocate()
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
                   ` (7 preceding siblings ...)
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 8/9] virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate Liu Bo
@ 2019-04-16 18:03 ` Liu Bo
  2019-04-16 18:07   ` Vivek Goyal
  2019-04-24 18:41 ` [Virtio-fs] [PATCH 0/9] virtio-fs fixes Vivek Goyal
  9 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:03 UTC (permalink / raw)
  To: virtio-fs

From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

This bug is obvious, fix it.

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 fs/fuse/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6ab23d7..0236783 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3623,7 +3623,7 @@ static long __fuse_file_fallocate(struct file *file, int mode,
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		down_write(&fi->i_mmap_sem);
 		truncate_pagecache_range(inode, offset, offset + length - 1);
-		down_write(&fi->i_mmap_sem);
+		up_write(&fi->i_mmap_sem);
 	}
 	fuse_invalidate_attr(inode);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 9/9] fuse: fix deadlock in __fuse_file_fallocate()
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 9/9] fuse: fix deadlock in __fuse_file_fallocate() Liu Bo
@ 2019-04-16 18:07   ` Vivek Goyal
  2019-05-02 22:10     ` Liu Bo
  0 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-16 18:07 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 02:03:22AM +0800, Liu Bo wrote:
> From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> 
> This bug is obvious, fix it.
> 
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

Thanks. This one I ran into late last week while testing
fallocate(PUNCH_HOLE) and fixed it already in my internal branch.

Vivek
> ---
>  fs/fuse/file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 6ab23d7..0236783 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3623,7 +3623,7 @@ static long __fuse_file_fallocate(struct file *file, int mode,
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		down_write(&fi->i_mmap_sem);
>  		truncate_pagecache_range(inode, offset, offset + length - 1);
> -		down_write(&fi->i_mmap_sem);
> +		up_write(&fi->i_mmap_sem);
>  	}
>  	fuse_invalidate_attr(inode);
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 1/9] virtio-fs: fix multiple tag support
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 1/9] virtio-fs: fix multiple tag support Liu Bo
@ 2019-04-16 18:09   ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2019-04-16 18:09 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 02:03:14AM +0800, Liu Bo wrote:
> While doing memremap from pci_dev's system bus address to kernel virtual
> address, we assign a wrong value to the %end of pgmap.res, which ends up
> with a wrong resource size in the process of memremap, and that further
> prevent the second virtiofs pci device from being probed successfully.
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>

Thanks. Already included in my internal branch.

Thanks
Vivek

> ---
>  fs/fuse/virtio_fs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index d669512..68fd11b 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -589,7 +589,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>  	pgmap->res = (struct resource){
>  		.name = "virtio-fs dax window",
>  		.start = (phys_addr_t) cache_reg.addr,
> -		.end = (phys_addr_t) cache_reg.addr + cache_reg.len,
> +		.end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
>  	};
>  
>  	fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 2/9] virtio-fs: clean up dax mapping before aborting connection
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 2/9] virtio-fs: clean up dax mapping before aborting connection Liu Bo
@ 2019-04-16 18:18   ` Vivek Goyal
  2019-04-16 18:40     ` Liu Bo
  0 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-16 18:18 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 02:03:15AM +0800, Liu Bo wrote:
> fstests generic/127 reported that we failed to remove dax mapping during
> umount,

Is it still a problem with 0.2? I have fixed it by calling
fuse_removemapping_one() only if connection is still there
(fc->connected).

Had talked to miklos in the past and he preferred that virtiofs
daemon should have the capability to clean this up. It might
happen that guest does hard reboot and never cleans up ranges.

I think now David Gilbert has taken care of doing cleanup
on ->init and I think ->destroy as well. Have not looked
at the code closely.

Vivek

> 
> -----
> run fstests generic/127 at 2019-03-20 04:54:35
> fuse_removemapping_one request failed -107
> Failed to removemapping. offset=0x200000 len=0x200000
> -----
> 
> The root cause:
> virtio_kill_sb
>   -> fuse_kill_sb_anon
>     -> fuse_sb_destroy
>       -> fuse_abort_conn #fiq->connect = 0
>   -> kill_anon_super
>     -> fuse_evict_inode
>       -> fuse_removemapping
>         -> fuse_removemapping_one
>            # get -ENOTCONN, bail out
> 
> Unfortunately we have to make sure that all dax mappings linked in
> per-inode interval tree are moved back to fc->busy_ranges in order to
> avoid mapping leakage.
> 
> As we've tracked all dax mappings in fc->busy_ranges, this forces to free
> dax mappings before aborting connection.
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> ---
>  fs/fuse/fuse_i.h | 2 ++
>  fs/fuse/inode.c  | 9 +++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 25e1e50..2c32846 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1202,4 +1202,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
>  void fuse_dax_free_mem_worker(struct work_struct *work);
>  void fuse_removemapping(struct inode *inode);
>  
> +int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free);
> +
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 6e96613..22d5f6a 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1378,6 +1378,15 @@ static void fuse_sb_destroy(struct super_block *sb)
>  	if (fc) {
>  		fuse_send_destroy(fc);
>  
> +		/* cleanup dax mapping ranges before aborting connection. */
> +		spin_lock(&fc->lock);
> +		if (fc->nr_busy_ranges) {
> +			spin_unlock(&fc->lock);
> +			fuse_dax_free_memory(fc, -1UL);
> +		} else {
> +			spin_unlock(&fc->lock);
> +		}
> +
>  		fuse_abort_conn(fc, false);
>  		fuse_wait_aborted(fc);
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 3/9] fuse: export fuse_drop_waiting()
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 3/9] fuse: export fuse_drop_waiting() Liu Bo
@ 2019-04-16 18:28   ` Vivek Goyal
  2019-04-16 18:43     ` Liu Bo
  0 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-16 18:28 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 02:03:16AM +0800, Liu Bo wrote:
> virtio-fs will need to do ref count against FORGET requests from
> outside fs/fuse/dev.c.  Make the symbol visible.
> 

Who needs it? virtio-fs does not seem to be calling it?

Vivek

> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> ---
>  fs/fuse/dev.c    | 3 ++-
>  fs/fuse/fuse_i.h | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 07ae706..7e6afd0 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -128,7 +128,7 @@ static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
>  	return !fc->initialized || (for_background && fc->blocked);
>  }
>  
> -static void fuse_drop_waiting(struct fuse_conn *fc)
> +void fuse_drop_waiting(struct fuse_conn *fc)
>  {
>  	if (fc->connected) {
>  		atomic_dec(&fc->num_waiting);
> @@ -137,6 +137,7 @@ static void fuse_drop_waiting(struct fuse_conn *fc)
>  		wake_up_all(&fc->blocked_waitq);
>  	}
>  }
> +EXPORT_SYMBOL_GPL(fuse_drop_waiting);
>  
>  static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>  				       bool for_background)
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 2c32846..171aee8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1199,6 +1199,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
>   * Get the next unique ID for a request
>   */
>  u64 fuse_get_unique(struct fuse_iqueue *fiq);
> +void fuse_drop_waiting(struct fuse_conn *fc);
>  void fuse_dax_free_mem_worker(struct work_struct *work);
>  void fuse_removemapping(struct inode *inode);
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 2/9] virtio-fs: clean up dax mapping before aborting connection
  2019-04-16 18:18   ` Vivek Goyal
@ 2019-04-16 18:40     ` Liu Bo
  0 siblings, 0 replies; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:40 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Tue, Apr 16, 2019 at 02:18:28PM -0400, Vivek Goyal wrote:
> On Wed, Apr 17, 2019 at 02:03:15AM +0800, Liu Bo wrote:
> > fstests generic/127 reported that we failed to remove dax mapping during
> > umount,
> 
> Is it still a problem with 0.2? I have fixed it by calling
> fuse_removemapping_one() only if connection is still there
> (fc->connected).
>

Haven't tested generic/127 against 0.2, but I think the patch is still
valid (will verify later).

"-ENOTCONN" is returned by __fuse_request_send() because of
"fiq->connected being 0", not "fc->connected".

> Had talked to miklos in the past and he preferred that virtiofs
> daemon should have the capability to clean this up. It might
> happen that guest does hard reboot and never cleans up ranges.
>
> I think now David Gilbert has taken care of doing cleanup
> on ->init and I think ->destroy as well. Have not looked
> at the code closely.
> 

It make sense if hard reboot occurs, but in the generic/127 case, it
does a valid umount which IMO is responsible for cleaning up dax
mappings.

thanks,
-liubo

> Vivek
> 
> > 
> > -----
> > run fstests generic/127 at 2019-03-20 04:54:35
> > fuse_removemapping_one request failed -107
> > Failed to removemapping. offset=0x200000 len=0x200000
> > -----
> > 
> > The root cause:
> > virtio_kill_sb
> >   -> fuse_kill_sb_anon
> >     -> fuse_sb_destroy
> >       -> fuse_abort_conn #fiq->connect = 0
> >   -> kill_anon_super
> >     -> fuse_evict_inode
> >       -> fuse_removemapping
> >         -> fuse_removemapping_one
> >            # get -ENOTCONN, bail out
> > 
> > Unfortunately we have to make sure that all dax mappings linked in
> > per-inode interval tree are moved back to fc->busy_ranges in order to
> > avoid mapping leakage.
> > 
> > As we've tracked all dax mappings in fc->busy_ranges, this forces to free
> > dax mappings before aborting connection.
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> > ---
> >  fs/fuse/fuse_i.h | 2 ++
> >  fs/fuse/inode.c  | 9 +++++++++
> >  2 files changed, 11 insertions(+)
> > 
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 25e1e50..2c32846 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1202,4 +1202,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
> >  void fuse_dax_free_mem_worker(struct work_struct *work);
> >  void fuse_removemapping(struct inode *inode);
> >  
> > +int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free);
> > +
> >  #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 6e96613..22d5f6a 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1378,6 +1378,15 @@ static void fuse_sb_destroy(struct super_block *sb)
> >  	if (fc) {
> >  		fuse_send_destroy(fc);
> >  
> > +		/* cleanup dax mapping ranges before aborting connection. */
> > +		spin_lock(&fc->lock);
> > +		if (fc->nr_busy_ranges) {
> > +			spin_unlock(&fc->lock);
> > +			fuse_dax_free_memory(fc, -1UL);
> > +		} else {
> > +			spin_unlock(&fc->lock);
> > +		}
> > +
> >  		fuse_abort_conn(fc, false);
> >  		fuse_wait_aborted(fc);
> >  
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 3/9] fuse: export fuse_drop_waiting()
  2019-04-16 18:28   ` Vivek Goyal
@ 2019-04-16 18:43     ` Liu Bo
  0 siblings, 0 replies; 38+ messages in thread
From: Liu Bo @ 2019-04-16 18:43 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Tue, Apr 16, 2019 at 02:28:49PM -0400, Vivek Goyal wrote:
> On Wed, Apr 17, 2019 at 02:03:16AM +0800, Liu Bo wrote:
> > virtio-fs will need to do ref count against FORGET requests from
> > outside fs/fuse/dev.c.  Make the symbol visible.
> > 
> 
> Who needs it? virtio-fs does not seem to be calling it?
>

Ah, this is a preparation for patch 4.

thanks,
-liubo

> Vivek
> 
> > Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > ---
> >  fs/fuse/dev.c    | 3 ++-
> >  fs/fuse/fuse_i.h | 1 +
> >  2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 07ae706..7e6afd0 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -128,7 +128,7 @@ static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
> >  	return !fc->initialized || (for_background && fc->blocked);
> >  }
> >  
> > -static void fuse_drop_waiting(struct fuse_conn *fc)
> > +void fuse_drop_waiting(struct fuse_conn *fc)
> >  {
> >  	if (fc->connected) {
> >  		atomic_dec(&fc->num_waiting);
> > @@ -137,6 +137,7 @@ static void fuse_drop_waiting(struct fuse_conn *fc)
> >  		wake_up_all(&fc->blocked_waitq);
> >  	}
> >  }
> > +EXPORT_SYMBOL_GPL(fuse_drop_waiting);
> >  
> >  static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> >  				       bool for_background)
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 2c32846..171aee8 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1199,6 +1199,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
> >   * Get the next unique ID for a request
> >   */
> >  u64 fuse_get_unique(struct fuse_iqueue *fiq);
> > +void fuse_drop_waiting(struct fuse_conn *fc);
> >  void fuse_dax_free_mem_worker(struct work_struct *work);
> >  void fuse_removemapping(struct inode *inode);
> >  
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none Liu Bo
@ 2019-04-16 19:38   ` Vivek Goyal
  2019-04-17  8:25     ` Miklos Szeredi
  0 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-16 19:38 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs, Miklos Szeredi

On Wed, Apr 17, 2019 at 02:03:19AM +0800, Liu Bo wrote:
> In case of dax+cache=none, mmap uses dax style prior to directIO style,
> while read/write don't, but it seems that there is no reason not to do so.
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

This is interesting. I was thinking about it today itself. I noticed
that ext4 and xfs also check for DAX inode first and use dax path
if dax is enabled.

cache=never sets FOPEN_DIRECT_IO (even if application never asked for
direct IO). If dax is enabled, for data its equivalent to doing direct
IO. And for mmap() we are already checking for DAX first. So it makes
sense to do same thing for read/write path as well.

CCing Miklos as well. He might have some thougts on this. I am curios
that initially whey did he make this change only for mmap() and not
for read/write paths.

Thanks
Vivek

> ---
>  fs/fuse/file.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index c6090f5..620326e 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1158,12 +1158,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	struct inode *inode = file->f_mapping->host;
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  
> -	if (ff->open_flags & FOPEN_DIRECT_IO)
> -		return fuse_direct_read_iter(iocb, to);
> -
>  	if (IS_DAX(inode))
>  		return fuse_dax_read_iter(iocb, to);
>  
> +	if (ff->open_flags & FOPEN_DIRECT_IO)
> +		return fuse_direct_read_iter(iocb, to);
> +
>  	/*
>  	 * In auto invalidate mode, always update attributes on read.
>  	 * Otherwise, only update if we attempt to read past EOF (to ensure
> @@ -1426,11 +1426,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	ssize_t err;
>  	loff_t endbyte = 0;
>  
> -	if (ff->open_flags & FOPEN_DIRECT_IO)
> -		return fuse_direct_write_iter(iocb, from);
>  	if (IS_DAX(inode))
>  		return fuse_dax_write_iter(iocb, from);
>  
> +	if (ff->open_flags & FOPEN_DIRECT_IO)
> +		return fuse_direct_write_iter(iocb, from);
> +
>  	if (get_fuse_conn(inode)->writeback_cache) {
>  		/* Update size (EOF optimization) and mode (SUID clearing) */
>  		err = fuse_update_attributes(mapping->host, file);
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 7/9] fuse: return early if punch_hole fails
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 7/9] fuse: return early if punch_hole fails Liu Bo
@ 2019-04-16 19:51   ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2019-04-16 19:51 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 02:03:20AM +0800, Liu Bo wrote:
> We don't have to clear FUSE_I_SIZE_UNSTABLE if punch_hole fails at
> filemap_write_and_wait_rang(), since the bit is not set yet.

Right. PUNCH_HOLE mandates setting FALLOC_FL_KEEP_SIZE as well. So we
will not set clear FUSE_I_SIZE_UNSTABLE anyway.

        if (!(mode & FALLOC_FL_KEEP_SIZE))
                clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);

This does not seem like a very important change to make at this point of
time. I had taken this function out and it might be structured differently
down the line. So for the time being I will leave it out.

Vivek
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 620326e..712fd97 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3585,7 +3585,7 @@ static long __fuse_file_fallocate(struct file *file, int mode,
>  		err = filemap_write_and_wait_range(inode->i_mapping, offset,
>  							endbyte);
>  		if (err)
> -			goto out;
> +			return err;
>  		fuse_sync_writes(inode);
>  	}
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 8/9] virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 8/9] virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate Liu Bo
@ 2019-04-16 19:57   ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2019-04-16 19:57 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 02:03:21AM +0800, Liu Bo wrote:
> generic/228 reported this failure that fuse fallocate does not honor what
> 'ulimit -f' has set.
> 
> This adds the necessary inode_newsize_ok() check.
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>

This sounds like a generic fallocate() fixed needed in fuse as well? If
that's the case, it will be better to send it to Miklos directly and cc
fuse mailing list.

Vivek

> ---
>  fs/fuse/file.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 712fd97..6ab23d7 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3589,6 +3589,13 @@ static long __fuse_file_fallocate(struct file *file, int mode,
>  		fuse_sync_writes(inode);
>  	}
>  
> +	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> +	    offset + length > i_size_read(inode)) {
> +		err = inode_newsize_ok(inode, offset + length);
> +		if (err)
> +			return err;
> +	}
> +
>  	if (!(mode & FALLOC_FL_KEEP_SIZE))
>  		set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 5/9] fuse: do not write whole page while page straddles i_size
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 5/9] fuse: do not write whole page while page straddles i_size Liu Bo
@ 2019-04-16 20:16   ` Vivek Goyal
  2019-04-17  0:12     ` Liu Bo
  0 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-16 20:16 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs, Miklos Szeredi

On Wed, Apr 17, 2019 at 02:03:18AM +0800, Liu Bo wrote:
> From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> 
> If page straddles i_size and we write the whole page, the fuse
> user-space filesystem may extend file size, it will confuse users.
> 
> Before this patch:
> xfs_io -t -f \
>         -c "truncate 5120" \
>         -c "pwrite -S 0x58 0 5120"      \
>         -c "mmap -rw 0 5120"            \
>         -c "mwrite -S 0x59 2048 3072" \
>         -c "close"      \
>         testfile
> testfile's length will be 8192 bytes, with this patch, testfile's
> length will be 5120 bytes.
> 
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

[cc miklos]

This sounds like a generic fuse issue as well (and not virtio-fs specic)?
If yes, can you please send this fix separately to miklos and fuse
mailing list.

Vivek
> ---
>  fs/fuse/file.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index bb45acc..c6090f5 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2293,6 +2293,8 @@ static int fuse_writepages_fill(struct page *page,
>  	struct page *tmp_page;
>  	bool is_writeback;
>  	int err;
> +	loff_t size;
> +	unsigned int len;
>  
>  	if (!data->ff) {
>  		err = -EIO;
> @@ -2364,7 +2366,12 @@ static int fuse_writepages_fill(struct page *page,
>  	copy_highpage(tmp_page, page);
>  	req->pages[req->num_pages] = tmp_page;
>  	req->page_descs[req->num_pages].offset = 0;
> -	req->page_descs[req->num_pages].length = PAGE_SIZE;
> +	size = i_size_read(inode);
> +	if (page->index == size >> PAGE_SHIFT)
> +		len = size & ~PAGE_MASK;
> +	else
> +		len = PAGE_SIZE;
> +	req->page_descs[req->num_pages].length = len;
>  
>  	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
>  	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 5/9] fuse: do not write whole page while page straddles i_size
  2019-04-16 20:16   ` Vivek Goyal
@ 2019-04-17  0:12     ` Liu Bo
  0 siblings, 0 replies; 38+ messages in thread
From: Liu Bo @ 2019-04-17  0:12 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, Miklos Szeredi

On Tue, Apr 16, 2019 at 04:16:21PM -0400, Vivek Goyal wrote:
> On Wed, Apr 17, 2019 at 02:03:18AM +0800, Liu Bo wrote:
> > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > 
> > If page straddles i_size and we write the whole page, the fuse
> > user-space filesystem may extend file size, it will confuse users.
> > 
> > Before this patch:
> > xfs_io -t -f \
> >         -c "truncate 5120" \
> >         -c "pwrite -S 0x58 0 5120"      \
> >         -c "mmap -rw 0 5120"            \
> >         -c "mwrite -S 0x59 2048 3072" \
> >         -c "close"      \
> >         testfile
> > testfile's length will be 8192 bytes, with this patch, testfile's
> > length will be 5120 bytes.
> > 
> > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> 
> [cc miklos]
> 
> This sounds like a generic fuse issue as well (and not virtio-fs specic)?
> If yes, can you please send this fix separately to miklos and fuse
> mailing list.

Sure, will do.

thanks,
-liubo

> 
> Vivek
> > ---
> >  fs/fuse/file.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index bb45acc..c6090f5 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -2293,6 +2293,8 @@ static int fuse_writepages_fill(struct page *page,
> >  	struct page *tmp_page;
> >  	bool is_writeback;
> >  	int err;
> > +	loff_t size;
> > +	unsigned int len;
> >  
> >  	if (!data->ff) {
> >  		err = -EIO;
> > @@ -2364,7 +2366,12 @@ static int fuse_writepages_fill(struct page *page,
> >  	copy_highpage(tmp_page, page);
> >  	req->pages[req->num_pages] = tmp_page;
> >  	req->page_descs[req->num_pages].offset = 0;
> > -	req->page_descs[req->num_pages].length = PAGE_SIZE;
> > +	size = i_size_read(inode);
> > +	if (page->index == size >> PAGE_SHIFT)
> > +		len = size & ~PAGE_MASK;
> > +	else
> > +		len = PAGE_SIZE;
> > +	req->page_descs[req->num_pages].length = len;
> >  
> >  	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
> >  	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none
  2019-04-16 19:38   ` Vivek Goyal
@ 2019-04-17  8:25     ` Miklos Szeredi
  2019-04-17 19:35       ` Liu Bo
                         ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Miklos Szeredi @ 2019-04-17  8:25 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Tue, Apr 16, 2019 at 9:38 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Apr 17, 2019 at 02:03:19AM +0800, Liu Bo wrote:
> > In case of dax+cache=none, mmap uses dax style prior to directIO style,
> > while read/write don't, but it seems that there is no reason not to do so.
> >
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
>
> This is interesting. I was thinking about it today itself. I noticed
> that ext4 and xfs also check for DAX inode first and use dax path
> if dax is enabled.
>
> cache=never sets FOPEN_DIRECT_IO (even if application never asked for
> direct IO). If dax is enabled, for data its equivalent to doing direct
> IO. And for mmap() we are already checking for DAX first. So it makes
> sense to do same thing for read/write path as well.
>
> CCing Miklos as well. He might have some thougts on this. I am curios
> that initially whey did he make this change only for mmap() and not
> for read/write paths.

AFAIR the main reason was that we had performance issues with size
extending writes with dax.

There is also the question of mtime updates and atime updates, which
are handled properly with FOPEN_DIRECT_IO, but not with DAX I/O.

If we solve those issues then I think it makes sense to switch
read/write to using DAX.

Does that answer your question?

Thanks,
Miklos

>
> Thanks
> Vivek
>
> > ---
> >  fs/fuse/file.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index c6090f5..620326e 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1158,12 +1158,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >       struct inode *inode = file->f_mapping->host;
> >       struct fuse_conn *fc = get_fuse_conn(inode);
> >
> > -     if (ff->open_flags & FOPEN_DIRECT_IO)
> > -             return fuse_direct_read_iter(iocb, to);
> > -
> >       if (IS_DAX(inode))
> >               return fuse_dax_read_iter(iocb, to);
> >
> > +     if (ff->open_flags & FOPEN_DIRECT_IO)
> > +             return fuse_direct_read_iter(iocb, to);
> > +
> >       /*
> >        * In auto invalidate mode, always update attributes on read.
> >        * Otherwise, only update if we attempt to read past EOF (to ensure
> > @@ -1426,11 +1426,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >       ssize_t err;
> >       loff_t endbyte = 0;
> >
> > -     if (ff->open_flags & FOPEN_DIRECT_IO)
> > -             return fuse_direct_write_iter(iocb, from);
> >       if (IS_DAX(inode))
> >               return fuse_dax_write_iter(iocb, from);
> >
> > +     if (ff->open_flags & FOPEN_DIRECT_IO)
> > +             return fuse_direct_write_iter(iocb, from);
> > +
> >       if (get_fuse_conn(inode)->writeback_cache) {
> >               /* Update size (EOF optimization) and mode (SUID clearing) */
> >               err = fuse_update_attributes(mapping->host, file);
> > --
> > 1.8.3.1
> >
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none
  2019-04-17  8:25     ` Miklos Szeredi
@ 2019-04-17 19:35       ` Liu Bo
  2019-04-25 18:35         ` Vivek Goyal
  2019-04-17 20:56       ` Vivek Goyal
  2019-04-22 18:14       ` Vivek Goyal
  2 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-17 19:35 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 10:25:53AM +0200, Miklos Szeredi wrote:
> On Tue, Apr 16, 2019 at 9:38 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Wed, Apr 17, 2019 at 02:03:19AM +0800, Liu Bo wrote:
> > > In case of dax+cache=none, mmap uses dax style prior to directIO style,
> > > while read/write don't, but it seems that there is no reason not to do so.
> > >
> > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> >
> > This is interesting. I was thinking about it today itself. I noticed
> > that ext4 and xfs also check for DAX inode first and use dax path
> > if dax is enabled.
> >
> > cache=never sets FOPEN_DIRECT_IO (even if application never asked for
> > direct IO). If dax is enabled, for data its equivalent to doing direct
> > IO. And for mmap() we are already checking for DAX first. So it makes
> > sense to do same thing for read/write path as well.
> >
> > CCing Miklos as well. He might have some thougts on this. I am curios
> > that initially whey did he make this change only for mmap() and not
> > for read/write paths.
> 
> AFAIR the main reason was that we had performance issues with size
> extending writes with dax.
> 
> There is also the question of mtime updates and atime updates, which
> are handled properly with FOPEN_DIRECT_IO, but not with DAX I/O.

Looks like fuse_iget() has set inode with NOATIME and NOCMTIME under
the case "!fc->writeback_cache", which is true in dax mode, do we
still case about time update?

AFAICS, these metadata are maintained by host side.

thanks,
-liubo

> 
> If we solve those issues then I think it makes sense to switch
> read/write to using DAX.
> 
> Does that answer your question?
> 
> Thanks,
> Miklos
> 
> >
> > Thanks
> > Vivek
> >
> > > ---
> > >  fs/fuse/file.c | 11 ++++++-----
> > >  1 file changed, 6 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index c6090f5..620326e 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1158,12 +1158,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > >       struct inode *inode = file->f_mapping->host;
> > >       struct fuse_conn *fc = get_fuse_conn(inode);
> > >
> > > -     if (ff->open_flags & FOPEN_DIRECT_IO)
> > > -             return fuse_direct_read_iter(iocb, to);
> > > -
> > >       if (IS_DAX(inode))
> > >               return fuse_dax_read_iter(iocb, to);
> > >
> > > +     if (ff->open_flags & FOPEN_DIRECT_IO)
> > > +             return fuse_direct_read_iter(iocb, to);
> > > +
> > >       /*
> > >        * In auto invalidate mode, always update attributes on read.
> > >        * Otherwise, only update if we attempt to read past EOF (to ensure
> > > @@ -1426,11 +1426,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > >       ssize_t err;
> > >       loff_t endbyte = 0;
> > >
> > > -     if (ff->open_flags & FOPEN_DIRECT_IO)
> > > -             return fuse_direct_write_iter(iocb, from);
> > >       if (IS_DAX(inode))
> > >               return fuse_dax_write_iter(iocb, from);
> > >
> > > +     if (ff->open_flags & FOPEN_DIRECT_IO)
> > > +             return fuse_direct_write_iter(iocb, from);
> > > +
> > >       if (get_fuse_conn(inode)->writeback_cache) {
> > >               /* Update size (EOF optimization) and mode (SUID clearing) */
> > >               err = fuse_update_attributes(mapping->host, file);
> > > --
> > > 1.8.3.1
> > >
> > > _______________________________________________
> > > Virtio-fs mailing list
> > > Virtio-fs@redhat.com
> > > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none
  2019-04-17  8:25     ` Miklos Szeredi
  2019-04-17 19:35       ` Liu Bo
@ 2019-04-17 20:56       ` Vivek Goyal
  2019-04-22 18:55         ` Liu Bo
  2019-04-22 18:14       ` Vivek Goyal
  2 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-17 20:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 10:25:53AM +0200, Miklos Szeredi wrote:
> On Tue, Apr 16, 2019 at 9:38 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Wed, Apr 17, 2019 at 02:03:19AM +0800, Liu Bo wrote:
> > > In case of dax+cache=none, mmap uses dax style prior to directIO style,
> > > while read/write don't, but it seems that there is no reason not to do so.
> > >
> > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> >
> > This is interesting. I was thinking about it today itself. I noticed
> > that ext4 and xfs also check for DAX inode first and use dax path
> > if dax is enabled.
> >
> > cache=never sets FOPEN_DIRECT_IO (even if application never asked for
> > direct IO). If dax is enabled, for data its equivalent to doing direct
> > IO. And for mmap() we are already checking for DAX first. So it makes
> > sense to do same thing for read/write path as well.
> >
> > CCing Miklos as well. He might have some thougts on this. I am curios
> > that initially whey did he make this change only for mmap() and not
> > for read/write paths.
> 
> AFAIR the main reason was that we had performance issues with size
> extending writes with dax.

Finally I decided to do some measurement on performance cost of file
extending writes. I wrote a small program to keep on writing 5 bytes
at the end of file 16K times and measure total time.

With cache=never and dax not enabled, it takes around 2.5 to 3 seconds.
With cache=never and dax enabled (and code modified to call dax path),
it takes around 12 to 13 seconds.

So fallocate() path is definitely seem to be 4-5 times slower. I tried
replacing fallocate() with truncate operation but that does not help
much either.

Part of the reason it being slow seems to be fallocate() operation on
host itself is expensive. It roughly took 4 seconds to perform 16K
fallocate() requests while it took only 100 us to perform 16K write
requests (as received by lo_write_buf()).

But that explains only about 4 seconds of extra latency. Assuming fuse
and virtio communication latency is same between two commands (FUSE_WRITE,
FUSE_FALLOCATE), not sure where another 5-6 seconds of latency comes
from.

Apart from latency, fallocate() also has the issue that its not atomic.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none
  2019-04-17  8:25     ` Miklos Szeredi
  2019-04-17 19:35       ` Liu Bo
  2019-04-17 20:56       ` Vivek Goyal
@ 2019-04-22 18:14       ` Vivek Goyal
  2 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2019-04-22 18:14 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs

On Wed, Apr 17, 2019 at 10:25:53AM +0200, Miklos Szeredi wrote:
> On Tue, Apr 16, 2019 at 9:38 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Wed, Apr 17, 2019 at 02:03:19AM +0800, Liu Bo wrote:
> > > In case of dax+cache=none, mmap uses dax style prior to directIO style,
> > > while read/write don't, but it seems that there is no reason not to do so.
> > >
> > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> >
> > This is interesting. I was thinking about it today itself. I noticed
> > that ext4 and xfs also check for DAX inode first and use dax path
> > if dax is enabled.
> >
> > cache=never sets FOPEN_DIRECT_IO (even if application never asked for
> > direct IO). If dax is enabled, for data its equivalent to doing direct
> > IO. And for mmap() we are already checking for DAX first. So it makes
> > sense to do same thing for read/write path as well.
> >
> > CCing Miklos as well. He might have some thougts on this. I am curios
> > that initially whey did he make this change only for mmap() and not
> > for read/write paths.
> 
> AFAIR the main reason was that we had performance issues with size
> extending writes with dax.

Hi Miklos,

How about falling back to sending fuse WRITE message to fuse daemon
in case of extending writes (and not use DAX). That will solve the issue
of write and i_size modification not being atomic also will improve
performance of file extending writes. 

I wrote following hack patch and this seems to work.

Thanks
Vivek


Subject: fuse, dax: Do not use dax for file extnding writes

Right not we use dax for file extending writes and use fallocate() first
to make sure file on host is extended so that later write through mmap()
will not result in SIGBUS.

This approach has two problems. First of all write and i_size are not atomic.
And that means if guest crashes after fallocate(), it can expose trailing
zeros in file and give the apperance as if user data was lost. Secondly,
calling falloate() for every extending write is very slow.  Apart from
communication overhead, fallocate() on host is much slower than calling
write on host.

So do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path) and this should solve botht the
above problems.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c |   67 ++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 50 insertions(+), 17 deletions(-)

Index: rhvgoyal-linux-fuse/fs/fuse/file.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/fuse/file.c	2019-04-18 16:08:19.048407845 -0400
+++ rhvgoyal-linux-fuse/fs/fuse/file.c	2019-04-22 13:22:20.939679793 -0400
@@ -1809,6 +1809,13 @@ static int fuse_iomap_begin(struct inode
 	pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
 			pos, length);
 
+	/*
+	 * Writes beyond end of file are not handled using dax path. Instead
+	 * a fuse write message is sent to daemon
+	 */
+	if (flags & IOMAP_WRITE && pos >= i_size_read(inode))
+		return -EIO;
+
 	iomap->offset = pos;
 	iomap->flags = 0;
 	iomap->bdev = NULL;
@@ -1929,11 +1936,34 @@ static ssize_t fuse_dax_read_iter(struct
 	return ret;
 }
 
-static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	return (iov_iter_rw(from) == WRITE &&
+	    	((iocb->ki_pos) >= i_size_read(inode)));
+}
+
+static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
 	ssize_t ret;
 
+	ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
+	if (ret < 0)
+		return ret;
+
+	fuse_invalidate_attr(inode);
+	fuse_write_update_size(inode, iocb->ki_pos);
+	return ret;
+}
+
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret, count;
+
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		if (!inode_trylock(inode))
 			return -EAGAIN;
@@ -1950,26 +1980,29 @@ static ssize_t fuse_dax_write_iter(struc
 		goto out;
 	/* TODO file_update_time() but we don't want metadata I/O */
 
-	/* TODO handle growing the file */
-	/* Grow file here if need be. iomap_begin() does not have access
-	 * to file pointer
-	 */
-	if (iov_iter_rw(from) == WRITE &&
-	    ((iocb->ki_pos + iov_iter_count(from)) > i_size_read(inode))) {
-		ret = __fuse_file_fallocate(iocb->ki_filp, 0, iocb->ki_pos,
-						iov_iter_count(from));
-		if (ret < 0) {
-			printk("fallocate(offset=0x%llx length=0x%zx)"
-			" failed. err=%zd\n", iocb->ki_pos,
-			iov_iter_count(from), ret);
-			goto out;
-		}
-		pr_debug("fallocate(offset=0x%llx length=0x%zx)"
-		" succeed. ret=%zd\n", iocb->ki_pos, iov_iter_count(from), ret);
+	/* Do not use dax for file extending writes as its an mmap and
+	 * trying to write beyong end of existing page will generate
+	 * SIGBUS.
+	 */
+	if (file_extending_write(iocb, from)) {
+		ret = fuse_dax_direct_write(iocb, from);
+		goto out;
 	}
 
 	ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
+	if (ret < 0)
+		goto out;
 
+	/*
+	 * If part of the write was file extending, fuse dax path will not
+	 * take care of that. Do direct write instead.
+	 */
+	if (iov_iter_count(from) && file_extending_write(iocb, from)) {
+		count = fuse_dax_direct_write(iocb, from);
+		if (count < 0)
+			goto out;
+		ret += count;
+	}
 out:
 	inode_unlock(inode);
 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none
  2019-04-17 20:56       ` Vivek Goyal
@ 2019-04-22 18:55         ` Liu Bo
  0 siblings, 0 replies; 38+ messages in thread
From: Liu Bo @ 2019-04-22 18:55 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, Miklos Szeredi

On Wed, Apr 17, 2019 at 04:56:53PM -0400, Vivek Goyal wrote:
> On Wed, Apr 17, 2019 at 10:25:53AM +0200, Miklos Szeredi wrote:
> > On Tue, Apr 16, 2019 at 9:38 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Wed, Apr 17, 2019 at 02:03:19AM +0800, Liu Bo wrote:
> > > > In case of dax+cache=none, mmap uses dax style prior to directIO style,
> > > > while read/write don't, but it seems that there is no reason not to do so.
> > > >
> > > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> > >
> > > This is interesting. I was thinking about it today itself. I noticed
> > > that ext4 and xfs also check for DAX inode first and use dax path
> > > if dax is enabled.
> > >
> > > cache=never sets FOPEN_DIRECT_IO (even if application never asked for
> > > direct IO). If dax is enabled, for data its equivalent to doing direct
> > > IO. And for mmap() we are already checking for DAX first. So it makes
> > > sense to do same thing for read/write path as well.
> > >
> > > CCing Miklos as well. He might have some thougts on this. I am curios
> > > that initially whey did he make this change only for mmap() and not
> > > for read/write paths.
> > 
> > AFAIR the main reason was that we had performance issues with size
> > extending writes with dax.
> 
> Finally I decided to do some measurement on performance cost of file
> extending writes. I wrote a small program to keep on writing 5 bytes
> at the end of file 16K times and measure total time.
> 
> With cache=never and dax not enabled, it takes around 2.5 to 3 seconds.
> With cache=never and dax enabled (and code modified to call dax path),
> it takes around 12 to 13 seconds.
> 
> So fallocate() path is definitely seem to be 4-5 times slower. I tried
> replacing fallocate() with truncate operation but that does not help
> much either.
> 
> Part of the reason it being slow seems to be fallocate() operation on
> host itself is expensive. It roughly took 4 seconds to perform 16K
> fallocate() requests while it took only 100 us to perform 16K write
> requests (as received by lo_write_buf()).

Hmm, I suppose that fallocate(2) would be much faster than posix_fallocate(2) as
posix_fallocate(2) will write zero through the range while fallocate(2) just
allocate extents on lower fs for the range.

thanks,
-liubo
> 
> But that explains only about 4 seconds of extra latency. Assuming fuse
> and virtio communication latency is same between two commands (FUSE_WRITE,
> FUSE_FALLOCATE), not sure where another 5-6 seconds of latency comes
> from.
> 
> Apart from latency, fallocate() also has the issue that its not atomic.
> 
> Thanks
> Vivek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 4/9] virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 4/9] virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info Liu Bo
@ 2019-04-23 19:51   ` Vivek Goyal
  2019-04-25  0:16     ` Liu Bo
  0 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-23 19:51 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

Hi,

fuse code does not really track forget requests. If anyting has not been
sent to fuse daemon, that is dropped during abort connection time.

So it makes sense not to wait for these forget requests even in case of
virtio-fs. All we are doing in completion is free request and fuse
device does not have to be around by then.

Problem seems to be dependeny on fud->fpq->lock. How about if we use
a different lock to provide locking around sending/receiving requests
from vq. That way we don't have to use fpq->lock in this path and
this problem should not happen. 

Can you try following patch and see if it fixes the problem.

Thanks
Vivek

Subject: virtio-fs: Use virtio_fs_vq->lock for locking vq

Instead of using fuse_pqueue->lock, use virtio_fs_vq->lock for locking. That
seems more logical as well as helps decouple fuse device and virtio_fs_vq.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/virtio_fs.c |   41 ++++++++++++++++++++---------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

Index: rhvgoyal-linux-fuse/fs/fuse/virtio_fs.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/fuse/virtio_fs.c	2019-04-16 11:34:34.805036786 -0400
+++ rhvgoyal-linux-fuse/fs/fuse/virtio_fs.c	2019-04-23 15:03:40.139017074 -0400
@@ -24,7 +24,8 @@ enum {
 
 /* Per-virtqueue state */
 struct virtio_fs_vq {
-	struct virtqueue *vq;     /* protected by fpq->lock */
+	spinlock_t lock;
+	struct virtqueue *vq;     /* protected by lock */
 	struct work_struct done_work;
 	struct list_head queued_reqs;
 	struct delayed_work dispatch_work;
@@ -180,11 +181,10 @@ static void virtio_fs_hiprio_done_work(s
 {
 	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
 						 done_work);
-	struct fuse_pqueue *fpq = &fsvq->fud->pq;
 	struct virtqueue *vq = fsvq->vq;
 
 	/* Free completed FUSE_FORGET requests */
-	spin_lock(&fpq->lock);
+	spin_lock(&fsvq->lock);
 	do {
 		unsigned len;
 		void *req;
@@ -194,7 +194,7 @@ static void virtio_fs_hiprio_done_work(s
 		while ((req = virtqueue_get_buf(vq, &len)) != NULL)
 			kfree(req);
 	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
-	spin_unlock(&fpq->lock);
+	spin_unlock(&fsvq->lock);
 }
 
 static void virtio_fs_dummy_dispatch_work(struct work_struct *work)
@@ -207,7 +207,6 @@ static void virtio_fs_hiprio_dispatch_wo
 	struct virtio_fs_forget *forget;
 	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
 						 dispatch_work.work);
-	struct fuse_pqueue *fpq = &fsvq->fud->pq;
 	struct virtqueue *vq = fsvq->vq;
 	struct scatterlist sg;
 	struct scatterlist *sgs[] = {&sg};
@@ -216,11 +215,11 @@ static void virtio_fs_hiprio_dispatch_wo
 
 	pr_debug("worker virtio_fs_hiprio_dispatch_work() called.\n");
 	while(1) {
-		spin_lock(&fpq->lock);
+		spin_lock(&fsvq->lock);
 		forget = list_first_entry_or_null(&fsvq->queued_reqs,
 					struct virtio_fs_forget, list);
 		if (!forget) {
-			spin_unlock(&fpq->lock);
+			spin_unlock(&fsvq->lock);
 			return;
 		}
 
@@ -243,12 +242,12 @@ static void virtio_fs_hiprio_dispatch_wo
 					 " err=%d. Dropping it.\n", ret);
 				kfree(forget);
 			}
-			spin_unlock(&fpq->lock);
+			spin_unlock(&fsvq->lock);
 			return;
 		}
 
 		notify = virtqueue_kick_prepare(vq);
-		spin_unlock(&fpq->lock);
+		spin_unlock(&fsvq->lock);
 
 		if (notify)
 			virtqueue_notify(vq);
@@ -335,7 +334,7 @@ static void virtio_fs_requests_done_work
 	LIST_HEAD(reqs);
 
 	/* Collect completed requests off the virtqueue */
-	spin_lock(&fpq->lock);
+	spin_lock(&fsvq->lock);
 	do {
 		unsigned len;
 
@@ -344,7 +343,7 @@ static void virtio_fs_requests_done_work
 		while ((req = virtqueue_get_buf(vq, &len)) != NULL)
 			list_move_tail(&req->list, &reqs);
 	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
-	spin_unlock(&fpq->lock);
+	spin_unlock(&fsvq->lock);
 
 	/* End requests */
 	list_for_each_entry_safe(req, next, &reqs, list) {
@@ -413,9 +412,11 @@ static int virtio_fs_setup_vqs(struct vi
 	INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].queued_reqs);
 	INIT_DELAYED_WORK(&fs->vqs[VQ_HIPRIO].dispatch_work,
 			virtio_fs_hiprio_dispatch_work);
+	spin_lock_init(&fs->vqs[VQ_HIPRIO].lock);
 
 	/* Initialize the requests virtqueues */
 	for (i = VQ_REQUEST; i < fs->nvqs; i++) {
+		spin_lock_init(&fs->vqs[i].lock);
 		INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work);
 		INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work,
 					virtio_fs_dummy_dispatch_work);
@@ -744,7 +745,6 @@ __releases(fiq->waitq.lock)
 {
 	struct fuse_forget_link *link;
 	struct virtio_fs_forget *forget;
-	struct fuse_pqueue *fpq;
 	struct scatterlist sg;
 	struct scatterlist *sgs[] = {&sg};
 	struct virtio_fs *fs;
@@ -788,8 +788,7 @@ __releases(fiq->waitq.lock)
 	/* Enqueue the request */
 	vq = fsvq->vq;
 	dev_dbg(&vq->vdev->dev, "%s\n", __func__);
-	fpq = vq_to_fpq(vq);
-	spin_lock(&fpq->lock);
+	spin_lock(&fsvq->lock);
 
 	ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC);
 	if (ret < 0) {
@@ -804,13 +803,13 @@ __releases(fiq->waitq.lock)
 				 " Dropping it.\n", ret);
 			kfree(forget);
 		}
-		spin_unlock(&fpq->lock);
+		spin_unlock(&fsvq->lock);
 		goto out;
 	}
 
 	notify = virtqueue_kick_prepare(vq);
 
-	spin_unlock(&fpq->lock);
+	spin_unlock(&fsvq->lock);
 
 	if (notify)
 		virtqueue_notify(vq);
@@ -903,7 +902,7 @@ static int virtio_fs_enqueue_req(struct
 	struct scatterlist stack_sg[ARRAY_SIZE(stack_sgs)];
 	struct scatterlist **sgs = stack_sgs;
 	struct scatterlist *sg = stack_sg;
-	struct fuse_pqueue *fpq;
+	struct virtio_fs_vq *fsvq;
 	unsigned argbuf_used = 0;
 	unsigned out_sgs = 0;
 	unsigned in_sgs = 0;
@@ -950,19 +949,19 @@ static int virtio_fs_enqueue_req(struct
 	for (i = 0; i < total_sgs; i++)
 		sgs[i] = &sg[i];
 
-	fpq = vq_to_fpq(vq);
-	spin_lock(&fpq->lock);
+	fsvq = vq_to_fsvq(vq);
+	spin_lock(&fsvq->lock);
 
 	ret = virtqueue_add_sgs(vq, sgs, out_sgs, in_sgs, req, GFP_ATOMIC);
 	if (ret < 0) {
 		/* TODO handle full virtqueue */
-		spin_unlock(&fpq->lock);
+		spin_unlock(&fsvq->lock);
 		goto out;
 	}
 
 	notify = virtqueue_kick_prepare(vq);
 
-	spin_unlock(&fpq->lock);
+	spin_unlock(&fsvq->lock);
 
 	if (notify)
 		virtqueue_notify(vq);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 0/9] virtio-fs fixes
  2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
                   ` (8 preceding siblings ...)
  2019-04-16 18:03 ` [Virtio-fs] [PATCH 9/9] fuse: fix deadlock in __fuse_file_fallocate() Liu Bo
@ 2019-04-24 18:41 ` Vivek Goyal
  2019-04-24 23:12   ` Liu Bo
  9 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-24 18:41 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

Hi Liubo,

I have made some fixes and took some of yours and pushed latest snapshot
of my internal tree here.

https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1

Patches have been rebased to 5.1-rc5 kernel. I am thinking of updating
this branch frequently with latest code.

Vivek

On Wed, Apr 17, 2019 at 02:03:13AM +0800, Liu Bo wrote:

> Hi,
> 
> This set contains a few fixes collected by running xfstests.
> 
> thanks,
> liubo
> 
> Liu Bo (6):
>   virtio-fs: fix multiple tag support
>   virtio-fs: clean up dax mapping before aborting connection
>   fuse: export fuse_drop_waiting()
>   virtio-fs: let dax style override directIO style when dax+cache=none
>   fuse: return early if punch_hole fails
>   virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate
> 
> Xiaoguang Wang (3):
>   virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info
>   fuse: do not write whole page while page straddles i_size
>   fuse: fix deadlock in __fuse_file_fallocate()
> 
>  fs/fuse/dev.c       |  3 ++-
>  fs/fuse/file.c      | 31 +++++++++++++++++++++++--------
>  fs/fuse/fuse_i.h    |  3 +++
>  fs/fuse/inode.c     |  9 +++++++++
>  fs/fuse/virtio_fs.c | 13 +++++++++++--
>  5 files changed, 48 insertions(+), 11 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 0/9] virtio-fs fixes
  2019-04-24 18:41 ` [Virtio-fs] [PATCH 0/9] virtio-fs fixes Vivek Goyal
@ 2019-04-24 23:12   ` Liu Bo
  2019-04-25 14:59     ` Vivek Goyal
  0 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-24 23:12 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

Hi Vivek,

On Wed, Apr 24, 2019 at 02:41:30PM -0400, Vivek Goyal wrote:
> Hi Liubo,
> 
> I have made some fixes and took some of yours and pushed latest snapshot
> of my internal tree here.
> 
> https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> 
> Patches have been rebased to 5.1-rc5 kernel. I am thinking of updating
> this branch frequently with latest code.

With this branch, generic/476 still got hang, and yes, it's related to
"async page fault related events" just as what you've mentioned on #irc.

I confirmed this with kvm and kvmmmu tracepoints.

The tracepoints[1] showed that
[1]: https://paste.ubuntu.com/p/N9ngrthKCf/

---
handle_ept_violation
  kvm_mmu_page_fault(error_code=182)
    tdp_page_fault
      fast_page_fault # spte not present
      try_async_pf #queue a async_pf work and return RETRY

vcpu_run
 kvm_check_async_pf_completion
   kvm_arch_async_page_ready
     tdp_page_fault(vcpu, work->gva, 0, true);
       fast_page_fault(error_code == 0);
       try_async_pf # found hpa
       __direct_map()
	  set_spte(error_code == 0) # won't set the write bit

handle_ept_violation
  kvm_mmu_page_fault(error_code=1aa)
    tdp_page_fault
      fast_page_fault # spte present but no write bit
      try_async_pf # no hpa again queue a async_pf work and return RETRY
---


The host kernel in use is 4.9 latest stable, but I've also tried the
5.0 kernel but no luck, I've spent quite a while on this and I was
wondering if this was a kvm issue?  How did you solve it?

Please suggest.

thanks,
-liubo

> 
> Vivek
> 
> On Wed, Apr 17, 2019 at 02:03:13AM +0800, Liu Bo wrote:
> 
> > Hi,
> > 
> > This set contains a few fixes collected by running xfstests.
> > 
> > thanks,
> > liubo
> > 
> > Liu Bo (6):
> >   virtio-fs: fix multiple tag support
> >   virtio-fs: clean up dax mapping before aborting connection
> >   fuse: export fuse_drop_waiting()
> >   virtio-fs: let dax style override directIO style when dax+cache=none
> >   fuse: return early if punch_hole fails
> >   virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate
> > 
> > Xiaoguang Wang (3):
> >   virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info
> >   fuse: do not write whole page while page straddles i_size
> >   fuse: fix deadlock in __fuse_file_fallocate()
> > 
> >  fs/fuse/dev.c       |  3 ++-
> >  fs/fuse/file.c      | 31 +++++++++++++++++++++++--------
> >  fs/fuse/fuse_i.h    |  3 +++
> >  fs/fuse/inode.c     |  9 +++++++++
> >  fs/fuse/virtio_fs.c | 13 +++++++++++--
> >  5 files changed, 48 insertions(+), 11 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 4/9] virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info
  2019-04-23 19:51   ` Vivek Goyal
@ 2019-04-25  0:16     ` Liu Bo
  0 siblings, 0 replies; 38+ messages in thread
From: Liu Bo @ 2019-04-25  0:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Tue, Apr 23, 2019 at 03:51:41PM -0400, Vivek Goyal wrote:
> Hi,
> 
> fuse code does not really track forget requests. If anyting has not been
> sent to fuse daemon, that is dropped during abort connection time.
> 
> So it makes sense not to wait for these forget requests even in case of
> virtio-fs. All we are doing in completion is free request and fuse
> device does not have to be around by then.
> 
> Problem seems to be dependeny on fud->fpq->lock. How about if we use
> a different lock to provide locking around sending/receiving requests
> from vq. That way we don't have to use fpq->lock in this path and
> this problem should not happen. 
> 
> Can you try following patch and see if it fixes the problem.
>

Hi Vivek,

This logic looks reasonable to me.  Tested 100 times with this one, no
panic showed up.

As the original race is not 100% producible, I'm not able to run into
the panic with reverting
"virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info"

Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>

thanks,
-liubo

> Thanks
> Vivek
> 
> Subject: virtio-fs: Use virtio_fs_vq->lock for locking vq
> 
> Instead of using fuse_pqueue->lock, use virtio_fs_vq->lock for locking. That
> seems more logical as well as helps decouple fuse device and virtio_fs_vq.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/fuse/virtio_fs.c |   41 ++++++++++++++++++++---------------------
>  1 file changed, 20 insertions(+), 21 deletions(-)
> 
> Index: rhvgoyal-linux-fuse/fs/fuse/virtio_fs.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/fuse/virtio_fs.c	2019-04-16 11:34:34.805036786 -0400
> +++ rhvgoyal-linux-fuse/fs/fuse/virtio_fs.c	2019-04-23 15:03:40.139017074 -0400
> @@ -24,7 +24,8 @@ enum {
>  
>  /* Per-virtqueue state */
>  struct virtio_fs_vq {
> -	struct virtqueue *vq;     /* protected by fpq->lock */
> +	spinlock_t lock;
> +	struct virtqueue *vq;     /* protected by lock */
>  	struct work_struct done_work;
>  	struct list_head queued_reqs;
>  	struct delayed_work dispatch_work;
> @@ -180,11 +181,10 @@ static void virtio_fs_hiprio_done_work(s
>  {
>  	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
>  						 done_work);
> -	struct fuse_pqueue *fpq = &fsvq->fud->pq;
>  	struct virtqueue *vq = fsvq->vq;
>  
>  	/* Free completed FUSE_FORGET requests */
> -	spin_lock(&fpq->lock);
> +	spin_lock(&fsvq->lock);
>  	do {
>  		unsigned len;
>  		void *req;
> @@ -194,7 +194,7 @@ static void virtio_fs_hiprio_done_work(s
>  		while ((req = virtqueue_get_buf(vq, &len)) != NULL)
>  			kfree(req);
>  	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
> -	spin_unlock(&fpq->lock);
> +	spin_unlock(&fsvq->lock);
>  }
>  
>  static void virtio_fs_dummy_dispatch_work(struct work_struct *work)
> @@ -207,7 +207,6 @@ static void virtio_fs_hiprio_dispatch_wo
>  	struct virtio_fs_forget *forget;
>  	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
>  						 dispatch_work.work);
> -	struct fuse_pqueue *fpq = &fsvq->fud->pq;
>  	struct virtqueue *vq = fsvq->vq;
>  	struct scatterlist sg;
>  	struct scatterlist *sgs[] = {&sg};
> @@ -216,11 +215,11 @@ static void virtio_fs_hiprio_dispatch_wo
>  
>  	pr_debug("worker virtio_fs_hiprio_dispatch_work() called.\n");
>  	while(1) {
> -		spin_lock(&fpq->lock);
> +		spin_lock(&fsvq->lock);
>  		forget = list_first_entry_or_null(&fsvq->queued_reqs,
>  					struct virtio_fs_forget, list);
>  		if (!forget) {
> -			spin_unlock(&fpq->lock);
> +			spin_unlock(&fsvq->lock);
>  			return;
>  		}
>  
> @@ -243,12 +242,12 @@ static void virtio_fs_hiprio_dispatch_wo
>  					 " err=%d. Dropping it.\n", ret);
>  				kfree(forget);
>  			}
> -			spin_unlock(&fpq->lock);
> +			spin_unlock(&fsvq->lock);
>  			return;
>  		}
>  
>  		notify = virtqueue_kick_prepare(vq);
> -		spin_unlock(&fpq->lock);
> +		spin_unlock(&fsvq->lock);
>  
>  		if (notify)
>  			virtqueue_notify(vq);
> @@ -335,7 +334,7 @@ static void virtio_fs_requests_done_work
>  	LIST_HEAD(reqs);
>  
>  	/* Collect completed requests off the virtqueue */
> -	spin_lock(&fpq->lock);
> +	spin_lock(&fsvq->lock);
>  	do {
>  		unsigned len;
>  
> @@ -344,7 +343,7 @@ static void virtio_fs_requests_done_work
>  		while ((req = virtqueue_get_buf(vq, &len)) != NULL)
>  			list_move_tail(&req->list, &reqs);
>  	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
> -	spin_unlock(&fpq->lock);
> +	spin_unlock(&fsvq->lock);
>  
>  	/* End requests */
>  	list_for_each_entry_safe(req, next, &reqs, list) {
> @@ -413,9 +412,11 @@ static int virtio_fs_setup_vqs(struct vi
>  	INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].queued_reqs);
>  	INIT_DELAYED_WORK(&fs->vqs[VQ_HIPRIO].dispatch_work,
>  			virtio_fs_hiprio_dispatch_work);
> +	spin_lock_init(&fs->vqs[VQ_HIPRIO].lock);
>  
>  	/* Initialize the requests virtqueues */
>  	for (i = VQ_REQUEST; i < fs->nvqs; i++) {
> +		spin_lock_init(&fs->vqs[i].lock);
>  		INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work);
>  		INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work,
>  					virtio_fs_dummy_dispatch_work);
> @@ -744,7 +745,6 @@ __releases(fiq->waitq.lock)
>  {
>  	struct fuse_forget_link *link;
>  	struct virtio_fs_forget *forget;
> -	struct fuse_pqueue *fpq;
>  	struct scatterlist sg;
>  	struct scatterlist *sgs[] = {&sg};
>  	struct virtio_fs *fs;
> @@ -788,8 +788,7 @@ __releases(fiq->waitq.lock)
>  	/* Enqueue the request */
>  	vq = fsvq->vq;
>  	dev_dbg(&vq->vdev->dev, "%s\n", __func__);
> -	fpq = vq_to_fpq(vq);
> -	spin_lock(&fpq->lock);
> +	spin_lock(&fsvq->lock);
>  
>  	ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC);
>  	if (ret < 0) {
> @@ -804,13 +803,13 @@ __releases(fiq->waitq.lock)
>  				 " Dropping it.\n", ret);
>  			kfree(forget);
>  		}
> -		spin_unlock(&fpq->lock);
> +		spin_unlock(&fsvq->lock);
>  		goto out;
>  	}
>  
>  	notify = virtqueue_kick_prepare(vq);
>  
> -	spin_unlock(&fpq->lock);
> +	spin_unlock(&fsvq->lock);
>  
>  	if (notify)
>  		virtqueue_notify(vq);
> @@ -903,7 +902,7 @@ static int virtio_fs_enqueue_req(struct
>  	struct scatterlist stack_sg[ARRAY_SIZE(stack_sgs)];
>  	struct scatterlist **sgs = stack_sgs;
>  	struct scatterlist *sg = stack_sg;
> -	struct fuse_pqueue *fpq;
> +	struct virtio_fs_vq *fsvq;
>  	unsigned argbuf_used = 0;
>  	unsigned out_sgs = 0;
>  	unsigned in_sgs = 0;
> @@ -950,19 +949,19 @@ static int virtio_fs_enqueue_req(struct
>  	for (i = 0; i < total_sgs; i++)
>  		sgs[i] = &sg[i];
>  
> -	fpq = vq_to_fpq(vq);
> -	spin_lock(&fpq->lock);
> +	fsvq = vq_to_fsvq(vq);
> +	spin_lock(&fsvq->lock);
>  
>  	ret = virtqueue_add_sgs(vq, sgs, out_sgs, in_sgs, req, GFP_ATOMIC);
>  	if (ret < 0) {
>  		/* TODO handle full virtqueue */
> -		spin_unlock(&fpq->lock);
> +		spin_unlock(&fsvq->lock);
>  		goto out;
>  	}
>  
>  	notify = virtqueue_kick_prepare(vq);
>  
> -	spin_unlock(&fpq->lock);
> +	spin_unlock(&fsvq->lock);
>  
>  	if (notify)
>  		virtqueue_notify(vq);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 0/9] virtio-fs fixes
  2019-04-24 23:12   ` Liu Bo
@ 2019-04-25 14:59     ` Vivek Goyal
  2019-04-25 18:10       ` Liu Bo
  0 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-25 14:59 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Wed, Apr 24, 2019 at 04:12:59PM -0700, Liu Bo wrote:
> Hi Vivek,
> 
> On Wed, Apr 24, 2019 at 02:41:30PM -0400, Vivek Goyal wrote:
> > Hi Liubo,
> > 
> > I have made some fixes and took some of yours and pushed latest snapshot
> > of my internal tree here.
> > 
> > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > 
> > Patches have been rebased to 5.1-rc5 kernel. I am thinking of updating
> > this branch frequently with latest code.
> 
> With this branch, generic/476 still got hang, and yes, it's related to
> "async page fault related events" just as what you've mentioned on #irc.
> 
> I confirmed this with kvm and kvmmmu tracepoints.
> 
> The tracepoints[1] showed that
> [1]: https://paste.ubuntu.com/p/N9ngrthKCf/
> 
> ---
> handle_ept_violation
>   kvm_mmu_page_fault(error_code=182)
>     tdp_page_fault
>       fast_page_fault # spte not present
>       try_async_pf #queue a async_pf work and return RETRY
> 
> vcpu_run
>  kvm_check_async_pf_completion
>    kvm_arch_async_page_ready
>      tdp_page_fault(vcpu, work->gva, 0, true);
>        fast_page_fault(error_code == 0);
>        try_async_pf # found hpa
>        __direct_map()
> 	  set_spte(error_code == 0) # won't set the write bit
> 
> handle_ept_violation
>   kvm_mmu_page_fault(error_code=1aa)
>     tdp_page_fault
>       fast_page_fault # spte present but no write bit
>       try_async_pf # no hpa again queue a async_pf work and return RETRY

So why there is no "hpa"?

I was running a different test. I mmaped a file in guest, then truncated
file to size 0 on host and then guest tried to read/write the mmaped
region.

This will trigger async page fault on host. But given file size is zero,
that page fault will not succeed.

Current async pf logic has no notion of failure. It assumes it will always
succeed. It does not even check the return code of
get_user_pages_remote(), which can return error.

So there are few things to be done.

- Modify async pf logic so that it can it capture and report error.
- If guest user space mmaped() file in question, then send SIGBUS to
  process.
- If guest kernel is trying to access memory which async pf can't
  resolve, then create an escape path and return error to user
  space. (something like memcpy_mcsafe() I think).

I was playing with this and made some progress. But that work is not
complete. I thought of dealing with this problem later. If you are
curious, I have pushed my unfinished code here.

Kernel:
https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-async-pf

Qemu:
https://github.com/rhvgoyal/qemu/commits/virtio-fs-async-pf

Thanks
Vivek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 0/9] virtio-fs fixes
  2019-04-25 14:59     ` Vivek Goyal
@ 2019-04-25 18:10       ` Liu Bo
  2019-04-27  0:58         ` Liu Bo
  0 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-25 18:10 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Thu, Apr 25, 2019 at 10:59:50AM -0400, Vivek Goyal wrote:
> On Wed, Apr 24, 2019 at 04:12:59PM -0700, Liu Bo wrote:
> > Hi Vivek,
> > 
> > On Wed, Apr 24, 2019 at 02:41:30PM -0400, Vivek Goyal wrote:
> > > Hi Liubo,
> > > 
> > > I have made some fixes and took some of yours and pushed latest snapshot
> > > of my internal tree here.
> > > 
> > > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > > 
> > > Patches have been rebased to 5.1-rc5 kernel. I am thinking of updating
> > > this branch frequently with latest code.
> > 
> > With this branch, generic/476 still got hang, and yes, it's related to
> > "async page fault related events" just as what you've mentioned on #irc.
> > 
> > I confirmed this with kvm and kvmmmu tracepoints.
> > 
> > The tracepoints[1] showed that
> > [1]: https://paste.ubuntu.com/p/N9ngrthKCf/
> > 
> > ---
> > handle_ept_violation
> >   kvm_mmu_page_fault(error_code=182)
> >     tdp_page_fault
> >       fast_page_fault # spte not present
> >       try_async_pf #queue a async_pf work and return RETRY
> > 
> > vcpu_run
> >  kvm_check_async_pf_completion
> >    kvm_arch_async_page_ready
> >      tdp_page_fault(vcpu, work->gva, 0, true);
> >        fast_page_fault(error_code == 0);
> >        try_async_pf # found hpa
> >        __direct_map()
> > 	  set_spte(error_code == 0) # won't set the write bit
> > 
> > handle_ept_violation
> >   kvm_mmu_page_fault(error_code=1aa)
> >     tdp_page_fault
> >       fast_page_fault # spte present but no write bit
> >       try_async_pf # no hpa again queue a async_pf work and return RETRY
> 
> So why there is no "hpa"?
>

TBH, I have no idea, __gfn_to_pfn_memslot() did returned a pfn
successfully after async pf, but during its following EPT_VIOLATION,
__gfn_to_pfn_memslot() returned KVM_PFN_ERR_FAULT and indicated
callers to do another async pf, and over and over again.

> I was running a different test. I mmaped a file in guest, then truncated
> file to size 0 on host and then guest tried to read/write the mmaped
> region.
> 
> This will trigger async page fault on host. But given file size is zero,
> that page fault will not succeed.
>

I see, I checked the file I used on host, I could use FIEMAP to read
all its extent and it wasn't truncated.

> Current async pf logic has no notion of failure. It assumes it will always
> succeed. It does not even check the return code of
> get_user_pages_remote(), which can return error.
> 
> So there are few things to be done.
> 
> - Modify async pf logic so that it can it capture and report error.
> - If guest user space mmaped() file in question, then send SIGBUS to
>   process.
> - If guest kernel is trying to access memory which async pf can't
>   resolve, then create an escape path and return error to user
>   space. (something like memcpy_mcsafe() I think).
>

I need to think more about this, in my case, guest is just doing a
plain write(2) or writev(2), it shouldn't get into hang like that in
any case.

Thanks for sharing the code, will take a look.

thanks,
-liubo
> I was playing with this and made some progress. But that work is not
> complete. I thought of dealing with this problem later. If you are
> curious, I have pushed my unfinished code here.
> 
> Kernel:
> https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-async-pf
> 
> Qemu:
> https://github.com/rhvgoyal/qemu/commits/virtio-fs-async-pf
> 
> Thanks
> Vivek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none
  2019-04-17 19:35       ` Liu Bo
@ 2019-04-25 18:35         ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2019-04-25 18:35 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs, Miklos Szeredi

On Wed, Apr 17, 2019 at 12:35:54PM -0700, Liu Bo wrote:
> On Wed, Apr 17, 2019 at 10:25:53AM +0200, Miklos Szeredi wrote:
> > On Tue, Apr 16, 2019 at 9:38 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Wed, Apr 17, 2019 at 02:03:19AM +0800, Liu Bo wrote:
> > > > In case of dax+cache=none, mmap uses dax style prior to directIO style,
> > > > while read/write don't, but it seems that there is no reason not to do so.
> > > >
> > > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> > >
> > > This is interesting. I was thinking about it today itself. I noticed
> > > that ext4 and xfs also check for DAX inode first and use dax path
> > > if dax is enabled.
> > >
> > > cache=never sets FOPEN_DIRECT_IO (even if application never asked for
> > > direct IO). If dax is enabled, for data its equivalent to doing direct
> > > IO. And for mmap() we are already checking for DAX first. So it makes
> > > sense to do same thing for read/write path as well.
> > >
> > > CCing Miklos as well. He might have some thougts on this. I am curios
> > > that initially whey did he make this change only for mmap() and not
> > > for read/write paths.
> > 
> > AFAIR the main reason was that we had performance issues with size
> > extending writes with dax.
> > 
> > There is also the question of mtime updates and atime updates, which
> > are handled properly with FOPEN_DIRECT_IO, but not with DAX I/O.
> 
> Looks like fuse_iget() has set inode with NOATIME and NOCMTIME under
> the case "!fc->writeback_cache", which is true in dax mode, do we
> still case about time update?
> 
> AFAICS, these metadata are maintained by host side.

I guess what Miklos is alluding to is that with cache=none,
FOPEN_DIRECT_IO will always be set. That means file data will not be
cached in guest and all read/writes and file changes will go through
fuse daemon and that will result in atime/ctime/mtime update on file
on host (because of fuse daemon operations).

But with DAX path, it translates to mmap() IO on host and atime/mtime/ctime
update semantics are every different there as opposed to read/write.

My take on this is that we have two modes here. "cache=none" and
"cache=none+dax". Those who are concerned with strict atime/ctime/mtime
update behavior, should not mount with dax enabled.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 0/9] virtio-fs fixes
  2019-04-25 18:10       ` Liu Bo
@ 2019-04-27  0:58         ` Liu Bo
  2019-04-29 13:18           ` Vivek Goyal
  0 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-04-27  0:58 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Thu, Apr 25, 2019 at 11:10:08AM -0700, Liu Bo wrote:
> On Thu, Apr 25, 2019 at 10:59:50AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 24, 2019 at 04:12:59PM -0700, Liu Bo wrote:
> > > Hi Vivek,
> > > 
> > > On Wed, Apr 24, 2019 at 02:41:30PM -0400, Vivek Goyal wrote:
> > > > Hi Liubo,
> > > > 
> > > > I have made some fixes and took some of yours and pushed latest snapshot
> > > > of my internal tree here.
> > > > 
> > > > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > > > 
> > > > Patches have been rebased to 5.1-rc5 kernel. I am thinking of updating
> > > > this branch frequently with latest code.
> > > 
> > > With this branch, generic/476 still got hang, and yes, it's related to
> > > "async page fault related events" just as what you've mentioned on #irc.
> > > 
> > > I confirmed this with kvm and kvmmmu tracepoints.
> > > 
> > > The tracepoints[1] showed that
> > > [1]: https://paste.ubuntu.com/p/N9ngrthKCf/
> > > 
> > > ---
> > > handle_ept_violation
> > >   kvm_mmu_page_fault(error_code=182)
> > >     tdp_page_fault
> > >       fast_page_fault # spte not present
> > >       try_async_pf #queue a async_pf work and return RETRY
> > > 
> > > vcpu_run
> > >  kvm_check_async_pf_completion
> > >    kvm_arch_async_page_ready
> > >      tdp_page_fault(vcpu, work->gva, 0, true);
> > >        fast_page_fault(error_code == 0);
> > >        try_async_pf # found hpa
> > >        __direct_map()
> > > 	  set_spte(error_code == 0) # won't set the write bit
> > > 
> > > handle_ept_violation
> > >   kvm_mmu_page_fault(error_code=1aa)
> > >     tdp_page_fault
> > >       fast_page_fault # spte present but no write bit
> > >       try_async_pf # no hpa again queue a async_pf work and return RETRY
> > 
> > So why there is no "hpa"?
> >
> 
> TBH, I have no idea, __gfn_to_pfn_memslot() did returned a pfn
> successfully after async pf, but during its following EPT_VIOLATION,
> __gfn_to_pfn_memslot() returned KVM_PFN_ERR_FAULT and indicated
> callers to do another async pf, and over and over again.
>

So I think I've figured out it, here is the summary,

virtiofs's dax write implementation sends a fallocate request to extend inode
size and allocate space on the underlying fs so that the underlying mmap can
fault in pages on demands.

There're two problems here,

1) virtiofs write(2) only checks if the write range is within inode size,
   however, this doesn't work all the time because besides write(2) and
   fallocate(2), inode size can also be extended by truncate(2) which doesn't
   allocate space on the underlying fs, so when guest VM writes to this address,
   it then causes a EPT_VIOLATION which will help fault-in the necessary page
   from the underlying %vma, and if it's a write fault, page_mkwrite() will be
   called, if the required space is not yet allocated, page_mkwrite() then tries
   to allocate the space, which may fail with ENOSPC if the underlying fs has
   already been full,

2) async pf doesn't check whether gup is successful.

-------
the call stack analysis:

(vcpu thread)
handle_ept_violation
|->kvm_mmu_page_fault(error_code=182)
   |->tdp_page_fault
      |->try_async_pf  # the 1st page fault (write fault)
         |->__gfn_to_pfn_memslot
           |->get_user_page
              ->faultin_page
                ->handle_mm_fault
                  ->ext4_filemap_fault
                    ->ext4_page_mkwrite # return -ENOSPC, the write fault fails
        
	     |->kvm_arch_setup_async_pf # schedule a async_pf work on kworker
		    ->INIT_WORK(&work->work, async_pf_execute);

(kworker thread)
async_pf_execute
  ->get_user_pages_remote # the 2nd page fault (write fault)


(vcpu thread)
vcpu_run
|->kvm_check_async_pf_completion
   |->kvm_arch_async_page_ready
      |->tdp_page_fault
	     |->try_async_pf # the 3rd page fault (read fault, successful)
	     |->__direct_map
	        |->set_spte # install spte into ept
-------

So Vivek is right about async_pf not checking gup's failure, but in the meantime
we need to deal with 'truncate'.

thanks,
-liubo

> > I was running a different test. I mmaped a file in guest, then truncated
> > file to size 0 on host and then guest tried to read/write the mmaped
> > region.
> > 
> > This will trigger async page fault on host. But given file size is zero,
> > that page fault will not succeed.
> >
> 
> I see, I checked the file I used on host, I could use FIEMAP to read
> all its extent and it wasn't truncated.
> 
> > Current async pf logic has no notion of failure. It assumes it will always
> > succeed. It does not even check the return code of
> > get_user_pages_remote(), which can return error.
> > 
> > So there are few things to be done.
> > 
> > - Modify async pf logic so that it can it capture and report error.
> > - If guest user space mmaped() file in question, then send SIGBUS to
> >   process.
> > - If guest kernel is trying to access memory which async pf can't
> >   resolve, then create an escape path and return error to user
> >   space. (something like memcpy_mcsafe() I think).
> >
> 
> I need to think more about this, in my case, guest is just doing a
> plain write(2) or writev(2), it shouldn't get into hang like that in
> any case.
> 
> Thanks for sharing the code, will take a look.
> 
> thanks,
> -liubo
> > I was playing with this and made some progress. But that work is not
> > complete. I thought of dealing with this problem later. If you are
> > curious, I have pushed my unfinished code here.
> > 
> > Kernel:
> > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-async-pf
> > 
> > Qemu:
> > https://github.com/rhvgoyal/qemu/commits/virtio-fs-async-pf
> > 
> > Thanks
> > Vivek
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 0/9] virtio-fs fixes
  2019-04-27  0:58         ` Liu Bo
@ 2019-04-29 13:18           ` Vivek Goyal
  2019-04-30  1:38             ` Liu Bo
  0 siblings, 1 reply; 38+ messages in thread
From: Vivek Goyal @ 2019-04-29 13:18 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Fri, Apr 26, 2019 at 05:58:39PM -0700, Liu Bo wrote:
> On Thu, Apr 25, 2019 at 11:10:08AM -0700, Liu Bo wrote:
> > On Thu, Apr 25, 2019 at 10:59:50AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 24, 2019 at 04:12:59PM -0700, Liu Bo wrote:
> > > > Hi Vivek,
> > > > 
> > > > On Wed, Apr 24, 2019 at 02:41:30PM -0400, Vivek Goyal wrote:
> > > > > Hi Liubo,
> > > > > 
> > > > > I have made some fixes and took some of yours and pushed latest snapshot
> > > > > of my internal tree here.
> > > > > 
> > > > > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > > > > 
> > > > > Patches have been rebased to 5.1-rc5 kernel. I am thinking of updating
> > > > > this branch frequently with latest code.
> > > > 
> > > > With this branch, generic/476 still got hang, and yes, it's related to
> > > > "async page fault related events" just as what you've mentioned on #irc.
> > > > 
> > > > I confirmed this with kvm and kvmmmu tracepoints.
> > > > 
> > > > The tracepoints[1] showed that
> > > > [1]: https://paste.ubuntu.com/p/N9ngrthKCf/
> > > > 
> > > > ---
> > > > handle_ept_violation
> > > >   kvm_mmu_page_fault(error_code=182)
> > > >     tdp_page_fault
> > > >       fast_page_fault # spte not present
> > > >       try_async_pf #queue a async_pf work and return RETRY
> > > > 
> > > > vcpu_run
> > > >  kvm_check_async_pf_completion
> > > >    kvm_arch_async_page_ready
> > > >      tdp_page_fault(vcpu, work->gva, 0, true);
> > > >        fast_page_fault(error_code == 0);
> > > >        try_async_pf # found hpa
> > > >        __direct_map()
> > > > 	  set_spte(error_code == 0) # won't set the write bit
> > > > 
> > > > handle_ept_violation
> > > >   kvm_mmu_page_fault(error_code=1aa)
> > > >     tdp_page_fault
> > > >       fast_page_fault # spte present but no write bit
> > > >       try_async_pf # no hpa again queue a async_pf work and return RETRY
> > > 
> > > So why there is no "hpa"?
> > >
> > 
> > TBH, I have no idea, __gfn_to_pfn_memslot() did returned a pfn
> > successfully after async pf, but during its following EPT_VIOLATION,
> > __gfn_to_pfn_memslot() returned KVM_PFN_ERR_FAULT and indicated
> > callers to do another async pf, and over and over again.
> >
> 
> So I think I've figured out it, here is the summary,
> 
> virtiofs's dax write implementation sends a fallocate request to extend inode
> size and allocate space on the underlying fs so that the underlying mmap can
> fault in pages on demands.
> 
> There're two problems here,

> 
> 1) virtiofs write(2) only checks if the write range is within inode size,
>    however, this doesn't work all the time because besides write(2) and
>    fallocate(2), inode size can also be extended by truncate(2) which doesn't
>    allocate space on the underlying fs, so when guest VM writes to this address,
>    it then causes a EPT_VIOLATION which will help fault-in the necessary page
>    from the underlying %vma, and if it's a write fault, page_mkwrite() will be
>    called, if the required space is not yet allocated, page_mkwrite() then tries
>    to allocate the space, which may fail with ENOSPC if the underlying fs has
>    already been full,
> 
> 2) async pf doesn't check whether gup is successful.

Ok. So filesystem on host is full but truncate still succeeds (as it did
not reuiqre fs block allocations). But later when a write from guest
process happens, it results in async pf on host and that fails because
fs block can't be allocated.

But this still sounds like an issue with async pf where an error needs
to be captured and somehow communicated back to guest OS. In this
case -ENOSPC.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 0/9] virtio-fs fixes
  2019-04-29 13:18           ` Vivek Goyal
@ 2019-04-30  1:38             ` Liu Bo
  0 siblings, 0 replies; 38+ messages in thread
From: Liu Bo @ 2019-04-30  1:38 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Mon, Apr 29, 2019 at 09:18:22AM -0400, Vivek Goyal wrote:
> On Fri, Apr 26, 2019 at 05:58:39PM -0700, Liu Bo wrote:
> > On Thu, Apr 25, 2019 at 11:10:08AM -0700, Liu Bo wrote:
> > > On Thu, Apr 25, 2019 at 10:59:50AM -0400, Vivek Goyal wrote:
> > > > On Wed, Apr 24, 2019 at 04:12:59PM -0700, Liu Bo wrote:
> > > > > Hi Vivek,
> > > > > 
> > > > > On Wed, Apr 24, 2019 at 02:41:30PM -0400, Vivek Goyal wrote:
> > > > > > Hi Liubo,
> > > > > > 
> > > > > > I have made some fixes and took some of yours and pushed latest snapshot
> > > > > > of my internal tree here.
> > > > > > 
> > > > > > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > > > > > 
> > > > > > Patches have been rebased to 5.1-rc5 kernel. I am thinking of updating
> > > > > > this branch frequently with latest code.
> > > > > 
> > > > > With this branch, generic/476 still got hang, and yes, it's related to
> > > > > "async page fault related events" just as what you've mentioned on #irc.
> > > > > 
> > > > > I confirmed this with kvm and kvmmmu tracepoints.
> > > > > 
> > > > > The tracepoints[1] showed that
> > > > > [1]: https://paste.ubuntu.com/p/N9ngrthKCf/
> > > > > 
> > > > > ---
> > > > > handle_ept_violation
> > > > >   kvm_mmu_page_fault(error_code=182)
> > > > >     tdp_page_fault
> > > > >       fast_page_fault # spte not present
> > > > >       try_async_pf #queue a async_pf work and return RETRY
> > > > > 
> > > > > vcpu_run
> > > > >  kvm_check_async_pf_completion
> > > > >    kvm_arch_async_page_ready
> > > > >      tdp_page_fault(vcpu, work->gva, 0, true);
> > > > >        fast_page_fault(error_code == 0);
> > > > >        try_async_pf # found hpa
> > > > >        __direct_map()
> > > > > 	  set_spte(error_code == 0) # won't set the write bit
> > > > > 
> > > > > handle_ept_violation
> > > > >   kvm_mmu_page_fault(error_code=1aa)
> > > > >     tdp_page_fault
> > > > >       fast_page_fault # spte present but no write bit
> > > > >       try_async_pf # no hpa again queue a async_pf work and return RETRY
> > > > 
> > > > So why there is no "hpa"?
> > > >
> > > 
> > > TBH, I have no idea, __gfn_to_pfn_memslot() did returned a pfn
> > > successfully after async pf, but during its following EPT_VIOLATION,
> > > __gfn_to_pfn_memslot() returned KVM_PFN_ERR_FAULT and indicated
> > > callers to do another async pf, and over and over again.
> > >
> > 
> > So I think I've figured out it, here is the summary,
> > 
> > virtiofs's dax write implementation sends a fallocate request to extend inode
> > size and allocate space on the underlying fs so that the underlying mmap can
> > fault in pages on demands.
> > 
> > There're two problems here,
> 
> > 
> > 1) virtiofs write(2) only checks if the write range is within inode size,
> >    however, this doesn't work all the time because besides write(2) and
> >    fallocate(2), inode size can also be extended by truncate(2) which doesn't
> >    allocate space on the underlying fs, so when guest VM writes to this address,
> >    it then causes a EPT_VIOLATION which will help fault-in the necessary page
> >    from the underlying %vma, and if it's a write fault, page_mkwrite() will be
> >    called, if the required space is not yet allocated, page_mkwrite() then tries
> >    to allocate the space, which may fail with ENOSPC if the underlying fs has
> >    already been full,
> > 
> > 2) async pf doesn't check whether gup is successful.
> 
> Ok. So filesystem on host is full but truncate still succeeds (as it did
> not reuiqre fs block allocations). But later when a write from guest
> process happens, it results in async pf on host and that fails because
> fs block can't be allocated.
> 
> But this still sounds like an issue with async pf where an error needs
> to be captured and somehow communicated back to guest OS. In this
> case -ENOSPC.

I have a question about how guest responds to this kind of error, so guest vm is
doing dax_copy_from_iter (in case of write), and eventually it's memory copying
a iovector, right?

I'm not sure how guest can exit gracefully from there?  Can copy_in() return a
EFAULT somehow?

My workaround is to ensure there is enough fs space allocated to dax mapping
range when doing SETUPMAPPING, in other words, we can do a plain fallocate upon
the range before sending messages to the vhost-user backend.

thanks,
-liubo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 9/9] fuse: fix deadlock in __fuse_file_fallocate()
  2019-04-16 18:07   ` Vivek Goyal
@ 2019-05-02 22:10     ` Liu Bo
  2019-05-03 15:22       ` Vivek Goyal
  0 siblings, 1 reply; 38+ messages in thread
From: Liu Bo @ 2019-05-02 22:10 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Tue, Apr 16, 2019 at 02:07:56PM -0400, Vivek Goyal wrote:
> On Wed, Apr 17, 2019 at 02:03:22AM +0800, Liu Bo wrote:
> > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > 
> > This bug is obvious, fix it.
> > 
> > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> 
> Thanks. This one I ran into late last week while testing
> fallocate(PUNCH_HOLE) and fixed it already in my internal branch.

Hi Vivek,

FYI, the bug still exists in
https://github.com/rhvgoyal/linux/tree/virtio-fs-4.19.28

thanks,
-liubo

> 
> Vivek
> > ---
> >  fs/fuse/file.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 6ab23d7..0236783 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -3623,7 +3623,7 @@ static long __fuse_file_fallocate(struct file *file, int mode,
> >  	if (mode & FALLOC_FL_PUNCH_HOLE) {
> >  		down_write(&fi->i_mmap_sem);
> >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > -		down_write(&fi->i_mmap_sem);
> > +		up_write(&fi->i_mmap_sem);
> >  	}
> >  	fuse_invalidate_attr(inode);
> >  
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Virtio-fs] [PATCH 9/9] fuse: fix deadlock in __fuse_file_fallocate()
  2019-05-02 22:10     ` Liu Bo
@ 2019-05-03 15:22       ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2019-05-03 15:22 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

Hi Liu Bo,

Thanks. I fixed it now. Check it out. This change was part of my
5.1-rc rebase (and not 4.19 rebase).

Thanks
Vivek

On Thu, May 02, 2019 at 03:10:15PM -0700, Liu Bo wrote:
> On Tue, Apr 16, 2019 at 02:07:56PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 17, 2019 at 02:03:22AM +0800, Liu Bo wrote:
> > > From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > > 
> > > This bug is obvious, fix it.
> > > 
> > > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> > > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > 
> > Thanks. This one I ran into late last week while testing
> > fallocate(PUNCH_HOLE) and fixed it already in my internal branch.
> 
> Hi Vivek,
> 
> FYI, the bug still exists in
> https://github.com/rhvgoyal/linux/tree/virtio-fs-4.19.28
> 
> thanks,
> -liubo
> 
> > 
> > Vivek
> > > ---
> > >  fs/fuse/file.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index 6ab23d7..0236783 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -3623,7 +3623,7 @@ static long __fuse_file_fallocate(struct file *file, int mode,
> > >  	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > >  		down_write(&fi->i_mmap_sem);
> > >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > > -		down_write(&fi->i_mmap_sem);
> > > +		up_write(&fi->i_mmap_sem);
> > >  	}
> > >  	fuse_invalidate_attr(inode);
> > >  
> > > -- 
> > > 1.8.3.1
> > > 
> > > _______________________________________________
> > > Virtio-fs mailing list
> > > Virtio-fs@redhat.com
> > > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2019-05-03 15:22 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-16 18:03 [Virtio-fs] [PATCH 0/9] virtio-fs fixes Liu Bo
2019-04-16 18:03 ` [Virtio-fs] [PATCH 1/9] virtio-fs: fix multiple tag support Liu Bo
2019-04-16 18:09   ` Vivek Goyal
2019-04-16 18:03 ` [Virtio-fs] [PATCH 2/9] virtio-fs: clean up dax mapping before aborting connection Liu Bo
2019-04-16 18:18   ` Vivek Goyal
2019-04-16 18:40     ` Liu Bo
2019-04-16 18:03 ` [Virtio-fs] [PATCH 3/9] fuse: export fuse_drop_waiting() Liu Bo
2019-04-16 18:28   ` Vivek Goyal
2019-04-16 18:43     ` Liu Bo
2019-04-16 18:03 ` [Virtio-fs] [PATCH 4/9] virtio-fs: fix use-after-free against virtio_fs_vq's fuse_dev info Liu Bo
2019-04-23 19:51   ` Vivek Goyal
2019-04-25  0:16     ` Liu Bo
2019-04-16 18:03 ` [Virtio-fs] [PATCH 5/9] fuse: do not write whole page while page straddles i_size Liu Bo
2019-04-16 20:16   ` Vivek Goyal
2019-04-17  0:12     ` Liu Bo
2019-04-16 18:03 ` [Virtio-fs] [PATCH 6/9] virtio-fs: let dax style override directIO style when dax+cache=none Liu Bo
2019-04-16 19:38   ` Vivek Goyal
2019-04-17  8:25     ` Miklos Szeredi
2019-04-17 19:35       ` Liu Bo
2019-04-25 18:35         ` Vivek Goyal
2019-04-17 20:56       ` Vivek Goyal
2019-04-22 18:55         ` Liu Bo
2019-04-22 18:14       ` Vivek Goyal
2019-04-16 18:03 ` [Virtio-fs] [PATCH 7/9] fuse: return early if punch_hole fails Liu Bo
2019-04-16 19:51   ` Vivek Goyal
2019-04-16 18:03 ` [Virtio-fs] [PATCH 8/9] virtio-fs: honor RLIMIT_FSIZE in fuse_file_fallocate Liu Bo
2019-04-16 19:57   ` Vivek Goyal
2019-04-16 18:03 ` [Virtio-fs] [PATCH 9/9] fuse: fix deadlock in __fuse_file_fallocate() Liu Bo
2019-04-16 18:07   ` Vivek Goyal
2019-05-02 22:10     ` Liu Bo
2019-05-03 15:22       ` Vivek Goyal
2019-04-24 18:41 ` [Virtio-fs] [PATCH 0/9] virtio-fs fixes Vivek Goyal
2019-04-24 23:12   ` Liu Bo
2019-04-25 14:59     ` Vivek Goyal
2019-04-25 18:10       ` Liu Bo
2019-04-27  0:58         ` Liu Bo
2019-04-29 13:18           ` Vivek Goyal
2019-04-30  1:38             ` Liu Bo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.