linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines
@ 2019-05-15 19:26 Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 01/30] fuse: delete dentry if timeout is zero Vivek Goyal
                   ` (29 more replies)
  0 siblings, 30 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

Hi,

Here are the RFC patches for V2 of virtio-fs. These patches apply on top
of 5.1 kernel. These patches are also available here.

https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
  
Patches for V1 were posted here.
  
https://lwn.net/ml/linux-fsdevel/20181210171318.16998-1-vgoyal@redhat.com/

This is still work in progress. As of now one can passthrough a host
directory in to guest and it works reasonably well. pjdfstests test
suite passes and blogbench runs. But this dirctory can't be shared
between guests and host can't modify files in directory yet.  That's
still TBD.
  
Posting another version to gather feedback and comments on progress so far.
  
More information about the project can be found here.
  
https://virtio-fs.gitlab.io/

Changes from V1
===============
- Various bug fixes
- virtio-fs dax huge page size working, leading to improved performance.
- Fixed kernel automated tests warnings.
- Better handling of shared cache region reporting by virtio device.

Description from V1 posting
---------------------------
Problem Description
===================
We want to be able to take a directory tree on the host and share it with
guest[s]. Our goal is to be able to do it in a fast, consistent and secure
manner. Our primary use case is kata containers, but it should be usable in
other scenarios as well.

Containers may rely on local file system semantics for shared volumes,
read-write mounts that multiple containers access simultaneously.  File
system changes must be visible to other containers with the same consistency
expected of a local file system, including mmap MAP_SHARED.

Existing Solutions
==================
We looked at existing solutions and virtio-9p already provides basic shared
file system functionality although does not offer local file system semantics,
causing some workloads and test suites to fail. In addition, virtio-9p
performance has been an issue for Kata Containers and we believe this cannot
be alleviated without major changes that do not fit into the 9P protocol.

Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.

- Use fuse protocol (instead of 9p) for communication between guest
  and host. Guest kernel will be fuse client and a fuse server will
  run on host to serve the requests. Benchmark results are encouraging and
  show this approach performs well (2x to 8x improvement depending on test
  being run).

- For data access inside guest, mmap portion of file in QEMU address
  space and guest accesses this memory using dax. That way guest page
  cache is bypassed and there is only one copy of data (on host). This
  will also enable mmap(MAP_SHARED) between guests.

- For metadata coherency, there is a shared memory region which contains
  version number associated with metadata and any guest changing metadata
  updates version number and other guests refresh metadata on next
  access. This is yet to be implemented.

How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).

DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.

By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).

These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.

HOWTO
======
We have put instructions on how to use it here.

https://virtio-fs.gitlab.io/

Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems. The “shared” option
only has an effect on coherence between virtio-fs filesystem instances
running inside different guests.

- cache=none
  metadata, data and pathname lookup are not cached in guest. They are always
  fetched from host and any changes are immediately pushed to host.

- cache=always
  metadata, data and pathname lookup are cached in guest and never expire.

- cache=auto
  metadata and pathname lookup cache expires after a configured amount of time
  (default is 1 second). Data is cached while the file is open (close to open
  consistency).

- writeback/no_writeback
  These options control the writeback strategy.  If writeback is disabled,
  then normal writes will immediately be synchronized with the host fs. If
  writeback is enabled, then writes may be cached in the guest until the file
  is closed or an fsync(2) performed. This option has no effect on mmap-ed
  writes or writes going through the DAX mechanism.

- shared/no_shared
  These options control the  use of the shared version table. If shared mode
  is enabled then metadata and pathname lookup is cached in guest, but is
  refreshed due to changes in another virtio-fs instance.

DAX
===
- dax can be turned on/off when mounting virtio-fs inside guest.

TODO
====
- Implement "cache=shared" option.
- Improve error handling on host. If page fault on host fails, we need
  to propagate it into guest.
- Try to fine tune for performance.
- Bug fixes

RESULTS
=======
- pjdfstests are passing. Have tried cache=none/auto/always and dax on/off).

  https://github.com/pjd/pjdfstest

  (one symlink test fails and that seems to be due xfs on host. Yet to
   look into it).

- Ran blogbench and that works too.

Thanks
Vivek  

Miklos Szeredi (2):
  fuse: delete dentry if timeout is zero
  fuse: Use default_file_splice_read for direct IO

Sebastien Boeuf (3):
  virtio: Add get_shm_region method
  virtio: Implement get_shm_region for PCI transport
  virtio: Implement get_shm_region for MMIO transport

Stefan Hajnoczi (10):
  fuse: export fuse_end_request()
  fuse: export fuse_len_args()
  fuse: export fuse_get_unique()
  fuse: extract fuse_fill_super_common()
  fuse: add fuse_iqueue_ops callbacks
  virtio_fs: add skeleton virtio_fs.ko module
  dax: remove block device dependencies
  fuse, dax: add fuse_conn->dax_dev field
  virtio_fs, dax: Set up virtio_fs dax_device
  fuse, dax: add DAX mmap support

Vivek Goyal (15):
  fuse: Clear setuid bit even in cache=never path
  fuse: Export fuse_send_init_request()
  fuse: Separate fuse device allocation and installation in fuse_conn
  dax: Pass dax_dev to dax_writeback_mapping_range()
  fuse: Keep a list of free dax memory ranges
  fuse: Introduce setupmapping/removemapping commands
  fuse, dax: Implement dax read/write operations
  fuse: Define dax address space operations
  fuse, dax: Take ->i_mmap_sem lock during dax page fault
  fuse: Maintain a list of busy elements
  fuse: Add logic to free up a memory range
  fuse: Release file in process context
  fuse: Reschedule dax free work if too many EAGAIN attempts
  fuse: Take inode lock for dax inode truncation
  virtio-fs: Do not provide abort interface in fusectl

 drivers/dax/super.c                |    3 +-
 drivers/virtio/virtio_mmio.c       |   32 +
 drivers/virtio/virtio_pci_modern.c |  108 +++
 fs/dax.c                           |   23 +-
 fs/ext2/inode.c                    |    2 +-
 fs/ext4/inode.c                    |    2 +-
 fs/fuse/Kconfig                    |   11 +
 fs/fuse/Makefile                   |    1 +
 fs/fuse/control.c                  |    4 +-
 fs/fuse/cuse.c                     |    5 +-
 fs/fuse/dev.c                      |   80 +-
 fs/fuse/dir.c                      |   28 +-
 fs/fuse/file.c                     |  953 ++++++++++++++++++++++-
 fs/fuse/fuse_i.h                   |  206 ++++-
 fs/fuse/inode.c                    |  307 ++++++--
 fs/fuse/virtio_fs.c                | 1129 ++++++++++++++++++++++++++++
 fs/splice.c                        |    3 +-
 fs/xfs/xfs_aops.c                  |    2 +-
 include/linux/dax.h                |    6 +-
 include/linux/fs.h                 |    2 +
 include/linux/virtio_config.h      |   17 +
 include/uapi/linux/fuse.h          |   34 +
 include/uapi/linux/virtio_fs.h     |   44 ++
 include/uapi/linux/virtio_ids.h    |    1 +
 include/uapi/linux/virtio_mmio.h   |   11 +
 include/uapi/linux/virtio_pci.h    |   10 +
 26 files changed, 2875 insertions(+), 149 deletions(-)
 create mode 100644 fs/fuse/virtio_fs.c
 create mode 100644 include/uapi/linux/virtio_fs.h

-- 
2.20.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 01/30] fuse: delete dentry if timeout is zero
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path Vivek Goyal
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Miklos Szeredi <mszeredi@redhat.com>

Don't hold onto dentry in lru list if need to re-lookup it anyway at next
access.

More advanced version of this patch would periodically flush out dentries
from the lru which have gone stale.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fuse/dir.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index dd0f64f7bc06..fd8636e67ae9 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -29,12 +29,26 @@ union fuse_dentry {
 	struct rcu_head rcu;
 };
 
-static inline void fuse_dentry_settime(struct dentry *entry, u64 time)
+static void fuse_dentry_settime(struct dentry *dentry, u64 time)
 {
-	((union fuse_dentry *) entry->d_fsdata)->time = time;
+	/*
+	 * Mess with DCACHE_OP_DELETE because dput() will be faster without it.
+	 *  Don't care about races, either way it's just an optimization
+	 */
+	if ((time && (dentry->d_flags & DCACHE_OP_DELETE)) ||
+	    (!time && !(dentry->d_flags & DCACHE_OP_DELETE))) {
+		spin_lock(&dentry->d_lock);
+		if (time)
+			dentry->d_flags &= ~DCACHE_OP_DELETE;
+		else
+			dentry->d_flags |= DCACHE_OP_DELETE;
+		spin_unlock(&dentry->d_lock);
+	}
+
+	((union fuse_dentry *) dentry->d_fsdata)->time = time;
 }
 
-static inline u64 fuse_dentry_time(struct dentry *entry)
+static inline u64 fuse_dentry_time(const struct dentry *entry)
 {
 	return ((union fuse_dentry *) entry->d_fsdata)->time;
 }
@@ -255,8 +269,14 @@ static void fuse_dentry_release(struct dentry *dentry)
 	kfree_rcu(fd, rcu);
 }
 
+static int fuse_dentry_delete(const struct dentry *dentry)
+{
+	return time_before64(fuse_dentry_time(dentry), get_jiffies_64());
+}
+
 const struct dentry_operations fuse_dentry_operations = {
 	.d_revalidate	= fuse_dentry_revalidate,
+	.d_delete	= fuse_dentry_delete,
 	.d_init		= fuse_dentry_init,
 	.d_release	= fuse_dentry_release,
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 01/30] fuse: delete dentry if timeout is zero Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-20 14:41   ` Miklos Szeredi
  2019-05-15 19:26 ` [PATCH v2 03/30] fuse: Use default_file_splice_read for direct IO Vivek Goyal
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

If fuse daemon is started with cache=never, fuse falls back to direct IO.
In that write path we don't call file_remove_privs() and that means setuid
bit is not cleared if unpriviliged user writes to a file with setuid bit set.

pjdfstest chmod test 12.t tests this and fails.

Fix this by calling fuse_remove_privs() even for direct I/O path.

I tested this as follows.

- Run fuse example pasthrough fs.

  $ passthrough_ll /mnt/pasthrough-mnt -o default_permissions,allow_other,cache=never
  $ mkdir /mnt/pasthrough-mnt/testdir
  $ cd /mnt/pasthrough-mnt/testdir
  $ prove -rv pjdfstests/tests/chmod/12.t

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 06096b60f1df..5baf07fd2876 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1456,14 +1456,18 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	/* Don't allow parallel writes to the same file */
 	inode_lock(inode);
 	res = generic_write_checks(iocb, from);
-	if (res > 0) {
-		if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
-			res = fuse_direct_IO(iocb, from);
-		} else {
-			res = fuse_direct_io(&io, from, &iocb->ki_pos,
-					     FUSE_DIO_WRITE);
-		}
+	if (res <= 0)
+		goto out;
+
+	res = file_remove_privs(iocb->ki_filp);
+	if (res)
+		goto out;
+	if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
+		res = fuse_direct_IO(iocb, from);
+	} else {
+		res = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
 	}
+out:
 	fuse_invalidate_attr(inode);
 	if (res > 0)
 		fuse_write_update_size(inode, iocb->ki_pos);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 03/30] fuse: Use default_file_splice_read for direct IO
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 01/30] fuse: delete dentry if timeout is zero Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 04/30] fuse: export fuse_end_request() Vivek Goyal
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Miklos Szeredi <mszeredi@redhat.com>

---
 fs/fuse/file.c     | 15 ++++++++++++++-
 fs/splice.c        |  3 ++-
 include/linux/fs.h |  2 ++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5baf07fd2876..e9a7aa97c539 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2167,6 +2167,19 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
+				     struct pipe_inode_info *pipe, size_t len,
+				     unsigned int flags)
+{
+	struct fuse_file *ff = in->private_data;
+
+	if (ff->open_flags & FOPEN_DIRECT_IO)
+		return default_file_splice_read(in, ppos, pipe, len, flags);
+	else
+		return generic_file_splice_read(in, ppos, pipe, len, flags);
+
+}
+
 static int convert_fuse_file_lock(struct fuse_conn *fc,
 				  const struct fuse_file_lock *ffl,
 				  struct file_lock *fl)
@@ -3174,7 +3187,7 @@ static const struct file_operations fuse_file_operations = {
 	.fsync		= fuse_fsync,
 	.lock		= fuse_file_lock,
 	.flock		= fuse_file_flock,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= fuse_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.unlocked_ioctl	= fuse_file_ioctl,
 	.compat_ioctl	= fuse_file_compat_ioctl,
diff --git a/fs/splice.c b/fs/splice.c
index 25212dcca2df..e2e881e34935 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -361,7 +361,7 @@ static ssize_t kernel_readv(struct file *file, const struct kvec *vec,
 	return res;
 }
 
-static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
+ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
@@ -425,6 +425,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	iov_iter_advance(&to, copied);	/* truncates and discards */
 	return res;
 }
+EXPORT_SYMBOL(default_file_splice_read);
 
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28e7679089..6804aecf7e30 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3055,6 +3055,8 @@ extern void block_sync_page(struct page *page);
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
+extern ssize_t default_file_splice_read(struct file *, loff_t *,
+		struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 04/30] fuse: export fuse_end_request()
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (2 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 03/30] fuse: Use default_file_splice_read for direct IO Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 05/30] fuse: export fuse_len_args() Vivek Goyal
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

virtio-fs will need to complete requests from outside fs/fuse/dev.c.
Make the symbol visible.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/dev.c    | 19 ++++++++++---------
 fs/fuse/fuse_i.h |  5 +++++
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9971a35cf1ef..46d1aecd7506 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -427,7 +427,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
  * the 'end' callback is called if given, else the reference to the
  * request is released
  */
-static void request_end(struct fuse_conn *fc, struct fuse_req *req)
+void fuse_request_end(struct fuse_conn *fc, struct fuse_req *req)
 {
 	struct fuse_iqueue *fiq = &fc->iq;
 
@@ -480,6 +480,7 @@ static void request_end(struct fuse_conn *fc, struct fuse_req *req)
 put_request:
 	fuse_put_request(fc, req);
 }
+EXPORT_SYMBOL_GPL(fuse_request_end);
 
 static int queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req)
 {
@@ -567,12 +568,12 @@ static void __fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
 		req->in.h.unique = fuse_get_unique(fiq);
 		queue_request(fiq, req);
 		/* acquire extra reference, since request is still needed
-		   after request_end() */
+		   after fuse_request_end() */
 		__fuse_get_request(req);
 		spin_unlock(&fiq->waitq.lock);
 
 		request_wait_answer(fc, req);
-		/* Pairs with smp_wmb() in request_end() */
+		/* Pairs with smp_wmb() in fuse_request_end() */
 		smp_rmb();
 	}
 }
@@ -1302,7 +1303,7 @@ __releases(fiq->waitq.lock)
  * the pending list and copies request data to userspace buffer.  If
  * no reply is needed (FORGET) or request has been aborted or there
  * was an error during the copying then it's finished by calling
- * request_end().  Otherwise add it to the processing list, and set
+ * fuse_request_end().  Otherwise add it to the processing list, and set
  * the 'sent' flag.
  */
 static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
@@ -1362,7 +1363,7 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 		/* SETXATTR is special, since it may contain too large data */
 		if (in->h.opcode == FUSE_SETXATTR)
 			req->out.h.error = -E2BIG;
-		request_end(fc, req);
+		fuse_request_end(fc, req);
 		goto restart;
 	}
 	spin_lock(&fpq->lock);
@@ -1405,7 +1406,7 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	if (!test_bit(FR_PRIVATE, &req->flags))
 		list_del_init(&req->list);
 	spin_unlock(&fpq->lock);
-	request_end(fc, req);
+	fuse_request_end(fc, req);
 	return err;
 
  err_unlock:
@@ -1913,7 +1914,7 @@ static int copy_out_args(struct fuse_copy_state *cs, struct fuse_out *out,
  * the write buffer.  The request is then searched on the processing
  * list by the unique ID found in the header.  If found, then remove
  * it from the list and copy the rest of the buffer to the request.
- * The request is finished by calling request_end()
+ * The request is finished by calling fuse_request_end().
  */
 static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 				 struct fuse_copy_state *cs, size_t nbytes)
@@ -2000,7 +2001,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 		list_del_init(&req->list);
 	spin_unlock(&fpq->lock);
 
-	request_end(fc, req);
+	fuse_request_end(fc, req);
 out:
 	return err ? err : nbytes;
 
@@ -2140,7 +2141,7 @@ static void end_requests(struct fuse_conn *fc, struct list_head *head)
 		req->out.h.error = -ECONNABORTED;
 		clear_bit(FR_SENT, &req->flags);
 		list_del_init(&req->list);
-		request_end(fc, req);
+		fuse_request_end(fc, req);
 	}
 }
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0920c0c032a0..c4584c873b87 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -949,6 +949,11 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct fuse_args *args);
 void fuse_request_send_background(struct fuse_conn *fc, struct fuse_req *req);
 bool fuse_request_queue_background(struct fuse_conn *fc, struct fuse_req *req);
 
+/**
+ * End a finished request
+ */
+void fuse_request_end(struct fuse_conn *fc, struct fuse_req *req);
+
 /* Abort all requests */
 void fuse_abort_conn(struct fuse_conn *fc);
 void fuse_wait_aborted(struct fuse_conn *fc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 05/30] fuse: export fuse_len_args()
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (3 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 04/30] fuse: export fuse_end_request() Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 06/30] fuse: Export fuse_send_init_request() Vivek Goyal
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

virtio-fs will need to query the length of fuse_arg lists.  Make the
symbol visible.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/dev.c    | 7 ++++---
 fs/fuse/fuse_i.h | 5 +++++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 46d1aecd7506..d8054b1a45f5 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -350,7 +350,7 @@ void fuse_put_request(struct fuse_conn *fc, struct fuse_req *req)
 }
 EXPORT_SYMBOL_GPL(fuse_put_request);
 
-static unsigned len_args(unsigned numargs, struct fuse_arg *args)
+unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args)
 {
 	unsigned nbytes = 0;
 	unsigned i;
@@ -360,6 +360,7 @@ static unsigned len_args(unsigned numargs, struct fuse_arg *args)
 
 	return nbytes;
 }
+EXPORT_SYMBOL_GPL(fuse_len_args);
 
 static u64 fuse_get_unique(struct fuse_iqueue *fiq)
 {
@@ -375,7 +376,7 @@ static unsigned int fuse_req_hash(u64 unique)
 static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
 {
 	req->in.h.len = sizeof(struct fuse_in_header) +
-		len_args(req->in.numargs, (struct fuse_arg *) req->in.args);
+		fuse_len_args(req->in.numargs, (struct fuse_arg *) req->in.args);
 	list_add_tail(&req->list, &fiq->pending);
 	wake_up_locked(&fiq->waitq);
 	kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
@@ -1894,7 +1895,7 @@ static int copy_out_args(struct fuse_copy_state *cs, struct fuse_out *out,
 	if (out->h.error)
 		return nbytes != reqsize ? -EINVAL : 0;
 
-	reqsize += len_args(out->numargs, out->args);
+	reqsize += fuse_len_args(out->numargs, out->args);
 
 	if (reqsize < nbytes || (reqsize > nbytes && !out->argvar))
 		return -EINVAL;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4584c873b87..3a235386d667 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1091,4 +1091,9 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type);
 /* readdir.c */
 int fuse_readdir(struct file *file, struct dir_context *ctx);
 
+/**
+ * Return the number of bytes in an arguments list
+ */
+unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
+
 #endif /* _FS_FUSE_I_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 06/30] fuse: Export fuse_send_init_request()
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (4 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 05/30] fuse: export fuse_len_args() Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 07/30] fuse: export fuse_get_unique() Vivek Goyal
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

This will be used by virtio-fs to send init request to fuse server after
initialization of virt queues.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/dev.c    | 1 +
 fs/fuse/fuse_i.h | 1 +
 fs/fuse/inode.c  | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index d8054b1a45f5..40eb827caa10 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -139,6 +139,7 @@ void fuse_request_free(struct fuse_req *req)
 	fuse_req_pages_free(req);
 	kmem_cache_free(fuse_req_cachep, req);
 }
+EXPORT_SYMBOL_GPL(fuse_request_free);
 
 void __fuse_get_request(struct fuse_req *req)
 {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3a235386d667..16f238d7f624 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -987,6 +987,7 @@ void fuse_conn_put(struct fuse_conn *fc);
 
 struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc);
 void fuse_dev_free(struct fuse_dev *fud);
+void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req);
 
 /**
  * Add connection to control filesystem
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ec5d9953dfb6..f02291469518 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -958,7 +958,7 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 	wake_up_all(&fc->blocked_waitq);
 }
 
-static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
+void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 {
 	struct fuse_init_in *arg = &req->misc.init_in;
 
@@ -988,6 +988,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 	req->end = process_init_reply;
 	fuse_request_send_background(fc, req);
 }
+EXPORT_SYMBOL_GPL(fuse_send_init);
 
 static void fuse_free_conn(struct fuse_conn *fc)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 07/30] fuse: export fuse_get_unique()
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (5 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 06/30] fuse: Export fuse_send_init_request() Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 08/30] fuse: extract fuse_fill_super_common() Vivek Goyal
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

virtio-fs will need unique IDs for FORGET requests from outside
fs/fuse/dev.c.  Make the symbol visible.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/dev.c    | 3 ++-
 fs/fuse/fuse_i.h | 5 +++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 40eb827caa10..42fd3b576686 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -363,11 +363,12 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args)
 }
 EXPORT_SYMBOL_GPL(fuse_len_args);
 
-static u64 fuse_get_unique(struct fuse_iqueue *fiq)
+u64 fuse_get_unique(struct fuse_iqueue *fiq)
 {
 	fiq->reqctr += FUSE_REQ_ID_STEP;
 	return fiq->reqctr;
 }
+EXPORT_SYMBOL_GPL(fuse_get_unique);
 
 static unsigned int fuse_req_hash(u64 unique)
 {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 16f238d7f624..38a572ba650d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1097,4 +1097,9 @@ int fuse_readdir(struct file *file, struct dir_context *ctx);
  */
 unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
 
+/**
+ * Get the next unique ID for a request
+ */
+u64 fuse_get_unique(struct fuse_iqueue *fiq);
+
 #endif /* _FS_FUSE_I_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 08/30] fuse: extract fuse_fill_super_common()
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (6 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 07/30] fuse: export fuse_get_unique() Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 09/30] fuse: add fuse_iqueue_ops callbacks Vivek Goyal
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

fuse_fill_super() includes code to process the fd= option and link the
struct fuse_dev to the fd's struct file.  In virtio-fs there is no file
descriptor because /dev/fuse is not used.

This patch extracts fuse_fill_super_common() so that both classic fuse
and virtio-fs can share the code to initialize a mount.

parse_fuse_opt() is also extracted so that the fuse_fill_super_common()
caller has access to the mount options.  This allows classic fuse to
handle the fd= option outside fuse_fill_super_common().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fuse/fuse_i.h |  33 ++++++++++++
 fs/fuse/inode.c  | 137 ++++++++++++++++++++++++-----------------------
 2 files changed, 103 insertions(+), 67 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 38a572ba650d..84f094e4ac36 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -56,6 +56,25 @@ extern struct mutex fuse_mutex;
 extern unsigned max_user_bgreq;
 extern unsigned max_user_congthresh;
 
+/** Mount options */
+struct fuse_mount_data {
+	int fd;
+	unsigned rootmode;
+	kuid_t user_id;
+	kgid_t group_id;
+	unsigned fd_present:1;
+	unsigned rootmode_present:1;
+	unsigned user_id_present:1;
+	unsigned group_id_present:1;
+	unsigned default_permissions:1;
+	unsigned allow_other:1;
+	unsigned max_read;
+	unsigned blksize;
+
+	/* fuse_dev pointer to fill in, should contain NULL on entry */
+	void **fudptr;
+};
+
 /* One forget request */
 struct fuse_forget_link {
 	struct fuse_forget_one forget_one;
@@ -989,6 +1008,20 @@ struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc);
 void fuse_dev_free(struct fuse_dev *fud);
 void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req);
 
+/**
+ * Parse a mount options string
+ */
+int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+				struct user_namespace *user_ns);
+
+/**
+ * Fill in superblock and initialize fuse connection
+ * @sb: partially-initialized superblock to fill in
+ * @mount_data: mount parameters
+ */
+int fuse_fill_super_common(struct super_block *sb,
+			   struct fuse_mount_data *mount_data);
+
 /**
  * Add connection to control filesystem
  */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f02291469518..baf2966a753a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -59,21 +59,6 @@ MODULE_PARM_DESC(max_user_congthresh,
 /** Congestion starts at 75% of maximum */
 #define FUSE_DEFAULT_CONGESTION_THRESHOLD (FUSE_DEFAULT_MAX_BACKGROUND * 3 / 4)
 
-struct fuse_mount_data {
-	int fd;
-	unsigned rootmode;
-	kuid_t user_id;
-	kgid_t group_id;
-	unsigned fd_present:1;
-	unsigned rootmode_present:1;
-	unsigned user_id_present:1;
-	unsigned group_id_present:1;
-	unsigned default_permissions:1;
-	unsigned allow_other:1;
-	unsigned max_read;
-	unsigned blksize;
-};
-
 struct fuse_forget_link *fuse_alloc_forget(void)
 {
 	return kzalloc(sizeof(struct fuse_forget_link), GFP_KERNEL);
@@ -482,7 +467,7 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
 			  struct user_namespace *user_ns)
 {
 	char *p;
@@ -559,12 +544,13 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
 		}
 	}
 
-	if (!d->fd_present || !d->rootmode_present ||
-	    !d->user_id_present || !d->group_id_present)
+	if (!d->rootmode_present || !d->user_id_present ||
+	    !d->group_id_present)
 		return 0;
 
 	return 1;
 }
+EXPORT_SYMBOL_GPL(parse_fuse_opt);
 
 static int fuse_show_options(struct seq_file *m, struct dentry *root)
 {
@@ -1079,15 +1065,13 @@ void fuse_dev_free(struct fuse_dev *fud)
 }
 EXPORT_SYMBOL_GPL(fuse_dev_free);
 
-static int fuse_fill_super(struct super_block *sb, void *data, int silent)
+int fuse_fill_super_common(struct super_block *sb,
+			   struct fuse_mount_data *mount_data)
 {
 	struct fuse_dev *fud;
 	struct fuse_conn *fc;
 	struct inode *root;
-	struct fuse_mount_data d;
-	struct file *file;
 	struct dentry *root_dentry;
-	struct fuse_req *init_req;
 	int err;
 	int is_bdev = sb->s_bdev != NULL;
 
@@ -1097,13 +1081,10 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
-		goto err;
-
 	if (is_bdev) {
 #ifdef CONFIG_BLOCK
 		err = -EINVAL;
-		if (!sb_set_blocksize(sb, d.blksize))
+		if (!sb_set_blocksize(sb, mount_data->blksize))
 			goto err;
 #endif
 	} else {
@@ -1120,19 +1101,6 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (sb->s_user_ns != &init_user_ns)
 		sb->s_iflags |= SB_I_UNTRUSTED_MOUNTER;
 
-	file = fget(d.fd);
-	err = -EINVAL;
-	if (!file)
-		goto err;
-
-	/*
-	 * Require mount to happen from the same user namespace which
-	 * opened /dev/fuse to prevent potential attacks.
-	 */
-	if (file->f_op != &fuse_dev_operations ||
-	    file->f_cred->user_ns != sb->s_user_ns)
-		goto err_fput;
-
 	/*
 	 * If we are not in the initial user namespace posix
 	 * acls must be translated.
@@ -1143,7 +1111,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
 	err = -ENOMEM;
 	if (!fc)
-		goto err_fput;
+		goto err;
 
 	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
@@ -1163,17 +1131,17 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 		fc->dont_mask = 1;
 	sb->s_flags |= SB_POSIXACL;
 
-	fc->default_permissions = d.default_permissions;
-	fc->allow_other = d.allow_other;
-	fc->user_id = d.user_id;
-	fc->group_id = d.group_id;
-	fc->max_read = max_t(unsigned, 4096, d.max_read);
+	fc->default_permissions = mount_data->default_permissions;
+	fc->allow_other = mount_data->allow_other;
+	fc->user_id = mount_data->user_id;
+	fc->group_id = mount_data->group_id;
+	fc->max_read = max_t(unsigned, 4096, mount_data->max_read);
 
 	/* Used by get_root_inode() */
 	sb->s_fs_info = fc;
 
 	err = -ENOMEM;
-	root = fuse_get_root_inode(sb, d.rootmode);
+	root = fuse_get_root_inode(sb, mount_data->rootmode);
 	sb->s_d_op = &fuse_root_dentry_operations;
 	root_dentry = d_make_root(root);
 	if (!root_dentry)
@@ -1181,20 +1149,15 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	/* Root dentry doesn't have .d_revalidate */
 	sb->s_d_op = &fuse_dentry_operations;
 
-	init_req = fuse_request_alloc(0);
-	if (!init_req)
-		goto err_put_root;
-	__set_bit(FR_BACKGROUND, &init_req->flags);
-
 	if (is_bdev) {
 		fc->destroy_req = fuse_request_alloc(0);
 		if (!fc->destroy_req)
-			goto err_free_init_req;
+			goto err_put_root;
 	}
 
 	mutex_lock(&fuse_mutex);
 	err = -EINVAL;
-	if (file->private_data)
+	if (*mount_data->fudptr)
 		goto err_unlock;
 
 	err = fuse_ctl_add_conn(fc);
@@ -1203,23 +1166,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	list_add_tail(&fc->entry, &fuse_conn_list);
 	sb->s_root = root_dentry;
-	file->private_data = fud;
+	*mount_data->fudptr = fud;
 	mutex_unlock(&fuse_mutex);
-	/*
-	 * atomic_dec_and_test() in fput() provides the necessary
-	 * memory barrier for file->private_data to be visible on all
-	 * CPUs after this
-	 */
-	fput(file);
-
-	fuse_send_init(fc, init_req);
-
 	return 0;
 
  err_unlock:
 	mutex_unlock(&fuse_mutex);
- err_free_init_req:
-	fuse_request_free(init_req);
  err_put_root:
 	dput(root_dentry);
  err_dev_free:
@@ -1227,11 +1179,62 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
  err_put_conn:
 	fuse_conn_put(fc);
 	sb->s_fs_info = NULL;
- err_fput:
-	fput(file);
  err:
 	return err;
 }
+EXPORT_SYMBOL_GPL(fuse_fill_super_common);
+
+static int fuse_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct fuse_mount_data d;
+	struct file *file;
+	int is_bdev = sb->s_bdev != NULL;
+	int err;
+	struct fuse_req *init_req;
+
+	err = -EINVAL;
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
+		goto err;
+	if (!d.fd_present)
+		goto err;
+
+	file = fget(d.fd);
+	if (!file)
+		goto err;
+
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if ((file->f_op != &fuse_dev_operations) ||
+	    (file->f_cred->user_ns != sb->s_user_ns))
+		goto err_fput;
+
+	init_req = fuse_request_alloc(0);
+	if (!init_req)
+		goto err_fput;
+	__set_bit(FR_BACKGROUND, &init_req->flags);
+
+	d.fudptr = &file->private_data;
+	err = fuse_fill_super_common(sb, &d);
+	if (err < 0)
+		goto err_free_init_req;
+	/*
+	 * atomic_dec_and_test() in fput() provides the necessary
+	 * memory barrier for file->private_data to be visible on all
+	 * CPUs after this
+	 */
+	fput(file);
+	fuse_send_init(get_fuse_conn_super(sb), init_req);
+	return 0;
+
+err_free_init_req:
+	fuse_request_free(init_req);
+err_fput:
+	fput(file);
+err:
+	return err;
+}
 
 static struct dentry *fuse_mount(struct file_system_type *fs_type,
 		       int flags, const char *dev_name,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 09/30] fuse: add fuse_iqueue_ops callbacks
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (7 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 08/30] fuse: extract fuse_fill_super_common() Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 10/30] fuse: Separate fuse device allocation and installation in fuse_conn Vivek Goyal
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

The /dev/fuse device uses fiq->waitq and fasync to signal that requests
are available.  These mechanisms do not apply to virtio-fs.  This patch
introduces callbacks so alternative behavior can be used.

Note that queue_interrupt() changes along these lines:

  spin_lock(&fiq->waitq.lock);
  wake_up_locked(&fiq->waitq);
+ kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
  spin_unlock(&fiq->waitq.lock);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);

Since queue_request() and queue_forget() also call kill_fasync() inside
the spinlock this should be safe.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fuse/cuse.c   |  2 +-
 fs/fuse/dev.c    | 50 ++++++++++++++++++++++++++++++++----------------
 fs/fuse/fuse_i.h | 48 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/inode.c  | 16 ++++++++++++----
 4 files changed, 94 insertions(+), 22 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 55a26f351467..a6ed7a036b50 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -504,7 +504,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	 * Limit the cuse channel to requests that can
 	 * be represented in file->f_cred->user_ns.
 	 */
-	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns, &fuse_dev_fiq_ops, NULL);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 42fd3b576686..ef489beadf58 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -375,13 +375,33 @@ static unsigned int fuse_req_hash(u64 unique)
 	return hash_long(unique & ~FUSE_INT_REQ_BIT, FUSE_PQ_HASH_BITS);
 }
 
-static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
+/**
+ * A new request is available, wake fiq->waitq
+ */
+static void fuse_dev_wake_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
 {
-	req->in.h.len = sizeof(struct fuse_in_header) +
-		fuse_len_args(req->in.numargs, (struct fuse_arg *) req->in.args);
-	list_add_tail(&req->list, &fiq->pending);
 	wake_up_locked(&fiq->waitq);
 	kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
+	spin_unlock(&fiq->waitq.lock);
+}
+
+const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
+	.wake_forget_and_unlock		= fuse_dev_wake_and_unlock,
+	.wake_interrupt_and_unlock	= fuse_dev_wake_and_unlock,
+	.wake_pending_and_unlock	= fuse_dev_wake_and_unlock,
+};
+EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
+
+static void queue_request_and_unlock(struct fuse_iqueue *fiq,
+				     struct fuse_req *req)
+__releases(fiq->waitq.lock)
+{
+	req->in.h.len = sizeof(struct fuse_in_header) +
+		fuse_len_args(req->in.numargs,
+			      (struct fuse_arg *) req->in.args);
+	list_add_tail(&req->list, &fiq->pending);
+	fiq->ops->wake_pending_and_unlock(fiq);
 }
 
 void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
@@ -396,12 +416,11 @@ void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
 	if (fiq->connected) {
 		fiq->forget_list_tail->next = forget;
 		fiq->forget_list_tail = forget;
-		wake_up_locked(&fiq->waitq);
-		kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
+		fiq->ops->wake_forget_and_unlock(fiq);
 	} else {
 		kfree(forget);
+		spin_unlock(&fiq->waitq.lock);
 	}
-	spin_unlock(&fiq->waitq.lock);
 }
 
 static void flush_bg_queue(struct fuse_conn *fc)
@@ -417,8 +436,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
 		fc->active_background++;
 		spin_lock(&fiq->waitq.lock);
 		req->in.h.unique = fuse_get_unique(fiq);
-		queue_request(fiq, req);
-		spin_unlock(&fiq->waitq.lock);
+		queue_request_and_unlock(fiq, req);
 	}
 }
 
@@ -506,10 +524,10 @@ static int queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req)
 			spin_unlock(&fiq->waitq.lock);
 			return 0;
 		}
-		wake_up_locked(&fiq->waitq);
-		kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
+		fiq->ops->wake_interrupt_and_unlock(fiq);
+	} else {
+		spin_unlock(&fiq->waitq.lock);
 	}
-	spin_unlock(&fiq->waitq.lock);
 	return 0;
 }
 
@@ -569,11 +587,10 @@ static void __fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
 		req->out.h.error = -ENOTCONN;
 	} else {
 		req->in.h.unique = fuse_get_unique(fiq);
-		queue_request(fiq, req);
 		/* acquire extra reference, since request is still needed
 		   after fuse_request_end() */
 		__fuse_get_request(req);
-		spin_unlock(&fiq->waitq.lock);
+		queue_request_and_unlock(fiq, req);
 
 		request_wait_answer(fc, req);
 		/* Pairs with smp_wmb() in fuse_request_end() */
@@ -706,10 +723,11 @@ static int fuse_request_send_notify_reply(struct fuse_conn *fc,
 	req->in.h.unique = unique;
 	spin_lock(&fiq->waitq.lock);
 	if (fiq->connected) {
-		queue_request(fiq, req);
+		queue_request_and_unlock(fiq, req);
 		err = 0;
+	} else {
+		spin_unlock(&fiq->waitq.lock);
 	}
-	spin_unlock(&fiq->waitq.lock);
 
 	return err;
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 84f094e4ac36..0b578e07156d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -71,6 +71,12 @@ struct fuse_mount_data {
 	unsigned max_read;
 	unsigned blksize;
 
+	/* fuse input queue operations */
+	const struct fuse_iqueue_ops *fiq_ops;
+
+	/* device-specific state for fuse_iqueue */
+	void *fiq_priv;
+
 	/* fuse_dev pointer to fill in, should contain NULL on entry */
 	void **fudptr;
 };
@@ -461,6 +467,39 @@ struct fuse_req {
 	struct file *stolen_file;
 };
 
+struct fuse_iqueue;
+
+/**
+ * Input queue callbacks
+ *
+ * Input queue signalling is device-specific.  For example, the /dev/fuse file
+ * uses fiq->waitq and fasync to wake processes that are waiting on queue
+ * readiness.  These callbacks allow other device types to respond to input
+ * queue activity.
+ */
+struct fuse_iqueue_ops {
+	/**
+	 * Signal that a forget has been queued
+	 */
+	void (*wake_forget_and_unlock)(struct fuse_iqueue *fiq)
+	__releases(fiq->waitq.lock);
+
+	/**
+	 * Signal that an INTERRUPT request has been queued
+	 */
+	void (*wake_interrupt_and_unlock)(struct fuse_iqueue *fiq)
+	__releases(fiq->waitq.lock);
+
+	/**
+	 * Signal that a request has been queued
+	 */
+	void (*wake_pending_and_unlock)(struct fuse_iqueue *fiq)
+	__releases(fiq->waitq.lock);
+};
+
+/** /dev/fuse input queue operations */
+extern const struct fuse_iqueue_ops fuse_dev_fiq_ops;
+
 struct fuse_iqueue {
 	/** Connection established */
 	unsigned connected;
@@ -486,6 +525,12 @@ struct fuse_iqueue {
 
 	/** O_ASYNC requests */
 	struct fasync_struct *fasync;
+
+	/** Device-specific callbacks */
+	const struct fuse_iqueue_ops *ops;
+
+	/** Device-specific state */
+	void *priv;
 };
 
 #define FUSE_PQ_HASH_BITS 8
@@ -997,7 +1042,8 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
+		    const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index baf2966a753a..126e77854dac 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -570,7 +570,9 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	return 0;
 }
 
-static void fuse_iqueue_init(struct fuse_iqueue *fiq)
+static void fuse_iqueue_init(struct fuse_iqueue *fiq,
+			     const struct fuse_iqueue_ops *ops,
+			     void *priv)
 {
 	memset(fiq, 0, sizeof(struct fuse_iqueue));
 	init_waitqueue_head(&fiq->waitq);
@@ -578,6 +580,8 @@ static void fuse_iqueue_init(struct fuse_iqueue *fiq)
 	INIT_LIST_HEAD(&fiq->interrupts);
 	fiq->forget_list_tail = &fiq->forget_list_head;
 	fiq->connected = 1;
+	fiq->ops = ops;
+	fiq->priv = priv;
 }
 
 static void fuse_pqueue_init(struct fuse_pqueue *fpq)
@@ -591,7 +595,8 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
+		    const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -601,7 +606,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 	atomic_set(&fc->dev_count, 1);
 	init_waitqueue_head(&fc->blocked_waitq);
 	init_waitqueue_head(&fc->reserved_req_waitq);
-	fuse_iqueue_init(&fc->iq);
+	fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
 	INIT_LIST_HEAD(&fc->bg_queue);
 	INIT_LIST_HEAD(&fc->entry);
 	INIT_LIST_HEAD(&fc->devices);
@@ -1113,7 +1118,8 @@ int fuse_fill_super_common(struct super_block *sb,
 	if (!fc)
 		goto err;
 
-	fuse_conn_init(fc, sb->s_user_ns);
+	fuse_conn_init(fc, sb->s_user_ns, mount_data->fiq_ops,
+		       mount_data->fiq_priv);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
@@ -1215,6 +1221,8 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 		goto err_fput;
 	__set_bit(FR_BACKGROUND, &init_req->flags);
 
+	d.fiq_ops = &fuse_dev_fiq_ops;
+	d.fiq_priv = NULL;
 	d.fudptr = &file->private_data;
 	err = fuse_fill_super_common(sb, &d);
 	if (err < 0)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 10/30] fuse: Separate fuse device allocation and installation in fuse_conn
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (8 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 09/30] fuse: add fuse_iqueue_ops callbacks Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 11/30] virtio_fs: add skeleton virtio_fs.ko module Vivek Goyal
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

As of now fuse_dev_alloc() both allocates a fuse device and installs it
in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation
fails.

virtio-fs needs to initialize multiple fuse devices (one per virtio
queue). It initializes one fuse device as part of call to
fuse_fill_super_common() and rest of the devices are allocated and
installed after that.

But, we can't affort to fail after calling fuse_fill_super_common() as
we don't have a way to undo all the actions done by fuse_fill_super_common().
So to avoid failures after the call to fuse_fill_super_common(),
pre-allocate all fuse devices early and install them into fuse connection
later.

This patch provides two separate helpers for fuse device allocation and
fuse device installation in fuse_conn.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/cuse.c   |  2 +-
 fs/fuse/dev.c    |  2 +-
 fs/fuse/fuse_i.h |  4 +++-
 fs/fuse/inode.c  | 25 ++++++++++++++++++++-----
 4 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index a6ed7a036b50..a509747153a7 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -506,7 +506,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	 */
 	fuse_conn_init(&cc->fc, file->f_cred->user_ns, &fuse_dev_fiq_ops, NULL);
 
-	fud = fuse_dev_alloc(&cc->fc);
+	fud = fuse_dev_alloc_install(&cc->fc);
 	if (!fud) {
 		kfree(cc);
 		return -ENOMEM;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ef489beadf58..ee9dd38bc0f0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2318,7 +2318,7 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
 	if (new->private_data)
 		return -EINVAL;
 
-	fud = fuse_dev_alloc(fc);
+	fud = fuse_dev_alloc_install(fc);
 	if (!fud)
 		return -ENOMEM;
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0b578e07156d..4008ed65a48d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1050,7 +1050,9 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
  */
 void fuse_conn_put(struct fuse_conn *fc);
 
-struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc);
+struct fuse_dev *fuse_dev_alloc_install(struct fuse_conn *fc);
+struct fuse_dev *fuse_dev_alloc(void);
+void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc);
 void fuse_dev_free(struct fuse_dev *fud);
 void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 126e77854dac..9b0114437a14 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1027,8 +1027,7 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 	return 0;
 }
 
-struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc)
-{
+struct fuse_dev *fuse_dev_alloc(void) {
 	struct fuse_dev *fud;
 	struct list_head *pq;
 
@@ -1043,16 +1042,32 @@ struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc)
 	}
 
 	fud->pq.processing = pq;
-	fud->fc = fuse_conn_get(fc);
 	fuse_pqueue_init(&fud->pq);
 
+	return fud;
+}
+EXPORT_SYMBOL_GPL(fuse_dev_alloc);
+
+void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc) {
+	fud->fc = fuse_conn_get(fc);
 	spin_lock(&fc->lock);
 	list_add_tail(&fud->entry, &fc->devices);
 	spin_unlock(&fc->lock);
+}
+EXPORT_SYMBOL_GPL(fuse_dev_install);
 
+struct fuse_dev *fuse_dev_alloc_install(struct fuse_conn *fc)
+{
+	struct fuse_dev *fud;
+
+	fud = fuse_dev_alloc();
+	if (!fud)
+		return NULL;
+
+	fuse_dev_install(fud, fc);
 	return fud;
 }
-EXPORT_SYMBOL_GPL(fuse_dev_alloc);
+EXPORT_SYMBOL_GPL(fuse_dev_alloc_install);
 
 void fuse_dev_free(struct fuse_dev *fud)
 {
@@ -1122,7 +1137,7 @@ int fuse_fill_super_common(struct super_block *sb,
 		       mount_data->fiq_priv);
 	fc->release = fuse_free_conn;
 
-	fud = fuse_dev_alloc(fc);
+	fud = fuse_dev_alloc_install(fc);
 	if (!fud)
 		goto err_put_conn;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 11/30] virtio_fs: add skeleton virtio_fs.ko module
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (9 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 10/30] fuse: Separate fuse device allocation and installation in fuse_conn Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 12/30] dax: remove block device dependencies Vivek Goyal
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

Add a basic file system module for virtio-fs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/Kconfig                 |  11 +
 fs/fuse/Makefile                |   1 +
 fs/fuse/fuse_i.h                |  13 +
 fs/fuse/inode.c                 |  15 +-
 fs/fuse/virtio_fs.c             | 956 ++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_fs.h  |  41 ++
 include/uapi/linux/virtio_ids.h |   1 +
 7 files changed, 1035 insertions(+), 3 deletions(-)
 create mode 100644 fs/fuse/virtio_fs.c
 create mode 100644 include/uapi/linux/virtio_fs.h

diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 76f09ce7e5b2..46e9a8ff9f7a 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -26,3 +26,14 @@ config CUSE
 
 	  If you want to develop or use a userspace character device
 	  based on CUSE, answer Y or M.
+
+config VIRTIO_FS
+	tristate "Virtio Filesystem"
+	depends on FUSE_FS
+	select VIRTIO
+	help
+	  The Virtio Filesystem allows guests to mount file systems from the
+          host.
+
+	  If you want to share files between guests or with the host, answer Y
+          or M.
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index f7b807bc1027..47b78fac5809 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -4,5 +4,6 @@
 
 obj-$(CONFIG_FUSE_FS) += fuse.o
 obj-$(CONFIG_CUSE) += cuse.o
+obj-$(CONFIG_VIRTIO_FS) += virtio_fs.o
 
 fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4008ed65a48d..f5cb4d40b83f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -59,15 +59,18 @@ extern unsigned max_user_congthresh;
 /** Mount options */
 struct fuse_mount_data {
 	int fd;
+	const char *tag; /* lifetime: .fill_super() data argument */
 	unsigned rootmode;
 	kuid_t user_id;
 	kgid_t group_id;
 	unsigned fd_present:1;
+	unsigned tag_present:1;
 	unsigned rootmode_present:1;
 	unsigned user_id_present:1;
 	unsigned group_id_present:1;
 	unsigned default_permissions:1;
 	unsigned allow_other:1;
+	unsigned destroy:1;
 	unsigned max_read;
 	unsigned blksize;
 
@@ -465,6 +468,9 @@ struct fuse_req {
 
 	/** Request is stolen from fuse_file->reserved_req */
 	struct file *stolen_file;
+
+	/** virtio-fs's physically contiguous buffer for in and out args */
+	void *argbuf;
 };
 
 struct fuse_iqueue;
@@ -1070,6 +1076,13 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
 int fuse_fill_super_common(struct super_block *sb,
 			   struct fuse_mount_data *mount_data);
 
+/**
+ * Disassociate fuse connection from superblock and kill the superblock
+ *
+ * Calls kill_anon_super(), use with do not use with bdev mounts.
+ */
+void fuse_kill_sb_anon(struct super_block *sb);
+
 /**
  * Add connection to control filesystem
  */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9b0114437a14..731a8a74d032 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -434,6 +434,7 @@ static int fuse_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 enum {
 	OPT_FD,
+	OPT_TAG,
 	OPT_ROOTMODE,
 	OPT_USER_ID,
 	OPT_GROUP_ID,
@@ -446,6 +447,7 @@ enum {
 
 static const match_table_t tokens = {
 	{OPT_FD,			"fd=%u"},
+	{OPT_TAG,			"tag=%s"},
 	{OPT_ROOTMODE,			"rootmode=%o"},
 	{OPT_USER_ID,			"user_id=%u"},
 	{OPT_GROUP_ID,			"group_id=%u"},
@@ -492,6 +494,11 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
 			d->fd_present = 1;
 			break;
 
+		case OPT_TAG:
+			d->tag = args[0].from;
+			d->tag_present = 1;
+			break;
+
 		case OPT_ROOTMODE:
 			if (match_octal(&args[0], &value))
 				return 0;
@@ -1170,7 +1177,7 @@ int fuse_fill_super_common(struct super_block *sb,
 	/* Root dentry doesn't have .d_revalidate */
 	sb->s_d_op = &fuse_dentry_operations;
 
-	if (is_bdev) {
+	if (mount_data->destroy) {
 		fc->destroy_req = fuse_request_alloc(0);
 		if (!fc->destroy_req)
 			goto err_put_root;
@@ -1216,7 +1223,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	err = -EINVAL;
 	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
-	if (!d.fd_present)
+	if (!d.fd_present || d.tag_present)
 		goto err;
 
 	file = fget(d.fd);
@@ -1239,6 +1246,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	d.fiq_ops = &fuse_dev_fiq_ops;
 	d.fiq_priv = NULL;
 	d.fudptr = &file->private_data;
+	d.destroy = is_bdev;
 	err = fuse_fill_super_common(sb, &d);
 	if (err < 0)
 		goto err_free_init_req;
@@ -1282,11 +1290,12 @@ static void fuse_sb_destroy(struct super_block *sb)
 	}
 }
 
-static void fuse_kill_sb_anon(struct super_block *sb)
+void fuse_kill_sb_anon(struct super_block *sb)
 {
 	fuse_sb_destroy(sb);
 	kill_anon_super(sb);
 }
+EXPORT_SYMBOL_GPL(fuse_kill_sb_anon);
 
 static struct file_system_type fuse_fs_type = {
 	.owner		= THIS_MODULE,
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
new file mode 100644
index 000000000000..e76e0f5dce40
--- /dev/null
+++ b/fs/fuse/virtio_fs.c
@@ -0,0 +1,956 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio-fs: Virtio Filesystem
+ * Copyright (C) 2018 Red Hat, Inc.
+ */
+
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/virtio_fs.h>
+#include <linux/delay.h>
+#include "fuse_i.h"
+
+/* List of virtio-fs device instances and a lock for the list */
+static DEFINE_MUTEX(virtio_fs_mutex);
+static LIST_HEAD(virtio_fs_instances);
+
+enum {
+	VQ_HIPRIO,
+	VQ_REQUEST
+};
+
+/* Per-virtqueue state */
+struct virtio_fs_vq {
+	spinlock_t lock;
+	struct virtqueue *vq;     /* protected by ->lock */
+	struct work_struct done_work;
+	struct list_head queued_reqs;
+	struct delayed_work dispatch_work;
+	struct fuse_dev *fud;
+	char name[24];
+} ____cacheline_aligned_in_smp;
+
+/* A virtio-fs device instance */
+struct virtio_fs {
+	struct list_head list;    /* on virtio_fs_instances */
+	char *tag;
+	struct virtio_fs_vq *vqs;
+	unsigned nvqs;            /* number of virtqueues */
+	unsigned num_queues;      /* number of request queues */
+};
+
+struct virtio_fs_forget {
+	struct fuse_in_header ih;
+	struct fuse_forget_in arg;
+	/* This request can be temporarily queued on virt queue */
+	struct list_head list;
+};
+
+static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
+{
+	struct virtio_fs *fs = vq->vdev->priv;
+
+	return &fs->vqs[vq->index];
+}
+
+static inline struct fuse_pqueue *vq_to_fpq(struct virtqueue *vq)
+{
+	return &vq_to_fsvq(vq)->fud->pq;
+}
+
+/* Add a new instance to the list or return -EEXIST if tag name exists*/
+static int virtio_fs_add_instance(struct virtio_fs *fs)
+{
+	struct virtio_fs *fs2;
+	bool duplicate = false;
+
+	mutex_lock(&virtio_fs_mutex);
+
+	list_for_each_entry(fs2, &virtio_fs_instances, list) {
+		if (strcmp(fs->tag, fs2->tag) == 0)
+			duplicate = true;
+	}
+
+	if (!duplicate)
+		list_add_tail(&fs->list, &virtio_fs_instances);
+
+	mutex_unlock(&virtio_fs_mutex);
+
+	if (duplicate)
+		return -EEXIST;
+	return 0;
+}
+
+/* Return the virtio_fs with a given tag, or NULL */
+static struct virtio_fs *virtio_fs_find_instance(const char *tag)
+{
+	struct virtio_fs *fs;
+
+	mutex_lock(&virtio_fs_mutex);
+
+	list_for_each_entry(fs, &virtio_fs_instances, list) {
+		if (strcmp(fs->tag, tag) == 0)
+			goto found;
+	}
+
+	fs = NULL; /* not found */
+
+found:
+	mutex_unlock(&virtio_fs_mutex);
+
+	return fs;
+}
+
+static void virtio_fs_free_devs(struct virtio_fs *fs)
+{
+	unsigned int i;
+
+	/* TODO lock */
+
+	for (i = 0; i < fs->nvqs; i++) {
+		struct virtio_fs_vq *fsvq = &fs->vqs[i];
+
+		if (!fsvq->fud)
+			continue;
+
+		flush_work(&fsvq->done_work);
+		flush_delayed_work(&fsvq->dispatch_work);
+
+		fuse_dev_free(fsvq->fud); /* TODO need to quiesce/end_requests/decrement dev_count */
+		fsvq->fud = NULL;
+	}
+}
+
+/* Read filesystem name from virtio config into fs->tag (must kfree()). */
+static int virtio_fs_read_tag(struct virtio_device *vdev, struct virtio_fs *fs)
+{
+	char tag_buf[sizeof_field(struct virtio_fs_config, tag)];
+	char *end;
+	size_t len;
+
+	virtio_cread_bytes(vdev, offsetof(struct virtio_fs_config, tag),
+			   &tag_buf, sizeof(tag_buf));
+	end = memchr(tag_buf, '\0', sizeof(tag_buf));
+	if (end == tag_buf)
+		return -EINVAL; /* empty tag */
+	if (!end)
+		end = &tag_buf[sizeof(tag_buf)];
+
+	len = end - tag_buf;
+	fs->tag = devm_kmalloc(&vdev->dev, len + 1, GFP_KERNEL);
+	if (!fs->tag)
+		return -ENOMEM;
+	memcpy(fs->tag, tag_buf, len);
+	fs->tag[len] = '\0';
+	return 0;
+}
+
+/* Work function for hiprio completion */
+static void virtio_fs_hiprio_done_work(struct work_struct *work)
+{
+	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
+						 done_work);
+	struct virtqueue *vq = fsvq->vq;
+
+	/* Free completed FUSE_FORGET requests */
+	spin_lock(&fsvq->lock);
+	do {
+		unsigned len;
+		void *req;
+
+		virtqueue_disable_cb(vq);
+
+		while ((req = virtqueue_get_buf(vq, &len)) != NULL)
+			kfree(req);
+	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
+	spin_unlock(&fsvq->lock);
+}
+
+static void virtio_fs_dummy_dispatch_work(struct work_struct *work)
+{
+	return;
+}
+
+static void virtio_fs_hiprio_dispatch_work(struct work_struct *work)
+{
+	struct virtio_fs_forget *forget;
+	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
+						 dispatch_work.work);
+	struct virtqueue *vq = fsvq->vq;
+	struct scatterlist sg;
+	struct scatterlist *sgs[] = {&sg};
+	bool notify;
+	int ret;
+
+	pr_debug("worker virtio_fs_hiprio_dispatch_work() called.\n");
+	while(1) {
+		spin_lock(&fsvq->lock);
+		forget = list_first_entry_or_null(&fsvq->queued_reqs,
+					struct virtio_fs_forget, list);
+		if (!forget) {
+			spin_unlock(&fsvq->lock);
+			return;
+		}
+
+		list_del(&forget->list);
+		sg_init_one(&sg, forget, sizeof(*forget));
+
+		/* Enqueue the request */
+		dev_dbg(&vq->vdev->dev, "%s\n", __func__);
+		ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC);
+		if (ret < 0) {
+			if (ret == -ENOMEM || ret == -ENOSPC) {
+				pr_debug("virtio-fs: Could not queue FORGET:"
+					 " err=%d. Will try later\n", ret);
+				list_add_tail(&forget->list,
+						&fsvq->queued_reqs);
+				schedule_delayed_work(&fsvq->dispatch_work,
+						msecs_to_jiffies(1));
+			} else {
+				pr_debug("virtio-fs: Could not queue FORGET:"
+					 " err=%d. Dropping it.\n", ret);
+				kfree(forget);
+			}
+			spin_unlock(&fsvq->lock);
+			return;
+		}
+
+		notify = virtqueue_kick_prepare(vq);
+		spin_unlock(&fsvq->lock);
+
+		if (notify)
+			virtqueue_notify(vq);
+		pr_debug("worker virtio_fs_hiprio_dispatch_work() dispatched one forget request.\n");
+	}
+}
+
+/* Allocate and copy args into req->argbuf */
+static int copy_args_to_argbuf(struct fuse_req *req)
+{
+	unsigned offset = 0;
+	unsigned num_in;
+	unsigned num_out;
+	unsigned len;
+	unsigned i;
+
+	num_in = req->in.numargs - req->in.argpages;
+	num_out = req->out.numargs - req->out.argpages;
+	len = fuse_len_args(num_in, (struct fuse_arg *)req->in.args) +
+	      fuse_len_args(num_out, req->out.args);
+
+	req->argbuf = kmalloc(len, GFP_ATOMIC);
+	if (!req->argbuf)
+		return -ENOMEM;
+
+	for (i = 0; i < num_in; i++) {
+		memcpy(req->argbuf + offset,
+		       req->in.args[i].value,
+		       req->in.args[i].size);
+		offset += req->in.args[i].size;
+	}
+
+	return 0;
+}
+
+/* Copy args out of and free req->argbuf */
+static void copy_args_from_argbuf(struct fuse_req *req)
+{
+	unsigned remaining;
+	unsigned offset;
+	unsigned num_in;
+	unsigned num_out;
+	unsigned i;
+
+	remaining = req->out.h.len - sizeof(req->out.h);
+	num_in = req->in.numargs - req->in.argpages;
+	num_out = req->out.numargs - req->out.argpages;
+	offset = fuse_len_args(num_in, (struct fuse_arg *)req->in.args);
+
+	for (i = 0; i < num_out; i++) {
+		unsigned argsize = req->out.args[i].size;
+
+		if (req->out.argvar &&
+		    i == req->out.numargs - 1 &&
+		    argsize > remaining) {
+			argsize = remaining;
+		}
+
+		memcpy(req->out.args[i].value, req->argbuf + offset, argsize);
+		offset += argsize;
+
+		if (i != req->out.numargs - 1)
+			remaining -= argsize;
+	}
+
+	/* Store the actual size of the variable-length arg */
+	if (req->out.argvar)
+		req->out.args[req->out.numargs - 1].size = remaining;
+
+	kfree(req->argbuf);
+	req->argbuf = NULL;
+}
+
+/* Work function for request completion */
+static void virtio_fs_requests_done_work(struct work_struct *work)
+{
+	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
+						 done_work);
+	struct fuse_pqueue *fpq = &fsvq->fud->pq;
+	struct fuse_conn *fc = fsvq->fud->fc;
+	struct virtqueue *vq = fsvq->vq;
+	struct fuse_req *req;
+	struct fuse_req *next;
+	LIST_HEAD(reqs);
+
+	/* Collect completed requests off the virtqueue */
+	spin_lock(&fsvq->lock);
+	do {
+		unsigned len;
+
+		virtqueue_disable_cb(vq);
+
+		while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+			spin_lock(&fpq->lock);
+			list_move_tail(&req->list, &reqs);
+			spin_unlock(&fpq->lock);
+		}
+	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
+	spin_unlock(&fsvq->lock);
+
+	/* End requests */
+	list_for_each_entry_safe(req, next, &reqs, list) {
+		/* TODO check unique */
+		/* TODO fuse_len_args(out) against oh.len */
+
+		copy_args_from_argbuf(req);
+
+		/* TODO zeroing? */
+
+		spin_lock(&fpq->lock);
+		clear_bit(FR_SENT, &req->flags);
+		list_del_init(&req->list);
+		spin_unlock(&fpq->lock);
+
+		fuse_request_end(fc, req);
+	}
+}
+
+/* Virtqueue interrupt handler */
+static void virtio_fs_vq_done(struct virtqueue *vq)
+{
+	struct virtio_fs_vq *fsvq = vq_to_fsvq(vq);
+
+	dev_dbg(&vq->vdev->dev, "%s %s\n", __func__, fsvq->name);
+
+	schedule_work(&fsvq->done_work);
+}
+
+/* Initialize virtqueues */
+static int virtio_fs_setup_vqs(struct virtio_device *vdev,
+			       struct virtio_fs *fs)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	unsigned i;
+	int ret;
+
+	virtio_cread(vdev, struct virtio_fs_config, num_queues,
+		     &fs->num_queues);
+	if (fs->num_queues == 0)
+		return -EINVAL;
+
+	fs->nvqs = 1 + fs->num_queues;
+
+	fs->vqs = devm_kcalloc(&vdev->dev, fs->nvqs,
+				sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL);
+	if (!fs->vqs)
+		return -ENOMEM;
+
+	vqs = kmalloc_array(fs->nvqs, sizeof(vqs[VQ_HIPRIO]), GFP_KERNEL);
+	callbacks = kmalloc_array(fs->nvqs, sizeof(callbacks[VQ_HIPRIO]),
+					GFP_KERNEL);
+	names = kmalloc_array(fs->nvqs, sizeof(names[VQ_HIPRIO]), GFP_KERNEL);
+	if (!vqs || !callbacks || !names) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	callbacks[VQ_HIPRIO] = virtio_fs_vq_done;
+	snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name),
+			"hiprio");
+	names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name;
+	INIT_WORK(&fs->vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work);
+	INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].queued_reqs);
+	INIT_DELAYED_WORK(&fs->vqs[VQ_HIPRIO].dispatch_work,
+			virtio_fs_hiprio_dispatch_work);
+	spin_lock_init(&fs->vqs[VQ_HIPRIO].lock);
+
+	/* Initialize the requests virtqueues */
+	for (i = VQ_REQUEST; i < fs->nvqs; i++) {
+		spin_lock_init(&fs->vqs[i].lock);
+		INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work);
+		INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work,
+					virtio_fs_dummy_dispatch_work);
+		INIT_LIST_HEAD(&fs->vqs[i].queued_reqs);
+		snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
+			 "requests.%u", i - VQ_REQUEST);
+		callbacks[i] = virtio_fs_vq_done;
+		names[i] = fs->vqs[i].name;
+	}
+
+	ret = virtio_find_vqs(vdev, fs->nvqs, vqs, callbacks, names, NULL);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < fs->nvqs; i++)
+		fs->vqs[i].vq = vqs[i];
+
+out:
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
+	return ret;
+}
+
+/* Free virtqueues (device must already be reset) */
+static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
+				  struct virtio_fs *fs)
+{
+	vdev->config->del_vqs(vdev);
+}
+
+static int virtio_fs_probe(struct virtio_device *vdev)
+{
+	struct virtio_fs *fs;
+	int ret;
+
+	fs = devm_kzalloc(&vdev->dev, sizeof(*fs), GFP_KERNEL);
+	if (!fs)
+		return -ENOMEM;
+	vdev->priv = fs;
+
+	ret = virtio_fs_read_tag(vdev, fs);
+	if (ret < 0)
+		goto out;
+
+	ret = virtio_fs_setup_vqs(vdev, fs);
+	if (ret < 0)
+		goto out;
+
+	/* TODO vq affinity */
+	/* TODO populate notifications vq */
+
+	/* Bring the device online in case the filesystem is mounted and
+	 * requests need to be sent before we return.
+	 */
+	virtio_device_ready(vdev);
+
+	ret = virtio_fs_add_instance(fs);
+	if (ret < 0)
+		goto out_vqs;
+
+	return 0;
+
+out_vqs:
+	vdev->config->reset(vdev);
+	virtio_fs_cleanup_vqs(vdev, fs);
+
+out:
+	vdev->priv = NULL;
+	return ret;
+}
+
+static void virtio_fs_remove(struct virtio_device *vdev)
+{
+	struct virtio_fs *fs = vdev->priv;
+
+	virtio_fs_free_devs(fs);
+
+	vdev->config->reset(vdev);
+	virtio_fs_cleanup_vqs(vdev, fs);
+
+	mutex_lock(&virtio_fs_mutex);
+	list_del(&fs->list);
+	mutex_unlock(&virtio_fs_mutex);
+
+	vdev->priv = NULL;
+}
+
+#ifdef CONFIG_PM
+static int virtio_fs_freeze(struct virtio_device *vdev)
+{
+	return 0; /* TODO */
+}
+
+static int virtio_fs_restore(struct virtio_device *vdev)
+{
+	return 0; /* TODO */
+}
+#endif /* CONFIG_PM */
+
+const static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_FS, VIRTIO_DEV_ANY_ID },
+	{},
+};
+
+const static unsigned int feature_table[] = {};
+
+static struct virtio_driver virtio_fs_driver = {
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.feature_table		= feature_table,
+	.feature_table_size	= ARRAY_SIZE(feature_table),
+	/* TODO validate config_get != NULL */
+	.probe			= virtio_fs_probe,
+	.remove			= virtio_fs_remove,
+#ifdef CONFIG_PM_SLEEP
+	.freeze			= virtio_fs_freeze,
+	.restore		= virtio_fs_restore,
+#endif
+};
+
+static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
+{
+	struct fuse_forget_link *link;
+	struct virtio_fs_forget *forget;
+	struct scatterlist sg;
+	struct scatterlist *sgs[] = {&sg};
+	struct virtio_fs *fs;
+	struct virtqueue *vq;
+	struct virtio_fs_vq *fsvq;
+	bool notify;
+	u64 unique;
+	int ret;
+
+	BUG_ON(!fiq->forget_list_head.next);
+	link = fiq->forget_list_head.next;
+	BUG_ON(link->next);
+	fiq->forget_list_head.next = NULL;
+	fiq->forget_list_tail = &fiq->forget_list_head;
+
+	unique = fuse_get_unique(fiq);
+
+	fs = fiq->priv;
+	fsvq = &fs->vqs[VQ_HIPRIO];
+	spin_unlock(&fiq->waitq.lock);
+
+	/* Allocate a buffer for the request */
+	forget = kmalloc(sizeof(*forget), GFP_ATOMIC);
+	if (!forget) {
+		pr_err("virtio-fs: dropped FORGET: kmalloc failed\n");
+		goto out; /* TODO avoid dropping it? */
+	}
+
+	forget->ih = (struct fuse_in_header){
+		.opcode = FUSE_FORGET,
+		.nodeid = link->forget_one.nodeid,
+		.unique = unique,
+		.len = sizeof(*forget),
+	};
+	forget->arg = (struct fuse_forget_in){
+		.nlookup = link->forget_one.nlookup,
+	};
+
+	sg_init_one(&sg, forget, sizeof(*forget));
+
+	/* Enqueue the request */
+	vq = fsvq->vq;
+	dev_dbg(&vq->vdev->dev, "%s\n", __func__);
+	spin_lock(&fsvq->lock);
+
+	ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC);
+	if (ret < 0) {
+		if (ret == -ENOMEM || ret == -ENOSPC) {
+			pr_debug("virtio-fs: Could not queue FORGET: err=%d."
+				 " Will try later.\n", ret);
+			list_add_tail(&forget->list, &fsvq->queued_reqs);
+			schedule_delayed_work(&fsvq->dispatch_work,
+					msecs_to_jiffies(1));
+		} else {
+			pr_debug("virtio-fs: Could not queue FORGET: err=%d."
+				 " Dropping it.\n", ret);
+			kfree(forget);
+		}
+		spin_unlock(&fsvq->lock);
+		goto out;
+	}
+
+	notify = virtqueue_kick_prepare(vq);
+
+	spin_unlock(&fsvq->lock);
+
+	if (notify)
+		virtqueue_notify(vq);
+out:
+	kfree(link);
+}
+
+static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
+{
+	/* TODO */
+	spin_unlock(&fiq->waitq.lock);
+}
+
+/* Return the number of scatter-gather list elements required */
+static unsigned sg_count_fuse_req(struct fuse_req *req)
+{
+	unsigned total_sgs = 1 /* fuse_in_header */;
+
+	if (req->in.numargs - req->in.argpages)
+		total_sgs += 1;
+
+	if (req->in.argpages)
+		total_sgs += req->num_pages;
+
+	if (!test_bit(FR_ISREPLY, &req->flags))
+		return total_sgs;
+
+	total_sgs += 1 /* fuse_out_header */;
+
+	if (req->out.numargs - req->out.argpages)
+		total_sgs += 1;
+
+	if (req->out.argpages)
+		total_sgs += req->num_pages;
+
+	return total_sgs;
+}
+
+/* Add pages to scatter-gather list and return number of elements used */
+static unsigned sg_init_fuse_pages(struct scatterlist *sg,
+				   struct page **pages,
+				   struct fuse_page_desc *page_descs,
+				   unsigned num_pages)
+{
+	unsigned i;
+
+	for (i = 0; i < num_pages; i++) {
+		sg_init_table(&sg[i], 1);
+		sg_set_page(&sg[i], pages[i],
+			    page_descs[i].length,
+			    page_descs[i].offset);
+	}
+
+	return i;
+}
+
+/* Add args to scatter-gather list and return number of elements used */
+static unsigned sg_init_fuse_args(struct scatterlist *sg,
+				  struct fuse_req *req,
+				  struct fuse_arg *args,
+				  unsigned numargs,
+				  bool argpages,
+				  void *argbuf,
+				  unsigned *len_used)
+{
+	unsigned total_sgs = 0;
+	unsigned len;
+
+	len = fuse_len_args(numargs - argpages, args);
+	if (len)
+		sg_init_one(&sg[total_sgs++], argbuf, len);
+
+	if (argpages)
+		total_sgs += sg_init_fuse_pages(&sg[total_sgs],
+						req->pages,
+						req->page_descs,
+						req->num_pages);
+
+	if (len_used)
+		*len_used = len;
+
+	return total_sgs;
+}
+
+/* Add a request to a virtqueue and kick the device */
+static int virtio_fs_enqueue_req(struct virtqueue *vq, struct fuse_req *req)
+{
+	struct scatterlist *stack_sgs[6 /* requests need at least 4 elements */];
+	struct scatterlist stack_sg[ARRAY_SIZE(stack_sgs)];
+	struct scatterlist **sgs = stack_sgs;
+	struct scatterlist *sg = stack_sg;
+	struct virtio_fs_vq *fsvq;
+	unsigned argbuf_used = 0;
+	unsigned out_sgs = 0;
+	unsigned in_sgs = 0;
+	unsigned total_sgs;
+	unsigned i;
+	int ret;
+	bool notify;
+
+	/* Does the sglist fit on the stack? */
+	total_sgs = sg_count_fuse_req(req);
+	if (total_sgs > ARRAY_SIZE(stack_sgs)) {
+		sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
+		sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+		if (!sgs || !sg) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	/* Use a bounce buffer since stack args cannot be mapped */
+	ret = copy_args_to_argbuf(req);
+	if (ret < 0)
+		goto out;
+
+	/* Request elements */
+	sg_init_one(&sg[out_sgs++], &req->in.h, sizeof(req->in.h));
+	out_sgs += sg_init_fuse_args(&sg[out_sgs], req,
+				     (struct fuse_arg *)req->in.args,
+				     req->in.numargs, req->in.argpages,
+				     req->argbuf, &argbuf_used);
+
+	/* Reply elements */
+	if (test_bit(FR_ISREPLY, &req->flags)) {
+		sg_init_one(&sg[out_sgs + in_sgs++],
+			    &req->out.h, sizeof(req->out.h));
+		in_sgs += sg_init_fuse_args(&sg[out_sgs + in_sgs], req,
+					    req->out.args, req->out.numargs,
+					    req->out.argpages,
+					    req->argbuf + argbuf_used, NULL);
+	}
+
+	BUG_ON(out_sgs + in_sgs != total_sgs);
+
+	for (i = 0; i < total_sgs; i++)
+		sgs[i] = &sg[i];
+
+	fsvq = vq_to_fsvq(vq);
+	spin_lock(&fsvq->lock);
+
+	ret = virtqueue_add_sgs(vq, sgs, out_sgs, in_sgs, req, GFP_ATOMIC);
+	if (ret < 0) {
+		/* TODO handle full virtqueue */
+		spin_unlock(&fsvq->lock);
+		goto out;
+	}
+
+	notify = virtqueue_kick_prepare(vq);
+
+	spin_unlock(&fsvq->lock);
+
+	if (notify)
+		virtqueue_notify(vq);
+
+out:
+	if (ret < 0 && req->argbuf) {
+		kfree(req->argbuf);
+		req->argbuf = NULL;
+	}
+	if (sgs != stack_sgs) {
+		kfree(sgs);
+		kfree(sg);
+	}
+
+	return ret;
+}
+
+static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
+{
+	unsigned queue_id = VQ_REQUEST; /* TODO multiqueue */
+	struct virtio_fs *fs;
+	struct fuse_conn *fc;
+	struct fuse_req *req;
+	struct fuse_pqueue *fpq;
+	int ret;
+
+	BUG_ON(list_empty(&fiq->pending));
+	req = list_last_entry(&fiq->pending, struct fuse_req, list);
+	clear_bit(FR_PENDING, &req->flags);
+	list_del_init(&req->list);
+	BUG_ON(!list_empty(&fiq->pending));
+	spin_unlock(&fiq->waitq.lock);
+
+	fs = fiq->priv;
+	fc = fs->vqs[queue_id].fud->fc;
+
+	dev_dbg(&fs->vqs[queue_id].vq->vdev->dev,
+		"%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n",
+		__func__, req->in.h.opcode, req->in.h.unique, req->in.h.nodeid,
+		req->in.h.len, fuse_len_args(req->out.numargs, req->out.args));
+
+	fpq = &fs->vqs[queue_id].fud->pq;
+	spin_lock(&fpq->lock);
+	if (!fpq->connected) {
+		spin_unlock(&fpq->lock);
+		req->out.h.error = -ENODEV;
+		printk(KERN_ERR "%s: disconnected\n", __func__);
+		fuse_request_end(fc, req);
+		return;
+	}
+	list_add_tail(&req->list, fpq->processing);
+	spin_unlock(&fpq->lock);
+	set_bit(FR_SENT, &req->flags);
+	/* matches barrier in request_wait_answer() */
+	smp_mb__after_atomic();
+	/* TODO check for FR_INTERRUPTED? */
+
+retry:
+	ret = virtio_fs_enqueue_req(fs->vqs[queue_id].vq, req);
+	if (ret < 0) {
+		if (ret == -ENOMEM || ret == -ENOSPC) {
+			/* Virtqueue full. Retry submission */
+			usleep_range(20, 30);
+			goto retry;
+		}
+		req->out.h.error = ret;
+		printk(KERN_ERR "%s: virtio_fs_enqueue_req failed %d\n",
+			__func__, ret);
+		fuse_request_end(fc, req);
+		return;
+	}
+}
+
+const static struct fuse_iqueue_ops virtio_fs_fiq_ops = {
+	.wake_forget_and_unlock		= virtio_fs_wake_forget_and_unlock,
+	.wake_interrupt_and_unlock	= virtio_fs_wake_interrupt_and_unlock,
+	.wake_pending_and_unlock	= virtio_fs_wake_pending_and_unlock,
+};
+
+static int virtio_fs_fill_super(struct super_block *sb, void *data,
+				int silent)
+{
+	struct fuse_mount_data d;
+	struct fuse_conn *fc;
+	struct virtio_fs *fs;
+	int is_bdev = sb->s_bdev != NULL;
+	unsigned int i;
+	int err;
+	struct fuse_req *init_req;
+
+	err = -EINVAL;
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
+		goto err;
+	if (d.fd_present) {
+		printk(KERN_ERR "virtio-fs: fd option cannot be used\n");
+		goto err;
+	}
+	if (!d.tag_present) {
+		printk(KERN_ERR "virtio-fs: missing tag option\n");
+		goto err;
+	}
+
+	fs = virtio_fs_find_instance(d.tag);
+	if (!fs) {
+		printk(KERN_ERR "virtio-fs: tag not found\n");
+		err = -ENOENT;
+		goto err;
+	}
+
+	/* TODO lock */
+	if (fs->vqs[VQ_REQUEST].fud) {
+		printk(KERN_ERR "virtio-fs: device already in use\n");
+		err = -EBUSY;
+		goto err;
+	}
+
+	err = -ENOMEM;
+	/* Allocate fuse_dev for hiprio and notification queues */
+	for (i = 0; i < VQ_REQUEST; i++) {
+		struct virtio_fs_vq *fsvq = &fs->vqs[i];
+
+		fsvq->fud = fuse_dev_alloc();
+		if (!fsvq->fud)
+			goto err_free_fuse_devs;
+	}
+
+	init_req = fuse_request_alloc(0);
+	if (!init_req)
+ 		goto err_free_fuse_devs;
+	__set_bit(FR_BACKGROUND, &init_req->flags);
+
+	d.fiq_ops = &virtio_fs_fiq_ops;
+	d.fiq_priv = fs;
+	d.fudptr = (void **)&fs->vqs[VQ_REQUEST].fud;
+	d.destroy = true; /* Send destroy request on unmount */
+	err = fuse_fill_super_common(sb, &d);
+	if (err < 0)
+		goto err_free_init_req;
+
+	fc = fs->vqs[VQ_REQUEST].fud->fc;
+
+	/* TODO take fuse_mutex around this loop? */
+	for (i = 0; i < fs->nvqs; i++) {
+		struct virtio_fs_vq *fsvq = &fs->vqs[i];
+
+		if (i == VQ_REQUEST)
+			continue; /* already initialized */
+		fuse_dev_install(fsvq->fud, fc);
+		atomic_inc(&fc->dev_count);
+	}
+
+	fuse_send_init(fc, init_req);
+	return 0;
+
+err_free_init_req:
+	fuse_request_free(init_req);
+err_free_fuse_devs:
+	for (i = 0; i < fs->nvqs; i++) {
+		struct virtio_fs_vq *fsvq = &fs->vqs[i];
+		fuse_dev_free(fsvq->fud);
+	}
+err:
+	return err;
+}
+
+static void virtio_kill_sb(struct super_block *sb)
+{
+	struct fuse_conn *fc = get_fuse_conn_super(sb);
+	fuse_kill_sb_anon(sb);
+	if (fc) {
+		struct virtio_fs *vfs = fc->iq.priv;
+		virtio_fs_free_devs(vfs);
+	}
+}
+
+static struct dentry *virtio_fs_mount(struct file_system_type *fs_type,
+				      int flags, const char *dev_name,
+				      void *raw_data)
+{
+	return mount_nodev(fs_type, flags, raw_data, virtio_fs_fill_super);
+}
+
+static struct file_system_type virtio_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= KBUILD_MODNAME,
+	.mount		= virtio_fs_mount,
+	.kill_sb	= virtio_kill_sb,
+};
+
+static int __init virtio_fs_init(void)
+{
+	int ret;
+
+	ret = register_virtio_driver(&virtio_fs_driver);
+	if (ret < 0)
+		return ret;
+
+	ret = register_filesystem(&virtio_fs_type);
+	if (ret < 0) {
+		unregister_virtio_driver(&virtio_fs_driver);
+		return ret;
+	}
+
+	return 0;
+}
+module_init(virtio_fs_init);
+
+static void __exit virtio_fs_exit(void)
+{
+	unregister_filesystem(&virtio_fs_type);
+	unregister_virtio_driver(&virtio_fs_driver);
+}
+module_exit(virtio_fs_exit);
+
+MODULE_AUTHOR("Stefan Hajnoczi <stefanha@redhat.com>");
+MODULE_DESCRIPTION("Virtio Filesystem");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_FS(KBUILD_MODNAME);
+MODULE_DEVICE_TABLE(virtio, id_table);
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
new file mode 100644
index 000000000000..48f3590dcfbe
--- /dev/null
+++ b/include/uapi/linux/virtio_fs.h
@@ -0,0 +1,41 @@
+#ifndef _UAPI_LINUX_VIRTIO_FS_H
+#define _UAPI_LINUX_VIRTIO_FS_H
+/* This header is BSD licensed so anyone can use the definitions to implement
+ * compatible drivers/servers.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE. */
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_types.h>
+
+struct virtio_fs_config {
+	/* Filesystem name (UTF-8, not NUL-terminated, padded with NULs) */
+	__u8 tag[36];
+
+	/* Number of request queues */
+	__u32 num_queues;
+} __attribute__((packed));
+
+#endif /* _UAPI_LINUX_VIRTIO_FS_H */
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..884b0e2734bb 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_FS           26 /* virtio filesystem */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 12/30] dax: remove block device dependencies
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (10 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 11/30] virtio_fs: add skeleton virtio_fs.ko module Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-16  0:21   ` Dan Williams
  2019-05-15 19:26 ` [PATCH v2 13/30] dax: Pass dax_dev to dax_writeback_mapping_range() Vivek Goyal
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

Although struct dax_device itself is not tied to a block device, some
DAX code assumes there is a block device.  Make block devices optional
by allowing bdev to be NULL in commonly used DAX APIs.

When there is no block device:
 * Skip the partition offset calculation in bdev_dax_pgoff()
 * Skip the blkdev_issue_zeroout() optimization

Note that more block device assumptions remain but I haven't reach those
code paths yet.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 drivers/dax/super.c | 3 ++-
 fs/dax.c            | 7 ++++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0a339b85133e..cb44ec663991 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -53,7 +53,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 		pgoff_t *pgoff)
 {
-	phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+	sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+	phys_addr_t phys_off = (start_sect + sector) * 512;
 
 	if (pgoff)
 		*pgoff = PHYS_PFN(phys_off);
diff --git a/fs/dax.c b/fs/dax.c
index e5e54da1715f..815bc32fd967 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1042,7 +1042,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 static bool dax_range_is_aligned(struct block_device *bdev,
 				 unsigned int offset, unsigned int length)
 {
-	unsigned short sector_size = bdev_logical_block_size(bdev);
+	unsigned short sector_size;
+
+	if (!bdev)
+		return false;
+
+	sector_size = bdev_logical_block_size(bdev);
 
 	if (!IS_ALIGNED(offset, sector_size))
 		return false;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 13/30] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (11 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 12/30] dax: remove block device dependencies Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:26 ` [PATCH v2 14/30] virtio: Add get_shm_region method Vivek Goyal
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
is searched from that bdev name.

virtio-fs does not have a bdev. So pass in dax_dev also to
dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
used otherwise dax_dev is searched using bdev.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c            | 16 ++++++++++------
 fs/ext2/inode.c     |  2 +-
 fs/ext4/inode.c     |  2 +-
 fs/xfs/xfs_aops.c   |  2 +-
 include/linux/dax.h |  6 ++++--
 5 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 815bc32fd967..c944c1efc78f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -932,12 +932,12 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
  * on persistent storage prior to completion of the operation.
  */
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct block_device *bdev, struct dax_device *dax_dev,
+		struct writeback_control *wbc)
 {
 	XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
 	struct inode *inode = mapping->host;
 	pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
-	struct dax_device *dax_dev;
 	void *entry;
 	int ret = 0;
 	unsigned int scanned = 0;
@@ -948,9 +948,12 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
 		return 0;
 
-	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
-	if (!dax_dev)
-		return -EIO;
+	if (bdev) {
+		WARN_ON(dax_dev);
+		dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+		if (!dax_dev)
+			return -EIO;
+	}
 
 	trace_dax_writeback_range(inode, xas.xa_index, end_index);
 
@@ -972,7 +975,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
-	put_dax(dax_dev);
+	if (bdev)
+		put_dax(dax_dev);
 	trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
 	return ret;
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c27c27300d95..9b0131c53429 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -956,7 +956,7 @@ static int
 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	return dax_writeback_mapping_range(mapping,
-			mapping->host->i_sb->s_bdev, wbc);
+			mapping->host->i_sb->s_bdev, NULL, wbc);
 }
 
 const struct address_space_operations ext2_aops = {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b32a57bc5d5d..cb8cf5eddd9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2972,7 +2972,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
 	percpu_down_read(&sbi->s_journal_flag_rwsem);
 	trace_ext4_writepages(inode, wbc);
 
-	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, NULL, wbc);
 	trace_ext4_writepages_result(inode, wbc, ret,
 				     nr_to_write - wbc->nr_to_write);
 	percpu_up_read(&sbi->s_journal_flag_rwsem);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3619e9e8d359..27f71ff55096 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -994,7 +994,7 @@ xfs_dax_writepages(
 {
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	return dax_writeback_mapping_range(mapping,
-			xfs_find_bdev_for_inode(mapping->host), wbc);
+			xfs_find_bdev_for_inode(mapping->host), NULL, wbc);
 }
 
 STATIC int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..bf3b00b5f0bf 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -87,7 +87,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc);
+		struct block_device *bdev, struct dax_device *dax_dev,
+		struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
 dax_entry_t dax_lock_page(struct page *page);
@@ -119,7 +120,8 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 }
 
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct block_device *bdev, struct dax_device *dax_dev,
+		struct writeback_control *wbc)
 {
 	return -EOPNOTSUPP;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 14/30] virtio: Add get_shm_region method
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (12 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 13/30] dax: Pass dax_dev to dax_writeback_mapping_range() Vivek Goyal
@ 2019-05-15 19:26 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 15/30] virtio: Implement get_shm_region for PCI transport Vivek Goyal
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

Virtio defines 'shared memory regions' that provide a continuously
shared region between the host and guest.

Provide a method to find a particular region on a device.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/linux/virtio_config.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index bb4cc4910750..c859f000a751 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -10,6 +10,11 @@
 
 struct irq_affinity;
 
+struct virtio_shm_region {
+       u64 addr;
+       u64 len;
+};
+
 /**
  * virtio_config_ops - operations for configuring a virtio device
  * Note: Do not assume that a transport implements all of the operations
@@ -65,6 +70,7 @@ struct irq_affinity;
  *      the caller can then copy.
  * @set_vq_affinity: set the affinity for a virtqueue (optional).
  * @get_vq_affinity: get the affinity for a virtqueue (optional).
+ * @get_shm_region: get a shared memory region based on the index.
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -88,6 +94,8 @@ struct virtio_config_ops {
 			       const struct cpumask *cpu_mask);
 	const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
 			int index);
+	bool (*get_shm_region)(struct virtio_device *vdev,
+			       struct virtio_shm_region *region, u8 id);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
@@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
 	return 0;
 }
 
+static inline
+bool virtio_get_shm_region(struct virtio_device *vdev,
+                         struct virtio_shm_region *region, u8 id)
+{
+	if (!vdev->config->get_shm_region)
+		return false;
+	return vdev->config->get_shm_region(vdev, region, id);
+}
+
 static inline bool virtio_is_little_endian(struct virtio_device *vdev)
 {
 	return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 15/30] virtio: Implement get_shm_region for PCI transport
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (13 preceding siblings ...)
  2019-05-15 19:26 ` [PATCH v2 14/30] virtio: Add get_shm_region method Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 16/30] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

On PCI the shm regions are found using capability entries;
find a region by searching for the capability.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 drivers/virtio/virtio_pci_modern.c | 108 +++++++++++++++++++++++++++++
 include/uapi/linux/virtio_pci.h    |  10 +++
 2 files changed, 118 insertions(+)

diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 07571daccfec..51c9e6eca5ac 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -446,6 +446,112 @@ static void del_vq(struct virtio_pci_vq_info *info)
 	vring_del_virtqueue(vq);
 }
 
+static int virtio_pci_find_shm_cap(struct pci_dev *dev,
+                                   u8 required_id,
+                                   u8 *bar, u64 *offset, u64 *len)
+{
+	int pos;
+
+        for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+             pos > 0;
+             pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+		u8 type, cap_len, id;
+                u32 tmp32;
+                u64 res_offset, res_length;
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+                                                         cfg_type),
+                                     &type);
+                if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+                        continue;
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+                                                         cap_len),
+                                     &cap_len);
+                if (cap_len != sizeof(struct virtio_pci_shm_cap)) {
+		        printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
+                               __func__, pos, cap_len);
+                        continue;
+                };
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_shm_cap,
+                                                         id),
+                                     &id);
+                if (id != required_id)
+                        continue;
+
+                /* Type, and ID match, looks good */
+                pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+                                                         bar),
+                                     bar);
+
+                /* Read the lower 32bit of length and offset */
+                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, offset),
+                                      &tmp32);
+                res_offset = tmp32;
+                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, length),
+                                      &tmp32);
+                res_length = tmp32;
+
+                /* and now the top half */
+                pci_read_config_dword(dev,
+                                      pos + offsetof(struct virtio_pci_shm_cap,
+                                                     offset_hi),
+                                      &tmp32);
+                res_offset |= ((u64)tmp32) << 32;
+                pci_read_config_dword(dev,
+                                      pos + offsetof(struct virtio_pci_shm_cap,
+                                                     length_hi),
+                                      &tmp32);
+                res_length |= ((u64)tmp32) << 32;
+
+                *offset = res_offset;
+                *len = res_length;
+
+                return pos;
+        }
+        return 0;
+}
+
+static bool vp_get_shm_region(struct virtio_device *vdev,
+			      struct virtio_shm_region *region, u8 id)
+{
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	struct pci_dev *pci_dev = vp_dev->pci_dev;
+	u8 bar;
+	u64 offset, len;
+	phys_addr_t phys_addr;
+	size_t bar_len;
+	char *bar_name;
+	int ret;
+
+	if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
+		return false;
+	}
+
+	ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
+	if (ret < 0) {
+		dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
+			__func__);
+		return false;
+	}
+
+	phys_addr = pci_resource_start(pci_dev, bar);
+	bar_len = pci_resource_len(pci_dev, bar);
+
+        if (offset + len > bar_len) {
+                dev_err(&pci_dev->dev,
+                        "%s: bar shorter than cap offset+len\n",
+                        __func__);
+                return false;
+        }
+
+	region->len = len;
+	region->addr = (u64) phys_addr + offset;
+
+	return true;
+}
+
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
 	.get		= NULL,
 	.set		= NULL,
@@ -460,6 +566,7 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
 	.bus_name	= vp_bus_name,
 	.set_vq_affinity = vp_set_vq_affinity,
 	.get_vq_affinity = vp_get_vq_affinity,
+	.get_shm_region  = vp_get_shm_region,
 };
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -476,6 +583,7 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
 	.bus_name	= vp_bus_name,
 	.set_vq_affinity = vp_set_vq_affinity,
 	.get_vq_affinity = vp_get_vq_affinity,
+	.get_shm_region  = vp_get_shm_region,
 };
 
 /**
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 90007a1abcab..31841a60a4ad 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -113,6 +113,8 @@
 #define VIRTIO_PCI_CAP_DEVICE_CFG	4
 /* PCI configuration access */
 #define VIRTIO_PCI_CAP_PCI_CFG		5
+/* Additional shared memory capability */
+#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
 
 /* This is the PCI capability header: */
 struct virtio_pci_cap {
@@ -163,6 +165,14 @@ struct virtio_pci_cfg_cap {
 	__u8 pci_cfg_data[4]; /* Data for BAR access. */
 };
 
+/* Fields in VIRTIO_PCI_CAP_SHARED_MEMORY_CFG */
+struct virtio_pci_shm_cap {
+       struct virtio_pci_cap cap;
+       __le32 offset_hi;             /* Most sig 32 bits of offset */
+       __le32 length_hi;             /* Most sig 32 bits of length */
+        __u8   id;                    /* To distinguish shm chunks */
+};
+
 /* Macro versions of offsets for the Old Timers! */
 #define VIRTIO_PCI_CAP_VNDR		0
 #define VIRTIO_PCI_CAP_NEXT		1
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 16/30] virtio: Implement get_shm_region for MMIO transport
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (14 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 15/30] virtio: Implement get_shm_region for PCI transport Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 17/30] fuse, dax: add fuse_conn->dax_dev field Vivek Goyal
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

On MMIO a new set of registers is defined for finding SHM
regions.  Add their definitions and use them to find the region.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
---
 drivers/virtio/virtio_mmio.c     | 32 ++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_mmio.h | 11 +++++++++++
 2 files changed, 43 insertions(+)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index d9dd0f789279..ac42520abd86 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -499,6 +499,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
 	return vm_dev->pdev->name;
 }
 
+static bool vm_get_shm_region(struct virtio_device *vdev,
+			      struct virtio_shm_region *region, u8 id)
+{
+	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+	u64 len, addr;
+
+	/* Select the region we're interested in */
+	writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
+
+	/* Read the region size */
+	len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
+	len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
+
+	region->len = len;
+
+	/* Check if region length is -1. If that's the case, the shared memory
+	 * region does not exist and there is no need to proceed further.
+	 */
+	if (len == ~(u64)0) {
+		return false;
+	}
+
+	/* Read the region base address */
+	addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
+	addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
+
+	region->addr = addr;
+
+	return true;
+}
+
 static const struct virtio_config_ops virtio_mmio_config_ops = {
 	.get		= vm_get,
 	.set		= vm_set,
@@ -511,6 +542,7 @@ static const struct virtio_config_ops virtio_mmio_config_ops = {
 	.get_features	= vm_get_features,
 	.finalize_features = vm_finalize_features,
 	.bus_name	= vm_bus_name,
+	.get_shm_region = vm_get_shm_region,
 };
 
 
diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
 #define VIRTIO_MMIO_QUEUE_USED_LOW	0x0a0
 #define VIRTIO_MMIO_QUEUE_USED_HIGH	0x0a4
 
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL             0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW         0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH        0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW        0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH       0x0bc
+
 /* Configuration atomicity value */
 #define VIRTIO_MMIO_CONFIG_GENERATION	0x0fc
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 17/30] fuse, dax: add fuse_conn->dax_dev field
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (15 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 16/30] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

A struct dax_device instance is a prerequisite for the DAX filesystem
APIs.  Let virtio_fs associate a dax_device with a fuse_conn.  Classic
FUSE and CUSE set the pointer to NULL, disabling DAX.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/cuse.c      | 3 ++-
 fs/fuse/fuse_i.h    | 9 ++++++++-
 fs/fuse/inode.c     | 9 ++++++---
 fs/fuse/virtio_fs.c | 1 +
 4 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index a509747153a7..417448f11f9f 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -504,7 +504,8 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	 * Limit the cuse channel to requests that can
 	 * be represented in file->f_cred->user_ns.
 	 */
-	fuse_conn_init(&cc->fc, file->f_cred->user_ns, &fuse_dev_fiq_ops, NULL);
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns, NULL, &fuse_dev_fiq_ops,
+					NULL);
 
 	fud = fuse_dev_alloc_install(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f5cb4d40b83f..46fc1a454084 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -74,6 +74,9 @@ struct fuse_mount_data {
 	unsigned max_read;
 	unsigned blksize;
 
+	/* DAX device, may be NULL */
+	struct dax_device *dax_dev;
+
 	/* fuse input queue operations */
 	const struct fuse_iqueue_ops *fiq_ops;
 
@@ -822,6 +825,9 @@ struct fuse_conn {
 
 	/** List of device instances belonging to this connection */
 	struct list_head devices;
+
+	/** DAX device, non-NULL if DAX is supported */
+	struct dax_device *dax_dev;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1049,7 +1055,8 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
  * Initialize fuse_conn
  */
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
-		    const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
+			struct dax_device *dax_dev,
+			const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 731a8a74d032..42f3ac5b7521 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -603,7 +603,8 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 }
 
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
-		    const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
+			struct dax_device *dax_dev,
+			const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -628,6 +629,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	atomic64_set(&fc->attr_version, 1);
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->dax_dev = dax_dev;
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 }
@@ -1140,8 +1142,8 @@ int fuse_fill_super_common(struct super_block *sb,
 	if (!fc)
 		goto err;
 
-	fuse_conn_init(fc, sb->s_user_ns, mount_data->fiq_ops,
-		       mount_data->fiq_priv);
+	fuse_conn_init(fc, sb->s_user_ns, mount_data->dax_dev,
+		       mount_data->fiq_ops, mount_data->fiq_priv);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc_install(fc);
@@ -1243,6 +1245,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 		goto err_fput;
 	__set_bit(FR_BACKGROUND, &init_req->flags);
 
+	d.dax_dev = NULL;
 	d.fiq_ops = &fuse_dev_fiq_ops;
 	d.fiq_priv = NULL;
 	d.fudptr = &file->private_data;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index e76e0f5dce40..a23a1fb67217 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -866,6 +866,7 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
  		goto err_free_fuse_devs;
 	__set_bit(FR_BACKGROUND, &init_req->flags);
 
+	d.dax_dev = NULL;
 	d.fiq_ops = &virtio_fs_fiq_ops;
 	d.fiq_priv = fs;
 	d.fudptr = (void **)&fs->vqs[VQ_REQUEST].fud;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (16 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 17/30] fuse, dax: add fuse_conn->dax_dev field Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-07-17 17:27   ` Halil Pasic
  2019-05-15 19:27 ` [PATCH v2 19/30] fuse: Keep a list of free dax memory ranges Vivek Goyal
                   ` (11 subsequent siblings)
  29 siblings, 1 reply; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

Setup a dax device.

Use the shm capability to find the cache entry and map it.

The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86).  Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/fuse_i.h               |   1 +
 fs/fuse/inode.c                |   8 ++
 fs/fuse/virtio_fs.c            | 173 ++++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_fs.h |   3 +
 4 files changed, 183 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 46fc1a454084..840c88af711c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -70,6 +70,7 @@ struct fuse_mount_data {
 	unsigned group_id_present:1;
 	unsigned default_permissions:1;
 	unsigned allow_other:1;
+	unsigned dax:1;
 	unsigned destroy:1;
 	unsigned max_read;
 	unsigned blksize;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 42f3ac5b7521..97d218a7daa8 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -442,6 +442,7 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+	OPT_DAX,
 	OPT_ERR
 };
 
@@ -455,6 +456,7 @@ static const match_table_t tokens = {
 	{OPT_ALLOW_OTHER,		"allow_other"},
 	{OPT_MAX_READ,			"max_read=%u"},
 	{OPT_BLKSIZE,			"blksize=%u"},
+	{OPT_DAX,			"dax"},
 	{OPT_ERR,			NULL}
 };
 
@@ -546,6 +548,10 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
 			d->blksize = value;
 			break;
 
+		case OPT_DAX:
+			d->dax = 1;
+			break;
+
 		default:
 			return 0;
 		}
@@ -574,6 +580,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 		seq_printf(m, ",max_read=%u", fc->max_read);
 	if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
 		seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+	if (fc->dax_dev)
+		seq_printf(m, ",dax");
 	return 0;
 }
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index a23a1fb67217..2b790865dc21 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,6 +5,9 @@
  */
 
 #include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/pci.h>
+#include <linux/pfn_t.h>
 #include <linux/module.h>
 #include <linux/virtio.h>
 #include <linux/virtio_fs.h>
@@ -31,6 +34,18 @@ struct virtio_fs_vq {
 	char name[24];
 } ____cacheline_aligned_in_smp;
 
+/* State needed for devm_memremap_pages().  This API is called on the
+ * underlying pci_dev instead of struct virtio_fs (layering violation).  Since
+ * the memremap release function only gets called when the pci_dev is released,
+ * keep the associated state separate from struct virtio_fs (it has a different
+ * lifecycle from pci_dev).
+ */
+struct virtio_fs_memremap_info {
+	struct dev_pagemap pgmap;
+	struct percpu_ref ref;
+	struct completion completion;
+};
+
 /* A virtio-fs device instance */
 struct virtio_fs {
 	struct list_head list;    /* on virtio_fs_instances */
@@ -38,6 +53,12 @@ struct virtio_fs {
 	struct virtio_fs_vq *vqs;
 	unsigned nvqs;            /* number of virtqueues */
 	unsigned num_queues;      /* number of request queues */
+	struct dax_device *dax_dev;
+
+	/* DAX memory window where file contents are mapped */
+	void *window_kaddr;
+	phys_addr_t window_phys_addr;
+	size_t window_len;
 };
 
 struct virtio_fs_forget {
@@ -421,6 +442,151 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
 	vdev->config->del_vqs(vdev);
 }
 
+/* Map a window offset to a page frame number.  The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+				    long nr_pages, void **kaddr, pfn_t *pfn)
+{
+	struct virtio_fs *fs = dax_get_private(dax_dev);
+	phys_addr_t offset = PFN_PHYS(pgoff);
+	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+
+	if (kaddr)
+		*kaddr = fs->window_kaddr + offset;
+	if (pfn)
+		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+					PFN_DEV | PFN_MAP);
+	return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+				       pgoff_t pgoff, void *addr,
+				       size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
+				       pgoff_t pgoff, void *addr,
+				       size_t bytes, struct iov_iter *i)
+{
+	return copy_to_iter(addr, bytes, i);
+}
+
+static const struct dax_operations virtio_fs_dax_ops = {
+	.direct_access = virtio_fs_direct_access,
+	.copy_from_iter = virtio_fs_copy_from_iter,
+	.copy_to_iter = virtio_fs_copy_to_iter,
+};
+
+static void virtio_fs_percpu_release(struct percpu_ref *ref)
+{
+	struct virtio_fs_memremap_info *mi =
+		container_of(ref, struct virtio_fs_memremap_info, ref);
+
+	complete(&mi->completion);
+}
+
+static void virtio_fs_percpu_exit(void *data)
+{
+	struct virtio_fs_memremap_info *mi = data;
+
+	wait_for_completion(&mi->completion);
+	percpu_ref_exit(&mi->ref);
+}
+
+static void virtio_fs_percpu_kill(struct percpu_ref *ref)
+{
+	percpu_ref_kill(ref);
+}
+
+static void virtio_fs_cleanup_dax(void *data)
+{
+	struct virtio_fs *fs = data;
+
+	kill_dax(fs->dax_dev);
+	put_dax(fs->dax_dev);
+}
+
+static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
+{
+	struct virtio_shm_region cache_reg;
+	struct virtio_fs_memremap_info *mi;
+	struct dev_pagemap *pgmap;
+	bool have_cache;
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_DAX_DRIVER))
+		return 0;
+
+	/* Get cache region */
+	have_cache = virtio_get_shm_region(vdev,
+					   &cache_reg,
+					   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
+	if (!have_cache) {
+		dev_err(&vdev->dev, "%s: No cache capability\n", __func__);
+		return -ENXIO;
+	} else {
+		dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n",
+			   cache_reg.len, cache_reg.addr);
+	}
+
+	mi = devm_kzalloc(&vdev->dev, sizeof(*mi), GFP_KERNEL);
+	if (!mi)
+		return -ENOMEM;
+
+	init_completion(&mi->completion);
+	ret = percpu_ref_init(&mi->ref, virtio_fs_percpu_release, 0,
+			      GFP_KERNEL);
+	if (ret < 0) {
+		dev_err(&vdev->dev, "%s: percpu_ref_init failed (%d)\n",
+			__func__, ret);
+		return ret;
+	}
+
+	ret = devm_add_action(&vdev->dev, virtio_fs_percpu_exit, mi);
+	if (ret < 0) {
+		percpu_ref_exit(&mi->ref);
+		return ret;
+	}
+
+	pgmap = &mi->pgmap;
+	pgmap->altmap_valid = false;
+	pgmap->ref = &mi->ref;
+	pgmap->kill = virtio_fs_percpu_kill;
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+	/* Ideally we would directly use the PCI BAR resource but
+	 * devm_memremap_pages() wants its own copy in pgmap.  So
+	 * initialize a struct resource from scratch (only the start
+	 * and end fields will be used).
+	 */
+	pgmap->res = (struct resource){
+		.name = "virtio-fs dax window",
+		.start = (phys_addr_t) cache_reg.addr,
+		.end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
+	};
+
+	fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);
+	if (IS_ERR(fs->window_kaddr))
+		return PTR_ERR(fs->window_kaddr);
+
+	fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
+	fs->window_len = (phys_addr_t) cache_reg.len;
+
+	dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx"
+		" len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr,
+		cache_reg.len);
+
+	fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
+	if (!fs->dax_dev)
+		return -ENOMEM;
+
+	return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax, fs);
+}
+
 static int virtio_fs_probe(struct virtio_device *vdev)
 {
 	struct virtio_fs *fs;
@@ -442,6 +608,10 @@ static int virtio_fs_probe(struct virtio_device *vdev)
 	/* TODO vq affinity */
 	/* TODO populate notifications vq */
 
+	ret = virtio_fs_setup_dax(vdev, fs);
+	if (ret < 0)
+		goto out_vqs;
+
 	/* Bring the device online in case the filesystem is mounted and
 	 * requests need to be sent before we return.
 	 */
@@ -456,7 +626,6 @@ static int virtio_fs_probe(struct virtio_device *vdev)
 out_vqs:
 	vdev->config->reset(vdev);
 	virtio_fs_cleanup_vqs(vdev, fs);
-
 out:
 	vdev->priv = NULL;
 	return ret;
@@ -866,7 +1035,7 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
  		goto err_free_fuse_devs;
 	__set_bit(FR_BACKGROUND, &init_req->flags);
 
-	d.dax_dev = NULL;
+	d.dax_dev = d.dax ? fs->dax_dev : NULL;
 	d.fiq_ops = &virtio_fs_fiq_ops;
 	d.fiq_priv = fs;
 	d.fudptr = (void **)&fs->vqs[VQ_REQUEST].fud;
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
index 48f3590dcfbe..d4bb549568eb 100644
--- a/include/uapi/linux/virtio_fs.h
+++ b/include/uapi/linux/virtio_fs.h
@@ -38,4 +38,7 @@ struct virtio_fs_config {
 	__u32 num_queues;
 } __attribute__((packed));
 
+/* For the id field in virtio_pci_shm_cap */
+#define VIRTIO_FS_SHMCAP_ID_CACHE 0
+
 #endif /* _UAPI_LINUX_VIRTIO_FS_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 19/30] fuse: Keep a list of free dax memory ranges
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (17 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 20/30] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
---
 fs/fuse/fuse_i.h    | 23 ++++++++++++
 fs/fuse/inode.c     | 86 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/virtio_fs.c |  2 ++
 3 files changed, 111 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 840c88af711c..5439e4628362 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -46,6 +46,10 @@
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
+/* Default memory range size, 2MB */
+#define FUSE_DAX_MEM_RANGE_SZ	(2*1024*1024)
+#define FUSE_DAX_MEM_RANGE_PAGES	(FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -94,6 +98,18 @@ struct fuse_forget_link {
 	struct fuse_forget_link *next;
 };
 
+/** Translation information for file offsets to DAX window offsets */
+struct fuse_dax_mapping {
+	/* Will connect in fc->free_ranges to keep track of free memory */
+	struct list_head list;
+
+       /** Position in DAX window */
+       u64 window_offset;
+
+       /** Length of mapping, in bytes */
+       loff_t length;
+};
+
 /** FUSE inode */
 struct fuse_inode {
 	/** Inode data */
@@ -829,6 +845,13 @@ struct fuse_conn {
 
 	/** DAX device, non-NULL if DAX is supported */
 	struct dax_device *dax_dev;
+
+	/*
+	 * DAX Window Free Ranges. TODO: This might not be best place to store
+	 * this free list
+	 */
+	long nr_free_ranges;
+	struct list_head free_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 97d218a7daa8..8a3dd72f9843 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -22,6 +22,8 @@
 #include <linux/exportfs.h>
 #include <linux/posix_acl.h>
 #include <linux/pid_namespace.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
 
 MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
 MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -610,6 +612,76 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
+static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
+{
+	struct fuse_dax_mapping *range, *temp;
+
+	/* Free All allocated elements */
+	list_for_each_entry_safe(range, temp, mem_list, list) {
+		list_del(&range->list);
+		kfree(range);
+	}
+}
+
+#ifdef CONFIG_FS_DAX
+static int fuse_dax_mem_range_init(struct fuse_conn *fc,
+				   struct dax_device *dax_dev)
+{
+	long nr_pages, nr_ranges;
+	void *kaddr;
+	pfn_t pfn;
+	struct fuse_dax_mapping *range;
+	LIST_HEAD(mem_ranges);
+	phys_addr_t phys_addr;
+	int ret = 0, id;
+	size_t dax_size = -1;
+	unsigned long i;
+
+	id = dax_read_lock();
+	nr_pages = dax_direct_access(dax_dev, 0, PHYS_PFN(dax_size), &kaddr,
+					&pfn);
+	dax_read_unlock(id);
+	if (nr_pages < 0) {
+		pr_debug("dax_direct_access() returned %ld\n", nr_pages);
+		return nr_pages;
+	}
+
+	phys_addr = pfn_t_to_phys(pfn);
+	nr_ranges = nr_pages/FUSE_DAX_MEM_RANGE_PAGES;
+	printk("fuse_dax_mem_range_init(): dax mapped %ld pages. nr_ranges=%ld\n", nr_pages, nr_ranges);
+
+	for (i = 0; i < nr_ranges; i++) {
+		range = kzalloc(sizeof(struct fuse_dax_mapping), GFP_KERNEL);
+		if (!range) {
+			pr_debug("memory allocation for mem_range failed.\n");
+			ret = -ENOMEM;
+			goto out_err;
+		}
+		/* TODO: This offset only works if virtio-fs driver is not
+		 * having some memory hidden at the beginning. This needs
+		 * better handling
+		 */
+		range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
+		range->length = FUSE_DAX_MEM_RANGE_SZ;
+		list_add_tail(&range->list, &mem_ranges);
+	}
+
+	list_replace_init(&mem_ranges, &fc->free_ranges);
+	fc->nr_free_ranges = nr_ranges;
+	return 0;
+out_err:
+	/* Free All allocated elements */
+	fuse_free_dax_mem_ranges(&mem_ranges);
+	return ret;
+}
+#else /* !CONFIG_FS_DAX */
+static inline int fuse_dax_mem_range_init(struct fuse_conn *fc,
+					  struct dax_device *dax_dev)
+{
+	return 0;
+}
+#endif /* CONFIG_FS_DAX */
+
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 			struct dax_device *dax_dev,
 			const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
@@ -640,6 +712,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->dax_dev = dax_dev;
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
+	INIT_LIST_HEAD(&fc->free_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -648,6 +721,8 @@ void fuse_conn_put(struct fuse_conn *fc)
 	if (refcount_dec_and_test(&fc->count)) {
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
+		if (fc->dax_dev)
+			fuse_free_dax_mem_ranges(&fc->free_ranges);
 		put_pid_ns(fc->pid_ns);
 		put_user_ns(fc->user_ns);
 		fc->release(fc);
@@ -1154,6 +1229,14 @@ int fuse_fill_super_common(struct super_block *sb,
 		       mount_data->fiq_ops, mount_data->fiq_priv);
 	fc->release = fuse_free_conn;
 
+	if (mount_data->dax_dev) {
+		err = fuse_dax_mem_range_init(fc, mount_data->dax_dev);
+		if (err) {
+			pr_debug("fuse_dax_mem_range_init() returned %d\n", err);
+			goto err_free_ranges;
+		}
+	}
+
 	fud = fuse_dev_alloc_install(fc);
 	if (!fud)
 		goto err_put_conn;
@@ -1214,6 +1297,9 @@ int fuse_fill_super_common(struct super_block *sb,
 	dput(root_dentry);
  err_dev_free:
 	fuse_dev_free(fud);
+ err_free_ranges:
+	if (mount_data->dax_dev)
+		fuse_free_dax_mem_ranges(&fc->free_ranges);
  err_put_conn:
 	fuse_conn_put(fc);
 	sb->s_fs_info = NULL;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 2b790865dc21..76c46edcc8ac 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -453,6 +453,8 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	phys_addr_t offset = PFN_PHYS(pgoff);
 	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
 
+	pr_debug("virtio_fs_direct_access(): called. nr_pages=%ld max_nr_pages=%zu\n", nr_pages, max_nr_pages);
+
 	if (kaddr)
 		*kaddr = fs->window_kaddr + offset;
 	if (pfn)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 20/30] fuse: Introduce setupmapping/removemapping commands
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (18 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 19/30] fuse: Keep a list of free dax memory ranges Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 21/30] fuse, dax: Implement dax read/write operations Vivek Goyal
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 include/uapi/linux/fuse.h | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 2ac598614a8f..9eb313220549 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -399,6 +399,8 @@ enum fuse_opcode {
 	FUSE_RENAME2		= 45,
 	FUSE_LSEEK		= 46,
 	FUSE_COPY_FILE_RANGE	= 47,
+	FUSE_SETUPMAPPING       = 48,
+	FUSE_REMOVEMAPPING      = 49,
 
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
@@ -822,4 +824,35 @@ struct fuse_copy_file_range_in {
 	uint64_t	flags;
 };
 
+#define FUSE_SETUPMAPPING_ENTRIES 8
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+	/* An already open handle */
+	uint64_t	fh;
+	/* Offset into the file to start the mapping */
+	uint64_t	foffset;
+	/* Length of mapping required */
+	uint64_t	len;
+	/* Flags, FUSE_SETUPMAPPING_FLAG_* */
+	uint64_t	flags;
+	/* Offset in Memory Window */
+	uint64_t	moffset;
+};
+
+struct fuse_setupmapping_out {
+	/* Offsets into the cache of mappings */
+	uint64_t	coffset[FUSE_SETUPMAPPING_ENTRIES];
+        /* Lengths of each mapping */
+        uint64_t	len[FUSE_SETUPMAPPING_ENTRIES];
+};
+
+struct fuse_removemapping_in {
+        /* An already open handle */
+        uint64_t	fh;
+	/* Offset into the dax window start the unmapping */
+	uint64_t        moffset;
+        /* Length of mapping required */
+        uint64_t	len;
+};
+
 #endif /* _LINUX_FUSE_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 21/30] fuse, dax: Implement dax read/write operations
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (19 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 20/30] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 22/30] fuse, dax: add DAX mmap support Vivek Goyal
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.

We make use of interval tree to keep track of per inode dax mappings.

Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/file.c            | 454 +++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h          |  21 ++
 fs/fuse/inode.c           |   6 +
 include/uapi/linux/fuse.h |   1 +
 4 files changed, 476 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e9a7aa97c539..edbb11ca735e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,12 @@
 #include <linux/swap.h>
 #include <linux/falloc.h>
 #include <linux/uio.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/interval_tree_generic.h>
+
+INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
+                     START, LAST, static inline, fuse_dax_interval_tree);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
@@ -171,6 +177,173 @@ static void fuse_link_write_file(struct file *file)
 	spin_unlock(&fi->lock);
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+	struct fuse_dax_mapping *dmap = NULL;
+
+	spin_lock(&fc->lock);
+
+	/* TODO: Add logic to try to free up memory if wait is allowed */
+	if (fc->nr_free_ranges <= 0) {
+		spin_unlock(&fc->lock);
+		return NULL;
+	}
+
+	WARN_ON(list_empty(&fc->free_ranges));
+
+	/* Take a free range */
+	dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
+					list);
+	list_del_init(&dmap->list);
+	fc->nr_free_ranges--;
+	spin_unlock(&fc->lock);
+	return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __free_dax_mapping(struct fuse_conn *fc,
+				struct fuse_dax_mapping *dmap)
+{
+	list_add_tail(&dmap->list, &fc->free_ranges);
+	fc->nr_free_ranges++;
+}
+
+static void free_dax_mapping(struct fuse_conn *fc,
+				struct fuse_dax_mapping *dmap)
+{
+	/* Return fuse_dax_mapping to free list */
+	spin_lock(&fc->lock);
+	__free_dax_mapping(fc, dmap);
+	spin_unlock(&fc->lock);
+}
+
+/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
+static int fuse_setup_one_mapping(struct inode *inode,
+				struct file *file, loff_t offset,
+				struct fuse_dax_mapping *dmap)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_file *ff = NULL;
+	struct fuse_setupmapping_in inarg;
+	FUSE_ARGS(args);
+	ssize_t err;
+
+	if (file)
+		ff = file->private_data;
+
+	WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
+	WARN_ON(fc->nr_free_ranges < 0);
+
+	/* Ask fuse daemon to setup mapping */
+	memset(&inarg, 0, sizeof(inarg));
+	inarg.foffset = offset;
+	if (ff)
+		inarg.fh = ff->fh;
+	else
+		inarg.fh = -1;
+	inarg.moffset = dmap->window_offset;
+	inarg.len = FUSE_DAX_MEM_RANGE_SZ;
+	if (file) {
+		inarg.flags |= (file->f_mode & FMODE_WRITE) ?
+				FUSE_SETUPMAPPING_FLAG_WRITE : 0;
+		inarg.flags |= (file->f_mode & FMODE_READ) ?
+				FUSE_SETUPMAPPING_FLAG_READ : 0;
+	} else {
+		inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+		inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+	}
+	args.in.h.opcode = FUSE_SETUPMAPPING;
+	args.in.h.nodeid = fi->nodeid;
+	args.in.numargs = 1;
+	args.in.args[0].size = sizeof(inarg);
+	args.in.args[0].value = &inarg;
+	err = fuse_simple_request(fc, &args);
+	if (err < 0) {
+		printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
+				 __func__, dmap->window_offset, err);
+		return err;
+	}
+
+	pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx err=%zd\n", offset, err);
+
+	/* TODO: What locking is required here. For now, using fc->lock */
+	dmap->start = offset;
+	dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
+	/* Protected by fi->i_dmap_sem */
+	fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
+	fi->nr_dmaps++;
+	return 0;
+}
+
+static int fuse_removemapping_one(struct inode *inode,
+					struct fuse_dax_mapping *dmap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_removemapping_in inarg;
+	FUSE_ARGS(args);
+
+	memset(&inarg, 0, sizeof(inarg));
+	inarg.moffset = dmap->window_offset;
+	inarg.len = dmap->length;
+	args.in.h.opcode = FUSE_REMOVEMAPPING;
+	args.in.h.nodeid = fi->nodeid;
+	args.in.numargs = 1;
+	args.in.args[0].size = sizeof(inarg);
+	args.in.args[0].value = &inarg;
+	return fuse_simple_request(fc, &args);
+}
+
+/*
+ * It is called from evict_inode() and by that time inode is going away. So
+ * this function does not take any locks like fi->i_dmap_sem for traversing
+ * that fuse inode interval tree. If that lock is taken then lock validator
+ * complains of deadlock situation w.r.t fs_reclaim lock.
+ */
+void fuse_removemapping(struct inode *inode)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	ssize_t err;
+	struct fuse_dax_mapping *dmap;
+
+	/* Clear the mappings list */
+	while (true) {
+		WARN_ON(fi->nr_dmaps < 0);
+
+		dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0,
+								-1);
+		if (dmap) {
+			fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
+			fi->nr_dmaps--;
+		}
+
+		if (!dmap)
+			break;
+
+		/*
+		 * During umount/shutdown, fuse connection is dropped first
+		 * and later evict_inode() is called later. That means any
+		 * removemapping messages are going to fail. Send messages
+		 * only if connection is up. Otherwise fuse daemon is
+		 * responsible for cleaning up any leftover references and
+		 * mappings.
+		 */
+		if (fc->connected) {
+			err = fuse_removemapping_one(inode, dmap);
+			if (err) {
+				pr_warn("Failed to removemapping. offset=0x%llx"
+					" len=0x%llx\n", dmap->window_offset,
+					dmap->length);
+			}
+		}
+
+		/* Add it back to free ranges list */
+		free_dax_mapping(fc, dmap);
+	}
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
@@ -1476,32 +1649,290 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return res;
 }
 
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
+	struct inode *inode = file->f_mapping->host;
 
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
-		return fuse_cache_read_iter(iocb, to);
-	else
+	if (IS_DAX(inode))
+		return fuse_dax_read_iter(iocb, to);
+
+	if (ff->open_flags & FOPEN_DIRECT_IO)
 		return fuse_direct_read_iter(iocb, to);
+
+	return fuse_cache_read_iter(iocb, to);
 }
 
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
 static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
+	struct inode *inode = file->f_mapping->host;
 
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
-		return fuse_cache_write_iter(iocb, from);
-	else
+	if (IS_DAX(inode))
+		return fuse_dax_write_iter(iocb, from);
+
+	if (ff->open_flags & FOPEN_DIRECT_IO)
 		return fuse_direct_write_iter(iocb, from);
+
+	return fuse_cache_write_iter(iocb, from);
+}
+
+static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
+{
+	iomap->addr = IOMAP_NULL_ADDR;
+	iomap->length = length;
+	iomap->type = IOMAP_HOLE;
+}
+
+static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
+			struct iomap *iomap, struct fuse_dax_mapping *dmap,
+			unsigned flags)
+{
+	loff_t offset, len;
+	loff_t i_size = i_size_read(inode);
+
+	offset = pos - dmap->start;
+	len = min(length, dmap->length - offset);
+
+	/* If length is beyond end of file, truncate further */
+	if (pos + len > i_size)
+		len = i_size - pos;
+
+	if (len > 0) {
+		iomap->addr = dmap->window_offset + offset;
+		iomap->length = len;
+		if (flags & IOMAP_FAULT)
+			iomap->length = ALIGN(len, PAGE_SIZE);
+		iomap->type = IOMAP_MAPPED;
+		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
+				" length 0x%llx\n", __func__, iomap->addr,
+				iomap->offset, iomap->length);
+	} else {
+		/* Mapping beyond end of file is hole */
+		fuse_fill_iomap_hole(iomap, length);
+		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
+				"length 0x%llx\n", __func__, iomap->addr,
+				iomap->offset, iomap->length);
+	}
+}
+
+/* This is just for DAX and the mapping is ephemeral, do not use it for other
+ * purposes since there is no block device with a permanent mapping.
+ */
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
+			    unsigned flags, struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
+	int ret;
+
+	/* We don't support FIEMAP */
+	BUG_ON(flags & IOMAP_REPORT);
+
+	pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
+			pos, length);
+
+	/*
+	 * Writes beyond end of file are not handled using dax path. Instead
+	 * a fuse write message is sent to daemon
+	 */
+	if (flags & IOMAP_WRITE && pos >= i_size_read(inode))
+		return -EIO;
+
+	iomap->offset = pos;
+	iomap->flags = 0;
+	iomap->bdev = NULL;
+	iomap->dax_dev = fc->dax_dev;
+
+	/*
+	 * Both read/write and mmap path can race here. So we need something
+	 * to make sure if we are setting up mapping, then other path waits
+	 *
+	 * For now, use a semaphore for this. It probably needs to be
+	 * optimized later.
+	 */
+	down_read(&fi->i_dmap_sem);
+	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
+
+	if (dmap) {
+		fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+		up_read(&fi->i_dmap_sem);
+		return 0;
+	} else {
+		up_read(&fi->i_dmap_sem);
+		pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
+				__func__, pos, length);
+		if (pos >= i_size_read(inode))
+			goto iomap_hole;
+
+		alloc_dmap = alloc_dax_mapping(fc);
+		if (!alloc_dmap)
+			return -EBUSY;
+
+		/*
+		 * Drop read lock and take write lock so that only one
+		 * caller can try to setup mapping and other waits
+		 */
+		down_write(&fi->i_dmap_sem);
+		/*
+		 * We dropped lock. Check again if somebody else setup
+		 * mapping already.
+		 */
+		dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
+							pos);
+		if (dmap) {
+			fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+			free_dax_mapping(fc, alloc_dmap);
+			up_write(&fi->i_dmap_sem);
+			return 0;
+		}
+
+		/* Setup one mapping */
+		ret = fuse_setup_one_mapping(inode, NULL,
+				ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
+				alloc_dmap);
+		if (ret < 0) {
+			printk("fuse_setup_one_mapping() failed. err=%d"
+				" pos=0x%llx\n", ret, pos);
+			free_dax_mapping(fc, alloc_dmap);
+			up_write(&fi->i_dmap_sem);
+			return ret;
+		}
+		fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
+		up_write(&fi->i_dmap_sem);
+		return 0;
+	}
+
+	/*
+	 * If read beyond end of file happnes, fs code seems to return
+	 * it as hole
+	 */
+iomap_hole:
+	fuse_fill_iomap_hole(iomap, length);
+	pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
+	return 0;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
+			  ssize_t written, unsigned flags,
+			  struct iomap *iomap)
+{
+	/* DAX writes beyond end-of-file aren't handled using iomap, so the
+	 * file size is unchanged and there is nothing to do here.
+	 */
+	return 0;
+}
+
+static const struct iomap_ops fuse_iomap_ops = {
+	.iomap_begin = fuse_iomap_begin,
+	.iomap_end = fuse_iomap_end,
+};
+
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock_shared(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock_shared(inode);
+	}
+
+	ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
+	inode_unlock_shared(inode);
+
+	/* TODO file_accessed(iocb->f_filp) */
+
+	return ret;
+}
+
+static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	return (iov_iter_rw(from) == WRITE &&
+		((iocb->ki_pos) >= i_size_read(inode)));
+}
+
+static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
+	ssize_t ret;
+
+	ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
+	if (ret < 0)
+		return ret;
+
+	fuse_invalidate_attr(inode);
+	fuse_write_update_size(inode, iocb->ki_pos);
+	return ret;
+}
+
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret, count;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	ret = file_remove_privs(iocb->ki_filp);
+	if (ret)
+		goto out;
+	/* TODO file_update_time() but we don't want metadata I/O */
+
+	/* Do not use dax for file extending writes as its an mmap and
+	 * trying to write beyong end of existing page will generate
+	 * SIGBUS.
+	 */
+	if (file_extending_write(iocb, from)) {
+		ret = fuse_dax_direct_write(iocb, from);
+		goto out;
+ 	}
+
+	ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * If part of the write was file extending, fuse dax path will not
+	 * take care of that. Do direct write instead.
+	 */
+	if (iov_iter_count(from) && file_extending_write(iocb, from)) {
+		count = fuse_dax_direct_write(iocb, from);
+		if (count < 0)
+			goto out;
+		ret += count;
+	}
+
+out:
+	inode_unlock(inode);
+
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	return ret;
 }
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
@@ -2180,6 +2611,11 @@ static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
 
 }
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	return -EINVAL; /* TODO */
+}
+
 static int convert_fuse_file_lock(struct fuse_conn *fc,
 				  const struct fuse_file_lock *ffl,
 				  struct file_lock *fl)
@@ -3212,6 +3648,7 @@ static const struct address_space_operations fuse_file_aops  = {
 void fuse_init_file_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
@@ -3221,4 +3658,9 @@ void fuse_init_file_inode(struct inode *inode)
 	fi->writectr = 0;
 	init_waitqueue_head(&fi->page_waitq);
 	INIT_LIST_HEAD(&fi->writepages);
+	fi->dmap_tree = RB_ROOT_CACHED;
+
+	if (fc->dax_dev) {
+		inode->i_flags |= S_DAX;
+	}
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 5439e4628362..f1ae549eff98 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -98,11 +98,22 @@ struct fuse_forget_link {
 	struct fuse_forget_link *next;
 };
 
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->end)
+
 /** Translation information for file offsets to DAX window offsets */
 struct fuse_dax_mapping {
 	/* Will connect in fc->free_ranges to keep track of free memory */
 	struct list_head list;
 
+	/* For interval tree in file/inode */
+	struct rb_node rb;
+	/** Start Position in file */
+	__u64 start;
+	/** End Position in file */
+	__u64 end;
+	__u64 __subtree_last;
+
        /** Position in DAX window */
        u64 window_offset;
 
@@ -195,6 +206,15 @@ struct fuse_inode {
 
 	/** Lock to protect write related fields */
 	spinlock_t lock;
+
+	/*
+	 * Semaphore to protect modifications to dmap_tree
+	 */
+	struct rw_semaphore i_dmap_sem;
+
+	/** Sorted rb tree of struct fuse_dax_mapping elements */
+	struct rb_root_cached dmap_tree;
+	unsigned long nr_dmaps;
 };
 
 /** FUSE inode state bits */
@@ -1226,5 +1246,6 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
  * Get the next unique ID for a request
  */
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
+void fuse_removemapping(struct inode *inode);
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8a3dd72f9843..ad66a353554b 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -83,7 +83,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	fi->attr_version = 0;
 	fi->orig_ino = 0;
 	fi->state = 0;
+	fi->nr_dmaps = 0;
 	mutex_init(&fi->mutex);
+	init_rwsem(&fi->i_dmap_sem);
 	spin_lock_init(&fi->lock);
 	fi->forget = fuse_alloc_forget();
 	if (!fi->forget) {
@@ -119,6 +121,10 @@ static void fuse_evict_inode(struct inode *inode)
 	if (inode->i_sb->s_flags & SB_ACTIVE) {
 		struct fuse_conn *fc = get_fuse_conn(inode);
 		struct fuse_inode *fi = get_fuse_inode(inode);
+		if (IS_DAX(inode)) {
+			fuse_removemapping(inode);
+			WARN_ON(fi->nr_dmaps);
+		}
 		fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
 		fi->forget = NULL;
 	}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 9eb313220549..5042e227e8a8 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -826,6 +826,7 @@ struct fuse_copy_file_range_in {
 
 #define FUSE_SETUPMAPPING_ENTRIES 8
 #define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
 struct fuse_setupmapping_in {
 	/* An already open handle */
 	uint64_t	fh;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 22/30] fuse, dax: add DAX mmap support
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (20 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 21/30] fuse, dax: Implement dax read/write operations Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 23/30] fuse: Define dax address space operations Vivek Goyal
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

From: Stefan Hajnoczi <stefanha@redhat.com>

Add DAX mmap() support.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/file.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index edbb11ca735e..a053bcb9498d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2576,10 +2576,15 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 	.page_mkwrite	= fuse_page_mkwrite,
 };
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct fuse_file *ff = file->private_data;
 
+	/* DAX mmap is superior to direct_io mmap */
+	if (IS_DAX(file_inode(file)))
+		return fuse_dax_mmap(file, vma);
+
 	if (ff->open_flags & FOPEN_DIRECT_IO) {
 		/* Can't provide the coherency needed for MAP_SHARED */
 		if (vma->vm_flags & VM_MAYSHARE)
@@ -2611,9 +2616,65 @@ static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
 
 }
 
+static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+			    bool write)
+{
+	vm_fault_t ret;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	pfn_t pfn;
+
+	if (write)
+		sb_start_pagefault(sb);
+
+	/* TODO inode semaphore to protect faults vs truncate */
+
+	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+
+	if (ret & VM_FAULT_NEEDDSYNC)
+		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+	if (write)
+		sb_end_pagefault(sb);
+
+	return ret;
+}
+
+static vm_fault_t fuse_dax_fault(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+				vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf,
+			       enum page_entry_size pe_size)
+{
+	return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+	.fault		= fuse_dax_fault,
+	.huge_fault	= fuse_dax_huge_fault,
+	.page_mkwrite	= fuse_dax_page_mkwrite,
+	.pfn_mkwrite	= fuse_dax_pfn_mkwrite,
+};
+
 static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	return -EINVAL; /* TODO */
+	file_accessed(file);
+	vma->vm_ops = &fuse_dax_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	return 0;
 }
 
 static int convert_fuse_file_lock(struct fuse_conn *fc,
@@ -3622,6 +3683,7 @@ static const struct file_operations fuse_file_operations = {
 	.release	= fuse_release,
 	.fsync		= fuse_fsync,
 	.lock		= fuse_file_lock,
+	.get_unmapped_area = thp_get_unmapped_area,
 	.flock		= fuse_file_flock,
 	.splice_read	= fuse_file_splice_read,
 	.splice_write	= iter_file_splice_write,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 23/30] fuse: Define dax address space operations
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (21 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 22/30] fuse, dax: add DAX mmap support Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 24/30] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

This is done along the lines of ext4 and xfs. I primarily wanted ->writepages
hook at this time so that I could call into dax_writeback_mapping_range().
This in turn will decide which pfns need to be written back.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a053bcb9498d..2777355bc245 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2402,6 +2402,17 @@ static int fuse_writepages_fill(struct page *page,
 	return err;
 }
 
+static int fuse_dax_writepages(struct address_space *mapping,
+				struct writeback_control *wbc)
+{
+
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	return dax_writeback_mapping_range(mapping,
+		NULL, fc->dax_dev, wbc);
+}
+
 static int fuse_writepages(struct address_space *mapping,
 			   struct writeback_control *wbc)
 {
@@ -3707,6 +3718,13 @@ static const struct address_space_operations fuse_file_aops  = {
 	.write_end	= fuse_write_end,
 };
 
+static const struct address_space_operations fuse_dax_file_aops  = {
+	.writepages	= fuse_dax_writepages,
+	.direct_IO	= noop_direct_IO,
+	.set_page_dirty	= noop_set_page_dirty,
+	.invalidatepage	= noop_invalidatepage,
+};
+
 void fuse_init_file_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -3724,5 +3742,6 @@ void fuse_init_file_inode(struct inode *inode)
 
 	if (fc->dax_dev) {
 		inode->i_flags |= S_DAX;
+		inode->i_data.a_ops = &fuse_dax_file_aops;
 	}
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 24/30] fuse, dax: Take ->i_mmap_sem lock during dax page fault
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (22 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 23/30] fuse: Define dax address space operations Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 25/30] fuse: Maintain a list of busy elements Vivek Goyal
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

We need some kind of locking mechanism here. Normal file systems like
ext4 and xfs seems to take their own semaphore to protect agains
truncate while fault is going on.

We have additional requirement to protect against fuse dax memory range
reclaim. When a range has been selected for reclaim, we need to make sure
no other read/write/fault can try to access that memory range while
reclaim is in progress. Once reclaim is complete, lock will be released
and read/write/fault will trigger allocation of fresh dax range.

Taking inode_lock() is not an option in fault path as lockdep complains
about circular dependencies. So define a new fuse_inode->i_mmap_sem.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/dir.c    |  2 ++
 fs/fuse/file.c   | 17 +++++++++++++----
 fs/fuse/fuse_i.h |  7 +++++++
 fs/fuse/inode.c  |  1 +
 4 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index fd8636e67ae9..84c0b638affb 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1559,8 +1559,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 	 */
 	if ((is_truncate || !is_wb) &&
 	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+		down_write(&fi->i_mmap_sem);
 		truncate_pagecache(inode, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
+		up_write(&fi->i_mmap_sem);
 	}
 
 	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 2777355bc245..e536a04aaa06 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2638,13 +2638,20 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	if (write)
 		sb_start_pagefault(sb);
 
-	/* TODO inode semaphore to protect faults vs truncate */
-
+	/*
+	 * We need to serialize against not only truncate but also against
+	 * fuse dax memory range reclaim. While a range is being reclaimed,
+	 * we do not want any read/write/mmap to make progress and try
+	 * to populate page cache or access memory we are trying to free.
+	 */
+	down_read(&get_fuse_inode(inode)->i_mmap_sem);
 	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
 
 	if (ret & VM_FAULT_NEEDDSYNC)
 		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
 
+	up_read(&get_fuse_inode(inode)->i_mmap_sem);
+
 	if (write)
 		sb_end_pagefault(sb);
 
@@ -3593,9 +3600,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 			file_update_time(file);
 	}
 
-	if (mode & FALLOC_FL_PUNCH_HOLE)
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		down_write(&fi->i_mmap_sem);
 		truncate_pagecache_range(inode, offset, offset + length - 1);
-
+		up_write(&fi->i_mmap_sem);
+	}
 	fuse_invalidate_attr(inode);
 
 out:
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f1ae549eff98..a234cf30538d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -212,6 +212,13 @@ struct fuse_inode {
 	 */
 	struct rw_semaphore i_dmap_sem;
 
+	/**
+	 * Can't take inode lock in fault path (leads to circular dependency).
+	 * So take this in fuse dax fault path to make sure truncate and
+	 * punch hole etc. can't make progress in parallel.
+	 */
+	struct rw_semaphore i_mmap_sem;
+
 	/** Sorted rb tree of struct fuse_dax_mapping elements */
 	struct rb_root_cached dmap_tree;
 	unsigned long nr_dmaps;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ad66a353554b..713c5f32ab35 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -85,6 +85,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	fi->state = 0;
 	fi->nr_dmaps = 0;
 	mutex_init(&fi->mutex);
+	init_rwsem(&fi->i_mmap_sem);
 	init_rwsem(&fi->i_dmap_sem);
 	spin_lock_init(&fi->lock);
 	fi->forget = fuse_alloc_forget();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 25/30] fuse: Maintain a list of busy elements
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (23 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 24/30] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 26/30] fuse: Add logic to free up a memory range Vivek Goyal
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c   | 8 ++++++++
 fs/fuse/fuse_i.h | 7 +++++++
 fs/fuse/inode.c  | 4 ++++
 3 files changed, 19 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e536a04aaa06..3f0f7a387341 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -273,6 +273,10 @@ static int fuse_setup_one_mapping(struct inode *inode,
 	/* Protected by fi->i_dmap_sem */
 	fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
 	fi->nr_dmaps++;
+	spin_lock(&fc->lock);
+	list_add_tail(&dmap->busy_list, &fc->busy_ranges);
+	fc->nr_busy_ranges++;
+	spin_unlock(&fc->lock);
 	return 0;
 }
 
@@ -317,6 +321,10 @@ void fuse_removemapping(struct inode *inode)
 		if (dmap) {
 			fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
 			fi->nr_dmaps--;
+			spin_lock(&fc->lock);
+			list_del_init(&dmap->busy_list);
+			fc->nr_busy_ranges--;
+			spin_unlock(&fc->lock);
 		}
 
 		if (!dmap)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a234cf30538d..c93e9155b723 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -114,6 +114,9 @@ struct fuse_dax_mapping {
 	__u64 end;
 	__u64 __subtree_last;
 
+	/* Will connect in fc->busy_ranges to keep track busy memory */
+	struct list_head busy_list;
+
        /** Position in DAX window */
        u64 window_offset;
 
@@ -873,6 +876,10 @@ struct fuse_conn {
 	/** DAX device, non-NULL if DAX is supported */
 	struct dax_device *dax_dev;
 
+	/* List of memory ranges which are busy */
+	unsigned long nr_busy_ranges;
+	struct list_head busy_ranges;
+
 	/*
 	 * DAX Window Free Ranges. TODO: This might not be best place to store
 	 * this free list
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 713c5f32ab35..f57f7ce02acc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -626,6 +626,8 @@ static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
 	/* Free All allocated elements */
 	list_for_each_entry_safe(range, temp, mem_list, list) {
 		list_del(&range->list);
+		if (!list_empty(&range->busy_list))
+			list_del(&range->busy_list);
 		kfree(range);
 	}
 }
@@ -670,6 +672,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 		 */
 		range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
 		range->length = FUSE_DAX_MEM_RANGE_SZ;
+		INIT_LIST_HEAD(&range->busy_list);
 		list_add_tail(&range->list, &mem_ranges);
 	}
 
@@ -720,6 +723,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	INIT_LIST_HEAD(&fc->free_ranges);
+	INIT_LIST_HEAD(&fc->busy_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 26/30] fuse: Add logic to free up a memory range
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (24 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 25/30] fuse: Maintain a list of busy elements Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
       [not found]   ` <CAN+Pk99SNKSf+GjSQUUWt_eu1fSjTy_ByUOEQUXHi8zNqXY1zA@mail.gmail.com>
  2019-05-15 19:27 ` [PATCH v2 27/30] fuse: Release file in process context Vivek Goyal
                   ` (3 subsequent siblings)
  29 siblings, 1 reply; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.

In certain cases (write path), process can steal one of its busy
dax ranges if free range is not available.

If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.

For reclaiming a range, as of now we need to hold following locks in
specified order.

	inode_trylock(inode);
	down_write(&fi->i_mmap_sem);
	down_write(&fi->i_dmap_sem);

This means, one can not wait for a range to become free when in fault
path because it can lead to deadlock in following two situations.

- Worker thread to free memory might block on fuse_inode->i_mmap_sem as well.
- This inode is holding all the memory and more memory can't be freed.

In both the cases, deadlock will ensue. So return -ENOSPC from iomap_begin()
in fault path if memory can't be allocated. Drop fuse_inode->i_mmap_sem,
and wait for a free range to become available and retry.

read path can't do direct reclaim as well because it holds shared inode
lock while reclaim assumes that inode lock is held exclusively. Due to
shared lock, it might happen that one reader is still reading from range
and another reader reclaims that range leading to problems. So read path
also returns -ENOSPC and higher layers retry (like fault path).

 a different story. We hold inode lock and lock ordering
allows to grab fuse_inode->immap_sem, if needed. That means we can do direct
reclaim in that path. But if there is no memory allocated to this inode,
then direct reclaim will not work and we need to wait for a memory range
to become free. So try following order.

A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free

Here sleeping with locks held should be fine because in step B, we made
sure this inode is not holding any ranges. That means other inodes are
holding ranges and somebody should be able to free memory. Also, worker
thread does a trylock() on inode lock. That means worker tread will not
wait on this inode and move onto next memory range. Hence above sequence
should be deadlock free.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c   | 357 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/fuse/fuse_i.h |  22 +++
 fs/fuse/inode.c  |   4 +
 3 files changed, 374 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3f0f7a387341..87fc2b5e0a3a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -25,6 +25,8 @@
 INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
                      START, LAST, static inline, fuse_dax_interval_tree);
 
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+				struct inode *inode);
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
 {
@@ -179,6 +181,7 @@ static void fuse_link_write_file(struct file *file)
 
 static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 {
+	unsigned long free_threshold;
 	struct fuse_dax_mapping *dmap = NULL;
 
 	spin_lock(&fc->lock);
@@ -186,7 +189,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 	/* TODO: Add logic to try to free up memory if wait is allowed */
 	if (fc->nr_free_ranges <= 0) {
 		spin_unlock(&fc->lock);
-		return NULL;
+		goto out_kick;
 	}
 
 	WARN_ON(list_empty(&fc->free_ranges));
@@ -197,15 +200,43 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 	list_del_init(&dmap->list);
 	fc->nr_free_ranges--;
 	spin_unlock(&fc->lock);
+
+out_kick:
+	/* If number of free ranges are below threshold, start reclaim */
+	free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
+				(unsigned long)1);
+	if (fc->nr_free_ranges < free_threshold) {
+		pr_debug("fuse: Kicking dax memory reclaim worker. nr_free_ranges=0x%ld nr_total_ranges=%ld\n", fc->nr_free_ranges, fc->nr_ranges);
+		queue_delayed_work(system_long_wq, &fc->dax_free_work, 0);
+	}
 	return dmap;
 }
 
+/* This assumes fc->lock is held */
+static void __dmap_remove_busy_list(struct fuse_conn *fc,
+				struct fuse_dax_mapping *dmap)
+{
+	list_del_init(&dmap->busy_list);
+	WARN_ON(fc->nr_busy_ranges == 0);
+	fc->nr_busy_ranges--;
+}
+
+static void dmap_remove_busy_list(struct fuse_conn *fc,
+				struct fuse_dax_mapping *dmap)
+{
+	spin_lock(&fc->lock);
+	__dmap_remove_busy_list(fc, dmap);
+	spin_unlock(&fc->lock);
+}
+
 /* This assumes fc->lock is held */
 static void __free_dax_mapping(struct fuse_conn *fc,
 				struct fuse_dax_mapping *dmap)
 {
 	list_add_tail(&dmap->list, &fc->free_ranges);
 	fc->nr_free_ranges++;
+	/* TODO: Wake up only when needed */
+	wake_up(&fc->dax_range_waitq);
 }
 
 static void free_dax_mapping(struct fuse_conn *fc,
@@ -267,7 +298,15 @@ static int fuse_setup_one_mapping(struct inode *inode,
 
 	pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx err=%zd\n", offset, err);
 
-	/* TODO: What locking is required here. For now, using fc->lock */
+	/*
+	 * We don't take a refernce on inode. inode is valid right now and
+	 * when inode is going away, cleanup logic should first cleanup
+	 * dmap entries.
+	 *
+	 * TODO: Do we need to ensure that we are holding inode lock
+	 * as well.
+	 */
+	dmap->inode = inode;
 	dmap->start = offset;
 	dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
 	/* Protected by fi->i_dmap_sem */
@@ -321,10 +360,7 @@ void fuse_removemapping(struct inode *inode)
 		if (dmap) {
 			fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
 			fi->nr_dmaps--;
-			spin_lock(&fc->lock);
-			list_del_init(&dmap->busy_list);
-			fc->nr_busy_ranges--;
-			spin_unlock(&fc->lock);
+			dmap_remove_busy_list(fc, dmap);
 		}
 
 		if (!dmap)
@@ -347,6 +383,8 @@ void fuse_removemapping(struct inode *inode)
 			}
 		}
 
+		dmap->inode = NULL;
+
 		/* Add it back to free ranges list */
 		free_dax_mapping(fc, dmap);
 	}
@@ -1784,8 +1822,23 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 		if (pos >= i_size_read(inode))
 			goto iomap_hole;
 
-		alloc_dmap = alloc_dax_mapping(fc);
-		if (!alloc_dmap)
+		/* Can't do reclaim in fault path yet due to lock ordering.
+		 * Read path takes shared inode lock and that's not sufficient
+		 * for inline range reclaim. Caller needs to drop lock, wait
+		 * and retry.
+		 */
+		if (flags & IOMAP_FAULT || !(flags & IOMAP_WRITE)) {
+			alloc_dmap = alloc_dax_mapping(fc);
+			if (!alloc_dmap)
+				return -ENOSPC;
+		} else {
+			alloc_dmap = alloc_dax_mapping_reclaim(fc, inode);
+			if (IS_ERR(alloc_dmap))
+				return PTR_ERR(alloc_dmap);
+		}
+
+		/* If we are here, we should have memory allocated */
+		if (WARN_ON(!alloc_dmap))
 			return -EBUSY;
 
 		/*
@@ -1850,7 +1903,18 @@ static const struct iomap_ops fuse_iomap_ops = {
 static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_conn *fc = get_fuse_conn(inode);
 	ssize_t ret;
+	bool retry = false;
+
+retry:
+	if (retry && !(fc->nr_free_ranges > 0)) {
+		ret = -EINTR;
+		if (wait_event_killable_exclusive(fc->dax_range_waitq,
+						  (fc->nr_free_ranges > 0))) {
+			goto out;
+		}
+	}
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		if (!inode_trylock_shared(inode))
@@ -1862,8 +1926,19 @@ static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
 	inode_unlock_shared(inode);
 
+	/* If a dax range could not be allocated and it can't be reclaimed
+	 * inline, then drop inode lock and retry. Range reclaim logic
+	 * requires exclusive access to inode lock.
+	 *
+	 * TODO: What if -ENOSPC needs to be returned to user space. Fix it.
+	 */
+	if (ret == -ENOSPC) {
+		retry = true;
+		goto retry;
+	}
 	/* TODO file_accessed(iocb->f_filp) */
 
+out:
 	return ret;
 }
 
@@ -2642,10 +2717,21 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	pfn_t pfn;
+	int error = 0;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	bool retry = false;
 
 	if (write)
 		sb_start_pagefault(sb);
 
+retry:
+	if (retry && !(fc->nr_free_ranges > 0)) {
+		ret = -EINTR;
+		if (wait_event_killable_exclusive(fc->dax_range_waitq,
+					(fc->nr_free_ranges > 0)))
+			goto out;
+	}
+
 	/*
 	 * We need to serialize against not only truncate but also against
 	 * fuse dax memory range reclaim. While a range is being reclaimed,
@@ -2653,13 +2739,20 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	 * to populate page cache or access memory we are trying to free.
 	 */
 	down_read(&get_fuse_inode(inode)->i_mmap_sem);
-	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+	ret = dax_iomap_fault(vmf, pe_size, &pfn, &error, &fuse_iomap_ops);
+	if ((ret & VM_FAULT_ERROR) && error == -ENOSPC) {
+		error = 0;
+		retry = true;
+		up_read(&get_fuse_inode(inode)->i_mmap_sem);
+		goto retry;
+	}
 
 	if (ret & VM_FAULT_NEEDDSYNC)
 		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
 
 	up_read(&get_fuse_inode(inode)->i_mmap_sem);
 
+out:
 	if (write)
 		sb_end_pagefault(sb);
 
@@ -3762,3 +3855,249 @@ void fuse_init_file_inode(struct inode *inode)
 		inode->i_data.a_ops = &fuse_dax_file_aops;
 	}
 }
+
+int fuse_dax_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
+				struct fuse_dax_mapping *dmap)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	ret = filemap_fdatawrite_range(inode->i_mapping, dmap->start,
+					dmap->end);
+	if (ret) {
+		printk("filemap_fdatawrite_range() failed. err=%d start=0x%llx,"
+			" end=0x%llx\n", ret, dmap->start, dmap->end);
+		return ret;
+	}
+
+	ret = invalidate_inode_pages2_range(inode->i_mapping,
+					dmap->start >> PAGE_SHIFT,
+					dmap->end >> PAGE_SHIFT);
+	/* TODO: What to do if above fails? For now,
+	 * leave the range in place.
+	 */
+	if (ret) {
+		printk("invalidate_inode_pages2_range() failed err=%d\n", ret);
+		return ret;
+	}
+
+	/* Remove dax mapping from inode interval tree now */
+	fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
+	fi->nr_dmaps--;
+
+	ret = fuse_removemapping_one(inode, dmap);
+	if (ret) {
+		pr_warn("Failed to remove mapping. offset=0x%llx len=0x%llx\n",
+			dmap->window_offset, dmap->length);
+	}
+
+	return 0;
+}
+
+/* First first mapping in the tree and free it. */
+struct fuse_dax_mapping *fuse_dax_reclaim_first_mapping_locked(
+				struct fuse_conn *fc, struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+	int ret;
+
+	/* Find fuse dax mapping at file offset inode. */
+	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0, -1);
+	if (!dmap)
+		return NULL;
+
+	ret = fuse_dax_reclaim_dmap_locked(fc, inode, dmap);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	/* Clean up dmap. Do not add back to free list */
+	dmap_remove_busy_list(fc, dmap);
+	dmap->inode = NULL;
+	dmap->start = dmap->end = 0;
+
+	pr_debug("fuse: reclaimed memory range window_offset=0x%llx,"
+				" length=0x%llx\n", dmap->window_offset,
+				dmap->length);
+	return dmap;
+}
+
+/*
+ * First first mapping in the tree and free it and return it. Do not add
+ * it back to free pool.
+ *
+ * This is called with inode lock held.
+ */
+struct fuse_dax_mapping *fuse_dax_reclaim_first_mapping(struct fuse_conn *fc,
+					struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+
+	down_write(&fi->i_mmap_sem);
+	down_write(&fi->i_dmap_sem);
+	dmap = fuse_dax_reclaim_first_mapping_locked(fc, inode);
+	up_write(&fi->i_dmap_sem);
+	up_write(&fi->i_mmap_sem);
+	return dmap;
+}
+
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+					struct inode *inode)
+{
+	struct fuse_dax_mapping *dmap;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	while(1) {
+		dmap = alloc_dax_mapping(fc);
+		if (dmap)
+			return dmap;
+
+		if (fi->nr_dmaps)
+			return fuse_dax_reclaim_first_mapping(fc, inode);
+		/*
+		 * There are no mappings which can be reclaimed.
+		 * Wait for one.
+		 */
+		if (!(fc->nr_free_ranges > 0)) {
+			if (wait_event_killable_exclusive(fc->dax_range_waitq,
+					(fc->nr_free_ranges > 0)))
+				return ERR_PTR(-EINTR);
+		}
+	}
+}
+
+int fuse_dax_free_one_mapping_locked(struct fuse_conn *fc, struct inode *inode,
+				u64 dmap_start)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+
+	WARN_ON(!inode_is_locked(inode));
+
+	/* Find fuse dax mapping at file offset inode. */
+	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
+							dmap_start);
+
+	/* Range already got cleaned up by somebody else */
+	if (!dmap)
+		return 0;
+
+	ret = fuse_dax_reclaim_dmap_locked(fc, inode, dmap);
+	if (ret < 0)
+		return ret;
+
+	/* Cleanup dmap entry and add back to free list */
+	spin_lock(&fc->lock);
+	__dmap_remove_busy_list(fc, dmap);
+	dmap->inode = NULL;
+	dmap->start = dmap->end = 0;
+	__free_dax_mapping(fc, dmap);
+	spin_unlock(&fc->lock);
+
+	pr_debug("fuse: freed memory range window_offset=0x%llx,"
+				" length=0x%llx\n", dmap->window_offset,
+				dmap->length);
+	return ret;
+}
+
+/*
+ * Free a range of memory.
+ * Locking.
+ * 1. Take inode->i_rwsem to prever further read/write.
+ * 2. Take fuse_inode->i_mmap_sem to block dax faults.
+ * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
+ *    be strictly necessary as lock 1 and 2 seem sufficient.
+ */
+int fuse_dax_free_one_mapping(struct fuse_conn *fc, struct inode *inode,
+				u64 dmap_start)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	/*
+	 * If process is blocked waiting for memory while holding inode
+	 * lock, we will deadlock. So continue to free next range.
+	 */
+	if (!inode_trylock(inode))
+		return -EAGAIN;
+	down_write(&fi->i_mmap_sem);
+	down_write(&fi->i_dmap_sem);
+	ret = fuse_dax_free_one_mapping_locked(fc, inode, dmap_start);
+	up_write(&fi->i_dmap_sem);
+	up_write(&fi->i_mmap_sem);
+	inode_unlock(inode);
+	return ret;
+}
+
+int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
+{
+	struct fuse_dax_mapping *dmap, *pos, *temp;
+	int ret, nr_freed = 0;
+	u64 dmap_start = 0, window_offset = 0;
+	struct inode *inode = NULL;
+
+	/* Pick first busy range and free it for now*/
+	while(1) {
+		if (nr_freed >= nr_to_free)
+			break;
+
+		dmap = NULL;
+		spin_lock(&fc->lock);
+
+		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
+						busy_list) {
+			inode = igrab(pos->inode);
+			/*
+			 * This inode is going away. That will free
+			 * up all the ranges anyway, continue to
+			 * next range.
+			 */
+			if (!inode)
+				continue;
+			/*
+			 * Take this element off list and add it tail. If
+			 * inode lock can't be obtained, this will help with
+			 * selecting new element
+			 */
+			dmap = pos;
+			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
+			dmap_start = dmap->start;
+			window_offset = dmap->window_offset;
+			break;
+		}
+		spin_unlock(&fc->lock);
+		if (!dmap)
+			return 0;
+
+		ret = fuse_dax_free_one_mapping(fc, inode, dmap_start);
+		iput(inode);
+		if (ret && ret != -EAGAIN) {
+			printk("%s(window_offset=0x%llx) failed. err=%d\n",
+				__func__, window_offset, ret);
+			return ret;
+		}
+
+		/* Could not get inode lock. Try next element */
+		if (ret == -EAGAIN)
+			continue;
+		nr_freed++;
+	}
+	return 0;
+}
+
+/* TODO: This probably should go in inode.c */
+void fuse_dax_free_mem_worker(struct work_struct *work)
+{
+	int ret;
+	struct fuse_conn *fc = container_of(work, struct fuse_conn,
+						dax_free_work.work);
+	pr_debug("fuse: Worker to free memory called.\n");
+	pr_debug("fuse: Worker to free memory called. nr_free_ranges=%lu"
+		 " nr_busy_ranges=%lu\n", fc->nr_free_ranges,
+		 fc->nr_busy_ranges);
+	ret = fuse_dax_free_memory(fc, FUSE_DAX_RECLAIM_CHUNK);
+	if (ret)
+		pr_debug("fuse: fuse_dax_free_memory() failed with err=%d\n", ret);
+}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c93e9155b723..b4a5728444bb 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -50,6 +50,16 @@
 #define FUSE_DAX_MEM_RANGE_SZ	(2*1024*1024)
 #define FUSE_DAX_MEM_RANGE_PAGES	(FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
 
+/* Number of ranges reclaimer will try to free in one invocation */
+#define FUSE_DAX_RECLAIM_CHUNK		(10)
+
+/*
+ * Dax memory reclaim threshold in percetage of total ranges. When free
+ * number of free ranges drops below this threshold, reclaim can trigger
+ * Default is 20%
+ * */
+#define FUSE_DAX_RECLAIM_THRESHOLD	(20)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -103,6 +113,9 @@ struct fuse_forget_link {
 
 /** Translation information for file offsets to DAX window offsets */
 struct fuse_dax_mapping {
+	/* Pointer to inode where this memory range is mapped */
+	struct inode *inode;
+
 	/* Will connect in fc->free_ranges to keep track of free memory */
 	struct list_head list;
 
@@ -880,12 +893,20 @@ struct fuse_conn {
 	unsigned long nr_busy_ranges;
 	struct list_head busy_ranges;
 
+	/* Worker to free up memory ranges */
+	struct delayed_work dax_free_work;
+
+	/* Wait queue for a dax range to become free */
+	wait_queue_head_t dax_range_waitq;
+
 	/*
 	 * DAX Window Free Ranges. TODO: This might not be best place to store
 	 * this free list
 	 */
 	long nr_free_ranges;
 	struct list_head free_ranges;
+
+	unsigned long nr_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1260,6 +1281,7 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
  * Get the next unique ID for a request
  */
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
+void fuse_dax_free_mem_worker(struct work_struct *work);
 void fuse_removemapping(struct inode *inode);
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f57f7ce02acc..8af7f31c6e19 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -678,6 +678,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 
 	list_replace_init(&mem_ranges, &fc->free_ranges);
 	fc->nr_free_ranges = nr_ranges;
+	fc->nr_ranges = nr_ranges;
 	return 0;
 out_err:
 	/* Free All allocated elements */
@@ -704,6 +705,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	atomic_set(&fc->dev_count, 1);
 	init_waitqueue_head(&fc->blocked_waitq);
 	init_waitqueue_head(&fc->reserved_req_waitq);
+	init_waitqueue_head(&fc->dax_range_waitq);
 	fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
 	INIT_LIST_HEAD(&fc->bg_queue);
 	INIT_LIST_HEAD(&fc->entry);
@@ -724,6 +726,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	INIT_LIST_HEAD(&fc->free_ranges);
 	INIT_LIST_HEAD(&fc->busy_ranges);
+	INIT_DELAYED_WORK(&fc->dax_free_work, fuse_dax_free_mem_worker);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -732,6 +735,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 	if (refcount_dec_and_test(&fc->count)) {
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
+		flush_delayed_work(&fc->dax_free_work);
 		if (fc->dax_dev)
 			fuse_free_dax_mem_ranges(&fc->free_ranges);
 		put_pid_ns(fc->pid_ns);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 27/30] fuse: Release file in process context
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (25 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 26/30] fuse: Add logic to free up a memory range Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 28/30] fuse: Reschedule dax free work if too many EAGAIN attempts Vivek Goyal
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

fuse_file_put(sync) can be called with sync=true/false. If sync=true,
it waits for release request response and then calls iput() in the
caller's context. If sync=false, it does not wait for release request
response, frees the fuse_file struct immediately and req->end function
does the iput().

iput() can be a problem with DAX if called in req->end context. If this
is last reference to inode (VFS has let go its reference already), then
iput() will clean DAX mappings as well and send REMOVEMAPPING requests
and wait for completion. (All the the worker thread context which is
processing fuse replies from daemon on the host).

That means it blocks worker thread and it stops processing further
replies and system deadlocks.

So for now, force sync release of file in case of DAX inodes.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 87fc2b5e0a3a..b0293a308b5e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -475,6 +475,7 @@ void fuse_release_common(struct file *file, bool isdir)
 	struct fuse_file *ff = file->private_data;
 	struct fuse_req *req = ff->reserved_req;
 	int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+	bool sync = false;
 
 	fuse_prepare_release(fi, ff, file->f_flags, opcode);
 
@@ -495,8 +496,20 @@ void fuse_release_common(struct file *file, bool isdir)
 	 * Make the release synchronous if this is a fuseblk mount,
 	 * synchronous RELEASE is allowed (and desirable) in this case
 	 * because the server can be trusted not to screw up.
+	 *
+	 * For DAX, fuse server is trusted. So it should be fine to
+	 * do a sync file put. Doing async file put is creating
+	 * problems right now because when request finish, iput()
+	 * can lead to freeing of inode. That means it tears down
+	 * mappings backing DAX memory and sends REMOVEMAPPING message
+	 * to server and blocks for completion. Currently, waiting
+	 * in req->end context deadlocks the system as same worker thread
+	 * can't process REMOVEMAPPING reply it is waiting for.
 	 */
-	fuse_file_put(ff, ff->fc->destroy_req != NULL, isdir);
+	if (IS_DAX(req->misc.release.inode) || ff->fc->destroy_req != NULL)
+		sync = true;
+
+	fuse_file_put(ff, sync, isdir);
 }
 
 static int fuse_open(struct inode *inode, struct file *file)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 28/30] fuse: Reschedule dax free work if too many EAGAIN attempts
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (26 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 27/30] fuse: Release file in process context Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 29/30] fuse: Take inode lock for dax inode truncation Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 30/30] virtio-fs: Do not provide abort interface in fusectl Vivek Goyal
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

fuse_dax_free_memory() can be very cpu intensive in corner cases. For example,
if one inode has consumed all the memory and a setupmapping request is
pending, that means inode lock is held by request and worker thread will
not get lock for a while. And given there is only one inode consuming all
the dax ranges, all the attempts to acquire lock will fail.

So if there are too many inode lock failures (-EAGAIN), reschedule the
worker with a 10ms delay.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b0293a308b5e..9b82d9b4ebc3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -4047,7 +4047,7 @@ int fuse_dax_free_one_mapping(struct fuse_conn *fc, struct inode *inode,
 int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
 {
 	struct fuse_dax_mapping *dmap, *pos, *temp;
-	int ret, nr_freed = 0;
+	int ret, nr_freed = 0, nr_eagain = 0;
 	u64 dmap_start = 0, window_offset = 0;
 	struct inode *inode = NULL;
 
@@ -4056,6 +4056,12 @@ int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
 		if (nr_freed >= nr_to_free)
 			break;
 
+		if (nr_eagain > 20) {
+			queue_delayed_work(system_long_wq, &fc->dax_free_work,
+						msecs_to_jiffies(10));
+			return 0;
+		}
+
 		dmap = NULL;
 		spin_lock(&fc->lock);
 
@@ -4093,8 +4099,10 @@ int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
 		}
 
 		/* Could not get inode lock. Try next element */
-		if (ret == -EAGAIN)
+		if (ret == -EAGAIN) {
+			nr_eagain++;
 			continue;
+		}
 		nr_freed++;
 	}
 	return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 29/30] fuse: Take inode lock for dax inode truncation
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (27 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 28/30] fuse: Reschedule dax free work if too many EAGAIN attempts Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  2019-05-15 19:27 ` [PATCH v2 30/30] virtio-fs: Do not provide abort interface in fusectl Vivek Goyal
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

When a file is opened with O_TRUNC, we need to make sure that any other
DAX operation is not in progress. DAX expects i_size to be stable.

In fuse_iomap_begin() we check for i_size at multiple places and we expect
i_size to not change.

Another problem is, if we setup a mapping in fuse_iomap_begin(), and
file gets truncated and dax read/write happens, KVM currently hangs.
It tries to fault in a page which does not exist on host (file got
truncated). It probably requries fixing in KVM.

So for now, take inode lock. Once KVM is fixed, we might have to
have a look at it again.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9b82d9b4ebc3..d0979dc32f08 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -420,7 +420,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
 	int err;
 	bool lock_inode = (file->f_flags & O_TRUNC) &&
 			  fc->atomic_o_trunc &&
-			  fc->writeback_cache;
+			  (fc->writeback_cache || IS_DAX(inode));
 
 	err = generic_file_open(inode, file);
 	if (err)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 30/30] virtio-fs: Do not provide abort interface in fusectl
  2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
                   ` (28 preceding siblings ...)
  2019-05-15 19:27 ` [PATCH v2 29/30] fuse: Take inode lock for dax inode truncation Vivek Goyal
@ 2019-05-15 19:27 ` Vivek Goyal
  29 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-15 19:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm, linux-nvdimm
  Cc: vgoyal, miklos, stefanha, dgilbert, swhiteho

virtio-fs does not support aborting requests which are being processed. That
is requests which have been sent to fuse daemon on host.

So do not provide "abort" interface for virtio-fs in fusectl.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/control.c   | 4 ++--
 fs/fuse/fuse_i.h    | 4 ++++
 fs/fuse/inode.c     | 1 +
 fs/fuse/virtio_fs.c | 1 +
 4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index fe80bea4ad89..c1423f2ebc5e 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -278,8 +278,8 @@ int fuse_ctl_add_conn(struct fuse_conn *fc)
 
 	if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1,
 				 NULL, &fuse_ctl_waiting_ops) ||
-	    !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1,
-				 NULL, &fuse_ctl_abort_ops) ||
+	    (!fc->no_abort && !fuse_ctl_add_dentry(parent, fc, "abort",
+			S_IFREG | 0200, 1, NULL, &fuse_ctl_abort_ops)) ||
 	    !fuse_ctl_add_dentry(parent, fc, "max_background", S_IFREG | 0600,
 				 1, NULL, &fuse_conn_max_background_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "congestion_threshold",
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b4a5728444bb..7ac7f9a0b81b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -86,6 +86,7 @@ struct fuse_mount_data {
 	unsigned allow_other:1;
 	unsigned dax:1;
 	unsigned destroy:1;
+	unsigned no_abort:1;
 	unsigned max_read;
 	unsigned blksize;
 
@@ -847,6 +848,9 @@ struct fuse_conn {
 	/** Does the filesystem support copy_file_range? */
 	unsigned no_copy_file_range:1;
 
+	/** Do not create abort file in fuse control fs */
+	unsigned no_abort:1;
+
 	/** The number of requests waiting for completion */
 	atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8af7f31c6e19..302f7e04b645 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1272,6 +1272,7 @@ int fuse_fill_super_common(struct super_block *sb,
 	fc->user_id = mount_data->user_id;
 	fc->group_id = mount_data->group_id;
 	fc->max_read = max_t(unsigned, 4096, mount_data->max_read);
+	fc->no_abort = mount_data->no_abort;
 
 	/* Used by get_root_inode() */
 	sb->s_fs_info = fc;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 76c46edcc8ac..18fc0dca0abc 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1042,6 +1042,7 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
 	d.fiq_priv = fs;
 	d.fudptr = (void **)&fs->vqs[VQ_REQUEST].fud;
 	d.destroy = true; /* Send destroy request on unmount */
+	d.no_abort = 1;
 	err = fuse_fill_super_common(sb, &d);
 	if (err < 0)
 		goto err_free_init_req;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 12/30] dax: remove block device dependencies
  2019-05-15 19:26 ` [PATCH v2 12/30] dax: remove block device dependencies Vivek Goyal
@ 2019-05-16  0:21   ` Dan Williams
  2019-05-16 10:07     ` Stefan Hajnoczi
  2019-05-16 14:23     ` Vivek Goyal
  0 siblings, 2 replies; 52+ messages in thread
From: Dan Williams @ 2019-05-16  0:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Linux Kernel Mailing List, KVM list, linux-nvdimm,
	Steven Whitehouse, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Miklos Szeredi

On Wed, May 15, 2019 at 12:28 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> From: Stefan Hajnoczi <stefanha@redhat.com>
>
> Although struct dax_device itself is not tied to a block device, some
> DAX code assumes there is a block device.  Make block devices optional
> by allowing bdev to be NULL in commonly used DAX APIs.
>
> When there is no block device:
>  * Skip the partition offset calculation in bdev_dax_pgoff()
>  * Skip the blkdev_issue_zeroout() optimization
>
> Note that more block device assumptions remain but I haven't reach those
> code paths yet.
>

Is there a generic object that non-block-based filesystems reference
for physical storage as a bdev stand-in? I assume "sector_t" is still
the common type for addressing filesystem capacity?

It just seems to me that we should stop pretending that the
filesystem-dax facility requires block devices and try to move this
functionality to generically use a dax device across all interfaces.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 12/30] dax: remove block device dependencies
  2019-05-16  0:21   ` Dan Williams
@ 2019-05-16 10:07     ` Stefan Hajnoczi
  2019-05-16 14:23     ` Vivek Goyal
  1 sibling, 0 replies; 52+ messages in thread
From: Stefan Hajnoczi @ 2019-05-16 10:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vivek Goyal, linux-fsdevel, Linux Kernel Mailing List, KVM list,
	linux-nvdimm, Steven Whitehouse, Dr. David Alan Gilbert,
	Miklos Szeredi

[-- Attachment #1: Type: text/plain, Size: 1452 bytes --]

On Wed, May 15, 2019 at 05:21:51PM -0700, Dan Williams wrote:
> On Wed, May 15, 2019 at 12:28 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > From: Stefan Hajnoczi <stefanha@redhat.com>
> >
> > Although struct dax_device itself is not tied to a block device, some
> > DAX code assumes there is a block device.  Make block devices optional
> > by allowing bdev to be NULL in commonly used DAX APIs.
> >
> > When there is no block device:
> >  * Skip the partition offset calculation in bdev_dax_pgoff()
> >  * Skip the blkdev_issue_zeroout() optimization
> >
> > Note that more block device assumptions remain but I haven't reach those
> > code paths yet.
> >
> 
> Is there a generic object that non-block-based filesystems reference
> for physical storage as a bdev stand-in? I assume "sector_t" is still
> the common type for addressing filesystem capacity?
> 
> It just seems to me that we should stop pretending that the
> filesystem-dax facility requires block devices and try to move this
> functionality to generically use a dax device across all interfaces.

virtio-fs uses a PCI BAR called the DAX Window to access data.  This
object is internal to the virtio_fs.ko driver, not really a generic
object that DAX code can reference.

But does the DAX code need to reference any object at all?  It seems
like block device users just want callbacks for the partition offset
calculation and blkdev_issue_zeroout().

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 12/30] dax: remove block device dependencies
  2019-05-16  0:21   ` Dan Williams
  2019-05-16 10:07     ` Stefan Hajnoczi
@ 2019-05-16 14:23     ` Vivek Goyal
  1 sibling, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-16 14:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, Linux Kernel Mailing List, KVM list, linux-nvdimm,
	Steven Whitehouse, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Miklos Szeredi

On Wed, May 15, 2019 at 05:21:51PM -0700, Dan Williams wrote:

[..]
> It just seems to me that we should stop pretending that the
> filesystem-dax facility requires block devices and try to move this
> functionality to generically use a dax device across all interfaces.

That sounds reasonable and will help with our use case where we don't
have the block device at all.

Vivek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 26/30] fuse: Add logic to free up a memory range
       [not found]   ` <CAN+Pk99SNKSf+GjSQUUWt_eu1fSjTy_ByUOEQUXHi8zNqXY1zA@mail.gmail.com>
@ 2019-05-20 12:53     ` Vivek Goyal
  0 siblings, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-20 12:53 UTC (permalink / raw)
  To: Eric Ren
  Cc: linux-fsdevel, linux-kernel, kvm, linux-nvdimm, miklos, stefanha,
	dgilbert, swhiteho

On Sun, May 19, 2019 at 03:48:05PM +0800, Eric Ren wrote:
> Hi,
> 
> @@ -1784,8 +1822,23 @@ static int fuse_iomap_begin(struct inode *inode,
> > loff_t pos, loff_t length,
> >                 if (pos >= i_size_read(inode))
> >                         goto iomap_hole;
> >
> > -               alloc_dmap = alloc_dax_mapping(fc);
> > -               if (!alloc_dmap)
> > +               /* Can't do reclaim in fault path yet due to lock ordering.
> > +                * Read path takes shared inode lock and that's not
> > sufficient
> > +                * for inline range reclaim. Caller needs to drop lock,
> > wait
> > +                * and retry.
> > +                */
> > +               if (flags & IOMAP_FAULT || !(flags & IOMAP_WRITE)) {
> > +                       alloc_dmap = alloc_dax_mapping(fc);
> > +                       if (!alloc_dmap)
> > +                               return -ENOSPC;
> > +               } else {
> > +                       alloc_dmap = alloc_dax_mapping_reclaim(fc, inode);
> >
> 
> alloc_dmap could be NULL as follows:
> 
> alloc_dax_mapping_reclaim
>    -->fuse_dax_reclaim_first_mapping
>              -->fuse_dax_reclaim_first_mapping_locked
>                   --> fuse_dax_interval_tree_iter_first  ==> return NULL
> and
> 
> IS_ERR(NULL) is false, so we may miss that error case.

Hi Eric,

Good catch. I will fix it next version. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path
  2019-05-15 19:26 ` [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path Vivek Goyal
@ 2019-05-20 14:41   ` Miklos Szeredi
  2019-05-20 14:44     ` Miklos Szeredi
  2019-05-21 15:01     ` Vivek Goyal
  0 siblings, 2 replies; 52+ messages in thread
From: Miklos Szeredi @ 2019-05-20 14:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, kvm, linux-nvdimm, stefanha,
	dgilbert, swhiteho

On Wed, May 15, 2019 at 03:26:47PM -0400, Vivek Goyal wrote:
> If fuse daemon is started with cache=never, fuse falls back to direct IO.
> In that write path we don't call file_remove_privs() and that means setuid
> bit is not cleared if unpriviliged user writes to a file with setuid bit set.
> 
> pjdfstest chmod test 12.t tests this and fails.

I think better sulution is to tell the server if the suid bit needs to be
removed, so it can do so in a race free way.

Here's the kernel patch, and I'll reply with the libfuse patch.

---
 fs/fuse2/file.c           |    2 ++
 include/uapi/linux/fuse.h |    3 +++
 2 files changed, 5 insertions(+)

--- a/fs/fuse2/file.c
+++ b/fs/fuse2/file.c
@@ -363,6 +363,8 @@ static ssize_t fuse_send_write(struct fu
 		inarg->flags |= O_DSYNC;
 	if (iocb->ki_flags & IOCB_SYNC)
 		inarg->flags |= O_SYNC;
+	if (!capable(CAP_FSETID))
+		inarg->write_flags |= FUSE_WRITE_KILL_PRIV;
 	req->inh.opcode = FUSE_WRITE;
 	req->inh.nodeid = ff->nodeid;
 	req->inh.len = req->inline_inlen + count;
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -125,6 +125,7 @@
  *
  *  7.29
  *  - add FUSE_NO_OPENDIR_SUPPORT flag
+ *  - add FUSE_WRITE_KILL_PRIV flag
  */
 
 #ifndef _LINUX_FUSE_H
@@ -318,9 +319,11 @@ struct fuse_file_lock {
  *
  * FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed
  * FUSE_WRITE_LOCKOWNER: lock_owner field is valid
+ * FUSE_WRITE_KILL_PRIV: kill suid and sgid bits
  */
 #define FUSE_WRITE_CACHE	(1 << 0)
 #define FUSE_WRITE_LOCKOWNER	(1 << 1)
+#define FUSE_WRITE_KILL_PRIV	(1 << 2)
 
 /**
  * Read flags



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path
  2019-05-20 14:41   ` Miklos Szeredi
@ 2019-05-20 14:44     ` Miklos Szeredi
  2019-05-20 20:25       ` Nikolaus Rath
  2019-05-21 15:01     ` Vivek Goyal
  1 sibling, 1 reply; 52+ messages in thread
From: Miklos Szeredi @ 2019-05-20 14:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, kvm, linux-nvdimm, stefanha,
	dgilbert, swhiteho

[-- Attachment #1: Type: text/plain, Size: 672 bytes --]

On Mon, May 20, 2019 at 04:41:37PM +0200, Miklos Szeredi wrote:
> On Wed, May 15, 2019 at 03:26:47PM -0400, Vivek Goyal wrote:
> > If fuse daemon is started with cache=never, fuse falls back to direct IO.
> > In that write path we don't call file_remove_privs() and that means setuid
> > bit is not cleared if unpriviliged user writes to a file with setuid bit set.
> > 
> > pjdfstest chmod test 12.t tests this and fails.
> 
> I think better sulution is to tell the server if the suid bit needs to be
> removed, so it can do so in a race free way.
> 
> Here's the kernel patch, and I'll reply with the libfuse patch.

Here are the patches for libfuse and passthrough_ll.

[-- Attachment #2: libfuse-add-fuse_write_kill_priv.patch --]
[-- Type: text/plain, Size: 2439 bytes --]

---
 include/fuse_common.h |    5 ++++-
 include/fuse_kernel.h |    2 ++
 lib/fuse_lowlevel.c   |   12 ++++++++----
 3 files changed, 14 insertions(+), 5 deletions(-)

--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -64,8 +64,11 @@ struct fuse_file_info {
 	   May only be set in ->release(). */
 	unsigned int flock_release : 1;
 
+	/* Kill suid and sgid bits on write */
+	unsigned int write_kill_priv : 1;
+
 	/** Padding.  Do not use*/
-	unsigned int padding : 27;
+	unsigned int padding : 26;
 
 	/** File handle.  May be filled in by filesystem in open().
 	    Available in all other file operations */
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -304,9 +304,11 @@ struct fuse_file_lock {
  *
  * FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed
  * FUSE_WRITE_LOCKOWNER: lock_owner field is valid
+ * FUSE_WRITE_KILL_PRIV: kill suid and sgid bits
  */
 #define FUSE_WRITE_CACHE	(1 << 0)
 #define FUSE_WRITE_LOCKOWNER	(1 << 1)
+#define FUSE_WRITE_KILL_PRIV	(1 << 2)
 
 /**
  * Read flags
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -1315,12 +1315,14 @@ static void do_write(fuse_req_t req, fus
 
 	memset(&fi, 0, sizeof(fi));
 	fi.fh = arg->fh;
-	fi.writepage = (arg->write_flags & 1) != 0;
+	fi.writepage = (arg->write_flags & FUSE_WRITE_CACHE) != 0;
+	fi.write_kill_priv = (arg->write_flags & FUSE_WRITE_KILL_PRIV) != 0;
 
 	if (req->se->conn.proto_minor < 9) {
 		param = ((char *) arg) + FUSE_COMPAT_WRITE_IN_SIZE;
 	} else {
-		fi.lock_owner = arg->lock_owner;
+		if (arg->write_flags & FUSE_WRITE_LOCKOWNER)
+			fi.lock_owner = arg->lock_owner;
 		fi.flags = arg->flags;
 		param = PARAM(arg);
 	}
@@ -1345,7 +1347,8 @@ static void do_write_buf(fuse_req_t req,
 
 	memset(&fi, 0, sizeof(fi));
 	fi.fh = arg->fh;
-	fi.writepage = arg->write_flags & 1;
+	fi.writepage = (arg->write_flags & FUSE_WRITE_CACHE) != 0;
+	fi.write_kill_priv = (arg->write_flags & FUSE_WRITE_KILL_PRIV) != 0;
 
 	if (se->conn.proto_minor < 9) {
 		bufv.buf[0].mem = ((char *) arg) + FUSE_COMPAT_WRITE_IN_SIZE;
@@ -1353,7 +1356,8 @@ static void do_write_buf(fuse_req_t req,
 			FUSE_COMPAT_WRITE_IN_SIZE;
 		assert(!(bufv.buf[0].flags & FUSE_BUF_IS_FD));
 	} else {
-		fi.lock_owner = arg->lock_owner;
+		if (arg->write_flags & FUSE_WRITE_LOCKOWNER)
+			fi.lock_owner = arg->lock_owner;
 		fi.flags = arg->flags;
 		if (!(bufv.buf[0].flags & FUSE_BUF_IS_FD))
 			bufv.buf[0].mem = PARAM(arg);

[-- Attachment #3: passthrough_ll-kill-suid.patch --]
[-- Type: text/plain, Size: 1824 bytes --]

---
 example/passthrough_ll.c |   29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

--- a/example/passthrough_ll.c
+++ b/example/passthrough_ll.c
@@ -56,6 +56,7 @@
 #include <sys/file.h>
 #include <sys/xattr.h>
 #include <sys/syscall.h>
+#include <sys/capability.h>
 
 /* We are re-using pointers to our `struct lo_inode` and `struct
    lo_dirp` elements as inodes. This means that we must be able to
@@ -965,6 +966,11 @@ static void lo_write_buf(fuse_req_t req,
 	(void) ino;
 	ssize_t res;
 	struct fuse_bufvec out_buf = FUSE_BUFVEC_INIT(fuse_buf_size(in_buf));
+	struct __user_cap_header_struct cap_hdr = {
+		.version = _LINUX_CAPABILITY_VERSION_1,
+	};
+	struct __user_cap_data_struct cap_orig;
+	struct __user_cap_data_struct cap_new;
 
 	out_buf.buf[0].flags = FUSE_BUF_IS_FD | FUSE_BUF_FD_SEEK;
 	out_buf.buf[0].fd = fi->fh;
@@ -974,7 +980,28 @@ static void lo_write_buf(fuse_req_t req,
 		fprintf(stderr, "lo_write(ino=%" PRIu64 ", size=%zd, off=%lu)\n",
 			ino, out_buf.buf[0].size, (unsigned long) off);
 
+	if (fi->write_kill_priv) {
+		res = capget(&cap_hdr, &cap_orig);
+		if (res == -1) {
+			fuse_reply_err(req, errno);
+			return;
+		}
+		cap_new = cap_orig;
+		cap_new.effective &= ~(1 << CAP_FSETID);
+		res = capset(&cap_hdr, &cap_new);
+		if (res == -1) {
+			fuse_reply_err(req, errno);
+			return;
+		}
+	}
+
 	res = fuse_buf_copy(&out_buf, in_buf, 0);
+
+	if (fi->write_kill_priv) {
+		if (capset(&cap_hdr, &cap_orig) != 0)
+			abort();
+	}
+
 	if(res < 0)
 		fuse_reply_err(req, -res);
 	else
@@ -1215,7 +1242,7 @@ static void lo_copy_file_range(fuse_req_
 	res = copy_file_range(fi_in->fh, &off_in, fi_out->fh, &off_out, len,
 			      flags);
 	if (res < 0)
-		fuse_reply_err(req, -errno);
+		fuse_reply_err(req, errno);
 	else
 		fuse_reply_write(req, res);
 }

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path
  2019-05-20 14:44     ` Miklos Szeredi
@ 2019-05-20 20:25       ` Nikolaus Rath
  0 siblings, 0 replies; 52+ messages in thread
From: Nikolaus Rath @ 2019-05-20 20:25 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Vivek Goyal, linux-fsdevel, linux-kernel, kvm, linux-nvdimm,
	stefanha, dgilbert, swhiteho

On May 20 2019, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Mon, May 20, 2019 at 04:41:37PM +0200, Miklos Szeredi wrote:
>> On Wed, May 15, 2019 at 03:26:47PM -0400, Vivek Goyal wrote:
>> > If fuse daemon is started with cache=never, fuse falls back to direct IO.
>> > In that write path we don't call file_remove_privs() and that means setuid
>> > bit is not cleared if unpriviliged user writes to a file with setuid bit set.
>> > 
>> > pjdfstest chmod test 12.t tests this and fails.
>> 
>> I think better sulution is to tell the server if the suid bit needs to be
>> removed, so it can do so in a race free way.
>> 
>> Here's the kernel patch, and I'll reply with the libfuse patch.
>
> Here are the patches for libfuse and passthrough_ll.

Could you also submit them as pull requests at https://github.com/libfuse/libfuse/pulls?

Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path
  2019-05-20 14:41   ` Miklos Szeredi
  2019-05-20 14:44     ` Miklos Szeredi
@ 2019-05-21 15:01     ` Vivek Goyal
  1 sibling, 0 replies; 52+ messages in thread
From: Vivek Goyal @ 2019-05-21 15:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, kvm, linux-nvdimm, stefanha,
	dgilbert, swhiteho

On Mon, May 20, 2019 at 04:41:37PM +0200, Miklos Szeredi wrote:
> On Wed, May 15, 2019 at 03:26:47PM -0400, Vivek Goyal wrote:
> > If fuse daemon is started with cache=never, fuse falls back to direct IO.
> > In that write path we don't call file_remove_privs() and that means setuid
> > bit is not cleared if unpriviliged user writes to a file with setuid bit set.
> > 
> > pjdfstest chmod test 12.t tests this and fails.
> 
> I think better sulution is to tell the server if the suid bit needs to be
> removed, so it can do so in a race free way.
> 
> Here's the kernel patch, and I'll reply with the libfuse patch.

Hi Miklos,

I tested and it works for me.

Vivek

> 
> ---
>  fs/fuse2/file.c           |    2 ++
>  include/uapi/linux/fuse.h |    3 +++
>  2 files changed, 5 insertions(+)
> 
> --- a/fs/fuse2/file.c
> +++ b/fs/fuse2/file.c
> @@ -363,6 +363,8 @@ static ssize_t fuse_send_write(struct fu
>  		inarg->flags |= O_DSYNC;
>  	if (iocb->ki_flags & IOCB_SYNC)
>  		inarg->flags |= O_SYNC;
> +	if (!capable(CAP_FSETID))
> +		inarg->write_flags |= FUSE_WRITE_KILL_PRIV;
>  	req->inh.opcode = FUSE_WRITE;
>  	req->inh.nodeid = ff->nodeid;
>  	req->inh.len = req->inline_inlen + count;
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -125,6 +125,7 @@
>   *
>   *  7.29
>   *  - add FUSE_NO_OPENDIR_SUPPORT flag
> + *  - add FUSE_WRITE_KILL_PRIV flag
>   */
>  
>  #ifndef _LINUX_FUSE_H
> @@ -318,9 +319,11 @@ struct fuse_file_lock {
>   *
>   * FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed
>   * FUSE_WRITE_LOCKOWNER: lock_owner field is valid
> + * FUSE_WRITE_KILL_PRIV: kill suid and sgid bits
>   */
>  #define FUSE_WRITE_CACHE	(1 << 0)
>  #define FUSE_WRITE_LOCKOWNER	(1 << 1)
> +#define FUSE_WRITE_KILL_PRIV	(1 << 2)
>  
>  /**
>   * Read flags
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-05-15 19:27 ` [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
@ 2019-07-17 17:27   ` Halil Pasic
  2019-07-18  9:04     ` Cornelia Huck
  2019-07-18 13:15     ` Vivek Goyal
  0 siblings, 2 replies; 52+ messages in thread
From: Halil Pasic @ 2019-07-17 17:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, kvm, linux-nvdimm, miklos, stefanha,
	dgilbert, swhiteho, Sebastian Ott, Cornelia Huck,
	Christian Borntraeger, Collin Walling

On Wed, 15 May 2019 15:27:03 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> From: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Setup a dax device.
> 
> Use the shm capability to find the cache entry and map it.
> 
> The DAX window is accessed by the fs/dax.c infrastructure and must have
> struct pages (at least on x86).  Use devm_memremap_pages() to map the
> DAX window PCI BAR and allocate struct page.
>

Sorry for being this late. I don't see any more recent version so I will
comment here.

I'm trying to figure out how is this supposed to work on s390. My concern
is, that on s390 PCI memory needs to be accessed by special
instructions. This is taken care of by the stuff defined in
arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
the appropriate s390 instruction. However if the code does not use the
linux abstractions for accessing PCI memory, but assumes it can be
accessed like RAM, we have a problem.

Looking at this patch, it seems to me, that we might end up with exactly
the case described. For example AFAICT copy_to_iter() (3) resolves to
the function in lib/iov_iter.c which does not seem to cater for s390
oddities.

I didn't have the time to investigate this properly, and since virtio-fs
is virtual, we may be able to get around what is otherwise a
limitation on s390. My understanding of these areas is admittedly
shallow, and since I'm not sure I'll have much more time to
invest in the near future I decided to raise concern.

Any opinions?

[CCing some s390 people who are probably more knowledgeable than my on
these matters.]

Regards,
Halil


> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---

[..]
  
> +/* Map a window offset to a page frame number.  The window offset will have
> + * been produced by .iomap_begin(), which maps a file offset to a window
> + * offset.
> + */
> +static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> +				    long nr_pages, void **kaddr, pfn_t *pfn)
> +{
> +	struct virtio_fs *fs = dax_get_private(dax_dev);
> +	phys_addr_t offset = PFN_PHYS(pgoff);
> +	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
> +
> +	if (kaddr)
> +		*kaddr = fs->window_kaddr + offset;

(2) Here we use fs->window_kaddr, basically directing the access to the
virtio shared memory region.

> +	if (pfn)
> +		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
> +					PFN_DEV | PFN_MAP);
> +	return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
> +}
> +
> +static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
> +				       pgoff_t pgoff, void *addr,
> +				       size_t bytes, struct iov_iter *i)
> +{
> +	return copy_from_iter(addr, bytes, i);
> +}
> +
> +static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
> +				       pgoff_t pgoff, void *addr,
> +				       size_t bytes, struct iov_iter *i)
> +{
> +	return copy_to_iter(addr, bytes, i);

(3) And this should be the access to it. Which does not seem to use.

> +}
> +
> +static const struct dax_operations virtio_fs_dax_ops = {
> +	.direct_access = virtio_fs_direct_access,
> +	.copy_from_iter = virtio_fs_copy_from_iter,
> +	.copy_to_iter = virtio_fs_copy_to_iter,
> +};
> +
> +static void virtio_fs_percpu_release(struct percpu_ref *ref)
> +{
> +	struct virtio_fs_memremap_info *mi =
> +		container_of(ref, struct virtio_fs_memremap_info, ref);
> +
> +	complete(&mi->completion);
> +}
> +
> +static void virtio_fs_percpu_exit(void *data)
> +{
> +	struct virtio_fs_memremap_info *mi = data;
> +
> +	wait_for_completion(&mi->completion);
> +	percpu_ref_exit(&mi->ref);
> +}
> +
> +static void virtio_fs_percpu_kill(struct percpu_ref *ref)
> +{
> +	percpu_ref_kill(ref);
> +}
> +
> +static void virtio_fs_cleanup_dax(void *data)
> +{
> +	struct virtio_fs *fs = data;
> +
> +	kill_dax(fs->dax_dev);
> +	put_dax(fs->dax_dev);
> +}
> +
> +static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> +{
> +	struct virtio_shm_region cache_reg;
> +	struct virtio_fs_memremap_info *mi;
> +	struct dev_pagemap *pgmap;
> +	bool have_cache;
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_DAX_DRIVER))
> +		return 0;
> +
> +	/* Get cache region */
> +	have_cache = virtio_get_shm_region(vdev,
> +					   &cache_reg,
> +					   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
> +	if (!have_cache) {
> +		dev_err(&vdev->dev, "%s: No cache capability\n", __func__);
> +		return -ENXIO;
> +	} else {
> +		dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n",
> +			   cache_reg.len, cache_reg.addr);
> +	}
> +
> +	mi = devm_kzalloc(&vdev->dev, sizeof(*mi), GFP_KERNEL);
> +	if (!mi)
> +		return -ENOMEM;
> +
> +	init_completion(&mi->completion);
> +	ret = percpu_ref_init(&mi->ref, virtio_fs_percpu_release, 0,
> +			      GFP_KERNEL);
> +	if (ret < 0) {
> +		dev_err(&vdev->dev, "%s: percpu_ref_init failed (%d)\n",
> +			__func__, ret);
> +		return ret;
> +	}
> +
> +	ret = devm_add_action(&vdev->dev, virtio_fs_percpu_exit, mi);
> +	if (ret < 0) {
> +		percpu_ref_exit(&mi->ref);
> +		return ret;
> +	}
> +
> +	pgmap = &mi->pgmap;
> +	pgmap->altmap_valid = false;
> +	pgmap->ref = &mi->ref;
> +	pgmap->kill = virtio_fs_percpu_kill;
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +
> +	/* Ideally we would directly use the PCI BAR resource but
> +	 * devm_memremap_pages() wants its own copy in pgmap.  So
> +	 * initialize a struct resource from scratch (only the start
> +	 * and end fields will be used).
> +	 */
> +	pgmap->res = (struct resource){
> +		.name = "virtio-fs dax window",
> +		.start = (phys_addr_t) cache_reg.addr,
> +		.end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
> +	};
> +
> +	fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);

(1) Here we assign fs->window_kaddr basically from the virtio shm region.

> +	if (IS_ERR(fs->window_kaddr))
> +		return PTR_ERR(fs->window_kaddr);
> +
> +	fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
> +	fs->window_len = (phys_addr_t) cache_reg.len;
> +
> +	dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx"
> +		" len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr,
> +		cache_reg.len);
> +
> +	fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
> +	if (!fs->dax_dev)
> +		return -ENOMEM;
> +
> +	return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax, fs);
> +}
> +

[..]


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-17 17:27   ` Halil Pasic
@ 2019-07-18  9:04     ` Cornelia Huck
  2019-07-18 11:20       ` Halil Pasic
  2019-07-18 13:15     ` Vivek Goyal
  1 sibling, 1 reply; 52+ messages in thread
From: Cornelia Huck @ 2019-07-18  9:04 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Vivek Goyal, linux-fsdevel, linux-kernel, kvm, linux-nvdimm,
	miklos, stefanha, dgilbert, swhiteho, Sebastian Ott,
	Christian Borntraeger, Collin Walling, David Hildenbrand

On Wed, 17 Jul 2019 19:27:25 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Wed, 15 May 2019 15:27:03 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > From: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Setup a dax device.
> > 
> > Use the shm capability to find the cache entry and map it.
> > 
> > The DAX window is accessed by the fs/dax.c infrastructure and must have
> > struct pages (at least on x86).  Use devm_memremap_pages() to map the
> > DAX window PCI BAR and allocate struct page.
> >  
> 
> Sorry for being this late. I don't see any more recent version so I will
> comment here.

[Yeah, this one has been sitting in my to-review queue far too long as
well :(]

> 
> I'm trying to figure out how is this supposed to work on s390. My concern
> is, that on s390 PCI memory needs to be accessed by special
> instructions. This is taken care of by the stuff defined in
> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> the appropriate s390 instruction. However if the code does not use the
> linux abstractions for accessing PCI memory, but assumes it can be
> accessed like RAM, we have a problem.
> 
> Looking at this patch, it seems to me, that we might end up with exactly
> the case described. For example AFAICT copy_to_iter() (3) resolves to
> the function in lib/iov_iter.c which does not seem to cater for s390
> oddities.

What about the new pci instructions recently introduced? Not sure how
they differ from the old ones (which are currently the only ones
supported in QEMU...), but I'm pretty sure they are supposed to solve
an issue :)

> 
> I didn't have the time to investigate this properly, and since virtio-fs
> is virtual, we may be able to get around what is otherwise a
> limitation on s390. My understanding of these areas is admittedly
> shallow, and since I'm not sure I'll have much more time to
> invest in the near future I decided to raise concern.
> 
> Any opinions?

Let me point to the thread starting at
https://marc.info/?l=linux-s390&m=155048406205221&w=2 as well. That
memory region stuff is still unsolved for ccw, and I'm not sure if we
need to do something for zpci as well.

Does s390 work with DAX at all? ISTR that DAX evolved from XIP, so I
thought it did?

> 
> [CCing some s390 people who are probably more knowledgeable than my on
> these matters.]
> 
> Regards,
> Halil
> 
> 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---  
> 
> [..]
>   
> > +/* Map a window offset to a page frame number.  The window offset will have
> > + * been produced by .iomap_begin(), which maps a file offset to a window
> > + * offset.
> > + */
> > +static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> > +				    long nr_pages, void **kaddr, pfn_t *pfn)
> > +{
> > +	struct virtio_fs *fs = dax_get_private(dax_dev);
> > +	phys_addr_t offset = PFN_PHYS(pgoff);
> > +	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
> > +
> > +	if (kaddr)
> > +		*kaddr = fs->window_kaddr + offset;  
> 
> (2) Here we use fs->window_kaddr, basically directing the access to the
> virtio shared memory region.
> 
> > +	if (pfn)
> > +		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
> > +					PFN_DEV | PFN_MAP);
> > +	return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
> > +}
> > +
> > +static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
> > +				       pgoff_t pgoff, void *addr,
> > +				       size_t bytes, struct iov_iter *i)
> > +{
> > +	return copy_from_iter(addr, bytes, i);
> > +}
> > +
> > +static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
> > +				       pgoff_t pgoff, void *addr,
> > +				       size_t bytes, struct iov_iter *i)
> > +{
> > +	return copy_to_iter(addr, bytes, i);  
> 
> (3) And this should be the access to it. Which does not seem to use.
> 
> > +}
> > +
> > +static const struct dax_operations virtio_fs_dax_ops = {
> > +	.direct_access = virtio_fs_direct_access,
> > +	.copy_from_iter = virtio_fs_copy_from_iter,
> > +	.copy_to_iter = virtio_fs_copy_to_iter,
> > +};
> > +
> > +static void virtio_fs_percpu_release(struct percpu_ref *ref)
> > +{
> > +	struct virtio_fs_memremap_info *mi =
> > +		container_of(ref, struct virtio_fs_memremap_info, ref);
> > +
> > +	complete(&mi->completion);
> > +}
> > +
> > +static void virtio_fs_percpu_exit(void *data)
> > +{
> > +	struct virtio_fs_memremap_info *mi = data;
> > +
> > +	wait_for_completion(&mi->completion);
> > +	percpu_ref_exit(&mi->ref);
> > +}
> > +
> > +static void virtio_fs_percpu_kill(struct percpu_ref *ref)
> > +{
> > +	percpu_ref_kill(ref);
> > +}
> > +
> > +static void virtio_fs_cleanup_dax(void *data)
> > +{
> > +	struct virtio_fs *fs = data;
> > +
> > +	kill_dax(fs->dax_dev);
> > +	put_dax(fs->dax_dev);
> > +}
> > +
> > +static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > +{
> > +	struct virtio_shm_region cache_reg;
> > +	struct virtio_fs_memremap_info *mi;
> > +	struct dev_pagemap *pgmap;
> > +	bool have_cache;
> > +	int ret;
> > +
> > +	if (!IS_ENABLED(CONFIG_DAX_DRIVER))
> > +		return 0;
> > +
> > +	/* Get cache region */
> > +	have_cache = virtio_get_shm_region(vdev,
> > +					   &cache_reg,
> > +					   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
> > +	if (!have_cache) {
> > +		dev_err(&vdev->dev, "%s: No cache capability\n", __func__);
> > +		return -ENXIO;
> > +	} else {
> > +		dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n",
> > +			   cache_reg.len, cache_reg.addr);
> > +	}
> > +
> > +	mi = devm_kzalloc(&vdev->dev, sizeof(*mi), GFP_KERNEL);
> > +	if (!mi)
> > +		return -ENOMEM;
> > +
> > +	init_completion(&mi->completion);
> > +	ret = percpu_ref_init(&mi->ref, virtio_fs_percpu_release, 0,
> > +			      GFP_KERNEL);
> > +	if (ret < 0) {
> > +		dev_err(&vdev->dev, "%s: percpu_ref_init failed (%d)\n",
> > +			__func__, ret);
> > +		return ret;
> > +	}
> > +
> > +	ret = devm_add_action(&vdev->dev, virtio_fs_percpu_exit, mi);
> > +	if (ret < 0) {
> > +		percpu_ref_exit(&mi->ref);
> > +		return ret;
> > +	}
> > +
> > +	pgmap = &mi->pgmap;
> > +	pgmap->altmap_valid = false;
> > +	pgmap->ref = &mi->ref;
> > +	pgmap->kill = virtio_fs_percpu_kill;
> > +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> > +
> > +	/* Ideally we would directly use the PCI BAR resource but
> > +	 * devm_memremap_pages() wants its own copy in pgmap.  So
> > +	 * initialize a struct resource from scratch (only the start
> > +	 * and end fields will be used).
> > +	 */
> > +	pgmap->res = (struct resource){
> > +		.name = "virtio-fs dax window",
> > +		.start = (phys_addr_t) cache_reg.addr,
> > +		.end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
> > +	};
> > +
> > +	fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);  
> 
> (1) Here we assign fs->window_kaddr basically from the virtio shm region.
> 
> > +	if (IS_ERR(fs->window_kaddr))
> > +		return PTR_ERR(fs->window_kaddr);
> > +
> > +	fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
> > +	fs->window_len = (phys_addr_t) cache_reg.len;
> > +
> > +	dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx"
> > +		" len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr,
> > +		cache_reg.len);
> > +
> > +	fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
> > +	if (!fs->dax_dev)
> > +		return -ENOMEM;
> > +
> > +	return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax, fs);
> > +}
> > +  
> 
> [..]
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-18  9:04     ` Cornelia Huck
@ 2019-07-18 11:20       ` Halil Pasic
  2019-07-18 14:47         ` Cornelia Huck
  0 siblings, 1 reply; 52+ messages in thread
From: Halil Pasic @ 2019-07-18 11:20 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Vivek Goyal, linux-fsdevel, linux-kernel, kvm, linux-nvdimm,
	miklos, stefanha, dgilbert, swhiteho, Sebastian Ott,
	Christian Borntraeger, Collin Walling, David Hildenbrand

On Thu, 18 Jul 2019 11:04:17 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> On Wed, 17 Jul 2019 19:27:25 +0200
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
> > On Wed, 15 May 2019 15:27:03 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > From: Stefan Hajnoczi <stefanha@redhat.com>
> > > 
> > > Setup a dax device.
> > > 
> > > Use the shm capability to find the cache entry and map it.
> > > 
> > > The DAX window is accessed by the fs/dax.c infrastructure and must have
> > > struct pages (at least on x86).  Use devm_memremap_pages() to map the
> > > DAX window PCI BAR and allocate struct page.
> > >  
> > 
> > Sorry for being this late. I don't see any more recent version so I will
> > comment here.
> 
> [Yeah, this one has been sitting in my to-review queue far too long as
> well :(]
> 
> > 
> > I'm trying to figure out how is this supposed to work on s390. My concern
> > is, that on s390 PCI memory needs to be accessed by special
> > instructions. This is taken care of by the stuff defined in
> > arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> > the appropriate s390 instruction. However if the code does not use the
> > linux abstractions for accessing PCI memory, but assumes it can be
> > accessed like RAM, we have a problem.
> > 
> > Looking at this patch, it seems to me, that we might end up with exactly
> > the case described. For example AFAICT copy_to_iter() (3) resolves to
> > the function in lib/iov_iter.c which does not seem to cater for s390
> > oddities.
> 
> What about the new pci instructions recently introduced? Not sure how
> they differ from the old ones (which are currently the only ones
> supported in QEMU...), but I'm pretty sure they are supposed to solve
> an issue :)
> 

I'm struggling to find the connection between this topic and the new pci
instructions. Can you please explain in more detail?

> > 
> > I didn't have the time to investigate this properly, and since virtio-fs
> > is virtual, we may be able to get around what is otherwise a
> > limitation on s390. My understanding of these areas is admittedly
> > shallow, and since I'm not sure I'll have much more time to
> > invest in the near future I decided to raise concern.
> > 
> > Any opinions?
> 
> Let me point to the thread starting at
> https://marc.info/?l=linux-s390&m=155048406205221&w=2 as well. That
> memory region stuff is still unsolved for ccw, and I'm not sure if we
> need to do something for zpci as well.
> 

Right virtio-ccw is another problem, but at least there we don't have the
need to limit ourselves to a very specific set of instructions (for
accessing memory).

zPCI i.e. virtio-pci on z should require much less dedicated love if any
at all. Unfortunately I'm not very knowledgeable on either PCI in general
or its s390 variant.

> Does s390 work with DAX at all? ISTR that DAX evolved from XIP, so I
> thought it did?
> 

Documentation/filesystems/dax.txt even mentions dcssblk: s390 dcss block
device driver as a source of inspiration. So I suppose it does work.

Regards,
Halil

> > 
> > [CCing some s390 people who are probably more knowledgeable than my
> > on these matters.]
> > 
> > Regards,
> > Halil
> > 
> > 
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > ---  
> > 
> > [..]
> >   
> > > +/* Map a window offset to a page frame number.  The window offset
> > > will have
> > > + * been produced by .iomap_begin(), which maps a file offset to a
> > > window
> > > + * offset.
> > > + */
> > > +static long virtio_fs_direct_access(struct dax_device *dax_dev,
> > > pgoff_t pgoff,
> > > +				    long nr_pages, void **kaddr,
> > > pfn_t *pfn) +{
> > > +	struct virtio_fs *fs = dax_get_private(dax_dev);
> > > +	phys_addr_t offset = PFN_PHYS(pgoff);
> > > +	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
> > > +
> > > +	if (kaddr)
> > > +		*kaddr = fs->window_kaddr + offset;  
> > 
> > (2) Here we use fs->window_kaddr, basically directing the access to
> > the virtio shared memory region.
> > 
> > > +	if (pfn)
> > > +		*pfn = phys_to_pfn_t(fs->window_phys_addr +
> > > offset,
> > > +					PFN_DEV | PFN_MAP);
> > > +	return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
> > > +}
> > > +
> > > +static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
> > > +				       pgoff_t pgoff, void *addr,
> > > +				       size_t bytes, struct
> > > iov_iter *i) +{
> > > +	return copy_from_iter(addr, bytes, i);
> > > +}
> > > +
> > > +static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
> > > +				       pgoff_t pgoff, void *addr,
> > > +				       size_t bytes, struct
> > > iov_iter *i) +{
> > > +	return copy_to_iter(addr, bytes, i);  
> > 
> > (3) And this should be the access to it. Which does not seem to use.
> > 
> > > +}
> > > +
> > > +static const struct dax_operations virtio_fs_dax_ops = {
> > > +	.direct_access = virtio_fs_direct_access,
> > > +	.copy_from_iter = virtio_fs_copy_from_iter,
> > > +	.copy_to_iter = virtio_fs_copy_to_iter,
> > > +};
> > > +
> > > +static void virtio_fs_percpu_release(struct percpu_ref *ref)
> > > +{
> > > +	struct virtio_fs_memremap_info *mi =
> > > +		container_of(ref, struct virtio_fs_memremap_info,
> > > ref); +
> > > +	complete(&mi->completion);
> > > +}
> > > +
> > > +static void virtio_fs_percpu_exit(void *data)
> > > +{
> > > +	struct virtio_fs_memremap_info *mi = data;
> > > +
> > > +	wait_for_completion(&mi->completion);
> > > +	percpu_ref_exit(&mi->ref);
> > > +}
> > > +
> > > +static void virtio_fs_percpu_kill(struct percpu_ref *ref)
> > > +{
> > > +	percpu_ref_kill(ref);
> > > +}
> > > +
> > > +static void virtio_fs_cleanup_dax(void *data)
> > > +{
> > > +	struct virtio_fs *fs = data;
> > > +
> > > +	kill_dax(fs->dax_dev);
> > > +	put_dax(fs->dax_dev);
> > > +}
> > > +
> > > +static int virtio_fs_setup_dax(struct virtio_device *vdev, struct
> > > virtio_fs *fs) +{
> > > +	struct virtio_shm_region cache_reg;
> > > +	struct virtio_fs_memremap_info *mi;
> > > +	struct dev_pagemap *pgmap;
> > > +	bool have_cache;
> > > +	int ret;
> > > +
> > > +	if (!IS_ENABLED(CONFIG_DAX_DRIVER))
> > > +		return 0;
> > > +
> > > +	/* Get cache region */
> > > +	have_cache = virtio_get_shm_region(vdev,
> > > +					   &cache_reg,
> > > +
> > > (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
> > > +	if (!have_cache) {
> > > +		dev_err(&vdev->dev, "%s: No cache capability\n",
> > > __func__);
> > > +		return -ENXIO;
> > > +	} else {
> > > +		dev_notice(&vdev->dev, "Cache len: 0x%llx @
> > > 0x%llx\n",
> > > +			   cache_reg.len, cache_reg.addr);
> > > +	}
> > > +
> > > +	mi = devm_kzalloc(&vdev->dev, sizeof(*mi), GFP_KERNEL);
> > > +	if (!mi)
> > > +		return -ENOMEM;
> > > +
> > > +	init_completion(&mi->completion);
> > > +	ret = percpu_ref_init(&mi->ref, virtio_fs_percpu_release,
> > > 0,
> > > +			      GFP_KERNEL);
> > > +	if (ret < 0) {
> > > +		dev_err(&vdev->dev, "%s: percpu_ref_init failed
> > > (%d)\n",
> > > +			__func__, ret);
> > > +		return ret;
> > > +	}
> > > +
> > > +	ret = devm_add_action(&vdev->dev, virtio_fs_percpu_exit,
> > > mi);
> > > +	if (ret < 0) {
> > > +		percpu_ref_exit(&mi->ref);
> > > +		return ret;
> > > +	}
> > > +
> > > +	pgmap = &mi->pgmap;
> > > +	pgmap->altmap_valid = false;
> > > +	pgmap->ref = &mi->ref;
> > > +	pgmap->kill = virtio_fs_percpu_kill;
> > > +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> > > +
> > > +	/* Ideally we would directly use the PCI BAR resource but
> > > +	 * devm_memremap_pages() wants its own copy in pgmap.  So
> > > +	 * initialize a struct resource from scratch (only the
> > > start
> > > +	 * and end fields will be used).
> > > +	 */
> > > +	pgmap->res = (struct resource){
> > > +		.name = "virtio-fs dax window",
> > > +		.start = (phys_addr_t) cache_reg.addr,
> > > +		.end = (phys_addr_t) cache_reg.addr +
> > > cache_reg.len - 1,
> > > +	};
> > > +
> > > +	fs->window_kaddr = devm_memremap_pages(&vdev->dev,
> > > pgmap);  
> > 
> > (1) Here we assign fs->window_kaddr basically from the virtio shm
> > region.
> > 
> > > +	if (IS_ERR(fs->window_kaddr))
> > > +		return PTR_ERR(fs->window_kaddr);
> > > +
> > > +	fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
> > > +	fs->window_len = (phys_addr_t) cache_reg.len;
> > > +
> > > +	dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr
> > > 0x%llx"
> > > +		" len 0x%llx\n", __func__, fs->window_kaddr,
> > > cache_reg.addr,
> > > +		cache_reg.len);
> > > +
> > > +	fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
> > > +	if (!fs->dax_dev)
> > > +		return -ENOMEM;
> > > +
> > > +	return devm_add_action_or_reset(&vdev->dev,
> > > virtio_fs_cleanup_dax, fs); +}
> > > +  
> > 
> > [..]
> > 
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-17 17:27   ` Halil Pasic
  2019-07-18  9:04     ` Cornelia Huck
@ 2019-07-18 13:15     ` Vivek Goyal
  2019-07-18 14:30       ` Dan Williams
  1 sibling, 1 reply; 52+ messages in thread
From: Vivek Goyal @ 2019-07-18 13:15 UTC (permalink / raw)
  To: Halil Pasic
  Cc: linux-fsdevel, linux-kernel, kvm, linux-nvdimm, miklos, stefanha,
	dgilbert, swhiteho, Sebastian Ott, Cornelia Huck,
	Christian Borntraeger, Collin Walling

On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:
> On Wed, 15 May 2019 15:27:03 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > From: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Setup a dax device.
> > 
> > Use the shm capability to find the cache entry and map it.
> > 
> > The DAX window is accessed by the fs/dax.c infrastructure and must have
> > struct pages (at least on x86).  Use devm_memremap_pages() to map the
> > DAX window PCI BAR and allocate struct page.
> >
> 
> Sorry for being this late. I don't see any more recent version so I will
> comment here.
> 
> I'm trying to figure out how is this supposed to work on s390. My concern
> is, that on s390 PCI memory needs to be accessed by special
> instructions. This is taken care of by the stuff defined in
> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> the appropriate s390 instruction. However if the code does not use the
> linux abstractions for accessing PCI memory, but assumes it can be
> accessed like RAM, we have a problem.
> 
> Looking at this patch, it seems to me, that we might end up with exactly
> the case described. For example AFAICT copy_to_iter() (3) resolves to
> the function in lib/iov_iter.c which does not seem to cater for s390
> oddities.
> 
> I didn't have the time to investigate this properly, and since virtio-fs
> is virtual, we may be able to get around what is otherwise a
> limitation on s390. My understanding of these areas is admittedly
> shallow, and since I'm not sure I'll have much more time to
> invest in the near future I decided to raise concern.
> 
> Any opinions?

Hi Halil,

I don't understand s390 and how PCI works there as well. Is there any
other transport we can use there to map IO memory directly and access
using DAX?

BTW, is DAX supported for s390.

I am also hoping somebody who knows better can chip in. Till that time,
we could still use virtio-fs on s390 without DAX.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-18 13:15     ` Vivek Goyal
@ 2019-07-18 14:30       ` Dan Williams
  2019-07-22 10:51         ` Christian Borntraeger
  0 siblings, 1 reply; 52+ messages in thread
From: Dan Williams @ 2019-07-18 14:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Halil Pasic, Collin Walling, Cornelia Huck, Sebastian Ott,
	KVM list, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert,
	Christian Borntraeger, Stefan Hajnoczi, linux-fsdevel,
	Steven Whitehouse

On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:
> > On Wed, 15 May 2019 15:27:03 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > > From: Stefan Hajnoczi <stefanha@redhat.com>
> > >
> > > Setup a dax device.
> > >
> > > Use the shm capability to find the cache entry and map it.
> > >
> > > The DAX window is accessed by the fs/dax.c infrastructure and must have
> > > struct pages (at least on x86).  Use devm_memremap_pages() to map the
> > > DAX window PCI BAR and allocate struct page.
> > >
> >
> > Sorry for being this late. I don't see any more recent version so I will
> > comment here.
> >
> > I'm trying to figure out how is this supposed to work on s390. My concern
> > is, that on s390 PCI memory needs to be accessed by special
> > instructions. This is taken care of by the stuff defined in
> > arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> > the appropriate s390 instruction. However if the code does not use the
> > linux abstractions for accessing PCI memory, but assumes it can be
> > accessed like RAM, we have a problem.
> >
> > Looking at this patch, it seems to me, that we might end up with exactly
> > the case described. For example AFAICT copy_to_iter() (3) resolves to
> > the function in lib/iov_iter.c which does not seem to cater for s390
> > oddities.
> >
> > I didn't have the time to investigate this properly, and since virtio-fs
> > is virtual, we may be able to get around what is otherwise a
> > limitation on s390. My understanding of these areas is admittedly
> > shallow, and since I'm not sure I'll have much more time to
> > invest in the near future I decided to raise concern.
> >
> > Any opinions?
>
> Hi Halil,
>
> I don't understand s390 and how PCI works there as well. Is there any
> other transport we can use there to map IO memory directly and access
> using DAX?
>
> BTW, is DAX supported for s390.
>
> I am also hoping somebody who knows better can chip in. Till that time,
> we could still use virtio-fs on s390 without DAX.

s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
In practice that means that support for PTE_DEVMAP is missing which
means no get_user_pages() support for dax mappings. Effectively it's
only useful for execute-in-place as operations like fork() and ptrace
of dax mappings will fail.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-18 11:20       ` Halil Pasic
@ 2019-07-18 14:47         ` Cornelia Huck
  0 siblings, 0 replies; 52+ messages in thread
From: Cornelia Huck @ 2019-07-18 14:47 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Vivek Goyal, linux-fsdevel, linux-kernel, kvm, linux-nvdimm,
	miklos, stefanha, dgilbert, swhiteho, Sebastian Ott,
	Christian Borntraeger, Collin Walling, David Hildenbrand

On Thu, 18 Jul 2019 13:20:49 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Thu, 18 Jul 2019 11:04:17 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Wed, 17 Jul 2019 19:27:25 +0200
> > Halil Pasic <pasic@linux.ibm.com> wrote:

> > > I'm trying to figure out how is this supposed to work on s390. My concern
> > > is, that on s390 PCI memory needs to be accessed by special
> > > instructions. This is taken care of by the stuff defined in
> > > arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> > > the appropriate s390 instruction. However if the code does not use the
> > > linux abstractions for accessing PCI memory, but assumes it can be
> > > accessed like RAM, we have a problem.
> > > 
> > > Looking at this patch, it seems to me, that we might end up with exactly
> > > the case described. For example AFAICT copy_to_iter() (3) resolves to
> > > the function in lib/iov_iter.c which does not seem to cater for s390
> > > oddities.  
> > 
> > What about the new pci instructions recently introduced? Not sure how
> > they differ from the old ones (which are currently the only ones
> > supported in QEMU...), but I'm pretty sure they are supposed to solve
> > an issue :)
> >   
> 
> I'm struggling to find the connection between this topic and the new pci
> instructions. Can you please explain in more detail?

The problem is that I'm lacking detail myself... if the new approach is
handling some things substantially differently (e.g. you set up
something and then do read/writes instead of going through
instructions), things will probably work out differently.

> 
> > > 
> > > I didn't have the time to investigate this properly, and since virtio-fs
> > > is virtual, we may be able to get around what is otherwise a
> > > limitation on s390. My understanding of these areas is admittedly
> > > shallow, and since I'm not sure I'll have much more time to
> > > invest in the near future I decided to raise concern.
> > > 
> > > Any opinions?  
> > 
> > Let me point to the thread starting at
> > https://marc.info/?l=linux-s390&m=155048406205221&w=2 as well. That
> > memory region stuff is still unsolved for ccw, and I'm not sure if we
> > need to do something for zpci as well.
> >   
> 
> Right virtio-ccw is another problem, but at least there we don't have the
> need to limit ourselves to a very specific set of instructions (for
> accessing memory).
> 
> zPCI i.e. virtio-pci on z should require much less dedicated love if any

s/virtio-pci/pci/

> at all. Unfortunately I'm not very knowledgeable on either PCI in general
> or its s390 variant.

Right, the biggest issue with zpci and shared regions is the
interaction with ccw using shared regions as well.

Unfortunately, I can't judge any zpci details from here, either :(

If virtio-fs is working in its non-dax version, we'll at least have
something on s390. (Has anyone tried that, btw?) It seems that s390 is
only supporting a limited subset of dax anyway.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-18 14:30       ` Dan Williams
@ 2019-07-22 10:51         ` Christian Borntraeger
  2019-07-22 10:56           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 52+ messages in thread
From: Christian Borntraeger @ 2019-07-22 10:51 UTC (permalink / raw)
  To: Dan Williams, Vivek Goyal
  Cc: Halil Pasic, Collin Walling, Cornelia Huck, Sebastian Ott,
	KVM list, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert,
	Stefan Hajnoczi, linux-fsdevel, Steven Whitehouse,
	Heiko Carstens, David Hildenbrand



On 18.07.19 16:30, Dan Williams wrote:
> On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>>
>> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:
>>> On Wed, 15 May 2019 15:27:03 -0400
>>> Vivek Goyal <vgoyal@redhat.com> wrote:
>>>
>>>> From: Stefan Hajnoczi <stefanha@redhat.com>
>>>>
>>>> Setup a dax device.
>>>>
>>>> Use the shm capability to find the cache entry and map it.
>>>>
>>>> The DAX window is accessed by the fs/dax.c infrastructure and must have
>>>> struct pages (at least on x86).  Use devm_memremap_pages() to map the
>>>> DAX window PCI BAR and allocate struct page.
>>>>
>>>
>>> Sorry for being this late. I don't see any more recent version so I will
>>> comment here.
>>>
>>> I'm trying to figure out how is this supposed to work on s390. My concern
>>> is, that on s390 PCI memory needs to be accessed by special
>>> instructions. This is taken care of by the stuff defined in
>>> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
>>> the appropriate s390 instruction. However if the code does not use the
>>> linux abstractions for accessing PCI memory, but assumes it can be
>>> accessed like RAM, we have a problem.
>>>
>>> Looking at this patch, it seems to me, that we might end up with exactly
>>> the case described. For example AFAICT copy_to_iter() (3) resolves to
>>> the function in lib/iov_iter.c which does not seem to cater for s390
>>> oddities.
>>>
>>> I didn't have the time to investigate this properly, and since virtio-fs
>>> is virtual, we may be able to get around what is otherwise a
>>> limitation on s390. My understanding of these areas is admittedly
>>> shallow, and since I'm not sure I'll have much more time to
>>> invest in the near future I decided to raise concern.
>>>
>>> Any opinions?
>>
>> Hi Halil,
>>
>> I don't understand s390 and how PCI works there as well. Is there any
>> other transport we can use there to map IO memory directly and access
>> using DAX?
>>
>> BTW, is DAX supported for s390.
>>
>> I am also hoping somebody who knows better can chip in. Till that time,
>> we could still use virtio-fs on s390 without DAX.
> 
> s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
> In practice that means that support for PTE_DEVMAP is missing which
> means no get_user_pages() support for dax mappings. Effectively it's
> only useful for execute-in-place as operations like fork() and ptrace
> of dax mappings will fail.


This is only true for the dcssblk device driver (drivers/s390/block/dcssblk.c
and arch/s390/mm/extmem.c). 

For what its worth, the dcssblk looks to Linux like normal memory (just above the
previously detected memory) that can be used like normal memory. In previous time
we even had struct pages for this memory - this was removed long ago (when it was
still xip) to reduce the memory footprint for large dcss blocks and small memory
guests.
Can the CONFIG_FS_DAX_LIMITED go away if we have struct pages for that memory?

Now some observations: 
- dcssblk is z/VM only (not KVM)
- Setting CONFIG_FS_DAX_LIMITED globally as a Kconfig option depending on wether
  a device driver is compiled in or not seems not flexible enough in case if you
  have device driver that does have struct pages and another one that doesn't
- I do not see a reason why we should not be able to map anything from QEMU
  into the guest real memory via an additional KVM memory slot. 
  We would need to handle that in the guest somehow (and not as a PCI bar),
  register this with struct pages etc.
- we must then look how we can create the link between the guest memory and the
  virtio-fs driver. For virtio-ccw we might be able to add a new ccw command or
  whatever. Maybe we could also piggy-back on some memory hotplug work from David
  Hildenbrand (add cc).

Regarding limitations on the platform:
- while we do have PCI, the virtio devices are usually plugged via the ccw bus.
  That implies no PCI bars. I assume you use those PCI bars only to implicitely 
  have the location of the shared memory
  Correct?
- no real memory mapped I/O. Instead there are instructions that work on the mmio.
  As I understand things, this is of no concern regarding virtio-fs as you do not
  need mmio in the sense that a memory access of the guest to such an address 
  triggers an exit. You just need the shared memory as a mean to have the data
  inside the guest. Any notification is done via normal virtqueue mechanisms
  Correct?


Adding Heiko, maybe he remembers some details of the dcssblk/xip history.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-22 10:51         ` Christian Borntraeger
@ 2019-07-22 10:56           ` Dr. David Alan Gilbert
  2019-07-22 11:20             ` Christian Borntraeger
  0 siblings, 1 reply; 52+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-22 10:56 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Dan Williams, Vivek Goyal, Halil Pasic, Collin Walling,
	Cornelia Huck, Sebastian Ott, KVM list, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Stefan Hajnoczi,
	linux-fsdevel, Steven Whitehouse, Heiko Carstens,
	David Hildenbrand

* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> 
> 
> On 18.07.19 16:30, Dan Williams wrote:
> > On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >>
> >> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:
> >>> On Wed, 15 May 2019 15:27:03 -0400
> >>> Vivek Goyal <vgoyal@redhat.com> wrote:
> >>>
> >>>> From: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>
> >>>> Setup a dax device.
> >>>>
> >>>> Use the shm capability to find the cache entry and map it.
> >>>>
> >>>> The DAX window is accessed by the fs/dax.c infrastructure and must have
> >>>> struct pages (at least on x86).  Use devm_memremap_pages() to map the
> >>>> DAX window PCI BAR and allocate struct page.
> >>>>
> >>>
> >>> Sorry for being this late. I don't see any more recent version so I will
> >>> comment here.
> >>>
> >>> I'm trying to figure out how is this supposed to work on s390. My concern
> >>> is, that on s390 PCI memory needs to be accessed by special
> >>> instructions. This is taken care of by the stuff defined in
> >>> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> >>> the appropriate s390 instruction. However if the code does not use the
> >>> linux abstractions for accessing PCI memory, but assumes it can be
> >>> accessed like RAM, we have a problem.
> >>>
> >>> Looking at this patch, it seems to me, that we might end up with exactly
> >>> the case described. For example AFAICT copy_to_iter() (3) resolves to
> >>> the function in lib/iov_iter.c which does not seem to cater for s390
> >>> oddities.
> >>>
> >>> I didn't have the time to investigate this properly, and since virtio-fs
> >>> is virtual, we may be able to get around what is otherwise a
> >>> limitation on s390. My understanding of these areas is admittedly
> >>> shallow, and since I'm not sure I'll have much more time to
> >>> invest in the near future I decided to raise concern.
> >>>
> >>> Any opinions?
> >>
> >> Hi Halil,
> >>
> >> I don't understand s390 and how PCI works there as well. Is there any
> >> other transport we can use there to map IO memory directly and access
> >> using DAX?
> >>
> >> BTW, is DAX supported for s390.
> >>
> >> I am also hoping somebody who knows better can chip in. Till that time,
> >> we could still use virtio-fs on s390 without DAX.
> > 
> > s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
> > In practice that means that support for PTE_DEVMAP is missing which
> > means no get_user_pages() support for dax mappings. Effectively it's
> > only useful for execute-in-place as operations like fork() and ptrace
> > of dax mappings will fail.
> 
> 
> This is only true for the dcssblk device driver (drivers/s390/block/dcssblk.c
> and arch/s390/mm/extmem.c). 
> 
> For what its worth, the dcssblk looks to Linux like normal memory (just above the
> previously detected memory) that can be used like normal memory. In previous time
> we even had struct pages for this memory - this was removed long ago (when it was
> still xip) to reduce the memory footprint for large dcss blocks and small memory
> guests.
> Can the CONFIG_FS_DAX_LIMITED go away if we have struct pages for that memory?
> 
> Now some observations: 
> - dcssblk is z/VM only (not KVM)
> - Setting CONFIG_FS_DAX_LIMITED globally as a Kconfig option depending on wether
>   a device driver is compiled in or not seems not flexible enough in case if you
>   have device driver that does have struct pages and another one that doesn't
> - I do not see a reason why we should not be able to map anything from QEMU
>   into the guest real memory via an additional KVM memory slot. 
>   We would need to handle that in the guest somehow (and not as a PCI bar),
>   register this with struct pages etc.
> - we must then look how we can create the link between the guest memory and the
>   virtio-fs driver. For virtio-ccw we might be able to add a new ccw command or
>   whatever. Maybe we could also piggy-back on some memory hotplug work from David
>   Hildenbrand (add cc).
> 
> Regarding limitations on the platform:
> - while we do have PCI, the virtio devices are usually plugged via the ccw bus.
>   That implies no PCI bars. I assume you use those PCI bars only to implicitely 
>   have the location of the shared memory
>   Correct?

Right.

> - no real memory mapped I/O. Instead there are instructions that work on the mmio.
>   As I understand things, this is of no concern regarding virtio-fs as you do not
>   need mmio in the sense that a memory access of the guest to such an address 
>   triggers an exit. You just need the shared memory as a mean to have the data
>   inside the guest. Any notification is done via normal virtqueue mechanisms
>   Correct?

Yep.

> 
> Adding Heiko, maybe he remembers some details of the dcssblk/xip history.
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-22 10:56           ` Dr. David Alan Gilbert
@ 2019-07-22 11:20             ` Christian Borntraeger
  2019-07-22 11:43               ` Cornelia Huck
  0 siblings, 1 reply; 52+ messages in thread
From: Christian Borntraeger @ 2019-07-22 11:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Dan Williams, Vivek Goyal, Halil Pasic, Collin Walling,
	Cornelia Huck, Sebastian Ott, KVM list, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Stefan Hajnoczi,
	linux-fsdevel, Steven Whitehouse, Heiko Carstens,
	David Hildenbrand



On 22.07.19 12:56, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>
>>
>> On 18.07.19 16:30, Dan Williams wrote:
>>> On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>>>>
>>>> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:
>>>>> On Wed, 15 May 2019 15:27:03 -0400
>>>>> Vivek Goyal <vgoyal@redhat.com> wrote:
>>>>>
>>>>>> From: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>
>>>>>> Setup a dax device.
>>>>>>
>>>>>> Use the shm capability to find the cache entry and map it.
>>>>>>
>>>>>> The DAX window is accessed by the fs/dax.c infrastructure and must have
>>>>>> struct pages (at least on x86).  Use devm_memremap_pages() to map the
>>>>>> DAX window PCI BAR and allocate struct page.
>>>>>>
>>>>>
>>>>> Sorry for being this late. I don't see any more recent version so I will
>>>>> comment here.
>>>>>
>>>>> I'm trying to figure out how is this supposed to work on s390. My concern
>>>>> is, that on s390 PCI memory needs to be accessed by special
>>>>> instructions. This is taken care of by the stuff defined in
>>>>> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
>>>>> the appropriate s390 instruction. However if the code does not use the
>>>>> linux abstractions for accessing PCI memory, but assumes it can be
>>>>> accessed like RAM, we have a problem.
>>>>>
>>>>> Looking at this patch, it seems to me, that we might end up with exactly
>>>>> the case described. For example AFAICT copy_to_iter() (3) resolves to
>>>>> the function in lib/iov_iter.c which does not seem to cater for s390
>>>>> oddities.
>>>>>
>>>>> I didn't have the time to investigate this properly, and since virtio-fs
>>>>> is virtual, we may be able to get around what is otherwise a
>>>>> limitation on s390. My understanding of these areas is admittedly
>>>>> shallow, and since I'm not sure I'll have much more time to
>>>>> invest in the near future I decided to raise concern.
>>>>>
>>>>> Any opinions?
>>>>
>>>> Hi Halil,
>>>>
>>>> I don't understand s390 and how PCI works there as well. Is there any
>>>> other transport we can use there to map IO memory directly and access
>>>> using DAX?
>>>>
>>>> BTW, is DAX supported for s390.
>>>>
>>>> I am also hoping somebody who knows better can chip in. Till that time,
>>>> we could still use virtio-fs on s390 without DAX.
>>>
>>> s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
>>> In practice that means that support for PTE_DEVMAP is missing which
>>> means no get_user_pages() support for dax mappings. Effectively it's
>>> only useful for execute-in-place as operations like fork() and ptrace
>>> of dax mappings will fail.
>>
>>
>> This is only true for the dcssblk device driver (drivers/s390/block/dcssblk.c
>> and arch/s390/mm/extmem.c). 
>>
>> For what its worth, the dcssblk looks to Linux like normal memory (just above the
>> previously detected memory) that can be used like normal memory. In previous time
>> we even had struct pages for this memory - this was removed long ago (when it was
>> still xip) to reduce the memory footprint for large dcss blocks and small memory
>> guests.
>> Can the CONFIG_FS_DAX_LIMITED go away if we have struct pages for that memory?
>>
>> Now some observations: 
>> - dcssblk is z/VM only (not KVM)
>> - Setting CONFIG_FS_DAX_LIMITED globally as a Kconfig option depending on wether
>>   a device driver is compiled in or not seems not flexible enough in case if you
>>   have device driver that does have struct pages and another one that doesn't
>> - I do not see a reason why we should not be able to map anything from QEMU
>>   into the guest real memory via an additional KVM memory slot. 
>>   We would need to handle that in the guest somehow (and not as a PCI bar),
>>   register this with struct pages etc.
>> - we must then look how we can create the link between the guest memory and the
>>   virtio-fs driver. For virtio-ccw we might be able to add a new ccw command or
>>   whatever. Maybe we could also piggy-back on some memory hotplug work from David
>>   Hildenbrand (add cc).
>>
>> Regarding limitations on the platform:
>> - while we do have PCI, the virtio devices are usually plugged via the ccw bus.
>>   That implies no PCI bars. I assume you use those PCI bars only to implicitely 
>>   have the location of the shared memory
>>   Correct?
> 
> Right.

So in essence we just have to provide a vm_get_shm_region callback in the virtio-ccw
guest code?

How many regions do we have to support? One region per device? Or many?
Even if we need more, this should be possible with a 2 new CCWs, e.g READ_SHM_BASE(id)
and READ_SHM_SIZE(id)


> 
>> - no real memory mapped I/O. Instead there are instructions that work on the mmio.
>>   As I understand things, this is of no concern regarding virtio-fs as you do not
>>   need mmio in the sense that a memory access of the guest to such an address 
>>   triggers an exit. You just need the shared memory as a mean to have the data
>>   inside the guest. Any notification is done via normal virtqueue mechanisms
>>   Correct?
> 
> Yep.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-22 11:20             ` Christian Borntraeger
@ 2019-07-22 11:43               ` Cornelia Huck
  2019-07-22 12:00                 ` Christian Borntraeger
  0 siblings, 1 reply; 52+ messages in thread
From: Cornelia Huck @ 2019-07-22 11:43 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Dr. David Alan Gilbert, Dan Williams, Vivek Goyal, Halil Pasic,
	Collin Walling, Sebastian Ott, KVM list, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Stefan Hajnoczi,
	linux-fsdevel, Steven Whitehouse, Heiko Carstens,
	David Hildenbrand

On Mon, 22 Jul 2019 13:20:18 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 22.07.19 12:56, Dr. David Alan Gilbert wrote:
> > * Christian Borntraeger (borntraeger@de.ibm.com) wrote:  
> >>
> >>
> >> On 18.07.19 16:30, Dan Williams wrote:  
> >>> On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:  
> >>>>
> >>>> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:  
> >>>>> On Wed, 15 May 2019 15:27:03 -0400
> >>>>> Vivek Goyal <vgoyal@redhat.com> wrote:
> >>>>>  
> >>>>>> From: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>
> >>>>>> Setup a dax device.
> >>>>>>
> >>>>>> Use the shm capability to find the cache entry and map it.
> >>>>>>
> >>>>>> The DAX window is accessed by the fs/dax.c infrastructure and must have
> >>>>>> struct pages (at least on x86).  Use devm_memremap_pages() to map the
> >>>>>> DAX window PCI BAR and allocate struct page.
> >>>>>>  
> >>>>>
> >>>>> Sorry for being this late. I don't see any more recent version so I will
> >>>>> comment here.
> >>>>>
> >>>>> I'm trying to figure out how is this supposed to work on s390. My concern
> >>>>> is, that on s390 PCI memory needs to be accessed by special
> >>>>> instructions. This is taken care of by the stuff defined in
> >>>>> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> >>>>> the appropriate s390 instruction. However if the code does not use the
> >>>>> linux abstractions for accessing PCI memory, but assumes it can be
> >>>>> accessed like RAM, we have a problem.
> >>>>>
> >>>>> Looking at this patch, it seems to me, that we might end up with exactly
> >>>>> the case described. For example AFAICT copy_to_iter() (3) resolves to
> >>>>> the function in lib/iov_iter.c which does not seem to cater for s390
> >>>>> oddities.
> >>>>>
> >>>>> I didn't have the time to investigate this properly, and since virtio-fs
> >>>>> is virtual, we may be able to get around what is otherwise a
> >>>>> limitation on s390. My understanding of these areas is admittedly
> >>>>> shallow, and since I'm not sure I'll have much more time to
> >>>>> invest in the near future I decided to raise concern.
> >>>>>
> >>>>> Any opinions?  
> >>>>
> >>>> Hi Halil,
> >>>>
> >>>> I don't understand s390 and how PCI works there as well. Is there any
> >>>> other transport we can use there to map IO memory directly and access
> >>>> using DAX?
> >>>>
> >>>> BTW, is DAX supported for s390.
> >>>>
> >>>> I am also hoping somebody who knows better can chip in. Till that time,
> >>>> we could still use virtio-fs on s390 without DAX.  
> >>>
> >>> s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
> >>> In practice that means that support for PTE_DEVMAP is missing which
> >>> means no get_user_pages() support for dax mappings. Effectively it's
> >>> only useful for execute-in-place as operations like fork() and ptrace
> >>> of dax mappings will fail.  
> >>
> >>
> >> This is only true for the dcssblk device driver (drivers/s390/block/dcssblk.c
> >> and arch/s390/mm/extmem.c). 
> >>
> >> For what its worth, the dcssblk looks to Linux like normal memory (just above the
> >> previously detected memory) that can be used like normal memory. In previous time
> >> we even had struct pages for this memory - this was removed long ago (when it was
> >> still xip) to reduce the memory footprint for large dcss blocks and small memory
> >> guests.
> >> Can the CONFIG_FS_DAX_LIMITED go away if we have struct pages for that memory?
> >>
> >> Now some observations: 
> >> - dcssblk is z/VM only (not KVM)
> >> - Setting CONFIG_FS_DAX_LIMITED globally as a Kconfig option depending on wether
> >>   a device driver is compiled in or not seems not flexible enough in case if you
> >>   have device driver that does have struct pages and another one that doesn't
> >> - I do not see a reason why we should not be able to map anything from QEMU
> >>   into the guest real memory via an additional KVM memory slot. 
> >>   We would need to handle that in the guest somehow (and not as a PCI bar),
> >>   register this with struct pages etc.

You mean for ccw, right? I don't think we want pci to behave
differently than everywhere else.

> >> - we must then look how we can create the link between the guest memory and the
> >>   virtio-fs driver. For virtio-ccw we might be able to add a new ccw command or
> >>   whatever. Maybe we could also piggy-back on some memory hotplug work from David
> >>   Hildenbrand (add cc).
> >>
> >> Regarding limitations on the platform:
> >> - while we do have PCI, the virtio devices are usually plugged via the ccw bus.
> >>   That implies no PCI bars. I assume you use those PCI bars only to implicitely 
> >>   have the location of the shared memory
> >>   Correct?  
> > 
> > Right.  
> 
> So in essence we just have to provide a vm_get_shm_region callback in the virtio-ccw
> guest code?
> 
> How many regions do we have to support? One region per device? Or many?
> Even if we need more, this should be possible with a 2 new CCWs, e.g READ_SHM_BASE(id)
> and READ_SHM_SIZE(id)

I'd just add a single CCW with a control block containing id and size.

The main issue is where we put those regions, and what happens if we
use both virtio-pci and virtio-ccw on the same machine.

> 
> 
> >   
> >> - no real memory mapped I/O. Instead there are instructions that work on the mmio.
> >>   As I understand things, this is of no concern regarding virtio-fs as you do not
> >>   need mmio in the sense that a memory access of the guest to such an address 
> >>   triggers an exit. You just need the shared memory as a mean to have the data
> >>   inside the guest. Any notification is done via normal virtqueue mechanisms
> >>   Correct?  
> > 
> > Yep.  
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-22 11:43               ` Cornelia Huck
@ 2019-07-22 12:00                 ` Christian Borntraeger
  2019-07-22 12:08                   ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Christian Borntraeger @ 2019-07-22 12:00 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Dr. David Alan Gilbert, Dan Williams, Vivek Goyal, Halil Pasic,
	Collin Walling, Sebastian Ott, KVM list, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Stefan Hajnoczi,
	linux-fsdevel, Steven Whitehouse, Heiko Carstens,
	David Hildenbrand



On 22.07.19 13:43, Cornelia Huck wrote:
> On Mon, 22 Jul 2019 13:20:18 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 22.07.19 12:56, Dr. David Alan Gilbert wrote:
>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:  
>>>>
>>>>
>>>> On 18.07.19 16:30, Dan Williams wrote:  
>>>>> On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:  
>>>>>>
>>>>>> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:  
>>>>>>> On Wed, 15 May 2019 15:27:03 -0400
>>>>>>> Vivek Goyal <vgoyal@redhat.com> wrote:
>>>>>>>  
>>>>>>>> From: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>
>>>>>>>> Setup a dax device.
>>>>>>>>
>>>>>>>> Use the shm capability to find the cache entry and map it.
>>>>>>>>
>>>>>>>> The DAX window is accessed by the fs/dax.c infrastructure and must have
>>>>>>>> struct pages (at least on x86).  Use devm_memremap_pages() to map the
>>>>>>>> DAX window PCI BAR and allocate struct page.
>>>>>>>>  
>>>>>>>
>>>>>>> Sorry for being this late. I don't see any more recent version so I will
>>>>>>> comment here.
>>>>>>>
>>>>>>> I'm trying to figure out how is this supposed to work on s390. My concern
>>>>>>> is, that on s390 PCI memory needs to be accessed by special
>>>>>>> instructions. This is taken care of by the stuff defined in
>>>>>>> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
>>>>>>> the appropriate s390 instruction. However if the code does not use the
>>>>>>> linux abstractions for accessing PCI memory, but assumes it can be
>>>>>>> accessed like RAM, we have a problem.
>>>>>>>
>>>>>>> Looking at this patch, it seems to me, that we might end up with exactly
>>>>>>> the case described. For example AFAICT copy_to_iter() (3) resolves to
>>>>>>> the function in lib/iov_iter.c which does not seem to cater for s390
>>>>>>> oddities.
>>>>>>>
>>>>>>> I didn't have the time to investigate this properly, and since virtio-fs
>>>>>>> is virtual, we may be able to get around what is otherwise a
>>>>>>> limitation on s390. My understanding of these areas is admittedly
>>>>>>> shallow, and since I'm not sure I'll have much more time to
>>>>>>> invest in the near future I decided to raise concern.
>>>>>>>
>>>>>>> Any opinions?  
>>>>>>
>>>>>> Hi Halil,
>>>>>>
>>>>>> I don't understand s390 and how PCI works there as well. Is there any
>>>>>> other transport we can use there to map IO memory directly and access
>>>>>> using DAX?
>>>>>>
>>>>>> BTW, is DAX supported for s390.
>>>>>>
>>>>>> I am also hoping somebody who knows better can chip in. Till that time,
>>>>>> we could still use virtio-fs on s390 without DAX.  
>>>>>
>>>>> s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
>>>>> In practice that means that support for PTE_DEVMAP is missing which
>>>>> means no get_user_pages() support for dax mappings. Effectively it's
>>>>> only useful for execute-in-place as operations like fork() and ptrace
>>>>> of dax mappings will fail.  
>>>>
>>>>
>>>> This is only true for the dcssblk device driver (drivers/s390/block/dcssblk.c
>>>> and arch/s390/mm/extmem.c). 
>>>>
>>>> For what its worth, the dcssblk looks to Linux like normal memory (just above the
>>>> previously detected memory) that can be used like normal memory. In previous time
>>>> we even had struct pages for this memory - this was removed long ago (when it was
>>>> still xip) to reduce the memory footprint for large dcss blocks and small memory
>>>> guests.
>>>> Can the CONFIG_FS_DAX_LIMITED go away if we have struct pages for that memory?
>>>>
>>>> Now some observations: 
>>>> - dcssblk is z/VM only (not KVM)
>>>> - Setting CONFIG_FS_DAX_LIMITED globally as a Kconfig option depending on wether
>>>>   a device driver is compiled in or not seems not flexible enough in case if you
>>>>   have device driver that does have struct pages and another one that doesn't
>>>> - I do not see a reason why we should not be able to map anything from QEMU
>>>>   into the guest real memory via an additional KVM memory slot. 
>>>>   We would need to handle that in the guest somehow (and not as a PCI bar),
>>>>   register this with struct pages etc.
> 
> You mean for ccw, right? I don't think we want pci to behave
> differently than everywhere else.

Yes for virtio-ccw. We would need to have a look at how virtio-ccw can create a memory
mapping with struct pages, so that DAX will work.(Dan, it is just struct pages that 
you need, correct?)


> 
>>>> - we must then look how we can create the link between the guest memory and the
>>>>   virtio-fs driver. For virtio-ccw we might be able to add a new ccw command or
>>>>   whatever. Maybe we could also piggy-back on some memory hotplug work from David
>>>>   Hildenbrand (add cc).
>>>>
>>>> Regarding limitations on the platform:
>>>> - while we do have PCI, the virtio devices are usually plugged via the ccw bus.
>>>>   That implies no PCI bars. I assume you use those PCI bars only to implicitely 
>>>>   have the location of the shared memory
>>>>   Correct?  
>>>
>>> Right.  
>>
>> So in essence we just have to provide a vm_get_shm_region callback in the virtio-ccw
>> guest code?
>>
>> How many regions do we have to support? One region per device? Or many?
>> Even if we need more, this should be possible with a 2 new CCWs, e.g READ_SHM_BASE(id)
>> and READ_SHM_SIZE(id)
> 
> I'd just add a single CCW with a control block containing id and size.
> 
> The main issue is where we put those regions, and what happens if we
> use both virtio-pci and virtio-ccw on the same machine.

Then these 2 devices should get independent memory regions that are added in an
independent (but still exclusive) way.
> 
>>
>>
>>>   
>>>> - no real memory mapped I/O. Instead there are instructions that work on the mmio.
>>>>   As I understand things, this is of no concern regarding virtio-fs as you do not
>>>>   need mmio in the sense that a memory access of the guest to such an address 
>>>>   triggers an exit. You just need the shared memory as a mean to have the data
>>>>   inside the guest. Any notification is done via normal virtqueue mechanisms
>>>>   Correct?  
>>>
>>> Yep.  
>>
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-22 12:00                 ` Christian Borntraeger
@ 2019-07-22 12:08                   ` David Hildenbrand
  2019-07-29 13:20                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2019-07-22 12:08 UTC (permalink / raw)
  To: Christian Borntraeger, Cornelia Huck
  Cc: Dr. David Alan Gilbert, Dan Williams, Vivek Goyal, Halil Pasic,
	Collin Walling, Sebastian Ott, KVM list, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Stefan Hajnoczi,
	linux-fsdevel, Steven Whitehouse, Heiko Carstens

On 22.07.19 14:00, Christian Borntraeger wrote:
> 
> 
> On 22.07.19 13:43, Cornelia Huck wrote:
>> On Mon, 22 Jul 2019 13:20:18 +0200
>> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>
>>> On 22.07.19 12:56, Dr. David Alan Gilbert wrote:
>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:  
>>>>>
>>>>>
>>>>> On 18.07.19 16:30, Dan Williams wrote:  
>>>>>> On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:  
>>>>>>>
>>>>>>> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:  
>>>>>>>> On Wed, 15 May 2019 15:27:03 -0400
>>>>>>>> Vivek Goyal <vgoyal@redhat.com> wrote:
>>>>>>>>  
>>>>>>>>> From: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>>
>>>>>>>>> Setup a dax device.
>>>>>>>>>
>>>>>>>>> Use the shm capability to find the cache entry and map it.
>>>>>>>>>
>>>>>>>>> The DAX window is accessed by the fs/dax.c infrastructure and must have
>>>>>>>>> struct pages (at least on x86).  Use devm_memremap_pages() to map the
>>>>>>>>> DAX window PCI BAR and allocate struct page.
>>>>>>>>>  
>>>>>>>>
>>>>>>>> Sorry for being this late. I don't see any more recent version so I will
>>>>>>>> comment here.
>>>>>>>>
>>>>>>>> I'm trying to figure out how is this supposed to work on s390. My concern
>>>>>>>> is, that on s390 PCI memory needs to be accessed by special
>>>>>>>> instructions. This is taken care of by the stuff defined in
>>>>>>>> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
>>>>>>>> the appropriate s390 instruction. However if the code does not use the
>>>>>>>> linux abstractions for accessing PCI memory, but assumes it can be
>>>>>>>> accessed like RAM, we have a problem.
>>>>>>>>
>>>>>>>> Looking at this patch, it seems to me, that we might end up with exactly
>>>>>>>> the case described. For example AFAICT copy_to_iter() (3) resolves to
>>>>>>>> the function in lib/iov_iter.c which does not seem to cater for s390
>>>>>>>> oddities.
>>>>>>>>
>>>>>>>> I didn't have the time to investigate this properly, and since virtio-fs
>>>>>>>> is virtual, we may be able to get around what is otherwise a
>>>>>>>> limitation on s390. My understanding of these areas is admittedly
>>>>>>>> shallow, and since I'm not sure I'll have much more time to
>>>>>>>> invest in the near future I decided to raise concern.
>>>>>>>>
>>>>>>>> Any opinions?  
>>>>>>>
>>>>>>> Hi Halil,
>>>>>>>
>>>>>>> I don't understand s390 and how PCI works there as well. Is there any
>>>>>>> other transport we can use there to map IO memory directly and access
>>>>>>> using DAX?
>>>>>>>
>>>>>>> BTW, is DAX supported for s390.
>>>>>>>
>>>>>>> I am also hoping somebody who knows better can chip in. Till that time,
>>>>>>> we could still use virtio-fs on s390 without DAX.  
>>>>>>
>>>>>> s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
>>>>>> In practice that means that support for PTE_DEVMAP is missing which
>>>>>> means no get_user_pages() support for dax mappings. Effectively it's
>>>>>> only useful for execute-in-place as operations like fork() and ptrace
>>>>>> of dax mappings will fail.  
>>>>>
>>>>>
>>>>> This is only true for the dcssblk device driver (drivers/s390/block/dcssblk.c
>>>>> and arch/s390/mm/extmem.c). 
>>>>>
>>>>> For what its worth, the dcssblk looks to Linux like normal memory (just above the
>>>>> previously detected memory) that can be used like normal memory. In previous time
>>>>> we even had struct pages for this memory - this was removed long ago (when it was
>>>>> still xip) to reduce the memory footprint for large dcss blocks and small memory
>>>>> guests.
>>>>> Can the CONFIG_FS_DAX_LIMITED go away if we have struct pages for that memory?
>>>>>
>>>>> Now some observations: 
>>>>> - dcssblk is z/VM only (not KVM)
>>>>> - Setting CONFIG_FS_DAX_LIMITED globally as a Kconfig option depending on wether
>>>>>   a device driver is compiled in or not seems not flexible enough in case if you
>>>>>   have device driver that does have struct pages and another one that doesn't
>>>>> - I do not see a reason why we should not be able to map anything from QEMU
>>>>>   into the guest real memory via an additional KVM memory slot. 
>>>>>   We would need to handle that in the guest somehow (and not as a PCI bar),
>>>>>   register this with struct pages etc.
>>
>> You mean for ccw, right? I don't think we want pci to behave
>> differently than everywhere else.
> 
> Yes for virtio-ccw. We would need to have a look at how virtio-ccw can create a memory
> mapping with struct pages, so that DAX will work.(Dan, it is just struct pages that 
> you need, correct?)
> 
> 
>>
>>>>> - we must then look how we can create the link between the guest memory and the
>>>>>   virtio-fs driver. For virtio-ccw we might be able to add a new ccw command or
>>>>>   whatever. Maybe we could also piggy-back on some memory hotplug work from David
>>>>>   Hildenbrand (add cc).
>>>>>
>>>>> Regarding limitations on the platform:
>>>>> - while we do have PCI, the virtio devices are usually plugged via the ccw bus.
>>>>>   That implies no PCI bars. I assume you use those PCI bars only to implicitely 
>>>>>   have the location of the shared memory
>>>>>   Correct?  
>>>>
>>>> Right.  
>>>
>>> So in essence we just have to provide a vm_get_shm_region callback in the virtio-ccw
>>> guest code?
>>>
>>> How many regions do we have to support? One region per device? Or many?
>>> Even if we need more, this should be possible with a 2 new CCWs, e.g READ_SHM_BASE(id)
>>> and READ_SHM_SIZE(id)
>>
>> I'd just add a single CCW with a control block containing id and size.
>>
>> The main issue is where we put those regions, and what happens if we
>> use both virtio-pci and virtio-ccw on the same machine.
> 
> Then these 2 devices should get independent memory regions that are added in an
> independent (but still exclusive) way.

I remember that one discussion was who dictates the physical address
mapping. If I'm not wrong, PCI bars can be mapped freely by the guest
intot he address space. So it would not just be querying the start+size.
Unless we want a pre-determined mapping (which might make more sense for
s390x).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device
  2019-07-22 12:08                   ` David Hildenbrand
@ 2019-07-29 13:20                     ` Stefan Hajnoczi
  0 siblings, 0 replies; 52+ messages in thread
From: Stefan Hajnoczi @ 2019-07-29 13:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christian Borntraeger, Cornelia Huck, Dr. David Alan Gilbert,
	Dan Williams, Vivek Goyal, Halil Pasic, Collin Walling,
	Sebastian Ott, KVM list, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Stefan Hajnoczi, linux-fsdevel,
	Steven Whitehouse, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 6865 bytes --]

On Mon, Jul 22, 2019 at 02:08:02PM +0200, David Hildenbrand wrote:
> On 22.07.19 14:00, Christian Borntraeger wrote:
> > 
> > 
> > On 22.07.19 13:43, Cornelia Huck wrote:
> >> On Mon, 22 Jul 2019 13:20:18 +0200
> >> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> >>
> >>> On 22.07.19 12:56, Dr. David Alan Gilbert wrote:
> >>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:  
> >>>>>
> >>>>>
> >>>>> On 18.07.19 16:30, Dan Williams wrote:  
> >>>>>> On Thu, Jul 18, 2019 at 6:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:  
> >>>>>>>
> >>>>>>> On Wed, Jul 17, 2019 at 07:27:25PM +0200, Halil Pasic wrote:  
> >>>>>>>> On Wed, 15 May 2019 15:27:03 -0400
> >>>>>>>> Vivek Goyal <vgoyal@redhat.com> wrote:
> >>>>>>>>  
> >>>>>>>>> From: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>>
> >>>>>>>>> Setup a dax device.
> >>>>>>>>>
> >>>>>>>>> Use the shm capability to find the cache entry and map it.
> >>>>>>>>>
> >>>>>>>>> The DAX window is accessed by the fs/dax.c infrastructure and must have
> >>>>>>>>> struct pages (at least on x86).  Use devm_memremap_pages() to map the
> >>>>>>>>> DAX window PCI BAR and allocate struct page.
> >>>>>>>>>  
> >>>>>>>>
> >>>>>>>> Sorry for being this late. I don't see any more recent version so I will
> >>>>>>>> comment here.
> >>>>>>>>
> >>>>>>>> I'm trying to figure out how is this supposed to work on s390. My concern
> >>>>>>>> is, that on s390 PCI memory needs to be accessed by special
> >>>>>>>> instructions. This is taken care of by the stuff defined in
> >>>>>>>> arch/s390/include/asm/io.h. E.g. we 'override' __raw_writew so it uses
> >>>>>>>> the appropriate s390 instruction. However if the code does not use the
> >>>>>>>> linux abstractions for accessing PCI memory, but assumes it can be
> >>>>>>>> accessed like RAM, we have a problem.
> >>>>>>>>
> >>>>>>>> Looking at this patch, it seems to me, that we might end up with exactly
> >>>>>>>> the case described. For example AFAICT copy_to_iter() (3) resolves to
> >>>>>>>> the function in lib/iov_iter.c which does not seem to cater for s390
> >>>>>>>> oddities.
> >>>>>>>>
> >>>>>>>> I didn't have the time to investigate this properly, and since virtio-fs
> >>>>>>>> is virtual, we may be able to get around what is otherwise a
> >>>>>>>> limitation on s390. My understanding of these areas is admittedly
> >>>>>>>> shallow, and since I'm not sure I'll have much more time to
> >>>>>>>> invest in the near future I decided to raise concern.
> >>>>>>>>
> >>>>>>>> Any opinions?  
> >>>>>>>
> >>>>>>> Hi Halil,
> >>>>>>>
> >>>>>>> I don't understand s390 and how PCI works there as well. Is there any
> >>>>>>> other transport we can use there to map IO memory directly and access
> >>>>>>> using DAX?
> >>>>>>>
> >>>>>>> BTW, is DAX supported for s390.
> >>>>>>>
> >>>>>>> I am also hoping somebody who knows better can chip in. Till that time,
> >>>>>>> we could still use virtio-fs on s390 without DAX.  
> >>>>>>
> >>>>>> s390 has so-called "limited" dax support, see CONFIG_FS_DAX_LIMITED.
> >>>>>> In practice that means that support for PTE_DEVMAP is missing which
> >>>>>> means no get_user_pages() support for dax mappings. Effectively it's
> >>>>>> only useful for execute-in-place as operations like fork() and ptrace
> >>>>>> of dax mappings will fail.  
> >>>>>
> >>>>>
> >>>>> This is only true for the dcssblk device driver (drivers/s390/block/dcssblk.c
> >>>>> and arch/s390/mm/extmem.c). 
> >>>>>
> >>>>> For what its worth, the dcssblk looks to Linux like normal memory (just above the
> >>>>> previously detected memory) that can be used like normal memory. In previous time
> >>>>> we even had struct pages for this memory - this was removed long ago (when it was
> >>>>> still xip) to reduce the memory footprint for large dcss blocks and small memory
> >>>>> guests.
> >>>>> Can the CONFIG_FS_DAX_LIMITED go away if we have struct pages for that memory?
> >>>>>
> >>>>> Now some observations: 
> >>>>> - dcssblk is z/VM only (not KVM)
> >>>>> - Setting CONFIG_FS_DAX_LIMITED globally as a Kconfig option depending on wether
> >>>>>   a device driver is compiled in or not seems not flexible enough in case if you
> >>>>>   have device driver that does have struct pages and another one that doesn't
> >>>>> - I do not see a reason why we should not be able to map anything from QEMU
> >>>>>   into the guest real memory via an additional KVM memory slot. 
> >>>>>   We would need to handle that in the guest somehow (and not as a PCI bar),
> >>>>>   register this with struct pages etc.
> >>
> >> You mean for ccw, right? I don't think we want pci to behave
> >> differently than everywhere else.
> > 
> > Yes for virtio-ccw. We would need to have a look at how virtio-ccw can create a memory
> > mapping with struct pages, so that DAX will work.(Dan, it is just struct pages that 
> > you need, correct?)
> > 
> > 
> >>
> >>>>> - we must then look how we can create the link between the guest memory and the
> >>>>>   virtio-fs driver. For virtio-ccw we might be able to add a new ccw command or
> >>>>>   whatever. Maybe we could also piggy-back on some memory hotplug work from David
> >>>>>   Hildenbrand (add cc).
> >>>>>
> >>>>> Regarding limitations on the platform:
> >>>>> - while we do have PCI, the virtio devices are usually plugged via the ccw bus.
> >>>>>   That implies no PCI bars. I assume you use those PCI bars only to implicitely 
> >>>>>   have the location of the shared memory
> >>>>>   Correct?  
> >>>>
> >>>> Right.  
> >>>
> >>> So in essence we just have to provide a vm_get_shm_region callback in the virtio-ccw
> >>> guest code?
> >>>
> >>> How many regions do we have to support? One region per device? Or many?
> >>> Even if we need more, this should be possible with a 2 new CCWs, e.g READ_SHM_BASE(id)
> >>> and READ_SHM_SIZE(id)
> >>
> >> I'd just add a single CCW with a control block containing id and size.
> >>
> >> The main issue is where we put those regions, and what happens if we
> >> use both virtio-pci and virtio-ccw on the same machine.
> > 
> > Then these 2 devices should get independent memory regions that are added in an
> > independent (but still exclusive) way.
> 
> I remember that one discussion was who dictates the physical address
> mapping. If I'm not wrong, PCI bars can be mapped freely by the guest
> intot he address space. So it would not just be querying the start+size.
> Unless we want a pre-determined mapping (which might make more sense for
> s390x).

Yes, guests can (re)map PCI BARs.  A PCI driver first probes the BAR to
determine the type (MMIO or PIO) and size.  Then it can set the address
but often this has been already been set by the firmware and the OS
keeps the existing location.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2019-07-29 13:20 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 19:26 [PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 01/30] fuse: delete dentry if timeout is zero Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path Vivek Goyal
2019-05-20 14:41   ` Miklos Szeredi
2019-05-20 14:44     ` Miklos Szeredi
2019-05-20 20:25       ` Nikolaus Rath
2019-05-21 15:01     ` Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 03/30] fuse: Use default_file_splice_read for direct IO Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 04/30] fuse: export fuse_end_request() Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 05/30] fuse: export fuse_len_args() Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 06/30] fuse: Export fuse_send_init_request() Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 07/30] fuse: export fuse_get_unique() Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 08/30] fuse: extract fuse_fill_super_common() Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 09/30] fuse: add fuse_iqueue_ops callbacks Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 10/30] fuse: Separate fuse device allocation and installation in fuse_conn Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 11/30] virtio_fs: add skeleton virtio_fs.ko module Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 12/30] dax: remove block device dependencies Vivek Goyal
2019-05-16  0:21   ` Dan Williams
2019-05-16 10:07     ` Stefan Hajnoczi
2019-05-16 14:23     ` Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 13/30] dax: Pass dax_dev to dax_writeback_mapping_range() Vivek Goyal
2019-05-15 19:26 ` [PATCH v2 14/30] virtio: Add get_shm_region method Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 15/30] virtio: Implement get_shm_region for PCI transport Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 16/30] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 17/30] fuse, dax: add fuse_conn->dax_dev field Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
2019-07-17 17:27   ` Halil Pasic
2019-07-18  9:04     ` Cornelia Huck
2019-07-18 11:20       ` Halil Pasic
2019-07-18 14:47         ` Cornelia Huck
2019-07-18 13:15     ` Vivek Goyal
2019-07-18 14:30       ` Dan Williams
2019-07-22 10:51         ` Christian Borntraeger
2019-07-22 10:56           ` Dr. David Alan Gilbert
2019-07-22 11:20             ` Christian Borntraeger
2019-07-22 11:43               ` Cornelia Huck
2019-07-22 12:00                 ` Christian Borntraeger
2019-07-22 12:08                   ` David Hildenbrand
2019-07-29 13:20                     ` Stefan Hajnoczi
2019-05-15 19:27 ` [PATCH v2 19/30] fuse: Keep a list of free dax memory ranges Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 20/30] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 21/30] fuse, dax: Implement dax read/write operations Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 22/30] fuse, dax: add DAX mmap support Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 23/30] fuse: Define dax address space operations Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 24/30] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 25/30] fuse: Maintain a list of busy elements Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 26/30] fuse: Add logic to free up a memory range Vivek Goyal
     [not found]   ` <CAN+Pk99SNKSf+GjSQUUWt_eu1fSjTy_ByUOEQUXHi8zNqXY1zA@mail.gmail.com>
2019-05-20 12:53     ` Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 27/30] fuse: Release file in process context Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 28/30] fuse: Reschedule dax free work if too many EAGAIN attempts Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 29/30] fuse: Take inode lock for dax inode truncation Vivek Goyal
2019-05-15 19:27 ` [PATCH v2 30/30] virtio-fs: Do not provide abort interface in fusectl Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).