All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/13] fuse uring communication
@ 2023-03-21  1:10 Bernd Schubert
  2023-03-21  1:10 ` [PATCH 01/13] fuse: Add uring data structures and documentation Bernd Schubert
                   ` (14 more replies)
  0 siblings, 15 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This adds support for uring communication between kernel and
userspace daemon using opcode the IORING_OP_URING_CMD. The basic
appraoch was taken from ublk.  The patches are in RFC state -
I'm not sure about all decisions and some questions are marked
with XXX.

Userspace side has to send IOCTL(s) to configure ring queue(s)
and it has the choice to configure exactly one ring or one
ring per core. If there are use case we can also consider
to allow a different number of rings - the ioctl configuration
option is rather generic (number of queues).

Right now a queue lock is taken for any ring entry state change,
mostly to correctly handle unmount/daemon-stop. In fact,
correctly stopping the ring took most of the development
time - always new corner cases came up.
I had run dozens of xfstest cycles, 
versions I had once seen a warning about the ring start_stop
mutex being the wrong state - probably another stop issue,
but I have not been able to track it down yet. 
Regarding the queue lock - I still need to do profiling, but
my assumption is that it should not matter for the 
one-ring-per-core configuration. For the single ring config
option lock contention might come up, but I see this
configuration mostly for development only.
Adding more complexity and protecting ring entries with
their own locks can be done later.

Current code also keep the fuse request allocation, initially
I only had that for background requests when the ring queue
didn't have free entries anymore. The allocation is done
to reduce initial complexity, especially also for ring stop.
The allocation free mode can be added back later.

Right now always the ring queue of the submitting core
is used, especially for page cached background requests
we might consider later to also enqueue on other core queues
(when these are not busy, of course).

Splice/zero-copy is not supported yet, all requests go
through the shared memory queue entry buffer. I also
following splice and ublk/zc copy discussions, I will
look into these options in the next days/weeks.
To have that buffer allocated on the right numa node,
a vmalloc is done per ring queue and on the numa node
userspace daemon side asks for.
My assumption is that the mmap offset parameter will be
part of a debate and I'm curious what other think about
that appraoch. 

Benchmarking and tuning is on my agenda for the next
days. For now I only have xfstest results - most longer
running tests were running at about 2x, but somehow when
I cleaned up the patches for submission I lost that.
My development VM/kernel has all sanitizers enabled -
hard to profile what happened. Performance
results with profiling will be submitted in a few days.

The patches include a design document, which has a few more
details.

The corresponding libfuse patches are on my uring branch,
but need cleanup for submission - will happen during the next
days.
https://github.com/bsbernd/libfuse/tree/uring

If it should make review easier, patches posted here are on
this branch
https://github.com/bsbernd/linux/tree/fuse-uring-for-6.2


Bernd Schubert (13):
  fuse: Add uring data structures and documentation
  fuse: rename to fuse_dev_end_requests and make non-static
  fuse: Move fuse_get_dev to header file
  Add a vmalloc_node_user function
  fuse: Add a uring config ioctl and ring destruction
  fuse: Add an interval ring stop worker/monitor
  fuse: Add uring mmap method
  fuse: Move request bits
  fuse: Add wait stop ioctl support to the ring
  fuse: Handle SQEs - register commands
  fuse: Add support to copy from/to the ring buffer
  fuse: Add uring sqe commit and fetch support
  fuse: Allow to queue to the ring

 Documentation/filesystems/fuse-uring.rst |  179 +++
 fs/fuse/Makefile                         |    2 +-
 fs/fuse/dev.c                            |  193 +++-
 fs/fuse/dev_uring.c                      | 1292 ++++++++++++++++++++++
 fs/fuse/dev_uring_i.h                    |   23 +
 fs/fuse/fuse_dev_i.h                     |   62 ++
 fs/fuse/fuse_i.h                         |  178 +++
 fs/fuse/inode.c                          |   10 +
 include/linux/vmalloc.h                  |    1 +
 include/uapi/linux/fuse.h                |  131 +++
 mm/nommu.c                               |    6 +
 mm/vmalloc.c                             |   41 +-
 12 files changed, 2064 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/filesystems/fuse-uring.rst
 create mode 100644 fs/fuse/dev_uring.c
 create mode 100644 fs/fuse/dev_uring_i.h
 create mode 100644 fs/fuse/fuse_dev_i.h

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net

-- 
2.37.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/13] fuse: Add uring data structures and documentation
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 02/13] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This just adds a design document and data structures needed by later
commits to support kernel/userspace communication using the uring
IORING_OP_URING_CMD command.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 Documentation/filesystems/fuse-uring.rst | 179 +++++++++++++++++++++++
 include/uapi/linux/fuse.h                | 131 +++++++++++++++++
 2 files changed, 310 insertions(+)
 create mode 100644 Documentation/filesystems/fuse-uring.rst

diff --git a/Documentation/filesystems/fuse-uring.rst b/Documentation/filesystems/fuse-uring.rst
new file mode 100644
index 000000000000..088b97bbc289
--- /dev/null
+++ b/Documentation/filesystems/fuse-uring.rst
@@ -0,0 +1,179 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+FUSE Uring design documentation
+==============================
+
+This documentation covers basic details how the fuse
+kernel/userspace communication through uring is configured
+and works. For generic details about FUSE see fuse.rst.
+
+This document also covers the current interface, which is
+still in development and might change.
+
+Limitations
+===========
+As of now not all requests types are supported through uring, userspace
+side is required to also handle requests through /dev/fuse after
+uring setup is complete. These are especially notifications (initiated
+from daemon side), interrupts and forgets.
+Interrupts are probably not working at all when uring is used. At least
+current state of libfuse will not be able to handle those for requests
+on ring queues.
+All these limitation will be addressed later.
+
+Fuse uring configuration
+========================
+
+Fuse kernel requests are queued through the classical /dev/fuse
+read/write interface - until uring setup is complete.
+
+IOCTL configuration
+-------------------
+
+Userspace daemon side has to initiate ring confuration through
+the FUSE_DEV_IOC_URING ioctl, with cmd FUSE_URING_IOCTL_CMD_QUEUE_CFG.
+
+Number of queues can be
+    - 1
+        - One ring for all cores and all requests.
+    - Number of cores
+        - One ring per core, requests are queued on the ring queue
+          that is submitting the request. Especially for background
+          requests we might consider to use queues of other cores
+          as well - future work.
+        - Kernel and userspace have to agree on the number of cores,
+          on mismatch the ioctl is rejected.
+        - For each queue a separate ioctl needs to be send.
+
+Example:
+
+fuse_uring_configure_kernel_queue()
+{
+	struct fuse_uring_cfg ioc_cfg = {
+		.cmd = FUSE_URING_IOCTL_CMD_QUEUE_CFG,
+		.qid = 2,
+		.nr_queues = 3,
+		.fg_queue_depth = 16,
+		.bg_queue_depth = 4,
+		.req_arg_len = 1024 * 1024,
+		.numa_node_id = 1,
+	};
+
+    rc = ioctl(se->fd, FUSE_DEV_IOC_URING, &ioc_cfg);
+}
+
+
+On kernel side the first ioctl that arrives configures the basic fuse ring
+and then its queue id. All further ioctls only their queue. Each queue gets
+a memory allocation that is then assigned per queue entry.
+
+MMAP
+====
+
+For shared memory communication allocated memory per queue is mmaped with
+mmap. The corresponding queue is identified with the offset parameter.
+Important is a strict agreement between kernel and userspace daemon side
+on memory assignment per queue entry - a mismatch would lead to data
+corruption.
+Ideal would be an mmap per ring entry and to verify the pointer on SQE
+submission, but the result obtained in the file_operations::mmap method
+is scrambled further down the stack - fuse kernel does not know the exact
+pointer value returned to mmap initiated by userspace.
+
+
+Kernel - userspace interface using uring
+========================================
+
+After queue ioctl setup and memory mapping userspace submits
+SQEs (opcode = IORING_OP_URING_CMD) in order to fetch
+fuse requests. Initial submit is with the sub command
+FUSE_URING_REQ_FETCH, which will just register entries
+to be available on the kernel side - it sets the according
+entry state and marks the entry as available in the queue bitmap.
+
+Once all entries for all queues are submitted kernel side starts
+to enqueue to ring queue(s). The request is copied into the shared
+memory queue entry buffer and submitted as CQE to the userspace
+side.
+Userspace side handles the CQE and submits the result as subcommand
+FUSE_URING_REQ_COMMIT_AND_FETCH - kernel side does completes the requests
+and also marks the queue entry as available again. If there are
+pending requests waiting the request will be immediately submitted
+to userspace again.
+
+Initial SQE
+-----------
+
+ |                                    |  FUSE filesystem daemon
+ |                                    |
+ |                                    |  >io_uring_submit()
+ |                                    |   IORING_OP_URING_CMD /
+ |                                    |   FUSE_URING_REQ_FETCH
+ |                                    |  [wait cqe]
+ |                                    |   >io_uring_wait_cqe() or
+ |                                    |   >io_uring_submit_and_wait()
+ |                                    |
+ |  >fuse_uring_cmd()                 |
+ |   >fuse_uring_fetch()              |
+ |    >fuse_uring_ent_release()       |
+
+
+Sending requests with CQEs
+--------------------------
+
+ |                                         |  FUSE filesystem daemon
+ |                                         |  [waiting for CQEs]
+ |  "rm /mnt/fuse/file"                    |
+ |                                         |
+ |  >sys_unlink()                          |
+ |    >fuse_unlink()                       |
+ |      [allocate request]                 |
+ |      >__fuse_request_send()             |
+ |        ...                              |
+ |       >fuse_uring_queue_fuse_req        |
+ |        [queue request on fg or          |
+ |          bg queue]                      |
+ |         >fuse_uring_assign_ring_entry() |
+ |         >fuse_uring_send_to_ring()      |
+ |          >fuse_uring_copy_to_ring()     |
+ |          >io_uring_cmd_done()           |
+ |          >request_wait_answer()         |
+ |           [sleep on req->waitq]         |
+ |                                         |  [receives and handles CQE]
+ |                                         |  [submit result and fetch next]
+ |                                         |  >io_uring_submit()
+ |                                         |   IORING_OP_URING_CMD/
+ |                                         |   FUSE_URING_REQ_COMMIT_AND_FETCH
+ |  >fuse_uring_cmd()                      |
+ |   >fuse_uring_commit_and_release()      |
+ |    >fuse_uring_copy_from_ring()         |
+ |     [ copy the result to the fuse req]  |
+ |     >fuse_uring_req_end_and_get_next()  |
+ |      >fuse_request_end()                |
+ |       [wake up req->waitq]              |
+ |      >fuse_uring_ent_release_and_fetch()|
+ |       [wait or handle next req]         |
+ |                                         |
+ |                                         |
+ |       [req->waitq woken up]             |
+ |    <fuse_unlink()                       |
+ |  <sys_unlink()                          |
+
+
+Shutdown
+========
+
+A dayled workqueue is started when the ring gets configured with ioctls and
+runs periodically to complete ring entries on umount or daemon stop.
+See fuse_uring_stop_mon() and subfunctions for details - basically it needs
+to run io_uring_cmd_done() for waiting SQEs and fuse_request_end() for
+queue entries that have a fuse request assigned.
+
+In order to avoid periodic cpu cycles for shutdown the userspace daemon can
+create a thread and submit that thread into a waiting state with the
+FUSE_DEV_IOC_URING ioctl and FUSE_URING_IOCTL_CMD_WAIT subcommand.
+Kernel side will stop the periodic waiter on receiving this ioctl
+and will go into a waitq. On umount or daemon termination it will
+wake up and start the delayed stop workq again before returning to
+userspace.
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index e3c54109bae9..0f59507b4b18 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -966,9 +966,64 @@ struct fuse_notify_retrieve_in {
 	uint64_t	dummy4;
 };
 
+
+enum fuse_uring_ioctl_cmd {
+	/* not correctly initialized when set */
+	FUSE_URING_IOCTL_CMD_INVALID    = 0,
+
+	/* The ioctl is a queue configuration command */
+	FUSE_URING_IOCTL_CMD_QUEUE_CFG = 1,
+
+	/* Wait in the kernel until the process gets terminated, process
+	 * termination will wake up the waitq and initiate ring shutdown.
+	 * This avoids the need to run a check in intervals if ring termination
+	 * should be started (less cpu cycles) and also helps for faster ring
+	 * shutdown.
+	 */
+	FUSE_URING_IOCTL_CMD_WAIT      = 2,
+
+	/* Daemon side wants to explicitly stop the waiter thread. This will
+	 * restart the interval termination checker.
+	 */
+	FUSE_URING_IOCTL_CMD_STOP      = 3,
+};
+
+struct fuse_uring_cfg {
+	/* currently unused */
+	uint32_t flags;
+
+	/* configuration command */
+	uint16_t cmd;
+
+	uint16_t padding;
+
+	/* qid the config command is for */
+	uint32_t qid;
+
+	/* number of queues */
+	uint32_t nr_queues;
+
+	/* number of foreground entries per queue */
+	uint32_t fg_queue_depth;
+
+	/* number of background entries per queue */
+	uint32_t bg_queue_depth;
+
+	/* argument (data length) of a request */
+	uint32_t req_arg_len;
+
+	/* numa node this queue runs on; UINT32_MAX if any*/
+	uint32_t numa_node_id;
+
+	/* reserved space for future additions */
+	uint64_t reserve[8];
+};
+
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
+#define FUSE_DEV_IOC_URING		_IOR(FUSE_DEV_IOC_MAGIC, 1, \
+					     struct fuse_uring_cfg)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
@@ -1047,4 +1102,80 @@ struct fuse_secctx_header {
 	uint32_t	nr_secctx;
 };
 
+
+/**
+ * Size of the ring buffer header
+ */
+#define FUSE_RING_HEADER_BUF_SIZE 4096
+#define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
+
+enum fuse_ring_req_cmd {
+	FUSE_RING_BUF_CMD_INVALID = 0,
+
+	/* return an iovec pointer */
+	FUSE_RING_BUF_CMD_IOVEC = 1,
+
+	/* report an error */
+	FUSE_RING_BUF_CMD_ERROR = 2,
+};
+
+/* Request is background type. Daemon side is free to use this information
+ * to handle foreground/background CQEs with different priorities.
+ */
+#define FUSE_RING_REQ_FLAG_BACKGROUND (1ull << 0)
+
+/**
+ * This structure mapped onto the
+ */
+struct fuse_ring_req {
+
+	union {
+		/* The first 4K are command data */
+		char ring_header[FUSE_RING_HEADER_BUF_SIZE];
+
+		struct {
+			uint64_t flags;
+
+			/* enum fuse_ring_buf_cmd */
+			uint32_t cmd;
+			uint32_t in_out_arg_len;
+
+			/* kernel fills in, reads out */
+			union {
+				struct fuse_in_header in;
+				struct fuse_out_header out;
+			};
+		};
+	};
+
+	char in_out_arg[];
+};
+
+/**
+ * sqe commands to the kernel
+ */
+enum fuse_uring_cmd {
+	FUSE_URING_REQ_INVALID = 0,
+
+	/* submit sqe to kernel to get a request */
+	FUSE_URING_REQ_FETCH = 1,
+
+	/* commit result and fetch next request */
+	FUSE_URING_REQ_COMMIT_AND_FETCH = 2,
+};
+
+/**
+ * In the 80B command area of the SQE.
+ */
+struct fuse_uring_cmd_req {
+	/* queue the command is for (queue index) */
+	uint16_t qid;
+
+	/* queue entry (array index) */
+	uint16_t tag;
+
+	/* pointer to struct fuse_uring_buf_req */
+	uint32_t padding;
+};
+
 #endif /* _LINUX_FUSE_H */
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/13] fuse: rename to fuse_dev_end_requests and make non-static
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
  2023-03-21  1:10 ` [PATCH 01/13] fuse: Add uring data structures and documentation Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 03/13] fuse: Move fuse_get_dev to header file Bernd Schubert
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This function is needed by fuse_uring.c to clean ring queues,
so make it non static. Especially in non-static mode the function
name 'end_requests' should be prefixed with fuse_

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev.c        |  6 +++---
 fs/fuse/fuse_dev_i.h | 14 ++++++++++++++
 2 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 fs/fuse/fuse_dev_i.h

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e8b60ce72c9a..02e9299ba781 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2075,7 +2075,7 @@ static __poll_t fuse_dev_poll(struct file *file, poll_table *wait)
 }
 
 /* Abort all requests on the given list (pending or processing) */
-static void end_requests(struct list_head *head)
+void fuse_dev_end_requests(struct list_head *head)
 {
 	while (!list_empty(head)) {
 		struct fuse_req *req;
@@ -2178,7 +2178,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
 		wake_up_all(&fc->blocked_waitq);
 		spin_unlock(&fc->lock);
 
-		end_requests(&to_end);
+		fuse_dev_end_requests(&to_end);
 	} else {
 		spin_unlock(&fc->lock);
 	}
@@ -2208,7 +2208,7 @@ int fuse_dev_release(struct inode *inode, struct file *file)
 			list_splice_init(&fpq->processing[i], &to_end);
 		spin_unlock(&fpq->lock);
 
-		end_requests(&to_end);
+		fuse_dev_end_requests(&to_end);
 
 		/* Are we the last open device? */
 		if (atomic_dec_and_test(&fc->dev_count)) {
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
new file mode 100644
index 000000000000..68e7da9f96ee
--- /dev/null
+++ b/fs/fuse/fuse_dev_i.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * FUSE: Filesystem in Userspace
+ * Copyright (C) 2001-2008  Miklos Szeredi <miklos@szeredi.hu>
+ */
+
+#ifndef _FS_FUSE_DEV_I_H
+#define _FS_FUSE_DEV_I_H
+
+void fuse_dev_end_requests(struct list_head *head);
+
+#endif
+
+
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/13] fuse: Move fuse_get_dev to header file
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
  2023-03-21  1:10 ` [PATCH 01/13] fuse: Add uring data structures and documentation Bernd Schubert
  2023-03-21  1:10 ` [PATCH 02/13] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 04/13] Add a vmalloc_node_user function Bernd Schubert
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

Another preparation patch, as this function will be needed by
fuse/dev.c and fuse/dev_uring.c.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev.c        | 10 +---------
 fs/fuse/fuse_dev_i.h |  9 +++++++++
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 02e9299ba781..e0669b8e4618 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -7,6 +7,7 @@
 */
 
 #include "fuse_i.h"
+#include "fuse_dev_i.h"
 
 #include <linux/init.h>
 #include <linux/module.h>
@@ -31,15 +32,6 @@ MODULE_ALIAS("devname:fuse");
 
 static struct kmem_cache *fuse_req_cachep;
 
-static struct fuse_dev *fuse_get_dev(struct file *file)
-{
-	/*
-	 * Lockless access is OK, because file->private data is set
-	 * once during mount and is valid until the file is released.
-	 */
-	return READ_ONCE(file->private_data);
-}
-
 static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
 {
 	INIT_LIST_HEAD(&req->list);
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 68e7da9f96ee..f623a85c4c24 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -7,6 +7,15 @@
 #ifndef _FS_FUSE_DEV_I_H
 #define _FS_FUSE_DEV_I_H
 
+static inline struct fuse_dev *fuse_get_dev(struct file *file)
+{
+	/*
+	 * Lockless access is OK, because file->private data is set
+	 * once during mount and is valid until the file is released.
+	 */
+	return READ_ONCE(file->private_data);
+}
+
 void fuse_dev_end_requests(struct list_head *head);
 
 #endif
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/13] Add a vmalloc_node_user function
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (2 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 03/13] fuse: Move fuse_get_dev to header file Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21 21:21   ` Andrew Morton
  2023-03-21  1:10 ` [PATCH 05/13] fuse: Add a uring config ioctl and ring destruction Bernd Schubert
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Miklos Szeredi,
	Amir Goldstein, fuse-devel

This is to have a numa aware vmalloc function for memory exposed to
userspace. Fuse uring will allocate queue memory using this
new function.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Uladzislau Rezki <urezki@gmail.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: linux-mm@kvack.org
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 include/linux/vmalloc.h |  1 +
 mm/nommu.c              |  6 ++++++
 mm/vmalloc.c            | 41 +++++++++++++++++++++++++++++++++++++----
 3 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..e4e6f8f220b9 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -142,6 +142,7 @@ static inline unsigned long vmalloc_nr_pages(void) { return 0; }
 extern void *vmalloc(unsigned long size) __alloc_size(1);
 extern void *vzalloc(unsigned long size) __alloc_size(1);
 extern void *vmalloc_user(unsigned long size) __alloc_size(1);
+extern void *vmalloc_node_user(unsigned long size, int node) __alloc_size(1);
 extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
 extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
 extern void *vmalloc_32(unsigned long size) __alloc_size(1);
diff --git a/mm/nommu.c b/mm/nommu.c
index 5b83938ecb67..a7710c90447a 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -186,6 +186,12 @@ void *vmalloc_user(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc_user);
 
+void *vmalloc_node_user(unsigned long size, int node)
+{
+	return __vmalloc_user_flags(size, GFP_KERNEL | __GFP_ZERO);
+}
+EXPORT_SYMBOL(vmalloc_node_user);
+
 struct page *vmalloc_to_page(const void *addr)
 {
 	return virt_to_page(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ca71de7c9d77..9ad98e6c5e59 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3367,6 +3367,25 @@ void *vzalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vzalloc);
 
+/**
+ * _vmalloc_node_user - allocate zeroed virtually contiguous memory for userspace
+ * on the given numa node
+ * @size: allocation size
+ * @node: numa node
+ *
+ * The resulting memory area is zeroed so it can be mapped to userspace
+ * without leaking data.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+static void *_vmalloc_node_user(unsigned long size, int node)
+{
+	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
+				    VM_USERMAP, node,
+				    __builtin_return_address(0));
+}
+
 /**
  * vmalloc_user - allocate zeroed virtually contiguous memory for userspace
  * @size: allocation size
@@ -3378,13 +3397,27 @@ EXPORT_SYMBOL(vzalloc);
  */
 void *vmalloc_user(unsigned long size)
 {
-	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
-				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
-				    VM_USERMAP, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	return _vmalloc_node_user(size, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(vmalloc_user);
 
+/**
+ * vmalloc_user - allocate zeroed virtually contiguous memory for userspace on
+ *                a numa node
+ * @size: allocation size
+ * @node: numa node
+ *
+ * The resulting memory area is zeroed so it can be mapped to userspace
+ * without leaking data.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_node_user(unsigned long size, int node)
+{
+	return _vmalloc_node_user(size, node);
+}
+EXPORT_SYMBOL(vmalloc_node_user);
+
 /**
  * vmalloc_node - allocate memory on a specific node
  * @size:	  allocation size
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/13] fuse: Add a uring config ioctl and ring destruction
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (3 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 04/13] Add a vmalloc_node_user function Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 06/13] fuse: Add an interval ring stop worker/monitor Bernd Schubert
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

Ring data are created with an ioctl, destruction goes via
fuse_abort_conn(). A new module parameter is added, for
now uring defaults to disabled.

This also adds the remaining fuse ring data structures.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/Makefile      |   2 +-
 fs/fuse/dev.c         |  21 +++
 fs/fuse/dev_uring.c   | 330 ++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h |  20 +++
 fs/fuse/fuse_i.h      | 178 +++++++++++++++++++++++
 fs/fuse/inode.c       |   7 +
 6 files changed, 557 insertions(+), 1 deletion(-)
 create mode 100644 fs/fuse/dev_uring.c
 create mode 100644 fs/fuse/dev_uring_i.h

diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 0c48b35c058d..634d47477393 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_FUSE_FS) += fuse.o
 obj-$(CONFIG_CUSE) += cuse.o
 obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
 
-fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
+fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o dev_uring.o
 fuse-$(CONFIG_FUSE_DAX) += dax.o
 
 virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e0669b8e4618..07323b041377 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -8,6 +8,7 @@
 
 #include "fuse_i.h"
 #include "fuse_dev_i.h"
+#include "dev_uring_i.h"
 
 #include <linux/init.h>
 #include <linux/module.h>
@@ -2171,6 +2172,11 @@ void fuse_abort_conn(struct fuse_conn *fc)
 		spin_unlock(&fc->lock);
 
 		fuse_dev_end_requests(&to_end);
+
+		mutex_lock(&fc->ring.start_stop_lock);
+		if (fc->ring.configured && !fc->ring.queues_stopped)
+			fuse_uring_end_requests(fc);
+		mutex_unlock(&fc->ring.start_stop_lock);
 	} else {
 		spin_unlock(&fc->lock);
 	}
@@ -2247,6 +2253,7 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 	int res;
 	int oldfd;
 	struct fuse_dev *fud = NULL;
+	struct fuse_uring_cfg ring_conf;
 
 	switch (cmd) {
 	case FUSE_DEV_IOC_CLONE:
@@ -2271,6 +2278,20 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 				fput(old);
 			}
 		}
+		break;
+	case FUSE_DEV_IOC_URING:
+		/* XXX fud ensures fc->ring.start_stop_lock is initialized? */
+		fud = fuse_get_dev(file);
+		if (fud) {
+			res = copy_from_user(&ring_conf, (void *)arg,
+					     sizeof(ring_conf));
+			if (res == 0)
+				res = fuse_uring_ioctl(file, &ring_conf);
+			else
+				res = -EFAULT;
+		} else
+			pr_info("%s: Did not get fud\n", __func__);
+
 		break;
 	default:
 		res = -ENOTTY;
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
new file mode 100644
index 000000000000..12fd21526b2b
--- /dev/null
+++ b/fs/fuse/dev_uring.c
@@ -0,0 +1,330 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * FUSE: Filesystem in Userspace
+ * Copyright (C) 2001-2008  Miklos Szeredi <miklos@szeredi.hu>
+ */
+
+#include "fuse_i.h"
+#include "fuse_dev_i.h"
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/sched/signal.h>
+#include <linux/uio.h>
+#include <linux/miscdevice.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/swap.h>
+#include <linux/splice.h>
+#include <linux/sched.h>
+#include <linux/io_uring.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/io_uring.h>
+#include <linux/topology.h>
+
+static bool __read_mostly enable_uring;
+module_param(enable_uring, bool, 0644);
+MODULE_PARM_DESC(enable_uring,
+	"Enable uring userspace communication through uring.");
+
+static struct fuse_ring_queue *
+fuse_uring_get_queue(struct fuse_conn *fc, int qid)
+{
+	char *ptr = (char *)fc->ring.queues;
+
+	if (unlikely(qid > fc->ring.nr_queues)) {
+		WARN_ON(1);
+		qid = 0;
+	}
+
+	return (struct fuse_ring_queue *)(ptr + qid * fc->ring.queue_size);
+}
+
+/* Abort all list queued request on the given ring queue */
+static void fuse_uring_end_queue_requests(struct fuse_ring_queue *queue)
+{
+	spin_lock(&queue->lock);
+	queue->aborted = 1;
+	fuse_dev_end_requests(&queue->fg_queue);
+	fuse_dev_end_requests(&queue->bg_queue);
+	spin_unlock(&queue->lock);
+}
+
+void fuse_uring_end_requests(struct fuse_conn *fc)
+{
+	int qid;
+
+	for (qid = 0; qid < fc->ring.nr_queues; qid++) {
+		struct fuse_ring_queue *queue =
+			fuse_uring_get_queue(fc, qid);
+
+		if (!queue->configured)
+			continue;
+
+		fuse_uring_end_queue_requests(queue);
+	}
+}
+
+/**
+ * use __vmalloc_node_range() (needs to be
+ * exported?) or add a new (exported) function vm_alloc_user_node()
+ */
+static char *fuse_uring_alloc_queue_buf(int size, int node)
+{
+	char *buf;
+
+	if (size <= 0) {
+		pr_info("Invalid queue buf size: %d.\n", size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	buf = vmalloc_node_user(size, node);
+	return buf ? buf : ERR_PTR(-ENOMEM);
+}
+
+/**
+ * Ring setup for this connection
+ */
+static int fuse_uring_conn_cfg(struct fuse_conn *fc,
+			       struct fuse_uring_cfg *cfg)
+__must_hold(fc->ring.stop_waitq.lock)
+{
+	size_t queue_sz;
+
+	if (cfg->nr_queues == 0) {
+		pr_info("zero number of queues is invalid.\n");
+		return -EINVAL;
+	}
+
+	if (cfg->nr_queues > 1 &&
+	    cfg->nr_queues != num_present_cpus()) {
+		pr_info("nr-queues (%d) does not match nr-cores (%d).\n",
+			cfg->nr_queues, num_present_cpus());
+		return -EINVAL;
+	}
+
+	if (cfg->qid > cfg->nr_queues) {
+		pr_info("qid (%d) exceeds number of queues (%d)\n",
+			cfg->qid, cfg->nr_queues);
+		return -EINVAL;
+	}
+
+	if (cfg->req_arg_len < FUSE_RING_MIN_IN_OUT_ARG_SIZE) {
+		pr_info("Per req buffer size too small (%d), min: %d\n",
+			cfg->req_arg_len, FUSE_RING_MIN_IN_OUT_ARG_SIZE);
+		return -EINVAL;
+	}
+
+	if (unlikely(fc->ring.queues)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	fc->ring.daemon = current;
+	get_task_struct(fc->ring.daemon);
+
+	fc->ring.nr_queues = cfg->nr_queues;
+	fc->ring.per_core_queue = cfg->nr_queues > 1;
+
+	fc->ring.max_fg = cfg->fg_queue_depth;
+	fc->ring.max_bg = cfg->bg_queue_depth;
+	fc->ring.queue_depth = cfg->fg_queue_depth + cfg->bg_queue_depth;
+
+	fc->ring.req_arg_len = cfg->req_arg_len;
+	fc->ring.req_buf_sz =
+		round_up(sizeof(struct fuse_ring_req) + fc->ring.req_arg_len,
+			 PAGE_SIZE);
+
+	/* verified during mmap that kernel and userspace have the same
+	 * buffer size
+	 */
+	fc->ring.queue_buf_size = fc->ring.req_buf_sz * fc->ring.queue_depth;
+
+	queue_sz = sizeof(*fc->ring.queues) +
+			fc->ring.queue_depth * sizeof(struct fuse_ring_ent);
+	fc->ring.queues = kcalloc(cfg->nr_queues, queue_sz, GFP_KERNEL);
+	if (!fc->ring.queues)
+		return -ENOMEM;
+	fc->ring.queue_size = queue_sz;
+
+	fc->ring.queue_refs = 0;
+
+	return 0;
+}
+
+static int fuse_uring_queue_cfg(struct fuse_conn *fc, unsigned int qid,
+				unsigned int node_id)
+__must_hold(fc->ring.stop_waitq.lock)
+{
+	int tag;
+	struct fuse_ring_queue *queue;
+	char *buf;
+
+	if (qid >= fc->ring.nr_queues) {
+		pr_info("fuse ring queue config: qid=%u >= nr-queues=%zu\n",
+			qid, fc->ring.nr_queues);
+		return -EINVAL;
+	}
+	queue = fuse_uring_get_queue(fc, qid);
+
+	if (queue->configured) {
+		pr_info("fuse ring qid=%u already configured!\n", qid);
+		return -EALREADY;
+	}
+
+	queue->qid = qid;
+	queue->fc = fc;
+	queue->req_fg = 0;
+	bitmap_zero(queue->req_avail_map, fc->ring.queue_depth);
+	spin_lock_init(&queue->lock);
+	INIT_LIST_HEAD(&queue->fg_queue);
+	INIT_LIST_HEAD(&queue->bg_queue);
+
+	buf = fuse_uring_alloc_queue_buf(fc->ring.queue_buf_size, node_id);
+	queue->queue_req_buf = buf;
+	if (IS_ERR(queue->queue_req_buf)) {
+		int err = PTR_ERR(queue->queue_req_buf);
+
+		queue->queue_req_buf = NULL;
+		return err;
+	}
+
+	for (tag = 0; tag < fc->ring.queue_depth; tag++) {
+		struct fuse_ring_ent *ent = &queue->ring_ent[tag];
+
+		ent->queue = queue;
+		ent->tag = tag;
+		ent->fuse_req = NULL;
+		ent->rreq = (struct fuse_ring_req *)buf;
+
+		pr_devel("initialize qid=%d tag=%d queue=%p req=%p",
+			 qid, tag, queue, ent);
+
+		ent->rreq->flags = 0;
+
+		ent->state = FRRS_INIT;
+		ent->need_cmd_done = 0;
+		ent->need_req_end = 0;
+		fc->ring.queue_refs++;
+		buf += fc->ring.req_buf_sz;
+	}
+
+	queue->configured = 1;
+	queue->aborted = 0;
+	fc->ring.nr_queues_ioctl_init++;
+	if (fc->ring.nr_queues_ioctl_init == fc->ring.nr_queues) {
+		fc->ring.configured = 1;
+		pr_devel("fc=%p nr-queues=%zu depth=%zu ioctl ready\n",
+			fc, fc->ring.nr_queues, fc->ring.queue_depth);
+	}
+
+	return 0;
+}
+
+/**
+ * Configure the queue for t he given qid. First call will also initialize
+ * the ring for this connection.
+ */
+static int fuse_uring_cfg(struct fuse_conn *fc, unsigned int qid,
+			  struct fuse_uring_cfg *cfg)
+{
+	int rc;
+
+	/* The lock is taken, so that user space may configure all queues
+	 * in parallel
+	 */
+	mutex_lock(&fc->ring.start_stop_lock);
+
+	if (fc->ring.configured) {
+		rc = -EALREADY;
+		goto unlock;
+	}
+
+	if (fc->ring.daemon == NULL) {
+		rc = fuse_uring_conn_cfg(fc, cfg);
+		if (rc != 0)
+			goto unlock;
+	}
+
+	rc = fuse_uring_queue_cfg(fc, qid, cfg->numa_node_id);
+
+unlock:
+	mutex_unlock(&fc->ring.start_stop_lock);
+
+	return rc;
+}
+
+int fuse_uring_ioctl(struct file *file, struct fuse_uring_cfg *cfg)
+{
+	struct fuse_dev *fud = fuse_get_dev(file);
+	struct fuse_conn *fc;
+
+	if (fud == NULL)
+		return -ENODEV;
+
+	fc = fud->fc;
+
+	pr_devel("%s fc=%p flags=%x cmd=%d qid=%d nq=%d fg=%d bg=%d\n",
+		 __func__, fc, cfg->flags, cfg->cmd, cfg->qid, cfg->nr_queues,
+		 cfg->fg_queue_depth, cfg->bg_queue_depth);
+
+
+	switch (cfg->cmd) {
+	case FUSE_URING_IOCTL_CMD_QUEUE_CFG:
+		return fuse_uring_cfg(fc, cfg->qid, cfg);
+	default:
+		return -EINVAL;
+	}
+
+	/* no cmd flag set */
+	return -EINVAL;
+}
+
+/**
+ * Finalize the ring destruction when queue ref counters are zero.
+ */
+void fuse_uring_ring_destruct(struct fuse_conn *fc)
+{
+	unsigned int qid;
+
+	if (READ_ONCE(fc->ring.queue_refs) != 0) {
+		pr_info("fc=%p refs=%d configured=%d",
+			fc, fc->ring.queue_refs, fc->ring.configured);
+		WARN_ON(1);
+		return;
+	}
+
+	put_task_struct(fc->ring.daemon);
+	fc->ring.daemon = NULL;
+
+	for (qid = 0; qid < fc->ring.nr_queues; qid++) {
+		int tag;
+		struct fuse_ring_queue *queue = fuse_uring_get_queue(fc, qid);
+
+		if (!queue->configured)
+			continue;
+
+		for (tag = 0; tag < fc->ring.queue_depth; tag++) {
+			struct fuse_ring_ent *ent = &queue->ring_ent[tag];
+
+			if (ent->need_cmd_done) {
+				pr_warn("fc=%p qid=%d tag=%d cmd not done\n",
+					fc, qid, tag);
+				io_uring_cmd_done(ent->cmd, -ENOTCONN, 0);
+				ent->need_cmd_done = 0;
+			}
+		}
+
+		vfree(queue->queue_req_buf);
+	}
+
+	kfree(fc->ring.queues);
+	fc->ring.queues = NULL;
+	fc->ring.nr_queues_ioctl_init = 0;
+	fc->ring.queue_depth = 0;
+	fc->ring.nr_queues = 0;
+}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
new file mode 100644
index 000000000000..4ab440ee00f2
--- /dev/null
+++ b/fs/fuse/dev_uring_i.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * FUSE: Filesystem in Userspace
+ * Copyright (C) 2001-2008  Miklos Szeredi <miklos@szeredi.hu>
+ */
+
+#ifndef _FS_FUSE_DEV_URING_I_H
+#define _FS_FUSE_DEV_URING_I_H
+
+#include "fuse_i.h"
+
+void fuse_uring_end_requests(struct fuse_conn *fc);
+void fuse_uring_ring_destruct(struct fuse_conn *fc);
+int fuse_uring_ioctl(struct file *file, struct fuse_uring_cfg *cfg);
+#endif
+
+
+
+
+
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 46797a171a84..634d90084690 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -529,6 +529,177 @@ struct fuse_sync_bucket {
 	struct rcu_head rcu;
 };
 
+enum fuse_ring_req_state {
+
+	FRRS_INVALD = 0,
+
+	/* request is basially initialied */
+	FRRS_INIT = 1u << 0,
+
+	/* request is committed from user space and waiting for a new fuse req */
+	FRRS_FUSE_FETCH_COMMIT = 1u << 1,
+
+	/* The ring request waits for a new fuse request */
+	 FRRS_FUSE_WAIT = 1u << 2,
+
+	/* The ring req got assigned a fuse req */
+	FRRS_FUSE_REQ = 1u << 3,
+
+	/* request is in or on the way to user space */
+	FRRS_USERSPACE = 1u << 4,
+
+	/* process is in the process to get freed */
+	FRRS_FREEING   = 1u << 5,
+
+	/* fuse_req_end was already done */
+	FRRS_FUSE_REQ_END = 1u << 6,
+
+	/* And error in the uring cmd command receiving function
+	 * request will then go back to user space
+	 */
+	FRRS_CMD_ERR      = 1u << 7,
+
+	/* request is released */
+	FRRS_FREED = 1u << 8,
+};
+
+/** A fuse ring entry, part of the ring queue */
+struct fuse_ring_ent {
+	/* pointer to kernel request buffer, userspace side has direct access
+	 * to it through the mmaped buffer
+	 */
+	struct fuse_ring_req *rreq;
+
+	int tag;
+
+	struct fuse_ring_queue *queue;
+
+	/* state the request is currently in */
+	u64 state;
+
+	int need_cmd_done:1;
+	int need_req_end:1;
+
+	struct fuse_req *fuse_req; /* when a list request is handled */
+
+	struct io_uring_cmd *cmd;
+};
+
+/* IORING_MAX_ENTRIES */
+#define FUSE_URING_MAX_QUEUE_DEPTH 32768
+
+struct fuse_ring_queue {
+	unsigned long flags;
+
+	struct fuse_conn *fc;
+
+	int qid;
+
+	/* This bitmap holds, which entries are available in the fuse_ring_ent
+	 * array.
+	 * XXX: Is there a way to make this dynamic
+	 */
+	DECLARE_BITMAP(req_avail_map, FUSE_URING_MAX_QUEUE_DEPTH);
+
+	/* available number of foreground requests  */
+	int req_fg;
+
+	/* available number of background requests */
+	int req_bg;
+
+	/* queue lock, taken when any value in the queue changes _and_ also
+	 * a ring entry state changes.
+	 */
+	spinlock_t lock;
+
+	/* per queue memory buffer that is divided per request */
+	char *queue_req_buf;
+
+	struct list_head bg_queue;
+	struct list_head fg_queue;
+
+	int configured:1;
+	int aborted:1;
+
+	/* size depends on queue depth */
+	struct fuse_ring_ent ring_ent[] ____cacheline_aligned_in_smp;
+};
+
+/**
+ * Describes if uring is for communication and holds alls the data needed
+ * for uring communication
+ */
+struct fuse_ring {
+
+	/* number of ring queues */
+	size_t nr_queues;
+
+	/* number of entries per queue */
+	size_t queue_depth;
+
+	/* max arg size for a request */
+	size_t req_arg_len;
+
+	/* req_arg_len + sizeof(struct fuse_req) */
+	size_t req_buf_sz;
+
+	/* max number of background requests per queue */
+	size_t max_bg;
+
+	/* max number of foreground requests */
+	size_t max_fg;
+
+	/* size of struct fuse_ring_queue + queue-depth * entry-size */
+	size_t queue_size;
+
+	/* buffer size per queue, that is used per queue entry */
+	size_t queue_buf_size;
+
+	/* When zero the queue can be freed on destruction */
+	int queue_refs;
+
+	/* Hold ring requests */
+	struct fuse_ring_queue *queues;
+
+	/* number of initialized queues with the ioctl */
+	int nr_queues_ioctl_init;
+
+	/* number of initialized queues with the uring cmd */
+	atomic_t nr_queues_cmd_init;
+
+	/* one queue per core or a single queue only ? */
+	unsigned int per_core_queue:1;
+
+	/* userspace sent a stop ioctl */
+	unsigned int stop_requested:1;
+
+	/* Is the ring completely iocl configured */
+	unsigned int configured:1;
+
+	/* Is the ring read to take requests */
+	unsigned int ready:1;
+
+	/* used on shutdown */
+	unsigned int queues_stopped:1;
+
+	/* userspace process */
+	struct task_struct *daemon;
+
+	struct mutex start_stop_lock;
+
+	/* userspace has a special thread that exists only to wait
+	 * in the kernel for process stop, to release uring
+	 */
+	wait_queue_head_t stop_waitq;
+
+	/* The daemon might get killed and uring then needs
+	 * to be released without getting a umount notification, this
+	 * workqueue exists to release uring even without a process
+	 * being hold in the stop_waitq
+	 */
+	struct delayed_work stop_monitor;
+};
+
 /**
  * A Fuse connection.
  *
@@ -836,6 +1007,13 @@ struct fuse_conn {
 
 	/* New writepages go into this bucket */
 	struct fuse_sync_bucket __rcu *curr_bucket;
+
+	/*
+	 * XXX Move to struct fuse_dev?
+	 * XXX Allocate dynamically?
+	 */
+	/**  uring connection information*/
+	struct fuse_ring ring;
 };
 
 /*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index de9b9ec5ce81..3f765e65a7b0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -7,6 +7,7 @@
 */
 
 #include "fuse_i.h"
+#include "dev_uring_i.h"
 
 #include <linux/pagemap.h>
 #include <linux/slab.h>
@@ -855,6 +856,9 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
 
+	mutex_init(&fc->ring.start_stop_lock);
+	fc->ring.daemon = NULL;
+
 	INIT_LIST_HEAD(&fc->mounts);
 	list_add(&fm->fc_entry, &fc->mounts);
 	fm->fc = fc;
@@ -1785,6 +1789,9 @@ void fuse_conn_destroy(struct fuse_mount *fm)
 		fuse_ctl_remove_conn(fc);
 		mutex_unlock(&fuse_mutex);
 	}
+
+	if (fc->ring.daemon != NULL)
+		fuse_uring_ring_destruct(fc);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_destroy);
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (4 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 05/13] fuse: Add a uring config ioctl and ring destruction Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-23 10:27   ` Miklos Szeredi
  2023-03-21  1:10 ` [PATCH 07/13] fuse: Add uring mmap method Bernd Schubert
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This adds a delayed work queue that runs in intervals
to check and to stop the ring if needed. Fuse connection
abort now waits for this worker to complete.

On stop the worker iterates over all queues and their ring
entries and tries to release entries when they are in the
right' state.

FRRS_INIT - the ring entry is not used at all yet, nothing
to be done.

FRRS_FUSE_WAIT - a CQE needs to be send. This is really important
to do, as uring other keeps workers in D state and prints a warning.

FRRS_USERSPACE bit set - a CQE was already sent and must not be
send again from shutdown code, but typically a fuse request
needs to be completed.

Any other state - the ring entry is currently worked on, shutdown
has to wait until this is completed.

Also, the queue lock is held on any queue entry state change,
shutdown handling is the main reason for that.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev.c         |  12 ++-
 fs/fuse/dev_uring.c   | 168 ++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h |   2 +-
 fs/fuse/inode.c       |   3 +
 4 files changed, 183 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 07323b041377..d9c40d782c94 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2174,9 +2174,12 @@ void fuse_abort_conn(struct fuse_conn *fc)
 		fuse_dev_end_requests(&to_end);
 
 		mutex_lock(&fc->ring.start_stop_lock);
-		if (fc->ring.configured && !fc->ring.queues_stopped)
+		if (fc->ring.configured && !fc->ring.queues_stopped) {
 			fuse_uring_end_requests(fc);
+			schedule_delayed_work(&fc->ring.stop_monitor, 0);
+		}
 		mutex_unlock(&fc->ring.start_stop_lock);
+
 	} else {
 		spin_unlock(&fc->lock);
 	}
@@ -2187,7 +2190,14 @@ void fuse_wait_aborted(struct fuse_conn *fc)
 {
 	/* matches implicit memory barrier in fuse_drop_waiting() */
 	smp_mb();
+
 	wait_event(fc->blocked_waitq, atomic_read(&fc->num_waiting) == 0);
+
+	/* XXX use struct completion? */
+	if (fc->ring.daemon != NULL) {
+		schedule_delayed_work(&fc->ring.stop_monitor, 0);
+		wait_event(fc->ring.stop_waitq, fc->ring.queues_stopped == 1);
+	}
 }
 
 int fuse_dev_release(struct inode *inode, struct file *file)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 12fd21526b2b..44ff23ce5ebf 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -26,6 +26,9 @@
 #include <linux/io_uring.h>
 #include <linux/topology.h>
 
+/* default monitor interval for a dying daemon */
+#define FURING_DAEMON_MON_PERIOD (5 * HZ)
+
 static bool __read_mostly enable_uring;
 module_param(enable_uring, bool, 0644);
 MODULE_PARM_DESC(enable_uring,
@@ -44,6 +47,15 @@ fuse_uring_get_queue(struct fuse_conn *fc, int qid)
 	return (struct fuse_ring_queue *)(ptr + qid * fc->ring.queue_size);
 }
 
+/* dummy function will be replaced in later commits */
+static void fuse_uring_bit_set(struct fuse_ring_ent *ent, bool is_bg,
+			       const char *str)
+{
+	(void)ent;
+	(void)is_bg;
+	(void)str;
+}
+
 /* Abort all list queued request on the given ring queue */
 static void fuse_uring_end_queue_requests(struct fuse_ring_queue *queue)
 {
@@ -69,6 +81,156 @@ void fuse_uring_end_requests(struct fuse_conn *fc)
 	}
 }
 
+/**
+ * Simplified ring-entry release function, for shutdown only
+ */
+static void _fuse_uring_shutdown_release_ent(struct fuse_ring_ent *ent)
+__must_hold(&queue->lock)
+{
+	bool is_bg = !!(ent->rreq->flags & FUSE_RING_REQ_FLAG_BACKGROUND);
+
+	ent->state |= FRRS_FUSE_REQ_END;
+	ent->need_req_end = 0;
+	fuse_request_end(ent->fuse_req);
+	ent->fuse_req = NULL;
+	fuse_uring_bit_set(ent, is_bg, __func__);
+}
+
+/*
+ * Release a request/entry on connection shutdown
+ */
+static void fuse_uring_shutdown_release_ent(struct fuse_ring_ent *ent)
+__must_hold(&fc->ring.start_stop_lock)
+__must_hold(&queue->lock)
+{
+	struct fuse_ring_queue *queue = ent->queue;
+	struct fuse_conn *fc = queue->fc;
+	bool may_release = false;
+	int state;
+
+	pr_devel("%s fc=%p qid=%d tag=%d state=%llu\n",
+		 __func__, fc, queue->qid, ent->tag, ent->state);
+
+	if (ent->state & FRRS_FREED)
+		goto out; /* no work left, freed before */
+
+	state = ent->state;
+
+	if (state == FRRS_INIT || state == FRRS_FUSE_WAIT ||
+	    ((state & FRRS_USERSPACE) && queue->aborted)) {
+		ent->state |= FRRS_FREED;
+
+		if (ent->need_cmd_done) {
+			pr_devel("qid=%d tag=%d sending cmd_done\n",
+				queue->qid, ent->tag);
+			io_uring_cmd_done(ent->cmd, -ENOTCONN, 0);
+			ent->need_cmd_done = 0;
+		}
+
+		if (ent->need_req_end)
+			_fuse_uring_shutdown_release_ent(ent);
+		may_release = true;
+	} else {
+		/* somewhere in between states, another thread should currently
+		 * handle it
+		 */
+		pr_devel("%s qid=%d tag=%d state=%llu\n",
+			 __func__, queue->qid, ent->tag, ent->state);
+	}
+
+out:
+	/* might free the queue - needs to have the queue waitq lock released */
+	if (may_release) {
+		int refs = --fc->ring.queue_refs;
+
+		pr_devel("free-req fc=%p qid=%d tag=%d refs=%d\n",
+			 fc, queue->qid, ent->tag, refs);
+		if (refs == 0) {
+			fc->ring.queues_stopped = 1;
+			wake_up_all(&fc->ring.stop_waitq);
+		}
+	}
+}
+
+static void fuse_uring_stop_queue(struct fuse_ring_queue *queue)
+__must_hold(&fc->ring.start_stop_lock)
+__must_hold(&queue->lock)
+{
+	struct fuse_conn *fc = queue->fc;
+	int tag;
+	bool empty =
+		(list_empty(&queue->fg_queue) && list_empty(&queue->fg_queue));
+
+	if (!empty && !queue->aborted)
+		return;
+
+	for (tag = 0; tag < fc->ring.queue_depth; tag++) {
+		struct fuse_ring_ent *ent = &queue->ring_ent[tag];
+
+		fuse_uring_shutdown_release_ent(ent);
+	}
+}
+
+/*
+ *  Stop the ring queues
+ */
+static void fuse_uring_stop_queues(struct fuse_conn *fc)
+__must_hold(fc->ring.start_stop_lock)
+{
+	int qid;
+
+	if (fc->ring.daemon == NULL)
+		return;
+
+	fc->ring.stop_requested = 1;
+	fc->ring.ready = 0;
+
+	for (qid = 0; qid < fc->ring.nr_queues; qid++) {
+		struct fuse_ring_queue *queue =
+			fuse_uring_get_queue(fc, qid);
+
+		if (!queue->configured)
+			continue;
+
+		spin_lock(&queue->lock);
+		fuse_uring_stop_queue(queue);
+		spin_unlock(&queue->lock);
+	}
+}
+
+/*
+ * monitoring functon to check if fuse shall be destructed, run
+ * as delayed task
+ */
+static void fuse_uring_stop_mon(struct work_struct *work)
+{
+	struct fuse_conn *fc = container_of(work, struct fuse_conn,
+					    ring.stop_monitor.work);
+	struct fuse_iqueue *fiq = &fc->iq;
+
+	pr_devel("fc=%p running stop-mon, queues-stopped=%u queue-refs=%d\n",
+		fc, fc->ring.queues_stopped, fc->ring.queue_refs);
+
+	mutex_lock(&fc->ring.start_stop_lock);
+
+	if (!fiq->connected || fc->ring.stop_requested ||
+	    (fc->ring.daemon->flags & PF_EXITING)) {
+		pr_devel("%s Stopping queues connected=%d stop-req=%d exit=%d\n",
+			__func__, fiq->connected, fc->ring.stop_requested,
+			(fc->ring.daemon->flags & PF_EXITING));
+		fuse_uring_stop_queues(fc);
+	}
+
+	if (!fc->ring.queues_stopped)
+		schedule_delayed_work(&fc->ring.stop_monitor,
+				      FURING_DAEMON_MON_PERIOD);
+	else
+		pr_devel("Not scheduling work queues-stopped=%u queue-refs=%d.\n",
+			fc->ring.queues_stopped,  fc->ring.queue_refs);
+
+	mutex_unlock(&fc->ring.start_stop_lock);
+}
+
 /**
  * use __vmalloc_node_range() (needs to be
  * exported?) or add a new (exported) function vm_alloc_user_node()
@@ -127,6 +289,11 @@ __must_hold(fc->ring.stop_waitq.lock)
 	fc->ring.daemon = current;
 	get_task_struct(fc->ring.daemon);
 
+	INIT_DELAYED_WORK(&fc->ring.stop_monitor,
+			  fuse_uring_stop_mon);
+	schedule_delayed_work(&fc->ring.stop_monitor,
+			      FURING_DAEMON_MON_PERIOD);
+
 	fc->ring.nr_queues = cfg->nr_queues;
 	fc->ring.per_core_queue = cfg->nr_queues > 1;
 
@@ -298,6 +465,7 @@ void fuse_uring_ring_destruct(struct fuse_conn *fc)
 		return;
 	}
 
+	cancel_delayed_work_sync(&fc->ring.stop_monitor);
 	put_task_struct(fc->ring.daemon);
 	fc->ring.daemon = NULL;
 
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 4ab440ee00f2..d5cb9bdca64e 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -10,8 +10,8 @@
 #include "fuse_i.h"
 
 void fuse_uring_end_requests(struct fuse_conn *fc);
-void fuse_uring_ring_destruct(struct fuse_conn *fc);
 int fuse_uring_ioctl(struct file *file, struct fuse_uring_cfg *cfg);
+void fuse_uring_ring_destruct(struct fuse_conn *fc);
 #endif
 
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3f765e65a7b0..91c912793dca 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -856,6 +856,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
 
+	init_waitqueue_head(&fc->ring.stop_waitq);
 	mutex_init(&fc->ring.start_stop_lock);
 	fc->ring.daemon = NULL;
 
@@ -1792,6 +1793,8 @@ void fuse_conn_destroy(struct fuse_mount *fm)
 
 	if (fc->ring.daemon != NULL)
 		fuse_uring_ring_destruct(fc);
+
+	mutex_destroy(&fc->ring.start_stop_lock);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_destroy);
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/13] fuse: Add uring mmap method
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (5 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 06/13] fuse: Add an interval ring stop worker/monitor Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 08/13] fuse: Move request bits Bernd Schubert
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This adds the uring mmap method. Mmap is currently done per ring queue,
the queue is identified using the offset parameter. Reason to have an
mmap per queue is to have a numa aware allocation per queue.
Trade off is that the offset limits the number of possible queues
(although a very high number) and it might cause issues if another
mmap is later on needed.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev.c         |  1 +
 fs/fuse/dev_uring.c   | 49 +++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h |  1 +
 3 files changed, 51 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index d9c40d782c94..256936af4f2e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2323,6 +2323,7 @@ const struct file_operations fuse_dev_operations = {
 	.fasync		= fuse_dev_fasync,
 	.unlocked_ioctl = fuse_dev_ioctl,
 	.compat_ioctl   = compat_ptr_ioctl,
+	.mmap		= fuse_uring_mmap,
 };
 EXPORT_SYMBOL_GPL(fuse_dev_operations);
 
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 44ff23ce5ebf..ade341d86c03 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -496,3 +496,52 @@ void fuse_uring_ring_destruct(struct fuse_conn *fc)
 	fc->ring.queue_depth = 0;
 	fc->ring.nr_queues = 0;
 }
+
+/**
+ * fuse uring mmap, per ring qeuue. The queue is identified by the offset
+ * parameter
+ */
+int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct fuse_dev *fud = fuse_get_dev(filp);
+	struct fuse_conn *fc = fud->fc;
+	size_t sz = vma->vm_end - vma->vm_start;
+	unsigned int qid;
+	int ret;
+	loff_t off;
+	struct fuse_ring_queue *queue;
+
+	/* check if uring is configured and if the requested size matches */
+	if (fc->ring.nr_queues == 0 || fc->ring.queue_depth == 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (sz != fc->ring.queue_buf_size) {
+		ret = -EINVAL;
+		pr_devel("mmap size mismatch, expected %zu got %zu\n",
+			 fc->ring.queue_buf_size, sz);
+		goto out;
+	}
+
+	/* XXX: Enforce a cloned session per ring and assign fud per queue
+	 * and use fud as key to find the right queue?
+	 */
+	off = (vma->vm_pgoff << PAGE_SHIFT) / PAGE_SIZE;
+	qid = off / (fc->ring.queue_depth);
+
+	queue = fuse_uring_get_queue(fc, qid);
+
+	if (queue == NULL) {
+		pr_devel("fuse uring mmap: invalid qid=%u\n", qid);
+		return -ERANGE;
+	}
+
+	ret = remap_vmalloc_range(vma, queue->queue_req_buf, 0);
+out:
+	pr_devel("%s: pid %d qid: %u addr: %p sz: %zu  ret: %d\n",
+		 __func__, current->pid, qid, (char *)vma->vm_start,
+		 sz, ret);
+
+	return ret;
+}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index d5cb9bdca64e..4032dccca8b6 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -12,6 +12,7 @@
 void fuse_uring_end_requests(struct fuse_conn *fc);
 int fuse_uring_ioctl(struct file *file, struct fuse_uring_cfg *cfg);
 void fuse_uring_ring_destruct(struct fuse_conn *fc);
+int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma);
 #endif
 
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/13] fuse: Move request bits
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (6 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 07/13] fuse: Add uring mmap method Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 09/13] fuse: Add wait stop ioctl support to the ring Bernd Schubert
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

These are needed by dev_uring functions as well

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev.c        | 4 ----
 fs/fuse/fuse_dev_i.h | 4 ++++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 256936af4f2e..4e79cdba540c 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -27,10 +27,6 @@
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
 
-/* Ordinary requests have even IDs, while interrupts IDs are odd */
-#define FUSE_INT_REQ_BIT (1ULL << 0)
-#define FUSE_REQ_ID_STEP (1ULL << 1)
-
 static struct kmem_cache *fuse_req_cachep;
 
 static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index f623a85c4c24..aea1f3f7aa5d 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -7,6 +7,10 @@
 #ifndef _FS_FUSE_DEV_I_H
 #define _FS_FUSE_DEV_I_H
 
+/* Ordinary requests have even IDs, while interrupts IDs are odd */
+#define FUSE_INT_REQ_BIT (1ULL << 0)
+#define FUSE_REQ_ID_STEP (1ULL << 1)
+
 static inline struct fuse_dev *fuse_get_dev(struct file *file)
 {
 	/*
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/13] fuse: Add wait stop ioctl support to the ring
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (7 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 08/13] fuse: Move request bits Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 10/13] fuse: Handle SQEs - register commands Bernd Schubert
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This is an optional ioctl to avoid running the stop monitor
(delayed workq) at run time in intervals - saves cpu cycles.
When the FUSE_DEV_IOC_URING ioctl with subcommand
FUSE_URING_IOCTL_CMD_WAIT is received it cancels the stop monitor
(delayed workq) and then goes into a an interruptible waitq - on
process termination it gets woken up and schedules the stop monitor
again.
As the submitting thread is waiting forever in a waitq,
userspace daemon side has to create a separate thread for it.

The additional ioctl subcommand FUSE_URING_IOCTL_CMD_STOP exits
to let userspace explicitly initiate fuse uring shutdown and to wake up
the FUSE_URING_IOCTL_CMD_WAIT waiting thread.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev_uring.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index ade341d86c03..e19c652e7071 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -425,6 +425,49 @@ static int fuse_uring_cfg(struct fuse_conn *fc, unsigned int qid,
 	return rc;
 }
 
+/**
+ * Wait until uring shall be destructed and then release uring resources
+ */
+static int fuse_uring_wait_stop(struct fuse_conn *fc)
+{
+	struct fuse_iqueue *fiq = &fc->iq;
+
+	pr_devel("%s stop_requested=%d", __func__, fc->ring.stop_requested);
+
+	if (fc->ring.stop_requested)
+		return -EINTR;
+
+	/* This userspace thread can stop uring on process stop, no need
+	 * for the interval worker
+	 */
+	pr_devel("%s cancel stop monitor\n", __func__);
+	cancel_delayed_work_sync(&fc->ring.stop_monitor);
+
+	wait_event_interruptible(fc->ring.stop_waitq,
+				 !fiq->connected ||
+				 fc->ring.stop_requested);
+
+	/* The userspace task gets scheduled to back userspace, we need
+	 * the interval worker again. It runs immediately for quick cleanup
+	 * in shutdown/process kill.
+	 */
+
+	mutex_lock(&fc->ring.start_stop_lock);
+	if (!fc->ring.queues_stopped)
+		mod_delayed_work(system_wq, &fc->ring.stop_monitor, 0);
+	mutex_unlock(&fc->ring.start_stop_lock);
+
+	return 0;
+}
+
+static int fuse_uring_shutdown_wakeup(struct fuse_conn *fc)
+{
+	fc->ring.stop_requested = 1;
+	wake_up_all(&fc->ring.stop_waitq);
+
+	return 0;
+}
+
 int fuse_uring_ioctl(struct file *file, struct fuse_uring_cfg *cfg)
 {
 	struct fuse_dev *fud = fuse_get_dev(file);
@@ -443,6 +486,10 @@ int fuse_uring_ioctl(struct file *file, struct fuse_uring_cfg *cfg)
 	switch (cfg->cmd) {
 	case FUSE_URING_IOCTL_CMD_QUEUE_CFG:
 		return fuse_uring_cfg(fc, cfg->qid, cfg);
+	case FUSE_URING_IOCTL_CMD_WAIT:
+		return fuse_uring_wait_stop(fc);
+	case FUSE_URING_IOCTL_CMD_STOP:
+		return fuse_uring_shutdown_wakeup(fc);
 	default:
 		return -EINVAL;
 	}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/13] fuse: Handle SQEs - register commands
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (8 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 09/13] fuse: Add wait stop ioctl support to the ring Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 11/13] fuse: Add support to copy from/to the ring buffer Bernd Schubert
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This adds basic support for ring SQEs (with opcode=IORING_OP_URING_CMD).
For now only FUSE_URING_REQ_FETCH is handled to register queue entries.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev_uring.c | 240 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 240 insertions(+)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index e19c652e7071..744a38064131 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -81,6 +81,44 @@ void fuse_uring_end_requests(struct fuse_conn *fc)
 	}
 }
 
+/*
+ * Release a ring request, it is no longer needed and can handle new data
+ *
+ */
+static void fuse_uring_ent_release(struct fuse_ring_ent *ring_ent,
+					   struct fuse_ring_queue *queue,
+					   bool bg)
+__must_hold(&queue->lock)
+{
+	struct fuse_conn *fc = queue->fc;
+
+	/* unsets all previous flags - basically resets */
+	pr_devel("%s fc=%p qid=%d tag=%d state=%llu bg=%d\n",
+		__func__, fc, ring_ent->queue->qid, ring_ent->tag,
+		ring_ent->state, bg);
+
+	if (ring_ent->state & FRRS_USERSPACE) {
+		pr_warn("%s qid=%d tag=%d state=%llu is_bg=%d\n",
+			__func__, ring_ent->queue->qid, ring_ent->tag,
+			ring_ent->state, bg);
+		WARN_ON(1);
+		return;
+	}
+
+	fuse_uring_bit_set(ring_ent, bg, __func__);
+
+	/* Check if this is call through shutdown/release task and already and
+	 * the request is about to be released - the state must not be reset
+	 * then, as state FRRS_FUSE_WAIT would introduce a double
+	 * io_uring_cmd_done
+	 */
+	if (ring_ent->state & FRRS_FREEING)
+		return;
+
+	/* Note: the bit in req->flag got already cleared in fuse_request_end */
+	ring_ent->rreq->flags = 0;
+	ring_ent->state = FRRS_FUSE_WAIT;
+}
 /**
  * Simplified ring-entry release function, for shutdown only
  */
@@ -592,3 +630,205 @@ int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	return ret;
 }
+
+/*
+ * fuse_uring_req_fetch command handling
+ */
+static int fuse_uring_fetch(struct fuse_ring_ent *ring_ent,
+			    struct io_uring_cmd *cmd)
+{
+	struct fuse_ring_queue *queue = ring_ent->queue;
+	struct fuse_conn *fc = queue->fc;
+	int ret;
+	bool is_bg = false;
+	int nr_queue_init = 0;
+
+	spin_lock(&queue->lock);
+
+	/* register requests for foreground requests first, then backgrounds */
+	if (queue->req_fg >= fc->ring.max_fg)
+		is_bg = true;
+	fuse_uring_ent_release(ring_ent, queue, is_bg);
+
+	/* daemon side registered all requests, this queue is complete */
+	if (queue->req_fg + queue->req_bg == fc->ring.queue_depth)
+		nr_queue_init =
+			atomic_inc_return(&fc->ring.nr_queues_cmd_init);
+
+	ret = 0;
+	if (queue->req_fg + queue->req_bg > fc->ring.queue_depth) {
+		/* should be caught by ring state before and queue depth
+		 * check before
+		 */
+		WARN_ON(1);
+		pr_info("qid=%d tag=%d req cnt (fg=%d bg=%d exceeds depth=%zu",
+			queue->qid, ring_ent->tag, queue->req_fg,
+			queue->req_bg, fc->ring.queue_depth);
+		ret = -ERANGE;
+
+		/* avoid completion through fuse_req_end, as there is no
+		 * fuse req assigned yet
+		 */
+		ring_ent->state = FRRS_INIT;
+	}
+
+	pr_devel("%s:%d qid=%d tag=%d nr-fg=%d nr-bg=%d nr_queue_init=%d\n",
+		__func__, __LINE__,
+		queue->qid, ring_ent->tag, queue->req_fg, queue->req_bg,
+		nr_queue_init);
+
+	spin_unlock(&queue->lock);
+	if (ret)
+		goto out; /* erange */
+
+	WRITE_ONCE(ring_ent->cmd, cmd);
+
+	if (nr_queue_init == fc->ring.nr_queues)
+		fc->ring.ready = 1;
+
+out:
+	return ret;
+}
+
+struct fuse_ring_queue *
+fuse_uring_get_verify_queue(struct fuse_conn *fc,
+			 const struct fuse_uring_cmd_req *cmd_req,
+			 unsigned int issue_flags)
+{
+	struct fuse_ring_queue *queue;
+	int ret;
+
+	if (!(issue_flags & IO_URING_F_SQE128)) {
+		pr_info("qid=%d tag=%d SQE128 not set\n",
+			cmd_req->qid, cmd_req->tag);
+		ret =  -EINVAL;
+		goto err;
+	}
+
+	if (unlikely(fc->ring.stop_requested)) {
+		ret = -ENOTCONN;
+		goto err;
+	}
+
+	if (unlikely(!fc->ring.configured)) {
+		pr_info("command for a conection that is not ring configure\n");
+		ret = -ENODEV;
+		goto err;
+	}
+
+	if (unlikely(cmd_req->qid >= fc->ring.nr_queues)) {
+		pr_devel("qid=%u >= nr-queues=%zu\n",
+			cmd_req->qid, fc->ring.nr_queues);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	queue = fuse_uring_get_queue(fc, cmd_req->qid);
+	if (unlikely(queue == NULL)) {
+		pr_info("Got NULL queue for qid=%d\n", cmd_req->qid);
+		ret = -EIO;
+		goto err;
+	}
+
+	if (unlikely(!fc->ring.configured || !queue->configured ||
+		     queue->aborted)) {
+		pr_info("Ring or queue (qid=%u) not ready.\n", cmd_req->qid);
+		ret = -ENOTCONN;
+		goto err;
+	}
+
+	if (cmd_req->tag > fc->ring.queue_depth) {
+		pr_info("tag=%u > queue-depth=%zu\n",
+			cmd_req->tag, fc->ring.queue_depth);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	return queue;
+
+err:
+	return ERR_PTR(ret);
+}
+
+/**
+ * Entry function from io_uring to handle the given passthrough command
+ * (op cocde IORING_OP_URING_CMD)
+ */
+int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	const struct fuse_uring_cmd_req *cmd_req =
+		(struct fuse_uring_cmd_req *)cmd->cmd;
+	struct fuse_dev *fud = fuse_get_dev(cmd->file);
+	struct fuse_conn *fc = fud->fc;
+	struct fuse_ring_queue *queue;
+	struct fuse_ring_ent *ring_ent = NULL;
+	u32 cmd_op = cmd->cmd_op;
+	u64 prev_state;
+	int ret = 0;
+
+	queue = fuse_uring_get_verify_queue(fc, cmd_req, issue_flags);
+	if (IS_ERR(queue)) {
+		ret = PTR_ERR(queue);
+		goto out;
+	}
+
+	ring_ent = &queue->ring_ent[cmd_req->tag];
+
+	pr_devel("%s:%d received: cmd op %d qid %d (%p) tag %d  (%p)\n",
+		 __func__, __LINE__,
+		 cmd_op, cmd_req->qid, queue, cmd_req->tag, ring_ent);
+
+	spin_lock(&queue->lock);
+	if (unlikely(queue->aborted)) {
+		/* XXX how to ensure queue still exists? Add
+		 * an rw fc->ring.stop lock? And take that at the beginning
+		 * of this function? Better would be to advise uring
+		 * not to call this function at all? Or free the queue memory
+		 * only, on daemon PF_EXITING?
+		 */
+		ret = -ENOTCONN;
+		spin_unlock(&queue->lock);
+		goto out;
+	}
+
+	prev_state = ring_ent->state;
+	ring_ent->state |= FRRS_FUSE_FETCH_COMMIT;
+	ring_ent->state &= ~FRRS_USERSPACE;
+	ring_ent->need_cmd_done = 1;
+	spin_unlock(&queue->lock);
+
+	switch (cmd_op) {
+	case FUSE_URING_REQ_FETCH:
+		if (prev_state != FRRS_INIT) {
+			pr_info_ratelimited("register req state %llu expected %d",
+					    prev_state, FRRS_INIT);
+			ret = -EINVAL;
+			goto out;
+
+			/* XXX error injection or test with malicious daemon */
+		}
+
+		ret = fuse_uring_fetch(ring_ent, cmd);
+		break;
+	default:
+		ret = -EINVAL;
+		pr_devel("Unknown uring command %d", cmd_op);
+		goto out;
+	}
+
+out:
+	pr_devel("uring cmd op=%d, qid=%d tag=%d ret=%d\n",
+		 cmd_op, cmd_req->qid, cmd_req->tag, ret);
+
+	if (ret < 0) {
+		if (ring_ent != NULL) {
+			spin_lock(&queue->lock);
+			ring_ent->state |= (FRRS_CMD_ERR | FRRS_USERSPACE);
+			ring_ent->need_cmd_done = 0;
+			spin_unlock(&queue->lock);
+		}
+		io_uring_cmd_done(cmd, ret, 0);
+	}
+
+	return -EIOCBQUEUED;
+}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 11/13] fuse: Add support to copy from/to the ring buffer
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (9 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 10/13] fuse: Handle SQEs - register commands Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 12/13] fuse: Add uring sqe commit and fetch support Bernd Schubert
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This adds support to existing fuse copy code to copy
from/to the ring buffer. The ring buffer is here mmaped
shared between kernel and userspace.

This also fuse_ prefixes the copy_out_args function

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev.c        | 60 ++++++++++++++++++++++++++------------------
 fs/fuse/fuse_dev_i.h | 35 ++++++++++++++++++++++++++
 2 files changed, 70 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 4e79cdba540c..de9193f66c8b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -628,21 +628,7 @@ static int unlock_request(struct fuse_req *req)
 	return err;
 }
 
-struct fuse_copy_state {
-	int write;
-	struct fuse_req *req;
-	struct iov_iter *iter;
-	struct pipe_buffer *pipebufs;
-	struct pipe_buffer *currbuf;
-	struct pipe_inode_info *pipe;
-	unsigned long nr_segs;
-	struct page *pg;
-	unsigned len;
-	unsigned offset;
-	unsigned move_pages:1;
-};
-
-static void fuse_copy_init(struct fuse_copy_state *cs, int write,
+void fuse_copy_init(struct fuse_copy_state *cs, int write,
 			   struct iov_iter *iter)
 {
 	memset(cs, 0, sizeof(*cs));
@@ -653,6 +639,7 @@ static void fuse_copy_init(struct fuse_copy_state *cs, int write,
 /* Unmap and put previous page of userspace buffer */
 static void fuse_copy_finish(struct fuse_copy_state *cs)
 {
+
 	if (cs->currbuf) {
 		struct pipe_buffer *buf = cs->currbuf;
 
@@ -717,6 +704,10 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 			cs->pipebufs++;
 			cs->nr_segs++;
 		}
+	} else if (cs->is_uring) {
+		if (cs->ring.offset > cs->ring.buf_sz)
+			return -ERANGE;
+		cs->len = cs->ring.buf_sz - cs->ring.offset;
 	} else {
 		size_t off;
 		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
@@ -735,21 +726,35 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
 {
 	unsigned ncpy = min(*size, cs->len);
+
 	if (val) {
-		void *pgaddr = kmap_local_page(cs->pg);
-		void *buf = pgaddr + cs->offset;
+
+		void *pgaddr;
+		void *buf;
+
+		if (cs->is_uring) {
+			buf = cs->ring.buf + cs->ring.offset;
+			cs->ring.offset += ncpy;
+
+		} else {
+			pgaddr = kmap_local_page(cs->pg);
+			buf = pgaddr + cs->offset;
+		}
 
 		if (cs->write)
 			memcpy(buf, *val, ncpy);
 		else
 			memcpy(*val, buf, ncpy);
 
-		kunmap_local(pgaddr);
+		if (pgaddr)
+			kunmap_local(pgaddr);
+
 		*val += ncpy;
 	}
 	*size -= ncpy;
 	cs->len -= ncpy;
 	cs->offset += ncpy;
+
 	return ncpy;
 }
 
@@ -997,9 +1002,9 @@ static int fuse_copy_one(struct fuse_copy_state *cs, void *val, unsigned size)
 }
 
 /* Copy request arguments to/from userspace buffer */
-static int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
-			  unsigned argpages, struct fuse_arg *args,
-			  int zeroing)
+int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs,
+		   unsigned int argpages, struct fuse_arg *args,
+		   int zeroing)
 {
 	int err = 0;
 	unsigned i;
@@ -1806,10 +1811,15 @@ static struct fuse_req *request_find(struct fuse_pqueue *fpq, u64 unique)
 	return NULL;
 }
 
-static int copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
-			 unsigned nbytes)
+int fuse_copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
+		       unsigned int nbytes)
 {
-	unsigned reqsize = sizeof(struct fuse_out_header);
+
+	unsigned int reqsize = 0;
+
+	/* Uring has the out header outside of args */
+	if (!cs->is_uring)
+		reqsize = sizeof(struct fuse_out_header);
 
 	reqsize += fuse_len_args(args->out_numargs, args->out_args);
 
@@ -1909,7 +1919,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 	if (oh.error)
 		err = nbytes != sizeof(oh) ? -EINVAL : 0;
 	else
-		err = copy_out_args(cs, req->args, nbytes);
+		err = fuse_copy_out_args(cs, req->args, nbytes);
 	fuse_copy_finish(cs);
 
 	spin_lock(&fpq->lock);
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index aea1f3f7aa5d..ccd128f81628 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -11,6 +11,33 @@
 #define FUSE_INT_REQ_BIT (1ULL << 0)
 #define FUSE_REQ_ID_STEP (1ULL << 1)
 
+struct fuse_copy_state {
+	int write;
+	struct fuse_req *req;
+	struct iov_iter *iter;
+	struct pipe_buffer *pipebufs;
+	struct pipe_buffer *currbuf;
+	struct pipe_inode_info *pipe;
+	unsigned long nr_segs;
+	struct page *pg;
+	unsigned int len;
+	unsigned int offset;
+	unsigned int move_pages:1, is_uring:1;
+	struct {
+		/* pointer into the ring buffer */
+		char *buf;
+
+		/* for copy to the ring request buffer, the buffer size - must
+		 * not be exceeded, for copy from the ring request buffer,
+		 * the size filled in by user space
+		 */
+		unsigned int buf_sz;
+
+		/* offset within buf while it is copying from/to the buf */
+		unsigned int offset;
+	} ring;
+};
+
 static inline struct fuse_dev *fuse_get_dev(struct file *file)
 {
 	/*
@@ -22,6 +49,14 @@ static inline struct fuse_dev *fuse_get_dev(struct file *file)
 
 void fuse_dev_end_requests(struct list_head *head);
 
+void fuse_copy_init(struct fuse_copy_state *cs, int write,
+			   struct iov_iter *iter);
+int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs,
+		   unsigned int argpages, struct fuse_arg *args,
+		   int zeroing);
+int fuse_copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
+		       unsigned int nbytes);
+
 #endif
 
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 12/13] fuse: Add uring sqe commit and fetch support
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (10 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 11/13] fuse: Add support to copy from/to the ring buffer Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:10 ` [PATCH 13/13] fuse: Allow to queue to the ring Bernd Schubert
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This adds support for fuse request completion through ring SQEs
(FUSE_URING_REQ_COMMIT_AND_FETCH handling). After committing
the ring entry it becomes available for new fuse requests.
Handling of requests through the ring (SQE/CQE handling)
is complete now.

Fuse request data are copied through the mmaped ring buffer,
there is no support for any zero copy yet.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev.c         |   1 +
 fs/fuse/dev_uring.c   | 408 +++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/dev_uring_i.h |   1 +
 3 files changed, 401 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index de9193f66c8b..cce55eaed8a3 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2330,6 +2330,7 @@ const struct file_operations fuse_dev_operations = {
 	.unlocked_ioctl = fuse_dev_ioctl,
 	.compat_ioctl   = compat_ptr_ioctl,
 	.mmap		= fuse_uring_mmap,
+	.uring_cmd	= fuse_uring_cmd,
 };
 EXPORT_SYMBOL_GPL(fuse_dev_operations);
 
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 744a38064131..5c41f9f71410 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -34,6 +34,9 @@ module_param(enable_uring, bool, 0644);
 MODULE_PARM_DESC(enable_uring,
 	"Enable uring userspace communication through uring.");
 
+static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent);
+static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent);
+
 static struct fuse_ring_queue *
 fuse_uring_get_queue(struct fuse_conn *fc, int qid)
 {
@@ -47,15 +50,6 @@ fuse_uring_get_queue(struct fuse_conn *fc, int qid)
 	return (struct fuse_ring_queue *)(ptr + qid * fc->ring.queue_size);
 }
 
-/* dummy function will be replaced in later commits */
-static void fuse_uring_bit_set(struct fuse_ring_ent *ent, bool is_bg,
-			       const char *str)
-{
-	(void)ent;
-	(void)is_bg;
-	(void)str;
-}
-
 /* Abort all list queued request on the given ring queue */
 static void fuse_uring_end_queue_requests(struct fuse_ring_queue *queue)
 {
@@ -81,6 +75,363 @@ void fuse_uring_end_requests(struct fuse_conn *fc)
 	}
 }
 
+/*
+ * Finalize a fuse request, then fetch and send the next entry, if available
+ *
+ * has lock/unlock/lock to avoid holding the lock on calling fuse_request_end
+ */
+static void
+fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent, bool set_err,
+				int error)
+{
+	bool already = false;
+	struct fuse_req *req = ring_ent->fuse_req;
+	bool send;
+
+	spin_lock(&ring_ent->queue->lock);
+	if (ring_ent->state & FRRS_FUSE_REQ_END || !ring_ent->need_req_end)
+		already = true;
+	else {
+		ring_ent->state |= FRRS_FUSE_REQ_END;
+		ring_ent->need_req_end = 0;
+	}
+	spin_unlock(&ring_ent->queue->lock);
+
+	if (already) {
+		struct fuse_ring_queue *queue = ring_ent->queue;
+
+		if (!queue->aborted) {
+			pr_info("request end not needed state=%llu end-bit=%d\n",
+				ring_ent->state, ring_ent->need_req_end);
+			WARN_ON(1);
+		}
+		return;
+	}
+
+	if (set_err)
+		req->out.h.error = error;
+
+	fuse_request_end(ring_ent->fuse_req);
+	ring_ent->fuse_req = NULL;
+
+	send = fuse_uring_ent_release_and_fetch(ring_ent);
+	if (send)
+		fuse_uring_send_to_ring(ring_ent);
+}
+
+/*
+ * Copy data from the req to the ring buffer
+ */
+static int fuse_uring_copy_to_ring(struct fuse_conn *fc,
+				   struct fuse_req *req,
+				   struct fuse_ring_req *rreq)
+{
+	struct fuse_copy_state cs;
+	struct fuse_args *args = req->args;
+	int err;
+
+	fuse_copy_init(&cs, 1, NULL);
+	cs.is_uring = 1;
+	cs.ring.buf = rreq->in_out_arg;
+	cs.ring.buf_sz = fc->ring.req_arg_len;
+	cs.req = req;
+
+	pr_devel("%s:%d buf=%p len=%d args=%d\n", __func__, __LINE__,
+		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs);
+
+	err = fuse_copy_args(&cs, args->in_numargs, args->in_pages,
+			     (struct fuse_arg *) args->in_args, 0);
+	rreq->in_out_arg_len = cs.ring.offset;
+
+	pr_devel("%s:%d buf=%p len=%d args=%d err=%d\n", __func__, __LINE__,
+		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs, err);
+
+	return err;
+}
+
+/*
+ * Copy data from the ring buffer to the fuse request
+ */
+static int fuse_uring_copy_from_ring(struct fuse_conn *fc,
+				     struct fuse_req *req,
+				     struct fuse_ring_req *rreq)
+{
+	struct fuse_copy_state cs;
+	struct fuse_args *args = req->args;
+
+	fuse_copy_init(&cs, 0, NULL);
+	cs.is_uring = 1;
+	cs.ring.buf = rreq->in_out_arg;
+
+	if (rreq->in_out_arg_len > fc->ring.req_arg_len) {
+		pr_devel("Max ring buffer len exceeded (%u vs %zu\n",
+			 rreq->in_out_arg_len,  fc->ring.req_arg_len);
+		return -EINVAL;
+	}
+	cs.ring.buf_sz = rreq->in_out_arg_len;
+	cs.req = req;
+
+	pr_devel("%s:%d buf=%p len=%d args=%d\n", __func__, __LINE__,
+		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs);
+
+	return fuse_copy_out_args(&cs, args, rreq->in_out_arg_len);
+}
+
+/**
+ * Write data to the ring buffer and send the request to userspace,
+ * userspace will read it
+ * This is comparable with classical read(/dev/fuse)
+ */
+static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent)
+{
+	struct fuse_conn *fc = ring_ent->queue->fc;
+	struct fuse_ring_req *rreq = ring_ent->rreq;
+	struct fuse_req *req = ring_ent->fuse_req;
+	int err = 0;
+
+	pr_devel("%s:%d ring-req=%p fuse_req=%p state=%llu args=%p\n", __func__,
+		 __LINE__, ring_ent, ring_ent->fuse_req, ring_ent->state, req->args);
+
+	spin_lock(&ring_ent->queue->lock);
+	if (unlikely((ring_ent->state & FRRS_USERSPACE) ||
+		     (ring_ent->state & FRRS_FREED))) {
+		pr_err("ring-req=%p buf_req=%p invalid state %llu on send\n",
+		       ring_ent, rreq, ring_ent->state);
+		WARN_ON(1);
+		err = -EIO;
+	} else
+		ring_ent->state |= FRRS_USERSPACE;
+
+	ring_ent->need_cmd_done = 0;
+	spin_unlock(&ring_ent->queue->lock);
+	if (err)
+		goto err;
+
+	err = fuse_uring_copy_to_ring(fc, req, rreq);
+	if (unlikely(err)) {
+		ring_ent->state &= ~FRRS_USERSPACE;
+		ring_ent->need_cmd_done = 1;
+		goto err;
+	}
+
+	/* ring req go directly into the shared memory buffer */
+	rreq->in = req->in.h;
+
+	pr_devel("%s qid=%d tag=%d state=%llu cmd-done op=%d unique=%llu\n",
+		__func__, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
+		rreq->in.opcode, rreq->in.unique);
+
+	io_uring_cmd_done(ring_ent->cmd, 0, 0);
+	return;
+
+err:
+	fuse_uring_req_end_and_get_next(ring_ent, true, err);
+}
+
+/**
+ * Set the given ring entry as available in the queue bitmap
+ */
+static void fuse_uring_bit_set(struct fuse_ring_ent *ring_ent, bool bg,
+			       const char *str)
+__must_hold(ring_ent->queue->lock)
+{
+	int old;
+	struct fuse_ring_queue *queue = ring_ent->queue;
+	const struct fuse_conn *fc = queue->fc;
+	int tag = ring_ent->tag;
+
+	old = test_and_set_bit(tag, queue->req_avail_map);
+	if (unlikely(old != 0)) {
+		pr_warn("%8s invalid bit value on clear for qid=%d tag=%d",
+			str, queue->qid, tag);
+		WARN_ON(1);
+	}
+	if (bg)
+		queue->req_bg++;
+	else
+		queue->req_fg++;
+
+	pr_devel("%35s bit set fc=%p is_bg=%d qid=%d tag=%d fg=%d bg=%d bgq: %d\n",
+		 str, fc, bg, queue->qid, ring_ent->tag, queue->req_fg,
+		 queue->req_bg, !list_empty(&queue->bg_queue));
+}
+
+/**
+ * Mark the ring entry as not available for other requests
+ */
+static int fuse_uring_bit_clear(struct fuse_ring_ent *ring_ent, int is_bg,
+				const char *str)
+__must_hold(ring_ent->queue->lock)
+{
+	int old;
+	struct fuse_ring_queue *queue = ring_ent->queue;
+	const struct fuse_conn *fc = queue->fc;
+	int tag = ring_ent->tag;
+	int *value = is_bg ? &queue->req_bg : &queue->req_fg;
+
+	if (unlikely(*value <= 0)) {
+		pr_warn("%s qid=%d tag=%d is_bg=%d zero req avail fg=%d bg=%d\n",
+			str, queue->qid, ring_ent->tag, is_bg,
+			queue->req_bg, queue->req_fg);
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	old = test_and_clear_bit(tag, queue->req_avail_map);
+	if (unlikely(old != 1)) {
+		pr_warn("%8s invalid bit value on clear for qid=%d tag=%d",
+			str, queue->qid, tag);
+		WARN_ON(1);
+		return -EIO;
+	}
+
+	ring_ent->rreq->flags = 0;
+
+	if (is_bg) {
+		ring_ent->rreq->flags |= FUSE_RING_REQ_FLAG_BACKGROUND;
+		queue->req_bg--;
+	} else
+		queue->req_fg--;
+
+	pr_devel("%35s ring bit clear fc=%p is_bg=%d qid=%d tag=%d fg=%d bg=%d\n",
+		 str, fc, is_bg, queue->qid, ring_ent->tag,
+		 queue->req_fg, queue->req_bg);
+
+	ring_ent->state |= FRRS_FUSE_REQ;
+
+	return 0;
+}
+
+/*
+ * Assign a fuse queue entry to the given entry
+ *
+ */
+static bool fuse_uring_assign_ring_entry(struct fuse_ring_ent *ring_ent,
+					     struct list_head *head,
+					     int is_bg)
+__must_hold(&queue.waitq.lock)
+{
+	struct fuse_req *req;
+	int res;
+
+	if (list_empty(head))
+		return false;
+
+	res = fuse_uring_bit_clear(ring_ent, is_bg, __func__);
+	if (unlikely(res))
+		return false;
+
+	req = list_first_entry(head, struct fuse_req, list);
+	list_del_init(&req->list);
+	clear_bit(FR_PENDING, &req->flags);
+	ring_ent->fuse_req = req;
+	ring_ent->need_req_end = 1;
+
+	return true;
+}
+
+/*
+ * Checks for errors and stores it into the request
+ */
+static int fuse_uring_ring_ent_has_err(struct fuse_conn *fc,
+				       struct fuse_ring_ent *ring_ent)
+{
+	struct fuse_req *req = ring_ent->fuse_req;
+	struct fuse_out_header *oh = &req->out.h;
+	int err;
+
+	if (oh->unique == 0) {
+		/* Not supportd through request based uring, this needs another
+		 * ring from user space to kernel
+		 */
+		pr_warn("Unsupported fuse-notify\n");
+		err = -EINVAL;
+		goto seterr;
+	}
+
+	if (oh->error <= -512 || oh->error > 0) {
+		err = -EINVAL;
+		goto seterr;
+	}
+
+	if (oh->error) {
+		err = oh->error;
+		pr_devel("%s:%d err=%d op=%d req-ret=%d",
+			 __func__, __LINE__, err, req->args->opcode,
+			 req->out.h.error);
+		goto err; /* error already set */
+	}
+
+	if ((oh->unique & ~FUSE_INT_REQ_BIT) != req->in.h.unique) {
+
+		pr_warn("Unpexted seqno mismatch, expected: %llu got %llu\n",
+			req->in.h.unique, oh->unique & ~FUSE_INT_REQ_BIT);
+		err = -ENOENT;
+		goto seterr;
+	}
+
+	/* Is it an interrupt reply ID?	 */
+	if (oh->unique & FUSE_INT_REQ_BIT) {
+		err = 0;
+		if (oh->error == -ENOSYS)
+			fc->no_interrupt = 1;
+		else if (oh->error == -EAGAIN) {
+			/* XXX Needs to copy to the next cq and submit it */
+			// err = queue_interrupt(req);
+			pr_warn("Intrerupt EAGAIN not supported yet");
+			err = -EINVAL;
+		}
+
+		goto seterr;
+	}
+
+	return 0;
+
+seterr:
+	pr_devel("%s:%d err=%d op=%d req-ret=%d",
+		 __func__, __LINE__, err, req->args->opcode,
+		 req->out.h.error);
+	oh->error = err;
+err:
+	pr_devel("%s:%d err=%d op=%d req-ret=%d",
+		 __func__, __LINE__, err, req->args->opcode,
+		 req->out.h.error);
+	return err;
+}
+
+/**
+ * Read data from the ring buffer, which user space has written to
+ * This is comparible with handling of classical write(/dev/fuse).
+ * Also make the ring request available again for new fuse requests.
+ */
+static void fuse_uring_commit_and_release(struct fuse_dev *fud,
+					  struct fuse_ring_ent *ring_ent)
+{
+	struct fuse_ring_req *rreq = ring_ent->rreq;
+	struct fuse_req *req = ring_ent->fuse_req;
+	ssize_t err = 0;
+	bool set_err = false;
+
+	req->out.h = rreq->out;
+
+	err = fuse_uring_ring_ent_has_err(fud->fc, ring_ent);
+	if (err) {
+		/* req->out.h.error already set */
+		pr_devel("%s:%d err=%zd oh->err=%d\n",
+			 __func__, __LINE__, err, req->out.h.error);
+		goto out;
+	}
+
+	err = fuse_uring_copy_from_ring(fud->fc, req, rreq);
+	if (err)
+		set_err = true;
+
+out:
+	pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n",
+		 __func__, __LINE__, err, req->args->opcode, req->out.h.error);
+	fuse_uring_req_end_and_get_next(ring_ent, set_err, err);
+}
+
 /*
  * Release a ring request, it is no longer needed and can handle new data
  *
@@ -119,6 +470,25 @@ __must_hold(&queue->lock)
 	ring_ent->rreq->flags = 0;
 	ring_ent->state = FRRS_FUSE_WAIT;
 }
+
+/*
+ * Release a uring entry and fetch the next fuse request if available
+ */
+static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent)
+{
+	struct fuse_ring_queue *queue = ring_ent->queue;
+	bool is_bg = !!(ring_ent->rreq->flags & FUSE_RING_REQ_FLAG_BACKGROUND);
+	bool send = false;
+	struct list_head *head = is_bg ? &queue->bg_queue : &queue->fg_queue;
+
+	spin_lock(&ring_ent->queue->lock);
+	fuse_uring_ent_release(ring_ent, queue, is_bg);
+	send = fuse_uring_assign_ring_entry(ring_ent, head, is_bg);
+	spin_unlock(&ring_ent->queue->lock);
+
+	return send;
+}
+
 /**
  * Simplified ring-entry release function, for shutdown only
  */
@@ -810,6 +1180,26 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 
 		ret = fuse_uring_fetch(ring_ent, cmd);
 		break;
+	case FUSE_URING_REQ_COMMIT_AND_FETCH:
+		if (unlikely(!fc->ring.ready)) {
+			pr_info("commit and fetch, but the ring is not ready yet");
+			goto out;
+		}
+
+		if (!(prev_state & FRRS_USERSPACE)) {
+			pr_info("qid=%d tag=%d state %llu misses %d\n",
+				queue->qid, ring_ent->tag, ring_ent->state,
+				FRRS_USERSPACE);
+			goto out;
+		}
+
+		/* XXX Test inject error */
+
+		WRITE_ONCE(ring_ent->cmd, cmd);
+		fuse_uring_commit_and_release(fud, ring_ent);
+
+		ret = 0;
+		break;
 	default:
 		ret = -EINVAL;
 		pr_devel("Unknown uring command %d", cmd_op);
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 4032dccca8b6..b0ef36215b80 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -13,6 +13,7 @@ void fuse_uring_end_requests(struct fuse_conn *fc);
 int fuse_uring_ioctl(struct file *file, struct fuse_uring_cfg *cfg);
 void fuse_uring_ring_destruct(struct fuse_conn *fc);
 int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma);
+int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 #endif
 
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 13/13] fuse: Allow to queue to the ring
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (11 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 12/13] fuse: Add uring sqe commit and fetch support Bernd Schubert
@ 2023-03-21  1:10 ` Bernd Schubert
  2023-03-21  1:26 ` [RFC PATCH 00/13] fuse uring communication Bernd Schubert
  2023-03-21  9:35 ` Amir Goldstein
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	fuse-devel

This enabled enqueuing requests through fuse uring queues.

For initial simplicity requests are always allocated the normal way
then added to ring queues lists and only then copied to ring queue
entries. Later on the allocation and adding the requests to a list
can be avoided, by directly using a ring entry. This introduces
some code complexity and is therefore not done for now.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-fsdevel@vger.kernel.org
cc: Amir Goldstein <amir73il@gmail.com>
cc: fuse-devel@lists.sourceforge.net
---
 fs/fuse/dev.c         | 80 ++++++++++++++++++++++++++++++++++++++-----
 fs/fuse/dev_uring.c   | 68 ++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h |  1 +
 3 files changed, 141 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cce55eaed8a3..e82db13da8f6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -211,13 +211,29 @@ const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
 };
 EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
 
-static void queue_request_and_unlock(struct fuse_iqueue *fiq,
-				     struct fuse_req *req)
+
+static void queue_request_and_unlock(struct fuse_conn *fc,
+				     struct fuse_req *req, bool allow_uring)
 __releases(fiq->lock)
 {
+	struct fuse_iqueue *fiq = &fc->iq;
+
 	req->in.h.len = sizeof(struct fuse_in_header) +
 		fuse_len_args(req->args->in_numargs,
 			      (struct fuse_arg *) req->args->in_args);
+
+	if (allow_uring && fc->ring.ready) {
+		int res;
+
+		/* this lock is not needed at all for ring req handling */
+		spin_unlock(&fiq->lock);
+		res = fuse_uring_queue_fuse_req(fc, req);
+		if (!res)
+			return;
+
+		/* fallthrough, handled through /dev/fuse read/write */
+	}
+
 	list_add_tail(&req->list, &fiq->pending);
 	fiq->ops->wake_pending_and_unlock(fiq);
 }
@@ -254,7 +270,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
 		fc->active_background++;
 		spin_lock(&fiq->lock);
 		req->in.h.unique = fuse_get_unique(fiq);
-		queue_request_and_unlock(fiq, req);
+		queue_request_and_unlock(fc, req, true);
 	}
 }
 
@@ -398,7 +414,8 @@ static void request_wait_answer(struct fuse_req *req)
 
 static void __fuse_request_send(struct fuse_req *req)
 {
-	struct fuse_iqueue *fiq = &req->fm->fc->iq;
+	struct fuse_conn *fc = req->fm->fc;
+	struct fuse_iqueue *fiq = &fc->iq;
 
 	BUG_ON(test_bit(FR_BACKGROUND, &req->flags));
 	spin_lock(&fiq->lock);
@@ -410,7 +427,7 @@ static void __fuse_request_send(struct fuse_req *req)
 		/* acquire extra reference, since request is still needed
 		   after fuse_request_end() */
 		__fuse_get_request(req);
-		queue_request_and_unlock(fiq, req);
+		queue_request_and_unlock(fc, req, true);
 
 		request_wait_answer(req);
 		/* Pairs with smp_wmb() in fuse_request_end() */
@@ -478,6 +495,12 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
 	if (args->force) {
 		atomic_inc(&fc->num_waiting);
 		req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);
+		if (unlikely(!req)) {
+			/* should only happen with uring on shutdown */
+			WARN_ON(!fc->ring.configured);
+			ret = -ENOTCONN;
+			goto err;
+		}
 
 		if (!args->nocreds)
 			fuse_force_creds(req);
@@ -505,16 +528,55 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
 	}
 	fuse_put_request(req);
 
+err:
 	return ret;
 }
 
-static bool fuse_request_queue_background(struct fuse_req *req)
+static bool fuse_request_queue_background_uring(struct fuse_conn *fc,
+					       struct fuse_req *req)
+{
+	struct fuse_iqueue *fiq = &fc->iq;
+	int err;
+
+	req->in.h.unique = fuse_get_unique(fiq);
+	req->in.h.len = sizeof(struct fuse_in_header) +
+		fuse_len_args(req->args->in_numargs,
+			      (struct fuse_arg *) req->args->in_args);
+
+	err = fuse_uring_queue_fuse_req(fc, req);
+	if (!err) {
+		/* XXX remove and lets the users of that use per queue values -
+		 * avoid the shared spin lock...
+		 * Is this needed at all?
+		 */
+		spin_lock(&fc->bg_lock);
+		fc->num_background++;
+		fc->active_background++;
+
+
+		/* XXX block when per ring queues get occupied */
+		if (fc->num_background == fc->max_background)
+			fc->blocked = 1;
+		spin_unlock(&fc->bg_lock);
+	}
+
+	return err ? false : true;
+}
+
+/*
+ * @return true if queued
+ */
+static int fuse_request_queue_background(struct fuse_req *req)
 {
 	struct fuse_mount *fm = req->fm;
 	struct fuse_conn *fc = fm->fc;
 	bool queued = false;
 
 	WARN_ON(!test_bit(FR_BACKGROUND, &req->flags));
+
+	if (fc->ring.ready)
+		return fuse_request_queue_background_uring(fc, req);
+
 	if (!test_bit(FR_WAITING, &req->flags)) {
 		__set_bit(FR_WAITING, &req->flags);
 		atomic_inc(&fc->num_waiting);
@@ -567,7 +629,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
 				    struct fuse_args *args, u64 unique)
 {
 	struct fuse_req *req;
-	struct fuse_iqueue *fiq = &fm->fc->iq;
+	struct fuse_conn *fc = fm->fc;
+	struct fuse_iqueue *fiq = &fc->iq;
 	int err = 0;
 
 	req = fuse_get_req(fm, false);
@@ -581,7 +644,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
 
 	spin_lock(&fiq->lock);
 	if (fiq->connected) {
-		queue_request_and_unlock(fiq, req);
+		/* uring for notify not supported yet */
+		queue_request_and_unlock(fc, req, false);
 	} else {
 		err = -ENODEV;
 		spin_unlock(&fiq->lock);
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 5c41f9f71410..9e02e58ac688 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -330,6 +330,74 @@ __must_hold(&queue.waitq.lock)
 	return true;
 }
 
+int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct fuse_ring_queue *queue;
+	int qid = 0;
+	struct fuse_ring_ent *ring_ent = NULL;
+	const size_t queue_depth = fc->ring.queue_depth;
+	int res;
+	int is_bg = test_bit(FR_BACKGROUND, &req->flags);
+	struct list_head *head;
+	int *queue_avail;
+
+	pr_devel("%s req=%p bg=%d\n", __func__, req, is_bg);
+
+	if (fc->ring.per_core_queue) {
+		qid = task_cpu(current);
+		if (unlikely(qid >= fc->ring.nr_queues)) {
+			WARN_ONCE(1, "Core number (%u) exceeds nr ueues (%zu)\n",
+				  qid, fc->ring.nr_queues);
+			qid = 0;
+		}
+	}
+
+	queue = fuse_uring_get_queue(fc, qid);
+	head = is_bg ?  &queue->bg_queue : &queue->fg_queue;
+	queue_avail = is_bg ? &queue->req_bg : &queue->req_fg;
+
+	spin_lock(&queue->lock);
+
+	if (unlikely(queue->aborted)) {
+		res = -ENOTCONN;
+		goto err_unlock;
+	}
+
+	list_add_tail(&req->list, head);
+	if (*queue_avail) {
+		bool got_req;
+		int tag = find_first_bit(queue->req_avail_map, queue_depth);
+
+		if (unlikely(tag == queue_depth)) {
+			pr_err("queue: no free bit found for qid=%d "
+				"qdepth=%zu av-fg=%d av-bg=%d max-fg=%zu "
+				"max-bg=%zu is_bg=%d\n", queue->qid,
+				queue_depth, queue->req_fg, queue->req_bg,
+				fc->ring.max_fg, fc->ring.max_bg, is_bg);
+
+			WARN_ON(1);
+			res = -ENOENT;
+			goto err_unlock;
+		}
+		ring_ent = &queue->ring_ent[tag];
+		got_req = fuse_uring_assign_ring_entry(ring_ent, head, is_bg);
+		if (unlikely(!got_req)) {
+			WARN_ON(1);
+			ring_ent = NULL;
+		}
+	}
+	spin_unlock(&queue->lock);
+
+	if (ring_ent != NULL)
+		fuse_uring_send_to_ring(ring_ent);
+
+	return 0;
+
+err_unlock:
+	spin_unlock(&queue->lock);
+	return res;
+}
+
 /*
  * Checks for errors and stores it into the request
  */
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index b0ef36215b80..42260a2a22ee 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -10,6 +10,7 @@
 #include "fuse_i.h"
 
 void fuse_uring_end_requests(struct fuse_conn *fc);
+int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req);
 int fuse_uring_ioctl(struct file *file, struct fuse_uring_cfg *cfg);
 void fuse_uring_ring_destruct(struct fuse_conn *fc);
 int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/13] fuse uring communication
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (12 preceding siblings ...)
  2023-03-21  1:10 ` [PATCH 13/13] fuse: Allow to queue to the ring Bernd Schubert
@ 2023-03-21  1:26 ` Bernd Schubert
  2023-03-21  9:35 ` Amir Goldstein
  14 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-21  1:26 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dharmendra Singh, Miklos Szeredi, Amir Goldstein, fuse-devel

The introduction seems to be missing on the fsdevel list

On 3/21/23 02:10, Bernd Schubert wrote:
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state -
> I'm not sure about all decisions and some questions are marked
> with XXX.
> 
> Userspace side has to send IOCTL(s) to configure ring queue(s)
> and it has the choice to configure exactly one ring or one
> ring per core. If there are use case we can also consider
> to allow a different number of rings - the ioctl configuration
> option is rather generic (number of queues).
> 
> Right now a queue lock is taken for any ring entry state change,
> mostly to correctly handle unmount/daemon-stop. In fact,
> correctly stopping the ring took most of the development
> time - always new corner cases came up.
> I had run dozens of xfstest cycles,
> versions I had once seen a warning about the ring start_stop
> mutex being the wrong state - probably another stop issue,
> but I have not been able to track it down yet.
> Regarding the queue lock - I still need to do profiling, but
> my assumption is that it should not matter for the
> one-ring-per-core configuration. For the single ring config
> option lock contention might come up, but I see this
> configuration mostly for development only.
> Adding more complexity and protecting ring entries with
> their own locks can be done later.
> 
> Current code also keep the fuse request allocation, initially
> I only had that for background requests when the ring queue
> didn't have free entries anymore. The allocation is done
> to reduce initial complexity, especially also for ring stop.
> The allocation free mode can be added back later.
> 
> Right now always the ring queue of the submitting core
> is used, especially for page cached background requests
> we might consider later to also enqueue on other core queues
> (when these are not busy, of course).
> 
> Splice/zero-copy is not supported yet, all requests go
> through the shared memory queue entry buffer. I also
> following splice and ublk/zc copy discussions, I will
> look into these options in the next days/weeks.
> To have that buffer allocated on the right numa node,
> a vmalloc is done per ring queue and on the numa node
> userspace daemon side asks for.
> My assumption is that the mmap offset parameter will be
> part of a debate and I'm curious what other think about
> that appraoch.
> 
> Benchmarking and tuning is on my agenda for the next
> days. For now I only have xfstest results - most longer
> running tests were running at about 2x, but somehow when
> I cleaned up the patches for submission I lost that.
> My development VM/kernel has all sanitizers enabled -
> hard to profile what happened. Performance
> results with profiling will be submitted in a few days.
> 
> The patches include a design document, which has a few more
> details.
> 
> The corresponding libfuse patches are on my uring branch,
> but need cleanup for submission - will happen during the next
> days.
> https://github.com/bsbernd/libfuse/tree/uring
> 
> If it should make review easier, patches posted here are on
> this branch
> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.2
> 
> 
> Bernd Schubert (13):
>    fuse: Add uring data structures and documentation
>    fuse: rename to fuse_dev_end_requests and make non-static
>    fuse: Move fuse_get_dev to header file
>    Add a vmalloc_node_user function
>    fuse: Add a uring config ioctl and ring destruction
>    fuse: Add an interval ring stop worker/monitor
>    fuse: Add uring mmap method
>    fuse: Move request bits
>    fuse: Add wait stop ioctl support to the ring
>    fuse: Handle SQEs - register commands
>    fuse: Add support to copy from/to the ring buffer
>    fuse: Add uring sqe commit and fetch support
>    fuse: Allow to queue to the ring
> 
>   Documentation/filesystems/fuse-uring.rst |  179 +++
>   fs/fuse/Makefile                         |    2 +-
>   fs/fuse/dev.c                            |  193 +++-
>   fs/fuse/dev_uring.c                      | 1292 ++++++++++++++++++++++
>   fs/fuse/dev_uring_i.h                    |   23 +
>   fs/fuse/fuse_dev_i.h                     |   62 ++
>   fs/fuse/fuse_i.h                         |  178 +++
>   fs/fuse/inode.c                          |   10 +
>   include/linux/vmalloc.h                  |    1 +
>   include/uapi/linux/fuse.h                |  131 +++
>   mm/nommu.c                               |    6 +
>   mm/vmalloc.c                             |   41 +-
>   12 files changed, 2064 insertions(+), 54 deletions(-)
>   create mode 100644 Documentation/filesystems/fuse-uring.rst
>   create mode 100644 fs/fuse/dev_uring.c
>   create mode 100644 fs/fuse/dev_uring_i.h
>   create mode 100644 fs/fuse/fuse_dev_i.h
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> cc: Miklos Szeredi <miklos@szeredi.hu>
> cc: linux-fsdevel@vger.kernel.org
> cc: Amir Goldstein <amir73il@gmail.com>
> cc: fuse-devel@lists.sourceforge.net
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/13] fuse uring communication
  2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
                   ` (13 preceding siblings ...)
  2023-03-21  1:26 ` [RFC PATCH 00/13] fuse uring communication Bernd Schubert
@ 2023-03-21  9:35 ` Amir Goldstein
  2023-03-23 11:18   ` Bernd Schubert
  14 siblings, 1 reply; 29+ messages in thread
From: Amir Goldstein @ 2023-03-21  9:35 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: linux-fsdevel, Dharmendra Singh, Miklos Szeredi, fuse-devel

On Tue, Mar 21, 2023 at 3:11 AM Bernd Schubert <bschubert@ddn.com> wrote:
>
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state -
> I'm not sure about all decisions and some questions are marked
> with XXX.
>
> Userspace side has to send IOCTL(s) to configure ring queue(s)
> and it has the choice to configure exactly one ring or one
> ring per core. If there are use case we can also consider
> to allow a different number of rings - the ioctl configuration
> option is rather generic (number of queues).
>
> Right now a queue lock is taken for any ring entry state change,
> mostly to correctly handle unmount/daemon-stop. In fact,
> correctly stopping the ring took most of the development
> time - always new corner cases came up.
> I had run dozens of xfstest cycles,
> versions I had once seen a warning about the ring start_stop
> mutex being the wrong state - probably another stop issue,
> but I have not been able to track it down yet.
> Regarding the queue lock - I still need to do profiling, but
> my assumption is that it should not matter for the
> one-ring-per-core configuration. For the single ring config
> option lock contention might come up, but I see this
> configuration mostly for development only.
> Adding more complexity and protecting ring entries with
> their own locks can be done later.
>
> Current code also keep the fuse request allocation, initially
> I only had that for background requests when the ring queue
> didn't have free entries anymore. The allocation is done
> to reduce initial complexity, especially also for ring stop.
> The allocation free mode can be added back later.
>
> Right now always the ring queue of the submitting core
> is used, especially for page cached background requests
> we might consider later to also enqueue on other core queues
> (when these are not busy, of course).
>
> Splice/zero-copy is not supported yet, all requests go
> through the shared memory queue entry buffer. I also
> following splice and ublk/zc copy discussions, I will
> look into these options in the next days/weeks.
> To have that buffer allocated on the right numa node,
> a vmalloc is done per ring queue and on the numa node
> userspace daemon side asks for.
> My assumption is that the mmap offset parameter will be
> part of a debate and I'm curious what other think about
> that appraoch.
>
> Benchmarking and tuning is on my agenda for the next
> days. For now I only have xfstest results - most longer
> running tests were running at about 2x, but somehow when
> I cleaned up the patches for submission I lost that.
> My development VM/kernel has all sanitizers enabled -
> hard to profile what happened. Performance
> results with profiling will be submitted in a few days.

When posting those benchmarks and with future RFC posting,
it's would be useful for people reading this introduction for the
first time, to explicitly state the motivation of your work, which
can only be inferred from the mention of "benchmarks".

I think it would also be useful to link to prior work (ZUFS, fuse2)
and mention the current FUSE performance issues related to
context switches and cache line bouncing that was discussed in
those threads.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 04/13] Add a vmalloc_node_user function
  2023-03-21  1:10 ` [PATCH 04/13] Add a vmalloc_node_user function Bernd Schubert
@ 2023-03-21 21:21   ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2023-03-21 21:21 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: linux-fsdevel, Dharmendra Singh, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Miklos Szeredi, Amir Goldstein,
	fuse-devel

On Tue, 21 Mar 2023 02:10:38 +0100 Bernd Schubert <bschubert@ddn.com> wrote:

> This is to have a numa aware vmalloc function for memory exposed to
> userspace. Fuse uring will allocate queue memory using this
> new function.
> 

Acked-by: Andrew Morton <akpm@linux-foundation.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-21  1:10 ` [PATCH 06/13] fuse: Add an interval ring stop worker/monitor Bernd Schubert
@ 2023-03-23 10:27   ` Miklos Szeredi
  2023-03-23 11:04     ` Bernd Schubert
  0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2023-03-23 10:27 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: linux-fsdevel, Dharmendra Singh, Amir Goldstein, fuse-devel

On Tue, 21 Mar 2023 at 02:11, Bernd Schubert <bschubert@ddn.com> wrote:
>
> This adds a delayed work queue that runs in intervals
> to check and to stop the ring if needed. Fuse connection
> abort now waits for this worker to complete.

This seems like a hack.   Can you explain what the problem is?

The first thing I notice is that you store a reference to the task
that initiated the ring creation.  This already looks fishy, as the
ring could well survive the task (thread) that created it, no?

Can you explain why the fuse case is different than regular io-uring?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-23 10:27   ` Miklos Szeredi
@ 2023-03-23 11:04     ` Bernd Schubert
  2023-03-23 12:35       ` Miklos Szeredi
  0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2023-03-23 11:04 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, Dharmendra Singh, Amir Goldstein, fuse-devel,
	Ming Lei, Aleksandr Mikhalitsyn

Thanks for looking at these patches!

I'm adding in Ming Lei, as I had taken several ideas from ublkm I guess 
I also should also explain in the commit messages and code why it is 
done that way.

On 3/23/23 11:27, Miklos Szeredi wrote:
> On Tue, 21 Mar 2023 at 02:11, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> This adds a delayed work queue that runs in intervals
>> to check and to stop the ring if needed. Fuse connection
>> abort now waits for this worker to complete.
> 
> This seems like a hack.   Can you explain what the problem is?
> 
> The first thing I notice is that you store a reference to the task
> that initiated the ring creation.  This already looks fishy, as the
> ring could well survive the task (thread) that created it, no?

You mean the currently ongoing work, where the daemon can be restarted? 
Daemon restart will need some work with ring communication, I will take 
care of that once we have agreed on an approach. [Also added in Alexsandre].

fuse_uring_stop_mon() checks if the daemon process is exiting and and 
looks at fc->ring.daemon->flags & PF_EXITING - this is what the process 
reference is for.

> 
> Can you explain why the fuse case is different than regular io-uring?


libfuse sends IORING_OP_URING_CMD - forward command and everything is 
handled by fuse.ko - fuse.ko receives the SQE and stores it. On shutdown 
that command needs to be completed with io_uring_cmd_done(). If you 
forget to do it - worker queues will go into D-state and print warning 
messages in intervals in uring code.

Purpose of check of PF_EXITING is to detect if the daemon is getting 
killed and to release uring resources, even if you didn't umount yet.


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/13] fuse uring communication
  2023-03-21  9:35 ` Amir Goldstein
@ 2023-03-23 11:18   ` Bernd Schubert
  2023-03-23 11:55     ` Amir Goldstein
  0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2023-03-23 11:18 UTC (permalink / raw)
  To: Amir Goldstein, Bernd Schubert
  Cc: linux-fsdevel, Dharmendra Singh, Miklos Szeredi, fuse-devel



On 3/21/23 10:35, Amir Goldstein wrote:
> On Tue, Mar 21, 2023 at 3:11 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> This adds support for uring communication between kernel and
>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>> appraoch was taken from ublk.  The patches are in RFC state -
>> I'm not sure about all decisions and some questions are marked
>> with XXX.
>>
>> Userspace side has to send IOCTL(s) to configure ring queue(s)
>> and it has the choice to configure exactly one ring or one
>> ring per core. If there are use case we can also consider
>> to allow a different number of rings - the ioctl configuration
>> option is rather generic (number of queues).
>>
>> Right now a queue lock is taken for any ring entry state change,
>> mostly to correctly handle unmount/daemon-stop. In fact,
>> correctly stopping the ring took most of the development
>> time - always new corner cases came up.
>> I had run dozens of xfstest cycles,
>> versions I had once seen a warning about the ring start_stop
>> mutex being the wrong state - probably another stop issue,
>> but I have not been able to track it down yet.
>> Regarding the queue lock - I still need to do profiling, but
>> my assumption is that it should not matter for the
>> one-ring-per-core configuration. For the single ring config
>> option lock contention might come up, but I see this
>> configuration mostly for development only.
>> Adding more complexity and protecting ring entries with
>> their own locks can be done later.
>>
>> Current code also keep the fuse request allocation, initially
>> I only had that for background requests when the ring queue
>> didn't have free entries anymore. The allocation is done
>> to reduce initial complexity, especially also for ring stop.
>> The allocation free mode can be added back later.
>>
>> Right now always the ring queue of the submitting core
>> is used, especially for page cached background requests
>> we might consider later to also enqueue on other core queues
>> (when these are not busy, of course).
>>
>> Splice/zero-copy is not supported yet, all requests go
>> through the shared memory queue entry buffer. I also
>> following splice and ublk/zc copy discussions, I will
>> look into these options in the next days/weeks.
>> To have that buffer allocated on the right numa node,
>> a vmalloc is done per ring queue and on the numa node
>> userspace daemon side asks for.
>> My assumption is that the mmap offset parameter will be
>> part of a debate and I'm curious what other think about
>> that appraoch.
>>
>> Benchmarking and tuning is on my agenda for the next
>> days. For now I only have xfstest results - most longer
>> running tests were running at about 2x, but somehow when
>> I cleaned up the patches for submission I lost that.
>> My development VM/kernel has all sanitizers enabled -
>> hard to profile what happened. Performance
>> results with profiling will be submitted in a few days.
> 
> When posting those benchmarks and with future RFC posting,
> it's would be useful for people reading this introduction for the
> first time, to explicitly state the motivation of your work, which
> can only be inferred from the mention of "benchmarks".
> 
> I think it would also be useful to link to prior work (ZUFS, fuse2)
> and mention the current FUSE performance issues related to
> context switches and cache line bouncing that was discussed in
> those threads.

Oh yes sure, entirely forgot to mention the motivation. Will do in the 
next patch round. You don't have these links by any chance? I know that 
there were several zufs threads, but I don't remember discussions about 
cache line - maybe I had missed it. I can try to read through the old 
threads, in case you don't have it.
Our own motivation for ring basically comes from atomic-open benchmarks, 
which gave totally confusing benchmark results in multi threaded libfuse 
mode - less requests caused lower IOPs - switching to single threaded 
then gave expect IOP increase. Part of it was due to a libfuse issue - 
persistent thread destruction/creation due to min_idle_threads, but 
another part can be explained with thread switching only. When I added 
(slight) spinning in fuse_dev_do_read(), the hard part/impossible part 
was to avoid letting multiple threads spin - even with a single threaded 
application creating/deleting files (like bonnie++), multiple libfuse 
threads started to spin for no good reason. So spinning resulted in a 
much improved latency, but high cpu usage, because multiple threads were 
spinning. I will add those explanations to the next patch set.

Thanks,
Bernd


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/13] fuse uring communication
  2023-03-23 11:18   ` Bernd Schubert
@ 2023-03-23 11:55     ` Amir Goldstein
  2023-06-07 14:20       ` Miklos Szeredi
  0 siblings, 1 reply; 29+ messages in thread
From: Amir Goldstein @ 2023-03-23 11:55 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, linux-fsdevel, Dharmendra Singh, Miklos Szeredi,
	fuse-devel

On Thu, Mar 23, 2023 at 1:18 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 3/21/23 10:35, Amir Goldstein wrote:
> > On Tue, Mar 21, 2023 at 3:11 AM Bernd Schubert <bschubert@ddn.com> wrote:
> >>
> >> This adds support for uring communication between kernel and
> >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> >> appraoch was taken from ublk.  The patches are in RFC state -
> >> I'm not sure about all decisions and some questions are marked
> >> with XXX.
> >>
> >> Userspace side has to send IOCTL(s) to configure ring queue(s)
> >> and it has the choice to configure exactly one ring or one
> >> ring per core. If there are use case we can also consider
> >> to allow a different number of rings - the ioctl configuration
> >> option is rather generic (number of queues).
> >>
> >> Right now a queue lock is taken for any ring entry state change,
> >> mostly to correctly handle unmount/daemon-stop. In fact,
> >> correctly stopping the ring took most of the development
> >> time - always new corner cases came up.
> >> I had run dozens of xfstest cycles,
> >> versions I had once seen a warning about the ring start_stop
> >> mutex being the wrong state - probably another stop issue,
> >> but I have not been able to track it down yet.
> >> Regarding the queue lock - I still need to do profiling, but
> >> my assumption is that it should not matter for the
> >> one-ring-per-core configuration. For the single ring config
> >> option lock contention might come up, but I see this
> >> configuration mostly for development only.
> >> Adding more complexity and protecting ring entries with
> >> their own locks can be done later.
> >>
> >> Current code also keep the fuse request allocation, initially
> >> I only had that for background requests when the ring queue
> >> didn't have free entries anymore. The allocation is done
> >> to reduce initial complexity, especially also for ring stop.
> >> The allocation free mode can be added back later.
> >>
> >> Right now always the ring queue of the submitting core
> >> is used, especially for page cached background requests
> >> we might consider later to also enqueue on other core queues
> >> (when these are not busy, of course).
> >>
> >> Splice/zero-copy is not supported yet, all requests go
> >> through the shared memory queue entry buffer. I also
> >> following splice and ublk/zc copy discussions, I will
> >> look into these options in the next days/weeks.
> >> To have that buffer allocated on the right numa node,
> >> a vmalloc is done per ring queue and on the numa node
> >> userspace daemon side asks for.
> >> My assumption is that the mmap offset parameter will be
> >> part of a debate and I'm curious what other think about
> >> that appraoch.
> >>
> >> Benchmarking and tuning is on my agenda for the next
> >> days. For now I only have xfstest results - most longer
> >> running tests were running at about 2x, but somehow when
> >> I cleaned up the patches for submission I lost that.
> >> My development VM/kernel has all sanitizers enabled -
> >> hard to profile what happened. Performance
> >> results with profiling will be submitted in a few days.
> >
> > When posting those benchmarks and with future RFC posting,
> > it's would be useful for people reading this introduction for the
> > first time, to explicitly state the motivation of your work, which
> > can only be inferred from the mention of "benchmarks".
> >
> > I think it would also be useful to link to prior work (ZUFS, fuse2)
> > and mention the current FUSE performance issues related to
> > context switches and cache line bouncing that was discussed in
> > those threads.
>
> Oh yes sure, entirely forgot to mention the motivation. Will do in the
> next patch round. You don't have these links by any chance? I know that

I have this thread which you are on:
https://lore.kernel.org/linux-fsdevel/CAJfpegtjEoE7H8tayLaQHG9fRSBiVuaspnmPr2oQiOZXVB1+7g@mail.gmail.com/

> there were several zufs threads, but I don't remember discussions about
> cache line - maybe I had missed it. I can try to read through the old
> threads, in case you don't have it.

Miklos talked about it somewhere...

> Our own motivation for ring basically comes from atomic-open benchmarks,
> which gave totally confusing benchmark results in multi threaded libfuse
> mode - less requests caused lower IOPs - switching to single threaded
> then gave expect IOP increase. Part of it was due to a libfuse issue -
> persistent thread destruction/creation due to min_idle_threads, but
> another part can be explained with thread switching only. When I added
> (slight) spinning in fuse_dev_do_read(), the hard part/impossible part
> was to avoid letting multiple threads spin - even with a single threaded
> application creating/deleting files (like bonnie++), multiple libfuse
> threads started to spin for no good reason. So spinning resulted in a
> much improved latency, but high cpu usage, because multiple threads were
> spinning. I will add those explanations to the next patch set.
>

That would be great.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-23 11:04     ` Bernd Schubert
@ 2023-03-23 12:35       ` Miklos Szeredi
  2023-03-23 13:18         ` Bernd Schubert
  2023-03-23 13:26         ` Ming Lei
  0 siblings, 2 replies; 29+ messages in thread
From: Miklos Szeredi @ 2023-03-23 12:35 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: linux-fsdevel, Dharmendra Singh, Amir Goldstein, fuse-devel,
	Ming Lei, Aleksandr Mikhalitsyn

On Thu, 23 Mar 2023 at 12:04, Bernd Schubert <bschubert@ddn.com> wrote:
>
> Thanks for looking at these patches!
>
> I'm adding in Ming Lei, as I had taken several ideas from ublkm I guess
> I also should also explain in the commit messages and code why it is
> done that way.
>
> On 3/23/23 11:27, Miklos Szeredi wrote:
> > On Tue, 21 Mar 2023 at 02:11, Bernd Schubert <bschubert@ddn.com> wrote:
> >>
> >> This adds a delayed work queue that runs in intervals
> >> to check and to stop the ring if needed. Fuse connection
> >> abort now waits for this worker to complete.
> >
> > This seems like a hack.   Can you explain what the problem is?
> >
> > The first thing I notice is that you store a reference to the task
> > that initiated the ring creation.  This already looks fishy, as the
> > ring could well survive the task (thread) that created it, no?
>
> You mean the currently ongoing work, where the daemon can be restarted?
> Daemon restart will need some work with ring communication, I will take
> care of that once we have agreed on an approach. [Also added in Alexsandre].
>
> fuse_uring_stop_mon() checks if the daemon process is exiting and and
> looks at fc->ring.daemon->flags & PF_EXITING - this is what the process
> reference is for.

Okay, so you are saying that the lifetime of the ring is bound to the
lifetime of the thread that created it?

Why is that?

I'ts much more common to bind a lifetime of an object to that of an
open file.  io_uring_setup() will do that for example.

It's much easier to hook into the destruction of an open file, than
into the destruction of a process (as you've observed). And the way
you do it is even more confusing as the ring is destroyed not when the
process is destroyed, but when a specific thread is destroyed, making
this a thread specific behavior that is probably best avoided.

So the obvious solution would be to destroy the ring(s) in
fuse_dev_release().  Why wouldn't that work?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-23 12:35       ` Miklos Szeredi
@ 2023-03-23 13:18         ` Bernd Schubert
  2023-03-23 20:51           ` Bernd Schubert
  2023-03-23 13:26         ` Ming Lei
  1 sibling, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2023-03-23 13:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, Dharmendra Singh, Amir Goldstein, fuse-devel,
	Ming Lei, Aleksandr Mikhalitsyn

On 3/23/23 13:35, Miklos Szeredi wrote:
> On Thu, 23 Mar 2023 at 12:04, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> Thanks for looking at these patches!
>>
>> I'm adding in Ming Lei, as I had taken several ideas from ublkm I guess
>> I also should also explain in the commit messages and code why it is
>> done that way.
>>
>> On 3/23/23 11:27, Miklos Szeredi wrote:
>>> On Tue, 21 Mar 2023 at 02:11, Bernd Schubert <bschubert@ddn.com> wrote:
>>>>
>>>> This adds a delayed work queue that runs in intervals
>>>> to check and to stop the ring if needed. Fuse connection
>>>> abort now waits for this worker to complete.
>>>
>>> This seems like a hack.   Can you explain what the problem is?
>>>
>>> The first thing I notice is that you store a reference to the task
>>> that initiated the ring creation.  This already looks fishy, as the
>>> ring could well survive the task (thread) that created it, no?
>>
>> You mean the currently ongoing work, where the daemon can be restarted?
>> Daemon restart will need some work with ring communication, I will take
>> care of that once we have agreed on an approach. [Also added in Alexsandre].
>>
>> fuse_uring_stop_mon() checks if the daemon process is exiting and and
>> looks at fc->ring.daemon->flags & PF_EXITING - this is what the process
>> reference is for.
> 
> Okay, so you are saying that the lifetime of the ring is bound to the
> lifetime of the thread that created it?
> 
> Why is that?
> 
> I'ts much more common to bind a lifetime of an object to that of an
> open file.  io_uring_setup() will do that for example.
> 
> It's much easier to hook into the destruction of an open file, than
> into the destruction of a process (as you've observed). And the way
> you do it is even more confusing as the ring is destroyed not when the
> process is destroyed, but when a specific thread is destroyed, making
> this a thread specific behavior that is probably best avoided.
> 
> So the obvious solution would be to destroy the ring(s) in
> fuse_dev_release().  Why wouldn't that work?
> 

I _think_ I had tried it at the beginning and run into issues and then 
switched the ublk approach. Going to try again now.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-23 12:35       ` Miklos Szeredi
  2023-03-23 13:18         ` Bernd Schubert
@ 2023-03-23 13:26         ` Ming Lei
  1 sibling, 0 replies; 29+ messages in thread
From: Ming Lei @ 2023-03-23 13:26 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, linux-fsdevel, Dharmendra Singh, Amir Goldstein,
	fuse-devel, Aleksandr Mikhalitsyn, io-uring, Jens Axboe

On Thu, Mar 23, 2023 at 01:35:24PM +0100, Miklos Szeredi wrote:
> On Thu, 23 Mar 2023 at 12:04, Bernd Schubert <bschubert@ddn.com> wrote:
> >
> > Thanks for looking at these patches!
> >
> > I'm adding in Ming Lei, as I had taken several ideas from ublkm I guess
> > I also should also explain in the commit messages and code why it is
> > done that way.
> >
> > On 3/23/23 11:27, Miklos Szeredi wrote:
> > > On Tue, 21 Mar 2023 at 02:11, Bernd Schubert <bschubert@ddn.com> wrote:
> > >>
> > >> This adds a delayed work queue that runs in intervals
> > >> to check and to stop the ring if needed. Fuse connection
> > >> abort now waits for this worker to complete.
> > >
> > > This seems like a hack.   Can you explain what the problem is?
> > >
> > > The first thing I notice is that you store a reference to the task
> > > that initiated the ring creation.  This already looks fishy, as the
> > > ring could well survive the task (thread) that created it, no?
> >
> > You mean the currently ongoing work, where the daemon can be restarted?
> > Daemon restart will need some work with ring communication, I will take
> > care of that once we have agreed on an approach. [Also added in Alexsandre].
> >
> > fuse_uring_stop_mon() checks if the daemon process is exiting and and
> > looks at fc->ring.daemon->flags & PF_EXITING - this is what the process
> > reference is for.
> 
> Okay, so you are saying that the lifetime of the ring is bound to the
> lifetime of the thread that created it?
> 
> Why is that?

Cc Jens and io_uring list

For ublk:

1) it is MQ device, it is natural to map queue into pthread/uring

2) io_uring context is invisible to driver, we don't know when it is destructed,
so bind io_uring context with queue/pthread, because we have to complete all
uring commands before io_uring context exits. uring cmd usage for ublk/fuse should
be special and unique, and it is like poll request, and sent to device beforehand,
and it is completed only if driver has incoming thing which needs userspace to handle,
but ublk/fuse may never have anyting which needs userpace to look.

If io_uring can provides API for registering exit callback, things could be easier
for ublk/fuse. However, we still need to know the exact io_uring context associated
with our commands, so either more io_uring implementation details exposed to driver,
or proper APIs provided.

> 
> I'ts much more common to bind a lifetime of an object to that of an
> open file.  io_uring_setup() will do that for example.
> 
> It's much easier to hook into the destruction of an open file, than
> into the destruction of a process (as you've observed). And the way
> you do it is even more confusing as the ring is destroyed not when the
> process is destroyed, but when a specific thread is destroyed, making
> this a thread specific behavior that is probably best avoided.
> 
> So the obvious solution would be to destroy the ring(s) in
> fuse_dev_release().  Why wouldn't that work?

io uring is used for submitting multiple files, so its lifetime can't be bound
to file, also io_uring is invisible to driver if here the ring(s) means io_uring.


thanks,
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-23 13:18         ` Bernd Schubert
@ 2023-03-23 20:51           ` Bernd Schubert
  2023-03-27 13:22             ` Pavel Begunkov
  0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2023-03-23 20:51 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, Dharmendra Singh, Amir Goldstein, fuse-devel,
	Ming Lei, Aleksandr Mikhalitsyn, io-uring, Jens Axboe

On 3/23/23 14:18, Bernd Schubert wrote:
> On 3/23/23 13:35, Miklos Szeredi wrote:
>> On Thu, 23 Mar 2023 at 12:04, Bernd Schubert <bschubert@ddn.com> wrote:
>>>
>>> Thanks for looking at these patches!
>>>
>>> I'm adding in Ming Lei, as I had taken several ideas from ublkm I guess
>>> I also should also explain in the commit messages and code why it is
>>> done that way.
>>>
>>> On 3/23/23 11:27, Miklos Szeredi wrote:
>>>> On Tue, 21 Mar 2023 at 02:11, Bernd Schubert <bschubert@ddn.com> wrote:
>>>>>
>>>>> This adds a delayed work queue that runs in intervals
>>>>> to check and to stop the ring if needed. Fuse connection
>>>>> abort now waits for this worker to complete.
>>>>
>>>> This seems like a hack.   Can you explain what the problem is?
>>>>
>>>> The first thing I notice is that you store a reference to the task
>>>> that initiated the ring creation.  This already looks fishy, as the
>>>> ring could well survive the task (thread) that created it, no?
>>>
>>> You mean the currently ongoing work, where the daemon can be restarted?
>>> Daemon restart will need some work with ring communication, I will take
>>> care of that once we have agreed on an approach. [Also added in 
>>> Alexsandre].
>>>
>>> fuse_uring_stop_mon() checks if the daemon process is exiting and and
>>> looks at fc->ring.daemon->flags & PF_EXITING - this is what the process
>>> reference is for.
>>
>> Okay, so you are saying that the lifetime of the ring is bound to the
>> lifetime of the thread that created it?
>>
>> Why is that?
>>
>> I'ts much more common to bind a lifetime of an object to that of an
>> open file.  io_uring_setup() will do that for example.
>>
>> It's much easier to hook into the destruction of an open file, than
>> into the destruction of a process (as you've observed). And the way
>> you do it is even more confusing as the ring is destroyed not when the
>> process is destroyed, but when a specific thread is destroyed, making
>> this a thread specific behavior that is probably best avoided.
>>
>> So the obvious solution would be to destroy the ring(s) in
>> fuse_dev_release().  Why wouldn't that work?
>>
> 
> I _think_ I had tried it at the beginning and run into issues and then 
> switched the ublk approach. Going to try again now.
> 

Found the reason why I complete SQEs when the daemon stops - on daemon 
side I have

ret = io_uring_wait_cqe(&queue->ring, &cqe);

and that hangs when you stop user side with SIGTERM/SIGINT. Maybe that 
could be solved with io_uring_wait_cqe_timeout() / 
io_uring_wait_cqe_timeout(), but would that really be a good solution? 
We would now have CPU activity in intervals on the daemon side for now 
good reason - the more often the faster SIGTERM/SIGINT works.
So at best, it should be uring side that stops to wait on a receiving a 
signal.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-23 20:51           ` Bernd Schubert
@ 2023-03-27 13:22             ` Pavel Begunkov
  2023-03-27 14:02               ` Bernd Schubert
  0 siblings, 1 reply; 29+ messages in thread
From: Pavel Begunkov @ 2023-03-27 13:22 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi
  Cc: linux-fsdevel, Dharmendra Singh, Amir Goldstein, fuse-devel,
	Ming Lei, Aleksandr Mikhalitsyn, io-uring, Jens Axboe

On 3/23/23 20:51, Bernd Schubert wrote:
> On 3/23/23 14:18, Bernd Schubert wrote:
>> On 3/23/23 13:35, Miklos Szeredi wrote:
>>> On Thu, 23 Mar 2023 at 12:04, Bernd Schubert <bschubert@ddn.com> wrote:
[...]
> Found the reason why I complete SQEs when the daemon stops - on daemon
> side I have
> 
> ret = io_uring_wait_cqe(&queue->ring, &cqe);
> 
> and that hangs when you stop user side with SIGTERM/SIGINT. Maybe that
> could be solved with io_uring_wait_cqe_timeout() /
> io_uring_wait_cqe_timeout(), but would that really be a good solution?

It can be some sort of an eventfd triggered from the signal handler
and waited upon by an io_uring poll/read request. Or maybe signalfd.

> We would now have CPU activity in intervals on the daemon side for now
> good reason - the more often the faster SIGTERM/SIGINT works.
> So at best, it should be uring side that stops to wait on a receiving a
> signal.

FWIW, io_uring (i.e. kernel side) will stop waiting if there are pending
signals, and we'd need to check liburing to honour it, e.g. not to retry
waiting.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/13] fuse: Add an interval ring stop worker/monitor
  2023-03-27 13:22             ` Pavel Begunkov
@ 2023-03-27 14:02               ` Bernd Schubert
  0 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-03-27 14:02 UTC (permalink / raw)
  To: Pavel Begunkov, Miklos Szeredi
  Cc: linux-fsdevel, Dharmendra Singh, Amir Goldstein, fuse-devel,
	Ming Lei, Aleksandr Mikhalitsyn, io-uring, Jens Axboe

On 3/27/23 15:22, Pavel Begunkov wrote:
> On 3/23/23 20:51, Bernd Schubert wrote:
>> On 3/23/23 14:18, Bernd Schubert wrote:
>>> On 3/23/23 13:35, Miklos Szeredi wrote:
>>>> On Thu, 23 Mar 2023 at 12:04, Bernd Schubert <bschubert@ddn.com> wrote:
> [...]
>> Found the reason why I complete SQEs when the daemon stops - on daemon
>> side I have
>>
>> ret = io_uring_wait_cqe(&queue->ring, &cqe);
>>
>> and that hangs when you stop user side with SIGTERM/SIGINT. Maybe that
>> could be solved with io_uring_wait_cqe_timeout() /
>> io_uring_wait_cqe_timeout(), but would that really be a good solution?
> 
> It can be some sort of an eventfd triggered from the signal handler
> and waited upon by an io_uring poll/read request. Or maybe signalfd.
> 
>> We would now have CPU activity in intervals on the daemon side for now
>> good reason - the more often the faster SIGTERM/SIGINT works.
>> So at best, it should be uring side that stops to wait on a receiving a
>> signal.
> 
> FWIW, io_uring (i.e. kernel side) will stop waiting if there are pending
> signals, and we'd need to check liburing to honour it, e.g. not to retry
> waiting.
> 

I'm going to check where and why it hangs, busy with something else 
today - by tomorrow I should know what happens.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/13] fuse uring communication
  2023-03-23 11:55     ` Amir Goldstein
@ 2023-06-07 14:20       ` Miklos Szeredi
  2023-06-08 21:31         ` Bernd Schubert
  0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2023-06-07 14:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Bernd Schubert, Bernd Schubert, linux-fsdevel, Dharmendra Singh

On Thu, 23 Mar 2023 at 12:55, Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Mar 23, 2023 at 1:18 PM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:

> > there were several zufs threads, but I don't remember discussions about
> > cache line - maybe I had missed it. I can try to read through the old
> > threads, in case you don't have it.
>
> Miklos talked about it somewhere...

It was a private exchange between Amir and me:

    On Tue, 25 Feb 2020 at 20:33, Miklos Szeredi <miklos@szeredi.hu> wrote
    > On Tue, Feb 25, 2020 at 6:49 PM Amir Goldstein <amir73il@gmail.com> wrote:
    [...]
    > > BTW, out of curiosity, what was the purpose of the example of
    > > "use shared memory instead of threads"?
    >
    > In the threaded case there's a shared piece of memory in the kernel
    > (mm->cpu_bitmap) that is updated on each context switch (i.e. each
    > time a request is processed by one of the server threads).  If this is
    > a big NUMA system then cacheline pingpong on this bitmap can be a real
    > performance hit.
    >
    > Using shared memory means that the address space is not shared, hence
    > each server "thread" will have a separate "mm" structure, hence no
    > cacheline pingpong.
    >
    > It would be nice if the underlying problem with shared address space
    > could be solved in a scalable way instead of having to resort to this
    > hack, but it's not a trivial thing to do.  If you look at the
    > scheduler code, there's already a workaround for this issue in the
    > kernel threads case, but that doesn't work for user threads.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/13] fuse uring communication
  2023-06-07 14:20       ` Miklos Szeredi
@ 2023-06-08 21:31         ` Bernd Schubert
  0 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2023-06-08 21:31 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein
  Cc: Bernd Schubert, linux-fsdevel, Dharmendra Singh



On 6/7/23 16:20, Miklos Szeredi wrote:
> On Thu, 23 Mar 2023 at 12:55, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>> On Thu, Mar 23, 2023 at 1:18 PM Bernd Schubert
>> <bernd.schubert@fastmail.fm> wrote:
> 
>>> there were several zufs threads, but I don't remember discussions about
>>> cache line - maybe I had missed it. I can try to read through the old
>>> threads, in case you don't have it.
>>
>> Miklos talked about it somewhere...
> 
> It was a private exchange between Amir and me:
> 
>      On Tue, 25 Feb 2020 at 20:33, Miklos Szeredi <miklos@szeredi.hu> wrote
>      > On Tue, Feb 25, 2020 at 6:49 PM Amir Goldstein <amir73il@gmail.com> wrote:
>      [...]
>      > > BTW, out of curiosity, what was the purpose of the example of
>      > > "use shared memory instead of threads"?
>      >
>      > In the threaded case there's a shared piece of memory in the kernel
>      > (mm->cpu_bitmap) that is updated on each context switch (i.e. each
>      > time a request is processed by one of the server threads).  If this is
>      > a big NUMA system then cacheline pingpong on this bitmap can be a real
>      > performance hit.
>      >
>      > Using shared memory means that the address space is not shared, hence
>      > each server "thread" will have a separate "mm" structure, hence no
>      > cacheline pingpong.
>      >
>      > It would be nice if the underlying problem with shared address space
>      > could be solved in a scalable way instead of having to resort to this
>      > hack, but it's not a trivial thing to do.  If you look at the
>      > scheduler code, there's already a workaround for this issue in the
>      > kernel threads case, but that doesn't work for user threads.

Ah, thank you! I can quote this mail here then for the next version.

Thanks,
Bernd

PS: I get currently distracted by other work, I hope I can get back to 
fuse by tomorrow.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-06-08 21:31 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-21  1:10 [RFC PATCH 00/13] fuse uring communication Bernd Schubert
2023-03-21  1:10 ` [PATCH 01/13] fuse: Add uring data structures and documentation Bernd Schubert
2023-03-21  1:10 ` [PATCH 02/13] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
2023-03-21  1:10 ` [PATCH 03/13] fuse: Move fuse_get_dev to header file Bernd Schubert
2023-03-21  1:10 ` [PATCH 04/13] Add a vmalloc_node_user function Bernd Schubert
2023-03-21 21:21   ` Andrew Morton
2023-03-21  1:10 ` [PATCH 05/13] fuse: Add a uring config ioctl and ring destruction Bernd Schubert
2023-03-21  1:10 ` [PATCH 06/13] fuse: Add an interval ring stop worker/monitor Bernd Schubert
2023-03-23 10:27   ` Miklos Szeredi
2023-03-23 11:04     ` Bernd Schubert
2023-03-23 12:35       ` Miklos Szeredi
2023-03-23 13:18         ` Bernd Schubert
2023-03-23 20:51           ` Bernd Schubert
2023-03-27 13:22             ` Pavel Begunkov
2023-03-27 14:02               ` Bernd Schubert
2023-03-23 13:26         ` Ming Lei
2023-03-21  1:10 ` [PATCH 07/13] fuse: Add uring mmap method Bernd Schubert
2023-03-21  1:10 ` [PATCH 08/13] fuse: Move request bits Bernd Schubert
2023-03-21  1:10 ` [PATCH 09/13] fuse: Add wait stop ioctl support to the ring Bernd Schubert
2023-03-21  1:10 ` [PATCH 10/13] fuse: Handle SQEs - register commands Bernd Schubert
2023-03-21  1:10 ` [PATCH 11/13] fuse: Add support to copy from/to the ring buffer Bernd Schubert
2023-03-21  1:10 ` [PATCH 12/13] fuse: Add uring sqe commit and fetch support Bernd Schubert
2023-03-21  1:10 ` [PATCH 13/13] fuse: Allow to queue to the ring Bernd Schubert
2023-03-21  1:26 ` [RFC PATCH 00/13] fuse uring communication Bernd Schubert
2023-03-21  9:35 ` Amir Goldstein
2023-03-23 11:18   ` Bernd Schubert
2023-03-23 11:55     ` Amir Goldstein
2023-06-07 14:20       ` Miklos Szeredi
2023-06-08 21:31         ` Bernd Schubert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.