All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6] io_uring passthrough for nvme
       [not found] <CGME20220511055306epcas5p3bf3b4c1e32d2bb43db12785bd7caf5da@epcas5p3.samsung.com>
@ 2022-05-11  5:47 ` Kanchan Joshi
       [not found]   ` <CGME20220511055308epcas5p3627bcb0ec10d7a2222e701898e9ad0db@epcas5p3.samsung.com>
                     ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11  5:47 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

This series is against "for-5.19/io_uring-passthrough" branch (linux-block).
Patches to be refreshed on top of 2bb04df7c ("io_uring: support CQE32").

uring-cmd is the facility to enable io_uring capabilities (async is one
of those) for any arbitrary command (ioctl, fsctl or whatever else)
exposed by the command-providers (driver, fs etc.). The series
introduces uring-cmd, and connects nvme passthrough (over generic device
/dev/ngXnY) to it.

uring-cmd is specified by IORING_OP_URING_CMD. The storage for the
command is provided in the SQE itself. On a regular ring, 16 bytes of
space is available, which can be accessed using "sqe->cmd".
Alternatively, application can setup the ring with the flag
IORING_SETUP_SQE128. In that case, each SQE of the ring is 128b in size,
and provides 80b of storage space for placing the command.

nvme io-passthrough is specified by new operation NVME_URING_CMD_IO.
This operates on a new structure nvme_uring_cmd which is 72b in size.
nvme passthrough requires two results to be returned to user-space.
Therefore ring needs to be setup with the flag IORING_SETUP_CQE32.
When this flag is specified, each CQE of the ring is 32b in size.
The uring-cmd infrastructure exports helpers so that additional result
is collected from the provider and placed into the CQE.

Testing is done using this custom fio branch:
https://github.com/joshkan/fio/tree/big-cqe-pt.v4
regular io_uring io (read/write) is turned into passthrough io on
supplying "-uring_cmd=1" option.

Example command line:
fio -iodepth=1 -rw=randread -ioengine=io_uring -bs=4k -numjobs=1 -size=4k -group_reporting -filename=/dev/ng0n1 -name=io_uring_1 -uring_cmd=1

Changes since v4:
https://lore.kernel.org/linux-nvme/20220505060616.803816-1-joshi.k@samsung.com/
- Allow uring-cmd to operate on regular ring too
- Move big-sqe/big-cqe requirement to nvme
- Add support for cases when uring-cmd needs deferral
- Redone Patch 3
- In nvme, use READ_ONCE while reading cmd fields from SQE
- Refactoring in Patch 4 based on the feedback of Christoph

Changes since v3:
https://lore.kernel.org/linux-nvme/20220503184831.78705-1-p.raghav@samsung.com/
- Cleaned up placements of new fields in sqe and io_uring_cmd
- Removed packed-attribute from nvme_uring_cmd struct
- Applied all other Christoph's feedback too
- Applied Jens feedback
- Collected reviewed-by

Changes since v2:
https://lore.kernel.org/linux-nvme/20220401110310.611869-1-joshi.k@samsung.com/
- Rewire uring-cmd infrastructure on top of new big CQE
- Prep patch (refactored) and feedback from Christoph
- Add new opcode and structure in nvme for uring-cmd
- Enable vectored-io

Changes since v1:
https://lore.kernel.org/linux-nvme/20220308152105.309618-1-joshi.k@samsung.com/
- Trim down by removing patches for polling, fixed-buffer and bio-cache
- Add big CQE and move uring-cmd to use that
- Remove indirect (pointer) submission

Anuj Gupta (1):
  nvme: add vectored-io support for uring-cmd

Christoph Hellwig (1):
  nvme: refactor nvme_submit_user_cmd()

Jens Axboe (3):
  fs,io_uring: add infrastructure for uring-cmd
  block: wire-up support for passthrough plugging
  io_uring: finish IOPOLL/ioprio prep handler removal

Kanchan Joshi (1):
  nvme: wire-up uring-cmd support for io-passthru on char-device.

 block/blk-mq.c                  |  73 +++++-----
 drivers/nvme/host/core.c        |   1 +
 drivers/nvme/host/ioctl.c       | 247 ++++++++++++++++++++++++++++++--
 drivers/nvme/host/multipath.c   |   1 +
 drivers/nvme/host/nvme.h        |   4 +
 fs/io_uring.c                   | 135 ++++++++++++++---
 include/linux/fs.h              |   2 +
 include/linux/io_uring.h        |  33 +++++
 include/uapi/linux/io_uring.h   |  21 +--
 include/uapi/linux/nvme_ioctl.h |  26 ++++
 10 files changed, 471 insertions(+), 72 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v5 1/6] fs,io_uring: add infrastructure for uring-cmd
       [not found]   ` <CGME20220511055308epcas5p3627bcb0ec10d7a2222e701898e9ad0db@epcas5p3.samsung.com>
@ 2022-05-11  5:47     ` Kanchan Joshi
  2022-05-11 12:42       ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11  5:47 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Jens Axboe <axboe@kernel.dk>

file_operations->uring_cmd is a file private handler.
This is somewhat similar to ioctl but hopefully a lot more sane and
useful as it can be used to enable many io_uring capabilities for the
underlying operation.

IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
know what is in this command type, it's for the provider of ->uring_cmd()
to deal with.

Co-developed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 123 ++++++++++++++++++++++++++++++++--
 include/linux/fs.h            |   2 +
 include/linux/io_uring.h      |  33 +++++++++
 include/uapi/linux/io_uring.h |  21 +++---
 4 files changed, 164 insertions(+), 15 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ceaf7826ed71..592be5b89add 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -202,13 +202,6 @@ struct io_rings {
 	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
 };
 
-enum io_uring_cmd_flags {
-	IO_URING_F_COMPLETE_DEFER	= 1,
-	IO_URING_F_UNLOCKED		= 2,
-	/* int's last bit, sign checks are usually faster than a bit test */
-	IO_URING_F_NONBLOCK		= INT_MIN,
-};
-
 struct io_mapped_ubuf {
 	u64		ubuf;
 	u64		ubuf_end;
@@ -972,6 +965,7 @@ struct io_kiocb {
 		struct io_xattr		xattr;
 		struct io_socket	sock;
 		struct io_nop		nop;
+		struct io_uring_cmd	uring_cmd;
 	};
 
 	u8				opcode;
@@ -1050,6 +1044,14 @@ struct io_cancel_data {
 	int seq;
 };
 
+/*
+ * The URING_CMD payload starts at 'cmd' in the first sqe, and continues into
+ * the following sqe if SQE128 is used.
+ */
+#define uring_cmd_pdu_size(is_sqe128)				\
+	((1 + !!(is_sqe128)) * sizeof(struct io_uring_sqe) -	\
+		offsetof(struct io_uring_sqe, cmd))
+
 struct io_op_def {
 	/* needs req->file assigned */
 	unsigned		needs_file : 1;
@@ -1289,6 +1291,12 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_SOCKET] = {
 		.audit_skip		= 1,
 	},
+	[IORING_OP_URING_CMD] = {
+		.needs_file		= 1,
+		.plug			= 1,
+		.needs_async_setup	= 1,
+		.async_size		= uring_cmd_pdu_size(1),
+	},
 };
 
 /* requests with any of those set should undergo io_disarm_next() */
@@ -1428,6 +1436,8 @@ const char *io_uring_get_opcode(u8 opcode)
 		return "GETXATTR";
 	case IORING_OP_SOCKET:
 		return "SOCKET";
+	case IORING_OP_URING_CMD:
+		return "URING_CMD";
 	case IORING_OP_LAST:
 		return "INVALID";
 	}
@@ -4910,6 +4920,96 @@ static int io_linkat(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+static void io_uring_cmd_work(struct io_kiocb *req, bool *locked)
+{
+	req->uring_cmd.task_work_cb(&req->uring_cmd);
+}
+
+void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+			void (*task_work_cb)(struct io_uring_cmd *))
+{
+	struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd);
+
+	req->uring_cmd.task_work_cb = task_work_cb;
+	req->io_task_work.func = io_uring_cmd_work;
+	io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task);
+
+/*
+ * Called by consumers of io_uring_cmd, if they originally returned
+ * -EIOCBQUEUED upon receiving the command.
+ */
+void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2)
+{
+	struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd);
+
+	if (ret < 0)
+		req_set_fail(req);
+	if (req->ctx->flags & IORING_SETUP_CQE32)
+		__io_req_complete32(req, 0, ret, 0, res2, 0);
+	else
+		io_req_complete(req, ret);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_done);
+
+static int io_uring_cmd_prep_async(struct io_kiocb *req)
+{
+	size_t cmd_size;
+
+	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
+
+	memcpy(req->async_data, req->uring_cmd.cmd, cmd_size);
+	return 0;
+}
+
+static int io_uring_cmd_prep(struct io_kiocb *req,
+			     const struct io_uring_sqe *sqe)
+{
+	struct io_uring_cmd *ioucmd = &req->uring_cmd;
+
+	if (sqe->ioprio || sqe->rw_flags)
+		return -EINVAL;
+	ioucmd->cmd = sqe->cmd;
+	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
+	return 0;
+}
+
+static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_uring_cmd *ioucmd = &req->uring_cmd;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct file *file = req->file;
+	int ret;
+
+	if (!req->file->f_op->uring_cmd)
+		return -EOPNOTSUPP;
+
+	if (ctx->flags & IORING_SETUP_SQE128)
+		issue_flags |= IO_URING_F_SQE128;
+	if (ctx->flags & IORING_SETUP_CQE32)
+		issue_flags |= IO_URING_F_CQE32;
+	if (ctx->flags & IORING_SETUP_IOPOLL)
+		issue_flags |= IO_URING_F_IOPOLL;
+
+	if (req_has_async_data(req))
+		ioucmd->cmd = req->async_data;
+
+	ret = file->f_op->uring_cmd(ioucmd, issue_flags);
+	if (ret == -EAGAIN) {
+		if (!req_has_async_data(req)) {
+			if (io_alloc_async_data(req))
+				return -ENOMEM;
+			io_uring_cmd_prep_async(req);
+		}
+		return -EAGAIN;
+	}
+
+	if (ret != -EIOCBQUEUED)
+		io_uring_cmd_done(ioucmd, ret, 0);
+	return 0;
+}
+
 static int io_shutdown_prep(struct io_kiocb *req,
 			    const struct io_uring_sqe *sqe)
 {
@@ -7755,6 +7855,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_getxattr_prep(req, sqe);
 	case IORING_OP_SOCKET:
 		return io_socket_prep(req, sqe);
+	case IORING_OP_URING_CMD:
+		return io_uring_cmd_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -7787,6 +7889,8 @@ static int io_req_prep_async(struct io_kiocb *req)
 		return io_recvmsg_prep_async(req);
 	case IORING_OP_CONNECT:
 		return io_connect_prep_async(req);
+	case IORING_OP_URING_CMD:
+		return io_uring_cmd_prep_async(req);
 	}
 	printk_once(KERN_WARNING "io_uring: prep_async() bad opcode %d\n",
 		    req->opcode);
@@ -8081,6 +8185,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	case IORING_OP_SOCKET:
 		ret = io_socket(req, issue_flags);
 		break;
+	case IORING_OP_URING_CMD:
+		ret = io_uring_cmd(req, issue_flags);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -12699,6 +12806,8 @@ static int __init io_uring_init(void)
 
 	BUILD_BUG_ON(sizeof(atomic_t) != sizeof(u32));
 
+	BUILD_BUG_ON(sizeof(struct io_uring_cmd) > 64);
+
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC |
 				SLAB_ACCOUNT);
 	return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbde95387a23..87b5af1d9fbe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1953,6 +1953,7 @@ struct dir_context {
 #define REMAP_FILE_ADVISORY		(REMAP_FILE_CAN_SHORTEN)
 
 struct iov_iter;
+struct io_uring_cmd;
 
 struct file_operations {
 	struct module *owner;
@@ -1995,6 +1996,7 @@ struct file_operations {
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
+	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 } __randomize_layout;
 
 struct inode_operations {
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 24651c229ed2..4a2f6cc5a492 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,7 +5,32 @@
 #include <linux/sched.h>
 #include <linux/xarray.h>
 
+enum io_uring_cmd_flags {
+	IO_URING_F_COMPLETE_DEFER	= 1,
+	IO_URING_F_UNLOCKED		= 2,
+	/* int's last bit, sign checks are usually faster than a bit test */
+	IO_URING_F_NONBLOCK		= INT_MIN,
+
+	/* ctx state flags, for URING_CMD */
+	IO_URING_F_SQE128		= 4,
+	IO_URING_F_CQE32		= 8,
+	IO_URING_F_IOPOLL		= 16,
+};
+
+struct io_uring_cmd {
+	struct file	*file;
+	const void	*cmd;
+	/* callback to defer completions to task context */
+	void (*task_work_cb)(struct io_uring_cmd *cmd);
+	u32		cmd_op;
+	u32		pad;
+	u8		pdu[32]; /* available inline for free use */
+};
+
 #if defined(CONFIG_IO_URING)
+void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2);
+void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+			void (*task_work_cb)(struct io_uring_cmd *));
 struct sock *io_uring_get_socket(struct file *file);
 void __io_uring_cancel(bool cancel_all);
 void __io_uring_free(struct task_struct *tsk);
@@ -30,6 +55,14 @@ static inline void io_uring_free(struct task_struct *tsk)
 		__io_uring_free(tsk);
 }
 #else
+static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret,
+		ssize_t ret2)
+{
+}
+static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+			void (*task_work_cb)(struct io_uring_cmd *))
+{
+}
 static inline struct sock *io_uring_get_socket(struct file *file)
 {
 	return NULL;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ac2d90d669c3..23618be55dd2 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -22,6 +22,7 @@ struct io_uring_sqe {
 	union {
 		__u64	off;	/* offset into file */
 		__u64	addr2;
+		__u32	cmd_op;
 	};
 	union {
 		__u64	addr;	/* pointer to buffer or iovecs */
@@ -61,14 +62,17 @@ struct io_uring_sqe {
 		__s32	splice_fd_in;
 		__u32	file_index;
 	};
-	__u64	addr3;
-	__u64	__pad2[1];
-
-	/*
-	 * If the ring is initialized with IORING_SETUP_SQE128, then this field
-	 * contains 64-bytes of padding, doubling the size of the SQE.
-	 */
-	__u64	__big_sqe_pad[0];
+	union {
+		struct {
+			__u64	addr3;
+			__u64	__pad2[1];
+		};
+		/*
+		 * If the ring is initialized with IORING_SETUP_SQE128, then
+		 * this field is used for 80 bytes of arbitrary command data
+		 */
+		__u8	cmd[0];
+	};
 };
 
 enum {
@@ -175,6 +179,7 @@ enum io_uring_op {
 	IORING_OP_FGETXATTR,
 	IORING_OP_GETXATTR,
 	IORING_OP_SOCKET,
+	IORING_OP_URING_CMD,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 2/6] block: wire-up support for passthrough plugging
       [not found]   ` <CGME20220511055310epcas5p46650f5b6fe963279f686b8f50a98a286@epcas5p4.samsung.com>
@ 2022-05-11  5:47     ` Kanchan Joshi
  2022-05-12  5:25       ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11  5:47 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Jens Axboe <axboe@kernel.dk>

Add support for plugging in passthrough path. When plugging is enabled, the
requests are added to a plug instead of getting dispatched to the driver.
And when the plug is finished, the whole batch gets dispatched via
->queue_rqs which turns out to be more efficient. Otherwise dispatching
used to happen via ->queue_rq, one request at a time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 73 +++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 34 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 84d749511f55..2cf011b57cf9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2340,6 +2340,40 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
+/*
+ * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
+ * queues. This is important for md arrays to benefit from merging
+ * requests.
+ */
+static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
+{
+	if (plug->multiple_queues)
+		return BLK_MAX_REQUEST_COUNT * 2;
+	return BLK_MAX_REQUEST_COUNT;
+}
+
+static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
+{
+	struct request *last = rq_list_peek(&plug->mq_list);
+
+	if (!plug->rq_count) {
+		trace_block_plug(rq->q);
+	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
+		   (!blk_queue_nomerges(rq->q) &&
+		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+		blk_mq_flush_plug_list(plug, false);
+		trace_block_plug(rq->q);
+	}
+
+	if (!plug->multiple_queues && last && last->q != rq->q)
+		plug->multiple_queues = true;
+	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
+		plug->has_elevator = true;
+	rq->rq_next = NULL;
+	rq_list_add(&plug->mq_list, rq);
+	plug->rq_count++;
+}
+
 /**
  * blk_mq_request_bypass_insert - Insert a request at dispatch list.
  * @rq: Pointer to request to be inserted.
@@ -2353,7 +2387,12 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
 				  bool run_queue)
 {
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	struct blk_plug *plug = current->plug;
 
+	if (plug) {
+		blk_add_rq_to_plug(plug, rq);
+		return;
+	}
 	spin_lock(&hctx->lock);
 	if (at_head)
 		list_add(&rq->queuelist, &hctx->dispatch);
@@ -2676,40 +2715,6 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
 		hctx->queue->mq_ops->commit_rqs(hctx);
 }
 
-/*
- * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
- * queues. This is important for md arrays to benefit from merging
- * requests.
- */
-static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
-{
-	if (plug->multiple_queues)
-		return BLK_MAX_REQUEST_COUNT * 2;
-	return BLK_MAX_REQUEST_COUNT;
-}
-
-static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
-{
-	struct request *last = rq_list_peek(&plug->mq_list);
-
-	if (!plug->rq_count) {
-		trace_block_plug(rq->q);
-	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
-		   (!blk_queue_nomerges(rq->q) &&
-		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
-		blk_mq_flush_plug_list(plug, false);
-		trace_block_plug(rq->q);
-	}
-
-	if (!plug->multiple_queues && last && last->q != rq->q)
-		plug->multiple_queues = true;
-	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
-		plug->has_elevator = true;
-	rq->rq_next = NULL;
-	rq_list_add(&plug->mq_list, rq);
-	plug->rq_count++;
-}
-
 static bool blk_mq_attempt_bio_merge(struct request_queue *q,
 				     struct bio *bio, unsigned int nr_segs)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 3/6] nvme: refactor nvme_submit_user_cmd()
       [not found]   ` <CGME20220511055312epcas5p3b1e9989a32cb1a79f8a941476fc433d1@epcas5p3.samsung.com>
@ 2022-05-11  5:47     ` Kanchan Joshi
  0 siblings, 0 replies; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11  5:47 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Christoph Hellwig <hch@lst.de>

Divide the work into two helpers, namely nvme_alloc_user_request and
nvme_execute_user_rq. This is a prep patch, to help wiring up
uring-cmd support in nvme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: fold in fix for assuming bio is non-NULL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/ioctl.c | 56 +++++++++++++++++++++++++++++++--------
 1 file changed, 45 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 554566371ffa..8d2569b656cc 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -53,10 +53,20 @@ static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
 	return ERR_PTR(ret);
 }
 
-static int nvme_submit_user_cmd(struct request_queue *q,
+static int nvme_finish_user_metadata(struct request *req, void __user *ubuf,
+		void *meta, unsigned len, int ret)
+{
+	if (!ret && req_op(req) == REQ_OP_DRV_IN &&
+	    copy_to_user(ubuf, meta, len))
+		ret = -EFAULT;
+	kfree(meta);
+	return ret;
+}
+
+static struct request *nvme_alloc_user_request(struct request_queue *q,
 		struct nvme_command *cmd, void __user *ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-		u32 meta_seed, u64 *result, unsigned timeout, bool vec)
+		u32 meta_seed, void **metap, unsigned timeout, bool vec)
 {
 	bool write = nvme_is_write(cmd);
 	struct nvme_ns *ns = q->queuedata;
@@ -68,7 +78,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 
 	req = blk_mq_alloc_request(q, nvme_req_op(cmd), 0);
 	if (IS_ERR(req))
-		return PTR_ERR(req);
+		return req;
 	nvme_init_request(req, cmd);
 
 	if (timeout)
@@ -105,26 +115,50 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 				goto out_unmap;
 			}
 			req->cmd_flags |= REQ_INTEGRITY;
+			*metap = meta;
 		}
 	}
 
+	return req;
+
+out_unmap:
+	if (bio)
+		blk_rq_unmap_user(bio);
+out:
+	blk_mq_free_request(req);
+	return ERR_PTR(ret);
+}
+
+static int nvme_submit_user_cmd(struct request_queue *q,
+		struct nvme_command *cmd, void __user *ubuffer,
+		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
+		u32 meta_seed, u64 *result, unsigned timeout, bool vec)
+{
+	struct request *req;
+	void *meta = NULL;
+	struct bio *bio;
+	int ret;
+
+	req = nvme_alloc_user_request(q, cmd, ubuffer, bufflen, meta_buffer,
+			meta_len, meta_seed, &meta, timeout, vec);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	bio = req->bio;
+
 	ret = nvme_execute_passthru_rq(req);
+
 	if (result)
 		*result = le64_to_cpu(nvme_req(req)->result.u64);
-	if (meta && !ret && !write) {
-		if (copy_to_user(meta_buffer, meta, meta_len))
-			ret = -EFAULT;
-	}
-	kfree(meta);
- out_unmap:
+	if (meta)
+		ret = nvme_finish_user_metadata(req, meta_buffer, meta,
+						meta_len, ret);
 	if (bio)
 		blk_rq_unmap_user(bio);
- out:
 	blk_mq_free_request(req);
 	return ret;
 }
 
-
 static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 {
 	struct nvme_user_io io;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 4/6] nvme: wire-up uring-cmd support for io-passthru on char-device.
       [not found]   ` <CGME20220511055314epcas5p42ddbc17608f2677814c07b0d790e1318@epcas5p4.samsung.com>
@ 2022-05-11  5:47     ` Kanchan Joshi
  0 siblings, 0 replies; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11  5:47 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

Introduce handler for fops->uring_cmd(), implementing async passthru
on char device (/dev/ngX). The handler supports newly introduced
operation NVME_URING_CMD_IO. This operates on a new structure
nvme_uring_cmd, which is similar to struct nvme_passthru_cmd64 but
without the embedded 8b result field. This field is not needed since
uring-cmd allows to return additional result via big-CQE.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c        |   1 +
 drivers/nvme/host/ioctl.c       | 192 +++++++++++++++++++++++++++++++-
 drivers/nvme/host/multipath.c   |   1 +
 drivers/nvme/host/nvme.h        |   4 +
 include/uapi/linux/nvme_ioctl.h |  25 +++++
 5 files changed, 220 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e1846d04817f..682df98db341 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3699,6 +3699,7 @@ static const struct file_operations nvme_ns_chr_fops = {
 	.release	= nvme_ns_chr_release,
 	.unlocked_ioctl	= nvme_ns_chr_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
+	.uring_cmd	= nvme_ns_chr_uring_cmd,
 };
 
 static int nvme_add_ns_cdev(struct nvme_ns *ns)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 8d2569b656cc..92d695262d8f 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -5,6 +5,7 @@
  */
 #include <linux/ptrace.h>	/* for force_successful_syscall_return */
 #include <linux/nvme_ioctl.h>
+#include <linux/io_uring.h>
 #include "nvme.h"
 
 /*
@@ -66,7 +67,8 @@ static int nvme_finish_user_metadata(struct request *req, void __user *ubuf,
 static struct request *nvme_alloc_user_request(struct request_queue *q,
 		struct nvme_command *cmd, void __user *ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-		u32 meta_seed, void **metap, unsigned timeout, bool vec)
+		u32 meta_seed, void **metap, unsigned timeout, bool vec,
+		unsigned int rq_flags, blk_mq_req_flags_t blk_flags)
 {
 	bool write = nvme_is_write(cmd);
 	struct nvme_ns *ns = q->queuedata;
@@ -76,7 +78,7 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
 	void *meta = NULL;
 	int ret;
 
-	req = blk_mq_alloc_request(q, nvme_req_op(cmd), 0);
+	req = blk_mq_alloc_request(q, nvme_req_op(cmd) | rq_flags, blk_flags);
 	if (IS_ERR(req))
 		return req;
 	nvme_init_request(req, cmd);
@@ -140,7 +142,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 	int ret;
 
 	req = nvme_alloc_user_request(q, cmd, ubuffer, bufflen, meta_buffer,
-			meta_len, meta_seed, &meta, timeout, vec);
+			meta_len, meta_seed, &meta, timeout, vec, 0, 0);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
@@ -330,6 +332,139 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	return status;
 }
 
+struct nvme_uring_data {
+	__u64	metadata;
+	__u64	addr;
+	__u32	data_len;
+	__u32	metadata_len;
+	__u32	timeout_ms;
+};
+
+/*
+ * This overlays struct io_uring_cmd pdu.
+ * Expect build errors if this grows larger than that.
+ */
+struct nvme_uring_cmd_pdu {
+	union {
+		struct bio *bio;
+		struct request *req;
+	};
+	void *meta; /* kernel-resident buffer */
+	void __user *meta_buffer;
+	u32 meta_len;
+};
+
+static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
+		struct io_uring_cmd *ioucmd)
+{
+	return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
+}
+
+static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd)
+{
+	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+	struct request *req = pdu->req;
+	struct bio *bio = req->bio;
+	int status;
+	u64 result;
+
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		status = -EINTR;
+	else
+		status = nvme_req(req)->status;
+
+	result = le64_to_cpu(nvme_req(req)->result.u64);
+
+	if (pdu->meta)
+		status = nvme_finish_user_metadata(req, pdu->meta_buffer,
+					pdu->meta, pdu->meta_len, status);
+	if (bio)
+		blk_rq_unmap_user(bio);
+	blk_mq_free_request(req);
+
+	io_uring_cmd_done(ioucmd, status, result);
+}
+
+static void nvme_uring_cmd_end_io(struct request *req, blk_status_t err)
+{
+	struct io_uring_cmd *ioucmd = req->end_io_data;
+	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+	/* extract bio before reusing the same field for request */
+	struct bio *bio = pdu->bio;
+
+	pdu->req = req;
+	req->bio = bio;
+	/* this takes care of moving rest of completion-work to task context */
+	io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb);
+}
+
+static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+		struct io_uring_cmd *ioucmd, unsigned int issue_flags)
+{
+	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+	const struct nvme_uring_cmd *cmd = ioucmd->cmd;
+	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+	struct nvme_uring_data d;
+	struct nvme_command c;
+	struct request *req;
+	unsigned int rq_flags = 0;
+	blk_mq_req_flags_t blk_flags = 0;
+	void *meta = NULL;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	c.common.opcode = READ_ONCE(cmd->opcode);
+	c.common.flags = READ_ONCE(cmd->flags);
+	if (c.common.flags)
+		return -EINVAL;
+
+	c.common.command_id = 0;
+	c.common.nsid = cpu_to_le32(cmd->nsid);
+	if (!nvme_validate_passthru_nsid(ctrl, ns, le32_to_cpu(c.common.nsid)))
+		return -EINVAL;
+
+	c.common.cdw2[0] = cpu_to_le32(READ_ONCE(cmd->cdw2));
+	c.common.cdw2[1] = cpu_to_le32(READ_ONCE(cmd->cdw3));
+	c.common.metadata = 0;
+	c.common.dptr.prp1 = c.common.dptr.prp2 = 0;
+	c.common.cdw10 = cpu_to_le32(READ_ONCE(cmd->cdw10));
+	c.common.cdw11 = cpu_to_le32(READ_ONCE(cmd->cdw11));
+	c.common.cdw12 = cpu_to_le32(READ_ONCE(cmd->cdw12));
+	c.common.cdw13 = cpu_to_le32(READ_ONCE(cmd->cdw13));
+	c.common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14));
+	c.common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15));
+
+	d.metadata = READ_ONCE(cmd->metadata);
+	d.addr = READ_ONCE(cmd->addr);
+	d.data_len = READ_ONCE(cmd->data_len);
+	d.metadata_len = READ_ONCE(cmd->metadata_len);
+	d.timeout_ms = READ_ONCE(cmd->timeout_ms);
+
+	if (issue_flags & IO_URING_F_NONBLOCK) {
+		rq_flags = REQ_NOWAIT;
+		blk_flags = BLK_MQ_REQ_NOWAIT;
+	}
+
+	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
+			d.data_len, nvme_to_user_ptr(d.metadata),
+			d.metadata_len, 0, &meta, d.timeout_ms ?
+			msecs_to_jiffies(d.timeout_ms) : 0, 0, rq_flags,
+			blk_flags);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+	req->end_io_data = ioucmd;
+
+	/* to free bio on completion, as req->bio will be null at that time */
+	pdu->bio = req->bio;
+	pdu->meta = meta;
+	pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
+	pdu->meta_len = d.metadata_len;
+
+	blk_execute_rq_nowait(req, 0, nvme_uring_cmd_end_io);
+	return -EIOCBQUEUED;
+}
+
 static bool is_ctrl_ioctl(unsigned int cmd)
 {
 	if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD)
@@ -421,6 +556,42 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	return __nvme_ioctl(ns, cmd, (void __user *)arg);
 }
 
+static int nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd,
+			     unsigned int issue_flags)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	int ret;
+
+	BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
+
+	/* IOPOLL not supported yet */
+	if (issue_flags & IO_URING_F_IOPOLL)
+		return -EOPNOTSUPP;
+
+	/* NVMe passthrough requires bit SQE/CQE support */
+	if ((issue_flags & (IO_URING_F_SQE128|IO_URING_F_CQE32)) !=
+	    (IO_URING_F_SQE128|IO_URING_F_CQE32))
+		return -EOPNOTSUPP;
+
+	switch (ioucmd->cmd_op) {
+	case NVME_URING_CMD_IO:
+		ret = nvme_uring_cmd_io(ctrl, ns, ioucmd, issue_flags);
+		break;
+	default:
+		ret = -ENOTTY;
+	}
+
+	return ret;
+}
+
+int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
+{
+	struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev,
+			struct nvme_ns, cdev);
+
+	return nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+}
+
 #ifdef CONFIG_NVME_MULTIPATH
 static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
 		void __user *argp, struct nvme_ns_head *head, int srcu_idx)
@@ -487,6 +658,21 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 	srcu_read_unlock(&head->srcu, srcu_idx);
 	return ret;
 }
+
+int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+		unsigned int issue_flags)
+{
+	struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+	struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
+	int srcu_idx = srcu_read_lock(&head->srcu);
+	struct nvme_ns *ns = nvme_find_path(head);
+	int ret = -EINVAL;
+
+	if (ns)
+		ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
 #endif /* CONFIG_NVME_MULTIPATH */
 
 static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index d464fdf978fb..d3e2440d8abb 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -437,6 +437,7 @@ static const struct file_operations nvme_ns_head_chr_fops = {
 	.release	= nvme_ns_head_chr_release,
 	.unlocked_ioctl	= nvme_ns_head_chr_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
+	.uring_cmd	= nvme_ns_head_chr_uring_cmd,
 };
 
 static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a2b53ca63335..086ccbdd7003 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -782,6 +782,10 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
 long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
+int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+		unsigned int issue_flags);
+int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+		unsigned int issue_flags);
 int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
 
 extern const struct attribute_group *nvme_ns_id_attr_groups[];
diff --git a/include/uapi/linux/nvme_ioctl.h b/include/uapi/linux/nvme_ioctl.h
index b2e43185e3b5..04e458c649ab 100644
--- a/include/uapi/linux/nvme_ioctl.h
+++ b/include/uapi/linux/nvme_ioctl.h
@@ -70,6 +70,28 @@ struct nvme_passthru_cmd64 {
 	__u64	result;
 };
 
+/* same as struct nvme_passthru_cmd64, minus the 8b result field */
+struct nvme_uring_cmd {
+	__u8	opcode;
+	__u8	flags;
+	__u16	rsvd1;
+	__u32	nsid;
+	__u32	cdw2;
+	__u32	cdw3;
+	__u64	metadata;
+	__u64	addr;
+	__u32	metadata_len;
+	__u32	data_len;
+	__u32	cdw10;
+	__u32	cdw11;
+	__u32	cdw12;
+	__u32	cdw13;
+	__u32	cdw14;
+	__u32	cdw15;
+	__u32	timeout_ms;
+	__u32   rsvd2;
+};
+
 #define nvme_admin_cmd nvme_passthru_cmd
 
 #define NVME_IOCTL_ID		_IO('N', 0x40)
@@ -83,4 +105,7 @@ struct nvme_passthru_cmd64 {
 #define NVME_IOCTL_IO64_CMD	_IOWR('N', 0x48, struct nvme_passthru_cmd64)
 #define NVME_IOCTL_IO64_CMD_VEC	_IOWR('N', 0x49, struct nvme_passthru_cmd64)
 
+/* io_uring async commands: */
+#define NVME_URING_CMD_IO	_IOWR('N', 0x80, struct nvme_uring_cmd)
+
 #endif /* _UAPI_LINUX_NVME_IOCTL_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 5/6] nvme: add vectored-io support for uring-cmd
       [not found]   ` <CGME20220511055319epcas5p1d78cb03bd1b145a6d58c8e616795af14@epcas5p1.samsung.com>
@ 2022-05-11  5:47     ` Kanchan Joshi
  0 siblings, 0 replies; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11  5:47 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Anuj Gupta <anuj20.g@samsung.com>

wire up support for async passthru that takes an array of buffers (using
iovec). Exposed via a new op NVME_URING_CMD_IO_VEC. Same 'struct
nvme_uring_cmd' is to be used with -

1. cmd.addr as base address of user iovec array
2. cmd.data_len as count of iovec array elements

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/ioctl.c       | 9 ++++++---
 include/uapi/linux/nvme_ioctl.h | 1 +
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 92d695262d8f..7b0e2c9cdcae 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -399,7 +399,7 @@ static void nvme_uring_cmd_end_io(struct request *req, blk_status_t err)
 }
 
 static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-		struct io_uring_cmd *ioucmd, unsigned int issue_flags)
+		struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec)
 {
 	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
 	const struct nvme_uring_cmd *cmd = ioucmd->cmd;
@@ -449,7 +449,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
 			d.data_len, nvme_to_user_ptr(d.metadata),
 			d.metadata_len, 0, &meta, d.timeout_ms ?
-			msecs_to_jiffies(d.timeout_ms) : 0, 0, rq_flags,
+			msecs_to_jiffies(d.timeout_ms) : 0, vec, rq_flags,
 			blk_flags);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
@@ -575,7 +575,10 @@ static int nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd,
 
 	switch (ioucmd->cmd_op) {
 	case NVME_URING_CMD_IO:
-		ret = nvme_uring_cmd_io(ctrl, ns, ioucmd, issue_flags);
+		ret = nvme_uring_cmd_io(ctrl, ns, ioucmd, issue_flags, false);
+		break;
+	case NVME_URING_CMD_IO_VEC:
+		ret = nvme_uring_cmd_io(ctrl, ns, ioucmd, issue_flags, true);
 		break;
 	default:
 		ret = -ENOTTY;
diff --git a/include/uapi/linux/nvme_ioctl.h b/include/uapi/linux/nvme_ioctl.h
index 04e458c649ab..0b1876aa5a59 100644
--- a/include/uapi/linux/nvme_ioctl.h
+++ b/include/uapi/linux/nvme_ioctl.h
@@ -107,5 +107,6 @@ struct nvme_uring_cmd {
 
 /* io_uring async commands: */
 #define NVME_URING_CMD_IO	_IOWR('N', 0x80, struct nvme_uring_cmd)
+#define NVME_URING_CMD_IO_VEC	_IOWR('N', 0x81, struct nvme_uring_cmd)
 
 #endif /* _UAPI_LINUX_NVME_IOCTL_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 6/6] io_uring: finish IOPOLL/ioprio prep handler removal
       [not found]   ` <CGME20220511055320epcas5p2457aaf8e7405387282f60831f5a4eca9@epcas5p2.samsung.com>
@ 2022-05-11  5:47     ` Kanchan Joshi
  2022-05-11  6:54       ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11  5:47 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Jens Axboe <axboe@kernel.dk>

Commit 73911426aaaa unified the checking of this, but was done before we
had a few new opcode handlers.

Finish the job and apply it to the xattr/socket/uring_cmd handlers as
well so we check and use this fully consistently.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 592be5b89add..44c57dca358d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4517,10 +4517,6 @@ static int __io_getxattr_prep(struct io_kiocb *req,
 	const char __user *name;
 	int ret;
 
-	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
-		return -EINVAL;
-	if (unlikely(sqe->ioprio))
-		return -EINVAL;
 	if (unlikely(req->flags & REQ_F_FIXED_FILE))
 		return -EBADF;
 
@@ -4630,10 +4626,6 @@ static int __io_setxattr_prep(struct io_kiocb *req,
 	const char __user *name;
 	int ret;
 
-	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
-		return -EINVAL;
-	if (unlikely(sqe->ioprio))
-		return -EINVAL;
 	if (unlikely(req->flags & REQ_F_FIXED_FILE))
 		return -EBADF;
 
@@ -4968,7 +4960,7 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
 {
 	struct io_uring_cmd *ioucmd = &req->uring_cmd;
 
-	if (sqe->ioprio || sqe->rw_flags)
+	if (sqe->rw_flags)
 		return -EINVAL;
 	ioucmd->cmd = sqe->cmd;
 	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
@@ -6405,9 +6397,7 @@ static int io_socket_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_socket *sock = &req->sock;
 
-	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
-		return -EINVAL;
-	if (sqe->ioprio || sqe->addr || sqe->rw_flags || sqe->buf_index)
+	if (sqe->addr || sqe->rw_flags || sqe->buf_index)
 		return -EINVAL;
 
 	sock->domain = READ_ONCE(sqe->fd);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 6/6] io_uring: finish IOPOLL/ioprio prep handler removal
  2022-05-11  5:47     ` [PATCH v5 6/6] io_uring: finish IOPOLL/ioprio prep handler removal Kanchan Joshi
@ 2022-05-11  6:54       ` Christoph Hellwig
  2022-05-11 11:05         ` Kanchan Joshi
  2022-05-11 12:38         ` Jens Axboe
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2022-05-11  6:54 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof,
	shr, joshiiitr, anuj20.g, gost.dev

I think this should go first, and the uring_cmd bits need to do the
right thing from the beginning.

On Wed, May 11, 2022 at 11:17:50AM +0530, Kanchan Joshi wrote:
> From: Jens Axboe <axboe@kernel.dk>
> 
> Commit 73911426aaaa unified the checking of this, but was done before we
> had a few new opcode handlers.

Also please quote the actual commit subject as well here per the usual
Fixes-like refernence style.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 6/6] io_uring: finish IOPOLL/ioprio prep handler removal
  2022-05-11  6:54       ` Christoph Hellwig
@ 2022-05-11 11:05         ` Kanchan Joshi
  2022-05-11 12:38         ` Jens Axboe
  1 sibling, 0 replies; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11 11:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 350 bytes --]

On Wed, May 11, 2022 at 08:54:23AM +0200, Christoph Hellwig wrote:
>I think this should go first, and the uring_cmd bits need to do the
>right thing from the beginning.

so this patch should do what it does but only for xattr/socket.
And then it should move down to the point before uring-cmd.
Jens - let me know if you want me to iterate with that.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 6/6] io_uring: finish IOPOLL/ioprio prep handler removal
  2022-05-11  6:54       ` Christoph Hellwig
  2022-05-11 11:05         ` Kanchan Joshi
@ 2022-05-11 12:38         ` Jens Axboe
  1 sibling, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2022-05-11 12:38 UTC (permalink / raw)
  To: Christoph Hellwig, Kanchan Joshi
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/11/22 12:54 AM, Christoph Hellwig wrote:
> I think this should go first, and the uring_cmd bits need to do the
> right thing from the beginning.

It just needs folding in, I'll do that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/6] io_uring passthrough for nvme
  2022-05-11  5:47 ` [PATCH v5 0/6] io_uring passthrough for nvme Kanchan Joshi
                     ` (5 preceding siblings ...)
       [not found]   ` <CGME20220511055320epcas5p2457aaf8e7405387282f60831f5a4eca9@epcas5p2.samsung.com>
@ 2022-05-11 12:39   ` Jens Axboe
  2022-05-11 13:14     ` Kanchan Joshi
  2022-05-11 13:48   ` Jens Axboe
  7 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2022-05-11 12:39 UTC (permalink / raw)
  To: Kanchan Joshi, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

Which patches had changes in this series? I'm assuming it's just patch
1, but the changelog doesn't actually say. Would save me from comparing
to what's in-tree already.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/6] fs,io_uring: add infrastructure for uring-cmd
  2022-05-11  5:47     ` [PATCH v5 1/6] fs,io_uring: add infrastructure for uring-cmd Kanchan Joshi
@ 2022-05-11 12:42       ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2022-05-11 12:42 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof,
	shr, joshiiitr, anuj20.g, gost.dev

Looks good (modulo the bits from patch 6 that should be folded in):

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/6] io_uring passthrough for nvme
  2022-05-11 12:39   ` [PATCH v5 0/6] io_uring passthrough for nvme Jens Axboe
@ 2022-05-11 13:14     ` Kanchan Joshi
  2022-05-11 13:39       ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Kanchan Joshi @ 2022-05-11 13:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kanchan Joshi, Christoph Hellwig, io-uring, linux-nvme,
	Pavel Begunkov, Ming Lei, Luis Chamberlain, Stefan Roesch,
	Anuj Gupta, gost.dev

On Wed, May 11, 2022 at 6:09 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Which patches had changes in this series? I'm assuming it's just patch
> 1, but the changelog doesn't actually say. Would save me from comparing
> to what's in-tree already.

Compared to in-tree, it is Patch 1 and Patch 4.
This part in patch 4:
+int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+               unsigned int issue_flags)
+{
+       struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+       struct nvme_ns_head *head = container_of(cdev, struct
nvme_ns_head, cdev);
+       int srcu_idx = srcu_read_lock(&head->srcu);
+       struct nvme_ns *ns = nvme_find_path(head);
+       int ret = -EINVAL;
+
+       if (ns)
+               ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+       srcu_read_unlock(&head->srcu, srcu_idx);
+       return ret;
+}
Initializing ret to -EINVAL rather than 0.
We do not support admin commands yet, so ns can be null only if
something goes wrong with multipath.
So if at all anything goes wrong and ns is null, it is better to
return failure than success.

And I removed the lore links from commit-messages, thinking those will
be refreshed too.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/6] io_uring passthrough for nvme
  2022-05-11 13:14     ` Kanchan Joshi
@ 2022-05-11 13:39       ` Jens Axboe
  0 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2022-05-11 13:39 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Kanchan Joshi, Christoph Hellwig, io-uring, linux-nvme,
	Pavel Begunkov, Ming Lei, Luis Chamberlain, Stefan Roesch,
	Anuj Gupta, gost.dev

On 5/11/22 7:14 AM, Kanchan Joshi wrote:
> On Wed, May 11, 2022 at 6:09 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Which patches had changes in this series? I'm assuming it's just patch
>> 1, but the changelog doesn't actually say. Would save me from comparing
>> to what's in-tree already.
> 
> Compared to in-tree, it is Patch 1 and Patch 4.
> This part in patch 4:
> +int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
> +               unsigned int issue_flags)
> +{
> +       struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> +       struct nvme_ns_head *head = container_of(cdev, struct
> nvme_ns_head, cdev);
> +       int srcu_idx = srcu_read_lock(&head->srcu);
> +       struct nvme_ns *ns = nvme_find_path(head);
> +       int ret = -EINVAL;
> +
> +       if (ns)
> +               ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
> +       srcu_read_unlock(&head->srcu, srcu_idx);
> +       return ret;
> +}
> Initializing ret to -EINVAL rather than 0.
> We do not support admin commands yet, so ns can be null only if
> something goes wrong with multipath.
> So if at all anything goes wrong and ns is null, it is better to
> return failure than success.
> 
> And I removed the lore links from commit-messages, thinking those will
> be refreshed too.

OK all good, that's what I expected. I'll update the series.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/6] io_uring passthrough for nvme
  2022-05-11  5:47 ` [PATCH v5 0/6] io_uring passthrough for nvme Kanchan Joshi
                     ` (6 preceding siblings ...)
  2022-05-11 12:39   ` [PATCH v5 0/6] io_uring passthrough for nvme Jens Axboe
@ 2022-05-11 13:48   ` Jens Axboe
  7 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2022-05-11 13:48 UTC (permalink / raw)
  To: Christoph Hellwig, joshi.k
  Cc: joshiiitr, anuj20.g, shr, ming.lei, linux-nvme, io-uring, mcgrof,
	gost.dev, asml.silence

On Wed, 11 May 2022 11:17:44 +0530, Kanchan Joshi wrote:
> This series is against "for-5.19/io_uring-passthrough" branch (linux-block).
> Patches to be refreshed on top of 2bb04df7c ("io_uring: support CQE32").
> 
> uring-cmd is the facility to enable io_uring capabilities (async is one
> of those) for any arbitrary command (ioctl, fsctl or whatever else)
> exposed by the command-providers (driver, fs etc.). The series
> introduces uring-cmd, and connects nvme passthrough (over generic device
> /dev/ngXnY) to it.
> 
> [...]

Applied, thanks!

[1/6] fs,io_uring: add infrastructure for uring-cmd
      commit: ee692a21e9bf8354bd3ec816f1cf4bff8619ed77
[2/6] block: wire-up support for passthrough plugging
      commit: 1c2d2fff6dc04662dc8e86b537989643e1abeed9
[3/6] nvme: refactor nvme_submit_user_cmd()
      commit: bcad2565b5d64700cf68cc9d48618ab817ff5bc4
[4/6] nvme: wire-up uring-cmd support for io-passthru on char-device.
      commit: 456cba386e94f22fa1b1426303fdcac9e66b1417
[5/6] nvme: add vectored-io support for uring-cmd
      commit: f569add47119fa910ed7711b26b8d38e21f7ea77
[6/6] io_uring: finish IOPOLL/ioprio prep handler removal
      (no commit info)

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/6] block: wire-up support for passthrough plugging
  2022-05-11  5:47     ` [PATCH v5 2/6] block: wire-up support for passthrough plugging Kanchan Joshi
@ 2022-05-12  5:25       ` Ming Lei
  2022-05-12  8:09         ` Christoph Hellwig
  2022-05-12 11:50         ` Anuj Gupta
  0 siblings, 2 replies; 21+ messages in thread
From: Ming Lei @ 2022-05-12  5:25 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, io-uring, linux-nvme, asml.silence, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

Hello,

On Wed, May 11, 2022 at 11:17:46AM +0530, Kanchan Joshi wrote:
> From: Jens Axboe <axboe@kernel.dk>
> 
> Add support for plugging in passthrough path. When plugging is enabled, the
> requests are added to a plug instead of getting dispatched to the driver.
> And when the plug is finished, the whole batch gets dispatched via
> ->queue_rqs which turns out to be more efficient. Otherwise dispatching
> used to happen via ->queue_rq, one request at a time.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-mq.c | 73 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 34 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 84d749511f55..2cf011b57cf9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2340,6 +2340,40 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	blk_mq_hctx_mark_pending(hctx, ctx);
>  }
>  
> +/*
> + * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
> + * queues. This is important for md arrays to benefit from merging
> + * requests.
> + */
> +static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
> +{
> +	if (plug->multiple_queues)
> +		return BLK_MAX_REQUEST_COUNT * 2;
> +	return BLK_MAX_REQUEST_COUNT;
> +}
> +
> +static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> +{
> +	struct request *last = rq_list_peek(&plug->mq_list);
> +
> +	if (!plug->rq_count) {
> +		trace_block_plug(rq->q);
> +	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
> +		   (!blk_queue_nomerges(rq->q) &&
> +		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
> +		blk_mq_flush_plug_list(plug, false);
> +		trace_block_plug(rq->q);
> +	}
> +
> +	if (!plug->multiple_queues && last && last->q != rq->q)
> +		plug->multiple_queues = true;
> +	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
> +		plug->has_elevator = true;
> +	rq->rq_next = NULL;
> +	rq_list_add(&plug->mq_list, rq);
> +	plug->rq_count++;
> +}
> +
>  /**
>   * blk_mq_request_bypass_insert - Insert a request at dispatch list.
>   * @rq: Pointer to request to be inserted.
> @@ -2353,7 +2387,12 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
>  				  bool run_queue)
>  {
>  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> +	struct blk_plug *plug = current->plug;
>  
> +	if (plug) {
> +		blk_add_rq_to_plug(plug, rq);
> +		return;
> +	}

This way may cause nested plugging, and breaks xfstests generic/131.
Also may cause io hang since request can't be polled before flushing
plug in blk_execute_rq().

I'd suggest to apply the plug in blk_execute_rq_nowait(), such as:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2cf011b57cf9..60c29c0229d5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1169,6 +1169,62 @@ static void blk_end_sync_rq(struct request *rq, blk_status_t error)
 	complete(waiting);
 }
 
+/*
+ * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
+ * queues. This is important for md arrays to benefit from merging
+ * requests.
+ */
+static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
+{
+	if (plug->multiple_queues)
+		return BLK_MAX_REQUEST_COUNT * 2;
+	return BLK_MAX_REQUEST_COUNT;
+}
+
+static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
+{
+	struct request *last = rq_list_peek(&plug->mq_list);
+
+	if (!plug->rq_count) {
+		trace_block_plug(rq->q);
+	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
+		   (!blk_queue_nomerges(rq->q) &&
+		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+		blk_mq_flush_plug_list(plug, false);
+		trace_block_plug(rq->q);
+	}
+
+	if (!plug->multiple_queues && last && last->q != rq->q)
+		plug->multiple_queues = true;
+	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
+		plug->has_elevator = true;
+	rq->rq_next = NULL;
+	rq_list_add(&plug->mq_list, rq);
+	plug->rq_count++;
+}
+
+static void __blk_execute_rq_nowait(struct request *rq, bool at_head,
+		rq_end_io_fn *done, bool use_plug)
+{
+	WARN_ON(irqs_disabled());
+	WARN_ON(!blk_rq_is_passthrough(rq));
+
+	rq->end_io = done;
+
+	blk_account_io_start(rq);
+
+	if (use_plug && current->plug) {
+		blk_add_rq_to_plug(current->plug, rq);
+		return;
+	}
+	/*
+	 * don't check dying flag for MQ because the request won't
+	 * be reused after dying flag is set
+	 */
+	blk_mq_sched_insert_request(rq, at_head, true, false);
+}
+
+
 /**
  * blk_execute_rq_nowait - insert a request to I/O scheduler for execution
  * @rq:		request to insert
@@ -1184,18 +1240,8 @@ static void blk_end_sync_rq(struct request *rq, blk_status_t error)
  */
 void blk_execute_rq_nowait(struct request *rq, bool at_head, rq_end_io_fn *done)
 {
-	WARN_ON(irqs_disabled());
-	WARN_ON(!blk_rq_is_passthrough(rq));
-
-	rq->end_io = done;
-
-	blk_account_io_start(rq);
+	__blk_execute_rq_nowait(rq, at_head, done, true);
 
-	/*
-	 * don't check dying flag for MQ because the request won't
-	 * be reused after dying flag is set
-	 */
-	blk_mq_sched_insert_request(rq, at_head, true, false);
 }
 EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
 
@@ -1234,7 +1280,7 @@ blk_status_t blk_execute_rq(struct request *rq, bool at_head)
 	unsigned long hang_check;
 
 	rq->end_io_data = &wait;
-	blk_execute_rq_nowait(rq, at_head, blk_end_sync_rq);
+	__blk_execute_rq_nowait(rq, at_head, blk_end_sync_rq, false);
 
 	/* Prevent hang_check timer from firing at us during very long I/O */
 	hang_check = sysctl_hung_task_timeout_secs;
@@ -2340,40 +2386,6 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
-/*
- * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
- * queues. This is important for md arrays to benefit from merging
- * requests.
- */
-static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
-{
-	if (plug->multiple_queues)
-		return BLK_MAX_REQUEST_COUNT * 2;
-	return BLK_MAX_REQUEST_COUNT;
-}
-
-static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
-{
-	struct request *last = rq_list_peek(&plug->mq_list);
-
-	if (!plug->rq_count) {
-		trace_block_plug(rq->q);
-	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
-		   (!blk_queue_nomerges(rq->q) &&
-		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
-		blk_mq_flush_plug_list(plug, false);
-		trace_block_plug(rq->q);
-	}
-
-	if (!plug->multiple_queues && last && last->q != rq->q)
-		plug->multiple_queues = true;
-	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
-		plug->has_elevator = true;
-	rq->rq_next = NULL;
-	rq_list_add(&plug->mq_list, rq);
-	plug->rq_count++;
-}
-
 /**
  * blk_mq_request_bypass_insert - Insert a request at dispatch list.
  * @rq: Pointer to request to be inserted.
@@ -2387,12 +2399,7 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
 				  bool run_queue)
 {
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
-	struct blk_plug *plug = current->plug;
 
-	if (plug) {
-		blk_add_rq_to_plug(plug, rq);
-		return;
-	}
 	spin_lock(&hctx->lock);
 	if (at_head)
 		list_add(&rq->queuelist, &hctx->dispatch);


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/6] block: wire-up support for passthrough plugging
  2022-05-12  5:25       ` Ming Lei
@ 2022-05-12  8:09         ` Christoph Hellwig
  2022-05-12  8:44           ` Ming Lei
  2022-05-12 11:50         ` Anuj Gupta
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2022-05-12  8:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kanchan Joshi, axboe, hch, io-uring, linux-nvme, asml.silence,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On Thu, May 12, 2022 at 01:25:24PM +0800, Ming Lei wrote:
> This way may cause nested plugging, and breaks xfstests generic/131.
> Also may cause io hang since request can't be polled before flushing
> plug in blk_execute_rq().

Looking at this again, yes blk_mq_request_bypass_insert is probably the
wrong place.

> I'd suggest to apply the plug in blk_execute_rq_nowait(), such as:

Do we really need the use_plug parameter and the extra helper?  If
someone holds a plug over passthrough command submission I think
we can assume they actually do want to use it.  Otherwise this does
indeed look like the better plan.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/6] block: wire-up support for passthrough plugging
  2022-05-12  8:09         ` Christoph Hellwig
@ 2022-05-12  8:44           ` Ming Lei
  2022-05-12 12:31             ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2022-05-12  8:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, io-uring, linux-nvme, asml.silence, mcgrof,
	shr, joshiiitr, anuj20.g, gost.dev

On Thu, May 12, 2022 at 10:09:12AM +0200, Christoph Hellwig wrote:
> On Thu, May 12, 2022 at 01:25:24PM +0800, Ming Lei wrote:
> > This way may cause nested plugging, and breaks xfstests generic/131.
> > Also may cause io hang since request can't be polled before flushing
> > plug in blk_execute_rq().
> 
> Looking at this again, yes blk_mq_request_bypass_insert is probably the
> wrong place.
> 
> > I'd suggest to apply the plug in blk_execute_rq_nowait(), such as:
> 
> Do we really need the use_plug parameter and the extra helper?  If
> someone holds a plug over passthrough command submission I think
> we can assume they actually do want to use it.  Otherwise this does
> indeed look like the better plan.

use_plug is just for avoiding hang in blk_rq_poll_completion(), so
I think we can bypass plug if one polled rq is executed inside
blk_execute_rq().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/6] block: wire-up support for passthrough plugging
  2022-05-12  5:25       ` Ming Lei
  2022-05-12  8:09         ` Christoph Hellwig
@ 2022-05-12 11:50         ` Anuj Gupta
  2022-05-12 14:06           ` Ming Lei
  1 sibling, 1 reply; 21+ messages in thread
From: Anuj Gupta @ 2022-05-12 11:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kanchan Joshi, axboe, hch, io-uring, linux-nvme, asml.silence,
	mcgrof, shr, joshiiitr, gost.dev

[-- Attachment #1: Type: text/plain, Size: 7976 bytes --]

On Thu, May 12, 2022 at 01:25:24PM +0800, Ming Lei wrote:
> Hello,
> 
> On Wed, May 11, 2022 at 11:17:46AM +0530, Kanchan Joshi wrote:
> > From: Jens Axboe <axboe@kernel.dk>
> > 
> > Add support for plugging in passthrough path. When plugging is enabled, the
> > requests are added to a plug instead of getting dispatched to the driver.
> > And when the plug is finished, the whole batch gets dispatched via
> > ->queue_rqs which turns out to be more efficient. Otherwise dispatching
> > used to happen via ->queue_rq, one request at a time.
> > 
> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  block/blk-mq.c | 73 +++++++++++++++++++++++++++-----------------------
> >  1 file changed, 39 insertions(+), 34 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 84d749511f55..2cf011b57cf9 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2340,6 +2340,40 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> >  	blk_mq_hctx_mark_pending(hctx, ctx);
> >  }
> >  
> > +/*
> > + * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
> > + * queues. This is important for md arrays to benefit from merging
> > + * requests.
> > + */
> > +static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
> > +{
> > +	if (plug->multiple_queues)
> > +		return BLK_MAX_REQUEST_COUNT * 2;
> > +	return BLK_MAX_REQUEST_COUNT;
> > +}
> > +
> > +static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> > +{
> > +	struct request *last = rq_list_peek(&plug->mq_list);
> > +
> > +	if (!plug->rq_count) {
> > +		trace_block_plug(rq->q);
> > +	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
> > +		   (!blk_queue_nomerges(rq->q) &&
> > +		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
> > +		blk_mq_flush_plug_list(plug, false);
> > +		trace_block_plug(rq->q);
> > +	}
> > +
> > +	if (!plug->multiple_queues && last && last->q != rq->q)
> > +		plug->multiple_queues = true;
> > +	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
> > +		plug->has_elevator = true;
> > +	rq->rq_next = NULL;
> > +	rq_list_add(&plug->mq_list, rq);
> > +	plug->rq_count++;
> > +}
> > +
> >  /**
> >   * blk_mq_request_bypass_insert - Insert a request at dispatch list.
> >   * @rq: Pointer to request to be inserted.
> > @@ -2353,7 +2387,12 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
> >  				  bool run_queue)
> >  {
> >  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > +	struct blk_plug *plug = current->plug;
> >  
> > +	if (plug) {
> > +		blk_add_rq_to_plug(plug, rq);
> > +		return;
> > +	}
> 
> This way may cause nested plugging, and breaks xfstests generic/131.
> Also may cause io hang since request can't be polled before flushing
> plug in blk_execute_rq().
>
Hi Ming,
Could you please share your test setup.
I tried test 131 with xfs and it passed.

I followed these steps:
1) mkfs.xfs -f /dev/nvme0n1
2) mount /dev/nvme0n1 /mnt/test
3) ./check tests/generic/131

Tried the same with ext4 and it passed as well.

Thanks,
Anuj

> I'd suggest to apply the plug in blk_execute_rq_nowait(), such as:
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2cf011b57cf9..60c29c0229d5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1169,6 +1169,62 @@ static void blk_end_sync_rq(struct request *rq, blk_status_t error)
>  	complete(waiting);
>  }
>  
> +/*
> + * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
> + * queues. This is important for md arrays to benefit from merging
> + * requests.
> + */
> +static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
> +{
> +	if (plug->multiple_queues)
> +		return BLK_MAX_REQUEST_COUNT * 2;
> +	return BLK_MAX_REQUEST_COUNT;
> +}
> +
> +static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> +{
> +	struct request *last = rq_list_peek(&plug->mq_list);
> +
> +	if (!plug->rq_count) {
> +		trace_block_plug(rq->q);
> +	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
> +		   (!blk_queue_nomerges(rq->q) &&
> +		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
> +		blk_mq_flush_plug_list(plug, false);
> +		trace_block_plug(rq->q);
> +	}
> +
> +	if (!plug->multiple_queues && last && last->q != rq->q)
> +		plug->multiple_queues = true;
> +	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
> +		plug->has_elevator = true;
> +	rq->rq_next = NULL;
> +	rq_list_add(&plug->mq_list, rq);
> +	plug->rq_count++;
> +}
> +
> +static void __blk_execute_rq_nowait(struct request *rq, bool at_head,
> +		rq_end_io_fn *done, bool use_plug)
> +{
> +	WARN_ON(irqs_disabled());
> +	WARN_ON(!blk_rq_is_passthrough(rq));
> +
> +	rq->end_io = done;
> +
> +	blk_account_io_start(rq);
> +
> +	if (use_plug && current->plug) {
> +		blk_add_rq_to_plug(current->plug, rq);
> +		return;
> +	}
> +	/*
> +	 * don't check dying flag for MQ because the request won't
> +	 * be reused after dying flag is set
> +	 */
> +	blk_mq_sched_insert_request(rq, at_head, true, false);
> +}
> +
> +
>  /**
>   * blk_execute_rq_nowait - insert a request to I/O scheduler for execution
>   * @rq:		request to insert
> @@ -1184,18 +1240,8 @@ static void blk_end_sync_rq(struct request *rq, blk_status_t error)
>   */
>  void blk_execute_rq_nowait(struct request *rq, bool at_head, rq_end_io_fn *done)
>  {
> -	WARN_ON(irqs_disabled());
> -	WARN_ON(!blk_rq_is_passthrough(rq));
> -
> -	rq->end_io = done;
> -
> -	blk_account_io_start(rq);
> +	__blk_execute_rq_nowait(rq, at_head, done, true);
>  
> -	/*
> -	 * don't check dying flag for MQ because the request won't
> -	 * be reused after dying flag is set
> -	 */
> -	blk_mq_sched_insert_request(rq, at_head, true, false);
>  }
>  EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
>  
> @@ -1234,7 +1280,7 @@ blk_status_t blk_execute_rq(struct request *rq, bool at_head)
>  	unsigned long hang_check;
>  
>  	rq->end_io_data = &wait;
> -	blk_execute_rq_nowait(rq, at_head, blk_end_sync_rq);
> +	__blk_execute_rq_nowait(rq, at_head, blk_end_sync_rq, false);
>  
>  	/* Prevent hang_check timer from firing at us during very long I/O */
>  	hang_check = sysctl_hung_task_timeout_secs;
> @@ -2340,40 +2386,6 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	blk_mq_hctx_mark_pending(hctx, ctx);
>  }
>  
> -/*
> - * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
> - * queues. This is important for md arrays to benefit from merging
> - * requests.
> - */
> -static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
> -{
> -	if (plug->multiple_queues)
> -		return BLK_MAX_REQUEST_COUNT * 2;
> -	return BLK_MAX_REQUEST_COUNT;
> -}
> -
> -static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> -{
> -	struct request *last = rq_list_peek(&plug->mq_list);
> -
> -	if (!plug->rq_count) {
> -		trace_block_plug(rq->q);
> -	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
> -		   (!blk_queue_nomerges(rq->q) &&
> -		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
> -		blk_mq_flush_plug_list(plug, false);
> -		trace_block_plug(rq->q);
> -	}
> -
> -	if (!plug->multiple_queues && last && last->q != rq->q)
> -		plug->multiple_queues = true;
> -	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
> -		plug->has_elevator = true;
> -	rq->rq_next = NULL;
> -	rq_list_add(&plug->mq_list, rq);
> -	plug->rq_count++;
> -}
> -
>  /**
>   * blk_mq_request_bypass_insert - Insert a request at dispatch list.
>   * @rq: Pointer to request to be inserted.
> @@ -2387,12 +2399,7 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
>  				  bool run_queue)
>  {
>  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> -	struct blk_plug *plug = current->plug;
>  
> -	if (plug) {
> -		blk_add_rq_to_plug(plug, rq);
> -		return;
> -	}
>  	spin_lock(&hctx->lock);
>  	if (at_head)
>  		list_add(&rq->queuelist, &hctx->dispatch);
> 
> 
> Thanks,
> Ming
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/6] block: wire-up support for passthrough plugging
  2022-05-12  8:44           ` Ming Lei
@ 2022-05-12 12:31             ` Jens Axboe
  0 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2022-05-12 12:31 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: Kanchan Joshi, io-uring, linux-nvme, asml.silence, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/12/22 2:44 AM, Ming Lei wrote:
> On Thu, May 12, 2022 at 10:09:12AM +0200, Christoph Hellwig wrote:
>> On Thu, May 12, 2022 at 01:25:24PM +0800, Ming Lei wrote:
>>> This way may cause nested plugging, and breaks xfstests generic/131.
>>> Also may cause io hang since request can't be polled before flushing
>>> plug in blk_execute_rq().
>>
>> Looking at this again, yes blk_mq_request_bypass_insert is probably the
>> wrong place.
>>
>>> I'd suggest to apply the plug in blk_execute_rq_nowait(), such as:
>>
>> Do we really need the use_plug parameter and the extra helper?  If
>> someone holds a plug over passthrough command submission I think
>> we can assume they actually do want to use it.  Otherwise this does
>> indeed look like the better plan.
> 
> use_plug is just for avoiding hang in blk_rq_poll_completion(), so
> I think we can bypass plug if one polled rq is executed inside
> blk_execute_rq().

Agree, and good catch. Do you want to send out a patch for this?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/6] block: wire-up support for passthrough plugging
  2022-05-12 11:50         ` Anuj Gupta
@ 2022-05-12 14:06           ` Ming Lei
  0 siblings, 0 replies; 21+ messages in thread
From: Ming Lei @ 2022-05-12 14:06 UTC (permalink / raw)
  To: Anuj Gupta
  Cc: Kanchan Joshi, axboe, hch, io-uring, linux-nvme, asml.silence,
	mcgrof, shr, joshiiitr, gost.dev

On Thu, May 12, 2022 at 05:20:47PM +0530, Anuj Gupta wrote:
> On Thu, May 12, 2022 at 01:25:24PM +0800, Ming Lei wrote:
> > Hello,
> > 
> > On Wed, May 11, 2022 at 11:17:46AM +0530, Kanchan Joshi wrote:
> > > From: Jens Axboe <axboe@kernel.dk>
> > > 
> > > Add support for plugging in passthrough path. When plugging is enabled, the
> > > requests are added to a plug instead of getting dispatched to the driver.
> > > And when the plug is finished, the whole batch gets dispatched via
> > > ->queue_rqs which turns out to be more efficient. Otherwise dispatching
> > > used to happen via ->queue_rq, one request at a time.
> > > 
> > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  block/blk-mq.c | 73 +++++++++++++++++++++++++++-----------------------
> > >  1 file changed, 39 insertions(+), 34 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 84d749511f55..2cf011b57cf9 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -2340,6 +2340,40 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> > >  	blk_mq_hctx_mark_pending(hctx, ctx);
> > >  }
> > >  
> > > +/*
> > > + * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
> > > + * queues. This is important for md arrays to benefit from merging
> > > + * requests.
> > > + */
> > > +static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
> > > +{
> > > +	if (plug->multiple_queues)
> > > +		return BLK_MAX_REQUEST_COUNT * 2;
> > > +	return BLK_MAX_REQUEST_COUNT;
> > > +}
> > > +
> > > +static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> > > +{
> > > +	struct request *last = rq_list_peek(&plug->mq_list);
> > > +
> > > +	if (!plug->rq_count) {
> > > +		trace_block_plug(rq->q);
> > > +	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
> > > +		   (!blk_queue_nomerges(rq->q) &&
> > > +		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
> > > +		blk_mq_flush_plug_list(plug, false);
> > > +		trace_block_plug(rq->q);
> > > +	}
> > > +
> > > +	if (!plug->multiple_queues && last && last->q != rq->q)
> > > +		plug->multiple_queues = true;
> > > +	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
> > > +		plug->has_elevator = true;
> > > +	rq->rq_next = NULL;
> > > +	rq_list_add(&plug->mq_list, rq);
> > > +	plug->rq_count++;
> > > +}
> > > +
> > >  /**
> > >   * blk_mq_request_bypass_insert - Insert a request at dispatch list.
> > >   * @rq: Pointer to request to be inserted.
> > > @@ -2353,7 +2387,12 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
> > >  				  bool run_queue)
> > >  {
> > >  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > > +	struct blk_plug *plug = current->plug;
> > >  
> > > +	if (plug) {
> > > +		blk_add_rq_to_plug(plug, rq);
> > > +		return;
> > > +	}
> > 
> > This way may cause nested plugging, and breaks xfstests generic/131.
> > Also may cause io hang since request can't be polled before flushing
> > plug in blk_execute_rq().
> >
> Hi Ming,
> Could you please share your test setup.
> I tried test 131 with xfs and it passed.
> 
> I followed these steps:
> 1) mkfs.xfs -f /dev/nvme0n1
> 2) mount /dev/nvme0n1 /mnt/test
> 3) ./check tests/generic/131

It is triggered when I run xfstests over UBD & xfs:

https://github.com/ming1/linux/tree/my_for-5.18-ubd-devel_v2
https://github.com/ming1/ubdsrv/tree/devel-v2



Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-05-12 14:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220511055306epcas5p3bf3b4c1e32d2bb43db12785bd7caf5da@epcas5p3.samsung.com>
2022-05-11  5:47 ` [PATCH v5 0/6] io_uring passthrough for nvme Kanchan Joshi
     [not found]   ` <CGME20220511055308epcas5p3627bcb0ec10d7a2222e701898e9ad0db@epcas5p3.samsung.com>
2022-05-11  5:47     ` [PATCH v5 1/6] fs,io_uring: add infrastructure for uring-cmd Kanchan Joshi
2022-05-11 12:42       ` Christoph Hellwig
     [not found]   ` <CGME20220511055310epcas5p46650f5b6fe963279f686b8f50a98a286@epcas5p4.samsung.com>
2022-05-11  5:47     ` [PATCH v5 2/6] block: wire-up support for passthrough plugging Kanchan Joshi
2022-05-12  5:25       ` Ming Lei
2022-05-12  8:09         ` Christoph Hellwig
2022-05-12  8:44           ` Ming Lei
2022-05-12 12:31             ` Jens Axboe
2022-05-12 11:50         ` Anuj Gupta
2022-05-12 14:06           ` Ming Lei
     [not found]   ` <CGME20220511055312epcas5p3b1e9989a32cb1a79f8a941476fc433d1@epcas5p3.samsung.com>
2022-05-11  5:47     ` [PATCH v5 3/6] nvme: refactor nvme_submit_user_cmd() Kanchan Joshi
     [not found]   ` <CGME20220511055314epcas5p42ddbc17608f2677814c07b0d790e1318@epcas5p4.samsung.com>
2022-05-11  5:47     ` [PATCH v5 4/6] nvme: wire-up uring-cmd support for io-passthru on char-device Kanchan Joshi
     [not found]   ` <CGME20220511055319epcas5p1d78cb03bd1b145a6d58c8e616795af14@epcas5p1.samsung.com>
2022-05-11  5:47     ` [PATCH v5 5/6] nvme: add vectored-io support for uring-cmd Kanchan Joshi
     [not found]   ` <CGME20220511055320epcas5p2457aaf8e7405387282f60831f5a4eca9@epcas5p2.samsung.com>
2022-05-11  5:47     ` [PATCH v5 6/6] io_uring: finish IOPOLL/ioprio prep handler removal Kanchan Joshi
2022-05-11  6:54       ` Christoph Hellwig
2022-05-11 11:05         ` Kanchan Joshi
2022-05-11 12:38         ` Jens Axboe
2022-05-11 12:39   ` [PATCH v5 0/6] io_uring passthrough for nvme Jens Axboe
2022-05-11 13:14     ` Kanchan Joshi
2022-05-11 13:39       ` Jens Axboe
2022-05-11 13:48   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.