io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Asynchronous passthrough ioctl
       [not found] <CGME20210127150134epcas5p251fc1de3ff3581dd4c68b3fbe0b9dd91@epcas5p2.samsung.com>
@ 2021-01-27 15:00 ` Kanchan Joshi
       [not found]   ` <CGME20210127150140epcas5p32832cc0c0db953db199eb9dd326f2d4c@epcas5p3.samsung.com>
                     ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Kanchan Joshi @ 2021-01-27 15:00 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi
  Cc: linux-nvme, io-uring, linux-fsdevel, linux-kernel, javier.gonz,
	nj.shetty, selvakuma.s1, Kanchan Joshi

This RFC patchset adds asynchronous ioctl capability for NVMe devices.
Purpose of RFC is to get the feedback and optimize the path.

At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
presented to user-space applications. Like regular-ioctl, it takes
ioctl opcode and an optional argument (ioctl-specific input/output
parameter). Unlike regular-ioctl, it is made to skip the block-layer
and reach directly to the underlying driver (nvme in the case of this
patchset). This path between io-uring and nvme is via a newly
introduced block-device operation "async_ioctl". This operation
expects io-uring to supply a callback function which can be used to
report completion at later stage.

For a regular ioctl, NVMe driver submits the command to the device and
the submitter (task) is made to wait until completion arrives. For
async-ioctl, completion is decoupled from submission. Submitter goes
back to its business without waiting for nvme-completion. When
nvme-completion arrives, it informs io-uring via the registered
completion-handler. But some ioctls may require updating certain
ioctl-specific fields which can be accessed only in context of the
submitter task. For that reason, NVMe driver uses task-work infra for
that ioctl-specific update. Since task-work is not exported, it cannot
be referenced when nvme is compiled as a module. Therefore, one of the
patch exports task-work API.

Here goes example of usage (pseudo-code).
Actual nvme-cli source, modified to issue all ioctls via this opcode
is present at-
https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5

With regular ioctl-
int nvme_submit_passthru(int fd, unsigned long ioctl_cmd,
                         struct nvme_passthru_cmd *cmd)
{
	return ioctl(fd, ioctl_cmd, cmd);
}

With uring passthru ioctl-
int nvme_submit_passthru(int fd, unsigned long ioctl_cmd,
                         struct nvme_passthru_cmd *cmd)
{
	return uring_ioctl(fd, ioctl_cmd, cmd);
}
int uring_ioctl(int fd, unsinged long cmd, u64 arg)
{
	sqe = io_uring_get_sqe(ring);

	/* prepare sqe */
	sqe->fd = fd;
	sqe->opcode = IORING_OP_IOCTL_PT;
	sqe->ioctl_cmd = cmd;
	sqe->ioctl_arg = arg;

	/* submit sqe */
	io_uring_submit(ring);

	/* reap completion and obtain result */
	io_uring_wait_cqe(ring, &cqe);
	printf("ioctl result =%d\n", cqe->res)
}

Kanchan Joshi (4):
  block: introduce async ioctl operation
  kernel: export task_work_add
  nvme: add async ioctl support
  io_uring: add async passthrough ioctl support

 drivers/nvme/host/core.c      | 347 +++++++++++++++++++++++++++-------
 fs/io_uring.c                 |  77 ++++++++
 include/linux/blkdev.h        |  12 ++
 include/uapi/linux/io_uring.h |   7 +-
 kernel/task_work.c            |   2 +-
 5 files changed, 376 insertions(+), 69 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/4] block: introduce async ioctl operation
       [not found]   ` <CGME20210127150140epcas5p32832cc0c0db953db199eb9dd326f2d4c@epcas5p3.samsung.com>
@ 2021-01-27 15:00     ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2021-01-27 15:00 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi
  Cc: linux-nvme, io-uring, linux-fsdevel, linux-kernel, javier.gonz,
	nj.shetty, selvakuma.s1, Kanchan Joshi

Add a new block-dev operation for async-ioctl.
Driver managing the block-dev can choose to implement it.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 include/linux/blkdev.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94ee3089e01..c9f6cc26d675 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1848,6 +1848,16 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
 
 #endif /* CONFIG_BLK_INLINE_ENCRYPTION */
 
+struct pt_ioctl_ctx {
+	/* submitter task context */
+	struct task_struct *task;
+	/* callback supplied by upper layer */
+	void (*pt_complete)(struct pt_ioctl_ctx *ptioc, long ret);
+	/* driver-allocated data */
+	void *ioc_data;
+	/* to schedule task-work */
+	struct callback_head pt_work;
+};
 
 struct block_device_operations {
 	blk_qc_t (*submit_bio) (struct bio *bio);
@@ -1856,6 +1866,8 @@ struct block_device_operations {
 	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
+	int (*async_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long,
+				struct pt_ioctl_ctx *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	void (*unlock_native_capacity) (struct gendisk *);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/4] kernel: export task_work_add
       [not found]   ` <CGME20210127150144epcas5p29ccb35d7e7170aba7947b5ee16fd2db0@epcas5p2.samsung.com>
@ 2021-01-27 15:00     ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2021-01-27 15:00 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi
  Cc: linux-nvme, io-uring, linux-fsdevel, linux-kernel, javier.gonz,
	nj.shetty, selvakuma.s1, Kanchan Joshi

Task-work infra is required to introduce async-ioctl in nvme driver.
Without this being exported, NVMe needs to be built statically.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 kernel/task_work.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 9cde961875c0..3cf413c89639 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -57,7 +57,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 
 	return 0;
 }
-
+EXPORT_SYMBOL(task_work_add);
 /**
  * task_work_cancel - cancel a pending work added by task_work_add()
  * @task: the task which should execute the work
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/4] nvme: add async ioctl support
       [not found]   ` <CGME20210127150149epcas5p4fa8edd47712f28ccdd9bac5139fc6e61@epcas5p4.samsung.com>
@ 2021-01-27 15:00     ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2021-01-27 15:00 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi
  Cc: linux-nvme, io-uring, linux-fsdevel, linux-kernel, javier.gonz,
	nj.shetty, selvakuma.s1, Kanchan Joshi, Anuj Gupta

Add async_ioctl handler that implements asynchronous handling of ioctl
operation. If requested ioctl opcode does not involve submitting a
command to device (e.g. NVME_IOCTL_ID), it is made to return instantly.
Otherwise, ioctl-completion is decoupled from submission, and
-EIOCBQUEUED is returned post submission. When completion arrives from
device, nvme calls the ioctl-completion handler supplied by upper-layer.
But there is execption to that. An ioctl completion may also require
updating certain ioctl-specific user buffers/fields which can be
accessed only in context of original submitter-task. For such ioctl,
nvme-completion schedules a task-work which first updates ioctl-specific
buffers/fields and after that invokes the ioctl-completion handler.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/host/core.c | 347 +++++++++++++++++++++++++++++++--------
 1 file changed, 280 insertions(+), 67 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 200bdd672c28..57f3040bae34 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -21,6 +21,7 @@
 #include <linux/nvme_ioctl.h>
 #include <linux/pm_qos.h>
 #include <asm/unaligned.h>
+#include <linux/task_work.h>
 
 #include "nvme.h"
 #include "fabrics.h"
@@ -1092,7 +1093,107 @@ static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
 	}
 }
 
-void nvme_execute_passthru_rq(struct request *rq)
+struct async_pt_desc {
+	struct bio *bio;
+	int status; /* command status */
+	u64 result; /* nvme cmd result */
+	void __user *res_ptr; /* can be null, 32bit addr or 64 bit addr */
+	void __user *meta_ptr;
+	void *meta; /* kernel-space resident buffer */
+	unsigned metalen; /* length of meta */
+	bool is_res64 : 1; /* res_ptr refers to 64bit of space */
+	bool is_write : 1;
+	bool is_taskwork : 1;
+};
+
+static int nvme_add_task_work(struct task_struct *tsk,
+			struct callback_head *twork,
+			task_work_func_t work_func)
+{
+	int ret;
+
+	get_task_struct(tsk);
+	init_task_work(twork, work_func);
+	ret = task_work_add(tsk, twork, TWA_SIGNAL);
+	if (!ret)
+		wake_up_process(tsk);
+	return ret;
+}
+
+static void async_pt_update_work(struct callback_head *cbh)
+{
+	struct pt_ioctl_ctx *ptioc;
+	struct async_pt_desc *ptd;
+	struct task_struct *tsk;
+	int ret;
+
+	ptioc = container_of(cbh, struct pt_ioctl_ctx, pt_work);
+	ptd = ptioc->ioc_data;
+	tsk = ptioc->task;
+
+	/* handle meta update */
+	if (ptd->meta) {
+		if (!ptd->status && !ptd->is_write)
+			if (copy_to_user(ptd->meta_ptr, ptd->meta, ptd->metalen))
+				ptd->status = -EFAULT;
+		kfree(ptd->meta);
+	}
+	/* handle result update */
+	if (ptd->res_ptr) {
+		if (!ptd->is_res64)
+			ret = put_user(ptd->result, (u32 __user *)ptd->res_ptr);
+		else
+			ret = put_user(ptd->result, (u64 __user *)ptd->res_ptr);
+		if (ret)
+			ptd->status = -EFAULT;
+	}
+
+	ptioc->pt_complete(ptioc, ptd->status);
+	put_task_struct(tsk);
+	kfree(ptd);
+}
+
+static void nvme_end_async_pt(struct request *req, blk_status_t err)
+{
+	struct pt_ioctl_ctx *ptioc;
+	struct async_pt_desc *ptd;
+	struct bio *bio;
+
+	ptioc = req->end_io_data;
+	ptd = ptioc->ioc_data;
+
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		ptd->status = -EINTR;
+	else
+		ptd->status = nvme_req(req)->status;
+
+	ptd->result = le64_to_cpu(nvme_req(req)->result.u64);
+	bio = ptd->bio;
+	/* setup task work if needed */
+	if (ptd->is_taskwork) {
+		int ret = nvme_add_task_work(ptioc->task, &ptioc->pt_work,
+				async_pt_update_work);
+		/* update failure if task-work could not be setup */
+		if (ret < 0) {
+			put_task_struct(ptioc->task);
+			ptioc->pt_complete(ptioc, ret);
+			kfree(ptd->meta);
+			kfree(ptd);
+		}
+	} else {
+		/* return status via callback, nothing else to update */
+		ptioc->pt_complete(ptioc, ptd->status);
+		kfree(ptd);
+	}
+
+	/* unmap pages, free bio, nvme command and request */
+	blk_rq_unmap_user(bio);
+	kfree(nvme_req(req)->cmd);
+	blk_mq_free_request(req);
+}
+
+
+void nvme_execute_passthru_rq_common(struct request *rq, int async)
 {
 	struct nvme_command *cmd = nvme_req(rq)->cmd;
 	struct nvme_ctrl *ctrl = nvme_req(rq)->ctrl;
@@ -1101,15 +1202,52 @@ void nvme_execute_passthru_rq(struct request *rq)
 	u32 effects;
 
 	effects = nvme_passthru_start(ctrl, ns, cmd->common.opcode);
-	blk_execute_rq(rq->q, disk, rq, 0);
+	if (!async)
+		blk_execute_rq(rq->q, disk, rq, 0);
+	else
+		blk_execute_rq_nowait(rq->q, disk, rq, 0, nvme_end_async_pt);
 	nvme_passthru_end(ctrl, effects);
 }
+
+void nvme_execute_passthru_rq(struct request *rq)
+{
+	return nvme_execute_passthru_rq_common(rq, 0);
+}
 EXPORT_SYMBOL_NS_GPL(nvme_execute_passthru_rq, NVME_TARGET_PASSTHRU);
 
+static int setup_async_pt_desc(struct request *rq, struct pt_ioctl_ctx *ptioc,
+		void __user *resptr, void __user *meta_buffer, void *meta,
+		unsigned meta_len, bool write, bool is_res64)
+{
+	struct async_pt_desc *ptd;
+
+	ptd = kzalloc(sizeof(struct async_pt_desc), GFP_KERNEL);
+	if (!ptd)
+		return -ENOMEM;
+
+	/* to free bio on completion, as req->bio will be null at that time */
+	ptd->bio = rq->bio;
+	ptd->res_ptr = resptr;
+	ptd->is_write = write;
+	ptd->is_res64 = is_res64;
+	if (meta) {
+		ptd->meta_ptr = meta_buffer;
+		ptd->meta = meta;
+		ptd->metalen = meta_len;
+	}
+	if (resptr)
+		ptd->is_taskwork = 1;
+
+	ptioc->ioc_data = ptd;
+	rq->end_io_data = ptioc;
+	return 0;
+}
+
 static int nvme_submit_user_cmd(struct request_queue *q,
 		struct nvme_command *cmd, void __user *ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-		u32 meta_seed, u64 *result, unsigned timeout)
+		u32 meta_seed, u64 *result, unsigned timeout,
+		struct pt_ioctl_ctx *ptioc, bool is_res64)
 {
 	bool write = nvme_is_write(cmd);
 	struct nvme_ns *ns = q->queuedata;
@@ -1145,6 +1283,18 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 		}
 	}
 
+	if (ptioc) { /* async handling */
+		ret = setup_async_pt_desc(req, ptioc, result, meta_buffer,
+			       meta, meta_len, write, is_res64);
+		if (ret) {
+			kfree(meta);
+			goto out_unmap;
+		}
+		/* send request for async processing */
+		nvme_execute_passthru_rq_common(req, 1);
+		return ret;
+	}
+	/* sync handling */
 	nvme_execute_passthru_rq(req);
 	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
 		ret = -EINTR;
@@ -1521,10 +1671,11 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
 	return (void __user *)ptrval;
 }
 
-static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
+static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio,
+			struct pt_ioctl_ctx *ptioc)
 {
 	struct nvme_user_io io;
-	struct nvme_command c;
+	struct nvme_command c, *cptr;
 	unsigned length, meta_len;
 	void __user *metadata;
 
@@ -1554,31 +1705,42 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 			return -EINVAL;
 	}
 
-	memset(&c, 0, sizeof(c));
-	c.rw.opcode = io.opcode;
-	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
-	c.rw.slba = cpu_to_le64(io.slba);
-	c.rw.length = cpu_to_le16(io.nblocks);
-	c.rw.control = cpu_to_le16(io.control);
-	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
-	c.rw.reftag = cpu_to_le32(io.reftag);
-	c.rw.apptag = cpu_to_le16(io.apptag);
-	c.rw.appmask = cpu_to_le16(io.appmask);
-
-	return nvme_submit_user_cmd(ns->queue, &c,
+	if (!ptioc)
+		cptr = &c;
+	else { /* for async - allocate cmd dynamically */
+		cptr = kmalloc(sizeof(struct nvme_command), GFP_KERNEL);
+		if (!cptr)
+			return -ENOMEM;
+	}
+
+	memset(cptr, 0, sizeof(c));
+	cptr->rw.opcode = io.opcode;
+	cptr->rw.flags = io.flags;
+	cptr->rw.nsid = cpu_to_le32(ns->head->ns_id);
+	cptr->rw.slba = cpu_to_le64(io.slba);
+	cptr->rw.length = cpu_to_le16(io.nblocks);
+	cptr->rw.control = cpu_to_le16(io.control);
+	cptr->rw.dsmgmt = cpu_to_le32(io.dsmgmt);
+	cptr->rw.reftag = cpu_to_le32(io.reftag);
+	cptr->rw.apptag = cpu_to_le16(io.apptag);
+	cptr->rw.appmask = cpu_to_le16(io.appmask);
+
+	return nvme_submit_user_cmd(ns->queue, cptr,
 			nvme_to_user_ptr(io.addr), length,
-			metadata, meta_len, lower_32_bits(io.slba), NULL, 0);
+			metadata, meta_len, lower_32_bits(io.slba), NULL, 0,
+			ptioc, 0);
 }
 
 static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-			struct nvme_passthru_cmd __user *ucmd)
+			struct nvme_passthru_cmd __user *ucmd,
+			struct pt_ioctl_ctx *ptioc)
 {
 	struct nvme_passthru_cmd cmd;
-	struct nvme_command c;
+	struct nvme_command c, *cptr;
 	unsigned timeout = 0;
 	u64 result;
 	int status;
+	void *resptr;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
@@ -1586,43 +1748,61 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 		return -EFAULT;
 	if (cmd.flags)
 		return -EINVAL;
+	if (!ptioc) {
+		cptr = &c;
+		resptr = &result;
+	} else {
+		/*
+		 * for async - (a) allocate cmd dynamically
+		 * (b) use user-space result addr
+		 */
+		cptr = kmalloc(sizeof(struct nvme_command), GFP_KERNEL);
+		if (!cptr)
+			return -ENOMEM;
+		resptr = &ucmd->result;
+	}
 
-	memset(&c, 0, sizeof(c));
-	c.common.opcode = cmd.opcode;
-	c.common.flags = cmd.flags;
-	c.common.nsid = cpu_to_le32(cmd.nsid);
-	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
-	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
-	c.common.cdw10 = cpu_to_le32(cmd.cdw10);
-	c.common.cdw11 = cpu_to_le32(cmd.cdw11);
-	c.common.cdw12 = cpu_to_le32(cmd.cdw12);
-	c.common.cdw13 = cpu_to_le32(cmd.cdw13);
-	c.common.cdw14 = cpu_to_le32(cmd.cdw14);
-	c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+	memset(cptr, 0, sizeof(c));
+	cptr->common.opcode = cmd.opcode;
+	cptr->common.flags = cmd.flags;
+	cptr->common.nsid = cpu_to_le32(cmd.nsid);
+	cptr->common.cdw2[0] = cpu_to_le32(cmd.cdw2);
+	cptr->common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+	cptr->common.cdw10 = cpu_to_le32(cmd.cdw10);
+	cptr->common.cdw11 = cpu_to_le32(cmd.cdw11);
+	cptr->common.cdw12 = cpu_to_le32(cmd.cdw12);
+	cptr->common.cdw13 = cpu_to_le32(cmd.cdw13);
+	cptr->common.cdw14 = cpu_to_le32(cmd.cdw14);
+	cptr->common.cdw15 = cpu_to_le32(cmd.cdw15);
 
 	if (cmd.timeout_ms)
 		timeout = msecs_to_jiffies(cmd.timeout_ms);
 
-	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, cptr,
 			nvme_to_user_ptr(cmd.addr), cmd.data_len,
 			nvme_to_user_ptr(cmd.metadata), cmd.metadata_len,
-			0, &result, timeout);
+			0, resptr, timeout, ptioc, 0);
 
-	if (status >= 0) {
+	if (!ptioc && status >= 0) {
 		if (put_user(result, &ucmd->result))
 			return -EFAULT;
 	}
+	/* async case, free cmd in case of error */
+	if (ptioc && status < 0)
+		kfree(cptr);
 
 	return status;
 }
 
 static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-			struct nvme_passthru_cmd64 __user *ucmd)
+			struct nvme_passthru_cmd64 __user *ucmd,
+			struct pt_ioctl_ctx *ptioc)
 {
 	struct nvme_passthru_cmd64 cmd;
-	struct nvme_command c;
+	struct nvme_command c, *cptr;
 	unsigned timeout = 0;
 	int status;
+	void *resptr;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
@@ -1631,31 +1811,43 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	if (cmd.flags)
 		return -EINVAL;
 
-	memset(&c, 0, sizeof(c));
-	c.common.opcode = cmd.opcode;
-	c.common.flags = cmd.flags;
-	c.common.nsid = cpu_to_le32(cmd.nsid);
-	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
-	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
-	c.common.cdw10 = cpu_to_le32(cmd.cdw10);
-	c.common.cdw11 = cpu_to_le32(cmd.cdw11);
-	c.common.cdw12 = cpu_to_le32(cmd.cdw12);
-	c.common.cdw13 = cpu_to_le32(cmd.cdw13);
-	c.common.cdw14 = cpu_to_le32(cmd.cdw14);
-	c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+	if (!ptioc) {
+		cptr = &c;
+		resptr = &cmd.result;
+	} else {
+		cptr = kmalloc(sizeof(struct nvme_command), GFP_KERNEL);
+		if (!cptr)
+			return -ENOMEM;
+		resptr = &ucmd->result;
+	}
+
+	memset(cptr, 0, sizeof(struct nvme_command));
+	cptr->common.opcode = cmd.opcode;
+	cptr->common.flags = cmd.flags;
+	cptr->common.nsid = cpu_to_le32(cmd.nsid);
+	cptr->common.cdw2[0] = cpu_to_le32(cmd.cdw2);
+	cptr->common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+	cptr->common.cdw10 = cpu_to_le32(cmd.cdw10);
+	cptr->common.cdw11 = cpu_to_le32(cmd.cdw11);
+	cptr->common.cdw12 = cpu_to_le32(cmd.cdw12);
+	cptr->common.cdw13 = cpu_to_le32(cmd.cdw13);
+	cptr->common.cdw14 = cpu_to_le32(cmd.cdw14);
+	cptr->common.cdw15 = cpu_to_le32(cmd.cdw15);
 
 	if (cmd.timeout_ms)
 		timeout = msecs_to_jiffies(cmd.timeout_ms);
 
-	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, cptr,
 			nvme_to_user_ptr(cmd.addr), cmd.data_len,
 			nvme_to_user_ptr(cmd.metadata), cmd.metadata_len,
-			0, &cmd.result, timeout);
+			0, resptr, timeout, ptioc, 1);
 
-	if (status >= 0) {
+	if (!ptioc && status >= 0) {
 		if (put_user(cmd.result, &ucmd->result))
 			return -EFAULT;
 	}
+	if (ptioc && status < 0)
+		kfree(cptr);
 
 	return status;
 }
@@ -1702,7 +1894,8 @@ static bool is_ctrl_ioctl(unsigned int cmd)
 static int nvme_handle_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
 				  void __user *argp,
 				  struct nvme_ns_head *head,
-				  int srcu_idx)
+				  int srcu_idx,
+				  struct pt_ioctl_ctx *ptioc)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	int ret;
@@ -1712,21 +1905,24 @@ static int nvme_handle_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
 
 	switch (cmd) {
 	case NVME_IOCTL_ADMIN_CMD:
-		ret = nvme_user_cmd(ctrl, NULL, argp);
+		ret = nvme_user_cmd(ctrl, NULL, argp, ptioc);
 		break;
 	case NVME_IOCTL_ADMIN64_CMD:
-		ret = nvme_user_cmd64(ctrl, NULL, argp);
+		ret = nvme_user_cmd64(ctrl, NULL, argp, ptioc);
 		break;
 	default:
-		ret = sed_ioctl(ctrl->opal_dev, cmd, argp);
+		if (!ptioc)
+			ret = sed_ioctl(ctrl->opal_dev, cmd, argp);
+		else
+			ret = -EOPNOTSUPP; /* RFP: no support for now */
 		break;
 	}
 	nvme_put_ctrl(ctrl);
 	return ret;
 }
 
-static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
-		unsigned int cmd, unsigned long arg)
+static int nvme_async_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg, struct pt_ioctl_ctx *ptioc)
 {
 	struct nvme_ns_head *head = NULL;
 	void __user *argp = (void __user *)arg;
@@ -1743,33 +1939,49 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	 * deadlock when deleting namespaces using the passthrough interface.
 	 */
 	if (is_ctrl_ioctl(cmd))
-		return nvme_handle_ctrl_ioctl(ns, cmd, argp, head, srcu_idx);
+		return nvme_handle_ctrl_ioctl(ns, cmd, argp, head, srcu_idx, ptioc);
 
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
 		ret = ns->head->ns_id;
+		if (ptioc)
+			goto put_ns; /* return in sync fashion always */
 		break;
 	case NVME_IOCTL_IO_CMD:
-		ret = nvme_user_cmd(ns->ctrl, ns, argp);
+		ret = nvme_user_cmd(ns->ctrl, ns, argp, ptioc);
 		break;
 	case NVME_IOCTL_SUBMIT_IO:
-		ret = nvme_submit_io(ns, argp);
+		ret = nvme_submit_io(ns, argp, ptioc);
 		break;
 	case NVME_IOCTL_IO64_CMD:
-		ret = nvme_user_cmd64(ns->ctrl, ns, argp);
+		ret = nvme_user_cmd64(ns->ctrl, ns, argp, ptioc);
 		break;
 	default:
+		if (ptioc) {
+			/* RFP- don't support this for now */
+			ret = -EOPNOTSUPP;
+			break;
+		}
 		if (ns->ndev)
 			ret = nvme_nvm_ioctl(ns, cmd, arg);
 		else
 			ret = -ENOTTY;
 	}
-
+	/* if there is no error, return queued for async-ioctl */
+	if (ptioc && ret >= 0)
+		ret = -EIOCBQUEUED;
+ put_ns:
 	nvme_put_ns_from_disk(head, srcu_idx);
 	return ret;
 }
 
+static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	return nvme_async_ioctl(bdev, mode, cmd, arg, NULL);
+}
+
 #ifdef CONFIG_COMPAT
 struct nvme_user_io32 {
 	__u8	opcode;
@@ -2324,6 +2536,7 @@ EXPORT_SYMBOL_GPL(nvme_sec_submit);
 static const struct block_device_operations nvme_bdev_ops = {
 	.owner		= THIS_MODULE,
 	.ioctl		= nvme_ioctl,
+	.async_ioctl	= nvme_async_ioctl,
 	.compat_ioctl	= nvme_compat_ioctl,
 	.open		= nvme_open,
 	.release	= nvme_release,
@@ -3261,7 +3474,7 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp)
 	kref_get(&ns->kref);
 	up_read(&ctrl->namespaces_rwsem);
 
-	ret = nvme_user_cmd(ctrl, ns, argp);
+	ret = nvme_user_cmd(ctrl, ns, argp, NULL);
 	nvme_put_ns(ns);
 	return ret;
 
@@ -3278,9 +3491,9 @@ static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 
 	switch (cmd) {
 	case NVME_IOCTL_ADMIN_CMD:
-		return nvme_user_cmd(ctrl, NULL, argp);
+		return nvme_user_cmd(ctrl, NULL, argp, NULL);
 	case NVME_IOCTL_ADMIN64_CMD:
-		return nvme_user_cmd64(ctrl, NULL, argp);
+		return nvme_user_cmd64(ctrl, NULL, argp, NULL);
 	case NVME_IOCTL_IO_CMD:
 		return nvme_dev_user_cmd(ctrl, argp);
 	case NVME_IOCTL_RESET:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 4/4] io_uring: add async passthrough ioctl support
       [not found]   ` <CGME20210127150156epcas5p26cdf368e4ff6bffb132fa1c7f9430653@epcas5p2.samsung.com>
@ 2021-01-27 15:00     ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2021-01-27 15:00 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi
  Cc: linux-nvme, io-uring, linux-fsdevel, linux-kernel, javier.gonz,
	nj.shetty, selvakuma.s1, Kanchan Joshi, Anuj Gupta

Introduce IORING_OP_IOCTL_PT for async ioctl. It skips entering into
block-layer and reaches to underlying block-driver managing the
block-device. This is done by calling newly introduced "async_ioctl"
block-device operation.
The requested operation may be completed synchronously, and in that case
CQE is updated on the fly. For asynchronous update, lower-layer calls
the completion-callback supplied by io-uring.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/io_uring.c                 | 77 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/io_uring.h |  7 +++-
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 985a9e3f976d..c15852dfb727 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -468,6 +468,19 @@ struct io_rw {
 	u64				len;
 };
 
+/*
+ * passthru ioctl skips block-layer and reaches to block device driver via
+ * async_ioctl() block-dev operation.
+ */
+struct io_pt_ioctl {
+	struct file			*file;
+	/* arg and cmd like regular ioctl */
+	u64				arg;
+	u32				cmd;
+	/* defined by block layer */
+	struct pt_ioctl_ctx		ioctx;
+};
+
 struct io_connect {
 	struct file			*file;
 	struct sockaddr __user		*addr;
@@ -699,6 +712,7 @@ struct io_kiocb {
 		struct io_shutdown	shutdown;
 		struct io_rename	rename;
 		struct io_unlink	unlink;
+		struct io_pt_ioctl	ptioctl;
 		/* use only after cleaning per-op data, see io_clean_op() */
 		struct io_completion	compl;
 	};
@@ -824,6 +838,10 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.work_flags		= IO_WQ_WORK_BLKCG,
 	},
+	[IORING_OP_IOCTL_PT] = {
+		.needs_file		= 1,
+		.work_flags		= IO_WQ_WORK_MM,
+	},
 	[IORING_OP_READ_FIXED] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
@@ -3704,6 +3722,60 @@ static int io_write(struct io_kiocb *req, bool force_nonblock,
 	return ret;
 }
 
+static int io_pt_ioctl_prep(struct io_kiocb *req,
+			    const struct io_uring_sqe *sqe)
+{
+	unsigned int cmd = READ_ONCE(sqe->ioctl_cmd);
+	unsigned long arg = READ_ONCE(sqe->ioctl_arg);
+	struct io_ring_ctx *ctx = req->ctx;
+	struct block_device *bdev = I_BDEV(req->file->f_mapping->host);
+	struct gendisk *disk = NULL;
+
+	disk = bdev->bd_disk;
+	if (!disk || !disk->fops || !disk->fops->async_ioctl)
+		return -EOPNOTSUPP;
+	/* for sqpoll, use sqo_task */
+	if (ctx->flags & IORING_SETUP_SQPOLL)
+		req->ptioctl.ioctx.task = ctx->sqo_task;
+	else
+		req->ptioctl.ioctx.task = current;
+
+	req->ptioctl.arg = arg;
+	req->ptioctl.cmd = cmd;
+	return 0;
+}
+
+void pt_complete(struct pt_ioctl_ctx *ptioc, long ret)
+{
+	struct io_kiocb *req = container_of(ptioc, struct io_kiocb, ptioctl.ioctx);
+
+	if (ret < 0)
+		req_set_fail_links(req);
+	io_req_complete(req, ret);
+}
+
+static int io_pt_ioctl(struct io_kiocb *req, bool force_nonblock)
+{
+	long ret = 0;
+	struct block_device *bdev = I_BDEV(req->file->f_mapping->host);
+	fmode_t mode = req->file->f_mode;
+	struct gendisk *disk = NULL;
+
+	disk = bdev->bd_disk;
+	/* set up callback for async */
+	req->ptioctl.ioctx.pt_complete = pt_complete;
+
+	ret = disk->fops->async_ioctl(bdev, mode, req->ptioctl.cmd,
+				req->ptioctl.arg, &req->ptioctl.ioctx);
+	if (ret == -EIOCBQUEUED) /*async completion */
+		return 0;
+	if (ret < 0)
+		req_set_fail_links(req);
+
+	io_req_complete(req, ret);
+	return 0;
+}
+
 static int io_renameat_prep(struct io_kiocb *req,
 			    const struct io_uring_sqe *sqe)
 {
@@ -6078,6 +6150,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_renameat_prep(req, sqe);
 	case IORING_OP_UNLINKAT:
 		return io_unlinkat_prep(req, sqe);
+	case IORING_OP_IOCTL_PT:
+		return io_pt_ioctl_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6337,6 +6411,9 @@ static int io_issue_sqe(struct io_kiocb *req, bool force_nonblock,
 	case IORING_OP_UNLINKAT:
 		ret = io_unlinkat(req, force_nonblock);
 		break;
+	case IORING_OP_IOCTL_PT:
+		ret = io_pt_ioctl(req, force_nonblock);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d31a2a1e8ef9..60671e2b00ba 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -22,12 +22,16 @@ struct io_uring_sqe {
 	union {
 		__u64	off;	/* offset into file */
 		__u64	addr2;
+		__u64	ioctl_arg;
 	};
 	union {
 		__u64	addr;	/* pointer to buffer or iovecs */
 		__u64	splice_off_in;
 	};
-	__u32	len;		/* buffer size or number of iovecs */
+	union {
+		__u32	len;	/* buffer size or number of iovecs */
+		__u32	ioctl_cmd;
+	};
 	union {
 		__kernel_rwf_t	rw_flags;
 		__u32		fsync_flags;
@@ -137,6 +141,7 @@ enum {
 	IORING_OP_SHUTDOWN,
 	IORING_OP_RENAMEAT,
 	IORING_OP_UNLINKAT,
+	IORING_OP_IOCTL_PT,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-27 15:00 ` [RFC PATCH 0/4] Asynchronous passthrough ioctl Kanchan Joshi
                     ` (3 preceding siblings ...)
       [not found]   ` <CGME20210127150156epcas5p26cdf368e4ff6bffb132fa1c7f9430653@epcas5p2.samsung.com>
@ 2021-01-27 15:42   ` Pavel Begunkov
  2021-01-27 15:53     ` Pavel Begunkov
  4 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2021-01-27 15:42 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, sagi
  Cc: linux-nvme, io-uring, linux-fsdevel, linux-kernel, javier.gonz,
	nj.shetty, selvakuma.s1

On 27/01/2021 15:00, Kanchan Joshi wrote:
> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
> Purpose of RFC is to get the feedback and optimize the path.
> 
> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
> presented to user-space applications. Like regular-ioctl, it takes
> ioctl opcode and an optional argument (ioctl-specific input/output
> parameter). Unlike regular-ioctl, it is made to skip the block-layer
> and reach directly to the underlying driver (nvme in the case of this
> patchset). This path between io-uring and nvme is via a newly
> introduced block-device operation "async_ioctl". This operation
> expects io-uring to supply a callback function which can be used to
> report completion at later stage.
> 
> For a regular ioctl, NVMe driver submits the command to the device and
> the submitter (task) is made to wait until completion arrives. For
> async-ioctl, completion is decoupled from submission. Submitter goes
> back to its business without waiting for nvme-completion. When
> nvme-completion arrives, it informs io-uring via the registered
> completion-handler. But some ioctls may require updating certain
> ioctl-specific fields which can be accessed only in context of the
> submitter task. For that reason, NVMe driver uses task-work infra for
> that ioctl-specific update. Since task-work is not exported, it cannot
> be referenced when nvme is compiled as a module. Therefore, one of the
> patch exports task-work API.
> 
> Here goes example of usage (pseudo-code).
> Actual nvme-cli source, modified to issue all ioctls via this opcode
> is present at-
> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5

see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops

Looks like good time to bring that branch/discussion back

> 
> With regular ioctl-
> int nvme_submit_passthru(int fd, unsigned long ioctl_cmd,
>                          struct nvme_passthru_cmd *cmd)
> {
> 	return ioctl(fd, ioctl_cmd, cmd);
> }
> 
> With uring passthru ioctl-
> int nvme_submit_passthru(int fd, unsigned long ioctl_cmd,
>                          struct nvme_passthru_cmd *cmd)
> {
> 	return uring_ioctl(fd, ioctl_cmd, cmd);
> }
> int uring_ioctl(int fd, unsinged long cmd, u64 arg)
> {
> 	sqe = io_uring_get_sqe(ring);
> 
> 	/* prepare sqe */
> 	sqe->fd = fd;
> 	sqe->opcode = IORING_OP_IOCTL_PT;
> 	sqe->ioctl_cmd = cmd;
> 	sqe->ioctl_arg = arg;
> 
> 	/* submit sqe */
> 	io_uring_submit(ring);
> 
> 	/* reap completion and obtain result */
> 	io_uring_wait_cqe(ring, &cqe);
> 	printf("ioctl result =%d\n", cqe->res)
> }
> 
> Kanchan Joshi (4):
>   block: introduce async ioctl operation
>   kernel: export task_work_add
>   nvme: add async ioctl support
>   io_uring: add async passthrough ioctl support
> 
>  drivers/nvme/host/core.c      | 347 +++++++++++++++++++++++++++-------
>  fs/io_uring.c                 |  77 ++++++++
>  include/linux/blkdev.h        |  12 ++
>  include/uapi/linux/io_uring.h |   7 +-
>  kernel/task_work.c            |   2 +-
>  5 files changed, 376 insertions(+), 69 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-27 15:42   ` [RFC PATCH 0/4] Asynchronous passthrough ioctl Pavel Begunkov
@ 2021-01-27 15:53     ` Pavel Begunkov
  2021-01-28 12:04       ` Kanchan Joshi
  0 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2021-01-27 15:53 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, sagi
  Cc: linux-nvme, io-uring, linux-fsdevel, linux-kernel, javier.gonz,
	nj.shetty, selvakuma.s1

On 27/01/2021 15:42, Pavel Begunkov wrote:
> On 27/01/2021 15:00, Kanchan Joshi wrote:
>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
>> Purpose of RFC is to get the feedback and optimize the path.
>>
>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
>> presented to user-space applications. Like regular-ioctl, it takes
>> ioctl opcode and an optional argument (ioctl-specific input/output
>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
>> and reach directly to the underlying driver (nvme in the case of this
>> patchset). This path between io-uring and nvme is via a newly
>> introduced block-device operation "async_ioctl". This operation
>> expects io-uring to supply a callback function which can be used to
>> report completion at later stage.
>>
>> For a regular ioctl, NVMe driver submits the command to the device and
>> the submitter (task) is made to wait until completion arrives. For
>> async-ioctl, completion is decoupled from submission. Submitter goes
>> back to its business without waiting for nvme-completion. When
>> nvme-completion arrives, it informs io-uring via the registered
>> completion-handler. But some ioctls may require updating certain
>> ioctl-specific fields which can be accessed only in context of the
>> submitter task. For that reason, NVMe driver uses task-work infra for
>> that ioctl-specific update. Since task-work is not exported, it cannot
>> be referenced when nvme is compiled as a module. Therefore, one of the
>> patch exports task-work API.
>>
>> Here goes example of usage (pseudo-code).
>> Actual nvme-cli source, modified to issue all ioctls via this opcode
>> is present at-
>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> 
> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> 
> Looks like good time to bring that branch/discussion back

a bit more context:
https://github.com/axboe/liburing/issues/270

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-27 15:53     ` Pavel Begunkov
@ 2021-01-28 12:04       ` Kanchan Joshi
  2021-01-28 14:38         ` Jens Axboe
  2021-01-28 14:50         ` Jens Axboe
  0 siblings, 2 replies; 16+ messages in thread
From: Kanchan Joshi @ 2021-01-28 12:04 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe
  Cc: Kanchan Joshi, Keith Busch, Christoph Hellwig, sagi, linux-nvme,
	io-uring, linux-fsdevel, linux-kernel, Javier Gonzalez,
	Nitesh Shetty, Selvakumar S

On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 27/01/2021 15:42, Pavel Begunkov wrote:
> > On 27/01/2021 15:00, Kanchan Joshi wrote:
> >> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
> >> Purpose of RFC is to get the feedback and optimize the path.
> >>
> >> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
> >> presented to user-space applications. Like regular-ioctl, it takes
> >> ioctl opcode and an optional argument (ioctl-specific input/output
> >> parameter). Unlike regular-ioctl, it is made to skip the block-layer
> >> and reach directly to the underlying driver (nvme in the case of this
> >> patchset). This path between io-uring and nvme is via a newly
> >> introduced block-device operation "async_ioctl". This operation
> >> expects io-uring to supply a callback function which can be used to
> >> report completion at later stage.
> >>
> >> For a regular ioctl, NVMe driver submits the command to the device and
> >> the submitter (task) is made to wait until completion arrives. For
> >> async-ioctl, completion is decoupled from submission. Submitter goes
> >> back to its business without waiting for nvme-completion. When
> >> nvme-completion arrives, it informs io-uring via the registered
> >> completion-handler. But some ioctls may require updating certain
> >> ioctl-specific fields which can be accessed only in context of the
> >> submitter task. For that reason, NVMe driver uses task-work infra for
> >> that ioctl-specific update. Since task-work is not exported, it cannot
> >> be referenced when nvme is compiled as a module. Therefore, one of the
> >> patch exports task-work API.
> >>
> >> Here goes example of usage (pseudo-code).
> >> Actual nvme-cli source, modified to issue all ioctls via this opcode
> >> is present at-
> >> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> >
> > see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> >
> > Looks like good time to bring that branch/discussion back
>
> a bit more context:
> https://github.com/axboe/liburing/issues/270

Thanks, it looked good. It seems key differences (compared to
uring-patch that I posted) are -
1. using file-operation instead of block-dev operation.
2. repurpose the sqe memory for ioctl-cmd. If an application does
ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
That's nifty. We still need to support passing larger-cmd (e.g.
nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
difficult I suppose.

And for some ioctls, driver may still need to use task-work to update
the user-space pointers (embedded in uring/ioctl cmd) during
completion.

@Jens - will it be fine if I start looking at plumbing nvme-part of
this series on top of your work?


Thanks,

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-28 12:04       ` Kanchan Joshi
@ 2021-01-28 14:38         ` Jens Axboe
  2021-01-28 17:13           ` Kanchan Joshi
  2021-01-28 14:50         ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-01-28 14:38 UTC (permalink / raw)
  To: Kanchan Joshi, Pavel Begunkov
  Cc: Kanchan Joshi, Keith Busch, Christoph Hellwig, sagi, linux-nvme,
	io-uring, linux-fsdevel, linux-kernel, Javier Gonzalez,
	Nitesh Shetty, Selvakumar S

On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 27/01/2021 15:42, Pavel Begunkov wrote:
>>> On 27/01/2021 15:00, Kanchan Joshi wrote:
>>>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
>>>> Purpose of RFC is to get the feedback and optimize the path.
>>>>
>>>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
>>>> presented to user-space applications. Like regular-ioctl, it takes
>>>> ioctl opcode and an optional argument (ioctl-specific input/output
>>>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
>>>> and reach directly to the underlying driver (nvme in the case of this
>>>> patchset). This path between io-uring and nvme is via a newly
>>>> introduced block-device operation "async_ioctl". This operation
>>>> expects io-uring to supply a callback function which can be used to
>>>> report completion at later stage.
>>>>
>>>> For a regular ioctl, NVMe driver submits the command to the device and
>>>> the submitter (task) is made to wait until completion arrives. For
>>>> async-ioctl, completion is decoupled from submission. Submitter goes
>>>> back to its business without waiting for nvme-completion. When
>>>> nvme-completion arrives, it informs io-uring via the registered
>>>> completion-handler. But some ioctls may require updating certain
>>>> ioctl-specific fields which can be accessed only in context of the
>>>> submitter task. For that reason, NVMe driver uses task-work infra for
>>>> that ioctl-specific update. Since task-work is not exported, it cannot
>>>> be referenced when nvme is compiled as a module. Therefore, one of the
>>>> patch exports task-work API.
>>>>
>>>> Here goes example of usage (pseudo-code).
>>>> Actual nvme-cli source, modified to issue all ioctls via this opcode
>>>> is present at-
>>>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
>>>
>>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
>>>
>>> Looks like good time to bring that branch/discussion back
>>
>> a bit more context:
>> https://github.com/axboe/liburing/issues/270
> 
> Thanks, it looked good. It seems key differences (compared to
> uring-patch that I posted) are -
> 1. using file-operation instead of block-dev operation.

Right, it's meant to span wider than just block devices.

> 2. repurpose the sqe memory for ioctl-cmd. If an application does
> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> That's nifty. We still need to support passing larger-cmd (e.g.
> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> difficult I suppose.

It's actually 48 bytes in the as-posted version, and I've bumped it to
56 bytes in the latest branch. So not quite enough for everything,
nothing ever will be, but should work for a lot of cases without
requiring per-command allocations just for the actual command.

> And for some ioctls, driver may still need to use task-work to update
> the user-space pointers (embedded in uring/ioctl cmd) during
> completion.
> 
> @Jens - will it be fine if I start looking at plumbing nvme-part of
> this series on top of your work?

Sure, go ahead. Just beware that things are still changing, so you might
have to adapt it a few times. It's still early days, but I do think
that's the way forward in providing controlled access to what is
basically async ioctls.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-28 12:04       ` Kanchan Joshi
  2021-01-28 14:38         ` Jens Axboe
@ 2021-01-28 14:50         ` Jens Axboe
  2021-01-28 17:25           ` Kanchan Joshi
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-01-28 14:50 UTC (permalink / raw)
  To: Kanchan Joshi, Pavel Begunkov
  Cc: Kanchan Joshi, Keith Busch, Christoph Hellwig, sagi, linux-nvme,
	io-uring, linux-fsdevel, linux-kernel, Javier Gonzalez,
	Nitesh Shetty, Selvakumar S

On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> And for some ioctls, driver may still need to use task-work to update
> the user-space pointers (embedded in uring/ioctl cmd) during
> completion.

For this use case, we should ensure that just io_uring handles this
part. It's already got everything setup for it, and I'd rather avoid
having drivers touch any of those parts. Could be done by having an
io_uring helper ala:

io_uring_cmd_complete_in_task(cmd, handler);

which takes care of the nitty gritty details.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-28 14:38         ` Jens Axboe
@ 2021-01-28 17:13           ` Kanchan Joshi
  2021-01-28 17:24             ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Kanchan Joshi @ 2021-01-28 17:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, Kanchan Joshi, Keith Busch, Christoph Hellwig,
	sagi, linux-nvme, io-uring, linux-fsdevel, linux-kernel,
	Javier Gonzalez, Nitesh Shetty, Selvakumar S

On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> > On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> On 27/01/2021 15:42, Pavel Begunkov wrote:
> >>> On 27/01/2021 15:00, Kanchan Joshi wrote:
> >>>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
> >>>> Purpose of RFC is to get the feedback and optimize the path.
> >>>>
> >>>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
> >>>> presented to user-space applications. Like regular-ioctl, it takes
> >>>> ioctl opcode and an optional argument (ioctl-specific input/output
> >>>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
> >>>> and reach directly to the underlying driver (nvme in the case of this
> >>>> patchset). This path between io-uring and nvme is via a newly
> >>>> introduced block-device operation "async_ioctl". This operation
> >>>> expects io-uring to supply a callback function which can be used to
> >>>> report completion at later stage.
> >>>>
> >>>> For a regular ioctl, NVMe driver submits the command to the device and
> >>>> the submitter (task) is made to wait until completion arrives. For
> >>>> async-ioctl, completion is decoupled from submission. Submitter goes
> >>>> back to its business without waiting for nvme-completion. When
> >>>> nvme-completion arrives, it informs io-uring via the registered
> >>>> completion-handler. But some ioctls may require updating certain
> >>>> ioctl-specific fields which can be accessed only in context of the
> >>>> submitter task. For that reason, NVMe driver uses task-work infra for
> >>>> that ioctl-specific update. Since task-work is not exported, it cannot
> >>>> be referenced when nvme is compiled as a module. Therefore, one of the
> >>>> patch exports task-work API.
> >>>>
> >>>> Here goes example of usage (pseudo-code).
> >>>> Actual nvme-cli source, modified to issue all ioctls via this opcode
> >>>> is present at-
> >>>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> >>>
> >>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> >>>
> >>> Looks like good time to bring that branch/discussion back
> >>
> >> a bit more context:
> >> https://github.com/axboe/liburing/issues/270
> >
> > Thanks, it looked good. It seems key differences (compared to
> > uring-patch that I posted) are -
> > 1. using file-operation instead of block-dev operation.
>
> Right, it's meant to span wider than just block devices.
>
> > 2. repurpose the sqe memory for ioctl-cmd. If an application does
> > ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> > That's nifty. We still need to support passing larger-cmd (e.g.
> > nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> > difficult I suppose.
>
> It's actually 48 bytes in the as-posted version, and I've bumped it to
> 56 bytes in the latest branch. So not quite enough for everything,
> nothing ever will be, but should work for a lot of cases without
> requiring per-command allocations just for the actual command.

Agreed. But if I got it right, you are open to support both in-the-sqe
command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
interface.
Driver processing the ioctl can fetch the cmd from user-space in one
case (as it does now), and skips in another.

> > And for some ioctls, driver may still need to use task-work to update
> > the user-space pointers (embedded in uring/ioctl cmd) during
> > completion.
> >
> > @Jens - will it be fine if I start looking at plumbing nvme-part of
> > this series on top of your work?
>
> Sure, go ahead. Just beware that things are still changing, so you might
> have to adapt it a few times. It's still early days, but I do think
> that's the way forward in providing controlled access to what is
> basically async ioctls.

Sounds good, I will start with the latest branch that you posted. Thanks.

-- 
Kanchan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-28 17:13           ` Kanchan Joshi
@ 2021-01-28 17:24             ` Jens Axboe
  2021-02-22 13:42               ` Kanchan Joshi
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-01-28 17:24 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Pavel Begunkov, Kanchan Joshi, Keith Busch, Christoph Hellwig,
	sagi, linux-nvme, io-uring, linux-fsdevel, linux-kernel,
	Javier Gonzalez, Nitesh Shetty, Selvakumar S

On 1/28/21 10:13 AM, Kanchan Joshi wrote:
> On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
>>> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>
>>>> On 27/01/2021 15:42, Pavel Begunkov wrote:
>>>>> On 27/01/2021 15:00, Kanchan Joshi wrote:
>>>>>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
>>>>>> Purpose of RFC is to get the feedback and optimize the path.
>>>>>>
>>>>>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
>>>>>> presented to user-space applications. Like regular-ioctl, it takes
>>>>>> ioctl opcode and an optional argument (ioctl-specific input/output
>>>>>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
>>>>>> and reach directly to the underlying driver (nvme in the case of this
>>>>>> patchset). This path between io-uring and nvme is via a newly
>>>>>> introduced block-device operation "async_ioctl". This operation
>>>>>> expects io-uring to supply a callback function which can be used to
>>>>>> report completion at later stage.
>>>>>>
>>>>>> For a regular ioctl, NVMe driver submits the command to the device and
>>>>>> the submitter (task) is made to wait until completion arrives. For
>>>>>> async-ioctl, completion is decoupled from submission. Submitter goes
>>>>>> back to its business without waiting for nvme-completion. When
>>>>>> nvme-completion arrives, it informs io-uring via the registered
>>>>>> completion-handler. But some ioctls may require updating certain
>>>>>> ioctl-specific fields which can be accessed only in context of the
>>>>>> submitter task. For that reason, NVMe driver uses task-work infra for
>>>>>> that ioctl-specific update. Since task-work is not exported, it cannot
>>>>>> be referenced when nvme is compiled as a module. Therefore, one of the
>>>>>> patch exports task-work API.
>>>>>>
>>>>>> Here goes example of usage (pseudo-code).
>>>>>> Actual nvme-cli source, modified to issue all ioctls via this opcode
>>>>>> is present at-
>>>>>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
>>>>>
>>>>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
>>>>>
>>>>> Looks like good time to bring that branch/discussion back
>>>>
>>>> a bit more context:
>>>> https://github.com/axboe/liburing/issues/270
>>>
>>> Thanks, it looked good. It seems key differences (compared to
>>> uring-patch that I posted) are -
>>> 1. using file-operation instead of block-dev operation.
>>
>> Right, it's meant to span wider than just block devices.
>>
>>> 2. repurpose the sqe memory for ioctl-cmd. If an application does
>>> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
>>> That's nifty. We still need to support passing larger-cmd (e.g.
>>> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
>>> difficult I suppose.
>>
>> It's actually 48 bytes in the as-posted version, and I've bumped it to
>> 56 bytes in the latest branch. So not quite enough for everything,
>> nothing ever will be, but should work for a lot of cases without
>> requiring per-command allocations just for the actual command.
> 
> Agreed. But if I got it right, you are open to support both in-the-sqe
> command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
> interface.
> Driver processing the ioctl can fetch the cmd from user-space in one
> case (as it does now), and skips in another.

Your out-of-seq command would be none of io_urings business, outside of
the fact that we'd need to ensure it's stable if we need to postpone
it. So yes, that would be fine, it just means your actual command is
passed in as a pointer, and you would be responsible for copying it
in for execution

We're going to need something to handle postponing, and something
for ensuring that eg cancelations free the allocated memory.

>>> And for some ioctls, driver may still need to use task-work to update
>>> the user-space pointers (embedded in uring/ioctl cmd) during
>>> completion.
>>>
>>> @Jens - will it be fine if I start looking at plumbing nvme-part of
>>> this series on top of your work?
>>
>> Sure, go ahead. Just beware that things are still changing, so you might
>> have to adapt it a few times. It's still early days, but I do think
>> that's the way forward in providing controlled access to what is
>> basically async ioctls.
> 
> Sounds good, I will start with the latest branch that you posted. Thanks.

It's io_uring-fops.v2 for now, use that one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-28 14:50         ` Jens Axboe
@ 2021-01-28 17:25           ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2021-01-28 17:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, Kanchan Joshi, Keith Busch, Christoph Hellwig,
	sagi, linux-nvme, io-uring, linux-fsdevel, linux-kernel,
	Javier Gonzalez, Nitesh Shetty, anuj20.g

On Thu, Jan 28, 2021 at 8:20 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> > And for some ioctls, driver may still need to use task-work to update
> > the user-space pointers (embedded in uring/ioctl cmd) during
> > completion.
>
> For this use case, we should ensure that just io_uring handles this
> part. It's already got everything setup for it, and I'd rather avoid
> having drivers touch any of those parts. Could be done by having an
> io_uring helper ala:
>
> io_uring_cmd_complete_in_task(cmd, handler);
>
> which takes care of the nitty gritty details.

Ah right. With that, I can do away with exporting task-work.
NVMe completion can invoke (depending on ioctl) this uring-helper with
a handler that does the ioctl-specific update in task context.



-- 
Kanchan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-01-28 17:24             ` Jens Axboe
@ 2021-02-22 13:42               ` Kanchan Joshi
  2021-02-22 14:33                 ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Kanchan Joshi @ 2021-02-22 13:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, Kanchan Joshi, Keith Busch, Christoph Hellwig,
	sagi, linux-nvme, io-uring, linux-fsdevel, linux-kernel,
	Javier Gonzalez, Nitesh Shetty, Selvakumar S

On Thu, Jan 28, 2021 at 10:54 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/28/21 10:13 AM, Kanchan Joshi wrote:
> > On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> >>> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>>>
> >>>> On 27/01/2021 15:42, Pavel Begunkov wrote:
> >>>>> On 27/01/2021 15:00, Kanchan Joshi wrote:
> >>>>>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
> >>>>>> Purpose of RFC is to get the feedback and optimize the path.
> >>>>>>
> >>>>>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
> >>>>>> presented to user-space applications. Like regular-ioctl, it takes
> >>>>>> ioctl opcode and an optional argument (ioctl-specific input/output
> >>>>>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
> >>>>>> and reach directly to the underlying driver (nvme in the case of this
> >>>>>> patchset). This path between io-uring and nvme is via a newly
> >>>>>> introduced block-device operation "async_ioctl". This operation
> >>>>>> expects io-uring to supply a callback function which can be used to
> >>>>>> report completion at later stage.
> >>>>>>
> >>>>>> For a regular ioctl, NVMe driver submits the command to the device and
> >>>>>> the submitter (task) is made to wait until completion arrives. For
> >>>>>> async-ioctl, completion is decoupled from submission. Submitter goes
> >>>>>> back to its business without waiting for nvme-completion. When
> >>>>>> nvme-completion arrives, it informs io-uring via the registered
> >>>>>> completion-handler. But some ioctls may require updating certain
> >>>>>> ioctl-specific fields which can be accessed only in context of the
> >>>>>> submitter task. For that reason, NVMe driver uses task-work infra for
> >>>>>> that ioctl-specific update. Since task-work is not exported, it cannot
> >>>>>> be referenced when nvme is compiled as a module. Therefore, one of the
> >>>>>> patch exports task-work API.
> >>>>>>
> >>>>>> Here goes example of usage (pseudo-code).
> >>>>>> Actual nvme-cli source, modified to issue all ioctls via this opcode
> >>>>>> is present at-
> >>>>>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> >>>>>
> >>>>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> >>>>>
> >>>>> Looks like good time to bring that branch/discussion back
> >>>>
> >>>> a bit more context:
> >>>> https://github.com/axboe/liburing/issues/270
> >>>
> >>> Thanks, it looked good. It seems key differences (compared to
> >>> uring-patch that I posted) are -
> >>> 1. using file-operation instead of block-dev operation.
> >>
> >> Right, it's meant to span wider than just block devices.
> >>
> >>> 2. repurpose the sqe memory for ioctl-cmd. If an application does
> >>> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> >>> That's nifty. We still need to support passing larger-cmd (e.g.
> >>> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> >>> difficult I suppose.
> >>
> >> It's actually 48 bytes in the as-posted version, and I've bumped it to
> >> 56 bytes in the latest branch. So not quite enough for everything,
> >> nothing ever will be, but should work for a lot of cases without
> >> requiring per-command allocations just for the actual command.
> >
> > Agreed. But if I got it right, you are open to support both in-the-sqe
> > command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
> > interface.
> > Driver processing the ioctl can fetch the cmd from user-space in one
> > case (as it does now), and skips in another.
>
> Your out-of-seq command would be none of io_urings business, outside of
> the fact that we'd need to ensure it's stable if we need to postpone
> it. So yes, that would be fine, it just means your actual command is
> passed in as a pointer, and you would be responsible for copying it
> in for execution
>
> We're going to need something to handle postponing, and something
> for ensuring that eg cancelations free the allocated memory.

I have few doubts about allocation/postponing. Are you referring to
uring allocating memory for this case, similar to the way
"req->async_data" is managed for few other opcodes?
Or can it (i.e. larger cmd) remain a user-space pointer, and the
underlying driver fetches the command in.
If submission context changes (for sqo/io-wq case), uring seemed to
apply context-grabbing techniques to make that work.

> >>> And for some ioctls, driver may still need to use task-work to update
> >>> the user-space pointers (embedded in uring/ioctl cmd) during
> >>> completion.
> >>>
> >>> @Jens - will it be fine if I start looking at plumbing nvme-part of
> >>> this series on top of your work?
> >>
> >> Sure, go ahead. Just beware that things are still changing, so you might
> >> have to adapt it a few times. It's still early days, but I do think
> >> that's the way forward in providing controlled access to what is
> >> basically async ioctls.
> >
> > Sounds good, I will start with the latest branch that you posted. Thanks.
>
> It's io_uring-fops.v2 for now, use that one.

Moved to v3 now.
nvme_user_io is 48 bytes, while nvme passthrough requires 72 or 80
bytes (passthru with 64 bit result).
The block_uring_cmd has 32 bytes of available space. If NVMe defines
its own "nvme_uring_cmd" (which can be used for nvme char interface)
that will buy some more space, but still won't be enough for passthru
command.

So I am looking at adding support for large-cmd in uring. And felt the
need to clear those doubts I mentioned above.


Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-02-22 13:42               ` Kanchan Joshi
@ 2021-02-22 14:33                 ` Jens Axboe
  2021-02-23  4:41                   ` Kanchan Joshi
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-02-22 14:33 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Pavel Begunkov, Kanchan Joshi, Keith Busch, Christoph Hellwig,
	sagi, linux-nvme, io-uring, linux-fsdevel, linux-kernel,
	Javier Gonzalez, Nitesh Shetty, Selvakumar S

On 2/22/21 6:42 AM, Kanchan Joshi wrote:
> On Thu, Jan 28, 2021 at 10:54 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 1/28/21 10:13 AM, Kanchan Joshi wrote:
>>> On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
>>>>> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>>>
>>>>>> On 27/01/2021 15:42, Pavel Begunkov wrote:
>>>>>>> On 27/01/2021 15:00, Kanchan Joshi wrote:
>>>>>>>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
>>>>>>>> Purpose of RFC is to get the feedback and optimize the path.
>>>>>>>>
>>>>>>>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
>>>>>>>> presented to user-space applications. Like regular-ioctl, it takes
>>>>>>>> ioctl opcode and an optional argument (ioctl-specific input/output
>>>>>>>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
>>>>>>>> and reach directly to the underlying driver (nvme in the case of this
>>>>>>>> patchset). This path between io-uring and nvme is via a newly
>>>>>>>> introduced block-device operation "async_ioctl". This operation
>>>>>>>> expects io-uring to supply a callback function which can be used to
>>>>>>>> report completion at later stage.
>>>>>>>>
>>>>>>>> For a regular ioctl, NVMe driver submits the command to the device and
>>>>>>>> the submitter (task) is made to wait until completion arrives. For
>>>>>>>> async-ioctl, completion is decoupled from submission. Submitter goes
>>>>>>>> back to its business without waiting for nvme-completion. When
>>>>>>>> nvme-completion arrives, it informs io-uring via the registered
>>>>>>>> completion-handler. But some ioctls may require updating certain
>>>>>>>> ioctl-specific fields which can be accessed only in context of the
>>>>>>>> submitter task. For that reason, NVMe driver uses task-work infra for
>>>>>>>> that ioctl-specific update. Since task-work is not exported, it cannot
>>>>>>>> be referenced when nvme is compiled as a module. Therefore, one of the
>>>>>>>> patch exports task-work API.
>>>>>>>>
>>>>>>>> Here goes example of usage (pseudo-code).
>>>>>>>> Actual nvme-cli source, modified to issue all ioctls via this opcode
>>>>>>>> is present at-
>>>>>>>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
>>>>>>>
>>>>>>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
>>>>>>>
>>>>>>> Looks like good time to bring that branch/discussion back
>>>>>>
>>>>>> a bit more context:
>>>>>> https://github.com/axboe/liburing/issues/270
>>>>>
>>>>> Thanks, it looked good. It seems key differences (compared to
>>>>> uring-patch that I posted) are -
>>>>> 1. using file-operation instead of block-dev operation.
>>>>
>>>> Right, it's meant to span wider than just block devices.
>>>>
>>>>> 2. repurpose the sqe memory for ioctl-cmd. If an application does
>>>>> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
>>>>> That's nifty. We still need to support passing larger-cmd (e.g.
>>>>> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
>>>>> difficult I suppose.
>>>>
>>>> It's actually 48 bytes in the as-posted version, and I've bumped it to
>>>> 56 bytes in the latest branch. So not quite enough for everything,
>>>> nothing ever will be, but should work for a lot of cases without
>>>> requiring per-command allocations just for the actual command.
>>>
>>> Agreed. But if I got it right, you are open to support both in-the-sqe
>>> command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
>>> interface.
>>> Driver processing the ioctl can fetch the cmd from user-space in one
>>> case (as it does now), and skips in another.
>>
>> Your out-of-seq command would be none of io_urings business, outside of
>> the fact that we'd need to ensure it's stable if we need to postpone
>> it. So yes, that would be fine, it just means your actual command is
>> passed in as a pointer, and you would be responsible for copying it
>> in for execution
>>
>> We're going to need something to handle postponing, and something
>> for ensuring that eg cancelations free the allocated memory.
> 
> I have few doubts about allocation/postponing. Are you referring to
> uring allocating memory for this case, similar to the way
> "req->async_data" is managed for few other opcodes?
> Or can it (i.e. larger cmd) remain a user-space pointer, and the
> underlying driver fetches the command in.
> If submission context changes (for sqo/io-wq case), uring seemed to
> apply context-grabbing techniques to make that work.

There are two concerns here:

1) We need more space than the 48 bytes, which means we need to allocate
   the command or part of the command when get the sqe, and of course
   free that when the command is done.

2) When io_uring_enter() returns and has consumed N commands, the state
   for those commands must be stable. Hence if you're passing in a
   pointer to a struct, that struct will have been read and store
   safely. This prevents things like on-stack structures from being an
   issue.

->async_data deals with #2 here, it's used when a command needs to store
data because we're switching to an async context to execute the command
(or the command is otherwise deferred, eg links and such). You can
always rely on the context being sane, it's either the task itself or
equivalent.

>>>>> And for some ioctls, driver may still need to use task-work to update
>>>>> the user-space pointers (embedded in uring/ioctl cmd) during
>>>>> completion.
>>>>>
>>>>> @Jens - will it be fine if I start looking at plumbing nvme-part of
>>>>> this series on top of your work?
>>>>
>>>> Sure, go ahead. Just beware that things are still changing, so you might
>>>> have to adapt it a few times. It's still early days, but I do think
>>>> that's the way forward in providing controlled access to what is
>>>> basically async ioctls.
>>>
>>> Sounds good, I will start with the latest branch that you posted. Thanks.
>>
>> It's io_uring-fops.v2 for now, use that one.
> 
> Moved to v3 now.
> nvme_user_io is 48 bytes, while nvme passthrough requires 72 or 80
> bytes (passthru with 64 bit result).
> The block_uring_cmd has 32 bytes of available space. If NVMe defines
> its own "nvme_uring_cmd" (which can be used for nvme char interface)
> that will buy some more space, but still won't be enough for passthru
> command.
> 
> So I am looking at adding support for large-cmd in uring. And felt the
> need to clear those doubts I mentioned above.

The simple solution is to just keep the command type the same on the
NVMe side, and just pass in a pointer to it. Then API could then be
nr_commands and **commands, for example.

But I think we're getting to the point where it'd be easier to just
discuss actual code. So if you rebase on top of the v3 code, then send
out those patches and we can discuss the most convenient API to present
for nvme passthrough and friends. Does that work?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl
  2021-02-22 14:33                 ` Jens Axboe
@ 2021-02-23  4:41                   ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2021-02-23  4:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, Kanchan Joshi, Keith Busch, Christoph Hellwig,
	sagi, linux-nvme, io-uring, linux-fsdevel, linux-kernel,
	Javier Gonzalez, Nitesh Shetty, Selvakumar S

On Mon, Feb 22, 2021 at 8:03 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/22/21 6:42 AM, Kanchan Joshi wrote:
> > On Thu, Jan 28, 2021 at 10:54 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 1/28/21 10:13 AM, Kanchan Joshi wrote:
> >>> On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> >>>>> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>>>>>
> >>>>>> On 27/01/2021 15:42, Pavel Begunkov wrote:
> >>>>>>> On 27/01/2021 15:00, Kanchan Joshi wrote:
> >>>>>>>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
> >>>>>>>> Purpose of RFC is to get the feedback and optimize the path.
> >>>>>>>>
> >>>>>>>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
> >>>>>>>> presented to user-space applications. Like regular-ioctl, it takes
> >>>>>>>> ioctl opcode and an optional argument (ioctl-specific input/output
> >>>>>>>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
> >>>>>>>> and reach directly to the underlying driver (nvme in the case of this
> >>>>>>>> patchset). This path between io-uring and nvme is via a newly
> >>>>>>>> introduced block-device operation "async_ioctl". This operation
> >>>>>>>> expects io-uring to supply a callback function which can be used to
> >>>>>>>> report completion at later stage.
> >>>>>>>>
> >>>>>>>> For a regular ioctl, NVMe driver submits the command to the device and
> >>>>>>>> the submitter (task) is made to wait until completion arrives. For
> >>>>>>>> async-ioctl, completion is decoupled from submission. Submitter goes
> >>>>>>>> back to its business without waiting for nvme-completion. When
> >>>>>>>> nvme-completion arrives, it informs io-uring via the registered
> >>>>>>>> completion-handler. But some ioctls may require updating certain
> >>>>>>>> ioctl-specific fields which can be accessed only in context of the
> >>>>>>>> submitter task. For that reason, NVMe driver uses task-work infra for
> >>>>>>>> that ioctl-specific update. Since task-work is not exported, it cannot
> >>>>>>>> be referenced when nvme is compiled as a module. Therefore, one of the
> >>>>>>>> patch exports task-work API.
> >>>>>>>>
> >>>>>>>> Here goes example of usage (pseudo-code).
> >>>>>>>> Actual nvme-cli source, modified to issue all ioctls via this opcode
> >>>>>>>> is present at-
> >>>>>>>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> >>>>>>>
> >>>>>>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> >>>>>>>
> >>>>>>> Looks like good time to bring that branch/discussion back
> >>>>>>
> >>>>>> a bit more context:
> >>>>>> https://github.com/axboe/liburing/issues/270
> >>>>>
> >>>>> Thanks, it looked good. It seems key differences (compared to
> >>>>> uring-patch that I posted) are -
> >>>>> 1. using file-operation instead of block-dev operation.
> >>>>
> >>>> Right, it's meant to span wider than just block devices.
> >>>>
> >>>>> 2. repurpose the sqe memory for ioctl-cmd. If an application does
> >>>>> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> >>>>> That's nifty. We still need to support passing larger-cmd (e.g.
> >>>>> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> >>>>> difficult I suppose.
> >>>>
> >>>> It's actually 48 bytes in the as-posted version, and I've bumped it to
> >>>> 56 bytes in the latest branch. So not quite enough for everything,
> >>>> nothing ever will be, but should work for a lot of cases without
> >>>> requiring per-command allocations just for the actual command.
> >>>
> >>> Agreed. But if I got it right, you are open to support both in-the-sqe
> >>> command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
> >>> interface.
> >>> Driver processing the ioctl can fetch the cmd from user-space in one
> >>> case (as it does now), and skips in another.
> >>
> >> Your out-of-seq command would be none of io_urings business, outside of
> >> the fact that we'd need to ensure it's stable if we need to postpone
> >> it. So yes, that would be fine, it just means your actual command is
> >> passed in as a pointer, and you would be responsible for copying it
> >> in for execution
> >>
> >> We're going to need something to handle postponing, and something
> >> for ensuring that eg cancelations free the allocated memory.
> >
> > I have few doubts about allocation/postponing. Are you referring to
> > uring allocating memory for this case, similar to the way
> > "req->async_data" is managed for few other opcodes?
> > Or can it (i.e. larger cmd) remain a user-space pointer, and the
> > underlying driver fetches the command in.
> > If submission context changes (for sqo/io-wq case), uring seemed to
> > apply context-grabbing techniques to make that work.
>
> There are two concerns here:
>
> 1) We need more space than the 48 bytes, which means we need to allocate
>    the command or part of the command when get the sqe, and of course
>    free that when the command is done.
>
> 2) When io_uring_enter() returns and has consumed N commands, the state
>    for those commands must be stable. Hence if you're passing in a
>    pointer to a struct, that struct will have been read and store
>    safely. This prevents things like on-stack structures from being an
>    issue.
>
> ->async_data deals with #2 here, it's used when a command needs to store
> data because we're switching to an async context to execute the command
> (or the command is otherwise deferred, eg links and such). You can
> always rely on the context being sane, it's either the task itself or
> equivalent.

Thanks for sorting this out.

> >>>>> And for some ioctls, driver may still need to use task-work to update
> >>>>> the user-space pointers (embedded in uring/ioctl cmd) during
> >>>>> completion.
> >>>>>
> >>>>> @Jens - will it be fine if I start looking at plumbing nvme-part of
> >>>>> this series on top of your work?
> >>>>
> >>>> Sure, go ahead. Just beware that things are still changing, so you might
> >>>> have to adapt it a few times. It's still early days, but I do think
> >>>> that's the way forward in providing controlled access to what is
> >>>> basically async ioctls.
> >>>
> >>> Sounds good, I will start with the latest branch that you posted. Thanks.
> >>
> >> It's io_uring-fops.v2 for now, use that one.
> >
> > Moved to v3 now.
> > nvme_user_io is 48 bytes, while nvme passthrough requires 72 or 80
> > bytes (passthru with 64 bit result).
> > The block_uring_cmd has 32 bytes of available space. If NVMe defines
> > its own "nvme_uring_cmd" (which can be used for nvme char interface)
> > that will buy some more space, but still won't be enough for passthru
> > command.
> >
> > So I am looking at adding support for large-cmd in uring. And felt the
> > need to clear those doubts I mentioned above.
>
> The simple solution is to just keep the command type the same on the
> NVMe side, and just pass in a pointer to it. Then API could then be
> nr_commands and **commands, for example.
>
> But I think we're getting to the point where it'd be easier to just
> discuss actual code. So if you rebase on top of the v3 code, then send
> out those patches and we can discuss the most convenient API to present
> for nvme passthrough and friends. Does that work?

Yes, perfect. I will go about that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-02-23  4:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20210127150134epcas5p251fc1de3ff3581dd4c68b3fbe0b9dd91@epcas5p2.samsung.com>
2021-01-27 15:00 ` [RFC PATCH 0/4] Asynchronous passthrough ioctl Kanchan Joshi
     [not found]   ` <CGME20210127150140epcas5p32832cc0c0db953db199eb9dd326f2d4c@epcas5p3.samsung.com>
2021-01-27 15:00     ` [RFC PATCH 1/4] block: introduce async ioctl operation Kanchan Joshi
     [not found]   ` <CGME20210127150144epcas5p29ccb35d7e7170aba7947b5ee16fd2db0@epcas5p2.samsung.com>
2021-01-27 15:00     ` [RFC PATCH 2/4] kernel: export task_work_add Kanchan Joshi
     [not found]   ` <CGME20210127150149epcas5p4fa8edd47712f28ccdd9bac5139fc6e61@epcas5p4.samsung.com>
2021-01-27 15:00     ` [RFC PATCH 3/4] nvme: add async ioctl support Kanchan Joshi
     [not found]   ` <CGME20210127150156epcas5p26cdf368e4ff6bffb132fa1c7f9430653@epcas5p2.samsung.com>
2021-01-27 15:00     ` [RFC PATCH 4/4] io_uring: add async passthrough " Kanchan Joshi
2021-01-27 15:42   ` [RFC PATCH 0/4] Asynchronous passthrough ioctl Pavel Begunkov
2021-01-27 15:53     ` Pavel Begunkov
2021-01-28 12:04       ` Kanchan Joshi
2021-01-28 14:38         ` Jens Axboe
2021-01-28 17:13           ` Kanchan Joshi
2021-01-28 17:24             ` Jens Axboe
2021-02-22 13:42               ` Kanchan Joshi
2021-02-22 14:33                 ` Jens Axboe
2021-02-23  4:41                   ` Kanchan Joshi
2021-01-28 14:50         ` Jens Axboe
2021-01-28 17:25           ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).