io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v7 0/5] fixed-buffer for uring-cmd/passthru
       [not found] <CGME20220909103131epcas5p23d146916eccedf30d498e0ea23e54052@epcas5p2.samsung.com>
@ 2022-09-09 10:21 ` Kanchan Joshi
       [not found]   ` <CGME20220909103136epcas5p38ea3a933e90d9f9d7451848dc3a60829@epcas5p3.samsung.com>
                     ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-09 10:21 UTC (permalink / raw)
  To: axboe, hch, kbusch, asml.silence
  Cc: io-uring, linux-nvme, linux-block, gost.dev, Kanchan Joshi

Currently uring-cmd lacks the ability to leverage the pre-registered
buffers. This series adds that support in uring-cmd, and plumbs
nvme passthrough to work with it.

Using registered-buffers showed IOPS hike from 1.9M to 2.2M in my tests.

Patch 1, 3, 4 = prep
Patch 2 = expand io_uring command to use registered-buffers
Patch 5 = expand nvme passthrough to use registered-buffers

Changes since v6:
- Patch 1: fix warning for io_uring_cmd_import_fixed (robot)
-
Changes since v5:
- Patch 4: newly addd, to split a nvme function into two
- Patch 3: folded cleanups in bio_map_user_iov (Chaitanya, Pankaj)
- Rebase to latest for-next

Changes since v4:
- Patch 1, 2: folded all review comments of Jens

Changes since v3:
- uring_cmd_flags, change from u16 to u32 (Jens)
- patch 3, add another helper to reduce code-duplication (Jens)

Changes since v2:
- Kill the new opcode, add a flag instead (Pavel)
- Fix standalone build issue with patch 1 (Pavel)

Changes since v1:
- Fix a naming issue for an exported helper



Anuj Gupta (2):
  io_uring: add io_uring_cmd_import_fixed
  io_uring: introduce fixed buffer support for io_uring_cmd

Kanchan Joshi (3):
  nvme: refactor nvme_alloc_user_request
  block: add helper to map bvec iterator for passthrough
  nvme: wire up fixed buffer support for nvme passthrough

 block/blk-map.c               |  87 ++++++++++++++++++++---
 drivers/nvme/host/ioctl.c     | 126 +++++++++++++++++++++-------------
 include/linux/blk-mq.h        |   1 +
 include/linux/io_uring.h      |  10 ++-
 include/uapi/linux/io_uring.h |   9 +++
 io_uring/uring_cmd.c          |  26 ++++++-
 6 files changed, 199 insertions(+), 60 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH for-next v7 1/5] io_uring: add io_uring_cmd_import_fixed
       [not found]   ` <CGME20220909103136epcas5p38ea3a933e90d9f9d7451848dc3a60829@epcas5p3.samsung.com>
@ 2022-09-09 10:21     ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-09 10:21 UTC (permalink / raw)
  To: axboe, hch, kbusch, asml.silence
  Cc: io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <anuj20.g@samsung.com>

This is a new helper that callers can use to obtain a bvec iterator for
the previously mapped buffer. This is preparatory work to enable
fixed-buffer support for io_uring_cmd.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 include/linux/io_uring.h |  8 ++++++++
 io_uring/uring_cmd.c     | 10 ++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 58676c0a398f..1dbf51115c30 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched.h>
 #include <linux/xarray.h>
+#include <uapi/linux/io_uring.h>
 
 enum io_uring_cmd_flags {
 	IO_URING_F_COMPLETE_DEFER	= 1,
@@ -32,6 +33,8 @@ struct io_uring_cmd {
 };
 
 #if defined(CONFIG_IO_URING)
+int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
+			      struct iov_iter *iter, void *ioucmd);
 void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2);
 void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
 			void (*task_work_cb)(struct io_uring_cmd *));
@@ -59,6 +62,11 @@ static inline void io_uring_free(struct task_struct *tsk)
 		__io_uring_free(tsk);
 }
 #else
+static int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
+			      struct iov_iter *iter, void *ioucmd)
+{
+	return -EOPNOTSUPP;
+}
 static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret,
 		ssize_t ret2)
 {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index f3ed61e9bd0f..6a6d69523d75 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -8,6 +8,7 @@
 #include <uapi/linux/io_uring.h>
 
 #include "io_uring.h"
+#include "rsrc.h"
 #include "uring_cmd.h"
 
 static void io_uring_cmd_work(struct io_kiocb *req, bool *locked)
@@ -129,3 +130,12 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 
 	return IOU_ISSUE_SKIP_COMPLETE;
 }
+
+int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
+			      struct iov_iter *iter, void *ioucmd)
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+
+	return io_import_fixed(rw, iter, req->imu, ubuf, len);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v7 2/5] io_uring: introduce fixed buffer support for io_uring_cmd
       [not found]   ` <CGME20220909103140epcas5p36689726422eb68e6fdc1d39019a4a8ba@epcas5p3.samsung.com>
@ 2022-09-09 10:21     ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-09 10:21 UTC (permalink / raw)
  To: axboe, hch, kbusch, asml.silence
  Cc: io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <anuj20.g@samsung.com>

Add IORING_URING_CMD_FIXED flag that is to be used for sending io_uring
command with previously registered buffers. User-space passes the buffer
index in sqe->buf_index, same as done in read/write variants that uses
fixed buffers.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 include/linux/io_uring.h      |  2 +-
 include/uapi/linux/io_uring.h |  9 +++++++++
 io_uring/uring_cmd.c          | 16 +++++++++++++++-
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 1dbf51115c30..e10c5cc81082 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -28,7 +28,7 @@ struct io_uring_cmd {
 		void *cookie;
 	};
 	u32		cmd_op;
-	u32		pad;
+	u32		flags;
 	u8		pdu[32]; /* available inline for free use */
 };
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 972b179bc07a..f94f377f2ae6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -56,6 +56,7 @@ struct io_uring_sqe {
 		__u32		hardlink_flags;
 		__u32		xattr_flags;
 		__u32		msg_ring_flags;
+		__u32		uring_cmd_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -218,6 +219,14 @@ enum io_uring_op {
 	IORING_OP_LAST,
 };
 
+/*
+ * sqe->uring_cmd_flags
+ * IORING_URING_CMD_FIXED	use registered buffer; pass thig flag
+ *				along with setting sqe->buf_index.
+ */
+#define IORING_URING_CMD_FIXED	(1U << 0)
+
+
 /*
  * sqe->fsync_flags
  */
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 6a6d69523d75..faefa9f6f259 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -4,6 +4,7 @@
 #include <linux/file.h>
 #include <linux/io_uring.h>
 #include <linux/security.h>
+#include <linux/nospec.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -77,8 +78,21 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
 
-	if (sqe->rw_flags || sqe->__pad1)
+	if (sqe->__pad1)
 		return -EINVAL;
+
+	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
+	if (ioucmd->flags & IORING_URING_CMD_FIXED) {
+		struct io_ring_ctx *ctx = req->ctx;
+		u16 index;
+
+		req->buf_index = READ_ONCE(sqe->buf_index);
+		if (unlikely(req->buf_index >= ctx->nr_user_bufs))
+			return -EFAULT;
+		index = array_index_nospec(req->buf_index, ctx->nr_user_bufs);
+		req->imu = ctx->user_bufs[index];
+		io_req_set_rsrc_node(req, ctx, 0);
+	}
 	ioucmd->cmd = sqe->cmd;
 	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v7 3/5] nvme: refactor nvme_alloc_user_request
       [not found]   ` <CGME20220909103143epcas5p2eda60190cd23b79fb8f48596af3e1524@epcas5p2.samsung.com>
@ 2022-09-09 10:21     ` Kanchan Joshi
  2022-09-20 12:02       ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-09 10:21 UTC (permalink / raw)
  To: axboe, hch, kbusch, asml.silence
  Cc: io-uring, linux-nvme, linux-block, gost.dev, Kanchan Joshi

Separate this out to two functions with reduced number of arguments.
_
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 drivers/nvme/host/ioctl.c | 116 ++++++++++++++++++++++----------------
 1 file changed, 66 insertions(+), 50 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 548aca8b5b9f..cb2fa4db50dd 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -65,18 +65,10 @@ static int nvme_finish_user_metadata(struct request *req, void __user *ubuf,
 }
 
 static struct request *nvme_alloc_user_request(struct request_queue *q,
-		struct nvme_command *cmd, void __user *ubuffer,
-		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-		u32 meta_seed, void **metap, unsigned timeout, bool vec,
+		struct nvme_command *cmd, unsigned timeout,
 		blk_opf_t rq_flags, blk_mq_req_flags_t blk_flags)
 {
-	bool write = nvme_is_write(cmd);
-	struct nvme_ns *ns = q->queuedata;
-	struct block_device *bdev = ns ? ns->disk->part0 : NULL;
 	struct request *req;
-	struct bio *bio = NULL;
-	void *meta = NULL;
-	int ret;
 
 	req = blk_mq_alloc_request(q, nvme_req_op(cmd) | rq_flags, blk_flags);
 	if (IS_ERR(req))
@@ -86,49 +78,61 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
 	if (timeout)
 		req->timeout = timeout;
 	nvme_req(req)->flags |= NVME_REQ_USERCMD;
+	return req;
+}
 
-	if (ubuffer && bufflen) {
-		if (!vec)
-			ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
-				GFP_KERNEL);
-		else {
-			struct iovec fast_iov[UIO_FASTIOV];
-			struct iovec *iov = fast_iov;
-			struct iov_iter iter;
-
-			ret = import_iovec(rq_data_dir(req), ubuffer, bufflen,
-					UIO_FASTIOV, &iov, &iter);
-			if (ret < 0)
-				goto out;
-			ret = blk_rq_map_user_iov(q, req, NULL, &iter,
-					GFP_KERNEL);
-			kfree(iov);
-		}
-		if (ret)
+static int nvme_map_user_request(struct request *req, void __user *ubuffer,
+		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
+		u32 meta_seed, void **metap, bool vec)
+{
+	struct request_queue *q = req->q;
+	struct nvme_ns *ns = q->queuedata;
+	struct block_device *bdev = ns ? ns->disk->part0 : NULL;
+	struct bio *bio = NULL;
+	void *meta = NULL;
+	int ret;
+
+	if (!ubuffer || !bufflen)
+		return 0;
+
+	if (!vec)
+		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
+			GFP_KERNEL);
+	else {
+		struct iovec fast_iov[UIO_FASTIOV];
+		struct iovec *iov = fast_iov;
+		struct iov_iter iter;
+
+		ret = import_iovec(rq_data_dir(req), ubuffer, bufflen,
+				UIO_FASTIOV, &iov, &iter);
+		if (ret < 0)
 			goto out;
-		bio = req->bio;
-		if (bdev)
-			bio_set_dev(bio, bdev);
-		if (bdev && meta_buffer && meta_len) {
-			meta = nvme_add_user_metadata(bio, meta_buffer, meta_len,
-					meta_seed, write);
-			if (IS_ERR(meta)) {
-				ret = PTR_ERR(meta);
-				goto out_unmap;
-			}
-			req->cmd_flags |= REQ_INTEGRITY;
-			*metap = meta;
+		ret = blk_rq_map_user_iov(q, req, NULL, &iter, GFP_KERNEL);
+		kfree(iov);
+	}
+	bio = req->bio;
+	if (ret)
+		goto out_unmap;
+	if (bdev)
+		bio_set_dev(bio, bdev);
+	if (bdev && meta_buffer && meta_len) {
+		meta = nvme_add_user_metadata(bio, meta_buffer, meta_len,
+				meta_seed, req_op(req) == REQ_OP_DRV_OUT);
+		if (IS_ERR(meta)) {
+			ret = PTR_ERR(meta);
+			goto out_unmap;
 		}
+		req->cmd_flags |= REQ_INTEGRITY;
+		*metap = meta;
 	}
 
-	return req;
+	return ret;
 
 out_unmap:
 	if (bio)
 		blk_rq_unmap_user(bio);
 out:
-	blk_mq_free_request(req);
-	return ERR_PTR(ret);
+	return ret;
 }
 
 static int nvme_submit_user_cmd(struct request_queue *q,
@@ -141,13 +145,16 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 	struct bio *bio;
 	int ret;
 
-	req = nvme_alloc_user_request(q, cmd, ubuffer, bufflen, meta_buffer,
-			meta_len, meta_seed, &meta, timeout, vec, 0, 0);
+	req = nvme_alloc_user_request(q, cmd, timeout, 0, 0);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
-	bio = req->bio;
+	ret = nvme_map_user_request(req, ubuffer, bufflen, meta_buffer,
+			meta_len, meta_seed, &meta, vec);
+	if (ret)
+		goto out;
 
+	bio = req->bio;
 	ret = nvme_execute_passthru_rq(req);
 
 	if (result)
@@ -157,6 +164,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 						meta_len, ret);
 	if (bio)
 		blk_rq_unmap_user(bio);
+out:
 	blk_mq_free_request(req);
 	return ret;
 }
@@ -418,6 +426,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	blk_opf_t rq_flags = 0;
 	blk_mq_req_flags_t blk_flags = 0;
 	void *meta = NULL;
+	int ret;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
@@ -457,13 +466,17 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 		rq_flags |= REQ_POLLED;
 
 retry:
-	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
-			d.data_len, nvme_to_user_ptr(d.metadata),
-			d.metadata_len, 0, &meta, d.timeout_ms ?
-			msecs_to_jiffies(d.timeout_ms) : 0, vec, rq_flags,
-			blk_flags);
+	req = nvme_alloc_user_request(q, &c,
+			d.timeout_ms ? msecs_to_jiffies(d.timeout_ms) : 0,
+			rq_flags, blk_flags);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
+
+	ret = nvme_map_user_request(req, nvme_to_user_ptr(d.addr),
+			d.data_len, nvme_to_user_ptr(d.metadata),
+			d.metadata_len, 0, &meta, vec);
+	if (ret)
+		goto out_err;
 	req->end_io = nvme_uring_cmd_end_io;
 	req->end_io_data = ioucmd;
 
@@ -486,6 +499,9 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 
 	blk_execute_rq_nowait(req, false);
 	return -EIOCBQUEUED;
+out_err:
+	blk_mq_free_request(req);
+	return ret;
 }
 
 static bool is_ctrl_ioctl(unsigned int cmd)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough
       [not found]   ` <CGME20220909103147epcas5p2a83ec151333bcb1d2abb8c7536789bfd@epcas5p2.samsung.com>
@ 2022-09-09 10:21     ` Kanchan Joshi
  2022-09-20 12:08       ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-09 10:21 UTC (permalink / raw)
  To: axboe, hch, kbusch, asml.silence
  Cc: io-uring, linux-nvme, linux-block, gost.dev, Kanchan Joshi, Anuj Gupta

Add blk_rq_map_user_bvec which maps the bvec iterator into a bio and
places that into the request. This helper will be used in nvme for
uring-passthrough with fixed-buffer.
While at it, create another helper bio_map_get to reduce the code
duplication.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-map.c        | 87 +++++++++++++++++++++++++++++++++++++-----
 include/linux/blk-mq.h |  1 +
 2 files changed, 78 insertions(+), 10 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 7693f8e3c454..5dcfa112f240 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -241,17 +241,10 @@ static void bio_map_put(struct bio *bio)
 	}
 }
 
-static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
+static struct bio *bio_map_get(struct request *rq, unsigned int nr_vecs,
 		gfp_t gfp_mask)
 {
-	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
-	unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
 	struct bio *bio;
-	int ret;
-	int j;
-
-	if (!iov_iter_count(iter))
-		return -EINVAL;
 
 	if (rq->cmd_flags & REQ_POLLED) {
 		blk_opf_t opf = rq->cmd_flags | REQ_ALLOC_CACHE;
@@ -259,13 +252,31 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		bio = bio_alloc_bioset(NULL, nr_vecs, opf, gfp_mask,
 					&fs_bio_set);
 		if (!bio)
-			return -ENOMEM;
+			return NULL;
 	} else {
 		bio = bio_kmalloc(nr_vecs, gfp_mask);
 		if (!bio)
-			return -ENOMEM;
+			return NULL;
 		bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq));
 	}
+	return bio;
+}
+
+static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
+		gfp_t gfp_mask)
+{
+	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
+	unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
+	struct bio *bio;
+	int ret;
+	int j;
+
+	if (!iov_iter_count(iter))
+		return -EINVAL;
+
+	bio = bio_map_get(rq, nr_vecs, gfp_mask);
+	if (bio == NULL)
+		return -ENOMEM;
 
 	while (iov_iter_count(iter)) {
 		struct page **pages, *stack_pages[UIO_FASTIOV];
@@ -611,6 +622,62 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
 }
 EXPORT_SYMBOL(blk_rq_map_user);
 
+/* Prepare bio for passthrough IO given an existing bvec iter */
+int blk_rq_map_user_bvec(struct request *rq, struct iov_iter *iter)
+{
+	struct request_queue *q = rq->q;
+	size_t nr_iter, nr_segs, i;
+	struct bio *bio;
+	struct bio_vec *bv, *bvecs, *bvprvp = NULL;
+	struct queue_limits *lim = &q->limits;
+	unsigned int nsegs = 0, bytes = 0;
+
+	nr_iter = iov_iter_count(iter);
+	nr_segs = iter->nr_segs;
+
+	if (!nr_iter || (nr_iter >> SECTOR_SHIFT) > queue_max_hw_sectors(q))
+		return -EINVAL;
+	if (nr_segs > queue_max_segments(q))
+		return -EINVAL;
+
+	/* no iovecs to alloc, as we already have a BVEC iterator */
+	bio = bio_map_get(rq, 0, GFP_KERNEL);
+	if (bio == NULL)
+		return -ENOMEM;
+
+	bio_iov_bvec_set(bio, iter);
+	blk_rq_bio_prep(rq, bio, nr_segs);
+
+	/* loop to perform a bunch of sanity checks */
+	bvecs = (struct bio_vec *)iter->bvec;
+	for (i = 0; i < nr_segs; i++) {
+		bv = &bvecs[i];
+		/*
+		 * If the queue doesn't support SG gaps and adding this
+		 * offset would create a gap, disallow it.
+		 */
+		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset))
+			goto out_err;
+
+		/* check full condition */
+		if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len)
+			goto out_err;
+
+		if (bytes + bv->bv_len <= nr_iter &&
+				bv->bv_offset + bv->bv_len <= PAGE_SIZE) {
+			nsegs++;
+			bytes += bv->bv_len;
+		} else
+			goto out_err;
+		bvprvp = bv;
+	}
+	return 0;
+out_err:
+	bio_map_put(bio);
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(blk_rq_map_user_bvec);
+
 /**
  * blk_rq_unmap_user - unmap a request with user data
  * @bio:	       start of bio list
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b43c81d91892..83bef362f0f9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -970,6 +970,7 @@ struct rq_map_data {
 	bool from_user;
 };
 
+int blk_rq_map_user_bvec(struct request *rq, struct iov_iter *iter);
 int blk_rq_map_user(struct request_queue *, struct request *,
 		struct rq_map_data *, void __user *, unsigned long, gfp_t);
 int blk_rq_map_user_iov(struct request_queue *, struct request *,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v7 5/5] nvme: wire up fixed buffer support for nvme passthrough
       [not found]   ` <CGME20220909103151epcas5p1e25127c3053ba21e8f8418a701878973@epcas5p1.samsung.com>
@ 2022-09-09 10:21     ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-09 10:21 UTC (permalink / raw)
  To: axboe, hch, kbusch, asml.silence
  Cc: io-uring, linux-nvme, linux-block, gost.dev, Kanchan Joshi

if io_uring sends passthrough command with IORING_URING_CMD_FIXED flag,
use the pre-registered buffer to form the bio.
While at it, modify nvme_submit_user_cmd to take ubuffer as plain integer
argument, and do away with nvme_to_user_ptr conversion in callers.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 drivers/nvme/host/ioctl.c | 40 ++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index cb2fa4db50dd..e47ef12ce047 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -81,9 +81,10 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
 	return req;
 }
 
-static int nvme_map_user_request(struct request *req, void __user *ubuffer,
+static int nvme_map_user_request(struct request *req, u64 ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-		u32 meta_seed, void **metap, bool vec)
+		u32 meta_seed, void **metap, struct io_uring_cmd *ioucmd,
+		bool vec)
 {
 	struct request_queue *q = req->q;
 	struct nvme_ns *ns = q->queuedata;
@@ -91,20 +92,33 @@ static int nvme_map_user_request(struct request *req, void __user *ubuffer,
 	struct bio *bio = NULL;
 	void *meta = NULL;
 	int ret;
+	bool fixedbufs = ioucmd && (ioucmd->flags & IORING_URING_CMD_FIXED);
 
 	if (!ubuffer || !bufflen)
 		return 0;
 
 	if (!vec)
-		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
-			GFP_KERNEL);
+		if (fixedbufs) {
+			struct iov_iter iter;
+
+			ret = io_uring_cmd_import_fixed(ubuffer, bufflen,
+					rq_data_dir(req), &iter, ioucmd);
+			if (ret < 0)
+				goto out;
+			ret = blk_rq_map_user_bvec(req, &iter);
+
+		} else {
+			ret = blk_rq_map_user(q, req, NULL,
+					nvme_to_user_ptr(ubuffer), bufflen,
+					GFP_KERNEL);
+		}
 	else {
 		struct iovec fast_iov[UIO_FASTIOV];
 		struct iovec *iov = fast_iov;
 		struct iov_iter iter;
 
-		ret = import_iovec(rq_data_dir(req), ubuffer, bufflen,
-				UIO_FASTIOV, &iov, &iter);
+		ret = import_iovec(rq_data_dir(req), nvme_to_user_ptr(ubuffer),
+				bufflen, UIO_FASTIOV, &iov, &iter);
 		if (ret < 0)
 			goto out;
 		ret = blk_rq_map_user_iov(q, req, NULL, &iter, GFP_KERNEL);
@@ -136,7 +150,7 @@ static int nvme_map_user_request(struct request *req, void __user *ubuffer,
 }
 
 static int nvme_submit_user_cmd(struct request_queue *q,
-		struct nvme_command *cmd, void __user *ubuffer,
+		struct nvme_command *cmd, u64 ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
 		u32 meta_seed, u64 *result, unsigned timeout, bool vec)
 {
@@ -150,7 +164,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 		return PTR_ERR(req);
 
 	ret = nvme_map_user_request(req, ubuffer, bufflen, meta_buffer,
-			meta_len, meta_seed, &meta, vec);
+			meta_len, meta_seed, &meta, NULL, vec);
 	if (ret)
 		goto out;
 
@@ -228,7 +242,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	c.rw.appmask = cpu_to_le16(io.appmask);
 
 	return nvme_submit_user_cmd(ns->queue, &c,
-			nvme_to_user_ptr(io.addr), length,
+			io.addr, length,
 			metadata, meta_len, lower_32_bits(io.slba), NULL, 0,
 			false);
 }
@@ -282,7 +296,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 		timeout = msecs_to_jiffies(cmd.timeout_ms);
 
 	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
-			nvme_to_user_ptr(cmd.addr), cmd.data_len,
+			cmd.addr, cmd.data_len,
 			nvme_to_user_ptr(cmd.metadata), cmd.metadata_len,
 			0, &result, timeout, false);
 
@@ -328,7 +342,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 		timeout = msecs_to_jiffies(cmd.timeout_ms);
 
 	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
-			nvme_to_user_ptr(cmd.addr), cmd.data_len,
+			cmd.addr, cmd.data_len,
 			nvme_to_user_ptr(cmd.metadata), cmd.metadata_len,
 			0, &cmd.result, timeout, vec);
 
@@ -472,9 +486,9 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
-	ret = nvme_map_user_request(req, nvme_to_user_ptr(d.addr),
+	ret = nvme_map_user_request(req, d.addr,
 			d.data_len, nvme_to_user_ptr(d.metadata),
-			d.metadata_len, 0, &meta, vec);
+			d.metadata_len, 0, &meta, ioucmd, vec);
 	if (ret)
 		goto out_err;
 	req->end_io = nvme_uring_cmd_end_io;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 3/5] nvme: refactor nvme_alloc_user_request
  2022-09-09 10:21     ` [PATCH for-next v7 3/5] nvme: refactor nvme_alloc_user_request Kanchan Joshi
@ 2022-09-20 12:02       ` Christoph Hellwig
  2022-09-22 15:46         ` Kanchan Joshi
  2022-09-23  9:25         ` Kanchan Joshi
  0 siblings, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-20 12:02 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
	linux-block, gost.dev

On Fri, Sep 09, 2022 at 03:51:34PM +0530, Kanchan Joshi wrote:
> Separate this out to two functions with reduced number of arguments.
> _
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
>  drivers/nvme/host/ioctl.c | 116 ++++++++++++++++++++++----------------
>  1 file changed, 66 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index 548aca8b5b9f..cb2fa4db50dd 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -65,18 +65,10 @@ static int nvme_finish_user_metadata(struct request *req, void __user *ubuf,
>  }
>  
>  static struct request *nvme_alloc_user_request(struct request_queue *q,
> +		struct nvme_command *cmd, unsigned timeout,
>  		blk_opf_t rq_flags, blk_mq_req_flags_t blk_flags)

I think we can also drop the timeout flag here, which seems like it
can be handled cleaner in the callers.
to set it can just do that.

> +static int nvme_map_user_request(struct request *req, void __user *ubuffer,
> +		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
> +		u32 meta_seed, void **metap, bool vec)
> +{
> +	struct request_queue *q = req->q;
> +	struct nvme_ns *ns = q->queuedata;
> +	struct block_device *bdev = ns ? ns->disk->part0 : NULL;
> +	struct bio *bio = NULL;
> +	void *meta = NULL;
> +	int ret;
> +
> +	if (!ubuffer || !bufflen)
> +		return 0;

I'd leave these in the callers and not call the helper if there is
no data to transfer.

> +
> +	if (!vec)
> +		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
> +			GFP_KERNEL);
> +	else {
> +		struct iovec fast_iov[UIO_FASTIOV];
> +		struct iovec *iov = fast_iov;
> +		struct iov_iter iter;
> +
> +		ret = import_iovec(rq_data_dir(req), ubuffer, bufflen,
> +				UIO_FASTIOV, &iov, &iter);
> +		if (ret < 0)
>  			goto out;
> +		ret = blk_rq_map_user_iov(q, req, NULL, &iter, GFP_KERNEL);
> +		kfree(iov);

To me some of this almost screams like lifting the vectored vs
not to the block layer into a separate helper.

> +	}
> +	bio = req->bio;
> +	if (ret)
> +		goto out_unmap;

This seems incorrect, we don't need to unmap if blk_rq_map_user*
failed.

> +	if (bdev)
> +		bio_set_dev(bio, bdev);

I think we can actually drop this now - bi_bdev should only be used
by the non-passthrough path these days.

> +	if (bdev && meta_buffer && meta_len) {
> +		meta = nvme_add_user_metadata(bio, meta_buffer, meta_len,
> +				meta_seed, req_op(req) == REQ_OP_DRV_OUT);
> +		if (IS_ERR(meta)) {
> +			ret = PTR_ERR(meta);
> +			goto out_unmap;
>  		}
> +		req->cmd_flags |= REQ_INTEGRITY;
> +		*metap = meta;

And if we pass the request to nvme_add_user_metadata, that can set
REQ_INTEGRITY.  And we don't need this second helper at all.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough
  2022-09-09 10:21     ` [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough Kanchan Joshi
@ 2022-09-20 12:08       ` Christoph Hellwig
  2022-09-22 15:23         ` Kanchan Joshi
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-20 12:08 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
	linux-block, gost.dev, Anuj Gupta

> -static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
> +static struct bio *bio_map_get(struct request *rq, unsigned int nr_vecs,
>  		gfp_t gfp_mask)

bio_map_get is a very confusing name.  And I also still think this is
the wrong way to go.  If plain slab allocations don't use proper
per-cpu caches we have a MM problem and need to talk to the slab
maintainers and not use the overkill bio_set here.

> +/* Prepare bio for passthrough IO given an existing bvec iter */
> +int blk_rq_map_user_bvec(struct request *rq, struct iov_iter *iter)

I'm a little confused about the interface we're trying to present from
the block layer to the driver here.

blk_rq_map_user_iov really should be able to detect that it is called
on a bvec iter and just do the right thing rather than needing different
helpers.

> +		/*
> +		 * If the queue doesn't support SG gaps and adding this
> +		 * offset would create a gap, disallow it.
> +		 */
> +		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset))
> +			goto out_err;

So now you limit the input that is accepted?  That's not really how
iov_iters are used.   We can either try to reshuffle the bvecs, or
just fall back to the copy data version as blk_rq_map_user_iov does
for 'weird' iters˙

> +
> +		/* check full condition */
> +		if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len)
> +			goto out_err;
> +
> +		if (bytes + bv->bv_len <= nr_iter &&
> +				bv->bv_offset + bv->bv_len <= PAGE_SIZE) {
> +			nsegs++;
> +			bytes += bv->bv_len;
> +		} else
> +			goto out_err;

Nit: This would read much better as:

		if (bytes + bv->bv_len > nr_iter)
			goto out_err;
		if (bv->bv_offset + bv->bv_len > PAGE_SIZE)
			goto out_err;

		bytes += bv->bv_len;
		nsegs++;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough
  2022-09-20 12:08       ` Christoph Hellwig
@ 2022-09-22 15:23         ` Kanchan Joshi
  2022-09-23 15:29           ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-22 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, asml.silence, io-uring, linux-nvme, linux-block,
	gost.dev, Anuj Gupta

[-- Attachment #1: Type: text/plain, Size: 3783 bytes --]

On Tue, Sep 20, 2022 at 02:08:02PM +0200, Christoph Hellwig wrote:
>> -static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>> +static struct bio *bio_map_get(struct request *rq, unsigned int nr_vecs,
>>  		gfp_t gfp_mask)
>
>bio_map_get is a very confusing name.

So I chose that name because functionality is opposite of what we do
inside existing bio_map_put helper. In that way it is symmetric.

>And I also still think this is
>the wrong way to go.  If plain slab allocations don't use proper
>per-cpu caches we have a MM problem and need to talk to the slab
>maintainers and not use the overkill bio_set here.

This series is not about using (or not using) bio-set. Attempt here has
been to use pre-mapped buffers (and bvec) that we got from io_uring.

>> +/* Prepare bio for passthrough IO given an existing bvec iter */
>> +int blk_rq_map_user_bvec(struct request *rq, struct iov_iter *iter)
>
>I'm a little confused about the interface we're trying to present from
>the block layer to the driver here.
>
>blk_rq_map_user_iov really should be able to detect that it is called
>on a bvec iter and just do the right thing rather than needing different
>helpers.

I too explored that possibility, but found that it does not. It maps the
user-pages into bio either directly or by doing that copy (in certain odd
conditions) but does not know how to deal with existing bvec.
Reason, I guess, is no one felt the need to try passthrough for bvecs
before. It makes sense only in context of io_uring passthrough.
And it really felt cleaner to me write a new function rather than 
overloading the blk_rq_map_user_iov with multiple if/else canals.
I tried that again after your comment, but it does not seem to produce
any good-looking code.
The other factor is - it seemed safe to go this way as I am more sure
that I will not break something else (using blk_rq_map_user_iov).

>> +		/*
>> +		 * If the queue doesn't support SG gaps and adding this
>> +		 * offset would create a gap, disallow it.
>> +		 */
>> +		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset))
>> +			goto out_err;
>
>So now you limit the input that is accepted?  That's not really how
>iov_iters are used.   We can either try to reshuffle the bvecs, or
>just fall back to the copy data version as blk_rq_map_user_iov does
>for 'weird' iters˙

Since I was writing a 'new' helper for passthrough only, I thought it
will not too bad to just bail out (rather than try to handle it using
copy) if we hit this queue_virt_boundary related situation. 

To handle it the 'copy data' way we would need this -

585         else if (queue_virt_boundary(q))
586                 copy = queue_virt_boundary(q) & iov_iter_gap_alignment(iter);
587

But iov_iter_gap_alignment does not work on bvec iters. Line #1274 below

1264 unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
1265 {
1266         unsigned long res = 0;
1267         unsigned long v = 0;
1268         size_t size = i->count;
1269         unsigned k;
1270
1271         if (iter_is_ubuf(i))
1272                 return 0;
1273
1274         if (WARN_ON(!iter_is_iovec(i)))
1275                 return ~0U;

Do you see a way to overcome this. Or maybe this can be revisted as we
are not missing a lot?

>> +
>> +		/* check full condition */
>> +		if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len)
>> +			goto out_err;
>> +
>> +		if (bytes + bv->bv_len <= nr_iter &&
>> +				bv->bv_offset + bv->bv_len <= PAGE_SIZE) {
>> +			nsegs++;
>> +			bytes += bv->bv_len;
>> +		} else
>> +			goto out_err;
>
>Nit: This would read much better as:
>
>		if (bytes + bv->bv_len > nr_iter)
>			goto out_err;
>		if (bv->bv_offset + bv->bv_len > PAGE_SIZE)
>			goto out_err;
>
>		bytes += bv->bv_len;
>		nsegs++;

Indeed, cleaner. Thanks.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 3/5] nvme: refactor nvme_alloc_user_request
  2022-09-20 12:02       ` Christoph Hellwig
@ 2022-09-22 15:46         ` Kanchan Joshi
  2022-09-23  9:25         ` Kanchan Joshi
  1 sibling, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-22 15:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, asml.silence, io-uring, linux-nvme, linux-block, gost.dev

[-- Attachment #1: Type: text/plain, Size: 125 bytes --]

On Tue, Sep 20, 2022 at 02:02:26PM +0200, Christoph Hellwig wrote:

I should be able to fold all the changes you mentioned.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 3/5] nvme: refactor nvme_alloc_user_request
  2022-09-20 12:02       ` Christoph Hellwig
  2022-09-22 15:46         ` Kanchan Joshi
@ 2022-09-23  9:25         ` Kanchan Joshi
  1 sibling, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-23  9:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, asml.silence, io-uring, linux-nvme, linux-block, gost.dev

[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]

>> +
>> +	if (!vec)
>> +		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
>> +			GFP_KERNEL);
>> +	else {
>> +		struct iovec fast_iov[UIO_FASTIOV];
>> +		struct iovec *iov = fast_iov;
>> +		struct iov_iter iter;
>> +
>> +		ret = import_iovec(rq_data_dir(req), ubuffer, bufflen,
>> +				UIO_FASTIOV, &iov, &iter);
>> +		if (ret < 0)
>>  			goto out;
>> +		ret = blk_rq_map_user_iov(q, req, NULL, &iter, GFP_KERNEL);
>> +		kfree(iov);
>
>To me some of this almost screams like lifting the vectored vs
>not to the block layer into a separate helper.
>
So I skipped doing this, as cleanup is effective when we have the
elephant; only a part is visible here. The last patch (nvme fixedbufs
support) also changes this region. 
I can post a cleanup when all these moving pieces get settled.

>> +	}
>> +	bio = req->bio;
>> +	if (ret)
>> +		goto out_unmap;
>
>This seems incorrect, we don't need to unmap if blk_rq_map_user*
>failed.
>
>> +	if (bdev)
>> +		bio_set_dev(bio, bdev);
>
>I think we can actually drop this now - bi_bdev should only be used
>by the non-passthrough path these days.
Not sure if I am missing something, but this seemed necessary. bi_bdev was
null otherwise.

Did all other changes.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough
  2022-09-22 15:23         ` Kanchan Joshi
@ 2022-09-23 15:29           ` Christoph Hellwig
  2022-09-23 18:43             ` Kanchan Joshi
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-23 15:29 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, axboe, kbusch, asml.silence, io-uring,
	linux-nvme, linux-block, gost.dev, Anuj Gupta

On Thu, Sep 22, 2022 at 08:53:31PM +0530, Kanchan Joshi wrote:
>> blk_rq_map_user_iov really should be able to detect that it is called
>> on a bvec iter and just do the right thing rather than needing different
>> helpers.
>
> I too explored that possibility, but found that it does not. It maps the
> user-pages into bio either directly or by doing that copy (in certain odd
> conditions) but does not know how to deal with existing bvec.

What do you mean with existing bvec?  We allocate a brand new bio here
that we want to map the next chunk of the iov_iter to, and that
is exactly what blk_rq_map_user_iov does.  What blk_rq_map_user_iov
currently does not do is to implement this mapping efficiently
for ITER_BVEC iters, but that is something that could and should
be fixed.

> And it really felt cleaner to me write a new function rather than 
> overloading the blk_rq_map_user_iov with multiple if/else canals.

No.  The whole point of the iov_iter is to support this "overload".

> But iov_iter_gap_alignment does not work on bvec iters. Line #1274 below

So we'll need to fix it.

> 1264 unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
> 1265 {
> 1266         unsigned long res = 0;
> 1267         unsigned long v = 0;
> 1268         size_t size = i->count;
> 1269         unsigned k;
> 1270
> 1271         if (iter_is_ubuf(i))
> 1272                 return 0;
> 1273
> 1274         if (WARN_ON(!iter_is_iovec(i)))
> 1275                 return ~0U;
>
> Do you see a way to overcome this. Or maybe this can be revisted as we
> are not missing a lot?

We just need to implement the equivalent functionality for bvecs.  It
isn't really hard, it just wasn't required so far.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough
  2022-09-23 15:29           ` Christoph Hellwig
@ 2022-09-23 18:43             ` Kanchan Joshi
  2022-09-25 17:46               ` Kanchan Joshi
  2022-09-26 14:50               ` Christoph Hellwig
  0 siblings, 2 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-23 18:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, asml.silence, io-uring, linux-nvme, linux-block,
	gost.dev, Anuj Gupta

[-- Attachment #1: Type: text/plain, Size: 3452 bytes --]

On Fri, Sep 23, 2022 at 05:29:41PM +0200, Christoph Hellwig wrote:
>On Thu, Sep 22, 2022 at 08:53:31PM +0530, Kanchan Joshi wrote:
>>> blk_rq_map_user_iov really should be able to detect that it is called
>>> on a bvec iter and just do the right thing rather than needing different
>>> helpers.
>>
>> I too explored that possibility, but found that it does not. It maps the
>> user-pages into bio either directly or by doing that copy (in certain odd
>> conditions) but does not know how to deal with existing bvec.
>
>What do you mean with existing bvec?  We allocate a brand new bio here
>that we want to map the next chunk of the iov_iter to, and that
>is exactly what blk_rq_map_user_iov does.  What blk_rq_map_user_iov
>currently does not do is to implement this mapping efficiently
>for ITER_BVEC iters

It is clear that it was not written for ITER_BVEC iters.
Otherwise that WARN_ON would not have hit.

And efficency is the concern as we are moving to more heavyweight
helper that 'handles' weird conditions rather than just 'bails out'.
These alignment checks end up adding a loop that traverses
the entire ITER_BVEC.
Also blk_rq_map_user_iov uses bio_iter_advance which also seems
cycle-consuming given below code-comment in io_import_fixed():

if (offset) {
        /*
         * Don't use iov_iter_advance() here, as it's really slow for
         * using the latter parts of a big fixed buffer - it iterates
         * over each segment manually. We can cheat a bit here, because
         * we know that:

So if at all I could move the code inside blk_rq_map_user_iov, I will
need to see that I skip doing iov_iter_advance.

I still think it would be better to take this route only when there are
other usecases/callers of this. And that is a future thing. For the current
requirement, it seems better to prioritze efficency.

>, but that is something that could and should
>be fixed.
>
>> And it really felt cleaner to me write a new function rather than
>> overloading the blk_rq_map_user_iov with multiple if/else canals.
>
>No.  The whole point of the iov_iter is to support this "overload".

Even if I try taking that route, WARN_ON is a blocker that  prevents 
me to put this code inside blk_rq_map_user_iov.

>> But iov_iter_gap_alignment does not work on bvec iters. Line #1274 below
>
>So we'll need to fix it.

Do you see good way to trigger this virt-alignment condition? I have
not seen this hitting (the SG gap checks) when running with fixebufs.

>> 1264 unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
>> 1265 {
>> 1266         unsigned long res = 0;
>> 1267         unsigned long v = 0;
>> 1268         size_t size = i->count;
>> 1269         unsigned k;
>> 1270
>> 1271         if (iter_is_ubuf(i))
>> 1272                 return 0;
>> 1273
>> 1274         if (WARN_ON(!iter_is_iovec(i)))
>> 1275                 return ~0U;
>>
>> Do you see a way to overcome this. Or maybe this can be revisted as we
>> are not missing a lot?
>
>We just need to implement the equivalent functionality for bvecs.  It
>isn't really hard, it just wasn't required so far.

Can the virt-boundary alignment gap exist for ITER_BVEC iter in first
place? Two reasons to ask this question:

1. Commit description of this code (from Al viro) says -

"iov_iter_gap_alignment(): get rid of iterate_all_kinds()

For one thing, it's only used for iovec (and makes sense only for
those)."

2. I did not hit it so far as I mentioned above.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough
  2022-09-23 18:43             ` Kanchan Joshi
@ 2022-09-25 17:46               ` Kanchan Joshi
  2022-09-26 14:50               ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-25 17:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, asml.silence, io-uring, linux-nvme, linux-block,
	gost.dev, Anuj Gupta

[-- Attachment #1: Type: text/plain, Size: 4863 bytes --]

On Sat, Sep 24, 2022 at 12:13:49AM +0530, Kanchan Joshi wrote:
>On Fri, Sep 23, 2022 at 05:29:41PM +0200, Christoph Hellwig wrote:
>>On Thu, Sep 22, 2022 at 08:53:31PM +0530, Kanchan Joshi wrote:
>>>>blk_rq_map_user_iov really should be able to detect that it is called
>>>>on a bvec iter and just do the right thing rather than needing different
>>>>helpers.
>>>
>>>I too explored that possibility, but found that it does not. It maps the
>>>user-pages into bio either directly or by doing that copy (in certain odd
>>>conditions) but does not know how to deal with existing bvec.
>>
>>What do you mean with existing bvec?  We allocate a brand new bio here
>>that we want to map the next chunk of the iov_iter to, and that
>>is exactly what blk_rq_map_user_iov does.  What blk_rq_map_user_iov
>>currently does not do is to implement this mapping efficiently
>>for ITER_BVEC iters
>
>It is clear that it was not written for ITER_BVEC iters.
>Otherwise that WARN_ON would not have hit.
>
>And efficency is the concern as we are moving to more heavyweight
>helper that 'handles' weird conditions rather than just 'bails out'.
>These alignment checks end up adding a loop that traverses
>the entire ITER_BVEC.
>Also blk_rq_map_user_iov uses bio_iter_advance which also seems
>cycle-consuming given below code-comment in io_import_fixed():
>
>if (offset) {
>       /*
>        * Don't use iov_iter_advance() here, as it's really slow for
>        * using the latter parts of a big fixed buffer - it iterates
>        * over each segment manually. We can cheat a bit here, because
>        * we know that:
>
>So if at all I could move the code inside blk_rq_map_user_iov, I will
>need to see that I skip doing iov_iter_advance.
>
>I still think it would be better to take this route only when there are
>other usecases/callers of this. And that is a future thing. For the current
>requirement, it seems better to prioritze efficency.
>
>>, but that is something that could and should
>>be fixed.
>>
>>>And it really felt cleaner to me write a new function rather than
>>>overloading the blk_rq_map_user_iov with multiple if/else canals.
>>
>>No.  The whole point of the iov_iter is to support this "overload".
>
>Even if I try taking that route, WARN_ON is a blocker that  prevents 
>me to put this code inside blk_rq_map_user_iov.
>
>>>But iov_iter_gap_alignment does not work on bvec iters. Line #1274 below
>>
>>So we'll need to fix it.
>
>Do you see good way to trigger this virt-alignment condition? I have
>not seen this hitting (the SG gap checks) when running with fixebufs.
>
>>>1264 unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
>>>1265 {
>>>1266         unsigned long res = 0;
>>>1267         unsigned long v = 0;
>>>1268         size_t size = i->count;
>>>1269         unsigned k;
>>>1270
>>>1271         if (iter_is_ubuf(i))
>>>1272                 return 0;
>>>1273
>>>1274         if (WARN_ON(!iter_is_iovec(i)))
>>>1275                 return ~0U;
>>>
>>>Do you see a way to overcome this. Or maybe this can be revisted as we
>>>are not missing a lot?
>>
>>We just need to implement the equivalent functionality for bvecs.  It
>>isn't really hard, it just wasn't required so far.
>
>Can the virt-boundary alignment gap exist for ITER_BVEC iter in first
>place? Two reasons to ask this question:
>
>1. Commit description of this code (from Al viro) says -
>
>"iov_iter_gap_alignment(): get rid of iterate_all_kinds()
>
>For one thing, it's only used for iovec (and makes sense only for
>those)."
>
>2. I did not hit it so far as I mentioned above.

And we also have below condition (patch of Linus) that restricts
blk_rq_map_user_iov to only iovec iterator

commit a0ac402cfcdc904f9772e1762b3fda112dcc56a0
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Dec 6 16:18:14 2016 -0800

    Don't feed anything but regular iovec's to blk_rq_map_user_iov

    In theory we could map other things, but there's a reason that function
    is called "user_iov".  Using anything else (like splice can do) just
    confuses it.

    Reported-and-tested-by: Johannes Thumshirn <jthumshirn@suse.de>
    Cc: Al Viro <viro@ZenIV.linux.org.uk>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/block/blk-map.c b/block/blk-map.c
index b8657fa8dc9a..27fd8d92892d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -118,6 +118,9 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
        struct iov_iter i;
        int ret;

+       if (!iter_is_iovec(iter))
+               goto fail;
+
        if (map_data)
                copy = true;
        else if (iov_iter_alignment(iter) & align)
@@ -140,6 +143,7 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,

 unmap_rq:
        __blk_rq_unmap_user(bio);
+fail:
        rq->bio = NULL;
        return -EINVAL;
 }


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough
  2022-09-23 18:43             ` Kanchan Joshi
  2022-09-25 17:46               ` Kanchan Joshi
@ 2022-09-26 14:50               ` Christoph Hellwig
  2022-09-27 16:47                 ` Kanchan Joshi
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-26 14:50 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, axboe, kbusch, asml.silence, io-uring,
	linux-nvme, linux-block, gost.dev, Anuj Gupta

On Sat, Sep 24, 2022 at 12:13:49AM +0530, Kanchan Joshi wrote:
> And efficency is the concern as we are moving to more heavyweight
> helper that 'handles' weird conditions rather than just 'bails out'.
> These alignment checks end up adding a loop that traverses
> the entire ITER_BVEC.
> Also blk_rq_map_user_iov uses bio_iter_advance which also seems
> cycle-consuming given below code-comment in io_import_fixed():

No one says you should use the existing loop in blk_rq_map_user_iov.
Just make it call your new helper early on when a ITER_BVEC iter is
passed in.

> Do you see good way to trigger this virt-alignment condition? I have
> not seen this hitting (the SG gap checks) when running with fixebufs.

You'd need to make sure the iovec passed to the fixed buffer
registration is chunked up smaller than the nvme page size.

E.g. if you pass lots of non-contiguous 512 byte sized iovecs to the
buffer registration.

>> We just need to implement the equivalent functionality for bvecs.  It
>> isn't really hard, it just wasn't required so far.
>
> Can the virt-boundary alignment gap exist for ITER_BVEC iter in first
> place?

Yes.  bvecs are just a way to represent data.  If the individual
segments don't fit the virt boundary you still need to deal with it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough
  2022-09-26 14:50               ` Christoph Hellwig
@ 2022-09-27 16:47                 ` Kanchan Joshi
  0 siblings, 0 replies; 16+ messages in thread
From: Kanchan Joshi @ 2022-09-27 16:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, asml.silence, io-uring, linux-nvme, linux-block,
	gost.dev, Anuj Gupta

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

On Mon, Sep 26, 2022 at 04:50:40PM +0200, Christoph Hellwig wrote:
>On Sat, Sep 24, 2022 at 12:13:49AM +0530, Kanchan Joshi wrote:
>> And efficency is the concern as we are moving to more heavyweight
>> helper that 'handles' weird conditions rather than just 'bails out'.
>> These alignment checks end up adding a loop that traverses
>> the entire ITER_BVEC.
>> Also blk_rq_map_user_iov uses bio_iter_advance which also seems
>> cycle-consuming given below code-comment in io_import_fixed():
>
>No one says you should use the existing loop in blk_rq_map_user_iov.
>Just make it call your new helper early on when a ITER_BVEC iter is
>passed in.

Indeed. I will send the v10 with that approach.

>> Do you see good way to trigger this virt-alignment condition? I have
>> not seen this hitting (the SG gap checks) when running with fixebufs.
>
>You'd need to make sure the iovec passed to the fixed buffer
>registration is chunked up smaller than the nvme page size.
>
>E.g. if you pass lots of non-contiguous 512 byte sized iovecs to the
>buffer registration.
>
>>> We just need to implement the equivalent functionality for bvecs.  It
>>> isn't really hard, it just wasn't required so far.
>>
>> Can the virt-boundary alignment gap exist for ITER_BVEC iter in first
>> place?
>
>Yes.  bvecs are just a way to represent data.  If the individual
>segments don't fit the virt boundary you still need to deal with it.

Thanks for clearing this.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-09-27 16:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220909103131epcas5p23d146916eccedf30d498e0ea23e54052@epcas5p2.samsung.com>
2022-09-09 10:21 ` [PATCH for-next v7 0/5] fixed-buffer for uring-cmd/passthru Kanchan Joshi
     [not found]   ` <CGME20220909103136epcas5p38ea3a933e90d9f9d7451848dc3a60829@epcas5p3.samsung.com>
2022-09-09 10:21     ` [PATCH for-next v7 1/5] io_uring: add io_uring_cmd_import_fixed Kanchan Joshi
     [not found]   ` <CGME20220909103140epcas5p36689726422eb68e6fdc1d39019a4a8ba@epcas5p3.samsung.com>
2022-09-09 10:21     ` [PATCH for-next v7 2/5] io_uring: introduce fixed buffer support for io_uring_cmd Kanchan Joshi
     [not found]   ` <CGME20220909103143epcas5p2eda60190cd23b79fb8f48596af3e1524@epcas5p2.samsung.com>
2022-09-09 10:21     ` [PATCH for-next v7 3/5] nvme: refactor nvme_alloc_user_request Kanchan Joshi
2022-09-20 12:02       ` Christoph Hellwig
2022-09-22 15:46         ` Kanchan Joshi
2022-09-23  9:25         ` Kanchan Joshi
     [not found]   ` <CGME20220909103147epcas5p2a83ec151333bcb1d2abb8c7536789bfd@epcas5p2.samsung.com>
2022-09-09 10:21     ` [PATCH for-next v7 4/5] block: add helper to map bvec iterator for passthrough Kanchan Joshi
2022-09-20 12:08       ` Christoph Hellwig
2022-09-22 15:23         ` Kanchan Joshi
2022-09-23 15:29           ` Christoph Hellwig
2022-09-23 18:43             ` Kanchan Joshi
2022-09-25 17:46               ` Kanchan Joshi
2022-09-26 14:50               ` Christoph Hellwig
2022-09-27 16:47                 ` Kanchan Joshi
     [not found]   ` <CGME20220909103151epcas5p1e25127c3053ba21e8f8418a701878973@epcas5p1.samsung.com>
2022-09-09 10:21     ` [PATCH for-next v7 5/5] nvme: wire up fixed buffer support for nvme passthrough Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).