linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v4 0/8] io_uring passthrough support
@ 2021-03-17 22:10 Jens Axboe
  2021-03-17 22:10 ` [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze

Hi,

I fiddled a bit with the v3 repo, and came up with what I think is a
better solution. Basically we split the io_uring_sqe into a header part,
and then a main part. io_uring_sqe remains the same, obviously, but
io_uring_cmd_sqe is then the sqe for these kinds of passthrough payloads.

In turn, consumers of that can then overlay on io_uring_cmd_sqe. Since
I think we need the personality in there, we may as well add op and len
as most/all will want that too. That leaves 40 bytes that can be used
freely. That may not seem like much, but remember that's 40 bytes outside
of the fd, len, and command op.

I updated and tested the block ioctl example, but didn't update the
net side outside of needing a tweak on the net command. Outside of that,
it should work like before.

I'd be interested in feedback on this approach. My main goal is to make
this flexible enough to be useful, but also fast enough to be useful.
That means no extra allocations if at all avoidable, and even being wary
of adding extra branches to the io_uring hot path. With this series, we
don't do the nasty split in io_init_req() anymore, which I really
disliked in the previous series.

This is by no means perfect yet, but I do think it's better than v3 by
quite a lot. So please send feedback and comments, I'd like to get this
moving forward as we have various folks already lined up to use it...

Kanchan, can you try and address the NVMe feedback and rebase on top
of this branch? Thanks!

You can also find this branch here:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v4

 block/blk-mq.c                |  11 +++
 fs/block_dev.c                |  30 ++++++
 fs/io_uring.c                 | 181 ++++++++++++++++++++++++----------
 include/linux/blk-mq.h        |   6 ++
 include/linux/blkdev.h        |  13 +++
 include/linux/fs.h            |  11 +++
 include/linux/io_uring.h      |  16 +++
 include/linux/net.h           |   2 +
 include/net/raw.h             |   3 +
 include/net/sock.h            |   6 ++
 include/net/tcp.h             |   2 +
 include/net/udp.h             |   2 +
 include/uapi/linux/io_uring.h |  21 ++++
 include/uapi/linux/net.h      |  17 ++++
 net/core/sock.c               |  17 +++-
 net/dccp/ipv4.c               |   1 +
 net/ipv4/af_inet.c            |   3 +
 net/ipv4/raw.c                |  27 +++++
 net/ipv4/tcp.c                |  36 +++++++
 net/ipv4/tcp_ipv4.c           |   1 +
 net/ipv4/udp.c                |  18 ++++
 net/ipv6/raw.c                |   1 +
 net/ipv6/tcp_ipv6.c           |   1 +
 net/ipv6/udp.c                |   1 +
 net/l2tp/l2tp_ip.c            |   1 +
 net/mptcp/protocol.c          |   1 +
 net/sctp/protocol.c           |   1 +
 net/socket.c                  |  13 +++
 28 files changed, 391 insertions(+), 52 deletions(-)

-- 
Jens Axboe




_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main
  2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
@ 2021-03-17 22:10 ` Jens Axboe
  2021-03-18  5:34   ` Christoph Hellwig
  2021-03-17 22:10 ` [PATCH 2/8] io_uring: add infrastructure around io_uring_cmd_sqe issue type Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze, Jens Axboe

In preparation for overlaying passthrough commands on the io_uring_sqe
struct, split out the header part as we'll be reusing that for the
new format as well.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 73 ++++++++++++++++++-----------------
 include/uapi/linux/io_uring.h | 11 ++++++
 2 files changed, 48 insertions(+), 36 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5538568f24e9..416e47832468 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2722,7 +2722,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if ((kiocb->ki_flags & IOCB_NOWAIT) || (file->f_flags & O_NONBLOCK))
 		req->flags |= REQ_F_NOWAIT;
 
-	ioprio = READ_ONCE(sqe->ioprio);
+	ioprio = READ_ONCE(sqe->hdr.ioprio);
 	if (ioprio) {
 		ret = ioprio_check_cap(ioprio);
 		if (ret)
@@ -3467,7 +3467,7 @@ static int io_renameat_prep(struct io_kiocb *req,
 	if (unlikely(req->flags & REQ_F_FIXED_FILE))
 		return -EBADF;
 
-	ren->old_dfd = READ_ONCE(sqe->fd);
+	ren->old_dfd = READ_ONCE(sqe->hdr.fd);
 	oldf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	newf = u64_to_user_ptr(READ_ONCE(sqe->addr2));
 	ren->new_dfd = READ_ONCE(sqe->len);
@@ -3514,7 +3514,7 @@ static int io_unlinkat_prep(struct io_kiocb *req,
 	if (unlikely(req->flags & REQ_F_FIXED_FILE))
 		return -EBADF;
 
-	un->dfd = READ_ONCE(sqe->fd);
+	un->dfd = READ_ONCE(sqe->hdr.fd);
 
 	un->flags = READ_ONCE(sqe->unlink_flags);
 	if (un->flags & ~AT_REMOVEDIR)
@@ -3555,7 +3555,7 @@ static int io_shutdown_prep(struct io_kiocb *req,
 #if defined(CONFIG_NET)
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->off || sqe->addr || sqe->rw_flags ||
+	if (sqe->hdr.ioprio || sqe->off || sqe->addr || sqe->rw_flags ||
 	    sqe->buf_index)
 		return -EINVAL;
 
@@ -3711,7 +3711,7 @@ static int io_fsync_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
+	if (unlikely(sqe->addr || sqe->hdr.ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	req->sync.flags = READ_ONCE(sqe->fsync_flags);
@@ -3744,7 +3744,7 @@ static int io_fsync(struct io_kiocb *req, unsigned int issue_flags)
 static int io_fallocate_prep(struct io_kiocb *req,
 			     const struct io_uring_sqe *sqe)
 {
-	if (sqe->ioprio || sqe->buf_index || sqe->rw_flags)
+	if (sqe->hdr.ioprio || sqe->buf_index || sqe->rw_flags)
 		return -EINVAL;
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
@@ -3775,7 +3775,7 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
 	const char __user *fname;
 	int ret;
 
-	if (unlikely(sqe->ioprio || sqe->buf_index))
+	if (unlikely(sqe->hdr.ioprio || sqe->buf_index))
 		return -EINVAL;
 	if (unlikely(req->flags & REQ_F_FIXED_FILE))
 		return -EBADF;
@@ -3784,7 +3784,7 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
 	if (!(req->open.how.flags & O_PATH) && force_o_largefile())
 		req->open.how.flags |= O_LARGEFILE;
 
-	req->open.dfd = READ_ONCE(sqe->fd);
+	req->open.dfd = READ_ONCE(sqe->hdr.fd);
 	fname = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	req->open.filename = getname(fname);
 	if (IS_ERR(req->open.filename)) {
@@ -3900,10 +3900,10 @@ static int io_remove_buffers_prep(struct io_kiocb *req,
 	struct io_provide_buf *p = &req->pbuf;
 	u64 tmp;
 
-	if (sqe->ioprio || sqe->rw_flags || sqe->addr || sqe->len || sqe->off)
+	if (sqe->hdr.ioprio || sqe->rw_flags || sqe->addr || sqe->len || sqe->off)
 		return -EINVAL;
 
-	tmp = READ_ONCE(sqe->fd);
+	tmp = READ_ONCE(sqe->hdr.fd);
 	if (!tmp || tmp > USHRT_MAX)
 		return -EINVAL;
 
@@ -3970,10 +3970,10 @@ static int io_provide_buffers_prep(struct io_kiocb *req,
 	struct io_provide_buf *p = &req->pbuf;
 	u64 tmp;
 
-	if (sqe->ioprio || sqe->rw_flags)
+	if (sqe->hdr.ioprio || sqe->rw_flags)
 		return -EINVAL;
 
-	tmp = READ_ONCE(sqe->fd);
+	tmp = READ_ONCE(sqe->hdr.fd);
 	if (!tmp || tmp > USHRT_MAX)
 		return -E2BIG;
 	p->nbufs = tmp;
@@ -4050,12 +4050,12 @@ static int io_epoll_ctl_prep(struct io_kiocb *req,
 			     const struct io_uring_sqe *sqe)
 {
 #if defined(CONFIG_EPOLL)
-	if (sqe->ioprio || sqe->buf_index)
+	if (sqe->hdr.ioprio || sqe->buf_index)
 		return -EINVAL;
 	if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL)))
 		return -EINVAL;
 
-	req->epoll.epfd = READ_ONCE(sqe->fd);
+	req->epoll.epfd = READ_ONCE(sqe->hdr.fd);
 	req->epoll.op = READ_ONCE(sqe->len);
 	req->epoll.fd = READ_ONCE(sqe->off);
 
@@ -4096,7 +4096,7 @@ static int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags)
 static int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU)
-	if (sqe->ioprio || sqe->buf_index || sqe->off)
+	if (sqe->hdr.ioprio || sqe->buf_index || sqe->off)
 		return -EINVAL;
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
@@ -4131,7 +4131,7 @@ static int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
 
 static int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
-	if (sqe->ioprio || sqe->buf_index || sqe->addr)
+	if (sqe->hdr.ioprio || sqe->buf_index || sqe->addr)
 		return -EINVAL;
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
@@ -4169,12 +4169,12 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL)))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->buf_index)
+	if (sqe->hdr.ioprio || sqe->buf_index)
 		return -EINVAL;
 	if (req->flags & REQ_F_FIXED_FILE)
 		return -EBADF;
 
-	req->statx.dfd = READ_ONCE(sqe->fd);
+	req->statx.dfd = READ_ONCE(sqe->hdr.fd);
 	req->statx.mask = READ_ONCE(sqe->len);
 	req->statx.filename = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	req->statx.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2));
@@ -4208,13 +4208,13 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->off || sqe->addr || sqe->len ||
+	if (sqe->hdr.ioprio || sqe->off || sqe->addr || sqe->len ||
 	    sqe->rw_flags || sqe->buf_index)
 		return -EINVAL;
 	if (req->flags & REQ_F_FIXED_FILE)
 		return -EBADF;
 
-	req->close.fd = READ_ONCE(sqe->fd);
+	req->close.fd = READ_ONCE(sqe->hdr.fd);
 	return 0;
 }
 
@@ -4277,7 +4277,7 @@ static int io_sfr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
+	if (unlikely(sqe->addr || sqe->hdr.ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	req->sync.off = READ_ONCE(sqe->off);
@@ -4698,7 +4698,7 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->len || sqe->buf_index)
+	if (sqe->hdr.ioprio || sqe->len || sqe->buf_index)
 		return -EINVAL;
 
 	accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -4746,7 +4746,7 @@ static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags)
+	if (sqe->hdr.ioprio || sqe->len || sqe->buf_index || sqe->rw_flags)
 		return -EINVAL;
 
 	conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -5290,7 +5290,7 @@ static int io_poll_remove_prep(struct io_kiocb *req,
 {
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index ||
+	if (sqe->hdr.ioprio || sqe->off || sqe->len || sqe->buf_index ||
 	    sqe->poll_events)
 		return -EINVAL;
 
@@ -5341,7 +5341,7 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index)
+	if (sqe->addr || sqe->hdr.ioprio || sqe->off || sqe->len || sqe->buf_index)
 		return -EINVAL;
 
 	events = READ_ONCE(sqe->poll32_events);
@@ -5466,7 +5466,7 @@ static int io_timeout_remove_prep(struct io_kiocb *req,
 		return -EINVAL;
 	if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->buf_index || sqe->len)
+	if (sqe->hdr.ioprio || sqe->buf_index || sqe->len)
 		return -EINVAL;
 
 	tr->addr = READ_ONCE(sqe->addr);
@@ -5525,7 +5525,7 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->buf_index || sqe->len != 1)
+	if (sqe->hdr.ioprio || sqe->buf_index || sqe->len != 1)
 		return -EINVAL;
 	if (off && is_timeout_link)
 		return -EINVAL;
@@ -5677,7 +5677,7 @@ static int io_async_cancel_prep(struct io_kiocb *req,
 		return -EINVAL;
 	if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->off || sqe->len || sqe->cancel_flags)
+	if (sqe->hdr.ioprio || sqe->off || sqe->len || sqe->cancel_flags)
 		return -EINVAL;
 
 	req->cancel.addr = READ_ONCE(sqe->addr);
@@ -5738,7 +5738,7 @@ static int io_rsrc_update_prep(struct io_kiocb *req,
 		return -EINVAL;
 	if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
 		return -EINVAL;
-	if (sqe->ioprio || sqe->rw_flags)
+	if (sqe->hdr.ioprio || sqe->rw_flags)
 		return -EINVAL;
 
 	req->rsrc_update.offset = READ_ONCE(sqe->off);
@@ -6390,9 +6390,9 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	unsigned int sqe_flags;
 	int personality, ret = 0;
 
-	req->opcode = READ_ONCE(sqe->opcode);
+	req->opcode = READ_ONCE(sqe->hdr.opcode);
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
-	req->flags = sqe_flags = READ_ONCE(sqe->flags);
+	req->flags = sqe_flags = READ_ONCE(sqe->hdr.flags);
 	req->user_data = READ_ONCE(sqe->user_data);
 	req->async_data = NULL;
 	req->file = NULL;
@@ -6445,7 +6445,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	if (io_op_defs[req->opcode].needs_file) {
 		bool fixed = req->flags & REQ_F_FIXED_FILE;
 
-		req->file = io_file_get(state, req, READ_ONCE(sqe->fd), fixed);
+		req->file = io_file_get(state, req, READ_ONCE(sqe->hdr.fd),
+					fixed);
 		if (unlikely(!req->file))
 			ret = -EBADF;
 	}
@@ -9914,10 +9915,10 @@ static int __init io_uring_init(void)
 #define BUILD_BUG_SQE_ELEM(eoffset, etype, ename) \
 	__BUILD_BUG_VERIFY_ELEMENT(struct io_uring_sqe, eoffset, etype, ename)
 	BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64);
-	BUILD_BUG_SQE_ELEM(0,  __u8,   opcode);
-	BUILD_BUG_SQE_ELEM(1,  __u8,   flags);
-	BUILD_BUG_SQE_ELEM(2,  __u16,  ioprio);
-	BUILD_BUG_SQE_ELEM(4,  __s32,  fd);
+	BUILD_BUG_SQE_ELEM(0,  __u8,   hdr.opcode);
+	BUILD_BUG_SQE_ELEM(1,  __u8,   hdr.flags);
+	BUILD_BUG_SQE_ELEM(2,  __u16,  hdr.ioprio);
+	BUILD_BUG_SQE_ELEM(4,  __s32,  hdr.fd);
 	BUILD_BUG_SQE_ELEM(8,  __u64,  off);
 	BUILD_BUG_SQE_ELEM(8,  __u64,  addr2);
 	BUILD_BUG_SQE_ELEM(16, __u64,  addr);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 2514eb6b1cf2..5609474ccd9f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -14,11 +14,22 @@
 /*
  * IO submission data structure (Submission Queue Entry)
  */
+struct io_uring_sqe_hdr {
+	__u8	opcode;		/* type of operation for this sqe */
+	__u8	flags;		/* IOSQE_ flags */
+	__u16	ioprio;		/* ioprio for the request */
+	__s32	fd;		/* file descriptor to do IO on */
+};
+
 struct io_uring_sqe {
+#ifdef __KERNEL__
+	struct io_uring_sqe_hdr	hdr;
+#else
 	__u8	opcode;		/* type of operation for this sqe */
 	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
+#endif
 	union {
 		__u64	off;	/* offset into file */
 		__u64	addr2;
-- 
2.31.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/8] io_uring: add infrastructure around io_uring_cmd_sqe issue type
  2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
  2021-03-17 22:10 ` [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main Jens Axboe
@ 2021-03-17 22:10 ` Jens Axboe
  2021-03-17 22:10 ` [PATCH 3/8] fs: add file_operations->uring_cmd() Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze, Jens Axboe

Define an io_uring_cmd_sqe struct that passthrough commands can use,
and define an array that has offset information for the two members
that we care about (user_data and personality). Then we can init the
two command types in basically the same way, just reading the user_data
and personality at the defined offsets for the command type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 57 +++++++++++++++++++++++++++--------
 include/uapi/linux/io_uring.h | 10 ++++++
 2 files changed, 54 insertions(+), 13 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 416e47832468..a4699b066172 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -824,6 +824,22 @@ struct io_defer_entry {
 	u32			seq;
 };
 
+struct sqe_offset {
+	unsigned char		user_data;
+	unsigned char		personality;
+};
+
+static struct sqe_offset sqe_offsets[] = {
+	{
+		.user_data	= offsetof(struct io_uring_sqe, user_data),
+		.personality	= offsetof(struct io_uring_sqe, personality)
+	},
+	{
+		.user_data	= offsetof(struct io_uring_cmd_sqe, user_data),
+		.personality	= offsetof(struct io_uring_cmd_sqe, personality)
+	}
+};
+
 struct io_op_def {
 	/* needs req->file assigned */
 	unsigned		needs_file : 1;
@@ -844,6 +860,8 @@ struct io_op_def {
 	unsigned		plug : 1;
 	/* size of async data needed, if any */
 	unsigned short		async_size;
+	/* offset definition for user_data/personality */
+	unsigned short		offsets;
 };
 
 static const struct io_op_def io_op_defs[] = {
@@ -988,6 +1006,9 @@ static const struct io_op_def io_op_defs[] = {
 	},
 	[IORING_OP_RENAMEAT] = {},
 	[IORING_OP_UNLINKAT] = {},
+	[IORING_OP_URING_CMD] = {
+		.offsets		= 1,
+	},
 };
 
 static bool io_disarm_next(struct io_kiocb *req);
@@ -6384,16 +6405,21 @@ static inline bool io_check_restriction(struct io_ring_ctx *ctx,
 }
 
 static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
-		       const struct io_uring_sqe *sqe)
+		       const struct io_uring_sqe_hdr *hdr)
 {
 	struct io_submit_state *state;
+	const struct io_op_def *def;
 	unsigned int sqe_flags;
+	const __u64 *uptr;
+	const __u16 *pptr;
 	int personality, ret = 0;
 
-	req->opcode = READ_ONCE(sqe->hdr.opcode);
+	req->opcode = READ_ONCE(hdr->opcode);
+	def = &io_op_defs[req->opcode];
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
-	req->flags = sqe_flags = READ_ONCE(sqe->hdr.flags);
-	req->user_data = READ_ONCE(sqe->user_data);
+	req->flags = sqe_flags = READ_ONCE(hdr->flags);
+	uptr = (const void *) hdr + sqe_offsets[def->offsets].user_data;
+	req->user_data = READ_ONCE(*uptr);
 	req->async_data = NULL;
 	req->file = NULL;
 	req->ctx = ctx;
@@ -6419,11 +6445,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	if (unlikely(!io_check_restriction(ctx, req, sqe_flags)))
 		return -EACCES;
 
-	if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
-	    !io_op_defs[req->opcode].buffer_select)
+	if ((sqe_flags & IOSQE_BUFFER_SELECT) && !def->buffer_select)
 		return -EOPNOTSUPP;
 
-	personality = READ_ONCE(sqe->personality);
+	pptr = (const void *) hdr + sqe_offsets[def->offsets].personality;
+	personality = READ_ONCE(*pptr);
 	if (personality) {
 		req->work.creds = xa_load(&ctx->personalities, personality);
 		if (!req->work.creds)
@@ -6436,17 +6462,15 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	 * Plug now if we have more than 1 IO left after this, and the target
 	 * is potentially a read/write to block based storage.
 	 */
-	if (!state->plug_started && state->ios_left > 1 &&
-	    io_op_defs[req->opcode].plug) {
+	if (!state->plug_started && state->ios_left > 1 && def->plug) {
 		blk_start_plug(&state->plug);
 		state->plug_started = true;
 	}
 
-	if (io_op_defs[req->opcode].needs_file) {
+	if (def->needs_file) {
 		bool fixed = req->flags & REQ_F_FIXED_FILE;
 
-		req->file = io_file_get(state, req, READ_ONCE(sqe->hdr.fd),
-					fixed);
+		req->file = io_file_get(state, req, READ_ONCE(hdr->fd), fixed);
 		if (unlikely(!req->file))
 			ret = -EBADF;
 	}
@@ -6461,7 +6485,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	struct io_submit_link *link = &ctx->submit_state.link;
 	int ret;
 
-	ret = io_init_req(ctx, req, sqe);
+	ret = io_init_req(ctx, req, &sqe->hdr);
 	if (unlikely(ret)) {
 fail_req:
 		io_req_complete_failed(req, ret);
@@ -9915,6 +9939,7 @@ static int __init io_uring_init(void)
 #define BUILD_BUG_SQE_ELEM(eoffset, etype, ename) \
 	__BUILD_BUG_VERIFY_ELEMENT(struct io_uring_sqe, eoffset, etype, ename)
 	BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64);
+	BUILD_BUG_ON(sizeof(struct io_uring_cmd_sqe) != 64);
 	BUILD_BUG_SQE_ELEM(0,  __u8,   hdr.opcode);
 	BUILD_BUG_SQE_ELEM(1,  __u8,   hdr.flags);
 	BUILD_BUG_SQE_ELEM(2,  __u16,  hdr.ioprio);
@@ -9943,6 +9968,12 @@ static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(40, __u16,  buf_index);
 	BUILD_BUG_SQE_ELEM(42, __u16,  personality);
 	BUILD_BUG_SQE_ELEM(44, __s32,  splice_fd_in);
+#define BUILD_BUG_SQEC_ELEM(eoffset, etype, ename) \
+	__BUILD_BUG_VERIFY_ELEMENT(struct io_uring_cmd_sqe, eoffset, etype, ename)
+	BUILD_BUG_SQEC_ELEM(8,				__u64,	user_data);
+	BUILD_BUG_SQEC_ELEM(18,				__u16,	personality);
+	BUILD_BUG_SQEC_ELEM(sqe_offsets[1].user_data,	__u64,	user_data);
+	BUILD_BUG_SQEC_ELEM(sqe_offsets[1].personality,	__u16,	personality);
 
 	BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
 	BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5609474ccd9f..165ac406f00b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -74,6 +74,15 @@ struct io_uring_sqe {
 	};
 };
 
+struct io_uring_cmd_sqe {
+	struct io_uring_sqe_hdr	hdr;
+	__u64			user_data;
+	__u16			op;
+	__u16			personality;
+	__u32			len;
+	__u64			pdu[5];
+};
+
 enum {
 	IOSQE_FIXED_FILE_BIT,
 	IOSQE_IO_DRAIN_BIT,
@@ -148,6 +157,7 @@ enum {
 	IORING_OP_SHUTDOWN,
 	IORING_OP_RENAMEAT,
 	IORING_OP_UNLINKAT,
+	IORING_OP_URING_CMD,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
-- 
2.31.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/8] fs: add file_operations->uring_cmd()
  2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
  2021-03-17 22:10 ` [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main Jens Axboe
  2021-03-17 22:10 ` [PATCH 2/8] io_uring: add infrastructure around io_uring_cmd_sqe issue type Jens Axboe
@ 2021-03-17 22:10 ` Jens Axboe
  2021-03-18  5:38   ` Christoph Hellwig
  2022-02-17  1:25   ` Luis Chamberlain
  2021-03-17 22:10 ` [PATCH 4/8] io_uring: add support for IORING_OP_URING_CMD Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze, Jens Axboe

This is a file private handler, similar to ioctls but hopefully a lot
more sane and useful.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c      |  5 -----
 include/linux/fs.h | 11 +++++++++++
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a4699b066172..fecf10e0625f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -188,11 +188,6 @@ struct io_rings {
 	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
 };
 
-enum io_uring_cmd_flags {
-	IO_URING_F_NONBLOCK		= 1,
-	IO_URING_F_COMPLETE_DEFER	= 2,
-};
-
 struct io_mapped_ubuf {
 	u64		ubuf;
 	size_t		len;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec8f3ddf4a6a..009abc668987 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1884,6 +1884,15 @@ struct dir_context {
 #define REMAP_FILE_ADVISORY		(REMAP_FILE_CAN_SHORTEN)
 
 struct iov_iter;
+struct io_uring_cmd;
+
+/*
+ * f_op->uring_cmd() issue flags
+ */
+enum io_uring_cmd_flags {
+	IO_URING_F_NONBLOCK		= 1,
+	IO_URING_F_COMPLETE_DEFER	= 2,
+};
 
 struct file_operations {
 	struct module *owner;
@@ -1925,6 +1934,8 @@ struct file_operations {
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
+
+	int (*uring_cmd)(struct io_uring_cmd *, enum io_uring_cmd_flags);
 } __randomize_layout;
 
 struct inode_operations {
-- 
2.31.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/8] io_uring: add support for IORING_OP_URING_CMD
  2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
                   ` (2 preceding siblings ...)
  2021-03-17 22:10 ` [PATCH 3/8] fs: add file_operations->uring_cmd() Jens Axboe
@ 2021-03-17 22:10 ` Jens Axboe
  2021-03-18  5:42   ` Christoph Hellwig
  2021-03-17 22:10 ` [PATCH 5/8] block: wire up support for file_operations->uring_cmd() Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze, Jens Axboe

This is a file private kind of request. io_uring doesn't know what's
in this command type, it's for the file_operations->uring_cmd()
handler to deal with.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c            | 54 ++++++++++++++++++++++++++++++++++++++++
 include/linux/io_uring.h | 16 ++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index fecf10e0625f..a66f953f71d4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -770,6 +770,7 @@ struct io_kiocb {
 		struct io_shutdown	shutdown;
 		struct io_rename	rename;
 		struct io_unlink	unlink;
+		struct io_uring_cmd	uring_cmd;
 		/* use only after cleaning per-op data, see io_clean_op() */
 		struct io_completion	compl;
 	};
@@ -1002,6 +1003,7 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_RENAMEAT] = {},
 	[IORING_OP_UNLINKAT] = {},
 	[IORING_OP_URING_CMD] = {
+		.needs_file		= 1,
 		.offsets		= 1,
 	},
 };
@@ -3565,6 +3567,53 @@ static int io_unlinkat(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+/*
+ * Called by consumers of io_uring_cmd, if they originally returned
+ * -EIOCBQUEUED upon receiving the command.
+ */
+void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
+{
+	struct io_kiocb *req = container_of(cmd, struct io_kiocb, uring_cmd);
+
+	if (ret < 0)
+		req_set_fail_links(req);
+	io_req_complete(req, ret);
+}
+EXPORT_SYMBOL(io_uring_cmd_done);
+
+static int io_uring_cmd_prep(struct io_kiocb *req,
+			     const struct io_uring_sqe *sqe)
+{
+	const struct io_uring_cmd_sqe *csqe = (const void *) sqe;
+	struct io_uring_cmd *cmd = &req->uring_cmd;
+
+	if (!req->file->f_op->uring_cmd)
+		return -EOPNOTSUPP;
+
+	cmd->op = READ_ONCE(csqe->op);
+	cmd->len = READ_ONCE(csqe->len);
+
+	/*
+	 * The payload is the last 40 bytes of an io_uring_cmd_sqe, with the
+	 * type being defined by the recipient.
+	 */
+	memcpy(&cmd->pdu, &csqe->pdu, sizeof(cmd->pdu));
+	return 0;
+}
+
+static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct file *file = req->file;
+	int ret;
+
+	ret = file->f_op->uring_cmd(&req->uring_cmd, issue_flags);
+	/* queued async, consumer will call io_uring_cmd_done() when complete */
+	if (ret == -EIOCBQUEUED)
+		return 0;
+	io_uring_cmd_done(&req->uring_cmd, ret);
+	return 0;
+}
+
 static int io_shutdown_prep(struct io_kiocb *req,
 			    const struct io_uring_sqe *sqe)
 {
@@ -5858,6 +5907,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_renameat_prep(req, sqe);
 	case IORING_OP_UNLINKAT:
 		return io_unlinkat_prep(req, sqe);
+	case IORING_OP_URING_CMD:
+		return io_uring_cmd_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6114,6 +6165,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	case IORING_OP_UNLINKAT:
 		ret = io_unlinkat(req, issue_flags);
 		break;
+	case IORING_OP_URING_CMD:
+		ret = io_uring_cmd(req, issue_flags);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 9761a0ec9f95..fd5c8ca40a70 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -30,7 +30,20 @@ struct io_uring_task {
 	struct callback_head	task_work;
 };
 
+/*
+ * Note that the first member here must be a struct file, as the
+ * io_uring command layout depends on that.
+ */
+struct io_uring_cmd {
+	struct file	*file;
+	__u16		op;
+	__u16		unused;
+	__u32		len;
+	__u64		pdu[5];	/* 40 bytes available inline for free use */
+};
+
 #if defined(CONFIG_IO_URING)
+void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret);
 struct sock *io_uring_get_socket(struct file *file);
 void __io_uring_task_cancel(void);
 void __io_uring_files_cancel(struct files_struct *files);
@@ -52,6 +65,9 @@ static inline void io_uring_free(struct task_struct *tsk)
 		__io_uring_free(tsk);
 }
 #else
+static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
+{
+}
 static inline struct sock *io_uring_get_socket(struct file *file)
 {
 	return NULL;
-- 
2.31.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/8] block: wire up support for file_operations->uring_cmd()
  2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
                   ` (3 preceding siblings ...)
  2021-03-17 22:10 ` [PATCH 4/8] io_uring: add support for IORING_OP_URING_CMD Jens Axboe
@ 2021-03-17 22:10 ` Jens Axboe
  2021-03-18  5:44   ` Christoph Hellwig
  2021-03-17 22:10 ` [PATCH 6/8] block: add example ioctl Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze, Jens Axboe

Pass it through the mq_ops->uring_cmd() handler, so we can plumb it
through all the way to the device driver.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c         | 11 +++++++++++
 fs/block_dev.c         | 10 ++++++++++
 include/linux/blk-mq.h |  6 ++++++
 include/linux/blkdev.h |  2 ++
 4 files changed, 29 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..6c68540a89c0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3919,6 +3919,17 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 }
 EXPORT_SYMBOL_GPL(blk_poll);
 
+int blk_uring_cmd(struct block_device *bdev, struct io_uring_cmd *cmd,
+		  enum io_uring_cmd_flags issue_flags)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (!q->mq_ops || !q->mq_ops->uring_cmd)
+		return -EOPNOTSUPP;
+
+	return q->mq_ops->uring_cmd(q, cmd, issue_flags);
+}
+
 unsigned int blk_mq_rq_cpu(struct request *rq)
 {
 	return rq->mq_ctx->cpu;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 92ed7d5df677..cbc403ad0330 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -34,6 +34,7 @@
 #include <linux/part_stat.h>
 #include <linux/uaccess.h>
 #include <linux/suspend.h>
+#include <linux/io_uring.h>
 #include "internal.h"
 
 struct bdev_inode {
@@ -317,6 +318,14 @@ struct blkdev_dio {
 
 static struct bio_set blkdev_dio_pool;
 
+static int blkdev_uring_cmd(struct io_uring_cmd *cmd,
+			    enum io_uring_cmd_flags flags)
+{
+	struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
+
+	return blk_uring_cmd(bdev, cmd, flags);
+}
+
 static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
 {
 	struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
@@ -1840,6 +1849,7 @@ const struct file_operations def_blk_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= blkdev_fallocate,
+	.uring_cmd	= blkdev_uring_cmd,
 };
 
 /**
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b8990..70ee55c148c1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -376,6 +376,12 @@ struct blk_mq_ops {
 	 */
 	int (*map_queues)(struct blk_mq_tag_set *set);
 
+	/**
+	 * @uring_cmd: queues requests through io_uring IORING_OP_URING_CMD
+	 */
+	int (*uring_cmd)(struct request_queue *q, struct io_uring_cmd *cmd,
+				enum io_uring_cmd_flags issue_flags);
+
 #ifdef CONFIG_BLK_DEBUG_FS
 	/**
 	 * @show_rq: Used by the debugfs implementation to show driver-specific
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..7eb993e82783 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -957,6 +957,8 @@ int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
 int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
+int blk_uring_cmd(struct block_device *bdev, struct io_uring_cmd *cmd,
+			enum io_uring_cmd_flags issue_flags);
 
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
-- 
2.31.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/8] block: add example ioctl
  2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
                   ` (4 preceding siblings ...)
  2021-03-17 22:10 ` [PATCH 5/8] block: wire up support for file_operations->uring_cmd() Jens Axboe
@ 2021-03-17 22:10 ` Jens Axboe
  2021-03-18  5:45   ` Christoph Hellwig
  2021-03-17 22:10 ` [PATCH 7/8] net: wire up support for file_operations->uring_cmd() Jens Axboe
  2021-03-17 22:10 ` [PATCH 8/8] net: add example SOCKET_URING_OP_SIOCINQ/SOCKET_URING_OP_SIOCOUTQ Jens Axboe
  7 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze, Jens Axboe

Grab op == 1, BLOCK_URING_OP_IOCTL, and use it to implement basic
ioctl functionality.

Example code, to issue BLKBSZGET through IORING_OP_URING_CMD:

struct block_uring_cmd {
	__u32	ioctl_cmd;
	__u32	unused1;
	__u64	unused2[4];
};

static int get_bs(struct io_uring *ring, const char *dev)
{
	struct io_uring_cqe *cqe;
	struct io_uring_sqe *sqe;
        struct io_uring_cmd_sqe *csqe;
	struct block_uring_cmd *cmd;
	int ret, fd;

	fd = open(dev, O_RDONLY);

	sqe = io_uring_get_sqe(ring);
	csqe = (void *) sqe;
        memset(csqe, 0, sizeof(*csqe));
        csqe->hdr.opcode = IORING_OP_URING_CMD;
        csqe->hdr.fd = fd;
	csqe->user_data = 0x1234;
	csqe->op = BLOCK_URING_OP_IOCTL;

	io_uring_submit(ring);
	io_uring_wait_cqe(ring, &cqe);
	printf("bs=%d\n", cqe->res);
	io_uring_cqe_seen(ring, cqe);
	return 0;
err:
	return 1;
}

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c         | 20 ++++++++++++++++++++
 include/linux/blkdev.h | 11 +++++++++++
 2 files changed, 31 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index cbc403ad0330..9e44f63a0fe1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -318,11 +318,31 @@ struct blkdev_dio {
 
 static struct bio_set blkdev_dio_pool;
 
+static int blkdev_uring_ioctl(struct block_device *bdev,
+			      struct io_uring_cmd *cmd)
+{
+	struct block_uring_cmd *bcmd = (struct block_uring_cmd *) &cmd->pdu;
+
+	switch (bcmd->ioctl_cmd) {
+	case BLKBSZGET:
+		return block_size(bdev);
+	default:
+		return -ENOTTY;
+	}
+}
+
 static int blkdev_uring_cmd(struct io_uring_cmd *cmd,
 			    enum io_uring_cmd_flags flags)
 {
 	struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
 
+	switch (cmd->op) {
+	case BLOCK_URING_OP_IOCTL:
+		return blkdev_uring_ioctl(bdev, cmd);
+	default:
+		break;
+	}
+
 	return blk_uring_cmd(bdev, cmd, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7eb993e82783..fa895aa3b51a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -44,6 +44,17 @@ struct blk_queue_stats;
 struct blk_stat_callback;
 struct blk_keyslot_manager;
 
+enum {
+	BLOCK_URING_OP_IOCTL = 1,
+};
+
+/* This overlays struct io_uring_cmd pdu (40 bytes) */
+struct block_uring_cmd {
+	__u32	ioctl_cmd;
+	__u32	unused1;
+	__u64	unused2[4];
+};
+
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
 
-- 
2.31.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/8] net: wire up support for file_operations->uring_cmd()
  2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
                   ` (5 preceding siblings ...)
  2021-03-17 22:10 ` [PATCH 6/8] block: add example ioctl Jens Axboe
@ 2021-03-17 22:10 ` Jens Axboe
  2022-02-17  1:03   ` Luis Chamberlain
  2021-03-17 22:10 ` [PATCH 8/8] net: add example SOCKET_URING_OP_SIOCINQ/SOCKET_URING_OP_SIOCOUTQ Jens Axboe
  7 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze, Jens Axboe

Pass it through the proto_ops->uring_cmd() handler, so we can plumb it
through all the way to the proto->uring_cmd() handler.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/net.h  |  2 ++
 include/net/sock.h   |  6 ++++++
 net/core/sock.c      | 17 +++++++++++++++--
 net/dccp/ipv4.c      |  1 +
 net/ipv4/af_inet.c   |  3 +++
 net/l2tp/l2tp_ip.c   |  1 +
 net/mptcp/protocol.c |  1 +
 net/sctp/protocol.c  |  1 +
 net/socket.c         | 13 +++++++++++++
 9 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index ba736b457a06..b61c6cfefc15 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -159,6 +159,8 @@ struct proto_ops {
 	int	 	(*compat_ioctl) (struct socket *sock, unsigned int cmd,
 				      unsigned long arg);
 #endif
+	int		(*uring_cmd)(struct socket *sock, struct io_uring_cmd *cmd,
+					enum io_uring_cmd_flags issue_flags);
 	int		(*gettstamp) (struct socket *sock, void __user *userstamp,
 				      bool timeval, bool time32);
 	int		(*listen)    (struct socket *sock, int len);
diff --git a/include/net/sock.h b/include/net/sock.h
index 636810ddcd9b..9c2921f4357a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -110,6 +110,7 @@ typedef struct {
 struct sock;
 struct proto;
 struct net;
+struct io_uring_cmd;
 
 typedef __u32 __bitwise __portpair;
 typedef __u64 __bitwise __addrpair;
@@ -1146,6 +1147,9 @@ struct proto {
 
 	int			(*ioctl)(struct sock *sk, int cmd,
 					 unsigned long arg);
+	int			(*uring_cmd)(struct sock *sk,
+					struct io_uring_cmd *cmd,
+					enum io_uring_cmd_flags issue_flags);
 	int			(*init)(struct sock *sk);
 	void			(*destroy)(struct sock *sk);
 	void			(*shutdown)(struct sock *sk, int how);
@@ -1761,6 +1765,8 @@ int sock_common_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 			int flags);
 int sock_common_setsockopt(struct socket *sock, int level, int optname,
 			   sockptr_t optval, unsigned int optlen);
+int sock_common_uring_cmd(struct socket *sock, struct io_uring_cmd *cmd,
+				enum io_uring_cmd_flags issue_flags);
 
 void sk_common_release(struct sock *sk);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 0ed98f20448a..e3c1bd68fdfd 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3264,6 +3264,18 @@ int sock_common_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(sock_common_setsockopt);
 
+int sock_common_uring_cmd(struct socket *sock, struct io_uring_cmd *cmd,
+				enum io_uring_cmd_flags issue_flags)
+{
+	struct sock *sk = sock->sk;
+
+	if (!sk->sk_prot || !sk->sk_prot->uring_cmd)
+		return -EOPNOTSUPP;
+
+	return sk->sk_prot->uring_cmd(sk, cmd, issue_flags);
+}
+EXPORT_SYMBOL(sock_common_uring_cmd);
+
 void sk_common_release(struct sock *sk)
 {
 	if (sk->sk_prot->destroy)
@@ -3615,7 +3627,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
 
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
-			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
+			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
@@ -3629,6 +3641,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 		   proto_method_implemented(proto->disconnect),
 		   proto_method_implemented(proto->accept),
 		   proto_method_implemented(proto->ioctl),
+		   proto_method_implemented(proto->uring_cmd),
 		   proto_method_implemented(proto->init),
 		   proto_method_implemented(proto->destroy),
 		   proto_method_implemented(proto->shutdown),
@@ -3657,7 +3670,7 @@ static int proto_seq_show(struct seq_file *seq, void *v)
 			   "maxhdr",
 			   "slab",
 			   "module",
-			   "cl co di ac io in de sh ss gs se re sp bi br ha uh gp em\n");
+			   "cl co di ac io ur in de sh ss gs se re sp bi br ha uh gp em\n");
 	else
 		proto_seq_printf(seq, list_entry(v, struct proto, node));
 	return 0;
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 2455b0c0e486..8ba02865ae98 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -983,6 +983,7 @@ static const struct proto_ops inet_dccp_ops = {
 	/* FIXME: work on tcp_poll to rename it to inet_csk_poll */
 	.poll		   = dccp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	/* FIXME: work on inet_listen to rename it to sock_common_listen */
 	.listen		   = inet_dccp_listen,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1355e6c0d567..7dc4d399b2ef 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1030,6 +1030,7 @@ const struct proto_ops inet_stream_ops = {
 	.getname	   = inet_getname,
 	.poll		   = tcp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = inet_listen,
 	.shutdown	   = inet_shutdown,
@@ -1064,6 +1065,7 @@ const struct proto_ops inet_dgram_ops = {
 	.getname	   = inet_getname,
 	.poll		   = udp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
@@ -1095,6 +1097,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.getname	   = inet_getname,
 	.poll		   = datagram_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 97ae1255fcb6..9b5a4b3b5acb 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -615,6 +615,7 @@ static const struct proto_ops l2tp_ip_ops = {
 	.getname	   = l2tp_ip_getname,
 	.poll		   = datagram_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 76958570ae7f..7f61fb783f50 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3454,6 +3454,7 @@ static const struct proto_ops mptcp_stream_ops = {
 	.getname	   = inet_getname,
 	.poll		   = mptcp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = mptcp_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 6f2bbfeec3a4..fdb960e752b3 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1133,6 +1133,7 @@ static const struct proto_ops inet_seqpacket_ops = {
 	.getname	   = inet_getname,	/* Semantics are different.  */
 	.poll		   = sctp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sctp_inet_listen,
 	.shutdown	   = inet_shutdown,	/* Looks harmless.  */
diff --git a/net/socket.c b/net/socket.c
index 84a8049c2b09..19ab0986af9d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -86,6 +86,7 @@
 #include <linux/xattr.h>
 #include <linux/nospec.h>
 #include <linux/indirect_call_wrapper.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -113,6 +114,7 @@ unsigned int sysctl_net_busy_poll __read_mostly;
 static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to);
 static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from);
 static int sock_mmap(struct file *file, struct vm_area_struct *vma);
+static int sock_uring_cmd(struct io_uring_cmd *cmd, enum io_uring_cmd_flags issue_flags);
 
 static int sock_close(struct inode *inode, struct file *file);
 static __poll_t sock_poll(struct file *file,
@@ -156,6 +158,7 @@ static const struct file_operations socket_file_ops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = compat_sock_ioctl,
 #endif
+	.uring_cmd =	sock_uring_cmd,
 	.mmap =		sock_mmap,
 	.release =	sock_close,
 	.fasync =	sock_fasync,
@@ -1182,6 +1185,16 @@ static long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	return err;
 }
 
+static int sock_uring_cmd(struct io_uring_cmd *cmd, enum io_uring_cmd_flags issue_flags)
+{
+	struct socket *sock = cmd->file->private_data;
+
+	if (!sock->ops || !sock->ops->uring_cmd)
+		return -EOPNOTSUPP;
+
+	return sock->ops->uring_cmd(sock, cmd, issue_flags);
+}
+
 /**
  *	sock_create_lite - creates a socket
  *	@family: protocol family (AF_INET, ...)
-- 
2.31.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 8/8] net: add example SOCKET_URING_OP_SIOCINQ/SOCKET_URING_OP_SIOCOUTQ
  2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
                   ` (6 preceding siblings ...)
  2021-03-17 22:10 ` [PATCH 7/8] net: wire up support for file_operations->uring_cmd() Jens Axboe
@ 2021-03-17 22:10 ` Jens Axboe
  7 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2021-03-17 22:10 UTC (permalink / raw)
  To: io-uring; +Cc: joshi.k, hch, kbusch, linux-nvme, metze, Jens Axboe

This adds support for these sample ioctls for tcp/udp/raw ipv4/ipv6.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/net/raw.h        |  3 +++
 include/net/tcp.h        |  2 ++
 include/net/udp.h        |  2 ++
 include/uapi/linux/net.h | 17 +++++++++++++++++
 net/ipv4/raw.c           | 27 +++++++++++++++++++++++++++
 net/ipv4/tcp.c           | 36 ++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c      |  1 +
 net/ipv4/udp.c           | 18 ++++++++++++++++++
 net/ipv6/raw.c           |  1 +
 net/ipv6/tcp_ipv6.c      |  1 +
 net/ipv6/udp.c           |  1 +
 11 files changed, 109 insertions(+)

diff --git a/include/net/raw.h b/include/net/raw.h
index 8ad8df594853..27098db724dd 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -82,4 +82,7 @@ static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
 #endif
 }
 
+int raw_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+			enum io_uring_cmd_flags issue_flags);
+
 #endif	/* _RAW_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 963cd86d12dd..b2aca8ce3293 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -350,6 +350,8 @@ void tcp_twsk_destructor(struct sock *sk);
 ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
 			struct pipe_inode_info *pipe, size_t len,
 			unsigned int flags);
+int tcp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+			enum io_uring_cmd_flags issue_flags);
 
 void tcp_enter_quickack_mode(struct sock *sk, unsigned int max_quickacks);
 static inline void tcp_dec_quickack_mode(struct sock *sk,
diff --git a/include/net/udp.h b/include/net/udp.h
index a132a02b2f2c..0588ca8a9406 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -329,6 +329,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
 			       struct sk_buff *skb);
 struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
 				 __be16 sport, __be16 dport);
+int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+			enum io_uring_cmd_flags issue_flags);
 
 /* UDP uses skb->dev_scratch to cache as much information as possible and avoid
  * possibly multiple cache miss on dequeue()
diff --git a/include/uapi/linux/net.h b/include/uapi/linux/net.h
index 4dabec6bd957..5e8d604e4cc6 100644
--- a/include/uapi/linux/net.h
+++ b/include/uapi/linux/net.h
@@ -19,6 +19,7 @@
 #ifndef _UAPI_LINUX_NET_H
 #define _UAPI_LINUX_NET_H
 
+#include <linux/types.h>
 #include <linux/socket.h>
 #include <asm/socket.h>
 
@@ -55,4 +56,20 @@ typedef enum {
 
 #define __SO_ACCEPTCON	(1 << 16)	/* performed a listen		*/
 
+enum {
+	SOCKET_URING_OP_SIOCINQ		= 0,
+	SOCKET_URING_OP_SIOCOUTQ,
+
+	/*
+	 * This is reserved for custom sub protocol
+	 */
+	SOCKET_URING_OP_SUBPROTO_CMD	= 0xffff,
+};
+
+struct sock_uring_cmd {
+	__u16	op;
+	__u16	unused[3];
+	__u64	unused2[4];
+};
+
 #endif /* _UAPI_LINUX_NET_H */
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 50a73178d63a..f93011d8f174 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -75,6 +75,7 @@
 #include <linux/netfilter_ipv4.h>
 #include <linux/compat.h>
 #include <linux/uio.h>
+#include <linux/io_uring.h>
 
 struct raw_frag_vec {
 	struct msghdr *msg;
@@ -878,6 +879,31 @@ static int raw_getsockopt(struct sock *sk, int level, int optname,
 	return do_raw_getsockopt(sk, level, optname, optval, optlen);
 }
 
+int raw_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  enum io_uring_cmd_flags issue_flags)
+{
+	struct sock_uring_cmd *scmd = (struct sock_uring_cmd *)&cmd->pdu;
+
+	switch (scmd->op) {
+	case SOCKET_URING_OP_SIOCOUTQ:
+		return sk_wmem_alloc_get(sk);
+	case SOCKET_URING_OP_SIOCINQ: {
+		struct sk_buff *skb;
+		int amount = 0;
+
+		spin_lock_bh(&sk->sk_receive_queue.lock);
+		skb = skb_peek(&sk->sk_receive_queue);
+		if (skb)
+			amount = skb->len;
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+		return amount;
+		}
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL_GPL(raw_uring_cmd);
+
 static int raw_ioctl(struct sock *sk, int cmd, unsigned long arg)
 {
 	switch (cmd) {
@@ -956,6 +982,7 @@ struct proto raw_prot = {
 	.release_cb	   = ip4_datagram_release_cb,
 	.hash		   = raw_hash_sk,
 	.unhash		   = raw_unhash_sk,
+	.uring_cmd	   = raw_uring_cmd,
 	.obj_size	   = sizeof(struct raw_sock),
 	.useroffset	   = offsetof(struct raw_sock, filter),
 	.usersize	   = sizeof_field(struct raw_sock, filter),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de7cc8445ac0..b9d4c6098049 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -279,6 +279,7 @@
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
+#include <linux/io_uring.h>
 
 /* Track pending CMSGs. */
 enum {
@@ -600,6 +601,41 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(tcp_poll);
 
+int tcp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  enum io_uring_cmd_flags issue_flags)
+{
+	struct sock_uring_cmd *scmd = (struct sock_uring_cmd *)&cmd->pdu;
+	struct tcp_sock *tp = tcp_sk(sk);
+	bool slow;
+	int ret;
+
+	switch (scmd->op) {
+	case SOCKET_URING_OP_SIOCINQ:
+		if (sk->sk_state == TCP_LISTEN)
+			return -EINVAL;
+
+		slow = lock_sock_fast(sk);
+		ret = tcp_inq(sk);
+		unlock_sock_fast(sk, slow);
+		break;
+	case SOCKET_URING_OP_SIOCOUTQ:
+		if (sk->sk_state == TCP_LISTEN)
+			return -EINVAL;
+
+		if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
+			ret = 0;
+		else
+			ret = READ_ONCE(tp->write_seq) - tp->snd_una;
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+		break;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tcp_uring_cmd);
+
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index daad4f99db32..c131eb1007b1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2791,6 +2791,7 @@ struct proto tcp_prot = {
 	.disconnect		= tcp_disconnect,
 	.accept			= inet_csk_accept,
 	.ioctl			= tcp_ioctl,
+	.uring_cmd		= tcp_uring_cmd,
 	.init			= tcp_v4_init_sock,
 	.destroy		= tcp_v4_destroy_sock,
 	.shutdown		= tcp_shutdown,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4a0478b17243..809ac8ae7e41 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -113,6 +113,7 @@
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
 #include <net/udp_tunnel.h>
+#include <linux/io_uring.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ipv6_stubs.h>
 #endif
@@ -1682,6 +1683,22 @@ static int first_packet_length(struct sock *sk)
 	return res;
 }
 
+int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  enum io_uring_cmd_flags issue_flags)
+{
+	struct sock_uring_cmd *scmd = (struct sock_uring_cmd *)&cmd->pdu;
+
+	switch (scmd->op) {
+	case SOCKET_URING_OP_SIOCINQ:
+		return max_t(int, 0, first_packet_length(sk));
+	case SOCKET_URING_OP_SIOCOUTQ:
+		return sk_wmem_alloc_get(sk);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL_GPL(udp_uring_cmd);
+
 /*
  *	IOCTL requests applicable to the UDP protocol
  */
@@ -2837,6 +2854,7 @@ struct proto udp_prot = {
 	.connect		= ip4_datagram_connect,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
+	.uring_cmd		= udp_uring_cmd,
 	.init			= udp_init_sock,
 	.destroy		= udp_destroy_sock,
 	.setsockopt		= udp_setsockopt,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 1f56d9aae589..50f1e8189482 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1235,6 +1235,7 @@ struct proto rawv6_prot = {
 	.connect	   = ip6_datagram_connect_v6_only,
 	.disconnect	   = __udp_disconnect,
 	.ioctl		   = rawv6_ioctl,
+	.uring_cmd	   = raw_uring_cmd,
 	.init		   = rawv6_init_sk,
 	.setsockopt	   = rawv6_setsockopt,
 	.getsockopt	   = rawv6_getsockopt,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bd44ded7e50c..1ce253cc28f5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2119,6 +2119,7 @@ struct proto tcpv6_prot = {
 	.disconnect		= tcp_disconnect,
 	.accept			= inet_csk_accept,
 	.ioctl			= tcp_ioctl,
+	.uring_cmd		= tcp_uring_cmd,
 	.init			= tcp_v6_init_sock,
 	.destroy		= tcp_v6_destroy_sock,
 	.shutdown		= tcp_shutdown,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d25e5a9252fd..082593726a1e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1702,6 +1702,7 @@ struct proto udpv6_prot = {
 	.connect		= ip6_datagram_connect,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
+	.uring_cmd		= udp_uring_cmd,
 	.init			= udp_init_sock,
 	.destroy		= udpv6_destroy_sock,
 	.setsockopt		= udpv6_setsockopt,
-- 
2.31.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main
  2021-03-17 22:10 ` [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main Jens Axboe
@ 2021-03-18  5:34   ` Christoph Hellwig
  2021-03-18 18:40     ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-18  5:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, joshi.k, hch, kbusch, linux-nvme, metze

> @@ -14,11 +14,22 @@
>  /*
>   * IO submission data structure (Submission Queue Entry)
>   */
> +struct io_uring_sqe_hdr {
> +	__u8	opcode;		/* type of operation for this sqe */
> +	__u8	flags;		/* IOSQE_ flags */
> +	__u16	ioprio;		/* ioprio for the request */
> +	__s32	fd;		/* file descriptor to do IO on */
> +};
> +
>  struct io_uring_sqe {
> +#ifdef __KERNEL__
> +	struct io_uring_sqe_hdr	hdr;
> +#else
>  	__u8	opcode;		/* type of operation for this sqe */
>  	__u8	flags;		/* IOSQE_ flags */
>  	__u16	ioprio;		/* ioprio for the request */
>  	__s32	fd;		/* file descriptor to do IO on */
> +#endif
>  	union {
>  		__u64	off;	/* offset into file */
>  		__u64	addr2;

Please don't do that ifdef __KERNEL__ mess.  We never guaranteed
userspace API compatbility, just ABI compatibility.

But we really do have a biger problem here, and that is ioprio is
a field that is specific to the read and write commands and thus
should not be in the generic header.  On the other hand the
personality is.

So I'm not sure trying to retrofit this even makes all that much sense.

Maybe we should just define io_uring_sqe_hdr the way it makes
sense:

struct io_uring_sqe_hdr {
	__u8	opcode;	
	__u8	flags;
	__u16	personality;
	__s32	fd;
	__u64	user_data;
};

and use that for all new commands going forward while marking the
old ones as legacy.

io_uring_cmd_sqe would then be:

struct io_uring_cmd_sqe {
        struct io_uring_sqe_hdr	hdr;
	__u33			ioc;
	__u32 			len;
	__u8			data[40];
};

for example.  Note the 32-bit opcode just like ioctl to avoid
getting into too much trouble due to collisions.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] fs: add file_operations->uring_cmd()
  2021-03-17 22:10 ` [PATCH 3/8] fs: add file_operations->uring_cmd() Jens Axboe
@ 2021-03-18  5:38   ` Christoph Hellwig
  2021-03-18 18:41     ` Jens Axboe
  2022-02-17  1:27     ` Luis Chamberlain
  2022-02-17  1:25   ` Luis Chamberlain
  1 sibling, 2 replies; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-18  5:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, joshi.k, hch, kbusch, linux-nvme, metze

On Wed, Mar 17, 2021 at 04:10:22PM -0600, Jens Axboe wrote:
> This is a file private handler, similar to ioctls but hopefully a lot
> more sane and useful.

I really hate defining the interface in terms of io_uring.  This really
is nothing but an async ioctl.

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ec8f3ddf4a6a..009abc668987 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1884,6 +1884,15 @@ struct dir_context {
>  #define REMAP_FILE_ADVISORY		(REMAP_FILE_CAN_SHORTEN)
>  
>  struct iov_iter;
> +struct io_uring_cmd;
> +
> +/*
> + * f_op->uring_cmd() issue flags
> + */
> +enum io_uring_cmd_flags {
> +	IO_URING_F_NONBLOCK		= 1,
> +	IO_URING_F_COMPLETE_DEFER	= 2,
> +};

I'm a little worried about exposing a complete io_uring specific
concept like IO_URING_F_COMPLETE_DEFER to random drivers.  This
needs to be better encapsulated.

>  struct file_operations {
>  	struct module *owner;
> @@ -1925,6 +1934,8 @@ struct file_operations {
>  				   struct file *file_out, loff_t pos_out,
>  				   loff_t len, unsigned int remap_flags);
>  	int (*fadvise)(struct file *, loff_t, loff_t, int);
> +
> +	int (*uring_cmd)(struct io_uring_cmd *, enum io_uring_cmd_flags);

As of this patch io_uring_cmd is still a private structure.  In general
I'm not sure this patch makes much sense on its own either.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] io_uring: add support for IORING_OP_URING_CMD
  2021-03-17 22:10 ` [PATCH 4/8] io_uring: add support for IORING_OP_URING_CMD Jens Axboe
@ 2021-03-18  5:42   ` Christoph Hellwig
  2021-03-18 18:43     ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-18  5:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, joshi.k, hch, kbusch, linux-nvme, metze

On Wed, Mar 17, 2021 at 04:10:23PM -0600, Jens Axboe wrote:
> +/*
> + * Called by consumers of io_uring_cmd, if they originally returned
> + * -EIOCBQUEUED upon receiving the command.
> + */
> +void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
> +{
> +	struct io_kiocb *req = container_of(cmd, struct io_kiocb, uring_cmd);
> +
> +	if (ret < 0)
> +		req_set_fail_links(req);
> +	io_req_complete(req, ret);
> +}
> +EXPORT_SYMBOL(io_uring_cmd_done);

This really should be EXPORT_SYMBOL_GPL. But more importantly I'm not
sure it is an all that useful interface.  All useful non-trivial ioctls
tend to access user memory, so something that queues up work in the task
context like in Joshis patch should really be part of the documented
interface.

> +
> +static int io_uring_cmd_prep(struct io_kiocb *req,
> +			     const struct io_uring_sqe *sqe)
> +{
> +	const struct io_uring_cmd_sqe *csqe = (const void *) sqe;

We really should not need this casting.  The struct io_uring_sqe
usage in io_uring.c needs to be replaced with a union or some other
properly type safe way to handle this.

> +	ret = file->f_op->uring_cmd(&req->uring_cmd, issue_flags);
> +	/* queued async, consumer will call io_uring_cmd_done() when complete */
> +	if (ret == -EIOCBQUEUED)
> +		return 0;
> +	io_uring_cmd_done(&req->uring_cmd, ret);
> +	return 0;

This can be simplified to:

	if (ret != -EIOCBQUEUED)
		io_uring_cmd_done(&req->uring_cmd, ret);
	return 0;


> +/*
> + * Note that the first member here must be a struct file, as the
> + * io_uring command layout depends on that.
> + */
> +struct io_uring_cmd {
> +	struct file	*file;
> +	__u16		op;
> +	__u16		unused;
> +	__u32		len;
> +	__u64		pdu[5];	/* 40 bytes available inline for free use */
> +};

I am a little worried about exposting this internal structure to random
drivers.  OTOH we need something that eventually allows a container_of
to io_kiocb for the completion, so I can't think of anything better
at the moment either.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] block: wire up support for file_operations->uring_cmd()
  2021-03-17 22:10 ` [PATCH 5/8] block: wire up support for file_operations->uring_cmd() Jens Axboe
@ 2021-03-18  5:44   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-18  5:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, joshi.k, hch, kbusch, linux-nvme, metze

> +int blk_uring_cmd(struct block_device *bdev, struct io_uring_cmd *cmd,
> +		  enum io_uring_cmd_flags issue_flags)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (!q->mq_ops || !q->mq_ops->uring_cmd)
> +		return -EOPNOTSUPP;
> +
> +	return q->mq_ops->uring_cmd(q, cmd, issue_flags);
> +}

This has absilutely not business in blk-mq.  It is a plain
block_device_operation that has nothing to do with requests or
blk-mq.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] block: add example ioctl
  2021-03-17 22:10 ` [PATCH 6/8] block: add example ioctl Jens Axboe
@ 2021-03-18  5:45   ` Christoph Hellwig
  2021-03-18 12:43     ` Pavel Begunkov
  2021-03-18 18:44     ` Jens Axboe
  0 siblings, 2 replies; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-18  5:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, joshi.k, hch, kbusch, linux-nvme, metze

On Wed, Mar 17, 2021 at 04:10:25PM -0600, Jens Axboe wrote:
> +static int blkdev_uring_ioctl(struct block_device *bdev,
> +			      struct io_uring_cmd *cmd)
> +{
> +	struct block_uring_cmd *bcmd = (struct block_uring_cmd *) &cmd->pdu;
> +
> +	switch (bcmd->ioctl_cmd) {
> +	case BLKBSZGET:
> +		return block_size(bdev);
> +	default:
> +		return -ENOTTY;
> +	}
> +}
> +
>  static int blkdev_uring_cmd(struct io_uring_cmd *cmd,
>  			    enum io_uring_cmd_flags flags)
>  {
>  	struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
>  
> +	switch (cmd->op) {
> +	case BLOCK_URING_OP_IOCTL:
> +		return blkdev_uring_ioctl(bdev, cmd);

I don't think the two level dispatch here makes any sense.  Then again
I don't think this code makes sense either except as an example..

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] block: add example ioctl
  2021-03-18  5:45   ` Christoph Hellwig
@ 2021-03-18 12:43     ` Pavel Begunkov
  2021-03-18 18:44     ` Jens Axboe
  1 sibling, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-03-18 12:43 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: io-uring, joshi.k, kbusch, linux-nvme, metze

On 18/03/2021 05:45, Christoph Hellwig wrote:
> On Wed, Mar 17, 2021 at 04:10:25PM -0600, Jens Axboe wrote:
>> +static int blkdev_uring_ioctl(struct block_device *bdev,
>> +			      struct io_uring_cmd *cmd)
>> +{
>> +	struct block_uring_cmd *bcmd = (struct block_uring_cmd *) &cmd->pdu;
>> +
>> +	switch (bcmd->ioctl_cmd) {
>> +	case BLKBSZGET:
>> +		return block_size(bdev);
>> +	default:
>> +		return -ENOTTY;
>> +	}
>> +}
>> +
>>  static int blkdev_uring_cmd(struct io_uring_cmd *cmd,
>>  			    enum io_uring_cmd_flags flags)
>>  {
>>  	struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
>>  
>> +	switch (cmd->op) {
>> +	case BLOCK_URING_OP_IOCTL:
>> +		return blkdev_uring_ioctl(bdev, cmd);
> 
> I don't think the two level dispatch here makes any sense.  Then again

At least it's in my plans to rework it a bit to resolve callbacks in
advance and get rid of double dispatching (for some cases). Like

struct io_cmd {
	void (*cb)(...);
	...
};

struct file_operations {
	struct io_cmd *get_cmd(...);
};

// registration
ctx->cmds[i] = file->get_cmd(...);

And first we do registration, and then use it

> I don't think this code makes sense either except as an example..

-- 
Pavel Begunkov

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main
  2021-03-18  5:34   ` Christoph Hellwig
@ 2021-03-18 18:40     ` Jens Axboe
  2021-03-19 11:20       ` Stefan Metzmacher
                         ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Jens Axboe @ 2021-03-18 18:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, joshi.k, kbusch, linux-nvme, metze

On 3/17/21 11:34 PM, Christoph Hellwig wrote:
>> @@ -14,11 +14,22 @@
>>  /*
>>   * IO submission data structure (Submission Queue Entry)
>>   */
>> +struct io_uring_sqe_hdr {
>> +	__u8	opcode;		/* type of operation for this sqe */
>> +	__u8	flags;		/* IOSQE_ flags */
>> +	__u16	ioprio;		/* ioprio for the request */
>> +	__s32	fd;		/* file descriptor to do IO on */
>> +};
>> +
>>  struct io_uring_sqe {
>> +#ifdef __KERNEL__
>> +	struct io_uring_sqe_hdr	hdr;
>> +#else
>>  	__u8	opcode;		/* type of operation for this sqe */
>>  	__u8	flags;		/* IOSQE_ flags */
>>  	__u16	ioprio;		/* ioprio for the request */
>>  	__s32	fd;		/* file descriptor to do IO on */
>> +#endif
>>  	union {
>>  		__u64	off;	/* offset into file */
>>  		__u64	addr2;
> 
> Please don't do that ifdef __KERNEL__ mess.  We never guaranteed
> userspace API compatbility, just ABI compatibility.

Right, but I'm the one that has to deal with the fallout. For the
in-kernel one I can skip the __KERNEL__ part, and the layout is the
same anyway.

> But we really do have a biger problem here, and that is ioprio is
> a field that is specific to the read and write commands and thus
> should not be in the generic header.  On the other hand the
> personality is.
> 
> So I'm not sure trying to retrofit this even makes all that much sense.
> 
> Maybe we should just define io_uring_sqe_hdr the way it makes
> sense:
> 
> struct io_uring_sqe_hdr {
> 	__u8	opcode;	
> 	__u8	flags;
> 	__u16	personality;
> 	__s32	fd;
> 	__u64	user_data;
> };
> 
> and use that for all new commands going forward while marking the
> old ones as legacy.
> 
> io_uring_cmd_sqe would then be:
> 
> struct io_uring_cmd_sqe {
>         struct io_uring_sqe_hdr	hdr;
> 	__u33			ioc;
> 	__u32 			len;
> 	__u8			data[40];
> };
> 
> for example.  Note the 32-bit opcode just like ioctl to avoid
> getting into too much trouble due to collisions.

I was debating that with myself too, it's essentially making
the existing io_uring_sqe into io_uring_sqe_v1 and then making a new
v2 one. That would impact _all_ commands, and we'd need some trickery
to have newly compiled stuff use v2 and have existing applications
continue to work with the v1 format. That's very different from having
a single (or new) opcodes use a v2 format, effectively.

Looking into the feasibility of this. But if that is done, there are
other things that need to be factored in, as I'm not at all interested
in having a v3 down the line as well. And I'd need to be able to do this
seamlessly, both from an application point of view, and a performance
point of view (no stupid conversions inline).

Things that come up when something like this is on the table

- Should flags be extended? We're almost out... It hasn't been an
  issue so far, but seems a bit silly to go v2 and not at least leave
  a bit of room there. But obviously comes at a cost of losing eg 8
  bits somewhere else.

- Is u8 enough for the opcode? Again, we're nowhere near the limits
  here, but eventually multiplexing might be necessary.

That's just off the top of my head, probably other things to consider
too.

-- 
Jens Axboe


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] fs: add file_operations->uring_cmd()
  2021-03-18  5:38   ` Christoph Hellwig
@ 2021-03-18 18:41     ` Jens Axboe
  2022-02-17  1:27     ` Luis Chamberlain
  1 sibling, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2021-03-18 18:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, joshi.k, kbusch, linux-nvme, metze

On 3/17/21 11:38 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2021 at 04:10:22PM -0600, Jens Axboe wrote:
>> This is a file private handler, similar to ioctls but hopefully a lot
>> more sane and useful.
> 
> I really hate defining the interface in terms of io_uring.  This really
> is nothing but an async ioctl.

Sure, it's generic, could potentially use any transport. But the way it's
currently setup is using an io_uring transport and embedded command.

>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index ec8f3ddf4a6a..009abc668987 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1884,6 +1884,15 @@ struct dir_context {
>>  #define REMAP_FILE_ADVISORY		(REMAP_FILE_CAN_SHORTEN)
>>  
>>  struct iov_iter;
>> +struct io_uring_cmd;
>> +
>> +/*
>> + * f_op->uring_cmd() issue flags
>> + */
>> +enum io_uring_cmd_flags {
>> +	IO_URING_F_NONBLOCK		= 1,
>> +	IO_URING_F_COMPLETE_DEFER	= 2,
>> +};
> 
> I'm a little worried about exposing a complete io_uring specific
> concept like IO_URING_F_COMPLETE_DEFER to random drivers.  This
> needs to be better encapsulated.

Agree.

>>  struct file_operations {
>>  	struct module *owner;
>> @@ -1925,6 +1934,8 @@ struct file_operations {
>>  				   struct file *file_out, loff_t pos_out,
>>  				   loff_t len, unsigned int remap_flags);
>>  	int (*fadvise)(struct file *, loff_t, loff_t, int);
>> +
>> +	int (*uring_cmd)(struct io_uring_cmd *, enum io_uring_cmd_flags);
> 
> As of this patch io_uring_cmd is still a private structure.  In general
> I'm not sure this patch makes much sense on its own either.

Might indeed just fold it in or reshuffle, I'll take a look.

-- 
Jens Axboe


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] io_uring: add support for IORING_OP_URING_CMD
  2021-03-18  5:42   ` Christoph Hellwig
@ 2021-03-18 18:43     ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2021-03-18 18:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, joshi.k, kbusch, linux-nvme, metze

On 3/17/21 11:42 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2021 at 04:10:23PM -0600, Jens Axboe wrote:
>> +/*
>> + * Called by consumers of io_uring_cmd, if they originally returned
>> + * -EIOCBQUEUED upon receiving the command.
>> + */
>> +void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
>> +{
>> +	struct io_kiocb *req = container_of(cmd, struct io_kiocb, uring_cmd);
>> +
>> +	if (ret < 0)
>> +		req_set_fail_links(req);
>> +	io_req_complete(req, ret);
>> +}
>> +EXPORT_SYMBOL(io_uring_cmd_done);
> 
> This really should be EXPORT_SYMBOL_GPL. But more importantly I'm not

Did make that change in my tree yesterday.

> sure it is an all that useful interface.  All useful non-trivial ioctls
> tend to access user memory, so something that queues up work in the task
> context like in Joshis patch should really be part of the documented
> interface.

Agree, and I made some comments on that patch to how to make that situation
better. Should go in with this part, to have in-task completions for
finishing it up.

>> +static int io_uring_cmd_prep(struct io_kiocb *req,
>> +			     const struct io_uring_sqe *sqe)
>> +{
>> +	const struct io_uring_cmd_sqe *csqe = (const void *) sqe;
> 
> We really should not need this casting.  The struct io_uring_sqe
> usage in io_uring.c needs to be replaced with a union or some other
> properly type safe way to handle this.
> 
>> +	ret = file->f_op->uring_cmd(&req->uring_cmd, issue_flags);
>> +	/* queued async, consumer will call io_uring_cmd_done() when complete */
>> +	if (ret == -EIOCBQUEUED)
>> +		return 0;
>> +	io_uring_cmd_done(&req->uring_cmd, ret);
>> +	return 0;
> 
> This can be simplified to:
> 
> 	if (ret != -EIOCBQUEUED)
> 		io_uring_cmd_done(&req->uring_cmd, ret);
> 	return 0;

Good point, will do that.

>> + * Note that the first member here must be a struct file, as the
>> + * io_uring command layout depends on that.
>> + */
>> +struct io_uring_cmd {
>> +	struct file	*file;
>> +	__u16		op;
>> +	__u16		unused;
>> +	__u32		len;
>> +	__u64		pdu[5];	/* 40 bytes available inline for free use */
>> +};
> 
> I am a little worried about exposting this internal structure to random
> drivers.  OTOH we need something that eventually allows a container_of
> to io_kiocb for the completion, so I can't think of anything better
> at the moment either.
> 


-- 
Jens Axboe


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] block: add example ioctl
  2021-03-18  5:45   ` Christoph Hellwig
  2021-03-18 12:43     ` Pavel Begunkov
@ 2021-03-18 18:44     ` Jens Axboe
  1 sibling, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2021-03-18 18:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, joshi.k, kbusch, linux-nvme, metze

On 3/17/21 11:45 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2021 at 04:10:25PM -0600, Jens Axboe wrote:
>> +static int blkdev_uring_ioctl(struct block_device *bdev,
>> +			      struct io_uring_cmd *cmd)
>> +{
>> +	struct block_uring_cmd *bcmd = (struct block_uring_cmd *) &cmd->pdu;
>> +
>> +	switch (bcmd->ioctl_cmd) {
>> +	case BLKBSZGET:
>> +		return block_size(bdev);
>> +	default:
>> +		return -ENOTTY;
>> +	}
>> +}
>> +
>>  static int blkdev_uring_cmd(struct io_uring_cmd *cmd,
>>  			    enum io_uring_cmd_flags flags)
>>  {
>>  	struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
>>  
>> +	switch (cmd->op) {
>> +	case BLOCK_URING_OP_IOCTL:
>> +		return blkdev_uring_ioctl(bdev, cmd);
> 
> I don't think the two level dispatch here makes any sense.  Then again
> I don't think this code makes sense either except as an example..

That's all it is, an example. And, for me, just a quick way to test
that everything stacks and layers appropriately. But yes, once we have
something more concrete, this POC can be dropped and then re-introduced
when there's a real use case.

-- 
Jens Axboe


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main
  2021-03-18 18:40     ` Jens Axboe
@ 2021-03-19 11:20       ` Stefan Metzmacher
  2021-03-19 13:29       ` Christoph Hellwig
  2022-02-24 22:34       ` Luis Chamberlain
  2 siblings, 0 replies; 25+ messages in thread
From: Stefan Metzmacher @ 2021-03-19 11:20 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: io-uring, joshi.k, kbusch, linux-nvme


Am 18.03.21 um 19:40 schrieb Jens Axboe:
> On 3/17/21 11:34 PM, Christoph Hellwig wrote:
>>> @@ -14,11 +14,22 @@
>>>  /*
>>>   * IO submission data structure (Submission Queue Entry)
>>>   */
>>> +struct io_uring_sqe_hdr {
>>> +	__u8	opcode;		/* type of operation for this sqe */
>>> +	__u8	flags;		/* IOSQE_ flags */
>>> +	__u16	ioprio;		/* ioprio for the request */
>>> +	__s32	fd;		/* file descriptor to do IO on */
>>> +};
>>> +
>>>  struct io_uring_sqe {
>>> +#ifdef __KERNEL__
>>> +	struct io_uring_sqe_hdr	hdr;
>>> +#else
>>>  	__u8	opcode;		/* type of operation for this sqe */
>>>  	__u8	flags;		/* IOSQE_ flags */
>>>  	__u16	ioprio;		/* ioprio for the request */
>>>  	__s32	fd;		/* file descriptor to do IO on */
>>> +#endif
>>>  	union {
>>>  		__u64	off;	/* offset into file */
>>>  		__u64	addr2;
>>
>> Please don't do that ifdef __KERNEL__ mess.  We never guaranteed
>> userspace API compatbility, just ABI compatibility.
> 
> Right, but I'm the one that has to deal with the fallout. For the
> in-kernel one I can skip the __KERNEL__ part, and the layout is the
> same anyway.
> 
>> But we really do have a biger problem here, and that is ioprio is
>> a field that is specific to the read and write commands and thus
>> should not be in the generic header.  On the other hand the
>> personality is.
>>
>> So I'm not sure trying to retrofit this even makes all that much sense.
>>
>> Maybe we should just define io_uring_sqe_hdr the way it makes
>> sense:
>>
>> struct io_uring_sqe_hdr {
>> 	__u8	opcode;	
>> 	__u8	flags;
>> 	__u16	personality;
>> 	__s32	fd;
>> 	__u64	user_data;
>> };
>>
>> and use that for all new commands going forward while marking the
>> old ones as legacy.
>>
>> io_uring_cmd_sqe would then be:
>>
>> struct io_uring_cmd_sqe {
>>         struct io_uring_sqe_hdr	hdr;
>> 	__u33			ioc;
>> 	__u32 			len;
>> 	__u8			data[40];
>> };
>>
>> for example.  Note the 32-bit opcode just like ioctl to avoid
>> getting into too much trouble due to collisions.
> 
> I was debating that with myself too, it's essentially making
> the existing io_uring_sqe into io_uring_sqe_v1 and then making a new
> v2 one. That would impact _all_ commands, and we'd need some trickery
> to have newly compiled stuff use v2 and have existing applications
> continue to work with the v1 format. That's very different from having
> a single (or new) opcodes use a v2 format, effectively.

I think we should use v0 and v1.

I think io_init_req and io_prep_req could be merged into an io_init_prep_req()
which could then do:

switch (ctx->sqe_version)
case 0:
      return io_init_prep_req_v0();
case 1:
      return io_init_prep_req_v1();
default:
      return -EINVAL;

The kernel would return IORING_FEAT_SQE_V1
and set ctx->sqe_version = 1 if IORING_SETUP_SQE_V1 was passed from
the caller.

liburing whould then need to pass struct io_uring *ring to
io_uring_prep_*(), io_uring_sqe_set_flags() and io_uring_sqe_set_data().
in order to use struct io_uring->sq.sqe_version to alter the behavior.
(I think we should also have a io_uring_sqe_set_personality() helper).

static inline void io_uring_prep_nop(struct io_uring *ring, struct io_uring_sqe *sqe)
{
	struct io_uring_sqe_common *nop = &sqe->common;
	if (ring->sq.sqe_version == 0)
        	io_uring_prep_rw_v0(IORING_OP_NOP, sqe, -1, NULL, 0, 0);
	else
		*nop = (struct io_uring_sqe_common) {
			.hdr = {
				.opcode = IORING_OP_NOP,
			},
		};
}

For new features the prep functions would return a pointer to
the specific structure (see also below).

static inline struct io_uring_sqe_file_cmd *
io_uring_prep_file_cmd(struct io_uring *ring, struct io_uring_sqe *sqe, int fd, uint32_t cmd_opcode)
{
	struct io_uring_sqe_file_cmd *file_cmd = &sqe->file_cmd;

	*file_cmd = (struct io_uring_sqe_file_cmd) {
		.hdr = {
			.opcode = IORING_OP_FILE_CMD,
		},
		.fd = fd,
		.cmd_opcode = cmd_opcode,
	}

	return file_cmd;
}

The application could then also check for a n
In order to test v1 it should have a way to skip IORING_FEAT_SQE_V2
and all existing tests could have a helper function to toggle that
based on an environment variable, so that make runtests could run
each test in both modes.

> Looking into the feasibility of this. But if that is done, there are
> other things that need to be factored in, as I'm not at all interested
> in having a v3 down the line as well. And I'd need to be able to do this
> seamlessly, both from an application point of view, and a performance
> point of view (no stupid conversions inline).


> Things that come up when something like this is on the table
> 
> - Should flags be extended? We're almost out... It hasn't been an
>   issue so far, but seems a bit silly to go v2 and not at least leave
>   a bit of room there. But obviously comes at a cost of losing eg 8
>   bits somewhere else.
> 
> - Is u8 enough for the opcode? Again, we're nowhere near the limits
>   here, but eventually multiplexing might be necessary.
> 
> That's just off the top of my head, probably other things to consider
> too.

What about using something like this:

struct io_uring_sqe_hdr {
 	__u64	user_data;
 	__u16	personality;
 	__u16	opcode;
        __u32   flags;
};

I moved __s32 fd out of it as not all commands need it and some need more than
one. So I guess it's easier to have them in the per opcode structure.
and the io_file_get() should better be done in the per opcode prep_vX function.

struct io_uring_sqe_common {
	struct io_uring_sqe_hdr hdr;
	__u8 __reserved[48];
};

struct io_uring_sqe_rw_common {
	struct io_uring_sqe_hdr hdr;
	__s32 fd;        /* file descriptor to do IO on */
	__u32 len;       /* buffer size or number of iovecs */
	__u64 off;       /* offset into file */
	__u64 addr;      /* pointer to buffer or iovecs */
	__kernel_rwf_t   rw_flags;
	__u16 ioprio;    /* ioprio for the request */
	__u16 buf_index; /* index into fixed buffers, if used */
	__u8 __reserved[16];
};

struct io_uring_sqe_file_cmd {
	struct io_uring_sqe_hdr	hdr;
	__s32 fd;           /* file descriptor to do IO on */
	__u32 cmd_opcode;   /* file specific command */
 	__u8  cmd_data[40]; /* command spefic data */
};

struct io_uring_sqe {
	union {
		struct io_uring_sqe_common common;
		struct io_uring_sqe_common nop;
		struct io_uring_sqe_rw_common readv;
		struct io_uring_sqe_rw_common writev;
		struct io_uring_sqe_rw_common read_fixed;
		struct io_uring_sqe_rw_common write_fixed;
		struct io_uring_sqe_file_cmd file_cmd;
        };
};

Each _opcode_prep() function would then check hdr.flags for unsupported flags
and __reserved for zeros. Instead of having a global io_op_defs[] array
the _opcode_prep() function would have a static const definition for the opcode
and lease req->op_def (which would be const struct io_op_def *op_def);

Does that sound useful in anyway?

metze

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main
  2021-03-18 18:40     ` Jens Axboe
  2021-03-19 11:20       ` Stefan Metzmacher
@ 2021-03-19 13:29       ` Christoph Hellwig
  2022-02-24 22:34       ` Luis Chamberlain
  2 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-19 13:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, io-uring, joshi.k, kbusch, linux-nvme, metze

On Thu, Mar 18, 2021 at 12:40:25PM -0600, Jens Axboe wrote:
> > and use that for all new commands going forward while marking the
> > old ones as legacy.
> > 
> > io_uring_cmd_sqe would then be:
> > 
> > struct io_uring_cmd_sqe {
> >         struct io_uring_sqe_hdr	hdr;
> > 	__u33			ioc;
> > 	__u32 			len;
> > 	__u8			data[40];
> > };
> > 
> > for example.  Note the 32-bit opcode just like ioctl to avoid
> > getting into too much trouble due to collisions.
> 
> I was debating that with myself too, it's essentially making
> the existing io_uring_sqe into io_uring_sqe_v1 and then making a new
> v2 one. That would impact _all_ commands, and we'd need some trickery
> to have newly compiled stuff use v2 and have existing applications
> continue to work with the v1 format. That's very different from having
> a single (or new) opcodes use a v2 format, effectively.

I only proposed it for all new commands because we have so many
existing ones.

> Looking into the feasibility of this. But if that is done, there are
> other things that need to be factored in, as I'm not at all interested
> in having a v3 down the line as well. And I'd need to be able to do this
> seamlessly, both from an application point of view, and a performance
> point of view (no stupid conversions inline).
> 
> Things that come up when something like this is on the table
> 
> - Should flags be extended? We're almost out... It hasn't been an
>   issue so far, but seems a bit silly to go v2 and not at least leave
>   a bit of room there. But obviously comes at a cost of losing eg 8
>   bits somewhere else.
> 
> - Is u8 enough for the opcode? Again, we're nowhere near the limits
>   here, but eventually multiplexing might be necessary.
> 
> That's just off the top of my head, probably other things to consider
> too.

At some point there isn't much left of the common space if we
extend all that, but yeah.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 7/8] net: wire up support for file_operations->uring_cmd()
  2021-03-17 22:10 ` [PATCH 7/8] net: wire up support for file_operations->uring_cmd() Jens Axboe
@ 2022-02-17  1:03   ` Luis Chamberlain
  0 siblings, 0 replies; 25+ messages in thread
From: Luis Chamberlain @ 2022-02-17  1:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, joshi.k, hch, kbusch, linux-nvme, metze, mcgrof

On Wed, Mar 17, 2021 at 04:10:26PM -0600, Jens Axboe wrote:
> Pass it through the proto_ops->uring_cmd() handler, so we can plumb it
> through all the way to the proto->uring_cmd() handler.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Without a user I think this is just a distraction for now, although a
nice proof of concept.

metze,

do we have a user lined up yet? :)

  Luis


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] fs: add file_operations->uring_cmd()
  2021-03-17 22:10 ` [PATCH 3/8] fs: add file_operations->uring_cmd() Jens Axboe
  2021-03-18  5:38   ` Christoph Hellwig
@ 2022-02-17  1:25   ` Luis Chamberlain
  1 sibling, 0 replies; 25+ messages in thread
From: Luis Chamberlain @ 2022-02-17  1:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, joshi.k, hch, kbusch, linux-nvme, metze

On Wed, Mar 17, 2021 at 04:10:22PM -0600, Jens Axboe wrote:
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ec8f3ddf4a6a..009abc668987 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1884,6 +1884,15 @@ struct dir_context {
>  #define REMAP_FILE_ADVISORY		(REMAP_FILE_CAN_SHORTEN)
>  
>  struct iov_iter;
> +struct io_uring_cmd;
> +
> +/*
> + * f_op->uring_cmd() issue flags
> + */

Adding kdoc style comments would be nice.

  Luis


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] fs: add file_operations->uring_cmd()
  2021-03-18  5:38   ` Christoph Hellwig
  2021-03-18 18:41     ` Jens Axboe
@ 2022-02-17  1:27     ` Luis Chamberlain
  1 sibling, 0 replies; 25+ messages in thread
From: Luis Chamberlain @ 2022-02-17  1:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, io-uring, joshi.k, kbusch, linux-nvme, metze

On Thu, Mar 18, 2021 at 06:38:32AM +0100, Christoph Hellwig wrote:
> On Wed, Mar 17, 2021 at 04:10:22PM -0600, Jens Axboe wrote:
> > This is a file private handler, similar to ioctls but hopefully a lot
> > more sane and useful.
> 
> I really hate defining the interface in terms of io_uring.  This really
> is nothing but an async ioctl.

Calling it an ioctl does a disservice to what this is allowing.
Although ioctls might be a first use case, there is nothing tying the
commands to them.

  Luis


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main
  2021-03-18 18:40     ` Jens Axboe
  2021-03-19 11:20       ` Stefan Metzmacher
  2021-03-19 13:29       ` Christoph Hellwig
@ 2022-02-24 22:34       ` Luis Chamberlain
  2 siblings, 0 replies; 25+ messages in thread
From: Luis Chamberlain @ 2022-02-24 22:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, io-uring, joshi.k, kbusch, linux-nvme, metze

On Thu, Mar 18, 2021 at 12:40:25PM -0600, Jens Axboe wrote:
> I'm not at all interested
> in having a v3 down the line as well. And I'd need to be able to do this
> seamlessly, both from an application point of view, and a performance
> point of view (no stupid conversions inline).

At this point I've now traced the history of effort of wanting to do
io-uring "ioctl" work through 3 sepearate independent efforts:

2019-12-14: Pavel Begunkov - https://lore.kernel.org/all/f77ac379ddb6a67c3ac6a9dc54430142ead07c6f.1576336565.git.asml.silence@gmail.com/
2020-11-02: Hao Xu - https://lore.kernel.org/all/1604303041-184595-1-git-send-email-haoxu@linux.alibaba.com/
2021-01-27: Kanchan Joshi - https://lore.kernel.org/linux-nvme/20210127150029.13766-1-joshi.k@samsung.com/#r

So clearly there is interest in this moving forward.

On the same day as Joshi's post you posted your file_operations based
implemenation. So that's 2 years, 2 months to this day since Pavel's
first patchset... Wouldn't we be a bit too much of a burden to ensure a
v2 will suffice for *all* use cases? If so, adaptability for evolution
sounds like a more fitting use case for design here. That way
we reduce our requirements and allow for experimentation, while
enabling improvements on future design.

  Luis


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-02-24 22:34 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 22:10 [PATCHSET v4 0/8] io_uring passthrough support Jens Axboe
2021-03-17 22:10 ` [PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main Jens Axboe
2021-03-18  5:34   ` Christoph Hellwig
2021-03-18 18:40     ` Jens Axboe
2021-03-19 11:20       ` Stefan Metzmacher
2021-03-19 13:29       ` Christoph Hellwig
2022-02-24 22:34       ` Luis Chamberlain
2021-03-17 22:10 ` [PATCH 2/8] io_uring: add infrastructure around io_uring_cmd_sqe issue type Jens Axboe
2021-03-17 22:10 ` [PATCH 3/8] fs: add file_operations->uring_cmd() Jens Axboe
2021-03-18  5:38   ` Christoph Hellwig
2021-03-18 18:41     ` Jens Axboe
2022-02-17  1:27     ` Luis Chamberlain
2022-02-17  1:25   ` Luis Chamberlain
2021-03-17 22:10 ` [PATCH 4/8] io_uring: add support for IORING_OP_URING_CMD Jens Axboe
2021-03-18  5:42   ` Christoph Hellwig
2021-03-18 18:43     ` Jens Axboe
2021-03-17 22:10 ` [PATCH 5/8] block: wire up support for file_operations->uring_cmd() Jens Axboe
2021-03-18  5:44   ` Christoph Hellwig
2021-03-17 22:10 ` [PATCH 6/8] block: add example ioctl Jens Axboe
2021-03-18  5:45   ` Christoph Hellwig
2021-03-18 12:43     ` Pavel Begunkov
2021-03-18 18:44     ` Jens Axboe
2021-03-17 22:10 ` [PATCH 7/8] net: wire up support for file_operations->uring_cmd() Jens Axboe
2022-02-17  1:03   ` Luis Chamberlain
2021-03-17 22:10 ` [PATCH 8/8] net: add example SOCKET_URING_OP_SIOCINQ/SOCKET_URING_OP_SIOCOUTQ Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).