io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v3] io_uring support for automatic buffers
@ 2020-02-28 20:30 Jens Axboe
  2020-02-28 20:30 ` [PATCH 1/6] io_uring: buffer registration infrastructure Jens Axboe
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-28 20:30 UTC (permalink / raw)
  To: io-uring; +Cc: andres

With the poll retry based async IO patchset I posted last week, the one
big missing thing for me was the ability to have automatic buffer
selection. Generally applications that handle tons of sockets like to
poll for activity on them, then issue IO when they become ready. This is
of course at least two system calls, but it also means that it provides
an application a chance to manage how many IO buffers it needs. With the
io_uring based polled IO, the application need only issue an
IORING_OP_RECV (for example, to receive socket data), it doesn't need to
poll at all. However, this means that the application no longer has an
opportune moment to select how many IO buffers to keep in flight, it has
to be equal to what it currently has pending.

I had originally intended to use BPF to provide some means of buffer
selection, but I had a hard time imagining how life times of the buffer
could be managed through that. I had a false start today, but Andres
suggested a nifty approach that also solves the life time issue.

Basically the application registers buffers with the kernel. Each buffer
is registered with a given group ID, and buffer ID. The buffers are
organized by group ID, and the application selects a buffer pool based
on this group ID. One use case might be to group by size. There's an
opcode for this, IORING_OP_PROVIDE_BUFFERS.

IORING_OP_PROVIDE_BUFFERS takes a start address, length of a buffer, and
number of buffers. It also provides a group ID with which these buffers
should be associated, and a starting buffer ID. The buffers are then
added, and the buffer ID is incremented by 1 for each buffer.

With that, when doing the same IORING_OP_RECV, no buffer is passed in
with the request. Instead, it's flagged with IOSQE_BUFFER_SELECT, and
sqe->buf_group is filled in with a valid group ID. When the kernel can
satisfy the receive, a buffer is selected from the specified group ID
pool. If none are available, the IO is terminated with -ENOBUFS. On
success, the buffer ID is passed back through the (CQE) completion
event. This tells the application what specific buffer was used.

A buffer can be used only once. On completion, the application may
choose to free it, or register it again with IORING_OP_PROVIDE_BUFFERS.

Patches can also be found in the below repo:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-buf-select

which is sitting on top of for-5.7/io_uring.

v3:
- Add support for IORING_OP_READV and IORING_OP_RECVMSG
- Drop write side
- More cleanups on address space, retained state
- Add a few helpers
- Rebase on new for-5.7/io_uring

v2:
- Cleanup address space
- Fix locking for async offload issue
- Add lockdep annotation for uring_lock
- Verify sqe fields on PROVIDE_BUFFERS prep
- Fix send/recv kbuf leak on import failure
- Fix send/recv error handling on -ENOBUFS
- Change IORING_OP_PROVIDE_BUFFER to PROVIDE_BUFFERS, and allow multiple
  contig buffers in one call

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] io_uring: buffer registration infrastructure
  2020-02-28 20:30 [PATCHSET v3] io_uring support for automatic buffers Jens Axboe
@ 2020-02-28 20:30 ` Jens Axboe
  2020-02-28 20:30 ` [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS Jens Axboe
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-28 20:30 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

This just prepares the ring for having lists of buffers associated with
it, that the application can provide for SQEs to consume instead of
providing their own.

The buffers are organized by group ID.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8d8d93adb9c2..f6a0f07e35b5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -195,6 +195,13 @@ struct fixed_file_data {
 	struct completion		done;
 };
 
+struct io_buffer {
+	struct list_head list;
+	__u64 addr;
+	__s32 len;
+	__u16 bid;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -272,6 +279,8 @@ struct io_ring_ctx {
 	struct socket		*ring_sock;
 #endif
 
+	struct idr		io_buffer_idr;
+
 	struct idr		personality_idr;
 
 	struct {
@@ -875,6 +884,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->cq_overflow_list);
 	init_completion(&ctx->completions[0]);
 	init_completion(&ctx->completions[1]);
+	idr_init(&ctx->io_buffer_idr);
 	idr_init(&ctx->personality_idr);
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
@@ -6565,6 +6575,28 @@ static int io_eventfd_unregister(struct io_ring_ctx *ctx)
 	return -ENXIO;
 }
 
+static int __io_destroy_buffers(int id, void *p, void *data)
+{
+	struct io_ring_ctx *ctx = data;
+	struct list_head *buf_list = p;
+	struct io_buffer *buf;
+
+	while (!list_empty(buf_list)) {
+		buf = list_first_entry(buf_list, struct io_buffer, list);
+		list_del(&buf->list);
+		kfree(buf);
+	}
+	idr_remove(&ctx->io_buffer_idr, id);
+	kfree(buf_list);
+	return 0;
+}
+
+static void io_destroy_buffers(struct io_ring_ctx *ctx)
+{
+	idr_for_each(&ctx->io_buffer_idr, __io_destroy_buffers, ctx);
+	idr_destroy(&ctx->io_buffer_idr);
+}
+
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	io_finish_async(ctx);
@@ -6575,6 +6607,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_sqe_buffer_unregister(ctx);
 	io_sqe_files_unregister(ctx);
 	io_eventfd_unregister(ctx);
+	io_destroy_buffers(ctx);
 	idr_destroy(&ctx->personality_idr);
 
 #if defined(CONFIG_UNIX)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-28 20:30 [PATCHSET v3] io_uring support for automatic buffers Jens Axboe
  2020-02-28 20:30 ` [PATCH 1/6] io_uring: buffer registration infrastructure Jens Axboe
@ 2020-02-28 20:30 ` Jens Axboe
  2020-02-29  0:43   ` Pavel Begunkov
                     ` (2 more replies)
  2020-02-28 20:30 ` [PATCH 3/6] io_uring: support buffer selection Jens Axboe
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-28 20:30 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
support passing in an addr/len that is associated with a buffer ID and
buffer group ID. The group ID is used to index and lookup the buffers,
while the buffer ID can be used to notify the application which buffer
in the group was used. The addr passed in is the starting buffer address,
and length is each buffer length. A number of buffers to add with can be
specified, in which case addr is incremented by length for each addition,
and each buffer increments the buffer ID specified.

No validation is done of the buffer ID. If the application provides
buffers within the same group with identical buffer IDs, then it'll have
a hard time telling which buffer ID was used. The only restriction is
that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
maximum ID that can be used.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 127 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   9 ++-
 2 files changed, 133 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f6a0f07e35b5..d6dc5faf3605 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -447,6 +447,15 @@ struct io_splice {
 	unsigned int			flags;
 };
 
+struct io_provide_buf {
+	struct file			*file;
+	__u64				addr;
+	__s32				len;
+	__u32				gid;
+	__u16				nbufs;
+	__u16				bid;
+};
+
 struct io_async_connect {
 	struct sockaddr_storage		address;
 };
@@ -572,6 +581,7 @@ struct io_kiocb {
 		struct io_madvise	madvise;
 		struct io_epoll		epoll;
 		struct io_splice	splice;
+		struct io_provide_buf	pbuf;
 	};
 
 	struct io_async_ctx		*io;
@@ -799,7 +809,8 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
-	}
+	},
+	[IORING_OP_PROVIDE_BUFFERS] = {},
 };
 
 static void io_wq_submit_work(struct io_wq_work **workptr);
@@ -2801,6 +2812,109 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt,
 	return io_openat2(req, nxt, force_nonblock);
 }
 
+static int io_provide_buffers_prep(struct io_kiocb *req,
+				   const struct io_uring_sqe *sqe)
+{
+	struct io_provide_buf *p = &req->pbuf;
+	u64 tmp;
+
+	if (sqe->ioprio || sqe->rw_flags)
+		return -EINVAL;
+
+	tmp = READ_ONCE(sqe->fd);
+	if (!tmp || tmp > USHRT_MAX)
+		return -EINVAL;
+	p->nbufs = tmp;
+	p->addr = READ_ONCE(sqe->addr);
+	p->len = READ_ONCE(sqe->len);
+
+	if (!access_ok(u64_to_user_ptr(p->addr), p->len))
+		return -EFAULT;
+
+	p->gid = READ_ONCE(sqe->buf_group);
+	tmp = READ_ONCE(sqe->off);
+	if (tmp > USHRT_MAX)
+		return -EINVAL;
+	p->bid = tmp;
+	return 0;
+}
+
+static int io_add_buffers(struct io_provide_buf *pbuf, struct list_head *list)
+{
+	struct io_buffer *buf;
+	u64 addr = pbuf->addr;
+	int i, bid = pbuf->bid;
+
+	for (i = 0; i < pbuf->nbufs; i++) {
+		buf = kmalloc(sizeof(*buf), GFP_KERNEL);
+		if (!buf)
+			break;
+
+		buf->addr = addr;
+		buf->len = pbuf->len;
+		buf->bid = bid;
+		list_add(&buf->list, list);
+		addr += pbuf->len;
+		bid++;
+	}
+
+	return i;
+}
+
+static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
+			      bool force_nonblock)
+{
+	struct io_provide_buf *p = &req->pbuf;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct list_head *list;
+	int ret = 0;
+
+	/*
+	 * "Normal" inline submissions always hold the uring_lock, since we
+	 * grab it from the system call. Same is true for the SQPOLL offload.
+	 * The only exception is when we've detached the request and issue it
+	 * from an async worker thread, grab the lock for that case.
+	 */
+	if (!force_nonblock)
+		mutex_lock(&ctx->uring_lock);
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	list = idr_find(&ctx->io_buffer_idr, p->gid);
+	if (!list) {
+		list = kmalloc(sizeof(*list), GFP_KERNEL);
+		if (!list) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		INIT_LIST_HEAD(list);
+		ret = idr_alloc(&ctx->io_buffer_idr, list, p->gid, p->gid + 1,
+					GFP_KERNEL);
+		if (ret < 0) {
+			kfree(list);
+			goto out;
+		}
+	}
+
+	ret = io_add_buffers(p, list);
+	if (!ret) {
+		/* no buffers added and list empty, remove entry */
+		if (list_empty(list)) {
+			idr_remove(&ctx->io_buffer_idr, p->gid);
+			kfree(list);
+		}
+		ret = -ENOMEM;
+	}
+out:
+	if (!force_nonblock)
+		mutex_unlock(&ctx->uring_lock);
+	if (ret < 0)
+		req_set_fail_links(req);
+	io_cqring_add_event(req, ret);
+	io_put_req_find_next(req, nxt);
+	return 0;
+}
+
 static int io_epoll_ctl_prep(struct io_kiocb *req,
 			     const struct io_uring_sqe *sqe)
 {
@@ -4419,6 +4533,9 @@ static int io_req_defer_prep(struct io_kiocb *req,
 	case IORING_OP_SPLICE:
 		ret = io_splice_prep(req, sqe);
 		break;
+	case IORING_OP_PROVIDE_BUFFERS:
+		ret = io_provide_buffers_prep(req, sqe);
+		break;
 	default:
 		printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
 				req->opcode);
@@ -4696,6 +4813,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		}
 		ret = io_splice(req, nxt, force_nonblock);
 		break;
+	case IORING_OP_PROVIDE_BUFFERS:
+		if (sqe) {
+			ret = io_provide_buffers_prep(req, sqe);
+			if (ret)
+				break;
+		}
+		ret = io_provide_buffers(req, nxt, force_nonblock);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 53b36311cdac..1de1f683cc3c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -45,8 +45,12 @@ struct io_uring_sqe {
 	__u64	user_data;	/* data to be passed back at completion time */
 	union {
 		struct {
-			/* index into fixed buffers, if used */
-			__u16	buf_index;
+			union {
+				/* index into fixed buffers, if used */
+				__u16	buf_index;
+				/* for grouped buffer selection */
+				__u16	buf_group;
+			};
 			/* personality to use, if used */
 			__u16	personality;
 			__s32	splice_fd_in;
@@ -119,6 +123,7 @@ enum {
 	IORING_OP_OPENAT2,
 	IORING_OP_EPOLL_CTL,
 	IORING_OP_SPLICE,
+	IORING_OP_PROVIDE_BUFFERS,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] io_uring: support buffer selection
  2020-02-28 20:30 [PATCHSET v3] io_uring support for automatic buffers Jens Axboe
  2020-02-28 20:30 ` [PATCH 1/6] io_uring: buffer registration infrastructure Jens Axboe
  2020-02-28 20:30 ` [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS Jens Axboe
@ 2020-02-28 20:30 ` Jens Axboe
  2020-02-29 12:21   ` Pavel Begunkov
  2020-03-09 17:21   ` Andres Freund
  2020-02-28 20:30 ` [PATCH 4/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV Jens Axboe
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-28 20:30 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.

Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.

With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:

	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;

Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.

Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EINVAL will be posted if attempted on unsupported requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 190 ++++++++++++++++++++++++++++++----
 include/uapi/linux/io_uring.h |  14 +++
 2 files changed, 183 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d6dc5faf3605..224a452a637f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -395,7 +395,9 @@ struct io_sr_msg {
 		void __user		*buf;
 	};
 	int				msg_flags;
+	int				gid;
 	size_t				len;
+	struct io_buffer		*kbuf;
 };
 
 struct io_open {
@@ -490,6 +492,7 @@ enum {
 	REQ_F_LINK_BIT		= IOSQE_IO_LINK_BIT,
 	REQ_F_HARDLINK_BIT	= IOSQE_IO_HARDLINK_BIT,
 	REQ_F_FORCE_ASYNC_BIT	= IOSQE_ASYNC_BIT,
+	REQ_F_BUFFER_SELECT_BIT	= IOSQE_BUFFER_SELECT_BIT,
 
 	REQ_F_LINK_NEXT_BIT,
 	REQ_F_FAIL_LINK_BIT,
@@ -506,6 +509,7 @@ enum {
 	REQ_F_NEED_CLEANUP_BIT,
 	REQ_F_OVERFLOW_BIT,
 	REQ_F_POLLED_BIT,
+	REQ_F_BUFFER_SELECTED_BIT,
 };
 
 enum {
@@ -519,6 +523,8 @@ enum {
 	REQ_F_HARDLINK		= BIT(REQ_F_HARDLINK_BIT),
 	/* IOSQE_ASYNC */
 	REQ_F_FORCE_ASYNC	= BIT(REQ_F_FORCE_ASYNC_BIT),
+	/* IOSQE_BUFFER_SELECT */
+	REQ_F_BUFFER_SELECT	= BIT(REQ_F_BUFFER_SELECT_BIT),
 
 	/* already grabbed next link */
 	REQ_F_LINK_NEXT		= BIT(REQ_F_LINK_NEXT_BIT),
@@ -550,6 +556,8 @@ enum {
 	REQ_F_OVERFLOW		= BIT(REQ_F_OVERFLOW_BIT),
 	/* already went through poll handler */
 	REQ_F_POLLED		= BIT(REQ_F_POLLED_BIT),
+	/* buffer already selected */
+	REQ_F_BUFFER_SELECTED	= BIT(REQ_F_BUFFER_SELECTED_BIT),
 };
 
 struct async_poll {
@@ -612,6 +620,7 @@ struct io_kiocb {
 			struct callback_head	task_work;
 			struct hlist_node	hash_node;
 			struct async_poll	*apoll;
+			int			cflags;
 		};
 		struct io_wq_work	work;
 	};
@@ -661,6 +670,8 @@ struct io_op_def {
 	/* set if opcode supports polled "wait" */
 	unsigned		pollin : 1;
 	unsigned		pollout : 1;
+	/* op supports buffer selection */
+	unsigned		buffer_select : 1;
 };
 
 static const struct io_op_def io_op_defs[] = {
@@ -770,6 +781,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollin			= 1,
+		.buffer_select		= 1,
 	},
 	[IORING_OP_WRITE] = {
 		.needs_mm		= 1,
@@ -794,6 +806,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollin			= 1,
+		.buffer_select		= 1,
 	},
 	[IORING_OP_OPENAT2] = {
 		.needs_file		= 1,
@@ -1181,7 +1194,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 		if (cqe) {
 			WRITE_ONCE(cqe->user_data, req->user_data);
 			WRITE_ONCE(cqe->res, req->result);
-			WRITE_ONCE(cqe->flags, 0);
+			WRITE_ONCE(cqe->flags, req->flags);
 		} else {
 			WRITE_ONCE(ctx->rings->cq_overflow,
 				atomic_inc_return(&ctx->cached_cq_overflow));
@@ -1205,7 +1218,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 	return cqe != NULL;
 }
 
-static void io_cqring_fill_event(struct io_kiocb *req, long res)
+static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_uring_cqe *cqe;
@@ -1221,7 +1234,7 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res)
 	if (likely(cqe)) {
 		WRITE_ONCE(cqe->user_data, req->user_data);
 		WRITE_ONCE(cqe->res, res);
-		WRITE_ONCE(cqe->flags, 0);
+		WRITE_ONCE(cqe->flags, cflags);
 	} else if (ctx->cq_overflow_flushed) {
 		WRITE_ONCE(ctx->rings->cq_overflow,
 				atomic_inc_return(&ctx->cached_cq_overflow));
@@ -1233,23 +1246,34 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res)
 		req->flags |= REQ_F_OVERFLOW;
 		refcount_inc(&req->refs);
 		req->result = res;
+		req->cflags = cflags;
 		list_add_tail(&req->list, &ctx->cq_overflow_list);
 	}
 }
 
-static void io_cqring_add_event(struct io_kiocb *req, long res)
+static void io_cqring_fill_event(struct io_kiocb *req, long res)
+{
+	__io_cqring_fill_event(req, res, 0);
+}
+
+static void __io_cqring_add_event(struct io_kiocb *req, long res, long cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	unsigned long flags;
 
 	spin_lock_irqsave(&ctx->completion_lock, flags);
-	io_cqring_fill_event(req, res);
+	__io_cqring_fill_event(req, res, cflags);
 	io_commit_cqring(ctx);
 	spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
 	io_cqring_ev_posted(ctx);
 }
 
+static void io_cqring_add_event(struct io_kiocb *req, long res)
+{
+	__io_cqring_add_event(req, res, 0);
+}
+
 static inline bool io_is_fallback_req(struct io_kiocb *req)
 {
 	return req == (struct io_kiocb *)
@@ -1630,6 +1654,18 @@ static inline bool io_req_multi_free(struct req_batch *rb, struct io_kiocb *req)
 	return true;
 }
 
+static int io_rw_common_cflags(struct io_kiocb *req)
+{
+	struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
+	int cflags;
+
+	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
+	cflags |= IORING_CQE_F_BUFFER;
+	req->rw.addr = 0;
+	kfree(kbuf);
+	return cflags;
+}
+
 /*
  * Find and free completed poll iocbs
  */
@@ -1641,10 +1677,15 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 
 	rb.to_free = rb.need_iter = 0;
 	while (!list_empty(done)) {
+		int cflags = 0;
+
 		req = list_first_entry(done, struct io_kiocb, list);
 		list_del(&req->list);
 
-		io_cqring_fill_event(req, req->result);
+		if (req->flags & REQ_F_BUFFER_SELECTED)
+			cflags = io_rw_common_cflags(req);
+
+		__io_cqring_fill_event(req, req->result, cflags);
 		(*nr_events)++;
 
 		if (refcount_dec_and_test(&req->refs) &&
@@ -1819,13 +1860,16 @@ static inline void req_set_fail_links(struct io_kiocb *req)
 static void io_complete_rw_common(struct kiocb *kiocb, long res)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
+	int cflags = 0;
 
 	if (kiocb->ki_flags & IOCB_WRITE)
 		kiocb_end_write(req);
 
 	if (res != req->result)
 		req_set_fail_links(req);
-	io_cqring_add_event(req, res);
+	if (req->flags & REQ_F_BUFFER_SELECTED)
+		cflags = io_rw_common_cflags(req);
+	__io_cqring_add_event(req, res, cflags);
 }
 
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
@@ -2014,7 +2058,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	req->rw.addr = READ_ONCE(sqe->addr);
 	req->rw.len = READ_ONCE(sqe->len);
-	/* we own ->private, reuse it for the buffer index */
+	/* we own ->private, reuse it for the buffer index  / buffer ID */
 	req->rw.kiocb.private = (void *) (unsigned long)
 					READ_ONCE(sqe->buf_index);
 	return 0;
@@ -2127,8 +2171,43 @@ static ssize_t io_import_fixed(struct io_kiocb *req, int rw,
 	return len;
 }
 
+static struct io_buffer *io_buffer_select(struct io_kiocb *req, int gid,
+					  struct io_buffer *kbuf,
+					  bool needs_lock)
+{
+	struct list_head *list;
+
+	if (req->flags & REQ_F_BUFFER_SELECTED)
+		return kbuf;
+
+	/*
+	 * "Normal" inline submissions always hold the uring_lock, since we
+	 * grab it from the system call. Same is true for the SQPOLL offload.
+	 * The only exception is when we've detached the request and issue it
+	 * from an async worker thread, grab the lock for that case.
+	 */
+	if (needs_lock)
+		mutex_lock(&req->ctx->uring_lock);
+
+	lockdep_assert_held(&req->ctx->uring_lock);
+
+	list = idr_find(&req->ctx->io_buffer_idr, gid);
+	if (list && !list_empty(list)) {
+		kbuf = list_first_entry(list, struct io_buffer, list);
+		list_del(&kbuf->list);
+	} else {
+		kbuf = ERR_PTR(-ENOBUFS);
+	}
+
+	if (needs_lock)
+		mutex_unlock(&req->ctx->uring_lock);
+
+	return kbuf;
+}
+
 static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
-			       struct iovec **iovec, struct iov_iter *iter)
+			       struct iovec **iovec, struct iov_iter *iter,
+			       bool needs_lock)
 {
 	void __user *buf = u64_to_user_ptr(req->rw.addr);
 	size_t sqe_len = req->rw.len;
@@ -2140,12 +2219,30 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
 		return io_import_fixed(req, rw, iter);
 	}
 
-	/* buffer index only valid with fixed read/write */
-	if (req->rw.kiocb.private)
+	/* buffer index only valid with fixed read/write, or buffer select  */
+	if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT))
 		return -EINVAL;
 
 	if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
 		ssize_t ret;
+
+		if (req->flags & REQ_F_BUFFER_SELECT) {
+			struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
+			int gid;
+
+			gid = (int) (unsigned long) req->rw.kiocb.private;
+			kbuf = io_buffer_select(req, gid, kbuf, needs_lock);
+			if (IS_ERR(kbuf)) {
+				*iovec = NULL;
+				return PTR_ERR(kbuf);
+			}
+			req->rw.addr = (u64) kbuf;
+			if (sqe_len > kbuf->len)
+				sqe_len = kbuf->len;
+			req->flags |= REQ_F_BUFFER_SELECTED;
+			buf = u64_to_user_ptr(kbuf->addr);
+		}
+
 		ret = import_single_range(rw, buf, sqe_len, *iovec, iter);
 		*iovec = NULL;
 		return ret < 0 ? ret : sqe_len;
@@ -2288,7 +2385,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	io = req->io;
 	io->rw.iov = io->rw.fast_iov;
 	req->io = NULL;
-	ret = io_import_iovec(READ, req, &io->rw.iov, &iter);
+	ret = io_import_iovec(READ, req, &io->rw.iov, &iter, !force_nonblock);
 	req->io = io;
 	if (ret < 0)
 		return ret;
@@ -2306,7 +2403,7 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
 	size_t iov_count;
 	ssize_t io_size, ret;
 
-	ret = io_import_iovec(READ, req, &iovec, &iter);
+	ret = io_import_iovec(READ, req, &iovec, &iter, !force_nonblock);
 	if (ret < 0)
 		return ret;
 
@@ -2378,7 +2475,7 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	io = req->io;
 	io->rw.iov = io->rw.fast_iov;
 	req->io = NULL;
-	ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter);
+	ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter, !force_nonblock);
 	req->io = io;
 	if (ret < 0)
 		return ret;
@@ -2396,7 +2493,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
 	size_t iov_count;
 	ssize_t ret, io_size;
 
-	ret = io_import_iovec(WRITE, req, &iovec, &iter);
+	ret = io_import_iovec(WRITE, req, &iovec, &iter, !force_nonblock);
 	if (ret < 0)
 		return ret;
 
@@ -3403,6 +3500,29 @@ static int io_send(struct io_kiocb *req, struct io_kiocb **nxt,
 #endif
 }
 
+static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req,
+					       int *cflags, bool needs_lock)
+{
+	struct io_sr_msg *sr = &req->sr_msg;
+	struct io_buffer *kbuf;
+
+	if (!(req->flags & REQ_F_BUFFER_SELECT))
+		return NULL;
+
+	kbuf = io_buffer_select(req, sr->gid, sr->kbuf, needs_lock);
+	if (IS_ERR(kbuf))
+		return kbuf;
+
+	sr->kbuf = kbuf;
+	if (sr->len > kbuf->len)
+		sr->len = kbuf->len;
+	req->flags |= REQ_F_BUFFER_SELECTED;
+
+	*cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
+	*cflags |= IORING_CQE_F_BUFFER;
+	return kbuf;
+}
+
 static int io_recvmsg_prep(struct io_kiocb *req,
 			   const struct io_uring_sqe *sqe)
 {
@@ -3414,6 +3534,7 @@ static int io_recvmsg_prep(struct io_kiocb *req,
 	sr->msg_flags = READ_ONCE(sqe->msg_flags);
 	sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	sr->len = READ_ONCE(sqe->len);
+	sr->gid = READ_ONCE(sqe->buf_group);
 
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
@@ -3505,8 +3626,9 @@ static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt,
 		   bool force_nonblock)
 {
 #if defined(CONFIG_NET)
+	struct io_buffer *kbuf = NULL;
 	struct socket *sock;
-	int ret;
+	int ret, cflags = 0;
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
@@ -3514,15 +3636,25 @@ static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt,
 	sock = sock_from_file(req->file, &ret);
 	if (sock) {
 		struct io_sr_msg *sr = &req->sr_msg;
+		void __user *buf = sr->buf;
 		struct msghdr msg;
 		struct iovec iov;
 		unsigned flags;
 
-		ret = import_single_range(READ, sr->buf, sr->len, &iov,
+		kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock);
+		if (IS_ERR(kbuf))
+			return PTR_ERR(kbuf);
+		else if (kbuf)
+			buf = u64_to_user_ptr(kbuf->addr);
+
+		ret = import_single_range(READ, buf, sr->len, &iov,
 						&msg.msg_iter);
-		if (ret)
+		if (ret) {
+			kfree(kbuf);
 			return ret;
+		}
 
+		req->flags |= REQ_F_NEED_CLEANUP;
 		msg.msg_name = NULL;
 		msg.msg_control = NULL;
 		msg.msg_controllen = 0;
@@ -3543,7 +3675,9 @@ static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt,
 			ret = -EINTR;
 	}
 
-	io_cqring_add_event(req, ret);
+	kfree(kbuf);
+	req->flags &= ~REQ_F_NEED_CLEANUP;
+	__io_cqring_add_event(req, ret, cflags);
 	if (ret < 0)
 		req_set_fail_links(req);
 	io_put_req_find_next(req, nxt);
@@ -4585,6 +4719,8 @@ static void io_cleanup_req(struct io_kiocb *req)
 	case IORING_OP_WRITEV:
 	case IORING_OP_WRITE_FIXED:
 	case IORING_OP_WRITE:
+		if (req->flags & REQ_F_BUFFER_SELECTED)
+			kfree((void *)req->rw.addr);
 		if (io->rw.iov != io->rw.fast_iov)
 			kfree(io->rw.iov);
 		break;
@@ -4593,6 +4729,10 @@ static void io_cleanup_req(struct io_kiocb *req)
 		if (io->msg.iov != io->msg.fast_iov)
 			kfree(io->msg.iov);
 		break;
+	case IORING_OP_RECV:
+		if (req->flags & REQ_F_BUFFER_SELECTED)
+			kfree(req->sr_msg.kbuf);
+		break;
 	case IORING_OP_OPENAT:
 	case IORING_OP_OPENAT2:
 	case IORING_OP_STATX:
@@ -5177,7 +5317,8 @@ static inline void io_queue_link_head(struct io_kiocb *req)
 }
 
 #define SQE_VALID_FLAGS	(IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK|	\
-				IOSQE_IO_HARDLINK | IOSQE_ASYNC)
+				IOSQE_IO_HARDLINK | IOSQE_ASYNC | \
+				IOSQE_BUFFER_SELECT)
 
 static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			  struct io_submit_state *state, struct io_kiocb **link)
@@ -5194,6 +5335,12 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		goto err_req;
 	}
 
+	if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
+	    !io_op_defs[req->opcode].buffer_select) {
+		ret = -EINVAL;
+		goto err_req;
+	}
+
 	id = READ_ONCE(sqe->personality);
 	if (id) {
 		req->work.creds = idr_find(&ctx->personality_idr, id);
@@ -5206,7 +5353,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
 	req->flags |= sqe_flags & (IOSQE_IO_DRAIN | IOSQE_IO_HARDLINK |
-					IOSQE_ASYNC | IOSQE_FIXED_FILE);
+					IOSQE_ASYNC | IOSQE_FIXED_FILE |
+					IOSQE_BUFFER_SELECT);
 
 	ret = io_req_set_file(state, req, sqe);
 	if (unlikely(ret)) {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1de1f683cc3c..36ecd1f8d5d3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@ enum {
 	IOSQE_IO_LINK_BIT,
 	IOSQE_IO_HARDLINK_BIT,
 	IOSQE_ASYNC_BIT,
+	IOSQE_BUFFER_SELECT_BIT,
 };
 
 /*
@@ -80,6 +81,8 @@ enum {
 #define IOSQE_IO_HARDLINK	(1U << IOSQE_IO_HARDLINK_BIT)
 /* always go async */
 #define IOSQE_ASYNC		(1U << IOSQE_ASYNC_BIT)
+/* select buffer from sqe->buf_group */
+#define IOSQE_BUFFER_SELECT	(1U << IOSQE_BUFFER_SELECT_BIT)
 
 /*
  * io_uring_setup() flags
@@ -154,6 +157,17 @@ struct io_uring_cqe {
 	__u32	flags;
 };
 
+/*
+ * cqe->flags
+ *
+ * IORING_CQE_F_BUFFER	If set, the upper 16 bits are the buffer ID
+ */
+#define IORING_CQE_F_BUFFER		(1U << 0)
+
+enum {
+	IORING_CQE_BUFFER_SHIFT		= 16,
+};
+
 /*
  * Magic offsets for the application to mmap the data it needs
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV
  2020-02-28 20:30 [PATCHSET v3] io_uring support for automatic buffers Jens Axboe
                   ` (2 preceding siblings ...)
  2020-02-28 20:30 ` [PATCH 3/6] io_uring: support buffer selection Jens Axboe
@ 2020-02-28 20:30 ` Jens Axboe
  2020-02-28 20:30 ` [PATCH 5/6] net: abstract out normal and compat msghdr import Jens Axboe
  2020-02-28 20:30 ` [PATCH 6/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG Jens Axboe
  5 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-28 20:30 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

This adds support for the vectored read. This is limited to supporting
just 1 segment in the iov, and is provided just for convenience for
applications that use IORING_OP_READV already.

The iov helpers will be used for IORING_OP_RECVMSG as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 110 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 96 insertions(+), 14 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 224a452a637f..561120460422 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -682,6 +682,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollin			= 1,
+		.buffer_select		= 1,
 	},
 	[IORING_OP_WRITEV] = {
 		.async_ctx		= 1,
@@ -2205,12 +2206,96 @@ static struct io_buffer *io_buffer_select(struct io_kiocb *req, int gid,
 	return kbuf;
 }
 
+static void __user *io_rw_buffer_select(struct io_kiocb *req, size_t *len,
+					bool needs_lock)
+{
+	struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
+	int gid;
+
+	gid = (int) (unsigned long) req->rw.kiocb.private;
+	kbuf = io_buffer_select(req, gid, kbuf, needs_lock);
+	if (IS_ERR(kbuf))
+		return kbuf;
+	req->rw.addr = (u64) kbuf;
+	if (*len > kbuf->len)
+		*len = kbuf->len;
+	req->flags |= REQ_F_BUFFER_SELECTED;
+	return u64_to_user_ptr(kbuf->addr);
+}
+
+#ifdef CONFIG_COMPAT
+static ssize_t io_compat_import(struct io_kiocb *req, struct iovec *iov,
+				bool needs_lock)
+{
+	struct compat_iovec __user *uiov;
+	compat_ssize_t clen;
+	void __user *buf;
+	ssize_t len;
+
+	uiov = u64_to_user_ptr(req->rw.addr);
+	if (!access_ok(uiov, sizeof(*uiov)))
+		return -EFAULT;
+	if (__get_user(clen, &uiov->iov_len))
+		return -EFAULT;
+	if (clen < 0)
+		return -EINVAL;
+
+	len = clen;
+	buf = io_rw_buffer_select(req, &len, needs_lock);
+	if (IS_ERR(buf))
+		return PTR_ERR(buf);
+	iov[0].iov_base = buf;
+	iov[0].iov_len = (compat_size_t) len;
+	return 0;
+}
+#endif
+
+static ssize_t __io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov,
+				      bool needs_lock)
+{
+	struct iovec __user *uiov = u64_to_user_ptr(req->rw.addr);
+	void __user *buf;
+	ssize_t len;
+
+	if (copy_from_user(iov, uiov, sizeof(*uiov)))
+		return -EFAULT;
+
+	len = iov[0].iov_len;
+	if (len < 0)
+		return -EINVAL;
+	buf = io_rw_buffer_select(req, &len, needs_lock);
+	if (IS_ERR(buf))
+		return PTR_ERR(buf);
+	iov[0].iov_base = buf;
+	iov[0].iov_len = len;
+	return 0;
+}
+
+static ssize_t io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov,
+				    bool needs_lock)
+{
+	if (req->flags & REQ_F_BUFFER_SELECTED)
+		return 0;
+	if (!req->rw.len)
+		return 0;
+	else if (req->rw.len > 1)
+		return -EINVAL;
+
+#ifdef CONFIG_COMPAT
+	if (req->ctx->compat)
+		return io_compat_import(req, iov, needs_lock);
+#endif
+
+	return __io_iov_buffer_select(req, iov, needs_lock);
+}
+
 static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
 			       struct iovec **iovec, struct iov_iter *iter,
 			       bool needs_lock)
 {
 	void __user *buf = u64_to_user_ptr(req->rw.addr);
 	size_t sqe_len = req->rw.len;
+	ssize_t ret;
 	u8 opcode;
 
 	opcode = req->opcode;
@@ -2224,23 +2309,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
 		return -EINVAL;
 
 	if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
-		ssize_t ret;
-
 		if (req->flags & REQ_F_BUFFER_SELECT) {
-			struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
-			int gid;
-
-			gid = (int) (unsigned long) req->rw.kiocb.private;
-			kbuf = io_buffer_select(req, gid, kbuf, needs_lock);
-			if (IS_ERR(kbuf)) {
+			buf = io_rw_buffer_select(req, &sqe_len, needs_lock);
+			if (IS_ERR(buf)) {
 				*iovec = NULL;
-				return PTR_ERR(kbuf);
+				return PTR_ERR(buf);
 			}
-			req->rw.addr = (u64) kbuf;
-			if (sqe_len > kbuf->len)
-				sqe_len = kbuf->len;
-			req->flags |= REQ_F_BUFFER_SELECTED;
-			buf = u64_to_user_ptr(kbuf->addr);
 		}
 
 		ret = import_single_range(rw, buf, sqe_len, *iovec, iter);
@@ -2258,6 +2332,14 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
 		return iorw->size;
 	}
 
+	if (req->flags & REQ_F_BUFFER_SELECT) {
+		ret = io_iov_buffer_select(req, *iovec, needs_lock);
+		if (!ret)
+			iov_iter_init(iter, rw, *iovec, 1, (*iovec)->iov_len);
+		*iovec = NULL;
+		return ret;
+	}
+
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
 		return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] net: abstract out normal and compat msghdr import
  2020-02-28 20:30 [PATCHSET v3] io_uring support for automatic buffers Jens Axboe
                   ` (3 preceding siblings ...)
  2020-02-28 20:30 ` [PATCH 4/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV Jens Axboe
@ 2020-02-28 20:30 ` Jens Axboe
  2020-02-28 20:30 ` [PATCH 6/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG Jens Axboe
  5 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-28 20:30 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

This splits it into two parts, one that imports the message, and one
that imports the iovec. This allows a caller to only do the first part,
and import the iovec manually afterwards.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/socket.h |  4 ++++
 include/net/compat.h   |  3 +++
 net/compat.c           | 30 +++++++++++++++++++++++-------
 net/socket.c           | 25 +++++++++++++++++++++----
 4 files changed, 51 insertions(+), 11 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 2d2313403101..fc59ac825561 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -391,6 +391,10 @@ extern int recvmsg_copy_msghdr(struct msghdr *msg,
 			       struct user_msghdr __user *umsg, unsigned flags,
 			       struct sockaddr __user **uaddr,
 			       struct iovec **iov);
+extern int __copy_msghdr_from_user(struct msghdr *kmsg,
+				   struct user_msghdr __user *umsg,
+				   struct sockaddr __user **save_addr,
+				   struct iovec __user **uiov, size_t *nsegs);
 
 /* helpers which do the actual work for syscalls */
 extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size,
diff --git a/include/net/compat.h b/include/net/compat.h
index f277653c7e17..e341260642fe 100644
--- a/include/net/compat.h
+++ b/include/net/compat.h
@@ -38,6 +38,9 @@ struct compat_cmsghdr {
 #define compat_mmsghdr	mmsghdr
 #endif /* defined(CONFIG_COMPAT) */
 
+int __get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr __user *umsg,
+			struct sockaddr __user **save_addr, compat_uptr_t *ptr,
+			compat_size_t *len);
 int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *,
 		      struct sockaddr __user **, struct iovec **);
 struct sock_fprog __user *get_compat_bpf_fprog(char __user *optval);
diff --git a/net/compat.c b/net/compat.c
index 47d99c784947..4bed96e84d9a 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -33,10 +33,10 @@
 #include <linux/uaccess.h>
 #include <net/compat.h>
 
-int get_compat_msghdr(struct msghdr *kmsg,
-		      struct compat_msghdr __user *umsg,
-		      struct sockaddr __user **save_addr,
-		      struct iovec **iov)
+int __get_compat_msghdr(struct msghdr *kmsg,
+			struct compat_msghdr __user *umsg,
+			struct sockaddr __user **save_addr,
+			compat_uptr_t *ptr, compat_size_t *len)
 {
 	struct compat_msghdr msg;
 	ssize_t err;
@@ -79,10 +79,26 @@ int get_compat_msghdr(struct msghdr *kmsg,
 		return -EMSGSIZE;
 
 	kmsg->msg_iocb = NULL;
+	*ptr = msg.msg_iov;
+	*len = msg.msg_iovlen;
+	return 0;
+}
+
+int get_compat_msghdr(struct msghdr *kmsg,
+		      struct compat_msghdr __user *umsg,
+		      struct sockaddr __user **save_addr,
+		      struct iovec **iov)
+{
+	compat_uptr_t ptr;
+	compat_size_t len;
+	ssize_t err;
+
+	err = __get_compat_msghdr(kmsg, umsg, save_addr, &ptr, &len);
+	if (err)
+		return err;
 
-	err = compat_import_iovec(save_addr ? READ : WRITE,
-				   compat_ptr(msg.msg_iov), msg.msg_iovlen,
-				   UIO_FASTIOV, iov, &kmsg->msg_iter);
+	err = compat_import_iovec(save_addr ? READ : WRITE, compat_ptr(ptr),
+				   len, UIO_FASTIOV, iov, &kmsg->msg_iter);
 	return err < 0 ? err : 0;
 }
 
diff --git a/net/socket.c b/net/socket.c
index b79a05de7c6e..70ede74ab24b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2226,10 +2226,10 @@ struct used_address {
 	unsigned int name_len;
 };
 
-static int copy_msghdr_from_user(struct msghdr *kmsg,
-				 struct user_msghdr __user *umsg,
-				 struct sockaddr __user **save_addr,
-				 struct iovec **iov)
+int __copy_msghdr_from_user(struct msghdr *kmsg,
+			    struct user_msghdr __user *umsg,
+			    struct sockaddr __user **save_addr,
+			    struct iovec __user **uiov, size_t *nsegs)
 {
 	struct user_msghdr msg;
 	ssize_t err;
@@ -2271,6 +2271,23 @@ static int copy_msghdr_from_user(struct msghdr *kmsg,
 		return -EMSGSIZE;
 
 	kmsg->msg_iocb = NULL;
+	*uiov = msg.msg_iov;
+	*nsegs = msg.msg_iovlen;
+	return 0;
+}
+
+static int copy_msghdr_from_user(struct msghdr *kmsg,
+				 struct user_msghdr __user *umsg,
+				 struct sockaddr __user **save_addr,
+				 struct iovec **iov)
+{
+	struct user_msghdr msg;
+	ssize_t err;
+
+	err = __copy_msghdr_from_user(kmsg, umsg, save_addr, &msg.msg_iov,
+					&msg.msg_iovlen);
+	if (err)
+		return err;
 
 	err = import_iovec(save_addr ? READ : WRITE,
 			    msg.msg_iov, msg.msg_iovlen,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
  2020-02-28 20:30 [PATCHSET v3] io_uring support for automatic buffers Jens Axboe
                   ` (4 preceding siblings ...)
  2020-02-28 20:30 ` [PATCH 5/6] net: abstract out normal and compat msghdr import Jens Axboe
@ 2020-02-28 20:30 ` Jens Axboe
  5 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-28 20:30 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

Like IORING_OP_READV, this is limited to supporting just a single
segment in the iovec passed in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 115 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 103 insertions(+), 12 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 561120460422..fadffe21d4da 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -44,6 +44,7 @@
 #include <linux/errno.h>
 #include <linux/syscalls.h>
 #include <linux/compat.h>
+#include <net/compat.h>
 #include <linux/refcount.h>
 #include <linux/uio.h>
 #include <linux/bits.h>
@@ -729,6 +730,7 @@ static const struct io_op_def io_op_defs[] = {
 		.unbound_nonreg_file	= 1,
 		.needs_fs		= 1,
 		.pollin			= 1,
+		.buffer_select		= 1,
 	},
 	[IORING_OP_TIMEOUT] = {
 		.async_ctx		= 1,
@@ -3582,6 +3584,90 @@ static int io_send(struct io_kiocb *req, struct io_kiocb **nxt,
 #endif
 }
 
+static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io)
+{
+	struct io_sr_msg *sr = &req->sr_msg;
+	struct iovec __user *uiov;
+	size_t iov_len;
+	int ret;
+
+	ret = __copy_msghdr_from_user(&io->msg.msg, sr->msg, &io->msg.uaddr,
+					&uiov, &iov_len);
+	if (ret)
+		return ret;
+
+	if (req->flags & REQ_F_BUFFER_SELECT) {
+		if (iov_len > 1)
+			return -EINVAL;
+		if (copy_from_user(io->msg.iov, uiov, sizeof(*uiov)))
+			return -EFAULT;
+		sr->len = io->msg.iov[0].iov_len;
+		iov_iter_init(&io->msg.msg.msg_iter, READ, io->msg.iov, 1,
+				sr->len);
+		io->msg.iov = NULL;
+	} else {
+		ret = import_iovec(READ, uiov, iov_len, UIO_FASTIOV,
+					&io->msg.iov, &io->msg.msg.msg_iter);
+		if (ret > 0)
+			ret = 0;
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req,
+					struct io_async_ctx *io)
+{
+	struct compat_msghdr __user *msg_compat;
+	struct io_sr_msg *sr = &req->sr_msg;
+	struct compat_iovec __user *uiov;
+	compat_uptr_t ptr;
+	compat_size_t len;
+	int ret;
+
+	msg_compat = (struct compat_msghdr __user *) sr->msg;
+	ret = __get_compat_msghdr(&io->msg.msg, msg_compat, &io->msg.uaddr,
+					&ptr, &len);
+
+	uiov = compat_ptr(ptr);
+	if (req->flags & REQ_F_BUFFER_SELECT) {
+		compat_ssize_t clen;
+
+		if (len > 1)
+			return -EINVAL;
+		if (!access_ok(uiov, sizeof(*uiov)))
+			return -EFAULT;
+		if (__get_user(clen, &uiov->iov_len))
+			return -EFAULT;
+		if (clen < 0)
+			return -EINVAL;
+		sr->len = io->msg.iov[0].iov_len;
+		io->msg.iov = NULL;
+	} else {
+		ret = compat_import_iovec(READ, uiov, len, UIO_FASTIOV,
+						&io->msg.iov,
+						&io->msg.msg.msg_iter);
+		if (ret > 0)
+			ret = 0;
+	}
+
+	return 0;
+}
+#endif
+
+static int io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io)
+{
+	io->msg.iov = io->msg.fast_iov;
+
+#ifdef CONFIG_COMPAT
+	if (req->ctx->compat)
+		return __io_compat_recvmsg_copy_hdr(req, io);
+#endif
+
+	return __io_recvmsg_copy_hdr(req, io);
+}
+
 static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req,
 					       int *cflags, bool needs_lock)
 {
@@ -3629,9 +3715,7 @@ static int io_recvmsg_prep(struct io_kiocb *req,
 	if (req->flags & REQ_F_NEED_CLEANUP)
 		return 0;
 
-	io->msg.iov = io->msg.fast_iov;
-	ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags,
-					&io->msg.uaddr, &io->msg.iov);
+	ret = io_recvmsg_copy_hdr(req, io);
 	if (!ret)
 		req->flags |= REQ_F_NEED_CLEANUP;
 	return ret;
@@ -3646,13 +3730,14 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
 #if defined(CONFIG_NET)
 	struct io_async_msghdr *kmsg = NULL;
 	struct socket *sock;
-	int ret;
+	int ret, cflags = 0;
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
 
 	sock = sock_from_file(req->file, &ret);
 	if (sock) {
+		struct io_buffer *kbuf;
 		struct io_async_ctx io;
 		unsigned flags;
 
@@ -3664,19 +3749,23 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
 				kmsg->iov = kmsg->fast_iov;
 			kmsg->msg.msg_iter.iov = kmsg->iov;
 		} else {
-			struct io_sr_msg *sr = &req->sr_msg;
-
 			kmsg = &io.msg;
 			kmsg->msg.msg_name = &io.msg.addr;
 
-			io.msg.iov = io.msg.fast_iov;
-			ret = recvmsg_copy_msghdr(&io.msg.msg, sr->msg,
-					sr->msg_flags, &io.msg.uaddr,
-					&io.msg.iov);
+			ret = io_recvmsg_copy_hdr(req, &io);
 			if (ret)
 				return ret;
 		}
 
+		kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock);
+		if (IS_ERR(kbuf)) {
+			return PTR_ERR(kbuf);
+		} else if (kbuf) {
+			kmsg->fast_iov[0].iov_base = u64_to_user_ptr(kbuf->addr);
+			iov_iter_init(&kmsg->msg.msg_iter, READ, kmsg->iov,
+					1, req->sr_msg.len);
+		}
+
 		flags = req->sr_msg.msg_flags;
 		if (flags & MSG_DONTWAIT)
 			req->flags |= REQ_F_NOWAIT;
@@ -3694,7 +3783,7 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
 	if (kmsg && kmsg->iov != kmsg->fast_iov)
 		kfree(kmsg->iov);
 	req->flags &= ~REQ_F_NEED_CLEANUP;
-	io_cqring_add_event(req, ret);
+	__io_cqring_add_event(req, ret, cflags);
 	if (ret < 0)
 		req_set_fail_links(req);
 	io_put_req_find_next(req, nxt);
@@ -4806,8 +4895,10 @@ static void io_cleanup_req(struct io_kiocb *req)
 		if (io->rw.iov != io->rw.fast_iov)
 			kfree(io->rw.iov);
 		break;
-	case IORING_OP_SENDMSG:
 	case IORING_OP_RECVMSG:
+		if (req->flags & REQ_F_BUFFER_SELECTED)
+			kfree(req->sr_msg.kbuf);
+	case IORING_OP_SENDMSG:
 		if (io->msg.iov != io->msg.fast_iov)
 			kfree(io->msg.iov);
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-28 20:30 ` [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS Jens Axboe
@ 2020-02-29  0:43   ` Pavel Begunkov
  2020-02-29  4:50     ` Jens Axboe
  2020-02-29 12:08   ` Pavel Begunkov
  2020-03-09 17:03   ` Andres Freund
  2 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2020-02-29  0:43 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: andres


[-- Attachment #1.1: Type: text/plain, Size: 3399 bytes --]



On 28/02/2020 23:30, Jens Axboe wrote:
> IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
> support passing in an addr/len that is associated with a buffer ID and
> buffer group ID. The group ID is used to index and lookup the buffers,
> while the buffer ID can be used to notify the application which buffer
> in the group was used. The addr passed in is the starting buffer address,
> and length is each buffer length. A number of buffers to add with can be
> specified, in which case addr is incremented by length for each addition,
> and each buffer increments the buffer ID specified.
> 
> No validation is done of the buffer ID. If the application provides
> buffers within the same group with identical buffer IDs, then it'll have
> a hard time telling which buffer ID was used. The only restriction is
> that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
> maximum ID that can be used.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>


> +
> +static int io_add_buffers(struct io_provide_buf *pbuf, struct list_head *list)
> +{
> +	struct io_buffer *buf;
> +	u64 addr = pbuf->addr;
> +	int i, bid = pbuf->bid;
> +
> +	for (i = 0; i < pbuf->nbufs; i++) {
> +		buf = kmalloc(sizeof(*buf), GFP_KERNEL);
> +		if (!buf)
> +			break;
> +
> +		buf->addr = addr;
> +		buf->len = pbuf->len;
> +		buf->bid = bid;
> +		list_add(&buf->list, list);
> +		addr += pbuf->len;

So, it chops a linear buffer into pbuf->nbufs chunks of size pbuf->len.
Did you consider iovec? I'll try to think about it after getting some sleep

> +		bid++;
> +	}
> +
> +	return i;
> +}
> +
> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
> +			      bool force_nonblock)
> +{
> +	struct io_provide_buf *p = &req->pbuf;
> +	struct io_ring_ctx *ctx = req->ctx;
> +	struct list_head *list;
> +	int ret = 0;
> +
> +	/*
> +	 * "Normal" inline submissions always hold the uring_lock, since we
> +	 * grab it from the system call. Same is true for the SQPOLL offload.
> +	 * The only exception is when we've detached the request and issue it
> +	 * from an async worker thread, grab the lock for that case.
> +	 */
> +	if (!force_nonblock)
> +		mutex_lock(&ctx->uring_lock);

io_poll_task_handler() calls it with force_nonblock==true, but it doesn't hold
the mutex AFAIK.

> +
> +	lockdep_assert_held(&ctx->uring_lock);
> +
> +	list = idr_find(&ctx->io_buffer_idr, p->gid);
> +	if (!list) {
> +		list = kmalloc(sizeof(*list), GFP_KERNEL);
> +		if (!list) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		INIT_LIST_HEAD(list);
> +		ret = idr_alloc(&ctx->io_buffer_idr, list, p->gid, p->gid + 1,
> +					GFP_KERNEL);
> +		if (ret < 0) {
> +			kfree(list);
> +			goto out;
> +		}
> +	}
> +
> +	ret = io_add_buffers(p, list);

Isn't it better to not do partial registration?
i.e. it may return ret < pbuf->nbufs

> +	if (!ret) {
> +		/* no buffers added and list empty, remove entry */
> +		if (list_empty(list)) {
> +			idr_remove(&ctx->io_buffer_idr, p->gid);
> +			kfree(list);
> +		}
> +		ret = -ENOMEM;
> +	}
> +out:
> +	if (!force_nonblock)
> +		mutex_unlock(&ctx->uring_lock);
> +	if (ret < 0)
> +		req_set_fail_links(req);
> +	io_cqring_add_event(req, ret);
> +	io_put_req_find_next(req, nxt);
> +	return 0;
> +}
> +
-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-29  0:43   ` Pavel Begunkov
@ 2020-02-29  4:50     ` Jens Axboe
  2020-02-29 11:36       ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2020-02-29  4:50 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: andres

On 2/28/20 5:43 PM, Pavel Begunkov wrote:
> 
> 
> On 28/02/2020 23:30, Jens Axboe wrote:
>> IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
>> support passing in an addr/len that is associated with a buffer ID and
>> buffer group ID. The group ID is used to index and lookup the buffers,
>> while the buffer ID can be used to notify the application which buffer
>> in the group was used. The addr passed in is the starting buffer address,
>> and length is each buffer length. A number of buffers to add with can be
>> specified, in which case addr is incremented by length for each addition,
>> and each buffer increments the buffer ID specified.
>>
>> No validation is done of the buffer ID. If the application provides
>> buffers within the same group with identical buffer IDs, then it'll have
>> a hard time telling which buffer ID was used. The only restriction is
>> that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
>> maximum ID that can be used.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> 
>> +
>> +static int io_add_buffers(struct io_provide_buf *pbuf, struct list_head *list)
>> +{
>> +	struct io_buffer *buf;
>> +	u64 addr = pbuf->addr;
>> +	int i, bid = pbuf->bid;
>> +
>> +	for (i = 0; i < pbuf->nbufs; i++) {
>> +		buf = kmalloc(sizeof(*buf), GFP_KERNEL);
>> +		if (!buf)
>> +			break;
>> +
>> +		buf->addr = addr;
>> +		buf->len = pbuf->len;
>> +		buf->bid = bid;
>> +		list_add(&buf->list, list);
>> +		addr += pbuf->len;
> 
> So, it chops a linear buffer into pbuf->nbufs chunks of size pbuf->len.

Right

> Did you consider iovec? I'll try to think about it after getting some sleep

I did, my issue there is that if you do vecs, then you want non-contig
buffer IDs. And then you'd need something else than an iovec, and it
gets messy pretty quick.

But in general, I'd really like to reuse buffer IDs, because I think
that's what applications would generally want. Maybe that means that
single buffers is just easier to deal with, but at least you can do that
with this approach without needing any sort of "packaging" for the data
passed in - it's just a pointer.

I'm open to suggestions, though!

>> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
>> +			      bool force_nonblock)
>> +{
>> +	struct io_provide_buf *p = &req->pbuf;
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +	struct list_head *list;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * "Normal" inline submissions always hold the uring_lock, since we
>> +	 * grab it from the system call. Same is true for the SQPOLL offload.
>> +	 * The only exception is when we've detached the request and issue it
>> +	 * from an async worker thread, grab the lock for that case.
>> +	 */
>> +	if (!force_nonblock)
>> +		mutex_lock(&ctx->uring_lock);
> 
> io_poll_task_handler() calls it with force_nonblock==true, but it
> doesn't hold the mutex AFAIK.

True, that's the only exception. And that command doesn't transfer data
so would never need a buffer, but I agree that's perhaps not fully
clear. The async task handler grabs the mutex.

>> +	lockdep_assert_held(&ctx->uring_lock);
>> +
>> +	list = idr_find(&ctx->io_buffer_idr, p->gid);
>> +	if (!list) {
>> +		list = kmalloc(sizeof(*list), GFP_KERNEL);
>> +		if (!list) {
>> +			ret = -ENOMEM;
>> +			goto out;
>> +		}
>> +		INIT_LIST_HEAD(list);
>> +		ret = idr_alloc(&ctx->io_buffer_idr, list, p->gid, p->gid + 1,
>> +					GFP_KERNEL);
>> +		if (ret < 0) {
>> +			kfree(list);
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	ret = io_add_buffers(p, list);
> 
> Isn't it better to not do partial registration?
> i.e. it may return ret < pbuf->nbufs

Most things work like that, though. If you ask for an 8k read, you can't
unwind if you just get 4k. We return 4k for that. I think in general, if
it fails, you're probably somewhat screwed in either case. At least with
the partial return, you know which buffers got registered and how many
you can use. If you return 0 and undo it all, then the application
really has no way to continue except abort.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-29  4:50     ` Jens Axboe
@ 2020-02-29 11:36       ` Pavel Begunkov
  2020-02-29 17:32         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2020-02-29 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: andres

On 2/29/2020 7:50 AM, Jens Axboe wrote:
> On 2/28/20 5:43 PM, Pavel Begunkov wrote:
>>> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
>>> +			      bool force_nonblock)
>>> +{
>>> +	struct io_provide_buf *p = &req->pbuf;
>>> +	struct io_ring_ctx *ctx = req->ctx;
>>> +	struct list_head *list;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * "Normal" inline submissions always hold the uring_lock, since we
>>> +	 * grab it from the system call. Same is true for the SQPOLL offload.
>>> +	 * The only exception is when we've detached the request and issue it
>>> +	 * from an async worker thread, grab the lock for that case.
>>> +	 */
>>> +	if (!force_nonblock)
>>> +		mutex_lock(&ctx->uring_lock);
>>
>> io_poll_task_handler() calls it with force_nonblock==true, but it
>> doesn't hold the mutex AFAIK.
> 
> True, that's the only exception. And that command doesn't transfer data
> so would never need a buffer, but I agree that's perhaps not fully
> clear. The async task handler grabs the mutex.

Hmm, I meant io_poll_task_func(), which do __io_queue_sqe() for @nxt
request, which in its turn calls io_issue_sqe(force_nonblock=true).

Does io_poll_task_func() hold @uring_mutex? Otherwise, if @nxt happened
to be io_provide_buffers(), we get there without holding the mutex and
with force_nonblock=true.


>>> +	lockdep_assert_held(&ctx->uring_lock);
>>> +
>>> +	list = idr_find(&ctx->io_buffer_idr, p->gid);
>>> +	if (!list) {
>>> +		list = kmalloc(sizeof(*list), GFP_KERNEL);
>>> +		if (!list) {
>>> +			ret = -ENOMEM;
>>> +			goto out;
>>> +		}
>>> +		INIT_LIST_HEAD(list);
>>> +		ret = idr_alloc(&ctx->io_buffer_idr, list, p->gid, p->gid + 1,
>>> +					GFP_KERNEL);
>>> +		if (ret < 0) {
>>> +			kfree(list);
>>> +			goto out;
>>> +		}
>>> +	}
>>> +
>>> +	ret = io_add_buffers(p, list);
>>
>> Isn't it better to not do partial registration?
>> i.e. it may return ret < pbuf->nbufs
> 
> Most things work like that, though. If you ask for an 8k read, you can't
> unwind if you just get 4k. We return 4k for that. I think in general, if
> it fails, you're probably somewhat screwed in either case. At least with
> the partial return, you know which buffers got registered and how many
> you can use. If you return 0 and undo it all, then the application
> really has no way to continue except abort.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-28 20:30 ` [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS Jens Axboe
  2020-02-29  0:43   ` Pavel Begunkov
@ 2020-02-29 12:08   ` Pavel Begunkov
  2020-02-29 17:34     ` Jens Axboe
  2020-03-09 17:03   ` Andres Freund
  2 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2020-02-29 12:08 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: andres

On 2/28/2020 11:30 PM, Jens Axboe wrote:
> IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
> support passing in an addr/len that is associated with a buffer ID and
> buffer group ID. The group ID is used to index and lookup the buffers,
> while the buffer ID can be used to notify the application which buffer
> in the group was used. The addr passed in is the starting buffer address,
> and length is each buffer length. A number of buffers to add with can be
> specified, in which case addr is incremented by length for each addition,
> and each buffer increments the buffer ID specified.
> 
> No validation is done of the buffer ID. If the application provides
> buffers within the same group with identical buffer IDs, then it'll have
> a hard time telling which buffer ID was used. The only restriction is
> that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
> maximum ID that can be used.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---

>  
> +static int io_provide_buffers_prep(struct io_kiocb *req,
> +				   const struct io_uring_sqe *sqe)
> +{

*provide* may be confusing, at least it's for me. It's not immediately
obvious, who gives and who consumes. Not sure what name would be better yet.

> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
> +			      bool force_nonblock)
> +{
> +	struct io_provide_buf *p = &req->pbuf;
> +	struct io_ring_ctx *ctx = req->ctx;
> +	struct list_head *list;
> +	int ret = 0;
> +
> +	/*
> +	 * "Normal" inline submissions always hold the uring_lock, since we
> +	 * grab it from the system call. Same is true for the SQPOLL offload.
> +	 * The only exception is when we've detached the request and issue it
> +	 * from an async worker thread, grab the lock for that case.
> +	 */
> +	if (!force_nonblock)
> +		mutex_lock(&ctx->uring_lock);
> +
> +	lockdep_assert_held(&ctx->uring_lock);
> +
> +	list = idr_find(&ctx->io_buffer_idr, p->gid);
> +	if (!list) {
> +		list = kmalloc(sizeof(*list), GFP_KERNEL);

Could be easier to hook struct io_buffer into idr directly, i.e. without
a separate allocated list-head entry.

> +		if (!list) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		INIT_LIST_HEAD(list);
> +		ret = idr_alloc(&ctx->io_buffer_idr, list, p->gid, p->gid + 1,
> +					GFP_KERNEL);
> +		if (ret < 0) {
> +			kfree(list);
> +			goto out;
> +		}
> +	}
> +
> +	ret = io_add_buffers(p, list);
> +	if (!ret) {
> +		/* no buffers added and list empty, remove entry */
> +		if (list_empty(list)) {
> +			idr_remove(&ctx->io_buffer_idr, p->gid);
> +			kfree(list);
> +		}
> +		ret = -ENOMEM;
> +	}
> +out:
> +	if (!force_nonblock)
> +		mutex_unlock(&ctx->uring_lock);
> +	if (ret < 0)
> +		req_set_fail_links(req);
> +	io_cqring_add_event(req, ret);
> +	io_put_req_find_next(req, nxt);
> +	return 0;
> +}

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] io_uring: support buffer selection
  2020-02-28 20:30 ` [PATCH 3/6] io_uring: support buffer selection Jens Axboe
@ 2020-02-29 12:21   ` Pavel Begunkov
  2020-02-29 17:35     ` Jens Axboe
  2020-03-09 17:21   ` Andres Freund
  1 sibling, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2020-02-29 12:21 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: andres

On 2/28/2020 11:30 PM, Jens Axboe wrote:

> +static int io_rw_common_cflags(struct io_kiocb *req)

Sounds more like sort of const/idempotent function, but not one changing
internal state (i.e. deallocation kbuf). Could it be named closer to
deallocate, remove, disarm, etc?

> +{
> +	struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
> +	int cflags;
> +
> +	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
> +	cflags |= IORING_CQE_F_BUFFER;
> +	req->rw.addr = 0;
> +	kfree(kbuf);
> +	return cflags;
> +}



> +static struct io_buffer *io_buffer_select(struct io_kiocb *req, int gid,
> +					  struct io_buffer *kbuf,
> +					  bool needs_lock)
> +{
> +	struct list_head *list;
> +
> +	if (req->flags & REQ_F_BUFFER_SELECTED)
> +		return kbuf;
> +
> +	/*
> +	 * "Normal" inline submissions always hold the uring_lock, since we
> +	 * grab it from the system call. Same is true for the SQPOLL offload.
> +	 * The only exception is when we've detached the request and issue it
> +	 * from an async worker thread, grab the lock for that case.
> +	 */
> +	if (needs_lock)
> +		mutex_lock(&req->ctx->uring_lock);

The same concern as for the [2/6]

> +
> +	lockdep_assert_held(&req->ctx->uring_lock);
> +
> +	list = idr_find(&req->ctx->io_buffer_idr, gid);
> +	if (list && !list_empty(list)) {
> +		kbuf = list_first_entry(list, struct io_buffer, list);
> +		list_del(&kbuf->list);

free(list), if it became empty? As mentioned, may go away naturally if
idr would store io_buffer directly.

> +	} else {
> +		kbuf = ERR_PTR(-ENOBUFS);
> +	}
> +
> +	if (needs_lock)
> +		mutex_unlock(&req->ctx->uring_lock);
> +
> +	return kbuf;
> +}
> +

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-29 11:36       ` Pavel Begunkov
@ 2020-02-29 17:32         ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-29 17:32 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: andres

On 2/29/20 4:36 AM, Pavel Begunkov wrote:
> On 2/29/2020 7:50 AM, Jens Axboe wrote:
>> On 2/28/20 5:43 PM, Pavel Begunkov wrote:
>>>> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
>>>> +			      bool force_nonblock)
>>>> +{
>>>> +	struct io_provide_buf *p = &req->pbuf;
>>>> +	struct io_ring_ctx *ctx = req->ctx;
>>>> +	struct list_head *list;
>>>> +	int ret = 0;
>>>> +
>>>> +	/*
>>>> +	 * "Normal" inline submissions always hold the uring_lock, since we
>>>> +	 * grab it from the system call. Same is true for the SQPOLL offload.
>>>> +	 * The only exception is when we've detached the request and issue it
>>>> +	 * from an async worker thread, grab the lock for that case.
>>>> +	 */
>>>> +	if (!force_nonblock)
>>>> +		mutex_lock(&ctx->uring_lock);
>>>
>>> io_poll_task_handler() calls it with force_nonblock==true, but it
>>> doesn't hold the mutex AFAIK.
>>
>> True, that's the only exception. And that command doesn't transfer data
>> so would never need a buffer, but I agree that's perhaps not fully
>> clear. The async task handler grabs the mutex.
> 
> Hmm, I meant io_poll_task_func(), which do __io_queue_sqe() for @nxt
> request, which in its turn calls io_issue_sqe(force_nonblock=true).
> 
> Does io_poll_task_func() hold @uring_mutex? Otherwise, if @nxt happened
> to be io_provide_buffers(), we get there without holding the mutex and
> with force_nonblock=true.

Ah yes, good point! Not sure why I missed that... I've added the locking
around __io_queue_sqe() if 'nxt' is set in the handler.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-29 12:08   ` Pavel Begunkov
@ 2020-02-29 17:34     ` Jens Axboe
  2020-02-29 18:11       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2020-02-29 17:34 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: andres

On 2/29/20 5:08 AM, Pavel Begunkov wrote:
> On 2/28/2020 11:30 PM, Jens Axboe wrote:
>> IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
>> support passing in an addr/len that is associated with a buffer ID and
>> buffer group ID. The group ID is used to index and lookup the buffers,
>> while the buffer ID can be used to notify the application which buffer
>> in the group was used. The addr passed in is the starting buffer address,
>> and length is each buffer length. A number of buffers to add with can be
>> specified, in which case addr is incremented by length for each addition,
>> and each buffer increments the buffer ID specified.
>>
>> No validation is done of the buffer ID. If the application provides
>> buffers within the same group with identical buffer IDs, then it'll have
>> a hard time telling which buffer ID was used. The only restriction is
>> that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
>> maximum ID that can be used.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
> 
>>  
>> +static int io_provide_buffers_prep(struct io_kiocb *req,
>> +				   const struct io_uring_sqe *sqe)
>> +{
> 
> *provide* may be confusing, at least it's for me. It's not immediately
> obvious, who gives and who consumes. Not sure what name would be better yet.

It's the application providing the buffers upfront, at least that's how
the naming makes sense to me.

>> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
>> +			      bool force_nonblock)
>> +{
>> +	struct io_provide_buf *p = &req->pbuf;
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +	struct list_head *list;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * "Normal" inline submissions always hold the uring_lock, since we
>> +	 * grab it from the system call. Same is true for the SQPOLL offload.
>> +	 * The only exception is when we've detached the request and issue it
>> +	 * from an async worker thread, grab the lock for that case.
>> +	 */
>> +	if (!force_nonblock)
>> +		mutex_lock(&ctx->uring_lock);
>> +
>> +	lockdep_assert_held(&ctx->uring_lock);
>> +
>> +	list = idr_find(&ctx->io_buffer_idr, p->gid);
>> +	if (!list) {
>> +		list = kmalloc(sizeof(*list), GFP_KERNEL);
> 
> Could be easier to hook struct io_buffer into idr directly, i.e. without
> a separate allocated list-head entry.

Good point, we can just make the first kbuf the list, point to the next
one (or NULL) when a kbuf is removed. I'll make that change, gets rid of
the list alloc.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] io_uring: support buffer selection
  2020-02-29 12:21   ` Pavel Begunkov
@ 2020-02-29 17:35     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-29 17:35 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: andres

On 2/29/20 5:21 AM, Pavel Begunkov wrote:
> On 2/28/2020 11:30 PM, Jens Axboe wrote:
> 
>> +static int io_rw_common_cflags(struct io_kiocb *req)
> 
> Sounds more like sort of const/idempotent function, but not one changing
> internal state (i.e. deallocation kbuf). Could it be named closer to
> deallocate, remove, disarm, etc?

io_put_kbuf() or something like that is probably better.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-29 17:34     ` Jens Axboe
@ 2020-02-29 18:11       ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-02-29 18:11 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: andres

On 2/29/20 10:34 AM, Jens Axboe wrote:
>>> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
>>> +			      bool force_nonblock)
>>> +{
>>> +	struct io_provide_buf *p = &req->pbuf;
>>> +	struct io_ring_ctx *ctx = req->ctx;
>>> +	struct list_head *list;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * "Normal" inline submissions always hold the uring_lock, since we
>>> +	 * grab it from the system call. Same is true for the SQPOLL offload.
>>> +	 * The only exception is when we've detached the request and issue it
>>> +	 * from an async worker thread, grab the lock for that case.
>>> +	 */
>>> +	if (!force_nonblock)
>>> +		mutex_lock(&ctx->uring_lock);
>>> +
>>> +	lockdep_assert_held(&ctx->uring_lock);
>>> +
>>> +	list = idr_find(&ctx->io_buffer_idr, p->gid);
>>> +	if (!list) {
>>> +		list = kmalloc(sizeof(*list), GFP_KERNEL);
>>
>> Could be easier to hook struct io_buffer into idr directly, i.e. without
>> a separate allocated list-head entry.
> 
> Good point, we can just make the first kbuf the list, point to the next
> one (or NULL) when a kbuf is removed. I'll make that change, gets rid of
> the list alloc.

I took a look at this, and it does come with tradeoffs. The nice thing
about the list is that it provides a constant lookup pointer, whereas if
we just use the kbuf as the idr index and hang other buffers off that,
then we at least need an idr_replace() or similar when the list becomes
empty. And we need to grab buffers from the tail to retain the head kbuf
(which is idr indexed) for as long as possible. The latter isn't really
a tradeoff, it's just list management.

I do still think it'll end up nicer though, I'll go ahead and give it a
whirl.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-02-28 20:30 ` [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS Jens Axboe
  2020-02-29  0:43   ` Pavel Begunkov
  2020-02-29 12:08   ` Pavel Begunkov
@ 2020-03-09 17:03   ` Andres Freund
  2020-03-09 17:17     ` Jens Axboe
  2 siblings, 1 reply; 22+ messages in thread
From: Andres Freund @ 2020-03-09 17:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi,

I like this feature quite a bit...

Sorry for the late response.


On 2020-02-28 13:30:49 -0700, Jens Axboe wrote:
> +static int io_provide_buffers_prep(struct io_kiocb *req,
> +				   const struct io_uring_sqe *sqe)
> +{
> +	struct io_provide_buf *p = &req->pbuf;
> +	u64 tmp;
> +
> +	if (sqe->ioprio || sqe->rw_flags)
> +		return -EINVAL;
> +
> +	tmp = READ_ONCE(sqe->fd);
> +	if (!tmp || tmp > USHRT_MAX)
> +		return -EINVAL;

Hm, seems like it'd be less confusing if this didn't use
io_uring_sqe->fd, but a separate union member?


> +	p->nbufs = tmp;
> +	p->addr = READ_ONCE(sqe->addr);
> +	p->len = READ_ONCE(sqe->len);
> +
> +	if (!access_ok(u64_to_user_ptr(p->addr), p->len))
> +		return -EFAULT;
> +
> +	p->gid = READ_ONCE(sqe->buf_group);
> +	tmp = READ_ONCE(sqe->off);
> +	if (tmp > USHRT_MAX)
> +		return -EINVAL;

Would it make sense to return a distinct error for the >= USHRT_MAX
cases? E2BIG or something roughly in that direction? Seems good to be
able to recognizably refuse "large" buffer group ids.


> +static int io_add_buffers(struct io_provide_buf *pbuf, struct list_head *list)
> +{
> +	struct io_buffer *buf;
> +	u64 addr = pbuf->addr;
> +	int i, bid = pbuf->bid;
> +
> +	for (i = 0; i < pbuf->nbufs; i++) {
> +		buf = kmalloc(sizeof(*buf), GFP_KERNEL);
> +		if (!buf)
> +			break;
> +
> +		buf->addr = addr;
> +		buf->len = pbuf->len;
> +		buf->bid = bid;
> +		list_add(&buf->list, list);
> +		addr += pbuf->len;
> +		bid++;
> +	}
> +
> +	return i;
> +}

Hm, aren't you loosing an error here if you kmalloc fails for i > 0?
Afaict io_provide_buffes() only checks for ret != 0. I think userland
should know that a PROVIDE_BUFFERS failed, even if just partially (I'd
just make it fail wholesale).


> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
> +			      bool force_nonblock)
> +{
> +	struct io_provide_buf *p = &req->pbuf;
> +	struct io_ring_ctx *ctx = req->ctx;
> +	struct list_head *list;
> +	int ret = 0;

> +	list = idr_find(&ctx->io_buffer_idr, p->gid);

I'd rename gid to bgid or such to distinguish from other types of group
ids, but whatever.


Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-03-09 17:03   ` Andres Freund
@ 2020-03-09 17:17     ` Jens Axboe
  2020-03-09 17:28       ` Andres Freund
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2020-03-09 17:17 UTC (permalink / raw)
  To: Andres Freund; +Cc: io-uring

On 3/9/20 11:03 AM, Andres Freund wrote:
> Hi,
> 
> I like this feature quite a bit...
> 
> Sorry for the late response.
> 
> 
> On 2020-02-28 13:30:49 -0700, Jens Axboe wrote:
>> +static int io_provide_buffers_prep(struct io_kiocb *req,
>> +				   const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_provide_buf *p = &req->pbuf;
>> +	u64 tmp;
>> +
>> +	if (sqe->ioprio || sqe->rw_flags)
>> +		return -EINVAL;
>> +
>> +	tmp = READ_ONCE(sqe->fd);
>> +	if (!tmp || tmp > USHRT_MAX)
>> +		return -EINVAL;
> 
> Hm, seems like it'd be less confusing if this didn't use
> io_uring_sqe->fd, but a separate union member?

Sure, not a big deal to me, and we can make this change after the fact
too as it won't change the ABI.

>> +	p->nbufs = tmp;
>> +	p->addr = READ_ONCE(sqe->addr);
>> +	p->len = READ_ONCE(sqe->len);
>> +
>> +	if (!access_ok(u64_to_user_ptr(p->addr), p->len))
>> +		return -EFAULT;
>> +
>> +	p->gid = READ_ONCE(sqe->buf_group);
>> +	tmp = READ_ONCE(sqe->off);
>> +	if (tmp > USHRT_MAX)
>> +		return -EINVAL;
> 
> Would it make sense to return a distinct error for the >= USHRT_MAX
> cases? E2BIG or something roughly in that direction? Seems good to be
> able to recognizably refuse "large" buffer group ids.

Don't feel too strongly, but it does help to separate the error.

>> +static int io_add_buffers(struct io_provide_buf *pbuf, struct list_head *list)
>> +{
>> +	struct io_buffer *buf;
>> +	u64 addr = pbuf->addr;
>> +	int i, bid = pbuf->bid;
>> +
>> +	for (i = 0; i < pbuf->nbufs; i++) {
>> +		buf = kmalloc(sizeof(*buf), GFP_KERNEL);
>> +		if (!buf)
>> +			break;
>> +
>> +		buf->addr = addr;
>> +		buf->len = pbuf->len;
>> +		buf->bid = bid;
>> +		list_add(&buf->list, list);
>> +		addr += pbuf->len;
>> +		bid++;
>> +	}
>> +
>> +	return i;
>> +}
> 
> Hm, aren't you loosing an error here if you kmalloc fails for i > 0?
> Afaict io_provide_buffes() only checks for ret != 0. I think userland
> should know that a PROVIDE_BUFFERS failed, even if just partially (I'd
> just make it fail wholesale).

The above one does have the issue that we're losing the error for i ==
0, current one does:

return i ? i : -ENOMEM;

But this is what most interfaces end up doing, return the number
processed, if any, or error if none of them were added. Like a short
read, for example, and you'd get EIO if you forwarded and tried again.
So I tend to prefer doing it like that, at least to me it seems more
logical than unwinding. The application won't know what buffer caused
the error if you unwind, whereas it's perfectly clear if you asked to
add 128 and we return 64 that the issue is with the 65th buffer.

>> +static int io_provide_buffers(struct io_kiocb *req, struct io_kiocb **nxt,
>> +			      bool force_nonblock)
>> +{
>> +	struct io_provide_buf *p = &req->pbuf;
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +	struct list_head *list;
>> +	int ret = 0;
> 
>> +	list = idr_find(&ctx->io_buffer_idr, p->gid);
> 
> I'd rename gid to bgid or such to distinguish from other types of group
> ids, but whatever.

Noted!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] io_uring: support buffer selection
  2020-02-28 20:30 ` [PATCH 3/6] io_uring: support buffer selection Jens Axboe
  2020-02-29 12:21   ` Pavel Begunkov
@ 2020-03-09 17:21   ` Andres Freund
  2020-03-10 13:37     ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Andres Freund @ 2020-03-09 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi,

On 2020-02-28 13:30:50 -0700, Jens Axboe wrote:
> If a server process has tons of pending socket connections, generally
> it uses epoll to wait for activity. When the socket is ready for reading
> (or writing), the task can select a buffer and issue a recv/send on the
> given fd.
> 
> Now that we have fast (non-async thread) support, a task can have tons
> of pending reads or writes pending. But that means they need buffers to
> back that data, and if the number of connections is high enough, having
> them preallocated for all possible connections is unfeasible.
> 
> With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
> use for any request. The request then sets IOSQE_BUFFER_SELECT in the
> sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
> a free buffer from the specified group is selected. If none are
> available, the request is terminated with -ENOBUFS. If successful, the
> CQE on completion will contain the buffer ID chosen in the cqe->flags
> member, encoded as:
> 
> 	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
> 
> Once a buffer has been consumed by a request, it is no longer available
> and must be registered again with IORING_OP_PROVIDE_BUFFERS.
> 
> Requests need to support this feature. For now, IORING_OP_READ and
> IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
> res == -EINVAL will be posted if attempted on unsupported requests.

Why not EOPNOTSUPP or such? Makes it more feasible for applications to
handle the case separately.


> +static int io_rw_common_cflags(struct io_kiocb *req)
> +{
> +	struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
> +	int cflags;
> +
> +	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
> +	cflags |= IORING_CQE_F_BUFFER;
> +	req->rw.addr = 0;
> +	kfree(kbuf);
> +	return cflags;
> +}

>  		if (refcount_dec_and_test(&req->refs) &&
> @@ -1819,13 +1860,16 @@ static inline void req_set_fail_links(struct io_kiocb *req)
>  static void io_complete_rw_common(struct kiocb *kiocb, long res)
>  {
>  	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
> +	int cflags = 0;
>  
>  	if (kiocb->ki_flags & IOCB_WRITE)
>  		kiocb_end_write(req);
>  
>  	if (res != req->result)
>  		req_set_fail_links(req);
> -	io_cqring_add_event(req, res);
> +	if (req->flags & REQ_F_BUFFER_SELECTED)
> +		cflags = io_rw_common_cflags(req);
> +	__io_cqring_add_event(req, res, cflags);
>  }

Besides the naming already commented upon by Pavel, I'm also wondering
if it's the right thing to call this unconditionally from
io_complete_*rw*_common() - hard to see how this feature would ever be
used in the write path...


> +static struct io_buffer *io_buffer_select(struct io_kiocb *req, int gid,
> +					  struct io_buffer *kbuf,
> +					  bool needs_lock)
> +{
> +	struct list_head *list;
> +
> +	if (req->flags & REQ_F_BUFFER_SELECTED)
> +		return kbuf;
> +
> +	/*
> +	 * "Normal" inline submissions always hold the uring_lock, since we
> +	 * grab it from the system call. Same is true for the SQPOLL offload.
> +	 * The only exception is when we've detached the request and issue it
> +	 * from an async worker thread, grab the lock for that case.
> +	 */
> +	if (needs_lock)
> +		mutex_lock(&req->ctx->uring_lock);
> +
> +	lockdep_assert_held(&req->ctx->uring_lock);

This comment is in a few places, perhaps there's a way to unify by
placing the conditional acquisition into a helper?


> +	list = idr_find(&req->ctx->io_buffer_idr, gid);
> +	if (list && !list_empty(list)) {
> +		kbuf = list_first_entry(list, struct io_buffer, list);
> +		list_del(&kbuf->list);
> +	} else {
> +		kbuf = ERR_PTR(-ENOBUFS);
> +	}
> +
> +	if (needs_lock)
> +		mutex_unlock(&req->ctx->uring_lock);
> +
> +	return kbuf;
> +}
> +
>  static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
> -			       struct iovec **iovec, struct iov_iter *iter)
> +			       struct iovec **iovec, struct iov_iter *iter,
> +			       bool needs_lock)
>  {
>  	void __user *buf = u64_to_user_ptr(req->rw.addr);
>  	size_t sqe_len = req->rw.len;
> @@ -2140,12 +2219,30 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>  		return io_import_fixed(req, rw, iter);
>  	}
>  
> -	/* buffer index only valid with fixed read/write */
> -	if (req->rw.kiocb.private)
> +	/* buffer index only valid with fixed read/write, or buffer select  */
> +	if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT))
>  		return -EINVAL;
>  
>  	if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
>  		ssize_t ret;
> +
> +		if (req->flags & REQ_F_BUFFER_SELECT) {
> +			struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
> +			int gid;
> +
> +			gid = (int) (unsigned long) req->rw.kiocb.private;
> +			kbuf = io_buffer_select(req, gid, kbuf, needs_lock);
> +			if (IS_ERR(kbuf)) {
> +				*iovec = NULL;
> +				return PTR_ERR(kbuf);
> +			}
> +			req->rw.addr = (u64) kbuf;
> +			if (sqe_len > kbuf->len)
> +				sqe_len = kbuf->len;
> +			req->flags |= REQ_F_BUFFER_SELECTED;
> +			buf = u64_to_user_ptr(kbuf->addr);
> +		}

Feels a bit dangerous to have addr sometimes pointing to the user
specified data, and sometimes to kernel data. Even if indicated by
REQ_F_BUFFER_SELECTED.


> +static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req,
> +					       int *cflags, bool needs_lock)
> +{
> +	struct io_sr_msg *sr = &req->sr_msg;
> +	struct io_buffer *kbuf;
> +
> +	if (!(req->flags & REQ_F_BUFFER_SELECT))
> +		return NULL;
> +
> +	kbuf = io_buffer_select(req, sr->gid, sr->kbuf, needs_lock);
> +	if (IS_ERR(kbuf))
> +		return kbuf;
> +
> +	sr->kbuf = kbuf;
> +	if (sr->len > kbuf->len)
> +		sr->len = kbuf->len;
> +	req->flags |= REQ_F_BUFFER_SELECTED;
> +
> +	*cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
> +	*cflags |= IORING_CQE_F_BUFFER;
> +	return kbuf;
> +}

Could more of this be moved into io_buffer_select? Looks like every
REQ_F_BUFFER_SELECT supporting op is going to need most of it?


>  static int io_recvmsg_prep(struct io_kiocb *req,
>  			   const struct io_uring_sqe *sqe)
>  {

Looks like this would be unused if !defined(CONFIG_NET)?


Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-03-09 17:17     ` Jens Axboe
@ 2020-03-09 17:28       ` Andres Freund
  2020-03-10 13:33         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Andres Freund @ 2020-03-09 17:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi,

On 2020-03-09 11:17:46 -0600, Jens Axboe wrote:
> >> +static int io_add_buffers(struct io_provide_buf *pbuf, struct list_head *list)
> >> +{
> >> +	struct io_buffer *buf;
> >> +	u64 addr = pbuf->addr;
> >> +	int i, bid = pbuf->bid;
> >> +
> >> +	for (i = 0; i < pbuf->nbufs; i++) {
> >> +		buf = kmalloc(sizeof(*buf), GFP_KERNEL);
> >> +		if (!buf)
> >> +			break;
> >> +
> >> +		buf->addr = addr;
> >> +		buf->len = pbuf->len;
> >> +		buf->bid = bid;
> >> +		list_add(&buf->list, list);
> >> +		addr += pbuf->len;
> >> +		bid++;
> >> +	}
> >> +
> >> +	return i;
> >> +}
> > 
> > Hm, aren't you loosing an error here if you kmalloc fails for i > 0?
> > Afaict io_provide_buffes() only checks for ret != 0. I think userland
> > should know that a PROVIDE_BUFFERS failed, even if just partially (I'd
> > just make it fail wholesale).
> 
> The above one does have the issue that we're losing the error for i ==
> 0, current one does:
> 
> return i ? i : -ENOMEM;
> 
> But this is what most interfaces end up doing, return the number
> processed, if any, or error if none of them were added. Like a short
> read, for example, and you'd get EIO if you forwarded and tried again.
> So I tend to prefer doing it like that, at least to me it seems more
> logical than unwinding. The application won't know what buffer caused
> the error if you unwind, whereas it's perfectly clear if you asked to
> add 128 and we return 64 that the issue is with the 65th buffer.

Fair enough. I was/am thinking that this'd pretty much always be a fatal
error for the application. Which does seem a bit different from the
short read/write case, where there are plenty reasons to handle them
"silently" during normal operation.

But I can error out with the current interface, so ...

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS
  2020-03-09 17:28       ` Andres Freund
@ 2020-03-10 13:33         ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-03-10 13:33 UTC (permalink / raw)
  To: Andres Freund; +Cc: io-uring

On 3/9/20 11:28 AM, Andres Freund wrote:
> Hi,
> 
> On 2020-03-09 11:17:46 -0600, Jens Axboe wrote:
>>>> +static int io_add_buffers(struct io_provide_buf *pbuf, struct list_head *list)
>>>> +{
>>>> +	struct io_buffer *buf;
>>>> +	u64 addr = pbuf->addr;
>>>> +	int i, bid = pbuf->bid;
>>>> +
>>>> +	for (i = 0; i < pbuf->nbufs; i++) {
>>>> +		buf = kmalloc(sizeof(*buf), GFP_KERNEL);
>>>> +		if (!buf)
>>>> +			break;
>>>> +
>>>> +		buf->addr = addr;
>>>> +		buf->len = pbuf->len;
>>>> +		buf->bid = bid;
>>>> +		list_add(&buf->list, list);
>>>> +		addr += pbuf->len;
>>>> +		bid++;
>>>> +	}
>>>> +
>>>> +	return i;
>>>> +}
>>>
>>> Hm, aren't you loosing an error here if you kmalloc fails for i > 0?
>>> Afaict io_provide_buffes() only checks for ret != 0. I think userland
>>> should know that a PROVIDE_BUFFERS failed, even if just partially (I'd
>>> just make it fail wholesale).
>>
>> The above one does have the issue that we're losing the error for i ==
>> 0, current one does:
>>
>> return i ? i : -ENOMEM;
>>
>> But this is what most interfaces end up doing, return the number
>> processed, if any, or error if none of them were added. Like a short
>> read, for example, and you'd get EIO if you forwarded and tried again.
>> So I tend to prefer doing it like that, at least to me it seems more
>> logical than unwinding. The application won't know what buffer caused
>> the error if you unwind, whereas it's perfectly clear if you asked to
>> add 128 and we return 64 that the issue is with the 65th buffer.
> 
> Fair enough. I was/am thinking that this'd pretty much always be a fatal
> error for the application. Which does seem a bit different from the
> short read/write case, where there are plenty reasons to handle them
> "silently" during normal operation.
> 
> But I can error out with the current interface, so ...

Even if it is most likely a fatal condition for the application, it
would usually indicate that the application is doing something wrong.
Which means you want to debug it, and that's a lot more approachable
if you know exactly which buffer caused the issue. There's merit to
saying "buffer N isn't valid" rather than "One of the N buffers
submitted has an issue, good luck!".

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] io_uring: support buffer selection
  2020-03-09 17:21   ` Andres Freund
@ 2020-03-10 13:37     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2020-03-10 13:37 UTC (permalink / raw)
  To: Andres Freund; +Cc: io-uring

On 3/9/20 11:21 AM, Andres Freund wrote:
> Hi,
> 
> On 2020-02-28 13:30:50 -0700, Jens Axboe wrote:
>> If a server process has tons of pending socket connections, generally
>> it uses epoll to wait for activity. When the socket is ready for reading
>> (or writing), the task can select a buffer and issue a recv/send on the
>> given fd.
>>
>> Now that we have fast (non-async thread) support, a task can have tons
>> of pending reads or writes pending. But that means they need buffers to
>> back that data, and if the number of connections is high enough, having
>> them preallocated for all possible connections is unfeasible.
>>
>> With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
>> use for any request. The request then sets IOSQE_BUFFER_SELECT in the
>> sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
>> a free buffer from the specified group is selected. If none are
>> available, the request is terminated with -ENOBUFS. If successful, the
>> CQE on completion will contain the buffer ID chosen in the cqe->flags
>> member, encoded as:
>>
>> 	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
>>
>> Once a buffer has been consumed by a request, it is no longer available
>> and must be registered again with IORING_OP_PROVIDE_BUFFERS.
>>
>> Requests need to support this feature. For now, IORING_OP_READ and
>> IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
>> res == -EINVAL will be posted if attempted on unsupported requests.
> 
> Why not EOPNOTSUPP or such? Makes it more feasible for applications to
> handle the case separately.

Good point, I can make that change.

>> +static int io_rw_common_cflags(struct io_kiocb *req)
>> +{
>> +	struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
>> +	int cflags;
>> +
>> +	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
>> +	cflags |= IORING_CQE_F_BUFFER;
>> +	req->rw.addr = 0;
>> +	kfree(kbuf);
>> +	return cflags;
>> +}
> 
>>  		if (refcount_dec_and_test(&req->refs) &&
>> @@ -1819,13 +1860,16 @@ static inline void req_set_fail_links(struct io_kiocb *req)
>>  static void io_complete_rw_common(struct kiocb *kiocb, long res)
>>  {
>>  	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
>> +	int cflags = 0;
>>  
>>  	if (kiocb->ki_flags & IOCB_WRITE)
>>  		kiocb_end_write(req);
>>  
>>  	if (res != req->result)
>>  		req_set_fail_links(req);
>> -	io_cqring_add_event(req, res);
>> +	if (req->flags & REQ_F_BUFFER_SELECTED)
>> +		cflags = io_rw_common_cflags(req);
>> +	__io_cqring_add_event(req, res, cflags);
>>  }
> 
> Besides the naming already commented upon by Pavel, I'm also wondering
> if it's the right thing to call this unconditionally from
> io_complete_*rw*_common() - hard to see how this feature would ever be
> used in the write path...

Doesn't really matter I think, I'd rather have that dead branch for
writes than needing a separate handler. I did change the naming, this
posting is almost two weeks old. I'll change the other little bits from
here and post a new series so we're all on the same page.

>> +static struct io_buffer *io_buffer_select(struct io_kiocb *req, int gid,
>> +					  struct io_buffer *kbuf,
>> +					  bool needs_lock)
>> +{
>> +	struct list_head *list;
>> +
>> +	if (req->flags & REQ_F_BUFFER_SELECTED)
>> +		return kbuf;
>> +
>> +	/*
>> +	 * "Normal" inline submissions always hold the uring_lock, since we
>> +	 * grab it from the system call. Same is true for the SQPOLL offload.
>> +	 * The only exception is when we've detached the request and issue it
>> +	 * from an async worker thread, grab the lock for that case.
>> +	 */
>> +	if (needs_lock)
>> +		mutex_lock(&req->ctx->uring_lock);
>> +
>> +	lockdep_assert_held(&req->ctx->uring_lock);
> 
> This comment is in a few places, perhaps there's a way to unify by
> placing the conditional acquisition into a helper?

We could have a io_lock_ring(ctx, force_nonblock) helper and just put it
in there, ditto for the unlock.

>> +	list = idr_find(&req->ctx->io_buffer_idr, gid);
>> +	if (list && !list_empty(list)) {
>> +		kbuf = list_first_entry(list, struct io_buffer, list);
>> +		list_del(&kbuf->list);
>> +	} else {
>> +		kbuf = ERR_PTR(-ENOBUFS);
>> +	}
>> +
>> +	if (needs_lock)
>> +		mutex_unlock(&req->ctx->uring_lock);
>> +
>> +	return kbuf;
>> +}
>> +
>>  static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>> -			       struct iovec **iovec, struct iov_iter *iter)
>> +			       struct iovec **iovec, struct iov_iter *iter,
>> +			       bool needs_lock)
>>  {
>>  	void __user *buf = u64_to_user_ptr(req->rw.addr);
>>  	size_t sqe_len = req->rw.len;
>> @@ -2140,12 +2219,30 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>>  		return io_import_fixed(req, rw, iter);
>>  	}
>>  
>> -	/* buffer index only valid with fixed read/write */
>> -	if (req->rw.kiocb.private)
>> +	/* buffer index only valid with fixed read/write, or buffer select  */
>> +	if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT))
>>  		return -EINVAL;
>>  
>>  	if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
>>  		ssize_t ret;
>> +
>> +		if (req->flags & REQ_F_BUFFER_SELECT) {
>> +			struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
>> +			int gid;
>> +
>> +			gid = (int) (unsigned long) req->rw.kiocb.private;
>> +			kbuf = io_buffer_select(req, gid, kbuf, needs_lock);
>> +			if (IS_ERR(kbuf)) {
>> +				*iovec = NULL;
>> +				return PTR_ERR(kbuf);
>> +			}
>> +			req->rw.addr = (u64) kbuf;
>> +			if (sqe_len > kbuf->len)
>> +				sqe_len = kbuf->len;
>> +			req->flags |= REQ_F_BUFFER_SELECTED;
>> +			buf = u64_to_user_ptr(kbuf->addr);
>> +		}
> 
> Feels a bit dangerous to have addr sometimes pointing to the user
> specified data, and sometimes to kernel data. Even if indicated by
> REQ_F_BUFFER_SELECTED.

It's not ideal, but it's either that or have the struct io_rw blow over
the cacheline, which I don't want. So the tradeoff seemed like the right
one to me. All the initial io_kiocb per-request-type unions are 64b or
less.

>> +static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req,
>> +					       int *cflags, bool needs_lock)
>> +{
>> +	struct io_sr_msg *sr = &req->sr_msg;
>> +	struct io_buffer *kbuf;
>> +
>> +	if (!(req->flags & REQ_F_BUFFER_SELECT))
>> +		return NULL;
>> +
>> +	kbuf = io_buffer_select(req, sr->gid, sr->kbuf, needs_lock);
>> +	if (IS_ERR(kbuf))
>> +		return kbuf;
>> +
>> +	sr->kbuf = kbuf;
>> +	if (sr->len > kbuf->len)
>> +		sr->len = kbuf->len;
>> +	req->flags |= REQ_F_BUFFER_SELECTED;
>> +
>> +	*cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
>> +	*cflags |= IORING_CQE_F_BUFFER;
>> +	return kbuf;
>> +}
> 
> Could more of this be moved into io_buffer_select? Looks like every
> REQ_F_BUFFER_SELECT supporting op is going to need most of it?

Probably could be, I'll take a look and see if we can move more of that
logic in there.

>>  static int io_recvmsg_prep(struct io_kiocb *req,
>>  			   const struct io_uring_sqe *sqe)
>>  {
> 
> Looks like this would be unused if !defined(CONFIG_NET)?

Also fixed a week ago or so.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-03-10 13:37 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-28 20:30 [PATCHSET v3] io_uring support for automatic buffers Jens Axboe
2020-02-28 20:30 ` [PATCH 1/6] io_uring: buffer registration infrastructure Jens Axboe
2020-02-28 20:30 ` [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS Jens Axboe
2020-02-29  0:43   ` Pavel Begunkov
2020-02-29  4:50     ` Jens Axboe
2020-02-29 11:36       ` Pavel Begunkov
2020-02-29 17:32         ` Jens Axboe
2020-02-29 12:08   ` Pavel Begunkov
2020-02-29 17:34     ` Jens Axboe
2020-02-29 18:11       ` Jens Axboe
2020-03-09 17:03   ` Andres Freund
2020-03-09 17:17     ` Jens Axboe
2020-03-09 17:28       ` Andres Freund
2020-03-10 13:33         ` Jens Axboe
2020-02-28 20:30 ` [PATCH 3/6] io_uring: support buffer selection Jens Axboe
2020-02-29 12:21   ` Pavel Begunkov
2020-02-29 17:35     ` Jens Axboe
2020-03-09 17:21   ` Andres Freund
2020-03-10 13:37     ` Jens Axboe
2020-02-28 20:30 ` [PATCH 4/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV Jens Axboe
2020-02-28 20:30 ` [PATCH 5/6] net: abstract out normal and compat msghdr import Jens Axboe
2020-02-28 20:30 ` [PATCH 6/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).