IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 00/13] io_uring: buffer registration enhancements
@ 2020-12-18 18:07 Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

v3:

- batch file->rsrc renames into a signle patch when possible
- fix other review changes from v2
- fix checkpatch warnings

v2:

- drop readv/writev with fixed buffers patch
- handle ref_nodes both both files/buffers with a single ref_list
- make file/buffer handling more unified

This patchset implements a set of enhancements to buffer registration
consistent with existing file registration functionality:

- buffer registration updates		IORING_REGISTER_BUFFERS_UPDATE
					IORING_OP_BUFFERS_UPDATE

- buffer registration sharing		IORING_SETUP_SHARE_BUF
					IORING_SETUP_ATTACH_BUF

I have kept the original patchset unchanged for the most part to
facilitate reviewing and so this set adds a number of additional patches
mostly making file/buffer handling more unified.

Patch 1-2 modularize existing buffer registration code.

Patch 3-7 generalize fixed_file functionality to fixed_rsrc.

Patch 8 applies fixed_rsrc functionality for fixed buffers support.

Patch 9-10 generalize files_update functionality to rsrc_update.

Patch 11 implements buffer registration update, and introduces
IORING_REGISTER_BUFFERS_UPDATE and IORING_OP_BUFFERS_UPDATE, consistent
with file registration update.

Patch 12 generalizes fixed resource allocation 

Patch 13 implements buffer sharing among multiple rings; it works as follows:

- A new ring, A,  is setup. Since no buffers have been registered, the
  registered buffer state is an empty set, Z. That's different from the
  NULL state in current implementation.

- Ring B is setup, attaching to Ring A. It's also attaching to it's
  buffer registrations, now we have two references to the same empty
  set, Z.

- Ring A registers buffers into set Z, which is no longer empty.

- Ring B sees this immediately, since it's already sharing that set.

Testing

I have used liburing file-{register,update} tests as models for
buffer-{register,update,share}, tests and they run ok.

TBD

- I tried to use a single opcode for files/buffers but ran into an
issue since work_flags is different for files/buffers.  This should
be ok for the most part since req->work.flags is ultimately examined;
however, there are place where io_op_defs[opcode].work_flags is examined
directly, and I wasn't sure what would the best way to handle that.

- Need to still address Pavel's comments about deadlocks. I figure
to send out the set anyway since this is a last patch and may even be
handled separately.

Bijan Mottahedeh (13):
  io_uring: modularize io_sqe_buffer_register
  io_uring: modularize io_sqe_buffers_register
  io_uring: rename file related variables to rsrc
  io_uring: generalize io_queue_rsrc_removal
  io_uring: separate ref_list from fixed_rsrc_data
  io_uring: generalize fixed_file_ref_node functionality
  io_uring: add rsrc_ref locking routines
  io_uring: implement fixed buffers registration similar to fixed files
  io_uring: create common fixed_rsrc_ref_node handling routines
  io_uring: generalize files_update functionlity to rsrc_update
  io_uring: support buffer registration updates
  io_uring: create common fixed_rsrc_data allocation routines.
  io_uring: support buffer registration sharing

 fs/io_uring.c                 | 1004 +++++++++++++++++++++++++++++------------
 include/uapi/linux/io_uring.h |   12 +-
 2 files changed, 735 insertions(+), 281 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 01/13] io_uring: modularize io_sqe_buffer_register
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2021-01-04 21:54   ` Pavel Begunkov
  2020-12-18 18:07 ` [PATCH v3 02/13] io_uring: modularize io_sqe_buffers_register Bijan Mottahedeh
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Split io_sqe_buffer_register into two routines:

- io_sqe_buffer_register() registers a single buffer
- io_sqe_buffers_register iterates over all user specified buffers

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 210 ++++++++++++++++++++++++++++++----------------------------
 1 file changed, 107 insertions(+), 103 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bdfb473..7d6718a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8252,7 +8252,7 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return pages;
 }
 
-static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
 	int i, j;
 
@@ -8370,14 +8370,103 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 	return ret;
 }
 
-static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
-				  unsigned nr_args)
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
+				  struct io_mapped_ubuf *imu,
+				  struct page **last_hpage)
 {
 	struct vm_area_struct **vmas = NULL;
 	struct page **pages = NULL;
+	unsigned long off, start, end, ubuf;
+	size_t size;
+	int ret, pret, nr_pages, i;
+
+	ubuf = (unsigned long) iov->iov_base;
+	end = (ubuf + iov->iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	start = ubuf >> PAGE_SHIFT;
+	nr_pages = end - start;
+
+	ret = -ENOMEM;
+
+	pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		goto done;
+
+	vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *),
+			      GFP_KERNEL);
+	if (!vmas)
+		goto done;
+
+	imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
+				   GFP_KERNEL);
+	if (!imu->bvec)
+		goto done;
+
+	ret = 0;
+	mmap_read_lock(current->mm);
+	pret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
+			      pages, vmas);
+	if (pret == nr_pages) {
+		/* don't support file backed memory */
+		for (i = 0; i < nr_pages; i++) {
+			struct vm_area_struct *vma = vmas[i];
+
+			if (vma->vm_file &&
+			    !is_file_hugepages(vma->vm_file)) {
+				ret = -EOPNOTSUPP;
+				break;
+			}
+		}
+	} else {
+		ret = pret < 0 ? pret : -EFAULT;
+	}
+	mmap_read_unlock(current->mm);
+	if (ret) {
+		/*
+		 * if we did partial map, or found file backed vmas,
+		 * release any pages we did get
+		 */
+		if (pret > 0)
+			unpin_user_pages(pages, pret);
+		kvfree(imu->bvec);
+		goto done;
+	}
+
+	ret = io_buffer_account_pin(ctx, pages, pret, imu, last_hpage);
+	if (ret) {
+		unpin_user_pages(pages, pret);
+		kvfree(imu->bvec);
+		goto done;
+	}
+
+	off = ubuf & ~PAGE_MASK;
+	size = iov->iov_len;
+	for (i = 0; i < nr_pages; i++) {
+		size_t vec_len;
+
+		vec_len = min_t(size_t, size, PAGE_SIZE - off);
+		imu->bvec[i].bv_page = pages[i];
+		imu->bvec[i].bv_len = vec_len;
+		imu->bvec[i].bv_offset = off;
+		off = 0;
+		size -= vec_len;
+	}
+	/* store original address for later verification */
+	imu->ubuf = ubuf;
+	imu->len = iov->iov_len;
+	imu->nr_bvecs = nr_pages;
+	ret = 0;
+done:
+	kvfree(pages);
+	kvfree(vmas);
+	return ret;
+}
+
+static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
+				   unsigned int nr_args)
+{
+	int i, ret;
+	struct iovec iov;
 	struct page *last_hpage = NULL;
-	int i, j, got_pages = 0;
-	int ret = -EINVAL;
 
 	if (ctx->user_bufs)
 		return -EBUSY;
@@ -8391,14 +8480,10 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 	for (i = 0; i < nr_args; i++) {
 		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
-		unsigned long off, start, end, ubuf;
-		int pret, nr_pages;
-		struct iovec iov;
-		size_t size;
 
 		ret = io_copy_iov(ctx, &iov, arg, i);
 		if (ret)
-			goto err;
+			break;
 
 		/*
 		 * Don't impose further limits on the size and buffer
@@ -8407,103 +8492,22 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 		 */
 		ret = -EFAULT;
 		if (!iov.iov_base || !iov.iov_len)
-			goto err;
+			break;
 
 		/* arbitrary limit, but we need something */
 		if (iov.iov_len > SZ_1G)
-			goto err;
-
-		ubuf = (unsigned long) iov.iov_base;
-		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		start = ubuf >> PAGE_SHIFT;
-		nr_pages = end - start;
-
-		ret = 0;
-		if (!pages || nr_pages > got_pages) {
-			kvfree(vmas);
-			kvfree(pages);
-			pages = kvmalloc_array(nr_pages, sizeof(struct page *),
-						GFP_KERNEL);
-			vmas = kvmalloc_array(nr_pages,
-					sizeof(struct vm_area_struct *),
-					GFP_KERNEL);
-			if (!pages || !vmas) {
-				ret = -ENOMEM;
-				goto err;
-			}
-			got_pages = nr_pages;
-		}
-
-		imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
-						GFP_KERNEL);
-		ret = -ENOMEM;
-		if (!imu->bvec)
-			goto err;
-
-		ret = 0;
-		mmap_read_lock(current->mm);
-		pret = pin_user_pages(ubuf, nr_pages,
-				      FOLL_WRITE | FOLL_LONGTERM,
-				      pages, vmas);
-		if (pret == nr_pages) {
-			/* don't support file backed memory */
-			for (j = 0; j < nr_pages; j++) {
-				struct vm_area_struct *vma = vmas[j];
-
-				if (vma->vm_file &&
-				    !is_file_hugepages(vma->vm_file)) {
-					ret = -EOPNOTSUPP;
-					break;
-				}
-			}
-		} else {
-			ret = pret < 0 ? pret : -EFAULT;
-		}
-		mmap_read_unlock(current->mm);
-		if (ret) {
-			/*
-			 * if we did partial map, or found file backed vmas,
-			 * release any pages we did get
-			 */
-			if (pret > 0)
-				unpin_user_pages(pages, pret);
-			kvfree(imu->bvec);
-			goto err;
-		}
-
-		ret = io_buffer_account_pin(ctx, pages, pret, imu, &last_hpage);
-		if (ret) {
-			unpin_user_pages(pages, pret);
-			kvfree(imu->bvec);
-			goto err;
-		}
+			break;
 
-		off = ubuf & ~PAGE_MASK;
-		size = iov.iov_len;
-		for (j = 0; j < nr_pages; j++) {
-			size_t vec_len;
-
-			vec_len = min_t(size_t, size, PAGE_SIZE - off);
-			imu->bvec[j].bv_page = pages[j];
-			imu->bvec[j].bv_len = vec_len;
-			imu->bvec[j].bv_offset = off;
-			off = 0;
-			size -= vec_len;
-		}
-		/* store original address for later verification */
-		imu->ubuf = ubuf;
-		imu->len = iov.iov_len;
-		imu->nr_bvecs = nr_pages;
+		ret = io_sqe_buffer_register(ctx, &iov, imu, &last_hpage);
+		if (ret)
+			break;
 
 		ctx->nr_user_bufs++;
 	}
-	kvfree(pages);
-	kvfree(vmas);
-	return 0;
-err:
-	kvfree(pages);
-	kvfree(vmas);
-	io_sqe_buffer_unregister(ctx);
+
+	if (ret)
+		io_sqe_buffers_unregister(ctx);
+
 	return ret;
 }
 
@@ -8557,7 +8561,7 @@ static void io_destroy_buffers(struct io_ring_ctx *ctx)
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	io_finish_async(ctx);
-	io_sqe_buffer_unregister(ctx);
+	io_sqe_buffers_unregister(ctx);
 
 	if (ctx->sqo_task) {
 		put_task_struct(ctx->sqo_task);
@@ -9853,13 +9857,13 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 
 	switch (opcode) {
 	case IORING_REGISTER_BUFFERS:
-		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		ret = io_sqe_buffers_register(ctx, arg, nr_args);
 		break;
 	case IORING_UNREGISTER_BUFFERS:
 		ret = -EINVAL;
 		if (arg || nr_args)
 			break;
-		ret = io_sqe_buffer_unregister(ctx);
+		ret = io_sqe_buffers_unregister(ctx);
 		break;
 	case IORING_REGISTER_FILES:
 		ret = io_sqe_files_register(ctx, arg, nr_args);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 02/13] io_uring: modularize io_sqe_buffers_register
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2021-01-04 21:48   ` Pavel Begunkov
  2020-12-18 18:07 ` [PATCH v3 03/13] io_uring: rename file related variables to rsrc Bijan Mottahedeh
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Move allocation of buffer management structures, and validation of
buffers into separate routines.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 51 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7d6718a..d8505e2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8461,13 +8461,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	return ret;
 }
 
-static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
-				   unsigned int nr_args)
+static int io_buffers_map_alloc(struct io_ring_ctx *ctx, unsigned int nr_args)
 {
-	int i, ret;
-	struct iovec iov;
-	struct page *last_hpage = NULL;
-
 	if (ctx->user_bufs)
 		return -EBUSY;
 	if (!nr_args || nr_args > UIO_MAXIOV)
@@ -8478,6 +8473,37 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (!ctx->user_bufs)
 		return -ENOMEM;
 
+	return 0;
+}
+
+static int io_buffer_validate(struct iovec *iov)
+{
+	/*
+	 * Don't impose further limits on the size and buffer
+	 * constraints here, we'll -EINVAL later when IO is
+	 * submitted if they are wrong.
+	 */
+	if (!iov->iov_base || !iov->iov_len)
+		return -EFAULT;
+
+	/* arbitrary limit, but we need something */
+	if (iov->iov_len > SZ_1G)
+		return -EFAULT;
+
+	return 0;
+}
+
+static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
+				   unsigned int nr_args)
+{
+	int i, ret;
+	struct iovec iov;
+	struct page *last_hpage = NULL;
+
+	ret = io_buffers_map_alloc(ctx, nr_args);
+	if (ret)
+		return ret;
+
 	for (i = 0; i < nr_args; i++) {
 		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
 
@@ -8485,17 +8511,8 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		if (ret)
 			break;
 
-		/*
-		 * Don't impose further limits on the size and buffer
-		 * constraints here, we'll -EINVAL later when IO is
-		 * submitted if they are wrong.
-		 */
-		ret = -EFAULT;
-		if (!iov.iov_base || !iov.iov_len)
-			break;
-
-		/* arbitrary limit, but we need something */
-		if (iov.iov_len > SZ_1G)
+		ret = io_buffer_validate(&iov);
+		if (ret)
 			break;
 
 		ret = io_sqe_buffer_register(ctx, &iov, imu, &last_hpage);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 03/13] io_uring: rename file related variables to rsrc
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 02/13] io_uring: modularize io_sqe_buffers_register Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2021-01-05  1:53   ` Pavel Begunkov
  2020-12-18 18:07 ` [PATCH v3 04/13] io_uring: generalize io_queue_rsrc_removal Bijan Mottahedeh
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

This is a prep rename patch for subsequent patches to generalize file
registration.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c                 | 200 ++++++++++++++++++++++--------------------
 include/uapi/linux/io_uring.h |   2 +-
 2 files changed, 104 insertions(+), 98 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d8505e2..a14f1ba 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -195,24 +195,29 @@ struct io_mapped_ubuf {
 	unsigned long	acct_pages;
 };
 
-struct fixed_file_table {
+struct io_rsrc_put {
+	struct list_head list;
+	struct file *file;
+};
+
+struct fixed_rsrc_table {
 	struct file		**files;
 };
 
-struct fixed_file_ref_node {
+struct fixed_rsrc_ref_node {
 	struct percpu_ref		refs;
 	struct list_head		node;
-	struct list_head		file_list;
-	struct fixed_file_data		*file_data;
+	struct list_head		rsrc_list;
+	struct fixed_rsrc_data		*rsrc_data;
 	struct llist_node		llist;
 	bool				done;
 };
 
-struct fixed_file_data {
-	struct fixed_file_table		*table;
+struct fixed_rsrc_data {
+	struct fixed_rsrc_table		*table;
 	struct io_ring_ctx		*ctx;
 
-	struct fixed_file_ref_node	*node;
+	struct fixed_rsrc_ref_node	*node;
 	struct percpu_ref		refs;
 	struct completion		done;
 	struct list_head		ref_list;
@@ -318,7 +323,7 @@ struct io_ring_ctx {
 	 * readers must ensure that ->refs is alive as long as the file* is
 	 * used. Only updated through io_uring_register(2).
 	 */
-	struct fixed_file_data	*file_data;
+	struct fixed_rsrc_data	*file_data;
 	unsigned		nr_user_files;
 
 	/* if used, fixed mapped user buffers */
@@ -382,8 +387,8 @@ struct io_ring_ctx {
 		struct list_head	inflight_list;
 	} ____cacheline_aligned_in_smp;
 
-	struct delayed_work		file_put_work;
-	struct llist_head		file_put_llist;
+	struct delayed_work		rsrc_put_work;
+	struct llist_head		rsrc_put_llist;
 
 	struct work_struct		exit_work;
 	struct io_restriction		restrictions;
@@ -493,7 +498,7 @@ struct io_open {
 	unsigned long			nofile;
 };
 
-struct io_files_update {
+struct io_rsrc_update {
 	struct file			*file;
 	u64				arg;
 	u32				nr_args;
@@ -687,7 +692,7 @@ struct io_kiocb {
 		struct io_sr_msg	sr_msg;
 		struct io_open		open;
 		struct io_close		close;
-		struct io_files_update	files_update;
+		struct io_rsrc_update	rsrc_update;
 		struct io_fadvise	fadvise;
 		struct io_madvise	madvise;
 		struct io_epoll		epoll;
@@ -717,7 +722,7 @@ struct io_kiocb {
 	u64				user_data;
 
 	struct io_kiocb			*link;
-	struct percpu_ref		*fixed_file_refs;
+	struct percpu_ref		*fixed_rsrc_refs;
 
 	/*
 	 * 1. used with ctx->iopoll_list with reads/writes
@@ -1002,13 +1007,13 @@ static void __io_complete_rw(struct io_kiocb *req, long res, long res2,
 static void __io_queue_linked_timeout(struct io_kiocb *req);
 static void io_queue_linked_timeout(struct io_kiocb *req);
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
-				 struct io_uring_files_update *ip,
+				 struct io_uring_rsrc_update *ip,
 				 unsigned nr_args);
 static void __io_clean_op(struct io_kiocb *req);
 static struct file *io_file_get(struct io_submit_state *state,
 				struct io_kiocb *req, int fd, bool fixed);
 static void __io_queue_sqe(struct io_kiocb *req, struct io_comp_state *cs);
-static void io_file_put_work(struct work_struct *work);
+static void io_rsrc_put_work(struct work_struct *work);
 
 static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
 			       struct iovec **iovec, struct iov_iter *iter,
@@ -1048,9 +1053,9 @@ static inline void io_set_resource_node(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	if (!req->fixed_file_refs) {
-		req->fixed_file_refs = &ctx->file_data->node->refs;
-		percpu_ref_get(req->fixed_file_refs);
+	if (!req->fixed_rsrc_refs) {
+		req->fixed_rsrc_refs = &ctx->file_data->node->refs;
+		percpu_ref_get(req->fixed_rsrc_refs);
 	}
 }
 
@@ -1309,8 +1314,8 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	spin_lock_init(&ctx->inflight_lock);
 	INIT_LIST_HEAD(&ctx->inflight_list);
-	INIT_DELAYED_WORK(&ctx->file_put_work, io_file_put_work);
-	init_llist_head(&ctx->file_put_llist);
+	INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
+	init_llist_head(&ctx->rsrc_put_llist);
 	return ctx;
 err:
 	if (ctx->fallback_req)
@@ -1948,8 +1953,8 @@ static void io_dismantle_req(struct io_kiocb *req)
 		kfree(req->async_data);
 	if (req->file)
 		io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE));
-	if (req->fixed_file_refs)
-		percpu_ref_put(req->fixed_file_refs);
+	if (req->fixed_rsrc_refs)
+		percpu_ref_put(req->fixed_rsrc_refs);
 	io_req_clean_work(req);
 }
 
@@ -5927,7 +5932,7 @@ static int io_async_cancel(struct io_kiocb *req)
 	return 0;
 }
 
-static int io_files_update_prep(struct io_kiocb *req,
+static int io_rsrc_update_prep(struct io_kiocb *req,
 				const struct io_uring_sqe *sqe)
 {
 	if (unlikely(req->ctx->flags & IORING_SETUP_SQPOLL))
@@ -5937,11 +5942,11 @@ static int io_files_update_prep(struct io_kiocb *req,
 	if (sqe->ioprio || sqe->rw_flags)
 		return -EINVAL;
 
-	req->files_update.offset = READ_ONCE(sqe->off);
-	req->files_update.nr_args = READ_ONCE(sqe->len);
-	if (!req->files_update.nr_args)
+	req->rsrc_update.offset = READ_ONCE(sqe->off);
+	req->rsrc_update.nr_args = READ_ONCE(sqe->len);
+	if (!req->rsrc_update.nr_args)
 		return -EINVAL;
-	req->files_update.arg = READ_ONCE(sqe->addr);
+	req->rsrc_update.arg = READ_ONCE(sqe->addr);
 	return 0;
 }
 
@@ -5949,17 +5954,17 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 			   struct io_comp_state *cs)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_uring_files_update up;
+	struct io_uring_rsrc_update up;
 	int ret;
 
 	if (force_nonblock)
 		return -EAGAIN;
 
-	up.offset = req->files_update.offset;
-	up.fds = req->files_update.arg;
+	up.offset = req->rsrc_update.offset;
+	up.fds = req->rsrc_update.arg;
 
 	mutex_lock(&ctx->uring_lock);
-	ret = __io_sqe_files_update(ctx, &up, req->files_update.nr_args);
+	ret = __io_sqe_files_update(ctx, &up, req->rsrc_update.nr_args);
 	mutex_unlock(&ctx->uring_lock);
 
 	if (ret < 0)
@@ -6014,7 +6019,7 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	case IORING_OP_CLOSE:
 		return io_close_prep(req, sqe);
 	case IORING_OP_FILES_UPDATE:
-		return io_files_update_prep(req, sqe);
+		return io_rsrc_update_prep(req, sqe);
 	case IORING_OP_STATX:
 		return io_statx_prep(req, sqe);
 	case IORING_OP_FADVISE:
@@ -6364,7 +6369,7 @@ static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work)
 static inline struct file *io_file_from_index(struct io_ring_ctx *ctx,
 					      int index)
 {
-	struct fixed_file_table *table;
+	struct fixed_rsrc_table *table;
 
 	table = &ctx->file_data->table[index >> IORING_FILE_TABLE_SHIFT];
 	return table->files[index & IORING_FILE_TABLE_MASK];
@@ -6750,7 +6755,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	req->ctx = ctx;
 	req->flags = 0;
 	req->link = NULL;
-	req->fixed_file_refs = NULL;
+	req->fixed_rsrc_refs = NULL;
 	/* one is dropped after submission, the other at completion */
 	refcount_set(&req->refs, 2);
 	req->task = current;
@@ -7234,18 +7239,18 @@ static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
 #endif
 }
 
-static void io_file_ref_kill(struct percpu_ref *ref)
+static void io_rsrc_ref_kill(struct percpu_ref *ref)
 {
-	struct fixed_file_data *data;
+	struct fixed_rsrc_data *data;
 
-	data = container_of(ref, struct fixed_file_data, refs);
+	data = container_of(ref, struct fixed_rsrc_data, refs);
 	complete(&data->done);
 }
 
 static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 {
-	struct fixed_file_data *data = ctx->file_data;
-	struct fixed_file_ref_node *ref_node = NULL;
+	struct fixed_rsrc_data *data = ctx->file_data;
+	struct fixed_rsrc_ref_node *ref_node = NULL;
 	unsigned nr_tables, i;
 
 	if (!data)
@@ -7260,7 +7265,7 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	percpu_ref_kill(&data->refs);
 
 	/* wait for all refs nodes to complete */
-	flush_delayed_work(&ctx->file_put_work);
+	flush_delayed_work(&ctx->rsrc_put_work);
 	wait_for_completion(&data->done);
 
 	__io_sqe_files_unregister(ctx);
@@ -7494,13 +7499,13 @@ static int io_sqe_files_scm(struct io_ring_ctx *ctx)
 }
 #endif
 
-static int io_sqe_alloc_file_tables(struct fixed_file_data *file_data,
+static int io_sqe_alloc_file_tables(struct fixed_rsrc_data *file_data,
 				    unsigned nr_tables, unsigned nr_files)
 {
 	int i;
 
 	for (i = 0; i < nr_tables; i++) {
-		struct fixed_file_table *table = &file_data->table[i];
+		struct fixed_rsrc_table *table = &file_data->table[i];
 		unsigned this_files;
 
 		this_files = min(nr_files, IORING_MAX_FILES_TABLE);
@@ -7515,7 +7520,7 @@ static int io_sqe_alloc_file_tables(struct fixed_file_data *file_data,
 		return 0;
 
 	for (i = 0; i < nr_tables; i++) {
-		struct fixed_file_table *table = &file_data->table[i];
+		struct fixed_rsrc_table *table = &file_data->table[i];
 		kfree(table->files);
 	}
 	return 1;
@@ -7583,56 +7588,51 @@ static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file)
 #endif
 }
 
-struct io_file_put {
-	struct list_head list;
-	struct file *file;
-};
-
-static void __io_file_put_work(struct fixed_file_ref_node *ref_node)
+static void __io_rsrc_put_work(struct fixed_rsrc_ref_node *ref_node)
 {
-	struct fixed_file_data *file_data = ref_node->file_data;
-	struct io_ring_ctx *ctx = file_data->ctx;
-	struct io_file_put *pfile, *tmp;
+	struct fixed_rsrc_data *rsrc_data = ref_node->rsrc_data;
+	struct io_ring_ctx *ctx = rsrc_data->ctx;
+	struct io_rsrc_put *prsrc, *tmp;
 
-	list_for_each_entry_safe(pfile, tmp, &ref_node->file_list, list) {
-		list_del(&pfile->list);
-		io_ring_file_put(ctx, pfile->file);
-		kfree(pfile);
+	list_for_each_entry_safe(prsrc, tmp, &ref_node->rsrc_list, list) {
+		list_del(&prsrc->list);
+		io_ring_file_put(ctx, prsrc->file);
+		kfree(prsrc);
 	}
 
 	percpu_ref_exit(&ref_node->refs);
 	kfree(ref_node);
-	percpu_ref_put(&file_data->refs);
+	percpu_ref_put(&rsrc_data->refs);
 }
 
-static void io_file_put_work(struct work_struct *work)
+static void io_rsrc_put_work(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx;
 	struct llist_node *node;
 
-	ctx = container_of(work, struct io_ring_ctx, file_put_work.work);
-	node = llist_del_all(&ctx->file_put_llist);
+	ctx = container_of(work, struct io_ring_ctx, rsrc_put_work.work);
+	node = llist_del_all(&ctx->rsrc_put_llist);
 
 	while (node) {
-		struct fixed_file_ref_node *ref_node;
+		struct fixed_rsrc_ref_node *ref_node;
 		struct llist_node *next = node->next;
 
-		ref_node = llist_entry(node, struct fixed_file_ref_node, llist);
-		__io_file_put_work(ref_node);
+		ref_node = llist_entry(node, struct fixed_rsrc_ref_node, llist);
+		__io_rsrc_put_work(ref_node);
 		node = next;
 	}
 }
 
-static void io_file_data_ref_zero(struct percpu_ref *ref)
+static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 {
-	struct fixed_file_ref_node *ref_node;
-	struct fixed_file_data *data;
+	struct fixed_rsrc_ref_node *ref_node;
+	struct fixed_rsrc_data *data;
 	struct io_ring_ctx *ctx;
 	bool first_add = false;
 	int delay = HZ;
 
-	ref_node = container_of(ref, struct fixed_file_ref_node, refs);
-	data = ref_node->file_data;
+	ref_node = container_of(ref, struct fixed_rsrc_ref_node, refs);
+	data = ref_node->rsrc_data;
 	ctx = data->ctx;
 
 	spin_lock_bh(&data->lock);
@@ -7640,12 +7640,12 @@ static void io_file_data_ref_zero(struct percpu_ref *ref)
 
 	while (!list_empty(&data->ref_list)) {
 		ref_node = list_first_entry(&data->ref_list,
-					struct fixed_file_ref_node, node);
+					struct fixed_rsrc_ref_node, node);
 		/* recycle ref nodes in order */
 		if (!ref_node->done)
 			break;
 		list_del(&ref_node->node);
-		first_add |= llist_add(&ref_node->llist, &ctx->file_put_llist);
+		first_add |= llist_add(&ref_node->llist, &ctx->rsrc_put_llist);
 	}
 	spin_unlock_bh(&data->lock);
 
@@ -7653,33 +7653,33 @@ static void io_file_data_ref_zero(struct percpu_ref *ref)
 		delay = 0;
 
 	if (!delay)
-		mod_delayed_work(system_wq, &ctx->file_put_work, 0);
+		mod_delayed_work(system_wq, &ctx->rsrc_put_work, 0);
 	else if (first_add)
-		queue_delayed_work(system_wq, &ctx->file_put_work, delay);
+		queue_delayed_work(system_wq, &ctx->rsrc_put_work, delay);
 }
 
-static struct fixed_file_ref_node *alloc_fixed_file_ref_node(
+static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
 			struct io_ring_ctx *ctx)
 {
-	struct fixed_file_ref_node *ref_node;
+	struct fixed_rsrc_ref_node *ref_node;
 
 	ref_node = kzalloc(sizeof(*ref_node), GFP_KERNEL);
 	if (!ref_node)
 		return ERR_PTR(-ENOMEM);
 
-	if (percpu_ref_init(&ref_node->refs, io_file_data_ref_zero,
+	if (percpu_ref_init(&ref_node->refs, io_rsrc_data_ref_zero,
 			    0, GFP_KERNEL)) {
 		kfree(ref_node);
 		return ERR_PTR(-ENOMEM);
 	}
 	INIT_LIST_HEAD(&ref_node->node);
-	INIT_LIST_HEAD(&ref_node->file_list);
-	ref_node->file_data = ctx->file_data;
+	INIT_LIST_HEAD(&ref_node->rsrc_list);
+	ref_node->rsrc_data = ctx->file_data;
 	ref_node->done = false;
 	return ref_node;
 }
 
-static void destroy_fixed_file_ref_node(struct fixed_file_ref_node *ref_node)
+static void destroy_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node)
 {
 	percpu_ref_exit(&ref_node->refs);
 	kfree(ref_node);
@@ -7692,8 +7692,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	unsigned nr_tables, i;
 	struct file *file;
 	int fd, ret = -ENOMEM;
-	struct fixed_file_ref_node *ref_node;
-	struct fixed_file_data *file_data;
+	struct fixed_rsrc_ref_node *ref_node;
+	struct fixed_rsrc_data *file_data;
 
 	if (ctx->file_data)
 		return -EBUSY;
@@ -7716,7 +7716,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (!file_data->table)
 		goto out_free;
 
-	if (percpu_ref_init(&file_data->refs, io_file_ref_kill,
+	if (percpu_ref_init(&file_data->refs, io_rsrc_ref_kill,
 				PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
 		goto out_free;
 
@@ -7725,7 +7725,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	ctx->file_data = file_data;
 
 	for (i = 0; i < nr_args; i++, ctx->nr_user_files++) {
-		struct fixed_file_table *table;
+		struct fixed_rsrc_table *table;
 		unsigned index;
 
 		if (copy_from_user(&fd, &fds[i], sizeof(fd))) {
@@ -7836,28 +7836,34 @@ static int io_sqe_file_register(struct io_ring_ctx *ctx, struct file *file,
 #endif
 }
 
-static int io_queue_file_removal(struct fixed_file_data *data,
-				 struct file *file)
+static int io_queue_rsrc_removal(struct fixed_rsrc_data *data,
+				 struct file *rsrc)
 {
-	struct io_file_put *pfile;
-	struct fixed_file_ref_node *ref_node = data->node;
+	struct io_rsrc_put *prsrc;
+	struct fixed_rsrc_ref_node *ref_node = data->node;
 
-	pfile = kzalloc(sizeof(*pfile), GFP_KERNEL);
-	if (!pfile)
+	prsrc = kzalloc(sizeof(*prsrc), GFP_KERNEL);
+	if (!prsrc)
 		return -ENOMEM;
 
-	pfile->file = file;
-	list_add(&pfile->list, &ref_node->file_list);
+	prsrc->file = rsrc;
+	list_add(&prsrc->list, &ref_node->rsrc_list);
 
 	return 0;
 }
 
+static inline int io_queue_file_removal(struct fixed_rsrc_data *data,
+					struct file *file)
+{
+	return io_queue_rsrc_removal(data, file);
+}
+
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
-				 struct io_uring_files_update *up,
+				 struct io_uring_rsrc_update *up,
 				 unsigned nr_args)
 {
-	struct fixed_file_data *data = ctx->file_data;
-	struct fixed_file_ref_node *ref_node;
+	struct fixed_rsrc_data *data = ctx->file_data;
+	struct fixed_rsrc_ref_node *ref_node;
 	struct file *file;
 	__s32 __user *fds;
 	int fd, i, err;
@@ -7876,7 +7882,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 	done = 0;
 	fds = u64_to_user_ptr(up->fds);
 	while (nr_args) {
-		struct fixed_file_table *table;
+		struct fixed_rsrc_table *table;
 		unsigned index;
 
 		err = 0;
@@ -7935,7 +7941,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 		spin_unlock_bh(&data->lock);
 		percpu_ref_get(&ctx->file_data->refs);
 	} else
-		destroy_fixed_file_ref_node(ref_node);
+		destroy_fixed_rsrc_ref_node(ref_node);
 
 	return done ? done : err;
 }
@@ -7943,7 +7949,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg,
 			       unsigned nr_args)
 {
-	struct io_uring_files_update up;
+	struct io_uring_rsrc_update up;
 
 	if (!ctx->file_data)
 		return -ENXIO;
@@ -9292,7 +9298,7 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	seq_printf(m, "SqThreadCpu:\t%d\n", sq ? task_cpu(sq->thread) : -1);
 	seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files);
 	for (i = 0; has_lock && i < ctx->nr_user_files; i++) {
-		struct fixed_file_table *table;
+		struct fixed_rsrc_table *table;
 		struct file *f;
 
 		table = &ctx->file_data->table[i >> IORING_FILE_TABLE_SHIFT];
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d31a2a1..d421f70 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -285,7 +285,7 @@ enum {
 	IORING_REGISTER_LAST
 };
 
-struct io_uring_files_update {
+struct io_uring_rsrc_update {
 	__u32 offset;
 	__u32 resv;
 	__aligned_u64 /* __s32 * */ fds;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 04/13] io_uring: generalize io_queue_rsrc_removal
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (2 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 03/13] io_uring: rename file related variables to rsrc Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 05/13] io_uring: separate ref_list from fixed_rsrc_data Bijan Mottahedeh
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Generalize io_queue_rsrc_removal to handle both files and buffers.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a14f1ba..c47f2ac 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -195,13 +195,22 @@ struct io_mapped_ubuf {
 	unsigned long	acct_pages;
 };
 
+struct io_ring_ctx;
+
 struct io_rsrc_put {
 	struct list_head list;
-	struct file *file;
+	union {
+		void *rsrc;
+		struct file *file;
+		struct io_mapped_ubuf *buf;
+	};
 };
 
 struct fixed_rsrc_table {
-	struct file		**files;
+	union {
+		struct file		**files;
+		struct io_mapped_ubuf	*bufs;
+	};
 };
 
 struct fixed_rsrc_ref_node {
@@ -209,6 +218,8 @@ struct fixed_rsrc_ref_node {
 	struct list_head		node;
 	struct list_head		rsrc_list;
 	struct fixed_rsrc_data		*rsrc_data;
+	void				(*rsrc_put)(struct io_ring_ctx *ctx,
+						    struct io_rsrc_put *prsrc);
 	struct llist_node		llist;
 	bool				done;
 };
@@ -7526,8 +7537,9 @@ static int io_sqe_alloc_file_tables(struct fixed_rsrc_data *file_data,
 	return 1;
 }
 
-static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file)
+static void io_ring_file_put(struct io_ring_ctx *ctx, struct io_rsrc_put *prsrc)
 {
+	struct file *file = prsrc->file;
 #if defined(CONFIG_UNIX)
 	struct sock *sock = ctx->ring_sock->sk;
 	struct sk_buff_head list, *head = &sock->sk_receive_queue;
@@ -7596,7 +7608,7 @@ static void __io_rsrc_put_work(struct fixed_rsrc_ref_node *ref_node)
 
 	list_for_each_entry_safe(prsrc, tmp, &ref_node->rsrc_list, list) {
 		list_del(&prsrc->list);
-		io_ring_file_put(ctx, prsrc->file);
+		ref_node->rsrc_put(ctx, prsrc);
 		kfree(prsrc);
 	}
 
@@ -7675,6 +7687,7 @@ static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
 	INIT_LIST_HEAD(&ref_node->node);
 	INIT_LIST_HEAD(&ref_node->rsrc_list);
 	ref_node->rsrc_data = ctx->file_data;
+	ref_node->rsrc_put = io_ring_file_put;
 	ref_node->done = false;
 	return ref_node;
 }
@@ -7836,8 +7849,7 @@ static int io_sqe_file_register(struct io_ring_ctx *ctx, struct file *file,
 #endif
 }
 
-static int io_queue_rsrc_removal(struct fixed_rsrc_data *data,
-				 struct file *rsrc)
+static int io_queue_rsrc_removal(struct fixed_rsrc_data *data, void *rsrc)
 {
 	struct io_rsrc_put *prsrc;
 	struct fixed_rsrc_ref_node *ref_node = data->node;
@@ -7846,7 +7858,7 @@ static int io_queue_rsrc_removal(struct fixed_rsrc_data *data,
 	if (!prsrc)
 		return -ENOMEM;
 
-	prsrc->file = rsrc;
+	prsrc->rsrc = rsrc;
 	list_add(&prsrc->list, &ref_node->rsrc_list);
 
 	return 0;
@@ -7855,7 +7867,7 @@ static int io_queue_rsrc_removal(struct fixed_rsrc_data *data,
 static inline int io_queue_file_removal(struct fixed_rsrc_data *data,
 					struct file *file)
 {
-	return io_queue_rsrc_removal(data, file);
+	return io_queue_rsrc_removal(data, (void *)file);
 }
 
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 05/13] io_uring: separate ref_list from fixed_rsrc_data
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (3 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 04/13] io_uring: generalize io_queue_rsrc_removal Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 06/13] io_uring: generalize fixed_file_ref_node functionality Bijan Mottahedeh
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Uplevel ref_list and make it common to all resources.  This is to
allow one common ref_list to be used for both files, and buffers
in upcoming patches.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c47f2ac..28e178b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -231,8 +231,6 @@ struct fixed_rsrc_data {
 	struct fixed_rsrc_ref_node	*node;
 	struct percpu_ref		refs;
 	struct completion		done;
-	struct list_head		ref_list;
-	spinlock_t			lock;
 };
 
 struct io_buffer {
@@ -400,6 +398,8 @@ struct io_ring_ctx {
 
 	struct delayed_work		rsrc_put_work;
 	struct llist_head		rsrc_put_llist;
+	struct list_head		rsrc_ref_list;
+	spinlock_t			rsrc_ref_lock;
 
 	struct work_struct		exit_work;
 	struct io_restriction		restrictions;
@@ -1325,6 +1325,8 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	spin_lock_init(&ctx->inflight_lock);
 	INIT_LIST_HEAD(&ctx->inflight_list);
+	spin_lock_init(&ctx->rsrc_ref_lock);
+	INIT_LIST_HEAD(&ctx->rsrc_ref_list);
 	INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
 	init_llist_head(&ctx->rsrc_put_llist);
 	return ctx;
@@ -7267,9 +7269,9 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	if (!data)
 		return -ENXIO;
 
-	spin_lock_bh(&data->lock);
+	spin_lock_bh(&ctx->rsrc_ref_lock);
 	ref_node = data->node;
-	spin_unlock_bh(&data->lock);
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
 	if (ref_node)
 		percpu_ref_kill(&ref_node->refs);
 
@@ -7647,11 +7649,11 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 	data = ref_node->rsrc_data;
 	ctx = data->ctx;
 
-	spin_lock_bh(&data->lock);
+	spin_lock_bh(&ctx->rsrc_ref_lock);
 	ref_node->done = true;
 
-	while (!list_empty(&data->ref_list)) {
-		ref_node = list_first_entry(&data->ref_list,
+	while (!list_empty(&ctx->rsrc_ref_list)) {
+		ref_node = list_first_entry(&ctx->rsrc_ref_list,
 					struct fixed_rsrc_ref_node, node);
 		/* recycle ref nodes in order */
 		if (!ref_node->done)
@@ -7659,7 +7661,7 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 		list_del(&ref_node->node);
 		first_add |= llist_add(&ref_node->llist, &ctx->rsrc_put_llist);
 	}
-	spin_unlock_bh(&data->lock);
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
 
 	if (percpu_ref_is_dying(&data->refs))
 		delay = 0;
@@ -7720,8 +7722,6 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 		return -ENOMEM;
 	file_data->ctx = ctx;
 	init_completion(&file_data->done);
-	INIT_LIST_HEAD(&file_data->ref_list);
-	spin_lock_init(&file_data->lock);
 
 	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_FILES_TABLE);
 	file_data->table = kcalloc(nr_tables, sizeof(*file_data->table),
@@ -7783,9 +7783,9 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	}
 
 	file_data->node = ref_node;
-	spin_lock_bh(&file_data->lock);
-	list_add_tail(&ref_node->node, &file_data->ref_list);
-	spin_unlock_bh(&file_data->lock);
+	spin_lock_bh(&ctx->rsrc_ref_lock);
+	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
 	percpu_ref_get(&file_data->refs);
 	return ret;
 out_fput:
@@ -7947,10 +7947,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 
 	if (needs_switch) {
 		percpu_ref_kill(&data->node->refs);
-		spin_lock_bh(&data->lock);
-		list_add_tail(&ref_node->node, &data->ref_list);
+		spin_lock_bh(&ctx->rsrc_ref_lock);
+		list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
 		data->node = ref_node;
-		spin_unlock_bh(&data->lock);
+		spin_unlock_bh(&ctx->rsrc_ref_lock);
 		percpu_ref_get(&ctx->file_data->refs);
 	} else
 		destroy_fixed_rsrc_ref_node(ref_node);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 06/13] io_uring: generalize fixed_file_ref_node functionality
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (4 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 05/13] io_uring: separate ref_list from fixed_rsrc_data Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 07/13] io_uring: add rsrc_ref locking routines Bijan Mottahedeh
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Split alloc_fixed_file_ref_node into resource generic/specific parts,
rename destroy_fixed_file_ref_node, and factor out fixed_file_ref_node
switching, to be be leveraged by fixed buffers.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 28e178b..e23f67f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7672,7 +7672,7 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 		queue_delayed_work(system_wq, &ctx->rsrc_put_work, delay);
 }
 
-static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
+static struct fixed_rsrc_ref_node *alloc_fixed_rsrc_ref_node(
 			struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_ref_node *ref_node;
@@ -7688,9 +7688,21 @@ static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
 	}
 	INIT_LIST_HEAD(&ref_node->node);
 	INIT_LIST_HEAD(&ref_node->rsrc_list);
+	ref_node->done = false;
+	return ref_node;
+}
+
+static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
+			struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_ref_node *ref_node;
+
+	ref_node = alloc_fixed_rsrc_ref_node(ctx);
+	if (!ref_node)
+		return NULL;
+
 	ref_node->rsrc_data = ctx->file_data;
 	ref_node->rsrc_put = io_ring_file_put;
-	ref_node->done = false;
 	return ref_node;
 }
 
@@ -7870,6 +7882,17 @@ static inline int io_queue_file_removal(struct fixed_rsrc_data *data,
 	return io_queue_rsrc_removal(data, (void *)file);
 }
 
+static void switch_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
+				       struct fixed_rsrc_data *data,
+				       struct io_ring_ctx *ctx)
+{
+	percpu_ref_kill(&data->node->refs);
+	spin_lock_bh(&ctx->rsrc_ref_lock);
+	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
+	data->node = ref_node;
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
+}
+
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				 struct io_uring_rsrc_update *up,
 				 unsigned nr_args)
@@ -7946,11 +7969,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 	}
 
 	if (needs_switch) {
-		percpu_ref_kill(&data->node->refs);
-		spin_lock_bh(&ctx->rsrc_ref_lock);
-		list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
-		data->node = ref_node;
-		spin_unlock_bh(&ctx->rsrc_ref_lock);
+		switch_fixed_rsrc_ref_node(ref_node, data, ctx);
 		percpu_ref_get(&ctx->file_data->refs);
 	} else
 		destroy_fixed_rsrc_ref_node(ref_node);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 07/13] io_uring: add rsrc_ref locking routines
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (5 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 06/13] io_uring: generalize fixed_file_ref_node functionality Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Encapsulate resource reference locking into separate routines.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e23f67f..bad2477 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7260,6 +7260,16 @@ static void io_rsrc_ref_kill(struct percpu_ref *ref)
 	complete(&data->done);
 }
 
+static inline void io_rsrc_ref_lock(struct io_ring_ctx *ctx)
+{
+	spin_lock_bh(&ctx->rsrc_ref_lock);
+}
+
+static inline void io_rsrc_ref_unlock(struct io_ring_ctx *ctx)
+{
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
+}
+
 static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_data *data = ctx->file_data;
@@ -7269,9 +7279,9 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	if (!data)
 		return -ENXIO;
 
-	spin_lock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_lock(ctx);
 	ref_node = data->node;
-	spin_unlock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_unlock(ctx);
 	if (ref_node)
 		percpu_ref_kill(&ref_node->refs);
 
@@ -7649,7 +7659,7 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 	data = ref_node->rsrc_data;
 	ctx = data->ctx;
 
-	spin_lock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_lock(ctx);
 	ref_node->done = true;
 
 	while (!list_empty(&ctx->rsrc_ref_list)) {
@@ -7661,7 +7671,7 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 		list_del(&ref_node->node);
 		first_add |= llist_add(&ref_node->llist, &ctx->rsrc_put_llist);
 	}
-	spin_unlock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_unlock(ctx);
 
 	if (percpu_ref_is_dying(&data->refs))
 		delay = 0;
@@ -7795,9 +7805,9 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	}
 
 	file_data->node = ref_node;
-	spin_lock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_lock(ctx);
 	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
-	spin_unlock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_unlock(ctx);
 	percpu_ref_get(&file_data->refs);
 	return ret;
 out_fput:
@@ -7887,10 +7897,10 @@ static void switch_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
 				       struct io_ring_ctx *ctx)
 {
 	percpu_ref_kill(&data->node->refs);
-	spin_lock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_lock(ctx);
 	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
 	data->node = ref_node;
-	spin_unlock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_unlock(ctx);
 }
 
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (6 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 07/13] io_uring: add rsrc_ref locking routines Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2021-01-05  2:43   ` Pavel Begunkov
  2020-12-18 18:07 ` [PATCH v3 09/13] io_uring: create common fixed_rsrc_ref_node handling routines Bijan Mottahedeh
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Apply fixed_rsrc functionality for fixed buffers support.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 242 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 206 insertions(+), 36 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bad2477..bac2813 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -104,6 +104,14 @@
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
 				 IORING_REGISTER_LAST + IORING_OP_LAST)
 
+/*
+ * Shift of 7 is 128 entries, or exactly one page on 64-bit archs
+ */
+#define IORING_BUF_TABLE_SHIFT	7	/* struct io_mapped_ubuf */
+#define IORING_MAX_BUFS_TABLE	(1U << IORING_BUF_TABLE_SHIFT)
+#define IORING_BUF_TABLE_MASK	(IORING_MAX_BUFS_TABLE - 1)
+#define IORING_MAX_FIXED_BUFS	UIO_MAXIOV
+
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
 	u32 tail ____cacheline_aligned_in_smp;
@@ -336,8 +344,8 @@ struct io_ring_ctx {
 	unsigned		nr_user_files;
 
 	/* if used, fixed mapped user buffers */
+	struct fixed_rsrc_data	*buf_data;
 	unsigned		nr_user_bufs;
-	struct io_mapped_ubuf	*user_bufs;
 
 	struct user_struct	*user;
 
@@ -2962,6 +2970,15 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret,
 		io_rw_done(kiocb, ret);
 }
 
+static inline struct io_mapped_ubuf *io_buf_from_index(struct io_ring_ctx *ctx,
+						       int index)
+{
+	struct fixed_rsrc_table *table;
+
+	table = &ctx->buf_data->table[index >> IORING_BUF_TABLE_SHIFT];
+	return &table->bufs[index & IORING_BUF_TABLE_MASK];
+}
+
 static ssize_t io_import_fixed(struct io_kiocb *req, int rw,
 			       struct iov_iter *iter)
 {
@@ -2975,7 +2992,7 @@ static ssize_t io_import_fixed(struct io_kiocb *req, int rw,
 	if (unlikely(buf_index >= ctx->nr_user_bufs))
 		return -EFAULT;
 	index = array_index_nospec(buf_index, ctx->nr_user_bufs);
-	imu = &ctx->user_bufs[index];
+	imu = io_buf_from_index(ctx, index);
 	buf_addr = req->rw.addr;
 
 	/* overflow */
@@ -8299,28 +8316,71 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return pages;
 }
 
-static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
+static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 {
-	int i, j;
+	unsigned int i;
 
-	if (!ctx->user_bufs)
-		return -ENXIO;
+	for (i = 0; i < imu->nr_bvecs; i++)
+		unpin_user_page(imu->bvec[i].bv_page);
 
-	for (i = 0; i < ctx->nr_user_bufs; i++) {
-		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+	if (imu->acct_pages)
+		io_unaccount_mem(ctx, imu->nr_bvecs, ACCT_PINNED);
+	kvfree(imu->bvec);
+	imu->nr_bvecs = 0;
+}
 
-		for (j = 0; j < imu->nr_bvecs; j++)
-			unpin_user_page(imu->bvec[j].bv_page);
+static void io_buffers_unmap(struct io_ring_ctx *ctx)
+{
+	unsigned int i;
+	struct io_mapped_ubuf *imu;
 
-		if (imu->acct_pages)
-			io_unaccount_mem(ctx, imu->acct_pages, ACCT_PINNED);
-		kvfree(imu->bvec);
-		imu->nr_bvecs = 0;
+	for (i = 0; i < ctx->nr_user_bufs; i++) {
+		imu = io_buf_from_index(ctx, i);
+		io_buffer_unmap(ctx, imu);
 	}
+}
+
+static void io_buffers_map_free(struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_data *data = ctx->buf_data;
+	unsigned int nr_tables, i;
 
-	kfree(ctx->user_bufs);
-	ctx->user_bufs = NULL;
+	if (!data)
+		return;
+
+	nr_tables = DIV_ROUND_UP(ctx->nr_user_bufs, IORING_MAX_BUFS_TABLE);
+	for (i = 0; i < nr_tables; i++)
+		kfree(data->table[i].bufs);
+	kfree(data->table);
+	percpu_ref_exit(&data->refs);
+	kfree(data);
+	ctx->buf_data = NULL;
 	ctx->nr_user_bufs = 0;
+}
+
+static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_data *data = ctx->buf_data;
+	struct fixed_rsrc_ref_node *ref_node = NULL;
+
+	if (!data)
+		return -ENXIO;
+
+	io_rsrc_ref_lock(ctx);
+	ref_node = data->node;
+	io_rsrc_ref_unlock(ctx);
+	if (ref_node)
+		percpu_ref_kill(&ref_node->refs);
+
+	percpu_ref_kill(&data->refs);
+
+	/* wait for all refs nodes to complete */
+	flush_delayed_work(&ctx->rsrc_put_work);
+	wait_for_completion(&data->done);
+
+	io_buffers_unmap(ctx);
+	io_buffers_map_free(ctx);
+
 	return 0;
 }
 
@@ -8373,7 +8433,13 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
 
 	/* check previously registered pages */
 	for (i = 0; i < ctx->nr_user_bufs; i++) {
-		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		struct fixed_rsrc_table *table;
+		struct io_mapped_ubuf *imu;
+		unsigned int index;
+
+		table = &ctx->buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
+		index = i & IORING_BUF_TABLE_MASK;
+		imu = &table->bufs[index];
 
 		for (j = 0; j < imu->nr_bvecs; j++) {
 			if (!PageCompound(imu->bvec[j].bv_page))
@@ -8508,19 +8574,79 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	return ret;
 }
 
-static int io_buffers_map_alloc(struct io_ring_ctx *ctx, unsigned int nr_args)
+static void io_free_buf_tables(struct fixed_rsrc_data *buf_data,
+			       unsigned int nr_tables)
 {
-	if (ctx->user_bufs)
-		return -EBUSY;
-	if (!nr_args || nr_args > UIO_MAXIOV)
-		return -EINVAL;
+	int i;
 
-	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
-					GFP_KERNEL);
-	if (!ctx->user_bufs)
-		return -ENOMEM;
+	for (i = 0; i < nr_tables; i++) {
+		struct fixed_rsrc_table *table = &buf_data->table[i];
 
-	return 0;
+		kfree(table->bufs);
+	}
+}
+
+static int io_alloc_buf_tables(struct fixed_rsrc_data *buf_data,
+			       unsigned int nr_tables, unsigned int nr_bufs)
+{
+	int i;
+		
+	for (i = 0; i < nr_tables; i++) {
+		struct fixed_rsrc_table *table = &buf_data->table[i];
+		unsigned int this_bufs;
+
+		this_bufs = min(nr_bufs, IORING_MAX_BUFS_TABLE);
+		table->bufs = kcalloc(this_bufs, sizeof(struct io_mapped_ubuf),
+				      GFP_KERNEL);
+		if (!table->bufs)
+			break;
+		nr_bufs -= this_bufs;
+	}
+
+	if (i == nr_tables)
+		return 0;
+
+	io_free_buf_tables(buf_data, nr_tables);
+	return 1;
+}
+
+static struct fixed_rsrc_data *io_buffers_map_alloc(struct io_ring_ctx *ctx,
+						    unsigned int nr_args)
+{
+	unsigned nr_tables;
+	struct fixed_rsrc_data *buf_data;
+	int ret = -ENOMEM;
+
+	if (!nr_args || nr_args > IORING_MAX_FIXED_BUFS)
+		return ERR_PTR(-EINVAL);
+
+	buf_data = kzalloc(sizeof(*ctx->buf_data), GFP_KERNEL);
+	if (!buf_data)
+		return ERR_PTR(-ENOMEM);
+	buf_data->ctx = ctx;
+	init_completion(&buf_data->done);
+
+	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_BUFS_TABLE);
+	buf_data->table = kcalloc(nr_tables, sizeof(*buf_data->table),
+				  GFP_KERNEL);
+	if (!buf_data->table)
+		goto out_free;
+
+	if (percpu_ref_init(&buf_data->refs, io_rsrc_ref_kill,
+			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
+		goto out_free;
+
+	if (io_alloc_buf_tables(buf_data, nr_tables, nr_args))
+		goto out_ref;
+
+	return buf_data;
+
+out_ref:
+	percpu_ref_exit(&buf_data->refs);
+out_free:
+	kfree(buf_data->table);
+	kfree(buf_data);
+	return ERR_PTR(ret);
 }
 
 static int io_buffer_validate(struct iovec *iov)
@@ -8540,39 +8666,82 @@ static int io_buffer_validate(struct iovec *iov)
 	return 0;
 }
 
+static void io_ring_buf_put(struct io_ring_ctx *ctx, struct io_rsrc_put *prsrc)
+{
+	io_buffer_unmap(ctx, prsrc->buf);
+}
+
+static struct fixed_rsrc_ref_node *alloc_fixed_buf_ref_node(
+			struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_ref_node *ref_node;
+
+	ref_node = alloc_fixed_rsrc_ref_node(ctx);
+	ref_node->rsrc_data = ctx->buf_data;
+	ref_node->rsrc_put = io_ring_buf_put;
+	return ref_node;
+}
+
 static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 				   unsigned int nr_args)
 {
 	int i, ret;
 	struct iovec iov;
 	struct page *last_hpage = NULL;
+	struct fixed_rsrc_ref_node *ref_node;
+	struct fixed_rsrc_data *buf_data;
 
-	ret = io_buffers_map_alloc(ctx, nr_args);
-	if (ret)
-		return ret;
+	if (ctx->nr_user_bufs)
+		return -EBUSY;
 
-	for (i = 0; i < nr_args; i++) {
-		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+	buf_data = io_buffers_map_alloc(ctx, nr_args);
+	if (IS_ERR(buf_data))
+		return PTR_ERR(buf_data);
+
+	for (i = 0; i < nr_args; i++, ctx->nr_user_bufs++) {
+		struct fixed_rsrc_table *table;
+		struct io_mapped_ubuf *imu;
+		unsigned int index;
 
 		ret = io_copy_iov(ctx, &iov, arg, i);
 		if (ret)
 			break;
 
+		/* allow sparse sets */
+		if (!iov.iov_base && !iov.iov_len)
+			continue;
+
 		ret = io_buffer_validate(&iov);
 		if (ret)
 			break;
 
+		table = &buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
+		index = i & IORING_BUF_TABLE_MASK;
+		imu = &table->bufs[index];
+
 		ret = io_sqe_buffer_register(ctx, &iov, imu, &last_hpage);
 		if (ret)
 			break;
+	}
 
-		ctx->nr_user_bufs++;
+	ctx->buf_data = buf_data;
+	if (ret) {
+		io_sqe_buffers_unregister(ctx);
+		return ret;
 	}
 
-	if (ret)
+	ref_node = alloc_fixed_buf_ref_node(ctx);
+	if (IS_ERR(ref_node)) {
 		io_sqe_buffers_unregister(ctx);
+		return PTR_ERR(ref_node);
+	}
 
-	return ret;
+	buf_data->node = ref_node;
+	io_rsrc_ref_lock(ctx);
+	list_add(&ref_node->node, &ctx->rsrc_ref_list);
+	io_rsrc_ref_unlock(ctx);
+	percpu_ref_get(&buf_data->refs);
+	return 0;
 }
 
 static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
@@ -9351,7 +9520,7 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	}
 	seq_printf(m, "UserBufs:\t%u\n", ctx->nr_user_bufs);
 	for (i = 0; has_lock && i < ctx->nr_user_bufs; i++) {
-		struct io_mapped_ubuf *buf = &ctx->user_bufs[i];
+		struct io_mapped_ubuf *buf = io_buf_from_index(ctx, i);
 
 		seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf,
 						(unsigned int) buf->len);
@@ -9854,6 +10023,7 @@ static bool io_register_op_must_quiesce(int op)
 	switch (op) {
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
+	case IORING_UNREGISTER_BUFFERS:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
 	case IORING_UNREGISTER_PERSONALITY:
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 09/13] io_uring: create common fixed_rsrc_ref_node handling routines
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (7 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 10/13] io_uring: generalize files_update functionlity to rsrc_update Bijan Mottahedeh
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Create common routines to be used for both files/buffers registration.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 66 +++++++++++++++++++++++++++--------------------------------
 1 file changed, 30 insertions(+), 36 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bac2813..7e1467c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7287,14 +7287,10 @@ static inline void io_rsrc_ref_unlock(struct io_ring_ctx *ctx)
 	spin_unlock_bh(&ctx->rsrc_ref_lock);
 }
 
-static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+static void io_rsrc_ref_quiesce(struct fixed_rsrc_data *data,
+				struct io_ring_ctx *ctx)
 {
-	struct fixed_rsrc_data *data = ctx->file_data;
 	struct fixed_rsrc_ref_node *ref_node = NULL;
-	unsigned nr_tables, i;
-
-	if (!data)
-		return -ENXIO;
 
 	io_rsrc_ref_lock(ctx);
 	ref_node = data->node;
@@ -7307,6 +7303,17 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	/* wait for all refs nodes to complete */
 	flush_delayed_work(&ctx->rsrc_put_work);
 	wait_for_completion(&data->done);
+}
+
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_data *data = ctx->file_data;
+	unsigned nr_tables, i;
+
+	if (!data)
+		return -ENXIO;
+
+	io_rsrc_ref_quiesce(data, ctx);
 
 	__io_sqe_files_unregister(ctx);
 	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
@@ -7739,6 +7746,17 @@ static void destroy_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node)
 	kfree(ref_node);
 }
 
+static void add_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
+				    struct fixed_rsrc_data *data,
+				    struct io_ring_ctx *ctx)
+{
+	io_rsrc_ref_lock(ctx);
+	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
+	data->node = ref_node;
+	io_rsrc_ref_unlock(ctx);
+	percpu_ref_get(&data->refs);
+}
+
 static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 				 unsigned nr_args)
 {
@@ -7821,11 +7839,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 		return PTR_ERR(ref_node);
 	}
 
-	file_data->node = ref_node;
-	io_rsrc_ref_lock(ctx);
-	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
-	io_rsrc_ref_unlock(ctx);
-	percpu_ref_get(&file_data->refs);
+	add_fixed_rsrc_ref_node(ref_node, file_data, ctx);
 	return ret;
 out_fput:
 	for (i = 0; i < ctx->nr_user_files; i++) {
@@ -7914,10 +7928,7 @@ static void switch_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
 				       struct io_ring_ctx *ctx)
 {
 	percpu_ref_kill(&data->node->refs);
-	io_rsrc_ref_lock(ctx);
-	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
-	data->node = ref_node;
-	io_rsrc_ref_unlock(ctx);
+	add_fixed_rsrc_ref_node(ref_node, data, ctx);
 }
 
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
@@ -7995,10 +8006,9 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 		up->offset++;
 	}
 
-	if (needs_switch) {
+	if (needs_switch)
 		switch_fixed_rsrc_ref_node(ref_node, data, ctx);
-		percpu_ref_get(&ctx->file_data->refs);
-	} else
+	else
 		destroy_fixed_rsrc_ref_node(ref_node);
 
 	return done ? done : err;
@@ -8361,23 +8371,11 @@ static void io_buffers_map_free(struct io_ring_ctx *ctx)
 static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_data *data = ctx->buf_data;
-	struct fixed_rsrc_ref_node *ref_node = NULL;
 
 	if (!data)
 		return -ENXIO;
 
-	io_rsrc_ref_lock(ctx);
-	ref_node = data->node;
-	io_rsrc_ref_unlock(ctx);
-	if (ref_node)
-		percpu_ref_kill(&ref_node->refs);
-
-	percpu_ref_kill(&data->refs);
-
-	/* wait for all refs nodes to complete */
-	flush_delayed_work(&ctx->rsrc_put_work);
-	wait_for_completion(&data->done);
-
+	io_rsrc_ref_quiesce(data, ctx);
 	io_buffers_unmap(ctx);
 	io_buffers_map_free(ctx);
 
@@ -8736,11 +8734,7 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		return PTR_ERR(ref_node);
 	}
 
-	buf_data->node = ref_node;
-	io_rsrc_ref_lock(ctx);
-	list_add(&ref_node->node, &ctx->rsrc_ref_list);
-	io_rsrc_ref_unlock(ctx);
-	percpu_ref_get(&buf_data->refs);
+	add_fixed_rsrc_ref_node(ref_node, buf_data, ctx);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 10/13] io_uring: generalize files_update functionlity to rsrc_update
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (8 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 09/13] io_uring: create common fixed_rsrc_ref_node handling routines Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 11/13] io_uring: support buffer registration updates Bijan Mottahedeh
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Generalize files_update functionality to rsrc_update in order to
leverage it for buffers updates.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c                 | 19 ++++++++++++++-----
 include/uapi/linux/io_uring.h |  6 +++++-
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7e1467c..dbac1ea 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5963,7 +5963,7 @@ static int io_async_cancel(struct io_kiocb *req)
 }
 
 static int io_rsrc_update_prep(struct io_kiocb *req,
-				const struct io_uring_sqe *sqe)
+			       const struct io_uring_sqe *sqe)
 {
 	if (unlikely(req->ctx->flags & IORING_SETUP_SQPOLL))
 		return -EINVAL;
@@ -5980,8 +5980,11 @@ static int io_rsrc_update_prep(struct io_kiocb *req,
 	return 0;
 }
 
-static int io_files_update(struct io_kiocb *req, bool force_nonblock,
-			   struct io_comp_state *cs)
+static int io_rsrc_update(struct io_kiocb *req, bool force_nonblock,
+			  struct io_comp_state *cs,
+			  int (*update)(struct io_ring_ctx *ctx,
+					struct io_uring_rsrc_update *up,
+					unsigned int nr_args))
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_uring_rsrc_update up;
@@ -5991,10 +5994,10 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 		return -EAGAIN;
 
 	up.offset = req->rsrc_update.offset;
-	up.fds = req->rsrc_update.arg;
+	up.rsrc = req->rsrc_update.arg;
 
 	mutex_lock(&ctx->uring_lock);
-	ret = __io_sqe_files_update(ctx, &up, req->rsrc_update.nr_args);
+	ret = (*update)(ctx, &up, req->rsrc_update.nr_args);
 	mutex_unlock(&ctx->uring_lock);
 
 	if (ret < 0)
@@ -6003,6 +6006,12 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 	return 0;
 }
 
+static int io_files_update(struct io_kiocb *req, bool force_nonblock,
+			   struct io_comp_state *cs)
+{
+	return io_rsrc_update(req, force_nonblock, cs, __io_sqe_files_update);
+}
+
 static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	switch (req->opcode) {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d421f70..8cc672c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -288,7 +288,11 @@ enum {
 struct io_uring_rsrc_update {
 	__u32 offset;
 	__u32 resv;
-	__aligned_u64 /* __s32 * */ fds;
+	union {
+		__aligned_u64 /* __s32 * */ fds;
+		__aligned_u64 /* __s32 * */ iovs;
+		__aligned_u64 /* __s32 * */ rsrc;
+	};
 };
 
 #define IO_URING_OP_SUPPORTED	(1U << 0)
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 11/13] io_uring: support buffer registration updates
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (9 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 10/13] io_uring: generalize files_update functionlity to rsrc_update Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 12/13] io_uring: create common fixed_rsrc_data allocation routines Bijan Mottahedeh
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Introduce IORING_REGISTER_BUFFERS_UPDATE and IORING_OP_BUFFERS_UPDATE,
consistent with file registration update.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c                 | 123 +++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   2 +
 2 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index dbac1ea..87c1420 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1009,6 +1009,9 @@ struct io_op_def {
 		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
 						IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
 	},
+	[IORING_OP_BUFFERS_UPDATE] = {
+		.work_flags		= IO_WQ_WORK_MM,
+	},
 };
 
 enum io_mem_account {
@@ -1028,6 +1031,9 @@ static void __io_complete_rw(struct io_kiocb *req, long res, long res2,
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				 struct io_uring_rsrc_update *ip,
 				 unsigned nr_args);
+static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
+				   struct io_uring_rsrc_update *up,
+				   unsigned int nr_args);
 static void __io_clean_op(struct io_kiocb *req);
 static struct file *io_file_get(struct io_submit_state *state,
 				struct io_kiocb *req, int fd, bool fixed);
@@ -6012,6 +6018,12 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 	return io_rsrc_update(req, force_nonblock, cs, __io_sqe_files_update);
 }
 
+static int io_buffers_update(struct io_kiocb *req, bool force_nonblock,
+			     struct io_comp_state *cs)
+{
+	return io_rsrc_update(req, force_nonblock, cs, __io_sqe_buffers_update);
+}
+
 static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	switch (req->opcode) {
@@ -6083,11 +6095,13 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_renameat_prep(req, sqe);
 	case IORING_OP_UNLINKAT:
 		return io_unlinkat_prep(req, sqe);
+	case IORING_OP_BUFFERS_UPDATE:
+		return io_rsrc_update_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
 			req->opcode);
-	return-EINVAL;
+	return -EINVAL;
 }
 
 static int io_req_defer_prep(struct io_kiocb *req,
@@ -6342,6 +6356,9 @@ static int io_issue_sqe(struct io_kiocb *req, bool force_nonblock,
 	case IORING_OP_UNLINKAT:
 		ret = io_unlinkat(req, force_nonblock);
 		break;
+	case IORING_OP_BUFFERS_UPDATE:
+		ret = io_buffers_update(req, force_nonblock, cs);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -8345,6 +8362,7 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 	if (imu->acct_pages)
 		io_unaccount_mem(ctx, imu->nr_bvecs, ACCT_PINNED);
 	kvfree(imu->bvec);
+	imu->bvec = NULL;
 	imu->nr_bvecs = 0;
 }
 
@@ -8548,6 +8566,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 		if (pret > 0)
 			unpin_user_pages(pages, pret);
 		kvfree(imu->bvec);
+		imu->bvec = NULL;
 		goto done;
 	}
 
@@ -8676,6 +8695,8 @@ static int io_buffer_validate(struct iovec *iov)
 static void io_ring_buf_put(struct io_ring_ctx *ctx, struct io_rsrc_put *prsrc)
 {
 	io_buffer_unmap(ctx, prsrc->buf);
+	kvfree(prsrc->buf);
+	prsrc->buf = NULL;
 }
 
 static struct fixed_rsrc_ref_node *alloc_fixed_buf_ref_node(
@@ -8747,6 +8768,102 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	return 0;
 }
 
+static inline int io_queue_buffer_removal(struct fixed_rsrc_data *data,
+					  struct io_mapped_ubuf *imu)
+{
+	return io_queue_rsrc_removal(data, (void *)imu);
+}
+
+static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
+				   struct io_uring_rsrc_update *up,
+				   unsigned int nr_args)
+{
+	struct fixed_rsrc_data *data = ctx->buf_data;
+	struct fixed_rsrc_ref_node *ref_node;
+	struct io_mapped_ubuf *imu;
+	struct iovec iov;
+	struct iovec __user *iovs;
+	struct page *last_hpage = NULL;
+	__u32 done;
+	int i, err;
+	bool needs_switch = false;
+
+	if (check_add_overflow(up->offset, nr_args, &done))
+		return -EOVERFLOW;
+	if (done > ctx->nr_user_bufs)
+		return -EINVAL;
+
+	ref_node = alloc_fixed_buf_ref_node(ctx);
+	if (IS_ERR(ref_node))
+		return PTR_ERR(ref_node);
+
+	done = 0;
+	iovs = u64_to_user_ptr(up->iovs);
+	while (nr_args) {
+		struct fixed_rsrc_table *table;
+		unsigned int index;
+
+		err = 0;
+		if (copy_from_user(&iov, &iovs[done], sizeof(iov))) {
+			err = -EFAULT;
+			break;
+		}
+		i = array_index_nospec(up->offset, ctx->nr_user_bufs);
+		table = &ctx->buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
+		index = i & IORING_BUF_TABLE_MASK;
+		imu = &table->bufs[index];
+		if (table->bufs[index].ubuf) {
+			struct io_mapped_ubuf *dup;
+
+			dup = kmemdup(imu, sizeof(*imu), GFP_KERNEL);
+			if (!dup) {
+				err = -ENOMEM;
+				break;
+			}
+			err = io_queue_buffer_removal(data, dup);
+			if (err)
+				break;
+			memset(imu, 0, sizeof(*imu));
+			needs_switch = true;
+		}
+		if (!io_buffer_validate(&iov)) {
+			err = io_sqe_buffer_register(ctx, &iov, imu,
+						     &last_hpage);
+			if (err) {
+				memset(imu, 0, sizeof(*imu));
+				break;
+			}
+		}
+		nr_args--;
+		done++;
+		up->offset++;
+	}
+
+	if (needs_switch)
+		switch_fixed_rsrc_ref_node(ref_node, data, ctx);
+	else
+		destroy_fixed_rsrc_ref_node(ref_node);
+
+	return done ? done : err;
+}
+
+static int io_sqe_buffers_update(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned int nr_args)
+{
+	struct io_uring_rsrc_update up;
+
+	if (!ctx->buf_data)
+		return -ENXIO;
+	if (!nr_args)
+		return -EINVAL;
+	if (copy_from_user(&up, arg, sizeof(up)))
+		return -EFAULT;
+	if (up.resv)
+		return -EINVAL;
+
+	return __io_sqe_buffers_update(ctx, &up, nr_args);
+}
+
 static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
 {
 	__s32 __user *fds = arg;
@@ -10027,6 +10144,7 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
 	case IORING_UNREGISTER_BUFFERS:
+	case IORING_REGISTER_BUFFERS_UPDATE:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
 	case IORING_UNREGISTER_PERSONALITY:
@@ -10102,6 +10220,9 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffers_unregister(ctx);
 		break;
+	case IORING_REGISTER_BUFFERS_UPDATE:
+		ret = io_sqe_buffers_update(ctx, arg, nr_args);
+		break;
 	case IORING_REGISTER_FILES:
 		ret = io_sqe_files_register(ctx, arg, nr_args);
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 8cc672c..0d9ac12 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -137,6 +137,7 @@ enum {
 	IORING_OP_SHUTDOWN,
 	IORING_OP_RENAMEAT,
 	IORING_OP_UNLINKAT,
+	IORING_OP_BUFFERS_UPDATE,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -280,6 +281,7 @@ enum {
 	IORING_UNREGISTER_PERSONALITY		= 10,
 	IORING_REGISTER_RESTRICTIONS		= 11,
 	IORING_REGISTER_ENABLE_RINGS		= 12,
+	IORING_REGISTER_BUFFERS_UPDATE		= 13,
 
 	/* this goes last */
 	IORING_REGISTER_LAST
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 12/13] io_uring: create common fixed_rsrc_data allocation routines.
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (10 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 11/13] io_uring: support buffer registration updates Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2020-12-18 18:07 ` [PATCH v3 13/13] io_uring: support buffer registration sharing Bijan Mottahedeh
  2021-01-04 17:09 ` [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Create common alloc/free fixed_rsrc_data routines for both files and
buffers.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 77 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 42 insertions(+), 35 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 87c1420..f5706e1 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7331,6 +7331,33 @@ static void io_rsrc_ref_quiesce(struct fixed_rsrc_data *data,
 	wait_for_completion(&data->done);
 }
 
+static struct fixed_rsrc_data *alloc_fixed_rsrc_data(struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_data *data;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->ctx = ctx;
+	init_completion(&data->done);
+
+	if (percpu_ref_init(&data->refs, io_rsrc_ref_kill,
+			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
+		kfree(data);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return data;
+}
+
+static void free_fixed_rsrc_data(struct fixed_rsrc_data *data)
+{
+	percpu_ref_exit(&data->refs);
+	kfree(data->table);
+	kfree(data);
+}
+
 static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_data *data = ctx->file_data;
@@ -7345,9 +7372,7 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
 	for (i = 0; i < nr_tables; i++)
 		kfree(data->table[i].files);
-	kfree(data->table);
-	percpu_ref_exit(&data->refs);
-	kfree(data);
+	free_fixed_rsrc_data(ctx->file_data);
 	ctx->file_data = NULL;
 	ctx->nr_user_files = 0;
 	return 0;
@@ -7800,11 +7825,9 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (nr_args > IORING_MAX_FIXED_FILES)
 		return -EMFILE;
 
-	file_data = kzalloc(sizeof(*ctx->file_data), GFP_KERNEL);
-	if (!file_data)
-		return -ENOMEM;
-	file_data->ctx = ctx;
-	init_completion(&file_data->done);
+	file_data = alloc_fixed_rsrc_data(ctx);
+	if (IS_ERR(file_data))
+		return PTR_ERR(ref_node);
 
 	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_FILES_TABLE);
 	file_data->table = kcalloc(nr_tables, sizeof(*file_data->table),
@@ -7812,12 +7835,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (!file_data->table)
 		goto out_free;
 
-	if (percpu_ref_init(&file_data->refs, io_rsrc_ref_kill,
-				PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
-		goto out_free;
-
 	if (io_sqe_alloc_file_tables(file_data, nr_tables, nr_args))
-		goto out_ref;
+		goto out_free;
 	ctx->file_data = file_data;
 
 	for (i = 0; i < nr_args; i++, ctx->nr_user_files++) {
@@ -7876,11 +7895,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	for (i = 0; i < nr_tables; i++)
 		kfree(file_data->table[i].files);
 	ctx->nr_user_files = 0;
-out_ref:
-	percpu_ref_exit(&file_data->refs);
 out_free:
-	kfree(file_data->table);
-	kfree(file_data);
+	free_fixed_rsrc_data(ctx->file_data);
 	ctx->file_data = NULL;
 	return ret;
 }
@@ -8639,39 +8655,30 @@ static int io_alloc_buf_tables(struct fixed_rsrc_data *buf_data,
 static struct fixed_rsrc_data *io_buffers_map_alloc(struct io_ring_ctx *ctx,
 						    unsigned int nr_args)
 {
-	unsigned nr_tables;
 	struct fixed_rsrc_data *buf_data;
+	unsigned int nr_tables;
 	int ret = -ENOMEM;
 
 	if (!nr_args || nr_args > IORING_MAX_FIXED_BUFS)
 		return ERR_PTR(-EINVAL);
 
-	buf_data = kzalloc(sizeof(*ctx->buf_data), GFP_KERNEL);
-	if (!buf_data)
-		return ERR_PTR(-ENOMEM);
-	buf_data->ctx = ctx;
-	init_completion(&buf_data->done);
+	buf_data = alloc_fixed_rsrc_data(ctx);
+	if (IS_ERR(buf_data))
+		return buf_data;
 
 	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_BUFS_TABLE);
 	buf_data->table = kcalloc(nr_tables, sizeof(*buf_data->table),
 				  GFP_KERNEL);
 	if (!buf_data->table)
-		goto out_free;
-
-	if (percpu_ref_init(&buf_data->refs, io_rsrc_ref_kill,
-			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
-		goto out_free;
+		goto out;
 
 	if (io_alloc_buf_tables(buf_data, nr_tables, nr_args))
-		goto out_ref;
+		goto out;
 
 	return buf_data;
-
-out_ref:
-	percpu_ref_exit(&buf_data->refs);
-out_free:
-	kfree(buf_data->table);
-	kfree(buf_data);
+out:
+	free_fixed_rsrc_data(ctx->buf_data);
+	ctx->buf_data = NULL;
 	return ERR_PTR(ret);
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 13/13] io_uring: support buffer registration sharing
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (11 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 12/13] io_uring: create common fixed_rsrc_data allocation routines Bijan Mottahedeh
@ 2020-12-18 18:07 ` Bijan Mottahedeh
  2021-01-04 17:09 ` [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:07 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Implement buffer sharing among multiple rings.

A ring shares its (future) buffer registrations at setup time with
IORING_SETUP_SHARE_BUF. A ring attaches to another ring's buffer
registration at setup time with IORING_SETUP_ATTACH_BUF, after
authenticating with the buffer registration owner's fd. Any updates to
the owner's buffer registrations become immediately available to the
attached rings.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c                 | 85 +++++++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/io_uring.h |  2 +
 2 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f5706e1..1341526 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8411,6 +8411,13 @@ static void io_buffers_map_free(struct io_ring_ctx *ctx)
 	ctx->nr_user_bufs = 0;
 }
 
+static void io_detach_buf_data(struct io_ring_ctx *ctx)
+{
+	percpu_ref_put(&ctx->buf_data->refs);
+	ctx->buf_data = NULL;
+	ctx->nr_user_bufs = 0;
+}
+
 static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_data *data = ctx->buf_data;
@@ -8418,6 +8425,11 @@ static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 	if (!data)
 		return -ENXIO;
 
+	if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
+		io_detach_buf_data(ctx);
+		return 0;
+	}
+
 	io_rsrc_ref_quiesce(data, ctx);
 	io_buffers_unmap(ctx);
 	io_buffers_map_free(ctx);
@@ -8662,9 +8674,13 @@ static struct fixed_rsrc_data *io_buffers_map_alloc(struct io_ring_ctx *ctx,
 	if (!nr_args || nr_args > IORING_MAX_FIXED_BUFS)
 		return ERR_PTR(-EINVAL);
 
-	buf_data = alloc_fixed_rsrc_data(ctx);
-	if (IS_ERR(buf_data))
-		return buf_data;
+	if (ctx->buf_data) {
+		buf_data = ctx->buf_data;
+	} else {
+		buf_data = alloc_fixed_rsrc_data(ctx);
+		if (IS_ERR(buf_data))
+			return buf_data;
+	}
 
 	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_BUFS_TABLE);
 	buf_data->table = kcalloc(nr_tables, sizeof(*buf_data->table),
@@ -8729,9 +8745,17 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (ctx->nr_user_bufs)
 		return -EBUSY;
 
+	if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
+		if (!ctx->buf_data)
+			return -EFAULT;
+		ctx->nr_user_bufs = ctx->buf_data->ctx->nr_user_bufs;
+		return 0;
+	}
+
 	buf_data = io_buffers_map_alloc(ctx, nr_args);
 	if (IS_ERR(buf_data))
 		return PTR_ERR(buf_data);
+	ctx->buf_data = buf_data;
 
 	for (i = 0; i < nr_args; i++, ctx->nr_user_bufs++) {
 		struct fixed_rsrc_table *table;
@@ -8759,7 +8783,6 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			break;
 	}
 
-	ctx->buf_data = buf_data;
 	if (ret) {
 		io_sqe_buffers_unregister(ctx);
 		return ret;
@@ -9789,6 +9812,55 @@ static int io_uring_get_fd(struct io_ring_ctx *ctx)
 	return ret;
 }
 
+static int io_attach_buf_data(struct io_ring_ctx *ctx,
+			      struct io_uring_params *p)
+{
+	struct io_ring_ctx *ctx_attach;
+	struct fd f;
+
+	f = fdget(p->wq_fd);
+	if (!f.file)
+		return -EBADF;
+	if (f.file->f_op != &io_uring_fops) {
+		fdput(f);
+		return -EINVAL;
+	}
+
+	ctx_attach = f.file->private_data;
+	if (!ctx_attach->buf_data) {
+		fdput(f);
+		return -EINVAL;
+	}
+	ctx->buf_data = ctx_attach->buf_data;
+
+	percpu_ref_get(&ctx->buf_data->refs);
+	fdput(f);
+	return 0;
+}
+
+static int io_init_buf_data(struct io_ring_ctx *ctx, struct io_uring_params *p)
+{
+	if ((p->flags & (IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF)) ==
+	    (IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF))
+		return -EINVAL;
+
+	if (p->flags & IORING_SETUP_SHARE_BUF) {
+		struct fixed_rsrc_data *buf_data;
+
+		buf_data = alloc_fixed_rsrc_data(ctx);
+		if (IS_ERR(buf_data))
+			return PTR_ERR(buf_data);
+
+		ctx->buf_data = buf_data;
+		return 0;
+	}
+
+	if (p->flags & IORING_SETUP_ATTACH_BUF)
+		return io_attach_buf_data(ctx, p);
+
+	return 0;
+}
+
 static int io_uring_create(unsigned entries, struct io_uring_params *p,
 			   struct io_uring_params __user *params)
 {
@@ -9903,6 +9975,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (ret)
 		goto err;
 
+	ret = io_init_buf_data(ctx, p);
+	if (ret)
+		goto err;
+
 	ret = io_sq_offload_create(ctx, p);
 	if (ret)
 		goto err;
@@ -9974,6 +10050,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
 			IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
 			IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
+			IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF |
 			IORING_SETUP_R_DISABLED))
 		return -EINVAL;
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0d9ac12..2366126 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -98,6 +98,8 @@ enum {
 #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
 #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
 #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
+#define IORING_SETUP_SHARE_BUF	(1U << 7)	/* share buffer registration */
+#define IORING_SETUP_ATTACH_BUF	(1U << 8)	/* attach buffer registration */
 
 enum {
 	IORING_OP_NOP,
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 00/13] io_uring: buffer registration enhancements
  2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (12 preceding siblings ...)
  2020-12-18 18:07 ` [PATCH v3 13/13] io_uring: support buffer registration sharing Bijan Mottahedeh
@ 2021-01-04 17:09 ` Bijan Mottahedeh
  13 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-04 17:09 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

A reminder to please review this version.

> v3:
> 
> - batch file->rsrc renames into a signle patch when possible
> - fix other review changes from v2
> - fix checkpatch warnings
> 
> v2:
> 
> - drop readv/writev with fixed buffers patch
> - handle ref_nodes both both files/buffers with a single ref_list
> - make file/buffer handling more unified
> 
> This patchset implements a set of enhancements to buffer registration
> consistent with existing file registration functionality:
> 
> - buffer registration updates		IORING_REGISTER_BUFFERS_UPDATE
> 					IORING_OP_BUFFERS_UPDATE
> 
> - buffer registration sharing		IORING_SETUP_SHARE_BUF
> 					IORING_SETUP_ATTACH_BUF
> 
> I have kept the original patchset unchanged for the most part to
> facilitate reviewing and so this set adds a number of additional patches
> mostly making file/buffer handling more unified.
> 
> Patch 1-2 modularize existing buffer registration code.
> 
> Patch 3-7 generalize fixed_file functionality to fixed_rsrc.
> 
> Patch 8 applies fixed_rsrc functionality for fixed buffers support.
> 
> Patch 9-10 generalize files_update functionality to rsrc_update.
> 
> Patch 11 implements buffer registration update, and introduces
> IORING_REGISTER_BUFFERS_UPDATE and IORING_OP_BUFFERS_UPDATE, consistent
> with file registration update.
> 
> Patch 12 generalizes fixed resource allocation
> 
> Patch 13 implements buffer sharing among multiple rings; it works as follows:
> 
> - A new ring, A,  is setup. Since no buffers have been registered, the
>    registered buffer state is an empty set, Z. That's different from the
>    NULL state in current implementation.
> 
> - Ring B is setup, attaching to Ring A. It's also attaching to it's
>    buffer registrations, now we have two references to the same empty
>    set, Z.
> 
> - Ring A registers buffers into set Z, which is no longer empty.
> 
> - Ring B sees this immediately, since it's already sharing that set.
> 
> Testing
> 
> I have used liburing file-{register,update} tests as models for
> buffer-{register,update,share}, tests and they run ok.
> 
> TBD
> 
> - I tried to use a single opcode for files/buffers but ran into an
> issue since work_flags is different for files/buffers.  This should
> be ok for the most part since req->work.flags is ultimately examined;
> however, there are place where io_op_defs[opcode].work_flags is examined
> directly, and I wasn't sure what would the best way to handle that.
> 
> - Need to still address Pavel's comments about deadlocks. I figure
> to send out the set anyway since this is a last patch and may even be
> handled separately.
> 
> Bijan Mottahedeh (13):
>    io_uring: modularize io_sqe_buffer_register
>    io_uring: modularize io_sqe_buffers_register
>    io_uring: rename file related variables to rsrc
>    io_uring: generalize io_queue_rsrc_removal
>    io_uring: separate ref_list from fixed_rsrc_data
>    io_uring: generalize fixed_file_ref_node functionality
>    io_uring: add rsrc_ref locking routines
>    io_uring: implement fixed buffers registration similar to fixed files
>    io_uring: create common fixed_rsrc_ref_node handling routines
>    io_uring: generalize files_update functionlity to rsrc_update
>    io_uring: support buffer registration updates
>    io_uring: create common fixed_rsrc_data allocation routines.
>    io_uring: support buffer registration sharing
> 
>   fs/io_uring.c                 | 1004 +++++++++++++++++++++++++++++------------
>   include/uapi/linux/io_uring.h |   12 +-
>   2 files changed, 735 insertions(+), 281 deletions(-)
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 02/13] io_uring: modularize io_sqe_buffers_register
  2020-12-18 18:07 ` [PATCH v3 02/13] io_uring: modularize io_sqe_buffers_register Bijan Mottahedeh
@ 2021-01-04 21:48   ` Pavel Begunkov
  0 siblings, 0 replies; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-04 21:48 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 18/12/2020 18:07, Bijan Mottahedeh wrote:
> Move allocation of buffer management structures, and validation of
> buffers into separate routines.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 01/13] io_uring: modularize io_sqe_buffer_register
  2020-12-18 18:07 ` [PATCH v3 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
@ 2021-01-04 21:54   ` Pavel Begunkov
  2021-01-06 19:46     ` Bijan Mottahedeh
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-04 21:54 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 18/12/2020 18:07, Bijan Mottahedeh wrote:
> Split io_sqe_buffer_register into two routines:
> 
> - io_sqe_buffer_register() registers a single buffer
> - io_sqe_buffers_register iterates over all user specified buffers

It's a bit worse in terms of extra allocations, but not so hot to be
be a problem, and looks simpler.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>

Jens, I suggest to take 1,2 while they still apply (3/13 does not).
I'll review others in a meanwhile.
 
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 03/13] io_uring: rename file related variables to rsrc
  2020-12-18 18:07 ` [PATCH v3 03/13] io_uring: rename file related variables to rsrc Bijan Mottahedeh
@ 2021-01-05  1:53   ` Pavel Begunkov
  2021-01-06 19:46     ` Bijan Mottahedeh
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-05  1:53 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 18/12/2020 18:07, Bijan Mottahedeh wrote:
> This is a prep rename patch for subsequent patches to generalize file
> registration.
[...]
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index d31a2a1..d421f70 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -285,7 +285,7 @@ enum {
>  	IORING_REGISTER_LAST
>  };
>  
> -struct io_uring_files_update {
> +struct io_uring_rsrc_update {

It's a user API, i.e. the header used by userspace programs, so can't
be changed or would break them.

>  	__u32 offset;
>  	__u32 resv;
>  	__aligned_u64 /* __s32 * */ fds;
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2020-12-18 18:07 ` [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
@ 2021-01-05  2:43   ` Pavel Begunkov
  2021-01-06 19:46     ` Bijan Mottahedeh
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-05  2:43 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 18/12/2020 18:07, Bijan Mottahedeh wrote:
> Apply fixed_rsrc functionality for fixed buffers support.

git generated a pretty messy diff...

Because it's do quiesce, fixed read/write access buffers from asynchronous
contexts without synchronisation. That won't work anymore, so

1. either we save it in advance, that would require extra req_async
allocation for linked fixed rw

2. or synchronise whenever async. But that would mean that a request
may get and do IO on two different buffers, that's rotten.

3. do mixed -- lazy, but if do IO then alloc.

3.5 also "synchronise" there would mean uring_lock, that's not welcome,
but we can probably do rcu.

Let me think of a patch...

[...]
> @@ -8373,7 +8433,13 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
>  
>  	/* check previously registered pages */
>  	for (i = 0; i < ctx->nr_user_bufs; i++) {
> -		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> +		struct fixed_rsrc_table *table;
> +		struct io_mapped_ubuf *imu;
> +		unsigned int index;
> +
> +		table = &ctx->buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
> +		index = i & IORING_BUF_TABLE_MASK;
> +		imu = &table->bufs[index];

io_buf_from_index() may tak buf_data, so can be reused.

>  
>  		for (j = 0; j < imu->nr_bvecs; j++) {
>  			if (!PageCompound(imu->bvec[j].bv_page))
> @@ -8508,19 +8574,79 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
>  	return ret;
>  }
>  
> -static int io_buffers_map_alloc(struct io_ring_ctx *ctx, unsigned int nr_args)
> +static void io_free_buf_tables(struct fixed_rsrc_data *buf_data,
> +			       unsigned int nr_tables)
>  {
> -	if (ctx->user_bufs)
> -		return -EBUSY;
> -	if (!nr_args || nr_args > UIO_MAXIOV)
> -		return -EINVAL;
> +	int i;
>  
> -	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
> -					GFP_KERNEL);
> -	if (!ctx->user_bufs)
> -		return -ENOMEM;
> +	for (i = 0; i < nr_tables; i++) {
> +		struct fixed_rsrc_table *table = &buf_data->table[i];
>  
> -	return 0;
> +		kfree(table->bufs);
> +	}
> +}
> +
> +static int io_alloc_buf_tables(struct fixed_rsrc_data *buf_data,
> +			       unsigned int nr_tables, unsigned int nr_bufs)
> +{
> +	int i;
> +		

trailing tabs

> +	for (i = 0; i < nr_tables; i++) {
> +		struct fixed_rsrc_table *table = &buf_data->table[i];
> +		unsigned int this_bufs;
> +
> +		this_bufs = min(nr_bufs, IORING_MAX_BUFS_TABLE);
> +		table->bufs = kcalloc(this_bufs, sizeof(struct io_mapped_ubuf),
> +				      GFP_KERNEL);
> +		if (!table->bufs)
> +			break;
> +		nr_bufs -= this_bufs;
> +	}
> +
> +	if (i == nr_tables)
> +		return 0;
> +
> +	io_free_buf_tables(buf_data, nr_tables);

Would work because kcalloc() zeroed buf_data->table, but

io_free_buf_tables(buf_data, __i__);

> +	return 1;
> +}
> +
[...]
>  static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>  				   unsigned int nr_args)
>  {
>  	int i, ret;
>  	struct iovec iov;
>  	struct page *last_hpage = NULL;
> +	struct fixed_rsrc_ref_node *ref_node;
> +	struct fixed_rsrc_data *buf_data;
>  
> -	ret = io_buffers_map_alloc(ctx, nr_args);
> -	if (ret)
> -		return ret;
> +	if (ctx->nr_user_bufs)
> +		return -EBUSY;
>  
> -	for (i = 0; i < nr_args; i++) {
> -		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> +	buf_data = io_buffers_map_alloc(ctx, nr_args);
> +	if (IS_ERR(buf_data))
> +		return PTR_ERR(buf_data);
> +
> +	for (i = 0; i < nr_args; i++, ctx->nr_user_bufs++) {
> +		struct fixed_rsrc_table *table;
> +		struct io_mapped_ubuf *imu;
> +		unsigned int index;
>  
>  		ret = io_copy_iov(ctx, &iov, arg, i);
>  		if (ret)
>  			break;
>  
> +		/* allow sparse sets */
> +		if (!iov.iov_base && !iov.iov_len)
> +			continue;
> +
>  		ret = io_buffer_validate(&iov);
>  		if (ret)
>  			break;
>  
> +		table = &buf_data->table[i >> IORING_BUF_TABLE_SHIFT];

same, io_buf_from_index() can be reused

> +		index = i & IORING_BUF_TABLE_MASK;
> +		imu = &table->bufs[index];
> +
>  		ret = io_sqe_buffer_register(ctx, &iov, imu, &last_hpage);
>  		if (ret)
>  			break;
> +	}
>  
[...]
> @@ -9854,6 +10023,7 @@ static bool io_register_op_must_quiesce(int op)
>  	switch (op) {
>  	case IORING_UNREGISTER_FILES:
>  	case IORING_REGISTER_FILES_UPDATE:
> +	case IORING_UNREGISTER_BUFFERS:

what about REGISTER_BUFFERS? 

>  	case IORING_REGISTER_PROBE:
>  	case IORING_REGISTER_PERSONALITY:
>  	case IORING_UNREGISTER_PERSONALITY:
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 01/13] io_uring: modularize io_sqe_buffer_register
  2021-01-04 21:54   ` Pavel Begunkov
@ 2021-01-06 19:46     ` Bijan Mottahedeh
  0 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-06 19:46 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring

On 1/4/2021 1:54 PM, Pavel Begunkov wrote:
> On 18/12/2020 18:07, Bijan Mottahedeh wrote:
>> Split io_sqe_buffer_register into two routines:
>>
>> - io_sqe_buffer_register() registers a single buffer
>> - io_sqe_buffers_register iterates over all user specified buffers
> 
> It's a bit worse in terms of extra allocations, but not so hot to be
> be a problem, and looks simpler.
> 
> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
> 
> Jens, I suggest to take 1,2 while they still apply (3/13 does not).
> I'll review others in a meanwhile.

I rebased the patches and will send the version soon, so the rest should 
apply as well.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-05  2:43   ` Pavel Begunkov
@ 2021-01-06 19:46     ` Bijan Mottahedeh
  2021-01-06 22:22       ` Pavel Begunkov
  2021-01-07  2:37       ` Pavel Begunkov
  0 siblings, 2 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-06 19:46 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring

On 1/4/2021 6:43 PM, Pavel Begunkov wrote:
> On 18/12/2020 18:07, Bijan Mottahedeh wrote:
>> Apply fixed_rsrc functionality for fixed buffers support.
> 
> git generated a pretty messy diff...

I had tried to break this up a few ways but it didn't work well because 
I think most of the code changes depend on the io_uring structure 
changes.  I can look again or if you some idea of how you want to split 
it, I can do that.

> Because it's do quiesce, fixed read/write access buffers from asynchronous
> contexts without synchronisation. That won't work anymore, so
> 
> 1. either we save it in advance, that would require extra req_async
> allocation for linked fixed rw
> 
> 2. or synchronise whenever async. But that would mean that a request
> may get and do IO on two different buffers, that's rotten.
> 
> 3. do mixed -- lazy, but if do IO then alloc.
> 
> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
> but we can probably do rcu.

Are you referring to a case where a fixed buffer request can be 
submitted from async context while those buffers are being unregistered, 
or something like that?

> Let me think of a patch...

Ok, will wait for it.

>> @@ -8373,7 +8433,13 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
>>   
>>   	/* check previously registered pages */
>>   	for (i = 0; i < ctx->nr_user_bufs; i++) {
>> -		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
>> +		struct fixed_rsrc_table *table;
>> +		struct io_mapped_ubuf *imu;
>> +		unsigned int index;
>> +
>> +		table = &ctx->buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
>> +		index = i & IORING_BUF_TABLE_MASK;
>> +		imu = &table->bufs[index];
> 
> io_buf_from_index() may tak buf_data, so can be reused.

Ok.

>> +	for (i = 0; i < nr_tables; i++) {
>> +		struct fixed_rsrc_table *table = &buf_data->table[i];
>> +		unsigned int this_bufs;
>> +
>> +		this_bufs = min(nr_bufs, IORING_MAX_BUFS_TABLE);
>> +		table->bufs = kcalloc(this_bufs, sizeof(struct io_mapped_ubuf),
>> +				      GFP_KERNEL);
>> +		if (!table->bufs)
>> +			break;
>> +		nr_bufs -= this_bufs;
>> +	}
>> +
>> +	if (i == nr_tables)
>> +		return 0;
>> +
>> +	io_free_buf_tables(buf_data, nr_tables);
> 
> Would work because kcalloc() zeroed buf_data->table, but
> 
> io_free_buf_tables(buf_data, __i__);

Ok.

>>   		ret = io_buffer_validate(&iov);
>>   		if (ret)
>>   			break;
>>   
>> +		table = &buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
> 
> same, io_buf_from_index() can be reused
> 
>> +		index = i & IORING_BUF_TABLE_MASK;
>> +		imu = &table->bufs[index];

Ok.

>> @@ -9854,6 +10023,7 @@ static bool io_register_op_must_quiesce(int op)
>>   	switch (op) {
>>   	case IORING_UNREGISTER_FILES:
>>   	case IORING_REGISTER_FILES_UPDATE:
>> +	case IORING_UNREGISTER_BUFFERS:
> 
> what about REGISTER_BUFFERS?

I followed the FILES case, which deals with UNREGISTER and UPDATE only, 
should I add REGISTER for buffers only?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 03/13] io_uring: rename file related variables to rsrc
  2021-01-05  1:53   ` Pavel Begunkov
@ 2021-01-06 19:46     ` Bijan Mottahedeh
  0 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-06 19:46 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring

On 1/4/2021 5:53 PM, Pavel Begunkov wrote:
> On 18/12/2020 18:07, Bijan Mottahedeh wrote:
>> This is a prep rename patch for subsequent patches to generalize file
>> registration.
> [...]
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index d31a2a1..d421f70 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -285,7 +285,7 @@ enum {
>>   	IORING_REGISTER_LAST
>>   };
>>   
>> -struct io_uring_files_update {
>> +struct io_uring_rsrc_update {
> 
> It's a user API, i.e. the header used by userspace programs, so can't
> be changed or would break them.
> 
>>   	__u32 offset;
>>   	__u32 resv;
>>   	__aligned_u64 /* __s32 * */ fds;
>>
> 

I defined files to be same as rsrc in io_uring.h:

#define io_uring_files_update   io_uring_rsrc_update

As for liburing, I have modified it to use io_uring_rsrc_update.

Does that look reasonable?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-06 19:46     ` Bijan Mottahedeh
@ 2021-01-06 22:22       ` Pavel Begunkov
  2021-01-07  2:37       ` Pavel Begunkov
  1 sibling, 0 replies; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-06 22:22 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 06/01/2021 19:46, Bijan Mottahedeh wrote:
> On 1/4/2021 6:43 PM, Pavel Begunkov wrote:
>> On 18/12/2020 18:07, Bijan Mottahedeh wrote:
>>> Apply fixed_rsrc functionality for fixed buffers support.
>>
>> git generated a pretty messy diff...
> 
> I had tried to break this up a few ways but it didn't work well because I think most of the code changes depend on the io_uring structure changes.  I can look again or if you some idea of how you want to split it, I can do that.

Nah, that's fine, I just may have missed something
without applying it.

> 
>> Because it's do quiesce, fixed read/write access buffers from asynchronous
>> contexts without synchronisation. That won't work anymore, so
>>
>> 1. either we save it in advance, that would require extra req_async
>> allocation for linked fixed rw
>>
>> 2. or synchronise whenever async. But that would mean that a request
>> may get and do IO on two different buffers, that's rotten.
>>
>> 3. do mixed -- lazy, but if do IO then alloc.
>>
>> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
>> but we can probably do rcu.
> 
> Are you referring to a case where a fixed buffer request can be submitted from async context while those buffers are being unregistered, or something like that?

Yes, io_import_fixed() called from io-wq, and in parallel with
unregister/etc. That may happen in many cases, e.g. linked reqs
or IOSQE_ASYNC.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-06 19:46     ` Bijan Mottahedeh
  2021-01-06 22:22       ` Pavel Begunkov
@ 2021-01-07  2:37       ` Pavel Begunkov
  2021-01-07 21:21         ` Bijan Mottahedeh
  1 sibling, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-07  2:37 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 06/01/2021 19:46, Bijan Mottahedeh wrote:
> On 1/4/2021 6:43 PM, Pavel Begunkov wrote:
>> On 18/12/2020 18:07, Bijan Mottahedeh wrote:
>>> Apply fixed_rsrc functionality for fixed buffers support.
>>
>> git generated a pretty messy diff...
> 
> I had tried to break this up a few ways but it didn't work well because I think most of the code changes depend on the io_uring structure changes.  I can look again or if you some idea of how you want to split it, I can do that.
> 
>> Because it's do quiesce, fixed read/write access buffers from asynchronous
>> contexts without synchronisation. That won't work anymore, so
>>
>> 1. either we save it in advance, that would require extra req_async
>> allocation for linked fixed rw
>>
>> 2. or synchronise whenever async. But that would mean that a request
>> may get and do IO on two different buffers, that's rotten.
>>
>> 3. do mixed -- lazy, but if do IO then alloc.
>>
>> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
>> but we can probably do rcu.
> 
> Are you referring to a case where a fixed buffer request can be submitted from async context while those buffers are being unregistered, or something like that?
> 
>> Let me think of a patch...

The most convenient API would be [1], it selects a buffer during
submission, but allocates if needs to go async or for all linked
requests.

[2] should be correct from the kernel perspective (no races), it
also solves doing IO on 2 different buffers, that's nasty (BTW,
[1] solves this problem naturally). However, a buffer might be
selected async, but the following can happen, and user should
wait for request completion before removing a buffer.

1. register buf id=0
2. syscall io_uring_enter(submit=RW_FIXED,buf_id=0,IOSQE_ASYNC)
3. unregister buffers
4. the request may not find the buffer and fail.

Not very convenient + can actually add overhead on the userspace
side, can be even some heavy synchronisation.

uring_lock in [2] is not nice, but I think I can replace it
with rcu, probably can even help with sharing, but I need to
try to implement to be sure.

So that's an open question what API to have.
Neither of diffs is tested.

[1]
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7e35283fc0b1..2171836a9ce3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -826,6 +826,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollin			= 1,
+		.needs_async_data	= 1,
 		.plug			= 1,
 		.async_size		= sizeof(struct io_async_rw),
 		.work_flags		= IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
@@ -835,6 +836,7 @@ static const struct io_op_def io_op_defs[] = {
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollout		= 1,
+		.needs_async_data	= 1,
 		.plug			= 1,
 		.async_size		= sizeof(struct io_async_rw),
 		.work_flags		= IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE |



[2]
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7e35283fc0b1..31560b879fb3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3148,7 +3148,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
 	opcode = req->opcode;
 	if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
 		*iovec = NULL;
-		return io_import_fixed(req, rw, iter);
+
+		io_ring_submit_lock(req->ctx, needs_lock);
+		lockdep_assert_held(&req->ctx->uring_lock);
+		ret = io_import_fixed(req, rw, iter);
+		io_ring_submit_unlock(req->ctx, needs_lock);
+		return ret;
 	}
 
 	/* buffer index only valid with fixed read/write, or buffer select  */
@@ -3638,7 +3643,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock,
 copy_iov:
 		/* some cases will consume bytes even on error returns */
 		iov_iter_revert(iter, io_size - iov_iter_count(iter));
-		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
+		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 		if (!ret)
 			return -EAGAIN;
 	}


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-07  2:37       ` Pavel Begunkov
@ 2021-01-07 21:21         ` Bijan Mottahedeh
  2021-01-07 21:37           ` Pavel Begunkov
  0 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-07 21:21 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring


>>> Because it's do quiesce, fixed read/write access buffers from asynchronous
>>> contexts without synchronisation. That won't work anymore, so
>>>
>>> 1. either we save it in advance, that would require extra req_async
>>> allocation for linked fixed rw
>>>
>>> 2. or synchronise whenever async. But that would mean that a request
>>> may get and do IO on two different buffers, that's rotten.
>>>
>>> 3. do mixed -- lazy, but if do IO then alloc.
>>>
>>> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
>>> but we can probably do rcu.
>>
>> Are you referring to a case where a fixed buffer request can be submitted from async context while those buffers are being unregistered, or something like that?
>>
>>> Let me think of a patch...
> 
> The most convenient API would be [1], it selects a buffer during
> submission, but allocates if needs to go async or for all linked
> requests.
> 
> [2] should be correct from the kernel perspective (no races), it
> also solves doing IO on 2 different buffers, that's nasty (BTW,
> [1] solves this problem naturally). However, a buffer might be
> selected async, but the following can happen, and user should
> wait for request completion before removing a buffer.
> 
> 1. register buf id=0
> 2. syscall io_uring_enter(submit=RW_FIXED,buf_id=0,IOSQE_ASYNC)
> 3. unregister buffers
> 4. the request may not find the buffer and fail.
> 
> Not very convenient + can actually add overhead on the userspace
> side, can be even some heavy synchronisation.
> 
> uring_lock in [2] is not nice, but I think I can replace it
> with rcu, probably can even help with sharing, but I need to
> try to implement to be sure.
> 
> So that's an open question what API to have.
> Neither of diffs is tested.
> 
> [1]
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 7e35283fc0b1..2171836a9ce3 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -826,6 +826,7 @@ static const struct io_op_def io_op_defs[] = {
>   		.needs_file		= 1,
>   		.unbound_nonreg_file	= 1,
>   		.pollin			= 1,
> +		.needs_async_data	= 1,
>   		.plug			= 1,
>   		.async_size		= sizeof(struct io_async_rw),
>   		.work_flags		= IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
> @@ -835,6 +836,7 @@ static const struct io_op_def io_op_defs[] = {
>   		.hash_reg_file		= 1,
>   		.unbound_nonreg_file	= 1,
>   		.pollout		= 1,
> +		.needs_async_data	= 1,
>   		.plug			= 1,
>   		.async_size		= sizeof(struct io_async_rw),
>   		.work_flags		= IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE |
> 
> 
> 
> [2]
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 7e35283fc0b1..31560b879fb3 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -3148,7 +3148,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>   	opcode = req->opcode;
>   	if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
>   		*iovec = NULL;
> -		return io_import_fixed(req, rw, iter);
> +
> +		io_ring_submit_lock(req->ctx, needs_lock);
> +		lockdep_assert_held(&req->ctx->uring_lock);
> +		ret = io_import_fixed(req, rw, iter);
> +		io_ring_submit_unlock(req->ctx, needs_lock);
> +		return ret;
>   	}
>   
>   	/* buffer index only valid with fixed read/write, or buffer select  */
> @@ -3638,7 +3643,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock,
>   copy_iov:
>   		/* some cases will consume bytes even on error returns */
>   		iov_iter_revert(iter, io_size - iov_iter_count(iter));
> -		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
> +		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>   		if (!ret)
>   			return -EAGAIN;
>   	}
> 
> 

For my understanding, is [1] essentially about stashing the iovec for 
the fixed IO in an io_async_rw struct and referencing it in async 
context?  I don't understand how this prevents unregistering the buffer 
(described by the iovec) while the IO takes place.

Taking a step back, what is the cost of keeping the quiesce for buffer 
registration operations?  It should not be a frequent operation even a 
heavy handed quiesce should not be a big issue?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-07 21:21         ` Bijan Mottahedeh
@ 2021-01-07 21:37           ` Pavel Begunkov
  2021-01-07 22:14             ` Bijan Mottahedeh
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-07 21:37 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 07/01/2021 21:21, Bijan Mottahedeh wrote:
> 
>>>> Because it's do quiesce, fixed read/write access buffers from asynchronous
>>>> contexts without synchronisation. That won't work anymore, so
>>>>
>>>> 1. either we save it in advance, that would require extra req_async
>>>> allocation for linked fixed rw
>>>>
>>>> 2. or synchronise whenever async. But that would mean that a request
>>>> may get and do IO on two different buffers, that's rotten.
>>>>
>>>> 3. do mixed -- lazy, but if do IO then alloc.
>>>>
>>>> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
>>>> but we can probably do rcu.
>>>
>>> Are you referring to a case where a fixed buffer request can be submitted from async context while those buffers are being unregistered, or something like that?
>>>
>>>> Let me think of a patch...
>>
>> The most convenient API would be [1], it selects a buffer during
>> submission, but allocates if needs to go async or for all linked
>> requests.
>>
>> [2] should be correct from the kernel perspective (no races), it
>> also solves doing IO on 2 different buffers, that's nasty (BTW,
>> [1] solves this problem naturally). However, a buffer might be
>> selected async, but the following can happen, and user should
>> wait for request completion before removing a buffer.
>>
>> 1. register buf id=0
>> 2. syscall io_uring_enter(submit=RW_FIXED,buf_id=0,IOSQE_ASYNC)
>> 3. unregister buffers
>> 4. the request may not find the buffer and fail.
>>
>> Not very convenient + can actually add overhead on the userspace
>> side, can be even some heavy synchronisation.
>>
>> uring_lock in [2] is not nice, but I think I can replace it
>> with rcu, probably can even help with sharing, but I need to
>> try to implement to be sure.
>>
>> So that's an open question what API to have.
>> Neither of diffs is tested.
>>
>> [1]
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 7e35283fc0b1..2171836a9ce3 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -826,6 +826,7 @@ static const struct io_op_def io_op_defs[] = {
>>           .needs_file        = 1,
>>           .unbound_nonreg_file    = 1,
>>           .pollin            = 1,
>> +        .needs_async_data    = 1,
>>           .plug            = 1,
>>           .async_size        = sizeof(struct io_async_rw),
>>           .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
>> @@ -835,6 +836,7 @@ static const struct io_op_def io_op_defs[] = {
>>           .hash_reg_file        = 1,
>>           .unbound_nonreg_file    = 1,
>>           .pollout        = 1,
>> +        .needs_async_data    = 1,
>>           .plug            = 1,
>>           .async_size        = sizeof(struct io_async_rw),
>>           .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE |
>>
>>
>>
>> [2]
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 7e35283fc0b1..31560b879fb3 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -3148,7 +3148,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>>       opcode = req->opcode;
>>       if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
>>           *iovec = NULL;
>> -        return io_import_fixed(req, rw, iter);
>> +
>> +        io_ring_submit_lock(req->ctx, needs_lock);
>> +        lockdep_assert_held(&req->ctx->uring_lock);
>> +        ret = io_import_fixed(req, rw, iter);
>> +        io_ring_submit_unlock(req->ctx, needs_lock);
>> +        return ret;
>>       }
>>         /* buffer index only valid with fixed read/write, or buffer select  */
>> @@ -3638,7 +3643,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock,
>>   copy_iov:
>>           /* some cases will consume bytes even on error returns */
>>           iov_iter_revert(iter, io_size - iov_iter_count(iter));
>> -        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
>> +        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>           if (!ret)
>>               return -EAGAIN;
>>       }
>>
>>
> 
> For my understanding, is [1] essentially about stashing the iovec for the fixed IO in an io_async_rw struct and referencing it in async context?

Yes, like that. It actually doesn't use iov but employs bvec, which
it gets from struct io_mapped_ubuf, and stores it inside iter.

> I don't understand how this prevents unregistering the buffer (described by the iovec) while the IO takes place.

The bvec itself is guaranteed to be alive during the whole lifetime
of the request, that's because of all that percpu_ref in nodes.
However, the table storing buffers (i.e. ctx->user_bufs) may be
overwritten.

reg/unreg/update happens with uring_lock held, as well as submission.
Hence if we always grab a buffer during submission it will be fine.

> 
> Taking a step back, what is the cost of keeping the quiesce for buffer registration operations?  It should not be a frequent operation even a heavy handed quiesce should not be a big issue?

It waits for __all__ inflight requests to complete and doesn't allow
submissions in the meantime (basically all io_uring_enter() attempts
will fail). +grace period.

It's pretty heavy, but worse is that it shuts down everything while
waiting. However, if an application is prepared for that and it's
really rare or done once, that should be ok.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-07 21:37           ` Pavel Begunkov
@ 2021-01-07 22:14             ` Bijan Mottahedeh
  2021-01-07 22:33               ` Pavel Begunkov
  0 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-07 22:14 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring

On 1/7/2021 1:37 PM, Pavel Begunkov wrote:
> On 07/01/2021 21:21, Bijan Mottahedeh wrote:
>>
>>>>> Because it's do quiesce, fixed read/write access buffers from asynchronous
>>>>> contexts without synchronisation. That won't work anymore, so
>>>>>
>>>>> 1. either we save it in advance, that would require extra req_async
>>>>> allocation for linked fixed rw
>>>>>
>>>>> 2. or synchronise whenever async. But that would mean that a request
>>>>> may get and do IO on two different buffers, that's rotten.
>>>>>
>>>>> 3. do mixed -- lazy, but if do IO then alloc.
>>>>>
>>>>> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
>>>>> but we can probably do rcu.
>>>>
>>>> Are you referring to a case where a fixed buffer request can be submitted from async context while those buffers are being unregistered, or something like that?
>>>>
>>>>> Let me think of a patch...
>>>
>>> The most convenient API would be [1], it selects a buffer during
>>> submission, but allocates if needs to go async or for all linked
>>> requests.
>>>
>>> [2] should be correct from the kernel perspective (no races), it
>>> also solves doing IO on 2 different buffers, that's nasty (BTW,
>>> [1] solves this problem naturally). However, a buffer might be
>>> selected async, but the following can happen, and user should
>>> wait for request completion before removing a buffer.
>>>
>>> 1. register buf id=0
>>> 2. syscall io_uring_enter(submit=RW_FIXED,buf_id=0,IOSQE_ASYNC)
>>> 3. unregister buffers
>>> 4. the request may not find the buffer and fail.
>>>
>>> Not very convenient + can actually add overhead on the userspace
>>> side, can be even some heavy synchronisation.
>>>
>>> uring_lock in [2] is not nice, but I think I can replace it
>>> with rcu, probably can even help with sharing, but I need to
>>> try to implement to be sure.
>>>
>>> So that's an open question what API to have.
>>> Neither of diffs is tested.
>>>
>>> [1]
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 7e35283fc0b1..2171836a9ce3 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -826,6 +826,7 @@ static const struct io_op_def io_op_defs[] = {
>>>            .needs_file        = 1,
>>>            .unbound_nonreg_file    = 1,
>>>            .pollin            = 1,
>>> +        .needs_async_data    = 1,
>>>            .plug            = 1,
>>>            .async_size        = sizeof(struct io_async_rw),
>>>            .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
>>> @@ -835,6 +836,7 @@ static const struct io_op_def io_op_defs[] = {
>>>            .hash_reg_file        = 1,
>>>            .unbound_nonreg_file    = 1,
>>>            .pollout        = 1,
>>> +        .needs_async_data    = 1,
>>>            .plug            = 1,
>>>            .async_size        = sizeof(struct io_async_rw),
>>>            .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE |
>>>
>>>
>>>
>>> [2]
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 7e35283fc0b1..31560b879fb3 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -3148,7 +3148,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>>>        opcode = req->opcode;
>>>        if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
>>>            *iovec = NULL;
>>> -        return io_import_fixed(req, rw, iter);
>>> +
>>> +        io_ring_submit_lock(req->ctx, needs_lock);
>>> +        lockdep_assert_held(&req->ctx->uring_lock);
>>> +        ret = io_import_fixed(req, rw, iter);
>>> +        io_ring_submit_unlock(req->ctx, needs_lock);
>>> +        return ret;
>>>        }
>>>          /* buffer index only valid with fixed read/write, or buffer select  */
>>> @@ -3638,7 +3643,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock,
>>>    copy_iov:
>>>            /* some cases will consume bytes even on error returns */
>>>            iov_iter_revert(iter, io_size - iov_iter_count(iter));
>>> -        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
>>> +        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>>            if (!ret)
>>>                return -EAGAIN;
>>>        }
>>>
>>>
>>
>> For my understanding, is [1] essentially about stashing the iovec for the fixed IO in an io_async_rw struct and referencing it in async context?
> 
> Yes, like that. It actually doesn't use iov but employs bvec, which
> it gets from struct io_mapped_ubuf, and stores it inside iter.
> 
>> I don't understand how this prevents unregistering the buffer (described by the iovec) while the IO takes place.
> 
> The bvec itself is guaranteed to be alive during the whole lifetime
> of the request, that's because of all that percpu_ref in nodes.
> However, the table storing buffers (i.e. ctx->user_bufs) may be
> overwritten.
> 
> reg/unreg/update happens with uring_lock held, as well as submission.
> Hence if we always grab a buffer during submission it will be fine.

So because of the uring_lock being held, if we implement [1], then once 
we grab a fixed buffer during submission, we are guaranteed that the IO 
successfully completes, even if the buffer table is overwritten?

Would the bvec persistence help us with buffer sharing and the deadlock 
scenario you brought up as well?  If the sharing task wouldn't have to 
block for the attached tasks to get rid of their references, it seems 
that any outstanding IO would complete successfully.

My concern however is what would happen if the sharing task actually 
*frees* its buffers after returning from unregister, since those buffers 
would still live in the buf_data, right?


>> Taking a step back, what is the cost of keeping the quiesce for buffer registration operations?  It should not be a frequent operation even a heavy handed quiesce should not be a big issue?
> 
> It waits for __all__ inflight requests to complete and doesn't allow
> submissions in the meantime (basically all io_uring_enter() attempts
> will fail). +grace period.
> 
> It's pretty heavy, but worse is that it shuts down everything while
> waiting. However, if an application is prepared for that and it's
> really rare or done once, that should be ok.
> 

Jens, what do you think?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-07 22:14             ` Bijan Mottahedeh
@ 2021-01-07 22:33               ` Pavel Begunkov
  2021-01-07 23:10                 ` Pavel Begunkov
  2021-01-08  0:17                 ` Bijan Mottahedeh
  0 siblings, 2 replies; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-07 22:33 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 07/01/2021 22:14, Bijan Mottahedeh wrote:
> On 1/7/2021 1:37 PM, Pavel Begunkov wrote:
>> On 07/01/2021 21:21, Bijan Mottahedeh wrote:
>>>
>>>>>> Because it's do quiesce, fixed read/write access buffers from asynchronous
>>>>>> contexts without synchronisation. That won't work anymore, so
>>>>>>
>>>>>> 1. either we save it in advance, that would require extra req_async
>>>>>> allocation for linked fixed rw
>>>>>>
>>>>>> 2. or synchronise whenever async. But that would mean that a request
>>>>>> may get and do IO on two different buffers, that's rotten.
>>>>>>
>>>>>> 3. do mixed -- lazy, but if do IO then alloc.
>>>>>>
>>>>>> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
>>>>>> but we can probably do rcu.
>>>>>
>>>>> Are you referring to a case where a fixed buffer request can be submitted from async context while those buffers are being unregistered, or something like that?
>>>>>
>>>>>> Let me think of a patch...
>>>>
>>>> The most convenient API would be [1], it selects a buffer during
>>>> submission, but allocates if needs to go async or for all linked
>>>> requests.
>>>>
>>>> [2] should be correct from the kernel perspective (no races), it
>>>> also solves doing IO on 2 different buffers, that's nasty (BTW,
>>>> [1] solves this problem naturally). However, a buffer might be
>>>> selected async, but the following can happen, and user should
>>>> wait for request completion before removing a buffer.
>>>>
>>>> 1. register buf id=0
>>>> 2. syscall io_uring_enter(submit=RW_FIXED,buf_id=0,IOSQE_ASYNC)
>>>> 3. unregister buffers
>>>> 4. the request may not find the buffer and fail.
>>>>
>>>> Not very convenient + can actually add overhead on the userspace
>>>> side, can be even some heavy synchronisation.
>>>>
>>>> uring_lock in [2] is not nice, but I think I can replace it
>>>> with rcu, probably can even help with sharing, but I need to
>>>> try to implement to be sure.
>>>>
>>>> So that's an open question what API to have.
>>>> Neither of diffs is tested.
>>>>
>>>> [1]
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 7e35283fc0b1..2171836a9ce3 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -826,6 +826,7 @@ static const struct io_op_def io_op_defs[] = {
>>>>            .needs_file        = 1,
>>>>            .unbound_nonreg_file    = 1,
>>>>            .pollin            = 1,
>>>> +        .needs_async_data    = 1,
>>>>            .plug            = 1,
>>>>            .async_size        = sizeof(struct io_async_rw),
>>>>            .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
>>>> @@ -835,6 +836,7 @@ static const struct io_op_def io_op_defs[] = {
>>>>            .hash_reg_file        = 1,
>>>>            .unbound_nonreg_file    = 1,
>>>>            .pollout        = 1,
>>>> +        .needs_async_data    = 1,
>>>>            .plug            = 1,
>>>>            .async_size        = sizeof(struct io_async_rw),
>>>>            .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE |
>>>>
>>>>
>>>>
>>>> [2]
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 7e35283fc0b1..31560b879fb3 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -3148,7 +3148,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>>>>        opcode = req->opcode;
>>>>        if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
>>>>            *iovec = NULL;
>>>> -        return io_import_fixed(req, rw, iter);
>>>> +
>>>> +        io_ring_submit_lock(req->ctx, needs_lock);
>>>> +        lockdep_assert_held(&req->ctx->uring_lock);
>>>> +        ret = io_import_fixed(req, rw, iter);
>>>> +        io_ring_submit_unlock(req->ctx, needs_lock);
>>>> +        return ret;
>>>>        }
>>>>          /* buffer index only valid with fixed read/write, or buffer select  */
>>>> @@ -3638,7 +3643,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock,
>>>>    copy_iov:
>>>>            /* some cases will consume bytes even on error returns */
>>>>            iov_iter_revert(iter, io_size - iov_iter_count(iter));
>>>> -        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
>>>> +        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>>>            if (!ret)
>>>>                return -EAGAIN;
>>>>        }
>>>>
>>>>
>>>
>>> For my understanding, is [1] essentially about stashing the iovec for the fixed IO in an io_async_rw struct and referencing it in async context?
>>
>> Yes, like that. It actually doesn't use iov but employs bvec, which
>> it gets from struct io_mapped_ubuf, and stores it inside iter.
>>
>>> I don't understand how this prevents unregistering the buffer (described by the iovec) while the IO takes place.
>>
>> The bvec itself is guaranteed to be alive during the whole lifetime
>> of the request, that's because of all that percpu_ref in nodes.
>> However, the table storing buffers (i.e. ctx->user_bufs) may be
>> overwritten.
>>
>> reg/unreg/update happens with uring_lock held, as well as submission.
>> Hence if we always grab a buffer during submission it will be fine.
> 
> So because of the uring_lock being held, if we implement [1], then once we grab a fixed buffer during submission, we are guaranteed that the IO successfully completes, even if the buffer table is overwritten?

There are two separate things.
1. bvec itself. Currently quiesce guarantees its validity, and for your
patches node->refs keeps it.

2. the table where bvecs are stored, i.e. array of pointers to bvecs.
Naturally, it's racy to read and write in parallel and not synchronised
from it. Currently it's also synchronised by quiesce, but [1] and [2]
sync it with uring_lock, but in a different fashion.
I may be able to replace uring_lock there with RCU. 

> 
> Would the bvec persistence help us with buffer sharing and the deadlock scenario you brought up as well?  If the sharing task wouldn't have to block for the attached tasks to get rid of their references, it seems that any outstanding IO would complete successfully.

bvecs (1.) should be fine/easy to do, one of the problems is the table
itself (2.). When I get time I'll look into RCU option, and I have a
hunch it would help with it as well.
But IIRC there are other issues.

> 
> My concern however is what would happen if the sharing task actually *frees* its buffers after returning from unregister, since those buffers would still live in the buf_data, right?

Don't remember the patch, but it must not. That should be the easy
part because we can rely on node::refs

>>> Taking a step back, what is the cost of keeping the quiesce for buffer registration operations?  It should not be a frequent operation even a heavy handed quiesce should not be a big issue?
>>
>> It waits for __all__ inflight requests to complete and doesn't allow
>> submissions in the meantime (basically all io_uring_enter() attempts
>> will fail). +grace period.
>>
>> It's pretty heavy, but worse is that it shuts down everything while
>> waiting. However, if an application is prepared for that and it's
>> really rare or done once, that should be ok.> Jens, what do you think?

Just to note, that's how it works now. And IORING_UPDATE_BUFFERS would
work same way if added head on.

You mentioned that this work is important for you, so I'd rather ask
your opinion on that matter. Is it ok for your use case? How often
do you expect to do register/unregister/update buffers?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-07 22:33               ` Pavel Begunkov
@ 2021-01-07 23:10                 ` Pavel Begunkov
  2021-01-08  1:53                   ` Bijan Mottahedeh
  2021-01-08  0:17                 ` Bijan Mottahedeh
  1 sibling, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-07 23:10 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 07/01/2021 22:33, Pavel Begunkov wrote:
> On 07/01/2021 22:14, Bijan Mottahedeh wrote:
>> On 1/7/2021 1:37 PM, Pavel Begunkov wrote:
>>> On 07/01/2021 21:21, Bijan Mottahedeh wrote:
>>>>
>>>>>>> Because it's do quiesce, fixed read/write access buffers from asynchronous
>>>>>>> contexts without synchronisation. That won't work anymore, so
>>>>>>>
>>>>>>> 1. either we save it in advance, that would require extra req_async
>>>>>>> allocation for linked fixed rw
>>>>>>>
>>>>>>> 2. or synchronise whenever async. But that would mean that a request
>>>>>>> may get and do IO on two different buffers, that's rotten.
>>>>>>>
>>>>>>> 3. do mixed -- lazy, but if do IO then alloc.
>>>>>>>
>>>>>>> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
>>>>>>> but we can probably do rcu.
>>>>>>
>>>>>> Are you referring to a case where a fixed buffer request can be submitted from async context while those buffers are being unregistered, or something like that?
>>>>>>
>>>>>>> Let me think of a patch...
>>>>>
>>>>> The most convenient API would be [1], it selects a buffer during
>>>>> submission, but allocates if needs to go async or for all linked
>>>>> requests.
>>>>>
>>>>> [2] should be correct from the kernel perspective (no races), it
>>>>> also solves doing IO on 2 different buffers, that's nasty (BTW,
>>>>> [1] solves this problem naturally). However, a buffer might be
>>>>> selected async, but the following can happen, and user should
>>>>> wait for request completion before removing a buffer.
>>>>>
>>>>> 1. register buf id=0
>>>>> 2. syscall io_uring_enter(submit=RW_FIXED,buf_id=0,IOSQE_ASYNC)
>>>>> 3. unregister buffers
>>>>> 4. the request may not find the buffer and fail.
>>>>>
>>>>> Not very convenient + can actually add overhead on the userspace
>>>>> side, can be even some heavy synchronisation.
>>>>>
>>>>> uring_lock in [2] is not nice, but I think I can replace it
>>>>> with rcu, probably can even help with sharing, but I need to
>>>>> try to implement to be sure.
>>>>>
>>>>> So that's an open question what API to have.
>>>>> Neither of diffs is tested.
>>>>>
>>>>> [1]
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 7e35283fc0b1..2171836a9ce3 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -826,6 +826,7 @@ static const struct io_op_def io_op_defs[] = {
>>>>>            .needs_file        = 1,
>>>>>            .unbound_nonreg_file    = 1,
>>>>>            .pollin            = 1,
>>>>> +        .needs_async_data    = 1,
>>>>>            .plug            = 1,
>>>>>            .async_size        = sizeof(struct io_async_rw),
>>>>>            .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
>>>>> @@ -835,6 +836,7 @@ static const struct io_op_def io_op_defs[] = {
>>>>>            .hash_reg_file        = 1,
>>>>>            .unbound_nonreg_file    = 1,
>>>>>            .pollout        = 1,
>>>>> +        .needs_async_data    = 1,
>>>>>            .plug            = 1,
>>>>>            .async_size        = sizeof(struct io_async_rw),
>>>>>            .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE |
>>>>>
>>>>>
>>>>>
>>>>> [2]
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 7e35283fc0b1..31560b879fb3 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -3148,7 +3148,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>>>>>        opcode = req->opcode;
>>>>>        if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
>>>>>            *iovec = NULL;
>>>>> -        return io_import_fixed(req, rw, iter);
>>>>> +
>>>>> +        io_ring_submit_lock(req->ctx, needs_lock);
>>>>> +        lockdep_assert_held(&req->ctx->uring_lock);
>>>>> +        ret = io_import_fixed(req, rw, iter);
>>>>> +        io_ring_submit_unlock(req->ctx, needs_lock);
>>>>> +        return ret;
>>>>>        }
>>>>>          /* buffer index only valid with fixed read/write, or buffer select  */
>>>>> @@ -3638,7 +3643,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock,
>>>>>    copy_iov:
>>>>>            /* some cases will consume bytes even on error returns */
>>>>>            iov_iter_revert(iter, io_size - iov_iter_count(iter));
>>>>> -        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
>>>>> +        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>>>>            if (!ret)
>>>>>                return -EAGAIN;
>>>>>        }
>>>>>
>>>>>
>>>>
>>>> For my understanding, is [1] essentially about stashing the iovec for the fixed IO in an io_async_rw struct and referencing it in async context?
>>>
>>> Yes, like that. It actually doesn't use iov but employs bvec, which
>>> it gets from struct io_mapped_ubuf, and stores it inside iter.
>>>
>>>> I don't understand how this prevents unregistering the buffer (described by the iovec) while the IO takes place.
>>>
>>> The bvec itself is guaranteed to be alive during the whole lifetime
>>> of the request, that's because of all that percpu_ref in nodes.
>>> However, the table storing buffers (i.e. ctx->user_bufs) may be
>>> overwritten.
>>>
>>> reg/unreg/update happens with uring_lock held, as well as submission.
>>> Hence if we always grab a buffer during submission it will be fine.
>>
>> So because of the uring_lock being held, if we implement [1], then once we grab a fixed buffer during submission, we are guaranteed that the IO successfully completes, even if the buffer table is overwritten?
> 
> There are two separate things.
> 1. bvec itself. Currently quiesce guarantees its validity, and for your
> patches node->refs keeps it.
> 
> 2. the table where bvecs are stored, i.e. array of pointers to bvecs.
> Naturally, it's racy to read and write in parallel and not synchronised
> from it. Currently it's also synchronised by quiesce, but [1] and [2]
> sync it with uring_lock, but in a different fashion.
> I may be able to replace uring_lock there with RCU. 
> 
>>
>> Would the bvec persistence help us with buffer sharing and the deadlock scenario you brought up as well?  If the sharing task wouldn't have to block for the attached tasks to get rid of their references, it seems that any outstanding IO would complete successfully.
> 
> bvecs (1.) should be fine/easy to do, one of the problems is the table
> itself (2.). When I get time I'll look into RCU option, and I have a
> hunch it would help with it as well.
> But IIRC there are other issues.
> 
>>
>> My concern however is what would happen if the sharing task actually *frees* its buffers after returning from unregister, since those buffers would still live in the buf_data, right?
> 
> Don't remember the patch, but it must not. That should be the easy
> part because we can rely on node::refs

Forgot to mention, I failed to find where you do
io_set_resource_node() in v4 for fixed reads/writes.
Also, it should be done during submission, io_prep_rw()
is the right place for that.


>>>> Taking a step back, what is the cost of keeping the quiesce for buffer registration operations?  It should not be a frequent operation even a heavy handed quiesce should not be a big issue?
>>>
>>> It waits for __all__ inflight requests to complete and doesn't allow
>>> submissions in the meantime (basically all io_uring_enter() attempts
>>> will fail). +grace period.
>>>
>>> It's pretty heavy, but worse is that it shuts down everything while
>>> waiting. However, if an application is prepared for that and it's
>>> really rare or done once, that should be ok.> Jens, what do you think?
> 
> Just to note, that's how it works now. And IORING_UPDATE_BUFFERS would
> work same way if added head on.
> 
> You mentioned that this work is important for you, so I'd rather ask
> your opinion on that matter. Is it ok for your use case? How often
> do you expect to do register/unregister/update buffers?
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-07 22:33               ` Pavel Begunkov
  2021-01-07 23:10                 ` Pavel Begunkov
@ 2021-01-08  0:17                 ` Bijan Mottahedeh
  1 sibling, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-08  0:17 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring

On 1/7/2021 2:33 PM, Pavel Begunkov wrote:
> On 07/01/2021 22:14, Bijan Mottahedeh wrote:
>> On 1/7/2021 1:37 PM, Pavel Begunkov wrote:
>>> On 07/01/2021 21:21, Bijan Mottahedeh wrote:
>>>>
>>>>>>> Because it's do quiesce, fixed read/write access buffers from asynchronous
>>>>>>> contexts without synchronisation. That won't work anymore, so
>>>>>>>
>>>>>>> 1. either we save it in advance, that would require extra req_async
>>>>>>> allocation for linked fixed rw
>>>>>>>
>>>>>>> 2. or synchronise whenever async. But that would mean that a request
>>>>>>> may get and do IO on two different buffers, that's rotten.
>>>>>>>
>>>>>>> 3. do mixed -- lazy, but if do IO then alloc.
>>>>>>>
>>>>>>> 3.5 also "synchronise" there would mean uring_lock, that's not welcome,
>>>>>>> but we can probably do rcu.
>>>>>>
>>>>>> Are you referring to a case where a fixed buffer request can be submitted from async context while those buffers are being unregistered, or something like that?
>>>>>>
>>>>>>> Let me think of a patch...
>>>>>
>>>>> The most convenient API would be [1], it selects a buffer during
>>>>> submission, but allocates if needs to go async or for all linked
>>>>> requests.
>>>>>
>>>>> [2] should be correct from the kernel perspective (no races), it
>>>>> also solves doing IO on 2 different buffers, that's nasty (BTW,
>>>>> [1] solves this problem naturally). However, a buffer might be
>>>>> selected async, but the following can happen, and user should
>>>>> wait for request completion before removing a buffer.
>>>>>
>>>>> 1. register buf id=0
>>>>> 2. syscall io_uring_enter(submit=RW_FIXED,buf_id=0,IOSQE_ASYNC)
>>>>> 3. unregister buffers
>>>>> 4. the request may not find the buffer and fail.
>>>>>
>>>>> Not very convenient + can actually add overhead on the userspace
>>>>> side, can be even some heavy synchronisation.
>>>>>
>>>>> uring_lock in [2] is not nice, but I think I can replace it
>>>>> with rcu, probably can even help with sharing, but I need to
>>>>> try to implement to be sure.
>>>>>
>>>>> So that's an open question what API to have.
>>>>> Neither of diffs is tested.
>>>>>
>>>>> [1]
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 7e35283fc0b1..2171836a9ce3 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -826,6 +826,7 @@ static const struct io_op_def io_op_defs[] = {
>>>>>             .needs_file        = 1,
>>>>>             .unbound_nonreg_file    = 1,
>>>>>             .pollin            = 1,
>>>>> +        .needs_async_data    = 1,
>>>>>             .plug            = 1,
>>>>>             .async_size        = sizeof(struct io_async_rw),
>>>>>             .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_MM,
>>>>> @@ -835,6 +836,7 @@ static const struct io_op_def io_op_defs[] = {
>>>>>             .hash_reg_file        = 1,
>>>>>             .unbound_nonreg_file    = 1,
>>>>>             .pollout        = 1,
>>>>> +        .needs_async_data    = 1,
>>>>>             .plug            = 1,
>>>>>             .async_size        = sizeof(struct io_async_rw),
>>>>>             .work_flags        = IO_WQ_WORK_BLKCG | IO_WQ_WORK_FSIZE |
>>>>>
>>>>>
>>>>>
>>>>> [2]
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 7e35283fc0b1..31560b879fb3 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -3148,7 +3148,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>>>>>         opcode = req->opcode;
>>>>>         if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
>>>>>             *iovec = NULL;
>>>>> -        return io_import_fixed(req, rw, iter);
>>>>> +
>>>>> +        io_ring_submit_lock(req->ctx, needs_lock);
>>>>> +        lockdep_assert_held(&req->ctx->uring_lock);
>>>>> +        ret = io_import_fixed(req, rw, iter);
>>>>> +        io_ring_submit_unlock(req->ctx, needs_lock);
>>>>> +        return ret;
>>>>>         }
>>>>>           /* buffer index only valid with fixed read/write, or buffer select  */
>>>>> @@ -3638,7 +3643,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock,
>>>>>     copy_iov:
>>>>>             /* some cases will consume bytes even on error returns */
>>>>>             iov_iter_revert(iter, io_size - iov_iter_count(iter));
>>>>> -        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
>>>>> +        ret = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>>>>             if (!ret)
>>>>>                 return -EAGAIN;
>>>>>         }
>>>>>
>>>>>
>>>>
>>>> For my understanding, is [1] essentially about stashing the iovec for the fixed IO in an io_async_rw struct and referencing it in async context?
>>>
>>> Yes, like that. It actually doesn't use iov but employs bvec, which
>>> it gets from struct io_mapped_ubuf, and stores it inside iter.
>>>
>>>> I don't understand how this prevents unregistering the buffer (described by the iovec) while the IO takes place.
>>>
>>> The bvec itself is guaranteed to be alive during the whole lifetime
>>> of the request, that's because of all that percpu_ref in nodes.
>>> However, the table storing buffers (i.e. ctx->user_bufs) may be
>>> overwritten.
>>>
>>> reg/unreg/update happens with uring_lock held, as well as submission.
>>> Hence if we always grab a buffer during submission it will be fine.
>>
>> So because of the uring_lock being held, if we implement [1], then once we grab a fixed buffer during submission, we are guaranteed that the IO successfully completes, even if the buffer table is overwritten?
> 
> There are two separate things.
> 1. bvec itself. Currently quiesce guarantees its validity, and for your
> patches node->refs keeps it.
> 
> 2. the table where bvecs are stored, i.e. array of pointers to bvecs.
> Naturally, it's racy to read and write in parallel and not synchronised
> from it. Currently it's also synchronised by quiesce, but [1] and [2]
> sync it with uring_lock, but in a different fashion.
> I may be able to replace uring_lock there with RCU.
> 
>>
>> Would the bvec persistence help us with buffer sharing and the deadlock scenario you brought up as well?  If the sharing task wouldn't have to block for the attached tasks to get rid of their references, it seems that any outstanding IO would complete successfully.
> 
> bvecs (1.) should be fine/easy to do, one of the problems is the table
> itself (2.). When I get time I'll look into RCU option, and I have a
> hunch it would help with it as well.
> But IIRC there are other issues.
> 
>>
>> My concern however is what would happen if the sharing task actually *frees* its buffers after returning from unregister, since those buffers would still live in the buf_data, right?
> 
> Don't remember the patch, but it must not. That should be the easy
> part because we can rely on node::refs

The buffer sharing patch is the last one, #13.

> 
>>>> Taking a step back, what is the cost of keeping the quiesce for buffer registration operations?  It should not be a frequent operation even a heavy handed quiesce should not be a big issue?
>>>
>>> It waits for __all__ inflight requests to complete and doesn't allow
>>> submissions in the meantime (basically all io_uring_enter() attempts
>>> will fail). +grace period.

What failure would be expect for submissions?

>>> It's pretty heavy, but worse is that it shuts down everything while
>>> waiting. However, if an application is prepared for that and it's
>>> really rare or done once, that should be ok.> Jens, what do you think?
> 
> Just to note, that's how it works now. And IORING_UPDATE_BUFFERS would
> work same way if added head on.
> 
> You mentioned that this work is important for you, so I'd rather ask
> your opinion on that matter. Is it ok for your use case? How often
> do you expect to do register/unregister/update buffers?
> 

For our use case, a (primary) process shares and incrementally registers 
buffers in chunks inside a large shmem segment during the app startup. 
Other (secondary) processes then attach the buffers and initiate IO with 
the buffers already registered.  The registered buffers should pretty 
much persist during the app lifetime.  It seems that a heavy quiesce 
cost shouldn't be significant compared to the cost of actual 
registrations etc.

On the other hand, grab/release of the uring_lock with [2] could have a 
more significant perf impact.

However, if [2] doesn't affect perf, and if it can facilitate the buffer 
sharing implementation, then I'd say that'd be the preferred approach.

For our case, sharing of buffer registrations is highest priority, 
followed by the incremental updates.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-07 23:10                 ` Pavel Begunkov
@ 2021-01-08  1:53                   ` Bijan Mottahedeh
  2021-01-11  5:12                     ` Pavel Begunkov
  0 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-08  1:53 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring


> Forgot to mention, I failed to find where you do
> io_set_resource_node() in v4 for fixed reads/writes.
> Also, it should be done during submission, io_prep_rw()
> is the right place for that.

Would something like below be ok?  I renamed io_set_resource_node() to 
io_get_fixed_rsrc_ref() to make its function more clear and also 
distinguish it from io_sqe_rsrc_set_node().



diff --git a/fs/io_uring.c b/fs/io_uring.c
index bd55d11..a9b9881 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1083,12 +1083,11 @@ static inline void io_clean_op(struct io_kiocb *req)
                 __io_clean_op(req);
  }

-static inline void io_set_resource_node(struct io_kiocb *req)
+static inline void io_get_fixed_rsrc_ref(struct io_kiocb *req,
+                                        struct fixed_rsrc_data *rsrc_data)
  {
-       struct io_ring_ctx *ctx = req->ctx;
-
         if (!req->fixed_rsrc_refs) {
-               req->fixed_rsrc_refs = &ctx->file_data->node->refs;
+               req->fixed_rsrc_refs = &rsrc_data->node->refs;
                 percpu_ref_get(req->fixed_rsrc_refs);
         }
  }
@@ -2928,6 +2927,9 @@ static int io_prep_rw(struct io_kiocb *req, const 
struct i
         req->rw.addr = READ_ONCE(sqe->addr);
         req->rw.len = READ_ONCE(sqe->len);
         req->buf_index = READ_ONCE(sqe->buf_index);
+       if (req->opcode == IORING_OP_READ_FIXED ||
+           req->opcode == IORING_OP_WRITE_FIXED)
+               io_get_fixed_rsrc_ref(req, ctx->buf_data);
         return 0;
  }

@@ -6452,7 +6454,7 @@ static struct file *io_file_get(struct 
io_submit_state *st
                         return NULL;
                 fd = array_index_nospec(fd, ctx->nr_user_files);
                 file = io_file_from_index(ctx, fd);
-               io_set_resource_node(req);
+               io_get_fixed_rsrc_ref(req, ctx->file_data);
         } else {
                 trace_io_uring_file_get(ctx, fd);
                 file = __io_file_get(state, fd);


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2021-01-08  1:53                   ` Bijan Mottahedeh
@ 2021-01-11  5:12                     ` Pavel Begunkov
  0 siblings, 0 replies; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-11  5:12 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 08/01/2021 01:53, Bijan Mottahedeh wrote:
> 
>> Forgot to mention, I failed to find where you do
>> io_set_resource_node() in v4 for fixed reads/writes.
>> Also, it should be done during submission, io_prep_rw()
>> is the right place for that.
> 
> Would something like below be ok?  I renamed io_set_resource_node() to io_get_fixed_rsrc_ref() to make its function more clear and also distinguish it from io_sqe_rsrc_set_node().

looks good

> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index bd55d11..a9b9881 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1083,12 +1083,11 @@ static inline void io_clean_op(struct io_kiocb *req)
>                 __io_clean_op(req);
>  }
> 
> -static inline void io_set_resource_node(struct io_kiocb *req)
> +static inline void io_get_fixed_rsrc_ref(struct io_kiocb *req,
> +                                        struct fixed_rsrc_data *rsrc_data)
>  {
> -       struct io_ring_ctx *ctx = req->ctx;
> -
>         if (!req->fixed_rsrc_refs) {
> -               req->fixed_rsrc_refs = &ctx->file_data->node->refs;
> +               req->fixed_rsrc_refs = &rsrc_data->node->refs;
>                 percpu_ref_get(req->fixed_rsrc_refs);
>         }
>  }
> @@ -2928,6 +2927,9 @@ static int io_prep_rw(struct io_kiocb *req, const struct i
>         req->rw.addr = READ_ONCE(sqe->addr);
>         req->rw.len = READ_ONCE(sqe->len);
>         req->buf_index = READ_ONCE(sqe->buf_index);
> +       if (req->opcode == IORING_OP_READ_FIXED ||
> +           req->opcode == IORING_OP_WRITE_FIXED)
> +               io_get_fixed_rsrc_ref(req, ctx->buf_data);
>         return 0;
>  }
> 
> @@ -6452,7 +6454,7 @@ static struct file *io_file_get(struct io_submit_state *st
>                         return NULL;
>                 fd = array_index_nospec(fd, ctx->nr_user_files);
>                 file = io_file_from_index(ctx, fd);
> -               io_set_resource_node(req);
> +               io_get_fixed_rsrc_ref(req, ctx->file_data);
>         } else {
>                 trace_io_uring_file_get(ctx, fd);
>                 file = __io_file_get(state, fd);
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, back to index

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-18 18:07 [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
2021-01-04 21:54   ` Pavel Begunkov
2021-01-06 19:46     ` Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 02/13] io_uring: modularize io_sqe_buffers_register Bijan Mottahedeh
2021-01-04 21:48   ` Pavel Begunkov
2020-12-18 18:07 ` [PATCH v3 03/13] io_uring: rename file related variables to rsrc Bijan Mottahedeh
2021-01-05  1:53   ` Pavel Begunkov
2021-01-06 19:46     ` Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 04/13] io_uring: generalize io_queue_rsrc_removal Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 05/13] io_uring: separate ref_list from fixed_rsrc_data Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 06/13] io_uring: generalize fixed_file_ref_node functionality Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 07/13] io_uring: add rsrc_ref locking routines Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
2021-01-05  2:43   ` Pavel Begunkov
2021-01-06 19:46     ` Bijan Mottahedeh
2021-01-06 22:22       ` Pavel Begunkov
2021-01-07  2:37       ` Pavel Begunkov
2021-01-07 21:21         ` Bijan Mottahedeh
2021-01-07 21:37           ` Pavel Begunkov
2021-01-07 22:14             ` Bijan Mottahedeh
2021-01-07 22:33               ` Pavel Begunkov
2021-01-07 23:10                 ` Pavel Begunkov
2021-01-08  1:53                   ` Bijan Mottahedeh
2021-01-11  5:12                     ` Pavel Begunkov
2021-01-08  0:17                 ` Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 09/13] io_uring: create common fixed_rsrc_ref_node handling routines Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 10/13] io_uring: generalize files_update functionlity to rsrc_update Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 11/13] io_uring: support buffer registration updates Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 12/13] io_uring: create common fixed_rsrc_data allocation routines Bijan Mottahedeh
2020-12-18 18:07 ` [PATCH v3 13/13] io_uring: support buffer registration sharing Bijan Mottahedeh
2021-01-04 17:09 ` [PATCH v3 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git