All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/13] io_uring: buffer registration enhancements
@ 2020-12-07 22:15 Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
                   ` (14 more replies)
  0 siblings, 15 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

v2:

- drop readv/writev with fixed buffers patch
- handle ref_nodes both both files/buffers with a single ref_list
- make file/buffer handling more unified

This patchset implements a set of enhancements to buffer registration
consistent with existing file registration functionality:

- buffer registration updates		IORING_REGISTER_BUFFERS_UPDATE
					IORING_OP_BUFFERS_UPDATE

- buffer registration sharing		IORING_SETUP_SHARE_BUF
					IORING_SETUP_ATTACH_BUF

I have kept the original patchset unchanged for the most part to
facilitate reviewing and so this set adds a number of additional patches
mostly making file/buffer handling more unified.

Patch 1-2 modularize existing buffer registration code.

Patch 3-7 generalize fixed_file functionality to fixed_rsrc.

Patch 8 applies fixed_rsrc functionality for fixed buffers support.

Patch 9-10 generalize files_update functionality to rsrc_update.

Patch 11 implements buffer registration update, and introduces
IORING_REGISTER_BUFFERS_UPDATE and IORING_OP_BUFFERS_UPDATE, consistent
with file registration update.

Patch 12 generalizes fixed resource allocation 

Patch 13 implements buffer sharing among multiple rings; it works as follows:

- A new ring, A,  is setup. Since no buffers have been registered, the
  registered buffer state is an empty set, Z. That's different from the
  NULL state in current implementation.

- Ring B is setup, attaching to Ring A. It's also attaching to it's
  buffer registrations, now we have two references to the same empty
  set, Z.

- Ring A registers buffers into set Z, which is no longer empty.

- Ring B sees this immediately, since it's already sharing that set.

Testing

I have used liburing file-{register,update} tests as models for
buffer-{register,update,share}, tests and they run ok.

TBD

- Haven't addressed Pavel's comment yet on using a single opcode for
files/buffers update, pending Jen's opinion.  Could we encode the resource
type into the sqe (e.g. rw_flags)?

Bijan Mottahedeh (13):
  io_uring: modularize io_sqe_buffer_register
  io_uring: modularize io_sqe_buffers_register
  io_uring: generalize fixed file functionality
  io_uring: rename fixed_file variables to fixed_rsrc
  io_uring: separate ref_list from fixed_rsrc_data
  io_uring: generalize fixed_file_ref_node functionality
  io_uring: add rsrc_ref locking routines
  io_uring: implement fixed buffers registration similar to fixed files
  io_uring: create common fixed_rsrc_ref_node handling routines
  io_uring: generalize files_update functionlity to rsrc_update
  io_uring: support buffer registration updates
  io_uring: create common fixed_rsrc_data allocation routines.
  io_uring: support buffer registration sharing

 fs/io_uring.c                 | 1004 +++++++++++++++++++++++++++++------------
 include/uapi/linux/io_uring.h |   12 +-
 2 files changed, 732 insertions(+), 284 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 01/13] io_uring: modularize io_sqe_buffer_register
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 02/13] io_uring: modularize io_sqe_buffers_register Bijan Mottahedeh
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Split io_sqe_buffer_register into two routines:

- io_sqe_buffer_register() registers a single buffer
- io_sqe_buffers_register iterates over all user specified buffers

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 210 ++++++++++++++++++++++++++++++----------------------------
 1 file changed, 107 insertions(+), 103 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bdfb473..7d6718a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8252,7 +8252,7 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return pages;
 }
 
-static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
 	int i, j;
 
@@ -8370,14 +8370,103 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 	return ret;
 }
 
-static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
-				  unsigned nr_args)
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
+				  struct io_mapped_ubuf *imu,
+				  struct page **last_hpage)
 {
 	struct vm_area_struct **vmas = NULL;
 	struct page **pages = NULL;
+	unsigned long off, start, end, ubuf;
+	size_t size;
+	int ret, pret, nr_pages, i;
+
+	ubuf = (unsigned long) iov->iov_base;
+	end = (ubuf + iov->iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	start = ubuf >> PAGE_SHIFT;
+	nr_pages = end - start;
+
+	ret = -ENOMEM;
+
+	pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		goto done;
+
+	vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *),
+			      GFP_KERNEL);
+	if (!vmas)
+		goto done;
+
+	imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
+				   GFP_KERNEL);
+	if (!imu->bvec)
+		goto done;
+
+	ret = 0;
+	mmap_read_lock(current->mm);
+	pret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
+			      pages, vmas);
+	if (pret == nr_pages) {
+		/* don't support file backed memory */
+		for (i = 0; i < nr_pages; i++) {
+			struct vm_area_struct *vma = vmas[i];
+
+			if (vma->vm_file &&
+			    !is_file_hugepages(vma->vm_file)) {
+				ret = -EOPNOTSUPP;
+				break;
+			}
+		}
+	} else {
+		ret = pret < 0 ? pret : -EFAULT;
+	}
+	mmap_read_unlock(current->mm);
+	if (ret) {
+		/*
+		 * if we did partial map, or found file backed vmas,
+		 * release any pages we did get
+		 */
+		if (pret > 0)
+			unpin_user_pages(pages, pret);
+		kvfree(imu->bvec);
+		goto done;
+	}
+
+	ret = io_buffer_account_pin(ctx, pages, pret, imu, last_hpage);
+	if (ret) {
+		unpin_user_pages(pages, pret);
+		kvfree(imu->bvec);
+		goto done;
+	}
+
+	off = ubuf & ~PAGE_MASK;
+	size = iov->iov_len;
+	for (i = 0; i < nr_pages; i++) {
+		size_t vec_len;
+
+		vec_len = min_t(size_t, size, PAGE_SIZE - off);
+		imu->bvec[i].bv_page = pages[i];
+		imu->bvec[i].bv_len = vec_len;
+		imu->bvec[i].bv_offset = off;
+		off = 0;
+		size -= vec_len;
+	}
+	/* store original address for later verification */
+	imu->ubuf = ubuf;
+	imu->len = iov->iov_len;
+	imu->nr_bvecs = nr_pages;
+	ret = 0;
+done:
+	kvfree(pages);
+	kvfree(vmas);
+	return ret;
+}
+
+static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
+				   unsigned int nr_args)
+{
+	int i, ret;
+	struct iovec iov;
 	struct page *last_hpage = NULL;
-	int i, j, got_pages = 0;
-	int ret = -EINVAL;
 
 	if (ctx->user_bufs)
 		return -EBUSY;
@@ -8391,14 +8480,10 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 	for (i = 0; i < nr_args; i++) {
 		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
-		unsigned long off, start, end, ubuf;
-		int pret, nr_pages;
-		struct iovec iov;
-		size_t size;
 
 		ret = io_copy_iov(ctx, &iov, arg, i);
 		if (ret)
-			goto err;
+			break;
 
 		/*
 		 * Don't impose further limits on the size and buffer
@@ -8407,103 +8492,22 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 		 */
 		ret = -EFAULT;
 		if (!iov.iov_base || !iov.iov_len)
-			goto err;
+			break;
 
 		/* arbitrary limit, but we need something */
 		if (iov.iov_len > SZ_1G)
-			goto err;
-
-		ubuf = (unsigned long) iov.iov_base;
-		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		start = ubuf >> PAGE_SHIFT;
-		nr_pages = end - start;
-
-		ret = 0;
-		if (!pages || nr_pages > got_pages) {
-			kvfree(vmas);
-			kvfree(pages);
-			pages = kvmalloc_array(nr_pages, sizeof(struct page *),
-						GFP_KERNEL);
-			vmas = kvmalloc_array(nr_pages,
-					sizeof(struct vm_area_struct *),
-					GFP_KERNEL);
-			if (!pages || !vmas) {
-				ret = -ENOMEM;
-				goto err;
-			}
-			got_pages = nr_pages;
-		}
-
-		imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
-						GFP_KERNEL);
-		ret = -ENOMEM;
-		if (!imu->bvec)
-			goto err;
-
-		ret = 0;
-		mmap_read_lock(current->mm);
-		pret = pin_user_pages(ubuf, nr_pages,
-				      FOLL_WRITE | FOLL_LONGTERM,
-				      pages, vmas);
-		if (pret == nr_pages) {
-			/* don't support file backed memory */
-			for (j = 0; j < nr_pages; j++) {
-				struct vm_area_struct *vma = vmas[j];
-
-				if (vma->vm_file &&
-				    !is_file_hugepages(vma->vm_file)) {
-					ret = -EOPNOTSUPP;
-					break;
-				}
-			}
-		} else {
-			ret = pret < 0 ? pret : -EFAULT;
-		}
-		mmap_read_unlock(current->mm);
-		if (ret) {
-			/*
-			 * if we did partial map, or found file backed vmas,
-			 * release any pages we did get
-			 */
-			if (pret > 0)
-				unpin_user_pages(pages, pret);
-			kvfree(imu->bvec);
-			goto err;
-		}
-
-		ret = io_buffer_account_pin(ctx, pages, pret, imu, &last_hpage);
-		if (ret) {
-			unpin_user_pages(pages, pret);
-			kvfree(imu->bvec);
-			goto err;
-		}
+			break;
 
-		off = ubuf & ~PAGE_MASK;
-		size = iov.iov_len;
-		for (j = 0; j < nr_pages; j++) {
-			size_t vec_len;
-
-			vec_len = min_t(size_t, size, PAGE_SIZE - off);
-			imu->bvec[j].bv_page = pages[j];
-			imu->bvec[j].bv_len = vec_len;
-			imu->bvec[j].bv_offset = off;
-			off = 0;
-			size -= vec_len;
-		}
-		/* store original address for later verification */
-		imu->ubuf = ubuf;
-		imu->len = iov.iov_len;
-		imu->nr_bvecs = nr_pages;
+		ret = io_sqe_buffer_register(ctx, &iov, imu, &last_hpage);
+		if (ret)
+			break;
 
 		ctx->nr_user_bufs++;
 	}
-	kvfree(pages);
-	kvfree(vmas);
-	return 0;
-err:
-	kvfree(pages);
-	kvfree(vmas);
-	io_sqe_buffer_unregister(ctx);
+
+	if (ret)
+		io_sqe_buffers_unregister(ctx);
+
 	return ret;
 }
 
@@ -8557,7 +8561,7 @@ static void io_destroy_buffers(struct io_ring_ctx *ctx)
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	io_finish_async(ctx);
-	io_sqe_buffer_unregister(ctx);
+	io_sqe_buffers_unregister(ctx);
 
 	if (ctx->sqo_task) {
 		put_task_struct(ctx->sqo_task);
@@ -9853,13 +9857,13 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 
 	switch (opcode) {
 	case IORING_REGISTER_BUFFERS:
-		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		ret = io_sqe_buffers_register(ctx, arg, nr_args);
 		break;
 	case IORING_UNREGISTER_BUFFERS:
 		ret = -EINVAL;
 		if (arg || nr_args)
 			break;
-		ret = io_sqe_buffer_unregister(ctx);
+		ret = io_sqe_buffers_unregister(ctx);
 		break;
 	case IORING_REGISTER_FILES:
 		ret = io_sqe_files_register(ctx, arg, nr_args);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 02/13] io_uring: modularize io_sqe_buffers_register
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 03/13] io_uring: generalize fixed file functionality Bijan Mottahedeh
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Move allocation of buffer management structures, and validation of
buffers into separate routines.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 51 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7d6718a..d8505e2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8461,13 +8461,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	return ret;
 }
 
-static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
-				   unsigned int nr_args)
+static int io_buffers_map_alloc(struct io_ring_ctx *ctx, unsigned int nr_args)
 {
-	int i, ret;
-	struct iovec iov;
-	struct page *last_hpage = NULL;
-
 	if (ctx->user_bufs)
 		return -EBUSY;
 	if (!nr_args || nr_args > UIO_MAXIOV)
@@ -8478,6 +8473,37 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (!ctx->user_bufs)
 		return -ENOMEM;
 
+	return 0;
+}
+
+static int io_buffer_validate(struct iovec *iov)
+{
+	/*
+	 * Don't impose further limits on the size and buffer
+	 * constraints here, we'll -EINVAL later when IO is
+	 * submitted if they are wrong.
+	 */
+	if (!iov->iov_base || !iov->iov_len)
+		return -EFAULT;
+
+	/* arbitrary limit, but we need something */
+	if (iov->iov_len > SZ_1G)
+		return -EFAULT;
+
+	return 0;
+}
+
+static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
+				   unsigned int nr_args)
+{
+	int i, ret;
+	struct iovec iov;
+	struct page *last_hpage = NULL;
+
+	ret = io_buffers_map_alloc(ctx, nr_args);
+	if (ret)
+		return ret;
+
 	for (i = 0; i < nr_args; i++) {
 		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
 
@@ -8485,17 +8511,8 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		if (ret)
 			break;
 
-		/*
-		 * Don't impose further limits on the size and buffer
-		 * constraints here, we'll -EINVAL later when IO is
-		 * submitted if they are wrong.
-		 */
-		ret = -EFAULT;
-		if (!iov.iov_base || !iov.iov_len)
-			break;
-
-		/* arbitrary limit, but we need something */
-		if (iov.iov_len > SZ_1G)
+		ret = io_buffer_validate(&iov);
+		if (ret)
 			break;
 
 		ret = io_sqe_buffer_register(ctx, &iov, imu, &last_hpage);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 03/13] io_uring: generalize fixed file functionality
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 02/13] io_uring: modularize io_sqe_buffers_register Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 04/13] io_uring: rename fixed_file variables to fixed_rsrc Bijan Mottahedeh
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Generalize fixed_file functionality to fixed_rsrc in order to leverage
it for fixed buffers.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 159 +++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 91 insertions(+), 68 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d8505e2..2b42049 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -195,24 +195,40 @@ struct io_mapped_ubuf {
 	unsigned long	acct_pages;
 };
 
-struct fixed_file_table {
-	struct file		**files;
+struct io_ring_ctx;
+
+struct io_rsrc_put {
+	struct list_head list;
+	union {
+		void *rsrc;
+		struct file *file;
+		struct io_mapped_ubuf *buf;
+	};
+};
+
+struct fixed_rsrc_table {
+	union {
+		struct file		**files;
+		struct io_mapped_ubuf	*bufs;
+	};
 };
 
-struct fixed_file_ref_node {
+struct fixed_rsrc_ref_node {
 	struct percpu_ref		refs;
 	struct list_head		node;
-	struct list_head		file_list;
-	struct fixed_file_data		*file_data;
+	struct list_head		rsrc_list;
+	struct fixed_rsrc_data		*rsrc_data;
+	void 				(*rsrc_put)(struct io_ring_ctx *ctx,
+						    struct io_rsrc_put *prsrc);
 	struct llist_node		llist;
 	bool				done;
 };
 
-struct fixed_file_data {
-	struct fixed_file_table		*table;
+struct fixed_rsrc_data {
+	struct fixed_rsrc_table		*table;
 	struct io_ring_ctx		*ctx;
 
-	struct fixed_file_ref_node	*node;
+	struct fixed_rsrc_ref_node	*node;
 	struct percpu_ref		refs;
 	struct completion		done;
 	struct list_head		ref_list;
@@ -318,7 +334,7 @@ struct io_ring_ctx {
 	 * readers must ensure that ->refs is alive as long as the file* is
 	 * used. Only updated through io_uring_register(2).
 	 */
-	struct fixed_file_data	*file_data;
+	struct fixed_rsrc_data	*file_data;
 	unsigned		nr_user_files;
 
 	/* if used, fixed mapped user buffers */
@@ -6364,7 +6380,7 @@ static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work)
 static inline struct file *io_file_from_index(struct io_ring_ctx *ctx,
 					      int index)
 {
-	struct fixed_file_table *table;
+	struct fixed_rsrc_table *table;
 
 	table = &ctx->file_data->table[index >> IORING_FILE_TABLE_SHIFT];
 	return table->files[index & IORING_FILE_TABLE_MASK];
@@ -7234,18 +7250,18 @@ static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
 #endif
 }
 
-static void io_file_ref_kill(struct percpu_ref *ref)
+static void io_rsrc_ref_kill(struct percpu_ref *ref)
 {
-	struct fixed_file_data *data;
+	struct fixed_rsrc_data *data;
 
-	data = container_of(ref, struct fixed_file_data, refs);
+	data = container_of(ref, struct fixed_rsrc_data, refs);
 	complete(&data->done);
 }
 
 static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 {
-	struct fixed_file_data *data = ctx->file_data;
-	struct fixed_file_ref_node *ref_node = NULL;
+	struct fixed_rsrc_data *data = ctx->file_data;
+	struct fixed_rsrc_ref_node *ref_node = NULL;
 	unsigned nr_tables, i;
 
 	if (!data)
@@ -7494,13 +7510,13 @@ static int io_sqe_files_scm(struct io_ring_ctx *ctx)
 }
 #endif
 
-static int io_sqe_alloc_file_tables(struct fixed_file_data *file_data,
+static int io_sqe_alloc_file_tables(struct fixed_rsrc_data *file_data,
 				    unsigned nr_tables, unsigned nr_files)
 {
 	int i;
 
 	for (i = 0; i < nr_tables; i++) {
-		struct fixed_file_table *table = &file_data->table[i];
+		struct fixed_rsrc_table *table = &file_data->table[i];
 		unsigned this_files;
 
 		this_files = min(nr_files, IORING_MAX_FILES_TABLE);
@@ -7515,14 +7531,15 @@ static int io_sqe_alloc_file_tables(struct fixed_file_data *file_data,
 		return 0;
 
 	for (i = 0; i < nr_tables; i++) {
-		struct fixed_file_table *table = &file_data->table[i];
+		struct fixed_rsrc_table *table = &file_data->table[i];
 		kfree(table->files);
 	}
 	return 1;
 }
 
-static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file)
+static void io_ring_file_put(struct io_ring_ctx *ctx, struct io_rsrc_put *prsrc)
 {
+	struct file *file = prsrc->file;
 #if defined(CONFIG_UNIX)
 	struct sock *sock = ctx->ring_sock->sk;
 	struct sk_buff_head list, *head = &sock->sk_receive_queue;
@@ -7583,26 +7600,34 @@ static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file)
 #endif
 }
 
-struct io_file_put {
-	struct list_head list;
-	struct file *file;
-};
-
-static void __io_file_put_work(struct fixed_file_ref_node *ref_node)
+static void __io_rsrc_put_work(struct fixed_rsrc_ref_node *ref_node)
 {
-	struct fixed_file_data *file_data = ref_node->file_data;
-	struct io_ring_ctx *ctx = file_data->ctx;
-	struct io_file_put *pfile, *tmp;
+	struct fixed_rsrc_data *rsrc_data = ref_node->rsrc_data;
+	struct io_ring_ctx *ctx = rsrc_data->ctx;
+	struct io_rsrc_put *prsrc, *tmp;
 
-	list_for_each_entry_safe(pfile, tmp, &ref_node->file_list, list) {
-		list_del(&pfile->list);
-		io_ring_file_put(ctx, pfile->file);
-		kfree(pfile);
+	list_for_each_entry_safe(prsrc, tmp, &ref_node->rsrc_list, list) {
+		list_del(&prsrc->list);
+		ref_node->rsrc_put(ctx, prsrc);
+		kfree(prsrc);
 	}
 
 	percpu_ref_exit(&ref_node->refs);
 	kfree(ref_node);
-	percpu_ref_put(&file_data->refs);
+	percpu_ref_put(&rsrc_data->refs);
+}
+
+static void io_rsrc_put_work(struct llist_node *node)
+{
+	struct fixed_rsrc_ref_node *ref_node;
+	struct llist_node *next;
+
+	while (node) {
+		next = node->next;
+		ref_node = llist_entry(node, struct fixed_rsrc_ref_node, llist);
+		__io_rsrc_put_work(ref_node);
+		node = next;
+	}
 }
 
 static void io_file_put_work(struct work_struct *work)
@@ -7612,27 +7637,19 @@ static void io_file_put_work(struct work_struct *work)
 
 	ctx = container_of(work, struct io_ring_ctx, file_put_work.work);
 	node = llist_del_all(&ctx->file_put_llist);
-
-	while (node) {
-		struct fixed_file_ref_node *ref_node;
-		struct llist_node *next = node->next;
-
-		ref_node = llist_entry(node, struct fixed_file_ref_node, llist);
-		__io_file_put_work(ref_node);
-		node = next;
-	}
+	io_rsrc_put_work(node);
 }
 
 static void io_file_data_ref_zero(struct percpu_ref *ref)
 {
-	struct fixed_file_ref_node *ref_node;
-	struct fixed_file_data *data;
+	struct fixed_rsrc_ref_node *ref_node;
+	struct fixed_rsrc_data *data;
 	struct io_ring_ctx *ctx;
 	bool first_add = false;
 	int delay = HZ;
 
-	ref_node = container_of(ref, struct fixed_file_ref_node, refs);
-	data = ref_node->file_data;
+	ref_node = container_of(ref, struct fixed_rsrc_ref_node, refs);
+	data = ref_node->rsrc_data;
 	ctx = data->ctx;
 
 	spin_lock_bh(&data->lock);
@@ -7640,7 +7657,7 @@ static void io_file_data_ref_zero(struct percpu_ref *ref)
 
 	while (!list_empty(&data->ref_list)) {
 		ref_node = list_first_entry(&data->ref_list,
-					struct fixed_file_ref_node, node);
+					struct fixed_rsrc_ref_node, node);
 		/* recycle ref nodes in order */
 		if (!ref_node->done)
 			break;
@@ -7658,10 +7675,10 @@ static void io_file_data_ref_zero(struct percpu_ref *ref)
 		queue_delayed_work(system_wq, &ctx->file_put_work, delay);
 }
 
-static struct fixed_file_ref_node *alloc_fixed_file_ref_node(
+static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
 			struct io_ring_ctx *ctx)
 {
-	struct fixed_file_ref_node *ref_node;
+	struct fixed_rsrc_ref_node *ref_node;
 
 	ref_node = kzalloc(sizeof(*ref_node), GFP_KERNEL);
 	if (!ref_node)
@@ -7673,13 +7690,14 @@ static struct fixed_file_ref_node *alloc_fixed_file_ref_node(
 		return ERR_PTR(-ENOMEM);
 	}
 	INIT_LIST_HEAD(&ref_node->node);
-	INIT_LIST_HEAD(&ref_node->file_list);
-	ref_node->file_data = ctx->file_data;
+	INIT_LIST_HEAD(&ref_node->rsrc_list);
+	ref_node->rsrc_data = ctx->file_data;
+	ref_node->rsrc_put = io_ring_file_put;
 	ref_node->done = false;
 	return ref_node;
 }
 
-static void destroy_fixed_file_ref_node(struct fixed_file_ref_node *ref_node)
+static void destroy_fixed_file_ref_node(struct fixed_rsrc_ref_node *ref_node)
 {
 	percpu_ref_exit(&ref_node->refs);
 	kfree(ref_node);
@@ -7692,8 +7710,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	unsigned nr_tables, i;
 	struct file *file;
 	int fd, ret = -ENOMEM;
-	struct fixed_file_ref_node *ref_node;
-	struct fixed_file_data *file_data;
+	struct fixed_rsrc_ref_node *ref_node;
+	struct fixed_rsrc_data *file_data;
 
 	if (ctx->file_data)
 		return -EBUSY;
@@ -7716,7 +7734,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (!file_data->table)
 		goto out_free;
 
-	if (percpu_ref_init(&file_data->refs, io_file_ref_kill,
+	if (percpu_ref_init(&file_data->refs, io_rsrc_ref_kill,
 				PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
 		goto out_free;
 
@@ -7725,7 +7743,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	ctx->file_data = file_data;
 
 	for (i = 0; i < nr_args; i++, ctx->nr_user_files++) {
-		struct fixed_file_table *table;
+		struct fixed_rsrc_table *table;
 		unsigned index;
 
 		if (copy_from_user(&fd, &fds[i], sizeof(fd))) {
@@ -7836,28 +7854,33 @@ static int io_sqe_file_register(struct io_ring_ctx *ctx, struct file *file,
 #endif
 }
 
-static int io_queue_file_removal(struct fixed_file_data *data,
-				 struct file *file)
+static int io_queue_rsrc_removal(struct fixed_rsrc_data *data, void *rsrc)
 {
-	struct io_file_put *pfile;
-	struct fixed_file_ref_node *ref_node = data->node;
+	struct io_rsrc_put *prsrc;
+	struct fixed_rsrc_ref_node *ref_node = data->node;
 
-	pfile = kzalloc(sizeof(*pfile), GFP_KERNEL);
-	if (!pfile)
+	prsrc = kzalloc(sizeof(*prsrc), GFP_KERNEL);
+	if (!prsrc)
 		return -ENOMEM;
 
-	pfile->file = file;
-	list_add(&pfile->list, &ref_node->file_list);
+	prsrc->rsrc = rsrc;
+	list_add(&prsrc->list, &ref_node->rsrc_list);
 
 	return 0;
 }
 
+static inline int io_queue_file_removal(struct fixed_rsrc_data *data,
+					struct file *file)
+{
+	return io_queue_rsrc_removal(data, (void *)file);
+}
+
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				 struct io_uring_files_update *up,
 				 unsigned nr_args)
 {
-	struct fixed_file_data *data = ctx->file_data;
-	struct fixed_file_ref_node *ref_node;
+	struct fixed_rsrc_data *data = ctx->file_data;
+	struct fixed_rsrc_ref_node *ref_node;
 	struct file *file;
 	__s32 __user *fds;
 	int fd, i, err;
@@ -7876,7 +7899,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 	done = 0;
 	fds = u64_to_user_ptr(up->fds);
 	while (nr_args) {
-		struct fixed_file_table *table;
+		struct fixed_rsrc_table *table;
 		unsigned index;
 
 		err = 0;
@@ -9292,7 +9315,7 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	seq_printf(m, "SqThreadCpu:\t%d\n", sq ? task_cpu(sq->thread) : -1);
 	seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files);
 	for (i = 0; has_lock && i < ctx->nr_user_files; i++) {
-		struct fixed_file_table *table;
+		struct fixed_rsrc_table *table;
 		struct file *f;
 
 		table = &ctx->file_data->table[i >> IORING_FILE_TABLE_SHIFT];
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 04/13] io_uring: rename fixed_file variables to fixed_rsrc
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (2 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 03/13] io_uring: generalize fixed file functionality Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 05/13] io_uring: separate ref_list from fixed_rsrc_data Bijan Mottahedeh
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Change fixed_file_refs names to fixed_rsrc_refs, and change
set_resource_node() to io_get_fixed_rsrc_ref() to make its function
more clear.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2b42049..33b2ff6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -733,7 +733,7 @@ struct io_kiocb {
 	u64				user_data;
 
 	struct io_kiocb			*link;
-	struct percpu_ref		*fixed_file_refs;
+	struct percpu_ref		*fixed_rsrc_refs;
 
 	/*
 	 * 1. used with ctx->iopoll_list with reads/writes
@@ -1060,13 +1060,13 @@ static inline void io_clean_op(struct io_kiocb *req)
 		__io_clean_op(req);
 }
 
-static inline void io_set_resource_node(struct io_kiocb *req)
+static inline void io_get_fixed_rsrc_ref(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	if (!req->fixed_file_refs) {
-		req->fixed_file_refs = &ctx->file_data->node->refs;
-		percpu_ref_get(req->fixed_file_refs);
+	if (!req->fixed_rsrc_refs) {
+		req->fixed_rsrc_refs = &ctx->file_data->node->refs;
+		percpu_ref_get(req->fixed_rsrc_refs);
 	}
 }
 
@@ -1964,8 +1964,8 @@ static void io_dismantle_req(struct io_kiocb *req)
 		kfree(req->async_data);
 	if (req->file)
 		io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE));
-	if (req->fixed_file_refs)
-		percpu_ref_put(req->fixed_file_refs);
+	if (req->fixed_rsrc_refs)
+		percpu_ref_put(req->fixed_rsrc_refs);
 	io_req_clean_work(req);
 }
 
@@ -6397,7 +6397,7 @@ static struct file *io_file_get(struct io_submit_state *state,
 			return NULL;
 		fd = array_index_nospec(fd, ctx->nr_user_files);
 		file = io_file_from_index(ctx, fd);
-		io_set_resource_node(req);
+		io_get_fixed_rsrc_ref(req);
 	} else {
 		trace_io_uring_file_get(ctx, fd);
 		file = __io_file_get(state, fd);
@@ -6766,7 +6766,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	req->ctx = ctx;
 	req->flags = 0;
 	req->link = NULL;
-	req->fixed_file_refs = NULL;
+	req->fixed_rsrc_refs = NULL;
 	/* one is dropped after submission, the other at completion */
 	refcount_set(&req->refs, 2);
 	req->task = current;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 05/13] io_uring: separate ref_list from fixed_rsrc_data
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (3 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 04/13] io_uring: rename fixed_file variables to fixed_rsrc Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 06/13] io_uring: generalize fixed_file_ref_node functionality Bijan Mottahedeh
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Uplevel ref_list and make it common to all resources.  This is to
allow one common ref_list to be used for both files, and buffers
in upcoming patches.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 77 ++++++++++++++++++++++++++++-------------------------------
 1 file changed, 36 insertions(+), 41 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 33b2ff6..1ed63bc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -231,8 +231,6 @@ struct fixed_rsrc_data {
 	struct fixed_rsrc_ref_node	*node;
 	struct percpu_ref		refs;
 	struct completion		done;
-	struct list_head		ref_list;
-	spinlock_t			lock;
 };
 
 struct io_buffer {
@@ -398,8 +396,10 @@ struct io_ring_ctx {
 		struct list_head	inflight_list;
 	} ____cacheline_aligned_in_smp;
 
-	struct delayed_work		file_put_work;
-	struct llist_head		file_put_llist;
+	struct delayed_work		rsrc_put_work;
+	struct llist_head		rsrc_put_llist;
+	struct list_head		rsrc_ref_list;
+	spinlock_t			rsrc_ref_lock;
 
 	struct work_struct		exit_work;
 	struct io_restriction		restrictions;
@@ -1024,7 +1024,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 static struct file *io_file_get(struct io_submit_state *state,
 				struct io_kiocb *req, int fd, bool fixed);
 static void __io_queue_sqe(struct io_kiocb *req, struct io_comp_state *cs);
-static void io_file_put_work(struct work_struct *work);
+static void io_rsrc_put_work(struct work_struct *work);
 
 static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
 			       struct iovec **iovec, struct iov_iter *iter,
@@ -1325,8 +1325,10 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	spin_lock_init(&ctx->inflight_lock);
 	INIT_LIST_HEAD(&ctx->inflight_list);
-	INIT_DELAYED_WORK(&ctx->file_put_work, io_file_put_work);
-	init_llist_head(&ctx->file_put_llist);
+	spin_lock_init(&ctx->rsrc_ref_lock);
+	INIT_LIST_HEAD(&ctx->rsrc_ref_list);
+	INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
+	init_llist_head(&ctx->rsrc_put_llist);
 	return ctx;
 err:
 	if (ctx->fallback_req)
@@ -7267,16 +7269,16 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	if (!data)
 		return -ENXIO;
 
-	spin_lock_bh(&data->lock);
+	spin_lock_bh(&ctx->rsrc_ref_lock);
 	ref_node = data->node;
-	spin_unlock_bh(&data->lock);
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
 	if (ref_node)
 		percpu_ref_kill(&ref_node->refs);
 
 	percpu_ref_kill(&data->refs);
 
 	/* wait for all refs nodes to complete */
-	flush_delayed_work(&ctx->file_put_work);
+	flush_delayed_work(&ctx->rsrc_put_work);
 	wait_for_completion(&data->done);
 
 	__io_sqe_files_unregister(ctx);
@@ -7617,30 +7619,25 @@ static void __io_rsrc_put_work(struct fixed_rsrc_ref_node *ref_node)
 	percpu_ref_put(&rsrc_data->refs);
 }
 
-static void io_rsrc_put_work(struct llist_node *node)
+static void io_rsrc_put_work(struct work_struct *work)
 {
-	struct fixed_rsrc_ref_node *ref_node;
-	struct llist_node *next;
+	struct io_ring_ctx *ctx;
+	struct llist_node *node;
+
+	ctx = container_of(work, struct io_ring_ctx, rsrc_put_work.work);
+	node = llist_del_all(&ctx->rsrc_put_llist);
 
 	while (node) {
-		next = node->next;
+		struct fixed_rsrc_ref_node *ref_node;
+		struct llist_node *next = node->next;
+
 		ref_node = llist_entry(node, struct fixed_rsrc_ref_node, llist);
 		__io_rsrc_put_work(ref_node);
 		node = next;
 	}
 }
 
-static void io_file_put_work(struct work_struct *work)
-{
-	struct io_ring_ctx *ctx;
-	struct llist_node *node;
-
-	ctx = container_of(work, struct io_ring_ctx, file_put_work.work);
-	node = llist_del_all(&ctx->file_put_llist);
-	io_rsrc_put_work(node);
-}
-
-static void io_file_data_ref_zero(struct percpu_ref *ref)
+static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 {
 	struct fixed_rsrc_ref_node *ref_node;
 	struct fixed_rsrc_data *data;
@@ -7652,27 +7649,27 @@ static void io_file_data_ref_zero(struct percpu_ref *ref)
 	data = ref_node->rsrc_data;
 	ctx = data->ctx;
 
-	spin_lock_bh(&data->lock);
+	spin_lock_bh(&ctx->rsrc_ref_lock);
 	ref_node->done = true;
 
-	while (!list_empty(&data->ref_list)) {
-		ref_node = list_first_entry(&data->ref_list,
+	while (!list_empty(&ctx->rsrc_ref_list)) {
+		ref_node = list_first_entry(&ctx->rsrc_ref_list,
 					struct fixed_rsrc_ref_node, node);
 		/* recycle ref nodes in order */
 		if (!ref_node->done)
 			break;
 		list_del(&ref_node->node);
-		first_add |= llist_add(&ref_node->llist, &ctx->file_put_llist);
+		first_add |= llist_add(&ref_node->llist, &ctx->rsrc_put_llist);
 	}
-	spin_unlock_bh(&data->lock);
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
 
 	if (percpu_ref_is_dying(&data->refs))
 		delay = 0;
 
 	if (!delay)
-		mod_delayed_work(system_wq, &ctx->file_put_work, 0);
+		mod_delayed_work(system_wq, &ctx->rsrc_put_work, 0);
 	else if (first_add)
-		queue_delayed_work(system_wq, &ctx->file_put_work, delay);
+		queue_delayed_work(system_wq, &ctx->rsrc_put_work, delay);
 }
 
 static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
@@ -7684,7 +7681,7 @@ static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
 	if (!ref_node)
 		return ERR_PTR(-ENOMEM);
 
-	if (percpu_ref_init(&ref_node->refs, io_file_data_ref_zero,
+	if (percpu_ref_init(&ref_node->refs, io_rsrc_data_ref_zero,
 			    0, GFP_KERNEL)) {
 		kfree(ref_node);
 		return ERR_PTR(-ENOMEM);
@@ -7725,8 +7722,6 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 		return -ENOMEM;
 	file_data->ctx = ctx;
 	init_completion(&file_data->done);
-	INIT_LIST_HEAD(&file_data->ref_list);
-	spin_lock_init(&file_data->lock);
 
 	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_FILES_TABLE);
 	file_data->table = kcalloc(nr_tables, sizeof(*file_data->table),
@@ -7788,9 +7783,9 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	}
 
 	file_data->node = ref_node;
-	spin_lock_bh(&file_data->lock);
-	list_add_tail(&ref_node->node, &file_data->ref_list);
-	spin_unlock_bh(&file_data->lock);
+	spin_lock_bh(&ctx->rsrc_ref_lock);
+	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
 	percpu_ref_get(&file_data->refs);
 	return ret;
 out_fput:
@@ -7952,10 +7947,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 
 	if (needs_switch) {
 		percpu_ref_kill(&data->node->refs);
-		spin_lock_bh(&data->lock);
-		list_add_tail(&ref_node->node, &data->ref_list);
+		spin_lock_bh(&ctx->rsrc_ref_lock);
+		list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
 		data->node = ref_node;
-		spin_unlock_bh(&data->lock);
+		spin_unlock_bh(&ctx->rsrc_ref_lock);
 		percpu_ref_get(&ctx->file_data->refs);
 	} else
 		destroy_fixed_file_ref_node(ref_node);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 06/13] io_uring: generalize fixed_file_ref_node functionality
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (4 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 05/13] io_uring: separate ref_list from fixed_rsrc_data Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-16 14:53   ` Pavel Begunkov
  2020-12-07 22:15 ` [PATCH v2 07/13] io_uring: add rsrc_ref locking routines Bijan Mottahedeh
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Split alloc_fixed_file_ref_node into resource generic/specific parts,
rename destroy_fixed_file_ref_node, and factor out fixed_file_ref_node
switching, to be be leveraged by fixed buffers.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1ed63bc..126237e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7672,7 +7672,7 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 		queue_delayed_work(system_wq, &ctx->rsrc_put_work, delay);
 }
 
-static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
+static struct fixed_rsrc_ref_node *alloc_fixed_rsrc_ref_node(
 			struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_ref_node *ref_node;
@@ -7688,13 +7688,22 @@ static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
 	}
 	INIT_LIST_HEAD(&ref_node->node);
 	INIT_LIST_HEAD(&ref_node->rsrc_list);
+	ref_node->done = false;
+	return ref_node;
+}
+
+static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
+			struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_ref_node *ref_node;
+
+	ref_node = alloc_fixed_rsrc_ref_node(ctx);
 	ref_node->rsrc_data = ctx->file_data;
 	ref_node->rsrc_put = io_ring_file_put;
-	ref_node->done = false;
 	return ref_node;
 }
 
-static void destroy_fixed_file_ref_node(struct fixed_rsrc_ref_node *ref_node)
+static void destroy_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node)
 {
 	percpu_ref_exit(&ref_node->refs);
 	kfree(ref_node);
@@ -7870,6 +7879,17 @@ static inline int io_queue_file_removal(struct fixed_rsrc_data *data,
 	return io_queue_rsrc_removal(data, (void *)file);
 }
 
+static void switch_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
+				       struct fixed_rsrc_data *data,
+				       struct io_ring_ctx *ctx)
+{
+	percpu_ref_kill(&data->node->refs);
+	spin_lock_bh(&ctx->rsrc_ref_lock);
+	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
+	data->node = ref_node;
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
+}
+
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				 struct io_uring_files_update *up,
 				 unsigned nr_args)
@@ -7946,14 +7966,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 	}
 
 	if (needs_switch) {
-		percpu_ref_kill(&data->node->refs);
-		spin_lock_bh(&ctx->rsrc_ref_lock);
-		list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
-		data->node = ref_node;
-		spin_unlock_bh(&ctx->rsrc_ref_lock);
+		switch_fixed_rsrc_ref_node(ref_node, data, ctx);
 		percpu_ref_get(&ctx->file_data->refs);
 	} else
-		destroy_fixed_file_ref_node(ref_node);
+		destroy_fixed_rsrc_ref_node(ref_node);
 
 	return done ? done : err;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 07/13] io_uring: add rsrc_ref locking routines
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (5 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 06/13] io_uring: generalize fixed_file_ref_node functionality Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Encapsulate resource reference locking into separate routines.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 126237e..25b575e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7260,6 +7260,16 @@ static void io_rsrc_ref_kill(struct percpu_ref *ref)
 	complete(&data->done);
 }
 
+static inline void io_rsrc_ref_lock(struct io_ring_ctx *ctx)
+{
+	spin_lock_bh(&ctx->rsrc_ref_lock);
+}
+
+static inline void io_rsrc_ref_unlock(struct io_ring_ctx *ctx)
+{
+	spin_unlock_bh(&ctx->rsrc_ref_lock);
+}
+
 static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_data *data = ctx->file_data;
@@ -7269,9 +7279,9 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	if (!data)
 		return -ENXIO;
 
-	spin_lock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_lock(ctx);
 	ref_node = data->node;
-	spin_unlock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_unlock(ctx);
 	if (ref_node)
 		percpu_ref_kill(&ref_node->refs);
 
@@ -7649,7 +7659,7 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 	data = ref_node->rsrc_data;
 	ctx = data->ctx;
 
-	spin_lock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_lock(ctx);
 	ref_node->done = true;
 
 	while (!list_empty(&ctx->rsrc_ref_list)) {
@@ -7661,7 +7671,7 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
 		list_del(&ref_node->node);
 		first_add |= llist_add(&ref_node->llist, &ctx->rsrc_put_llist);
 	}
-	spin_unlock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_unlock(ctx);
 
 	if (percpu_ref_is_dying(&data->refs))
 		delay = 0;
@@ -7792,9 +7802,9 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	}
 
 	file_data->node = ref_node;
-	spin_lock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_lock(ctx);
 	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
-	spin_unlock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_unlock(ctx);
 	percpu_ref_get(&file_data->refs);
 	return ret;
 out_fput:
@@ -7884,10 +7894,10 @@ static void switch_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
 				       struct io_ring_ctx *ctx)
 {
 	percpu_ref_kill(&data->node->refs);
-	spin_lock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_lock(ctx);
 	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
 	data->node = ref_node;
-	spin_unlock_bh(&ctx->rsrc_ref_lock);
+	io_rsrc_ref_unlock(ctx);
 }
 
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (6 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 07/13] io_uring: add rsrc_ref locking routines Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-16 14:58   ` Pavel Begunkov
  2020-12-16 14:59   ` Pavel Begunkov
  2020-12-07 22:15 ` [PATCH v2 09/13] io_uring: create common fixed_rsrc_ref_node handling routines Bijan Mottahedeh
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Apply fixed_rsrc functionality for fixed buffers support.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 240 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 204 insertions(+), 36 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 25b575e..a51d1e6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -104,6 +104,14 @@
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
 				 IORING_REGISTER_LAST + IORING_OP_LAST)
 
+/*
+ * Shift of 7 is 128 entries, or exactly one page on 64-bit archs
+ */
+#define IORING_BUF_TABLE_SHIFT	7	/* struct io_mapped_ubuf */
+#define IORING_MAX_BUFS_TABLE	(1U << IORING_BUF_TABLE_SHIFT)
+#define IORING_BUF_TABLE_MASK	(IORING_MAX_BUFS_TABLE - 1)
+#define IORING_MAX_FIXED_BUFS	UIO_MAXIOV
+
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
 	u32 tail ____cacheline_aligned_in_smp;
@@ -336,8 +344,8 @@ struct io_ring_ctx {
 	unsigned		nr_user_files;
 
 	/* if used, fixed mapped user buffers */
+	struct fixed_rsrc_data	*buf_data;
 	unsigned		nr_user_bufs;
-	struct io_mapped_ubuf	*user_bufs;
 
 	struct user_struct	*user;
 
@@ -2962,6 +2970,15 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret,
 		io_rw_done(kiocb, ret);
 }
 
+static inline struct io_mapped_ubuf *io_buf_from_index(struct io_ring_ctx *ctx,
+						       int index)
+{
+	struct fixed_rsrc_table *table;
+
+	table = &ctx->buf_data->table[index >> IORING_BUF_TABLE_SHIFT];
+	return &table->bufs[index & IORING_BUF_TABLE_MASK];
+}
+
 static ssize_t io_import_fixed(struct io_kiocb *req, int rw,
 			       struct iov_iter *iter)
 {
@@ -2975,7 +2992,7 @@ static ssize_t io_import_fixed(struct io_kiocb *req, int rw,
 	if (unlikely(buf_index >= ctx->nr_user_bufs))
 		return -EFAULT;
 	index = array_index_nospec(buf_index, ctx->nr_user_bufs);
-	imu = &ctx->user_bufs[index];
+	imu = io_buf_from_index(ctx, index);
 	buf_addr = req->rw.addr;
 
 	/* overflow */
@@ -8296,28 +8313,71 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return pages;
 }
 
-static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
+static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 {
-	int i, j;
+	unsigned i;
 
-	if (!ctx->user_bufs)
-		return -ENXIO;
+	for (i = 0; i < imu->nr_bvecs; i++)
+		unpin_user_page(imu->bvec[i].bv_page);
 
-	for (i = 0; i < ctx->nr_user_bufs; i++) {
-		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+	if (imu->acct_pages)
+		io_unaccount_mem(ctx, imu->nr_bvecs, ACCT_PINNED);
+	kvfree(imu->bvec);
+	imu->nr_bvecs = 0;
+}
 
-		for (j = 0; j < imu->nr_bvecs; j++)
-			unpin_user_page(imu->bvec[j].bv_page);
+static void io_buffers_unmap(struct io_ring_ctx *ctx)
+{
+	unsigned i;
+	struct io_mapped_ubuf *imu;
 
-		if (imu->acct_pages)
-			io_unaccount_mem(ctx, imu->acct_pages, ACCT_PINNED);
-		kvfree(imu->bvec);
-		imu->nr_bvecs = 0;
+	for (i = 0; i < ctx->nr_user_bufs; i++) {
+		imu = io_buf_from_index(ctx, i);
+		io_buffer_unmap(ctx, imu);
 	}
+}
+
+static void io_buffers_map_free(struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_data *data = ctx->buf_data;
+	unsigned nr_tables, i;
+
+	if (!data)
+		return;
 
-	kfree(ctx->user_bufs);
-	ctx->user_bufs = NULL;
+	nr_tables = DIV_ROUND_UP(ctx->nr_user_bufs, IORING_MAX_BUFS_TABLE);
+	for (i = 0; i < nr_tables; i++)
+		kfree(data->table[i].bufs);
+	kfree(data->table);
+	percpu_ref_exit(&data->refs);
+	kfree(data);
+	ctx->buf_data = NULL;
 	ctx->nr_user_bufs = 0;
+}
+
+static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_data *data = ctx->buf_data;
+	struct fixed_rsrc_ref_node *ref_node = NULL;
+
+	if (!data)
+		return -ENXIO;
+
+	io_rsrc_ref_lock(ctx);
+	ref_node = data->node;
+	io_rsrc_ref_unlock(ctx);
+	if (ref_node)
+		percpu_ref_kill(&ref_node->refs);
+
+	percpu_ref_kill(&data->refs);
+
+	/* wait for all refs nodes to complete */
+	flush_delayed_work(&ctx->rsrc_put_work);
+	wait_for_completion(&data->done);
+
+	io_buffers_unmap(ctx);
+	io_buffers_map_free(ctx);
+
 	return 0;
 }
 
@@ -8370,7 +8430,13 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
 
 	/* check previously registered pages */
 	for (i = 0; i < ctx->nr_user_bufs; i++) {
-		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		struct fixed_rsrc_table *table;
+		struct io_mapped_ubuf *imu;
+		unsigned index;
+
+		table = &ctx->buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
+		index = i & IORING_BUF_TABLE_MASK;
+		imu = &table->bufs[index];
 
 		for (j = 0; j < imu->nr_bvecs; j++) {
 			if (!PageCompound(imu->bvec[j].bv_page))
@@ -8505,19 +8571,80 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	return ret;
 }
 
-static int io_buffers_map_alloc(struct io_ring_ctx *ctx, unsigned int nr_args)
+static void io_free_buf_tables(struct fixed_rsrc_data *buf_data,
+			       unsigned nr_tables)
 {
-	if (ctx->user_bufs)
-		return -EBUSY;
-	if (!nr_args || nr_args > UIO_MAXIOV)
-		return -EINVAL;
+	int i;
 
-	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
-					GFP_KERNEL);
-	if (!ctx->user_bufs)
-		return -ENOMEM;
+	for (i = 0; i < nr_tables; i++) {
+		struct fixed_rsrc_table *table = &buf_data->table[i];
+		kfree(table->bufs);
+	}
+}
 
-	return 0;
+static int io_alloc_buf_tables(struct fixed_rsrc_data *buf_data,
+			       unsigned nr_tables, unsigned nr_bufs)
+{
+	int i;
+		
+	for (i = 0; i < nr_tables; i++) {
+		struct fixed_rsrc_table *table = &buf_data->table[i];
+		unsigned this_bufs;
+
+		this_bufs = min(nr_bufs, IORING_MAX_BUFS_TABLE);
+		table->bufs = kcalloc(this_bufs, sizeof(struct io_mapped_ubuf),
+				      GFP_KERNEL);
+		if (!table->bufs)
+			break;
+		nr_bufs -= this_bufs;
+	}
+
+	if (i == nr_tables)
+		return 0;
+
+	io_free_buf_tables(buf_data, nr_tables);
+	return 1;
+}
+
+static struct fixed_rsrc_data *io_buffers_map_alloc(struct io_ring_ctx *ctx,
+						    unsigned int nr_args)
+{
+	unsigned nr_tables;
+	struct fixed_rsrc_data *buf_data;
+	int ret = -ENOMEM;
+
+	if (ctx->buf_data)
+		return ERR_PTR(-EBUSY);
+	if (!nr_args || nr_args > IORING_MAX_FIXED_BUFS)
+		return ERR_PTR(-EINVAL);
+
+	buf_data = kzalloc(sizeof(*ctx->buf_data), GFP_KERNEL);
+	if (!buf_data)
+		return ERR_PTR(-ENOMEM);
+	buf_data->ctx = ctx;
+	init_completion(&buf_data->done);
+
+	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_BUFS_TABLE);
+	buf_data->table = kcalloc(nr_tables, sizeof(*buf_data->table),
+				  GFP_KERNEL);
+	if (!buf_data->table)
+		goto out_free;
+
+	if (percpu_ref_init(&buf_data->refs, io_rsrc_ref_kill,
+			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
+		goto out_free;
+
+	if (io_alloc_buf_tables(buf_data, nr_tables, nr_args))
+		goto out_ref;
+
+	return buf_data;
+
+out_ref:
+	percpu_ref_exit(&buf_data->refs);
+out_free:
+	kfree(buf_data->table);
+	kfree(buf_data);
+	return ERR_PTR(ret);
 }
 
 static int io_buffer_validate(struct iovec *iov)
@@ -8537,39 +8664,79 @@ static int io_buffer_validate(struct iovec *iov)
 	return 0;
 }
 
+static void io_ring_buf_put(struct io_ring_ctx *ctx, struct io_rsrc_put *prsrc)
+{
+	io_buffer_unmap(ctx, prsrc->buf);
+}
+
+static struct fixed_rsrc_ref_node *alloc_fixed_buf_ref_node(
+			struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_ref_node *ref_node;
+
+	ref_node = alloc_fixed_rsrc_ref_node(ctx);
+	ref_node->rsrc_data = ctx->buf_data;
+	ref_node->rsrc_put = io_ring_buf_put;
+	return ref_node;
+}
+
 static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 				   unsigned int nr_args)
 {
 	int i, ret;
 	struct iovec iov;
 	struct page *last_hpage = NULL;
+	struct fixed_rsrc_ref_node *ref_node;
+	struct fixed_rsrc_data *buf_data;
 
-	ret = io_buffers_map_alloc(ctx, nr_args);
-	if (ret)
-		return ret;
+	buf_data = io_buffers_map_alloc(ctx, nr_args);
+	if (IS_ERR(buf_data))
+		return PTR_ERR(buf_data);
 
-	for (i = 0; i < nr_args; i++) {
-		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+	for (i = 0; i < nr_args; i++, ctx->nr_user_bufs++) {
+		struct fixed_rsrc_table *table;
+		struct io_mapped_ubuf *imu;
+		unsigned index;
 
 		ret = io_copy_iov(ctx, &iov, arg, i);
 		if (ret)
 			break;
 
+		/* allow sparse sets */
+		if (!iov.iov_base && !iov.iov_len)
+			continue;
+
 		ret = io_buffer_validate(&iov);
 		if (ret)
 			break;
 
+		table = &buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
+		index = i & IORING_BUF_TABLE_MASK;
+		imu = &table->bufs[index];
+
 		ret = io_sqe_buffer_register(ctx, &iov, imu, &last_hpage);
 		if (ret)
 			break;
+	}
 
-		ctx->nr_user_bufs++;
+	ctx->buf_data = buf_data;
+	if (ret) {
+		io_sqe_buffers_unregister(ctx);
+		return ret;
 	}
 
-	if (ret)
+	ref_node = alloc_fixed_buf_ref_node(ctx);
+	if (IS_ERR(ref_node)) {
 		io_sqe_buffers_unregister(ctx);
+		return PTR_ERR(ref_node);
+	}
 
-	return ret;
+	buf_data->node = ref_node;
+	io_rsrc_ref_lock(ctx);
+	list_add(&ref_node->node, &ctx->rsrc_ref_list);
+	io_rsrc_ref_unlock(ctx);
+	percpu_ref_get(&buf_data->refs);
+	return 0;
 }
 
 static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
@@ -9348,7 +9515,7 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	}
 	seq_printf(m, "UserBufs:\t%u\n", ctx->nr_user_bufs);
 	for (i = 0; has_lock && i < ctx->nr_user_bufs; i++) {
-		struct io_mapped_ubuf *buf = &ctx->user_bufs[i];
+		struct io_mapped_ubuf *buf = io_buf_from_index(ctx, i);
 
 		seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf,
 						(unsigned int) buf->len);
@@ -9851,6 +10018,7 @@ static bool io_register_op_must_quiesce(int op)
 	switch (op) {
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
+	case IORING_UNREGISTER_BUFFERS:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
 	case IORING_UNREGISTER_PERSONALITY:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 09/13] io_uring: create common fixed_rsrc_ref_node handling routines
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (7 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 10/13] io_uring: generalize files_update functionlity to rsrc_update Bijan Mottahedeh
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Create common routines to be used for both files/buffers registration.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 66 +++++++++++++++++++++++++++--------------------------------
 1 file changed, 30 insertions(+), 36 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a51d1e6..384ff3c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7287,14 +7287,10 @@ static inline void io_rsrc_ref_unlock(struct io_ring_ctx *ctx)
 	spin_unlock_bh(&ctx->rsrc_ref_lock);
 }
 
-static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+static void io_rsrc_ref_quiesce(struct fixed_rsrc_data *data,
+				struct io_ring_ctx *ctx)
 {
-	struct fixed_rsrc_data *data = ctx->file_data;
 	struct fixed_rsrc_ref_node *ref_node = NULL;
-	unsigned nr_tables, i;
-
-	if (!data)
-		return -ENXIO;
 
 	io_rsrc_ref_lock(ctx);
 	ref_node = data->node;
@@ -7307,6 +7303,17 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	/* wait for all refs nodes to complete */
 	flush_delayed_work(&ctx->rsrc_put_work);
 	wait_for_completion(&data->done);
+}
+
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_data *data = ctx->file_data;
+	unsigned nr_tables, i;
+
+	if (!data)
+		return -ENXIO;
+
+	io_rsrc_ref_quiesce(data, ctx);
 
 	__io_sqe_files_unregister(ctx);
 	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
@@ -7736,6 +7743,17 @@ static void destroy_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node)
 	kfree(ref_node);
 }
 
+static void add_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
+				    struct fixed_rsrc_data *data,
+				    struct io_ring_ctx *ctx)
+{
+	io_rsrc_ref_lock(ctx);
+	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
+	data->node = ref_node;
+	io_rsrc_ref_unlock(ctx);
+	percpu_ref_get(&data->refs);
+}
+
 static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 				 unsigned nr_args)
 {
@@ -7818,11 +7836,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 		return PTR_ERR(ref_node);
 	}
 
-	file_data->node = ref_node;
-	io_rsrc_ref_lock(ctx);
-	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
-	io_rsrc_ref_unlock(ctx);
-	percpu_ref_get(&file_data->refs);
+	add_fixed_rsrc_ref_node(ref_node, file_data, ctx);
 	return ret;
 out_fput:
 	for (i = 0; i < ctx->nr_user_files; i++) {
@@ -7911,10 +7925,7 @@ static void switch_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
 				       struct io_ring_ctx *ctx)
 {
 	percpu_ref_kill(&data->node->refs);
-	io_rsrc_ref_lock(ctx);
-	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
-	data->node = ref_node;
-	io_rsrc_ref_unlock(ctx);
+	add_fixed_rsrc_ref_node(ref_node, data, ctx);
 }
 
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
@@ -7992,10 +8003,9 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 		up->offset++;
 	}
 
-	if (needs_switch) {
+	if (needs_switch)
 		switch_fixed_rsrc_ref_node(ref_node, data, ctx);
-		percpu_ref_get(&ctx->file_data->refs);
-	} else
+	else
 		destroy_fixed_rsrc_ref_node(ref_node);
 
 	return done ? done : err;
@@ -8358,23 +8368,11 @@ static void io_buffers_map_free(struct io_ring_ctx *ctx)
 static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_data *data = ctx->buf_data;
-	struct fixed_rsrc_ref_node *ref_node = NULL;
 
 	if (!data)
 		return -ENXIO;
 
-	io_rsrc_ref_lock(ctx);
-	ref_node = data->node;
-	io_rsrc_ref_unlock(ctx);
-	if (ref_node)
-		percpu_ref_kill(&ref_node->refs);
-
-	percpu_ref_kill(&data->refs);
-
-	/* wait for all refs nodes to complete */
-	flush_delayed_work(&ctx->rsrc_put_work);
-	wait_for_completion(&data->done);
-
+	io_rsrc_ref_quiesce(data, ctx);
 	io_buffers_unmap(ctx);
 	io_buffers_map_free(ctx);
 
@@ -8731,11 +8729,7 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		return PTR_ERR(ref_node);
 	}
 
-	buf_data->node = ref_node;
-	io_rsrc_ref_lock(ctx);
-	list_add(&ref_node->node, &ctx->rsrc_ref_list);
-	io_rsrc_ref_unlock(ctx);
-	percpu_ref_get(&buf_data->refs);
+	add_fixed_rsrc_ref_node(ref_node, buf_data, ctx);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 10/13] io_uring: generalize files_update functionlity to rsrc_update
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (8 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 09/13] io_uring: create common fixed_rsrc_ref_node handling routines Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 11/13] io_uring: support buffer registration updates Bijan Mottahedeh
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Generalize files_update functionality to rsrc_update in order to leverage
it for buffers updates.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c                 | 45 ++++++++++++++++++++++++++-----------------
 include/uapi/linux/io_uring.h | 10 ++++++++++
 2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 384ff3c..1108682 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -517,7 +517,7 @@ struct io_open {
 	unsigned long			nofile;
 };
 
-struct io_files_update {
+struct io_rsrc_update {
 	struct file			*file;
 	u64				arg;
 	u32				nr_args;
@@ -711,7 +711,7 @@ struct io_kiocb {
 		struct io_sr_msg	sr_msg;
 		struct io_open		open;
 		struct io_close		close;
-		struct io_files_update	files_update;
+		struct io_rsrc_update	rsrc_update;
 		struct io_fadvise	fadvise;
 		struct io_madvise	madvise;
 		struct io_epoll		epoll;
@@ -1026,7 +1026,7 @@ static void __io_complete_rw(struct io_kiocb *req, long res, long res2,
 static void __io_queue_linked_timeout(struct io_kiocb *req);
 static void io_queue_linked_timeout(struct io_kiocb *req);
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
-				 struct io_uring_files_update *ip,
+				 struct io_uring_rsrc_update *ip,
 				 unsigned nr_args);
 static void __io_clean_op(struct io_kiocb *req);
 static struct file *io_file_get(struct io_submit_state *state,
@@ -5962,8 +5962,8 @@ static int io_async_cancel(struct io_kiocb *req)
 	return 0;
 }
 
-static int io_files_update_prep(struct io_kiocb *req,
-				const struct io_uring_sqe *sqe)
+static int io_rsrc_update_prep(struct io_kiocb *req,
+			       const struct io_uring_sqe *sqe)
 {
 	if (unlikely(req->ctx->flags & IORING_SETUP_SQPOLL))
 		return -EINVAL;
@@ -5972,29 +5972,32 @@ static int io_files_update_prep(struct io_kiocb *req,
 	if (sqe->ioprio || sqe->rw_flags)
 		return -EINVAL;
 
-	req->files_update.offset = READ_ONCE(sqe->off);
-	req->files_update.nr_args = READ_ONCE(sqe->len);
-	if (!req->files_update.nr_args)
+	req->rsrc_update.offset = READ_ONCE(sqe->off);
+	req->rsrc_update.nr_args = READ_ONCE(sqe->len);
+	if (!req->rsrc_update.nr_args)
 		return -EINVAL;
-	req->files_update.arg = READ_ONCE(sqe->addr);
+	req->rsrc_update.arg = READ_ONCE(sqe->addr);
 	return 0;
 }
 
-static int io_files_update(struct io_kiocb *req, bool force_nonblock,
-			   struct io_comp_state *cs)
+static int io_rsrc_update(struct io_kiocb *req, bool force_nonblock,
+			  struct io_comp_state *cs,
+			  int (*update)(struct io_ring_ctx *ctx,
+					struct io_uring_rsrc_update *up,
+					unsigned nr_args))
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_uring_files_update up;
+	struct io_uring_rsrc_update up;
 	int ret;
 
 	if (force_nonblock)
 		return -EAGAIN;
 
-	up.offset = req->files_update.offset;
-	up.fds = req->files_update.arg;
+	up.offset = req->rsrc_update.offset;
+	up.rsrc = req->rsrc_update.arg;
 
 	mutex_lock(&ctx->uring_lock);
-	ret = __io_sqe_files_update(ctx, &up, req->files_update.nr_args);
+	ret = (*update)(ctx, &up, req->rsrc_update.nr_args);
 	mutex_unlock(&ctx->uring_lock);
 
 	if (ret < 0)
@@ -6003,6 +6006,12 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 	return 0;
 }
 
+static int io_files_update(struct io_kiocb *req, bool force_nonblock,
+			   struct io_comp_state *cs)
+{
+	return io_rsrc_update(req, force_nonblock, cs, __io_sqe_files_update);
+}
+
 static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	switch (req->opcode) {
@@ -6049,7 +6058,7 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	case IORING_OP_CLOSE:
 		return io_close_prep(req, sqe);
 	case IORING_OP_FILES_UPDATE:
-		return io_files_update_prep(req, sqe);
+		return io_rsrc_update_prep(req, sqe);
 	case IORING_OP_STATX:
 		return io_statx_prep(req, sqe);
 	case IORING_OP_FADVISE:
@@ -7929,7 +7938,7 @@ static void switch_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
 }
 
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
-				 struct io_uring_files_update *up,
+				 struct io_uring_rsrc_update *up,
 				 unsigned nr_args)
 {
 	struct fixed_rsrc_data *data = ctx->file_data;
@@ -8014,7 +8023,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg,
 			       unsigned nr_args)
 {
-	struct io_uring_files_update up;
+	struct io_uring_rsrc_update up;
 
 	if (!ctx->file_data)
 		return -ENXIO;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d31a2a1..ba5deff 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -291,6 +291,16 @@ struct io_uring_files_update {
 	__aligned_u64 /* __s32 * */ fds;
 };
 
+struct io_uring_rsrc_update {
+	__u32 offset;
+	__u32 resv;
+	union {
+		__aligned_u64 /* __s32 * */ fds;
+		__aligned_u64 /* __s32 * */ iovs;
+		__aligned_u64 /* __s32 * */ rsrc;
+	};
+};
+
 #define IO_URING_OP_SUPPORTED	(1U << 0)
 
 struct io_uring_probe_op {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 11/13] io_uring: support buffer registration updates
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (9 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 10/13] io_uring: generalize files_update functionlity to rsrc_update Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 12/13] io_uring: create common fixed_rsrc_data allocation routines Bijan Mottahedeh
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Introduce IORING_REGISTER_BUFFERS_UPDATE and IORING_OP_BUFFERS_UPDATE,
consistent with file registration update.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c                 | 122 +++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   8 +--
 2 files changed, 123 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1108682..b7a1f65 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1009,6 +1009,9 @@ struct io_op_def {
 		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
 						IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
 	},
+	[IORING_OP_BUFFERS_UPDATE] = {
+		.work_flags		= IO_WQ_WORK_MM,
+	},
 };
 
 enum io_mem_account {
@@ -1028,6 +1031,9 @@ static void __io_complete_rw(struct io_kiocb *req, long res, long res2,
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				 struct io_uring_rsrc_update *ip,
 				 unsigned nr_args);
+static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
+				   struct io_uring_rsrc_update *up,
+				   unsigned nr_args);
 static void __io_clean_op(struct io_kiocb *req);
 static struct file *io_file_get(struct io_submit_state *state,
 				struct io_kiocb *req, int fd, bool fixed);
@@ -6012,6 +6018,12 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock,
 	return io_rsrc_update(req, force_nonblock, cs, __io_sqe_files_update);
 }
 
+static int io_buffers_update(struct io_kiocb *req, bool force_nonblock,
+			     struct io_comp_state *cs)
+{
+	return io_rsrc_update(req, force_nonblock, cs, __io_sqe_buffers_update);
+}
+
 static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	switch (req->opcode) {
@@ -6083,11 +6095,13 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_renameat_prep(req, sqe);
 	case IORING_OP_UNLINKAT:
 		return io_unlinkat_prep(req, sqe);
+	case IORING_OP_BUFFERS_UPDATE:
+		return io_rsrc_update_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
 			req->opcode);
-	return-EINVAL;
+	return -EINVAL;
 }
 
 static int io_req_defer_prep(struct io_kiocb *req,
@@ -6342,6 +6356,9 @@ static int io_issue_sqe(struct io_kiocb *req, bool force_nonblock,
 	case IORING_OP_UNLINKAT:
 		ret = io_unlinkat(req, force_nonblock);
 		break;
+	case IORING_OP_BUFFERS_UPDATE:
+		ret = io_buffers_update(req, force_nonblock, cs);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -8342,6 +8359,7 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 	if (imu->acct_pages)
 		io_unaccount_mem(ctx, imu->nr_bvecs, ACCT_PINNED);
 	kvfree(imu->bvec);
+	imu->bvec = NULL;
 	imu->nr_bvecs = 0;
 }
 
@@ -8545,6 +8563,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 		if (pret > 0)
 			unpin_user_pages(pages, pret);
 		kvfree(imu->bvec);
+		imu->bvec = NULL;
 		goto done;
 	}
 
@@ -8674,6 +8693,8 @@ static int io_buffer_validate(struct iovec *iov)
 static void io_ring_buf_put(struct io_ring_ctx *ctx, struct io_rsrc_put *prsrc)
 {
 	io_buffer_unmap(ctx, prsrc->buf);
+	kvfree(prsrc->buf);
+	prsrc->buf = NULL;
 }
 
 static struct fixed_rsrc_ref_node *alloc_fixed_buf_ref_node(
@@ -8742,6 +8763,101 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	return 0;
 }
 
+static inline int io_queue_buffer_removal(struct fixed_rsrc_data *data,
+					  struct io_mapped_ubuf *imu)
+{
+	return io_queue_rsrc_removal(data, (void *)imu);
+}
+
+static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
+				   struct io_uring_rsrc_update *up,
+				   unsigned nr_args)
+{
+	struct fixed_rsrc_data *data = ctx->buf_data;
+	struct fixed_rsrc_ref_node *ref_node;
+	struct io_mapped_ubuf *imu;
+	struct iovec iov;
+	struct iovec __user *iovs;
+	struct page *last_hpage = NULL;
+	__u32 done;
+	int i, err;
+	bool needs_switch = false;
+
+	if (check_add_overflow(up->offset, nr_args, &done))
+		return -EOVERFLOW;
+	if (done > ctx->nr_user_bufs)
+		return -EINVAL;
+
+	ref_node = alloc_fixed_buf_ref_node(ctx);
+	if (IS_ERR(ref_node))
+		return PTR_ERR(ref_node);
+
+	done = 0;
+	iovs = u64_to_user_ptr(up->iovs);
+	while (nr_args) {
+		struct fixed_rsrc_table *table;
+		unsigned index;
+
+		err = 0;
+		if (copy_from_user(&iov, &iovs[done], sizeof(iov))) {
+			err = -EFAULT;
+			break;
+		}
+		i = array_index_nospec(up->offset, ctx->nr_user_bufs);
+		table = &ctx->buf_data->table[i >> IORING_BUF_TABLE_SHIFT];
+		index = i & IORING_BUF_TABLE_MASK;
+		imu = &table->bufs[index];
+		if (table->bufs[index].ubuf) {
+			struct io_mapped_ubuf *dup;
+			dup = kmemdup(imu, sizeof(*imu), GFP_KERNEL);
+			if (!dup) {
+				err = -ENOMEM;
+				break;
+			}
+			err = io_queue_buffer_removal(data, dup);
+			if (err)
+				break;
+			memset(imu, 0, sizeof(*imu));
+			needs_switch = true;
+		}
+		if (!io_buffer_validate(&iov)) {
+			err = io_sqe_buffer_register(ctx, &iov, imu,
+						     &last_hpage);
+			if (err) {
+				memset(imu, 0, sizeof(*imu));
+				break;
+			}
+		}
+		nr_args--;
+		done++;
+		up->offset++;
+	}
+
+	if (needs_switch)
+		switch_fixed_rsrc_ref_node(ref_node, data, ctx);
+	else
+		destroy_fixed_rsrc_ref_node(ref_node);
+
+	return done ? done : err;
+}
+
+static int io_sqe_buffers_update(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned nr_args)
+{
+	struct io_uring_rsrc_update up;
+
+	if (!ctx->buf_data)
+		return -ENXIO;
+	if (!nr_args)
+		return -EINVAL;
+	if (copy_from_user(&up, arg, sizeof(up)))
+		return -EFAULT;
+	if (up.resv)
+		return -EINVAL;
+
+	return __io_sqe_buffers_update(ctx, &up, nr_args);
+}
+
 static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
 {
 	__s32 __user *fds = arg;
@@ -10022,6 +10138,7 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
 	case IORING_UNREGISTER_BUFFERS:
+	case IORING_REGISTER_BUFFERS_UPDATE:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
 	case IORING_UNREGISTER_PERSONALITY:
@@ -10097,6 +10214,9 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffers_unregister(ctx);
 		break;
+	case IORING_REGISTER_BUFFERS_UPDATE:
+		ret = io_sqe_buffers_update(ctx, arg, nr_args);
+		break;
 	case IORING_REGISTER_FILES:
 		ret = io_sqe_files_register(ctx, arg, nr_args);
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ba5deff..0d9ac12 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -137,6 +137,7 @@ enum {
 	IORING_OP_SHUTDOWN,
 	IORING_OP_RENAMEAT,
 	IORING_OP_UNLINKAT,
+	IORING_OP_BUFFERS_UPDATE,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -280,17 +281,12 @@ enum {
 	IORING_UNREGISTER_PERSONALITY		= 10,
 	IORING_REGISTER_RESTRICTIONS		= 11,
 	IORING_REGISTER_ENABLE_RINGS		= 12,
+	IORING_REGISTER_BUFFERS_UPDATE		= 13,
 
 	/* this goes last */
 	IORING_REGISTER_LAST
 };
 
-struct io_uring_files_update {
-	__u32 offset;
-	__u32 resv;
-	__aligned_u64 /* __s32 * */ fds;
-};
-
 struct io_uring_rsrc_update {
 	__u32 offset;
 	__u32 resv;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 12/13] io_uring: create common fixed_rsrc_data allocation routines.
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (10 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 11/13] io_uring: support buffer registration updates Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-07 22:15 ` [PATCH v2 13/13] io_uring: support buffer registration sharing Bijan Mottahedeh
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Create common alloc/free fixed_rsrc_data routines for both files and
buffers.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c | 79 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 43 insertions(+), 36 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b7a1f65..479a6b9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7331,6 +7331,33 @@ static void io_rsrc_ref_quiesce(struct fixed_rsrc_data *data,
 	wait_for_completion(&data->done);
 }
 
+static struct fixed_rsrc_data *alloc_fixed_rsrc_data(struct io_ring_ctx *ctx)
+{
+	struct fixed_rsrc_data *data;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->ctx = ctx;
+	init_completion(&data->done);
+
+	if (percpu_ref_init(&data->refs, io_rsrc_ref_kill,
+			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
+		kfree(data);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return data;
+}
+
+static void free_fixed_rsrc_data(struct fixed_rsrc_data *data)
+{
+	percpu_ref_exit(&data->refs);
+	kfree(data->table);
+	kfree(data);
+}
+
 static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_data *data = ctx->file_data;
@@ -7345,9 +7372,7 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
 	for (i = 0; i < nr_tables; i++)
 		kfree(data->table[i].files);
-	kfree(data->table);
-	percpu_ref_exit(&data->refs);
-	kfree(data);
+	free_fixed_rsrc_data(ctx->file_data);
 	ctx->file_data = NULL;
 	ctx->nr_user_files = 0;
 	return 0;
@@ -7797,11 +7822,9 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (nr_args > IORING_MAX_FIXED_FILES)
 		return -EMFILE;
 
-	file_data = kzalloc(sizeof(*ctx->file_data), GFP_KERNEL);
-	if (!file_data)
-		return -ENOMEM;
-	file_data->ctx = ctx;
-	init_completion(&file_data->done);
+	file_data = alloc_fixed_rsrc_data(ctx);
+	if (IS_ERR(file_data))
+		return PTR_ERR(ref_node);
 
 	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_FILES_TABLE);
 	file_data->table = kcalloc(nr_tables, sizeof(*file_data->table),
@@ -7809,12 +7832,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	if (!file_data->table)
 		goto out_free;
 
-	if (percpu_ref_init(&file_data->refs, io_rsrc_ref_kill,
-				PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
-		goto out_free;
-
 	if (io_sqe_alloc_file_tables(file_data, nr_tables, nr_args))
-		goto out_ref;
+		goto out_free;
 	ctx->file_data = file_data;
 
 	for (i = 0; i < nr_args; i++, ctx->nr_user_files++) {
@@ -7873,11 +7892,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	for (i = 0; i < nr_tables; i++)
 		kfree(file_data->table[i].files);
 	ctx->nr_user_files = 0;
-out_ref:
-	percpu_ref_exit(&file_data->refs);
 out_free:
-	kfree(file_data->table);
-	kfree(file_data);
+	free_fixed_rsrc_data(ctx->file_data);
 	ctx->file_data = NULL;
 	return ret;
 }
@@ -8635,41 +8651,32 @@ static int io_alloc_buf_tables(struct fixed_rsrc_data *buf_data,
 static struct fixed_rsrc_data *io_buffers_map_alloc(struct io_ring_ctx *ctx,
 						    unsigned int nr_args)
 {
-	unsigned nr_tables;
 	struct fixed_rsrc_data *buf_data;
+	unsigned nr_tables;
 	int ret = -ENOMEM;
 
-	if (ctx->buf_data)
+	if (ctx->nr_user_bufs)
 		return ERR_PTR(-EBUSY);
 	if (!nr_args || nr_args > IORING_MAX_FIXED_BUFS)
 		return ERR_PTR(-EINVAL);
 
-	buf_data = kzalloc(sizeof(*ctx->buf_data), GFP_KERNEL);
-	if (!buf_data)
-		return ERR_PTR(-ENOMEM);
-	buf_data->ctx = ctx;
-	init_completion(&buf_data->done);
+	buf_data = alloc_fixed_rsrc_data(ctx);
+	if (IS_ERR(buf_data))
+		return buf_data;
 
 	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_BUFS_TABLE);
 	buf_data->table = kcalloc(nr_tables, sizeof(*buf_data->table),
 				  GFP_KERNEL);
 	if (!buf_data->table)
-		goto out_free;
-
-	if (percpu_ref_init(&buf_data->refs, io_rsrc_ref_kill,
-			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
-		goto out_free;
+		goto out;
 
 	if (io_alloc_buf_tables(buf_data, nr_tables, nr_args))
-		goto out_ref;
+		goto out;
 
 	return buf_data;
-
-out_ref:
-	percpu_ref_exit(&buf_data->refs);
-out_free:
-	kfree(buf_data->table);
-	kfree(buf_data);
+out:
+	free_fixed_rsrc_data(ctx->buf_data);
+	ctx->buf_data = NULL;
 	return ERR_PTR(ret);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 13/13] io_uring: support buffer registration sharing
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (11 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 12/13] io_uring: create common fixed_rsrc_data allocation routines Bijan Mottahedeh
@ 2020-12-07 22:15 ` Bijan Mottahedeh
  2020-12-16 15:29   ` Pavel Begunkov
  2020-12-14 19:09 ` [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
  2020-12-16 15:34 ` Pavel Begunkov
  14 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-07 22:15 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Implement buffer sharing among multiple rings.

A ring shares its (future) buffer registrations at setup time with
IORING_SETUP_SHARE_BUF. A ring attaches to another ring's buffer
registration at setup time with IORING_SETUP_ATTACH_BUF, after
authenticating with the buffer registration owner's fd. Any updates to
the owner's buffer registrations become immediately available to the
attached rings.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io_uring.c                 | 85 +++++++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/io_uring.h |  2 +
 2 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 479a6b9..b75cbd7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8408,6 +8408,12 @@ static void io_buffers_map_free(struct io_ring_ctx *ctx)
 	ctx->nr_user_bufs = 0;
 }
 
+static void io_detach_buf_data(struct io_ring_ctx *ctx)
+{
+	percpu_ref_put(&ctx->buf_data->refs);
+	ctx->buf_data = NULL;
+}
+
 static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
 	struct fixed_rsrc_data *data = ctx->buf_data;
@@ -8415,6 +8421,12 @@ static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 	if (!data)
 		return -ENXIO;
 
+	if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
+		io_detach_buf_data(ctx);
+		ctx->nr_user_bufs = 0;
+		return 0;
+	}
+
 	io_rsrc_ref_quiesce(data, ctx);
 	io_buffers_unmap(ctx);
 	io_buffers_map_free(ctx);
@@ -8660,9 +8672,13 @@ static struct fixed_rsrc_data *io_buffers_map_alloc(struct io_ring_ctx *ctx,
 	if (!nr_args || nr_args > IORING_MAX_FIXED_BUFS)
 		return ERR_PTR(-EINVAL);
 
-	buf_data = alloc_fixed_rsrc_data(ctx);
-	if (IS_ERR(buf_data))
-		return buf_data;
+	if (ctx->buf_data) {
+		buf_data = ctx->buf_data;
+	} else {
+		buf_data = alloc_fixed_rsrc_data(ctx);
+		if (IS_ERR(buf_data))
+			return buf_data;
+	}
 
 	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_BUFS_TABLE);
 	buf_data->table = kcalloc(nr_tables, sizeof(*buf_data->table),
@@ -8724,9 +8740,17 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	struct fixed_rsrc_ref_node *ref_node;
 	struct fixed_rsrc_data *buf_data;
 
+	if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
+		if (!ctx->buf_data)
+			return -EFAULT;
+		ctx->nr_user_bufs = ctx->buf_data->ctx->nr_user_bufs;
+		return 0;
+	}
+
 	buf_data = io_buffers_map_alloc(ctx, nr_args);
 	if (IS_ERR(buf_data))
 		return PTR_ERR(buf_data);
+	ctx->buf_data = buf_data;
 
 	for (i = 0; i < nr_args; i++, ctx->nr_user_bufs++) {
 		struct fixed_rsrc_table *table;
@@ -8754,7 +8778,6 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			break;
 	}
 
-	ctx->buf_data = buf_data;
 	if (ret) {
 		io_sqe_buffers_unregister(ctx);
 		return ret;
@@ -9783,6 +9806,55 @@ static int io_uring_get_fd(struct io_ring_ctx *ctx)
 	return ret;
 }
 
+static int io_attach_buf_data(struct io_ring_ctx *ctx,
+			      struct io_uring_params *p)
+{
+	struct io_ring_ctx *ctx_attach;
+	struct fd f;
+
+	f = fdget(p->wq_fd);
+	if (!f.file)
+		return -EBADF;
+	if (f.file->f_op != &io_uring_fops) {
+		fdput(f);
+		return -EINVAL;
+	}
+
+	ctx_attach = f.file->private_data;
+	if (!ctx_attach->buf_data) {
+		fdput(f);
+		return -EINVAL;
+	}
+	ctx->buf_data = ctx_attach->buf_data;
+
+	percpu_ref_get(&ctx->buf_data->refs);
+	fdput(f);
+	return 0;
+}
+
+static int io_init_buf_data(struct io_ring_ctx *ctx, struct io_uring_params *p)
+{
+	if ((p->flags & (IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF)) ==
+	    (IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF))
+		return -EINVAL;
+
+	if (p->flags & IORING_SETUP_SHARE_BUF) {
+		struct fixed_rsrc_data *buf_data;
+
+		buf_data = alloc_fixed_rsrc_data(ctx);
+		if (IS_ERR(buf_data))
+			return PTR_ERR(buf_data);
+
+		ctx->buf_data = buf_data;
+		return 0;
+	}
+
+	if (p->flags & IORING_SETUP_ATTACH_BUF)
+		return io_attach_buf_data(ctx, p);
+
+	return 0;
+}
+
 static int io_uring_create(unsigned entries, struct io_uring_params *p,
 			   struct io_uring_params __user *params)
 {
@@ -9897,6 +9969,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (ret)
 		goto err;
 
+	ret = io_init_buf_data(ctx, p);
+	if (ret)
+		goto err;
+
 	ret = io_sq_offload_create(ctx, p);
 	if (ret)
 		goto err;
@@ -9968,6 +10044,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
 			IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
 			IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
+			IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF |
 			IORING_SETUP_R_DISABLED))
 		return -EINVAL;
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0d9ac12..2366126 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -98,6 +98,8 @@ enum {
 #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
 #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
 #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
+#define IORING_SETUP_SHARE_BUF	(1U << 7)	/* share buffer registration */
+#define IORING_SETUP_ATTACH_BUF	(1U << 8)	/* attach buffer registration */
 
 enum {
 	IORING_OP_NOP,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] io_uring: buffer registration enhancements
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (12 preceding siblings ...)
  2020-12-07 22:15 ` [PATCH v2 13/13] io_uring: support buffer registration sharing Bijan Mottahedeh
@ 2020-12-14 19:09 ` Bijan Mottahedeh
  2020-12-14 19:29   ` Jens Axboe
  2020-12-16 15:34 ` Pavel Begunkov
  14 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-14 19:09 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring

Just a ping.  Anything I can do to facilitate the review, please let me 
know.


> v2:
> 
> - drop readv/writev with fixed buffers patch
> - handle ref_nodes both both files/buffers with a single ref_list
> - make file/buffer handling more unified
> 
> This patchset implements a set of enhancements to buffer registration
> consistent with existing file registration functionality:
> 
> - buffer registration updates		IORING_REGISTER_BUFFERS_UPDATE
> 					IORING_OP_BUFFERS_UPDATE
> 
> - buffer registration sharing		IORING_SETUP_SHARE_BUF
> 					IORING_SETUP_ATTACH_BUF
> 
> I have kept the original patchset unchanged for the most part to
> facilitate reviewing and so this set adds a number of additional patches
> mostly making file/buffer handling more unified.
> 
> Patch 1-2 modularize existing buffer registration code.
> 
> Patch 3-7 generalize fixed_file functionality to fixed_rsrc.
> 
> Patch 8 applies fixed_rsrc functionality for fixed buffers support.
> 
> Patch 9-10 generalize files_update functionality to rsrc_update.
> 
> Patch 11 implements buffer registration update, and introduces
> IORING_REGISTER_BUFFERS_UPDATE and IORING_OP_BUFFERS_UPDATE, consistent
> with file registration update.
> 
> Patch 12 generalizes fixed resource allocation
> 
> Patch 13 implements buffer sharing among multiple rings; it works as follows:
> 
> - A new ring, A,  is setup. Since no buffers have been registered, the
>    registered buffer state is an empty set, Z. That's different from the
>    NULL state in current implementation.
> 
> - Ring B is setup, attaching to Ring A. It's also attaching to it's
>    buffer registrations, now we have two references to the same empty
>    set, Z.
> 
> - Ring A registers buffers into set Z, which is no longer empty.
> 
> - Ring B sees this immediately, since it's already sharing that set.
> 
> Testing
> 
> I have used liburing file-{register,update} tests as models for
> buffer-{register,update,share}, tests and they run ok.
> 
> TBD
> 
> - Haven't addressed Pavel's comment yet on using a single opcode for
> files/buffers update, pending Jen's opinion.  Could we encode the resource
> type into the sqe (e.g. rw_flags)?
> 
> Bijan Mottahedeh (13):
>    io_uring: modularize io_sqe_buffer_register
>    io_uring: modularize io_sqe_buffers_register
>    io_uring: generalize fixed file functionality
>    io_uring: rename fixed_file variables to fixed_rsrc
>    io_uring: separate ref_list from fixed_rsrc_data
>    io_uring: generalize fixed_file_ref_node functionality
>    io_uring: add rsrc_ref locking routines
>    io_uring: implement fixed buffers registration similar to fixed files
>    io_uring: create common fixed_rsrc_ref_node handling routines
>    io_uring: generalize files_update functionlity to rsrc_update
>    io_uring: support buffer registration updates
>    io_uring: create common fixed_rsrc_data allocation routines.
>    io_uring: support buffer registration sharing
> 
>   fs/io_uring.c                 | 1004 +++++++++++++++++++++++++++++------------
>   include/uapi/linux/io_uring.h |   12 +-
>   2 files changed, 732 insertions(+), 284 deletions(-)
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] io_uring: buffer registration enhancements
  2020-12-14 19:09 ` [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
@ 2020-12-14 19:29   ` Jens Axboe
  2020-12-14 19:43     ` Bijan Mottahedeh
  2020-12-14 20:59     ` Pavel Begunkov
  0 siblings, 2 replies; 32+ messages in thread
From: Jens Axboe @ 2020-12-14 19:29 UTC (permalink / raw)
  To: Bijan Mottahedeh, asml.silence, io-uring

On 12/14/20 12:09 PM, Bijan Mottahedeh wrote:
> Just a ping.  Anything I can do to facilitate the review, please let me 
> know.

I'll get to this soon - sorry that this means that it'll miss 5.11, but
I wanted to make sure that we get this absolutely right. It is
definitely an interesting and useful feature, but worth spending the
necessary time on to ensure we don't have any mistakes we'll regret
later.

For your question, yes I think we could add sqe->update_flags (something
like that) and union it with the other flags, and add a flag that means
we're updating buffers instead of files. A bit iffy with the naming of
the opcode itself, but probably still a useful way to go.

I'd also love to see a bunch of test cases for this that exercise all
parts of it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] io_uring: buffer registration enhancements
  2020-12-14 19:29   ` Jens Axboe
@ 2020-12-14 19:43     ` Bijan Mottahedeh
  2020-12-14 19:47       ` Jens Axboe
  2020-12-14 20:59     ` Pavel Begunkov
  1 sibling, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-14 19:43 UTC (permalink / raw)
  To: Jens Axboe, asml.silence, io-uring

On 12/14/2020 11:29 AM, Jens Axboe wrote:
> On 12/14/20 12:09 PM, Bijan Mottahedeh wrote:
>> Just a ping.  Anything I can do to facilitate the review, please let me
>> know.
> 
> I'll get to this soon - sorry that this means that it'll miss 5.11, but
> I wanted to make sure that we get this absolutely right. It is
> definitely an interesting and useful feature, but worth spending the
> necessary time on to ensure we don't have any mistakes we'll regret
> later.

Makes total sense.

> 
> For your question, yes I think we could add sqe->update_flags (something
> like that) and union it with the other flags, and add a flag that means
> we're updating buffers instead of files. A bit iffy with the naming of
> the opcode itself, but probably still a useful way to go.

I'll look into that and we can fold it in the next round, would that work?

> 
> I'd also love to see a bunch of test cases for this that exercise all
> parts of it.
> 

Great idea.  Should I send out the liburing changes and test cases now, 
that would definitely help identify the gaps early.

Thanks.

--bijan



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] io_uring: buffer registration enhancements
  2020-12-14 19:43     ` Bijan Mottahedeh
@ 2020-12-14 19:47       ` Jens Axboe
  0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2020-12-14 19:47 UTC (permalink / raw)
  To: Bijan Mottahedeh, asml.silence, io-uring

On 12/14/20 12:43 PM, Bijan Mottahedeh wrote:
> On 12/14/2020 11:29 AM, Jens Axboe wrote:
>> On 12/14/20 12:09 PM, Bijan Mottahedeh wrote:
>>> Just a ping.  Anything I can do to facilitate the review, please let me
>>> know.
>>
>> I'll get to this soon - sorry that this means that it'll miss 5.11, but
>> I wanted to make sure that we get this absolutely right. It is
>> definitely an interesting and useful feature, but worth spending the
>> necessary time on to ensure we don't have any mistakes we'll regret
>> later.
> 
> Makes total sense.
> 
>>
>> For your question, yes I think we could add sqe->update_flags (something
>> like that) and union it with the other flags, and add a flag that means
>> we're updating buffers instead of files. A bit iffy with the naming of
>> the opcode itself, but probably still a useful way to go.
> 
> I'll look into that and we can fold it in the next round, would that work?

That totally works, thanks.

>> I'd also love to see a bunch of test cases for this that exercise all
>> parts of it.
>>
> 
> Great idea.  Should I send out the liburing changes and test cases now, 
> that would definitely help identify the gaps early.

Yes, you can send them out now, and then we can just stuff them in a
separate branch. Makes it easy to test and adapt to any potential kernel
side changes as this is groomed for the next release.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] io_uring: buffer registration enhancements
  2020-12-14 19:29   ` Jens Axboe
  2020-12-14 19:43     ` Bijan Mottahedeh
@ 2020-12-14 20:59     ` Pavel Begunkov
  2020-12-18 18:06       ` Bijan Mottahedeh
  1 sibling, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2020-12-14 20:59 UTC (permalink / raw)
  To: Jens Axboe, Bijan Mottahedeh, io-uring

On 14/12/2020 19:29, Jens Axboe wrote:
> On 12/14/20 12:09 PM, Bijan Mottahedeh wrote:
>> Just a ping.  Anything I can do to facilitate the review, please let me 
>> know.
> 
> I'll get to this soon - sorry that this means that it'll miss 5.11, but
> I wanted to make sure that we get this absolutely right. It is
> definitely an interesting and useful feature, but worth spending the
> necessary time on to ensure we don't have any mistakes we'll regret
> later.

I'll take a look as I familiarised myself with it before.
Also, io-wq punting is probably needed to be fixed first.

> For your question, yes I think we could add sqe->update_flags (something
> like that) and union it with the other flags, and add a flag that means
> we're updating buffers instead of files. A bit iffy with the naming of
> the opcode itself, but probably still a useful way to go.

#define OPCODE_UPDATE_RESOURCES OPCODE_UPDATE_FILES

With define + documenting that they're same IMHO should be fine.

> 
> I'd also love to see a bunch of test cases for this that exercise all
> parts of it.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/13] io_uring: generalize fixed_file_ref_node functionality
  2020-12-07 22:15 ` [PATCH v2 06/13] io_uring: generalize fixed_file_ref_node functionality Bijan Mottahedeh
@ 2020-12-16 14:53   ` Pavel Begunkov
  2020-12-18 18:06     ` Bijan Mottahedeh
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2020-12-16 14:53 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: axboe, io-uring

On 07/12/2020 22:15, Bijan Mottahedeh wrote:
> Split alloc_fixed_file_ref_node into resource generic/specific parts,
> rename destroy_fixed_file_ref_node, and factor out fixed_file_ref_node
> switching, to be be leveraged by fixed buffers.
> 
> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
> ---
>  fs/io_uring.c | 34 +++++++++++++++++++++++++---------
>  1 file changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 1ed63bc..126237e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -7672,7 +7672,7 @@ static void io_rsrc_data_ref_zero(struct percpu_ref *ref)
>  		queue_delayed_work(system_wq, &ctx->rsrc_put_work, delay);
>  }
>  
> -static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
> +static struct fixed_rsrc_ref_node *alloc_fixed_rsrc_ref_node(
>  			struct io_ring_ctx *ctx)
>  {
>  	struct fixed_rsrc_ref_node *ref_node;
> @@ -7688,13 +7688,22 @@ static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
>  	}
>  	INIT_LIST_HEAD(&ref_node->node);
>  	INIT_LIST_HEAD(&ref_node->rsrc_list);
> +	ref_node->done = false;
> +	return ref_node;
> +}
> +
> +static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
> +			struct io_ring_ctx *ctx)
> +{
> +	struct fixed_rsrc_ref_node *ref_node;
> +
> +	ref_node = alloc_fixed_rsrc_ref_node(ctx);

if (!ref_node)
	return NULL;

>  	ref_node->rsrc_data = ctx->file_data;
>  	ref_node->rsrc_put = io_ring_file_put;
> -	ref_node->done = false;
>  	return ref_node;
>  }
>  
> -static void destroy_fixed_file_ref_node(struct fixed_rsrc_ref_node *ref_node)
> +static void destroy_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node)
>  {
>  	percpu_ref_exit(&ref_node->refs);
>  	kfree(ref_node);
> @@ -7870,6 +7879,17 @@ static inline int io_queue_file_removal(struct fixed_rsrc_data *data,
>  	return io_queue_rsrc_removal(data, (void *)file);
>  }
>  
> +static void switch_fixed_rsrc_ref_node(struct fixed_rsrc_ref_node *ref_node,
> +				       struct fixed_rsrc_data *data,
> +				       struct io_ring_ctx *ctx)
> +{
> +	percpu_ref_kill(&data->node->refs);
> +	spin_lock_bh(&ctx->rsrc_ref_lock);
> +	list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
> +	data->node = ref_node;
> +	spin_unlock_bh(&ctx->rsrc_ref_lock);
> +}
> +
>  static int __io_sqe_files_update(struct io_ring_ctx *ctx,
>  				 struct io_uring_files_update *up,
>  				 unsigned nr_args)
> @@ -7946,14 +7966,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
>  	}
>  
>  	if (needs_switch) {
> -		percpu_ref_kill(&data->node->refs);
> -		spin_lock_bh(&ctx->rsrc_ref_lock);
> -		list_add_tail(&ref_node->node, &ctx->rsrc_ref_list);
> -		data->node = ref_node;
> -		spin_unlock_bh(&ctx->rsrc_ref_lock);
> +		switch_fixed_rsrc_ref_node(ref_node, data, ctx);
>  		percpu_ref_get(&ctx->file_data->refs);
>  	} else
> -		destroy_fixed_file_ref_node(ref_node);
> +		destroy_fixed_rsrc_ref_node(ref_node);
>  
>  	return done ? done : err;
>  }
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2020-12-07 22:15 ` [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
@ 2020-12-16 14:58   ` Pavel Begunkov
  2020-12-18 18:06     ` Bijan Mottahedeh
  2020-12-16 14:59   ` Pavel Begunkov
  1 sibling, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2020-12-16 14:58 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: axboe, io-uring

On 07/12/2020 22:15, Bijan Mottahedeh wrote:
> Apply fixed_rsrc functionality for fixed buffers support.
> 
> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
> ---
>  fs/io_uring.c | 240 +++++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 204 insertions(+), 36 deletions(-)
> 
[...]
>  	/* overflow */
> @@ -8296,28 +8313,71 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
>  	return pages;
>  }
>  
> -static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
> +static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)

I think this and some others from here can go into a separate patch, would
be cleaner.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2020-12-07 22:15 ` [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
  2020-12-16 14:58   ` Pavel Begunkov
@ 2020-12-16 14:59   ` Pavel Begunkov
  1 sibling, 0 replies; 32+ messages in thread
From: Pavel Begunkov @ 2020-12-16 14:59 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 07/12/2020 22:15, Bijan Mottahedeh wrote:
> Apply fixed_rsrc functionality for fixed buffers support.
> 
> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
> ---
[...]
> +static int io_alloc_buf_tables(struct fixed_rsrc_data *buf_data,
> +			       unsigned nr_tables, unsigned nr_bufs)
> +{
> +	int i;
> +		

trailing tabs, btw

> +	for (i = 0; i < nr_tables; i++) {
> +		struct fixed_rsrc_table *table = &buf_data->table[i];
> +		unsigned this_bufs;

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 13/13] io_uring: support buffer registration sharing
  2020-12-07 22:15 ` [PATCH v2 13/13] io_uring: support buffer registration sharing Bijan Mottahedeh
@ 2020-12-16 15:29   ` Pavel Begunkov
  2020-12-18 18:06     ` Bijan Mottahedeh
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2020-12-16 15:29 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 07/12/2020 22:15, Bijan Mottahedeh wrote:
> Implement buffer sharing among multiple rings.
> 
> A ring shares its (future) buffer registrations at setup time with
> IORING_SETUP_SHARE_BUF. A ring attaches to another ring's buffer
> registration at setup time with IORING_SETUP_ATTACH_BUF, after
> authenticating with the buffer registration owner's fd. Any updates to
> the owner's buffer registrations become immediately available to the
> attached rings.
> 
> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
> ---
>  fs/io_uring.c                 | 85 +++++++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/io_uring.h |  2 +
>  2 files changed, 83 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 479a6b9..b75cbd7 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -8408,6 +8408,12 @@ static void io_buffers_map_free(struct io_ring_ctx *ctx)
>  	ctx->nr_user_bufs = 0;
>  }
>  
> +static void io_detach_buf_data(struct io_ring_ctx *ctx)
> +{
> +	percpu_ref_put(&ctx->buf_data->refs);
> +	ctx->buf_data = NULL;
> +}
> +
>  static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
>  {
>  	struct fixed_rsrc_data *data = ctx->buf_data;
> @@ -8415,6 +8421,12 @@ static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
>  	if (!data)
>  		return -ENXIO;
>  
> +	if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
> +		io_detach_buf_data(ctx);
> +		ctx->nr_user_bufs = 0;

nr_user_bufs is a part of invariant and should stay together with
stuff in io_detach_buf_data().

> +		return 0;
> +	}
> +
>  	io_rsrc_ref_quiesce(data, ctx);
>  	io_buffers_unmap(ctx);
>  	io_buffers_map_free(ctx);
> @@ -8660,9 +8672,13 @@ static struct fixed_rsrc_data *io_buffers_map_alloc(struct io_ring_ctx *ctx,
>  	if (!nr_args || nr_args > IORING_MAX_FIXED_BUFS)
>  		return ERR_PTR(-EINVAL);
>  
> -	buf_data = alloc_fixed_rsrc_data(ctx);
> -	if (IS_ERR(buf_data))
> -		return buf_data;
> +	if (ctx->buf_data) {
> +		buf_data = ctx->buf_data;
> +	} else {
> +		buf_data = alloc_fixed_rsrc_data(ctx);
> +		if (IS_ERR(buf_data))
> +			return buf_data;
> +	}
>  
>  	nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_BUFS_TABLE);
>  	buf_data->table = kcalloc(nr_tables, sizeof(*buf_data->table),
> @@ -8724,9 +8740,17 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>  	struct fixed_rsrc_ref_node *ref_node;
>  	struct fixed_rsrc_data *buf_data;
>  
> +	if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
> +		if (!ctx->buf_data)
> +			return -EFAULT;
> +		ctx->nr_user_bufs = ctx->buf_data->ctx->nr_user_bufs;

Why? Once a table is initialised it shouldn't change its size, would
be racy otherwise.

> +		return 0;
> +	}
> +
>  	buf_data = io_buffers_map_alloc(ctx, nr_args);
>  	if (IS_ERR(buf_data))
>  		return PTR_ERR(buf_data);
> +	ctx->buf_data = buf_data;

Wanted to write that there is missing
`if (ctx->user_bufs) return -EBUSY`

but apparently it was moved into io_buffers_map_alloc().
I'd really prefer to have it here.

>  
>  	for (i = 0; i < nr_args; i++, ctx->nr_user_bufs++) {
>  		struct fixed_rsrc_table *table;
> @@ -8754,7 +8778,6 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>  			break;
>  	}
>  
> -	ctx->buf_data = buf_data;
>  	if (ret) {
>  		io_sqe_buffers_unregister(ctx);
>  		return ret;
> @@ -9783,6 +9806,55 @@ static int io_uring_get_fd(struct io_ring_ctx *ctx)
>  	return ret;
>  }
>  
> +static int io_attach_buf_data(struct io_ring_ctx *ctx,
> +			      struct io_uring_params *p)
> +{
> +	struct io_ring_ctx *ctx_attach;
> +	struct fd f;
> +
> +	f = fdget(p->wq_fd);
> +	if (!f.file)
> +		return -EBADF;
> +	if (f.file->f_op != &io_uring_fops) {
> +		fdput(f);
> +		return -EINVAL;
> +	}
> +
> +	ctx_attach = f.file->private_data;
> +	if (!ctx_attach->buf_data) {

It looks racy. What prevents it from being deleted while we're
working on it, e.g. by io_sqe_buffers_unregister?

> +		fdput(f);
> +		return -EINVAL;
> +	}
> +	ctx->buf_data = ctx_attach->buf_data;

Before updates, etc. (e.g. __io_sqe_buffers_update()) were synchronised
by uring_lock, now it's modified concurrently, that looks to be really
racy.

> +
> +	percpu_ref_get(&ctx->buf_data->refs);

Ok, now the original io_uring instance will wait until the attached
once get rid of their references. That's a versatile ground to have
in kernel deadlocks.

task1: uring1 = create()
task2: uring2 = create()
task1: uring3 = create(share=uring2);
task2: uring4 = create(share=uring1);

task1: io_sqe_buffers_unregister(uring1)
task2: io_sqe_buffers_unregister(uring2)

If I skimmed through the code right, that should hang unkillably.

> +	fdput(f);
> +	return 0;
> +}
> +
> +static int io_init_buf_data(struct io_ring_ctx *ctx, struct io_uring_params *p)
> +{
> +	if ((p->flags & (IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF)) ==
> +	    (IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF))
> +		return -EINVAL;
> +
> +	if (p->flags & IORING_SETUP_SHARE_BUF) {
> +		struct fixed_rsrc_data *buf_data;
> +
> +		buf_data = alloc_fixed_rsrc_data(ctx);
> +		if (IS_ERR(buf_data))
> +			return PTR_ERR(buf_data);
> +
> +		ctx->buf_data = buf_data;
> +		return 0;
> +	}
> +
> +	if (p->flags & IORING_SETUP_ATTACH_BUF)
> +		return io_attach_buf_data(ctx, p);
> +
> +	return 0;
> +}
> +
>  static int io_uring_create(unsigned entries, struct io_uring_params *p,
>  			   struct io_uring_params __user *params)
>  {
> @@ -9897,6 +9969,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
>  	if (ret)
>  		goto err;
>  
> +	ret = io_init_buf_data(ctx, p);
> +	if (ret)
> +		goto err;
> +
>  	ret = io_sq_offload_create(ctx, p);
>  	if (ret)
>  		goto err;
> @@ -9968,6 +10044,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
>  	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
>  			IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
>  			IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
> +			IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF |
>  			IORING_SETUP_R_DISABLED))
>  		return -EINVAL;
>  
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 0d9ac12..2366126 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -98,6 +98,8 @@ enum {
>  #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
>  #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
>  #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
> +#define IORING_SETUP_SHARE_BUF	(1U << 7)	/* share buffer registration */
> +#define IORING_SETUP_ATTACH_BUF	(1U << 8)	/* attach buffer registration */
>  
>  enum {
>  	IORING_OP_NOP,
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] io_uring: buffer registration enhancements
  2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
                   ` (13 preceding siblings ...)
  2020-12-14 19:09 ` [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
@ 2020-12-16 15:34 ` Pavel Begunkov
  2020-12-18 18:06   ` Bijan Mottahedeh
  14 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2020-12-16 15:34 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

On 07/12/2020 22:15, Bijan Mottahedeh wrote:
> v2:
> 
> - drop readv/writev with fixed buffers patch
> - handle ref_nodes both both files/buffers with a single ref_list
> - make file/buffer handling more unified
> 
> This patchset implements a set of enhancements to buffer registration
> consistent with existing file registration functionality:
> 
> - buffer registration updates		IORING_REGISTER_BUFFERS_UPDATE
> 					IORING_OP_BUFFERS_UPDATE
> 
> - buffer registration sharing		IORING_SETUP_SHARE_BUF
> 					IORING_SETUP_ATTACH_BUF
> 
> I have kept the original patchset unchanged for the most part to
> facilitate reviewing and so this set adds a number of additional patches
> mostly making file/buffer handling more unified.

Thanks for the patches! I'd _really_ prefer for all s/files/rsrc/
renaming being in a single prep patch instead of spilling it all around.
I found one bug, but to be honest it's hard to get through when functional
changes and other cleanups are buried under tons of such renaming.

> 
> Patch 1-2 modularize existing buffer registration code.
> 
> Patch 3-7 generalize fixed_file functionality to fixed_rsrc.
> 
> Patch 8 applies fixed_rsrc functionality for fixed buffers support.
> 
> Patch 9-10 generalize files_update functionality to rsrc_update.
> 
> Patch 11 implements buffer registration update, and introduces
> IORING_REGISTER_BUFFERS_UPDATE and IORING_OP_BUFFERS_UPDATE, consistent
> with file registration update.
> 
> Patch 12 generalizes fixed resource allocation 
> 
> Patch 13 implements buffer sharing among multiple rings; it works as follows:
> 
> - A new ring, A,  is setup. Since no buffers have been registered, the
>   registered buffer state is an empty set, Z. That's different from the
>   NULL state in current implementation.
> 
> - Ring B is setup, attaching to Ring A. It's also attaching to it's
>   buffer registrations, now we have two references to the same empty
>   set, Z.
> 
> - Ring A registers buffers into set Z, which is no longer empty.
> 
> - Ring B sees this immediately, since it's already sharing that set.
> 
> Testing
> 
> I have used liburing file-{register,update} tests as models for
> buffer-{register,update,share}, tests and they run ok.
> 
> TBD
> 
> - Haven't addressed Pavel's comment yet on using a single opcode for
> files/buffers update, pending Jen's opinion.  Could we encode the resource
> type into the sqe (e.g. rw_flags)?
> 
> Bijan Mottahedeh (13):
>   io_uring: modularize io_sqe_buffer_register
>   io_uring: modularize io_sqe_buffers_register
>   io_uring: generalize fixed file functionality
>   io_uring: rename fixed_file variables to fixed_rsrc
>   io_uring: separate ref_list from fixed_rsrc_data
>   io_uring: generalize fixed_file_ref_node functionality
>   io_uring: add rsrc_ref locking routines
>   io_uring: implement fixed buffers registration similar to fixed files
>   io_uring: create common fixed_rsrc_ref_node handling routines
>   io_uring: generalize files_update functionlity to rsrc_update
>   io_uring: support buffer registration updates
>   io_uring: create common fixed_rsrc_data allocation routines.
>   io_uring: support buffer registration sharing
> 
>  fs/io_uring.c                 | 1004 +++++++++++++++++++++++++++++------------
>  include/uapi/linux/io_uring.h |   12 +-
>  2 files changed, 732 insertions(+), 284 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] io_uring: buffer registration enhancements
  2020-12-16 15:34 ` Pavel Begunkov
@ 2020-12-18 18:06   ` Bijan Mottahedeh
  0 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:06 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring


> Thanks for the patches! I'd _really_ prefer for all s/files/rsrc/
> renaming being in a single prep patch instead of spilling it all around.
> I found one bug, but to be honest it's hard to get through when functional
> changes and other cleanups are buried under tons of such renaming.

I've tried to gather as much of the renames that can be grouped together 
in the next version I'm about to send.  It is hard however to put all 
renames in one place since they often go hand in hand with a 
corresponding functionality change.  I appreciate though that the 
renames get in the way, sorry for that :-)  If you think more can be 
done after the next version, I'll look into more.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] io_uring: buffer registration enhancements
  2020-12-14 20:59     ` Pavel Begunkov
@ 2020-12-18 18:06       ` Bijan Mottahedeh
  0 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:06 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring


>> For your question, yes I think we could add sqe->update_flags (something
>> like that) and union it with the other flags, and add a flag that means
>> we're updating buffers instead of files. A bit iffy with the naming of
>> the opcode itself, but probably still a useful way to go.
> 
> #define OPCODE_UPDATE_RESOURCES OPCODE_UPDATE_FILES
> 
> With define + documenting that they're same IMHO should be fine.

I tried to use a single opcode for files/buffers but ran into an
issue since work_flags is different for files/buffers.  This should
be ok for the most part since req->work.flags is ultimately examined;
however, there are place where io_op_defs[opcode].work_flags is examined
directly, and I wasn't sure what would the best way to handle that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/13] io_uring: generalize fixed_file_ref_node functionality
  2020-12-16 14:53   ` Pavel Begunkov
@ 2020-12-18 18:06     ` Bijan Mottahedeh
  0 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:06 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: axboe, io-uring


>> +static struct fixed_rsrc_ref_node *alloc_fixed_file_ref_node(
>> +			struct io_ring_ctx *ctx)
>> +{
>> +	struct fixed_rsrc_ref_node *ref_node;
>> +
>> +	ref_node = alloc_fixed_rsrc_ref_node(ctx);
> 
> if (!ref_node)
> 	return NULL;

Ok.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files
  2020-12-16 14:58   ` Pavel Begunkov
@ 2020-12-18 18:06     ` Bijan Mottahedeh
  0 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:06 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: axboe, io-uring


>> -static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
>> +static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
> 
> I think this and some others from here can go into a separate patch, would
> be cleaner.
> 

I tried to break this up a few ways but it didn't work well because I 
think most of the code changes depend on the io_uring structure changes.
I've left it as is for now but if you have some idea on what you want to 
do, we can do it next.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 13/13] io_uring: support buffer registration sharing
  2020-12-16 15:29   ` Pavel Begunkov
@ 2020-12-18 18:06     ` Bijan Mottahedeh
  2021-01-07  0:50       ` Bijan Mottahedeh
  0 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2020-12-18 18:06 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring


>> @@ -8415,6 +8421,12 @@ static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
>>   	if (!data)
>>   		return -ENXIO;
>>   
>> +	if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
>> +		io_detach_buf_data(ctx);
>> +		ctx->nr_user_bufs = 0;
> 
> nr_user_bufs is a part of invariant and should stay together with
> stuff in io_detach_buf_data().

Moved to io_detach_buf_data.


>> @@ -8724,9 +8740,17 @@ static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>>   	struct fixed_rsrc_ref_node *ref_node;
>>   	struct fixed_rsrc_data *buf_data;
>>   
>> +	if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
>> +		if (!ctx->buf_data)
>> +			return -EFAULT;
>> +		ctx->nr_user_bufs = ctx->buf_data->ctx->nr_user_bufs;
> 
> Why? Once a table is initialised it shouldn't change its size, would
> be racy otherwise.

ctx->buf_data is set at ring setup time but the sharing process 
(SETUP_SHARE) may do the actual buffer registration at an arbitrary time 
later, so the attaching process must ensure to get the updated value of 
nr_user_bufs if available.

>>   	buf_data = io_buffers_map_alloc(ctx, nr_args);
>>   	if (IS_ERR(buf_data))
>>   		return PTR_ERR(buf_data);
>> +	ctx->buf_data = buf_data;
> 
> Wanted to write that there is missing
> `if (ctx->user_bufs) return -EBUSY`
> 
> but apparently it was moved into io_buffers_map_alloc().
> I'd really prefer to have it here.

Moved it back.

>> +static int io_attach_buf_data(struct io_ring_ctx *ctx,
>> +			      struct io_uring_params *p)
>> +{
>> +	struct io_ring_ctx *ctx_attach;
>> +	struct fd f;
>> +
>> +	f = fdget(p->wq_fd);
>> +	if (!f.file)
>> +		return -EBADF;
>> +	if (f.file->f_op != &io_uring_fops) {
>> +		fdput(f);
>> +		return -EINVAL;
>> +	}
>> +
>> +	ctx_attach = f.file->private_data;
>> +	if (!ctx_attach->buf_data) {
> 
> It looks racy. What prevents it from being deleted while we're
> working on it, e.g. by io_sqe_buffers_unregister?

I think the premise here is that buffer sharing happens between trusted 
and coordinated processes.  If I understand your concern correctly, then 
if the sharing process unregisters its buffers after having shared them, 
than that process is acting improperly.  The race could lead to a failed 
attach but that would be expected and reasonable I would think?  What do 
you think should happen in this case?

> 
>> +		fdput(f);
>> +		return -EINVAL;
>> +	}
>> +	ctx->buf_data = ctx_attach->buf_data;
> 
> Before updates, etc. (e.g. __io_sqe_buffers_update()) were synchronised
> by uring_lock, now it's modified concurrently, that looks to be really
> racy.

Racy from the attaching process perspective you mean?

> 
>> +
>> +	percpu_ref_get(&ctx->buf_data->refs);
> 
> Ok, now the original io_uring instance will wait until the attached
> once get rid of their references. That's a versatile ground to have
> in kernel deadlocks.
> 
> task1: uring1 = create()
> task2: uring2 = create()
> task1: uring3 = create(share=uring2);
> task2: uring4 = create(share=uring1);
> 
> task1: io_sqe_buffers_unregister(uring1)
> task2: io_sqe_buffers_unregister(uring2)
> 
> If I skimmed through the code right, that should hang unkillably.

So we need a way to enforce that a process can only have one role, 
sharing or attaching? But I'm not what the best way to do that.  Is this 
an issue for other resource sharing, work queues or polling thread?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 13/13] io_uring: support buffer registration sharing
  2020-12-18 18:06     ` Bijan Mottahedeh
@ 2021-01-07  0:50       ` Bijan Mottahedeh
  2021-01-11  5:19         ` Pavel Begunkov
  0 siblings, 1 reply; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-07  0:50 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring

On 12/18/2020 10:06 AM, Bijan Mottahedeh wrote:
> 
>>> @@ -8415,6 +8421,12 @@ static int io_sqe_buffers_unregister(struct 
>>> io_ring_ctx *ctx)
>>>       if (!data)
>>>           return -ENXIO;
>>> +    if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
>>> +        io_detach_buf_data(ctx);
>>> +        ctx->nr_user_bufs = 0;
>>
>> nr_user_bufs is a part of invariant and should stay together with
>> stuff in io_detach_buf_data().
> 
> Moved to io_detach_buf_data.
> 
> 
>>> @@ -8724,9 +8740,17 @@ static int io_sqe_buffers_register(struct 
>>> io_ring_ctx *ctx, void __user *arg,
>>>       struct fixed_rsrc_ref_node *ref_node;
>>>       struct fixed_rsrc_data *buf_data;
>>> +    if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
>>> +        if (!ctx->buf_data)
>>> +            return -EFAULT;
>>> +        ctx->nr_user_bufs = ctx->buf_data->ctx->nr_user_bufs;
>>
>> Why? Once a table is initialised it shouldn't change its size, would
>> be racy otherwise.
> 
> ctx->buf_data is set at ring setup time but the sharing process 
> (SETUP_SHARE) may do the actual buffer registration at an arbitrary time 
> later, so the attaching process must ensure to get the updated value of 
> nr_user_bufs if available.
> 
>>>       buf_data = io_buffers_map_alloc(ctx, nr_args);
>>>       if (IS_ERR(buf_data))
>>>           return PTR_ERR(buf_data);
>>> +    ctx->buf_data = buf_data;
>>
>> Wanted to write that there is missing
>> `if (ctx->user_bufs) return -EBUSY`
>>
>> but apparently it was moved into io_buffers_map_alloc().
>> I'd really prefer to have it here.
> 
> Moved it back.
> 
>>> +static int io_attach_buf_data(struct io_ring_ctx *ctx,
>>> +                  struct io_uring_params *p)
>>> +{
>>> +    struct io_ring_ctx *ctx_attach;
>>> +    struct fd f;
>>> +
>>> +    f = fdget(p->wq_fd);
>>> +    if (!f.file)
>>> +        return -EBADF;
>>> +    if (f.file->f_op != &io_uring_fops) {
>>> +        fdput(f);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    ctx_attach = f.file->private_data;
>>> +    if (!ctx_attach->buf_data) {
>>
>> It looks racy. What prevents it from being deleted while we're
>> working on it, e.g. by io_sqe_buffers_unregister?
> 
> I think the premise here is that buffer sharing happens between trusted 
> and coordinated processes.  If I understand your concern correctly, then 
> if the sharing process unregisters its buffers after having shared them, 
> than that process is acting improperly.  The race could lead to a failed 
> attach but that would be expected and reasonable I would think?  What do 
> you think should happen in this case?
> 
>>
>>> +        fdput(f);
>>> +        return -EINVAL;
>>> +    }
>>> +    ctx->buf_data = ctx_attach->buf_data;
>>
>> Before updates, etc. (e.g. __io_sqe_buffers_update()) were synchronised
>> by uring_lock, now it's modified concurrently, that looks to be really
>> racy.
> 
> Racy from the attaching process perspective you mean?
> 
>>
>>> +
>>> +    percpu_ref_get(&ctx->buf_data->refs);
>>
>> Ok, now the original io_uring instance will wait until the attached
>> once get rid of their references. That's a versatile ground to have
>> in kernel deadlocks.
>>
>> task1: uring1 = create()
>> task2: uring2 = create()
>> task1: uring3 = create(share=uring2);
>> task2: uring4 = create(share=uring1);
>>
>> task1: io_sqe_buffers_unregister(uring1)
>> task2: io_sqe_buffers_unregister(uring2)
>>
>> If I skimmed through the code right, that should hang unkillably.
> 
> So we need a way to enforce that a process can only have one role, 
> sharing or attaching? But I'm not what the best way to do that.  Is this 
> an issue for other resource sharing, work queues or polling thread?
> 

The intended use case for buffer registration is:

- a group of processes attach a shmem segment
- one process registers the buffers in the shmem segment and shares it
- other processes attach that registration

For this case, it seems that there is really no need to wait for the 
attached processes to get rid of the their references since the shmem 
segment (and thus the registered buffers) will persist anyway until the 
last attached process goes away.  So the last unregister could quiesce 
all references and get rid of the shared buf_data.

I'm not sure how useful the non-shmem use case would be anyway.

Would it makes sense to restrict the scope of this feature?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 13/13] io_uring: support buffer registration sharing
  2021-01-07  0:50       ` Bijan Mottahedeh
@ 2021-01-11  5:19         ` Pavel Begunkov
  2021-01-12 21:50           ` Bijan Mottahedeh
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2021-01-11  5:19 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe, io-uring

> The intended use case for buffer registration is:
> 
> - a group of processes attach a shmem segment
> - one process registers the buffers in the shmem segment and shares it
> - other processes attach that registration
> 
> For this case, it seems that there is really no need to wait for the attached processes to get rid of the their references since the shmem segment (and thus the registered buffers) will persist anyway until the last attached process goes away.  So the last unregister could quiesce all references and get rid of the shared buf_data.
> 
> I'm not sure how useful the non-shmem use case would be anyway.
> 
> Would it makes sense to restrict the scope of this feature?

I have to say I like that generic resources thing, makes it easier to
extend in the future. e.g. pre-allocating dma mappings, structs like
bios, etc.

I didn't think it through properly but it also looks that with refnodes
it would be much easier to do sharing in the end, if not possible
vs impossible.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 13/13] io_uring: support buffer registration sharing
  2021-01-11  5:19         ` Pavel Begunkov
@ 2021-01-12 21:50           ` Bijan Mottahedeh
  0 siblings, 0 replies; 32+ messages in thread
From: Bijan Mottahedeh @ 2021-01-12 21:50 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, io-uring

On 1/10/2021 9:19 PM, Pavel Begunkov wrote:
>> The intended use case for buffer registration is:
>>
>> - a group of processes attach a shmem segment
>> - one process registers the buffers in the shmem segment and shares it
>> - other processes attach that registration
>>
>> For this case, it seems that there is really no need to wait for the attached processes to get rid of the their references since the shmem segment (and thus the registered buffers) will persist anyway until the last attached process goes away.  So the last unregister could quiesce all references and get rid of the shared buf_data.
>>
>> I'm not sure how useful the non-shmem use case would be anyway.
>>
>> Would it makes sense to restrict the scope of this feature?
> 
> I have to say I like that generic resources thing, makes it easier to
> extend in the future. e.g. pre-allocating dma mappings, structs like
> bios, etc.
> 
> I didn't think it through properly but it also looks that with refnodes
> it would be much easier to do sharing in the end, if not possible
> vs impossible.
> 

Do you think that an the unkillable deadlock is still an issue with your 
changes in the v5 version of io_rsrc_ref_quiesce() I just sent out?

We're calling wait_for_completion_interruptible() so I assume it should 
be interruptible.  I think we'll then skip unmapping the buffers though, 
so it's not clear to me what the right solution is for the below 
scenario you raised in the first place:

task1: uring1 = create()
task2: uring2 = create()
task1: uring3 = create(share=uring2);
task2: uring4 = create(share=uring1);

task1: io_sqe_buffers_unregister(uring1)
task2: io_sqe_buffers_unregister(uring2)

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-01-12 21:51 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-07 22:15 [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 01/13] io_uring: modularize io_sqe_buffer_register Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 02/13] io_uring: modularize io_sqe_buffers_register Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 03/13] io_uring: generalize fixed file functionality Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 04/13] io_uring: rename fixed_file variables to fixed_rsrc Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 05/13] io_uring: separate ref_list from fixed_rsrc_data Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 06/13] io_uring: generalize fixed_file_ref_node functionality Bijan Mottahedeh
2020-12-16 14:53   ` Pavel Begunkov
2020-12-18 18:06     ` Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 07/13] io_uring: add rsrc_ref locking routines Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 08/13] io_uring: implement fixed buffers registration similar to fixed files Bijan Mottahedeh
2020-12-16 14:58   ` Pavel Begunkov
2020-12-18 18:06     ` Bijan Mottahedeh
2020-12-16 14:59   ` Pavel Begunkov
2020-12-07 22:15 ` [PATCH v2 09/13] io_uring: create common fixed_rsrc_ref_node handling routines Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 10/13] io_uring: generalize files_update functionlity to rsrc_update Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 11/13] io_uring: support buffer registration updates Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 12/13] io_uring: create common fixed_rsrc_data allocation routines Bijan Mottahedeh
2020-12-07 22:15 ` [PATCH v2 13/13] io_uring: support buffer registration sharing Bijan Mottahedeh
2020-12-16 15:29   ` Pavel Begunkov
2020-12-18 18:06     ` Bijan Mottahedeh
2021-01-07  0:50       ` Bijan Mottahedeh
2021-01-11  5:19         ` Pavel Begunkov
2021-01-12 21:50           ` Bijan Mottahedeh
2020-12-14 19:09 ` [PATCH v2 00/13] io_uring: buffer registration enhancements Bijan Mottahedeh
2020-12-14 19:29   ` Jens Axboe
2020-12-14 19:43     ` Bijan Mottahedeh
2020-12-14 19:47       ` Jens Axboe
2020-12-14 20:59     ` Pavel Begunkov
2020-12-18 18:06       ` Bijan Mottahedeh
2020-12-16 15:34 ` Pavel Begunkov
2020-12-18 18:06   ` Bijan Mottahedeh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.