io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically
@ 2022-10-31 13:41 Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes Dylan Yudaken
                   ` (12 more replies)
  0 siblings, 13 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

There is a problem with long running io_uring requests and rsrc node
cleanup, where a single long running request can block cleanup of all
subsequent nodes. For example a network server may have both long running
accepts and also use registered files for connections. When sockets are
closed and returned (either through close_direct, or through
register_files_update) the underlying file will not be freed until that
original accept completes. For the case of multishot this may be the
lifetime of the application, which will cause file numbers to grow
unbounded - resulting in either OOMs or ENFILE errors.

To fix this I propose retargeting the rsrc nodes from ongoing requests to
the current main request of the io_uring. This only needs to happen for
long running requests types, and specifically those that happen as a
result of some external event. For this reason I exclude write/send style
ops for the time being as even though these can cause this issue in
reality it would be unexpected to have a write block for hours. This
support can obviously be added later if needed.

In order to retarget nodes all the outstanding requests (in both poll
tables and io-wq) need to be iterated and the request needs to be checked
to make sure the retargeting is valid. For example for FIXED_FILE requests
this involves ensuring the file is still referenced in the current node.
This O(N) operation seems to take ~1ms locally for 30k outstanding
requests. Note it locks the io_uring while it happens and so no other work
can occur. In order to amortize this cost slightly, I propose running this
operation at most every 60 seconds. It is hard coded currently, but would
be happy to take suggestions if this should be customizable (and how to do
such a thing).

Without customizable retargeting period, it's a bit difficult to submit
tests for this. I have a test but it obviously takes a many minutes to run
which is not going to be acceptable for liburing.

Patches 1-5 are the basic io_uring infrastructure
Patch 6 is a helper function used in the per op customisations
Patch 7 splits out the zerocopy specific field in io_sr_msg
Patches 8-12 are opcode implementations for retargeting

Dylan Yudaken (12):
  io_uring: infrastructure for retargeting rsrc nodes
  io_uring: io-wq helper to iterate all work
  io_uring: support retargeting rsrc on requests in the io-wq
  io_uring: reschedule retargeting at shutdown of ring
  io_uring: add tracing for io_uring_rsrc_retarget
  io_uring: add fixed file peeking function
  io_uring: split send_zc specific struct out of io_sr_msg
  io_uring: recv/recvmsg retarget_rsrc support
  io_uring: accept retarget_rsrc support
  io_uring: read retarget_rsrc support
  io_uring: read_fixed retarget_rsrc support
  io_uring: poll_add retarget_rsrc support

 include/linux/io_uring_types.h  |   2 +
 include/trace/events/io_uring.h |  30 +++++++
 io_uring/io-wq.c                |  49 +++++++++++
 io_uring/io-wq.h                |   3 +
 io_uring/io_uring.c             |  28 ++++--
 io_uring/io_uring.h             |   1 +
 io_uring/net.c                  | 114 ++++++++++++++++--------
 io_uring/net.h                  |   2 +
 io_uring/opdef.c                |   7 ++
 io_uring/opdef.h                |   1 +
 io_uring/poll.c                 |  12 +++
 io_uring/poll.h                 |   2 +
 io_uring/rsrc.c                 | 148 ++++++++++++++++++++++++++++++++
 io_uring/rsrc.h                 |   2 +
 io_uring/rw.c                   |  29 +++++++
 io_uring/rw.h                   |   2 +
 16 files changed, 390 insertions(+), 42 deletions(-)


base-commit: 30209debe98b6f66b13591e59e5272cb65b3945e
-- 
2.30.2


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 16:02   ` Jens Axboe
  2022-11-02 11:20   ` Pavel Begunkov
  2022-10-31 13:41 ` [PATCH for-next 02/12] io_uring: io-wq helper to iterate all work Dylan Yudaken
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

rsrc node cleanup can be indefinitely delayed when there are long lived
requests. For example if a file is located in the same rsrc node as a long
lived socket with multishot poll, then even if unregistering the file it
will not be closed while the poll request is still active.

Introduce a timer when rsrc node is switched, so that periodically we can
retarget these long lived requests to the newest nodes. That will allow
the old nodes to be cleaned up, freeing resources.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 include/linux/io_uring_types.h |  2 +
 io_uring/io_uring.c            |  1 +
 io_uring/opdef.h               |  1 +
 io_uring/rsrc.c                | 92 ++++++++++++++++++++++++++++++++++
 io_uring/rsrc.h                |  1 +
 5 files changed, 97 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f5b687a787a3..1d4eff4e632c 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -327,6 +327,8 @@ struct io_ring_ctx {
 	struct llist_head		rsrc_put_llist;
 	struct list_head		rsrc_ref_list;
 	spinlock_t			rsrc_ref_lock;
+	struct delayed_work		rsrc_retarget_work;
+	bool				rsrc_retarget_scheduled;
 
 	struct list_head		io_buffers_pages;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 6cc16e39b27f..ea2260359c56 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -320,6 +320,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	spin_lock_init(&ctx->rsrc_ref_lock);
 	INIT_LIST_HEAD(&ctx->rsrc_ref_list);
 	INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
+	INIT_DELAYED_WORK(&ctx->rsrc_retarget_work, io_rsrc_retarget_work);
 	init_llist_head(&ctx->rsrc_put_llist);
 	init_llist_head(&ctx->work_llist);
 	INIT_LIST_HEAD(&ctx->tctx_list);
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index 3efe06d25473..1b72b14cb5ab 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -37,6 +37,7 @@ struct io_op_def {
 	int (*prep_async)(struct io_kiocb *);
 	void (*cleanup)(struct io_kiocb *);
 	void (*fail)(struct io_kiocb *);
+	bool (*can_retarget_rsrc)(struct io_kiocb *);
 };
 
 extern const struct io_op_def io_op_defs[];
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 55d4ab96fb92..106210e0d5d5 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -15,6 +15,7 @@
 #include "io_uring.h"
 #include "openclose.h"
 #include "rsrc.h"
+#include "opdef.h"
 
 struct io_rsrc_update {
 	struct file			*file;
@@ -204,6 +205,95 @@ void io_rsrc_put_work(struct work_struct *work)
 	}
 }
 
+
+static unsigned int io_rsrc_retarget_req(struct io_ring_ctx *ctx,
+					 struct io_kiocb *req)
+	__must_hold(&ctx->uring_lock)
+{
+	if (!req->rsrc_node ||
+	     req->rsrc_node == ctx->rsrc_node)
+		return 0;
+	if (!io_op_defs[req->opcode].can_retarget_rsrc)
+		return 0;
+	if (!(*io_op_defs[req->opcode].can_retarget_rsrc)(req))
+		return 0;
+
+	io_rsrc_put_node(req->rsrc_node, 1);
+	req->rsrc_node = ctx->rsrc_node;
+	return 1;
+}
+
+static unsigned int io_rsrc_retarget_table(struct io_ring_ctx *ctx,
+				   struct io_hash_table *table)
+{
+	unsigned int nr_buckets = 1U << table->hash_bits;
+	unsigned int refs = 0;
+	struct io_kiocb *req;
+	int i;
+
+	for (i = 0; i < nr_buckets; i++) {
+		struct io_hash_bucket *hb = &table->hbs[i];
+
+		spin_lock(&hb->lock);
+		hlist_for_each_entry(req, &hb->list, hash_node)
+			refs += io_rsrc_retarget_req(ctx, req);
+		spin_unlock(&hb->lock);
+	}
+	return refs;
+}
+
+static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
+	__must_hold(&ctx->uring_lock)
+{
+	percpu_ref_get(&ctx->refs);
+	mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, 60 * HZ);
+	ctx->rsrc_retarget_scheduled = true;
+}
+
+static void __io_rsrc_retarget_work(struct io_ring_ctx *ctx)
+	__must_hold(&ctx->uring_lock)
+{
+	struct io_rsrc_node *node;
+	unsigned int refs;
+	bool any_waiting;
+
+	if (!ctx->rsrc_node)
+		return;
+
+	spin_lock_irq(&ctx->rsrc_ref_lock);
+	any_waiting = false;
+	list_for_each_entry(node, &ctx->rsrc_ref_list, node) {
+		if (!node->done) {
+			any_waiting = true;
+			break;
+		}
+	}
+	spin_unlock_irq(&ctx->rsrc_ref_lock);
+
+	if (!any_waiting)
+		return;
+
+	refs = io_rsrc_retarget_table(ctx, &ctx->cancel_table);
+	refs += io_rsrc_retarget_table(ctx, &ctx->cancel_table_locked);
+
+	ctx->rsrc_cached_refs -= refs;
+	while (unlikely(ctx->rsrc_cached_refs < 0))
+		io_rsrc_refs_refill(ctx);
+}
+
+void io_rsrc_retarget_work(struct work_struct *work)
+{
+	struct io_ring_ctx *ctx;
+
+	ctx = container_of(work, struct io_ring_ctx, rsrc_retarget_work.work);
+
+	mutex_lock(&ctx->uring_lock);
+	ctx->rsrc_retarget_scheduled = false;
+	__io_rsrc_retarget_work(ctx);
+	mutex_unlock(&ctx->uring_lock);
+	percpu_ref_put(&ctx->refs);
+}
+
 void io_wait_rsrc_data(struct io_rsrc_data *data)
 {
 	if (data && !atomic_dec_and_test(&data->refs))
@@ -285,6 +375,8 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,
 		atomic_inc(&data_to_kill->refs);
 		percpu_ref_kill(&rsrc_node->refs);
 		ctx->rsrc_node = NULL;
+		if (!ctx->rsrc_retarget_scheduled)
+			io_rsrc_retarget_schedule(ctx);
 	}
 
 	if (!ctx->rsrc_node) {
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 81445a477622..2b94df8fd9e8 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -54,6 +54,7 @@ struct io_mapped_ubuf {
 };
 
 void io_rsrc_put_work(struct work_struct *work);
+void io_rsrc_retarget_work(struct work_struct *work);
 void io_rsrc_refs_refill(struct io_ring_ctx *ctx);
 void io_wait_rsrc_data(struct io_rsrc_data *data);
 void io_rsrc_node_destroy(struct io_rsrc_node *ref_node);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 02/12] io_uring: io-wq helper to iterate all work
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 03/12] io_uring: support retargeting rsrc on requests in the io-wq Dylan Yudaken
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Add a helper to iterate all work currently queued on an io-wq.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/io-wq.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++
 io_uring/io-wq.h |  3 +++
 2 files changed, 52 insertions(+)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 6f1d0e5df23a..47cbe2df05c4 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -38,6 +38,11 @@ enum {
 	IO_ACCT_STALLED_BIT	= 0,	/* stalled on hash */
 };
 
+struct io_for_each_work_data {
+	work_for_each_fn	*cb;
+	void			*data;
+};
+
 /*
  * One for each thread in a wqe pool
  */
@@ -856,6 +861,19 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe,
 	return ret;
 }
 
+static bool io_wq_for_each_work_cb(struct io_worker *w, void *data)
+{
+	struct io_for_each_work_data *f = data;
+
+	raw_spin_lock(&w->lock);
+	if (w->cur_work)
+		f->cb(w->cur_work, f->data);
+	if (w->next_work)
+		f->cb(w->next_work, f->data);
+	raw_spin_unlock(&w->lock);
+	return false;
+}
+
 static bool io_wq_worker_wake(struct io_worker *worker, void *data)
 {
 	__set_notify_signal(worker->task);
@@ -1113,6 +1131,37 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
 	return IO_WQ_CANCEL_NOTFOUND;
 }
 
+void io_wq_for_each(struct io_wq *wq, work_for_each_fn *cb, void *data)
+{
+	int node, i;
+	struct io_for_each_work_data wq_data = {
+		.cb = cb,
+		.data = data
+	};
+
+	for_each_node(node) {
+		struct io_wqe *wqe = wq->wqes[node];
+
+		for (i = 0; i < IO_WQ_ACCT_NR; i++) {
+			struct io_wqe_acct *acct = io_get_acct(wqe, i == 0);
+			struct io_wq_work_node *node, *prev;
+			struct io_wq_work *work;
+
+			raw_spin_lock(&acct->lock);
+			wq_list_for_each(node, prev, &acct->work_list) {
+				work = container_of(node, struct io_wq_work, list);
+				cb(work, data);
+			}
+			raw_spin_unlock(&acct->lock);
+		}
+
+
+		raw_spin_lock(&wqe->lock);
+		io_wq_for_each_worker(wqe, io_wq_for_each_work_cb, &wq_data);
+		raw_spin_unlock(&wqe->lock);
+	}
+}
+
 static int io_wqe_hash_wake(struct wait_queue_entry *wait, unsigned mode,
 			    int sync, void *key)
 {
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index 31228426d192..163cb12259b0 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -63,6 +63,9 @@ typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
 enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
 					void *data, bool cancel_all);
 
+typedef void (work_for_each_fn)(struct io_wq_work *, void *);
+void io_wq_for_each(struct io_wq *wq, work_for_each_fn *cb, void *data);
+
 #if defined(CONFIG_IO_WQ)
 extern void io_wq_worker_sleeping(struct task_struct *);
 extern void io_wq_worker_running(struct task_struct *);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 03/12] io_uring: support retargeting rsrc on requests in the io-wq
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 02/12] io_uring: io-wq helper to iterate all work Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 18:19   ` Jens Axboe
  2022-10-31 13:41 ` [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring Dylan Yudaken
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Requests can be in flight on the io-wq, and can be long lived (for example
a double read will get onto the io-wq). So make sure to retarget the rsrc
nodes on those requests.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/rsrc.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 106210e0d5d5..8d0d40713a63 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -16,6 +16,7 @@
 #include "openclose.h"
 #include "rsrc.h"
 #include "opdef.h"
+#include "tctx.h"
 
 struct io_rsrc_update {
 	struct file			*file;
@@ -24,6 +25,11 @@ struct io_rsrc_update {
 	u32				offset;
 };
 
+struct io_retarget_data {
+	struct io_ring_ctx		*ctx;
+	unsigned int			refs;
+};
+
 static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 				  struct io_mapped_ubuf **pimu,
 				  struct page **last_hpage);
@@ -250,11 +256,42 @@ static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
 	ctx->rsrc_retarget_scheduled = true;
 }
 
+static void io_retarget_rsrc_wq_cb(struct io_wq_work *work, void *data)
+{
+	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct io_retarget_data *rd = data;
+
+	if (req->ctx != rd->ctx)
+		return;
+
+	rd->refs += io_rsrc_retarget_req(rd->ctx, req);
+}
+
+static void io_rsrc_retarget_wq(struct io_retarget_data *data)
+	__must_hold(&data->ctx->uring_lock)
+{
+	struct io_ring_ctx *ctx = data->ctx;
+	struct io_tctx_node *node;
+
+	list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
+		struct io_uring_task *tctx = node->task->io_uring;
+
+		if (!tctx->io_wq)
+			continue;
+
+		io_wq_for_each(tctx->io_wq, io_retarget_rsrc_wq_cb, data);
+	}
+}
+
 static void __io_rsrc_retarget_work(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_rsrc_node *node;
-	unsigned int refs;
+	struct io_retarget_data data = {
+		.ctx = ctx,
+		.refs = 0
+	};
+	unsigned int poll_refs;
 	bool any_waiting;
 
 	if (!ctx->rsrc_node)
@@ -273,10 +310,11 @@ static void __io_rsrc_retarget_work(struct io_ring_ctx *ctx)
 	if (!any_waiting)
 		return;
 
-	refs = io_rsrc_retarget_table(ctx, &ctx->cancel_table);
-	refs += io_rsrc_retarget_table(ctx, &ctx->cancel_table_locked);
+	poll_refs = io_rsrc_retarget_table(ctx, &ctx->cancel_table);
+	poll_refs += io_rsrc_retarget_table(ctx, &ctx->cancel_table_locked);
+	io_rsrc_retarget_wq(&data);
 
-	ctx->rsrc_cached_refs -= refs;
+	ctx->rsrc_cached_refs -= (poll_refs + data.refs);
 	while (unlikely(ctx->rsrc_cached_refs < 0))
 		io_rsrc_refs_refill(ctx);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (2 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 03/12] io_uring: support retargeting rsrc on requests in the io-wq Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 16:02   ` Jens Axboe
  2022-10-31 13:41 ` [PATCH for-next 05/12] io_uring: add tracing for io_uring_rsrc_retarget Dylan Yudaken
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

When the ring shuts down, instead of waiting for the work to release it's
reference, just reschedule it to now and get the reference back that way.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/io_uring.c |  1 +
 io_uring/rsrc.c     | 26 +++++++++++++++++++++-----
 io_uring/rsrc.h     |  1 +
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ea2260359c56..32eb305c4ce7 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2751,6 +2751,7 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 		}
 
 		io_req_caches_free(ctx);
+		io_rsrc_retarget_exiting(ctx);
 
 		if (WARN_ON_ONCE(time_after(jiffies, timeout))) {
 			/* there is little hope left, don't run it too often */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 8d0d40713a63..40b37899e943 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -248,12 +248,20 @@ static unsigned int io_rsrc_retarget_table(struct io_ring_ctx *ctx,
 	return refs;
 }
 
-static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
+static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx, bool delay)
 	__must_hold(&ctx->uring_lock)
 {
-	percpu_ref_get(&ctx->refs);
-	mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, 60 * HZ);
-	ctx->rsrc_retarget_scheduled = true;
+	unsigned long del;
+
+	if (delay)
+		del = 60 * HZ;
+	else
+		del = 0;
+
+	if (likely(!mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, del))) {
+		percpu_ref_get(&ctx->refs);
+		ctx->rsrc_retarget_scheduled = true;
+	}
 }
 
 static void io_retarget_rsrc_wq_cb(struct io_wq_work *work, void *data)
@@ -332,6 +340,14 @@ void io_rsrc_retarget_work(struct work_struct *work)
 	percpu_ref_put(&ctx->refs);
 }
 
+void io_rsrc_retarget_exiting(struct io_ring_ctx *ctx)
+{
+	mutex_lock(&ctx->uring_lock);
+	if (ctx->rsrc_retarget_scheduled)
+		io_rsrc_retarget_schedule(ctx, false);
+	mutex_unlock(&ctx->uring_lock);
+}
+
 void io_wait_rsrc_data(struct io_rsrc_data *data)
 {
 	if (data && !atomic_dec_and_test(&data->refs))
@@ -414,7 +430,7 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,
 		percpu_ref_kill(&rsrc_node->refs);
 		ctx->rsrc_node = NULL;
 		if (!ctx->rsrc_retarget_scheduled)
-			io_rsrc_retarget_schedule(ctx);
+			io_rsrc_retarget_schedule(ctx, true);
 	}
 
 	if (!ctx->rsrc_node) {
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 2b94df8fd9e8..93c66475796e 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -55,6 +55,7 @@ struct io_mapped_ubuf {
 
 void io_rsrc_put_work(struct work_struct *work);
 void io_rsrc_retarget_work(struct work_struct *work);
+void io_rsrc_retarget_exiting(struct io_ring_ctx *ctx);
 void io_rsrc_refs_refill(struct io_ring_ctx *ctx);
 void io_wait_rsrc_data(struct io_rsrc_data *data);
 void io_rsrc_node_destroy(struct io_rsrc_node *ref_node);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 05/12] io_uring: add tracing for io_uring_rsrc_retarget
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (3 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 06/12] io_uring: add fixed file peeking function Dylan Yudaken
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Add event tracing to show how many poll/wq requests were retargeted

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 include/trace/events/io_uring.h | 30 ++++++++++++++++++++++++++++++
 io_uring/rsrc.c                 |  2 ++
 2 files changed, 32 insertions(+)

diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 936fd41bf147..b47be89dd270 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -684,6 +684,36 @@ TRACE_EVENT(io_uring_local_work_run,
 	TP_printk("ring %p, count %d, loops %u", __entry->ctx, __entry->count, __entry->loops)
 );
 
+/*
+ * io_uring_rsrc_retarget - ran a rsrc retarget
+ *
+ * @ctx:		pointer to a io_uring_ctx
+ * @poll:		how many retargeted that were polling
+ * @wq:			how many retargeted that were in the wq
+ *
+ */
+TRACE_EVENT(io_uring_rsrc_retarget,
+
+	TP_PROTO(void *ctx, unsigned int poll, unsigned int wq),
+
+	TP_ARGS(ctx, poll, wq),
+
+	TP_STRUCT__entry(
+		__field(void *,		ctx)
+		__field(unsigned int,	poll)
+		__field(unsigned int,	wq)
+	),
+
+	TP_fast_assign(
+		__entry->ctx		= ctx;
+		__entry->poll		= poll;
+		__entry->wq		= wq;
+	),
+
+	TP_printk("ring %p, poll %u, wq %u",
+		  __entry->ctx, __entry->poll, __entry->wq)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 40b37899e943..00402533cee5 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -325,6 +325,8 @@ static void __io_rsrc_retarget_work(struct io_ring_ctx *ctx)
 	ctx->rsrc_cached_refs -= (poll_refs + data.refs);
 	while (unlikely(ctx->rsrc_cached_refs < 0))
 		io_rsrc_refs_refill(ctx);
+
+	trace_io_uring_rsrc_retarget(ctx, poll_refs, data.refs);
 }
 
 void io_rsrc_retarget_work(struct work_struct *work)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 06/12] io_uring: add fixed file peeking function
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (4 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 05/12] io_uring: add tracing for io_uring_rsrc_retarget Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 16:04   ` Jens Axboe
  2022-11-02 11:23   ` Pavel Begunkov
  2022-10-31 13:41 ` [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg Dylan Yudaken
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Add a helper function to grab the fixed file at a given offset. Will be
useful for retarget op handlers.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/io_uring.c | 26 ++++++++++++++++++++------
 io_uring/io_uring.h |  1 +
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 32eb305c4ce7..a052653fc65e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1841,6 +1841,23 @@ void io_wq_submit_work(struct io_wq_work *work)
 		io_req_task_queue_fail(req, ret);
 }
 
+static unsigned long __io_file_peek_fixed(struct io_kiocb *req, int fd)
+	__must_hold(&req->ctx->uring_lock)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	if (unlikely((unsigned int)fd >= ctx->nr_user_files))
+		return 0;
+	fd = array_index_nospec(fd, ctx->nr_user_files);
+	return io_fixed_file_slot(&ctx->file_table, fd)->file_ptr;
+}
+
+struct file *io_file_peek_fixed(struct io_kiocb *req, int fd)
+	__must_hold(&req->ctx->uring_lock)
+{
+	return (struct file *) (__io_file_peek_fixed(req, fd) & FFS_MASK);
+}
+
 inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
 				      unsigned int issue_flags)
 {
@@ -1849,17 +1866,14 @@ inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
 	unsigned long file_ptr;
 
 	io_ring_submit_lock(ctx, issue_flags);
-
-	if (unlikely((unsigned int)fd >= ctx->nr_user_files))
-		goto out;
-	fd = array_index_nospec(fd, ctx->nr_user_files);
-	file_ptr = io_fixed_file_slot(&ctx->file_table, fd)->file_ptr;
+	file_ptr = __io_file_peek_fixed(req, fd);
 	file = (struct file *) (file_ptr & FFS_MASK);
 	file_ptr &= ~FFS_MASK;
 	/* mask in overlapping REQ_F and FFS bits */
 	req->flags |= (file_ptr << REQ_F_SUPPORT_NOWAIT_BIT);
 	io_req_set_rsrc_node(req, ctx, 0);
-out:
+	WARN_ON_ONCE(file && !test_bit(fd, ctx->file_table.bitmap));
+
 	io_ring_submit_unlock(ctx, issue_flags);
 	return file;
 }
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index ef77d2aa3172..781471bfba12 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -44,6 +44,7 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
 struct file *io_file_get_normal(struct io_kiocb *req, int fd);
 struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
 			       unsigned issue_flags);
+struct file *io_file_peek_fixed(struct io_kiocb *req, int fd);
 
 static inline bool io_req_ffs_set(struct io_kiocb *req)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (5 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 06/12] io_uring: add fixed file peeking function Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-11-02 11:32   ` Pavel Begunkov
  2022-10-31 13:41 ` [PATCH for-next 08/12] io_uring: recv/recvmsg retarget_rsrc support Dylan Yudaken
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Split out the specific sendzc parts of struct io_sr_msg as other opcodes
are going to be specialized.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/net.c | 77 +++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 35 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 15dea91625e2..f4638e79a022 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -63,10 +63,14 @@ struct io_sr_msg {
 	/* initialised and used only by !msg send variants */
 	u16				addr_len;
 	void __user			*addr;
-	/* used only for send zerocopy */
-	struct io_kiocb 		*notif;
 };
 
+struct io_send_zc_msg {
+	struct io_sr_msg	sr;
+	struct io_kiocb		*notif;
+};
+
+
 #define IO_APOLL_MULTI_POLLED (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)
 
 int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -910,7 +914,7 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 
 void io_send_zc_cleanup(struct io_kiocb *req)
 {
-	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
+	struct io_send_zc_msg *zc = io_kiocb_to_cmd(req, struct io_send_zc_msg);
 	struct io_async_msghdr *io;
 
 	if (req_has_async_data(req)) {
@@ -927,8 +931,9 @@ void io_send_zc_cleanup(struct io_kiocb *req)
 
 int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
-	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
+	struct io_send_zc_msg *zc = io_kiocb_to_cmd(req, struct io_send_zc_msg);
 	struct io_ring_ctx *ctx = req->ctx;
+	struct io_sr_msg *sr = &zc->sr;
 	struct io_kiocb *notif;
 
 	if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
@@ -937,8 +942,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (req->flags & REQ_F_CQE_SKIP)
 		return -EINVAL;
 
-	zc->flags = READ_ONCE(sqe->ioprio);
-	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
+	sr->flags = READ_ONCE(sqe->ioprio);
+	if (sr->flags & ~(IORING_RECVSEND_POLL_FIRST |
 			  IORING_RECVSEND_FIXED_BUF))
 		return -EINVAL;
 	notif = zc->notif = io_alloc_notif(ctx);
@@ -948,7 +953,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	notif->cqe.res = 0;
 	notif->cqe.flags = IORING_CQE_F_NOTIF;
 	req->flags |= REQ_F_NEED_CLEANUP;
-	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
+	if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
 		unsigned idx = READ_ONCE(sqe->buf_index);
 
 		if (unlikely(idx >= ctx->nr_user_bufs))
@@ -961,26 +966,26 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (req->opcode == IORING_OP_SEND_ZC) {
 		if (READ_ONCE(sqe->__pad3[0]))
 			return -EINVAL;
-		zc->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
-		zc->addr_len = READ_ONCE(sqe->addr_len);
+		sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+		sr->addr_len = READ_ONCE(sqe->addr_len);
 	} else {
 		if (unlikely(sqe->addr2 || sqe->file_index))
 			return -EINVAL;
-		if (unlikely(zc->flags & IORING_RECVSEND_FIXED_BUF))
+		if (unlikely(sr->flags & IORING_RECVSEND_FIXED_BUF))
 			return -EINVAL;
 	}
 
-	zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
-	zc->len = READ_ONCE(sqe->len);
-	zc->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL;
-	if (zc->msg_flags & MSG_DONTWAIT)
+	sr->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	sr->len = READ_ONCE(sqe->len);
+	sr->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL;
+	if (sr->msg_flags & MSG_DONTWAIT)
 		req->flags |= REQ_F_NOWAIT;
 
-	zc->done_io = 0;
+	sr->done_io = 0;
 
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
-		zc->msg_flags |= MSG_CMSG_COMPAT;
+		sr->msg_flags |= MSG_CMSG_COMPAT;
 #endif
 	return 0;
 }
@@ -1046,7 +1051,8 @@ static int io_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct sockaddr_storage __address;
-	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
+	struct io_send_zc_msg *zc = io_kiocb_to_cmd(req, struct io_send_zc_msg);
+	struct io_sr_msg *sr = &zc->sr;
 	struct msghdr msg;
 	struct iovec iov;
 	struct socket *sock;
@@ -1064,42 +1070,42 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
 
-	if (zc->addr) {
+	if (sr->addr) {
 		if (req_has_async_data(req)) {
 			struct io_async_msghdr *io = req->async_data;
 
 			msg.msg_name = &io->addr;
 		} else {
-			ret = move_addr_to_kernel(zc->addr, zc->addr_len, &__address);
+			ret = move_addr_to_kernel(sr->addr, sr->addr_len, &__address);
 			if (unlikely(ret < 0))
 				return ret;
 			msg.msg_name = (struct sockaddr *)&__address;
 		}
-		msg.msg_namelen = zc->addr_len;
+		msg.msg_namelen = sr->addr_len;
 	}
 
 	if (!(req->flags & REQ_F_POLLED) &&
-	    (zc->flags & IORING_RECVSEND_POLL_FIRST))
+	    (sr->flags & IORING_RECVSEND_POLL_FIRST))
 		return io_setup_async_addr(req, &__address, issue_flags);
 
-	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
+	if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
 		ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu,
-					(u64)(uintptr_t)zc->buf, zc->len);
+					(u64)(uintptr_t)sr->buf, sr->len);
 		if (unlikely(ret))
 			return ret;
 		msg.sg_from_iter = io_sg_from_iter;
 	} else {
-		ret = import_single_range(WRITE, zc->buf, zc->len, &iov,
+		ret = import_single_range(WRITE, sr->buf, sr->len, &iov,
 					  &msg.msg_iter);
 		if (unlikely(ret))
 			return ret;
-		ret = io_notif_account_mem(zc->notif, zc->len);
+		ret = io_notif_account_mem(zc->notif, sr->len);
 		if (unlikely(ret))
 			return ret;
 		msg.sg_from_iter = io_sg_from_iter_iovec;
 	}
 
-	msg_flags = zc->msg_flags | MSG_ZEROCOPY;
+	msg_flags = sr->msg_flags | MSG_ZEROCOPY;
 	if (issue_flags & IO_URING_F_NONBLOCK)
 		msg_flags |= MSG_DONTWAIT;
 	if (msg_flags & MSG_WAITALL)
@@ -1114,9 +1120,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 			return io_setup_async_addr(req, &__address, issue_flags);
 
 		if (ret > 0 && io_net_retry(sock, msg.msg_flags)) {
-			zc->len -= ret;
-			zc->buf += ret;
-			zc->done_io += ret;
+			sr->len -= ret;
+			sr->buf += ret;
+			sr->done_io += ret;
 			req->flags |= REQ_F_PARTIAL_IO;
 			return io_setup_async_addr(req, &__address, issue_flags);
 		}
@@ -1126,9 +1132,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	if (ret >= 0)
-		ret += zc->done_io;
-	else if (zc->done_io)
-		ret = zc->done_io;
+		ret += sr->done_io;
+	else if (sr->done_io)
+		ret = sr->done_io;
 
 	/*
 	 * If we're in io-wq we can't rely on tw ordering guarantees, defer
@@ -1144,8 +1150,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 
 int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
 {
-	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+	struct io_send_zc_msg *zc = io_kiocb_to_cmd(req, struct io_send_zc_msg);
 	struct io_async_msghdr iomsg, *kmsg;
+	struct io_sr_msg *sr = &zc->sr;
 	struct socket *sock;
 	unsigned flags;
 	int ret, min_ret = 0;
@@ -1175,7 +1182,7 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
 	if (flags & MSG_WAITALL)
 		min_ret = iov_iter_count(&kmsg->msg.msg_iter);
 
-	kmsg->msg.msg_ubuf = &io_notif_to_data(sr->notif)->uarg;
+	kmsg->msg.msg_ubuf = &io_notif_to_data(zc->notif)->uarg;
 	kmsg->msg.sg_from_iter = io_sg_from_iter_iovec;
 	ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags);
 
@@ -1209,7 +1216,7 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
 	 * flushing notif to io_send_zc_cleanup()
 	 */
 	if (!(issue_flags & IO_URING_F_UNLOCKED)) {
-		io_notif_flush(sr->notif);
+		io_notif_flush(zc->notif);
 		req->flags &= ~REQ_F_NEED_CLEANUP;
 	}
 	io_req_set_res(req, ret, IORING_CQE_F_MORE);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 08/12] io_uring: recv/recvmsg retarget_rsrc support
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (6 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 09/12] io_uring: accept " Dylan Yudaken
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Add can_retarget_rsrc handler for recv/recvmsg

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/net.c   | 22 +++++++++++++++++++++-
 io_uring/net.h   |  1 +
 io_uring/opdef.c |  2 ++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index f4638e79a022..0fa05ef52dd3 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -70,6 +70,11 @@ struct io_send_zc_msg {
 	struct io_kiocb		*notif;
 };
 
+struct io_recv_msg {
+	struct io_sr_msg	sr;
+	int			retarget_fd;
+};
+
 
 #define IO_APOLL_MULTI_POLLED (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)
 
@@ -547,7 +552,8 @@ int io_recvmsg_prep_async(struct io_kiocb *req)
 
 int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
-	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+	struct io_recv_msg *rcv = io_kiocb_to_cmd(req, struct io_recv_msg);
+	struct io_sr_msg *sr = &rcv->sr;
 
 	if (unlikely(sqe->file_index || sqe->addr2))
 		return -EINVAL;
@@ -572,6 +578,11 @@ int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		req->flags |= REQ_F_APOLL_MULTISHOT;
 	}
 
+	if (req->flags & REQ_F_FIXED_FILE)
+		rcv->retarget_fd = req->cqe.fd;
+	else
+		rcv->retarget_fd = -1;
+
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
 		sr->msg_flags |= MSG_CMSG_COMPAT;
@@ -709,6 +720,15 @@ static int io_recvmsg_multishot(struct socket *sock, struct io_sr_msg *io,
 			kmsg->controllen + err;
 }
 
+bool io_recv_can_retarget_rsrc(struct io_kiocb *req)
+{
+	struct io_recv_msg *rcv = io_kiocb_to_cmd(req, struct io_recv_msg);
+
+	if (rcv->retarget_fd < 0)
+		return false;
+	return io_file_peek_fixed(req, rcv->retarget_fd) == req->file;
+}
+
 int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
diff --git a/io_uring/net.h b/io_uring/net.h
index 5ffa11bf5d2e..6b5719084494 100644
--- a/io_uring/net.h
+++ b/io_uring/net.h
@@ -43,6 +43,7 @@ int io_recvmsg_prep_async(struct io_kiocb *req);
 int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags);
 int io_recv(struct io_kiocb *req, unsigned int issue_flags);
+bool io_recv_can_retarget_rsrc(struct io_kiocb *req);
 
 void io_sendrecv_fail(struct io_kiocb *req);
 
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 83dc0f9ad3b2..1a0be5681c7b 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -178,6 +178,7 @@ const struct io_op_def io_op_defs[] = {
 		.prep_async		= io_recvmsg_prep_async,
 		.cleanup		= io_sendmsg_recvmsg_cleanup,
 		.fail			= io_sendrecv_fail,
+		.can_retarget_rsrc	= io_recv_can_retarget_rsrc,
 #else
 		.prep			= io_eopnotsupp_prep,
 #endif
@@ -340,6 +341,7 @@ const struct io_op_def io_op_defs[] = {
 		.prep			= io_recvmsg_prep,
 		.issue			= io_recv,
 		.fail			= io_sendrecv_fail,
+		.can_retarget_rsrc	= io_recv_can_retarget_rsrc,
 #else
 		.prep			= io_eopnotsupp_prep,
 #endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 09/12] io_uring: accept retarget_rsrc support
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (7 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 08/12] io_uring: recv/recvmsg retarget_rsrc support Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 10/12] io_uring: read " Dylan Yudaken
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Add can_retarget_rsrc handler for accept

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/net.c   | 15 +++++++++++++++
 io_uring/net.h   |  1 +
 io_uring/opdef.c |  1 +
 3 files changed, 17 insertions(+)

diff --git a/io_uring/net.c b/io_uring/net.c
index 0fa05ef52dd3..429176f3d191 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -30,6 +30,7 @@ struct io_accept {
 	int				flags;
 	u32				file_slot;
 	unsigned long			nofile;
+	int				retarget_fd;
 };
 
 struct io_socket {
@@ -1255,6 +1256,15 @@ void io_sendrecv_fail(struct io_kiocb *req)
 		req->cqe.flags |= IORING_CQE_F_MORE;
 }
 
+bool io_accept_can_retarget_rsrc(struct io_kiocb *req)
+{
+	struct io_accept *accept = io_kiocb_to_cmd(req, struct io_accept);
+
+	if (accept->retarget_fd < 0)
+		return false;
+	return io_file_peek_fixed(req, accept->retarget_fd) == req->file;
+}
+
 int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_accept *accept = io_kiocb_to_cmd(req, struct io_accept);
@@ -1285,6 +1295,11 @@ int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		accept->flags = (accept->flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
 	if (flags & IORING_ACCEPT_MULTISHOT)
 		req->flags |= REQ_F_APOLL_MULTISHOT;
+
+	if (req->flags & REQ_F_FIXED_FILE)
+		accept->retarget_fd = req->cqe.fd;
+	else
+		accept->retarget_fd = -1;
 	return 0;
 }
 
diff --git a/io_uring/net.h b/io_uring/net.h
index 6b5719084494..67fafb94d7de 100644
--- a/io_uring/net.h
+++ b/io_uring/net.h
@@ -49,6 +49,7 @@ void io_sendrecv_fail(struct io_kiocb *req);
 
 int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_accept(struct io_kiocb *req, unsigned int issue_flags);
+bool io_accept_can_retarget_rsrc(struct io_kiocb *req);
 
 int io_socket_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_socket(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 1a0be5681c7b..7c94f1a4315a 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -207,6 +207,7 @@ const struct io_op_def io_op_defs[] = {
 #if defined(CONFIG_NET)
 		.prep			= io_accept_prep,
 		.issue			= io_accept,
+		.can_retarget_rsrc	= io_accept_can_retarget_rsrc,
 #else
 		.prep			= io_eopnotsupp_prep,
 #endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 10/12] io_uring: read retarget_rsrc support
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (8 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 09/12] io_uring: accept " Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 11/12] io_uring: read_fixed " Dylan Yudaken
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Add can_retarget_rsrc handler for read

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/opdef.c |  2 ++
 io_uring/rw.c    | 14 ++++++++++++++
 io_uring/rw.h    |  1 +
 3 files changed, 17 insertions(+)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 7c94f1a4315a..0018fe39cbb5 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -70,6 +70,7 @@ const struct io_op_def io_op_defs[] = {
 		.prep_async		= io_readv_prep_async,
 		.cleanup		= io_readv_writev_cleanup,
 		.fail			= io_rw_fail,
+		.can_retarget_rsrc	= io_read_can_retarget_rsrc,
 	},
 	[IORING_OP_WRITEV] = {
 		.needs_file		= 1,
@@ -284,6 +285,7 @@ const struct io_op_def io_op_defs[] = {
 		.prep			= io_prep_rw,
 		.issue			= io_read,
 		.fail			= io_rw_fail,
+		.can_retarget_rsrc	= io_read_can_retarget_rsrc,
 	},
 	[IORING_OP_WRITE] = {
 		.needs_file		= 1,
diff --git a/io_uring/rw.c b/io_uring/rw.c
index bb47cc4da713..7618e402dcec 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1068,3 +1068,17 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 	io_free_batch_list(ctx, pos);
 	return nr_events;
 }
+
+bool io_read_can_retarget_rsrc(struct io_kiocb *req)
+{
+	struct file *f;
+
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		return true;
+
+	f = io_file_peek_fixed(req, req->cqe.fd);
+	if (f != req->file)
+		return false;
+
+	return true;
+}
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3b733f4b610a..715e7249463b 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -22,3 +22,4 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags);
 int io_writev_prep_async(struct io_kiocb *req);
 void io_readv_writev_cleanup(struct io_kiocb *req);
 void io_rw_fail(struct io_kiocb *req);
+bool io_read_can_retarget_rsrc(struct io_kiocb *req);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 11/12] io_uring: read_fixed retarget_rsrc support
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (9 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 10/12] io_uring: read " Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-10-31 13:41 ` [PATCH for-next 12/12] io_uring: poll_add " Dylan Yudaken
  2022-11-02 11:44 ` [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Pavel Begunkov
  12 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Add can_retarget_rsrc handler for read_fixed

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/opdef.c |  1 +
 io_uring/rw.c    | 15 +++++++++++++++
 io_uring/rw.h    |  1 +
 3 files changed, 17 insertions(+)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 0018fe39cbb5..5159b3abc2b2 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -109,6 +109,7 @@ const struct io_op_def io_op_defs[] = {
 		.prep			= io_prep_rw,
 		.issue			= io_read,
 		.fail			= io_rw_fail,
+		.can_retarget_rsrc	= io_read_fixed_can_retarget_rsrc,
 	},
 	[IORING_OP_WRITE_FIXED] = {
 		.needs_file		= 1,
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 7618e402dcec..d82fbe074bd9 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1082,3 +1082,18 @@ bool io_read_can_retarget_rsrc(struct io_kiocb *req)
 
 	return true;
 }
+
+bool io_read_fixed_can_retarget_rsrc(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	u16 index;
+
+	if (unlikely(req->buf_index >= ctx->nr_user_bufs))
+		return false;
+
+	index = array_index_nospec(req->buf_index, ctx->nr_user_bufs);
+	if (ctx->user_bufs[index] != req->imu)
+		return false;
+
+	return io_read_can_retarget_rsrc(req);
+}
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 715e7249463b..69cbc36560f6 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -23,3 +23,4 @@ int io_writev_prep_async(struct io_kiocb *req);
 void io_readv_writev_cleanup(struct io_kiocb *req);
 void io_rw_fail(struct io_kiocb *req);
 bool io_read_can_retarget_rsrc(struct io_kiocb *req);
+bool io_read_fixed_can_retarget_rsrc(struct io_kiocb *req);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH for-next 12/12] io_uring: poll_add retarget_rsrc support
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (10 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 11/12] io_uring: read_fixed " Dylan Yudaken
@ 2022-10-31 13:41 ` Dylan Yudaken
  2022-11-02 11:44 ` [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Pavel Begunkov
  12 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 13:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, kernel-team, Dylan Yudaken

Add can_retarget_rsrc handler for poll.

Note that the copy of fd is stashed in the middle of the struct io_poll as
there is a hole there, and this is the only way to ensure that the
structure does not grow beyond the size of struct io_cmd_data.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
---
 io_uring/opdef.c |  1 +
 io_uring/poll.c  | 12 ++++++++++++
 io_uring/poll.h  |  2 ++
 3 files changed, 15 insertions(+)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 5159b3abc2b2..952ea8ff5032 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -133,6 +133,7 @@ const struct io_op_def io_op_defs[] = {
 		.name			= "POLL_ADD",
 		.prep			= io_poll_add_prep,
 		.issue			= io_poll_add,
+		.can_retarget_rsrc	= io_poll_can_retarget_rsrc,
 	},
 	[IORING_OP_POLL_REMOVE] = {
 		.audit_skip		= 1,
diff --git a/io_uring/poll.c b/io_uring/poll.c
index 0d9f49c575e0..fde8060b9399 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -863,6 +863,7 @@ int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EINVAL;
 
 	poll->events = io_poll_parse_events(sqe, flags);
+	poll->fd = req->cqe.fd;
 	return 0;
 }
 
@@ -963,3 +964,14 @@ void io_apoll_cache_free(struct io_cache_entry *entry)
 {
 	kfree(container_of(entry, struct async_poll, cache));
 }
+
+bool io_poll_can_retarget_rsrc(struct io_kiocb *req)
+{
+	struct io_poll *poll = io_kiocb_to_cmd(req, struct io_poll);
+
+	if (req->flags & REQ_F_FIXED_FILE &&
+	    io_file_peek_fixed(req, poll->fd) != req->file)
+		return false;
+
+	return true;
+}
diff --git a/io_uring/poll.h b/io_uring/poll.h
index 5f3bae50fc81..dcc4b06bcea1 100644
--- a/io_uring/poll.h
+++ b/io_uring/poll.h
@@ -12,6 +12,7 @@ struct io_poll {
 	struct file			*file;
 	struct wait_queue_head		*head;
 	__poll_t			events;
+	int				fd; /* only used by poll_add */
 	struct wait_queue_entry		wait;
 };
 
@@ -37,3 +38,4 @@ bool io_poll_remove_all(struct io_ring_ctx *ctx, struct task_struct *tsk,
 			bool cancel_all);
 
 void io_apoll_cache_free(struct io_cache_entry *entry);
+bool io_poll_can_retarget_rsrc(struct io_kiocb *req);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring
  2022-10-31 13:41 ` [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring Dylan Yudaken
@ 2022-10-31 16:02   ` Jens Axboe
  2022-10-31 16:44     ` Dylan Yudaken
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2022-10-31 16:02 UTC (permalink / raw)
  To: Dylan Yudaken, Pavel Begunkov; +Cc: io-uring, kernel-team

On 10/31/22 7:41 AM, Dylan Yudaken wrote:
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 8d0d40713a63..40b37899e943 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -248,12 +248,20 @@ static unsigned int io_rsrc_retarget_table(struct io_ring_ctx *ctx,
>  	return refs;
>  }
>  
> -static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
> +static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx, bool delay)
>  	__must_hold(&ctx->uring_lock)
>  {
> -	percpu_ref_get(&ctx->refs);
> -	mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, 60 * HZ);
> -	ctx->rsrc_retarget_scheduled = true;
> +	unsigned long del;
> +
> +	if (delay)
> +		del = 60 * HZ;
> +	else
> +		del = 0;
> +
> +	if (likely(!mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, del))) {
> +		percpu_ref_get(&ctx->refs);
> +		ctx->rsrc_retarget_scheduled = true;
> +	}
>  }

What happens for del == 0 and the work running ala:

CPU 0				CPU 1
mod_delayed_work(.., 0);
				delayed_work runs
					put ctx
percpu_ref_get(ctx)

Also I think that likely() needs to get dropped.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes
  2022-10-31 13:41 ` [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes Dylan Yudaken
@ 2022-10-31 16:02   ` Jens Axboe
  2022-10-31 16:45     ` Dylan Yudaken
  2022-11-02 11:20   ` Pavel Begunkov
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2022-10-31 16:02 UTC (permalink / raw)
  To: Dylan Yudaken, Pavel Begunkov; +Cc: io-uring, kernel-team

On 10/31/22 7:41 AM, Dylan Yudaken wrote:
> +static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
> +	__must_hold(&ctx->uring_lock)
> +{
> +	percpu_ref_get(&ctx->refs);
> +	mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, 60 * HZ);
> +	ctx->rsrc_retarget_scheduled = true;
> +}

Can this ever be called with rsrc_retarget_work already pending? If so,
that would seem to leak a ctx ref.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 06/12] io_uring: add fixed file peeking function
  2022-10-31 13:41 ` [PATCH for-next 06/12] io_uring: add fixed file peeking function Dylan Yudaken
@ 2022-10-31 16:04   ` Jens Axboe
  2022-10-31 16:47     ` Dylan Yudaken
  2022-11-02 11:23   ` Pavel Begunkov
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2022-10-31 16:04 UTC (permalink / raw)
  To: Dylan Yudaken, Pavel Begunkov; +Cc: io-uring, kernel-team

On 10/31/22 7:41 AM, Dylan Yudaken wrote:
> @@ -1849,17 +1866,14 @@ inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
>  	unsigned long file_ptr;
>  
>  	io_ring_submit_lock(ctx, issue_flags);
> -
> -	if (unlikely((unsigned int)fd >= ctx->nr_user_files))
> -		goto out;
> -	fd = array_index_nospec(fd, ctx->nr_user_files);
> -	file_ptr = io_fixed_file_slot(&ctx->file_table, fd)->file_ptr;
> +	file_ptr = __io_file_peek_fixed(req, fd);
>  	file = (struct file *) (file_ptr & FFS_MASK);
>  	file_ptr &= ~FFS_MASK;
>  	/* mask in overlapping REQ_F and FFS bits */
>  	req->flags |= (file_ptr << REQ_F_SUPPORT_NOWAIT_BIT);
>  	io_req_set_rsrc_node(req, ctx, 0);
> -out:
> +	WARN_ON_ONCE(file && !test_bit(fd, ctx->file_table.bitmap));
> +
>  	io_ring_submit_unlock(ctx, issue_flags);
>  	return file;
>  }

Is this WARN_ON_ONCE() a leftover from being originally based on a tree
before:

commit 4d5059512d283dab7372d282c2fbd43c7f5a2456
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Sun Oct 16 21:30:49 2022 +0100

    io_uring: kill hot path fixed file bitmap debug checks

got added? Seems not related to the changes here otherwise.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring
  2022-10-31 16:02   ` Jens Axboe
@ 2022-10-31 16:44     ` Dylan Yudaken
  2022-10-31 19:13       ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 16:44 UTC (permalink / raw)
  To: Dylan Yudaken, axboe, asml.silence; +Cc: Kernel Team, io-uring

On Mon, 2022-10-31 at 10:02 -0600, Jens Axboe wrote:
> On 10/31/22 7:41 AM, Dylan Yudaken wrote:
> > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > index 8d0d40713a63..40b37899e943 100644
> > --- a/io_uring/rsrc.c
> > +++ b/io_uring/rsrc.c
> > @@ -248,12 +248,20 @@ static unsigned int
> > io_rsrc_retarget_table(struct io_ring_ctx *ctx,
> >         return refs;
> >  }
> >  
> > -static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
> > +static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx,
> > bool delay)
> >         __must_hold(&ctx->uring_lock)
> >  {
> > -       percpu_ref_get(&ctx->refs);
> > -       mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, 60 *
> > HZ);
> > -       ctx->rsrc_retarget_scheduled = true;
> > +       unsigned long del;
> > +
> > +       if (delay)
> > +               del = 60 * HZ;
> > +       else
> > +               del = 0;
> > +
> > +       if (likely(!mod_delayed_work(system_wq, &ctx-
> > >rsrc_retarget_work, del))) {
> > +               percpu_ref_get(&ctx->refs);
> > +               ctx->rsrc_retarget_scheduled = true;
> > +       }
> >  }
> 
> What happens for del == 0 and the work running ala:
> 
> CPU 0                           CPU 1
> mod_delayed_work(.., 0);
>                                 delayed_work runs
>                                         put ctx
> percpu_ref_get(ctx)

The work takes the lock before put(ctx), and CPU 0 only releases the
lock after calling get(ctx) so it should be ok.

> 
> Also I think that likely() needs to get dropped.
> 

It's not a big thing, but the only time it will be enqueued is on ring
shutdown if there is an outstanding enqueue. Other times it will not
get double enqueued as it is protected by the _scheduled bool (this is
important or else it will continually push back by 1 period and maybe
never run)



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes
  2022-10-31 16:02   ` Jens Axboe
@ 2022-10-31 16:45     ` Dylan Yudaken
  0 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 16:45 UTC (permalink / raw)
  To: Dylan Yudaken, axboe, asml.silence; +Cc: Kernel Team, io-uring

On Mon, 2022-10-31 at 10:02 -0600, Jens Axboe wrote:
> On 10/31/22 7:41 AM, Dylan Yudaken wrote:
> > +static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
> > +       __must_hold(&ctx->uring_lock)
> > +{
> > +       percpu_ref_get(&ctx->refs);
> > +       mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, 60 *
> > HZ);
> > +       ctx->rsrc_retarget_scheduled = true;
> > +}
> 
> Can this ever be called with rsrc_retarget_work already pending? If
> so,
> that would seem to leak a ctx ref.
> 

Not in this patch.
In the later patch it can (but only on shutdown) and it checks for it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 06/12] io_uring: add fixed file peeking function
  2022-10-31 16:04   ` Jens Axboe
@ 2022-10-31 16:47     ` Dylan Yudaken
  0 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-10-31 16:47 UTC (permalink / raw)
  To: Dylan Yudaken, axboe, asml.silence; +Cc: Kernel Team, io-uring

On Mon, 2022-10-31 at 10:04 -0600, Jens Axboe wrote:
> On 10/31/22 7:41 AM, Dylan Yudaken wrote:
> > @@ -1849,17 +1866,14 @@ inline struct file
> > *io_file_get_fixed(struct io_kiocb *req, int fd,
> >         unsigned long file_ptr;
> >  
> >         io_ring_submit_lock(ctx, issue_flags);
> > -
> > -       if (unlikely((unsigned int)fd >= ctx->nr_user_files))
> > -               goto out;
> > -       fd = array_index_nospec(fd, ctx->nr_user_files);
> > -       file_ptr = io_fixed_file_slot(&ctx->file_table, fd)-
> > >file_ptr;
> > +       file_ptr = __io_file_peek_fixed(req, fd);
> >         file = (struct file *) (file_ptr & FFS_MASK);
> >         file_ptr &= ~FFS_MASK;
> >         /* mask in overlapping REQ_F and FFS bits */
> >         req->flags |= (file_ptr << REQ_F_SUPPORT_NOWAIT_BIT);
> >         io_req_set_rsrc_node(req, ctx, 0);
> > -out:
> > +       WARN_ON_ONCE(file && !test_bit(fd, ctx-
> > >file_table.bitmap));
> > +
> >         io_ring_submit_unlock(ctx, issue_flags);
> >         return file;
> >  }
> 
> Is this WARN_ON_ONCE() a leftover from being originally based on a
> tree
> before:
> 
> commit 4d5059512d283dab7372d282c2fbd43c7f5a2456
> Author: Pavel Begunkov <asml.silence@gmail.com>
> Date:   Sun Oct 16 21:30:49 2022 +0100
> 
>     io_uring: kill hot path fixed file bitmap debug checks
> 
> got added? Seems not related to the changes here otherwise.
> 

Ah yes. That was a bad merge in that case with the "out:" label.
I'll fix that in v2. I am assuming there will be some more comments.
Thanks for pointing this out.


Dylan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 03/12] io_uring: support retargeting rsrc on requests in the io-wq
  2022-10-31 13:41 ` [PATCH for-next 03/12] io_uring: support retargeting rsrc on requests in the io-wq Dylan Yudaken
@ 2022-10-31 18:19   ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2022-10-31 18:19 UTC (permalink / raw)
  To: Dylan Yudaken, Pavel Begunkov; +Cc: io-uring, kernel-team

On 10/31/22 7:41 AM, Dylan Yudaken wrote:
> Requests can be in flight on the io-wq, and can be long lived (for example
> a double read will get onto the io-wq). So make sure to retarget the rsrc
> nodes on those requests.
> 
> Signed-off-by: Dylan Yudaken <dylany@meta.com>
> ---
>  io_uring/rsrc.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 106210e0d5d5..8d0d40713a63 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -16,6 +16,7 @@
>  #include "openclose.h"
>  #include "rsrc.h"
>  #include "opdef.h"
> +#include "tctx.h"
>  
>  struct io_rsrc_update {
>  	struct file			*file;
> @@ -24,6 +25,11 @@ struct io_rsrc_update {
>  	u32				offset;
>  };
>  
> +struct io_retarget_data {
> +	struct io_ring_ctx		*ctx;
> +	unsigned int			refs;
> +};

Do we really need this struct? As far as I can tell, you pass in ctx
only and back refs. It's passed in the callbacks, but they only care
about ctx. If io_rsrc_retarget_wq() returned back the refs rather than
use data->refs, then we could just pass in the ctx?

Or you could at least keep it local to io_rsrc_retarget_wq() and
io_retarget_rsrc_wq_cb().

Not a big deal, just always nice to keep the scope of struct as small as
can be (or get rid of them).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring
  2022-10-31 16:44     ` Dylan Yudaken
@ 2022-10-31 19:13       ` Jens Axboe
  2022-11-01 12:09         ` Dylan Yudaken
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2022-10-31 19:13 UTC (permalink / raw)
  To: Dylan Yudaken, asml.silence; +Cc: Kernel Team, io-uring

On 10/31/22 10:44 AM, Dylan Yudaken wrote:
> On Mon, 2022-10-31 at 10:02 -0600, Jens Axboe wrote:
>> On 10/31/22 7:41 AM, Dylan Yudaken wrote:
>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>> index 8d0d40713a63..40b37899e943 100644
>>> --- a/io_uring/rsrc.c
>>> +++ b/io_uring/rsrc.c
>>> @@ -248,12 +248,20 @@ static unsigned int
>>> io_rsrc_retarget_table(struct io_ring_ctx *ctx,
>>>         return refs;
>>>  }
>>>  
>>> -static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
>>> +static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx,
>>> bool delay)
>>>         __must_hold(&ctx->uring_lock)
>>>  {
>>> -       percpu_ref_get(&ctx->refs);
>>> -       mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, 60 *
>>> HZ);
>>> -       ctx->rsrc_retarget_scheduled = true;
>>> +       unsigned long del;
>>> +
>>> +       if (delay)
>>> +               del = 60 * HZ;
>>> +       else
>>> +               del = 0;
>>> +
>>> +       if (likely(!mod_delayed_work(system_wq, &ctx-
>>>> rsrc_retarget_work, del))) {
>>> +               percpu_ref_get(&ctx->refs);
>>> +               ctx->rsrc_retarget_scheduled = true;
>>> +       }
>>>  }
>>
>> What happens for del == 0 and the work running ala:
>>
>> CPU 0                           CPU 1
>> mod_delayed_work(.., 0);
>>                                 delayed_work runs
>>                                         put ctx
>> percpu_ref_get(ctx)
> 
> The work takes the lock before put(ctx), and CPU 0 only releases the
> lock after calling get(ctx) so it should be ok.

But io_ring_ctx_ref_free() would've run at that point? Maybe I'm
missing something...

In any case, would be saner to always grab that ref first. Or at
least have a proper comment as to why it's safe, because it looks
iffy.

>> Also I think that likely() needs to get dropped.
>>
> 
> It's not a big thing, but the only time it will be enqueued is on ring
> shutdown if there is an outstanding enqueue. Other times it will not
> get double enqueued as it is protected by the _scheduled bool (this is
> important or else it will continually push back by 1 period and maybe
> never run)

We've already called into this function, don't think it's worth a
likely. Same for most of the others added in this series, imho they
only really make sense for a very hot path where that branch is
inline.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring
  2022-10-31 19:13       ` Jens Axboe
@ 2022-11-01 12:09         ` Dylan Yudaken
  0 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-11-01 12:09 UTC (permalink / raw)
  To: Dylan Yudaken, axboe, asml.silence; +Cc: Kernel Team, io-uring

On Mon, 2022-10-31 at 13:13 -0600, Jens Axboe wrote:
> On 10/31/22 10:44 AM, Dylan Yudaken wrote:
> > On Mon, 2022-10-31 at 10:02 -0600, Jens Axboe wrote:
> > > On 10/31/22 7:41 AM, Dylan Yudaken wrote:
> > > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > > index 8d0d40713a63..40b37899e943 100644
> > > > --- a/io_uring/rsrc.c
> > > > +++ b/io_uring/rsrc.c
> > > > @@ -248,12 +248,20 @@ static unsigned int
> > > > io_rsrc_retarget_table(struct io_ring_ctx *ctx,
> > > >         return refs;
> > > >  }
> > > >  
> > > > -static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
> > > > +static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx,
> > > > bool delay)
> > > >         __must_hold(&ctx->uring_lock)
> > > >  {
> > > > -       percpu_ref_get(&ctx->refs);
> > > > -       mod_delayed_work(system_wq, &ctx->rsrc_retarget_work,
> > > > 60 *
> > > > HZ);
> > > > -       ctx->rsrc_retarget_scheduled = true;
> > > > +       unsigned long del;
> > > > +
> > > > +       if (delay)
> > > > +               del = 60 * HZ;
> > > > +       else
> > > > +               del = 0;
> > > > +
> > > > +       if (likely(!mod_delayed_work(system_wq, &ctx-
> > > > > rsrc_retarget_work, del))) {
> > > > +               percpu_ref_get(&ctx->refs);
> > > > +               ctx->rsrc_retarget_scheduled = true;
> > > > +       }
> > > >  }
> > > 
> > > What happens for del == 0 and the work running ala:
> > > 
> > > CPU 0                           CPU 1
> > > mod_delayed_work(.., 0);
> > >                                 delayed_work runs
> > >                                         put ctx
> > > percpu_ref_get(ctx)
> > 
> > The work takes the lock before put(ctx), and CPU 0 only releases
> > the
> > lock after calling get(ctx) so it should be ok.
> 
> But io_ring_ctx_ref_free() would've run at that point? Maybe I'm
> missing something...
> 
> In any case, would be saner to always grab that ref first. Or at
> least have a proper comment as to why it's safe, because it looks
> iffy.

I think I misunderstood - assuming a ref was already taken higher up
the stack. That is not the case, and in fact in the _exiting() calls it
is not really valid to take the reference as it may have already hit
zero. Instead we can use the cancel_delayed_work in exiting (no need to
retarget rsrc nodes at this point) and it makes things a bit cleaner.
I'll update in v2.

> 
> > > Also I think that likely() needs to get dropped.
> > > 
> > 
> > It's not a big thing, but the only time it will be enqueued is on
> > ring
> > shutdown if there is an outstanding enqueue. Other times it will
> > not
> > get double enqueued as it is protected by the _scheduled bool (this
> > is
> > important or else it will continually push back by 1 period and
> > maybe
> > never run)
> 
> We've already called into this function, don't think it's worth a
> likely. Same for most of the others added in this series, imho they
> only really make sense for a very hot path where that branch is
> inline.

Will remove it 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes
  2022-10-31 13:41 ` [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes Dylan Yudaken
  2022-10-31 16:02   ` Jens Axboe
@ 2022-11-02 11:20   ` Pavel Begunkov
  1 sibling, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2022-11-02 11:20 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe; +Cc: io-uring, kernel-team

On 10/31/22 13:41, Dylan Yudaken wrote:
> rsrc node cleanup can be indefinitely delayed when there are long lived
> requests. For example if a file is located in the same rsrc node as a long
> lived socket with multishot poll, then even if unregistering the file it
> will not be closed while the poll request is still active.
> 
> Introduce a timer when rsrc node is switched, so that periodically we can
> retarget these long lived requests to the newest nodes. That will allow
> the old nodes to be cleaned up, freeing resources.
> 
> Signed-off-by: Dylan Yudaken <dylany@meta.com>
> ---
>   include/linux/io_uring_types.h |  2 +
>   io_uring/io_uring.c            |  1 +
>   io_uring/opdef.h               |  1 +
>   io_uring/rsrc.c                | 92 ++++++++++++++++++++++++++++++++++
>   io_uring/rsrc.h                |  1 +
>   5 files changed, 97 insertions(+)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index f5b687a787a3..1d4eff4e632c 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -327,6 +327,8 @@ struct io_ring_ctx {
>   	struct llist_head		rsrc_put_llist;
>   	struct list_head		rsrc_ref_list;
>   	spinlock_t			rsrc_ref_lock;
> +	struct delayed_work		rsrc_retarget_work;
> +	bool				rsrc_retarget_scheduled;
>   
>   	struct list_head		io_buffers_pages;
>   
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 6cc16e39b27f..ea2260359c56 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -320,6 +320,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>   	spin_lock_init(&ctx->rsrc_ref_lock);
>   	INIT_LIST_HEAD(&ctx->rsrc_ref_list);
>   	INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
> +	INIT_DELAYED_WORK(&ctx->rsrc_retarget_work, io_rsrc_retarget_work);
>   	init_llist_head(&ctx->rsrc_put_llist);
>   	init_llist_head(&ctx->work_llist);
>   	INIT_LIST_HEAD(&ctx->tctx_list);
> diff --git a/io_uring/opdef.h b/io_uring/opdef.h
> index 3efe06d25473..1b72b14cb5ab 100644
> --- a/io_uring/opdef.h
> +++ b/io_uring/opdef.h
> @@ -37,6 +37,7 @@ struct io_op_def {
>   	int (*prep_async)(struct io_kiocb *);
>   	void (*cleanup)(struct io_kiocb *);
>   	void (*fail)(struct io_kiocb *);
> +	bool (*can_retarget_rsrc)(struct io_kiocb *);

side note: need to be split at some moment into 2 tables depending
on hotness, we want better caching for ->issue and ->prep

>   };
>   
>   extern const struct io_op_def io_op_defs[];
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 55d4ab96fb92..106210e0d5d5 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -15,6 +15,7 @@
>   #include "io_uring.h"
>   #include "openclose.h"
>   #include "rsrc.h"
> +#include "opdef.h"
>   
>   struct io_rsrc_update {
>   	struct file			*file;
> @@ -204,6 +205,95 @@ void io_rsrc_put_work(struct work_struct *work)
>   	}
>   }
>   
> +
> +static unsigned int io_rsrc_retarget_req(struct io_ring_ctx *ctx,
> +					 struct io_kiocb *req)
> +	__must_hold(&ctx->uring_lock)
> +{
> +	if (!req->rsrc_node ||
> +	     req->rsrc_node == ctx->rsrc_node)
> +		return 0;
> +	if (!io_op_defs[req->opcode].can_retarget_rsrc)
> +		return 0;
> +	if (!(*io_op_defs[req->opcode].can_retarget_rsrc)(req))
> +		return 0;

nit, there should be no need to deref fptr.

if (!io_op_defs[req->opcode].can_retarget_rsrc(req)) ...

> +
> +	io_rsrc_put_node(req->rsrc_node, 1);
> +	req->rsrc_node = ctx->rsrc_node;
> +	return 1;
> +}
> +
> +static unsigned int io_rsrc_retarget_table(struct io_ring_ctx *ctx,
> +				   struct io_hash_table *table)
> +{
> +	unsigned int nr_buckets = 1U << table->hash_bits;
> +	unsigned int refs = 0;
> +	struct io_kiocb *req;
> +	int i;
> +
> +	for (i = 0; i < nr_buckets; i++) {
> +		struct io_hash_bucket *hb = &table->hbs[i];
> +
> +		spin_lock(&hb->lock);
> +		hlist_for_each_entry(req, &hb->list, hash_node)
> +			refs += io_rsrc_retarget_req(ctx, req);
> +		spin_unlock(&hb->lock);
> +	}
> +	return refs;
> +}
> +
> +static void io_rsrc_retarget_schedule(struct io_ring_ctx *ctx)
> +	__must_hold(&ctx->uring_lock)
> +{
> +	percpu_ref_get(&ctx->refs);
> +	mod_delayed_work(system_wq, &ctx->rsrc_retarget_work, 60 * HZ);
> +	ctx->rsrc_retarget_scheduled = true;
> +}
> +
> +static void __io_rsrc_retarget_work(struct io_ring_ctx *ctx)
> +	__must_hold(&ctx->uring_lock)
> +{
> +	struct io_rsrc_node *node;
> +	unsigned int refs;
> +	bool any_waiting;
> +
> +	if (!ctx->rsrc_node)
> +		return;
> +
> +	spin_lock_irq(&ctx->rsrc_ref_lock);
> +	any_waiting = false;
> +	list_for_each_entry(node, &ctx->rsrc_ref_list, node) {
> +		if (!node->done) {
> +			any_waiting = true;
> +			break;
> +		}
> +	}
> +	spin_unlock_irq(&ctx->rsrc_ref_lock);
> +
> +	if (!any_waiting)
> +		return;
> +
> +	refs = io_rsrc_retarget_table(ctx, &ctx->cancel_table);
> +	refs += io_rsrc_retarget_table(ctx, &ctx->cancel_table_locked);
> +
> +	ctx->rsrc_cached_refs -= refs;
> +	while (unlikely(ctx->rsrc_cached_refs < 0))
> +		io_rsrc_refs_refill(ctx);

We can charge ->rsrc_cached_refs after setting up nodes in prep / submission
without underflowing the actual refs because we know that the requests are
not yet submitted and so won't consume refs. This one looks more troublesome


> +}
> +
> +void io_rsrc_retarget_work(struct work_struct *work)
> +{
> +	struct io_ring_ctx *ctx;
> +
> +	ctx = container_of(work, struct io_ring_ctx, rsrc_retarget_work.work);
> +
> +	mutex_lock(&ctx->uring_lock);
> +	ctx->rsrc_retarget_scheduled = false;
> +	__io_rsrc_retarget_work(ctx);
> +	mutex_unlock(&ctx->uring_lock);
> +	percpu_ref_put(&ctx->refs);
> +}
> +
>   void io_wait_rsrc_data(struct io_rsrc_data *data)
>   {
>   	if (data && !atomic_dec_and_test(&data->refs))
> @@ -285,6 +375,8 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,
>   		atomic_inc(&data_to_kill->refs);
>   		percpu_ref_kill(&rsrc_node->refs);
>   		ctx->rsrc_node = NULL;
> +		if (!ctx->rsrc_retarget_scheduled)
> +			io_rsrc_retarget_schedule(ctx);
>   	}
>   
>   	if (!ctx->rsrc_node) {
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index 81445a477622..2b94df8fd9e8 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -54,6 +54,7 @@ struct io_mapped_ubuf {
>   };
>   
>   void io_rsrc_put_work(struct work_struct *work);
> +void io_rsrc_retarget_work(struct work_struct *work);
>   void io_rsrc_refs_refill(struct io_ring_ctx *ctx);
>   void io_wait_rsrc_data(struct io_rsrc_data *data);
>   void io_rsrc_node_destroy(struct io_rsrc_node *ref_node);

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 06/12] io_uring: add fixed file peeking function
  2022-10-31 13:41 ` [PATCH for-next 06/12] io_uring: add fixed file peeking function Dylan Yudaken
  2022-10-31 16:04   ` Jens Axboe
@ 2022-11-02 11:23   ` Pavel Begunkov
  1 sibling, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2022-11-02 11:23 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe; +Cc: io-uring, kernel-team

On 10/31/22 13:41, Dylan Yudaken wrote:
> Add a helper function to grab the fixed file at a given offset. Will be
> useful for retarget op handlers.
> 
> Signed-off-by: Dylan Yudaken <dylany@meta.com>
> ---
>   io_uring/io_uring.c | 26 ++++++++++++++++++++------
>   io_uring/io_uring.h |  1 +
>   2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 32eb305c4ce7..a052653fc65e 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1841,6 +1841,23 @@ void io_wq_submit_work(struct io_wq_work *work)
>   		io_req_task_queue_fail(req, ret);
>   }
>   
> +static unsigned long __io_file_peek_fixed(struct io_kiocb *req, int fd)
> +	__must_hold(&req->ctx->uring_lock)

Let's mark it inline, it's in the hot path. Yeah, It's small but I
battled compilers enough because from time to time they leave it
not inlined.


> +{
> +	struct io_ring_ctx *ctx = req->ctx;
> +
> +	if (unlikely((unsigned int)fd >= ctx->nr_user_files))
> +		return 0;
> +	fd = array_index_nospec(fd, ctx->nr_user_files);
> +	return io_fixed_file_slot(&ctx->file_table, fd)->file_ptr;
> +}
> +
> +struct file *io_file_peek_fixed(struct io_kiocb *req, int fd)
> +	__must_hold(&req->ctx->uring_lock)
> +{
> +	return (struct file *) (__io_file_peek_fixed(req, fd) & FFS_MASK);
> +}
> +
>   inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
>   				      unsigned int issue_flags)
>   {
> @@ -1849,17 +1866,14 @@ inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
>   	unsigned long file_ptr;
>   
>   	io_ring_submit_lock(ctx, issue_flags);
> -
> -	if (unlikely((unsigned int)fd >= ctx->nr_user_files))
> -		goto out;
> -	fd = array_index_nospec(fd, ctx->nr_user_files);
> -	file_ptr = io_fixed_file_slot(&ctx->file_table, fd)->file_ptr;
> +	file_ptr = __io_file_peek_fixed(req, fd);
>   	file = (struct file *) (file_ptr & FFS_MASK);
>   	file_ptr &= ~FFS_MASK;
>   	/* mask in overlapping REQ_F and FFS bits */
>   	req->flags |= (file_ptr << REQ_F_SUPPORT_NOWAIT_BIT);
>   	io_req_set_rsrc_node(req, ctx, 0);
> -out:
> +	WARN_ON_ONCE(file && !test_bit(fd, ctx->file_table.bitmap));
> +
>   	io_ring_submit_unlock(ctx, issue_flags);
>   	return file;
>   }
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index ef77d2aa3172..781471bfba12 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -44,6 +44,7 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
>   struct file *io_file_get_normal(struct io_kiocb *req, int fd);
>   struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
>   			       unsigned issue_flags);
> +struct file *io_file_peek_fixed(struct io_kiocb *req, int fd);
>   
>   static inline bool io_req_ffs_set(struct io_kiocb *req)
>   {

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg
  2022-10-31 13:41 ` [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg Dylan Yudaken
@ 2022-11-02 11:32   ` Pavel Begunkov
  2022-11-02 13:45     ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2022-11-02 11:32 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe; +Cc: io-uring, kernel-team

On 10/31/22 13:41, Dylan Yudaken wrote:
> Split out the specific sendzc parts of struct io_sr_msg as other opcodes
> are going to be specialized.

I'd suggest to put the fields into a union and not splitting the structs
for now, it can be done later. The reason is that the file keeps changing
relatively often, and this change will add conflicts complicating
backporting and cross-tree development (i.e. series that rely on both
net and io_uring trees).

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically
  2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
                   ` (11 preceding siblings ...)
  2022-10-31 13:41 ` [PATCH for-next 12/12] io_uring: poll_add " Dylan Yudaken
@ 2022-11-02 11:44 ` Pavel Begunkov
  2022-11-02 13:08   ` Dylan Yudaken
  12 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2022-11-02 11:44 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe; +Cc: io-uring, kernel-team

On 10/31/22 13:41, Dylan Yudaken wrote:
> There is a problem with long running io_uring requests and rsrc node
> cleanup, where a single long running request can block cleanup of all
> subsequent nodes. For example a network server may have both long running
> accepts and also use registered files for connections. When sockets are
> closed and returned (either through close_direct, or through
> register_files_update) the underlying file will not be freed until that
> original accept completes. For the case of multishot this may be the
> lifetime of the application, which will cause file numbers to grow
> unbounded - resulting in either OOMs or ENFILE errors.
> 
> To fix this I propose retargeting the rsrc nodes from ongoing requests to
> the current main request of the io_uring. This only needs to happen for
> long running requests types, and specifically those that happen as a
> result of some external event. For this reason I exclude write/send style
> ops for the time being as even though these can cause this issue in
> reality it would be unexpected to have a write block for hours. This
> support can obviously be added later if needed.

Is there a particular reason why it tries to retarget instead of
downgrading? Taking a file ref / etc. sounds more robust, e.g.
what if we send a lingering request and then remove the file
from the table? It also doesn't need caching the file index.


> In order to retarget nodes all the outstanding requests (in both poll
> tables and io-wq) need to be iterated and the request needs to be checked
> to make sure the retargeting is valid. For example for FIXED_FILE requests
> this involves ensuring the file is still referenced in the current node.
> This O(N) operation seems to take ~1ms locally for 30k outstanding
> requests. Note it locks the io_uring while it happens and so no other work
> can occur. In order to amortize this cost slightly, I propose running this
> operation at most every 60 seconds. It is hard coded currently, but would
> be happy to take suggestions if this should be customizable (and how to do
> such a thing).
> 
> Without customizable retargeting period, it's a bit difficult to submit
> tests for this. I have a test but it obviously takes a many minutes to run
> which is not going to be acceptable for liburing.

We may also want to trigger it if there are too many rsrc nodes queued

> Patches 1-5 are the basic io_uring infrastructure
> Patch 6 is a helper function used in the per op customisations
> Patch 7 splits out the zerocopy specific field in io_sr_msg
> Patches 8-12 are opcode implementations for retargeting
> 
> Dylan Yudaken (12):
>    io_uring: infrastructure for retargeting rsrc nodes
>    io_uring: io-wq helper to iterate all work
>    io_uring: support retargeting rsrc on requests in the io-wq
>    io_uring: reschedule retargeting at shutdown of ring
>    io_uring: add tracing for io_uring_rsrc_retarget
>    io_uring: add fixed file peeking function
>    io_uring: split send_zc specific struct out of io_sr_msg
>    io_uring: recv/recvmsg retarget_rsrc support
>    io_uring: accept retarget_rsrc support
>    io_uring: read retarget_rsrc support
>    io_uring: read_fixed retarget_rsrc support
>    io_uring: poll_add retarget_rsrc support
> 
>   include/linux/io_uring_types.h  |   2 +
>   include/trace/events/io_uring.h |  30 +++++++
>   io_uring/io-wq.c                |  49 +++++++++++
>   io_uring/io-wq.h                |   3 +
>   io_uring/io_uring.c             |  28 ++++--
>   io_uring/io_uring.h             |   1 +
>   io_uring/net.c                  | 114 ++++++++++++++++--------
>   io_uring/net.h                  |   2 +
>   io_uring/opdef.c                |   7 ++
>   io_uring/opdef.h                |   1 +
>   io_uring/poll.c                 |  12 +++
>   io_uring/poll.h                 |   2 +
>   io_uring/rsrc.c                 | 148 ++++++++++++++++++++++++++++++++
>   io_uring/rsrc.h                 |   2 +
>   io_uring/rw.c                   |  29 +++++++
>   io_uring/rw.h                   |   2 +
>   16 files changed, 390 insertions(+), 42 deletions(-)
> 
> 
> base-commit: 30209debe98b6f66b13591e59e5272cb65b3945e

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically
  2022-11-02 11:44 ` [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Pavel Begunkov
@ 2022-11-02 13:08   ` Dylan Yudaken
  0 siblings, 0 replies; 30+ messages in thread
From: Dylan Yudaken @ 2022-11-02 13:08 UTC (permalink / raw)
  To: Dylan Yudaken, axboe, asml.silence; +Cc: Kernel Team, io-uring

On Wed, 2022-11-02 at 11:44 +0000, Pavel Begunkov wrote:
> On 10/31/22 13:41, Dylan Yudaken wrote:
> > There is a problem with long running io_uring requests and rsrc
> > node
> > cleanup, where a single long running request can block cleanup of
> > all
> > subsequent nodes. For example a network server may have both long
> > running
> > accepts and also use registered files for connections. When sockets
> > are
> > closed and returned (either through close_direct, or through
> > register_files_update) the underlying file will not be freed until
> > that
> > original accept completes. For the case of multishot this may be
> > the
> > lifetime of the application, which will cause file numbers to grow
> > unbounded - resulting in either OOMs or ENFILE errors.
> > 
> > To fix this I propose retargeting the rsrc nodes from ongoing
> > requests to
> > the current main request of the io_uring. This only needs to happen
> > for
> > long running requests types, and specifically those that happen as
> > a
> > result of some external event. For this reason I exclude write/send
> > style
> > ops for the time being as even though these can cause this issue in
> > reality it would be unexpected to have a write block for hours.
> > This
> > support can obviously be added later if needed.
> 
> Is there a particular reason why it tries to retarget instead of
> downgrading? Taking a file ref / etc. 

Downgrading could work - but it is less general as it will not work for
buffers (and whatever future resources get added to this system). If it
could avoid periodic work that would be good but I don't really see how
that would happen

> sounds more robust, e.g.
> what if we send a lingering request and then remove the file
> from the table? 

This will work as the file will no longer match, and so cannot
retarget. 


> It also doesn't need caching the file index.

Yeah that would be a big benefit. Perhaps we should do that for some
ops (accept/poll) and retarget for others that can have fixed buffers?
It will make the code a bit simpler too for the simple ops. 

That being said removing the REQ_F_FIXED_FILE flag from the req sounds
like it might be problematic as flags had historically been constant?

> 
> 
> > In order to retarget nodes all the outstanding requests (in both
> > poll
> > tables and io-wq) need to be iterated and the request needs to be
> > checked
> > to make sure the retargeting is valid. For example for FIXED_FILE
> > requests
> > this involves ensuring the file is still referenced in the current
> > node.
> > This O(N) operation seems to take ~1ms locally for 30k outstanding
> > requests. Note it locks the io_uring while it happens and so no
> > other work
> > can occur. In order to amortize this cost slightly, I propose
> > running this
> > operation at most every 60 seconds. It is hard coded currently, but
> > would
> > be happy to take suggestions if this should be customizable (and
> > how to do
> > such a thing).
> > 
> > Without customizable retargeting period, it's a bit difficult to
> > submit
> > tests for this. I have a test but it obviously takes a many minutes
> > to run
> > which is not going to be acceptable for liburing.
> 
> We may also want to trigger it if there are too many rsrc nodes
> queued

In my testing this hasn't really been necessary as the individual nodes
do not take up too much space. But I am happy to add this if you think
it will help.


Also - thanks for the patch reviews, will get those into a v2 once the
general approach is ironed out.

Dylan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg
  2022-11-02 11:32   ` Pavel Begunkov
@ 2022-11-02 13:45     ` Jens Axboe
  2022-11-02 14:09       ` Pavel Begunkov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2022-11-02 13:45 UTC (permalink / raw)
  To: Pavel Begunkov, Dylan Yudaken; +Cc: io-uring, kernel-team

On 11/2/22 5:32 AM, Pavel Begunkov wrote:
> On 10/31/22 13:41, Dylan Yudaken wrote:
>> Split out the specific sendzc parts of struct io_sr_msg as other opcodes
>> are going to be specialized.
> 
> I'd suggest to put the fields into a union and not splitting the structs
> for now, it can be done later. The reason is that the file keeps changing
> relatively often, and this change will add conflicts complicating
> backporting and cross-tree development (i.e. series that rely on both
> net and io_uring trees).

Not super important, but I greatly prefer having them split. That
way the ownership is much clearer than a union, which always
gets a bit iffy.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg
  2022-11-02 13:45     ` Jens Axboe
@ 2022-11-02 14:09       ` Pavel Begunkov
  2022-11-02 14:12         ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2022-11-02 14:09 UTC (permalink / raw)
  To: Jens Axboe, Dylan Yudaken; +Cc: io-uring, kernel-team

On 11/2/22 13:45, Jens Axboe wrote:
> On 11/2/22 5:32 AM, Pavel Begunkov wrote:
>> On 10/31/22 13:41, Dylan Yudaken wrote:
>>> Split out the specific sendzc parts of struct io_sr_msg as other opcodes
>>> are going to be specialized.
>>
>> I'd suggest to put the fields into a union and not splitting the structs
>> for now, it can be done later. The reason is that the file keeps changing
>> relatively often, and this change will add conflicts complicating
>> backporting and cross-tree development (i.e. series that rely on both
>> net and io_uring trees).
> 
> Not super important, but I greatly prefer having them split. That
> way the ownership is much clearer than a union, which always
> gets a bit iffy.

I'd agree in general, but I think it's easier to do in a few releases
than now. And this one is nothing in levels of nastiness comparing to
req->cqe.fd and some cases that we had before.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg
  2022-11-02 14:09       ` Pavel Begunkov
@ 2022-11-02 14:12         ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2022-11-02 14:12 UTC (permalink / raw)
  To: Pavel Begunkov, Dylan Yudaken; +Cc: io-uring, kernel-team

On 11/2/22 8:09 AM, Pavel Begunkov wrote:
> On 11/2/22 13:45, Jens Axboe wrote:
>> On 11/2/22 5:32 AM, Pavel Begunkov wrote:
>>> On 10/31/22 13:41, Dylan Yudaken wrote:
>>>> Split out the specific sendzc parts of struct io_sr_msg as other opcodes
>>>> are going to be specialized.
>>>
>>> I'd suggest to put the fields into a union and not splitting the structs
>>> for now, it can be done later. The reason is that the file keeps changing
>>> relatively often, and this change will add conflicts complicating
>>> backporting and cross-tree development (i.e. series that rely on both
>>> net and io_uring trees).
>>
>> Not super important, but I greatly prefer having them split. That
>> way the ownership is much clearer than a union, which always
>> gets a bit iffy.
> 
> I'd agree in general, but I think it's easier to do in a few releases
> than now.

I'd rather just deal with that pain, in reality it'll be very minor. And
if lots of churn is expected there, it'll be the most trivial of rejects
for this particular area.

> And this one is nothing in levels of nastiness comparing to
> req->cqe.fd and some cases that we had before.

I agree, but that's not a reason to allow more of it, quite the
contrary.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-11-02 14:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31 13:41 [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Dylan Yudaken
2022-10-31 13:41 ` [PATCH for-next 01/12] io_uring: infrastructure for retargeting rsrc nodes Dylan Yudaken
2022-10-31 16:02   ` Jens Axboe
2022-10-31 16:45     ` Dylan Yudaken
2022-11-02 11:20   ` Pavel Begunkov
2022-10-31 13:41 ` [PATCH for-next 02/12] io_uring: io-wq helper to iterate all work Dylan Yudaken
2022-10-31 13:41 ` [PATCH for-next 03/12] io_uring: support retargeting rsrc on requests in the io-wq Dylan Yudaken
2022-10-31 18:19   ` Jens Axboe
2022-10-31 13:41 ` [PATCH for-next 04/12] io_uring: reschedule retargeting at shutdown of ring Dylan Yudaken
2022-10-31 16:02   ` Jens Axboe
2022-10-31 16:44     ` Dylan Yudaken
2022-10-31 19:13       ` Jens Axboe
2022-11-01 12:09         ` Dylan Yudaken
2022-10-31 13:41 ` [PATCH for-next 05/12] io_uring: add tracing for io_uring_rsrc_retarget Dylan Yudaken
2022-10-31 13:41 ` [PATCH for-next 06/12] io_uring: add fixed file peeking function Dylan Yudaken
2022-10-31 16:04   ` Jens Axboe
2022-10-31 16:47     ` Dylan Yudaken
2022-11-02 11:23   ` Pavel Begunkov
2022-10-31 13:41 ` [PATCH for-next 07/12] io_uring: split send_zc specific struct out of io_sr_msg Dylan Yudaken
2022-11-02 11:32   ` Pavel Begunkov
2022-11-02 13:45     ` Jens Axboe
2022-11-02 14:09       ` Pavel Begunkov
2022-11-02 14:12         ` Jens Axboe
2022-10-31 13:41 ` [PATCH for-next 08/12] io_uring: recv/recvmsg retarget_rsrc support Dylan Yudaken
2022-10-31 13:41 ` [PATCH for-next 09/12] io_uring: accept " Dylan Yudaken
2022-10-31 13:41 ` [PATCH for-next 10/12] io_uring: read " Dylan Yudaken
2022-10-31 13:41 ` [PATCH for-next 11/12] io_uring: read_fixed " Dylan Yudaken
2022-10-31 13:41 ` [PATCH for-next 12/12] io_uring: poll_add " Dylan Yudaken
2022-11-02 11:44 ` [PATCH for-next 00/12] io_uring: retarget rsrc nodes periodically Pavel Begunkov
2022-11-02 13:08   ` Dylan Yudaken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).