Re: [PATCH RFC] io_uring: limit inflight IO

From: Jens Axboe <axboe@kernel.dk>
To: Pavel Begunkov <asml.silence@gmail.com>, io-uring@vger.kernel.org
Subject: Re: [PATCH RFC] io_uring: limit inflight IO
Date: Sat, 9 Nov 2019 08:15:21 -0700	[thread overview]
Message-ID: <38f51d0c-cd27-6631-c4d3-06fbb26a5c1e@kernel.dk> (raw)
In-Reply-To: <d8002007-7641-3e9d-0560-123358300e66@kernel.dk>

On 11/9/19 7:23 AM, Jens Axboe wrote:
> On 11/9/19 4:16 AM, Pavel Begunkov wrote:
>>> I've been struggling a bit with how to make this reliable, and I'm not
>>> so sure there's a way to do that. Let's say an application sets up a
>>> ring with 8 sq entries, which would then default to 16 cq entries. With
>>> this patch, we'd allow 16 ios inflight. But what if the application does
>>>
>>> for (i = 0; i < 32; i++) {
>>> 	sqe = get_sqe();
>>> 	prep_sqe();
>>> 	submit_sqe();
>>> }
>>>
>>> And then directly proceeds to:
>>>
>>> do {
>>> 	get_completions();
>>> } while (has_completions);
>>>
>>> As long as fewer than 16 requests complete before we start reaping,
>>> we don't lose any events. Hence there's a risk of breaking existing
>>> setups with this, even though I don't think that's a high risk.
>>>
>>
>> I think, this should be considered as an erroneous usage of the API.
>> It's better to fail ASAP than to be surprised in a production
>> system, because of non-deterministic nature of such code. Even worse
>> with trying to debug such stuff.
>>
>> As for me, cases like below are too far-fetched
>>
>> for (i = 0; i < n; i++)
>> 	submit_read_sqe()
>> for (i = 0; i < n; i++) {
>> 	device_allow_next_read()
>> 	get_single_cqe()
>> }
> 
> I can't really disagree with that, it's a use case that's bound to fail
> every now and then...
> 
> But if we agree that's the case, then we should be able to just limit
> based on the cq ring size in question.
> 
> Do we make it different fro CQ_NODROP and !CQ_NODROP or not? Because the
> above case would work with CQ_NODROP, reliably. At least CQ_NODROP is
> new so we get to set the rules for that one, they just have to make
> sense.

Just tossing this one out there, it's an incremental to v2 of the patch.

- Check upfront if we're going over the limit, use the same kind of
  cost amortization logic except something that's appropriate for
  once-per-batch.

- Fold in with the backpressure -EBUSY logic

This avoids breaking up chains, for example, and also means we don't
have to run these checks for every request.

Limit is > 2 * cq_entries. I think that's liberal enough to not cause
issues, while still having a relation to the sq/cq ring sizes which
I like.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 18711d45b994..53ccd4e1dee2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -737,25 +737,6 @@ static struct io_kiocb *io_get_fallback_req(struct io_ring_ctx *ctx)
 	return NULL;
 }
 
-static bool io_req_over_limit(struct io_ring_ctx *ctx)
-{
-	unsigned inflight;
-
-	/*
-	 * This doesn't need to be super precise, so only check every once
-	 * in a while.
-	 */
-	if (ctx->cached_sq_head & ctx->sq_mask)
-		return false;
-
-	/*
-	 * Use 2x the max CQ ring size
-	 */
-	inflight = ctx->cached_sq_head -
-		  (ctx->cached_cq_tail + atomic_read(&ctx->cached_cq_overflow));
-	return inflight >= 2 * IORING_MAX_CQ_ENTRIES;
-}
-
 static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 				   struct io_submit_state *state)
 {
@@ -766,8 +747,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 		return ERR_PTR(-ENXIO);
 
 	if (!state) {
-		if (unlikely(io_req_over_limit(ctx)))
-			goto out_limit;
 		req = kmem_cache_alloc(req_cachep, gfp);
 		if (unlikely(!req))
 			goto fallback;
@@ -775,8 +754,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 		size_t sz;
 		int ret;
 
-		if (unlikely(io_req_over_limit(ctx)))
-			goto out_limit;
 		sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->reqs));
 		ret = kmem_cache_alloc_bulk(req_cachep, gfp, sz, state->reqs);
 
@@ -812,7 +789,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	req = io_get_fallback_req(ctx);
 	if (req)
 		goto got_it;
-out_limit:
 	percpu_ref_put(&ctx->refs);
 	return ERR_PTR(-EBUSY);
 }
@@ -3021,6 +2997,30 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	return false;
 }
 
+static bool io_sq_over_limit(struct io_ring_ctx *ctx, unsigned to_submit)
+{
+	unsigned inflight;
+
+	if ((ctx->flags & IORING_SETUP_CQ_NODROP) &&
+	    !list_empty(&ctx->cq_overflow_list))
+		return true;
+
+	/*
+	 * This doesn't need to be super precise, so only check every once
+	 * in a while.
+	 */
+	if ((ctx->cached_sq_head & ctx->sq_mask) !=
+	    ((ctx->cached_sq_head + to_submit) & ctx->sq_mask))
+		return false;
+
+	/*
+	 * Limit us to 2x the CQ ring size
+	 */
+	inflight = ctx->cached_sq_head -
+		  (ctx->cached_cq_tail + atomic_read(&ctx->cached_cq_overflow));
+	return inflight > 2 * ctx->cq_entries;
+}
+
 static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			  struct file *ring_file, int ring_fd,
 			  struct mm_struct **mm, bool async)
@@ -3031,8 +3031,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 	int i, submitted = 0;
 	bool mm_fault = false;
 
-	if ((ctx->flags & IORING_SETUP_CQ_NODROP) &&
-	    !list_empty(&ctx->cq_overflow_list))
+	if (unlikely(io_sq_over_limit(ctx, nr)))
 		return -EBUSY;
 
 	if (nr > IO_PLUG_THRESHOLD) {

-- 
Jens Axboe