Re: [PATCH v2 2/2] io_uring: flush timeouts that should already have expired

From: Pavel Begunkov <asml.silence@gmail.com>
To: Marcelo Diop-Gonzalez <marcelo827@gmail.com>, axboe@kernel.dk
Cc: io-uring@vger.kernel.org
Subject: Re: [PATCH v2 2/2] io_uring: flush timeouts that should already have expired
Date: Sat, 2 Jan 2021 20:26:26 +0000	[thread overview]
Message-ID: <c0cde7df-f19f-92fd-e0f6-855396d126ab@gmail.com> (raw)
In-Reply-To: <d3feb2bc-b456-d057-e553-af024b234d31@gmail.com>

On 02/01/2021 19:54, Pavel Begunkov wrote:
> On 19/12/2020 19:15, Marcelo Diop-Gonzalez wrote:
>> Right now io_flush_timeouts() checks if the current number of events
>> is equal to ->timeout.target_seq, but this will miss some timeouts if
>> there have been more than 1 event added since the last time they were
>> flushed (possible in io_submit_flush_completions(), for example). Fix
>> it by recording the starting value of ->cached_cq_overflow -
>> ->cq_timeouts instead of the target value, so that we can safely
>> (without overflow problems) compare the number of events that have
>> happened with the number of events needed to trigger the timeout.

https://www.spinics.net/lists/kernel/msg3475160.html

The idea was to replace u32 cached_cq_tail with u64 while keeping
timeout offsets u32. Assuming that we won't ever hit ~2^62 inflight
requests, complete all requests falling into some large enough window
behind that u64 cached_cq_tail.

simplifying:

i64 d = target_off - ctx->u64_cq_tail
if (d <= 0 && d > -2^32)
	complete_it()

Not fond  of it, but at least worked at that time. You can try out
this approach if you want, but would be perfect if you would find
something more elegant :)

>>
>> Signed-off-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
>> ---
>>  fs/io_uring.c | 30 +++++++++++++++++++++++-------
>>  1 file changed, 23 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index f394bf358022..f62de0cb5fc4 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -444,7 +444,7 @@ struct io_cancel {
>>  struct io_timeout {
>>  	struct file			*file;
>>  	u32				off;
>> -	u32				target_seq;
>> +	u32				start_seq;
>>  	struct list_head		list;
>>  	/* head of the link, used by linked timeouts only */
>>  	struct io_kiocb			*head;
>> @@ -1629,6 +1629,24 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx)
>>  	} while (!list_empty(&ctx->defer_list));
>>  }
>>  
>> +static inline u32 io_timeout_events_left(struct io_kiocb *req)
>> +{
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +	u32 events;
>> +
>> +	/*
>> +	 * events -= req->timeout.start_seq and the comparison between
>> +	 * ->timeout.off and events will not overflow because each time
>> +	 * ->cq_timeouts is incremented, ->cached_cq_tail is incremented too.
>> +	 */
>> +
>> +	events = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
>> +	events -= req->timeout.start_seq;
> 
> It looks to me that events before the start_seq subtraction can have got wrapped
> around start_seq.
> 
> e.g.
> 1) you submit a timeout with off=0xff...ff (start_seq=0 for convenience)
> 
> 2) some time has passed, let @events = 0xff..ff - 1
> so the timeout still waits
> 
> 3) we commit 5 requests at once and call io_commit_cqring() only once for
> them, so we get @events == 0xff..ff - 1 + 5, i.e. 4
> 
> @events == 4 < off == 0xff...ff,
> so we didn't trigger out timeout even though should have
> 
>> +	if (req->timeout.off > events)
>> +		return req->timeout.off - events;
>> +	return 0;
>> +}
>> +
>>  static void io_flush_timeouts(struct io_ring_ctx *ctx)
>>  {
>>  	while (!list_empty(&ctx->timeout_list)) {
>> @@ -1637,8 +1655,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
>>  
>>  		if (io_is_timeout_noseq(req))
>>  			break;
>> -		if (req->timeout.target_seq != ctx->cached_cq_tail
>> -					- atomic_read(&ctx->cq_timeouts))
>> +		if (io_timeout_events_left(req) > 0)
>>  			break;
>>  
>>  		list_del_init(&req->timeout.list);
>> @@ -5785,7 +5802,6 @@ static int io_timeout(struct io_kiocb *req)
>>  	struct io_ring_ctx *ctx = req->ctx;
>>  	struct io_timeout_data *data = req->async_data;
>>  	struct list_head *entry;
>> -	u32 tail, off = req->timeout.off;
>>  
>>  	spin_lock_irq(&ctx->completion_lock);
>>  
>> @@ -5799,8 +5815,8 @@ static int io_timeout(struct io_kiocb *req)
>>  		goto add;
>>  	}
>>  
>> -	tail = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
>> -	req->timeout.target_seq = tail + off;
>> +	req->timeout.start_seq = ctx->cached_cq_tail -
>> +		atomic_read(&ctx->cq_timeouts);
>>  
>>  	/*
>>  	 * Insertion sort, ensuring the first entry in the list is always
>> @@ -5813,7 +5829,7 @@ static int io_timeout(struct io_kiocb *req)
>>  		if (io_is_timeout_noseq(nxt))
>>  			continue;
>>  		/* nxt.seq is behind @tail, otherwise would've been completed */
>> -		if (off >= nxt->timeout.target_seq - tail)
>> +		if (req->timeout.off >= io_timeout_events_left(nxt))
>>  			break;
>>  	}
>>  add:
>>
> 

-- 
Pavel Begunkov