I/O cancellation in io-uring (was: io_uring: clear TIF_NOTIFY_SIGNAL ...)

From: Nadav Amit <nadav.amit@gmail.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>,
	Olivier Langlois <olivier@trillion01.com>,
	io-uring@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: I/O cancellation in io-uring (was: io_uring: clear TIF_NOTIFY_SIGNAL ...)
Date: Tue, 10 Aug 2021 22:40:58 -0700	[thread overview]
Message-ID: <29430B77-37C2-4E65-B279-2590CF825FB3@gmail.com> (raw)
In-Reply-To: <1bf56100-e904-65b5-bbb8-fa313d85b01a@kernel.dk>

> On Aug 10, 2021, at 7:51 PM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> There's no way to cancel file/bdev related IO, and there likely never
> will be. That's basically the only exception, everything else can get
> canceled pretty easily. Many things can be written on why that is the
> case, and they have (myself included), but it boils down to proper
> hardware support which we'll likely never have as it's not a well tested
> path. For other kind of async IO, we're either waiting in poll (which is
> trivially cancellable) or in an async thread (which is also easily
> cancellable). For DMA+irq driven block storage, we'd need to be able to
> reliably cancel on the hardware side, to prevent errant DMA after the
> fact.
> 
> None of this is really io_uring specific, io_uring just suffers from the
> same limitations as others would (or are).
> 
>> Otherwise they might potentially never complete, as happens in my
>> use-case.
> 
> If you see that, that is most certainly a bug. While bdev/reg file IO
> can't really be canceled, they all have the property that they complete
> in finite time. Either the IO completes normally in a "short" amount of
> time, or a timeout will cancel it and complete it in error. There are no
> unbounded execution times for uncancellable IO.

I understand your point that hardware reads/writes cannot easily be cancelled.
(well, the buffers can be unmapped from the IOMMU tables, but let's put this
discussion aside).

Yet the question is whether reads/writes from special files such as pipes,
eventfd, signalfd, fuse should be cancellable. Obviously, it is always
possible to issue a blocking read/write from a worker thread. Yet, there are
inherent overheads that are associated with this design, specifically
context-switches. While the overhead of a context-switch is not as high as
portrayed by some, it is still high for low latency use-cases.

There is a potential alternative, however. When a write to a pipe is
performed, or when an event takes place or signal sent, queued io-uring reads
can be fulfilled immediately, without a context-switch to a worker. I
specifically want to fulfill userfaultfd reads and notify userspace on
page-faults in such manner. I do not have the numbers in front of me, but
doing so shows considerable performance improvement.

To allow such use-cases, cancellation of the read/write is needed. A read from
a pipe might never complete if there is no further writes to the pipe.
Cancellation is not hard to implement for such cases (it's only the mess with
the existing AIO's ki_cancel() that bothers me, but as you said - it is a
single use-case).

Admittedly, there are no such use-cases in the kernel today, but arguably,
this is due to the lack of infrastructure. I see no alternative which is as
performant as the one I propose here. Using poll+read or any other option
will have unwarranted overhead.  

If an RFC might convince you, or some more mainstream use-case such as queued
pipe reads would convince you, I can work on such in order to try to get
something like that.