Re: [RFC] Programming model for io_uring + eBPF

From: Pavel Begunkov <asml.silence@gmail.com>
To: Christian Dietrich <stettberger@dokucode.de>,
	io-uring <io-uring@vger.kernel.org>
Cc: Horst Schirmeier <horst.schirmeier@tu-dortmund.de>,
	"Franz-B. Tuneke" <franz-bernhard.tuneke@tu-dortmund.de>
Subject: Re: [RFC] Programming model for io_uring + eBPF
Date: Fri, 21 May 2021 11:27:22 +0100	[thread overview]
Message-ID: <e11cd3e6-b1be-2098-732a-2987a5a9f842@gmail.com> (raw)
In-Reply-To: <s7bsg2hwitp.fsf@dokucode.de>

On 5/20/21 4:01 PM, Christian Dietrich wrote:
> Pavel Begunkov <asml.silence@gmail.com> [20. May 2021]:
> 
>> atomics/spinlocks scale purely, can be used we would surely prefer to
>> avoid them in hot paths if possible.
> 
> I understand that. Do you consider SQE-Linking part of the hot path or
> is it OK if linking SQE results overhead?
> 
>> E.g., if userspace is single threaded or already finely sync
>> submission/completion, the exactly same algorithm can be implemented
>> without any extra synchronisation there.
> 
> The problem that I see is that eBPF in io_uring breaks this fine
> synchronization as eBPF SQE submission and userspace SQE submission can
> run in parallel.

It definitely won't be a part of ABI, but they actually do serialise
at the moment. But even if not, there are way doing that sync-less
not ctx-wide, but per submitter.

> But going back to my original wish: I wanted to ensure that I can
> serialize eBPF-SQEs such that I'm sure that they do not run in parallel.
> My idea was to use synchronization groups as a generalization of
> SQE linking in order to make it also useful for others (not only for eBPF).

So, let's dissect it a bit more, why do you need serialising as
such? What use case you have in mind, and let's see if it's indeed
can't be implemented efficiently with what we have.

To recap: BPFs don't share SQ with userspace at all, and may have
separate CQs to reap events from. You may post an event and it's
wait synchronised, so may act as a message-based synchronisation,
see test3 in the recently posted v2 for example. I'll also be
adding futex support (bpf + separate requests), it might
play handy for some users.

> My reasoning being not doing this serialization in userspace is that I
> want to use the SQPOLL mode and execute long chains of
> IO/computation-SQEs without leaving the kernelspace.

btw, "in userspace" is now more vague as it can be done by BPF
as well. For some use cases I'd expect BPF acting as a reactor,
e.g. on completion of previous CQEs and submitting new requests
in response, and so keeping it entirely in kernel space until
it have anything to tell to the userspace, e.g. by posting
into the main CQ. 

>> Not telling that the feature can't have place, just needs good
>> enough justification (e.g. performance or opening new use cases)
>> comparing to complexity/etc. So the simpler/less intrusive the
>> better.
> 
> I hope that you find this discussion as fruitful as I do. I really enjoy
> searching for an abstraction that is simple and yet powerful enough to
> fulfill user requirements.

It is, I just have my doubts that it's the most flexible/performant
option. Would be easier to reason with specific use cases in mind,
so we may see if it can't be done in a better way

>>> At submission time, we have to append requests, if there is a
>>> predecessor. For this, we extend io_submit_sqe to work with multiple
>>> groups:
>>>
>>>    u8 sg = req->sync_group;
>>>    struct io_kiocb **link_field_new =
>>>        (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) ? &(req->link) : NULL;
>>>
>>> retry:
>>>    struct io_kiocb **link_field = ctx->sync_groups[sg]
>>>    if (link_field) {
>>>        // Try to append to previous SQE. However, we might run in
>>>        // parallel to __io_req_find_next.
>>>
>>>        // Edit the link field of the previous SQE.
>>>        *link_field = req;
>>
>> By this time the req might already be gone (and sync_groups[] changed to
>> NULL/next), and so you may get use after free. Also modifying it will
>> break some cancellation bits who want it to be stable and conditionally
>> sync using completion_lock.
> 
> Yes, that is right if the completion side frees requests. Is this the
> case or is the SQE returned to an io_kiocb cache?

Might be freed. Caching them is a recent feature and doesn't cover
all the cases, at least for now.

>>> In essence, the sync_groups would be a lock_free queue with a dangling
>>> head that is even wait-free on the completion side. The above is surely
>>> not correct, but with a few strategic load_aquire and the store_release
>>> it probably can be made correct.
>>
>> Neither those are free -- cache bouncing
> 
> The problem that I had when thinking about the implementation is that
> IO_LINK semantic works in the wrong direction: Link the next SQE,
> whenever it comes to this SQE. If it would be the other way around
> ("Link this SQE to the previous one") it would be much easier as the
> cost would only arise if we actually request linking. But compatibility..

Stack vs queue style linking? If I understand what you mean right, that's
because this is how SQ is parsed and so that's the most efficient way.

>>> And while it is not free, there already should be a similar kind of
>>> synchronization between submission and completion if it should be
>>> possible to link SQE to SQEs that are already in flight and could
>>> complete while we want to link it.
>>> Otherwise, SQE linking would only work for SQEs that are submitted in
>>> one go, but as io_submit_state_end() does not clear
>>> state->link.head, I think this is supposed to work.
>>
>> Just to clarify, it submits all the collected link before returning,
>> just the invariant is based on link.head, if NULL there is no link,
>> if not tail is initialised.
> 
> Ok, but what happens if the last SQE in an io_submit_sqes() call
> requests linking? Is it envisioned that the first SQE that comes with
> the next io_submit_sqes() is linked to that one?

No, it doesn't leave submission boundary (e.g. a single syscall).
In theory may be left there _not_ submitted, but don't see much
profit in it.

> If this is not supported, what happens if I use the SQPOLL mode where
>   the poller thread can partition my submitted SQEs at an arbitrary
>   point into multiple io_submit_sqes() calls?

It's not arbitrary, submission is atomic in nature, first you fill
SQEs in memory but they are not visible to SQPOLL in a meanwhile,
and then you "commit" them by overwriting SQ tail pointer.

Not a great exception for that is shared SQPOLL task, but it
just waits someone to take care of the case.

if (cap_entries && to_submit > 8)
	to_submit = 8;

> If this is supported, link.head has to point to the last submitted SQE after
>   the first io_submit_sqes()-call. Isn't then appending SQEs in the
>   second io_submit_sqes()-call racy with the completion side. (With the
>   same problems that I tried to solve?

Exactly why it's not supported

-- 
Pavel Begunkov