Re: [RFC] Programming model for io_uring + eBPF

From: Pavel Begunkov <asml.silence@gmail.com>
To: Christian Dietrich <stettberger@dokucode.de>,
	io-uring <io-uring@vger.kernel.org>
Cc: Horst Schirmeier <horst.schirmeier@tu-dortmund.de>,
	"Franz-B. Tuneke" <franz-bernhard.tuneke@tu-dortmund.de>
Subject: Re: [RFC] Programming model for io_uring + eBPF
Date: Wed, 2 Jun 2021 11:47:50 +0100	[thread overview]
Message-ID: <49df117c-dcbd-9b91-e181-e5b2757ae6aa@gmail.com> (raw)
In-Reply-To: <s7b1r9sfn1k.fsf@dokucode.de>

On 5/27/21 12:12 PM, Christian Dietrich wrote:
> Pavel Begunkov <asml.silence@gmail.com> [21. May 2021]:
> 
>>> The problem that I see is that eBPF in io_uring breaks this fine
>>> synchronization as eBPF SQE submission and userspace SQE submission can
>>> run in parallel.
>>
>> It definitely won't be a part of ABI, but they actually do serialise
>> at the moment.
> 
> They serialize because they are executed by the same worker thread,
> right?

No, because all submissions are currently under ctx->uring_lock mutex
and also executed in the submitting task context (no io-wq). I hope to
get rid of the second restriction, i.e. prior upstreaming, and maybe
do something about the first one in far future if there will be a need.

> Perhaps that is the solution to my synchronization problem. If/when
> io_uring supports more than one eBPF executioners, we should make the
> number of executors configurable at setup time. Thereby, the user can
> implicitly manage serialization of eBPF execution.

I wouldn't keep it a part of ABI, but may be good enough for
experiments. So, it'll be UB and may break unless we figure
something out.

... actually it sounds like a sane idea to do lock grabbing lazy,
i.e. only if it actually submits requests. That may make sense if
you control several requests by BPF, e.g. keeping QD + batching.

>>> But going back to my original wish: I wanted to ensure that I can
>>> serialize eBPF-SQEs such that I'm sure that they do not run in parallel.
>>> My idea was to use synchronization groups as a generalization of
>>> SQE linking in order to make it also useful for others (not only for eBPF).
>>
>> So, let's dissect it a bit more, why do you need serialising as
>> such? What use case you have in mind, and let's see if it's indeed
>> can't be implemented efficiently with what we have.
> 
> What I want to do is to manipulate (read-calculate-update) user memory
> from eBPF without the need to synchronize between eBPF invocations.
> 
> As eBPF invocations have a run-to-completion semantic, it feels bad to
> use lock-based synchronization. Besides waiting for user-memory to be
> swapped in, they will be usually short and plug together results and
> newly emitted CQEs.

swapping can't be done under spinlocks, that's a reason why userspace
read/write are available only to sleepable BPF.

btw, I was thinking about adding atomic ops with userspace memory.
I can code an easy solution, but may have too much of additional
overhead. Same situation with normal load/stores being slow
mentioned below.

> 
>> To recap: BPFs don't share SQ with userspace at all, and may have
>> separate CQs to reap events from. You may post an event and it's
>> wait synchronised, so may act as a message-based synchronisation,
>> see test3 in the recently posted v2 for example. I'll also be
>> adding futex support (bpf + separate requests), it might
>> play handy for some users.
> 
> I'm sure that it is possible to use those mechanisms for synchronizing,
> but I assume that explicit synchronization (locks, passing tokens
> around) is more expensive than sequenzializing requests (implicit
> synchronization) before starting to execute them.

As you know it depends on work/synchronisation ratio, but I'm also
concerned about scalability. Consider same argument applied to
sequenzializing normal user threads (e.g. pthreads). Not quite
of the same extent but shows the idea.

> But probably, we need some benchmarks to see what performs better.

Would be great to have some in any case. Userspace read/write
may be slow. It can be solved (if a problem) but would require
work to be done on the BPF kernel side

>>> My reasoning being not doing this serialization in userspace is that I
>>> want to use the SQPOLL mode and execute long chains of
>>> IO/computation-SQEs without leaving the kernelspace.
>>
>> btw, "in userspace" is now more vague as it can be done by BPF
>> as well. For some use cases I'd expect BPF acting as a reactor,
>> e.g. on completion of previous CQEs and submitting new requests
>> in response, and so keeping it entirely in kernel space until
>> it have anything to tell to the userspace, e.g. by posting
>> into the main CQ.
> 
> Yes, exactly that is my motivation. But I also think that it is a useful
> pattern to have many eBPF callbacks pending (e.g. one for each
> connection).

Good case. One more moment is how much generality such cases need.
E.g. there are some data structures in BPF, don't remember names
but like perf buffers or perf rings.

> 
> With one pending invocation per connection, synchronization with a fixed
> number of additional CQEs might be problematic: For example, for a
> per-connection barrier synchronization with the CQ-reap approach, one
> needs one CQ for each connection.
> 
>>> The problem that I had when thinking about the implementation is that
>>> IO_LINK semantic works in the wrong direction: Link the next SQE,
>>> whenever it comes to this SQE. If it would be the other way around
>>> ("Link this SQE to the previous one") it would be much easier as the
>>> cost would only arise if we actually request linking. But compatibility..
>>
>> Stack vs queue style linking? If I understand what you mean right, that's
>> because this is how SQ is parsed and so that's the most efficient way.
> 
> No, I did not want to argue about the ordering within the link chain,
> but with the semantic of link flag. I though that it might have been
> beneficial to let the flag indicate that the SQE should be linked to the
> previous one. However, after thinking this through in more detail I now
> believe that it does not make any difference for the submission path.
> 
> 
>>> Ok, but what happens if the last SQE in an io_submit_sqes() call
>>> requests linking? Is it envisioned that the first SQE that comes with
>>> the next io_submit_sqes() is linked to that one?
>>
>> No, it doesn't leave submission boundary (e.g. a single syscall).
>> In theory may be left there _not_ submitted, but don't see much
>> profit in it.
>>
>>> If this is not supported, what happens if I use the SQPOLL mode where
>>>   the poller thread can partition my submitted SQEs at an arbitrary
>>>   point into multiple io_submit_sqes() calls?
>>
>> It's not arbitrary, submission is atomic in nature, first you fill
>> SQEs in memory but they are not visible to SQPOLL in a meanwhile,
>> and then you "commit" them by overwriting SQ tail pointer.
>>
>> Not a great exception for that is shared SQPOLL task, but it
>> just waits someone to take care of the case.
>>
>> if (cap_entries && to_submit > 8)
>> 	to_submit = 8;
>>
>>> If this is supported, link.head has to point to the last submitted SQE after
>>>   the first io_submit_sqes()-call. Isn't then appending SQEs in the
>>>   second io_submit_sqes()-call racy with the completion side. (With the
>>>   same problems that I tried to solve?
>>
>> Exactly why it's not supported
> 
> Thank you for this detailed explanation. I now understand the design
> decision with the SQE linking much better and why delayed linking of
> SQEs introduces linking between completion and submission side that is
> undesirable.

You're welcome, hope we'll get API into shape in no time

-- 
Pavel Begunkov