All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Christian Dietrich <stettberger@dokucode.de>,
	io-uring <io-uring@vger.kernel.org>
Cc: Horst Schirmeier <horst.schirmeier@tu-dortmund.de>,
	"Franz-B. Tuneke" <franz-bernhard.tuneke@tu-dortmund.de>
Subject: Re: [RFC] Programming model for io_uring + eBPF
Date: Wed, 2 Jun 2021 11:47:50 +0100	[thread overview]
Message-ID: <49df117c-dcbd-9b91-e181-e5b2757ae6aa@gmail.com> (raw)
In-Reply-To: <s7b1r9sfn1k.fsf@dokucode.de>

On 5/27/21 12:12 PM, Christian Dietrich wrote:
> Pavel Begunkov <asml.silence@gmail.com> [21. May 2021]:
> 
>>> The problem that I see is that eBPF in io_uring breaks this fine
>>> synchronization as eBPF SQE submission and userspace SQE submission can
>>> run in parallel.
>>
>> It definitely won't be a part of ABI, but they actually do serialise
>> at the moment.
> 
> They serialize because they are executed by the same worker thread,
> right?

No, because all submissions are currently under ctx->uring_lock mutex
and also executed in the submitting task context (no io-wq). I hope to
get rid of the second restriction, i.e. prior upstreaming, and maybe
do something about the first one in far future if there will be a need.

> Perhaps that is the solution to my synchronization problem. If/when
> io_uring supports more than one eBPF executioners, we should make the
> number of executors configurable at setup time. Thereby, the user can
> implicitly manage serialization of eBPF execution.

I wouldn't keep it a part of ABI, but may be good enough for
experiments. So, it'll be UB and may break unless we figure
something out.

... actually it sounds like a sane idea to do lock grabbing lazy,
i.e. only if it actually submits requests. That may make sense if
you control several requests by BPF, e.g. keeping QD + batching.

>>> But going back to my original wish: I wanted to ensure that I can
>>> serialize eBPF-SQEs such that I'm sure that they do not run in parallel.
>>> My idea was to use synchronization groups as a generalization of
>>> SQE linking in order to make it also useful for others (not only for eBPF).
>>
>> So, let's dissect it a bit more, why do you need serialising as
>> such? What use case you have in mind, and let's see if it's indeed
>> can't be implemented efficiently with what we have.
> 
> What I want to do is to manipulate (read-calculate-update) user memory
> from eBPF without the need to synchronize between eBPF invocations.
> 
> As eBPF invocations have a run-to-completion semantic, it feels bad to
> use lock-based synchronization. Besides waiting for user-memory to be
> swapped in, they will be usually short and plug together results and
> newly emitted CQEs.

swapping can't be done under spinlocks, that's a reason why userspace
read/write are available only to sleepable BPF.

btw, I was thinking about adding atomic ops with userspace memory.
I can code an easy solution, but may have too much of additional
overhead. Same situation with normal load/stores being slow
mentioned below.

> 
>> To recap: BPFs don't share SQ with userspace at all, and may have
>> separate CQs to reap events from. You may post an event and it's
>> wait synchronised, so may act as a message-based synchronisation,
>> see test3 in the recently posted v2 for example. I'll also be
>> adding futex support (bpf + separate requests), it might
>> play handy for some users.
> 
> I'm sure that it is possible to use those mechanisms for synchronizing,
> but I assume that explicit synchronization (locks, passing tokens
> around) is more expensive than sequenzializing requests (implicit
> synchronization) before starting to execute them.

As you know it depends on work/synchronisation ratio, but I'm also
concerned about scalability. Consider same argument applied to
sequenzializing normal user threads (e.g. pthreads). Not quite
of the same extent but shows the idea.

> But probably, we need some benchmarks to see what performs better.

Would be great to have some in any case. Userspace read/write
may be slow. It can be solved (if a problem) but would require
work to be done on the BPF kernel side

>>> My reasoning being not doing this serialization in userspace is that I
>>> want to use the SQPOLL mode and execute long chains of
>>> IO/computation-SQEs without leaving the kernelspace.
>>
>> btw, "in userspace" is now more vague as it can be done by BPF
>> as well. For some use cases I'd expect BPF acting as a reactor,
>> e.g. on completion of previous CQEs and submitting new requests
>> in response, and so keeping it entirely in kernel space until
>> it have anything to tell to the userspace, e.g. by posting
>> into the main CQ.
> 
> Yes, exactly that is my motivation. But I also think that it is a useful
> pattern to have many eBPF callbacks pending (e.g. one for each
> connection).

Good case. One more moment is how much generality such cases need.
E.g. there are some data structures in BPF, don't remember names
but like perf buffers or perf rings.

> 
> With one pending invocation per connection, synchronization with a fixed
> number of additional CQEs might be problematic: For example, for a
> per-connection barrier synchronization with the CQ-reap approach, one
> needs one CQ for each connection.
> 
>>> The problem that I had when thinking about the implementation is that
>>> IO_LINK semantic works in the wrong direction: Link the next SQE,
>>> whenever it comes to this SQE. If it would be the other way around
>>> ("Link this SQE to the previous one") it would be much easier as the
>>> cost would only arise if we actually request linking. But compatibility..
>>
>> Stack vs queue style linking? If I understand what you mean right, that's
>> because this is how SQ is parsed and so that's the most efficient way.
> 
> No, I did not want to argue about the ordering within the link chain,
> but with the semantic of link flag. I though that it might have been
> beneficial to let the flag indicate that the SQE should be linked to the
> previous one. However, after thinking this through in more detail I now
> believe that it does not make any difference for the submission path.
> 
> 
>>> Ok, but what happens if the last SQE in an io_submit_sqes() call
>>> requests linking? Is it envisioned that the first SQE that comes with
>>> the next io_submit_sqes() is linked to that one?
>>
>> No, it doesn't leave submission boundary (e.g. a single syscall).
>> In theory may be left there _not_ submitted, but don't see much
>> profit in it.
>>
>>> If this is not supported, what happens if I use the SQPOLL mode where
>>>   the poller thread can partition my submitted SQEs at an arbitrary
>>>   point into multiple io_submit_sqes() calls?
>>
>> It's not arbitrary, submission is atomic in nature, first you fill
>> SQEs in memory but they are not visible to SQPOLL in a meanwhile,
>> and then you "commit" them by overwriting SQ tail pointer.
>>
>> Not a great exception for that is shared SQPOLL task, but it
>> just waits someone to take care of the case.
>>
>> if (cap_entries && to_submit > 8)
>> 	to_submit = 8;
>>
>>> If this is supported, link.head has to point to the last submitted SQE after
>>>   the first io_submit_sqes()-call. Isn't then appending SQEs in the
>>>   second io_submit_sqes()-call racy with the completion side. (With the
>>>   same problems that I tried to solve?
>>
>> Exactly why it's not supported
> 
> Thank you for this detailed explanation. I now understand the design
> decision with the SQE linking much better and why delayed linking of
> SQEs introduces linking between completion and submission side that is
> undesirable.

You're welcome, hope we'll get API into shape in no time

-- 
Pavel Begunkov

  reply	other threads:[~2021-06-02 10:49 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <s7bsg4slmn3.fsf@dokucode.de>
     [not found] ` <9b3a8815-9a47-7895-0f4d-820609c15e9b@gmail.com>
     [not found]   ` <s7btuo6wi7l.fsf@dokucode.de>
2021-04-16 15:49     ` [RFC] Programming model for io_uring + eBPF Pavel Begunkov
2021-04-20 16:35       ` Christian Dietrich
2021-04-23 15:34         ` Pavel Begunkov
2021-04-29 13:27           ` Christian Dietrich
2021-05-01  9:49             ` Pavel Begunkov
2021-05-05 12:57               ` Christian Dietrich
2021-05-05 16:13                 ` Christian Dietrich
2021-05-07 15:13                   ` Pavel Begunkov
2021-05-12 11:20                     ` Christian Dietrich
2021-05-18 14:39                       ` Pavel Begunkov
2021-05-19 16:55                         ` Christian Dietrich
2021-05-20 11:14                           ` Pavel Begunkov
2021-05-20 15:01                             ` Christian Dietrich
2021-05-21 10:27                               ` Pavel Begunkov
2021-05-27 11:12                                 ` Christian Dietrich
2021-06-02 10:47                                   ` Pavel Begunkov [this message]
2021-05-07 15:10                 ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49df117c-dcbd-9b91-e181-e5b2757ae6aa@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=franz-bernhard.tuneke@tu-dortmund.de \
    --cc=horst.schirmeier@tu-dortmund.de \
    --cc=io-uring@vger.kernel.org \
    --cc=stettberger@dokucode.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.