Re: [RFC] Programming model for io_uring + eBPF

From: Christian Dietrich <stettberger@dokucode.de>
To: Pavel Begunkov <asml.silence@gmail.com>,
	io-uring <io-uring@vger.kernel.org>
Cc: Horst Schirmeier <horst.schirmeier@tu-dortmund.de>,
	"Franz-B. Tuneke" <franz-bernhard.tuneke@tu-dortmund.de>
Subject: Re: [RFC] Programming model for io_uring + eBPF
Date: Thu, 20 May 2021 17:01:06 +0200	[thread overview]
Message-ID: <s7bsg2hwitp.fsf@dokucode.de> (raw)
In-Reply-To: <0468c1d5-9d0a-f8c0-618c-4a40b4677099@gmail.com>

Pavel Begunkov <asml.silence@gmail.com> [20. May 2021]:

> atomics/spinlocks scale purely, can be used we would surely prefer to
> avoid them in hot paths if possible.

I understand that. Do you consider SQE-Linking part of the hot path or
is it OK if linking SQE results overhead?

> E.g., if userspace is single threaded or already finely sync
> submission/completion, the exactly same algorithm can be implemented
> without any extra synchronisation there.

The problem that I see is that eBPF in io_uring breaks this fine
synchronization as eBPF SQE submission and userspace SQE submission can
run in parallel.

But going back to my original wish: I wanted to ensure that I can
serialize eBPF-SQEs such that I'm sure that they do not run in parallel.
My idea was to use synchronization groups as a generalization of
SQE linking in order to make it also useful for others (not only for eBPF).

My reasoning being not doing this serialization in userspace is that I
want to use the SQPOLL mode and execute long chains of
IO/computation-SQEs without leaving the kernelspace.

> Not telling that the feature can't have place, just needs good
> enough justification (e.g. performance or opening new use cases)
> comparing to complexity/etc. So the simpler/less intrusive the
> better.

I hope that you find this discussion as fruitful as I do. I really enjoy
searching for an abstraction that is simple and yet powerful enough to
fulfill user requirements.

>> At submission time, we have to append requests, if there is a
>> predecessor. For this, we extend io_submit_sqe to work with multiple
>> groups:
>> 
>>    u8 sg = req->sync_group;
>>    struct io_kiocb **link_field_new =
>>        (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) ? &(req->link) : NULL;
>> 
>> retry:
>>    struct io_kiocb **link_field = ctx->sync_groups[sg]
>>    if (link_field) {
>>        // Try to append to previous SQE. However, we might run in
>>        // parallel to __io_req_find_next.
>> 
>>        // Edit the link field of the previous SQE.
>>        *link_field = req;
>
> By this time the req might already be gone (and sync_groups[] changed to
> NULL/next), and so you may get use after free. Also modifying it will
> break some cancellation bits who want it to be stable and conditionally
> sync using completion_lock.

Yes, that is right if the completion side frees requests. Is this the
case or is the SQE returned to an io_kiocb cache?

>> In essence, the sync_groups would be a lock_free queue with a dangling
>> head that is even wait-free on the completion side. The above is surely
>> not correct, but with a few strategic load_aquire and the store_release
>> it probably can be made correct.
>
> Neither those are free -- cache bouncing

The problem that I had when thinking about the implementation is that
IO_LINK semantic works in the wrong direction: Link the next SQE,
whenever it comes to this SQE. If it would be the other way around
("Link this SQE to the previous one") it would be much easier as the
cost would only arise if we actually request linking. But compatibility..

>> And while it is not free, there already should be a similar kind of
>> synchronization between submission and completion if it should be
>> possible to link SQE to SQEs that are already in flight and could
>> complete while we want to link it.
>> Otherwise, SQE linking would only work for SQEs that are submitted in
>> one go, but as io_submit_state_end() does not clear
>> state->link.head, I think this is supposed to work.
>
> Just to clarify, it submits all the collected link before returning,
> just the invariant is based on link.head, if NULL there is no link,
> if not tail is initialised.

Ok, but what happens if the last SQE in an io_submit_sqes() call
requests linking? Is it envisioned that the first SQE that comes with
the next io_submit_sqes() is linked to that one?

If this is not supported, what happens if I use the SQPOLL mode where
  the poller thread can partition my submitted SQEs at an arbitrary
  point into multiple io_submit_sqes() calls?

If this is supported, link.head has to point to the last submitted SQE after
  the first io_submit_sqes()-call. Isn't then appending SQEs in the
  second io_submit_sqes()-call racy with the completion side. (With the
  same problems that I tried to solve?

chris
-- 
Prof. Dr.-Ing. Christian Dietrich
Operating System Group (E-EXK4)
Technische Universität Hamburg
Am Schwarzenberg-Campus 3 (E), 4.092
21073 Hamburg

eMail:  christian.dietrich@tuhh.de
Tel:    +49 40 42878 2188
WWW:    https://osg.tuhh.de/