io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* tcp short writes / write ordering / etc
@ 2021-01-31  9:10 dormando
  2021-02-01 10:56 ` Pavel Begunkov
  0 siblings, 1 reply; 2+ messages in thread
From: dormando @ 2021-01-31  9:10 UTC (permalink / raw)
  To: io-uring

Hey,

I'm trying to puzzle out an architecture on top of io_uring for a tcp
proxy I'm working on. I have a high level question, then I'll explain what
I'm doing for context:

- How (is?) order maintained for write()'s to the same FD from different
SQE's to a network socket? ie; I get request A and queue a write(), later
request B comes in and gets queued, A finishes short. There was no chance
to IOSQE_LINK A to B. Does B cancel? This makes sense for disk IO but I
can't wrap my head around it for network sockets.

The setup:

- N per-core worker threads. Each thread handles X client sockets.
- Y backend sockets in a global shared pool. These point to storage
servers (or other proxyes/anything).

- client sockets wake up with requests for an arbitrary number of keys (1
to 100 or so).
  - each key is mapped to a backend (like keyhash % Y).
  - new requests are dispatched for each key to each backend socket.
  - the results are put back into order and returned to the client.

The workers are designed such that they should not have to wait for a
large request set before processing the next ready client socket. ie;
thread N1 gets a request for 100 keys; it queues that work off, and then
starts on a request for a single key. it picks up the results of the
original request later and returns it. Else we get poor long tail latency.

I've been working out a test program to mock this new backend. I have mock
worker threads that submit batches of work from fake connections, and then
have libevent or io_uring handle things.

In libevent/epoll mode:
 - workers can directly call write() to backend sockets while holding a
lock around a descriptive structure. this ensures order.
 - OR workers submit stacks to one or more threads which the backends
sockets are striped across. These threads lock and write(). this mode
helps with latency pileup.
 - a dedicated thread sits in epoll_wait() on EPOLLIN for each backend
socket. This avoids repeated calls to epoll_add()/mod/etc. As responses
are parsed, completed sets of requests are shipped back to the worker
threads.

In uring mode:
 - workers should submit to a single (or few) threads which have a private
ring. sqe's are stacked and submit()'ed in a batch. Ideally saving all of
the overhead of write()'ing to a bunch of sockets. (not working yet)
 - a dedicated thread with its own ring is sitting on recv() for each
backend socket. It handles the same as epoll mode, except after each read
I have to re-submit a new SQE for the next read.

(I have everything sharing the same WQ, for what it's worth)

I'm trying to figure out uring mode's single submission thread, but
figuring out the IO ordering issues is blanking my mind. Requests can come
in interleaved as the backends are shared, and waiting for a batch to
complete before submitting the next one defeats the purpose (I think).

What would be super nice but I'm pretty sure is impossible:

- M (possibly 1) thread(s) sitting on recv() in its own ring
- N client handling worker threads with independent rings on the same WQ
 - SQE's with writes to the same backend FD are serialized by a magical
unicorn.

Then:
- worker with a request for 100 keys makes and submits the SQE's itself,
  then moves on to the next client connection.
- recv() thread gathers responses and signals worker when the batch is
complete.

If I can avoid issues with short/colliding writes I can still make this
work as my protocol can allow for out of order responses, but it's not the
default mode so I need both to work anyway.

Apologies if this isn't clear or was answered recently; I did try to read
archives/code/etc.

Thanks,
-Dormando

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: tcp short writes / write ordering / etc
  2021-01-31  9:10 tcp short writes / write ordering / etc dormando
@ 2021-02-01 10:56 ` Pavel Begunkov
  0 siblings, 0 replies; 2+ messages in thread
From: Pavel Begunkov @ 2021-02-01 10:56 UTC (permalink / raw)
  To: dormando, io-uring

On 31/01/2021 09:10, dormando wrote:
> Hey,
> 
> I'm trying to puzzle out an architecture on top of io_uring for a tcp
> proxy I'm working on. I have a high level question, then I'll explain what
> I'm doing for context:
> 
> - How (is?) order maintained for write()'s to the same FD from different
> SQE's to a network socket? ie; I get request A and queue a write(), later

Without IOSQE_LINK or anything -- no ordering guarantees. Even if CQEs came
in some order actual I/O may have been executed in reverse.

> request B comes in and gets queued, A finishes short. There was no chance
> to IOSQE_LINK A to B. Does B cancel? This makes sense for disk IO but I
> can't wrap my head around it for network sockets.
> 
> The setup:
> 
> - N per-core worker threads. Each thread handles X client sockets.
> - Y backend sockets in a global shared pool. These point to storage
> servers (or other proxyes/anything).
> 
> - client sockets wake up with requests for an arbitrary number of keys (1
> to 100 or so).
>   - each key is mapped to a backend (like keyhash % Y).
>   - new requests are dispatched for each key to each backend socket.
>   - the results are put back into order and returned to the client.
> 
> The workers are designed such that they should not have to wait for a
> large request set before processing the next ready client socket. ie;
> thread N1 gets a request for 100 keys; it queues that work off, and then
> starts on a request for a single key. it picks up the results of the
> original request later and returns it. Else we get poor long tail latency.
> 
> I've been working out a test program to mock this new backend. I have mock
> worker threads that submit batches of work from fake connections, and then
> have libevent or io_uring handle things.
> 
> In libevent/epoll mode:
>  - workers can directly call write() to backend sockets while holding a
> lock around a descriptive structure. this ensures order.
>  - OR workers submit stacks to one or more threads which the backends
> sockets are striped across. These threads lock and write(). this mode
> helps with latency pileup.
>  - a dedicated thread sits in epoll_wait() on EPOLLIN for each backend
> socket. This avoids repeated calls to epoll_add()/mod/etc. As responses
> are parsed, completed sets of requests are shipped back to the worker
> threads.
> 
> In uring mode:
>  - workers should submit to a single (or few) threads which have a private
> ring. sqe's are stacked and submit()'ed in a batch. Ideally saving all of
> the overhead of write()'ing to a bunch of sockets. (not working yet)
>  - a dedicated thread with its own ring is sitting on recv() for each
> backend socket. It handles the same as epoll mode, except after each read
> I have to re-submit a new SQE for the next read.
> 
> (I have everything sharing the same WQ, for what it's worth)
> 
> I'm trying to figure out uring mode's single submission thread, but
> figuring out the IO ordering issues is blanking my mind. Requests can come
> in interleaved as the backends are shared, and waiting for a batch to
> complete before submitting the next one defeats the purpose (I think).
> 
> What would be super nice but I'm pretty sure is impossible:
> 
> - M (possibly 1) thread(s) sitting on recv() in its own ring
> - N client handling worker threads with independent rings on the same WQ
>  - SQE's with writes to the same backend FD are serialized by a magical
> unicorn.
> 
> Then:
> - worker with a request for 100 keys makes and submits the SQE's itself,
>   then moves on to the next client connection.
> - recv() thread gathers responses and signals worker when the batch is
> complete.
> 
> If I can avoid issues with short/colliding writes I can still make this
> work as my protocol can allow for out of order responses, but it's not the
> default mode so I need both to work anyway.
> 
> Apologies if this isn't clear or was answered recently; I did try to read
> archives/code/etc.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-02-01 11:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-31  9:10 tcp short writes / write ordering / etc dormando
2021-02-01 10:56 ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).