io_uring-only sendmsg + recvmsg zerocopy

* io_uring-only sendmsg + recvmsg zerocopy
@ 2020-11-10 21:31 Victor Stewart
  2020-11-10 23:23 ` Pavel Begunkov
  0 siblings, 1 reply; 6+ messages in thread
From: Victor Stewart @ 2020-11-10 21:31 UTC (permalink / raw)
  To: io-uring

here’s the design i’m flirting with for "recvmsg and sendmsg zerocopy"
with persistent buffers patch.

we'd be looking at approx +100% throughput each on the send and recv
paths (per TCP_ZEROCOPY_RECEIVE benchmarks).

these would be io_uring only operations given the sendmsg completion
logic described below. want to get some conscious that this design
could/would be acceptable for merging before I begin writing the code.

the problem with zerocopy send is the asynchronous ACK from the NIC
confirming transmission. and you can’t just block on a syscall til
then. MSG_ZEROCOPY tackled this by putting the ACK on the
MSG_ERRQUEUE. but that logic is very disjointed and requires a double
completion (once from sendmsg once the send is enqueued, and again
once the NIC ACKs the transmission), and requires costly userspace
bookkeeping.

so what i propose instead is to exploit the asynchrony of io_uring.

you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
sometime later receive the completion event on the ring’s completion
queue (either failure or success once ACK-ed by the NIC). 1 unified
completion flow.

we can somehow tag the socket as registered to io_uring, then when the
NIC ACKs, instead of finding the socket's error queue and putting the
completion there like MSG_ZEROCOPY, the kernel would find the io_uring
instance the socket is registered to and call into an io_uring
sendmsg_zerocopy_completion function. Then the cqe would get pushed
onto the completion queue.

the "recvmsg zerocopy" is straight forward enough. mimicking
TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.

the other big concern is the lifecycle of the persistent memory
buffers in the case of nefarious actors. but since we already have
buffer registration for O_DIRECT, I assume those mechanics already
address those issues and can just be repurposed?

and so with those persistent memory buffers, you'd only pay the cost
of pinning the memory into the kernel once upon registration, before
you even start your server listening... thus "free". versus pinning
per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".

^ permalink raw reply	[flat|nested] 6+ messages in thread