All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: io-uring@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: "David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Jonathan Lemon <jonathan.lemon@gmail.com>,
	Willem de Bruijn <willemb@google.com>,
	Jens Axboe <axboe@kernel.dk>, David Ahern <dsahern@kernel.org>,
	kernel-team@fb.com
Subject: Re: [PATCH net-next v3 00/25] io_uring zerocopy send
Date: Tue, 5 Jul 2022 16:04:44 +0100	[thread overview]
Message-ID: <15cff5cd-52d5-68af-75c1-32be28137773@gmail.com> (raw)
In-Reply-To: <cover.1656318994.git.asml.silence@gmail.com>

On 7/5/22 16:01, Pavel Begunkov wrote:

NOTE: This is not be picked directly due to cross-subsystem merge problems.
After finding a consensus and getting necessary acks, I'll work out merging
with Jakub and Jens.


> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
> 
>  From the kernel networking perspective there are two main changes. The first
> one is passing ubuf_info into the network layer from io_uring (inside of an
> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
> caching on the io_uring side, but also helps to avoid cross-referencing
> and synchronisation problems. The second part is an optional optimisation
> removing page referencing for requests with registered buffers.
> 
> Benchmarking with an optimised version of the selftest (see [1]), which in a
> loop sends a bunch of requests and then waits for their completions. "+ flush"
> column posts one additional "buffer-free" notification per request, and
> just "zc" doesn't post buffer notifications at all.
> 
> NIC (requests / second):
> IO size | non-zc    | zc             | zc + flush
> 4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
> 1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
> 1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
> 600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
> 
> dummy (requests / second):
> IO size | non-zc    | zc             | zc + flush
> 8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
> 4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
> 1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
> 600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
> 
> Previously it also brought a massive performance speedup compared to the
> msg_zerocopy tool (see [3]), which is probably not super interesting.
> 
> There is an additional bunch of refcounting optimisations that was omitted from
> the series for simplicity and as they don't change the picture drastically,
> they will be sent as follow up, as well as flushing optimisations closing the
> performance gap b/w two last columns.
> 
> Note: the series is based on net-next + for-5.20/io_uring, but as vanilla
> net-next fails for me the repo (see [2]) is on top of for-5.20/io_uring.
> 
> Links:
> 
>    liburing (benchmark + some tests):
>    [1] https://github.com/isilence/liburing/tree/zc_v3
> 
>    kernel repo:
>    [2] https://github.com/isilence/linux/tree/zc_v3
> 
>    RFC v1:
>    [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/
> 
>    RFC v2:
>    https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/
> 
> API design overview:
> 
>    The series introduces an io_uring concept of notifactors. From the userspace
>    perspective it's an entity to which it can bind one or more requests and then
>    requesting to flush it. Flushing a notifier makes it impossible to attach new
>    requests to it, and instructs the notifier to post a completion once all
>    requests attached to it are completed and the kernel doesn't need the buffers
>    anymore.
> 
>    Notifications are stored in notification slots, which should be registered as
>    an array in io_uring. Each slot stores only one notifier at any particular
>    moment. Flushing removes it from the slot and the slot automatically replaces
>    it with a new notifier. All operations with notifiers are done by specifying
>    an index of a slot it's currently in.
> 
>    When registering a notification the userspace specifies a u64 tag for each
>    slot, which will be copied in notification completion entries as
>    cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
>    sequence number counting notifiers of a slot.
> 
> Changelog:
> 
>    RFC v2 -> v3:
>      mem accounting for non-registered buffers
>      allow mixing registered and normal requests per notifier
>      notification flushing via IORING_OP_RSRC_UPDATE
>      TCP support
>      fix buffer indexing
>      fix io-wq ->uring_lock locking
>      fix bugs when mixing with MSG_ZEROCOPY
>      fix managed refs bugs in skbuff.c
> 
>    RFC -> RFC v2:
>      remove additional overhead for non-zc from skb_release_data()
>      avoid msg propagation, hide extra bits of non-zc overhead
>      task_work based "buffer free" notifications
>      improve io_uring's notification refcounting
>      added 5/19, (no pfmemalloc tracking)
>      added 8/19 and 9/19 preventing small copies with zc
>      misc small changes
> 
> Pavel Begunkov (25):
>    ipv4: avoid partial copy for zc
>    ipv6: avoid partial copy for zc
>    skbuff: add SKBFL_DONT_ORPHAN flag
>    skbuff: carry external ubuf_info in msghdr
>    net: bvec specific path in zerocopy_sg_from_iter
>    net: optimise bvec-based zc page referencing
>    net: don't track pfmemalloc for managed frags
>    skbuff: don't mix ubuf_info of different types
>    ipv4/udp: support zc with managed data
>    ipv6/udp: support zc with managed data
>    tcp: support zc with managed data
>    io_uring: add zc notification infrastructure
>    io_uring: export task put
>    io_uring: cache struct io_notif
>    io_uring: complete notifiers in tw
>    io_uring: add notification slot registration
>    io_uring: wire send zc request type
>    io_uring: account locked pages for non-fixed zc
>    io_uring: allow to pass addr into sendzc
>    io_uring: add rsrc referencing for notifiers
>    io_uring: sendzc with fixed buffers
>    io_uring: flush notifiers after sendzc
>    io_uring: rename IORING_OP_FILES_UPDATE
>    io_uring: add zc notification flush requests
>    selftests/io_uring: test zerocopy send
> 
>   include/linux/io_uring_types.h                |  37 ++
>   include/linux/skbuff.h                        |  59 +-
>   include/linux/socket.h                        |   7 +
>   include/uapi/linux/io_uring.h                 |  43 +-
>   io_uring/Makefile                             |   2 +-
>   io_uring/io_uring.c                           |  40 +-
>   io_uring/io_uring.h                           |  21 +
>   io_uring/net.c                                | 134 ++++
>   io_uring/net.h                                |   4 +
>   io_uring/notif.c                              | 215 +++++++
>   io_uring/notif.h                              |  87 +++
>   io_uring/opdef.c                              |  24 +-
>   io_uring/rsrc.c                               |  55 +-
>   io_uring/rsrc.h                               |  16 +-
>   io_uring/tctx.h                               |  26 -
>   net/compat.c                                  |   2 +
>   net/core/datagram.c                           |  53 +-
>   net/core/skbuff.c                             |  35 +-
>   net/ipv4/ip_output.c                          |  63 +-
>   net/ipv4/tcp.c                                |  52 +-
>   net/ipv6/ip6_output.c                         |  62 +-
>   net/socket.c                                  |   6 +
>   tools/testing/selftests/net/Makefile          |   1 +
>   .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
>   .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
>   25 files changed, 1652 insertions(+), 128 deletions(-)
>   create mode 100644 io_uring/notif.c
>   create mode 100644 io_uring/notif.h
>   create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
>   create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh
> 

-- 
Pavel Begunkov

      parent reply	other threads:[~2022-07-05 15:08 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-05 15:01 [PATCH net-next v3 00/25] io_uring zerocopy send Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 01/25] ipv4: avoid partial copy for zc Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 02/25] ipv6: " Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 03/25] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 04/25] skbuff: carry external ubuf_info in msghdr Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 05/25] net: bvec specific path in zerocopy_sg_from_iter Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 06/25] net: optimise bvec-based zc page referencing Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 07/25] net: don't track pfmemalloc for managed frags Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 08/25] skbuff: don't mix ubuf_info of different types Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 09/25] ipv4/udp: support zc with managed data Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 10/25] ipv6/udp: " Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 11/25] tcp: " Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 12/25] io_uring: add zc notification infrastructure Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 13/25] io_uring: export task put Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 14/25] io_uring: cache struct io_notif Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 15/25] io_uring: complete notifiers in tw Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 16/25] io_uring: add notification slot registration Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 17/25] io_uring: wire send zc request type Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 18/25] io_uring: account locked pages for non-fixed zc Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 19/25] io_uring: allow to pass addr into sendzc Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 20/25] io_uring: add rsrc referencing for notifiers Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 21/25] io_uring: sendzc with fixed buffers Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 22/25] io_uring: flush notifiers after sendzc Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 23/25] io_uring: rename IORING_OP_FILES_UPDATE Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 24/25] io_uring: add zc notification flush requests Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 25/25] selftests/io_uring: test zerocopy send Pavel Begunkov
2022-07-05 15:04 ` Pavel Begunkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=15cff5cd-52d5-68af-75c1-32be28137773@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=io-uring@vger.kernel.org \
    --cc=jonathan.lemon@gmail.com \
    --cc=kernel-team@fb.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.