netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
To: Stefan Hajnoczi <stefanha@redhat.com>,
	Stefano Garzarella <sgarzare@redhat.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Bobby Eshleman <bobby.eshleman@bytedance.com>
Cc: <kvm@vger.kernel.org>,
	<virtualization@lists.linux-foundation.org>,
	<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<kernel@sberdevices.ru>, <oxffffaa@gmail.com>,
	<avkrasnov@sberdevices.ru>,
	Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Subject: [RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support
Date: Sun, 23 Apr 2023 22:26:28 +0300	[thread overview]
Message-ID: <20230423192643.1537470-1-AVKrasnov@sberdevices.ru> (raw)

Hello,

                           DESCRIPTION

this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
current implementation for TCP as much as possible:

1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
   flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
   flag will be ignored (e.g. without completion).

2) Kernel uses completions from socket's error queue. Single completion
   for single tx syscall (or it can merge several completions to single
   one). I used already implemented logic for MSG_ZEROCOPY support:
   'msg_zerocopy_realloc()' etc.

Difference with copy way is not significant. During packet allocation,
non-linear skb is created, then I call 'pin_user_pages()' for each page
from user's iov iterator and add each returned page to the skb as fragment.
There are also some updates for vhost and guest parts of transport - in
both cases i've added handling of non-linear skb for virtio part. vhost
copies data from such skb to the guest's rx virtio buffers. In the guest,
virtio transport fills tx virtio queue with pages from skb.

This version has several limits/problems:

1) As this feature totally depends on transport, there is no way (or it
   is difficult) to check whether transport is able to handle it or not
   during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
   setsockopt callback from setsockopt callback for SOL_SOCKET, but this
   leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
   are not considered to be called from each other. So in current version
   SO_ZEROCOPY is set successfully to any type (e.g. transport) of
   AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
   tx routine will fail with EOPNOTSUPP.

2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
   one completion. In each completion there is flag which shows how tx
   was performed: zerocopy or copy. This leads that whole message must
   be send in zerocopy or copy way - we can't send part of message with
   copying and rest of message with zerocopy mode (or vice versa). Now,
   we need to account vsock credit logic, e.g. we can't send whole data
   once - only allowed number of bytes could sent at any moment. In case
   of copying way there is no problem as in worst case we can send single
   bytes, but zerocopy is more complex because smallest transmission
   unit is single page. So if there is not enough space at peer's side
   to send integer number of pages (at least one) - we will wait, thus
   stalling tx side. To overcome this problem i've added simple rule -
   zerocopy is possible only when there is enough space at another side
   for whole message (to check, that current 'msghdr' was already used
   in previous tx iterations i use 'iov_offset' field of it's iov iter).

3) loopback transport is not supported, because it requires to implement
   non-linear skb handling in dequeue logic (as we "send" fragged skb
   and "receive" it from the same queue). I'm going to implement it in
   next versions.

   ^^^ fixed in v2

4) Current implementation sets max length of packet to 64KB. IIUC this
   is due to 'kmalloc()' allocated data buffers. I think, in case of
   MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
   not touched for data - user space pages are used as buffers. Also
   this limit trims every message which is > 64KB, thus such messages
   will be send in copy mode due to 'iov_offset' check in 2).

   ^^^ fixed in v2

                         PATCHSET STRUCTURE

Patchset has the following structure:
1) Handle non-linear skbuff on receive in virtio/vhost.
2) Handle non-linear skbuff on send in virtio/vhost.
3) Updates for AF_VSOCK.
4) Enable MSG_ZEROCOPY support on transports.
5) Tests/tools/docs updates.

                            PERFORMANCE

Performance: it is a little bit tricky to compare performance between
copy and zerocopy transmissions. In zerocopy way we need to wait when
user buffers will be released by kernel, so it something like synchronous
path (wait until device driver will process it), while in copy way we
can feed data to kernel as many as we want, don't care about device
driver. So I compared only time which we spend in the 'send()' syscall.
Then if this value will be combined with total number of transmitted
bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
enough credit, receiver allocates same amount of space as sender needs.

Sender:
./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]

Receiver:
./vsock_perf --vsk-size 256M

G2H transmission (values are Gbit/s):

*-------------------------------*
|          |         |          |
| buf size |   copy  | zerocopy |
|          |         |          |
*-------------------------------*
|   4KB    |    3    |    10    |
*-------------------------------*
|   32KB   |    9    |    45    |
*-------------------------------*
|   256KB  |    24   |    195   |
*-------------------------------*
|    1M    |    27   |    270   |
*-------------------------------*
|    8M    |    22   |    277   |
*-------------------------------*

H2G:

*-------------------------------*
|          |         |          |
| buf size |   copy  | zerocopy |
|          |         |          |
*-------------------------------*
|   4KB    |    17   |    11    |
*-------------------------------*
|   32KB   |    30   |    66    |
*-------------------------------*
|   256KB  |    38   |    179   |
*-------------------------------*
|    1M    |    38   |    234   |
*-------------------------------*
|    8M    |    28   |    279   |
*-------------------------------*

Loopback:

*-------------------------------*
|          |         |          |
| buf size |   copy  | zerocopy |
|          |         |          |
*-------------------------------*
|   4KB    |    8    |    7     |
*-------------------------------*
|   32KB   |    34   |    42    |
*-------------------------------*
|   256KB  |    43   |    83    |
*-------------------------------*
|    1M    |    40   |    109   |
*-------------------------------*
|    8M    |    40   |    171   |
*-------------------------------*

I suppose that huge difference above between both modes has two reasons:
1) We don't need to copy data.
2) We don't need to allocate buffer for data, only for header.

Zerocopy is faster than classic copy mode, but of course it requires
specific architecture of application due to user pages pinning, buffer
size and alignment.

If host fails to send data with "Cannot allocate memory", check value
/proc/sys/net/core/optmem_max - it is accounted during completion skb
allocation.

                            TESTING

This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
cover new code as much as possible so there are different cases for
MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
vector types (different sizes, alignments, with unmapped pages). I also
run tests with loopback transport and running vsockmon.

Thanks, Arseniy

Link to v1:
https://lore.kernel.org/netdev/0e7c6fc4-b4a6-a27b-36e9-359597bba2b5@sberdevices.ru/

Changelog:
v1 -> v2:
 - Replace 'get_user_pages()' with 'pin_user_pages()'.
 - Loopback transport support.

Arseniy Krasnov (15):
  vsock/virtio: prepare for non-linear skb support
  vhost/vsock: non-linear skb handling support
  vsock/virtio: non-linear skb handling support
  vsock/virtio: non-linear skb handling for tap
  vsock/virtio: MSG_ZEROCOPY flag support
  vsock: check error queue to set EPOLLERR
  vsock: read from socket's error queue
  vsock: check for MSG_ZEROCOPY support
  vhost/vsock: support MSG_ZEROCOPY for transport
  vsock/virtio: support MSG_ZEROCOPY for transport
  vsock/loopback: support MSG_ZEROCOPY for transport
  net/sock: enable setting SO_ZEROCOPY for PF_VSOCK
  test/vsock: MSG_ZEROCOPY flag tests
  test/vsock: MSG_ZEROCOPY support for vsock_perf
  docs: net: description of MSG_ZEROCOPY for AF_VSOCK

 Documentation/networking/msg_zerocopy.rst |  12 +-
 drivers/vhost/vsock.c                     |  29 +-
 include/linux/socket.h                    |   1 +
 include/linux/virtio_vsock.h              |   7 +
 include/net/af_vsock.h                    |   3 +
 net/core/sock.c                           |   4 +-
 net/vmw_vsock/af_vsock.c                  |  16 +-
 net/vmw_vsock/virtio_transport.c          |  39 +-
 net/vmw_vsock/virtio_transport_common.c   | 497 ++++++++++++++++++---
 net/vmw_vsock/vsock_loopback.c            |   8 +
 tools/testing/vsock/Makefile              |   2 +-
 tools/testing/vsock/util.h                |   1 +
 tools/testing/vsock/vsock_perf.c          | 139 +++++-
 tools/testing/vsock/vsock_test.c          |  11 +
 tools/testing/vsock/vsock_test_zerocopy.c | 501 ++++++++++++++++++++++
 tools/testing/vsock/vsock_test_zerocopy.h |  12 +
 16 files changed, 1194 insertions(+), 88 deletions(-)
 create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
 create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h

-- 
2.25.1


             reply	other threads:[~2023-04-23 19:31 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-23 19:26 Arseniy Krasnov [this message]
2023-04-23 19:26 ` [RFC PATCH v2 01/15] vsock/virtio: prepare for non-linear skb support Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 02/15] vhost/vsock: non-linear skb handling support Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 03/15] vsock/virtio: " Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 04/15] vsock/virtio: non-linear skb handling for tap Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 05/15] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 06/15] vsock: check error queue to set EPOLLERR Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 07/15] vsock: read from socket's error queue Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 08/15] vsock: check for MSG_ZEROCOPY support Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 09/15] vhost/vsock: support MSG_ZEROCOPY for transport Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 10/15] vsock/virtio: " Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 11/15] vsock/loopback: " Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 12/15] net/sock: enable setting SO_ZEROCOPY for PF_VSOCK Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 13/15] test/vsock: MSG_ZEROCOPY flag tests Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 14/15] test/vsock: MSG_ZEROCOPY support for vsock_perf Arseniy Krasnov
2023-04-23 19:26 ` [RFC PATCH v2 15/15] docs: net: description of MSG_ZEROCOPY for AF_VSOCK Arseniy Krasnov
2023-05-03 12:52 ` [RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support Stefano Garzarella
2023-05-03 13:11   ` Arseniy Krasnov
2023-05-03 13:47     ` Stefano Garzarella
2023-05-03 13:46       ` Arseniy Krasnov
2023-05-03 13:54         ` Stefano Garzarella

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230423192643.1537470-1-AVKrasnov@sberdevices.ru \
    --to=avkrasnov@sberdevices.ru \
    --cc=bobby.eshleman@bytedance.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jasowang@redhat.com \
    --cc=kernel@sberdevices.ru \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=oxffffaa@gmail.com \
    --cc=pabeni@redhat.com \
    --cc=sgarzare@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).